Arxiv今日论文 | 2024-10-18

本篇博文主要展示 2024-10-18 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决Transformer-based大型语言模型（LLMs）在数学能力方面的局限性问题，特别是其在算术任务中的表现。解决方案的关键在于识别并强调数值精度对LLMs在数学任务中有效性的重要影响。论文通过理论分析和实验验证表明，低数值精度的Transformer模型在处理迭代加法和整数乘法等算术任务时，需要模型规模随输入长度呈超多项式增长，而标准数值精度的Transformer则能在较小模型规模下高效完成这些任务。这一发现为提升LLMs的数学推理能力提供了重要见解。

链接: https://arxiv.org/abs/2410.13857
作者: Guhao Feng,Kai Yang,Yuntian Gu,Xinyue Ai,Shengjie Luo,Jiacheng Sun,Di He,Zhenguo Li,Liwei Wang
关键词-EN: Transformer-based Large Language, Large Language Models, Transformer-based Large, Large Language, success of Transformer-based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Despite the remarkable success of Transformer-based Large Language Models (LLMs) across various domains, understanding and enhancing their mathematical capabilities remains a significant challenge. In this paper, we conduct a rigorous theoretical analysis of LLMs’ mathematical abilities, with a specific focus on their arithmetic performances. We identify numerical precision as a key factor that influences their effectiveness in mathematical tasks. Our results show that Transformers operating with low numerical precision fail to address arithmetic tasks, such as iterated addition and integer multiplication, unless the model size grows super-polynomially with respect to the input length. In contrast, Transformers with standard numerical precision can efficiently handle these tasks with significantly smaller model sizes. We further support our theoretical findings through empirical experiments that explore the impact of varying numerical precision on arithmetic tasks, providing valuable insights for improving the mathematical reasoning capabilities of LLMs.
摘要：尽管基于 Transformer 的大语言模型 (Large Language Models, LLMs) 在各个领域取得了显著的成功，但理解和提升其数学能力仍然是一个重大挑战。本文对 LLMs 的数学能力进行了严格的理论分析，特别关注其在算术表现上的能力。我们发现数值精度是影响其在数学任务中有效性的关键因素。研究结果表明，使用低数值精度的 Transformer 无法解决诸如迭代加法和整数乘法等算术任务，除非模型大小相对于输入长度呈超多项式增长。相比之下，使用标准数值精度的 Transformer 能够以显著较小的模型尺寸高效处理这些任务。我们通过实证实验进一步支持了我们的理论发现，这些实验探讨了不同数值精度对算术任务的影响，为提升 LLMs 的数学推理能力提供了宝贵的见解。

[NLP-1] Can MLLMs Understand the Deep Implication Behind Chinese Images?

【速读】：该论文试图解决多模态大语言模型（MLLMs）在理解和感知中文视觉内容方面的高阶能力评估问题。解决方案的关键在于引入了一个名为CII-Bench的基准测试，该基准通过使用来自中国互联网的真实图像和手工制作的答案，确保了中文语境的真实性，并特别包含了反映中国传统文化的图像，以深入评估模型对中国传统文化的理解能力。通过在多个MLLMs上的广泛实验，CII-Bench揭示了模型在处理高层次语义和中国传统文化图像时的局限性，并提出了通过引入图像情感提示来提高模型准确性的方法。

链接: https://arxiv.org/abs/2410.13854
作者: Chenhao Zhang,Xi Feng,Yuelin Bai,Xinrun Du,Jinchang Hou,Kaixin Deng,Guangzeng Han,Qinrui Li,Bingli Wang,Jiaheng Liu,Xingwei Qu,Yifei Zhang,Qixuan Zhao,Yiming Liang,Ziqiang Liu,Feiteng Fang,Min Yang,Wenhao Huang,Chenghua Lin,Ge Zhang,Shiwen Ni
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Chinese traditional culture
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 32 pages,18 figures. Project Page: this https URL Code: this https URL Dataset: this https URL

点击查看摘要

Abstract:As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the Chinese Image Implication understanding Benchmark, CII-Bench, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model’s understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at this https URL.
摘要：随着多模态大语言模型 (MLLMs) 能力的不断提升，对 MLLMs 高阶能力评估的需求也在增加。然而，目前缺乏针对 MLLMs 对中文视觉内容的高阶感知和理解能力的评估工作。为了填补这一空白，我们引入了 Chinese Image Implication 理解 Benchmark，即 CII-Bench，旨在评估 MLLMs 对中文图像的高阶感知和理解能力。与现有基准相比，CII-Bench 在多个方面具有独特性。首先，为了确保中文语境的真实性，CII-Bench 中的图像来源于中国互联网并经过人工审查，相应的答案也是手工制作的。此外，CII-Bench 还包含了代表中国传统文化的图像，如著名的中国传统绘画，这些图像能够深刻反映模型对中国传统文化的理解。通过在多个 MLLMs 上进行的广泛实验，我们获得了重要发现。最初，MLLMs 在 CII-Bench 上的表现与人类之间存在显著差距。MLLMs 的最高准确率为 64.4%，而人类平均准确率为 78.2%，最高达到 81.0%。随后，MLLMs 在中国传统文化图像上的表现较差，表明其在理解高级语义和缺乏对中国传统文化的深入知识库方面存在局限性。最后，观察到大多数模型在提示中加入图像情感提示后，准确率有所提高。我们相信，CII-Bench 将帮助 MLLMs 更好地理解中文语义和中国特有的图像，推动向专家级通用人工智能 (AGI) 的迈进。我们的项目公开可用，链接为 https URL。

[NLP-2] Retrospective Learning from Interactions

【速读】：该论文试图解决大语言模型（LLMs）在与用户的多轮交互中如何利用隐式反馈信号进行持续学习的问题。解决方案的关键在于引入了一种名为ReSpect的方法，通过回顾过去的交互来学习这些隐式反馈信号。ReSpect方法的核心在于识别用户在模型响应不符合预期时发出的重述请求、表达不满或转向其他任务的信号，这些信号具有任务无关性和语言空间的相对约束性，使得模型能够在无需额外标注的情况下从这些信号中学习，从而逐步提高任务完成率。

链接: https://arxiv.org/abs/2410.13852
作者: Zizhao Chen,Mustafa Omer Gul,Yiwei Chen,Gloria Geng,Anne Wu,Yoav Artzi
关键词-EN: naturally include implicit, include implicit feedback, users naturally include, large language models, implicit feedback signals
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-turn interactions between large language models (LLMs) and users naturally include implicit feedback signals. If an LLM responds in an unexpected way to an instruction, the user is likely to signal it by rephrasing the request, expressing frustration, or pivoting to an alternative task. Such signals are task-independent and occupy a relatively constrained subspace of language, allowing the LLM to identify them even if it fails on the actual task. This creates an avenue for continually learning from interactions without additional annotations. We introduce ReSpect, a method to learn from such signals in past interactions via retrospection. We deploy ReSpect in a new multimodal interaction scenario, where humans instruct an LLM to solve an abstract reasoning task with a combinatorial solution space. Through thousands of interactions with humans, we show how ReSpect gradually improves task completion rate from 31% to 82%, all without any external annotation.
摘要：大语言模型 (LLM) 与用户之间的多轮交互自然包含隐式反馈信号。如果 LLM 对指令的响应不符合预期，用户可能会通过重新表述请求、表达沮丧或转向其他任务来发出信号。这些信号与任务无关，且占据语言的相对受限子空间，使得 LLM 即使在实际任务中失败时也能识别它们。这为持续从交互中学习提供了途径，而无需额外的标注。我们引入了 ReSpect，一种通过回顾过去交互中的此类信号进行学习的方法。我们在一个新的多模态交互场景中部署了 ReSpect，其中人类指导 LLM 解决具有组合解空间的抽象推理任务。通过与人类进行数千次交互，我们展示了 ReSpect 如何逐步将任务完成率从 31% 提高到 82%，且无需任何外部标注。

[NLP-3] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

【速读】：该论文试图解决多模态理解和生成任务中，单一视觉编码器在不同信息粒度需求下表现不佳的问题。解决方案的关键在于将视觉编码解耦为独立的通道，同时利用统一的Transformer架构进行处理。这种解耦不仅缓解了视觉编码器在理解和生成任务中的角色冲突，还增强了框架的灵活性，使得理解和生成组件可以独立选择最适合的编码方法，从而提升整体性能。

链接: https://arxiv.org/abs/2410.13848
作者: Chengyue Wu,Xiaokang Chen,Zhiyu Wu,Yiyang Ma,Xingchao Liu,Zizheng Pan,Wen Liu,Zhenda Xie,Xingkai Yu,Chong Ruan,Ping Luo
关键词-EN: unifies multimodal understanding, multimodal understanding, understanding and generation, understanding, introduce Janus
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
摘要：本文介绍了一种名为 Janus 的自回归框架，该框架统一了多模态理解和生成。先前的研究通常依赖于单一的视觉编码器来处理这两项任务，例如 Chameleon。然而，由于多模态理解和生成所需的信息粒度水平不同，这种方法可能导致性能不佳，尤其是在多模态理解方面。为了解决这一问题，我们将视觉编码分离为独立的通道，同时仍然利用单一的、统一的 Transformer 架构进行处理。这种解耦不仅缓解了视觉编码器在理解和生成角色之间的冲突，还增强了框架的灵活性。例如，多模态理解和生成组件可以独立选择最适合的编码方法。实验表明，Janus 超越了先前的统一模型，并且在性能上匹配或超过了特定任务模型的表现。Janus 的简洁性、高灵活性和有效性使其成为下一代统一多模态模型的有力候选。

[NLP-4] SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction

【速读】：该论文试图解决大型语言模型（LLMs）在处理长上下文时，由于模型层数和输入序列长度的增加导致的关键-值（KV）缓存存储需求急剧上升的问题。解决方案的关键在于提出了SimLayerKV方法，通过识别并减少在长上下文建模中贡献较少的“懒惰层”的KV缓存，从而有效降低内存需求。该方法基于观察到某些模型层在处理长距离依赖时表现出“懒惰”行为，且这种行为在生成过程中对给定输入的每个token是一致的。SimLayerKV通过分析注意力权重模式来识别这些懒惰层，并相应地减少其KV缓存，从而在不显著影响模型性能的情况下实现高达5倍的KV缓存压缩比。

链接: https://arxiv.org/abs/2410.13846
作者: Xuan Zhang,Cunxiao Du,Chao Du,Tianyu Pang,Wei Gao,Min Lin
关键词-EN: handle long contexts, Recent advancements, large language models, long contexts, advancements in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have extended their capabilities to handle long contexts. However, increasing the number of model layers and the length of input sequences significantly escalates the memory required to store key-value (KV) cache, posing challenges for efficient inference. To mitigate this issue, we present SimLayerKV, a simple yet effective method that reduces inter-layer KV cache redundancies by selectively dropping cache in identified lazy layers. Our approach is based on the observation that certain layers in long-context LLMs exhibit “lazy” behavior, contributing less to modeling long-range dependencies compared to non-lazy layers. By analyzing attention weight patterns, we find that the behavior of these lazy layers is consistent across tokens during generation for a given input. This insight motivates our SimLayerKV, which identifies lazy layers and reduces their KV cache accordingly. SimLayerKV is training-free, generalizable, and can be implemented with only seven lines of code. We conduct extensive experiments on three representative LLMs, e.g., LLaMA2-7B, LLaMA3-8B, and Mistral-7B across 16 tasks from the LongBench benchmark. The results demonstrate that SimLayerKV achieves a KV cache compression ratio of 5 \times with only a 1.2% performance drop when combined with 4-bit quantization. Our code is available at this https URL.
摘要：近年来，大语言模型 (LLMs) 在处理长上下文方面的能力得到了显著提升。然而，增加模型层数和输入序列长度会显著增加存储键值 (KV) 缓存所需的内存，这对高效推理构成了挑战。为解决这一问题，我们提出了 SimLayerKV，这是一种简单而有效的方法，通过在识别出的惰性层中选择性地丢弃缓存来减少层间 KV 缓存冗余。我们的方法基于以下观察：在长上下文 LLMs 中，某些层表现出“惰性”行为，对建模长程依赖的贡献较少，相比之下非惰性层更为活跃。通过分析注意力权重模式，我们发现这些惰性层在生成过程中对给定输入的 Token 行为是一致的。这一洞察促使我们开发了 SimLayerKV，该方法能够识别惰性层并相应地减少其 KV 缓存。SimLayerKV 无需训练，具有通用性，并且仅需七行代码即可实现。我们在三个代表性 LLMs（例如 LLaMA2-7B、LLaMA3-8B 和 Mistral-7B）上进行了广泛实验，涵盖了 LongBench 基准测试中的 16 项任务。结果表明，结合 4 位量化，SimLayerKV 实现了 5 倍的 KV 缓存压缩比，性能仅下降 1.2%。我们的代码可在以下链接获取：https URL。

[NLP-5] A Unified View of Delta Parameter Editing in Post-Trained Large-Scale Models

【速读】：该论文试图解决的问题是如何系统地理解和优化大规模预训练模型在微调后的参数差异（即delta参数），特别是在模型编辑操作中的表现。解决方案的关键在于提出了基于Riemann求和近似的损失函数分析框架，将现有的delta参数编辑方法分类为竞争性、减少性和改进性三类，并通过理论分析和实验验证了这些方法如何影响模型性能。此外，论文还扩展了现有技术如DARE和BitDelta，重新组织其表达以更好地利用delta参数的特性，从而提高微调后模型中delta参数编辑的适用性和效果。

链接: https://arxiv.org/abs/2410.13841
作者: Qiaoyu Tang,Le Yu,Bowen Yu,Hongyu Lin,Keming Lu,Yaojie Lu,Xianpei Han,Le Sun
关键词-EN: adapting large-scale pre-trained, Post-training has emerged, large-scale pre-trained models, delta parameter editing, delta parameter
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training has emerged as a crucial paradigm for adapting large-scale pre-trained models to various tasks, whose effects are fully reflected by delta parameters (i.e., the disparity between post-trained and pre-trained parameters). While numerous studies have explored delta parameter properties via operations like pruning, quantization, low-rank approximation, and extrapolation, a unified framework for systematically examining these characteristics has been lacking. In this paper, we propose a novel perspective based on Riemann sum approximation of the loss function to elucidate delta parameter editing operations. Our analysis categorizes existing methods into three classes based on their post-editing performance: competitive, decreased, and improved, explaining how they are expressed by the Riemann sum approximation term and how they alter the model performance. Extensive experiments on both visual and language models, including ViT, LLaMA 3, Qwen 2, and Mistral, corroborate our theoretical findings. Furthermore, we introduce extensions to existing techniques like DARE and BitDelta, highlighting their limitations in leveraging the properties of delta parameters and reorganizing them into general expressions to enhance the applicability and effectiveness of delta parameter editing in post-trained models.
摘要：训练后调优已成为将大规模预训练模型适应于各种任务的关键范式，其效果完全通过增量参数（即训练后参数与预训练参数之间的差异）体现。尽管已有众多研究通过剪枝、量化、低秩近似和外推等操作探索了增量参数的特性，但系统性地考察这些特性的统一框架仍然缺失。本文提出了一种基于损失函数的黎曼和近似的新视角，以阐明增量参数编辑操作。我们的分析将现有方法根据其编辑后的表现分为三类：竞争型、下降型和改进型，解释了它们如何通过黎曼和近似项表达，以及它们如何改变模型性能。在视觉和语言模型（包括 ViT、LLaMA 3、Qwen 2 和 Mistral）上的广泛实验验证了我们的理论发现。此外，我们引入了对现有技术（如 DARE 和 BitDelta）的扩展，强调了它们在利用增量参数特性方面的局限性，并将它们重新组织为通用表达式，以增强训练后模型中增量参数编辑的适用性和有效性。

[NLP-6] A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

【速读】：该论文旨在解决基于边际损失的强化学习从人类反馈（RLHF）方法在语言模型对齐过程中存在的潜在问题，即边际损失方法可能导致理想行为的不充分指定，从而引发安全性和性能上的风险。论文指出，随着边际值的增加，不理想的（如不安全的）响应概率可能上升，而理想的响应概率可能下降。解决方案的关键在于识别并解释了这一问题的根本原因——梯度纠缠（gradient entanglement），即边际损失将理想和不理想响应的概率变化耦合在一起，导致两者概率同步变化。论文通过理论分析和实验验证，提出了在边际损失对齐目标中，当理想和不理想响应的对数概率梯度的内积相对于各自梯度范数较大时，梯度纠缠问题尤为显著。基于此，论文建议通过改进算法设计来缓解边际损失方法的不充分指定问题，从而提升语言模型的对齐效果。

链接: https://arxiv.org/abs/2410.13828
作者: Hui Yuan,Yifan Zeng,Yue Wu,Huazheng Wang,Mengdi Wang,Liu Leqi
关键词-EN: Reinforcement Learning, Human Feedback, Learning from Human, predominant approach, preferred
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model (LM) alignment. At its core, RLHF uses a margin-based loss for preference optimization, specifying ideal LM behavior only by the difference between preferred and dispreferred responses. In this paper, we identify a common pitfall of margin-based methods – the under-specification of ideal LM behavior on preferred and dispreferred responses individually, which leads to two unintended consequences as the margin increases: (1) The probability of dispreferred (e.g., unsafe) responses may increase, resulting in potential safety alignment failures. (2) The probability of preferred responses may decrease, even when those responses are ideal. We demystify the reasons behind these problematic behaviors: margin-based losses couple the change in the preferred probability to the gradient of the dispreferred one, and vice versa, often preventing the preferred probability from increasing while the dispreferred one decreases, and thus causing a synchronized increase or decrease in both probabilities. We term this effect, inherent in margin-based objectives, gradient entanglement. Formally, we derive conditions for general margin-based alignment objectives under which gradient entanglement becomes concerning: the inner product of the gradients of preferred and dispreferred log-probabilities is large relative to the individual gradient norms. We theoretically investigate why such inner products can be large when aligning language models and empirically validate our findings. Empirical implications of our framework extend to explaining important differences in the training dynamics of various preference optimization algorithms, and suggesting potential algorithm designs to mitigate the under-specification issue of margin-based methods and thereby improving language model alignment.
摘要：基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 已成为语言模型 (Language Model, LM) 对齐的主要方法。其核心在于，RLHF 使用基于边际的损失函数进行偏好优化，仅通过偏好响应与非偏好响应之间的差异来指定理想的 LM 行为。本文中，我们识别了基于边际方法的一个常见陷阱——在偏好和非偏好响应上对理想 LM 行为的欠指定，这导致随着边际的增加，出现两个意外后果：(1) 非偏好（例如，不安全）响应的概率可能增加，从而导致潜在的安全对齐失败。(2) 即使这些响应是理想的，偏好响应的概率也可能降低。我们揭示了这些问题的背后原因：基于边际的损失将偏好概率的变化与非偏好概率的梯度耦合在一起，反之亦然，这通常会阻止偏好概率在非偏好概率下降时增加，从而导致两者概率的同步增加或减少。我们将这种效应，即基于边际目标固有的梯度纠缠 (gradient entanglement)，正式化。我们推导了在一般情况下，当梯度纠缠成为问题时，基于边际的对齐目标的条件：偏好和非偏好对数概率梯度的内积相对于各自梯度范数较大。我们理论性地探讨了在语言模型对齐时为何这种内积可能较大，并通过实证验证了我们的发现。我们框架的实证意义扩展到解释各种偏好优化算法训练动态的重要差异，并提出潜在的算法设计以缓解基于边际方法的欠指定问题，从而改进语言模型的对齐。

[NLP-7] Agent Occam: A Simple Yet Strong Baseline for LLM-Based Web Agents

【速读】：该论文试图解决现有基于大型语言模型（LLM）的网页代理在执行任务时，其观察和动作表示与LLM预训练数据之间的不一致问题。解决方案的关键在于通过精细调整代理的观察和动作空间，使其更好地与LLM的能力对齐。具体来说，论文提出的AgentOccam通过这种调整，显著提升了在WebArena基准测试中的表现，超越了以往的最先进方法和同期工作，展示了LLM在网页任务中的零样本学习能力，并强调了观察和动作空间精细调优对LLM基代理的重要性。

链接: https://arxiv.org/abs/2410.13825
作者: Ke Yang,Yao Liu,Sapana Chaudhary,Rasool Fakoor,Pratik Chaudhari,George Karypis,Huzefa Rangwala
关键词-EN: large language models, boosts human efficiency, web, human efficiency, agent
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autonomy via agents using large language models (LLMs) for personalized, standardized tasks boosts human efficiency. Automating web tasks (like booking hotels within a budget) is increasingly sought after. Fulfilling practical needs, the web agent also serves as an important proof-of-concept example for various agent grounding scenarios, with its success promising advancements in many future applications. Prior research often handcrafts web agent strategies (e.g., prompting templates, multi-agent systems, search methods, etc.) and the corresponding in-context examples, which may not generalize well across all real-world scenarios. On the other hand, there has been limited study on the misalignment between a web agent’s observation/action representation and the pre-training data of the LLM it’s based on. This discrepancy is especially notable when LLMs are primarily trained for language completion rather than tasks involving embodied navigation actions and symbolic web elements. Our study enhances an LLM-based web agent by simply refining its observation and action space to better align with the LLM’s capabilities. This approach enables our base agent to significantly outperform previous methods on a wide variety of web tasks. Specifically, on WebArena, a benchmark featuring general-purpose web interaction tasks, our agent AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively, and boosts the success rate by 26.6 points (+161%) over similar plain web agents with its observation and action space alignment. We achieve this without using in-context examples, new agent roles, online feedback or search strategies. AgentOccam’s simple design highlights LLMs’ impressive zero-shot performance on web tasks, and underlines the critical role of carefully tuning observation and action spaces for LLM-based agents.
摘要：通过使用大语言模型 (LLM) 的智能体实现个性化、标准化的任务自主化，极大地提升了人类效率。自动化网络任务（如在预算内预订酒店）的需求日益增长。满足实际需求的网络智能体，同时也是各种智能体基础场景的重要概念验证示例，其成功预示着未来众多应用的进步。先前的研究往往手工设计网络智能体的策略（例如，提示模板、多智能体系统、搜索方法等）及其对应的上下文示例，这些策略在所有现实场景中的泛化能力可能不佳。另一方面，关于网络智能体的观察/动作表示与其所基于的 LLM 的预训练数据之间的错配研究较少。这种差异在 LLM 主要用于语言补全而非涉及实体导航动作和符号网络元素的任务时尤为显著。我们的研究通过简单地优化基于 LLM 的网络智能体的观察和动作空间，以更好地匹配 LLM 的能力，从而增强了该智能体。这种方法使我们的基础智能体在广泛的网络任务中显著优于以往的方法。具体而言，在 WebArena 这一通用网络交互任务的基准测试中，我们的智能体 AgentOccam 分别以 9.8 (+29.4%) 和 5.9 (+15.8%) 的绝对分数超越了之前的最佳水平和同期工作，并通过观察和动作空间的调整，将类似普通网络智能体的成功率提高了 26.6 点 (+161%)。我们实现这一目标时未使用上下文示例、新的智能体角色、在线反馈或搜索策略。AgentOccam 的简单设计突显了 LLM 在网络任务中的惊人零样本表现，并强调了精心调整观察和动作空间对于基于 LLM 的智能体的关键作用。

[NLP-8] Harnessing Webpage UIs for Text-Rich Visual Understanding

【速读】：该论文试图解决多模态大语言模型（MLLMs）在处理包含密集文本内容的视觉环境中的理解能力问题。解决方案的关键在于利用基于文本的大语言模型（LLMs）从网页用户界面（UI）中合成多模态指令，并通过网页的可访问性树结构处理结构化文本表示。这些指令与UI截图配对，用于训练多模态模型。论文提出了MultiUI数据集，包含730万样本，覆盖多种多模态任务和UI布局，显著提升了模型在网页UI任务中的表现，并展示了其在非网页UI任务和非UI领域中的广泛适用性。

链接: https://arxiv.org/abs/2410.13824
作者: Junpeng Liu,Tianyue Ou,Yifan Song,Yuxiao Qu,Wai Lam,Chenyan Xiong,Wenhu Chen,Graham Neubig,Xiang Yue
关键词-EN: dense textual content, large language models, visual understanding-the ability, text-based large language, multimodal large language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in action accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.
摘要：文本丰富的视觉理解——即处理环境中密集文本内容与视觉内容相结合的能力——对于多模态大语言模型 (MLLMs) 在结构化环境中有效交互至关重要。为了增强这一能力，我们提出利用基于文本的大语言模型 (LLMs) 从网页用户界面 (UI) 合成通用多模态指令。尽管缺乏直接的视觉输入，基于文本的 LLMs 能够处理从网页可访问性树中提取的结构化文本表示。这些指令随后与 UI 截图配对，用于训练多模态模型。我们引入了 MultiUI，这是一个包含 730 万个样本的数据集，来自 100 万个网站，涵盖了多样化的多模态任务和 UI 布局。在 MultiUI 上训练的模型不仅在网页 UI 任务中表现出色——在 VisualWebBench 上实现了高达 48% 的改进，在 Mind2Web 的网页智能体数据集上动作准确率提升了 19.1%——而且在非网页 UI 任务和非 UI 领域（如文档理解、OCR 和图表解释）中也表现出惊人的泛化能力。这些结果突显了网页 UI 数据在推动各种场景中文本丰富的视觉理解方面的广泛适用性。

[NLP-9] De-mark: Watermark Removal in Large Language Models

【速读】：该论文试图解决机器生成内容中n-gram水印的鲁棒性问题，提出了一种名为De-mark的高级框架，通过创新的随机选择探测策略来有效移除n-gram水印。解决方案的关键在于利用这种探测策略评估水印强度并识别n-gram水印中的红绿列表，从而在实验中展示了其在Llama3和ChatGPT等流行语言模型上的高效性和有效性。

链接: https://arxiv.org/abs/2410.13808
作者: Ruibo Chen,Yihan Wu,Junfeng Guo,Heng Huang
关键词-EN: identify machine-generated content, embedding covert information, Watermarking techniques offer, language models, machine-generated content
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark and identifying the red-green list within the n-gram watermark. Experiments on popular LMs, such as Llama3 and ChatGPT, demonstrate the efficiency and effectiveness of De-mark in watermark removal and exploitation tasks.
摘要：水印技术提供了一种有前景的方法，通过将隐秘信息嵌入到语言模型 (LMs) 生成的内容中，来识别机器生成的内容。然而，水印方案的鲁棒性尚未得到充分探讨。本文提出了 De-mark，一个先进的框架，旨在有效去除基于 n-gram 的水印。我们的方法采用了一种新颖的查询策略，称为随机选择探测 (random selection probing)，有助于评估水印的强度并识别 n-gram 水印中的红绿列表。在 Llama3 和 ChatGPT 等流行的大语言模型上的实验表明，De-mark 在去除和利用水印任务中表现出了高效性和有效性。

[NLP-10] A Watermark for Order-Agnostic Language Models

【速读】：该论文试图解决顺序无关语言模型（order-agnostic LMs）中的水印技术问题，因为传统的统计水印技术无法直接应用于这些模型，因为它们的令牌不是按顺序生成的。解决方案的关键在于引入了一种基于模式的Pattern-mark水印框架，该框架专门设计用于顺序无关的LMs。核心创新包括开发了一个基于马尔可夫链的水印生成器，用于生成高频关键模式的水印密钥序列，并提出了一种基于统计模式的检测算法，通过在检测过程中恢复密钥序列并进行基于高频模式计数的统计测试，从而提高了检测效率、生成质量和鲁棒性。

链接: https://arxiv.org/abs/2410.13805
作者: Ruibo Chen,Yihan Wu,Yanshuo Chen,Chenxi Liu,Junfeng Guo,Heng Huang
关键词-EN: decoded language models, sequentially decoded language, order-agnostic LMs, language models, decoded language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Statistical watermarking techniques are well-established for sequentially decoded language models (LMs). However, these techniques cannot be directly applied to order-agnostic LMs, as the tokens in order-agnostic LMs are not generated sequentially. In this work, we introduce Pattern-mark, a pattern-based watermarking framework specifically designed for order-agnostic LMs. We develop a Markov-chain-based watermark generator that produces watermark key sequences with high-frequency key patterns. Correspondingly, we propose a statistical pattern-based detection algorithm that recovers the key sequence during detection and conducts statistical tests based on the count of high-frequency patterns. Our extensive evaluations on order-agnostic LMs, such as ProteinMPNN and CMLM, demonstrate Pattern-mark’s enhanced detection efficiency, generation quality, and robustness, positioning it as a superior watermarking technique for order-agnostic LMs.
摘要：统计水印技术在顺序解码的语言模型 (LMs) 中已经得到了广泛应用。然而，这些技术无法直接应用于无序语言模型，因为无序语言模型中的 Token 不是按顺序生成的。在本研究中，我们引入了 Pattern-mark，这是一种专门为无序语言模型设计的基于模式的嵌入式水印框架。我们开发了一种基于马尔可夫链的水印生成器，该生成器能够生成具有高频密钥模式的水印密钥序列。相应地，我们提出了一种基于统计模式的检测算法，该算法在检测过程中恢复密钥序列，并基于高频模式的计数进行统计测试。我们在无序语言模型（如 ProteinMPNN 和 CMLM）上的广泛评估表明，Pattern-mark 在检测效率、生成质量和鲁棒性方面均有显著提升，使其成为无序语言模型的优越水印技术。

[NLP-11] BenTo: Benchmark Task Reduction with In-Context Transferability

【速读】：该论文试图解决大规模语言模型（LLMs）评估成本高的问题，即如何在不影响评估质量的前提下，高效地减少用于基准测试的任务数量。解决方案的关键在于利用任务的可转移性和相关性，通过优化设施位置函数来识别最具代表性的任务子集。论文提出了一种基于上下文学习（ICL）的实用高效指标，用于估计任务间的可转移性，从而在保持评估质量的同时，将任务数量减少至原基准的5%，且仅引入4%的评估差异。相比以往方法，该方法无需训练、无需梯度计算，仅依赖ICL，具有高效率和实用性。

链接: https://arxiv.org/abs/2410.13804
作者: Hongyu Zhao,Ming Li,Lichao Sun,Tianyi Zhou
关键词-EN: Evaluating large language, large language models, Evaluating large, language models, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a 4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.
摘要：评估大语言模型 (LLMs) 的成本高昂：它需要在大规模、多任务的基准上生成和检查 LLM 的输出。本文研究了如何在不影响评估质量的前提下，高效地减少用于基准测试的任务数量。我们的研究表明，任务的可转移性和相关性提供了关键信息，通过优化设施位置函数可以识别出最具代表性的任务子集。我们提出了一种实际高效的指标，用于通过上下文学习 (ICL) 估计两个任务之间的可转移性。通过分析成对任务的可转移性，我们可以将现代 LLM 基准（例如 MMLU 或 FLAN）中的任务数量减少到 5%，同时仅导致原始基准评估结果产生 4% 的差异。与先前的工作相比，我们的方法无需训练，无需梯度计算，且高度高效，仅需 ICL 即可实现。

[NLP-12] Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions

【速读】：该论文试图解决大型语言模型（LLMs）在面对高度模糊的用户请求时，常常假设单一解释而未能准确理解用户意图的问题。解决方案的关键在于改进偏好标签的分配方法，即通过模拟未来对话轮次中的预期结果来评估LLM的响应，从而使模型能够学习在必要时提出澄清性问题，以更好地适应不同用户的解释和预期答案。实验结果表明，采用这种新方法训练的LLMs在提出澄清性问题方面比传统方法提高了5%的F1分数。

链接: https://arxiv.org/abs/2410.13788
作者: Michael J.Q. Zhang,W. Bradley Knox,Eunsol Choi
关键词-EN: Large language models, Large language, language models, highly ambiguous user, Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) must often respond to highly ambiguous user requests. In such cases, the LLM’s best response may be to ask a clarifying question to elicit more information. We observe existing LLMs often respond by presupposing a single interpretation of such ambiguous requests, frustrating users who intended a different interpretation. We speculate this is caused by current preference data labeling practice, where LLM responses are evaluated only on their prior contexts. To address this, we propose to assign preference labels by simulating their expected outcomes in the future turns. This allows LLMs to learn to ask clarifying questions when it can generate responses that are tailored to each user interpretation in future turns. In experiments on open-domain QA, we compare systems that trained using our proposed preference labeling methods against standard methods, which assign preferences based on only prior context. We evaluate systems based on their ability to ask clarifying questions that can recover each user’s interpretation and expected answer, and find that our training with our proposed method trains LLMs to ask clarifying questions with a 5% improvement in F1 measured against the answer set from different interpretations of each query
摘要：大语言模型 (LLMs) 在处理高度模糊的用户请求时，往往需要做出回应。在这种情况下，LLM 的最佳回应可能是提出一个澄清性问题以获取更多信息。我们观察到现有的 LLM 通常会预设一种对模糊请求的单一解释，这使得那些意图不同解释的用户感到沮丧。我们推测，这可能是由于当前偏好数据标注实践所导致的，即 LLM 的回应仅基于其先前的上下文进行评估。为了解决这一问题，我们提出通过模拟未来轮次中的预期结果来分配偏好标签。这使得 LLM 能够在未来轮次中生成针对每个用户解释定制的回应时，学会提出澄清性问题。在开放领域问答的实验中，我们将使用我们提出的偏好标注方法训练的系统与仅基于先前上下文分配偏好的标准方法进行了比较。我们评估了系统在提出澄清性问题以恢复每个用户的解释和预期答案方面的能力，并发现使用我们提出的方法训练的 LLM 在提出澄清性问题方面，相对于不同解释的每个查询的答案集，F1 值提高了 5%。

[NLP-13] Looking Inward: Language Models Can Learn About Themselves by Introspection

【速读】：该论文试图解决的问题是大型语言模型（LLMs）是否具备内省能力，即模型能否获取不包含在其训练数据中、而是源自内部状态的知识。解决方案的关键在于通过微调LLMs，使其能够预测自身在假设场景中的行为特性，如在给定输入下选择短期或长期选项。研究通过实验发现，经过微调的模型M1在预测自身行为时优于未微调的模型M2，即使M2基于M1的真实行为进行训练。这一结果表明M1具有对其自身行为倾向的特权访问能力，从而支持了内省能力的假设。然而，该方法在处理更复杂任务或需要分布外泛化时仍存在局限。

链接: https://arxiv.org/abs/2410.13787
作者: Felix J Binder,James Chua,Tomek Korbak,Henry Sleight,John Hughes,Robert Long,Ethan Perez,Miles Turpin,Owain Evans
关键词-EN: Humans acquire knowledge, Humans acquire, Humans, model, introspection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model’s internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model’s training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, “Given the input P, would your output favor the short- or long-term option?” If a model M1 can introspect, it should outperform a different model M2 in predicting M1’s behavior even if M2 is trained on M1’s ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization. Comments: 15 pages, 9 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.13787 [cs.CL] (or arXiv:2410.13787v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.13787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：人类通过观察外部世界获取知识，同时也通过内省获取知识。内省使一个人能够优先了解其当前的心理状态（例如，思想和感受），这是外部观察者无法获得的。大语言模型 (LLM) 能否进行内省？我们将内省定义为获取知识的过程，这些知识不包含在训练数据中，也不从训练数据中派生，而是源自内部状态。这种能力可以增强模型的可解释性。与其费力地分析模型的内部工作机制，我们不如直接询问模型关于其信念、世界模型和目标的信息。更具推测性的是，一个内省的模型可能会自我报告是否拥有某些内部状态，例如主观感受或欲望，这可以帮助我们了解这些状态的道德地位。这种自我报告不会完全由模型的训练数据决定。我们通过微调大语言模型来预测其在假设情景中的行为属性，以此研究内省。例如，“给定输入 P，您的输出会倾向于短期还是长期选项？”如果模型 M1 能够内省，它应该在预测 M1 的行为方面优于另一个模型 M2，即使 M2 是在 M1 的真实行为数据上训练的。这个想法是，M1 对其自身的行为倾向有优先访问权，这使得它能够比 M2 更好地预测自己（即使 M2 总体上更强）。在 GPT-4、GPT-4o 和 Llama-3 模型的实验中（每个模型都经过微调以预测自身行为），我们发现模型 M1 在预测自身行为方面优于 M2，这为内省提供了证据。值得注意的是，即使我们有意修改其真实行为，M1 仍能准确预测其行为。然而，尽管我们在简单任务上成功引发了内省，但在更复杂的任务或需要分布外泛化的任务上我们未能成功。

评论：15 页，9 图
主题：计算与语言 (cs.CL)；人工智能 (cs.AI)
引用为：arXiv:2410.13787 [cs.CL]
（或 arXiv:2410.13787v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.13787
了解更多信息
arXiv 发布的 DOI 通过 DataCite（待注册）

[NLP-14] PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

【速读】：该论文试图解决大语言模型（LLMs）在训练过程中对齐人类偏好时存在的两个问题：（1）对齐不全面；（2）模型易受越狱攻击。解决方案的关键在于提出PopAlign框架，该框架通过在提示、模型和流程层面上引入多样化的对比模式，提出了六种对比策略，无需额外的反馈标注过程，从而显著增强了偏好数据的全面性和多样性，进而提升了模型的对齐效果。

链接: https://arxiv.org/abs/2410.13785
作者: Zekun Moore Wang,Shawn Wang,Kang Zhu,Jiaheng Liu,Ke Xu,Jie Fu,Wangchunshu Zhou,Wenhao Huang
关键词-EN: preference-contrastive output pairs, involves training models, large language models, involves training, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.
摘要：大语言模型 (LLM) 的对齐涉及在偏好对比输出对上训练模型，以根据人类偏好调整其响应。为了获得此类对比对，传统方法如 RLHF 和 RLAIF 依赖于有限的对比模式，例如变化模型变体或解码温度。这种单一性导致两个问题：(1) 对齐不全面；因此 (2) 模型容易受到越狱攻击。为了解决这些问题，我们研究如何构建更全面和多样化的对比模式以增强偏好数据 (RQ1)，并验证对比模式多样性对模型对齐的影响 (RQ2)。对于 RQ1，我们提出了 PopAlign，这是一个在提示、模型和管道级别上集成多样化对比模式的框架，引入了六种不需要额外反馈标注程序的对比策略。关于 RQ2，我们进行了全面的实验，证明 PopAlign 显著优于现有方法，导致更全面的对齐。

[NLP-15] Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?

【速读】：该论文试图解决在低资源环境下，如何有效利用单语数据提升机器翻译模型性能的问题。解决方案的关键在于选择性地使用高质量或与测试数据领域接近的单语数据，而不是简单地利用所有可用的单语数据。研究表明，在英语到德语的低资源神经机器翻译任务中，选择性使用单语数据能够显著提升模型性能。

链接: https://arxiv.org/abs/2410.13783
作者: Idris Abdulmumin,Bashir Shehu Galadanci,Garba Aliyu,Shamsuddeen Hassan Muhammad
关键词-EN: large quantities, upscale the scarcely, data, Monolingual data, scarcely available parallel
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Monolingual data, being readily available in large quantities, has been used to upscale the scarcely available parallel data to train better models for automatic translation. Self-learning, where a model is made to learn from its output, is one approach to exploit such data. However, it has been shown that too much of this data can be detrimental to the performance of the model if the available parallel data is comparatively extremely low. In this study, we investigate whether the monolingual data can also be too little and if this reduction, based on quality, has any effect on the performance of the translation model. Experiments have shown that on English-German low-resource NMT, it is often better to select only the most useful additional data, based on quality or closeness to the domain of the test data, than utilizing all of the available data.
摘要：单语数据因其大量且易于获取的特点，常被用于扩充稀缺的平行数据，以训练更好的自动翻译模型。自学习方法，即模型从其输出中学习，是利用此类数据的一种途径。然而，研究表明，如果可用的平行数据相对极少，过多的单语数据可能会对模型的性能产生负面影响。在本研究中，我们探讨了单语数据是否也可能过少，以及基于质量的减少是否会对翻译模型的性能产生影响。实验表明，在英语-德语低资源神经机器翻译 (NMT) 任务中，基于质量或与测试数据领域接近度选择最有用的额外数据，往往比利用所有可用数据效果更佳。

[NLP-16] Optimal Quantization for Matrix Multiplication

【速读】：该论文试图解决在大规模矩阵乘法中，如何通过有损压缩（量化）来加速矩阵加载速度的问题。解决方案的关键在于提出了一种基于嵌套格子的通用量化器，该量化器能够在给定矩阵的Frobenius范数和矩阵乘积的Frobenius范数的情况下，对任意非随机矩阵对 ( A ) 和 ( B ) 进行有效的近似。具体来说，量化器通过独立地对矩阵 ( A ) 和 ( B ) 进行编码，生成每个条目使用 ( R ) 比特的描述，然后利用这些描述来估计矩阵乘积 ( A^\top B )。论文还证明了这种量化器在独立同分布高斯矩阵的情况下，达到了理论上的最优性能，并且在实际应用中，低复杂度的版本也能接近最优性能。此外，论文还推导了独立同分布高斯矩阵乘法的率失真函数。

链接: https://arxiv.org/abs/2410.13780
作者: Or Ordentlich,Yury Polyanskiy
关键词-EN: machine learning community, learning community proposed, community proposed multiple, proposed multiple methods, performing lossy compression
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work in machine learning community proposed multiple methods for performing lossy compression (quantization) of large matrices. This quantization is important for accelerating matrix multiplication (main component of large language models), which is often bottlenecked by the speed of loading these matrices from memory. Unlike classical vector quantization and rate-distortion theory, the goal of these new compression algorithms is to be able to approximate not the matrices themselves, but their matrix product. Specifically, given a pair of real matrices A,B an encoder (compressor) is applied to each of them independently producing descriptions with R bits per entry. These representations subsequently are used by the decoder to estimate matrix product A^\top B . In this work, we provide a non-asymptotic lower bound on the mean squared error of this approximation (as a function of rate R ) for the case of matrices A,B with iid Gaussian entries. Algorithmically, we construct a universal quantizer based on nested lattices with an explicit guarantee of approximation error for any (non-random) pair of matrices A , B in terms of only Frobenius norms |A|_F, |B|_F and |A^\top B|_F . For iid Gaussian matrices our quantizer achieves the lower bound and is, thus, asymptotically optimal. A practical low-complexity version of our quantizer achieves performance quite close to optimal. In information-theoretic terms we derive rate-distortion function for matrix multiplication of iid Gaussian matrices.
摘要：近期机器学习领域提出了多种方法，用于对大型矩阵进行有损压缩（量化）。这种量化对于加速矩阵乘法（大语言模型的主要组成部分）至关重要，因为矩阵乘法通常受限于从内存中加载这些矩阵的速度。与传统的向量量化和率失真理论不同，这些新型压缩算法的目标不是近似矩阵本身，而是近似其矩阵乘积。具体而言，给定一对实矩阵 A, B，编码器（压缩器）分别独立地对它们进行处理，生成每项 R 比特的描述。随后，这些表示被解码器用于估计矩阵乘积 A^\top B。在本研究中，我们为具有独立同分布高斯项的矩阵 A, B 的情况，提供了该近似均方误差的非渐近下界（作为速率 R 的函数）。算法上，我们基于嵌套格构造了一个通用量化器，并明确保证了对于任何（非随机）矩阵对 A, B，其近似误差仅取决于 Frobenius 范数 |A|_F, |B|_F 和 |A^\top B|_F。对于独立同分布高斯矩阵，我们的量化器达到了下界，因此是渐近最优的。我们量化器的低复杂度版本在性能上非常接近最优。从信息论的角度，我们推导了独立同分布高斯矩阵乘法的率失真函数。

[NLP-17] he Mystery of the Pathological Path-star Task for Language Models EMNLP2024

【速读】：该论文试图解决语言模型在处理路径星图任务时的表现不佳问题。解决方案的关键在于引入了一种正则化方法，通过使用具有不同目标节点的结构化样本进行训练，从而改善了模型在各种类型模型上的表现。此外，论文还证明了该任务在理论上是可解的，并发现了一些设置下，仅使用编码器模型也能一致地解决该任务。

链接: https://arxiv.org/abs/2410.13779
作者: Arvid Frydenlund
关键词-EN: Bachmann and Nagarajan, recently introduced path-star, minimal task designed, introduced path-star task, recently introduced
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2024 Main

点击查看摘要

Abstract:The recently introduced path-star task is a minimal task designed to exemplify limitations to the abilities of language models (Bachmann and Nagarajan, 2024). It involves a path-star graph where multiple arms radiate from a single starting node and each node is unique. Given the start node and a specified target node that ends an arm, the task is to generate the arm containing that target node. This is straightforward for a human but surprisingly difficult for language models, which did not outperform the random baseline. The authors hypothesized this is due to a deficiency in teacher-forcing and the next-token prediction paradigm. We demonstrate the task is learnable using teacher-forcing in alternative settings and that the issue is partially due to representation. We introduce a regularization method using structured samples of the same graph but with differing target nodes, improving results across a variety of model types. We provide RASP proofs showing the task is theoretically solvable. Finally, we find settings where an encoder-only model can consistently solve the task. Comments: EMNLP 2024 Main Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2410.13779 [cs.CL] (or arXiv:2410.13779v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.13779 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：最近引入的路径星任务是一个设计用来展示语言模型能力局限性的最小任务 (Bachman and Nagarajan, 2024)。该任务涉及一个路径星图，其中多个分支从单个起始节点辐射出来，每个节点都是唯一的。给定起始节点和一个指定的目标节点（该节点结束一个分支），任务是生成包含该目标节点的分支。这对人类来说很简单，但对语言模型来说却出乎意料地困难，它们的表现并未超过随机基线。作者假设这是由于教师强制和下一个 Token 预测范式的缺陷。我们证明，在替代设置中使用教师强制可以使任务可学习，并且问题部分源于表示方式。我们引入了一种正则化方法，使用相同图形的结构化样本，但目标节点不同，从而在各种模型类型中提高了结果。我们提供了 RASP 证明，表明该任务在理论上是可解的。最后，我们发现了一些设置，其中仅编码器模型可以一致地解决该任务。

评论：EMNLP 2024 主会议主题：计算与语言 (cs.CL)；机器学习 (cs.LG) 引用为：arXiv:2410.13779 [cs.CL] （或 arXiv:2410.13779v1 [cs.CL] 用于此版本） https://doi.org/10.48550/arXiv.2410.13779 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite（待注册）

[NLP-18] Aggregation Artifacts in Subjective Tasks Collapse Large Language Models Posteriors

【速读】：该论文试图解决的问题是：在复杂主观领域（如情感和道德）中，上下文学习（ICL）主要依赖于任务先验知识的检索，而非真正的“学习”能力，导致模型在这些领域的性能受限。论文的关键解决方案在于：通过分析数据集中的标注聚合问题，发现聚合可能导致标注噪声，从而影响模型的表现。研究提出应关注个体标注者的建模，而非依赖聚合结果，并通过量化评估模型先验知识，揭示了标注聚合在主观任务建模中的混杂因素作用。此外，研究还发现少数标注者的观点可能更好地与大语言模型（LLM）对齐，并可能被进一步放大。

链接: https://arxiv.org/abs/2410.13776
作者: Georgios Chochlakis,Alexandros Potamianos,Kristina Lerman,Shrikanth Narayanan
关键词-EN: Large Language Models, performing natural language, natural language tasks, Large Language, In-context Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 2 tables

点击查看摘要

Abstract:In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs). The knowledge acquired during pre-training is crucial for this few-shot capability, providing the model with task priors. However, recent studies have shown that ICL predominantly relies on retrieving task priors rather than “learning” to perform tasks. This limitation is particularly evident in complex subjective domains such as emotion and morality, where priors significantly influence posterior predictions. In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt. Moreover, we evaluate the posterior bias towards certain annotators by grounding our study in appropriate, quantitative measures of LLM priors. Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead. However, aggregation does not explain the entire gap between ICL and the state of the art, meaning other factors in such tasks also account for the observed phenomena. Finally, by rigorously studying annotator-level labels, we find that it is possible for minority annotators to both better align with LLMs and have their perspectives further amplified.
摘要：上下文学习 (In-context Learning, ICL) 已成为使用大语言模型 (Large Language Models, LLMs) 执行自然语言任务的主要方法。预训练期间获取的知识对于这种少样本能力至关重要，为模型提供了任务先验。然而，最近的研究表明，ICL 主要依赖于检索任务先验，而不是“学习”执行任务。这种局限性在情感和道德等复杂主观领域尤为明显，其中先验显著影响后验预测。在本研究中，我们探讨这是否是由于相应数据集中使用的聚合结果，试图结合低一致性、不同的注释可能会导致注释伪影，从而在提示中产生有害噪声。此外，我们通过基于大语言模型先验的适当定量测量，评估了后验对某些注释者的偏差。我们的结果表明，聚合是主观任务建模中的混杂因素，并主张关注个体建模。然而，聚合并不能解释 ICL 与最先进技术之间的全部差距，这意味着此类任务中的其他因素也解释了观察到的现象。最后，通过严格研究注释者级别的标签，我们发现少数注释者既可以更好地与大语言模型对齐，也可以进一步放大他们的观点。

[NLP-19] Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval

【速读】：该论文试图解决现有大型语言模型（LLMs）在生成查询扩展时，仅关注文本相似性而忽视文档间关系的问题。解决方案的关键在于提出了一种知识感知的查询扩展框架，通过将知识图谱（KG）中的结构化文档关系融入LLMs，增强查询扩展的语义和结构相关性。具体而言，该框架利用文档文本作为丰富的KG节点表示，并通过基于文档的关系过滤来改进知识感知检索（KAR），从而更有效地处理具有文本和关系双重需求的半结构化查询。

链接: https://arxiv.org/abs/2410.13765
作者: Yu Xia,Junda Wu,Sungchul Kim,Tong Yu,Ryan A. Rossi,Haoliang Wang,Julian McAuley
关键词-EN: Large language models, Large language, improving information search, generate query expansions, language models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been used to generate query expansions augmenting original queries for improving information search. Recent studies also explore providing LLMs with initial retrieval results to generate query expansions more grounded to document corpus. However, these methods mostly focus on enhancing textual similarities between search queries and target documents, overlooking document relations. For queries like “Find me a highly rated camera for wildlife photography compatible with my Nikon F-Mount lenses”, existing methods may generate expansions that are semantically similar but structurally unrelated to user intents. To handle such semi-structured queries with both textual and relational requirements, in this paper we propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG). To further address the limitation of entity-based scoring in existing KG-based methods, we leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR). Extensive experiments on three datasets of diverse domains show the advantages of our method compared against state-of-the-art baselines on textual and relational semi-structured retrieval.
摘要：大语言模型 (Large Language Models, LLMs) 已被用于生成查询扩展，以增强原始查询，从而改进信息检索。最近的研究还探索了向 LLMs 提供初始检索结果，以生成更贴近文档语料库的查询扩展。然而，这些方法大多侧重于增强搜索查询与目标文档之间的文本相似性，而忽略了文档之间的关系。对于类似“为我找到一款适合野生动物摄影的高评分相机，且兼容我的 Nikon F-Mount 镜头”的查询，现有方法可能会生成语义上相似但结构上与用户意图无关的扩展。为了处理这种既包含文本要求又包含关系要求的半结构化查询，本文提出了一种知识感知的查询扩展框架，通过将知识图谱 (Knowledge Graph, KG) 中的结构化文档关系引入 LLMs 来增强其能力。为进一步解决现有基于 KG 的方法中实体评分存在的局限性，我们利用文档文本作为丰富的 KG 节点表示，并采用基于文档的关系过滤方法，用于我们的知识感知检索 (Knowledge-Aware Retrieval, KAR)。在三个不同领域的数据集上的广泛实验表明，与最先进的基线方法相比，我们的方法在文本和关系半结构化检索方面具有优势。

[NLP-20] MobA: A Two-Level Agent System for Efficient Mobile Task Automation

【速读】：该论文试图解决当前移动助手在依赖系统API、处理复杂用户指令和多样化界面时存在的理解和决策能力受限的问题。解决方案的关键在于提出了一种名为MobA的新型移动助手，该助手基于多模态大语言模型，并通过一个复杂的两级代理架构来增强理解和规划能力。具体来说，高层次的全局代理（GA）负责理解用户命令、跟踪历史记忆和规划任务，而低层次的本地代理（LA）则根据GA的子任务和记忆预测详细的操作，以函数调用的形式执行。此外，集成反射模块使得系统能够高效完成任务，并处理之前未见过的复杂任务。

链接: https://arxiv.org/abs/2410.13757
作者: Zichen Zhu,Hao Tang,Yansi Li,Kunyao Lan,Yixuan Jiang,Hao Zhou,Yixiao Wang,Situo Zhang,Liangtai Sun,Lu Chen,Kai Yu
关键词-EN: diverse interfaces due, Current mobile assistants, Current mobile, decision-making abilities, limited by dependence
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 27 pages, 6 figures, and 5 tables. We will release our source code in a few days

点击查看摘要

Abstract:Current mobile assistants are limited by dependence on system APIs or struggle with complex user instructions and diverse interfaces due to restricted comprehension and decision-making abilities. To address these challenges, we propose MobA, a novel Mobile phone Agent powered by multimodal large language models that enhances comprehension and planning capabilities through a sophisticated two-level agent architecture. The high-level Global Agent (GA) is responsible for understanding user commands, tracking history memories, and planning tasks. The low-level Local Agent (LA) predicts detailed actions in the form of function calls, guided by sub-tasks and memory from the GA. Integrating a Reflection Module allows for efficient task completion and enables the system to handle previously unseen complex tasks. MobA demonstrates significant improvements in task execution efficiency and completion rate in real-life evaluations, underscoring the potential of MLLM-empowered mobile assistants.
摘要：当前的移动助手受限于对系统 API 的依赖，或在处理复杂用户指令和多样化界面时，由于理解能力和决策能力的限制而表现不佳。为解决这些问题，我们提出了 MobA，这是一种新型移动电话智能体，由多模态大语言模型驱动，通过复杂的两级智能体架构提升理解和规划能力。高层次的全局智能体 (Global Agent, GA) 负责理解用户指令、跟踪历史记忆并规划任务。低层次的局部智能体 (Local Agent, LA) 则根据 GA 提供的子任务和记忆，预测具体操作并以函数调用的形式执行。通过集成反思模块 (Reflection Module)，系统能够高效完成任务，并能处理之前未见过的复杂任务。在实际评估中，MobA 在任务执行效率和完成率方面展示了显著的改进，突显了多模态大语言模型赋能的移动助手潜力。

[NLP-21] LLM-Human Pipeline for Cultural Context Grounding of Conversations

【速读】：该论文试图解决自然语言处理（NLP）模型在跨文化对话中理解和遵守社会规范的难题。解决方案的关键在于引入了一个“文化上下文模式”（Cultural Context Schema），该模式整合了对话中的情感、对话行为等会话信息以及社会规范、规范违反等文化信息。通过使用大型语言模型（LLMs）生成约11万条社会规范和违反描述，并结合自动化验证策略和人类判断进行细化，论文构建了一个包含“规范概念”（Norm Concepts）的结构化数据集。这些规范概念通过符号标注与对话内容相结合，最终用于情感、情感倾向和对话行为检测等下游任务，显著提升了模型的实际表现。

链接: https://arxiv.org/abs/2410.13727
作者: Rajkumar Pujari,Dan Goldwasser
关键词-EN: adhere to well-understood, well-understood social norms, Asian cultures, Conversations, Cultural Context Schema
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures, 7 tables

点击查看摘要

Abstract:Conversations often adhere to well-understood social norms that vary across cultures. For example, while “addressing parents by name” is commonplace in the West, it is rare in most Asian cultures. Adherence or violation of such norms often dictates the tenor of conversations. Humans are able to navigate social situations requiring cultural awareness quite adeptly. However, it is a hard task for NLP models. In this paper, we tackle this problem by introducing a “Cultural Context Schema” for conversations. It comprises (1) conversational information such as emotions, dialogue acts, etc., and (2) cultural information such as social norms, violations, etc. We generate ~110k social norm and violation descriptions for ~23k conversations from Chinese culture using LLMs. We refine them using automated verification strategies which are evaluated against culturally aware human judgements. We organize these descriptions into meaningful structures we call “Norm Concepts”, using an interactive human-in-loop framework. We ground the norm concepts and the descriptions in conversations using symbolic annotation. Finally, we use the obtained dataset for downstream tasks such as emotion, sentiment, and dialogue act detection. We show that it significantly improves the empirical performance. Comments: 19 pages, 9 figures, 7 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.13727 [cs.CL] (or arXiv:2410.13727v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.13727 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：对话通常遵循不同文化背景下被广泛理解的社会规范。例如，在西方，“直呼父母名字”是常见的，而在大多数亚洲文化中则很少见。遵守或违反这些规范往往决定了对话的基调。人类能够相当娴熟地应对需要文化意识的社会情境。然而，这对自然语言处理（NLP）模型来说是一项艰巨的任务。在本文中，我们通过引入一种“文化语境架构”来解决这一问题。该架构包括（1）对话信息，如情感、对话行为等，以及（2）文化信息，如社会规范、违反情况等。我们使用大语言模型（LLM）为来自中国文化的约23,000个对话生成了约110,000条社会规范和违反描述。我们通过自动化验证策略对这些描述进行细化，这些策略通过与具有文化意识的专家判断进行评估。我们使用交互式人机回环框架将这些描述组织成我们称之为“规范概念”的有意义结构。我们通过符号标注将规范概念和描述锚定在对话中。最后，我们使用所获得的数据集进行情感、情绪和对话行为检测等下游任务。我们展示了这显著提升了实证性能。

评论：19页，9图，7表
学科：计算与语言（cs.CL）；人工智能（cs.AI）
引用方式：arXiv:2410.13727 [cs.CL]（或 arXiv:2410.13727v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.13727
了解更多信息
通过 DataCite 发布的 arXiv DOI（待注册）

[NLP-22] MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

【速读】：该论文试图解决传统检索增强生成（RAG）基准测试中依赖于启发式评估指标的问题，这些指标需要人类偏好作为参考，而竞技场基准测试则需要昂贵的大型语言模型（LLM）作为评判者。论文提出的解决方案关键在于训练一个学习排序模型作为“代理”评判者，使用RAG评估启发式特征作为输入，生成一个合成竞技场排行榜。通过这种方法，论文开发了MIRAGE-Bench，一个标准化的多语言RAG竞技场基准测试，涵盖18种不同语言，结合了启发式特征和LLM评判者进行全面评估。实验结果显示，使用启发式特征训练的代理评判者与GPT-4评判者之间具有高度相关性（Kendall Tau (\tau) = 0.909），表明该方法在多语言RAG评估中具有高效性和可靠性。

链接: https://arxiv.org/abs/2410.13716
作者: Nandan Thakur,Suleman Kazi,Ge Luo,Jimmy Lin,Amin Ahmad
关键词-EN: Traditional Retrieval-Augmented Generation, Traditional Retrieval-Augmented, require human preferences, truth for reference, heuristic-based metrics
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional Retrieval-Augmented Generation (RAG) benchmarks rely on different heuristic-based metrics for evaluation, but these require human preferences as ground truth for reference. In contrast, arena-based benchmarks, where two models compete each other, require an expensive Large Language Model (LLM) as a judge for a reliable evaluation. We present an easy and efficient technique to get the best of both worlds. The idea is to train a learning to rank model as a “surrogate” judge using RAG-based evaluation heuristics as input, to produce a synthetic arena-based leaderboard. Using this idea, We develop MIRAGE-Bench, a standardized arena-based multilingual RAG benchmark for 18 diverse languages on Wikipedia. The benchmark is constructed using MIRACL, a retrieval dataset, and extended for multilingual generation evaluation. MIRAGE-Bench evaluates RAG extensively coupling both heuristic features and LLM as a judge evaluator. In our work, we benchmark 19 diverse multilingual-focused LLMs, and achieve a high correlation (Kendall Tau ( \tau ) = 0.909) using our surrogate judge learned using heuristic features with pairwise evaluations and between GPT-4o as a teacher on the MIRAGE-Bench leaderboard using the Bradley-Terry framework. We observe proprietary and large open-source LLMs currently dominate in multilingual RAG. MIRAGE-Bench is available at: this https URL.
摘要：传统的检索增强生成 (Retrieval-Augmented Generation, RAG) 基准测试依赖于不同的基于启发式的评估指标，但这些指标需要人类偏好作为参考的基准事实。相比之下，基于竞技场的基准测试，即两个模型相互竞争，需要一个昂贵的大语言模型 (Large Language Model, LLM) 作为裁判进行可靠的评估。我们提出了一种简单且高效的技术，以融合两者的优势。该方法是通过训练一个学习排序模型作为“代理”裁判，使用基于 RAG 评估的启发式特征作为输入，生成一个合成竞技场排行榜。基于这一思路，我们开发了 MIRAGE-Bench，这是一个标准化的多语言 RAG 竞技场基准测试，涵盖了维基百科上的 18 种不同语言。该基准测试利用了 MIRACL 检索数据集，并扩展用于多语言生成评估。MIRAGE-Bench 通过结合启发式特征和 LLM 作为裁判评估器，对 RAG 进行了广泛评估。在我们的研究中，我们测试了 19 个专注于多语言的大语言模型，并使用基于启发式特征的成对评估训练的代理裁判，在 MIRAGE-Bench 排行榜上实现了高度相关性 (Kendall Tau ( \tau ) = 0.909)，同时使用 Bradley-Terry 框架下的 GPT-4o 作为教师模型。我们观察到，目前多语言 RAG 领域中，专有和大型开源 LLM 占据主导地位。MIRAGE-Bench 可通过以下链接获取：this https URL。

[NLP-23] On the Role of Attention Heads in Large Language Model Safety

【速读】：该论文试图解决大语言模型（LLMs）中安全机制的可解释性问题，特别是多重注意力机制对模型安全能力的影响。解决方案的关键在于提出了一个新的度量标准——Safety Head ImPortant Score (Ships)，用于评估多重注意力机制中各个头对模型安全性的贡献。基于此，论文进一步推广了Ships到数据集层面，并引入了Safety Attention Head AttRibution Algorithm (Sahara)算法，用于识别模型中关键的安全注意力头。研究发现，特定的注意力头对模型的安全性有显著影响，移除这些头会导致模型对有害查询的响应增加16倍，而仅修改了0.006%的参数，远低于之前研究中所需的5%参数修改。此外，论文还展示了这些注意力头主要作为安全特征提取器，并且从同一基础模型微调的模型中存在重叠的安全注意力头。这些发现为理解大模型中安全机制的黑箱提供了新的视角。

链接: https://arxiv.org/abs/2410.13708
作者: Zhenhong Zhou,Haiyang Yu,Xinghua Zhang,Rongwu Xu,Fei Huang,Kun Wang,Yang Liu,Junfeng Fang,Yongbin Li
关键词-EN: multiple language tasks, safety, language tasks, multiple language, attention
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 28 pages, 18 figures, 7 tables

点击查看摘要

Abstract:Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or component are suppressed, the safety capability of LLMs are compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms, despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in the safety-related mechanistic interpretability. We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads’ contributions to model safety. Based on this, we generalize Ships to the dataset level and further introduce the Safety Attention Head AttRibution Algorithm (Sahara) to attribute the critical safety attention heads inside the model. Our findings show that the special attention head has a significant impact on safety. Ablating a single safety head allows aligned model (e.g., Llama-2-7b-chat) to respond to 16 times more harmful queries, while only modifying 0.006% of the parameters, in contrast to the ~ 5% modification required in previous studies. More importantly, we demonstrate that attention heads primarily function as feature extractors for safety and models fine-tuned from the same base model exhibit overlapping safety heads through comprehensive experiments. Together, our attribution approach and findings provide a novel perspective for unpacking the black box of safety mechanisms within large models.
摘要：大语言模型 (LLMs) 在多项语言任务中达到了最先进的性能，但其安全防护措施可能被绕过，导致有害内容的生成。鉴于此，近期关于安全机制的研究涌现，揭示了当安全表示或组件被抑制时，LLMs 的安全能力会受到影响。然而，现有研究往往忽视了多头注意力机制对安全性的影响，尽管其在多种模型功能中起着至关重要的作用。因此，本文旨在探讨标准注意力机制与安全能力之间的联系，以填补安全相关机制可解释性中的这一空白。我们提出了一种针对多头注意力的新型度量标准，即安全头重要性评分 (Safety Head ImPortant Score, Ships)，用于评估各头对模型安全性的贡献。基于此，我们将 Ships 推广到数据集层面，并进一步引入了安全注意力头归因算法 (Safety Attention Head AttRibution Algorithm, Sahara)，以识别模型内部的关键安全注意力头。我们的研究结果表明，特定的注意力头对安全有显著影响。通过消除单个安全头，对齐模型 (例如，Llama-2-7b-chat) 能够响应的有害查询数量增加了 16 倍，而仅修改了 0.006% 的参数，相比之下，先前研究中需要修改约 5% 的参数。更重要的是，我们证明了注意力头主要作为安全特征提取器，并且从同一基础模型微调的模型通过综合实验显示出重叠的安全头。总之，我们的归因方法和发现为解开大模型中安全机制的黑箱提供了新的视角。

[NLP-24] Unconstrained Model Merging for Enhanced LLM Reasoning

【速读】：该论文试图解决创建全能大型语言模型（LLM）所需的专有数据和计算资源限制问题。解决方案的关键在于提出了一种无约束的模型合并框架，该框架能够整合同构和异构模型架构，特别关注推理任务。具体方法包括为同构模型设计细粒度的逐层权重合并策略，以及基于指令-响应微调数据导出的概率分布知识进行异构模型合并。通过在多个基准测试和推理优化的LLM上的实验，证明了模型合并能够产生超越简单叠加效果的组合推理能力，为去中心化LLM的发展奠定了基础。

链接: https://arxiv.org/abs/2410.13699
作者: Yiming Zhang,Baoyi He,Shengyu Zhang,Yuhao Fu,Qi Zhou,Zhijie Sang,Zijin Hong,Kejing Yang,Wenjun Wang,Jianbo Yuan,Guangning Han,Linyi Li,Chunlin Ji,Fei Wu,Hongxia Yang
关键词-EN: shown remarkable success, multi-step problem solving, building domain-specific large, domain-specific large language, large language models
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Recent advancements in building domain-specific large language models (LLMs) have shown remarkable success, especially in tasks requiring reasoning abilities like logical inference over complex relationships and multi-step problem solving. However, creating a powerful all-in-one LLM remains challenging due to the need for proprietary data and vast computational resources. As a resource-friendly alternative, we explore the potential of merging multiple expert models into a single LLM. Existing studies on model merging mainly focus on generalist LLMs instead of domain experts, or the LLMs under the same architecture and size. In this work, we propose an unconstrained model merging framework that accommodates both homogeneous and heterogeneous model architectures with a focus on reasoning tasks. A fine-grained layer-wise weight merging strategy is designed for homogeneous models merging, while heterogeneous model merging is built upon the probabilistic distribution knowledge derived from instruction-response fine-tuning data. Across 7 benchmarks and 9 reasoning-optimized LLMs, we reveal key findings that combinatorial reasoning emerges from merging which surpasses simple additive effects. We propose that unconstrained model merging could serve as a foundation for decentralized LLMs, marking a notable progression from the existing centralized LLM framework. This evolution could enhance wider participation and stimulate additional advancement in the field of artificial intelligence, effectively addressing the constraints posed by centralized models.
摘要：近年来，构建领域特定的大语言模型 (LLM) 取得了显著的成功，特别是在需要推理能力的任务中，如复杂关系上的逻辑推理和多步骤问题解决。然而，由于需要专有数据和庞大的计算资源，创建一个强大的全能 LLM 仍然具有挑战性。作为一种资源友好型的替代方案，我们探索了将多个专家模型合并为一个 LLM 的潜力。现有的模型合并研究主要集中在通用 LLM 上，而非领域专家模型，或者是在相同架构和大小的 LLM 之间进行合并。在本研究中，我们提出了一种无约束的模型合并框架，该框架适用于同构和异构模型架构，并专注于推理任务。我们设计了一种细粒度的逐层权重合并策略用于同构模型合并，而异构模型合并则基于从指令-响应微调数据中导出的概率分布知识。在 7 个基准测试和 9 个推理优化 LLM 中，我们揭示了关键发现，即组合推理能力通过合并得以涌现，超越了简单的叠加效应。我们提出，无约束模型合并可以作为去中心化 LLM 的基础，标志着从现有集中式 LLM 框架的显著进展。这一演变可以促进更广泛的参与，并刺激人工智能领域的进一步发展，有效解决集中式模型带来的限制。

[NLP-25] Exploring the Design Space of Visual Context Representation in Video MLLMs

【速读】：该论文试图解决视频多模态大语言模型（MLLMs）中视觉上下文表示的系统性研究不足问题，特别是如何从视频中选择帧以及从帧中选择嵌入（或令牌）的策略。解决方案的关键在于将视觉上下文表示任务形式化为一个约束优化问题，并通过建模语言建模损失与帧数和每帧嵌入数之间的关系，探索帧选择和令牌选择的缩放效应。通过大量实验拟合相应的函数曲线，论文研究了典型选择策略的有效性，并推导出确定帧选择和令牌选择的最优公式，从而提高视频MLLMs的性能。

链接: https://arxiv.org/abs/2410.13694
作者: Yifan Du,Yuqi Huo,Kun Zhou,Zijia Zhao,Haoyu Lu,Han Huang,Wayne Xin Zhao,Bingning Wang,Weipeng Chen,Ji-Rong Wen
关键词-EN: Video Multimodal Large, Multimodal Large Language, Multimodal Large, Large Language Models, shown remarkable capability
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Long Video MLLM; work in progress

点击查看摘要

Abstract:Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks. Despite the advancements, there is still a lack of systematic research on visual context representation, which refers to the scheme to select frames from a video and further select the tokens from a frame. In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes. Firstly, we formulate the task of visual context representation as a constrained optimization problem, and model the language modeling loss as a function of the number of frames and the number of embeddings (or tokens) per frame, given the maximum visual context window size. Then, we explore the scaling effects in frame selection and token selection respectively, and fit the corresponding function curve by conducting extensive empirical experiments. We examine the effectiveness of typical selection strategies and present empirical findings to determine the two factors. Furthermore, we study the joint effect of frame selection and token selection, and derive the optimal formula for determining the two factors. We demonstrate that the derived optimal settings show alignment with the best-performed results of empirical experiments. Our code and model are available at: this https URL.
摘要：视频多模态大语言模型 (MLLMs) 在理解视频语义并应用于各种下游任务方面展现了显著的能力。尽管取得了进展，但对于视觉上下文表示的研究仍缺乏系统性，视觉上下文表示指的是从视频中选择帧，并进一步从帧中选择 Token 的方案。本文探讨了视觉上下文表示的设计空间，旨在通过发现更有效的表示方案来提升视频 MLLMs 的性能。首先，我们将视觉上下文表示任务形式化为一个约束优化问题，并将语言建模损失建模为给定最大视觉上下文窗口大小的情况下，帧数和每帧嵌入（或 Token）数的函数。接着，我们分别探索了帧选择和 Token 选择的缩放效应，并通过广泛的实证实验拟合相应的函数曲线。我们评估了典型选择策略的有效性，并提出了实证发现以确定这两个因素。此外，我们研究了帧选择和 Token 选择的联合效应，并推导出确定这两个因素的最优公式。我们证明，推导出的最优设置与实证实验中表现最佳的结果相一致。我们的代码和模型可在以下链接获取：this https URL。

[NLP-26] Pose-Based Sign Language Appearance Transfer

【速读】：该论文旨在解决手语骨架姿态中说话者外观的转移问题，同时保持手语内容不变。解决方案的关键在于利用估计的姿态，将一个说话者的外观转移到另一个说话者上，确保动作和过渡的自然性，从而改进基于姿态的渲染和手语拼接，同时模糊身份信息。实验结果表明，该方法在降低说话者识别准确性的同时，对手语识别性能略有影响，揭示了隐私保护与功能性之间的权衡。

链接: https://arxiv.org/abs/2410.13675
作者: Amit Moryossef,Gerard Sant,Zifan Jiang
关键词-EN: language skeletal poses, sign language skeletal, language skeletal, skeletal poses, sign content
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a method for transferring the signer’s appearance in sign language skeletal poses while preserving the sign content. Using estimated poses, we transfer the appearance of one signer to another, maintaining natural movements and transitions. This approach improves pose-based rendering and sign stitching while obfuscating identity. Our experiments show that while the method reduces signer identification accuracy, it slightly harms sign recognition performance, highlighting a tradeoff between privacy and utility. Our code is available at \urlthis https URL.
摘要：我们提出了一种在保持手语内容的同时转移手语者外观的方法。通过估计的姿态，我们将一个手语者的外观转移到另一个手语者上，同时保持自然的动作和过渡。这种方法改进了基于姿态的渲染和手语拼接，同时模糊了身份信息。我们的实验表明，尽管该方法降低了手语者识别的准确性，但它对手语识别性能的影响较小，突显了隐私与实用性之间的权衡。我们的代码可在以下链接获取：\urlthis https URL。

[NLP-27] HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings

【速读】：该论文试图解决在实际应用场景中对多语言大型语言模型（LLMs）进行评估的问题，特别是针对印度英语和四种印度本土语言的医疗聊天机器人数据。解决方案的关键在于采用统一的检索增强生成框架来生成响应，并通过自动化技术和人工评估相结合的方式，基于四个特定指标对模型性能进行全面评估。研究发现，不同模型在处理这些语言查询时的表现差异显著，且指令调优的印度本土语言模型在处理印度语言查询时并不总是表现良好。此外，实验表明，与英语查询相比，印度语言查询的响应在事实准确性上通常较低。最后，定性分析显示，数据集中的代码混合和文化相关查询对评估的模型提出了挑战。

链接: https://arxiv.org/abs/2410.13671
作者: Varun Gumma,Anandhita Raghunath,Mohit Jain,Sunayana Sitaram
关键词-EN: garnered significant interest, scenarios remains rare, real-world scenarios remains, Assessing the capabilities, significant interest
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Assessing the capabilities and limitations of large language models (LLMs) has garnered significant interest, yet the evaluation of multiple models in real-world scenarios remains rare. Multilingual evaluation often relies on translated benchmarks, which typically do not capture linguistic and cultural nuances present in the source language. This study provides an extensive assessment of 24 LLMs on real world data collected from Indian patients interacting with a medical chatbot in Indian English and 4 other Indic languages. We employ a uniform Retrieval Augmented Generation framework to generate responses, which are evaluated using both automated techniques and human evaluators on four specific metrics relevant to our application. We find that models vary significantly in their performance and that instruction tuned Indic models do not always perform well on Indic language queries. Further, we empirically show that factual correctness is generally lower for responses to Indic queries compared to English queries. Finally, our qualitative work shows that code-mixed and culturally relevant queries in our dataset pose challenges to evaluated models.
摘要：评估大语言模型 (LLMs) 的能力和局限性引起了广泛关注，但在实际场景中对多个模型的评估仍然罕见。多语言评估通常依赖于翻译后的基准测试，这些测试通常无法捕捉源语言中的语言和文化细微差别。本研究对24个LLMs在从印度患者与医疗聊天机器人互动中收集的印度英语和另外4种印度语言的真实世界数据上进行了广泛评估。我们采用统一的检索增强生成框架来生成响应，这些响应通过自动化技术和人工评估者在四个与我们应用相关的特定指标上进行评估。我们发现，模型在性能上存在显著差异，并且指令调整的印度语言模型在处理印度语言查询时并不总是表现良好。此外，我们实证表明，与英语查询相比，印度语言查询的响应在事实正确性上通常较低。最后，我们的定性工作显示，数据集中的代码混合和文化相关查询对评估模型提出了挑战。

[NLP-28] signwriting-evaluation: Effective Sign Language Evaluation via SignWriting

【速读】：该论文试图解决手语书写系统SignWriting缺乏专用自动评估指标的问题，以促进手语转录和翻译模型的有效开发。解决方案的关键在于引入了一套专门针对SignWriting设计的评估指标，包括对标准指标如BLEU和chrF的适应性调整、将CLIPScore应用于SignWriting图像，以及提出了一种新颖的符号距离度量方法。这些指标不仅解决了单个手势与连续手语评估的独特挑战，还通过在SignBank语料库中的分数分布分析和最近邻搜索展示了其有效性，为未来SignWriting模型的评估提供了重要工具。

链接: https://arxiv.org/abs/2410.13668
作者: Amit Moryossef,Rotem Zilberman,Ohad Langer
关键词-EN: developing effective transcription, automatic evaluation metrics, evaluation metrics tailored, lack of automatic, presents a significant
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The lack of automatic evaluation metrics tailored for SignWriting presents a significant obstacle in developing effective transcription and translation models for signed languages. This paper introduces a comprehensive suite of evaluation metrics specifically designed for SignWriting, including adaptations of standard metrics such as \textttBLEU and \textttchrF, the application of \textttCLIPScore to SignWriting images, and a novel symbol distance metric unique to our approach. We address the distinct challenges of evaluating single signs versus continuous signing and provide qualitative demonstrations of metric efficacy through score distribution analyses and nearest-neighbor searches within the SignBank corpus. Our findings reveal the strengths and limitations of each metric, offering valuable insights for future advancements using SignWriting. This work contributes essential tools for evaluating SignWriting models, facilitating progress in the field of sign language processing. Our code is available at \urlthis https URL.
摘要：针对手语书写 (SignWriting) 缺乏专门设计的自动评估指标的问题，本文提出了一套全面的评估指标，旨在有效开发手语转录和翻译模型。这些指标包括对标准指标如 \textttBLEU 和 \textttchrF 的适应性调整，将 \textttCLIPScore 应用于手语书写图像，以及我们独创的符号距离度量方法。我们解决了评估单个手势与连续手语的独特挑战，并通过对手语数据库 (SignBank corpus) 中的分数分布分析和最近邻搜索，展示了这些指标的有效性。研究结果揭示了各指标的优势与局限，为未来使用手语书写进行的研究提供了宝贵的见解。本研究为评估手语书写模型提供了关键工具，推动了手语处理领域的发展。相关代码可在 \urlthis https URL 获取。

[NLP-29] ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization EMNLP2023

【速读】：该论文试图解决中文环境下目标无关立场检测和辩论总结任务的数据集不足问题。解决方案的关键在于提出了ORCHID（口语中文辩论）数据集，这是首个用于基准测试目标无关立场检测和辩论总结的中文数据集。该数据集包含1,218个真实世界的中文辩论，涉及476个独特话题，提供了2,436个立场特定的总结和14,133个完全标注的语句，为未来的研究提供了多功能测试平台，并通过实证研究展示了数据集的挑战性，并提出了将立场检测融入辩论总结的潜在可能性。

链接: https://arxiv.org/abs/2410.13667
作者: Xiutian Zhao,Ke Wang,Wei Peng
关键词-EN: receiving increasing attention, large language models, attention for years, Dialogue agents, receiving increasing
类目: Computation and Language (cs.CL)
备注: In EMNLP 2023

点击查看摘要

Abstract:Dialogue agents have been receiving increasing attention for years, and this trend has been further boosted by the recent progress of large language models (LLMs). Stance detection and dialogue summarization are two core tasks of dialogue agents in application scenarios that involve argumentative dialogues. However, research on these tasks is limited by the insufficiency of public datasets, especially for non-English languages. To address this language resource gap in Chinese, we present ORCHID (Oral Chinese Debate), the first Chinese dataset for benchmarking target-independent stance detection and debate summarization. Our dataset consists of 1,218 real-world debates that were conducted in Chinese on 476 unique topics, containing 2,436 stance-specific summaries and 14,133 fully annotated utterances. Besides providing a versatile testbed for future research, we also conduct an empirical study on the dataset and propose an integrated task. The results show the challenging nature of the dataset and suggest a potential of incorporating stance detection in summarization for argumentative dialogue.
摘要：对话代理多年来一直受到越来越多的关注，而大语言模型 (LLM) 的最新进展进一步推动了这一趋势。立场检测和对话摘要是对话代理在涉及论证性对话的应用场景中的两个核心任务。然而，这些任务的研究受到公共数据集不足的限制，尤其是对于非英语语言。为了解决中文中的这一语言资源缺口，我们提出了 ORCHID (Oral Chinese Debate)，这是首个用于基准测试目标无关立场检测和辩论摘要的中文数据集。我们的数据集包含 1,218 个在 476 个独特话题上进行的真实世界辩论，包含 2,436 个立场特定的摘要和 14,133 个完全注释的话语。除了为未来的研究提供多功能测试平台外，我们还对数据集进行了实证研究，并提出了一个综合任务。结果显示了数据集的挑战性，并表明在论证性对话的摘要中整合立场检测具有潜在价值。

[NLP-30] VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks

【速读】：该论文试图解决现有先进的人工智能模型在处理需要视觉和文本联合推理的任务时表现不佳的问题。解决方案的关键在于提出了VL-GLUE基准，这是一个包含超过10万个样本的多任务基准，涵盖了七个需要视觉语言推理的核心任务。VL-GLUE不仅包含了多样化的图像类型（如合成渲染图、日常场景、图表和复杂图示），还涵盖了广泛的领域特定文本（如烹饪、政治、体育和高中学科内容），从而模拟了现实世界中对多模态理解的需求。通过这一基准，论文展示了现有大规模视觉语言模型在此类任务上的挑战性，并鼓励开发具备强大视觉语言推理能力的系统。

链接: https://arxiv.org/abs/2410.13666
作者: Shailaja Keyur Sampat,Mutsumi Nakamura,Shankar Kailas,Kartik Aggarwal,Mandy Zhou,Yezhou Yang,Chitta Baral
关键词-EN: Deriving inference, advanced Artificial Intelligence, heterogeneous inputs, humans to perform, inference from heterogeneous
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:Deriving inference from heterogeneous inputs (such as images, text, and audio) is an important skill for humans to perform day-to-day tasks. A similar ability is desirable for the development of advanced Artificial Intelligence (AI) systems. While state-of-the-art models are rapidly closing the gap with human-level performance on diverse computer vision and NLP tasks separately, they struggle to solve tasks that require joint reasoning over visual and textual modalities. Inspired by GLUE (Wang et. al., 2018)- a multitask benchmark for natural language understanding, we propose VL-GLUE in this paper. VL-GLUE consists of over 100k samples spanned across seven different tasks, which at their core require visuo-linguistic reasoning. Moreover, our benchmark comprises of diverse image types (from synthetically rendered figures, and day-to-day scenes to charts and complex diagrams) and includes a broad variety of domain-specific text (from cooking, politics, and sports to high-school curricula), demonstrating the need for multi-modal understanding in the real-world. We show that this benchmark is quite challenging for existing large-scale vision-language models and encourage development of systems that possess robust visuo-linguistic reasoning capabilities.
摘要： 从异质输入（如图像、文本和音频）中推导出结论是人类执行日常任务的重要技能。类似的技能对于开发先进的人工智能 (AI) 系统也是必要的。尽管最先进的模型在单独的计算机视觉和自然语言处理 (NLP) 任务上迅速接近人类水平的表现，但它们在需要视觉和文本模态联合推理的任务上仍面临挑战。受 GLUE（Wang et. al., 2018）——一个用于自然语言理解的多元任务基准的启发，我们在本文中提出了 VL-GLUE。VL-GLUE 包含超过 10 万个样本，跨越七个不同的任务，这些任务的核心都需要视觉-语言推理。此外，我们的基准测试涵盖了多种图像类型（从合成渲染的图形、日常场景到图表和复杂图解），并包括广泛的领域特定文本（从烹饪、政治、体育到高中课程），展示了现实世界中多模态理解的需求。我们表明，这个基准对现有的视觉-语言大模型来说非常具有挑战性，并鼓励开发具备强大视觉-语言推理能力的系统。

[NLP-31] Red and blue language: Word choices in the Trump Harris 2024 presidential debate

【速读】：该论文试图解决的问题是如何在政治辩论中分析候选人的语言使用差异，特别是特朗普和哈里斯在2024年9月10日辩论中的语言特征。解决方案的关键在于分析候选人在语义和语用特征上的差异，包括价值观和意识形态的框架构建、情感诉求、词汇的具体性和特异性、以及通过单数或复数代词来称呼他人。研究发现，哈里斯常围绕恢复和赋权构建议题，而特朗普则聚焦于危机和衰退；两人在情感语言使用上相似，但特朗普表现出略高的负面倾向和较少主观性；在回应的特异性上无显著差异；抽象语言使用相似，但特朗普的变异性更大；特朗普不直接提及哈里斯的名字，而哈里斯频繁提及特朗普；哈里斯平等使用单数和复数代词，而特朗普更多使用单数代词。这些结果与关于“红蓝语言”的先前研究相关联，该研究指出了保守派（红色）和自由派（蓝色）政治意识形态相关的不同语言模式。

链接: https://arxiv.org/abs/2410.13654
作者: Philipp Wicke,Marianna M. Bolognesi
关键词-EN: candidates directly confront, undecided voters, Trump, moderator questions, peculiar type
类目: Computation and Language (cs.CL)
备注: Submitted to PLOS ONE, under review

点击查看摘要

Abstract:Political debates are a peculiar type of political discourse, in which candidates directly confront one another, addressing not only the the moderator’s questions, but also their opponent’s statements, as well as the concerns of voters from both parties and undecided voters. Therefore, language is adjusted to meet specific expectations and achieve persuasion. We analyse how the language of Trump and Harris during the debate (September 10th 2024) differs in relation to the following semantic and pragmatic features, for which we formulated targeted hypotheses: framing values and ideology, appealing to emotion, using words with different degrees of concreteness and specificity, addressing others through singular or plural pronouns. Our findings include: differences in the use of figurative frames (Harris often framing issues around recovery and empowerment, Trump often focused on crisis and decline); similar use of emotional language, with Trump showing a slight higher tendency toward negativity and toward less subjective language compared to Harris; no significant difference in the specificity of candidates’ responses; similar use of abstract language, with Trump showing more variability than Harris, depending on the subject discussed; differences in addressing the opponent, with Trump not mentioning Harris by name, while Harris referring to Trump frequently; different uses of pronouns, with Harris using both singular and plural pronouns equally, while Trump using more singular pronouns. The results are discussed in relation to previous literature on Red and Blue language, which refers to distinct linguistic patterns associated with conservative (Red) and liberal (Blue) political ideologies.
摘要：政治辩论是一种特殊的政治话语形式，其中候选人不仅直接面对主持人提出的问题，还回应对手的陈述，以及两党选民和未决定选民的关切。因此，语言会根据特定的期望进行调整，以达到说服的目的。我们分析了特朗普和哈里斯在辩论（2024年9月10日）中的语言，探讨了以下语义和语用特征，并针对这些特征提出了具体的假设：框架价值观和意识形态、情感诉求、使用不同程度的具体性和特定性的词汇、通过单数或复数代词称呼他人。我们的研究发现包括：在比喻框架使用上的差异（哈里斯经常围绕恢复和赋权构建问题，而特朗普则更多关注危机和衰退）；情感语言使用上的相似性，特朗普表现出略高的负面倾向和较少的客观语言倾向；候选人在回应具体性上没有显著差异；抽象语言使用上的相似性，特朗普在讨论不同主题时表现出比哈里斯更大的变化性；在称呼对手上的差异，特朗普未提及哈里斯的名字，而哈里斯则频繁提及特朗普；代词使用上的差异，哈里斯平等使用单数和复数代词，而特朗普更多使用单数代词。这些结果与之前关于红蓝语言的文献进行了讨论，红蓝语言指的是与保守（红色）和自由（蓝色）政治意识形态相关的不同语言模式。

[NLP-32] A new approach for fine-tuning sentence transformers for intent classification and out-of-scope detection tasks

【速读】：该论文试图解决虚拟助手系统中对外部范围（OOS）查询的拒绝问题，特别是在意图分类任务中，由于基于Transformer的句子编码器生成的嵌入在全句嵌入空间中分散，导致内部范围（in-scope）嵌入与OOS嵌入可能重叠，从而使OOS拒绝变得困难的问题。解决方案的关键在于通过自动编码器学习内部范围嵌入的重构损失，并将此损失作为正则化项加入到交叉熵损失中，从而在不影响意图分类性能的前提下，提高拒绝OOS实例的精度-召回曲线下面积（AUC-PR），实现了1-4%的改进。

链接: https://arxiv.org/abs/2410.13649
作者: Tianyi Zhang,Atta Norouzian,Aanchan Mohan,Frederick Ducatelle
关键词-EN: redirect user queries, virtual assistant, important to reject, reject or redirect, redirect user
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Appearing at Empirical Methods in Natural Language Processing 2025 - Industry Track

点击查看摘要

Abstract:In virtual assistant (VA) systems it is important to reject or redirect user queries that fall outside the scope of the system. One of the most accurate approaches for out-of-scope (OOS) rejection is to combine it with the task of intent classification on in-scope queries, and to use methods based on the similarity of embeddings produced by transformer-based sentence encoders. Typically, such encoders are fine-tuned for the intent-classification task, using cross-entropy loss. Recent work has shown that while this produces suitable embeddings for the intent-classification task, it also tends to disperse in-scope embeddings over the full sentence embedding space. This causes the in-scope embeddings to potentially overlap with OOS embeddings, thereby making OOS rejection difficult. This is compounded when OOS data is unknown. To mitigate this issue our work proposes to regularize the cross-entropy loss with an in-scope embedding reconstruction loss learned using an auto-encoder. Our method achieves a 1-4% improvement in the area under the precision-recall curve for rejecting out-of-sample (OOS) instances, without compromising intent classification performance.
摘要：在虚拟助手 (VA) 系统中，拒绝或重定向超出系统范围的用户查询至关重要。最准确的处理超出范围 (OOS) 查询的方法之一是将此任务与范围内查询的意图分类任务相结合，并利用基于 Transformer 句子编码器生成的嵌入相似性的方法。通常，这些编码器会针对意图分类任务进行微调，使用交叉熵损失。最近的研究表明，尽管这种方法为意图分类任务生成了合适的嵌入，但它往往会使范围内嵌入在整个句子嵌入空间中分散开来。这导致范围内嵌入可能与 OOS 嵌入重叠，从而增加了 OOS 拒绝的难度。当 OOS 数据未知时，这一问题更加复杂。为缓解此问题，我们的工作提出通过使用自编码器学习到的范围内嵌入重构损失来正则化交叉熵损失。我们的方法在不损害意图分类性能的情况下，在拒绝样本外 (OOS) 实例的精确召回曲线下的面积上实现了 1-4% 的改进。

[NLP-33] SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

【速读】：该论文试图解决的问题是评估大型语言模型（LLMs）在社会互动中应用“心智理论”（Theory of Mind, ToM）的能力，即模型是否能隐式地运用对他人心理状态的理解来预测行为和判断行为的合理性。解决方案的关键在于创建了一个名为SimpleToM的新数据集，该数据集包含简短且多样化的故事，每个故事都附有三个问题，分别测试不同程度的心智理论推理。通过实验，论文发现尽管模型在预测心理状态方面表现良好，但在预测行为和判断行为合理性方面表现较差。论文提出通过特定的干预措施，如提醒模型之前的心理状态答案和心智状态特定的思维链提示，可以显著提高模型在行为预测和行为判断上的准确性，但这些干预措施需要针对具体任务进行设计，且模型的自然表现仍然较低，这为LLM的实际部署提供了警示。

链接: https://arxiv.org/abs/2410.13648
作者: Yuling Gu,Oyvind Tafjord,Hyunwoo Kim,Jared Moore,Ronan Le Bras,Peter Clark,Yejin Choi
关键词-EN: large language models, attribute mental states, prior work, work testing, mental state
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While prior work has explored whether large language models (LLMs) possess a “theory of mind” (ToM) - the ability to attribute mental states to oneself and others - there has been little work testing whether LLMs can implicitly apply such knowledge to predict behavior, or to judge whether an observed behavior is rational. Such skills are critical for appropriate interaction in social environments. We create a new dataset, SimpleTom, containing concise, diverse stories (e.g., “The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier.”), each with three questions that test different degrees of ToM reasoning, asking models to predict (a) mental state (“Is Mary aware of the mold?”), (b) behavior (“Will Mary pay for the chips or report the mold?”), and © judgment (“Mary paid for the chips. Was that reasonable?”). To our knowledge, SimpleToM is the first dataset to systematically explore downstream reasoning requiring knowledge of mental states in realistic scenarios. Our experimental results are intriguing: While most models can reliably predict mental state on our dataset (a), they often fail to correctly predict the behavior (b), and fare even worse at judging whether given behaviors are reasonable ©, despite being correctly aware of the protagonist’s mental state should make such secondary predictions obvious. We further show that we can help models do better at (b) and © via interventions such as reminding the model of its earlier mental state answer and mental-state-specific chain-of-thought prompting, raising the action prediction accuracies (e.g., from 49.5% to 93.5% for GPT-4o) and judgment accuracies (e.g., from 15.3% to 94.7% in GPT-4o). While this shows that models can be coaxed to perform well, it requires task-specific interventions, and the natural model performances remain low, a cautionary tale for LLM deployment.
摘要：尽管先前的工作探讨了大语言模型 (LLMs) 是否具备“心智理论” (Theory of Mind, ToM) ——即归因于自身和他人的心理状态的能力——但很少有研究测试 LLMs 是否能隐式地将这种知识应用于预测行为，或判断观察到的行为是否合理。这些技能对于在社交环境中进行适当的互动至关重要。我们创建了一个新的数据集，SimpleTom，包含简洁、多样化的故事（例如，“一罐品客薯片里有发霉的薯片。玛丽在超市里拿起这罐薯片，走向收银台。”），每个故事都有三个问题，测试不同程度的心智理论推理，要求模型预测 (a) 心理状态（“玛丽是否意识到霉菌？”），(b) 行为（“玛丽会为薯片付款还是报告霉菌？”），以及 © 判断（“玛丽为薯片付款了。这是合理的吗？”）。据我们所知，SimpleToM 是第一个系统地探索在现实场景中需要心理状态知识的下游推理的数据集。我们的实验结果引人深思：尽管大多数模型在我们的数据集上能够可靠地预测心理状态 (a)，但它们往往无法正确预测行为 (b)，并且在判断给定行为是否合理 © 方面表现更差，尽管它们正确地意识到主角的心理状态应该使这些次级预测显而易见。我们进一步表明，通过干预措施，例如提醒模型其早期的心理状态答案和特定于心理状态的思维链提示，可以提高模型在 (b) 和 © 上的表现，提升行为预测准确率（例如，GPT-4o 从 49.5% 提高到 93.5%）和判断准确率（例如，GPT-4o 从 15.3% 提高到 94.7%）。虽然这表明模型可以通过特定任务的干预措施表现良好，但自然模型表现仍然较低，这对 LLM 的部署提出了警示。

[NLP-34] An Active Learning Framework for Inclusive Generation by Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在生成文本时未能充分代表多样性子群体的问题，特别是在训练数据中缺乏与代表性不足群体相关的关键概念时。解决方案的关键在于提出了一种基于聚类的主动学习框架，并结合知识蒸馏技术。该框架通过转换学习者模型的中间输出，首次实现了生成任务的有效主动学习。通过聚类和知识蒸馏的结合，该方法能够在无需预先了解数据分布的情况下，减少人工干预，生成更具代表性的模型。实验结果表明，该方法在性能上比基线模型提升了2%-10%，并在不同数据子群体中表现出更一致的性能和更高的词汇多样性，同时还能提升未参与学习循环的次级模型的性能。

链接: https://arxiv.org/abs/2410.13641
作者: Sabit Hassan,Anthony Sicilia,Malihe Alikhani
关键词-EN: Large Language Models, Ensuring that Large, Large Language, key concepts related, generate text representative
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ensuring that Large Language Models (LLMs) generate text representative of diverse sub-populations is essential, particularly when key concepts related to under-represented groups are scarce in the training data. We address this challenge with a novel clustering-based active learning framework, enhanced with knowledge distillation. The proposed framework transforms the intermediate outputs of the learner model, enabling effective active learning for generative tasks for the first time. Integration of clustering and knowledge distillation yields more representative models without prior knowledge of underlying data distribution and overbearing human efforts. We validate our approach in practice through case studies in counter-narration and style transfer. We construct two new datasets in tandem with model training, showing a performance improvement of 2%-10% over baseline models. Our results also show more consistent performance across various data subgroups and increased lexical diversity, underscoring our model’s resilience to skewness in available data. Further, our results show that the data acquired via our approach improves the performance of secondary models not involved in the learning loop, showcasing practical utility of the framework.
摘要：确保大语言模型 (LLM) 生成的文本能够代表多样化的子群体至关重要，特别是在与代表性不足群体相关的关键概念在训练数据中稀缺的情况下。我们通过一种新颖的基于聚类的主动学习框架来应对这一挑战，该框架结合了知识蒸馏技术。所提出的框架转换了学习者模型的中间输出，首次实现了生成任务的有效主动学习。聚类与知识蒸馏的结合使得模型更具代表性，无需预先了解底层数据分布，也无需过多的人工干预。我们通过反叙事和风格迁移的案例研究在实践中验证了我们的方法。我们在模型训练的同时构建了两个新数据集，显示性能比基线模型提高了 2%-10%。我们的结果还表明，在各种数据子群体中，性能更加一致，词汇多样性增加，突显了我们的模型对可用数据偏斜的适应能力。此外，我们的结果显示，通过我们的方法获取的数据能够提升未参与学习循环的次级模型的性能，展示了该框架的实际应用价值。

[NLP-35] Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

【速读】：该论文试图解决大语言模型（LLM）在部署中的可靠性问题，特别是如何在没有外部反馈的情况下进行自我评估。解决方案的关键在于提出了Chain-of-Embedding（CoE）方法，通过分析LLM在推理过程中产生的所有渐进隐藏状态（即潜在的思考路径），来识别正确和错误响应之间的差异。这些差异有助于估计LLM响应的正确性，从而实现输出无依赖的自我评估。该方法无需训练，计算成本低，适用于大规模实时场景。

链接: https://arxiv.org/abs/2410.13640
作者: Yiming Wang,Pei Zhang,Baosong Yang,Derek F. Wong,Rui Wang
关键词-EN: LLM self-evaluation relies, deployment reliability, LLM response correctness, ability to estimate, greatly improve
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 18 figures, 12 tables

点击查看摘要

Abstract:LLM self-evaluation relies on the LLM’s own ability to estimate response correctness, which can greatly improve its deployment reliability. In this research track, we propose the Chain-of-Embedding (CoE) in the latent space to enable LLMs to perform output-free self-evaluation. CoE consists of all progressive hidden states produced during the inference time, which can be treated as the latent thinking path of LLMs. We find that when LLMs respond correctly and incorrectly, their CoE features differ, these discrepancies assist us in estimating LLM response correctness. Experiments in four diverse domains and seven LLMs fully demonstrate the effectiveness of our method. Meanwhile, its label-free design intent without any training and millisecond-level computational cost ensure real-time feedback in large-scale scenarios. More importantly, we provide interesting insights into LLM response correctness from the perspective of hidden state changes inside LLMs.
摘要：大语言模型 (LLM) 的自评估依赖于其自身对响应正确性的估计能力，这可以显著提高其部署的可靠性。在本研究领域，我们提出了在潜在空间中的嵌入链 (Chain-of-Embedding, CoE)，以使大语言模型能够进行无输出的自我评估。CoE 由推理过程中产生的所有渐进隐藏状态组成，可以视为大语言模型的潜在思维路径。我们发现，当大语言模型正确和错误响应时，其 CoE 特征存在差异，这些差异有助于我们估计大语言模型响应的正确性。在四个不同领域和七个大语言模型上的实验充分证明了我们方法的有效性。同时，其无标签设计意图无需任何训练和毫秒级的计算成本，确保了在大规模场景中的实时反馈。更重要的是，我们从大语言模型内部隐藏状态变化的角度，提供了对大语言模型响应正确性的有趣见解。

[NLP-36] A Comparative Study on Reasoning Patterns of OpenAIs o1 Model

【速读】：该论文试图解决的问题是如何在不显著增加模型参数的情况下，提升大型语言模型（LLMs）在复杂任务（如编程和数学问题）中的推理能力。解决方案的关键在于探索和优化推理策略，特别是测试时计算方法（Test-time Compute methods），如OpenAI的o1模型。通过对比o1与其他推理方法（如BoN、Step-wise BoN、Agent Workflow和Self-Refine）在数学、编程和常识推理等领域的性能，研究发现o1在大多数数据集上表现最佳，主要得益于其高效的推理模式和优化的搜索空间。此外，论文还总结了o1的六种推理模式，并提供了详细的分析，揭示了这些推理策略在提升模型性能中的重要作用。

链接: https://arxiv.org/abs/2410.13639
作者: Siwei Wu,Zhongyuan Peng,Xinrun Du,Tuney Zheng,Minghao Liu,Jialong Wu,Jiachen Ma,Yizhi Li,Jian Yang,Wangchunshu Zhou,Qunshu Lin,Junbo Zhao,Zhaoxiang Zhang,Wenhao Huang,Ge Zhang,Chenghua Lin,J.H. Liu
关键词-EN: Enabling Large Language, Large Language Models, Enabling Large, Large Language, drawn great attention
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to handle a wider range of complex tasks (e.g., coding, math) has drawn great attention from many researchers. As LLMs continue to evolve, merely increasing the number of model parameters yields diminishing performance improvements and heavy computational costs. Recently, OpenAI’s o1 model has shown that inference strategies (i.e., Test-time Compute methods) can also significantly enhance the reasoning capabilities of LLMs. However, the mechanisms behind these methods are still unexplored. In our work, to investigate the reasoning patterns of o1, we compare o1 with existing Test-time Compute methods (BoN, Step-wise BoN, Agent Workflow, and Self-Refine) by using OpenAI’s GPT-4o as a backbone on general reasoning benchmarks in three domains (i.e., math, coding, commonsense reasoning). Specifically, first, our experiments show that the o1 model has achieved the best performance on most datasets. Second, as for the methods of searching diverse responses (e.g., BoN), we find the reward models’ capability and the search space both limit the upper boundary of these methods. Third, as for the methods that break the problem into many sub-problems, the Agent Workflow has achieved better performance than Step-wise BoN due to the domain-specific system prompt for planning better reasoning processes. Fourth, it is worth mentioning that we have summarized six reasoning patterns of o1, and provided a detailed analysis on several reasoning benchmarks.
摘要：使大语言模型 (LLMs) 能够处理更广泛的复杂任务（例如，编码、数学）引起了许多研究者的极大关注。随着 LLMs 的不断发展，仅仅增加模型参数的数量会导致性能提升的边际效应递减和计算成本的显著增加。最近，OpenAI 的 o1 模型展示了推理策略（即测试时计算方法）也能显著增强 LLMs 的推理能力。然而，这些方法背后的机制仍未被探索。在我们的工作中，为了研究 o1 的推理模式，我们通过使用 OpenAI 的 GPT-4o 作为骨干模型，在三个领域的通用推理基准（即数学、编码、常识推理）上比较了 o1 与现有的测试时计算方法（BoN、逐步 BoN、Agent Workflow 和 Self-Refine）。具体来说，首先，我们的实验表明 o1 模型在大多数数据集上取得了最佳性能。其次，对于搜索多样化响应的方法（例如 BoN），我们发现奖励模型的能力和搜索空间都限制了这些方法的上限。第三，对于将问题分解为多个子问题的方法，Agent Workflow 由于其针对规划更好推理过程的领域特定系统提示，表现优于逐步 BoN。第四，值得一提的是，我们总结了 o1 的六种推理模式，并对几个推理基准进行了详细分析。

[NLP-37] H2OVL-Mississippi Vision Language Models Technical Report

【速读】：该论文试图解决在隐私保护和设备端应用中，小型视觉-语言模型（VLMs）的高效运行问题。解决方案的关键在于提出了H2OVL-Mississippi系列模型，包括0.8亿参数的H2OVL-Mississippi-0.8B和2亿参数的H2OVL-Mississippi-2B，这些模型通过在3700万图像-文本对上进行训练，显著提升了文本识别和视觉理解能力，特别是在OCRBench的文本识别部分达到了最先进水平，并且这些模型基于Apache 2.0许可证开放，旨在普及文档AI和视觉LLMs的应用。

链接: https://arxiv.org/abs/2410.13611
作者: Shaikat Galib,Shanshan Wang,Guanshuo Xu,Pascal Pfeiffer,Ryan Chesler,Mark Landry,Sri Satish Ambati
关键词-EN: on-device applications due, Smaller vision-language models, processing enterprise commercial, Smaller vision-language, enterprise commercial documents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Smaller vision-language models (VLMs) are becoming increasingly important for privacy-focused, on-device applications due to their ability to run efficiently on consumer hardware for processing enterprise commercial documents and images. These models require strong language understanding and visual capabilities to enhance human-machine interaction. To address this need, we present H2OVL-Mississippi, a pair of small VLMs trained on 37 million image-text pairs using 240 hours of compute on 8 x H100 GPUs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition, achieving state of the art performance on the Text Recognition portion of OCRBench and surpassing much larger models in this area. Additionally, we are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics across various academic benchmarks. Both models build upon our prior work with H2O-Danube language models, extending their capabilities into the visual domain. We release them under the Apache 2.0 license, making VLMs accessible to everyone, democratizing document AI and visual LLMs.
摘要：随着隐私保护和设备端应用的需求日益增长，小型视觉-语言模型 (Vision-Language Models, VLMs) 在处理企业商业文档和图像方面因其能够在消费级硬件上高效运行而变得越来越重要。这些模型需要强大的语言理解和视觉能力，以增强人机交互。为了满足这一需求，我们推出了 H2OVL-Mississippi，这是一对基于 3700 万图像-文本对训练的小型 VLMs，使用了 8 块 H100 GPU 进行 240 小时的计算。H2OVL-Mississippi-0.8B 是一个仅有 0.8 亿参数的微型模型，专注于文本识别，在 OCRBench 的文本识别部分达到了最先进的性能，并在此领域超越了许多更大规模的模型。此外，我们还发布了 H2OVL-Mississippi-2B，这是一个 20 亿参数的模型，适用于通用场景，在各种学术基准测试中表现出高度竞争力。这两个模型均基于我们之前在 H2O-Danube 语言模型上的工作，将其能力扩展到视觉领域。我们以 Apache 2.0 许可证发布这些模型，使 VLMs 对所有人开放，从而实现文档 AI 和视觉大语言模型的民主化。

[NLP-38] MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling

【速读】：该论文试图解决在医学领域中，仅依赖工具无法完全应对复杂现实问题的情况，特别是在使用医疗计算器进行健康状态评估时。解决方案的关键在于引入MeNTi，这是一种通用代理架构，专门为大型语言模型（LLMs）设计。MeNTi通过集成专用医疗工具包，并采用元工具和嵌套调用机制，增强了LLM的工具利用能力。具体来说，MeNTi实现了灵活的工具选择和嵌套工具调用，以应对复杂的医疗场景中的实际问题，如计算器选择、槽填充和单位转换。此外，论文还引入了CalcQA基准，用于评估LLM在临床过程中使用医疗计算器进行定量评估的能力，从而验证了MeNTi框架的显著性能提升。

链接: https://arxiv.org/abs/2410.13610
作者: Yakun Zhu,Shaohang Wei,Xu Wang,Kui Xue,Xiaofan Zhang,Shaoting Zhang
关键词-EN: Large Language Models, Language Models, Large Language, Integrating tools, widespread application
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Integrating tools into Large Language Models (LLMs) has facilitated the widespread application. Despite this, in specialized downstream task contexts, reliance solely on tools is insufficient to fully address the complexities of the real world. This particularly restricts the effective deployment of LLMs in fields such as medicine. In this paper, we focus on the downstream tasks of medical calculators, which use standardized tests to assess an individual’s health status. We introduce MeNTi, a universal agent architecture for LLMs. MeNTi integrates a specialized medical toolkit and employs meta-tool and nested calling mechanisms to enhance LLM tool utilization. Specifically, it achieves flexible tool selection and nested tool calling to address practical issues faced in intricate medical scenarios, including calculator selection, slot filling, and unit conversion. To assess the capabilities of LLMs for quantitative assessment throughout the clinical process of calculator scenarios, we introduce CalcQA. This benchmark requires LLMs to use medical calculators to perform calculations and assess patient health status. CalcQA is constructed by professional physicians and includes 100 case-calculator pairs, complemented by a toolkit of 281 medical tools. The experimental results demonstrate significant performance improvements with our framework. This research paves new directions for applying LLMs in demanding scenarios of medicine.
摘要：将工具集成到大语言模型 (LLM) 中已促进了其广泛应用。尽管如此，在专业下游任务的背景下，仅依赖工具不足以完全解决现实世界的复杂性。这在医学领域尤为明显，限制了 LLM 的有效部署。本文聚焦于医疗计算器的下游任务，这些计算器使用标准化测试来评估个体的健康状况。我们引入了 MeNTi，这是一种适用于 LLM 的通用智能体架构。MeNTi 集成了专门的医疗工具包，并采用元工具和嵌套调用机制来增强 LLM 的工具利用率。具体而言，它实现了灵活的工具选择和嵌套工具调用，以应对复杂医疗场景中的实际问题，包括计算器选择、槽位填充和单位转换。为了评估 LLM 在整个计算器场景临床过程中的定量评估能力，我们引入了 CalcQA。该基准要求 LLM 使用医疗计算器进行计算并评估患者健康状况。CalcQA 由专业医师构建，包含 100 个病例-计算器对，并辅以包含 281 个医疗工具的工具包。实验结果显示，我们的框架显著提升了性能。这项研究为在医学的苛刻场景中应用 LLM 开辟了新的方向。

[NLP-39] Large Language Models as Narrative-Driven Recommenders

【速读】：该论文试图解决在电影推荐系统中，如何利用大型语言模型（LLMs）处理用户自由文本请求的问题。解决方案的关键在于比较不同规模和来源的LLMs（如LLama 3.2和GPT-4o）在处理此类请求时的表现，并评估多种提示策略（如零样本、身份和少样本提示）的效果。研究结果表明，LLMs能够生成与上下文相关的电影推荐，显著优于传统的推荐方法（如doc2vec），其中闭源和大规模参数模型表现最佳，而中等规模的开源模型也表现出竞争力。此外，简单的零样本提示策略在大多数模型中表现出色，证明了其在叙事驱动推荐中的有效性。

链接: https://arxiv.org/abs/2410.13604
作者: Lukas Eberhard,Thorsten Ruprechter,Denis Helic
关键词-EN: Shutter Island, user requests expressed, provide personalized suggestions, mind-bending story, aim to provide
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review; 19 pages

点击查看摘要

Abstract:Narrative-driven recommenders aim to provide personalized suggestions for user requests expressed in free-form text such as “I want to watch a thriller with a mind-bending story, like Shutter Island.” Although large language models (LLMs) have been shown to excel in processing general natural language queries, their effectiveness for handling such recommendation requests remains relatively unexplored. To close this gap, we compare the performance of 38 open- and closed-source LLMs of various sizes, such as LLama 3.2 and GPT-4o, in a movie recommendation setting. For this, we utilize a gold-standard, crowdworker-annotated dataset of posts from reddit’s movie suggestion community and employ various prompting strategies, including zero-shot, identity, and few-shot prompting. Our findings demonstrate the ability of LLMs to generate contextually relevant movie recommendations, significantly outperforming other state-of-the-art approaches, such as doc2vec. While we find that closed-source and large-parameterized models generally perform best, medium-sized open-source models remain competitive, being only slightly outperformed by their more computationally expensive counterparts. Furthermore, we observe no significant differences across prompting strategies for most models, underscoring the effectiveness of simple approaches such as zero-shot prompting for narrative-driven recommendations. Overall, this work offers valuable insights for recommender system researchers as well as practitioners aiming to integrate LLMs into real-world recommendation tools.
摘要：叙事驱动推荐系统旨在根据用户以自由文本形式表达的需求，提供个性化的建议，例如“我想看一部剧情扭曲的心理惊悚片，像《禁闭岛》那样。”尽管大语言模型 (LLMs) 在处理一般自然语言查询方面表现出色，但它们在处理此类推荐请求方面的有效性仍相对未被深入探索。为了填补这一空白，我们在电影推荐场景中比较了 38 个不同规模的开放和闭源 LLMs 的性能，如 LLama 3.2 和 GPT-4o。为此，我们利用了一个来自 reddit 电影建议社区的帖子数据集，该数据集经过众包工人标注，并采用了多种提示策略，包括零样本 (zero-shot)、身份 (identity) 和少样本 (few-shot) 提示。我们的研究结果表明，LLMs 能够生成与上下文相关的电影推荐，显著优于其他最先进的方法，如 doc2vec。尽管我们发现闭源和大规模参数模型总体表现最佳，但中等规模的开放源模型仍然具有竞争力，仅略逊于其计算成本更高的同类模型。此外，我们观察到大多数模型在不同提示策略之间没有显著差异，这突显了简单方法（如零样本提示）在叙事驱动推荐中的有效性。总体而言，这项工作为推荐系统研究人员以及希望将 LLMs 整合到实际推荐工具中的从业者提供了宝贵的见解。

[NLP-40] Enhancing Fact Retrieval in PLMs through Truthfulness

【速读】：该论文试图解决预训练语言模型（PLMs）中事实知识提取效率的问题。解决方案的关键在于利用一个辅助模型来评估PLMs隐藏状态表示的真实性，从而提高事实检索的准确性。具体来说，辅助模型根据PLMs的隐藏状态来判断输入信息的真实性，进而优化事实检索过程，实验结果表明这种方法能将事实检索的准确性提高达33%。

链接: https://arxiv.org/abs/2410.13562
作者: Paul Youssef,Jörg Schlötterer,Christin Seifert
关键词-EN: Pre-trained Language Models, Pre-trained Language, pre-training phase, trained to predict, missing word
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained Language Models (PLMs) encode various facts about the world at their pre-training phase as they are trained to predict the next or missing word in a sentence. There has a been an interest in quantifying and improving the amount of facts that can be extracted from PLMs, as they have been envisioned to act as soft knowledge bases, which can be queried in natural language. Different approaches exist to enhance fact retrieval from PLM. Recent work shows that the hidden states of PLMs can be leveraged to determine the truthfulness of the PLMs’ inputs. Leveraging this finding to improve factual knowledge retrieval remains unexplored. In this work, we investigate the use of a helper model to improve fact retrieval. The helper model assesses the truthfulness of an input based on the corresponding hidden states representations from the PLMs. We evaluate this approach on several masked PLMs and show that it enhances fact retrieval by up to 33%. Our findings highlight the potential of hidden states representations from PLMs in improving their factual knowledge retrieval.
摘要：预训练语言模型 (Pre-trained Language Models, PLMs) 在其预训练阶段编码了关于世界的各种事实，因为它们被训练来预测句子中的下一个或缺失的词。人们一直对量化和提高从 PLM 中提取事实的数量感兴趣，因为它们被设想为可以作为软知识库，能够以自然语言进行查询。存在不同的方法来增强从 PLM 中提取事实的能力。最近的研究表明，PLM 的隐藏状态可以用来确定 PLM 输入的真实性。利用这一发现来改进事实知识检索仍未被探索。在这项工作中，我们研究了使用辅助模型来改进事实检索的方法。辅助模型根据 PLM 的相应隐藏状态表示来评估输入的真实性。我们在多个掩码 PLM 上评估了这种方法，并表明它将事实检索能力提高了多达 33%。我们的研究结果突显了 PLM 的隐藏状态表示在改进其事实知识检索方面的潜力。

[NLP-41] Integrating Temporal Representations for Dynamic Memory Retrieval and Management in Large Language Models

【速读】：该论文试图解决传统对话代理在记忆召回方面效率低下的问题，特别是重复检索和用户关联管理不足。解决方案的关键在于提出了一种名为SynapticRAG的新方法，该方法将突触动力学引入检索增强生成（RAG）模型中。SynapticRAG通过将时间表示集成到记忆向量中，模拟生物突触根据事件发生时间区分事件并动态更新记忆重要性的机制。该模型采用时间评分机制来评估记忆连接，并引入突触启发的传播控制机制。实验结果表明，SynapticRAG在记忆检索准确性上比现有方法（包括传统RAG）有显著提升，最高可达14.66%。

链接: https://arxiv.org/abs/2410.13553
作者: Yuki Hou,Haruki Tamoto,Homei Miyashita
关键词-EN: unique user associations, Conventional dialogue agents, effective memory recall, leading to redundant, user associations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conventional dialogue agents often struggle with effective memory recall, leading to redundant retrieval and inadequate management of unique user associations. To address this, we propose SynapticRAG, a novel approach integrating synaptic dynamics into Retrieval-Augmented Generation (RAG). SynapticRAG integrates temporal representations into memory vectors, mimicking biological synapses by differentiating events based on occurrence times and dynamically updating memory significance. This model employs temporal scoring for memory connections and a synaptic-inspired propagation control mechanism. Experiments across English, Japanese, and Chinese datasets demonstrate SynapticRAG’s superiority over existing methods, including traditional RAG, with up to 14.66% improvement in memory retrieval accuracy. Our approach advances context-aware dialogue AI systems by enhancing long-term context maintenance and specific information extraction from conversations.
摘要： 传统的对话代理在有效记忆检索方面常常遇到困难，导致冗余检索和独特用户关联管理不足。为解决这一问题，我们提出了 SynapticRAG，这是一种将突触动力学融入检索增强生成 (RAG) 的新方法。SynapticRAG 将时间表示整合到记忆向量中，通过根据事件发生时间区分事件并动态更新记忆重要性，模拟生物突触。该模型采用时间评分机制进行记忆连接，并引入突触启发式的传播控制机制。在英语、日语和中文数据集上的实验表明，SynapticRAG 优于现有方法，包括传统的 RAG，记忆检索准确率提高了高达 14.66%。我们的方法通过增强长期上下文维护和从对话中提取特定信息，推动了上下文感知对话 AI 系统的发展。

[NLP-42] Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks ?

【速读】：该论文试图解决大语言模型（LLMs）在交互过程中偏见鲁棒性的问题，解决方案的关键在于引入一种新颖的自辩论方法。通过让两个LLM实例分别代表对立观点进行辩论，并试图说服一个中立版本的模型，研究评估了偏见的持久性和模型是否容易强化错误信息或转向有害观点。这种方法跨越了不同大小、来源和语言的多个LLM，提供了关于偏见在不同语言和文化背景下持久性和灵活性的深入见解。

链接: https://arxiv.org/abs/2410.13517
作者: Virgile Rennard,Christos Xypolopoulos,Michalis Vazirgiannis
关键词-EN: Large language models, alignment processes, influencing their responses, Large language, training data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model. Through this, we evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints. Our experiments span multiple LLMs of varying sizes, origins, and languages, providing deeper insights into bias persistence and flexibility across linguistic and cultural contexts.
摘要：大语言模型 (LLMs) 从其训练数据和对齐过程中继承了偏见，这些偏见以微妙的方式影响其响应。尽管许多研究已经考察了这些偏见，但很少有工作探讨它们在交互过程中的鲁棒性。在本文中，我们提出了一种新颖的方法，其中两个 LLM 实例进行自我辩论，各自持相反观点以说服模型的中性版本。通过这种方式，我们评估了偏见的牢固程度以及模型是否容易强化错误信息或转向有害观点。我们的实验涵盖了多种不同大小、来源和语言的 LLM，提供了关于偏见在不同语言和文化背景下持久性和灵活性的深入见解。

[NLP-43] GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models

【速读】：该论文试图解决视觉-语言模型（VLMs）在几何问题解决中面临的挑战，特别是模型在处理未见过的数学运算和正确应用几何公式方面的局限性。解决方案的关键在于提出了GeoCoder模型，通过模块化代码微调的方法，利用预定义的几何函数库生成并执行代码，从而实现精确和确定性的计算。这种方法不仅减少了自回归标记预测的随机性，还通过函数库的使用减少了公式应用中的错误。此外，论文还提出了RAG-GeoCoder变体，通过引入非参数记忆模块来检索几何库中的函数，进一步减少对参数记忆的依赖，从而提升了模型在几何推理任务中的表现。

链接: https://arxiv.org/abs/2410.13510
作者: Aditya Sharma,Aman Dalmia,Mehran Kazemi,Amal Zouaq,Christopher J. Pal
关键词-EN: problem-solving demands advanced, mathematical knowledge effectively, process multimodal inputs, employ mathematical knowledge, Geometry problem-solving demands
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geometry problem-solving demands advanced reasoning abilities to process multimodal inputs and employ mathematical knowledge effectively. Vision-language models (VLMs) have made significant progress in various multimodal tasks. Yet, they still struggle with geometry problems and are significantly limited by their inability to perform mathematical operations not seen during pre-training, such as calculating the cosine of an arbitrary angle, and by difficulties in correctly applying relevant geometry formulas. To overcome these challenges, we present GeoCoder, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library. By executing the code, we achieve accurate and deterministic calculations, contrasting the stochastic nature of autoregressive token prediction, while the function library minimizes errors in formula usage. We also propose a multimodal retrieval-augmented variant of GeoCoder, named RAG-GeoCoder, which incorporates a non-parametric memory module for retrieving functions from the geometry library, thereby reducing reliance on parametric memory. Our modular code-finetuning approach enhances the geometric reasoning capabilities of VLMs, yielding an average improvement of over 16% across various question complexities on the GeomVerse dataset compared to other finetuning methods.
摘要：几何问题求解需要高级推理能力来处理多模态输入并有效运用数学知识。视觉语言模型 (Vision-language models, VLMs) 在各种多模态任务中取得了显著进展。然而，它们在几何问题上仍面临挑战，主要受限于无法执行预训练阶段未见过的数学运算，例如计算任意角度的余弦值，以及难以正确应用相关几何公式。为克服这些挑战，我们提出了 GeoCoder，它利用模块化代码微调来生成并执行代码，使用预定义的几何函数库。通过执行代码，我们实现了精确且确定性的计算，与自回归 Token 预测的随机性形成对比，同时函数库最小化了公式使用中的错误。我们还提出了一种多模态检索增强的 GeoCoder 变体，名为 RAG-GeoCoder，它包含一个非参数记忆模块，用于从几何库中检索函数，从而减少对参数化记忆的依赖。我们的模块化代码微调方法增强了 VLMs 的几何推理能力，在 GeomVerse 数据集上，与其它微调方法相比，在各种问题复杂度上平均提升了超过 16%。

[NLP-44] RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

【速读】：该论文试图解决在大语言模型（LLMs）中使用检索增强生成（RAG）时，由于不同RAG模块间数据偏好不一致导致的性能瓶颈问题。解决方案的关键在于提出了一种可微分的数据奖励（Differentiable Data Rewards, DDR）方法，通过端到端训练RAG系统，使不同模块的数据偏好对齐。DDR方法通过收集奖励并使用回滚方法优化每个代理，促使代理生成能够提升整个RAG系统性能的输出，从而显著提高知识密集型任务中的表现，特别是在参数规模较小的LLMs中，这些模型更依赖于检索到的知识。

链接: https://arxiv.org/abs/2410.13509
作者: Xinze Li,Sen Mei,Zhenghao Liu,Yukun Yan,Shuo Wang,Shi Yu,Zheni Zeng,Hao Chen,Ge Yu,Zhiyuan Liu,Maosong Sun,Chenyan Xiong
关键词-EN: Large Language Models, Language Models, Large Language, hallucinations in Large, RAG
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources. To adapt LLMs for RAG pipelines, current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge. This supervised fine-tuning (SFT) approach focuses on equipping LLMs to handle diverse RAG tasks using different instructions. However, it trains RAG modules to overfit training signals and overlooks the varying data preferences among agents within the RAG system. In this paper, we propose a Differentiable Data Rewards (DDR) method, which end-to-end trains RAG systems by aligning data preferences between different RAG modules. DDR works by collecting the rewards to optimize each agent with a rollout method. This method prompts agents to sample some potential responses as perturbations, evaluates the impact of these perturbations on the whole RAG system, and subsequently optimizes the agent to produce outputs that improve the performance of the RAG system. Our experiments on various knowledge-intensive tasks demonstrate that DDR significantly outperforms the SFT method, particularly for LLMs with smaller-scale parameters that depend more on the retrieved knowledge. Additionally, DDR exhibits a stronger capability to align the data preference between RAG modules. The DDR method makes generation module more effective in extracting key information from documents and mitigating conflicts between parametric memory and external knowledge. All codes are available at this https URL.
摘要：检索增强生成 (Retrieval-Augmented Generation, RAG) 通过从外部资源中检索知识，已证明其在减少大语言模型 (Large Language Models, LLMs) 中的幻觉现象方面的有效性。为了使 LLMs 适应 RAG 流程，当前方法使用指令调优来优化 LLMs，以提高其利用检索知识的能力。这种监督微调 (Supervised Fine-Tuning, SFT) 方法侧重于使 LLMs 能够使用不同指令处理多样化的 RAG 任务。然而，它训练 RAG 模块过度拟合训练信号，并忽视了 RAG 系统中不同代理之间数据偏好的差异。在本文中，我们提出了一种可微分数据奖励 (Differentiable Data Rewards, DDR) 方法，该方法通过调整不同 RAG 模块之间的数据偏好来端到端地训练 RAG 系统。DDR 通过收集奖励并使用回滚方法优化每个代理来工作。该方法促使代理采样一些潜在的响应作为扰动，评估这些扰动对整个 RAG 系统的影响，并随后优化代理以生成提高 RAG 系统性能的输出。我们在各种知识密集型任务上的实验表明，DDR 显著优于 SFT 方法，特别是在依赖更多检索知识的较小规模参数的 LLMs 中。此外，DDR 在调整 RAG 模块之间的数据偏好方面表现出更强的能力。DDR 方法使生成模块在从文档中提取关键信息和缓解参数记忆与外部知识之间的冲突方面更加有效。所有代码均可在此 https URL 获取。

[NLP-45] MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

【速读】：该论文试图解决大语言模型（LLMs）在处理复杂算术问题时的泛化能力问题，特别是当问题证明过程比训练数据中的更为复杂时。解决方案的关键在于提出了一个名为MathGAP的评估框架，该框架能够生成具有任意复杂算术证明的问题，并附带链式思维推理注释，从而系统地研究模型在不同复杂度证明下的泛化能力。通过MathGAP，研究者发现，随着证明的深度和广度增加，大多数模型的性能显著下降，尤其是在复杂的非线性证明结构中。此外，论文还探讨了上下文学习与泛化能力之间的关系，发现提供与测试集同分布的上下文示例并不总是有益，有时零样本提示或展示比测试数据更简单的多样化示例反而能获得相似或更高的准确率。

链接: https://arxiv.org/abs/2410.13502
作者: Andreas Opedal,Haruki Shirakami,Bernhard Schölkopf,Abulhair Saparov,Mrinmaya Sachan
关键词-EN: Large language models, Large language, solve arithmetic word, high accuracy, arithmetic word problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to problems that are more complex than the ones on which they have been trained. Empirical investigations of such questions are impeded by two major flaws of current evaluations: (i) much of the evaluation data is contaminated, in the sense that it has already been seen during training, and (ii) benchmark datasets do not capture how problem proofs may be arbitrarily complex in various ways. As a step towards addressing these issues, we present a framework for evaluating LLMs on problems that have arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problems that follow fixed proof specifications – along with chain-of-thought reasoning annotations – enabling systematic studies on generalization with respect to arithmetic proof complexity. We apply MathGAP to analyze how in-context learning interacts with generalization to problems that have more complex proofs. We find that among the models tested, most show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for GPT-4o. Surprisingly, providing in-context examples from the same distribution as the test set is not always beneficial for performance. In particular, zero-shot prompting as well as demonstrating a diverse range of examples that are less complex than the test data sometimes yield similar or higher accuracies.
摘要：大语言模型 (LLMs) 能够以高准确率解决算术应用题，但对于它们在处理比训练数据更为复杂的应用题时的泛化能力，我们知之甚少。当前评估方法存在两大缺陷，阻碍了对这些问题的实证研究：(i) 评估数据中很大一部分已被污染，即在训练过程中已经见过；(ii) 基准数据集未能捕捉到问题证明在多种方式上可能具有的任意复杂性。为了解决这些问题，我们提出了一种名为 MathGAP 的框架，用于评估 LLMs 在具有任意复杂算术证明的问题上的表现。MathGAP 生成遵循固定证明规范的问题，并附带链式思维推理注释，从而能够系统地研究关于算术证明复杂性的泛化能力。我们应用 MathGAP 分析了上下文学习如何与具有更复杂证明的问题的泛化能力相互作用。我们发现，在测试的模型中，大多数模型随着证明的深度和广度增加，性能显著下降。这种效应在复杂、非线性的证明结构中更为明显，即使对于 GPT-4o 也颇具挑战性。令人惊讶的是，提供与测试集同分布的上下文示例并不总是对性能有益。特别是，零样本提示以及展示比测试数据更简单的多样化示例有时会获得相似甚至更高的准确率。

[NLP-46] Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning Semi-Supervised Training and Advanced Optimization Techniques

【速读】：该论文试图解决文本生成中的连贯性、多样性和创意性问题，同时避免偏见或不当内容。解决方案的关键在于结合自然语言生成（NLG）和自然语言理解（NLU）的学习框架，通过预处理和特征提取技术（如POS标签、词袋模型和TF-IDF）准备数据，并采用基于Transformer的编码器和解码器来捕捉长距离依赖关系。此外，引入预训练语言模型如优化后的BERT和混合红狐人工蜂鸟算法（HRAHA），结合强化学习、半监督训练和改进的注意力机制，以及可微分的近似方法如直通Gumbel SoftMax估计器，来微调模型并有效处理复杂的语言任务。

链接: https://arxiv.org/abs/2410.13498
作者: Rahimanuddin Shaik,Katikela Sreeharsha Kishore
关键词-EN: Text generation, Natural Language Generation, computational methods, Natural Language Understanding, automated process
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text generation is the automated process of producing written or spoken language using computational methods. It involves generating coherent and contextually relevant text based on predefined rules or learned patterns. However, challenges in text generation arise from maintaining coherence, ensuring diversity and creativity, and avoiding biases or inappropriate content. This research paper developed a novel approach to improve text generation in the context of joint Natural Language Generation (NLG) and Natural Language Understanding (NLU) learning. The data is prepared by gathering and preprocessing annotated datasets, including cleaning, tokenization, stemming, and stop-word removal. Feature extraction techniques such as POS tagging, Bag of words, and Term Frequency-Inverse Document Frequency (TF-IDF) are applied. Transformer-based encoders and decoders, capturing long range dependencies and improving source-target sequence modelling. Pre-trained language models like Optimized BERT are incorporated, along with a Hybrid Redfox Artificial Hummingbird Algorithm (HRAHA). Reinforcement learning with policy gradient techniques, semi-supervised training, improved attention mechanisms, and differentiable approximations like straight-through Gumbel SoftMax estimator are employed to fine-tune the models and handle complex linguistic tasks effectively. The proposed model is implemented using Python.
摘要：文本生成是通过计算方法自动生成书面或口头语言的过程。它涉及基于预定义规则或学习模式生成连贯且上下文相关的文本。然而，文本生成面临保持连贯性、确保多样性和创造性以及避免偏见或不当内容的挑战。本研究论文开发了一种新颖的方法，以改进联合自然语言生成 (NLG) 和自然语言理解 (NLU) 学习背景下的文本生成。数据通过收集和预处理标注数据集来准备，包括清洗、分词、词干提取和停用词移除。应用了诸如词性标注、词袋模型和词频-逆文档频率 (TF-IDF) 等特征提取技术。基于 Transformer 的编码器和解码器捕捉长距离依赖关系，并改进源-目标序列建模。结合了优化版 BERT 等预训练语言模型，以及混合红狐人工蜂鸟算法 (HRAHA)。采用带有策略梯度技术的强化学习、半监督训练、改进的注意力机制和诸如直通 Gumbel SoftMax 估计器等可微分近似方法来微调模型并有效处理复杂的语言任务。所提出的模型使用 Python 实现。

[NLP-47] Repetition Neurons: How Do Language Models Produce Repetitions?

【速读】：该论文试图解决文本生成任务中的重复问题，关键在于识别并分析被称为“重复神经元”的特定神经元。这些神经元在重复发生时会逐渐增强激活，表明它们感知重复任务为反复复制先前上下文，类似于上下文学习。通过比较重复发生前后神经元的激活值，论文在多个预训练语言模型中发现了这些重复神经元的存在，并观察到它们在不同模型中的相似行为模式。

链接: https://arxiv.org/abs/2410.13497
作者: Tatsuya Hiraoka,Kentaro Inui
关键词-EN: skill neurons responsible, paper introduces repetition, text generation tasks, regarded as skill, paper introduces
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces repetition neurons, regarded as skill neurons responsible for the repetition problem in text generation tasks. These neurons are progressively activated more strongly as repetition continues, indicating that they perceive repetition as a task to copy the previous context repeatedly, similar to in-context learning. We identify these repetition neurons by comparing activation values before and after the onset of repetition in texts generated by recent pre-trained language models. We analyze the repetition neurons in three English and one Japanese pre-trained language models and observe similar patterns across them.
摘要：本文介绍了重复神经元，这些神经元被视为负责文本生成任务中重复问题的技能神经元。随着重复的持续，这些神经元的激活强度逐渐增强，表明它们将重复视为一项重复复制先前上下文的任务，类似于上下文学习。我们通过比较近期预训练语言模型生成的文本中重复发生前后的激活值，识别出这些重复神经元。我们对三款英文和一款日文预训练语言模型中的重复神经元进行了分析，并观察到它们之间存在相似的模式。

[NLP-48] Seeing Through VisualBERT: A Causal Adventure on Memetic Landscapes EMNLP

【速读】：该论文试图解决深度神经网络在检测冒犯性模因（memes）时缺乏透明性的问题，特别是面对隐含冒犯性的模因和非因果归因的挑战。解决方案的关键在于提出了一种基于结构因果模型（Structural Causal Model, SCM）的框架，通过训练VisualBERT模型来预测输入模因的类别，并结合因果概念进行解释，从而实现模型的透明化解释。该框架不仅在定性评估中展示了其有效性，还在定量分析中验证了去混淆、对抗学习和动态路由等建模选择的重要性，并指出传统输入归因方法在确保因果关系方面的不足，强调了在安全关键应用中的可靠性问题。

链接: https://arxiv.org/abs/2410.13488
作者: Dibyanayan Bandyopadhyay,Mohammed Hasanuzzaman,Asif Ekbal
关键词-EN: Detecting offensive memes, standard deep neural, deep neural network, neural network systems, Detecting offensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at EMNLP Findings 2024

点击查看摘要

Abstract:Detecting offensive memes is crucial, yet standard deep neural network systems often remain opaque. Various input attribution-based methods attempt to interpret their behavior, but they face challenges with implicitly offensive memes and non-causal attributions. To address these issues, we propose a framework based on a Structural Causal Model (SCM). In this framework, VisualBERT is trained to predict the class of an input meme based on both meme input and causal concepts, allowing for transparent interpretation. Our qualitative evaluation demonstrates the framework’s effectiveness in understanding model behavior, particularly in determining whether the model was right due to the right reason, and in identifying reasons behind misclassification. Additionally, quantitative analysis assesses the significance of proposed modelling choices, such as de-confounding, adversarial learning, and dynamic routing, and compares them with input attribution methods. Surprisingly, we find that input attribution methods do not guarantee causality within our framework, raising questions about their reliability in safety-critical applications. The project page is at: this https URL
摘要：检测冒犯性模因至关重要，然而标准的深度神经网络系统往往仍然是不透明的。各种基于输入归因的方法试图解释其行为，但它们在处理隐含冒犯性模因和非因果归因方面面临挑战。为了解决这些问题，我们提出了一种基于结构因果模型 (Structural Causal Model, SCM) 的框架。在该框架中，VisualBERT 被训练来根据模因输入和因果概念预测输入模因的类别，从而实现透明的解释。我们的定性评估展示了该框架在理解模型行为方面的有效性，特别是在确定模型是否因正确原因而正确，以及识别误分类背后的原因方面。此外，定量分析评估了所提出的建模选择（如去混淆、对抗学习、动态路由）的重要性，并将其与输入归因方法进行了比较。令人惊讶的是，我们发现输入归因方法在我们的框架内并不能保证因果关系，这引发了关于它们在安全关键应用中的可靠性的问题。项目页面位于：this https URL

翻译说明：

保留了原文中的术语和格式，如“Structural Causal Model (SCM)”、“VisualBERT”等。
保留了引用格式，如“[20]”。
保留了 Markdown 格式，如“翻译说明：”。
在翻译专业术语时，第一次出现时在括号中注明英文原文，如“结构因果模型 (Structural Causal Model, SCM)”。

[NLP-49] IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

【速读】：该论文试图解决在大规模语言模型（LLMs）的指令调优过程中，如何高效且经济地从大量源数据中选择高质量指令数据的问题。解决方案的关键在于提出了 IterSelectTune，这是一种无需人工干预且对GPT-4依赖有限的迭代训练策略。通过仅对源数据的约20%进行微调，该方法在多个基准测试和公开测试数据集上均优于对整个数据集进行微调的模型，显著提升了LLM性能并减少了指令调优所需的计算资源。

链接: https://arxiv.org/abs/2410.13464
作者: Jielin Song,Siyu Liu,Bin Zhu,Yanghui Rao
关键词-EN: continue to advance, contextually appropriate responses, critical for improving, improving their ability, ability to generate
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses. Although numerous instruction-tuning datasets have been developed to enhance LLM performance, selecting high-quality instruction data from large source datasets typically demands significant human effort. In this work, we introduce \textbfIterSelectTune , an efficient, cost-effective iterative training policy for selecting high-quality instruction data with no human involvement and limited reliance on GPT-4. By fine-tuning on approximately 20% of the source data, our method consistently outperforms models fine-tuned on the full dataset across multiple benchmarks and public test datasets. These results highlight the effectiveness of our approach in enhancing LLM performance while reducing the computational resources required for instruction tuning.
摘要：随着大语言模型 (Large Language Models, LLMs) 的不断进步，指令调优 (instruction tuning) 已成为提升其生成准确且上下文相关响应能力的关键。尽管已开发了众多指令调优数据集以增强 LLM 性能，但从大型源数据集中选择高质量指令数据通常需要大量的人力投入。在本研究中，我们提出了 IterSelectTune，这是一种高效且成本低廉的迭代训练策略，能够在无需人工干预且对 GPT-4 依赖有限的情况下，选择高质量的指令数据。通过在约 20% 的源数据上进行微调，我们的方法在多个基准测试和公共测试数据集上持续优于在全数据集上微调的模型。这些结果突显了我们的方法在提升 LLM 性能的同时，减少了指令调优所需的计算资源。

[NLP-50] Progressive Mixed-Precision Decoding for Efficient LLM Inference

【速读】：该论文试图解决大语言模型（LLMs）在资源受限设备上的部署问题，特别是由于其高计算和内存需求导致的挑战。解决方案的关键在于提出了一种新型的阶段感知量化方法，即在LLM推理的不同阶段（如预填充和解码）选择性地分配精度，以实现高效内存带宽利用和强大的上下文提取。此外，论文引入了渐进混合精度解码（PMPD）技术，通过在生成的序列中逐步降低精度，并结合动态的精度切换调度器，以任务自适应或提示自适应的方式进行精度调整，从而在保持输出质量的同时显著提升计算效率。

链接: https://arxiv.org/abs/2410.13461
作者: Hao Mark Chen,Fuwen Tan,Alexandros Kouris,Royson Lee,Hongxiang Fan,Stylianos I. Venieris
关键词-EN: resource-constrained devices remains, devices remains challenging, remains challenging due, great potential, potential of large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In spite of the great potential of large language models (LLMs) across various tasks, their deployment on resource-constrained devices remains challenging due to their excessive computational and memory demands. Quantization has emerged as an effective solution by storing weights in reduced precision. However, utilizing low precisions (i.e.~2/3-bit) to substantially alleviate the memory-boundedness of LLM decoding, still suffers from prohibitive performance drop. In this work, we argue that existing approaches fail to explore the diversity in computational patterns, redundancy, and sensitivity to approximations of the different phases of LLM inference, resorting to a uniform quantization policy throughout. Instead, we propose a novel phase-aware method that selectively allocates precision during different phases of LLM inference, achieving both strong context extraction during prefill and efficient memory bandwidth utilization during decoding. To further address the memory-boundedness of the decoding phase, we introduce Progressive Mixed-Precision Decoding (PMPD), a technique that enables the gradual lowering of precision deeper in the generated sequence, together with a spectrum of precision-switching schedulers that dynamically drive the precision-lowering decisions in either task-adaptive or prompt-adaptive manner. Extensive evaluation across diverse language tasks shows that when targeting Nvidia GPUs, PMPD achieves 1.4 - 12.2 \times speedup in matrix-vector multiplications over fp16 models, while when targeting an LLM-optimized NPU, our approach delivers a throughput gain of 3.8 - 8.0 \times over fp16 models and up to 1.54 \times over uniform quantization approaches while preserving the output quality.
摘要：尽管大语言模型 (LLM) 在各种任务中展现出巨大的潜力，但由于其过高的计算和内存需求，在资源受限的设备上部署仍然面临挑战。量化作为一种有效解决方案，通过以降低的精度存储权重来实现。然而，利用低精度 (即 2/3 位) 来显著缓解 LLM 解码的内存瓶颈，仍然会遭遇性能大幅下降的问题。本文认为，现有方法未能充分探索 LLM 推理不同阶段在计算模式、冗余度和对近似敏感性方面的多样性，而是采用统一的量化策略。相反，我们提出了一种新颖的阶段感知方法，该方法在 LLM 推理的不同阶段选择性地分配精度，在预填充阶段实现强大的上下文提取，并在解码阶段实现高效的内存带宽利用。为进一步解决解码阶段的内存瓶颈问题，我们引入了渐进混合精度解码 (PMPD)，这是一种能够在生成的序列中逐步降低精度的技术，同时配备了一系列精度切换调度器，以任务自适应或提示自适应的方式动态驱动精度降低决策。在多种语言任务上的广泛评估表明，当目标为 Nvidia GPU 时，PMPD 在矩阵向量乘法中比 fp16 模型实现了 1.4 - 12.2 倍的加速；而当目标为 LLM 优化的 NPU 时，我们的方法在 fp16 模型上提供了 3.8 - 8.0 倍的吞吐量增益，并且在保持输出质量的同时，比统一量化方法高出 1.54 倍。

[NLP-51] Breaking the Manual Annotation Bottleneck: Creating a Comprehensive Legal Case Criticality Dataset through Semi-Automated Labeling

【速读】：该论文试图解决法律领域中案件重要性预测的问题，特别是评估瑞士联邦最高法院判决对未来法律实践的影响。解决方案的关键在于引入了一个半自动化的标签生成系统，通过识别“Leading Decisions”（LD-Label）和基于引用频率与时效性的“Citation-Label”，构建了一个大规模的多语言数据集。这一方法不仅提高了数据集的规模，还允许对案件重要性进行更细致的评估。论文还验证了通过微调的多语言模型在任务特定适应性上的优势，表明了任务特定模型优于零样本基线模型。

链接: https://arxiv.org/abs/2410.13460
作者: Ronja Stern,Ken Kawamura,Matthias Stürmer,Ilias Chalkidis,Joel Niklaus
关键词-EN: Predicting case criticality, Federal Supreme Court, Swiss Federal Supreme, Supreme Court decisions, court system manage
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicting case criticality helps legal professionals in the court system manage large volumes of case law. This paper introduces the Criticality Prediction dataset, a new resource for evaluating the potential influence of Swiss Federal Supreme Court decisions on future jurisprudence. Unlike existing approaches that rely on resource-intensive manual annotations, we semi-automatically derive labels leading to a much larger dataset than otherwise possible. Our dataset features a two-tier labeling system: (1) the LD-Label, which identifies cases published as Leading Decisions (LD), and (2) the Citation-Label, which ranks cases by their citation frequency and recency. This allows for a more nuanced evaluation of case importance. We evaluate several multilingual models, including fine-tuned variants and large language models, and find that fine-tuned models consistently outperform zero-shot baselines, demonstrating the need for task-specific adaptation. Our contributions include the introduction of this task and the release of a multilingual dataset to the research community.
摘要：预测案件的重要性有助于法律专业人士在法院系统中管理大量案例法。本文介绍了重要性预测数据集，这是一个用于评估瑞士联邦最高法院判决对未来法学潜在影响的新资源。与依赖资源密集型手动标注的现有方法不同，我们采用半自动方式生成标签，从而获得比传统方法更大的数据集。我们的数据集具有两级标注系统：(1) LD-Label，用于识别作为领先判决 (Leading Decisions, LD) 发布的案件；(2) Citation-Label，根据案件的引用频率和时效性对案件进行排序。这使得对案件重要性的评估更加细致。我们评估了多种多语言模型，包括微调变体和大语言模型，发现微调模型始终优于零样本基线，表明任务特定适应的必要性。我们的贡献包括引入这一任务并向研究社区发布多语言数据集。

[NLP-52] MedINST: Meta Dataset of Biomedical Instructions

【速读】：该论文试图解决医学领域中大型语言模型（LLM）训练所需的大规模、多样化且标注良好的数据集稀缺问题。解决方案的关键在于引入MedINST，这是一个多领域、多任务的生物医学指令元数据集，包含133个生物医学NLP任务和超过700万训练样本，是目前最全面的生物医学指令数据集。通过使用MedINST作为元数据集，论文进一步构建了MedINST32基准测试集，用于评估LLM的跨任务泛化能力，并通过在MedINST上微调多个LLM并在MedINST32上进行评估，展示了增强的跨任务泛化能力。

链接: https://arxiv.org/abs/2410.13458
作者: Wenhan Han,Meng Fang,Zihan Zhang,Yu Yin,Zirui Song,Ling Chen,Mykola Pechenizkiy,Qingyu Chen
关键词-EN: large language model, well-annotated datasets remains, integration of large, large language, scarcity of large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of large language model (LLM) techniques in the field of medical analysis has brought about significant advancements, yet the scarcity of large, diverse, and well-annotated datasets remains a major challenge. Medical data and tasks, which vary in format, size, and other parameters, require extensive preprocessing and standardization for effective use in training LLMs. To address these challenges, we introduce MedINST, the Meta Dataset of Biomedical Instructions, a novel multi-domain, multi-task instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over 7 million training samples, making it the most comprehensive biomedical instruction dataset to date. Using MedINST as the meta dataset, we curate MedINST32, a challenging benchmark with different task difficulties aiming to evaluate LLMs’ generalization ability. We fine-tune several LLMs on MedINST and evaluate on MedINST32, showcasing enhanced cross-task generalization.
摘要：大语言模型 (LLM) 技术在医学分析领域的整合带来了显著的进步，然而，大规模、多样化且标注良好的数据集的稀缺性仍然是一个主要挑战。医学数据和任务在格式、大小和其他参数上各不相同，需要广泛的预处理和标准化，以便在训练 LLM 时有效使用。为了应对这些挑战，我们引入了 MedINST，即生物医学指令的元数据集，这是一个新颖的多领域、多任务的指令元数据集。MedINST 包含 133 个生物医学 NLP 任务和超过 700 万个训练样本，使其成为迄今为止最全面的生物医学指令数据集。使用 MedINST 作为元数据集，我们精心策划了 MedINST32，这是一个具有不同任务难度的挑战性基准，旨在评估 LLM 的泛化能力。我们对多个 LLM 在 MedINST 上进行了微调，并在 MedINST32 上进行了评估，展示了增强的跨任务泛化能力。

[NLP-53] Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

【速读】：该论文试图解决法律研究中时间消耗大且依赖人工撰写摘要的问题，关键解决方案是引入瑞士领先判决摘要（SLDS）数据集，该数据集包含18K瑞士联邦最高法院的判决，涵盖德语、法语和意大利语，并附有德语摘要。通过微调mT5模型和评估多种模型，论文展示了在零样本和单样本设置下，尽管专有模型表现良好，但微调后的较小模型仍具有较强的竞争力。该数据集的公开发布旨在促进多语言法律摘要研究，并推动法律专业辅助技术的发展。

链接: https://arxiv.org/abs/2410.13456
作者: Luca Rolshoven,Vishvaksenan Rasiah,Srinanda Brügger Bose,Matthias Stürmer,Joel Niklaus
关键词-EN: daily basis, Legal research, Legal, Swiss Federal Supreme, lawyers face
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Legal research is a time-consuming task that most lawyers face on a daily basis. A large part of legal research entails looking up relevant caselaw and bringing it in relation to the case at hand. Lawyers heavily rely on summaries (also called headnotes) to find the right cases quickly. However, not all decisions are annotated with headnotes and writing them is time-consuming. Automated headnote creation has the potential to make hundreds of thousands of decisions more accessible for legal research in Switzerland alone. To kickstart this, we introduce the Swiss Leading Decision Summarization ( SLDS) dataset, a novel cross-lingual resource featuring 18K court rulings from the Swiss Federal Supreme Court (SFSC), in German, French, and Italian, along with German headnotes. We fine-tune and evaluate three mT5 variants, along with proprietary models. Our analysis highlights that while proprietary models perform well in zero-shot and one-shot settings, fine-tuned smaller models still provide a strong competitive edge. We publicly release the dataset to facilitate further research in multilingual legal summarization and the development of assistive technologies for legal professionals
摘要：法律研究是大多数律师日常工作中耗时的一项任务。法律研究的大部分内容涉及查找相关判例法，并将其与当前案件联系起来。律师们严重依赖摘要（也称为headnotes）来快速找到正确的案件。然而，并非所有判决都附有headnotes，撰写它们也非常耗时。自动创建headnote有潜力使仅在瑞士就有数十万份判决更容易被法律研究者获取。为此，我们引入了瑞士领先判决摘要（Swiss Leading Decision Summarization, SLDS）数据集，这是一个新颖的跨语言资源，包含来自瑞士联邦最高法院（Swiss Federal Supreme Court, SFSC）的18,000份判决，涵盖德语、法语和意大利语，并附有德语headnotes。我们对三种mT5变体以及专有模型进行了微调并评估。分析表明，尽管专有模型在零样本和单样本设置下表现良好，但经过微调的小型模型仍然提供了强大的竞争优势。我们公开发布了该数据集，以促进多语言法律摘要领域的进一步研究和法律专业辅助技术的发展。

[NLP-54] Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

【速读】：该论文试图解决低资源语言的自动语音识别（ASR）问题，由于缺乏标注训练数据，这一直是一个挑战。解决方案的关键在于结合参数高效微调（parameter-efficient fine-tuning）和仅文本适应（text-only adaptation）两种方法，利用多语言多模态模型如SeamlessM4T来提升ASR性能。具体来说，多模态模型通过仅文本适应利用未标注的文本数据，并进一步进行参数高效的ASR微调，从而在零样本设置下实现高达17%的词错误率（WER）相对减少，无需任何标注语音数据。

链接: https://arxiv.org/abs/2410.13445
作者: Abhishek Gupta,Amruta Parulekar,Sameep Chattopadhyay,Preethi Jyothi
关键词-EN: Automatic speech recognition, labeled training data, training data, Automatic speech, low-resource languages remains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) for low-resource languages remains a challenge due to the scarcity of labeled training data. Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance. We also show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over a baseline in a zero-shot setting without any labeled speech.
摘要：对于低资源语言的自动语音识别 (ASR) 仍然是一个挑战，这是由于标记训练数据的稀缺性。参数高效微调 (Parameter-efficient fine-tuning) 和纯文本适应 (text-only adaptation) 是两种常用的方法，用于应对这种低资源环境。在本研究中，我们探讨了如何通过使用像 SeamlessM4T 这样的多语言多模态模型，有效地结合这些技术。多模态模型能够通过纯文本适应利用未标记的文本，并进一步进行参数高效的 ASR 微调，从而提升 ASR 性能。我们还展示了从高资源语言到低资源语言的跨语言迁移，在零样本 (zero-shot) 设置下，无需任何标记语音，实现了相对于基线高达 17% 的词错误率 (WER) 降低。

[NLP-55] NLIP_Lab-IITH Multilingual MT System for WAT24 MT Shared Task

【速读】：该论文旨在解决多语言印度语机器翻译问题，特别是针对22种属于4个语系的印度语。解决方案的关键在于利用双语词典进行源句子的词替换，并通过预训练和微调特定语言方向的多语言翻译模型来提升翻译质量。论文中提出的243M参数的多语言翻译模型在IN22-Gen和IN22-Conv基准测试中表现优异，尤其在Indic-En方向上，平均chrF++和BLEU得分分别为56.34和30.82，显示出其与IndicTransv1（474M参数模型）相当的竞争力。

链接: https://arxiv.org/abs/2410.13443
作者: Maharaj Brahma,Pramit Sahoo,Maunendra Sankar Desarkar
关键词-EN: describes NLIP Lab, NLIP Lab multilingual, paper describes NLIP, NLIP Lab, Lab multilingual machine
类目: Computation and Language (cs.CL)
备注: WMT 24 WAT Shared Task IndicMultiMT (Best System)

点击查看摘要

Abstract:This paper describes NLIP Lab’s multilingual machine translation system for the WAT24 shared task on multilingual Indic MT task for 22 scheduled languages belonging to 4 language families. We explore pre-training for Indic languages using alignment agreement objectives. We utilize bi-lingual dictionaries to substitute words from source sentences. Furthermore, we fine-tuned language direction-specific multilingual translation models using small and high-quality seed data. Our primary submission is a 243M parameters multilingual translation model covering 22 Indic languages. In the IN22-Gen benchmark, we achieved an average chrF++ score of 46.80 and 18.19 BLEU score for the En-Indic direction. In the Indic-En direction, we achieved an average chrF++ score of 56.34 and 30.82 BLEU score. In the In22-Conv benchmark, we achieved an average chrF++ score of 43.43 and BLEU score of 16.58 in the En-Indic direction, and in the Indic-En direction, we achieved an average of 52.44 and 29.77 for chrF++ and BLEU respectively. Our model\footnoteOur code and models are available at \urlthis https URL is competitive with IndicTransv1 (474M parameter model).
摘要：本文描述了 NLIP Lab 为 WAT24 多语言 Indic MT 任务开发的机器翻译系统，该任务涉及 22 种预定语言，分属 4 个语系。我们探索了使用对齐一致性目标对 Indic 语言进行预训练的方法。我们利用双语词典来替换源句中的词汇。此外，我们使用少量高质量的种子数据对特定语言方向的多语言翻译模型进行了微调。我们的主要提交是一个涵盖 22 种 Indic 语言的 243M 参数多语言翻译模型。在 IN22-Gen 基准测试中，我们在 En-Indic 方向上取得了平均 chrF++ 分数 46.80 和 BLEU 分数 18.19。在 Indic-En 方向上，我们取得了平均 chrF++ 分数 56.34 和 BLEU 分数 30.82。在 In22-Conv 基准测试中，我们在 En-Indic 方向上取得了平均 chrF++ 分数 43.43 和 BLEU 分数 16.58，在 Indic-En 方向上分别取得了平均 chrF++ 分数 52.44 和 BLEU 分数 29.77。我们的模型（代码和模型可在 https URL 获取）与 IndicTransv1（474M 参数模型）具有竞争力。

[NLP-56] Similarity-Dissimilarity Loss with Supervised Contrastive Learning for Multi-label Classification

【速读】：该论文试图解决多标签分类中正样本确定的问题，特别是在监督对比学习框架下，如何动态调整对比损失函数中的权重以反映不同样本间的关系。解决方案的关键在于引入了五种不同的样本关系，并提出了一种基于相似性和差异性的损失函数（Similarity-Dissimilarity Loss），通过计算正样本与给定锚点之间的相似性和差异性来重新加权损失，从而提高多标签分类任务的性能。实验结果表明，该方法在MIMIC和MS-COCO数据集上的多标签文本分类任务中，显著提升了监督对比学习范式下各种编码器的性能。

链接: https://arxiv.org/abs/2410.13439
作者: Guangming Huang,Yunfei Long,Cunjin Luo,Sheng Liu
关键词-EN: scenario remains challenging, multi-label scenario remains, determining positive samples, remains challenging, positive samples
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised contrastive learning has been explored in making use of label information for multi-label classification, but determining positive samples in multi-label scenario remains challenging. Previous studies have examined strategies for identifying positive samples, considering label overlap proportion between anchors and samples. However, they ignore various relations between given anchors and samples, as well as how to dynamically adjust the weights in contrastive loss functions based on different relations, leading to great ambiguity. In this paper, we introduce five distinct relations between multi-label samples and propose a Similarity-Dissimilarity Loss with contrastive learning for multi-label classification. Our loss function re-weights the loss by computing the similarity and dissimilarity between positive samples and a given anchor based on the introduced relations. We mainly conduct experiments for multi-label text classification on MIMIC datasets, then further extend the evaluation on MS-COCO. The Experimental results show that our proposed loss effectively improves the performance on all encoders under supervised contrastive learning paradigm, demonstrating its effectiveness and robustness.
摘要：监督对比学习在利用标签信息进行多标签分类方面已有探索，但在多标签场景中确定正样本仍然具有挑战性。以往的研究考察了识别正样本的策略，考虑了锚点和样本之间的标签重叠比例。然而，这些研究忽略了给定锚点和样本之间的各种关系，以及如何根据不同关系动态调整对比损失函数中的权重，导致极大的模糊性。本文中，我们引入了多标签样本之间的五种不同关系，并提出了一种用于多标签分类的相似性-差异性损失函数与对比学习相结合的方法。我们的损失函数通过计算正样本与给定锚点之间的相似性和差异性，基于引入的关系重新加权损失。我们主要在 MIMIC 数据集上进行多标签文本分类实验，随后进一步在 MS-COCO 上进行评估。实验结果表明，我们提出的损失函数在监督对比学习范式下，对所有编码器均有效提升了性能，展示了其有效性和鲁棒性。

[NLP-57] hink Thrice Before You Act: Progressive Thought Refinement in Large Language Models

【速读】：该论文试图解决现有大型语言模型（LLMs）在处理开放性问题时，依赖监督信号进行评估的局限性，以及其在不同任务间泛化能力不足的问题。解决方案的关键在于提出了一种名为“渐进思维精炼”（Progressive Thought Refinement, PTR）的框架。该框架通过两个阶段实现：首先，在“思维数据构建阶段”，采用弱模型与强模型协作的选择策略，构建高质量的渐进精炼数据集，确保从思维到答案的逻辑一致性，并在每一轮中逐步精炼答案；其次，在“思维掩码微调阶段”，设计了一种训练结构，通过掩码“思维”并调整损失权重，鼓励LLMs对先前的思维进行精炼，从而教会模型“如何改进”而非“什么是正确”。实验结果表明，PTR显著提升了LLMs在多样任务中的表现，特别是在开放性任务中，模型的响应质量得到了显著提升。

链接: https://arxiv.org/abs/2410.13413
作者: Chengyu Du,Jinyi Han,Yizhou Ying,Aili Chen,Qianyu He,Haokun Zhao,Sirui Xia,Haoran Guo,Jiaqing Liang,Zulong Chen,Liangyue Li,Yanghua Xiao
关键词-EN: Recent advancements, large language models, advancements in large, large language, providing a single
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated that progressive refinement, rather than providing a single answer, results in more accurate and thoughtful outputs. However, existing methods often rely heavily on supervision signals to evaluate previous responses, making it difficult to assess output quality in more open-ended scenarios effectively. Additionally, these methods are typically designed for specific tasks, which limits their generalization to new domains. To address these limitations, we propose Progressive Thought Refinement (PTR), a framework that enables LLMs to refine their responses progressively. PTR operates in two phases: (1) Thought data construction stage: We propose a weak and strong model collaborative selection strategy to build a high-quality progressive refinement dataset to ensure logical consistency from thought to answers, and the answers are gradually refined in each round. (2) Thought-Mask Fine-Tuning Phase: We design a training structure to mask the “thought” and adjust loss weights to encourage LLMs to refine prior thought, teaching them to implicitly understand “how to improve” rather than “what is correct.” Experimental results show that PTR significantly enhances LLM performance across ten diverse tasks (avg. from 49.6% to 53.5%) without task-specific fine-tuning. Notably, in more open-ended tasks, LLMs also demonstrate substantial improvements in the quality of responses beyond mere accuracy, suggesting that PTR truly teaches LLMs to self-improve over time.
摘要：近年来，大语言模型 (LLM) 的进展表明，逐步细化输出而非一次性提供答案，能够产生更准确和深思熟虑的结果。然而，现有方法通常严重依赖监督信号来评估先前的响应，这使得在更开放的场景中难以有效评估输出质量。此外，这些方法通常为特定任务设计，限制了它们向新领域的泛化能力。为解决这些局限性，我们提出了渐进思维细化 (PTR) 框架，使 LLM 能够逐步细化其响应。PTR 分为两个阶段：(1) 思维数据构建阶段：我们提出了一种弱模型与强模型协作选择策略，以构建高质量的渐进细化数据集，确保从思维到答案的逻辑一致性，并在每一轮中逐步细化答案。(2) 思维掩码微调阶段：我们设计了一种训练结构，通过掩码“思维”并调整损失权重，鼓励 LLM 细化先前的思维，教导它们隐式理解“如何改进”而非“什么是正确”。实验结果显示，PTR 在十个不同任务上显著提升了 LLM 的性能 (平均从 49.6% 提升至 53.5%)，且无需特定任务的微调。值得注意的是，在更开放的任务中，LLM 在响应质量上也表现出显著提升，不仅限于准确性，这表明 PTR 确实教会了 LLM 随着时间的推移自我改进。

[NLP-58] Attr-Int: A Simple and Effective Entity Alignment Framework for Heterogeneous Knowledge Graphs

【速读】：该论文试图解决异构知识图谱（KGs）中的实体对齐（EA）问题，特别是在实体的邻域结构非同构的情况下。解决方案的关键在于提出了一个名为Attr-Int的实体对齐框架，该框架能够无缝集成创新的属性信息交互方法与任何嵌入编码器，从而提升现有实体对齐技术的性能。通过引入两个新的基准测试，论文验证了Attr-Int框架在处理异构KGs实体对齐问题上的优越性。

链接: https://arxiv.org/abs/2410.13409
作者: Linyan Yang,Jingwei Cheng,Chuanhao Xu,Xihao Wang,Jiayi Li,Fu Zhang
关键词-EN: Entity alignment, knowledge graphs, task of linking, Entity, alignment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Entity alignment (EA) refers to the task of linking entities in different knowledge graphs (KGs). Existing EA methods rely heavily on structural isomorphism. However, in real-world KGs, aligned entities usually have non-isomorphic neighborhood structures, which paralyses the application of these structure-dependent methods. In this paper, we investigate and tackle the problem of entity alignment between heterogeneous KGs. First, we propose two new benchmarks to closely simulate real-world EA scenarios of heterogeneity. Then we conduct extensive experiments to evaluate the performance of representative EA methods on the new benchmarks. Finally, we propose a simple and effective entity alignment framework called Attr-Int, in which innovative attribute information interaction methods can be seamlessly integrated with any embedding encoder for entity alignment, improving the performance of existing entity alignment techniques. Experiments demonstrate that our framework outperforms the state-of-the-art approaches on two new benchmarks.
摘要：实体对齐 (Entity Alignment, EA) 是指在不同知识图谱 (Knowledge Graphs, KGs) 中链接实体的任务。现有的 EA 方法严重依赖于结构同构性。然而，在现实世界的 KGs 中，对齐的实体通常具有非同构的邻域结构，这使得这些依赖结构的方法难以应用。本文中，我们研究并解决了异构 KGs 之间的实体对齐问题。首先，我们提出了两个新的基准，以紧密模拟现实世界中异构 EA 场景。然后，我们对代表性 EA 方法在新基准上的性能进行了广泛的实验评估。最后，我们提出了一种简单而有效的实体对齐框架，称为 Attr-Int，其中创新的属性信息交互方法可以无缝集成到任何实体对齐的嵌入编码器中，从而提升现有实体对齐技术的性能。实验表明，我们的框架在两个新基准上优于最先进的方法。

[NLP-59] MoR: Mixture of Ranks for Low-Rank Adaptation Tuning

【速读】：该论文试图解决低秩适应（LoRA）方法在捕捉高秩信息时存在的性能瓶颈问题，以及MoE-style LoRA方法参数和推理延迟增加的问题。解决方案的关键在于引入混合秩（Mixture of Ranks, MoR）方法，该方法通过学习基于输入的任务特定秩信息，并高效地整合多秩信息，从而在不显著增加参数和延迟的情况下提升LoRA的多任务处理能力。MoR通过数学变换从低秩组件中推导出高秩信息，降低了LoRA的学习难度，并显著提升了其性能。

链接: https://arxiv.org/abs/2410.13408
作者: Chuanyu Tang,Yilong Chen,Zhenyu Zhang,Junyuan Shang,Wenyuan Zhang,Yong Huang,Tingwen Liu
关键词-EN: Low-Rank Adaptation, drives research, research to align, full fine-tuning, Adaptation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) drives research to align its performance with full fine-tuning. However, significant challenges remain: (1) Simply increasing the rank size of LoRA does not effectively capture high-rank information, which leads to a performance bottleneck.(2) MoE-style LoRA methods substantially increase parameters and inference latency, contradicting the goals of efficient fine-tuning and ease of application. To address these challenges, we introduce Mixture of Ranks (MoR), which learns rank-specific information for different tasks based on input and efficiently integrates multi-rank information. We firstly propose a new framework that equates the integration of multiple LoRAs to expanding the rank of LoRA. Moreover, we hypothesize that low-rank LoRA already captures sufficient intrinsic information, and MoR can derive high-rank information through mathematical transformations of the low-rank components. Thus, MoR can reduces the learning difficulty of LoRA and enhances its multi-task capabilities. MoR achieves impressive results, with MoR delivering a 1.31% performance improvement while using only 93.93% of the parameters compared to baseline methods.
摘要：低秩适应 (LoRA) 推动了研究以使其性能与全量微调相匹配。然而，仍存在显著挑战：(1) 简单地增加 LoRA 的秩大小并不能有效捕捉高秩信息，这导致了性能瓶颈。(2) MoE 风格的 LoRA 方法大幅增加了参数数量和推理延迟，这与高效微调和应用简便的目标相悖。为应对这些挑战，我们引入了秩混合 (Mixture of Ranks, MoR)，该方法基于输入学习特定任务的秩信息，并高效整合多秩信息。我们首先提出了一种新框架，将多个 LoRA 的整合等同于扩展 LoRA 的秩。此外，我们假设低秩 LoRA 已经捕获了足够的内在信息，而 MoR 可以通过对低秩成分的数学变换推导出高秩信息。因此，MoR 可以降低 LoRA 的学习难度并增强其多任务能力。MoR 取得了显著成果，与基线方法相比，MoR 在使用仅 93.93% 参数的情况下实现了 1.31% 的性能提升。

[NLP-60] owards Hybrid Intelligence in Journalism: Findings and Lessons Learnt from a Collaborative Analysis of Greek Political Rhetoric by ChatGPT and Humans

【速读】：该论文试图解决如何利用人工智能（AI）分析政治话语的问题，特别是在希腊2023年大选背景下，通过结合人类专家和AI技术来深入研究政治领袖的竞选演讲。解决方案的关键在于采用跨学科团队合作，包括记者、政治科学家和数据科学家，利用大型语言模型（如OpenAI的ChatGPT）进行情感分析、极化分析、民粹主义检测、主题检测和命名实体识别（NER）。论文强调了AI在政治话语分析中的潜力和局限性，并指出人类监督在AI应用中的重要性，特别是在新闻项目和社会其他领域中。这种人机协作（即“混合智能”）在数字人文领域的创新应用为未来类似项目提供了宝贵的经验。

链接: https://arxiv.org/abs/2410.13400
作者: Thanasis Troboukis,Kelly Kiki,Antonis Galanopoulos,Pavlos Sermpezis,Stelios Karamanidis,Ilias Dimitriadis,Athena Vakali
关键词-EN: employing Artificial Intelligence, Artificial Intelligence, research project titled, preparation for Greece, general elections
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This chapter introduces a research project titled “Analyzing the Political Discourse: A Collaboration Between Humans and Artificial Intelligence”, which was initiated in preparation for Greece’s 2023 general elections. The project focused on the analysis of political leaders’ campaign speeches, employing Artificial Intelligence (AI), in conjunction with an interdisciplinary team comprising journalists, a political scientist, and data scientists. The chapter delves into various aspects of political discourse analysis, including sentiment analysis, polarization, populism, topic detection, and Named Entities Recognition (NER). This experimental study investigates the capabilities of large language model (LLMs), and in particular OpenAI’s ChatGPT, for analyzing political speech, evaluates its strengths and weaknesses, and highlights the essential role of human oversight in using AI in journalism projects and potentially other societal sectors. The project stands as an innovative example of human-AI collaboration (known also as “hybrid intelligence”) within the realm of digital humanities, offering valuable insights for future initiatives.
摘要：本章介绍了一项名为“分析政治话语：人机协作”的研究项目，该项目是为准备希腊2023年大选而启动的。该项目专注于分析政治领袖的竞选演讲，采用人工智能 (AI)，并与一个跨学科团队合作，该团队包括记者、政治学家和数据科学家。本章深入探讨了政治话语分析的各个方面，包括情感分析、极化、民粹主义、主题检测和命名实体识别 (NER)。这项实验性研究探讨了大语言模型 (LLMs)，特别是 OpenAI 的 ChatGPT，在分析政治演讲方面的能力，评估了其优缺点，并强调了在新闻项目中使用 AI 时人类监督的关键作用，并可能延伸至其他社会领域。该项目作为数字人文领域中人机协作（也称为“混合智能”）的创新实例，为未来的相关项目提供了宝贵的见解。

[NLP-61] Linguistically Grounded Analysis of Language Models using Shapley Head Values

【速读】：该论文试图解决语言模型中语言知识编码方式的问题，特别是通过研究形态句法现象的处理来提升模型的泛化能力。解决方案的关键在于利用Shapley Head Values (SHVs)这一新提出的方法来探测语言模型，通过定量剪枝和定性聚类分析，揭示负责处理相关语言现象的注意力头如何聚集在一起，从而揭示语言模型如何组织和处理语言信息。研究结果表明，SHV-based的归因方法在BERT和RoBERTa模型中显示出不同的模式，支持了语言模型学习与语言理论相对应的子网络的假设，这对跨语言模型分析和自然语言处理中的可解释性具有重要意义。

链接: https://arxiv.org/abs/2410.13396
作者: Marcell Fekete,Johannes Bjerva
关键词-EN: generalisation capabilities, knowledge is encoded, crucial for improving, improving their generalisation, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding how linguistic knowledge is encoded in language models is crucial for improving their generalisation capabilities. In this paper, we investigate the processing of morphosyntactic phenomena, by leveraging a recently proposed method for probing language models via Shapley Head Values (SHVs). Using the English language BLiMP dataset, we test our approach on two widely used models, BERT and RoBERTa, and compare how linguistic constructions such as anaphor agreement and filler-gap dependencies are handled. Through quantitative pruning and qualitative clustering analysis, we demonstrate that attention heads responsible for processing related linguistic phenomena cluster together. Our results show that SHV-based attributions reveal distinct patterns across both models, providing insights into how language models organize and process linguistic information. These findings support the hypothesis that language models learn subnetworks corresponding to linguistic theory, with potential implications for cross-linguistic model analysis and interpretability in Natural Language Processing (NLP).
摘要：理解语言模型中语言知识的编码方式对于提升其泛化能力至关重要。本文通过利用最近提出的通过Shapley Head Values (SHVs) 探查语言模型的方法，研究了形态句法现象的处理。我们使用英语BLiMP数据集，在两个广泛使用的模型BERT和RoBERTa上测试了我们的方法，并比较了如回指一致性和填充-空位依赖等语言结构的处理方式。通过定量剪枝和定性聚类分析，我们展示了负责处理相关语言现象的注意力头会聚集在一起。我们的结果表明，基于SHV的归因在两个模型中揭示了不同的模式，提供了关于语言模型如何组织和处理语言信息的见解。这些发现支持了语言模型学习与语言理论相对应的子网络的假设，这对跨语言模型分析和自然语言处理 (NLP) 中的可解释性具有潜在影响。

[NLP-62] Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

【速读】：该论文试图解决非英语语言机器生成文本评估的挑战，特别是在多语言评估框架方面的不足。解决方案的关键在于提出了跨语言自动评估（CIA）套件，其中包括一个名为Hercule的评估器LLM和一个专门设计的测试集Recon。Hercule模型通过学习基于英语参考答案来为非英语语言的响应分配分数，从而解决了目标语言中参考答案稀缺的问题。实验结果表明，Hercule在低资源场景下与人类判断更为一致，并且在零样本评估中对未见过的语言也有效，展示了跨语言评估在多语言评估中的有效性和扩展性。

链接: https://arxiv.org/abs/2410.13394
作者: Sumanth Doddapaneni,Mohammed Safi Ur Rahman Khan,Dilip Venkatesh,Raj Dabre,Anoop Kunchukuttan,Mitesh M. Khapra
关键词-EN: Evaluating machine-generated text, machine-generated text remains, Evaluating machine-generated, challenge in NLP, Cross Lingual Auto
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.
摘要：评估机器生成的文本在自然语言处理 (NLP) 中仍然是一个重大挑战，尤其是在非英语语言方面。当前的方法，包括自动化指标、人工评估和大语言模型 (LLM) 评估，主要集中在英语上，这揭示了多语言评估框架中的显著差距。我们引入了跨语言自动评估 (Cross Lingual Auto Evaluation, CIA) 套件，这是一个可扩展的框架，包括评估用大语言模型 (Hercule) 和一个专门为多语言评估设计的新测试集 (Recon)。我们的测试集包含 500 条人工标注的指令，涵盖了各种任务能力，并在六种语言中提供了人类判断分数。这将使通用多语言大语言模型的基准测试成为可能，并促进评估用大语言模型的元评估。所提出的模型 Hercule 是一种跨语言评估模型，通过学习基于容易获得的英语参考答案来为响应分配分数，从而解决了目标语言中参考答案稀缺的问题。我们的实验表明，与专有模型相比，Hercule 更接近人类判断，展示了在低资源场景中这种跨语言评估的有效性。此外，它在未见过的语言上的零样本评估中也表现有效。本研究首次全面考察了使用大语言模型进行跨语言评估，提出了一种可扩展且有效的多语言评估方法。所有代码、数据集和模型将公开发布，以促进这一重要领域的进一步研究。

[NLP-63] Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence

【速读】：该论文试图解决的问题是探究大型语言模型（如ChatGPT）是否具备类似于人类的元认知监控能力，特别是在预测记忆表现方面。解决方案的关键在于通过跨代理预测模型，比较人类和ChatGPT在语言记忆任务中的元认知表现，发现ChatGPT无法像人类那样准确预测其记忆表现，这表明当前的大型语言模型在元认知层面与人类存在根本差异。这一发现对于开发能够进行有效自我监控和适应人类需求的AI系统至关重要，特别是在教育和个性化学习等领域。

链接: https://arxiv.org/abs/2410.13392
作者: Markus Huff,Elanur Ulakçı
关键词-EN: Large language models, human cognitive processes, shown impressive alignment, Large language, cognitive processes
类目: Computation and Language (cs.CL)
备注: 28 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:2403.05152

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive alignment with human cognitive processes, raising questions about the extent of their similarity to human cognition. This study investigates whether LLMs, specifically ChatGPT, possess metacognitive monitoring abilities akin to humans-particularly in predicting memory performance on an item-by-item basis. We employed a cross-agent prediction model to compare the metacognitive performance of humans and ChatGPT in a language-based memory task involving garden-path sentences preceded by either fitting or unfitting context sentences. Both humans and ChatGPT rated the memorability of these sentences; humans then completed a surprise recognition memory test. Our findings reveal a significant positive relationship between humans’ memorability ratings and their actual recognition performance, indicating reliable metacognitive monitoring. In contrast, ChatGPT did not exhibit a similar predictive capability. Bootstrapping analyses demonstrated that none of the GPT models tested (GPT-3.5-turbo, GPT-4-turbo, GPT-4o) could accurately predict human memory performance on a per-item basis. This suggests that, despite their advanced language processing abilities and alignment with human cognition at the object level, current LLMs lack the metacognitive mechanisms that enable humans to anticipate their memory performance. These results highlight a fundamental difference between human and AI cognition at the metacognitive level. Addressing this gap is crucial for developing AI systems capable of effective self-monitoring and adaptation to human needs, thereby enhancing human-AI interactions across domains such as education and personalized learning.
摘要：大语言模型 (LLMs) 在与人脑认知过程的匹配度上展现了令人印象深刻的表现，这引发了关于其与人类认知相似程度的疑问。本研究探讨了 LLMs，特别是 ChatGPT，是否具备类似于人类的元认知监控能力，尤其是在逐项预测记忆表现方面。我们采用了一种跨智能体的预测模型，比较了人类和 ChatGPT 在基于语言的记忆任务中的元认知表现，该任务涉及花园路径句，这些句子之前有合适或不合适的上下文句子。人类和 ChatGPT 都对这些句子的可记忆性进行了评分；随后，人类完成了一个意外的识别记忆测试。我们的研究结果显示，人类的可记忆性评分与其实际的识别表现之间存在显著的正相关关系，表明了可靠的元认知监控。相比之下，ChatGPT 并未展现出类似的预测能力。通过引导分析，我们发现所有测试的 GPT 模型 (GPT-3.5-turbo, GPT-4-turbo, GPT-4o) 都无法准确地逐项预测人类的记忆表现。这表明，尽管当前的 LLMs 在语言处理能力和与人类认知的对象级匹配上表现出色，但它们缺乏使人类能够预测其记忆表现的元认知机制。这些结果突显了人类与 AI 在元认知层面的根本差异。解决这一差距对于开发能够有效进行自我监控并适应人类需求的 AI 系统至关重要，从而在教育和个人化学习等领域增强人机交互。

[NLP-64] Remember Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

【速读】：该论文试图解决多模态大语言模型（MLLMs）在缺乏用户特定知识的情况下难以应用于日常生活的问题。解决方案的关键在于引入检索增强个性化（Retrieval Augmented Personalization, RAP）框架，通过三个步骤实现MLLMs的个性化：首先，设计键值数据库存储用户相关信息（如用户名、头像等）；其次，在用户发起对话时，利用多模态检索器从数据库中检索相关信息；最后，将输入查询和检索到的概念信息输入MLLMs，生成个性化且知识增强的响应。RAP框架允许通过更新外部数据库实现实时概念编辑，并通过专门的数据集收集和个性化训练提升生成质量和与用户信息的匹配度。

链接: https://arxiv.org/abs/2410.13360
作者: Haoran Hao,Jiaming Han,Changsheng Li,Yu-Feng Li,Xiangyu Yue
关键词-EN: Retrieval Augmented Personalization, large language models, development of large, large language, significantly enhanced
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human’s daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs’ personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user’s name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. © Generate: The input query and retrieved concepts’ information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at this https URL.
摘要：大语言模型 (LLM) 的发展显著提升了多模态大语言模型 (MLLM) 作为通用助手的功能。然而，缺乏用户特定知识仍然限制了它们在人类日常生活中的应用。本文中，我们介绍了用于 MLLM 个性化的检索增强个性化 (RAP) 框架。从通用 MLLM 出发，我们通过三个步骤将其转变为个性化助手。(a) 记忆：我们设计了一个键值数据库来存储用户相关信息，例如用户姓名、头像和其他属性。(b) 检索：当用户发起对话时，RAP 将使用多模态检索器从数据库中检索相关信息。© 生成：输入查询和检索到的概念信息被输入到 MLLM 中，以生成个性化、知识增强的响应。与以往方法不同，RAP 允许通过更新外部数据库进行实时概念编辑。为进一步提高生成质量和与用户特定信息的匹配度，我们设计了一个数据收集流程，并创建了一个专门的数据集用于 MLLM 的个性化训练。基于该数据集，我们训练了一系列作为个性化多模态助手的 MLLM。通过在大规模数据集上预训练，RAP-MLLM 可以泛化到无限视觉概念，而无需额外微调。我们的模型在个性化图像描述、问答和视觉识别等各种任务中展示了出色的灵活性和生成质量。代码、数据和模型可在以下链接获取：https URL。

[NLP-65] LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights

【速读】：该论文试图解决如何评估大型语言模型（LLMs）在法律推理能力上的问题。解决方案的关键在于提出了一个新的任务——法律论证推理（Legal Argument Reasoning, LAR），并构建了一个基于欧洲人权法院（ECHR）案例的数据集（LAR-ECHR）。通过这一任务和数据集，研究者能够评估LLMs在处理法律论证链中的多选题时的表现，并发现现有模型在法律推理上的不足，为未来模型的改进提供了方向。

链接: https://arxiv.org/abs/2410.13352
作者: Odysseas S. Chlapanis,Dimitrios Galanis,Ion Androutsopoulos
关键词-EN: Large Language Models, Large Language, present Legal Argument, capabilities of Large, Legal Argument Reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in Natural Legal Language Processing (NLLP) 2024 workshop

点击查看摘要

Abstract:We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large Language Models (LLMs). The task requires selecting the correct next statement (from multiple choice options) in a chain of legal arguments from court proceedings, given the facts of the case. We constructed a dataset (LAR-ECHR) for this task using cases from the European Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on LAR-ECHR and found that (a) the ranking of the models is aligned with that of LegalBench, an established US-based legal reasoning benchmark, even though LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more clearly, compared to LegalBench, © even the best model (GPT-4o) obtains 75.8% accuracy on LAR-ECHR, indicating significant potential for further model improvement. The process followed to construct LAR-ECHR can be replicated with cases from other legal systems.
摘要：我们提出了法律论证推理 (Legal Argument Reasoning, LAR)，这是一个新颖的任务，旨在评估大语言模型 (Large Language Models, LLMs) 的法律推理能力。该任务要求在给定案件事实的情况下，从多个选项中选择法庭诉讼中法律论证链中的正确下一条陈述。我们使用欧洲人权法院 (European Court of Human Rights, ECHR) 的案件构建了一个数据集 (LAR-ECHR) 来完成此任务。我们在 LAR-ECHR 上评估了七个通用大语言模型，并发现：(a) 这些模型的排名与基于美国法律推理基准 LegalBench 的排名一致，尽管 LAR-ECHR 基于欧盟法律；(b) 与 LegalBench 相比，LAR-ECHR 更清晰地区分了顶级模型；© 即使是最佳模型 (GPT-4o) 在 LAR-ECHR 上的准确率也仅为 75.8%，表明模型仍有显著的改进潜力。构建 LAR-ECHR 的过程可以复制到其他法律体系的案件中。

[NLP-66] Representation Learning of Structured Data for Medical Foundation Models NEURIPS2024

【速读】：该论文试图解决大语言模型（LLMs）在处理医疗记录中的结构化非文本数据（如ICD-10或SNOMED-CT代码）时表现不佳的问题。解决方案的关键在于引入UniStruct架构，通过专门为结构化医疗代码设计的子词分词技术，构建一个能够有效融合非结构化文本和结构化数据的医学基础模型。这一方法在预训练阶段显著提升了模型对医疗代码的处理能力，并在多个下游任务中表现出显著的性能提升。

链接: https://arxiv.org/abs/2410.13351
作者: Vijay Prakash Dwivedi,Viktor Schlegel,Andy T. Liu,Thanh-Tung Nguyen,Abhinav Ramesh Kashyap,Jeng Wei,Wei-Hsian Yin,Stefan Winkler,Robby T. Tan
关键词-EN: Large Language Models, Large Language, demonstrated remarkable performance, Language Models, including healthcare
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2024 Workshop on Unifying Representations in Neural Models (UniReps 2024)

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various domains, including healthcare. However, their ability to effectively represent structured non-textual data, such as the alphanumeric medical codes used in records like ICD-10 or SNOMED-CT, is limited and has been particularly exposed in recent research. This paper examines the challenges LLMs face in processing medical codes due to the shortcomings of current tokenization methods. As a result, we introduce the UniStruct architecture to design a multimodal medical foundation model of unstructured text and structured data, which addresses these challenges by adapting subword tokenization techniques specifically for the structured medical codes. Our approach is validated through model pre-training on both an extensive internal medical database and a public repository of structured medical records. Trained on over 1 billion tokens on the internal medical database, the proposed model achieves up to a 23% improvement in evaluation metrics, with around 2% gain attributed to our proposed tokenization. Additionally, when evaluated on the EHRSHOT public benchmark with a 1/1000 fraction of the pre-training data, the UniStruct model improves performance on over 42% of the downstream tasks. Our approach not only enhances the representation and generalization capabilities of patient-centric models but also bridges a critical gap in representation learning models’ ability to handle complex structured medical data, alongside unstructured text.
摘要：大语言模型 (LLMs) 在包括医疗在内的多个领域展示了显著的性能。然而，它们在有效表示结构化非文本数据（如 ICD-10 或 SNOMED-CT 记录中使用的字母数字医疗代码）方面的能力有限，这在最近的研究中尤为明显。本文探讨了 LLMs 在处理医疗代码时面临的挑战，这些挑战源于当前分词方法的不足。为此，我们引入了 UniStruct 架构，设计了一个结合非结构化文本和结构化数据的多模态医疗基础模型，通过专门为结构化医疗代码调整子词分词技术来解决这些挑战。我们的方法通过在广泛的内部医疗数据库和公开的结构化医疗记录库上进行模型预训练来验证。在内部医疗数据库上训练超过 10 亿个 Token 后，所提出的模型在评估指标上实现了高达 23% 的改进，其中约 2% 的增益归因于我们提出的分词方法。此外，在 EHRSHOT 公共基准上使用预训练数据的 1/1000 部分进行评估时，UniStruct 模型在超过 42% 的下游任务中提升了性能。我们的方法不仅增强了以患者为中心的模型的表示和泛化能力，还弥合了表示学习模型在处理复杂结构化医疗数据与非结构化文本方面的关键差距。

[NLP-67] Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

【速读】：该论文试图解决大语言模型（LLMs）在推理速度上的瓶颈问题，特别是由于自回归解码导致的低效率。解决方案的关键在于提出了一种名为Cerberus的自适应并行解码框架，该框架通过引入门控机制，使LLMs能够在每个解码步骤自适应地选择合适的解码方法，同时设计了一种新的解码头范式，既保持了执行的并行性，又引入了序列知识。实验结果表明，Cerberus相比自回归解码实现了高达2.12倍的加速，并且在加速效果和生成质量上优于领先的并行解码框架Medusa。

链接: https://arxiv.org/abs/2410.13344
作者: Yuxuan Liu,Wenyuan Li,Laizhong Cui,Hailiang Yang
关键词-EN: Large language models, Large language, parallel decoding, parallel decoding frameworks, decoding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have identified two key issues with existing parallel decoding frameworks: (1) decoding heads fail to balance prediction accuracy and the parallelism of execution, and (2) parallel decoding is not a universal solution, as it can bring unnecessary overheads at some challenging decoding steps. To address these issues, we propose Cerberus, an adaptive parallel decoding framework introduces the gating mechanism to enable the LLMs to adaptively choose appropriate decoding approaches at each decoding step, along with introducing a new paradigm of decoding heads that introduce the sequential knowledge while maintaining execution parallelism. The experiment results demonstrate that the Cerberus can achieve up to 2.12x speed up compared to auto-regressive decoding, and outperforms one of the leading parallel decoding frameworks, Medusa, with a 10% - 30% increase in acceleration and superior generation quality.
摘要：大语言模型 (LLMs) 由于依赖自回归解码，经常面临推理速度的瓶颈。最近，并行解码显示出显著提升推理效率的潜力。然而，我们发现了现有并行解码框架的两个关键问题：(1) 解码头无法平衡预测准确性和执行并行性，(2) 并行解码并非通用解决方案，因为它在某些挑战性解码步骤中会带来不必要的开销。为了解决这些问题，我们提出了 Cerberus，这是一种自适应并行解码框架，引入了门控机制，使 LLMs 能够在每个解码步骤自适应地选择合适的解码方法，同时引入了一种新的解码头范式，在保持执行并行性的同时引入序列知识。实验结果表明，Cerberus 相比自回归解码可以实现高达 2.12 倍的加速，并且在加速性能上优于领先的并行解码框架 Medusa，加速提升 10% - 30%，且生成质量更优。

[NLP-68] Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在处理自然语言任务时可能依赖数据集偏差作为预测捷径，从而影响其鲁棒性和泛化能力的问题。解决方案的关键在于提出了Shortcut Suite测试套件，该套件通过整合六种捷径类型、五种评估指标和四种提示策略，系统地评估捷径对LLMs性能的影响。研究结果表明，链式思维提示策略显著减少了模型对捷径的依赖，并优于其他提示策略，为提升LLMs的鲁棒性和泛化能力提供了新的评估方法和潜在的改进方向。

链接: https://arxiv.org/abs/2410.13343
作者: Yu Yuan,Lili Zhao,Kai Zhang,Guangting Zheng,Qi Liu
关键词-EN: Large Language Models, natural language processing, Large Language, Language Models, language processing tasks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks. However, LLMs may rely on dataset biases as shortcuts for prediction, which can significantly impair their robustness and generalization capabilities. This paper presents Shortcut Suite, a comprehensive test suite designed to evaluate the impact of shortcuts on LLMs’ performance, incorporating six shortcut types, five evaluation metrics, and four prompting strategies. Our extensive experiments yield several key findings: 1) LLMs demonstrate varying reliance on shortcuts for downstream tasks, significantly impairing their performance. 2) Larger LLMs are more likely to utilize shortcuts under zero-shot and few-shot in-context learning prompts. 3) Chain-of-thought prompting notably reduces shortcut reliance and outperforms other prompting strategies, while few-shot prompts generally underperform compared to zero-shot prompts. 4) LLMs often exhibit overconfidence in their predictions, especially when dealing with datasets that contain shortcuts. 5) LLMs generally have a lower explanation quality in shortcut-laden datasets, with errors falling into three types: distraction, disguised comprehension, and logical fallacy. Our findings offer new insights for evaluating robustness and generalization in LLMs and suggest potential directions for mitigating the reliance on shortcuts. The code is available at \url this https URL.
摘要：大语言模型 (LLMs) 在各种自然语言处理任务中展现了卓越的能力。然而，LLMs 可能依赖数据集中的偏差作为预测的捷径，这会显著损害其鲁棒性和泛化能力。本文介绍了 Shortcut Suite，这是一个综合测试套件，旨在评估捷径对 LLMs 性能的影响，包含了六种捷径类型、五种评估指标和四种提示策略。我们的广泛实验得出了几个关键发现：1) LLMs 在下游任务中对捷径的依赖程度不同，显著损害了其性能。2) 在零样本和少样本上下文学习提示下，较大的 LLMs 更可能利用捷径。3) 思维链提示显著减少了捷径依赖，并优于其他提示策略，而少样本提示通常表现不如零样本提示。4) LLMs 在预测时常常表现出过度自信，尤其是在处理包含捷径的数据集时。5) LLMs 在充满捷径的数据集中通常解释质量较低，错误可分为三种类型：干扰、伪装理解和逻辑谬误。我们的发现为评估 LLMs 的鲁棒性和泛化能力提供了新的见解，并指出了减少对捷径依赖的潜在方向。代码可在 \url this https URL 获取。

[NLP-69] Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval

【速读】：该论文试图解决传统检索增强生成（RAG）模型在实际应用中可能存在的检索步骤冗余问题，特别是在查询不需要额外检索时仍进行检索的情况。解决方案的关键在于提出了一种探测增强的RAG（Probing-RAG）模型，该模型利用语言模型中间层的隐藏状态表示，通过预训练的探测器来动态判断是否需要进行额外的检索。这种方法能够有效捕捉模型的内部认知，从而在不需要检索时避免不必要的检索步骤，提高模型效率。

链接: https://arxiv.org/abs/2410.13339
作者: Ingeol Baek,Hwan Chang,Byeongjeong Kim,Jimin Lee,Hwanhee Lee
关键词-EN: Retrieval-Augmented Generation, relevant external knowledge, incorporating relevant external, enhances language models, incorporating relevant
类目: Computation and Language (cs.CL)
备注: 6 figures, 13 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances language models by retrieving and incorporating relevant external knowledge. However, traditional retrieve-and-generate processes may not be optimized for real-world scenarios, where queries might require multiple retrieval steps or none at all. In this paper, we propose a Probing-RAG, which utilizes the hidden state representations from the intermediate layers of language models to adaptively determine the necessity of additional retrievals for a given query. By employing a pre-trained prober, Probing-RAG effectively captures the model’s internal cognition, enabling reliable decision-making about retrieving external documents. Experimental results across five open-domain QA datasets demonstrate that Probing-RAG outperforms previous methods while reducing the number of redundant retrieval steps.
摘要：检索增强生成 (Retrieval-Augmented Generation, RAG) 通过检索并整合相关外部知识来增强语言模型。然而，传统的检索与生成过程可能未针对现实场景进行优化，其中查询可能需要多次检索步骤或根本不需要检索。本文提出了一种探针增强的 RAG (Probing-RAG)，该方法利用语言模型中间层的隐藏状态表示，自适应地确定给定查询是否需要额外的检索。通过采用预训练的探针，Probing-RAG 有效地捕捉了模型的内部认知，从而能够可靠地决定是否检索外部文档。在五个开放域问答数据集上的实验结果表明，Probing-RAG 在减少冗余检索步骤的同时，优于先前的方法。

[NLP-70] Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

【速读】：该论文试图解决大型语言模型（LLMs）在面对恶意输入时可能产生的安全风险问题，特别是“越狱”现象，即通过恶意输入迫使LLMs生成有害内容。解决方案的关键在于识别并减轻由安全措施引入的故意偏见，这些偏见可能导致模型在不同敏感词（如非二元性别与顺性别、白人与黑人关键词）上的反应差异。论文提出了PCJailbreak概念，揭示了这些安全偏见带来的风险，并引入了一种高效的防御方法PCDefense，通过在生成文本前注入防御性提示来防止越狱尝试，从而避免在文本生成后增加额外推理成本的Guard模型（如Llama-Guard）的不足。

链接: https://arxiv.org/abs/2410.13334
作者: Isack Lee,Haebin Seong
关键词-EN: generating harmful content, demonstrate impressive proficiency, large language models, present potential safety, demonstrate impressive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks’, where malicious inputs can coerce LLMs into generating harmful content. To address these issues, many LLM developers have implemented various safety measures to align these models. This alignment involves several techniques, including data filtering during pre-training, supervised fine-tuning, reinforcement learning from human feedback, and red-teaming exercises. These methods often introduce deliberate and intentional biases similar to Political Correctness (PC) to ensure the ethical behavior of LLMs. In this paper, we delve into the intentional biases injected into LLMs for safety purposes and examine methods to circumvent these safety alignment techniques. Notably, these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cisgender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of PCJailbreak, highlighting the inherent risks posed by these safety-induced biases. Additionally, we propose an efficient defense method PCDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. PCDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize the urgent need for LLM developers to adopt a more responsible approach when designing and implementing safety measures.
摘要：尽管大语言模型 (LLM) 在各种任务中展现出令人印象深刻的熟练度，但它们也存在潜在的安全风险，例如“越狱”现象，即恶意输入可能导致 LLM 生成有害内容。为应对这些问题，许多 LLM 开发者已实施了多种安全措施以使这些模型符合规范。这种对齐涉及多种技术，包括预训练期间的数据过滤、监督微调、基于人类反馈的强化学习以及红队演练。这些方法通常会引入类似于政治正确 (PC) 的有意偏见，以确保 LLM 的伦理行为。本文深入探讨了为安全目的而注入 LLM 的有意偏见，并研究了绕过这些安全对齐技术的方法。值得注意的是，这些有意偏见导致 GPT-4o 模型在非二元和顺性别关键词之间的越狱成功率相差 20%，而在白人和黑人关键词之间的成功率相差 16%，即使提示的其他部分完全相同。我们提出了 PCJailbreak 的概念，强调了这些安全诱导偏见带来的固有风险。此外，我们提出了一种高效的防御方法 PCDefense，通过在生成前注入防御提示来防止越狱尝试。PCDefense 作为一种有吸引力的替代方案，避免了像 Llama-Guard 这样的防护模型在文本生成后需要额外推理成本的问题。我们的研究结果强调了 LLM 开发者在设计和实施安全措施时采取更负责任方法的迫切需要。

[NLP-71] Fine-Tuning Language Models on Multiple Datasets for Citation Intention Classification EMNLP2024

【速读】：该论文试图解决预训练语言模型（PLMs）在微调过程中容易在小数据集上过拟合的问题。解决方案的关键在于提出了一种多任务学习（MTL）框架，通过联合微调PLMs在主要数据集和多个辅助CIC数据集上，利用额外的监督信号来增强模型的泛化能力。此外，论文还开发了一种数据驱动的任务关系学习（TRL）方法，用于控制辅助数据集的贡献，避免负迁移并减少超参数调优的成本。实验结果表明，该框架在小型数据集上显著提升了PLMs的性能，同时在大数据集上与现有最佳模型表现相当。

链接: https://arxiv.org/abs/2410.13332
作者: Zeren Shui,Petros Karypis,Daniel S. Karls,Mingjian Wen,Saurav Manchanda,Ellad B. Tadmor,George Karypis
关键词-EN: Citation intention Classification, tools classify citations, intention Classification, Citation intention, classify citations
类目: Computation and Language (cs.CL)
备注: To be appear as a Findings paper at EMNLP 2024

点击查看摘要

Abstract:Citation intention Classification (CIC) tools classify citations by their intention (e.g., background, motivation) and assist readers in evaluating the contribution of scientific literature. Prior research has shown that pretrained language models (PLMs) such as SciBERT can achieve state-of-the-art performance on CIC benchmarks. PLMs are trained via self-supervision tasks on a large corpus of general text and can quickly adapt to CIC tasks via moderate fine-tuning on the corresponding dataset. Despite their advantages, PLMs can easily overfit small datasets during fine-tuning. In this paper, we propose a multi-task learning (MTL) framework that jointly fine-tunes PLMs on a dataset of primary interest together with multiple auxiliary CIC datasets to take advantage of additional supervision signals. We develop a data-driven task relation learning (TRL) method that controls the contribution of auxiliary datasets to avoid negative transfer and expensive hyper-parameter tuning. We conduct experiments on three CIC datasets and show that fine-tuning with additional datasets can improve the PLMs’ generalization performance on the primary dataset. PLMs fine-tuned with our proposed framework outperform the current state-of-the-art models by 7% to 11% on small datasets while aligning with the best-performing model on a large dataset.
摘要：引用意图分类 (Citation Intention Classification, CIC) 工具通过分类引用的意图（例如，背景、动机）来帮助读者评估科学文献的贡献。先前的研究表明，预训练语言模型 (Pretrained Language Models, PLMs) 如 SciBERT 可以在 CIC 基准测试中达到最先进的性能。PLMs 通过在大规模通用文本语料库上的自监督任务进行训练，并可以通过对相应数据集进行适度微调来快速适应 CIC 任务。尽管它们具有优势，但 PLMs 在微调过程中容易在小数据集上过拟合。在本文中，我们提出了一种多任务学习 (Multi-task Learning, MTL) 框架，该框架联合微调 PLMs 在主要感兴趣的数据集上以及多个辅助 CIC 数据集上，以利用额外的监督信号。我们开发了一种数据驱动的任务关系学习 (Task Relation Learning, TRL) 方法，该方法控制辅助数据集的贡献，以避免负迁移和昂贵的超参数调优。我们在三个 CIC 数据集上进行了实验，并表明通过添加数据集进行微调可以提高 PLMs 在主要数据集上的泛化性能。使用我们提出的框架微调的 PLMs 在小数据集上的表现比当前最先进的模型高出 7% 到 11%，而在大数据集上与最佳表现模型保持一致。

[NLP-72] Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding

【速读】：该论文试图解决大视觉语言模型（LVLMs）在生成视觉输入响应时过度依赖语言先验导致的幻觉问题。解决方案的关键是提出了一种新的方法——摘要引导解码（Summary-Guided Decoding, SGD）。SGD通过减少文本上下文中的语言信息，促使模型更多地关注图像信息，同时仅控制与图像相关的词性（POS）标记以保持文本质量。实验结果表明，SGD在减少对象幻觉方面达到了最先进的性能，并且在精确度和召回率的权衡上实现了帕累托最优，有效平衡了幻觉减少与文本质量的维护。

链接: https://arxiv.org/abs/2410.13321
作者: Kyungmin Min,Minbeom Kim,Kang-il Lee,Dongryeol Lee,Kyomin Jung
关键词-EN: Large Vision-Language Models, Large Vision-Language, demonstrate impressive capabilities, visual inputs, impressive capabilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) demonstrate impressive capabilities in generating detailed and coherent responses from visual inputs. However, they are prone to generate hallucinations due to an over-reliance on language priors. To address this issue, we investigate the language priors in LVLMs and make two key observations: (1) Even when predicting the tokens associated with image-related part-of-speech (POS), models increasingly rely on linguistic priors as the token sequences grow, thereby amplifying hallucinations. (2) Methods that directly calibrate LVLM’s output distribution to mitigate language priors can lead to a degradation in text quality or even exacerbate hallucinations. Based on these findings, we propose a novel method, Summary-Guided Decoding (SGD). This method naturally encourages the model to focus more on image information by reducing the text context through summaries, while controlling only the image-related POS tokens to maintain text quality. Through experiments, we demonstrate that SGD achieves state-of-the-art performance on object hallucination benchmarks. Furthermore, in terms of the trade-off between precision and recall, SGD achieves Pareto optimality among the existing methods. Lastly, we observe that although existing methods struggle to balance the reduction of object hallucinations with maintaining text quality, SGD demonstrates robustness in handling this challenge.
摘要：大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 在从视觉输入生成详细且连贯的响应方面展现了令人印象深刻的能力。然而，由于过度依赖语言先验，这些模型容易产生幻觉。为了解决这一问题，我们研究了 LVLMs 中的语言先验，并得出两个关键观察结果：(1) 即使在预测与图像相关的词性 (Part-of-Speech, POS) 的 Token 时，随着 Token 序列的增长，模型越来越依赖语言先验，从而放大了幻觉。(2) 直接校准 LVLM 输出分布以减轻语言先验的方法可能会导致文本质量下降，甚至加剧幻觉。基于这些发现，我们提出了一种新颖的方法，即总结引导解码 (Summary-Guided Decoding, SGD)。该方法通过减少文本上下文（通过总结）自然地鼓励模型更多地关注图像信息，同时仅控制与图像相关的 POS Token 以保持文本质量。通过实验，我们证明了 SGD 在对象幻觉基准测试中达到了最先进的性能。此外，在精确率和召回率的权衡方面，SGD 在现有方法中实现了帕累托最优。最后，我们观察到，尽管现有方法在减少对象幻觉和保持文本质量之间难以平衡，但 SGD 在应对这一挑战时表现出了鲁棒性。

[NLP-73] Computational Approaches to Arabic-English Code-Switching

【速读】：该论文试图解决阿拉伯语与英语代码转换（Code-Switching, CS）数据在命名实体识别（Named Entity Recognition, NER）任务中的挑战。解决方案的关键在于创建了首个针对阿拉伯语-英语代码转换的NER任务标注语料库，并应用了两种增强技术：代码转换上下文嵌入和数据增强技术，以提升NER标注器在代码转换数据上的性能。此外，论文还提出了多种内词语言识别方法，用于确定混合文本的语言类型并识别其是否为命名实体。

链接: https://arxiv.org/abs/2410.13318
作者: Caroline Sabty
关键词-EN: Natural Language Processing, addressing language processing, Language Processing, Modern Standard Arabic, NLP tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Natural Language Processing (NLP) is a vital computational method for addressing language processing, analysis, and generation. NLP tasks form the core of many daily applications, from automatic text correction to speech recognition. While significant research has focused on NLP tasks for the English language, less attention has been given to Modern Standard Arabic and Dialectal Arabic. Globalization has also contributed to the rise of Code-Switching (CS), where speakers mix languages within conversations and even within individual words (intra-word CS). This is especially common in Arab countries, where people often switch between dialects or between dialects and a foreign language they master. CS between Arabic and English is frequent in Egypt, especially on social media. Consequently, a significant amount of code-switched content can be found online. Such code-switched data needs to be investigated and analyzed for several NLP tasks to tackle the challenges of this multilingual phenomenon and Arabic language challenges. No work has been done before for several integral NLP tasks on Arabic-English CS data. In this work, we focus on the Named Entity Recognition (NER) task and other tasks that help propose a solution for the NER task on CS data, e.g., Language Identification. This work addresses this gap by proposing and applying state-of-the-art techniques for Modern Standard Arabic and Arabic-English NER. We have created the first annotated CS Arabic-English corpus for the NER task. Also, we apply two enhancement techniques to improve the NER tagger on CS data using CS contextual embeddings and data augmentation techniques. All methods showed improvements in the performance of the NER taggers on CS data. Finally, we propose several intra-word language identification approaches to determine the language type of a mixed text and identify whether it is a named entity or not.
摘要：自然语言处理 (Natural Language Processing, NLP) 是解决语言处理、分析和生成的重要计算方法。NLP 任务构成了许多日常应用的核心，从自动文本校正到语音识别。尽管大量研究集中在英语的 NLP 任务上，但对现代标准阿拉伯语和阿拉伯方言的关注较少。全球化也促进了代码转换 (Code-Switching, CS) 的兴起，即说话者在对话中甚至在单个词内 (intra-word CS) 混合语言。这在阿拉伯国家尤为常见，人们经常在方言之间或方言与他们掌握的外语之间切换。在埃及，阿拉伯语和英语之间的代码转换在社交媒体上尤为频繁。因此，在线可以找到大量代码转换的内容。为了应对这种多语言现象和阿拉伯语的挑战，需要对这些代码转换数据进行调查和分析，以用于多个 NLP 任务。之前没有针对阿拉伯语-英语代码转换数据进行过多个核心 NLP 任务的研究。在本研究中，我们专注于命名实体识别 (Named Entity Recognition, NER) 任务以及其他有助于提出代码转换数据上 NER 任务解决方案的任务，例如语言识别。本研究通过提出并应用最先进的现代标准阿拉伯语和阿拉伯语-英语 NER 技术来填补这一空白。我们创建了首个用于 NER 任务的阿拉伯语-英语代码转换标注语料库。此外，我们应用了两种增强技术，通过代码转换上下文嵌入和数据增强技术来改进代码转换数据上的 NER 标注器。所有方法都显示出在代码转换数据上 NER 标注器性能的提升。最后，我们提出了几种词内语言识别方法，以确定混合文本的语言类型并识别其是否为命名实体。

[NLP-74] Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language EMNLP

【速读】：该论文试图解决在标注攻击性语言时由于描述性指令导致的标注不一致和主观性问题。解决方案的关键在于引入基于人文学科研究的规范性标注基准，以确保标注的一致性和无偏性，特别是在处理非主流和随意语言使用时。通过创建两个新的标注数据集，研究显示语言模型（LLM）在缺乏专业标注员的情况下可以作为有效的替代方案，并且经过多源LLM标注数据微调的小型模型在性能上优于基于单一来源大型人类标注数据集训练的模型。这一发现强调了结构化指南在减少主观变异性、在有限数据下保持性能以及接受语言多样性方面的重要性。

链接: https://arxiv.org/abs/2410.13313
作者: Xinmeng Hou
关键词-EN: prescriptive annotation benchmark, annotation benchmark grounded, ensure consistent, unbiased labeling, study introduces
类目: Computation and Language (cs.CL)
备注: 12 pages, 9 figures, EMNLP-NLP4DH 2024

点击查看摘要

Abstract:This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language, particularly for casual and non-mainstream language uses. We contribute two newly annotated datasets that achieve higher inter-annotator agreement between human and language model (LLM) annotations compared to original datasets based on descriptive instructions. Our experiments show that LLMs can serve as effective alternatives when professional annotators are unavailable. Moreover, smaller models fine-tuned on multi-source LLM-annotated data outperform models trained on larger, single-source human-annotated datasets. These findings highlight the value of structured guidelines in reducing subjective variability, maintaining performance with limited data, and embracing language diversity. Content Warning: This article only analyzes offensive language for academic purposes. Discretion is advised. Comments: 12 pages, 9 figures, EMNLP-NLP4DH 2024 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.13313 [cs.CL] (or arXiv:2410.13313v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.13313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：本研究引入了一个基于人文学科研究的规范性标注基准，以确保对冒犯性语言的一致且无偏见的标注，特别是针对非正式和非主流语言的使用。我们贡献了两个新标注的数据集，这些数据集在人类和语言模型（LLM）标注之间的标注者间一致性方面优于基于描述性指令的原始数据集。我们的实验表明，当专业标注者不可用时，LLM 可以作为有效的替代方案。此外，在多源 LLM 标注数据上微调的较小模型在性能上优于在更大、单一源人类标注数据集上训练的模型。这些发现突显了结构化指南在减少主观变异性、在有限数据下保持性能以及拥抱语言多样性方面的价值。内容警告：本文仅出于学术目的分析冒犯性语言。请谨慎阅读。

评论：12 页，9 幅图，EMNLP-NLP4DH 2024
主题：计算与语言（cs.CL）
引用方式：arXiv:2410.13313 [cs.CL]
（或 arXiv:2410.13313v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.13313
了解更多信息
arXiv 发布的 DOI 通过 DataCite（待注册）

[NLP-75] Reference-Based Post-OCR Processing with LLM for Diacritic Languages

【速读】：该论文试图解决从老化文档中提取带有变音符号的细粒度OCR文本的难题，主要挑战包括意外的伪影、时间引起的退化以及缺乏相关数据集。解决方案的关键在于利用现有的内容导向的电子书作为参考基准，结合大型语言模型来纠正不完美的OCR生成文本。该方法通过生成高精度的伪页对页标签，有效处理了历史文档中因小笔画带来的识别难题，并通过后处理消除了老化文档中的多种噪声，解决了字符缺失、词语遗漏和序列混乱等问题。实验结果表明，该方法在古典越南语书籍的OCR数据集上表现优于现有的基于Transformer的越南语拼写纠正模型。

链接: https://arxiv.org/abs/2410.13305
作者: Thao Do
关键词-EN: Extracting fine-grained OCR, remains challenging due, Extracting fine-grained, languages remains challenging, time-induced degradation
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Extracting fine-grained OCR text from aged documents in diacritic languages remains challenging due to unexpected artifacts, time-induced degradation, and lack of datasets. While standalone spell correction approaches have been proposed, they show limited performance for historical documents due to numerous possible OCR error combinations and differences between modern and classical corpus distributions. We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text, supported by large language models. This technique generates high-precision pseudo-page-to-page labels for diacritic languages, where small strokes pose significant challenges in historical conditions. The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences. Our post-processing method, which generated a large OCR dataset of classical Vietnamese books, achieved a mean grading score of 8.72 on a 10-point scale. This outperformed the state-of-the-art transformer-based Vietnamese spell correction model, which scored 7.03 when evaluated on a sampled subset of the dataset. We also trained a baseline OCR model to assess and compare it with well-known engines. Experimental results demonstrate the strength of our baseline model compared to widely used open-source solutions. The resulting dataset will be released publicly to support future studies.
摘要：从带有变音符号的古旧文档中提取细粒度的 OCR 文本仍然面临挑战，原因在于意外的伪影、时间引起的退化以及缺乏数据集。尽管已有独立拼写校正方法被提出，但由于历史文档中可能存在大量 OCR 错误组合以及现代与古典语料库分布的差异，这些方法在处理历史文档时表现有限。我们提出了一种利用现有内容导向的电子书作为参考基础，结合大语言模型来校正不完美的 OCR 生成文本的方法。该技术为变音符号语言生成了高精度的伪页对页标签，在历史条件下，小笔画构成了显著的挑战。该流程消除了古旧文档中的各种噪声，并解决了字符、单词缺失以及序列混乱等问题。我们的后处理方法生成了一个包含古典越南书籍的大规模 OCR 数据集，在 10 分制评分中达到了 8.72 的平均评分。这一成绩优于当前最先进的基于 Transformer 的越南语拼写校正模型，该模型在数据集的抽样子集上的评分为 7.03。我们还训练了一个基准 OCR 模型，以评估并与其知名引擎进行比较。实验结果表明，我们的基准模型相比广泛使用的开源解决方案具有更强的性能。最终的数据集将公开发布，以支持未来的研究。

[NLP-76] Advancing Large Language Model Attribution through Self-Improving EMNLP2024

【速读】：该论文试图解决大语言模型（LLMs）在生成文本时缺乏证据来源引用的问题，这会导致信息失真和可验证性不足。解决方案的关键是提出了一种自学习引用框架（START），通过模型自我构建合成训练数据进行预热，并利用细粒度的偏好监督信号迭代提升模型的引用能力，从而在不依赖人工标注和更高级模型的前提下，显著提高模型在开放域问答任务中的表现。

链接: https://arxiv.org/abs/2410.13298
作者: Lei Huang,Xiaocheng Feng,Weitao Ma,Liang Zhao,Yuchun Fan,Weihong Zhong,Dongliang Xu,Qing Yang,Hongtao Liu,Bing Qin
关键词-EN: Teaching large language, Teaching large, large language models, information-seeking systems, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 Main Conference

点击查看摘要

Abstract:Teaching large language models (LLMs) to generate text with citations to evidence sources can mitigate hallucinations and enhance verifiability in information-seeking systems. However, improving this capability requires high-quality attribution data, which is costly and labor-intensive. Inspired by recent advances in self-improvement that enhance LLMs without manual annotation, we present START, a Self-Taught AttRibuTion framework for iteratively improving the attribution capability of LLMs. First, to prevent models from stagnating due to initially insufficient supervision signals, START leverages the model to self-construct synthetic training data for warming up. To further self-improve the model’s attribution ability, START iteratively utilizes fine-grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation. Experiments on three open-domain question-answering datasets, covering long-form QA and multi-step reasoning, demonstrate significant performance gains of 25.13% on average without relying on human annotations and more advanced models. Further analysis reveals that START excels in aggregating information across multiple sources.
摘要：教授大语言模型 (LLMs) 生成带有证据来源引用的文本，可以缓解幻觉问题并增强信息检索系统的可验证性。然而，提升这一能力需要高质量的归属数据，这既昂贵又耗费人力。受近期无需人工标注即可增强 LLMs 的自改进技术启发，我们提出了 START，一种自学习归属框架，用于迭代提升 LLMs 的归属能力。首先，为防止模型因初始监督信号不足而停滞，START 利用模型自我构建合成训练数据进行预热。为进一步自我提升模型的归属能力，START 迭代利用从其采样响应中构建的细粒度偏好监督信号，以鼓励生成稳健、全面且可归属的内容。在涵盖长篇问答和多步骤推理的三个开放领域问答数据集上的实验表明，START 在不依赖人工标注和更先进模型的情况下，平均性能提升了 25.13%。进一步分析显示，START 在整合多源信息方面表现出色。

[NLP-77] Learning to Route with Confidence Tokens

【速读】：该论文试图解决大语言模型（LLMs）在实际应用中输出不可靠性的问题，特别是在高风险场景下，如何确保模型输出的可信度至关重要。解决方案的关键在于提出了一种名为Self-REF的轻量级训练策略，通过引入置信度标记（confidence tokens）来教导LLMs可靠地表达其答案的置信度。这些置信度标记可以提取出置信度评分，从而在下游任务中显著提升路由和拒绝学习的效果，相较于传统的置信度表达和标记概率分析方法，Self-REF表现出显著的改进。

链接: https://arxiv.org/abs/2410.13284
作者: Yu-Neng Chuang,Helen Zhou,Prathusha Kameswara Sarma,Parikshit Gopalan,John Boccio,Sara Bolouki,Xia Hu
关键词-EN: Large language models, demonstrated impressive performance, Large language, language models, real-world applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance on several tasks and are increasingly deployed in real-world applications. However, especially in high-stakes settings, it becomes vital to know when the output of an LLM may be unreliable. Depending on whether an answer is trustworthy, a system can then choose to route the question to another expert, or otherwise fall back on a safe default behavior. In this work, we study the extent to which LLMs can reliably indicate confidence in their answers, and how this notion of confidence can translate into downstream accuracy gains. We propose Self-REF, a lightweight training strategy to teach LLMs to express confidence in whether their answers are correct in a reliable manner. Self-REF introduces confidence tokens into the LLM, from which a confidence score can be extracted. Compared to conventional approaches such as verbalizing confidence and examining token probabilities, we demonstrate empirically that confidence tokens show significant improvements in downstream routing and rejection learning tasks.
摘要：大语言模型 (LLMs) 在多个任务上展示了令人印象深刻的表现，并越来越多地被部署在实际应用中。然而，特别是在高风险环境中，了解 LLM 的输出何时可能不可靠变得至关重要。根据答案的可信度，系统可以选择将问题转交给其他专家，或者采取安全的默认行为。在本研究中，我们探讨了 LLMs 在多大程度上能够可靠地表示对其答案的信心，以及这种信心如何转化为下游准确性的提升。我们提出了 Self-REF，一种轻量级的训练策略，教导 LLMs 以可靠的方式表达对其答案正确性的信心。Self-REF 将信心 Token 引入 LLM，从中可以提取信心分数。与传统的表达信心和检查 Token 概率的方法相比，我们通过实验证明，信心 Token 在下游路由和拒绝学习任务中显示出显著的改进。

[NLP-78] BANTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla

【速读】：该论文试图解决在低资源语言（特别是转写孟加拉语）中进行多标签仇恨言论分类的问题。解决方案的关键在于引入了BanTH数据集，这是首个针对转写孟加拉语的多标签仇恨言论数据集，包含37.3万条样本，来源于YouTube评论，每条评论标记了多个目标群体标签。论文通过在转写孟加拉语文本上进一步预训练Transformer编码器，并提出了一种基于翻译的大型语言模型提示策略，显著提升了在零样本设置下的分类性能，从而为孟加拉语及其他低资源语言的仇恨言论研究填补了重要空白。

链接: https://arxiv.org/abs/2410.13281
作者: Fabiha Haider,Fariha Tanjim Shifat,Md Farhan Ishmam,Deeparghya Dutta Barua,Md Sakib Ul Rahman Sourove,Md Fahim,Md Farhad Alam
关键词-EN: classifying hate speech, hate speech, transliterated Bangla, digital spaces, spaces has emphasized
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of transliterated texts in digital spaces has emphasized the need for detecting and classifying hate speech in languages beyond English, particularly in low-resource languages. As online discourse can perpetuate discrimination based on target groups, e.g. gender, religion, and origin, multi-label classification of hateful content can help in comprehending hate motivation and enhance content moderation. While previous efforts have focused on monolingual or binary hate classification tasks, no work has yet addressed the challenge of multi-label hate speech classification in transliterated Bangla. We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset comprising 37.3k samples. The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups, reflecting the regional demographic. We establish novel transformer encoder-based baselines by further pre-training on transliterated Bangla corpus. We also propose a novel translation-based LLM prompting strategy for transliterated text. Experiments reveal that our further pre-trained encoders are achieving state-of-the-art performance on the BanTH dataset, while our translation-based prompting outperforms other strategies in the zero-shot setting. The introduction of BanTH not only fills a critical gap in hate speech research for Bangla but also sets the stage for future exploration into code-mixed and multi-label classification challenges in underrepresented languages.
摘要：数字空间中音译文本的激增突显了在英语之外的语言中检测和分类仇恨言论的必要性，特别是在资源匮乏的语言中。由于在线讨论可能基于目标群体（如性别、宗教和出身）传播歧视，仇恨内容的多标签分类有助于理解仇恨动机并增强内容审核。尽管之前的研究集中在单语或二元仇恨分类任务上，但尚未有工作解决音译孟加拉语中多标签仇恨言论分类的挑战。我们引入了 BanTH，这是首个包含 37.3k 样本的多标签音译孟加拉语仇恨言论数据集。样本来源于 YouTube 评论，每个实例都标有一个或多个目标群体，反映了地区的人口统计特征。我们通过在音译孟加拉语文本上进一步预训练，建立了基于 Transformer 编码器的新基线。我们还提出了一种基于翻译的大语言模型提示策略，用于音译文本。实验表明，我们进一步预训练的编码器在 BanTH 数据集上达到了最先进的性能，而基于翻译的提示策略在零样本设置中优于其他策略。BanTH 的引入不仅填补了孟加拉语仇恨言论研究中的关键空白，还为未来探索代表性不足语言中的代码混合和多标签分类挑战奠定了基础。

[NLP-79] SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

【速读】：该论文试图解决大语言模型（LLMs）中注意力机制的二次复杂度问题，特别是在长上下文窗口的情况下，这种复杂度限制了模型的效率和可扩展性。解决方案的关键在于设计了一种名为SeerAttention的新型注意力机制，该机制通过引入一个可学习的门控网络，自适应地选择注意力图中的重要块，并将其他块视为稀疏，从而在块级别上实现稀疏性。这种块级别的稀疏性有效地平衡了准确性和加速效果。此外，论文还开发了一种定制的FlashAttention实现，以最小开销提取注意力图的块级别真实值，从而实现高效的门控网络学习。SeerAttention不仅适用于训练后阶段，还能在长上下文微调中表现出色，显著优于现有的基于静态或启发式稀疏注意力的方法。

链接: https://arxiv.org/abs/2410.13276
作者: Yizhao Gao,Zhichen Zeng,Dayou Du,Shijie Cao,Hayden Kwok-Hay So,Ting Cao,Fan Yang,Mao Yang
关键词-EN: Large Language Models, modern Large Language, Language Models, Large Language, modern Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity limits the efficiency and scalability of LLMs, especially for those with a long-context window. A promising approach addressing this limitation is to leverage the sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics to approximate sparsity. This practice falls short to fully capture the dynamic nature of attention sparsity in language-based tasks. This paper argues that attention sparsity should be learned rather than predefined. To this end, we design SeerAttention, a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. Such block-level sparsity effectively balances accuracy and speedup. To enable efficient learning of the gating network, we develop a customized FlashAttention implementation that extracts the block-level ground truth of attention map with minimum overhead. SeerAttention not only applies to post-training, but also excels in long-context fine-tuning. Our results show that at post-training stages, SeerAttention significantly outperforms state-of-the-art static or heuristic-based sparse attention methods, while also being more versatile and flexible to adapt to varying context lengths and sparsity ratios. When applied to long-context fine-tuning with YaRN, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.
摘要：注意力机制是现代大语言模型 (LLM) 的基石。然而，其二次复杂性限制了 LLM 的效率和可扩展性，尤其是在长上下文窗口的情况下。一种有前景的解决方法是利用注意力中的稀疏性。然而，现有的基于稀疏性的解决方案主要依赖于预定义的模式或启发式方法来近似稀疏性。这种方法无法充分捕捉语言任务中注意力稀疏性的动态特性。本文认为，注意力稀疏性应通过学习而非预定义来实现。为此，我们设计了 SeerAttention，这是一种新的注意力机制，通过增加一个可学习的门控机制，自适应地选择注意力图中的重要块，并将其余块视为稀疏。这种块级稀疏性有效地平衡了准确性和加速。为了实现门控网络的高效学习，我们开发了一种定制的 FlashAttention 实现，该实现以最小的开销提取注意力图的块级真实值。SeerAttention 不仅适用于训练后阶段，而且在长上下文微调中也表现出色。我们的结果表明，在训练后阶段，SeerAttention 显著优于最先进的静态或启发式稀疏注意力方法，同时更具通用性和灵活性，能够适应不同的上下文长度和稀疏比率。当应用于与 YaRN 结合的长上下文微调时，SeerAttention 可以在 32k 上下文长度下实现 90% 的稀疏比率，且困惑度损失最小，比 FlashAttention-2 提供 5.67 倍的加速。

[NLP-80] Breaking Chains: Unraveling the Links in Multi-Hop Knowledge Unlearning

【速读】：该论文试图解决现有遗忘技术在处理多跳查询（multi-hop queries）时无法完全移除间接知识的问题。解决方案的关键在于提出了一种基于不确定性的方法MUNCH，该方法通过将多跳查询分解为子问题，并利用遗忘模型的不确定性进行最终决策，从而有效增强了遗忘过程，使其能够更彻底地移除间接知识。

链接: https://arxiv.org/abs/2410.13274
作者: Minseok Choi,ChaeHun Park,Dohyun Lee,Jaegul Choo
关键词-EN: giant information stores, Large language models, Large language, serve as giant, information stores
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) serve as giant information stores, often including personal or copyrighted data, and retraining them from scratch is not a viable option. This has led to the development of various fast, approximate unlearning techniques to selectively remove knowledge from LLMs. Prior research has largely focused on minimizing the probabilities of specific token sequences by reversing the language modeling objective. However, these methods still leave LLMs vulnerable to adversarial attacks that exploit indirect references. In this work, we examine the limitations of current unlearning techniques in effectively erasing a particular type of indirect prompt: multi-hop queries. Our findings reveal that existing methods fail to completely remove multi-hop knowledge when one of the intermediate hops is unlearned. To address this issue, we propose MUNCH, a simple uncertainty-based approach that breaks down multi-hop queries into subquestions and leverages the uncertainty of the unlearned model in final decision-making. Empirical results demonstrate the effectiveness of our framework, and MUNCH can be easily integrated with existing unlearning techniques, making it a flexible and useful solution for enhancing unlearning processes.
摘要：大语言模型 (LLMs) 作为巨大的信息存储库，通常包含个人或受版权保护的数据，从头开始重新训练它们并不是一个可行的选项。这促使了各种快速、近似的遗忘技术的开发，以有选择地从 LLMs 中移除知识。先前的研究主要集中在通过反转语言建模目标来最小化特定 Token 序列的概率。然而，这些方法仍然使 LLMs 容易受到利用间接引用的对抗攻击。在本研究中，我们探讨了当前遗忘技术在有效擦除特定类型的间接提示：多跳查询方面的局限性。我们的研究发现，当其中一个中间跳被遗忘时，现有方法无法完全移除多跳知识。为了解决这一问题，我们提出了 MUNCH，一种基于不确定性的简单方法，该方法将多跳查询分解为子问题，并利用遗忘模型的不确定性进行最终决策。实证结果证明了我们框架的有效性，MUNCH 可以轻松集成到现有的遗忘技术中，使其成为增强遗忘过程的灵活且有用的解决方案。

[NLP-81] Roadmap towards Superhuman Speech Understanding using Large Language Models

【速读】：该论文试图解决如何将大型语言模型（LLMs）扩展到语音和音频数据处理领域，以创建能够同时处理文本和非文本输入的通用基础模型。解决方案的关键在于提出一个五级路线图，从基本的自动语音识别（ASR）逐步发展到能够整合非语义信息和抽象声学知识的超人类模型，并通过设计SAGI Benchmark来标准化评估这些模型在不同任务中的表现，揭示其在处理副语言线索和抽象声学知识方面的不足，并为未来的研究方向提供指导。

链接: https://arxiv.org/abs/2410.13268
作者: Fan Bu,Yuhao Zhang,Xidong Wang,Benyou Wang,Qun Liu,Haizhou Li
关键词-EN: create general foundation, large language models, general foundation models, abstract acoustic knowledge, foundation models capable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling paralinguistic cues and abstract acoustic knowledge, and we offer future directions. This paper outlines a roadmap for advancing speech LLMs, introduces a benchmark for evaluation, and provides key insights into their current limitations and potential.
摘要：大语言模型 (LLM) 的成功促使人们努力整合语音和音频数据，旨在创建能够处理文本和非文本输入的通用基础模型。最近的进展，如 GPT-4o，突显了端到端语音 LLM 的潜力，这些模型保留了非语义信息和世界知识，以实现更深层次的语音理解。为了指导语音 LLM 的开发，我们提出了一条五级路线图，从基本的自动语音识别 (ASR) 到能够整合非语义信息与抽象声学知识以应对复杂任务的高级超人模型。此外，我们设计了一个基准测试，SAGI 基准测试，该基准标准化了这五个级别中各种任务的关键方面，揭示了使用抽象声学知识的挑战和能力完整性。我们的研究结果揭示了在处理副语言线索和抽象声学知识方面的差距，并提供了未来的研究方向。本文概述了推进语音 LLM 的路线图，介绍了评估基准，并提供了对其当前局限性和潜力的关键见解。

[NLP-82] CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

【速读】：该论文试图解决当前音乐信息检索系统在处理语言多样性和整合多种音乐形式（如ABC记谱法和MIDI）时面临的挑战。解决方案的关键在于引入CLaMP 2系统，该系统兼容101种语言，并支持ABC记谱法和MIDI格式。CLaMP 2通过预训练在150万ABC-MIDI-文本三元组上，结合多语言文本编码器和多模态音乐编码器，并通过对比学习进行对齐，从而实现高效的多语言和多模态音乐信息检索。此外，利用大型语言模型进行多语言描述的精细化处理，显著减少了文本噪声并平衡了语言分布，最终在多语言语义搜索和音乐分类任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2410.13267
作者: Shangda Wu,Yashan Wang,Ruibin Yuan,Zhancheng Guo,Xu Tan,Ge Zhang,Monan Zhou,Jing Chen,Xuefeng Mu,Yuejie Gao,Yuanliang Dong,Jiafeng Liu,Xiaobing Li,Feng Yu,Maosong Sun
关键词-EN: managing linguistic diversity, Challenges in managing, music information retrieval, Instrument Digital Interface, current music information
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 17 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.
摘要：当前的音乐信息检索系统在管理语言多样性和整合多种音乐模态方面面临挑战。这些局限性降低了它们在全球多模态音乐环境中的有效性。为解决这些问题，我们引入了 CLaMP 2 系统，该系统兼容 101 种语言，支持 ABC 记谱法（一种基于文本的音乐记谱格式）和 MIDI（乐器数字接口）进行音乐信息检索。CLaMP 2 在 150 万个 ABC-MIDI-文本三元组上进行了预训练，包含一个多语言文本编码器和一个多模态音乐编码器，通过对比学习进行对齐。通过利用大语言模型，我们大规模地获得了精细且一致的多语言描述，显著减少了文本噪声并平衡了语言分布。我们的实验表明，CLaMP 2 在多语言语义搜索和跨模态音乐分类方面均达到了最先进的结果，从而为包容性和全球化的音乐信息检索树立了新标准。

[NLP-83] From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition

【速读】：该论文试图解决的问题是如何从人类语言习得的角度评估语言模型（LMs）的语言能力。解决方案的关键在于提出了一个三阶段的框架，用于评估LMs从初步词汇理解到复杂语法和逻辑推理的能力。通过这一框架，研究者利用语言学研究方法评估了LMs的生成能力，并发现尽管最新LMs在整体表现上优于早期模型，但其发展轨迹并未严格遵循人类语言习得的路径。特别地，LMs在生成任务中与人类表现相似的领域，往往是信息易于从语料库中提取的领域，如平均词长、从句和辅助动词。然而，在从句和辅助动词等维度上，新模型并未显示出显著进步，这可能与训练数据的语言特征有关，即语域理论所提出的训练数据对模型能力的显著影响。

链接: https://arxiv.org/abs/2410.13259
作者: Qiyuan Yang,Pengda Wang,Luke D. Plonsky,Frederick L. Oswald,Hanjie Chen
关键词-EN: human language acquisition, critical perspective, language acquisition, language capabilities, language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We examine the language capabilities of language models (LMs) from the critical perspective of human language acquisition. Building on classical language development theories, we propose a three-stage framework to assess the abilities of LMs, ranging from preliminary word understanding to complex grammar and complex logical reasoning. Using this framework, we evaluate the generative capacities of LMs using methods from linguistic research. Results indicate that although recent LMs outperform earlier models in overall performance, their developmental trajectory does not strictly follow the path of human language acquisition. Notably, in generation tasks, LMs are more similar to human performance in areas where information is easier to extract from the corpus, such as average word length, clauses, and auxiliary verbs. Newer LMs did not exhibit significant progress in terms of specific dimensions, such as clauses and auxiliary verbs, where the variation across corpora is relatively limited. Register theory offers a plausible explanation for these observations, suggesting that the linguistic features of the training data have a substantial impact on the models’ abilities.
摘要：我们从人类语言习得的批判性视角考察了语言模型 (LMs) 的语言能力。基于经典的语言发展理论，我们提出了一种三阶段框架，用于评估 LMs 的能力，从初步的词汇理解到复杂的语法和逻辑推理。利用这一框架，我们采用语言学研究的方法评估了 LMs 的生成能力。结果表明，尽管最近的 LMs 在整体表现上优于早期模型，但其发展轨迹并未严格遵循人类语言习得的路径。值得注意的是，在生成任务中，LMs 在信息易于从语料库中提取的领域（如平均词长、从句和助动词）与人类表现更为相似。较新的 LMs 在特定维度（如从句和助动词）上并未表现出显著进步，这些维度在不同语料库中的变化相对有限。语域理论为这些观察结果提供了一个合理的解释，表明训练数据的语言特征对模型的能力有重大影响。

[NLP-84] A Systematic Investigation of Knowledge Retrieval and Selection for Retrieval Augmented Generation

【速读】：该论文试图解决在检索增强生成（RAG）系统中，知识检索和选择对下游生成性能的影响问题。解决方案的关键在于通过模拟不同检索和选择条件，评估这些因素对生成结果的影响。研究发现，下游生成模型的能力、任务和数据集的复杂性显著影响知识检索和选择的效果。在典型情况下，提高知识召回率是提升生成结果的关键，而知识选择器在强生成模型和清晰任务中提供的额外收益有限。对于较弱的生成模型或更模糊的任务和数据集，知识F1分数变得至关重要，知识选择器在提升整体性能中扮演更重要的角色。

链接: https://arxiv.org/abs/2410.13258
作者: Xiangci Li,Jessica Ouyang
关键词-EN: integrating external knowledge, natural language generation, Retrieval-augmented generation, enhancing natural language, knowledge
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a powerful method for enhancing natural language generation by integrating external knowledge into a model’s output. While prior work has demonstrated the importance of improving knowledge retrieval for boosting generation quality, the role of knowledge selection remains less clear. In this paper, we perform a comprehensive analysis of how knowledge retrieval and selection influence downstream generation performance in RAG systems. By simulating different retrieval and selection conditions through a controlled mixture of gold and distractor knowledge, we assess the impact of these factors on generation outcomes. Our findings indicate that the downstream generator model’s capability, as well as the complexity of the task and dataset, significantly influence the impact of knowledge retrieval and selection on the overall RAG system performance. In typical scenarios, improving the knowledge recall score is key to enhancing generation outcomes, with the knowledge selector providing a limited additional benefit when a strong generator model is used on clear, well-defined tasks. For weaker generator models or more ambiguous tasks and datasets, the knowledge F1 score becomes a critical factor, and the knowledge selector plays a more prominent role in improving overall performance.
摘要：检索增强生成 (Retrieval-augmented generation, RAG) 已成为一种强大的方法，通过将外部知识整合到模型的输出中来增强自然语言生成。尽管先前的工作已经证明了改进知识检索对于提升生成质量的重要性，但知识选择的作用仍不明确。本文中，我们对知识检索和选择如何影响 RAG 系统中的下游生成性能进行了全面分析。通过模拟不同的检索和选择条件，利用黄金知识和干扰知识的控制混合，我们评估了这些因素对生成结果的影响。我们的研究结果表明，下游生成器模型的能力，以及任务和数据集的复杂性，显著影响知识检索和选择对整体 RAG 系统性能的影响。在典型场景中，提高知识召回率是增强生成结果的关键，当使用强大的生成器模型处理清晰、定义明确的任务时，知识选择器提供的额外收益有限。对于较弱的生成器模型或更模糊的任务和数据集，知识 F1 分数成为关键因素，知识选择器在提升整体性能中扮演更重要的角色。

[NLP-85] Automatic Translation Alignment Pipeline for Multilingual Digital Editions of Literary Works

【速读】：该论文试图解决在创建多语言数字版（MDE）时，如何有效对齐不同语言翻译文本的问题。解决方案的关键在于开发一个自动化管道，将原始文本和翻译文本转换为基于网页的并排展示形式，并提出新的对齐评估指标和可视化技术，以克服现有算法在文学翻译文本对齐中的局限性。

链接: https://arxiv.org/abs/2410.13255
作者: Maria Levchenko
关键词-EN: Multilingual Digital Edition, Alessandro Manzoni Italian, Russian and Chinese, Digital Edition, Multilingual Digital
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, Computational Humanities Research Conference, December 4-6, 2024, Aarhus, Denmark

点击查看摘要

Abstract:This paper investigates the application of translation alignment algorithms in the creation of a Multilingual Digital Edition (MDE) of Alessandro Manzoni’s Italian novel “I promessi sposi” (“The Betrothed”), with translations in eight languages (English, Spanish, French, German, Dutch, Polish, Russian and Chinese) from the 19th and 20th centuries. We identify key requirements for the MDE to improve both the reader experience and support for translation studies. Our research highlights the limitations of current state-of-the-art algorithms when applied to the translation of literary texts and outlines an automated pipeline for MDE creation. This pipeline transforms raw texts into web-based, side-by-side representations of original and translated texts with different rendering options. In addition, we propose new metrics for evaluating the alignment of literary translations and suggest visualization techniques for future analysis.
摘要：本文探讨了翻译对齐算法在创建亚历山德罗·曼佐尼的意大利小说《I promessi sposi》（《订婚者》）的多语言数字版（Multilingual Digital Edition, MDE）中的应用，该小说包含19世纪和20世纪的八种语言翻译（英语、西班牙语、法语、德语、荷兰语、波兰语、俄语和中文）。我们确定了MDE的关键需求，以提升读者体验并支持翻译研究。我们的研究突显了当前最先进算法在应用于文学文本翻译时的局限性，并概述了一个用于MDE创建的自动化流程。该流程将原始文本转换为基于网络的、并排展示原文和译文的表示形式，并提供不同的渲染选项。此外，我们提出了新的评估文学翻译对齐度的指标，并建议了未来分析的可视化技术。

[NLP-86] Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation

【速读】：该论文试图解决现有可解释推荐系统在生成解释时未能准确反映用户购买后情感的问题。解决方案的关键在于引入新的数据集和评估方法，通过明确提取用户在购买后评论中的正面和负面意见，并评估生成解释是否与用户情感一致，以及是否准确识别用户对目标项目的正负面意见。此外，论文还发现，将用户对目标项目的预测评分直接输入模型可以提高生成解释的情感感知能力。

链接: https://arxiv.org/abs/2410.13248
作者: Ryotaro Shimizu,Takashi Wada,Yu Wang,Johannes Kruse,Sean O’Brien,Sai HtaungKham,Linxin Song,Yuya Yoshikawa,Yuki Saito,Fugee Tsung,Masayuki Goto,Julian McAuley
关键词-EN: text generation problem, explainable recommendation generally, recommendation generally frames, standard text generation, users’ sentiments
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent research on explainable recommendation generally frames the task as a standard text generation problem, and evaluates models simply based on the textual similarity between the predicted and ground-truth explanations. However, this approach fails to consider one crucial aspect of the systems: whether their outputs accurately reflect the users’ (post-purchase) sentiments, i.e., whether and why they would like and/or dislike the recommended items. To shed light on this issue, we introduce new datasets and evaluation methods that focus on the users’ sentiments. Specifically, we construct the datasets by explicitly extracting users’ positive and negative opinions from their post-purchase reviews using an LLM, and propose to evaluate systems based on whether the generated explanations 1) align well with the users’ sentiments, and 2) accurately identify both positive and negative opinions of users on the target items. We benchmark several recent models on our datasets and demonstrate that achieving strong performance on existing metrics does not ensure that the generated explanations align well with the users’ sentiments. Lastly, we find that existing models can provide more sentiment-aware explanations when the users’ (predicted) ratings for the target items are directly fed into the models as input. We will release our code and datasets upon acceptance.
摘要：近期关于可解释推荐的研究通常将任务框架化为一个标准的文本生成问题，并简单地基于预测解释与真实解释之间的文本相似性来评估模型。然而，这种方法未能考虑到系统的一个重要方面：其输出是否准确反映了用户的（购买后）情感，即用户是否会喜欢和/或不喜欢推荐的物品，以及原因何在。为了深入探讨这一问题，我们引入了新的数据集和评估方法，重点关注用户的情感。具体而言，我们通过使用大语言模型（LLM）从用户的购买后评论中明确提取其正面和负面意见来构建数据集，并提出基于以下两点来评估系统：1) 生成的解释是否与用户的情感高度一致；2) 是否准确识别用户对目标物品的正面和负面意见。我们在新数据集上对几种近期模型进行了基准测试，并证明在现有指标上取得优异表现并不能保证生成的解释与用户情感高度一致。最后，我们发现，当用户的（预测）对目标物品的评分直接作为输入提供给模型时，现有模型能够提供更具情感意识的解释。我们将在接受后发布代码和数据集。

[NLP-87] Atomic Calibration of LLMs in Long-Form Generations

【速读】：该论文试图解决大型语言模型（LLMs）在长文本生成任务中因幻觉现象导致的信任度问题。解决方案的关键在于引入了一种名为“原子校准”（atomic calibration）的新方法，该方法通过将长文本分解为原子声明（atomic claims）来在细粒度级别上评估事实性校准。论文提出将置信度提取方法分为判别型和生成型两类，并证明它们的结合可以增强校准效果。实验结果表明，原子校准不仅适用于长文本生成，还能提升整体校准结果，并揭示了LLMs在生成过程中置信度的变化模式。

链接: https://arxiv.org/abs/2410.13246
作者: Caiqi Zhang,Ruihan Yang,Zhisong Zhang,Xinting Huang,Sen Yang,Dong Yu,Nigel Collier
关键词-EN: Large language models, posing significant challenges, Large language, suffer from hallucinations, posing significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, which estimates the underlying uncertainty of model predictions, is essential to enhance the LLMs’ trustworthiness. Existing research on LLM calibration has primarily focused on short-form tasks, providing a single confidence score at the response level (macro calibration). However, this approach is insufficient for long-form generations, where responses often contain more complex statements and may include both accurate and inaccurate information. Therefore, we introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. We classify confidence elicitation methods into discriminative and generative types and demonstrate that their combination can enhance calibration. Our extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results. Additionally, atomic calibration reveals insightful patterns in LLM confidence throughout the generation process.
摘要：大语言模型 (LLMs) 常常面临幻觉问题，这给实际应用带来了重大挑战。置信度校准，即估计模型预测的潜在不确定性，对于提升 LLMs 的可信度至关重要。现有的 LLM 校准研究主要集中在短文本任务上，提供了一个在响应层面的单一置信度评分（宏观校准）。然而，这种方法对于长文本生成任务来说是不够的，因为长文本响应通常包含更复杂的陈述，并且可能同时包含准确和不准确的信息。因此，我们引入了原子校准，这是一种新颖的方法，通过将长文本响应分解为原子声明来评估细粒度的真实性校准。我们将置信度获取方法分为判别型和生成型，并证明它们的结合可以增强校准效果。我们在多种 LLMs 和数据集上的广泛实验表明，原子校准非常适合长文本生成，并且还能改善宏观校准结果。此外，原子校准揭示了 LLM 在整个生成过程中置信度的深刻模式。

[NLP-88] Large Language Models are Easily Confused: A Quantitative Metric Security Implications and Typological Analysis

【速读】：该论文试图解决大型语言模型（LLMs）在文本生成过程中出现的语言混淆现象，即生成的文本既不是所需语言，也不符合上下文语境的问题。解决方案的关键在于引入了一种新的度量标准——语言混淆熵（Language Confusion Entropy），该指标基于语言类型学和词汇变异的语言分布，直接测量和量化语言混淆现象。通过与语言混淆基准的全面比较，验证了该度量的有效性，并揭示了LLMs中语言混淆的模式。此外，论文还将语言混淆与LLM安全性联系起来，发现多语言嵌入反演攻击中的模式，展示了语言类型学在理论解释和利用语言相似性作为LLM对齐和安全性的先验知识方面的价值。

链接: https://arxiv.org/abs/2410.13237
作者: Yiyi Chen,Qiongxiu Li,Russa Biswas,Johannes Bjerva
关键词-EN: Large Language Models, Language Confusion, Language Models, Large Language, Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 17 pages, 6 figures, 14 tables

点击查看摘要

Abstract:Language Confusion is a phenomenon where Large Language Models (LLMs) generate text that is neither in the desired language, nor in a contextually appropriate language. This phenomenon presents a critical challenge in text generation by LLMs, often appearing as erratic and unpredictable behavior. We hypothesize that there are linguistic regularities to this inherent vulnerability in LLMs and shed light on patterns of language confusion across LLMs. We introduce a novel metric, Language Confusion Entropy, designed to directly measure and quantify this confusion, based on language distributions informed by linguistic typology and lexical variation. Comprehensive comparisons with the Language Confusion Benchmark (Marchisio et al., 2024) confirm the effectiveness of our metric, revealing patterns of language confusion across LLMs. We further link language confusion to LLM security, and find patterns in the case of multilingual embedding inversion attacks. Our analysis demonstrates that linguistic typology offers theoretically grounded interpretation, and valuable insights into leveraging language similarities as a prior for LLM alignment and security.
摘要：语言混淆是指大语言模型 (LLM) 生成既非期望语言，也非语境适当语言的文本现象。这种现象在大语言模型的文本生成中构成了重大挑战，常表现为不稳定且不可预测的行为。我们假设这种内在脆弱性存在语言规律，并揭示了跨大语言模型的语言混淆模式。我们引入了一种新指标——语言混淆熵，旨在基于语言类型学和词汇变异所提供的语言分布，直接测量和量化这种混淆。与语言混淆基准 (Marchisio et al., 2024) 的全面比较证实了我们指标的有效性，揭示了跨大语言模型的语言混淆模式。我们进一步将语言混淆与大语言模型安全性联系起来，并在多语言嵌入逆向攻击案例中发现了相关模式。我们的分析表明，语言类型学提供了理论基础的解释，并为利用语言相似性作为大语言模型对齐和安全性的先验知识提供了宝贵见解。

[NLP-89] SPIN: Self-Supervised Prompt INjection

【速读】：该论文试图解决大语言模型（LLMs）在面对对抗性和越狱攻击时的安全性和可靠性问题。解决方案的关键在于引入了一种自监督提示注入（Self-supervised Prompt INjection, SPIN）机制，该机制能够在推理时检测并反转这些攻击，从而在不损害正常用户请求性能的前提下，将攻击成功率降低高达87.9%。此外，该方法还兼容现有的对齐技术，并为防御提供了额外的安全层，即使在面对了解防御机制的适应性攻击者时，仍表现出较强的韧性。

链接: https://arxiv.org/abs/2410.13236
作者: Leon Zhou,Junfeng Yang,Chengzhi Mao
关键词-EN: Large Language Models, Large Language, Language Models, important applications, major concerns
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain as major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. As our self-supervised prompt defense is done at inference-time, it is also compatible with existing alignment and adds an additional layer of safety for defense. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.
摘要：大语言模型 (LLMs) 在各种重要应用中越来越普遍，但其安全性和可靠性仍然是主要关注点。多种对抗性和越狱攻击已被提出，以绕过安全对齐并导致模型产生有害响应。我们引入了自监督提示注入 (Self-supervised Prompt INjection, SPIN)，该方法能够检测并逆转对 LLMs 的多种攻击。由于我们的自监督提示防御是在推理时进行的，因此它与现有的对齐方法兼容，并为防御增加了额外的安全层。我们的基准测试表明，我们的系统可以将攻击成功率降低高达 87.9%，同时保持对良性用户请求的性能。此外，我们讨论了适应性攻击者的情况，并展示了我们的方法对于了解我们防御的攻击者仍然具有弹性。

[NLP-90] Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

【速读】：该论文试图解决当前基于大型语言模型（LLM）的网络代理在处理长期任务时表现不佳的问题，尤其是避免重复购买不可退款机票等不可逆错误。解决方案的关键在于引入“世界模型”（World Model），即通过模拟行动结果来增强决策能力。具体实现上，论文提出了一种世界模型增强（WMA）网络代理，并采用了一种专注于状态差异的观察抽象方法，以自然语言描述的形式预测重要状态变化，从而在不增加训练成本的情况下提升代理的策略选择效率。实验结果表明，这种世界模型显著提高了代理在WebArena和Mind2Web等平台上的性能，且在成本和时间效率上优于基于树搜索的代理方法。

链接: https://arxiv.org/abs/2410.13232
作者: Hyungjoo Chae,Namyoung Kim,Kai Tzu-iunn Ong,Minju Gwak,Gwanwoo Song,Jihoon Kim,Sunghwan Kim,Dongha Lee,Jinyoung Yeo
关键词-EN: Large language models, building autonomous agents, Large language, recently gained, gained much attention
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the “world model”. Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet, etc.). Then, we present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents’ policy selection without training and demonstrate our agents’ cost- and time-efficiency compared to recent tree-search-based agents.
摘要：大语言模型 (LLM) 在构建自主智能体方面近期引起了广泛关注。然而，当前基于 LLM 的网络智能体在长时间任务中的表现远未达到最优，常常出现诸如反复购买不可退款机票的错误。相比之下，人类能够避免此类不可逆的错误，因为我们具备对行动潜在结果（例如，损失金钱）的认知，即所谓的“世界模型”。受此启发，我们的研究首先进行了初步分析，确认了当前 LLM（例如 GPT-4o, Claude-3.5-Sonnet 等）中世界模型的缺失。随后，我们提出了一种世界模型增强型 (WMA) 网络智能体，该智能体通过模拟行动结果来优化决策过程。为解决将 LLM 训练为预测下一观察结果的世界模型所面临的挑战，如观察结果中重复元素和长 HTML 输入，我们提出了一种专注于状态转换的观察抽象方法，其中预测目标为自由形式的自然语言描述，仅突出时间步之间的重要状态差异。在 WebArena 和 Mind2Web 上的实验表明，我们的世界模型在不进行训练的情况下提升了智能体的策略选择，并展示了我们的智能体相较于近期基于树搜索的智能体在成本和时间效率上的优势。

[NLP-91] Proof Flow: Preliminary Study on Generative Flow Network Language Model Tuning for Formal Reasoning

【速读】：该论文试图解决复杂问题推理能力不足的问题，特别是在开放模型中难以应对足够复杂的问题。解决方案的关键在于利用生成流网络（Generative Flow Networks, GFlowNets）作为对大型语言模型（LLMs）的微调方法，以解锁高级推理能力。具体而言，论文在形式推理领域，特别是神经定理证明（Neural Theorem Proving, NTP）设置中，展示了GFlowNets在提高模型性能和探索状态空间方面的潜力，避免了传统强化学习中过度利用高奖励动作的问题，从而增强了模型的泛化能力和多样性假设的维持。

链接: https://arxiv.org/abs/2410.13224
作者: Matthew Ho,Vincent Zhu,Xiaoyin Chen,Moksh Jain,Nikolay Malkin,Edwin Zhang
关键词-EN: Generative Flow Networks, fundamental substrate, substrate for solving, Neural Theorem Proving, complex problems
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning is a fundamental substrate for solving novel and complex problems. Deliberate efforts in learning and developing frameworks around System 2 reasoning have made great strides, yet problems of sufficient complexity remain largely out of reach for open models. To address this gap, we examine the potential of Generative Flow Networks as a fine-tuning method for LLMs to unlock advanced reasoning capabilities. In this paper, we present a proof of concept in the domain of formal reasoning, specifically in the Neural Theorem Proving (NTP) setting, where proofs specified in a formal language such as Lean can be deterministically and objectively verified. Unlike classical reward-maximization reinforcement learning, which frequently over-exploits high-reward actions and fails to effectively explore the state space, GFlowNets have emerged as a promising approach for sampling compositional objects, improving generalization, and enabling models to maintain diverse hypotheses. Our early results demonstrate GFlowNet fine-tuning’s potential for enhancing model performance in a search setting, which is especially relevant given the paradigm shift towards inference time compute scaling and “thinking slowly.”
摘要：推理是解决新颖和复杂问题的基本基础。在学习和开发围绕系统 2 推理的框架方面所做的刻意努力取得了显著进展，然而，对于开放模型来说，足够复杂的问题仍然在很大程度上难以解决。为了填补这一空白，我们探讨了生成式流网络 (Generative Flow Networks) 作为大语言模型 (LLM) 微调方法的潜力，以解锁高级推理能力。在本文中，我们在形式推理领域提出了一个概念验证，特别是在神经定理证明 (Neural Theorem Proving, NTP) 的背景下，其中使用如 Lean 这样的形式语言指定的证明可以被确定性地和客观地验证。与经常过度利用高奖励动作并无法有效探索状态空间的经典奖励最大化强化学习不同，生成式流网络 (GFlowNets) 作为一种有前途的方法出现，用于采样组合对象，改进泛化，并使模型能够保持多样化的假设。我们的早期结果表明，生成式流网络微调在搜索设置中提升模型性能的潜力，这在推理时间计算扩展和“缓慢思考”范式转变的背景下尤为相关。

[NLP-92] CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

【速读】：该论文试图解决当前患者需求与可用心理健康支持之间的显著差距，特别是探索如何利用大型语言模型（LLMs）辅助专业心理治疗。解决方案的关键在于提出了一个新的基准测试CBT-BENCH，用于系统评估认知行为疗法（CBT）辅助工具的性能。CBT-BENCH包括三个层次的任务：基础CBT知识获取、认知模型理解以及治疗性回应生成，涵盖了CBT的关键方面，并设定了从基础知识背诵到实际治疗对话的能力要求层次。实验结果表明，尽管LLMs在背诵CBT知识方面表现良好，但在需要深入分析患者认知结构和生成有效回应的复杂场景中表现不足，这为未来的研究指明了方向。

链接: https://arxiv.org/abs/2410.13218
作者: Mian Zhang,Xianjun Yang,Xinlu Zhang,Travis Labrum,Jamie C. Chiu,Shaun M. Eack,Fei Fang,William Yang Wang,Zhiyu Zoey Chen
关键词-EN: health support today, mental health support, Large Language Models, support today, significant gap
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:There is a significant gap between patient needs and available mental health support today. In this paper, we aim to thoroughly examine the potential of using Large Language Models (LLMs) to assist professional psychotherapy. To this end, we propose a new benchmark, CBT-BENCH, for the systematic evaluation of cognitive behavioral therapy (CBT) assistance. We include three levels of tasks in CBT-BENCH: I: Basic CBT knowledge acquisition, with the task of multiple-choice questions; II: Cognitive model understanding, with the tasks of cognitive distortion classification, primary core belief classification, and fine-grained core belief classification; III: Therapeutic response generation, with the task of generating responses to patient speech in CBT therapy sessions. These tasks encompass key aspects of CBT that could potentially be enhanced through AI assistance, while also outlining a hierarchy of capability requirements, ranging from basic knowledge recitation to engaging in real therapeutic conversations. We evaluated representative LLMs on our benchmark. Experimental results indicate that while LLMs perform well in reciting CBT knowledge, they fall short in complex real-world scenarios requiring deep analysis of patients’ cognitive structures and generating effective responses, suggesting potential future work.
摘要：当前，患者需求与可获得的心理健康支持之间存在显著差距。本文旨在深入探讨利用大语言模型 (LLMs) 辅助专业心理治疗的可能性。为此，我们提出了一种新的基准测试，即 CBT-BENCH，用于系统评估认知行为疗法 (CBT) 的辅助效果。CBT-BENCH 包含三个层次的任务：I：基础 CBT 知识获取，任务形式为多项选择题；II：认知模型理解，任务包括认知扭曲分类、主要核心信念分类和细粒度核心信念分类；III：治疗性回应生成，任务为在 CBT 治疗会话中生成对患者言语的回应。这些任务涵盖了 CBT 的关键方面，这些方面有可能通过 AI 辅助得到增强，同时勾勒出从基础知识背诵到参与实际治疗对话的能力需求层次。我们在基准测试中评估了代表性的大语言模型。实验结果表明，尽管大语言模型在背诵 CBT 知识方面表现良好，但在需要深入分析患者认知结构并生成有效回应的复杂现实场景中表现不足，这为未来的工作提供了潜在方向。

[NLP-93] Anchored Alignment for Self-Explanations Enhancement

【速读】：该论文试图解决大型语言模型（LLMs）在缺乏标注理由解释的情况下，如何提升其自我解释能力的问题。解决方案的关键在于引入了一种新的对齐方法，该方法包括三个主要组成部分：解释质量评估、自我指导数据集生成和模型对齐。此外，论文提出了一种名为“锚定偏好对齐”的新技术，通过将模型输出分类为一致正确、一致错误和可变三类，并针对每类应用定制策略，从而改进直接偏好优化（DPO）的效果。实验结果表明，这种方法在保持准确性的同时，显著提高了解释质量。

链接: https://arxiv.org/abs/2410.13216
作者: Luis Felipe Villa-Arenas,Ata Nizamoglu,Qianli Wang,Sebastian Möller,Vera Schmitt
关键词-EN: annotated rationale explanations, large language models, articulate their reasoning, ability of large, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we introduce a methodology for alignment designed to enhance the ability of large language models (LLMs) to articulate their reasoning (self-explanation) even in the absence of annotated rationale explanations. Our alignment methodology comprises three key components: explanation quality assessment, self-instruction dataset generation, and model alignment. Additionally, we present a novel technique called Alignment with Anchor Preference Pairs, which improves the selection of preference pairs by categorizing model outputs into three groups: consistently correct, consistently incorrect, and variable. By applying tailored strategies to each category, we enhance the effectiveness of Direct Preference Optimization (DPO). Our experimental results demonstrate that this approach significantly improves explanation quality while maintaining accuracy compared to other fine-tuning strategies.
摘要：在本研究中，我们提出了一种用于增强大语言模型 (LLMs) 自我解释能力的方法论，即使在缺乏标注的推理解释的情况下也能有效。我们的对齐方法论包括三个关键组成部分：解释质量评估、自我指导数据集生成和模型对齐。此外，我们引入了一种名为“锚定偏好对对齐”的新技术，通过将模型输出分为三类：始终正确、始终错误和可变，来改进偏好对的选择。通过针对每一类别应用定制策略，我们提升了直接偏好优化 (DPO) 的有效性。实验结果表明，与其它微调策略相比，该方法在保持准确性的同时显著提高了解释质量。

[NLP-94] FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

【速读】：该论文试图解决现有大型语言模型（LLMs）生成的摘要中幻觉评估和幻觉检测模型评估缺乏多样性和时效性的问题。解决方案的关键是引入了FaithBench，这是一个包含10个现代LLMs（来自8个不同家族）生成的具有挑战性幻觉的摘要基准，由人类专家提供真实标注。这些“挑战性”幻觉是指现有最先进的幻觉检测模型（包括GPT-4o-as-a-judge）存在分歧的摘要。研究结果表明，GPT-4o和GPT-3.5-Turbo产生的幻觉最少，但即使是最好的幻觉检测模型在FaithBench上的准确率也接近50%，表明未来仍有很大的改进空间。

链接: https://arxiv.org/abs/2410.13210
作者: Forrest Sheng Bao,Miaoran Li,Renyi Qu,Ge Luo,Erana Wan,Yujia Tang,Weisi Fan,Manveer Singh Tamber,Suleman Kazi,Vivek Sourabh,Mike Qi,Ruixuan Tu,Chenyu Xu,Matthew Gonzales,Ofer Mendelevitch,Amin Ahmad
关键词-EN: common tasks performed, Retrieval-Augmented Generation, large language models, hallucination detection models, common tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. ``Challenging’’ here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, even the best hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement. The repo is this https URL
摘要：总结是大语言模型 (LLM) 最常见的任务之一，尤其是在检索增强生成 (RAG) 等应用中。然而，现有关于 LLM 生成总结中的幻觉评估，以及幻觉检测模型的评估，都存在所考虑的 LLM 及其家族的多样性和时效性不足的问题。本文介绍了 FaithBench，这是一个总结幻觉基准，包含了由 10 个现代 LLM 从 8 个不同家族生成的具有挑战性的幻觉，并由人类专家进行了真实标注。这里的“挑战性”指的是那些流行的、最先进的幻觉检测模型，包括 GPT-4o-as-a-judge，对其判断存在分歧的总结。我们的结果显示，GPT-4o 和 GPT-3.5-Turbo 产生的幻觉最少。然而，即使是最好的幻觉检测模型在 FaithBench 上的准确率也接近 50%，表明未来仍有很大的改进空间。代码仓库地址为 https URL。

[NLP-95] BQA: Body Language Question Answering Dataset for Video Large Language Models

【速读】：该论文试图解决当前视频大语言模型（VideoLLMs）在准确解读人体语言（如面部表情、眼神接触和肢体语言）方面的挑战。解决方案的关键在于提出了一个名为BQA的数据集，该数据集包含26种情感标签的视频片段，用于验证模型是否能正确解读这些情感。通过在BQA数据集上的评估，论文揭示了理解人体语言的难度，并分析了模型在不同年龄组和种族个体视频中产生的偏见性错误答案。

链接: https://arxiv.org/abs/2410.13206
作者: Shintaro Ozaki,Kazuki Hayashi,Miyu Oba,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
关键词-EN: body language, human communication relies, eye contact, language, facial expressions
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding. Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent. To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language. We evaluated various VideoLLMs on BQA and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made significantly biased answers depending on the age group and ethnicity of the individuals in the video. The dataset is available.
摘要：人类交流的很大一部分依赖于非语言线索，如面部表情、眼神接触和肢体语言。与语言或手语不同，这种非语言交流缺乏正式规则，需要基于常识理解的复杂推理。使当前的视频大语言模型 (VideoLLMs) 能够准确解读肢体语言是一个关键挑战，因为人类的潜意识动作很容易导致模型误解其意图。为了解决这一问题，我们提出了一种数据集，BQA，一种肢体语言问答数据集，用于验证模型是否能从包含 26 种情感标签的肢体语言短视频片段中正确解读情感。我们在 BQA 上评估了多种 VideoLLMs，并揭示了理解肢体语言的挑战性，我们对 VideoLLMs 错误答案的分析显示，某些 VideoLLMs 根据视频中个体的年龄组和种族做出了显著偏见的回答。该数据集已公开可用。

[NLP-96] Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations

【速读】：该论文试图解决在军事危机决策等高风险情境下，依赖语言模型（LMs）进行自动化决策时可能出现的不一致性问题。解决方案的关键在于使用基于BERTScore的度量方法，量化评估LMs在自由形式响应中的不一致性。通过这种方法，研究者能够克服传统方法中对预定义行动的依赖，并证明即使在调整战棋设置、匿名化冲突国家或改变采样温度参数的情况下，所有测试的LMs都表现出一定程度的不一致性，表明存在语义差异。此外，研究还发现，不同提示敏感性变化对不一致性的影响可能超过温度采样带来的不一致性，特别是在温度T=0时。鉴于军事部署等高风险决策的性质，论文建议在使用LMs进行此类决策前需进一步慎重考虑。

链接: https://arxiv.org/abs/2410.13204
作者: Aryan Shrivastava,Jessica Hullman,Max Lamparth
关键词-EN: actively testing LMs, multiple countries actively, countries actively testing, increasing interest, actively testing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There is an increasing interest in using language models (LMs) for automated decision-making, with multiple countries actively testing LMs to aid in military crisis decision-making. To scrutinize relying on LM decision-making in high-stakes settings, we examine the inconsistency of responses in a crisis simulation (“wargame”), similar to reported tests conducted by the US military. Prior work illustrated escalatory tendencies and varying levels of aggression among LMs but were constrained to simulations with pre-defined actions. This was due to the challenges associated with quantitatively measuring semantic differences and evaluating natural language decision-making without relying on pre-defined actions. In this work, we query LMs for free form responses and use a metric based on BERTScore to measure response inconsistency quantitatively. Leveraging the benefits of BERTScore, we show that the inconsistency metric is robust to linguistic variations that preserve semantic meaning in a question-answering setting across text lengths. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences, even when adjusting the wargame setting, anonymizing involved conflict countries, or adjusting the sampling temperature parameter T . Further qualitative evaluation shows that models recommend courses of action that share few to no similarities. We also study the impact of different prompt sensitivity variations on inconsistency at temperature T = 0 . We find that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations. Given the high-stakes nature of military deployment, we recommend further consideration be taken before using LMs to inform military decisions or other cases of high-stakes decision-making.
摘要：越来越多的兴趣集中在使用语言模型 (LMs) 进行自动化决策上，多个国家正在积极测试 LMs 以辅助军事危机决策。为了审视在高风险情境下依赖 LM 决策的可靠性，我们研究了在危机模拟（“战争游戏”）中响应的不一致性，这与美国军方报告的测试类似。先前的工作展示了 LMs 在模拟中的升级倾向和不同程度的攻击性，但这些研究局限于预定义行动的模拟。这是由于量化测量语义差异和在不依赖预定义行动的情况下评估自然语言决策的挑战。在本研究中，我们向 LMs 查询自由形式的响应，并使用基于 BERTScore 的指标来量化响应的不一致性。利用 BERTScore 的优势，我们展示了不一致性指标在不同文本长度的问题回答设置中对保留语义意义的语言变异具有鲁棒性。我们发现，所有五个测试的 LMs 都表现出一定程度的不一致性，表明语义差异，即使在调整战争游戏设置、匿名化涉及的冲突国家或调整采样温度参数 T 的情况下也是如此。进一步的定性评估显示，模型推荐的行动方案几乎没有相似之处。我们还研究了在温度 T = 0 时不同提示敏感性变化对不一致性的影响。我们发现，由于语义等价的提示变化引起的不一致性可以超过大多数研究模型在不同层次的消融中的温度采样响应不一致性。鉴于军事部署的高风险性质，我们建议在使用 LMs 辅助军事决策或其他高风险决策之前，应进一步考虑其可靠性。

[NLP-97] Meta-DiffuB: A Contextualized Sequence-to-Sequence Text Diffusion Model with Meta-Exploration

【速读】：该论文试图解决现有Seq2Seq扩散模型（S2S-Diffusion）在噪声调度过程中缺乏上下文感知的问题，导致生成效果受限。解决方案的关键在于提出了Meta-DiffuB框架，通过引入Meta-Exploration训练一个专门的调度器模型，用于为每个句子生成上下文感知的噪声调度方案。这一调度器模型与S2S-Diffusion模型协同工作，显著提升了生成性能，并在多个Seq2Seq基准数据集上超越了现有模型和微调后的预训练语言模型（PLMs）。此外，调度器模型在推理阶段可作为“即插即用”模块，无需重新微调即可增强DiffuSeq的效果。

链接: https://arxiv.org/abs/2410.13201
作者: Yun-Yen Chuang,Hung-Min Hsu,Kevin Lin,Chen-Sheng Gu,Ling Zhen Li,Ray-I Chang,Hung-yi Lee
关键词-EN: achieved significant success, generative modeling paradigm, generating images, generative modeling, achieved significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The diffusion model, a new generative modeling paradigm, has achieved significant success in generating images, audio, video, and text. It has been adapted for sequence-to-sequence text generation (Seq2Seq) through DiffuSeq, termed S2S Diffusion. Existing S2S-Diffusion models predominantly rely on fixed or hand-crafted rules to schedule noise during the diffusion and denoising processes. However, these models are limited by non-contextualized noise, which fails to fully consider the characteristics of Seq2Seq tasks. In this paper, we propose the Meta-DiffuB framework - a novel scheduler-exploiter S2S-Diffusion paradigm designed to overcome the limitations of existing S2S-Diffusion models. We employ Meta-Exploration to train an additional scheduler model dedicated to scheduling contextualized noise for each sentence. Our exploiter model, an S2S-Diffusion model, leverages the noise scheduled by our scheduler model for updating and generation. Meta-DiffuB achieves state-of-the-art performance compared to previous S2S-Diffusion models and fine-tuned pre-trained language models (PLMs) across four Seq2Seq benchmark datasets. We further investigate and visualize the impact of Meta-DiffuB’s noise scheduling on the generation of sentences with varying difficulties. Additionally, our scheduler model can function as a “plug-and-play” model to enhance DiffuSeq without the need for fine-tuning during the inference stage.
摘要：扩散模型，一种新的生成式建模范式，在生成图像、音频、视频和文本方面取得了显著成功。它通过 DiffuSeq 被应用于序列到序列文本生成 (Seq2Seq)，称为 S2S Diffusion。现有的 S2S-Diffusion 模型主要依赖于固定或手工制定的规则来调度扩散和去噪过程中的噪声。然而，这些模型受限于非上下文化的噪声，未能充分考虑 Seq2Seq 任务的特性。本文中，我们提出了 Meta-DiffuB 框架——一种新颖的调度器-利用器 S2S-Diffusion 范式，旨在克服现有 S2S-Diffusion 模型的局限性。我们采用元探索 (Meta-Exploration) 来训练一个额外的调度器模型，专门用于为每个句子调度上下文化的噪声。我们的利用器模型，即 S2S-Diffusion 模型，利用调度器模型调度的噪声进行更新和生成。Meta-DiffuB 在与之前 S2S-Diffusion 模型和微调后的预训练语言模型 (PLMs) 在四个 Seq2Seq 基准数据集上的比较中，达到了最先进的性能。我们进一步研究并可视化了 Meta-DiffuB 的噪声调度对生成不同难度句子影响的差异。此外，我们的调度器模型可以作为一个“即插即用”模型，在推理阶段无需微调即可增强 DiffuSeq。

[NLP-98] he Geometry of Numerical Reasoning: Language Models Compare Numeric Properties in Linear Subspaces

【速读】：该论文试图解决大语言模型（LLMs）在回答逻辑比较问题时是否利用了嵌入空间中低维子空间编码的数值属性。解决方案的关键在于通过偏最小二乘回归（PLS）识别这些子空间，并证明通过干预这些子空间中的隐藏状态可以改变LLM的比较结果，从而验证了LLMs确实利用了线性编码的数值信息进行数值推理。

链接: https://arxiv.org/abs/2410.13194
作者: Ahmed Oumar El-Shangiti,Tatsuya Hiraoka,Hilal AlQuabeh,Benjamin Heinzerling,Kentaro Inui
关键词-EN: large language models, born before Messi, logical comparison questions, Cristiano born, answering logical comparison
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates whether large language models (LLMs) utilize numerical attributes encoded in a low-dimensional subspace of the embedding space when answering logical comparison questions (e.g., Was Cristiano born before Messi?). We first identified these subspaces using partial least squares regression, which effectively encodes the numerical attributes associated with the entities in comparison prompts. Further, we demonstrate causality by intervening in these subspaces to manipulate hidden states, thereby altering the LLM’s comparison outcomes. Experimental results show that our findings hold for different numerical attributes, indicating that LLMs utilize the linearly encoded information for numerical reasoning.
摘要：本文探讨了大语言模型 (LLMs) 在回答逻辑比较问题（例如，Cristiano 是否比 Messi 早出生？）时，是否利用了嵌入空间中低维子空间编码的数值属性。我们首先使用偏最小二乘回归 (Partial Least Squares Regression) 识别这些子空间，该方法有效地编码了比较提示中实体关联的数值属性。进一步地，我们通过干预这些子空间来操纵隐藏状态，从而改变 LLM 的比较结果，以此展示因果关系。实验结果表明，我们的发现适用于不同的数值属性，表明 LLMs 利用线性编码的信息进行数值推理。

[NLP-99] Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

【速读】：该论文试图解决在检索增强生成（RAG）系统中，如何更有效地利用自生成文档（SGDs）以提升大型语言模型（LLM）性能的问题。解决方案的关键在于对不同类型的SGDs进行系统性功能语言学（SFL）分类，并通过实验分析不同SGD类别对知识密集型任务的影响，从而揭示哪些类型的SGDs最能有效提升LLM的性能，并为基于SGD类别的融合方法提供实践指导，以在知识驱动的问答任务中实现显著进步。

链接: https://arxiv.org/abs/2410.13192
作者: Jiatao Li,Xinyu Hu,Xunjian Yin,Xiaojun Wan
关键词-EN: retrieval-augmented generation systems, alongside retrieved content, large language model, generation systems, self-generated documents
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:In retrieval-augmented generation systems, the integration of self-generated documents (SGDs) alongside retrieved content has emerged as a promising strategy for enhancing the performance of large language model. However, previous research primarily focuses on optimizing the use of SGDs, with the inherent properties of SGDs remaining underexplored. Therefore, this paper conducts a comprehensive analysis of different types of SGDs and experiments on various knowledge-intensive tasks. We develop a taxonomy of SGDs grounded in Systemic Functional Linguistics (SFL) to compare the influence of different SGD categories. Our findings offer key insights into what kinds of SGDs most effectively contribute to improving LLM’s performance. The results and further fusion methods based on SGD categories also provide practical guidelines for taking better advantage of SGDs to achieve significant advancements in knowledge-driven QA tasks with RAG.
摘要：在检索增强生成系统中，将自生成文档 (Self-Generated Documents, SGDs) 与检索内容相结合已成为提升大语言模型性能的一种有前景的策略。然而，以往的研究主要集中在优化 SGDs 的使用上，而 SGDs 的固有特性却未得到充分探索。因此，本文对不同类型的 SGDs 进行了全面分析，并在多种知识密集型任务上进行了实验。我们基于系统功能语言学 (Systemic Functional Linguistics, SFL) 构建了 SGDs 的分类体系，以比较不同 SGD 类别的影响。我们的研究结果揭示了哪些类型的 SGDs 最有效地促进了 LLM 性能的提升。基于 SGD 类别的结果和进一步融合方法也为如何更好地利用 SGDs 在 RAG 支持的知识驱动问答任务中取得显著进展提供了实际指导。

[NLP-100] MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique Correction and Comparison Feedback

【速读】：该论文试图解决在生成专业考试（如美国医师执照考试USMLE）的多选题（MCQG）时，现有大型语言模型（如GPT-4）由于知识过时、幻觉问题和提示敏感性导致的质量不佳和难度不合适的问题。解决方案的关键是提出了MCQG-SRefine框架，通过结合专家驱动的提示工程与迭代自批判和自校正反馈机制，显著提升了问题质量和难度，使其更符合专家的满意度。此外，引入基于LLM作为评判者的自动评估指标，替代了复杂且昂贵的专家评估过程，确保了评估的可靠性和与专家一致性。

链接: https://arxiv.org/abs/2410.13191
作者: Zonghai Yao,Aditya Parashar,Huixue Zhou,Won Seok Jang,Feiyun Ouyang,Zhichao Yang,Hong Yu
关键词-EN: Medical Licensing Examination, United States Medical, States Medical Licensing, Automatic question generation, dialogue systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Equal contribution for the first two authors

点击查看摘要

Abstract:Automatic question generation (QG) is essential for AI and NLP, particularly in intelligent tutoring, dialogue systems, and fact verification. Generating multiple-choice questions (MCQG) for professional exams, like the United States Medical Licensing Examination (USMLE), is particularly challenging, requiring domain expertise and complex multi-hop reasoning for high-quality questions. However, current large language models (LLMs) like GPT-4 struggle with professional MCQG due to outdated knowledge, hallucination issues, and prompt sensitivity, resulting in unsatisfactory quality and difficulty. To address these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique and Correction) framework for converting medical cases into high-quality USMLE-style questions. By integrating expert-driven prompt engineering with iterative self-critique and self-correction feedback, MCQG-SRefine significantly enhances human expert satisfaction regarding both the quality and difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based automatic metric to replace the complex and costly expert evaluation process, ensuring reliable and expert-aligned assessments.
摘要：自动问答生成 (Automatic question generation, QG) 在人工智能 (AI) 和自然语言处理 (NLP) 领域至关重要，特别是在智能辅导、对话系统和事实验证中。为专业考试生成多项选择题 (Multiple-choice question generation, MCQG)，如美国医学执照考试 (United States Medical Licensing Examination, USMLE)，尤为困难，需要领域专业知识和复杂的多步推理来生成高质量的问题。然而，当前的大语言模型 (Large Language Model, LLM) 如 GPT-4 在处理专业 MCQG 时面临知识过时、幻觉问题和提示敏感性等挑战，导致问题质量和难度不尽如人意。为应对这些挑战，我们提出了 MCQG-SRefine，一种基于 LLM 自我精炼 (Critique and Correction) 框架，用于将医学案例转化为高质量的 USMLE 风格问题。通过结合专家驱动的提示工程与迭代自我批评和自我修正反馈，MCQG-SRefine 显著提升了人类专家对问题质量和难度的满意度。此外，我们引入了一种基于 LLM 作为评判者的自动评估指标，以替代复杂且成本高昂的专家评估过程，确保评估结果的可靠性和与专家一致性。

[NLP-101] aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Completion

【速读】：该论文试图解决大型语言模型（LLMs）在代码补全应用中响应时间长、降低开发者生产力的问题。解决方案的关键在于提出了一种轻量级且高效的LLM，名为aiXcoder-7B，其核心优势体现在三个方面：1）多目标训练，特别是提出的结构化填空中间（SFIM）方法，考虑代码的语法结构，显著提升模型性能；2）多样化的数据采样策略，增强模型对跨文件上下文的理解能力；3）广泛的高质量数据，通过严格的数据收集流程，使用1.2万亿独特标记进行训练，使模型能够学习到广泛的代码分布。这些关键因素使得aiXcoder-7B在保持较小规模（70亿参数）的同时，实现了更高的代码补全准确性，并在多个基准测试中表现优异。

链接: https://arxiv.org/abs/2410.13187
作者: Siyuan Jiang,Jia Li,He Zong,Huanyu Liu,Hao Zhu,Shukai Hu,Erlu Li,Jiazheng Ding,Yu Han,Wei Ning,Ge Li
关键词-EN: Large Language Models, Large Language, Language Models, code completion, code
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: aiXcoder-7B is available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely used in code completion, and researchers are focusing on scaling up LLMs to improve their accuracy. However, larger LLMs will increase the response time of code completion and decrease the developers’ productivity. In this paper, we propose a lightweight and effective LLM for code completion named aiXcoder-7B. Compared to existing LLMs, aiXcoder-7B achieves higher code completion accuracy while having smaller scales (i.e., 7 billion parameters). We attribute the superiority of aiXcoder-7B to three key factors: (1) Multi-objective training. We employ three training objectives, one of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers the syntax structures in code and effectively improves the performance of LLMs for code. (2) Diverse data sampling strategies. They consider inter-file relationships and enhance the capability of LLMs in understanding cross-file contexts. (3) Extensive high-quality data. We establish a rigorous data collection pipeline and consume a total of 1.2 trillion unique tokens for training aiXcoder-7B. This vast volume of data enables aiXcoder-7B to learn a broad distribution of code. We evaluate aiXcoder-7B in five popular code completion benchmarks and a new benchmark collected by this paper. The results show that aiXcoder-7B outperforms the latest six LLMs with similar sizes and even surpasses four larger LLMs (e.g., StarCoder2-15B and CodeLlama-34B), positioning aiXcoder-7B as a lightweight and effective LLM for academia and industry. Finally, we summarize three valuable insights for helping practitioners train the next generations of LLMs for code. aiXcoder-7B has been open-souced and gained significant attention. As of the submission date, aiXcoder-7B has received 2,193 GitHub Stars.
摘要：大语言模型 (LLMs) 在代码补全中得到了广泛应用，研究人员正致力于扩展 LLMs 以提高其准确性。然而，更大的 LLMs 将增加代码补全的响应时间，从而降低开发者的生产力。本文提出了一种轻量且高效的代码补全 LLM，名为 aiXcoder-7B。与现有的 LLMs 相比，aiXcoder-7B 在规模更小（即 70 亿参数）的情况下实现了更高的代码补全准确性。我们将 aiXcoder-7B 的优越性归因于三个关键因素：(1) 多目标训练。我们采用了三种训练目标，其中之一是我们提出的结构化填空 (Structured Fill-In-the-Middle, SFIM)。SFIM 考虑了代码的语法结构，有效提升了 LLMs 在代码方面的性能。(2) 多样化的数据采样策略。这些策略考虑了文件间的关系，增强了 LLMs 在理解跨文件上下文方面的能力。(3) 广泛的高质量数据。我们建立了一个严格的数据收集流程，并为训练 aiXcoder-7B 消耗了总计 1.2 万亿个独特的 Token。这一庞大的数据量使得 aiXcoder-7B 能够学习广泛的代码分布。我们在五个流行的代码补全基准和一个由本文收集的新基准上评估了 aiXcoder-7B。结果显示，aiXcoder-7B 在相同规模的最新六个 LLMs 中表现最佳，甚至超越了四个规模更大的 LLMs（例如 StarCoder2-15B 和 CodeLlama-34B），从而将 aiXcoder-7B 定位为学术界和工业界轻量且高效的 LLM。最后，我们总结了三条有价值的见解，以帮助从业者训练下一代代码 LLMs。aiXcoder-7B 已开源并获得了显著关注。截至提交日期，aiXcoder-7B 已获得 2,193 个 GitHub Stars。

[NLP-102] Chain of Ideas: Revolutionizing Research in Novel Idea Development with LLM Agents

【速读】：该论文试图解决科研人员在海量科学文献中难以有效识别和生成有意义研究方向的问题。解决方案的关键在于提出了一个基于大型语言模型（LLM）的Chain-of-Ideas（CoI）代理，该代理通过将相关文献组织成链式结构，模拟研究领域的渐进发展，从而帮助LLM更好地捕捉当前研究进展，增强其创新能力。此外，论文还提出了Idea Arena评估协议，用于从多个角度全面评估创新生成方法，确保其与人类研究者的偏好相符。实验结果表明，CoI代理在创新生成方面持续优于其他方法，并且在质量和成本效益上表现出色。

链接: https://arxiv.org/abs/2410.13185
作者: Long Li,Weiwen Xu,Jiayan Guo,Ruochen Zhao,Xinxuan Li,Yuqian Yuan,Boqiang Zhang,Yuming Jiang,Yifei Xin,Ronghao Dang,Deli Zhao,Yu Rong,Tian Feng,Lidong Bing
关键词-EN: Effective research ideation, Effective research, critical step, research, Effective
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages,5 figures, conference

点击查看摘要

Abstract:Effective research ideation is a critical step for scientific research. However, the exponential increase in scientific literature makes it challenging for researchers to stay current with recent advances and identify meaningful research directions. Recent developments in large language models~(LLMs) suggest a promising avenue for automating the generation of novel research ideas. However, existing methods for idea generation either trivially prompt LLMs or directly expose LLMs to extensive literature without indicating useful information. Inspired by the research process of human researchers, we propose a Chain-of-Ideas~(CoI) agent, an LLM-based agent that organizes relevant literature in a chain structure to effectively mirror the progressive development in a research domain. This organization facilitates LLMs to capture the current advancements in research, thereby enhancing their ideation capabilities. Furthermore, we propose Idea Arena, an evaluation protocol that can comprehensively evaluate idea generation methods from different perspectives, aligning closely with the preferences of human researchers. Experimental results indicate that the CoI agent consistently outperforms other methods and shows comparable quality as humans in research idea generation. Moreover, our CoI agent is budget-friendly, with a minimum cost of \ 0.50 to generate a candidate idea and its corresponding experimental design.
摘要：有效的研究构思是科学研究的关键步骤。然而，科学文献的指数级增长使得研究人员难以跟上最新进展并识别有意义的研究方向。大语言模型 (LLM) 的最新发展为自动化生成新颖研究想法提供了有前景的途径。然而，现有的想法生成方法要么简单地提示 LLM，要么直接将 LLM 暴露于大量文献中，而没有指示有用信息。受人类研究人员研究过程的启发，我们提出了一个链式想法 (Chain-of-Ideas, CoI) 智能体，这是一个基于 LLM 的智能体，它以链式结构组织相关文献，以有效反映研究领域的逐步发展。这种组织方式有助于 LLM 捕捉当前研究进展，从而增强其构思能力。此外，我们提出了想法竞技场 (Idea Arena)，一种评估协议，可以从不同角度全面评估想法生成方法，与人类研究人员的偏好高度一致。实验结果表明，CoI 智能体在研究想法生成方面始终优于其他方法，并在质量上与人类相当。此外，我们的 CoI 智能体成本友好，生成一个候选想法及其相应的实验设计的最低成本为 0.50 美元。

[NLP-103] Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

【速读】：该论文试图解决传统Transformer模型在处理输入时分配固定计算资源导致的效率低下问题。解决方案的关键在于提出了Mixture of Depths (MoD)方法，通过动态调整计算深度来跳过不重要的层，从而提高计算效率。具体来说，论文提出了两种创新方法：一是Router-Tuning，通过在小数据集上微调路由器来大幅降低训练成本；二是MindSkip，采用动态深度注意力机制，确保在跳过重要层时不会显著影响模型性能。实验结果表明，这些方法在保持竞争力的同时，显著提升了计算效率，例如实现了21%的加速，性能仅下降0.2%。

链接: https://arxiv.org/abs/2410.13184
作者: Shwai He,Tao Ge,Guoheng Sun,Bowei Tian,Xiaoyang Wang,Ang Li,Dong Yu
关键词-EN: Traditional transformer models, Traditional transformer, input token, leading to inefficient, allocate a fixed
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional transformer models often allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this, the Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges: (1) \textithigh training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) \textitthe risk of performance degradation when important layers are bypassed. In response to the first issue, we propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys \textitAttention with Dynamic Depths. This method preserves the model’s performance while significantly enhancing computational and memory efficiency. Extensive experiments demonstrate that our approach delivers competitive results while dramatically improving the computation efficiency, e.g., 21% speedup and only a 0.2% performance drop. The code is released at \urlthis https URL.
摘要：传统的 Transformer 模型通常为每个输入 Token 分配固定的计算资源，导致计算效率低下且不必要的计算。为了解决这一问题，提出了深度混合 (Mixture of Depths, MoD) 方法，通过跳过不太重要的层来动态调整计算深度。尽管前景广阔，当前的 MoD 方法仍未得到充分探索，并面临两个主要挑战：(1) 由于需要训练整个模型以及决定跳过哪些层的路由器，导致训练成本高昂；(2) 当重要层被跳过时，存在性能下降的风险。针对第一个问题，我们提出了路由器微调 (Router-Tuning) 方法，该方法仅在小数据集上微调路由器，从而大幅减少了与全模型训练相关的计算开销。对于第二个挑战，我们提出了 MindSkip，该方法采用动态深度注意力 (Attention with Dynamic Depths)。这种方法在显著提高计算和内存效率的同时，保持了模型的性能。广泛的实验表明，我们的方法在显著提高计算效率的同时，仍能取得有竞争力的结果，例如，实现了 21% 的加速，性能仅下降 0.2%。代码已发布在 [https URL]。

[NLP-104] AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning EMNLP2024

【速读】：该论文试图解决用户在使用大型语言模型（LLMs）时面临的成本与性能之间的权衡问题。解决方案的关键在于提出了一种新的LLM利用范式，即通过协同操作大型云端LLM和小型本地部署LLM来实现高效的任务处理。具体来说，该框架包括一个使用较小LLM的本地代理和一个使用较大LLM的云端代理，本地代理处理简单推理步骤，而云端代理处理复杂推理步骤。通过自适应机制，本地代理能够自我检查错误并主动向云端代理寻求帮助，从而有效整合两者的优势，显著提升任务完成性能和效率。

链接: https://arxiv.org/abs/2410.13181
作者: Hao Sun,Jiayi Wu,Hengyi Cai,Xiaochi Wei,Yue Feng,Bo Wang,Shuaiqiang Wang,Yan Zhang,Dawei Yin
关键词-EN: Recent advancements, large language models, language models, LLMs, cloud-based LLMs
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Main Conference

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effectively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing much less computational overhead.
摘要：近年来，大语言模型 (Large Language Models, LLMs) 的进展显著。用户面临的选择是在生成质量上使用基于云的 LLM，还是在计算成本上部署本地 LLM。前者通常成本高且效率低，而后者往往无法在需要深思熟虑的推理步骤中提供令人满意的性能。在这项工作中，我们提出了一种新颖的 LLM 利用范式，促进大型云端 LLM 和较小本地部署 LLM 的协同操作。我们的框架包括两个主要模块：本地智能体 (Local Agent)，由相对较小的 LLM 实例化，处理较简单的推理步骤；云端智能体 (Cloud Agent)，配备较大的 LLM，管理更复杂的推理步骤。这种协同处理通过一种自适应机制实现，本地智能体自省识别错误并主动向云端智能体寻求帮助，从而有效整合本地部署和云端 LLM 的优势，显著提升任务完成性能和效率。我们在 7 个基准上评估了 AdaSwitch，涵盖数学推理和复杂问答，使用各种类型的 LLM 实例化本地和云端智能体。实证结果表明，AdaSwitch 有效地提升了本地智能体的性能，并且在使用较少的计算开销时，有时能达到与云端智能体相当的竞争结果。

[NLP-105] EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

【速读】：该论文试图解决语音表示学习中的自监督学习问题，特别是如何通过改进掩码声学建模（MAM）中的掩码策略来提升模型对语音的理解能力。解决方案的关键在于引入了一种新颖的选择性和自适应掩码策略，即EH-MAM（Easy-to-Hard adaptive Masked Acoustic Modeling）。该策略通过逐步引入更难的区域进行重建，利用教师模型预测帧级损失并决定哪些帧需要掩码，从而使模型能够学习更具挑战性的问题，进而获得更有效的语音表示和更全面的语音理解能力。

链接: https://arxiv.org/abs/2410.13179
作者: Ashish Seth,Ramaneswaran Selvakumar,S Sakshi,Sonal Kumar,Sreyan Ghosh,Dinesh Manocha
关键词-EN: Masked Acoustic Modeling, Acoustic Modeling, adaptive Masked Acoustic, Masked Acoustic, self-supervised learning approach
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Modeling (MAM), we introduce a novel selective and adaptive masking strategy. Specifically, during SSL training, we progressively introduce harder regions to the model for reconstruction. Our approach automatically selects hard regions and is built on the observation that the reconstruction loss of individual frames in MAM can provide natural signals to judge the difficulty of solving the MAM pre-text task for that frame. To identify these hard regions, we employ a teacher model that first predicts the frame-wise losses and then decides which frames to mask. By learning to create challenging problems, such as identifying harder frames and solving them simultaneously, the model is able to learn more effective representations and thereby acquire a more comprehensive understanding of the speech. Quantitatively, EH-MAM outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%. Additionally, we conduct a thorough analysis to show that the regions masked by EH-MAM effectively capture useful context across speech frames.
摘要：本文提出了一种名为 EH-MAM (Easy-to-Hard 自适应掩码声学建模) 的新型自监督学习方法，用于语音表示学习。与以往使用随机掩码策略进行掩码声学建模 (Masked Acoustic Modeling, MAM) 的方法不同，我们引入了一种新颖的选择性和自适应掩码策略。具体而言，在自监督学习 (SSL) 训练过程中，我们逐步向模型引入更难的区域进行重构。我们的方法自动选择困难区域，并基于以下观察结果构建：MAM 中各个帧的重构损失可以提供自然信号，用于判断该帧解决 MAM 前置任务的难度。为了识别这些困难区域，我们采用了一个教师模型，该模型首先预测帧级损失，然后决定哪些帧需要掩码。通过学习创建具有挑战性的问题，例如识别更难的帧并同时解决它们，模型能够学习到更有效的表示，从而获得对语音更全面的理解。定量分析表明，EH-MAM 在各种低资源语音识别和 SUPERB 基准测试中，超越了多个最先进的基线模型，性能提升达 5%-10%。此外，我们还进行了深入分析，证明 EH-MAM 掩码的区域能够有效捕捉跨语音帧的有用上下文。

[NLP-106] An Evolved Universal Transformer Memory

【速读】：该论文试图解决现代基础模型在处理长上下文时成本不断上升的问题，解决方案的关键在于引入神经注意力记忆模型（Neural Attention Memory Models, NAMMs）。NAMMs通过学习网络进行记忆管理，能够在保留模型原始性能的同时，显著提高效率。具体来说，NAMMs在预训练的transformer模型基础上进化，为不同层提供不同的潜在上下文，专注于最相关的信息，并且这种机制仅依赖于生成的注意力矩阵中的值，因此适用于任何使用自注意力机制的模型。通过在少量问题上训练NAMMs，论文展示了在多个长上下文基准测试中显著的性能提升，同时将模型的输入上下文大小减少到原始尺寸的一小部分。此外，NAMMs的通用性使其能够在零样本转移的情况下，从语言领域扩展到全新的transformer架构，甚至跨越输入模态，如视觉和强化学习。

链接: https://arxiv.org/abs/2410.13166
作者: Edoardo Cetin,Qi Sun,Tianyu Zhao,Yujin Tang
关键词-EN: Prior methods propose, dropping specific parts, modern foundation models, Prior methods, Neural Attention Memory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 14 figures. Preprint, under submission. Source code is available at this https URL

点击查看摘要

Abstract:Prior methods propose to offset the escalating costs of modern foundation models by dropping specific parts of their contexts with hand-designed rules, while attempting to preserve their original performance. We overcome this trade-off with Neural Attention Memory Models (NAMMs), introducing a learned network for memory management that improves both the performance and efficiency of transformers. We evolve NAMMs atop pre-trained transformers to provide different latent contexts focusing on the most relevant information for individual layers and attention this http URL are universally applicable to any model using self-attention as they condition exclusively on the values in the produced attention matrices. Learning NAMMs on a small set of problems, we achieve substantial performance improvements across multiple long-context benchmarks while cutting the model’s input contexts up to a fraction of the original sizes. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.
摘要：先前的方法提出通过使用手工设计的规则丢弃现代基础模型中的特定部分上下文来抵消其不断上升的成本，同时试图保持其原始性能。我们通过引入神经注意力记忆模型 (Neural Attention Memory Models, NAMMs) 克服了这一权衡，引入了一个用于内存管理的可学习网络，该网络不仅提高了 Transformer 的性能，还提升了其效率。我们在预训练的 Transformer 之上演化 NAMMs，以提供不同的潜在上下文，这些上下文专注于对各个层和注意力机制最相关的信息。由于 NAMMs 仅基于生成的注意力矩阵中的值进行条件化，因此它们普遍适用于任何使用自注意力机制的模型。通过在一小部分问题上学习 NAMMs，我们在多个长上下文基准测试中实现了显著的性能提升，同时将模型的输入上下文大小缩减至原始尺寸的一小部分。我们展示了我们的条件化方法的通用性，使得仅在语言数据上训练的 NAMMs 能够零样本迁移到全新的 Transformer 架构，甚至跨越输入模态，其优势也延伸至视觉和强化学习领域。

[NLP-107] SLM-Mod: Small Language Models Surpass LLMs at Content Moderation

【速读】：该论文试图解决大型语言模型（LLMs）在实时内容审核中成本高昂且难以适应特定社区需求的问题。解决方案的关键在于使用开源的小型语言模型（SLMs）进行社区特定的内容审核任务。通过微调和评估参数少于15B的SLMs，并与更大型的开源和闭源模型进行性能对比，研究发现SLMs在内容审核任务中表现优于LLMs，平均准确率高出11.5%，召回率高出25.7%。此外，论文还展示了跨社区内容审核的潜力，这对新社区和跨平台审核技术的发展具有重要意义。

链接: https://arxiv.org/abs/2410.13155
作者: Xianyang Zhan,Agam Goyal,Yilun Chen,Eshwar Chandrasekharan,Koustuv Saha
关键词-EN: Large language models, Large language, natural language understanding, content moderation, language understanding tasks
类目: Computation and Language (cs.CL)
备注: Preprint: 15 pages, 8 figures, 8 pages

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in many natural language understanding tasks, including content moderation. However, these models can be expensive to query in real-time and do not allow for a community-specific approach to content moderation. To address these challenges, we explore the use of open-source small language models (SLMs) for community-specific content moderation tasks. We fine-tune and evaluate SLMs (less than 15B parameters) by comparing their performance against much larger open- and closed-sourced models. Using 150K comments from 15 popular Reddit communities, we find that SLMs outperform LLMs at content moderation – 11.5% higher accuracy and 25.7% higher recall on average across all communities. We further show the promise of cross-community content moderation, which has implications for new communities and the development of cross-platform moderation techniques. Finally, we outline directions for future work on language model based content moderation. Code and links to HuggingFace models can be found at this https URL.
摘要：大语言模型 (LLMs) 在许多自然语言理解任务中展现了潜力，包括内容审核。然而，这些模型在实时查询时成本高昂，并且无法实现针对特定社区的内容审核方法。为了解决这些问题，我们探索了使用开源的小语言模型 (SLMs) 进行特定社区的内容审核任务。我们通过将 SLMs（参数少于 15B）的性能与更大规模的开源和闭源模型进行比较，对其进行了微调和评估。使用来自 15 个流行 Reddit 社区的 15 万条评论，我们发现 SLMs 在内容审核方面优于 LLMs——平均准确率高出 11.5%，召回率高出 25.7%。我们进一步展示了跨社区内容审核的潜力，这对新社区和跨平台审核技术的发展具有重要意义。最后，我们概述了基于语言模型的内容审核未来工作的方向。代码和 HuggingFace 模型的链接可以在以下 URL 找到。

[NLP-108] Better to Ask in English: Evaluation of Large Language Models on English Low-resource and Cross-Lingual Settings

【速读】：该论文试图解决大型语言模型（LLMs）在低资源语言（特别是南亚地区的主要语言如孟加拉语、印地语和乌尔都语）中的表现问题。解决方案的关键在于通过零样本提示和五种不同的提示设置，对GPT-4、Llama 2和Gemini等LLMs进行跨语言翻译提示的广泛评估，以揭示这些模型在不同语言环境下的有效性，并强调为开发更通用的自然语言处理应用所需的模型改进和语言资源增强。

链接: https://arxiv.org/abs/2410.13153
作者: Krishno Dey,Prerona Tarannum,Md. Arid Hasan,Imran Razzak,Usman Naseem
关键词-EN: Large Language Models, Large Language, Language Models, South Asia, amounts of data
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are trained on massive amounts of data, enabling their application across diverse domains and tasks. Despite their remarkable performance, most LLMs are developed and evaluated primarily in English. Recently, a few multi-lingual LLMs have emerged, but their performance in low-resource languages, especially the most spoken languages in South Asia, is less explored. To address this gap, in this study, we evaluate LLMs such as GPT-4, Llama 2, and Gemini to analyze their effectiveness in English compared to other low-resource languages from South Asia (e.g., Bangla, Hindi, and Urdu). Specifically, we utilized zero-shot prompting and five different prompt settings to extensively investigate the effectiveness of the LLMs in cross-lingual translated prompts. The findings of the study suggest that GPT-4 outperformed Llama 2 and Gemini in all five prompt settings and across all languages. Moreover, all three LLMs performed better for English language prompts than other low-resource language prompts. This study extensively investigates LLMs in low-resource language contexts to highlight the improvements required in LLMs and language-specific resources to develop more generally purposed NLP applications.
摘要：大语言模型 (LLMs) 经过大量数据的训练，能够在多种领域和任务中应用。尽管其表现出色，但大多数 LLM 主要以英语开发和评估。近期，一些多语言 LLM 开始出现，但它们在低资源语言，尤其是南亚最广泛使用的语言中的表现尚未得到充分探索。为了填补这一空白，本研究评估了 GPT-4、Llama 2 和 Gemini 等 LLM，分析它们在英语与其他南亚低资源语言（如孟加拉语、印地语和乌尔都语）中的有效性。具体而言，我们采用了零样本提示和五种不同的提示设置，广泛研究了 LLM 在跨语言翻译提示中的有效性。研究结果表明，GPT-4 在所有五种提示设置和所有语言中均优于 Llama 2 和 Gemini。此外，所有三种 LLM 在英语提示中的表现均优于其他低资源语言提示。本研究深入探讨了 LLM 在低资源语言环境中的应用，以强调在 LLM 和特定语言资源方面所需的改进，从而开发更具通用性的自然语言处理应用。

[NLP-109] Mapping Bias in Vision Language Models: Signposts Pitfalls and the Road Ahead NAACL2025

【速读】：该论文试图解决视觉语言模型（VLMs）中的公平性问题，特别是分析不同模型在不同数据集上的表现差异及其背后的偏见。解决方案的关键在于识别和利用适合检测偏见的数据集，如UTKFace和CelebA等肖像数据集，以及改进现有的数据集（如VisoGender）以提供更严格的评估标准。论文呼吁设计更有效和精心构建的数据集，以确保VLMs的公平性和可靠性。

链接: https://arxiv.org/abs/2410.13146
作者: Kuleen Sasse,Shan Chen,Jackson Pond,Danielle Bitterman,John Osborne
关键词-EN: Vision Language Models, Vision Language, fairness remains under-explored, Language Models, gain widespread
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review at NAACL 2025

点击查看摘要

Abstract:As Vision Language Models (VLMs) gain widespread use, their fairness remains under-explored. In this paper, we analyze demographic biases across five models and six datasets. We find that portrait datasets like UTKFace and CelebA are the best tools for bias detection, finding gaps in performance and fairness between LLaVa and CLIP models. However, scene based datasets like PATA, VLStereoSet fail to be useful benchmarks for bias due to their construction. As for pronoun based datasets like VisoGender, we receive mixed signals as only some subsets of the data are useful in providing insights. To alleviate this problem, we introduce a more difficult version of VisoGender to serve as a more rigorous evaluation. Based on these results, we call for more effective and carefully designed datasets to ensure VLMs are both fair and reliable.
摘要：随着视觉语言模型 (Vision Language Models, VLMs) 的广泛应用，其公平性问题仍未得到充分探索。本文分析了五个模型和六个数据集中的群体偏差。我们发现，像 UTKFace 和 CelebA 这样的肖像数据集是检测偏差的有效工具，揭示了 LLaVa 和 CLIP 模型在性能和公平性方面的差距。然而，像 PATA 和 VLStereoSet 这样的场景数据集由于其构建方式，未能成为有效的偏差基准。对于像 VisoGender 这样的代词数据集，我们得到的信号是混合的，因为只有部分数据子集在提供洞察方面是有用的。为了缓解这一问题，我们引入了一个更难版本的 VisoGender，以作为更严格的评估工具。基于这些结果，我们呼吁开发更有效和精心设计的数据集，以确保 VLMs 既公平又可靠。

[NLP-110] Data Defenses Against Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在文本推理过程中可能引发的伦理问题，如监控、劳动力替代和知识产权/版权盗窃。解决方案的关键在于提出了一种名为“数据防御”（data defenses）的新策略，通过自动生成对抗性提示注入（adversarial prompt injections）来阻止LLMs对数据进行准确推理。这些对抗性提示注入能够显著降低LLMs从输入文本中提取个人身份信息或使用受版权保护文本进行推理的能力。论文还探讨了这种直接抵抗LLM推理的伦理问题，并认为数据防御有助于实现数据所有权、数据主权和人工智能系统的民主控制等重要价值。

链接: https://arxiv.org/abs/2410.13138
作者: William Agnew,Harry H. Jiang,Cella Sum,Maarten Sap,Sauvik Das
关键词-EN: Large language models, language models excel, Large language, data defenses, language models
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models excel at performing inference over text to extract information, summarize information, or generate additional text. These inference capabilities are implicated in a variety of ethical harms spanning surveillance, labor displacement, and IP/copyright theft. While many policy, legal, and technical mitigations have been proposed to counteract these harms, these mitigations typically require cooperation from institutions that move slower than technical advances (i.e., governments) or that have few incentives to act to counteract these harms (i.e., the corporations that create and profit from these LLMs). In this paper, we define and build “data defenses” – a novel strategy that directly empowers data owners to block LLMs from performing inference on their data. We create data defenses by developing a method to automatically generate adversarial prompt injections that, when added to input text, significantly reduce the ability of LLMs to accurately infer personally identifying information about the subject of the input text or to use copyrighted text in inference. We examine the ethics of enabling such direct resistance to LLM inference, and argue that making data defenses that resist and subvert LLMs enables the realization of important values such as data ownership, data sovereignty, and democratic control over AI systems. We verify that our data defenses are cheap and fast to generate, work on the latest commercial and open-source LLMs, resistance to countermeasures, and are robust to several different attack settings. Finally, we consider the security implications of LLM data defenses and outline several future research directions in this area. Our code is available at this https URL and a tool for using our defenses to protect text against LLM inference is at this https URL.
摘要：大语言模型在通过文本进行推理以提取信息、总结信息或生成额外文本方面表现出色。这些推理能力涉及多种伦理危害，包括监控、劳动力替代和知识产权/版权盗窃。尽管已经提出了许多政策、法律和技术缓解措施来对抗这些危害，但这些缓解措施通常需要来自行动速度慢于技术进步（即政府）或几乎没有动机采取行动对抗这些危害（即创建并从这些大语言模型中获利的公司）的机构的合作。在本文中，我们定义并构建了“数据防御”——一种直接赋予数据所有者阻止大语言模型对其数据进行推理的新策略。我们通过开发一种自动生成对抗性提示注入的方法来创建数据防御，当这些提示注入添加到输入文本中时，显著降低了大语言模型准确推断输入文本主题的个人身份信息或使用受版权保护的文本进行推理的能力。我们探讨了使这种直接抵抗大语言模型推理的伦理问题，并认为，创建抵抗和颠覆大语言模型的数据防御有助于实现数据所有权、数据主权和人工智能系统的民主控制等重要价值。我们验证了我们的数据防御生成成本低廉且速度快，适用于最新的商业和开源大语言模型，能够抵抗反制措施，并且在多种不同的攻击场景下表现稳健。最后，我们考虑了大语言模型数据防御的安全影响，并概述了该领域的几个未来研究方向。我们的代码可在以下链接获取，使用我们的防御措施保护文本免受大语言模型推理的工具可在以下链接获取。

[NLP-111] Retrieval-Enhanced Named Entity Recognition

【速读】：该论文试图解决在命名实体识别（NER）任务中，结合上下文学习（In-Context Learning）和自回归语言模型的应用问题。解决方案的关键在于提出了一种名为RENER（Retrieval-Enhanced Named Entity Recognition）的技术，该技术通过信息检索方法从训练数据集中检索相似示例，以增强语言模型对输入文本中命名实体的识别能力。RENER的模块化设计使其独立于底层语言模型和信息检索算法，实验结果表明，该技术在CrossNER数据集上达到了最先进的性能，信息检索能够将F-score提升多达11个百分点。

链接: https://arxiv.org/abs/2410.13118
作者: Enzo Shiraishi,Raphael Y. de Camargo,Henrique L. P. Silva,Ronaldo C. Prati
关键词-EN: named entity recognition, entity recognition, named entity, achieved good performance, In-Context Learning
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 13 pages, 6 figures, 3 tables

点击查看摘要

Abstract:When combined with In-Context Learning, a technique that enables models to adapt to new tasks by incorporating task-specific examples or demonstrations directly within the input prompt, autoregressive language models have achieved good performance in a wide range of tasks and applications. However, this combination has not been properly explored in the context of named entity recognition, where the structure of this task poses unique challenges. We propose RENER (Retrieval-Enhanced Named Entity Recognition), a technique for named entity recognition using autoregressive language models based on In-Context Learning and information retrieval techniques. When presented with an input text, RENER fetches similar examples from a dataset of training examples that are used to enhance a language model to recognize named entities from this input text. RENER is modular and independent of the underlying language model and information retrieval algorithms. Experimental results show that in the CrossNER collection we achieve state-of-the-art performance with the proposed technique and that information retrieval can increase the F-score by up to 11 percentage points.
摘要：结合上下文学习 (In-Context Learning) 技术，该技术通过在输入提示中直接包含任务特定的示例或演示，使模型能够适应新任务，自回归语言模型在广泛的任务和应用中取得了良好的性能。然而，这种结合在命名实体识别 (Named Entity Recognition) 的背景下尚未得到充分探索，该任务的结构带来了独特的挑战。我们提出了 RENER (Retrieval-Enhanced Named Entity Recognition)，这是一种基于上下文学习和信息检索技术的自回归语言模型用于命名实体识别的技术。当输入文本时，RENER 从训练示例数据集中获取相似的示例，用于增强语言模型以识别该输入文本中的命名实体。RENER 是模块化的，独立于底层语言模型和信息检索算法。实验结果表明，在我们使用的 CrossNER 集合中，所提出的技术达到了最先进的性能，并且信息检索可以将 F-score 提高多达 11 个百分点。

[NLP-112] Learning to Summarize from LLM-generated Feedback

【速读】：该论文试图解决大型语言模型（LLM）生成的摘要中存在的幻觉、关键信息遗漏和冗长等问题。解决方案的关键在于利用LLM生成的多维度反馈来提升摘要质量，使其更符合人类对忠实性、完整性和简洁性的偏好。论文提出了FeedSum数据集，并通过实验验证了高质量、多维度、细粒度的反馈对摘要生成质量的显著提升作用。此外，论文比较了监督微调和直接偏好优化两种利用反馈的方法，并展示了SummLlama3-8b模型在生成人类偏好摘要方面优于近10倍大小的Llama3-70b-instruct模型，表明适当训练下较小模型也能实现卓越性能。

链接: https://arxiv.org/abs/2410.13116
作者: Hwanjun Song,Taewon Yun,Yuho Lee,Gihun Lee,Jason Cai,Hang Su
关键词-EN: Developing effective text, key information omissions, effective text summarizers, text summarizers remains, Developing effective
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing effective text summarizers remains a challenge due to issues like hallucinations, key information omissions, and verbosity in LLM-generated summaries. This work explores using LLM-generated feedback to improve summary quality by aligning the summaries with human preferences for faithfulness, completeness, and conciseness. We introduce FeedSum, a large-scale dataset containing multi-dimensional LLM feedback on summaries of varying quality across diverse domains. Our experiments show how feedback quality, dimensionality, and granularity influence preference learning, revealing that high-quality, multi-dimensional, fine-grained feedback significantly improves summary generation. We also compare two methods for using this feedback: supervised fine-tuning and direct preference optimization. Finally, we introduce SummLlama3-8b, a model that outperforms the nearly 10x larger Llama3-70b-instruct in generating human-preferred summaries, demonstrating that smaller models can achieve superior performance with appropriate training. The full dataset will be released soon. The SummLlama3-8B model is now available at this https URL.
摘要：开发有效的文本摘要器仍然是一个挑战，因为大语言模型 (LLM) 生成的摘要存在幻觉、关键信息遗漏和冗长等问题。本研究探讨了利用大语言模型生成的反馈来提高摘要质量，通过使摘要与人类对忠实性、完整性和简洁性的偏好相一致。我们引入了 FeedSum，这是一个大规模数据集，包含了针对不同领域中质量各异的摘要的多维度大语言模型反馈。我们的实验展示了反馈质量、维度数量和粒度如何影响偏好学习，揭示了高质量、多维度、细粒度的反馈显著提升了摘要生成的效果。我们还比较了两种利用这种反馈的方法：监督微调和直接偏好优化。最后，我们推出了 SummLlama3-8b 模型，该模型在生成人类偏好的摘要方面优于近 10 倍大的 Llama3-70b-instruct 模型，证明了在适当训练下，较小的模型也能实现卓越的性能。完整数据集将很快发布。SummLlama3-8B 模型现已在此 https URL 上提供。

[NLP-113] Controllable Generation via Locally Constrained Resampling

【速读】：该论文试图解决自回归模型在生成复杂输出时难以满足逻辑约束的问题。解决方案的关键在于提出了一种可处理的贝叶斯条件化方法，通过在整个序列上进行条件化，从而实现全局最优的约束生成。具体步骤包括从模型样本出发，构建局部因子分解分布，并在约束条件下进行采样，通过校正样本偏差并重新采样，确保生成的样本既满足约束条件又接近目标分布。该方法在LLM去毒化和解决数独谜题等任务中表现优异，显著提升了生成结果的准确性和符合性。

链接: https://arxiv.org/abs/2410.13111
作者: Kareem Ahmed,Kai-Wei Chang,Guy Van den Broeck
关键词-EN: natural language, demonstrated an unprecedented, unprecedented ability, ability at modeling, modeling the intricacies
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: arXiv admin note: text overlap with arXiv:2312.03905

点击查看摘要

Abstract:Autoregressive models have demonstrated an unprecedented ability at modeling the intricacies of natural language. However, they continue to struggle with generating complex outputs that adhere to logical constraints. Sampling from a fully-independent distribution subject to a constraint is hard. Sampling from an autoregressive distribution subject to a constraint is doubly hard: We have to contend not only with the hardness of the constraint but also the distribution’s lack of structure. We propose a tractable probabilistic approach that performs Bayesian conditioning to draw samples subject to a constraint. Our approach considers the entire sequence, leading to a more globally optimal constrained generation than current greedy methods. Starting from a model sample, we induce a local, factorized distribution which we can tractably condition on the constraint. To generate samples that satisfy the constraint, we sample from the conditional distribution, correct for biases in the samples and resample. The resulting samples closely approximate the target distribution and are guaranteed to satisfy the constraints. We evaluate our approach on several tasks, including LLM detoxification and solving Sudoku puzzles. We show that by disallowing a list of toxic expressions our approach is able to steer the model’s outputs away from toxic generations, outperforming similar approaches to detoxification. We conclude by showing that our approach achieves a perfect accuracy on Sudoku compared to 50% for GPT4-o and Gemini 1.5.
摘要：自回归模型在模拟自然语言的复杂性方面展现了前所未有的能力。然而，它们在生成符合逻辑约束的复杂输出方面仍面临挑战。从完全独立的分布中采样并受限于某一约束是困难的。从自回归分布中采样并受限于某一约束则更为困难：我们不仅要应对约束的复杂性，还要面对分布结构缺失的问题。我们提出了一种可行的概率方法，通过贝叶斯条件化来抽取受约束的样本。我们的方法考虑了整个序列，从而比当前的贪婪方法更能实现全局最优的约束生成。从模型样本出发，我们引入了局部因子分解分布，并可对其进行可行的约束条件化。为了生成满足约束的样本，我们从条件分布中采样，纠正样本中的偏差并重新采样。最终的样本接近目标分布，并保证满足约束条件。我们在多个任务上评估了我们的方法，包括大语言模型 (LLM) 的解毒和解决数独谜题。我们展示了通过禁止一系列有毒表达，我们的方法能够引导模型输出远离有毒生成，优于类似的解毒方法。最后，我们展示了在数独任务上，我们的方法达到了 100% 的准确率，而 GPT4-o 和 Gemini 1.5 的准确率仅为 50%。

[NLP-114] A Little Human Data Goes A Long Way

【速读】：该论文试图解决在自然语言处理（NLP）系统中，合成数据在多大程度上可以替代昂贵的人工标注数据的问题。解决方案的关键在于通过逐步替换训练数据中的真实标注数据为合成数据，研究其对事实验证（FV）和问答（QA）任务性能的影响。研究发现，替换高达90%的训练数据对模型性能影响较小，但替换最后10%会导致性能显著下降。此外，仅包含125个真实标注数据的模型训练可以显著提升纯合成数据训练的模型性能，表明即使大规模人工标注不可行，少量真实标注数据仍具有极高的价值。

链接: https://arxiv.org/abs/2410.13098
作者: Dhananjay Ashok,Jonathan May
关键词-EN: NLP systems increasingly, systems increasingly turn, creators of NLP, NLP systems, synthetic data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Question Answering (QA) by studying the effects of incrementally replacing human generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be reliably improved by including as few as 125 human generated data points. We show that matching the performance gain of just a little additional human data (only 200 points) requires an order of magnitude more synthetic data and estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human generated.
摘要：面对昂贵的人工标注过程，自然语言处理 (NLP) 系统的开发者们越来越多地转向合成数据生成。尽管这种方法显示出潜力，但合成数据在多大程度上能够替代人工标注仍未被充分理解。我们通过研究逐步用合成数据替代人工生成数据对八个不同数据集在事实验证 (FV) 和问答 (QA) 任务中的影响，来探讨合成数据的使用。令人惊讶的是，替换高达 90% 的训练数据仅略微降低性能，但替换最后的 10% 会导致严重下降。我们发现，完全基于合成数据训练的模型可以通过包含仅 125 个人工生成数据点来可靠地改进。我们展示了与仅增加少量额外人工数据 (仅 200 点) 相当的性能提升需要多一个数量级的合成数据，并估计了在何种价格比率下人工标注会成为更具成本效益的解决方案。我们的结果表明，即使大规模人工标注不可行，拥有一小部分人工生成的数据集仍然具有巨大的价值。

[NLP-115] Communication-Efficient and Tensorized Federated Fine-Tuning of Large Language Models

【速读】：该论文试图解决在多设备分布式数据场景下，如何高效且保护隐私地微调大型语言模型（LLMs）的问题。解决方案的关键在于提出了FedTT和FedTT+方法，通过将张量化适配器集成到客户端模型的编码器/解码器块中，实现了在联邦学习（FL）框架下的参数高效微调（PEFT）。FedTT适用于跨设备和跨机构的FL，而FedTT+通过自适应冻结张量因子的部分，进一步减少了可训练参数的数量，增强了对抗数据异质性的鲁棒性，同时显著降低了通信成本。

链接: https://arxiv.org/abs/2410.13097
作者: Sajjad Ghiasvand,Yifan Yang,Zhiyu Xue,Mahnoosh Alizadeh,Zheng Zhang,Ramtin Pedarsani
关键词-EN: Large Language Models, Large Language, assume that Large, Parameter-efficient fine-tuning, Language Models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods typically assume that Large Language Models (LLMs) are trained on data from a single device or client. However, real-world scenarios often require fine-tuning these models on private data distributed across multiple devices. Federated Learning (FL) offers an appealing solution by preserving user privacy, as sensitive data remains on local devices during training. Nonetheless, integrating PEFT methods into FL introduces two main challenges: communication overhead and data heterogeneity. In this paper, we introduce FedTT and FedTT+, methods for adapting LLMs by integrating tensorized adapters into client-side models’ encoder/decoder blocks. FedTT is versatile and can be applied to both cross-silo FL and large-scale cross-device FL. FedTT+, an extension of FedTT tailored for cross-silo FL, enhances robustness against data heterogeneity by adaptively freezing portions of tensor factors, further reducing the number of trainable parameters. Experiments on BERT and LLaMA models demonstrate that our proposed methods successfully address data heterogeneity challenges and perform on par or even better than existing federated PEFT approaches while achieving up to 10 \times reduction in communication cost.
摘要：参数高效微调 (Parameter-efficient fine-tuning, PEFT) 方法通常假设大语言模型 (Large Language Models, LLMs) 是在单一设备或客户端的数据上进行训练的。然而，现实场景中往往需要在分布于多个设备上的私有数据上对这些模型进行微调。联邦学习 (Federated Learning, FL) 通过在训练过程中保留用户隐私，提供了一个吸引人的解决方案，因为敏感数据在本地设备上保持不变。尽管如此，将 PEFT 方法整合到 FL 中引入了两个主要挑战：通信开销和数据异质性。本文中，我们介绍了 FedTT 和 FedTT+，这两种方法通过将张量化适配器集成到客户端模型的编码器/解码器块中，来适应 LLMs。FedTT 具有通用性，可应用于跨部门 FL 和大规模跨设备 FL。FedTT+ 是 FedTT 针对跨部门 FL 的扩展，通过自适应地冻结部分张量因子，增强了对抗数据异质性的鲁棒性，进一步减少了可训练参数的数量。在 BERT 和 LLaMA 模型上的实验表明，我们提出的方法成功解决了数据异质性挑战，并且在通信成本上实现了高达 10 倍的减少，性能与现有的联邦 PEFT 方法相当甚至更优。

[NLP-116] Self-Comparison for Dataset-Level Membership Inference in Large (Vision-)Language Models

【速读】：该论文试图解决大型语言模型（LLMs）和视觉-语言模型（VLMs）在训练过程中可能涉及的版权侵权问题，特别是通过数据集级别的成员推断攻击（MIA）来识别训练数据中的版权材料。解决方案的关键在于提出了一种基于自比较的新型数据集级别成员推断方法。该方法通过在成员数据前缀后接非成员数据后缀（通过改写成员后缀获得），利用模型对训练数据的记忆特性，评估改写前后序列似然性的变化，从而推断数据集成员身份。这种方法无需访问与测试数据同分布的真实成员或非成员数据，更具实用性，并在多种数据集和模型上表现优于传统MIA和数据集推断技术。

链接: https://arxiv.org/abs/2410.13088
作者: Jie Ren,Kangrui Chen,Chen Chen,Vikash Sehwag,Yue Xing,Jiliang Tang,Lingjuan Lyu
关键词-EN: natural language processing, Large Language Models, made significant advancements, Large Language, natural language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have made significant advancements in a wide range of natural language processing and vision-language tasks. Access to large web-scale datasets has been a key factor in their success. However, concerns have been raised about the unauthorized use of copyrighted materials and potential copyright infringement. Existing methods, such as sample-level Membership Inference Attacks (MIA) and distribution-based dataset inference, distinguish member data (data used for training) and non-member data by leveraging the common observation that models tend to memorize and show greater confidence in member data. Nevertheless, these methods face challenges when applied to LLMs and VLMs, such as the requirement for ground-truth member data or non-member data that shares the same distribution as the test data. In this paper, we propose a novel dataset-level membership inference method based on Self-Comparison. We find that a member prefix followed by a non-member suffix (paraphrased from a member suffix) can further trigger the model’s memorization on training data. Instead of directly comparing member and non-member data, we introduce paraphrasing to the second half of the sequence and evaluate how the likelihood changes before and after paraphrasing. Unlike prior approaches, our method does not require access to ground-truth member data or non-member data in identical distribution, making it more practical. Extensive experiments demonstrate that our proposed method outperforms traditional MIA and dataset inference techniques across various datasets and models, including including public models, fine-tuned models, and API-based commercial models.
摘要：大语言模型 (LLM) 和视觉语言模型 (VLM) 在自然语言处理和视觉语言任务方面取得了显著进展。访问大规模网络数据集是其成功的关键因素之一。然而，关于未经授权使用版权材料和潜在版权侵权的担忧也随之而来。现有的方法，如样本级别的成员推断攻击 (MIA) 和基于分布的数据集推断，通过利用模型倾向于记忆并更自信地处理成员数据 (用于训练的数据) 这一常见现象，区分成员数据和非成员数据。然而，这些方法在应用于 LLM 和 VLM 时面临挑战，例如需要与测试数据具有相同分布的真实成员数据或非成员数据。在本文中，我们提出了一种基于自比较的新型数据集级别成员推断方法。我们发现，成员前缀后接非成员后缀 (从成员后缀改写而来) 可以进一步触发模型对训练数据的记忆。我们不是直接比较成员和非成员数据，而是将改写引入序列的后半部分，并评估改写前后可能性变化的情况。与先前的方法不同，我们的方法不需要访问具有相同分布的真实成员数据或非成员数据，使其更具实用性。广泛的实验表明，我们提出的方法在各种数据集和模型上，包括公开模型、微调模型和基于 API 的商业模型，均优于传统的 MIA 和数据集推断技术。

[NLP-117] Reverse-Engineering the Reader

【速读】：该论文试图解决的问题是如何通过直接优化语言模型使其成为有效的认知模型，具体方法是将其与人类心理测量数据对齐。解决方案的关键在于引入了一种新颖的对齐技术，即通过微调语言模型，使其隐式优化线性回归器的参数，从而直接预测人类对上下文中语言单位（如音素、词素或单词）的阅读时间，使用语言模型生成的意外性估计作为输入。研究结果表明，这种方法提高了语言模型在心理测量预测方面的能力，但也发现心理测量能力与模型在下游NLP任务中的表现及其在保留测试数据上的困惑度之间存在负相关关系。

链接: https://arxiv.org/abs/2410.13086
作者: Samuel Kiegeland,Ethan Gotlieb Wilcox,Afra Amini,David Robert Reich,Ryan Cotterell
关键词-EN: Numerous previous studies, natural language text, Numerous previous, extent language models, pretrained on natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Numerous previous studies have sought to determine to what extent language models, pretrained on natural language text, can serve as useful models of human cognition. In this paper, we are interested in the opposite question: whether we can directly optimize a language model to be a useful cognitive model by aligning it to human psychometric data. To achieve this, we introduce a novel alignment technique in which we fine-tune a language model to implicitly optimize the parameters of a linear regressor that directly predicts humans’ reading times of in-context linguistic units, e.g., phonemes, morphemes, or words, using surprisal estimates derived from the language model. Using words as a test case, we evaluate our technique across multiple model sizes and datasets and find that it improves language models’ psychometric predictive power. However, we find an inverse relationship between psychometric power and a model’s performance on downstream NLP tasks as well as its perplexity on held-out test data. While this latter trend has been observed before (Oh et al., 2022; Shain et al., 2024), we are the first to induce it by manipulating a model’s alignment to psychometric data.
摘要：许多先前的研究试图确定在自然语言文本上预训练的语言模型在多大程度上可以作为人类认知的有用模型。在本文中，我们感兴趣的是相反的问题：我们是否可以通过将语言模型与人类心理测量数据对齐，直接优化语言模型以成为有用的认知模型。为此，我们引入了一种新颖的对齐技术，在该技术中，我们微调语言模型以隐式优化线性回归器的参数，该回归器直接预测人类对上下文语言单元（例如，音素、词素或单词）的阅读时间，使用从语言模型中得出的意外估计值。以单词为例，我们在多个模型大小和数据集上评估了我们的技术，发现它提高了语言模型的心理测量预测能力。然而，我们发现心理测量能力与模型在下游自然语言处理任务中的表现及其在保留测试数据上的困惑度之间存在反比关系。虽然后一种趋势之前已被观察到（Oh et al., 2022; Shain et al., 2024），但我们是第一个通过操纵模型与心理测量数据的对齐来诱导这种趋势的。

[NLP-118] MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

【速读】：该论文试图解决医学大型视觉-语言模型（Med-LVLMs）在疾病诊断和治疗规划中存在的“事实幻觉”问题，即模型可能生成不准确或错误的信息。解决方案的关键在于提出了一种多模态检索增强生成系统（MMed-RAG），通过引入领域感知的检索机制、自适应的检索上下文选择方法以及可证明的RAG偏好微调策略，显著提高了模型在引入检索上下文时的对齐性和事实准确性，从而在多个医学数据集上实现了平均43.8%的事实准确性提升。

链接: https://arxiv.org/abs/2410.13085
作者: Peng Xia,Kangyu Zhu,Haoran Li,Tianze Wang,Weijia Shi,Sheng Wang,Linjun Zhang,James Zou,Huaxiu Yao
关键词-EN: Artificial Intelligence, demonstrated significant potential, potential in healthcare, treatment planning, demonstrated significant
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection method, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in the factual accuracy of Med-LVLMs. Our data and code are available in this https URL.
摘要：人工智能 (AI) 在医疗领域展现了巨大的潜力，特别是在疾病诊断和治疗规划方面。近期医疗大视觉语言模型 (Med-LVLMs) 的进展为交互式诊断工具开辟了新的可能性。然而，这些模型常常出现事实性幻觉，可能导致错误的诊断。微调 (Fine-tuning) 和检索增强生成 (RAG) 已成为解决这些问题的手段。然而，高质量数据的数量以及训练数据与部署数据之间的分布偏移限制了微调方法的应用。尽管 RAG 轻量且有效，现有的基于 RAG 的方法在不同医疗领域中不够通用，并且可能引发模态间以及模型与真实数据间的对齐问题。本文中，我们提出了一种多功能的多模态 RAG 系统，即 MMed-RAG，旨在提升 Med-LVLMs 的事实性。我们的方法引入了领域感知检索机制、自适应检索上下文选择方法以及基于 RAG 的可证明偏好微调策略。这些创新使得 RAG 过程足够通用和可靠，显著改善了引入检索上下文时的对齐效果。在五个医疗数据集（涉及放射学、眼科学、病理学）上的医疗视觉问答 (VQA) 和报告生成实验结果表明，MMed-RAG 能够使 Med-LVLMs 的事实准确性平均提升 43.8%。我们的数据和代码可在以下链接获取：https URL。

[NLP-119] Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在推理过程中由于知识缺口和幻觉现象导致的推理不准确问题。解决方案的关键在于引入了一种名为图约束推理（Graph-Constrained Reasoning, GCR）的新框架，通过将知识图谱（KGs）的结构化知识与LLMs的非结构化推理相结合，确保推理过程的准确性和忠实性。具体来说，GCR利用KG-Trie这一基于前缀树的索引结构，将知识图谱的推理路径编码并集成到LLM的解码过程中，从而约束解码过程，使LLMs能够在图上直接进行推理，生成基于知识图谱的忠实推理路径。此外，GCR还结合了轻量级的知识图谱专用LLM和强大的通用LLM，分别用于图约束推理和多推理路径的归纳推理，从而实现无推理幻觉的准确推理。

链接: https://arxiv.org/abs/2410.13080
作者: Linhao Luo,Zicheng Zhao,Chen Gong,Gholamreza Haffari,Shirui Pan
关键词-EN: Large language models, Large language, impressive reasoning abilities, demonstrated impressive reasoning, reasoning
类目: Computation and Language (cs.CL)
备注: 21 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive reasoning abilities, but they still struggle with faithful reasoning due to knowledge gaps and hallucinations. To address these issues, knowledge graphs (KGs) have been utilized to enhance LLM reasoning through their structured knowledge. However, existing KG-enhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this work, we introduce graph-constrained reasoning (GCR), a novel framework that bridges structured knowledge in KGs with unstructured reasoning in LLMs. To eliminate hallucinations, GCR ensures faithful KG-grounded reasoning by integrating KG structure into the LLM decoding process through KG-Trie, a trie-based index that encodes KG reasoning paths. KG-Trie constrains the decoding process, allowing LLMs to directly reason on graphs and generate faithful reasoning paths grounded in KGs. Additionally, GCR leverages a lightweight KG-specialized LLM for graph-constrained reasoning alongside a powerful general LLM for inductive reasoning over multiple reasoning paths, resulting in accurate reasoning with zero reasoning hallucination. Extensive experiments on several KGQA benchmarks demonstrate that GCR achieves state-of-the-art performance and exhibits strong zero-shot generalizability to unseen KGs without additional training.
摘要：大语言模型 (LLMs) 展示了令人印象深刻的推理能力，但由于知识缺口和幻觉问题，它们在忠实推理方面仍存在困难。为了解决这些问题，知识图谱 (KGs) 被用来通过其结构化知识增强 LLM 推理。然而，现有的 KG 增强方法，无论是基于检索的还是基于智能体的，在准确检索知识和高效遍历大规模 KGs 方面都遇到了困难。在本研究中，我们引入了图约束推理 (GCR)，这是一种将 KGs 中的结构化知识与 LLMs 中的非结构化推理相结合的新框架。为了消除幻觉，GCR 通过 KG-Trie（一种基于 trie 的索引，用于编码 KG 推理路径）将 KG 结构整合到 LLM 解码过程中，从而确保了基于 KG 的忠实推理。KG-Trie 约束解码过程，使 LLMs 能够直接在图上进行推理，并生成基于 KG 的忠实推理路径。此外，GCR 利用轻量级的 KG 专用 LLM 进行图约束推理，同时结合强大的通用 LLM 对多条推理路径进行归纳推理，从而实现零推理幻觉的准确推理。在多个 KGQA 基准上的广泛实验表明，GCR 达到了最先进的性能，并展示了在没有额外训练的情况下对未见过的 KGs 的强大零样本泛化能力。

[NLP-120] uning Language Models by Mixture-of-Depths Ensemble

【速读】：该论文试图解决传统Transformer大型语言模型（LLMs）在训练和预测时过度依赖最终层的问题，提出了一种新的调优框架——Mixture-of-Depths（MoD）。解决方案的关键在于利用中间层的预测能力，通过训练后期层作为集成模型，并通过学习到的路由权重贡献最终的logits。MoD框架通过辅助蒸馏损失和额外的归一化模块，确保后期层的输出适应语言建模任务。该方法不仅在各种语言建模任务中表现出一致的改进，而且通过减少可训练参数的数量，实现了与传统方法相当的性能，展示了在训练过程中利用中间层表示的潜力。

链接: https://arxiv.org/abs/2410.13077
作者: Haoyan Luo,Lucia Specia
关键词-EN: Transformer-based Large Language, Large Language Models, Transformer-based Large, Language Models, traditionally rely
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based Large Language Models (LLMs) traditionally rely on final-layer loss for training and final-layer representations for predictions, potentially overlooking the predictive power embedded in intermediate layers. Surprisingly, we find that focusing training efforts on these intermediate layers can yield training losses comparable to those of final layers, with complementary test-time performance. We introduce a novel tuning framework, Mixture-of-Depths (MoD), which trains late layers as ensembles contributing to the final logits through learned routing weights. With the auxiliary distillation loss and additional normalization modules, we ensure that the outputs of the late layers adapt to language modeling. Our MoD framework, which can be integrated with any existing tuning method, shows consistent improvement on various language modelling tasks. Furthermore, by replacing traditional trainable modules with MoD, our approach achieves similar performance with significantly fewer trainable parameters, demonstrating the potential of leveraging predictive power from intermediate representations during training.
摘要：基于 Transformer 的大语言模型 (LLM) 传统上依赖于最终层的损失进行训练，并使用最终层的表示进行预测，这可能会忽视嵌入在中间层的预测能力。令人惊讶的是，我们发现将训练重点放在这些中间层上，可以产生与最终层相当的训练损失，并在测试时表现出互补的性能。我们引入了一种新颖的调优框架，即深度混合 (Mixture-of-Depths, MoD)，该框架将后期层作为集成层进行训练，通过学习的路由权重对最终的 logits 做出贡献。通过辅助的蒸馏损失和额外的归一化模块，我们确保后期层的输出适应语言建模。我们的 MoD 框架可以与任何现有的调优方法集成，并在各种语言建模任务中显示出一致的改进。此外，通过将传统的可训练模块替换为 MoD，我们的方法在显著减少可训练参数的情况下实现了类似的性能，展示了在训练过程中利用中间表示的预测能力的潜力。

[NLP-121] PromptExp: Multi-granularity Prompt Explanation of Large Language Models

【速读】：该论文试图解决大型语言模型（LLM）在提示工程中的可解释性问题，特别是由于LLM的黑箱特性导致的解释困难。解决方案的关键在于引入了一个名为OurTool的框架，该框架通过多粒度提示解释来聚合令牌级别的洞察。OurTool包括两种令牌级别的解释方法：基于聚合的方法结合了局部解释技术，以及基于扰动的方法，通过新颖的技术评估令牌掩码的影响。该框架支持白盒和黑盒解释，并将解释扩展到更高粒度级别，从而实现灵活的分析。通过案例研究和用户研究，论文验证了OurTool在提高LLM可解释性方面的有效性和实际价值。

链接: https://arxiv.org/abs/2410.13073
作者: Ximing Dong,Shaowei Wang,Dayi Lin,Gopi Krishnan Rajbahadur,Boquan Zhou,Shichao Liu,Ahmed E. Hassan
关键词-EN: Large Language Models, Large Language, Language Models excel, text generation, natural language understanding
类目: Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:Large Language Models excel in tasks like natural language understanding and text generation. Prompt engineering plays a critical role in leveraging LLM effectively. However, LLMs black-box nature hinders its interpretability and effective prompting engineering. A wide range of model explanation approaches have been developed for deep learning models, However, these local explanations are designed for single-output tasks like classification and regression,and cannot be directly applied to LLMs, which generate sequences of tokens. Recent efforts in LLM explanation focus on natural language explanations, but they are prone to hallucinations and inaccuracies. To address this, we introduce OurTool, a framework for multi-granularity prompt explanations by aggregating token-level insights. OurTool introduces two token-level explanation approaches: this http URL aggregation-based approach combining local explanation techniques, and 2. a perturbation-based approach with novel techniques to evaluate token masking impact. OurTool supports both white-box and black-box explanations and extends explanations to higher granularity levels, enabling flexible analysis. We evaluate OurTool in case studies such as sentiment analysis, showing the perturbation-based approach performs best using semantic similarity to assess perturbation impact. Furthermore, we conducted a user study to confirm OurTool’s accuracy and practical value, and demonstrate its potential to enhance LLM interpretability.
摘要：大语言模型在自然语言理解和文本生成等任务中表现出色。提示工程在有效利用大语言模型方面起着关键作用。然而，大语言模型的黑箱特性阻碍了其可解释性和有效的提示工程。针对深度学习模型，已经开发了多种模型解释方法，但这些局部解释方法主要针对分类和回归等单一输出任务设计，无法直接应用于生成 Token 序列的大语言模型。近期在大语言模型解释方面的努力主要集中在自然语言解释上，但这些解释容易出现幻觉和不准确性。为解决这一问题，我们引入了 OurTool，这是一个通过聚合 Token 级洞察力实现多粒度提示解释的框架。OurTool 引入了两种 Token 级解释方法：1. 基于聚合的方法，结合局部解释技术；2. 基于扰动的方法，采用新颖的技术评估 Token 掩码的影响。OurTool 支持白盒和黑盒解释，并将解释扩展到更高粒度级别，从而实现灵活的分析。我们在情感分析等案例研究中评估了 OurTool，结果显示基于扰动的方法在评估扰动影响时使用语义相似性表现最佳。此外，我们进行了一项用户研究，以确认 OurTool 的准确性和实际价值，并展示了其增强大语言模型可解释性的潜力。

[NLP-122] Is Semantic Chunking Worth the Computational Cost?

【速读】：该论文试图解决的问题是评估语义分块（semantic chunking）在检索增强生成（RAG）系统中的实际效果，特别是与简单的固定大小分块（fixed-size chunking）相比，是否能带来显著的性能提升。解决方案的关键在于通过系统地比较三种常见的检索任务（文档检索、证据检索和基于检索的答案生成），发现语义分块虽然在理论上具有优势，但在实际应用中并未表现出持续的性能提升，且计算成本较高，因此需要探索更高效的分块策略。

链接: https://arxiv.org/abs/2410.13070
作者: Renyi Qu,Ruixuan Tu,Forrest Bao
关键词-EN: semantically coherent segments, Recent advances, advances in Retrieval-Augmented, aims to improve, semantically coherent
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the actual benefits over simpler fixed-size chunking, where documents are split into consecutive, fixed-size segments, remain unclear. This study systematically evaluates the effectiveness of semantic chunking using three common retrieval-related tasks: document retrieval, evidence retrieval, and retrieval-based answer generation. The results show that the computational costs associated with semantic chunking are not justified by consistent performance gains. These findings challenge the previous assumptions about semantic chunking and highlight the need for more efficient chunking strategies in RAG systems.
摘要：近期，检索增强生成 (Retrieval-Augmented Generation, RAG) 系统的进展推动了语义分块 (semantic chunking) 的普及，该方法旨在通过将文档分割成语义连贯的片段来提升检索性能。尽管语义分块的应用日益广泛，但其相对于简单固定大小分块 (fixed-size chunking) 的实际优势仍不明确。在简单固定大小分块中，文档被分割成连续的固定大小片段。本研究系统评估了语义分块在三种常见检索相关任务中的有效性：文档检索、证据检索和基于检索的答案生成。结果显示，语义分块所伴随的计算成本并未能通过持续的性能提升得到合理化。这些发现对先前关于语义分块的假设提出了挑战，并强调了在 RAG 系统中需要更高效的分块策略。

[NLP-123] Language Models as Semiotic Machines: Reconceptualizing AI Language Systems through Structuralist and Post-Structuralist Theories of Language

【速读】：该论文试图解决如何理解大型语言模型（LLMs）的问题，提出了一种新的框架，将LLMs重新概念化为符号机器而非人类认知的模仿。解决方案的关键在于借鉴结构主义和后结构主义的语言理论，特别是索绪尔和德里达的观点，将LLMs视为语言本身的模型，而非仅仅是人类思维的映射。通过将LLMs理解为符号行为的统计近似，论文提供了一种新的视角来评估LLMs的优缺点，并为未来的研究开辟了新的方向。

链接: https://arxiv.org/abs/2410.13065
作者: Elad Vromen
关键词-EN: understanding large language, large language models, human cognition, understanding large, imitations of human
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 2 figures

点击查看摘要

Abstract:This paper proposes a novel framework for understanding large language models (LLMs) by reconceptualizing them as semiotic machines rather than as imitations of human cognition. Drawing from structuralist and post-structuralist theories of language-specifically the works of Ferdinand de Saussure and Jacques Derrida-I argue that LLMs should be understood as models of language itself, aligning with Derrida’s concept of ‘writing’ (l’ecriture). The paper is structured into three parts. First, I lay the theoretical groundwork by explaining how the word2vec embedding algorithm operates within Saussure’s framework of language as a relational system of signs. Second, I apply Derrida’s critique of Saussure to position ‘writing’ as the object modeled by LLMs, offering a view of the machine’s ‘mind’ as a statistical approximation of sign behavior. Finally, the third section addresses how modern LLMs reflect post-structuralist notions of unfixed meaning, arguing that the “next token generation” mechanism effectively captures the dynamic nature of meaning. By reconceptualizing LLMs as semiotic machines rather than cognitive models, this framework provides an alternative lens through which to assess the strengths and limitations of LLMs, offering new avenues for future research.
摘要：本文提出了一种新的框架，通过将大语言模型 (LLM) 重新概念化为符号机器而非人类认知的模仿，来理解这些模型。借鉴结构主义和后结构主义的语言理论，特别是 Ferdinand de Saussure 和 Jacques Derrida 的著作，我认为 LLM 应被理解为语言本身的模型，与 Derrida 的“书写”(l’ecriture) 概念相一致。本文分为三个部分。首先，我通过解释 word2vec 嵌入算法如何在 Saussure 的语言作为符号关系系统的框架内运作，奠定了理论基础。其次，我应用 Derrida 对 Saussure 的批判，将“书写”定位为 LLM 所建模的对象，提供了一种将机器的“心智”视为符号行为统计近似的视角。最后，第三部分探讨了现代 LLM 如何反映后结构主义关于意义不固定的概念，认为“下一个 Token 生成”机制有效地捕捉了意义的动态本质。通过将 LLM 重新概念化为符号机器而非认知模型，这一框架提供了一种新的视角，用以评估 LLM 的优势和局限性，并为未来的研究开辟了新的途径。

[NLP-124] ERAS: Evaluating the Robustness of Chinese NLP Models to Morphological Garden Path Errors NAACL

【速读】：该论文试图解决中文自然语言处理模型在面对形态学歧义时容易出现的“花园路径错误”问题。解决方案的关键在于提出一个名为ERAS的基准测试，通过比较模型在具有局部分词歧义和无歧义句子上的表现，来评估模型对形态句法上下文的依赖程度。研究结果表明，即使在没有显式分词步骤的情况下，采用字符级分词的情感分析模型也会隐式地出现花园路径错误，这表明模型在处理中文文本时往往未能充分考虑形态句法上下文。

链接: https://arxiv.org/abs/2410.13057
作者: Qinchan Li,Sophie Hao
关键词-EN: garden path errors, NLP models perform, orthographic word boundaries, garden path, Chinese NLP models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review in ARR/NAACL

点击查看摘要

Abstract:In languages without orthographic word boundaries, NLP models perform word segmentation, either as an explicit preprocessing step or as an implicit step in an end-to-end computation. This paper shows that Chinese NLP models are vulnerable to morphological garden path errors: errors caused by a failure to resolve local word segmentation ambiguities using sentence-level morphosyntactic context. We propose a benchmark, ERAS, that tests a model’s vulnerability to morphological garden path errors by comparing its behavior on sentences with and without local segmentation ambiguities. Using ERAS, we show that word segmentation models make garden path errors on locally ambiguous sentences, but do not make equivalent errors on unambiguous sentences. We further show that sentiment analysis models with character-level tokenization make implicit garden path errors, even without an explicit word segmentation step in the pipeline. Our results indicate that models’ segmentation of Chinese text often fails to account for morphosyntactic context.
摘要：在缺乏正字法词边界的语言中，自然语言处理 (NLP) 模型进行词分割，无论是作为显式的预处理步骤，还是在端到端计算中的隐式步骤。本文表明，中文 NLP 模型容易受到形态学花园路径错误的影响：这些错误是由于未能利用句子级别的形态句法上下文来解决局部词分割歧义所导致的。我们提出了一项基准测试，ERAS，通过比较模型在具有和不具有局部分割歧义的句子上的表现，来测试模型对形态学花园路径错误的敏感性。使用 ERAS，我们发现词分割模型在局部歧义句子中会出现花园路径错误，但在无歧义句子中不会出现类似的错误。我们进一步表明，采用字符级别 Token 化的情感分析模型即使在没有显式词分割步骤的情况下，也会隐式地产生花园路径错误。我们的结果表明，模型对中文文本的分割往往未能充分考虑形态句法上下文。

[NLP-125] Channel-Wise Mixed-Precision Quantization for Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在边缘设备上部署时面临的内存需求过大问题。解决方案的关键是引入了一种名为通道混合精度量化（Channel-Wise Mixed-Precision Quantization, CMPQ）的新型混合精度量化方法。CMPQ通过根据激活分布在通道级别上分配不同的量化精度，能够适应任意比特宽度约束，并采用非均匀量化策略和两种异常值提取技术来协同保留关键信息，从而最小化量化损失。实验结果表明，CMPQ不仅在整数比特量化任务中提升了性能，而且在适度增加内存使用的情况下实现了显著的性能提升，为LLMs在不同设备上的高效部署提供了适应性强且有效的解决方案。

链接: https://arxiv.org/abs/2410.13056
作者: Zihan Chen,Bike Xie,Jundong Li,Cong Shen
关键词-EN: Large Language Models, Language Models, Large Language, demonstrated remarkable success, remains challenging due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ can adapt to any bit-width constraint. CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on different sizes of LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage. CMPQ thus represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.
摘要：大语言模型 (Large Language Models, LLMs) 在广泛的语义任务中展示了显著的成功，但由于其庞大的参数规模带来的巨大内存需求，其在边缘设备上的部署仍面临挑战。权重仅量化 (Weight-only quantization) 提供了一种有前景的解决方案，以减少 LLMs 的内存占用。然而，现有方法主要集中在整数位量化上，限制了它们在分数位量化任务中的适应性，并阻碍了设备上可用存储空间的最大化利用。本文介绍了通道混合精度量化 (Channel-Wise Mixed-Precision Quantization, CMPQ)，这是一种新颖的混合精度量化方法，基于激活分布在通道层面分配量化精度。通过为不同的权重通道分配不同的精度级别，CMPQ 能够适应任何位宽约束。CMPQ 采用非均匀量化策略，并结合两种异常值提取技术，共同保留关键信息，从而最小化量化损失。在不同规模的 LLMs 上的实验表明，CMPQ 不仅在整数位量化任务中提升了性能，而且在内存使用适度增加的情况下实现了显著的性能提升。因此，CMPQ 代表了一种适应性强且有效的 LLM 量化方法，为各种设备能力提供了实质性的优势。

[NLP-126] Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models

【速读】：该论文试图解决供应链网络复杂性带来的关系映射和实体角色识别难题。解决方案的关键在于利用自然语言处理（NLP）和大型语言模型（LLMs）从非结构化文本数据中提取和处理信息，构建全面的供应链图谱。具体来说，论文提出了一种新方法，通过LLMs从公开资源中提取原始文本信息，并针对土木工程领域进行案例研究，展示如何揭示公司、项目和其他实体之间的隐藏关系。此外，通过领域特定的LLM微调，提高实体分类的准确性，从而增强对供应链网络的理解。

链接: https://arxiv.org/abs/2410.13051
作者: Tong Liu,Hadi Meidani
关键词-EN: Supply chain networks, Supply chain, increasing complexity presents, complexity presents significant, presents significant challenges
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Supply chain networks are critical to the operational efficiency of industries, yet their increasing complexity presents significant challenges in mapping relationships and identifying the roles of various entities. Traditional methods for constructing supply chain networks rely heavily on structured datasets and manual data collection, limiting their scope and efficiency. In contrast, recent advancements in Natural Language Processing (NLP) and large language models (LLMs) offer new opportunities for discovering and analyzing supply chain networks using unstructured text data. This paper proposes a novel approach that leverages LLMs to extract and process raw textual information from publicly available sources to construct a comprehensive supply chain graph. We focus on the civil engineering sector as a case study, demonstrating how LLMs can uncover hidden relationships among companies, projects, and other entities. Additionally, we fine-tune an LLM to classify entities within the supply chain graph, providing detailed insights into their roles and relationships. The results show that domain-specific fine-tuning improves classification accuracy, highlighting the potential of LLMs for industry-specific supply chain analysis. Our contributions include the development of a supply chain graph for the civil engineering sector, as well as a fine-tuned LLM model that enhances entity classification and understanding of supply chain networks.
摘要：供应链网络对行业的运营效率至关重要，但其日益增加的复杂性在映射关系和识别各种实体的角色方面带来了重大挑战。传统的供应链网络构建方法严重依赖结构化数据集和人工数据收集，限制了其范围和效率。相比之下，自然语言处理 (NLP) 和大型语言模型 (LLM) 的最新进展为利用非结构化文本数据发现和分析供应链网络提供了新的机会。本文提出了一种利用 LLM 从公开可用资源中提取和处理原始文本信息以构建全面供应链图的新方法。我们以土木工程领域为案例研究，展示了 LLM 如何揭示公司、项目和其他实体之间的隐藏关系。此外，我们还对 LLM 进行了微调，以对供应链图中的实体进行分类，提供对其角色和关系的详细洞察。结果表明，领域特定的微调提高了分类准确性，突显了 LLM 在特定行业供应链分析中的潜力。我们的贡献包括为土木工程领域开发了一个供应链图，以及一个经过微调的 LLM 模型，该模型增强了实体分类和对供应链网络的理解。

[NLP-127] LLM Confidence Evaluation Measures in Zero-Shot CSS Classification

【速读】：该论文试图解决在计算社会科学（CSS）任务中利用大型语言模型（LLMs）进行自动标注时，如何评估分类置信度的问题。解决方案的关键在于提出了一个专门针对数据标注任务的不确定性量化（UQ）性能度量方法，并比较了五种不同的UQ策略在三种LLM和CSS数据标注任务中的表现。此外，论文还引入了一种新的UQ聚合策略，该策略能够有效识别低置信度的LLM标注，并 disproportionately 揭示LLM错误标注的数据，从而显著提升人机协作的数据标注流程。

链接: https://arxiv.org/abs/2410.13047
作者: David Farr,Iain Cruickshank,Nico Manzonelli,Nicholas Clark,Kate Starbird,Jevin West
关键词-EN: Computational Social Science, Assessing classification confidence, Social Science, Computational Social, large language models
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Assessing classification confidence is critical for leveraging large language models (LLMs) in automated labeling tasks, especially in the sensitive domains presented by Computational Social Science (CSS) tasks. In this paper, we make three key contributions: (1) we propose an uncertainty quantification (UQ) performance measure tailored for data annotation tasks, (2) we compare, for the first time, five different UQ strategies across three distinct LLMs and CSS data annotation tasks, (3) we introduce a novel UQ aggregation strategy that effectively identifies low-confidence LLM annotations and disproportionately uncovers data incorrectly labeled by the LLMs. Our results demonstrate that our proposed UQ aggregation strategy improves upon existing methods andcan be used to significantly improve human-in-the-loop data annotation processes.
摘要：在自动化标注任务中，评估分类置信度对于利用大语言模型 (LLMs) 至关重要，特别是在计算社会科学 (CSS) 任务的敏感领域。本文做出三个关键贡献：(1) 我们提出了一种专为数据标注任务量身定制的不确定性量化 (UQ) 性能度量；(2) 我们首次比较了五种不同的 UQ 策略在三种不同 LLM 和 CSS 数据标注任务中的表现；(3) 我们引入了一种新颖的 UQ 聚合策略，该策略能够有效识别低置信度的 LLM 标注，并 disproportionately 揭示 LLM 错误标注的数据。我们的研究结果表明，所提出的 UQ 聚合策略优于现有方法，并可用于显著改进人机协作数据标注流程。

[NLP-128] LFOSum: Summarizing Long-form Opinions with Large Language Models

【速读】：该论文试图解决在线评论信息过载问题，特别是传统模型在处理长输入和大批量评论时面临的挑战，以及现有大型语言模型（LLM）在生成准确和忠实总结时的不足。解决方案的关键在于：1) 引入一个包含长篇用户评论的新数据集，每个实体包含超过一千条评论；2) 提出两种无需训练的基于LLM的总结方法，能够处理长输入；3) 开发自动评估指标，包括与领域专家编写的深入且无偏见的总结进行对比，以及引入新的无参考评估指标，以更细粒度、上下文敏感的方式评估总结的忠实度。通过这些方法，论文旨在提高LLM在处理长篇评论时的性能，并提供更准确的总结。

链接: https://arxiv.org/abs/2410.13037
作者: Mir Tafseer Nayeem,Davood Rafiei
关键词-EN: Online reviews play, influencing consumer decisions, Online reviews, hotels or restaurants, play a pivotal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Online reviews play a pivotal role in influencing consumer decisions across various domains, from purchasing products to selecting hotels or restaurants. However, the sheer volume of reviews – often containing repetitive or irrelevant content – leads to information overload, making it challenging for users to extract meaningful insights. Traditional opinion summarization models face challenges in handling long inputs and large volumes of reviews, while newer Large Language Model (LLM) approaches often fail to generate accurate and faithful summaries. To address those challenges, this paper introduces (1) a new dataset of long-form user reviews, each entity comprising over a thousand reviews, (2) two training-free LLM-based summarization approaches that scale to long inputs, and (3) automatic evaluation metrics. Our dataset of user reviews is paired with in-depth and unbiased critical summaries by domain experts, serving as a reference for evaluation. Additionally, our novel reference-free evaluation metrics provide a more granular, context-sensitive assessment of summary faithfulness. We benchmark several open-source and closed-source LLMs using our methods. Our evaluation reveals that LLMs still face challenges in balancing sentiment and format adherence in long-form summaries, though open-source models can narrow the gap when relevant information is retrieved in a focused manner.
摘要：在线评论在影响消费者决策方面发挥着关键作用，涵盖从购买产品到选择酒店或餐厅等多个领域。然而，评论的数量庞大——通常包含重复或无关内容——导致信息过载，使用户难以提取有意义的见解。传统的观点摘要模型在处理长输入和大批量评论时面临挑战，而较新的生成式 AI (Generative AI) 方法往往无法生成准确且忠实的摘要。为应对这些挑战，本文提出了（1）一个包含长篇用户评论的新数据集，每个实体包含超过一千条评论，（2）两种无需训练的基于大语言模型 (LLM) 的摘要方法，能够处理长输入，以及（3）自动评估指标。我们的用户评论数据集与领域专家提供的深入且无偏见的批判性摘要配对，作为评估的参考。此外，我们提出的无参考评估指标提供了更细致、上下文敏感的摘要忠实度评估。我们使用这些方法对多个开源和闭源大语言模型进行了基准测试。我们的评估结果显示，大语言模型在长篇摘要中仍面临平衡情感和格式一致性的挑战，尽管开源模型在以专注方式检索相关信息时能够缩小差距。

[NLP-129] Sensitivity of Generative VLMs to Semantically and Lexically Altered Prompts

【速读】：该论文旨在探讨生成式视觉-语言模型（VLMs）对提示词中词汇和语义变化的敏感性问题。解决方案的关键在于通过使用SugarCrepe++数据集，系统地评估VLMs对词汇变化（即使不伴随语义变化）的敏感性。研究发现，生成式VLMs对这类变化高度敏感，这种敏感性影响了模型输出的一致性，从而揭示了现有技术在处理提示词变化时的局限性。

链接: https://arxiv.org/abs/2410.13030
作者: Sri Harsha Dumpala,Aman Jaiswal,Chandramouli Sastry,Evangelos Milios,Sageev Oore,Hassan Sajjad
关键词-EN: generative vision-language models, vision-language models, significant influx, influx of prompt-tuning, remains unclear
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the significant influx of prompt-tuning techniques for generative vision-language models (VLMs), it remains unclear how sensitive these models are to lexical and semantic alterations in prompts. In this paper, we evaluate the ability of generative VLMs to understand lexical and semantic changes in text using the SugarCrepe++ dataset. We analyze the sensitivity of VLMs to lexical alterations in prompts without corresponding semantic changes. Our findings demonstrate that generative VLMs are highly sensitive to such alterations. Additionally, we show that this vulnerability affects the performance of techniques aimed at achieving consistency in their outputs.
摘要：尽管针对生成式视觉语言模型 (Generative Vision-Language Models, VLMs) 的提示调优技术大量涌现，但这些模型对提示中词汇和语义变化的敏感程度仍不明确。本文中，我们利用 SugarCrepe++ 数据集评估了生成式 VLMs 对文本中词汇和语义变化的理解能力。我们分析了 VLMs 对提示中词汇变化（不伴随语义变化）的敏感性。研究结果表明，生成式 VLMs 对这类变化极为敏感。此外，我们还展示了这种脆弱性影响了旨在实现输出一致性的技术性能。

[NLP-130] When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems

【速读】：该论文试图解决大型语言模型（LLMs）在处理不可解数学应用题时可能产生幻觉并输出错误结果的问题。解决方案的关键在于通过特定的提示（prompts）引导GPT模型识别并拒绝回答不可解的数学应用题，从而提高其拒绝回答的准确性和可靠性。研究通过使用Unanswerable Word Math Problem (UWMP)数据集和GPT模型API进行实验，并引入综合评估指标（包括拒绝回答、正确性和置信度），揭示了现有GPT模型在处理不可解问题时的不足，强调了开发能够更好管理不确定性和复杂推理的新模型的重要性。

链接: https://arxiv.org/abs/2410.13029
作者: Asir Saadat,Tasmia Binte Sogir,Md Taukir Azam Chowdhury,Syem Aziz
关键词-EN: Large language models, Large language, increasingly relied, Large, math
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly relied upon to solve complex mathematical word problems. However, being susceptible to hallucination, they may generate inaccurate results when presented with unanswerable questions, raising concerns about their potential harm. While GPT models are now widely used and trusted, the exploration of how they can effectively abstain from answering unanswerable math problems and the enhancement of their abstention capabilities has not been rigorously investigated. In this paper, we investigate whether GPTs can appropriately respond to unanswerable math word problems by applying prompts typically used in solvable mathematical scenarios. Our experiments utilize the Unanswerable Word Math Problem (UWMP) dataset, directly leveraging GPT model APIs. Evaluation metrics are introduced, which integrate three key factors: abstention, correctness and confidence. Our findings reveal critical gaps in GPT models and the hallucination it suffers from for unsolvable problems, highlighting the need for improved models capable of better managing uncertainty and complex reasoning in math word problem-solving contexts.
摘要：大语言模型 (LLMs) 越来越多地被用于解决复杂的数学应用题。然而，由于其容易产生幻觉，当面对无法回答的问题时，可能会生成不准确的结果，从而引发对其潜在危害的担忧。尽管 GPT 模型目前被广泛使用并受到信任，但关于它们如何有效避免回答无法解答的数学问题以及如何增强其拒绝回答能力的问题尚未得到深入研究。本文探讨了 GPT 模型是否能够通过应用通常用于可解答数学场景的提示，适当地回应无法解答的数学应用题。我们的实验使用了 Unanswerable Word Math Problem (UWMP) 数据集，直接利用 GPT 模型 API。我们引入了评估指标，这些指标综合了三个关键因素：拒绝回答、正确性和置信度。我们的研究结果揭示了 GPT 模型在处理不可解答问题时存在的重大缺陷及其产生的幻觉，强调了需要改进模型以更好地管理数学应用题解决情境中的不确定性和复杂推理。

[NLP-131] LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks

【速读】：该论文试图解决在缺乏目标任务训练数据的情况下，如何通过组合多个技能来高效地微调大型语言模型（LLMs）的问题。解决方案的关键在于提出了一种称为LoRA（低秩适应）模块的拼接方法（CAT），通过最优地平均分别训练在不同技能上的LoRA模块，显著提升了模型在组合任务上的表现。实验结果表明，CAT方法在解决数学应用题等任务上，分别比现有的模型合并和数据混合技术提高了43%和12%的性能，证明了模型合并相对于数据混合在二元技能组合任务中的优越性。

链接: https://arxiv.org/abs/2410.13025
作者: Akshara Prabhakar,Yuanzhi Li,Karthik Narasimhan,Sham Kakade,Eran Malach,Samy Jelassi
关键词-EN: Large Language Models, Large Language, Low-Rank Adaptation, fine-tuning of Large, Language Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages plus references and appendices

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient fine-tuning of Large Language Models (LLMs). We study how different LoRA modules can be merged to achieve skill composition – testing the performance of the merged model on a target task that involves combining multiple skills, each skill coming from a single LoRA. This setup is favorable when it is difficult to obtain training data for the target task and when it can be decomposed into multiple skills. First, we identify practically occurring use-cases that can be studied under the realm of skill composition, e.g. solving hard math-word problems with code, creating a bot to answer questions on proprietary manuals or about domain-specialized corpora. Our main contribution is to show that concatenation of LoRAs (CAT), which optimally averages LoRAs that were individually trained on different skills, outperforms existing model- and data- merging techniques; for instance on math-word problems, CAT beats these methods by an average of 43% and 12% respectively. Thus, this paper advocates model merging as an efficient way to solve compositional tasks and underscores CAT as a simple, compute-friendly and effective procedure. To our knowledge, this is the first work demonstrating the superiority of model merging over data mixing for binary skill composition tasks.
摘要：低秩适应 (Low-Rank Adaptation, LoRA) 是一种流行的参数高效微调大语言模型 (Large Language Models, LLMs) 的技术。我们研究了如何将不同的 LoRA 模块合并以实现技能组合——在涉及结合多种技能的目标任务上测试合并模型的性能，每种技能来自单一的 LoRA。当难以获取目标任务的训练数据且该任务可以分解为多种技能时，这种设置是有利的。首先，我们识别了在技能组合领域可以研究的实际应用场景，例如使用代码解决复杂的数学应用题，创建一个能够回答关于专有手册或领域专业化语料库问题的机器人。我们的主要贡献是展示了 LoRA 的串联 (Concatenation of LoRAs, CAT)，即最优地平均在不同技能上单独训练的 LoRA，其性能优于现有的模型和数据合并技术；例如在数学应用题上，CAT 分别比这些方法高出 43% 和 12%。因此，本文提倡将模型合并作为一种高效解决组合任务的方法，并强调 CAT 作为一种简单、计算友好且有效的过程。据我们所知，这是首次展示模型合并在二元技能组合任务上优于数据混合的工作。

[NLP-132] Learning Representations for Reasoning: Generalizing Across Diverse Structures

【速读】：该论文试图解决知识图谱和文本推理中的泛化问题，特别是针对未见过的实体和关系以及多步查询的推理能力。解决方案的关键在于设计能够跨知识结构和查询结构泛化的算法和系统。具体来说，论文提出了动态编程框架来处理新实体，通过构建关系图来转换新关系，并利用图神经网络和模糊逻辑操作来解决知识图谱上的多步查询问题。此外，论文还提出了文本规则学习算法，以改进大型语言模型在文本上的多步推理能力，并开发了两个系统来加速结构化数据上的机器学习开发。

链接: https://arxiv.org/abs/2410.13018
作者: Zhaocheng Zhu
关键词-EN: logically draw conclusions, hallmark of human, ability to logically, logically draw, draw conclusions
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: PhD thesis

点击查看摘要

Abstract:Reasoning, the ability to logically draw conclusions from existing knowledge, is a hallmark of human. Together with perception, they constitute the two major themes of artificial intelligence. While deep learning has pushed the limit of perception beyond human-level performance, the progress in reasoning domains is way behind. One fundamental reason is that reasoning problems usually have flexible structures for both knowledge and queries, and many existing models only perform well on structures seen during training. Here we aim to push the boundary of reasoning models by devising algorithms that generalize across knowledge and query structures, as well as systems that accelerate development on structured data. This thesis consists of three parts. In Part I, we study models that can inductively generalize to unseen knowledge graphs with new entity and relation vocabularies. For new entities, we propose a framework that learns neural operators in a dynamic programming algorithm computing path representations. For relations, we construct a relation graph to capture the interactions between relations, thereby converting new relations into new entities. In Part II, we propose two solutions for generalizing across multi-step queries on knowledge graphs and text respectively. For knowledge graphs, we show that multi-step queries can be solved by multiple calls of graph neural networks and fuzzy logic operations. For text, we devise an algorithm to learn explicit knowledge as textual rules to improve large language models on multi-step queries. In Part III, we propose two systems to facilitate machine learning development on structured data. Our library treats structured data as first-class citizens and removes the barrier for developing algorithms on structured data. Our node embedding system solves the GPU memory bottleneck of embedding matrices and scales to graphs with billion nodes.
摘要：推理，即从现有知识中逻辑推导出结论的能力，是人类的标志性特征。与感知一起，它们构成了人工智能的两大主题。尽管深度学习在感知领域已超越人类水平的表现，但在推理领域的进展却远远落后。一个根本原因是，推理问题通常在知识和查询方面具有灵活的结构，而许多现有模型仅在训练期间见过的结构上表现良好。本文旨在通过设计能够跨知识和查询结构泛化的算法，以及加速结构化数据开发的系统，来推动推理模型的边界。本文分为三个部分。在第一部分，我们研究了能够归纳泛化到具有新实体和关系词汇的未见知识图谱的模型。对于新实体，我们提出了一种框架，该框架在动态规划算法中学习神经算子，用于计算路径表示。对于关系，我们构建了一个关系图来捕捉关系之间的交互，从而将新关系转化为新实体。在第二部分，我们提出了两种解决方案，分别针对知识图谱和文本上的多步查询进行泛化。对于知识图谱，我们展示了多步查询可以通过多次调用图神经网络和模糊逻辑操作来解决。对于文本，我们设计了一种算法，用于学习显式的知识作为文本规则，以改进大语言模型在多步查询上的表现。在第三部分，我们提出了两个系统，以促进结构化数据上的机器学习开发。我们的库将结构化数据视为一等公民，并消除了开发结构化数据算法的障碍。我们的节点嵌入系统解决了嵌入矩阵的GPU内存瓶颈问题，并扩展到拥有数十亿节点的图。

[NLP-133] LEGAL-UQA: A Low-Resource Urdu-English Dataset for Legal Question Answering

【速读】：该论文试图解决低资源语言（如乌尔都语）在法律领域自然语言处理（NLP）资源匮乏的问题。解决方案的关键在于创建了LEGAL-UQA数据集，这是首个基于巴基斯坦宪法的乌尔都语法律问答数据集，包含619对英乌双语问答及其对应的法律条文上下文。通过OCR提取、手动精炼以及GPT-4辅助翻译和问答对生成，该数据集不仅填补了乌尔都语法律NLP资源的空白，还通过实验评估了最新通用语言和嵌入模型在该数据集上的表现，特别是Claude-3.5-Sonnet达到了99.19%的人类评估准确率。此外，论文还微调了mt5-large-UQA-1.0模型，并评估了不同嵌入模型的检索性能，为乌尔都语法律信息的获取提供了技术基础。

链接: https://arxiv.org/abs/2410.13013
作者: Faizan Faisal,Umair Yousaf
关键词-EN: Urdu legal question-answering, question-answering dataset derived, legal question-answering dataset, Urdu legal, Pakistan constitution
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:We present LEGAL-UQA, the first Urdu legal question-answering dataset derived from Pakistan’s constitution. This parallel English-Urdu dataset includes 619 question-answer pairs, each with corresponding legal article contexts, addressing the need for domain-specific NLP resources in low-resource languages. We describe the dataset creation process, including OCR extraction, manual refinement, and GPT-4-assisted translation and generation of QA pairs. Our experiments evaluate the latest generalist language and embedding models on LEGAL-UQA, with Claude-3.5-Sonnet achieving 99.19% human-evaluated accuracy. We fine-tune mt5-large-UQA-1.0, highlighting the challenges of adapting multilingual models to specialized domains. Additionally, we assess retrieval performance, finding OpenAI’s text-embedding-3-large outperforms Mistral’s mistral-embed. LEGAL-UQA bridges the gap between global NLP advancements and localized applications, particularly in constitutional law, and lays the foundation for improved legal information access in Pakistan.
摘要：我们介绍了 LEGAL-UQA，这是首个源自巴基斯坦宪法的乌尔都语法律问答数据集。该平行英乌尔都语数据集包含 619 对问答，每对问答均附有相应的法律条文背景，以满足低资源语言领域特定自然语言处理 (NLP) 资源的需求。我们详细描述了数据集的创建过程，包括 OCR 提取、人工精炼，以及 GPT-4 辅助的翻译和问答对生成。我们的实验评估了最新通用语言和嵌入模型在 LEGAL-UQA 上的表现，其中 Claude-3.5-Sonnet 达到了 99.19% 的人工评估准确率。我们微调了 mt5-large-UQA-1.0，突显了将多语言模型适应于专业领域的挑战。此外，我们还评估了检索性能，发现 OpenAI 的 text-embedding-3-large 优于 Mistral 的 mistral-embed。LEGAL-UQA 填补了全球 NLP 进展与本地化应用之间的差距，特别是在宪法法律领域，并为改善巴基斯坦法律信息获取奠定了基础。

[NLP-134] POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

【速读】：该论文试图解决大型语言模型在安全性和实用性之间的平衡问题，特别是模型过度拒绝良性提示的问题。解决方案的关键在于使用高级教师模型（如GPT-4o）生成训练数据，并通过偏好优化方法（POROver）来减少过度拒绝。具体来说，通过生成通用和有毒提示的响应，显著改善了安全性和实用性的平衡，同时偏好优化算法在精心筛选的偏好数据基础上，能有效将过度拒绝率从45.2%降低至15.0%，同时保持安全水平。

链接: https://arxiv.org/abs/2410.12999
作者: Batuhan K. Karaman,Ishmam Zabir,Alon Benhaim,Vishrav Chaudhary,Mert R. Sabuncu,Xia Song
关键词-EN: Balancing safety, recent years, critical challenge, challenge in recent, Balancing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Balancing safety and usefulness in large language models has become a critical challenge in recent years. Models often exhibit unsafe behavior or adopt an overly cautious approach, leading to frequent overrefusal of benign prompts, which reduces their usefulness. Addressing these issues requires methods that maintain safety while avoiding overrefusal. In this work, we examine how the overgeneration of training data using advanced teacher models (e.g., GPT-4o), including responses to both general-purpose and toxic prompts, influences the safety and overrefusal balance of instruction-following language models. Additionally, we present POROver, a strategy to use preference optimization methods in order to reduce overrefusal, via employing a superior teacher model’s completions. Our results show that overgenerating completions for general-purpose prompts significantly improves the balance between safety and usefulness. Specifically, the F1 score calculated between safety and usefulness increases from 70.8% to 88.3%. Moreover, overgeneration for toxic prompts substantially reduces overrefusal, decreasing it from 94.4% to 45.2%. Furthermore, preference optimization algorithms, when applied with carefully curated preference data, can effectively reduce a model’s overrefusal from 45.2% to 15.0% while maintaining comparable safety levels. Our code and data are available at this https URL.
摘要：在大语言模型中平衡安全性和实用性已成为近年来的关键挑战。模型往往表现出不安全行为或采取过于谨慎的方法，导致对良性提示的频繁过度拒绝，从而降低了其有用性。解决这些问题需要既能保持安全性又能避免过度拒绝的方法。在本研究中，我们探讨了使用先进教师模型（如 GPT-4o）对训练数据进行过度生成（包括对通用和有毒提示的响应）如何影响指令遵循语言模型的安全性和过度拒绝平衡。此外，我们提出了 POROver 策略，通过采用更优教师模型的完成结果，利用偏好优化方法来减少过度拒绝。我们的结果表明，对通用提示的完成结果进行过度生成显著改善了安全性和有用性之间的平衡。具体而言，计算得出的安全性和有用性之间的 F1 分数从 70.8% 提高到 88.3%。此外，对有毒提示的过度生成大幅减少了过度拒绝，从 94.4% 降至 45.2%。进一步地，偏好优化算法在应用精心筛选的偏好数据时，能够将模型的过度拒绝从 45.2% 有效降低至 15.0%，同时保持可比的安全水平。我们的代码和数据可通过此 https URL 获取。

[NLP-135] “Lets Argue Both Sides”: Argument Generation Can Force Small Models to Utilize Previously Inaccessible Reasoning Capabilities EMNLP2024

【速读】：该论文试图解决大型语言模型（LLMs）在需要严格逻辑推理的任务中表现不佳的问题。解决方案的关键在于提出了一种名为“论点生成”（Argument Generation）的方法，通过生成每个可能推理结果的论点并要求模型对其进行排序，从而强制模型利用其推理能力。这种方法在不增加复杂性的前提下，可以替代零样本提示技术，并且在较小的语言模型中显示出更大的性能提升，揭示了模型规模与提示方法之间的复杂关系。

链接: https://arxiv.org/abs/2410.12997
作者: Kaveh Eskandari Miandoab,Vasanth Sarathy
关键词-EN: Large Language Models, Large Language, Argument Generation, evaluation tasks, struggle to maintain
类目: Computation and Language (cs.CL)
备注: Accepted to Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual at EMNLP 2024

点击查看摘要

Abstract:Large Language Models (LLMs), despite achieving state-of-the-art results in a number of evaluation tasks, struggle to maintain their performance when logical reasoning is strictly required to correctly infer a prediction. In this work, we propose Argument Generation as a method of forcing models to utilize their reasoning capabilities when other approaches such as chain-of-thought reasoning prove insufficient. Our method involves the generation of arguments for each possible inference result, and asking the end model to rank the generated arguments. We show that Argument Generation can serve as an appropriate substitute for zero-shot prompting techniques without the requirement to add layers of complexity. Furthermore, we argue that knowledge-probing techniques such as chain-of-thought reasoning and Argument Generation are only useful when further reasoning is required to infer a prediction, making them auxiliary to more common zero-shot approaches. Finally, we demonstrate that our approach forces larger gains in smaller language models, showcasing a complex relationship between model size and prompting methods in foundation models.
摘要：大语言模型 (LLMs) 虽然在许多评估任务中取得了最先进的结果，但在严格需要逻辑推理以正确推断预测的情况下，其性能难以维持。在本研究中，我们提出将论证生成 (Argument Generation) 作为一种方法，以强制模型在其他方法（如思维链推理 (chain-of-thought reasoning)）不足时利用其推理能力。我们的方法涉及为每个可能的推断结果生成论证，并要求最终模型对生成的论证进行排序。我们证明，论证生成可以作为零样本提示技术 (zero-shot prompting techniques) 的适当替代，而无需增加复杂性层次。此外，我们认为，知识探查技术（如思维链推理和论证生成）仅在进一步推理需要推断预测时才有用，使它们成为更常见的零样本方法的辅助手段。最后，我们展示了我们的方法在较小的语言模型中强制实现了更大的增益，展示了基础模型中模型大小与提示方法之间的复杂关系。

[NLP-136] Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

【速读】：该论文试图解决在大型语言模型（LLMs）开发中，多语言模型分词器质量评估不足的问题。解决方案的关键在于引入了Qtok工具，该工具通过一套包括语言覆盖率、分词完整性和语言及语言学类别分布的评估指标，系统地评估了13种不同分词器在58个公开模型中的表现。研究揭示了分词器在不同语言和类别中的分词分布存在显著差异，指出了当前分词策略中的潜在偏见和改进空间，从而为多语言LLM开发中的分词器质量评估提供了系统的评估方法和工具。

链接: https://arxiv.org/abs/2410.12989
作者: Iaroslav Chelombitko,Egor Safronov,Aleksey Komissarov
关键词-EN: Large Language Models, Large Language, considerable attention, quality, tokenizer quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 9 figures, 6 tables. Code and data available at this https URL

点击查看摘要

Abstract:In the development of Large Language Models (LLMs), considerable attention has been given to the quality of training datasets. However, the role of tokenizers in the LLM training pipeline, particularly for multilingual models, has received less focus. The quality of tokenization can significantly impact a model’s ability to handle diverse languages effectively. We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts. Our research proposes a set of metrics for evaluating tokenizer quality, including measures of language coverage, token completeness, and distribution across languages and linguistic categories. Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts. Our analysis revealed significant variations in token distribution across languages and categories, highlighting potential biases and areas for improvement in current tokenization strategies. This research contributes to the field of tokenizer evaluation within multilingual LLM development by providing a systematic approach to assessing tokenizer quality. Our findings highlight the critical role of tokenization in multilingual LLM capability. The Qtok tool and our analysis methodology offer practical means for researchers to evaluate and improve tokenization strategies for multilingual applications. We offer a method to compare tokenizer quality across these metrics, which may be useful when selecting or adjusting tokenizers for specific multilingual LLM applications. Comments: 24 pages, 9 figures, 6 tables. Code and data available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; I.2.6; H.3.3 Cite as: arXiv:2410.12989 [cs.CL] (or arXiv:2410.12989v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.12989 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：在大语言模型 (LLM) 的开发过程中，训练数据集的质量受到了相当多的关注。然而，在 LLM 训练流程中，特别是对于多语言模型，分词器 (tokenizer) 的作用却较少受到关注。分词的质量显著影响模型有效处理多种语言的能力。我们引入了 Qtok，这是一个旨在评估分词器质量的工具，特别强调其在多语言环境中的表现。我们的研究提出了一套评估分词器质量的指标，包括语言覆盖率、Token 完整性以及跨语言和语言类别的分布。Qtok 应用这些指标评估了来自 58 个公开模型的 13 种不同分词器，分析了它们在不同语言环境中的输出。我们的分析揭示了语言和类别之间 Token 分布的显著差异，突显了当前分词策略中潜在的偏见和改进领域。这项研究通过提供一种系统的方法来评估分词器质量，为多语言 LLM 开发领域的分词器评估做出了贡献。我们的研究结果强调了分词在多语言 LLM 能力中的关键作用。Qtok 工具和我们的分析方法为研究人员提供了实用的手段，用于评估和改进多语言应用中的分词策略。我们提供了一种在这些指标上比较分词器质量的方法，这在为特定的多语言 LLM 应用选择或调整分词器时可能非常有用。

评论：24 页，9 幅图，6 张表。代码和数据可在以下链接获取 https URL 主题：计算与语言 (cs.CL); 人工智能 (cs.AI) ACM 分类：I.2.7; I.2.6; H.3.3 引用为：arXiv:2410.12989 [cs.CL] (或 arXiv:2410.12989v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2410.12989 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-137] Leveraging LLMs for Translating and Classifying Mental Health Data

【速读】：该论文试图解决在非英语语言环境中，特别是希腊语中，利用大型语言模型（LLMs）检测抑郁症严重程度的问题。解决方案的关键在于通过自动翻译用户生成的英文帖子到希腊语，并评估GPT3.5-turbo模型在识别抑郁症严重程度方面的表现。研究结果表明，GPT3.5-turbo在英语和希腊语中的表现均不理想，这强调了在资源较少的语言环境中进一步研究的必要性，并指出在心理健康平台上使用LLMs时，需要谨慎实施并保持人类监督以避免误诊。

链接: https://arxiv.org/abs/2410.12985
作者: Konstantinos Skianis,A. Seza Doğruöz,John Pavlopoulos
关键词-EN: Large language models, mental health, mental health support, Large language, medical fields
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in medical fields. In mental health support, the early identification of linguistic markers associated with mental health conditions can provide valuable support to mental health professionals, and reduce long waiting times for patients. Despite the benefits of LLMs for mental health support, there is limited research on their application in mental health systems for languages other than English. Our study addresses this gap by focusing on the detection of depression severity in Greek through user-generated posts which are automatically translated from English. Our results show that GPT3.5-turbo is not very successful in identifying the severity of depression in English, and it has a varying performance in Greek as well. Our study underscores the necessity for further research, especially in languages with less resources. Also, careful implementation is necessary to ensure that LLMs are used effectively in mental health platforms, and human supervision remains crucial to avoid misdiagnosis.
摘要：大语言模型 (LLMs) 在医疗领域中的应用日益增多。在心理健康支持方面，早期识别与心理健康状况相关的语言标记可以为心理健康专业人员提供宝贵的支持，并减少患者的长时间等待。尽管 LLMs 在心理健康支持方面具有诸多优势，但关于其在非英语心理健康系统中应用的研究仍较为有限。本研究通过关注用户生成的帖子（这些帖子自动从英语翻译成希腊语）来检测希腊语中的抑郁症严重程度，填补了这一研究空白。我们的结果显示，GPT3.5-turbo 在识别英语中的抑郁症严重程度方面表现不佳，在希腊语中的表现也存在差异。本研究强调了进一步研究的必要性，尤其是在资源较少的语言方面。此外，在心理健康平台上有效使用 LLMs 需要谨慎实施，并且人类的监督对于避免误诊仍然至关重要。

[NLP-138] BenchmarkCards: Large Language Model and Risk Reporting

【速读】：该论文试图解决大语言模型（LLMs）在预部署评估中缺乏标准化文档方法的问题。解决方案的关键在于引入BenchmarkCards，这是一种结构化的框架，用于记录LLM基准测试的关键属性，如目标风险和评估方法（包括偏见和公平性），而不涉及如何测量或解释基准结果的具体定义。通过提供标准化的方式来捕捉和报告这些关键特征，BenchmarkCards促进了基准测试的透明度和可重复性，帮助研究人员选择合适的基准测试，从而提高LLM评估的一致性和有效性。

链接: https://arxiv.org/abs/2410.12974
作者: Anna Sokol,Nuno Moniz,Elizabeth Daly,Michael Hind,Nitesh Chawla
关键词-EN: Large language models, Large language, introduce significant risks, offer powerful capabilities, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer powerful capabilities but also introduce significant risks. One way to mitigate these risks is through comprehensive pre-deployment evaluations using benchmarks designed to test for specific vulnerabilities. However, the rapidly expanding body of LLM benchmark literature lacks a standardized method for documenting crucial benchmark details, hindering consistent use and informed selection. BenchmarkCards addresses this gap by providing a structured framework specifically for documenting LLM benchmark properties rather than defining the entire evaluation process itself. BenchmarkCards do not prescribe how to measure or interpret benchmark results (e.g., defining ``correctness’') but instead offer a standardized way to capture and report critical characteristics like targeted risks and evaluation methodologies, including properties such as bias and fairness. This structured metadata facilitates informed benchmark selection, enabling researchers to choose appropriate benchmarks and promoting transparency and reproducibility in LLM evaluation.
摘要：大语言模型 (LLMs) 提供了强大的能力，但也引入了显著的风险。减轻这些风险的一种方法是使用专门设计用于测试特定漏洞的基准进行全面的部署前评估。然而，迅速扩展的 LLM 基准文献缺乏一种标准化的方法来记录关键的基准细节，这阻碍了其一致使用和明智选择。BenchmarkCards 通过提供一个专门用于记录 LLM 基准属性的结构化框架来解决这一差距，而不是定义整个评估过程本身。BenchmarkCards 不规定如何测量或解释基准结果（例如，定义“正确性”），而是提供了一种标准化的方式来捕捉和报告关键特征，如目标风险和评估方法，包括偏差和公平性等属性。这种结构化的元数据有助于明智的基准选择，使研究人员能够选择适当的基准，并促进 LLM 评估中的透明度和可重复性。

[NLP-139] Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks

【速读】：该论文试图解决指令遵循任务的基准测试问题，特别是如何验证任务表现和指令遵循能力。解决方案的关键在于通过调整现有知识基准并增加条件性指令，使得指令既依赖于正确回答知识任务，又利用多选知识回答任务中的候选选项空间。这种方法不仅评估了模型的指令遵循能力，还通过无大型语言模型（LLM）的方法研究了任务表现。研究结果表明，即使在零样本设置下，大规模指令微调的语言模型也难以遵循简单指令。

链接: https://arxiv.org/abs/2410.12972
作者: Rudra Murthy,Prince Kumar,Praveen Venkateswaran,Danish Contractor
关键词-EN: focus our attention, attention on developing, easy to verify, instruction-following capabilities, task performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we focus our attention on developing a benchmark for instruction-following where it is easy to verify both task performance as well as instruction-following capabilities. We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. This allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions. In contrast to existing benchmarks for instruction following, we not only measure instruction-following capabilities but also use LLM-free methods to study task performance. We study a series of openly available large language models of varying parameter sizes (1B-405B) and closed source models namely GPT-4o-mini, GPT-4o. We find that even large-scale instruction-tuned LLMs fail to follow simple instructions in zero-shot settings. We release our dataset, the benchmark, code, and results for future work.
摘要：在本研究中，我们专注于开发一个指令跟随的基准测试，该基准测试不仅易于验证任务表现，还能评估指令跟随能力。我们改编了现有的知识基准测试，并增加了以下两种指令：a) 基于正确回答知识任务的条件指令；b) 利用多选知识回答任务中的候选选项空间的指令。这使我们能够研究模型的特性，例如在存在答案修改指令和干扰指令的情况下，模型在知识任务上的表现变化。与现有的指令跟随基准测试不同，我们不仅测量指令跟随能力，还使用无大语言模型（LLM-free）的方法来研究任务表现。我们研究了一系列公开可用的大语言模型，参数规模从1B到405B不等，以及闭源模型，如GPT-4o-mini和GPT-4o。我们发现，即使在零样本设置下，大规模指令微调的大语言模型也难以遵循简单的指令。我们公开了我们的数据集、基准测试、代码和结果，以供未来研究使用。

[NLP-140] Self-Pluralising Culture Alignment for Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在跨文化背景下如何更好地对齐多元文化价值观的问题。解决方案的关键在于提出了CultureSPA框架，该框架通过生成文化相关的问题，并在文化感知和文化无感知两种设置下获取LLM的输出，从而识别和收集文化相关的实例。这些实例用于微调LLMs，使其能够以文化联合或文化特定的方式服务于多元文化。实验结果表明，CultureSPA显著提升了LLMs对多元文化的对齐效果，且在不损害其通用能力的前提下，结合先进的提示工程技术可以进一步提高效果。

链接: https://arxiv.org/abs/2410.12971
作者: Shaoyang Xu,Yongqi Leng,Linhao Yu,Deyi Xiong
关键词-EN: large language models, serve pluralistic human, language models, large language, increasingly accessible
类目: Computation and Language (cs.CL)
备注: Implementation for the paper: this https URL

点击查看摘要

Abstract:As large language models (LLMs) become increasingly accessible in many countries, it is essential to align them to serve pluralistic human values across cultures. However, pluralistic culture alignment in LLMs remain an open problem. In this paper, we propose CultureSPA, a Self-Pluralising Culture Alignment framework that allows LLMs to simultaneously align to pluralistic cultures. The framework first generates questions on various culture topics, then yields LLM outputs in response to these generated questions under both culture-aware and culture-unaware settings. By comparing culture-aware/unaware outputs, we are able to detect and collect culture-related instances. These instances are employed to fine-tune LLMs to serve pluralistic cultures in either a culture-joint or culture-specific way. Extensive experiments demonstrate that CultureSPA significantly improves the alignment of LLMs to diverse cultures without compromising general abilities. And further improvements can be achieved if CultureSPA is combined with advanced prompt engineering techniques. Comparisons between culture-joint and culture-specific tuning strategies, along with variations in data quality and quantity, illustrate the robustness of our method. We also explore the mechanisms underlying CultureSPA and the relations between different cultures it reflects.
摘要：随着大语言模型 (LLM) 在许多国家变得越来越普及，使其适应多元文化的人类价值观变得至关重要。然而，大语言模型在多元文化对齐方面仍然是一个开放的问题。在本文中，我们提出了 CultureSPA，一个自多元化的文化对齐框架，使大语言模型能够同时对齐多元文化。该框架首先生成关于各种文化主题的问题，然后在文化感知和文化无感知设置下生成大语言模型的输出。通过比较文化感知/无感知的输出，我们能够检测和收集与文化相关的实例。这些实例被用于微调大语言模型，以服务于多元文化，无论是通过文化联合还是文化特定的方法。广泛的实验表明，CultureSPA 显著提高了大语言模型对多样文化的对齐能力，同时不损害其通用能力。如果将 CultureSPA 与先进的提示工程技术结合，还可以进一步提高效果。通过比较文化联合和文化特定的调优策略，以及数据质量和数量的变化，展示了我们方法的鲁棒性。我们还探讨了 CultureSPA 背后的机制及其反映的不同文化之间的关系。

[NLP-141] Large Language Models as a Tool for Mining Object Knowledge

【速读】：该论文试图解决大语言模型（LLMs）在处理日常物体及其组成部分和材料方面的知识表达问题。解决方案的关键在于通过少样本学习和零样本多步提示技术，区分整体物体与其组成部分的材料，从而生成一个包含约2,300个物体及其子类型部件和材料的数据库。这种方法不仅提高了知识提取的覆盖率和准确性，还为AI研究提供了关于物体结构和组成的显式知识源，类似于知识图谱，有助于LLMs在多跳问答任务中的表现。

链接: https://arxiv.org/abs/2410.12959
作者: Hannah YoungEun An,Lenhart K. Schubert
关键词-EN: Commonsense knowledge, essential for machines, machines to reason, knowledge, Commonsense
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Commonsense knowledge is essential for machines to reason about the world. Large language models (LLMs) have demonstrated their ability to perform almost human-like text generation. Despite this success, they fall short as trustworthy intelligent systems, due to the opacity of the basis for their answers and a tendency to confabulate facts when questioned about obscure entities or technical domains. We hypothesize, however, that their general knowledge about objects in the everyday world is largely sound. Based on that hypothesis, this paper investigates LLMs’ ability to formulate explicit knowledge about common physical artifacts, focusing on their parts and materials. Our work distinguishes between the substances that comprise an entire object and those that constitute its parts \unicodex2014 a previously underexplored distinction in knowledge base construction. Using few-shot with five in-context examples and zero-shot multi-step prompting, we produce a repository of data on the parts and materials of about 2,300 objects and their subtypes. Our evaluation demonstrates LLMs’ coverage and soundness in extracting knowledge. This contribution to knowledge mining should prove useful to AI research on reasoning about object structure and composition and serve as an explicit knowledge source (analogous to knowledge graphs) for LLMs performing multi-hop question answering.
摘要：常识知识对于机器理解世界至关重要。大语言模型 (LLM) 在文本生成方面已展现出近似人类的性能。尽管如此，由于其答案基础的不透明性以及在涉及模糊实体或技术领域时倾向于虚构事实，它们作为可信赖的智能系统仍存在不足。然而，我们假设，这些模型对于日常世界中对象的普遍知识大体上是可靠的。基于这一假设，本文研究了 LLM 对常见物理物品的部件和材料进行明确知识表述的能力。我们的工作区分了构成整个对象的物质与构成其部件的物质——这在知识库构建中是一个先前未被充分探索的区分。通过使用少样本 (few-shot) 和零样本 (zero-shot) 多步提示，我们生成了一个包含约 2,300 个对象及其子类型的部件和材料的数据库。我们的评估展示了 LLM 在提取知识方面的覆盖率和准确性。这一知识挖掘的贡献应有助于 AI 研究在对象结构和组成方面的推理，并作为 LLM 进行多跳问答的明确知识来源（类似于知识图谱）。

[NLP-142] Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning

【速读】：该论文试图解决大型语言模型（LLMs）在处理需要多轮函数调用的复合任务时面临的挑战。解决方案的关键在于提出了BUTTON方法，通过自底向上的指令构建和自顶向下的轨迹生成来生成合成复合指令调优数据。具体来说，自底向上阶段通过基于真实场景生成简单原子任务，并使用启发式策略构建复合任务，同时开发相应的函数；自顶向下阶段则利用多智能体环境模拟人类、助手和工具之间的交互，收集多轮函数调用轨迹。这种方法确保了任务的复合性，并通过分析复合任务中的原子任务来有效生成函数和轨迹，最终生成了包含8k数据点的BUTTONInstruct数据集，并通过实验验证了其有效性。

链接: https://arxiv.org/abs/2410.12952
作者: Mingyang Chen,Haoze Sun,Tianpeng Li,Fan Yang,Hao Liang,Keer Lu,Bin Cui,Wentao Zhang,Zenan Zhou,Weipeng Chen
关键词-EN: Large Language Models, Large Language, Language Models, exhibited significant potential, performing diverse tasks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited significant potential in performing diverse tasks, including the ability to call functions or use external tools to enhance their performance. While current research on function calling by LLMs primarily focuses on single-turn interactions, this paper addresses the overlooked necessity for LLMs to engage in multi-turn function calling–critical for handling compositional, real-world queries that require planning with functions but not only use functions. To facilitate this, we introduce an approach, BUTTON, which generates synthetic compositional instruction tuning data via bottom-up instruction construction and top-down trajectory generation. In the bottom-up phase, we generate simple atomic tasks based on real-world scenarios and build compositional tasks using heuristic strategies based on atomic tasks. Corresponding functions are then developed for these compositional tasks. The top-down phase features a multi-agent environment where interactions among simulated humans, assistants, and tools are utilized to gather multi-turn function calling trajectories. This approach ensures task compositionality and allows for effective function and trajectory generation by examining atomic tasks within compositional tasks. We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs.
摘要：大语言模型 (LLMs) 在执行多样化任务方面展现了显著的潜力，包括调用函数或使用外部工具以增强其性能的能力。尽管当前关于 LLMs 函数调用的研究主要集中在单轮交互上，本文则关注 LLMs 进行多轮函数调用的必要性——这对于处理需要函数规划但不仅限于使用函数的复合、现实世界查询至关重要。为此，我们提出了一种方法，BUTTON，通过自底向上的指令构建和自顶向下的轨迹生成来生成合成复合指令调优数据。在自底向上阶段，我们基于现实场景生成简单的原子任务，并使用基于原子任务的启发式策略构建复合任务。随后为这些复合任务开发相应的函数。自顶向下阶段则涉及一个多智能体环境，其中模拟的人类、助手和工具之间的交互用于收集多轮函数调用轨迹。这种方法确保了任务的复合性，并通过检查复合任务中的原子任务来实现有效的函数和轨迹生成。我们生成了一个包含 8k 数据点的数据集 BUTTONInstruct，并通过在各种 LLMs 上的广泛实验展示了其有效性。

[NLP-143] Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

【速读】：该论文试图解决在大语言模型中进行知识编辑和遗忘时如何提高精确性和有效性的问题。解决方案的关键在于利用机制性可解释性（mechanistic interpretability）来定位与特定可解释机制相关的模型组件（电路），从而实现更精确和有效的编辑和遗忘。通过将编辑/遗忘操作定位到与事实回忆的查找表机制相关的组件，论文发现这种方法在不同输入/输出格式下具有更高的编辑/遗忘鲁棒性，并且能更好地抵抗重新学习不想要信息的尝试，同时减少了意外的副作用。此外，这种定位方法在面对各种攻击时也表现出更强的遗忘鲁棒性。

链接: https://arxiv.org/abs/2410.12949
作者: Phillip Guo,Aaquib Syed,Abhay Sheshadri,Aidan Ewart,Gintare Karolina Dziugaite
关键词-EN: language modeling performance, compromising general language, general language modeling, large language models, language models seek
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages, 19 figures, 7 tables

点击查看摘要

Abstract:Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability – which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability – can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the lookup-table mechanism for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models. We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.
摘要：在大语言模型中，知识编辑和遗忘的方法旨在在不损害通用语言建模性能的情况下，编辑或移除不希望的知识或能力。本文探讨了机制性可解释性——其部分目标是通过识别与特定可解释机制相关的模型组件（电路）来提高编辑和遗忘的精确性和有效性——如何改进编辑和遗忘的精确性和有效性。我们发现，当使用不同方法定位训练组件时，遗忘和编辑的鲁棒性存在显著差异。我们强调了基于主要保留输出定位组件的方法与寻找具有可预测中间状态的高级机制的方法之间的重要区别。特别是，将编辑/遗忘定位到与事实回忆的查找表机制相关的组件上，1) 在不同输入/输出格式下导致更鲁棒的编辑/遗忘，2) 抵抗重新学习不希望信息的尝试，同时在多个模型上的体育事实数据集和CounterFact数据集上减少了与基线相比的意外副作用。我们还发现，某些局部化编辑比其他基线更严重地破坏了模型的潜在知识，使得遗忘对各种攻击更具鲁棒性。

[NLP-144] What Do Speech Foundation Models Not Learn About Speech?

【速读】：该论文试图解决语音基础模型如何捕捉非语言线索的问题，并探讨这些线索在不同模型层中的表示方式及其对下游任务的适应性。解决方案的关键在于通过分析Whisper、Seamless、Wav2Vec、HuBERT和Qwen2-Audio等模型在Dynamic-SUPERB基准上的表现，评估其在零样本设置下的泛化能力，并通过层级特征的微调来研究模型层级表示的特性及其对下游任务的适应程度。研究结果表明，某些模型在零样本设置下表现良好，且零样本性能与更好的学习表示相关，同时层级特征分析揭示了模型深度与学习表示的可分性之间的凸关系。

链接: https://arxiv.org/abs/2410.12948
作者: Abdul Waheed,Hanin Atwany,Bhiksha Raj,Rita Singh
关键词-EN: Understanding how speech, speech foundation models, foundation models capture, capture non-verbal cues, speech foundation
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 20 Pages

点击查看摘要

Abstract:Understanding how speech foundation models capture non-verbal cues is crucial for improving their interpretability and adaptability across diverse tasks. In our work, we analyze several prominent models such as Whisper, Seamless, Wav2Vec, HuBERT, and Qwen2-Audio focusing on their learned representations in both paralinguistic and non-paralinguistic tasks from the Dynamic-SUPERB benchmark. Our study addresses three key questions: (1) What non-verbal cues (e.g., speaker intent, emotion, environmental context) are captured? (2) How are these cues represented across different layers of the models? and (3) To what extent can these representations be effectively adapted to downstream tasks? To answer these questions, we first evaluate the models in a zero-shot setting, followed by fine-tuning on layer-wise features extracted from these models. Our results provide insights into the models’ capacity for generalization, the characteristics of their layer-wise representations, and the degree of transformation required for downstream task adaptation. Our findings suggest that some of these models perform well on various tasks in zero-shot settings, despite not being explicitly trained for those tasks. We also observe that zero-shot performance correlates with better-learned representations. The analysis of layer-wise features demonstrates that some models exhibit a convex relationship between the separability of the learned representations and model depth, with different layers capturing task-specific features.
摘要：理解语音基础模型如何捕捉非语言线索对于提高其在多样化任务中的可解释性和适应性至关重要。在我们的研究中，我们分析了几个突出的模型，如 Whisper、Seamless、Wav2Vec、HuBERT 和 Qwen2-Audio，重点关注它们在 Dynamic-SUPERB 基准测试中在副语言和非副语言任务中的学习表示。我们的研究回答了三个关键问题：(1) 捕捉了哪些非语言线索（例如，说话者意图、情感、环境背景）？(2) 这些线索在模型的不同层中是如何表示的？以及 (3) 这些表示在多大程度上可以有效地适应下游任务？为了回答这些问题，我们首先在零样本设置下评估模型，然后对从这些模型中提取的逐层特征进行微调。我们的结果提供了关于模型泛化能力、其逐层表示特征以及适应下游任务所需转换程度的见解。我们的研究结果表明，尽管这些模型并非专门为这些任务训练，但它们在零样本设置下在各种任务中表现良好。我们还观察到，零样本性能与更好的学习表示相关。对逐层特征的分析表明，某些模型在学习的表示的可分离性与模型深度之间表现出凸关系，不同层捕捉任务特定的特征。

[NLP-145] Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging EMNLP2024

【速读】：该论文试图解决将通用语言模型适应新技能时的高成本和潜在的技能遗忘问题。解决方案的关键在于采用“并行训练然后合并”的策略，即在新技能上单独训练模型，然后将训练结果与通用模型合并（例如使用任务向量）。这种方法显著降低了成本，并且在科学文献理解、安全性和编码等实验中表现出与重新训练模型相当的效能，特别是在增强模型的安全性方面，这种方法能够显著提高模型对安全提示的遵守度，同时保持其拒绝危险或有害提示的能力。

链接: https://arxiv.org/abs/2410.12937
作者: Jacob Morrison,Noah A. Smith,Hannaneh Hajishirzi,Pang Wei Koh,Jesse Dodge,Pradeep Dasigi
关键词-EN: Adapting general-purpose language, instruction datasets targeting, forget older skills, Adapting general-purpose, general-purpose language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Findings of EMNLP 2024

点击查看摘要

Abstract:Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we investigate the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g. using task vectors). In experiments focusing on scientific literature understanding, safety, and coding, we find that the parallel-train-then-merge procedure, which is significantly cheaper than retraining the models on updated data mixtures, is often comparably effective. Our experiments also show that parallel training is especially well-suited for enabling safety features in LMs relative to continued finetuning and retraining, as it dramatically improves model compliance with safe prompts while preserving its ability to refuse dangerous or harmful prompts.
摘要：将通用语言模型适应于新技能目前是一个昂贵的过程，必须在新技能的指令数据集创建时重复进行，或者可能导致模型遗忘旧技能。在本研究中，我们探讨了通过单独训练新技能，然后将训练结果与通用模型合并（例如使用任务向量）来为现有模型添加新技能的有效性。在专注于科学文献理解、安全性和编码的实验中，我们发现，与在更新数据混合上重新训练模型相比，并行训练然后合并的过程显著降低了成本，且通常同样有效。我们的实验还表明，相对于持续微调与重新训练，并行训练特别适合于在语言模型中启用安全功能，因为它显著提高了模型对安全提示的遵从性，同时保留了拒绝危险或有害提示的能力。

[NLP-146] Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

【速读】：该论文试图解决大语言模型（LLMs）在解决数学推理问题时，使用Best-of-N解码方法生成的多个解决方案中，即使选择得分最高的答案仍可能存在错误的问题。解决方案的关键在于提出了一种名为Stepwise Correction（StepCo）的新型提示方法，通过迭代验证和修正阶段，帮助LLMs识别并修正推理路径中的错误步骤。该方法利用过程监督验证器进行验证和修正，不仅提高了答案的正确性，还减少了生成路径所需的令牌消耗，显著优于现有的Best-of-N方法。

链接: https://arxiv.org/abs/2410.12934
作者: Zhenyu Wu,Qingkai Zeng,Zhihan Zhang,Zhaoxuan Tan,Chao Shen,Meng Jiang
关键词-EN: large language models, instruct large language, mathematical reasoning problems, decoding methods instruct, methods instruct large
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems. However, this repeated independent process often leads to the same mistakes, making the selected solution still incorrect. We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths. It iterates verification and revision phases that employ a process-supervised verifier. The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate. With StepCo, a series of LLMs demonstrate exceptional performance. Notably, using GPT-4o as the backend LLM, StepCo achieves an average accuracy of 94.1 across eight datasets, significantly outperforming the state-of-the-art Best-of-N method by +2.4, while reducing token consumption by 77.8%.
摘要：Best-of-N 解码方法指导大语言模型 (LLMs) 生成多个解决方案，使用评分函数对每个方案进行评分，并选择得分最高的作为最终答案来解决数学推理问题。然而，这种重复的独立过程往往导致相同的错误，使得选定的解决方案仍然不正确。我们提出了一种名为逐步校正 (Stepwise Correction, StepCo) 的新型提示方法，帮助 LLMs 识别并修正其生成推理路径中的错误步骤。该方法通过迭代验证和修订阶段，利用过程监督验证器进行校正。这种先验证后修订的过程不仅提高了答案的正确性，还通过减少生成路径的数量降低了 Token 消耗。使用 StepCo，一系列 LLMs 展示了卓越的性能。特别值得注意的是，使用 GPT-4o 作为后端 LLM，StepCo 在八个数据集上实现了 94.1 的平均准确率，显著优于最先进的 Best-of-N 方法，准确率提高了 +2.4，同时 Token 消耗减少了 77.8%。

[NLP-147] Interpreting token compositionality in LLMs: A robustness analysis

【速读】：该论文试图解决大型语言模型（LLMs）在处理组合性语言结构时的内部机制问题，特别是模型在整合语义表示方面的局限性。解决方案的关键在于提出了一种名为“成分感知池化（Constituent-Aware Pooling, CAP）”的方法，通过在不同模型层级上基于成分的池化操作，系统性地干预模型激活，以分析和揭示模型在处理组合性抽象时的不足。实验结果表明，模型在整合基于成分的语义表示方面存在显著的碎片化信息处理问题，且随着模型规模的增大，这一问题更加严重，这揭示了当前transformer架构在组合性语义处理和模型可解释性方面的根本性限制。

链接: https://arxiv.org/abs/2410.12924
作者: Nura Aljaafari,Danilo S. Carvalho,André Freitas
关键词-EN: Understanding the internal, large language models, enhancing their reliability, inference processes, internal mechanisms
类目: Computation and Language (cs.CL)
备注: 15 pages, 2 Figures, 7 tables

点击查看摘要

Abstract:Understanding the internal mechanisms of large language models (LLMs) is integral to enhancing their reliability, interpretability, and inference processes. We present Constituent-Aware Pooling (CAP), a methodology designed to analyse how LLMs process compositional linguistic structures. Grounded in principles of compositionality, mechanistic interpretability, and information gain theory, CAP systematically intervenes in model activations through constituent-based pooling at various model levels. Our experiments on inverse definition modelling, hypernym and synonym prediction reveal critical insights into transformers’ limitations in handling compositional abstractions. No specific layer integrates tokens into unified semantic representations based on their constituent parts. We observe fragmented information processing, which intensifies with model size, suggesting that larger models struggle more with these interventions and exhibit greater information dispersion. This fragmentation likely stems from transformers’ training objectives and architectural design, preventing systematic and cohesive representations. Our findings highlight fundamental limitations in current transformer architectures regarding compositional semantics processing and model interpretability, underscoring the critical need for novel approaches in LLM design to address these challenges.
摘要：理解大语言模型 (LLM) 的内部机制对于提升其可靠性、可解释性和推理过程至关重要。我们提出了成分感知池化 (Constituent-Aware Pooling, CAP)，这是一种旨在分析 LLM 如何处理组合性语言结构的方法。基于组合性原则、机制可解释性和信息增益理论，CAP 通过在不同模型层级进行基于成分的池化，系统地干预模型激活。我们在逆定义建模、上位词和同义词预测实验中揭示了 Transformer 在处理组合抽象方面的关键局限性。没有特定的层级能够基于其成分部分将 Token 整合为统一的语义表示。我们观察到信息处理的分裂现象，随着模型规模的增大，这种现象加剧，表明更大的模型在这些干预下更为困难，且表现出更大的信息分散。这种分裂可能源于 Transformer 的训练目标和架构设计，阻碍了系统性和连贯性的表示。我们的研究结果突显了当前 Transformer 架构在组合语义处理和模型可解释性方面的根本性局限，强调了在 LLM 设计中采用新方法以应对这些挑战的迫切需求。

[NLP-148] MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation NEURIPS2024

【速读】：该论文试图解决文本到SQL生成模型在大规模闭源模型（如GPT-4）中存在的可访问性、隐私和延迟问题。解决方案的关键在于开发小型、高效且开源的文本到SQL模型，并通过采样多个候选SQL生成结果并利用相关元数据进行评估和批判，提出了一种名为MSc-SQL的方法。该方法通过同时评估多个输出，在保持与大型模型竞争力的同时，显著降低了成本，实现了开源模型中的最先进性能。

链接: https://arxiv.org/abs/2410.12916
作者: Satya Krishna Gorti,Ilan Gofman,Zhaoyan Liu,Jiapeng Wu,Noël Vouitsis,Guangwei Yu,Jesse C. Cresswell,Rasa Hosseinzadeh
关键词-EN: generation enables non-experts, natural language, enables non-experts, non-experts to interact, interact with databases
类目: Computation and Language (cs.CL)
备注: 3rd Table Representation Learning Workshop at NeurIPS 2024

点击查看摘要

Abstract:Text-to-SQL generation enables non-experts to interact with databases via natural language. Recent advances rely on large closed-source models like GPT-4 that present challenges in accessibility, privacy, and latency. To address these issues, we focus on developing small, efficient, and open-source text-to-SQL models. We demonstrate the benefits of sampling multiple candidate SQL generations and propose our method, MSc-SQL, to critique them using associated metadata. Our sample critiquing model evaluates multiple outputs simultaneously, achieving state-of-the-art performance compared to other open-source models while remaining competitive with larger models at a much lower cost. Full code can be found at this http URL.
摘要：文本到 SQL 生成使非专家能够通过自然语言与数据库进行交互。最近的进展依赖于 GPT-4 等大型闭源模型，这些模型在可访问性、隐私性和延迟方面存在挑战。为了解决这些问题，我们专注于开发小型、高效且开源的文本到 SQL 模型。我们展示了采样多个候选 SQL 生成的优势，并提出了我们的方法 MSc-SQL，使用相关元数据对其进行评估。我们的样本评估模型同时评估多个输出，在与其他开源模型相比时达到了最先进的性能，同时在成本大幅降低的情况下与更大型的模型保持竞争力。完整代码可在以下网址找到：http URL。

[NLP-149] A Survey on Data Synthesis and Augmentation for Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在训练和评估过程中面临的数据资源枯竭问题。解决方案的关键在于提升数据效率和探索新的数据来源，其中合成数据被视为有前景的解决方案。论文详细探讨了数据生成的各个阶段，包括数据准备、预训练、微调、指令调整、偏好对齐和应用，并讨论了当前方法的局限性和未来发展的潜在路径。通过全面总结和评估这些技术，论文旨在为研究人员提供清晰的指导，帮助他们在构建LLMs时快速识别合适的数据生成策略，并为未来的研究提供有价值的见解。

链接: https://arxiv.org/abs/2410.12896
作者: Ke Wang,Jiahui Zhu,Minjie Ren,Zeming Liu,Shiwei Li,Zongye Zhang,Chenkai Zhang,Xiaoyu Wu,Qiqi Zhan,Qingjie Liu,Yunhong Wang
关键词-EN: Large Language Models, Language Models, Large Language, success of Large, availability of vast
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the expansion of training datasets, leading to a looming data exhaustion crisis. This underscores the urgent need to enhance data efficiency and explore new data sources. In this context, synthetic data has emerged as a promising solution. Currently, data generation primarily consists of two major approaches: data augmentation and synthesis. This paper comprehensively reviews and summarizes data generation techniques throughout the lifecycle of LLMs, including data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. Furthermore, We discuss the current constraints faced by these methods and investigate potential pathways for future development and research. Our aspiration is to equip researchers with a clear understanding of these methodologies, enabling them to swiftly identify appropriate data generation strategies in the construction of LLMs, while providing valuable insights for future exploration.
摘要：大语言模型 (LLM) 的成功与训练和评估所需的大量、多样且高质量的数据密不可分。然而，高质量数据的增速远不及训练数据集的扩展速度，导致数据枯竭危机迫在眉睫。这凸显了提升数据效率和探索新数据源的迫切需求。在此背景下，合成数据作为一种有前景的解决方案应运而生。目前，数据生成主要包含两大方法：数据增强和数据合成。本文全面回顾并总结了大语言模型生命周期中的数据生成技术，包括数据准备、预训练、微调、指令调优、偏好对齐及应用。此外，我们讨论了这些方法当前面临的限制，并探讨了未来发展和研究的可能路径。我们的目标是帮助研究人员清晰理解这些方法论，使他们在构建大语言模型时能迅速识别合适的数据生成策略，同时为未来的探索提供有价值的见解。

[NLP-150] Large Language Models and the Rationalist Empiricist Debate

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）的出现是否能够证明理性主义在语言学习中的优越性，以及这种证明是否对人类语言学习的理性主义与经验主义之争具有相关性。解决方案的关键在于区分人类和LLMs在学习方式上的根本差异：人类语言学习是在刺激贫乏的情况下进行的，而LLMs则依赖于极其丰富的刺激；人类语言输出基于感官体验，而LLMs则不然。这些差异表明两者使用不同的底层能力来产生输出，因此LLMs的学习方式与人类的学习方式不具有可比性，从而使得LLMs是否以经验主义方式学习的问题与人类是否以经验主义方式学习的问题无关。

链接: https://arxiv.org/abs/2410.12895
作者: David King
关键词-EN: Large Language Models, Chomsky Rationalism, Quine and Skinner, LLMs, updated version
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To many Chomsky’s debates with Quine and Skinner are an updated version of the Rationalist Empiricist debates of the 17th century. The consensus being that Chomsky’s Rationalism was victorious. This dispute has reemerged with the advent of Large Language Models. With some arguing that LLMs vindicate rationalism because of the necessity of building in innate biases to make them work. The necessity of building in innate biases is taken to prove that empiricism hasn’t got the conceptual resources to explain linguistic competence. Such claims depend on the nature of the empiricism one is endorsing. Externalized Empiricism has no difficulties with innate apparatus once they are determined empirically (Quine 1969). Thus, externalized empiricism is not refuted because of the need to build in innate biases in LLMs. Furthermore, the relevance of LLMs to the rationalist empiricist debate in relation to humans is dubious. For any claim about whether LLMs learn in an empiricist manner to be relevant to humans it needs to be shown that LLMs and humans learn in the same way. Two key features distinguish humans and LLMs. Humans learn despite a poverty of stimulus and LLMs learn because of an incredibly rich stimulus. Human linguistic outputs are grounded in sensory experience and LLMs are not. These differences in how the two learn indicates that they both use different underlying competencies to produce their output. Therefore, any claims about whether LLMs learn in an empiricist manner are not relevant to whether humans learn in an empiricist manner.
摘要：许多学者认为，Chomsky 与 Quine 和 Skinner 的辩论是 17 世纪理性主义与经验主义辩论的现代版本。共识是 Chomsky 的理性主义取得了胜利。随着大语言模型 (LLM) 的出现，这一争议再次浮现。一些人认为，LLM 证明了理性主义的正确性，因为构建这些模型时必须内置先天的偏见才能使其正常工作。内置先天偏见的必要性被用来证明经验主义没有足够的概念资源来解释语言能力。然而，这种观点取决于所支持的经验主义的性质。外部化经验主义 (Externalized Empiricism) 在经验确定先天机制后，并不存在解释上的困难 (Quine 1969)。因此，外部化经验主义并不会因为 LLM 需要内置先天偏见而被驳倒。此外，LLM 与人类在理性主义与经验主义辩论中的相关性也值得怀疑。要使关于 LLM 是否以经验主义方式学习的论点对人类相关，必须证明 LLM 和人类以相同的方式学习。人类与 LLM 有两个关键区别：人类在刺激贫乏的情况下学习，而 LLM 则依赖于极其丰富的刺激；人类的语言输出基于感官体验，而 LLM 则不然。这些学习方式的差异表明，两者使用不同的底层能力来生成输出。因此，关于 LLM 是否以经验主义方式学习的论点，与人类是否以经验主义方式学习无关。

[NLP-151] MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation NEURIPS2024

【速读】：该论文试图解决自动问题生成系统中问题质量评估的自动化问题，特别是在缺乏人类理解和判断能力的情况下。解决方案的关键是提出了一个名为MIRROR（多LLM迭代评审与响应优化评分）的系统，该系统利用大型语言模型（如GPT-4、Gemini和Llama2-70b）来自动化评估过程。通过反馈机制，MIRROR系统能够提高评估指标（如相关性、适当性、新颖性、复杂性和语法性）的得分，使其更接近人类基准评分，并显著改善评估结果与人类专家评估之间的皮尔逊相关系数。

链接: https://arxiv.org/abs/2410.12893
作者: Aniket Deroy,Subhankar Maity,Sudeshna Sarkar
关键词-EN: stimulate critical thinking, Automatic question generation, involves evaluating question, evaluating question quality, Automatic question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at FM-Eduassess @ NEURIPS 2024 (ORAL Paper)

点击查看摘要

Abstract:Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson’s correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.
摘要：自动问答生成是一项关键任务，涉及通过考虑参与度、教学价值和激发批判性思维的能力等因素来评估问题质量。这些方面需要类人理解和判断，而当前的自动化系统尚不具备这些能力。然而，人工评估成本高昂且不适用于大规模生成的问题样本。因此，我们提出了一种新型系统，即 MIRROR（多 LLM 迭代审查与响应优化评分系统），该系统利用大语言模型 (LLM) 来自动化评估由自动问答生成系统生成的问题。我们实验了多种最先进的 LLM，如 GPT-4、Gemini 和 Llama2-70b。我们观察到，使用基于反馈的方法 MIRROR 后，人类评估指标（即相关性、适当性、新颖性、复杂性和语法性）的评分有所提高，更接近人类基准评分。此外，我们观察到，与直接提示评估相比，使用我们提出的基于反馈的方法 MIRROR 时，GPT-4 与人类专家之间的皮尔逊相关系数有所提高。错误分析显示，我们提出的方法 MIRROR 显著有助于提高相关性和适当性。

[NLP-152] Multi-trait User Simulation with Adaptive Decoding for Conversational Task Assistants EMNLP2024

【速读】：该论文试图解决对话系统在面对用户多样化的交互特征时如何实现鲁棒性和高效模拟的问题。解决方案的关键在于提出了多特征自适应解码（Multi-Trait Adaptive Decoding, mTAD）方法，通过在解码时从多种特征特定的语言模型（LMs）中采样，生成多样化的用户配置文件。mTAD提供了一种自适应且可扩展的用户模拟方法，无需额外微调即可创建多个用户配置文件，从而增强对话的多样性。实验结果验证了该方法在捕捉单一特征和结合多样用户模拟器方面的有效性和灵活性。

链接: https://arxiv.org/abs/2410.12891
作者: Rafael Ferreira,David Semedo,João Magalhães
关键词-EN: naturally exhibit diverse, interactions that naturally, naturally exhibit, exhibit diverse conversational, trait-specific Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Preprint fron EMNLP 2024 Findings

点击查看摘要

Abstract:Conversational systems must be robust to user interactions that naturally exhibit diverse conversational traits. Capturing and simulating these diverse traits coherently and efficiently presents a complex challenge. This paper introduces Multi-Trait Adaptive Decoding (mTAD), a method that generates diverse user profiles at decoding-time by sampling from various trait-specific Language Models (LMs). mTAD provides an adaptive and scalable approach to user simulation, enabling the creation of multiple user profiles without the need for additional fine-tuning. By analyzing real-world dialogues from the Conversational Task Assistant (CTA) domain, we identify key conversational traits and developed a framework to generate profile-aware dialogues that enhance conversational diversity. Experimental results validate the effectiveness of our approach in modeling single-traits using specialized LMs, which can capture less common patterns, even in out-of-domain tasks. Furthermore, the results demonstrate that mTAD is a robust and flexible framework for combining diverse user simulators.
摘要： 对话系统必须能够应对用户交互中自然展现的多样化对话特征。捕捉并高效地模拟这些多样化的特征是一个复杂的挑战。本文介绍了多特征自适应解码 (Multi-Trait Adaptive Decoding, mTAD) 方法，该方法通过从各种特征特定的语言模型 (Language Models, LMs) 中采样，在解码时生成多样化的用户配置文件。mTAD 提供了一种自适应且可扩展的用户模拟方法，能够在无需额外微调的情况下创建多个用户配置文件。通过分析来自对话任务助手 (Conversational Task Assistant, CTA) 领域的真实对话，我们识别了关键的对话特征，并开发了一个框架来生成增强对话多样性的配置文件感知对话。实验结果验证了我们的方法在使用专用 LMs 建模单一特征方面的有效性，这些 LMs 能够捕捉到不常见模式，即使在域外任务中也是如此。此外，结果表明 mTAD 是一个稳健且灵活的框架，适用于结合多样化的用户模拟器。

[NLP-153] REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models

【速读】：该论文试图解决在特定领域中使用预训练嵌入模型进行文档检索时，由于检索上下文不准确导致生成答案出现错误或幻觉的问题。解决方案的关键在于提出了一种名为REFINE的新技术，通过生成合成数据并采用模型融合方法对嵌入模型进行微调，以提高在新领域中的检索性能，同时保持其在其他领域的泛化能力。实验结果表明，即使使用标准微调结合数据增强技术，也能超越原始预训练模型，而结合模型融合后，性能提升更为显著。

链接: https://arxiv.org/abs/2410.12890
作者: Ambuje Gupta,Mrinal Rawat,Andreas Stolcke,Roberto Pieraccini
关键词-EN: vector store computed, Retrieval augmented generation, retrieving relevant documents, augmented generation, pipelines are commonly
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted in AJCAI’24

点击查看摘要

Abstract:Retrieval augmented generation (RAG) pipelines are commonly used in tasks such as question-answering (QA), relying on retrieving relevant documents from a vector store computed using a pretrained embedding model. However, if the retrieved context is inaccurate, the answers generated using the large language model (LLM) may contain errors or hallucinations. Although pretrained embedding models have advanced, adapting them to new domains remains challenging. Fine-tuning is a potential solution, but industry settings often lack the necessary fine-tuning data. To address these challenges, we propose REFINE, a novel technique that generates synthetic data from available documents and then uses a model fusion approach to fine-tune embeddings for improved retrieval performance in new domains, while preserving out-of-domain capability. We conducted experiments on the two public datasets: SQUAD and RAG-12000 and a proprietary TOURISM dataset. Results demonstrate that even the standard fine-tuning with the proposed data augmentation technique outperforms the vanilla pretrained model. Furthermore, when combined with model fusion, the proposed approach achieves superior performance, with a 5.76% improvement in recall on the TOURISM dataset, and 6.58 % and 0.32% enhancement on SQUAD and RAG-12000 respectively.
摘要：检索增强生成 (Retrieval augmented generation, RAG) 管道通常用于问答 (Question-Answering, QA) 等任务，依赖于从使用预训练嵌入模型计算的向量存储中检索相关文档。然而，如果检索到的上下文不准确，使用大语言模型 (Large Language Model, LLM) 生成的答案可能包含错误或幻觉。尽管预训练嵌入模型已经取得了进展，但将其适应到新领域仍然具有挑战性。微调是一个潜在的解决方案，但行业环境中往往缺乏必要的微调数据。为了应对这些挑战，我们提出了 REFINE，一种新颖的技术，该技术从现有文档生成合成数据，然后使用模型融合方法对嵌入进行微调，以在新领域中提高检索性能，同时保留域外能力。我们在两个公开数据集：SQUAD 和 RAG-12000 以及一个专有的 TOURISM 数据集上进行了实验。结果表明，即使使用所提出的数据增强技术进行标准微调，其性能也优于普通的预训练模型。此外，当结合模型融合时，所提出的方法表现更为出色，在 TOURISM 数据集上的召回率提高了 5.76%，在 SQUAD 和 RAG-12000 上分别提升了 6.58% 和 0.32%。

[NLP-154] AT-RAG: An Adaptive RAG Model Enhancing Query Efficiency with Topic Filtering and Iterative Reasoning

【速读】：该论文试图解决大型语言模型（如GPT-4）在处理复杂多跳查询时存在的局限性问题。解决方案的关键在于提出了一种名为AT-RAG的新型多步骤RAG模型，该模型通过结合主题建模（使用BERTopic）来动态分配查询主题，从而提高文档检索的准确性和效率。AT-RAG通过主题过滤和迭代推理的集成，能够有效处理复杂的查询，显著提升结果的正确性、完整性和相关性，同时减少检索时间并保持高精度，适用于通用任务问答和复杂领域（如医疗）的特定挑战。

链接: https://arxiv.org/abs/2410.12886
作者: Mohammad Reza Rezaei,Maziar Hafezi,Amit Satpathy,Lovell Hodge,Ebrahim Pourjafari
关键词-EN: Recent advancements, handling complex multi-hop, shown limitations, limitations in handling, complex multi-hop queries
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in QA with LLM, like GPT-4, have shown limitations in handling complex multi-hop queries. We propose AT-RAG, a novel multistep RAG incorporating topic modeling for efficient document retrieval and reasoning. Using BERTopic, our model dynamically assigns topics to queries, improving retrieval accuracy and efficiency. We evaluated AT-RAG on multihop benchmark datasets QA and a medical case study QA. Results show significant improvements in correctness, completeness, and relevance compared to existing methods. AT-RAG reduces retrieval time while maintaining high precision, making it suitable for general tasks QA and complex domain-specific challenges such as medical QA. The integration of topic filtering and iterative reasoning enables our model to handle intricate queries efficiently, which makes it suitable for applications that require nuanced information retrieval and decision-making.
摘要：近年来，随着大语言模型（LLM）如 GPT-4 的发展，其在处理复杂的多跳查询（multi-hop queries）方面显示出一定的局限性。我们提出了 AT-RAG，这是一种新颖的多步骤 RAG 模型，结合了主题建模（topic modeling）以实现高效的文档检索和推理。通过使用 BERTopic，我们的模型能够动态地为查询分配主题，从而提高检索的准确性和效率。我们在多跳基准数据集 QA 和医疗案例研究 QA 上对 AT-RAG 进行了评估。结果显示，与现有方法相比，AT-RAG 在正确性、完整性和相关性方面均有显著提升。AT-RAG 在减少检索时间的同时保持了高精度，使其适用于一般任务 QA 和复杂的特定领域挑战，如医疗 QA。主题过滤（topic filtering）和迭代推理（iterative reasoning）的结合使得我们的模型能够高效处理复杂的查询，这使其非常适合需要细致信息检索和决策制定的应用。

[NLP-155] Scaling Laws for Multilingual Language Models

【速读】：该论文试图解决多语言预训练语言模型（LMs）在训练过程中如何平衡各语言性能的问题。解决方案的关键在于将研究焦点从单个语言转向语言家族，并提出一个假设：每个语言家族的测试交叉熵损失仅由其自身的采样比例决定，与其他语言无关。基于这一假设，论文推导出一种幂律关系，将性能与数据集大小、模型大小和采样比例联系起来，从而简化了多语言扩展的复杂性，并能预测不同组合下的模型性能，推导出不同模型规模下的最优采样比例。通过大规模实验验证，该方法在小模型上推导出的最优采样比例在大规模模型上同样有效，为多语言LM训练提供了资源高效的解决方案。

链接: https://arxiv.org/abs/2410.12883
作者: Yifei He,Alon Benhaim,Barun Patra,Praneetha Vaddamanu,Sanchit Ahuja,Parul Chopra,Vishrav Chaudhary,Han Zhao,Xia Song
关键词-EN: general-purpose decoder-only language, addressing the problem, general-purpose decoder-only, problem of balancing, sampling ratios
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel scaling law for general-purpose decoder-only language models (LMs) trained on multilingual data, addressing the problem of balancing languages during multilingual pretraining. A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. To address this, we shift the focus from individual languages to language families. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio, independent of other languages in the mixture. This insight simplifies the complexity of multilingual scaling and make the analysis scalable to an arbitrary number of languages. Building on this hypothesis, we derive a power-law relationship that links performance with dataset size, model size and sampling ratios. This relationship enables us to predict performance across various combinations of the above three quantities, and derive the optimal sampling ratios at different model scales. To demonstrate the effectiveness and accuracy of our proposed scaling law, we perform a large-scale empirical study, training more than 100 models on 23 languages spanning 5 language families. Our experiments show that the optimal sampling ratios derived from small models (85M parameters) generalize effectively to models that are several orders of magnitude larger (1.2B parameters), offering a resource-efficient approach for multilingual LM training at scale.
摘要：我们提出了一种针对多语言数据训练的通用解码器专用语言模型 (LMs) 的新型扩展定律，旨在解决多语言预训练中语言平衡的问题。研究多语言扩展的一个主要挑战是，由于跨语言迁移的影响，难以分析单个语言的表现。为解决这一问题，我们将关注点从单个语言转向语言家族。我们提出并验证了一个假设，即每个语言家族的测试交叉熵损失仅由其自身的采样比例决定，与其他语言在混合中的比例无关。这一见解简化了多语言扩展的复杂性，并使得分析能够扩展到任意数量的语言。基于这一假设，我们推导出一种幂律关系，该关系将性能与数据集大小、模型大小和采样比例联系起来。这种关系使我们能够预测上述三种数量各种组合下的性能，并推导出不同模型规模下的最优采样比例。为了展示我们提出的扩展定律的有效性和准确性，我们进行了一项大规模的实证研究，训练了超过 100 个模型，涵盖 23 种语言，跨越 5 个语言家族。我们的实验表明，从小模型（85M 参数）推导出的最优采样比例能够有效地推广到数量级更大的模型（1.2B 参数），为大规模多语言 LM 训练提供了一种资源高效的方法。

[NLP-156] MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

【速读】：该论文试图解决现有大型语言模型（LLMs）在复杂、多跳和数学推理任务中表现不足的问题。解决方案的关键在于提出了一种新颖的大规模多样化数学信息合成对话生成方法（MIND），通过基于OpenWebMath（OWM）生成合成对话，形成新的数学语料库MIND-OWM。研究强调了在对话参与者之间引入知识差距对于生成高质量数学数据的重要性，并提出了一种有效的方法来格式化和整合合成数据与原始数据，以在预训练过程中最大化数学推理能力的提升。实验结果显示，与仅使用原始数据预训练相比，使用MIND-OWM预训练的模型在数学推理任务（如GSM8K和MATH）、专业知识任务（如MMLU和MMLU-STEM）以及通用推理任务中均表现出显著提升。

链接: https://arxiv.org/abs/2410.12881
作者: Syeda Nahida Akter,Shrimai Prabhumoye,John Kamalu,Sanjeev Satheesh,Eric Nyberg,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
关键词-EN: recent large language, downstream task accuracy, large language models, improve downstream task, mathematical reasoning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 5 figures, 14 tables

点击查看摘要

Abstract:The utility of synthetic data to enhance pretraining data quality and hence to improve downstream task accuracy has been widely explored in recent large language models (LLMs). Yet, these approaches fall inadequate in complex, multi-hop and mathematical reasoning tasks as the synthetic data typically fails to add complementary knowledge to the existing raw corpus. In this work, we propose a novel large-scale and diverse Math Informed syNthetic Dialogue (MIND) generation method that improves the mathematical reasoning ability of LLMs. Specifically, using MIND, we generate synthetic conversations based on OpenWebMath (OWM), resulting in a new math corpus, MIND-OWM. Our experiments with different conversational settings reveal that incorporating knowledge gaps between dialog participants is essential for generating high-quality math data. We further identify an effective way to format and integrate synthetic and raw data during pretraining to maximize the gain in mathematical reasoning, emphasizing the need to restructure raw data rather than use it as-is. Compared to pretraining just on raw data, a model pretrained on MIND-OWM shows significant boost in mathematical reasoning (GSM8K: +13.42%, MATH: +2.30%), including superior performance in specialized knowledge (MMLU: +4.55%, MMLU-STEM: +4.28%) and general purpose reasoning tasks (GENERAL REASONING: +2.51%).
摘要：合成数据在提升预训练数据质量和增强下游任务准确性方面的效用，在近期的大语言模型 (LLM) 中得到了广泛探索。然而，这些方法在处理复杂、多步和数学推理任务时显得不足，因为合成数据通常无法为现有的原始语料库增添互补知识。在本研究中，我们提出了一种新颖的大规模多样化数学信息合成对话 (MIND) 生成方法，旨在提升 LLM 的数学推理能力。具体而言，我们利用 MIND 基于 OpenWebMath (OWM) 生成合成对话，从而形成一个新的数学语料库，即 MIND-OWM。通过不同对话设置的实验，我们发现，在生成高质量数学数据时，对话参与者之间的知识差距的引入至关重要。我们进一步确定了在预训练过程中有效格式化和整合合成数据与原始数据的方法，以最大化数学推理能力的提升，强调了重构原始数据的必要性，而非直接使用。与仅使用原始数据进行预训练相比，基于 MIND-OWM 预训练的模型在数学推理方面表现出显著提升 (GSM8K: +13.42%, MATH: +2.30%)，包括在专业知识 (MMLU: +4.55%, MMLU-STEM: +4.28%) 和通用推理任务 (GENERAL REASONING: +2.51%) 上的优越表现。

[NLP-157] Navigating the Cultural Kaleidoscope: A Hitchhikers Guide to Sensitivity in Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在全球应用中可能产生的文化敏感性问题，特别是在参数较小的模型中，由于缺乏足够的训练数据来捕捉全球文化细微差别，容易导致文化误解或价值观冲突。解决方案的关键在于提出了两个主要贡献：(1) 创建了一个文化伤害测试数据集，用于评估模型在不同文化背景下的输出，揭示潜在的文化不敏感性；(2) 设计了一个文化对齐偏好数据集，通过多样化的标注者反馈进行微调，以恢复和增强模型的文化敏感性。这些数据集的使用有助于评估和提升LLMs的文化敏感性，确保其在不同文化环境中能够安全、伦理地部署，从而推动更包容和尊重的AI系统发展。

链接: https://arxiv.org/abs/2410.12880
作者: Somnath Banerjee,Sayan Layek,Hari Shrawgi,Rajarshi Mandal,Avik Halder,Shanu Kumar,Sagnik Basu,Parag Agrawal,Rima Hazra,Animesh Mukherjee
关键词-EN: backgrounds feel respected, diverse backgrounds feel, cultural, respected and understood, increasingly deployed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As LLMs are increasingly deployed in global applications, the importance of cultural sensitivity becomes paramount, ensuring that users from diverse backgrounds feel respected and understood. Cultural harm can arise when these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values. This work addresses the challenges of ensuring cultural sensitivity in LLMs, especially in small-parameter models that often lack the extensive training data needed to capture global cultural nuances. We present two key contributions: (1) A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and (2) A culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators. These datasets facilitate the evaluation and enhancement of LLMs, ensuring their ethical and safe deployment across different cultural landscapes. Our results show that integrating culturally aligned feedback leads to a marked improvement in model behavior, significantly reducing the likelihood of generating culturally insensitive or harmful content. Ultimately, this work paves the way for more inclusive and respectful AI systems, fostering a future where LLMs can safely and ethically navigate the complexities of diverse cultural landscapes.
摘要： 随着大语言模型 (LLM) 在全球应用中的日益普及，文化敏感性的重要性变得至关重要，确保来自不同背景的用户感受到尊重和理解。当这些模型未能与特定文化规范保持一致时，可能会产生文化伤害，导致对文化价值的误解或侵犯。本研究针对确保大语言模型文化敏感性的挑战，特别是在那些通常缺乏捕捉全球文化细微差别所需广泛训练数据的小参数模型中。我们提出了两项关键贡献：(1) 一个文化伤害测试数据集，通过暴露潜在文化不敏感性的场景来评估模型在不同文化背景下的输出；(2) 一个文化对齐偏好数据集，旨在通过基于多样性注释者反馈的微调来恢复文化敏感性。这些数据集有助于评估和增强大语言模型，确保其在不同文化环境中的伦理和安全部署。我们的结果表明，整合文化对齐反馈显著改善了模型行为，大幅降低了生成文化不敏感或有害内容的可能性。最终，本研究为更具包容性和尊重性的 AI 系统铺平了道路，促进了未来大语言模型能够在多样文化环境中安全且伦理地导航的愿景。

[NLP-158] Exploring transfer learning for Deep NLP systems on rarely annotated languages

【速读】：该论文试图解决在自然语言处理（NLP）领域中，由于数据稀缺导致许多语言（尤其是小语种）的词性标注（POS tagging）性能不佳的问题。解决方案的关键在于利用迁移学习（transfer learning）和多任务学习（multitask learning），通过联合训练印地语（Hindi）和尼泊尔语（Nepali）的词性标注模型，以及引入性别和单复数标注等辅助任务，来提升模型的性能。具体实现中，采用了BLSTM-CNN-CRF深度学习架构，并通过对比单语种词嵌入、向量映射嵌入和联合训练的印地语-尼泊尔语词嵌入，验证了联合训练词嵌入在提升模型性能方面的有效性。

链接: https://arxiv.org/abs/2410.12879
作者: Dipendra Yadav,Tobias Strauß,Kristina Yordanova
关键词-EN: Natural language processing, significantly outperforming traditional, experienced rapid advancements, traditional rule-based methods, outperforming traditional rule-based
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural language processing (NLP) has experienced rapid advancements with the rise of deep learning, significantly outperforming traditional rule-based methods. By capturing hidden patterns and underlying structures within data, deep learning has improved performance across various NLP tasks, overcoming the limitations of rule-based systems. However, most research and development in NLP has been concentrated on a select few languages, primarily those with large numbers of speakers or financial significance, leaving many others underexplored. This lack of research is often attributed to the scarcity of adequately annotated datasets essential for training deep learning models. Despite this challenge, there is potential in leveraging the linguistic similarities between unexplored and well-studied languages, particularly those in close geographic and linguistic proximity. This thesis investigates the application of transfer learning for Part-of-Speech (POS) tagging between Hindi and Nepali, two highly similar languages belonging to the Indo-Aryan language family. Specifically, the work explores whether joint training of a POS tagging model for both languages enhances performance. Additionally, we assess whether multitask learning in Hindi, with auxiliary tasks such as gender and singular/plural tagging, can contribute to improved POS tagging accuracy. The deep learning architecture employed is the BLSTM-CNN-CRF model, trained under different conditions: monolingual word embeddings, vector-mapped embeddings, and jointly trained Hindi-Nepali word embeddings. Varying dropout rates (0.25 to 0.5) and optimizers (ADAM and AdaDelta) are also evaluated. Results indicate that jointly trained Hindi-Nepali word embeddings improve performance across all models compared to monolingual and vector-mapped embeddings.
摘要：自然语言处理 (NLP) 在深度学习的推动下取得了显著进展，其性能远超传统的基于规则的方法。通过捕捉数据中的隐藏模式和底层结构，深度学习在各种 NLP 任务中提升了表现，克服了基于规则系统的局限性。然而，大多数 NLP 的研究和开发集中在少数几种语言上，主要是那些使用人数众多或具有经济重要性的语言，导致许多其他语言的研究不足。这种研究缺乏往往归因于训练深度学习模型所需的高质量标注数据集的稀缺。尽管面临这一挑战，利用未深入研究语言与已研究语言之间的语言相似性，特别是地理位置和语言上相近的语言，具有潜在价值。本论文探讨了在印欧语系印度-雅利安语族中的两种高度相似的语言——印地语和尼泊尔语之间，应用迁移学习进行词性标注 (POS) 的可能性。具体而言，研究了同时为这两种语言训练 POS 标注模型是否能提升性能。此外，我们还评估了在印地语中进行多任务学习，包括性别和单复数标注等辅助任务，是否能有助于提高 POS 标注的准确性。所采用的深度学习架构是 BLSTM-CNN-CRF 模型，在不同条件下进行训练：单语词嵌入、向量映射嵌入以及联合训练的印地语-尼泊尔语词嵌入。同时，还评估了不同的 dropout 率 (0.25 至 0.5) 和优化器 (ADAM 和 AdaDelta)。结果表明，与单语和向量映射嵌入相比，联合训练的印地语-尼泊尔语词嵌入在所有模型中均提升了性能。

[NLP-159] owards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models

【速读】：该论文试图解决自然语言处理中表格到文本生成任务的性能提升问题，特别是当前开源语言模型在此任务中的表现。解决方案的关键在于探索和评估不同的上下文学习策略，特别是通过提供示例来增强模型的表现。研究还引入了大型语言模型（LLM）的自我评估方法，结合链式思维推理，并与人类对齐的评估指标（如BERTScore）进行对比，以期找到更可靠的评估方法。研究发现，提供示例对提升表格到文本生成的效果有显著影响，但LLM自我评估与人类判断的匹配度仍有提升空间。

链接: https://arxiv.org/abs/2410.12878
作者: Sahar Iravani,Tim .O .F Conrad
关键词-EN: Table processing, natural language processing, key task, task in natural, significantly benefited
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages

点击查看摘要

Abstract:Table processing, a key task in natural language processing, has significantly benefited from recent advancements in language models (LMs). However, the capabilities of LMs in table-to-text generation, which transforms structured data into coherent narrative text, require an in-depth investigation, especially with current open-source models. This study explores the effectiveness of various in-context learning strategies in LMs across benchmark datasets, focusing on the impact of providing examples to the model. More importantly, we examine a real-world use case, offering valuable insights into practical applications. To complement traditional evaluation metrics, we employ a large language model (LLM) self-evaluation approach using chain-of-thought reasoning and assess its correlation with human-aligned metrics like BERTScore. Our findings highlight the significant impact of examples in improving table-to-text generation and suggest that, while LLM self-evaluation has potential, its current alignment with human judgment could be enhanced. This points to the need for more reliable evaluation methods.
摘要：表格处理作为自然语言处理中的关键任务，得益于语言模型 (LMs) 的最新进展而显著受益。然而，语言模型在表格到文本生成方面的能力，即将结构化数据转化为连贯的叙述文本，需要深入研究，尤其是在当前的开源模型中。本研究探讨了在基准数据集上，各种上下文学习策略在语言模型中的有效性，重点考察了向模型提供示例的影响。更重要的是，我们考察了一个实际应用案例，为实际应用提供了宝贵的见解。为了补充传统的评估指标，我们采用了大语言模型 (LLM) 自我评估方法，使用链式思维推理，并评估其与 BERTScore 等人性化指标的相关性。我们的研究结果强调了示例在提升表格到文本生成中的显著影响，并表明，虽然 LLM 自我评估具有潜力，但其当前与人类判断的一致性仍有待提高。这指出了需要更可靠的评估方法。

[NLP-160] Improving Instruction-Following in Language Models through Activation Steering

【速读】：该论文试图解决语言模型在遵循指令方面的能力问题，特别是如何在不明确指令的情况下增强模型对输出格式、长度和词汇包含等约束的遵守。解决方案的关键在于从语言模型中提取指令特定的向量表示，这些向量通过计算带有和不带有指令的输入之间的激活差异来生成，从而实现模块化的激活引导。这种方法不仅能够在没有明确指令时引导模型遵循约束，还能在指令存在时提升性能，并通过组合多个指令向量实现同时应用多重指令。此外，论文还展示了指令调优模型生成的引导向量可以迁移到基础模型中，从而提升其性能。

链接: https://arxiv.org/abs/2410.12877
作者: Alessandro Stolfo,Vidhisha Balachandran,Safoora Yousefi,Eric Horvitz,Besmira Nushi
关键词-EN: numerous real-world applications, crucial for numerous, numerous real-world, real-world applications, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of activation steering, successfully applying multiple instructions simultaneously. Finally, we demonstrate that steering vectors computed on instruction-tuned models can transfer to improve base models. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
摘要：遵循指令的能力对于语言模型在众多实际应用中的表现至关重要。为了深入洞察并提升模型的能力，我们从语言模型中提取特定指令的向量表示，并利用这些向量来引导模型。这些向量是通过计算带有和不带有指令的输入之间的激活差异来获得的，从而实现了一种模块化的激活引导方法。我们展示了这种方法如何增强模型对输出格式、长度和词汇包含等约束的遵循，提供了在推理时对指令遵循的控制。我们在四个模型上的实验表明，即使在没有明确指令的情况下，我们也可以使用激活向量来引导模型遵循约束，并在有指令时提升性能。此外，我们探索了激活引导的组合性，成功地同时应用了多个指令。最后，我们证明了在指令调优模型上计算的引导向量可以转移到基础模型上，从而提升其性能。我们的研究结果表明，激活引导提供了一种实用且可扩展的方法，用于在语言生成中实现精细控制。

[NLP-161] In-context KV-Cache Eviction for LLMs via Attention-Gate

【速读】：该论文试图解决大语言模型（LLMs）推理系统中KV-Cache成为性能瓶颈的问题，特别是在处理超大规模模型和长上下文查询时。解决方案的关键在于提出了一种名为Attention-Gate的参数化KV-Cache驱逐机制，该机制能够根据整个上下文动态生成每个token的驱逐标志，从而实现上下文感知的驱逐策略。Attention-Gate可以灵活地应用于不同的注意力头和层，并通过持续预训练或监督微调进行高效调整，以确定哪些token可以被丢弃，从而在最小化计算和内存开销的同时，提高推理效率和模型性能。

链接: https://arxiv.org/abs/2410.12876
作者: Zihao Zeng,Bokai Lin,Tianqi Hou,Hao Zhang,Zhijie Deng
关键词-EN: large language models, large language, language models, KV-Cache technique, eviction
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The KV-Cache technique has become the standard for the inference of large language models (LLMs). It caches states of self-attention to avoid recomputation. Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system, especially when confronted with ultra-large models and long-context queries. A natural remedy is to discard the KV-Cache for less important tokens, with StreamingLLM as an example, but the used static eviction strategies cannot flexibly adapt to varying contexts. Remedies like H2O leverage accumulative attention scores to perform dynamic eviction but suffer from the attention bias issue in capturing contextual information. This paper bridges this gap by devising a parameterized KV-Cache eviction mechanism, dubbed as Attention-Gate, which accepts the whole context as input and yields eviction flags for each token to realize in-context eviction. The subsequent self-attention module proceeds according to the flags and only the KV states for the remaining tokens need to be cached. The Attention-Gates can vary among different heads and layers and be trivially plugged into pre-trained LLMs, tuned by cost-effective continual pre-training or supervised fine-tuning objectives to acquire what to discard. The computational and memory overhead introduced by Attention-Gates is minimal. Our method is validated across multiple tasks, demonstrating both efficiency and adaptability. After a highly efficient continual pre-training, it achieves higher average accuracy and evicts more tokens compared to traditional training-free methods. In supervised fine-tuning, it not only evicts many tokens but also outperforms LoRA-finetuned LLMs on some datasets, such as RTE, where it improves accuracy by 13.9% while evicting 62.8% of tokens, showing that effective eviction of redundant tokens can even enhance performance.
摘要：KV-Cache 技术已成为大语言模型 (LLM) 推理的标准。它缓存自注意力状态以避免重新计算。然而，KV-Cache 被广泛批评为 LLM 推理系统的瓶颈，尤其是在面对超大规模模型和长上下文查询时。一个自然的补救措施是丢弃对不太重要的 Token 的 KV-Cache，例如 StreamingLLM，但所使用的静态驱逐策略无法灵活适应不同的上下文。像 H2O 这样的补救措施利用累积注意力分数进行动态驱逐，但存在捕捉上下文信息时的注意力偏差问题。本文通过设计一种参数化的 KV-Cache 驱逐机制，称为 Attention-Gate，填补了这一空白。该机制接受整个上下文作为输入，并为每个 Token 生成驱逐标志，以实现上下文内的驱逐。随后的自注意力模块根据这些标志进行操作，只有剩余 Token 的 KV 状态需要被缓存。Attention-Gates 可以在不同头和层之间变化，并且可以轻松插入预训练的 LLM 中，通过成本效益高的持续预训练或监督微调目标进行调整，以确定要丢弃的内容。Attention-Gates 引入的计算和内存开销极小。我们的方法在多个任务中得到了验证，展示了其效率和适应性。经过高效的持续预训练后，与传统的无训练方法相比，它在平均准确率上更高，并且驱逐了更多 Token。在监督微调中，它不仅驱逐了许多 Token，而且在某些数据集上（如 RTE），其准确率提高了 13.9%，同时驱逐了 62.8% 的 Token，表明有效驱逐冗余 Token 甚至可以提升性能。

[NLP-162] On Debiasing Text Embeddings Through Context Injection

【速读】：该论文试图解决自然语言处理（NLP）中嵌入模型捕捉和延续文本中存在的偏见的问题。解决方案的关键在于评估和量化19种嵌入模型在处理上下文信息时的偏见程度，并发现尽管高性能的嵌入模型更容易捕捉常见偏见，但它们也更擅长结合上下文信息。论文提出了一种动态选择k值的简单算法，用于改进检索任务中的偏见问题。

链接: https://arxiv.org/abs/2410.12874
作者: Thomas Uriot
关键词-EN: leveraging textual data, NLP has made, build applications leveraging, applications leveraging textual, Current advances
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Current advances in NLP has made it increasingly feasible to build applications leveraging textual data. Generally, the core of these applications rely on having a good semantic representation of text into vectors, via specialized embedding models. However, it has been shown that these embeddings capture and perpetuate biases already present in text. While a few techniques have been proposed to debias embeddings, they do not take advantage of the recent advances in context understanding of the modern embedding models. In this paper, we fill this gap by conducting a review of 19 embedding models by quantifying their biases and how well they respond to context injection as a mean of debiasing. We show that higher performing embedding models are more prone to capturing common biases, but are also better able to incorporate context. Surprisingly, we find that while models can easily embed affirmative context, they fail at embedding neutral semantics. Finally, in a retrieval task, we show that biases in embeddings can lead to non-desirable outcomes. We use our new-found insights to design a simple algorithm for top k retrieval where k is dynamically selected.
摘要：当前自然语言处理 (NLP) 的进展使得利用文本数据构建应用程序变得越来越可行。通常，这些应用程序的核心依赖于通过专门的嵌入模型将文本转化为良好的语义向量表示。然而，已有研究表明，这些嵌入模型捕捉并延续了文本中已存在的偏见。尽管已经提出了一些去偏见的技术，但它们并未充分利用现代嵌入模型在上下文理解方面的最新进展。本文通过量化 19 种嵌入模型的偏见及其对上下文注入作为去偏见手段的响应程度，填补了这一空白。我们发现，性能较高的嵌入模型更容易捕捉常见偏见，但同时也更能融入上下文。令人惊讶的是，我们发现尽管模型可以轻松嵌入肯定的上下文，但在嵌入中性语义方面却表现不佳。最后，在检索任务中，我们展示了嵌入模型中的偏见可能导致不理想的结果。我们利用新获得的见解设计了一种简单的算法，用于动态选择 k 值的 top k 检索。

[NLP-163] Beyond Right and Wrong: Mitigating Cold Start in Knowledge Tracing Using Large Language Model and Option Weight

【速读】：该论文试图解决知识追踪（Knowledge Tracing, KT）中的冷启动问题，特别是在学习者历史数据有限的情况下。解决方案的关键在于引入LOKT（Large Language Model Option-weighted Knowledge Tracing）模型，通过将选项权重与大型语言模型（LLMs）结合，超越传统KT模型的二元分类（正确与错误），将不同类型的错误答案转化为文本形式的序数类别，从而使LLMs能够更清晰地评估学习者的知识状态。该方法特别关注最终的知识状态而非学习过程的演变，并在多个公开数据集上验证了其即使在数据有限的情况下也能保持高预测准确性，有效应对“学习者冷启动”和“系统冷启动”问题。

链接: https://arxiv.org/abs/2410.12872
作者: JongWoo Kim,SeongYeub Chu,Bryan Wong,Mun Yi
关键词-EN: Option-weighted Knowledge Tracing, tracking learners’ knowledge, Knowledge Tracing, educational data mining, enabling personalized learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:Knowledge Tracing (KT) is vital in educational data mining, enabling personalized learning by tracking learners’ knowledge states and forecasting their academic outcomes. This study introduces the LOKT (Large Language Model Option-weighted Knowledge Tracing) model to address the cold start problem where limited historical data available using large language models (LLMs). While traditional KT models have incorporated option weights, our research extends this by integrating these weights into an LLM-based KT framework. Moving beyond the binary classification of correct and incorrect responses, we emphasize that different types of incorrect answers offer valuable insights into a learner’s knowledge state. By converting these responses into text-based ordinal categories, we enable LLMs to assess learner understanding with greater clarity, although our approach focuses on the final knowledge state rather than the progression of learning over time. Using five public datasets, we demonstrate that the LOKT model sustains high predictive accuracy even with limited data, effectively addressing both “learner cold-start” and “system cold-start” scenarios. These findings showcase LOKT’s potential to enhance LLM-based learning tools and support early-stage personalization.
摘要：知识追踪 (Knowledge Tracing, KT) 在教育数据挖掘中至关重要，通过追踪学习者的知识状态并预测其学术成果，实现个性化学习。本研究引入了 LOKT (Large Language Model Option-weighted Knowledge Tracing) 模型，以解决在可用历史数据有限的情况下使用大语言模型 (Large Language Model, LLM) 的冷启动问题。尽管传统的 KT 模型已纳入选项权重，但我们的研究进一步将这些权重整合到基于 LLM 的 KT 框架中。我们超越了对正确和错误响应的二元分类，强调不同类型的错误答案能够提供关于学习者知识状态的有价值信息。通过将这些响应转换为基于文本的序数类别，我们使 LLM 能够更清晰地评估学习者的理解，尽管我们的方法侧重于最终的知识状态而非学习过程的进展。使用五个公开数据集，我们证明了 LOKT 模型即使在数据有限的情况下也能保持高预测准确性，有效应对“学习者冷启动”和“系统冷启动”场景。这些发现展示了 LOKT 在增强基于 LLM 的学习工具和早期个性化支持方面的潜力。

[NLP-164] Skill Learning Using Process Mining for Large Language Model Plan Generation

【速读】：该论文试图解决大型语言模型（LLMs）在生成复杂任务计划时面临的顺序执行限制、缺乏控制流模型以及技能检索困难的问题。解决方案的关键在于引入了一种新颖的技能学习方法，通过整合过程挖掘技术，利用过程发现进行技能获取，过程模型进行技能存储，以及一致性检查进行技能检索。这种方法不仅增强了基于文本的计划生成能力，还支持灵活的技能发现、并行执行和提高可解释性。实验结果表明，该方法在特定条件下显著提升了技能检索的准确性，超越了当前最先进的基准。

链接: https://arxiv.org/abs/2410.12870
作者: Andrei Cosmin Redis,Mohammadreza Fani Sani,Bahram Zarrin,Andrea Burattin
关键词-EN: Large language models, Large language, control flow models, hold promise, complex tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 12 pages, 5 figures, 2 tables, accepted at ICPM 2024’

点击查看摘要

Abstract:Large language models (LLMs) hold promise for generating plans for complex tasks, but their effectiveness is limited by sequential execution, lack of control flow models, and difficulties in skill retrieval. Addressing these issues is crucial for improving the efficiency and interpretability of plan generation as LLMs become more central to automation and decision-making. We introduce a novel approach to skill learning in LLMs by integrating process mining techniques, leveraging process discovery for skill acquisition, process models for skill storage, and conformance checking for skill retrieval. Our methods enhance text-based plan generation by enabling flexible skill discovery, parallel execution, and improved interpretability. Experimental results suggest the effectiveness of our approach, with our skill retrieval method surpassing state-of-the-art accuracy baselines under specific conditions.
摘要：大语言模型 (LLMs) 在生成复杂任务计划方面展现出潜力，但其有效性受限于顺序执行、缺乏控制流模型以及技能检索的困难。解决这些问题对于提高计划生成的效率和可解释性至关重要，因为 LLMs 在自动化和决策过程中变得越来越重要。我们提出了一种新颖的方法，通过整合过程挖掘技术来增强 LLMs 中的技能学习，利用过程发现进行技能获取，过程模型进行技能存储，以及一致性检查进行技能检索。我们的方法通过实现灵活的技能发现、并行执行和提高可解释性，增强了基于文本的计划生成。实验结果表明，我们的方法具有有效性，在特定条件下，我们的技能检索方法超越了最先进的准确性基线。

[NLP-165] Language Model Preference Evaluation with Multiple Weak Evaluators

【速读】：该论文试图解决大语言模型（LLMs）输出质量评估中的偏好冲突问题，即现有基于模型的评估方法容易出现循环偏好（如A优于B，B优于C，但C优于A），导致评估结果矛盾。解决方案的关键是提出了一种名为GED（Preference Graph Ensemble and Denoise）的新方法，通过利用多个模型评估器构建偏好图，并进行图的集成和去噪处理，以消除循环不一致性，确保评估结果形成有向无环图（DAG）结构。该方法通过理论保证和实验验证，展示了其在恢复真实偏好结构和提升评估可靠性方面的有效性。

链接: https://arxiv.org/abs/2410.12869
作者: Zhengyu Hu,Jieyu Zhang,Zhihan Xiong,Alexander Ratner,Hui Xiong,Ranjay Krishna
关键词-EN: Large Language Models, Large Language, success of Large, evaluating their outputs’, critical challenge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs’ quality regarding preference remains a critical challenge. Existing works usually leverage a powerful LLM (e.g., GPT4) as the judge for comparing LLMs’ output pairwisely, yet such model-based evaluator is vulnerable to conflicting preference, i.e., output A is better than B, B than C, but C than A, causing contradictory evaluation results. To improve model-based preference evaluation, we introduce GED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensemble and denoise these graphs for better, non-contradictory evaluation results. In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process to eliminate cyclic inconsistencies, ensuring a directed acyclic graph (DAG) structure. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments across ten benchmark datasets show that GED outperforms baseline methods in model ranking, response selection, and model alignment tasks. Notably, GED combines weaker evaluators like Llama3-8B, Mistral-7B, and Qwen2-7B to surpass the performance of stronger evaluators like Qwen2-72B, highlighting its ability to enhance evaluation reliability and improve model performance.
摘要：尽管大语言模型 (LLM) 取得了显著的成功，但评估其输出质量在偏好方面的表现仍然是一个关键挑战。现有工作通常利用一个强大的 LLM（例如 GPT4）作为评判者，对 LLM 的输出进行成对比较，然而这种基于模型的评估者容易受到偏好冲突的影响，即输出 A 优于 B，B 优于 C，但 C 又优于 A，导致评估结果矛盾。为了改进基于模型的偏好评估，我们引入了 GED（偏好图集成与去噪），这是一种新颖的方法，利用多个基于模型的评估者构建偏好图，然后对这些图进行集成和去噪，以获得更好、无矛盾的评估结果。具体而言，我们的方法包括两个主要阶段：将评估结果聚合为一个统一的图，并应用去噪过程消除循环不一致性，确保有向无环图 (DAG) 结构。我们为我们的框架提供了理论保证，证明了其在恢复真实偏好结构方面的有效性。在十个基准数据集上的广泛实验表明，GED 在模型排序、响应选择和模型对齐任务中优于基线方法。值得注意的是，GED 结合了 Llama3-8B、Mistral-7B 和 Qwen2-7B 等较弱的评估者，超越了 Qwen2-72B 等较强评估者的性能，突显了其在提升评估可靠性和改进模型性能方面的能力。

[NLP-166] IMAS: A Comprehensive Agent ic Approach to Rural Healthcare Delivery

【速读】：该论文试图解决全球农村社区在COVID-19疫情后因医疗专业人员向城市迁移而面临的医疗资源不足问题。解决方案的关键在于开发一种先进的代理式医疗助手系统，该系统利用大型语言模型（LLMs）和代理方法，通过五个核心组件（翻译、医疗复杂性评估、专家网络集成、最终医疗建议生成和响应简化）来提升农村地区的医疗服务质量。该系统能够进行临床分诊、诊断，并识别需要专家介入的病例，同时考虑到文化差异和不同教育水平，提供清晰且可操作的本地语言医疗建议。通过在MedQA、PubMedQA和JAMA数据集上的评估，证明该集成方法显著提高了农村医疗工作者的效率，使医疗服务更加普及和易于理解。

链接: https://arxiv.org/abs/2410.12868
作者: Agasthya Gangavarapu,Ananya Gangavarapu
关键词-EN: faced significant challenges, experienced medical professionals, accessing healthcare due, rural communities worldwide, Registered Medical Practitioners
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Since the onset of COVID-19, rural communities worldwide have faced significant challenges in accessing healthcare due to the migration of experienced medical professionals to urban centers. Semi-trained caregivers, such as Community Health Workers (CHWs) and Registered Medical Practitioners (RMPs), have stepped in to fill this gap, but often lack formal training. This paper proposes an advanced agentic medical assistant system designed to improve healthcare delivery in rural areas by utilizing Large Language Models (LLMs) and agentic approaches. The system is composed of five crucial components: translation, medical complexity assessment, expert network integration, final medical advice generation, and response simplification. Our innovative framework ensures context-sensitive, adaptive, and reliable medical assistance, capable of clinical triaging, diagnostics, and identifying cases requiring specialist intervention. The system is designed to handle cultural nuances and varying literacy levels, providing clear and actionable medical advice in local languages. Evaluation results using the MedQA, PubMedQA, and JAMA datasets demonstrate that this integrated approach significantly enhances the effectiveness of rural healthcare workers, making healthcare more accessible and understandable for underserved populations. All code and supplemental materials associated with the paper and IMAS are available at this https URL.
摘要：自 COVID-19 爆发以来，全球农村社区由于经验丰富的医疗专业人员向城市中心的迁移，面临着获取医疗服务的重大挑战。半训练的护理人员，如社区健康工作者 (Community Health Workers, CHWs) 和注册医疗从业者 (Registered Medical Practitioners, RMPs)，已介入填补这一空缺，但往往缺乏正规培训。本文提出了一种先进的智能医疗助手系统，旨在通过利用大语言模型 (Large Language Models, LLMs) 和智能体方法，改善农村地区的医疗服务。该系统由五个关键组件构成：翻译、医疗复杂性评估、专家网络集成、最终医疗建议生成和响应简化。我们的创新框架确保了上下文敏感、适应性强且可靠的医疗援助，能够进行临床分诊、诊断，并识别需要专家干预的病例。该系统设计用于处理文化差异和不同识字水平，提供当地语言的清晰且可操作的医疗建议。使用 MedQA、PubMedQA 和 JAMA 数据集的评估结果表明，这种集成方法显著增强了农村医疗工作者的效率，使医疗服务对服务不足的人群更加可及和易于理解。与本文和 IMAS 相关的所有代码和补充材料均可在此 https URL 获取。

[NLP-167] Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis

【速读】：该论文试图解决言语障碍（Dysarthria）患者因神经损伤导致的言语不清问题，通过引入一种新颖的方法来识别和翻译言语障碍者的语音，从而提高他们的沟通效率。解决方案的关键在于利用先进的大型语言模型进行准确的语音校正和多模态情感分析。具体步骤包括使用OpenAI的Whisper模型将言语障碍者的语音转换为文本，然后通过微调的开源模型和基准模型（如GPT-4.0、LLaMA 3.1 70B和Mistral 8x7B）在Groq AI加速器上进行句子预测。此外，结合TORGO数据集和Google语音数据，手动标注情感上下文，以识别和重建包含情感的句子，从而实现高精度的言语障碍语音识别和解释。

链接: https://arxiv.org/abs/2410.12867
作者: Kaushal Attaluri,Anirudh CHVS,Sireesha Chittepu
关键词-EN: motor speech disorder, speech disorder caused, leading to slurred, disorder caused, caused by neurological
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Dysarthria is a motor speech disorder caused by neurological damage that affects the muscles used for speech production, leading to slurred, slow, or difficult-to-understand speech. It affects millions of individuals worldwide, including those with conditions such as stroke, traumatic brain injury, cerebral palsy, Parkinsons disease, and multiple sclerosis. Dysarthria presents a major communication barrier, impacting quality of life and social interaction. This paper introduces a novel approach to recognizing and translating dysarthric speech, empowering individuals with this condition to communicate more effectively. We leverage advanced large language models for accurate speech correction and multimodal emotion analysis. Dysarthric speech is first converted to text using OpenAI Whisper model, followed by sentence prediction using fine-tuned open-source models and benchmark models like GPT-4.o, LLaMA 3.1 70B and Mistral 8x7B on Groq AI accelerators. The dataset used combines the TORGO dataset with Google speech data, manually labeled for emotional context. Our framework identifies emotions such as happiness, sadness, neutrality, surprise, anger, and fear, while reconstructing intended sentences from distorted speech with high accuracy. This approach demonstrates significant advancements in the recognition and interpretation of dysarthric speech.
摘要：构音障碍是一种由神经损伤引起的运动性言语障碍，影响言语产生所需的肌肉，导致言语含糊、缓慢或难以理解。全球数百万人受到构音障碍的影响，包括中风、创伤性脑损伤、脑瘫、帕金森病和多发性硬化症患者。构音障碍构成了主要的沟通障碍，影响生活质量和社交互动。本文介绍了一种新颖的方法来识别和翻译构音性言语，使患有此病症的个体能够更有效地沟通。我们利用先进的大语言模型进行准确的语音校正和多模态情感分析。首先使用 OpenAI 的 Whisper 模型将构音性言语转换为文本，然后通过微调的开源模型和基准模型如 GPT-4.o、LLaMA 3.1 70B 和 Mistral 8x7B 在 Groq AI 加速器上进行句子预测。所使用的数据集结合了 TORGO 数据集和 Google 语音数据，并手动标注了情感上下文。我们的框架能够识别快乐、悲伤、中性、惊讶、愤怒和恐惧等情感，同时以高准确度从扭曲的语音中重建意图句子。这种方法在构音性言语的识别和解释方面展示了显著的进步。

[NLP-168] owards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings

【速读】：该论文试图解决脑机接口（BCI）中因生理和仪器因素导致的数据异质性问题，这些问题阻碍了侵入式脑电信号对声调解码的统一应用。解决方案的关键是提出了“同质性-异质性解耦学习”（H2DiLR）框架，该框架能够从多个受试者的立体脑电图（sEEG）记录中解耦并学习神经表示的同质性和异质性，从而实现跨受试者的统一解码，显著优于传统的异质性解码方法。

链接: https://arxiv.org/abs/2410.12866
作者: Di Wu,Siyuan Li,Chen Feng,Lu Cao,Yue Zhang,Jie Yang,Mohamad Sawan
关键词-EN: tonal language speakers, speech-impaired tonal language, Recent advancements, brain-computer interfaces, offering the potential
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
备注: Preprint V1 with 10 pages main text

点击查看摘要

Abstract:Recent advancements in brain-computer interfaces (BCIs) have enabled the decoding of lexical tones from intracranial recordings, offering the potential to restore the communication abilities of speech-impaired tonal language speakers. However, data heterogeneity induced by both physiological and instrumental factors poses a significant challenge for unified invasive brain tone decoding. Traditional subject-specific models, which operate under a heterogeneous decoding paradigm, fail to capture generalized neural representations and cannot effectively leverage data across subjects. To address these limitations, we introduce Homogeneity-Heterogeneity Disentangled Learning for neural Representations (H2DiLR), a novel framework that disentangles and learns both the homogeneity and heterogeneity from intracranial recordings across multiple subjects. To evaluate H2DiLR, we collected stereoelectroencephalography (sEEG) data from multiple participants reading Mandarin materials comprising 407 syllables, representing nearly all Mandarin characters. Extensive experiments demonstrate that H2DiLR, as a unified decoding paradigm, significantly outperforms the conventional heterogeneous decoding approach. Furthermore, we empirically confirm that H2DiLR effectively captures both homogeneity and heterogeneity during neural representation learning.
摘要：近年来，脑机接口 (BCI) 的进展使得从颅内记录中解码词汇声调成为可能，为恢复语音障碍的声调语言使用者的沟通能力提供了潜力。然而，由生理和仪器因素引起的数据异质性对统一的侵入式脑声调解码构成了重大挑战。传统的特定受试者模型在异质性解码范式下运行，无法捕捉到广义的神经表征，也无法有效利用跨受试者的数据。为了解决这些局限性，我们提出了神经表征的同质性-异质性解耦学习 (H2DiLR)，这是一个新颖的框架，能够从多个受试者的颅内记录中解耦并学习同质性和异质性。为了评估 H2DiLR，我们收集了多个参与者阅读包含 407 个音节的普通话材料的立体脑电图 (sEEG) 数据，这些音节几乎涵盖了所有普通话字符。广泛的实验表明，作为统一解码范式的 H2DiLR 显著优于传统的异质性解码方法。此外，我们通过实证确认 H2DiLR 在神经表征学习过程中有效地捕捉了同质性和异质性。

[NLP-169] ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction

【速读】：该论文试图解决机器学习流程中特征工程自动化的问题，特别是评估大型语言模型（LLMs）在特征生成方面与人类专家的差异。解决方案的关键在于提出了ELF-Gym框架，通过量化评估LLM生成的特征对下游模型性能的影响，以及这些特征与专家设计特征在语义和功能上的相似性，从而提供了一个更全面的评估方法。该框架利用历史Kaggle竞赛中的251个“黄金”特征作为基准，揭示了LLMs在最佳情况下能捕捉到约56%的黄金特征，但在实施层面这一比例降至13%，表明LLMs在复杂特征生成方面仍有显著改进空间。

链接: https://arxiv.org/abs/2410.12865
作者: Yanlin Zhang,Ning Li,Quan Gan,Weinan Zhang,David Wipf,Minjie Wang
关键词-EN: Crafting effective features, Crafting effective, machine learning pipelines, Large Language Models, crucial yet labor-intensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Crafting effective features is a crucial yet labor-intensive and domain-specific task within machine learning pipelines. Fortunately, recent advancements in Large Language Models (LLMs) have shown promise in automating various data science tasks, including feature engineering. But despite this potential, evaluations thus far are primarily based on the end performance of a complete ML pipeline, providing limited insight into precisely how LLMs behave relative to human experts in feature engineering. To address this gap, we propose ELF-Gym, a framework for Evaluating LLM-generated Features. We curated a new dataset from historical Kaggle competitions, including 251 “golden” features used by top-performing teams. ELF-Gym then quantitatively evaluates LLM-generated features by measuring their impact on downstream model performance as well as their alignment with expert-crafted features through semantic and functional similarity assessments. This approach provides a more comprehensive evaluation of disparities between LLMs and human experts, while offering valuable insights into specific areas where LLMs may have room for improvement. For example, using ELF-Gym we empirically demonstrate that, in the best-case scenario, LLMs can semantically capture approximately 56% of the golden features, but at the more demanding implementation level this overlap drops to 13%. Moreover, in other cases LLMs may fail completely, particularly on datasets that require complex features, indicating broad potential pathways for improvement.
摘要：构建有效的特征是机器学习流程中一项关键但劳动密集且领域特定的任务。幸运的是，大语言模型 (LLM) 的最新进展在自动化各种数据科学任务（包括特征工程）方面显示出潜力。尽管如此，迄今为止的评估主要基于完整机器学习管道的最终性能，对 LLM 在特征工程中相对于人类专家的行为方式提供的见解有限。为了填补这一空白，我们提出了 ELF-Gym，一个用于评估 LLM 生成特征的框架。我们从历史 Kaggle 竞赛中精心挑选了一个新数据集，其中包括 251 个由顶级团队使用的“黄金”特征。ELF-Gym 通过测量这些特征对下游模型性能的影响以及它们与专家设计特征在语义和功能相似性方面的匹配程度，来定量评估 LLM 生成的特征。这种方法提供了 LLM 与人类专家之间差异的更全面评估，同时为 LLM 在哪些具体领域有改进空间提供了有价值的见解。例如，使用 ELF-Gym，我们实证表明，在最佳情况下，LLM 可以语义上捕捉到约 56% 的黄金特征，但在更具挑战性的实现层面，这种重叠下降到 13%。此外，在其他情况下，LLM 可能完全失败，特别是在需要复杂特征的数据集上，这表明了广泛的改进途径。

[NLP-170] Investigating Implicit Bias in Large Language Models : A Large-Scale Study of Over 50 LLMs

【速读】：该论文试图解决大型语言模型（LLMs）在决策过程中可能存在的隐性偏见问题，特别是随着模型复杂度的增加，偏见并未自动减少，反而可能在某些情况下加剧。解决方案的关键在于建立标准化的偏见评估指标和基准，以及在模型开发过程中优先考虑并实施偏见缓解策略。论文强调了通过扩展隐性偏见的检测方法，可以更全面地理解高级模型中的偏见，从而推动公平和负责任的AI系统的发展。

链接: https://arxiv.org/abs/2410.12864
作者: Divyanshu Kumar,Umang Jain,Sahil Agarwal,Prashanth Harshangi
关键词-EN: Large Language Models, including decision-making processes, Large Language, LLM Implicit Association, Implicit Association Test
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being adopted across a wide range of tasks, including decision-making processes in industries where bias in AI systems is a significant concern. Recent research indicates that LLMs can harbor implicit biases even when they pass explicit bias evaluations. Building upon the frameworks of the LLM Implicit Association Test (IAT) Bias and LLM Decision Bias, this study highlights that newer or larger language models do not automatically exhibit reduced bias; in some cases, they displayed higher bias scores than their predecessors, such as in Meta’s Llama series and OpenAI’s GPT models. This suggests that increasing model complexity without deliberate bias mitigation strategies can unintentionally amplify existing biases. The variability in bias scores within and across providers underscores the need for standardized evaluation metrics and benchmarks for bias assessment. The lack of consistency indicates that bias mitigation is not yet a universally prioritized goal in model development, which can lead to unfair or discriminatory outcomes. By broadening the detection of implicit bias, this research provides a more comprehensive understanding of the biases present in advanced models and underscores the critical importance of addressing these issues to ensure the development of fair and responsible AI systems.
摘要：大语言模型 (LLMs) 正在被广泛应用于各种任务中，包括在存在 AI 系统偏见问题的行业中的决策过程。最近的研究表明，即使通过了显式偏见评估，LLMs 也可能隐藏隐性偏见。基于 LLM 隐性联想测试 (IAT) 偏见和 LLM 决策偏见的框架，本研究强调，更新或更大的语言模型并不一定会自动减少偏见；在某些情况下，它们显示的偏见得分甚至高于其前代，例如 Meta 的 Llama 系列和 OpenAI 的 GPT 模型。这表明，在没有刻意偏见缓解策略的情况下增加模型复杂性，可能会无意中放大现有偏见。不同供应商之间偏见得分的变化突显了标准化偏见评估指标和基准的必要性。缺乏一致性表明，偏见缓解在模型开发中尚未成为普遍优先考虑的目标，这可能导致不公平或歧视性的结果。通过扩大隐性偏见的检测范围，本研究提供了对先进模型中存在的偏见的更全面理解，并强调了解决这些问题以确保开发公平和负责任的 AI 系统的关键重要性。

[NLP-171] Enhancing Affinity Propagation for Improved Public Sentiment Insights

【速读】：该论文试图解决传统情感分析方法依赖大量标注数据导致成本高、耗时长的问题。解决方案的关键在于采用无监督学习技术，特别是亲和传播（Affinity Propagation, AP）聚类算法，结合层次聚类（Agglomerative Hierarchical Clustering）形成混合方法，以无需预设聚类数量和较少标注数据的方式，更高效地分析文本数据的情感结构。通过TF-IDF向量化和主成分分析（PCA）降维，该方法在评估指标上显著优于传统的K-means聚类，为自然语言处理领域提供了一种可扩展且高效的无监督学习框架，用于分析公众情感。

链接: https://arxiv.org/abs/2410.12862
作者: Mayimunah Nagayi,Clement Nyirenda
关键词-EN: including marketing, generated every day, Agglomerative Hierarchical Clustering, key factor, public sentiment
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:With the large amount of data generated every day, public sentiment is a key factor for various fields, including marketing, politics, and social research. Understanding the public sentiment about different topics can provide valuable insights. However, most traditional approaches for sentiment analysis often depend on supervised learning, which requires a significant amount of labeled data. This makes it both expensive and time-consuming to implement. This project introduces an approach using unsupervised learning techniques, particularly Affinity Propagation (AP) clustering, to analyze sentiment. AP clustering groups text data based on natural patterns, without needing predefined cluster numbers. The paper compares AP with K-means clustering, using TF-IDF Vectorization for text representation and Principal Component Analysis (PCA) for dimensionality reduction. To enhance performance, AP is combined with Agglomerative Hierarchical Clustering. This hybrid method refines clusters further, capturing both global and local sentiment structures more effectively. The effectiveness of these methods is evaluated using the Silhouette Score, Calinski-Harabasz Score, and Davies-Bouldin Index. Results show that AP with Agglomerative Hierarchical Clustering significantly outperforms K-means. This research contributes to Natural Language Processing (NLP) by proposing a scalable and efficient unsupervised learning framework for sentiment analysis, highlighting the significant societal impact of advanced AI techniques in analyzing public sentiment without the need for extensive labeled data.
摘要：随着每日产生的大量数据，公众情绪成为包括市场营销、政治和社会研究在内的多个领域的关键因素。理解不同话题的公众情绪可以提供有价值的见解。然而，大多数传统的情绪分析方法往往依赖于监督学习，这需要大量的标注数据，从而使得实施成本高且耗时。本项目介绍了一种使用无监督学习技术，特别是亲和传播 (Affinity Propagation, AP) 聚类的方法来分析情绪。AP 聚类根据自然模式对文本数据进行分组，无需预定义的聚类数量。论文将 AP 与 K-means 聚类进行比较，使用 TF-IDF 向量化进行文本表示，并使用主成分分析 (Principal Component Analysis, PCA) 进行降维。为了提高性能，AP 与凝聚层次聚类 (Agglomerative Hierarchical Clustering) 结合使用。这种混合方法进一步细化了聚类，更有效地捕捉了全局和局部的情绪结构。这些方法的有效性通过 Silhouette 分数、Calinski-Harabasz 分数和 Davies-Bouldin 指数进行评估。结果显示，AP 结合凝聚层次聚类显著优于 K-means。本研究通过提出一个可扩展且高效的无监督学习框架，为自然语言处理 (Natural Language Processing, NLP) 做出了贡献，强调了先进的 AI 技术在无需大量标注数据的情况下分析公众情绪的显著社会影响。

[NLP-172] Scaled and Inter-token Relation Enhanced Transformer for Sample-restricted Residential NILM

【速读】：该论文试图解决在非侵入式负载监测（NILM）中，使用小规模数据集训练Transformer模型时面临的挑战。解决方案的关键在于增强Transformer模型的注意力机制，具体提出了两种新颖的机制：一是增强词间关系机制，通过减少词内关系的优先级来增加词间关系的关注度；二是动态温度调谐机制，引入可学习的温度参数来调整词相似度矩阵，从而缓解固定温度值导致的过度平滑问题。这两种机制均基于严格的数学基础，并通过在REDD住宅NILM数据集上的实验验证了其显著提升原始Transformer模型性能的效果。

链接: https://arxiv.org/abs/2410.12861
作者: Minhajur Rahman,Yasir Arafat
关键词-EN: Non-Intrusive Load Monitoring, Load Monitoring, yielded impressive results, Recent advancements, Non-Intrusive Load
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to 27th IEEE-ICCIT

点击查看摘要

Abstract:Recent advancements in transformer models have yielded impressive results in Non-Intrusive Load Monitoring (NILM). However, effectively training a transformer on small-scale datasets remains a challenge. This paper addresses this issue by enhancing the attention mechanism of the original transformer to improve performance. We propose two novel mechanisms: the inter-token relation enhancement mechanism and the dynamic temperature tuning mechanism. The first mechanism reduces the prioritization of intra-token relationships in the token similarity matrix during training, thereby increasing inter-token focus. The second mechanism introduces a learnable temperature tuning for the token similarity matrix, mitigating the over-smoothing problem associated with fixed temperature values. Both mechanisms are supported by rigorous mathematical foundations. We evaluate our approach using the REDD residential NILM dataset, a relatively small-scale dataset and demonstrate that our methodology significantly enhances the performance of the original transformer model across multiple appliance types.
摘要：近年来，Transformer 模型在非侵入式负载监测 (Non-Intrusive Load Monitoring, NILM) 领域取得了显著成果。然而，如何在小型数据集上有效训练 Transformer 仍然是一个挑战。本文通过增强原始 Transformer 的注意力机制来解决这一问题。我们提出了两种新颖的机制：Token 间关系增强机制和动态温度调谐机制。第一种机制在训练过程中减少 Token 相似矩阵中 Token 内关系的优先级，从而增加 Token 间关注度。第二种机制为 Token 相似矩阵引入可学习的温度调谐，缓解了固定温度值带来的过平滑问题。这两种机制均基于严格的数学基础。我们使用 REDD 住宅 NILM 数据集（一个相对小规模的数据集）评估了我们的方法，并证明我们的方法显著提升了原始 Transformer 模型在多种电器类型上的性能。

[NLP-173] LLMD: A Large Language Model for Interpreting Longitudinal Medical Records

【速读】：该论文试图解决基于患者医疗记录分析其医疗历史的问题，解决方案的关键在于开发了一个名为LLMD的大型语言模型。LLMD通过结合领域知识和大量跨时间及医疗机构的医疗记录进行预训练，并在此基础上进行指令微调，以处理结构化和抽象任务。其核心优势在于能够从多源、多时间段的记录中提取并整合复杂的医疗信息，形成对患者健康状况的准确描绘。此外，LLMD通过分层验证系统确保其输出质量，并在实际应用中显著优于其他模型，特别是在处理真实世界患者数据时，展示了其超越现有医疗知识基准的潜力。

链接: https://arxiv.org/abs/2410.12860
作者: Robert Porter,Adam Diehl,Benjamin Pastel,J. Henry Hinnefeld,Lawson Nerenberg,Pye Maung,Sebastien Kerbrat,Gillian Hanson,Troy Astorino,Stephen J. Tarsa
关键词-EN: LLMD, designed to analyze, introduce LLMD, records, language model designed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce LLMD, a large language model designed to analyze a patient’s medical history based on their medical records. Along with domain knowledge, LLMD is trained on a large corpus of records collected over time and across facilities, as well as tasks and labels that make nuanced connections among them. This approach is critical to an accurate picture of patient health, and has distinctive advantages over models trained on knowledge alone, unlabeled records, structured EHR data, or records from a single health system. The recipe for LLMD continues pretraining a foundational model on both domain knowledge and the contents of millions of records. These span an average of 10 years of care and as many as 140 care sites per patient. LLMD is then instruction fine-tuned on structuring and abstraction tasks. The former jointly identify and normalize document metadata, provenance information, clinical named-entities, and ontology mappings, while the latter roll these into higher-level representations, such a continuous era of time a patient was on a medication. LLMD is deployed within a layered validation system that includes continual random audits and review by experts, e.g. based on uncertainty, disease-specific rules, or use-case. LLMD exhibits large gains over both more-powerful generalized models and domain-specific models. On medical knowledge benchmarks, LLMD-8B achieves state of the art accuracy on PubMedQA text responses, besting orders-of-magnitude larger models. On production tasks, we show that LLMD significantly outperforms all other models evaluated, and among alternatives, large general purpose LLMs like GPT-4o are more accurate than models emphasizing medical knowledge. We find strong evidence that accuracy on today’s medical benchmarks is not the most significant factor when analyzing real-world patient data, an insight with implications for future medical LLMs.’ Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.12860 [cs.CL] (or arXiv:2410.12860v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.12860 Focus to learn more arXiv-issued DOI via DataCite
摘要：我们介绍了 LLMD，这是一种大语言模型，旨在基于患者的医疗记录分析其病史。除了领域知识外，LLMD 还基于长时间跨机构收集的大量记录进行训练，并针对这些记录中的细微关联进行任务和标签的训练。这种方法对于准确描绘患者健康状况至关重要，并且相较于仅基于知识、未标记记录、结构化电子健康记录数据或单一健康系统记录训练的模型，具有显著优势。LLMD 的训练方法包括在领域知识和数百万条记录内容上对基础模型进行预训练。这些记录涵盖了每位患者平均 10 年的护理时间以及多达 140 个护理站点。随后，LLMD 在结构化和抽象任务上进行指令微调。前者共同识别并标准化文档元数据、出处信息、临床命名实体和本体映射，而后者则将这些信息整合为更高层次的表示，例如患者在一段时间内持续服用某种药物的情况。LLMD 部署在一个分层验证系统中，该系统包括持续的随机审计和专家评审，例如基于不确定性、特定疾病规则或使用案例。LLMD 在性能上显著优于更强大的通用模型和领域专用模型。在医学知识基准测试中，LLMD-8B 在 PubMedQA 文本响应中达到了最先进的准确率，超越了规模更大的模型。在生产任务中，我们展示了 LLMD 显著优于所有其他评估模型，并且在备选方案中，像 GPT-4o 这样的大型通用大语言模型比强调医学知识的模型更为准确。我们发现强有力的证据表明，在分析现实世界患者数据时，当前医学基准的准确性并非最重要的因素，这一见解对未来医学大语言模型的发展具有重要意义。

主题：计算与语言 (cs.CL); 人工智能 (cs.AI)
引用方式：arXiv:2410.12860 [cs.CL] (或 arXiv:2410.12860v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.12860
通过 DataCite 发布的 arXiv DOI

[NLP-174] Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

【速读】：该论文试图解决大型语言模型（LLMs）在处理长上下文时面临的计算复杂度二次增长问题，以及传统检索增强生成（RAG）模型在复杂问题推理中表现不佳的问题。解决方案的关键在于引入了一种名为“内循环记忆增强树检索（ILM-TR）”的新方法，该方法不仅基于初始查询进行检索，还基于中间发现结果进行内循环查询。在推理过程中，模型从RAG系统中检索信息，整合来自长文档的多层次抽象数据，并将生成的文本存储在短期记忆（STM）中，用于形成下一次查询。通过重复这一检索过程直至STM中的文本收敛，ILM-TR显著提升了在长上下文测试中的表现，特别是在Multi-Needle In A Haystack（M-NIAH）和BABILong等测试中。

链接: https://arxiv.org/abs/2410.12859
作者: Yimin Tang,Yurong Xu,Ning Yan,Masood Mortazavi
关键词-EN: context window size, input context window, large language models, input size, window size
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Transformers have a quadratic scaling of computational complexity with input size, which limits the input context window size of large language models (LLMs) in both training and inference. Meanwhile, retrieval-augmented generation (RAG) besed models can better handle longer contexts by using a retrieval system to filter out unnecessary information. However, most RAG methods only perform retrieval based on the initial query, which may not work well with complex questions that require deeper reasoning. We introduce a novel approach, Inner Loop Memory Augmented Tree Retrieval (ILM-TR), involving inner-loop queries, based not only on the query question itself but also on intermediate findings. At inference time, our model retrieves information from the RAG system, integrating data from lengthy documents at various levels of abstraction. Based on the information retrieved, the LLM generates texts stored in an area named Short-Term Memory (STM) which is then used to formulate the next query. This retrieval process is repeated until the text in STM converged. Our experiments demonstrate that retrieval with STM offers improvements over traditional retrieval-augmented LLMs, particularly in long context tests such as Multi-Needle In A Haystack (M-NIAH) and BABILong.
摘要：Transformer 的计算复杂度与输入大小呈二次方关系，这限制了大语言模型 (LLM) 在训练和推理中的输入上下文窗口大小。同时，基于检索增强生成 (RAG) 的模型通过使用检索系统过滤不必要的信息，能够更好地处理较长的上下文。然而，大多数 RAG 方法仅根据初始查询进行检索，这可能不适用于需要深度推理的复杂问题。我们提出了一种新颖的方法，即内循环记忆增强树检索 (ILM-TR)，该方法不仅基于查询问题本身，还基于中间发现结果进行内循环查询。在推理时，我们的模型从 RAG 系统中检索信息，整合来自长文档在不同抽象层次的数据。根据检索到的信息，LLM 生成文本并存储在名为短期记忆 (STM) 的区域中，然后用于构建下一个查询。此检索过程重复进行，直到 STM 中的文本收敛。我们的实验表明，使用 STM 进行检索相较于传统的检索增强 LLM 有所改进，特别是在多针在干草堆 (M-NIAH) 和 BABILong 等长上下文测试中。

[NLP-175] Large Language Models for Medical OSCE Assessment: A Novel Approach to Transcript Analysis

【速读】：该论文试图解决传统OSCE（客观结构化临床考试）评分过程中耗时且成本高的问题，特别是评估医学生沟通技能中的病史总结能力。解决方案的关键在于利用大型语言模型（LLMs）来自动化评分过程。通过分析2019-2022年间德克萨斯大学西南医学中心录制的2,027个OSCE视频，研究团队使用Whisper-v3转录语音，并采用多种LLM技术（如零样本链式思维提示、检索增强生成和多模型集成方法）来评估学生的病史总结能力。研究结果表明，前沿LLM模型如GPT-4与人工评分者的一致性达到0.88的Cohen’s kappa值，显示出LLM在OSCE评分中的巨大潜力，能够有效辅助当前的评分流程，并降低成本。

链接: https://arxiv.org/abs/2410.12858
作者: Ameer Hamza Shakur,Michael J. Holcomb,David Hein,Shinyoung Kang,Thomas O. Dalton,Krystle K. Campbell,Daniel J. Scott,Andrew R. Jamieson
关键词-EN: Objective Structured Clinical, Structured Clinical Examinations, Grading Objective Structured, Objective Structured, Structured Clinical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grading Objective Structured Clinical Examinations (OSCEs) is a time-consuming and expensive process, traditionally requiring extensive manual effort from human experts. In this study, we explore the potential of Large Language Models (LLMs) to assess skills related to medical student communication. We analyzed 2,027 video-recorded OSCE examinations from the University of Texas Southwestern Medical Center (UTSW), spanning four years (2019-2022), and several different medical cases or “stations.” Specifically, our focus was on evaluating students’ ability to summarize patients’ medical history: we targeted the rubric item ‘did the student summarize the patients’ medical history?’ from the communication skills rubric. After transcribing speech audio captured by OSCE videos using Whisper-v3, we studied the performance of various LLM-based approaches for grading students on this summarization task based on their examination transcripts. Using various frontier-level open-source and proprietary LLMs, we evaluated different techniques such as zero-shot chain-of-thought prompting, retrieval augmented generation, and multi-model ensemble methods. Our results show that frontier LLM models like GPT-4 achieved remarkable alignment with human graders, demonstrating a Cohen’s kappa agreement of 0.88 and indicating strong potential for LLM-based OSCE grading to augment the current grading process. Open-source models also showed promising results, suggesting potential for widespread, cost-effective deployment. Further, we present a failure analysis identifying conditions where LLM grading may be less reliable in this context and recommend best practices for deploying LLMs in medical education settings.
摘要：客观结构化临床考试 (OSCE) 的评分是一个耗时且昂贵的流程，传统上需要大量的人工专家参与。在本研究中，我们探讨了大语言模型 (LLM) 在评估医学生沟通技能方面的潜力。我们分析了来自德克萨斯大学西南医学中心 (UTSW) 的 2,027 个视频记录的 OSCE 考试，涵盖了四年 (2019-2022) 和多个不同的医学案例或“站点”。具体而言，我们的重点是评估学生总结患者病史的能力：我们针对沟通技能评分标准中的“学生是否总结了患者的病史？”这一评分项。通过使用 Whisper-v3 转录 OSCE 视频中捕捉到的语音音频后，我们研究了基于 LLM 的不同方法在根据考试转录文本对学生进行总结任务评分方面的表现。我们使用了多种前沿的开源和专有大语言模型，评估了如零样本链式思维提示、检索增强生成和多模型集成方法等技术。结果显示，像 GPT-4 这样的前沿 LLM 模型与人工评分者取得了显著的一致性，Cohen’s kappa 一致性系数达到 0.88，表明基于 LLM 的 OSCE 评分具有增强当前评分流程的强大潜力。开源模型也显示出有希望的结果，表明其具有广泛、经济高效的部署潜力。此外，我们进行了失败分析，识别了在此情境下 LLM 评分可能不太可靠的条件，并推荐了在医学教育环境中部署 LLM 的最佳实践。

[NLP-176] Enterprise Benchmarks for Large Language Model Evaluation

【速读】：该论文试图解决大型语言模型（LLMs）在企业应用中复杂任务评估的挑战，关键在于提出一个系统的基准测试策略，专注于利用特定领域的数据集进行评估。解决方案的核心是构建一个包含25个公开可用数据集的评估框架，涵盖金融、法律、网络安全和气候可持续性等多个企业领域，并通过13个模型的多样化表现，强调根据具体任务需求选择合适模型的必要性。

链接: https://arxiv.org/abs/2410.12857
作者: Bing Zhang,Mikio Takeuchi,Ryo Kawahara,Shubhi Asthana,Md. Maruf Hossain,Guang-Jie Ren,Kate Soule,Yada Zhu
关键词-EN: complex tasks performed, large language models, advancement of large, large language, greater challenge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) has led to a greater challenge of having a rigorous and systematic evaluation of complex tasks performed, especially in enterprise applications. Therefore, LLMs need to be able to benchmark enterprise datasets for various tasks. This work presents a systematic exploration of benchmarking strategies tailored to LLM evaluation, focusing on the utilization of domain-specific datasets and consisting of a variety of NLP tasks. The proposed evaluation framework encompasses 25 publicly available datasets from diverse enterprise domains like financial services, legal, cyber security, and climate and sustainability. The diverse performance of 13 models across different enterprise tasks highlights the importance of selecting the right model based on the specific requirements of each task. Code and prompts are available on GitHub.
摘要：大语言模型 (LLM) 的进步带来了对复杂任务进行严格和系统评估的更大挑战，尤其是在企业应用中。因此，LLM 需要能够对各种任务的企业数据集进行基准测试。本文系统地探讨了针对 LLM 评估的基准测试策略，重点关注特定领域数据集的利用，并涵盖了多种自然语言处理 (NLP) 任务。所提出的评估框架包括来自金融服务、法律、网络安全和气候与可持续发展等不同企业领域的 25 个公开可用数据集。13 个模型在不同企业任务中的多样化表现突显了根据每个任务的具体需求选择合适模型的重要性。代码和提示可在 GitHub 上获取。

[NLP-177] Optimized Biomedical Question-Answering Services with LLM and Multi-BERT Integration ICDM

【速读】：该论文试图解决生物医学领域中问答系统处理复杂数据和避免过拟合的问题。解决方案的关键在于集成大型语言模型（LLMs）与多配置的BERT模型，通过冻结一个BERT模型并训练另一个BERT模型来增强系统的适应性和效率，同时利用多层感知器（MLP）层进行优化。这种方法不仅提高了问答系统的专业性和响应速度，还通过使用如BioASQ和BioMRC等广泛数据集来验证其信息综合能力，从而为医疗专业人员提供更可靠和响应迅速的工具，以支持更好的患者治疗结果和决策制定。

链接: https://arxiv.org/abs/2410.12856
作者: Cheng Qian,Xianglong Shi,Shanshan Yao,Yichen Liu,Fengming Zhou,Zishu Zhang,Junaid Akram,Ali Braytee,Ali Anaissi
关键词-EN: integrating large language, Multi-BERT configurations, present a refined, integrating large, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 12 figures, accepted and to be published in the proceedings of 2024 IEEE International Conference on Data Mining Workshops (ICDMW)

点击查看摘要

Abstract:We present a refined approach to biomedical question-answering (QA) services by integrating large language models (LLMs) with Multi-BERT configurations. By enhancing the ability to process and prioritize vast amounts of complex biomedical data, this system aims to support healthcare professionals in delivering better patient outcomes and informed decision-making. Through innovative use of BERT and BioBERT models, combined with a multi-layer perceptron (MLP) layer, we enable more specialized and efficient responses to the growing demands of the healthcare sector. Our approach not only addresses the challenge of overfitting by freezing one BERT model while training another but also improves the overall adaptability of QA services. The use of extensive datasets, such as BioASQ and BioMRC, demonstrates the system’s ability to synthesize critical information. This work highlights how advanced language models can make a tangible difference in healthcare, providing reliable and responsive tools for professionals to manage complex information, ultimately serving the broader goal of improved care and data-driven insights.
摘要：我们提出了一种改进的生物医学问答 (QA) 服务方法，通过将大语言模型 (LLM) 与 Multi-BERT 配置相结合。通过增强处理和优先处理大量复杂生物医学数据的能力，该系统旨在支持医疗专业人员提供更好的患者治疗效果和知情决策。通过创新性地使用 BERT 和 BioBERT 模型，结合多层感知器 (MLP) 层，我们能够更专业和高效地响应医疗行业日益增长的需求。我们的方法不仅通过冻结一个 BERT 模型同时训练另一个来解决过拟合问题，还提高了 QA 服务的整体适应性。使用广泛的生物医学数据集，如 BioASQ 和 BioMRC，展示了系统综合关键信息的能力。这项工作突显了先进语言模型如何在医疗领域产生实质性影响，为专业人员提供可靠且响应迅速的工具来管理复杂信息，最终服务于改善护理和数据驱动洞察力的更广泛目标。

[NLP-178] JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

【速读】：该论文试图解决当前大语言模型（LLM）在面对越狱攻击时的防御评估问题，特别是现有方法在解释性、复杂场景泛化性以及多语言环境下的评估不足。解决方案的关键在于提出了JAILJUDGE基准，该基准包括多样化的风险场景（合成、对抗、真实世界和多语言提示）以及高质量的人工标注数据集。通过JAILJUDGE MultiAgent框架，实现了可解释的细粒度评分（1到10），并支持指令调优基础事实的构建，进而开发了JAILJUDGE Guard端到端评判模型，提供推理并消除API成本。此外，论文还引入了JailBoost攻击增强器和GuardShield防御机制，显著提升了零样本设置下的攻击和防御任务性能。

链接: https://arxiv.org/abs/2410.12855
作者: Fan Liu,Yue Feng,Zhao Xu,Lixin Su,Xinyu Ma,Dawei Yin,Hao Liu
关键词-EN: evaluating LLM defenses, LLM defenses remains, enhancing LLM safety, evaluating LLM, LLM safety
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in-the-wild, and multilingual prompts, along with high-quality human-annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables explainable, fine-grained scoring (1 to 10). This framework supports the construction of instruction-tuning ground truth and facilitates the development of JAILJUDGE Guard, an end-to-end judge model that provides reasoning and eliminates API costs. Additionally, we introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a moderation defense, both leveraging JAILJUDGE Guard. Our experiments demonstrate the state-of-the-art performance of JailJudge methods (JailJudge MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT-4, Llama-Guard) and zero-shot scenarios. JailBoost and GuardShield significantly improve jailbreak attack and defense tasks under zero-shot settings, with JailBoost enhancing performance by 29.24% and GuardShield reducing defense ASR from 40.46% to 0.15%.
摘要：尽管在增强大语言模型 (LLM) 对越狱攻击的安全性方面取得了进展，但评估 LLM 防御措施仍然是一个挑战，当前的方法往往缺乏可解释性和对复杂场景的泛化能力，导致评估不完整（例如，直接判断而没有推理，GPT-4 在复杂情况下的 F1 分数较低，多语言场景中的偏见）。为了解决这一问题，我们提出了 JAILJUDGE，这是一个综合基准，涵盖了多种风险场景，包括合成、对抗性、实际应用中和多语言提示，以及高质量的人工标注数据集。JAILJUDGE 数据集包括超过 35,000 条带有推理解释的指令调优数据和 JAILJUDGETEST，一个包含 4,500 多条标注的风险场景数据集，以及一个涵盖十种语言的 6,000 多条多语言数据集。为了通过显式推理增强评估，我们提出了 JailJudge 多智能体框架，该框架支持可解释的、细粒度的评分（1 到 10）。该框架支持指令调优基础事实的构建，并促进 JAILJUDGE Guard 的开发，这是一个端到端的判断模型，提供推理并消除 API 成本。此外，我们引入了 JailBoost，一个与攻击者无关的攻击增强器，以及 GuardShield，一个利用 JAILJUDGE Guard 的审核防御措施。我们的实验表明，JailJudge 方法（JailJudge 多智能体、JAILJUDGE Guard）在多种模型（例如 GPT-4、Llama-Guard）和零样本场景中表现出色。JailBoost 和 GuardShield 在零样本设置下显著提高了越狱攻击和防御任务的性能，其中 JailBoost 提升了 29.24% 的性能，而 GuardShield 将防御 ASR 从 40.46% 降低到 0.15%。

[NLP-179] PO: Aligning Large Language Models with Multi-branch Multi-step Preference Trees

【速读】：该论文试图解决现有Direct Preference Optimization (DPO)算法在处理复杂推理任务时，无法有效学习多层次偏好信息的问题。解决方案的关键在于引入Tree Preference Optimization (TPO)，通过直接从整个偏好树中学习，而非仅采样成对偏好响应，从而实现更全面的偏好学习。TPO将语言模型对齐问题转化为偏好列表排序问题，并利用Adaptive Step Reward调整推理过程中每一步的奖励值，以进行细粒度的偏好优化，从而提升大语言模型在长链推理任务中的表现。

链接: https://arxiv.org/abs/2410.12854
作者: Weibin Liao,Xu Chu,Yasha Wang
关键词-EN: Direct Preference Optimization, Preference Optimization, Preference, Direct Preference, Tree Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain reasoning capabilities of large language models (LLMs). To this end, these studies employed LLMs to generate preference trees via Tree-of-thoughts (ToT) and sample the paired preference responses required by the DPO algorithm. However, the DPO algorithm based on binary preference optimization is unable to learn multiple responses with varying degrees of preference/dispreference that provided by the preference trees, resulting in incomplete preference learning. In this work, we introduce Tree Preference Optimization (TPO), that does not sample paired preference responses from the preference tree; instead, it directly learns from the entire preference tree during the fine-tuning. Specifically, TPO formulates the language model alignment as a Preference List Ranking problem, where the policy can potentially learn more effectively from a ranked preference list of responses given the prompt. In addition, to further assist LLMs in identifying discriminative steps within long-chain reasoning and increase the relative reward margin in the preference list, TPO utilizes Adaptive Step Reward to adjust the reward values of each step in trajectory for performing fine-grained preference optimization. We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across three public large language models on four datasets.
摘要：在复杂的推理任务领域，如数学推理，近期研究提出了使用直接偏好优化 (Direct Preference Optimization, DPO) 来抑制不受欢迎的输出，从而增强大语言模型 (Large Language Model, LLM) 的长链推理能力。为此，这些研究利用 LLM 通过思维树 (Tree-of-thoughts, ToT) 生成偏好树，并采样 DPO 算法所需的成对偏好响应。然而，基于二元偏好优化的 DPO 算法无法学习偏好树提供的具有不同偏好/不偏好程度的多个响应，导致偏好学习不完整。在本研究中，我们引入了树偏好优化 (Tree Preference Optimization, TPO)，该方法不从偏好树中采样成对偏好响应；相反，它在微调过程中直接从整个偏好树中学习。具体而言，TPO 将语言模型对齐问题形式化为偏好列表排序问题，其中策略可以更有效地从给定提示的响应的排序偏好列表中学习。此外，为了进一步帮助 LLM 识别长链推理中的区分步骤，并增加偏好列表中的相对奖励边际，TPO 利用自适应步骤奖励 (Adaptive Step Reward) 来调整轨迹中每个步骤的奖励值，以进行细粒度的偏好优化。我们在数学推理任务上进行了广泛的实验来评估 TPO。实验结果表明，TPO 在四个数据集上的三个公共大语言模型中始终优于 DPO。

[NLP-180] Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks

【速读】：该论文试图解决大语言模型（LLMs）在数学推理任务中常自信地生成错误答案的问题。解决方案的关键在于利用多代理辩论框架，通过多样化的模型参与辩论来提升推理和事实准确性。研究发现，多代理辩论在任何模型规模下都有助于提高性能，尤其是当使用多样化训练模型时，数学推理任务的性能提升最为显著。例如，经过4轮辩论，一组中等容量的多样化模型（如Gemini-Pro、Mixtral 7BX8和PaLM 2-M）在GSM-8K基准测试中超越了GPT-4，达到了91%的准确率。这表明，通过多样化的合作代理，可以实现超越单一强大模型的涌现能力。

链接: https://arxiv.org/abs/2410.12853
作者: Mahmood Hegazy
关键词-EN: Large language models, produce incorrect responses, natural language generation, confidently produce incorrect, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) excel in natural language generation but often confidently produce incorrect responses, especially in tasks like mathematical reasoning. Chain-of-thought prompting, self-verification, and multi-agent debate are among the strategies proposed to improve the reasoning and factual accuracy of LLMs. Building on Du et al.'s multi-agent debate framework, we find that multi-agent debate helps at any model scale, and that diversity of thought elicits stronger reasoning in debating LLMs. Across various model sizes, performance on mathematical reasoning tasks benefits most when diverse trained models are used. Remarkably, after 4 rounds of debate, a diverse set of medium-capacity models (Gemini-Pro, Mixtral 7BX8, and PaLM 2-M) outperforms GPT-4 on the GSM-8K benchmark, scoring 91% accuracy. By comparison, when 3 instances of Gemini-Pro are used, performance only reaches 82%. Finally, this diverse set of medium-capacity models sets a new state-of-the-art performance on the ASDiv benchmark (94%). These results underscore the idea that the future of AI is agentic, with diverse cooperating agents yielding emergent capabilities beyond even the most powerful individual models.
摘要：大语言模型 (LLMs) 在自然语言生成方面表现出色，但往往自信地产生错误答案，尤其是在数学推理等任务中。链式思维提示、自我验证和多智能体辩论是提高 LLMs 推理和事实准确性的几种策略。基于 Du 等人的多智能体辩论框架，我们发现多智能体辩论在任何模型规模下都有帮助，并且思想的多样性激发了辩论中 LLMs 更强的推理能力。在各种模型规模中，使用多样化训练模型的数学推理任务表现提升最为显著。值得注意的是，经过 4 轮辩论后，一组中等容量的多样化模型（Gemini-Pro、Mixtral 7BX8 和 PaLM 2-M）在 GSM-8K 基准测试中超越了 GPT-4，准确率达到 91%。相比之下，当使用 3 个 Gemini-Pro 实例时，性能仅达到 82%。最后，这组中等容量的多样化模型在 ASDiv 基准测试中创下了新的最佳性能（94%）。这些结果强调了 AI 的未来是智能体化的，多样化的合作智能体能够产生超越最强大个体模型的涌现能力。

[NLP-181] he Large Language Model GreekLegalRoBERTa

【速读】：该论文旨在解决希腊语法律文本处理中的命名实体识别和多类别法律主题分类问题，并提出了一种名为GreekLegalRoBERTa的解决方案。其关键在于开发了四个版本的GreekLegalRoBERTa模型，这些模型在希腊语法律和非法律文本上进行了预训练，显著超越了现有的GreekLegalBERT、Greek-LegalBERT-v2和GreekBERT模型，展示了现代NLP技术在低资源语言领域特定任务中的有效性。

链接: https://arxiv.org/abs/2410.12852
作者: Vasileios Saketos,Despina-Athanasia Pantazi,Manolis Koubarakis
关键词-EN: versions of GreekLegalRoBERTa, nonlegal text, Greek legal documents, large language models, language models trained
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We develop four versions of GreekLegalRoBERTa, which are four large language models trained on Greek legal and nonlegal text. We show that our models surpass the performance of GreekLegalBERT, Greek- LegalBERT-v2, and GreekBERT in two tasks involving Greek legal documents: named entity recognition and multi-class legal topic classification. We view our work as a contribution to the study of domain-specific NLP tasks in low-resource languages, like Greek, using modern NLP techniques and methodologies.
摘要：我们开发了四个版本的 GreekLegalRoBERTa，这些是大语言模型，专门针对希腊语的法律和非法律文本进行训练。我们的研究表明，在涉及希腊法律文件的两个任务中，即命名实体识别和多类别法律主题分类，我们的模型在性能上超越了 GreekLegalBERT、Greek-LegalBERT-v2 和 GreekBERT。我们认为，我们的工作是对低资源语言（如希腊语）领域特定自然语言处理 (NLP) 任务研究的一个贡献，采用了现代 NLP 技术和方法论。

[NLP-182] VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）输出中难以量化但影响用户偏好的细微特征（即“vibes”）的问题。解决方案的关键是引入VibeCheck系统，通过迭代发现模型的识别性特征（vibes），并利用LLM评判小组对其进行量化评估。VibeCheck不仅验证了这些vibes与人类感知的一致性，还在实际用户对话数据中展示了其预测模型身份和用户偏好的能力。

链接: https://arxiv.org/abs/2410.12851
作者: Lisa Dunlap,Krishna Mandal,Trevor Darrell,Jacob Steinhardt,Joseph E Gonzalez
关键词-EN: Large language models, Large language, users intuitively recognize, intuitively recognize, struggle to quantify
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, unironic use of the word ‘vibe’

点击查看摘要

Abstract:Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These “vibes” - such as tone, formatting, or writing style - influence user preferences, yet traditional evaluations focus primarily on the single axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model (“vibes”) that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discover vibes from model outputs, then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with llama-3-70b VS GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. Some of the vibes we find are that Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often over-explains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash.
摘要：大语言模型 (LLMs) 在其输出中常常展现出微妙但独特的特征，这些特征用户能够直观地识别，但却难以量化。这些“氛围”——如语气、格式或写作风格——影响着用户的偏好，然而传统的评估主要集中在正确性这一单一维度上。我们引入了 VibeCheck，这是一个系统，通过发现模型的识别性特征（“氛围”）来自动比较一对 LLMs，这些特征定义明确、具有区分性且与用户需求一致。VibeCheck 从模型输出中迭代发现氛围，然后利用一组 LLM 评判者来量化每个氛围的实用性。我们验证了 VibeCheck 生成的氛围与人类发现的氛围相符，并在 llama-3-70b 与 GPT-4 的真实用户对话的成对偏好数据上运行 VibeCheck。结果显示，Llama 具有友好、幽默且略带争议的氛围。这些氛围在预测模型身份时准确率达到 80%，在预测人类偏好时准确率达到 61%。最后，我们在多种模型和任务上运行 VibeCheck，包括总结、数学和字幕生成，以深入了解模型行为的差异。我们发现的一些氛围包括：在总结时，Command X 倾向于添加具体的引言和结论，而 TNGL 则不然；Llama-405b 在解决数学问题时常常过度解释其思维过程，相比之下 GPT-4o 则更为简洁；GPT-4 在生成字幕时更倾向于关注场景的情绪和情感，而 Gemini-1.5-Flash 则不然。

[NLP-183] RecurFormer: Not All Transformer Heads Need Self-Attention

【速读】：该论文试图解决Transformer-based大型语言模型在处理长输入时因注意力机制导致的显著计算成本问题。解决方案的关键在于识别并利用某些注意力头对近期token的集中关注特性（称为recency aware），提出了一种名为RecurFormer的新架构，将这些注意力头替换为线性循环神经网络（如Mamba架构）。这种替换在不删除token的情况下减少了缓存大小，同时保持了生成质量，并通过保留的注意力头维持了对长距离依赖的建模能力。此外，RecurFormer允许重用预训练的Transformer权重并进行持续训练，实验证明其在保持原模型性能的同时显著提升了推理效率。

链接: https://arxiv.org/abs/2410.12850
作者: Ruiqing Yan,Linghan Zheng,Xingbo Du,Han Zou,Yufeng Guo,Jianfei Yang
关键词-EN: modeling complex language, complex language patterns, mechanism memory overhead, Transformer-based large language, attention mechanism memory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention mechanism’s memory overhead. We observe that certain attention heads exhibit a distribution where the attention weights concentrate on tokens near the query token, termed as recency aware, which focuses on local and short-range dependencies. Leveraging this insight, we propose RecurFormer, a novel architecture that replaces these attention heads with linear recurrent neural networks (RNNs), specifically the Mamba architecture. This replacement reduces the cache size without evicting tokens, thus maintaining generation quality. RecurFormer retains the ability to model long-range dependencies through the remaining attention heads and allows for reusing pre-trained Transformer-based LLMs weights with continual training. Experiments demonstrate that RecurFormer matches the original model’s performance while significantly enhancing inference efficiency. Our approach provides a practical solution to the computational challenges of Transformer-based LLMs inference, making it highly attractive for tasks involving long inputs.
摘要：基于 Transformer 的大语言模型 (LLM) 在模拟复杂语言模式方面表现出色，但在推理过程中面临显著的计算成本，特别是由于注意力机制的内存开销，长输入的处理尤为困难。我们观察到某些注意力头表现出一种分布，其中注意力权重集中在查询 Token 附近的 Token 上，这种特性被称为近期感知 (recency aware)，它关注局部和短程依赖关系。基于这一洞察，我们提出了 RecurFormer，这是一种新颖的架构，用线性循环神经网络 (RNN) 取代这些注意力头，特别是 Mamba 架构。这种替换在不驱逐 Token 的情况下减少了缓存大小，从而保持了生成质量。RecurFormer 通过剩余的注意力头保留了建模长程依赖的能力，并允许通过持续训练重用预训练的基于 Transformer 的 LLM 权重。实验表明，RecurFormer 在显著提升推理效率的同时，与原始模型的性能相匹配。我们的方法为基于 Transformer 的 LLM 推理的计算挑战提供了一个实用的解决方案，使其在涉及长输入的任务中极具吸引力。

[NLP-184] Prompt Engineering a Schizophrenia Chatbot: Utilizing a Multi-Agent Approach for Enhanced Compliance with Prompt Instructions

【速读】：该论文试图解决的问题是利用大型语言模型（如GPT-4）在精神健康教育平台中提供信息时，如何确保其输出内容的安全性和伦理合规性。解决方案的关键在于引入一个“关键分析过滤器”（Critical Analysis Filter），通过一组经过提示工程设计的LLM代理对聊天机器人的回复进行实时批判性分析和修正，从而提高聊天机器人在对话过程中遵守预设规则和限制的能力，确保其在提供信息时保持透明度和准确性。

链接: https://arxiv.org/abs/2410.12848
作者: Per Niklas Waaler,Musarrat Hussain,Igor Molchanov,Lars Ailo Bongo,Brita Elvevåg
关键词-EN: Large Language Models, Critical Analysis Filter, present with cognitive, cognitive impairments, hinder their ability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Patients with schizophrenia often present with cognitive impairments that may hinder their ability to learn about their condition. These individuals could benefit greatly from education platforms that leverage the adaptability of Large Language Models (LLMs) such as GPT-4. While LLMs have the potential to make topical mental health information more accessible and engaging, their black-box nature raises concerns about ethics and safety. Prompting offers a way to produce semi-scripted chatbots with responses anchored in instructions and validated information, but prompt-engineered chatbots may drift from their intended identity as the conversation progresses. We propose a Critical Analysis Filter for achieving better control over chatbot behavior. In this system, a team of prompted LLM agents are prompt-engineered to critically analyze and refine the chatbot’s response and deliver real-time feedback to the chatbot. To test this approach, we develop an informational schizophrenia chatbot and converse with it (with the filter deactivated) until it oversteps its scope. Once drift has been observed, AI-agents are used to automatically generate sample conversations in which the chatbot is being enticed to talk about out-of-bounds topics. We manually assign to each response a compliance score that quantifies the chatbot’s compliance to its instructions; specifically the rules about accurately conveying sources and being transparent about limitations. Activating the Critical Analysis Filter resulted in an acceptable compliance score (=2) in 67.0% of responses, compared to only 8.7% when the filter was deactivated. These results suggest that a self-reflection layer could enable LLMs to be used effectively and safely in mental health platforms, maintaining adaptability while reliably limiting their scope to appropriate use cases.
摘要：患有精神分裂症的患者常常表现出认知障碍，这可能阻碍他们了解自身病情的能力。这类人群可以从利用大语言模型（LLMs）如 GPT-4 的适应性的教育平台中获益良多。尽管 LLMs 有可能使专题心理健康信息更加易于获取和吸引人，但其黑箱特性引发了关于伦理和安全的担忧。提示（Prompting）提供了一种方法，可以生成基于指令和验证信息的半脚本聊天机器人，但经过提示工程的聊天机器人在对话过程中可能会偏离其预设身份。我们提出了一种关键分析过滤器（Critical Analysis Filter），以实现对聊天机器人行为的更好控制。在该系统中，一组经过提示的 LLM 智能体被提示工程化，以批判性地分析和精炼聊天机器人的响应，并向聊天机器人提供实时反馈。为了测试这种方法，我们开发了一个信息性精神分裂症聊天机器人，并与它进行对话（过滤器未激活），直到它超出其范围。一旦观察到偏离，AI 智能体被用于自动生成聊天机器人被诱导谈论超出范围话题的样本对话。我们手动为每个响应分配一个合规性评分，该评分量化了聊天机器人对其指令的遵守程度；特别是关于准确传达来源和透明地说明限制的规则。激活关键分析过滤器后，67.0% 的响应获得了可接受的合规性评分（=2），而过滤器未激活时仅为 8.7%。这些结果表明，自我反思层可以使 LLMs 在心理健康平台中有效且安全地使用，同时保持适应性，并可靠地将范围限制在适当的用例中。

[NLP-185] ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning EMNLP

【速读】：该论文试图解决传统提示调优方法中随着提示长度增加，参数数量线性增长的问题。解决方案的关键在于提出了自适应码本复合高效提示调优（ACCEPT）方法，借鉴了乘积量化（PQ）的概念，使得所有软提示共享一组可学习的码本向量，并通过一组自适应权重来区分不同的提示。这种方法显著减少了需要调优的参数数量，同时在大规模预训练语言模型（PLMs）上实现了卓越的性能，适用于自然语言理解（NLU）和问答（QA）等多种任务。

链接: https://arxiv.org/abs/2410.12847
作者: Yu-Chen Lin,Wei-Hua Li,Jun-Cheng Chen,Chu-Song Chen
关键词-EN: popular Parameter-Efficient Fine-Tuning, large-scale pretrained Language, Parameter-Efficient Fine-Tuning method, Fine-Tuning method attributed, Efficient Prompt Tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP Finding 2024

点击查看摘要

Abstract:Prompt Tuning has been a popular Parameter-Efficient Fine-Tuning method attributed to its remarkable performance with few updated parameters on various large-scale pretrained Language Models (PLMs). Traditionally, each prompt has been considered indivisible and updated independently, leading the parameters increase proportionally as prompt length grows. To address this issue, we propose Adaptive Codebook for Composite and Efficient Prompt Tuning (ACCEPT). In our method, we refer to the concept of product quantization (PQ), allowing all soft prompts to share a set of learnable codebook vectors in each subspace, with each prompt differentiated by a set of adaptive weights. We achieve the superior performance on 17 diverse natural language tasks including natural language understanding (NLU) and question answering (QA) tasks by tuning only 0.3% of parameters of the PLMs. Our approach also excels in few-shot and large model settings, highlighting its significant potential.
摘要：提示调优 (Prompt Tuning) 作为一种高效的参数微调方法，因其在大规模预训练语言模型 (PLMs) 上仅需更新少量参数即可取得显著性能而广受欢迎。传统上，每个提示被视为不可分割的独立单元，导致随着提示长度的增加，参数数量也成比例增长。为解决这一问题，我们提出了复合高效提示调优的自适应码本方法 (ACCEPT)。在我们的方法中，我们借鉴了乘积量化 (Product Quantization) 的概念，使得所有软提示在每个子空间中共享一组可学习的码本向量，并通过一组自适应权重来区分每个提示。我们在包括自然语言理解 (NLU) 和问答 (QA) 任务在内的 17 项多样化自然语言任务上，仅微调 PLMs 的 0.3% 参数，便取得了优异的性能。我们的方法在少样本和大模型设置下同样表现出色，凸显了其巨大的潜力。

[NLP-186] Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering

【速读】：该论文试图解决自由形式表格问答（TableQA）中数值问题处理的挑战，特别是大语言模型（LLMs）在处理数值数据时的不准确性。解决方案的关键在于提出了一种名为TabLaP的模型，该模型利用LLMs进行多步骤推理规划，而将实际的数值计算任务交给Python解释器执行，以确保计算的准确性。此外，TabLaP还首次尝试量化答案的可信度，使用户能够在知晓潜在风险的情况下使用该模型。实验结果表明，TabLaP在两个基准数据集上的答案准确性分别提高了5.7%和5.8%，显著优于现有最先进模型。

链接: https://arxiv.org/abs/2410.12846
作者: Yuxiang Wang,Jianzhong Qi,Junhao Gan
关键词-EN: answering on free-form, free-form tables, Large Language Models, Question answering, LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Question answering on free-form tables (a.k.a. TableQA) is a challenging task because of the flexible structure and the complex schema of tables. Recent studies use Large Language Models (LLMs) for this task, exploiting their capability in understanding the questions and tabular data which are typically given in natural language and contains many textual fields, respectively. While this approach has shown promising results, it overlooks the challenges brought by numerical values which are common in tabular data, while LLMs are known to struggle with such values. We aim to address this issue and answer numerical questions. We propose a model named TabLaP that uses LLMs as a planner rather than an answer generator, exploiting LLMs capability in multi-step reasoning while leaving the actual numerical calculations to a Python interpreter for accurate calculation. Recognizing the inaccurate nature of LLMs, we further make a first attempt to quantify the trustworthiness of the answers produced by TabLaP, such that users can use TabLaP in a regret-aware manner. Experimental results on two benchmark datasets show that TabLaP is substantially more accurate than the state-of-the-art models, improving the answer accuracy by 5.7% and 5.8% on the two datasets, respectively.
摘要：自由格式表格的问答（又称 TableQA）是一项具有挑战性的任务，因为表格的结构灵活且模式复杂。最近的研究利用大语言模型 (LLM) 来处理这项任务，利用它们在理解自然语言问题和表格数据方面的能力，这些数据通常分别以自然语言形式给出并包含许多文本字段。尽管这种方法显示出有希望的结果，但它忽略了数值在表格数据中常见所带来的挑战，而大语言模型在处理这些数值时存在困难。我们的目标是通过提出一种名为 TabLaP 的模型来解决这一问题，并回答数值相关的问题。TabLaP 模型利用大语言模型作为规划器而非答案生成器，利用大语言模型在多步推理中的能力，同时将实际的数值计算交给 Python 解释器进行精确计算。鉴于大语言模型的不准确性，我们首次尝试量化 TabLaP 生成答案的可信度，以便用户可以在意识到可能的遗憾的情况下使用 TabLaP。在两个基准数据集上的实验结果表明，TabLaP 的准确性显著高于现有最先进的模型，分别在这两个数据集上提高了 5.7% 和 5.8% 的答案准确率。

[NLP-187] oward Relieving Clinician Burden by Automatically Generating Progress Notes using Interim Hospital Data

【速读】：该论文试图解决临床医生在撰写进展记录时的工作负担问题，特别是通过自动化生成进展记录来减轻这一负担。解决方案的关键在于利用电子健康记录中的结构化或表格信息，提出了一种新的框架和名为ChartPNG的大型数据集，用于自动化进展记录生成任务。该框架通过结合通用和生物医学领域的大型语言模型，建立了在该数据集上的基线，并通过自动化和手动分析评估了模型的性能，揭示了该任务的挑战和未来研究的机会。

链接: https://arxiv.org/abs/2410.12845
作者: Sarvesh Soni,Dina Demner-Fushman
关键词-EN: Regular documentation, progress notes, automate progress note, progress note generation, main contributors
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the AMIA 2024 Annual Symposium

点击查看摘要

Abstract:Regular documentation of progress notes is one of the main contributors to clinician burden. The abundance of structured chart information in medical records further exacerbates the burden, however, it also presents an opportunity to automate the generation of progress notes. In this paper, we propose a task to automate progress note generation using structured or tabular information present in electronic health records. To this end, we present a novel framework and a large dataset, ChartPNG, for the task which contains 7089 annotation instances (each having a pair of progress notes and interim structured chart data) across 1616 patients. We establish baselines on the dataset using large language models from general and biomedical domains. We perform both automated (where the best performing Biomistral model achieved a BERTScore F1 of 80.53 and MEDCON score of 19.61 ) and manual (where we found that the model was able to leverage relevant structured data with 76.9% accuracy) analyses to identify the challenges with the proposed task and opportunities for future research.
摘要：定期记录进展笔记是临床医生负担的主要来源之一。医疗记录中丰富的结构化图表信息进一步加剧了这一负担，但同时也为自动化生成进展笔记提供了机会。本文提出了一项任务，即利用电子健康记录中存在的结构化或表格信息来自动生成进展笔记。为此，我们提出了一种新颖的框架和一个大型数据集 ChartPNG，该数据集包含 7089 个标注实例（每个实例包含一对进展笔记和临时结构化图表数据），涵盖 1616 名患者。我们使用来自通用和生物医学领域的大语言模型在数据集上建立了基线。我们进行了自动化（其中表现最佳的 Biomistral 模型在 BERTScore F1 上达到了 80.53，在 MEDCON 评分上达到了 19.61）和手动（我们发现模型能够以 76.9% 的准确率利用相关结构化数据）分析，以识别该任务的挑战和未来研究的机会。

[NLP-188] xtLap: Customizing Language Models for Text-to-Layout Planning EMNLP

【速读】：该论文试图解决自动生成图形布局的问题，特别是在设计海报、传单、广告和图形用户界面等实际应用中。解决方案的关键在于利用大型语言模型（LLMs）的自然语言理解和生成能力，通过定制化的LLM来根据用户的文本指令生成引人注目的图形布局。论文提出的方法称为TextLap，它使用一个精心设计的指令型布局规划数据集（InsLap）来定制LLMs，使其具备图形设计师的功能，并在图像生成和图形设计基准测试中表现优于包括基于GPT-4的方法在内的强基线。

链接: https://arxiv.org/abs/2410.12844
作者: Jian Chen,Ruiyi Zhang,Yufan Zhou,Jennifer Healey,Jiuxiang Gu,Zhiqiang Xu,Changyou Chen
关键词-EN: including designing posters, Automatic generation, graphical user interfaces, real-world applications, designing posters
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the EMNLP Findings

点击查看摘要

Abstract:Automatic generation of graphical layouts is crucial for many real-world applications, including designing posters, flyers, advertisements, and graphical user interfaces. Given the incredible ability of Large language models (LLMs) in both natural language understanding and generation, we believe that we could customize an LLM to help people create compelling graphical layouts starting with only text instructions from the user. We call our method TextLap (text-based layout planning). It uses a curated instruction-based layout planning dataset (InsLap) to customize LLMs as a graphic designer. We demonstrate the effectiveness of TextLap and show that it outperforms strong baselines, including GPT-4 based methods, for image generation and graphical design benchmarks.
摘要：自动生成图形布局对于许多实际应用至关重要，包括设计海报、传单、广告和图形用户界面。鉴于大语言模型 (LLM) 在自然语言理解和生成方面的卓越能力，我们相信可以定制一个 LLM，以帮助人们仅通过用户的文本指令创建引人注目的图形布局。我们称这种方法为 TextLap（基于文本的布局规划）。它使用一个精心策划的基于指令的布局规划数据集 (InsLap) 来定制 LLM 作为图形设计师。我们展示了 TextLap 的有效性，并证明它在图像生成和图形设计基准测试中优于包括基于 GPT-4 的方法在内的强大基线。

[NLP-189] Exploring Prompt Engineering: A Systematic Review with SWOT Analysis

【速读】：该论文旨在通过全面的SWOT分析，探讨大型语言模型（LLMs）中提示工程技术的优势、劣势、机会和威胁，以解决如何优化AI与人类之间的交互以及提升语言模型对人类提示的理解问题。解决方案的关键在于深入分析基于模板的提示方法和微调技术，识别并应对这些技术在实际应用中的问题和挑战，从而为未来研究提供方向，以进一步提高提示工程在人机通信中的有效性。

链接: https://arxiv.org/abs/2410.12843
作者: Aditi Singh,Abul Ehtesham,Gaurav Kumar Gupta,Nikhil Kumar Chatta,Saket Kumar,Tala Talaei Khoei
关键词-EN: Large Language Models, comprehensive SWOT analysis, comprehensive SWOT, realm of Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figures

点击查看摘要

Abstract:In this paper, we conduct a comprehensive SWOT analysis of prompt engineering techniques within the realm of Large Language Models (LLMs). Emphasizing linguistic principles, we examine various techniques to identify their strengths, weaknesses, opportunities, and threats. Our findings provide insights into enhancing AI interactions and improving language model comprehension of human prompts. The analysis covers techniques including template-based approaches and fine-tuning, addressing the problems and challenges associated with each. The conclusion offers future research directions aimed at advancing the effectiveness of prompt engineering in optimizing human-machine communication.
摘要：本文对大语言模型 (LLM) 领域内的提示工程技术进行了全面的 SWOT 分析。我们强调语言学原理，通过研究各种技术来识别其优势、劣势、机会和威胁。我们的研究结果为增强 AI 交互和提高语言模型对人类提示的理解提供了见解。分析涵盖了基于模板的方法和微调等技术，解决了每种技术所面临的问题和挑战。结论部分提出了未来研究方向，旨在提高提示工程在优化人机通信中的有效性。

[NLP-190] A Two-Model Approach for Humour Style Recognition

【速读】：该论文试图解决幽默风格识别的问题，由于缺乏标准数据集和机器学习模型，这一领域存在显著挑战。解决方案的关键在于引入了一个新的文本数据集，包含1463个实例，涵盖四种幽默风格（自我提升、自我贬低、亲和型和攻击型）及非幽默文本，并采用多种计算方法（如经典机器学习分类器、文本嵌入模型和DistilBERT）建立基线性能。此外，论文提出了一种双模型方法，显著提升了亲和型幽默分类的f1-score，相较于14种测试模型，性能提升了11.61%。这一方法为文本中幽默的计算分析提供了新的工具，有助于在文学、社交媒体等领域研究幽默。

链接: https://arxiv.org/abs/2410.12842
作者: Mary Ogbuka Kenneth,Foaad Khosmood,Abbas Edalat
关键词-EN: significantly impact social, impact social interactions, human communication, mental health, fundamental aspect
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humour, a fundamental aspect of human communication, manifests itself in various styles that significantly impact social interactions and mental health. Recognising different humour styles poses challenges due to the lack of established datasets and machine learning (ML) models. To address this gap, we present a new text dataset for humour style recognition, comprising 1463 instances across four styles (self-enhancing, self-deprecating, affiliative, and aggressive) and non-humorous text, with lengths ranging from 4 to 229 words. Our research employs various computational methods, including classic machine learning classifiers, text embedding models, and DistilBERT, to establish baseline performance. Additionally, we propose a two-model approach to enhance humour style recognition, particularly in distinguishing between affiliative and aggressive styles. Our method demonstrates an 11.61% improvement in f1-score for affiliative humour classification, with consistent improvements in the 14 models tested. Our findings contribute to the computational analysis of humour in text, offering new tools for studying humour in literature, social media, and other textual sources.
摘要：幽默，作为人类交流的基本方面，以多种风格呈现，对社会互动和心理健康产生显著影响。由于缺乏成熟的数据集和机器学习 (ML) 模型，识别不同的幽默风格面临挑战。为了填补这一空白，我们提出了一种新的用于幽默风格识别的文本数据集，包含 1463 个实例，涵盖四种风格（自我提升、自我贬低、亲和型和攻击型）和非幽默文本，字数范围从 4 到 229 字不等。我们的研究采用了多种计算方法，包括经典的机器学习分类器、文本嵌入模型和 DistilBERT，以建立基线性能。此外，我们提出了一种双模型方法来增强幽默风格的识别，特别是在区分亲和型和攻击型风格方面。我们的方法在亲和型幽默分类的 f1-score 上展示了 11.61% 的提升，并且在测试的 14 个模型中均表现出一致的改进。我们的研究成果为文本中幽默的计算分析提供了新的工具，为研究文学、社交媒体和其他文本来源中的幽默提供了新的手段。

[NLP-191] UniAutoML: A Human-Centered Framework for Unified Discriminative and Generative AutoML with Large Language Models

【速读】：该论文试图解决传统自动化机器学习（AutoML）框架在处理生成模型任务时的不足，以及缺乏用户参与和透明度的问题。解决方案的关键在于引入UniAutoML框架，该框架通过结合大语言模型（LLMs），实现了对判别任务和生成任务的统一自动化处理。其核心创新在于人机交互设计，采用对话式用户界面（CUI），通过自然语言交互提供实时指导、反馈和进度更新，增强训练过程的透明度和用户控制，从而提升用户信任和使用体验。此外，UniAutoML还通过安全防护措施过滤输入和审查输出，以降低LLM生成内容的风险。

链接: https://arxiv.org/abs/2410.12841
作者: Jiayi Guo,Liyun Zhang,Yiqin Shen
关键词-EN: Automated Machine Learning, Automated Machine, Machine Learning, data pre-processing, hyper-parameter searching
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 24 pages

点击查看摘要

Abstract:Automated Machine Learning (AutoML) has simplified complex ML processes such as data pre-processing, model selection, and hyper-parameter searching. However, traditional AutoML frameworks focus solely on discriminative tasks, often falling short in tackling AutoML for generative models. Additionally, these frameworks lack interpretability and user engagement during the training process, primarily due to the absence of human-centered design. It leads to a lack of transparency in final decision-making and limited user control, potentially reducing trust and adoption of AutoML methods. To address these limitations, we introduce UniAutoML, a human-centered AutoML framework that leverages Large Language Models (LLMs) to unify AutoML for both discriminative (e.g., Transformers and CNNs for classification or regression tasks) and generative tasks (e.g., fine-tuning diffusion models or LLMs). The human-centered design of UniAutoML innovatively features a conversational user interface (CUI) that facilitates natural language interactions, providing users with real-time guidance, feedback, and progress updates for better interpretability. This design enhances transparency and user control throughout the AutoML training process, allowing users to seamlessly break down or modify the model being trained. To mitigate potential risks associated with LLM generated content, UniAutoML incorporates a safety guardline that filters inputs and censors outputs. We evaluated UniAutoML’s performance and usability through experiments on eight diverse datasets and user studies involving 25 participants, demonstrating that UniAutoML not only enhances performance but also improves user control and trust. Our human-centered design bridges the gap between AutoML capabilities and user understanding, making ML more accessible to a broader audience.
摘要：自动化机器学习 (AutoML) 简化了数据预处理、模型选择和超参数搜索等复杂的机器学习过程。然而，传统的 AutoML 框架主要专注于判别任务，在处理生成模型的 AutoML 方面往往表现不足。此外，这些框架在训练过程中缺乏可解释性和用户参与度，主要原因是缺乏以人为中心的设计。这导致了最终决策的不透明性和用户控制的局限性，可能降低了对 AutoML 方法的信任和采用率。为了解决这些局限性，我们引入了 UniAutoML，这是一个以人为中心的 AutoML 框架，利用大语言模型 (LLM) 来统一判别任务（例如，用于分类或回归任务的 Transformer 和 CNN）和生成任务（例如，微调扩散模型或 LLM）的 AutoML。UniAutoML 的人性化设计创新性地采用了对话式用户界面 (CUI)，促进了自然语言交互，为用户提供实时指导、反馈和进度更新，以提高可解释性。这种设计在整个 AutoML 训练过程中增强了透明度和用户控制，使用户能够无缝地分解或修改正在训练的模型。为了减轻与 LLM 生成内容相关的潜在风险，UniAutoML 还集成了一个安全防线，用于过滤输入和审查输出。我们通过在八个不同数据集上的实验和涉及 25 名参与者的用户研究，评估了 UniAutoML 的性能和可用性，结果表明 UniAutoML 不仅提高了性能，还增强了用户控制和信任。我们的人性化设计弥合了 AutoML 能力与用户理解之间的差距，使机器学习对更广泛的受众更加易于接触。

[NLP-192] Answering Questions in Stages: Prompt Chaining for Contract QA

【速读】：该论文试图解决在处理复杂法律合同条款时，简单零样本提示无法有效生成结构化答案的问题。解决方案的关键在于提出两阶段提示链方法，通过第一阶段生成初步答案，第二阶段进行细化，从而提高对复杂法律文本的回答准确性。该方法在处理多选和多选题时表现尤为有效，但仍需进一步改进以应对语言变异超出简单指定答案范围的情况。

链接: https://arxiv.org/abs/2410.12840
作者: Adam Roegiest,Radha Chitta
关键词-EN: understanding market trends, Finding answers, due diligence, risk mitigation, important form
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Finding answers to legal questions about clauses in contracts is an important form of analysis in many legal workflows (e.g., understanding market trends, due diligence, risk mitigation) but more important is being able to do this at scale. Prior work showed that it is possible to use large language models with simple zero-shot prompts to generate structured answers to questions, which can later be incorporated into legal workflows. Such prompts, while effective on simple and straightforward clauses, fail to perform when the clauses are long and contain information not relevant to the question. In this paper, we propose two-stage prompt chaining to produce structured answers to multiple-choice and multiple-select questions and show that they are more effective than simple prompts on more nuanced legal text. We analyze situations where this technique works well and areas where further refinement is needed, especially when the underlying linguistic variations are more than can be captured by simply specifying possible answers. Finally, we discuss future research that seeks to refine this work by improving stage one results by making them more question-specific.
摘要：在许多法律工作流程（例如，理解市场趋势、尽职调查、风险缓解）中，查找合同条款中的法律问题的答案是一种重要的分析形式，但更重要的是能够大规模地进行这种分析。先前的工作表明，使用带有简单零样本提示的大语言模型可以生成结构化的答案，这些答案随后可以被纳入法律工作流程中。然而，这种提示虽然在处理简单直接的条款时有效，但在条款较长且包含与问题无关的信息时，其表现不佳。在本文中，我们提出了一种两阶段提示链方法，用于生成多选和多选题的结构化答案，并证明其在处理更为复杂的法律文本时比简单提示更为有效。我们分析了该技术表现良好的情况以及需要进一步改进的领域，特别是在底层语言变异超出简单指定可能答案所能捕捉的范围时。最后，我们讨论了未来的研究方向，旨在通过使第一阶段的结果更具问题针对性来改进这一工作。

[NLP-193] Capturing Bias Diversity in LLMs

【速读】：该论文试图解决大型语言模型（LLMs）在生成输出时缺乏多样性和代表性的问题。解决方案的关键在于通过开发多个定制化的GPT模型实例，每个实例反映特定的社会人口特征（如性别、年龄和种族）的偏见，从而创建一个名为BiasGPT的框架。这些定制化的GPT模型最终会协作，将各自不同的视角融合成一个综合的响应，以捕捉更广泛的人类经验和观点，从而实现更具包容性的AI技术。

链接: https://arxiv.org/abs/2410.12839
作者: Purva Prasad Gosavi,Vaishnavi Murlidhar Kulkarni,Alan F. Smeaton
关键词-EN: Large Language Models, Large Language, paper presents research, enhancements to Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2nd International Conference on Foundation and Large Language Models (FLLM2024), 26-29 November, 2024 | Dubai, UAE

点击查看摘要

Abstract:This paper presents research on enhancements to Large Language Models (LLMs) through the addition of diversity in its generated outputs. Our study introduces a configuration of multiple LLMs which demonstrates the diversities capable with a single LLM. By developing multiple customised instances of a GPT model, each reflecting biases in specific demographic characteristics including gender, age, and race, we propose, develop and evaluate a framework for a more nuanced and representative AI dialogue which we call BiasGPT. The customised GPT models will ultimately collaborate, merging their diverse perspectives on a topic into an integrated response that captures a broad spectrum of human experiences and viewpoints. In this paper, through experiments, we demonstrate the capabilities of a GPT model to embed different biases which, when combined, can open the possibilities of more inclusive AI technologies.
摘要：本文探讨了通过增加生成式输出多样性来增强大语言模型 (LLM) 的研究。我们的研究引入了一种由多个 LLM 组成的配置，展示了单个 LLM 所能实现的多样性。通过开发多个定制化的 GPT 模型实例，每个实例反映特定人口统计特征（包括性别、年龄和种族）的偏见，我们提出、开发并评估了一个名为 BiasGPT 的框架，旨在实现更加细致和具有代表性的 AI 对话。这些定制化的 GPT 模型最终将协作，将它们对某一主题的不同视角整合成一个综合响应，捕捉到广泛的人类经验和观点。本文通过实验展示了 GPT 模型嵌入不同偏见的能力，这些偏见在结合后可以开启更多包容性 AI 技术的可能性。

[NLP-194] A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution Current Landscape and Future Directions

【速读】：该论文旨在全面研究检索增强生成（Retrieval-Augmented Generation, RAG）技术，从基础概念到当前最先进的状态，解决大型语言模型（LLMs）在输出准确性方面的关键限制。解决方案的关键在于将检索机制与生成语言模型相结合，通过集成检索和生成过程来处理知识密集型任务，从而提高输出的准确性。论文详细探讨了RAG的基本架构、技术进步及其在问答、摘要和知识型任务等领域的应用，并讨论了当前面临的挑战如可扩展性、偏见和伦理问题，以及未来的研究方向。

链接: https://arxiv.org/abs/2410.12837
作者: Shailja Gupta,Rajesh Ranjan,Surya Narayan Singh
关键词-EN: RAG, tracing its evolution, presents a comprehensive, current state, RAG models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 4 Figures

点击查看摘要

Abstract:This paper presents a comprehensive study of Retrieval-Augmented Generation (RAG), tracing its evolution from foundational concepts to the current state of the art. RAG combines retrieval mechanisms with generative language models to enhance the accuracy of outputs, addressing key limitations of LLMs. The study explores the basic architecture of RAG, focusing on how retrieval and generation are integrated to handle knowledge-intensive tasks. A detailed review of the significant technological advancements in RAG is provided, including key innovations in retrieval-augmented language models and applications across various domains such as question-answering, summarization, and knowledge-based tasks. Recent research breakthroughs are discussed, highlighting novel methods for improving retrieval efficiency. Furthermore, the paper examines ongoing challenges such as scalability, bias, and ethical concerns in deployment. Future research directions are proposed, focusing on improving the robustness of RAG models, expanding the scope of application of RAG models, and addressing societal implications. This survey aims to serve as a foundational resource for researchers and practitioners in understanding the potential of RAG and its trajectory in natural language processing.
摘要：本文对检索增强生成 (Retrieval-Augmented Generation, RAG) 进行了全面研究，追溯了其从基础概念到当前最先进状态的演变过程。RAG 结合了检索机制与生成式语言模型，以提高输出准确性，解决了大语言模型 (Large Language Model, LLM) 的关键局限性。研究探讨了 RAG 的基本架构，重点分析了检索与生成如何集成以处理知识密集型任务。本文详细回顾了 RAG 在技术上的重大进展，包括检索增强语言模型的关键创新及其在问答、摘要和基于知识的任务等多个领域的应用。讨论了近期研究突破，突出了提高检索效率的新方法。此外，论文还探讨了当前面临的挑战，如可扩展性、偏见和部署中的伦理问题。提出了未来的研究方向，包括提高 RAG 模型的鲁棒性、扩展 RAG 模型的应用范围以及解决其社会影响。本调查旨在为研究人员和实践者提供一个基础资源，以理解 RAG 的潜力及其在自然语言处理中的发展轨迹。

[NLP-195] A Dutch Financial Large Language Model

【速读】：该论文试图解决荷兰语金融领域缺乏专用大型语言模型（LLM）的问题，并提出了FinGEITje，这是首个针对荷兰语金融任务优化的LLM。解决方案的关键在于：1) 发布了一个包含超过14万样本的荷兰语金融指令调优数据集，该数据集通过自动化翻译和数据处理方法构建；2) 提供了一个开源的数据构建方法，便于在不同语言中创建金融指令数据集；3) 引入了首个荷兰语金融评估基准，并采用LLM作为独立评估者进行自动化评估，减少了人工干预。这些措施共同提升了FinGEITje在荷兰语和英语金融任务中的表现。

链接: https://arxiv.org/abs/2410.12835
作者: Sander Noels,Jorne De Blaere,Tijl De Bie
关键词-EN: Dutch financial Large, Large Language Model, financial Large Language, Large Language, paper presents FinGEITje
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 9 pages, 1 figure, accepted at ACM ICAIF’24

点击查看摘要

Abstract:This paper presents FinGEITje, the first Dutch financial Large Language Model (LLM) specifically designed and optimized for various financial tasks. Together with the model, we release a specialized Dutch financial instruction tuning dataset with over 140,000 samples, constructed employing an automated translation and data processing method. The open-source data construction method is provided, facilitating the creation of financial instruction datasets in different languages. To evaluate model performance, the study introduces the first Dutch financial evaluation benchmark, along with an automated evaluation method that utilizes an LLM as an independent evaluator, reducing manual intervention in performance evaluation. The experimental results highlight the superior performance of FinGEITje across five critical Dutch and English financial tasks.
摘要：本文介绍了 FinGEITje，这是首个专门为各种金融任务设计和优化的荷兰语大语言模型 (LLM)。与该模型一同发布的还有一个专门的荷兰语金融指令调优数据集，该数据集包含超过 140,000 个样本，采用了自动化翻译和数据处理方法构建。本文还提供了开源的数据构建方法，便于在不同语言中创建金融指令数据集。为了评估模型性能，研究引入了首个荷兰语金融评估基准，并采用了一种利用 LLM 作为独立评估者的自动化评估方法，减少了性能评估中的人工干预。实验结果显示，FinGEITje 在五项关键的荷兰语和英语金融任务中表现优异。

[NLP-196] Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective NEURIPS2024

【速读】：该论文试图解决CLIP模型在适应特定任务时存在的数据错位问题，即任务无关知识对预测结果的干扰。解决方案的关键在于提出了因果引导的语义解耦与分类（CDC）方法，通过前门调整策略，解耦下游任务数据中的语义，并基于每个语义进行分类，同时利用Dempster-Shafer证据理论评估不同语义生成预测的不确定性，从而有效缓解任务无关知识的干扰。

链接: https://arxiv.org/abs/2410.12816
作者: Yanan Zhang,Jiangmeng Li,Lixiang Liu,Wenwen Qiang
关键词-EN: Foundational Vision-Language models, exhibited impressive generalization, Foundational Vision-Language, data misalignment, exhibited impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks. However, CLIP suffers from a two-level misalignment issue, i.e., task misalignment and data misalignment, when adapting to specific tasks. Soft prompt tuning has mitigated the task misalignment, yet the data misalignment remains a challenge. To analyze the impacts of the data misalignment, we revisit the pre-training and adaptation processes of CLIP and develop a structural causal model. We discover that while we expect to capture task-relevant information for downstream tasks accurately, the task-irrelevant knowledge impacts the prediction results and hampers the modeling of the true relationships between the images and the predicted classes. As task-irrelevant knowledge is unobservable, we leverage the front-door adjustment and propose Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate the interference of task-irrelevant knowledge. Specifically, we decouple semantics contained in the data of downstream tasks and perform classification based on each semantic. Furthermore, we employ the Dempster-Shafer evidence theory to evaluate the uncertainty of each prediction generated by diverse semantics. Experiments conducted in multiple different settings have consistently demonstrated the effectiveness of CDC.
摘要：基础视觉-语言模型如 CLIP 在下游任务中展现了显著的泛化能力。然而，当适应特定任务时，CLIP 面临两级对齐问题，即任务对齐和数据对齐。软提示调优缓解了任务对齐问题，但数据对齐仍是一个挑战。为了分析数据对齐问题的影响，我们重新审视了 CLIP 的预训练和适应过程，并构建了一个结构因果模型。我们发现，尽管我们期望准确捕捉与下游任务相关的信息，但任务无关的知识影响了预测结果，并阻碍了图像与预测类别之间真实关系的建模。由于任务无关的知识是不可观测的，我们利用前门调整，并提出了因果引导的语义解耦与分类（Causality-Guided Semantic Decoupling and Classification, CDC）以减轻任务无关知识的干扰。具体而言，我们对下游任务数据中的语义进行解耦，并基于每个语义进行分类。此外，我们采用 Dempster-Shafer 证据理论来评估由不同语义生成的每个预测的不确定性。在多种不同设置下的实验一致证明了 CDC 的有效性。

[NLP-197] Predicting the Geolocation of Tweets Using transformer models on Customized Data

【速读】：该论文旨在解决推文/用户地理位置预测问题，并提供一种灵活的方法来对文本大数据进行地理标记。解决方案的关键在于利用神经网络进行自然语言处理（NLP），通过预训练的BERT模型提取推文内容和元数据的特征，并结合二维高斯混合模型（GMMs）来估计地理位置的坐标对（经度，纬度）。该方法在Twitter数据集上进行了微调，并在全球和美国级别的数据集上分别实现了中位误差小于30公里和小于15公里的预测精度。

链接: https://arxiv.org/abs/2303.07865
作者: Kateryna Lutsai,Christoph H. Lampert
关键词-EN: user geolocation prediction, geolocation prediction task, textual big data, Gaussian Mixture Models, two-dimensional Gaussian Mixture
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 31 pages, 5 tables, 9 figures

点击查看摘要

Abstract:This research is aimed to solve the tweet/user geolocation prediction task and provide a flexible methodology for the geotagging of textual big data. The suggested approach implements neural networks for natural language processing (NLP) to estimate the location as coordinate pairs (longitude, latitude) and two-dimensional Gaussian Mixture Models (GMMs). The scope of proposed models has been finetuned on a Twitter dataset using pretrained Bidirectional Encoder Representations from Transformers (BERT) as base models. Performance metrics show a median error of fewer than 30 km on a worldwide-level, and fewer than 15 km on the US-level datasets for the models trained and evaluated on text features of tweets’ content and metadata context. Our source code and data are available at this https URL
摘要：本研究旨在解决推文/用户地理位置预测任务，并提供一种灵活的方法论用于文本大数据的地理标记。所提出的方法利用神经网络进行自然语言处理 (NLP) 来估计位置为坐标对 (经度, 纬度) 和二维高斯混合模型 (GMMs)。所提出的模型在 Twitter 数据集上进行了微调，使用预训练的 Transformer 双向编码器表示 (BERT) 作为基础模型。性能指标显示，在全球范围内的中位误差小于 30 公里，在美国范围内的数据集上训练和评估的模型中位误差小于 15 公里，这些模型基于推文内容和元数据上下文的文本特征。我们的源代码和数据可通过此 https URL 获取。

[NLP-198] On the Use of Audio to Improve Dialogue Policies

【速读】：该论文试图解决对话系统中对话策略模块仅依赖文本转录而忽略用户语音中嵌入的重要非语言信息的问题。解决方案的关键在于提出了一种新的架构，通过结合语音和文本嵌入，利用双多头注意力机制（Double Multi-Head Attention）来整合音频信息。实验结果表明，这种音频嵌入感知的对话策略在噪声转录场景下显著优于仅基于文本的策略，且如何有效结合文本和音频嵌入对提升系统性能至关重要。在DSTC2数据集上，该方法相较于仅基于文本的对话系统，用户请求得分相对提高了9.8%。

链接: https://arxiv.org/abs/2410.13385
作者: Daniel Roncel,Federico Costa,Javier Hernando
关键词-EN: spoken goal-oriented dialogue, spoken goal-oriented, increasingly popular, significant progress, goal-oriented dialogue systems
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: IberSpeech 2024

点击查看摘要

Abstract:With the significant progress of speech technologies, spoken goal-oriented dialogue systems are becoming increasingly popular. One of the main modules of a dialogue system is typically the dialogue policy, which is responsible for determining system actions. This component usually relies only on audio transcriptions, being strongly dependent on their quality and ignoring very important extralinguistic information embedded in the user’s speech. In this paper, we propose new architectures to add audio information by combining speech and text embeddings using a Double Multi-Head Attention component. Our experiments show that audio embedding-aware dialogue policies outperform text-based ones, particularly in noisy transcription scenarios, and that how text and audio embeddings are combined is crucial to improve performance. We obtained a 9.8% relative improvement in the User Request Score compared to an only-text-based dialogue system on the DSTC2 dataset.
摘要：随着语音技术的显著进步，面向目标的口语对话系统正变得越来越流行。对话系统的主要模块之一通常是对话策略，负责确定系统动作。这一组件通常仅依赖于音频转录，强烈依赖于其质量，并忽略了用户语音中嵌入的非常重要的非语言信息。在本文中，我们提出了新的架构，通过结合语音和文本嵌入使用双多头注意力组件来添加音频信息。我们的实验表明，具有音频嵌入意识的对话策略优于基于文本的策略，特别是在嘈杂的转录场景中，文本和音频嵌入的结合方式对提升性能至关重要。与仅基于文本的对话系统相比，我们在DSTC2数据集上的用户请求评分中获得了9.8%的相对改进。

[NLP-199] Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

【速读】：该论文试图解决生成错误纠正（GEC）模型在自动语音识别（ASR）系统中泛化能力不足的问题，特别是在处理新领域（OOD）错误和命名实体（NEs）时。解决方案的关键在于提出了一种名为DARAG（数据和检索增强生成错误纠正）的新方法，通过使用大型语言模型（LLMs）和文本到语音模型生成合成数据来增强训练数据集，模拟更多错误类型以提高模型的学习能力。此外，对于OOD场景，采用无监督方式模拟新领域的测试时错误。为了更好地处理命名实体，引入了检索增强纠正，通过从数据库中检索相关实体来增强输入信息。该方法简单、可扩展，并且对领域和语言不敏感，实验结果显示DARAG在ID和OOD设置中均显著优于基线模型，实现了相对WER的显著提升。

链接: https://arxiv.org/abs/2410.13198
作者: Sreyan Ghosh,Mohammad Sadegh Rasooli,Michael Levit,Peidong Wang,Jian Xue,Dinesh Manocha,Jinyu Li
关键词-EN: Automatic Speech Recognition, Speech Recognition, Automatic Speech, powerful post-processing method, performance of Automatic
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Preprint. Under Review

点击查看摘要

Abstract:Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC training dataset with synthetic data generated by prompting LLMs and text-to-speech models, thereby simulating additional errors from which the model can learn. For OOD scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle named entities, we introduce retrieval-augmented correction by augmenting the input with entities retrieved from a database. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8% – 30% relative WER improvements in ID and 10% – 33% improvements in OOD settings.
摘要：生成式错误纠正 (Generative Error Correction, GEC) 已成为提升自动语音识别 (Automatic Speech Recognition, ASR) 系统性能的一种强大后处理方法。然而，我们发现 GEC 模型在训练过程中遇到的特定类型错误之外，难以泛化到新的、未见过的错误，尤其是在域外 (Out-of-Domain, OOD) 场景下，这一问题尤为突出。这种现象在命名实体 (Named Entities, NEs) 中尤为明显，除了缺乏足够的上下文信息或对 NEs 的知识外，新的 NEs 不断涌现。为解决这些问题，我们提出了 DARAG (Data- and Retrieval-Augmented Generative Error Correction)，这是一种旨在提升 ASR 在域内 (In-Domain, ID) 和 OOD 场景下 GEC 性能的新方法。我们通过提示大语言模型 (Large Language Model, LLM) 和文本到语音模型生成合成数据，从而扩充 GEC 训练数据集，模拟模型可以学习的额外错误。对于 OOD 场景，我们同样以无监督的方式模拟来自新领域的测试时错误。此外，为了更好地处理命名实体，我们引入了通过从数据库中检索实体来增强输入的检索增强纠正方法。我们的方法简单、可扩展，并且与领域和语言无关。我们在多个数据集和设置上进行了实验，结果显示 DARAG 优于所有基线方法，在 ID 设置下实现了 8% – 30% 的相对词错误率 (Word Error Rate, WER) 改进，在 OOD 设置下实现了 10% – 33% 的改进。

[NLP-200] Exploiting Longitudinal Speech Sessions via Voice Assistant Systems for Early Detection of Cognitive Decline ALT

【速读】：该论文试图解决轻度认知障碍（MCI）的早期检测问题，特别是通过语音数据来识别认知变化的时间动态。解决方案的关键在于利用语音助手系统（VAS）进行长期纵向研究，通过收集多次语音数据（每三个月一次，持续18个月），并提出两种方法来改进MCI检测和认知变化的预测。第一种方法结合历史数据，第二种方法预测两个时间点的认知变化。实验结果表明，结合历史数据显著提高了MCI检测的准确性（F1-score分别从58.6%提升至71.2%和从62.1%提升至75.1%），并且预测认知变化的F1-score达到73.7%，证实了基于VAS的语音数据在早期认知衰退检测中的潜力。

链接: https://arxiv.org/abs/2410.12885
作者: Kristin Qi,Jiatong Shi,Caroline Summerour,John A. Batsis,Xiaohui Liang
关键词-EN: Mild Cognitive Impairment, Alzheimer disease, stage of Alzheimer, Mild Cognitive, Cognitive Impairment
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: IEEE International Conference on E-health Networking, Application Services

点击查看摘要

Abstract:Mild Cognitive Impairment (MCI) is an early stage of Alzheimer’s disease (AD), a form of neurodegenerative disorder. Early identification of MCI is crucial for delaying its progression through timely interventions. Existing research has demonstrated the feasibility of detecting MCI using speech collected from clinical interviews or digital devices. However, these approaches typically analyze data collected at limited time points, limiting their ability to identify cognitive changes over time. This paper presents a longitudinal study using voice assistant systems (VAS) to remotely collect seven-session speech data at three-month intervals across 18 months. We propose two methods to improve MCI detection and the prediction of cognitive changes. The first method incorporates historical data, while the second predicts cognitive changes at two time points. Our results indicate improvements when incorporating historical data: the average F1-score for MCI detection improves from 58.6% to 71.2% (by 12.6%) in the case of acoustic features and from 62.1% to 75.1% (by 13.0%) in the case of linguistic features. Additionally, the prediction of cognitive changes achieves an F1-score of 73.7% in the case of acoustic features. These results confirm the potential of VAS-based speech sessions for early detection of cognitive decline.
摘要：轻度认知障碍 (Mild Cognitive Impairment, MCI) 是阿尔茨海默病 (Alzheimer’s disease, AD) 的早期阶段，属于一种神经退行性疾病。早期识别 MCI 对于通过及时干预延缓其进展至关重要。现有研究表明，通过临床访谈或数字设备收集的语音数据可以用于检测 MCI。然而，这些方法通常仅分析有限时间点收集的数据，限制了其识别认知随时间变化的能力。本文提出了一项纵向研究，利用语音助手系统 (Voice Assistant Systems, VAS) 在 18 个月内每隔三个月远程收集七次语音数据。我们提出了两种方法来改进 MCI 检测和认知变化的预测。第一种方法结合了历史数据，第二种方法则预测两个时间点的认知变化。我们的结果表明，结合历史数据可以显著提升效果：在声学特征的情况下，MCI 检测的平均 F1-score 从 58.6% 提高到 71.2% (提升了 12.6%)；在语言特征的情况下，F1-score 从 62.1% 提高到 75.1% (提升了 13.0%)。此外，认知变化的预测在声学特征的情况下达到了 73.7% 的 F1-score。这些结果证实了基于 VAS 的语音会话在早期检测认知衰退方面的潜力。

人工智能

[AI-0] How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs

点击查看摘要

[AI-1] Can MLLMs Understand the Deep Implication Behind Chinese Images?

链接: https://arxiv.org/abs/2410.13854
作者: Chenhao Zhang,Xi Feng,Yuelin Bai,Xinrun Du,Jinchang Hou,Kaixin Deng,Guangzeng Han,Qinrui Li,Bingli Wang,Jiaheng Liu,Xingwei Qu,Yifei Zhang,Qixuan Zhao,Yiming Liang,Ziqiang Liu,Feiteng Fang,Min Yang,Wenhao Huang,Chenghua Lin,Ge Zhang,Shiwen Ni
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Chinese traditional culture
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注: 32 pages,18 figures. Project Page: this https URL Code: this https URL Dataset: this https URL

点击查看摘要

[AI-2] Retrospective Learning from Interactions

点击查看摘要

[AI-3] Influence Functions for Scalable Data Attribution in Diffusion Models

链接: https://arxiv.org/abs/2410.13850
作者: Bruno Mlodozeniec,Runa Eschenhagen,Juhan Bae,Alexander Immer,David Krueger,Richard Turner
关键词-EN: generative modelling, Diffusion models, led to significant, significant advancements, advancements in generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by developing an \textitinfluence functions framework. Influence function-based data attribution methods approximate how a model’s output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we systematically develop K-FAC approximations based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We recast previously proposed methods as specific design choices in our framework and show that our recommended method outperforms previous data attribution approaches on common evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.

[AI-4] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

点击查看摘要

[AI-5] SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction

点击查看摘要

[AI-6] Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding ICASSP2025

链接: https://arxiv.org/abs/2410.13839
作者: Tan Dat Nguyen,Ji-Hoon Kim,Jeongsoo Choi,Shukjae Choi,Jinseok Park,Younglo Lee,Joon Son Chung
关键词-EN: accelerate codec-based speech, accelerate codec-based, systems with minimum, minimum sacrifice, codec-based speech synthesis
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to IEEE ICASSP 2025

点击查看摘要

Abstract:The goal of this paper is to accelerate codec-based speech synthesis systems with minimum sacrifice to speech quality. We propose an enhanced inference method that allows for flexible trade-offs between speed and quality during inference without requiring additional training. Our core idea is to predict multiple tokens per inference step of the AR module using multiple prediction heads, resulting in a linear reduction in synthesis time as the number of heads increases. Furthermore, we introduce a novel speculative decoding technique that utilises a Viterbi-based algorithm to select the optimal sequence of generated tokens at each decoding step. In our experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models, with minimal quality trade-off or even improvement in terms of speech intelligibility. Audio samples are available at: this http URL.

[AI-7] ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

链接: https://arxiv.org/abs/2410.13837
作者: Chen Bo Calvin Zhang,Zhang-Wei Hong,Aldo Pacchiano,Pulkit Agrawal
关键词-EN: reinforcement learning, hinder learning, shaping reward, critical component, component in reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: preprint, 35 pages, 23 figures

点击查看摘要

Abstract:Reward shaping is a critical component in reinforcement learning (RL), particularly for complex tasks where sparse rewards can hinder learning. While shaping rewards have been introduced to provide additional guidance, selecting effective shaping functions remains challenging and computationally expensive. This paper introduces Online Reward Selection and Policy Optimization (ORSO), a novel approach that frames shaping reward selection as an online model selection problem. ORSO employs principled exploration strategies to automatically identify promising shaping reward functions without human intervention, balancing exploration and exploitation with provable regret guarantees. We demonstrate ORSO’s effectiveness across various continuous control tasks using the Isaac Gym simulator. Compared to traditional methods that fully evaluate each shaping reward function, ORSO significantly improves sample efficiency, reduces computational time, and consistently identifies high-quality reward functions that produce policies comparable to those generated by domain experts through hand-engineered rewards.

[AI-8] he Disparate Benefits of Deep Ensembles

链接: https://arxiv.org/abs/2410.13831
作者: Kajetan Schweighofer,Adrian Arnaiz-Rodriguez,Sepp Hochreiter,Nuria Oliver
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, Deep Ensembles, Deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ensembles of Deep Neural Networks, Deep Ensembles, are widely used as a simple way to boost predictive performance. However, their impact on algorithmic fairness is not well understood yet. Algorithmic fairness investigates how a model’s performance varies across different groups, typically defined by protected attributes such as age, gender, or race. In this work, we investigate the interplay between the performance gains from Deep Ensembles and fairness. Our analysis reveals that they unevenly favor different groups in what we refer to as a disparate benefits effect. We empirically investigate this effect with Deep Ensembles applied to popular facial analysis and medical imaging datasets, where protected group attributes are given and find that it occurs for multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify the per-group difference in predictive diversity of ensemble members as the potential cause of the disparate benefits effect. Finally, we evaluate different approaches to reduce unfairness due to the disparate benefits effect. Our findings show that post-processing is an effective method to mitigate this unfairness while preserving the improved performance of Deep Ensembles.

[AI-9] A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

点击查看摘要

[AI-10] Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models MICRO

链接: https://arxiv.org/abs/2410.13826
作者: Mazda Moayeri,Vidhisha Balachandran,Varun Chandrasekaran,Safoora Yousefi,Thomas Fel,Soheil Feizi,Besmira Nushi,Neel Joshi,Vibhav Vineet
关键词-EN: grown more complex, testing multiple skills, skills, testing multiple, multiple skills
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code at: this http URL

点击查看摘要

Abstract:With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for 46 k instances over 12 benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is 18% more accurate in “computing molar mass”, but 19% less accurate in “applying constitutional law”, despite the overall accuracies of the three models differing by a mere 0.4% . Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a 3% accuracy improvement over our 12 dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

[AI-11] Agent Occam: A Simple Yet Strong Baseline for LLM-Based Web Agents

点击查看摘要

[AI-12] Multi-style conversion for semantic segmentation of lesions in fundus images by adversarial attacks

链接: https://arxiv.org/abs/2410.13822
作者: Clément Playout,Renaud Duval,Marie Carole Boucher,Farida Cheriet
关键词-EN: global classification approach, fundus images, faces challenges, relies on fundus, challenges in achieving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:The diagnosis of diabetic retinopathy, which relies on fundus images, faces challenges in achieving transparency and interpretability when using a global classification approach. However, segmentation-based databases are significantly more expensive to acquire and combining them is often problematic. This paper introduces a novel method, termed adversarial style conversion, to address the lack of standardization in annotation styles across diverse databases. By training a single architecture on combined databases, the model spontaneously modifies its segmentation style depending on the input, demonstrating the ability to convert among different labeling styles. The proposed methodology adds a linear probe to detect dataset origin based on encoder features and employs adversarial attacks to condition the model’s segmentation style. Results indicate significant qualitative and quantitative through dataset combination, offering avenues for improved model generalization, uncertainty estimation and continuous interpolation between annotation styles. Our approach enables training a segmentation model with diverse databases while controlling and leveraging annotation styles for improved retinopathy diagnosis.

[AI-13] Artificial Kuramoto Oscillatory Neurons

链接: https://arxiv.org/abs/2410.13821
作者: Takeru Miyato,Sindy Löwe,Andreas Geiger,Max Welling
关键词-EN: abstract concepts, Artificial Kuramoto Oscillatory, form of competitive, competitive learning, compressed in order
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Code: this https URL

点击查看摘要

Abstract:It has long been known in both neuroscience and AI that ``binding’’ between neurons leads to a form of competitive learning where representations are compressed in order to represent more abstract concepts in deeper layers of the network. More recently, it was also hypothesized that dynamic (spatiotemporal) representations play an important role in both neuroscience and AI. Building on these ideas, we introduce Artificial Kuramoto Oscillatory Neurons (AKOrN) as a dynamical alternative to threshold units, which can be combined with arbitrary connectivity designs such as fully connected, convolutional, or attentive mechanisms. Our generalized Kuramoto updates bind neurons together through their synchronization dynamics. We show that this idea provides performance improvements across a wide spectrum of tasks such as unsupervised object discovery, adversarial robustness, calibrated uncertainty quantification, and reasoning. We believe that these empirical results show the importance of rethinking our assumptions at the most basic neuronal level of neural representation, and in particular show the importance of dynamical representations.

[AI-14] Guided Reinforcement Learning for Robust Multi-Contact Loco-Manipulation

链接: https://arxiv.org/abs/2410.13817
作者: Jean-Pierre Sleiman,Mayank Mittal,Marco Hutter
关键词-EN: Markov Decision Process, meticulous Markov Decision, Decision Process, Markov Decision, Reinforcement learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: J. P. Sleiman and M. Mittal contributed equally. Accepted for CoRL 2024 (Oral). Project website: this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) often necessitates a meticulous Markov Decision Process (MDP) design tailored to each task. This work aims to address this challenge by proposing a systematic approach to behavior synthesis and control for multi-contact loco-manipulation tasks, such as navigating spring-loaded doors and manipulating heavy dishwashers. We define a task-independent MDP to train RL policies using only a single demonstration per task generated from a model-based trajectory optimizer. Our approach incorporates an adaptive phase dynamics formulation to robustly track the demonstrations while accommodating dynamic uncertainties and external disturbances. We compare our method against prior motion imitation RL works and show that the learned policies achieve higher success rates across all considered tasks. These policies learn recovery maneuvers that are not present in the demonstration, such as re-grasping objects during execution or dealing with slippages. Finally, we successfully transfer the policies to a real robot, demonstrating the practical viability of our approach.

[AI-15] A Pattern to Align Them All: Integrating Different Modalities to Define Multi-Modal Entities

链接: https://arxiv.org/abs/2410.13803
作者: Gianluca Apriceno,Valentina Tamma,Tania Bailoni,Jacopo de Berardinis,Mauro Dragoni
关键词-EN: Multi-Modal Knowledge Graphs, Knowledge Graphs, foundation underpinning human, underpinning human intelligence, Knowledge Graphs extend
类目: Artificial Intelligence (cs.AI)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:The ability to reason with and integrate different sensory inputs is the foundation underpinning human intelligence and it is the reason for the growing interest in modelling multi-modal information within Knowledge Graphs. Multi-Modal Knowledge Graphs extend traditional Knowledge Graphs by associating an entity with its possible modal representations, including text, images, audio, and videos, all of which are used to convey the semantics of the entity. Despite the increasing attention that Multi-Modal Knowledge Graphs have received, there is a lack of consensus about the definitions and modelling of modalities, whose definition is often determined by application domains. In this paper, we propose a novel ontology design pattern that captures the separation of concerns between an entity (and the information it conveys), whose semantics can have different manifestations across different media, and its realisation in terms of a physical information entity. By introducing this abstract model, we aim to facilitate the harmonisation and integration of different existing multi-modal ontologies which is crucial for many intelligent applications across different domains spanning from medicine to digital humanities.

[AI-16] Learning Graph Quantized Tokenizers for Transformers

链接: https://arxiv.org/abs/2410.13798
作者: Limei Wang,Kaveh Hassani,Si Zhang,Dongqi Fu,Baichuan Yuan,Weilin Cong,Zhigang Hua,Hao Wu,Ning Yao,Bo Long
关键词-EN: architectures of Foundational, Foundational Models, Graph Neural Networks, Neural Networks, backbone architectures
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers serve as the backbone architectures of Foundational Models, where a domain-specific tokenizer helps them adapt to various domains. Graph Transformers (GTs) have recently emerged as a leading model in geometric deep learning, outperforming Graph Neural Networks (GNNs) in various graph learning tasks. However, the development of tokenizers for graphs has lagged behind other modalities, with existing approaches relying on heuristics or GNNs co-trained with Transformers. To address this, we introduce GQT (\textbfGraph \textbfQuantized \textbfTokenizer), which decouples tokenizer training from Transformer training by leveraging multi-task graph self-supervised learning, yielding robust and generalizable graph tokens. Furthermore, the GQT utilizes Residual Vector Quantization (RVQ) to learn hierarchical discrete tokens, resulting in significantly reduced memory requirements and improved generalization capabilities. By combining the GQT with token modulation, a Transformer encoder achieves state-of-the-art performance on 16 out of 18 benchmarks, including large-scale homophilic and heterophilic datasets. The code is available at: this https URL

[AI-17] Looking Inward: Language Models Can Learn About Themselves by Introspection

点击查看摘要

[AI-18] PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

点击查看摘要

[AI-19] Optimal Quantization for Matrix Multiplication

点击查看摘要

[AI-20] Aggregation Artifacts in Subjective Tasks Collapse Large Language Models Posteriors

点击查看摘要

[AI-21] ransformer Guided Coevolution: Improved Team Formation in Multiagent Adversarial Games

链接: https://arxiv.org/abs/2410.13769
作者: Pranav Rajbhandari,Prithviraj Dasgupta,Donald Sofge
关键词-EN: Masked Language Model, Language Model training, Masked Language, Language Model, adversarial game Marine
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We consider the problem of team formation within multiagent adversarial games. We propose BERTeam, a novel algorithm that uses a transformer-based deep neural network with Masked Language Model training to select the best team of players from a trained population. We integrate this with coevolutionary deep reinforcement learning, which trains a diverse set of individual players to choose teams from. We test our algorithm in the multiagent adversarial game Marine Capture-The-Flag, and we find that BERTeam learns non-trivial team compositions that perform well against unseen opponents. For this game, we find that BERTeam outperforms MCAA, an algorithm that similarly optimizes team formation.

[AI-22] Virtual Sensing for Real-Time Degradation Monitoring of Nuclear Systems: Leveraging DeepONet for Enhanced Sensing Coverage for Digital Twin-Enabling Technology

链接: https://arxiv.org/abs/2410.13762
作者: Raisa Bentay Hossain,Farid Ahmed,Kazuma Kobayashi,Seid Koric,Diab Abueidda,Syed Bahauddin Alam
关键词-EN: Effective real-time monitoring, Effective real-time, technique is crucial, crucial for detecting, maintaining the structural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective real-time monitoring technique is crucial for detecting material degradation and maintaining the structural integrity of nuclear systems to ensure both safety and operational efficiency. Traditional physical sensor systems face limitations such as installation challenges, high costs, and difficulties in measuring critical parameters in hard-to-reach or harsh environments, often resulting in incomplete data coverage. Machine learning-driven virtual sensors offer a promising solution by enhancing physical sensor capabilities to monitor critical degradation indicators like pressure, velocity, and turbulence. However, conventional machine learning models struggle with real-time monitoring due to the high-dimensional nature of reactor data and the need for frequent retraining. This paper explores the use of Deep Operator Networks (DeepONet) within a digital twin (DT) framework to predict key thermal-hydraulic parameters in the hot leg of an AP-1000 Pressurized Water Reactor (PWR). In this study, DeepONet is trained with different operational conditions, which relaxes the requirement of continuous retraining, making it suitable for online and real-time prediction components for DT. Our results show that DeepONet achieves accurate predictions with low mean squared error and relative L2 error and can make predictions on unknown data 160,000 times faster than traditional finite element (FE) simulations. This speed and accuracy make DeepONet a powerful tool for tracking conditions that contribute to material degradation in real-time, enhancing reactor safety and longevity.

[AI-23] MobA: A Two-Level Agent System for Efficient Mobile Task Automation

链接: https://arxiv.org/abs/2410.13757
作者: Zichen Zhu,Hao Tang,Yansi Li,Kunyao Lan,Yixuan Jiang,Hao Zhou,Yixiao Wang,Situo Zhang,Liangtai Sun,Lu Chen,Kai Yu
关键词-EN: diverse interfaces due, Current mobile assistants, Current mobile, decision-making abilities, limited by dependence
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: 27 pages, 6 figures, and 5 tables. We will release our source code in a few days

点击查看摘要

[AI-24] CLIMB: Language-Guided Continual Learning for Task Planning with Iterative Model Building

链接: https://arxiv.org/abs/2410.13756
作者: Walker Byrnes,Miroslav Bogdanovic,Avi Balakirsky,Stephen Balakirsky,Animesh Garg
关键词-EN: Intelligent and reliable, descriptive domain representation, reliable task planning, generalized robotics, requiring a descriptive
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:Intelligent and reliable task planning is a core capability for generalized robotics, requiring a descriptive domain representation that sufficiently models all object and state information for the scene. We present CLIMB, a continual learning framework for robot task planning that leverages foundation models and execution feedback to guide domain model construction. CLIMB can build a model from a natural language description, learn non-obvious predicates while solving tasks, and store that information for future problems. We demonstrate the ability of CLIMB to improve performance in common planning environments compared to baseline methods. We also develop the BlocksWorld++ domain, a simulated environment with an easily usable real counterpart, together with a curriculum of tasks with progressing difficulty for evaluating continual learning. Additional details and demonstrations for this system can be found at this https URL .

[AI-25] MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

链接: https://arxiv.org/abs/2410.13754
作者: Jinjie Ni,Yifan Song,Deepanway Ghosal,Bo Li,David Junhao Zhang,Xiang Yue,Fuzhao Xue,Zian Zheng,Kaichen Zhang,Mahir Shah,Kabir Jain,Yang You,Michael Shieh
关键词-EN: generating diverse modalities, necessitating reliable evaluations, Perceiving and generating, necessitating reliable, generating diverse
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any real-world benchmark designed to optimize and standardize evaluations across input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions and the model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98). We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

[AI-26] Privacy-Preserving Decentralized AI with Confidential Computing

链接: https://arxiv.org/abs/2410.13752
作者: Dayeol Lee,Jorge Antonio,Hisham Khan
关键词-EN: decentralized Artificial Intelligence, Artificial Intelligence, Atoma Network, paper addresses privacy, Confidential Computing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses privacy protection in decentralized Artificial Intelligence (AI) using Confidential Computing (CC) within the Atoma Network, a decentralized AI platform designed for the Web3 domain. Decentralized AI distributes AI services among multiple entities without centralized oversight, fostering transparency and robustness. However, this structure introduces significant privacy challenges, as sensitive assets such as proprietary models and personal data may be exposed to untrusted participants. Cryptography-based privacy protection techniques such as zero-knowledge machine learning (zkML) suffers prohibitive computational overhead. To address the limitation, we propose leveraging Confidential Computing (CC). Confidential Computing leverages hardware-based Trusted Execution Environments (TEEs) to provide isolation for processing sensitive data, ensuring that both model parameters and user data remain secure, even in decentralized, potentially untrusted environments. While TEEs face a few limitations, we believe they can bridge the privacy gap in decentralized AI. We explore how we can integrate TEEs into Atoma’s decentralized framework.

[AI-27] LLM-Human Pipeline for Cultural Context Grounding of Conversations

点击查看摘要

[AI-28] DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

链接: https://arxiv.org/abs/2410.13726
作者: Hanbo Cheng,Limin Lin,Chenyu Liu,Pengcheng Xia,Pengfei Hu,Jiefeng Ma,Jun Du,Jia Pan
关键词-EN: speech audio clip, realistic talking head, Talking head, Talking head generation, audio clip
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly at this https URL.

[AI-29] Persistent Pre-Training Poisoning of LLMs

链接: https://arxiv.org/abs/2410.13722
作者: Yiming Zhang,Javier Rando,Ivan Evtimov,Jianfeng Chi,Eric Michael Smith,Nicholas Carlini,Florian Tramèr,Daphne Ippolito
关键词-EN: Large language models, uncurated text datasets, text datasets consisting, Large language, language models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model’s pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.

[AI-30] Movie Gen: A Cast of Media Foundation Models

链接: https://arxiv.org/abs/2410.13720
作者: Adam Polyak,Amit Zohar,Andrew Brown,Andros Tjandra,Animesh Sinha,Ann Lee,Apoorv Vyas,Bowen Shi,Chih-Yao Ma,Ching-Yao Chuang,David Yan,Dhruv Choudhary,Dingkang Wang,Geet Sethi,Guan Pang,Haoyu Ma,Ishan Misra,Ji Hou,Jialiang Wang,Kiran Jagadeesh,Kunpeng Li,Luxin Zhang,Mannat Singh,Mary Williamson,Matt Le,Matthew Yu,Mitesh Kumar Singh,Peizhao Zhang,Peter Vajda,Quentin Duval,Rohit Girdhar,Roshan Sumbaly,Sai Saketh Rambhatla,Sam Tsai,Samaneh Azadi,Samyak Datta,Sanyuan Chen,Sean Bell,Sharadh Ramaswamy,Shelly Sheynin,Siddharth Bhattacharya,Simran Motwani,Tao Xu,Tianhe Li,Tingbo Hou,Wei-Ning Hsu,Xi Yin,Xiaoliang Dai,Yaniv Taigman,Yaqiao Luo,Yen-Cheng Liu,Yi-Chiao Wu,Yue Zhao,Yuval Kirstain,Zecheng He,Zijian He,Albert Pumarola,Ali Thabet,Artsiom Sanakoyeu,Arun Mallya,Baishan Guo,Boris Araya,Breena Kerr,Carleigh Wood,Ce Liu,Cen Peng,Dimitry Vengertsev,Edgar Schonfeld,Elliot Blanchard,Felix Juefei-Xu,Fraylie Nord,Jeff Liang,John Hoffman,Jonas Kohler,Kaolin Fire,Karthik Sivakumar,Lawrence Chen,Licheng Yu,Luya Gao,Markos Georgopoulos,Rashel Moritz,Sara K. Sampson,Shikai Li,Simone Parmeggiani,Steve Fine,Tara Fowler,Vladan Petrovic,Yuming Du
关键词-EN: present Movie Gen, Movie Gen, present Movie, generates high-quality, synchronized audio
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at this https URL.

[AI-31] MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

点击查看摘要

[AI-32] On the Role of Attention Heads in Large Language Model Safety

点击查看摘要

[AI-33] Disjointness Violations in Wikidata

链接: https://arxiv.org/abs/2410.13707
作者: Ege Atacan Doğan,Peter F. Patel-Schneider
关键词-EN: important constraint checks, correct incorrect statements, constraint checks, community-managed knowledge base, knowledge base
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Sixth International Knowledge Graph and Semantic Web Conference

点击查看摘要

Abstract:Disjointness checks are among the most important constraint checks in a knowledge base and can be used to help detect and correct incorrect statements and internal contradictions. Wikidata is a very large, community-managed knowledge base. Because of both its size and construction, Wikidata contains many incorrect statements and internal contradictions. We analyze the current modeling of disjointness on Wikidata, identify patterns that cause these disjointness violations and categorize them. We use SPARQL queries to identify each ``culprit’’ causing a disjointness violation and lay out formulas to identify and fix conflicting information. We finally discuss how disjointness information could be better modeled and expanded in Wikidata in the future.

[AI-34] Jailbreaking LLM-Controlled Robots

链接: https://arxiv.org/abs/2410.13691
作者: Alexander Robey,Zachary Ravichandran,Vijay Kumar,Hamed Hassani,George J. Pappas
关键词-EN: large language models, enabling contextual reasoning, intuitive human-robot interaction, language models, varied as manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a stand-alone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit harmful text by bypassing LLM safety guardrails. To assess the risks of deploying LLMs in robotics, in this paper, we introduce RoboPAIR, the first algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual attacks on LLM chatbots, RoboPAIR elicits harmful physical actions from LLM-controlled robots, a phenomenon we experimentally demonstrate in three scenarios: (i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. In each scenario and across three new datasets of harmful robotic actions, we demonstrate that RoboPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates. Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world. Indeed, our results on the Unitree Go2 represent the first successful jailbreak of a deployed commercial robotic system. Addressing this emerging vulnerability is critical for ensuring the safe deployment of LLMs in robotics. Additional media is available at: this https URL

[AI-35] Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

链接: https://arxiv.org/abs/2410.13674
作者: Yijun Liang,Shweta Bhardwaj,Tianyi Zhou
关键词-EN: posed significant challenges, deep neural networks, training deep neural, data, networks in practice
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low-quality or scarce data has posed significant challenges for training deep neural networks in practice. While classical data augmentation cannot contribute very different new data, diffusion models opens up a new door to build self-evolving AI by generating high-quality and diverse synthetic data through text-guided prompts. However, text-only guidance cannot control synthetic images’ proximity to the original images, resulting in out-of-distribution data detrimental to the model performance. To overcome the limitation, we study image guidance to achieve a spectrum of interpolations between synthetic and real images. With stronger image guidance, the generated images are similar to the training data but hard to learn. While with weaker image guidance, the synthetic images will be easier for model but contribute to a larger distribution gap with the original data. The generated full spectrum of data enables us to build a novel “Diffusion Curriculum (DisCL)”. DisCL adjusts the image guidance level of image synthesis for each training stage: It identifies and focuses on hard samples for the model and assesses the most effective guidance level of synthetic images to improve hard data learning. We apply DisCL to two challenging tasks: long-tail (LT) classification and learning from low-quality data. It focuses on lower-guidance images of high-quality to learn prototypical features as a warm-up of learning higher-guidance images that might be weak on diversity or quality. Extensive experiments showcase a gain of 2.7% and 2.1% in OOD and ID macro-accuracy when applying DisCL to iWildCam dataset. On ImageNet-LT, DisCL improves the base model’s tail-class accuracy from 4.4% to 23.64% and leads to a 4.02% improvement in all-class accuracy.

[AI-36] A new approach for fine-tuning sentence transformers for intent classification and out-of-scope detection tasks

链接: https://arxiv.org/abs/2410.13649
作者: Tianyi Zhang,Atta Norouzian,Aanchan Mohan,Frederick Ducatelle
关键词-EN: redirect user queries, virtual assistant, important to reject, reject or redirect, redirect user
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Appearing at Empirical Methods in Natural Language Processing 2025 - Industry Track

点击查看摘要

[AI-37] SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

点击查看摘要

[AI-38] Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design

链接: https://arxiv.org/abs/2410.13643
作者: Chenyu Wang,Masatoshi Uehara,Yichun He,Amy Wang,Tommaso Biancalani,Avantika Lal,Tommi Jaakkola,Sergey Levine,Hanchen Wang,Aviv Regev
关键词-EN: strong empirical performance, Recent studies, diffusion models, biological sequence generation, models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent studies have demonstrated the strong empirical performance of diffusion models on discrete sequences across domains from natural language to biological sequence generation. For example, in the protein inverse folding task, conditional diffusion models have achieved impressive results in generating natural-like sequences that fold back into the original structure. However, practical design tasks often require not only modeling a conditional distribution but also optimizing specific task objectives. For instance, we may prefer protein sequences with high stability. To address this, we consider the scenario where we have pre-trained discrete diffusion models that can generate natural-like sequences, as well as reward models that map sequences to task objectives. We then formulate the reward maximization problem within discrete diffusion models, analogous to reinforcement learning (RL), while minimizing the KL divergence against pretrained diffusion models to preserve naturalness. To solve this RL problem, we propose a novel algorithm, DRAKES, that enables direct backpropagation of rewards through entire trajectories generated by diffusion models, by making the originally non-differentiable trajectories differentiable using the Gumbel-Softmax trick. Our theoretical analysis indicates that our approach can generate sequences that are both natural-like and yield high rewards. While similar tasks have been recently explored in diffusion models for continuous domains, our work addresses unique algorithmic and theoretical challenges specific to discrete diffusion models, which arise from their foundation in continuous-time Markov chains rather than Brownian motion. Finally, we demonstrate the effectiveness of DRAKES in generating DNA and protein sequences that optimize enhancer activity and protein stability, respectively, important tasks for gene therapies and protein-based therapeutics.

[AI-39] Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

点击查看摘要

[AI-40] Scaling Wearable Foundation Models

链接: https://arxiv.org/abs/2410.13638
作者: Girish Narayanswamy,Xin Liu,Kumar Ayush,Yuzhe Yang,Xuhai Xu,Shun Liao,Jake Garrison,Shyam Tailor,Jake Sunshine,Yun Liu,Tim Althoff,Shrikanth Narayanan,Pushmeet Kohli,Jiening Zhan,Mark Malhotra,Shwetak Patel,Samy Abdel-Ghaffar,Daniel McDuff
关键词-EN: health tracking features, Wearable sensors, tracking features, variety of health, health tracking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Wearable sensors have become ubiquitous thanks to a variety of health tracking features. The resulting continuous and longitudinal measurements from everyday life generate large volumes of data; however, making sense of these observations for scientific and actionable insights is non-trivial. Inspired by the empirical success of generative modeling, where large neural networks learn powerful representations from vast amounts of text, image, video, or audio data, we investigate the scaling properties of sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, electrodermal activity, accelerometer, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM, a multimodal foundation model built on the largest wearable-signals dataset with the most extensive range of sensor modalities to date. Our results establish the scaling laws of LSM for tasks such as imputation, interpolation and extrapolation, both across time and sensor modalities. Moreover, we highlight how LSM enables sample-efficient downstream learning for tasks like exercise and activity recognition.

[AI-41] Normalizing self-supervised learning for provably reliable Change Point Detection

链接: https://arxiv.org/abs/2410.13637
作者: Alexandra Bazarova,Evgenia Romanenkova,Alexey Zaytsev
关键词-EN: Change point detection, identify abrupt shifts, Change point, input data streams, point detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Change point detection (CPD) methods aim to identify abrupt shifts in the distribution of input data streams. Accurate estimators for this task are crucial across various real-world scenarios. Yet, traditional unsupervised CPD techniques face significant limitations, often relying on strong assumptions or suffering from low expressive power due to inherent model simplicity. In contrast, representation learning methods overcome these drawbacks by offering flexibility and the ability to capture the full complexity of the data without imposing restrictive assumptions. However, these approaches are still emerging in the CPD field and lack robust theoretical foundations to ensure their reliability. Our work addresses this gap by integrating the expressive power of representation learning with the groundedness of traditional CPD techniques. We adopt spectral normalization (SN) for deep representation learning in CPD tasks and prove that the embeddings after SN are highly informative for CPD. Our method significantly outperforms current state-of-the-art methods during the comprehensive evaluation via three standard CPD datasets.

[AI-42] Spatiotemporal Object Detection for Improved Aerial Vehicle Detection in Traffic Monitoring

链接: https://arxiv.org/abs/2410.13616
作者: Kristina Telegraph,Christos Kyrkou
关键词-EN: work presents advancements, Vehicle Detection Dataset, multi-class vehicle detection, vehicle detection, Spatio-Temporal Vehicle Detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:This work presents advancements in multi-class vehicle detection using UAV cameras through the development of spatiotemporal object detection models. The study introduces a Spatio-Temporal Vehicle Detection Dataset (STVD) containing 6, 600 annotated sequential frame images captured by UAVs, enabling comprehensive training and evaluation of algorithms for holistic spatiotemporal perception. A YOLO-based object detection algorithm is enhanced to incorporate temporal dynamics, resulting in improved performance over single frame models. The integration of attention mechanisms into spatiotemporal models is shown to further enhance performance. Experimental validation demonstrates significant progress, with the best spatiotemporal model exhibiting a 16.22% improvement over single frame models, while it is demonstrated that attention mechanisms hold the potential for additional performance gains.

[AI-43] H2OVL-Mississippi Vision Language Models Technical Report

点击查看摘要

[AI-44] MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling

点击查看摘要

[AI-45] Large Language Models as Narrative-Driven Recommenders

点击查看摘要

[AI-46] xt-Guided Multi-Property Molecular Optimization with a Diffusion Language Model

链接: https://arxiv.org/abs/2410.13597
作者: Yida Xiong,Kun Li,Weiwei Liu,Jia Wu,Bo Du,Shirui Pan,Wenbin Hu
关键词-EN: meet practical industrial, task-oriented generated molecules, practical industrial requirements, crucial stage, stage in drug
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Molecular optimization (MO) is a crucial stage in drug discovery in which task-oriented generated molecules are optimized to meet practical industrial requirements. Existing mainstream MO approaches primarily utilize external property predictors to guide iterative property optimization. However, learning all molecular samples in the vast chemical space is unrealistic for predictors. As a result, errors and noise are inevitably introduced during property prediction due to the nature of approximation. This leads to discrepancy accumulation, generalization reduction and suboptimal molecular candidates. In this paper, we propose a text-guided multi-property molecular optimization method utilizing transformer-based diffusion language model (TransDLM). TransDLM leverages standardized chemical nomenclature as semantic representations of molecules and implicitly embeds property requirements into textual descriptions, thereby preventing error propagation during diffusion process. Guided by physically and chemically detailed textual descriptions, TransDLM samples and optimizes encoded source molecules, retaining core scaffolds of source molecules and ensuring structural similarities. Moreover, TransDLM enables simultaneous sampling of multiple molecules, making it ideal for scalable, efficient large-scale optimization through distributed computation on web platforms. Furthermore, our approach surpasses state-of-the-art methods in optimizing molecular structural similarity and enhancing chemical properties on the benchmark dataset. The code is available at: this https URL.

[AI-47] CCUP: A Controllable Synthetic Data Generation Pipeline for Pretraining Cloth-Changing Person Re-Identification Models

链接: https://arxiv.org/abs/2410.13567
作者: Yujian Zhao,Chengru Wu,Yinong Xu,Xuanzheng Du,Ruiyu Li,Guanglin Niu
关键词-EN: Long-Term Person Re-Identification, garnered significant attention, challenging research topic, recently garnered significant, Cloth-changing person re-identification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cloth-changing person re-identification (CC-ReID), also known as Long-Term Person Re-Identification (LT-ReID) is a critical and challenging research topic in computer vision that has recently garnered significant attention. However, due to the high cost of constructing CC-ReID data, the existing data-driven models are hard to train efficiently on limited data, causing overfitting issue. To address this challenge, we propose a low-cost and efficient pipeline for generating controllable and high-quality synthetic data simulating the surveillance of real scenarios specific to the CC-ReID task. Particularly, we construct a new self-annotated CC-ReID dataset named Cloth-Changing Unreal Person (CCUP), containing 6,000 IDs, 1,179,976 images, 100 cameras, and 26.5 outfits per individual. Based on this large-scale dataset, we introduce an effective and scalable pretrain-finetune framework for enhancing the generalization capabilities of the traditional CC-ReID models. The extensive experiments demonstrate that two typical models namely TransReID and FIRe^2, when integrated into our framework, outperform other state-of-the-art models after pretraining on CCUP and finetuning on the benchmarks such as PRCC, VC-Clothes and NKUP. The CCUP is available at: this https URL.

[AI-48] Integrating Temporal Representations for Dynamic Memory Retrieval and Management in Large Language Models

点击查看摘要

[AI-49] Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

链接: https://arxiv.org/abs/2410.13523
作者: Che Liu,Zhongwei Wan,Haozhe Wang,Yinda Chen,Talha Qaiser,Chen Jin,Fariba Yousefi,Nikolay Burlutskiy,Rossella Arcucci
关键词-EN: Medical Vision-Language Pre-training, Vision-Language Pre-training, medical image understanding, made significant progress, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Medical Vision-Language Pre-training (MedVLP) has made significant progress in enabling zero-shot tasks for medical image understanding. However, training MedVLP models typically requires large-scale datasets with paired, high-quality image-text data, which are scarce in the medical domain. Recent advancements in Large Language Models (LLMs) and diffusion models have made it possible to generate large-scale synthetic image-text pairs. This raises the question: Can MedVLP succeed using purely synthetic data? To address this, we use off-the-shelf generative models to create synthetic radiology reports and paired Chest X-ray (CXR) images, and propose an automated pipeline to build a diverse, high-quality synthetic dataset, enabling a rigorous study that isolates model and training settings, focusing entirely from the data perspective. Our results show that MedVLP models trained exclusively on synthetic data outperform those trained on real data by 3.8% in averaged AUC on zero-shot classification. Moreover, using a combination of synthetic and real data leads to a further improvement of 9.07%. Additionally, MedVLP models trained on synthetic or mixed data consistently outperform those trained on real data in zero-shot grounding, as well as in fine-tuned classification and segmentation tasks. Our analysis suggests MedVLP trained on well-designed synthetic data can outperform models trained on real datasets, which may be limited by low-quality samples and long-tailed distributions.

[AI-50] Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks ?

点击查看摘要

[AI-51] MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

点击查看摘要

[AI-52] Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning Semi-Supervised Training and Advanced Optimization Techniques

点击查看摘要

[AI-53] Seeing Through VisualBERT: A Causal Adventure on Memetic Landscapes EMNLP

点击查看摘要

[AI-54] Breaking the Manual Annotation Bottleneck: Creating a Comprehensive Legal Case Criticality Dataset through Semi-Automated Labeling

点击查看摘要

[AI-55] Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

点击查看摘要

[AI-56] Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

点击查看摘要

[AI-57] Instruction-Driven Game Engine: A Poker Case Study EMNLP2024

链接: https://arxiv.org/abs/2410.13441
作者: Hongqiu Wu,Xingyuan Liu,Yan Wang,Hai Zhao
关键词-EN: Instruction-Driven Game Engine, generate game-play processes, follow free-form game, free-form game descriptions, democratize game development
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: EMNLP 2024 Demo. arXiv admin note: substantial text overlap with arXiv:2404.00276

点击查看摘要

Abstract:The Instruction-Driven Game Engine (IDGE) project aims to democratize game development by enabling a large language model (LLM) to follow free-form game descriptions and generate game-play processes. The IDGE allows users to create games simply by natural language instructions, which significantly lowers the barrier for game development. We approach the learning process for IDGEs as a Next State Prediction task, wherein the model autoregressively predicts the game states given player actions. The computation of game states must be precise; otherwise, slight errors could corrupt the game-play experience. This is challenging because of the gap between stability and diversity. To address this, we train the IDGE in a curriculum manner that progressively increases its exposure to complex scenarios. Our initial progress lies in developing an IDGE for Poker, which not only supports a wide range of poker variants but also allows for highly individualized new poker games through natural language inputs. This work lays the groundwork for future advancements in transforming how games are created and played.

[AI-58] Solving Prior Distribution Mismatch in Diffusion Models via Optimal Transport

链接: https://arxiv.org/abs/2410.13431
作者: Zhanpeng Wang,Shenghao Li,Chen Wang,Shuting Cao,Na Lei,Zhongxuan Luo
关键词-EN: surrounding diffusion models, knowledge surrounding diffusion, recent years, grown significantly, knowledge surrounding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, the knowledge surrounding diffusion models(DMs) has grown significantly, though several theoretical gaps remain. Particularly noteworthy is prior error, defined as the discrepancy between the termination distribution of the forward process and the initial distribution of the reverse process. To address these deficiencies, this paper explores the deeper relationship between optimal transport(OT) theory and DMs with discrete initial distribution. Specifically, we demonstrate that the two stages of DMs fundamentally involve computing time-dependent OT. However, unavoidable prior error result in deviation during the reverse process under quadratic transport cost. By proving that as the diffusion termination time increases, the probability flow exponentially converges to the gradient of the solution to the classical Monge-Ampère equation, we establish a vital link between these fields. Therefore, static OT emerges as the most intrinsic single-step method for bridging this theoretical potential gap. Additionally, we apply these insights to accelerate sampling in both unconditional and conditional generation scenarios. Experimental results across multiple image datasets validate the effectiveness of our approach.

[AI-59] Shavette: Low Power Neural Network Acceleration via Algorithm-level Error Detection and Undervolting

链接: https://arxiv.org/abs/2410.13415
作者: Mikael Rinkinen,Lauri Koskinen,Olli Silven,Mehdi Safarpour
关键词-EN: Reduced voltage operation, enabling reduced voltage, Deep Neural Network, Reduced voltage, substantial energy efficiency
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reduced voltage operation is an effective technique for substantial energy efficiency improvement in digital circuits. This brief introduces a simple approach for enabling reduced voltage operation of Deep Neural Network (DNN) accelerators by mere software modifications. Conventional approaches for enabling reduced voltage operation e.g., Timing Error Detection (TED) systems, incur significant development costs and overheads, while not being applicable to the off-the-shelf components. Contrary to those, the solution proposed in this paper relies on algorithm-based error detection, and hence, is implemented with low development costs, does not require any circuit modifications, and is even applicable to commodity devices. By showcasing the solution through experimenting on popular DNNs, i.e., LeNet and VGG16, on a GPU platform, we demonstrate 18% to 25% energy saving with no accuracy loss of the models and negligible throughput compromise ( 3.9%), considering the overheads from integration of the error detection schemes into the DNN. The integration of presented algorithmic solution into the design is simpler when compared conventional TED based techniques that require extensive circuit-level modifications, cell library characterizations or special support from the design tools.

[AI-60] hink Thrice Before You Act: Progressive Thought Refinement in Large Language Models

点击查看摘要

[AI-61] Attr-Int: A Simple and Effective Entity Alignment Framework for Heterogeneous Knowledge Graphs

点击查看摘要

[AI-62] MoR: Mixture of Ranks for Low-Rank Adaptation Tuning

点击查看摘要

[AI-63] Context-aware adaptive personalised recommendation: a meta-hybrid

链接: https://arxiv.org/abs/2410.13374
作者: Peter Tibensky,Michal Kompan
关键词-EN: reducing the problem, wide scale, scale of e-commerce, e-commerce systems, information overload
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recommenders take place on a wide scale of e-commerce systems, reducing the problem of information overload. The most common approach is to choose a recommender used by the system to make predictions. However, users vary from each other; thus, a one-fits-all approach seems to be sub-optimal. In this paper, we propose a meta-hybrid recommender that uses machine learning to predict an optimal algorithm. In this way, the best-performing recommender is used for each specific session and user. This selection depends on contextual and preferential information collected about the user. We use standard MovieLens and The Movie DB datasets for offline evaluation. We show that based on the proposed model, it is possible to predict which recommender will provide the most precise recommendations to a user. The theoretical performance of our meta-hybrid outperforms separate approaches by 20-50% in normalized Discounted Gain and Root Mean Square Error metrics. However, it is hard to obtain the optimal performance based on widely-used standard information stored about users.

[AI-64] MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models ICTAI

链接: https://arxiv.org/abs/2410.13370
作者: Donghao Zhou,Jiancheng Huang,Jinbin Bai,Jiaze Wang,Hao Chen,Guangyong Chen,Xiaowei Hu,Pheng-Ann Heng
关键词-EN: Recent advancements, text prompts, enabled the creation, creation of high-quality, struggle to generate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent advancements in text-to-image (T2I) diffusion models have enabled the creation of high-quality images from text prompts, but they still struggle to generate images with precise control over specific visual concepts. Existing approaches can replicate a given concept by learning from reference images, yet they lack the flexibility for fine-grained customization of the individual component within the concept. In this paper, we introduce component-controllable personalization, a novel task that pushes the boundaries of T2I models by allowing users to reconfigure specific components when personalizing visual concepts. This task is particularly challenging due to two primary obstacles: semantic pollution, where unwanted visual elements corrupt the personalized concept, and semantic imbalance, which causes disproportionate learning of the concept and component. To overcome these challenges, we design MagicTailor, an innovative framework that leverages Dynamic Masked Degradation (DM-Deg) to dynamically perturb undesired visual semantics and Dual-Stream Balancing (DS-Bal) to establish a balanced learning paradigm for desired visual semantics. Extensive comparisons, ablations, and analyses demonstrate that MagicTailor not only excels in this challenging task but also holds significant promise for practical applications, paving the way for more nuanced and creative image generation.

[AI-65] Remember Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

点击查看摘要

[AI-66] LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights

链接: https://arxiv.org/abs/2410.13352
作者: Odysseas S. Chlapanis,Dimitrios Galanis,Ion Androutsopoulos
关键词-EN: Large Language Models, Large Language, present Legal Argument, capabilities of Large, Legal Argument Reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Published in Natural Legal Language Processing (NLLP) 2024 workshop

点击查看摘要

[AI-67] Representation Learning of Structured Data for Medical Foundation Models NEURIPS2024

链接: https://arxiv.org/abs/2410.13351
作者: Vijay Prakash Dwivedi,Viktor Schlegel,Andy T. Liu,Thanh-Tung Nguyen,Abhinav Ramesh Kashyap,Jeng Wei,Wei-Hsian Yin,Stefan Winkler,Robby T. Tan
关键词-EN: Large Language Models, Large Language, demonstrated remarkable performance, Language Models, including healthcare
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Workshop on Unifying Representations in Neural Models (UniReps 2024)

点击查看摘要

[AI-68] Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

点击查看摘要

[AI-69] DiffImp: Efficient Diffusion Model for Probabilistic Time Series Imputation with Bidirectional Mamba Backbone

链接: https://arxiv.org/abs/2410.13338
作者: Hongfan Gao,Wangmeng Shen,Xiangfei Qiu,Ronghui Xu,Jilin Hu,Bin Yang
关键词-EN: time series imputation, Probabilistic time series, time series, series imputation, real-world scenarios due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 14 figures

点击查看摘要

Abstract:Probabilistic time series imputation has been widely applied in real-world scenarios due to its ability to estimate uncertainty of imputation results. Meanwhile, denoising diffusion probabilistic models (DDPMs) have achieved great success in probabilistic time series imputation tasks with its power to model complex distributions. However, current DDPM-based probabilistic time series imputation methodologies are confronted with two types of challenges: 1)~\textit~The backbone modules of the denoising parts are not capable of achieving sequence modeling with low time complexity. 2)~\textitThe architecture of denoising modules can not handle the inter-variable and bidirectional dependencies in the time series imputation problem effectively. To address the first challenge, we integrate the computational efficient state space model, namely Mamba, as the backbone denosing module for DDPMs. To tackle the second challenge, we carefully devise several SSM-based blocks for bidirectional modeling and inter-variable relation understanding. Experimental results demonstrate that our approach can achieve state-of-the-art time series imputation results on multiple datasets, different missing scenarios and missing ratios.

[AI-70] Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

点击查看摘要

[AI-71] Improving Discrete Optimisation Via Decoupled Straight-Through Gumbel-Softmax

链接: https://arxiv.org/abs/2410.13331
作者: Rushi Shah,Mingyuan Yan,Michael Curtis Mozer,Dianbo Liu
关键词-EN: non-differentiable nature poses, nature poses significant, poses significant challenges, Discrete representations play, representations play
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Discrete representations play a crucial role in many deep learning architectures, yet their non-differentiable nature poses significant challenges for gradient-based optimization. To address this issue, various gradient estimators have been developed, including the Straight-Through Gumbel-Softmax (ST-GS) estimator, which combines the Straight-Through Estimator (STE) and the Gumbel-based reparameterization trick. However, the performance of ST-GS is highly sensitive to temperature, with its selection often compromising gradient fidelity. In this work, we propose a simple yet effective extension to ST-GS by employing decoupled temperatures for forward and backward passes, which we refer to as “Decoupled ST-GS”. We show that our approach significantly enhances the original ST-GS through extensive experiments across multiple tasks and datasets. We further investigate the impact of our method on gradient fidelity from multiple perspectives, including the gradient gap and the bias-variance trade-off of estimated gradients. Our findings contribute to the ongoing effort to improve discrete optimization in deep learning, offering a practical solution that balances simplicity and effectiveness.

[AI-72] Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding

点击查看摘要

[AI-73] Computational Approaches to Arabic-English Code-Switching

点击查看摘要

[AI-74] Precipitation Nowcasting Using Diffusion Transformer with Causal Attention

链接: https://arxiv.org/abs/2410.13314
作者: ChaoRong Li,XuDong Ling,YiLan Xue,Wenjie Luo,LiHong Zhu,FengQing Qin,Yaodong Zhou,Yuanyuan Huang
关键词-EN: Short-term precipitation forecasting, forecasting remains challenging, remains challenging due, precipitation forecasting remains, Short-term precipitation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Short-term precipitation forecasting remains challenging due to the difficulty in capturing long-term spatiotemporal dependencies. Current deep learning methods fall short in establishing effective dependencies between conditions and forecast results, while also lacking interpretability. To address this issue, we propose a Precipitation Nowcasting Using Diffusion Transformer with Causal Attention model. Our model leverages Transformer and combines causal attention mechanisms to establish spatiotemporal queries between conditional information (causes) and forecast results (results). This design enables the model to effectively capture long-term dependencies, allowing forecast results to maintain strong causal relationships with input conditions over a wide range of time and space. We explore four variants of spatiotemporal information interactions for DTCA, demonstrating that global spatiotemporal labeling interactions yield the best performance. In addition, we introduce a Channel-To-Batch shift operation to further enhance the model’s ability to represent complex rainfall dynamics. We conducted experiments on two datasets. Compared to state-of-the-art U-Net-based methods, our approach improved the CSI (Critical Success Index) for predicting heavy precipitation by approximately 15% and 8% respectively, achieving state-of-the-art performance.

[AI-75] Hiformer: Hybrid Frequency Feature Enhancement Inverted Transformer for Long-Term Wind Power Prediction

链接: https://arxiv.org/abs/2410.13303
作者: Chongyang Wan,Shunbo Lei,Yuan Luo
关键词-EN: mitigating environmental impact, climate change necessitates, renewable energy sources, wind energy crucial, wind power
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing severity of climate change necessitates an urgent transition to renewable energy sources, making the large-scale adoption of wind energy crucial for mitigating environmental impact. However, the inherent uncertainty of wind power poses challenges for grid stability, underscoring the need for accurate wind energy prediction models to enable effective power system planning and operation. While many existing studies on wind power prediction focus on short-term forecasting, they often overlook the importance of long-term predictions. Long-term wind power forecasting is essential for effective power grid dispatch and market transactions, as it requires careful consideration of weather features such as wind speed and direction, which directly influence power output. Consequently, methods designed for short-term predictions may lead to inaccurate results and high computational costs in long-term settings. To adress these limitations, we propose a novel approach called Hybrid Frequency Feature Enhancement Inverted Transformer (Hiformer). Hiformer introduces a unique structure that integrates signal decomposition technology with weather feature extraction technique to enhance the modeling of correlations between meteorological conditions and wind power generation. Additionally, Hiformer employs an encoder-only architecture, which reduces the computational complexity associated with long-term wind power forecasting. Compared to the state-of-the-art methods, Hiformer: (i) can improve the prediction accuracy by up to 52.5%; and (ii) can reduce computational time by up to 68.5%.

[AI-76] Automating IETF Insights generation with AI

链接: https://arxiv.org/abs/2410.13301
作者: Jaime Jiménez
关键词-EN: Engineering Task Force, Internet Engineering Task, IETF Insights project, Working Groups, Task Force
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 5 pages plus Appendix

点击查看摘要

Abstract:This paper presents the IETF Insights project, an automated system that streamlines the generation of comprehensive reports on the activities of the Internet Engineering Task Force (IETF) Working Groups. The system collects, consolidates, and analyzes data from various IETF sources, including meeting minutes, participant lists, drafts and agendas. The core components of the system include data preprocessing code and a report generation module that produces high-quality documents in LaTeX or Markdown. By integrating large Language Models (LLMs) for summaries based on the data as ground truth, the IETF Insights project enhances the accessibility and utility of IETF records, providing a valuable overview of the IETF’s activities and contributions to the community.

[AI-77] LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

链接: https://arxiv.org/abs/2410.13299
作者: David Hoffmann,Kailash Budhathoki,Matthaeus Kleindessner
关键词-EN: necessitating effective inference, inference optimisation techniques, effective inference optimisation, large language models, deployment costs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The evolving capabilities of large language models are accompanied by growing sizes and deployment costs, necessitating effective inference optimisation techniques. We propose a novel pruning method utilising centrality measures from graph theory, reducing both the computational requirements and the memory footprint of these models. Specifically, we devise a method for creating a weighted directed acyclical graph representation of multilayer perceptrons to which we apply a modified version of the weighted PageRank centrality measure to compute node importance scores. In combination with uniform pruning this leads to structured sparsity. We call this pruning method MLPRank. Furthermore we introduce an extension to decoder-only transformer models and call it LLMRank. For both variants we demonstrate a strong performance. With MLPRank on average leading to 6.09 % higher accuracy retention than three popular baselines and 13.42 % with LLMRank compared to two popular baselines.

[AI-78] Advancing Large Language Model Attribution through Self-Improving EMNLP2024

点击查看摘要

[AI-79] Fairness-Enhancing Ensemble Classification in Water Distribution Networks

链接: https://arxiv.org/abs/2410.13296
作者: Janine Strotherm,Barbara Hammer
关键词-EN: affecting decision support, decision support tools, support tools constitutes, criminal detection software, social domain affecting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As relevant examples such as the future criminal detection software [1] show, fairness of AI-based and social domain affecting decision support tools constitutes an important area of research. In this contribution, we investigate the applications of AI to socioeconomically relevant infrastructures such as those of water distribution networks (WDNs), where fairness issues have yet to gain a foothold. To establish the notion of fairness in this domain, we propose an appropriate definition of protected groups and group fairness in WDNs as an extension of existing definitions. We demonstrate that typical methods for the detection of leakages in WDNs are unfair in this sense. Further, we thus propose a remedy to increase the fairness which can be applied even to non-differentiable ensemble classification methods as used in this context.

[AI-80] PiLocNet: Physics-informed neural network on 3D localization with rotating point spread function

链接: https://arxiv.org/abs/2410.13295
作者: Mingda Lu,Zitian Ao,Chao Wang,Sudhakar Prasad,Raymond H. Chan
关键词-EN: point spread function, previously introduced localization, introduced localization neural, neural network, localization neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:For the 3D localization problem using point spread function (PSF) engineering, we propose a novel enhancement of our previously introduced localization neural network, LocNet. The improved network is a physics-informed neural network (PINN) that we call PiLocNet. Previous works on the localization problem may be categorized separately into model-based optimization and neural network approaches. Our PiLocNet combines the unique strengths of both approaches by incorporating forward-model-based information into the network via a data-fitting loss term that constrains the neural network to yield results that are physically sensible. We additionally incorporate certain regularization terms from the variational method, which further improves the robustness of the network in the presence of image noise, as we show for the Poisson and Gaussian noise models. This framework accords interpretability to the neural network, and the results we obtain show its superiority. Although the paper focuses on the use of single-lobe rotating PSF to encode the full 3D source location, we expect the method to be widely applicable to other PSFs and imaging problems that are constrained by known forward processes.

[AI-81] SBI-RAG: Enhancing Math Word Problem Solving for Students through Schema-Based Instruction and Retrieval-Augmented Generation NEURIPS’24

链接: https://arxiv.org/abs/2410.13293
作者: Prakhar Dixit,Tim Oates
关键词-EN: identify key information, math word problems, categorize problems based, http URL-based instruction, improving problem-solving accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted to the 4th MATH-AI Workshop at NeurIPS’24

点击查看摘要

Abstract:Many students struggle with math word problems (MWPs), often finding it difficult to identify key information and select the appropriate mathematical this http URL-based instruction (SBI) is an evidence-based strategy that helps students categorize problems based on their structure, improving problem-solving accuracy. Building on this, we propose a Schema-Based Instruction Retrieval-Augmented Generation (SBI-RAG) framework that incorporates a large language model (LLM).Our approach emphasizes step-by-step reasoning by leveraging schemas to guide solution generation. We evaluate its performance on the GSM8K dataset, comparing it with GPT-4 and GPT-3.5 Turbo, and introduce a “reasoning score” metric to assess solution quality. Our findings suggest that SBI-RAG enhances reasoning clarity and problem-solving accuracy, potentially providing educational benefits for students

[AI-82] Learning to Route with Confidence Tokens

点击查看摘要

[AI-83] Roadmap towards Superhuman Speech Understanding using Large Language Models

点击查看摘要

[AI-84] he Latent Road to Atoms: Backmapping Coarse-grained Protein Structures with Latent Diffusion

链接: https://arxiv.org/abs/2410.13264
作者: Xu Han,Yuancheng Sun,Kai Chen,Kang Liu,Qiwei Ye
关键词-EN: molecular dynamics simulations, dynamics simulations offer, offer computational efficiency, molecular dynamics, thermodynamic properties
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper under review

点击查看摘要

Abstract:Coarse-grained(CG) molecular dynamics simulations offer computational efficiency for exploring protein conformational ensembles and thermodynamic properties. Though coarse representations enable large-scale simulations across extended temporal and spatial ranges, the sacrifice of atomic-level details limits their utility in tasks such as ligand docking and protein-protein interaction prediction. Backmapping, the process of reconstructing all-atom structures from coarse-grained representations, is crucial for recovering these fine details. While recent machine learning methods have made strides in protein structure generation, challenges persist in reconstructing diverse atomistic conformations that maintain geometric accuracy and chemical validity. In this paper, we present Latent Diffusion Backmapping (LDB), a novel approach leveraging denoising diffusion within latent space to address these challenges. By combining discrete latent encoding with diffusion, LDB bypasses the need for equivariant and internal coordinate manipulation, significantly simplifying the training and sampling processes as well as facilitating better and wider exploration in configuration space. We evaluate LDB’s state-of-the-art performance on three distinct protein datasets, demonstrating its ability to efficiently reconstruct structures with high structural accuracy and chemical validity. Moreover, LDB shows exceptional versatility in capturing diverse protein ensembles, highlighting its capability to explore intricate conformational spaces. Our results position LDB as a powerful and scalable approach for backmapping, effectively bridging the gap between CG simulations and atomic-level analyses in computational biology.

[AI-85] A Simplifying and Learnable Graph Convolutional Attention Network for Unsupervised Knowledge Graphs Alignment

链接: https://arxiv.org/abs/2410.13263
作者: Weishan Cai,Wenjun Ma,Yuncheng Jiang
关键词-EN: task depends largely, labeled data, supervision information provided, current Entity Alignment, task depends
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:The success of current Entity Alignment (EA) task depends largely on the supervision information provided by labeled data. Considering the cost of labeled data, most supervised methods are difficult to apply in practical scenarios. Therefore, more and more works based on contrastive learning, active learning or other deep learning techniques have been developed, to solve the performance bottleneck caused by the lack of labeled data. However, the existing unsupervised EA methods still have some limitations, either their modeling complexity is high or they cannot balance the effectiveness and practicality of alignment. To overcome these issues, we propose a Simplifying and Learnable graph convolutional attention network for Unsupervised Knowledge Graphs alignment method (SLU). Specifically, we first introduce LCAT, a new and simple framework as the backbone network to model the graph structure of two KGs. Then we design a reconstruction method of relation structure based on potential matching relations for efficiently filtering invalid neighborhood information of aligned entities, to improve the usability and scalability of SLU. Impressively, a similarity function based on consistency is proposed to better measure the similarity of candidate entity pairs. Finally, we conduct extensive experiments on three datasets of different sizes (15K and 100K) and different types (cross-lingual and monolingual) to verify the superiority of SLU. Experimental results show that SLU significantly improves alignment accuracy, outperforming 25 supervised or unsupervised methods, and improving 6.4% in Hits@1 over the best baseline in the best case.

[AI-86] scFusionTTT: Single-cell transcriptomics and proteomics fusion with Test-Time Training layers

链接: https://arxiv.org/abs/2410.13257
作者: Dian Meng,Bohao Xing,Xinlei Huang,Yanran Liu,Yijun Zhou,Yongjun xiao,Zitong Yu,Xubin Zheng
关键词-EN: Epitopes by Sequencing, Cellular Indexing, Indexing of Transcriptomes, Transcriptomes and Epitopes, paired multimodal data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Single-cell multi-omics (scMulti-omics) refers to the paired multimodal data, such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), where the regulation of each cell was measured from different modalities, i.e. genes and proteins. scMulti-omics can reveal heterogeneity inside tumors and understand the distinct genetic properties of diverse cell types, which is crucial to targeted therapy. Currently, deep learning methods based on attention structures in the bioinformatics area face two challenges. The first challenge is the vast number of genes in a single cell. Traditional attention-based modules struggled to effectively leverage all gene information due to their limited capacity for long-context learning and high-complexity computing. The second challenge is that genes in the human genome are ordered and influence each other’s expression. Most of the methods ignored this sequential information. The recently introduced Test-Time Training (TTT) layer is a novel sequence modeling approach, particularly suitable for handling long contexts like genomics data because TTT layer is a linear complexity sequence modeling structure and is better suited to data with sequential relationships. In this paper, we propose scFusionTTT, a novel method for Single-Cell multimodal omics Fusion with TTT-based masked autoencoder. Of note, we combine the order information of genes and proteins in the human genome with the TTT layer, fuse multimodal omics, and enhance unimodal omics analysis. Finally, the model employs a three-stage training strategy, which yielded the best performance across most metrics in four multimodal omics datasets and four unimodal omics datasets, demonstrating the superior performance of our model. The dataset and code will be available on this https URL.

[AI-87] Automatic Translation Alignment Pipeline for Multilingual Digital Editions of Literary Works

链接: https://arxiv.org/abs/2410.13255
作者: Maria Levchenko
关键词-EN: Multilingual Digital Edition, Alessandro Manzoni Italian, Russian and Chinese, Digital Edition, Multilingual Digital
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 18 pages, Computational Humanities Research Conference, December 4-6, 2024, Aarhus, Denmark

点击查看摘要

[AI-88] Perceptions of Discriminatory Decisions of Artificial Intelligence: Unpacking the Role of Individual Characteristics

链接: https://arxiv.org/abs/2410.13250
作者: Soojong Kim
关键词-EN: outcomes exhibiting gender, digital self-efficacy, political ideology, technical knowledge, personal differences
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This study investigates how personal differences (digital self-efficacy, technical knowledge, belief in equality, political ideology) and demographic factors (age, education, and income) are associated with perceptions of artificial intelligence (AI) outcomes exhibiting gender and racial bias and with general attitudes towards AI. Analyses of a large-scale experiment dataset (N = 1,206) indicate that digital self-efficacy and technical knowledge are positively associated with attitudes toward AI, while liberal ideologies are negatively associated with outcome trust, higher negative emotion, and greater skepticism. Furthermore, age and income are closely connected to cognitive gaps in understanding discriminatory AI outcomes. These findings highlight the importance of promoting digital literacy skills and enhancing digital self-efficacy to maintain trust in AI and beliefs in AI usefulness and safety. The findings also suggest that the disparities in understanding problematic AI outcomes may be aligned with economic inequalities and generational gaps in society. Overall, this study sheds light on the socio-technological system in which complex interactions occur between social hierarchies, divisions, and machines that reflect and exacerbate the disparities.

[AI-89] Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation

点击查看摘要

[AI-90] Enhancing Sentiment Analysis with Collaborative AI: Architecture Predictions and Deployment Strategies

链接: https://arxiv.org/abs/2410.13247
作者: Chaofeng Zhang,Jia Hou,Xueting Tan,Caijuan Chen,Hiroshi Hashimoto
关键词-EN: based artificial intelligence, artificial intelligence technologies, large language model, based artificial, advancement of large
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The advancement of large language model (LLM) based artificial intelligence technologies has been a game-changer, particularly in sentiment analysis. This progress has enabled a shift from highly specialized research environments to practical, widespread applications within the industry. However, integrating diverse AI models for processing complex multimodal data and the associated high costs of feature extraction presents significant challenges. Motivated by the marketing oriented software development +needs, our study introduces a collaborative AI framework designed to efficiently distribute and resolve tasks across various AI systems to address these issues. Initially, we elucidate the key solutions derived from our development process, highlighting the role of generative AI models like \emphchatgpt, \emphgoogle gemini in simplifying intricate sentiment analysis tasks into manageable, phased objectives. Furthermore, we present a detailed case study utilizing our collaborative AI system in edge and cloud, showcasing its effectiveness in analyzing sentiments across diverse online media channels.

[AI-91] Atomic Calibration of LLMs in Long-Form Generations

点击查看摘要

[AI-92] Large Language Models are Easily Confused: A Quantitative Metric Security Implications and Typological Analysis

点击查看摘要

[AI-93] SPIN: Self-Supervised Prompt INjection

点击查看摘要

[AI-94] Quamba: A Post-Training Quantization Recipe for Selective State Space Models

链接: https://arxiv.org/abs/2410.13229
作者: Hung-Yueh Chiang,Chi-Chih Chang,Natalia Frumkin,Kai-Chiang Wu,Diana Marculescu
关键词-EN: constant memory complexity, holding longer context, longer context lengths, State Space Models, large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:State Space Models (SSMs) have emerged as an appealing alternative to Transformers for large language models, achieving state-of-the-art accuracy with constant memory complexity which allows for holding longer context lengths than attention-based networks. The superior computational efficiency of SSMs in long sequence modeling positions them favorably over Transformers in many scenarios. However, improving the efficiency of SSMs on request-intensive cloud-serving and resource-limited edge applications is still a formidable task. SSM quantization is a possible solution to this problem, making SSMs more suitable for wide deployment, while still maintaining their accuracy. Quantization is a common technique to reduce the model size and to utilize the low bit-width acceleration features on modern computing units, yet existing quantization techniques are poorly suited for SSMs. Most notably, SSMs have highly sensitive feature maps within the selective scan mechanism (i.e., linear recurrence) and massive outliers in the output activations which are not present in the output of token-mixing in the self-attention modules. To address this issue, we propose a static 8-bit per-tensor SSM quantization method which suppresses the maximum values of the input activations to the selective SSM for finer quantization precision and quantizes the output activations in an outlier-free space with Hadamard transform. Our 8-bit weight-activation quantized Mamba 2.8B SSM benefits from hardware acceleration and achieves a 1.72x lower generation latency on an Nvidia Orin Nano 8G, with only a 0.9% drop in average accuracy on zero-shot tasks. The experiments demonstrate the effectiveness and practical applicability of our approach for deploying SSM-based models of all sizes on both cloud and edge platforms.

[AI-95] From PINNs to PIKANs: Recent Advances in Physics-Informed Machine Learning

链接: https://arxiv.org/abs/2410.13228
作者: Juan Diego Toscano,Vivek Oommen,Alan John Varghese,Zongren Zou,Nazanin Ahmadi Daryakenari,Chenxi Wu,George Em Karniadakis
关键词-EN: Scientific Machine Learning, partial differential equations, Scientific Machine, Machine Learning, Physics-Informed Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: physics-informed neural networks, Kolmogorov-Arnold networks, optimization algorithms, separable PINNs, self-adaptive weights, uncertainty quantification

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a key tool in Scientific Machine Learning since their introduction in 2017, enabling the efficient solution of ordinary and partial differential equations using sparse measurements. Over the past few years, significant advancements have been made in the training and optimization of PINNs, covering aspects such as network architectures, adaptive refinement, domain decomposition, and the use of adaptive weights and activation functions. A notable recent development is the Physics-Informed Kolmogorov-Arnold Networks (PIKANS), which leverage a representation model originally proposed by Kolmogorov in 1957, offering a promising alternative to traditional PINNs. In this review, we provide a comprehensive overview of the latest advancements in PINNs, focusing on improvements in network design, feature expansion, optimization techniques, uncertainty quantification, and theoretical insights. We also survey key applications across a range of fields, including biomedicine, fluid and solid mechanics, geophysics, dynamical systems, heat transfer, chemical engineering, and beyond. Finally, we review computational frameworks and software tools developed by both academia and industry to support PINN research and applications.

[AI-96] Research on Travel Route Planing Problems Based on Greedy Algorithm

链接: https://arxiv.org/abs/2410.13226
作者: Yiquan Wang
关键词-EN: route planning problem, ending point, starting and ending, route planning algorithm, greedy algorithm based
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The greedy algorithm based route planning problem is a method of finding the optimal or near optimal route between a given starting and ending point. This article first uses PCA method to reduce the dimensionality of urban evaluation indicators, extracts key principal components, and KMO and TOPSIS algorithms to reduce the dimensionality of the data. Secondly, for datasets that have not passed the KMO test, a comprehensive evaluation will be conducted using the entropy weight method and TOPSIS method. Finally, based on the greedy algorithm, a route planning algorithm was proposed and optimized to provide personalized route customization according to the different needs of tourists. We also took into account the local travel efficiency, the time required to visit tourist attractions, and necessary daily rest time to reduce costs and avoid falling into the local optimal solution.

[AI-97] CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

点击查看摘要

[AI-98] MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling

链接: https://arxiv.org/abs/2410.13217
作者: Ruohan Wang,Zilong Wang,Ziyang Song,David Buckeridge,Yue Li
关键词-EN: electronic health records, enhance personalized medicine, Automatic subphenotyping, health records, subphenotyping from electronic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Automatic subphenotyping from electronic health records (EHRs)provides numerous opportunities to understand diseases with unique subgroups and enhance personalized medicine for patients. However, existing machine learning algorithms either focus on specific diseases for better interpretability or produce coarse-grained phenotype topics without considering nuanced disease patterns. In this study, we propose a guided topic model, MixEHR-Nest, to infer sub-phenotype topics from thousands of disease using multi-modal EHR data. Specifically, MixEHR-Nest detects multiple subtopics from each phenotype topic, whose prior is guided by the expert-curated phenotype concepts such as Phenotype Codes (PheCodes) or Clinical Classification Software (CCS) codes. We evaluated MixEHR-Nest on two EHR datasets: (1) the MIMIC-III dataset consisting of over 38 thousand patients from intensive care unit (ICU) from Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA; (2) the healthcare administrative database PopHR, comprising 1.3 million patients from Montreal, Canada. Experimental results demonstrate that MixEHR-Nest can identify subphenotypes with distinct patterns within each phenotype, which are predictive for disease progression and severity. Consequently, MixEHR-Nest distinguishes between type 1 and type 2 diabetes by inferring subphenotypes using CCS codes, which do not differentiate these two subtype concepts. Additionally, MixEHR-Nest not only improved the prediction accuracy of short-term mortality of ICU patients and initial insulin treatment in diabetic patients but also revealed the contributions of subphenotypes. For longitudinal analysis, MixEHR-Nest identified subphenotypes of distinct age prevalence under the same phenotypes, such as asthma, leukemia, epilepsy, and depression. The MixEHR-Nest software is available at GitHub: this https URL.

[AI-99] Anchored Alignment for Self-Explanations Enhancement

点击查看摘要

[AI-100] LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch

链接: https://arxiv.org/abs/2410.13213
作者: Caigao Jiang,Xiang Shu,Hong Qian,Xingyu Lu,Jun Zhou,Aimin Zhou,Yang Yu
关键词-EN: Optimization, optimization problem types, Optimization problems, LLMOPT, optimization generalization
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimization problems are prevalent across various scenarios. Formulating and then solving optimization problems described by natural language often requires highly specialized human expertise, which could block the widespread application of optimization-based decision making. To make problem formulating and solving automated, leveraging large language models (LLMs) has emerged as a potential way. However, this kind of way suffers from the issue of optimization generalization. Namely, the accuracy of most current LLM-based methods and the generality of optimization problem types that they can model are still limited. In this paper, we propose a unified learning-based framework called LLMOPT to boost optimization generalization. Starting from the natural language descriptions of optimization problems and a pre-trained LLM, LLMOPT constructs the introduced five-element formulation as a universal model for learning to define diverse optimization problem types. Then, LLMOPT employs the multi-instruction tuning to enhance both problem formalization and solver code generation accuracy and generality. After that, to prevent hallucinations in LLMs, such as sacrificing solving accuracy to avoid execution errors, model alignment and self-correction mechanism are adopted in LLMOPT. We evaluate the optimization generalization ability of LLMOPT and compared methods across six real-world datasets covering roughly 20 fields such as health, environment, energy and manufacturing, etc. Extensive experiment results show that LLMOPT is able to model various optimization problem types such as linear/nonlinear programming, mixed integer programming and combinatorial optimization, and achieves a notable 11.08% average solving accuracy improvement compared with the state-of-the-art methods. The code is available at this https URL.

[AI-101] AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

链接: https://arxiv.org/abs/2410.13212
作者: Qian Tao,Wenyuan Yu,Jingren Zhou
关键词-EN: Large language models, shown exceptional capabilities, Large language, text generation, video generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Large language models have shown exceptional capabilities in a wide range of tasks, such as text generation and video generation, among others. However, due to their massive parameter count, these models often require substantial storage space, imposing significant constraints on the machines deploying LLMs. To overcome this limitation, one research direction proposes to compress the models using integer replacements for floating-point numbers, in a process known as Quantization. Some recent studies suggest quantizing the key and value cache (KV Cache) of LLMs, and designing quantization techniques that treat the key and value matrices equivalently. This work delves deeper into the asymmetric structural roles of KV Cache, a phenomenon where the transformer’s output loss is more sensitive to the quantization of key matrices. We conduct a systematic examination of the attention output error resulting from key and value quantization. The phenomenon inspires us to propose an asymmetric quantization strategy. Our approach allows for 1-bit quantization of the KV cache by implementing distinct configurations for key and value matrices. We carry out experiments across a variety of datasets, demonstrating that our proposed model allows for the quantization of up to 75% decoder layers with 1 bit, while simultaneously maintaining performance levels comparable to those of the models with floating parameters. Comments: 12 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.13212 [cs.LG] (or arXiv:2410.13212v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.13212 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-102] Estimating the Probabilities of Rare Outputs in Language Models

链接: https://arxiv.org/abs/2410.13211
作者: Gabriel Wu,Jacob Hilton
关键词-EN: machine learning model, low probability estimation, machine learning, binary property, formally-specified input distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 27 pages, 9 figures

点击查看摘要

Abstract:We consider the problem of low probability estimation: given a machine learning model and a formally-specified input distribution, how can we estimate the probability of a binary property of the model’s output, even when that probability is too small to estimate by random sampling? This problem is motivated by the need to improve worst-case performance, which distribution shift can make much more likely. We study low probability estimation in the context of argmax sampling from small transformer language models. We compare two types of methods: importance sampling, which involves searching for inputs giving rise to the rare output, and activation extrapolation, which involves extrapolating a probability distribution fit to the model’s logits. We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling. Finally, we explain how minimizing the probability estimate of an undesirable behavior generalizes adversarial training, and argue that new methods for low probability estimation are needed to provide stronger guarantees about worst-case performance.

[AI-103] FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

点击查看摘要

[AI-104] abSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering ICPR2024

链接: https://arxiv.org/abs/2410.13203
作者: Al Zadid Sultan Bin Habib,Kesheng Wang,Mary-Anne Hartley,Gianfranco Doretto,Donald A. Adjeroh
关键词-EN: Effective analysis, deep learning, levels of relevance, learning, poses a significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted for presentation at the 26th International Conference on Pattern Recognition (ICPR 2024) in Kolkata, India

点击查看摘要

Abstract:Effective analysis of tabular data still poses a significant problem in deep learning, mainly because features in tabular datasets are often heterogeneous and have different levels of relevance. This work introduces TabSeq, a novel framework for the sequential ordering of features, addressing the vital necessity to optimize the learning process. Features are not always equally informative, and for certain deep learning models, their random arrangement can hinder the model’s learning capacity. Finding the optimum sequence order for such features could improve the deep learning models’ learning process. The novel feature ordering technique we provide in this work is based on clustering and incorporates both local ordering and global ordering. It is designed to be used with a multi-head attention mechanism in a denoising autoencoder network. Our framework uses clustering to align comparable features and improve data organization. Multi-head attention focuses on essential characteristics, whereas the denoising autoencoder highlights important aspects by rebuilding from distorted inputs. This method improves the capability to learn from tabular data while lowering redundancy. Our research, demonstrating improved performance through appropriate feature sequence rearrangement using raw antibody microarray and two other real-world biomedical datasets, validates the impact of feature ordering. These results demonstrate that feature ordering can be a viable approach to improved deep learning of tabular data.

[AI-105] Meta-DiffuB: A Contextualized Sequence-to-Sequence Text Diffusion Model with Meta-Exploration

点击查看摘要

[AI-106] Context-Enhanced Multi-View Trajectory Representation Learning: Bridging the Gap through Self-Supervised Models

链接: https://arxiv.org/abs/2410.13196
作者: Tangwen Qian,Junhe Li,Yile Chen,Gao Cong,Tao Sun,Fei Wang,Yongjun Xu
关键词-EN: travel time estimation, generic-purpose dense representations, downstream applications, travel time, similarity computation
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling trajectory data with generic-purpose dense representations has become a prevalent paradigm for various downstream applications, such as trajectory classification, travel time estimation and similarity computation. However, existing methods typically rely on trajectories from a single spatial view, limiting their ability to capture the rich contextual information that is crucial for gaining deeper insights into movement patterns across different geospatial contexts. To this end, we propose MVTraj, a novel multi-view modeling method for trajectory representation learning. MVTraj integrates diverse contextual knowledge, from GPS to road network and points-of-interest to provide a more comprehensive understanding of trajectory data. To align the learning process across multiple views, we utilize GPS trajectories as a bridge and employ self-supervised pretext tasks to capture and distinguish movement patterns across different spatial views. Following this, we treat trajectories from different views as distinct modalities and apply a hierarchical cross-modal interaction module to fuse the representations, thereby enriching the knowledge derived from multiple sources. Extensive experiments on real-world datasets demonstrate that MVTraj significantly outperforms existing baselines in tasks associated with various spatial views, validating its effectiveness and practical utility in spatio-temporal modeling.

[AI-107] Golyadkins Torment: Doppelg"angers and Adversarial Vulnerability

链接: https://arxiv.org/abs/2410.13193
作者: George I. Kamberov
关键词-EN: adversarial visual metamers, claimed to outperform, classifiers, adversarial Doppelgangers, visual metamers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Many machine learning (ML) classifiers are claimed to outperform humans, but they still make mistakes that humans do not. The most notorious examples of such mistakes are adversarial visual metamers. This paper aims to define and investigate the phenomenon of adversarial Doppelgangers (AD), which includes adversarial visual metamers, and to compare the performance and robustness of ML classifiers to human performance. We find that AD are inputs that are close to each other with respect to a perceptual metric defined in this paper. AD are qualitatively different from the usual adversarial examples. The vast majority of classifiers are vulnerable to AD and robustness-accuracy trade-offs may not improve them. Some classification problems may not admit any AD robust classifiers because the underlying classes are ambiguous. We provide criteria that can be used to determine whether a classification problem is well defined or not; describe the structure and attributes of an AD-robust classifier; introduce and explore the notions of conceptual entropy and regions of conceptual ambiguity for classifiers that are vulnerable to AD attacks, along with methods to bound the AD fooling rate of an attack. We define the notion of classifiers that exhibit hypersensitive behavior, that is, classifiers whose only mistakes are adversarial Doppelgangers. Improving the AD robustness of hyper-sensitive classifiers is equivalent to improving accuracy. We identify conditions guaranteeing that all classifiers with sufficiently high accuracy are hyper-sensitive. Our findings are aimed at significant improvements in the reliability and security of machine learning systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO) Cite as: arXiv:2410.13193 [cs.LG] (or arXiv:2410.13193v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.13193 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-108] MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique Correction and Comparison Feedback

点击查看摘要

[AI-109] CohEx: A Generalized Framework for Cohort Explanation

链接: https://arxiv.org/abs/2410.13190
作者: Fanyu Meng,Xin Liu,Zhaodan Kong,Xin Chen
关键词-EN: eXplainable Artificial Intelligence, Artificial Intelligence, garnered significant attention, eXplainable Artificial, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:eXplainable Artificial Intelligence (XAI) has garnered significant attention for enhancing transparency and trust in machine learning models. However, the scopes of most existing explanation techniques focus either on offering a holistic view of the explainee model (global explanation) or on individual instances (local explanation), while the middle ground, i.e., cohort-based explanation, is less explored. Cohort explanations offer insights into the explainee’s behavior on a specific group or cohort of instances, enabling a deeper understanding of model decisions within a defined context. In this paper, we discuss the unique challenges and opportunities associated with measuring cohort explanations, define their desired properties, and create a generalized framework for generating cohort explanations based on supervised clustering.

[AI-110] aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Completion

点击查看摘要

[AI-111] Chain of Ideas: Revolutionizing Research in Novel Idea Development with LLM Agents

点击查看摘要

[AI-112] EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

点击查看摘要

[AI-113] GeSubNet: Gene Interaction Inference for Disease Subtype Network Generation ICLR2025

链接: https://arxiv.org/abs/2410.13178
作者: Ziwei Yang,Zheng Chen,Xin Liu,Rikuto Kotoge,Peng Chen,Yasuko Matsubara,Yasushi Sakurai,Jimeng Sun
关键词-EN: Retrieving gene functional, Retrieving gene, knowledge databases presents, presents a challenge, challenge due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review as a conference paper at ICLR 2025

点击查看摘要

Abstract:Retrieving gene functional networks from knowledge databases presents a challenge due to the mismatch between disease networks and subtype-specific variations. Current solutions, including statistical and deep learning methods, often fail to effectively integrate gene interaction knowledge from databases or explicitly learn subtype-specific interactions. To address this mismatch, we propose GeSubNet, which learns a unified representation capable of predicting gene interactions while distinguishing between different disease subtypes. Graphs generated by such representations can be considered subtype-specific networks. GeSubNet is a multi-step representation learning framework with three modules: First, a deep generative model learns distinct disease subtypes from patient gene expression profiles. Second, a graph neural network captures representations of prior gene networks from knowledge databases, ensuring accurate physical gene interactions. Finally, we integrate these two representations using an inference loss that leverages graph generation capabilities, conditioned on the patient separation loss, to refine subtype-specific information in the learned representation. GeSubNet consistently outperforms traditional methods, with average improvements of 30.6%, 21.0%, 20.1%, and 56.6% across four graph evaluation metrics, averaged over four cancer datasets. Particularly, we conduct a biological simulation experiment to assess how the behavior of selected genes from over 11,000 candidates affects subtypes or patient distributions. The results show that the generated network has the potential to identify subtype-specific genes with an 83% likelihood of impacting patient distribution shifts. The GeSubNet resource is available: this https URL

[AI-114] CP-Diffusion: A Multi-modal Diffusion Model for Global Tropical Cyclone Precipitation Forecasting with Change Awareness

链接: https://arxiv.org/abs/2410.13175
作者: Cheng Huang,Pan Mu,Cong Bai,Peter AG Watson
关键词-EN: Tropical Cyclone Precipitation, Cyclone Precipitation Diffusion, Cyclone Precipitation, Precipitation, cyclone precipitation forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Precipitation from tropical cyclones (TCs) can cause disasters such as flooding, mudslides, and landslides. Predicting such precipitation in advance is crucial, giving people time to prepare and defend against these precipitation-induced disasters. Developing deep learning (DL) rainfall prediction methods offers a new way to predict potential disasters. However, one problem is that most existing methods suffer from cumulative errors and lack physical consistency. Second, these methods overlook the importance of meteorological factors in TC rainfall and their integration with the numerical weather prediction (NWP) model. Therefore, we propose Tropical Cyclone Precipitation Diffusion (TCP-Diffusion), a multi-modal model for global tropical cyclone precipitation forecasting. It forecasts TC rainfall around the TC center for the next 12 hours at 3 hourly resolution based on past rainfall observations and multi-modal environmental variables. Adjacent residual prediction (ARP) changes the training target from the absolute rainfall value to the rainfall trend and gives our model the ability of rainfall change awareness, reducing cumulative errors and ensuring physical consistency. Considering the influence of TC-related meteorological factors and the useful information from NWP model forecasts, we propose a multi-model framework with specialized encoders to extract richer information from environmental variables and results provided by NWP models. The results of extensive experiments show that our method outperforms other DL methods and the NWP method from the European Centre for Medium-Range Weather Forecasts (ECMWF).

[AI-115] An Evolved Universal Transformer Memory

链接: https://arxiv.org/abs/2410.13166
作者: Edoardo Cetin,Qi Sun,Tianyu Zhao,Yujin Tang
关键词-EN: Prior methods propose, dropping specific parts, modern foundation models, Prior methods, Neural Attention Memory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 29 pages, 14 figures. Preprint, under submission. Source code is available at this https URL

点击查看摘要

[AI-116] Utilizing Large Language Models in An Iterative Paradigm with Domain Feedback for Molecule Optimization

链接: https://arxiv.org/abs/2410.13147
作者: Khiem Le,Nitesh V. Chawla
关键词-EN: Large Language Models, chemical modification, drug discovery, discovery to optimize, text
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Molecule optimization is a critical task in drug discovery to optimize desired properties of a given molecule through chemical modification. Despite Large Language Models (LLMs) holding the potential to efficiently simulate this task by using natural language to direct the optimization, straightforwardly utilizing shows limited performance. In this work, we facilitate utilizing LLMs in an iterative paradigm by proposing a simple yet highly effective domain feedback provider, namely \textRe^2 DF. In detail, \textRe^2 DF harnesses an external toolkit, RDKit, to handle the molecule hallucination, if the modified molecule is chemically invalid. Otherwise, its desired properties are computed and compared to the original one, establishing reliable domain feedback with correct direction and distance towards the objective, followed by a retrieved example, to explicitly guide the LLM to refine the modified molecule. We conduct experiments across both single- and multi-property objectives with 2 thresholds, where \textRe^2 DF shows significant improvements. Particularly, for 20 single-property objectives, \textRe^2 DF enhances the Hit ratio by 16.95% and 20.76% under loose and strict thresholds, respectively. For 32 multi-property objectives, \textRe^2 DF enhances the Hit ratio by 6.04% and 5.25%.

[AI-117] rust but Verify: Programmatic VLM Evaluation in the Wild

链接: https://arxiv.org/abs/2410.13121
作者: Viraj Prabhu,Senthil Purushwalkam,An Yan,Caiming Xiong,Ran Xu
关键词-EN: plausible but incorrect, PROVE, Programmatic VLM Evaluation, incorrect responses, open-ended queries
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two. Project page: \urlthis https URL.

[AI-118] Preference Diffusion for Recommendation

链接: https://arxiv.org/abs/2410.13117
作者: Shuo Liu,An Zhang,Guoqing Hu,Hong Qian,Tat-seng Chua
关键词-EN: historical behavior data, Recommender systems predict, systems predict personalized, predict personalized item, item rankings based
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recommender systems predict personalized item rankings based on user preference distributions derived from historical behavior data. Recently, diffusion models (DMs) have gained attention in recommendation for their ability to model complex distributions, yet current DM-based recommenders often rely on traditional objectives like mean squared error (MSE) or recommendation objectives, which are not optimized for personalized ranking tasks or fail to fully leverage DM’s generative potential. To address this, we propose PreferDiff, a tailored optimization objective for DM-based recommenders. PreferDiff transforms BPR into a log-likelihood ranking objective and integrates multiple negative samples to better capture user preferences. Specifically, we employ variational inference to handle the intractability through minimizing the variational upper bound and replaces MSE with cosine error to improve alignment with recommendation tasks. Finally, we balance learning generation and preference to enhance the training stability of DMs. PreferDiff offers three key benefits: it is the first personalized ranking loss designed specifically for DM-based recommenders and it improves ranking and faster convergence by addressing hard negatives. We also prove that it is theoretically connected to Direct Preference Optimization which indicates that it has the potential to align user preferences in DM-based recommenders via generative modeling. Extensive experiments across three benchmarks validate its superior recommendation performance and commendable general sequential recommendation capabilities. Our codes are available at \urlthis https URL.

[AI-119] Learning to Summarize from LLM-generated Feedback

点击查看摘要

[AI-120] Sound Check: Auditing Audio Datasets

链接: https://arxiv.org/abs/2410.13114
作者: William Agnew,Julia Barnett,Annie Chu,Rachel Hong,Michael Feffer,Robin Netzorg,Harry H. Jiang,Ezra Awumey,Sauvik Das
关键词-EN: released high quality, high quality generative, Generative audio models, generative audio products, powerful generative audio
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Generative audio models are rapidly advancing in both capabilities and public utilization – several powerful generative audio models have readily available open weights, and some tech companies have released high quality generative audio products. Yet, while prior work has enumerated many ethical issues stemming from the data on which generative visual and textual models have been trained, we have little understanding of similar issues with generative audio datasets, including those related to bias, toxicity, and intellectual property. To bridge this gap, we conducted a literature review of hundreds of audio datasets and selected seven of the most prominent to audit in more detail. We found that these datasets are biased against women, contain toxic stereotypes about marginalized communities, and contain significant amounts of copyrighted work. To enable artists to see if they are in popular audio datasets and facilitate exploration of the contents of these datasets, we developed a web tool audio datasets exploration tool at this https URL.

[AI-121] Cliqueformer: Model-Based Optimization with Structured Transformers

链接: https://arxiv.org/abs/2410.13106
作者: Jakub Grudzien Kuba,Pieter Abbeel,Sergey Levine
关键词-EN: Expressive large-scale neural, large-scale neural networks, neural networks enable, networks enable training, enable training powerful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Expressive large-scale neural networks enable training powerful models for prediction tasks. However, in many engineering and science domains, such models are intended to be used not just for prediction, but for design – e.g., creating new proteins that serve as effective therapeutics, or creating new materials or chemicals that maximize a downstream performance measure. Thus, researchers have recently grown an interest in building deep learning methods that solve offline \emphmodel-based optimization (MBO) problems, in which design candidates are optimized with respect to surrogate models learned from offline data. However, straightforward application of predictive models that are effective at predicting in-distribution properties of a design are not necessarily the best suited for use in creating new designs. Thus, the most successful algorithms that tackle MBO draw inspiration from reinforcement learning and generative modeling to meet the in-distribution constraints. Meanwhile, recent theoretical works have observed that exploiting the structure of the target black-box function is an effective strategy for solving MBO from offline data. Unfortunately, discovering such structure remains an open problem. In this paper, following first principles, we develop a model that learns the structure of an MBO task and empirically leads to improved designs. To this end, we introduce \emphCliqueformer – a scalable transformer-based architecture that learns the black-box function’s structure in the form of its \emphfunctional graphical model (FGM), thus bypassing the problem of distribution shift, previously tackled by conservative approaches. We evaluate Cliqueformer on various tasks, ranging from high-dimensional black-box functions from MBO literature to real-world tasks of chemical and genetic design, consistently demonstrating its state-of-the-art performance.

[AI-122] A Little Human Data Goes A Long Way

点击查看摘要

[AI-123] ask Consistent Prototype Learning for Incremental Few-shot Semantic Segmentation

链接: https://arxiv.org/abs/2410.13094
作者: Wenbo Xu,Yanan Wu,Haoran Jiang,Yang Wang,Qiang Wu,Jian Zhang
关键词-EN: Few-Shot Semantic Segmentation, Incremental Few-Shot Semantic, Semantic Segmentation, Few-Shot Semantic, segmentation capability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: conference

点击查看摘要

Abstract:Incremental Few-Shot Semantic Segmentation (iFSS) tackles a task that requires a model to continually expand its segmentation capability on novel classes using only a few annotated examples. Typical incremental approaches encounter a challenge that the objective of the base training phase (fitting base classes with sufficient instances) does not align with the incremental learning phase (rapidly adapting to new classes with less forgetting). This disconnect can result in suboptimal performance in the incremental setting. This study introduces a meta-learning-based prototype approach that encourages the model to learn how to adapt quickly while preserving previous knowledge. Concretely, we mimic the incremental evaluation protocol during the base training session by sampling a sequence of pseudo-incremental tasks. Each task in the simulated sequence is trained using a meta-objective to enable rapid adaptation without forgetting. To enhance discrimination among class prototypes, we introduce prototype space redistribution learning, which dynamically updates class prototypes to establish optimal inter-prototype boundaries within the prototype space. Extensive experiments on iFSS datasets built upon PASCAL and COCO benchmarks show the advanced performance of the proposed approach, offering valuable insights for addressing iFSS challenges.

[AI-124] Reverse-Engineering the Reader

点击查看摘要

[AI-125] FedCAP: Robust Federated Learning via Customized Aggregation and Personalization ACSA

链接: https://arxiv.org/abs/2410.13083
作者: Youpeng Li,Xinda Wang,Fuxun Yu,Lichao Sun,Wenbin Zhang,Xuyu Wang
关键词-EN: machine learning paradigm, distributed machine learning, Federated learning, emerging distributed machine, learning paradigm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 14 pages, 12 figures, 5 tables, accepted by 2024 Annual Computer Security Applications Conference (ACSAC 2024)

点击查看摘要

Abstract:Federated learning (FL), an emerging distributed machine learning paradigm, has been applied to various privacy-preserving scenarios. However, due to its distributed nature, FL faces two key issues: the non-independent and identical distribution (non-IID) of user data and vulnerability to Byzantine threats. To address these challenges, in this paper, we propose FedCAP, a robust FL framework against both data heterogeneity and Byzantine attacks. The core of FedCAP is a model update calibration mechanism to help a server capture the differences in the direction and magnitude of model updates among clients. Furthermore, we design a customized model aggregation rule that facilitates collaborative training among similar clients while accelerating the model deterioration of malicious clients. With a Euclidean norm-based anomaly detection mechanism, the server can quickly identify and permanently remove malicious clients. Moreover, the impact of data heterogeneity and Byzantine attacks can be further mitigated through personalization on the client side. We conduct extensive experiments, comparing multiple state-of-the-art baselines, to demonstrate that FedCAP performs well in several non-IID settings and shows strong robustness under a series of poisoning attacks.

[AI-126] uning Language Models by Mixture-of-Depths Ensemble

点击查看摘要

[AI-127] Language Models as Semiotic Machines: Reconceptualizing AI Language Systems through Structuralist and Post-Structuralist Theories of Language

点击查看摘要

[AI-128] Optimal Transport for Probabilistic Circuits

链接: https://arxiv.org/abs/2410.13061
作者: Adrian Ciotinga,YooJung Choi
关键词-EN: optimal transport framework, optimal transport, Wasserstein distance, transport framework, distance
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce a novel optimal transport framework for probabilistic circuits (PCs). While it has been shown recently that divergences between distributions represented as certain classes of PCs can be computed tractably, to the best of our knowledge, there is no existing approach to compute the Wasserstein distance between probability distributions given by PCs. We consider a Wasserstein-type distance that restricts the coupling measure of the associated optimal transport problem to be a probabilistic circuit. We then develop an algorithm for computing this distance by solving a series of small linear programs and derive the circuit conditions under which this is tractable. Furthermore, we show that we can also retrieve the optimal transport plan between the PCs from the solutions to these linear programming problems. We then consider the empirical Wasserstein distance between a PC and a dataset, and show that we can estimate the PC parameters to minimize this distance through an efficient iterative algorithm.

[AI-129] ERAS: Evaluating the Robustness of Chinese NLP Models to Morphological Garden Path Errors NAACL

点击查看摘要

[AI-130] Channel-Wise Mixed-Precision Quantization for Large Language Models

点击查看摘要

[AI-131] Systems with Switching Causal Relations: A Meta-Causal Perspective

链接: https://arxiv.org/abs/2410.13054
作者: Moritz Willig,Tim Nelson Tobiasch,Florian Peter Busch,Jonas Seng,Devendra Singh Dhami,Kristian Kersting
关键词-EN: machine learning assumes, constant underlying process, work on causality, causality in machine, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 19 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Most work on causality in machine learning assumes that causal relationships are driven by a constant underlying process. However, the flexibility of agents’ actions or tipping points in the environmental process can change the qualitative dynamics of the system. As a result, new causal relationships may emerge, while existing ones change or disappear, resulting in an altered causal graph. To analyze these qualitative changes on the causal graph, we propose the concept of meta-causal states, which groups classical causal models into clusters based on equivalent qualitative behavior and consolidates specific mechanism parameterizations. We demonstrate how meta-causal states can be inferred from observed agent behavior, and discuss potential methods for disentangling these states from unlabeled data. Finally, we direct our analysis towards the application of a dynamical system, showing that meta-causal states can also emerge from inherent system dynamics, and thus constitute more than a context-dependent framework in which mechanisms emerge only as a result of external factors.

[AI-132] FedGTST: Boosting Global Transferability of Federated Models via Statistics Tuning

链接: https://arxiv.org/abs/2410.13045
作者: Evelyn Ma,Chao Pan,Rasoul Etesami,Han Zhao,Olgica Milenkovic
关键词-EN: substantial computational resources, performance of Transfer, Transfer Learning, demands large datasets, heavily relies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The performance of Transfer Learning (TL) heavily relies on effective pretraining, which demands large datasets and substantial computational resources. As a result, executing TL is often challenging for individual model developers. Federated Learning (FL) addresses these issues by facilitating collaborations among clients, expanding the dataset indirectly, distributing computational costs, and preserving privacy. However, key challenges remain unresolved. First, existing FL methods tend to optimize transferability only within local domains, neglecting the global learning domain. Second, most approaches rely on indirect transferability metrics, which do not accurately reflect the final target loss or true degree of transferability. To address these gaps, we propose two enhancements to FL. First, we introduce a client-server exchange protocol that leverages cross-client Jacobian (gradient) norms to boost transferability. Second, we increase the average Jacobian norm across clients at the server, using this as a local regularizer to reduce cross-client Jacobian variance. Our transferable federated algorithm, termed FedGTST (Federated Global Transferability via Statistics Tuning), demonstrates that increasing the average Jacobian and reducing its variance allows for tighter control of the target loss. This leads to an upper bound on the target loss in terms of the source loss and source-target domain discrepancy. Extensive experiments on datasets such as MNIST to MNIST-M and CIFAR10 to SVHN show that FedGTST outperforms relevant baselines, including FedSR. On the second dataset pair, FedGTST improves accuracy by 9.8% over FedSR and 7.6% over FedIIR when LeNet is used as the backbone.

[AI-133] LFOSum: Summarizing Long-form Opinions with Large Language Models

点击查看摘要

[AI-134] Hypothesis Testing the Circuit Hypothesis in LLMs

链接: https://arxiv.org/abs/2410.13032
作者: Claudia Shi,Nicolas Beltran-Velez,Achille Nazaret,Carolina Zheng,Adrià Garriga-Alonso,Andrew Jesson,Maggie Makar,David M. Blei
关键词-EN: Large language models, demonstrate surprising capabilities, Large language, demonstrate surprising, surprising capabilities
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code available here: this https URL

点击查看摘要

Abstract:Large language models (LLMs) demonstrate surprising capabilities, but we do not understand how they are implemented. One hypothesis suggests that these capabilities are primarily executed by small subnetworks within the LLM, known as circuits. But how can we evaluate this hypothesis? In this paper, we formalize a set of criteria that a circuit is hypothesized to meet and develop a suite of hypothesis tests to evaluate how well circuits satisfy them. The criteria focus on the extent to which the LLM’s behavior is preserved, the degree of localization of this behavior, and whether the circuit is minimal. We apply these tests to six circuits described in the research literature. We find that synthetic circuits – circuits that are hard-coded in the model – align with the idealized properties. Circuits discovered in Transformer models satisfy the criteria to varying degrees. To facilitate future empirical studies of circuits, we created the \textitcircuitry package, a wrapper around the \textitTransformerLens library, which abstracts away lower-level manipulations of hooks and activations. The software is available at \urlthis https URL.

[AI-135] Learning Representations for Reasoning: Generalizing Across Diverse Structures

点击查看摘要

[AI-136] LEGAL-UQA: A Low-Resource Urdu-English Dataset for Legal Question Answering

点击查看摘要

[AI-137] Hiding-in-Plain-Sight (HiPS) Attack on CLIP for Targetted Object Removal from Images NEURIPS2024

链接: https://arxiv.org/abs/2410.13010
作者: Arka Daw,Megan Hong-Thanh Chung,Maria Mahbub,Amir Sadovnik
关键词-EN: Machine learning models, Machine learning, focused on single-modalities, Machine, attacks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in the 3rd Workshop on New Frontiers in Adversarial Machine Learning at NeurIPS 2024. 10 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Machine learning models are known to be vulnerable to adversarial attacks, but traditional attacks have mostly focused on single-modalities. With the rise of large multi-modal models (LMMs) like CLIP, which combine vision and language capabilities, new vulnerabilities have emerged. However, prior work in multimodal targeted attacks aim to completely change the model’s output to what the adversary wants. In many realistic scenarios, an adversary might seek to make only subtle modifications to the output, so that the changes go unnoticed by downstream models or even by humans. We introduce Hiding-in-Plain-Sight (HiPS) attacks, a novel class of adversarial attacks that subtly modifies model predictions by selectively concealing target object(s), as if the target object was absent from the scene. We propose two HiPS attack variants, HiPS-cls and HiPS-cap, and demonstrate their effectiveness in transferring to downstream image captioning models, such as CLIP-Cap, for targeted object removal from image captions.

[AI-138] Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models

链接: https://arxiv.org/abs/2410.13002
作者: Makram Chahine,Alex Quach,Alaa Maalouf,Tsun-Hsuan Wang,Daniela Rus
关键词-EN: learning directly maps, directly maps sensory, maps sensory inputs, complex robotics tasks, creating highly integrated
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models are tricky to efficiently train and often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. To this end, we design datasets with various levels of data representation richness, refine feature extraction protocols by leveraging multi-modal foundation model encoders, and assess the suitability of different policy network heads. Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. These rich features form the basis for training highly robust downstream policies capable of generalizing across platforms, environments, and text-specified tasks. We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning on a small simulated dataset successfully generalize to real-world scenes, handling diverse novel goals and command formulations.

[AI-139] SSET: Swapping-Sliding Explanation for Time Series Classifiers in Affect Detection

链接: https://arxiv.org/abs/2410.12996
作者: Nazanin Fouladgar,Marjan Alirezaie,Kary Främling
关键词-EN: models make specific, make specific decisions, recently received significant, received significant attention, significant attention due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Local explanation of machine learning (ML) models has recently received significant attention due to its ability to reduce ambiguities about why the models make specific decisions. Extensive efforts have been invested to address explainability for different data types, particularly images. However, the work on multivariate time series data is limited. A possible reason is that the conflation of time and other variables in time series data can cause the generated explanations to be incomprehensible to humans. In addition, some efforts on time series fall short of providing accurate explanations as they either ignore a context in the time domain or impose differentiability requirements on the ML models. Such restrictions impede their ability to provide valid explanations in real-world applications and non-differentiable ML settings. In this paper, we propose a swapping–sliding decision explanation for multivariate time series classifiers, called SSET. The proposal consists of swapping and sliding stages, by which salient sub-sequences causing significant drops in the prediction score are presented as explanations. In the former stage, the important variables are detected by swapping the series of interest with close train data from target classes. In the latter stage, the salient observations of these variables are explored by sliding a window over each time step. Additionally, the model measures the importance of different variables over time in a novel way characterized by multiple factors. We leverage SSET on affect detection domain where evaluations are performed on two real-world physiological time series datasets, WESAD and MAHNOB-HCI, and a deep convolutional classifier, CN-Waterfall. This classifier has shown superior performance to prior models to detect human affective states. Comparing SSET with several benchmarks, including LIME, integrated gradients, and Dynamask, we found…

[AI-140] Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

链接: https://arxiv.org/abs/2410.12989
作者: Iaroslav Chelombitko,Egor Safronov,Aleksey Komissarov
关键词-EN: Large Language Models, Large Language, considerable attention, quality, tokenizer quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 24 pages, 9 figures, 6 tables. Code and data available at this https URL

点击查看摘要

[AI-141] Reinforcement Learning with Euclidean Data Augmentation for State-Based Continuous Control

链接: https://arxiv.org/abs/2410.12983
作者: Jinzhu Luo,Dingyang Chen,Qi Zhang
关键词-EN: Data augmentation creates, Data augmentation, Euclidean data augmentation, Data, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data augmentation creates new data points by transforming the original ones for a reinforcement learning (RL) agent to learn from, which has been shown to be effective for the objective of improving the data efficiency of RL for continuous control. Prior work towards this objective has been largely restricted to perturbation-based data augmentation where new data points are created by perturbing the original ones, which has been impressively effective for tasks where the RL agent observes control states as images with perturbations including random cropping, shifting, etc. This work focuses on state-based control, where the RL agent can directly observe raw kinematic and task features, and considers an alternative data augmentation applied to these features based on Euclidean symmetries under transformations like rotations. We show that the default state features used in exiting benchmark tasks that are based on joint configurations are not amenable to Euclidean transformations. We therefore advocate using state features based on configurations of the limbs (i.e., the rigid bodies connected by the joints) that instead provide rich augmented data under Euclidean transformations. With minimal hyperparameter tuning, we show this new Euclidean data augmentation strategy significantly improves both data efficiency and asymptotic performance of RL on a wide range of continuous control tasks.

[AI-142] Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

链接: https://arxiv.org/abs/2410.12982
作者: Costin-Andrei Oncescu,Sanket Purandare,Stratos Idreos,Sham Kakade
关键词-EN: sequence generative models, computational cost remains, cost remains quadratic, recent advancements, sequence length
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 9 figures, 5 algorithms

点击查看摘要

Abstract:While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs’ exact inference to quasilinear O(L\log^2L) time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to 1.6\times end-to-end improvement over standard inference by improving 50\times within the position-mixing part.

[AI-143] Large Language Models as a Tool for Mining Object Knowledge

点击查看摘要

[AI-144] Long-Tailed Backdoor Attack Using Dynamic Data Augmentation Operations

链接: https://arxiv.org/abs/2410.12955
作者: Lu Pang,Tao Sun,Weimin Lyu,Haibin Ling,Chao Chen
关键词-EN: increasing security threat, deep neural networks, attention of researchers, backdoor attack, increasing security
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, backdoor attack has become an increasing security threat to deep neural networks and drawn the attention of researchers. Backdoor attacks exploit vulnerabilities in third-party pretrained models during the training phase, enabling them to behave normally for clean samples and mispredict for samples with specific triggers. Existing backdoor attacks mainly focus on balanced datasets. However, real-world datasets often follow long-tailed distributions. In this paper, for the first time, we explore backdoor attack on such datasets. Specifically, we first analyze the influence of data imbalance on backdoor attack. Based on our analysis, we propose an effective backdoor attack named Dynamic Data Augmentation Operation (D ^2 AO). We design D ^2 AO selectors to select operations depending jointly on the class, sample type (clean vs. backdoored) and sample features. Meanwhile, we develop a trigger generator to generate sample-specific triggers. Through simultaneous optimization of the backdoored model and trigger generator, guided by dynamic data augmentation operation selectors, we achieve significant advancements. Extensive experiments demonstrate that our method can achieve the state-of-the-art attack performance while preserving the clean accuracy.

[AI-145] A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data

链接: https://arxiv.org/abs/2410.12954
作者: Ali Borji
关键词-EN: study conducted, conducted by Shumailov, Kernel Density Estimation, Shumailov, Density Estimation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Comment on this https URL

点击查看摘要

Abstract:The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable.

[AI-146] Gradient Map-Assisted Head and Neck Tumor Segmentation: A Pre-RT to Mid-RT Approach in MRI-Guided Radiotherapy

链接: https://arxiv.org/abs/2410.12941
作者: Jintao Ren,Kim Hochreuter,Mathis Ersted Rasmussen,Jesper Folsted Kallehauge,Stine Sofia Korreman
关键词-EN: Radiation therapy, gross tumor volume, head and neck, neck cancer, vital part
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Radiation therapy (RT) is a vital part of treatment for head and neck cancer, where accurate segmentation of gross tumor volume (GTV) is essential for effective treatment planning. This study investigates the use of pre-RT tumor regions and local gradient maps to enhance mid-RT tumor segmentation for head and neck cancer in MRI-guided adaptive radiotherapy. By leveraging pre-RT images and their segmentations as prior knowledge, we address the challenge of tumor localization in mid-RT segmentation. A gradient map of the tumor region from the pre-RT image is computed and applied to mid-RT images to improve tumor boundary delineation. Our approach demonstrated improved segmentation accuracy for both primary GTV (GTVp) and nodal GTV (GTVn), though performance was limited by data constraints. The final DSCagg scores from the challenge’s test set evaluation were 0.534 for GTVp, 0.867 for GTVn, and a mean score of 0.70. This method shows potential for enhancing segmentation and treatment planning in adaptive radiotherapy. Team: DCPT-Stine’s group.

[AI-147] UMambaAdj: Advancing GTV Segmentation for Head and Neck Cancer in MRI-Guided RT with UMamba and nnU-Net ResEnc Planner

链接: https://arxiv.org/abs/2410.12940
作者: Jintao Ren,Kim Hochreuter,Jesper Folsted Kallehauge,Stine Sofia Korreman
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, superior soft-tissue contrast, plays a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) plays a crucial role in MRI-guided adaptive radiotherapy for head and neck cancer (HNC) due to its superior soft-tissue contrast. However, accurately segmenting the gross tumor volume (GTV), which includes both the primary tumor (GTVp) and lymph nodes (GTVn), remains challenging. Recently, two deep learning segmentation innovations have shown great promise: UMamba, which effectively captures long-range dependencies, and the nnU-Net Residual Encoder (ResEnc), which enhances feature extraction through multistage residual blocks. In this study, we integrate these strengths into a novel approach, termed ‘UMambaAdj’. Our proposed method was evaluated on the HNTS-MRG 2024 challenge test set using pre-RT T2-weighted MRI images, achieving an aggregated Dice Similarity Coefficient (DSCagg) of 0.751 for GTVp and 0.842 for GTVn, with a mean DSCagg of 0.796. This approach demonstrates potential for more precise tumor delineation in MRI-guided adaptive radiotherapy, ultimately improving treatment outcomes for HNC patients. Team: DCPT-Stine’s group.

[AI-148] SoK: On Finding Common Ground in Loss Landscapes Using Deep Model Merging Techniques

链接: https://arxiv.org/abs/2410.12927
作者: Arham Khan,Todd Nief,Nathaniel Hudson,Mansi Sakarvadia,Daniel Grzenda,Aswathy Ajith,Jordan Pettyjohn,Kyle Chard,Ian Foster
关键词-EN: loss landscape geometry, model merging, model, crucial to creating, creating reliable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding neural networks is crucial to creating reliable and trustworthy deep learning models. Most contemporary research in interpretability analyzes just one model at a time via causal intervention or activation analysis. Yet despite successes, these methods leave significant gaps in our understanding of the training behaviors of neural networks, how their inner representations emerge, and how we can predictably associate model components with task-specific behaviors. Seeking new insights from work in related fields, here we survey literature in the field of model merging, a field that aims to combine the abilities of various neural networks by merging their parameters and identifying task-specific model components in the process. We analyze the model merging literature through the lens of loss landscape geometry, an approach that enables us to connect observations from empirical studies on interpretability, security, model merging, and loss landscape analysis to phenomena that govern neural network training and the emergence of their inner representations. To systematize knowledge in this area, we present a novel taxonomy of model merging techniques organized by their core algorithmic principles. Additionally, we distill repeated empirical observations from the literature in these fields into characterizations of four major aspects of loss landscape geometry: mode convexity, determinism, directedness, and connectivity. We argue that by improving our understanding of the principles underlying model merging and loss landscape geometry, this work contributes to the goal of ensuring secure and trustworthy machine learning in practice.

[AI-149] Boosting Asynchronous Decentralized Learning with Model Fragmentation

链接: https://arxiv.org/abs/2410.12918
作者: Sayan Biswas,Anne-Marie Kermarrec,Alexis Marouani,Rafael Pires,Rishi Sharma,Martijn De Vos
关键词-EN: train machine learning, sharing raw data, collaboratively train machine, Decentralized learning, machine learning models
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Decentralized learning (DL) is an emerging technique that allows nodes on the web to collaboratively train machine learning models without sharing raw data. Dealing with stragglers, i.e., nodes with slower compute or communication than others, is a key challenge in DL. We present DivShare, a novel asynchronous DL algorithm that achieves fast model convergence in the presence of communication stragglers. DivShare achieves this by having nodes fragment their models into parameter subsets and send, in parallel to computation, each subset to a random sample of other nodes instead of sequentially exchanging full models. The transfer of smaller fragments allows more efficient usage of the collective bandwidth and enables nodes with slow network links to quickly contribute with at least some of their model parameters. By theoretically proving the convergence of DivShare, we provide, to the best of our knowledge, the first formal proof of convergence for a DL algorithm that accounts for the effects of asynchronous communication with delays. We experimentally evaluate DivShare against two state-of-the-art DL baselines, AD-PSGD and Swift, and with two standard datasets, CIFAR-10 and MovieLens. We find that DivShare with communication stragglers lowers time-to-accuracy by up to 3.9x compared to AD-PSGD on the CIFAR-10 dataset. Compared to baselines, DivShare also achieves up to 19.4% better accuracy and 9.5% lower test loss on the CIFAR-10 and MovieLens datasets, respectively.

[AI-150] Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights

链接: https://arxiv.org/abs/2410.12913
作者: Ameet Gadekar,Aristides Gionis,Suhas Thejaswi
关键词-EN: called cluster centers, Data summarization tasks, data points, called cluster, Data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:Data summarization tasks are often modeled as k -clustering problems, where the goal is to choose k data points, called cluster centers, that best represent the dataset by minimizing a clustering objective. A popular objective is to minimize the maximum distance between any data point and its nearest center, which is formalized as the k -center problem. While in some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a predefined subset of points, referred as facilities or suppliers; this is known as the k -supplier problem. In this work, we focus on fair data summarization modeled as the fair k -supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group while minimizing the k -supplier objective. The groups can be disjoint or overlapping, leading to two distinct problem variants each with different computational complexity. We present 3 -approximation algorithms for both variants, improving the previously known factor of 5 . For disjoint groups, our algorithm runs in polynomial time, while for overlapping groups, we present a fixed-parameter tractable algorithm, where the exponential runtime depends only on the number of groups and centers. We show that these approximation factors match the theoretical lower bounds, assuming standard complexity theory conjectures. Finally, using an open-source implementation, we demonstrate the scalability of our algorithms on large synthetic datasets and assess the price of fairness on real-world data, comparing solution quality with and without fairness constraints. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Discrete Mathematics (cs.DM) Cite as: arXiv:2410.12913 [cs.LG] (or arXiv:2410.12913v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.12913 Focus to learn more arXiv-issued DOI via DataCite

[AI-151] Large Language Models and the Rationalist Empiricist Debate

点击查看摘要

[AI-152] MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation NEURIPS2024

链接: https://arxiv.org/abs/2410.12893
作者: Aniket Deroy,Subhankar Maity,Sudeshna Sarkar
关键词-EN: stimulate critical thinking, Automatic question generation, involves evaluating question, evaluating question quality, Automatic question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at FM-Eduassess @ NEURIPS 2024 (ORAL Paper)

点击查看摘要

[AI-153] Multi-trait User Simulation with Adaptive Decoding for Conversational Task Assistants EMNLP2024

点击查看摘要

[AI-154] Using Protected Attributes to Consider Fairness in Multi-Agent Systems

链接: https://arxiv.org/abs/2410.12889
作者: Gabriele La Malfa,Jie M. Zhang,Michael Luck,Elizabeth Black
关键词-EN: resource division, extensively studied, goods allocation, bargaining systems, Fairness
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fairness in Multi-Agent Systems (MAS) has been extensively studied, particularly in reward distribution among agents in scenarios such as goods allocation, resource division, lotteries, and bargaining systems. Fairness in MAS depends on various factors, including the system’s governing rules, the behaviour of the agents, and their characteristics. Yet, fairness in human society often involves evaluating disparities between disadvantaged and privileged groups, guided by principles of Equality, Diversity, and Inclusion (EDI). Taking inspiration from the work on algorithmic fairness, which addresses bias in machine learning-based decision-making, we define protected attributes for MAS as characteristics that should not disadvantage an agent in terms of its expected rewards. We adapt fairness metrics from the algorithmic fairness literature – namely, demographic parity, counterfactual fairness, and conditional statistical parity – to the multi-agent setting, where self-interested agents interact within an environment. These metrics allow us to evaluate the fairness of MAS, with the ultimate aim of designing MAS that do not disadvantage agents based on protected attributes.

[AI-155] MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

点击查看摘要

[AI-156] Navigating the Cultural Kaleidoscope: A Hitchhikers Guide to Sensitivity in Large Language Models

点击查看摘要

[AI-157] owards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models

点击查看摘要

[AI-158] Improving Instruction-Following in Language Models through Activation Steering

点击查看摘要

[AI-159] Beyond Right and Wrong: Mitigating Cold Start in Knowledge Tracing Using Large Language Model and Option Weight

点击查看摘要

[AI-160] Skill Learning Using Process Mining for Large Language Model Plan Generation

链接: https://arxiv.org/abs/2410.12870
作者: Andrei Cosmin Redis,Mohammadreza Fani Sani,Bahram Zarrin,Andrea Burattin
关键词-EN: Large language models, Large language, control flow models, hold promise, complex tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 2 tables, accepted at ICPM 2024’

点击查看摘要

[AI-161] Language Model Preference Evaluation with Multiple Weak Evaluators

点击查看摘要

[AI-162] IMAS: A Comprehensive Agent ic Approach to Rural Healthcare Delivery

点击查看摘要

[AI-163] Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis

点击查看摘要

[AI-164] owards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings

点击查看摘要

[AI-165] ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction

点击查看摘要

[AI-166] Investigating Implicit Bias in Large Language Models : A Large-Scale Study of Over 50 LLMs

点击查看摘要

[AI-167] Scaled and Inter-token Relation Enhanced Transformer for Sample-restricted Residential NILM

点击查看摘要

[AI-168] LLMD: A Large Language Model for Interpreting Longitudinal Medical Records

点击查看摘要

[AI-169] Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

点击查看摘要

[AI-170] Large Language Models for Medical OSCE Assessment: A Novel Approach to Transcript Analysis

点击查看摘要

[AI-171] Enterprise Benchmarks for Large Language Model Evaluation

点击查看摘要

[AI-172] Optimized Biomedical Question-Answering Services with LLM and Multi-BERT Integration ICDM

链接: https://arxiv.org/abs/2410.12856
作者: Cheng Qian,Xianglong Shi,Shanshan Yao,Yichen Liu,Fengming Zhou,Zishu Zhang,Junaid Akram,Ali Braytee,Ali Anaissi
关键词-EN: integrating large language, Multi-BERT configurations, present a refined, integrating large, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 12 figures, accepted and to be published in the proceedings of 2024 IEEE International Conference on Data Mining Workshops (ICDMW)

点击查看摘要

[AI-173] JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

点击查看摘要

[AI-174] PO: Aligning Large Language Models with Multi-branch Multi-step Preference Trees

点击查看摘要

[AI-175] Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks

点击查看摘要

[AI-176] VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

链接: https://arxiv.org/abs/2410.12851
作者: Lisa Dunlap,Krishna Mandal,Trevor Darrell,Jacob Steinhardt,Joseph E Gonzalez
关键词-EN: Large language models, Large language, users intuitively recognize, intuitively recognize, struggle to quantify
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, unironic use of the word ‘vibe’

点击查看摘要

[AI-177] RecurFormer: Not All Transformer Heads Need Self-Attention

点击查看摘要

[AI-178] Prompt Engineering a Schizophrenia Chatbot: Utilizing a Multi-Agent Approach for Enhanced Compliance with Prompt Instructions

点击查看摘要

[AI-179] ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning EMNLP

点击查看摘要

[AI-180] Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering

点击查看摘要

[AI-181] oward Relieving Clinician Burden by Automatically Generating Progress Notes using Interim Hospital Data

点击查看摘要

[AI-182] Exploring Prompt Engineering: A Systematic Review with SWOT Analysis

点击查看摘要

[AI-183] A Two-Model Approach for Humour Style Recognition

点击查看摘要

[AI-184] UniAutoML: A Human-Centered Framework for Unified Discriminative and Generative AutoML with Large Language Models

点击查看摘要

[AI-185] Capturing Bias Diversity in LLMs

链接: https://arxiv.org/abs/2410.12839
作者: Purva Prasad Gosavi,Vaishnavi Murlidhar Kulkarni,Alan F. Smeaton
关键词-EN: Large Language Models, Large Language, paper presents research, enhancements to Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 2nd International Conference on Foundation and Large Language Models (FLLM2024), 26-29 November, 2024 | Dubai, UAE

点击查看摘要

[AI-186] A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution Current Landscape and Future Directions

点击查看摘要

[AI-187] EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing

链接: https://arxiv.org/abs/2410.12836
作者: Kaizhi Zheng,Xiaotong Chen,Xuehai He,Jing Gu,Linjie Li,Zhengyuan Yang,Kevin Lin,Jianfeng Wang,Lijuan Wang,Xin Eric Wang
关键词-EN: steep learning curve, augmented reality, virtual reality, curve of professional, steep learning
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming. However, recent approaches to language-guided 3D scene editing either require manual interventions or focus only on appearance modifications without supporting comprehensive scene layout changes. In response, we propose Edit-Room, a unified framework capable of executing a variety of layout edits through natural language commands, without requiring manual intervention. Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes using a diffusion-based method, enabling six types of edits: rotate, translate, scale, replace, add, and remove. To address the lack of data for language-guided 3D scene editing, we have developed an automatic pipeline to augment existing 3D scene synthesis datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs, for training and evaluation. Our experiments demonstrate that our approach consistently outperforms other baselines across all metrics, indicating higher accuracy and coherence in language-guided scene layout editing.

[AI-188] A transformer-based deep reinforcement learning approach to spatial navigation in a partially observable Morris Water Maze

链接: https://arxiv.org/abs/2410.12820
作者: Marte Eggen,Inga Strümke
关键词-EN: fundamental cognitive skill, cognitive skill extensively, skill extensively studied, gained substantial interest, Morris Water Maze
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Navigation is a fundamental cognitive skill extensively studied in neuroscientific experiments and has lately gained substantial interest in artificial intelligence research. Recreating the task solved by rodents in the well-established Morris Water Maze (MWM) experiment, this work applies a transformer-based architecture using deep reinforcement learning – an approach previously unexplored in this context – to navigate a 2D version of the maze. Specifically, the agent leverages a decoder-only transformer architecture serving as a deep Q-network performing effective decision making in the partially observable environment. We demonstrate that the proposed architecture enables the agent to efficiently learn spatial navigation strategies, overcoming challenges associated with a limited field of vision, corresponding to the visual information available to a rodent in the MWM. Demonstrating the potential of transformer-based models for enhancing navigation performance in partially observable environments, this work suggests promising avenues for future research in artificial agents whose behavior resembles that of biological agents. Finally, the flexibility of the transformer architecture in supporting varying input sequence lengths opens opportunities for gaining increased understanding of the artificial agent’s inner representation of the environment.

[AI-189] Interactive Explainable Anomaly Detection for Industrial Settings

链接: https://arxiv.org/abs/2410.12817
作者: Daniel Gramelt,Timon Höfer,Ute Schmid
关键词-EN: Convolutional Neural Networks, production lines, recognise defects, key element, element of quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Being able to recognise defects in industrial objects is a key element of quality assurance in production lines. Our research focuses on visual anomaly detection in RGB images. Although Convolutional Neural Networks (CNNs) achieve high accuracies in this task, end users in industrial environments receive the model’s decisions without additional explanations. Therefore, it is of interest to enrich the model’s outputs with further explanations to increase confidence in the model and speed up anomaly detection. In our work, we focus on (1) CNN-based classification models and (2) the further development of a model-agnostic explanation algorithm for black-box classifiers. Additionally, (3) we demonstrate how we can establish an interactive interface that allows users to further correct the model’s output. We present our NearCAIPI Interaction Framework, which improves AI through user interaction, and show how this approach increases the system’s trustworthiness. We also illustrate how NearCAIPI can integrate human feedback into an interactive process chain.

[AI-190] Optimizing and Evaluating Enterprise Retrieval-Augmented Generation (RAG): A Content Design Perspective

链接: https://arxiv.org/abs/2410.12812
作者: Sarah Packowski,Inge Halilovic,Jenifer Schlotfeldt,Trish Smith
关键词-EN: large language models, Retrieval-augmented generation, language models, build customer-support, large language
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures, to be published in ICAAI 2024 conference proceedings

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a popular technique for using large language models (LLMs) to build customer-support, question-answering solutions. In this paper, we share our team’s practical experience building and maintaining enterprise-scale RAG solutions that answer users’ questions about our software based on product documentation. Our experience has not always matched the most common patterns in the RAG literature. This paper focuses on solution strategies that are modular and model-agnostic. For example, our experience over the past few years - using different search methods and LLMs, and many knowledge base collections - has been that simple changes to the way we create knowledge base content can have a huge impact on our RAG solutions’ success. In this paper, we also discuss how we monitor and evaluate results. Common RAG benchmark evaluation techniques have not been useful for evaluating responses to novel user questions, so we have found a flexible, “human in the lead” approach is required.

[AI-191] Interpretable Rule-Based System for Radar-Based Gesture Sensing: Enhancing Transparency and Personalization in AI

链接: https://arxiv.org/abs/2410.12806
作者: Sarah Seifi,Tobias Sukianto,Cecilia Carbonelli,Lorenzo Servadei,Robert Wille
关键词-EN: artificial intelligence, increasing demand, demand in artificial, effective and explainable, domains where safety
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: accepted at the 21st European Radar Conference, 4 pages, 2 figure

点击查看摘要

Abstract:The increasing demand in artificial intelligence (AI) for models that are both effective and explainable is critical in domains where safety and trust are paramount. In this study, we introduce MIRA, a transparent and interpretable multi-class rule-based algorithm tailored for radar-based gesture detection. Addressing the critical need for understandable AI, MIRA enhances user trust by providing insight into its decision-making process. We showcase the system’s adaptability through personalized rule sets that calibrate to individual user behavior, offering a user-centric AI experience. Alongside presenting a novel multi-class classification architecture, we share an extensive frequency-modulated continuous wave radar gesture dataset and evidence of the superior interpretability of our system through comparative analyses. Our research underscores MIRA’s ability to deliver both high interpretability and performance and emphasizes the potential for broader adoption of interpretable AI in safety-critical applications.

[AI-192] Design of an Efficient Fan-Shaped Clustered Trust-Based Routing Model with QoS Security-Aware Side-Chaining for IoV Deployments

链接: https://arxiv.org/abs/2410.12798
作者: Sadaf Ravindra Suryawanshi,Praveen Gupta
关键词-EN: interconnected devices vehicles, Internet of Vehicles, expansion of Internet, devices vehicles, massive data traffic
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:The rapid expansion of Internet of Vehicles (IoV) deployments has necessitated the creation of efficient and secure routing models to manage the massive data traffic generated by interconnected devices vehicles. For IoV deployments, we propose a novel fan-shaped trust-based routing model with Quality of Service (QoS) and security-aware side-chaining. Our method employs temporal levels of delay, throughput, Packet Delivery Ratio (PDR), and energy consumption to determine optimal routing paths, thereby ensuring efficient data transmissions. We employ the Bacterial Foraging Optimizer (BFO) algorithm to manage side-chains within the network, which dynamically adjusts side-chain configurations to optimize system performance. The technique of fan-shaped clustering is used to group nodes into efficient clusters, allowing for more efficient communication and resource utilization sets. Extensive experimentation and performance analysis are utilized to evaluate the proposed model. Existing blockchain-based security models have been significantly improved by our findings. Our model achieves a remarkable 9.5% reduction in delay, a 10.5% improvement in throughput, a 2.9% improvement in PDR, and a 4.5% reduction in energy consumption compared to alternative approaches. In addition, we evaluate the model’s resistance to Sybil, Masquerading, and Flooding attacks, which are prevalent security threats for IoV deployments. Even under these attack scenarios, our model provides consistently higher QoS levels compared to existing solutions, ensuring uninterrupted and reliable data transmissions. In IoV deployments, the proposed routing model and side-chaining management approach have numerous applications and use-cases like Smart cities, industrial automation, healthcare systems, transportation networks, and environmental monitoring.

[AI-193] Disaggregating Embedding Recommendation Systems with FlexEMR

链接: https://arxiv.org/abs/2410.12794
作者: Yibo Huang,Zhenning Yang,Jiarong Xing,Yi Dai,Yiming Qiu,Dingming Wu,Fan Lai,Ang Chen
关键词-EN: Efficiently serving embedding-based, serving embedding-based recommendation, large memory requirements, increasingly large memory, Efficiently serving
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their increasingly large memory requirements. Today’s practice splits the model across many monolithic servers, where a mix of GPUs, CPUs, and DRAM is provisioned in fixed proportions. This approach leads to suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference is a promising solution but raises novel networking challenges. In this paper, we discuss the design of FlexEMR for optimized EMR disaggregation. FlexEMR proposes two sets of techniques to tackle the networking challenges: Leveraging the temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests. We outline the design space for each technique and present initial results from our early prototype.

[AI-194] Environment Scan of Generative AI Infrastructure for Clinical and Translational Science

链接: https://arxiv.org/abs/2410.12793
作者: Betina Idnay,Zihan Xu,William G. Adams,Mohammad Adibuzzaman,Nicholas R. Anderson,Neil Bahroos,Douglas S. Bell,Cody Bumgardner,Thomas Campion,Mario Castro,James J. Cimino,I. Glenn Cohen,David Dorr,Peter L Elkin,Jungwei W. Fan,Todd Ferris,David J. Foran,David Hanauer,Mike Hogarth,Kun Huang,Jayashree Kalpathy-Cramer,Manoj Kandpal,Niranjan S. Karnik,Avnish Katoch,Albert M. Lai,Christophe G. Lambert,Lang Li,Christopher Lindsell,Jinze Liu,Zhiyong Lu,Yuan Luo,Peter McGarvey,Eneida A. Mendonca,Parsa Mirhaji,Shawn Murphy,John D. Osborne,Ioannis C. Paschalidis,Paul A. Harris,Fred Prior,Nicholas J. Shaheen,Nawar Shara,Ida Sim,Umberto Tachinardi,Lemuel R. Waitman,Rosalind J. Wright,Adrian H. Zai,Kai Zheng,Sandra Soo-Jin Lee,Bradley A. Malin,Karthik Natarajan,W. Nicholson Price II,Rui Zhang,Yiye Zhang,Hua Xu,Jiang Bian,Chunhua Weng,Yifan Peng
关键词-EN: Translational Science Award, Advancing Translational Sciences, translational science, Advancing Translational, Science Award
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study reports a comprehensive environmental scan of the generative AI (GenAI) infrastructure in the national network for clinical and translational science across 36 institutions supported by the Clinical and Translational Science Award (CTSA) Program led by the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH) at the United States. With the rapid advancement of GenAI technologies, including large language models (LLMs), healthcare institutions face unprecedented opportunities and challenges. This research explores the current status of GenAI integration, focusing on stakeholder roles, governance structures, and ethical considerations by administering a survey among leaders of health institutions (i.e., representing academic medical centers and health systems) to assess the institutional readiness and approach towards GenAI adoption. Key findings indicate a diverse range of institutional strategies, with most organizations in the experimental phase of GenAI deployment. The study highlights significant variations in governance models, with a strong preference for centralized decision-making but notable gaps in workforce training and ethical oversight. Moreover, the results underscore the need for a more coordinated approach to GenAI governance, emphasizing collaboration among senior leaders, clinicians, information technology staff, and researchers. Our analysis also reveals concerns regarding GenAI bias, data security, and stakeholder trust, which must be addressed to ensure the ethical and effective implementation of GenAI technologies. This study offers valuable insights into the challenges and opportunities of GenAI integration in healthcare, providing a roadmap for institutions aiming to leverage GenAI for improved quality of care and operational efficiency.

[AI-195] Data-adaptive Differentially Private Prompt Synthesis for In-Context Learning

链接: https://arxiv.org/abs/2410.12085
作者: Fengyu Gao,Ruida Zhou,Tianhao Wang,Cong Shen,Jing Yang
关键词-EN: Large Language Models, Large Language, Language Models, perform in-context learning, contextual information embedded
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) rely on the contextual information embedded in examples/demonstrations to perform in-context learning (ICL). To mitigate the risk of LLMs potentially leaking private information contained in examples in the prompt, we introduce a novel data-adaptive differentially private algorithm called AdaDPSyn to generate synthetic examples from the private dataset and then use these synthetic examples to perform ICL. The objective of AdaDPSyn is to adaptively adjust the noise level in the data synthesis mechanism according to the inherent statistical properties of the data, thereby preserving high ICL accuracy while maintaining formal differential privacy guarantees. A key innovation in AdaDPSyn is the Precision-Focused Iterative Radius Reduction technique, which dynamically refines the aggregation radius - the scope of data grouping for noise addition - based on patterns observed in data clustering, thereby minimizing the amount of additive noise. We conduct extensive experiments on standard benchmarks and compare AdaDPSyn with DP few-shot generation algorithm (Tang et al., 2023). The experiments demonstrate that AdaDPSyn not only outperforms DP few-shot generation, but also maintains high accuracy levels close to those of non-private baselines, providing an effective solution for ICL with privacy protection.

[AI-196] Predicting the Geolocation of Tweets Using transformer models on Customized Data

点击查看摘要

[AI-197] Rapid and Automated Alloy Design with Graph Neural Network-Powered LLM-Driven Multi-Agent Systems

链接: https://arxiv.org/abs/2410.13768
作者: Alireza Ghafarollahi,Markus J. Buehler
关键词-EN: integrating multimodal data, external knowledge including, knowledge including insights, integrating multimodal, multimodal data
类目: Materials Science (cond-mat.mtrl-sci); Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:A multi-agent AI model is used to automate the discovery of new metallic alloys, integrating multimodal data and external knowledge including insights from physics via atomistic simulations. Our multi-agent system features three key components: (a) a suite of LLMs responsible for tasks such as reasoning and planning, (b) a group of AI agents with distinct roles and expertise that dynamically collaborate, and © a newly developed graph neural network (GNN) model for rapid retrieval of key physical properties. A set of LLM-driven AI agents collaborate to automate the exploration of the vast design space of MPEAs, guided by predictions from the GNN. We focus on the NbMoTa family of body-centered cubic (bcc) alloys, modeled using an ML-based interatomic potential, and target two key properties: the Peierls barrier and solute/screw dislocation interaction energy. Our GNN model accurately predicts these atomic-scale properties, providing a faster alternative to costly brute-force calculations and reducing the computational burden on multi-agent systems for physics retrieval. This AI system revolutionizes materials discovery by reducing reliance on human expertise and overcoming the limitations of direct all-atom simulations. By synergizing the predictive power of GNNs with the dynamic collaboration of LLM-based agents, the system autonomously navigates vast alloy design spaces, identifying trends in atomic-scale material properties and predicting macro-scale mechanical strength, as demonstrated by several computational experiments. This approach accelerates the discovery of advanced alloys and holds promise for broader applications in other complex systems, marking a significant step forward in automated materials design.

[AI-198] OAH-Net: A Deep Neural Network for Hologram Reconstruction of Off-axis Digital Holographic Microscope

链接: https://arxiv.org/abs/2410.13592
作者: Wei Liu,Kerem Delikoyun,Qianyu Chen,Alperen Yildiz,Si Ko Myo,Win Sen Kuan,John Tshon Yit Soong,Matthew Edward Cove,Oliver Hayden,Hweekuan Lee
关键词-EN: label-free imaging technology, large-scale cellular imaging, digital holographic microscopy, Off-axis digital holographic, label-free imaging
类目: Optics (physics.optics); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Off-axis digital holographic microscopy is a high-throughput, label-free imaging technology that provides three-dimensional, high-resolution information about samples, particularly useful in large-scale cellular imaging. However, the hologram reconstruction process poses a significant bottleneck for timely data analysis. To address this challenge, we propose a novel reconstruction approach that integrates deep learning with the physical principles of off-axis holography. We initialized part of the network weights based on the physical principle and then fine-tuned them via weakly supersized learning. Our off-axis hologram network (OAH-Net) retrieves phase and amplitude images with errors that fall within the measurement error range attributable to hardware, and its reconstruction speed significantly surpasses the microscope’s acquisition rate. Crucially, OAH-Net demonstrates remarkable external generalization capabilities on unseen samples with distinct patterns and can be seamlessly integrated with other models for downstream tasks to achieve end-to-end real-time hologram analysis. This capability further expands off-axis holography’s applications in both biological and medical studies.

[AI-199] RGB to Hyperspectral: Spectral Reconstruction for Enhanced Surgical Imaging

链接: https://arxiv.org/abs/2410.13570
作者: Tobias Czempiel,Alfie Roddan,Maria Leiloglou,Zepeng Hu,Kevin O’Neill,Giulio Anichini,Danail Stoyanov,Daniel Elson
关键词-EN: in-house neurosurgery dataset, signatures from RGB, RGB data, enhance surgical imaging, utilizing the publicly
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:This study investigates the reconstruction of hyperspectral signatures from RGB data to enhance surgical imaging, utilizing the publicly available HeiPorSPECTRAL dataset from porcine surgery and an in-house neurosurgery dataset. Various architectures based on convolutional neural networks (CNNs) and transformer models are evaluated using comprehensive metrics. Transformer models exhibit superior performance in terms of RMSE, SAM, PSNR and SSIM by effectively integrating spatial information to predict accurate spectral profiles, encompassing both visible and extended spectral ranges. Qualitative assessments demonstrate the capability to predict spectral profiles critical for informed surgical decision-making during procedures. Challenges associated with capturing both the visible and extended hyperspectral ranges are highlighted using the MAE, emphasizing the complexities involved. The findings open up the new research direction of hyperspectral reconstruction for surgical applications and clinical use cases in real-time surgical environments.

[AI-200] DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech NEURIPS2024

链接: https://arxiv.org/abs/2410.13342
作者: Jan Melechovsky,Ambuj Mehrish,Berrak Sisman,Dorien Herremans
关键词-EN: Recent advancements, systems have enabled, textual input, Accented TTS aims, enabled the generation
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Accepted in Audio Imagination workshop of NeurIPS 2024

点击查看摘要

Abstract:Recent advancements in Text-to-Speech (TTS) systems have enabled the generation of natural and expressive speech from textual input. Accented TTS aims to enhance user experience by making the synthesized speech more relatable to minority group listeners, and useful across various applications and context. Speech synthesis can further be made more flexible by allowing users to choose any combination of speaker identity and accent, resulting in a wide range of personalized speech outputs. Current models struggle to disentangle speaker and accent representation, making it difficult to accurately imitate different accents while maintaining the same speaker characteristics. We propose a novel approach to disentangle speaker and accent representations using multi-level variational autoencoders (ML-VAE) and vector quantization (VQ) to improve flexibility and enhance personalization in speech synthesis. Our proposed method addresses the challenge of effectively separating speaker and accent characteristics, enabling more fine-grained control over the synthesized speech. Code and speech samples are publicly available.

[AI-201] Active inference and deep generative modeling for cognitive ultrasound

链接: https://arxiv.org/abs/2410.13310
作者: Ruud JG van Sloun
关键词-EN: unique potential, potential to offer, offer access, access to medical, Ultrasound
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ultrasound (US) has the unique potential to offer access to medical imaging to anyone, everywhere. Devices have become ultra-portable and cost-effective, akin to the stethoscope. Nevertheless US image quality and diagnostic efficacy are still highly operator- and patient-dependent. In difficult-to-image patients, image quality is often insufficient for reliable diagnosis. In this paper, we put forth that US imaging systems can be recast as information-seeking agents that engage in reciprocal interactions with their anatomical environment. Such agents autonomously adapt their transmit-receive sequences to fully personalize imaging and actively maximize information gain in-situ. To that end, we will show that the sequence of pulse-echo experiments that a US system performs can be interpreted as a perception-action loop: the action is the data acquisition, probing tissue with acoustic waves and recording reflections at the detection array, and perception is the inference of the anatomical and or functional state, potentially including associated diagnostic quantities. We then equip systems with a mechanism to actively reduce uncertainty and maximize diagnostic value across a sequence of experiments, treating action and perception jointly using Bayesian inference given generative models of the environment and action-conditional pulse-echo observations. Since the representation capacity of the generative models dictates both the quality of inferred anatomical states and the effectiveness of inferred sequences of future imaging actions, we will be greatly leveraging the enormous advances in deep generative modelling that are currently disrupting many fields and society at large. Finally, we show some examples of cognitive, closed-loop, US systems that perform active beamsteering and adaptive scanline selection, based on deep generative models that track anatomical belief states.

[AI-202] AI-Driven Autonomous Control of Proton-Boron Fusion Reactors Using Backpropagation Neural Networks

链接: https://arxiv.org/abs/2410.12871
作者: Michele Laurelli
关键词-EN: neutron-free energy generation, path towards sustainable, neutron-free energy, energy generation, presents a promising
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Proton-boron (p-11B) fusion presents a promising path towards sustainable, neutron-free energy generation. However, its implementation is hindered by extreme operational conditions, such as plasma temperatures exceeding billions of degrees and the complexity of controlling high-energy particles. Traditional control systems face significant challenges in managing the highly dynamic and non-linear behavior of the plasma. In this paper, we propose a novel approach utilizing backpropagation-based neural networks to autonomously control key parameters in a proton-boron fusion reactor. Our method leverages real-time feedback and learning from physical data to adapt to changing plasma conditions, offering a potential breakthrough in stable and efficient p-11B fusion. Furthermore, we expand on the scalability and generalization of our approach to other fusion systems and future AI technologies.

[AI-203] Segment as You Wish – Free-Form Language-Based Segmentation for Medical Images

链接: https://arxiv.org/abs/2410.12831
作者: Longchao Da,Rui Wang,Xiaojian Xu,Parminder Bhatia,Taha Kass-Hout,Hua Wei,Cao Xiao
关键词-EN: patient health condition, ensure precise diagnosis, health condition, treatment planning, imaging is crucial
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical imaging is crucial for diagnosing a patient’s health condition, and accurate segmentation of these images is essential for isolating regions of interest to ensure precise diagnosis and treatment planning. Existing methods primarily rely on bounding boxes or point-based prompts, while few have explored text-related prompts, despite clinicians often describing their observations and instructions in natural language. To address this gap, we first propose a RAG-based free-form text prompt generator, that leverages the domain corpus to generate diverse and realistic descriptions. Then, we introduce FLanS, a novel medical image segmentation model that handles various free-form text prompts, including professional anatomy-informed queries, anatomy-agnostic position-driven queries, and anatomy-agnostic size-driven queries. Additionally, our model also incorporates a symmetry-aware canonicalization module to ensure consistent, accurate segmentations across varying scan orientations and reduce confusion between the anatomical position of an organ and its appearance in the scan. FLanS is trained on a large-scale dataset of over 100k medical images from 7 public datasets. Comprehensive experiments demonstrate the model’s superior language understanding and segmentation precision, along with a deep comprehension of the relationship between them, outperforming SOTA baselines on both in-domain and out-of-domain datasets.

[AI-204] Incorporating Metabolic Information into LLMs for Anomaly Detection in Clinical Time-Series

链接: https://arxiv.org/abs/2410.12830
作者: Maxx Richard Rahman,Ruoxuan Liu,Wolfgang Maass
关键词-EN: time-series holds significant, holds significant potential, clinical time-series holds, identifying suspicious patterns, time-series holds
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection in clinical time-series holds significant potential in identifying suspicious patterns in different biological parameters. In this paper, we propose a targeted method that incorporates the clinical domain knowledge into LLMs to improve their ability to detect anomalies. We introduce the Metabolism Pathway-driven Prompting (MPP) method, which integrates the information about metabolic pathways to better capture the structural and temporal changes in biological samples. We applied our method for doping detection in sports, focusing on steroid metabolism, and evaluated using real-world data from athletes. The results show that our method improves anomaly detection performance by leveraging metabolic context, providing a more nuanced and accurate prediction of suspicious samples in athletes’ profiles.

[AI-205] A Hierarchical conv-LSTM and LLM Integrated Model for Holistic Stock Forecasting

链接: https://arxiv.org/abs/2410.12807
作者: Arya Chakraborty,Auhona Basu
关键词-EN: multifaceted data sources, financial domain presents, Convolutional Neural Networks, Long Short-Term Memory, Conv-LSTM Neural Network
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, 2 tables

点击查看摘要

Abstract:The financial domain presents a complex environment for stock market prediction, characterized by volatile patterns and the influence of multifaceted data sources. Traditional models have leveraged either Convolutional Neural Networks (CNN) for spatial feature extraction or Long Short-Term Memory (LSTM) networks for capturing temporal dependencies, with limited integration of external textual data. This paper proposes a novel Two-Level Conv-LSTM Neural Network integrated with a Large Language Model (LLM) for comprehensive stock advising. The model harnesses the strengths of Conv-LSTM for analyzing time-series data and LLM for processing and understanding textual information from financial news, social media, and reports. In the first level, convolutional layers are employed to identify local patterns in historical stock prices and technical indicators, followed by LSTM layers to capture the temporal dynamics. The second level integrates the output with an LLM that analyzes sentiment and contextual information from textual data, providing a holistic view of market conditions. The combined approach aims to improve prediction accuracy and provide contextually rich stock advising.

计算机视觉

[CV-0] UniDrive: Towards Universal Driving Perception Across Camera Configurations

链接: https://arxiv.org/abs/2410.13864
作者: Ye Li,Wenzhao Zheng,Xiaonan Huang,Kurt Keutzer
关键词-EN: demonstrated excellent performance, Vision-centric autonomous driving, demonstrated excellent, camera, driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint; 14 pages, 5 figures, 2 tables; Code at this https URL

点击查看摘要

Abstract:Vision-centric autonomous driving has demonstrated excellent performance with economical sensors. As the fundamental step, 3D perception aims to infer 3D information from 2D images based on 3D-2D projection. This makes driving perception models susceptible to sensor configuration (e.g., camera intrinsics and extrinsics) variations. However, generalizing across camera configurations is important for deploying autonomous driving models on different car models. In this paper, we present UniDrive, a novel framework for vision-centric autonomous driving to achieve universal perception across camera configurations. We deploy a set of unified virtual cameras and propose a ground-aware projection method to effectively transform the original images into these unified virtual views. We further propose a virtual configuration optimization method by minimizing the expected projection error between original cameras and virtual cameras. The proposed virtual camera projection can be applied to existing 3D perception methods as a plug-and-play module to mitigate the challenges posed by camera parameter variability, resulting in more adaptable and reliable driving perception models. To evaluate the effectiveness of our framework, we collect a dataset on Carla by driving the same routes while only modifying the camera configurations. Experimental results demonstrate that our method trained on one specific camera configuration can generalize to varying configurations with minor performance degradation.

[CV-1] Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

链接: https://arxiv.org/abs/2410.13863
作者: Lijie Fan,Tianhong Li,Siyang Qin,Yuanzhen Li,Chen Sun,Michael Rubinstein,Deqing Sun,Kaiming He,Yonglong Tian
关键词-EN: proven as beneficial, models, continuous tokens, large language models, Scaling
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Tech report

点击查看摘要

Abstract:Scaling up autoregressive models in vision has not proven as beneficial as in large language models. In this work, we investigate this scaling problem in the context of text-to-image generation, focusing on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed raster order using BERT- or GPT-like transformer architectures. Our empirical results show that, while all models scale effectively in terms of validation loss, their evaluation performance – measured by FID, GenEval score, and visual quality – follows different trends. Models based on continuous tokens achieve significantly better visual quality than those using discrete tokens. Furthermore, the generation order and attention mechanisms significantly affect the GenEval score: random-order models achieve notably better GenEval scores compared to raster-order models. Inspired by these findings, we train Fluid, a random-order autoregressive model on continuous tokens. Fluid 10.5B model achieves a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K, and 0.69 overall score on the GenEval benchmark. We hope our findings and results will encourage future efforts to further bridge the scaling gap between vision and language models.

[CV-2] DepthSplat: Connecting Gaussian Splatting and Depth

链接: https://arxiv.org/abs/2410.13862
作者: Haofei Xu,Songyou Peng,Fangjinhua Wang,Hermann Blum,Daniel Barath,Andreas Geiger,Marc Pollefeys
关键词-EN: Gaussian splatting, connect Gaussian splatting, depth estimation, Gaussian splatting reconstructions, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Gaussian splatting and single/multi-view depth estimation are typically studied in isolation. In this paper, we present DepthSplat to connect Gaussian splatting and depth estimation and study their interactions. More specifically, we first contribute a robust multi-view depth model by leveraging pre-trained monocular depth features, leading to high-quality feed-forward 3D Gaussian splatting reconstructions. We also show that Gaussian splatting can serve as an unsupervised pre-training objective for learning powerful depth models from large-scale unlabelled datasets. We validate the synergy between Gaussian splatting and depth estimation through extensive ablation and cross-task transfer experiments. Our DepthSplat achieves state-of-the-art performance on ScanNet, RealEstate10K and DL3DV datasets in terms of both depth estimation and novel view synthesis, demonstrating the mutual benefits of connecting both tasks. Our code, models, and video results are available at this https URL.

[CV-3] PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

链接: https://arxiv.org/abs/2410.13861
作者: Rongyao Fang,Chengqi Duan,Kun Wang,Hao Li,Hao Tian,Xingyu Zeng,Rui Zhao,Jifeng Dai,Hongsheng Li,Xihui Liu
关键词-EN: Recent advancements, unified MLLM, yielded significant progress, vision-language understanding, multimodal foundation models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. However, existing works have insufficiently addressed the varying granularity demands of different image generation tasks within a unified MLLM paradigm - from the diversity required in text-to-image generation to the precise controllability needed in image manipulation. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs, elegantly addressing the different granularity requirements of various image generation tasks within a unified MLLM framework. Following multimodal pretraining and task-specific instruction tuning, PUMA demonstrates proficiency in a wide range of multimodal tasks. This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks. The code and model will be released in this https URL.

[CV-4] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

链接: https://arxiv.org/abs/2410.13860
作者: Runsen Xu,Zhiwei Huang,Tai Wang,Yilun Chen,Jiangmiao Pang,Dahua Lin
关键词-EN: scene understanding, crucial for robots, requiring integration, integration of natural, natural language
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: CoRL 2024 Camera Ready. 25 pages. A novel zero-shot 3D visual grounding framework based solely on 2D images

点击查看摘要

Abstract:3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding. Traditional methods depending on supervised learning with 3D point clouds are limited by scarce datasets. Recently zero-shot methods leveraging LLMs have been proposed to address the data issue. While effective, these methods only use object-centric information, limiting their ability to handle complex queries. In this work, we present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images. VLM-Grounder dynamically stitches image sequences, employs a grounding and feedback scheme to find the target object, and uses a multi-view ensemble projection to accurately estimate 3D bounding boxes. Experiments on ScanRefer and Nr3D datasets show VLM-Grounder outperforms previous zero-shot methods, achieving 51.6% Acc@0.25 on ScanRefer and 48.0% Acc on Nr3D, without relying on 3D geometry or object priors. Codes are available at this https URL .

[CV-5] gamma-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

链接: https://arxiv.org/abs/2410.13859
作者: Yaxin Luo,Gen Luo,Jiayi Ji,Yiyi Zhou,Xiaoshuai Sun,Zhiqiang Shen,Rongrong Ji
关键词-EN: large language models, multimodal large language, high computational cost, computational cost remains, MoD
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of ``activated tokens’'. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called \gamma -MoD. In \gamma -MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer. Based on ARank, we further propose two novel designs to maximize the computational sparsity of MLLM while maintaining its performance, namely shared vision-language router and masked routing learning. With these designs, more than 90% dense layers of the MLLM can be effectively converted to the MoD ones. To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of \gamma -MoD to existing MLLMs but also confirm its generalization ability on various MLLMs. For example, with a minor performance drop, i.e., -1.5%, \gamma -MoD can reduce the training and inference time of LLaVA-HR by 31.0% and 53.2%, respectively.

[CV-6] Can MLLMs Understand the Deep Implication Behind Chinese Images?

链接: https://arxiv.org/abs/2410.13854
作者: Chenhao Zhang,Xi Feng,Yuelin Bai,Xinrun Du,Jinchang Hou,Kaixin Deng,Guangzeng Han,Qinrui Li,Bingli Wang,Jiaheng Liu,Xingwei Qu,Yifei Zhang,Qixuan Zhao,Yiming Liang,Ziqiang Liu,Feiteng Fang,Min Yang,Wenhao Huang,Chenghua Lin,Ge Zhang,Shiwen Ni
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Chinese traditional culture
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注: 32 pages,18 figures. Project Page: this https URL Code: this https URL Dataset: this https URL

点击查看摘要

[CV-7] Retrospective Learning from Interactions

点击查看摘要

[CV-8] Differentiable Robot Rendering

链接: https://arxiv.org/abs/2410.13851
作者: Ruoshi Liu,Alper Canberk,Shuran Song,Carl Vondrick
关键词-EN: shown unprecedented reasoning, trained on massive, massive amounts, shown unprecedented, unprecedented reasoning
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Vision foundation models trained on massive amounts of visual data have shown unprecedented reasoning and planning skills in open-world settings. A key challenge in applying them to robotic tasks is the modality gap between visual data and action data. We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters. Our model integrates a kinematics-aware deformable model and Gaussians Splatting and is compatible with any robot form factors and degrees of freedom. We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models. Quantitative and qualitative results show that our differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.

[CV-9] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

点击查看摘要

[CV-10] D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement

链接: https://arxiv.org/abs/2410.13842
作者: Yansong Peng,Hebei Li,Peixi Wu,Yueyi Zhang,Xiaoyan Sun,Feng Wu
关键词-EN: outstanding localization precision, bounding box regression, Global Optimal Localization, powerful real-time object, Fine-grained Distribution Refinement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.

[CV-11] VidPanos: Generative Panoramic Videos from Casual Panning Videos SIGGRAPH

链接: https://arxiv.org/abs/2410.13832
作者: Jingwei Ma,Erika Lu,Roni Paiss,Shiran Zada,Aleksander Holynski,Tali Dekel,Brian Curless,Michael Rubinstein,Forrester Cole
关键词-EN: Panoramic image stitching, video, Panoramic image, panoramic video, wide-angle view
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page at this https URL . To appear at SIGGRAPH Asia 2024 (conference track)

点击查看摘要

Abstract:Panoramic image stitching provides a unified, wide-angle view of a scene that extends beyond the camera’s field of view. Stitching frames of a panning video into a panoramic photograph is a well-understood problem for stationary scenes, but when objects are moving, a still panorama cannot capture the scene. We present a method for synthesizing a panoramic video from a casually-captured panning video, as if the original video were captured with a wide-angle camera. We pose panorama synthesis as a space-time outpainting problem, where we aim to create a full panoramic video of the same length as the input video. Consistent completion of the space-time volume requires a powerful, realistic prior over video content and motion, for which we adapt generative video models. Existing generative models do not, however, immediately extend to panorama completion, as we show. We instead apply video generation as a component of our panorama synthesis system, and demonstrate how to exploit the strengths of the models while minimizing their limitations. Our system can create video panoramas for a range of in-the-wild scenes including people, vehicles, and flowing water, as well as stationary background features.

[CV-12] DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

链接: https://arxiv.org/abs/2410.13830
作者: Yujie Wei,Shiwei Zhang,Hangjie Yuan,Xiang Wang,Haonan Qiu,Rui Zhao,Yutong Feng,Feng Liu,Zhizhong Huang,Jiaxin Ye,Yingya Zhang,Hongming Shan
关键词-EN: Recent advances, customized video generation, create videos tailored, motion control, motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in customized video generation have enabled users to create videos tailored to both specific subjects and motion trajectories. However, existing methods often require complicated test-time fine-tuning and struggle with balancing subject learning and motion control, limiting their real-world applications. In this paper, we present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning. Specifically, we introduce reference attention, which leverages the model’s inherent capabilities for subject learning, and devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks derived from bounding boxes. While these two components achieve their intended functions, we empirically observe that motion control tends to dominate over subject learning. To address this, we propose two key designs: 1) the masked reference attention, which integrates a blended latent mask modeling scheme into reference attention to enhance subject representations at the desired positions, and 2) a reweighted diffusion loss, which differentiates the contributions of regions inside and outside the bounding boxes to ensure a balance between subject and motion control. Extensive experimental results on a newly curated dataset demonstrate that DreamVideo-2 outperforms state-of-the-art methods in both subject customization and motion control. The dataset, code, and models will be made publicly available.

[CV-13] Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models MICRO

点击查看摘要

[CV-14] Harnessing Webpage UIs for Text-Rich Visual Understanding

点击查看摘要

[CV-15] Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning NEURIPS2024

链接: https://arxiv.org/abs/2410.13823
作者: Xiaodan Xing,Junzhi Ning,Yang Nan,Guang Yang
关键词-EN: enhancing dataset size, Deep generative models, significantly advanced medical, advanced medical imaging, medical imaging analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by AIM-FM Workshop of NeurIPS2024

点击查看摘要

Abstract:Deep generative models have significantly advanced medical imaging analysis by enhancing dataset size and quality. Beyond mere data augmentation, our research in this paper highlights an additional, significant capacity of deep generative models: their ability to reveal and demonstrate patterns in medical images. We employ a generative structure with hybrid conditions, combining clinical data and segmentation masks to guide the image synthesis process. Furthermore, we innovatively transformed the tabular clinical data into textual descriptions. This approach simplifies the handling of missing values and also enables us to leverage large pre-trained vision-language models that investigate the relations between independent clinical entries and comprehend general terms, such as gender and smoking status. Our approach differs from and presents a more challenging task than traditional medical report-guided synthesis due to the less visual correlation of our clinical information with the images. To overcome this, we introduce a text-visual embedding mechanism that strengthens the conditions, ensuring the network effectively utilizes the provided information. Our pipeline is generalizable to both GAN-based and diffusion models. Experiments on chest CT, particularly focusing on the smoking status, demonstrated a consistent intensity shift in the lungs which is in agreement with clinical observations, indicating the effectiveness of our method in capturing and visualizing the impact of specific attributes on medical image patterns. Our methods offer a new avenue for the early detection and precise visualization of complex clinical conditions with deep generative models. All codes are this https URL.

[CV-16] Multi-style conversion for semantic segmentation of lesions in fundus images by adversarial attacks

点击查看摘要

[CV-17] ConsisSR: Delving Deep into Consistency in Diffusion-based Image Super-Resolution

链接: https://arxiv.org/abs/2410.13807
作者: Junhao Gu,Peng-Tao Jiang,Hao Zhang,Mi Zhou,Jinwei Chen,Wenming Yang,Bo Li
关键词-EN: Real-world image super-resolution, Real-world image, aims at restoring, restoring high-quality, complex degradations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Real-world image super-resolution (Real-ISR) aims at restoring high-quality (HQ) images from low-quality (LQ) inputs corrupted by unknown and complex degradations. In particular, pretrained text-to-image (T2I) diffusion models provide strong generative priors to reconstruct credible and intricate details. However, T2I generation focuses on semantic consistency while Real-ISR emphasizes pixel-level reconstruction, which hinders existing methods from fully exploiting diffusion priors. To address this challenge, we introduce ConsisSR to handle both semantic and pixel-level consistency. Specifically, compared to coarse-grained text prompts, we exploit the more powerful CLIP image embedding and effectively leverage both modalities through our Hybrid Prompt Adapter (HPA) for semantic guidance. Secondly, we introduce Time-aware Latent Augmentation (TALA) to mitigate the inherent gap between T2I generation and Real-ISR consistency requirements. By randomly mixing LQ and HQ latent inputs, our model not only handle timestep-specific diffusion noise but also refine the accumulated latent representations. Last but not least, our GAN-Embedding strategy employs the pretrained Real-ESRGAN model to refine the diffusion start point. This accelerates the inference process to 10 steps while preserving sampling quality, in a training-free this http URL method demonstrates state-of-the-art performance among both full-scale and accelerated models. The code will be made publicly available.

[CV-18] MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

链接: https://arxiv.org/abs/2410.13790
作者: Liang Xu,Shaoyang Hua,Zili Lin,Yifan Liu,Feipeng Ma,Yichao Yan,Xin Jin,Xiaokang Yang,Wenjun Zeng
关键词-EN: large motion model, motion, tackle the problem, human motion generation, motions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we tackle the problem of how to build and benchmark a large motion model (LMM). The ultimate goal of LMM is to serve as a foundation model for versatile motion-related tasks, e.g., human motion generation, with interpretability and generalizability. Though advanced, recent LMM-related works are still limited by small-scale motion data and costly text descriptions. Besides, previous motion benchmarks primarily focus on pure body movements, neglecting the ubiquitous motions in context, i.e., humans interacting with humans, objects, and scenes. To address these limitations, we consolidate large-scale video action datasets as knowledge banks to build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions. Different from laboratory-captured motions, in-the-wild human-centric videos contain abundant motions in context. To facilitate better motion text alignment, we also meticulously devise a motion caption generation algorithm to automatically produce rule-based, unbiased, and disentangled text descriptions via the kinematic characteristics for each motion. Extensive experiments show that our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding. Video motions together with the rule-based text annotations could serve as an efficient alternative for larger LMMs. Our dataset, codes, and benchmark will be publicly available at this https URL.

[CV-19] Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation

链接: https://arxiv.org/abs/2410.13786
作者: Fengqi Liu,Hexiang Wang,Jingyu Gong,Ran Yi,Qianyu Zhou,Xuequan Lu,Jiangbo Lu,Lizhuang Ma
关键词-EN: Speech-driven gesture generation, gesture sequence synchronized, input speech signal, gesture generation aims, Speech-driven gesture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence, ignoring the semantic association of different modalities and failing to deal with salient gestures. In this paper, we propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture. Specifically, we first learn a joint manifold space for the individual representation of audio and body pose to exploit the inherent semantic association between two modalities, and propose to enforce semantic consistency via a consistency loss. Furthermore, we emphasize the semantic consistency of salient postures by introducing a weakly-supervised detector to identify salient postures, and reweighting the consistency loss to focus more on learning the correspondence between salient postures and the high-level semantics of speech content. In addition, we propose to extract audio features dedicated to facial expression and body gesture separately, and design separate branches for face and body gesture synthesis. Extensive experimental results demonstrate the superiority of our method over the state-of-the-art approaches.

[CV-20] Eyelid Fold Consistency in Facial Modeling

链接: https://arxiv.org/abs/2410.13760
作者: Lohit Petikam,Charlie Hewitt,Fatemeh Saleh,Tadas Baltrušaitis
关键词-EN: human facial modeling, facial modeling, integral to identity, human facial, Eyelid
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Eyelid shape is integral to identity and likeness in human facial modeling. Human eyelids are diverse in appearance with varied skin fold and epicanthal fold morphology between individuals. Existing parametric face models express eyelid shape variation to an extent, but do not preserve sufficient likeness across a diverse range of individuals. We propose a new definition of eyelid fold consistency and implement geometric processing techniques to model diverse eyelid shapes in a unified topology. Using this method we reprocess data used to train a parametric face model and demonstrate significant improvements in face-related machine learning tasks.

[CV-21] Improving Multi-modal Large Language Model through Boosting Vision Capabilities

链接: https://arxiv.org/abs/2410.13733
作者: Yanpeng Sun,Huaxin Zhang,Qiang Chen,Xinyu Zhang,Nong Sang,Gang Zhang,Jingdong Wang,Zechao Li
关键词-EN: focus on improving, boosting the vision-language, visual encoder, visual, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We focus on improving the visual understanding capability for boosting the vision-language models. We propose \textbfArcana, a multiModal language model, which introduces two crucial techniques. First, we present Multimodal LoRA (MM-LoRA), a module designed to enhance the decoder. Unlike traditional language-driven decoders, MM-LoRA consists of two parallel LoRAs – one for vision and one for language – each with its own parameters. This disentangled parameters design allows for more specialized learning in each modality and better integration of multimodal information. Second, we introduce the Query Ladder adapter (QLadder) to improve the visual encoder. QLadder employs a learnable ``\textitladder’’ structure to deeply aggregates the intermediate representations from the frozen pretrained visual encoder (e.g., CLIP image encoder). This enables the model to learn new and informative visual features, as well as remaining the powerful capabilities of the pretrained visual encoder. These techniques collectively enhance Arcana’s visual perception power, enabling it to leverage improved visual information for more accurate and contextually relevant outputs across various multimodal scenarios. Extensive experiments and ablation studies demonstrate the effectiveness and generalization capability of our Arcana. The code and re-annotated data are available at \urlthis https URL.

[CV-22] DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

点击查看摘要

[CV-23] Movie Gen: A Cast of Media Foundation Models

点击查看摘要

[CV-24] Exploring the Design Space of Visual Context Representation in Video MLLMs

点击查看摘要

[CV-25] Label-free prediction of fluorescence markers in bovine satellite cells using deep learning

链接: https://arxiv.org/abs/2410.13685
作者: Sania Sinha,Aarham Wasit,Won Seob Kim,Jongkyoo Kim,Jiyoon Yi
关键词-EN: food sustainability challenges, address global food, global food sustainability, bovine satellite cells, sustainability challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Assessing the quality of bovine satellite cells (BSCs) is essential for the cultivated meat industry, which aims to address global food sustainability challenges. This study aims to develop a label-free method for predicting fluorescence markers in isolated BSCs using deep learning. We employed a U-Net-based CNN model to predict multiple fluorescence signals from a single bright-field microscopy image of cell culture. Two key biomarkers, DAPI and Pax7, were used to determine the abundance and quality of BSCs. The image pre-processing pipeline included fluorescence denoising to improve prediction performance and consistency. A total of 48 biological replicates were used, with statistical performance metrics such as Pearson correlation coefficient and SSIM employed for model evaluation. The model exhibited better performance with DAPI predictions due to uniform staining. Pax7 predictions were more variable, reflecting biological heterogeneity. Enhanced visualization techniques, including color mapping and image overlay, improved the interpretability of the predictions by providing better contextual and perceptual information. The findings highlight the importance of data pre-processing and demonstrate the potential of deep learning to advance non-invasive, label-free assessment techniques in the cultivated meat industry, paving the way for reliable and actionable AI-driven evaluations.

[CV-26] Pose-Based Sign Language Appearance Transfer

点击查看摘要

[CV-27] Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

点击查看摘要

[CV-28] VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks

点击查看摘要

[CV-29] DiRecNetV2: A Transformer-Enhanced Network for Aerial Disaster Recognition

链接: https://arxiv.org/abs/2410.13663
作者: Demetris Shianios,Panayiotis Kolios,Christos Kyrkou
关键词-EN: Unmanned Aerial Vehicles, aerial imagery processing, real-time processing capabilities, Aerial Vehicles, Unmanned Aerial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages

点击查看摘要

Abstract:The integration of Unmanned Aerial Vehicles (UAVs) with artificial intelligence (AI) models for aerial imagery processing in disaster assessment, necessitates models that demonstrate exceptional accuracy, computational efficiency, and real-time processing capabilities. Traditionally Convolutional Neural Networks (CNNs), demonstrate efficiency in local feature extraction but are limited by their potential for global context interpretation. On the other hand, Vision Transformers (ViTs) show promise for improved global context interpretation through the use of attention mechanisms, although they still remain underinvestigated in UAV-based disaster response applications. Bridging this research gap, we introduce DiRecNetV2, an improved hybrid model that utilizes convolutional and transformer layers. It merges the inductive biases of CNNs for robust feature extraction with the global context understanding of Transformers, maintaining a low computational load ideal for UAV applications. Additionally, we introduce a new, compact multi-label dataset of disasters, to set an initial benchmark for future research, exploring how models trained on single-label data perform in a multi-label test set. The study assesses lightweight CNNs and ViTs on the AIDERSv2 dataset, based on the frames per second (FPS) for efficiency and the weighted F1 scores for classification performance. DiRecNetV2 not only achieves a weighted F1 score of 0.964 on a single-label test set but also demonstrates adaptability, with a score of 0.614 on a complex multi-label test set, while functioning at 176.13 FPS on the Nvidia Orin Jetson device.

[CV-30] ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions

链接: https://arxiv.org/abs/2410.13662
作者: Shailaja Keyur Sampat,Yezhou Yang,Chitta Baral
关键词-EN: Humans observe, visually perceive, draw a wide, wide range, Humans
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 3 figures. arXiv admin note: text overlap with arXiv:2004.10796 by other authors

点击查看摘要

Abstract:Humans observe various actions being performed by other humans (physically or in videos/images) and can draw a wide range of inferences about it beyond what they can visually perceive. Such inferences include determining the aspects of the world that make action execution possible (e.g. liquid objects can undergo pouring), predicting how the world will change as a result of the action (e.g. potatoes being golden and crispy after frying), high-level goals associated with the action (e.g. beat the eggs to make an omelet) and reasoning about actions that possibly precede or follow the current action (e.g. crack eggs before whisking or draining pasta after boiling). Similar reasoning ability is highly desirable in autonomous systems that would assist us in performing everyday tasks. To that end, we propose a multi-modal task to learn aforementioned concepts about actions being performed in images. We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images, collected from an annotated cooking-video dataset. We propose ActionCOMET, a zero-shot framework to discern knowledge present in language models specific to the provided visual input. We present baseline results of ActionCOMET over the collected dataset and compare them with the performance of the best existing VQA approaches.

[CV-31] Help Me Identify: Is an LLMVQA System All We Need to Identify Visual Concepts?

链接: https://arxiv.org/abs/2410.13651
作者: Shailaja Keyur Sampat,Maitreya Patel,Yezhou Yang,Chitta Baral
关键词-EN: produce convincing linguistic, convincing linguistic justification, ability to learn, small amount, data and produce
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:An ability to learn about new objects from a small amount of visual data and produce convincing linguistic justification about the presence/absence of certain concepts (that collectively compose the object) in novel scenarios is an important characteristic of human cognition. This is possible due to abstraction of attributes/properties that an object is composed of e.g. an object `bird’ can be identified by the presence of a beak, feathers, legs, wings, etc. Inspired by this aspect of human reasoning, in this work, we present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system. Specifically, we prompt GPT-3 to obtain a rich linguistic description of visual objects in the dataset. We convert the obtained concept descriptions into a set of binary questions. We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images. Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead, yet being fully explainable from the reasoning perspective.

[CV-32] Comparison of Image Preprocessing Techniques for Vehicle License Plate Recognition Using OCR: Performance and Accuracy Evaluation

链接: https://arxiv.org/abs/2410.13622
作者: Renato Augusto Tavares
关键词-EN: Artificial Intelligence solutions, Artificial Intelligence, machine learning models, Intelligence solutions, Optical Character Recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 13 figures

点击查看摘要

Abstract:The growing use of Artificial Intelligence solutions has led to an explosion in image capture and its application in machine learning models. However, the lack of standardization in image quality generates inconsistencies in the results of these models. To mitigate this problem, Optical Character Recognition (OCR) is often used as a preprocessing technique, but it still faces challenges in scenarios with inadequate lighting, low resolution, and perspective distortions. This work aims to explore and evaluate various preprocessing techniques, such as grayscale conversion, CLAHE in RGB, and Bilateral Filter, applied to vehicle license plate recognition. Each technique is analyzed individually and in combination, using metrics such as accuracy, precision, recall, F1-score, ROC curve, AUC, and ANOVA, to identify the most effective method. The study uses a dataset of Brazilian vehicle license plates, widely used in OCR applications. The research provides a detailed analysis of best preprocessing practices, offering insights to optimize OCR performance in real-world scenarios. Comments: 12 pages, 13 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.13622 [cs.CV] (or arXiv:2410.13622v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.13622 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-33] Enhanced Prompt-leveraged Weakly Supervised Cancer Segmentation based on Segment Anything

链接: https://arxiv.org/abs/2410.13621
作者: Joonhyeon Song,Seohwan Yun,Seongho Yoon,Joohyeok Kim,Sangmin Lee
关键词-EN: robust labeled data, pathological image analysis, limited robust labeled, image analysis, addressing the challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:This work proposes a novel approach beyond supervised learning for effective pathological image analysis, addressing the challenge of limited robust labeled data. Pathological diagnosis of diseases like cancer has conventionally relied on the evaluation of morphological features by physicians and pathologists. However, recent advancements in compute-aided diagnosis (CAD) systems are gaining significant attention as diagnostic support tools. Although the advancement of deep learning has improved CAD significantly, segmentation models typically require large pixel-level annotated dataset, and such labeling is expensive. Existing studies not based on supervised approaches still struggle with limited generalization, and no practical approach has emerged yet. To address this issue, we present a weakly supervised semantic segmentation (WSSS) model by combining class activation map and Segment Anything Model (SAM)-based pseudo-labeling. For effective pretraining, we adopt the SAM-a foundation model that is pretrained on large datasets and operates in zero-shot configurations using only coarse prompts. The proposed approach transfer enhanced Attention Dropout Layer’s knowledge to SAM, thereby generating pseudo-labels. To demonstrate the superiority of the proposed method, experimental studies are conducted on histopathological breast cancer datasets. The proposed method outperformed other WSSS methods across three datasets, demonstrating its efficiency by achieving this with only 12GB of GPU memory during training. Our code is available at : this https URL

[CV-34] LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2410.13618
作者: Yiming Shi,Jiwei Wei,Yujia Wu,Ran Ran,Chengwei Sun,Shiyuan He,Yang Yang
关键词-EN: necessitated substantial computational, substantial computational resources, rapid growth, scale has necessitated, necessitated substantial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:The rapid growth of model scale has necessitated substantial computational resources for fine-tuning. Existing approach such as Low-Rank Adaptation (LoRA) has sought to address the problem of handling the large updated parameters in full fine-tuning. However, LoRA utilize random initialization and optimization of low-rank matrices to approximate updated weights, which can result in suboptimal convergence and an accuracy gap compared to full fine-tuning. To address these issues, we propose LoLDU, a Parameter-Efficient Fine-Tuning (PEFT) approach that significantly reduces trainable parameters by 2600 times compared to regular PEFT methods while maintaining comparable performance. LoLDU leverages Lower-Diag-Upper Decomposition (LDU) to initialize low-rank matrices for faster convergence and orthogonality. We focus on optimizing the diagonal matrix for scaling transformations. To the best of our knowledge, LoLDU has the fewest parameters among all PEFT approaches. We conducted extensive experiments across 4 instruction-following datasets, 6 natural language understanding (NLU) datasets, 8 image classification datasets, and image generation datasets with multiple model types (LLaMA2, RoBERTa, ViT, and Stable Diffusion), providing a comprehensive and detailed analysis. Our open-source code can be accessed at \hrefthis https URLthis https URL.

[CV-35] Spatiotemporal Object Detection for Improved Aerial Vehicle Detection in Traffic Monitoring

点击查看摘要

[CV-36] Material Fingerprinting: Identifying and Predicting Perceptual Attributes of Material Appearance

链接: https://arxiv.org/abs/2410.13615
作者: Jiri Filip,Filip Dechterenko,Filipp Schmidt,Jiri Lukavsky,Veronika Vilimovska,Jan Kotera,Roland W. Fleming
关键词-EN: possessing unique surface, material, world is abundant, play a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 12 figures, 3 tables

点击查看摘要

Abstract:The world is abundant with diverse materials, each possessing unique surface appearances that play a crucial role in our daily perception and understanding of their properties. Despite advancements in technology enabling the capture and realistic reproduction of material appearances for visualization and quality control, the interoperability of material property information across various measurement representations and software platforms remains a complex challenge. A key to overcoming this challenge lies in the automatic identification of materials’ perceptual features, enabling intuitive differentiation of properties stored in disparate material data representations. We reasoned that for many practical purposes, a compact representation of the perceptual appearance is more useful than an exhaustive physical this http URL paper introduces a novel approach to material identification by encoding perceptual features obtained from dynamic visual stimuli. We conducted a psychophysical experiment to select and validate 16 particularly significant perceptual attributes obtained from videos of 347 materials. We then gathered attribute ratings from over twenty participants for each material, creating a ‘material fingerprint’ that encodes the unique perceptual properties of each material. Finally, we trained a multi-layer perceptron model to predict the relationship between statistical and deep learning image features and their corresponding perceptual properties. We demonstrate the model’s performance in material retrieval and filtering according to individual attributes. This model represents a significant step towards simplifying the sharing and understanding of material properties in diverse digital environments regardless of their digital representation, enhancing both the accuracy and efficiency of material identification.

[CV-37] MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes

链接: https://arxiv.org/abs/2410.13613
作者: Xinjie Zhang,Zhening Liu,Yifan Zhang,Xingtong Ge,Dailan He,Tongda Xu,Yan Wang,Zehong Lin,Shuicheng Yan,Jun Zhang
关键词-EN: Gaussian Splatting, capturing complex dynamic, high fidelity, recently emerged, capturing complex
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:4D Gaussian Splatting (4DGS) has recently emerged as a promising technique for capturing complex dynamic 3D scenes with high fidelity. It utilizes a 4D Gaussian representation and a GPU-friendly rasterizer, enabling rapid rendering speeds. Despite its advantages, 4DGS faces significant challenges, notably the requirement of millions of 4D Gaussians, each with extensive associated attributes, leading to substantial memory and storage cost. This paper introduces a memory-efficient framework for 4DGS. We streamline the color attribute by decomposing it into a per-Gaussian direct color component with only 3 parameters and a shared lightweight alternating current color predictor. This approach eliminates the need for spherical harmonics coefficients, which typically involve up to 144 parameters in classic 4DGS, thereby creating a memory-efficient 4D Gaussian representation. Furthermore, we introduce an entropy-constrained Gaussian deformation technique that uses a deformation field to expand the action range of each Gaussian and integrates an opacity-based entropy loss to limit the number of Gaussians, thus forcing our model to use as few Gaussians as possible to fit a dynamic scene well. With simple half-precision storage and zip compression, our framework achieves a storage reduction by approximately 190 \times and 125 \times on the Technicolor and Neural 3D Video datasets, respectively, compared to the original 4DGS. Meanwhile, it maintains comparable rendering speeds and scene representation quality, setting a new standard in the field.

[CV-38] H2OVL-Mississippi Vision Language Models Technical Report

点击查看摘要

[CV-39] DN-4DGS: Denoised Deformable Network with Temporal-Spatial Aggregation for Dynamic Scene Rendering NEURIPS2024

链接: https://arxiv.org/abs/2410.13607
作者: Jiahao Lu,Jiacheng Deng,Ruijie Zhu,Yanzhe Liang,Wenfei Yang,Tianzhu Zhang,Xu Zhou
关键词-EN: Dynamic scenes rendering, Dynamic scenes, challenging problem, intriguing yet challenging, Dynamic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Dynamic scenes rendering is an intriguing yet challenging problem. Although current methods based on NeRF have achieved satisfactory performance, they still can not reach real-time levels. Recently, 3D Gaussian Splatting (3DGS) has gar?nered researchers attention due to their outstanding rendering quality and real?time speed. Therefore, a new paradigm has been proposed: defining a canonical 3D gaussians and deforming it to individual frames in deformable fields. How?ever, since the coordinates of canonical 3D gaussians are filled with noise, which can transfer noise into the deformable fields, and there is currently no method that adequately considers the aggregation of 4D information. Therefore, we pro?pose Denoised Deformable Network with Temporal-Spatial Aggregation for Dy?namic Scene Rendering (DN-4DGS). Specifically, a Noise Suppression Strategy is introduced to change the distribution of the coordinates of the canonical 3D gaussians and suppress noise. Additionally, a Decoupled Temporal-Spatial Ag?gregation Module is designed to aggregate information from adjacent points and frames. Extensive experiments on various real-world datasets demonstrate that our method achieves state-of-the-art rendering quality under a real-time level.

[CV-40] Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

链接: https://arxiv.org/abs/2410.13598
作者: Jongbhin Woo,Hyeonggon Ryu,Youngjoon Jang,Jae Won Cho,Joon Son Chung
关键词-EN: Video Temporal Grounding, Temporal Grounding, match text queries, Video Temporal, aims to identify
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACMMM 24

点击查看摘要

Abstract:Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries. Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences. However, these approaches overlook a crucial aspect of the problem: a holistic understanding of the query sentence. A model may capture correlations between individual word tokens and arbitrary visual frames while possibly missing out on the global meaning. To address this, we introduce two primary contributions: (1) a visual frame-level gate mechanism that incorporates holistic textual information, (2) cross-modal alignment loss to learn the fine-grained correlation between query and relevant frames. As a result, we regularize the effect of individual word tokens and suppress irrelevant visual frames. We demonstrate that our method outperforms state-of-the-art approaches in VTG benchmarks, indicating that holistic text understanding guides the model to focus on the semantically important parts within the video.

[CV-41] Pseudo Dataset Generation for Out-of-Domain Multi-Camera View Recommendation

链接: https://arxiv.org/abs/2410.13585
作者: Kuan-Ying Lee,Qian Zhou,Klara Nahrstedt
关键词-EN: indispensable in movies, systems are indispensable, multi-camera view recommendation, view recommendation datasets, view recommendation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to VCIP 2024. Project page: this https URL

点击查看摘要

Abstract:Multi-camera systems are indispensable in movies, TV shows, and other media. Selecting the appropriate camera at every timestamp has a decisive impact on production quality and audience preferences. Learning-based view recommendation frameworks can assist professionals in decision-making. However, they often struggle outside of their training domains. The scarcity of labeled multi-camera view recommendation datasets exacerbates the issue. Based on the insight that many videos are edited from the original multi-camera videos, we propose transforming regular videos into pseudo-labeled multi-camera view recommendation datasets. Promisingly, by training the model on pseudo-labeled datasets stemming from videos in the target domain, we achieve a 68% relative improvement in the model’s accuracy in the target domain and bridge the accuracy gap between in-domain and never-before-seen domains.

[CV-42] Co-Segmentation without any Pixel-level Supervision with Application to Large-Scale Sketch Classification ACCV2024

链接: https://arxiv.org/abs/2410.13582
作者: Nikolaos-Antonios Ypsilantis,Ondřej Chum
关键词-EN: pre-trained Vision Transformer, work proposes, rough object localization, Vision Transformer, common object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV 2024 Main Paper + Supplementary (Appendix)

点击查看摘要

Abstract:This work proposes a novel method for object co-segmentation, i.e. pixel-level localization of a common object in a set of images, that uses no pixel-level supervision for training. Two pre-trained Vision Transformer (ViT) models are exploited: ImageNet classification-trained ViT, whose features are used to estimate rough object localization through intra-class token relevance, and a self-supervised DINO-ViT for intra-image token relevance. On recent challenging benchmarks, the method achieves state-of-the-art performance among methods trained with the same level of supervision (image labels) while being competitive with methods trained with pixel-level supervision (binary masks). The benefits of the proposed co-segmentation method are further demonstrated in the task of large-scale sketch recognition, that is, the classification of sketches into a wide range of categories. The limited amount of hand-drawn sketch training data is leveraged by exploiting readily available image-level-annotated datasets of natural images containing a large number of classes. To bridge the domain gap, the classifier is trained on a sketch-like proxy domain derived from edges detected on natural images. We show that sketch recognition significantly benefits when the classifier is trained on sketch-like structures extracted from the co-segmented area rather than from the full image. Code: this https URL .

[CV-43] DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

链接: https://arxiv.org/abs/2410.13571
作者: Guosheng Zhao,Chaojun Ni,Xiaofeng Wang,Zheng Zhu,Guan Huang,Xinze Chen,Boyuan Wang,Youyi Zhang,Wenjun Mei,Xingang Wang
关键词-EN: autonomous driving systems, Closed-loop simulation, essential for advancing, Closed-loop, textit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Closed-loop simulation is essential for advancing end-to-end autonomous driving systems. Contemporary sensor simulation methods, such as NeRF and 3DGS, rely predominantly on conditions closely aligned with training data distributions, which are largely confined to forward-driving scenarios. Consequently, these methods face limitations when rendering complex maneuvers (e.g., lane change, acceleration, deceleration). Recent advancements in autonomous-driving world models have demonstrated the potential to generate diverse driving videos. However, these approaches remain constrained to 2D video generation, inherently lacking the spatiotemporal coherence required to capture intricacies of dynamic driving environments. In this paper, we introduce \textitDriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. Specifically, we utilize the world model as a data machine to synthesize novel trajectory videos based on real-world driving data. Notably, we explicitly leverage structured conditions to control the spatial-temporal consistency of foreground and background elements, thus the generated data adheres closely to traffic constraints. To our knowledge, \textitDriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios. Experimental results reveal that \textitDriveDreamer4D significantly enhances generation quality under novel trajectory views, achieving a relative improvement in FID by 24.5%, 39.0%, and 10.5% compared to PVG, \textS^3 Gaussian, and Deformable-GS. Moreover, \textitDriveDreamer4D markedly enhances the spatiotemporal coherence of driving agents, which is verified by a comprehensive user study and the relative increases of 20.3%, 42.0%, and 13.7% in the NTA-IoU metric.

[CV-44] Representing Model Weights with Language using Tree Experts

链接: https://arxiv.org/abs/2410.13569
作者: Eliahu Horwitz,Bar Cavia,Jonathan Kahana,Yedid Hoshen
关键词-EN: train neural networks, neural networks, model, model weights, models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The increasing availability of public models begs the question: can we train neural networks that use other networks as input? This paper learns to represent models within a joint space that embeds both model weights and language. However, machine learning on model weights is challenging as model weights often exhibit significant variation unrelated to the models’ semantic properties (nuisance variation). We identify a key property of real-world models: most public models belong to a small set of Model Trees, where all models within a tree are fine-tuned from a common ancestor (e.g., a foundation model). Importantly, we find that within each tree there is less nuisance variation between models. For example, while classifying models according to their training dataset generally requires complex architectures, in our case, even a linear classifier trained on a single layer is often effective. While effective, linear layers are computationally expensive as model weights are very high dimensional. To address this, we introduce Probing Experts (ProbeX), a theoretically motivated, lightweight probing method. Notably, ProbeX is the first probing method designed to learn from the weights of just a single model layer. We also construct and release a dataset that simulates the structure of public model repositories. Our results show that ProbeX can effectively map the weights of large models into a shared weight-language embedding space. Furthermore, we demonstrate the impressive generalization of our method, achieving zero-shot model classification and retrieval.

[CV-45] CCUP: A Controllable Synthetic Data Generation Pipeline for Pretraining Cloth-Changing Person Re-Identification Models

点击查看摘要

[CV-46] 360U-Former: HDR Illumination Estimation with Panoramic Adapted Vision Transformers ECCV2024

链接: https://arxiv.org/abs/2410.13566
作者: Jack Hilliard,Adrian Hilton,Jean-Yves Guillemaut
关键词-EN: Recent illumination estimation, Recent illumination, focused on enhancing, enhancing the resolution, resolution and improving
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted at AIM Workshop 2024 at ECCV 2024, 18 pages, 6 figures

点击查看摘要

Abstract:Recent illumination estimation methods have focused on enhancing the resolution and improving the quality and diversity of the generated textures. However, few have explored tailoring the neural network architecture to the Equirectangular Panorama (ERP) format utilised in image-based lighting. Consequently, high dynamic range images (HDRI) results usually exhibit a seam at the side borders and textures or objects that are warped at the poles. To address this shortcoming we propose a novel architecture, 360U-Former, based on a U-Net style Vision-Transformer which leverages the work of PanoSWIN, an adapted shifted window attention tailored to the ERP format. To the best of our knowledge, this is the first purely Vision-Transformer model used in the field of illumination estimation. We train 360U-Former as a GAN to generate HDRI from a limited field of view low dynamic range image (LDRI). We evaluate our method using current illumination estimation evaluation protocols and datasets, demonstrating that our approach outperforms existing and state-of-the-art methods without the artefacts typically associated with the use of the ERP format.

[CV-47] SDI-Paste: Synthetic Dynamic Instance Copy-Paste for Video Instance Segmentation

链接: https://arxiv.org/abs/2410.13565
作者: Sahir Shrestha,Weihao Li,Gao Zhu,Nick Barnes
关键词-EN: incurring minimal costs, minimal costs, incurring minimal, video, expand training datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data augmentation methods such as Copy-Paste have been studied as effective ways to expand training datasets while incurring minimal costs. While such methods have been extensively implemented for image level tasks, we found no scalable implementation of Copy-Paste built specifically for video tasks. In this paper, we leverage the recent growth in video fidelity of generative models to explore effective ways of incorporating synthetically generated objects into existing video datasets to artificially expand object instance pools. We first procure synthetic video sequences featuring objects that morph dynamically with time. Our carefully devised pipeline automatically segments then copy-pastes these dynamic instances across the frames of any target background video sequence. We name our video data augmentation pipeline Synthetic Dynamic Instance Copy-Paste, and test it on the complex task of Video Instance Segmentation which combines detection, segmentation and tracking of object instances across a video sequence. Extensive experiments on the popular Youtube-VIS 2021 dataset using two separate popular networks as baselines achieve strong gains of +2.9 AP (6.5%) and +2.1 AP (4.9%). We make our code and models publicly available.

[CV-48] Generative Location Modeling for Spatially Aware Object Insertion

链接: https://arxiv.org/abs/2410.13564
作者: Jooyeol Yun,Davide Abati,Mohamed Omran,Jaegul Choo,Amirhossein Habibian,Auke Wiggers
关键词-EN: powerful tool, including object insertion, image editing tasks, including object, object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generative models have become a powerful tool for image editing tasks, including object insertion. However, these methods often lack spatial awareness, generating objects with unrealistic locations and scales, or unintentionally altering the scene background. A key challenge lies in maintaining visual coherence, which requires both a geometrically suitable object location and a high-quality image edit. In this paper, we focus on the former, creating a location model dedicated to identifying realistic object locations. Specifically, we train an autoregressive model that generates bounding box coordinates, conditioned on the background image and the desired object class. This formulation allows to effectively handle sparse placement annotations and to incorporate implausible locations into a preference dataset by performing direct preference optimization. Our extensive experiments demonstrate that our generative location model, when paired with an inpainting method, substantially outperforms state-of-the-art instruction-tuned models and location modeling baselines in object insertion tasks, delivering accurate and visually coherent results.

[CV-49] RemoteDet-Mamba: A Hybrid Mamba-CNN Network for Multi-modal Object Detection in Remote Sensing Images

链接: https://arxiv.org/abs/2410.13532
作者: Kejun Ren,Xin Wu,Lianming Xu,Li Wang
关键词-EN: Unmanned aerial vehicle, rapid information acquisition, Unmanned aerial, aerial vehicle, emergency response
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unmanned aerial vehicle (UAV) remote sensing is widely applied in fields such as emergency response, owing to its advantages of rapid information acquisition and low cost. However, due to the effects of shooting distance and imaging mechanisms, the objects in the images present challenges such as small size, dense distribution, and low inter-class differentiation. To this end, we propose a multimodal remote sensing detection network that employs a quad-directional selective scanning fusion strategy called RemoteDet-Mamba. RemoteDet-Mamba simultaneously facilitates the learning of single-modal local features and the integration of patch-level global features across modalities, enhancing the distinguishability for small objects and utilizing local information to improve discrimination between different classes. Additionally, the use of Mamba’s serial processing significantly increases detection speed. Experimental results on the DroneVehicle dataset demonstrate the effectiveness of RemoteDet-Mamba, which achieves superior detection accuracy compared to state-of-the-art methods while maintaining computational efficiency and parameter count.

[CV-50] L3DG: Latent 3D Gaussian Diffusion SIGGRAPH

链接: https://arxiv.org/abs/2410.13530
作者: Barbara Roessle,Norman Müller,Lorenzo Porzi,Samuel Rota Bulò,Peter Kontschieder,Angela Dai,Matthias Nießner
关键词-EN: Gaussian diffusion formulation, Gaussians, enables effective generative, latent diffusion formulation, diffusion formulation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: SIGGRAPH Asia 2024, project page: this https URL , video: this https URL

点击查看摘要

Abstract:We propose L3DG, the first approach for generative 3D modeling of 3D Gaussians through a latent 3D Gaussian diffusion formulation. This enables effective generative 3D modeling, scaling to generation of entire room-scale scenes which can be very efficiently rendered. To enable effective synthesis of 3D Gaussians, we propose a latent diffusion formulation, operating in a compressed latent space of 3D Gaussians. This compressed latent space is learned by a vector-quantized variational autoencoder (VQ-VAE), for which we employ a sparse convolutional architecture to efficiently operate on room-scale scenes. This way, the complexity of the costly generation process via diffusion is substantially reduced, allowing higher detail on object-level generation, as well as scalability to large scenes. By leveraging the 3D Gaussian representation, the generated scenes can be rendered from arbitrary viewpoints in real-time. We demonstrate that our approach significantly improves visual quality over prior work on unconditional object-level radiance field synthesis and showcase its applicability to room-scale scene generation.

[CV-51] Generative Adversarial Synthesis of Radar Point Cloud Scenes

链接: https://arxiv.org/abs/2410.13526
作者: Muhammad Saad Nawaz,Thomas Dallmann,Torsten Schoen,Dirk Heberling
关键词-EN: realistic traffic scenarios, scenarios are required, laborious to acquire, validation and verification, verification of automotive
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: ICMIM 2024; 7th IEEE MTT Conference

点击查看摘要

Abstract:For the validation and verification of automotive radars, datasets of realistic traffic scenarios are required, which, how ever, are laborious to acquire. In this paper, we introduce radar scene synthesis using GANs as an alternative to the real dataset acquisition and simulation-based approaches. We train a PointNet++ based GAN model to generate realistic radar point cloud scenes and use a binary classifier to evaluate the performance of scenes generated using this model against a test set of real scenes. We demonstrate that our GAN model achieves similar performance (~87%) to the real scenes test set.

[CV-52] Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

点击查看摘要

[CV-53] GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models

点击查看摘要

[CV-54] SAda-Net: A Self-Supervised Adaptive Stereo Estimation CNN For Remote Sensing Image Data ICPR2024

链接: https://arxiv.org/abs/2410.13500
作者: Dominik Hirner,Friedrich Fraundorfer
关键词-EN: Stereo estimation, estimation has made, made many advancements, advancements in recent, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Will be presented at ICPR2024 in December 2024 in Kolkata, India

点击查看摘要

Abstract:Stereo estimation has made many advancements in recent years with the introduction of deep-learning. However the traditional supervised approach to deep-learning requires the creation of accurate and plentiful ground-truth data, which is expensive to create and not available in many situations. This is especially true for remote sensing applications, where there is an excess of available data without proper ground truth. To tackle this problem, we propose a self-supervised CNN with self-improving adaptive abilities. In the first iteration, the created disparity map is inaccurate and noisy. Leveraging the left-right consistency check, we get a sparse but more accurate disparity map which is used as an initial pseudo ground-truth. This pseudo ground-truth is then adapted and updated after every epoch in the training step of the network. We use the sum of inconsistent points in order to track the network convergence. The code for our method is publicly available at: this https URLthis https URL

[CV-55] SemSim: Revisiting Weak-to-Strong Consistency from a Semantic Similarity Perspective for Semi-supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2410.13486
作者: Shiao Xie,Hongyi Wang,Ziwei Niu,Hao Sun,Shuyi Ouyang,Yen-Wei Chen,Lanfen Lin
关键词-EN: highly practical task, leveraging unlabeled samples, large-scale labeled dataset, medical image segmentation, challenging yet highly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) for medical image segmentation is a challenging yet highly practical task, which reduces reliance on large-scale labeled dataset by leveraging unlabeled samples. Among SSL techniques, the weak-to-strong consistency framework, popularized by FixMatch, has emerged as a state-of-the-art method in classification tasks. Notably, such a simple pipeline has also shown competitive performance in medical image segmentation. However, two key limitations still persist, impeding its efficient adaptation: (1) the neglect of contextual dependencies results in inconsistent predictions for similar semantic features, leading to incomplete object segmentation; (2) the lack of exploitation of semantic similarity between labeled and unlabeled data induces considerable class-distribution discrepancy. To address these limitations, we propose a novel semi-supervised framework based on FixMatch, named SemSim, powered by two appealing designs from semantic similarity perspective: (1) rectifying pixel-wise prediction by reasoning about the intra-image pair-wise affinity map, thus integrating contextual dependencies explicitly into the final prediction; (2) bridging labeled and unlabeled data via a feature querying mechanism for compact class representation learning, which fully considers cross-image anatomical similarities. As the reliable semantic similarity extraction depends on robust features, we further introduce an effective spatial-aware fusion module (SFM) to explore distinctive information from multiple scales. Extensive experiments show that SemSim yields consistent improvements over the state-of-the-art methods across three public segmentation benchmarks.

[CV-56] Day-Night Adaptation: An Innovative Source-free Adaptation Framework for Medical Image Segmentation

链接: https://arxiv.org/abs/2410.13472
作者: Ziyang Chen,Yiwen Ye,Yongsheng Pan,Yong Xia
关键词-EN: Distribution shifts widely, shifts widely exist, Distribution shifts, medical images acquired, semantic segmentation models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Distribution shifts widely exist in medical images acquired from different medical centers, hindering the deployment of semantic segmentation models trained on data from one center (source domain) to another (target domain). While unsupervised domain adaptation (UDA) has shown significant promise in mitigating these shifts, it poses privacy risks due to sharing data between centers. To facilitate adaptation while preserving data privacy, source-free domain adaptation (SFDA) and test-time adaptation (TTA) have emerged as effective paradigms, relying solely on target domain data. However, the scenarios currently addressed by SFDA and TTA are limited, making them less suitable for clinical applications. In a more realistic clinical scenario, the pre-trained model is deployed in a medical centre to assist with clinical tasks during the day and rest at night. During the daytime process, TTA can be employed to enhance inference performance. During the nighttime process, after collecting the test data from the day, the model can be fine-tuned utilizing SFDA to further adapt to the target domain. With above insights, we propose a novel adaptation framework called Day-Night Adaptation (DyNA). This framework adapts the model to the target domain through day-night loops without requiring access to source data. Specifically, we implement distinct adaptation strategies for daytime and nighttime to better meet the demands of clinical settings. During the daytime, model parameters are frozen, and a specific low-frequency prompt is trained for each test sample. Additionally, we construct a memory bank for prompt initialization and develop a warm-up mechanism to enhance prompt training. During nighttime, we integrate a global student model into the traditional teacher-student self-training paradigm to fine-tune the model while ensuring training stability…

[CV-57] SiamSeg: Self-Training with Contrastive Learning for Unsupervised Domain Adaptation in Remote Sensing

链接: https://arxiv.org/abs/2410.13471
作者: Bin Wang,Fei Deng,Shuang Wang,Wen Luo,Zhixuan Zhang
关键词-EN: Semantic segmentation, remote sensing, challenging task, domain, Semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semantic segmentation of remote sensing (RS) images is a challenging task with significant potential across various applications. Deep learning, especially supervised learning with large-scale labeled datasets, has greatly advanced this field. However, acquiring high-quality labeled data is expensive and time-consuming. Moreover, variations in ground sampling distance (GSD), imaging equipment, and geographic diversity contribute to domain shifts between datasets, which pose significant challenges to models trained solely on source domain data, leading to poor cross-domain performance. Domain shift is well-known for undermining a model’s generalization ability in the target domain. To address this, unsupervised domain adaptation (UDA) has emerged as a promising solution, enabling models to learn from unlabeled target domain data while training on labeled source domain data. Recent advancements, particularly in self-supervised learning via pseudo-label generation, have shown potential in mitigating domain discrepancies. Strategies combining source and target domain images with their true and pseudo labels for self-supervised training have been effective in addressing domain bias. Despite progress in computer vision, the application of pseudo-labeling methods to RS image segmentation remains underexplored.

[CV-58] Object Pose Estimation Using Implicit Representation For Transparent Objects

链接: https://arxiv.org/abs/2410.13465
作者: Varun Burde,Artem Moroz,Vit Zeman,Pavel Burget
关键词-EN: Object pose estimation, Object pose, computer vision, Object, prominent task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object pose estimation is a prominent task in computer vision. The object pose gives the orientation and translation of the object in real-world space, which allows various applications such as manipulation, augmented reality, etc. Various objects exhibit different properties with light, such as reflections, absorption, etc. This makes it challenging to understand the object’s structure in RGB and depth channels. Recent research has been moving toward learning-based methods, which provide a more flexible and generalizable approach to object pose estimation utilizing deep learning. One such approach is the render-and-compare method, which renders the object from multiple views and compares it against the given 2D image, which often requires an object representation in the form of a CAD model. We reason that the synthetic texture of the CAD model may not be ideal for rendering and comparing operations. We showed that if the object is represented as an implicit (neural) representation in the form of Neural Radiance Field (NeRF), it exhibits a more realistic rendering of the actual scene and retains the crucial spatial features, which makes the comparison more versatile. We evaluated our NeRF implementation of the render-and-compare method on transparent datasets and found that it surpassed the current state-of-the-art results.

[CV-59] Augmentation Policy Generation for Image Classification Using Large Language Models ISCAS2025

链接: https://arxiv.org/abs/2410.13453
作者: Ant Duru,Alptekin Temizel
关键词-EN: Automated data augmentation, Automated data, deep learning models, image classification, generalization capability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures, 4 tables, submitted for consideration to the International Workshop on Computational Intelligence for Multimedia Understanding (IWCIM), ISCAS 2025

点击查看摘要

Abstract:Automated data augmentation methods have significantly improved the performance and generalization capability of deep learning models in image classification. Yet, most state-of-the-art methods are optimized on common benchmark datasets, limiting their applicability to more diverse or domain-specific data, such as medical datasets. In this paper, we propose a strategy that uses large language models to automatically generate efficient augmentation policies, customized to fit the specific characteristics of any dataset and model architecture. The proposed method iteratively interacts with an LLM to obtain and refine the augmentation policies on model performance feedback, creating a dataset-agnostic data augmentation pipeline. The proposed method was evaluated on medical imaging datasets, showing a clear improvement over state-of-the-art methods. The proposed approach offers an adaptive and scalable solution. Although it increases computational cost, it significantly boosts model robustness, automates the process, and minimizes the need for human involvement during model development.

[CV-60] Similarity-Dissimilarity Loss with Supervised Contrastive Learning for Multi-label Classification

点击查看摘要

[CV-61] mporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation

链接: https://arxiv.org/abs/2410.13437
作者: Changcheng Xiao,Qiong Cao,Yujie Zhong,Xiang Zhang,Tao Wang,Canqun Yang,Long Lan
关键词-EN: Referring multi-object tracking, aims to locate, locate an arbitrary, arbitrary number, maintain their identities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects and maintain their identities referred by a language expression in a video. This intricate task involves the reasoning of linguistic and visual modalities, along with the temporal association of target objects. However, the seminal work employs only loose feature fusion and overlooks the utilization of long-term information on tracked objects. In this study, we introduce a compact Transformer-based method, termed TenRMOT. We conduct feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture. Specifically, we incrementally perform cross-modal fusion layer-by-layer during the encoding phase. In the decoding phase, we utilize language-guided queries to probe memory features for accurate prediction of the desired objects. Moreover, we introduce a query update module that explicitly leverages temporal prior information of the tracked objects to enhance the consistency of their trajectories. In addition, we introduce a novel task called Referring Multi-Object Tracking and Segmentation (RMOTS) and construct a new dataset named Ref-KITTI Segmentation. Our dataset consists of 18 videos with 818 expressions, and each expression averages 10.7 masks, which poses a greater challenge compared to the typical single mask in most existing referring video segmentation datasets. TenRMOT demonstrates superior performance on both the referring multi-object tracking and the segmentation tasks.

[CV-62] Performance of Gaussian Mixture Model Classifiers on Embedded Feature Spaces

链接: https://arxiv.org/abs/2410.13421
作者: Jeremy Chopin,Rozenn Dahyot
关键词-EN: Data embeddings, multimodal data, analysis of multimedia, Data, Gaussian Mixture models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:Data embeddings with CLIP and ImageBind provide powerful features for the analysis of multimedia and/or multimodal data. We assess their performance here for classification using a Gaussian Mixture models (GMMs) based layer as an alternative to the standard Softmax layer. GMMs based classifiers have recently been shown to have interesting performances as part of deep learning pipelines trained end-to-end. Our first contribution is to investigate GMM based classification performance taking advantage of the embedded spaces CLIP and ImageBind. Our second contribution is in proposing our own GMM based classifier with a lower parameters count than previously proposed. Our findings are, that in most cases, on these tested embedded spaces, one gaussian component in the GMMs is often enough for capturing each class, and we hypothesize that this may be due to the contrastive loss used for training these embedded spaces that naturally concentrates features together for each class. We also observed that ImageBind often provides better performance than CLIP for classification of image datasets even when these embedded spaces are compressed using PCA.

[CV-63] RescueADI: Adaptive Disaster Interpretation in Remote Sensing Images with Autonomous Agents

链接: https://arxiv.org/abs/2410.13384
作者: Zhuoran Liu,Danpei Zhao,Bo Yuan
关键词-EN: remote sensing images, sensing images, visual question-answering, remote sensing, focus on isolated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current methods for disaster scene interpretation in remote sensing images (RSIs) mostly focus on isolated tasks such as segmentation, detection, or visual question-answering (VQA). However, current interpretation methods often fail at tasks that require the combination of multiple perception methods and specialized tools. To fill this gap, this paper introduces Adaptive Disaster Interpretation (ADI), a novel task designed to solve requests by planning and executing multiple sequentially correlative interpretation tasks to provide a comprehensive analysis of disaster scenes. To facilitate research and application in this area, we present a new dataset named RescueADI, which contains high-resolution RSIs with annotations for three connected aspects: planning, perception, and recognition. The dataset includes 4,044 RSIs, 16,949 semantic masks, 14,483 object bounding boxes, and 13,424 interpretation requests across nine challenging request types. Moreover, we propose a new disaster interpretation method employing autonomous agents driven by large language models (LLMs) for task planning and execution, proving its efficacy in handling complex disaster interpretations. The proposed agent-based method solves various complex interpretation requests such as counting, area calculation, and path-finding without human intervention, which traditional single-task approaches cannot handle effectively. Experimental results on RescueADI demonstrate the feasibility of the proposed task and show that our method achieves an accuracy 9% higher than existing VQA methods, highlighting its advantages over conventional disaster interpretation approaches. The dataset will be publicly available.

[CV-64] Railway LiDAR semantic segmentation based on intelligent semi-automated data annotation

链接: https://arxiv.org/abs/2410.13383
作者: Florian Wulff,Bernd Schaeufele,Julian Pfeifer,Ilja Radusch
关键词-EN: Automated vehicles rely, vehicles rely, accurate and robust, Automated vehicles, automated trains
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
*备注: This article has been accepted for publication in the IEEE VTC Fall 2024

点击查看摘要

Abstract:Automated vehicles rely on an accurate and robust perception of the environment. Similarly to automated cars, highly automated trains require an environmental perception. Although there is a lot of research based on either camera or LiDAR sensors in the automotive domain, very few contributions for this task exist yet for automated trains. Additionally, no public dataset or described approach for a 3D LiDAR semantic segmentation in the railway environment exists yet. Thus, we propose an approach for a point-wise 3D semantic segmentation based on the 2DPass network architecture using scans and images jointly. In addition, we present a semi-automated intelligent data annotation approach, which we use to efficiently and accurately label the required dataset recorded on a railway track in Germany. To improve performance despite a still small number of labeled scans, we apply an active learning approach to intelligently select scans for the training dataset. Our contributions are threefold: We annotate rail data including camera and LiDAR data from the railway environment, transfer label the raw LiDAR point clouds using an image segmentation network, and train a state-of-the-art 3D LiDAR semantic segmentation network efficiently leveraging active learning. The trained network achieves good segmentation results with a mean IoU of 71.48% of 9 classes.

[CV-65] Accurate Checkerboard Corner Detection under Defoucs

链接: https://arxiv.org/abs/2410.13371
作者: Zezhun Shi
关键词-EN: autonomous driving, critical process, pacting applications, applications in autonomous, visible light
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Camera calibration is a critical process in 3D vision, im pacting applications in autonomous driving, robotics, ar chitecture, and so on. This paper focuses on enhancing feature extraction for chessboard corner detection, a key step in calibration. We analyze existing methods, high lighting their limitations and propose a novel sub-pixel refinement approach based on symmetry, which signifi cantly improves accuracy for visible light cameras. Un like prior symmetry based method that assume a contin uous physical pattern, our approach accounts for abrupt changes in visible light camera images and defocus ef fects. We introduce a simplified objective function that reduces computation time and mitigates overfitting risks. Furthermore, we derive an explicit expression for the pixel value of a blurred edge, providing insights into the relationship between pixel value and center intensity. Our method demonstrates superior performance, achiev ing substantial accuracy improvements over existing tech niques, particularly in the context of visible light cam era calibration. Our code is available from https: //github.com/spdfghi/Accurate-Checkerboard this http URL.

[CV-66] MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models ICTAI

点击查看摘要

[CV-67] Remember Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

点击查看摘要

[CV-68] Self-Supervised Scene Flow Estimation with Point-Voxel Fusion and Surface Representation ICASSP2025

链接: https://arxiv.org/abs/2410.13355
作者: Xuezhi Xiang,Xi Wang,Lei Zhang,Denis Ombati,Himaloy Himu,Xiantong Zhen
关键词-EN: flow estimation aims, Scene flow estimation, motion field, flow estimation, estimation aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper is under consideration at 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Scene flow estimation aims to generate the 3D motion field of points between two consecutive frames of point clouds, which has wide applications in various fields. Existing point-based methods ignore the irregularity of point clouds and have difficulty capturing long-range dependencies due to the inefficiency of point-level computation. Voxel-based methods suffer from the loss of detail information. In this paper, we propose a point-voxel fusion method, where we utilize a voxel branch based on sparse grid attention and the shifted window strategy to capture long-range dependencies and a point branch to capture fine-grained features to compensate for the information loss in the voxel branch. In addition, since xyz coordinates are difficult to describe the geometric structure of complex 3D objects in the scene, we explicitly encode the local surface information of the point cloud through the umbrella surface feature extraction (USFE) module. We verify the effectiveness of our method by conducting experiments on the Flyingthings3D and KITTI datasets. Our method outperforms all other self-supervised methods and achieves highly competitive results compared to fully supervised methods. We achieve improvements in all metrics, especially EPE, which is reduced by 8.51% and 10.52% on the KITTIo and KITTIs datasets, respectively.

[CV-69] GlossyGS: Inverse Rendering of Glossy Objects with 3D Gaussian Splatting

链接: https://arxiv.org/abs/2410.13349
作者: Shuichang Lai,Letian Huang,Jie Guo,Kai Cheng,Bowen Pan,Xiaoxiao Long,Jiangjing Lyu,Chengfei Lv,Yanwen Guo
关键词-EN: computer vision, Reconstructing objects, computer graphics, posed images, crucial and complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing objects from posed images is a crucial and complex task in computer graphics and computer vision. While NeRF-based neural reconstruction methods have exhibited impressive reconstruction ability, they tend to be time-comsuming. Recent strategies have adopted 3D Gaussian Splatting (3D-GS) for inverse rendering, which have led to quick and effective outcomes. However, these techniques generally have difficulty in producing believable geometries and materials for glossy objects, a challenge that stems from the inherent ambiguities of inverse rendering. To address this, we introduce GlossyGS, an innovative 3D-GS-based inverse rendering framework that aims to precisely reconstruct the geometry and materials of glossy objects by integrating material priors. The key idea is the use of micro-facet geometry segmentation prior, which helps to reduce the intrinsic ambiguities and improve the decomposition of geometries and materials. Additionally, we introduce a normal map prefiltering strategy to more accurately simulate the normal distribution of reflective surfaces. These strategies are integrated into a hybrid geometry and material representation that employs both explicit and implicit methods to depict glossy objects. We demonstrate through quantitative analysis and qualitative visualization that the proposed method is effective to reconstruct high-fidelity geometries and materials of glossy objects, and performs favorably against state-of-the-arts.

[CV-70] Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding

点击查看摘要

[CV-71] Inadequate contrast ratio of road markings as an indicator for ADAS failure

链接: https://arxiv.org/abs/2410.13320
作者: Novel Certad,Cristina Olaverri-Monreal,Friedrich Wiesinger,Tomasz E. Burghardt
关键词-EN: driver assistance systems, advanced driver assistance, machine vision technologies, vision technologies utilised, road safety features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IRF World Congress 2024

点击查看摘要

Abstract:Road markings were reported as critical road safety features, equally needed for both human drivers and for machine vision technologies utilised by advanced driver assistance systems (ADAS) and in driving automation. Visibility of road markings is achieved because of their colour contrasting with the roadway surface. During recent testing of an open-source camera-based ADAS under several visibility conditions (day, night, rain, glare), significant failures in trajectory planning were recorded and quantified. Consistently, better ADAS reliability under poor visibility conditions was achieved with Type II road markings (i.e. structured markings, facilitating moisture drainage) as compared to Type I road marking (i.e. flat lines). To further understand these failures, analysis of contrast ratio of road markings, which the tested ADAS was detecting for traffic lane recognition, was performed. The highest contrast ratio (greater than 0.5, calculated per Michelson equation) was measured at night in the absence of confounding factors, with statistically significant difference of 0.1 in favour of Type II road markings over Type I. Under daylight conditions, contrast ratio was reduced, with slightly higher values measured with Type I. The presence of rain or wet roads caused the deterioration of the contrast ratio, with Type II road markings exhibiting significantly higher contrast ratio than Type I, even though the values were low (less than 0.1). These findings matched the output of the ADAS related to traffic lane detection and underlined the importance of road marking visibility. Inadequate lane recognition by ADAS was associated with very low contrast ratio of road markings indeed. Importantly, specific minimum contrast ratio value could not be found, which was due to the complexity of ADAS algorithms…

[CV-72] Precipitation Nowcasting Using Diffusion Transformer with Causal Attention

点击查看摘要

[CV-73] Enhancing Dataset Distillation via Label Inconsistency Elimination and Learning Pattern Refinement ECCV2024

链接: https://arxiv.org/abs/2410.13311
作者: Chuhao Zhou,Chenxi Jiang,Yi Xie,Haozhi Cao,Jianfei Yang
关键词-EN: achieve performance similar, Data Distillation Challenge, entire original dataset, seeks to create, create a condensed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Dataset Distillation Challenge

点击查看摘要

Abstract:Dataset Distillation (DD) seeks to create a condensed dataset that, when used to train a model, enables the model to achieve performance similar to that of a model trained on the entire original dataset. It relieves the model training from processing massive data and thus reduces the computation resources, storage, and time costs. This paper illustrates our solution that ranks 1st in the ECCV-2024 Data Distillation Challenge (track 1). Our solution, Modified Difficulty-Aligned Trajectory Matching (M-DATM), introduces two key modifications to the original state-of-the-art method DATM: (1) the soft labels learned by DATM do not achieve one-to-one correspondence with the counterparts generated by the official evaluation script, so we remove the soft labels technique to alleviate such inconsistency; (2) since the removal of soft labels makes it harder for the synthetic dataset to learn late trajectory information, particularly on Tiny ImageNet, we reduce the matching range, allowing the synthetic data to concentrate more on the easier patterns. In the final evaluation, our M-DATM achieved accuracies of 0.4061 and 0.1831 on the CIFAR-100 and Tiny ImageNet datasets, ranking 1st in the Fixed Images Per Class (IPC) Track.

[CV-74] Reference-Based Post-OCR Processing with LLM for Diacritic Languages

点击查看摘要

[CV-75] PiLocNet: Physics-informed neural network on 3D localization with rotating point spread function

点击查看摘要

[CV-76] LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

链接: https://arxiv.org/abs/2410.13294
作者: Xuexun Liu,Xiaoxu Xu,Jinlong Li,Qiudan Zhang,Xu Wang,Nicu Sebe,Lin Ma
关键词-EN: visual-language task, task that segments, language-agnostic instance segmentation, Segmentation, query
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Referring 3D Segmentation is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. Previous works perform a two-stage paradigm, first conducting language-agnostic instance segmentation then matching with given text query. However, the semantic concepts from text query and visual cues are separately interacted during the training, and both instance and semantic labels for each object are required, which is time consuming and human-labor intensive. To mitigate these issues, we propose a novel Referring 3D Segmentation pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask. Specifically, we design a Point-Word Cross-Modal Alignment module for aligning the fine-grained features of points and textual embedding. Query Mask Predictor module and Query-Sentence Alignment module are introduced for coarse-grained alignment between masks and query. Furthermore, we propose an area regularization loss, which coarsely reduces irrelevant background predictions on a large scale. Besides, a point-to-point contrastive loss is proposed concentrating on distinguishing points with subtly similar features. Through extensive experiments, we achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.

[CV-77] Composing Novel Classes: A Concept-Driven Approach to Generalized Category Discovery

链接: https://arxiv.org/abs/2410.13285
作者: Chuyu Zhang,Peiyan Gu,Xueyang Yu,Xuming He
关键词-EN: generalized category discovery, category discovery, class, tackle the generalized, generalized category
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Underreview. The first two authors contribute equally

点击查看摘要

Abstract:We tackle the generalized category discovery (GCD) problem, which aims to discover novel classes in unlabeled datasets by leveraging the knowledge of known classes. Previous works utilize the known class knowledge through shared representation spaces. Despite their progress, our analysis experiments show that novel classes can achieve impressive clustering results on the feature space of a known class pre-trained model, suggesting that existing methods may not fully utilize known class knowledge. To address it, we introduce a novel concept learning framework for GCD, named ConceptGCD, that categorizes concepts into two types: derivable and underivable from known class concepts, and adopts a stage-wise learning strategy to learn them separately. Specifically, our framework first extracts known class concepts by a known class pre-trained model and then produces derivable concepts from them by a generator layer with a covariance-augmented loss. Subsequently, we expand the generator layer to learn underivable concepts in a balanced manner ensured by a concept score normalization strategy and integrate a contrastive loss to preserve previously learned concepts. Extensive experiments on various benchmark datasets demonstrate the superiority of our approach over the previous state-of-the-art methods. Code will be available soon.

[CV-78] Hybrid bundle-adjusting 3D Gaussians for view consistent rendering with pose optimization

链接: https://arxiv.org/abs/2410.13280
作者: Yanan Guo,Ying Xie,Ying Chang,Benkui Zhang,Bo Jia,Lin Cao
关键词-EN: computer vision, made significant progress, synthesis has made, view synthesis, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Photonics Asia 2024

点击查看摘要

Abstract:Novel view synthesis has made significant progress in the field of 3D computer vision. However, the rendering of view-consistent novel views from imperfect camera poses remains challenging. In this paper, we introduce a hybrid bundle-adjusting 3D Gaussians model that enables view-consistent rendering with pose optimization. This model jointly extract image-based and neural 3D representations to simultaneously generate view-consistent images and camera poses within forward-facing scenes. The effective of our model is demonstrated through extensive experiments conducted on both real and synthetic datasets. These experiments clearly illustrate that our model can effectively optimize neural scene representations while simultaneously resolving significant camera pose misalignments. The source code is available at this https URL.

[CV-79] Inductive Gradient Adjustment For Spectral Bias In Implicit Neural Representations

链接: https://arxiv.org/abs/2410.13271
作者: Kexuan Shi,Hai Chen,Leheng Zhang,Shuhang Gu
关键词-EN: Implicit Neural Representations, Implicit Neural, Neural Tangent Kernel, versatile representation paradigm, computer vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 28 pages, 12 figures

点击查看摘要

Abstract:Implicit Neural Representations (INRs), as a versatile representation paradigm, have achieved success in various computer vision tasks. Due to the spectral bias of the vanilla multi-layer perceptrons (MLPs), existing methods focus on designing MLPs with sophisticated architectures or repurposing training techniques for highly accurate INRs. In this paper, we delve into the linear dynamics model of MLPs and theoretically identify the empirical Neural Tangent Kernel (eNTK) matrix as a reliable link between spectral bias and training dynamics. Based on eNTK matrix, we propose a practical inductive gradient adjustment method, which could purposefully improve the spectral bias via inductive generalization of eNTK-based gradient transformation matrix. We evaluate our method on different INRs tasks with various INR architectures and compare to existing training techniques. The superior representation performance clearly validates the advantage of our proposed method. Armed with our gradient adjustment method, better INRs with more enhanced texture details and sharpened edges can be learned from data by tailored improvements on spectral bias.

[CV-80] Fundus to Fluorescein Angiography Video Generation as a Retinal Generative Foundation Model

链接: https://arxiv.org/abs/2410.13242
作者: Weiyi Zhang,Jiancheng Yang,Ruoyu Chen,Siyu Huang,Pusheng Xu,Xiaolan Chen,Shanfu Lu,Hongyu Cao,Mingguang He,Danli Shi
关键词-EN: Fundus fluorescein angiography, restricted accessibility compared, Fundus fluorescein, color fundus, monitoring retinal vascular
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fundus fluorescein angiography (FFA) is crucial for diagnosing and monitoring retinal vascular issues but is limited by its invasive nature and restricted accessibility compared to color fundus (CF) imaging. Existing methods that convert CF images to FFA are confined to static image generation, missing the dynamic lesional changes. We introduce Fundus2Video, an autoregressive generative adversarial network (GAN) model that generates dynamic FFA videos from single CF images. Fundus2Video excels in video generation, achieving an FVD of 1497.12 and a PSNR of 11.77. Clinical experts have validated the fidelity of the generated videos. Additionally, the model’s generator demonstrates remarkable downstream transferability across ten external public datasets, including blood vessel segmentation, retinal disease diagnosis, systemic disease prediction, and multimodal retrieval, showcasing impressive zero-shot and few-shot capabilities. These findings position Fundus2Video as a powerful, non-invasive alternative to FFA exams and a versatile retinal generative foundation model that captures both static and temporal retinal features, enabling the representation of complex inter-modality relationships.

[CV-81] Latent Image and Video Resolution Prediction using Convolutional Neural Networks ICIP

链接: https://arxiv.org/abs/2410.13227
作者: Rittwika Kansabanik,Adrian Barbu
关键词-EN: Video Quality Assessment, Quality Assessment, Video Quality, Convolutional Neural Networks, received little attention
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注: Submitted in ICIP conference

点击查看摘要

Abstract:This paper introduces a Video Quality Assessment (VQA) problem that has received little attention in the literature, called the latent resolution prediction problem. The problem arises when images or videos are upscaled from their native resolution and are reported as having a higher resolution than their native resolution. This paper formulates the problem, constructs a dataset for training and evaluation, and introduces several machine learning algorithms, including two Convolutional Neural Networks (CNNs), to address this problem. Experiments indicate that some proposed methods can predict the latent video resolution with about 95% accuracy.

[CV-82] UniG: Modelling Unitary 3D Gaussians for View-consistent 3D Reconstruction

链接: https://arxiv.org/abs/2410.13195
作者: Jiamin Wu,Kenkun Liu,Yukai Shi,Xiaoke Jiang,Yuan Yao,Lei Zhang
关键词-EN: view synthesis model, present UniG, synthesis model, model that generates, generates a high-fidelity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we present UniG, a view-consistent 3D reconstruction and novel view synthesis model that generates a high-fidelity representation of 3D Gaussians from sparse images. Existing 3D Gaussians-based methods usually regress Gaussians per-pixel of each view, create 3D Gaussians per view separately, and merge them through point concatenation. Such a view-independent reconstruction approach often results in a view inconsistency issue, where the predicted positions of the same 3D point from different views may have discrepancies. To address this problem, we develop a DETR (DEtection TRansformer)-like framework, which treats 3D Gaussians as decoder queries and updates their parameters layer by layer by performing multi-view cross-attention (MVDFA) over multiple input images. In this way, multiple views naturally contribute to modeling a unitary representation of 3D Gaussians, thereby making 3D reconstruction more view-consistent. Moreover, as the number of 3D Gaussians used as decoder queries is irrespective of the number of input views, allow an arbitrary number of input images without causing memory explosion. Extensive experiments validate the advantages of our approach, showcasing superior performance over existing methods quantitatively (improving PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and qualitatively.

[CV-83] Golyadkins Torment: Doppelg"angers and Adversarial Vulnerability

点击查看摘要

[CV-84] FAMSeC: A Few-shot-sample-based General AI-generated Image Detection Method

链接: https://arxiv.org/abs/2410.13156
作者: Juncong Xu,Yang Yang,Han Fang,Honggu Liu,Weiming Zhang
关键词-EN: raising security concerns, Forgery Awareness Module, Semantic feature-guided Contrastive, raising security, explosive growth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The explosive growth of generative AI has saturated the internet with AI-generated images, raising security concerns and increasing the need for reliable detection methods. The primary requirement for such detection is generalizability, typically achieved by training on numerous fake images from various models. However, practical limitations, such as closed-source models and restricted access, often result in limited training samples. Therefore, training a general detector with few-shot samples is essential for modern detection mechanisms. To address this challenge, we propose FAMSeC, a general AI-generated image detection method based on LoRA-based Forgery Awareness Module and Semantic feature-guided Contrastive learning strategy. To effectively learn from limited samples and prevent overfitting, we developed a Forgery Awareness Module (FAM) based on LoRA, maintaining the generalization of pre-trained features. Additionally, to cooperate with FAM, we designed a Semantic feature-guided Contrastive learning strategy (SeC), making the FAM focus more on the differences between real/fake image than on the features of the samples themselves. Experiments show that FAMSeC outperforms state-of-the-art method, enhancing classification accuracy by 14.55% with just 0.56% of the training samples.

[CV-85] Utilizing Large Language Models in An Iterative Paradigm with Domain Feedback for Molecule Optimization

点击查看摘要

[CV-86] Mapping Bias in Vision Language Models: Signposts Pitfalls and the Road Ahead NAACL2025

点击查看摘要

[CV-87] See Behind Walls in Real-time Using Aerial Drones and Augmented Reality

链接: https://arxiv.org/abs/2410.13139
作者: Sikai Yang,Kang Yang,Yuning Chen,Fan Zhao,Wan Du
关键词-EN: enables real-time through-wall, real-time through-wall surveillance, work presents, augmented reality, framework that enables
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: 6 pages

点击查看摘要

Abstract:This work presents ARD2, a framework that enables real-time through-wall surveillance using two aerial drones and an augmented reality (AR) device. ARD2 consists of two main steps: target direction estimation and contour reconstruction. In the first stage, ARD2 leverages geometric relationships between the drones, the user, and the target to project the target’s direction onto the user’s AR display. In the second stage, images from the drones are synthesized to reconstruct the target’s contour, allowing the user to visualize the target behind walls. Experimental results demonstrate the system’s accuracy in both direction estimation and contour reconstruction.

[CV-88] Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance NEURIPS2024

链接: https://arxiv.org/abs/2410.13136
作者: Jiwan Hur,Dong-Jae Lee,Gyojin Han,Jaehyun Choi,Yunho Jeon,Junmo Kim
关键词-EN: Masked generative models, shown impressive generative, impressive generative ability, continuous diffusion models, Masked generative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024. Code is available at: this https URL

点击查看摘要

Abstract:Masked generative models (MGMs) have shown impressive generative ability while providing an order of magnitude efficient sampling steps compared to continuous diffusion models. However, MGMs still underperform in image synthesis compared to recent well-developed continuous diffusion models with similar size in terms of quality and diversity of generated samples. A key factor in the performance of continuous diffusion models stems from the guidance methods, which enhance the sample quality at the expense of diversity. In this paper, we extend these guidance methods to generalized guidance formulation for MGMs and propose a self-guidance sampling method, which leads to better generation quality. The proposed approach leverages an auxiliary task for semantic smoothing in vector-quantized token space, analogous to the Gaussian blur in continuous pixel space. Equipped with the parameter-efficient fine-tuning method and high-temperature sampling, MGMs with the proposed self-guidance achieve a superior quality-diversity trade-off, outperforming existing sampling methods in MGMs with more efficient training and sampling costs. Extensive experiments with the various sampling hyperparameters confirm the effectiveness of the proposed self-guidance.

[CV-89] Boosting Imperceptibility of Stable Diffusion-based Adversarial Examples Generation with Momentum

链接: https://arxiv.org/abs/2410.13122
作者: Nashrah Haque,Xiang Li,Zhehui Chen,Yanzhao Wu,Lei Yu,Arun Iyengar,Wenqi Wei
关键词-EN: Diffusion-based Momentum Integrated, Stable Diffusion-based Momentum, effectively mislead neural, mislead neural network, neural network classifiers
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 12 figures. To be published in IEEE TPS 2024 Proceedings. Code available on GitHub: this https URL

点击查看摘要

Abstract:We propose a novel framework, Stable Diffusion-based Momentum Integrated Adversarial Examples (SD-MIAE), for generating adversarial examples that can effectively mislead neural network classifiers while maintaining visual imperceptibility and preserving the semantic similarity to the original class label. Our method leverages the text-to-image generation capabilities of the Stable Diffusion model by manipulating token embeddings corresponding to the specified class in its latent space. These token embeddings guide the generation of adversarial images that maintain high visual fidelity. The SD-MIAE framework consists of two phases: (1) an initial adversarial optimization phase that modifies token embeddings to produce misclassified yet natural-looking images and (2) a momentum-based optimization phase that refines the adversarial perturbations. By introducing momentum, our approach stabilizes the optimization of perturbations across iterations, enhancing both the misclassification rate and visual fidelity of the generated adversarial examples. Experimental results demonstrate that SD-MIAE achieves a high misclassification rate of 79%, improving by 35% over the state-of-the-art method while preserving the imperceptibility of adversarial perturbations and the semantic similarity to the original class label, making it a practical method for robust adversarial evaluation.

[CV-90] rust but Verify: Programmatic VLM Evaluation in the Wild

点击查看摘要

[CV-91] ask Consistent Prototype Learning for Incremental Few-shot Semantic Segmentation

点击查看摘要

[CV-92] MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

点击查看摘要

[CV-93] BOXR: Body and head motion Optimization framework for eXtended Reality

链接: https://arxiv.org/abs/2410.13084
作者: Ziliang Zhang,Zexin Li,Hyoseung Kim,Cong Liu
关键词-EN: frequent head motions, enhanced user mobility, frequent body motions, accommodating both subtle, frequent head
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Accepted to 45th IEEE Real-Time Systems Symposium (RTSS’24)

点击查看摘要

Abstract:The emergence of standalone XR systems has enhanced user mobility, accommodating both subtle, frequent head motions and substantial, less frequent body motions. However, the pervasively used M2D latency metric, which measures the delay between the most recent motion and its corresponding display update, only accounts for head motions. This oversight can leave users prone to motion sickness if significant body motion is involved. Although existing methods optimize M2D latency through asynchronous task scheduling and reprojection methods, they introduce challenges like resource contention between tasks and outdated pose data. These challenges are further complicated by user motion dynamics and scene changes during runtime. To address these issues, we for the first time introduce the C2D latency metric, which captures the delay caused by body motions, and present BOXR, a framework designed to co-optimize both body and head motion delays within an XR system. BOXR enhances the coordination between M2D and C2D latencies by efficiently scheduling tasks to avoid contentions while maintaining an up-to-date pose in the output frame. Moreover, BOXR incorporates a motion-driven visual inertial odometer to adjust to user motion dynamics and employs scene-dependent foveated rendering to manage changes in the scene effectively. Our evaluations show that BOXR significantly outperforms state-of-the-art solutions in 11 EuRoC MAV datasets across 4 XR applications across 3 hardware platforms. In controlled motion and scene settings, BOXR reduces M2D and C2D latencies by up to 63% and 27%, respectively and increases frame rate by up to 43%. In practical deployments, BOXR achieves substantial reductions in real-world scenarios up to 42% in M2D latency and 31% in C2D latency while maintaining remarkably low miss rates of only 1.6% for M2D requirements and 1.0% for C2D requirements.

[CV-94] A low complexity contextual stacked ensemble-learning approach for pedestrian intent prediction

链接: https://arxiv.org/abs/2410.13039
作者: Chia-Yen Chiang,Yasmin Fathy,Gregory Slabaugh,Mona Jaber
关键词-EN: promoting sustainable transport, sustainable transport, form of active, active travel, travel is essential
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Walking as a form of active travel is essential in promoting sustainable transport. It is thus crucial to accurately predict pedestrian crossing intention and avoid collisions, especially with the advent of autonomous and advanced driver-assisted vehicles. Current research leverages computer vision and machine learning advances to predict near-misses; however, this often requires high computation power to yield reliable results. In contrast, this work proposes a low-complexity ensemble-learning approach that employs contextual data for predicting the pedestrian’s intent for crossing. The pedestrian is first detected, and their image is then compressed using skeleton-ization, and contextual information is added into a stacked ensemble-learning approach. Our experiments on different datasets achieve similar pedestrian intent prediction performance as the state-of-the-art approaches with 99.7% reduction in computational complexity. Our source code and trained models will be released upon paper acceptance

[CV-95] Sensitivity of Generative VLMs to Semantically and Lexically Altered Prompts

点击查看摘要

[CV-96] Geometric Trajectory Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2410.13027
作者: Jiaqi Han,Minkai Xu,Aaron Lou,Haotian Ye,Stefano Ermon
关键词-EN: shown great promise, natural science domains, Generative models, promise in generating, protein design
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Published at NeurIPS 2024. 29 pages, 10 figures

点击查看摘要

Abstract:Generative models have shown great promise in generating 3D geometric systems, which is a fundamental problem in many natural science domains such as molecule and protein design. However, existing approaches only operate on static structures, neglecting the fact that physical systems are always dynamic in nature. In this work, we propose geometric trajectory diffusion models (GeoTDM), the first diffusion model for modeling the temporal distribution of 3D geometric trajectories. Modeling such distribution is challenging as it requires capturing both the complex spatial interactions with physical symmetries and temporal correspondence encapsulated in the dynamics. We theoretically justify that diffusion models with equivariant temporal kernels can lead to density with desired symmetry, and develop a novel transition kernel leveraging SE(3)-equivariant spatial convolution and temporal attention. Furthermore, to induce an expressive trajectory distribution for conditional generation, we introduce a generalized learnable geometric prior into the forward diffusion process to enhance temporal conditioning. We conduct extensive experiments on both unconditional and conditional generation in various scenarios, including physical simulation, molecular dynamics, and pedestrian motion. Empirical results on a wide suite of metrics demonstrate that GeoTDM can generate realistic geometric trajectories with significantly higher quality.

[CV-97] Interpreting and Analyzing CLIPs Zero-Shot Image Classification via Mutual Knowledge NEURIPS2024

链接: https://arxiv.org/abs/2410.13016
作者: Fawaz Sammani,Nikos Deligiannis
关键词-EN: Contrastive Language-Image Pretraining, Contrastive Language-Image, textual class representation, shared embedding space, class representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space, then retrieving the class closest to the image. This work provides a new approach for interpreting CLIP models for image classification from the lens of mutual knowledge between the two modalities. Specifically, we ask: what concepts do both vision and language CLIP encoders learn in common that influence the joint embedding space, causing points to be closer or further apart? We answer this question via an approach of textual concept-based explanations, showing their effectiveness, and perform an analysis encompassing a pool of 13 CLIP models varying in architecture, size and pretraining datasets. We explore those different aspects in relation to mutual knowledge, and analyze zero-shot predictions. Our approach demonstrates an effective and human-friendly way of understanding zero-shot classification decisions with CLIP.

[CV-98] Hiding-in-Plain-Sight (HiPS) Attack on CLIP for Targetted Object Removal from Images NEURIPS2024

点击查看摘要

[CV-99] Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation

链接: https://arxiv.org/abs/2410.12995
作者: Anthony Opipari,Aravindhan K Krishnan,Shreekant Gayaka,Min Sun,Cheng-Hao Kuo,Arnie Sen,Odest Chadwicke Jenkins
关键词-EN: improve class-agnostic video, generating large-scale datasets, video segmentation, form factors, paper presents
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in IEEE Robotics and Automation Letters October 2024

点击查看摘要

Abstract:This paper presents a method for generating large-scale datasets to improve class-agnostic video segmentation across robots with different form factors. Specifically, we consider the question of whether video segmentation models trained on generic segmentation data could be more effective for particular robot platforms if robot embodiment is factored into the data generation process. To answer this question, a pipeline is formulated for using 3D reconstructions (e.g. from HM3DSem) to generate segmented videos that are configurable based on a robot’s embodiment (e.g. sensor type, sensor placement, and illumination source). A resulting massive RGB-D video panoptic segmentation dataset (MVPd) is introduced for extensive benchmarking with foundation and video segmentation models, as well as to support embodiment-focused research in video segmentation. Our experimental findings demonstrate that using MVPd for finetuning can lead to performance improvements when transferring foundation models to certain robot embodiments, such as specific camera placements. These experiments also show that using 3D modalities (depth images and camera pose) can lead to improvements in video segmentation accuracy and consistency. The project webpage is available at this https URL

[CV-100] Explainable Binary Classification of Separable Shape Ensembles

链接: https://arxiv.org/abs/2410.12994
作者: Zachary Grey,Nicholas Fisher,Andrew Glaws,Gunay Dogan
关键词-EN: representing grain boundaries, Materials scientists utilize, ensembles representing grain, scientists utilize image, utilize image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Materials scientists utilize image segmentation of micrographs to create large curve ensembles representing grain boundaries of material microstructures. Observations of these collections of shapes can facilitate inferences about material properties and manufacturing processes. We seek to bolster this application, and related engineering/scientific tasks, using novel pattern recognition formalisms and inference over large ensembles of segmented curves – i.e., facilitate principled assessments for quantifying differences in distributions of shapes. To this end, we apply a composite integral operator to motivate accurate and efficient numerical representations of discrete planar curves over matrix manifolds. The main result is a rigid-invariant orthonormal decomposition of curve component functions into separable forms of scale variations and complementary features of undulation. We demonstrate how these separable shape tensors – given thousands of curves in an ensemble – can inform explainable binary classification of segmented images by utilizing a product maximum mean discrepancy to distinguish the shape distributions; absent labelled data, building interpretable feature spaces in seconds without high performance computation, and detecting discrepancies below cursory visual inspections.

[CV-101] Risk Assessment for Autonomous Landing in Urban Environments using Semantic Segmentation

链接: https://arxiv.org/abs/2410.12988
作者: Jesús Alejandro Loera-Ponce,Diego A. Mercado-Ravell,Israel Becerra-Durán,Luis Manuel Valentin-Coronado
关键词-EN: deep neural networks, complex urban environments, vision-based autonomous landing, autonomous landing problem, urban environments
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we address the vision-based autonomous landing problem in complex urban environments using deep neural networks for semantic segmentation and risk assessment. We propose employing the SegFormer, a state-of-the-art visual transformer network, for the semantic segmentation of complex, unstructured urban environments. This approach yields valuable information that can be utilized in smart autonomous landing missions, particularly in emergency landing scenarios resulting from system failures or human errors. The assessment is done in real-time flight, when images of an RGB camera at the Unmanned Aerial Vehicle (UAV) are segmented with the SegFormer into the most common classes found in urban environments. These classes are then mapped into a level of risk, considering in general, potential material damage, damaging the drone itself and endanger people. The proposed strategy is validated through several case studies, demonstrating the huge potential of semantic segmentation-based strategies to determining the safest landing areas for autonomous emergency landing, which we believe will help unleash the full potential of UAVs on civil applications within urban areas.

[CV-102] Super-resolving Real-world Image Illumination Enhancement: A New Dataset and A Conditional Diffusion Model

链接: https://arxiv.org/abs/2410.12961
作者: Yang Liu,Yaofang Liu,Jinshan Pan,Yuxiang Hui,Fan Jia,Raymond H. Chan,Tieyong Zeng
关键词-EN: developed to improve, existing super-resolution methods, well-lighted conditions, SRRIIE dataset, dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code and dataset at this https URL

点击查看摘要

Abstract:Most existing super-resolution methods and datasets have been developed to improve the image quality in well-lighted conditions. However, these methods do not work well in real-world low-light conditions as the images captured in such conditions lose most important information and contain significant unknown noises. To solve this problem, we propose a SRRIIE dataset with an efficient conditional diffusion probabilistic models-based method. The proposed dataset contains 4800 paired low-high quality images. To ensure that the dataset are able to model the real-world image degradation in low-illumination environments, we capture images using an ILDC camera and an optical zoom lens with exposure levels ranging from -6 EV to 0 EV and ISO levels ranging from 50 to 12800. We comprehensively evaluate with various reconstruction and perceptual metrics and demonstrate the practicabilities of the SRRIIE dataset for deep learning-based methods. We show that most existing methods are less effective in preserving the structures and sharpness of restored images from complicated noises. To overcome this problem, we revise the condition for Raw sensor data and propose a novel time-melding condition for diffusion probabilistic model. Comprehensive quantitative and qualitative experimental results on the real-world benchmark datasets demonstrate the feasibility and effectivenesses of the proposed conditional diffusion probabilistic model on Raw sensor data. Code and dataset will be available at this https URL

[CV-103] MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

链接: https://arxiv.org/abs/2410.12957
作者: Ruiqi Li,Siqi Zheng,Xize Cheng,Ziang Zhang,Shengpeng Ji,Zhou Zhao
关键词-EN: involves generating music, involves generating, Generating music, challenging task, requires a deep
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Working in progress

点击查看摘要

Abstract:Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives. This paper presents MuVi, a novel framework that effectively addresses these challenges to enhance the cohesion and immersive experience of audio-visual content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. These features are used to generate music that not only matches the video’s mood and theme but also its rhythm and pacing. We also introduce a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases. In addition, we demonstrate that our flow-matching-based music generator has in-context learning ability, allowing us to control the style and genre of the generated music. Experimental results show that MuVi demonstrates superior performance in both audio quality and temporal synchronization. The generated music video samples are available at this https URL.

[CV-104] Long-Tailed Backdoor Attack Using Dynamic Data Augmentation Operations

点击查看摘要

[CV-105] Syn2Real Domain Generalization for Underwater Mine-like Object Detection Using Side-Scan Sonar

链接: https://arxiv.org/abs/2410.12953
作者: Aayush Agrawal,Aniruddh Sikdar,Rajini Makam,Suresh Sundaram,Suresh Kumar Besai,Mahesh Gopi
关键词-EN: deep learning suffers, suffers from limitations, limitations due, Underwater mine detection, data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 7 pages, 4 figures and 3 tables

点击查看摘要

Abstract:Underwater mine detection with deep learning suffers from limitations due to the scarcity of real-world data. This scarcity leads to overfitting, where models perform well on training data but poorly on unseen data. This paper proposes a Syn2Real (Synthetic to Real) domain generalization approach using diffusion models to address this challenge. We demonstrate that synthetic data generated with noise by DDPM and DDIM models, even if not perfectly realistic, can effectively augment real-world samples for training. The residual noise in the final sampled images improves the model’s ability to generalize to real-world data with inherent noise and high variation. The baseline Mask-RCNN model when trained on a combination of synthetic and original training datasets, exhibited approximately a 60% increase in Average Precision (AP) compared to being trained solely on the original training data. This significant improvement highlights the potential of Syn2Real domain generalization for underwater mine detection tasks. Comments: 7 pages, 4 figures and 3 tables Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2410.12953 [cs.LG] (or arXiv:2410.12953v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.12953 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-106] Gradient Map-Assisted Head and Neck Tumor Segmentation: A Pre-RT to Mid-RT Approach in MRI-Guided Radiotherapy

点击查看摘要

[CV-107] UMambaAdj: Advancing GTV Segmentation for Head and Neck Cancer in MRI-Guided RT with UMamba and nnU-Net ResEnc Planner

点击查看摘要

[CV-108] DreamCraft3D: Efficient Hierarchical 3D Generation with Multi-Plane Reconstruction Model

链接: https://arxiv.org/abs/2410.12928
作者: Jingxiang Sun,Cheng Peng,Ruizhi Shao,Yuan-Chen Guo,Xiaochen Zhao,Yangguang Li,Yanpei Cao,Bo Zhang,Yebin Liu
关键词-EN: efficient high-quality generation, enables efficient high-quality, efficient high-quality, high-quality generation, multi-stage generation process
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce DreamCraft3D++, an extension of DreamCraft3D that enables efficient high-quality generation of complex 3D assets. DreamCraft3D++ inherits the multi-stage generation process of DreamCraft3D, but replaces the time-consuming geometry sculpting optimization with a feed-forward multi-plane based reconstruction model, speeding up the process by 1000x. For texture refinement, we propose a training-free IP-Adapter module that is conditioned on the enhanced multi-view images to enhance texture and geometry consistency, providing a 4x faster alternative to DreamCraft3D’s DreamBooth fine-tuning. Experiments on diverse datasets demonstrate DreamCraft3D++'s ability to generate creative 3D assets with intricate geometry and realistic 360° textures, outperforming state-of-the-art image-to-3D methods in quality and speed. The full implementation will be open-sourced to enable new possibilities in 3D content creation.

[CV-109] DEeR: Deviation Eliminating and Noise Regulating for Privacy-preserving Federated Low-rank Adaptation

链接: https://arxiv.org/abs/2410.12926
作者: Meilu Zhu,Axiu Mao,Jun Liu,Yixuan Yuan
关键词-EN: Integrating low-rank adaptation, widespread attention recently, pretrained foundation models, received widespread attention, adapt pretrained foundation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Integrating low-rank adaptation (LoRA) with federated learning (FL) has received widespread attention recently, aiming to adapt pretrained foundation models (FMs) to downstream medical tasks via privacy-preserving decentralized training. However, owing to the direct combination of LoRA and FL, current methods generally undergo two problems, i.e., aggregation deviation, and differential privacy (DP) noise amplification effect. To address these problems, we propose a novel privacy-preserving federated finetuning framework called \underlineDeviation \underlineEliminating and Nois\underlinee \underlineRegulating (DEeR). Specifically, we firstly theoretically prove that the necessary condition to eliminate aggregation deviation is guaranteing the equivalence between LoRA parameters of clients. Based on the theoretical insight, a deviation eliminator is designed to utilize alternating minimization algorithm to iteratively optimize the zero-initialized and non-zero-initialized parameter matrices of LoRA, ensuring that aggregation deviation always be zeros during training. Furthermore, we also conduct an in-depth analysis of the noise amplification effect and find that this problem is mainly caused by the ``linear relationship’’ between DP noise and LoRA parameters. To suppress the noise amplification effect, we propose a noise regulator that exploits two regulator factors to decouple relationship between DP and LoRA, thereby achieving robust privacy protection and excellent finetuning performance. Additionally, we perform comprehensive ablated experiments to verify the effectiveness of the deviation eliminator and noise regulator. DEeR shows better performance on public medical datasets in comparison with state-of-the-art approaches. The code is available at this https URL.

[CV-110] Answering Questions in Stages: Prompt Chaining for Contract QA

点击查看摘要

[CV-111] EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing

点击查看摘要

[CV-112] GCM-Net: Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network for Video Sentiment and Emotion Analysis

链接: https://arxiv.org/abs/2410.12828
作者: Prasad Chaudhari,Aman Kumar,Chandravardhan Singh Raghaw,Mohammad Zia Ur Rehman,Nagendra Kumar
关键词-EN: challenging tasks, diversity and complexity, emotion recognition, emotion, Sentiment
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sentiment analysis and emotion recognition in videos are challenging tasks, given the diversity and complexity of the information conveyed in different modalities. Developing a highly competent framework that effectively addresses the distinct characteristics across various modalities is a primary concern in this domain. Previous studies on combined multimodal sentiment and emotion analysis often overlooked effective fusion for modality integration, intermodal contextual congruity, optimizing concatenated feature spaces, leading to suboptimal architecture. This paper presents a novel framework that leverages the multi-modal contextual information from utterances and applies metaheuristic algorithms to learn the contributing features for utterance-level sentiment and emotion prediction. Our Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network (GCM-Net) integrates graph sampling and aggregation to recalibrate the modality features for video sentiment and emotion prediction. GCM-Net includes a cross-modal attention module determining intermodal interactions and utterance relevance. A harmonic optimization module employing a metaheuristic algorithm combines attended features, allowing for handling both single and multi-utterance inputs. To show the effectiveness of our approach, we have conducted extensive evaluations on three prominent multi-modal benchmark datasets, CMU MOSI, CMU MOSEI, and IEMOCAP. The experimental results demonstrate the efficacy of our proposed approach, showcasing accuracies of 91.56% and 86.95% for sentiment analysis on MOSI and MOSEI datasets. We have performed emotion analysis for the IEMOCAP dataset procuring an accuracy of 85.66% which signifies substantial performance enhancements over existing methods.

[CV-113] AVID: Adapting Video Diffusion Models to World Models

链接: https://arxiv.org/abs/2410.12822
作者: Marc Rigter,Tarun Gupta,Agrin Hilmkil,Chao Ma
关键词-EN: Large-scale generative models, achieved remarkable success, Large-scale generative, models, number of domains
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.1 Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.

[CV-114] Interactive Explainable Anomaly Detection for Industrial Settings

点击查看摘要

[CV-115] Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective NEURIPS2024

点击查看摘要

[CV-116] Leveraging generative models to characterize the failure conditions of image classifiers

链接: https://arxiv.org/abs/2410.12814
作者: Adrien Le Coz,Stéphane Herbin,Faouzi Adjed
关键词-EN: Generative Adversarial Networks, failure conditions, recent Generative Adversarial, work the question, question of identifying
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We address in this work the question of identifying the failure conditions of a given image classifier. To do so, we exploit the capacity of producing controllable distributions of high quality image data made available by recent Generative Adversarial Networks (StyleGAN2): the failure conditions are expressed as directions of strong performance degradation in the generative model latent space. This strategy of analysis is used to discover corner cases that combine multiple sources of corruption, and to compare in more details the behavior of different classifiers. The directions of degradation can also be rendered visually by generating data for better interpretability. Some degradations such as image quality can affect all classes, whereas other ones such as shape are more class-specific. The approach is demonstrated on the MNIST dataset that has been completed by two sources of corruption: noise and blur, and shows a promising way to better understand and control the risks of exploiting Artificial Intelligence components for safety-critical applications.

[CV-117] ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models

链接: https://arxiv.org/abs/2410.12813
作者: Mengxue Qu,Xiaodong Chen,Wu Liu,Alicia Li,Yao Zhao
关键词-EN: Video Temporal Grounding, Large Language Models, ground specific segments, Dialogue Large Language, Temporal Grounding
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:Video Temporal Grounding (VTG) aims to ground specific segments within an untrimmed video corresponding to the given natural language query. Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases. To address these challenges, we present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding. Our ChatVTG leverages Video Dialogue LLMs to generate multi-granularity segment captions and matches these captions with the given query for coarse temporal grounding, circumventing the need for paired annotation data. Furthermore, to obtain more precise temporal grounding results, we employ moment refinement for fine-grained caption proposals. Extensive experiments on three mainstream VTG datasets, including Charades-STA, ActivityNet-Captions, and TACoS, demonstrate the effectiveness of ChatVTG. Our ChatVTG surpasses the performance of current zero-shot methods.

[CV-118] Decoding Emotions: Unveiling Facial Expressions through Acoustic Sensing with Contrastive Attention

链接: https://arxiv.org/abs/2410.12811
作者: Guangjing Wang,Juexing Wang,Ce Zhou,Weikang Ding,Huacheng Zeng,Tianxing Li,Qiben Yan
关键词-EN: users’ emotional states, holds great promise, detecting users’ emotional, recognition holds great, accurately detecting users’
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: The extended version of the 2023 IEEE INFOCOM conference paper

点击查看摘要

Abstract:Expression recognition holds great promise for applications such as content recommendation and mental healthcare by accurately detecting users’ emotional states. Traditional methods often rely on cameras or wearable sensors, which raise privacy concerns and add extra device burdens. In addition, existing acoustic-based methods struggle to maintain satisfactory performance when there is a distribution shift between the training dataset and the inference dataset. In this paper, we introduce FacER+, an active acoustic facial expression recognition system, which eliminates the requirement for external microphone arrays. FacER+ extracts facial expression features by analyzing the echoes of near-ultrasound signals emitted between the 3D facial contour and the earpiece speaker on a smartphone. This approach not only reduces background noise but also enables the identification of different expressions from various users with minimal training data. We develop a contrastive external attention-based model to consistently learn expression features across different users, reducing the distribution differences. Extensive experiments involving 20 volunteers, both with and without masks, demonstrate that FacER+ can accurately recognize six common facial expressions with over 90% accuracy in diverse, user-independent real-life scenarios, surpassing the performance of the leading acoustic sensing methods by 10%. FacER+ offers a robust and practical solution for facial expression recognition.

[CV-119] Deep-learning recognition and tracking of individual nanotubes in low-contrast microscopy videos

链接: https://arxiv.org/abs/2410.13594
作者: Vladimir Pimonov,Said Tahir,Vincent Jourdain
关键词-EN: automated deep learning, homodyne polarization microscopy, in-situ homodyne polarization, deep learning, study addresses
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 13 pages, 5 Figures, No supporting information included

点击查看摘要

Abstract:This study addresses the challenge of analyzing the growth kinetics of carbon nanotubes using in-situ homodyne polarization microscopy (HPM) by developing an automated deep learning (DL) approach. A Mask-RCNN architecture, enhanced with a ResNet-50 backbone, was employed to recognize and track individual nanotubes in microscopy videos, significantly improving the efficiency and reproducibility of kinetic data extraction. The method involves a series of video processing steps to enhance contrast and used differential treatment techniques to manage low signal and fast kinetics. The DL model demonstrates consistency with manual measurements and increased throughput, laying the foundation for statistical studies of nanotube growth. The approach can be adapted for other types of in-situ microscopy studies, emphasizing the importance of automation in high-throughput data acquisition for research on individual nano-objects.

[CV-120] RGB to Hyperspectral: Spectral Reconstruction for Enhanced Surgical Imaging

点击查看摘要

[CV-121] Unsupervised Skull Segmentation via Contrastive MR-to-CT Modality Translation ACCV2024

链接: https://arxiv.org/abs/2410.13427
作者: Kamil Kwarciak,Mateusz Daniol,Daria Hemmerling,Marek Wodzinski
关键词-EN: solved problem, skull segmentation, segmentation, skull, unsupervised skull segmentation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 5 figures, ACCV 2024 - GAISynMeD Workshop

点击查看摘要

Abstract:The skull segmentation from CT scans can be seen as an already solved problem. However, in MR this task has a significantly greater complexity due to the presence of soft tissues rather than bones. Capturing the bone structures from MR images of the head, where the main visualization objective is the brain, is very demanding. The attempts that make use of skull stripping seem to not be well suited for this task and fail to work in many cases. On the other hand, supervised approaches require costly and time-consuming skull annotations. To overcome the difficulties we propose a fully unsupervised approach, where we do not perform the segmentation directly on MR images, but we rather perform a synthetic CT data generation via MR-to-CT translation and perform the segmentation there. We address many issues associated with unsupervised skull segmentation including the unpaired nature of MR and CT datasets (contrastive learning), low resolution and poor quality (super-resolution), and generalization capabilities. The research has a significant value for downstream tasks requiring skull segmentation from MR volumes such as craniectomy or surgery planning and can be seen as an important step towards the utilization of synthetic data in medical imaging.

[CV-122] Scalable Drift Monitoring in Medical Imaging AI

链接: https://arxiv.org/abs/2410.13174
作者: Jameson Merkow,Felix J. Dorfner,Xiyu Yang,Alexander Ersoy,Giridhar Dasegowda,Mannudeep Kalra,Matthew P. Lungren,Christopher P. Bridge,Ivan Tarapov
关键词-EN: ensuring long-term reliability, advanced clinical diagnostics, artificial intelligence, long-term reliability, medical imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI) into medical imaging has advanced clinical diagnostics but poses challenges in managing model drift and ensuring long-term reliability. To address these challenges, we develop MMC+, an enhanced framework for scalable drift monitoring, building upon the CheXstray framework that introduced real-time drift detection for medical imaging AI models using multi-modal data concordance. This work extends the original framework’s methodologies, providing a more scalable and adaptable solution for real-world healthcare settings and offers a reliable and cost-effective alternative to continuous performance monitoring addressing limitations of both continuous and periodic monitoring methods. MMC+ introduces critical improvements to the original framework, including more robust handling of diverse data streams, improved scalability with the integration of foundation models like MedImageInsight for high-dimensional image embeddings without site-specific training, and the introduction of uncertainty bounds to better capture drift in dynamic clinical environments. Validated with real-world data from Massachusetts General Hospital during the COVID-19 pandemic, MMC+ effectively detects significant data shifts and correlates them with model performance changes. While not directly predicting performance degradation, MMC+ serves as an early warning system, indicating when AI systems may deviate from acceptable performance bounds and enabling timely interventions. By emphasizing the importance of monitoring diverse data streams and evaluating data shifts alongside model performance, this work contributes to the broader adoption and integration of AI solutions in clinical settings.

[CV-123] Adversarial Neural Networks in Medical Imaging Advancements and Challenges in Semantic Segmentation

链接: https://arxiv.org/abs/2410.13099
作者: Houze Liu,Bo Zhang,Yanlin Xiang,Yuxiang Hu,Aoran Shen,Yang Lin
关键词-EN: Recent advancements, artificial intelligence, advancements in artificial, precipitated a paradigm, paradigm shift
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI) have precipitated a paradigm shift in medical imaging, particularly revolutionizing the domain of brain imaging. This paper systematically investigates the integration of deep learning – a principal branch of AI – into the semantic segmentation of brain images. Semantic segmentation serves as an indispensable technique for the delineation of discrete anatomical structures and the identification of pathological markers, essential for the diagnosis of complex neurological disorders. Historically, the reliance on manual interpretation by radiologists, while noteworthy for its accuracy, is plagued by inherent subjectivity and inter-observer variability. This limitation becomes more pronounced with the exponential increase in imaging data, which traditional methods struggle to process efficiently and effectively. In response to these challenges, this study introduces the application of adversarial neural networks, a novel AI approach that not only automates but also refines the semantic segmentation process. By leveraging these advanced neural networks, our approach enhances the precision of diagnostic outputs, reducing human error and increasing the throughput of imaging data analysis. The paper provides a detailed discussion on how adversarial neural networks facilitate a more robust, objective, and scalable solution, thereby significantly improving diagnostic accuracies in neurological evaluations. This exploration highlights the transformative impact of AI on medical imaging, setting a new benchmark for future research and clinical practice in neurology.

[CV-124] UniCoN: Universal Conditional Networks for Multi-Age Embryonic Cartilage Segmentation with Sparsely Annotated Data

链接: https://arxiv.org/abs/2410.13043
作者: Nishchal Sapkota,Yejia Zhang,Zihao Zhao,Maria Gomez,Yuhan Hsi,Jordan A. Wilson,Kazuhiko Kawasaki,Greg Holmes,Meng Wu,Ethylin Wang Jabs,Joan T. Richtsmeier,Susan M. Motch Perrine,Danny Z. Chen
关键词-EN: newborns globally, head malformations, contributing to childhood, quality of life, childhood morbidity
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Osteochondrodysplasia, affecting 2-3% of newborns globally, is a group of bone and cartilage disorders that often result in head malformations, contributing to childhood morbidity and reduced quality of life. Current research on this disease using mouse models faces challenges since it involves accurately segmenting the developing cartilage in 3D micro-CT images of embryonic mice. Tackling this segmentation task with deep learning (DL) methods is laborious due to the big burden of manual image annotation, expensive due to the high acquisition costs of 3D micro-CT images, and difficult due to embryonic cartilage’s complex and rapidly changing shapes. While DL approaches have been proposed to automate cartilage segmentation, most such models have limited accuracy and generalizability, especially across data from different embryonic age groups. To address these limitations, we propose novel DL methods that can be adopted by any DL architectures – including CNNs, Transformers, or hybrid models – which effectively leverage age and spatial information to enhance model performance. Specifically, we propose two new mechanisms, one conditioned on discrete age categories and the other on continuous image crop locations, to enable an accurate representation of cartilage shape changes across ages and local shape details throughout the cranial region. Extensive experiments on multi-age cartilage segmentation datasets show significant and consistent performance improvements when integrating our conditional modules into popular DL segmentation architectures. On average, we achieve a 1.7% Dice score increase with minimal computational overhead and a 7.5% improvement on unseen data. These results highlight the potential of our approach for developing robust, universal models capable of handling diverse datasets with limited annotated data, a key challenge in DL-based medical image analysis.

[CV-125] Synthesis and Perceptual Scaling of High Resolution Natural Images Using Stable Diffusion

链接: https://arxiv.org/abs/2410.13034
作者: Leonardo Pettini,Carsten Bogler,Christian Doeller,John-Dylan Haynes
关键词-EN: Natural scenes, Natural, natural scene stimulus, developed natural scene, images
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
*备注: 29 pages, 7 Figures, 5 tables

点击查看摘要

Abstract:Natural scenes are of key interest for visual perception. Previous work on natural scenes has frequently focused on collections of discrete images with considerable physical differences from stimulus to stimulus. For many purposes it would, however, be desirable to have sets of natural images that vary smoothly along a continuum (for example in order to measure quantitative properties such as thresholds or precisions). This problem has typically been addressed by morphing a source into a target image. However, this approach yields transitions between images that primarily follow their low-level physical features and that can be semantically unclear or ambiguous. Here, in contrast, we used a different approach (Stable Diffusion XL) to synthesise a custom stimulus set of photorealistic images that are characterized by gradual transitions where each image is a clearly interpretable but unique exemplar from the same category. We developed natural scene stimulus sets from six categories with 18 objects each. For each object we generated 10 graded variants that are ordered along a perceptual continuum. We validated the image set psychophysically in a large sample of participants, ensuring that stimuli for each exemplar have varying levels of perceptual confusability. This image set is of interest for studies on visual perception, attention and short- and long-term memory.

[CV-126] MyData: A Comprehensive Database of Mycetoma Tissue Microscopic Images for Histopathological Analysis

链接: https://arxiv.org/abs/2410.12833
作者: Hyam Omar Ali,Romain Abraham,Guillaume Desoubeaux,Ahmed Fahal,Clovis Tauber
关键词-EN: neglected inflammatory disease, inflammatory disease prevalent, subtropical regions, chronic and neglected, neglected inflammatory
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Mycetoma is a chronic and neglected inflammatory disease prevalent in tropical and subtropical regions. It can lead to severe disability and social stigma. The disease is classified into two types based on the causative microorganisms: eumycetoma (fungal) and actinomycetoma (bacterial). Effective treatment strategies depend on accurately identifying the causative agents. Current identification methods include molecular, cytological, and histopathological techniques, as well as grain culturing. Among these, histopathological techniques are considered optimal for use in endemic areas, but they require expert pathologists for accurate identification, which can be challenging in rural areas lacking such expertise. The advent of digital pathology and automated image analysis algorithms offers a potential solution. This report introduces a novel dataset designed for the automated detection and classification of mycetoma using histopathological images. It includes the first database of microscopic images of mycetoma tissue, detailing the entire pipeline from species distribution and patient sampling to acquisition protocols through histological procedures. The dataset consists of images from 142 patients, totalling 864 images, each annotated with binary masks indicating the presence of grains, facilitating both detection and segmentation tasks.

[CV-127] Segment as You Wish – Free-Form Language-Based Segmentation for Medical Images

点击查看摘要

[CV-128] DyMix: Dynamic Frequency Mixup Scheduler based Unsupervised Domain Adaptation for Enhancing Alzheimers Disease Identification

链接: https://arxiv.org/abs/2410.12827
作者: Yooseung Shin,Kwanseok Oh,Heung-Il Suk
关键词-EN: brain image analysis, Alzheimer disease, Advances in deep, accuracy of Alzheimer, timely interventions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Advances in deep learning (DL)-based models for brain image analysis have significantly enhanced the accuracy of Alzheimer’s disease (AD) diagnosis, allowing for more timely interventions. Despite these advancements, most current DL models suffer from performance degradation when inferring on unseen domain data owing to the variations in data distributions, a phenomenon known as domain shift. To address this challenge, we propose a novel approach called the dynamic frequency mixup scheduler (DyMix) for unsupervised domain adaptation. Contrary to the conventional mixup technique, which involves simple linear interpolations between predefined data points from the frequency space, our proposed DyMix dynamically adjusts the magnitude of the frequency regions being mixed from the source and target domains. Such an adaptive strategy optimizes the model’s capacity to deal with domain variability, thereby enhancing its generalizability across the target domain. In addition, we incorporate additional strategies to further enforce the model’s robustness against domain shifts, including leveraging amplitude-phase recombination to ensure resilience to intensity variations and applying self-adversarial learning to derive domain-invariant feature representations. Experimental results on two benchmark datasets quantitatively and qualitatively validated the effectiveness of our DyMix in that we demonstrated its outstanding performance in AD diagnosis compared to state-of-the-art methods.

[CV-129] Deep Adversarial Learning with Activity-Based User Discrimination Task for Human Activity Recognition

链接: https://arxiv.org/abs/2410.12819
作者: Francisco M. Calatrava-Nicolás,Oscar Martinez Mozos
关键词-EN: inertial sensors worn, human activity recognition, adversarial deep learning, deep learning framework, deep learning
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a new adversarial deep learning framework for the problem of human activity recognition (HAR) using inertial sensors worn by people. Our framework incorporates a novel adversarial activity-based discrimination task that addresses inter-person variability-i.e., the fact that different people perform the same activity in different ways. Overall, our proposed framework outperforms previous approaches on three HAR datasets using a leave-one-(person)-out cross-validation (LOOCV) benchmark. Additional results demonstrate that our discrimination task yields better classification results compared to previous tasks within the same adversarial framework.

机器学习

[LG-0] Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

点击查看摘要

[LG-1] How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs

点击查看摘要

[LG-2] Diffusing States and Matching Scores: A New Framework for Imitation Learning

链接: https://arxiv.org/abs/2410.13855
作者: Runzhe Wu,Yiding Chen,Gokul Swamy,Kianté Brantley,Wen Sun
关键词-EN: Generative Adversarial Network, Generative Adversarial Imitation, two-player zero-sum game, adversarially chosen cost, Adversarial Imitation Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial Imitation Learning is traditionally framed as a two-player zero-sum game between a learner and an adversarially chosen cost function, and can therefore be thought of as the sequential generalization of a Generative Adversarial Network (GAN). A prominent example of this framework is Generative Adversarial Imitation Learning (GAIL). However, in recent years, diffusion models have emerged as a non-adversarial alternative to GANs that merely require training a score function via regression, yet produce generations of a higher quality. In response, we investigate how to lift insights from diffusion modeling to the sequential setting. We propose diffusing states and performing score-matching along diffused states to measure the discrepancy between the expert’s and learner’s states. Thus, our approach only requires training score functions to predict noises via standard regression, making it significantly easier and more stable to train than adversarial methods. Theoretically, we prove first- and second-order instance-dependent bounds with linear scaling in the horizon, proving that our approach avoids the compounding errors that stymie offline approaches to imitation learning. Empirically, we show our approach outperforms GAN-style imitation learning baselines across various continuous control problems, including complex tasks like controlling humanoids to walk, sit, and crawl.

[LG-3] AutoAL: Automated Active Learning with Differentiable Query Strategy Search

链接: https://arxiv.org/abs/2410.13853
作者: Yifeng Wang,Xueying Zhan,Siyu Huang
关键词-EN: deep learning continues, continues to evolve, increasingly important, efficiency becomes increasingly, learning continues
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As deep learning continues to evolve, the need for data efficiency becomes increasingly important. Considering labeling large datasets is both time-consuming and expensive, active learning (AL) provides a promising solution to this challenge by iteratively selecting the most informative subsets of examples to train deep neural networks, thereby reducing the labeling cost. However, the effectiveness of different AL algorithms can vary significantly across data scenarios, and determining which AL algorithm best fits a given task remains a challenging problem. This work presents the first differentiable AL strategy search method, named AutoAL, which is designed on top of existing AL sampling strategies. AutoAL consists of two neural nets, named SearchNet and FitNet, which are optimized concurrently under a differentiable bi-level optimization framework. For any given task, SearchNet and FitNet are iteratively co-optimized using the labeled data, learning how well a set of candidate AL algorithms perform on that task. With the optimal AL strategies identified, SearchNet selects a small subset from the unlabeled pool for querying their annotations, enabling efficient training of the task model. Experimental results demonstrate that AutoAL consistently achieves superior accuracy compared to all candidate AL algorithms and other selective AL approaches, showcasing its potential for adapting and integrating multiple existing AL methods across diverse tasks and domains. Code will be available at: this https URL.

[LG-4] Retrospective Learning from Interactions

点击查看摘要

[LG-5] Influence Functions for Scalable Data Attribution in Diffusion Models

点击查看摘要

[LG-6] SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction

点击查看摘要

[LG-7] A Unified View of Delta Parameter Editing in Post-Trained Large-Scale Models

点击查看摘要

[LG-8] ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

点击查看摘要

[LG-9] Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

链接: https://arxiv.org/abs/2410.13835
作者: Tianyu Guo,Druv Pai,Yu Bai,Jiantao Jiao,Michael I. Jordan,Song Mei
关键词-EN: transformer-based large language, Practitioners have consistently, large language models, value-state drains, collectively referred
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called “sink tokens” receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures – transformers with one to three layers – trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.13835 [cs.LG] (or arXiv:2410.13835v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.13835 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] he Disparate Benefits of Deep Ensembles

点击查看摘要

[LG-11] A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

点击查看摘要

[LG-12] Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models MICRO

点击查看摘要

[LG-13] Artificial Kuramoto Oscillatory Neurons

点击查看摘要

[LG-14] Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance

链接: https://arxiv.org/abs/2410.13816
作者: Mitsuhiko Nakamoto,Oier Mees,Aviral Kumar,Sergey Levine
关键词-EN: acquiring broad repertoires, diverse demonstration datasets, manipulation skills, remarkably effective, controlling a variety
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Conference on Robot Learning (CoRL) 2024. Project Page: this https URL

点击查看摘要

Abstract:Large, general-purpose robotic policies trained on diverse demonstration datasets have been shown to be remarkably effective both for controlling a variety of robots in a range of different scenes, and for acquiring broad repertoires of manipulation skills. However, the data that such policies are trained on is generally of mixed quality – not only are human-collected demonstrations unlikely to perform the task perfectly, but the larger the dataset is, the harder it is to curate only the highest quality examples. It also remains unclear how optimal data from one embodiment is for training on another embodiment. In this paper, we present a general and broadly applicable approach that enhances the performance of such generalist robot policies at deployment time by re-ranking their actions according to a value function learned via offline RL. This approach, which we call Value-Guided Policy Steering (V-GPS), is compatible with a wide range of different generalist policies, without needing to fine-tune or even access the weights of the policy. We show that the same value function can improve the performance of five different state-of-the-art policies with different architectures, even though they were trained on distinct datasets, attaining consistent performance improvement on multiple robotic platforms across a total of 12 tasks. Code and videos can be found at: this https URL

[LG-15] Private Counterfactual Retrieval

链接: https://arxiv.org/abs/2410.13812
作者: Mohamed Nomeir,Pasan Dissanayake,Shreya Meel,Sanghamitra Dutta,Sennur Ulukus
关键词-EN: extremely important aspects, employing black-box machine, black-box machine learning, machine learning models, Transparency and explainability
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Transparency and explainability are two extremely important aspects to be considered when employing black-box machine learning models in high-stake applications. Providing counterfactual explanations is one way of catering this requirement. However, this also poses a threat to the privacy of both the institution that is providing the explanation as well as the user who is requesting it. In this work, we propose multiple schemes inspired by private information retrieval (PIR) techniques which ensure the \emphuser’s privacy when retrieving counterfactual explanations. We present a scheme which retrieves the \emphexact nearest neighbor counterfactual explanation from a database of accepted points while achieving perfect (information-theoretic) privacy for the user. While the scheme achieves perfect privacy for the user, some leakage on the database is inevitable which we quantify using a mutual information based metric. Furthermore, we propose strategies to reduce this leakage to achieve an advanced degree of database privacy. We extend these schemes to incorporate user’s preference on transforming their attributes, so that a more actionable explanation can be received. Since our schemes rely on finite field arithmetic, we empirically validate our schemes on real datasets to understand the trade-off between the accuracy and the finite field sizes.

[LG-16] Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers

链接: https://arxiv.org/abs/2410.13802
作者: Patrik Zavoral,Dušan Variš,Ondřej Bojar
关键词-EN: training data, tendency to overfit, Transformer, Transformer model, aspects
类目: Machine Learning (cs.LG)
*备注: 9 pages, 8 figures, 2 tables; to be published

点击查看摘要

Abstract:The Transformer model has a tendency to overfit various aspects of the training data, such as the overall sequence length. We study elementary string edit functions using a defined set of error indicators to interpret the behaviour of the sequence-to-sequence Transformer. We show that generalization to shorter sequences is often possible, but confirm that longer sequences are highly problematic, although partially correct answers are often obtained. Additionally, we find that other structural characteristics of the sequences, such as subsegment length, may be equally important. We hypothesize that the models learn algorithmic aspects of the tasks simultaneously with structural aspects but adhering to the structural aspects is unfortunately often preferred by Transformer when they come into conflict.

[LG-17] Learning Graph Quantized Tokenizers for Transformers

点击查看摘要

[LG-18] Arbitrarily-Conditioned Multi-Functional Diffusion for Multi-Physics Emulation

链接: https://arxiv.org/abs/2410.13794
作者: Da Long,Zhitong Xu,Guang Yang,Akil Narayan,Shandian Zhe
关键词-EN: Modern physics simulation, traditional numerical approaches, Modern physics, computationally costly, physics simulation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern physics simulation often involves multiple functions of interests, and traditional numerical approaches are known to be complex and computationally costly. While machine learning-based surrogate models can offer significant cost reductions, most focus on a single task, such as forward prediction, and typically lack uncertainty quantification – an essential component in many applications. To overcome these limitations, we propose Arbitrarily-Conditioned Multi-Functional Diffusion (ACMFD), a versatile probabilistic surrogate model for multi-physics emulation. ACMFD can perform a wide range of tasks within a single framework, including forward prediction, various inverse problems, and simulating data for entire systems or subsets of quantities conditioned on others. Specifically, we extend the standard Denoising Diffusion Probabilistic Model (DDPM) for multi-functional generation by modeling noise as Gaussian processes (GP). We then introduce an innovative denoising loss. The training involves randomly sampling the conditioned part and fitting the corresponding predicted noise to zero, enabling ACMFD to flexibly generate function values conditioned on any other functions or quantities. To enable efficient training and sampling, and to flexibly handle irregularly sampled data, we use GPs to interpolate function samples onto a grid, inducing a Kronecker product structure for efficient computation. We demonstrate the advantages of ACMFD across several fundamental multi-physics systems.

[LG-19] Analyzing Deep Transformer Models for Time Series Forecasting via Manifold Learning

链接: https://arxiv.org/abs/2410.13792
作者: Ilya Kaufman,Omri Azencot
关键词-EN: consistently achieved remarkable, achieved remarkable results, natural language processing, time series forecasting, time series
类目: Machine Learning (cs.LG)
*备注: Accepted to TMLR 2024

点击查看摘要

Abstract:Transformer models have consistently achieved remarkable results in various domains such as natural language processing and computer vision. However, despite ongoing research efforts to better understand these models, the field still lacks a comprehensive understanding. This is particularly true for deep time series forecasting methods, where analysis and understanding work is relatively limited. Time series data, unlike image and text information, can be more challenging to interpret and analyze. To address this, we approach the problem from a manifold learning perspective, assuming that the latent representations of time series forecasting models lie next to a low-dimensional manifold. In our study, we focus on analyzing the geometric features of these latent data manifolds, including intrinsic dimension and principal curvatures. Our findings reveal that deep transformer models exhibit similar geometric behavior across layers, and these geometric features are correlated with model performance. Additionally, we observe that untrained models initially have different structures, but they rapidly converge during training. By leveraging our geometric analysis and differentiable tools, we can potentially design new and improved deep forecasting neural networks. This approach complements existing analysis studies and contributes to a better understanding of transformer models in the context of time series forecasting. Code is released at this https URL.

[LG-20] DPLM-2: A Multimodal Diffusion Protein Language Model

链接: https://arxiv.org/abs/2410.13782
作者: Xinyou Wang,Zaixiang Zheng,Fei Ye,Dongyu Xue,Shujian Huang,Quanquan Gu
关键词-EN: essential macromolecules defined, living organisms, essential macromolecules, macromolecules defined, determine their three-dimensional
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.

[LG-21] Optimal Quantization for Matrix Multiplication

点击查看摘要

[LG-22] he Mystery of the Pathological Path-star Task for Language Models EMNLP2024

点击查看摘要

[LG-23] Change Detection in Multivariate data streams: Online Analysis with Kernel-QuantTree ECML2024 ALT ECML

链接: https://arxiv.org/abs/2410.13778
作者: Michelangelo Olmo Nogara Notarianni,Filippo Leveni,Diego Stucchi,Luca Frittoli,Giacomo Boracchi
关键词-EN: Exponentially Weighted Moving, Kernel-QuantTree Exponentially Weighted, Weighted Moving Average, present Kernel-QuantTree Exponentially, data streams online
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: AALTD workshop at ECML 2024 ( this https URL )

点击查看摘要

Abstract:We present Kernel-QuantTree Exponentially Weighted Moving Average (KQT-EWMA), a non-parametric change-detection algorithm that combines the Kernel-QuantTree (KQT) histogram and the EWMA statistic to monitor multivariate data streams online. The resulting monitoring scheme is very flexible, since histograms can be used to model any stationary distribution, and practical, since the distribution of test statistics does not depend on the distribution of datastream in stationary conditions (non-parametric monitoring). KQT-EWMA enables controlling false alarms by operating at a pre-determined Average Run Length ( ARL_0 ), which measures the expected number of stationary samples to be monitored before triggering a false alarm. The latter peculiarity is in contrast with most non-parametric change-detection tests, which rarely can control the ARL_0 a priori. Our experiments on synthetic and real-world datasets demonstrate that KQT-EWMA can control ARL_0 while achieving detection delays comparable to or lower than state-of-the-art methods designed to work in the same conditions.

[LG-24] Enhancing Retail Sales Forecasting with Optimized Machine Learning Models ICSE

链接: https://arxiv.org/abs/2410.13773
作者: Priyam Ganguly,Isha Mukherjee
关键词-EN: accurately predicting future, predicting future sales, accurately predicting, strategic planning, Support Vector Regression
类目: Machine Learning (cs.LG)
*备注: IEEE 4th ICSES 2024

点击查看摘要

Abstract:In retail sales forecasting, accurately predicting future sales is crucial for inventory management and strategic planning. Traditional methods like LR often fall short due to the complexity of sales data, which includes seasonality and numerous product families. Recent advancements in machine learning (ML) provide more robust alternatives. This research benefits from the power of ML, particularly Random Forest (RF), Gradient Boosting (GB), Support Vector Regression (SVR), and XGBoost, to improve prediction accuracy. Despite advancements, a significant gap exists in handling complex datasets with high seasonality and multiple product families. The proposed solution involves implementing and optimizing a RF model, leveraging hyperparameter tuning through randomized search cross-validation. This approach addresses the complexities of the dataset, capturing intricate patterns that traditional methods miss. The optimized RF model achieved an R-squared value of 0.945, substantially higher than the initial RF model and traditional LR, which had an R-squared of 0.531. The model reduced the root mean squared logarithmic error (RMSLE) to 1.172, demonstrating its superior predictive capability. The optimized RF model did better than cutting-edge models like Gradient Boosting (R-squared: 0.942), SVR (R-squared: 0.940), and XGBoost (R-squared: 0.939), with more minor mean squared error (MSE) and mean absolute error (MAE) numbers. The results demonstrate that the optimized RF model excels in forecasting retail sales, handling the datasets complexity with higher accuracy and reliability. This research highlights the importance of advanced ML techniques in predictive analytics, offering a significant improvement over traditional methods and other contemporary models.

[LG-25] Is Prior-Free Black-Box Non-Stationary Reinforcement Learning Feasible?

链接: https://arxiv.org/abs/2410.13772
作者: Argyrios Gerogiannis,Yu-Han Huang,Venugopal V. Veeravalli
关键词-EN: Non-Stationary Reinforcement Learning, Reinforcement Learning, Non-Stationary Reinforcement, study the problem, problem of Non-Stationary
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of Non-Stationary Reinforcement Learning (NS-RL) without prior knowledge about the system’s non-stationarity. A state-of-the-art, black-box algorithm, known as MASTER, is considered, with a focus on identifying the conditions under which it can achieve its stated goals. Specifically, we prove that MASTER’s non-stationarity detection mechanism is not triggered for practical choices of horizon, leading to performance akin to a random restarting algorithm. Moreover, we show that the regret bound for MASTER, while being order optimal, stays above the worst-case linear regret until unreasonably large values of the horizon. To validate these observations, MASTER is tested for the special case of piecewise stationary multi-armed bandits, along with methods that employ random restarting, and others that use quickest change detection to restart. A simple, order optimal random restarting algorithm, that has prior knowledge of the non-stationarity is proposed as a baseline. The behavior of the MASTER algorithm is validated in simulations, and it is shown that methods employing quickest change detection are more robust and consistently outperform MASTER and other random restarting approaches.

[LG-26] Virtual Sensing for Real-Time Degradation Monitoring of Nuclear Systems: Leveraging DeepONet for Enhanced Sensing Coverage for Digital Twin-Enabling Technology

点击查看摘要

[LG-27] GDeR: Safeguarding Efficiency Balancing and Robustness via Prototypical Graph Pruning NEURIPS2024

链接: https://arxiv.org/abs/2410.13761
作者: Guibin Zhang,Haonan Dong,Yuchen Zhang,Zhixun Li,Dingshuo Chen,Kai Wang,Tianlong Chen,Yuxuan Liang,Dawei Cheng,Kun Wang
关键词-EN: high-quality deep models, deep models necessitates, models necessitates vast, necessitates vast amounts, Training high-quality deep
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Training high-quality deep models necessitates vast amounts of data, resulting in overwhelming computational and memory demands. Recently, data pruning, distillation, and coreset selection have been developed to streamline data volume by retaining, synthesizing, or selecting a small yet informative subset from the full set. Among these methods, data pruning incurs the least additional training cost and offers the most practical acceleration benefits. However, it is the most vulnerable, often suffering significant performance degradation with imbalanced or biased data schema, thus raising concerns about its accuracy and reliability in on-device deployment. Therefore, there is a looming need for a new data pruning paradigm that maintains the efficiency of previous practices while ensuring balance and robustness. Unlike the fields of computer vision and natural language processing, where mature solutions have been developed to address these issues, graph neural networks (GNNs) continue to struggle with increasingly large-scale, imbalanced, and noisy datasets, lacking a unified dataset pruning solution. To achieve this, we introduce a novel dynamic soft-pruning method, GDeR, designed to update the training ``basket’’ during the process using trainable prototypes. GDeR first constructs a well-modeled graph embedding hypersphere and then samples \textitrepresentative, balanced, and unbiased subsets from this embedding space, which achieves the goal we called Graph Training Debugging. Extensive experiments on five datasets across three GNN backbones, demonstrate that GDeR (I) achieves or surpasses the performance of the full dataset with 30%~50% fewer training samples, (II) attains up to a 2.81x lossless training speedup, and (III) outperforms state-of-the-art pruning methods in imbalanced training and noisy training scenarios by 0.3%~4.3% and 3.6%~7.8%, respectively.

[LG-28] CLIMB: Language-Guided Continual Learning for Task Planning with Iterative Model Building

点击查看摘要

[LG-29] MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

点击查看摘要

[LG-30] Supervised Kernel Thinning

链接: https://arxiv.org/abs/2410.13749
作者: Albert Gong,Kyuseong Choi,Raaz Dwivedi
关键词-EN: Dwivedi Mackey, Monte Carlo integration, generic set, Monte Carlo, kernel thinning algorithm
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The kernel thinning algorithm of Dwivedi Mackey (2024) provides a better-than-i.i.d. compression of a generic set of points. By generating high-fidelity coresets of size significantly smaller than the input points, KT is known to speed up unsupervised tasks like Monte Carlo integration, uncertainty quantification, and non-parametric hypothesis testing, with minimal loss in statistical accuracy. In this work, we generalize the KT algorithm to speed up supervised learning problems involving kernel methods. Specifically, we combine two classical algorithms–Nadaraya-Watson (NW) regression or kernel smoothing, and kernel ridge regression (KRR)–with KT to provide a quadratic speed-up in both training and inference times. We show how distribution compression with KT in each setting reduces to constructing an appropriate kernel, and introduce the Kernel-Thinned NW and Kernel-Thinned KRR estimators. We prove that KT-based regression estimators enjoy significantly superior computational efficiency over the full-data estimators and improved statistical efficiency over i.i.d. subsampling of the training data. En route, we also provide a novel multiplicative error guarantee for compressing with KT. We validate our design choices with both simulations and real data experiments.

[LG-31] heory on Score-Mismatched Diffusion Models and Zero-Shot Conditional Samplers

链接: https://arxiv.org/abs/2410.13746
作者: Yuchen Liang,Peizhong Ju,Yingbin Liang,Ness Shroff
关键词-EN: powerful generative technique, denoising diffusion model, generative technique, capable of transforming, meaningful data
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The denoising diffusion model has recently emerged as a powerful generative technique, capable of transforming noise into meaningful data. While theoretical convergence guarantees for diffusion models are well established when the target distribution aligns with the training distribution, practical scenarios often present mismatches. One common case is in zero-shot conditional diffusion sampling, where the target conditional distribution is different from the (unconditional) training distribution. These score-mismatched diffusion models remain largely unexplored from a theoretical perspective. In this paper, we present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers, focusing on target distributions with finite second moments. We show that score mismatches result in an asymptotic distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions. This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise. Interestingly, the derived convergence upper bound offers useful guidance for designing a novel bias-optimal zero-shot sampler in linear conditional models that minimizes the asymptotic bias. For such bias-optimal samplers, we further establish convergence guarantees with explicit dependencies on dimension and conditioning, applied to several interesting target distributions, including those with bounded support and Gaussian mixtures. Our findings are supported by numerical studies.

[LG-32] Single-Timescale Multi-Sequence Stochastic Approximation Without Fixed Point Smoothness: Theories and Applications

链接: https://arxiv.org/abs/2410.13743
作者: Yue Huang,Zhaoxian Wu,Shiqian Ma,Qing Ling
关键词-EN: finds diverse applications, multiple coupled sequences, involves multiple coupled, Stochastic approximation, coupled sequences
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic approximation (SA) that involves multiple coupled sequences, known as multiple-sequence SA (MSSA), finds diverse applications in the fields of signal processing and machine learning. However, existing theoretical understandings of MSSA are limited: the multi-timescale analysis implies a slow convergence rate, whereas the single-timescale analysis relies on a stringent fixed point smoothness assumption. This paper establishes tighter single-timescale analysis for MSSA, without assuming smoothness of the fixed points. Our theoretical findings reveal that, when all involved operators are strongly monotone, MSSA converges at a rate of \tilde\mathcalO(K^-1) , where K denotes the total number of iterations. In addition, when all involved operators are strongly monotone except for the main one, MSSA converges at a rate of \mathcalO(K^-\frac12) . These theoretical findings align with those established for single-sequence SA. Applying these theoretical findings to bilevel optimization and communication-efficient distributed learning offers relaxed assumptions and/or simpler algorithms with performance guarantees, as validated by numerical experiments.

[LG-33] Optimizing Probabilistic Conformal Prediction with Vectorized Non-Conformity Scores

链接: https://arxiv.org/abs/2410.13735
作者: Minxing Zheng,Shixiang Zhu
关键词-EN: shown significant promise, reliable decision-making hinges, accurate uncertainty quantification, Generative models, autonomous driving
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Generative models have shown significant promise in critical domains such as medical diagnosis, autonomous driving, and climate science, where reliable decision-making hinges on accurate uncertainty quantification. While probabilistic conformal prediction (PCP) offers a powerful framework for this purpose, its coverage efficiency – the size of the uncertainty set – is limited when dealing with complex underlying distributions and a finite number of generated samples. In this paper, we propose a novel PCP framework that enhances efficiency by first vectorizing the non-conformity scores with ranked samples and then optimizing the shape of the prediction set by varying the quantiles for samples at the same rank. Our method delivers valid coverage while producing discontinuous and more efficient prediction sets, making it particularly suited for high-stakes applications. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.

[LG-34] Reducing the Transformer Architecture to a Minimum

链接: https://arxiv.org/abs/2410.13732
作者: Bernhard Bermeitinger,Tomas Hrycej,Massimo Pavone,Julianus Kath,Siegfried Handschuh
关键词-EN: Natural Language Processing, Language Processing, Computer Vision, Natural Language, Attention Mechanism
类目: Machine Learning (cs.LG)
*备注: 8 pages, to appear in KDIR2024

点击查看摘要

Abstract:Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem of extracting relevant context information from long sequences in NLP and realistic scenes in CV. A classical neural network component, a Multi-Layer Perceptron (MLP), complements the attention mechanism. Its necessity is frequently justified by its capability of modeling nonlinear relationships. However, the attention mechanism itself is nonlinear through its internal use of similarity measures. A possible hypothesis is that this nonlinearity is sufficient for modeling typical application problems. As the MLPs usually contain the most trainable parameters of the whole model, their omission would substantially reduce the parameter set size. Further components can also be reorganized to reduce the number of parameters. Under some conditions, query and key matrices can be collapsed into a single matrix of the same size. The same is true about value and projection matrices, which can also be omitted without eliminating the substance of the attention mechanism. Initially, the similarity measure was defined asymmetrically, with peculiar properties such as that a token is possibly dissimilar to itself. A possible symmetric definition requires only half of the parameters. We have laid the groundwork by testing widespread CV benchmarks: MNIST and CIFAR-10. The tests have shown that simplified transformer architectures (a) without MLP, (b) with collapsed matrices, and © symmetric similarity matrices exhibit similar performance as the original architecture, saving up to 90% of parameters without hurting the classification performance.

[LG-35] Movie Gen: A Cast of Media Foundation Models

点击查看摘要

[LG-36] Generation through the lens of learning theory

链接: https://arxiv.org/abs/2410.13714
作者: Vinod Raman,Ambuj Tewari
关键词-EN: Kleinberg and Mullainathan, statistical learning theory, learning theory, abstract instance space, study generation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages

点击查看摘要

Abstract:We study generation through the lens of statistical learning theory. First, we abstract and formalize the results of Gold [1967], Angluin [1979, 1980], and Kleinberg and Mullainathan [2024] for language identification/generation in the limit in terms of a binary hypothesis class defined over an abstract instance space. Then, we formalize a different paradigm of generation studied by Kleinberg and Mullainathan [2024], which we call ``uniform generation," and provide a characterization of which hypothesis classes are uniformly generatable. As is standard in statistical learning theory, our characterization is in terms of the finiteness of a new combinatorial dimension we call the Closure dimension. By doing so, we are able to compare generatability with predictability (captured via PAC and online learnability) and show that these two properties of hypothesis classes are \emphincompatible - there are classes that are generatable but not predictable and vice versa.

[LG-37] CrystalX: Ultra-Precision Crystal Structure Resolution and Error Correction Using Deep Learning

链接: https://arxiv.org/abs/2410.13713
作者: Kaipeng Zheng,Weiran Huang,Wanli Ouyang,Han-Sen Zhong,Yuqiang Li
关键词-EN: Atomic structure analysis, material sciences, crystalline materials, Atomic structure, paramount endeavor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Atomic structure analysis of crystalline materials is a paramount endeavor in both chemical and material sciences. This sophisticated technique necessitates not only a solid foundation in crystallography but also a profound comprehension of the intricacies of the accompanying software, posing a significant challenge in meeting the rigorous daily demands. For the first time, we confront this challenge head-on by harnessing the power of deep learning for ultra-precise structural analysis at the full-atom level. To validate the performance of the model, named CrystalX, we employed a vast dataset comprising over 50,000 X-ray diffraction measurements derived from authentic experiments, demonstrating performance that is commensurate with human experts and adept at deciphering intricate geometric patterns. Remarkably, CrystalX revealed that even peer-reviewed publications can harbor errors that are stealthy to human scrutiny, yet CrystalX adeptly rectifies them. This deep learning model revolutionizes the time frame for crystal structure analysis, slashing it down to seconds. It has already been successfully applied in the structure analysis of newly discovered compounds in the latest research without human intervention. Overall, CrystalX marks the beginning of a new era in automating routine structural analysis within self-driving laboratories.

[LG-38] On-device Federated Learning in Smartphones for Detecting Depression from Reddit Posts

链接: https://arxiv.org/abs/2410.13709
作者: Mustofa Ahmed,Abdul Muntakim,Nawrin Tabassum,Mohammad Asifur Rahim,Faisal Muhammad Shah
关键词-EN: social media posts, previous studies, detection using deep, large amounts, social media
类目: Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, Submitted to IEEE

点击查看摘要

Abstract:Depression detection using deep learning models has been widely explored in previous studies, especially due to the large amounts of data available from social media posts. These posts provide valuable information about individuals’ mental health conditions and can be leveraged to train models and identify patterns in the data. However, distributed learning approaches have not been extensively explored in this domain. In this study, we adopt Federated Learning (FL) to facilitate decentralized training on smartphones while protecting user data privacy. We train three neural network architectures–GRU, RNN, and LSTM on Reddit posts to detect signs of depression and evaluate their performance under heterogeneous FL settings. To optimize the training process, we leverage a common tokenizer across all client devices, which reduces the computational load. Additionally, we analyze resource consumption and communication costs on smartphones to assess their impact in a real-world FL environment. Our experimental results demonstrate that the federated models achieve comparable performance to the centralized models. This study highlights the potential of FL for decentralized mental health prediction by providing a secure and efficient model training process on edge devices.

[LG-39] On the Role of Attention Heads in Large Language Model Safety

点击查看摘要

[LG-40] Efficient Function Placement in Virtual Networks: An Online Learning Approach

链接: https://arxiv.org/abs/2410.13696
作者: Wei Huang,Richard Combes,Hind Castel-Taleb,Badii Jouaber
关键词-EN: virtual function placement, function placement problem, multi-armed bandits, propose a model, virtual function
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 19 pages

点击查看摘要

Abstract:We propose a model for the virtual function placement problem and several novel algorithms using ideas based on multi-armed bandits. We prove that these algorithms learn the optimal placement policy rapidly, and their regret grows at a rate at most O( N M \sqrtT\ln T ) while respecting the feasibility constraints with high probability. We show through numerical experiments that those algorithms both have good practical performance and modest computational complexity. Using the proposed acceleration technique, they can be used to learn in large networks where computational power is limited. Our experiments are fully reproducible, and the code is publicly available.

[LG-41] Automated Model Discovery for Tensional Homeostasis: Constitutive Machine Learning in Growth and Remodeling

链接: https://arxiv.org/abs/2410.13645
作者: Hagen Holthusen,Tim Brepols,Kevin Linka,Ellen Kuhl
关键词-EN: external mechanical stimuli, Soft biological tissues, Soft biological, biological tissues exhibit, tensile stress
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE)
*备注: 46 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Soft biological tissues exhibit a tendency to maintain a preferred state of tensile stress, known as tensional homeostasis, which is restored even after external mechanical stimuli. This macroscopic behavior can be described using the theory of kinematic growth, where the deformation gradient is multiplicatively decomposed into an elastic part and a part related to growth and remodeling. Recently, the concept of homeostatic surfaces was introduced to define the state of homeostasis and the evolution equations for inelastic deformations. However, identifying the optimal model and material parameters to accurately capture the macroscopic behavior of inelastic materials can only be accomplished with significant expertise, is often time-consuming, and prone to error, regardless of the specific inelastic phenomenon. To address this challenge, built-in physics machine learning algorithms offer significant potential. In this work, we extend our inelastic Constitutive Artificial Neural Networks (iCANNs) by incorporating kinematic growth and homeostatic surfaces to discover the scalar model equations, namely the Helmholtz free energy and the pseudo potential. The latter describes the state of homeostasis in a smeared sense. We evaluate the ability of the proposed network to learn from experimentally obtained tissue equivalent data at the material point level, assess its predictive accuracy beyond the training regime, and discuss its current limitations when applied at the structural level. Our source code, data, examples, and an implementation of the corresponding material subroutine are made accessible to the public at this https URL. Comments: 46 pages, 12 figures, 5 tables Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE) MSC classes: 65, 74 ACMclasses: I.6; J.2 Cite as: arXiv:2410.13645 [cs.LG] (or arXiv:2410.13645v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.13645 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hagen Holthusen [view email] [v1] Thu, 17 Oct 2024 15:12:55 UTC (22,289 KB)

[LG-42] Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design

点击查看摘要

[LG-43] Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

点击查看摘要

[LG-44] Scaling Wearable Foundation Models

点击查看摘要

[LG-45] Normalizing self-supervised learning for provably reliable Change Point Detection

点击查看摘要

[LG-46] H2OVL-Mississippi Vision Language Models Technical Report

点击查看摘要

[LG-47] All models are wrong some are useful: Model Selection with Limited Labels

链接: https://arxiv.org/abs/2410.13609
作者: Patrik Okanovic,Andreas Kirsch,Jannes Kasper,Torsten Hoefler,Andreas Krause,Nezihe Merve Gürel
关键词-EN: machine learning lifecycle, MODEL SELECTOR, model, MODEL SELECTOR reduces, learning lifecycle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the multitude of pretrained models available thanks to the advancements in large-scale supervised and self-supervised learning, choosing the right model is becoming increasingly pivotal in the machine learning lifecycle. However, much like the training process, choosing the best pretrained off-the-shelf model for raw, unlabeled data is a labor-intensive task. To overcome this, we introduce MODEL SELECTOR, a framework for label-efficient selection of pretrained classifiers. Given a pool of unlabeled target data, MODEL SELECTOR samples a small subset of highly informative examples for labeling, in order to efficiently identify the best pretrained model for deployment on this target dataset. Through extensive experiments, we demonstrate that MODEL SELECTOR drastically reduces the need for labeled data while consistently picking the best or near-best performing model. Across 18 model collections on 16 different datasets, comprising over 1,500 pretrained models, MODEL SELECTOR reduces the labeling cost by up to 94.15% to identify the best model compared to the cost of the strongest baseline. Our results further highlight the robustness of MODEL SELECTOR in model selection, as it reduces the labeling cost by up to 72.41% when selecting a near-best model, whose accuracy is only within 1% of the best model.

[LG-48] ransformer-Based Approaches for Sensor-Based Human Activity Recognition: Opportunities and Challenges

链接: https://arxiv.org/abs/2410.13605
作者: Clayton Souza Leite,Henry Mauranen,Aziza Zhanabatyrova,Yu Xiao
关键词-EN: Human Activity Recognition, sensor-based Human Activity, Activity Recognition, Human Activity, natural language processing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have excelled in natural language processing and computer vision, paving their way to sensor-based Human Activity Recognition (HAR). Previous studies show that transformers outperform their counterparts exclusively when they harness abundant data or employ compute-intensive optimization algorithms. However, neither of these scenarios is viable in sensor-based HAR due to the scarcity of data in this field and the frequent need to perform training and inference on resource-constrained devices. Our extensive investigation into various implementations of transformer-based versus non-transformer-based HAR using wearable sensors, encompassing more than 500 experiments, corroborates these concerns. We observe that transformer-based solutions pose higher computational demands, consistently yield inferior performance, and experience significant performance degradation when quantized to accommodate resource-constrained devices. Additionally, transformers demonstrate lower robustness to adversarial attacks, posing a potential threat to user trust in HAR.

[LG-49] owards Satellite Non-IID Imagery: A Spectral Clustering-Assisted Federated Learning Approach

链接: https://arxiv.org/abs/2410.13602
作者: Luyao Zou,Yu Min Park,Chu Myaet Thwal,Yan Kyaw Tun,Zhu Han,Choong Seon Hong
关键词-EN: Low Earth orbit, gathering abundant Earth, Low Earth, Internet of Things, abundant Earth observation
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Low Earth orbit (LEO) satellites are capable of gathering abundant Earth observation data (EOD) to enable different Internet of Things (IoT) applications. However, to accomplish an effective EOD processing mechanism, it is imperative to investigate: 1) the challenge of processing the observed data without transmitting those large-size data to the ground because the connection between the satellites and the ground stations is intermittent, and 2) the challenge of processing the non-independent and identically distributed (non-IID) satellite data. In this paper, to cope with those challenges, we propose an orbit-based spectral clustering-assisted clustered federated self-knowledge distillation (OSC-FSKD) approach for each orbit of an LEO satellite constellation, which retains the advantage of FL that the observed data does not need to be sent to the ground. Specifically, we introduce normalized Laplacian-based spectral clustering (NLSC) into federated learning (FL) to create clustered FL in each round to address the challenge resulting from non-IID data. Particularly, NLSC is adopted to dynamically group clients into several clusters based on cosine similarities calculated by model updates. In addition, self-knowledge distillation is utilized to construct each local client, where the most recent updated local model is used to guide current local model training. Experiments demonstrate that the observation accuracy obtained by the proposed method is separately 1.01x, 2.15x, 1.10x, and 1.03x higher than that of pFedSD, FedProx, FedAU, and FedALA approaches using the SAT4 dataset. The proposed method also shows superiority when using other datasets.

[LG-50] xt-Guided Multi-Property Molecular Optimization with a Diffusion Language Model

点击查看摘要

[LG-51] owards Better Performance in Incomplete LDL: Addressing Data Imbalance

链接: https://arxiv.org/abs/2410.13579
作者: Zhiqiang Kou,Haoyuan Xuan,Jing Wang,Yuheng Jia,Xin Geng
关键词-EN: Label Distribution Learning, machine learning paradigm, found widespread applications, Distribution Learning, Label Distribution
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Label Distribution Learning (LDL) is a novel machine learning paradigm that addresses the problem of label ambiguity and has found widespread applications. Obtaining complete label distributions in real-world scenarios is challenging, which has led to the emergence of Incomplete Label Distribution Learning (InLDL). However, the existing InLDL methods overlook a crucial aspect of LDL data: the inherent imbalance in label distributions. To address this limitation, we propose \textbfIncomplete and Imbalance Label Distribution Learning (I(^2)LDL), a framework that simultaneously handles incomplete labels and imbalanced label distributions. Our method decomposes the label distribution matrix into a low-rank component for frequent labels and a sparse component for rare labels, effectively capturing the structure of both head and tail labels. We optimize the model using the Alternating Direction Method of Multipliers (ADMM) and derive generalization error bounds via Rademacher complexity, providing strong theoretical guarantees. Extensive experiments on 15 real-world datasets demonstrate the effectiveness and robustness of our proposed framework compared to existing InLDL methods.

[LG-52] Sample Compression Hypernetworks: From Generalization Bounds to Meta-Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.13577
作者: Benjamin Leblanc,Mathieu Bazinet,Nathaniel D’Amours,Alexandre Drouin,Pascal Germain
关键词-EN: sample compression theory, Reconstruction functions, deriving tight generalization, training set, compression theory
类目: Machine Learning (cs.LG)
*备注: Accepted at the NeurIPS 2024 workshop on Compression in Machine Learning

点击查看摘要

Abstract:Reconstruction functions are pivotal in sample compression theory, a framework for deriving tight generalization bounds. From a small sample of the training set (the compression set) and an optional stream of information (the message), they recover a predictor previously learned from the whole training set. While usually fixed, we propose to learn reconstruction functions. To facilitate the optimization and increase the expressiveness of the message, we derive a new sample compression generalization bound for real-valued messages. From this theoretical analysis, we then present a new hypernetwork architecture that outputs predictors with tight generalization guarantees when trained using an original meta-learning framework. The results of promising preliminary experiments are then reported.

[LG-53] Representing Model Weights with Language using Tree Experts

点击查看摘要

[LG-54] Ornstein-Uhlenbeck Adaptation as a Mechanism for Learning in Brains and Machines

链接: https://arxiv.org/abs/2410.13563
作者: Jesus Garcia Fernandez,Nasir Ahmad,Marcel van Gerven
关键词-EN: fundamental property, organisms and engineered, Learning, biological organisms, intelligent systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning is a fundamental property of intelligent systems, observed across biological organisms and engineered systems. While modern intelligent systems typically rely on gradient descent for learning, the need for exact gradients and complex information flow makes its implementation in biological and neuromorphic systems challenging. This has motivated the exploration of alternative learning mechanisms that can operate locally and do not rely on exact gradients. In this work, we introduce a novel approach that leverages noise in the parameters of the system and global reinforcement signals. Using an Ornstein-Uhlenbeck process with adaptive dynamics, our method balances exploration and exploitation during learning, driven by deviations from error predictions, akin to reward prediction error. Operating in continuous time, Orstein-Uhlenbeck adaptation (OUA) is proposed as a general mechanism for learning dynamic, time-evolving environments. We validate our approach across diverse tasks, including supervised learning and reinforcement learning in feedforward and recurrent systems. Additionally, we demonstrate that it can perform meta-learning, adjusting hyper-parameters autonomously. Our results indicate that OUA provides a viable alternative to traditional gradient-based methods, with potential applications in neuromorphic computing. It also hints at a possible mechanism for noise-driven learning in the brain, where stochastic neurotransmitter release may guide synaptic adjustments.

[LG-55] Adaptive and oblivious statistical adversaries are equivalent

链接: https://arxiv.org/abs/2410.13548
作者: Guy Blanc,Gregory Valiant
关键词-EN: ability to perform, sample, sample-oblivious adversaries, adversary corrupts, perform a statistical
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We resolve a fundamental question about the ability to perform a statistical task, such as learning, when an adversary corrupts the sample. Such adversaries are specified by the types of corruption they can make and their level of knowledge about the sample. The latter distinguishes between sample-adaptive adversaries which know the contents of the sample when choosing the corruption, and sample-oblivious adversaries, which do not. We prove that for all types of corruptions, sample-adaptive and sample-oblivious adversaries are \emphequivalent up to polynomial factors in the sample size. This resolves the main open question introduced by \citeBLMT22 and further explored in \citeCHLLN23. Specifically, consider any algorithm A that solves a statistical task even when a sample-oblivious adversary corrupts its input. We show that there is an algorithm A’ that solves the same task when the corresponding sample-adaptive adversary corrupts its input. The construction of A’ is simple and maintains the computational efficiency of A : It requests a polynomially larger sample than A uses and then runs A on a uniformly random subsample. One of our main technical tools is a new structural result relating two distributions defined on sunflowers which may be of independent interest. Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2410.13548 [cs.LG] (or arXiv:2410.13548v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.13548 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] Generative Adversarial Synthesis of Radar Point Cloud Scenes

点击查看摘要

[LG-57] PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization NEURIPS2024

链接: https://arxiv.org/abs/2410.13516
作者: Marco Spinaci,Marek Polewczyk,Johannes Hoffart,Markus C. Kohler,Sam Thelin,Tassilo Klein
关键词-EN: image domains, diverse domain, seeks to apply, apply advances, advances from natural
类目: Machine Learning (cs.LG)
*备注: Accepted at Table Representation Learning Workshop at NeurIPS 2024

点击查看摘要

Abstract:Self-supervised learning on tabular data seeks to apply advances from natural language and image domains to the diverse domain of tables. However, current techniques often struggle with integrating multi-domain data and require data cleaning or specific structural requirements, limiting the scalability of pre-training datasets. We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing. This simple yet powerful approach can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks. This work offers a practical advancement in self-supervised learning for large-scale tabular data.

[LG-58] CERES: Critical-Event Reconstruction via Temporal Scene Graph Completion

链接: https://arxiv.org/abs/2410.13514
作者: Efimia Panagiotaki,Georgi Pramatarov,Lars Kunze,Daniele De Martini
关键词-EN: paper proposes, proposes a method, method for on-demand, Autonomous Vehicles, Graph Neural Networks
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 8 figures

点击查看摘要

Abstract:This paper proposes a method for on-demand scenario generation in simulation, grounded on real-world data. Evaluating the behaviour of Autonomous Vehicles (AVs) in both safety-critical and regular scenarios is essential for assessing their robustness before real-world deployment. By integrating scenarios derived from real-world datasets into the simulation, we enhance the plausibility and validity of testing sets. This work introduces a novel approach that employs temporal scene graphs to capture evolving spatiotemporal relationships among scene entities from a real-world dataset, enabling the generation of dynamic scenarios in simulation through Graph Neural Networks (GNNs). User-defined action and criticality conditioning are used to ensure flexible, tailored scenario creation. Our model significantly outperforms the benchmarks in accurately predicting links corresponding to the requested scenarios. We further evaluate the validity and compatibility of our generated scenarios in an off-the-shelf simulator.

[LG-59] MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

点击查看摘要

[LG-60] Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning

链接: https://arxiv.org/abs/2410.13501
作者: Yoav Alon,Cristina David
关键词-EN: Large Language Models, Large Language, Language Models, LLM space exploration, long-term planning
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM’s space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM’s training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT.

[LG-61] SAda-Net: A Self-Supervised Adaptive Stereo Estimation CNN For Remote Sensing Image Data ICPR2024

点击查看摘要

[LG-62] Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning Semi-Supervised Training and Advanced Optimization Techniques

点击查看摘要

[LG-63] Deep Reinforcement Learning for Online Optimal Execution Strategies

链接: https://arxiv.org/abs/2410.13493
作者: Alessandro Micheli,Mélodie Monod
关键词-EN: Deterministic Policy Gradient, Deep Deterministic Policy, dynamic financial markets, paper tackles, tackles the challenge
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper tackles the challenge of learning non-Markovian optimal execution strategies in dynamic financial markets. We introduce a novel actor-critic algorithm based on Deep Deterministic Policy Gradient (DDPG) to address this issue, with a focus on transient price impact modeled by a general decay kernel. Through numerical experiments with various decay kernels, we show that our algorithm successfully approximates the optimal execution strategy. Additionally, the proposed algorithm demonstrates adaptability to evolving market conditions, where parameters fluctuate over time. Our findings also show that modern reinforcement learning algorithms can provide a solution that reduces the need for frequent and inefficient human intervention in optimal execution tasks.

[LG-64] Novelty-based Sample Reuse for Continuous Robotics Control

链接: https://arxiv.org/abs/2410.13490
作者: Ke Duan,Kai Yang,Houde Liu,Xueqian Wang
关键词-EN: agents collect state, reinforcement learning, agents collect, environmental interactions, essential for policy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In reinforcement learning, agents collect state information and rewards through environmental interactions, essential for policy refinement. This process is notably time-consuming, especially in complex robotic simulations and real-world applications. Traditional algorithms usually re-engage with the environment after processing a single batch of samples, thereby failing to fully capitalize on historical data. However, frequently observed states, with reliable value estimates, require minimal updates; in contrast, rare observed states necessitate more intensive updates for achieving accurate value estimations. To address uneven sample utilization, we propose Novelty-guided Sample Reuse (NSR). NSR provides extra updates for infrequent, novel states and skips additional updates for frequent states, maximizing sample use before interacting with the environment again. Our experiments show that NSR improves the convergence rate and success rate of algorithms without significantly increasing time consumption. Our code is publicly available at this https URL.

[LG-65] Seeing Through VisualBERT: A Causal Adventure on Memetic Landscapes EMNLP

点击查看摘要

[LG-66] Interpreting Temporal Graph Neural Networks with Koopman Theory

链接: https://arxiv.org/abs/2410.13469
作者: Michele Guerra,Simone Scardapane,Filippo Maria Bianchi
关键词-EN: Spatiotemporal graph neural, shown promising results, graph neural networks, Spatiotemporal graph, neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spatiotemporal graph neural networks (STGNNs) have shown promising results in many domains, from forecasting to epidemiology. However, understanding the dynamics learned by these models and explaining their behaviour is significantly more complex than for models dealing with static data. Inspired by Koopman theory, which allows a simpler description of intricate, nonlinear dynamical systems, we introduce an explainability approach for temporal graphs. We present two methods to interpret the STGNN’s decision process and identify the most relevant spatial and temporal patterns in the input for the task at hand. The first relies on dynamic mode decomposition (DMD), a Koopman-inspired dimensionality reduction method. The second relies on sparse identification of nonlinear dynamics (SINDy), a popular method for discovering governing equations, which we use for the first time as a general tool for explainability. We show how our methods can correctly identify interpretable features such as infection times and infected nodes in the context of dissemination processes.

[LG-67] runcating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

链接: https://arxiv.org/abs/2410.13463
作者: Riccardo Poiani,Nicole Nobili,Alberto Maria Metelli,Marcello Restelli
关键词-EN: Monte Carlo, policy gradient methods, Reinforcement Learning, Policy evaluation, policy gradient
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of fixed length within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths, i.e., \emphtruncated. Specifically, this surrogate shows the sub-optimality of the fixed-length trajectory schedule. Furthermore, it suggests that adaptive data collection strategies that spend the available budget sequentially can allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the error of the final estimate. Building on these findings, we present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO). The main intuition behind RIDO is to split the available interaction budget into mini-batches. At each round, the agent determines the most convenient schedule of trajectories that minimizes an empirical and robust version of the surrogate of the estimator’s error. After discussing the theoretical properties of our method, we conclude by assessing its performance across multiple domains. Our results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation.

[LG-68] Progressive Mixed-Precision Decoding for Efficient LLM Inference

点击查看摘要

[LG-69] Breaking the Manual Annotation Bottleneck: Creating a Comprehensive Legal Case Criticality Dataset through Semi-Automated Labeling

点击查看摘要

[LG-70] Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

点击查看摘要

[LG-71] Fast Estimation of Partial Dependence Functions using Trees

链接: https://arxiv.org/abs/2410.13448
作者: Jinyang Liu,Tessa Steensgaard,Marvin N. Wright,Niklas Pfister,Munir Hiabu
关键词-EN: Partial Dependence, machine learning model, pre-trained machine learning, based on Partial, texttt
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many existing interpretation methods are based on Partial Dependence (PD) functions that, for a pre-trained machine learning model, capture how a subset of the features affects the predictions by averaging over the remaining features. Notable methods include Shapley additive explanations (SHAP) which computes feature contributions based on a game theoretical interpretation and PD plots (i.e., 1-dim PD functions) that capture average marginal main effects. Recent work has connected these approaches using a functional decomposition and argues that SHAP values can be misleading since they merge main and interaction effects into a single local effect. A major advantage of SHAP compared to other PD-based interpretations, however, has been the availability of fast estimation techniques, such as \textttTreeSHAP. In this paper, we propose a new tree-based estimator, \textttFastPD, which efficiently estimates arbitrary PD functions. We show that \textttFastPD consistently estimates the desired population quantity – in contrast to path-dependent \textttTreeSHAP which is inconsistent when features are correlated. For moderately deep trees, \textttFastPD improves the complexity of existing methods from quadratic to linear in the number of observations. By estimating PD functions for arbitrary feature subsets, \textttFastPD can be used to extract PD-based interpretations such as SHAP, PD plots and higher order interaction effects.

[LG-72] Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

点击查看摘要

[LG-73] Similarity-Dissimilarity Loss with Supervised Contrastive Learning for Multi-label Classification

点击查看摘要

[LG-74] Solving Prior Distribution Mismatch in Diffusion Models via Optimal Transport

点击查看摘要

[LG-75] Partially Trained Graph Convolutional Networks Resist Oversmoothing

链接: https://arxiv.org/abs/2410.13416
作者: Dimitrios Kelesis,Dimitris Fotakis,Georgios Paliouras
关键词-EN: made by Kipf, generate meaningful node, meaningful node embeddings, observation made, generate meaningful
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work we investigate an observation made by Kipf \ Welling, who suggested that untrained GCNs can generate meaningful node embeddings. In particular, we investigate the effect of training only a single layer of a GCN, while keeping the rest of the layers frozen. We propose a basis on which the effect of the untrained layers and their contribution to the generation of embeddings can be predicted. Moreover, we show that network width influences the dissimilarity of node embeddings produced after the initial node features pass through the untrained part of the model. Additionally, we establish a connection between partially trained GCNs and oversmoothing, showing that they are capable of reducing it. We verify our theoretical results experimentally and show the benefits of using deep networks that resist oversmoothing, in a ``cold start’’ scenario, where there is a lack of feature information for unlabeled nodes.

[LG-76] RAMPA: Robotic Augmented Reality for Machine Programming and Automation

链接: https://arxiv.org/abs/2410.13412
作者: Fatih Dogangun,Serdar Bahar,Yigit Yildirim,Bora Toprak Temir,Emre Ugur,Mustafa Doga Dogan
关键词-EN: intuitive robot training, Robotic Augmented Reality, increasingly more important, continue to enter, enter various sectors
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:As robotics continue to enter various sectors beyond traditional industrial applications, the need for intuitive robot training and interaction systems becomes increasingly more important. This paper introduces Robotic Augmented Reality for Machine Programming (RAMPA), a system that utilizes the capabilities of state-of-the-art and commercially available AR headsets, e.g., Meta Quest 3, to facilitate the application of Programming from Demonstration (PfD) approaches on industrial robotic arms, such as Universal Robots UR10. Our approach enables in-situ data recording, visualization, and fine-tuning of skill demonstrations directly within the user’s physical environment. RAMPA addresses critical challenges of PfD, such as safety concerns, programming barriers, and the inefficiency of collecting demonstrations on the actual hardware. The performance of our system is evaluated against the traditional method of kinesthetic control in teaching three different robotic manipulation tasks and analyzed with quantitative metrics, measuring task performance and completion time, trajectory smoothness, system usability, user experience, and task load using standardized surveys. Our findings indicate a substantial advancement in how robotic tasks are taught and refined, promising improvements in operational safety, efficiency, and user engagement in robotic programming.

[LG-77] MoR: Mixture of Ranks for Low-Rank Adaptation Tuning

点击查看摘要

[LG-78] Predicting Breast Cancer Survival: A Survival Analysis Approach Using Log Odds and Clinical Variables

链接: https://arxiv.org/abs/2410.13404
作者: Opeyemi Sheu Alamu,Bismar Jorge Gutierrez Choque,Syed Wajeeh Abbs Rizvi,Samah Badr Hammed,Isameldin Elamin Medani,Md Kamrul Siam,Waqar Ahmad Tahir
关键词-EN: global health challenge, significant global health, decisions largely dependent, treatment decisions largely, Breast cancer
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Breast cancer remains a significant global health challenge, with prognosis and treatment decisions largely dependent on clinical characteristics. Accurate prediction of patient outcomes is crucial for personalized treatment strategies. This study employs survival analysis techniques, including Cox proportional hazards and parametric survival models, to enhance the prediction of the log odds of survival in breast cancer patients. Clinical variables such as tumor size, hormone receptor status, HER2 status, age, and treatment history were analyzed to assess their impact on survival outcomes. Data from 1557 breast cancer patients were obtained from a publicly available dataset provided by the University College Hospital, Ibadan, Nigeria. This dataset was preprocessed and analyzed using both univariate and multivariate approaches to evaluate survival outcomes. Kaplan-Meier survival curves were generated to visualize survival probabilities, while the Cox proportional hazards model identified key risk factors influencing mortality. The results showed that older age, larger tumor size, and HER2-positive status were significantly associated with an increased risk of mortality. In contrast, estrogen receptor positivity and breast-conserving surgery were linked to better survival outcomes. The findings suggest that integrating these clinical variables into predictive models improvesthe accuracy of survival predictions, helping to identify high-risk patients who may benefit from more aggressive interventions. This study demonstrates the potential of survival analysis in optimizing breast cancer care, particularly in resource-limited settings. Future research should focus on integrating genomic data and real-world clinical outcomes to further refine these models.

[LG-79] A Self-Constructing Multi-Expert Fuzzy System for High-dimensional Data Classification

链接: https://arxiv.org/abs/2410.13390
作者: Yingtao Ren,Yu-Cheng Chang,Thomas Do,Zehong Cao,Chin-Teng Lin
关键词-EN: Fuzzy Neural Networks, Neural Networks, fuzzy system, machine learning models, Fuzzy Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fuzzy Neural Networks (FNNs) are effective machine learning models for classification tasks, commonly based on the Takagi-Sugeno-Kang (TSK) fuzzy system. However, when faced with high-dimensional data, especially with noise, FNNs encounter challenges such as vanishing gradients, excessive fuzzy rules, and limited access to prior knowledge. To address these challenges, we propose a novel fuzzy system, the Self-Constructing Multi-Expert Fuzzy System (SOME-FS). It combines two learning strategies: mixed structure learning and multi-expert advanced learning. The former enables each base classifier to effectively determine its structure without requiring prior knowledge, while the latter tackles the issue of vanishing gradients by enabling each rule to focus on its local region, thereby enhancing the robustness of the fuzzy classifiers. The overall ensemble architecture enhances the stability and prediction performance of the fuzzy system. Our experimental results demonstrate that the proposed SOME-FS is effective in high-dimensional tabular data, especially in dealing with uncertainty. Moreover, our stable rule mining process can identify concise and core rules learned by the SOME-FS.

[LG-80] Data-Augmented Predictive Deep Neural Network: Enhancing the extrapolation capabilities of non-intrusive surrogate models

链接: https://arxiv.org/abs/2410.13376
作者: Shuwen Sun,Lihong Feng,Peter Benner
关键词-EN: high computational costs, large parametric nonlinear, parametric nonlinear dynamical, nonlinear dynamical system, Numerically solving
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Numerically solving a large parametric nonlinear dynamical system is challenging due to its high complexity and the high computational costs. In recent years, machine-learning-aided surrogates are being actively researched. However, many methods fail in accurately generalizing in the entire time interval [0, T] , when the training data is available only in a training time interval [0, T_0] , with T_0T . To improve the extrapolation capabilities of the surrogate models in the entire time domain, we propose a new deep learning framework, where kernel dynamic mode decomposition (KDMD) is employed to evolve the dynamics of the latent space generated by the encoder part of a convolutional autoencoder (CAE). After adding the KDMD-decoder-extrapolated data into the original data set, we train the CAE along with a feed-forward deep neural network using the augmented data. The trained network can predict future states outside the training time interval at any out-of-training parameter samples. The proposed method is tested on two numerical examples: a FitzHugh-Nagumo model and a model of incompressible flow past a cylinder. Numerical results show accurate and fast prediction performance in both the time and the parameter domain. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2410.13376 [cs.LG] (or arXiv:2410.13376v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.13376 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-81] Addressing Heterogeneity and Heterophily in Graphs: A Heterogeneous Heterophilic Spectral Graph Neural Network

链接: https://arxiv.org/abs/2410.13373
作者: Kangkang Lu,Yanhua Yu,Zhiyong Huang,Jia Li,Yuling Wang,Meiyu Liang,Xiting Qin,Yimeng Ren,Tat-Seng Chua,Xidian Wang
关键词-EN: garnered significant scholarly, significant scholarly attention, modeling graph structures, Graph Neural Networks, Neural Networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have garnered significant scholarly attention for their powerful capabilities in modeling graph structures. Despite this, two primary challenges persist: heterogeneity and heterophily. Existing studies often address heterogeneous and heterophilic graphs separately, leaving a research gap in the understanding of heterogeneous heterophilic graphs-those that feature diverse node or relation types with dissimilar connected nodes. To address this gap, we investigate the application of spectral graph filters within heterogeneous graphs. Specifically, we propose a Heterogeneous Heterophilic Spectral Graph Neural Network (H2SGNN), which employs a dual-module approach: local independent filtering and global hybrid filtering. The local independent filtering module applies polynomial filters to each subgraph independently to adapt to different homophily, while the global hybrid filtering module captures interactions across different subgraphs. Extensive empirical evaluations on four real-world datasets demonstrate the superiority of H2SGNN compared to state-of-the-art methods.

[LG-82] Statistical testing on generative AI anomaly detection tools in Alzheimers Disease diagnosis

链接: https://arxiv.org/abs/2410.13363
作者: Rosemary He,Ichiro Takeuchi
关键词-EN: Disease is challenging, Alzheimer Disease, heterogeneity among patients, challenging to diagnose, limited understanding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease is challenging to diagnose due to our limited understanding of its mechanism and large heterogeneity among patients. Neurodegeneration is studied widely as a biomarker for clinical diagnosis, which can be measured from time series MRI progression. On the other hand, generative AI has shown promise in anomaly detection in medical imaging and used for tasks including tumor detection. However, testing the reliability of such data-driven methods is non-trivial due to the issue of double-dipping in hypothesis testing. In this work, we propose to solve this issue with selective inference and develop a reliable generative AI method for Alzheimer’s prediction. We show that compared to traditional statistical methods with highly inflated p-values, selective inference successfully controls the false discovery rate under the desired alpha level while retaining statistical power. In practice, our pipeline could assist clinicians in Alzheimer’s diagnosis and early intervention.

[LG-83] Remember Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

点击查看摘要

[LG-84] Representation Learning of Structured Data for Medical Foundation Models NEURIPS2024

链接: https://arxiv.org/abs/2410.13351
作者: Vijay Prakash Dwivedi,Viktor Schlegel,Andy T. Liu,Thanh-Tung Nguyen,Abhinav Ramesh Kashyap,Jeng Wei,Wei-Hsian Yin,Stefan Winkler,Robby T. Tan
关键词-EN: Large Language Models, Large Language, demonstrated remarkable performance, Language Models, including healthcare
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Workshop on Unifying Representations in Neural Models (UniReps 2024)

点击查看摘要

[LG-85] Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models

点击查看摘要

[LG-86] Limits to scalable evaluation at the frontier: LLM as Judge wont beat twice the data

链接: https://arxiv.org/abs/2410.13341
作者: Florian E. Dorner,Vivian Y. Nastl,Moritz Hardt
关键词-EN: machine learning ecosystem, explosively growing machine, growing machine learning, learning ecosystem, increasingly a bottleneck
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages, 5 figures

点击查看摘要

Abstract:High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation, and points out promising avenues for future work.

[LG-87] DiffImp: Efficient Diffusion Model for Probabilistic Time Series Imputation with Bidirectional Mamba Backbone

点击查看摘要

[LG-88] Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

点击查看摘要

[LG-89] Improving Discrete Optimisation Via Decoupled Straight-Through Gumbel-Softmax

点击查看摘要

[LG-90] Precipitation Nowcasting Using Diffusion Transformer with Causal Attention

点击查看摘要

[LG-91] Hiformer: Hybrid Frequency Feature Enhancement Inverted Transformer for Long-Term Wind Power Prediction

点击查看摘要

[LG-92] LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

点击查看摘要

[LG-93] Fairness-Enhancing Ensemble Classification in Water Distribution Networks

点击查看摘要

[LG-94] PiLocNet: Physics-informed neural network on 3D localization with rotating point spread function

点击查看摘要

[LG-95] SBI-RAG: Enhancing Math Word Problem Solving for Students through Schema-Based Instruction and Retrieval-Augmented Generation NEURIPS’24

点击查看摘要

[LG-96] An Online Learning Approach to Prompt-based Selection of Generative Models

链接: https://arxiv.org/abs/2410.13287
作者: Xiaoyan Hu,Ho-fung Leung,Farzan Farnia
关键词-EN: averaged evaluation score, text-based generative models, generation model, multiple text-based generative, sample generation scheme
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selecting a sample generation scheme from multiple text-based generative models is typically addressed by choosing the model that maximizes an averaged evaluation score. However, this score-based selection overlooks the possibility that different models achieve the best generation performance for different types of text prompts. An online identification of the best generation model for various input prompts can reduce the costs associated with querying sub-optimal models. In this work, we explore the possibility of varying rankings of text-based generative models for different text prompts and propose an online learning framework to predict the best data generation model for a given input prompt. The proposed framework adapts the kernelized contextual bandit (CB) methodology to a CB setting with shared context variables across arms, utilizing the generated data to update a kernel-based function that predicts which model will achieve the highest score for unseen text prompts. Additionally, we apply random Fourier features (RFF) to the kernelized CB algorithm to accelerate the online learning process and establish a \widetilde\mathcalO(\sqrtT) regret bound for the proposed RFF-based CB algorithm over T iterations. Our numerical experiments on real and simulated text-to-image and image-to-text generative models show RFF-UCB performs successfully in identifying the best generation model across different sample types.

[LG-97] A Human-in-the-Loop Fairness-Aware Model Selection Framework for Complex Fairness Objective Landscapes

链接: https://arxiv.org/abs/2410.13286
作者: Jake Robertson,Thorsten Schmidt,Frank Hutter,Noor Awad
关键词-EN: Fairness-aware Machine Learning, Machine Learning, frequently involving multiple, well-known Impossibility Theorem, Fairness-aware Machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fairness-aware Machine Learning (FairML) applications are often characterized by complex social objectives and legal requirements, frequently involving multiple, potentially conflicting notions of fairness. Despite the well-known Impossibility Theorem of Fairness and extensive theoretical research on the statistical and socio-technical trade-offs between fairness metrics, many FairML tools still optimize or constrain for a single fairness objective. However, this one-sided optimization can inadvertently lead to violations of other relevant notions of fairness. In this socio-technical and empirical study, we frame fairness as a many-objective (MaO) problem by treating fairness metrics as conflicting objectives. We introduce ManyFairHPO, a human-in-the-loop, fairness-aware model selection framework that enables practitioners to effectively navigate complex and nuanced fairness objective landscapes. ManyFairHPO aids in the identification, evaluation, and balancing of fairness metric conflicts and their related social consequences, leading to more informed and socially responsible model-selection decisions. Through a comprehensive empirical evaluation and a case study on the Law School Admissions problem, we demonstrate the effectiveness of ManyFairHPO in balancing multiple fairness objectives, mitigating risks such as self-fulfilling prophecies, and providing interpretable insights to guide stakeholders in making fairness-aware modeling decisions.

[LG-98] Learning to Route with Confidence Tokens

点击查看摘要

[LG-99] Inductive Gradient Adjustment For Spectral Bias In Implicit Neural Representations

点击查看摘要

[LG-100] he Latent Road to Atoms: Backmapping Coarse-grained Protein Structures with Latent Diffusion

点击查看摘要

[LG-101] A Simplifying and Learnable Graph Convolutional Attention Network for Unsupervised Knowledge Graphs Alignment

点击查看摘要

[LG-102] scFusionTTT: Single-cell transcriptomics and proteomics fusion with Test-Time Training layers

点击查看摘要

[LG-103] FDF: Flexible Decoupled Framework for Time Series Forecasting with Conditional Denoising and Polynomial Modeling

链接: https://arxiv.org/abs/2410.13253
作者: Jintao Zhang,Mingyue Cheng,Xiaoyu Tao,Zhiding Liu,Daoyu Wang
关键词-EN: numerous web applications, influencing critical decision-making, Time series forecasting, Time series, web applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting is vital in numerous web applications, influencing critical decision-making across industries. While diffusion models have recently gained increasing popularity for this task, we argue they suffer from a significant drawback: indiscriminate noise addition to the original time series followed by denoising, which can obscure underlying dynamic evolving trend and complicate forecasting. To address this limitation, we propose a novel flexible decoupled framework (FDF) that learns high-quality time series representations for enhanced forecasting performance. A key characteristic of our approach leverages the inherent inductive bias of time series data by decomposing it into trend and seasonal components, each modeled separately to enable decoupled analysis and modeling. Specifically, we propose an innovative Conditional Denoising Seasonal Module (CDSM) within the diffusion model, which leverages statistical information from the historical window to conditionally model the complex seasonal component. Notably, we incorporate a Polynomial Trend Module (PTM) to effectively capture the smooth trend component, thereby enhancing the model’s ability to represent temporal dependencies. Extensive experiments validate the effectiveness of our framework, demonstrating superior performance over existing methods and higlighting its flexibility in time series forecasting. We hope our work can bring a new perspective for time series forecasting. We intend to make our code publicly available as open-source in the future.

[LG-104] Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation

点击查看摘要

[LG-105] Quamba: A Post-Training Quantization Recipe for Selective State Space Models

点击查看摘要

[LG-106] From PINNs to PIKANs: Recent Advances in Physics-Informed Machine Learning

点击查看摘要

[LG-107] MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling

点击查看摘要

[LG-108] Balancing Label Quantity and Quality for Scalable Elicitation

链接: https://arxiv.org/abs/2410.13215
作者: Alex Mallen,Nora Belrose
关键词-EN: unreliable or expensive, complex codebases, Scalable oversight studies, training and evaluating, evaluating AI systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scalable oversight studies methods of training and evaluating AI systems in domains where human judgement is unreliable or expensive, such as scientific research and software engineering in complex codebases. Recent work in this area by Burns et al. (2023) suggests that Language Models (LMs) pretrained on internet-scale corpora exhibit an inductive bias toward producing correct answers, even when finetuned on error-prone labels produced by a smaller language model. This suggests that massive pretraining combined with finetuning on imperfect human labels may be a solid baseline method for scalable oversight. In the real world, however, label quality is not fixed: practitioners face a quantity-quality tradeoff when generating finetuning data. In this paper, we explore the microeconomics of the quantity-quality tradeoff on binary NLP classification tasks used in Burns et al. (2023). We find that there are three regimes of eliciting classification knowledge from pretrained models using supervised finetuning: quantity-dominant, quality-dominant, and a mixed regime involving the use of low- and high-quality data together to attain higher accuracy at a lower cost than using either alone. We explore sample-efficient elicitation methods that make use of two datasets of differing qualities, and establish a Pareto frontier of scalable elicitation methods that optimally trade off labeling cost and classifier performance.

[LG-109] LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch

点击查看摘要

[LG-110] AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

点击查看摘要

[LG-111] Estimating the Probabilities of Rare Outputs in Language Models

点击查看摘要

[LG-112] abSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering ICPR2024

点击查看摘要

[LG-113] Meta-DiffuB: A Contextualized Sequence-to-Sequence Text Diffusion Model with Meta-Exploration

点击查看摘要

[LG-114] Context-Enhanced Multi-View Trajectory Representation Learning: Bridging the Gap through Self-Supervised Models

点击查看摘要

[LG-115] Golyadkins Torment: Doppelg"angers and Adversarial Vulnerability

点击查看摘要

[LG-116] CohEx: A Generalized Framework for Cohort Explanation

点击查看摘要

[LG-117] EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

点击查看摘要

[LG-118] GeSubNet: Gene Interaction Inference for Disease Subtype Network Generation ICLR2025

点击查看摘要

[LG-119] CP-Diffusion: A Multi-modal Diffusion Model for Global Tropical Cyclone Precipitation Forecasting with Change Awareness

点击查看摘要

[LG-120] An Evolved Universal Transformer Memory

链接: https://arxiv.org/abs/2410.13166
作者: Edoardo Cetin,Qi Sun,Tianyu Zhao,Yujin Tang
关键词-EN: Prior methods propose, dropping specific parts, modern foundation models, Prior methods, Neural Attention Memory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 29 pages, 14 figures. Preprint, under submission. Source code is available at this https URL

点击查看摘要

[LG-121] Data Driven Environmental Awareness Using Wireless Signals for Efficient Spectrum Sharing

链接: https://arxiv.org/abs/2410.13159
作者: Hossein Nasiri,Seda Dogan-Tusha,Muhammad Iqbal Rochman,Monisha Ghosh
关键词-EN: wireless network optimization, devices, outdoor, indoor, outdoor devices
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust classification of the operational environment of wireless devices is becoming increasingly important for wireless network optimization, particularly in a shared spectrum environment. Distinguishing between indoor and outdoor devices can enhance reliability and improve coexistence with existing, outdoor, incumbents. For instance, the unlicensed but shared 6 GHz band (5.925 - 7.125 GHz) enables sharing by imposing lower transmit power for indoor unlicensed devices and a spectrum coordination requirement for outdoor devices. Further, indoor devices are prohibited from using battery power, external antennas, and weatherization to prevent outdoor operations. As these rules may be circumvented, we propose a robust indoor/outdoor classification method by leveraging the fact that the radio-frequency environment faced by a device are quite different indoors and outdoors. We first collect signal strength data from all cellular and Wi-Fi bands that can be received by a smartphone in various environments (indoor interior, indoor near windows, and outdoors), along with GPS accuracy, and then evaluate three machine learning (ML) methods: deep neural network (DNN), decision tree, and random forest to perform classification into these three categories. Our results indicate that the DNN model performs the best, particularly in minimizing the most important classification error, that of classifying outdoor devices as indoor interior devices.

[LG-122] Utilizing Large Language Models in An Iterative Paradigm with Domain Feedback for Molecule Optimization

点击查看摘要

[LG-123] Federated scientific machine learning for approximating functions and solving differential equations with data heterogeneity

链接: https://arxiv.org/abs/2410.13141
作者: Handi Zhang,Langchen Liu,Lu Lu
关键词-EN: scientific machine learning, data, partial differential equations, leveraging neural networks, emerging field
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:By leveraging neural networks, the emerging field of scientific machine learning (SciML) offers novel approaches to address complex problems governed by partial differential equations (PDEs). In practical applications, challenges arise due to the distributed essence of data, concerns about data privacy, or the impracticality of transferring large volumes of data. Federated learning (FL), a decentralized framework that enables the collaborative training of a global model while preserving data privacy, offers a solution to the challenges posed by isolated data pools and sensitive data issues. Here, this paper explores the integration of FL and SciML to approximate complex functions and solve differential equations. We propose two novel models: federated physics-informed neural networks (FedPINN) and federated deep operator networks (FedDeepONet). We further introduce various data generation methods to control the degree of non-independent and identically distributed (non-iid) data and utilize the 1-Wasserstein distance to quantify data heterogeneity in function approximation and PDE learning. We systematically investigate the relationship between data heterogeneity and federated model performance. Additionally, we propose a measure of weight divergence and develop a theoretical framework to establish growth bounds for weight divergence in federated learning compared to traditional centralized learning. To demonstrate the effectiveness of our methods, we conducted 10 experiments, including 2 on function approximation, 5 PDE problems on FedPINN, and 3 PDE problems on FedDeepONet. These experiments demonstrate that proposed federated methods surpass the models trained only using local data and achieve competitive accuracy of centralized models trained using all data.

[LG-124] Boosting Imperceptibility of Stable Diffusion-based Adversarial Examples Generation with Momentum

点击查看摘要

[LG-125] Controllable Generation via Locally Constrained Resampling

链接: https://arxiv.org/abs/2410.13111
作者: Kareem Ahmed,Kai-Wei Chang,Guy Van den Broeck
关键词-EN: natural language, demonstrated an unprecedented, unprecedented ability, ability at modeling, modeling the intricacies
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2312.03905

点击查看摘要

[LG-126] Algorithmic Content Selection and the Impact of User Disengagement

链接: https://arxiv.org/abs/2410.13108
作者: Emilio Calvano,Nika Haghtalab,Ellen Vitercik,Eric Zhao
关键词-EN: content selection problem, content selection, multiple rounds, content, selection problem
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The content selection problem of digital services is often modeled as a decision-process where a service chooses, over multiple rounds, an arm to pull from a set of arms that each return a certain reward. This classical model does not account for the possibility that users disengage when dissatisfied and thus fails to capture an important trade-off between choosing content that promotes future engagement versus immediate reward. In this work, we introduce a model for the content selection problem where dissatisfied users may disengage and where the content that maximizes immediate reward does not necessarily maximize the odds of future user engagement. We show that when the relationship between each arm’s expected reward and effect on user satisfaction are linearly related, an optimal content selection policy can be computed efficiently with dynamic programming under natural assumptions about the complexity of the users’ engagement patterns. Moreover, we show that in an online learning setting where users with unknown engagement patterns arrive, there is a variant of Hedge that attains a \tfrac 12 -competitive ratio regret bound. We also use our model to identify key primitives that determine how digital services should weigh engagement against revenue. For example, when it is more difficult for users to rejoin a service they are disengaged from, digital services naturally see a reduced payoff but user engagement may – counterintuitively – increase.

[LG-127] Cliqueformer: Model-Based Optimization with Structured Transformers

点击查看摘要

[LG-128] A Little Human Data Goes A Long Way

点击查看摘要

[LG-129] Communication-Efficient and Tensorized Federated Fine-Tuning of Large Language Models

点击查看摘要

[LG-130] Self-Comparison for Dataset-Level Membership Inference in Large (Vision-)Language Models

点击查看摘要

[LG-131] Reverse-Engineering the Reader

点击查看摘要

[LG-132] MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

点击查看摘要

[LG-133] FedCAP: Robust Federated Learning via Customized Aggregation and Personalization ACSA

点击查看摘要

[LG-134] wo-Timescale Linear Stochastic Approximation: Constant Stepsizes Go a Long Way

链接: https://arxiv.org/abs/2410.13067
作者: Jeongyeol Kwon,Luke Dotson,Yudong Chen,Qiaomin Xie
关键词-EN: two-timescale stochastic approximation, stochastic approximation, diminishing stepsize schemes, beta, studies on two-timescale
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Previous studies on two-timescale stochastic approximation (SA) mainly focused on bounding mean-squared errors under diminishing stepsize schemes. In this work, we investigate \it constant stpesize schemes through the lens of Markov processes, proving that the iterates of both timescales converge to a unique joint stationary distribution in Wasserstein metric. We derive explicit geometric and non-asymptotic convergence rates, as well as the variance and bias introduced by constant stepsizes in the presence of Markovian noise. Specifically, with two constant stepsizes \alpha \beta , we show that the biases scale linearly with both stepsizes as \Theta(\alpha)+\Theta(\beta) up to higher-order terms, while the variance of the slower iterate (resp., faster iterate) scales only with its own stepsize as O(\alpha) (resp., O(\beta) ). Unlike previous work, our results require no additional assumptions such as \beta^2 \ll \alpha nor extra dependence on dimensions. These fine-grained characterizations allow tail-averaging and extrapolation techniques to reduce variance and bias, improving mean-squared error bound to O(\beta^4 + \frac1t) for both iterates.

[LG-135] Optimal Transport for Probabilistic Circuits

点击查看摘要

[LG-136] AERO: Softmax-Only LLMs for Efficient Private Inference

链接: https://arxiv.org/abs/2410.13060
作者: Nandan Kumar Jha,Brandon Reagen
关键词-EN: users’ sensitive data, raised privacy concerns, proprietary language models, private inference, sensitive data
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 35 pages, 21 figures, and 9 tables. arXiv admin note: text overlap with arXiv:2410.09637

点击查看摘要

Abstract:The pervasiveness of proprietary language models has raised privacy concerns for users’ sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23 \times communication and 1.94 \times latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.

[LG-137] Systems with Switching Causal Relations: A Meta-Causal Perspective

点击查看摘要

[LG-138] Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models

点击查看摘要

[LG-139] FedGTST: Boosting Global Transferability of Federated Models via Statistics Tuning

点击查看摘要

[LG-140] Hypothesis Testing the Circuit Hypothesis in LLMs

点击查看摘要

[LG-141] Sensitivity of Generative VLMs to Semantically and Lexically Altered Prompts

点击查看摘要

[LG-142] When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems

点击查看摘要

[LG-143] Geometric Trajectory Diffusion Models NEURIPS2024

点击查看摘要

[LG-144] LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks

点击查看摘要

[LG-145] Learning Representations for Reasoning: Generalizing Across Diverse Structures

点击查看摘要

[LG-146] LEGAL-UQA: A Low-Resource Urdu-English Dataset for Legal Question Answering

点击查看摘要

[LG-147] Sample Compression Scheme Reductions

链接: https://arxiv.org/abs/2410.13012
作者: Idan Attias,Steve Hanneke,Arvind Ramaswami
关键词-EN: compression scheme, binary compression scheme, compression, mathrm, binary compression
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present novel reductions from sample compression schemes in multiclass classification, regression, and adversarially robust learning settings to binary sample compression schemes. Assuming we have a compression scheme for binary classes of size f(d_\mathrmVC) , where d_\mathrmVC is the VC dimension, then we have the following results: (1) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists a multiclass compression scheme of size O(f(d_\mathrmG)) , where d_\mathrmG is the graph dimension. Moreover, for general binary compression schemes, we obtain a compression of size O(f(d_\mathrmG)\log|Y|) , where Y is the label space. (2) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists an \epsilon -approximate compression scheme for regression over [0,1] -valued functions of size O(f(d_\mathrmP)) , where d_\mathrmP is the pseudo-dimension. For general binary compression schemes, we obtain a compression of size O(f(d_\mathrmP)\log(1/\epsilon)) . These results would have significant implications if the sample compression conjecture, which posits that any binary concept class with a finite VC dimension admits a binary compression scheme of size O(d_\mathrmVC) , is resolved (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995; Warmuth, 2003). Our results would then extend the proof of the conjecture immediately to other settings. We establish similar results for adversarially robust learning and also provide an example of a concept class that is robustly learnable but has no bounded-size compression scheme, demonstrating that learnability is not equivalent to having a compression scheme independent of the sample size, unlike in binary classification, where compression of size 2^O(d_\mathrmVC) is attainable (Moran and Yehudayoff, 2016).

[LG-148] Hiding-in-Plain-Sight (HiPS) Attack on CLIP for Targetted Object Removal from Images NEURIPS2024

点击查看摘要

[LG-149] LLM Chain Ensembles for Scalable and Accurate Data Annotation

链接: https://arxiv.org/abs/2410.13006
作者: David Farr,Nico Manzonelli,Iain Cruickshank,Kate Starbird,Jevin West
关键词-EN: rapidly evolving domains, perform zero-shot classification, zero-shot classification makes, quality labeled data, large language models
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The ability of large language models (LLMs) to perform zero-shot classification makes them viable solutions for data annotation in rapidly evolving domains where quality labeled data is often scarce and costly to obtain. However, the large-scale deployment of LLMs can be prohibitively expensive. This paper introduces an LLM chain ensemble methodology that aligns multiple LLMs in a sequence, routing data subsets to subsequent models based on classification uncertainty. This approach leverages the strengths of individual LLMs within a broader system, allowing each model to handle data points where it exhibits the highest confidence, while forwarding more complex cases to potentially more robust models. Our results show that the chain ensemble method often exceeds the performance of the best individual model in the chain and achieves substantial cost savings, making LLM chain ensembles a practical and efficient solution for large-scale data annotation challenges.

[LG-150] SSET: Swapping-Sliding Explanation for Time Series Classifiers in Affect Detection

点击查看摘要

[LG-151] Double-Bayesian Learning

链接: https://arxiv.org/abs/2410.12984
作者: Stefan Jaeger
关键词-EN: Contemporary machine learning, Contemporary machine, Bayes error, machine learning methods, model can achieve
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 14 pages, 5 figures, draft

点击查看摘要

Abstract:Contemporary machine learning methods will try to approach the Bayes error, as it is the lowest possible error any model can achieve. This paper postulates that any decision is composed of not one but two Bayesian decisions and that decision-making is, therefore, a double-Bayesian process. The paper shows how this duality implies intrinsic uncertainty in decisions and how it incorporates explainability. The proposed approach understands that Bayesian learning is tantamount to finding a base for a logarithmic function measuring uncertainty, with solutions being fixed points. Furthermore, following this approach, the golden ratio describes possible solutions satisfying Bayes’ theorem. The double-Bayesian framework suggests using a learning rate and momentum weight with values similar to those used in the literature to train neural networks with stochastic gradient descent.

[LG-152] Reinforcement Learning with Euclidean Data Augmentation for State-Based Continuous Control

点击查看摘要

[LG-153] Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

点击查看摘要

[LG-154] Long-Tailed Backdoor Attack Using Dynamic Data Augmentation Operations

点击查看摘要

[LG-155] A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data

点击查看摘要

[LG-156] Syn2Real Domain Generalization for Underwater Mine-like Object Detection Using Side-Scan Sonar

点击查看摘要

[LG-157] Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

点击查看摘要

[LG-158] Multi-modal graph neural networks for localized off-grid weather forecasting

链接: https://arxiv.org/abs/2410.12938
作者: Qidong Yang,Jonathan Giezendanner,Daniel Salles Civitarese,Johannes Jakubik,Eric Schmitt,Anirban Chandra,Jeremy Vila,Detlef Hohl,Chris Hill,Campbell Watson,Sherrie Wang
关键词-EN: generation require precise, renewable energy generation, energy generation require, Earth surface, Urgent applications
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Urgent applications like wildfire management and renewable energy generation require precise, localized weather forecasts near the Earth’s surface. However, weather forecast products from machine learning or numerical weather models are currently generated on a global regular grid, on which a naive interpolation cannot accurately reflect fine-grained weather patterns close to the ground. In this work, we train a heterogeneous graph neural network (GNN) end-to-end to downscale gridded forecasts to off-grid locations of interest. This multi-modal GNN takes advantage of local historical weather observations (e.g., wind, temperature) to correct the gridded weather forecast at different lead times towards locally accurate forecasts. Each data modality is modeled as a different type of node in the graph. Using message passing, the node at the prediction location aggregates information from its heterogeneous neighbor nodes. Experiments using weather stations across the Northeastern United States show that our model outperforms a range of data-driven and non-data-driven off-grid forecasting methods. Our approach demonstrates how the gap between global large-scale weather models and locally accurate predictions can be bridged to inform localized decision-making.

[LG-159] Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging EMNLP2024

点击查看摘要

[LG-160] SoK: On Finding Common Ground in Loss Landscapes Using Deep Model Merging Techniques

点击查看摘要

[LG-161] Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights

点击查看摘要

[LG-162] AT-RAG: An Adaptive RAG Model Enhancing Query Efficiency with Topic Filtering and Iterative Reasoning

点击查看摘要

[LG-163] Scaling Laws for Multilingual Language Models

点击查看摘要

[LG-164] owards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models

点击查看摘要

[LG-165] Improving Instruction-Following in Language Models through Activation Steering

点击查看摘要

[LG-166] In-context KV-Cache Eviction for LLMs via Attention-Gate

点击查看摘要

[LG-167] On Debiasing Text Embeddings Through Context Injection

点击查看摘要

[LG-168] Beyond Right and Wrong: Mitigating Cold Start in Knowledge Tracing Using Large Language Model and Option Weight

点击查看摘要

[LG-169] Skill Learning Using Process Mining for Large Language Model Plan Generation

链接: https://arxiv.org/abs/2410.12870
作者: Andrei Cosmin Redis,Mohammadreza Fani Sani,Bahram Zarrin,Andrea Burattin
关键词-EN: Large language models, Large language, control flow models, hold promise, complex tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 2 tables, accepted at ICPM 2024’

点击查看摘要

[LG-170] Language Model Preference Evaluation with Multiple Weak Evaluators

点击查看摘要

[LG-171] IMAS: A Comprehensive Agent ic Approach to Rural Healthcare Delivery

点击查看摘要

[LG-172] owards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings

点击查看摘要

[LG-173] ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction

点击查看摘要

[LG-174] Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks

点击查看摘要

[LG-175] he Large Language Model GreekLegalRoBERTa

点击查看摘要

[LG-176] RecurFormer: Not All Transformer Heads Need Self-Attention

点击查看摘要

[LG-177] xtLap: Customizing Language Models for Text-to-Layout Planning EMNLP

点击查看摘要

[LG-178] UniAutoML: A Human-Centered Framework for Unified Discriminative and Generative AutoML with Large Language Models

点击查看摘要

[LG-179] Answering Questions in Stages: Prompt Chaining for Contract QA

点击查看摘要

[LG-180] Generative Reward Models

链接: https://arxiv.org/abs/2410.12832
作者: Dakota Mahan,Duy Van Phung,Rafael Rafailov,Chase Blagden,Nathan Lile,Louis Castricato,Jan-Philipp Fränken,Chelsea Finn,Alon Albalak
关键词-EN: modern Large Language, Large Language Models, Large Language, Reinforcement Learning, Language Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has greatly improved the performance of modern Large Language Models (LLMs). The RLHF process is resource-intensive and technically challenging, generally requiring a large collection of human preference labels over model-generated outputs. Reinforcement Learning from AI Feedback (RLAIF) addresses this data collection challenge by leveraging synthetic preferences generated by an LLM. However, recent work has shown that synthetic preferences labels may not align well with human preference judgments. To address this, we propose a hybrid approach that unifies RLHF and RLAIF methodologies. We introduce GenRM, an iterative algorithm that trains an LLM on self-generated reasoning traces, leading to synthetic preference labels matching human preference judgments. Empirically, we show that zero-shot LLM-based judgments under-perform compared to Bradley-Terry reward models on in-distribution tasks (between 9-36%). In contrast, GenRM achieves in-distribution accuracy comparable to Bradley-Terry models, while significantly outperforming them on out-of-distribution tasks (between 10-45%). Moreover, GenRM surpasses the performance of using LLMs as judges on both in-distribution (by 9-31%) and out-of-distribution tasks (by 2- 6%). Our results show that combining the strengths of RLHF and RLAIF offers a promising approach for improving the quality of synthetic preference labels.

[LG-181] GCM-Net: Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network for Video Sentiment and Emotion Analysis

点击查看摘要

[LG-182] AVID: Adapting Video Diffusion Models to World Models

点击查看摘要

[LG-183] A transformer-based deep reinforcement learning approach to spatial navigation in a partially observable Morris Water Maze

点击查看摘要

[LG-184] Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective NEURIPS2024

点击查看摘要

[LG-185] Hip Fracture Patient Pathways and Agent -based Modelling

链接: https://arxiv.org/abs/2410.12804
作者: Alison N. O’Connor,Stephen E. Ryan,Gauri Vaidya,Paul Harford,Meghana Kshirsagar
关键词-EN: straining European services, significantly straining European, Increased healthcare demand, European services, straining European
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:Increased healthcare demand is significantly straining European services. Digital solutions including advanced modelling techniques offer a promising solution to optimising patient flow without impacting day-to-day healthcare provision. In this work we outline an ongoing project that aims to optimise healthcare resources using agent-based simulations.

[LG-186] Developing Guidelines for Functionally-Grounded Evaluation of Explainable Artificial Intelligence using Tabular Data

链接: https://arxiv.org/abs/2410.12803
作者: Mythreyi Velmurugan,Chun Ouyang,Yue Xu,Renuka Sindhgatta,Bemali Wickramanayake,Catarina Moreira
关键词-EN: Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, opaque predictive models, XAI techniques
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) techniques are used to provide transparency to complex, opaque predictive models. However, these techniques are often designed for image and text data, and it is unclear how fit-for-purpose they are when applied to tabular data. As XAI techniques are rarely evaluated in settings with tabular data, the applicability of existing evaluation criteria and methods are also unclear and needs (re-)examination. For example, some works suggest that evaluation methods may unduly influence the evaluation results when using tabular data. This lack of clarity on evaluation procedures can lead to reduced transparency and ineffective use of XAI techniques in real world settings. In this study, we examine literature on XAI evaluation to derive guidelines on functionally-grounded assessment of local, post hoc XAI techniques. We identify 20 evaluation criteria and associated evaluation methods, and derive guidelines on when and how each criterion should be evaluated. We also identify key research gaps to be addressed by future work. Our study contributes to the body of knowledge on XAI evaluation through in-depth examination of functionally-grounded XAI evaluation protocols, and has laid the groundwork for future research on XAI evaluation.

[LG-187] Ads Supply Personalization via Doubly Robust Learning CIKM’24

链接: https://arxiv.org/abs/2410.12799
作者: Wei Shi,Chen Fu,Qi Xu,Sanjian Chen,Jizhe Zhang,Qinqin Zhu,Zhigang Hua,Shuang Yang
关键词-EN: Ads supply personalization, supply personalization aims, ads supply lies, Ads supply, user engagement
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted by CIKM’24

点击查看摘要

Abstract:Ads supply personalization aims to balance the revenue and user engagement, two long-term objectives in social media ads, by tailoring the ad quantity and density. In the industry-scale system, the challenge for ads supply lies in modeling the counterfactual effects of a conservative supply treatment (e.g., a small density change) over an extended duration. In this paper, we present a streamlined framework for personalized ad supply. This framework optimally utilizes information from data collection policies through the doubly robust learning. Consequently, it significantly improves the accuracy of long-term treatment effect estimates. Additionally, its low-complexity design not only results in computational cost savings compared to existing methods, but also makes it scalable for billion-scale applications. Through both offline experiments and online production tests, the framework consistently demonstrated significant improvements in top-line business metrics over months. The framework has been fully deployed to live traffic in one of the world’s largest social media platforms.

[LG-188] Data-adaptive Differentially Private Prompt Synthesis for In-Context Learning

点击查看摘要

[LG-189] Predicting the Geolocation of Tweets Using transformer models on Customized Data

点击查看摘要

[LG-190] From Gradient Clipping to Normalization for Heavy Tailed SGD

链接: https://arxiv.org/abs/2410.13849
作者: Florian Hübler,Ilyas Fatkhullin,Niao He
关键词-EN: Recent empirical evidence, machine learning applications, learning applications involve, Recent empirical, applications involve heavy-tailed
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent empirical evidence indicates that many machine learning applications involve heavy-tailed gradient noise, which challenges the standard assumptions of bounded variance in stochastic optimization. Gradient clipping has emerged as a popular tool to handle this heavy-tailed noise, as it achieves good performance in this setting both theoretically and practically. However, our current theoretical understanding of non-convex gradient clipping has three main shortcomings. First, the theory hinges on large, increasing clipping thresholds, which are in stark contrast to the small constant clipping thresholds employed in practice. Second, clipping thresholds require knowledge of problem-dependent parameters to guarantee convergence. Lastly, even with this knowledge, current sampling complexity upper bounds for the method are sub-optimal in nearly all parameters. To address these issues, we study convergence of Normalized SGD (NSGD). First, we establish a parameter-free sample complexity for NSGD of \mathcalO\left(\varepsilon^-\frac2pp-1\right) to find an \varepsilon -stationary point. Furthermore, we prove tightness of this result, by providing a matching algorithm-specific lower bound. In the setting where all problem parameters are known, we show this complexity is improved to \mathcalO\left(\varepsilon^-\frac3p-2p-1\right) , matching the previously known lower bound for all first-order methods in all problem dependent parameters. Finally, we establish high-probability convergence of NSGD with a mild logarithmic dependence on the failure probability. Our work complements the studies of gradient clipping under heavy tailed noise improving the sample complexities of existing algorithms and offering an alternative mechanism to achieve high probability convergence.

[LG-191] Discrete distributions are learnable from metastable samples

链接: https://arxiv.org/abs/2410.13800
作者: Abhijith Jayakumar,Andrey Y. Lokhov,Sidhant Misra,Marc Vuffray
关键词-EN: Markov chain samplers, chain samplers designed, Markov chain, reversible Markov chain, metastable distribution
类目: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: Preliminary version, 26 pages

点击查看摘要

Abstract:Markov chain samplers designed to sample from multi-variable distributions often undesirably get stuck in specific regions of their state space. This causes such samplers to approximately sample from a metastable distribution which is usually quite different from the desired, stationary distribution of the chain. We show that single-variable conditionals of metastable distributions of reversible Markov chain samplers that satisfy a strong metastability condition are on average very close to those of the true distribution. This holds even when the metastable distribution is far away from the true model in terms of global metrics like Kullback-Leibler divergence or total variation distance. This property allows us to learn the true model using a conditional likelihood based estimator, even when the samples come from a metastable distribution concentrated in a small region of the state space. Explicit examples of such metastable states can be constructed from regions that effectively bottleneck the probability flow and cause poor mixing of the Markov chain. For specific cases of binary pairwise undirected graphical models, we extend our results to further rigorously show that data coming from metastable states can be used to learn the parameters of the energy function and recover the structure of the model.

[LG-192] Machine-Learning Analysis of Radiative Decays to Dark Matter at the LHC

链接: https://arxiv.org/abs/2410.13799
作者: Ernesto Arganda,Marcela Carena,Martín de los Rios,Andres D. Perez,Duncan Rocha,Rosa M. Sandá Seoane,Carlos E. M. Wagner
关键词-EN: Large Hadron Collider, High Luminosity Large, Luminosity Large Hadron, Hadron Collider, weakly interacting matter
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 32 pages, 9 figures, 3 tables, 4 appendices

点击查看摘要

Abstract:The search for weakly interacting matter particles (WIMPs) is one of the main objectives of the High Luminosity Large Hadron Collider (HL-LHC). In this work we use Machine Learning (ML) techniques to explore WIMP radiative decays into a Dark Matter (DM) candidate in a supersymmetric framework. The minimal supersymmetric WIMP sector includes the lightest neutralino that can provide the observed DM relic density through its co-annihilation with the second lightest neutralino and lightest chargino. Moreover, the direct DM detection cross section rates fulfill current experimental bounds and provide discovery targets for the same region of model parameters in which the radiative decay of the second lightest neutralino into a photon and the lightest neutralino is enhanced. This strongly motivates the search for radiatively decaying neutralinos which, however, suffers from strong backgrounds. We investigate the LHC reach in the search for these radiatively decaying particles by means of cut-based and ML methods and estimate its discovery potential in this well-motivated, new physics scenario.

[LG-193] Probing the Latent Hierarchical Structure of Data via Diffusion Models

链接: https://arxiv.org/abs/2410.13770
作者: Antonio Sclocchi,Alessandro Favero,Noam Itzhak Levi,Matthieu Wyart
关键词-EN: High-dimensional data, highly structured, data, High-dimensional, models
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:High-dimensional data must be highly structured to be learnable. Although the compositional and hierarchical nature of data is often put forward to explain learnability, quantitative measurements establishing these properties are scarce. Likewise, accessing the latent variables underlying such a data structure remains a challenge. In this work, we show that forward-backward experiments in diffusion-based models, where data is noised and then denoised to generate new samples, are a promising tool to probe the latent structure of data. We predict in simple hierarchical models that, in this process, changes in data occur by correlated chunks, with a length scale that diverges at a noise level where a phase transition is known to take place. Remarkably, we confirm this prediction in both text and image datasets using state-of-the-art diffusion models. Our results show how latent variable changes manifest in the data and establish how to measure these effects in real data using diffusion models.

[LG-194] Improved Convergence Rate for Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2410.13738
作者: Gen Li,Yuchen Jiao
关键词-EN: Score-based diffusion models, remarkable empirical performance, Score-based diffusion, achieved remarkable empirical, remarkable empirical
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Score-based diffusion models have achieved remarkable empirical performance in the field of machine learning and artificial intelligence for their ability to generate high-quality new data instances from complex distributions. Improving our understanding of diffusion models, including mainly convergence analysis for such models, has attracted a lot of interests. Despite a lot of theoretical attempts, there still exists significant gap between theory and practice. Towards to close this gap, we establish an iteration complexity at the order of d^1/3\varepsilon^-2/3 , which is better than d^5/12\varepsilon^-1 , the best known complexity achieved before our work. This convergence analysis is based on a randomized midpoint method, which is first proposed for log-concave sampling (Shen and Lee, 2019), and then extended to diffusion models by Gupta et al. (2024). Our theory accommodates \varepsilon -accurate score estimates, and does not require log-concavity on the target distribution. Moreover, the algorithm can also be parallelized to run in only O(\log^2(d/\varepsilon)) parallel rounds in a similar way to prior works.

[LG-195] Ab initio nonparametric variable selection for scalable Symbolic Regression with large p

链接: https://arxiv.org/abs/2410.13681
作者: Shengbin Ye,Meng Li
关键词-EN: gaining increasing attention, discovering symbolic expressions, characterize nonlinear relationships, discovering symbolic, relationships in data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Symbolic regression (SR) is a powerful technique for discovering symbolic expressions that characterize nonlinear relationships in data, gaining increasing attention for its interpretability, compactness, and robustness. However, existing SR methods do not scale to datasets with a large number of input variables (referred to as extreme-scale SR), which are common in modern scientific applications. This ``large p ‘’ setting, often accompanied by measurement error, leads to slow performance of SR methods and overly complex expressions that are difficult to interpret. To address this scalability challenge, we propose a method called PAN+SR, which combines a key idea of ab initio nonparametric variable selection with SR to efficiently pre-screen large input spaces and reduce search complexity while maintaining accuracy. The use of nonparametric methods eliminates model misspecification, supporting a strategy called parametric-assisted nonparametric (PAN). We also extend SRBench, an open-source benchmarking platform, by incorporating high-dimensional regression problems with various signal-to-noise ratios. Our results demonstrate that PAN+SR consistently enhances the performance of 17 contemporary SR methods, enabling several to achieve state-of-the-art performance on these challenging datasets.

[LG-196] Learning Counterfactual Distributions via Kernel Nearest Neighbors

链接: https://arxiv.org/abs/2410.13381
作者: Kyuseong Choi,Jacob Feitelberg,Anish Agarwal,Raaz Dwivedi
关键词-EN: mobile app version, user weekly spend, specific mobile app, geographic locations, multiple units
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 33 pages, 2 figures

点击查看摘要

Abstract:Consider a setting with multiple units (e.g., individuals, cohorts, geographic locations) and outcomes (e.g., treatments, times, items), where the goal is to learn a multivariate distribution for each unit-outcome entry, such as the distribution of a user’s weekly spend and engagement under a specific mobile app version. A common challenge is the prevalence of missing not at random data, where observations are available only for certain unit-outcome combinations and the observation availability can be correlated with the properties of distributions themselves, i.e., there is unobserved confounding. An additional challenge is that for any observed unit-outcome entry, we only have a finite number of samples from the underlying distribution. We tackle these two challenges by casting the problem into a novel distributional matrix completion framework and introduce a kernel based distributional generalization of nearest neighbors to estimate the underlying distributions. By leveraging maximum mean discrepancies and a suitable factor model on the kernel mean embeddings of the underlying distributions, we establish consistent recovery of the underlying distributions even when data is missing not at random and positivity constraints are violated. Furthermore, we demonstrate that our nearest neighbors approach is robust to heteroscedastic noise, provided we have access to two or more measurements for the observed unit-outcome entries, a robustness not present in prior works on nearest neighbors with single measurements.

[LG-197] Active inference and deep generative modeling for cognitive ultrasound

点击查看摘要

[LG-198] A theoretical perspective on mode collapse in variational inference

链接: https://arxiv.org/abs/2410.13300
作者: Roman Soletskyi,Marylou Gabrié,Bruno Loureiro
关键词-EN: expressive variational families, highly expressive variational, traditional Kullback-Leibler objective, yield suboptimal solutions, variational families
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:While deep learning has expanded the possibilities for highly expressive variational families, the practical benefits of these tools for variational inference (VI) are often limited by the minimization of the traditional Kullback-Leibler objective, which can yield suboptimal solutions. A major challenge in this context is \emphmode collapse: the phenomenon where a model concentrates on a few modes of the target distribution during training, despite being statistically capable of expressing them all. In this work, we carry a theoretical investigation of mode collapse for the gradient flow on Gaussian mixture models. We identify the key low-dimensional statistics characterizing the flow, and derive a closed set of low-dimensional equations governing their evolution. Leveraging this compact description, we show that mode collapse is present even in statistically favorable scenarios, and identify two key mechanisms driving it: mean alignment and vanishing weight. Our theoretical findings are consistent with the implementation of VI using normalizing flows, a class of popular generative models, thereby offering practical insights.

[LG-199] Scalable Drift Monitoring in Medical Imaging AI

点击查看摘要

[LG-200] L1-Regularized ICA: A Novel Method for Analysis of Task-related fMRI Data

链接: https://arxiv.org/abs/2410.13171
作者: Yusuke Endo,Koujin Takeda
关键词-EN: independent component analysis, component analysis, independent component, order to extract, ICA
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 29 pages, 9 figures, 4 tables. Python code is available. Please contact the corresponding author for the code

点击查看摘要

Abstract:We propose a new method of independent component analysis (ICA) in order to extract appropriate features from high-dimensional data. In general, matrix factorization methods including ICA have a problem regarding the interpretability of extracted features. For the improvement of interpretability, it is considered that sparse constraint on a factorized matrix is helpful. With this background, we construct a new ICA method with sparsity. In our method, the L1-regularization term is added to the cost function of ICA, and minimization of the cost function is performed by difference of convex functions algorithm. For the validity of our proposed method, we apply it to synthetic data and real functional magnetic resonance imaging data.

[LG-201] Continuous normalizing flows for lattice gauge theories

链接: https://arxiv.org/abs/2410.13161
作者: Mathis Gerdes,Pim de Haan,Roberto Bondesan,Miranda C. N. Cheng
关键词-EN: Continuous normalizing flows, Continuous normalizing, general continuous normalizing, expressive and flexible, normalizing flow architecture
类目: High Energy Physics - Lattice (hep-lat); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注:

点击查看摘要

Abstract:Continuous normalizing flows are known to be highly expressive and flexible, which allows for easier incorporation of large symmetries and makes them a powerful tool for sampling in lattice field theories. Building on previous work, we present a general continuous normalizing flow architecture for matrix Lie groups that is equivariant under group transformations. We apply this to lattice gauge theories in two dimensions as a proof-of-principle and demonstrate competitive performance, showing its potential as a tool for future lattice sampling tasks.

[LG-202] Learning Efficient Representations of Neutrino Telescope Events ICLR2025

链接: https://arxiv.org/abs/2410.13148
作者: Felix J. Yu,Nicholas Kamp,Carlos A. Argüelles
关键词-EN: telescopes detect rare, detect rare interactions, Neutrino telescopes detect, detect rare, particles produced
类目: High Energy Physics - Experiment (hep-ex); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures. Submitted to ICLR 2025

点击查看摘要

Abstract:Neutrino telescopes detect rare interactions of particles produced in some of the most extreme environments in the Universe. This is accomplished by instrumenting a cubic-kilometer volume of naturally occurring transparent medium with light sensors. Given their substantial size and the high frequency of background interactions, these telescopes amass an enormous quantity of large variance, high-dimensional data. These attributes create substantial challenges for analyzing and reconstructing interactions, particularly when utilizing machine learning (ML) techniques. In this paper, we present a novel approach, called om2vec, that employs transformer-based variational autoencoders to efficiently represent neutrino telescope events by learning compact and descriptive latent representations. We demonstrate that these latent representations offer enhanced flexibility and improved computational efficiency, thereby facilitating downstream tasks in data analysis.

[LG-203] Distributional Matrix Completion via Nearest Neighbors in the Wasserstein Space

链接: https://arxiv.org/abs/2410.13112
作者: Jacob Feitelberg,Kyuseong Choi,Anish Agarwal,Raaz Dwivedi
关键词-EN: unobserved matrix entries, sparsely observed matrix, introduce the problem, seek to impute, impute the true
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We introduce the problem of distributional matrix completion: Given a sparsely observed matrix of empirical distributions, we seek to impute the true distributions associated with both observed and unobserved matrix entries. This is a generalization of traditional matrix completion where the observations per matrix entry are scalar valued. To do so, we utilize tools from optimal transport to generalize the nearest neighbors method to the distributional setting. Under a suitable latent factor model on probability distributions, we establish that our method recovers the distributions in the Wasserstein norm. We demonstrate through simulations that our method is able to (i) provide better distributional estimates for an entry compared to using observed samples for that entry alone, (ii) yield accurate estimates of distributional quantities such as standard deviation and value-at-risk, and (iii) inherently support heteroscedastic noise. We also prove novel asymptotic results for Wasserstein barycenters over one-dimensional distributions.

[LG-204] Contextual Bandits with Arm Request Costs and Delays

链接: https://arxiv.org/abs/2410.13109
作者: Lai Wei,Ambuj Tewari,Michael A. Cianfrocco
关键词-EN: stochastic time delays, requested with stochastic, stochastic time, time delays, contextual bandit problem
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel extension of the contextual bandit problem, where new sets of arms can be requested with stochastic time delays and associated costs. In this setting, the learner can select multiple arms from a decision set, with each selection taking one unit of time. The problem is framed as a special case of semi-Markov decision processes (SMDPs). The arm contexts, request times, and costs are assumed to follow an unknown distribution. We consider the regret of an online learning algorithm with respect to the optimal policy that achieves the maximum average reward. By leveraging the Bellman optimality equation, we design algorithms that can effectively select arms and determine the appropriate time to request new arms, thereby minimizing their regret. Under the realizability assumption, we analyze the proposed algorithms and demonstrate that their regret upper bounds align with established results in the contextual bandit literature. We validate the algorithms through experiments on simulated data and a movie recommendation dataset, showing that their performance is consistent with theoretical analyses.

[LG-205] Large data limits and scaling laws for tSNE

链接: https://arxiv.org/abs/2410.13063
作者: Ryan Murray,Adam Pickarski
关键词-EN: t-distributed stochastic neighbor, widely-used non-linear dimension, non-linear dimension reduction, stochastic neighbor embedding, dimension reduction algorithm
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work considers large-data asymptotics for t-distributed stochastic neighbor embedding (tSNE), a widely-used non-linear dimension reduction algorithm. We identify an appropriate continuum limit of the tSNE objective function, which can be viewed as a combination of a kernel-based repulsion and an asymptotically-vanishing Laplacian-type regularizer. As a consequence, we show that embeddings of the original tSNE algorithm cannot have any consistent limit as n \to \infty . We propose a rescaled model which mitigates the asymptotic decay of the attractive energy, and which does have a consistent limit.

[LG-206] Quantum Boltzmann machine learning of ground-state energies

链接: https://arxiv.org/abs/2410.12935
作者: Dhrumil Patel,Daniel Koch,Saahil Patel,Mark M. Wilde
关键词-EN: Estimating the ground-state, quantum Boltzmann machines, quantum Boltzmann, quantum, parameterized quantum circuits
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 7 pages, 1 figure, Supplementary material available as ‘Ancillary Files’

点击查看摘要

Abstract:Estimating the ground-state energy of Hamiltonians is a fundamental task for which it is believed that quantum computers can be helpful. Several approaches have been proposed toward this goal, including algorithms based on quantum phase estimation and hybrid quantum-classical optimizers involving parameterized quantum circuits, the latter falling under the umbrella of the variational quantum eigensolver. Here, we analyze the performance of quantum Boltzmann machines for this task, which is a less explored ansatz based on parameterized thermal states and which is not known to suffer from the barren-plateau problem. We delineate a hybrid quantum-classical algorithm for this task and rigorously prove that it converges to an \varepsilon -approximate stationary point of the energy function optimized over parameter space, while using a number of parameterized-thermal-state samples that is polynomial in \varepsilon^-1 , the number of parameters, and the norm of the Hamiltonian being optimized. Our algorithm estimates the gradient of the energy function efficiently by means of a novel quantum circuit construction that combines classical sampling, Hamiltonian simulation, and the Hadamard test, thus overcoming a key obstacle to quantum Boltzmann machine learning that has been left open since [Amin \textitet al., Phys.~Rev.~X \textbf8, 021050 (2018)]. Additionally supporting our main claims are calculations of the gradient and Hessian of the energy function, as well as an upper bound on the matrix elements of the latter that is used in the convergence analysis.

[LG-207] Credal Two-Sample Tests of Epistemic Ignorance

链接: https://arxiv.org/abs/2410.12921
作者: Siu Lun Chau,Antonin Schrab,Arthur Gretton,Dino Sejdinovic,Krikamol Muandet
关键词-EN: element captures aleatoric, captures aleatoric uncertainty, modeller partial ignorance, partial ignorance, represents epistemic uncertainty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 39 pages

点击查看摘要

Abstract:We introduce credal two-sample testing, a new hypothesis testing framework for comparing credal sets – convex sets of probability measures where each element captures aleatoric uncertainty and the set itself represents epistemic uncertainty that arises from the modeller’s partial ignorance. Classical two-sample tests, which rely on comparing precise distributions, fail to address epistemic uncertainty due to partial ignorance. To bridge this gap, we generalise two-sample tests to compare credal sets, enabling reasoning for equality, inclusion, intersection, and mutual exclusivity, each offering unique insights into the modeller’s epistemic beliefs. We formalise these tests as two-sample tests with nuisance parameters and introduce the first permutation-based solution for this class of problems, significantly improving upon existing methods. Our approach properly incorporates the modeller’s epistemic uncertainty into hypothesis testing, leading to more robust and credible conclusions, with kernel-based implementations for real-world applications.

[LG-208] AI-Driven Autonomous Control of Proton-Boron Fusion Reactors Using Backpropagation Neural Networks

点击查看摘要

[LG-209] Incorporating Metabolic Information into LLMs for Anomaly Detection in Clinical Time-Series

点击查看摘要

[LG-210] IMeSynC: Temporal Intent Modelling with Synchronized Context Encodings for Financial Service Applications RECSYS2024

链接: https://arxiv.org/abs/2410.12825
作者: Dwipam Katariya,Juan Manuel Origgi,Yage Wang,Thomas Caputo
关键词-EN: Users engage, web platforms, call centers, financial services companies, multiple channels
类目: General Finance (q-fin.GN); Machine Learning (cs.LG)
*备注: 6 pages, Accepted at RecSys 2024

点击查看摘要

Abstract:Users engage with financial services companies through multiple channels, often interacting with mobile applications, web platforms, call centers, and physical locations to service their accounts. The resulting interactions are recorded at heterogeneous temporal resolutions across these domains. This multi-channel data can be combined and encoded to create a comprehensive representation of the customer’s journey for accurate intent prediction. This demands sequential learning solutions. NMT transformers achieve state-of-the-art sequential representation learning by encoding context and decoding for the next best action to represent long-range dependencies. However, three major challenges exist while combining multi-domain sequences within an encode-decoder transformers architecture for intent prediction applications: a) aligning sequences with different sampling rates b) learning temporal dynamics across multi-variate, multi-domain sequences c) combining dynamic and static sequences. We propose an encoder-decoder transformer model to address these challenges for contextual and sequential intent prediction in financial servicing applications. Our experiments show significant improvement over the existing tabular method.

[LG-211] Optimization of Actuarial Neural Networks with Response Surface Methodology

链接: https://arxiv.org/abs/2410.12824
作者: Belguutei Ariuntugs,Kehelwala Dewage Gayan Madurang
关键词-EN: enhancing risk assessment, plays a crucial, predictive modeling, data-driven world, crucial role
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注: This work was presented at the Actuarial Research Conference (ARC) this http URL abstract submitted and presented at ARC 2024. More details can be found at \url{ this https URL }

点击查看摘要

Abstract:In the data-driven world of actuarial science, machine learning (ML) plays a crucial role in predictive modeling, enhancing risk assessment and pricing strategies. Neural networks, specifically combined actuarial neural networks (CANN), are vital for tasks such as mortality forecasting and pricing. However, optimizing hyperparameters (e.g., learning rates, layers) is essential for resource efficiency. This study utilizes a factorial design and response surface methodology (RSM) to optimize CANN performance. RSM effectively explores the hyperparameter space and captures potential curvature, outperforming traditional grid search. Our results show accurate performance predictions, identifying critical hyperparameters. By dropping statistically insignificant hyperparameters, we reduced runs from 288 to 188, with negligible loss in accuracy, achieving near-optimal out-of-sample Poisson deviance loss. Comments: This work was presented at the Actuarial Research Conference (ARC) this http URL abstract submitted and presented at ARC 2024. More details can be found at \urlthis https URL Subjects: Risk Management (q-fin.RM); Machine Learning (cs.LG) Cite as: arXiv:2410.12824 [q-fin.RM] (or arXiv:2410.12824v1 [q-fin.RM] for this version) https://doi.org/10.48550/arXiv.2410.12824 Focus to learn more arXiv-issued DOI via DataCite

[LG-212] Advancing Spatio-temporal Storm Surge Prediction with Hierarchical Deep Neural Networks

链接: https://arxiv.org/abs/2410.12823
作者: Saeed Saviz Naeini,Reda Snaiki,Teng Wu
关键词-EN: America face major, North America face, face major threats, storm surge, America face
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coastal regions in North America face major threats from storm surges caused by hurricanes and nor’easters. Traditional numerical models, while accurate, are computationally expensive, limiting their practicality for real-time predictions. Recently, deep learning techniques have been developed for efficient simulation of time-dependent storm surge. To resolve the small scales of storm surge in both time and space over a long duration and a large area, these simulations typically need to employ oversized neural networks that struggle with the accumulation of prediction errors over successive time steps. To address these challenges, this study introduces a hierarchical deep neural network (HDNN) combined with a convolutional autoencoder (CAE) to accurately and efficiently predict storm surge time series. The CAE reduces the dimensionality of storm surge data, streamlining the learning process. HDNNs then map storm parameters to the low-dimensional representation of storm surge, allowing for sequential predictions across different time scales. Specifically, the current-level neural network is utilized to predict future states with a relatively large time step, which are passed as inputs to the next-level neural network for smaller time-step predictions. This process continues sequentially for all time steps. The results from different-level neural networks across various time steps are then stacked to acquire the entire time series of storm surge. The simulated low-dimensional representations are finally decoded back into storm surge time series. The proposed model was trained and tested using synthetic data from the North Atlantic Comprehensive Coastal Study. Results demonstrate its excellent performance to effectively handle high-dimensional surge data while mitigating the accumulation of prediction errors over time, making it a promising tool for advancing storm surge prediction.

[LG-213] Deep Adversarial Learning with Activity-Based User Discrimination Task for Human Activity Recognition

点击查看摘要

[LG-214] Restoring Super-High Resolution GPS Mobility Data

链接: https://arxiv.org/abs/2410.12818
作者: Haruki Yonekura,Ren Ozeki,Hamada Rizk,Hirozumi Yamaguchi
关键词-EN: high-resolution GPS trajectory, reconstructing high-resolution GPS, GPS trajectory data, high-resolution GPS, balancing data utility
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted paper for the 2nd ACM SIGSPATIAL International Workshop on Geo-Privacy and Data Utility for Smart Societies (GeoPrivacy 2024)

点击查看摘要

Abstract:This paper presents a novel system for reconstructing high-resolution GPS trajectory data from truncated or synthetic low-resolution inputs, addressing the critical challenge of balancing data utility with privacy preservation in mobility applications. The system integrates transformer-based encoder-decoder models with graph convolutional networks (GCNs) to effectively capture both the temporal dependencies of trajectory data and the spatial relationships in road networks. By combining these techniques, the system is able to recover fine-grained trajectory details that are lost through data truncation or rounding, a common practice to protect user privacy. We evaluate the system on the Beijing trajectory dataset, demonstrating its superior performance over traditional map-matching algorithms and LSTM-based synthetic data generation methods. The proposed model achieves an average Fréchet distance of 0.198 km, significantly outperforming map-matching algorithms (0.632 km) and synthetic trajectory models (0.498 km). The results show that the system is not only capable of accurately reconstructing real-world trajectories but also generalizes effectively to synthetic data. These findings suggest that the system can be deployed in urban mobility applications, providing both high accuracy and robust privacy protection.

[LG-215] Cerebral microbleeds: Association with cognitive decline and pathology build-up

链接: https://arxiv.org/abs/2410.12809
作者: Saima Rathore,Jatin Chaudhary,Boning Tong,Selen Bozkurt
关键词-EN: progression remains unclear, Alzheimer disease, role in Alzheimer, cognitive decline, Cerebral microbleeds
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Cerebral microbleeds, markers of brain damage from vascular and amyloid pathologies, are linked to cognitive decline in aging, but their role in Alzheimer’s disease (AD) onset and progression remains unclear. This study aimed to explore whether the presence and location of lobar microbleeds are associated with amyloid- \beta (A \beta )-PET, tau tangle formation (tau-PET), and longitudinal cognitive decline. We analyzed 1,573 ADNI participants with MR imaging data and information on the number and location of microbleeds. Associations between lobar microbleeds and pathology, cerebrospinal fluid (CSF), genetics, and cognition were examined, focusing on regional microbleeds and domain-specific cognitive decline using ordinary least-squares regression while adjusting for covariates. Cognitive decline was assessed with ADAS-Cog11 and its domain-specific sub-scores. Participants underwent neuropsychological testing at least twice, with a minimum two-year interval between assessments. Among the 1,573 participants (692 women, mean age 71.23 years), 373 participants had microbleeds. The presence of microbleeds was linked to cognitive decline, particularly in the semantic, language, and praxis domains for those with temporal lobe microbleeds. Microbleeds in the overall cortex were associated with language decline. Pathologically, temporal lobe microbleeds were associated with increased tau in the overall cortex, while cortical microbleeds were linked to elevated A \beta in the temporal, parietal, and frontal regions. In this mixed population, microbleeds were connected to longitudinal cognitive decline, especially in semantic and language domains, and were associated with higher baseline A \beta and tau pathology. These findings suggest that lobar microbleeds should be included in AD diagnostic and prognostic evaluations.

[LG-216] A Hierarchical conv-LSTM and LLM Integrated Model for Holistic Stock Forecasting

点击查看摘要

信息检索

[IR-0] Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval

点击查看摘要

[IR-1] Disjointness Violations in Wikidata

点击查看摘要

[IR-2] Pessimistic Evaluation

链接: https://arxiv.org/abs/2410.13680
作者: Fernando Diaz
关键词-EN: information access systems, information access, Traditional evaluation, information access based, focused primarily
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Traditional evaluation of information access systems has focused primarily on average utility across a set of information needs (information retrieval) or users (recommender systems). In this work, we argue that evaluating only with average metric measurements assumes utilitarian values not aligned with traditions of information access based on equal access. We advocate for pessimistic evaluation of information access systems focusing on worst case utility. These methods are (a) grounded in ethical and pragmatic concepts, (b) theoretically complementary to existing robustness and fairness methods, and © empirically validated across a set of retrieval and recommendation tasks. These results suggest that pessimistic evaluation should be included in existing experimentation processes to better understand the behavior of systems, especially when concerned with principles of social good.

[IR-3] Large Language Models as Narrative-Driven Recommenders

点击查看摘要

[IR-4] Cross-Domain Sequential Recommendation via Neural Process

链接: https://arxiv.org/abs/2410.13588
作者: Haipeng Li,Jiangxia Cao,Yiwen Gao,Yunhuai Liu,Shuchao Pang
关键词-EN: Cross-Domain Sequential Recommendation, Sequential Recommendation, user interest modeling, sequence-based user interest, Cross-Domain Sequential
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Work in progress

点击查看摘要

Abstract:Cross-Domain Sequential Recommendation (CDSR) is a hot topic in sequence-based user interest modeling, which aims at utilizing a single model to predict the next items for different domains. To tackle the CDSR, many methods are focused on domain overlapped users’ behaviors fitting, which heavily relies on the same user’s different-domain item sequences collaborating signals to capture the synergy of cross-domain item-item correlation. Indeed, these overlapped users occupy a small fraction of the entire user set only, which introduces a strong assumption that the small group of domain overlapped users is enough to represent all domain user behavior characteristics. However, intuitively, such a suggestion is biased, and the insufficient learning paradigm in non-overlapped users will inevitably limit model performance. Further, it is not trivial to model non-overlapped user behaviors in CDSR because there are no other domain behaviors to collaborate with, which causes the observed single-domain users’ behavior sequences to be hard to contribute to cross-domain knowledge mining. Considering such a phenomenon, we raise a challenging and unexplored question: How to unleash the potential of non-overlapped users’ behaviors to empower CDSR?

[IR-5] Generate and Instantiate What You Prefer: Text-Guided Diffusion for Sequential Recommendation

链接: https://arxiv.org/abs/2410.13428
作者: Guoqing Hu,Zhangyi Yang,Zhibo Cai,An Zhang,Xiang Wang
关键词-EN: sequential recommendation tasks, Recent advancements, generative recommendation systems, realm of sequential, shown promise
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in generative recommendation systems, particularly in the realm of sequential recommendation tasks, have shown promise in enhancing generalization to new items. Among these approaches, diffusion-based generative recommendation has emerged as an effective tool, leveraging its ability to capture data distributions and generate high-quality samples. Despite effectiveness, two primary challenges have been identified: 1) the lack of consistent modeling of data distribution for oracle items; and 2) the difficulty in scaling to more informative control signals beyond historical interactions. These issues stem from the uninformative nature of ID embeddings, which necessitate random initialization and limit the incorporation of additional control signals. To address these limitations, we propose iDreamRe to involve more concrete prior knowledge to establish item embeddings, particularly through detailed item text descriptions and advanced Text Embedding Models (TEM). More importantly, by converting item descriptions into embeddings aligned with TEM, we enable the integration of intention instructions as control signals to guide the generation of oracle items. Experimental results on four datasets demonstrate that iDreamRec not only outperforms existing diffusion-based generative recommenders but also facilitates the incorporation of intention instructions for more precise and effective recommendation generation.

[IR-6] Context-aware adaptive personalised recommendation: a meta-hybrid

点击查看摘要

[IR-7] Comparing the Utility Preference and Performance of Course Material Search Functionality and Retrieval-Augmented Generation Large Language Model (RAG-LLM) AI Chatbots in Information-Seeking Tasks

链接: https://arxiv.org/abs/2410.13326
作者: Leonardo Pasquarelli,Charles Koutcheme,Arto Hellas
关键词-EN: Providing sufficient support, requires substantial resources, growing enrollment numbers, students requires substantial, Providing sufficient
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Providing sufficient support for students requires substantial resources, especially considering the growing enrollment numbers. Students need help in a variety of tasks, ranging from information-seeking to requiring support with course assignments. To explore the utility of recent large language models (LLMs) as a support mechanism, we developed an LLM-powered AI chatbot that augments the answers that are produced with information from the course materials. To study the effect of the LLM-powered AI chatbot, we conducted a lab-based user study (N=14), in which the participants worked on tasks from a web software development course. The participants were divided into two groups, where one of the groups first had access to the chatbot and then to a more traditional search functionality, while another group started with the search functionality and was then given the chatbot. We assessed the participants’ performance and perceptions towards the chatbot and the search functionality and explored their preferences towards the support functionalities. Our findings highlight that both support mechanisms are seen as useful and that support mechanisms work well for specific tasks, while less so for other tasks. We also observe that students tended to prefer the second support mechanism more, where students who were first given the chatbot tended to prefer the search functionality and vice versa.

[IR-8] SBI-RAG: Enhancing Math Word Problem Solving for Students through Schema-Based Instruction and Retrieval-Augmented Generation NEURIPS’24

点击查看摘要

[IR-9] Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation

点击查看摘要

[IR-10] Starbucks: Improved Training for 2D Matryoshka Embeddings

链接: https://arxiv.org/abs/2410.13230
作者: Shengyao Zhuang,Shuai Wang,Bevan Koopman,Guido Zuccon
关键词-EN: Effective approaches, embedding model depth, task requirements, scale embedding model, highly scalable
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Effective approaches that can scale embedding model depth (i.e. layers) and embedding size allow for the creation of models that are highly scalable across different computational resources and task requirements. While the recently proposed 2D Matryoshka training approach can efficiently produce a single embedding model such that its sub-layers and sub-dimensions can measure text similarity, its effectiveness is significantly worse than if smaller models were trained separately. To address this issue, we propose Starbucks, a new training strategy for Matryoshka-like embedding models, which encompasses both the fine-tuning and pre-training phases. For the fine-tuning phase, we discover that, rather than sampling a random sub-layer and sub-dimensions for each training steps, providing a fixed list of layer-dimension pairs, from small size to large sizes, and computing the loss across all pairs significantly improves the effectiveness of 2D Matryoshka embedding models, bringing them on par with their separately trained counterparts. To further enhance performance, we introduce a new pre-training strategy, which applies masked autoencoder language modelling to sub-layers and sub-dimensions during pre-training, resulting in a stronger backbone for subsequent fine-tuning of the embedding model. Experimental results on both semantic text similarity and retrieval benchmarks demonstrate that the proposed pre-training and fine-tuning strategies significantly improved the effectiveness over 2D Matryoshka models, enabling Starbucks models to perform more efficiently and effectively than separately trained models.

[IR-11] Research on Travel Route Planing Problems Based on Greedy Algorithm

点击查看摘要

[IR-12] MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling

点击查看摘要

[IR-13] ransformers4NewsRec: A Transformer-based News Recommendation Framework

链接: https://arxiv.org/abs/2410.13125
作者: Dairui Liu,Honghui Du,Boming Yang,Neil Hurley,Aonghus Lawlor,Irene Li,Derek Greene,Ruihai Dong
关键词-EN: language processing tasks, shown great promise, natural language processing, Pre-trained transformer models, Pre-trained transformer
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Pre-trained transformer models have shown great promise in various natural language processing tasks, including personalized news recommendations. To harness the power of these models, we introduce Transformers4NewsRec, a new Python framework built on the \textbfTransformers library. This framework is designed to unify and compare the performance of various news recommendation models, including deep neural networks and graph-based models. Transformers4NewsRec offers flexibility in terms of model selection, data preprocessing, and evaluation, allowing both quantitative and qualitative analysis.

[IR-14] Retrieval-Enhanced Named Entity Recognition

点击查看摘要

[IR-15] Preference Diffusion for Recommendation

点击查看摘要

[IR-16] Is Semantic Chunking Worth the Computational Cost?

点击查看摘要

[IR-17] Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models

点击查看摘要

[IR-18] LLM Confidence Evaluation Measures in Zero-Shot CSS Classification

点击查看摘要

[IR-19] LFOSum: Summarizing Long-form Opinions with Large Language Models

点击查看摘要

[IR-20] owards Computational Analysis of Pansori Singing

链接: https://arxiv.org/abs/2410.12956
作者: Sangheon Park,Danbinaerin Han,Dasaem Jeong
关键词-EN: representative vocal genres, elaborated vocal melody, vocal melody line, Korean traditional music, genres of Korean
类目: ound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval (ISMIR) Conference, 2024

点击查看摘要

Abstract:Pansori is one of the most representative vocal genres of Korean traditional music, which has an elaborated vocal melody line with strong vibrato. Although the music is transmitted orally without any music notation, transcribing pansori music in Western staff notation has been introduced for several purposes, such as documentation of music, education, or research. In this paper, we introduce computational analysis of pansori based on both audio and corresponding transcription, how modern Music Information Retrieval tasks can be used in analyzing traditional music and how it revealed different audio characteristics of what pansori contains.

[IR-21] REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models

点击查看摘要

[IR-22] AT-RAG: An Adaptive RAG Model Enhancing Query Efficiency with Topic Filtering and Iterative Reasoning

点击查看摘要

[IR-23] Enhancing Affinity Propagation for Improved Public Sentiment Insights

点击查看摘要

[IR-24] Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

点击查看摘要

[IR-25] A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution Current Landscape and Future Directions

点击查看摘要

[IR-26] Leveraging Large Language Models to Enhance Personalized Recommendations in E-commerce CEC

链接: https://arxiv.org/abs/2410.12829
作者: Wei Xu,Jue Xiao,Jianlong Chen
关键词-EN: large language model, study deeply explores, LLM, deeply explores, explores the application
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted by the 5th International Conference on Electrical, Communication and Computer Engineering (ICECCE 2024)

点击查看摘要

Abstract:This study deeply explores the application of large language model (LLM) in personalized recommendation system of e-commerce. Aiming at the limitations of traditional recommendation algorithms in processing large-scale and multi-dimensional data, a recommendation system framework based on LLM is proposed. Through comparative experiments, the recommendation model based on LLM shows significant improvement in multiple key indicators such as precision, recall, F1 score, average click-through rate (CTR) and recommendation diversity. Specifically, the precision of the LLM model is improved from 0.75 to 0.82, the recall rate is increased from 0.68 to 0.77, the F1 score is increased from 0.71 to 0.79, the CTR is increased from 0.56 to 0.63, and the recommendation diversity is increased by 41.2%, from 0.34 to 0.48. LLM effectively captures the implicit needs of users through deep semantic understanding of user comments and product description data, and combines contextual data for dynamic recommendation to generate more accurate and diverse results. The study shows that LLM has significant advantages in the field of personalized recommendation, can improve user experience and promote platform sales growth, and provides strong theoretical and practical support for personalized recommendation technology in e-commerce.

[IR-27] Optimizing and Evaluating Enterprise Retrieval-Augmented Generation (RAG): A Content Design Perspective

点击查看摘要

[IR-28] Ads Supply Personalization via Doubly Robust Learning CIKM’24

点击查看摘要

[IR-29] Disaggregating Embedding Recommendation Systems with FlexEMR

点击查看摘要

[IR-30] Predicting the Geolocation of Tweets Using transformer models on Customized Data

点击查看摘要

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-10-18

目录

概览 (2024-10-18)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载