本篇博文主要展示 2024-10-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-03)

今日共更新491篇论文,其中:

  • 自然语言处理103篇(Computation and Language (cs.CL))
  • 人工智能127篇(Artificial Intelligence (cs.AI))
  • 计算机视觉105篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习180篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

【速读】: 该论文试图解决大型语言模型(LLMs)在处理长上下文任务时,由于生成推理所需的计算负荷和GPU内存占用显著增加,导致难以在消费级设备如单个Nvidia 4090 GPU上部署的问题。解决方案的关键在于提出了Locret框架,该框架通过引入保留头(retaining heads)来评估KV缓存单元的重要性,从而在固定缓存大小内实现更精确的缓存单元驱逐。Locret通过在冻结的LLM骨干网络上进行微调,并结合分块预填充模式,显著降低了峰值GPU内存使用,同时保持了生成内容的质量。实验结果表明,Locret在内存效率和生成内容质量方面优于现有方法,并首次实现了在单个Nvidia 4090 GPU上部署Llama-3.1-8B等模型进行128K长上下文推理。

链接: https://arxiv.org/abs/2410.01805
作者: Yuxiang Huang,Binhang Yuan,Xu Han,Chaojun Xiao,Zhiyuan Liu
关键词-EN: Large language models, shown remarkable advances, Large language, supporting long-context comprehension, processing tasks
类目: Computation and Language (cs.CL)
备注: Preprints

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable advances in supporting long-context comprehension and processing tasks. However, scaling the generation inference of LLMs to such long contexts incurs significant additional computation load, and demands a substantial GPU memory footprint to maintain the key-value (KV) cache of transformer-based LLMs. Existing KV cache compression methods, such as quantization, face memory bottlenecks as context length increases, while static-sized caches, such as eviction, suffer from inefficient policies. These limitations restrict deployment on consumer-grade devices like a single Nvidia 4090 GPU. To overcome this, we propose Locret, a framework for long-context LLM inference that introduces retaining heads to evaluate the causal importance of KV cache units, allowing for more accurate eviction within a fixed cache size. Locret is fine-tuned on top of the frozen backbone LLM using a minimal amount of data from standard long-context SFT datasets. During inference, we evict low-importance cache units along with a chunked prefill pattern, significantly reducing peak GPU memory usage. We conduct an extensive empirical study to evaluate Locret, where the experimental results show that Locret outperforms the recent competitive approaches, including InfLLM, Quantization, SirLLM, and MInference, in terms of memory efficiency and the quality of generated contents – Locret achieves over a 20x and 8x KV cache compression ratio compared to the full KV cache for Phi-3-mini-128K and Llama-3.1-8B-instruct. Additionally, Locret can be combined with other methods, such as quantization and token merging. To our knowledge, Locret is the first framework capable of deploying Llama-3.1-8B or similar models on a single Nvidia 4090 GPU, enabling 128K long-context inference without compromising generation quality, and requiring little additional system optimizations.
摘要:大语言模型 (LLMs) 在支持长上下文理解和处理任务方面展现了显著的进步。然而,将 LLMs 的生成推理扩展到如此长的上下文会带来显著的额外计算负荷,并且需要大量的 GPU 内存来维持基于 Transformer 的 LLMs 的键值 (KV) 缓存。现有的 KV 缓存压缩方法,如量化,随着上下文长度的增加面临内存瓶颈,而静态大小的缓存,如驱逐策略,则存在效率低下的问题。这些限制阻碍了在消费级设备上的部署,例如单个 Nvidia 4090 GPU。为了克服这一问题,我们提出了 Locret,这是一个用于长上下文 LLM 推理的框架,引入了保留头来评估 KV 缓存单元的因果重要性,从而在固定缓存大小内实现更精确的驱逐。Locret 在冻结的 LLM 主干之上使用标准长上下文 SFT 数据集中的少量数据进行微调。在推理过程中,我们根据分块预填充模式驱逐低重要性缓存单元,显著减少了峰值 GPU 内存使用。我们进行了广泛的实证研究来评估 Locret,实验结果表明,Locret 在内存效率和生成内容质量方面优于最近的竞争方法,包括 InfLLM、量化、SirLLM 和 MInference——Locret 在 Phi-3-mini-128K 和 Llama-3.1-8B-instruct 上实现了超过 20 倍和 8 倍的 KV 缓存压缩比。此外,Locret 可以与其他方法结合使用,如量化和 Token 合并。据我们所知,Locret 是首个能够在单个 Nvidia 4090 GPU 上部署 Llama-3.1-8B 或类似模型,实现 128K 长上下文推理而不牺牲生成质量,并且几乎不需要额外系统优化的框架。

[NLP-1] Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

【速读】: 该论文试图解决基于少量可解释的变异特征预测复杂遗传基础表型的难题。解决方案的关键在于利用预训练的大型语言模型(LLMs)的内在知识,通过一种新颖的知识驱动框架——FREEFORM(Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling),进行特征选择和工程化。该框架结合了链式思维和集成学习的原理,能够在低样本情况下显著优于传统数据驱动方法。

链接: https://arxiv.org/abs/2410.01795
作者: Joseph Lee,Shu Yang,Jae Young Baik,Xiaoxi Liu,Zhen Tan,Dawei Li,Zixuan Wen,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Li Shen
关键词-EN: Predicting phenotypes, variant features remains, genetic bases based, bases based, remains a challenging
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: this https URL.
摘要:基于一小套可解释的变体特征来预测具有复杂遗传基础的表型仍然是一项具有挑战性的任务。传统上,数据驱动的方法被用于此任务,然而基因型数据的高维度特性使得分析和预测变得困难。受到预训练大语言模型 (LLM) 中编码的广泛知识和其在处理复杂生物医学概念方面的成功启发,我们着手研究 LLM 在表格基因型数据特征选择和工程中的能力,并提出了一种新颖的知识驱动框架。我们开发了 FREEFORM,即自由流动推理和集成增强特征输出和鲁棒建模,该系统设计基于思维链和集成原则,利用 LLM 的内在知识进行特征选择和工程。在两个不同的基因型-表型数据集(遗传祖先和遗传性听力损失)上进行评估,我们发现该框架在低样本情况下显著优于几种数据驱动方法。FREEFORM 作为开源框架可在 GitHub 上获取:此 https URL。

[NLP-2] Loki: An Open-Source Tool for Fact Verification

【速读】: 该论文试图解决日益严重的虚假信息问题,并提出了一个名为Loki的开源工具。解决方案的关键在于采用以人为中心的方法,通过将事实核查任务分解为五个步骤(分解长文本、评估核查价值、生成查询、检索证据、验证声明),在保持核查质量的同时降低人力成本。Loki并非完全自动化,而是通过在每个步骤提供关键信息来辅助人工判断,特别适用于记者和内容管理员等普通用户。此外,Loki在延迟、鲁棒性和成本效率方面进行了优化,达到了商业可用的水平。

链接: https://arxiv.org/abs/2410.01794
作者: Haonan Li,Xudong Han,Hao Wang,Yuxia Wang,Minghan Wang,Rui Xing,Yilin Geng,Zenan Zhai,Preslav Nakov,Timothy Baldwin
关键词-EN: open-source tool designed, problem of misinformation, open-source tool, tool designed, designed to address
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Loki, an open-source tool designed to address the growing problem of misinformation. Loki adopts a human-centered approach, striking a balance between the quality of fact-checking and the cost of human involvement. It decomposes the fact-checking task into a five-step pipeline: breaking down long texts into individual claims, assessing their check-worthiness, generating queries, retrieving evidence, and verifying the claims. Instead of fully automating the claim verification process, Loki provides essential information at each step to assist human judgment, especially for general users such as journalists and content moderators. Moreover, it has been optimized for latency, robustness, and cost efficiency at a commercially usable level. Loki is released under an MIT license and is available on GitHub. We also provide a video presenting the system and its capabilities.
摘要:我们介绍了 Loki,一个开源工具,旨在解决日益严重的虚假信息问题。Loki 采用以人为本的方法,在事实核查的质量和人力成本之间取得了平衡。它将事实核查任务分解为五个步骤的流水线:将长文本分解为单独的主张,评估其可核查性,生成查询,检索证据,以及验证主张。Loki 并没有完全自动化主张验证过程,而是在每个步骤提供关键信息以辅助人类判断,特别是对于记者和内容管理员等普通用户。此外,Loki 在商业可用级别上进行了延迟、鲁棒性和成本效率的优化。Loki 以 MIT 许可证发布,并在 GitHub 上提供。我们还提供了一个视频,展示系统的功能和能力。

[NLP-3] When a language model is optimized for reasoning does it still show embers of autoregression? An analysis of OpenAI o1

【速读】: 该论文试图解决的问题是大型语言模型(LLMs)在基于概率的推理任务中表现出的敏感性问题,即模型在处理高概率任务时表现较好,而在处理低概率任务时表现较差。解决方案的关键在于优化语言模型以增强其推理能力,从而在一定程度上缓解这种概率敏感性,但研究表明,这种优化可能无法完全消除模型的概率敏感性。

链接: https://arxiv.org/abs/2410.01792
作者: R. Thomas McCoy,Shunyu Yao,Dan Friedman,Mathew D. Hardy,Thomas L. Griffiths
关键词-EN: Embers of Autoregression, next-word prediction, previous LLMs, important limitations, origins in next-word
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:In “Embers of Autoregression” (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 - like previous LLMs - is sensitive to the probability of examples and tasks, performing better and requiring fewer “thinking tokens” in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model’s probability sensitivity.
摘要:在《自回归的余烬》(McCoy et al., 2023) 中,我们展示了多个大语言模型 (LLMs) 存在一些重要局限性,这些局限性源于其基于下一个词预测的起源。在此,我们探讨了这些问题的持续性,特别是针对 OpenAI 的新系统 o1,该系统与以往的 LLMs 不同,它针对推理进行了优化。我们发现,o1 在许多情况下显著优于以往的 LLMs,尤其是在常见任务的罕见变体上(例如,从列表中每个单词的第二个字母而非第一个字母形成缩写),改进尤为显著。尽管在定量上有所提升,o1 仍然表现出我们在以往系统中观察到的相同定性趋势。具体而言,o1 与之前的 LLMs 一样,对示例和任务的概率敏感,在高概率设置下表现更好,且需要的“思考 Token”更少,而在低概率设置下则相反。这些结果表明,针对推理优化语言模型可以缓解但可能无法完全克服语言模型的概率敏感性。

[NLP-4] DreamGarden: A Designer Assistant for Growing Games from a Single Prompt

【速读】: 该论文试图解决如何使编码助手更好地融入游戏开发者的工作流程,并探索由此产生的新型人机交互模式。解决方案的关键在于DreamGarden系统,它利用大型语言模型(LLM)驱动的规划器,将用户提供的高层次提示(如梦想、记忆或想象场景)分解为层次化的行动计划,并分配给专门的子模块进行具体实现。用户通过种子提示、修剪和反馈与系统互动,形成一个动态的计划和行动花园,从而实现半自主助手和开放式模拟设计的未来发展。

链接: https://arxiv.org/abs/2410.01791
作者: Sam Earle,Samyak Parajuli,Andrzej Banburski-Fahey
关键词-EN: Coding assistants, increasingly leveraged, generating code, code and making, making high-level plans
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 21 pages + appendix, 11 figures

点击查看摘要

Abstract:Coding assistants are increasingly leveraged in game design, both generating code and making high-level plans. To what degree can these tools align with developer workflows, and what new modes of human-computer interaction can emerge from their use? We present DreamGarden, an AI system capable of assisting with the development of diverse game environments in Unreal Engine. At the core of our method is an LLM-driven planner, capable of breaking down a single, high-level prompt – a dream, memory, or imagined scenario provided by a human user – into a hierarchical action plan, which is then distributed across specialized submodules facilitating concrete implementation. This system is presented to the user as a garden of plans and actions, both growing independently and responding to user intervention via seed prompts, pruning, and feedback. Through a user study, we explore design implications of this system, charting courses for future work in semi-autonomous assistants and open-ended simulation design.
摘要:编码助手在游戏设计中得到了越来越多的应用,不仅生成代码,还制定高级计划。这些工具能够在多大程度上与开发者的工作流程相契合,以及它们的使用会催生出哪些新型的人机交互模式?我们提出了 DreamGarden,这是一个能够在 Unreal Engine 中协助开发多样化游戏环境的 AI 系统。该方法的核心是一个由大语言模型 (LLM) 驱动的规划器,它能够将人类用户提供的一个单一的高级提示(如梦想、记忆或想象中的场景)分解成一个层次化的行动计划,然后将该计划分配给专门的子模块以实现具体的实施。该系统以计划和行动的花园形式呈现给用户,这些计划和行动既独立生长,又通过种子提示、修剪和反馈响应用户的干预。通过用户研究,我们探讨了该系统的设计含义,为未来在半自主助手和开放式模拟设计方面的工作绘制了路线图。

[NLP-5] Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models EMNLP2024

【速读】: 该论文试图解决现有检索增强生成(RAG)方法在使用开源大型语言模型(LLMs)时推理能力有限的问题。解决方案的关键在于引入了一个名为Open-RAG的新框架,该框架通过将任意密集LLM转换为参数高效的稀疏混合专家(MoE)模型,显著提升了处理复杂推理任务的能力,包括单跳和多跳查询。Open-RAG通过训练模型识别和规避误导性信息,利用潜在学习动态选择相关专家并有效整合外部知识,从而生成更准确和上下文相关的响应。此外,论文还提出了一种混合自适应检索方法,用于确定检索的必要性,并在性能提升和推理速度之间取得平衡。

链接: https://arxiv.org/abs/2410.01782
作者: Shayekh Bin Islam,Md Asib Rahman,K S M Tozammel Hossain,Enamul Hoque,Shafiq Joty,Md Rizwan Parvez
关键词-EN: Large Language Models, Large Language, Retrieval-Augmented Generation, accuracy of Large, limited reasoning capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 Findings. Website: this https URL . 14 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has been shown to enhance the factual accuracy of Large Language Models (LLMs), but existing methods often suffer from limited reasoning capabilities in effectively using the retrieved evidence, particularly when using open-source LLMs. To mitigate this gap, we introduce a novel framework, Open-RAG, designed to enhance reasoning capabilities in RAG with open-source LLMs. Our framework transforms an arbitrary dense LLM into a parameter-efficient sparse mixture of experts (MoE) model capable of handling complex reasoning tasks, including both single- and multi-hop queries. Open-RAG uniquely trains the model to navigate challenging distractors that appear relevant but are misleading. As a result, Open-RAG leverages latent learning, dynamically selecting relevant experts and integrating external knowledge effectively for more accurate and contextually relevant responses. In addition, we propose a hybrid adaptive retrieval method to determine retrieval necessity and balance the trade-off between performance gain and inference speed. Experimental results show that the Llama2-7B-based Open-RAG outperforms state-of-the-art LLMs and RAG models such as ChatGPT, Self-RAG, and Command R+ in various knowledge-intensive tasks. We open-source our code and models at this https URL
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 已被证明可以提高大语言模型 (Large Language Models, LLMs) 的事实准确性,但现有方法在有效利用检索证据进行推理时往往能力有限,尤其是在使用开源 LLMs 时。为了缩小这一差距,我们引入了一种新型框架——Open-RAG,旨在通过开源 LLMs 增强 RAG 的推理能力。我们的框架将任意密集 LLM 转换为参数高效的稀疏专家混合 (Mixture of Experts, MoE) 模型,能够处理包括单跳和多跳查询在内的复杂推理任务。Open-RAG 独特地训练模型以应对看似相关但具有误导性的挑战性干扰项。因此,Open-RAG 利用潜在学习,动态选择相关专家并有效整合外部知识,以生成更准确且上下文相关的响应。此外,我们提出了一种混合自适应检索方法,以确定检索的必要性并平衡性能提升与推理速度之间的权衡。实验结果表明,基于 Llama2-7B 的 Open-RAG 在各种知识密集型任务中优于当前最先进的 LLMs 和 RAG 模型,如 ChatGPT、Self-RAG 和 Command R+。我们在 https URL 上开源了我们的代码和模型。

[NLP-6] Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets

【速读】: 该论文试图解决在具有二次激活函数和L2损失的2层神经网络中,如何通过部分解的组合来构造全局最优解的问题。解决方案的关键在于揭示了权重空间在不同隐藏节点数下具有半环代数结构,并且损失函数由单项式势能组成,这些势能是环同态,允许通过环加法和乘法将部分解组合成全局解。这一框架被称为CoGO(组合全局优化器),并通过实验验证了其有效性,约95%的梯度下降法得到的解与理论构造完全匹配。

链接: https://arxiv.org/abs/2410.01779
作者: Yuandong Tian
关键词-EN: Abelian group, tasks in Abelian, prove rich algebraic, trained on reasoning, quadratic activation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Commutative Algebra (math.AC); Rings and Algebras (math.RA)
备注:

点击查看摘要

Abstract:We prove rich algebraic structures of the solution space for 2-layer neural networks with quadratic activation and L_2 loss, trained on reasoning tasks in Abelian group (e.g., modular addition). Such a rich structure enables analytical construction of global optimal solutions from partial solutions that only satisfy part of the loss, despite its high nonlinearity. We coin the framework as CoGO (Composing Global Optimizers). Specifically, we show that the weight space over different numbers of hidden nodes of the 2-layer network is equipped with a semi-ring algebraic structure, and the loss function to be optimized consists of monomial potentials, which are ring homomorphism, allowing partial solutions to be composed into global ones by ring addition and multiplication. Our experiments show that around 95% of the solutions obtained by gradient descent match exactly our theoretical constructions. Although the global optimizers constructed only required a small number of hidden nodes, our analysis on gradient dynamics shows that over-parameterization asymptotically decouples training dynamics and is beneficial. We further show that training dynamics favors simpler solutions under weight decay, and thus high-order global optimizers such as perfect memorization are unfavorable.
摘要:我们证明了在具有二次激活函数和 L_2 损失的 2 层神经网络的解空间中,存在丰富的代数结构,这些网络针对阿贝尔群(例如模加法)中的推理任务进行训练。这种丰富的结构使得即使在高非线性情况下,也能从仅满足部分损失的局部解中解析地构建出全局最优解。我们将这一框架命名为 CoGO(组合全局优化器)。具体而言,我们展示了在不同隐藏节点数量的 2 层网络的权重空间中,存在半环代数结构,并且待优化的损失函数由单项式势能组成,这些势能是环同态,允许通过环加法和乘法将局部解组合成全局解。我们的实验表明,通过梯度下降获得的约 95% 的解与我们的理论构建完全匹配。尽管构建的全局优化器仅需要少量隐藏节点,但我们的梯度动力学分析表明,过参数化在渐近意义上解耦了训练动力学,并具有益处。我们进一步展示,在权重衰减下,训练动力学倾向于更简单的解,因此高阶全局优化器(如完美记忆)是不利的。

[NLP-7] DeFine: Enhancing LLM Decision-Making with Factor Profiles and Analogical Reasoning

【速读】: 该论文试图解决在处理复杂场景的口语转录文本时,大型语言模型(LLMs)如何系统地考虑不确定性以进行决策的问题。解决方案的关键在于引入了一个名为DeFine的新框架,该框架通过构建概率性因素概况来量化复杂场景中的不确定性,并结合类比推理,利用过去类似经验中的见解来指导LLMs在新情况下做出关键决策。DeFine框架将量化不确定性和将其融入LLM决策过程的任务分离,这种方法在医疗咨询、谈判和政治辩论等领域尤为重要,因为这些领域在不确定性下做出决策至关重要。

链接: https://arxiv.org/abs/2410.01772
作者: Yebowen Hu,Xiaoyang Wang,Wenlin Yao,Yiming Lu,Daoan Zhang,Hassan Foroosh,Dong Yu,Fei Liu
关键词-EN: ability to reason, reason over long, long contexts, contexts and identify, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are ideal for decision-making due to their ability to reason over long contexts and identify critical factors. However, challenges arise when processing transcripts of spoken speech describing complex scenarios. These transcripts often contain ungrammatical or incomplete sentences, repetitions, hedging, and vagueness. For example, during a company’s earnings call, an executive might project a positive revenue outlook to reassure investors, despite significant uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce DeFine, a new framework that constructs probabilistic factor profiles from complex scenarios. DeFine then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in novel situations. Our framework separates the tasks of quantifying uncertainty in complex scenarios and incorporating it into LLM decision-making. This approach is particularly useful in fields such as medical consultations, negotiations, and political debates, where making decisions under uncertainty is vital.
摘要:大语言模型 (LLM) 由于其能够处理长上下文并识别关键因素的能力,非常适合用于决策。然而,在处理描述复杂场景的口语转录文本时,会遇到挑战。这些转录文本通常包含不合语法或不完整的句子、重复、含糊其辞和模糊不清的表达。例如,在公司财报电话会议中,高管可能会预测积极的收入前景以安抚投资者,尽管未来收入存在重大不确定性。对于大语言模型来说,在决策时系统性地考虑这种不确定性至关重要。本文介绍了 DeFine,这是一个新框架,它从复杂场景中构建概率因子概况。DeFine 随后将这些概况与类比推理相结合,利用类似过去经验的洞察力来指导大语言模型在新型情况下做出关键决策。我们的框架将量化复杂场景中的不确定性任务与将其融入大语言模型决策的任务分开。这种方法在医疗咨询、谈判和政治辩论等领域尤为有用,在这些领域中,在不确定性下做出决策至关重要。

[NLP-8] Quantifying Generalization Complexity for Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在泛化能力评估中与记忆混淆的问题。解决方案的关键在于引入了Scylla动态评估框架,通过在20个任务中评估模型在分布内(ID)和分布外(OOD)数据上的表现,量化模型的泛化能力。该框架揭示了任务复杂性与ID和OOD数据表现差距之间的非单调关系,即“泛化谷”现象,并确定了关键复杂度阈值,即模型依赖非泛化行为的峰值,从而为评估和理解LLMs的泛化能力提供了更精确的基准。

链接: https://arxiv.org/abs/2410.01769
作者: Zhenting Qi,Hongyin Luo,Xuliang Huang,Zhuokai Zhao,Yibo Jiang,Xiangjun Fan,Himabindu Lakkaraju,James Glass
关键词-EN: performing sophisticated tasks, shown exceptional capabilities, large language models, necessitating more precise, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have shown exceptional capabilities in understanding complex queries and performing sophisticated tasks, their generalization abilities are often deeply entangled with memorization, necessitating more precise evaluation. To address this challenge, we introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold - referred to as critical complexity - where reliance on non-generalizable behavior peaks, indicating the upper bound of LLMs’ generalization capabilities. As model size increases, the critical complexity shifts toward higher levels of task complexity, suggesting that larger models can handle more complex reasoning tasks before over-relying on memorization. Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs including both open-sourced models such as LLaMA and Qwen families, and close-sourced models like Claude and GPT, providing a more robust evaluation and establishing a clearer understanding of LLMs’ generalization capabilities.
摘要:尽管大语言模型 (LLMs) 在理解复杂查询和执行复杂任务方面展现出卓越的能力,但其泛化能力往往与记忆深度交织,需要更精确的评估。为应对这一挑战,我们引入了 Scylla,一个动态评估框架,用于定量测量 LLMs 的泛化能力。Scylla 通过评估模型在分布内 (ID) 和分布外 (OOD) 数据上的表现,解开了泛化与记忆的纠缠,涵盖了 20 项任务,跨越 5 个复杂度级别。通过广泛的实验,我们揭示了任务复杂度与 ID 和 OOD 数据性能差距之间的非单调关系,我们称之为泛化谷。具体而言,这一现象揭示了一个关键阈值——称为关键复杂度——在此阈值处,对非泛化行为的依赖达到峰值,表明了 LLMs 泛化能力的上限。随着模型规模的增加,关键复杂度向更高级别的任务复杂度转移,表明更大的模型在过度依赖记忆之前能够处理更复杂的推理任务。利用 Scylla 和关键复杂度的概念,我们基准测试了 28 个 LLMs,包括开源模型如 LLaMA 和 Qwen 系列,以及闭源模型如 Claude 和 GPT,提供了更稳健的评估,并更清晰地理解了 LLMs 的泛化能力。

[NLP-9] LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

【速读】: 该论文试图解决多文本丰富图像场景下多模态大语言模型(MLLM)面临的两大挑战:高质量指令调优数据集的稀缺和图像分辨率与视觉特征序列长度之间的平衡问题。解决方案的关键在于提出了\OurMethod模型,该模型通过以下两个关键步骤来应对这些挑战:首先,精心策划了约一百万条高质量的多模态指令调优数据,专门针对多文本丰富图像场景;其次,开发了一个自适应的高分辨率多图像编码模块,能够根据输入图像的原始宽高比和分辨率动态优化视觉序列长度的分配。这些创新使得模型在多文本丰富图像评估中表现出色,并在通用领域评估中保持竞争力。

链接: https://arxiv.org/abs/2410.01744
作者: Mengzhao Jia,Wenhao Yu,Kaixin Ma,Tianqing Fang,Zhihan Zhang,Siru Ouyang,Hongming Zhang,Meng Jiang,Dong Yu
关键词-EN: central visual element, visual element guiding, scanned documents, involving multiple text-rich, multiple text-rich images
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose \OurMethod, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model’s superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.
摘要:文本丰富的图像,其中文本作为指导整体理解的中心视觉元素,在现实世界应用中非常普遍,例如演示文稿、扫描文档和网页快照。涉及多个文本丰富图像的任务尤其具有挑战性,因为它们不仅需要理解单个图像的内容,还需要推理多个视觉输入之间的相互关系和逻辑流程。尽管这些场景的重要性不言而喻,但当前的多模态大语言模型 (MLLM) 在处理此类任务时面临两大关键挑战:(1) 高质量指令调优数据集在文本丰富多图像场景中的稀缺性,以及 (2) 图像分辨率与视觉特征序列长度之间的平衡难题。为应对这些挑战,我们提出了 \OurMethod,这是一种专门设计用于处理涉及多个文本丰富图像的视觉语言任务的 MLLM。首先,我们精心策划了约一百万条高质量的多模态指令调优数据,专门针对文本丰富、多图像场景进行定制。其次,我们开发了一个自适应高分辨率多图像编码模块,以根据输入图像的原始宽高比和分辨率动态优化视觉序列长度的分配。在广泛的基准测试中进行的实验表明,我们的模型在文本丰富、多图像评估中展现出卓越的能力,并在通用领域评估中表现出竞争性的性能。

[NLP-10] Recursive Abstractive Processing for Retrieval in Dynamic Datasets

【速读】: 该论文试图解决动态数据集下检索增强模型中层次结构更新复杂的问题。解决方案的关键在于提出了一种新的算法,能够高效地维护递归抽象树结构,即使在数据集动态变化的情况下也能保持性能。此外,论文还引入了一种新颖的后检索方法,通过查询导向的递归抽象处理显著提升上下文质量,该方法作为一个黑箱后检索层,兼容任何检索算法,从而克服了现有方法的局限性。

链接: https://arxiv.org/abs/2410.01736
作者: Charbel Chucri,Rami Azouz,Joachim Ott
关键词-EN: Recent retrieval-augmented models, retrieval-augmented models enhance, models enhance basic, Recent retrieval-augmented, retrieved text chunks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent retrieval-augmented models enhance basic methods by building a hierarchical structure over retrieved text chunks through recursive embedding, clustering, and summarization. The most relevant information is then retrieved from both the original text and generated summaries. However, such approaches face limitations with dynamic datasets, where adding or removing documents over time complicates the updating of hierarchical representations formed through clustering. We propose a new algorithm to efficiently maintain the recursive-abstractive tree structure in dynamic datasets, without compromising performance. Additionally, we introduce a novel post-retrieval method that applies query-focused recursive abstractive processing to substantially improve context quality. Our method overcomes the limitations of other approaches by functioning as a black-box post-retrieval layer compatible with any retrieval algorithm. Both algorithms are validated through extensive experiments on real-world datasets, demonstrating their effectiveness in handling dynamic data and improving retrieval performance.
摘要: 最近,检索增强模型通过在检索到的文本块上构建递归嵌入、聚类和摘要的分层结构,提升了基本方法的性能。然后从原始文本和生成的摘要中检索最相关的信息。然而,这类方法在处理动态数据集时存在局限性,因为在时间推移中添加或移除文档会使得通过聚类形成的分层表示的更新变得复杂。我们提出了一种新算法,能够在不牺牲性能的前提下,高效地维护动态数据集中的递归抽象树结构。此外,我们引入了一种新颖的后检索方法,该方法应用查询导向的递归抽象处理,以显著提升上下文质量。我们的方法通过作为一个与任何检索算法兼容的黑箱后检索层,克服了其他方法的局限性。这两种算法均通过在真实世界数据集上的广泛实验得到了验证,展示了它们在处理动态数据和提升检索性能方面的有效性。

[NLP-11] LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

【速读】: 该论文试图解决在训练大型语言模型(LLMs)时,单一固定奖励模型(RM)可能无法适应不同任务需求的问题。解决方案的关键在于引入LASeR(Learning to Adaptively Select Rewards),通过将多奖励模型的选择和利用问题框架为多臂赌博机问题,动态选择最适合当前任务的奖励模型,从而在迭代训练中优化LLMs的性能。这种方法不仅提高了模型在常识推理和数学推理任务中的准确性,还显著提升了训练效率,并在长上下文生成任务中表现出更好的性能。

链接: https://arxiv.org/abs/2410.01735
作者: Duy Nguyen,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal
关键词-EN: Reward Models, play a crucial, crucial role, role in aligning, multiple RMs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages; First two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:Reward Models (RMs) play a crucial role in aligning LLMs with human preferences, enhancing their performance by ranking outputs during inference or iterative training. However, the degree to which an RM generalizes to new tasks is often not known a priori (e.g. some RMs may excel at scoring creative writing vs. math reasoning). Therefore, using only one fixed RM while training LLMs can be suboptimal. Moreover, optimizing LLMs with multiple RMs simultaneously can be prohibitively computationally-intensive and challenging due to conflicting signals from different RMs, potentially degrading performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which iteratively trains LLMs using multiple RMs, selecting and utilizing the most well-suited RM for each instance to rank outputs and generate preference data, framed as a multi-armed bandit problem. Our results on commonsense and math reasoning tasks demonstrate that LASeR can boost iterative LLM optimization by optimizing for multiple RMs, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over training with ensemble RM scores while also showing superior training efficiency (e.g., a 2x speedup). Moreover, on WildChat, a benchmark of instruction-following prompts, we find that using Llama-3-8B LASeR leads to a 71.45% AlpacaEval win rate over sequentially optimizing multiple RMs. Extending to long-context generation tasks, we find that on Llama-3-8B, LASeR achieves an average improvement of 2.64 F1 and 2.42 F1 on single- and multi-document QA over random RM selection when used with best-of-n sampling. LASeR is robust to noisy rewards and generalizes to multiple settings. Finally, LASeR’s RM selection changes depending on the underlying task or instance and we verify the presence of conflicting preferences from multiple RMs that can be mitigated using LASeR.
摘要:奖励模型 (Reward Models, RMs) 在将大语言模型 (Large Language Models, LLMs) 与人类偏好对齐中起着至关重要的作用,通过在推理或迭代训练过程中对输出进行排序来提升其性能。然而,RM 对新任务的泛化程度通常是未知的 (例如,某些 RM 可能在评分创意写作方面表现出色,而在数学推理方面则不然)。因此,仅使用一个固定的 RM 来训练 LLMs 可能不是最优的。此外,同时优化多个 RM 可能会由于来自不同 RM 的冲突信号而变得计算密集且具有挑战性,这可能会降低性能。为了解决这些挑战,我们提出了 LASeR (Learning to Adaptively Select Rewards),它通过使用多个 RM 迭代训练 LLMs,选择并利用最适合每个实例的 RM 来对输出进行排序并生成偏好数据,这一过程被构造成一个多臂赌博机问题。我们在常识推理和数学推理任务上的结果表明,LASeR 通过优化多个 RM 可以提升 LLM 的迭代优化,使 Llama-3-8B 在三个数据集上的绝对平均准确率比使用集成 RM 评分训练提高了 2.67%,同时显示出更高的训练效率 (例如,速度提升 2 倍)。此外,在 WildChat 这一指令跟随提示的基准测试中,我们发现使用 Llama-3-8B LASeR 的 AlpacaEval 胜率达到了 71.45%,超过了依次优化多个 RM 的效果。在扩展到长上下文生成任务时,我们发现,在 Llama-3-8B 上,LASeR 在使用最佳 n 采样时,相对于随机 RM 选择,在单文档和多文档问答任务上分别实现了 2.64 F1 和 2.42 F1 的平均改进。LASeR 对噪声奖励具有鲁棒性,并能泛化到多种设置。最后,LASeR 的 RM 选择会根据底层任务或实例的不同而变化,我们验证了多个 RM 之间存在的冲突偏好可以通过 LASeR 得到缓解。

[NLP-12] Visual Perception in Text Strings

【速读】: 该论文试图解决大型语言模型(LLMs)和多模态大型语言模型(MLLMs)在理解连续字符中嵌入的视觉语义方面的能力问题。解决方案的关键在于选择ASCII艺术作为代表性工件,并将其问题框架为ASCII艺术识别任务。通过构建一个包含详细分类树的评估数据集和收集训练集,论文对数十种模型进行了全面分析。结果显示,尽管人类可以接近100%的准确率,但最先进的LLMs和MLLMs在识别ASCII艺术中的概念时表现远不如人类,尤其是在仅提供文本输入时,平均准确率仅为约30%。解决方案的关键在于通过监督微调来提高模型的准确性,特别是在提供图像模态时,但也强调了需要更好的训练技术来增强模态间的信息融合。

链接: https://arxiv.org/abs/2410.01733
作者: Qi Jia,Xiang Yue,Shanshan Huang,Ziheng Qin,Yizhu Liu,Bill Yuchen Lin,Yang You
关键词-EN: multi-modal large language, large language models, large language, Understanding visual semantics, visual semantics embedded
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding visual semantics embedded in consecutive characters is a crucial capability for both large language models (LLMs) and multi-modal large language models (MLLMs). This type of artifact possesses the unique characteristic that identical information can be readily formulated in both texts and images, making them a significant proxy for analyzing modern LLMs’ and MLLMs’ capabilities in modality-agnostic vision understanding. In this work, we select ASCII art as a representative artifact, where the lines and brightness used to depict each concept are rendered by characters, and we frame the problem as an ASCII art recognition task. We benchmark model performance on this task by constructing an evaluation dataset with an elaborate categorization tree and also collect a training set to elicit the models’ visual perception ability. Through a comprehensive analysis of dozens of models, results reveal that although humans can achieve nearly 100% accuracy, the state-of-the-art LLMs and MLLMs lag far behind. Models are capable of recognizing concepts depicted in the ASCII arts given only text inputs indicated by over 60% accuracy for some concepts, but most of them achieves merely around 30% accuracy when averaged across all categories. When provided with images as inputs, GPT-4o gets 82.68%, outperforming the strongest open-source MLLM by 21.95%. Although models favor different kinds of ASCII art depending on the modality provided, none of the MLLMs successfully benefit when both modalities are supplied simultaneously. Moreover, supervised fine-tuning helps improve models’ accuracy especially when provided with the image modality, but also highlights the need for better training techniques to enhance the information fusion among modalities.
摘要:理解连续字符中嵌入的视觉语义对于大语言模型 (LLMs) 和多模态大语言模型 (MLLMs) 来说是一项关键能力。这类工件具有独特的特性,即相同的信息可以轻松地在文本和图像中表达,使其成为分析现代 LLMs 和 MLLMs 在模态无关视觉理解能力方面的重要代理。在本研究中,我们选择 ASCII 艺术作为代表性工件,其中描绘每个概念的线条和亮度由字符呈现,并将问题框架为 ASCII 艺术识别任务。我们通过构建一个具有精细分类树的评估数据集来基准模型在该任务上的性能,并收集了一个训练集以激发模型的视觉感知能力。通过对数十个模型的综合分析,结果显示,尽管人类可以达到近 100% 的准确率,但最先进的 LLMs 和 MLLMs 远远落后。模型在仅提供文本输入的情况下,能够识别 ASCII 艺术中描绘的概念,某些概念的准确率超过 60%,但在所有类别中平均准确率仅为约 30%。当提供图像作为输入时,GPT-4o 的准确率达到 82.68%,比最强的开源 MLLM 高出 21.95%。尽管模型在提供不同模态时偏好不同类型的 ASCII 艺术,但在同时提供两种模态时,没有任何 MLLMs 能够成功受益。此外,监督微调有助于提高模型的准确率,特别是在提供图像模态时,但也突显了需要更好的训练技术来增强模态间的信息融合。

[NLP-13] ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

【速读】: 该论文试图解决文本到图像生成过程中,由于组件众多、依赖关系复杂以及生成提示的多样性,导致手动构建有效工作流需要大量专业知识的问题。解决方案的关键在于提出了两种基于大语言模型(LLM)的方法:一种是基于用户偏好数据进行调优的方法,另一种是无需训练直接利用LLM选择现有工作流的方法。这两种方法通过自动定制工作流以适应每个用户的提示,从而在提升图像质量方面优于单一模型或通用、与提示无关的工作流。

链接: https://arxiv.org/abs/2410.01731
作者: Rinon Gal,Adi Haviv,Yuval Alaluf,Amit H. Bermano,Daniel Cohen-Or,Gal Chechik
关键词-EN: combine multiple specialized, multiple specialized components, evolved from simple, combine multiple, multiple specialized
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR)
备注: Project website: this https URL

点击查看摘要

Abstract:The practical use of text-to-image generation has evolved from simple, monolithic models to complex workflows that combine multiple specialized components. While workflow-based approaches can lead to improved image quality, crafting effective workflows requires significant expertise, owing to the large number of available components, their complex inter-dependence, and their dependence on the generation prompt. Here, we introduce the novel task of prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt. We propose two LLM-based approaches to tackle this task: a tuning-based method that learns from user-preference data, and a training-free method that uses the LLM to select existing flows. Both approaches lead to improved image quality when compared to monolithic models or generic, prompt-independent workflows. Our work shows that prompt-dependent flow prediction offers a new pathway to improving text-to-image generation quality, complementing existing research directions in the field.
摘要:文本到图像生成的实际应用已从简单的单一模型演变为结合多个专用组件的复杂工作流程。尽管基于工作流程的方法可以提高图像质量,但由于可用组件数量庞大、它们之间的复杂依赖关系以及对生成提示的依赖,构建有效的工作流程需要大量的专业知识。在此,我们引入了提示自适应工作流程生成这一新任务,其目标是自动为每个用户提示量身定制工作流程。我们提出了两种基于大语言模型 (LLM) 的方法来解决这一任务:一种基于调优的方法,该方法从用户偏好数据中学习,以及一种无需训练的方法,该方法利用 LLM 选择现有流程。与单一模型或通用、与提示无关的工作流程相比,这两种方法都能提高图像质量。我们的研究表明,依赖提示的流程预测为提高文本到图像生成质量提供了新的途径,补充了该领域现有的研究方向。

[NLP-14] Evaluating Robustness of Reward Models for Mathematical Reasoning

【速读】: 该论文试图解决现有奖励模型评估方法在数学推理任务中的不可靠性问题,特别是RewardBench在数学子集上的单次比较可能导致结果不准确和奖励模型性能误解的问题。解决方案的关键在于引入了一种新的设计,即RewardMATH基准,该基准通过有效代表奖励模型在数学推理任务中的鲁棒性,来增强评估的可靠性。研究结果表明,RewardMATH的评分与优化策略的结果高度相关,能够有效估计奖励过度优化,而现有基准则几乎无此相关性。

链接: https://arxiv.org/abs/2410.01729
作者: Sunghwan Kim,Dongjin Kang,Taeyoon Kwon,Hyungjoo Chae,Jungsoo Won,Dongha Lee,Jinyoung Yeo
关键词-EN: Reward models, human feedback, human preferences, Reward, key in reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it only see an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking. In this work, we introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH, a benchmark that effectively represents the robustness of reward models in mathematical reasoning tasks. We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization, whereas the existing benchmark shows almost no correlation. The results underscore the potential of our design to enhance the reliability of evaluation, and represent the robustness of reward model. We make our code and data publicly available.
摘要:奖励模型在从人类反馈中进行强化学习 (RLHF) 系统中起着关键作用,使模型行为与人类偏好保持一致。特别是在数学领域,已有大量研究使用奖励模型来调整策略,以提高推理能力。最近,随着奖励模型的重要性被强调,RewardBench 被提出以理解其行为。然而,我们发现 RewardBench 的数学子集在选定和拒绝的完成之间存在不同的表示,并且依赖于单一比较,这可能导致结果不可靠,因为它仅看到一个孤立的案例。因此,它无法准确展示奖励模型的鲁棒性,导致对其性能的误解,并可能引发奖励操纵。在这项工作中,我们引入了一种新的设计,用于可靠地评估奖励模型,并通过构建 RewardMATH 基准来验证这一点,该基准有效地代表了奖励模型在数学推理任务中的鲁棒性。我们证明,RewardMATH 的得分与优化策略的结果高度相关,并能有效估计奖励过度优化,而现有的基准则几乎没有任何相关性。结果强调了我们的设计在提高评估可靠性方面的潜力,并代表了奖励模型的鲁棒性。我们将代码和数据公开发布。

[NLP-15] Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing

【速读】: 该论文试图解决现有知识追踪(KT)方法的两个主要局限性:一是依赖专家定义的知识概念(KCs),耗时且易出错;二是忽视了问题和KCs的语义信息。解决方案的关键在于提出了KCQRL框架,通过自动化知识概念标注和问题表示学习来提升KT模型的效果。具体来说,KCQRL利用大型语言模型(LLMs)自动生成问题解答并标注KCs,同时采用对比学习方法生成语义丰富的嵌入表示,通过定制的负样本消除策略对齐问题和解答步骤与KCs的关系。这些嵌入表示可直接替代现有KT模型中的随机初始化嵌入,从而实现性能提升。

链接: https://arxiv.org/abs/2410.01727
作者: Yilmazcan Ozyurt,Stefan Feuerriegel,Mrinmaya Sachan
关键词-EN: modeling students’ learning, students’ learning progress, progress over time, modeling students’, enable more personalized
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge tracing (KT) is a popular approach for modeling students’ learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is time-consuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets, where we achieve consistent performance improvements.
摘要:知识追踪 (Knowledge Tracing, KT) 是一种流行的方法,用于建模学生在一段时间内的学习进展,从而实现更加个性化和适应性的学习。然而,现有的 KT 方法面临两大主要限制:(1) 它们严重依赖于专家定义的问题中的知识概念 (Knowledge Concepts, KCs),这既耗时又容易出错;(2) KT 方法往往忽视了问题和给定 KCs 的语义。在本研究中,我们解决了这些挑战,并提出了 KCQRL,这是一个用于自动化知识概念注释和问题表示学习的框架,可以提高任何现有 KT 模型的有效性。首先,我们提出了一种使用大语言模型 (Large Language Models, LLMs) 的自动化 KC 注释过程,该过程生成问题解答,然后对每个解答步骤中的 KCs 进行注释。其次,我们引入了一种对比学习方法,以生成语义丰富的嵌入表示,用于问题和解答步骤,并通过一种定制的假负例消除方法将其与相关的 KCs 对齐。这些嵌入可以方便地集成到现有的 KT 模型中,取代其随机初始化的嵌入。我们在两个大型真实世界数学学习数据集上,对 15 种 KT 算法展示了 KCQRL 的有效性,并实现了持续的性能提升。

[NLP-16] Auto-Demo Prompting: Leveraging Generated Outputs as Demonstrations for Enhanced Batch Prompting

【速读】: 该论文试图解决在大语言模型(LLMs)中使用批量提示(batch prompting)时,随着批量大小的增加,模型处理长上下文输入时性能下降的问题。解决方案的关键是提出了一种名为“自动演示提示”(Auto-Demo Prompting)的新方法,该方法利用批量中早期问题的问答对作为后续答案推断的演示,从而优化模型的内部表示。通过这种方式,该方法有效地结合了批量提示和少样本提示的优点,提升了性能,同时仅轻微增加了token的使用量。

链接: https://arxiv.org/abs/2410.01724
作者: Longyu Feng,Mengze Hong,Chen Jason Zhang
关键词-EN: improve computational efficiency, multiple inputs simultaneously, large language models, aiming to improve, computational efficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Batch prompting is a common technique in large language models (LLMs) used to process multiple inputs simultaneously, aiming to improve computational efficiency. However, as batch sizes increase, performance degradation often occurs due to the model’s difficulty in handling lengthy context inputs. Existing methods that attempt to mitigate these issues rely solely on batch data arrangement and majority voting rather than improving the design of the batch prompt itself. In this paper, we address these limitations by proposing “Auto-Demo Prompting,” a novel approach that leverages the question-output pairs from earlier questions within a batch as demonstrations for subsequent answer inference. We provide a formal theoretical analysis of how Auto-Demo Prompting functions within the autoregressive generation process of LLMs, illustrating how it utilizes prior outputs to optimize the model’s internal representations. Our method effectively bridges the gap between batch prompting and few-shot prompting, enhancing performance with only a slight compromise in token usage. Experimental results across five NLP tasks demonstrate its effectiveness in mitigating performance degradation and occasionally outperforming single prompts. Furthermore, it opens new avenues for applying few-shot learning techniques, such as demonstration selection, within batch prompting, making it a robust solution for real-world applications.
摘要:批量提示 (Batch prompting) 是大语言模型 (LLMs) 中常用的一种技术,用于同时处理多个输入,旨在提高计算效率。然而,随着批量大小的增加,由于模型难以处理长上下文输入,性能下降问题常常出现。现有的缓解这些问题的方法仅依赖于批量数据的排列和多数投票,而不是改进批量提示本身的设计。本文通过提出“自动演示提示” (Auto-Demo Prompting) 这一新颖方法来解决这些限制,该方法利用批量中早期问题的问答对作为后续答案推断的演示。我们提供了关于自动演示提示如何在 LLMs 的自回归生成过程中发挥作用的正式理论分析,展示了它如何利用先前的输出优化模型的内部表示。我们的方法有效地弥合了批量提示和少样本提示之间的差距,在仅略微牺牲 Token 使用的情况下提升了性能。在五个 NLP 任务上的实验结果表明,它在缓解性能下降方面具有有效性,并且偶尔能超越单个提示的表现。此外,它为在批量提示中应用少样本学习技术(如演示选择)开辟了新的途径,使其成为实际应用中的稳健解决方案。

[NLP-17] owards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

【速读】: 该论文试图解决合成数据在大型语言模型(LLMs)后训练任务中的实际效果与理论理解之间的差距问题。解决方案的关键在于提出了一个详细的合成数据生成过程模型,并通过逆瓶颈视角分析了生成模型带来的信息增益对后训练模型泛化能力的关键影响。论文进一步引入了通过互信息(GGMI)的泛化增益概念,阐明了泛化增益与信息增益之间的关系,为合成数据生成技术和后训练过程优化提供了理论基础。

链接: https://arxiv.org/abs/2410.01720
作者: Zeyu Gan,Yong Liu
关键词-EN: synthetic data generation, Synthetic data, large language models, generate synthetic data, prevalent synthetic data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open source our code through an anonymous GitHub repository at this https URL.
摘要:由于高质量、特定数据的稀缺性,合成数据已成为大语言模型 (LLM) 后训练任务中的关键资源。尽管已经开发了多种生成合成数据的方法,但在合成数据的实际效果与我们的理论理解之间仍存在明显的差距。为了应对这一挑战,我们首先详细建模了当前流行的合成数据生成过程。在此基础上,我们通过一种新颖的反瓶颈视角分析,证明后训练模型的泛化能力关键取决于生成模型所获得的信息增益。此外,我们引入了通过互信息实现的泛化增益 (GGMI) 的概念,并阐明了泛化增益与信息增益之间的关系。这一分析为合成数据生成提供了理论基础,并进一步突显了其与后训练模型泛化能力的联系,为合成数据生成技术的设计和后训练过程的优化提供了见解。我们通过一个匿名的 GitHub 仓库开源了我们的代码,地址为 https URL。

[NLP-18] Examining the Role of Relationship Alignment in Large Language Models

【速读】: 该论文试图解决生成式AI在社交环境中如何最佳地个性化以满足用户需求,同时保持准确性和真实性的问题。解决方案的关键在于评估Llama 3.0(70B)模型在不同性别、年龄和友谊紧密度的组合下预测评论语义语调的能力,并通过生成评论与人类评论的相似性来验证。研究发现,尽管包含社交关系信息可以提高模型预测语义语调的能力,但在提示中包含所有社交关系信息时,生成的评论与人类评论的相似性反而降低,这可能是因为模型在训练时并未包含社交上下文信息。因此,论文强调了LLMs从原始帖子中理解语义的能力,但也指出了其在通过提示生成个性化评论时的局限性。

链接: https://arxiv.org/abs/2410.01708
作者: Kristen M. Altenburger,Hongda Jiang,Robert E. Kraut,Yi-Chia Wang,Jane Dwivedi-Yu
关键词-EN: settings raise important, raise important questions, deployment of Generative, social settings raise, human comments
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The rapid development and deployment of Generative AI in social settings raise important questions about how to optimally personalize them for users while maintaining accuracy and realism. Based on a Facebook public post-comment dataset, this study evaluates the ability of Llama 3.0 (70B) to predict the semantic tones across different combinations of a commenter’s and poster’s gender, age, and friendship closeness and to replicate these differences in LLM-generated comments. The study consists of two parts: Part I assesses differences in semantic tones across social relationship categories, and Part II examines the similarity between comments generated by Llama 3.0 (70B) and human comments from Part I given public Facebook posts as input. Part I results show that including social relationship information improves the ability of a model to predict the semantic tone of human comments. However, Part II results show that even without including social context information in the prompt, LLM-generated comments and human comments are equally sensitive to social context, suggesting that LLMs can comprehend semantics from the original post alone. When we include all social relationship information in the prompt, the similarity between human comments and LLM-generated comments decreases. This inconsistency may occur because LLMs did not include social context information as part of their training data. Together these results demonstrate the ability of LLMs to comprehend semantics from the original post and respond similarly to human comments, but also highlights their limitations in generalizing personalized comments through prompting alone. Subjects: Computation and Language (cs.CL); Social and Information Networks (cs.SI) Cite as: arXiv:2410.01708 [cs.CL] (or arXiv:2410.01708v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.01708 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:生成式 AI (Generative AI) 在社交环境中的快速发展和部署引发了关于如何在保持准确性和真实性的同时,为不同用户提供个性化体验的重要问题。基于 Facebook 公开的帖子-评论数据集,本研究评估了 Llama 3.0 (70B) 在不同评论者和发帖者性别、年龄和友谊亲密度组合下预测语义语调的能力,并探讨了这些差异在大语言模型 (LLM) 生成的评论中的再现情况。研究分为两部分:第一部分评估了不同社交关系类别间语义语调的差异,第二部分则考察了在输入为公开 Facebook 帖子的情况下,Llama 3.0 (70B) 生成的评论与第一部分中人类评论的相似性。第一部分结果显示,包含社交关系信息能够提升模型预测人类评论语义语调的能力。然而,第二部分结果表明,即使不包含社交上下文信息在提示中,大语言模型生成的评论和人类评论对社交上下文的敏感度是相同的,这表明大语言模型能够仅从原始帖子中理解语义。当我们将所有社交关系信息包含在提示中时,人类评论和大语言模型生成的评论之间的相似性下降。这种不一致性可能是因为大语言模型在训练数据中并未包含社交上下文信息。这些结果共同展示了大语言模型从原始帖子中理解语义并生成与人类评论相似内容的能力,但也突显了其在仅通过提示实现个性化评论泛化方面的局限性。

主题:计算与语言 (cs.CL);社会与信息网络 (cs.SI)
引用为:arXiv:2410.01708 [cs.CL] (或 arXiv:2410.01708v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.01708
通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-19] Interpretable Contrastive Monte Carlo Tree Search Reasoning

【速读】: 该论文试图解决现有基于蒙特卡洛树搜索(MCTS)的大语言模型(LLM)推理算法在速度和奖励模型优化方面的不足。解决方案的关键在于:1) 设计了一种基于对比解码原则的高度可解释性奖励模型;2) 通过推测解码实现了每个节点平均速度提升51.9%;3) 改进了UCT节点选择策略和反向传播机制,从而显著提升了推理性能。这些改进使得SC-MCTS*在Blocksworld多步推理数据集上,相比o1-mini实现了平均17.4%的性能提升。

链接: https://arxiv.org/abs/2410.01707
作者: Zitian Gao,Boye Niu,Xuzheng He,Haotian Xu,Hongzhang Liu,Aiwei Liu,Xuming Hu,Lijie Wen
关键词-EN: Carlo Tree Search, Monte Carlo Tree, Large Language Models, Tree Search, Monte Carlo
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose SC-MCTS*: a novel Monte Carlo Tree Search (MCTS) reasoning algorithm for Large Language Models (LLMs), significantly improves both reasoning accuracy and speed. Our motivation comes from: 1. Previous MCTS LLM reasoning works often overlooked its biggest drawback–slower speed compared to CoT; 2. Previous research mainly used MCTS as a tool for LLM reasoning on various tasks with limited quantitative analysis or ablation studies of its components from reasoning interpretability perspective. 3. The reward model is the most crucial component in MCTS, however previous work has rarely conducted in-depth study or improvement of MCTS’s reward models. Thus, we conducted extensive ablation studies and quantitative analysis on components of MCTS, revealing the impact of each component on the MCTS reasoning performance of LLMs. Building on this, (i) we designed a highly interpretable reward model based on the principle of contrastive decoding and (ii) achieved an average speed improvement of 51.9% per node using speculative decoding. Additionally, (iii) we improved UCT node selection strategy and backpropagation used in previous works, resulting in significant performance improvement. We outperformed o1-mini by an average of 17.4% on the Blocksworld multi-step reasoning dataset using Llama-3.1-70B with SC-MCTS*.
摘要:我们提出了 SC-MCTS*:一种针对大语言模型 (LLM) 的新型蒙特卡洛树搜索 (MCTS) 推理算法,显著提升了推理的准确性和速度。我们的动机源于以下几点:1. 以往的 MCTS LLM 推理工作往往忽视了其最大的缺点——与思维链 (CoT) 相比速度较慢;2. 先前的研究主要将 MCTS 作为 LLM 在各种任务中的推理工具,缺乏对其组件的定量分析或从推理可解释性角度的消融研究;3. 奖励模型是 MCTS 中最关键的组件,然而先前的工作很少对 MCTS 的奖励模型进行深入研究或改进。因此,我们对 MCTS 的组件进行了广泛的消融研究和定量分析,揭示了各组件对 LLM 的 MCTS 推理性能的影响。在此基础上,(i) 我们基于对比解码原则设计了一个高度可解释的奖励模型,(ii) 通过推测解码实现了每个节点平均速度提升 51.9%。此外,(iii) 我们改进了以往工作中使用的 UCT 节点选择策略和反向传播方法,从而显著提升了性能。在使用 Llama-3.1-70B 和 SC-MCTS* 的情况下,我们在 Blocksworld 多步推理数据集上平均超越了 o1-mini 17.4%。

[NLP-20] An Exploration of Self-Supervised Mutual Information Alignment for Multi-Task Settings

【速读】: 该论文试图解决在多任务设置下,如何通过多种对齐方法引导语言模型朝向个体属性和偏好,以提高模型在不同任务类别上的表现。解决方案的关键在于提出了一种名为Self-Supervised Alignment with Mutual Information (SAMI)的方法,该方法利用条件互信息来增强行为偏好与模型响应之间的联系。通过在多任务基准(MT-Bench)和数学准确性(GSM-8K)上的实验,论文展示了SAMI在多任务环境中的有效性,特别是在多次尝试设置下,SAMI与监督微调(SFT)结合能够进一步提升模型性能。

链接: https://arxiv.org/abs/2410.01704
作者: Soham Govande
关键词-EN: pluralistic alignment methods, steer language models, Mutual Information, Direct Preference Optimization, SAMI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There is a growing need for pluralistic alignment methods that can steer language models towards individual attributes and preferences. One such method, Self-Supervised Alignment with Mutual Information (SAMI), uses conditional mutual information to encourage the connection between behavioral preferences and model responses. We conduct two experiments exploring SAMI in multi-task settings. First, we compare SAMI to Direct Preference Optimization (DPO) on a multi-task benchmark (MT-Bench), using a stronger model to generate training data for a weaker one across diverse categories (humanities, STEM, extraction, coding, math, reasoning, and roleplay). Our results indicate that one iteration of SAMI has a 57% win rate against DPO, with significant variation in performance between task categories. Second, we examine SAMI’s impact on mathematical accuracy (GSM-8K) relative to supervised fine-tuning (SFT). While SAMI increases zero-shot performance by 1.1%, SFT is more effective with a 3.2% boost. However, SAMI shows interesting scaling trends. When given 10 attempts, SAMI improves accuracy by 3.9%, while SFT achieves a 10.1% increase. Combining SAMI with SFT yields an additional improvement of 1.3% in multi-attempt settings, though single-attempt accuracy remains unchanged.
摘要:随着语言模型在个性化应用中的需求日益增长,多元化的对齐方法变得愈发重要,这些方法能够引导模型朝着特定的属性和偏好发展。其中一种方法,即基于互信息的自监督对齐方法 (Self-Supervised Alignment with Mutual Information, SAMI),利用条件互信息来强化行为偏好与模型响应之间的关联。我们进行了两项实验,探讨了 SAMI 在多任务环境中的应用。首先,我们在多任务基准测试 (MT-Bench) 上将 SAMI 与直接偏好优化 (Direct Preference Optimization, DPO) 进行了比较,使用一个更强的模型为多个类别(包括人文、STEM、提取、编码、数学、推理和角色扮演)生成训练数据,以训练一个较弱的模型。结果显示,SAMI 在一次迭代中对 DPO 的胜率为 57%,但在不同任务类别间的性能存在显著差异。其次,我们研究了 SAMI 对数学准确性 (GSM-8K) 的影响,并与监督微调 (Supervised Fine-Tuning, SFT) 进行了对比。尽管 SAMI 将零样本性能提升了 1.1%,但 SFT 的效果更为显著,提升了 3.2%。然而,SAMI 显示出有趣的扩展趋势。在给予 10 次尝试的情况下,SAMI 的准确性提高了 3.9%,而 SFT 则提高了 10.1%。将 SAMI 与 SFT 结合在多尝试设置中,进一步提升了 1.3%,尽管单次尝试的准确性保持不变。

[NLP-21] CreDes: Causal Reasoning Enhancement and Dual-End Searching for Solving Long-Range Reasoning Problems using LLMs

【速读】: 该论文试图解决大型语言模型在处理涉及长程推理的组合优化问题时存在的局限性,主要表现为因果幻觉和搜索空间巨大。解决方案的关键在于引入因果关系增强(Causal Relationship Enhancement, CRE)机制和双端搜索(Dual-End Searching, DES)方法。CRE通过结合因果干预和个体治疗效应(ITE)来确保推理步骤与状态转换之间的因果关系正确性,而DES则通过从初始状态和目标状态同时开始搜索因果概率树来减少搜索空间,从而提高模型在长程推理任务中的准确性和时间效率。通过集成CRE和DES,论文提出的CreDes模型实现了多步推理的并行处理,避免了传统链式推理(Chain-of-Thought, CoT)中单步推理级联带来的效率问题。

链接: https://arxiv.org/abs/2410.01696
作者: Kangsheng Wang,Xiao Zhang,Hao Liu,Songde Han,Huimin Ma,Tianyu Hu
关键词-EN: Large language models, handling combinatorial optimization, combinatorial optimization problems, optimization problems involving, Individual Treatment Effect
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated limitations in handling combinatorial optimization problems involving long-range reasoning, partially due to causal hallucinations and huge search space. As for causal hallucinations, i.e., the inconsistency between reasoning and corresponding state transition, this paper introduces the Causal Relationship Enhancement (CRE) mechanism combining cause-effect interventions and the Individual Treatment Effect (ITE) to guarantee the solid causal rightness between each step of reasoning and state transition. As for the long causal range and huge search space limiting the performances of existing models featuring single-direction search, a Dual-End Searching (DES) approach is proposed to seek solutions by simultaneously starting from both the initial and goal states on the causal probability tree. By integrating CRE and DES (CreDes), our model has realized simultaneous multi-step reasoning, circumventing the inefficiencies from cascading multiple one-step reasoning like the Chain-of-Thought (CoT). Experiments demonstrate that CreDes significantly outperforms existing State-Of-The-Art (SOTA) solutions in long-range reasoning tasks in terms of both accuracy and time efficiency.
摘要:大语言模型 (LLMs) 在处理涉及长程推理的组合优化问题时表现出局限性,部分原因是因果幻觉和巨大的搜索空间。对于因果幻觉,即推理与相应状态转换之间的不一致性,本文引入了因果关系增强 (Causal Relationship Enhancement, CRE) 机制,结合因果干预和个体治疗效应 (Individual Treatment Effect, ITE),以确保推理的每一步与状态转换之间具有坚实的因果正确性。对于长因果范围和巨大搜索空间限制了现有单向搜索模型的性能,本文提出了一种双端搜索 (Dual-End Searching, DES) 方法,通过在因果概率树上同时从初始状态和目标状态出发来寻找解决方案。通过整合 CRE 和 DES (CreDes),我们的模型实现了同时多步推理,避免了链式思维 (Chain-of-Thought, CoT) 中多步推理的低效性。实验表明,CreDes 在长程推理任务中显著优于现有的最先进 (State-Of-The-Art, SOTA) 解决方案,无论是在准确性还是时间效率方面。

[NLP-22] U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在下游任务中表现出的突现能力(emergent abilities)的预测问题。解决方案的关键在于提出了一个名为“Slice-and-Sandwich”的简单而有效的流程,通过分析模型在不同难度问题上的表现趋势(如U形和倒U形缩放),来预测模型的突现阈值及阈值后的性能提升。具体来说,该方法利用了模型在简单和困难问题上的不同缩放趋势,从而实现对模型性能的准确预测。

链接: https://arxiv.org/abs/2410.01692
作者: Tung-Yu Wu,Pei-Yu Lo
关键词-EN: Large language models, exhibit emergent abilities, Large language, downstream tasks, shown to exhibit
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:Large language models (LLMs) have been shown to exhibit emergent abilities in some downstream tasks, where performance seems to stagnate at first and then improve sharply and unpredictably with scale beyond a threshold. By dividing questions in the datasets according to difficulty level by average performance, we observe U-shaped scaling for hard questions, and inverted-U scaling followed by steady improvement for easy questions. Moreover, the emergence threshold roughly coincides with the point at which performance on easy questions reverts from inverse scaling to standard scaling. Capitalizing on the observable though opposing scaling trend on easy and hard questions, we propose a simple yet effective pipeline, called Slice-and-Sandwich, to predict both the emergence threshold and model performance beyond the threshold.
摘要:大语言模型 (LLMs) 在某些下游任务中展现出涌现能力,即性能起初似乎停滞不前,然后在超过某个阈值后急剧且不可预测地提升。通过根据平均性能将数据集中的问题按难度级别划分,我们观察到难题呈现 U 形缩放,而简单问题则先呈现倒 U 形缩放,随后稳步提升。此外,涌现阈值大致与简单问题性能从反向缩放恢复到标准缩放的点重合。利用简单和难题上可观察到的相反缩放趋势,我们提出了一种简单而有效的流程,称为 Slice-and-Sandwich,用于预测涌现阈值及阈值后的模型性能。

[NLP-23] FactAlign: Long-form Factuality Alignment of Large Language Models EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)在生成长篇回答时出现的幻觉和非事实内容问题,特别是确保这些回答的事实准确性。解决方案的关键在于提出了FactAlign框架,该框架通过引入细粒度的句子级对齐算法fKTO,利用Kahneman-Tversky优化(KTO)方法,结合自动事实评估技术,来增强LLMs生成回答的事实性,同时保持其有用性。实验结果表明,FactAlign显著提高了LLMs回答的事实准确性,并提升了其信息量,从而改善了事实F1得分。

链接: https://arxiv.org/abs/2410.01691
作者: Chao-Wei Huang,Yun-Nung Chen
关键词-EN: Large language models, demonstrated significant potential, Large language, information access engines, next-generation information access
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024 Findings

点击查看摘要

Abstract:Large language models have demonstrated significant potential as the next-generation information access engines. However, their reliability is hindered by issues of hallucination and generating non-factual content. This is particularly problematic in long-form responses, where assessing and ensuring factual accuracy is complex. In this paper, we address this gap by proposing FactAlign, a novel alignment framework designed to enhance the factuality of LLMs’ long-form responses while maintaining their helpfulness. We introduce fKTO, a fine-grained, sentence-level alignment algorithm that extends the Kahneman-Tversky Optimization (KTO) alignment method. Leveraging recent advances in automatic factuality evaluation, FactAlign utilizes fine-grained factuality assessments to guide the alignment process. Our experiments on open-domain prompts and information-seeking questions demonstrate that FactAlign significantly improves the factual accuracy of LLM responses while also improving their helpfulness. Further analyses identify that FactAlign is capable of training LLMs to provide more information without losing factual precision, thus improving the factual F1 score. Our source code, datasets, and trained models are publicly available at this https URL
摘要:大语言模型展示了作为下一代信息访问引擎的巨大潜力。然而,其可靠性受到幻觉问题和生成非事实内容的阻碍。这在长篇回答中尤为成问题,因为在这些回答中评估和确保事实准确性非常复杂。本文通过提出 FactAlign 这一新颖的对齐框架来解决这一差距,该框架旨在增强大语言模型长篇回答的事实性,同时保持其有用性。我们引入了 fKTO,一种细粒度的句子级对齐算法,该算法扩展了 Kahneman-Tversky 优化 (KTO) 对齐方法。利用自动事实性评估的最新进展,FactAlign 利用细粒度的事实性评估来指导对齐过程。我们在开放域提示和信息查询问题上的实验表明,FactAlign 显著提高了大语言模型回答的事实准确性,同时也提高了其有用性。进一步的分析表明,FactAlign 能够训练大语言模型提供更多信息而不失去事实精确性,从而提高了事实 F1 分数。我们的源代码、数据集和训练模型已在以下链接公开:https URL

[NLP-24] VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

【速读】: 该论文试图解决在复杂推理任务中,大型语言模型(LLMs)在执行多个步骤后才能获得奖励的情况下,如何准确分配信用以提升模型性能的问题。解决方案的关键在于提出了一种名为VinePPO的新方法,该方法利用语言环境的灵活性,通过计算无偏的蒙特卡洛估计来替代传统的大规模价值网络,从而避免了高方差更新和次优性能的问题。VinePPO在减少梯度更新次数和计算时间的同时,显著提升了模型在MATH和GSM8K数据集上的表现,证明了其在强化学习微调LLM中的优越性。

链接: https://arxiv.org/abs/2410.01679
作者: Amirhossein Kazemnejad,Milad Aghajohari,Eva Portelance,Alessandro Sordoni,Siva Reddy,Aaron Courville,Nicolas Le Roux
关键词-EN: increasingly applied, require executing, complex reasoning tasks, Proximal Policy Optimization, enhancing model performance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, value networks face challenges in predicting the expected cumulative rewards accurately in complex reasoning tasks, often leading to high-variance updates and suboptimal performance. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they barely outperform a random baseline when comparing alternative steps. To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates, bypassing the need for large value networks. Our method consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets with fewer gradient updates (up to 9x), less wall-clock time (up to 3.0x). These results emphasize the importance of accurate credit assignment in RL finetuning of LLM and demonstrate VinePPO’s potential as a superior alternative.
摘要:大语言模型 (LLM) 在处理需要执行多个复杂步骤才能获得奖励的复杂推理任务中得到了越来越多的应用。正确地为这些步骤分配信用对于提升模型性能至关重要。近端策略优化 (PPO) 是一种用于 LLM 微调的先进强化学习 (RL) 算法,它采用价值网络来解决信用分配问题。然而,价值网络在复杂推理任务中准确预测预期累积奖励方面面临挑战,往往导致高方差更新和次优性能。在本研究中,我们系统地评估了价值网络的有效性,并揭示了其在推理密集型 LLM 任务中的显著不足,表明在比较替代步骤时,它们几乎无法超越随机基线。为解决这一问题,我们提出了 VinePPO,一种直接的方法,利用语言环境的灵活性来计算无偏的蒙特卡洛估计,从而避免了大型价值网络的需求。我们的方法在 MATH 和 GSM8K 数据集上持续优于 PPO 和其他无 RL 基线,且所需的梯度更新次数更少 (最多减少 9 倍),耗费的实际时间更短 (最多减少 3.0 倍)。这些结果强调了在 LLM 的 RL 微调中准确信用分配的重要性,并展示了 VinePPO 作为更优替代方案的潜力。

[NLP-25] rying to be human: Linguistic traces of stochastic empathy in language models

【速读】: 该论文试图解决如何区分生成内容与人类撰写内容的问题,特别是在大语言模型(LLMs)生成内容质量日益提高的背景下。解决方案的关键在于研究两个重要因素:同理心和人为表现人类特征的动机。通过两个实验,论文发现当需要展现同理心时,人类表现更优;而当LLM被指示尽可能表现得像人类时,其表现显著提升,这表明LLM可能具有隐含的人类文本特征表示,并能轻松应用这些启发式方法来模仿人类的随机同理心。这一发现对理解LLM与人类在文本生成上的性能相当性具有重要意义。

链接: https://arxiv.org/abs/2410.01675
作者: Bennett Kleinberg,Jari Zegers,Jonas Festor,Stefana Vida,Julian Präsent,Riccardo Loconte,Sanne Peereboom
关键词-EN: modern world, navigating the modern, Differentiating between generated, human, Differentiating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Differentiating between generated and human-written content is important for navigating the modern world. Large language models (LLMs) are crucial drivers behind the increased quality of computer-generated content. Reportedly, humans find it increasingly difficult to identify whether an AI model generated a piece of text. Our work tests how two important factors contribute to the human vs AI race: empathy and an incentive to appear human. We address both aspects in two experiments: human participants and a state-of-the-art LLM wrote relationship advice (Study 1, n=530) or mere descriptions (Study 2, n=610), either instructed to be as human as possible or not. New samples of humans (n=428 and n=408) then judged the texts’ source. Our findings show that when empathy is required, humans excel. Contrary to expectations, instructions to appear human were only effective for the LLM, so the human advantage diminished. Computational text analysis revealed that LLMs become more human because they may have an implicit representation of what makes a text human and effortlessly apply these heuristics. The model resorts to a conversational, self-referential, informal tone with a simpler vocabulary to mimic stochastic empathy. We discuss these findings in light of recent claims on the on-par performance of LLMs.
摘要:在现代社会中,区分生成内容与人类撰写内容至关重要。大语言模型 (LLM) 是推动计算机生成内容质量提升的关键因素。据报道,人类越来越难以辨别一段文本是否由 AI 模型生成。我们的研究测试了两个重要因素如何影响人类与 AI 之间的竞争:同理心和表现出人类特征的动机。我们在两个实验中探讨了这两个方面:人类参与者与最先进的 LLM 分别撰写了关系建议(研究 1,n=530)或简单的描述(研究 2,n=610),这些文本要么被指示尽可能表现得像人类,要么没有。随后,新的人类样本(n=428 和 n=408)对这些文本的来源进行了判断。我们的研究发现,当需要同理心时,人类表现出色。与预期相反,指示表现出人类特征仅对 LLM 有效,因此人类的这一优势减弱了。计算文本分析揭示,LLM 之所以变得更像人类,是因为它们可能具有一种隐含的关于使文本显得人类化的表示,并能轻松应用这些启发式方法。模型采用了一种对话式的、自我指涉的、非正式的语气,并使用更简单的词汇来模仿随机同理心。我们根据这些发现讨论了近期关于 LLM 与人类表现相当的声明。

[NLP-26] Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

【速读】: 该论文试图解决大型语言模型(LLMs)在处理长文本上下文和执行有效问答时面临的挑战,特别是由于长文本中的复杂性和歧义性导致的理解困难。解决方案的关键是引入Long Question Coreference Adaptation (LQCA)方法,该方法通过四个关键步骤来增强模型对长上下文中的指代关系的处理能力:在子文档中解析指代关系、计算提及之间的距离、定义指代的代表提及,并通过提及替换来回答问题。这种方法通过系统地处理信息,为LLMs提供了更易处理的文本分区,从而促进了更好的理解,并在实验中显著提升了模型在长上下文问答任务中的表现。

链接: https://arxiv.org/abs/2410.01671
作者: Yanming Liu,Xinyue Peng,Jiannan Cao,Shi Bo,Yanxin Shen,Xuhong Zhang,Sheng Cheng,Xun Wang,Jianwei Yin,Tianyu Du
关键词-EN: Large language models, shown remarkable capabilities, Large language, executing effective question, natural language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Underreview version of LQCA, Bridge context gap for long context

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in natural language processing; however, they still face difficulties when tasked with understanding lengthy contexts and executing effective question answering. These challenges often arise due to the complexity and ambiguity present in longer texts. To enhance the performance of LLMs in such scenarios, we introduce the Long Question Coreference Adaptation (LQCA) method. This innovative framework focuses on coreference resolution tailored to long contexts, allowing the model to identify and manage references effectively. The LQCA method encompasses four key steps: resolving coreferences within sub-documents, computing the distances between mentions, defining a representative mention for coreference, and answering questions through mention replacement. By processing information systematically, the framework provides easier-to-handle partitions for LLMs, promoting better understanding. Experimental evaluations on a range of LLMs and datasets have yielded positive results, with a notable improvements on OpenAI-o1-mini and GPT-4o models, highlighting the effectiveness of leveraging coreference resolution to bridge context gaps in question answering.
摘要:大语言模型 (LLMs) 在自然语言处理方面展现了卓越的能力;然而,在处理长篇上下文和执行有效问答任务时,它们仍面临诸多困难。这些挑战往往源于长文本中存在的复杂性和歧义性。为了提升 LLMs 在这些场景中的表现,我们提出了长问句指代消解适应 (Long Question Coreference Adaptation, LQCA) 方法。这一创新框架专注于为长上下文定制的指代消解,使模型能够有效识别和管理指代关系。LQCA 方法包括四个关键步骤:在子文档内消解指代关系,计算提及之间的距离,定义指代的代表性提及,以及通过提及替换进行问答。通过系统化处理信息,该框架为 LLMs 提供了更易处理的分割,促进了更好的理解。在一系列 LLMs 和数据集上的实验评估取得了积极成果,特别是在 OpenAI-o1-mini 和 GPT-4o 模型上显著提升,突显了利用指代消解来弥合问答中上下文差距的有效性。

[NLP-27] Efficient Long-range Language Modeling with Self-supervised Causal Retrieval

【速读】: 该论文试图解决现有基于检索的语言模型(RLMs)在适应因果语言模型(causal LM)时,预训练检索器参数固定导致适应性不足的问题。解决方案的关键在于提出了Grouped Cross-Attention模块,该模块实现了检索器和因果语言模型的联合预训练,并应用于长上下文建模。具体来说,输入序列被分割成块,当前块用于检索过去的块以进行后续文本生成,通过端到端的方式使检索器学习如何检索过去块以最小化后续token的自回归损失。此外,结合top-k检索技术,模型能够高效地从头开始预训练,支持高达64K token的上下文长度。

链接: https://arxiv.org/abs/2410.01651
作者: Xiang Hu,Zhihao Teng,Wei Wu,Kewei Tu
关键词-EN: retrieval-based language models, received much attention, retrieval-based language, Recently, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Recently, retrieval-based language models (RLMs) have received much attention. However, most of them leverage a pre-trained retriever with fixed parameters, which may not adapt well to causal language models. In this work, we propose Grouped Cross-Attention, a novel module enabling joint pre-training of the retriever and causal LM, and apply it to long-context modeling. For a given input sequence, we split it into chunks and use the current chunk to retrieve past chunks for subsequent text generation. Our innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. By integrating top- k retrieval, our model can be pre-trained efficiently from scratch with context lengths up to 64K tokens. Our experiments show our model, compared with long-range LM baselines, can achieve lower perplexity with comparable or lower pre-training and inference costs.
摘要:近年来,基于检索的语言模型 (Retrieval-based Language Models, RLMs) 受到了广泛关注。然而,大多数现有的 RLMs 采用预训练的检索器并固定其参数,这可能无法很好地适应因果语言模型 (Causal Language Models)。在本研究中,我们提出了分组交叉注意力 (Grouped Cross-Attention),这是一种新颖的模块,能够实现检索器和因果语言模型的联合预训练,并将其应用于长上下文建模。对于给定的输入序列,我们将其分割成块,并使用当前块来检索过去的块以进行后续文本生成。我们的创新之处在于,检索器能够以端到端的方式学习如何检索过去的块,从而更好地最小化后续 Token 的自回归损失。通过集成 top-k 检索,我们的模型可以从头开始高效地进行预训练,上下文长度可达 64K Token。实验结果表明,与长距离语言模型基线相比,我们的模型能够在相当或更低的预训练和推理成本下实现更低的困惑度。

[NLP-28] DeIDClinic: A Multi-Layered Framework for De-identification of Clinical Free-text Data

【速读】: 该论文旨在提升医疗文本分析中的去识别化(de-identification)效果,以更好地保护患者隐私。解决方案的关键在于将深度学习模型ClinicalBERT与传统的去识别化方法(如字典查找和基于规则的方法)相结合,形成MASK框架的增强版本。ClinicalBERT的集成显著提高了实体识别的性能,尤其是在识别常见实体(如姓名、日期和地点)方面,F1-score达到0.9732。此外,系统还引入了风险评估功能,通过分析文档上下文的唯一性来分类风险级别,指导进一步的去识别化工作。尽管系统表现出色,但仍需在未来改进对复杂实体的处理和系统对不同临床环境的适应性。

链接: https://arxiv.org/abs/2410.01648
作者: Angel Paul,Dhivin Shaji,Lifeng Han,Warren Del-Pinto,Goran Nenadic
关键词-EN: protecting patients’ privacy, healthcare text analytics, important in protecting, protecting patients’, patients’ privacy
类目: Computation and Language (cs.CL)
备注: ongoing work

点击查看摘要

Abstract:De-identification is important in protecting patients’ privacy for healthcare text analytics. The MASK framework is one of the best on the de-identification shared task organised by n2c2/i2b2 challenges. This work enhances the MASK framework by integrating ClinicalBERT, a deep learning model specifically fine-tuned on clinical texts, alongside traditional de-identification methods like dictionary lookup and rule-based approaches. The system effectively identifies and either redacts or replaces sensitive identifiable entities within clinical documents, while also allowing users to customise the masked documents according to their specific needs. The integration of ClinicalBERT significantly improves the performance of entity recognition, achieving 0.9732 F1-score, especially for common entities such as names, dates, and locations. A risk assessment feature has also been developed, which analyses the uniqueness of context within documents to classify them into risk levels, guiding further de-identification efforts. While the system demonstrates strong overall performance, this work highlights areas for future improvement, including handling more complex entity occurrences and enhancing the system’s adaptability to different clinical settings. Comments: ongoing work Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.01648 [cs.CL] (or arXiv:2410.01648v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.01648 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:去识别化在保护患者隐私的医疗文本分析中至关重要。MASK 框架是 n2c2/i2b2 挑战赛组织中去识别化共享任务中表现最佳的框架之一。本研究通过整合 ClinicalBERT(一种专门针对临床文本进行微调的深度学习模型)以及传统的去识别化方法(如字典查找和基于规则的方法),对 MASK 框架进行了增强。该系统能够有效识别临床文档中的敏感可识别实体,并根据需要进行遮蔽或替换,同时允许用户根据特定需求自定义遮蔽文档。ClinicalBERT 的集成显著提升了实体识别的性能,F1 分数达到 0.9732,尤其对于常见实体如姓名、日期和地点的识别效果显著。此外,还开发了一个风险评估功能,该功能分析文档中上下文的唯一性,将其分类为不同的风险等级,指导进一步的去识别化工作。尽管该系统整体表现出色,但本研究也指出了未来需要改进的领域,包括处理更复杂的实体出现情况以及增强系统对不同临床环境的适应性。

评论:正在进行的工作
主题:计算与语言 (cs.CL)
引用方式:arXiv:2410.01648 [cs.CL] (或 arXiv:2410.01648v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.01648
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-29] On The Adaptation of Unlimiformer for Decoder-Only Transformers

【速读】: 该论文试图解决当前大型语言模型(如LLama-2)在上下文长度受限的问题,特别是对于仅解码器(decoder-only)的Transformer模型。解决方案的关键在于将Unlimiformer方法适配到仅解码器模型,并通过一系列修改克服其原本的不兼容性。具体措施包括将交叉注意力计算卸载到kNN索引,并扩展实验设置以涵盖新的任务(如自由形式问答)和指令调优模型(如自定义的6.7B GPT模型)。实验结果表明,这些修改在摘要任务中表现出色,与具有两倍上下文长度的模型相当。

链接: https://arxiv.org/abs/2410.01637
作者: Kian Ahrabian,Alon Benhaim,Barun Patra,Jay Pujara,Saksham Singhal,Xia Song
关键词-EN: prominent issues stifling, limited context length, context length, large language models, prominent issues
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:One of the prominent issues stifling the current generation of large language models is their limited context length. Recent proprietary models such as GPT-4 and Claude 2 have introduced longer context lengths, 8k/32k and 100k, respectively; however, despite the efforts in the community, most common models, such as LLama-2, have a context length of 4k or less. Unlimiformer (Bertsch et al., 2023) is a recently popular vector-retrieval augmentation method that offloads cross-attention computations to a kNN index. However, its main limitation is incompatibility with decoder-only transformers out of the box. In this work, we explore practical considerations of adapting Unlimiformer to decoder-only transformers and introduce a series of modifications to overcome this limitation. Moreover, we expand the original experimental setup on summarization to include a new task (i.e., free-form QA) and an instruction-tuned model (i.e., a custom 6.7B GPT model). Our results showcase the effectiveness of these modifications on summarization, performing on par with a model with 2x the context length. Moreover, we discuss limitations and future directions for free-form QA and instruction-tuned models.
摘要:当前大语言模型面临的一个显著问题是其有限的上下文长度。近期推出的专有模型如 GPT-4 和 Claude 2 分别引入了 8k/32k 和 100k 的更长上下文长度;然而,尽管社区做出了努力,大多数常见模型如 LLama-2 的上下文长度仍为 4k 或更少。Unlimiformer (Bertsch et al., 2023) 是一种近期流行的向量检索增强方法,它将交叉注意力计算卸载到 kNN 索引中。然而,其主要局限性在于无法直接兼容仅解码器 Transformer。在本研究中,我们探讨了将 Unlimiformer 适配到仅解码器 Transformer 的实际考虑,并引入了一系列修改以克服这一局限。此外,我们将原始实验设置扩展到摘要生成任务,并新增了一项自由形式问答任务和一个指令微调模型(即自定义的 6.7B GPT 模型)。我们的结果展示了这些修改在摘要生成任务中的有效性,表现与上下文长度为两倍的模型相当。此外,我们讨论了自由形式问答和指令微调模型的局限性和未来方向。

[NLP-30] A Thematic Framework for Analyzing Large-scale Self-reported Social Media Data on Opioid Use Disorder Treatment Using Buprenorphine Product

【速读】: 该论文旨在解决阿片类药物使用障碍(OUD)治疗中患者在社交媒体上表达的信息需求问题。研究通过提出一个基于主题的框架,从Reddit的r/Suboxone社区收集并分析了15,253条帖子,识别出五个主要主题,并对6,000条帖子进行编码。关键解决方案在于通过主题分析揭示患者在康复过程中对心理和生理效应、药物获取复杂性、药物管理、减量及不同康复阶段使用物质的信息需求,以及自我治疗策略和同行建议中的潜在误解。这些发现有助于改善患者教育、医患沟通,设计系统干预措施以纠正治疗相关误解,并为未来研究生成假设。

链接: https://arxiv.org/abs/2410.01633
作者: Madhusudan Basak,Omar Sharif,Sarah E. Lord,Jacob T. Borodovsky,Lisa A. Marsch,Sandra A. Springer,Edward Nunes,Charlie D. Brackett,Luke J. ArchiBald,Sarah M. Preum
关键词-EN: Opioid Use Disorder, key FDA-approved medications, Background, OUD, Disorder
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Background: One of the key FDA-approved medications for Opioid Use Disorder (OUD) is buprenorphine. Despite its popularity, individuals often report various information needs regarding buprenorphine treatment on social media platforms like Reddit. However, the key challenge is to characterize these needs. In this study, we propose a theme-based framework to curate and analyze large-scale data from social media to characterize self-reported treatment information needs (TINs). Methods: We collected 15,253 posts from r/Suboxone, one of the largest Reddit sub-community for buprenorphine products. Following the standard protocol, we first identified and defined five main themes from the data and then coded 6,000 posts based on these themes, where one post can be labeled with applicable one to three themes. Finally, we determined the most frequently appearing sub-themes (topics) for each theme by analyzing samples from each group. Results: Among the 6,000 posts, 40.3% contained a single theme, 36% two themes, and 13.9% three themes. The most frequent topics for each theme or theme combination came with several key findings - prevalent reporting of psychological and physical effects during recovery, complexities in accessing buprenorphine, and significant information gaps regarding medication administration, tapering, and usage of substances during different stages of recovery. Moreover, self-treatment strategies and peer-driven advice reveal valuable insights and potential misconceptions. Conclusions: The findings obtained using our proposed framework can inform better patient education and patient-provider communication, design systematic interventions to address treatment-related misconceptions and rumors, and streamline the generation of hypotheses for future research. Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL) Cite as: arXiv:2410.01633 [cs.CY] (or arXiv:2410.01633v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2410.01633 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Madhusudan Basak [view email] [v1] Wed, 2 Oct 2024 15:04:21 UTC (408 KB)
摘要:背景:丁丙诺啡 (buprenorphine) 是治疗阿片类药物使用障碍 (Opioid Use Disorder, OUD) 的关键 FDA 批准药物之一。尽管其应用广泛,但个人在社交媒体平台如 Reddit 上经常报告关于丁丙诺啡治疗的多种信息需求。然而,关键挑战在于如何表征这些需求。在本研究中,我们提出了一种基于主题的框架,用于从社交媒体中筛选和分析大规模数据,以表征自我报告的治疗信息需求 (TINs)。方法:我们从 r/Suboxone 收集了 15,253 篇帖子,这是关于丁丙诺啡产品的最大 Reddit 子社区之一。按照标准协议,我们首先从数据中识别并定义了五个主要主题,然后根据这些主题对 6,000 篇帖子进行了编码,其中一篇帖子可以标记为一个至三个适用的主题。最后,我们通过分析每个组的样本,确定了每个主题或主题组合中最常出现的子主题 (话题)。结果:在 6,000 篇帖子中,40.3% 包含单一主题,36% 包含两个主题,13.9% 包含三个主题。每个主题或主题组合的最常见话题带来了几个关键发现——康复期间心理和生理效应的普遍报告、获取丁丙诺啡的复杂性,以及关于药物管理、减量和在康复不同阶段使用物质的信息缺口。此外,自我治疗策略和同行驱动的建议揭示了有价值的见解和潜在的误解。结论:通过我们提出的框架获得的发现可以更好地指导患者教育和医患沟通,设计系统干预措施以解决治疗相关的误解和谣言,并简化未来研究假设的生成。

主题:计算机与社会 (cs.CY); 计算与语言 (cs.CL)
引用为:arXiv:2410.01633 [cs.CY] (或 arXiv:2410.01633v1 [cs.CY] 用于此版本)
https://doi.org/10.48550/arXiv.2410.01633
通过 DataCite 发布的 arXiv DOI (待注册)
提交历史:从 Madhusudan Basak [查看电子邮件]
[v1] 2024年10月2日 15:04:21 UTC (408 KB)

[NLP-31] Intent Detection in the Age of LLMs EMNLP2024

【速读】: 该论文试图解决任务导向对话系统中意图检测的问题,特别是传统监督学习方法在处理超出训练数据范围(OOS)的意图检测时的局限性。解决方案的关键在于利用生成式大语言模型(LLMs)的内在世界知识和上下文学习能力,通过自适应的上下文学习和思维链提示来提升意图检测的准确性和效率。论文提出了一种混合系统,结合了不确定性路由策略和负数据增强技术,以实现接近原生LLM准确率的同时降低50%的延迟。此外,论文还通过实验揭示了LLM在OOS检测能力上受意图标签范围和标签空间大小的显著影响,并提出了一种利用LLM内部表示的两步法,显著提高了OOS检测的准确率和F1分数。

链接: https://arxiv.org/abs/2410.01627
作者: Gaurav Arora,Shreya Jain,Srujana Merugu
关键词-EN: address user utterances, task-oriented dialogue systems, dialog turn, critical component, component of task-oriented
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Industry Track

点击查看摘要

Abstract:Intent detection is a critical component of task-oriented dialogue systems (TODS) which enables the identification of suitable actions to address user utterances at each dialog turn. Traditional approaches relied on computationally efficient supervised sentence transformer encoder models, which require substantial training data and struggle with out-of-scope (OOS) detection. The emergence of generative large language models (LLMs) with intrinsic world knowledge presents new opportunities to address these challenges. In this work, we adapt 7 SOTA LLMs using adaptive in-context learning and chain-of-thought prompting for intent detection, and compare their performance with contrastively fine-tuned sentence transformer (SetFit) models to highlight prediction quality and latency tradeoff. We propose a hybrid system using uncertainty based routing strategy to combine the two approaches that along with negative data augmentation results in achieving the best of both worlds ( i.e. within 2% of native LLM accuracy with 50% less latency). To better understand LLM OOS detection capabilities, we perform controlled experiments revealing that this capability is significantly influenced by the scope of intent labels and the size of the label space. We also introduce a two-step approach utilizing internal LLM representations, demonstrating empirical gains in OOS detection accuracy and F1-score by 5% for the Mistral-7B model.
摘要:意图检测是面向任务的对话系统 (TODS) 中的关键组成部分,它能够在每个对话轮次中识别出适合处理用户话语的合适动作。传统方法依赖于计算效率高的监督式句子 Transformer 编码器模型,这些模型需要大量的训练数据,并且在超出范围 (OOS) 检测方面表现不佳。生成式大语言模型 (LLM) 的出现,凭借其内在的世界知识,为解决这些挑战提供了新的机会。在本研究中,我们采用自适应上下文学习 (adaptive in-context learning) 和思维链提示 (chain-of-thought prompting) 对 7 个最先进的 LLM 进行适应,并将其性能与对比微调的句子 Transformer (SetFit) 模型进行比较,以突出预测质量和延迟之间的权衡。我们提出了一种基于不确定性路由策略的混合系统,结合负数据增强 (negative data augmentation),实现了两者的最佳结合(即在延迟减少 50% 的情况下,达到接近原生 LLM 准确率的 98%)。为了更好地理解 LLM 的 OOS 检测能力,我们进行了控制实验,结果表明,这一能力显著受到意图标签范围和标签空间大小的影响。我们还引入了一种两步方法,利用 LLM 的内部表示,展示了在 OOS 检测准确率和 F1-score 上,Mistral-7B 模型获得了 5% 的实证增益。

[NLP-32] Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging

【速读】: 该论文试图解决将大型语言模型(LLMs)从密集模型转换为混合专家模型(MoE)时面临的数据需求大和通常依赖大规模后训练的问题。解决方案的关键是提出了Upcycling Instruction Tuning(UpIT)方法,通过利用密集模型指令调优过程中的中间检查点作为专家模型的基础,并引入遗传算法和参数合并技术来扩展专家数量,确保专家多样性。此外,通过选择少量种子数据预优化路由器,确保每个专家在MoE模型中有效工作。实验证明,UpIT在不同数据规模和设置下表现出色,显著提高了数据效率和专家扩展的稳定性。

链接: https://arxiv.org/abs/2410.01610
作者: Tingfeng Hui,Zhenyu Zhang,Shuohuan Wang,Yu Sun,Hua Wu,Sen Su
关键词-EN: language processing tasks, natural language processing, plentiful natural language, large language models, shines brightly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in progress

点击查看摘要

Abstract:Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and demonstrates outstanding performance in plentiful natural language processing tasks. However, existing methods transforming LLMs from dense to MoE face significant data requirements and typically rely on large-scale post-training. In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model. Specifically, we first point out that intermediate checkpoints during instruction tuning of the dense model are naturally suitable for specialized experts, and then propose an expert expansion stage to flexibly achieve models with flexible numbers of experts, where genetic algorithm and parameter merging are introduced to ensure sufficient diversity of new extended experts. To ensure that each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router. Extensive experiments with various data scales and upcycling settings demonstrate the outstanding performance and data efficiency of UpIT, as well as stable improvement in expert or data scaling. Further analysis reveals the importance of ensuring expert diversity in upcycling.
摘要:混合专家模型 (Mixture-of-Experts, MoE) 在大语言模型 (Large Language Models, LLMs) 中表现出色,并在众多自然语言处理任务中展现出卓越的性能。然而,现有的将 LLMs 从密集模型转换为 MoE 模型的方法面临着显著的数据需求,并且通常依赖于大规模的后训练过程。本文提出了一种名为“升级指令调优” (Upcycling Instruction Tuning, UpIT) 的数据高效方法,用于将预训练的密集模型调优为 MoE 指令模型。具体而言,我们首先指出,在密集模型的指令调优过程中产生的中间检查点自然适合作为专用专家模型,然后提出一个专家扩展阶段,通过引入遗传算法和参数合并来灵活实现具有灵活专家数量的模型,以确保新扩展专家的充分多样性。为了确保 MoE 模型中的每个专用专家都能按预期工作,我们选择少量每个专家擅长的种子数据来预优化路由器。在各种数据规模和升级设置下进行的广泛实验表明,UpIT 具有出色的性能和数据效率,并且在专家或数据扩展方面表现出稳定的改进。进一步的分析揭示了在升级过程中确保专家多样性的重要性。

[NLP-33] ENTP: Encoder-only Next Token Prediction

【速读】: 该论文试图挑战传统观点,即认为因果注意力(causal attention)是实现下一个词预测模型的必要条件。论文提出,这种设计选择更多是出于效率考虑而非必要性。解决方案的关键在于引入仅编码器架构的下一个词预测模型(Encoder-only Next Token Prediction, ENTP),并通过理论分析和实验验证,展示了ENTP在表达能力和复杂性上的潜在优势,特别是在处理如Triplet-Counting任务等复杂任务时,ENTP相较于仅解码器架构的Transformer表现出更强的能力。此外,实验还证明了ENTP在长度泛化和上下文学习等实际任务中的优越性能。

链接: https://arxiv.org/abs/2410.01600
作者: Ethan Ewer,Daewon Chae,Thomas Zeng,Jinkyu Kim,Kangwook Lee
关键词-EN: Next-token prediction models, causal attention, masking future tokens, Next-token prediction, essential to prevent
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Next-token prediction models have predominantly relied on decoder-only Transformers with causal attention, driven by the common belief that causal attention is essential to prevent “cheating” by masking future tokens. We challenge this widely accepted notion and argue that this design choice is about efficiency rather than necessity. While decoder-only Transformers are still a good choice for practical reasons, they are not the only viable option. In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP. We introduce the Triplet-Counting task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate ENTP’s superior performance across various realistic tasks, such as length generalization and in-context learning.
摘要:下一 Token 预测模型主要依赖于仅解码器 Transformer,其因果注意力机制被普遍认为是防止通过掩码未来 Token 进行“作弊”的关键。我们挑战这一广泛接受的观念,并认为这一设计选择更多是出于效率而非必要性。尽管仅解码器 Transformer 在实际应用中仍是一个不错的选择,但它们并非唯一可行的方案。在本研究中,我们引入了仅编码器下一 Token 预测 (Encoder-only Next Token Prediction, ENTP)。我们探讨了 ENTP 与仅解码器 Transformer 在表达能力和复杂性方面的差异,突显了 ENTP 的潜在优势。我们引入了三元组计数任务,并通过理论和实验证明,尽管 ENTP 可以轻松完成此任务,但仅解码器 Transformer 却无法做到。最后,我们通过实证展示了 ENTP 在各种现实任务中的优越性能,如长度泛化和上下文学习。

[NLP-34] Spoken Grammar Assessment Using LLM

【速读】: 该论文试图解决传统口语评估系统(SLA)仅限于发音和流利度评估,而语法和词汇评估依赖于书面语言评估系统(WLA)的问题。解决方案的关键在于提出一种端到端的口语评估系统,能够直接从口语表达中评估语法,从而使WLA系统变得多余。此外,通过使用大型语言模型(LLM)引入测试内容的多样性,使得评估难以通过训练预测,增强了评估的公正性。论文还展示了结合自定义语言模型的混合自动语音识别(ASR)系统在口语语法评估中优于当前最先进的ASR引擎。

链接: https://arxiv.org/abs/2410.01579
作者: Sunil Kumar Kopparapu,Chitralekha Bhat,Ashish Panda
关键词-EN: evaluating the pronunciation, pronunciation and oral, oral fluency, speaker by analysing, analysing the read
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Spoken language assessment (SLA) systems restrict themselves to evaluating the pronunciation and oral fluency of a speaker by analysing the read and spontaneous spoken utterances respectively. The assessment of language grammar or vocabulary is relegated to written language assessment (WLA) systems. Most WLA systems present a set of sentences from a curated finite-size database of sentences thereby making it possible to anticipate the test questions and train oneself. In this paper, we propose a novel end-to-end SLA system to assess language grammar from spoken utterances thus making WLA systems redundant; additionally, we make the assessment largely unteachable by employing a large language model (LLM) to bring in variations in the test. We further demonstrate that a hybrid automatic speech recognition (ASR) with a custom-built language model outperforms the state-of-the-art ASR engine for spoken grammar assessment.
摘要: 口语语言评估 (SLA) 系统通过分别分析朗读和即兴口语表达来评估说话者的发音和口语流利度。然而,语言语法或词汇的评估则交由书面语言评估 (WLA) 系统处理。大多数 WLA 系统会从精心策划的有限句子数据库中呈现一组句子,从而使得考生有可能预测测试问题并进行针对性训练。本文提出了一种新颖的端到端 SLA 系统,用于从口语表达中评估语言语法,从而使 WLA 系统变得多余;此外,我们通过采用大语言模型 (LLM) 引入测试变化,使得评估在很大程度上变得难以通过训练来应对。我们进一步证明,结合自定义语言模型的混合自动语音识别 (ASR) 系统在口语语法评估方面优于最先进的 ASR 引擎。

[NLP-35] OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

【速读】: 该论文试图解决大语言模型(LLM)在数学推理任务中的数据访问和训练数据合成问题。解决方案的关键在于通过精心设计的消融实验,揭示了数据合成中的几个关键因素:(a) 解决方案的格式对微调(SFT)性能有显著影响,过于冗长的解决方案会降低性能;(b) 由强教师模型生成的数据优于弱学生模型生成的数据;© SFT对低质量解决方案具有鲁棒性,允许不精确的数据过滤;(d) 问题多样性对于实现数据扩展至关重要。基于这些发现,论文创建了OpenMathInstruct-2数据集,并通过微调Llama-3.1-8B-Base模型,显著提升了在MATH数据集上的表现。

链接: https://arxiv.org/abs/2410.01560
作者: Shubham Toshniwal,Wei Du,Ivan Moshkov,Branislav Kisacanin,Alexan Ayrapetyan,Igor Gitman
关键词-EN: Mathematical reasoning continues, large language model, Mathematical reasoning, development with significant, significant interest
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mathematical reasoning continues to be a critical challenge in large language model (LLM) development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \emphclosed-source due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \textttLlama3.1 family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms \emphon-policy data generated by a weak student model, © SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs ( \approx 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \textttLlama-3.1-8B-Base using OpenMathInstruct-2 outperforms \textttLlama3.1-8B-Instruct on MATH by an absolute 15.9% (51.9% \rightarrow 67.8%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.
摘要:数学推理在大语言模型 (LLM) 开发中仍然是一个关键挑战,引起了广泛关注。然而,由于缺乏训练数据的访问权限,大多数在数学推理方面的前沿进展已经变得封闭源代码。这种数据访问的缺失限制了研究人员理解不同数据合成和利用选择的影响。为了创建一个高质量的数学推理微调 (SFT) 数据集,我们使用最近发布的 \textttLlama3.1 系列模型进行了细致的数据合成消融实验。我们的实验结果表明:(a) 解答格式至关重要,过于冗长的解答对 SFT 性能有害,(b) 由强教师模型生成的数据优于由弱学生模型生成的同策略数据,© SFT 对低质量解答具有鲁棒性,允许不精确的数据过滤,(d) 问题多样性对于实现数据扩展收益至关重要。基于这些见解,我们创建了 OpenMathInstruct-2 数据集,该数据集包含 1400 万个问题-解答对 (约 60 万个独特问题),使其比之前最大的开源数学推理数据集大近八倍。使用 OpenMathInstruct-2 对 \textttLlama-3.1-8B-Base 进行微调,在 MATH 数据集上的表现比 \textttLlama3.1-8B-Instruct 提高了 15.9% (从 51.9% 提升到 67.8%)。最后,为了加速开源工作,我们在商业许可下发布了代码、微调模型和 OpenMathInstruct-2 数据集。

[NLP-36] Integrative Decoding: Improve Factuality via Implicit Self-consistency

【速读】: 该论文试图解决现有自一致性方法在开放式生成任务中的应用局限性问题。解决方案的关键在于提出了一种名为“综合解码(Integrative Decoding, ID)”的新方法,通过构建一组输入,每个输入前缀为先前采样的响应,并在每个解码步骤中通过聚合所有对应预测来选择下一个词,从而在解码目标中隐式地引入自一致性。这种方法显著提升了语言模型在事实准确性上的表现,特别是在TruthfulQA、Biographies和LongFact等基准测试中取得了显著的改进。

链接: https://arxiv.org/abs/2410.01556
作者: Yi Cheng,Xiao Liang,Yeyun Gong,Wen Xiao,Song Wang,Yuji Zhang,Wenjun Hou,Kaishuai Xu,Wenge Liu,Wenjie Li,Jian Jiao,Qi Chen,Peng Cheng,Wayne Xiong
关键词-EN: involve repeatedly sampling, repeatedly sampling multiple, sampling multiple outputs, large language models, involve repeatedly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-consistency-based approaches, which involve repeatedly sampling multiple outputs and selecting the most consistent one as the final response, prove to be remarkably effective in improving the factual accuracy of large language models. Nonetheless, existing methods usually have strict constraints on the task format, largely limiting their applicability. In this paper, we present Integrative Decoding (ID), to unlock the potential of self-consistency in open-ended generation tasks. ID operates by constructing a set of inputs, each prepended with a previously sampled response, and then processes them concurrently, with the next token being selected by aggregating of all their corresponding predictions at each decoding step. In essence, this simple approach implicitly incorporates self-consistency in the decoding objective. Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the TruthfulQA (+11.2%), Biographies (+15.4%) and LongFact (+8.5%) benchmarks. The performance gains amplify progressively as the number of sampled responses increases, indicating the potential of ID to scale up with repeated sampling.
摘要:基于自一致性的方法,通过反复采样多个输出并选择最一致的输出作为最终响应,已被证明在提高大语言模型的实际准确性方面非常有效。然而,现有方法通常对任务格式有严格限制,大大限制了其适用性。本文提出了一种名为综合解码 (Integrative Decoding, ID) 的方法,以释放自一致性在开放式生成任务中的潜力。ID 通过构建一组输入,每个输入都附加了先前采样的响应,并同时处理这些输入,在每个解码步骤中通过聚合所有相应预测来选择下一个 Token。本质上,这种简单的方法在解码目标中隐含地融入了自一致性。广泛的评估表明,ID 在广泛的语模型中持续提高了事实性,在 TruthfulQA (+11.2%)、Biographies (+15.4%) 和 LongFact (+8.5%) 基准测试中取得了显著改进。随着采样响应数量的增加,性能提升逐渐放大,表明 ID 在重复采样中具有扩展潜力。

[NLP-37] ACE: A LLM-based Negotiation Coaching System EMNLP2024

【速读】: 该论文试图解决教育资源不足群体在战略谈判与协商技能方面的学习障碍问题。解决方案的关键在于开发了一个基于大型语言模型(LLM)的谈判辅导助手ACE,该助手不仅作为用户的谈判对手,还能通过分析谈判记录识别错误并提供针对性的反馈。通过收集MBA学生间的谈判记录数据,并结合专家咨询设计标注方案,ACE能够有效识别谈判中的错误并提供改进建议,从而显著提升用户的谈判表现。

链接: https://arxiv.org/abs/2410.01555
作者: Ryan Shea,Aymen Kallala,Xin Lucy Liu,Michael W. Morris,Zhou Yu
关键词-EN: growing prominence, prominence of LLMs, LLMs has led, negotiation, ACE
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: EMNLP 2024

点击查看摘要

Abstract:The growing prominence of LLMs has led to an increase in the development of AI tutoring systems. These systems are crucial in providing underrepresented populations with improved access to valuable education. One important area of education that is unavailable to many learners is strategic bargaining related to negotiation. To address this, we develop a LLM-based Assistant for Coaching nEgotiation (ACE). ACE not only serves as a negotiation partner for users but also provides them with targeted feedback for improvement. To build our system, we collect a dataset of negotiation transcripts between MBA students. These transcripts come from trained negotiators and emulate realistic bargaining scenarios. We use the dataset, along with expert consultations, to design an annotation scheme for detecting negotiation mistakes. ACE employs this scheme to identify mistakes and provide targeted feedback to users. To test the effectiveness of ACE-generated feedback, we conducted a user experiment with two consecutive trials of negotiation and found that it improves negotiation performances significantly compared to a system that doesn’t provide feedback and one which uses an alternative method of providing feedback.
摘要:随着大语言模型 (LLM) 的日益突出,AI 辅导系统的开发也逐渐增多。这些系统在为弱势群体提供更优质教育方面至关重要。教育领域中一个对许多学习者来说难以触及的重要领域是与谈判相关的战略讨价还价。为了解决这一问题,我们开发了一个基于 LLM 的辅导谈判助手 (ACE)。ACE 不仅作为用户的谈判伙伴,还为他们提供针对性的改进反馈。为了构建我们的系统,我们收集了 MBA 学生之间的谈判记录数据集。这些记录来自经过训练的谈判者,并模拟了现实的讨价还价场景。我们结合数据集和专家咨询,设计了一种用于检测谈判错误的标注方案。ACE 利用这一方案识别错误,并向用户提供针对性的反馈。为了测试 ACE 生成反馈的有效性,我们进行了一项用户实验,包括两次连续的谈判试验,结果显示,与不提供反馈的系统和使用替代反馈方法的系统相比,ACE 显著提升了谈判表现。

[NLP-38] MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework

【速读】: 该论文试图解决当前人工智能(AI)和大型语言模型(LLMs)在医疗领域中缺乏全面评估临床技能(CS)的问题。解决方案的关键在于引入MedQA-CS,这是一个受医学教育中的客观结构化临床考试(OSCEs)启发的AI-SCE框架。MedQA-CS通过两个指令跟随任务(LLM-as-medical-student和LLM-as-CS-examiner)来评估LLMs,这些任务旨在反映真实的临床场景。该框架不仅提供了公开的数据和专家注释,还通过定量和定性评估展示了LLMs作为CS评估的可靠判断者。实验结果表明,MedQA-CS比传统的多选题QA基准(如MedQA)更具挑战性,能够更全面地评估LLMs的临床能力。

链接: https://arxiv.org/abs/2410.01553
作者: Zonghai Yao,Zihao Zhang,Chaolong Tang,Xingyu Bian,Youxia Zhao,Zhichao Yang,Junda Wang,Huixue Zhou,Won Seok Jang,Feiyun Ouyang,Hong Yu
关键词-EN: large language models, healthcare require advanced, Artificial intelligence, require advanced clinical, Structured Clinical Examinations
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education’s Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks, MedQA-CS enables a more comprehensive evaluation of LLMs’ clinical capabilities for both open- and closed-source LLMs.
摘要:人工智能 (AI) 和医疗领域的大语言模型 (LLMs) 需要高级临床技能 (CS),然而当前的基准测试未能全面评估这些技能。我们引入了 MedQA-CS,这是一个受医学教育中的客观结构化临床考试 (OSCEs) 启发的 AI-SCE 框架,旨在填补这一空白。MedQA-CS 通过两个指令跟随任务,即 LLM-as-medical-student 和 LLM-as-CS-examiner,来评估 LLMs,这些任务旨在反映真实的临床场景。我们的贡献包括开发了 MedQA-CS,这是一个具有公开可用数据和专家注释的综合评估框架,并提供了对 LLMs 作为 CS 评估中可靠评判者的定量和定性评估。我们的实验表明,MedQA-CS 在评估临床技能方面比传统的多项选择 QA 基准 (例如,MedQA) 更具挑战性。结合现有基准,MedQA-CS 能够对开源和闭源 LLMs 的临床能力进行更全面的评估。

[NLP-39] In-Context Transfer Learning: Demonstration Synthesis by Transferring Similar Tasks

【速读】: 该论文试图解决大语言模型(LLMs)在上下文学习(ICL)中生成高质量演示示例的高成本问题。解决方案的关键在于提出了一种名为上下文迁移学习(ICTL)的方法,通过从相似的源任务中迁移已标记的演示示例来合成目标任务的演示示例。ICTL包括两个步骤:源示例采样和目标任务迁移。首先,通过定义一个优化目标,最小化迁移误差来选择与目标任务相似的源示例;然后,利用LLMs将选定的源示例迁移到目标任务,确保其定义和格式与目标任务一致。实验结果表明,ICTL在Super-NI数据集上的表现平均优于从头合成方法2.0%,证明了该方法的有效性。

链接: https://arxiv.org/abs/2410.01548
作者: Dingzirui Wang,Xuangliang Zhang,Qiguang Chen,Longxu Dou,Xiao Xu,Rongyu Cao,Yingwei Ma,Qingfu Zhu,Wanxiang Che,Binhua Li,Fei Huang,Yongbin Li
关键词-EN: large language models, target task, In-Context Transfer Learning, language models, effective approach
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) is an effective approach to help large language models (LLMs) adapt to various tasks by providing demonstrations of the target task. Considering the high cost of labeling demonstrations, many methods propose synthesizing demonstrations from scratch using LLMs. However, the quality of the demonstrations synthesized from scratch is limited by the capabilities and knowledge of LLMs. To address this, inspired by transfer learning, we propose In-Context Transfer Learning (ICTL), which synthesizes target task demonstrations by transferring labeled demonstrations from similar source tasks. ICTL consists of two steps: source sampling and target transfer. First, we define an optimization objective, which minimizes transfer error to sample source demonstrations similar to the target task. Then, we employ LLMs to transfer the sampled source demonstrations to the target task, matching the definition and format of the target task. Experiments on Super-NI show that ICTL outperforms synthesis from scratch by 2.0% on average, demonstrating the effectiveness of our method.
摘要:上下文学习 (In-context learning, ICL) 是一种有效的方法,通过提供目标任务的演示来帮助大语言模型 (Large Language Models, LLMs) 适应各种任务。考虑到标注演示的高成本,许多方法提出使用 LLMs 从头开始合成演示。然而,从头合成的演示质量受限于 LLMs 的能力和知识。为了解决这一问题,受迁移学习的启发,我们提出了上下文迁移学习 (In-Context Transfer Learning, ICTL),该方法通过从相似的源任务中转移标注的演示来合成目标任务的演示。ICTL 包括两个步骤:源采样和目标转移。首先,我们定义了一个优化目标,该目标最小化转移误差,以采样与目标任务相似的源演示。然后,我们使用 LLMs 将采样的源演示转移到目标任务,匹配目标任务的定义和格式。在 Super-NI 上的实验表明,ICTL 平均比从头合成的方法高出 2.0%,证明了我们方法的有效性。

[NLP-40] Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models

【速读】: 该论文试图解决现有强化学习从人类反馈(RLHF)方法在准确建模人类偏好方面的挑战。解决方案的关键在于引入GazeReward框架,该框架通过整合眼动追踪(ET)数据作为隐式反馈,增强了奖励模型(RM)的准确性。具体来说,论文探讨了如何利用ET数据提供的特征来更好地理解用户偏好,并通过消融实验验证了不同整合方法、大型语言模型(LLMs)和ET生成模型对RM准确性的显著提升。这一研究为优化AI与人类价值观的匹配提供了新的视角,探索了认知数据在NLP研究中的潜力。

链接: https://arxiv.org/abs/2410.01532
作者: Angela Lopez-Cardona,Carlos Segura,Alexandros Karatzoglou,Sergi Abadal,Ioannis Arapakis
关键词-EN: Natural Language Processing, Large Language Models, Language Processing, Advancements in Natural, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advancements in Natural Language Processing (NLP), have led to the emergence of Large Language Models (LLMs) such as GPT, Llama, Claude, and Gemini, which excel across a range of tasks but require extensive fine-tuning to align their outputs with human expectations. A widely used method for achieving this alignment is Reinforcement Learning from Human Feedback (RLHF), which, despite its success, faces challenges in accurately modelling human preferences. In this paper, we introduce GazeReward, a novel framework that integrates implicit feedback – and specifically eye-tracking (ET) data – into the Reward Model (RM). In addition, we explore how ET-based features can provide insights into user preferences. Through ablation studies we test our framework with different integration methods, LLMs, and ET generator models, demonstrating that our approach significantly improves the accuracy of the RM on established human preference datasets. This work advances the ongoing discussion on optimizing AI alignment with human values, exploring the potential of cognitive data for shaping future NLP research.
摘要:自然语言处理 (Natural Language Processing, NLP) 的进步催生了诸如 GPT、Llama、Claude 和 Gemini 等大语言模型 (Large Language Models, LLMs),这些模型在多种任务中表现出色,但需要大量微调以使其输出符合人类预期。实现这种对齐的常用方法是基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF),尽管这种方法取得了成功,但在准确建模人类偏好方面仍面临挑战。本文中,我们提出了 GazeReward,这是一种将隐式反馈(特别是眼动追踪 (Eye-Tracking, ET) 数据)集成到奖励模型 (Reward Model, RM) 中的新型框架。此外,我们探讨了基于 ET 的特征如何提供对用户偏好的洞察。通过消融研究,我们测试了不同集成方法、LLMs 和 ET 生成模型下的框架,证明我们的方法显著提高了 RM 在既定人类偏好数据集上的准确性。这项工作推进了关于优化 AI 与人类价值观对齐的持续讨论,探索了认知数据在塑造未来 NLP 研究中的潜力。

[NLP-41] HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

【速读】: 该论文试图解决在移动设备上部署大型语言模型(LLMs)时,安全防护模型因参数庞大导致的内存需求和延迟问题。解决方案的关键在于通过数据增强技术HarmAug,将大型教师安全防护模型蒸馏成小型模型。HarmAug通过引导LLM生成有害指令并生成对应的响应,从而扩充训练数据集,提升小型模型的性能。实验结果表明,使用HarmAug训练的435百万参数安全防护模型在F1分数和AUPRC上均达到或超越了70亿参数的大型模型,同时显著降低了计算成本。

链接: https://arxiv.org/abs/2410.01524
作者: Seanie Lee,Haebin Seong,Dong Bok Lee,Minki Kang,Xiaoyin Chen,Dominik Wagner,Yoshua Bengio,Juho Lee,Sung Ju Hwang
关键词-EN: detect malicious queries, malicious queries aimed, Safety guard models, Safety guard, existing safety guard
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as, “Make a single harmful instruction prompt that would elicit offensive content”, we add an affirmative prefix (e.g., “I have an idea for a prompt:”) to the LLM’s response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
摘要:检测针对大语言模型 (LLMs) 的恶意查询的安全防护模型对于确保 LLMs 在实际应用中的安全与负责任部署至关重要。然而,由于内存需求和延迟问题,将现有包含数十亿参数的安全防护模型与 LLMs 一同部署在移动设备上是不切实际的。为了降低这一成本,我们利用带有二元有害标签的指令-响应对标记数据集,将一个大型教师安全防护模型提炼成一个较小的模型。由于现有标记数据集中有害指令的多样性有限,简单提炼的模型往往表现不如较大的模型。为了缩小小型和大型模型之间的差距,我们提出了 HarmAug,一种简单而有效的数据增强方法,该方法涉及破解 LLM 并引导其生成有害指令。给定一个提示,例如“制作一个可能引发冒犯内容的有害指令提示”,我们在 LLM 的响应前添加一个肯定的前缀(例如,“我有一个提示的想法:”)。这鼓励 LLM 继续生成剩余的响应,从而采样有害指令。另一个 LLM 生成对有害指令的响应,教师模型对指令-响应对进行标记。我们实证表明,我们的 HarmAug 优于其他相关基线。此外,使用 HarmAug 训练的 4.35 亿参数安全防护模型在 F1 分数上与超过 70 亿参数的大型模型相当,甚至在 AUPRC 上表现更优,同时计算成本不到后者的 25%。

[NLP-42] InfiniPot: Infinite Context Processing on Memory-Constrained LLMs EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在处理长输入上下文时面临的内存限制问题,特别是在资源受限的环境如移动设备中。解决方案的关键是引入了一种名为InfiniPot的新型KV缓存控制框架,该框架通过利用持续上下文蒸馏(Continual Context Distillation, CCD)技术,能够在不增加额外训练的情况下,有效地在固定内存约束内管理长序列。CCD通过迭代压缩和保留重要信息,利用新颖的重要性度量方法,即使在无法访问未来上下文的情况下,也能有效维持关键数据。

链接: https://arxiv.org/abs/2410.01518
作者: Minsoo Kim,Kyuhong Shim,Jungwook Choi,Simyung Chang
关键词-EN: Large Language Models, Handling long input, Large Language, challenge for Large, input contexts remains
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Main

点击查看摘要

Abstract:Handling long input contexts remains a significant challenge for Large Language Models (LLMs), particularly in resource-constrained environments such as mobile devices. Our work aims to address this limitation by introducing InfiniPot, a novel KV cache control framework designed to enable pre-trained LLMs to manage extensive sequences within fixed memory constraints efficiently, without requiring additional training. InfiniPot leverages Continual Context Distillation (CCD), an iterative process that compresses and retains essential information through novel importance metrics, effectively maintaining critical data even without access to future context. Our comprehensive evaluations indicate that InfiniPot significantly outperforms models trained for long contexts in various NLP tasks, establishing its efficacy and versatility. This work represents a substantial advancement toward making LLMs applicable to a broader range of real-world scenarios.
摘要:处理长输入上下文仍然是大型语言模型 (Large Language Models, LLMs) 面临的一个重要挑战,特别是在资源受限的环境中,如移动设备。我们的工作旨在通过引入 InfiniPot 来解决这一限制,InfiniPot 是一种新颖的 KV 缓存控制框架,旨在使预训练的 LLMs 能够在固定的内存约束下高效管理广泛序列,而无需额外的训练。InfiniPot 利用持续上下文蒸馏 (Continual Context Distillation, CCD),这是一种迭代过程,通过新颖的重要性度量来压缩和保留关键信息,即使在无法访问未来上下文的情况下也能有效维持关键数据。我们的全面评估表明,InfiniPot 在各种 NLP 任务中显著优于为长上下文训练的模型,证明了其有效性和多功能性。这项工作代表了使 LLMs 适用于更广泛现实世界场景的重大进展。

[NLP-43] InstaTrans: An Instruction-Aware Translation Framework for Non-English Instruction Datasets

【速读】: 该论文试图解决非英语语言高质量指令数据集生成困难的问题,特别是由于尾部现象导致的在较少观察数据上的性能限制。解决方案的关键在于提出了一种名为InstaTrans(INSTruction-Aware TRANSlation)的翻译框架,该框架专门用于指令数据集的翻译,强调完整性和指令感知性,以保持原始数据集的固有属性。通过这种翻译方法,可以有效提升大型语言模型(LLMs)在目标语言上的性能,同时降低成本,扩大LLMs在多种语言中的应用范围。

链接: https://arxiv.org/abs/2410.01512
作者: Yungi Kim,Chanjun Park
关键词-EN: frequently observed data, generate high-quality instruction, non-English languages due, high-quality English instruction, tail phenomena
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:It is challenging to generate high-quality instruction datasets for non-English languages due to tail phenomena, which limit performance on less frequently observed data. To mitigate this issue, we propose translating existing high-quality English instruction datasets as a solution, emphasizing the need for complete and instruction-aware translations to maintain the inherent attributes of these datasets. We claim that fine-tuning LLMs with datasets translated in this way can improve their performance in the target language. To this end, we introduces a new translation framework tailored for instruction datasets, named InstaTrans (INSTruction-Aware TRANSlation). Through extensive experiments, we demonstrate the superiority of InstaTrans over other competitors in terms of completeness and instruction-awareness of translation, highlighting its potential to broaden the accessibility of LLMs across diverse languages at a relatively low cost. Furthermore, we have validated that fine-tuning LLMs with datasets translated by InstaTrans can effectively improve their performance in the target language.
摘要:由于尾部现象的存在,为非英语语言生成高质量的指令数据集具有挑战性,这限制了在较少观察到的数据上的表现。为了缓解这一问题,我们提出将现有的高质量英语指令数据集进行翻译作为解决方案,强调需要完整且指令感知的翻译以保持这些数据集的固有属性。我们声称,使用这种方式翻译的数据集对大语言模型 (LLM) 进行微调可以提高其在目标语言中的表现。为此,我们引入了一种专为指令数据集设计的新翻译框架,名为 InstaTrans (INSTruction-Aware TRANSlation)。通过广泛的实验,我们展示了 InstaTrans 在翻译的完整性和指令感知性方面优于其他竞争对手,突显了其在相对较低的成本下扩展大语言模型在多种语言中的可访问性的潜力。此外,我们已经验证了使用 InstaTrans 翻译的数据集对大语言模型进行微调可以有效地提高其在目标语言中的表现。

[NLP-44] Disentangling Latent Shifts of In-Context Learning Through Self-Training

【速读】: 该论文试图解决上下文学习(In-context Learning, ICL)在自然语言处理中面临的稳定性问题和长上下文处理难题,特别是在演示数量增加时导致的泛化能力差和推理效率低下的问题。解决方案的关键在于引入STICL(Self-Training ICL)方法,通过自训练机制将演示的潜在偏移与查询的潜在偏移分离。STICL利用教师模型生成伪标签,并通过适配器模块对这些标签进行编码,训练学生模型。这种方法使得学生模型能够逐步从弱到强地泛化,不断优化其预测能力,从而在域内和域外数据上均表现出优于传统ICL方法和其他分离策略的泛化能力和稳定性。

链接: https://arxiv.org/abs/2410.01508
作者: Josip Jukić,Jan Šnajder
关键词-EN: natural language processing, autoregressive large language, In-context learning, large language models, language models capable
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) has become essential in natural language processing, particularly with autoregressive large language models capable of learning from demonstrations provided within the prompt. However, ICL faces challenges with stability and long contexts, especially as the number of demonstrations grows, leading to poor generalization and inefficient inference. To address these issues, we introduce STICL (Self-Training ICL), an approach that disentangles the latent shifts of demonstrations from the latent shift of the query through self-training. STICL employs a teacher model to generate pseudo-labels and trains a student model using these labels, encoded in an adapter module. The student model exhibits weak-to-strong generalization, progressively refining its predictions over time. Our empirical results show that STICL improves generalization and stability, consistently outperforming traditional ICL methods and other disentangling strategies across both in-domain and out-of-domain data.
摘要:上下文学习 (In-context Learning, ICL) 在自然语言处理中变得至关重要,尤其是在具有自回归能力的大语言模型中,这些模型能够从提示中提供的演示中学习。然而,ICL 在稳定性和长上下文处理方面面临挑战,特别是随着演示数量的增加,导致泛化能力差和推理效率低下。为了解决这些问题,我们提出了 STICL (Self-Training ICL),这是一种通过自训练来解耦演示的潜在偏移与查询的潜在偏移的方法。STICL 使用教师模型生成伪标签,并利用这些标签通过适配器模块训练学生模型。学生模型表现出从弱到强的泛化能力,随着时间的推移逐步改进其预测。我们的实证结果表明,STICL 在泛化能力和稳定性方面有所提升,始终优于传统的 ICL 方法和其他解耦策略,无论是在域内还是域外数据上。

[NLP-45] PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation

【速读】: 该论文试图解决开源大型语言模型(LLMs)在数学问题解决能力上的不足,解决方案的关键在于提出了一种数据增强方法,并引入了PersonaMathQA数据集。该方法包括两个阶段:第一阶段通过使用闭源LLM生成详细的思维链(CoT)解决方案,并采用新颖的角色驱动数据增强技术来增加数据集的数量和多样性;第二阶段通过引入反思机制,充分利用更具挑战性和价值的问题。最终,基于LLaMA-2-7B的PersonaMath-7B模型在MATH和GSM8K数据集上的表现显著优于所有基线方法,达到了最先进的性能,证明了该数据集的高质量和多样性,从而实现了更高效的模型训练。

链接: https://arxiv.org/abs/2410.01504
作者: Jing Luo,Run Luo,Longze Chen,Liang Zhu,Chang Ao,Jiaming Li,Yukun Chen,Xin Cheng,Wen Yang,Jiayuan Su,Chengming Li,Min Yang
关键词-EN: closed-source Large Language, Large Language Models, Large Language, demonstrate strong mathematical, mathematical problem-solving abilities
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While closed-source Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities, open-source models continue to struggle with such tasks. To bridge this gap, we propose a data augmentation approach and introduce PersonaMathQA, a dataset derived from MATH and GSM8K, on which we train the PersonaMath models. Our approach consists of two stages: the first stage is learning from Persona Diversification, and the second stage is learning from Reflection. In the first stage, we regenerate detailed chain-of-thought (CoT) solutions as instructions using a closed-source LLM and introduce a novel persona-driven data augmentation technique to enhance the dataset’s quantity and diversity. In the second stage, we incorporate reflection to fully leverage more challenging and valuable questions. Evaluation of our PersonaMath models on MATH and GSM8K reveals that the PersonaMath-7B model (based on LLaMA-2-7B) achieves an accuracy of 24.2% on MATH and 68.7% on GSM8K, surpassing all baseline methods and achieving state-of-the-art performance. Notably, our dataset contains only 70.3K data points-merely 17.8% of MetaMathQA and 27% of MathInstruct-yet our model outperforms these baselines, demonstrating the high quality and diversity of our dataset, which enables more efficient model training. We open-source the PersonaMathQA dataset, PersonaMath models, and our code for public usage.
摘要:尽管闭源大语言模型 (Large Language Models, LLMs) 在解决数学问题上表现出强大的能力,但开源模型在这类任务上仍面临挑战。为了缩小这一差距,我们提出了一种数据增强方法,并引入了 PersonaMathQA,这是一个从 MATH 和 GSM8K 衍生出的数据集,用于训练 PersonaMath 模型。我们的方法包括两个阶段:第一阶段是基于角色多样化的学习,第二阶段是基于反思的学习。在第一阶段,我们使用闭源 LLM 重新生成详细的思维链 (Chain-of-Thought, CoT) 解决方案作为指令,并引入了一种新颖的角色驱动数据增强技术,以增强数据集的数量和多样性。在第二阶段,我们通过反思来充分利用更具挑战性和价值的问题。对 PersonaMath 模型在 MATH 和 GSM8K 上的评估显示,基于 LLaMA-2-7B 的 PersonaMath-7B 模型在 MATH 上的准确率为 24.2%,在 GSM8K 上的准确率为 68.7%,超越了所有基线方法,达到了最先进的性能。值得注意的是,我们的数据集仅包含 70.3K 个数据点,仅占 MetaMathQA 的 17.8% 和 MathInstruct 的 27%,但我们的模型表现优于这些基线,这表明我们的数据集具有高质量和多样性,能够实现更高效的模型训练。我们开放了 PersonaMathQA 数据集、PersonaMath 模型以及我们的代码供公众使用。

[NLP-46] DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic Lightweight Plugin for Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在特定领域微调时资源消耗大的问题,特别是现有方法在融合多个低秩适应(LoRA)模型时缺乏基于上下文的动态融合,且推理时间较长。解决方案的关键是提出了DLP-LoRA,一种动态轻量级插件,通过仅包含5M参数的mini-MLP模块,在句子级别使用top-p采样策略动态融合多个LoRA模型,从而减少推理时间至单个LoRA推理时间的两倍以内,并通过并行计算进一步优化效率。该方法在多项任务中表现出色,显著提升了模型在复合任务设置下的性能和效率。

链接: https://arxiv.org/abs/2410.01497
作者: Yuxuan Zhang,Ruizhe Li
关键词-EN: Large Language Models, Large Language, domains remains resource-intensive, Language Models, specific domains remains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint under review, 18 pages, 7 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have achieved robust performance across diverse tasks, but fine-tuning these models for specific domains remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) address this challenge by fine-tuning a small subset of parameters. However, existing methods for fusing multiple LoRAs lack dynamic fusion based on contextual inputs and often increase inference time due to token-level operations. We propose DLP-LoRA, a Dynamic Lightweight Plugin that employs a mini-MLP module with only 5M parameters to dynamically fuse multiple LoRAs at the sentence level using top-p sampling strategies. This approach reduces inference time to less than twice that of single LoRA inference by leveraging parallel computation. Evaluations across 26 tasks-including multiple-choice questions and question answering-demonstrate that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice datasets and significant improvements in BLEU and ROUGE scores on QA datasets, outperforming different LLMs backbones under composite task settings. DLP-LoRA effectively balances performance and efficiency, making it a practical solution for dynamic multi-task adaptation in LLMs. Our code is available at this https URL.
摘要:近年来,大语言模型 (LLM) 在各种任务中表现出色,但针对特定领域的微调仍然需要大量资源。参数高效微调 (PEFT) 方法,如低秩适应 (LoRA),通过微调一小部分参数来解决这一挑战。然而,现有的融合多个 LoRA 的方法缺乏基于上下文输入的动态融合,并且由于 Token 级别的操作,通常会增加推理时间。我们提出了 DLP-LoRA,一种动态轻量级插件,它采用仅包含 5M 参数的 mini-MLP 模块,通过 top-p 采样策略在句子级别动态融合多个 LoRA。这种方法通过利用并行计算,将推理时间减少到单个 LoRA 推理时间的不到两倍。在包括多项选择题和问答在内的 26 项任务的评估中,DLP-LoRA 在多项选择数据集上达到了 92.34% 的平均准确率,并在问答数据集上显著提高了 BLEU 和 ROUGE 分数,在复合任务设置下优于不同的大语言模型骨干。DLP-LoRA 有效地平衡了性能和效率,使其成为大语言模型中动态多任务适应的实用解决方案。我们的代码可在以下链接获取:https URL。

[NLP-47] Extending Context Window of Large Language Models from a Distributional Perspective EMNLP2024

【速读】: 该论文试图解决基于旋转位置嵌入(RoPE)的大型语言模型(LLMs)在扩展上下文窗口时性能下降的问题。解决方案的关键在于从旋转角度分布的角度优化上下文窗口扩展任务。具体来说,论文首先估计模型内部旋转角度的分布,并分析长度扩展对这一分布的扰动程度。然后,提出了一种新的扩展策略,通过最小化旋转角度分布之间的扰动,保持与预训练阶段的一致性,从而增强模型对更长序列的泛化能力。实验结果表明,该方法在扩展LLaMA2的上下文窗口至8k和16k时,分别减少了72%和32%的分布扰动,并在LongBench-E基准测试中取得了平均4.33%的性能提升。

链接: https://arxiv.org/abs/2410.01490
作者: Yingsheng Wu. Yuxuan Gu,Xiaocheng Feng,Weihong Zhong,Dongliang Xu,Qing Yang,Hongtao Liu,Bing Qin
关键词-EN: RoPE-based large language, rotary position embedding, context window, large language models, context window extending
类目: Computation and Language (cs.CL)
备注: 14 pages, 8 figures, Accepted to EMNLP2024

点击查看摘要

Abstract:Scaling the rotary position embedding (RoPE) has become a common method for extending the context window of RoPE-based large language models (LLMs). However, existing scaling methods often rely on empirical approaches and lack a profound understanding of the internal distribution within RoPE, resulting in suboptimal performance in extending the context window length. In this paper, we propose to optimize the context window extending task from the view of rotary angle distribution. Specifically, we first estimate the distribution of the rotary angles within the model and analyze the extent to which length extension perturbs this distribution. Then, we present a novel extension strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the model’s capability to generalize to longer sequences. Experimental results compared to the strong baseline methods demonstrate that our approach reduces by up to 72% of the distributional disturbance when extending LLaMA2’s context window to 8k, and reduces by up to 32% when extending to 16k. On the LongBench-E benchmark, our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods. Furthermore, Our method maintains the model’s performance on the Hugging Face Open LLM benchmark after context window extension, with only an average performance fluctuation ranging from -0.12 to +0.22.
摘要:扩展旋转位置嵌入 (RoPE) 已成为延长基于 RoPE 的大语言模型 (LLM) 上下文窗口的常见方法。然而,现有的扩展方法往往依赖于经验性方法,缺乏对 RoPE 内部分布的深刻理解,导致在延长上下文窗口长度时性能不佳。本文提出从旋转角度分布的角度优化上下文窗口扩展任务。具体而言,我们首先估计模型内旋转角度的分布,并分析长度扩展对这一分布的扰动程度。然后,我们提出了一种新的扩展策略,通过最小化旋转角度分布之间的扰动,以保持与预训练阶段的一致性,从而增强模型对更长序列的泛化能力。与强大的基线方法相比,实验结果表明,我们的方法在将 LLaMA2 的上下文窗口扩展到 8k 时,分布扰动减少了高达 72%,在扩展到 16k 时减少了高达 32%。在 LongBench-E 基准测试中,我们的方法相较于现有的最先进方法,平均提升了高达 4.33%。此外,我们的方法在扩展上下文窗口后,在 Hugging Face Open LLM 基准测试中保持了模型的性能,平均性能波动范围仅为 -0.12 至 +0.22。

[NLP-48] Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas

【速读】: 该论文试图解决当前基于子词(subword)分词算法的语言模型在作为语言表征模型时的有效性问题。解决方案的关键在于探索无分词、基于音素(phoneme)和字素(grapheme)的语言模型。论文展示了基于Llama架构的小型模型在采用字符级别词汇训练时,能够在标准句法和新颖的词汇/音素基准测试中取得优异的语言表现。此外,论文还表明,无任何字素偏差的音素模型在标准任务和新颖评估中几乎与字素模型表现相当。这些发现为创建更符合语言学原理、更适合语言习得和处理计算研究的语言模型提供了有前景的方向。

链接: https://arxiv.org/abs/2410.01487
作者: Bastian Bunzeck,Daniel Duran,Leonie Schade,Sina Zarrieß
关键词-EN: Byte Pair Encoding, Pair Encoding, Byte Pair, subword-based tokenization algorithms, Current language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current language models use subword-based tokenization algorithms like Byte Pair Encoding, which put their validity as models of linguistic representations into question. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models without any graphemic biases almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.
摘要:当前的语言模型采用基于子词的 Tokenization 算法,如 Byte Pair Encoding,这使得它们作为语言表征模型的有效性受到质疑。本文探讨了基于音素和字素的免 Tokenization 语言模型的潜力。我们证明,基于 Llama 架构的小型模型在采用字符级词汇进行训练时,能够在标准句法和新型词汇/音素基准测试中取得优异的语言表现。此外,我们进一步展示,基于音素的模型在没有任何字素偏差的情况下,几乎能在标准任务和新型评估中与基于字素的模型相媲美。我们的研究结果表明,这一方向有望创造出更具语言学合理性的语言模型,更适合用于语言习得和处理的计算研究。

[NLP-49] A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

【速读】: 该论文试图解决长上下文大语言模型(LLMs)在训练和服务过程中产生的显著开销问题。解决方案的关键在于将上下文长度扩展与GPU友好的KV缓存减少架构相结合,提出了一种名为LongGen的方法。LongGen通过在长度扩展阶段对预训练的LLM进行微调,构建了一个高效的架构。其核心在于:1) 采用稀疏注意力模式(如窗口注意力、注意力下沉和块状稀疏注意力),这些模式因其GPU友好的内存访问模式而适合构建高效的长上下文模型;2) 设计了一种混合架构,其中1/3的层为全注意力层,2/3的层为高效层,以平衡效率和长上下文性能;3) 通过轻量级训练在5B长上下文数据上,将混合模型的上下文长度从4K扩展到128K。实验结果表明,LongGen在训练和推理阶段均显著提升了效率,减少了内存开销。

链接: https://arxiv.org/abs/2410.01485
作者: Suyu Ge,Xihui Lin,Yunan Zhang,Jiawei Han,Hao Peng
关键词-EN: incurs substantial overhead, length extension, serving long-context large, pretrained LLM, incurs substantial
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training and serving long-context large language models (LLMs) incurs substantial overhead. To address this, two critical steps are often required: a pretrained LLM typically undergoes a separate stage for context length extension by training on long-context data, followed by architectural modifications to reduce the overhead of KV cache during serving. This paper argues that integrating length extension with a GPU-friendly KV cache reduction architecture not only reduces training overhead during length extension, but also achieves better long-context performance. This leads to our proposed LongGen, which finetunes a pretrained LLM into an efficient architecture during length extension. LongGen builds on three key insights: (1) Sparse attention patterns, such as window attention (attending to recent tokens), attention sink (initial ones), and blockwise sparse attention (strided token blocks) are well-suited for building efficient long-context models, primarily due to their GPU-friendly memory access patterns, enabling efficiency gains not just theoretically but in practice as well. (2) It is essential for the model to have direct access to all tokens. A hybrid architecture with 1/3 full attention layers and 2/3 efficient ones achieves a balanced trade-off between efficiency and long-context performance. (3) Lightweight training on 5B long-context data is sufficient to extend the hybrid model’s context length from 4K to 128K. We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its effectiveness across different scales. During training with 128K-long contexts, LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%, compared to a full-attention baseline. During inference, LongGen reduces KV cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding speedup. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.01485 [cs.CL] (or arXiv:2410.01485v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.01485 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:训练和部署长上下文大语言模型 (LLM) 会产生显著的开销。为了解决这一问题,通常需要两个关键步骤:预训练的 LLM 通常需要经历一个单独的阶段,通过在长上下文数据上进行训练来扩展上下文长度,随后进行架构修改以减少服务期间的 KV 缓存开销。本文认为,将上下文长度扩展与 GPU 友好的 KV 缓存减少架构集成,不仅在长度扩展期间减少了训练开销,而且实现了更好的长上下文性能。这促使我们提出了 LongGen,它在长度扩展期间将预训练的 LLM 微调为高效的架构。LongGen 基于三个关键见解:(1) 稀疏注意力模式,如窗口注意力 (关注最近的 Token)、注意力汇聚 (初始 Token) 和块状稀疏注意力 (跨步 Token 块),非常适合构建高效的长上下文模型,主要是因为它们对 GPU 友好的内存访问模式,不仅在理论上而且在实践中都能实现效率提升。(2) 模型必须能够直接访问所有 Token。一个混合架构,其中 1/3 为全注意力层,2/3 为高效层,在效率和长上下文性能之间实现了平衡。(3) 对 5B 长上下文数据进行轻量级训练足以将混合模型的上下文长度从 4K 扩展到 128K。我们在 Llama-2 7B 和 Llama-2 70B 上评估了 LongGen,证明了其在不同规模上的有效性。在 128K 长上下文的训练过程中,LongGen 实现了 1.55 倍的训练加速,并将挂钟时间减少了 36%,相比全注意力基线。在推理过程中,LongGen 将 KV 缓存内存减少了 62%,实现了 1.67 倍的预填充加速和 1.41 倍的解码加速。

主题:计算与语言 (cs.CL) 引用方式:arXiv:2410.01485 [cs.CL] (或 arXiv:2410.01485v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.01485
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-50] Agent -Driven Large Language Models for Mandarin Lyric Generation

【速读】: 该论文试图解决旋律到歌词生成过程中高质量对齐数据稀缺和创造性标准不明确的问题。解决方案的关键在于开发了一个多智能体系统,该系统将旋律到歌词的任务分解为子任务,每个智能体分别负责韵律、音节数、歌词与旋律的对齐以及一致性。通过这种分解,系统能够更精细地控制歌词生成的各个方面,并通过扩散基础的歌声合成器进行听觉测试,以评估不同智能体组生成的歌词质量。

链接: https://arxiv.org/abs/2410.01450
作者: Hong-Hsiang Liu,Yi-Wen Liu
关键词-EN: Generative Large Language, Generative Large, in-context learning abilities, shown impressive in-context, impressive in-context learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, figures, Accepted at O-COCOSDA 2024

点击查看摘要

Abstract:Generative Large Language Models have shown impressive in-context learning abilities, performing well across various tasks with just a prompt. Previous melody-to-lyric research has been limited by scarce high-quality aligned data and unclear standard for creativeness. Most efforts focused on general themes or emotions, which are less valuable given current language model capabilities. In tonal contour languages like Mandarin, pitch contours are influenced by both melody and tone, leading to variations in lyric-melody fit. Our study, validated by the Mpop600 dataset, confirms that lyricists and melody writers consider this fit during their composition process. In this research, we developed a multi-agent system that decomposes the melody-to-lyric task into sub-tasks, with each agent controlling rhyme, syllable count, lyric-melody alignment, and consistency. Listening tests were conducted via a diffusion-based singing voice synthesizer to evaluate the quality of lyrics generated by different agent groups.
摘要:生成式大语言模型展示了令人印象深刻的上下文学习能力,能够在仅提供提示的情况下出色地完成各种任务。以往的旋律到歌词研究受限于高质量对齐数据的稀缺以及创造性标准的不明确。大多数研究集中在通用主题或情感上,这在当前语言模型的能力下显得价值较低。在像普通话这样的音调轮廓语言中,音高轮廓受到旋律和音调的双重影响,导致歌词与旋律的契合度有所变化。我们的研究通过 Mpop600 数据集的验证,确认了词曲作者在创作过程中考虑了这种契合度。在本研究中,我们开发了一个多智能体系统,将旋律到歌词的任务分解为子任务,每个智能体分别控制韵律、音节数、歌词与旋律的对齐以及一致性。通过基于扩散的歌声合成器进行了听觉测试,以评估不同智能体组生成的歌词质量。

[NLP-51] Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation

【速读】: 该论文试图解决在符号音乐处理中,如何有效应用Byte-Pair Encoding (BPE)算法构建子词词汇表的问题。解决方案的关键在于理解BPE在不同乐器配置和音乐类型(如单声部和复调音乐)中的表现,并通过实验评估其在音乐短语分割任务中的效果。研究发现,BPE的训练过程高度依赖于乐器配置,而BPE的“超令牌”能够成功捕捉抽象的音乐内容。在复调音乐的短语分割任务中,BPE显著提升了性能,而在单声部音乐中,BPE仅在特定范围内的合并操作中增强了性能。

链接: https://arxiv.org/abs/2410.01448
作者: Dinh-Viet-Toan Le,Louis Bigo,Mikaela Keller
关键词-EN: Natural Language Processing, Natural Language, Language Processing, Byte-Pair Encoding, Processing to build
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to 3rd Workshop on NLP for Music and Audio (NLP4MusA, co-located with ISMIR 2024)

点击查看摘要

Abstract:Byte-Pair Encoding (BPE) is an algorithm commonly used in Natural Language Processing to build a vocabulary of subwords, which has been recently applied to symbolic music. Given that symbolic music can differ significantly from text, particularly with polyphony, we investigate how BPE behaves with different types of musical content. This study provides a qualitative analysis of BPE’s behavior across various instrumentations and evaluates its impact on a musical phrase segmentation task for both monophonic and polyphonic music. Our findings show that the BPE training process is highly dependent on the instrumentation and that BPE “supertokens” succeed in capturing abstract musical content. In a musical phrase segmentation task, BPE notably improves performance in a polyphonic setting, but enhances performance in monophonic tunes only within a specific range of BPE merges.
摘要:字节对编码 (Byte-Pair Encoding, BPE) 是一种常用于自然语言处理 (Natural Language Processing) 的算法,用于构建子词词汇表,最近也被应用于符号音乐领域。鉴于符号音乐与文本在多声部方面存在显著差异,我们研究了 BPE 在不同类型音乐内容中的表现。本研究对 BPE 在各种乐器配置中的行为进行了定性分析,并评估了其在单声部和多声部音乐片段分割任务中的影响。我们的研究结果表明,BPE 的训练过程高度依赖于乐器配置,并且 BPE 的“超词” (supertokens) 成功捕捉了抽象的音乐内容。在音乐片段分割任务中,BPE 显著提升了多声部音乐的性能,但在单声部音乐中,只有在特定的 BPE 合并范围内才能增强性能。

[NLP-52] Geometric Signatures of Compositionality Across a Language Models Lifetime ICLR2025

【速读】: 该论文试图解决的问题是理解人工语言模型(LMs)在组合泛化任务中表现出色的背后机制,特别是这些模型如何表示语言的组合性。解决方案的关键在于采用高层次的几何方法,通过研究数据集的组合性与模型表示的内在维度之间的关系,揭示了组合性与几何复杂性之间的联系。研究发现,数据集的组合性程度反映在其表示的内在维度上,而这种关系源于训练过程中学习到的语言特征。此外,论文还揭示了线性和非线性维度分别编码了语言组合的形式和语义方面。

链接: https://arxiv.org/abs/2410.01444
作者: Jin Hwa Lee,Thomas Jiralerspong,Lei Yu,Yoshua Bengio,Emily Cheng
关键词-EN: syntactic rules, permits the infinite, expression is constructed, parts and syntactic, infinite productivity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: Under review as a conference paper at ICLR 2025

点击查看摘要

Abstract:Compositionality, the notion that the meaning of an expression is constructed from the meaning of its parts and syntactic rules, permits the infinite productivity of human language. For the first time, artificial language models (LMs) are able to match human performance in a number of compositional generalization tasks. However, much remains to be understood about the representational mechanisms underlying these abilities. We take a high-level geometric approach to this problem by relating the degree of compositionality in a dataset to the intrinsic dimensionality of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations’ intrinsic dimensionality, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between linear and nonlinear dimensionality, showing that they respectively encode formal and semantic aspects of linguistic composition.
摘要:组合性(Compositionality)是指表达的意义由其组成部分的意义和句法规则构建而成,这使得人类语言具有无限的生产力。首次,人工语言模型(LMs)在多个组合泛化任务中能够与人类表现相匹配。然而,这些能力背后的表征机制仍有许多需要理解的地方。我们通过将数据集中的组合性程度与其在语言模型下的表征的内在维度(一种特征复杂度的度量)联系起来的高层次几何方法来解决这个问题。我们发现,不仅数据集的组合性程度反映在其表征的内在维度中,而且组合性与几何复杂度之间的关系是由于训练过程中学习到的语言特征所导致的。最后,我们的分析揭示了线性和非线性维度之间的显著对比,表明它们分别编码了语言组合的形式和语义方面。

[NLP-53] Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

【速读】: 该论文试图解决神经网络,特别是语言模型中,子网络(电路)的可重用性和组合性问题。解决方案的关键在于通过分析基于Transformer的语言模型中,针对高度组合性子任务的电路模块化特性。研究通过识别和比较负责十种模块化字符串编辑操作的电路,发现功能相似的电路在节点重叠和跨任务一致性方面表现显著。此外,研究还展示了这些识别出的电路可以通过子网络集合操作进行重用和组合,以表示模型更复杂的功能能力。

链接: https://arxiv.org/abs/2410.01434
作者: Philipp Mondorf,Sondre Wold,Barbara Plank
关键词-EN: implement reusable functions, implement reusable, fundamental question, reusable functions, composed to perform
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 24 pages, 17 figures

点击查看摘要

Abstract:A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions via subnetworks that can be composed to perform more complex tasks. Recent developments in mechanistic interpretability have made progress in identifying subnetworks, often referred to as circuits, which represent the minimal computational subgraph responsible for a model’s behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we examine the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through subnetwork set operations to represent more complex functional capabilities of the model.
摘要:可解释性研究中的一个基本问题是,神经网络,特别是语言模型,在多大程度上通过子网络实现可重用的功能,这些子网络可以组合起来执行更复杂的任务。最近在机制可解释性方面的发展取得了进展,能够识别出通常被称为“电路”的子网络,这些子网络代表了模型在特定任务上行为的最小计算子图。然而,大多数研究集中在识别单个任务的电路,而没有探讨功能相似的电路之间如何相互关联。为了填补这一空白,我们通过分析基于 Transformer 的大语言模型中高度组合的子任务的电路,来研究神经网络的模块化。具体而言,给定一个概率上下文无关文法,我们识别并比较了负责十个模块化字符串编辑操作的电路。我们的结果表明,功能相似的电路不仅在节点重叠方面表现显著,而且在跨任务的忠实性方面也表现出色。此外,我们证明了所识别的电路可以通过子网络集合操作进行重用和组合,以表示模型更复杂的功能能力。

[NLP-54] Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

【速读】: 该论文试图解决现有大型语言模型(LLMs)在复杂推理和事实准确性方面存在的问题,特别是在竞争性编程和数学等挑战性任务中,由于频繁的推理错误和无关知识检索导致的性能下降。解决方案的关键在于引入了一种名为CR-Planner的新框架,该框架通过细调的批评模型(critic models)来指导推理和检索过程。CR-Planner通过迭代选择和执行子目标来解决问题,利用子目标批评模型(sub-goal critic)和执行批评模型(execution critic)分别指导子目标的选择和执行过程,并通过蒙特卡洛树搜索(Monte Carlo Tree Search)来系统地探索行动序列及其长期影响。这种方法显著提高了在复杂领域知识和推理密集型任务中的表现。

链接: https://arxiv.org/abs/2410.01428
作者: Xingxuan Li,Weiwen Xu,Ruochen Zhao,Fangkai Jiao,Shafiq Joty,Lidong Bing
关键词-EN: exhibit impressive problem-solving, impressive problem-solving capabilities, large language models, factual correctness, improve factual correctness
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:State-of-the-art large language models (LLMs) exhibit impressive problem-solving capabilities but may struggle with complex reasoning and factual correctness. Existing methods harness the strengths of chain-of-thought and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness. These methods work well on straightforward reasoning tasks but often falter on challenging tasks such as competitive programming and mathematics, due to frequent reasoning errors and irrelevant knowledge retrieval. To address this, we introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning. CR-Planner solves a problem by iteratively selecting and executing sub-goals. Initially, it identifies the most promising sub-goal from reasoning, query generation, and retrieval, guided by rewards given by a critic model named sub-goal critic. It then executes this sub-goal through sampling and selecting the optimal output based on evaluations from another critic model named execution critic. This iterative process, informed by retrieved information and critic models, enables CR-Planner to effectively navigate the solution space towards the final answer. We employ Monte Carlo Tree Search to collect the data for training the critic models, allowing for a systematic exploration of action sequences and their long-term impacts. We validate CR-Planner on challenging domain-knowledge-intensive and reasoning-heavy tasks, including competitive programming, theorem-driven math reasoning, and complex domain retrieval problems. Our experiments demonstrate that CR-Planner significantly outperforms baselines, highlighting its effectiveness in addressing challenging problems by improving both reasoning and retrieval.
摘要:最先进的大语言模型 (LLMs) 展示了令人印象深刻的问题解决能力,但在复杂推理和事实准确性方面可能存在不足。现有方法利用思维链和检索增强生成 (RAG) 的优势,将复杂问题分解为更简单的步骤,并通过检索来提高事实准确性。这些方法在直接的推理任务中表现良好,但在竞争性编程和数学等挑战性任务中往往表现不佳,原因是频繁的推理错误和无关知识的检索。为了解决这一问题,我们引入了检索增强的批评引导规划框架,即 CR-Planner,该框架利用微调的批评模型来通过规划指导推理和检索过程。CR-Planner 通过迭代选择和执行子目标来解决问题。首先,它从推理、查询生成和检索中识别出最有希望的子目标,这一过程由名为子目标批评的模型提供的奖励引导。然后,它通过采样和基于执行批评模型的评估选择最优输出,执行该子目标。这一迭代过程,结合检索信息和批评模型,使 CR-Planner 能够有效地在解决方案空间中导航,最终找到答案。我们采用蒙特卡洛树搜索来收集训练批评模型的数据,从而系统地探索动作序列及其长期影响。我们在具有挑战性的领域知识密集型和推理密集型任务中验证了 CR-Planner,包括竞争性编程、定理驱动的数学推理和复杂领域检索问题。我们的实验表明,CR-Planner 显著优于基线方法,突显了其在通过改进推理和检索来解决挑战性问题方面的有效性。

[NLP-55] he Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在关联能力上的不足问题。解决方案的关键在于提出了一种新的基准测试,专注于评估模型在将观察与先前实践记忆关联的能力,即“关联”任务。论文通过构建一个无需标注的数据集转换方法,设计了单步、同步和异步三种关联任务,并进行了全面的零样本关联能力测试,涵盖多种模型和记忆策略。研究结果表明,当前的开源MLLMs在关联任务上表现不佳,即使是先进的GPT-4V(vision)也与人类表现存在显著差距。该基准测试为未来MLLMs的研究提供了重要参考。

链接: https://arxiv.org/abs/2410.01417
作者: Hong Li,Nanxi Li,Yuanjie Chen,Jianbin Zhu,Qinlu Guo,Cewu Lu,Yong-Lu Li
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, \textite.g. , hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: \textbfassociation , a human’s basic capability to link observation and prior practice memory. To comprehensively investigate MLLM’s performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient \textbfannotation-free construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs’ zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. \textitOur data and code are available at: this https URL.
摘要:多模态大语言模型 (MLLMs) 已经展示了令人印象深刻的能力。然而,与人类智能相比,最近发现 MLLMs 存在许多不足,例如幻觉 (hallucination)。为了推动 MLLMs 的研究,社区致力于构建包含复杂任务的更大规模基准测试。在本文中,我们提出对一个重要但通常被忽视的智能进行基准测试:关联 (association),即人类将观察与先前实践记忆联系起来的基本能力。为了全面研究 MLLM 在关联任务上的表现,我们制定了关联任务,并基于形容词和动词的语义概念设计了一个标准基准。我们提出了一种便捷的无标注 (annotation-free) 构建方法,将通用数据集转换为我们的关联任务数据集,而不是昂贵的数据标注和整理。同时,我们设计了一个严格的数据精炼过程,以消除原始数据集中的混淆。基于此数据库,我们建立了三个层次的关联任务:单步关联、同步关联和异步关联。此外,我们对 MLLMs 的零样本 (zero-shot) 关联能力进行了全面调查,涉及多个维度,包括三种不同的记忆策略、开源和闭源 MLLMs、尖端的专家混合 (Mixture-of-Experts, MoE) 模型以及人类专家的参与。我们的系统调查显示,当前的开源 MLLMs 在我们的关联任务中持续表现出较差的性能,即使是目前最先进的 GPT-4V(vision) 也与人类存在显著差距。我们相信,我们的基准将为未来的 MLLM 研究铺平道路。我们的数据和代码可在以下网址获取:this https URL。

[NLP-56] Question-guided Knowledge Graph Re-scoring and Injection for Knowledge Graph Question Answering EMNLP2024

【速读】: 该论文试图解决知识图谱问答(KGQA)中由于检索到的子图包含干扰信息而影响模型准确推理的问题。解决方案的关键在于提出了问题引导的知识图谱重评分方法(Q-KGR),通过消除与输入问题无关的路径来聚焦于相关的事实知识,并引入Knowformer这一参数高效的方法,将重评分的知识图谱注入大型语言模型,以增强其事实推理能力。

链接: https://arxiv.org/abs/2410.01401
作者: Yu Zhang,Kehai Chen,Xuefeng Bai,zhao kang,Quanjiang Guo,Min Zhang
关键词-EN: involves answering natural, leveraging structured information, structured information stored, Knowledge graph, answering natural language
类目: Computation and Language (cs.CL)
备注: findings of EMNLP2024

点击查看摘要

Abstract:Knowledge graph question answering (KGQA) involves answering natural language questions by leveraging structured information stored in a knowledge graph. Typically, KGQA initially retrieve a targeted subgraph from a large-scale knowledge graph, which serves as the basis for reasoning models to address queries. However, the retrieved subgraph inevitably brings distraction information for knowledge utilization, impeding the model’s ability to perform accurate reasoning. To address this issue, we propose a Question-guided Knowledge Graph Re-scoring method (Q-KGR) to eliminate noisy pathways for the input question, thereby focusing specifically on pertinent factual knowledge. Moreover, we introduce Knowformer, a parameter-efficient method for injecting the re-scored knowledge graph into large language models to enhance their ability to perform factual reasoning. Extensive experiments on multiple KGQA benchmarks demonstrate the superiority of our method over existing systems.
摘要:知识图谱问答 (KGQA) 通过利用存储在知识图谱中的结构化信息来回答自然语言问题。通常,KGQA 首先从大规模知识图谱中检索出一个目标子图,该子图作为推理模型处理查询的基础。然而,检索到的子图不可避免地包含干扰信息,影响模型进行准确推理的能力。为解决这一问题,我们提出了一种问题引导的知识图谱重评分方法 (Q-KGR),以消除输入问题的噪声路径,从而专注于相关的实际知识。此外,我们引入了 Knowformer,一种参数高效的方法,用于将重评分的知识图谱注入大语言模型,以增强其进行实际推理的能力。在多个 KGQA 基准上的广泛实验表明,我们的方法优于现有系统。

[NLP-57] CrowdCounter: A benchmark type-specific multi-target counterspeech dataset

【速读】: 该论文试图解决在维护言论自由的同时,有效应对仇恨言论的问题。解决方案的关键在于开发一种建议工具,帮助版主和用户撰写有效的反驳言论(counterspeech)。论文引入了名为CrowdCounter的新数据集,包含3,425对仇恨言论与反驳言论的配对,涵盖六种不同的反驳类型(同理心、幽默、质疑、警告、羞辱、矛盾),并通过设计专门的标注平台鼓励标注者撰写高质量、非冗余且类型特定的反驳言论。研究还评估了两种生成反驳言论的框架——普通框架和类型控制框架,并发现Flan-T5在普通框架中表现最佳,而DialoGPT在生成类型特定的反驳言论时表现最优。

链接: https://arxiv.org/abs/2410.01400
作者: Punyajoy Saha,Abhilash Datta,Abhik Jana,Animesh Mukherjee
关键词-EN: freedom of expression, presents a viable, viable alternative, alternative to banning, banning or suspending
类目: Computation and Language (cs.CL)
备注: 19 pages, 1 figure, 14 tables, Code available this https URL

点击查看摘要

Abstract:Counterspeech presents a viable alternative to banning or suspending users for hate speech while upholding freedom of expression. However, writing effective counterspeech is challenging for moderators/users. Hence, developing suggestion tools for writing counterspeech is the need of the hour. One critical challenge in developing such a tool is the lack of quality and diversity of the responses in the existing datasets. Hence, we introduce a new dataset - CrowdCounter containing 3,425 hate speech-counterspeech pairs spanning six different counterspeech types (empathy, humor, questioning, warning, shaming, contradiction), which is the first of its kind. The design of our annotation platform itself encourages annotators to write type-specific, non-redundant and high-quality counterspeech. We evaluate two frameworks for generating counterspeech responses - vanilla and type-controlled prompts - across four large language models. In terms of metrics, we evaluate the responses using relevance, diversity and quality. We observe that Flan-T5 is the best model in the vanilla framework across different models. Type-specific prompts enhance the relevance of the responses, although they might reduce the language quality. DialoGPT proves to be the best at following the instructions and generating the type-specific counterspeech accurately.
摘要:在维护言论自由的同时,反驳言论为替代禁止或暂停用户进行仇恨言论提供了一种可行的选择。然而,撰写有效的反驳言论对版主/用户来说颇具挑战性。因此,开发用于撰写反驳言论的建议工具已成为当务之急。开发此类工具的一个关键挑战在于现有数据集中响应的质量和多样性不足。为此,我们引入了一个新的数据集——CrowdCounter,其中包含3,425对仇恨言论与反驳言论的配对,涵盖六种不同的反驳言论类型(同理心、幽默、质疑、警告、羞辱、矛盾),这是同类中的首创。我们的标注平台设计本身就鼓励标注者撰写类型特定、非冗余且高质量的反驳言论。我们评估了两种生成反驳言论响应的框架——普通提示和类型控制提示——在四个大语言模型上的表现。在评价指标方面,我们使用相关性、多样性和质量来评估响应。我们观察到,在普通框架下,Flan-T5在不同模型中表现最佳。类型特定的提示增强了响应的相关性,尽管它们可能会降低语言质量。DialoGPT在遵循指令并准确生成类型特定的反驳言论方面表现最佳。

[NLP-58] PairDistill: Pairwise Relevance Distillation for Dense Retrieval EMNLP2024

【速读】: 该论文试图解决现有知识蒸馏技术在信息检索中主要依赖于单点重排序器(pointwise rerankers),导致文档相关性评分不一致的问题。解决方案的关键是引入成对相关性蒸馏(Pairwise Relevance Distillation, PairDistill),通过利用成对重排序(pairwise reranking)来提供细粒度的文档相关性区分,从而丰富密集检索模型的训练。实验结果表明,PairDistill在多个基准测试中超越了现有方法,达到了新的技术水平,突显了其在提升密集检索技术方面的潜力。

链接: https://arxiv.org/abs/2410.01383
作者: Chao-Wei Huang,Yun-Nung Chen
关键词-EN: Effective information retrieval, vast datasets relies, Effective information, extract relevant information, response to queries
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024 Main Conference

点击查看摘要

Abstract:Effective information retrieval (IR) from vast datasets relies on advanced techniques to extract relevant information in response to queries. Recent advancements in dense retrieval have showcased remarkable efficacy compared to traditional sparse retrieval methods. To further enhance retrieval performance, knowledge distillation techniques, often leveraging robust cross-encoder rerankers, have been extensively explored. However, existing approaches primarily distill knowledge from pointwise rerankers, which assign absolute relevance scores to documents, thus facing challenges related to inconsistent comparisons. This paper introduces Pairwise Relevance Distillation (PairDistill) to leverage pairwise reranking, offering fine-grained distinctions between similarly relevant documents to enrich the training of dense retrieval models. Our experiments demonstrate that PairDistill outperforms existing methods, achieving new state-of-the-art results across multiple benchmarks. This highlights the potential of PairDistill in advancing dense retrieval techniques effectively. Our source code and trained models are released at this https URL
摘要:从庞大的数据集中进行有效的信息检索 (IR) 依赖于先进的技术,以根据查询提取相关信息。近年来,密集检索技术相较于传统的稀疏检索方法展示了显著的效能。为了进一步提升检索性能,知识蒸馏技术,通常利用强大的交叉编码器重排序器,已被广泛探索。然而,现有方法主要从点式重排序器中提取知识,这些重排序器为文档分配绝对相关性分数,因此面临比较不一致的挑战。本文引入了成对相关性蒸馏 (Pairwise Relevance Distillation, PairDistill),利用成对重排序,提供相似相关文档之间的细粒度区分,以丰富密集检索模型的训练。我们的实验表明,PairDistill 优于现有方法,在多个基准测试中达到了新的最先进结果。这突显了 PairDistill 在有效推进密集检索技术方面的潜力。我们的源代码和训练模型已在以下链接发布:[https URL]

[NLP-59] Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

【速读】: 该论文试图解决预训练模型在训练过程中知识整合方式的演变及其对知识获取和遗忘的影响问题。解决方案的关键在于引入“知识熵”这一概念,用以量化模型所涉及的记忆源范围。高知识熵表示模型利用了广泛的记忆源,而低知识熵则表明模型依赖于特定且确定的记忆源。研究发现,随着预训练的推进,知识熵呈下降趋势,这种下降与模型知识获取和保留能力的减弱密切相关。论文通过实验证明,增加不活跃记忆源的活跃度可以提升模型的知识获取和保留能力,从而支持了知识熵下降对模型性能的负面影响。

链接: https://arxiv.org/abs/2410.01380
作者: Jiyeon Kim,Hyunji Lee,Hyowon Cho,Joel Jang,Hyeonbin Hwang,Seungpil Won,Youbin Ahn,Dohaeng Lee,Minjoon Seo
关键词-EN: parametric knowledge evolves, knowledge entropy, knowledge, affects overall performance, tendency to broadly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we investigate how a model’s tendency to broadly integrate its parametric knowledge evolves throughout pretraining, and how this behavior affects overall performance, particularly in terms of knowledge acquisition and forgetting. We introduce the concept of knowledge entropy, which quantifies the range of memory sources the model engages with; high knowledge entropy indicates that the model utilizes a wide range of memory sources, while low knowledge entropy suggests reliance on specific sources with greater certainty. Our analysis reveals a consistent decline in knowledge entropy as pretraining advances. We also find that the decline is closely associated with a reduction in the model’s ability to acquire and retain knowledge, leading us to conclude that diminishing knowledge entropy (smaller number of active memory sources) impairs the model’s knowledge acquisition and retention capabilities. We find further support for this by demonstrating that increasing the activity of inactive memory sources enhances the model’s capacity for knowledge acquisition and retention.
摘要:在本研究中,我们探讨了模型在预训练过程中如何广泛整合其参数化知识的演变过程,以及这种行为如何影响整体性能,特别是在知识获取和遗忘方面。我们引入了知识熵的概念,该概念量化了模型所涉及的记忆源范围;高知识熵表明模型利用了广泛的记忆源,而低知识熵则表明模型依赖于特定源,且具有更大的确定性。我们的分析显示,随着预训练的推进,知识熵呈现出持续下降的趋势。我们还发现,这种下降与模型获取和保留知识的能力减弱密切相关,从而得出结论:知识熵的减少(即活跃记忆源数量的减少)会损害模型的知识获取和保留能力。我们通过展示增加非活跃记忆源的活跃度可以增强模型的知识获取和保留能力,进一步支持了这一结论。

[NLP-60] PCQPR: Proactive Conversational Question Planning with Reflection EMNLP2024

【速读】: 该论文试图解决传统对话问答生成(CQG)系统在缺乏引导对话朝向特定结论的能力问题。解决方案的关键在于提出了一个名为“主动对话问答规划与自我优化(PCQPR)”的新方法。PCQPR通过结合蒙特卡洛树搜索(MCTS)的规划算法和大型语言模型(LLMs)的分析能力,预测未来对话轮次并持续优化提问策略,从而确保生成的问答对能够战略性地引导对话达到预定结论。这一迭代自我优化机制是实现结论导向对话问答系统的核心。

链接: https://arxiv.org/abs/2410.01363
作者: Shasha Guo,Lizi Liao,Jing Zhang,Cuiping Li,Hong Chen
关键词-EN: Conversational Question Generation, customer service, Conversational Question, enhances the interactivity, Conclusion-driven Conversational Question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 Main

点击查看摘要

Abstract:Conversational Question Generation (CQG) enhances the interactivity of conversational question-answering systems in fields such as education, customer service, and entertainment. However, traditional CQG, focusing primarily on the immediate context, lacks the conversational foresight necessary to guide conversations toward specified conclusions. This limitation significantly restricts their ability to achieve conclusion-oriented conversational outcomes. In this work, we redefine the CQG task as Conclusion-driven Conversational Question Generation (CCQG) by focusing on proactivity, not merely reacting to the unfolding conversation but actively steering it towards a conclusion-oriented question-answer pair. To address this, we propose a novel approach, called Proactive Conversational Question Planning with self-Refining (PCQPR). Concretely, by integrating a planning algorithm inspired by Monte Carlo Tree Search (MCTS) with the analytical capabilities of large language models (LLMs), PCQPR predicts future conversation turns and continuously refines its questioning strategies. This iterative self-refining mechanism ensures the generation of contextually relevant questions strategically devised to reach a specified outcome. Our extensive evaluations demonstrate that PCQPR significantly surpasses existing CQG methods, marking a paradigm shift towards conclusion-oriented conversational question-answering systems.
摘要:对话式问题生成 (Conversational Question Generation, CQG) 增强了教育、客户服务和娱乐等领域中对话式问答系统的互动性。然而,传统的 CQG 主要关注即时上下文,缺乏引导对话朝向特定结论所需的对话前瞻性。这一局限性极大地限制了其实现结论导向对话结果的能力。在本研究中,我们将 CQG 任务重新定义为结论驱动的对话式问题生成 (Conclusion-driven Conversational Question Generation, CCQG),强调主动性,不仅仅是被动应对展开的对话,而是积极引导对话朝向结论导向的问答对。为此,我们提出了一种新颖的方法,称为自精炼主动对话问题规划 (Proactive Conversational Question Planning with self-Refining, PCQPR)。具体而言,通过将受蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS) 启发的规划算法与大语言模型 (Large Language Model, LLM) 的分析能力相结合,PCQPR 预测未来的对话轮次,并持续精炼其提问策略。这种迭代自精炼机制确保生成与上下文相关的问题,这些问题是战略性地设计以达到特定结果。我们的广泛评估表明,PCQPR 显著超越了现有的 CQG 方法,标志着朝向结论导向对话式问答系统的范式转变。

[NLP-61] Assisted Data Annotation for Business Process Information Extraction from Textual Documents

【速读】: 该论文试图解决从自然语言文本生成过程模型的过程中,由于缺乏大规模高质量数据集而导致的流程发现阶段耗时且成本高昂的问题。解决方案的关键在于引入两种辅助功能:一是推荐系统,用于识别文本中的流程信息;二是通过图形化业务流程模型可视化已识别的流程信息。这些辅助功能显著降低了数据集创建的工作负荷(最高降低51.0%),并大幅提升了注释质量(最高提升38.9%)。

链接: https://arxiv.org/abs/2410.01356
作者: Julian Neuberger,Han van der Aa,Lars Ackermann,Daniel Buschek,Jannic Herrmann,Stefan Jablonski
关键词-EN: Machine-learning based generation, Machine-learning based, expensive process discovery, process discovery phase, natural language text
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine-learning based generation of process models from natural language text process descriptions provides a solution for the time-intensive and expensive process discovery phase. Many organizations have to carry out this phase, before they can utilize business process management and its benefits. Yet, research towards this is severely restrained by an apparent lack of large and high-quality datasets. This lack of data can be attributed to, among other things, an absence of proper tool assistance for dataset creation, resulting in high workloads and inferior data quality. We explore two assistance features to support dataset creation, a recommendation system for identifying process information in the text and visualization of the current state of already identified process information as a graphical business process model. A controlled user study with 31 participants shows that assisting dataset creators with recommendations lowers all aspects of workload, up to -51.0% , and significantly improves annotation quality, up to +38.9% . We make all data and code available to encourage further research on additional novel assistance strategies.
摘要:基于机器学习的从自然语言文本流程描述中生成流程模型的方法,为解决耗时且昂贵的流程发现阶段提供了一种解决方案。许多组织在能够利用业务流程管理及其优势之前,必须进行这一阶段。然而,这一领域的研究受到明显缺乏大规模高质量数据集的严重限制。这种数据缺乏可以归因于多种原因,其中包括缺乏适当的工具辅助数据集创建,导致工作量巨大和数据质量低下。我们探索了两种辅助功能来支持数据集创建,一种是用于识别文本中流程信息的推荐系统,另一种是将已识别的流程信息可视化为图形业务流程模型。通过对31名参与者进行的受控用户研究显示,通过推荐系统辅助数据集创建者可以降低所有方面的工作量,最高可达 -51.0%,并显著提高标注质量,最高可达 +38.9%。我们公开了所有数据和代码,以鼓励进一步研究其他新颖的辅助策略。

[NLP-62] Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

【速读】: 该论文试图解决在非英语语言中,由于缺乏任务特定数据而难以微调大型语言模型(LLMs)进行数学推理的问题。解决方案的关键在于提出了一种模型合并方法,通过将英语数学指令数据和目标语言的通用指令数据分别微调的“专家”模型进行层级替换,特别是将数学专家模型的顶部和底部Transformer层替换为语言专家模型的相应层,从而在不增加额外训练成本的情况下,显著提升目标语言中的数学推理性能。这种方法不仅简单直观,而且有效,能够在数学指令数据稀缺的四种主要语言中,将数学基准测试MGSM的性能提升10%。

链接: https://arxiv.org/abs/2410.01335
作者: Lucas Bandarkar,Benjamin Muller,Pritish Yuvraj,Rui Hou,Nayan Singhal,Hongjiang Lv,Bing Liu
关键词-EN: Large Language Models, math instruction data, practice of combining, instruction data, Model merging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 main pages, 23 pages total, 9 figures, 5 tables

点击查看摘要

Abstract:Model merging, such as model souping, is the practice of combining different models with the same architecture together without further training. In this work, we present a model merging methodology that addresses the difficulty of fine-tuning Large Language Models (LLMs) for target tasks in non-English languages, where task-specific data is often unavailable. We focus on mathematical reasoning and without in-language math data, facilitate cross-lingual transfer by composing language and math capabilities. Starting from the same pretrained model, we fine-tune separate “experts” on math instruction data in English and on generic instruction data in the target language. We then replace the top and bottom transformer layers of the math expert directly with layers from the language expert, which consequently enhances math performance in the target language. The resulting merged models outperform the individual experts and other merging methods on the math benchmark, MGSM, by 10% across four major languages where math instruction data is scarce. In addition, this layer swapping is simple, inexpensive, and intuitive, as it is based on an interpretative analysis of the most important parameter changes during the fine-tuning of each expert. The ability to successfully re-compose LLMs for cross-lingual transfer in this manner opens up future possibilities to combine model expertise, create modular solutions, and transfer reasoning capabilities across languages all post hoc.
摘要:模型合并,如模型融合,是指将具有相同架构的不同模型组合在一起而不进行进一步训练的实践。在本研究中,我们提出了一种模型合并方法,旨在解决在非英语语言中为目标任务微调大语言模型 (LLMs) 的难题,这些语言中通常缺乏特定任务的数据。我们专注于数学推理,在没有目标语言数学数据的情况下,通过组合语言和数学能力来促进跨语言迁移。从相同的预训练模型出发,我们在英语数学指令数据和目标语言的通用指令数据上分别微调出“专家”模型。然后,我们直接用语言专家的层替换数学专家的顶部和底部 Transformer 层,从而在目标语言中提升数学性能。在数学基准测试 MGSM 上,合并后的模型在数学指令数据稀缺的四种主要语言中,表现优于单独的专家模型和其他合并方法,提升幅度达 10%。此外,这种层交换简单、成本低廉且直观,因为它基于对每个专家微调过程中最重要的参数变化进行的解释性分析。这种成功地为跨语言迁移重组 LLMs 的能力,为未来结合模型专业知识、创建模块化解决方案以及在事后跨语言转移推理能力开辟了可能性。

[NLP-63] Unveiling Language Skills under Circuits

【速读】: 该论文试图解决现有电路分析方法在表示语言模型(LMs)功能范围上的不足,特别是忽略了前馈层的影响,以及难以从包含多种交织技能的文本中分离单一语言技能的问题。解决方案的关键在于引入“记忆电路”(Memory Circuit)这一最小单元,它能够完全且独立地操控语言模型的记忆读取功能,并通过将Transformer模型精确地解构为连接不同记忆电路的路径集合,识别出负责三种关键语言技能(前一个词技能、归纳技能和上下文学习技能)的显著电路路径(技能路径)。通过因果效应估计和反事实分析,验证了语言技能可通过电路解剖识别、简单技能位于浅层而复杂技能位于深层、复杂技能基于简单技能形成的三个长期假设。

链接: https://arxiv.org/abs/2410.01334
作者: Hang Chen,Jiaying Zhu,Xinyu Yang,Wenya Wang
关键词-EN: language skills, complex language skills, language, skills, Simple language skills
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The exploration of language skills in language models (LMs) has always been one of the central goals in mechanistic interpretability. However, existing circuit analyses often fall short in representing the full functional scope of these models, primarily due to the exclusion of Feed-Forward layers. Additionally, isolating the effect of a single language skill from a text, which inherently involves multiple entangled skills, poses a significant challenge. To address these gaps, we introduce a novel concept, Memory Circuit, a minimum unit that fully and independently manipulates the memory-reading functionality of a language model, and disentangle the transformer model precisely into a circuit graph which is an ensemble of paths connecting different memory circuits. Based on this disentanglement, we identify salient circuit paths, named as skill paths, responsible for three crucial language skills, i.e., the Previous Token Skill, Induction Skill and In-Context Learning (ICL) Skill, leveraging causal effect estimation through interventions and counterfactuals. Our experiments on various datasets confirm the correspondence between our identified skill paths and language skills, and validate three longstanding hypotheses: 1) Language skills are identifiable through circuit dissection; 2) Simple language skills reside in shallow layers, whereas complex language skills are found in deeper layers; 3) Complex language skills are formed on top of simpler language skills. Our codes are available at: this https URL.
摘要:语言模型 (Language Models, LMs) 中语言能力的探索一直是机制性可解释性的核心目标之一。然而,现有的电路分析往往无法全面代表这些模型的功能范围,主要原因是忽略了前馈层 (Feed-Forward layers)。此外,从文本中分离单一语言技能的影响,而文本本身涉及多种交织的技能,是一个重大挑战。为了解决这些差距,我们引入了一个新概念——记忆电路 (Memory Circuit),这是一个能够完全且独立操控语言模型记忆读取功能的最小单元,并将 Transformer 模型精确地解构为一个电路图,该图是由连接不同记忆电路的路径集合而成。基于这种解构,我们识别出显著的电路路径,命名为技能路径 (skill paths),负责三种关键的语言技能,即前一个 Token 技能 (Previous Token Skill)、归纳技能 (Induction Skill) 和上下文学习技能 (In-Context Learning, ICL Skill),通过干预和反事实 (counterfactuals) 进行因果效应估计。我们在多个数据集上的实验证实了我们识别的技能路径与语言技能之间的对应关系,并验证了三个长期存在的假设:1) 语言技能可以通过电路解剖来识别;2) 简单的语言技能位于浅层,而复杂的语言技能位于深层;3) 复杂的语言技能建立在更简单的语言技能之上。我们的代码可在以下链接获取:this https URL。

[NLP-64] Emotion-Aware Response Generation Using Affect-Enriched Embeddings with LLMs

【速读】: 该论文试图解决在自动化聊天机器人辅助心理治疗中,如何增强大型语言模型(LLMs)对情感和上下文理解的问题。解决方案的关键在于引入一个集成多种情感词典(如NRC Emotion Lexicon、VADER、WordNet和SentiWordNet)和先进LLMs(如LLAMA 2、Flan-T5、ChatGPT 3.0和ChatGPT 4.0)的新框架。通过将心理治疗会话记录分割成小块,并使用BERT、GPT-3和RoBERTa计算情感嵌入,存储在FAISS向量数据库中,实现高效的相似度搜索和聚类。这种方法显著提升了LLMs生成富有同理心和上下文相关响应的能力。

链接: https://arxiv.org/abs/2410.01306
作者: Abdur Rasool,Muhammad Irfan Shahzad,Hafsa Aslam,Vincent Chan
关键词-EN: automated chatbot-facilitated psychotherapy, chatbot-facilitated psychotherapy sessions, automated chatbot-facilitated, NRC Emotion Lexicon, including NRC Emotion
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:There is a need for empathetic and coherent responses in automated chatbot-facilitated psychotherapy sessions. This study addresses the challenge of enhancing the emotional and contextual understanding of large language models (LLMs) in psychiatric applications. We introduce a novel framework that integrates multiple emotion lexicons, including NRC Emotion Lexicon, VADER, WordNet, and SentiWordNet, with state-of-the-art LLMs such as LLAMA 2, Flan-T5, ChatGPT 3.0, and ChatGPT 4.0. The primary dataset comprises over 2,000 therapy session transcripts from the Counseling and Psychotherapy database, covering discussions on anxiety, depression, trauma, and addiction. We segment the transcripts into smaller chunks, enhancing them with lexical features and computing embeddings using BERT, GPT-3, and RoBERTa to capture semantic and emotional nuances. These embeddings are stored in a FAISS vector database, enabling efficient similarity search and clustering based on cosine similarity. Upon user query, the most relevant segments are retrieved and provided as context to the LLMs, significantly improving the models’ ability to generate empathetic and contextually appropriate responses. Experimental evaluations demonstrate that in-corporating emotion lexicons enhances empathy, coherence, informativeness, and fluency scores. Our findings highlight the critical role of emotional embeddings in improving LLM performance for psychotherapy.
摘要:在自动化聊天机器人辅助的心理治疗会话中,需要具备共情能力和连贯性的回应。本研究针对在精神病学应用中提升大语言模型 (LLMs) 的情感和情境理解能力这一挑战展开。我们引入了一种新颖的框架,该框架将多种情感词典(包括 NRC Emotion Lexicon、VADER、WordNet 和 SentiWordNet)与最先进的 LLMs(如 LLAMA 2、Flan-T5、ChatGPT 3.0 和 ChatGPT 4.0)相结合。主要数据集包括来自 Counseling and Psychotherapy 数据库的超过 2,000 份治疗会话记录,涵盖了关于焦虑、抑郁、创伤和成瘾的讨论。我们将这些记录分割成更小的片段,通过词汇特征增强并使用 BERT、GPT-3 和 RoBERTa 计算嵌入,以捕捉语义和情感的细微差别。这些嵌入存储在 FAISS 向量数据库中,基于余弦相似性实现高效的相似性搜索和聚类。在用户查询时,检索最相关的片段并作为上下文提供给 LLMs,显著提升模型生成共情且情境适宜回应的能力。实验评估表明,结合情感词典能提高共情度、连贯性、信息量和流畅度评分。我们的研究结果强调了情感嵌入在提升 LLM 心理治疗性能中的关键作用。

[NLP-65] Revisiting Hierarchical Text Classification: Inference and Metrics CONLL2024

【速读】: 该论文试图解决分层文本分类(Hierarchical Text Classification, HTC)中的评估方法问题。传统方法将HTC视为多标签分类问题进行评估,而论文提出应基于专门设计的分层指标来评估模型性能。解决方案的关键在于引入新的评估指标和理论驱动的损失函数,并通过实验证明这些简单但强大的基线方法在性能上与最新的复杂模型相当,强调了在提出新的HTC方法时,评估方法的重要性。

链接: https://arxiv.org/abs/2410.01305
作者: Roman Plaud,Matthieu Labeau,Antoine Saillenfest,Thomas Bonald
关键词-EN: structured space organized, Hierarchical text classification, task of assigning, assigning labels, structured space
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at CoNLL 2024

点击查看摘要

Abstract:Hierarchical text classification (HTC) is the task of assigning labels to a text within a structured space organized as a hierarchy. Recent works treat HTC as a conventional multilabel classification problem, therefore evaluating it as such. We instead propose to evaluate models based on specifically designed hierarchical metrics and we demonstrate the intricacy of metric choice and prediction inference method. We introduce a new challenging dataset and we evaluate fairly, recent sophisticated models, comparing them with a range of simple but strong baselines, including a new theoretically motivated loss. Finally, we show that those baselines are very often competitive with the latest models. This highlights the importance of carefully considering the evaluation methodology when proposing new methods for HTC. Code implementation and dataset are available at \urlthis https URL.
摘要:层次文本分类 (Hierarchical Text Classification, HTC) 是指在层次结构组织的空间中为文本分配标签的任务。最近的研究将 HTC 视为传统的多标签分类问题,并据此进行评估。我们提出了一种基于专门设计的层次度量的评估方法,并展示了度量选择和预测推理方法的复杂性。我们引入了一个新的具有挑战性的数据集,并公平地评估了近期复杂模型,将其与一系列简单但强大的基线模型进行比较,其中包括一种新的理论驱动的损失函数。最后,我们表明这些基线模型在许多情况下与最新的模型具有竞争力。这突显了在为 HTC 提出新方法时,仔细考虑评估方法的重要性。代码实现和数据集可在以下链接获取:\urlthis https URL。

[NLP-66] Endless Jailbreaks with Bijection Learning

【速读】: 该论文试图解决大型语言模型(LLMs)在面对对抗性输入时的脆弱性问题。解决方案的关键在于引入了一种名为“双射学习”(bijection learning)的攻击范式,通过利用语言模型的推理能力,在上下文中教授模型可逆语言(双射),从而生成无穷无尽的越狱提示。具体方法是将有害请求编码后传递给模型,绕过内置的安全机制,最后将模型的响应解码回英文,以提供对有害请求的有用回复。该方法在多种前沿语言模型和危害类别上均表现出有效性,并且随着模型规模的增大和推理能力的增强,双射学习攻击的效果也随之增强。

链接: https://arxiv.org/abs/2410.01294
作者: Brian R.Y. Huang,Maximilian Li,Leonard Tang
关键词-EN: extensive safety training, LLMs are vulnerable, adversarial inputs, vulnerable to adversarial, bijection learning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite extensive safety training, LLMs are vulnerable to adversarial inputs. In this work, we introduce a simple but powerful attack paradigm, bijection learning, that yields a practically endless set of jailbreak prompts. We exploit language models’ advanced reasoning capabilities to teach them invertible languages (bijections) in context, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English, yielding helpful replies to harmful requests. Our approach proves effective on a wide range of frontier language models and harm categories. Bijection learning is an automated and universal attack that grows stronger with scale: larger models with more advanced reasoning capabilities are more susceptible to bijection learning jailbreaks despite stronger safety mechanisms.
摘要:尽管进行了广泛的安全培训,大语言模型 (LLM) 仍然容易受到对抗性输入的影响。在这项工作中,我们引入了一种简单但强大的攻击范式,双射学习 (bijection learning),该范式能够生成实际上无穷无尽的越狱提示。我们利用语言模型的先进推理能力,在上下文中教授它们可逆语言 (bijections),将编码后的查询传递给模型以绕过内置的安全机制,最后将响应解码回英语,从而对有害请求产生有帮助的回复。我们的方法在广泛的尖端语言模型和危害类别中被证明是有效的。双射学习是一种自动化且通用的攻击手段,随着规模的扩大而变得更加强大:即使具有更强的安全机制,更大、推理能力更先进的模型对双射学习越狱的敏感性也更高。

[NLP-67] Mitigating Copy Bias in In-Context Learning through Neuron Pruning

【速读】: 该论文试图解决大型语言模型(LLMs)在少样本上下文学习(ICL)中存在的“复制偏差”问题,即模型倾向于从提供的示例中直接复制答案而非学习潜在的模式。解决方案的关键在于提出了一种新颖且简单的方法:首先创建一个合成任务,利用集成梯度法识别出优先考虑复制而非泛化的神经元,然后通过修剪这些神经元来提高模型在多种ICL任务中的表现。该方法适用于不同的LLM架构,如Transformer和状态空间模型,且无需对模型结构进行修改。此外,通过任务识别视角分析ICL,发现修剪操作增强了任务向量的质量,表明这些被修剪的神经元之前阻碍了有效的任务识别。

链接: https://arxiv.org/abs/2410.01288
作者: Ameen Ali,Lior Wolf,Ivan Titov
关键词-EN: Large language models, demonstrated impressive few-shot, impressive few-shot in-context, Large language, few-shot in-context learning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive few-shot in-context learning (ICL) abilities. Still, we show that they are sometimes prone to a `copying bias’, where they copy answers from provided examples instead of learning the underlying patterns. In this work, we propose a novel and simple method to mitigate such copying bias. First, we create a synthetic task and use the Integrated Gradients method to identify neurons that prioritize copying over generalization. We demonstrate that pruning these neurons consistently improves performance across a diverse set of ICL tasks. We also show that our method is applicable across various LLM architectures, including Transformers and State-Space Models, without requiring modifications. In our analysis, we adopt a task-recognition perspective on ICL and examine task vectors (Hendel et al., 2023) induced by the model. We find that pruning enhances the quality of these vectors, suggesting that the pruned neurons previously hindered effective task recognition.
摘要:大语言模型 (LLMs) 展示了令人印象深刻的少样本上下文学习 (ICL) 能力。然而,我们发现它们有时容易出现“复制偏差”,即从提供的示例中复制答案,而不是学习潜在的模式。在这项工作中,我们提出了一种新颖且简单的方法来缓解这种复制偏差。首先,我们创建了一个合成任务,并使用集成梯度方法识别那些优先考虑复制而非泛化的神经元。我们证明,修剪这些神经元可以一致地提高各种 ICL 任务的性能。我们还展示了我们的方法适用于各种大语言模型架构,包括 Transformer 和状态空间模型,而无需进行修改。在我们的分析中,我们采用了 ICL 的任务识别视角,并检查了模型诱导的任务向量 (Hendel et al., 2023)。我们发现,修剪增强了这些向量的质量,表明被修剪的神经元之前阻碍了有效的任务识别。

[NLP-68] Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)的黑箱特性带来的解释性难题,特别是在数据知识产权保护和幻觉溯源方面。解决方案的关键在于提出了一种名为Debias and Denoise Attribution (DDA)的新型训练数据归属(TDA)方法,该方法通过消除基础模型在微调前的知识偏差(debias策略)和通过平滑技术减少训练过程中拟合程度不同导致的差异(denoise策略),从而增强影响函数的效果。实验结果表明,DDA显著优于现有方法,平均AUC达到91.64%,并在不同来源和不同规模模型(如LLaMA2、QWEN2和Mistral)上表现出强大的通用性和可扩展性。

链接: https://arxiv.org/abs/2410.01285
作者: Kangxi Wu,Liang Pang,Huawei Shen,Xueqi Cheng
关键词-EN: intellectual property protection, data intellectual property, large language models, impacting issues, hallucination tracing
类目: Computation and Language (cs.CL)
备注: Accepted to the EMNLP 2024 main

点击查看摘要

Abstract:The black-box nature of large language models (LLMs) poses challenges in interpreting results, impacting issues such as data intellectual property protection and hallucination tracing. Training data attribution (TDA) methods are considered effective solutions to address these challenges. Most recent TDA methods rely on influence functions, assuming the model achieves minimized empirical risk. However, achieving this criterion is difficult, and sourcing accuracy can be compromised by fitting errors during model training. In this paper, we introduce a novel TDA method called Debias and Denoise Attribution (DDA), which enhances influence functions by addressing fitting errors. Specifically, the debias strategy seeks to improve the performance of influence functions by eliminating the knowledge bias present in the base model before fine-tuning, while the denoise strategy aims to reduce discrepancies in influence scores arising from varying degrees of fitting during the training process through smoothing techniques. Experimental results demonstrate that our method significantly outperforms existing approaches, achieving an averaged AUC of 91.64%. Moreover, DDA exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.
摘要:大语言模型 (LLM) 的黑箱特性在解释结果方面带来了挑战,影响了数据知识产权保护和幻觉追踪等问题。训练数据归因 (TDA) 方法被认为是解决这些挑战的有效方案。大多数最近的 TDA 方法依赖于影响函数,假设模型达到了最小的经验风险。然而,实现这一标准是困难的,模型训练过程中的拟合误差可能会影响来源的准确性。在本文中,我们提出了一种新的 TDA 方法,称为去偏和去噪归因 (DDA),该方法通过处理拟合误差来增强影响函数。具体来说,去偏策略旨在通过在微调之前消除基础模型中的知识偏差来提高影响函数的性能,而去噪策略则通过平滑技术减少训练过程中因不同程度的拟合而产生的影响分数差异。实验结果表明,我们的方法显著优于现有方法,平均 AUC 达到了 91.64%。此外,DDA 在不同来源和不同规模的模型(如 LLaMA2、QWEN2 和 Mistral)上表现出强大的通用性和可扩展性。

[NLP-69] Deep Learning and Machine Learning Advancing Big Data Analytics and Management: Unveiling AIs Potential Through Tools Techniques and Applications

【速读】: 该论文旨在为深度学习和机器学习在大数据分析中的应用提供入门指导,解决初学者和高级用户在理解和应用这些技术时面临的挑战。解决方案的关键在于系统地介绍了基本概念、实用工具(如ChatGPT和Claude)、硬件推荐以及如何使用PyTorch和TensorFlow等库搭建开发环境,并通过逐步指导、实践项目和未来AI趋势(如AutoML和边缘计算)的洞察,帮助读者全面掌握相关技术。

链接: https://arxiv.org/abs/2410.01268
作者: Pohsun Feng,Ziqian Bi,Yizhu Wen,Xuanhe Pan,Benji Peng,Ming Liu,Jiawei Xu,Keyu Chen,Junyu Liu,Caitlyn Heqi Yin,Sen Zhang,Jinlang Wang,Qian Niu,Ming Li,Tianyang Wang
关键词-EN: big data analytics, data analytics, deep learning, machine learning, book serves
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This book contains 156 pages and 9 figures

点击查看摘要

Abstract:This book serves as an introduction to deep learning and machine learning, focusing on their applications in big data analytics. It covers essential concepts, tools like ChatGPT and Claude, hardware recommendations, and practical guidance on setting up development environments using libraries like PyTorch and TensorFlow. Designed for beginners and advanced users alike, it provides step-by-step instructions, hands-on projects, and insights into AI’s future, including AutoML and edge computing.
摘要:本书旨在介绍深度学习和机器学习,重点在于它们在大数据分析中的应用。内容涵盖了基本概念、工具如 ChatGPT 和 Claude、硬件推荐,以及使用 PyTorch 和 TensorFlow 等库设置开发环境的实用指南。本书面向初学者和高级用户,提供逐步指导、实践项目以及对 AI 未来发展的见解,包括自动化机器学习 (AutoML) 和边缘计算。

[NLP-70] HelpSteer2-Preference: Complementing Ratings with Preferences

【速读】: 该论文试图解决在训练奖励模型时,Bradley-Terry风格和回归风格两种方法在数据匹配不足的情况下难以比较优劣的问题。解决方案的关键在于通过发布HelpSteer2数据集中的偏好注释(用于Bradley-Terry训练)来补充现有的评分数据(用于回归风格训练),从而实现两种方法在数据上的充分匹配。此外,论文提出了一种结合Bradley-Terry和回归风格的新方法,并通过实验证明其在RewardBench上的表现优于其他140多个奖励模型,特别是在RLHF中对齐模型以遵循指令方面表现出色。

链接: https://arxiv.org/abs/2410.01257
作者: Zhilin Wang,Alexander Bukharin,Olivier Delalleau,Daniel Egert,Gerald Shen,Jiaqi Zeng,Oleksii Kuchaiev,Yi Dong
关键词-EN: popular paradigms, adequately matched, Regression, Regression style, Reward
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 3 figures

点击查看摘要

Abstract:Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at this https URL and openly release the trained Reward Model at this https URL
摘要:奖励模型对于使模型遵循指令至关重要,通常根据两种流行范式之一进行训练:Bradley-Terry 风格或回归风格。然而,目前缺乏证据表明在数据充分匹配的情况下,哪种方法更优。这主要是因为这些方法需要以不同(且不兼容)的格式收集数据,这意味着现有公共数据集中不存在充分匹配的数据。为了解决这一问题,我们在 HelpSteer2 数据集中发布了偏好注释(专为 Bradley-Terry 训练设计),以补充现有的评分(专为回归风格训练设计)。为了提高数据的可解释性,偏好注释附有人类撰写的理由。利用这些数据,我们首次进行了 Bradley-Terry 和回归模型在数据充分匹配情况下的直接比较。基于这种比较得出的见解,我们提出了一种结合 Bradley-Terry 和回归奖励建模的新方法。使用这种方法调优的 Llama-3.1-70B-Instruct 模型在 RewardBench 上得分 94.1,截至 2024 年 10 月 1 日,在超过 140 个奖励模型中名列前茅。我们还展示了该奖励模型在 RLHF 中使模型遵循指令的有效性。我们在此 https URL 上以 CC-BY-4.0 许可证开源此数据集,并在该 https URL 上公开发布训练好的奖励模型。

[NLP-71] AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses EMNLP2024

【速读】: 该论文试图解决开放性问题回答的评估难题,即如何量化和评价开放性问题的多样性答案。解决方案的关键在于结合大型语言模型(LLMs)和层次分析法(AHP),通过LLMs生成多个评估标准,并利用AHP进行答案的成对比较和评分,从而更准确地反映人类判断。实验结果表明,该方法在多个数据集上优于传统基线方法,并探讨了评估标准数量、模型差异和数据集差异对结果的影响。

链接: https://arxiv.org/abs/2410.01246
作者: Xiaotian Lu,Jiyi Li,Koh Takeuchi,Hisashi Kashima
关键词-EN: natural language processing, open-ended questions, extensively studied, field of natural, NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for EMNLP 2024 Findings

点击查看摘要

Abstract:Question answering (QA) tasks have been extensively studied in the field of natural language processing (NLP). Answers to open-ended questions are highly diverse and difficult to quantify, and cannot be simply evaluated as correct or incorrect, unlike close-ended questions with definitive answers. While large language models (LLMs) have demonstrated strong capabilities across various tasks, they exhibit relatively weaker performance in evaluating answers to open-ended questions. In this study, we propose a method that leverages LLMs and the analytic hierarchy process (AHP) to assess answers to open-ended questions. We utilized LLMs to generate multiple evaluation criteria for a question. Subsequently, answers were subjected to pairwise comparisons under each criterion with LLMs, and scores for each answer were calculated in the AHP. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines. Additionally, we explored the impact of the number of criteria, variations in models, and differences in datasets on the results.
摘要:问答 (Question answering, QA) 任务在自然语言处理 (Natural Language Processing, NLP) 领域得到了广泛研究。开放式问题的答案具有高度多样性且难以量化,无法像封闭式问题那样简单地评判为正确或错误。尽管大语言模型 (Large Language Models, LLMs) 在各种任务中展示了强大的能力,但在评估开放式问题的答案时表现相对较弱。在本研究中,我们提出了一种利用 LLMs 和层次分析法 (Analytic Hierarchy Process, AHP) 来评估开放式问题答案的方法。我们使用 LLMs 为每个问题生成多个评估标准。随后,答案在每个标准下进行成对比较,并使用 LLMs 计算每个答案的得分。我们在四个数据集上使用 ChatGPT-3.5-turbo 和 GPT-4 进行了实验。结果表明,我们的方法相比四个基线方法更符合人类判断。此外,我们还探讨了标准数量、模型变化以及数据集差异对结果的影响。

[NLP-72] RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance

【速读】: 该论文试图解决大语言模型(LLMs)在代码生成任务中准确性有限的问题,特别是对于需要深入理解问题和代码生成过程的复杂任务。解决方案的关键在于引入了一种名为Refinement and Guidance Debugging (RGD)的新型架构,该架构通过多LLM代理(Guide Agent、Debug Agent和Feedback Agent)协同工作,将代码生成任务分解为多个步骤,实现迭代式的代码精炼和自动调试。RGD框架通过自省和反馈机制,显著提升了LLMs在代码生成和优化方面的能力,实验结果表明其在HumanEval和MBPP数据集上分别比现有最先进方法提升了9.8%和16.2%的性能。

链接: https://arxiv.org/abs/2410.01242
作者: Haolin Jin,Zechao Sun,Yiheng Yang,Huaming Chen
关键词-EN: Large Language Models, Large Language, Language Models, shown incredible potential, code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown incredible potential in code generation tasks, and recent research in prompt engineering have enhanced LLMs’ understanding of textual information. However, ensuring the accuracy of generated code often requires extensive testing and validation by programmers. While LLMs can typically generate code based on task descriptions, their accuracy remains limited, especially for complex tasks that require a deeper understanding of both the problem statement and the code generation process. This limitation is primarily due to the LLMs’ need to simultaneously comprehend text and generate syntactically and semantically correct code, without having the capability to automatically refine the code. In real-world software development, programmers rarely produce flawless code in a single attempt based on the task description alone, they rely on iterative feedback and debugging to refine their programs. Inspired by this process, we introduce a novel architecture of LLM-based agents for code generation and automatic debugging: Refinement and Guidance Debugging (RGD). The RGD framework is a multi-LLM-based agent debugger that leverages three distinct LLM agents-Guide Agent, Debug Agent, and Feedback Agent. RGD decomposes the code generation task into multiple steps, ensuring a clearer workflow and enabling iterative code refinement based on self-reflection and feedback. Experimental results demonstrate that RGD exhibits remarkable code generation capabilities, achieving state-of-the-art performance with a 9.8% improvement on the HumanEval dataset and a 16.2% improvement on the MBPP dataset compared to the state-of-the-art approaches and traditional direct prompting approaches. We highlight the effectiveness of the RGD framework in enhancing LLMs’ ability to generate and refine code autonomously.
摘要:大语言模型 (LLMs) 在代码生成任务中展现了巨大的潜力,而近期在提示工程 (prompt engineering) 方面的研究进一步增强了 LLMs 对文本信息的理解能力。然而,确保生成代码的准确性通常需要程序员进行大量的测试和验证。尽管 LLMs 能够根据任务描述生成代码,但其准确性仍然有限,尤其是在需要深入理解问题陈述和代码生成过程的复杂任务中。这一局限性主要源于 LLMs 在同时理解文本和生成语法及语义正确的代码时,缺乏自动优化代码的能力。在实际的软件开发中,程序员很少能仅凭任务描述一次性生成无缺陷的代码,他们依赖于迭代反馈和调试来完善程序。受此过程启发,我们提出了一种基于 LLM 的智能体 (agent) 架构,用于代码生成和自动调试:优化与指导调试 (Refinement and Guidance Debugging, RGD)。RGD 框架是一个基于多 LLM 的智能体调试器,利用了三种不同的 LLM 智能体——指导智能体 (Guide Agent)、调试智能体 (Debug Agent) 和反馈智能体 (Feedback Agent)。RGD 将代码生成任务分解为多个步骤,确保了更清晰的工作流程,并基于自我反思和反馈实现代码的迭代优化。实验结果表明,RGD 在代码生成方面表现出色,相较于最先进的方法和传统的直接提示方法,在 HumanEval 数据集上提升了 9.8%,在 MBPP 数据集上提升了 16.2%。我们强调了 RGD 框架在增强 LLMs 自主生成和优化代码能力方面的有效性。

[NLP-73] Automatic deductive coding in discourse analysis: an application of large language models in learning analytics

【速读】: 该论文试图解决传统演绎编码方法在教学与学习互动分析中耗时且劳动密集的问题。解决方案的关键在于利用大型语言模型(如GPT)进行自动演绎编码,并通过提示工程(prompt engineering)优化模型的表现。研究结果表明,结合提示工程的GPT模型在有限训练样本的情况下,其准确性和Kappa值均优于传统的文本分类方法和BERT类预训练语言模型,从而展示了大型语言模型在自动演绎编码中的潜力。

链接: https://arxiv.org/abs/2410.01240
作者: Lishan Zhang,Han Wu,Xiaoshan Huang,Tengfei Duan,Hanxiang Du
关键词-EN: learning analytics researchers, automatic deductive coding, Deductive coding, common discourse analysis, automatic deductive
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 20 pages

点击查看摘要

Abstract:Deductive coding is a common discourse analysis method widely used by learning science and learning analytics researchers for understanding teaching and learning interactions. It often requires researchers to manually label all discourses to be analyzed according to a theoretically guided coding scheme, which is time-consuming and labor-intensive. The emergence of large language models such as GPT has opened a new avenue for automatic deductive coding to overcome the limitations of traditional deductive coding. To evaluate the usefulness of large language models in automatic deductive coding, we employed three different classification methods driven by different artificial intelligence technologies, including the traditional text classification method with text feature engineering, BERT-like pretrained language model and GPT-like pretrained large language model (LLM). We applied these methods to two different datasets and explored the potential of GPT and prompt engineering in automatic deductive coding. By analyzing and comparing the accuracy and Kappa values of these three classification methods, we found that GPT with prompt engineering outperformed the other two methods on both datasets with limited number of training samples. By providing detailed prompt structures, the reported work demonstrated how large language models can be used in the implementation of automatic deductive coding.
摘要:演绎编码是一种常见的语篇分析方法,广泛应用于学习科学和学习分析研究领域,用于理解教学和学习互动。它通常要求研究人员根据理论指导的编码方案手动标注所有待分析的语篇,这一过程耗时且劳动密集。随着 GPT 等大语言模型的出现,自动演绎编码开辟了一条新途径,以克服传统演绎编码的局限性。为了评估大语言模型在自动演绎编码中的实用性,我们采用了三种不同的分类方法,这些方法由不同的人工智能技术驱动,包括传统的基于文本特征工程的文本分类方法、BERT 类预训练语言模型以及 GPT 类预训练大语言模型 (LLM)。我们将这些方法应用于两个不同的数据集,并探讨了 GPT 和提示工程在自动演绎编码中的潜力。通过分析和比较这三种分类方法的准确率和 Kappa 值,我们发现,在训练样本数量有限的情况下,GPT 结合提示工程在两个数据集上都优于其他两种方法。通过提供详细的提示结构,本研究展示了如何利用大语言模型实现自动演绎编码。

[NLP-74] From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

【速读】: 该论文试图解决现有基于大型语言模型(LLM)的调试系统在处理复杂问题时,无法在多层次粒度上有效识别和修复代码错误的问题。解决方案的关键在于引入多粒度调试器(MGDebugger),它通过将问题代码分解为层次化的子函数树结构,分别在低级语法错误和高级算法缺陷等多个粒度上进行错误隔离、识别和修复。MGDebugger采用自底向上的迭代调试方法,并结合LLM模拟的Python执行器来精确追踪代码执行和变量状态,从而提高调试的准确性和成功率。

链接: https://arxiv.org/abs/2410.01215
作者: Yuling Shi,Songsong Wang,Chengcheng Wan,Xiaodong Gu
关键词-EN: large language models, made significant strides, requiring human intervention, complex problems, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: Code and data available at this https URL

点击查看摘要

Abstract:While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.
摘要:尽管大语言模型在代码生成方面取得了显著进展,但生成的代码通过率仍受限于细微错误,通常需要人工干预才能通过测试,尤其是在处理复杂问题时。现有的基于大语言模型的调试系统将生成的程序视为单一单元,未能解决从低级语法错误到高级算法缺陷的多层次问题。本文介绍了一种多粒度调试器 (Multi-Granularity Debugger, MGDebugger),这是一种通过在不同粒度级别上隔离、识别和解决错误来实现的分层代码调试器。MGDebugger 将问题代码分解为子函数的层次树结构,每个层次代表特定的错误粒度。在调试过程中,它分析每个子函数,并以自底向上的方式迭代解决错误。为了有效测试每个子函数,我们提出了一种大语言模型模拟的 Python 执行器,该执行器跟踪代码执行并跟踪重要变量状态,以准确地定位错误。大量实验表明,MGDebugger 优于现有的调试系统,在 HumanEval 中对种子生成的准确性提高了 18.9%,在 HumanEvalFix 中的修复成功率达到 97.6%。此外,MGDebugger 能够有效修复不同类别和难度级别的错误,展示了其鲁棒性和有效性。

[NLP-75] StringLLM: Understanding the String Processing Capability of Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在字符串处理能力方面的不足问题。解决方案的关键在于提出了StringLLM方法,用于构建用于评估LLMs字符串处理能力的基准数据集StringBench,并通过系统的评估和深入分析,揭示了LLMs在字符串处理上的局限性。随后,论文提出了一种通过微调显著提升LLMs字符串处理能力的有效方法,为未来研究奠定了基础。

链接: https://arxiv.org/abs/2410.01208
作者: Xilong Wang,Hao Fu,Neil Zhenqiang Gong
关键词-EN: String processing, string processing capability, LLMs’ string processing, processing, String
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:String processing, which mainly involves the analysis and manipulation of strings, is a fundamental component of modern computing. Despite the significant advancements of large language models (LLMs) in various natural language processing (NLP) tasks, their capability in string processing remains underexplored and underdeveloped. To bridge this gap, we present a comprehensive study of LLMs’ string processing capability. In particular, we first propose StringLLM, a method to construct datasets for benchmarking string processing capability of LLMs. We use StringLLM to build a series of datasets, referred to as StringBench. It encompasses a wide range of string processing tasks, allowing us to systematically evaluate LLMs’ performance in this area. Our evaluations indicate that LLMs struggle with accurately processing strings compared to humans. To uncover the underlying reasons for this limitation, we conduct an in-depth analysis and subsequently propose an effective approach that significantly enhances LLMs’ string processing capability via fine-tuning. This work provides a foundation for future research to understand LLMs’ string processing capability. Our code and data are available at this https URL.
摘要:字符串处理,主要涉及字符串的分析和操作,是现代计算的基本组成部分。尽管大语言模型 (LLM) 在各种自然语言处理 (NLP) 任务中取得了显著进展,但其在字符串处理方面的能力仍未得到充分探索和开发。为了填补这一空白,我们进行了对 LLM 字符串处理能力的全面研究。特别是,我们首先提出了 StringLLM,一种用于构建 LLM 字符串处理能力基准测试数据集的方法。我们使用 StringLLM 构建了一系列数据集,称为 StringBench。它涵盖了广泛的字符串处理任务,使我们能够系统地评估 LLM 在这方面的表现。我们的评估表明,LLM 在准确处理字符串方面与人类相比存在困难。为了揭示这一限制的根本原因,我们进行了深入分析,并随后提出了一种通过微调显著增强 LLM 字符串处理能力的有效方法。这项工作为未来研究理解 LLM 的字符串处理能力奠定了基础。我们的代码和数据可在以下链接获取:https URL。

[NLP-76] Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在特定领域应用时由于缺乏领域知识而表现不佳的问题。解决方案的关键在于提出了一种名为VEGAD的自适应方法,该方法能够自动从给定的领域词汇中识别出有价值的词汇子集,从而在词汇扩展过程中优化模型性能。通过在三个中文数据集上的实验验证,VEGAD不仅在特定领域任务上表现出色,还能提升模型在通用任务上的性能,展示了其在词汇扩展中的潜力。

链接: https://arxiv.org/abs/2410.01188
作者: Chengyuan Liu,Shihang Wang,Lizhi Qing,Kun Kuang,Yangyang Kang,Changlong Sun,Fei Wu
关键词-EN: Large Language Models, Language Models, Large Language, demonstrate impressive generation, impressive generation abilities
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate impressive generation abilities, they frequently struggle when it comes to specialized domains due to their limited domain-specific knowledge. Studies on domain-specific LLMs resort to expanding the vocabulary before fine-tuning on domain-specific corpus, aiming to decrease the sequence length and enhance efficiency during decoding, without thoroughly investigating the results of vocabulary expansion to LLMs over different domains. Our pilot study reveals that expansion with only a subset of the entire vocabulary may lead to superior performance. Guided by the discovery, this paper explores how to identify a vocabulary subset to achieve the optimal results. We introduce VEGAD, an adaptive method that automatically identifies valuable words from a given domain vocabulary. Our method has been validated through experiments on three Chinese datasets, demonstrating its effectiveness. Additionally, we have undertaken comprehensive analyses of the method. The selection of a optimal subset for expansion has shown to enhance performance on both domain-specific tasks and general tasks, showcasing the potential of VEGAD.
摘要:尽管大语言模型 (LLMs) 展示了令人印象深刻的生成能力,但在面对特定领域时,由于其领域特定知识的局限性,它们往往表现不佳。针对特定领域的 LLMs 研究通常在微调特定领域语料库之前扩展词汇,旨在减少序列长度并提高解码效率,而没有深入探讨词汇扩展对不同领域 LLMs 的影响。我们的初步研究表明,仅扩展整个词汇的一个子集可能会带来更优的性能。基于这一发现,本文探讨了如何识别一个词汇子集以实现最佳结果。我们提出了 VEGAD,一种自适应方法,能够自动从给定领域词汇中识别出有价值的词语。我们的方法通过在三个中文数据集上的实验得到了验证,展示了其有效性。此外,我们还对该方法进行了全面的分析。选择一个最佳的扩展子集已被证明可以提升特定领域任务和通用任务的性能,展示了 VEGAD 的潜力。

[NLP-77] FastLexRank: Efficient Lexical Ranking for Structuring Social Media Posts

【速读】: 该论文试图解决原始LexRank算法在计算和内存复杂度上的高开销问题,解决方案的关键在于通过优化句子图的平稳分布计算方法,将复杂度从 (\mathcal{O}(n^2)) 降低到 (\mathcal{O}(n)),从而在不牺牲结果质量或准确性的前提下,显著提升计算效率。这一改进使得FastLexRank能够实时处理大规模数据集,如社交媒体语料库,并可用于识别核心推文,进一步结合高级自然语言处理技术进行分析。

链接: https://arxiv.org/abs/2410.01183
作者: Mao Li,Frederick Conrad,Johann Gagnon-Bartsch
关键词-EN: https URL, original LexRank method, original LexRank, LexRank algorithm, original LexRank scores
类目: Computation and Language (cs.CL); Computation (stat.CO)
备注:

点击查看摘要

Abstract:We present FastLexRank\footnotethis https URL, an efficient and scalable implementation of the LexRank algorithm for text ranking. Designed to address the computational and memory complexities of the original LexRank method, FastLexRank significantly reduces time and memory requirements from \mathcalO(n^2) to \mathcalO(n) without compromising the quality or accuracy of the results. By employing an optimized approach to calculating the stationary distribution of sentence graphs, FastLexRank maintains an identical results with the original LexRank scores while enhancing computational efficiency. This paper details the algorithmic improvements that enable the processing of large datasets, such as social media corpora, in real-time. Empirical results demonstrate its effectiveness, and we propose its use in identifying central tweets, which can be further analyzed using advanced NLP techniques. FastLexRank offers a scalable solution for text centrality calculation, addressing the growing need for efficient processing of digital content.
摘要:我们提出了 FastLexRank\footnotethis https URL,这是一种高效且可扩展的 LexRank 算法实现,用于文本排序。FastLexRank 旨在解决原始 LexRank 方法的计算和内存复杂性问题,将时间和内存需求从 \mathcalO(n^2) 显著降低到 \mathcalO(n),同时不损害结果的质量或准确性。通过采用优化方法计算句子图的平稳分布,FastLexRank 在保持与原始 LexRank 分数相同的结果的同时,提高了计算效率。本文详细介绍了实现大规模数据集(如社交媒体语料库)实时处理的算法改进。实证结果证明了其有效性,并建议将其用于识别中心推文,这些推文可以通过先进的自然语言处理技术进一步分析。FastLexRank 为文本中心性计算提供了一个可扩展的解决方案,满足了数字内容高效处理日益增长的需求。

[NLP-78] UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark

【速读】: 该论文试图解决视频中异常活动定位的问题,特别是由于预训练数据集中异常事件表示不足导致的定位困难。解决方案的关键在于引入UAL-Bench基准,通过整合视频-语言模型(Vid-LLMs)、指令调优的Vid-LLMs以及视觉-语言模型与大语言模型的结合(VLM-LLM),来提升模型对短时异常事件的定位和起始时间预测的准确性。论文还提出了一种新的评估指标R@1, TD = p,以弥补现有评估方法的不足,并强调了长时视频在自闭症诊断等场景中的挑战,指出了未来研究的方向。

链接: https://arxiv.org/abs/2410.01180
作者: Hasnat Md Abdullah,Tian Liu,Kangda Wei,Shu Kong,Ruihong Huang
关键词-EN: holds practical significance, Localizing unusual activities, videos holds practical, surveillance incidents, practical significance
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Localizing unusual activities, such as human errors or surveillance incidents, in videos holds practical significance. However, current video understanding models struggle with localizing these unusual events likely because of their insufficient representation in models’ pretraining datasets. To explore foundation models’ capability in localizing unusual activity, we introduce UAL-Bench, a comprehensive benchmark for unusual activity localization, featuring three video datasets: UAG-OOPS, UAG-SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct, to improve model capabilities. UAL-Bench evaluates three approaches: Video-Language Models (Vid-LLMs), instruction-tuned Vid-LLMs, and a novel integration of Vision-Language Models and Large Language Models (VLM-LLM). Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than Vid-LLMs. We also propose a new metric, R@1, TD = p, to address limitations in existing evaluation methods. Our findings highlight the challenges posed by long-duration videos, particularly in autism diagnosis scenarios, and the need for further advancements in localization techniques. Our work not only provides a benchmark for unusual activity localization but also outlines the key challenges for existing foundation models, suggesting future research directions on this important task.
摘要:在视频中定位异常活动,如人为错误或监控事件,具有实际意义。然而,当前的视频理解模型在定位这些异常事件时表现不佳,这可能是因为模型预训练数据集中对这些异常事件的表示不足。为了探索基础模型在定位异常活动方面的能力,我们引入了UAL-Bench,这是一个全面的异常活动定位基准,包含三个视频数据集:UAG-OOPS、UAG-SSBD、UAG-FunQA,以及一个指令调优数据集:OOPS-UAG-Instruct,以提升模型能力。UAL-Bench评估了三种方法:视频-语言模型 (Vid-LLMs)、指令调优的Vid-LLMs,以及一种新的视觉-语言模型与大语言模型 (VLM-LLM) 的集成。我们的结果显示,VLM-LLM方法在定位短时异常事件和更准确地预测其起始时间方面优于Vid-LLMs。我们还提出了一种新的指标,R@1,TD = p,以解决现有评估方法的局限性。我们的研究发现,长时视频,特别是在自闭症诊断场景中,带来了挑战,并强调了进一步改进定位技术的必要性。我们的工作不仅为异常活动定位提供了一个基准,还指出了现有基础模型面临的关键挑战,为这一重要任务的未来研究方向提供了建议。

[NLP-79] owards Inference-time Category-wise Safety Steering for Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在安全对齐方面的挑战,特别是在模型输出中引入细粒度的安全控制。解决方案的关键在于使用类别特定的转向向量(category-specific steering vectors)和复杂的方法来提取信息丰富的转向向量,以实现更有效的安全控制,同时保持生成文本的质量。通过这种方法,论文展示了在多个LLMs和数据集上的有效性,并讨论了其影响和最佳实践。

链接: https://arxiv.org/abs/2410.01174
作者: Amrita Bhattacharjee,Shaona Ghosh,Traian Rebedea,Christopher Parisien
关键词-EN: large language models, variety of use-cases, active research, large language, unprecedented advancements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have seen unprecedented advancements in capabilities and applications across a variety of use-cases, safety alignment of these models is still an area of active research. The fragile nature of LLMs, even models that have undergone extensive alignment and safety training regimes, warrants additional safety steering steps via training-free, inference-time methods. While recent work in the area of mechanistic interpretability has investigated how activations in latent representation spaces may encode concepts, and thereafter performed representation engineering to induce such concepts in LLM outputs, the applicability of such for safety is relatively under-explored. Unlike recent inference-time safety steering works, in this paper we explore safety steering of LLM outputs using: (i) category-specific steering vectors, thereby enabling fine-grained control over the steering, and (ii) sophisticated methods for extracting informative steering vectors for more effective safety steering while retaining quality of the generated text. We demonstrate our exploration on multiple LLMs and datasets, and showcase the effectiveness of the proposed steering method, along with a discussion on the implications and best practices.
摘要:尽管大语言模型 (LLMs) 在各种应用场景中展现了前所未有的能力提升,但其安全性对齐问题仍是一个活跃的研究领域。即使经过广泛的对齐和安全训练,LLMs 的脆弱性仍需通过无需训练、推理时采用的方法进行额外的安全引导。近期在机制可解释性领域的研究探讨了潜在表示空间中的激活如何编码概念,并进行了表示工程以在 LLM 输出中诱导这些概念,但此类方法在安全性方面的应用相对较少被探索。与近期推理时安全引导的工作不同,本文探讨了使用以下方法对 LLM 输出进行安全引导:(i) 类别特定的引导向量,从而实现对引导的细粒度控制;(ii) 提取信息丰富的引导向量的复杂方法,以在保持生成文本质量的同时实现更有效的安全引导。我们在多个 LLMs 和数据集上展示了我们的探索,并展示了所提出引导方法的有效性,同时讨论了其影响和最佳实践。

[NLP-80] BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation EMNLP2024

【速读】: 该论文试图解决跨语言情境下检索增强生成(RAG)系统在处理涉及语言、文化和政治边界的复杂查询(如地缘政治争议)时的鲁棒性问题。解决方案的关键在于研究如何选择和加权上下文中的信息源,特别是不同语言和来源的信息,以提高模型在面对多语言竞争信息时的响应一致性和准确性。论文通过构建一个包含相关维基百科页面的数据集,探讨了现有RAG系统在跨语言使用场景中的表现,并提出了未来研究的方向。

链接: https://arxiv.org/abs/2410.01171
作者: Bryan Li,Samar Haider,Fiona Luo,Adwait Agashe,Chris Callison-Burch
关键词-EN: Large language models, language models excel, Large language, models excel, excel at creative
类目: Computation and Language (cs.CL)
备注: NLP for Wikipedia workshop at EMNLP 2024

点击查看摘要

Abstract:Large language models excel at creative generation but continue to struggle with the issues of hallucination and bias. While retrieval-augmented generation (RAG) provides a framework for grounding LLMs’ responses in accurate and up-to-date information, it still raises the question of bias: which sources should be selected for inclusion in the context? And how should their importance be weighted? In this paper, we study the challenge of cross-lingual RAG and present a dataset to investigate the robustness of existing systems at answering queries about geopolitical disputes, which exist at the intersection of linguistic, cultural, and political boundaries. Our dataset is sourced from Wikipedia pages containing information relevant to the given queries and we investigate the impact of including additional context, as well as the composition of this context in terms of language and source, on an LLM’s response. Our results show that existing RAG systems continue to be challenged by cross-lingual use cases and suffer from a lack of consistency when they are provided with competing information in multiple languages. We present case studies to illustrate these issues and outline steps for future research to address these challenges. We make our dataset and code publicly available at this https URL.
摘要:大语言模型在创意生成方面表现出色,但仍然面临幻觉和偏见的问题。尽管检索增强生成 (RAG) 提供了一个框架,使大语言模型的响应基于准确且最新的信息,但它仍然提出了偏见问题:哪些来源应被纳入上下文?以及应如何权衡它们的重要性?本文研究了跨语言 RAG 的挑战,并提出一个数据集来调查现有系统在回答涉及地缘政治争议的查询时的鲁棒性,这些争议存在于语言、文化和政治边界的交汇处。我们的数据集来源于包含与给定查询相关信息的维基百科页面,并研究了包括额外上下文的影响,以及这种上下文在语言和来源方面的构成对大语言模型响应的影响。我们的结果表明,现有 RAG 系统在跨语言使用场景中仍然面临挑战,并且在提供多语言竞争信息时缺乏一致性。我们通过案例研究来说明这些问题,并概述了未来研究解决这些挑战的步骤。我们在此 https URL 公开了数据集和代码。

[NLP-81] Unifying the Scope of Bridging Anaphora Types in English: Bridging Annotations in ARRAU and GUM EMNLP2024

【速读】: 该论文试图解决跨语料库中桥接指代标注的不一致性和领域覆盖狭窄的问题。解决方案的关键在于通过比较不同语料库的标注指南,并使用可解释的预测模型来分析桥接实例,从而识别出标注差异。此外,论文还发布了经过调和和细分类的测试集,以促进跨领域桥接解析的可靠评估。

链接: https://arxiv.org/abs/2410.01170
作者: Lauren Levine,Amir Zeldes
关键词-EN: Comparing bridging annotations, disparate text domains, largely due, Comparing bridging, coreference resources
类目: Computation and Language (cs.CL)
备注: The Seventh Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2024), EMNLP 2024 Workshop, 15 November 2024

点击查看摘要

Abstract:Comparing bridging annotations across coreference resources is difficult, largely due to a lack of standardization across definitions and annotation schemas and narrow coverage of disparate text domains across resources. To alleviate domain coverage issues and consolidate schemas, we compare guidelines and use interpretable predictive models to examine the bridging instances annotated in the GUM, GENTLE and ARRAU corpora. Examining these cases, we find that there is a large difference in types of phenomena annotated as bridging. Beyond theoretical results, we release a harmonized, subcategorized version of the test sets of GUM, GENTLE and the ARRAU Wall Street Journal data to promote meaningful and reliable evaluation of bridging resolution across domains.
摘要:跨核心指代资源比较桥接注释存在困难,主要原因是定义和注释模式缺乏标准化,以及不同资源对不同文本领域的覆盖范围有限。为了缓解领域覆盖问题并整合模式,我们比较了指南并使用可解释的预测模型来检查 GUM、GENTLE 和 ARRAU 语料库中注释的桥接实例。通过检查这些案例,我们发现将现象注释为桥接的类型存在很大差异。除了理论结果外,我们还发布了一个协调的、子分类的 GUM、GENTLE 和 ARRAU 华尔街日报数据测试集版本,以促进跨领域桥接解析的有意义和可靠的评估。

[NLP-82] GADFA: Generator-Assisted Decision-Focused Approach for Opinion Expressing Timing Identification

【速读】: 该论文试图解决在特定新闻事件触发下,专业分析师何时表达意见的最佳时机问题。解决方案的关键在于引入了一个新的任务——识别新闻触发意见表达的时机,并通过构建一个基于专业分析师行为的新数据集来实现。论文采用决策导向的方法,利用文本生成模型来指导分类模型,从而提升整体性能。实验结果表明,生成的文本从多个角度提供了新颖的见解,有效帮助识别意见表达的最佳时机。

链接: https://arxiv.org/abs/2410.01169
作者: Chung-Chi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao
关键词-EN: capability to produce, produce coherent, coherent and convincing, text, opinion
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of text generation models has granted us the capability to produce coherent and convincing text on demand. Yet, in real-life circumstances, individuals do not continuously generate text or voice their opinions. For instance, consumers pen product reviews after weighing the merits and demerits of a product, and professional analysts issue reports following significant news releases. In essence, opinion expression is typically prompted by particular reasons or signals. Despite long-standing developments in opinion mining, the appropriate timing for expressing an opinion remains largely unexplored. To address this deficit, our study introduces an innovative task - the identification of news-triggered opinion expressing timing. We ground this task in the actions of professional stock analysts and develop a novel dataset for investigation. Our approach is decision-focused, leveraging text generation models to steer the classification model, thus enhancing overall performance. Our experimental findings demonstrate that the text generated by our model contributes fresh insights from various angles, effectively aiding in identifying the optimal timing for opinion expression.
摘要:文本生成模型的进步赋予了我们按需生成连贯且有说服力的文本的能力。然而,在现实生活中,人们并不会持续地生成文本或表达意见。例如,消费者在权衡产品的优缺点后撰写产品评论,专业分析师在重大新闻发布后发布报告。本质上,意见表达通常是由特定的原因或信号引发的。尽管意见挖掘领域已有长期的发展,但表达意见的适当时机仍未得到充分探索。为了填补这一空白,我们的研究引入了一项创新任务——识别新闻触发意见表达的时机。我们将此任务基于专业股票分析师的行为,并开发了一个新的数据集进行研究。我们的方法以决策为导向,利用文本生成模型来引导分类模型,从而提升整体性能。我们的实验结果表明,我们模型生成的文本从多个角度提供了新的见解,有效地帮助识别意见表达的最佳时机。

[NLP-83] Document Type Classification using File Names

【速读】: 该论文试图解决在数字取证和大规模媒体分类等时间敏感应用中,传统深度学习模型因高推理时间和计算资源需求而导致的文档分类效率低下的问题。解决方案的关键在于采用轻量级监督学习模型,结合基于TF-IDF特征提取的标记化方法,仅通过文件名进行高效准确的文档分类,显著减少推理时间。该方法通过置信度评分和引入负类(代表模糊文件名)来区分模糊和指示性文件名,实验结果表明,在包含大量训练数据范围外数据的测试集上,文件名分类器能够以96.7%的准确率处理超过80%的适用数据,且比复杂模型如DiT快442.43倍,从而在关键场景中实现快速、可靠的文档分类。

链接: https://arxiv.org/abs/2410.01166
作者: Zhijian Li,Stefan Larson,Kevin Leach
关键词-EN: Rapid document classification, large-scale media classification, Rapid document, time-sensitive applications, applications like digital
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification. Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets and computational resources associated with analyzing whole documents. In this paper, we present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method, to accurately and efficiently classify documents based solely on file names that substantially reduces inference time. This approach can distinguish ambiguous file names from the indicative file names through confidence scores and through using a negative class representing ambiguous file names. Our results indicate that file name classifiers can process more than 80% of the in-scope data with 96.7% accuracy when tested on a dataset with a large portion of out-of-scope data with respect to the training dataset while being 442.43x faster than more complex models such as DiT. Our method offers a crucial solution for efficiently processing vast datasets in critical scenarios, enabling fast, more reliable document classification.
摘要:在数字取证和大规模媒体分类等时间敏感应用中,快速文档分类至关重要。传统依赖重型深度学习模型的方法由于在庞大的输入数据集上推理时间长以及分析整个文档所需的计算资源而表现不佳。本文提出了一种使用轻量级监督学习模型,结合基于 TF-IDF 特征提取的 Token 化方法,仅根据文件名进行准确且高效的文档分类,从而显著减少推理时间。该方法通过置信度分数和使用代表模糊文件名的负类,能够区分模糊文件名和指示性文件名。我们的结果表明,在测试数据集中,当训练数据集包含大量超出范围的数据时,文件名分类器可以以 96.7% 的准确率处理超过 80% 的范围内数据,并且比 DiT 等更复杂的模型快 442.43 倍。我们的方法为在关键场景中高效处理庞大数据集提供了关键解决方案,实现了快速、更可靠的文档分类。

[NLP-84] Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting EMNLP2024

【速读】: 该论文试图解决零样本关系抽取(Zero-shot Relation Extraction, RE)中现有方法因缺乏详细、上下文特定的提示而导致性能不佳的问题。解决方案的关键在于引入自提示框架(Self-Prompting framework),通过三阶段的多样性方法生成多个合成样本,这些样本作为上下文学习样本,提供明确的、上下文特定的指导,从而高效地提示大型语言模型(LLMs)进行关系抽取。实验结果表明,该方法在基准数据集上的表现优于现有的基于LLM的零样本RE方法,并验证了生成管道在产生高质量合成数据方面的有效性。

链接: https://arxiv.org/abs/2410.01154
作者: Siyi Liu,Yang Li,Jiang Li,Shan Yang,Yunshi Lan
关键词-EN: Large Language Models, Language Models, Large Language, Recent research, zero-shot Relation Extraction
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: EMNLP 2024 Short

点击查看摘要

Abstract:Recent research in zero-shot Relation Extraction (RE) has focused on using Large Language Models (LLMs) due to their impressive zero-shot capabilities. However, current methods often perform suboptimally, mainly due to a lack of detailed, context-specific prompts needed for understanding various sentences and relations. To address this, we introduce the Self-Prompting framework, a novel method designed to fully harness the embedded RE knowledge within LLMs. Specifically, our framework employs a three-stage diversity approach to prompt LLMs, generating multiple synthetic samples that encapsulate specific relations from scratch. These generated samples act as in-context learning samples, offering explicit and context-specific guidance to efficiently prompt LLMs for RE. Experimental evaluations on benchmark datasets show our approach outperforms existing LLM-based zero-shot RE methods. Additionally, our experiments confirm the effectiveness of our generation pipeline in producing high-quality synthetic data that enhances performance.
摘要:近期关于零样本关系抽取 (Zero-shot Relation Extraction, RE) 的研究主要集中在利用大语言模型 (Large Language Models, LLMs) 上,因其卓越的零样本能力。然而,当前的方法往往表现不佳,主要原因是缺乏针对不同句子和关系的详细、上下文特定的提示。为解决这一问题,我们提出了自提示框架 (Self-Prompting framework),这是一种旨在充分利用 LLMs 中嵌入的 RE 知识的新方法。具体而言,我们的框架采用三阶段多样性方法来提示 LLMs,从头生成多个包含特定关系的合成样本。这些生成的样本作为上下文学习样本,提供明确的、上下文特定的指导,以高效地提示 LLMs 进行 RE。在基准数据集上的实验评估表明,我们的方法优于现有的基于 LLM 的零样本 RE 方法。此外,我们的实验证实了生成管道的有效性,能够生成高质量的合成数据,从而提升性能。

[NLP-85] Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs

【速读】: 该论文旨在解决大规模经济研究论文标题数据集中的重复项检测问题。解决方案的关键在于综合运用多种配对方法、经典的距离度量(如Levenshtein距离和余弦相似度)以及基于sBERT模型的语义评估技术,以识别潜在的重复项。研究结果表明,基于不同方法的语义相似性检测显示重复项的普遍性较低,为进一步验证,还引入了人工标注的基准数据集进行更全面的评估。

链接: https://arxiv.org/abs/2410.01141
作者: Doohee You,Karim Lasri,Samuel Fraiberger
关键词-EN: research paper titles, study investigates efficient, investigates efficient deduplication, efficient deduplication techniques, economic research paper
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure

点击查看摘要

Abstract:This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.
摘要:本研究探讨了针对大规模经济研究论文标题的自然语言处理 (NLP) 数据集的高效去重技术。我们探索了多种配对方法,结合了已建立的距离度量(Levenshtein 距离、余弦相似度)以及用于语义评估的 sBERT 模型。研究结果表明,基于不同方法观察到的语义相似性,重复项的普遍性可能较低。为进一步得出更确切的评估,我们完成了与人工标注的基准数据集的进一步探索。结果支持了来自 NLP 和大语言模型 (LLM) 的距离度量方法的发现。

[NLP-86] Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance – A Case Study in Finance

【速读】: 该论文试图解决在特定领域(如金融)中,如何更有效地微调大型语言模型(LLMs)以提升其在下游任务中的表现。解决方案的关键在于采用多任务微调策略,即通过同时训练模型处理多个相关任务,而非仅针对目标任务进行微调。研究结果表明,这种多任务微调方法能够显著提升模型性能,甚至使较小的模型(如Phi-3-Mini)在金融基准测试中超越更大的模型(如GPT-4-o)。此外,论文还探讨了使用通用指令数据作为正则化手段,以及引入数学数据以增强数值推理能力,这些策略均有助于提升模型在金融任务中的表现。

链接: https://arxiv.org/abs/2410.01109
作者: Meni Brief,Oded Ovadia,Gil Shenderovitz,Noga Ben Yoash,Rachel Lemberg,Eitam Sheetrit
关键词-EN: including finance, large language models, expanded rapidly, application of large, large language
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The application of large language models (LLMs) in domain-specific contexts, including finance, has expanded rapidly. Domain-specific LLMs are typically evaluated based on their performance in various downstream tasks relevant to the domain. In this work, we present a detailed analysis of fine-tuning LLMs for such tasks. Somewhat counterintuitively, we find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy. Instead, multi-task fine-tuning - where models are trained on a cocktail of related tasks - can significantly enhance performance. We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results, even surpassing the much larger GPT-4-o model on financial benchmarks. Our study involves a large-scale experiment, training over 200 models using several widely adopted LLMs as baselines, and empirically confirms the benefits of multi-task fine-tuning. Additionally, we explore the use of general instruction data as a form of regularization, suggesting that it helps minimize performance degradation. We also investigate the inclusion of mathematical data, finding improvements in numerical reasoning that transfer effectively to financial tasks. Finally, we note that while fine-tuning for downstream tasks leads to targeted improvements in task performance, it does not necessarily result in broader gains in domain knowledge or complex domain reasoning abilities.
摘要:大语言模型 (LLM) 在特定领域,包括金融领域的应用迅速扩展。特定领域的 LLM 通常根据其在与该领域相关的各种下游任务中的表现进行评估。在这项工作中,我们详细分析了针对这些任务的 LLM 微调。有些出乎意料的是,我们发现在特定领域的情况下,仅在目标任务上进行微调并不总是最有效的策略。相反,多任务微调——即模型在相关任务的组合上进行训练——可以显著提升性能。我们展示了这种方法如何使小型模型,如 Phi-3-Mini,在金融基准测试中达到最先进的结果,甚至超越了更大的 GPT-4-o 模型。我们的研究涉及大规模实验,使用几种广泛采用的 LLM 作为基线训练了超过 200 个模型,并实证证实了多任务微调的好处。此外,我们探讨了使用通用指令数据作为一种正则化形式,表明它有助于最小化性能下降。我们还研究了包含数学数据的情况,发现数值推理能力的提升能够有效地转移到金融任务中。最后,我们指出,虽然针对下游任务的微调可以带来任务性能的针对性提升,但这并不一定会导致领域知识或复杂领域推理能力的更广泛提升。

[NLP-87] Approximately Aligned Decoding

【速读】: 该论文试图解决当前大型语言模型(LLMs)在拒绝不期望输出时面临的计算量过大或输出分布严重扭曲的问题。解决方案的关键在于提出了一种能够在保持计算效率的同时,平衡输出分布扭曲程度的方法。该方法允许生成满足复杂约束的长文本序列,且相比于现有方法,减少了低概率输出的放大效应,从而在任务特定性能上与不扭曲输出分布的方法相当,同时显著提高了计算效率。

链接: https://arxiv.org/abs/2410.01103
作者: Daniel Melcer,Sujan Gonugondla,Pramuditha Perera,Haifeng Qian,Wen-Hao Chiang,Yanjun Wang,Nihal Jain,Pranav Garg,Xiaofei Ma,Anoop Deoras
关键词-EN: Large Language Models, Language Models, Large Language, reject undesired outputs, amount of computation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages main, 22 pages total

点击查看摘要

Abstract:It is common to reject undesired outputs of Large Language Models (LLMs); however, current methods to do so require an excessive amount of computation, or severely distort the distribution of outputs. We present a method to balance the distortion of the output distribution with computational efficiency, allowing for the generation of long sequences of text with difficult-to-satisfy constraints, with less amplification of low probability outputs compared to existing methods. We show through a series of experiments that the task-specific performance of our method is comparable to methods that do not distort the output distribution, while being much more computationally efficient.
摘要:通常情况下,我们会拒绝大语言模型 (LLM) 产生的不理想输出;然而,现有的方法要么需要大量的计算资源,要么严重扭曲输出分布。我们提出了一种方法,能够在计算效率和输出分布的扭曲之间取得平衡,使得在生成具有难以满足约束的长文本序列时,与现有方法相比,低概率输出的放大程度更小。通过一系列实验,我们展示了在特定任务性能方面,我们的方法与不扭曲输出分布的方法相当,同时在计算效率上显著更高。

[NLP-88] Unlocking Korean Verbs: A User-Friendly Exploration into the Verb Lexicon COLING2025

【速读】: 该论文试图解决如何有效利用Sejong词典数据集中的语言信息,特别是动词及其子分类框架(subcategorization frames)的问题。解决方案的关键在于开发了一个用户友好的网络界面,用于收集和整合与动词相关的信息,并通过将子分类框架与相应的例句对齐来映射这些信息。此外,论文还提供了一个Python库,用于简化句法解析和语义角色标注,从而帮助研究人员和开发者更高效地利用Sejong词典数据集进行韩语语言处理应用的开发。

链接: https://arxiv.org/abs/2410.01100
作者: Seohyun Song,Eunkyul Leah Jo,Yige Chen,Jeen-Pyo Hong,Kyuwon Kim,Jin Wee,Miyoung Kang,KyungTae Lim,Jungyeul Park,Chulwoo Park
关键词-EN: providing extensive coverage, valuable resource, providing extensive, coverage of morphology, Sejong dictionary dataset
类目: Computation and Language (cs.CL)
备注: COLING2025 System Demonstrations (Submitted)

点击查看摘要

Abstract:The Sejong dictionary dataset offers a valuable resource, providing extensive coverage of morphology, syntax, and semantic representation. This dataset can be utilized to explore linguistic information in greater depth. The labeled linguistic structures within this dataset form the basis for uncovering relationships between words and phrases and their associations with target verbs. This paper introduces a user-friendly web interface designed for the collection and consolidation of verb-related information, with a particular focus on subcategorization frames. Additionally, it outlines our efforts in mapping this information by aligning subcategorization frames with corresponding illustrative sentence examples. Furthermore, we provide a Python library that would simplify syntactic parsing and semantic role labeling. These tools are intended to assist individuals interested in harnessing the Sejong dictionary dataset to develop applications for Korean language processing.
摘要:Sejong 词典数据集提供了一个宝贵的资源,涵盖了广泛的形态学、句法和语义表示。该数据集可用于更深入地探索语言信息。该数据集中的标记语言结构构成了揭示词语和短语与其目标动词之间关系的基础。本文介绍了一个用户友好的网页界面,专门用于收集和整合与动词相关的信息,特别是子分类框架。此外,本文还概述了我们通过将子分类框架与相应的示例句子对齐来映射这些信息的努力。此外,我们还提供了一个 Python 库,用于简化句法解析和语义角色标注。这些工具旨在帮助有兴趣利用 Sejong 词典数据集开发韩语处理应用的个人。

[NLP-89] Exploring Empty Spaces: Human-in-the-Loop Data Augmentation

【速读】: 该论文试图解决数据增强过程中如何有效生成多样性数据以提升机器学习模型鲁棒性和安全性的问题。解决方案的关键在于引入Amplio工具,通过三种人机交互的数据增强技术(Augment With Concepts、Augment by Interpolation和Augment with Large Language Model),帮助从业者系统地识别和探索非结构化文本数据集中的空白数据空间,从而生成高质量、多样且相关的模型安全提示。

链接: https://arxiv.org/abs/2410.01088
作者: Catherine Yeh,Donghao Ren,Yannick Assogba,Dominik Moritz,Fred Hohman
关键词-EN: make machine learning, machine learning models, robust and safe, crucial to make, make machine
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these “unknown unknowns” is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate “unknown unknowns” in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment With Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.
摘要:数据增强对于使机器学习模型更加稳健和安全至关重要。然而,数据增强具有挑战性,因为它需要生成多样化的数据点,以严格评估模型在边缘情况下的行为并减轻潜在的危害。创建高质量的增强数据以涵盖这些“未知的未知”是一项耗时且需要创造力的任务。在这项工作中,我们介绍了 Amplio,这是一个交互式工具,帮助从业者在非结构化文本数据集中导航“未知的未知”,并通过系统地识别需要探索的空白数据空间来提高数据多样性。Amplio 包括三种人机协作的数据增强技术:概念增强 (Augment With Concepts)、插值增强 (Augment by Interpolation) 和大语言模型增强 (Augment with Large Language Model)。在与 18 名专业红队成员的用户研究中,我们展示了我们的增强方法在帮助生成高质量、多样化和相关模型安全提示方面的实用性。我们发现,Amplio 使红队成员能够快速且创造性地增强数据,突显了交互式增强工作流程的变革潜力。

[NLP-90] Concept Space Alignment in Multilingual LLMs EMNLP2024

【速读】: 该论文试图解决多语言大型语言模型(LLMs)在不同语言间泛化能力的问题,并探讨其背后的隐式向量空间对齐机制。解决方案的关键在于评估不同语言间概念的线性对齐质量,发现较大的模型在不同语言间表现出高质量的线性对齐。论文通过实验揭示了多语言LLMs的两个常见弱点:泛化效果在语言类型相似和抽象概念上表现最佳,而某些模型(如Llama-2系列)的提示嵌入比词嵌入对齐效果更好,但投影的线性度较低,这一现象几乎在所有模型系列中都存在,表明提示方法在一定程度上破坏了隐式学习到的对齐关系。

链接: https://arxiv.org/abs/2410.01079
作者: Qiwei Peng,Anders Søgaard
关键词-EN: Multilingual large language, Multilingual large, large language models, large language, models
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Multilingual large language models (LLMs) seem to generalize somewhat across languages. We hypothesize this is a result of implicit vector space alignment. Evaluating such alignment, we see that larger models exhibit very high-quality linear alignments between corresponding concepts in different languages. Our experiments show that multilingual LLMs suffer from two familiar weaknesses: generalization works best for languages with similar typology, and for abstract concepts. For some models, e.g., the Llama-2 family of models, prompt-based embeddings align better than word embeddings, but the projections are less linear – an observation that holds across almost all model families, indicating that some of the implicitly learned alignments are broken somewhat by prompt-based methods.
摘要:多语言大语言模型 (LLMs) 似乎在不同语言之间具有一定的泛化能力。我们假设这是由于隐式的向量空间对齐所导致的。通过评估这种对齐,我们发现较大的模型在不同语言中对应概念之间表现出非常高质的线性对齐。我们的实验表明,多语言 LLMs 存在两个常见的弱点:泛化效果在语系相似的语言之间最佳,并且在抽象概念上表现较好。对于某些模型,例如 Llama-2 系列模型,基于提示的嵌入比词嵌入具有更好的对齐效果,但投影的线性度较低——这一观察结果几乎适用于所有模型系列,表明一些隐式学习的对齐在一定程度上被基于提示的方法所破坏。

[NLP-91] From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems

【速读】: 该论文旨在全面研究基于大型语言模型(LLM)的文本到SQL系统的发展历程,从早期的基于规则的模型到先进的LLM方法,并探讨LLM如何影响这一领域。解决方案的关键在于探讨了两种主要技术:基于语料库的上下文学习和微调,以及由此衍生出的零样本、少样本学习和数据增强方法。此外,论文还强调了知识图谱的整合在提高上下文准确性和模式链接中的作用,并指出了当前面临的挑战,如计算效率、模型鲁棒性和数据隐私,为未来LLM-based文本到SQL系统的发展提供了改进方向。

链接: https://arxiv.org/abs/2410.01066
作者: Ali Mohammadjafari,Anthony S. Maida,Raju Gottumukkala
关键词-EN: structured SQL commands, translating natural language, natural language queries, structured SQL, SQL commands
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Since the onset of LLMs, translating natural language queries to structured SQL commands is assuming increasing. Unlike the previous reviews, this survey provides a comprehensive study of the evolution of LLM-based text-to-SQL systems, from early rule-based models to advanced LLM approaches, and how LLMs impacted this field. We discuss benchmarks, evaluation methods and evaluation metrics. Also, we uniquely study the role of integration of knowledge graphs for better contextual accuracy and schema linking in these systems. The current techniques fall into two categories: in-context learning of corpus and fine-tuning, which then leads to approaches such as zero-shot, few-shot learning from the end, and data augmentation. Finally, we highlight key challenges such as computational efficiency, model robustness, and data privacy with perspectives toward their development and improvements in potential areas for future of LLM-based text-to-SQL system.
摘要:自大语言模型 (LLM) 兴起以来,将自然语言查询转换为结构化 SQL 命令的需求日益增加。与以往的综述不同,本文全面研究了基于 LLM 的文本到 SQL 系统的演变过程,从早期的基于规则的模型到先进的 LLM 方法,以及 LLM 如何影响这一领域。我们讨论了基准测试、评估方法和评估指标。此外,我们还独特地研究了知识图谱的整合在这些系统中对上下文准确性和模式链接的作用。当前的技术可分为两类:语料库的上下文学习和微调,进而衍生出如零样本 (zero-shot)、少样本 (few-shot) 学习以及数据增强等方法。最后,我们指出了关键挑战,如计算效率、模型鲁棒性和数据隐私,并展望了这些领域未来的发展与改进方向。

[NLP-92] RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

【速读】: 该论文试图解决大型语言模型(LLMs)在推理过程中生成的步骤可能不完整的问题,因为这些模型在预训练数据中模仿了日常交流中常见的逻辑跳跃,导致潜在的推理依据经常被隐含而不明确表达。解决方案的关键在于引入RATIONALYST模型,该模型通过在从无标签数据中提取的大量推理依据注释上进行预训练,实现对推理过程的监督。RATIONALYST通过从网络规模的未标记数据集(如Pile)和结合了推理数据集的注释中提取79k个推理依据,进行大规模预训练,从而能够在数学、常识、科学和逻辑等多种推理任务中实现一致的泛化。经过LLaMa-3-8B的微调,RATIONALYST在7个代表性推理基准测试中平均提高了3.9%的推理准确性,并展示了优于GPT-4等更大规模验证器和类似尺寸模型的性能。

链接: https://arxiv.org/abs/2410.01044
作者: Dongwei Jiang,Guoxuan Wang,Yining Lu,Andrew Wang,Jingyu Zhang,Chuyu Liu,Benjamin Van Durme,Daniel Khashabi
关键词-EN: frequently left implicit, everyday communication found, mimic logical leaps, logical leaps common, reasoning steps generated
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Our code, data, and model can be found at this repository: this https URL

点击查看摘要

Abstract:The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data. We extract 79k rationales from web-scale unlabelled dataset (the Pile) and a combination of reasoning datasets with minimal human intervention. This web-scale pre-training for reasoning allows RATIONALYST to consistently generalize across diverse reasoning tasks, including mathematical, commonsense, scientific, and logical reasoning. Fine-tuned from LLaMa-3-8B, RATIONALYST improves the accuracy of reasoning by an average of 3.9% on 7 representative reasoning benchmarks. It also demonstrates superior performance compared to significantly larger verifiers like GPT-4 and similarly sized models fine-tuned on matching training sets.
摘要:大语言模型 (LLM) 生成的推理步骤可能是不完整的,因为它们模仿了预训练数据中日常交流中常见的逻辑跳跃:潜在的推理依据经常被隐含 (未明确说明)。为了解决这一挑战,我们引入了 RATIONALYST,这是一种基于对从无标签数据中提取的大量推理依据注释进行预训练的推理过程监督模型。我们从网络规模的未标记数据集 (Pile) 和结合了最少人工干预的推理数据集中提取了 79,000 条推理依据。这种针对推理的网络规模预训练使得 RATIONALYST 能够在包括数学、常识、科学和逻辑推理在内的多种推理任务中持续泛化。从 LLaMa-3-8B 微调而来的 RATIONALYST 在 7 个代表性推理基准测试中将推理准确率平均提高了 3.9%。与 GPT-4 等更大规模的验证器以及在匹配训练集上微调的类似大小的模型相比,RATIONALYST 也表现出了优越的性能。

[NLP-93] From Facts to Insights: A Study on the Generation and Evaluation of Analytical Reports for Deciphering Earnings Calls

【速读】: 该论文试图解决利用大型语言模型(LLMs)生成和评估从财报电话会议(Earnings Calls, ECs)中提取的分析报告的问题。解决方案的关键在于设计一个多智能体框架,通过引入具有不同视角和分析主题的专用智能体,增强报告生成的多样性和洞察力。研究结果表明,增加智能体数量可以生成更具洞察力的报告,尽管人类专家撰写的报告在大多数情况下仍被认为更优。此外,论文还探讨了LLMs在不同情境下评估生成报告质量的局限性和优势,发现其与人类专家的评估在多个维度上存在显著相关性。

链接: https://arxiv.org/abs/2410.01039
作者: Tomas Goldsack,Yang Wang,Chenghua Lin,Chung-Chi Chen
关键词-EN: Large Language Models, Language Models, Earnings Calls, Large Language, derived from Earnings
类目: Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:This paper explores the use of Large Language Models (LLMs) in the generation and evaluation of analytical reports derived from Earnings Calls (ECs). Addressing a current gap in research, we explore the generation of analytical reports with LLMs in a multi-agent framework, designing specialized agents that introduce diverse viewpoints and desirable topics of analysis into the report generation process. Through multiple analyses, we examine the alignment between generated and human-written reports and the impact of both individual and collective agents. Our findings suggest that the introduction of additional agents results in more insightful reports, although reports generated by human experts remain preferred in the majority of cases. Finally, we address the challenging issue of report evaluation, we examine the limitations and strengths of LLMs in assessing the quality of generated reports in different settings, revealing a significant correlation with human experts across multiple dimensions.
摘要:本文探讨了大语言模型 (LLM) 在生成和评估从财报电话会议 (EC) 中提取的分析报告中的应用。针对当前研究中的一个空白,我们在多智能体框架下探索了使用 LLM 生成分析报告的方法,设计了专门化的智能体,这些智能体引入了多样化的视角和分析报告中期望的主题。通过多次分析,我们研究了生成报告与人工撰写报告之间的一致性,以及个体智能体和集体智能体的影响。我们的研究结果表明,引入额外的智能体可以生成更具洞察力的报告,尽管在大多数情况下,人类专家生成的报告仍然更受青睐。最后,我们探讨了报告评估这一具有挑战性的问题,分析了在不同环境下 LLM 在评估生成报告质量方面的局限性和优势,揭示了与人类专家在多个维度上的显著相关性。

[NLP-94] MOSEL: 950000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages EMNLP2024

【速读】: 该论文试图解决现有语音基础模型(SFMs)在开源原则上的不足,即尽管声称开源,但没有任何现有SFMs在开源条款下公开模型权重、代码和训练数据。解决方案的关键在于收集并公开适用于欧盟24种官方语言的训练数据,总计95万小时,并发布44.1万小时的无标签数据的自动转录文本,这些数据均遵循开源许可,从而为欧盟语言创建开源SFMs奠定基础。

链接: https://arxiv.org/abs/2410.01036
作者: Marco Gaido,Sara Papi,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri
关键词-EN: regulatory efforts addressing, sparked significant interest, coupled with regulatory, risks and impacts, rise of foundation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at EMNLP 2024 Main Conference

点击查看摘要

Abstract:The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.
摘要:基础模型 (Foundation Models, FMs) 的兴起,加上针对其风险和影响的监管努力,引发了人们对开源模型的浓厚兴趣。然而,现有的语音基础模型 (Speech Foundation Models, SFMs) 即使声称符合开源原则,实际上仍未完全遵守,因为目前没有任何 SFM 在开源条款下公开其模型权重、代码和训练数据。在本研究中,我们首次尝试填补这一空白,专注于欧盟的 24 种官方语言。我们通过调查符合开源许可的自动语音识别数据集和未标注语音语料库,收集了总计 950,000 小时的合适训练数据。此外,我们在宽松的 CC-BY 许可下发布了 441,000 小时的未标注数据的自动转录文本,从而促进了欧盟语言的开源 SFMs 的创建。

[NLP-95] Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity

【速读】: 该论文试图解决大型语言模型推理速度慢的问题,提出了一种无需微调或黑箱优化的即时生成草稿模型的方法。其关键在于利用简单的规则根据输入上下文动态生成不同的草稿模型,从而在不牺牲性能的前提下显著提升推理速度,并且该方法具有即插即用的特性,易于集成到现有系统中。

链接: https://arxiv.org/abs/2410.01028
作者: Michael R. Metel,Peng Lu,Boxing Chen,Mehdi Rezagholizadeh,Ivan Kobyzev
关键词-EN: large language models, faster inference, inference of large, large language, Abstract
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.
摘要:我们提出了一种用于大语言模型 (Large Language Model) 推理加速的即时方法。与现有的 (自) 推测解码技术不同,我们的方法无需微调或黑箱优化来生成固定的草稿模型,而是依赖简单的规则生成适应输入上下文的多样化草稿模型。通过实证研究,我们展示了这种轻量级算法在自推测解码方面与当前最先进 (SOTA) 技术具有竞争力,同时具备真正的即插即用特性。

[NLP-96] Investigating the Synergistic Effects of Dropout and Residual Connections on Language Model Training

【速读】: 该论文试图解决语言模型训练中的过拟合问题,关键解决方案在于通过调整不同层和残差连接中的丢弃率(dropout rates)来优化模型的正则化和收敛性能。研究结果表明,丢弃技术在正则化方面具有显著优势,而残差连接则有助于模型的收敛,两者之间存在重要的权衡关系,即在深度神经网络中,残差连接的深度与这些连接上的丢弃率之间需要找到最佳平衡点,以实现更好的收敛和泛化能力。

链接: https://arxiv.org/abs/2410.01019
作者: Qingyang Li,Weimao Ke
关键词-EN: language model training, pivotal role, techniques in mitigating, mitigating overfitting, language model
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:This paper examines the pivotal role of dropout techniques in mitigating overfitting in language model training. It conducts a comprehensive investigation into the influence of variable dropout rates on both individual layers and residual connections within the context of language modeling. Our study conducts training of a decoder implementation on the classic Tiny Shakespeare data to examine the effects of the adjustments on training efficiency and validation error. Results not only confirm the benefits of dropout for regularization and residuals for convergence, but also reveal their interesting interactions. There exists an important trade-off between the depth of residual connections and the dropout on these connections for optimal deep neural network convergence and generalization.
摘要:本文探讨了在语言模型训练中,dropout 技术在缓解过拟合方面的关键作用。我们对在语言建模背景下,可变 dropout 率对单个层和残差连接的影响进行了全面研究。本研究在经典的 Tiny Shakespeare 数据上训练了一个解码器实现,以考察这些调整对训练效率和验证误差的影响。结果不仅证实了 dropout 在正则化方面的益处以及残差在收敛方面的优势,还揭示了它们之间有趣的相互作用。在深度神经网络的收敛和泛化方面,残差连接的深度与这些连接上的 dropout 之间存在重要的权衡关系。

[NLP-97] “Hiding in Plain Sight”: Designing Synthetic Dialog Generation for Uncovering Socially Situated Norms

【速读】: 该论文试图解决在自然对话中如何生成符合多种交际者属性、关系类型、话题和对话轨迹的对话,并确保这些对话遵循社会规范以避免冲突的问题。解决方案的关键在于提出了一个控制生成对话的框架,名为NormHint,该框架生成的对话集合经过分析,识别出可能导致冲突的规范违反,并提供遵循社会规范和尊重表达的替代方案,以保持原始表达的交际意图。通过人工验证和自动化分析,NormHint展示了其捕捉广泛对话话题的能力,并在自然性评分上获得了高度认可。

链接: https://arxiv.org/abs/2410.00998
作者: Chengfei Wu,Dan Goldwasser
关键词-EN: Naturally situated conversations, Naturally situated, underlying social norms, situated conversations capture, social norms
类目: Computation and Language (cs.CL)
备注: Pre-Print

点击查看摘要

Abstract:Naturally situated conversations capture the underlying social norms appropriate for the topic of conversation, the relationship between interlocutors and their communicative intent. This paper proposes a framework for controlled generation of dialogues, spanning a wide range of interlocutors attributes (such as age group, profession and personality types), relationship types, conversation topics and conversational trajectories. We use this framework to generate NormHint, a collection of dialogues consistent with these rich settings and analyzed for norm violation leading to conflicts, and potential steps for avoiding these conflicts by adhering to social norms and preferring respectful utterances maintaining the communicative intents of the original utterance. We present the results of human validation and automated analysis of NormHint and show it captures a wide range of conversational topics and scored highly by humans for the naturalness of the conversations based on the prompted context.
摘要:自然情境下的对话能够捕捉到与对话主题相关的潜在社会规范、对话者之间的关系及其交流意图。本文提出了一种对话生成框架,该框架涵盖了广泛的对话者属性(如年龄组、职业和性格类型)、关系类型、对话主题以及对话轨迹。我们利用这一框架生成了 NormHint,这是一个与这些丰富情境相一致的对话集合,并对其进行了分析,以识别导致冲突的规范违反行为,以及通过遵守社会规范和优先选择尊重性表达来避免这些冲突的潜在步骤,同时保持原始表达的交流意图。我们展示了 NormHint 的人工验证和自动化分析结果,并表明它能够捕捉到广泛的对话主题,并且在基于提示的上下文中,人类对其对话的自然性评分很高。

[NLP-98] Creative and Context-Aware Translation of East Asian Idioms with GPT-4

【速读】: 该论文试图解决东亚成语翻译中的人力资源负担问题,特别是编纂候选翻译词典所需的大量时间和创造力。解决方案的关键在于利用GPT-4生成高质量的翻译,通过自动评估忠实度和创造性,识别出优于Google和DeepL翻译引擎的Pareto-optimal提示策略。这种方法能够在低成本下实现比人工基准更高品质的翻译效果,并开源所有代码和数据以促进进一步研究。

链接: https://arxiv.org/abs/2410.00988
作者: Kenan Tang,Peiyang Song,Yao Qin,Xifeng Yan
关键词-EN: East Asian idiom, East Asian, Asian idiom condenses, condenses rich cultural, rich cultural background
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As a type of figurative language, an East Asian idiom condenses rich cultural background into only a few characters. Translating such idioms is challenging for human translators, who often resort to choosing a context-aware translation from an existing list of candidates. However, compiling a dictionary of candidate translations demands much time and creativity even for expert translators. To alleviate such burden, we evaluate if GPT-4 can help generate high-quality translations. Based on automatic evaluations of faithfulness and creativity, we first identify Pareto-optimal prompting strategies that can outperform translation engines from Google and DeepL. Then, at a low cost, our context-aware translations can achieve far more high-quality translations per idiom than the human baseline. We open-source all code and data to facilitate further research.
摘要:作为一种修辞语言,东亚成语将丰富的文化背景浓缩在寥寥数语之中。对于人类翻译者而言,翻译此类成语极具挑战性,他们通常需要从现有候选列表中选择一个符合上下文的翻译。然而,即使是专家翻译者,编纂一个候选翻译词典也需要大量的时间和创造力。为了减轻这种负担,我们评估了 GPT-4 是否能够帮助生成高质量的翻译。基于对忠实度和创造性的自动评估,我们首先确定了能够超越 Google 和 DeepL 翻译引擎的帕累托最优提示策略。随后,以较低的成本,我们的上下文感知翻译在每个成语上都能实现远超人类基准的高质量翻译。我们开源了所有代码和数据,以促进进一步的研究。

[NLP-99] Automatic Speech Recognition for the Ika Language

【速读】: 该论文试图解决低资源语言(如Ika语)的自动语音识别(ASR)模型开发问题,解决方案的关键在于利用预训练的多语言wav2vec 2.0模型进行微调。通过在高质量的Ika语《新约圣经》翻译语音数据集上进行微调,论文展示了这种方法在仅使用1小时训练数据的情况下,实现了较低的词错误率(WER)和字符错误率(CER)。尽管较大的10亿参数模型由于其复杂性和更丰富的语音表示能力表现更优,但小数据集上的过拟合问题限制了模型的泛化能力。因此,未来工作应着重于扩展数据集和探索缓解过拟合的技术。

链接: https://arxiv.org/abs/2410.00940
作者: Uchenna Nzenwata,Daniel Ogbuigwe
关键词-EN: Automatic Speech Recognition, developing Automatic Speech, developing Automatic, Speech Recognition, Word Error Rate
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 Figures This is a pre-release version

点击查看摘要

Abstract:We present a cost-effective approach for developing Automatic Speech Recognition (ASR) models for low-resource languages like Ika. We fine-tune the pretrained wav2vec 2.0 Massively Multilingual Speech Models on a high-quality speech dataset compiled from New Testament Bible translations in Ika. Our results show that fine-tuning multilingual pretrained models achieves a Word Error Rate (WER) of 0.5377 and Character Error Rate (CER) of 0.2651 with just over 1 hour of training data. The larger 1 billion parameter model outperforms the smaller 300 million parameter model due to its greater complexity and ability to store richer speech representations. However, we observe overfitting to the small training dataset, reducing generalizability. Our findings demonstrate the potential of leveraging multilingual pretrained models for low-resource languages. Future work should focus on expanding the dataset and exploring techniques to mitigate overfitting.
摘要:我们提出了一种经济高效的开发自动语音识别 (ASR) 模型的方法,适用于像 Ika 这样的低资源语言。我们通过微调预训练的 wav2vec 2.0 大规模多语言语音模型,使用从 Ika 新约圣经翻译中编译的高质量语音数据集。我们的结果显示,通过微调多语言预训练模型,在仅超过 1 小时的训练数据下,实现了 0.5377 的词错误率 (WER) 和 0.2651 的字符错误率 (CER)。由于其更高的复杂性和存储更丰富语音表示的能力,10 亿参数模型优于 3 亿参数模型。然而,我们观察到对小训练数据集的过拟合现象,降低了模型的泛化能力。我们的研究结果表明,利用多语言预训练模型为低资源语言开发 ASR 模型具有潜力。未来的工作应集中在扩展数据集和探索减少过拟合的技术上。

[NLP-100] xt Clustering as Classification with LLMs

【速读】: 该论文试图解决文本聚类中需要精细调整嵌入模型和复杂相似度度量的问题。解决方案的关键在于利用大型语言模型(LLMs)的上下文学习能力,将文本聚类任务转化为分类任务。具体步骤包括:首先,通过提示LLM生成数据集的潜在标签;其次,整合LLM生成的相似标签,并再次提示LLM为每个样本分配最合适的标签。实验证明,该框架在不依赖复杂调优或聚类算法的情况下,性能可与最先进的基于嵌入的聚类方法相媲美甚至超越。

链接: https://arxiv.org/abs/2410.00927
作者: Chen Huang,Guoxiu He
关键词-EN: clustering remains valuable, labeling is cost-prohibitive, Text clustering remains, remains valuable, valuable in real-world
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at this https URL.
摘要:文本聚类在实际应用中仍然具有重要价值,特别是在手动标注成本高昂的情况下。它通过根据文本表示将相似文本分组,从而实现信息的有效组织和分析。然而,实施这种方法需要对下游数据进行精细调整的嵌入器和复杂的相似度度量。为了解决这一问题,本研究提出了一种新颖的文本聚类框架,该框架有效利用了大语言模型 (LLM) 的上下文学习能力。我们不采用微调嵌入器的方法,而是通过 LLM 将文本聚类转化为分类任务。首先,我们引导 LLM 为给定数据集生成潜在标签。其次,在整合 LLM 生成的相似标签后,我们再次引导 LLM 为数据集中的每个样本分配最合适的标签。我们的框架在实验中已被证明能够达到与使用嵌入技术的最先进聚类方法相当甚至更优的性能,且无需复杂的微调或聚类算法。我们已将代码公开,供公众在此 https URL 上使用。

[NLP-101] OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models

【速读】: 该论文试图解决基因组基础模型(GFM)领域中缺乏标准化基准工具和开源软件的问题。解决方案的关键在于引入GFMBench框架,该框架通过标准化基准套件和自动化基准测试,整合了数百万基因组序列和数百个基因组任务,从而实现了对广泛开源GFM的标准化评估。GFMBench不仅提供了用户友好的界面和多样化的教程,还通过公开的排行榜促进了基因组建模领域的进一步发展。

链接: https://arxiv.org/abs/2410.01784
作者: Heng Yang,Jack Cole,Ke Li
关键词-EN: Large Language Models, Large Language, Language Models, genomic foundation models, foundation models
类目: Genomics (q-bio.GN); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:The advancements in artificial intelligence in recent years, such as Large Language Models (LLMs), have fueled expectations for breakthroughs in genomic foundation models (GFMs). The code of nature, hidden in diverse genomes since the very beginning of life’s evolution, holds immense potential for impacting humans and ecosystems through genome modeling. Recent breakthroughs in GFMs, such as Evo, have attracted significant investment and attention to genomic modeling, as they address long-standing challenges and transform in-silico genomic studies into automated, reliable, and efficient paradigms. In the context of this flourishing era of consecutive technological revolutions in genomics, GFM studies face two major challenges: the lack of GFM benchmarking tools and the absence of open-source software for diverse genomics. These challenges hinder the rapid evolution of GFMs and their wide application in tasks such as understanding and synthesizing genomes, problems that have persisted for decades. To address these challenges, we introduce GFMBench, a framework dedicated to GFM-oriented benchmarking. GFMBench standardizes benchmark suites and automates benchmarking for a wide range of open-source GFMs. It integrates millions of genomic sequences across hundreds of genomic tasks from four large-scale benchmarks, democratizing GFMs for a wide range of in-silico genomic applications. Additionally, GFMBench is released as open-source software, offering user-friendly interfaces and diverse tutorials, applicable for AutoBench and complex tasks like RNA design and structure prediction. To facilitate further advancements in genome modeling, we have launched a public leaderboard showcasing the benchmark performance derived from AutoBench. GFMBench represents a step toward standardizing GFM benchmarking and democratizing GFM applications.
摘要:近年来人工智能的进步,如大语言模型 (LLM),激发了人们对基因组基础模型 (GFM) 突破的期望。自生命进化之初就隐藏在多样基因组中的“自然代码”,通过基因组建模对人类和生态系统具有巨大的影响潜力。近期 GFMs 的突破,如 Evo,吸引了大量投资和关注,因为它们解决了长期存在的问题,并将计算机基因组研究转变为自动化、可靠和高效的模式。在这个基因组学连续技术革命的繁荣时代,GFM 研究面临两大挑战:缺乏 GFM 基准测试工具和缺乏多样基因组学的开源软件。这些挑战阻碍了 GFM 的快速进化及其在理解与合成基因组等任务中的广泛应用,这些问题已存在数十年。为应对这些挑战,我们推出了 GFMBench,一个专注于 GFM 基准测试的框架。GFMBench 标准化了基准套件并自动化了广泛开源 GFM 的基准测试。它整合了来自四个大规模基准的数百万个基因组序列,涵盖数百个基因组任务,使 GFM 广泛适用于各种计算机基因组应用。此外,GFMBench 作为开源软件发布,提供用户友好的界面和多样化的教程,适用于 AutoBench 和复杂的任务,如 RNA 设计和结构预测。为进一步推动基因组建模的发展,我们推出了一个公共排行榜,展示由 AutoBench 得出的基准性能。GFMBench 标志着向标准化 GFM 基准测试和普及 GFM 应用迈出了一步。

[NLP-102] Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

【速读】: 该论文试图解决的问题是如何使大型语言模型(LLMs)在生成回复时能够考虑用户的情感或说话风格,而无需对模型权重进行微调。解决方案的关键在于使用一个端到端的系统,其中包含一个语音编码器,该编码器被训练以生成能够捕捉语音中语义和副语言信息的token嵌入。通过这种方式,即使LLM的权重保持不变,系统也能生成与情感和风格相匹配的高质量、更具同理心的回复。实验结果表明,该系统在处理富有表现力的语音提示时,其回复质量优于多个基线系统。

链接: https://arxiv.org/abs/2410.01162
作者: Wonjune Kang,Junteng Jia,Chunyang Wu,Wei Zhou,Egor Lakomkin,Yashesh Gaur,Leda Sari,Suyoun Kim,Ke Li,Jay Mahadeokar,Ozlem Kalinli
关键词-EN: large language models, increasingly common modality, account users’ emotions, language models, increasingly common
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:As speech becomes an increasingly common modality for interacting with large language models (LLMs), it is becoming desirable to develop systems where LLMs can take into account users’ emotions or speaking styles when providing their responses. In this work, we study the potential of an LLM to understand these aspects of speech without fine-tuning its weights. To do this, we utilize an end-to-end system with a speech encoder; the encoder is trained to produce token embeddings such that the LLM’s response to an expressive speech prompt is aligned with its response to a semantically matching text prompt where the speaker’s emotion has also been specified. We find that this training framework allows the encoder to generate tokens that capture both semantic and paralinguistic information in speech and effectively convey it to the LLM, even when the LLM remains completely frozen. We also explore training on additional emotion and style-related response alignment tasks, finding that they further increase the amount of paralinguistic information explicitly captured in the speech tokens. Experiments demonstrate that our system is able to produce higher quality and more empathetic responses to expressive speech prompts compared to several baselines.
摘要:随着语音成为与大语言模型 (LLM) 交互的日益普遍的方式,开发能够在生成响应时考虑用户情感或说话风格的系统变得愈发重要。在本研究中,我们探讨了 LLM 在不进行微调的情况下理解这些语音特征的潜力。为此,我们采用了一个端到端的系统,该系统包含一个语音编码器;该编码器被训练以生成 Token 嵌入,使得 LLM 对富有表现力的语音提示的响应与其对语义匹配且包含说话者情感的文本提示的响应相一致。我们发现,这种训练框架使得编码器能够生成既捕捉语义信息又捕捉副语言信息的 Token,并有效地将其传递给 LLM,即使在 LLM 完全冻结的情况下也能实现。此外,我们还探索了在额外的情感和风格相关响应对齐任务上的训练,发现这进一步增加了语音 Token 中显式捕捉的副语言信息的量。实验表明,与多个基线相比,我们的系统能够生成质量更高且更具同理心的响应,以应对富有表现力的语音提示。

人工智能

[AI-0] Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

链接: https://arxiv.org/abs/2410.01806
作者: Mattia Segu,Luigi Piccinelli,Siyuan Li,Yung-Hsu Yang,Bernt Schiele,Luc Van Gool
关键词-EN: presents unique challenges, dynamic animal groups, coordinated dance performances, team sports, complex scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.

[AI-1] FabricDiffusion: High-Fidelity Texture Transfer for 3D Garments Generation from In-The-Wild Clothing Images SIGGRAPH

链接: https://arxiv.org/abs/2410.01801
作者: Cheng Zhang,Yuanhao Wang,Francisco Vicente Carrasco,Chenglei Wu,Jinlong Yang,Thabo Beeler,Fernando De la Torre
关键词-EN: transferring fabric textures, transferring fabric, single clothing image, texture, clothing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Accepted to SIGGRAPH Asia 2024. Project page: this https URL

点击查看摘要

Abstract:We introduce FabricDiffusion, a method for transferring fabric textures from a single clothing image to 3D garments of arbitrary shapes. Existing approaches typically synthesize textures on the garment surface through 2D-to-3D texture mapping or depth-aware inpainting via generative models. Unfortunately, these methods often struggle to capture and preserve texture details, particularly due to challenging occlusions, distortions, or poses in the input image. Inspired by the observation that in the fashion industry, most garments are constructed by stitching sewing patterns with flat, repeatable textures, we cast the task of clothing texture transfer as extracting distortion-free, tileable texture materials that are subsequently mapped onto the UV space of the garment. Building upon this insight, we train a denoising diffusion model with a large-scale synthetic dataset to rectify distortions in the input texture image. This process yields a flat texture map that enables a tight coupling with existing Physically-Based Rendering (PBR) material generation pipelines, allowing for realistic relighting of the garment under various lighting conditions. We show that FabricDiffusion can transfer various features from a single clothing image including texture patterns, material properties, and detailed prints and logos. Extensive experiments demonstrate that our model significantly outperforms state-to-the-art methods on both synthetic data and real-world, in-the-wild clothing images while generalizing to unseen textures and garment shapes.

[AI-2] Windowed MAPF with Completeness Guarantees

链接: https://arxiv.org/abs/2410.01798
作者: Rishi Veerapaneni,Muhammad Suhail Saleem,Jiaoyang Li,Maxim Likhachev
关键词-EN: Traditional multi-agent path, Traditional multi-agent, compute entire start-goal, multi-agent path finding, entire start-goal paths
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Traditional multi-agent path finding (MAPF) methods try to compute entire start-goal paths which are collision free. However, computing an entire path can take too long for MAPF systems where agents need to replan fast. Methods that address this typically employ a “windowed” approach and only try to find collision free paths for a small windowed timestep horizon. This adaptation comes at the cost of incompleteness; all current windowed approaches can become stuck in deadlock or livelock. Our main contribution is to introduce our framework, WinC-MAPF, for Windowed MAPF that enables completeness. Our framework uses heuristic update insights from single-agent real-time heuristic search algorithms as well as agent independence ideas from MAPF algorithms. We also develop Single-Step CBS (SS-CBS), an instantiation of this framework using a novel modification to CBS. We show how SS-CBS, which only plans a single step and updates heuristics, can effectively solve tough scenarios where existing windowed approaches fail.

[AI-3] When a language model is optimized for reasoning does it still show embers of autoregression? An analysis of OpenAI o1

链接: https://arxiv.org/abs/2410.01792
作者: R. Thomas McCoy,Shunyu Yao,Dan Friedman,Mathew D. Hardy,Thomas L. Griffiths
关键词-EN: Embers of Autoregression, next-word prediction, previous LLMs, important limitations, origins in next-word
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:In “Embers of Autoregression” (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 - like previous LLMs - is sensitive to the probability of examples and tasks, performing better and requiring fewer “thinking tokens” in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model’s probability sensitivity.

[AI-4] DreamGarden: A Designer Assistant for Growing Games from a Single Prompt

链接: https://arxiv.org/abs/2410.01791
作者: Sam Earle,Samyak Parajuli,Andrzej Banburski-Fahey
关键词-EN: Coding assistants, increasingly leveraged, generating code, code and making, making high-level plans
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
*备注: 21 pages + appendix, 11 figures

点击查看摘要

Abstract:Coding assistants are increasingly leveraged in game design, both generating code and making high-level plans. To what degree can these tools align with developer workflows, and what new modes of human-computer interaction can emerge from their use? We present DreamGarden, an AI system capable of assisting with the development of diverse game environments in Unreal Engine. At the core of our method is an LLM-driven planner, capable of breaking down a single, high-level prompt – a dream, memory, or imagined scenario provided by a human user – into a hierarchical action plan, which is then distributed across specialized submodules facilitating concrete implementation. This system is presented to the user as a garden of plans and actions, both growing independently and responding to user intervention via seed prompts, pruning, and feedback. Through a user study, we explore design implications of this system, charting courses for future work in semi-autonomous assistants and open-ended simulation design.

[AI-5] Investigating on RLHF methodology

链接: https://arxiv.org/abs/2410.01789
作者: Alexey Kutalev,Sergei Markoff
关键词-EN: Large Language Models, Large Language, Language Models, fine-tune Large Language, specific Language Model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 6 figures, 6 tables

点击查看摘要

Abstract:In this article, we investigate the alignment of Large Language Models according to human preferences. We discuss the features of training a Preference Model, which simulates human preferences, and the methods and details we found essential for achieving the best results. We also discuss using Reinforcement Learning to fine-tune Large Language Models and describe the challenges we faced and the ways to overcome them. Additionally, we present our experience with the Direct Preference Optimization method, which enables us to align a Large Language Model with human preferences without creating a separate Preference Model. As our contribution, we introduce the approach for collecting a preference dataset through perplexity filtering, which makes the process of creating such a dataset for a specific Language Model much easier and more cost-effective.

[AI-6] Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.01782
作者: Shayekh Bin Islam,Md Asib Rahman,K S M Tozammel Hossain,Enamul Hoque,Shafiq Joty,Md Rizwan Parvez
关键词-EN: Large Language Models, Large Language, Retrieval-Augmented Generation, accuracy of Large, limited reasoning capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Findings. Website: this https URL . 14 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has been shown to enhance the factual accuracy of Large Language Models (LLMs), but existing methods often suffer from limited reasoning capabilities in effectively using the retrieved evidence, particularly when using open-source LLMs. To mitigate this gap, we introduce a novel framework, Open-RAG, designed to enhance reasoning capabilities in RAG with open-source LLMs. Our framework transforms an arbitrary dense LLM into a parameter-efficient sparse mixture of experts (MoE) model capable of handling complex reasoning tasks, including both single- and multi-hop queries. Open-RAG uniquely trains the model to navigate challenging distractors that appear relevant but are misleading. As a result, Open-RAG leverages latent learning, dynamically selecting relevant experts and integrating external knowledge effectively for more accurate and contextually relevant responses. In addition, we propose a hybrid adaptive retrieval method to determine retrieval necessity and balance the trade-off between performance gain and inference speed. Experimental results show that the Llama2-7B-based Open-RAG outperforms state-of-the-art LLMs and RAG models such as ChatGPT, Self-RAG, and Command R+ in various knowledge-intensive tasks. We open-source our code and models at this https URL

[AI-7] Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets

链接: https://arxiv.org/abs/2410.01779
作者: Yuandong Tian
关键词-EN: Abelian group, tasks in Abelian, prove rich algebraic, trained on reasoning, quadratic activation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Commutative Algebra (math.AC); Rings and Algebras (math.RA)
*备注:

点击查看摘要

Abstract:We prove rich algebraic structures of the solution space for 2-layer neural networks with quadratic activation and L_2 loss, trained on reasoning tasks in Abelian group (e.g., modular addition). Such a rich structure enables analytical construction of global optimal solutions from partial solutions that only satisfy part of the loss, despite its high nonlinearity. We coin the framework as CoGO (Composing Global Optimizers). Specifically, we show that the weight space over different numbers of hidden nodes of the 2-layer network is equipped with a semi-ring algebraic structure, and the loss function to be optimized consists of monomial potentials, which are ring homomorphism, allowing partial solutions to be composed into global ones by ring addition and multiplication. Our experiments show that around 95% of the solutions obtained by gradient descent match exactly our theoretical constructions. Although the global optimizers constructed only required a small number of hidden nodes, our analysis on gradient dynamics shows that over-parameterization asymptotically decouples training dynamics and is beneficial. We further show that training dynamics favors simpler solutions under weight decay, and thus high-order global optimizers such as perfect memorization are unfavorable.

[AI-8] DeFine: Enhancing LLM Decision-Making with Factor Profiles and Analogical Reasoning

链接: https://arxiv.org/abs/2410.01772
作者: Yebowen Hu,Xiaoyang Wang,Wenlin Yao,Yiming Lu,Daoan Zhang,Hassan Foroosh,Dong Yu,Fei Liu
关键词-EN: ability to reason, reason over long, long contexts, contexts and identify, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLMs are ideal for decision-making due to their ability to reason over long contexts and identify critical factors. However, challenges arise when processing transcripts of spoken speech describing complex scenarios. These transcripts often contain ungrammatical or incomplete sentences, repetitions, hedging, and vagueness. For example, during a company’s earnings call, an executive might project a positive revenue outlook to reassure investors, despite significant uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce DeFine, a new framework that constructs probabilistic factor profiles from complex scenarios. DeFine then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in novel situations. Our framework separates the tasks of quantifying uncertainty in complex scenarios and incorporating it into LLM decision-making. This approach is particularly useful in fields such as medical consultations, negotiations, and political debates, where making decisions under uncertainty is vital.

[AI-9] Mimicking Human Intuition: Cognitive Belief-Driven Q-Learning ICLR25

链接: https://arxiv.org/abs/2410.01739
作者: Xingrui Gu,Guanren Qiao,Chuyi Jiang,Tianqing Xia,Hangyu Mao
关键词-EN: Reinforcement learning encounters, learning encounters challenges, Reinforcement learning, encounters challenges, Reinforcement
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review by ICLR 25

点击查看摘要

Abstract:Reinforcement learning encounters challenges in various environments related to robustness and explainability. Traditional Q-learning algorithms cannot effectively make decisions and utilize the historical learning experience. To overcome these limitations, we propose Cognitive Belief-Driven Q-Learning (CBDQ), which integrates subjective belief modeling into the Q-learning framework, enhancing decision-making accuracy by endowing agents with human-like learning and reasoning capabilities. Drawing inspiration from cognitive science, our method maintains a subjective belief distribution over the expectation of actions, leveraging a cluster-based subjective belief model that enables agents to reason about the potential probability associated with each decision. CBDQ effectively mitigates overestimated phenomena and optimizes decision-making policies by integrating historical experiences with current contextual information, mimicking the dynamics of human decision-making. We evaluate the proposed method on discrete control benchmark tasks in various complicate environments. The results demonstrate that CBDQ exhibits stronger adaptability, robustness, and human-like characteristics in handling these environments, outperforming other baselines. We hope this work will give researchers a fresh perspective on understanding and explaining Q-learning.

[AI-10] VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models

链接: https://arxiv.org/abs/2410.01738
作者: Kailai Feng,Yabo Zhang,Haodong Yu,Zhilong Ji,Jinfeng Bai,Hongzhi Zhang,Wangmeng Zuo
关键词-EN: input character, Artistic typography, readable manner, technique to visualize, visualize the meaning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Artistic typography is a technique to visualize the meaning of input character in an imaginable and readable manner. With powerful text-to-image diffusion models, existing methods directly design the overall geometry and texture of input character, making it challenging to ensure both creativity and legibility. In this paper, we introduce a dual-branch and training-free method, namely VitaGlyph, enabling flexible artistic typography along with controllable geometry change to maintain the readability. The key insight of VitaGlyph is to treat input character as a scene composed of Subject and Surrounding, followed by rendering them under varying degrees of geometry transformation. The subject flexibly expresses the essential concept of input character, while the surrounding enriches relevant background without altering the shape. Specifically, we implement VitaGlyph through a three-phase framework: (i) Knowledge Acquisition leverages large language models to design text descriptions of subject and surrounding. (ii) Regional decomposition detects the part that most matches the subject description and divides input glyph image into subject and surrounding regions. (iii) Typography Stylization firstly refines the structure of subject region via Semantic Typography, and then separately renders the textures of Subject and Surrounding regions through Controllable Compositional Generation. Experimental results demonstrate that VitaGlyph not only achieves better artistry and readability, but also manages to depict multiple customize concepts, facilitating more creative and pleasing artistic typography generation. Our code will be made publicly at this https URL.

[AI-11] Evaluating Robustness of Reward Models for Mathematical Reasoning

链接: https://arxiv.org/abs/2410.01729
作者: Sunghwan Kim,Dongjin Kang,Taeyoon Kwon,Hyungjoo Chae,Jungsoo Won,Dongha Lee,Jinyoung Yeo
关键词-EN: Reward models, human feedback, human preferences, Reward, key in reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Work in progress

点击查看摘要

Abstract:Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it only see an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking. In this work, we introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH, a benchmark that effectively represents the robustness of reward models in mathematical reasoning tasks. We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization, whereas the existing benchmark shows almost no correlation. The results underscore the potential of our design to enhance the reliability of evaluation, and represent the robustness of reward model. We make our code and data publicly available.

[AI-12] Auto-Demo Prompting: Leveraging Generated Outputs as Demonstrations for Enhanced Batch Prompting

链接: https://arxiv.org/abs/2410.01724
作者: Longyu Feng,Mengze Hong,Chen Jason Zhang
关键词-EN: improve computational efficiency, multiple inputs simultaneously, large language models, aiming to improve, computational efficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Batch prompting is a common technique in large language models (LLMs) used to process multiple inputs simultaneously, aiming to improve computational efficiency. However, as batch sizes increase, performance degradation often occurs due to the model’s difficulty in handling lengthy context inputs. Existing methods that attempt to mitigate these issues rely solely on batch data arrangement and majority voting rather than improving the design of the batch prompt itself. In this paper, we address these limitations by proposing “Auto-Demo Prompting,” a novel approach that leverages the question-output pairs from earlier questions within a batch as demonstrations for subsequent answer inference. We provide a formal theoretical analysis of how Auto-Demo Prompting functions within the autoregressive generation process of LLMs, illustrating how it utilizes prior outputs to optimize the model’s internal representations. Our method effectively bridges the gap between batch prompting and few-shot prompting, enhancing performance with only a slight compromise in token usage. Experimental results across five NLP tasks demonstrate its effectiveness in mitigating performance degradation and occasionally outperforming single prompts. Furthermore, it opens new avenues for applying few-shot learning techniques, such as demonstration selection, within batch prompting, making it a robust solution for real-world applications.

[AI-13] owards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

链接: https://arxiv.org/abs/2410.01720
作者: Zeyu Gan,Yong Liu
关键词-EN: synthetic data generation, Synthetic data, large language models, generate synthetic data, prevalent synthetic data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open source our code through an anonymous GitHub repository at this https URL.

[AI-14] Performant Memory Efficient and Scalable Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2410.01706
作者: Omayma Mahjoub,Sasha Abramowitz,Ruan de Kock,Wiem Khlifi,Simon du Toit,Jemma Daniel,Louay Ben Nessir,Louise Beyers,Claude Formanek,Liam Clark,Arnu Pretorius
关键词-EN: multi-agent reinforcement learning, achieving strong performance, reinforcement learning, progresses towards larger, achieving strong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:As the field of multi-agent reinforcement learning (MARL) progresses towards larger and more complex environments, achieving strong performance while maintaining memory efficiency and scalability to many agents becomes increasingly important. Although recent research has led to several advanced algorithms, to date, none fully address all of these key properties simultaneously. In this work, we introduce Sable, a novel and theoretically sound algorithm that adapts the retention mechanism from Retentive Networks to MARL. Sable’s retention-based sequence modelling architecture allows for computationally efficient scaling to a large number of agents, as well as maintaining a long temporal context, making it well-suited for large-scale partially observable environments. Through extensive evaluations across six diverse environments, we demonstrate how Sable is able to significantly outperform existing state-of-the-art methods in the majority of tasks (34 out of 45, roughly 75%). Furthermore, Sable demonstrates stable performance as we scale the number of agents, handling environments with more than a thousand agents while exhibiting a linear increase in memory usage. Finally, we conduct ablation studies to isolate the source of Sable’s performance gains and confirm its efficient computational memory usage. Our results highlight Sable’s performance and efficiency, positioning it as a leading approach to MARL at scale.

[AI-15] CreDes: Causal Reasoning Enhancement and Dual-End Searching for Solving Long-Range Reasoning Problems using LLMs

链接: https://arxiv.org/abs/2410.01696
作者: Kangsheng Wang,Xiao Zhang,Hao Liu,Songde Han,Huimin Ma,Tianyu Hu
关键词-EN: Large language models, handling combinatorial optimization, combinatorial optimization problems, optimization problems involving, Individual Treatment Effect
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated limitations in handling combinatorial optimization problems involving long-range reasoning, partially due to causal hallucinations and huge search space. As for causal hallucinations, i.e., the inconsistency between reasoning and corresponding state transition, this paper introduces the Causal Relationship Enhancement (CRE) mechanism combining cause-effect interventions and the Individual Treatment Effect (ITE) to guarantee the solid causal rightness between each step of reasoning and state transition. As for the long causal range and huge search space limiting the performances of existing models featuring single-direction search, a Dual-End Searching (DES) approach is proposed to seek solutions by simultaneously starting from both the initial and goal states on the causal probability tree. By integrating CRE and DES (CreDes), our model has realized simultaneous multi-step reasoning, circumventing the inefficiencies from cascading multiple one-step reasoning like the Chain-of-Thought (CoT). Experiments demonstrate that CreDes significantly outperforms existing State-Of-The-Art (SOTA) solutions in long-range reasoning tasks in terms of both accuracy and time efficiency.

[AI-16] From Prohibition to Adoption: How Hong Kong Universities Are Navigating ChatGPT in Academic Workflows

链接: https://arxiv.org/abs/2410.01695
作者: Junjun Huang,Jifan Wu,Qing Wang,Kemeng Yuan,Jiefeng Li,Di Lu
关键词-EN: Hong Kong universities, Hong Kong, time when Hong, Kong universities, paper aims
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper aims at comparing the time when Hong Kong universities used to ban ChatGPT to the current periods where it has become integrated in the academic processes. Bolted by concerns of integrity and ethical issues in technologies, institutions have adapted by moving towards the center adopting AI literacy and responsibility policies. This study examines new paradigms which have been developed to help implement these positives while preventing negative effects on academia. Keywords: ChatGPT, Academic Integrity, AI Literacy, Ethical AI Use, Generative AI in Education, University Policy, AI Integration in Academia, Higher Education and Technology

[AI-17] U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

链接: https://arxiv.org/abs/2410.01692
作者: Tung-Yu Wu,Pei-Yu Lo
关键词-EN: Large language models, exhibit emergent abilities, Large language, downstream tasks, shown to exhibit
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Preprint. Under review

点击查看摘要

Abstract:Large language models (LLMs) have been shown to exhibit emergent abilities in some downstream tasks, where performance seems to stagnate at first and then improve sharply and unpredictably with scale beyond a threshold. By dividing questions in the datasets according to difficulty level by average performance, we observe U-shaped scaling for hard questions, and inverted-U scaling followed by steady improvement for easy questions. Moreover, the emergence threshold roughly coincides with the point at which performance on easy questions reverts from inverse scaling to standard scaling. Capitalizing on the observable though opposing scaling trend on easy and hard questions, we propose a simple yet effective pipeline, called Slice-and-Sandwich, to predict both the emergence threshold and model performance beyond the threshold.

[AI-18] FactAlign: Long-form Factuality Alignment of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.01691
作者: Chao-Wei Huang,Yun-Nung Chen
关键词-EN: Large language models, demonstrated significant potential, Large language, information access engines, next-generation information access
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024 Findings

点击查看摘要

Abstract:Large language models have demonstrated significant potential as the next-generation information access engines. However, their reliability is hindered by issues of hallucination and generating non-factual content. This is particularly problematic in long-form responses, where assessing and ensuring factual accuracy is complex. In this paper, we address this gap by proposing FactAlign, a novel alignment framework designed to enhance the factuality of LLMs’ long-form responses while maintaining their helpfulness. We introduce fKTO, a fine-grained, sentence-level alignment algorithm that extends the Kahneman-Tversky Optimization (KTO) alignment method. Leveraging recent advances in automatic factuality evaluation, FactAlign utilizes fine-grained factuality assessments to guide the alignment process. Our experiments on open-domain prompts and information-seeking questions demonstrate that FactAlign significantly improves the factual accuracy of LLM responses while also improving their helpfulness. Further analyses identify that FactAlign is capable of training LLMs to provide more information without losing factual precision, thus improving the factual F1 score. Our source code, datasets, and trained models are publicly available at this https URL

[AI-19] Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities

链接: https://arxiv.org/abs/2410.01690
作者: Kenza Amara,Lukas Klein,Carsten Lüth,Paul Jäger,Hendrik Strobelt,Mennatallah El-Assady
关键词-EN: Visual Language Model, limitations of Generative, Visual Language, Language Model, Semantic Interventions
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The various limitations of Generative AI, such as hallucinations and model failures, have made it crucial to understand the role of different modalities in Visual Language Model (VLM) predictions. Our work investigates how the integration of information from image and text modalities influences the performance and behavior of VLMs in visual question answering (VQA) and reasoning tasks. We measure this effect through answer accuracy, reasoning quality, model uncertainty, and modality relevance. We study the interplay between text and image modalities in different configurations where visual content is essential for solving the VQA task. Our contributions include (1) the Semantic Interventions (SI)-VQA dataset, (2) a benchmark study of various VLM architectures under different modality configurations, and (3) the Interactive Semantic Interventions (ISI) tool. The SI-VQA dataset serves as the foundation for the benchmark, while the ISI tool provides an interface to test and apply semantic interventions in image and text inputs, enabling more fine-grained analysis. Our results show that complementary information between modalities improves answer and reasoning quality, while contradictory information harms model performance and confidence. Image text annotations have minimal impact on accuracy and uncertainty, slightly increasing image relevance. Attention analysis confirms the dominant role of image inputs over text in VQA tasks. In this study, we evaluate state-of-the-art VLMs that allow us to extract attention coefficients for each modality. A key finding is PaliGemma’s harmful overconfidence, which poses a higher risk of silent failures compared to the LLaVA models. This work sets the foundation for rigorous analysis of modality integration, supported by datasets specifically designed for this purpose.

[AI-20] Uncertainty Quantification with Bayesian Higher Order ReLU KANs

链接: https://arxiv.org/abs/2410.01687
作者: James Giroux,Cristiano Fanelli
关键词-EN: enhance computational efficiency, Higher Order, Kolmogorov-Arnold Networks, demands of Bayesian, enhance computational
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 13 pages, 7 Figures

点击查看摘要

Abstract:We introduce the first method of uncertainty quantification in the domain of Kolmogorov-Arnold Networks, specifically focusing on (Higher Order) ReLUKANs to enhance computational efficiency given the computational demands of Bayesian methods. The method we propose is general in nature, providing access to both epistemic and aleatoric uncertainties. It is also capable of generalization to other various basis functions. We validate our method through a series of closure tests, including simple one-dimensional functions and application to the domain of (Stochastic) Partial Differential Equations. Referring to the latter, we demonstrate the method’s ability to correctly identify functional dependencies introduced through the inclusion of a stochastic term. The code supporting this work can be found at this https URL

[AI-21] Positional Attention: Out-of-Distribution Generalization and Expressivity for Neural Algorithmic Reasoning

链接: https://arxiv.org/abs/2410.01686
作者: Artur Back de Luca,George Giapitzakis,Shenghao Yang,Petar Veličković,Kimon Fountoulakis
关键词-EN: solve algorithmic tasks, summary statistics, growing interest, ability of neural, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: 37 pages, 22 figures

点击查看摘要

Abstract:There has been a growing interest in the ability of neural networks to solve algorithmic tasks, such as arithmetic, summary statistics, and sorting. While state-of-the-art models like Transformers have demonstrated good generalization performance on in-distribution tasks, their out-of-distribution (OOD) performance is poor when trained end-to-end. In this paper, we focus on value generalization, a common instance of OOD generalization where the test distribution has the same input sequence length as the training distribution, but the value ranges in the training and test distributions do not necessarily overlap. To address this issue, we propose that using fixed positional encodings to determine attention weights-referred to as positional attention-enhances empirical OOD performance while maintaining expressivity. We support our claim about expressivity by proving that Transformers with positional attention can effectively simulate parallel algorithms.

[AI-22] PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation

链接: https://arxiv.org/abs/2410.01680
作者: Mike Ranzinger,Jon Barker,Greg Heinrich,Pavlo Molchanov,Bryan Catanzaro,Andrew Tao
关键词-EN: heterogeneous multi-teacher knowledge, multi-teacher knowledge distillation, visual foundation models, strengths and weaknesses, distillation without labels
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed “agglomerative models.” We build upon this body of work by studying the effect of the teachers’ activation statistics, particularly the impact of the loss function on the resulting student model quality. We explore a standard toolkit of statistical normalization techniques to better align the different distributions and assess their effects. Further, we examine the impact on downstream teacher-matching metrics, which motivates the use of Hadamard matrices. With these matrices, we demonstrate useful properties, showing how they can be used for isotropic standardization, where each dimension of a multivariate distribution is standardized using the same scale. We call this technique “PHI Standardization” (PHI-S) and empirically demonstrate that it produces the best student model across the suite of methods studied.

[AI-23] Mind Scramble: Unveiling Large Language Model Psychology Via Typoglycemia

链接: https://arxiv.org/abs/2410.01677
作者: Miao Yu,Junyuan Mao,Guibin Zhang,Jingheng Ye,Junfeng Fang,Aoxiao Zhong,Yang Liu,Yuxuan Liang,Kun Wang,Qingsong Wen
关键词-EN: large language models, physical world, large language, shown promise, promise in addressing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Research into the external behaviors and internal mechanisms of large language models (LLMs) has shown promise in addressing complex tasks in the physical world. Studies suggest that powerful LLMs, like GPT-4, are beginning to exhibit human-like cognitive abilities, including planning, reasoning, and reflection. In this paper, we introduce a research line and methodology called LLM Psychology, leveraging human psychology experiments to investigate the cognitive behaviors and mechanisms of LLMs. We migrate the Typoglycemia phenomenon from psychology to explore the “mind” of LLMs. Unlike human brains, which rely on context and word patterns to comprehend scrambled text, LLMs use distinct encoding and decoding processes. Through Typoglycemia experiments at the character, word, and sentence levels, we observe: (I) LLMs demonstrate human-like behaviors on a macro scale, such as lower task accuracy and higher token/time consumption; (II) LLMs exhibit varying robustness to scrambled input, making Typoglycemia a benchmark for model evaluation without new datasets; (III) Different task types have varying impacts, with complex logical tasks (e.g., math) being more challenging in scrambled form; (IV) Each LLM has a unique and consistent “cognitive pattern” across tasks, revealing general mechanisms in its psychology process. We provide an in-depth analysis of hidden layers to explain these phenomena, paving the way for future research in LLM Psychology and deeper interpretability.

[AI-24] rying to be human: Linguistic traces of stochastic empathy in language models

链接: https://arxiv.org/abs/2410.01675
作者: Bennett Kleinberg,Jari Zegers,Jonas Festor,Stefana Vida,Julian Präsent,Riccardo Loconte,Sanne Peereboom
关键词-EN: modern world, navigating the modern, Differentiating between generated, human, Differentiating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Differentiating between generated and human-written content is important for navigating the modern world. Large language models (LLMs) are crucial drivers behind the increased quality of computer-generated content. Reportedly, humans find it increasingly difficult to identify whether an AI model generated a piece of text. Our work tests how two important factors contribute to the human vs AI race: empathy and an incentive to appear human. We address both aspects in two experiments: human participants and a state-of-the-art LLM wrote relationship advice (Study 1, n=530) or mere descriptions (Study 2, n=610), either instructed to be as human as possible or not. New samples of humans (n=428 and n=408) then judged the texts’ source. Our findings show that when empathy is required, humans excel. Contrary to expectations, instructions to appear human were only effective for the LLM, so the human advantage diminished. Computational text analysis revealed that LLMs become more human because they may have an implicit representation of what makes a text human and effortlessly apply these heuristics. The model resorts to a conversational, self-referential, informal tone with a simpler vocabulary to mimic stochastic empathy. We discuss these findings in light of recent claims on the on-par performance of LLMs.

[AI-25] Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

链接: https://arxiv.org/abs/2410.01671
作者: Yanming Liu,Xinyue Peng,Jiannan Cao,Shi Bo,Yanxin Shen,Xuhong Zhang,Sheng Cheng,Xun Wang,Jianwei Yin,Tianyu Du
关键词-EN: Large language models, shown remarkable capabilities, Large language, executing effective question, natural language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Underreview version of LQCA, Bridge context gap for long context

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in natural language processing; however, they still face difficulties when tasked with understanding lengthy contexts and executing effective question answering. These challenges often arise due to the complexity and ambiguity present in longer texts. To enhance the performance of LLMs in such scenarios, we introduce the Long Question Coreference Adaptation (LQCA) method. This innovative framework focuses on coreference resolution tailored to long contexts, allowing the model to identify and manage references effectively. The LQCA method encompasses four key steps: resolving coreferences within sub-documents, computing the distances between mentions, defining a representative mention for coreference, and answering questions through mention replacement. By processing information systematically, the framework provides easier-to-handle partitions for LLMs, promoting better understanding. Experimental evaluations on a range of LLMs and datasets have yielded positive results, with a notable improvements on OpenAI-o1-mini and GPT-4o models, highlighting the effectiveness of leveraging coreference resolution to bridge context gaps in question answering.

[AI-26] Finding path and cycle counting formulae in graphs with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2410.01661
作者: Jason Piquenot,Maxime Bérar,Pierre Héroux,Jean-Yves Ramel,Romain Raveaux,Sébastien Adam
关键词-EN: Carlo Tree Search, Monte Carlo Tree, Grammar Reinforcement Learning, reinforcement learning algorithm, presents Grammar Reinforcement
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:This paper presents Grammar Reinforcement Learning (GRL), a reinforcement learning algorithm that uses Monte Carlo Tree Search (MCTS) and a transformer architecture that models a Pushdown Automaton (PDA) within a context-free grammar (CFG) framework. Taking as use case the problem of efficiently counting paths and cycles in graphs, a key challenge in network analysis, computer science, biology, and social sciences, GRL discovers new matrix-based formulas for path/cycle counting that improve computational efficiency by factors of two to six w.r.t state-of-the-art approaches. Our contributions include: (i) a framework for generating gramformers that operate within a CFG, (ii) the development of GRL for optimizing formulas within grammatical structures, and (iii) the discovery of novel formulas for graph substructure counting, leading to significant computational improvements.

[AI-27] Conformal Generative Modeling with Improved Sample Efficiency through Sequential Greedy Filtering

链接: https://arxiv.org/abs/2410.01660
作者: Klaus-Rudolf Kladny,Bernhard Schölkopf,Michael Muehlebach
关键词-EN: Generative models lack, models lack rigorous, lack rigorous statistical, Generative models, Sequential Conformal Prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative models lack rigorous statistical guarantees for their outputs and are therefore unreliable in safety-critical applications. In this work, we propose Sequential Conformal Prediction for Generative Models (SCOPE-Gen), a sequential conformal prediction method producing prediction sets that satisfy a rigorous statistical guarantee called conformal admissibility control. This guarantee states that with high probability, the prediction sets contain at least one admissible (or valid) example. To this end, our method first samples an initial set of i.i.d. examples from a black box generative model. Then, this set is iteratively pruned via so-called greedy filters. As a consequence of the iterative generation procedure, admissibility of the final prediction set factorizes as a Markov chain. This factorization is crucial, because it allows to control each factor separately, using conformal prediction. In comparison to prior work, our method demonstrates a large reduction in the number of admissibility evaluations during calibration. This reduction is important in safety-critical applications, where these evaluations must be conducted manually by domain experts and are therefore costly and time consuming. We highlight the advantages of our method in terms of admissibility evaluations and cardinality of the prediction sets through experiments in natural language generation and molecular graph extension tasks.

[AI-28] Efficient Long-range Language Modeling with Self-supervised Causal Retrieval

链接: https://arxiv.org/abs/2410.01651
作者: Xiang Hu,Zhihao Teng,Wei Wu,Kewei Tu
关键词-EN: retrieval-based language models, received much attention, retrieval-based language, Recently, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Recently, retrieval-based language models (RLMs) have received much attention. However, most of them leverage a pre-trained retriever with fixed parameters, which may not adapt well to causal language models. In this work, we propose Grouped Cross-Attention, a novel module enabling joint pre-training of the retriever and causal LM, and apply it to long-context modeling. For a given input sequence, we split it into chunks and use the current chunk to retrieve past chunks for subsequent text generation. Our innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. By integrating top- k retrieval, our model can be pre-trained efficiently from scratch with context lengths up to 64K tokens. Our experiments show our model, compared with long-range LM baselines, can achieve lower perplexity with comparable or lower pre-training and inference costs.

[AI-29] shapiq: Shapley Interactions for Machine Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.01649
作者: Maximilian Muschalik,Hubert Baniecki,Fabian Fumagalli,Patrick Kolpaczki,Barbara Hammer,Eyke Hüllermeier
关键词-EN: Originally rooted, machine learning, important tool, Originally, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Originally rooted in game theory, the Shapley Value (SV) has recently become an important tool in machine learning research. Perhaps most notably, it is used for feature attribution and data valuation in explainable artificial intelligence. Shapley Interactions (SIs) naturally extend the SV and address its limitations by assigning joint contributions to groups of entities, which enhance understanding of black box machine learning models. Due to the exponential complexity of computing SVs and SIs, various methods have been proposed that exploit structural assumptions or yield probabilistic estimates given limited resources. In this work, we introduce shapiq, an open-source Python package that unifies state-of-the-art algorithms to efficiently compute SVs and any-order SIs in an application-agnostic framework. Moreover, it includes a benchmarking suite containing 11 machine learning applications of SIs with pre-computed games and ground-truth values to systematically assess computational performance across domains. For practitioners, shapiq is able to explain and visualize any-order feature interactions in predictions of models, including vision transformers, language models, as well as XGBoost and LightGBM with TreeSHAP-IQ. With shapiq, we extend shap beyond feature attributions and consolidate the application of SVs and SIs in machine learning that facilitates future research. The source code and documentation are available at this https URL.

[AI-30] Stable Offline Value Function Learning with Bisimulation-based Representations

链接: https://arxiv.org/abs/2410.01643
作者: Brahma S. Pavse,Yudong Chen,Qiaomin Xie,Josiah P. Hanna
关键词-EN: expected discounted return, function learning, dataset to estimate, estimate the expected, expected discounted
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:In reinforcement learning, offline value function learning is the procedure of using an offline dataset to estimate the expected discounted return from each state when taking actions according to a fixed target policy. The stability of this procedure, i.e., whether it converges to its fixed-point, critically depends on the representations of the state-action pairs. Poorly learned representations can make value function learning unstable, or even divergent. Therefore, it is critical to stabilize value function learning by explicitly shaping the state-action representations. Recently, the class of bisimulation-based algorithms have shown promise in shaping representations for control. However, it is still unclear if this class of methods can stabilize value function learning. In this work, we investigate this question and answer it affirmatively. We introduce a bisimulation-based algorithm called kernel representations for offline policy evaluation (KROPE). KROPE uses a kernel to shape state-action representations such that state-action pairs that have similar immediate rewards and lead to similar next state-action pairs under the target policy also have similar representations. We show that KROPE: 1) learns stable representations and 2) leads to lower value error than baselines. Our analysis provides new theoretical insight into the stability properties of bisimulation-based methods and suggests that practitioners can use these methods for stable and accurate evaluation of offline reinforcement learning agents.

[AI-31] Moral Alignment for LLM Agents

链接: https://arxiv.org/abs/2410.01639
作者: Elizaveta Tennant,Stephen Hailes,Mirco Musolesi
关键词-EN: pre-trained Large Language, Large Language Models, Large Language, Decision-making agents based, pre-trained Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner’s Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2410.01639 [cs.LG] (or arXiv:2410.01639v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.01639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-32] Data Extrapolation for Text-to-image Generation on Small Datasets

链接: https://arxiv.org/abs/2410.01638
作者: Senmao Ye,Fei Liu
关键词-EN: requires large amount, generation requires large, synthesizing high-quality images, requires large, large amount
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-image generation requires large amount of training data to synthesizing high-quality images. For augmenting training data, previous methods rely on data interpolations like cropping, flipping, and mixing up, which fail to introduce new information and yield only marginal improvements. In this paper, we propose a new data augmentation method for text-to-image generation using linear extrapolation. Specifically, we apply linear extrapolation only on text feature, and new image data are retrieved from the internet by search engines. For the reliability of new text-image pairs, we design two outlier detectors to purify retrieved images. Based on extrapolation, we construct training samples dozens of times larger than the original dataset, resulting in a significant improvement in text-to-image performance. Moreover, we propose a NULL-guidance to refine score estimation, and apply recurrent affine transformation to fuse text information. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets. The code and data will be available on GitHub (this https URL).

[AI-33] Does Graph Prompt Work? A Data Operation Perspective with Theoretical Analysis

链接: https://arxiv.org/abs/2410.01635
作者: Qunzhong Wang,Xiangguo Sun,Hong Cheng
关键词-EN: promising research direction, graph, graph prompting, recent years, research direction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:In recent years, graph prompting has emerged as a promising research direction, enabling the learning of additional tokens or subgraphs appended to the original graphs without requiring retraining of pre-trained graph models across various applications. This novel paradigm, shifting from the traditional pretraining and finetuning to pretraining and prompting has shown significant empirical success in simulating graph data operations, with applications ranging from recommendation systems to biological networks and graph transferring. However, despite its potential, the theoretical underpinnings of graph prompting remain underexplored, raising critical questions about its fundamental effectiveness. The lack of rigorous theoretical proof of why and how much it works is more like a dark cloud over the graph prompt area to go further. To fill this gap, this paper introduces a theoretical framework that rigorously analyzes graph prompting from a data operation perspective. Our contributions are threefold: First, we provide a formal guarantee theorem, demonstrating graph prompts capacity to approximate graph transformation operators, effectively linking upstream and downstream tasks. Second, we derive upper bounds on the error of these data operations by graph prompts for a single graph and extend this discussion to batches of graphs, which are common in graph model training. Third, we analyze the distribution of data operation errors, extending our theoretical findings from linear graph models (e.g., GCN) to non-linear graph models (e.g., GAT). Extensive experiments support our theoretical results and confirm the practical implications of these guarantees.

[AI-34] Entropy-Based Uncertainty Modeling for Trajectory Prediction in Autonomous Driving

链接: https://arxiv.org/abs/2410.01628
作者: Aron Distelzweig,Andreas Look,Eitan Kosman,Faris Janjoš,Jörg Wagner,Abhinav Valadaa
关键词-EN: efficient motion planning, accurate motion prediction, accurate motion, motion planning, autonomous driving
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures, submitted to International Conference on Learning Representations (2025)

点击查看摘要

Abstract:In autonomous driving, accurate motion prediction is essential for safe and efficient motion planning. To ensure safety, planners must rely on reliable uncertainty information about the predicted future behavior of surrounding agents, yet this aspect has received limited attention. This paper addresses the so-far neglected problem of uncertainty modeling in trajectory prediction. We adopt a holistic approach that focuses on uncertainty quantification, decomposition, and the influence of model composition. Our method is based on a theoretically grounded information-theoretic approach to measure uncertainty, allowing us to decompose total uncertainty into its aleatoric and epistemic components. We conduct extensive experiments on the nuScenes dataset to assess how different model architectures and configurations affect uncertainty quantification and model robustness.

[AI-35] Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

链接: https://arxiv.org/abs/2410.01623
作者: Xi Chen,Kaituo Feng,Changsheng Li,Xunhao Lai,Xiangyu Yue,Ye Yuan,Guoren Wang
关键词-EN: Large Language Models, training Large Language, Language Models, Large Language, reducing memory usage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e.g., LoRA), or seek to decompose gradient matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal performance. This raises a question: whether it is possible to consistently preserve the low-rank constraint for memory efficiency, while achieving full-rank training (i.e., training with full-rank gradients of full-rank weights) to avoid inferior outcomes? In this paper, we propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal. First, we observe an interesting phenomenon during LLM training: the scaling impact of adaptive optimizers (e.g., Adam) on the gradient norm remains similar from low-rank to full-rank training. Based on this observation, we propose a norm-based scaling method, which utilizes the scaling impact of low-rank optimizers as substitutes for that of original full-rank optimizers to enable full-rank training. In this way, we can preserve the low-rank constraint in the optimizer while achieving full-rank training for better performance. Moreover, we find that there are sudden gradient rises during the optimization process, potentially causing loss spikes. To address this, we further put forward a norm-growth limiter to smooth the gradient via regulating the relative increase of gradient norms. Extensive experiments on the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.

[AI-36] DRUPI: Dataset Reduction Using Privileged Information

链接: https://arxiv.org/abs/2410.01611
作者: Shaobo Wang,Yantai Yang,Shuaiyu Zhang,Chenghao Sun,Weiya Li,Xuming Hu,Linfeng Zhang
关键词-EN: seeks to select, select or distill, distill samples, samples from large, smaller subsets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset reduction (DR) seeks to select or distill samples from large datasets into smaller subsets while preserving performance on target tasks. Existing methods primarily focus on pruning or synthesizing data in the same format as the original dataset, typically the input data and corresponding labels. However, in DR settings, we find it is possible to synthesize more information beyond the data-label pair as an additional learning target to facilitate model training. In this paper, we introduce Dataset Reduction Using Privileged Information (DRUPI), which enriches DR by synthesizing privileged information alongside the reduced dataset. This privileged information can take the form of feature labels or attention labels, providing auxiliary supervision to improve model learning. Our findings reveal that effective feature labels must balance between being overly discriminative and excessively diverse, with a moderate level proving optimal for improving the reduced dataset’s efficacy. Extensive experiments on ImageNet, CIFAR-10/100, and Tiny ImageNet demonstrate that DRUPI integrates seamlessly with existing dataset reduction methods, offering significant performance gains.

[AI-37] Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging

链接: https://arxiv.org/abs/2410.01610
作者: Tingfeng Hui,Zhenyu Zhang,Shuohuan Wang,Yu Sun,Hua Wu,Sen Su
关键词-EN: language processing tasks, natural language processing, plentiful natural language, large language models, shines brightly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: work in progress

点击查看摘要

Abstract:Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and demonstrates outstanding performance in plentiful natural language processing tasks. However, existing methods transforming LLMs from dense to MoE face significant data requirements and typically rely on large-scale post-training. In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model. Specifically, we first point out that intermediate checkpoints during instruction tuning of the dense model are naturally suitable for specialized experts, and then propose an expert expansion stage to flexibly achieve models with flexible numbers of experts, where genetic algorithm and parameter merging are introduced to ensure sufficient diversity of new extended experts. To ensure that each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router. Extensive experiments with various data scales and upcycling settings demonstrate the outstanding performance and data efficiency of UpIT, as well as stable improvement in expert or data scaling. Further analysis reveals the importance of ensuring expert diversity in upcycling.

[AI-38] Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

链接: https://arxiv.org/abs/2410.01606
作者: Maya Pavlova,Erik Brinkman,Krithika Iyer,Vitor Albiero,Joanna Bitton,Hailey Nguyen,Joe Li,Cristian Canton Ferrer,Ivan Evtimov,Aaron Grattafiori
关键词-EN: violates norms, safety training, assesses how large, produce content, content that violates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. Common users of AI models may not have advanced knowledge of adversarial machine learning methods or access to model internals, and they do not spend a lot of time crafting a single highly effective adversarial prompt. Instead, they are likely to make use of techniques commonly shared online and exploit the multiturn conversational nature of LLMs. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general-purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama 3.1 and 88% against GPT-4 on the JailbreakBench dataset.

[AI-39] Elaborative Subtopic Query Reformulation for Broad and Indirect Queries in Travel Destination Recommendation RECSYS2024

链接: https://arxiv.org/abs/2410.01598
作者: Qianfeng Wen,Yifan Liu,Joshua Zhang,George Saad,Anton Korikov,Yury Sambale,Scott Sanner
关键词-EN: Travel Recommender Systems, Recommender Systems, school graduation trip, Query-driven Travel Recommender, high school graduation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 7 figures,The 1st Workshop on Risks, Opportunities, and Evaluation of Generative Models in Recommender Systems (ROEGEN@RecSys 2024), October 2024, Bari, Italy

点击查看摘要

Abstract:In Query-driven Travel Recommender Systems (RSs), it is crucial to understand the user intent behind challenging natural language(NL) destination queries such as the broadly worded “youth-friendly activities” or the indirect description “a high school graduation trip”. Such queries are challenging due to the wide scope and subtlety of potential user intents that confound the ability of retrieval methods to infer relevant destinations from available textual descriptions such as WikiVoyage. While query reformulation (QR) has proven effective in enhancing retrieval by addressing user intent, existing QR methods tend to focus only on expanding the range of potentially matching query subtopics (breadth) or elaborating on the potential meaning of a query (depth), but not both. In this paper, we introduce Elaborative Subtopic Query Reformulation (EQR), a large language model-based QR method that combines both breadth and depth by generating potential query subtopics with information-rich elaborations. We also release TravelDest, a novel dataset for query-driven travel destination RSs. Experiments on TravelDest show that EQR achieves significant improvements in recall and precision over existing state-of-the-art QR methods.

[AI-40] KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

链接: https://arxiv.org/abs/2410.01595
作者: Pouyan Navard,Amin Karimi Monsefi,Mengxi Zhou,Wei-Lun Chao,Alper Yilmaz,Rajiv Ramnath
关键词-EN: Recent advances, balance fine-grained precision, significantly improved, advances in diffusion, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user’s specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.

[AI-41] Iterated Local Search with Linkage Learning

链接: https://arxiv.org/abs/2410.01583
作者: Renato Tinós,Michal W. Przewozniczek,Darrell Whitley,Francisco Chicano
关键词-EN: variable interaction graph, variable interaction, weighted variable interaction, interaction graph, interaction
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In pseudo-Boolean optimization, a variable interaction graph represents variables as vertices, and interactions between pairs of variables as edges. In black-box optimization, the variable interaction graph may be at least partially discovered by using empirical linkage learning techniques. These methods never report false variable interactions, but they are computationally expensive. The recently proposed local search with linkage learning discovers the partial variable interaction graph as a side-effect of iterated local search. However, information about the strength of the interactions is not learned by the algorithm. We propose local search with linkage learning 2, which builds a weighted variable interaction graph that stores information about the strength of the interaction between variables. The weighted variable interaction graph can provide new insights about the optimization problem and behavior of optimizers. Experiments with NK landscapes, knapsack problem, and feature selection show that local search with linkage learning 2 is able to efficiently build weighted variable interaction graphs. In particular, experiments with feature selection show that the weighted variable interaction graphs can be used for visualizing the feature interactions in machine learning. Additionally, new transformation operators that exploit the interactions between variables can be designed. We illustrate this ability by proposing a new perturbation operator for iterated local search.

[AI-42] Spoken Grammar Assessment Using LLM

链接: https://arxiv.org/abs/2410.01579
作者: Sunil Kumar Kopparapu,Chitralekha Bhat,Ashish Panda
关键词-EN: evaluating the pronunciation, pronunciation and oral, oral fluency, speaker by analysing, analysing the read
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Spoken language assessment (SLA) systems restrict themselves to evaluating the pronunciation and oral fluency of a speaker by analysing the read and spontaneous spoken utterances respectively. The assessment of language grammar or vocabulary is relegated to written language assessment (WLA) systems. Most WLA systems present a set of sentences from a curated finite-size database of sentences thereby making it possible to anticipate the test questions and train oneself. In this paper, we propose a novel end-to-end SLA system to assess language grammar from spoken utterances thus making WLA systems redundant; additionally, we make the assessment largely unteachable by employing a large language model (LLM) to bring in variations in the test. We further demonstrate that a hybrid automatic speech recognition (ASR) with a custom-built language model outperforms the state-of-the-art ASR engine for spoken grammar assessment.

[AI-43] Computing Ex Ante Equilibrium in Heterogeneous Zero-Sum Team Games

链接: https://arxiv.org/abs/2410.01575
作者: Naming Liu,Mingzhi Wang,Xihuai Wang,Weinan Zhang,Yaodong Yang,Youzhi Zhang,Bo An,Ying Wen
关键词-EN: ante equilibrium, Space Response Oracle, team policy space, heterogeneous team games, team
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ex ante equilibrium for two-team zero-sum games, where agents within each team collaborate to compete against the opposing team, is known to be the best a team can do for coordination. Many existing works on ex ante equilibrium solutions are aiming to extend the scope of ex ante equilibrium solving to large-scale team games based on Policy Space Response Oracle (PSRO). However, the joint team policy space constructed by the most prominent method, Team PSRO, cannot cover the entire team policy space in heterogeneous team games where teammates play distinct roles. Such insufficient policy expressiveness causes Team PSRO to be trapped into a sub-optimal ex ante equilibrium with significantly higher exploitability and never converges to the global ex ante equilibrium. To find the global ex ante equilibrium without introducing additional computational complexity, we first parameterize heterogeneous policies for teammates, and we prove that optimizing the heterogeneous teammates’ policies sequentially can guarantee a monotonic improvement in team rewards. We further propose Heterogeneous-PSRO (H-PSRO), a novel framework for heterogeneous team games, which integrates the sequential correlation mechanism into the PSRO framework and serves as the first PSRO framework for heterogeneous team games. We prove that H-PSRO achieves lower exploitability than Team PSRO in heterogeneous team games. Empirically, H-PSRO achieves convergence in matrix heterogeneous games that are unsolvable by non-heterogeneous baselines. Further experiments reveal that H-PSRO outperforms non-heterogeneous baselines in both heterogeneous team games and homogeneous settings.

[AI-44] OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

链接: https://arxiv.org/abs/2410.01560
作者: Shubham Toshniwal,Wei Du,Ivan Moshkov,Branislav Kisacanin,Alexan Ayrapetyan,Igor Gitman
关键词-EN: Mathematical reasoning continues, large language model, Mathematical reasoning, development with significant, significant interest
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mathematical reasoning continues to be a critical challenge in large language model (LLM) development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \emphclosed-source due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \textttLlama3.1 family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms \emphon-policy data generated by a weak student model, © SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs ( \approx 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \textttLlama-3.1-8B-Base using OpenMathInstruct-2 outperforms \textttLlama3.1-8B-Instruct on MATH by an absolute 15.9% (51.9% \rightarrow 67.8%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.

[AI-45] Integrative Decoding: Improve Factuality via Implicit Self-consistency

链接: https://arxiv.org/abs/2410.01556
作者: Yi Cheng,Xiao Liang,Yeyun Gong,Wen Xiao,Song Wang,Yuji Zhang,Wenjun Hou,Kaishuai Xu,Wenge Liu,Wenjie Li,Jian Jiao,Qi Chen,Peng Cheng,Wayne Xiong
关键词-EN: involve repeatedly sampling, repeatedly sampling multiple, sampling multiple outputs, large language models, involve repeatedly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-consistency-based approaches, which involve repeatedly sampling multiple outputs and selecting the most consistent one as the final response, prove to be remarkably effective in improving the factual accuracy of large language models. Nonetheless, existing methods usually have strict constraints on the task format, largely limiting their applicability. In this paper, we present Integrative Decoding (ID), to unlock the potential of self-consistency in open-ended generation tasks. ID operates by constructing a set of inputs, each prepended with a previously sampled response, and then processes them concurrently, with the next token being selected by aggregating of all their corresponding predictions at each decoding step. In essence, this simple approach implicitly incorporates self-consistency in the decoding objective. Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the TruthfulQA (+11.2%), Biographies (+15.4%) and LongFact (+8.5%) benchmarks. The performance gains amplify progressively as the number of sampled responses increases, indicating the potential of ID to scale up with repeated sampling.

[AI-46] MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework

链接: https://arxiv.org/abs/2410.01553
作者: Zonghai Yao,Zihao Zhang,Chaolong Tang,Xingyu Bian,Youxia Zhao,Zhichao Yang,Junda Wang,Huixue Zhou,Won Seok Jang,Feiyun Ouyang,Hong Yu
关键词-EN: large language models, healthcare require advanced, Artificial intelligence, require advanced clinical, Structured Clinical Examinations
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education’s Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks, MedQA-CS enables a more comprehensive evaluation of LLMs’ clinical capabilities for both open- and closed-source LLMs.

[AI-47] Edge-preserving noise for diffusion models

链接: https://arxiv.org/abs/2410.01540
作者: Jente Vandersanden,Sascha Holl,Xingchang Huang,Gurprit Singh
关键词-EN: spatial regions uniformly, neglecting potentially valuable, Classical generative diffusion, potentially valuable structural, isotropic Gaussian denoising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical generative diffusion models learn an isotropic Gaussian denoising process, treating all spatial regions uniformly, thus neglecting potentially valuable structural information in the data. Inspired by the long-established work on anisotropic diffusion in image processing, we present a novel edge-preserving diffusion model that is a generalization of denoising diffusion probablistic models (DDPM). In particular, we introduce an edge-aware noise scheduler that varies between edge-preserving and isotropic Gaussian noise. We show that our model’s generative process converges faster to results that more closely match the target distribution. We demonstrate its capability to better learn the low-to-mid frequencies within the dataset, which plays a crucial role in representing shapes and structural information. Our edge-preserving diffusion process consistently outperforms state-of-the-art baselines in unconditional image generation. It is also more robust for generative tasks guided by a shape-based prior, such as stroke-to-image generation. We present qualitative and quantitative results showing consistent improvements (FID score) of up to 30% for both tasks.

[AI-48] Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models

链接: https://arxiv.org/abs/2410.01532
作者: Angela Lopez-Cardona,Carlos Segura,Alexandros Karatzoglou,Sergi Abadal,Ioannis Arapakis
关键词-EN: Natural Language Processing, Large Language Models, Language Processing, Advancements in Natural, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advancements in Natural Language Processing (NLP), have led to the emergence of Large Language Models (LLMs) such as GPT, Llama, Claude, and Gemini, which excel across a range of tasks but require extensive fine-tuning to align their outputs with human expectations. A widely used method for achieving this alignment is Reinforcement Learning from Human Feedback (RLHF), which, despite its success, faces challenges in accurately modelling human preferences. In this paper, we introduce GazeReward, a novel framework that integrates implicit feedback – and specifically eye-tracking (ET) data – into the Reward Model (RM). In addition, we explore how ET-based features can provide insights into user preferences. Through ablation studies we test our framework with different integration methods, LLMs, and ET generator models, demonstrating that our approach significantly improves the accuracy of the RM on established human preference datasets. This work advances the ongoing discussion on optimizing AI alignment with human values, exploring the potential of cognitive data for shaping future NLP research.

[AI-49] VaT: Joint-Axis Attention for Time Series Forecasting with Lead-Lag Dynamics

链接: https://arxiv.org/abs/2410.01531
作者: Junwoo Ha,Hyukjae Kwon,Sungsoo Kim,Kisu Lee,Ha Young Kim
关键词-EN: real-world applications, Multivariate time series, plays a crucial, crucial role, temporal and inter-variable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15pages, 5 figures

点击查看摘要

Abstract:Multivariate time series (MTS) forecasting plays a crucial role in various real-world applications, yet simultaneously capturing both temporal and inter-variable dependencies remains a challenge. Conventional Channel-Dependent (CD) models handle these dependencies separately, limiting their ability to model complex interactions such as lead-lag dynamics. To address these limitations, we propose TiVaT (Time-Variable Transformer), a novel architecture that integrates temporal and variate dependencies through its Joint-Axis (JA) attention mechanism. TiVaT’s ability to capture intricate variate-temporal dependencies, including asynchronous interactions, is further enhanced by the incorporation of Distance-aware Time-Variable (DTV) Sampling, which reduces noise and improves accuracy through a learned 2D map that focuses on key interactions. TiVaT effectively models both temporal and variate dependencies, consistently delivering strong performance across diverse datasets. Notably, it excels in capturing complex patterns within multivariate time series, enabling it to surpass or remain competitive with state-of-the-art methods. This positions TiVaT as a new benchmark in MTS forecasting, particularly in handling datasets characterized by intricate and challenging dependencies.

[AI-50] InstaTrans: An Instruction-Aware Translation Framework for Non-English Instruction Datasets

链接: https://arxiv.org/abs/2410.01512
作者: Yungi Kim,Chanjun Park
关键词-EN: frequently observed data, generate high-quality instruction, non-English languages due, high-quality English instruction, tail phenomena
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:It is challenging to generate high-quality instruction datasets for non-English languages due to tail phenomena, which limit performance on less frequently observed data. To mitigate this issue, we propose translating existing high-quality English instruction datasets as a solution, emphasizing the need for complete and instruction-aware translations to maintain the inherent attributes of these datasets. We claim that fine-tuning LLMs with datasets translated in this way can improve their performance in the target language. To this end, we introduces a new translation framework tailored for instruction datasets, named InstaTrans (INSTruction-Aware TRANSlation). Through extensive experiments, we demonstrate the superiority of InstaTrans over other competitors in terms of completeness and instruction-awareness of translation, highlighting its potential to broaden the accessibility of LLMs across diverse languages at a relatively low cost. Furthermore, we have validated that fine-tuning LLMs with datasets translated by InstaTrans can effectively improve their performance in the target language.

[AI-51] LEGO: Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

链接: https://arxiv.org/abs/2410.01506
作者: Dexuan Ding,Lei Wang,Liyun Zhu,Tom Gedeon,Piotr Koniusz
关键词-EN: computer vision tasks, diverse representations, computer vision, vision tasks, fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research paper

点击查看摘要

Abstract:In computer vision tasks, features often come from diverse representations, domains, and modalities, such as text, images, and videos. Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing similarity graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we use graph power expansions and introduce a learnable graph fusion operator to combine these graph powers for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise similarity score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.

[AI-52] Discrete Diffusion Schr"odinger Bridge Matching for Graph Transformation

链接: https://arxiv.org/abs/2410.01500
作者: Jun Hyeong Kim,Seonghwan Kim,Seokhyun Moon,Hyeongwoo Kim,Jeheon Woo,Woo Youn Kim
关键词-EN: Transporting between arbitrary, generative modeling, fundamental goal, goal in generative, Schrödinger Bridge Matching
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transporting between arbitrary distributions is a fundamental goal in generative modeling. Recently proposed diffusion bridge models provide a potential solution, but they rely on a joint distribution that is difficult to obtain in practice. Furthermore, formulations based on continuous domains limit their applicability to discrete domains such as graphs. To overcome these limitations, we propose Discrete Diffusion Schrödinger Bridge Matching (DDSBM), a novel framework that utilizes continuous-time Markov chains to solve the SB problem in a high-dimensional discrete state space. Our approach extends Iterative Markovian Fitting to discrete domains, and we have proved its convergence to the SB. Furthermore, we adapt our framework for the graph transformation and show that our design choice of underlying dynamics characterized by independent modifications of nodes and edges can be interpreted as the entropy-regularized version of optimal transport with a cost function described by the graph edit distance. To demonstrate the effectiveness of our framework, we have applied DDSBM to molecular optimization in the field of chemistry. Experimental results demonstrate that DDSBM effectively optimizes molecules’ property-of-interest with minimal graph transformation, successfully retaining other features.

[AI-53] DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic Lightweight Plugin for Large Language Models

链接: https://arxiv.org/abs/2410.01497
作者: Yuxuan Zhang,Ruizhe Li
关键词-EN: Large Language Models, Large Language, domains remains resource-intensive, Language Models, specific domains remains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint under review, 18 pages, 7 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have achieved robust performance across diverse tasks, but fine-tuning these models for specific domains remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) address this challenge by fine-tuning a small subset of parameters. However, existing methods for fusing multiple LoRAs lack dynamic fusion based on contextual inputs and often increase inference time due to token-level operations. We propose DLP-LoRA, a Dynamic Lightweight Plugin that employs a mini-MLP module with only 5M parameters to dynamically fuse multiple LoRAs at the sentence level using top-p sampling strategies. This approach reduces inference time to less than twice that of single LoRA inference by leveraging parallel computation. Evaluations across 26 tasks-including multiple-choice questions and question answering-demonstrate that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice datasets and significant improvements in BLEU and ROUGE scores on QA datasets, outperforming different LLMs backbones under composite task settings. DLP-LoRA effectively balances performance and efficiency, making it a practical solution for dynamic multi-task adaptation in LLMs. Our code is available at this https URL.

[AI-54] SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios

链接: https://arxiv.org/abs/2410.01481
作者: Kai Li,Wendi Sang,Chang Zeng,Runxuan Yang,Guo Chen,Xiaolin Hu
关键词-EN: conditions typically requires, typically requires extensive, requires extensive data, extensive data comprising, source conditions typically
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Technical report

点击查看摘要

Abstract:The systematic evaluation of speech separation and enhancement models under moving sound source conditions typically requires extensive data comprising diverse scenarios. However, real-world datasets often contain insufficient data to meet the training and evaluation requirements of models. Although synthetic datasets offer a larger volume of data, their acoustic simulations lack realism. Consequently, neither real-world nor synthetic datasets effectively fulfill practical needs. To address these issues, we introduce SonicSim, a synthetic toolkit de-designed to generate highly customizable data for moving sound sources. SonicSim is developed based on the embodied AI simulation platform, Habitat-sim, supporting multi-level adjustments, including scene-level, microphone-level, and source-level, thereby generating more diverse synthetic data. Leveraging SonicSim, we constructed a moving sound source benchmark dataset, SonicSet, using the Librispeech, the Freesound Dataset 50k (FSD50K) and Free Music Archive (FMA), and 90 scenes from the Matterport3D to evaluate speech separation and enhancement models. Additionally, to validate the differences between synthetic data and real-world data, we randomly selected 5 hours of raw data without reverberation from the SonicSet validation set to record a real-world speech separation dataset, which was then compared with the corresponding synthetic datasets. Similarly, we utilized the real-world speech enhancement dataset RealMAN to validate the acoustic gap between other synthetic datasets and the SonicSet dataset for speech enhancement. The results indicate that the synthetic data generated by SonicSim can effectively generalize to real-world scenarios. Demo and code are publicly available at this https URL.

[AI-55] Peeling Back the Layers: An In-Depth Evaluation of Encoder Architectures in Neural News Recommenders RECSYS2024

链接: https://arxiv.org/abs/2410.01470
作者: Andreea Iana,Goran Glavaš,Heiko Paulheim
关键词-EN: Encoder architectures play, user encoder architectures, Encoder architectures, play a pivotal, pivotal role
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted at the 12th International Workshop on News Recommendation and Analytics (INRA 2024) in conjunction with ACM RecSys 2024

点击查看摘要

Abstract:Encoder architectures play a pivotal role in neural news recommenders by embedding the semantic and contextual information of news and users. Thus, research has heavily focused on enhancing the representational capabilities of news and user encoders to improve recommender performance. Despite the significant impact of encoder architectures on the quality of news and user representations, existing analyses of encoder designs focus only on the overall downstream recommendation performance. This offers a one-sided assessment of the encoders’ similarity, ignoring more nuanced differences in their behavior, and potentially resulting in sub-optimal model selection. In this work, we perform a comprehensive analysis of encoder architectures in neural news recommender systems. We systematically evaluate the most prominent news and user encoder architectures, focusing on their (i) representational similarity, measured with the Central Kernel Alignment, (ii) overlap of generated recommendation lists, quantified with the Jaccard similarity, and (iii) the overall recommendation performance. Our analysis reveals that the complexity of certain encoding techniques is often empirically unjustified, highlighting the potential for simpler, more efficient architectures. By isolating the effects of individual components, we provide valuable insights for researchers and practitioners to make better informed decisions about encoder selection and avoid unnecessary complexity in the design of news recommenders.

[AI-56] IGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

链接: https://arxiv.org/abs/2410.01469
作者: Mohan Xu,Kai Li,Guo Chen,Xiaolin Hu
关键词-EN: Time-frequency Interleaved Gain, Interleaved Gain Extraction, recent years, speech separation research, speech separation
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Technical report, demo page: this https URL

点击查看摘要

Abstract:In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features, while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing state-of-the-art (SOTA) model TF-GridNet. This is the first speech separation model with fewer than 1 million parameters that achieves performance comparable to the SOTA model.

[AI-57] From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge

链接: https://arxiv.org/abs/2410.01458
作者: Xiefeng Wu
关键词-EN: accelerate agent training, incorporating domain knowledge, directly shaping Q-values, Q-value initialization, agent training
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: q-shaping, reinforcement learning, reward shaping

点击查看摘要

Abstract:Q-shaping is an extension of Q-value initialization and serves as an alternative to reward shaping for incorporating domain knowledge to accelerate agent training, thereby improving sample efficiency by directly shaping Q-values. This approach is both general and robust across diverse tasks, allowing for immediate impact assessment while guaranteeing optimality. We evaluated Q-shaping across 20 different environments using a large language model (LLM) as the heuristic provider. The results demonstrate that Q-shaping significantly enhances sample efficiency, achieving a \textbf16.87% improvement over the best baseline in each environment and a \textbf253.80% improvement compared to LLM-based reward shaping methods. These findings establish Q-shaping as a superior and unbiased alternative to conventional reward shaping in reinforcement learning.

[AI-58] Agent -Driven Large Language Models for Mandarin Lyric Generation

链接: https://arxiv.org/abs/2410.01450
作者: Hong-Hsiang Liu,Yi-Wen Liu
关键词-EN: Generative Large Language, Generative Large, in-context learning abilities, shown impressive in-context, impressive in-context learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages, figures, Accepted at O-COCOSDA 2024

点击查看摘要

Abstract:Generative Large Language Models have shown impressive in-context learning abilities, performing well across various tasks with just a prompt. Previous melody-to-lyric research has been limited by scarce high-quality aligned data and unclear standard for creativeness. Most efforts focused on general themes or emotions, which are less valuable given current language model capabilities. In tonal contour languages like Mandarin, pitch contours are influenced by both melody and tone, leading to variations in lyric-melody fit. Our study, validated by the Mpop600 dataset, confirms that lyricists and melody writers consider this fit during their composition process. In this research, we developed a multi-agent system that decomposes the melody-to-lyric task into sub-tasks, with each agent controlling rhyme, syllable count, lyric-melody alignment, and consistency. Listening tests were conducted via a diffusion-based singing voice synthesizer to evaluate the quality of lyrics generated by different agent groups.

[AI-59] Geometric Signatures of Compositionality Across a Language Models Lifetime ICLR2025

链接: https://arxiv.org/abs/2410.01444
作者: Jin Hwa Lee,Thomas Jiralerspong,Lei Yu,Yoshua Bengio,Emily Cheng
关键词-EN: syntactic rules, permits the infinite, expression is constructed, parts and syntactic, infinite productivity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Under review as a conference paper at ICLR 2025

点击查看摘要

Abstract:Compositionality, the notion that the meaning of an expression is constructed from the meaning of its parts and syntactic rules, permits the infinite productivity of human language. For the first time, artificial language models (LMs) are able to match human performance in a number of compositional generalization tasks. However, much remains to be understood about the representational mechanisms underlying these abilities. We take a high-level geometric approach to this problem by relating the degree of compositionality in a dataset to the intrinsic dimensionality of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations’ intrinsic dimensionality, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between linear and nonlinear dimensionality, showing that they respectively encode formal and semantic aspects of linguistic composition.

[AI-60] Fair4Free: Generating High-fidelity Fair Synthetic Samples using Data Free Distillation

链接: https://arxiv.org/abs/2410.01423
作者: Md Fahim Sikder,Daniel de Leng,Fredrik Heintz
关键词-EN: latent space, work presents, student model, model, generative model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work presents Fair4Free, a novel generative model to generate synthetic fair data using data-free distillation in the latent space. Fair4Free can work on the situation when the data is private or inaccessible. In our approach, we first train a teacher model to create fair representation and then distil the knowledge to a student model (using a smaller architecture). The process of distilling the student model is data-free, i.e. the student model does not have access to the training dataset while distilling. After the distillation, we use the distilled model to generate fair synthetic samples. Our extensive experiments show that our synthetic samples outperform state-of-the-art models in all three criteria (fairness, utility and synthetic quality) with a performance increase of 5% for fairness, 8% for utility and 12% in synthetic quality for both tabular and image datasets.

[AI-61] he Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

链接: https://arxiv.org/abs/2410.01417
作者: Hong Li,Nanxi Li,Yuanjie Chen,Jianbin Zhu,Qinlu Guo,Cewu Lu,Yong-Lu Li
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, \textite.g. , hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: \textbfassociation , a human’s basic capability to link observation and prior practice memory. To comprehensively investigate MLLM’s performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient \textbfannotation-free construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs’ zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. \textitOur data and code are available at: this https URL.

[AI-62] Improving Fuzzy Rule Classifier with Brain Storm Optimization and Rule Modification

链接: https://arxiv.org/abs/2410.01413
作者: Yan Huang,Wei Liu,Xiaogang Zang
关键词-EN: adversely affect inductive, affect inductive learning, Brain Storm Optimization, fuzzy rule classifiers, expanding complexity
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages,8 figures

点击查看摘要

Abstract:The expanding complexity and dimensionality in the search space can adversely affect inductive learning in fuzzy rule classifiers, thus impacting the scalability and accuracy of fuzzy systems. This research specifically addresses the challenge of diabetic classification by employing the Brain Storm Optimization (BSO) algorithm to propose a novel fuzzy system that redefines rule generation for this context. An exponential model is integrated into the standard BSO algorithm to enhance rule derivation, tailored specifically for diabetes-related data. The innovative fuzzy system is then applied to classification tasks involving diabetic datasets, demonstrating a substantial improvement in classification accuracy, as evidenced by our experiments.

[AI-63] Can We Delegate Learning to Automation?: A Comparative Study of LLM Chatbots Search Engines and Books

链接: https://arxiv.org/abs/2410.01396
作者: Yeonsun Yang,Ahyeon Shin,Mincheol Kang,Jiheon Kang,Jean Young Song
关键词-EN: motivator behind information, Learning, Abstract, key motivator, search
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 21 pages, 14 figures

点击查看摘要

Abstract:Learning is a key motivator behind information search behavior. With the emergence of LLM-based chatbots, students are increasingly turning to these tools as their primary resource for acquiring knowledge. However, the transition from traditional resources like textbooks and web searches raises concerns among educators. They worry that these fully-automated LLMs might lead students to delegate critical steps of search as learning. In this paper, we systematically uncover three main concerns from educators’ perspectives. In response to these concerns, we conducted a mixed-methods study with 92 university students to compare three learning sources with different automation levels. Our results show that LLMs support comprehensive understanding of key concepts without promoting passive learning, though their effectiveness in knowledge retention was limited. Additionally, we found that academic performance impacted both learning outcomes and search patterns. Notably, higher-competence learners engaged more deeply with content through reading-intensive behaviors rather than relying on search activities.

[AI-64] FLAME: Adaptive and Reactive Concept Drift Mitigation for Federated Learning Deployments

链接: https://arxiv.org/abs/2410.01386
作者: Ioannis Mavromatis,Stefano De Feo,Aftab Khan
关键词-EN: presents Federated Learning, paper presents Federated, Federated Learning, Internet of Things, Monitoring and Elimination
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for Publication at EMERGE Workshop - EWSN 2024

点击查看摘要

Abstract:This paper presents Federated Learning with Adaptive Monitoring and Elimination (FLAME), a novel solution capable of detecting and mitigating concept drift in Federated Learning (FL) Internet of Things (IoT) environments. Concept drift poses significant challenges for FL models deployed in dynamic and real-world settings. FLAME leverages an FL architecture, considers a real-world FL pipeline, and proves capable of maintaining model performance and accuracy while addressing bandwidth and privacy constraints. Introducing various features and extensions on previous works, FLAME offers a robust solution to concept drift, significantly reducing computational load and communication overhead. Compared to well-known lightweight mitigation methods, FLAME demonstrates superior performance in maintaining high F1 scores and reducing resource utilisation in large-scale IoT deployments, making it a promising approach for real-world applications.

[AI-65] Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

链接: https://arxiv.org/abs/2410.01380
作者: Jiyeon Kim,Hyunji Lee,Hyowon Cho,Joel Jang,Hyeonbin Hwang,Seungpil Won,Youbin Ahn,Dohaeng Lee,Minjoon Seo
关键词-EN: parametric knowledge evolves, knowledge entropy, knowledge, affects overall performance, tendency to broadly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we investigate how a model’s tendency to broadly integrate its parametric knowledge evolves throughout pretraining, and how this behavior affects overall performance, particularly in terms of knowledge acquisition and forgetting. We introduce the concept of knowledge entropy, which quantifies the range of memory sources the model engages with; high knowledge entropy indicates that the model utilizes a wide range of memory sources, while low knowledge entropy suggests reliance on specific sources with greater certainty. Our analysis reveals a consistent decline in knowledge entropy as pretraining advances. We also find that the decline is closely associated with a reduction in the model’s ability to acquire and retain knowledge, leading us to conclude that diminishing knowledge entropy (smaller number of active memory sources) impairs the model’s knowledge acquisition and retention capabilities. We find further support for this by demonstrating that increasing the activity of inactive memory sources enhances the model’s capacity for knowledge acquisition and retention.

[AI-66] heoretical Lower Bounds for the Oven Scheduling Problem

链接: https://arxiv.org/abs/2410.01368
作者: Francesca Da Ros,Marie-Louise Lackner,Nysret Musliu
关键词-EN: scheduling problem arising, batch scheduling problem, parallel batch scheduling, Oven Scheduling Problem, Scheduling Problem
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
*备注: arXiv admin note: text overlap with arXiv:2203.12517

点击查看摘要

Abstract:The Oven Scheduling Problem (OSP) is an NP-hard real-world parallel batch scheduling problem arising in the semiconductor industry. The objective of the problem is to schedule a set of jobs on ovens while minimizing several factors, namely total oven runtime, job tardiness, and setup costs. At the same time, it must adhere to various constraints such as oven eligibility and availability, job release dates, setup times between batches, and oven capacity limitations. The key to obtaining efficient schedules is to process compatible jobs simultaneously in batches. In this paper, we develop theoretical, problem-specific lower bounds for the OSP that can be computed very quickly. We thoroughly examine these lower bounds, evaluating their quality and exploring their integration into existing solution methods. Specifically, we investigate their contribution to exact methods and a metaheuristic local search approach using simulated annealing. Moreover, these problem-specific lower bounds enable us to assess the solution quality for large instances for which exact methods often fail to provide tight lower bounds.

[AI-67] PCQPR: Proactive Conversational Question Planning with Reflection EMNLP2024

链接: https://arxiv.org/abs/2410.01363
作者: Shasha Guo,Lizi Liao,Jing Zhang,Cuiping Li,Hong Chen
关键词-EN: Conversational Question Generation, customer service, Conversational Question, enhances the interactivity, Conclusion-driven Conversational Question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 Main

点击查看摘要

Abstract:Conversational Question Generation (CQG) enhances the interactivity of conversational question-answering systems in fields such as education, customer service, and entertainment. However, traditional CQG, focusing primarily on the immediate context, lacks the conversational foresight necessary to guide conversations toward specified conclusions. This limitation significantly restricts their ability to achieve conclusion-oriented conversational outcomes. In this work, we redefine the CQG task as Conclusion-driven Conversational Question Generation (CCQG) by focusing on proactivity, not merely reacting to the unfolding conversation but actively steering it towards a conclusion-oriented question-answer pair. To address this, we propose a novel approach, called Proactive Conversational Question Planning with self-Refining (PCQPR). Concretely, by integrating a planning algorithm inspired by Monte Carlo Tree Search (MCTS) with the analytical capabilities of large language models (LLMs), PCQPR predicts future conversation turns and continuously refines its questioning strategies. This iterative self-refining mechanism ensures the generation of contextually relevant questions strategically devised to reach a specified outcome. Our extensive evaluations demonstrate that PCQPR significantly surpasses existing CQG methods, marking a paradigm shift towards conclusion-oriented conversational question-answering systems.

[AI-68] Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

链接: https://arxiv.org/abs/2410.01353
作者: Zhenyu Pan,Rongyu Cao,Yongchang Cao,Yingwei Ma,Binhua Li,Fei Huang,Han Liu,Yongbin Li
关键词-EN: key downstream task, enhancing developer productivity, Code completion, code completion tool, key downstream
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Code completion, a key downstream task in code generation, is one of the most frequent and impactful methods for enhancing developer productivity in software development. As intelligent completion tools evolve, we need a robust evaluation benchmark that enables meaningful comparisons between products and guides future advancements. However, existing benchmarks focus more on coarse-grained tasks without industrial analysis resembling general code generation rather than the real-world scenarios developers encounter. Moreover, these benchmarks often rely on costly and time-consuming human annotation, and the standalone test cases fail to leverage minimal tests for maximum repository-level understanding and code coverage. To address these limitations, we first analyze business data from an industrial code completion tool and redefine the evaluation criteria to better align with the developer’s intent and desired completion behavior throughout the coding process. Based on these insights, we introduce Codev-Agent, an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage, ensuring fair and effective comparisons. Using Codev-Agent, we present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Bench assesses whether a code completion tool can capture a developer’s immediate intent and suggest appropriate code across diverse contexts, providing a more realistic benchmark for code completion in modern software development.

[AI-69] akin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling

链接: https://arxiv.org/abs/2410.01350
作者: Yuguang Yang,Yu Pan,Jixun Yao,Xiang Zhang,Jianhao Ye,Hongbin Zhou,Lei Xie,Lei Ma,Jianjun Zhao
关键词-EN: http URL recent, URL recent advancements, http URL, shown remarkable progress, remains considerable potential
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Work in Progress; Under Review

点击查看摘要

Abstract:Zero-shot voice conversion (VC) aims to transform the source speaker timbre into an arbitrary unseen one without altering the original speech this http URL recent advancements in zero-shot VC methods have shown remarkable progress, there still remains considerable potential for improvement in terms of improving speaker similarity and speech this http URL this paper, we propose Takin-VC, a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling to tackle this challenge. Specifically, an effective hybrid content encoder, guided by neural codec training, that leverages quantized features from pre-trained WavLM and HybridFormer is first presented to extract the linguistic content of the source speech. Subsequently, we introduce an advanced cross-attention-based context-aware timbre modeling approach that learns the fine-grained, semantically associated target timbre features. To further enhance both speaker similarity and real-time performance, we utilize a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. Additionally, we advocate an efficient memory-augmented module designed to generate high-quality conditional target inputs for the flow matching process, thereby improving the overall performance of the proposed system. Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems, delivering superior performance in terms of both speech naturalness and speaker similarity.

[AI-70] Life uh Finds a Way: Systematic Neural Search

链接: https://arxiv.org/abs/2410.01349
作者: Alex Baranski,Jun Tani
关键词-EN: solve spatiotemporally continuous, tackle the challenge, challenge of rapidly, rapidly adapting, adapting an agent
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:We tackle the challenge of rapidly adapting an agent’s behavior to solve spatiotemporally continuous problems in novel settings. Animals exhibit extraordinary abilities to adapt to new contexts, a capacity unmatched by artificial systems. Instead of focusing on generalization through deep reinforcement learning, we propose viewing behavior as the physical manifestation of a search procedure, where robust problem-solving emerges from an exhaustive search across all possible behaviors. Surprisingly, this can be done efficiently using online modification of a cognitive graph that guides action, challenging the predominant view that exhaustive search in continuous spaces is impractical. We describe an algorithm that implicitly enumerates behaviors by regulating the tight feedback loop between execution of behaviors and mutation of the graph, and provide a neural implementation based on Hebbian learning and a novel high-dimensional harmonic representation inspired by entorhinal cortex. By framing behavior as search, we provide a mathematically simple and biologically plausible model for real-time behavioral adaptation, successfully solving a variety of continuous state-space navigation problems. This framework not only offers a flexible neural substrate for other applications but also presents a powerful paradigm for understanding adaptive behavior. Our results suggest potential advancements in developmental learning and unsupervised skill acquisition, paving the way for autonomous robots to master complex skills in data-sparse environments demanding flexibility.

[AI-71] PhyMPGN: Physics-encoded Message Passing Graph Network for spatiotemporal PDE systems

链接: https://arxiv.org/abs/2410.01337
作者: Bocheng Zeng,Qi Wang,Mengtao Yan,Yang Liu,Ruizhi Chengze,Yi Zhang,Hongsheng Liu,Zidong Wang,Hao Sun
关键词-EN: Solving partial differential, partial differential equations, Solving partial, modeling complex dynamical, differential equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Solving partial differential equations (PDEs) serves as a cornerstone for modeling complex dynamical systems. Recent progresses have demonstrated grand benefits of data-driven neural-based models for predicting spatiotemporal dynamics (e.g., tremendous speedup gain compared with classical numerical methods). However, most existing neural models rely on rich training data, have limited extrapolation and generalization abilities, and suffer to produce precise or reliable physical prediction under intricate conditions (e.g., irregular mesh or geometry, complex boundary conditions, diverse PDE parameters, etc.). To this end, we propose a new graph learning approach, namely, Physics-encoded Message Passing Graph Network (PhyMPGN), to model spatiotemporal PDE systems on irregular meshes given small training datasets. Specifically, we incorporate a GNN into a numerical integrator to approximate the temporal marching of spatiotemporal dynamics for a given PDE system. Considering that many physical phenomena are governed by diffusion processes, we further design a learnable Laplace block, which encodes the discrete Laplace-Beltrami operator, to aid and guide the GNN learning in a physically feasible solution space. A boundary condition padding strategy is also designed to improve the model convergence and accuracy. Extensive experiments demonstrate that PhyMPGN is capable of accurately predicting various types of spatiotemporal dynamics on coarse unstructured meshes, consistently achieves the state-of-the-art results, and outperforms other baselines with considerable gains.

[AI-72] Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

链接: https://arxiv.org/abs/2410.01335
作者: Lucas Bandarkar,Benjamin Muller,Pritish Yuvraj,Rui Hou,Nayan Singhal,Hongjiang Lv,Bing Liu
关键词-EN: Large Language Models, math instruction data, practice of combining, instruction data, Model merging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 main pages, 23 pages total, 9 figures, 5 tables

点击查看摘要

Abstract:Model merging, such as model souping, is the practice of combining different models with the same architecture together without further training. In this work, we present a model merging methodology that addresses the difficulty of fine-tuning Large Language Models (LLMs) for target tasks in non-English languages, where task-specific data is often unavailable. We focus on mathematical reasoning and without in-language math data, facilitate cross-lingual transfer by composing language and math capabilities. Starting from the same pretrained model, we fine-tune separate “experts” on math instruction data in English and on generic instruction data in the target language. We then replace the top and bottom transformer layers of the math expert directly with layers from the language expert, which consequently enhances math performance in the target language. The resulting merged models outperform the individual experts and other merging methods on the math benchmark, MGSM, by 10% across four major languages where math instruction data is scarce. In addition, this layer swapping is simple, inexpensive, and intuitive, as it is based on an interpretative analysis of the most important parameter changes during the fine-tuning of each expert. The ability to successfully re-compose LLMs for cross-lingual transfer in this manner opens up future possibilities to combine model expertise, create modular solutions, and transfer reasoning capabilities across languages all post hoc.

[AI-73] Unveiling Language Skills under Circuits

链接: https://arxiv.org/abs/2410.01334
作者: Hang Chen,Jiaying Zhu,Xinyu Yang,Wenya Wang
关键词-EN: language skills, complex language skills, language, skills, Simple language skills
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The exploration of language skills in language models (LMs) has always been one of the central goals in mechanistic interpretability. However, existing circuit analyses often fall short in representing the full functional scope of these models, primarily due to the exclusion of Feed-Forward layers. Additionally, isolating the effect of a single language skill from a text, which inherently involves multiple entangled skills, poses a significant challenge. To address these gaps, we introduce a novel concept, Memory Circuit, a minimum unit that fully and independently manipulates the memory-reading functionality of a language model, and disentangle the transformer model precisely into a circuit graph which is an ensemble of paths connecting different memory circuits. Based on this disentanglement, we identify salient circuit paths, named as skill paths, responsible for three crucial language skills, i.e., the Previous Token Skill, Induction Skill and In-Context Learning (ICL) Skill, leveraging causal effect estimation through interventions and counterfactuals. Our experiments on various datasets confirm the correspondence between our identified skill paths and language skills, and validate three longstanding hypotheses: 1) Language skills are identifiable through circuit dissection; 2) Simple language skills reside in shallow layers, whereas complex language skills are found in deeper layers; 3) Complex language skills are formed on top of simpler language skills. Our codes are available at: this https URL.

[AI-74] Fair Class-Incremental Learning using Sample Weighting

链接: https://arxiv.org/abs/2410.01324
作者: Jaeyoung Park,Minsu Kim,Steven Euijong Whang
关键词-EN: class-incremental learning, average gradient vector, Model fairness, average gradient, current task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model fairness is becoming important in class-incremental learning for Trustworthy AI. While accuracy has been a central focus in class-incremental learning, fairness has been relatively understudied. However, naively using all the samples of the current task for training results in unfair catastrophic forgetting for certain sensitive groups including classes. We theoretically analyze that forgetting occurs if the average gradient vector of the current task data is in an “opposite direction” compared to the average gradient vector of a sensitive group, which means their inner products are negative. We then propose a fair class-incremental learning framework that adjusts the training weights of current task samples to change the direction of the average gradient vector and thus reduce the forgetting of underperforming groups and achieve fairness. For various group fairness measures, we formulate optimization problems to minimize the overall losses of sensitive groups while minimizing the disparities among them. We also show the problems can be solved with linear programming and propose an efficient Fairness-aware Sample Weighting (FSW) algorithm. Experiments show that FSW achieves better accuracy-fairness tradeoff results than state-of-the-art approaches on real datasets.

[AI-75] Forte : Finding Outliers with Representation Typicality Estimation

链接: https://arxiv.org/abs/2410.01322
作者: Debargha Ganguly,Warren Morningstar,Andrew Yu,Vipin Chaudhary
关键词-EN: produce photorealistic synthetic, generative OOD detectors, OOD detectors, virtually indistinguishable, photorealistic synthetic data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Generative models can now produce photorealistic synthetic data which is virtually indistinguishable from the real data used to train it. This is a significant evolution over previous models which could produce reasonable facsimiles of the training data, but ones which could be visually distinguished from the training data by human evaluation. Recent work on OOD detection has raised doubts that generative model likelihoods are optimal OOD detectors due to issues involving likelihood misestimation, entropy in the generative process, and typicality. We speculate that generative OOD detectors also failed because their models focused on the pixels rather than the semantic content of the data, leading to failures in near-OOD cases where the pixels may be similar but the information content is significantly different. We hypothesize that estimating typical sets using self-supervised learners leads to better OOD detectors. We introduce a novel approach that leverages representation learning, and informative summary statistics based on manifold estimation, to address all of the aforementioned issues. Our method outperforms other unsupervised approaches and achieves state-of-the art performance on well-established challenging benchmarks, and new synthetic data detection tasks.

[AI-76] Finetuning Pre-trained Model with Limited Data for LiDAR-based 3D Object Detection by Bridging Domain Gaps IROS

链接: https://arxiv.org/abs/2410.01319
作者: Jiyun Jang,Mincheol Chang,Jongwon Park,Jinkyu Kim
关键词-EN: including autonomous vehicles, including autonomous, mobile robots, largely utilized, autonomous vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024

点击查看摘要

Abstract:LiDAR-based 3D object detectors have been largely utilized in various applications, including autonomous vehicles or mobile robots. However, LiDAR-based detectors often fail to adapt well to target domains with different sensor configurations (e.g., types of sensors, spatial resolution, or FOVs) and location shifts. Collecting and annotating datasets in a new setup is commonly required to reduce such gaps, but it is often expensive and time-consuming. Recent studies suggest that pre-trained backbones can be learned in a self-supervised manner with large-scale unlabeled LiDAR frames. However, despite their expressive representations, they remain challenging to generalize well without substantial amounts of data from the target domain. Thus, we propose a novel method, called Domain Adaptive Distill-Tuning (DADT), to adapt a pre-trained model with limited target data (approximately 100 LiDAR frames), retaining its representation power and preventing it from overfitting. Specifically, we use regularizers to align object-level and context-level representations between the pre-trained and finetuned models in a teacher-student architecture. Our experiments with driving benchmarks, i.e., Waymo Open dataset and KITTI, confirm that our method effectively finetunes a pre-trained model, achieving significant gains in accuracy.

[AI-77] Rethinking the Expressiveness of GNNs: A Computational Model Perspective

链接: https://arxiv.org/abs/2410.01308
作者: Guanyu Cui,Zhewei Wei,Hsin-Hao Su
关键词-EN: Graph Neural Networks, graph machine learning, considerable research focusing, Graph Neural, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are extensively employed in graph machine learning, with considerable research focusing on their expressiveness. Current studies often assess GNN expressiveness by comparing them to the Weisfeiler-Lehman (WL) tests or classical graph algorithms. However, we identify three key issues in existing analyses: (1) some studies use preprocessing to enhance expressiveness but overlook its computational costs; (2) some claim the anonymous WL test’s limited power while enhancing expressiveness using non-anonymous features, creating a mismatch; and (3) some characterize message-passing GNNs (MPGNNs) with the CONGEST model but make unrealistic assumptions about computational resources, allowing \textsfNP-Complete problems to be solved in O(m) depth. We contend that a well-defined computational model is urgently needed to serve as the foundation for discussions on GNN expressiveness. To address these issues, we introduce the Resource-Limited CONGEST (RL-CONGEST) model, incorporating optional preprocessing and postprocessing to form a framework for analyzing GNN expressiveness. Our framework sheds light on computational aspects, including the computational hardness of hash functions in the WL test and the role of virtual nodes in reducing network capacity. Additionally, we suggest that high-order GNNs correspond to first-order model-checking problems, offering new insights into their expressiveness.

[AI-78] FanCric : Multi-Agent ic Framework for Crafting Fantasy 11 Cricket Teams

链接: https://arxiv.org/abs/2410.01307
作者: Mohit Bhatnagar
关键词-EN: Indian Premier League, deep history, increasingly captivates, global audience, intricate strategies
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cricket, with its intricate strategies and deep history, increasingly captivates a global audience. The Indian Premier League (IPL), epitomizing Twenty20 cricket, showcases talent in a format that lasts just a few hours as opposed to the longer forms of the game. Renowned for its fusion of technology and fan engagement, the IPL stands as the world’s most popular cricket league. This study concentrates on Dream11, India’s leading fantasy cricket league for IPL, where participants craft virtual teams based on real player performances to compete internationally. Building a winning fantasy team requires navigating various complex factors including player form and match conditions. Traditionally, this has been approached through operations research and machine learning. This research introduces the FanCric framework, an advanced multi-agent system leveraging Large Language Models (LLMs) and a robust orchestration framework to enhance fantasy team selection in cricket. FanCric employs both structured and unstructured data to surpass traditional methods by incorporating sophisticated AI technologies. The analysis involved scrutinizing approximately 12.7 million unique entries from a Dream11 contest, evaluating FanCric’s efficacy against the collective wisdom of crowds and a simpler Prompt Engineering approach. Ablation studies further assessed the impact of generating varying numbers of teams. The exploratory findings are promising, indicating that further investigation into FanCric’s capabilities is warranted to fully realize its potential in enhancing strategic decision-making using LLMs in fantasy sports and business in general.

[AI-79] Emotion-Aware Response Generation Using Affect-Enriched Embeddings with LLMs

链接: https://arxiv.org/abs/2410.01306
作者: Abdur Rasool,Muhammad Irfan Shahzad,Hafsa Aslam,Vincent Chan
关键词-EN: automated chatbot-facilitated psychotherapy, chatbot-facilitated psychotherapy sessions, automated chatbot-facilitated, NRC Emotion Lexicon, including NRC Emotion
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:There is a need for empathetic and coherent responses in automated chatbot-facilitated psychotherapy sessions. This study addresses the challenge of enhancing the emotional and contextual understanding of large language models (LLMs) in psychiatric applications. We introduce a novel framework that integrates multiple emotion lexicons, including NRC Emotion Lexicon, VADER, WordNet, and SentiWordNet, with state-of-the-art LLMs such as LLAMA 2, Flan-T5, ChatGPT 3.0, and ChatGPT 4.0. The primary dataset comprises over 2,000 therapy session transcripts from the Counseling and Psychotherapy database, covering discussions on anxiety, depression, trauma, and addiction. We segment the transcripts into smaller chunks, enhancing them with lexical features and computing embeddings using BERT, GPT-3, and RoBERTa to capture semantic and emotional nuances. These embeddings are stored in a FAISS vector database, enabling efficient similarity search and clustering based on cosine similarity. Upon user query, the most relevant segments are retrieved and provided as context to the LLMs, significantly improving the models’ ability to generate empathetic and contextually appropriate responses. Experimental evaluations demonstrate that in-corporating emotion lexicons enhances empathy, coherence, informativeness, and fluency scores. Our findings highlight the critical role of emotional embeddings in improving LLM performance for psychotherapy.

[AI-80] Speculative Coreset Selection for Task-Specific Fine-tuning

链接: https://arxiv.org/abs/2410.01296
作者: Xiaoyu Zhang,Juan Zhai,Shiqing Ma,Chao Shen,Tianlin Li,Weipeng Jiang,Yang Liu
关键词-EN: requires significant computational, significant computational resources, large language models, target LLM, Task-specific fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 4 figures, 14 tables

点击查看摘要

Abstract:Task-specific fine-tuning is essential for the deployment of large language models (LLMs), but it requires significant computational resources and time. Existing solutions have proposed coreset selection methods to improve data efficiency and reduce model training overhead, but they still have limitations: 1) Overlooking valuable samples at high pruning rates, which degrades the coreset’s performance. 2) Requiring high time overhead during coreset selection to fine-tune and evaluate the target LLM. In this paper, we introduce STAFF, a speculative coreset selection method. STAFF leverages a small model from the same family as the target LLM to efficiently estimate data scores and then verifies the scores on the target LLM to accurately identify and allocate more selection budget to important regions while maintaining coverage of easy regions. We evaluate STAFF on three LLMs and three downstream tasks and show that STAFF improves the performance of SOTA methods by up to 54.3% and reduces selection overhead by up to 70.5% at different pruning rates. Furthermore, we observe that the coreset selected by STAFF at low pruning rates (i.e., 20%) can even obtain better fine-tuning performance than the full dataset.

[AI-81] owards a Law of Iterated Expectations for Heuristic Estimators

链接: https://arxiv.org/abs/2410.01290
作者: Paul Christiano,Jacob Hilton,Andrea Lincoln,Eric Neyman,Mark Xu
关键词-EN: heuristic estimator, heuristic, estimator, mathbb, Christiano
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 47 pages, 2 tables, 1 figure

点击查看摘要

Abstract:Christiano et al. (2022) define a heuristic estimator to be a hypothetical algorithm that estimates the values of mathematical expressions from arguments. In brief, a heuristic estimator \mathbbG takes as input a mathematical expression Y and a formal “heuristic argument” \pi , and outputs an estimate \mathbbG(Y \mid \pi) of Y . In this work, we argue for the informal principle that a heuristic estimator ought not to be able to predict its own errors, and we explore approaches to formalizing this principle. Most simply, the principle suggests that \mathbbG(Y - \mathbbG(Y \mid \pi) \mid \pi) ought to equal zero for all Y and \pi . We argue that an ideal heuristic estimator ought to satisfy two stronger properties in this vein, which we term iterated estimation (by analogy to the law of iterated expectations) and error orthogonality. Although iterated estimation and error orthogonality are intuitively appealing, it can be difficult to determine whether a given heuristic estimator satisfies the properties. As an alternative approach, we explore accuracy: a property that (roughly) states that \mathbbG has zero average error over a distribution of mathematical expressions. However, in the context of two estimation problems, we demonstrate barriers to creating an accurate heuristic estimator. We finish by discussing challenges and potential paths forward for finding a heuristic estimator that accords with our intuitive understanding of how such an estimator ought to behave, as well as the potential applications of heuristic estimators to understanding the behavior of neural networks. Comments: 47 pages, 2 tables, 1 figure Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) Cite as: arXiv:2410.01290 [cs.AI] (or arXiv:2410.01290v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.01290 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-82] Uncertainty-aware Human Mobility Modeling and Anomaly Detection

链接: https://arxiv.org/abs/2410.01281
作者: Haomin Wen,Shurui Cao,Leman Akoglu
关键词-EN: GPS coordinates, bad-actor or malicious, model GPS data, malicious behavior detection, effective anomaly detection
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given the GPS coordinates of a large collection of human agents over time, how can we model their mobility behavior toward effective anomaly detection (e.g. for bad-actor or malicious behavior detection) without any labeled data? Human mobility and trajectory modeling have been studied extensively with varying capacity to handle complex input, and performance-efficiency trade-offs. With the arrival of more expressive models in machine learning, we attempt to model GPS data as a sequence of stay-point events, each with a set of characterizing spatiotemporal features, and leverage modern sequence models such as Transformers for un/self-supervised training and inference. Notably, driven by the inherent stochasticity of certain individuals’ behavior, we equip our model with aleatoric/data uncertainty estimation. In addition, to handle data sparsity of a large variety of behaviors, we incorporate epistemic/model uncertainty into our model. Together, aleatoric and epistemic uncertainty enable a robust loss and training dynamics, as well as uncertainty-aware decision making in anomaly scoring. Experiments on large expert-simulated datasets with tens of thousands of agents demonstrate the effectiveness of our model against both forecasting and anomaly detection baselines.

[AI-83] Deep Unlearn: Benchmarking Machine Unlearning

链接: https://arxiv.org/abs/2410.01276
作者: Xavier F. Cadet,Anastasia Borovykh,Mohammad Malekzadeh,Sara Ahmadi-Abhari,Hamed Haddadi
关键词-EN: trained machine learning, machine learning model, trained machine, machine learning, aims to remove
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine unlearning (MU) aims to remove the influence of particular data points from the learnable parameters of a trained machine learning model. This is a crucial capability in light of data privacy requirements, trustworthiness, and safety in deployed models. MU is particularly challenging for deep neural networks (DNNs), such as convolutional nets or vision transformers, as such DNNs tend to memorize a notable portion of their training dataset. Nevertheless, the community lacks a rigorous and multifaceted study that looks into the success of MU methods for DNNs. In this paper, we investigate 18 state-of-the-art MU methods across various benchmark datasets and models, with each evaluation conducted over 10 different initializations, a comprehensive evaluation involving MU over 100K models. We show that, with the proper hyperparameters, Masked Small Gradients (MSG) and Convolution Transpose (CT), consistently perform better in terms of model accuracy and run-time efficiency across different models, datasets, and initializations, assessed by population-based membership inference attacks (MIA) and per-sample unlearning likelihood ratio attacks (U-LiRA). Furthermore, our benchmark highlights the fact that comparing a MU method only with commonly used baselines, such as Gradient Ascent (GA) or Successive Random Relabeling (SRL), is inadequate, and we need better baselines like Negative Gradient Plus (NG+) with proper hyperparameter selection.

[AI-84] HelpSteer2-Preference: Complementing Ratings with Preferences

链接: https://arxiv.org/abs/2410.01257
作者: Zhilin Wang,Alexander Bukharin,Olivier Delalleau,Daniel Egert,Gerald Shen,Jiaqi Zeng,Oleksii Kuchaiev,Yi Dong
关键词-EN: popular paradigms, adequately matched, Regression, Regression style, Reward
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 26 pages, 3 figures

点击查看摘要

Abstract:Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at this https URL and openly release the trained Reward Model at this https URL

[AI-85] AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses EMNLP2024

链接: https://arxiv.org/abs/2410.01246
作者: Xiaotian Lu,Jiyi Li,Koh Takeuchi,Hisashi Kashima
关键词-EN: natural language processing, open-ended questions, extensively studied, field of natural, NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted for EMNLP 2024 Findings

点击查看摘要

Abstract:Question answering (QA) tasks have been extensively studied in the field of natural language processing (NLP). Answers to open-ended questions are highly diverse and difficult to quantify, and cannot be simply evaluated as correct or incorrect, unlike close-ended questions with definitive answers. While large language models (LLMs) have demonstrated strong capabilities across various tasks, they exhibit relatively weaker performance in evaluating answers to open-ended questions. In this study, we propose a method that leverages LLMs and the analytic hierarchy process (AHP) to assess answers to open-ended questions. We utilized LLMs to generate multiple evaluation criteria for a question. Subsequently, answers were subjected to pairwise comparisons under each criterion with LLMs, and scores for each answer were calculated in the AHP. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines. Additionally, we explored the impact of the number of criteria, variations in models, and differences in datasets on the results.

[AI-86] RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance

链接: https://arxiv.org/abs/2410.01242
作者: Haolin Jin,Zechao Sun,Yiheng Yang,Huaming Chen
关键词-EN: Large Language Models, Large Language, Language Models, shown incredible potential, code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown incredible potential in code generation tasks, and recent research in prompt engineering have enhanced LLMs’ understanding of textual information. However, ensuring the accuracy of generated code often requires extensive testing and validation by programmers. While LLMs can typically generate code based on task descriptions, their accuracy remains limited, especially for complex tasks that require a deeper understanding of both the problem statement and the code generation process. This limitation is primarily due to the LLMs’ need to simultaneously comprehend text and generate syntactically and semantically correct code, without having the capability to automatically refine the code. In real-world software development, programmers rarely produce flawless code in a single attempt based on the task description alone, they rely on iterative feedback and debugging to refine their programs. Inspired by this process, we introduce a novel architecture of LLM-based agents for code generation and automatic debugging: Refinement and Guidance Debugging (RGD). The RGD framework is a multi-LLM-based agent debugger that leverages three distinct LLM agents-Guide Agent, Debug Agent, and Feedback Agent. RGD decomposes the code generation task into multiple steps, ensuring a clearer workflow and enabling iterative code refinement based on self-reflection and feedback. Experimental results demonstrate that RGD exhibits remarkable code generation capabilities, achieving state-of-the-art performance with a 9.8% improvement on the HumanEval dataset and a 16.2% improvement on the MBPP dataset compared to the state-of-the-art approaches and traditional direct prompting approaches. We highlight the effectiveness of the RGD framework in enhancing LLMs’ ability to generate and refine code autonomously.

[AI-87] See Me and Believe Me: Causality and Intersectionality in Testimonial Injustice in Healthcare

链接: https://arxiv.org/abs/2410.01227
作者: Kenya S. Andrews,Mesrob I. Ohannessian,Elena Zheleva
关键词-EN: testimonial injustice, heard and understood, correctly heard, Structural Causal Model, causal discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In medical settings, it is critical that all who are in need of care are correctly heard and understood. When this is not the case due to prejudices a listener has, the speaker is experiencing \emphtestimonial injustice, which, building upon recent work, we quantify by the presence of several categories of unjust vocabulary in medical notes. In this paper, we use FCI, a causal discovery method, to study the degree to which certain demographic features could lead to marginalization (e.g., age, gender, and race) by way of contributing to testimonial injustice. To achieve this, we review physicians’ notes for each patient, where we identify occurrences of unjust vocabulary, along with the demographic features present, and use causal discovery to build a Structural Causal Model (SCM) relating those demographic features to testimonial injustice. We analyze and discuss the resulting SCMs to show the interaction of these factors and how they influence the experience of injustice. Despite the potential presence of some confounding variables, we observe how one contributing feature can make a person more prone to experiencing another contributor of testimonial injustice. There is no single root of injustice and thus intersectionality cannot be ignored. These results call for considering more than singular or equalized attributes of who a person is when analyzing and improving their experiences of bias and injustice. This work is thus a first foray at using causal discovery to understand the nuanced experiences of patients in medical settings, and its insights could be used to guide design principles throughout healthcare, to build trust and promote better patient care.

[AI-88] From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

链接: https://arxiv.org/abs/2410.01215
作者: Yuling Shi,Songsong Wang,Chengcheng Wan,Xiaodong Gu
关键词-EN: large language models, made significant strides, requiring human intervention, complex problems, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: Code and data available at this https URL

点击查看摘要

Abstract:While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

[AI-89] Polyp-SES: Automatic Polyp Segmentation with Self-Enriched Semantic Model

链接: https://arxiv.org/abs/2410.01210
作者: Quang Vinh Nguyen,Thanh Hoang Son Vo,Sae-Ryung Kang,Soo-Hyung Kim
关键词-EN: Automatic polyp segmentation, Automatic polyp, crucial for effective, effective diagnosis, diagnosis and treatment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Asian Conference on Computer Vision 2024

点击查看摘要

Abstract:Automatic polyp segmentation is crucial for effective diagnosis and treatment in colonoscopy images. Traditional methods encounter significant challenges in accurately delineating polyps due to limitations in feature representation and the handling of variability in polyp appearance. Deep learning techniques, including CNN and Transformer-based methods, have been explored to improve polyp segmentation accuracy. However, existing approaches often neglect additional semantics, restricting their ability to acquire adequate contexts of polyps in colonoscopy images. In this paper, we propose an innovative method named ``Automatic Polyp Segmentation with Self-Enriched Semantic Model’’ to address these limitations. First, we extract a sequence of features from an input image and decode high-level features to generate an initial segmentation mask. Using the proposed self-enriched semantic module, we query potential semantics and augment deep features with additional semantics, thereby aiding the model in understanding context more effectively. Extensive experiments show superior segmentation performance of the proposed method against state-of-the-art polyp segmentation baselines across five polyp benchmarks in both superior learning and generalization capabilities.

[AI-90] Were RNNs All We Needed?

链接: https://arxiv.org/abs/2410.01201
作者: Leo Feng,Frederick Tung,Mohamed Osama Ahmed,Yoshua Bengio,Hossein Hajimirsadegh
关键词-EN: limitations of Transformers, scalability limitations, renewed interest, parallelizable during training, Transformers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The scalability limitations of Transformers regarding sequence length have renewed interest in recurrent sequence models that are parallelizable during training. As a result, many novel recurrent architectures, such as S4, Mamba, and Aaren, have been proposed that achieve comparable performance. In this work, we revisit traditional recurrent neural networks (RNNs) from over a decade ago: LSTMs (1997) and GRUs (2014). While these models were slow due to requiring to backpropagate through time (BPTT), we show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel. Building on this, we introduce minimal versions (minLSTMs and minGRUs) that (1) use significantly fewer parameters than their traditional counterparts and (2) are fully parallelizable during training (175x faster for a sequence of length 512). Lastly, we show that these stripped-down versions of decade-old RNNs match the empirical performance of recent sequence models.

[AI-91] Generative Diffusion-based Contract Design for Efficient AI Twins Migration in Vehicular Embodied AI Networks

链接: https://arxiv.org/abs/2410.01176
作者: Yue Zhong,Jiawen Kang,Jinbo Wen,Dongdong Ye,Jiangtian Nie,Dusit Niyato,Xiaozheng Gao,Shengli Xie
关键词-EN: rapidly advancing field, Embodied, enabling a wide, rapidly advancing, advancing field
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Embodied AI is a rapidly advancing field that bridges the gap between cyberspace and physical space, enabling a wide range of applications. This evolution has led to the development of the Vehicular Embodied AI NETwork (VEANET), where advanced AI capabilities are integrated into vehicular systems to enhance autonomous operations and decision-making. Embodied agents, such as Autonomous Vehicles (AVs), are autonomous entities that can perceive their environment and take actions to achieve specific goals, actively interacting with the physical world. Embodied twins are digital models of these embodied agents, with various embodied AI twins for intelligent applications in cyberspace. In VEANET, embodied AI twins act as in-vehicle AI assistants to perform diverse tasks supporting autonomous driving using generative AI models. Due to limited computational resources of AVs, these AVs often offload computationally intensive tasks, such as constructing and updating embodied AI twins, to nearby RSUs. However, since the rapid mobility of AVs and the limited provision coverage of a single RSU, embodied AI twins require dynamic migrations from current RSU to other RSUs in real-time, resulting in the challenge of selecting suitable RSUs for efficient embodied AI twins migrations. Given information asymmetry, AVs cannot know the detailed information of RSUs. To this end, in this paper, we construct a multi-dimensional contract theoretical model between AVs and alternative RSUs. Considering that AVs may exhibit irrational behavior, we utilize prospect theory instead of expected utility theory to model the actual utilities of AVs. Finally, we employ a generative diffusion model-based algorithm to identify the optimal contract designs. Compared with traditional deep reinforcement learning algorithms, numerical results demonstrate the effectiveness of the proposed scheme.

[AI-92] owards Inference-time Category-wise Safety Steering for Large Language Models

链接: https://arxiv.org/abs/2410.01174
作者: Amrita Bhattacharjee,Shaona Ghosh,Traian Rebedea,Christopher Parisien
关键词-EN: large language models, variety of use-cases, active research, large language, unprecedented advancements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) have seen unprecedented advancements in capabilities and applications across a variety of use-cases, safety alignment of these models is still an area of active research. The fragile nature of LLMs, even models that have undergone extensive alignment and safety training regimes, warrants additional safety steering steps via training-free, inference-time methods. While recent work in the area of mechanistic interpretability has investigated how activations in latent representation spaces may encode concepts, and thereafter performed representation engineering to induce such concepts in LLM outputs, the applicability of such for safety is relatively under-explored. Unlike recent inference-time safety steering works, in this paper we explore safety steering of LLM outputs using: (i) category-specific steering vectors, thereby enabling fine-grained control over the steering, and (ii) sophisticated methods for extracting informative steering vectors for more effective safety steering while retaining quality of the generated text. We demonstrate our exploration on multiple LLMs and datasets, and showcase the effectiveness of the proposed steering method, along with a discussion on the implications and best practices.

[AI-93] Recovering Manifold Structure Using Ollivier-Ricci Curvature

链接: https://arxiv.org/abs/2410.01149
作者: Tristan Luca Saidi,Abigail Hickok,Andrew J. Blumberg
关键词-EN: estimated metric distortion, prune spurious edges, nearest neighbor graphs, metric distortion, Ollivier-Ricci curvature
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:We introduce ORC-ManL, a new algorithm to prune spurious edges from nearest neighbor graphs using a criterion based on Ollivier-Ricci curvature and estimated metric distortion. Our motivation comes from manifold learning: we show that when the data generating the nearest-neighbor graph consists of noisy samples from a low-dimensional manifold, edges that shortcut through the ambient space have more negative Ollivier-Ricci curvature than edges that lie along the data manifold. We demonstrate that our method outperforms alternative pruning methods and that it significantly improves performance on many downstream geometric data analysis tasks that use nearest neighbor graphs as input. Specifically, we evaluate on manifold learning, persistent homology, dimension estimation, and others. We also show that ORC-ManL can be used to improve clustering and manifold learning of single-cell RNA sequencing data. Finally, we provide empirical convergence experiments that support our theoretical findings.

[AI-94] ProxiMix: Enhancing Fairness with Proximity Samples in Subgroups

链接: https://arxiv.org/abs/2410.01145
作者: Jingyu Hu,Jun Hong,Mengnan Du,Weiru Liu
关键词-EN: machine learning, bias mitigation, developed for addressing, bias mitigation methods, addressing fairness issues
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many bias mitigation methods have been developed for addressing fairness issues in machine learning. We found that using linear mixup alone, a data augmentation technique, for bias mitigation, can still retain biases present in dataset labels. Research presented in this paper aims to address this issue by proposing a novel pre-processing strategy in which both an existing mixup method and our new bias mitigation algorithm can be utilized to improve the generation of labels of augmented samples, which are proximity aware. Specifically, we proposed ProxiMix which keeps both pairwise and proximity relationships for fairer data augmentation. We conducted thorough experiments with three datasets, three ML models, and different hyperparameters settings. Our experimental results showed the effectiveness of ProxiMix from both fairness of predictions and fairness of recourse perspectives.

[AI-95] Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs

链接: https://arxiv.org/abs/2410.01141
作者: Doohee You,Karim Lasri,Samuel Fraiberger
关键词-EN: research paper titles, study investigates efficient, investigates efficient deduplication, efficient deduplication techniques, economic research paper
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages, 1 figure

点击查看摘要

Abstract:This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.

[AI-96] nGPT: Normalized Transformer with Representation Learning on the Hypersphere

链接: https://arxiv.org/abs/2410.01131
作者: Ilya Loshchilov,Cheng-Ping Hsieh,Simeng Sun,Boris Ginsburg
关键词-EN: neural network architecture, normalized Transformer, network architecture, neural network, representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

[AI-97] Learning to Build by Building Your Own Instructions

链接: https://arxiv.org/abs/2410.01111
作者: Aaron Walsman,Muru Zhang,Adam Fishman,Ali Farhadi,Dieter Fox
关键词-EN: important unsolved component, complex visual objects, Structural understanding, artificial intelligence, understanding of complex
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Structural understanding of complex visual objects is an important unsolved component of artificial intelligence. To study this, we develop a new technique for the recently proposed Break-and-Make problem in LTRON where an agent must learn to build a previously unseen LEGO assembly using a single interactive session to gather information about its components and their structure. We attack this problem by building an agent that we call \textbf\ours that is able to make its own visual instruction book. By disassembling an unseen assembly and periodically saving images of it, the agent is able to create a set of instructions so that it has the information necessary to rebuild it. These instructions form an explicit memory that allows the model to reason about the assembly process one step at a time, avoiding the need for long-term implicit memory. This in turn allows us to train on much larger LEGO assemblies than has been possible in the past. To demonstrate the power of this model, we release a new dataset of procedurally built LEGO vehicles that contain an average of 31 bricks each and require over one hundred steps to disassemble and reassemble. We train these models using online imitation learning which allows the model to learn from its own mistakes. Finally, we also provide some small improvements to LTRON and the Break-and-Make problem that simplify the learning environment and improve usability.

[AI-98] Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance – A Case Study in Finance

链接: https://arxiv.org/abs/2410.01109
作者: Meni Brief,Oded Ovadia,Gil Shenderovitz,Noga Ben Yoash,Rachel Lemberg,Eitam Sheetrit
关键词-EN: including finance, large language models, expanded rapidly, application of large, large language
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The application of large language models (LLMs) in domain-specific contexts, including finance, has expanded rapidly. Domain-specific LLMs are typically evaluated based on their performance in various downstream tasks relevant to the domain. In this work, we present a detailed analysis of fine-tuning LLMs for such tasks. Somewhat counterintuitively, we find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy. Instead, multi-task fine-tuning - where models are trained on a cocktail of related tasks - can significantly enhance performance. We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results, even surpassing the much larger GPT-4-o model on financial benchmarks. Our study involves a large-scale experiment, training over 200 models using several widely adopted LLMs as baselines, and empirically confirms the benefits of multi-task fine-tuning. Additionally, we explore the use of general instruction data as a form of regularization, suggesting that it helps minimize performance degradation. We also investigate the inclusion of mathematical data, finding improvements in numerical reasoning that transfer effectively to financial tasks. Finally, we note that while fine-tuning for downstream tasks leads to targeted improvements in task performance, it does not necessarily result in broader gains in domain knowledge or complex domain reasoning abilities.

[AI-99] softmax is not enough (for sharp out-of-distribution)

链接: https://arxiv.org/abs/2410.01104
作者: Petar Veličković,Christos Perivolaropoulos,Federico Barbero,Razvan Pascanu
关键词-EN: make sharp decisions, property of reasoning, ability to make, reasoning systems, key property
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: Comments welcome. 14 pages, 7 figures

点击查看摘要

Abstract:A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from “circuits” which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.

[AI-100] Approximately Aligned Decoding

链接: https://arxiv.org/abs/2410.01103
作者: Daniel Melcer,Sujan Gonugondla,Pramuditha Perera,Haifeng Qian,Wen-Hao Chiang,Yanjun Wang,Nihal Jain,Pranav Garg,Xiaofei Ma,Anoop Deoras
关键词-EN: Large Language Models, Language Models, Large Language, reject undesired outputs, amount of computation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages main, 22 pages total

点击查看摘要

Abstract:It is common to reject undesired outputs of Large Language Models (LLMs); however, current methods to do so require an excessive amount of computation, or severely distort the distribution of outputs. We present a method to balance the distortion of the output distribution with computational efficiency, allowing for the generation of long sequences of text with difficult-to-satisfy constraints, with less amplification of low probability outputs compared to existing methods. We show through a series of experiments that the task-specific performance of our method is comparable to methods that do not distort the output distribution, while being much more computationally efficient.

[AI-101] Generative AI Application for Building Industry

链接: https://arxiv.org/abs/2410.01098
作者: Hanlong Wan,Jian Zhang,Yan Chen,Weili Xu,Fan Feng
关键词-EN: large language models, language models, investigates the transformative, large language, building design optimization
类目: Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
*备注: 28 pages, 11 figures, 4 tables

点击查看摘要

Abstract:This paper investigates the transformative potential of generative AI technologies, particularly large language models (LLMs), within the building industry. By leveraging these advanced AI tools, the study explores their application across key areas such as energy code compliance, building design optimization, and workforce training. The research highlights how LLMs can automate labor-intensive processes, significantly improving efficiency, accuracy, and safety in building practices. The paper also addresses the challenges associated with interpreting complex visual and textual data in architectural plans and regulatory codes, proposing innovative solutions to enhance AI-driven compliance checking and design processes. Additionally, the study considers the broader implications of AI integration, including the development of AI-powered tools for comprehensive code compliance across various regulatory domains and the potential for AI to revolutionize workforce training through realistic simulations. This paper provides a comprehensive analysis of the current capabilities of generative AI in the building industry while outlining future directions for research and development, aiming to pave the way for smarter, more sustainable, and responsive construction practices.

[AI-102] Mechanic Maker: Accessible Game Development Via Symbolic Learning Program Synthesis AAAI

链接: https://arxiv.org/abs/2410.01096
作者: Megan Sumner,Vardan Saini,Matthew Guzdial
关键词-EN: highly technical practice, highly technical, Game development, traditionally requires programming, Game
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 11 pages, 8 figures, AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment

点击查看摘要

Abstract:Game development is a highly technical practice that traditionally requires programming skills. This serves as a barrier to entry for would-be developers or those hoping to use games as part of their creative expression. While there have been prior game development tools focused on accessibility, they generally still require programming, or have major limitations in terms of the kinds of games they can make. In this paper we introduce Mechanic Maker, a tool for creating a wide-range of game mechanics without programming. It instead relies on a backend symbolic learning system to synthesize game mechanics from examples. We conducted a user study to evaluate the benefits of the tool for participants with a variety of programming and game development experience. Our results demonstrated that participants’ ability to use the tool was unrelated to programming ability. We conclude that tools like ours could help democratize game development, making the practice accessible regardless of programming skills.

[AI-103] Efficient and Private Marginal Reconstruction with Local Non-Negativity NEURIPS2024

链接: https://arxiv.org/abs/2410.01091
作者: Brett Mullins,Miguel Fuentes,Yingtai Xiao,Daniel Kifer,Cameron Musco,Daniel Sheldon
关键词-EN: Differential privacy, millions of people, dominant standard, standard for formal, formal and quantifiable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear at NeurIPS 2024

点击查看摘要

Abstract:Differential privacy is the dominant standard for formal and quantifiable privacy and has been used in major deployments that impact millions of people. Many differentially private algorithms for query release and synthetic data contain steps that reconstruct answers to queries from answers to other queries measured by the mechanism. Reconstruction is an important subproblem for such mechanisms to economize the privacy budget, minimize error on reconstructed answers, and allow for scalability to high-dimensional datasets. In this paper, we introduce a principled and efficient postprocessing method ReM (Residuals-to-Marginals) for reconstructing answers to marginal queries. Our method builds on recent work on efficient mechanisms for marginal query release, based on making measurements using a residual query basis that admits efficient pseudoinversion, which is an important primitive used in reconstruction. An extension GReM-LNN (Gaussian Residuals-to-Marginals with Local Non-negativity) reconstructs marginals under Gaussian noise satisfying consistency and non-negativity, which often reduces error on reconstructed answers. We demonstrate the utility of ReM and GReM-LNN by applying them to improve existing private query answering mechanisms: ResidualPlanner and MWEM.

[AI-104] From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems

链接: https://arxiv.org/abs/2410.01066
作者: Ali Mohammadjafari,Anthony S. Maida,Raju Gottumukkala
关键词-EN: structured SQL commands, translating natural language, natural language queries, structured SQL, SQL commands
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 12 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Since the onset of LLMs, translating natural language queries to structured SQL commands is assuming increasing. Unlike the previous reviews, this survey provides a comprehensive study of the evolution of LLM-based text-to-SQL systems, from early rule-based models to advanced LLM approaches, and how LLMs impacted this field. We discuss benchmarks, evaluation methods and evaluation metrics. Also, we uniquely study the role of integration of knowledge graphs for better contextual accuracy and schema linking in these systems. The current techniques fall into two categories: in-context learning of corpus and fine-tuning, which then leads to approaches such as zero-shot, few-shot learning from the end, and data augmentation. Finally, we highlight key challenges such as computational efficiency, model robustness, and data privacy with perspectives toward their development and improvements in potential areas for future of LLM-based text-to-SQL system.

[AI-105] ruth or Deceit? A Bayesian Decoding Game Enhances Consistency and Reliability

链接: https://arxiv.org/abs/2410.01064
作者: Weitong Zhang,Chengqi Zang,Bernhard Kainz
关键词-EN: Large Language Models, Large Language, complex scenarios, Language Models, ambiguous or complex
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often produce outputs that – though plausible – can lack consistency and reliability, particularly in ambiguous or complex scenarios. Challenges arise from ensuring that outputs align with both factual correctness and human intent. This is problematic in existing approaches that trade improved consistency for lower accuracy. To mitigate these challenges, we propose a novel game-theoretic approach to enhance consistency and reliability during the decoding stage of LLM output generation. Our method models the decoding process as a multistage Bayesian decoding game. This ensures consistency through Correctness Alignment and enhances reliability via Ambiguity Calibration. The model dynamically converges to a consensus on the most reliable outputs and distinguishes Valid, Specious outputs without human feedback or additional training. Our game design allows smaller models to outperform much larger models through game mechanisms (e.g., 78.1 LLaMA13B vs 76.6 PaLM540B), as well as integrating various LL strategies and models, demonstrating the potential of game-theoretic tools to improve the truthfulness and reliability of LLMs.

[AI-106] RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

链接: https://arxiv.org/abs/2410.01044
作者: Dongwei Jiang,Guoxuan Wang,Yining Lu,Andrew Wang,Jingyu Zhang,Chuyu Liu,Benjamin Van Durme,Daniel Khashabi
关键词-EN: frequently left implicit, everyday communication found, mimic logical leaps, logical leaps common, reasoning steps generated
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Our code, data, and model can be found at this repository: this https URL

点击查看摘要

Abstract:The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data. We extract 79k rationales from web-scale unlabelled dataset (the Pile) and a combination of reasoning datasets with minimal human intervention. This web-scale pre-training for reasoning allows RATIONALYST to consistently generalize across diverse reasoning tasks, including mathematical, commonsense, scientific, and logical reasoning. Fine-tuned from LLaMa-3-8B, RATIONALYST improves the accuracy of reasoning by an average of 3.9% on 7 representative reasoning benchmarks. It also demonstrates superior performance compared to significantly larger verifiers like GPT-4 and similarly sized models fine-tuned on matching training sets.

[AI-107] MOSEL: 950000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages EMNLP2024

链接: https://arxiv.org/abs/2410.01036
作者: Marco Gaido,Sara Papi,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri
关键词-EN: regulatory efforts addressing, sparked significant interest, coupled with regulatory, risks and impacts, rise of foundation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted at EMNLP 2024 Main Conference

点击查看摘要

Abstract:The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.

[AI-108] Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! EMNLP2024

链接: https://arxiv.org/abs/2410.01023
作者: Jiwan Chung,Seungwon Lim,Jaehyun Jeon,Seungbeen Lee,Youngjae Yu
关键词-EN: Humans possess multimodal, actively integrate information, Humans possess, form reasoning, actively integrate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted as main paper in EMNLP 2024

点击查看摘要

Abstract:Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability? In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.

[AI-109] Robust Guided Diffusion for Offline Black-Box Optimization

链接: https://arxiv.org/abs/2410.00983
作者: Can (Sam)Chen,Christopher Beckham,Zixuan Liu,Xue Liu,Christopher Pal
关键词-EN: Offline black-box optimization, black-box optimization aims, measured properties, aims to maximize, dataset of designs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:Offline black-box optimization aims to maximize a black-box function using an offline dataset of designs and their measured properties. Two main approaches have emerged: the forward approach, which learns a mapping from input to its value, thereby acting as a proxy to guide optimization, and the inverse approach, which learns a mapping from value to input for conditional generation. (a) Although proxy-free~(classifier-free) diffusion shows promise in robustly modeling the inverse mapping, it lacks explicit guidance from proxies, essential for generating high-performance samples beyond the training distribution. Therefore, we propose \textitproxy-enhanced sampling which utilizes the explicit guidance from a trained proxy to bolster proxy-free diffusion with enhanced sampling control. (b) Yet, the trained proxy is susceptible to out-of-distribution issues. To address this, we devise the module \textitdiffusion-based proxy refinement, which seamlessly integrates insights from proxy-free diffusion back into the proxy for refinement. To sum up, we propose \textit\textbfRobust \textbfGuided \textbfDiffusion for Offline Black-box Optimization~(\textbfRGD), combining the advantages of proxy~(explicit guidance) and proxy-free diffusion~(robustness) for effective conditional generation. RGD achieves state-of-the-art results on various design-bench tasks, underscoring its efficacy. Our code is at this https URL.

[AI-110] Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset

链接: https://arxiv.org/abs/2410.00980
作者: Panagiota Anastasopoulou,Jessica Torrey,Xavier Serra,Frederic Font
关键词-EN: Automatic sound classification, enabling context-aware sound, context-aware sound processing, Automatic sound, enabling context-aware
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: DCASE2024, post-print, 5 pages, 2 figures

点击查看摘要

Abstract:Automatic sound classification has a wide range of applications in machine listening, enabling context-aware sound processing and understanding. This paper explores methodologies for automatically classifying heterogeneous sounds characterized by high intra-class variability. Our study evaluates the classification task using the Broad Sound Taxonomy, a two-level taxonomy comprising 28 classes designed to cover a heterogeneous range of sounds with semantic distinctions tailored for practical user applications. We construct a dataset through manual annotation to ensure accuracy, diverse representation within each class and relevance in real-world scenarios. We compare a variety of both traditional and modern machine learning approaches to establish a baseline for the task of heterogeneous sound classification. We investigate the role of input features, specifically examining how acoustically derived sound representations compare to embeddings extracted with pre-trained deep neural networks that capture both acoustic and semantic information about sounds. Experimental results illustrate that audio embeddings encoding acoustic and semantic information achieve higher accuracy in the classification task. After careful analysis of classification errors, we identify some underlying reasons for failure and propose actions to mitigate them. The paper highlights the need for deeper exploration of all stages of classification, understanding the data and adopting methodologies capable of effectively handling data complexity and generalizing in real-world sound environments.

[AI-111] owards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation ECCV2024

链接: https://arxiv.org/abs/2410.00979
作者: Shuting Zhao,Chenkang Du,Kristin Qi,Xinrong Chen,Xinhan Di
关键词-EN: adapt depth foundation, depth estimation recently, Adaptation methods, endoscopic depth estimation, depth foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: WiCV @ ECCV 2024

点击查看摘要

Abstract:Adaptation methods are developed to adapt depth foundation models to endoscopic depth estimation recently. However, such approaches typically under-perform training since they limit the parameter search to a low-rank subspace and alter the training dynamics. Therefore, we propose a full-parameter and parameter-efficient learning framework for endoscopic depth estimation. At the first stage, the subspace of attention, convolution and multi-layer perception are adapted simultaneously within different sub-spaces. At the second stage, a memory-efficient optimization is proposed for subspace composition and the performance is further improved in the united sub-space. Initial experiments on the SCARED dataset demonstrate that results at the first stage improves the performance from 10.2% to 4.1% for Sq Rel, Abs Rel, RMSE and RMSE log in the comparison with the state-of-the-art models.

[AI-112] ACEV: Unsupervised Intersecting Manifold Segmentation using Adaptation to Angular Change of Eigenvectors in Intrinsic Dimension

链接: https://arxiv.org/abs/2410.00930
作者: Subhadip Boral,Rikathi Pal,Ashish Ghosh
关键词-EN: Intersecting manifold segmentation, Intersecting manifold, data points, focus of research, distinct properties
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
*备注: 14 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Intersecting manifold segmentation has been a focus of research, where individual manifolds, that intersect with other manifolds, are separated to discover their distinct properties. The proposed method is based on the intuition that when a manifold in D dimensional space with an intrinsic dimension of d intersects with another manifold, the data variance grows in more than d directions. The proposed method measures local data variances and determines their vector directions. It counts the number of vectors with non-zero variance, which determines the manifold’s intrinsic dimension. For detection of the intersection region, the method adapts to the changes in the angular gaps between the corresponding direction vectors of the child and parent using exponential moving averages using a tree structure construction. Accordingly, it includes those data points in the same manifold whose neighborhood is within the adaptive angular difference and eventually identifies the data points in the intersection area of manifolds. Data points whose inclusion in the neighborhood-identified data points increases their intrinsic dimensionality are removed based on data variance and distance. The proposed method performs better than 18 SOTA manifold segmentation methods in ARI and NMI scores over 14 real-world datasets with lesser time complexity and better stability.

[AI-113] A Knowledge-Informed Large Language Model Framework for U.S. Nuclear Power Plant Shutdown Initiating Event Classification for Probabilistic Risk Assessment

链接: https://arxiv.org/abs/2410.00929
作者: Min Xian,Tao Wang,Sai Zhang,Fei Xu,Zhegang Ma
关键词-EN: low power shutdown, power shutdown probabilistic, nuclear power plants, classifying shutdown initiating, developing low power
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying and classifying shutdown initiating events (SDIEs) is critical for developing low power shutdown probabilistic risk assessment for nuclear power plants. Existing computational approaches cannot achieve satisfactory performance due to the challenges of unavailable large, labeled datasets, imbalanced event types, and label noise. To address these challenges, we propose a hybrid pipeline that integrates a knowledge-informed machine learning mode to prescreen non-SDIEs and a large language model (LLM) to classify SDIEs into four types. In the prescreening stage, we proposed a set of 44 SDIE text patterns that consist of the most salient keywords and phrases from six SDIE types. Text vectorization based on the SDIE patterns generates feature vectors that are highly separable by using a simple binary classifier. The second stage builds Bidirectional Encoder Representations from Transformers (BERT)-based LLM, which learns generic English language representations from self-supervised pretraining on a large dataset and adapts to SDIE classification by fine-tuning it on an SDIE dataset. The proposed approaches are evaluated on a dataset with 10,928 events using precision, recall ratio, F1 score, and average accuracy. The results demonstrate that the prescreening stage can exclude more than 97% non-SDIEs, and the LLM achieves an average accuracy of 93.4% for SDIE classification.

[AI-114] Optimistic Games for Combinatorial Bayesian Optimization with Application to Protein Design

链接: https://arxiv.org/abs/2409.18582
作者: Melis Ilayda Bal,Pier Giuseppe Sessa,Mojmir Mutny,Andreas Krause
关键词-EN: Bayesian optimization, optimize black-box, sequential interactions, powerful framework, framework to optimize
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a powerful framework to optimize black-box expensive-to-evaluate functions via sequential interactions. In several important problems (e.g. drug discovery, circuit design, neural architecture search, etc.), though, such functions are defined over large \textitcombinatorial and unstructured spaces. This makes existing BO algorithms not feasible due to the intractable maximization of the acquisition function over these domains. To address this issue, we propose \textbfGameOpt , a novel game-theoretical approach to combinatorial BO. \textbfGameOpt establishes a cooperative game between the different optimization variables, and selects points that are game \textitequilibria of an upper confidence bound acquisition function. These are stable configurations from which no variable has an incentive to deviate - analog to local optima in continuous domains. Crucially, this allows us to efficiently break down the complexity of the combinatorial domain into individual decision sets, making \textbfGameOpt scalable to large combinatorial spaces. We demonstrate the application of \textbfGameOpt to the challenging \textitprotein design problem and validate its performance on four real-world protein datasets. Each protein can take up to 20^X possible configurations, where X is the length of a protein, making standard BO methods infeasible. Instead, our approach iteratively selects informative protein configurations and very quickly discovers highly active protein variants compared to other baselines.

[AI-115] owards a vision foundation model for comprehensive assessment of Cardiac MRI

链接: https://arxiv.org/abs/2410.01665
作者: Athira J Jacob,Indraneel Borgohain,Teodora Chitiboi,Puneet Sharma,Dorin Comaniciu,Daniel Rueckert
关键词-EN: Cardiac magnetic resonance, complex modality requiring, noninvasive cardiac assessment, magnetic resonance imaging, Cardiac magnetic
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Cardiac magnetic resonance imaging (CMR), considered the gold standard for noninvasive cardiac assessment, is a diverse and complex modality requiring a wide variety of image processing tasks for comprehensive assessment of cardiac morphology and function. Advances in deep learning have enabled the development of state-of-the-art (SoTA) models for these tasks. However, model training is challenging due to data and label scarcity, especially in the less common imaging sequences. Moreover, each model is often trained for a specific task, with no connection between related tasks. In this work, we introduce a vision foundation model trained for CMR assessment, that is trained in a self-supervised fashion on 36 million CMR images. We then finetune the model in supervised way for 9 clinical tasks typical to a CMR workflow, across classification, segmentation, landmark localization, and pathology detection. We demonstrate improved accuracy and robustness across all tasks, over a range of available labeled dataset sizes. We also demonstrate improved few-shot learning with fewer labeled samples, a common challenge in medical image analyses. We achieve an out-of-box performance comparable to SoTA for most clinical tasks. The proposed method thus presents a resource-efficient, unified framework for CMR assessment, with the potential to accelerate the development of deep learning-based solutions for image analysis tasks, even with few annotated data available.

[AI-116] Imaging foundation model for universal enhancement of non-ideal measurement CT

链接: https://arxiv.org/abs/2410.01591
作者: Yuxin Liu,Rongjun Ge,Yuting He,Zhan Wu,Chenyu You,Shuo Li,Yang Chen
关键词-EN: measurement computed tomography, sacrifices optimal imaging, optimal imaging standards, Non-ideal measurement computed, NICT enhancement
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Non-ideal measurement computed tomography (NICT), which sacrifices optimal imaging standards for new advantages in CT imaging, is expanding the clinical application scope of CT images. However, with the reduction of imaging standards, the image quality has also been reduced, extremely limiting the clinical acceptability. Although numerous studies have demonstrated the feasibility of deep learning for the NICT enhancement in specific scenarios, their high data cost and limited generalizability have become large obstacles. The recent research on the foundation model has brought new opportunities for building a universal NICT enhancement model - bridging the image quality degradation with minimal data cost. However, owing to the challenges in the collection of large pre-training datasets and the compatibility of data variation, no success has been reported. In this paper, we propose a multi-scale integrated Transformer AMPlifier (TAMP), the first imaging foundation model for universal NICT enhancement. It has been pre-trained on a large-scale physical-driven simulation dataset with 3.6 million NICT-ICT image pairs, and is able to directly generalize to the NICT enhancement tasks with various non-ideal settings and body regions. Via the adaptation with few data, it can further achieve professional performance in real-world specific scenarios. Our extensive experiments have demonstrated that the proposed TAMP has significant potential for promoting the exploration and application of NICT and serving a wider range of medical scenarios.

[AI-117] One Wave to Explain Them All: A Unifying Perspective on Post-hoc Explainability

链接: https://arxiv.org/abs/2410.01482
作者: Gabriel Kasmi,Amandine Brunetto,Thomas Fel,Jayneel Parekh
关键词-EN: deep neural networks, black-box nature hinders, nature hinders transparency, inherent black-box nature, safety-critical decision-making
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: main: 10 pages, appendix: 14 pages, 5 Tables, 25 Figures

点击查看摘要

Abstract:Despite the growing use of deep neural networks in safety-critical decision-making, their inherent black-box nature hinders transparency and interpretability. Explainable AI (XAI) methods have thus emerged to understand a model’s internal workings, and notably attribution methods also called saliency maps. Conventional attribution methods typically identify the locations – the where – of significant regions within an input. However, because they overlook the inherent structure of the input data, these methods often fail to interpret what these regions represent in terms of structural components (e.g., textures in images or transients in sounds). Furthermore, existing methods are usually tailored to a single data modality, limiting their generalizability. In this paper, we propose leveraging the wavelet domain as a robust mathematical foundation for attribution. Our approach, the Wavelet Attribution Method (WAM) extends the existing gradient-based feature attributions into the wavelet domain, providing a unified framework for explaining classifiers across images, audio, and 3D shapes. Empirical evaluations demonstrate that WAM matches or surpasses state-of-the-art methods across faithfulness metrics and models in image, audio, and 3D explainability. Finally, we show how our method explains not only the where – the important parts of the input – but also the what – the relevant patterns in terms of structural components.

[AI-118] On the Convergence of FedProx with Extrapolation and Inexact Prox

链接: https://arxiv.org/abs/2410.01410
作者: Hanmin Li,Peter Richtárik
关键词-EN: federated learning algorithm, FedProx federated learning, Enhancing the FedProx, learning algorithm, FedProx federated
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
*备注: 36 pages, 6 figures

点击查看摘要

Abstract:Enhancing the FedProx federated learning algorithm (Li et al., 2020) with server-side extrapolation, Li et al. (2024a) recently introduced the FedExProx method. Their theoretical analysis, however, relies on the assumption that each client computes a certain proximal operator exactly, which is impractical since this is virtually never possible to do in real settings. In this paper, we investigate the behavior of FedExProx without this exactness assumption in the smooth and globally strongly convex setting. We establish a general convergence result, showing that inexactness leads to convergence to a neighborhood of the solution. Additionally, we demonstrate that, with careful control, the adverse effects of this inexactness can be mitigated. By linking inexactness to biased compression (Beznosikov et al., 2023), we refine our analysis, highlighting robustness of extrapolation to inexact proximal updates. We also examine the local iteration complexity required by each client to achieved the required level of inexactness using various local optimizers. Our theoretical insights are validated through comprehensive numerical experiments.

[AI-119] ransformers Handle Endogeneity in In-Context Linear Regression

链接: https://arxiv.org/abs/2410.01265
作者: Haodong Liang,Krishnakumar Balasubramanian,Lifeng Lai
关键词-EN: in-context linear regression, linear regression, explore the capability, address endogeneity, handle endogeneity effectively
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST)
*备注: 30 pages

点击查看摘要

Abstract:We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares (\textsf2SLS) solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the \textsf2SLS method, in the presence of endogeneity.

[AI-120] An uncertainty-aware Digital Shadow for underground multimodal CO2 storage monitoring

链接: https://arxiv.org/abs/2410.01218
作者: Abhinav Prakash Gahlot,Rafael Orozco,Ziyi Yin,Felix J. Herrmann
关键词-EN: uncertainty-aware Digital Shadow, scalable Digital Shadow, Digital Shadows uncertainty, Digital Shadows neural, Ensemble Bayesian Filtering
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Geological Carbon Storage GCS is arguably the only scalable net-negative CO2 emission technology available While promising subsurface complexities and heterogeneity of reservoir properties demand a systematic approach to quantify uncertainty when optimizing production and mitigating storage risks which include assurances of Containment and Conformance of injected supercritical CO2 As a first step towards the design and implementation of a Digital Twin for monitoring underground storage operations a machine learning based data-assimilation framework is introduced and validated on carefully designed realistic numerical simulations As our implementation is based on Bayesian inference but does not yet support control and decision-making we coin our approach an uncertainty-aware Digital Shadow To characterize the posterior distribution for the state of CO2 plumes conditioned on multi-modal time-lapse data the envisioned Shadow combines techniques from Simulation-Based Inference SBI and Ensemble Bayesian Filtering to establish probabilistic baselines and assimilate multi-modal data for GCS problems that are challenged by large degrees of freedom nonlinear multi-physics non-Gaussianity and computationally expensive to evaluate fluid flow and seismic simulations To enable SBI for dynamic systems a recursive scheme is proposed where the Digital Shadows neural networks are trained on simulated ensembles for their state and observed data well and/or seismic Once training is completed the systems state is inferred when time-lapse field data becomes available In this computational study we observe that a lack of knowledge on the permeability field can be factored into the Digital Shadows uncertainty quantification To our knowledge this work represents the first proof of concept of an uncertainty-aware in-principle scalable Digital Shadow.

[AI-121] RS-FME-SwinT: A Novel Feature Map Enhancement Framework Integrating Customized SwinT with Residual and Spatial CNN for Monkeypox Diagnosis

链接: https://arxiv.org/abs/2410.01216
作者: Saddam Hussain Khan,Rashid Iqbal(Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan)
关键词-EN: steadily increasing daily, cases steadily increasing, significant global concern, increasing daily, cases steadily
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 37 Pages, 5 Tables, 10 Figures

点击查看摘要

Abstract:Monkeypox (MPox) has emerged as a significant global concern, with cases steadily increasing daily. Conventional detection methods, including polymerase chain reaction (PCR) and manual examination, exhibit challenges of low sensitivity, high cost, and substantial workload. Therefore, deep learning offers an automated solution; however, the datasets include data scarcity, texture, contrast, inter-intra class variability, and similarities with other skin infectious diseases. In this regard, a novel hybrid approach is proposed that integrates the learning capacity of Residual Learning and Spatial Exploitation Convolutional Neural Network (CNN) with a customized Swin Transformer (RS-FME-SwinT) to capture multi-scale global and local correlated features for MPox diagnosis. The proposed RS-FME-SwinT technique employs a transfer learning-based feature map enhancement (FME) technique, integrating the customized SwinT for global information capture, residual blocks for texture extraction, and spatial blocks for local contrast variations. Moreover, incorporating new inverse residual blocks within the proposed SwinT effectively captures local patterns and mitigates vanishing gradients. The proposed RS-FME-SwinT has strong learning potential of diverse features that systematically reduce intra-class MPox variation and enable precise discrimination from other skin diseases. Finally, the proposed RS-FME-SwinT is a holdout cross-validated on a diverse MPox dataset and achieved outperformance on state-of-the-art CNNs and ViTs. The proposed RS-FME-SwinT demonstrates commendable results of an accuracy of 97.80%, sensitivity of 96.82%, precision of 98.06%, and an F-score of 97.44% in MPox detection. The RS-FME-SwinT could be a valuable tool for healthcare practitioners, enabling prompt and accurate MPox diagnosis and contributing significantly to mitigation efforts.

[AI-122] A versatile machine learning workflow for high-throughput analysis of supported metal catalyst particles

链接: https://arxiv.org/abs/2410.01213
作者: Arda Genc,Justin Marlowe,Anika Jalil,Libor Kovarik,Phillip Christopher
关键词-EN: Accurate and efficient, characterization of nanoparticles, efficient characterization, essential for advancing, advancing our understanding
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate and efficient characterization of nanoparticles (NPs), particularly regarding particle size distribution, is essential for advancing our understanding of their structure-property relationships and facilitating their design for various applications. In this study, we introduce a novel two-stage artificial intelligence (AI)-driven workflow for NP analysis that leverages prompt engineering techniques from state-of-the-art single-stage object detection and large-scale vision transformer (ViT) architectures. This methodology was applied to transmission electron microscopy (TEM) and scanning TEM (STEM) images of heterogeneous catalysts, enabling high-resolution, high-throughput analysis of particle size distributions for supported metal catalysts. The model’s performance in detecting and segmenting NPs was validated across diverse heterogeneous catalyst systems, including various metals (Cu, Ru, Pt, and PtCo), supports (silica ( \textSiO_2 ), \gamma -alumina ( \gamma - \textAl_2\textO_3 ), and carbon black), and particle diameter size distributions with means and standard deviations of 2.9 \pm 1.1 nm, 1.6 \pm 0.2 nm, 9.7 \pm 4.6 nm, and 4 \pm 1.0 nm. Additionally, the proposed machine learning (ML) approach successfully detects and segments overlapping NPs anchored on non-uniform catalytic support materials, providing critical insights into their spatial arrangements and interactions. Our AI-assisted NP analysis workflow demonstrates robust generalization across diverse datasets and can be readily applied to similar NP segmentation tasks without requiring costly model retraining.

[AI-123] Augmentation through Laundering Attacks for Audio Spoof Detection

链接: https://arxiv.org/abs/2410.01108
作者: Hashim Ali,Surya Subramani,Hafiz Malik
关键词-EN: made voice cloning, including Joe Biden, developments have made, voice cloning, easily accessible
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Recent text-to-speech (TTS) developments have made voice cloning (VC) more realistic, affordable, and easily accessible. This has given rise to many potential abuses of this technology, including Joe Biden’s New Hampshire deepfake robocall. Several methodologies have been proposed to detect such clones. However, these methodologies have been trained and evaluated on relatively clean databases. Recently, ASVspoof 5 Challenge introduced a new crowd-sourced database of diverse acoustic conditions including various spoofing attacks and codec conditions. This paper is our submission to the ASVspoof 5 Challenge and aims to investigate the performance of Audio Spoof Detection, trained using data augmentation through laundering attacks, on the ASVSpoof 5 database. The results demonstrate that our system performs worst on A18, A19, A20, A26, and A30 spoofing attacks and in the codec and compression conditions of C08, C09, and C10.

[AI-124] GAMMA-PD: Graph-based Analysis of Multi-Modal Motor Impairment Assessments in Parkinsons Disease MICCAI2024

链接: https://arxiv.org/abs/2410.00944
作者: Favour Nerrise(1),Alice Louise Heiman(2),Ehsan Adeli(2,3) ((1) Department of Electrical Engineering, Stanford University, Stanford, CA, USA, (2) Department of Computer Science, Stanford University, Stanford, CA, USA, (3) Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, USA)
关键词-EN: electronic health records, health records, multi-modal medical data, rapid advancement, technology has led
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
*备注: Accepted by the 6th Workshop on GRaphs in biomedicAl Image anaLysis (GRAIL) at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024). 12 pages, 3 figures, 2 tables, Source Code: this https URL

点击查看摘要

Abstract:The rapid advancement of medical technology has led to an exponential increase in multi-modal medical data, including imaging, genomics, and electronic health records (EHRs). Graph neural networks (GNNs) have been widely used to represent this data due to their prominent performance in capturing pairwise relationships. However, the heterogeneity and complexity of multi-modal medical data still pose significant challenges for standard GNNs, which struggle with learning higher-order, non-pairwise relationships. This paper proposes GAMMA-PD (Graph-based Analysis of Multi-modal Motor Impairment Assessments in Parkinson’s Disease), a novel heterogeneous hypergraph fusion framework for multi-modal clinical data analysis. GAMMA-PD integrates imaging and non-imaging data into a “hypernetwork” (patient population graph) by preserving higher-order information and similarity between patient profiles and symptom subtypes. We also design a feature-based attention-weighted mechanism to interpret feature-level contributions towards downstream decision tasks. We evaluate our approach with clinical data from the Parkinson’s Progression Markers Initiative (PPMI) and a private dataset. We demonstrate gains in predicting motor impairment symptoms in Parkinson’s disease. Our end-to-end framework also learns associations between subsets of patient characteristics to generate clinically relevant explanations for disease and symptom profiles. The source code is available at this https URL.

[AI-125] StreamEnsemble: Predictive Queries over Spatiotemporal Streaming Data

链接: https://arxiv.org/abs/2410.00933
作者: Anderson Chaves,Eduardo Ogasawara,Patrick Valduriez,Fabio Porto
关键词-EN: stream data pose, processing and analysis, Predictive queries, machine learning, queries over spatiotemporal
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:Predictive queries over spatiotemporal (ST) stream data pose significant data processing and analysis challenges. ST data streams involve a set of time series whose data distributions may vary in space and time, exhibiting multiple distinct patterns. In this context, assuming a single machine learning model would adequately handle such variations is likely to lead to failure. To address this challenge, we propose StreamEnsemble, a novel approach to predictive queries over ST data that dynamically selects and allocates Machine Learning models according to the underlying time series distributions and model characteristics. Our experimental evaluation reveals that this method markedly outperforms traditional ensemble methods and single model approaches in terms of accuracy and time, demonstrating a significant reduction in prediction error of more than 10 times compared to traditional approaches.

[AI-126] IBM Quantum Computers: Evolution Performance and Future Directions

链接: https://arxiv.org/abs/2410.00916
作者: M. AbuGhanem
关键词-EN: promising exponential speedups, IBM Quantum, classical computing limits, Quantum computers represent, Quantum
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Quantum computers represent a transformative frontier in computational technology, promising exponential speedups beyond classical computing limits. IBM Quantum has led significant advancements in both hardware and software, providing access to quantum hardware via IBM Cloud since 2016, achieving a milestone with the world’s first accessible quantum computer. This article explores IBM’s quantum computing journey, focusing on the development of practical quantum computers. We summarize the evolution and advancements of IBM Quantum’s processors across generations, including their recent breakthrough surpassing the 1,000-qubit barrier. The paper reviews detailed performance metrics across various hardware, tracing their evolution over time and highlighting IBM Quantum’s transition from the noisy intermediate-scale quantum (NISQ) computing era towards fault-tolerant quantum computing capabilities.

计算机视觉

[CV-0] Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

链接: https://arxiv.org/abs/2410.01806
作者: Mattia Segu,Luigi Piccinelli,Siyuan Li,Yung-Hsu Yang,Bernt Schiele,Luc Van Gool
关键词-EN: presents unique challenges, dynamic animal groups, coordinated dance performances, team sports, complex scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.

[CV-1] EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis

链接: https://arxiv.org/abs/2410.01804
作者: Alexander Mai,Peter Hedman,George Kopanas,Dor Verbin,David Futschik,Qiangeng Xu,Falko Kuester,Jon Barron,Yinda Zhang
关键词-EN: Volumetric Ellipsoid Rendering, Exact Volumetric Ellipsoid, present Exact Volumetric, Volumetric Ellipsoid, differentiable emission-only volume
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present Exact Volumetric Ellipsoid Rendering (EVER), a method for real-time differentiable emission-only volume rendering. Unlike recent rasterization based approach by 3D Gaussian Splatting (3DGS), our primitive based representation allows for exact volume rendering, rather than alpha compositing 3D Gaussian billboards. As such, unlike 3DGS our formulation does not suffer from popping artifacts and view dependent density, but still achieves frame rates of \sim!30 FPS at 720p on an NVIDIA RTX4090. Since our approach is built upon ray tracing it enables effects such as defocus blur and camera distortion (e.g. such as from fisheye cameras), which are difficult to achieve by rasterization. We show that our method is more accurate with fewer blending issues than 3DGS and follow-up work on view-consistent rendering, especially on the challenging large-scale scenes from the Zip-NeRF dataset where it achieves sharpest results among real-time techniques.

[CV-2] FabricDiffusion: High-Fidelity Texture Transfer for 3D Garments Generation from In-The-Wild Clothing Images SIGGRAPH

链接: https://arxiv.org/abs/2410.01801
作者: Cheng Zhang,Yuanhao Wang,Francisco Vicente Carrasco,Chenglei Wu,Jinlong Yang,Thabo Beeler,Fernando De la Torre
关键词-EN: transferring fabric textures, transferring fabric, single clothing image, texture, clothing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Accepted to SIGGRAPH Asia 2024. Project page: this https URL

点击查看摘要

Abstract:We introduce FabricDiffusion, a method for transferring fabric textures from a single clothing image to 3D garments of arbitrary shapes. Existing approaches typically synthesize textures on the garment surface through 2D-to-3D texture mapping or depth-aware inpainting via generative models. Unfortunately, these methods often struggle to capture and preserve texture details, particularly due to challenging occlusions, distortions, or poses in the input image. Inspired by the observation that in the fashion industry, most garments are constructed by stitching sewing patterns with flat, repeatable textures, we cast the task of clothing texture transfer as extracting distortion-free, tileable texture materials that are subsequently mapped onto the UV space of the garment. Building upon this insight, we train a denoising diffusion model with a large-scale synthetic dataset to rectify distortions in the input texture image. This process yields a flat texture map that enables a tight coupling with existing Physically-Based Rendering (PBR) material generation pipelines, allowing for realistic relighting of the garment under various lighting conditions. We show that FabricDiffusion can transfer various features from a single clothing image including texture patterns, material properties, and detailed prints and logos. Extensive experiments demonstrate that our model significantly outperforms state-to-the-art methods on both synthetic data and real-world, in-the-wild clothing images while generalizing to unseen textures and garment shapes.

[CV-3] SegEarth-OV: Towards Traning-Free Open-Vocabulary Segmentation for Remote Sensing Images

链接: https://arxiv.org/abs/2410.01768
作者: Kaiyu Li,Ruixun Liu,Xiangyong Cao,Deyu Meng,Zhi Wang
关键词-EN: Remote sensing image, Remote sensing, sensing image plays, water resources, disaster relief
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training-free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4%, and 15.3% improvement over state-of-the-art methods on 4 tasks. All codes are released. \urlthis https URL

[CV-4] ImageFolder: Autoregressive Image Generation with Folded Tokens

链接: https://arxiv.org/abs/2410.01756
作者: Xiang Li,Hao Chen,Kai Qiu,Jason Kuen,Jiuxiang Gu,Bhiksha Raj,Zhe Lin
关键词-EN: visual generative models, diffusion models, generative models, models, token length
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage dual-branch product quantization to capture different contexts of images. Specifically, semantic regularization is introduced in one branch to encourage compacted semantic information while another branch is designed to capture the remaining pixel-level details. Extensive experiments demonstrate the superior quality of image generation and shorter token length with ImageFolder tokenizer.

[CV-5] LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

链接: https://arxiv.org/abs/2410.01744
作者: Mengzhao Jia,Wenhao Yu,Kaixin Ma,Tianqing Fang,Zhihan Zhang,Siru Ouyang,Hongming Zhang,Meng Jiang,Dong Yu
关键词-EN: central visual element, visual element guiding, scanned documents, involving multiple text-rich, multiple text-rich images
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Our code is available at this https URL

点击查看摘要

Abstract:Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose \OurMethod, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model’s superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.

[CV-6] VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models

链接: https://arxiv.org/abs/2410.01738
作者: Kailai Feng,Yabo Zhang,Haodong Yu,Zhilong Ji,Jinfeng Bai,Hongzhi Zhang,Wangmeng Zuo
关键词-EN: input character, Artistic typography, readable manner, technique to visualize, visualize the meaning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Artistic typography is a technique to visualize the meaning of input character in an imaginable and readable manner. With powerful text-to-image diffusion models, existing methods directly design the overall geometry and texture of input character, making it challenging to ensure both creativity and legibility. In this paper, we introduce a dual-branch and training-free method, namely VitaGlyph, enabling flexible artistic typography along with controllable geometry change to maintain the readability. The key insight of VitaGlyph is to treat input character as a scene composed of Subject and Surrounding, followed by rendering them under varying degrees of geometry transformation. The subject flexibly expresses the essential concept of input character, while the surrounding enriches relevant background without altering the shape. Specifically, we implement VitaGlyph through a three-phase framework: (i) Knowledge Acquisition leverages large language models to design text descriptions of subject and surrounding. (ii) Regional decomposition detects the part that most matches the subject description and divides input glyph image into subject and surrounding regions. (iii) Typography Stylization firstly refines the structure of subject region via Semantic Typography, and then separately renders the textures of Subject and Surrounding regions through Controllable Compositional Generation. Experimental results demonstrate that VitaGlyph not only achieves better artistry and readability, but also manages to depict multiple customize concepts, facilitating more creative and pleasing artistic typography generation. Our code will be made publicly at this https URL.

[CV-7] RADAR: Robust Two-stage Modality-incomplete Industrial Anomaly Detection

链接: https://arxiv.org/abs/2410.01737
作者: Bingchen Miao,Wenqiao Zhang,Juncheng Li,Siliang Tang,Zhaocheng Li,Haochen Shi,Jun Xiao,Yueting Zhuang
关键词-EN: Industrial Anomaly Detection, industrial quality inspection, Multimodal Industrial Anomaly, RGB images, Anomaly Detection
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimodal Industrial Anomaly Detection (MIAD), utilizing 3D point clouds and 2D RGB images to identify the abnormal region of products, plays a crucial role in industrial quality inspection. However, the conventional MIAD setting presupposes that all 2D and 3D modalities are paired, overlooking the fact that multimodal data collected from the real world is often imperfect due to missing modalities. Consequently, MIAD models that demonstrate robustness against modal-incomplete data are highly desirable in practice. To address this practical challenge, we introduce a first-of-its-kind study that comprehensively investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD), to consider the imperfect learning environment in which the multimodal information may be incomplete. Not surprisingly, we discovered that most existing MIAD approaches are inadequate for addressing MIIAD challenges, leading to significant performance degradation on the MIIAD benchmark we developed. In this paper, we propose a novel two-stage Robust modAlity-imcomplete fusing and Detecting frAmewoRk, abbreviated as RADAR. Our bootstrapping philosophy is to enhance two stages in MIIAD, improving the robustness of the Multimodal Transformer: i) In feature fusion, we first explore learning modality-incomplete instruction, guiding the pre-trained Multimodal Transformer to robustly adapt to various modality-incomplete scenarios, and implement adaptive parameter learning based on a HyperNetwork; ii) In anomaly detection, we construct a real-pseudo hybrid module to highlight the distinctiveness of modality combinations, further enhancing the robustness of the MIIAD model. Our experimental results demonstrate that the proposed RADAR significantly surpasses conventional MIAD methods in terms of effectiveness and robustness on our newly created MIIAD dataset, underscoring its practical application value.

[CV-8] ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

链接: https://arxiv.org/abs/2410.01731
作者: Rinon Gal,Adi Haviv,Yuval Alaluf,Amit H. Bermano,Daniel Cohen-Or,Gal Chechik
关键词-EN: combine multiple specialized, multiple specialized components, evolved from simple, combine multiple, multiple specialized
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR)
*备注: Project website: this https URL

点击查看摘要

Abstract:The practical use of text-to-image generation has evolved from simple, monolithic models to complex workflows that combine multiple specialized components. While workflow-based approaches can lead to improved image quality, crafting effective workflows requires significant expertise, owing to the large number of available components, their complex inter-dependence, and their dependence on the generation prompt. Here, we introduce the novel task of prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt. We propose two LLM-based approaches to tackle this task: a tuning-based method that learns from user-preference data, and a training-free method that uses the LLM to select existing flows. Both approaches lead to improved image quality when compared to monolithic models or generic, prompt-independent workflows. Our work shows that prompt-dependent flow prediction offers a new pathway to improving text-to-image generation quality, complementing existing research directions in the field.

[CV-9] HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

链接: https://arxiv.org/abs/2410.01723
作者: Yushi Huang,Zining Wang,Ruihao Gong,Jing Liu,Xinjie Zhang,Jinyang Guo,Xianglong Liu,Jun Zhang
关键词-EN: Diffusion Transformers, generative tasks, gained prominence, prominence for outstanding, outstanding scalability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code will be released soon

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have gained prominence for outstanding scalability and extraordinary performance in generative tasks. However, their considerable inference costs impede practical deployment. The feature cache mechanism, which involves storing and retrieving redundant computations across timesteps, holds promise for reducing per-step inference time in diffusion models. Most existing caching methods for DiT are manually designed. Although the learning-based approach attempts to optimize strategies adaptively, it suffers from discrepancies between training and inference, which hampers both the performance and acceleration ratio. Upon detailed analysis, we pinpoint that these discrepancies primarily stem from two aspects: (1) Prior Timestep Disregard, where training ignores the effect of cache usage at earlier timesteps, and (2) Objective Mismatch, where the training target (align predicted noise in each timestep) deviates from the goal of inference (generate the high-quality image). To alleviate these discrepancies, we propose HarmoniCa, a novel method that Harmonizes training and inference with a novel learning-based Caching framework built upon Step-Wise Denoising Training (SDT) and Image Error Proxy-Guided Objective (IEPO). Compared to the traditional training paradigm, the newly proposed SDT maintains the continuity of the denoising process, enabling the model to leverage information from prior timesteps during training, similar to the way it operates during inference. Furthermore, we design IEPO, which integrates an efficient proxy mechanism to approximate the final image error caused by reusing the cached feature. Therefore, IEPO helps balance final image quality and cache utilization, resolving the issue of training that only considers the impact of cache usage on the predicted output at each timestep.

[CV-10] OmniSR: Shadow Removal under Direct and Indirect Lighting

链接: https://arxiv.org/abs/2410.01719
作者: Jiamin Xu,Zelong Li,Yuxin Zheng,Chenyu Huang,Renshu Gu,Weiwei Xu,Gang Xu
关键词-EN: indirect illumination, shadow removal, originate from occlusions, shadow removal network, indirect
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Shadows can originate from occlusions in both direct and indirect illumination. Although most current shadow removal research focuses on shadows caused by direct illumination, shadows from indirect illumination are often just as pervasive, particularly in indoor scenes. A significant challenge in removing shadows from indirect illumination is obtaining shadow-free images to train the shadow removal network. To overcome this challenge, we propose a novel rendering pipeline for generating shadowed and shadow-free images under direct and indirect illumination, and create a comprehensive synthetic dataset that contains over 30,000 image pairs, covering various object types and lighting conditions. We also propose an innovative shadow removal network that explicitly integrates semantic and geometric priors through concatenation and attention mechanisms. The experiments show that our method outperforms state-of-the-art shadow removal techniques and can effectively generalize to indoor and outdoor scenes under various lighting conditions, enhancing the overall effectiveness and applicability of shadow removal methods.

[CV-11] COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation

链接: https://arxiv.org/abs/2410.01718
作者: Mingzhen Sun,Weining Wang,Xinxin Zhu,Jing Liu
关键词-EN: similar object appearances, objects moving coherently, record objects moving, slightly changed postures, videos record objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Since videos record objects moving coherently, adjacent video frames have commonness (similar object appearances) and uniqueness (slightly changed postures). To prevent redundant modeling of common video signals, we propose a novel diffusion-based framework, named COMUNI, which decomposes the COMmon and UNIque video signals to enable efficient video generation. Our approach separates the decomposition of video signals from the task of video generation, thus reducing the computation complexity of generative models. In particular, we introduce CU-VAE to decompose video signals and encode them into latent features. To train CU-VAE in a self-supervised manner, we employ a cascading merge module to reconstitute video signals and a time-agnostic video decoder to reconstruct video frames. Then we propose CU-LDM to model latent features for video generation, which adopts two specific diffusion streams to simultaneously model the common and unique latent features. We further utilize additional joint modules for cross modeling of the common and unique latent features, and a novel position embedding method to ensure the content consistency and motion coherence of generated videos. The position embedding method incorporates spatial and temporal absolute position information into the joint modules. Extensive experiments demonstrate the necessity of decomposing common and unique video signals for video generation and the effectiveness and efficiency of our proposed method.

[CV-12] Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding

链接: https://arxiv.org/abs/2410.01699
作者: Yao Teng,Han Shi,Xian Liu,Xuefei Ning,Guohao Dai,Yu Wang,Zhenguo Li,Xihui Liu
关键词-EN: substantial time consumption, Jacobi decoding, models require hundreds, Speculative Jacobi Decoding, decoding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The current large auto-regressive models can generate high-quality, high-resolution images, but these models require hundreds or even thousands of steps of next-token prediction during inference, resulting in substantial time consumption. In existing studies, Jacobi decoding, an iterative parallel decoding algorithm, has been used to accelerate the auto-regressive generation and can be executed without training. However, the Jacobi decoding relies on a deterministic criterion to determine the convergence of iterations. Thus, it works for greedy decoding but is incompatible with sampling-based decoding which is crucial for visual quality and diversity in the current auto-regressive text-to-image generation. In this paper, we propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation. By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding and allowing the model to generate diverse images. Specifically, SJD facilitates the model to predict multiple tokens at each step and accepts tokens based on the probabilistic criterion, enabling the model to generate images with fewer steps than the conventional next-token-prediction paradigm. We also investigate the token initialization strategies that leverage the spatial locality of visual data to further improve the acceleration ratio under specific scenarios. We conduct experiments for our proposed SJD on multiple auto-regressive text-to-image generation models, showing the effectiveness of model acceleration without sacrificing the visual quality.

[CV-13] MOREL: Enhancing Adversarial Robustness through Multi-Objective Representation Learning

链接: https://arxiv.org/abs/2410.01697
作者: Sedjro Salomon Hotegni,Sebastian Peitz
关键词-EN: deep neural networks, neural networks, drastically different outputs, research has shown, shown that deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Extensive research has shown that deep neural networks (DNNs) are vulnerable to slight adversarial perturbations - small changes to the input data that appear insignificant but cause the model to produce drastically different outputs. In addition to augmenting training data with adversarial examples generated from a specific attack method, most of the current defense strategies necessitate modifying the original model architecture components to improve robustness or performing test-time data purification to handle adversarial attacks. In this work, we demonstrate that strong feature representation learning during training can significantly enhance the original model’s robustness. We propose MOREL, a multi-objective feature representation learning approach, encouraging classification models to produce similar features for inputs within the same class, despite perturbations. Our training method involves an embedding space where cosine similarity loss and multi-positive contrastive loss are used to align natural and adversarial features from the model encoder and ensure tight clustering. Concurrently, the classifier is motivated to achieve accurate predictions. Through extensive experiments, we demonstrate that our approach significantly enhances the robustness of DNNs against white-box and black-box adversarial attacks, outperforming other methods that similarly require no architectural changes or test-time data purification. Our code is available at this https URL

[CV-14] PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation

链接: https://arxiv.org/abs/2410.01680
作者: Mike Ranzinger,Jon Barker,Greg Heinrich,Pavlo Molchanov,Bryan Catanzaro,Andrew Tao
关键词-EN: heterogeneous multi-teacher knowledge, multi-teacher knowledge distillation, visual foundation models, strengths and weaknesses, distillation without labels
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed “agglomerative models.” We build upon this body of work by studying the effect of the teachers’ activation statistics, particularly the impact of the loss function on the resulting student model quality. We explore a standard toolkit of statistical normalization techniques to better align the different distributions and assess their effects. Further, we examine the impact on downstream teacher-matching metrics, which motivates the use of Hadamard matrices. With these matrices, we demonstrate useful properties, showing how they can be used for isotropic standardization, where each dimension of a multivariate distribution is standardized using the same scale. We call this technique “PHI Standardization” (PHI-S) and empirically demonstrate that it produces the best student model across the suite of methods studied.

[CV-15] Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking

链接: https://arxiv.org/abs/2410.01678
作者: Ayesha Ishaq,Mohamed El Amine Boudjoghra,Jean Lahoud,Fahad Shahbaz Khan,Salman Khan,Hisham Cholakkal,Rao Muhammad Anwer
关键词-EN: multiple objects’ movements, multi-object tracking plays, objects’ movements, plays a critical, critical role
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 7 pages, 4 figures, 3 tables

点击查看摘要

Abstract:3D multi-object tracking plays a critical role in autonomous driving by enabling the real-time monitoring and prediction of multiple objects’ movements. Traditional 3D tracking systems are typically constrained by predefined object categories, limiting their adaptability to novel, unseen objects in dynamic environments. To address this limitation, we introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories. We formulate the problem of open-vocabulary 3D tracking and introduce dataset splits designed to represent various open-vocabulary scenarios. We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes. Our method effectively reduces the performance gap between tracking known and novel objects through strategic adaptation. Experimental results demonstrate the robustness and adaptability of our method in diverse outdoor driving scenarios. To the best of our knowledge, this work is the first to address open-vocabulary 3D tracking, presenting a significant advancement for autonomous systems in real-world settings. Code, trained models, and dataset splits are available publicly.

[CV-16] 3DGS-DET: Empower 3D Gaussian Splatting with Boundary Guidance and Box-Focused Sampling for 3D Object Detection

链接: https://arxiv.org/abs/2410.01647
作者: Yang Cao,Yuanliang Jv,Dan Xu
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, Gaussian blobs, offering a promising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code Page: this https URL

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) are widely used for novel-view synthesis and have been adapted for 3D Object Detection (3DOD), offering a promising approach to 3DOD through view-synthesis representation. However, NeRF faces inherent limitations: (i) limited representational capacity for 3DOD due to its implicit nature, and (ii) slow rendering speeds. Recently, 3D Gaussian Splatting (3DGS) has emerged as an explicit 3D representation that addresses these limitations. Inspired by these advantages, this paper introduces 3DGS into 3DOD for the first time, identifying two main challenges: (i) Ambiguous spatial distribution of Gaussian blobs: 3DGS primarily relies on 2D pixel-level supervision, resulting in unclear 3D spatial distribution of Gaussian blobs and poor differentiation between objects and background, which hinders 3DOD; (ii) Excessive background blobs: 2D images often include numerous background pixels, leading to densely reconstructed 3DGS with many noisy Gaussian blobs representing the background, negatively affecting detection. To tackle the challenge (i), we leverage the fact that 3DGS reconstruction is derived from 2D images, and propose an elegant and efficient solution by incorporating 2D Boundary Guidance to significantly enhance the spatial distribution of Gaussian blobs, resulting in clearer differentiation between objects and their background. To address the challenge (ii), we propose a Box-Focused Sampling strategy using 2D boxes to generate object probability distribution in 3D spaces, allowing effective probabilistic sampling in 3D to retain more object blobs and reduce noisy background blobs. Benefiting from our designs, our 3DGS-DET significantly outperforms the SOTA NeRF-based method, NeRF-Det, achieving improvements of +6.6 on mAP@0.25 and +8.1 on mAP@0.5 for the ScanNet dataset, and impressive +31.5 on mAP@0.25 for the ARKITScenes dataset.

[CV-17] Data Extrapolation for Text-to-image Generation on Small Datasets

链接: https://arxiv.org/abs/2410.01638
作者: Senmao Ye,Fei Liu
关键词-EN: requires large amount, generation requires large, synthesizing high-quality images, requires large, large amount
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-image generation requires large amount of training data to synthesizing high-quality images. For augmenting training data, previous methods rely on data interpolations like cropping, flipping, and mixing up, which fail to introduce new information and yield only marginal improvements. In this paper, we propose a new data augmentation method for text-to-image generation using linear extrapolation. Specifically, we apply linear extrapolation only on text feature, and new image data are retrieved from the internet by search engines. For the reliability of new text-image pairs, we design two outlier detectors to purify retrieved images. Based on extrapolation, we construct training samples dozens of times larger than the original dataset, resulting in a significant improvement in text-to-image performance. Moreover, we propose a NULL-guidance to refine score estimation, and apply recurrent affine transformation to fuse text information. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets. The code and data will be available on GitHub (this https URL).

[CV-18] LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

链接: https://arxiv.org/abs/2410.01620
作者: Zhenyue Qin,Yu Yin,Dylan Campbell,Xuansheng Wu,Ke Zou,Yih-Chung Tham,Ninghao Liu,Xiuzhen Zhang,Qingyu Chen
关键词-EN: Ophthalmology relies heavily, ophthalmology images, treatment planning, relies heavily, heavily on detailed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ophthalmology relies heavily on detailed image analysis for diagnosis and treatment planning. While large vision-language models (LVLMs) have shown promise in understanding complex visual information, their performance on ophthalmology images remains underexplored. We introduce LMOD, a dataset and benchmark for evaluating LVLMs on ophthalmology images, covering anatomical understanding, diagnostic analysis, and demographic extraction. LMODincludes 21,993 images spanning optical coherence tomography, scanning laser ophthalmoscopy, eye photos, surgical scenes, and color fundus photographs. We benchmark 13 state-of-the-art LVLMs and find that they are far from perfect for comprehending ophthalmology images. Models struggle with diagnostic analysis and demographic extraction, reveal weaknesses in spatial reasoning, diagnostic analysis, handling out-of-domain queries, and safeguards for handling biomarkers of ophthalmology images.

[CV-19] SGBA: Semantic Gaussian Mixture Model-Based LiDAR Bundle Adjustment

链接: https://arxiv.org/abs/2410.01618
作者: Xingyu Ji,Shenghai Yuan,Jianping Li,Pengyu Yin,Haozhi Cao,Lihua Xie
关键词-EN: LiDAR bundle adjustment, bundle adjustment, reduce the drifts, LiDAR bundle, pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:LiDAR bundle adjustment (BA) is an effective approach to reduce the drifts in pose estimation from the front-end. Existing works on LiDAR BA usually rely on predefined geometric features for landmark representation. This reliance restricts generalizability, as the system will inevitably deteriorate in environments where these specific features are absent. To address this issue, we propose SGBA, a LiDAR BA scheme that models the environment as a semantic Gaussian mixture model (GMM) without predefined feature types. This approach encodes both geometric and semantic information, offering a comprehensive and general representation adaptable to various environments. Additionally, to limit computational complexity while ensuring generalizability, we propose an adaptive semantic selection framework that selects the most informative semantic clusters for optimization by evaluating the condition number of the cost function. Lastly, we introduce a probabilistic feature association scheme that considers the entire probability density of assignments, which can manage uncertainties in measurement and initial pose estimation. We have conducted various experiments and the results demonstrate that SGBA can achieve accurate and robust pose refinement even in challenging scenarios with low-quality initial pose estimation and limited geometric features. We plan to open-source the work for the benefit of the community this https URL.

[CV-20] Saliency-Guided DETR for Moment Retrieval and Highlight Detection

链接: https://arxiv.org/abs/2410.01615
作者: Aleksandr Gordeev,Vladimir Dokholyan,Irina Tolstykh,Maksim Kuprashevich
关键词-EN: limited production usage, video features efficiently, Existing approaches, video moment retrieval, features efficiently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Existing approaches for video moment retrieval and highlight detection are not able to align text and video features efficiently, resulting in unsatisfying performance and limited production usage. To address this, we propose a novel architecture that utilizes recent foundational video models designed for such alignment. Combined with the introduced Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach significantly enhances performance in both moment retrieval and highlight detection tasks. For even better improvement, we developed InterVid-MR, a large-scale and high-quality dataset for pretraining. Using it, our architecture achieves state-of-the-art results on the QVHighlights, Charades-STA and TACoS benchmarks. The proposed approach provides an efficient and scalable solution for both zero-shot and fine-tuning scenarios in video-language tasks.

[CV-21] Gaussian Splatting in Mirrors: Reflection-Aware Rendering via Virtual Camera Optimization

链接: https://arxiv.org/abs/2410.01614
作者: Zihan Wang,Shuzhe Wang,Matias Turkulainen,Junyuan Fang,Juho Kannala
关键词-EN: Gaussian Splatting, Recent advancements, facilitating real-time, view synthesis, revolutionized novel view
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To be published on 2024 British Machine Vision Conference

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3D-GS) have revolutionized novel view synthesis, facilitating real-time, high-quality image rendering. However, in scenarios involving reflective surfaces, particularly mirrors, 3D-GS often misinterprets reflections as virtual spaces, resulting in blurred and inconsistent multi-view rendering within mirrors. Our paper presents a novel method aimed at obtaining high-quality multi-view consistent reflection rendering by modelling reflections as physically-based virtual cameras. We estimate mirror planes with depth and normal estimates from 3D-GS and define virtual cameras that are placed symmetrically about the mirror plane. These virtual cameras are then used to explain mirror reflections in the scene. To address imperfections in mirror plane estimates, we propose a straightforward yet effective virtual camera optimization method to enhance reflection quality. We collect a new mirror dataset including three real-world scenarios for more diverse evaluation. Experimental validation on both Mirror-Nerf and our real-world dataset demonstrate the efficacy of our approach. We achieve comparable or superior results while significantly reducing training time compared to previous state-of-the-art.

[CV-22] DRUPI: Dataset Reduction Using Privileged Information

链接: https://arxiv.org/abs/2410.01611
作者: Shaobo Wang,Yantai Yang,Shuaiyu Zhang,Chenghao Sun,Weiya Li,Xuming Hu,Linfeng Zhang
关键词-EN: seeks to select, select or distill, distill samples, samples from large, smaller subsets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset reduction (DR) seeks to select or distill samples from large datasets into smaller subsets while preserving performance on target tasks. Existing methods primarily focus on pruning or synthesizing data in the same format as the original dataset, typically the input data and corresponding labels. However, in DR settings, we find it is possible to synthesize more information beyond the data-label pair as an additional learning target to facilitate model training. In this paper, we introduce Dataset Reduction Using Privileged Information (DRUPI), which enriches DR by synthesizing privileged information alongside the reduced dataset. This privileged information can take the form of feature labels or attention labels, providing auxiliary supervision to improve model learning. Our findings reveal that effective feature labels must balance between being overly discriminative and excessively diverse, with a moderate level proving optimal for improving the reduced dataset’s efficacy. Extensive experiments on ImageNet, CIFAR-10/100, and Tiny ImageNet demonstrate that DRUPI integrates seamlessly with existing dataset reduction methods, offering significant performance gains.

[CV-23] DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

链接: https://arxiv.org/abs/2410.01609
作者: Yihao Ding,Soyeon Caren Han,Zechuan Li,Hyunsuk Chung
关键词-EN: convey complex information, encompassing elements, elements like charts, convey complex, Visually-rich Document Understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Work in progress

点击查看摘要

Abstract:Visually-Rich Documents (VRDs), encompassing elements like charts, tables, and references, convey complex information across various fields. However, extracting information from these rich documents is labor-intensive, especially given their inconsistent formats and domain-specific requirements. While pretrained models for VRD Understanding have progressed, their reliance on large, annotated datasets limits scalability. This paper introduces the Domain Adaptive Visually-rich Document Understanding (DAViD) framework, which utilises machine-generated synthetic data for domain adaptation. DAViD integrates fine-grained and coarse-grained document representation learning and employs synthetic annotations to reduce the need for costly manual labelling. By leveraging pretrained models and synthetic data, DAViD achieves competitive performance with minimal annotated datasets. Extensive experiments validate DAViD’s effectiveness, demonstrating its ability to efficiently adapt to domain-specific VRDU tasks.

[CV-24] KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

链接: https://arxiv.org/abs/2410.01595
作者: Pouyan Navard,Amin Karimi Monsefi,Mengxi Zhou,Wei-Lun Chao,Alper Yilmaz,Rajiv Ramnath
关键词-EN: Recent advances, balance fine-grained precision, significantly improved, advances in diffusion, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user’s specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.

[CV-25] MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation ACM-MM2024

链接: https://arxiv.org/abs/2410.01594
作者: Mingzhen Sun,Weining Wang,Yanyuan Qiao,Jiahui Sun,Zihan Qin,Longteng Guo,Xinxin Zhu,Jing Liu
关键词-EN: distinct data formats, Sounding Video Generation, audio-video joint generation, generation task challenged, joint generation task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal latent diffusion model (MM-LDM) for the SVG task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces signal dimensions. The latter space serves to bridge the information gap between modalities and provides more insightful cross-modal guidance. Our proposed method achieves new state-of-the-art results with significant quality and efficiency gains. Specifically, our method achieves a comprehensive improvement on all evaluation metrics and a faster training and sampling speed on Landscape and AIST++ datasets. Moreover, we explore its performance on open-domain sounding video generation, long sounding video generation, audio continuation, video continuation, and conditional single-modal generation tasks for a comprehensive evaluation, where our MM-LDM demonstrates exciting adaptability and generalization ability.

[CV-26] Coordinate-Based Neural Representation Enabling Zero-Shot Learning for 3D Multiparametric Quantitative MRI

链接: https://arxiv.org/abs/2410.01577
作者: Guoyan Lao,Ruimin Feng,Haikun Qi,Zhenfeng Lv,Qiangqiang Liu,Chunlei Liu,Yuyao Zhang,Hongjiang Wei
关键词-EN: offers tissue-specific physical, tissue-specific physical parameters, Quantitative magnetic resonance, offers tissue-specific, magnetic resonance imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantitative magnetic resonance imaging (qMRI) offers tissue-specific physical parameters with significant potential for neuroscience research and clinical practice. However, lengthy scan times for 3D multiparametric qMRI acquisition limit its clinical utility. Here, we propose SUMMIT, an innovative imaging methodology that includes data acquisition and an unsupervised reconstruction for simultaneous multiparametric qMRI. SUMMIT first encodes multiple important quantitative properties into highly undersampled k-space. It further leverages implicit neural representation incorporated with a dedicated physics model to reconstruct the desired multiparametric maps without needing external training datasets. SUMMIT delivers co-registered T1, T2, T2*, and quantitative susceptibility mapping. Extensive simulations and phantom imaging demonstrate SUMMIT’s high accuracy. Additionally, the proposed unsupervised approach for qMRI reconstruction also introduces a novel zero-shot learning paradigm for multiparametric imaging applicable to various medical imaging modalities.

[CV-27] Fake It Until You Break It: On the Adversarial Robustness of AI-generated Image Detectors

链接: https://arxiv.org/abs/2410.01574
作者: Sina Mavali,Jonas Ricker,David Pape,Yash Sharma,Asja Fischer,Lea Schoenherr
关键词-EN: offers countless possibilities, artificially generated media, misinformation campaigns, offers countless, productive tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While generative AI (GenAI) offers countless possibilities for creative and productive tasks, artificially generated media can be misused for fraud, manipulation, scams, misinformation campaigns, and more. To mitigate the risks associated with maliciously generated media, forensic classifiers are employed to identify AI-generated content. However, current forensic classifiers are often not evaluated in practically relevant scenarios, such as the presence of an attacker or when real-world artifacts like social media degradations affect images. In this paper, we evaluate state-of-the-art AI-generated image (AIGI) detectors under different attack scenarios. We demonstrate that forensic classifiers can be effectively attacked in realistic settings, even when the attacker does not have access to the target model and post-processing occurs after the adversarial examples are created, which is standard on social media platforms. These attacks can significantly reduce detection accuracy to the extent that the risks of relying on detectors outweigh their benefits. Finally, we propose a simple defense mechanism to make CLIP-based detectors, which are currently the best-performing detectors, robust against these attacks.

[CV-28] PASS:Test-Time Prompting to Adapt Styles and Semantic Shapes in Medical Image Segmentation

链接: https://arxiv.org/abs/2410.01573
作者: Chuyan Zhang,Hao Zheng,Xin You,Yefeng Zheng,Yun Gu
关键词-EN: Test-time adaptation, extra training data, promising paradigm, paradigm to handle, existing TTA solutions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE TMI

点击查看摘要

Abstract:Test-time adaptation (TTA) has emerged as a promising paradigm to handle the domain shifts at test time for medical images from different institutions without using extra training data. However, existing TTA solutions for segmentation tasks suffer from (1) dependency on modifying the source training stage and access to source priors or (2) lack of emphasis on shape-related semantic knowledge that is crucial for segmentation this http URL research on visual prompt learning achieves source-relaxed adaptation by extended parameter space but still neglects the full utilization of semantic features, thus motivating our work on knowledge-enriched deep prompt learning. Beyond the general concern of image style shifts, we reveal that shape variability is another crucial factor causing the performance drop. To address this issue, we propose a TTA framework called PASS (Prompting to Adapt Styles and Semantic shapes), which jointly learns two types of prompts: the input-space prompt to reformulate the style of the test image to fit into the pretrained model and the semantic-aware prompts to bridge high-level shape discrepancy across domains. Instead of naively imposing a fixed prompt, we introduce an input decorator to generate the self-regulating visual prompt conditioned on the input data. To retrieve the knowledge representations and customize target-specific shape prompts for each test sample, we propose a cross-attention prompt modulator, which performs interaction between target representations and an enriched shape prompt bank. Extensive experiments demonstrate the superior performance of PASS over state-of-the-art methods on multiple medical image segmentation datasets. The code is available at this https URL.

[CV-29] Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension

链接: https://arxiv.org/abs/2410.01544
作者: Zaiquan Yang,Yuhao Liu,Jiaying Lin,Gerhard Hancke,Rynson W.H. Lau
关键词-EN: target object, Progressive Comprehension Network, input text description, image-text pairs, paper explores
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper explores the weakly-supervised referring image segmentation (WRIS) problem, and focuses on a challenging setup where target localization is learned directly from image-text pairs. We note that the input text description typically already contains detailed information on how to localize the target object, and we also observe that humans often follow a step-by-step comprehension process (\ie, progressively utilizing target-related attributes and relations as cues) to identify the target object. Hence, we propose a novel Progressive Comprehension Network (PCNet) to leverage target-related textual cues from the input description for progressively localizing the target object. Specifically, we first use a Large Language Model (LLM) to decompose the input text description into short phrases. These short phrases are taken as target-related cues and fed into a Conditional Referring Module (CRM) in multiple stages, to allow updating the referring text embedding and enhance the response map for target localization in a multi-stage manner. Based on the CRM, we then propose a Region-aware Shrinking (RaS) loss to constrain the visual localization to be conducted progressively in a coarse-to-fine manner across different stages. Finally, we introduce an Instance-aware Disambiguation (IaD) loss to suppress instance localization ambiguity by differentiating overlapping response maps generated by different referring texts on the same image. Extensive experiments show that our method outperforms SOTA methods on three common benchmarks.

[CV-30] Edge-preserving noise for diffusion models

链接: https://arxiv.org/abs/2410.01540
作者: Jente Vandersanden,Sascha Holl,Xingchang Huang,Gurprit Singh
关键词-EN: spatial regions uniformly, neglecting potentially valuable, Classical generative diffusion, potentially valuable structural, isotropic Gaussian denoising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical generative diffusion models learn an isotropic Gaussian denoising process, treating all spatial regions uniformly, thus neglecting potentially valuable structural information in the data. Inspired by the long-established work on anisotropic diffusion in image processing, we present a novel edge-preserving diffusion model that is a generalization of denoising diffusion probablistic models (DDPM). In particular, we introduce an edge-aware noise scheduler that varies between edge-preserving and isotropic Gaussian noise. We show that our model’s generative process converges faster to results that more closely match the target distribution. We demonstrate its capability to better learn the low-to-mid frequencies within the dataset, which plays a crucial role in representing shapes and structural information. Our edge-preserving diffusion process consistently outperforms state-of-the-art baselines in unconditional image generation. It is also more robust for generative tasks guided by a shape-based prior, such as stroke-to-image generation. We present qualitative and quantitative results showing consistent improvements (FID score) of up to 30% for both tasks.

[CV-31] Multi-Scale Fusion for Object Representation

链接: https://arxiv.org/abs/2410.01539
作者: Rongzhen Zhao,Vivienne Wang,Juho Kannala,Joni Pajarinen
关键词-EN: object-level feature vectors, pixel-level feature maps, facilitates advanced visual, advanced visual tasks, textit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Representing images or videos as object-level feature vectors, rather than pixel-level feature maps, facilitates advanced visual tasks. Object-Centric Learning (OCL) primarily achieves this by reconstructing the input under the guidance of Variational Autoencoder (VAE) intermediate representation to drive so-called \textitslots to aggregate as much object information as possible. However, existing VAE guidance does not explicitly address that objects can vary in pixel sizes while models typically excel at specific pattern scales. We propose \textitMulti-Scale Fusion (MSF) to enhance VAE guidance for OCL training. To ensure objects of all sizes fall within VAE’s comfort zone, we adopt the \textitimage pyramid, which produces intermediate representations at multiple scales; To foster scale-invariance/variance in object super-pixels, we devise \textitinter/\textitintra-scale fusion, which augments low-quality object super-pixels of one scale with corresponding high-quality super-pixels from another scale. On standard OCL benchmarks, our technique improves mainstream methods, including state-of-the-art diffusion-based ones. The source code is available in the supplemental material.

[CV-32] EUFCC-CIR: a Composed Image Retrieval Dataset for GLAM Collections ECCV

链接: https://arxiv.org/abs/2410.01536
作者: Francesc Net,Lluis Gomez
关键词-EN: Artificial Intelligence, explore cultural heritage, Humanities enables researchers, cultural heritage collections, Digital Humanities enables
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV Workshop (AI4DH2024)

点击查看摘要

Abstract:The intersection of Artificial Intelligence and Digital Humanities enables researchers to explore cultural heritage collections with greater depth and scale. In this paper, we present EUFCC-CIR, a dataset designed for Composed Image Retrieval (CIR) within Galleries, Libraries, Archives, and Museums (GLAM) collections. Our dataset is built on top of the EUFCC-340K image labeling dataset and contains over 180K annotated CIR triplets. Each triplet is composed of a multi-modal query (an input image plus a short text describing the desired attribute manipulations) and a set of relevant target images. The EUFCC-CIR dataset fills an existing gap in CIR-specific resources for Digital Humanities. We demonstrate the value of the EUFCC-CIR dataset by highlighting its unique qualities in comparison to other existing CIR datasets and evaluating the performance of several zero-shot CIR baselines.

[CV-33] GaussianBlock: Building Part-Aware Compositional and Editable 3D Scene by Primitives and Gaussians

链接: https://arxiv.org/abs/2410.01535
作者: Shuyi Jiang,Qihao Zhao,Hossein Rahmani,De Wen Soh,Jun Liu,Na Zhao
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, achieved remarkably high, development of Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, with the development of Neural Radiance Fields and Gaussian Splatting, 3D reconstruction techniques have achieved remarkably high fidelity. However, the latent representations learnt by these methods are highly entangled and lack interpretability. In this paper, we propose a novel part-aware compositional reconstruction method, called GaussianBlock, that enables semantically coherent and disentangled representations, allowing for precise and physical editing akin to building blocks, while simultaneously maintaining high fidelity. Our GaussianBlock introduces a hybrid representation that leverages the advantages of both primitives, known for their flexible actionability and editability, and 3D Gaussians, which excel in reconstruction quality. Specifically, we achieve semantically coherent primitives through a novel attention-guided centering loss derived from 2D semantic priors, complemented by a dynamic splitting and fusion strategy. Furthermore, we utilize 3D Gaussians that hybridize with primitives to refine structural details and enhance fidelity. Additionally, a binding inheritance strategy is employed to strengthen and maintain the connection between the two. Our reconstructed scenes are evidenced to be disentangled, compositional, and compact across diverse benchmarks, enabling seamless, direct and precise editing while maintaining high quality.

[CV-34] oward a Holistic Evaluation of Robustness in CLIP Models NEURIPS’23

链接: https://arxiv.org/abs/2410.01534
作者: Weijie Tu,Weijian Deng,Tom Gedeon
关键词-EN: Contrastive Language-Image Pre-training, Language-Image Pre-training, CLIP, CLIP models, diverse distribution shifts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 10 figures, extension of NeurIPS’23 work: A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP). arXiv admin note: text overlap with arXiv:2402.07410

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) models have shown significant potential, particularly in zero-shot classification across diverse distribution shifts. Building on existing evaluations of overall classification robustness, this work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. First, we investigate their robustness to variations in specific visual factors. Second, we assess two critical safety objectives–confidence uncertainty and out-of-distribution detection–beyond mere classification accuracy. Third, we evaluate the finesse with which CLIP models bridge the image and text modalities. Fourth, we extend our examination to 3D awareness in CLIP models, moving beyond traditional 2D image understanding. Finally, we explore the interaction between vision and language encoders within modern large multimodal models (LMMs) that utilize CLIP as the visual backbone, focusing on how this interaction impacts classification robustness. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts. Our study uncovers several previously unknown insights into CLIP. For instance, the architecture of the visual encoder in CLIP plays a significant role in their robustness against 3D corruption. CLIP models tend to exhibit a bias towards shape when making predictions. Moreover, this bias tends to diminish after fine-tuning on ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision encoder, could exhibit benefits in classification performance for challenging categories over CLIP alone. Our findings are poised to offer valuable guidance for enhancing the robustness and reliability of CLIP models.

[CV-35] Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

链接: https://arxiv.org/abs/2410.01529
作者: Jianxiong Li,Zhihao Wang,Jinliang Zheng,Xiaoai Zhou,Guanming Wang,Guanglu Song,Yu Liu,Jingjing Liu,Ya-Qin Zhang,Junzhi Yu,Xianyuan Zhan
关键词-EN: holistically understand complex, Cross-modality Alignment, enhanced robotic performance, understand complex task, essential for enhanced
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: preprint

点击查看摘要

Abstract:Multimodal task specification is essential for enhanced robotic performance, where \textitCross-modality Alignment enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong \textitCross-modality Alignment capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus enabling accurate robotic operations within a well-aligned multimodal latent space. Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of our proposed framework, demonstrating significant advantage in overcoming data constraints in robotic learning. Website: this http URL

[CV-36] MiraGe: Editable 2D Images using Gaussian Splatting

链接: https://arxiv.org/abs/2410.01521
作者: Joanna Waczyńska,Tomasz Szczepanik,Piotr Borycki,Sławomir Tadeja,Thomas Bohné,Przemysław Spurek
关键词-EN: Implicit Neural Representations, approximate discrete data, Implicit Neural, approximate discrete, discrete data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) approximate discrete data through continuous functions and are commonly used for encoding 2D images. Traditional image-based INRs employ neural networks to map pixel coordinates to RGB values, capturing shapes, colors, and textures within the network’s weights. Recently, GaussianImage has been proposed as an alternative, using Gaussian functions instead of neural networks to achieve comparable quality and compression. Such a solution obtains a quality and compression ratio similar to classical INR models but does not allow image modification. In contrast, our work introduces a novel method, MiraGe, which uses mirror reflections to perceive 2D images in 3D space and employs flat-controlled Gaussians for precise 2D image editing. Our approach improves the rendering quality and allows realistic image modifications, including human-inspired perception of photos in the 3D world. Thanks to modeling images in 3D space, we obtain the illusion of 3D-based modification in 2D images. We also show that our Gaussian representation can be easily combined with a physics engine to produce physics-based modification of 2D images. Consequently, MiraGe allows for better quality than the standard approach and natural modification of 2D images.

[CV-37] UW-GS: Distractor-Aware 3D Gaussian Splatting for Enhanced Underwater Scene Reconstruction

链接: https://arxiv.org/abs/2410.01517
作者: Haoran Wang,Nantheera Anantrasirichai,Fan Zhang,David Bull
关键词-EN: real-time high quality, achieve real-time high, offers the capability, high quality, Gaussian splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) offers the capability to achieve real-time high quality 3D scene rendering. However, 3DGS assumes that the scene is in a clear medium environment and struggles to generate satisfactory representations in underwater scenes, where light absorption and scattering are prevalent and moving objects are involved. To overcome these, we introduce a novel Gaussian Splatting-based method, UW-GS, designed specifically for underwater applications. It introduces a color appearance that models distance-dependent color variation, employs a new physics-based density control strategy to enhance clarity for distant objects, and uses a binary motion mask to handle dynamic content. Optimized with a well-designed loss function supporting for scattering media and strengthened by pseudo-depth maps, UW-GS outperforms existing methods with PSNR gains up to 1.26dB. To fully verify the effectiveness of the model, we also developed a new underwater dataset, S-UW, with dynamic object masks.

[CV-38] LEGO: Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

链接: https://arxiv.org/abs/2410.01506
作者: Dexuan Ding,Lei Wang,Liyun Zhu,Tom Gedeon,Piotr Koniusz
关键词-EN: computer vision tasks, diverse representations, computer vision, vision tasks, fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research paper

点击查看摘要

Abstract:In computer vision tasks, features often come from diverse representations, domains, and modalities, such as text, images, and videos. Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing similarity graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we use graph power expansions and introduce a learnable graph fusion operator to combine these graph powers for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise similarity score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.

[CV-39] Quo Vadis RankList-based System in Face Recognition?

链接: https://arxiv.org/abs/2410.01498
作者: Xinyi Zhang,Manuel Günther
关键词-EN: face recognition models, Face recognition, recognition models, recognition models perform, wild has gained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at IJCB 2024

点击查看摘要

Abstract:Face recognition in the wild has gained a lot of focus in the last few years, and many face recognition models are designed to verify faces in medium-quality images. Especially due to the availability of large training datasets with similar conditions, deep face recognition models perform exceptionally well in such tasks. However, in other tasks where substantially less training data is available, such methods struggle, especially when required to compare high-quality enrollment images with low-quality probes. On the other hand, traditional RankList-based methods have been developed that compare faces indirectly by comparing to cohort faces with similar conditions. In this paper, we revisit these RankList methods and extend them to use the logits of the state-of-the-art DaliFace network, instead of an external cohort. We show that through a reasonable Logit-Cohort Selection (LoCoS) the performance of RankList-based functions can be improved drastically. Experiments on two challenging face recognition datasets not only demonstrate the enhanced performance of our proposed method but also set the stage for future advancements in handling diverse image qualities.

[CV-40] SinkSAM: A Monocular Depth-Guided SAM Framework for Automatic Sinkhole Segmentation

链接: https://arxiv.org/abs/2410.01473
作者: Osher Rafaeli,Tal Svoray,Ariel Nahlieli
关键词-EN: influence soil degradation, significantly influence soil, Soil sinkholes significantly, remotely sensed data, sinkholes significantly influence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 14 figures

点击查看摘要

Abstract:Soil sinkholes significantly influence soil degradation, but their irregular shapes, along with interference from shadow and vegetation, make it challenging to accurately quantify their properties using remotely sensed data. We present a novel framework for sinkhole segmentation that combines traditional topographic computations of closed depressions with the newly developed prompt-based Segment Anything Model (SAM). Within this framework, termed SinkSAM, we highlight four key improvements: (1) The integration of topographic computations with SAM enables pixel-level refinement of sinkhole boundaries segmentation; (2) A coherent mathematical prompting strategy, based on closed depressions, addresses the limitations of purely learning-based models (CNNs) in detecting and segmenting undefined sinkhole features, while improving generalization to new, unseen regions; (3) Using Depth Anything V2 monocular depth for automatic prompts eliminates photogrammetric biases, enabling sinkhole mapping without the dependence on LiDAR data; and (4) An established sinkhole database facilitates fine-tuning of SAM, improving its zero-shot performance in sinkhole segmentation. These advancements allow the deployment of SinkSAM, in an unseen test area, in the highly variable semiarid region, achieving an intersection-over-union (IoU) of 40.27% and surpassing previous results. This paper also presents the first SAM implementation for sinkhole segmentation and demonstrates the robustness of SinkSAM in extracting sinkhole maps using a single RGB image.

[CV-41] Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

链接: https://arxiv.org/abs/2410.01441
作者: Arkadip Maitra,Shree Mitra,Siladittya Manna,Saumik Bhattacharya,Umapada Pal
关键词-EN: Self-supervised learning, computer vision, developed rapidly, areas of computer, Self-supervised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-supervised learning has developed rapidly over the last decade and has been applied in many areas of computer vision. Decorrelation-based self-supervised pretraining has shown great promise among non-contrastive algorithms, yielding performance at par with supervised and contrastive self-supervised baselines. In this work, we explore the decorrelation-based paradigm of self-supervised learning and apply the same to learning disentangled stroke features for writer identification. Here we propose a modified formulation of the decorrelation-based framework named SWIS which was proposed for signature verification by standardizing the features along each dimension on top of the existing framework. We show that the proposed framework outperforms the contemporary self-supervised learning framework on the writer identification benchmark and also outperforms several supervised methods as well. To the best of our knowledge, this work is the first of its kind to apply self-supervised learning for learning representations for writer verification tasks.

[CV-42] EVA-Gaussian: 3D Gaussian-based Real-time Human Novel View Synthesis under Diverse Camera Settings

链接: https://arxiv.org/abs/2410.01425
作者: Yingdong Hu,Zhening Liu,Jiawei Shao,Zehong Lin,Jun Zhang
关键词-EN: Gaussian Splatting method, demonstrated exceptional capability, Splatting method, Gaussian Splatting, feed-forward based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The feed-forward based 3D Gaussian Splatting method has demonstrated exceptional capability in real-time human novel view synthesis. However, existing approaches are restricted to dense viewpoint settings, which limits their flexibility in free-viewpoint rendering across a wide range of camera view angle discrepancies. To address this limitation, we propose a real-time pipeline named EVA-Gaussian for 3D human novel view synthesis across diverse camera settings. Specifically, we first introduce an Efficient cross-View Attention (EVA) module to accurately estimate the position of each 3D Gaussian from the source images. Then, we integrate the source images with the estimated Gaussian position map to predict the attributes and feature embeddings of the 3D Gaussians. Moreover, we employ a recurrent feature refiner to correct artifacts caused by geometric errors in position estimation and enhance visual this http URL further improve synthesis quality, we incorporate a powerful anchor loss function for both 3D Gaussian attributes and human face landmarks. Experimental results on the THuman2.0 and THumansit datasets showcase the superiority of our EVA-Gaussian approach in rendering quality across diverse camera settings. Project page: this https URL.

[CV-43] he Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

链接: https://arxiv.org/abs/2410.01417
作者: Hong Li,Nanxi Li,Yuanjie Chen,Jianbin Zhu,Qinlu Guo,Cewu Lu,Yong-Lu Li
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, \textite.g. , hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: \textbfassociation , a human’s basic capability to link observation and prior practice memory. To comprehensively investigate MLLM’s performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient \textbfannotation-free construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs’ zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. \textitOur data and code are available at: this https URL.

[CV-44] SHAP-CAT: A interpretable multi-modal framework enhancing WSI classification via virtual staining and shapley-value-based multimodal fusion

链接: https://arxiv.org/abs/2410.01408
作者: Jun Wang,Yu Mao,Nan Guan,Chun Jason Xue
关键词-EN: promise in histopathology, demonstrated promise, multimodal, Abstract, dimension reduction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The multimodal model has demonstrated promise in histopathology. However, most multimodal models are based on H\E and genomics, adopting increasingly complex yet black-box designs. In our paper, we propose a novel interpretable multimodal framework named SHAP-CAT, which uses a Shapley-value-based dimension reduction technique for effective multimodal fusion. Starting with two paired modalities – H\E and IHC images, we employ virtual staining techniques to enhance limited input data by generating a new clinical-related modality. Lightweight bag-level representations are extracted from image modalities and a Shapley-value-based mechanism is used for dimension reduction. For each dimension of the bag-level representation, attribution values are calculated to indicate how changes in the specific dimensions of the input affect the model output. In this way, we select a few top important dimensions of bag-level representation for each image modality to late fusion. Our experimental results demonstrate that the proposed SHAP-CAT framework incorporating synthetic modalities significantly enhances model performance, yielding a 5% increase in accuracy for the BCI, an 8% increase for IHC4BC-ER, and an 11% increase for the IHC4BC-PR dataset.

[CV-45] AgriCLIP: Adapting CLIP for Agriculture and Livestock via Domain-Specialized Cross-Model Alignment

链接: https://arxiv.org/abs/2410.01407
作者: Umair Nawaz,Muhammad Awais,Hanan Gani,Muzammal Naseer,Fahad Khan,Salman Khan,Rao Muhammad Anwer
关键词-EN: Capitalizing on vast, demonstrated remarkable zero-shot, remarkable zero-shot capabilities, vast amount, pre-training has demonstrated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Capitalizing on vast amount of image-text data, large-scale vision-language pre-training has demonstrated remarkable zero-shot capabilities and has been utilized in several applications. However, models trained on general everyday web-crawled data often exhibit sub-optimal performance for specialized domains, likely due to domain shift. Recent works have tackled this problem for some domains (e.g., healthcare) by constructing domain-specialized image-text data. However, constructing a dedicated large-scale image-text dataset for sustainable area of agriculture and livestock is still open to research. Further, this domain desires fine-grained feature learning due to the subtle nature of the downstream tasks (e.g, nutrient deficiency detection, livestock breed classification). To address this we present AgriCLIP, a vision-language foundational model dedicated to the domain of agriculture and livestock. First, we propose a large-scale dataset, named ALive, that leverages customized prompt generation strategy to overcome the scarcity of expert annotations. Our ALive dataset covers crops, livestock, and fishery, with around 600,000 image-text pairs. Second, we propose a training pipeline that integrates both contrastive and self-supervised learning to learn both global semantic and local fine-grained domain-specialized features. Experiments on diverse set of 20 downstream tasks demonstrate the effectiveness of AgriCLIP framework, achieving an absolute gain of 7.8% in terms of average zero-shot classification accuracy, over the standard CLIP adaptation via domain-specialized ALive dataset. Our ALive dataset and code can be accessible at \hrefthis https URLGithub.

[CV-46] Gaussian-Det: Learning Closed-Surface Gaussians for 3D Object Detection

链接: https://arxiv.org/abs/2410.01404
作者: Hongru Yan,Yu Zheng,Yueqi Duan
关键词-EN: sheet metal coating, Skins wrapping, informative geometry prior, Gaussian Splatting, leverages Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Skins wrapping around our bodies, leathers covering over the sofa, sheet metal coating the car - it suggests that objects are enclosed by a series of continuous surfaces, which provides us with informative geometry prior for objectness deduction. In this paper, we propose Gaussian-Det which leverages Gaussian Splatting as surface representation for multi-view based 3D object detection. Unlike existing monocular or NeRF-based methods which depict the objects via discrete positional data, Gaussian-Det models the objects in a continuous manner by formulating the input Gaussians as feature descriptors on a mass of partial surfaces. Furthermore, to address the numerous outliers inherently introduced by Gaussian splatting, we accordingly devise a Closure Inferring Module (CIM) for the comprehensive surface-based objectness deduction. CIM firstly estimates the probabilistic feature residuals for partial surfaces given the underdetermined nature of Gaussian Splatting, which are then coalesced into a holistic representation on the overall surface closure of the object proposal. In this way, the surface information Gaussian-Det exploits serves as the prior on the quality and reliability of objectness and the information basis of proposal refinement. Experiments on both synthetic and real-world datasets demonstrate that Gaussian-Det outperforms various existing approaches, in terms of both average precision and recall.

[CV-47] Signal Adversarial Examples Generation for Signal Detection Network via White-Box Attack

链接: https://arxiv.org/abs/2410.01393
作者: Dongyang Li,Linyuan Wang,Guangwei Xiong,Bin Yan,Dekui Ma,Jinxian Peng
关键词-EN: signal detection network, signal detection tasks, signal detection, detection network, signal
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: 18 pages, 6 figures, submitted to Mobile Networks and Applications

点击查看摘要

Abstract:With the development and application of deep learning in signal detection tasks, the vulnerability of neural networks to adversarial attacks has also become a security threat to signal detection networks. This paper defines a signal adversarial examples generation model for signal detection network from the perspective of adding perturbations to the signal. The model uses the inequality relationship of L2-norm between time domain and time-frequency domain to constrain the energy of signal perturbations. Building upon this model, we propose a method for generating signal adversarial examples utilizing gradient-based attacks and Short-Time Fourier Transform. The experimental results show that under the constraint of signal perturbation energy ratio less than 3%, our adversarial attack resulted in a 28.1% reduction in the mean Average Precision (mAP), a 24.7% reduction in recall, and a 30.4% reduction in precision of the signal detection network. Compared to random noise perturbation of equivalent intensity, our adversarial attack demonstrates a significant attack effect.

[CV-48] Quantifying Cancer Likeness: A Statistical Approach for Pathological Image Diagnosis

链接: https://arxiv.org/abs/2410.01391
作者: Toshiki Kindo
关键词-EN: automatically identify cancer, identify cancer regions, approach to automatically, automatically identify, statistical approach
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:In this paper, we present a new statistical approach to automatically identify cancer regions in pathological images. The proposed method is built from statistical theory in line with evidence-based medicine. The two core technologies are the classification information of image features, which was introduced based on information theory and which cancer features take positive values, normal features take negative values, and the calculation technique for determining their spatial distribution. This method then estimates areas where the classification information content shows a positive value as cancer areas in the pathological image. The method achieves AUCs of 0.95 or higher in cancer classification tasks. In addition, the proposed method has the practical advantage of not requiring a precise demarcation line between cancer and normal. This frees pathologists from the monotonous and tedious work of building consensus with other pathologists.

[CV-49] Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

链接: https://arxiv.org/abs/2410.01376
作者: Alejandro Castañeda Garcia,Jan van Gemert,Daan Brinks,Nergis Tömen
关键词-EN: Extracting physical dynamical, science and technology, Extracting physical, great interest, interest to applications
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Extracting physical dynamical system parameters from videos is of great interest to applications in natural science and technology. The state-of-the-art in automatic parameter estimation from video is addressed by training supervised deep networks on large datasets. Such datasets require labels, which are difficult to acquire. While some unsupervised techniques – which depend on frame prediction – exist, they suffer from long training times, instability under different initializations, and are limited to hand-picked motion problems. In this work, we propose a method to estimate the physical parameters of any known, continuous governing equation from single videos; our solution is suitable for different dynamical systems beyond motion and is robust to initialization compared to previous approaches. Moreover, we remove the need for frame prediction by implementing a KL-divergence-based loss function in the latent space, which avoids convergence to trivial solutions and reduces model size and compute.

[CV-50] Harnessing the Latent Diffusion Model for Training-Free Image Style Transfer

链接: https://arxiv.org/abs/2410.01366
作者: Kento Masui,Mayu Otani,Masahiro Nomura,Hideki Nakayama
关键词-EN: generate high-quality images, Reverse Diffusion Process, recently shown, shown the ability, ability to generate
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Diffusion models have recently shown the ability to generate high-quality images. However, controlling its generation process still poses challenges. The image style transfer task is one of those challenges that transfers the visual attributes of a style image to another content image. Typical obstacle of this task is the requirement of additional training of a pre-trained model. We propose a training-free style transfer algorithm, Style Tracking Reverse Diffusion Process (STRDP) for a pretrained Latent Diffusion Model (LDM). Our algorithm employs Adaptive Instance Normalization (AdaIN) function in a distinct manner during the reverse diffusion process of an LDM while tracking the encoding history of the style image. This algorithm enables style transfer in the latent space of LDM for reduced computational cost, and provides compatibility for various LDM models. Through a series of experiments and a user study, we show that our method can quickly transfer the style of an image without additional training. The speed, compatibility, and training-free aspect of our algorithm facilitates agile experiments with combinations of styles and LDMs for extensive application.

[CV-51] High-quality Animatable Eyelid Shapes from Lightweight Captures SIGGRAPH

链接: https://arxiv.org/abs/2410.01360
作者: Junfeng Lyu,Feng Xu
关键词-EN: High-quality eyelid reconstruction, eyelid reconstruction, complicated deformations, subtle details, High-quality eyelid
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by SIGGRAPH Asia 2024

点击查看摘要

Abstract:High-quality eyelid reconstruction and animation are challenging for the subtle details and complicated deformations. Previous works usually suffer from the trade-off between the capture costs and the quality of details. In this paper, we propose a novel method that can achieve detailed eyelid reconstruction and animation by only using an RGB video captured by a mobile phone. Our method utilizes both static and dynamic information of eyeballs (e.g., positions and rotations) to assist the eyelid reconstruction, cooperating with an automatic eyeball calibration method to get the required eyeball parameters. Furthermore, we develop a neural eyelid control module to achieve the semantic animation control of eyelids. To the best of our knowledge, we present the first method for high-quality eyelid reconstruction and animation from lightweight captures. Extensive experiments on both synthetic and real data show that our method can provide more detailed and realistic results compared with previous methods based on the same-level capture setups. The code is available at this https URL.

[CV-52] owards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

链接: https://arxiv.org/abs/2410.01345
作者: Ricardo Garcia,Shizhe Chen,Cordelia Schmid
关键词-EN: Generalizing language-conditioned robotic, Generalizing language-conditioned, suitable simulation benchmarks, language-conditioned robotic policies, significant challenge
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a framework that integrates 3D-LOTUS’s motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation. The benchmark, codes and trained models are available at \urlthis https URL.

[CV-53] Cognition Transferring and Decoupling for Text-supervised Egocentric Semantic Segmentation

链接: https://arxiv.org/abs/2410.01341
作者: Zhaofeng Shi,Heqian Qiu,Lanxiao Wang,Fanman Meng,Qingbo Wu,Hongliang Li
关键词-EN: Egocentic Semantic Segmentation, Text-supervised Egocentic Semantic, Text-supervised Egocentic, assign pixel-level categories, images weakly supervised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we explore a novel Text-supervised Egocentic Semantic Segmentation (TESS) task that aims to assign pixel-level categories to egocentric images weakly supervised by texts from image-level labels. In this task with prospective potential, the egocentric scenes contain dense wearer-object relations and inter-object interference. However, most recent third-view methods leverage the frozen Contrastive Language-Image Pre-training (CLIP) model, which is pre-trained on the semantic-oriented third-view data and lapses in the egocentric view due to the ``relation insensitive" problem. Hence, we propose a Cognition Transferring and Decoupling Network (CTDN) that first learns the egocentric wearer-object relations via correlating the image and text. Besides, a Cognition Transferring Module (CTM) is developed to distill the cognitive knowledge from the large-scale pre-trained model to our model for recognizing egocentric objects with various semantics. Based on the transferred cognition, the Foreground-background Decoupling Module (FDM) disentangles the visual representations to explicitly discriminate the foreground and background regions to mitigate false activation areas caused by foreground-background interferential objects during egocentric relation learning. Extensive experiments on four TESS benchmarks demonstrate the effectiveness of our approach, which outperforms many recent related methods by a large margin. Code will be available at this https URL.

[CV-54] VectorGraphNET: Graph Attention Networks for Accurate Segmentation of Complex Technical Drawings

链接: https://arxiv.org/abs/2410.01336
作者: Andrea Carrara,Stavros Nousias,André Borrmann
关键词-EN: analyze vector data, paper introduces, extract and analyze, involves converting PDF, converting PDF files
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 27 pages, 13 figures

点击查看摘要

Abstract:This paper introduces a new approach to extract and analyze vector data from technical drawings in PDF format. Our method involves converting PDF files into SVG format and creating a feature-rich graph representation, which captures the relationships between vector entities using geometrical information. We then apply a graph attention transformer with hierarchical label definition to achieve accurate line-level segmentation. Our approach is evaluated on two datasets, including the public FloorplanCAD dataset, which achieves state-of-the-art results on weighted F1 score, surpassing existing methods. The proposed vector-based method offers a more scalable solution for large-scale technical drawing analysis compared to vision-based approaches, while also requiring significantly less GPU power than current state-of-the-art vector-based techniques. Moreover, it demonstrates improved performance in terms of the weighted F1 (wF1) score on the semantic segmentation task. Our results demonstrate the effectiveness of our approach in extracting meaningful information from technical drawings, enabling new applications, and improving existing workflows in the AEC industry. Potential applications of our approach include automated building information modeling (BIM) and construction planning, which could significantly impact the efficiency and productivity of the industry.

[CV-55] Forte : Finding Outliers with Representation Typicality Estimation

链接: https://arxiv.org/abs/2410.01322
作者: Debargha Ganguly,Warren Morningstar,Andrew Yu,Vipin Chaudhary
关键词-EN: produce photorealistic synthetic, generative OOD detectors, OOD detectors, virtually indistinguishable, photorealistic synthetic data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Generative models can now produce photorealistic synthetic data which is virtually indistinguishable from the real data used to train it. This is a significant evolution over previous models which could produce reasonable facsimiles of the training data, but ones which could be visually distinguished from the training data by human evaluation. Recent work on OOD detection has raised doubts that generative model likelihoods are optimal OOD detectors due to issues involving likelihood misestimation, entropy in the generative process, and typicality. We speculate that generative OOD detectors also failed because their models focused on the pixels rather than the semantic content of the data, leading to failures in near-OOD cases where the pixels may be similar but the information content is significantly different. We hypothesize that estimating typical sets using self-supervised learners leads to better OOD detectors. We introduce a novel approach that leverages representation learning, and informative summary statistics based on manifold estimation, to address all of the aforementioned issues. Our method outperforms other unsupervised approaches and achieves state-of-the art performance on well-established challenging benchmarks, and new synthetic data detection tasks.

[CV-56] Finetuning Pre-trained Model with Limited Data for LiDAR-based 3D Object Detection by Bridging Domain Gaps IROS

链接: https://arxiv.org/abs/2410.01319
作者: Jiyun Jang,Mincheol Chang,Jongwon Park,Jinkyu Kim
关键词-EN: including autonomous vehicles, including autonomous, mobile robots, largely utilized, autonomous vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024

点击查看摘要

Abstract:LiDAR-based 3D object detectors have been largely utilized in various applications, including autonomous vehicles or mobile robots. However, LiDAR-based detectors often fail to adapt well to target domains with different sensor configurations (e.g., types of sensors, spatial resolution, or FOVs) and location shifts. Collecting and annotating datasets in a new setup is commonly required to reduce such gaps, but it is often expensive and time-consuming. Recent studies suggest that pre-trained backbones can be learned in a self-supervised manner with large-scale unlabeled LiDAR frames. However, despite their expressive representations, they remain challenging to generalize well without substantial amounts of data from the target domain. Thus, we propose a novel method, called Domain Adaptive Distill-Tuning (DADT), to adapt a pre-trained model with limited target data (approximately 100 LiDAR frames), retaining its representation power and preventing it from overfitting. Specifically, we use regularizers to align object-level and context-level representations between the pre-trained and finetuned models in a teacher-student architecture. Our experiments with driving benchmarks, i.e., Waymo Open dataset and KITTI, confirm that our method effectively finetunes a pre-trained model, achieving significant gains in accuracy.

[CV-57] Deep learning for action spotting in association football videos

链接: https://arxiv.org/abs/2410.01304
作者: Silvio Giancola,Anthony Cioppa,Bernard Ghanem,Marc Van Droogenbroeck
关键词-EN: untrimmed video streams, action spotting, action spotting consists, spotting, timestamp in long
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 31 pages, 2 figures, 5 tables

点击查看摘要

Abstract:The task of action spotting consists in both identifying actions and precisely localizing them in time with a single timestamp in long, untrimmed video streams. Automatically extracting those actions is crucial for many sports applications, including sports analytics to produce extended statistics on game actions, coaching to provide support to video analysts, or fan engagement to automatically overlay content in the broadcast when specific actions occur. However, before 2018, no large-scale datasets for action spotting in sports were publicly available, which impeded benchmarking action spotting methods. In response, our team built the largest dataset and the most comprehensive benchmarks for sports video understanding, under the umbrella of SoccerNet. Particularly, our dataset contains a subset specifically dedicated to action spotting, called SoccerNet Action Spotting, containing more than 550 complete broadcast games annotated with almost all types of actions that can occur in a football game. This dataset is tailored to develop methods for automatic spotting of actions of interest, including deep learning approaches, by providing a large amount of manually annotated actions. To engage with the scientific community, the SoccerNet initiative organizes yearly challenges, during which participants from all around the world compete to achieve state-of-the-art performances. Thanks to our dataset and challenges, more than 60 methods were developed or published over the past five years, improving on the first baselines and making action spotting a viable option for the sports industry. This paper traces the history of action spotting in sports, from the creation of the task back in 2018, to the role it plays today in research and the sports industry.

[CV-58] LaGeM: A Large Geometry Model for 3D Representation Learning and Diffusion

链接: https://arxiv.org/abs/2410.01295
作者: Biao Zhang,Peter Wonka
关键词-EN: compressed latent space, highly compressed latent, latent space, hierarchical autoencoder, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: For more information: this https URL

点击查看摘要

Abstract:This paper introduces a novel hierarchical autoencoder that maps 3D models into a highly compressed latent space. The hierarchical autoencoder is specifically designed to tackle the challenges arising from large-scale datasets and generative modeling using diffusion. Different from previous approaches that only work on a regular image or volume grid, our hierarchical autoencoder operates on unordered sets of vectors. Each level of the autoencoder controls different geometric levels of detail. We show that the model can be used to represent a wide range of 3D models while faithfully representing high-resolution geometry details. The training of the new architecture takes 0.70x time and 0.58x memory compared to the baseline. We also explore how the new representation can be used for generative modeling. Specifically, we propose a cascaded diffusion framework where each stage is conditioned on the previous stage. Our design extends existing cascaded designs for image and volume grids to vector sets.

[CV-59] SurgeoNet: Realtime 3D Pose Estimation of Articulated Surgical Instruments from Stereo Images using a Synthetically-trained Network

链接: https://arxiv.org/abs/2410.01293
作者: Ahmed Tawfik Aboukhadra,Nadia Robertini,Jameel Malik,Ahmed Elhayek,Gerd Reis,Didier Stricker
关键词-EN: Mixed Reality, recently received substantial, received substantial focus, substantial focus due, monitoring in Mixed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Surgery monitoring in Mixed Reality (MR) environments has recently received substantial focus due to its importance in image-based decisions, skill assessment, and robot-assisted surgery. Tracking hands and articulated surgical instruments is crucial for the success of these applications. Due to the lack of annotated datasets and the complexity of the task, only a few works have addressed this problem. In this work, we present SurgeoNet, a real-time neural network pipeline to accurately detect and track surgical instruments from a stereo VR view. Our multi-stage approach is inspired by state-of-the-art neural-network architectural design, like YOLO and Transformers. We demonstrate the generalization capabilities of SurgeoNet in challenging real-world scenarios, achieved solely through training on synthetic data. The approach can be easily extended to any new set of articulated surgical instruments. SurgeoNet’s code and data are publicly available.

[CV-60] CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction

链接: https://arxiv.org/abs/2410.01273
作者: Suhwan Choi,Yongjun Cho,Minchan Kim,Jaeyoon Jung,Myunchul Joe,Yubeen Park,Minseo Kim,Sungwoong Kim,Sungjae Lee,Hwiseong Park,Jiwan Chung,Youngjae Yu
关键词-EN: requires optimizing movements, addressing scenario-specific goals, Real-life robot navigation, Real-life robot, reaching a destination
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: project page this https URL

点击查看摘要

Abstract:Real-life robot navigation involves more than just reaching a destination; it requires optimizing movements while addressing scenario-specific goals. An intuitive way for humans to express these goals is through abstract cues like verbal commands or rough sketches. Such human guidance may lack details or be noisy. Nonetheless, we expect robots to navigate as intended. For robots to interpret and execute these abstract instructions in line with human expectations, they must share a common understanding of basic navigation concepts with humans. To this end, we introduce CANVAS, a novel framework that combines visual and linguistic instructions for commonsense-aware navigation. Its success is driven by imitation learning, enabling the robot to learn from human navigation behavior. We present COMMAND, a comprehensive dataset with human-annotated navigation results, spanning over 48 hours and 219 km, designed to train commonsense-aware navigation systems in simulated environments. Our experiments show that CANVAS outperforms the strong rule-based system ROS NavStack across all environments, demonstrating superior performance with noisy instructions. Notably, in the orchard environment, where ROS NavStack records a 0% total success rate, CANVAS achieves a total success rate of 67%. CANVAS also closely aligns with human demonstrations and commonsense constraints, even in unseen environments. Furthermore, real-world deployment of CANVAS showcases impressive Sim2Real transfer with a total success rate of 69%, highlighting the potential of learning from human demonstrations in simulated environments for real-world applications.

[CV-61] Panopticus: Omnidirectional 3D Object Detection on Resource-constrained Edge Devices

链接: https://arxiv.org/abs/2410.01270
作者: Jeho Lee,Chanyoung Jung,Jiwon Kim,Hojung Cha
关键词-EN: views enables safety-critical, mobile robot navigation, enables safety-critical applications, robot navigation, omnidirectional views enables
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
*备注: Published at MobiCom 2024

点击查看摘要

Abstract:3D object detection with omnidirectional views enables safety-critical applications such as mobile robot navigation. Such applications increasingly operate on resource-constrained edge devices, facilitating reliable processing without privacy concerns or network delays. To enable cost-effective deployment, cameras have been widely adopted as a low-cost alternative to LiDAR sensors. However, the compute-intensive workload to achieve high performance of camera-based solutions remains challenging due to the computational limitations of edge devices. In this paper, we present Panopticus, a carefully designed system for omnidirectional and camera-based 3D detection on edge devices. Panopticus employs an adaptive multi-branch detection scheme that accounts for spatial complexities. To optimize the accuracy within latency limits, Panopticus dynamically adjusts the model’s architecture and operations based on available edge resources and spatial characteristics. We implemented Panopticus on three edge devices and conducted experiments across real-world environments based on the public self-driving dataset and our mobile 360° camera dataset. Experiment results showed that Panopticus improves accuracy by 62% on average given the strict latency objective of 33ms. Also, Panopticus achieves a 2.1\times latency reduction on average compared to baselines.

[CV-62] Backdooring Vision-Language Models with Out-Of-Distribution Data

链接: https://arxiv.org/abs/2410.01264
作者: Weimin Lyu,Jiachen Yao,Saumya Gupta,Lu Pang,Tao Sun,Lingjie Yi,Lijie Hu,Haibin Ling,Chao Chen
关键词-EN: Large Language Models, Large Language, integrating computer vision, generate detailed text, detailed text descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The emergence of Vision-Language Models (VLMs) represents a significant advancement in integrating computer vision with Large Language Models (LLMs) to generate detailed text descriptions from visual inputs. Despite their growing importance, the security of VLMs, particularly against backdoor attacks, is under explored. Moreover, prior works often assume attackers have access to the original training data, which is often unrealistic. In this paper, we address a more practical and challenging scenario where attackers must rely solely on Out-Of-Distribution (OOD) data. We introduce VLOOD (Backdooring Vision-Language Models with Out-of-Distribution Data), a novel approach with two key contributions: (1) demonstrating backdoor attacks on VLMs in complex image-to-text tasks while minimizing degradation of the original semantics under poisoned inputs, and (2) proposing innovative techniques for backdoor injection without requiring any access to the original training data. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of VLOOD, revealing a critical security vulnerability in VLMs and laying the foundation for future research on securing multimodal models against sophisticated threats.

[CV-63] Aggregation of Multi Diffusion Models for Enhancing Learned Representations

链接: https://arxiv.org/abs/2410.01262
作者: Conghan Yue,Zhengwei Peng,Shiyan Du,Zhi Ji,Dongyu Zhang
关键词-EN: Diffusion models, achieved remarkable success, Diffusion, classifier-free guidance conditional, Multi Diffusion Models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image generation, particularly with the various applications of classifier-free guidance conditional diffusion models. While many diffusion models perform well when controlling for particular aspect among style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel algorithm, Aggregation of Multi Diffusion Models (AMDM), which synthesizes features from multiple diffusion models into a specified model, enhancing its learned representations to activate specific features for fine-grained control. AMDM consists of two key components: spherical aggregation and manifold optimization. Spherical aggregation merges intermediate variables from different diffusion models with minimal manifold deviation, while manifold optimization refines these variables to align with the intermediate data manifold, enhancing sampling quality. Experimental results demonstrate that AMDM significantly improves fine-grained control without additional training or inference time, proving its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional control generation in diffusion models: We can fully utilize existing conditional diffusion models that control specific aspects, or develop new ones, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: this https URL

[CV-64] OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects CVPR2024

链接: https://arxiv.org/abs/2410.01261
作者: Wenmo Qiu,Xinhan Di
关键词-EN: occluded objects, visual language multi-modal, language multi-modal models, language multi-modal, objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPR 2024 T4V Workshop (5 pages, 3 figures, 2 tables)

点击查看摘要

Abstract:There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multimodal models fail to provide satisfactory results in describing occluded objects for visual-language multimodal models through universal visual encoders. Another challenge is the limited number of datasets containing image-text pairs with a large number of occluded objects. Therefore, we introduce a novel multimodal model that applies a newly designed visual encoder to understand occluded objects in RGB images. We also introduce a large-scale visual-language pair dataset for training large-scale visual-language multimodal models and understanding occluded objects. We start our experiments comparing with the state-of-the-art models.

[CV-65] Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample

链接: https://arxiv.org/abs/2410.01251
作者: Zhiwen Shao,Hancheng Zhu,Yong Zhou,Xiang Xiang,Bing Liu,Rui Yao,Lizhuang Ma
关键词-EN: Facial action unit, Facial action, self-attention weight distribution, action unit, self-attention weight
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper is accepted by International Journal of Computer Vision

点击查看摘要

Abstract:Facial action unit (AU) detection remains a challenging task, due to the subtlety, dynamics, and diversity of AUs. Recently, the prevailing techniques of self-attention and causal inference have been introduced to AU detection. However, most existing methods directly learn self-attention guided by AU detection, or employ common patterns for all AUs during causal intervention. The former often captures irrelevant information in a global range, and the latter ignores the specific causal characteristic of each AU. In this paper, we propose a novel AU detection framework called AC2D by adaptively constraining self-attention weight distribution and causally deconfounding the sample confounder. Specifically, we explore the mechanism of self-attention weight distribution, in which the self-attention weight distribution of each AU is regarded as spatial distribution and is adaptively learned under the constraint of location-predefined attention and the guidance of AU detection. Moreover, we propose a causal intervention module for each AU, in which the bias caused by training samples and the interference from irrelevant AUs are both suppressed. Extensive experiments show that our method achieves competitive performance compared to state-of-the-art AU detection approaches on challenging benchmarks, including BP4D, DISFA, GFT, and BP4D+ in constrained scenarios and Aff-Wild2 in unconstrained scenarios. The code is available at this https URL.

[CV-66] Replacement Learning: Training Vision Tasks with Fewer Learnable Parameters

链接: https://arxiv.org/abs/2410.01239
作者: Yuming Zhang,Peizhe Wang,Shouxin Zhang,Dongzhi Guan,Jiabin Liu,Junhao Su
关键词-EN: enhance feature representation, Replacement Learning, deep learning models, enhance feature, feature representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditional end-to-end deep learning models often enhance feature representation and overall performance by increasing the depth and complexity of the network during training. However, this approach inevitably introduces issues of parameter redundancy and resource inefficiency, especially in deeper networks. While existing works attempt to skip certain redundant layers to alleviate these problems, challenges related to poor performance, computational complexity, and inefficient memory usage remain. To address these issues, we propose an innovative training approach called Replacement Learning, which mitigates these limitations by completely replacing all the parameters of the frozen layers with only two learnable parameters. Specifically, Replacement Learning selectively freezes the parameters of certain layers, and the frozen layers utilize parameters from adjacent layers, updating them through a parameter integration mechanism controlled by two learnable parameters. This method leverages information from surrounding structures, reduces computation, conserves GPU memory, and maintains a balance between historical context and new inputs, ultimately enhancing overall model performance. We conducted experiments across four benchmark datasets, including CIFAR-10, STL-10, SVHN, and ImageNet, utilizing various architectures such as CNNs and ViTs to validate the effectiveness of Replacement Learning. Experimental results demonstrate that our approach reduces the number of parameters, training time, and memory consumption while completely surpassing the performance of end-to-end training.

[CV-67] owards Native Generative Model for 3D Head Avatar

链接: https://arxiv.org/abs/2410.01226
作者: Yiyu Zhuang,Yuxiao He,Jiawei Zhang,Yanwen Wang,Jiahe Zhu,Yao Yao,Siyu Zhu,Xun Cao,Hao Zhu
关键词-EN: head, significant yet challenging, Creating, models, circ
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Creating 3D head avatars is a significant yet challenging task for many applicated scenarios. Previous studies have set out to learn 3D human head generative models using massive 2D image data. Although these models are highly generalizable for human appearance, their result models are not 360 ^\circ -renderable, and the predicted 3D geometry is unreliable. Therefore, such results cannot be used in VR, game modeling, and other scenarios that require 360 ^\circ -renderable 3D head models. An intuitive idea is that 3D head models with limited amount but high 3D accuracy are more reliable training data for a high-quality 3D generative model. In this vein, we delve into how to learn a native generative model for 360 ^\circ full head from a limited 3D head dataset. Specifically, three major problems are studied: 1) how to effectively utilize various representations for generating the 360 ^\circ -renderable human head; 2) how to disentangle the appearance, shape, and motion of human faces to generate a 3D head model that can be edited by appearance and driven by motion; 3) and how to extend the generalization capability of the generative model to support downstream tasks. Comprehensive experiments are conducted to verify the effectiveness of the proposed model. We hope the proposed models and artist-designed dataset can inspire future research on learning native generative 3D head models from limited 3D datasets.

[CV-68] Perceptual Piercing: Human Visual Cue-based Object Detection in Low Visibility Conditions

链接: https://arxiv.org/abs/2410.01225
作者: Ashutosh Kumar
关键词-EN: visual cortex mechanisms, learning framework inspired, framework inspired, inspired by atmospheric, atmospheric scattering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study proposes a novel deep learning framework inspired by atmospheric scattering and human visual cortex mechanisms to enhance object detection under poor visibility scenarios such as fog, smoke, and haze. These conditions pose significant challenges for object recognition, impacting various sectors, including autonomous driving, aviation management, and security systems. The objective is to enhance the precision and reliability of detection systems under adverse environmental conditions. The research investigates the integration of human-like visual cues, particularly focusing on selective attention and environmental adaptability, to ascertain their impact on object detection’s computational efficiency and accuracy. This paper proposes a multi-tiered strategy that integrates an initial quick detection process, followed by targeted region-specific dehazing, and concludes with an in-depth detection phase. The approach is validated using the Foggy Cityscapes, RESIDE-beta (OTS and RTTS) datasets and is anticipated to set new performance standards in detection accuracy while significantly optimizing computational efficiency. The findings offer a viable solution for enhancing object detection in poor visibility and contribute to the broader understanding of integrating human visual principles into deep learning algorithms for intricate visual recognition challenges.

[CV-69] Polyp-SES: Automatic Polyp Segmentation with Self-Enriched Semantic Model

链接: https://arxiv.org/abs/2410.01210
作者: Quang Vinh Nguyen,Thanh Hoang Son Vo,Sae-Ryung Kang,Soo-Hyung Kim
关键词-EN: Automatic polyp segmentation, Automatic polyp, crucial for effective, effective diagnosis, diagnosis and treatment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Asian Conference on Computer Vision 2024

点击查看摘要

Abstract:Automatic polyp segmentation is crucial for effective diagnosis and treatment in colonoscopy images. Traditional methods encounter significant challenges in accurately delineating polyps due to limitations in feature representation and the handling of variability in polyp appearance. Deep learning techniques, including CNN and Transformer-based methods, have been explored to improve polyp segmentation accuracy. However, existing approaches often neglect additional semantics, restricting their ability to acquire adequate contexts of polyps in colonoscopy images. In this paper, we propose an innovative method named ``Automatic Polyp Segmentation with Self-Enriched Semantic Model’’ to address these limitations. First, we extract a sequence of features from an input image and decode high-level features to generate an initial segmentation mask. Using the proposed self-enriched semantic module, we query potential semantics and augment deep features with additional semantics, thereby aiding the model in understanding context more effectively. Extensive experiments show superior segmentation performance of the proposed method against state-of-the-art polyp segmentation baselines across five polyp benchmarks in both superior learning and generalization capabilities.

[CV-70] AniSDF: Fused-Granularity Neural Surfaces with Anisotropic Encoding for High-Fidelity 3D Reconstruction

链接: https://arxiv.org/abs/2410.01202
作者: Jingnan Gao,Zhuo Chen,Yichao Yan,Xiaokang Yang
关键词-EN: recently revolutionized novel-view, recently revolutionized, geometry, achieved high-fidelity renderings, achieved high-fidelity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Neural radiance fields have recently revolutionized novel-view synthesis and achieved high-fidelity renderings. However, these methods sacrifice the geometry for the rendering quality, limiting their further applications including relighting and deformation. How to synthesize photo-realistic rendering while reconstructing accurate geometry remains an unsolved problem. In this work, we present AniSDF, a novel approach that learns fused-granularity neural surfaces with physics-based encoding for high-fidelity 3D reconstruction. Different from previous neural surfaces, our fused-granularity geometry structure balances the overall structures and fine geometric details, producing accurate geometry reconstruction. To disambiguate geometry from reflective appearance, we introduce blended radiance fields to model diffuse and specularity following the anisotropic spherical Gaussian encoding, a physics-based rendering pipeline. With these designs, AniSDF can reconstruct objects with complex structures and produce high-quality renderings. Furthermore, our method is a unified model that does not require complex hyperparameter tuning for specific objects. Extensive experiments demonstrate that our method boosts the quality of SDF-based methods by a great scale in both geometry reconstruction and novel-view synthesis.

[CV-71] [Re] Network Deconvolution

链接: https://arxiv.org/abs/2410.01189
作者: Rochana R. Obadage,Kumushini Thennakoon,Sarah M. Rajtmajer,Jian Wu
关键词-EN: Network Deconvolution, convolutional neural networks, work aims, aims to reproduce, reproduce the set
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Our work aims to reproduce the set of findings published in “Network Deconvolution” by Ye et al. (2020)[1]. That paper proposes an optimization technique for model training in convolutional neural networks. The proposed technique “network deconvolution” is used in convolutional neural networks to remove pixel-wise and channel-wise correlations before data is fed into each layer. In particular, we interrogate the validity of the authors’ claim that using network deconvolution instead of batch normalization improves deep learning model performance. Our effort confirms the validity of this claim, successfully reproducing the results reported in Tables 1 and 2 of the original paper. Our study involved 367 unique experiments across multiple architectures, datasets, and hyper parameter configurations. For Table 1, while there were some minor deviations in accuracy when compared to the original values (within 10%), the overall trend was consistent with the original study’s findings when training the models with epochs 20 and 100. For Table 2, all 14 reproduced values were consistent with the original values. Additionally, we document the training and testing times for each architecture in Table 1 with 1, 20, and 100 epoch settings for both CIFAR-10 and CIFAR-100 datasets. We document the total execution times for Table 2 architectures with the ImageNet dataset. The data and software used for this reproducibility study are publicly available at this https URL.

[CV-72] UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark

链接: https://arxiv.org/abs/2410.01180
作者: Hasnat Md Abdullah,Tian Liu,Kangda Wei,Shu Kong,Ruihong Huang
关键词-EN: holds practical significance, Localizing unusual activities, videos holds practical, surveillance incidents, practical significance
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Localizing unusual activities, such as human errors or surveillance incidents, in videos holds practical significance. However, current video understanding models struggle with localizing these unusual events likely because of their insufficient representation in models’ pretraining datasets. To explore foundation models’ capability in localizing unusual activity, we introduce UAL-Bench, a comprehensive benchmark for unusual activity localization, featuring three video datasets: UAG-OOPS, UAG-SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct, to improve model capabilities. UAL-Bench evaluates three approaches: Video-Language Models (Vid-LLMs), instruction-tuned Vid-LLMs, and a novel integration of Vision-Language Models and Large Language Models (VLM-LLM). Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than Vid-LLMs. We also propose a new metric, R@1, TD = p, to address limitations in existing evaluation methods. Our findings highlight the challenges posed by long-duration videos, particularly in autism diagnosis scenarios, and the need for further advancements in localization techniques. Our work not only provides a benchmark for unusual activity localization but also outlines the key challenges for existing foundation models, suggesting future research directions on this important task.

[CV-73] GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network

链接: https://arxiv.org/abs/2410.01160
作者: Panfeng Cao,Jian Wu
关键词-EN: Key information extraction, Key information, visually rich documents, information extraction, visually rich
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Key information extraction (KIE) from visually rich documents (VRD) has been a challenging task in document intelligence because of not only the complicated and diverse layouts of VRD that make the model hard to generalize but also the lack of methods to exploit the multimodal features in VRD. In this paper, we propose a light-weight model named GraphRevisedIE that effectively embeds multimodal features such as textual, visual, and layout features from VRD and leverages graph revision and graph convolution to enrich the multimodal embedding with global context. Extensive experiments on multiple real-world datasets show that GraphRevisedIE generalizes to documents of varied layouts and achieves comparable or better performance compared to previous KIE methods. We also publish a business license dataset that contains both real-life and synthesized documents to facilitate research of document KIE.

[CV-74] Automatic Image Unfolding and Stitching Framework for Esophageal Lining Video Based on Density-Weighted Feature Matching

链接: https://arxiv.org/abs/2410.01148
作者: Muyang Li,Juming Xiong,Ruining Deng,Tianyuan Yao,Regina N Tyree,Girish Hiremath,Yuankai Huo
关键词-EN: repetitive patterns make, patterns make image, image stitching challenging, make image stitching, gastrointestinal tract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Endoscopy is a crucial tool for diagnosing the gastrointestinal tract, but its effectiveness is often limited by a narrow field of view and the dynamic nature of the internal environment, especially in the esophagus, where complex and repetitive patterns make image stitching challenging. This paper introduces a novel automatic image unfolding and stitching framework tailored for esophageal videos captured during endoscopy. The method combines feature matching algorithms, including LoFTR, SIFT, and ORB, to create a feature filtering pool and employs a Density-Weighted Homography Optimization (DWHO) algorithm to enhance stitching accuracy. By merging consecutive frames, the framework generates a detailed panoramic view of the esophagus, enabling thorough and accurate visual analysis. Experimental results show the framework achieves low Root Mean Square Error (RMSE) and high Structural Similarity Index (SSIM) across extensive video sequences, demonstrating its potential for clinical use and improving the quality and continuity of endoscopic visual data.

[CV-75] Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models

链接: https://arxiv.org/abs/2410.01144
作者: Yunhao Yang,Yuxin Hu,Mao Ye,Zaiwei Zhang,Zhichao Lu,Yi Xu,Ufuk Topcu,Ben Snyder
关键词-EN: Multimodal foundation models, costs pose challenges, offer promising advancements, financial costs pose, models offer promising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal foundation models offer promising advancements for enhancing driving perception systems, but their high computational and financial costs pose challenges. We develop a method that leverages foundation models to refine predictions from existing driving perception models – such as enhancing object classification accuracy – while minimizing the frequency of using these resource-intensive models. The method quantitatively characterizes uncertainties in the perception model’s predictions and engages the foundation model only when these uncertainties exceed a pre-specified threshold. Specifically, it characterizes uncertainty by calibrating the perception model’s confidence scores into theoretical lower bounds on the probability of correct predictions using conformal prediction. Then, it sends images to the foundation model and queries for refining the predictions only if the theoretical bound of the perception model’s outcome is below the threshold. Additionally, we propose a temporal inference mechanism that enhances prediction accuracy by integrating historical predictions, leading to tighter theoretical bounds. The method demonstrates a 10 to 15 percent improvement in prediction accuracy and reduces the number of queries to the foundation model by 50 percent, based on quantitative evaluations from driving datasets.

[CV-76] Using Interleaved Ensemble Unlearning to Keep Backdoors at Bay for Finetuning Vision Transformers

链接: https://arxiv.org/abs/2410.01128
作者: Zeyu Michael Li
关键词-EN: computer vision tasks, Vision Transformers, computer vision, vision tasks, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have become popular in computer vision tasks. Backdoor attacks, which trigger undesirable behaviours in models during inference, threaten ViTs’ performance, particularly in security-sensitive tasks. Although backdoor defences have been developed for Convolutional Neural Networks (CNNs), they are less effective for ViTs, and defences tailored to ViTs are scarce. To address this, we present Interleaved Ensemble Unlearning (IEU), a method for finetuning clean ViTs on backdoored datasets. In stage 1, a shallow ViT is finetuned to have high confidence on backdoored data and low confidence on clean data. In stage 2, the shallow ViT acts as a ``gate’’ to block potentially poisoned data from the defended ViT. This data is added to an unlearn set and asynchronously unlearned via gradient ascent. We demonstrate IEU’s effectiveness on three datasets against 11 state-of-the-art backdoor attacks and show its versatility by applying it to different model architectures.

[CV-77] Synthetic imagery for fuzzy object detection: A comparative study

链接: https://arxiv.org/abs/2410.01124
作者: Siavash H. Khajavi,Mehdi Moshtaghi,Dikai Yu,Zixuan Liu,Kary Främling,Jan Holmström
关键词-EN: Fuzzy objects, object detection, challenging field, fuzzy, objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The fuzzy object detection is a challenging field of research in computer vision (CV). Distinguishing between fuzzy and non-fuzzy object detection in CV is important. Fuzzy objects such as fire, smoke, mist, and steam present significantly greater complexities in terms of visual features, blurred edges, varying shapes, opacity, and volume compared to non-fuzzy objects such as trees and cars. Collection of a balanced and diverse dataset and accurate annotation is crucial to achieve better ML models for fuzzy objects, however, the task of collection and annotation is still highly manual. In this research, we propose and leverage an alternative method of generating and automatically annotating fully synthetic fire images based on 3D models for training an object detection model. Moreover, the performance, and efficiency of the trained ML models on synthetic images is compared with ML models trained on real imagery and mixed imagery. Findings proved the effectiveness of the synthetic data for fire detection, while the performance improves as the test dataset covers a broader spectrum of real fires. Our findings illustrates that when synthetic imagery and real imagery is utilized in a mixed training set the resulting ML model outperforms models trained on real imagery as well as models trained on synthetic imagery for detection of a broad spectrum of fires. The proposed method for automating the annotation of synthetic fuzzy objects imagery carries substantial implications for reducing both time and cost in creating computer vision models specifically tailored for detecting fuzzy objects.

[CV-78] RobustEMD: Domain Robust Matching for Cross-domain Few-shot Medical Image Segmentation

链接: https://arxiv.org/abs/2410.01110
作者: Yazhou Zhu,Minxian Li,Qiaolin Ye,Shidong Wang,Tong Xin,Haofeng Zhang
关键词-EN: limited annotated data, annotated data learning, image analysis scope, medical imaging data, medical image analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Few-shot medical image segmentation (FSMIS) aims to perform the limited annotated data learning in the medical image analysis scope. Despite the progress has been achieved, current FSMIS models are all trained and deployed on the same data domain, as is not consistent with the clinical reality that medical imaging data is always across different data domains (e.g. imaging modalities, institutions and equipment sequences). How to enhance the FSMIS models to generalize well across the different specific medical imaging domains? In this paper, we focus on the matching mechanism of the few-shot semantic segmentation models and introduce an Earth Mover’s Distance (EMD) calculation based domain robust matching mechanism for the cross-domain scenario. Specifically, we formulate the EMD transportation process between the foreground support-query features, the texture structure aware weights generation method, which proposes to perform the sobel based image gradient calculation over the nodes, is introduced in the EMD matching flow to restrain the domain relevant nodes. Besides, the point set level distance measurement metric is introduced to calculated the cost for the transportation from support set nodes to query set nodes. To evaluate the performance of our model, we conduct experiments on three scenarios (i.e., cross-modal, cross-sequence and cross-institution), which includes eight medical datasets and involves three body regions, and the results demonstrate that our model achieves the SoTA performance against the compared models.

[CV-79] Semantic Segmentation of Unmanned Aerial Vehicle Remote Sensing Images using SegFormer

链接: https://arxiv.org/abs/2410.01092
作者: Vlatko Spasev,Ivica Dimitrovski,Ivan Chorbev,Ivan Kitanovski
关键词-EN: Unmanned Aerial Vehicles, Aerial Vehicles, garnered considerable attention, UAV remote sensing, remote sensing platforms
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The escalating use of Unmanned Aerial Vehicles (UAVs) as remote sensing platforms has garnered considerable attention, proving invaluable for ground object recognition. While satellite remote sensing images face limitations in resolution and weather susceptibility, UAV remote sensing, employing low-speed unmanned aircraft, offers enhanced object resolution and agility. The advent of advanced machine learning techniques has propelled significant strides in image analysis, particularly in semantic segmentation for UAV remote sensing images. This paper evaluates the effectiveness and efficiency of SegFormer, a semantic segmentation framework, for the semantic segmentation of UAV images. SegFormer variants, ranging from real-time (B0) to high-performance (B5) models, are assessed using the UAVid dataset tailored for semantic segmentation tasks. The research details the architecture and training procedures specific to SegFormer in the context of UAV semantic segmentation. Experimental results showcase the model’s performance on benchmark dataset, highlighting its ability to accurately delineate objects and land cover features in diverse UAV scenarios, leading to both high efficiency and performance.

[CV-80] FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

链接: https://arxiv.org/abs/2410.01089
作者: Peiran Wu,Che Liu,Canyu Chen,Jun Li,Cosmin I. Bercea,Rossella Arcucci
关键词-EN: Visual Question Answering, Question Answering, Report Generation, Visual Question, Multimodal Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Advancements in Multimodal Large Language Models (MLLMs) have significantly improved medical task performance, such as Visual Question Answering (VQA) and Report Generation (RG). However, the fairness of these models across diverse demographic groups remains underexplored, despite its importance in healthcare. This oversight is partly due to the lack of demographic diversity in existing medical multimodal datasets, which complicates the evaluation of fairness. In response, we propose FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes. FMBench has the following key features: 1: It includes four demographic attributes: race, ethnicity, language, and gender, across two tasks, VQA and RG, under zero-shot settings. 2: Our VQA task is free-form, enhancing real-world applicability and mitigating the biases associated with predefined choices. 3: We utilize both lexical metrics and LLM-based metrics, aligned with clinical evaluations, to assess models not only for linguistic accuracy but also from a clinical perspective. Furthermore, we introduce a new metric, Fairness-Aware Performance (FAP), to evaluate how fairly MLLMs perform across various demographic attributes. We thoroughly evaluate the performance and fairness of eight state-of-the-art open-source MLLMs, including both general and medical MLLMs, ranging from 7B to 26B parameters on the proposed benchmark. We aim for FMBench to assist the research community in refining model evaluation and driving future advancements in the field. All data and code will be released upon acceptance.

[CV-81] Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time ECCV2024

链接: https://arxiv.org/abs/2410.01083
作者: Chiao-An Yang,Ziwei Liu,Raymond A. Yeh
关键词-EN: Subsampling layers play, Subsampling layers, spatial dimensions, layers play, play a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Subsampling layers play a crucial role in deep nets by discarding a portion of an activation map to reduce its spatial dimensions. This encourages the deep net to learn higher-level representations. Contrary to this motivation, we hypothesize that the discarded activations are useful and can be incorporated on the fly to improve models’ prediction. To validate our hypothesis, we propose a search and aggregate method to find useful activation maps to be used at test time. We applied our approach to the task of image classification and semantic segmentation. Extensive experiments over nine different architectures on multiple datasets show that our method consistently improves model test-time performance, complementing existing test-time augmentation techniques. Our code is available at this https URL.

[CV-82] Pose Estimation of Buried Deep-Sea Objects using 3D Vision Deep Learning Models

链接: https://arxiv.org/abs/2410.01061
作者: Jerry Yan,Chinmay Talegaonkar,Nicholas Antipa,Eric Terrill,Sophia Merrifield
关键词-EN: San Pedro Basin, Southern California San, California San Pedro, Pedro Basin, Southern California
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to OCEANS 2024 Halifax

点击查看摘要

Abstract:We present an approach for pose and burial fraction estimation of debris field barrels found on the seabed in the Southern California San Pedro Basin. Our computational workflow leverages recent advances in foundation models for segmentation and a vision transformer-based approach to estimate the point cloud which defines the geometry of the barrel. We propose BarrelNet for estimating the 6-DOF pose and radius of buried barrels from the barrel point clouds as input. We train BarrelNet using synthetically generated barrel point clouds, and qualitatively demonstrate the potential of our approach using remotely operated vehicle (ROV) video footage of barrels found at a historic dump site. We compare our method to a traditional least squares fitting approach and show significant improvement according to our defined benchmarks.

[CV-83] ARPOV: Expanding Visualization of Object Detection in AR with Panoramic Mosaic Stitching

链接: https://arxiv.org/abs/2410.01055
作者: Erin McGowan,Ethan Brewer,Claudio Silva
关键词-EN: increasingly incorporate intelligent, incorporate intelligent features, intelligent assistant, object detection model, augmented reality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 6 figures, to be published in SIBGRAPI 2024 - 37th conference on Graphics, Patterns, and Images proceedings

点击查看摘要

Abstract:As the uses of augmented reality (AR) become more complex and widely available, AR applications will increasingly incorporate intelligent features that require developers to understand the user’s behavior and surrounding environment (e.g. an intelligent assistant). Such applications rely on video captured by an AR headset, which often contains disjointed camera movement with a limited field of view that cannot capture the full scope of what the user sees at any given time. Moreover, standard methods of visualizing object detection model outputs are limited to capturing objects within a single frame and timestep, and therefore fail to capture the temporal and spatial context that is often necessary for various domain applications. We propose ARPOV, an interactive visual analytics tool for analyzing object detection model outputs tailored to video captured by an AR headset that maximizes user understanding of model performance. The proposed tool leverages panorama stitching to expand the view of the environment while automatically filtering undesirable frames, and includes interactive features that facilitate object detection model debugging. ARPOV was designed as part of a collaboration between visualization researchers and machine learning and AR experts; we validate our design choices through interviews with 5 domain experts.

[CV-84] FCE-YOLOv8: YOLOv8 with Feature Context Excitation Modules for Fracture Detection in Pediatric Wrist X-ray Images

链接: https://arxiv.org/abs/2410.01031
作者: Rui-Yang Ju,Chun-Tse Chien,Enkaer Xieerke,Jen-Shiun Chiang
关键词-EN: interpret X-ray images, suffer wrist trauma, Children often suffer, interpret X-ray, X-ray images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2407.03163

点击查看摘要

Abstract:Children often suffer wrist trauma in daily life, while they usually need radiologists to analyze and interpret X-ray images before surgical treatment by surgeons. The development of deep learning has enabled neural networks to serve as computer-assisted diagnosis (CAD) tools to help doctors and experts in medical image diagnostics. Since the You Only Look Once Version-8 (YOLOv8) model has obtained the satisfactory success in object detection tasks, it has been applied to various fracture detection. This work introduces four variants of Feature Contexts Excitation-YOLOv8 (FCE-YOLOv8) model, each incorporating a different FCE module (i.e., modules of Squeeze-and-Excitation (SE), Global Context (GC), Gather-Excite (GE), and Gaussian Context Transformer (GCT)) to enhance the model performance. Experimental results on GRAZPEDWRI-DX dataset demonstrate that our proposed YOLOv8+GC-M3 model improves the mAP@50 value from 65.78% to 66.32%, outperforming the state-of-the-art (SOTA) model while reducing inference time. Furthermore, our proposed YOLOv8+SE-M3 model achieves the highest mAP@50 value of 67.07%, exceeding the SOTA performance. The implementation of this work is available at this https URL.

[CV-85] Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! EMNLP2024

链接: https://arxiv.org/abs/2410.01023
作者: Jiwan Chung,Seungwon Lim,Jaehyun Jeon,Seungbeen Lee,Youngjae Yu
关键词-EN: Humans possess multimodal, actively integrate information, Humans possess, form reasoning, actively integrate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted as main paper in EMNLP 2024

点击查看摘要

Abstract:Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability? In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.

[CV-86] A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio ICASSP2025

链接: https://arxiv.org/abs/2410.01020
作者: Xavier Juanola,Gloria Haro,Magdalena Fuentes
关键词-EN: Visual Sound Source, Sound Source Localization, enhanced scene understanding, visual scenes, integrating audio-visual data
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:The task of Visual Sound Source Localization (VSSL) involves identifying the location of sound sources in visual scenes, integrating audio-visual data for enhanced scene understanding. Despite advancements in state-of-the-art (SOTA) models, we observe three critical flaws: i) The evaluation of the models is mainly focused in sounds produced by objects that are visible in the image, ii) The evaluation often assumes a prior knowledge of the size of the sounding object, and iii) No universal threshold for localization in real-world scenarios is established, as previous approaches only consider positive examples without accounting for both positive and negative cases. In this paper, we introduce a novel test set and metrics designed to complete the current standard evaluation of VSSL models by testing them in scenarios where none of the objects in the image corresponds to the audio input, i.e. a negative audio. We consider three types of negative audio: silence, noise and offscreen. Our analysis reveals that numerous SOTA models fail to appropriately adjust their predictions based on audio input, suggesting that these models may not be leveraging audio information as intended. Additionally, we provide a comprehensive analysis of the range of maximum values in the estimated audio-visual similarity maps, in both positive and negative audio cases, and show that most of the models are not discriminative enough, making them unfit to choose a universal threshold appropriate to perform sound localization without any a priori information of the sounding object, that is, object size and visibility.

[CV-87] Y-CA-Net: A Convolutional Attention Based Network for Volumetric Medical Image Segmentation

链接: https://arxiv.org/abs/2410.01003
作者: Muhammad Hamza Sharif,Muzammal Naseer,Mohammad Yaqub,Min Xu,Mohsen Guizani
关键词-EN: modeling long-range dependencies, Recent attention-based volumetric, achieved remarkable performance, Recent attention-based, Feature Mixer Module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent attention-based volumetric segmentation (VS) methods have achieved remarkable performance in the medical domain which focuses on modeling long-range dependencies. However, for voxel-wise prediction tasks, discriminative local features are key components for the performance of the VS models which is missing in attention-based VS methods. Aiming at resolving this issue, we deliberately incorporate the convolutional encoder branch with transformer backbone to extract local and global features in a parallel manner and aggregate them in Cross Feature Mixer Module (CFMM) for better prediction of segmentation mask. Consequently, we observe that the derived model, Y-CT-Net, achieves competitive performance on multiple medical segmentation tasks. For example, on multi-organ segmentation, Y-CT-Net achieves an 82.4% dice score, surpassing well-tuned VS Transformer/CNN-like baselines UNETR/ResNet-3D by 2.9%/1.4%. With the success of Y-CT-Net, we extend this concept with hybrid attention models, that derived Y-CH-Net model, which brings a 3% improvement in terms of HD95 score for same segmentation task. The effectiveness of both models Y-CT-Net and Y-CH-Net verifies our hypothesis and motivates us to initiate the concept of Y-CA-Net, a versatile generic architecture based upon any two encoders and a decoder backbones, to fully exploit the complementary strengths of both convolution and attention mechanisms. Based on experimental results, we argue Y-CA-Net is a key player in achieving superior results for volumetric segmentation.

[CV-88] LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details

链接: https://arxiv.org/abs/2410.00990
作者: Jian Yang,Xukun Wang,Wentao Wang,Guoming Li,Qihang Fang,Ruihong Yuan,Tianyang Wang,Jason Zhaoxin Fan
关键词-EN: Virtual Reality, Audio-driven talking head, film-making and Virtual, Vector Quantised Auto, Quantised Auto Encoders
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Drawing on the theory of Lipschitz Continuity, we have theoretically established the noise robustness of Vector Quantised Auto Encoders (VQAEs). Our experiments further demonstrate that the high-frequency texture deficiency of the foundation model can be temporally consistently recovered by the Space-Optimised Vector Quantised Auto Encoder (SOVQAE) we introduced, thereby facilitating the creation of realistic talking head videos. We conduct experiments on both the conventional dataset and the High-Frequency TalKing head (HFTK) dataset that we curated. The results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.

[CV-89] ScVLM: a Vision-Language Model for Driving Safety Critical Event Understanding

链接: https://arxiv.org/abs/2410.00982
作者: Liang Shi,Boyu Jiang,Feng Guo
关键词-EN: driver assistance systems, advanced driver assistance, assistance systems research, automated driving systems, Accurately identifying
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurately identifying, understanding, and describing driving safety-critical events (SCEs), including crashes and near-crashes, is crucial for traffic safety, automated driving systems, and advanced driver assistance systems research and application. As SCEs are rare events, most general Vision-Language Models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucination and missing key safety characteristics. To tackle these challenges, we propose ScVLM, a hybrid approach that combines supervised learning and contrastive learning to improve driving video understanding and event description rationality for VLMs. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset, the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigate hallucinations from VLMs.

[CV-90] owards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation ECCV2024

链接: https://arxiv.org/abs/2410.00979
作者: Shuting Zhao,Chenkang Du,Kristin Qi,Xinrong Chen,Xinhan Di
关键词-EN: adapt depth foundation, depth estimation recently, Adaptation methods, endoscopic depth estimation, depth foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: WiCV @ ECCV 2024

点击查看摘要

Abstract:Adaptation methods are developed to adapt depth foundation models to endoscopic depth estimation recently. However, such approaches typically under-perform training since they limit the parameter search to a low-rank subspace and alter the training dynamics. Therefore, we propose a full-parameter and parameter-efficient learning framework for endoscopic depth estimation. At the first stage, the subspace of attention, convolution and multi-layer perception are adapted simultaneously within different sub-spaces. At the second stage, a memory-efficient optimization is proposed for subspace composition and the performance is further improved in the united sub-space. Initial experiments on the SCARED dataset demonstrate that results at the first stage improves the performance from 10.2% to 4.1% for Sq Rel, Abs Rel, RMSE and RMSE log in the comparison with the state-of-the-art models.

[CV-91] SegHeD: Segmentation of Heterogeneous Data for Multiple Sclerosis Lesions with Anatomical Constraints MICCAI

链接: https://arxiv.org/abs/2410.01766
作者: Berke Doga Basaran,Xinru Zhang,Paul M. Matthews,Wenjia Bai
关键词-EN: brain magnetic resonance, monitoring multiple sclerosis, magnetic resonance, images plays, progression from brain
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, MICCAI, LDTM Workshop

点击查看摘要

Abstract:Assessment of lesions and their longitudinal progression from brain magnetic resonance (MR) images plays a crucial role in diagnosing and monitoring multiple sclerosis (MS). Machine learning models have demonstrated a great potential for automated MS lesion segmentation. Training such models typically requires large-scale high-quality datasets that are consistently annotated. However, MS imaging datasets are often small, segregated across multiple sites, with different formats (cross-sectional or longitudinal), and diverse annotation styles. This poses a significant challenge to train a unified MS lesion segmentation model. To tackle this challenge, we present SegHeD, a novel multi-dataset multi-task segmentation model that can incorporate heterogeneous data as input and perform all-lesion, new-lesion, as well as vanishing-lesion segmentation. Furthermore, we account for domain knowledge about MS lesions, incorporating longitudinal, spatial, and volumetric constraints into the segmentation model. SegHeD is assessed on five MS datasets and achieves a high performance in all, new, and vanishing-lesion segmentation, outperforming several state-of-the-art methods in this field.

[CV-92] COSMIC: Compress Satellite Images Efficiently via Diffusion Compensation

链接: https://arxiv.org/abs/2410.01698
作者: Ziyuan Zhang,Han Qiu,Maosen Zhang,Jun Liu,Bin Chen,Tianwei Zhang,Hewu Li
关键词-EN: rapidly increasing number, enhanced capabilities, rapidly increasing, increasing number, exceeding the transmission
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rapidly increasing number of satellites in space and their enhanced capabilities, the amount of earth observation images collected by satellites is exceeding the transmission limits of satellite-to-ground links. Although existing learned image compression solutions achieve remarkable performance by using a sophisticated encoder to extract fruitful features as compression and using a decoder to reconstruct, it is still hard to directly deploy those complex encoders on current satellites’ embedded GPUs with limited computing capability and power supply to compress images in orbit. In this paper, we propose COSMIC, a simple yet effective learned compression solution to transmit satellite images. We first design a lightweight encoder (i.e. reducing FLOPs by 2.6\sim 5\times ) on satellite to achieve a high image compression ratio to save satellite-to-ground links. Then, for reconstructions on the ground, to deal with the feature extraction ability degradation due to simplifying encoders, we propose a diffusion-based model to compensate image details when decoding. Our insight is that satellite’s earth observation photos are not just images but indeed multi-modal data with a nature of Text-to-Image pairing since they are collected with rich sensor data (e.g. coordinates, timestamp, etc.) that can be used as the condition for diffusion generation. Extensive experiments show that COSMIC outperforms state-of-the-art baselines on both perceptual and distortion metrics.

[CV-93] owards a vision foundation model for comprehensive assessment of Cardiac MRI

链接: https://arxiv.org/abs/2410.01665
作者: Athira J Jacob,Indraneel Borgohain,Teodora Chitiboi,Puneet Sharma,Dorin Comaniciu,Daniel Rueckert
关键词-EN: Cardiac magnetic resonance, complex modality requiring, noninvasive cardiac assessment, magnetic resonance imaging, Cardiac magnetic
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Cardiac magnetic resonance imaging (CMR), considered the gold standard for noninvasive cardiac assessment, is a diverse and complex modality requiring a wide variety of image processing tasks for comprehensive assessment of cardiac morphology and function. Advances in deep learning have enabled the development of state-of-the-art (SoTA) models for these tasks. However, model training is challenging due to data and label scarcity, especially in the less common imaging sequences. Moreover, each model is often trained for a specific task, with no connection between related tasks. In this work, we introduce a vision foundation model trained for CMR assessment, that is trained in a self-supervised fashion on 36 million CMR images. We then finetune the model in supervised way for 9 clinical tasks typical to a CMR workflow, across classification, segmentation, landmark localization, and pathology detection. We demonstrate improved accuracy and robustness across all tasks, over a range of available labeled dataset sizes. We also demonstrate improved few-shot learning with fewer labeled samples, a common challenge in medical image analyses. We achieve an out-of-box performance comparable to SoTA for most clinical tasks. The proposed method thus presents a resource-efficient, unified framework for CMR assessment, with the potential to accelerate the development of deep learning-based solutions for image analysis tasks, even with few annotated data available.

[CV-94] Unleashing Parameter Potential of Neural Representation for Efficient Video Compression

链接: https://arxiv.org/abs/2410.01654
作者: Gai Zhang,Xinfeng Zhang,Lv Tang,Yue Li,Kai Zhang,Li Zhang
关键词-EN: prominent research area, research area, INR video compression, prominent research, video compression technology
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:For decades, video compression technology has been a prominent research area. Traditional hybrid video compression framework and end-to-end frameworks continue to explore various intra- and inter-frame reference and prediction strategies based on discrete transforms and deep learning techniques. However, the emerging implicit neural representation (INR) technique models entire videos as basic units, automatically capturing intra-frame and inter-frame correlations and obtaining promising performance. INR uses a compact neural network to store video information in network parameters, effectively eliminating spatial and temporal redundancy in the original video. However, in this paper, our exploration and verification reveal that current INR video compression methods do not fully exploit their potential to preserve information. We investigate the potential of enhancing network parameter storage through parameter reuse. By deepening the network, we designed a feasible INR parameter reuse scheme to further improve compression performance. Extensive experimental results show that our method significantly enhances the rate-distortion performance of INR video compression.

[CV-95] Imaging foundation model for universal enhancement of non-ideal measurement CT

链接: https://arxiv.org/abs/2410.01591
作者: Yuxin Liu,Rongjun Ge,Yuting He,Zhan Wu,Chenyu You,Shuo Li,Yang Chen
关键词-EN: measurement computed tomography, sacrifices optimal imaging, optimal imaging standards, Non-ideal measurement computed, NICT enhancement
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Non-ideal measurement computed tomography (NICT), which sacrifices optimal imaging standards for new advantages in CT imaging, is expanding the clinical application scope of CT images. However, with the reduction of imaging standards, the image quality has also been reduced, extremely limiting the clinical acceptability. Although numerous studies have demonstrated the feasibility of deep learning for the NICT enhancement in specific scenarios, their high data cost and limited generalizability have become large obstacles. The recent research on the foundation model has brought new opportunities for building a universal NICT enhancement model - bridging the image quality degradation with minimal data cost. However, owing to the challenges in the collection of large pre-training datasets and the compatibility of data variation, no success has been reported. In this paper, we propose a multi-scale integrated Transformer AMPlifier (TAMP), the first imaging foundation model for universal NICT enhancement. It has been pre-trained on a large-scale physical-driven simulation dataset with 3.6 million NICT-ICT image pairs, and is able to directly generalize to the NICT enhancement tasks with various non-ideal settings and body regions. Via the adaptation with few data, it can further achieve professional performance in real-world specific scenarios. Our extensive experiments have demonstrated that the proposed TAMP has significant potential for promoting the exploration and application of NICT and serving a wider range of medical scenarios.

[CV-96] SurgPointTransformer: Vertebrae Shape Completion with RGB-D Data

链接: https://arxiv.org/abs/2410.01443
作者: Aidana Massalimova,Florentin Liebmann,Sascha Jecklin,Fabio Carrillo,Farshad Mazda,Philipp Fürnstahl
关键词-EN: intraoperative imaging technologies, systems heavily depend, generate detailed, heavily depend, depend on intraoperative
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:State-of-the-art computer- and robot-assisted surgery systems heavily depend on intraoperative imaging technologies such as CT and fluoroscopy to generate detailed 3D visualization of the patient’s anatomy. While imaging techniques are highly accurate, they are based on ionizing radiation and expose patients and clinicians. This study introduces an alternative, radiation-free approach for reconstructing the 3D spine anatomy using RGB-D data. Drawing inspiration from the 3D “mental map” that surgeons form during surgeries, we introduce SurgPointTransformer, a shape completion approach for surgical applications that can accurately reconstruct the unexposed spine regions from sparse observations of the exposed surface. Our method involves two main steps: segmentation and shape completion. The segmentation step includes spinal column localization and segmentation, followed by vertebra-wise segmentation. The segmented vertebra point clouds are then subjected to SurgPointTransformer, which leverages an attention mechanism to learn patterns between visible surface features and the underlying anatomy. For evaluation, we utilize an ex-vivo dataset of nine specimens. Their CT data is used to establish ground truth data that were used to compare to the outputs of our methods. Our method significantly outperforms the state-of-the-art baselines, achieving an average Chamfer Distance of 5.39, an F-Score of 0.85, an Earth Mover’s Distance of 0.011, and a Signal-to-Noise Ratio of 22.90 dB. This study demonstrates the potential of our reconstruction method for 3D vertebral shape completion. It enables 3D reconstruction of the entire lumbar spine and surgical guidance without ionizing radiation or invasive imaging. Our work contributes to computer-aided and robot-assisted surgery, advancing the perception and intelligence of these systems. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.01443 [eess.IV] (or arXiv:2410.01443v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2410.01443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-97] CSIM: A Copula-based similarity index sensitive to local changes for Image quality assessment

链接: https://arxiv.org/abs/2410.01411
作者: Safouane El Ghazouali,Umberto Michelucci,Yassin El Hillali,Hichem Nouira
关键词-EN: computer vision, computer vision applications, machine learning, Image, play an important
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Image similarity metrics play an important role in computer vision applications, as they are used in image processing, computer vision and machine learning. Furthermore, those metrics enable tasks such as image retrieval, object recognition and quality assessment, essential in fields like healthcare, astronomy and surveillance. Existing metrics, such as PSNR, MSE, SSIM, ISSM and FSIM, often face limitations in terms of either speed, complexity or sensitivity to small changes in images. To address these challenges, a novel image similarity metric, namely CSIM, that combines real-time while being sensitive to subtle image variations is investigated in this paper. The novel metric uses Gaussian Copula from probability theory to transform an image into vectors of pixel distribution associated to local image patches. These vectors contain, in addition to intensities and pixel positions, information on the dependencies between pixel values, capturing the structural relationships within the image. By leveraging the properties of Copulas, CSIM effectively models the joint distribution of pixel intensities, enabling a more nuanced comparison of image patches making it more sensitive to local changes compared to other metrics. Experimental results demonstrate that CSIM outperforms existing similarity metrics in various image distortion scenarios, including noise, compression artifacts and blur. The metric’s ability to detect subtle differences makes it suitable for applications requiring high precision, such as medical imaging, where the detection of minor anomalies can be of a high importance. The results obtained in this work can be reproduced from this Github repository: this https URL.

[CV-98] oward Zero-Shot Learning for Visual Dehazing of Urological Surgical Robots

链接: https://arxiv.org/abs/2410.01395
作者: Renkai Wu,Xianjin Wang,Pengchen Liang,Zhenyu Zhang,Qing Chang,Hao Tang
关键词-EN: profoundly influenced current, influenced current forms, urological surgical, urological surgical robot, minimally invasive surgery
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Robot-assisted surgery has profoundly influenced current forms of minimally invasive surgery. However, in transurethral suburethral urological surgical robots, they need to work in a liquid environment. This causes vaporization of the liquid when shearing and heating is performed, resulting in bubble atomization that affects the visual perception of the robot. This can lead to the need for uninterrupted pauses in the surgical procedure, which makes the surgery take longer. To address the atomization characteristics of liquids under urological surgical robotic vision, we propose an unsupervised zero-shot dehaze method (RSF-Dehaze) for urological surgical robotic vision. Specifically, the proposed Region Similarity Filling Module (RSFM) of RSF-Dehaze significantly improves the recovery of blurred region tissues. In addition, we organize and propose a dehaze dataset for robotic vision in urological surgery (USRobot-Dehaze dataset). In particular, this dataset contains the three most common urological surgical robot operation scenarios. To the best of our knowledge, we are the first to organize and propose a publicly available dehaze dataset for urological surgical robot vision. The proposed RSF-Dehaze proves the effectiveness of our method in three urological surgical robot operation scenarios with extensive comparative experiments with 20 most classical and advanced dehazing and image recovery algorithms. The proposed source code and dataset are available at this https URL .

[CV-99] Anti-biofouling Lensless Camera System with Deep Learning based Image Reconstruction

链接: https://arxiv.org/abs/2410.01365
作者: Naoki Ide,Tomohiro Kawahara,Hiroshi Ueno,Daiki Yanagidaira,Susumu Takatsuka
关键词-EN: aqua culture environments, recent years, increasing demand, monitor the condition, condition of offshore
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 8 figures, Ocean Optics 2024

点击查看摘要

Abstract:In recent years, there has been an increasing demand for underwater cameras that monitor the condition of offshore structures and check the number of individuals in aqua culture environments with long-period observation. One of the significant issues with this observation is that biofouling sticks to the aperture and lens densely and prevents cameras from capturing clear images. This study examines an underwater camera that applies material technologies with high inherent resistance to biofouling and computer vision technologies based on image reconstruction by deep learning to lens-less cameras. For this purpose, our prototype camera uses a coded aperture with 1k rectangular shape pinholes in a thin metal plate, such as copper, which hinder the growth of biofouling and keep the surface clean. Although images taken by lens-less cameras are usually not well formed due to lack of the traditional glass-based lens, a deep learning approach using ViT (Vision Transformer) has recently demonstrated reconstructing original photo images well and our study shows that using gated MLP (Multilayer Perceptron) also yields good results. On the other hand, a certain degree of thickness for bio-repellence materials is required to exhibit their effect the thickness of aperture is necessary to use apertures sufficiently thinner than the size of the pinholes to avoid unintentional reflection and absorption on the sidewalls. Therefore, we prepared a sufficiently thin plate for image reconstruction and now currently we conduct tests of the lens-less camera of the bio-repellence aperture with actual seawater environments to determine whether it can sufficiently demonstrate the biofouling effect compared with usual camera with only waterproof.

[CV-100] RS-FME-SwinT: A Novel Feature Map Enhancement Framework Integrating Customized SwinT with Residual and Spatial CNN for Monkeypox Diagnosis

链接: https://arxiv.org/abs/2410.01216
作者: Saddam Hussain Khan,Rashid Iqbal(Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan)
关键词-EN: steadily increasing daily, cases steadily increasing, significant global concern, increasing daily, cases steadily
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 37 Pages, 5 Tables, 10 Figures

点击查看摘要

Abstract:Monkeypox (MPox) has emerged as a significant global concern, with cases steadily increasing daily. Conventional detection methods, including polymerase chain reaction (PCR) and manual examination, exhibit challenges of low sensitivity, high cost, and substantial workload. Therefore, deep learning offers an automated solution; however, the datasets include data scarcity, texture, contrast, inter-intra class variability, and similarities with other skin infectious diseases. In this regard, a novel hybrid approach is proposed that integrates the learning capacity of Residual Learning and Spatial Exploitation Convolutional Neural Network (CNN) with a customized Swin Transformer (RS-FME-SwinT) to capture multi-scale global and local correlated features for MPox diagnosis. The proposed RS-FME-SwinT technique employs a transfer learning-based feature map enhancement (FME) technique, integrating the customized SwinT for global information capture, residual blocks for texture extraction, and spatial blocks for local contrast variations. Moreover, incorporating new inverse residual blocks within the proposed SwinT effectively captures local patterns and mitigates vanishing gradients. The proposed RS-FME-SwinT has strong learning potential of diverse features that systematically reduce intra-class MPox variation and enable precise discrimination from other skin diseases. Finally, the proposed RS-FME-SwinT is a holdout cross-validated on a diverse MPox dataset and achieved outperformance on state-of-the-art CNNs and ViTs. The proposed RS-FME-SwinT demonstrates commendable results of an accuracy of 97.80%, sensitivity of 96.82%, precision of 98.06%, and an F-score of 97.44% in MPox detection. The RS-FME-SwinT could be a valuable tool for healthcare practitioners, enabling prompt and accurate MPox diagnosis and contributing significantly to mitigation efforts.

[CV-101] Formula-Driven Data Augmentation and Partial Retinal Layer Copying for Retinal Layer Segmentation MICCAI2024

链接: https://arxiv.org/abs/2410.01185
作者: Tsubasa Konno,Takahiro Ninomiya,Kanta Miura,Koichi Ito,Noriko Himori,Parmanand Sharma,Toru Nakazawa,Takafumi Aoki
关键词-EN: Major retinal layer, retinal layer segmentation, layer segmentation methods, OCT images, Major retinal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: The 11th OMIA Workshop on MICCAI 2024

点击查看摘要

Abstract:Major retinal layer segmentation methods from OCT images assume that the retina is flattened in advance, and thus cannot always deal with retinas that have changes in retinal structure due to ophthalmopathy and/or curvature due to myopia. To eliminate the use of flattening in retinal layer segmentation for practicality of such methods, we propose novel data augmentation methods for OCT images. Formula-driven data augmentation (FDDA) emulates a variety of retinal structures by vertically shifting each column of the OCT images according to a given mathematical formula. We also propose partial retinal layer copying (PRLC) that copies a part of the retinal layers and pastes it into a region outside the retinal layers. Through experiments using the OCT MS and Healthy Control dataset and the Duke Cyst DME dataset, we demonstrate that the use of FDDA and PRLC makes it possible to detect the boundaries of retinal layers without flattening even retinal layer segmentation methods that assume flattening of the retina.

[CV-102] Generating Seamless Virtual Immunohistochemical Whole Slide Images with Content and Color Consistency

链接: https://arxiv.org/abs/2410.01072
作者: Sitong Liu,Kechun Liu,Samuel Margolis,Wenjun Wu,Stevan R. Knezevich,David E Elder,Megan M. Eguchi,Joann G Elmore,Linda Shapiro
关键词-EN: providing crucial diagnostic, crucial diagnostic information, providing crucial, play a vital, vital role
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Immunohistochemical (IHC) stains play a vital role in a pathologist’s analysis of medical images, providing crucial diagnostic information for various diseases. Virtual staining from hematoxylin and eosin (HE)-stained whole slide images (WSIs) allows the automatic production of other useful IHC stains without the expensive physical staining process. However, current virtual WSI generation methods based on tile-wise processing often suffer from inconsistencies in content, texture, and color at tile boundaries. These inconsistencies lead to artifacts that compromise image quality and potentially hinder accurate clinical assessment and diagnoses. To address this limitation, we propose a novel consistent WSI synthesis network, CC-WSI-Net, that extends GAN models to produce seamless synthetic whole slide images. Our CC-WSI-Net integrates a content- and color-consistency supervisor, ensuring consistency across tiles and facilitating the generation of seamless synthetic WSIs while ensuring Sox10 immunohistochemistry accuracy in melanocyte detection. We validate our method through extensive image-quality analyses, objective detection assessments, and a subjective survey with pathologists. By generating high-quality synthetic WSIs, our method opens doors for advanced virtual staining techniques with broader applications in research and clinical care.

[CV-103] ransResNet: Integrating the Strengths of ViTs and CNNs for High Resolution Medical Image Segmentation via Feature Grafting

链接: https://arxiv.org/abs/2410.00986
作者: Muhammad Hamza Sharif,Dmitry Demidov,Asif Hanif,Mohammad Yaqub,Min Xu
关键词-EN: medical imaging domain, underlying method, imaging domain, significantly improve, improve the diagnostic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: The 33rd British Machine Vision Conference 2022

点击查看摘要

Abstract:High-resolution images are preferable in medical imaging domain as they significantly improve the diagnostic capability of the underlying method. In particular, high resolution helps substantially in improving automatic image segmentation. However, most of the existing deep learning-based techniques for medical image segmentation are optimized for input images having small spatial dimensions and perform poorly on high-resolution images. To address this shortcoming, we propose a parallel-in-branch architecture called TransResNet, which incorporates Transformer and CNN in a parallel manner to extract features from multi-resolution images independently. In TransResNet, we introduce Cross Grafting Module (CGM), which generates the grafted features, enriched in both global semantic and low-level spatial details, by combining the feature maps from Transformer and CNN branches through fusion and self-attention mechanism. Moreover, we use these grafted features in the decoding process, increasing the information flow for better prediction of the segmentation mask. Extensive experiments on ten datasets demonstrate that TransResNet achieves either state-of-the-art or competitive results on several segmentation tasks, including skin lesion, retinal vessel, and polyp segmentation. The source code and pre-trained models are available at this https URL.

[CV-104] Evaluating Deep Regression Models for WSI-Based Gene-Expression Prediction

链接: https://arxiv.org/abs/2410.00945
作者: Fredrik K. Gustafsson,Mattias Rantalainen
关键词-EN: routine whole-slide images, accessible molecular phenotyping, widely accessible molecular, mRNA gene-expression profiles, gene-expression profiles directly
类目: Genomics (q-bio.GN); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prediction of mRNA gene-expression profiles directly from routine whole-slide images (WSIs) using deep learning models could potentially offer cost-effective and widely accessible molecular phenotyping. While such WSI-based gene-expression prediction models have recently emerged within computational pathology, the high-dimensional nature of the corresponding regression problem offers numerous design choices which remain to be analyzed in detail. This study provides recommendations on how deep regression models should be trained for WSI-based gene-expression prediction. For example, we conclude that training a single model to simultaneously regress all 20530 genes is a computationally efficient yet very strong baseline.

机器学习

[LG-0] On the expressiveness and spectral bias of KANs

链接: https://arxiv.org/abs/2410.01803
作者: Yixuan Wang,Jonathan W. Siegel,Ziming Liu,Thomas Y. Hou
关键词-EN: prevalent architectural backbone, Kolmogorov-Arnold Networks, deep learning models, KANs, multi-layer perceptron
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KAN) \citeliu2024kan were very recently proposed as a potential alternative to the prevalent architectural backbone of many deep learning models, the multi-layer perceptron (MLP). KANs have seen success in various tasks of AI for science, with their empirical efficiency and accuracy demostrated in function regression, PDE solving, and many more scientific problems. In this article, we revisit the comparison of KANs and MLPs, with emphasis on a theoretical perspective. On the one hand, we compare the representation and approximation capabilities of KANs and MLPs. We establish that MLPs can be represented using KANs of a comparable size. This shows that the approximation and representation capabilities of KANs are at least as good as MLPs. Conversely, we show that KANs can be represented using MLPs, but that in this representation the number of parameters increases by a factor of the KAN grid size. This suggests that KANs with a large grid size may be more efficient than MLPs at approximating certain functions. On the other hand, from the perspective of learning and optimization, we study the spectral bias of KANs compared with MLPs. We demonstrate that KANs are less biased toward low frequencies than MLPs. We highlight that the multi-level learning feature specific to KANs, i.e. grid extension of splines, improves the learning process for high-frequency components. Detailed comparisons with different choices of depth, width, and grid sizes of KANs are made, shedding some light on how to choose the hyperparameters in practice. Comments: 17 pages, 5 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.01803 [cs.LG] (or arXiv:2410.01803v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.01803 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] PROXI: Challenging the GNNs for Link Prediction

链接: https://arxiv.org/abs/2410.01802
作者: Astrit Tola,Jack Myrick,Baris Coskunuzer
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, past decade, transformed graph representation
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:Over the past decade, Graph Neural Networks (GNNs) have transformed graph representation learning. In the widely adopted message-passing GNN framework, nodes refine their representations by aggregating information from neighboring nodes iteratively. While GNNs excel in various domains, recent theoretical studies have raised concerns about their capabilities. GNNs aim to address various graph-related tasks by utilizing such node representations, however, this one-size-fits-all approach proves suboptimal for diverse tasks. Motivated by these observations, we conduct empirical tests to compare the performance of current GNN models with more conventional and direct methods in link prediction tasks. Introducing our model, PROXI, which leverages proximity information of node pairs in both graph and attribute spaces, we find that standard machine learning (ML) models perform competitively, even outperforming cutting-edge GNN models when applied to these proximity metrics derived from node neighborhoods and attributes. This holds true across both homophilic and heterophilic networks, as well as small and large benchmark datasets, including those from the Open Graph Benchmark (OGB). Moreover, we show that augmenting traditional GNNs with PROXI significantly boosts their link prediction performance. Our empirical findings corroborate the previously mentioned theoretical observations and imply that there exists ample room for enhancement in current GNN models to reach their potential. Subjects: Machine Learning (cs.LG); Computational Geometry (cs.CG) Cite as: arXiv:2410.01802 [cs.LG] (or arXiv:2410.01802v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.01802 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Bellman Diffusion: Generative Modeling as Learning a Linear Operator in the Distribution Space

链接: https://arxiv.org/abs/2410.01796
作者: Yangming Li,Chieh-Hsin Lai,Carola-Bibiane Schönlieb,Yuki Mitsufuji,Stefano Ermon
关键词-EN: Deep Generative Models, Score-based Generative Models, Generative Models, including Energy-Based Models, Deep Generative
类目: Machine Learning (cs.LG)
*备注: Paper under review

点击查看摘要

Abstract:Deep Generative Models (DGMs), including Energy-Based Models (EBMs) and Score-based Generative Models (SGMs), have advanced high-fidelity data generation and complex continuous distribution approximation. However, their application in Markov Decision Processes (MDPs), particularly in distributional Reinforcement Learning (RL), remains underexplored, with conventional histogram-based methods dominating the field. This paper rigorously highlights that this application gap is caused by the nonlinearity of modern DGMs, which conflicts with the linearity required by the Bellman equation in MDPs. For instance, EBMs involve nonlinear operations such as exponentiating energy functions and normalizing constants. To address this, we introduce Bellman Diffusion, a novel DGM framework that maintains linearity in MDPs through gradient and scalar field modeling. With divergence-based training techniques to optimize neural network proxies and a new type of stochastic differential equation (SDE) for sampling, Bellman Diffusion is guaranteed to converge to the target distribution. Our empirical results show that Bellman Diffusion achieves accurate field estimations and is a capable image generator, converging 1.5x faster than the traditional histogram-based baseline in distributional RL tasks. This work enables the effective integration of DGMs into MDP applications, unlocking new avenues for advanced decision-making frameworks.

[LG-3] Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

链接: https://arxiv.org/abs/2410.01795
作者: Joseph Lee,Shu Yang,Jae Young Baik,Xiaoxi Liu,Zhen Tan,Dawei Li,Zixuan Wen,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Li Shen
关键词-EN: Predicting phenotypes, variant features remains, genetic bases based, bases based, remains a challenging
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: this https URL.

[LG-4] Investigating on RLHF methodology

链接: https://arxiv.org/abs/2410.01789
作者: Alexey Kutalev,Sergei Markoff
关键词-EN: Large Language Models, Large Language, Language Models, fine-tune Large Language, specific Language Model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 6 figures, 6 tables

点击查看摘要

Abstract:In this article, we investigate the alignment of Large Language Models according to human preferences. We discuss the features of training a Preference Model, which simulates human preferences, and the methods and details we found essential for achieving the best results. We also discuss using Reinforcement Learning to fine-tune Large Language Models and describe the challenges we faced and the ways to overcome them. Additionally, we present our experience with the Direct Preference Optimization method, which enables us to align a Large Language Model with human preferences without creating a separate Preference Model. As our contribution, we introduce the approach for collecting a preference dataset through perplexity filtering, which makes the process of creating such a dataset for a specific Language Model much easier and more cost-effective.

[LG-5] Learning To Solve Differential Equation Constrained Optimization Problems

链接: https://arxiv.org/abs/2410.01786
作者: Vincenzo Di Vito,Mostafa Mohammadian,Kyri Baker,Ferdinando Fioretto
关键词-EN: stochastic differential equations, constrained optimization plays, Differential equations, engineering fields, aerospace engineering
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differential equations (DE) constrained optimization plays a critical role in numerous scientific and engineering fields, including energy systems, aerospace engineering, ecology, and finance, where optimal configurations or control strategies must be determined for systems governed by ordinary or stochastic differential equations. Despite its significance, the computational challenges associated with these problems have limited their practical use. To address these limitations, this paper introduces a learning-based approach to DE-constrained optimization that combines techniques from proxy optimization and neural differential equations. The proposed approach uses a dual-network architecture, with one approximating the control strategies, focusing on steady-state constraints, and another solving the associated DEs. This combination enables the approximation of optimal strategies while accounting for dynamic constraints in near real-time. Experiments across problems in energy optimization and finance modeling show that this method provides full compliance with dynamic constraints and it produces results up to 25 times more precise than other methods which do not explicitly model the system’s dynamic equations.

[LG-6] Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.01782
作者: Shayekh Bin Islam,Md Asib Rahman,K S M Tozammel Hossain,Enamul Hoque,Shafiq Joty,Md Rizwan Parvez
关键词-EN: Large Language Models, Large Language, Retrieval-Augmented Generation, accuracy of Large, limited reasoning capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Findings. Website: this https URL . 14 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has been shown to enhance the factual accuracy of Large Language Models (LLMs), but existing methods often suffer from limited reasoning capabilities in effectively using the retrieved evidence, particularly when using open-source LLMs. To mitigate this gap, we introduce a novel framework, Open-RAG, designed to enhance reasoning capabilities in RAG with open-source LLMs. Our framework transforms an arbitrary dense LLM into a parameter-efficient sparse mixture of experts (MoE) model capable of handling complex reasoning tasks, including both single- and multi-hop queries. Open-RAG uniquely trains the model to navigate challenging distractors that appear relevant but are misleading. As a result, Open-RAG leverages latent learning, dynamically selecting relevant experts and integrating external knowledge effectively for more accurate and contextually relevant responses. In addition, we propose a hybrid adaptive retrieval method to determine retrieval necessity and balance the trade-off between performance gain and inference speed. Experimental results show that the Llama2-7B-based Open-RAG outperforms state-of-the-art LLMs and RAG models such as ChatGPT, Self-RAG, and Command R+ in various knowledge-intensive tasks. We open-source our code and models at this https URL

[LG-7] Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets

链接: https://arxiv.org/abs/2410.01779
作者: Yuandong Tian
关键词-EN: Abelian group, tasks in Abelian, prove rich algebraic, trained on reasoning, quadratic activation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Commutative Algebra (math.AC); Rings and Algebras (math.RA)
*备注:

点击查看摘要

Abstract:We prove rich algebraic structures of the solution space for 2-layer neural networks with quadratic activation and L_2 loss, trained on reasoning tasks in Abelian group (e.g., modular addition). Such a rich structure enables analytical construction of global optimal solutions from partial solutions that only satisfy part of the loss, despite its high nonlinearity. We coin the framework as CoGO (Composing Global Optimizers). Specifically, we show that the weight space over different numbers of hidden nodes of the 2-layer network is equipped with a semi-ring algebraic structure, and the loss function to be optimized consists of monomial potentials, which are ring homomorphism, allowing partial solutions to be composed into global ones by ring addition and multiplication. Our experiments show that around 95% of the solutions obtained by gradient descent match exactly our theoretical constructions. Although the global optimizers constructed only required a small number of hidden nodes, our analysis on gradient dynamics shows that over-parameterization asymptotically decouples training dynamics and is beneficial. We further show that training dynamics favors simpler solutions under weight decay, and thus high-order global optimizers such as perfect memorization are unfavorable.

[LG-8] opER: Topological Embeddings in Graph Representation Learning

链接: https://arxiv.org/abs/2410.01778
作者: Astrit Tola,Funmilola Mary Taiwom,Cuneyt Gurcan Akcora,Baris Coskunuzer
关键词-EN: interpret graph-structured data, allowing machine learning, play a critical, critical role, explore and interpret
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Graph embeddings play a critical role in graph representation learning, allowing machine learning models to explore and interpret graph-structured data. However, existing methods often rely on opaque, high-dimensional embeddings, limiting interpretability and practical visualization. In this work, we introduce Topological Evolution Rate (TopER), a novel, low-dimensional embedding approach grounded in topological data analysis. TopER simplifies a key topological approach, Persistent Homology, by calculating the evolution rate of graph substructures, resulting in intuitive and interpretable visualizations of graph data. This approach not only enhances the exploration of graph datasets but also delivers competitive performance in graph clustering and classification tasks. Our TopER-based models achieve or surpass state-of-the-art results across molecular, biological, and social network datasets in tasks such as classification, clustering, and visualization. Comments: 17 pages, 7 figures Subjects: Machine Learning (cs.LG); Algebraic Topology (math.AT) Cite as: arXiv:2410.01778 [cs.LG] (or arXiv:2410.01778v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.01778 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] rained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context

链接: https://arxiv.org/abs/2410.01774
作者: Spencer Frei,Gal Vardi
关键词-EN: supervised learning algorithms, unlabeled test, labeled training, capacity to act, act as supervised
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages

点击查看摘要

Abstract:Transformers have the capacity to act as supervised learning algorithms: by properly encoding a set of labeled training (“in-context”) examples and an unlabeled test example into an input sequence of vectors of the same dimension, the forward pass of the transformer can produce predictions for that unlabeled test example. A line of recent work has shown that when linear transformers are pre-trained on random instances for linear regression tasks, these trained transformers make predictions using an algorithm similar to that of ordinary least squares. In this work, we investigate the behavior of linear transformers trained on random linear classification tasks. Via an analysis of the implicit regularization of gradient descent, we characterize how many pre-training tasks and in-context examples are needed for the trained transformer to generalize well at test-time. We further show that in some settings, these trained transformers can exhibit “benign overfitting in-context”: when in-context examples are corrupted by label flipping noise, the transformer memorizes all of its in-context examples (including those with noisy labels) yet still generalizes near-optimally for clean test examples.

[LG-10] Bayesian Binary Search

链接: https://arxiv.org/abs/2410.01771
作者: Vikash Singh,Matthew Khanzadeh,Vincent Davis,Harrison Rush,Emanuele Rossi,Jesse Shrader,Pietro Lio
关键词-EN: present Bayesian Binary, classical binary search, Bayesian Binary Search, Binary Search, search space
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Bayesian Binary Search (BBS), a novel probabilistic variant of the classical binary search/bisection algorithm. BBS leverages machine learning/statistical techniques to estimate the probability density of the search space and modifies the bisection step to split based on probability density rather than the traditional midpoint, allowing for the learned distribution of the search space to guide the search algorithm. Search space density estimation can flexibly be performed using supervised probabilistic machine learning techniques (e.g., Gaussian process regression, Bayesian neural networks, quantile regression) or unsupervised learning algorithms (e.g., Gaussian mixture models, kernel density estimation (KDE), maximum likelihood estimation (MLE)). We demonstrate significant efficiency gains of using BBS on both simulated data across a variety of distributions and in a real-world binary search use case of probing channel balances in the Bitcoin Lightning Network, for which we have deployed the BBS algorithm in a production setting.

[LG-11] Explainable Earth Surface Forecasting under Extreme Events

链接: https://arxiv.org/abs/2410.01770
作者: Oscar J. Pellicer-Valero,Miguel-Ángel Fernández-Torres,Chaonan Ji,Miguel D. Mahecha,Gustau Camps-Valls
关键词-EN: high dimensional Earth, dimensional Earth observation, Earth observation data, observation data presents, dimensional Earth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With climate change-related extreme events on the rise, high dimensional Earth observation data presents a unique opportunity for forecasting and understanding impacts on ecosystems. This is, however, impeded by the complexity of processing, visualizing, modeling, and explaining this data. To showcase how this challenge can be met, here we train a convolutional long short-term memory-based architecture on the novel DeepExtremeCubes dataset. DeepExtremeCubes includes around 40,000 long-term Sentinel-2 minicubes (January 2016-October 2022) worldwide, along with labeled extreme events, meteorological data, vegetation land cover, and topography map, sampled from locations affected by extreme climate events and surrounding areas. When predicting future reflectances and vegetation impacts through kernel normalized difference vegetation index, the model achieved an R ^2 score of 0.9055 in the test set. Explainable artificial intelligence was used to analyze the model’s predictions during the October 2020 Central South America compound heatwave and drought event. We chose the same area exactly one year before the event as counterfactual, finding that the average temperature and surface pressure are generally the best predictors under normal conditions. In contrast, minimum anomalies of evaporation and surface latent heat flux take the lead during the event. A change of regime is also observed in the attributions before the event, which might help assess how long the event was brewing before happening. The code to replicate all experiments and figures in this paper is publicly available at this https URL

[LG-12] Decision-Focused Uncertainty Quantification

链接: https://arxiv.org/abs/2410.01767
作者: Santiago Cortes-Gomez,Carlos Patiño,Yewon Byun,Steven Wu,Eric Horvitz,Bryan Wilder
关键词-EN: decision-focused machine learning, downstream optimization problems, machine learning methods, increasing interest, interest in decision-focused
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is increasing interest in ‘‘decision-focused’’ machine learning methods which train models to account for how their predictions are used in downstream optimization problems. Doing so can often improve performance on subsequent decision problems. However, current methods for uncertainty quantification do not incorporate any information at all about downstream decisions. We develop a framework based on conformal prediction to produce prediction sets that account for a downstream decision loss function, making them more appropriate to inform high-stakes decision-making. Our approach harnesses the strengths of conformal methods–modularity, model-agnosticism, and statistical coverage guarantees–while incorporating downstream decisions and user-specified utility functions. We prove that our methods retain standard coverage guarantees. Empirical evaluation across a range of datasets and utility metrics demonstrates that our methods achieve significantly lower decision loss compared to standard conformal methods. Additionally, we present a real-world use case in healthcare diagnosis, where our method effectively incorporates the hierarchical structure of dermatological diseases. It successfully generates sets with coherent diagnostic meaning, aiding the triage process during dermatology diagnosis and illustrating how our method can ground high-stakes decision-making on external domain knowledge.

[LG-13] orchSISSO: A PyTorch-Based Implementation of the Sure Independence Screening and Sparsifying Operator for Efficient and Interpretable Model Discovery

链接: https://arxiv.org/abs/2410.01752
作者: Madhav Muthyala,Farshud Sorourifar,Joel A. Paulson
关键词-EN: powerful machine learning, machine learning approach, Symbolic regression, powerful machine, machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic regression (SR) is a powerful machine learning approach that searches for both the structure and parameters of algebraic models, offering interpretable and compact representations of complex data. Unlike traditional regression methods, SR explores progressively complex feature spaces, which can uncover simple models that generalize well, even from small datasets. Among SR algorithms, the Sure Independence Screening and Sparsifying Operator (SISSO) has proven particularly effective in the natural sciences, helping to rediscover fundamental physical laws as well as discover new interpretable equations for materials property modeling. However, its widespread adoption has been limited by performance inefficiencies and the challenges posed by its FORTRAN-based implementation, especially in modern computing environments. In this work, we introduce TorchSISSO, a native Python implementation built in the PyTorch framework. TorchSISSO leverages GPU acceleration, easy integration, and extensibility, offering a significant speed-up and improved accuracy over the original. We demonstrate that TorchSISSO matches or exceeds the performance of the original SISSO across a range of tasks, while dramatically reducing computational time and improving accessibility for broader scientific applications.

[LG-14] Not All LLM Reasoners Are Created Equal

链接: https://arxiv.org/abs/2410.01748
作者: Arian Hosseini,Alessandro Sordoni,Daniel Toyama,Aaron Courville,Rishabh Agarwal
关键词-EN: problem-solving capabilities, study the depth, depth of grade-school, grade-school math, Abstract
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

[LG-15] Leray-Schauder Mappings for Operator Learning

链接: https://arxiv.org/abs/2410.01746
作者: Emanuele Zappala
关键词-EN: Banach spaces, compact subspaces, present an algorithm, algorithm for learning, Leray-Schauder mappings
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 6 pages, 2 figures, 1 table. Comments are welcome!

点击查看摘要

Abstract:We present an algorithm for learning operators between Banach spaces, based on the use of Leray-Schauder mappings to learn a finite-dimensional approximation of compact subspaces. We show that the resulting method is a universal approximator of (possibly nonlinear) operators. We demonstrate the efficiency of the approach on two benchmark datasets showing it achieves results comparable to state of the art models.

[LG-16] PreND: Enhancing Intrinsic Motivation in Reinforcement Learning through Pre-trained Network Distillation

链接: https://arxiv.org/abs/2410.01745
作者: Mohammadamin Davoodabadi,Negin Hashemi Dijujin,Mahdieh Soleymani Baghshah
关键词-EN: Random Network Distillation, Pre-trained Network Distillation, Network Distillation, Intrinsic motivation, psychology of developmental
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Intrinsic motivation, inspired by the psychology of developmental learning in infants, stimulates exploration in agents without relying solely on sparse external rewards. Existing methods in reinforcement learning like Random Network Distillation (RND) face significant limitations, including (1) relying on raw visual inputs, leading to a lack of meaningful representations, (2) the inability to build a robust latent space, (3) poor target network initialization and (4) rapid degradation of intrinsic rewards. In this paper, we introduce Pre-trained Network Distillation (PreND), a novel approach to enhance intrinsic motivation in reinforcement learning (RL) by improving upon the widely used prediction-based method, RND. PreND addresses these challenges by incorporating pre-trained representation models into both the target and predictor networks, resulting in more meaningful and stable intrinsic rewards, while enhancing the representation learned by the model. We also tried simple but effective variants of the predictor network optimization by controlling the learning rate. Through experiments on the Atari domain, we demonstrate that PreND significantly outperforms RND, offering a more robust intrinsic motivation signal that leads to better exploration, improving overall performance and sample efficiency. This research highlights the importance of target and predictor networks representation in prediction-based intrinsic motivation, setting a new direction for improving RL agents’ learning efficiency in sparse reward environments.

[LG-17] Mimicking Human Intuition: Cognitive Belief-Driven Q-Learning ICLR25

链接: https://arxiv.org/abs/2410.01739
作者: Xingrui Gu,Guanren Qiao,Chuyi Jiang,Tianqing Xia,Hangyu Mao
关键词-EN: Reinforcement learning encounters, learning encounters challenges, Reinforcement learning, encounters challenges, Reinforcement
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review by ICLR 25

点击查看摘要

Abstract:Reinforcement learning encounters challenges in various environments related to robustness and explainability. Traditional Q-learning algorithms cannot effectively make decisions and utilize the historical learning experience. To overcome these limitations, we propose Cognitive Belief-Driven Q-Learning (CBDQ), which integrates subjective belief modeling into the Q-learning framework, enhancing decision-making accuracy by endowing agents with human-like learning and reasoning capabilities. Drawing inspiration from cognitive science, our method maintains a subjective belief distribution over the expectation of actions, leveraging a cluster-based subjective belief model that enables agents to reason about the potential probability associated with each decision. CBDQ effectively mitigates overestimated phenomena and optimizes decision-making policies by integrating historical experiences with current contextual information, mimicking the dynamics of human decision-making. We evaluate the proposed method on discrete control benchmark tasks in various complicate environments. The results demonstrate that CBDQ exhibits stronger adaptability, robustness, and human-like characteristics in handling these environments, outperforming other baselines. We hope this work will give researchers a fresh perspective on understanding and explaining Q-learning.

[LG-18] Recursive Abstractive Processing for Retrieval in Dynamic Datasets

链接: https://arxiv.org/abs/2410.01736
作者: Charbel Chucri,Rami Azouz,Joachim Ott
关键词-EN: Recent retrieval-augmented models, retrieval-augmented models enhance, models enhance basic, Recent retrieval-augmented, retrieved text chunks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent retrieval-augmented models enhance basic methods by building a hierarchical structure over retrieved text chunks through recursive embedding, clustering, and summarization. The most relevant information is then retrieved from both the original text and generated summaries. However, such approaches face limitations with dynamic datasets, where adding or removing documents over time complicates the updating of hierarchical representations formed through clustering. We propose a new algorithm to efficiently maintain the recursive-abstractive tree structure in dynamic datasets, without compromising performance. Additionally, we introduce a novel post-retrieval method that applies query-focused recursive abstractive processing to substantially improve context quality. Our method overcomes the limitations of other approaches by functioning as a black-box post-retrieval layer compatible with any retrieval algorithm. Both algorithms are validated through extensive experiments on real-world datasets, demonstrating their effectiveness in handling dynamic data and improving retrieval performance.

[LG-19] LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

链接: https://arxiv.org/abs/2410.01735
作者: Duy Nguyen,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal
关键词-EN: Reward Models, play a crucial, crucial role, role in aligning, multiple RMs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 20 pages; First two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:Reward Models (RMs) play a crucial role in aligning LLMs with human preferences, enhancing their performance by ranking outputs during inference or iterative training. However, the degree to which an RM generalizes to new tasks is often not known a priori (e.g. some RMs may excel at scoring creative writing vs. math reasoning). Therefore, using only one fixed RM while training LLMs can be suboptimal. Moreover, optimizing LLMs with multiple RMs simultaneously can be prohibitively computationally-intensive and challenging due to conflicting signals from different RMs, potentially degrading performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which iteratively trains LLMs using multiple RMs, selecting and utilizing the most well-suited RM for each instance to rank outputs and generate preference data, framed as a multi-armed bandit problem. Our results on commonsense and math reasoning tasks demonstrate that LASeR can boost iterative LLM optimization by optimizing for multiple RMs, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over training with ensemble RM scores while also showing superior training efficiency (e.g., a 2x speedup). Moreover, on WildChat, a benchmark of instruction-following prompts, we find that using Llama-3-8B LASeR leads to a 71.45% AlpacaEval win rate over sequentially optimizing multiple RMs. Extending to long-context generation tasks, we find that on Llama-3-8B, LASeR achieves an average improvement of 2.64 F1 and 2.42 F1 on single- and multi-document QA over random RM selection when used with best-of-n sampling. LASeR is robust to noisy rewards and generalizes to multiple settings. Finally, LASeR’s RM selection changes depending on the underlying task or instance and we verify the presence of conflicting preferences from multiple RMs that can be mitigated using LASeR.

[LG-20] Evaluating Robustness of Reward Models for Mathematical Reasoning

链接: https://arxiv.org/abs/2410.01729
作者: Sunghwan Kim,Dongjin Kang,Taeyoon Kwon,Hyungjoo Chae,Jungsoo Won,Dongha Lee,Jinyoung Yeo
关键词-EN: Reward models, human feedback, human preferences, Reward, key in reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Work in progress

点击查看摘要

Abstract:Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it only see an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking. In this work, we introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH, a benchmark that effectively represents the robustness of reward models in mathematical reasoning tasks. We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization, whereas the existing benchmark shows almost no correlation. The results underscore the potential of our design to enhance the reliability of evaluation, and represent the robustness of reward model. We make our code and data publicly available.

[LG-21] Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing

链接: https://arxiv.org/abs/2410.01727
作者: Yilmazcan Ozyurt,Stefan Feuerriegel,Mrinmaya Sachan
关键词-EN: modeling students’ learning, students’ learning progress, progress over time, modeling students’, enable more personalized
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Knowledge tracing (KT) is a popular approach for modeling students’ learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is time-consuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets, where we achieve consistent performance improvements.

[LG-22] owards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

链接: https://arxiv.org/abs/2410.01720
作者: Zeyu Gan,Yong Liu
关键词-EN: synthetic data generation, Synthetic data, large language models, generate synthetic data, prevalent synthetic data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open source our code through an anonymous GitHub repository at this https URL.

[LG-23] Meta-TTT: A Meta-learning Minimax Framework For Test-Time Training

链接: https://arxiv.org/abs/2410.01709
作者: Chen Tao,Li Shen,Soumik Mondal
关键词-EN: unlabeled target data, unlabeled target, data during inference, aims to adapt, target data
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 tables, 1 figure

点击查看摘要

Abstract:Test-time domain adaptation is a challenging task that aims to adapt a pre-trained model to limited, unlabeled target data during inference. Current methods that rely on self-supervision and entropy minimization underperform when the self-supervised learning (SSL) task does not align well with the primary objective. Additionally, minimizing entropy can lead to suboptimal solutions when there is limited diversity within minibatches. This paper introduces a meta-learning minimax framework for test-time training on batch normalization (BN) layers, ensuring that the SSL task aligns with the primary task while addressing minibatch overfitting. We adopt a mixed-BN approach that interpolates current test batch statistics with the statistics from source domains and propose a stochastic domain synthesizing method to improve model generalization and robustness to domain shifts. Extensive experiments demonstrate that our method surpasses state-of-the-art techniques across various domain adaptation and generalization benchmarks, significantly enhancing the pre-trained model’s robustness on unseen domains.

[LG-24] Performant Memory Efficient and Scalable Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2410.01706
作者: Omayma Mahjoub,Sasha Abramowitz,Ruan de Kock,Wiem Khlifi,Simon du Toit,Jemma Daniel,Louay Ben Nessir,Louise Beyers,Claude Formanek,Liam Clark,Arnu Pretorius
关键词-EN: multi-agent reinforcement learning, achieving strong performance, reinforcement learning, progresses towards larger, achieving strong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:As the field of multi-agent reinforcement learning (MARL) progresses towards larger and more complex environments, achieving strong performance while maintaining memory efficiency and scalability to many agents becomes increasingly important. Although recent research has led to several advanced algorithms, to date, none fully address all of these key properties simultaneously. In this work, we introduce Sable, a novel and theoretically sound algorithm that adapts the retention mechanism from Retentive Networks to MARL. Sable’s retention-based sequence modelling architecture allows for computationally efficient scaling to a large number of agents, as well as maintaining a long temporal context, making it well-suited for large-scale partially observable environments. Through extensive evaluations across six diverse environments, we demonstrate how Sable is able to significantly outperform existing state-of-the-art methods in the majority of tasks (34 out of 45, roughly 75%). Furthermore, Sable demonstrates stable performance as we scale the number of agents, handling environments with more than a thousand agents while exhibiting a linear increase in memory usage. Finally, we conduct ablation studies to isolate the source of Sable’s performance gains and confirm its efficient computational memory usage. Our results highlight Sable’s performance and efficiency, positioning it as a leading approach to MARL at scale.

[LG-25] MOREL: Enhancing Adversarial Robustness through Multi-Objective Representation Learning

链接: https://arxiv.org/abs/2410.01697
作者: Sedjro Salomon Hotegni,Sebastian Peitz
关键词-EN: deep neural networks, neural networks, drastically different outputs, research has shown, shown that deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Extensive research has shown that deep neural networks (DNNs) are vulnerable to slight adversarial perturbations - small changes to the input data that appear insignificant but cause the model to produce drastically different outputs. In addition to augmenting training data with adversarial examples generated from a specific attack method, most of the current defense strategies necessitate modifying the original model architecture components to improve robustness or performing test-time data purification to handle adversarial attacks. In this work, we demonstrate that strong feature representation learning during training can significantly enhance the original model’s robustness. We propose MOREL, a multi-objective feature representation learning approach, encouraging classification models to produce similar features for inputs within the same class, despite perturbations. Our training method involves an embedding space where cosine similarity loss and multi-positive contrastive loss are used to align natural and adversarial features from the model encoder and ensure tight clustering. Concurrently, the classifier is motivated to achieve accurate predictions. Through extensive experiments, we demonstrate that our approach significantly enhances the robustness of DNNs against white-box and black-box adversarial attacks, outperforming other methods that similarly require no architectural changes or test-time data purification. Our code is available at this https URL

[LG-26] Uncertainty Quantification with Bayesian Higher Order ReLU KANs

链接: https://arxiv.org/abs/2410.01687
作者: James Giroux,Cristiano Fanelli
关键词-EN: enhance computational efficiency, Higher Order, Kolmogorov-Arnold Networks, demands of Bayesian, enhance computational
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 13 pages, 7 Figures

点击查看摘要

Abstract:We introduce the first method of uncertainty quantification in the domain of Kolmogorov-Arnold Networks, specifically focusing on (Higher Order) ReLUKANs to enhance computational efficiency given the computational demands of Bayesian methods. The method we propose is general in nature, providing access to both epistemic and aleatoric uncertainties. It is also capable of generalization to other various basis functions. We validate our method through a series of closure tests, including simple one-dimensional functions and application to the domain of (Stochastic) Partial Differential Equations. Referring to the latter, we demonstrate the method’s ability to correctly identify functional dependencies introduced through the inclusion of a stochastic term. The code supporting this work can be found at this https URL

[LG-27] Positional Attention: Out-of-Distribution Generalization and Expressivity for Neural Algorithmic Reasoning

链接: https://arxiv.org/abs/2410.01686
作者: Artur Back de Luca,George Giapitzakis,Shenghao Yang,Petar Veličković,Kimon Fountoulakis
关键词-EN: solve algorithmic tasks, summary statistics, growing interest, ability of neural, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: 37 pages, 22 figures

点击查看摘要

Abstract:There has been a growing interest in the ability of neural networks to solve algorithmic tasks, such as arithmetic, summary statistics, and sorting. While state-of-the-art models like Transformers have demonstrated good generalization performance on in-distribution tasks, their out-of-distribution (OOD) performance is poor when trained end-to-end. In this paper, we focus on value generalization, a common instance of OOD generalization where the test distribution has the same input sequence length as the training distribution, but the value ranges in the training and test distributions do not necessarily overlap. To address this issue, we propose that using fixed positional encodings to determine attention weights-referred to as positional attention-enhances empirical OOD performance while maintaining expressivity. We support our claim about expressivity by proving that Transformers with positional attention can effectively simulate parallel algorithms.

[LG-28] PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation

链接: https://arxiv.org/abs/2410.01680
作者: Mike Ranzinger,Jon Barker,Greg Heinrich,Pavlo Molchanov,Bryan Catanzaro,Andrew Tao
关键词-EN: heterogeneous multi-teacher knowledge, multi-teacher knowledge distillation, visual foundation models, strengths and weaknesses, distillation without labels
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed “agglomerative models.” We build upon this body of work by studying the effect of the teachers’ activation statistics, particularly the impact of the loss function on the resulting student model quality. We explore a standard toolkit of statistical normalization techniques to better align the different distributions and assess their effects. Further, we examine the impact on downstream teacher-matching metrics, which motivates the use of Hadamard matrices. With these matrices, we demonstrate useful properties, showing how they can be used for isotropic standardization, where each dimension of a multivariate distribution is standardized using the same scale. We call this technique “PHI Standardization” (PHI-S) and empirically demonstrate that it produces the best student model across the suite of methods studied.

[LG-29] VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

链接: https://arxiv.org/abs/2410.01679
作者: Amirhossein Kazemnejad,Milad Aghajohari,Eva Portelance,Alessandro Sordoni,Siva Reddy,Aaron Courville,Nicolas Le Roux
关键词-EN: increasingly applied, require executing, complex reasoning tasks, Proximal Policy Optimization, enhancing model performance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, value networks face challenges in predicting the expected cumulative rewards accurately in complex reasoning tasks, often leading to high-variance updates and suboptimal performance. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they barely outperform a random baseline when comparing alternative steps. To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates, bypassing the need for large value networks. Our method consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets with fewer gradient updates (up to 9x), less wall-clock time (up to 3.0x). These results emphasize the importance of accurate credit assignment in RL finetuning of LLM and demonstrate VinePPO’s potential as a superior alternative.

[LG-30] Sparse Covariance Neural Networks

链接: https://arxiv.org/abs/2410.01669
作者: Andrea Cavallo,Zhan Gao,Elvin Isufi
关键词-EN: Covariance Neural Networks, Neural Networks, perform graph convolutions, Covariance Neural, Sparse coVariance Neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Covariance Neural Networks (VNNs) perform graph convolutions on the covariance matrix of tabular data and achieve success in a variety of applications. However, the empirical covariance matrix on which the VNNs operate may contain many spurious correlations, making VNNs’ performance inconsistent due to these noisy estimates and decreasing their computational efficiency. To tackle this issue, we put forth Sparse coVariance Neural Networks (S-VNNs), a framework that applies sparsification techniques on the sample covariance matrix before convolution. When the true covariance matrix is sparse, we propose hard and soft thresholding to improve covariance estimation and reduce computational cost. Instead, when the true covariance is dense, we propose stochastic sparsification where data correlations are dropped in probability according to principled strategies. We show that S-VNNs are more stable than nominal VNNs as well as sparse principal component analysis. By analyzing the impact of sparsification on their behavior, we provide novel connections between S-VNN stability and data distribution. We support our theoretical findings with experimental results on various application scenarios, ranging from brain data to human action recognition, and show an improved task performance, stability, and computational efficiency of S-VNNs compared with nominal VNNs.

[LG-31] Conformal Generative Modeling with Improved Sample Efficiency through Sequential Greedy Filtering

链接: https://arxiv.org/abs/2410.01660
作者: Klaus-Rudolf Kladny,Bernhard Schölkopf,Michael Muehlebach
关键词-EN: Generative models lack, models lack rigorous, lack rigorous statistical, Generative models, Sequential Conformal Prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative models lack rigorous statistical guarantees for their outputs and are therefore unreliable in safety-critical applications. In this work, we propose Sequential Conformal Prediction for Generative Models (SCOPE-Gen), a sequential conformal prediction method producing prediction sets that satisfy a rigorous statistical guarantee called conformal admissibility control. This guarantee states that with high probability, the prediction sets contain at least one admissible (or valid) example. To this end, our method first samples an initial set of i.i.d. examples from a black box generative model. Then, this set is iteratively pruned via so-called greedy filters. As a consequence of the iterative generation procedure, admissibility of the final prediction set factorizes as a Markov chain. This factorization is crucial, because it allows to control each factor separately, using conformal prediction. In comparison to prior work, our method demonstrates a large reduction in the number of admissibility evaluations during calibration. This reduction is important in safety-critical applications, where these evaluations must be conducted manually by domain experts and are therefore costly and time consuming. We highlight the advantages of our method in terms of admissibility evaluations and cardinality of the prediction sets through experiments in natural language generation and molecular graph extension tasks.

[LG-32] Scalable and Consistent Graph Neural Networks for Distributed Mesh-based Data-driven Modeling

链接: https://arxiv.org/abs/2410.01657
作者: Shivam Barwey,Riccardo Balin,Bethany Lusch,Saumil Patel,Ramesh Balakrishnan,Pinaki Pal,Romit Maulik,Venkatram Vishwanath
关键词-EN: message passing layer, neural message passing, graph neural network, consistent neural message, neural network
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:This work develops a distributed graph neural network (GNN) methodology for mesh-based modeling applications using a consistent neural message passing layer. As the name implies, the focus is on enabling scalable operations that satisfy physical consistency via halo nodes at sub-graph boundaries. Here, consistency refers to the fact that a GNN trained and evaluated on one rank (one large graph) is arithmetically equivalent to evaluations on multiple ranks (a partitioned graph). This concept is demonstrated by interfacing GNNs with NekRS, a GPU-capable exascale CFD solver developed at Argonne National Laboratory. It is shown how the NekRS mesh partitioning can be linked to the distributed GNN training and inference routines, resulting in a scalable mesh-based data-driven modeling workflow. We study the impact of consistency on the scalability of mesh-based GNNs, demonstrating efficient scaling in consistent GNNs for up to O(1B) graph nodes on the Frontier exascale supercomputer.

[LG-33] Extending Contextual Self-Modulation: Meta-Learning Across Modalities Task Dimensionalities and Data Regimes

链接: https://arxiv.org/abs/2410.01655
作者: Roussel Desmond Nzoyem,David A.W. Barton,Tom Deakin
关键词-EN: Neural Context Flow, potent regularization mechanism, Context Flow, Neural Context, CSM
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 23 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Contextual Self-Modulation (CSM) is a potent regularization mechanism for the Neural Context Flow (NCF) framework which demonstrates powerful meta-learning of physical systems. However, CSM has limitations in its applicability across different modalities and in high-data regimes. In this work, we introduce two extensions: i CSM, which expands CSM to infinite-dimensional tasks, and StochasticNCF, which improves scalability. These extensions are demonstrated through comprehensive experimentation on a range of tasks, including dynamical systems with parameter variations, computer vision challenges, and curve fitting problems. i CSM embeds the contexts into an infinite-dimensional function space, as opposed to CSM which uses finite-dimensional context vectors. StochasticNCF enables the application of both CSM and i CSM to high-data scenarios by providing an unbiased approximation of meta-gradient updates through a sampled set of nearest environments. Additionally, we incorporate higher-order Taylor expansions via Taylor-Mode automatic differentiation, revealing that higher-order approximations do not necessarily enhance generalization. Finally, we demonstrate how CSM can be integrated into other meta-learning frameworks with FlashCAVIA, a computationally efficient extension of the CAVIA meta-learning framework (Zintgraf et al. 2019). FlashCAVIA outperforms its predecessor across various benchmarks and reinforces the utility of bi-level optimization techniques. Together, these contributions establish a robust framework for tackling an expanded spectrum of meta-learning tasks, offering practical insights for out-of-distribution generalization. Our open-sourced library, designed for flexible integration of self-modulation into contextual meta-learning workflows, is available at \urlthis http URL.

[LG-34] shapiq: Shapley Interactions for Machine Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.01649
作者: Maximilian Muschalik,Hubert Baniecki,Fabian Fumagalli,Patrick Kolpaczki,Barbara Hammer,Eyke Hüllermeier
关键词-EN: Originally rooted, machine learning, important tool, Originally, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Originally rooted in game theory, the Shapley Value (SV) has recently become an important tool in machine learning research. Perhaps most notably, it is used for feature attribution and data valuation in explainable artificial intelligence. Shapley Interactions (SIs) naturally extend the SV and address its limitations by assigning joint contributions to groups of entities, which enhance understanding of black box machine learning models. Due to the exponential complexity of computing SVs and SIs, various methods have been proposed that exploit structural assumptions or yield probabilistic estimates given limited resources. In this work, we introduce shapiq, an open-source Python package that unifies state-of-the-art algorithms to efficiently compute SVs and any-order SIs in an application-agnostic framework. Moreover, it includes a benchmarking suite containing 11 machine learning applications of SIs with pre-computed games and ground-truth values to systematically assess computational performance across domains. For practitioners, shapiq is able to explain and visualize any-order feature interactions in predictions of models, including vision transformers, language models, as well as XGBoost and LightGBM with TreeSHAP-IQ. With shapiq, we extend shap beyond feature attributions and consolidate the application of SVs and SIs in machine learning that facilitates future research. The source code and documentation are available at this https URL.

[LG-35] A Novel Framework of Horizontal-Vertical Hybrid Federated Learning for EdgeIoT

链接: https://arxiv.org/abs/2410.01644
作者: Kai Li,Yilei Liang,Xin Yuan,Wei Ni,Jon Crowcroft,Chau Yuen,Ozgur B. Akan
关键词-EN: edge computing-enabled Internet, Internet of Things, horizontal-vertical federated learning, hybrid horizontal-vertical federated, mobile edge computing-enabled
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:This letter puts forth a new hybrid horizontal-vertical federated learning (HoVeFL) for mobile edge computing-enabled Internet of Things (EdgeIoT). In this framework, certain EdgeIoT devices train local models using the same data samples but analyze disparate data features, while the others focus on the same features using non-independent and identically distributed (non-IID) data samples. Thus, even though the data features are consistent, the data samples vary across devices. The proposed HoVeFL formulates the training of local and global models to minimize the global loss function. Performance evaluations on CIFAR-10 and SVHN datasets reveal that the testing loss of HoVeFL with 12 horizontal FL devices and six vertical FL devices is 5.5% and 25.2% higher, respectively, compared to a setup with six horizontal FL devices and 12 vertical FL devices.

[LG-36] Stable Offline Value Function Learning with Bisimulation-based Representations

链接: https://arxiv.org/abs/2410.01643
作者: Brahma S. Pavse,Yudong Chen,Qiaomin Xie,Josiah P. Hanna
关键词-EN: expected discounted return, function learning, dataset to estimate, estimate the expected, expected discounted
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:In reinforcement learning, offline value function learning is the procedure of using an offline dataset to estimate the expected discounted return from each state when taking actions according to a fixed target policy. The stability of this procedure, i.e., whether it converges to its fixed-point, critically depends on the representations of the state-action pairs. Poorly learned representations can make value function learning unstable, or even divergent. Therefore, it is critical to stabilize value function learning by explicitly shaping the state-action representations. Recently, the class of bisimulation-based algorithms have shown promise in shaping representations for control. However, it is still unclear if this class of methods can stabilize value function learning. In this work, we investigate this question and answer it affirmatively. We introduce a bisimulation-based algorithm called kernel representations for offline policy evaluation (KROPE). KROPE uses a kernel to shape state-action representations such that state-action pairs that have similar immediate rewards and lead to similar next state-action pairs under the target policy also have similar representations. We show that KROPE: 1) learns stable representations and 2) leads to lower value error than baselines. Our analysis provides new theoretical insight into the stability properties of bisimulation-based methods and suggests that practitioners can use these methods for stable and accurate evaluation of offline reinforcement learning agents.

[LG-37] Moral Alignment for LLM Agents

链接: https://arxiv.org/abs/2410.01639
作者: Elizaveta Tennant,Stephen Hailes,Mirco Musolesi
关键词-EN: pre-trained Large Language, Large Language Models, Large Language, Decision-making agents based, pre-trained Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner’s Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2410.01639 [cs.LG] (or arXiv:2410.01639v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.01639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] Does Graph Prompt Work? A Data Operation Perspective with Theoretical Analysis

链接: https://arxiv.org/abs/2410.01635
作者: Qunzhong Wang,Xiangguo Sun,Hong Cheng
关键词-EN: promising research direction, graph, graph prompting, recent years, research direction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:In recent years, graph prompting has emerged as a promising research direction, enabling the learning of additional tokens or subgraphs appended to the original graphs without requiring retraining of pre-trained graph models across various applications. This novel paradigm, shifting from the traditional pretraining and finetuning to pretraining and prompting has shown significant empirical success in simulating graph data operations, with applications ranging from recommendation systems to biological networks and graph transferring. However, despite its potential, the theoretical underpinnings of graph prompting remain underexplored, raising critical questions about its fundamental effectiveness. The lack of rigorous theoretical proof of why and how much it works is more like a dark cloud over the graph prompt area to go further. To fill this gap, this paper introduces a theoretical framework that rigorously analyzes graph prompting from a data operation perspective. Our contributions are threefold: First, we provide a formal guarantee theorem, demonstrating graph prompts capacity to approximate graph transformation operators, effectively linking upstream and downstream tasks. Second, we derive upper bounds on the error of these data operations by graph prompts for a single graph and extend this discussion to batches of graphs, which are common in graph model training. Third, we analyze the distribution of data operation errors, extending our theoretical findings from linear graph models (e.g., GCN) to non-linear graph models (e.g., GAT). Extensive experiments support our theoretical results and confirm the practical implications of these guarantees.

[LG-39] Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

链接: https://arxiv.org/abs/2410.01623
作者: Xi Chen,Kaituo Feng,Changsheng Li,Xunhao Lai,Xiangyu Yue,Ye Yuan,Guoren Wang
关键词-EN: Large Language Models, training Large Language, Language Models, Large Language, reducing memory usage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e.g., LoRA), or seek to decompose gradient matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal performance. This raises a question: whether it is possible to consistently preserve the low-rank constraint for memory efficiency, while achieving full-rank training (i.e., training with full-rank gradients of full-rank weights) to avoid inferior outcomes? In this paper, we propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal. First, we observe an interesting phenomenon during LLM training: the scaling impact of adaptive optimizers (e.g., Adam) on the gradient norm remains similar from low-rank to full-rank training. Based on this observation, we propose a norm-based scaling method, which utilizes the scaling impact of low-rank optimizers as substitutes for that of original full-rank optimizers to enable full-rank training. In this way, we can preserve the low-rank constraint in the optimizer while achieving full-rank training for better performance. Moreover, we find that there are sudden gradient rises during the optimization process, potentially causing loss spikes. To address this, we further put forward a norm-growth limiter to smooth the gradient via regulating the relative increase of gradient norms. Extensive experiments on the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.

[LG-40] On Using Certified Training towards Empirical Robustness

链接: https://arxiv.org/abs/2410.01617
作者: Alessandro De Palma,Serge Durand,Zakaria Chihani,François Terrier,Caterina Urban
关键词-EN: provide empirical robustness, certified training, specific adversarial, training, catastrophic overfitting
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Adversarial training is arguably the most popular way to provide empirical robustness against specific adversarial examples. While variants based on multi-step attacks incur significant computational overhead, single-step variants are vulnerable to a failure mode known as catastrophic overfitting, which hinders their practical utility for large perturbations. A parallel line of work, certified training, has focused on producing networks amenable to formal guarantees of robustness against any possible attack. However, the wide gap between the best-performing empirical and certified defenses has severely limited the applicability of the latter. Inspired by recent developments in certified training, which rely on a combination of adversarial attacks with network over-approximations, and by the connections between local linearity and catastrophic overfitting, we present experimental evidence on the practical utility and limitations of using certified training towards empirical robustness. We show that, when tuned for the purpose, a recent certified training algorithm can prevent catastrophic overfitting on single-step attacks, and that it can bridge the gap to multi-step baselines under appropriate experimental settings. Finally, we present a novel regularizer for network over-approximations that can achieve similar effects while markedly reducing runtime.

[LG-41] DRUPI: Dataset Reduction Using Privileged Information

链接: https://arxiv.org/abs/2410.01611
作者: Shaobo Wang,Yantai Yang,Shuaiyu Zhang,Chenghao Sun,Weiya Li,Xuming Hu,Linfeng Zhang
关键词-EN: seeks to select, select or distill, distill samples, samples from large, smaller subsets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset reduction (DR) seeks to select or distill samples from large datasets into smaller subsets while preserving performance on target tasks. Existing methods primarily focus on pruning or synthesizing data in the same format as the original dataset, typically the input data and corresponding labels. However, in DR settings, we find it is possible to synthesize more information beyond the data-label pair as an additional learning target to facilitate model training. In this paper, we introduce Dataset Reduction Using Privileged Information (DRUPI), which enriches DR by synthesizing privileged information alongside the reduced dataset. This privileged information can take the form of feature labels or attention labels, providing auxiliary supervision to improve model learning. Our findings reveal that effective feature labels must balance between being overly discriminative and excessively diverse, with a moderate level proving optimal for improving the reduced dataset’s efficacy. Extensive experiments on ImageNet, CIFAR-10/100, and Tiny ImageNet demonstrate that DRUPI integrates seamlessly with existing dataset reduction methods, offering significant performance gains.

[LG-42] Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

链接: https://arxiv.org/abs/2410.01606
作者: Maya Pavlova,Erik Brinkman,Krithika Iyer,Vitor Albiero,Joanna Bitton,Hailey Nguyen,Joe Li,Cristian Canton Ferrer,Ivan Evtimov,Aaron Grattafiori
关键词-EN: violates norms, safety training, assesses how large, produce content, content that violates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. Common users of AI models may not have advanced knowledge of adversarial machine learning methods or access to model internals, and they do not spend a lot of time crafting a single highly effective adversarial prompt. Instead, they are likely to make use of techniques commonly shared online and exploit the multiturn conversational nature of LLMs. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general-purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama 3.1 and 88% against GPT-4 on the JailbreakBench dataset.

[LG-43] ENTP: Encoder-only Next Token Prediction

链接: https://arxiv.org/abs/2410.01600
作者: Ethan Ewer,Daewon Chae,Thomas Zeng,Jinkyu Kim,Kangwook Lee
关键词-EN: Next-token prediction models, causal attention, masking future tokens, Next-token prediction, essential to prevent
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Next-token prediction models have predominantly relied on decoder-only Transformers with causal attention, driven by the common belief that causal attention is essential to prevent “cheating” by masking future tokens. We challenge this widely accepted notion and argue that this design choice is about efficiency rather than necessity. While decoder-only Transformers are still a good choice for practical reasons, they are not the only viable option. In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP. We introduce the Triplet-Counting task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate ENTP’s superior performance across various realistic tasks, such as length generalization and in-context learning.

[LG-44] owards Model Discovery Using Domain Decomposition and PINNs

链接: https://arxiv.org/abs/2410.01599
作者: Tirtho S. Saha,Alexander Heinlein,Cordula Reisch
关键词-EN: ordinary differential equations, Physics-Informed Neural Networks, complex systems represented, enhance machine learning, machine learning algorithms
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We enhance machine learning algorithms for learning model parameters in complex systems represented by ordinary differential equations (ODEs) with domain decomposition methods. The study evaluates the performance of two approaches, namely (vanilla) Physics-Informed Neural Networks (PINNs) and Finite Basis Physics-Informed Neural Networks (FBPINNs), in learning the dynamics of test models with a quasi-stationary longtime behavior. We test the approaches for data sets in different dynamical regions and with varying noise level. As results, we find a better performance for the FBPINN approach compared to the vanilla PINN approach, even in cases with data from only a quasi-stationary time domain with few dynamics.

[LG-45] SAFE: Semantic Adaptive Feature Extraction with Rate Control for 6G Wireless Communications

链接: https://arxiv.org/abs/2410.01597
作者: Yuna Yan,Lixin Li,Xin Zhang,Wensheng Lin,Wenchi Cheng,Zhu Han
关键词-EN: current Deep Learning-based, Learning-based Semantic Communication, Deep Learning-based Semantic, Deep Learning-based, Adaptive Feature Extraction
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Most current Deep Learning-based Semantic Communication (DeepSC) systems are designed and trained exclusively for particular single-channel conditions, which restricts their adaptability and overall bandwidth utilization. To address this, we propose an innovative Semantic Adaptive Feature Extraction (SAFE) framework, which significantly improves bandwidth efficiency by allowing users to select different sub-semantic combinations based on their channel conditions. This paper also introduces three advanced learning algorithms to optimize the performance of SAFE framework as a whole. Through a series of simulation experiments, we demonstrate that the SAFE framework can effectively and adaptively extract and transmit semantics under different channel bandwidth conditions, of which effectiveness is verified through objective and subjective quality evaluations.

[LG-46] DynFrs: An Efficient Framework for Machine Unlearning in Random Forest

链接: https://arxiv.org/abs/2410.01588
作者: Shurong Wang,Zhuoyang Shen,Xinbao Qiao,Tongning Zhang,Meng Zhang
关键词-EN: Random Forests, regression tasks, medical diagnosis, personalized recommendations, widely recognized
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Random Forests are widely recognized for establishing efficacy in classification and regression tasks, standing out in various domains such as medical diagnosis, finance, and personalized recommendations. These domains, however, are inherently sensitive to privacy concerns, as personal and confidential data are involved. With increasing demand for the right to be forgotten, particularly under regulations such as GDPR and CCPA, the ability to perform machine unlearning has become crucial for Random Forests. However, insufficient attention was paid to this topic, and existing approaches face difficulties in being applied to real-world scenarios. Addressing this gap, we propose the DynFrs framework designed to enable efficient machine unlearning in Random Forests while preserving predictive accuracy. Dynfrs leverages subsampling method Occ(q) and a lazy tag strategy Lzy, and is still adaptable to any Random Forest variant. In essence, Occ(q) ensures that each sample in the training set occurs only in a proportion of trees so that the impact of deleting samples is limited, and Lzy delays the reconstruction of a tree node until necessary, thereby avoiding unnecessary modifications on tree structures. In experiments, applying Dynfrs on Extremely Randomized Trees yields substantial improvements, achieving orders of magnitude faster unlearning performance and better predictive accuracy than existing machine unlearning methods for Random Forests.

[LG-47] Learning-Augmented Robust Algorithmic Recourse

链接: https://arxiv.org/abs/2410.01580
作者: Kshitij Kayastha,Vasilis Gkatzelis,Shahin Jabbari
关键词-EN: major negative impact, receive undesirable outcomes, machine learning models, negative impact, high-stakes domains
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread use of machine learning models in high-stakes domains can have a major negative impact, especially on individuals who receive undesirable outcomes. Algorithmic recourse provides such individuals with suggestions of minimum-cost improvements they can make to achieve a desirable outcome in the future. However, machine learning models often get updated over time and this can cause a recourse to become invalid (i.e., not lead to the desirable outcome). The robust recourse literature aims to choose recourses that are less sensitive, even against adversarial model changes, but this comes at a higher cost. To overcome this obstacle, we initiate the study of algorithmic recourse through the learning-augmented framework and evaluate the extent to which a designer equipped with a prediction regarding future model changes can reduce the cost of recourse when the prediction is accurate (consistency) while also limiting the cost even when the prediction is inaccurate (robustness). We propose a novel algorithm for this problem, study the robustness-consistency trade-off, and analyze how prediction accuracy affects performance.

[LG-48] Coordinate-Based Neural Representation Enabling Zero-Shot Learning for 3D Multiparametric Quantitative MRI

链接: https://arxiv.org/abs/2410.01577
作者: Guoyan Lao,Ruimin Feng,Haikun Qi,Zhenfeng Lv,Qiangqiang Liu,Chunlei Liu,Yuyao Zhang,Hongjiang Wei
关键词-EN: offers tissue-specific physical, tissue-specific physical parameters, Quantitative magnetic resonance, offers tissue-specific, magnetic resonance imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantitative magnetic resonance imaging (qMRI) offers tissue-specific physical parameters with significant potential for neuroscience research and clinical practice. However, lengthy scan times for 3D multiparametric qMRI acquisition limit its clinical utility. Here, we propose SUMMIT, an innovative imaging methodology that includes data acquisition and an unsupervised reconstruction for simultaneous multiparametric qMRI. SUMMIT first encodes multiple important quantitative properties into highly undersampled k-space. It further leverages implicit neural representation incorporated with a dedicated physics model to reconstruct the desired multiparametric maps without needing external training datasets. SUMMIT delivers co-registered T1, T2, T2*, and quantitative susceptibility mapping. Extensive simulations and phantom imaging demonstrate SUMMIT’s high accuracy. Additionally, the proposed unsupervised approach for qMRI reconstruction also introduces a novel zero-shot learning paradigm for multiparametric imaging applicable to various medical imaging modalities.

[LG-49] Fake It Until You Break It: On the Adversarial Robustness of AI-generated Image Detectors

链接: https://arxiv.org/abs/2410.01574
作者: Sina Mavali,Jonas Ricker,David Pape,Yash Sharma,Asja Fischer,Lea Schoenherr
关键词-EN: offers countless possibilities, artificially generated media, misinformation campaigns, offers countless, productive tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While generative AI (GenAI) offers countless possibilities for creative and productive tasks, artificially generated media can be misused for fraud, manipulation, scams, misinformation campaigns, and more. To mitigate the risks associated with maliciously generated media, forensic classifiers are employed to identify AI-generated content. However, current forensic classifiers are often not evaluated in practically relevant scenarios, such as the presence of an attacker or when real-world artifacts like social media degradations affect images. In this paper, we evaluate state-of-the-art AI-generated image (AIGI) detectors under different attack scenarios. We demonstrate that forensic classifiers can be effectively attacked in realistic settings, even when the attacker does not have access to the target model and post-processing occurs after the adversarial examples are created, which is standard on social media platforms. These attacks can significantly reduce detection accuracy to the extent that the risks of relying on detectors outweigh their benefits. Finally, we propose a simple defense mechanism to make CLIP-based detectors, which are currently the best-performing detectors, robust against these attacks.

[LG-50] runcated Kernel Stochastic Gradient Descent on Spheres

链接: https://arxiv.org/abs/2410.01570
作者: JinHui Bai,Lei Shi
关键词-EN: T-kernel SGD, stochastic gradient descent, least-square loss function, T-kernel SGD employs, kernel SGD
类目: Machine Learning (cs.LG)
*备注: 57 pages, 7 figures

点击查看摘要

Abstract:Inspired by the structure of spherical harmonics, we propose the truncated kernel stochastic gradient descent (T-kernel SGD) algorithm with a least-square loss function for spherical data fitting. T-kernel SGD employs a “truncation” operation, enabling the application of a series-based kernel function in stochastic gradient descent, thereby avoiding the difficulties of finding suitable closed-form kernel functions in high-dimensional spaces. In contrast to traditional kernel SGD, T-kernel SGD is more effective in balancing bias and variance by dynamically adjusting the hypothesis space during iterations. The most significant advantage of the proposed algorithm is that it can achieve theoretically optimal convergence rates using a constant step size (independent of the sample size) while overcoming the inherent saturation problem of kernel SGD. Additionally, we leverage the structure of spherical polynomials to derive an equivalent T-kernel SGD, significantly reducing storage and computational costs compared to kernel SGD. Typically, T-kernel SGD requires only \mathcalO(n^1+\fracdd-1\epsilon) computational complexity and \mathcalO(n^\fracdd-1\epsilon) storage to achieve optimal rates for the d-dimensional sphere, where 0\epsilon\frac12 can be arbitrarily small if the optimal fitting or the underlying space possesses sufficient regularity. This regularity is determined by the smoothness parameter of the objective function and the decaying rate of the eigenvalues of the integral operator associated with the kernel function, both of which reflect the difficulty of the estimation problem. Our main results quantitatively characterize how this prior information influences the convergence of T-kernel SGD. The numerical experiments further validate the theoretical findings presented in this paper.

[LG-51] Bayes Power for Explaining In-Context Learning Generalizations

链接: https://arxiv.org/abs/2410.01565
作者: Samuel Müller,Noah Hollmann,Frank Hutter
关键词-EN: maximum likelihood estimation, likelihood estimation, primarily viewed, maximum likelihood, Traditionally
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Traditionally, neural network training has been primarily viewed as an approximation of maximum likelihood estimation (MLE). This interpretation originated in a time when training for multiple epochs on small datasets was common and performance was data bound; but it falls short in the era of large-scale single-epoch trainings ushered in by large self-supervised setups, like language models. In this new setup, performance is compute-bound, but data is readily available. As models became more powerful, in-context learning (ICL), i.e., learning in a single forward-pass based on the context, emerged as one of the dominant paradigms. In this paper, we argue that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior, as defined by the data-generating process. We demonstrate this interpretations’ power for ICL and its usefulness to predict generalizations to previously unseen tasks. We show how models become robust in-context learners by effectively composing knowledge from their training data. We illustrate this with experiments that reveal surprising generalizations, all explicable through the exact posterior. Finally, we show the inherent constraints of the generalization capabilities of posteriors and the limitations of neural networks in approximating these posteriors.

[LG-52] OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

链接: https://arxiv.org/abs/2410.01560
作者: Shubham Toshniwal,Wei Du,Ivan Moshkov,Branislav Kisacanin,Alexan Ayrapetyan,Igor Gitman
关键词-EN: Mathematical reasoning continues, large language model, Mathematical reasoning, development with significant, significant interest
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mathematical reasoning continues to be a critical challenge in large language model (LLM) development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \emphclosed-source due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \textttLlama3.1 family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms \emphon-policy data generated by a weak student model, © SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs ( \approx 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \textttLlama-3.1-8B-Base using OpenMathInstruct-2 outperforms \textttLlama3.1-8B-Instruct on MATH by an absolute 15.9% (51.9% \rightarrow 67.8%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.

[LG-53] Integrative Decoding: Improve Factuality via Implicit Self-consistency

链接: https://arxiv.org/abs/2410.01556
作者: Yi Cheng,Xiao Liang,Yeyun Gong,Wen Xiao,Song Wang,Yuji Zhang,Wenjun Hou,Kaishuai Xu,Wenge Liu,Wenjie Li,Jian Jiao,Qi Chen,Peng Cheng,Wayne Xiong
关键词-EN: involve repeatedly sampling, repeatedly sampling multiple, sampling multiple outputs, large language models, involve repeatedly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-consistency-based approaches, which involve repeatedly sampling multiple outputs and selecting the most consistent one as the final response, prove to be remarkably effective in improving the factual accuracy of large language models. Nonetheless, existing methods usually have strict constraints on the task format, largely limiting their applicability. In this paper, we present Integrative Decoding (ID), to unlock the potential of self-consistency in open-ended generation tasks. ID operates by constructing a set of inputs, each prepended with a previously sampled response, and then processes them concurrently, with the next token being selected by aggregating of all their corresponding predictions at each decoding step. In essence, this simple approach implicitly incorporates self-consistency in the decoding objective. Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the TruthfulQA (+11.2%), Biographies (+15.4%) and LongFact (+8.5%) benchmarks. The performance gains amplify progressively as the number of sampled responses increases, indicating the potential of ID to scale up with repeated sampling.

[LG-54] Lines of Thought in Large Language Models

链接: https://arxiv.org/abs/2410.01545
作者: Raphaël Sarfati,Toni J. B. Liu,Nicolas Boullé,Christopher J. Earls
关键词-EN: successive transformer layers, achieve next-token prediction, accompanying embedding space, Language Models achieve, Models achieve next-token
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Large Language Models achieve next-token prediction by transporting a vectorized piece of text (prompt) across an accompanying embedding space under the action of successive transformer layers. The resulting high-dimensional trajectories realize different contextualization, or ‘thinking’, steps, and fully determine the output probability distribution. We aim to characterize the statistical properties of ensembles of these ‘lines of thought.’ We observe that independent trajectories cluster along a low-dimensional, non-Euclidean manifold, and that their path can be well approximated by a stochastic equation with few parameters extracted from data. We find it remarkable that the vast complexity of such large models can be reduced to a much simpler form, and we reflect on implications.

[LG-55] Edge-preserving noise for diffusion models

链接: https://arxiv.org/abs/2410.01540
作者: Jente Vandersanden,Sascha Holl,Xingchang Huang,Gurprit Singh
关键词-EN: spatial regions uniformly, neglecting potentially valuable, Classical generative diffusion, potentially valuable structural, isotropic Gaussian denoising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical generative diffusion models learn an isotropic Gaussian denoising process, treating all spatial regions uniformly, thus neglecting potentially valuable structural information in the data. Inspired by the long-established work on anisotropic diffusion in image processing, we present a novel edge-preserving diffusion model that is a generalization of denoising diffusion probablistic models (DDPM). In particular, we introduce an edge-aware noise scheduler that varies between edge-preserving and isotropic Gaussian noise. We show that our model’s generative process converges faster to results that more closely match the target distribution. We demonstrate its capability to better learn the low-to-mid frequencies within the dataset, which plays a crucial role in representing shapes and structural information. Our edge-preserving diffusion process consistently outperforms state-of-the-art baselines in unconditional image generation. It is also more robust for generative tasks guided by a shape-based prior, such as stroke-to-image generation. We present qualitative and quantitative results showing consistent improvements (FID score) of up to 30% for both tasks.

[LG-56] VaT: Joint-Axis Attention for Time Series Forecasting with Lead-Lag Dynamics

链接: https://arxiv.org/abs/2410.01531
作者: Junwoo Ha,Hyukjae Kwon,Sungsoo Kim,Kisu Lee,Ha Young Kim
关键词-EN: real-world applications, Multivariate time series, plays a crucial, crucial role, temporal and inter-variable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15pages, 5 figures

点击查看摘要

Abstract:Multivariate time series (MTS) forecasting plays a crucial role in various real-world applications, yet simultaneously capturing both temporal and inter-variable dependencies remains a challenge. Conventional Channel-Dependent (CD) models handle these dependencies separately, limiting their ability to model complex interactions such as lead-lag dynamics. To address these limitations, we propose TiVaT (Time-Variable Transformer), a novel architecture that integrates temporal and variate dependencies through its Joint-Axis (JA) attention mechanism. TiVaT’s ability to capture intricate variate-temporal dependencies, including asynchronous interactions, is further enhanced by the incorporation of Distance-aware Time-Variable (DTV) Sampling, which reduces noise and improves accuracy through a learned 2D map that focuses on key interactions. TiVaT effectively models both temporal and variate dependencies, consistently delivering strong performance across diverse datasets. Notably, it excels in capturing complex patterns within multivariate time series, enabling it to surpass or remain competitive with state-of-the-art methods. This positions TiVaT as a new benchmark in MTS forecasting, particularly in handling datasets characterized by intricate and challenging dependencies.

[LG-57] Bounds on L_p Errors in Density Ratio Estimation via f-Divergence Loss Functions

链接: https://arxiv.org/abs/2410.01516
作者: Yoshiaki Kitazawa
关键词-EN: fundamental machine learning, machine learning technique, divergence loss functions, Density ratio estimation, divergence loss
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Density ratio estimation (DRE) is a fundamental machine learning technique for identifying relationships between two probability distributions. f -divergence loss functions, derived from variational representations of f -divergence, are commonly employed in DRE to achieve state-of-the-art results. This study presents a novel perspective on DRE using f -divergence loss functions by deriving the upper and lower bounds on L_p errors. These bounds apply to any estimator within a class of Lipschitz continuous estimators, irrespective of the specific f -divergence loss functions utilized. The bounds are formulated as a product of terms that include the data dimension and the expected value of the density ratio raised to the power of p . Notably, the lower bound incorporates an exponential term dependent on the Kullback–Leibler divergence, indicating that the L_p error significantly increases with the Kullback–Leibler divergence for p 1 , and this increase becomes more pronounced as p increases. Furthermore, these theoretical findings are substantiated through numerical experiments.

[LG-58] LEGO: Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

链接: https://arxiv.org/abs/2410.01506
作者: Dexuan Ding,Lei Wang,Liyun Zhu,Tom Gedeon,Piotr Koniusz
关键词-EN: computer vision tasks, diverse representations, computer vision, vision tasks, fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research paper

点击查看摘要

Abstract:In computer vision tasks, features often come from diverse representations, domains, and modalities, such as text, images, and videos. Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing similarity graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we use graph power expansions and introduce a learnable graph fusion operator to combine these graph powers for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise similarity score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.

[LG-59] Discrete Diffusion Schr"odinger Bridge Matching for Graph Transformation

链接: https://arxiv.org/abs/2410.01500
作者: Jun Hyeong Kim,Seonghwan Kim,Seokhyun Moon,Hyeongwoo Kim,Jeheon Woo,Woo Youn Kim
关键词-EN: Transporting between arbitrary, generative modeling, fundamental goal, goal in generative, Schrödinger Bridge Matching
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transporting between arbitrary distributions is a fundamental goal in generative modeling. Recently proposed diffusion bridge models provide a potential solution, but they rely on a joint distribution that is difficult to obtain in practice. Furthermore, formulations based on continuous domains limit their applicability to discrete domains such as graphs. To overcome these limitations, we propose Discrete Diffusion Schrödinger Bridge Matching (DDSBM), a novel framework that utilizes continuous-time Markov chains to solve the SB problem in a high-dimensional discrete state space. Our approach extends Iterative Markovian Fitting to discrete domains, and we have proved its convergence to the SB. Furthermore, we adapt our framework for the graph transformation and show that our design choice of underlying dynamics characterized by independent modifications of nodes and edges can be interpreted as the entropy-regularized version of optimal transport with a cost function described by the graph edit distance. To demonstrate the effectiveness of our framework, we have applied DDSBM to molecular optimization in the field of chemistry. Experimental results demonstrate that DDSBM effectively optimizes molecules’ property-of-interest with minimal graph transformation, successfully retaining other features.

[LG-60] DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic Lightweight Plugin for Large Language Models

链接: https://arxiv.org/abs/2410.01497
作者: Yuxuan Zhang,Ruizhe Li
关键词-EN: Large Language Models, Large Language, domains remains resource-intensive, Language Models, specific domains remains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint under review, 18 pages, 7 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have achieved robust performance across diverse tasks, but fine-tuning these models for specific domains remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) address this challenge by fine-tuning a small subset of parameters. However, existing methods for fusing multiple LoRAs lack dynamic fusion based on contextual inputs and often increase inference time due to token-level operations. We propose DLP-LoRA, a Dynamic Lightweight Plugin that employs a mini-MLP module with only 5M parameters to dynamically fuse multiple LoRAs at the sentence level using top-p sampling strategies. This approach reduces inference time to less than twice that of single LoRA inference by leveraging parallel computation. Evaluations across 26 tasks-including multiple-choice questions and question answering-demonstrate that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice datasets and significant improvements in BLEU and ROUGE scores on QA datasets, outperforming different LLMs backbones under composite task settings. DLP-LoRA effectively balances performance and efficiency, making it a practical solution for dynamic multi-task adaptation in LLMs. Our code is available at this https URL.

[LG-61] Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks

链接: https://arxiv.org/abs/2410.01483
作者: Edan Kinderman,Itay Hubara,Haggai Maron,Daniel Soudry
关键词-EN: identical architectures trained, merge neural networks, recent methods aim, single multi-task model, identical architectures
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many recent methods aim to merge neural networks (NNs) with identical architectures trained on different tasks to obtain a single multi-task model. Most existing works tackle the simpler setup of merging NNs initialized from a common pre-trained network, where simple heuristics like weight averaging work well. This work targets a more challenging goal: merging large transformers trained on different tasks from distinct initializations. First, we demonstrate that traditional merging methods fail catastrophically in this setup. To overcome this challenge, we propose Foldable SuperNet Merge (FS-Merge), a method that optimizes a SuperNet to fuse the original models using a feature reconstruction loss. FS-Merge is simple, data-efficient, and capable of merging models of varying widths. We test FS-Merge against existing methods, including knowledge distillation, on MLPs and transformers across various settings, sizes, tasks, and modalities. FS-Merge consistently outperforms them, achieving SOTA results, particularly in limited data scenarios.

[LG-62] Reducing Variance in Meta-Learning via Laplace Approximation for Regression Tasks

链接: https://arxiv.org/abs/2410.01476
作者: Alfredo Reichlin,Gustaf Tegnér,Miguel Vasco,Hang Yin,Mårten Björkman,Danica Kragic
关键词-EN: meta-learning algorithms aim, optimal adaptation strategy, finite set, set of sample, algorithms aim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Given a finite set of sample points, meta-learning algorithms aim to learn an optimal adaptation strategy for new, unseen tasks. Often, this data can be ambiguous as it might belong to different tasks concurrently. This is particularly the case in meta-regression tasks. In such cases, the estimated adaptation strategy is subject to high variance due to the limited amount of support data for each task, which often leads to sub-optimal generalization performance. In this work, we address the problem of variance reduction in gradient-based meta-learning and formalize the class of problems prone to this, a condition we refer to as \emphtask overlap. Specifically, we propose a novel approach that reduces the variance of the gradient estimate by weighing each support point individually by the variance of its posterior over the parameters. To estimate the posterior, we utilize the Laplace approximation, which allows us to express the variance in terms of the curvature of the loss landscape of our meta-learner. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of variance reduction in meta-learning.

[LG-63] Selective Aggregation for Low-Rank Adaptation in Federated Learning

链接: https://arxiv.org/abs/2410.01463
作者: Pengxin Guo,Shuang Zeng,Yanran Wang,Huijie Fan,Feifei Wang,Liangqiong Qu
关键词-EN: asymmetry analysis, matrices, learning general knowledge, federated learning, LoRA variants
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate LoRA in federated learning through the lens of the asymmetry analysis of the learned A and B matrices. In doing so, we uncover that A matrices are responsible for learning general knowledge, while B matrices focus on capturing client-specific knowledge. Based on this finding, we introduce Federated Share-A Low-Rank Adaptation (FedSA-LoRA), which employs two low-rank trainable matrices A and B to model the weight update, but only A matrices are shared with the server for aggregation. Moreover, we delve into the relationship between the learned A and B matrices in other LoRA variants, such as rsLoRA and VeRA, revealing a consistent pattern. Consequently, we extend our FedSA-LoRA method to these LoRA variants, resulting in FedSA-rsLoRA and FedSA-VeRA. In this way, we establish a general paradigm for integrating LoRA with FL, offering guidance for future work on subsequent LoRA variants combined with FL. Extensive experimental results on natural language understanding and generation tasks demonstrate the effectiveness of the proposed method.

[LG-64] From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge

链接: https://arxiv.org/abs/2410.01458
作者: Xiefeng Wu
关键词-EN: accelerate agent training, incorporating domain knowledge, directly shaping Q-values, Q-value initialization, agent training
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: q-shaping, reinforcement learning, reward shaping

点击查看摘要

Abstract:Q-shaping is an extension of Q-value initialization and serves as an alternative to reward shaping for incorporating domain knowledge to accelerate agent training, thereby improving sample efficiency by directly shaping Q-values. This approach is both general and robust across diverse tasks, allowing for immediate impact assessment while guaranteeing optimality. We evaluated Q-shaping across 20 different environments using a large language model (LLM) as the heuristic provider. The results demonstrate that Q-shaping significantly enhances sample efficiency, achieving a \textbf16.87% improvement over the best baseline in each environment and a \textbf253.80% improvement compared to LLM-based reward shaping methods. These findings establish Q-shaping as a superior and unbiased alternative to conventional reward shaping in reinforcement learning.

[LG-65] Verbalized Graph Representation Learning: A Fully Interpretable Graph Model Based on Large Language Models Throughout the Entire Process

链接: https://arxiv.org/abs/2410.01457
作者: Xingyu Ji,Jiale Liu,Lu Li,Maojun Wang,Zeyu Zhang
关键词-EN: Graph Neural Networks, Neural Networks, attracted significant interest, significant interest due, wide-ranging real-world applications
类目: Machine Learning (cs.LG)
*备注: under review. corresponding author: Zeyu Zhang

点击查看摘要

Abstract:Representation learning on text-attributed graphs (TAGs) has attracted significant interest due to its wide-ranging real-world applications, particularly through Graph Neural Networks (GNNs). Traditional GNN methods focus on encoding the structural information of graphs, often using shallow text embeddings for node or edge attributes. This limits the model to understand the rich semantic information in the data and its reasoning ability for complex downstream tasks, while also lacking interpretability. With the rise of large language models (LLMs), an increasing number of studies are combining them with GNNs for graph representation learning and downstream tasks. While these approaches effectively leverage the rich semantic information in TAGs datasets, their main drawback is that they are only partially interpretable, which limits their application in critical fields. In this paper, we propose a verbalized graph representation learning (VGRL) method which is fully interpretable. In contrast to traditional graph machine learning models, which are usually optimized within a continuous parameter space, VGRL constrains this parameter space to be text description which ensures complete interpretability throughout the entire process, making it easier for users to understand and trust the decisions of the model. We conduct several studies to empirically evaluate the effectiveness of VGRL and we believe these method can serve as a stepping stone in graph representation learning.

[LG-66] Ensembles provably learn equivariance through data augmentation

链接: https://arxiv.org/abs/2410.01452
作者: Oskar Nordenfors,Axel Flinth
关键词-EN: wide neural networks, infinitely wide neural, neural tangent kernel, tangent kernel limit, group equivariance emerges
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Recently, it was proved that group equivariance emerges in ensembles of neural networks as the result of full augmentation in the limit of infinitely wide neural networks (neural tangent kernel limit). In this paper, we extend this result significantly. We provide a proof that this emergence does not depend on the neural tangent kernel limit at all. We also consider stochastic settings, and furthermore general architectures. For the latter, we provide a simple sufficient condition on the relation between the architecture and the action of the group for our results to hold. We validate our findings through simple numeric experiments.

[LG-67] Geometric Signatures of Compositionality Across a Language Models Lifetime ICLR2025

链接: https://arxiv.org/abs/2410.01444
作者: Jin Hwa Lee,Thomas Jiralerspong,Lei Yu,Yoshua Bengio,Emily Cheng
关键词-EN: syntactic rules, permits the infinite, expression is constructed, parts and syntactic, infinite productivity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Under review as a conference paper at ICLR 2025

点击查看摘要

Abstract:Compositionality, the notion that the meaning of an expression is constructed from the meaning of its parts and syntactic rules, permits the infinite productivity of human language. For the first time, artificial language models (LMs) are able to match human performance in a number of compositional generalization tasks. However, much remains to be understood about the representational mechanisms underlying these abilities. We take a high-level geometric approach to this problem by relating the degree of compositionality in a dataset to the intrinsic dimensionality of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations’ intrinsic dimensionality, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between linear and nonlinear dimensionality, showing that they respectively encode formal and semantic aspects of linguistic composition.

[LG-68] Closed-loop Long-horizon Robotic Planning via Equilibrium Sequence Modeling

链接: https://arxiv.org/abs/2410.01440
作者: Jinghan Li,Zhicheng Sun,Fei Li,Cao Sheng,Jiazhong Yu,Yadong Mu
关键词-EN: translating high-level task, high-level task descriptions, make autonomous robots, requires translating high-level, long-horizon action sequences
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating high-level task descriptions into long-horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self-refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end-to-end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self-refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed-loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome-Env benchmark, showing advanced performance with better scaling for inference computation. Code is available at this https URL.

[LG-69] Information-Theoretical Principled Trade-off between Jailbreakability and Stealthiness on Vision Language Models

链接: https://arxiv.org/abs/2410.01438
作者: Ching-Chia Kao,Chia-Mu Yu,Chun-Shien Lu,Chu-Song Chen
关键词-EN: demonstrated significant advancements, recent years, artificial intelligence, transforming tasks, demonstrated significant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Vision-Language Models (VLMs) have demonstrated significant advancements in artificial intelligence, transforming tasks across various domains. Despite their capabilities, these models are susceptible to jailbreak attacks, which can compromise their safety and reliability. This paper explores the trade-off between jailbreakability and stealthiness in VLMs, presenting a novel algorithm to detect non-stealthy jailbreak attacks and enhance model robustness. We introduce a stealthiness-aware jailbreak attack using diffusion models, highlighting the challenge of detecting AI-generated content. Our approach leverages Fano’s inequality to elucidate the relationship between attack success rates and stealthiness scores, providing an explainable framework for evaluating these threats. Our contributions aim to fortify AI systems against sophisticated attacks, ensuring their outputs remain aligned with ethical standards and user expectations.

[LG-70] Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

链接: https://arxiv.org/abs/2410.01434
作者: Philipp Mondorf,Sondre Wold,Barbara Plank
关键词-EN: implement reusable functions, implement reusable, fundamental question, reusable functions, composed to perform
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 24 pages, 17 figures

点击查看摘要

Abstract:A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions via subnetworks that can be composed to perform more complex tasks. Recent developments in mechanistic interpretability have made progress in identifying subnetworks, often referred to as circuits, which represent the minimal computational subgraph responsible for a model’s behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we examine the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through subnetwork set operations to represent more complex functional capabilities of the model.

[LG-71] Adaptive teachers for amortized samplers

链接: https://arxiv.org/abs/2410.01432
作者: Minsu Kim,Sanghyeok Choi,Taeyoung Yun,Emmanuel Bengio,Leo Feng,Jarrid Rector-Brooks,Sungsoo Ahn,Jinkyoo Park,Nikolay Malkin,Yoshua Bengio
关键词-EN: unnormalized density, density where exact, neural network, generative flow networks, Amortized inference
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 12 figures

点击查看摘要

Abstract:Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnormalized density where exact sampling is intractable. When sampling is implemented as a sequential decision-making process, reinforcement learning (RL) methods, such as generative flow networks, can be used to train the sampling policy. Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration. We propose to use an adaptive training distribution (the Teacher) to guide the training of the primary amortized sampler (the Student) by prioritizing high-loss regions. The Teacher, an auxiliary behavior model, is trained to sample high-error regions of the Student and can generalize across unexplored modes, thereby enhancing mode coverage by providing an efficient training curriculum. We validate the effectiveness of this approach in a synthetic environment designed to present an exploration challenge, two diffusion-based sampling tasks, and four biochemical discovery tasks demonstrating its ability to improve sample efficiency and mode coverage.

[LG-72] Scalable Reinforcement Learning-based Neural Architecture Search

链接: https://arxiv.org/abs/2410.01431
作者: Amber Cassimon,Siegfried Mercelis,Kevin Mets
关键词-EN: Reinforcement Learning-based solution, Neural Architecture Search, single optimal architecture, Neural Architecture, Learning-based solution
类目: Machine Learning (cs.LG)
*备注: 33 Pages, 19 Figures

点击查看摘要

Abstract:In this publication, we assess the ability of a novel Reinforcement Learning-based solution to the problem of Neural Architecture Search, where a Reinforcement Learning (RL) agent learns to search for good architectures, rather than to return a single optimal architecture. We consider both the NAS-Bench-101 and NAS- Bench-301 settings, and compare against various known strong baselines, such as local search and random search. We conclude that our Reinforcement Learning agent displays strong scalability with regards to the size of the search space, but limited robustness to hyperparameter changes.

[LG-73] Fair4Free: Generating High-fidelity Fair Synthetic Samples using Data Free Distillation

链接: https://arxiv.org/abs/2410.01423
作者: Md Fahim Sikder,Daniel de Leng,Fredrik Heintz
关键词-EN: latent space, work presents, student model, model, generative model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work presents Fair4Free, a novel generative model to generate synthetic fair data using data-free distillation in the latent space. Fair4Free can work on the situation when the data is private or inaccessible. In our approach, we first train a teacher model to create fair representation and then distil the knowledge to a student model (using a smaller architecture). The process of distilling the student model is data-free, i.e. the student model does not have access to the training dataset while distilling. After the distillation, we use the distilled model to generate fair synthetic samples. Our extensive experiments show that our synthetic samples outperform state-of-the-art models in all three criteria (fairness, utility and synthetic quality) with a performance increase of 5% for fairness, 8% for utility and 12% in synthetic quality for both tabular and image datasets.

[LG-74] he Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

链接: https://arxiv.org/abs/2410.01417
作者: Hong Li,Nanxi Li,Yuanjie Chen,Jianbin Zhu,Qinlu Guo,Cewu Lu,Yong-Lu Li
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, \textite.g. , hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: \textbfassociation , a human’s basic capability to link observation and prior practice memory. To comprehensively investigate MLLM’s performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient \textbfannotation-free construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs’ zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. \textitOur data and code are available at: this https URL.

[LG-75] On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

链接: https://arxiv.org/abs/2410.01405
作者: Kevin Xu,Issei Sato
关键词-EN: Transformers offer advantages, Turing completeness, Looped Transformers offer, efficiency and Turing, Looped Transformers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Looped Transformers offer advantages in parameter efficiency and Turing completeness. However, their expressive power for function approximation and approximation rate remains underexplored. In this paper, we establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts us to incorporate scaling parameters for each loop, conditioned on timestep encoding. Experimental results demonstrate that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding architecture.

[LG-76] Gaussian kernel expansion with basis functions uniformly bounded in mathcalL_infty

链接: https://arxiv.org/abs/2410.01394
作者: Mauro Bisiacco,Gianluigi Pillonetto
关键词-EN: so-called feature maps, feature maps introduced, machine learning, Gaussian kernel, Gaussian kernel expansion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kernel expansions are a topic of considerable interest in machine learning, also because of their relation to the so-called feature maps introduced in machine learning. Properties of the associated basis functions and weights (corresponding to eigenfunctions and eigenvalues in the Mercer setting) give insight into for example the structure of the associated reproducing kernel Hilbert space, the goodness of approximation schemes, the convergence rates and generalization properties of kernel machines. Recent work in the literature has derived some of these results by assuming uniformly bounded basis functions in \mathcalL_\infty . Motivated by this line of research, we investigate under this constraint all possible kernel expansions of the Gaussian kernel, one of the most widely used models in machine learning. Our main result is the construction on \mathbbR^2 of a Gaussian kernel expansion with weights in \ell_p for any p1 . This result is optimal since we also prove that p=1 cannot be reached by the Gaussian kernel, nor by any of the other radial basis function kernels commonly used in the literature. A consequence for this kind of kernels is also the non-existence of Mercer expansions on \mathbbR^2 , with respect to any finite measure, whose eigenfunctions all belong to a closed ball of \mathcalL_\infty .

[LG-77] Causal Inference Tools for a Better Evaluation of Machine Learning

链接: https://arxiv.org/abs/2410.01392
作者: Michaël Soumm
关键词-EN: improve machine learning, machine learning, applying rigorous statistical, analyze and improve, Analysis of Variance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a comprehensive framework for applying rigorous statistical techniques from econometrics to analyze and improve machine learning systems. We introduce key statistical methods such as Ordinary Least Squares (OLS) regression, Analysis of Variance (ANOVA), and logistic regression, explaining their theoretical foundations and practical applications in machine learning evaluation. The document serves as a guide for researchers and practitioners, detailing how these techniques can provide deeper insights into model behavior, performance, and fairness. We cover the mathematical principles behind each method, discuss their assumptions and limitations, and provide step-by-step instructions for their implementation. The paper also addresses how to interpret results, emphasizing the importance of statistical significance and effect size. Through illustrative examples, we demonstrate how these tools can reveal subtle patterns and interactions in machine learning models that are not apparent from traditional evaluation metrics. By connecting the fields of econometrics and machine learning, this work aims to equip readers with powerful analytical tools for more rigorous and comprehensive evaluation of AI systems. The framework presented here contributes to developing more robust, interpretable, and fair machine learning technologies.

[LG-78] FLAME: Adaptive and Reactive Concept Drift Mitigation for Federated Learning Deployments

链接: https://arxiv.org/abs/2410.01386
作者: Ioannis Mavromatis,Stefano De Feo,Aftab Khan
关键词-EN: presents Federated Learning, paper presents Federated, Federated Learning, Internet of Things, Monitoring and Elimination
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for Publication at EMERGE Workshop - EWSN 2024

点击查看摘要

Abstract:This paper presents Federated Learning with Adaptive Monitoring and Elimination (FLAME), a novel solution capable of detecting and mitigating concept drift in Federated Learning (FL) Internet of Things (IoT) environments. Concept drift poses significant challenges for FL models deployed in dynamic and real-world settings. FLAME leverages an FL architecture, considers a real-world FL pipeline, and proves capable of maintaining model performance and accuracy while addressing bandwidth and privacy constraints. Introducing various features and extensions on previous works, FLAME offers a robust solution to concept drift, significantly reducing computational load and communication overhead. Compared to well-known lightweight mitigation methods, FLAME demonstrates superior performance in maintaining high F1 scores and reducing resource utilisation in large-scale IoT deployments, making it a promising approach for real-world applications.

[LG-79] owards Dynamic Graph Neural Networks with Provably High-Order Expressive Power

链接: https://arxiv.org/abs/2410.01367
作者: Zhe Wang,Tianjian Zhao,Zhen Zhang,Jiawei Chen,Sheng Zhou,Yan Feng,Chun Chen,Can Wang
关键词-EN: Dynamic Graph Neural, Graph Neural Networks, expressive power, Dynamic Graph, Graph Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic Graph Neural Networks (DyGNNs) have garnered increasing research attention for learning representations on evolving graphs. Despite their effectiveness, the limited expressive power of existing DyGNNs hinders them from capturing important evolving patterns of dynamic graphs. Although some works attempt to enhance expressive capability with heuristic features, there remains a lack of DyGNN frameworks with provable and quantifiable high-order expressive power. To address this research gap, we firstly propose the k-dimensional Dynamic WL tests (k-DWL) as the referencing algorithms to quantify the expressive power of DyGNNs. We demonstrate that the expressive power of existing DyGNNs is upper bounded by the 1-DWL test. To enhance the expressive power, we propose Dynamic Graph Neural Network with High-order expressive power (HopeDGN), which updates the representation of central node pair by aggregating the interaction history with neighboring node pairs. Our theoretical results demonstrate that HopeDGN can achieve expressive power equivalent to the 2-DWL test. We then present a Transformer-based implementation for the local variant of HopeDGN. Experimental results show that HopeDGN achieved performance improvements of up to 3.12%, demonstrating the effectiveness of HopeDGN.

[LG-80] FlashMask: Efficient and Rich Mask Extension of FlashAttention

链接: https://arxiv.org/abs/2410.01359
作者: Guoxia Wang,Jinle Zeng,Xiyuan Xiao,Siming Wu,Jiabin Yang,Lujing Zheng,Zeyu Chen,Jiang Bian,Dianhai Yu,Haifeng Wang
关键词-EN: vanilla attention scale, attention scale quadratically, processing long sequences, posing significant challenges, demands of vanilla
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The computational and memory demands of vanilla attention scale quadratically with the sequence length N , posing significant challenges for processing long sequences in Transformer models. FlashAttention alleviates these challenges by eliminating the O(N^2) memory dependency and reducing attention latency through IO-aware memory optimizations. However, its native support for certain attention mask types is limited, and it does not inherently accommodate more complex masking requirements. Previous approaches resort to using dense masks with O(N^2) memory complexity, leading to inefficiencies. In this paper, we propose FlashMask, an extension of FlashAttention that introduces a column-wise sparse representation of attention masks. This approach efficiently represents a wide range of mask types and facilitates the development of optimized kernel implementations. By adopting this novel representation, FlashMask achieves linear memory complexity O(N) , suitable for modeling long-context sequences. Moreover, this representation enables kernel optimizations that eliminate unnecessary computations by leveraging sparsity in the attention mask, without sacrificing computational accuracy, resulting in higher computational efficiency. We evaluate FlashMask’s performance in fine-tuning and alignment training of LLMs such as SFT, LoRA, DPO, and RM. FlashMask achieves significant throughput improvements, with end-to-end speedups ranging from 1.65x to 3.22x compared to existing FlashAttention dense method. Additionally, our kernel-level comparisons demonstrate that FlashMask surpasses the latest counterpart, FlexAttention, by 12.1% to 60.7% in terms of kernel TFLOPs/s, achieving 37.8% to 62.3% of the theoretical maximum FLOPs/s on the A100 GPU. The code is open-sourced on PaddlePaddle and integrated into PaddleNLP, supporting models with over 100 billion parameters for contexts up to 128K tokens.

[LG-81] PhyMPGN: Physics-encoded Message Passing Graph Network for spatiotemporal PDE systems

链接: https://arxiv.org/abs/2410.01337
作者: Bocheng Zeng,Qi Wang,Mengtao Yan,Yang Liu,Ruizhi Chengze,Yi Zhang,Hongsheng Liu,Zidong Wang,Hao Sun
关键词-EN: Solving partial differential, partial differential equations, Solving partial, modeling complex dynamical, differential equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Solving partial differential equations (PDEs) serves as a cornerstone for modeling complex dynamical systems. Recent progresses have demonstrated grand benefits of data-driven neural-based models for predicting spatiotemporal dynamics (e.g., tremendous speedup gain compared with classical numerical methods). However, most existing neural models rely on rich training data, have limited extrapolation and generalization abilities, and suffer to produce precise or reliable physical prediction under intricate conditions (e.g., irregular mesh or geometry, complex boundary conditions, diverse PDE parameters, etc.). To this end, we propose a new graph learning approach, namely, Physics-encoded Message Passing Graph Network (PhyMPGN), to model spatiotemporal PDE systems on irregular meshes given small training datasets. Specifically, we incorporate a GNN into a numerical integrator to approximate the temporal marching of spatiotemporal dynamics for a given PDE system. Considering that many physical phenomena are governed by diffusion processes, we further design a learnable Laplace block, which encodes the discrete Laplace-Beltrami operator, to aid and guide the GNN learning in a physically feasible solution space. A boundary condition padding strategy is also designed to improve the model convergence and accuracy. Extensive experiments demonstrate that PhyMPGN is capable of accurately predicting various types of spatiotemporal dynamics on coarse unstructured meshes, consistently achieves the state-of-the-art results, and outperforms other baselines with considerable gains.

[LG-82] Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

链接: https://arxiv.org/abs/2410.01335
作者: Lucas Bandarkar,Benjamin Muller,Pritish Yuvraj,Rui Hou,Nayan Singhal,Hongjiang Lv,Bing Liu
关键词-EN: Large Language Models, math instruction data, practice of combining, instruction data, Model merging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 main pages, 23 pages total, 9 figures, 5 tables

点击查看摘要

Abstract:Model merging, such as model souping, is the practice of combining different models with the same architecture together without further training. In this work, we present a model merging methodology that addresses the difficulty of fine-tuning Large Language Models (LLMs) for target tasks in non-English languages, where task-specific data is often unavailable. We focus on mathematical reasoning and without in-language math data, facilitate cross-lingual transfer by composing language and math capabilities. Starting from the same pretrained model, we fine-tune separate “experts” on math instruction data in English and on generic instruction data in the target language. We then replace the top and bottom transformer layers of the math expert directly with layers from the language expert, which consequently enhances math performance in the target language. The resulting merged models outperform the individual experts and other merging methods on the math benchmark, MGSM, by 10% across four major languages where math instruction data is scarce. In addition, this layer swapping is simple, inexpensive, and intuitive, as it is based on an interpretative analysis of the most important parameter changes during the fine-tuning of each expert. The ability to successfully re-compose LLMs for cross-lingual transfer in this manner opens up future possibilities to combine model expertise, create modular solutions, and transfer reasoning capabilities across languages all post hoc.

[LG-83] Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting

链接: https://arxiv.org/abs/2410.01331
作者: Alessio Russo,Alberto Maria Metelli,Marcello Restelli
关键词-EN: Partially Observable Markov, Observable Markov Decision, Markov Decision Processes, Dealing with Partially, Partially Observable
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Dealing with Partially Observable Markov Decision Processes is notably a challenging task. We face an average-reward infinite-horizon POMDP setting with an unknown transition model, where we assume the knowledge of the observation model. Under this assumption, we propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. Then, we propose the OAS-UCRL algorithm that implicitly balances the exploration-exploitation trade-off following the \textitoptimism in the face of uncertainty principle. The algorithm runs through episodes of increasing length. For each episode, the optimal belief-based policy of the estimated POMDP interacts with the environment and collects samples that will be used in the next episode by the OAS estimation procedure to compute a new estimate of the POMDP parameters. Given the estimated model, an optimization oracle computes the new optimal policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order \mathcalO(\sqrtT \log(T)) for the proposed OAS-UCRL algorithm. We compare against the oracle playing the optimal stochastic belief-based policy and show the efficient scaling of our approach with respect to the dimensionality of the state, action, and observation space. We finally conduct numerical simulations to validate and compare the proposed technique with other baseline approaches.

[LG-84] Fair Class-Incremental Learning using Sample Weighting

链接: https://arxiv.org/abs/2410.01324
作者: Jaeyoung Park,Minsu Kim,Steven Euijong Whang
关键词-EN: class-incremental learning, average gradient vector, Model fairness, average gradient, current task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model fairness is becoming important in class-incremental learning for Trustworthy AI. While accuracy has been a central focus in class-incremental learning, fairness has been relatively understudied. However, naively using all the samples of the current task for training results in unfair catastrophic forgetting for certain sensitive groups including classes. We theoretically analyze that forgetting occurs if the average gradient vector of the current task data is in an “opposite direction” compared to the average gradient vector of a sensitive group, which means their inner products are negative. We then propose a fair class-incremental learning framework that adjusts the training weights of current task samples to change the direction of the average gradient vector and thus reduce the forgetting of underperforming groups and achieve fairness. For various group fairness measures, we formulate optimization problems to minimize the overall losses of sensitive groups while minimizing the disparities among them. We also show the problems can be solved with linear programming and propose an efficient Fairness-aware Sample Weighting (FSW) algorithm. Experiments show that FSW achieves better accuracy-fairness tradeoff results than state-of-the-art approaches on real datasets.

[LG-85] Forte : Finding Outliers with Representation Typicality Estimation

链接: https://arxiv.org/abs/2410.01322
作者: Debargha Ganguly,Warren Morningstar,Andrew Yu,Vipin Chaudhary
关键词-EN: produce photorealistic synthetic, generative OOD detectors, OOD detectors, virtually indistinguishable, photorealistic synthetic data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Generative models can now produce photorealistic synthetic data which is virtually indistinguishable from the real data used to train it. This is a significant evolution over previous models which could produce reasonable facsimiles of the training data, but ones which could be visually distinguished from the training data by human evaluation. Recent work on OOD detection has raised doubts that generative model likelihoods are optimal OOD detectors due to issues involving likelihood misestimation, entropy in the generative process, and typicality. We speculate that generative OOD detectors also failed because their models focused on the pixels rather than the semantic content of the data, leading to failures in near-OOD cases where the pixels may be similar but the information content is significantly different. We hypothesize that estimating typical sets using self-supervised learners leads to better OOD detectors. We introduce a novel approach that leverages representation learning, and informative summary statistics based on manifold estimation, to address all of the aforementioned issues. Our method outperforms other unsupervised approaches and achieves state-of-the art performance on well-established challenging benchmarks, and new synthetic data detection tasks.

[LG-86] Fast Summation of Radial Kernels via QMC Slicing

链接: https://arxiv.org/abs/2410.01316
作者: Johannes Hertrich,Tim Jahn,Michael Quellmalz
关键词-EN: large kernel sums, challenging task, computation of large, large kernel, kernel sums
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The fast computation of large kernel sums is a challenging task, which arises as a subproblem in any kernel method. We approach the problem by slicing, which relies on random projections to one-dimensional subspaces and fast Fourier summation. We prove bounds for the slicing error and propose a quasi-Monte Carlo (QMC) approach for selecting the projections based on spherical quadrature rules. Numerical examples demonstrate that our QMC-slicing approach significantly outperforms existing methods like (QMC-)random Fourier features, orthogonal Fourier features or non-QMC slicing on standard test datasets.

[LG-87] Sampling from Energy-based Policies using Diffusion

链接: https://arxiv.org/abs/2410.01312
作者: Vineet Jain,Tara Akhound-Sadegh,Siamak Ravanbakhsh
关键词-EN: Energy-based policies offer, modeling complex, offer a flexible, flexible framework, framework for modeling
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation - limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.

[LG-88] Getting Free Bits Back from Rotational Symmetries in LLMs

链接: https://arxiv.org/abs/2410.01309
作者: Jiajun He,Gergely Flamich,José Miguel Hernández-Lobato
关键词-EN: encoding redundant information, compressing neural network, neural network weights, channel simulation, redundant information
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:Current methods for compressing neural network weights, such as decomposition, pruning, quantization, and channel simulation, often overlook the inherent symmetries within these networks and thus waste bits on encoding redundant information. In this paper, we propose a format based on bits-back coding for storing rotationally symmetric Transformer weights more efficiently than the usual array layout at the same floating-point precision. We evaluate our method on Large Language Models (LLMs) pruned by SliceGPT (Ashkboos et al., 2024) and achieve a 3-5% reduction in total bit usage for free across different model sizes and architectures without impacting model performance within a certain numerical precision.

[LG-89] Rethinking the Expressiveness of GNNs: A Computational Model Perspective

链接: https://arxiv.org/abs/2410.01308
作者: Guanyu Cui,Zhewei Wei,Hsin-Hao Su
关键词-EN: Graph Neural Networks, graph machine learning, considerable research focusing, Graph Neural, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are extensively employed in graph machine learning, with considerable research focusing on their expressiveness. Current studies often assess GNN expressiveness by comparing them to the Weisfeiler-Lehman (WL) tests or classical graph algorithms. However, we identify three key issues in existing analyses: (1) some studies use preprocessing to enhance expressiveness but overlook its computational costs; (2) some claim the anonymous WL test’s limited power while enhancing expressiveness using non-anonymous features, creating a mismatch; and (3) some characterize message-passing GNNs (MPGNNs) with the CONGEST model but make unrealistic assumptions about computational resources, allowing \textsfNP-Complete problems to be solved in O(m) depth. We contend that a well-defined computational model is urgently needed to serve as the foundation for discussions on GNN expressiveness. To address these issues, we introduce the Resource-Limited CONGEST (RL-CONGEST) model, incorporating optional preprocessing and postprocessing to form a framework for analyzing GNN expressiveness. Our framework sheds light on computational aspects, including the computational hardness of hash functions in the WL test and the role of virtual nodes in reducing network capacity. Additionally, we suggest that high-order GNNs correspond to first-order model-checking problems, offering new insights into their expressiveness.

[LG-90] Revisiting Hierarchical Text Classification: Inference and Metrics CONLL2024

链接: https://arxiv.org/abs/2410.01305
作者: Roman Plaud,Matthieu Labeau,Antoine Saillenfest,Thomas Bonald
关键词-EN: structured space organized, Hierarchical text classification, task of assigning, assigning labels, structured space
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at CoNLL 2024

点击查看摘要

Abstract:Hierarchical text classification (HTC) is the task of assigning labels to a text within a structured space organized as a hierarchy. Recent works treat HTC as a conventional multilabel classification problem, therefore evaluating it as such. We instead propose to evaluate models based on specifically designed hierarchical metrics and we demonstrate the intricacy of metric choice and prediction inference method. We introduce a new challenging dataset and we evaluate fairly, recent sophisticated models, comparing them with a range of simple but strong baselines, including a new theoretically motivated loss. Finally, we show that those baselines are very often competitive with the latest models. This highlights the importance of carefully considering the evaluation methodology when proposing new methods for HTC. Code implementation and dataset are available at \urlthis https URL.

[LG-91] Speculative Coreset Selection for Task-Specific Fine-tuning

链接: https://arxiv.org/abs/2410.01296
作者: Xiaoyu Zhang,Juan Zhai,Shiqing Ma,Chao Shen,Tianlin Li,Weipeng Jiang,Yang Liu
关键词-EN: requires significant computational, significant computational resources, large language models, target LLM, Task-specific fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 4 figures, 14 tables

点击查看摘要

Abstract:Task-specific fine-tuning is essential for the deployment of large language models (LLMs), but it requires significant computational resources and time. Existing solutions have proposed coreset selection methods to improve data efficiency and reduce model training overhead, but they still have limitations: 1) Overlooking valuable samples at high pruning rates, which degrades the coreset’s performance. 2) Requiring high time overhead during coreset selection to fine-tune and evaluate the target LLM. In this paper, we introduce STAFF, a speculative coreset selection method. STAFF leverages a small model from the same family as the target LLM to efficiently estimate data scores and then verifies the scores on the target LLM to accurately identify and allocate more selection budget to important regions while maintaining coverage of easy regions. We evaluate STAFF on three LLMs and three downstream tasks and show that STAFF improves the performance of SOTA methods by up to 54.3% and reduces selection overhead by up to 70.5% at different pruning rates. Furthermore, we observe that the coreset selected by STAFF at low pruning rates (i.e., 20%) can even obtain better fine-tuning performance than the full dataset.

[LG-92] owards a Law of Iterated Expectations for Heuristic Estimators

链接: https://arxiv.org/abs/2410.01290
作者: Paul Christiano,Jacob Hilton,Andrea Lincoln,Eric Neyman,Mark Xu
关键词-EN: heuristic estimator, heuristic, estimator, mathbb, Christiano
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 47 pages, 2 tables, 1 figure

点击查看摘要

Abstract:Christiano et al. (2022) define a heuristic estimator to be a hypothetical algorithm that estimates the values of mathematical expressions from arguments. In brief, a heuristic estimator \mathbbG takes as input a mathematical expression Y and a formal “heuristic argument” \pi , and outputs an estimate \mathbbG(Y \mid \pi) of Y . In this work, we argue for the informal principle that a heuristic estimator ought not to be able to predict its own errors, and we explore approaches to formalizing this principle. Most simply, the principle suggests that \mathbbG(Y - \mathbbG(Y \mid \pi) \mid \pi) ought to equal zero for all Y and \pi . We argue that an ideal heuristic estimator ought to satisfy two stronger properties in this vein, which we term iterated estimation (by analogy to the law of iterated expectations) and error orthogonality. Although iterated estimation and error orthogonality are intuitively appealing, it can be difficult to determine whether a given heuristic estimator satisfies the properties. As an alternative approach, we explore accuracy: a property that (roughly) states that \mathbbG has zero average error over a distribution of mathematical expressions. However, in the context of two estimation problems, we demonstrate barriers to creating an accurate heuristic estimator. We finish by discussing challenges and potential paths forward for finding a heuristic estimator that accords with our intuitive understanding of how such an estimator ought to behave, as well as the potential applications of heuristic estimators to understanding the behavior of neural networks. Comments: 47 pages, 2 tables, 1 figure Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) Cite as: arXiv:2410.01290 [cs.AI] (or arXiv:2410.01290v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.01290 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-93] Mitigating Copy Bias in In-Context Learning through Neuron Pruning

链接: https://arxiv.org/abs/2410.01288
作者: Ameen Ali,Lior Wolf,Ivan Titov
关键词-EN: Large language models, demonstrated impressive few-shot, impressive few-shot in-context, Large language, few-shot in-context learning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive few-shot in-context learning (ICL) abilities. Still, we show that they are sometimes prone to a `copying bias’, where they copy answers from provided examples instead of learning the underlying patterns. In this work, we propose a novel and simple method to mitigate such copying bias. First, we create a synthetic task and use the Integrated Gradients method to identify neurons that prioritize copying over generalization. We demonstrate that pruning these neurons consistently improves performance across a diverse set of ICL tasks. We also show that our method is applicable across various LLM architectures, including Transformers and State-Space Models, without requiring modifications. In our analysis, we adopt a task-recognition perspective on ICL and examine task vectors (Hendel et al., 2023) induced by the model. We find that pruning enhances the quality of these vectors, suggesting that the pruned neurons previously hindered effective task recognition.

[LG-94] Uncertainty-aware Human Mobility Modeling and Anomaly Detection

链接: https://arxiv.org/abs/2410.01281
作者: Haomin Wen,Shurui Cao,Leman Akoglu
关键词-EN: GPS coordinates, bad-actor or malicious, model GPS data, malicious behavior detection, effective anomaly detection
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given the GPS coordinates of a large collection of human agents over time, how can we model their mobility behavior toward effective anomaly detection (e.g. for bad-actor or malicious behavior detection) without any labeled data? Human mobility and trajectory modeling have been studied extensively with varying capacity to handle complex input, and performance-efficiency trade-offs. With the arrival of more expressive models in machine learning, we attempt to model GPS data as a sequence of stay-point events, each with a set of characterizing spatiotemporal features, and leverage modern sequence models such as Transformers for un/self-supervised training and inference. Notably, driven by the inherent stochasticity of certain individuals’ behavior, we equip our model with aleatoric/data uncertainty estimation. In addition, to handle data sparsity of a large variety of behaviors, we incorporate epistemic/model uncertainty into our model. Together, aleatoric and epistemic uncertainty enable a robust loss and training dynamics, as well as uncertainty-aware decision making in anomaly scoring. Experiments on large expert-simulated datasets with tens of thousands of agents demonstrate the effectiveness of our model against both forecasting and anomaly detection baselines.

[LG-95] Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models

链接: https://arxiv.org/abs/2410.01280
作者: Can Demircan,Tankred Saanum,Akshay K. Jagadish,Marcel Binz,Eric Schulz
关键词-EN: large language models, input prompt, ability to adapt, adapt based, ubiquitous feature
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning, the ability to adapt based on a few examples in the input prompt, is a ubiquitous feature of large language models (LLMs). However, as LLMs’ in-context learning abilities continue to improve, understanding this phenomenon mechanistically becomes increasingly important. In particular, it is not well-understood how LLMs learn to solve specific classes of problems, such as reinforcement learning (RL) problems, in-context. Through three different tasks, we first show that Llama 3 70 B can solve simple RL problems in-context. We then analyze the residual stream of Llama using Sparse Autoencoders (SAEs) and find representations that closely match temporal difference (TD) errors. Notably, these representations emerge despite the model only being trained to predict the next token. We verify that these representations are indeed causally involved in the computation of TD errors and Q -values by performing carefully designed interventions on them. Taken together, our work establishes a methodology for studying and manipulating in-context learning with SAEs, paving the way for a more mechanistic understanding.

[LG-96] Deep Unlearn: Benchmarking Machine Unlearning

链接: https://arxiv.org/abs/2410.01276
作者: Xavier F. Cadet,Anastasia Borovykh,Mohammad Malekzadeh,Sara Ahmadi-Abhari,Hamed Haddadi
关键词-EN: trained machine learning, machine learning model, trained machine, machine learning, aims to remove
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine unlearning (MU) aims to remove the influence of particular data points from the learnable parameters of a trained machine learning model. This is a crucial capability in light of data privacy requirements, trustworthiness, and safety in deployed models. MU is particularly challenging for deep neural networks (DNNs), such as convolutional nets or vision transformers, as such DNNs tend to memorize a notable portion of their training dataset. Nevertheless, the community lacks a rigorous and multifaceted study that looks into the success of MU methods for DNNs. In this paper, we investigate 18 state-of-the-art MU methods across various benchmark datasets and models, with each evaluation conducted over 10 different initializations, a comprehensive evaluation involving MU over 100K models. We show that, with the proper hyperparameters, Masked Small Gradients (MSG) and Convolution Transpose (CT), consistently perform better in terms of model accuracy and run-time efficiency across different models, datasets, and initializations, assessed by population-based membership inference attacks (MIA) and per-sample unlearning likelihood ratio attacks (U-LiRA). Furthermore, our benchmark highlights the fact that comparing a MU method only with commonly used baselines, such as Gradient Ascent (GA) or Successive Random Relabeling (SRL), is inadequate, and we need better baselines like Negative Gradient Plus (NG+) with proper hyperparameter selection.

[LG-97] CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction

链接: https://arxiv.org/abs/2410.01273
作者: Suhwan Choi,Yongjun Cho,Minchan Kim,Jaeyoon Jung,Myunchul Joe,Yubeen Park,Minseo Kim,Sungwoong Kim,Sungjae Lee,Hwiseong Park,Jiwan Chung,Youngjae Yu
关键词-EN: requires optimizing movements, addressing scenario-specific goals, Real-life robot navigation, Real-life robot, reaching a destination
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: project page this https URL

点击查看摘要

Abstract:Real-life robot navigation involves more than just reaching a destination; it requires optimizing movements while addressing scenario-specific goals. An intuitive way for humans to express these goals is through abstract cues like verbal commands or rough sketches. Such human guidance may lack details or be noisy. Nonetheless, we expect robots to navigate as intended. For robots to interpret and execute these abstract instructions in line with human expectations, they must share a common understanding of basic navigation concepts with humans. To this end, we introduce CANVAS, a novel framework that combines visual and linguistic instructions for commonsense-aware navigation. Its success is driven by imitation learning, enabling the robot to learn from human navigation behavior. We present COMMAND, a comprehensive dataset with human-annotated navigation results, spanning over 48 hours and 219 km, designed to train commonsense-aware navigation systems in simulated environments. Our experiments show that CANVAS outperforms the strong rule-based system ROS NavStack across all environments, demonstrating superior performance with noisy instructions. Notably, in the orchard environment, where ROS NavStack records a 0% total success rate, CANVAS achieves a total success rate of 67%. CANVAS also closely aligns with human demonstrations and commonsense constraints, even in unseen environments. Furthermore, real-world deployment of CANVAS showcases impressive Sim2Real transfer with a total success rate of 69%, highlighting the potential of learning from human demonstrations in simulated environments for real-world applications.

[LG-98] “No Matter What You Do!”: Mitigating Backdoor Attacks in Graph Neural Networks

链接: https://arxiv.org/abs/2410.01272
作者: Jiale Zhang,Chengcheng Zhu,Bosen Rao,Hao Sui,Xiaobing Sun,Bing Chen,Chunyi Zhou,Shouling Ji
关键词-EN: Deep Neural Networks, Recent studies, backdoor, backdoor attack, hard backdoor triggers
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 18 pages, 12 figures, 9 tables

点击查看摘要

Abstract:Recent studies have exposed that GNNs are vulnerable to several adversarial attacks, among which backdoor attack is one of the toughest. Similar to Deep Neural Networks (DNNs), backdoor attacks in GNNs lie in the fact that the attacker modifies a portion of graph data by embedding triggers and enforces the model to learn the trigger feature during the model training process. Despite the massive prior backdoor defense works on DNNs, defending against backdoor attacks in GNNs is largely unexplored, severely hindering the widespread application of GNNs in real-world tasks. To bridge this gap, we present GCleaner, the first backdoor mitigation method on GNNs. GCleaner can mitigate the presence of the backdoor logic within backdoored GNNs by reversing the backdoor learning procedure, aiming to restore the model performance to a level similar to that is directly trained on the original clean dataset. To achieve this objective, we ask: How to recover universal and hard backdoor triggers in GNNs? How to unlearn the backdoor trigger feature while maintaining the model performance? We conduct the graph trigger recovery via the explanation method to identify optimal trigger locations, facilitating the search of universal and hard backdoor triggers in the feature space of the backdoored model through maximal similarity. Subsequently, we introduce the backdoor unlearning mechanism, which combines knowledge distillation and gradient-based explainable knowledge for fine-grained backdoor erasure. Extensive experimental evaluations on four benchmark datasets demonstrate that GCleaner can reduce the backdoor attack success rate to 10% with only 1% of clean data, and has almost negligible degradation in model performance, which far outperforms the state-of-the-art (SOTA) defense methods.

[LG-99] Deep Learning and Machine Learning Advancing Big Data Analytics and Management: Unveiling AIs Potential Through Tools Techniques and Applications

链接: https://arxiv.org/abs/2410.01268
作者: Pohsun Feng,Ziqian Bi,Yizhu Wen,Xuanhe Pan,Benji Peng,Ming Liu,Jiawei Xu,Keyu Chen,Junyu Liu,Caitlyn Heqi Yin,Sen Zhang,Jinlang Wang,Qian Niu,Ming Li,Tianyang Wang
关键词-EN: big data analytics, data analytics, deep learning, machine learning, book serves
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: This book contains 156 pages and 9 figures

点击查看摘要

Abstract:This book serves as an introduction to deep learning and machine learning, focusing on their applications in big data analytics. It covers essential concepts, tools like ChatGPT and Claude, hardware recommendations, and practical guidance on setting up development environments using libraries like PyTorch and TensorFlow. Designed for beginners and advanced users alike, it provides step-by-step instructions, hands-on projects, and insights into AI’s future, including AutoML and edge computing.

[LG-100] Aggregation of Multi Diffusion Models for Enhancing Learned Representations

链接: https://arxiv.org/abs/2410.01262
作者: Conghan Yue,Zhengwei Peng,Shiyan Du,Zhi Ji,Dongyu Zhang
关键词-EN: Diffusion models, achieved remarkable success, Diffusion, classifier-free guidance conditional, Multi Diffusion Models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image generation, particularly with the various applications of classifier-free guidance conditional diffusion models. While many diffusion models perform well when controlling for particular aspect among style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel algorithm, Aggregation of Multi Diffusion Models (AMDM), which synthesizes features from multiple diffusion models into a specified model, enhancing its learned representations to activate specific features for fine-grained control. AMDM consists of two key components: spherical aggregation and manifold optimization. Spherical aggregation merges intermediate variables from different diffusion models with minimal manifold deviation, while manifold optimization refines these variables to align with the intermediate data manifold, enhancing sampling quality. Experimental results demonstrate that AMDM significantly improves fine-grained control without additional training or inference time, proving its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional control generation in diffusion models: We can fully utilize existing conditional diffusion models that control specific aspects, or develop new ones, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: this https URL

[LG-101] HelpSteer2-Preference: Complementing Ratings with Preferences

链接: https://arxiv.org/abs/2410.01257
作者: Zhilin Wang,Alexander Bukharin,Olivier Delalleau,Daniel Egert,Gerald Shen,Jiaqi Zeng,Oleksii Kuchaiev,Yi Dong
关键词-EN: popular paradigms, adequately matched, Regression, Regression style, Reward
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 26 pages, 3 figures

点击查看摘要

Abstract:Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at this https URL and openly release the trained Reward Model at this https URL

[LG-102] Dual Approximation Policy Optimization

链接: https://arxiv.org/abs/2410.01249
作者: Zhihan Xiong,Maryam Fazel,Lin Xiao
关键词-EN: Approximation Policy Optimization, Policy Optimization, propose Dual Approximation, general function approximation, policy mirror descent
类目: Machine Learning (cs.LG)
*备注: 30 pages, 2 figures

点击查看摘要

Abstract:We propose Dual Approximation Policy Optimization (DAPO), a framework that incorporates general function approximation into policy mirror descent methods. In contrast to the popular approach of using the L_2 -norm to measure function approximation errors, DAPO uses the dual Bregman divergence induced by the mirror map for policy projection. This duality framework has both theoretical and practical implications: not only does it achieve fast linear convergence with general function approximation, but it also includes several well-known practical methods as special cases, immediately providing strong convergence guarantees.

[LG-103] ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving

链接: https://arxiv.org/abs/2410.01228
作者: Yifan Qiao,Shu Anzai,Shan Yu,Haoran Ma,Yang Wang,Miryung Kim,Harry Xu
关键词-EN: leveraging large language, interactive online jobs, GPU utilization, large language models, GPU
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many applications are leveraging large language models (LLMs) for complex tasks, and they generally demand low inference latency and high serving throughput for interactive online jobs such as chatbots. However, the tight latency requirement and high load variance of applications pose challenges to serving systems in achieving high GPU utilization. Due to the high costs of scheduling and preemption, today’s systems generally use separate clusters to serve online and offline inference tasks, and dedicate GPUs for online inferences to avoid interference. This approach leads to underutilized GPUs because one must reserve enough GPU resources for the peak expected load, even if the average load is low. This paper proposes to harvest stranded GPU resources for offline LLM inference tasks such as document summarization and LLM benchmarking. Unlike online inferences, these tasks usually run in a batch-processing manner with loose latency requirements, making them a good fit for stranded resources that are only available shortly. To enable safe and efficient GPU harvesting without interfering with online tasks, we built ConServe, an LLM serving system that contains (1) an execution engine that preempts running offline tasks upon the arrival of online tasks, (2) an incremental checkpointing mechanism that minimizes the amount of recomputation required by preemptions, and (3) a scheduler that adaptively batches offline tasks for higher GPU utilization. Our evaluation demonstrates that ConServe achieves strong performance isolation when co-serving online and offline tasks but at a much higher GPU utilization. When colocating practical online and offline workloads on popular models such as Llama-2-7B, ConServe achieves 2.35 \times higher throughput than state-of-the-art online serving systems and reduces serving latency by 84 \times compared to existing co-serving systems. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2410.01228 [cs.DC] (or arXiv:2410.01228v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2410.01228 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-104] See Me and Believe Me: Causality and Intersectionality in Testimonial Injustice in Healthcare

链接: https://arxiv.org/abs/2410.01227
作者: Kenya S. Andrews,Mesrob I. Ohannessian,Elena Zheleva
关键词-EN: testimonial injustice, heard and understood, correctly heard, Structural Causal Model, causal discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In medical settings, it is critical that all who are in need of care are correctly heard and understood. When this is not the case due to prejudices a listener has, the speaker is experiencing \emphtestimonial injustice, which, building upon recent work, we quantify by the presence of several categories of unjust vocabulary in medical notes. In this paper, we use FCI, a causal discovery method, to study the degree to which certain demographic features could lead to marginalization (e.g., age, gender, and race) by way of contributing to testimonial injustice. To achieve this, we review physicians’ notes for each patient, where we identify occurrences of unjust vocabulary, along with the demographic features present, and use causal discovery to build a Structural Causal Model (SCM) relating those demographic features to testimonial injustice. We analyze and discuss the resulting SCMs to show the interaction of these factors and how they influence the experience of injustice. Despite the potential presence of some confounding variables, we observe how one contributing feature can make a person more prone to experiencing another contributor of testimonial injustice. There is no single root of injustice and thus intersectionality cannot be ignored. These results call for considering more than singular or equalized attributes of who a person is when analyzing and improving their experiences of bias and injustice. This work is thus a first foray at using causal discovery to understand the nuanced experiences of patients in medical settings, and its insights could be used to guide design principles throughout healthcare, to build trust and promote better patient care.

[LG-105] Induced Covariance for Causal Discovery in Linear Sparse Structures

链接: https://arxiv.org/abs/2410.01221
作者: Saeed Mohseni-Sehdeh,Walid Saad
关键词-EN: traditional regression models, Causal models seek, models seek, regression models, seek to unravel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal models seek to unravel the cause-effect relationships among variables from observed data, as opposed to mere mappings among them, as traditional regression models do. This paper introduces a novel causal discovery algorithm designed for settings in which variables exhibit linearly sparse relationships. In such scenarios, the causal links represented by directed acyclic graphs (DAGs) can be encapsulated in a structural matrix. The proposed approach leverages the structural matrix’s ability to reconstruct data and the statistical properties it imposes on the data to identify the correct structural matrix. This method does not rely on independence tests or graph fitting procedures, making it suitable for scenarios with limited training data. Simulation results demonstrate that the proposed method outperforms the well-known PC, GES, BIC exact search, and LINGAM-based methods in recovering linearly sparse causal structures.

[LG-106] Effective Tuning Strategies for Generalist Robot Manipulation Policies

链接: https://arxiv.org/abs/2410.01220
作者: Wenbo Zhang,Yang Li,Yanyuan Qiao,Siyuan Huang,Jiajun Liu,Feras Dayoub,Xiao Ma,Lingqiao Liu
关键词-EN: Generalist robot manipulation, robot manipulation policies, Generalist robot, robot manipulation, potential to generalize
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generalist robot manipulation policies (GMPs) have the potential to generalize across a wide range of tasks, devices, and environments. However, existing policies continue to struggle with out-of-distribution scenarios due to the inherent difficulty of collecting sufficient action data to cover extensively diverse domains. While fine-tuning offers a practical way to quickly adapt a GMPs to novel domains and tasks with limited samples, we observe that the performance of the resulting GMPs differs significantly with respect to the design choices of fine-tuning strategies. In this work, we first conduct an in-depth empirical study to investigate the effect of key factors in GMPs fine-tuning strategies, covering the action space, policy head, supervision signal and the choice of tunable parameters, where 2,500 rollouts are evaluated for a single configuration. We systematically discuss and summarize our findings and identify the key design choices, which we believe give a practical guideline for GMPs fine-tuning. We observe that in a low-data regime, with carefully chosen fine-tuning strategies, a GMPs significantly outperforms the state-of-the-art imitation learning algorithms. The results presented in this work establish a new baseline for future studies on fine-tuned GMPs, and provide a significant addition to the GMPs toolbox for the community.

[LG-107] Absolute State-wise Constrained Policy Optimization: High-Probability State-wise Constraints Satisfaction

链接: https://arxiv.org/abs/2410.01212
作者: Weiye Zhao,Feihan Li,Yifan Sun,Yujie Wang,Rui Chen,Tianhao Wei,Changliu Liu
关键词-EN: Enforcing state-wise safety, Enforcing state-wise, state-wise safety constraints, state-wise, reinforcement learning
类目: Machine Learning (cs.LG)
*备注: submission to Journal of Machine Learning Research

点击查看摘要

Abstract:Enforcing state-wise safety constraints is critical for the application of reinforcement learning (RL) in real-world problems, such as autonomous driving and robot manipulation. However, existing safe RL methods only enforce state-wise constraints in expectation or enforce hard state-wise constraints with strong assumptions. The former does not exclude the probability of safety violations, while the latter is impractical. Our insight is that although it is intractable to guarantee hard state-wise constraints in a model-free setting, we can enforce state-wise safety with high probability while excluding strong assumptions. To accomplish the goal, we propose Absolute State-wise Constrained Policy Optimization (ASCPO), a novel general-purpose policy search algorithm that guarantees high-probability state-wise constraint satisfaction for stochastic systems. We demonstrate the effectiveness of our approach by training neural network policies for extensive robot locomotion tasks, where the agent must adhere to various state-wise safety constraints. Our results show that ASCPO significantly outperforms existing methods in handling state-wise constraints across challenging continuous control tasks, highlighting its potential for real-world applications.

[LG-108] Debiasing Federated Learning with Correlated Client Participation

链接: https://arxiv.org/abs/2410.01209
作者: Zhenyu Sun,Ziyang Zhang,Zheng Xu,Gauri Joshi,Pranay Sharma,Ermin Wei
关键词-EN: Federated Averaging, cross-device federated learning, federated learning, millions of mobile, small subset
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In cross-device federated learning (FL) with millions of mobile clients, only a small subset of clients participate in training in every communication round, and Federated Averaging (FedAvg) is the most popular algorithm in practice. Existing analyses of FedAvg usually assume the participating clients are independently sampled in each round from a uniform distribution, which does not reflect real-world scenarios. This paper introduces a theoretical framework that models client participation in FL as a Markov chain to study optimization convergence when clients have non-uniform and correlated participation across rounds. We apply this framework to analyze a more general and practical pattern: every client must wait a minimum number of R rounds (minimum separation) before re-participating. We theoretically prove and empirically observe that increasing minimum separation reduces the bias induced by intrinsic non-uniformity of client availability in cross-device FL systems. Furthermore, we develop an effective debiasing algorithm for FedAvg that provably converges to the unbiased optimal solution under arbitrary minimum separation and unknown client availability distribution.

[LG-109] Were RNNs All We Needed?

链接: https://arxiv.org/abs/2410.01201
作者: Leo Feng,Frederick Tung,Mohamed Osama Ahmed,Yoshua Bengio,Hossein Hajimirsadegh
关键词-EN: limitations of Transformers, scalability limitations, renewed interest, parallelizable during training, Transformers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The scalability limitations of Transformers regarding sequence length have renewed interest in recurrent sequence models that are parallelizable during training. As a result, many novel recurrent architectures, such as S4, Mamba, and Aaren, have been proposed that achieve comparable performance. In this work, we revisit traditional recurrent neural networks (RNNs) from over a decade ago: LSTMs (1997) and GRUs (2014). While these models were slow due to requiring to backpropagate through time (BPTT), we show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel. Building on this, we introduce minimal versions (minLSTMs and minGRUs) that (1) use significantly fewer parameters than their traditional counterparts and (2) are fully parallelizable during training (175x faster for a sequence of length 512). Lastly, we show that these stripped-down versions of decade-old RNNs match the empirical performance of recent sequence models.

[LG-110] Stochastic Gradient Descent with Adaptive Data

链接: https://arxiv.org/abs/2410.01195
作者: Ethan Che,Jing Dong,Xin T. Tong
关键词-EN: powerful optimization technique, online learning scenarios, Stochastic gradient descent, adaptively generated data, generated data stream
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Stochastic gradient descent (SGD) is a powerful optimization technique that is particularly useful in online learning scenarios. Its convergence analysis is relatively well understood under the assumption that the data samples are independent and identically distributed (iid). However, applying SGD to policy optimization problems in operations research involves a distinct challenge: the policy changes the environment and thereby affects the data used to update the policy. The adaptively generated data stream involves samples that are non-stationary, no longer independent from each other, and affected by previous decisions. The influence of previous decisions on the data generated introduces bias in the gradient estimate, which presents a potential source of instability for online learning not present in the iid case. In this paper, we introduce simple criteria for the adaptively generated data stream to guarantee the convergence of SGD. We show that the convergence speed of SGD with adaptive data is largely similar to the classical iid setting, as long as the mixing time of the policy-induced dynamics is factored in. Our Lyapunov-function analysis allows one to translate existing stability analysis of stochastic systems studied in operations research into convergence rates for SGD, and we demonstrate this for queueing and inventory management problems. We also showcase how our result can be applied to study the sample complexity of an actor-critic policy gradient algorithm.

[LG-111] [Re] Network Deconvolution

链接: https://arxiv.org/abs/2410.01189
作者: Rochana R. Obadage,Kumushini Thennakoon,Sarah M. Rajtmajer,Jian Wu
关键词-EN: Network Deconvolution, convolutional neural networks, work aims, aims to reproduce, reproduce the set
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Our work aims to reproduce the set of findings published in “Network Deconvolution” by Ye et al. (2020)[1]. That paper proposes an optimization technique for model training in convolutional neural networks. The proposed technique “network deconvolution” is used in convolutional neural networks to remove pixel-wise and channel-wise correlations before data is fed into each layer. In particular, we interrogate the validity of the authors’ claim that using network deconvolution instead of batch normalization improves deep learning model performance. Our effort confirms the validity of this claim, successfully reproducing the results reported in Tables 1 and 2 of the original paper. Our study involved 367 unique experiments across multiple architectures, datasets, and hyper parameter configurations. For Table 1, while there were some minor deviations in accuracy when compared to the original values (within 10%), the overall trend was consistent with the original study’s findings when training the models with epochs 20 and 100. For Table 2, all 14 reproduced values were consistent with the original values. Additionally, we document the training and testing times for each architecture in Table 1 with 1, 20, and 100 epoch settings for both CIFAR-10 and CIFAR-100 datasets. We document the total execution times for Table 2 architectures with the ImageNet dataset. The data and software used for this reproducibility study are publicly available at this https URL.

[LG-112] Efficient PAC Learning of Halfspaces with Constant Malicious Noise Rate

链接: https://arxiv.org/abs/2410.01186
作者: Xiaoyu Li,Jie Shen
关键词-EN: Understanding noise tolerance, Understanding noise, central quest, efficient PAC learning, noise tolerance
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Understanding noise tolerance of learning algorithms under certain conditions is a central quest in learning theory. In this work, we study the problem of computationally efficient PAC learning of halfspaces in the presence of malicious noise, where an adversary can corrupt both instances and labels of training samples. The best-known noise tolerance either depends on a target error rate under distributional assumptions or on a margin parameter under large-margin conditions. In this work, we show that when both types of conditions are satisfied, it is possible to achieve \em constant noise tolerance by minimizing a reweighted hinge loss. Our key ingredients include: 1) an efficient algorithm that finds weights to control the gradient deterioration from corrupted samples, and 2) a new analysis on the robustness of the hinge loss equipped with such weights.

[LG-113] A Deep Learning Approach for Imbalanced Tabular Data in Advertiser Prospecting: A Case of Direct Mail Prospecting KDD

链接: https://arxiv.org/abs/2410.01157
作者: Sadegh Farhang,William Hayes,Nick Murphy,Jonathan Neddenriep,Nicholas Tyris
关键词-EN: direct mail, growing businesses, Acquiring new customers, direct mail advertising, customers
类目: Machine Learning (cs.LG)
*备注: Third KDD Workshop on End-to-End Customer Journey Optimization

点击查看摘要

Abstract:Acquiring new customers is a vital process for growing businesses. Prospecting is the process of identifying and marketing to potential customers using methods ranging from online digital advertising, linear television, out of home, and direct mail. Despite the rapid growth in digital advertising (particularly social and search), research shows that direct mail remains one of the most effective ways to acquire new customers. However, there is a notable gap in the application of modern machine learning techniques within the direct mail space, which could significantly enhance targeting and personalization strategies. Methodologies deployed through direct mail are the focus of this paper. In this paper, we propose a supervised learning approach for identifying new customers, i.e., prospecting, which comprises how we define labels for our data and rank potential customers. The casting of prospecting to a supervised learning problem leads to imbalanced tabular data. The current state-of-the-art approach for tabular data is an ensemble of tree-based methods like random forest and XGBoost. We propose a deep learning framework for tabular imbalanced data. This framework is designed to tackle large imbalanced datasets with vast number of numerical and categorical features. Our framework comprises two components: an autoencoder and a feed-forward neural network. We demonstrate the effectiveness of our framework through a transparent real-world case study of prospecting in direct mail advertising. Our results show that our proposed deep learning framework outperforms the state of the art tree-based random forest approach when applied in the real-world. Comments: Third KDD Workshop on End-to-End Customer Journey Optimization Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.01157 [cs.LG] (or arXiv:2410.01157v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.01157 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-114] xt2PDE: Latent Diffusion Models for Accessible Physics Simulation

链接: https://arxiv.org/abs/2410.01153
作者: Anthony Zhou,Zijie Li,Michael Schneier,John R Buchanan Jr,Amir Barati Farimani
关键词-EN: partial differential equation, inspired numerous works, Recent advances, neural PDE solvers, PDE solvers
类目: Machine Learning (cs.LG)
*备注: 25 pages, 7 figures

点击查看摘要

Abstract:Recent advances in deep learning have inspired numerous works on data-driven solutions to partial differential equation (PDE) problems. These neural PDE solvers can often be much faster than their numerical counterparts; however, each presents its unique limitations and generally balances training cost, numerical accuracy, and ease of applicability to different problem setups. To address these limitations, we introduce several methods to apply latent diffusion models to physics simulation. Firstly, we introduce a mesh autoencoder to compress arbitrarily discretized PDE data, allowing for efficient diffusion training across various physics. Furthermore, we investigate full spatio-temporal solution generation to mitigate autoregressive error accumulation. Lastly, we investigate conditioning on initial physical quantities, as well as conditioning solely on a text prompt to introduce text2PDE generation. We show that language can be a compact, interpretable, and accurate modality for generating physics simulations, paving the way for more usable and accessible PDE solvers. Through experiments on both uniform and structured grids, we show that the proposed approach is competitive with current neural PDE solvers in both accuracy and efficiency, with promising scaling behavior up to \sim 3 billion parameters. By introducing a scalable, accurate, and usable physics simulator, we hope to bring neural PDE solvers closer to practical use.

[LG-115] Recovering Manifold Structure Using Ollivier-Ricci Curvature

链接: https://arxiv.org/abs/2410.01149
作者: Tristan Luca Saidi,Abigail Hickok,Andrew J. Blumberg
关键词-EN: estimated metric distortion, prune spurious edges, nearest neighbor graphs, metric distortion, Ollivier-Ricci curvature
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:We introduce ORC-ManL, a new algorithm to prune spurious edges from nearest neighbor graphs using a criterion based on Ollivier-Ricci curvature and estimated metric distortion. Our motivation comes from manifold learning: we show that when the data generating the nearest-neighbor graph consists of noisy samples from a low-dimensional manifold, edges that shortcut through the ambient space have more negative Ollivier-Ricci curvature than edges that lie along the data manifold. We demonstrate that our method outperforms alternative pruning methods and that it significantly improves performance on many downstream geometric data analysis tasks that use nearest neighbor graphs as input. Specifically, we evaluate on manifold learning, persistent homology, dimension estimation, and others. We also show that ORC-ManL can be used to improve clustering and manifold learning of single-cell RNA sequencing data. Finally, we provide empirical convergence experiments that support our theoretical findings.

[LG-116] ProxiMix: Enhancing Fairness with Proximity Samples in Subgroups

链接: https://arxiv.org/abs/2410.01145
作者: Jingyu Hu,Jun Hong,Mengnan Du,Weiru Liu
关键词-EN: machine learning, bias mitigation, developed for addressing, bias mitigation methods, addressing fairness issues
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many bias mitigation methods have been developed for addressing fairness issues in machine learning. We found that using linear mixup alone, a data augmentation technique, for bias mitigation, can still retain biases present in dataset labels. Research presented in this paper aims to address this issue by proposing a novel pre-processing strategy in which both an existing mixup method and our new bias mitigation algorithm can be utilized to improve the generation of labels of augmented samples, which are proximity aware. Specifically, we proposed ProxiMix which keeps both pairwise and proximity relationships for fairer data augmentation. We conducted thorough experiments with three datasets, three ML models, and different hyperparameters settings. Our experimental results showed the effectiveness of ProxiMix from both fairness of predictions and fairness of recourse perspectives.

[LG-117] Explain Like Im Five: Using LLMs to Improve PDE Surrogate Models with Text

链接: https://arxiv.org/abs/2410.01137
作者: Cooper Lorsung,Amir Barati Farimani
关键词-EN: Solving Partial Differential, Partial Differential Equations, Solving Partial, Partial Differential, Differential Equations
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 22 pages, 15 figures, 7 tables

点击查看摘要

Abstract:Solving Partial Differential Equations (PDEs) is ubiquitous in science and engineering. Computational complexity and difficulty in writing numerical solvers has motivated the development of machine learning techniques to generate solutions quickly. Many existing methods are purely data driven, relying solely on numerical solution fields, rather than known system information such as boundary conditions and governing equations. However, the recent rise in popularity of Large Language Models (LLMs) has enabled easy integration of text in multimodal machine learning models. In this work, we use pretrained LLMs to integrate various amounts known system information into PDE learning. Our multimodal approach significantly outperforms our baseline model, FactFormer, in both next-step prediction and autoregressive rollout performance on the 2D Heat, Burgers, Navier-Stokes, and Shallow Water equations. Further analysis shows that pretrained LLMs provide highly structured latent space that is consistent with the amount of system information provided through text.

[LG-118] nGPT: Normalized Transformer with Representation Learning on the Hypersphere

链接: https://arxiv.org/abs/2410.01131
作者: Ilya Loshchilov,Cheng-Ping Hsieh,Simeng Sun,Boris Ginsburg
关键词-EN: neural network architecture, normalized Transformer, network architecture, neural network, representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

[LG-119] Using Interleaved Ensemble Unlearning to Keep Backdoors at Bay for Finetuning Vision Transformers

链接: https://arxiv.org/abs/2410.01128
作者: Zeyu Michael Li
关键词-EN: computer vision tasks, Vision Transformers, computer vision, vision tasks, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have become popular in computer vision tasks. Backdoor attacks, which trigger undesirable behaviours in models during inference, threaten ViTs’ performance, particularly in security-sensitive tasks. Although backdoor defences have been developed for Convolutional Neural Networks (CNNs), they are less effective for ViTs, and defences tailored to ViTs are scarce. To address this, we present Interleaved Ensemble Unlearning (IEU), a method for finetuning clean ViTs on backdoored datasets. In stage 1, a shallow ViT is finetuned to have high confidence on backdoored data and low confidence on clean data. In stage 2, the shallow ViT acts as a ``gate’’ to block potentially poisoned data from the defended ViT. This data is added to an unlearn set and asynchronously unlearned via gradient ascent. We demonstrate IEU’s effectiveness on three datasets against 11 state-of-the-art backdoor attacks and show its versatility by applying it to different model architectures.

[LG-120] Almost Free: Self-concordance in Natural Exponential Families and an Application to Bandits NEURIPS

链接: https://arxiv.org/abs/2410.01112
作者: Shuai Liu,Alex Ayoub,Flore Sentenac,Xiaoqi Tan,Csaba Szepesvári
关键词-EN: natural exponential families, single-parameter natural exponential, subgaussian natural exponential, prove that single-parameter, self-concordant with polynomial-sized
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:We prove that single-parameter natural exponential families with subexponential tails are self-concordant with polynomial-sized parameters. For subgaussian natural exponential families we establish an exact characterization of the growth rate of the self-concordance parameter. Applying these findings to bandits allows us to fill gaps in the literature: We show that optimistic algorithms for generalized linear bandits enjoy regret bounds that are both second-order (scale with the variance of the optimal arm’s reward distribution) and free of an exponential dependence on the bound of the problem parameter in the leading term. To the best of our knowledge, ours is the first regret bound for generalized linear bandits with subexponential tails, broadening the class of problems to include Poisson, exponential and gamma bandits.

[LG-121] Embedding-based statistical inference on generative models

链接: https://arxiv.org/abs/2410.01106
作者: Hayden Helm,Aranyak Acharyya,Brandon Duderstadt,Youngser Park,Carey E. Priebe
关键词-EN: produce human expert, human expert level, expert level content, topics and domains, produce human
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The recent cohort of publicly available generative models can produce human expert level content across a variety of topics and domains. Given a model in this cohort as a base model, methods such as parameter efficient fine-tuning, in-context learning, and constrained decoding have further increased generative capabilities and improved both computational and data efficiency. Entire collections of derivative models have emerged as a byproduct of these methods and each of these models has a set of associated covariates such as a score on a benchmark, an indicator for if the model has (or had) access to sensitive information, etc. that may or may not be available to the user. For some model-level covariates, it is possible to use “similar” models to predict an unknown covariate. In this paper we extend recent results related to embedding-based representations of generative models – the data kernel perspective space – to classical statistical inference settings. We demonstrate that using the perspective space as the basis of a notion of “similar” is effective for multiple model-level inference tasks.

[LG-122] softmax is not enough (for sharp out-of-distribution)

链接: https://arxiv.org/abs/2410.01104
作者: Petar Veličković,Christos Perivolaropoulos,Federico Barbero,Razvan Pascanu
关键词-EN: make sharp decisions, property of reasoning, ability to make, reasoning systems, key property
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: Comments welcome. 14 pages, 7 figures

点击查看摘要

Abstract:A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from “circuits” which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.

[LG-123] Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank

链接: https://arxiv.org/abs/2410.01101
作者: Wenhao Zhan,Scott Fujimoto,Zheqing Zhu,Jason D. Lee,Daniel R. Jiang,Yonathan Efroni
关键词-EN: low interaction rank, offline multi-agent reinforcement, multi-agent reinforcement learning, low interaction, offline MARL
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of learning an approximate equilibrium in the offline multi-agent reinforcement learning (MARL) setting. We introduce a structural assumption – the interaction rank – and establish that functions with low interaction rank are significantly more robust to distribution shift compared to general ones. Leveraging this observation, we demonstrate that utilizing function classes with low interaction rank, when combined with regularization and no-regret learning, admits decentralized, computationally and statistically efficient learning in offline MARL. Our theoretical results are complemented by experiments that showcase the potential of critic architectures with low interaction rank in offline MARL, contrasting with commonly used single-agent value decomposition architectures.

[LG-124] Efficient and Private Marginal Reconstruction with Local Non-Negativity NEURIPS2024

链接: https://arxiv.org/abs/2410.01091
作者: Brett Mullins,Miguel Fuentes,Yingtai Xiao,Daniel Kifer,Cameron Musco,Daniel Sheldon
关键词-EN: Differential privacy, millions of people, dominant standard, standard for formal, formal and quantifiable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear at NeurIPS 2024

点击查看摘要

Abstract:Differential privacy is the dominant standard for formal and quantifiable privacy and has been used in major deployments that impact millions of people. Many differentially private algorithms for query release and synthetic data contain steps that reconstruct answers to queries from answers to other queries measured by the mechanism. Reconstruction is an important subproblem for such mechanisms to economize the privacy budget, minimize error on reconstructed answers, and allow for scalability to high-dimensional datasets. In this paper, we introduce a principled and efficient postprocessing method ReM (Residuals-to-Marginals) for reconstructing answers to marginal queries. Our method builds on recent work on efficient mechanisms for marginal query release, based on making measurements using a residual query basis that admits efficient pseudoinversion, which is an important primitive used in reconstruction. An extension GReM-LNN (Gaussian Residuals-to-Marginals with Local Non-negativity) reconstructs marginals under Gaussian noise satisfying consistency and non-negativity, which often reduces error on reconstructed answers. We demonstrate the utility of ReM and GReM-LNN by applying them to improve existing private query answering mechanisms: ResidualPlanner and MWEM.

[LG-125] Exploring Empty Spaces: Human-in-the-Loop Data Augmentation

链接: https://arxiv.org/abs/2410.01088
作者: Catherine Yeh,Donghao Ren,Yannick Assogba,Dominik Moritz,Fred Hohman
关键词-EN: make machine learning, machine learning models, robust and safe, crucial to make, make machine
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these “unknown unknowns” is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate “unknown unknowns” in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment With Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.

[LG-126] Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time ECCV2024

链接: https://arxiv.org/abs/2410.01083
作者: Chiao-An Yang,Ziwei Liu,Raymond A. Yeh
关键词-EN: Subsampling layers play, Subsampling layers, spatial dimensions, layers play, play a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Subsampling layers play a crucial role in deep nets by discarding a portion of an activation map to reduce its spatial dimensions. This encourages the deep net to learn higher-level representations. Contrary to this motivation, we hypothesize that the discarded activations are useful and can be incorporated on the fly to improve models’ prediction. To validate our hypothesis, we propose a search and aggregate method to find useful activation maps to be used at test time. We applied our approach to the task of image classification and semantic segmentation. Extensive experiments over nine different architectures on multiple datasets show that our method consistently improves model test-time performance, complementing existing test-time augmentation techniques. Our code is available at this https URL.

[LG-127] Inferring Kernel epsilon-Machines: Discovering Structure in Complex Systems

链接: https://arxiv.org/abs/2410.01076
作者: Alexandra M. Jurgens,Nicolas Brodu
关键词-EN: reproducing kernel Hilbert, mechanic causal states, computational mechanic causal, kernel Hilbert space, stochastic dynamical system
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Previously, we showed that computational mechanic’s causal states – predictively-equivalent trajectory classes for a stochastic dynamical system – can be cast into a reproducing kernel Hilbert space. The result is a widely-applicable method that infers causal structure directly from very different kinds of observations and systems. Here, we expand this method to explicitly introduce the causal diffusion components it produces. These encode the kernel causal-state estimates as a set of coordinates in a reduced dimension space. We show how each component extracts predictive features from data and demonstrate their application on four examples: first, a simple pendulum – an exactly solvable system; second, a molecular-dynamic trajectory of n -butane – a high-dimensional system with a well-studied energy landscape; third, the monthly sunspot sequence – the longest-running available time series of direct observations; and fourth, multi-year observations of an active crop field – a set of heterogeneous observations of the same ecosystem taken for over a decade. In this way, we demonstrate that the empirical kernel causal-states algorithm robustly discovers predictive structures for systems with widely varying dimensionality and stochasticity.

[LG-128] Convergent Privacy Loss of Noisy-SGD without Convexity and Smoothness

链接: https://arxiv.org/abs/2410.01068
作者: Eli Chien,Pan Li
关键词-EN: study the Differential, Differential Privacy, bounded domain, hidden-state Noisy-SGD algorithms, Noisy-SGD algorithms
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We study the Differential Privacy (DP) guarantee of hidden-state Noisy-SGD algorithms over a bounded domain. Standard privacy analysis for Noisy-SGD assumes all internal states are revealed, which leads to a divergent R’enyi DP bound with respect to the number of iterations. Ye Shokri (2022) and Altschuler Talwar (2022) proved convergent bounds for smooth (strongly) convex losses, and raise open questions about whether these assumptions can be relaxed. We provide positive answers by proving convergent R’enyi DP bound for non-convex non-smooth losses, where we show that requiring losses to have Hölder continuous gradient is sufficient. We also provide a strictly better privacy bound compared to state-of-the-art results for smooth strongly convex losses. Our analysis relies on the improvement of shifted divergence analysis in multiple aspects, including forward Wasserstein distance tracking, identifying the optimal shifts allocation, and the H"older reduction lemma. Our results further elucidate the benefit of hidden-state analysis for DP and its applicability.

[LG-129] Structure-Preserving Operator Learning

链接: https://arxiv.org/abs/2410.01065
作者: Nacime Bouziani,Nicolas Boullé
关键词-EN: partial differential equations, differential equations directly, data holds great, holds great promise, complex dynamics driven
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Learning complex dynamics driven by partial differential equations directly from data holds great promise for fast and accurate simulations of complex physical systems. In most cases, this problem can be formulated as an operator learning task, where one aims to learn the operator representing the physics of interest, which entails discretization of the continuous system. However, preserving key continuous properties at the discrete level, such as boundary conditions, and addressing physical systems with complex geometries is challenging for most existing approaches. We introduce a family of operator learning architectures, structure-preserving operator networks (SPONs), that allows to preserve key mathematical and physical properties of the continuous system by leveraging finite element (FE) discretizations of the input-output spaces. SPONs are encode-process-decode architectures that are end-to-end differentiable, where the encoder and decoder follows from the discretizations of the input-output spaces. SPONs can operate on complex geometries, enforce certain boundary conditions exactly, and offer theoretical guarantees. Our framework provides a flexible way of devising structure-preserving architectures tailored to specific applications, and offers an explicit trade-off between performance and efficiency, all thanks to the FE discretization of the input-output spaces. Additionally, we introduce a multigrid-inspired SPON architecture that yields improved performance at higher efficiency. Finally, we release a software to automate the design and training of SPON architectures.

[LG-130] Uncertainty Modelling and Robust Observer Synthesis using the Koopman Operator

链接: https://arxiv.org/abs/2410.01057
作者: Steven Dahdah,James Richard Forbes
关键词-EN: Koopman operator, Koopman, paper proposes, nonlinear Koopman observers, robust nonlinear Koopman
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 16 pages, 15 figures

点击查看摘要

Abstract:This paper proposes a robust nonlinear observer synthesis method for a population of systems modelled using the Koopman operator. The Koopman operator allows nonlinear systems to be rewritten as infinite-dimensional linear systems. A finite-dimensional approximation of the Koopman operator can be identified directly from data, yielding an approximately linear model of a nonlinear system. The proposed observer synthesis method is made possible by this linearity that in turn allows uncertainty within a population of Koopman models to be quantified in the frequency domain. Using this uncertainty model, linear robust control techniques are used to synthesize robust nonlinear Koopman observers. A population of several dozen motor drives is used to experimentally demonstrate the proposed method. Manufacturing variation is characterized in the frequency domain, and a robust Koopman observer is synthesized using mixed \mathcalH_2 - \mathcalH_\infty optimal control.

[LG-131] Spherical Analysis of Learning Nonlinear Functionals

链接: https://arxiv.org/abs/2410.01047
作者: Zhenyu Yang,Shuo Huang,Han Feng,Ding-Xuan Zhou
关键词-EN: recent years, growing interest, sets of functions, functionals defined, defined on sets
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, there has been growing interest in the field of functional neural networks. They have been proposed and studied with the aim of approximating continuous functionals defined on sets of functions on Euclidean domains. In this paper, we consider functionals defined on sets of functions on spheres. The approximation ability of deep ReLU neural networks is investigated by novel spherical analysis using an encoder-decoder framework. An encoder comes up first to accommodate the infinite-dimensional nature of the domain of functionals. It utilizes spherical harmonics to help us extract the latent finite-dimensional information of functions, which in turn facilitates in the next step of approximation analysis using fully connected neural networks. Moreover, real-world objects are frequently sampled discretely and are often corrupted by noise. Therefore, encoders with discrete input and those with discrete and random noise input are constructed, respectively. The approximation rates with different encoder structures are provided therein.

[LG-132] Dont Stop Me Now: Embedding Based Scheduling for LLMs

链接: https://arxiv.org/abs/2410.01035
作者: Rana Shahout,Eran Malach,Chunwei Liu,Weifan Jiang,Minlan Yu,Michael Mitzenmacher
关键词-EN: interactive Large Language, Large Language Model, Large Language, impacts user engagement, directly impacts user
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient scheduling is crucial for interactive Large Language Model (LLM) applications, where low request completion time directly impacts user engagement. Size-based scheduling algorithms like Shortest Remaining Process Time (SRPT) aim to reduce average request completion time by leveraging known or estimated request sizes and allowing preemption by incoming jobs with shorter service times. However, two main challenges arise when applying size-based scheduling to LLM systems. First, accurately predicting output lengths from prompts is challenging and often resource-intensive, making it impractical for many systems. As a result, the state-of-the-art LLM systems default to first-come, first-served scheduling, which can lead to head-of-line blocking and reduced system efficiency. Second, preemption introduces extra memory overhead to LLM systems as they must maintain intermediate states for unfinished (preempted) requests. In this paper, we propose TRAIL, a method to obtain output predictions from the target LLM itself. After generating each output token, we recycle the embedding of its internal structure as input for a lightweight classifier that predicts the remaining length for each running request. Using these predictions, we propose a prediction-based SRPT variant with limited preemption designed to account for memory overhead in LLM systems. This variant allows preemption early in request execution when memory consumption is low but restricts preemption as requests approach completion to optimize resource utilization. On the theoretical side, we derive a closed-form formula for this SRPT variant in an M/G/1 queue model, which demonstrates its potential value. In our system, we implement this preemption policy alongside our embedding-based prediction method.

[LG-133] Single-Shot Learning of Stable Dynamical Systems for Long-Horizon Manipulation Tasks ICRA2025

链接: https://arxiv.org/abs/2410.01033
作者: Alexandre St-Aubin(1),Amin Abyaneh(1),Hsiu-Chin Lin(1) ((1) McGill University)
关键词-EN: Mastering complex sequential, Mastering complex, complex sequential tasks, sequential tasks continues, complex sequential
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, submitted to ICRA 2025

点击查看摘要

Abstract:Mastering complex sequential tasks continues to pose a significant challenge in robotics. While there has been progress in learning long-horizon manipulation tasks, most existing approaches lack rigorous mathematical guarantees for ensuring reliable and successful execution. In this paper, we extend previous work on learning long-horizon tasks and stable policies, focusing on improving task success rates while reducing the amount of training data needed. Our approach introduces a novel method that (1) segments long-horizon demonstrations into discrete steps defined by waypoints and subgoals, and (2) learns globally stable dynamical system policies to guide the robot to each subgoal, even in the face of sensory noise and random disturbances. We validate our approach through both simulation and real-world experiments, demonstrating effective transfer from simulation to physical robotic platforms. Code is available at this https URL

[LG-134] GPTreeO: An R package for continual regression with dividing local Gaussian processes

链接: https://arxiv.org/abs/2410.01024
作者: Timo Braun,Anders Kvellestad,Riccardo De Bin
关键词-EN: scalable Gaussian process, Local Gaussian Processes, Dividing Local Gaussian, Gaussian process, continual learning problems
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We introduce GPTreeO, a flexible R package for scalable Gaussian process (GP) regression, particularly tailored to continual learning problems. GPTreeO builds upon the Dividing Local Gaussian Processes (DLGP) algorithm, in which a binary tree of local GP regressors is dynamically constructed using a continual stream of input data. In GPTreeO we extend the original DLGP algorithm by allowing continual optimisation of the GP hyperparameters, incorporating uncertainty calibration, and introducing new strategies for how the local partitions are created. Moreover, the modular code structure allows users to interface their favourite GP library to perform the local GP regression in GPTreeO. The flexibility of GPTreeO gives the user fine-grained control of the balance between computational speed, accuracy, stability and smoothness. We conduct a sensitivity analysis to show how GPTreeO’s configurable features impact the regression performance in a continual learning setting.

[LG-135] Investigating the Synergistic Effects of Dropout and Residual Connections on Language Model Training

链接: https://arxiv.org/abs/2410.01019
作者: Qingyang Li,Weimao Ke
关键词-EN: language model training, pivotal role, techniques in mitigating, mitigating overfitting, language model
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:This paper examines the pivotal role of dropout techniques in mitigating overfitting in language model training. It conducts a comprehensive investigation into the influence of variable dropout rates on both individual layers and residual connections within the context of language modeling. Our study conducts training of a decoder implementation on the classic Tiny Shakespeare data to examine the effects of the adjustments on training efficiency and validation error. Results not only confirm the benefits of dropout for regularization and residuals for convergence, but also reveal their interesting interactions. There exists an important trade-off between the depth of residual connections and the dropout on these connections for optimal deep neural network convergence and generalization.

[LG-136] Machine Learning-Assisted Intrusion Detection for Enhancing Internet of Things Security

链接: https://arxiv.org/abs/2410.01016
作者: Mona Esmaeili,Morteza Rahimi,Matin Khajavi,Dorsa Farahmand,Hadi Jabbari Saray
关键词-EN: Internet of Things, networked and integrated, secure IoT devices, Attacks, Things
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attacks against the Internet of Things (IoT) are rising as devices, applications, and interactions become more networked and integrated. The increase in cyber-attacks that target IoT networks poses a huge vulnerability and threat to the privacy, security, functionality, and availability of critical systems, which leads to operational disruptions, financial losses, identity thefts, and data breaches. To efficiently secure IoT devices, real-time detection of intrusion systems is critical, especially those using machine learning to identify threats and mitigate risks and vulnerabilities. This paper investigates the latest research on machine learning-based intrusion detection strategies for IoT security, concentrating on real-time responsiveness, detection accuracy, and algorithm efficiency. Key studies were reviewed from all well-known academic databases, and a taxonomy was provided for the existing approaches. This review also highlights existing research gaps and outlines the limitations of current IoT security frameworks to offer practical insights for future research directions and developments.

[LG-137] Back to Bayesics: Uncovering Human Mobility Distributions and Anomalies with an Integrated Statistical and Neural Framework

链接: https://arxiv.org/abs/2410.01011
作者: Minxuan Duan,Yinlong Qian,Lingyi Zhao,Zihao Zhou,Zeeshan Rasheed,Rose Yu,Khurram Shafique
关键词-EN: fall short due, high dimensionality inherent, handle the complexity, Existing methods, fall short
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Existing methods for anomaly detection often fall short due to their inability to handle the complexity, heterogeneity, and high dimensionality inherent in real-world mobility data. In this paper, we propose DeepBayesic, a novel framework that integrates Bayesian principles with deep neural networks to model the underlying multivariate distributions from sparse and complex datasets. Unlike traditional models, DeepBayesic is designed to manage heterogeneous inputs, accommodating both continuous and categorical data to provide a more comprehensive understanding of mobility patterns. The framework features customized neural density estimators and hybrid architectures, allowing for flexibility in modeling diverse feature distributions and enabling the use of specialized neural networks tailored to different data types. Our approach also leverages agent embeddings for personalized anomaly detection, enhancing its ability to distinguish between normal and anomalous behaviors for individual agents. We evaluate our approach on several mobility datasets, demonstrating significant improvements over state-of-the-art anomaly detection methods. Our results indicate that incorporating personalization and advanced sequence modeling techniques can substantially enhance the ability to detect subtle and complex anomalies in spatiotemporal event sequences.

[LG-138] CktGen: Specification-Conditioned Analog Circuit Generation

链接: https://arxiv.org/abs/2410.00995
作者: Yuxuan Hou,Jianrong Zhang,Hua Chen,Min Zhou,Faxin Yu,Hehe Fan,Yi Yang
关键词-EN: presents significant challenges, Automatic synthesis, circuits presents significant, significant challenges, analog circuits presents
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic synthesis of analog circuits presents significant challenges. Existing methods usually treat the task as optimization problems, which limits their transferability and reusability for new requirements. To address this limitation, we introduce a task that directly generates analog circuits based on specified specifications, termed specification-conditioned analog circuit generation. Specifically, we propose CktGen, a simple yet effective variational autoencoder (VAE) model, that maps specifications and circuits into a joint latent space, and reconstructs the circuit from the latent. Moreover, given that a single specification can correspond to multiple distinct circuits, simply minimizing the distance between the mapped latent representations of the circuit and specification does not capture these one-to-many relationships. To address this, we integrate contrastive learning and classifier guidance to prevent model collapse. We conduct comprehensive experiments on the Open Circuit Benchmark (OCB) and introduce new evaluation metrics for cross-model consistency in the specification-to-circuit generation task. Experimental results demonstrate substantial improvements over existing state-of-the-art methods.

[LG-139] ght Rates for Bandit Control Beyond Quadratics NEURIPS2024

链接: https://arxiv.org/abs/2410.00993
作者: Y. Jennifer Sun,Zhou Lu
关键词-EN: Linear Quadratic Control, Linear Quadratic, Unlike classical control, Unlike classical, classical control theory
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Neurips 2024

点击查看摘要

Abstract:Unlike classical control theory, such as Linear Quadratic Control (LQC), real-world control problems are highly complex. These problems often involve adversarial perturbations, bandit feedback models, and non-quadratic, adversarially chosen cost functions. A fundamental yet unresolved question is whether optimal regret can be achieved for these general control problems. The standard approach to addressing this problem involves a reduction to bandit convex optimization with memory. In the bandit setting, constructing a gradient estimator with low variance is challenging due to the memory structure and non-quadratic loss functions. In this paper, we provide an affirmative answer to this question. Our main contribution is an algorithm that achieves an \tildeO(\sqrtT) optimal regret for bandit non-stochastic control with strongly-convex and smooth cost functions in the presence of adversarial perturbations, improving the previously known \tildeO(T^2/3) regret bound from (Cassel and Koren, 2020. Our algorithm overcomes the memory issue by reducing the problem to Bandit Convex Optimization (BCO) without memory and addresses general strongly-convex costs using recent advancements in BCO from (Suggala et al., 2024). Along the way, we develop an improved algorithm for BCO with memory, which may be of independent interest. Comments: Neurips 2024 Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2410.00993 [cs.LG] (or arXiv:2410.00993v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.00993 Focus to learn more arXiv-issued DOI via DataCite

[LG-140] ackling the Accuracy-Interpretability Trade-off in a Hierarchy of Machine Learning Models for the Prediction of Extreme Heatwaves

链接: https://arxiv.org/abs/2410.00984
作者: Alessandro Lovo,Amaury Lancelin,Corentin Herbert,Freddy Bouchet
关键词-EN: Machine Learning, Convolutional Neural Networks, Interpretable Neural Network, Learning, Intrinsically Interpretable Neural
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:When performing predictions that use Machine Learning (ML), we are mainly interested in performance and interpretability. This generates a natural trade-off, where complex models generally have higher skills but are harder to explain and thus trust. Interpretability is particularly important in the climate community, where we aim at gaining a physical understanding of the underlying phenomena. Even more so when the prediction concerns extreme weather events with high impact on society. In this paper, we perform probabilistic forecasts of extreme heatwaves over France, using a hierarchy of increasingly complex ML models, which allows us to find the best compromise between accuracy and interpretability. More precisely, we use models that range from a global Gaussian Approximation (GA) to deep Convolutional Neural Networks (CNNs), with the intermediate steps of a simple Intrinsically Interpretable Neural Network (IINN) and a model using the Scattering Transform (ScatNet). Our findings reveal that CNNs provide higher accuracy, but their black-box nature severely limits interpretability, even when using state-of-the-art Explainable Artificial Intelligence (XAI) tools. In contrast, ScatNet achieves similar performance to CNNs while providing greater transparency, identifying key scales and patterns in the data that drive predictions. This study underscores the potential of interpretability in ML models for climate science, demonstrating that simpler models can rival the performance of their more complex counterparts, all the while being much easier to understand. This gained interpretability is crucial for building trust in model predictions and uncovering new scientific insights, ultimately advancing our understanding and management of extreme weather events.

[LG-141] Robust Guided Diffusion for Offline Black-Box Optimization

链接: https://arxiv.org/abs/2410.00983
作者: Can (Sam)Chen,Christopher Beckham,Zixuan Liu,Xue Liu,Christopher Pal
关键词-EN: Offline black-box optimization, black-box optimization aims, measured properties, aims to maximize, dataset of designs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:Offline black-box optimization aims to maximize a black-box function using an offline dataset of designs and their measured properties. Two main approaches have emerged: the forward approach, which learns a mapping from input to its value, thereby acting as a proxy to guide optimization, and the inverse approach, which learns a mapping from value to input for conditional generation. (a) Although proxy-free~(classifier-free) diffusion shows promise in robustly modeling the inverse mapping, it lacks explicit guidance from proxies, essential for generating high-performance samples beyond the training distribution. Therefore, we propose \textitproxy-enhanced sampling which utilizes the explicit guidance from a trained proxy to bolster proxy-free diffusion with enhanced sampling control. (b) Yet, the trained proxy is susceptible to out-of-distribution issues. To address this, we devise the module \textitdiffusion-based proxy refinement, which seamlessly integrates insights from proxy-free diffusion back into the proxy for refinement. To sum up, we propose \textit\textbfRobust \textbfGuided \textbfDiffusion for Offline Black-box Optimization~(\textbfRGD), combining the advantages of proxy~(explicit guidance) and proxy-free diffusion~(robustness) for effective conditional generation. RGD achieves state-of-the-art results on various design-bench tasks, underscoring its efficacy. Our code is at this https URL.

[LG-142] RisingBALLER: A player is a token a match is a sentence A path towards a foundational model for football players data analytics

链接: https://arxiv.org/abs/2410.00943
作者: Akedjou Achraff Adjileye
关键词-EN: transformer model trained, match-specific player representations, publicly available approach, approach that leverages, leverages a transformer
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures. The paper will be presented at the StatsBomb Conference 2024 ( this https URL )

点击查看摘要

Abstract:In this paper, I introduce RisingBALLER, the first publicly available approach that leverages a transformer model trained on football match data to learn match-specific player representations. Drawing inspiration from advances in language modeling, RisingBALLER treats each football match as a unique sequence in which players serve as tokens, with their embeddings shaped by the specific context of the match. Through the use of masked player prediction (MPP) as a pre-training task, RisingBALLER learns foundational features for football player representations, similar to how language models learn semantic features for text representations. As a downstream task, I introduce next match statistics prediction (NMSP) to showcase the effectiveness of the learned player embeddings. The NMSP model surpasses a strong baseline commonly used for performance forecasting within the community. Furthermore, I conduct an in-depth analysis to demonstrate how the learned embeddings by RisingBALLER can be used in various football analytics tasks, such as producing meaningful positional features that capture the essence and variety of player roles beyond rigid x,y coordinates, team cohesion estimation, and similar player retrieval for more effective data-driven scouting. More than a simple machine learning model, RisingBALLER is a comprehensive framework designed to transform football data analytics by learning high-level foundational features for players, taking into account the context of each match. It offers a deeper understanding of football players beyond individual statistics.

[LG-143] MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

链接: https://arxiv.org/abs/2410.00938
作者: Sheng Wang,Liheng Chen,Pengan Chen,Jingwei Dong,Boyang Xue,Jiyue Jiang,Lingpeng Kong,Chuan Wu
关键词-EN: explosive GPU memory, GPU memory overhead, large language models, language models necessitates, numerous customized models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid scaling of large language models necessitates more lightweight finetuning methods to reduce the explosive GPU memory overhead when numerous customized models are served simultaneously. Targeting more parameter-efficient low-rank adaptation (LoRA), parameter sharing presents a promising solution. Empirically, our research into high-level sharing principles highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing. Guided by this finding, we propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies, namely subset selection, pair dissociation, vector sharding, and shard privatization. Briefly, it selects a designated number of shards from global pools with a Mixture-of-Experts (MoE)-like routing mechanism before sequentially concatenating them to low-rank matrices. Hence, it retains all the advantages of LoRA while offering enhanced parameter efficiency, and effectively circumvents the drawbacks of peer parameter-sharing methods. Our empirical experiments demonstrate approximately 8x parameter savings in a standard LoRA setting. The ablation study confirms the significance of each component. Our insights into parameter sharing and MoS method may illuminate future developments of more parameter-efficient finetuning methods.

[LG-144] ACEV: Unsupervised Intersecting Manifold Segmentation using Adaptation to Angular Change of Eigenvectors in Intrinsic Dimension

链接: https://arxiv.org/abs/2410.00930
作者: Subhadip Boral,Rikathi Pal,Ashish Ghosh
关键词-EN: Intersecting manifold segmentation, Intersecting manifold, data points, focus of research, distinct properties
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
*备注: 14 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Intersecting manifold segmentation has been a focus of research, where individual manifolds, that intersect with other manifolds, are separated to discover their distinct properties. The proposed method is based on the intuition that when a manifold in D dimensional space with an intrinsic dimension of d intersects with another manifold, the data variance grows in more than d directions. The proposed method measures local data variances and determines their vector directions. It counts the number of vectors with non-zero variance, which determines the manifold’s intrinsic dimension. For detection of the intersection region, the method adapts to the changes in the angular gaps between the corresponding direction vectors of the child and parent using exponential moving averages using a tree structure construction. Accordingly, it includes those data points in the same manifold whose neighborhood is within the adaptive angular difference and eventually identifies the data points in the intersection area of manifolds. Data points whose inclusion in the neighborhood-identified data points increases their intrinsic dimensionality are removed based on data variance and distance. The proposed method performs better than 18 SOTA manifold segmentation methods in ARI and NMI scores over 14 real-world datasets with lesser time complexity and better stability.

[LG-145] A Knowledge-Informed Large Language Model Framework for U.S. Nuclear Power Plant Shutdown Initiating Event Classification for Probabilistic Risk Assessment

链接: https://arxiv.org/abs/2410.00929
作者: Min Xian,Tao Wang,Sai Zhang,Fei Xu,Zhegang Ma
关键词-EN: low power shutdown, power shutdown probabilistic, nuclear power plants, classifying shutdown initiating, developing low power
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying and classifying shutdown initiating events (SDIEs) is critical for developing low power shutdown probabilistic risk assessment for nuclear power plants. Existing computational approaches cannot achieve satisfactory performance due to the challenges of unavailable large, labeled datasets, imbalanced event types, and label noise. To address these challenges, we propose a hybrid pipeline that integrates a knowledge-informed machine learning mode to prescreen non-SDIEs and a large language model (LLM) to classify SDIEs into four types. In the prescreening stage, we proposed a set of 44 SDIE text patterns that consist of the most salient keywords and phrases from six SDIE types. Text vectorization based on the SDIE patterns generates feature vectors that are highly separable by using a simple binary classifier. The second stage builds Bidirectional Encoder Representations from Transformers (BERT)-based LLM, which learns generic English language representations from self-supervised pretraining on a large dataset and adapts to SDIE classification by fine-tuning it on an SDIE dataset. The proposed approaches are evaluated on a dataset with 10,928 events using precision, recall ratio, F1 score, and average accuracy. The results demonstrate that the prescreening stage can exclude more than 97% non-SDIEs, and the LLM achieves an average accuracy of 93.4% for SDIE classification.

[LG-146] Optimistic Games for Combinatorial Bayesian Optimization with Application to Protein Design

链接: https://arxiv.org/abs/2409.18582
作者: Melis Ilayda Bal,Pier Giuseppe Sessa,Mojmir Mutny,Andreas Krause
关键词-EN: Bayesian optimization, optimize black-box, sequential interactions, powerful framework, framework to optimize
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a powerful framework to optimize black-box expensive-to-evaluate functions via sequential interactions. In several important problems (e.g. drug discovery, circuit design, neural architecture search, etc.), though, such functions are defined over large \textitcombinatorial and unstructured spaces. This makes existing BO algorithms not feasible due to the intractable maximization of the acquisition function over these domains. To address this issue, we propose \textbfGameOpt , a novel game-theoretical approach to combinatorial BO. \textbfGameOpt establishes a cooperative game between the different optimization variables, and selects points that are game \textitequilibria of an upper confidence bound acquisition function. These are stable configurations from which no variable has an incentive to deviate - analog to local optima in continuous domains. Crucially, this allows us to efficiently break down the complexity of the combinatorial domain into individual decision sets, making \textbfGameOpt scalable to large combinatorial spaces. We demonstrate the application of \textbfGameOpt to the challenging \textitprotein design problem and validate its performance on four real-world protein datasets. Each protein can take up to 20^X possible configurations, where X is the length of a protein, making standard BO methods infeasible. Instead, our approach iteratively selects informative protein configurations and very quickly discovers highly active protein variants compared to other baselines.

[LG-147] DrivAerNet: A Parametric Car Dataset for Data-Driven Aerodynamic Design and Graph-Based Drag Prediction

链接: https://arxiv.org/abs/2403.08055
作者: Mohamed Elrefaie,Angela Dai,Faez Ahmed
关键词-EN: high-fidelity CFD dataset, large-scale high-fidelity CFD, dynamic graph convolutional, graph convolutional neural, convolutional neural network
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:This study introduces DrivAerNet, a large-scale high-fidelity CFD dataset of 3D industry-standard car shapes, and RegDGCNN, a dynamic graph convolutional neural network model, both aimed at aerodynamic car design through machine learning. DrivAerNet, with its 4000 detailed 3D car meshes using 0.5 million surface mesh faces and comprehensive aerodynamic performance data comprising of full 3D pressure, velocity fields, and wall-shear stresses, addresses the critical need for extensive datasets to train deep learning models in engineering applications. It is 60% larger than the previously available largest public dataset of cars, and is the only open-source dataset that also models wheels and underbody. RegDGCNN leverages this large-scale dataset to provide high-precision drag estimates directly from 3D meshes, bypassing traditional limitations such as the need for 2D image rendering or Signed Distance Fields (SDF). By enabling fast drag estimation in seconds, RegDGCNN facilitates rapid aerodynamic assessments, offering a substantial leap towards integrating data-driven methods in automotive design. Together, DrivAerNet and RegDGCNN promise to accelerate the car design process and contribute to the development of more efficient vehicles. To lay the groundwork for future innovations in the field, the dataset and code used in our study are publicly accessible at \urlthis https URL

[LG-148] Efficient 1-bit tensor approximations

链接: https://arxiv.org/abs/2410.01799
作者: Alex W. Neal Riasanovsky,Sarah El Kazdadi
关键词-EN: mathbf, spatially efficient decomposition, textit, present a spatially, linear combinations
类目: Combinatorics (math.CO); Machine Learning (cs.LG); Mathematical Software (cs.MS); Numerical Analysis (math.NA)
*备注: 16 pages, one cat picture reused a lot

点击查看摘要

Abstract:We present a spatially efficient decomposition of matrices and arbitrary-order tensors as linear combinations of tensor products of -1, 1\ -valued vectors. For any matrix A \in \mathbbR^m \times n , A - R_w = S_w C_w T_w^\top = \sum_j=1^w c_j \cdot \mathbfs_j \mathbft_j^\top is a \it w -width signed cut decomposition of A . Here C_w = “diag”(\mathbfc_w) for some \mathbfc_w \in \mathbbR^w, and S_w, T_w , and the vectors \mathbfs_j, \mathbft_j are -1, 1\ -valued. To store (S_w, T_w, C_w) , we may pack w \cdot (m + n) bits, and require only w floating point numbers. As a function of w , |R_w|_F exhibits exponential decay when applied to #f32 matrices with i.i.d. \mathcal N (0, 1) entries. Choosing w so that (S_w, T_w, C_w) has the same memory footprint as a \textitf16 or \textitbf16 matrix, the relative error is comparable. Our algorithm yields efficient signed cut decompositions in 20 lines of pseudocode. It reflects a simple modification from a celebrated 1999 paper [1] of Frieze and Kannan. As a first application, we approximate the weight matrices in the open \textitMistral-7B-v0.1 Large Language Model to a 50% spatial compression. Remarkably, all 226 remainder matrices have a relative error 6% and the expanded model closely matches \textitMistral-7B-v0.1 on the \it huggingface leaderboard [2]. Benchmark performance degrades slowly as we reduce the spatial compression from 50% to 25% . We optimize our open source \textitrust implementation [3] with \textitsimd instructions on \textitavx2 and \textitavx512 architectures. We also extend our algorithm from matrices to tensors of arbitrary order and use it to compress a picture of the first author’s cat Angus.

[LG-149] hermodynamic Bayesian Inference

链接: https://arxiv.org/abs/2410.01793
作者: Maxwell Aifer,Samuel Duffield,Kaelan Donatella,Denis Melanson,Phoebe Klett,Zach Belateche,Gavin Crooks,Antonio J. Martinez,Patrick J. Coles
关键词-EN: deep neural networks, enable rigorous uncertainty, rigorous uncertainty quantification, higher-level tasks including, fully Bayesian treatment
类目: atistical Mechanics (cond-mat.stat-mech); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:A fully Bayesian treatment of complicated predictive models (such as deep neural networks) would enable rigorous uncertainty quantification and the automation of higher-level tasks including model selection. However, the intractability of sampling Bayesian posteriors over many parameters inhibits the use of Bayesian methods where they are most needed. Thermodynamic computing has emerged as a paradigm for accelerating operations used in machine learning, such as matrix inversion, and is based on the mapping of Langevin equations to the dynamics of noisy physical systems. Hence, it is natural to consider the implementation of Langevin sampling algorithms on thermodynamic devices. In this work we propose electronic analog devices that sample from Bayesian posteriors by realizing Langevin dynamics physically. Circuit designs are given for sampling the posterior of a Gaussian-Gaussian model and for Bayesian logistic regression, and are validated by simulations. It is shown, under reasonable assumptions, that the Bayesian posteriors for these models can be sampled in time scaling with \ln(d) , where d is dimension. For the Gaussian-Gaussian model, the energy cost is shown to scale with d \ln(d) . These results highlight the potential for fast, energy-efficient Bayesian inference using thermodynamic computing.

[LG-150] Dynamical-generative downscaling of climate model ensembles

链接: https://arxiv.org/abs/2410.01776
作者: Ignacio Lopez-Gomez,Zhong Yi Wan,Leonardo Zepeda-Núñez,Tapio Schneider,John Anderson,Fei Sha
关键词-EN: hazard risk assessment, natural hazard risk, Earth System Model, Regional high-resolution climate, high-resolution climate projections
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regional high-resolution climate projections are crucial for many applications, such as agriculture, hydrology, and natural hazard risk assessment. Dynamical downscaling, the state-of-the-art method to produce localized future climate information, involves running a regional climate model (RCM) driven by an Earth System Model (ESM), but it is too computationally expensive to apply to large climate projection ensembles. We propose a novel approach combining dynamical downscaling with generative artificial intelligence to reduce the cost and improve the uncertainty estimates of downscaled climate projections. In our framework, an RCM dynamically downscales ESM output to an intermediate resolution, followed by a generative diffusion model that further refines the resolution to the target scale. This approach leverages the generalizability of physics-based models and the sampling efficiency of diffusion models, enabling the downscaling of large multi-model ensembles. We evaluate our method against dynamically-downscaled climate projections from the CMIP6 ensemble. Our results demonstrate its ability to provide more accurate uncertainty bounds on future regional climate than alternatives such as dynamical downscaling of smaller ensembles, or traditional empirical statistical downscaling methods. We also show that dynamical-generative downscaling results in significantly lower errors than bias correction and spatial disaggregation (BCSD), and captures more accurately the spectra and multivariate correlations of meteorological fields. These characteristics make the dynamical-generative framework a flexible, accurate, and efficient way to downscale large ensembles of climate projections, currently out of reach for pure dynamical downscaling.

[LG-151] SegHeD: Segmentation of Heterogeneous Data for Multiple Sclerosis Lesions with Anatomical Constraints MICCAI

链接: https://arxiv.org/abs/2410.01766
作者: Berke Doga Basaran,Xinru Zhang,Paul M. Matthews,Wenjia Bai
关键词-EN: brain magnetic resonance, monitoring multiple sclerosis, magnetic resonance, images plays, progression from brain
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, MICCAI, LDTM Workshop

点击查看摘要

Abstract:Assessment of lesions and their longitudinal progression from brain magnetic resonance (MR) images plays a crucial role in diagnosing and monitoring multiple sclerosis (MS). Machine learning models have demonstrated a great potential for automated MS lesion segmentation. Training such models typically requires large-scale high-quality datasets that are consistently annotated. However, MS imaging datasets are often small, segregated across multiple sites, with different formats (cross-sectional or longitudinal), and diverse annotation styles. This poses a significant challenge to train a unified MS lesion segmentation model. To tackle this challenge, we present SegHeD, a novel multi-dataset multi-task segmentation model that can incorporate heterogeneous data as input and perform all-lesion, new-lesion, as well as vanishing-lesion segmentation. Furthermore, we account for domain knowledge about MS lesions, incorporating longitudinal, spatial, and volumetric constraints into the segmentation model. SegHeD is assessed on five MS datasets and achieves a high performance in all, new, and vanishing-lesion segmentation, outperforming several state-of-the-art methods in this field.

[LG-152] Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes

链接: https://arxiv.org/abs/2410.01755
作者: Hossein Sholehrasa
关键词-EN: variability pose significant, pose significant challenges, guiding effective treatment, Breast cancer complexity, Breast cancer
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Breast cancer’s complexity and variability pose significant challenges in understanding its progression and guiding effective treatment. This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes and predict clinical outcomes. Using ProtGPT2, a language model designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins sequence. These embeddings were integrated with protein expression level to form enriched biological representations, which were analyzed using machine learning methods like ensemble K-means for clustering and XGBoost for classification. Our approach enabled successful clustering of patients into biologically distinct groups and accurately predicted clinical outcomes such as survival and biomarkers status, achieving high performance metrics, notably an F1 score of 0.88 for survival and 0.87 for biomarkers status prediction. Analysis of feature importance highlighted key proteins like KMT2C, GCN1, and CLASP2, linked to hormone receptor and Human Epidermal Growth Factor Receptor 2 (HER2) expression, which play a role in tumor progression and patient outcomes, respectively. Furthermore, protein-protein interaction networks and correlation analyses revealed the interdependence of proteins that may influence breast cancer subtype behaviors. These findings suggest that integrating protein sequence and expression data provides valuable insights into tumor biology and has significant potential to enhance personalized treatment strategies in breast cancer care.

[LG-153] Smaller Confidence Intervals From IPW Estimators via Data-Dependent Coarsening COLT

链接: https://arxiv.org/abs/2410.01658
作者: Alkis Kalavasis,Anay Mehrotra,Manolis Zampetakis
关键词-EN: Inverse propensity-score weighted, estimating average treatment, average treatment effects, Inverse propensity-score, IPW estimators
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Accepted for presentation at the 37th Conference on Learning Theory (COLT) 2024

点击查看摘要

Abstract:Inverse propensity-score weighted (IPW) estimators are prevalent in causal inference for estimating average treatment effects in observational studies. Under unconfoundedness, given accurate propensity scores and n samples, the size of confidence intervals of IPW estimators scales down with n , and, several of their variants improve the rate of scaling. However, neither IPW estimators nor their variants are robust to inaccuracies: even if a single covariate has an \varepsilon0 additive error in the propensity score, the size of confidence intervals of these estimators can increase arbitrarily. Moreover, even without errors, the rate with which the confidence intervals of these estimators go to zero with n can be arbitrarily slow in the presence of extreme propensity scores (those close to 0 or 1). We introduce a family of Coarse IPW (CIPW) estimators that captures existing IPW estimators and their variants. Each CIPW estimator is an IPW estimator on a coarsened covariate space, where certain covariates are merged. Under mild assumptions, e.g., Lipschitzness in expected outcomes and sparsity of extreme propensity scores, we give an efficient algorithm to find a robust estimator: given \varepsilon -inaccurate propensity scores and n samples, its confidence interval size scales with \varepsilon+1/\sqrtn . In contrast, under the same assumptions, existing estimators’ confidence interval sizes are \Omega(1) irrespective of \varepsilon and n . Crucially, our estimator is data-dependent and we show that no data-independent CIPW estimator can be robust to inaccuracies. Comments: Accepted for presentation at the 37th Conference on Learning Theory (COLT) 2024 Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2410.01658 [stat.ME] (or arXiv:2410.01658v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2410.01658 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-154] Efficient Statistics With Unknown Truncation Polynomial Time Algorithms Beyond Gaussians

链接: https://arxiv.org/abs/2410.01656
作者: Jane H. Lee,Anay Mehrotra,Manolis Zampetakis
关键词-EN: Toggle, varepsilon, Code, Papers, Code Toggle Papers
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: Accepted for presentation at the 65th IEEE Symposium on Foundations of Computer Science (FOCS), 2024; abstract shortened for arXiv

点击查看摘要

Abstract:We study the estimation of distributional parameters when samples are shown only if they fall in some unknown set S \subseteq \mathbbR^d . Kontonis, Tzamos, and Zampetakis (FOCS’19) gave a d^\mathrmpoly(1/\varepsilon) time algorithm for finding \varepsilon -accurate parameters for the special case of Gaussian distributions with diagonal covariance matrix. Recently, Diakonikolas, Kane, Pittas, and Zarifis (COLT’24) showed that this exponential dependence on 1/\varepsilon is necessary even when S belongs to some well-behaved classes. These works leave the following open problems which we address in this work: Can we estimate the parameters of any Gaussian or even extend beyond Gaussians? Can we design \mathrmpoly(d/\varepsilon) time algorithms when S is a simple set such as a halfspace? We make progress on both of these questions by providing the following results: 1. Toward the first question, we give a d^\mathrmpoly(\ell/\varepsilon) time algorithm for any exponential family that satisfies some structural assumptions and any unknown set S that is \varepsilon -approximable by degree- \ell polynomials. This result has two important applications: 1a) The first algorithm for estimating arbitrary Gaussian distributions from samples truncated to an unknown S ; and 1b) The first algorithm for linear regression with unknown truncation and Gaussian features. 2. To address the second question, we provide an algorithm with runtime \mathrmpoly(d/\varepsilon) that works for a set of exponential families (containing all Gaussians) when S is a halfspace or an axis-aligned rectangle. Along the way, we develop tools that may be of independent interest, including, a reduction from PAC learning with positive and unlabeled samples to PAC learning with positive and negative samples that is robust to certain covariate shifts. Comments: Accepted for presentation at the 65th IEEE Symposium on Foundations of Computer Science (FOCS), 2024; abstract shortened for arXiv Subjects: Statistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML) Cite as: arXiv:2410.01656 [math.ST] (or arXiv:2410.01656v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2410.01656 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anay Mehrotra [view email] [v1] Wed, 2 Oct 2024 15:21:07 UTC (387 KB) Full-text links: Access Paper: View a PDF of the paper titled Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians, by Jane H. Lee and Anay Mehrotra and Manolis ZampetakisView PDFTeX SourceOther Formats view license Current browse context: math.ST prev | next new | recent | 2024-10 Change to browse by: cs cs.DS cs.LG math stat stat.CO stat.ML stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-155] HRTF Estimation using a Score-based Prior

链接: https://arxiv.org/abs/2410.01562
作者: Etienne Thuillier,Jean-Marie Lemercier,Eloi Moliner,Timo Gerkmann,Vesa Välimäki
关键词-EN: head-related transfer function, transfer function, HRTF, present a head-related, head-related transfer
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:We present a head-related transfer function (HRTF) estimation method which relies on a data-driven prior given by a score-based diffusion model. The HRTF is estimated in reverberant environments using natural excitation signals, e.g. human speech. The impulse response of the room is estimated along with the HRTF by optimizing a parametric model of reverberation based on the statistical behaviour of room acoustics. The posterior distribution of HRTF given the reverberant measurement and excitation signal is modelled using the score-based HRTF prior and a log-likelihood approximation. We show that the resulting method outperforms several baselines, including an oracle recommender system that assigns the optimal HRTF in our training set based on the smallest distance to the true HRTF at the given direction of arrival. In particular, we show that the diffusion prior can account for the large variability of high-frequency content in HRTFs.

[LG-156] Attention layers provably solve single-location regression

链接: https://arxiv.org/abs/2410.01537
作者: Pierre Marion,Raphaël Berthier,Gérard Biau,Claire Boyer
关键词-EN: Attention-based models, comprehensive theoretical understanding, internal linear representations, lack a comprehensive, token-wise sparsity
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 41 pages, 7 figures

点击查看摘要

Abstract:Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.

[LG-157] One Wave to Explain Them All: A Unifying Perspective on Post-hoc Explainability

链接: https://arxiv.org/abs/2410.01482
作者: Gabriel Kasmi,Amandine Brunetto,Thomas Fel,Jayneel Parekh
关键词-EN: deep neural networks, black-box nature hinders, nature hinders transparency, inherent black-box nature, safety-critical decision-making
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: main: 10 pages, appendix: 14 pages, 5 Tables, 25 Figures

点击查看摘要

Abstract:Despite the growing use of deep neural networks in safety-critical decision-making, their inherent black-box nature hinders transparency and interpretability. Explainable AI (XAI) methods have thus emerged to understand a model’s internal workings, and notably attribution methods also called saliency maps. Conventional attribution methods typically identify the locations – the where – of significant regions within an input. However, because they overlook the inherent structure of the input data, these methods often fail to interpret what these regions represent in terms of structural components (e.g., textures in images or transients in sounds). Furthermore, existing methods are usually tailored to a single data modality, limiting their generalizability. In this paper, we propose leveraging the wavelet domain as a robust mathematical foundation for attribution. Our approach, the Wavelet Attribution Method (WAM) extends the existing gradient-based feature attributions into the wavelet domain, providing a unified framework for explaining classifiers across images, audio, and 3D shapes. Empirical evaluations demonstrate that WAM matches or surpasses state-of-the-art methods across faithfulness metrics and models in image, audio, and 3D explainability. Finally, we show how our method explains not only the where – the important parts of the input – but also the what – the relevant patterns in terms of structural components.

[LG-158] Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales

链接: https://arxiv.org/abs/2410.01480
作者: Joakim Wallmark,Maria Josefsson,Marie Wiberg
关键词-EN: Item Response Theory, evaluating test items, powerful statistical approach, determining test taker, test taker abilities
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Item Response Theory (IRT) is a powerful statistical approach for evaluating test items and determining test taker abilities through response analysis. An IRT model that better fits the data leads to more accurate latent trait estimates. In this study, we present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders. Using both simulated scenarios and real data from the Swedish Scholastic Aptitude Test, we demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit. Furthermore, we illustrate how the latent trait scale from any fitted IRT model can be transformed into a ratio scale, aiding in score interpretation and making it easier to compare different types of IRT models. We refer to these new scales as bit scales. Bit scales are especially useful for models for which minimal or no assumptions are made for the latent trait scale distributions, such as for the autoencoder fitted models in this study.

[LG-159] Flow Matching for Accelerated Simulation of Atomic Transport in Materials

链接: https://arxiv.org/abs/2410.01464
作者: Juno Nam,Sulin Liu,Gavin Winter,KyuJung Jun,Soojung Yang,Rafael Gómez-Bombarelli
关键词-EN: accelerate molecular dynamics, molecular dynamics, generative framework, framework to accelerate, accelerate molecular
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We introduce LiFlow, a generative framework to accelerate molecular dynamics (MD) simulations for crystalline materials that formulates the task as conditional generation of atomic displacements. The model uses flow matching, with a Propagator submodel to generate atomic displacements and a Corrector to locally correct unphysical geometries, and incorporates an adaptive prior based on the Maxwell-Boltzmann distribution to account for chemical and thermal conditions. We benchmark LiFlow on a dataset comprising 25-ps trajectories of lithium diffusion across 4,186 solid-state electrolyte (SSE) candidates at four temperatures. The model obtains a consistent Spearman rank correlation of 0.7-0.8 for lithium mean squared displacement (MSD) predictions on unseen compositions. Furthermore, LiFlow generalizes from short training trajectories to larger supercells and longer simulations while maintaining high accuracy. With speed-ups of up to 600,000 \times compared to first-principles methods, LiFlow enables scalable simulations at significantly larger length and time scales.

[LG-160] Approximation by Steklov Neural Network Operators

链接: https://arxiv.org/abs/2410.01426
作者: S. N. Karaman,M. Turgay,T. Acar
关键词-EN: Neural Network operators, is,Steklov Neural Network, Neural Network, present paper deals, Network operators
类目: Functional Analysis (math.FA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The present paper deals with construction of newly family of Neural Network operators, that is,Steklov Neural Network operators. By using Steklov type integral, we introduce a new version of Neural Network operators and we obtain some convergence theorems for the family, such as, pointwise and uniform convergence,rate of convergence via moduli of smoothness of order r .

[LG-161] Overpredictive Signal Analytics in Federated Learning: Algorithms and Analysis

链接: https://arxiv.org/abs/2410.01399
作者: Vijay Anavangot
关键词-EN: Edge signal processing, signal processing facilitates, processing facilitates distributed, Edge signal, signal
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Edge signal processing facilitates distributed learning and inference in the client-server model proposed in federated learning. In traditional machine learning, clients (IoT devices) that acquire raw signal samples can aid a data center (server) learn a global signal model by pooling these distributed samples at a third-party location. Despite the promising capabilities of IoTs, these distributed deployments often face the challenge of sensitive private data and communication rate constraints. This necessitates a learning approach that communicates a processed approximation of the distributed samples instead of the raw signals. Such a decentralized learning approach using signal approximations will be termed distributed signal analytics in this work. Overpredictive signal approximations may be desired for distributed signal analytics, especially in network demand (capacity) planning applications motivated by federated learning. In this work, we propose algorithms that compute an overpredictive signal approximation at the client devices using an efficient convex optimization framework. Tradeoffs between communication cost, sampling rate, and the signal approximation error are quantified using mathematical analysis. We also show the performance of the proposed distributed algorithms on a publicly available residential energy consumption dataset.

[LG-162] Response Estimation and System Identification of Dynamical Systems via Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2410.01340
作者: Marcus Haywood-Alexander,Giacamo Arcieri,Antonios Kamariotis,Eleni Chatzi
关键词-EN: Structural Health Monitoring, Health Monitoring, Structural Health, structural dynamics, numerous engineering applications
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accurate modelling of structural dynamics is crucial across numerous engineering applications, such as Structural Health Monitoring (SHM), seismic analysis, and vibration control. Often, these models originate from physics-based principles and can be derived from corresponding governing equations, often of differential equation form. However, complex system characteristics, such as nonlinearities and energy dissipation mechanisms, often imply that such models are approximative and often imprecise. This challenge is further compounded in SHM, where sensor data is often sparse, making it difficult to fully observe the system’s states. To address these issues, this paper explores the use of Physics-Informed Neural Networks (PINNs), a class of physics-enhanced machine learning (PEML) techniques, for the identification and estimation of dynamical systems. PINNs offer a unique advantage by embedding known physical laws directly into the neural network’s loss function, allowing for simple embedding of complex phenomena, even in the presence of uncertainties. This study specifically investigates three key applications of PINNs: state estimation in systems with sparse sensing, joint state-parameter estimation, when both system response and parameters are unknown, and parameter estimation within a Bayesian framework to quantify uncertainties. The results demonstrate that PINNs deliver an efficient tool across all aforementioned tasks, even in presence of modelling errors. However, these errors tend to have a more significant impact on parameter estimation, as the optimization process must reconcile discrepancies between the prescribed model and the true system behavior. Despite these challenges, PINNs show promise in dynamical system modeling, offering a robust approach to handling uncertainties.

[LG-163] Deep Kernel Posterior Learning under Infinite Variance Prior Weights

链接: https://arxiv.org/abs/2410.01284
作者: Jorge Loría,Anindya Bhadra
关键词-EN: bounded prior variance, infinitely wide shallow, proved that infinitely, infinitely wide, bounded prior
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 11 figures

点击查看摘要

Abstract:Neal (1996) proved that infinitely wide shallow Bayesian neural networks (BNN) converge to Gaussian processes (GP), when the network weights have bounded prior variance. Cho Saul (2009) provided a useful recursive formula for deep kernel processes for relating the covariance kernel of each layer to the layer immediately below. Moreover, they worked out the form of the layer-wise covariance kernel in an explicit manner for several common activation functions. Recent works, including Aitchison et al. (2021), have highlighted that the covariance kernels obtained in this manner are deterministic and hence, precludes any possibility of representation learning, which amounts to learning a non-degenerate posterior of a random kernel given the data. To address this, they propose adding artificial noise to the kernel to retain stochasticity, and develop deep kernel inverse Wishart processes. Nonetheless, this artificial noise injection could be critiqued in that it would not naturally emerge in a classic BNN architecture under an infinite-width limit. To address this, we show that a Bayesian deep neural network, where each layer width approaches infinity, and all network weights are elliptically distributed with infinite variance, converges to a process with \alpha -stable marginals in each layer that has a conditionally Gaussian representation. These conditional random covariance kernels could be recursively linked in the manner of Cho Saul (2009), even though marginally the process exhibits stable behavior, and hence covariances are not even necessarily defined. We also provide useful generalizations of the recent results of Loría Bhadra (2024) on shallow networks to multi-layer networks, and remedy the computational burden of their approach. The computational and statistical benefits over competing approaches stand out in simulations and in demonstrations on benchmark data sets.

[LG-164] ransformers Handle Endogeneity in In-Context Linear Regression

链接: https://arxiv.org/abs/2410.01265
作者: Haodong Liang,Krishnakumar Balasubramanian,Lifeng Lai
关键词-EN: in-context linear regression, linear regression, explore the capability, address endogeneity, handle endogeneity effectively
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST)
*备注: 30 pages

点击查看摘要

Abstract:We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares (\textsf2SLS) solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the \textsf2SLS method, in the presence of endogeneity.

[LG-165] Revisiting Optimism and Model Complexity in the Wake of Overparameterized Machine Learning

链接: https://arxiv.org/abs/2410.01259
作者: Pratik Patil,Jin-Hong Du,Ryan J. Tibshirani
关键词-EN: Common practice, learning involves fitting, prediction error, machine learning involves, involves fitting
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 59 pages, 17 figures

点击查看摘要

Abstract:Common practice in modern machine learning involves fitting a large number of parameters relative to the number of observations. These overparameterized models can exhibit surprising generalization behavior, e.g., ``double descent’’ in the prediction error curve when plotted against the raw number of model parameters, or another simplistic notion of complexity. In this paper, we revisit model complexity from first principles, by first reinterpreting and then extending the classical statistical concept of (effective) degrees of freedom. Whereas the classical definition is connected to fixed-X prediction error (in which prediction error is defined by averaging over the same, nonrandom covariate points as those used during training), our extension of degrees of freedom is connected to random-X prediction error (in which prediction error is averaged over a new, random sample from the covariate distribution). The random-X setting more naturally embodies modern machine learning problems, where highly complex models, even those complex enough to interpolate the training data, can still lead to desirable generalization performance under appropriate conditions. We demonstrate the utility of our proposed complexity measures through a mix of conceptual arguments, theory, and experiments, and illustrate how they can be used to interpret and compare arbitrary prediction models.

[LG-166] Resource-efficient equivariant quantum convolutional neural networks

链接: https://arxiv.org/abs/2410.01252
作者: Koki Chinzei,Quoc Hoan Tran,Yasuhiro Endo,Hirotaka Oshima
关键词-EN: potential quantum advantages, provide potential quantum, provide potential, neural networks, Equivariant
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures, 1 table

点击查看摘要

Abstract:Equivariant quantum neural networks (QNNs) are promising quantum machine learning models that exploit symmetries to provide potential quantum advantages. Despite theoretical developments in equivariant QNNs, their implementation on near-term quantum devices remains challenging due to limited computational resources. This study proposes a resource-efficient model of equivariant quantum convolutional neural networks (QCNNs) called equivariant split-parallelizing QCNN (sp-QCNN). Using a group-theoretical approach, we encode general symmetries into our model beyond the translational symmetry addressed by previous sp-QCNNs. We achieve this by splitting the circuit at the pooling layer while preserving symmetry. This splitting structure effectively parallelizes QCNNs to improve measurement efficiency in estimating the expectation value of an observable and its gradient by order of the number of qubits. Our model also exhibits high trainability and generalization performance, including the absence of barren plateaus. Numerical experiments demonstrate that the equivariant sp-QCNN can be trained and generalized with fewer measurement resources than a conventional equivariant QCNN in a noisy quantum data classification task. Our results contribute to the advancement of practical quantum machine learning algorithms.

[LG-167] Equivariant score-based generative models provably learn distributions with symmetries efficiently

链接: https://arxiv.org/abs/2410.01244
作者: Ziyu Chen,Markos A. Katsoulakis,Benjamin J. Zhang
关键词-EN: group symmetry, phenomena and tasks, vector fields, equivariant vector fields, real-world phenomena
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symmetry is ubiquitous in many real-world phenomena and tasks, such as physics, images, and molecular simulations. Empirical studies have demonstrated that incorporating symmetries into generative models can provide better generalization and sampling efficiency when the underlying data distribution has group symmetry. In this work, we provide the first theoretical analysis and guarantees of score-based generative models (SGMs) for learning distributions that are invariant with respect to some group symmetry and offer the first quantitative comparison between data augmentation and adding equivariant inductive bias. First, building on recent works on the Wasserstein-1 ( \mathbfd_1 ) guarantees of SGMs and empirical estimations of probability divergences under group symmetry, we provide an improved \mathbfd_1 generalization bound when the data distribution is group-invariant. Second, we describe the inductive bias of equivariant SGMs using Hamilton-Jacobi-Bellman theory, and rigorously demonstrate that one can learn the score of a symmetrized distribution using equivariant vector fields without data augmentations through the analysis of the optimality and equivalence of score-matching objectives. This also provides practical guidance that one does not have to augment the dataset as long as the vector field or the neural network parametrization is equivariant. Moreover, we quantify the impact of not incorporating equivariant structure into the score parametrization, by showing that non-equivariant vector fields can yield worse generalization bounds. This can be viewed as a type of model-form error that describes the missing structure of non-equivariant vector fields. Numerical simulations corroborate our analysis and highlight that data augmentations cannot replace the role of equivariant vector fields.

[LG-168] Statistical Taylor Expansion

链接: https://arxiv.org/abs/2410.01223
作者: Chengpu Wang
关键词-EN: Statistical Taylor expansion, Taylor expansion replaces, Taylor expansion, Statistical Taylor, input precise variables
类目: Computation (stat.CO); Machine Learning (cs.LG)
*备注: 75 pages, 55 figures

点击查看摘要

Abstract:Statistical Taylor expansion replaces the input precise variables in a conventional Taylor expansion with random variables each with known mean and deviation, to calculate the result mean and deviation. It is based on the uncorrelated uncertainty assumption: Each input variable is measured independently with fine enough statistical precision, so that their uncertainties are independent of each other. Statistical Taylor expansion reviews that the intermediate analytic expressions can no longer be regarded as independent of each other, and the result of analytic expression should be path independent. This conclusion differs fundamentally from the conventional common approach in applied mathematics to find the best execution path for a result. This paper also presents an implementation of statistical Taylor expansion called variance arithmetic, and the tests on variance arithmetic.

[LG-169] An uncertainty-aware Digital Shadow for underground multimodal CO2 storage monitoring

链接: https://arxiv.org/abs/2410.01218
作者: Abhinav Prakash Gahlot,Rafael Orozco,Ziyi Yin,Felix J. Herrmann
关键词-EN: uncertainty-aware Digital Shadow, scalable Digital Shadow, Digital Shadows uncertainty, Digital Shadows neural, Ensemble Bayesian Filtering
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Geological Carbon Storage GCS is arguably the only scalable net-negative CO2 emission technology available While promising subsurface complexities and heterogeneity of reservoir properties demand a systematic approach to quantify uncertainty when optimizing production and mitigating storage risks which include assurances of Containment and Conformance of injected supercritical CO2 As a first step towards the design and implementation of a Digital Twin for monitoring underground storage operations a machine learning based data-assimilation framework is introduced and validated on carefully designed realistic numerical simulations As our implementation is based on Bayesian inference but does not yet support control and decision-making we coin our approach an uncertainty-aware Digital Shadow To characterize the posterior distribution for the state of CO2 plumes conditioned on multi-modal time-lapse data the envisioned Shadow combines techniques from Simulation-Based Inference SBI and Ensemble Bayesian Filtering to establish probabilistic baselines and assimilate multi-modal data for GCS problems that are challenged by large degrees of freedom nonlinear multi-physics non-Gaussianity and computationally expensive to evaluate fluid flow and seismic simulations To enable SBI for dynamic systems a recursive scheme is proposed where the Digital Shadows neural networks are trained on simulated ensembles for their state and observed data well and/or seismic Once training is completed the systems state is inferred when time-lapse field data becomes available In this computational study we observe that a lack of knowledge on the permeability field can be factored into the Digital Shadows uncertainty quantification To our knowledge this work represents the first proof of concept of an uncertainty-aware in-principle scalable Digital Shadow.

[LG-170] Diverse Expected Improvement (DEI): Diverse Bayesian Optimization of Expensive Computer Simulators

链接: https://arxiv.org/abs/2410.01196
作者: John Joshua Miller,Simon Mak,Benny Sun,Sai Ranjeet Narayanan,Suo Yang,Zongxuan Sun,Kenneth S. Kim,Chol-Bum Mike Kweon
关键词-EN: expensive black-box simulators, black-box simulators arises, expensive black-box, myriad of modern, modern scientific
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The optimization of expensive black-box simulators arises in a myriad of modern scientific and engineering applications. Bayesian optimization provides an appealing solution, by leveraging a fitted surrogate model to guide the selection of subsequent simulator evaluations. In practice, however, the objective is often not to obtain a single good solution, but rather a ‘‘basket’’ of good solutions from which users can choose for downstream decision-making. This need arises in our motivating application for real-time control of internal combustion engines for flight propulsion, where a diverse set of control strategies is essential for stable flight control. There has been little work on this front for Bayesian optimization. We thus propose a new Diverse Expected Improvement (DEI) method that searches for diverse ‘’ \epsilon -optimal’’ solutions: locally-optimal solutions within a tolerance level \epsilon 0 from a global optimum. We show that DEI yields a closed-form acquisition function under a Gaussian process surrogate model, which facilitates efficient sequential queries via automatic differentiation. This closed form further reveals a novel exploration-exploitation-diversity trade-off, which incorporates the desired diversity property within the well-known exploration-exploitation trade-off. We demonstrate the improvement of DEI over existing methods in a suite of numerical experiments, then explore the DEI in two applications on rover trajectory optimization and engine control for flight propulsion.

[LG-171] High-dimensional logistic regression with missing data: Imputation regularization and universality

链接: https://arxiv.org/abs/2410.01093
作者: Kabir Aladin Verchand,Andrea Montanari
关键词-EN: ridge-regularized logistic regression, additive noise, prediction error, Bayes optimal prediction, ridge-regularized logistic
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study high-dimensional, ridge-regularized logistic regression in a setting in which the covariates may be missing or corrupted by additive noise. When both the covariates and the additive corruptions are independent and normally distributed, we provide exact characterizations of both the prediction error as well as the estimation error. Moreover, we show that these characterizations are universal: as long as the entries of the data matrix satisfy a set of independence and moment conditions, our guarantees continue to hold. Universality, in turn, enables the detailed study of several imputation-based strategies when the covariates are missing completely at random. We ground our study by comparing the performance of these strategies with the conjectured performance – stemming from replica theory in statistical physics – of the Bayes optimal procedure. Our analysis yields several insights including: (i) a distinction between single imputation and a simple variant of multiple imputation and (ii) that adding a simple ridge regularization term to single-imputed logistic regression can yield an estimator whose prediction error is nearly indistinguishable from the Bayes optimal prediction error. We supplement our findings with extensive numerical experiments.

[LG-172] An Introduction to Deep Survival Analysis Models for Predicting Time-to-Event Outcomes

链接: https://arxiv.org/abs/2410.01086
作者: George H. Chen
关键词-EN: applications involve reasoning, applications involve, outcomes, time, time durations
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Many applications involve reasoning about time durations before a critical event happens–also called time-to-event outcomes. When will a customer cancel a subscription, a coma patient wake up, or a convicted criminal reoffend? Time-to-event outcomes have been studied extensively within the field of survival analysis primarily by the statistical, medical, and reliability engineering communities, with textbooks already available in the 1970s and '80s. This monograph aims to provide a reasonably self-contained modern introduction to survival analysis. We focus on predicting time-to-event outcomes at the individual data point level with the help of neural networks. Our goal is to provide the reader with a working understanding of precisely what the basic time-to-event prediction problem is, how it differs from standard regression and classification, and how key “design patterns” have been used time after time to derive new time-to-event prediction models, from classical methods like the Cox proportional hazards model to modern deep learning approaches such as deep kernel Kaplan-Meier estimators and neural ordinary differential equation models. We further delve into two extensions of the basic time-to-event prediction setup: predicting which of several critical events will happen first along with the time until this earliest event happens (the competing risks setting), and predicting time-to-event outcomes given a time series that grows in length over time (the dynamic setting). We conclude with a discussion of a variety of topics such as fairness, causal reasoning, interpretability, and statistical guarantees. Our monograph comes with an accompanying code repository that implements every model and evaluation metric that we cover in detail.

[LG-173] Compressing Recurrent Neural Networks for FPGA-accelerated Implementation in Fluorescence Lifetime Imaging

链接: https://arxiv.org/abs/2410.00948
作者: Ismail Erbas,Vikas Pandey,Aporva Amarnath,Naigang Wang,Karthik Swaminathan,Stefan T. Radev,Xavier Intes
关键词-EN: Fluorescence lifetime imaging, iterative fitting algorithms, studying cellular environments, Fluorescence lifetime, requires capturing large
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:Fluorescence lifetime imaging (FLI) is an important technique for studying cellular environments and molecular interactions, but its real-time application is limited by slow data acquisition, which requires capturing large time-resolved images and complex post-processing using iterative fitting algorithms. Deep learning (DL) models enable real-time inference, but can be computationally demanding due to complex architectures and large matrix operations. This makes DL models ill-suited for direct implementation on field-programmable gate array (FPGA)-based camera hardware. Model compression is thus crucial for practical deployment for real-time inference generation. In this work, we focus on compressing recurrent neural networks (RNNs), which are well-suited for FLI time-series data processing, to enable deployment on resource-constrained FPGA boards. We perform an empirical evaluation of various compression techniques, including weight reduction, knowledge distillation (KD), post-training quantization (PTQ), and quantization-aware training (QAT), to reduce model size and computational load while preserving inference accuracy. Our compressed RNN model, Seq2SeqLite, achieves a balance between computational efficiency and prediction accuracy, particularly at 8-bit precision. By applying KD, the model parameter size was reduced by 98% while retaining performance, making it suitable for concurrent real-time FLI analysis on FPGA during data capture. This work represents a big step towards integrating hardware-accelerated real-time FLI analysis for fast biological processes.

[LG-174] Spectral Graph Sample Weighting for Interpretable Sub-cohort Analysis in Predictive Models for Neuroimaging

链接: https://arxiv.org/abs/2410.00946
作者: Magdalini Paschali,Jiang Yu Hang,Spencer Siegel,Camila Gonzalez,Kilian Pohl,Akshay Chaudhari,Qingyu Zhao
关键词-EN: comprise multiple subtypes, Recent advancements, developmental trajectories, subtypes of mechanisms, severity levels
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in medicine have confirmed that brain disorders often comprise multiple subtypes of mechanisms, developmental trajectories, or severity levels. Such heterogeneity is often associated with demographic aspects (e.g., sex) or disease-related contributors (e.g., genetics). Thus, the predictive power of machine learning models used for symptom prediction varies across subjects based on such factors. To model this heterogeneity, one can assign each training sample a factor-dependent weight, which modulates the subject’s contribution to the overall objective loss function. To this end, we propose to model the subject weights as a linear combination of the eigenbases of a spectral population graph that captures the similarity of factors across subjects. In doing so, the learned weights smoothly vary across the graph, highlighting sub-cohorts with high and low predictability. Our proposed sample weighting scheme is evaluated on two tasks. First, we predict initiation of heavy alcohol drinking in young adulthood from imaging and neuropsychological measures from the National Consortium on Alcohol and NeuroDevelopment in Adolescence (NCANDA). Next, we detect Dementia vs. Mild Cognitive Impairment (MCI) using imaging and demographic measurements in subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Compared to existing sample weighting schemes, our sample weights improve interpretability and highlight sub-cohorts with distinct characteristics and varying model accuracy.

[LG-175] Evaluating Deep Regression Models for WSI-Based Gene-Expression Prediction

链接: https://arxiv.org/abs/2410.00945
作者: Fredrik K. Gustafsson,Mattias Rantalainen
关键词-EN: routine whole-slide images, accessible molecular phenotyping, widely accessible molecular, mRNA gene-expression profiles, gene-expression profiles directly
类目: Genomics (q-bio.GN); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prediction of mRNA gene-expression profiles directly from routine whole-slide images (WSIs) using deep learning models could potentially offer cost-effective and widely accessible molecular phenotyping. While such WSI-based gene-expression prediction models have recently emerged within computational pathology, the high-dimensional nature of the corresponding regression problem offers numerous design choices which remain to be analyzed in detail. This study provides recommendations on how deep regression models should be trained for WSI-based gene-expression prediction. For example, we conclude that training a single model to simultaneously regress all 20530 genes is a computationally efficient yet very strong baseline.

[LG-176] GAMMA-PD: Graph-based Analysis of Multi-Modal Motor Impairment Assessments in Parkinsons Disease MICCAI2024

链接: https://arxiv.org/abs/2410.00944
作者: Favour Nerrise(1),Alice Louise Heiman(2),Ehsan Adeli(2,3) ((1) Department of Electrical Engineering, Stanford University, Stanford, CA, USA, (2) Department of Computer Science, Stanford University, Stanford, CA, USA, (3) Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, USA)
关键词-EN: electronic health records, health records, multi-modal medical data, rapid advancement, technology has led
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
*备注: Accepted by the 6th Workshop on GRaphs in biomedicAl Image anaLysis (GRAIL) at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024). 12 pages, 3 figures, 2 tables, Source Code: this https URL

点击查看摘要

Abstract:The rapid advancement of medical technology has led to an exponential increase in multi-modal medical data, including imaging, genomics, and electronic health records (EHRs). Graph neural networks (GNNs) have been widely used to represent this data due to their prominent performance in capturing pairwise relationships. However, the heterogeneity and complexity of multi-modal medical data still pose significant challenges for standard GNNs, which struggle with learning higher-order, non-pairwise relationships. This paper proposes GAMMA-PD (Graph-based Analysis of Multi-modal Motor Impairment Assessments in Parkinson’s Disease), a novel heterogeneous hypergraph fusion framework for multi-modal clinical data analysis. GAMMA-PD integrates imaging and non-imaging data into a “hypernetwork” (patient population graph) by preserving higher-order information and similarity between patient profiles and symptom subtypes. We also design a feature-based attention-weighted mechanism to interpret feature-level contributions towards downstream decision tasks. We evaluate our approach with clinical data from the Parkinson’s Progression Markers Initiative (PPMI) and a private dataset. We demonstrate gains in predicting motor impairment symptoms in Parkinson’s disease. Our end-to-end framework also learns associations between subsets of patient characteristics to generate clinically relevant explanations for disease and symptom profiles. The source code is available at this https URL.

[LG-177] AR-Sieve Bootstrap for the Random Forest and a simulation-based comparison with rangerts time series prediction

链接: https://arxiv.org/abs/2410.00942
作者: Cabrel Teguemne Fokam,Carsten Jentsch,Michel Lang,Markus Pauly
关键词-EN: time series prediction, including time series, Data Generating Process, Random Forest, spectrum of problems
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Random Forest (RF) algorithm can be applied to a broad spectrum of problems, including time series prediction. However, neither the classical IID (Independent and Identically distributed) bootstrap nor block bootstrapping strategies (as implemented in rangerts) completely account for the nature of the Data Generating Process (DGP) while resampling the observations. We propose the combination of RF with a residual bootstrapping technique where we replace the IID bootstrap with the AR-Sieve Bootstrap (ARSB), which assumes the DGP to be an autoregressive process. To assess the new model’s predictive performance, we conduct a simulation study using synthetic data generated from different types of DGPs. It turns out that ARSB provides more variation amongst the trees in the forest. Moreover, RF with ARSB shows greater accuracy compared to RF with other bootstrap strategies. However, these improvements are achieved at some efficiency costs.

[LG-178] StreamEnsemble: Predictive Queries over Spatiotemporal Streaming Data

链接: https://arxiv.org/abs/2410.00933
作者: Anderson Chaves,Eduardo Ogasawara,Patrick Valduriez,Fabio Porto
关键词-EN: stream data pose, processing and analysis, Predictive queries, machine learning, queries over spatiotemporal
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:Predictive queries over spatiotemporal (ST) stream data pose significant data processing and analysis challenges. ST data streams involve a set of time series whose data distributions may vary in space and time, exhibiting multiple distinct patterns. In this context, assuming a single machine learning model would adequately handle such variations is likely to lead to failure. To address this challenge, we propose StreamEnsemble, a novel approach to predictive queries over ST data that dynamically selects and allocates Machine Learning models according to the underlying time series distributions and model characteristics. Our experimental evaluation reveals that this method markedly outperforms traditional ensemble methods and single model approaches in terms of accuracy and time, demonstrating a significant reduction in prediction error of more than 10 times compared to traditional approaches.

[LG-179] On the topology and geometry of population-based SHM

链接: https://arxiv.org/abs/2410.00923
作者: Keith Worden,Tina A. Dardeno,Aidan J. Hughes,George Tsialiamanis
关键词-EN: Structural Health Monitoring, Population-Based Structural Health, Health Monitoring, Structural Health, Population-Based Structural
类目: Machine Learning (stat.ML); Databases (cs.DB); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Population-Based Structural Health Monitoring (PBSHM), aims to leverage information across populations of structures in order to enhance diagnostics on those with sparse data. The discipline of transfer learning provides the mechanism for this capability. One recent paper in PBSHM proposed a geometrical view in which the structures were represented as graphs in a metric “base space” with their data captured in the “total space” of a vector bundle above the graph space. This view was more suggestive than mathematically rigorous, although it did allow certain useful arguments. One bar to more rigorous analysis was the absence of a meaningful topology on the graph space, and thus no useful notion of continuity. The current paper aims to address this problem, by moving to parametric families of structures in the base space, essentially changing points in the graph space to open balls. This allows the definition of open sets in the fibre space and thus allows continuous variation between fibres. The new ideas motivate a new geometrical mechanism for transfer learning in data are transported from one fibre to an adjacent one; i.e., from one structure to another.

信息检索

[IR-0] Elaborative Subtopic Query Reformulation for Broad and Indirect Queries in Travel Destination Recommendation RECSYS2024

链接: https://arxiv.org/abs/2410.01598
作者: Qianfeng Wen,Yifan Liu,Joshua Zhang,George Saad,Anton Korikov,Yury Sambale,Scott Sanner
关键词-EN: Travel Recommender Systems, Recommender Systems, school graduation trip, Query-driven Travel Recommender, high school graduation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 7 figures,The 1st Workshop on Risks, Opportunities, and Evaluation of Generative Models in Recommender Systems (ROEGEN@RecSys 2024), October 2024, Bari, Italy

点击查看摘要

Abstract:In Query-driven Travel Recommender Systems (RSs), it is crucial to understand the user intent behind challenging natural language(NL) destination queries such as the broadly worded “youth-friendly activities” or the indirect description “a high school graduation trip”. Such queries are challenging due to the wide scope and subtlety of potential user intents that confound the ability of retrieval methods to infer relevant destinations from available textual descriptions such as WikiVoyage. While query reformulation (QR) has proven effective in enhancing retrieval by addressing user intent, existing QR methods tend to focus only on expanding the range of potentially matching query subtopics (breadth) or elaborating on the potential meaning of a query (depth), but not both. In this paper, we introduce Elaborative Subtopic Query Reformulation (EQR), a large language model-based QR method that combines both breadth and depth by generating potential query subtopics with information-rich elaborations. We also release TravelDest, a novel dataset for query-driven travel destination RSs. Experiments on TravelDest show that EQR achieves significant improvements in recall and precision over existing state-of-the-art QR methods.

[IR-1] Peeling Back the Layers: An In-Depth Evaluation of Encoder Architectures in Neural News Recommenders RECSYS2024

链接: https://arxiv.org/abs/2410.01470
作者: Andreea Iana,Goran Glavaš,Heiko Paulheim
关键词-EN: Encoder architectures play, user encoder architectures, Encoder architectures, play a pivotal, pivotal role
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted at the 12th International Workshop on News Recommendation and Analytics (INRA 2024) in conjunction with ACM RecSys 2024

点击查看摘要

Abstract:Encoder architectures play a pivotal role in neural news recommenders by embedding the semantic and contextual information of news and users. Thus, research has heavily focused on enhancing the representational capabilities of news and user encoders to improve recommender performance. Despite the significant impact of encoder architectures on the quality of news and user representations, existing analyses of encoder designs focus only on the overall downstream recommendation performance. This offers a one-sided assessment of the encoders’ similarity, ignoring more nuanced differences in their behavior, and potentially resulting in sub-optimal model selection. In this work, we perform a comprehensive analysis of encoder architectures in neural news recommender systems. We systematically evaluate the most prominent news and user encoder architectures, focusing on their (i) representational similarity, measured with the Central Kernel Alignment, (ii) overlap of generated recommendation lists, quantified with the Jaccard similarity, and (iii) the overall recommendation performance. Our analysis reveals that the complexity of certain encoding techniques is often empirically unjustified, highlighting the potential for simpler, more efficient architectures. By isolating the effects of individual components, we provide valuable insights for researchers and practitioners to make better informed decisions about encoder selection and avoid unnecessary complexity in the design of news recommenders.

[IR-2] Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation

链接: https://arxiv.org/abs/2410.01448
作者: Dinh-Viet-Toan Le,Louis Bigo,Mikaela Keller
关键词-EN: Natural Language Processing, Natural Language, Language Processing, Byte-Pair Encoding, Processing to build
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to 3rd Workshop on NLP for Music and Audio (NLP4MusA, co-located with ISMIR 2024)

点击查看摘要

Abstract:Byte-Pair Encoding (BPE) is an algorithm commonly used in Natural Language Processing to build a vocabulary of subwords, which has been recently applied to symbolic music. Given that symbolic music can differ significantly from text, particularly with polyphony, we investigate how BPE behaves with different types of musical content. This study provides a qualitative analysis of BPE’s behavior across various instrumentations and evaluates its impact on a musical phrase segmentation task for both monophonic and polyphonic music. Our findings show that the BPE training process is highly dependent on the instrumentation and that BPE “supertokens” succeed in capturing abstract musical content. In a musical phrase segmentation task, BPE notably improves performance in a polyphonic setting, but enhances performance in monophonic tunes only within a specific range of BPE merges.

[IR-3] Can We Delegate Learning to Automation?: A Comparative Study of LLM Chatbots Search Engines and Books

链接: https://arxiv.org/abs/2410.01396
作者: Yeonsun Yang,Ahyeon Shin,Mincheol Kang,Jiheon Kang,Jean Young Song
关键词-EN: motivator behind information, Learning, Abstract, key motivator, search
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 21 pages, 14 figures

点击查看摘要

Abstract:Learning is a key motivator behind information search behavior. With the emergence of LLM-based chatbots, students are increasingly turning to these tools as their primary resource for acquiring knowledge. However, the transition from traditional resources like textbooks and web searches raises concerns among educators. They worry that these fully-automated LLMs might lead students to delegate critical steps of search as learning. In this paper, we systematically uncover three main concerns from educators’ perspectives. In response to these concerns, we conducted a mixed-methods study with 92 university students to compare three learning sources with different automation levels. Our results show that LLMs support comprehensive understanding of key concepts without promoting passive learning, though their effectiveness in knowledge retention was limited. Additionally, we found that academic performance impacted both learning outcomes and search patterns. Notably, higher-competence learners engaged more deeply with content through reading-intensive behaviors rather than relying on search activities.

[IR-4] PairDistill: Pairwise Relevance Distillation for Dense Retrieval EMNLP2024

链接: https://arxiv.org/abs/2410.01383
作者: Chao-Wei Huang,Yun-Nung Chen
关键词-EN: Effective information retrieval, vast datasets relies, Effective information, extract relevant information, response to queries
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: Accepted to EMNLP 2024 Main Conference

点击查看摘要

Abstract:Effective information retrieval (IR) from vast datasets relies on advanced techniques to extract relevant information in response to queries. Recent advancements in dense retrieval have showcased remarkable efficacy compared to traditional sparse retrieval methods. To further enhance retrieval performance, knowledge distillation techniques, often leveraging robust cross-encoder rerankers, have been extensively explored. However, existing approaches primarily distill knowledge from pointwise rerankers, which assign absolute relevance scores to documents, thus facing challenges related to inconsistent comparisons. This paper introduces Pairwise Relevance Distillation (PairDistill) to leverage pairwise reranking, offering fine-grained distinctions between similarly relevant documents to enrich the training of dense retrieval models. Our experiments demonstrate that PairDistill outperforms existing methods, achieving new state-of-the-art results across multiple benchmarks. This highlights the potential of PairDistill in advancing dense retrieval techniques effectively. Our source code and trained models are released at this https URL

[IR-5] Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP

链接: https://arxiv.org/abs/2410.01190
作者: Jamie Mahowald,Benjamin Charles Germain Lee
关键词-EN: exploring map collections, current methods, structured metadata, prevalence and historical, historical importance
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: 18 pages, 7 figures, accepted at the Computational Humanities Research Conference (CHR 2024)

点击查看摘要

Abstract:Despite the prevalence and historical importance of maps in digital collections, current methods of navigating and exploring map collections are largely restricted to catalog records and structured metadata. In this paper, we explore the potential for interactively searching large-scale map collections using natural language inputs (“maps with sea monsters”), visual inputs (i.e., reverse image search), and multimodal inputs (an example map + “more grayscale”). As a case study, we adopt 562,842 images of maps publicly accessible via the Library of Congress’s API. To accomplish this, we use the mulitmodal Contrastive Language-Image Pre-training (CLIP) machine learning model to generate embeddings for these maps, and we develop code to implement exploratory search capabilities with these input strategies. We present results for example searches created in consultation with staff in the Library of Congress’s Geography and Map Division and describe the strengths, weaknesses, and possibilities for these search queries. Moreover, we introduce a fine-tuning dataset of 10,504 map-caption pairs, along with an architecture for fine-tuning a CLIP model on this dataset. To facilitate re-use, we provide all of our code in documented, interactive Jupyter notebooks and place all code into the public domain. Lastly, we discuss the opportunities and challenges for applying these approaches across both digitized and born-digital collections held by galleries, libraries, archives, and museums.

[IR-6] GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network

链接: https://arxiv.org/abs/2410.01160
作者: Panfeng Cao,Jian Wu
关键词-EN: Key information extraction, Key information, visually rich documents, information extraction, visually rich
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Key information extraction (KIE) from visually rich documents (VRD) has been a challenging task in document intelligence because of not only the complicated and diverse layouts of VRD that make the model hard to generalize but also the lack of methods to exploit the multimodal features in VRD. In this paper, we propose a light-weight model named GraphRevisedIE that effectively embeds multimodal features such as textual, visual, and layout features from VRD and leverages graph revision and graph convolution to enrich the multimodal embedding with global context. Extensive experiments on multiple real-world datasets show that GraphRevisedIE generalizes to documents of varied layouts and achieves comparable or better performance compared to previous KIE methods. We also publish a business license dataset that contains both real-life and synthesized documents to facilitate research of document KIE.

[IR-7] Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting EMNLP2024

链接: https://arxiv.org/abs/2410.01154
作者: Siyi Liu,Yang Li,Jiang Li,Shan Yang,Yunshi Lan
关键词-EN: Large Language Models, Language Models, Large Language, Recent research, zero-shot Relation Extraction
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: EMNLP 2024 Short

点击查看摘要

Abstract:Recent research in zero-shot Relation Extraction (RE) has focused on using Large Language Models (LLMs) due to their impressive zero-shot capabilities. However, current methods often perform suboptimally, mainly due to a lack of detailed, context-specific prompts needed for understanding various sentences and relations. To address this, we introduce the Self-Prompting framework, a novel method designed to fully harness the embedded RE knowledge within LLMs. Specifically, our framework employs a three-stage diversity approach to prompt LLMs, generating multiple synthetic samples that encapsulate specific relations from scratch. These generated samples act as in-context learning samples, offering explicit and context-specific guidance to efficiently prompt LLMs for RE. Experimental evaluations on benchmark datasets show our approach outperforms existing LLM-based zero-shot RE methods. Additionally, our experiments confirm the effectiveness of our generation pipeline in producing high-quality synthetic data that enhances performance.

[IR-8] xt Clustering as Classification with LLMs

链接: https://arxiv.org/abs/2410.00927
作者: Chen Huang,Guoxiu He
关键词-EN: clustering remains valuable, labeling is cost-prohibitive, Text clustering remains, remains valuable, valuable in real-world
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at this https URL.

附件下载

点击下载今日全部论文列表