Arxiv今日论文 | 2024-10-25

本篇博文主要展示 2024-10-25 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决的问题是当前大型多模态模型（LMMs）评估基准主要以英语为中心，缺乏对阿拉伯语等其他语言的支持，尤其是考虑到阿拉伯语有超过4亿的使用者。解决方案的关键在于开发了一个名为CAMEL-Bench的综合性阿拉伯语LMM评估基准。

CAMEL-Bench包含八个多样化的领域和38个子领域，涵盖多图像理解、复杂视觉感知、手写文档理解、视频理解、医学影像、植物病害、以及基于遥感的土地利用理解等多个任务，以评估模型在广泛场景中的通用性。该基准包含约29,036个问题，这些问题经过母语者的手动验证，以确保评估的可靠性。

通过对比闭源模型（如GPT-4系列）和开源LMMs的评估结果，论文揭示了现有模型在阿拉伯语任务上的显著不足，即使是表现最好的开源模型也存在显著改进空间。这一解决方案的关键在于提供了一个全面、高质量的阿拉伯语评估基准，以推动多模态模型在非英语环境中的发展。

链接: https://arxiv.org/abs/2410.18976
作者: Sara Ghaboura,Ahmed Heakl,Omkar Thawakar,Ali Alharthi,Ines Riahi,Abduljalil Saif,Jorma Laaksonen,Fahad S. Khan,Salman Khan,Rao M. Anwer
关键词-EN: Recent years, developing large multimodal, capable of performing, years have witnessed, witnessed a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, NAACL

点击查看摘要

Abstract:Recent years have witnessed a significant interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.
摘要：近年来，开发能够执行多种视觉推理和理解任务的大型多模态模型（LMMs）引起了广泛关注。这促使了多个LMM基准的引入，用于评估LMM在不同任务上的表现。然而，大多数现有的LMM评估基准主要以英语为中心。在本研究中，我们为阿拉伯语开发了一个全面的LMM评估基准，以代表超过4亿人口的语言群体。该基准命名为CAMEL-Bench，涵盖了八个不同领域和38个子领域，包括多图像理解、复杂视觉感知、手写文档理解、视频理解、医学影像、植物病害以及基于遥感的土地利用理解，以评估广泛的场景通用性。我们的CAMEL-Bench包含约29,036个问题，这些问题是经过筛选的，并由母语者手动验证质量，以确保模型的可靠评估。我们对包括GPT-4系列在内的闭源LMM和开源LMM进行了评估。我们的分析显示，需要显著改进，尤其是最好的开源模型，即使是闭源的GPT-4o也仅达到62%的整体得分。我们的基准和评估脚本已开源。

[NLP-1] Unbounded: A Generative Infinite Game of Character Life Simulation

【速读】：该论文试图解决传统有限游戏（finite games）在游戏机制、叙事和角色互动方面的局限性问题。解决方案的关键在于引入生成式无限游戏（generative infinite game）的概念，并利用生成式 AI (Generative AI) 技术来创建一个超越传统有限系统的游戏体验。

具体来说，论文提出了以下两个关键技术创新：

专门设计的蒸馏大型语言模型 (LLM)，用于实时动态生成游戏机制、叙事和角色互动。
一种新的动态区域图像提示适配器 (IP-Adapter)，用于视觉模型，确保角色在多个环境中视觉生成的连贯性和灵活性。

通过这些创新，论文展示了在角色生活模拟、用户指令跟随、叙事连贯性和视觉一致性方面的显著改进，相较于传统相关方法。

链接: https://arxiv.org/abs/2410.18975
作者: Jialu Li,Yuanzhen Li,Neal Wadhwa,Yael Pritch,David E. Jacobs,Michael Rubinstein,Mohit Bansal,Nataniel Ruiz
关键词-EN: generative infinite game, introduce the concept, character life simulation, generative models, James P. Carse
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 18 pages; Project page: this https URL

点击查看摘要

Abstract:We introduce the concept of a generative infinite game, a video game that transcends the traditional boundaries of finite, hard-coded systems by using generative models. Inspired by James P. Carse’s distinction between finite and infinite games, we leverage recent advances in generative AI to create Unbounded: a game of character life simulation that is fully encapsulated in generative models. Specifically, Unbounded draws inspiration from sandbox life simulations and allows you to interact with your autonomous virtual character in a virtual world by feeding, playing with and guiding it - with open-ended mechanics generated by an LLM, some of which can be emergent. In order to develop Unbounded, we propose technical innovations in both the LLM and visual generation domains. Specifically, we present: (1) a specialized, distilled large language model (LLM) that dynamically generates game mechanics, narratives, and character interactions in real-time, and (2) a new dynamic regional image prompt Adapter (IP-Adapter) for vision models that ensures consistent yet flexible visual generation of a character across multiple environments. We evaluate our system through both qualitative and quantitative analysis, showing significant improvements in character life simulation, user instruction following, narrative coherence, and visual consistency for both characters and the environments compared to traditional related approaches.
摘要：我们提出了生成式无限游戏的概念，这是一种超越传统有限、硬编码系统的视频游戏，通过使用生成式模型实现。受James P. Carse对有限游戏和无限游戏的区分启发，我们利用生成式 AI 的最新进展，创造了《Unbounded》：一个完全封装在生成模型中的角色生活模拟游戏。具体来说，《Unbounded》借鉴了沙盒生活模拟游戏的灵感，允许玩家在一个虚拟世界中与自主的虚拟角色互动，通过喂养、玩耍和引导它——其开放式的游戏机制由大语言模型生成，其中一些机制可能是涌现的。为了开发《Unbounded》，我们在大语言模型和视觉生成领域提出了技术创新。具体来说，我们提出了：(1) 一个专门化的、蒸馏后的大语言模型，能够实时动态生成游戏机制、叙事和角色互动；(2) 一种新的动态区域图像提示适配器（IP-Adapter），用于视觉模型，确保角色在多个环境中视觉生成的连贯性和灵活性。我们通过定性和定量分析评估了我们的系统，显示在角色生活模拟、用户指令跟随、叙事连贯性以及角色和环境的视觉一致性方面，相比传统相关方法有显著改进。

[NLP-2] Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

【速读】：该论文试图解决用户界面（UI）理解中的通用性问题，特别是在面对平台多样性、分辨率变化和数据限制等基础性挑战时。解决方案的关键在于引入Ferret-UI 2，这是一个多模态大语言模型（MLLM），具备以下三个核心创新：

支持多种平台类型：Ferret-UI 2能够理解和操作多种平台，包括iPhone、Android、iPad、Webpage和AppleTV。
高分辨率感知与自适应缩放：通过自适应缩放技术，模型能够处理不同分辨率的UI元素，确保在各种设备上的准确感知。
基于GPT-4o的任务训练数据生成：利用GPT-4o和视觉提示集（set-of-mark visual prompting）生成高级任务训练数据，增强了模型的任务适应性和用户中心交互能力。

这些创新使得Ferret-UI 2在复杂的用户中心任务中表现出色，显著优于前代模型，并展现出强大的跨平台迁移能力。

链接: https://arxiv.org/abs/2410.18967
作者: Zhangheng Li,Keen You,Haotian Zhang,Di Feng,Harsh Agrawal,Xiujun Li,Mohana Prasad Sathya Moorthy,Jeff Nichols,Yinfei Yang,Zhe Gan
关键词-EN: resolution variation, user interface, foundational issues, challenging due, Ferret-UI
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks \times 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.
摘要：构建一个用于用户界面（UI）理解的通用模型面临诸多基础性挑战，如平台多样性、分辨率变化及数据限制等。本文介绍了一种名为 Ferret-UI 2 的多模态大语言模型（MLLM），该模型旨在跨多种平台实现通用 UI 理解，包括 iPhone、Android、iPad、Webpage 和 AppleTV。在 Ferret-UI 的基础上，Ferret-UI 2 引入了三项关键创新：支持多种平台类型、通过自适应缩放实现高分辨率感知，以及利用 GPT-4o 结合 set-of-mark 视觉提示生成高级任务训练数据。这些进步使得 Ferret-UI 2 能够执行复杂的、以用户为中心的交互任务，从而在不断扩展的平台生态系统中表现出高度的多功能性和适应性。在指称、定位、以用户为中心的高级任务（包含 9 个子任务 × 5 个平台）、GUIDE 下一步动作预测数据集以及 GUI-World 多平台基准的广泛实证实验中，Ferret-UI 2 显著优于 Ferret-UI，并展现出强大的跨平台迁移能力。

[NLP-3] Does Data Contamination Detection Work (Well) for LLM s? A Survey and Evaluation on Detection Assumptions

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在评估过程中可能存在的数据污染问题，即训练数据与评估数据集之间的重叠可能导致性能评估的偏差。解决方案的关键在于系统性地审查和分类现有的数据污染检测方法，评估这些方法所依赖的假设是否在不同场景下都有效。

具体来说，论文通过回顾47篇相关文献，识别并分析了八类假设，并测试了其中三类假设作为案例研究。研究发现，基于这三类假设的检测方法在分类预训练LLMs所使用的实例时，表现接近随机猜测，这表明当前的LLMs更多地学习了数据分布而非记忆个别实例。因此，论文强调了明确陈述方法的假设并跨不同场景验证其有效性的重要性。

链接: https://arxiv.org/abs/2410.18966
作者: Yujuan Fu,Ozlem Uzuner,Meliha Yetisgen,Fei Xia
关键词-EN: Large language models, general-purpose task solvers, Large language, demonstrated great performance, language models
类目: Computation and Language (cs.CL)
备注: 2 tables and 1 figures in the main text. This is a preprint, under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers. However, as LLMs are typically trained on vast amounts of data, a significant concern in their evaluation is data contamination, where overlap between training data and evaluation datasets inflates performance assessments. While multiple approaches have been developed to identify data contamination, these approaches rely on specific assumptions that may not hold universally across different settings. To bridge this gap, we systematically review 47 papers on data contamination detection, categorize the underlying assumptions, and assess whether they have been rigorously validated. We identify and analyze eight categories of assumptions and test three of them as case studies. Our analysis reveals that when classifying instances used for pretraining LLMs, detection approaches based on these three assumptions perform close to random guessing, suggesting that current LLMs learn data distributions rather than memorizing individual instances. Overall, this work underscores the importance of approaches clearly stating their underlying assumptions and testing their validity across various scenarios.
摘要：大语言模型（Large Language Models, LLMs）在多种基准测试中展现了卓越的表现，显示出作为通用任务解决器的潜力。然而，由于LLMs通常在海量数据上进行训练，其评估中的一个重大问题是数据污染，即训练数据与评估数据集之间的重叠导致性能评估的膨胀。尽管已有多项方法被开发用于识别数据污染，但这些方法依赖于特定的假设，这些假设在不同环境下可能并不普遍适用。为了填补这一空白，我们系统性地回顾了47篇关于数据污染检测的论文，分类了其背后的假设，并评估了这些假设是否经过严格验证。我们识别并分析了八类假设，并以其中三类作为案例进行测试。我们的分析表明，在分类用于预训练LLMs的实例时，基于这三类假设的检测方法表现接近随机猜测，这表明当前的LLMs学习的是数据分布而非记忆个别实例。总体而言，这项工作强调了明确陈述方法背后的假设并跨多种场景测试其有效性的重要性。

[NLP-4] OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning

【速读】：该论文试图解决的问题是大型语言模型（LLMs）和大型多模态模型（LMMs）在跨多样应用场景中的泛化能力有限，从而限制了其在自动化复杂任务中的广泛应用。解决方案的关键是提出了一个名为OSCAR的通用代理系统，该系统通过状态感知推理和重规划（state-Aware reasoning and Re-planning）来控制操作系统。

OSCAR的核心功能包括：

通过标准化的控制方式（如鼠标和键盘输入）与各种桌面和移动应用程序进行交互。
处理屏幕图像以执行用户命令。
将人类指令翻译成可执行的Python代码，实现对图形用户界面（GUIs）的精确控制。
作为一个状态机运行，配备错误处理机制和动态任务重规划，以适应实时反馈和异常情况，提高系统的稳定性和适应性。

通过这些关键功能，OSCAR能够将复杂的任务流程简化为简单的自然语言命令，显著提升用户的工作效率。

链接: https://arxiv.org/abs/2410.18963
作者: Xiaoqiang Wang,Bang Liu
关键词-EN: large multimodal models, Large language models, shown great potential, large multimodal, multimodal models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) and large multimodal models (LMMs) have shown great potential in automating complex tasks like web browsing and gaming. However, their ability to generalize across diverse applications remains limited, hindering broader utility. To address this challenge, we present OSCAR: Operating System Control via state-Aware reasoning and Re-planning. OSCAR is a generalist agent designed to autonomously navigate and interact with various desktop and mobile applications through standardized controls, such as mouse and keyboard inputs, while processing screen images to fulfill user commands. OSCAR translates human instructions into executable Python code, enabling precise control over graphical user interfaces (GUIs). To enhance stability and adaptability, OSCAR operates as a state machine, equipped with error-handling mechanisms and dynamic task re-planning, allowing it to efficiently adjust to real-time feedback and exceptions. We demonstrate OSCAR’s effectiveness through extensive experiments on diverse benchmarks across desktop and mobile platforms, where it transforms complex workflows into simple natural language commands, significantly boosting user productivity. Our code will be open-source upon publication.
摘要：大语言模型 (LLM) 和大型多模态模型 (LMM) 在自动化复杂的任务如网页浏览和游戏方面展现了巨大的潜力。然而，它们在跨多种应用的泛化能力仍然有限，这限制了其更广泛的应用。为了应对这一挑战，我们提出了 OSCAR：通过状态感知推理和重规划的操作系统控制。OSCAR 是一个通用型智能体，旨在通过标准化的控制方式（如鼠标和键盘输入）自主导航并与各种桌面和移动应用程序进行交互，同时处理屏幕图像以执行用户指令。OSCAR 将人类指令转化为可执行的 Python 代码，从而实现对图形用户界面 (GUI) 的精确控制。为了增强稳定性和适应性，OSCAR 作为一个状态机运行，配备了错误处理机制和动态任务重规划功能，使其能够高效地根据实时反馈和异常情况进行调整。我们通过在桌面和移动平台上的多样化基准测试中进行的广泛实验，展示了 OSCAR 的有效性，它将复杂的操作流程转化为简单的自然语言命令，显著提升了用户的工作效率。我们的代码将在发表后开源。

[NLP-5] Bridge-Coder: Unlocking LLM s Potential to Overcome Language Gaps in Low-Resource Code

【速读】：该论文试图解决大语言模型（LLMs）在处理低资源编程语言（LRPLs）时表现不佳的问题。具体来说，LLMs在生成高资源编程语言（HRPLs）如Python的代码时表现出色，但在处理LRPLs如Racket或D时则面临显著困难。这种性能差距加剧了数字鸿沟，阻碍了使用LRPLs的开发者从LLM的进步中平等受益，并加剧了编程社区中创新的不平等。

解决方案的关键在于提出了一种名为Bridge-Coder的新方法，该方法利用LLMs的内在能力来提升对LRPLs的处理性能。具体步骤包括：

桥接生成（Bridge Generation）：通过利用LLMs对通用知识的理解、对HRPLs的熟练掌握以及上下文学习能力，创建高质量的数据集。
桥接对齐（Bridged Alignment）：逐步改进自然语言指令与LRPLs之间的对齐。

实验结果表明，Bridge-Coder显著提升了模型在多种LRPLs上的性能，展示了该方法的有效性和泛化能力。此外，论文还详细分析了方法的关键组成部分，为未来解决LRPLs相关挑战的工作提供了有价值的见解。

链接: https://arxiv.org/abs/2410.18957
作者: Jipeng Zhang,Jianshu Zhang,Yuanzhe Li,Renjie Pi,Rui Pan,Runtao Liu,Ziqiang Zheng,Tong Zhang
关键词-EN: Large Language Models, Python but struggle, Large Language, demonstrate strong proficiency, high-resource programming languages
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong proficiency in generating code for high-resource programming languages (HRPLs) like Python but struggle significantly with low-resource programming languages (LRPLs) such as Racket or D. This performance gap deepens the digital divide, preventing developers using LRPLs from benefiting equally from LLM advancements and reinforcing disparities in innovation within underrepresented programming communities. While generating additional training data for LRPLs is promising, it faces two key challenges: manual annotation is labor-intensive and costly, and LLM-generated LRPL code is often of subpar quality. The underlying cause of this issue is the gap between natural language to programming language gap (NL-PL Gap), which is especially pronounced in LRPLs due to limited aligned data. In this work, we introduce a novel approach called Bridge-Coder, which leverages LLMs’ intrinsic capabilities to enhance the performance on LRPLs. Our method consists of two key stages. Bridge Generation, where we create high-quality dataset by utilizing LLMs’ general knowledge understanding, proficiency in HRPLs, and in-context learning abilities. Then, we apply the Bridged Alignment, which progressively improves the alignment between NL instructions and LRPLs. Experimental results across multiple LRPLs show that Bridge-Coder significantly enhances model performance, demonstrating the effectiveness and generalization of our approach. Furthermore, we offer a detailed analysis of the key components of our method, providing valuable insights for future work aimed at addressing the challenges associated with LRPLs.
摘要：大语言模型 (LLMs) 在为高资源编程语言 (HRPLs) 如 Python 生成代码方面表现出强大的能力，但在处理低资源编程语言 (LRPLs) 如 Racket 或 D 时则显得力不从心。这种性能差距加剧了数字鸿沟，使得使用 LRPLs 的开发者无法平等地受益于 LLM 的进步，从而在代表性不足的编程社区中加剧了创新的不平等。尽管为 LRPLs 生成额外的训练数据具有潜力，但面临两大挑战：手动标注既费时又昂贵，且 LLM 生成的 LRPL 代码质量往往不佳。这一问题的根本原因是自然语言与编程语言之间的差距 (NL-PL Gap)，在 LRPLs 中尤为明显，因为其对齐数据有限。在本研究中，我们提出了一种名为 Bridge-Coder 的新方法，利用 LLMs 的内在能力来提升对 LRPLs 的性能。我们的方法包括两个关键阶段：首先是桥接生成 (Bridge Generation)，通过利用 LLMs 对通用知识的理解、对 HRPLs 的熟练掌握以及上下文学习能力，创建高质量的数据集；接着是桥接对齐 (Bridged Alignment)，逐步改进自然语言指令与 LRPLs 之间的对齐。在多个 LRPLs 上的实验结果表明，Bridge-Coder 显著提升了模型性能，证明了我们方法的有效性和普适性。此外，我们还对方法的关键组成部分进行了详细分析，为未来解决 LRPLs 相关挑战的工作提供了宝贵的见解。

[NLP-6] BioMistral-NLU: Towards More Generalizable Medical Language Understanding through Instruction Tuning

【速读】：该论文试图解决的问题是：在医疗领域自然语言理解（NLU）任务中，现有的指令微调大型语言模型（LLMs）如ChatGPT在需要领域知识、细粒度文本理解和结构化数据提取的特定任务上表现不佳。

解决方案的关键在于：

统一提示格式：提出了一种适用于7种重要NLU任务的统一提示格式，通过跨度提取和多选问答（QA）实现。
构建指令微调数据集：利用现有的多种开源医疗NLU语料库，构建了名为MNLU-Instruct的指令微调数据集。
开发医疗NLU模型：通过在MNLU-Instruct上微调BioMistral，开发了可泛化的医疗NLU模型BioMistral-NLU。

实验结果表明，BioMistral-NLU在零样本设置下，在多个重要NLU任务上优于原始BioMistral以及专有的LLMs（如ChatGPT和GPT-4）。该研究强调了在多样化的NLU任务上进行指令微调，能够增强LLMs在不同医疗NLU任务中的泛化能力。

链接: https://arxiv.org/abs/2410.18955
作者: Yujuan Velvin Fu,Giridhar Kaushik Ramachandran,Namu Park,Kevin Lybarger,Fei Xia,Ozlem Uzuner,Meliha Yetisgen
关键词-EN: Large language models, Biomedical Language Understanding, language understanding, Language Understanding Evaluation, important NLU tasks
类目: Computation and Language (cs.CL)
备注: 3 figures an 5 tables

点击查看摘要

Abstract:Large language models (LLMs) such as ChatGPT are fine-tuned on large and diverse instruction-following corpora, and can generalize to new tasks. However, those instruction-tuned LLMs often perform poorly in specialized medical natural language understanding (NLU) tasks that require domain knowledge, granular text comprehension, and structured data extraction. To bridge the gap, we: (1) propose a unified prompting format for 7 important NLU tasks, % through span extraction and multi-choice question-answering (QA), (2) curate an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and (3) develop BioMistral-NLU, a generalizable medical NLU model, through fine-tuning BioMistral on MNLU-Instruct. We evaluate BioMistral-NLU in a zero-shot setting, across 6 important NLU tasks, from two widely adopted medical NLU benchmarks: Biomedical Language Understanding Evaluation (BLUE) and Biomedical Language Understanding and Reasoning Benchmark (BLURB). Our experiments show that our BioMistral-NLU outperforms the original BioMistral, as well as the proprietary LLMs - ChatGPT and GPT-4. Our dataset-agnostic prompting strategy and instruction tuning step over diverse NLU tasks enhance LLMs’ generalizability across diverse medical NLU tasks. Our ablation experiments show that instruction-tuning on a wider variety of tasks, even when the total number of training instances remains constant, enhances downstream zero-shot generalization.
摘要：大语言模型（LLMs）如 ChatGPT 在大量且多样化的指令跟随语料库上进行了微调，并能泛化到新任务。然而，这些经过指令微调的 LLMs 在需要领域知识、细致文本理解和结构化数据提取的专门医疗自然语言理解（NLU）任务中往往表现不佳。为了填补这一差距，我们：（1）提出了一种统一的提示格式，适用于 7 项重要的 NLU 任务，通过跨度提取和多选题问答（QA）实现；（2）利用多样化的现有开源医疗 NLU 语料库，精心构建了一个指令微调数据集 MNLU-Instruct；（3）通过在 MNLU-Instruct 上微调 BioMistral，开发了可泛化的医疗 NLU 模型 BioMistral-NLU。我们在零样本设置下，针对 6 项重要的 NLU 任务，从两个广泛采用的医疗 NLU 基准测试——生物医学语言理解评估（BLUE）和生物医学语言理解与推理基准（BLURB）中评估了 BioMistral-NLU。实验结果表明，BioMistral-NLU 优于原始的 BioMistral，以及专有的 LLMs——ChatGPT 和 GPT-4。我们的数据集无关提示策略和多样 NLU 任务上的指令微调步骤增强了 LLMs 在多样化医疗 NLU 任务中的泛化能力。我们的消融实验显示，即使训练实例总数保持不变，对更多样化任务进行指令微调也能增强下游零样本泛化能力。

[NLP-7] Dynamic Vocabulary Pruning in Early-Exit LLM s

【速读】：该论文试图解决的问题是：在大规模语言模型 (Large Language Models, LLMs) 中，通过早期退出 (Early-exiting) 机制提高推理效率时，由于现代 LLMs 的词汇量庞大，导致用于退出决策的置信度估计计算成本高昂，从而削弱了效率提升的效果。

解决方案的关键在于：在测试时动态地对词汇表进行剪枝 (Dynamic Vocabulary Pruning)。具体来说，在模型的初始层对词汇表进行剪枝，并在后续的前向传播过程中使用剪枝后的较小词汇表。这种方法在保持竞争性能的同时，显著提高了早期退出 LLMs 中置信度估计的效率。

链接: https://arxiv.org/abs/2410.18952
作者: Jort Vincenti,Karim Abdel Sadek,Joan Velja,Matteo Nulli,Metod Jazbec
关键词-EN: large language models, language models, shown to lead, Increasing the size, Increasing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of LLM inference by enabling next token prediction at intermediate layers. Yet, the large vocabulary size in modern LLMs makes the confidence estimation required for exit decisions computationally expensive, diminishing the efficiency gains. To address this, we propose dynamically pruning the vocabulary at test time for each token. Specifically, the vocabulary is pruned at one of the initial layers, and the smaller vocabulary is then used throughout the rest of the forward pass. Our experiments demonstrate that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive performance.
摘要：增大大语言模型 (LLM) 的规模已被证明可以提升性能。然而，这会导致推理速度变慢且成本更高。早期退出 (Early-exiting) 是一种有前景的方法，通过在中间层进行下一个 Token 预测来提高 LLM 推理的效率。然而，现代 LLM 中庞大的词汇量使得退出决策所需的置信度估计在计算上非常昂贵，从而削弱了效率提升的效果。为解决这一问题，我们提出在测试时为每个 Token 动态剪枝词汇表。具体来说，词汇表在初始层之一进行剪枝，然后在剩余的前向传递过程中使用较小的词汇表。我们的实验表明，这种后验动态词汇表剪枝在保持竞争性能的同时，显著提高了早期退出 LLM 中置信度估计的效率。

[NLP-8] Schema-Guided Culture-Aware Complex Event Simulation with Multi-Agent Role-Play EMNLP2024

【速读】：该论文试图解决复杂新闻事件（如自然灾害和社会政治冲突）的快速响应问题。传统方法依赖历史事件进行预测，但由于这些事件稀少且无法涵盖所有可能的情况和细微差别，因此不足以应对未来的复杂事件。

解决方案的关键在于开发一个可控的复杂新闻事件模拟器。该模拟器通过以下几个关键组件实现：

事件模式（Event Schema）：代表领域知识，用于描述事件场景。
用户提供的假设（User-provided Assumptions）：代表特定案例的条件。
地理多样化的常识和文化规范感知知识增强组件（Geo-diverse Commonsense and Cultural Norm-aware Knowledge Enhancement Component）：考虑到事件动态依赖于细粒度的社会和文化背景。
基于代理的方法（Agent-based Approach）：模拟个体角色的状态、计划和行动，以增强模拟的连贯性。

通过整合事件模式和文化规范，生成的模拟在连贯性和适当性方面表现出色，并受到人道主义援助组织参与者的积极评价。

链接: https://arxiv.org/abs/2410.18935
作者: Sha Li,Revanth Gangi Reddy,Khanh Duy Nguyen,Qingyun Wang,May Fung,Chi Han,Jiawei Han,Kartik Natarajan,Clare R. Voss,Heng Ji
关键词-EN: require swift responses, socio-political conflicts, require swift, government and society, natural disasters
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted as EMNLP 2024 Demo

点击查看摘要

Abstract:Complex news events, such as natural disasters and socio-political conflicts, require swift responses from the government and society. Relying on historical events to project the future is insufficient as such events are sparse and do not cover all possible conditions and nuanced situations. Simulation of these complex events can help better prepare and reduce the negative impact. We develop a controllable complex news event simulator guided by both the event schema representing domain knowledge about the scenario and user-provided assumptions representing case-specific conditions. As event dynamics depend on the fine-grained social and cultural context, we further introduce a geo-diverse commonsense and cultural norm-aware knowledge enhancement component. To enhance the coherence of the simulation, apart from the global timeline of events, we take an agent-based approach to simulate the individual character states, plans, and actions. By incorporating the schema and cultural norms, our generated simulations achieve much higher coherence and appropriateness and are received favorably by participants from a humanitarian assistance organization.
摘要：复杂的新闻事件，如自然灾害和社会政治冲突，需要政府和社会迅速响应。仅依赖历史事件来预测未来是不够的，因为这些事件稀少且无法涵盖所有可能的条件和细微情况。模拟这些复杂事件有助于更好地准备并减少负面影响。我们开发了一个可控的复杂新闻事件模拟器，该模拟器由事件模式（代表场景的领域知识）和用户提供的假设（代表特定案例的条件）共同指导。由于事件动态依赖于细粒度的社会和文化背景，我们进一步引入了一个地理多样化的常识和文化规范感知知识增强组件。为了增强模拟的连贯性，除了事件的全局时间线外，我们还采用基于智能体的方法来模拟个体角色的状态、计划和行动。通过结合模式和文化规范，我们生成的模拟在连贯性和适当性方面取得了显著提升，并受到了来自人道主义援助组织的参与者的积极评价。

[NLP-9] From Blind Solvers to Logical Thinkers: Benchmarking LLM s Logical Integrity on Faulty Mathematical Problems

【速读】：该论文试图解决的问题是：当前的大型语言模型（LLMs）在处理数学问题时是否仅仅是“盲目求解者”（Blind Solver），即仅执行数学运算而缺乏深入的逻辑推理能力，还是能够作为“逻辑思考者”（Logical Thinker）识别并处理逻辑不一致的问题。

解决方案的关键在于提出了一个名为FaultyMath的基准数据集，该数据集包含了多种类型的错误数学问题，涵盖了不同的数学类别（如代数、几何、数论等）、不同难度级别以及不同来源的错误（如违反常识、模糊陈述、数学矛盾等）。通过评估一系列LLMs在FaultyMath上的表现，研究者从三个维度进行了分析：

模型在没有明确提示的情况下，能否准确检测出错误的数学问题。
当提供关于问题有效性的提示（无论是正确的还是误导性的）时，LLMs能否适应并成为可靠的逻辑思考者。
当LLMs识别出数学问题存在缺陷时，它们生成的解释的可信度如何。

研究结果表明，现有的LLMs主要表现为盲目求解者，缺乏成为逻辑思考者所需的推理能力。

链接: https://arxiv.org/abs/2410.18921
作者: A M Muntasir Rahman,Junyi Ye,Wei Yao,Wenpeng Yin,Guiling Wang
关键词-EN: yesterday and ate, Logical Thinker, Lily received, friend yesterday, Blind Solver
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Consider the math problem: “Lily received 3 cookies from her best friend yesterday and ate 5 for breakfast. Today, her friend gave her 3 more cookies. How many cookies does Lily have now?” Many large language models (LLMs) in previous research approach this problem by calculating the answer “1” using the equation “3 - 5 + 3.” However, from a human perspective, we recognize the inherent flaw in this problem: Lily cannot eat 5 cookies if she initially only had 3. This discrepancy prompts a key question: Are current LLMs merely Blind Solver that apply mathematical operations without deeper reasoning, or can they function as Logical Thinker capable of identifying logical inconsistencies? To explore this question, we propose a benchmark dataset, FaultyMath, which includes faulty math problems of rich diversity: i) multiple mathematical categories, e.g., algebra, geometry, number theory, etc., ii) varying levels of difficulty, and iii) different origins of faultiness – ranging from violations of common sense and ambiguous statements to mathematical contradictions and more. We evaluate a broad spectrum of LLMs, including open-source, closed-source, and math-specialized models, using FaultyMath across three dimensions: (i) How accurately can the models detect faulty math problems without being explicitly prompted to do so? (ii) When provided with hints – either correct or misleading – about the validity of the problems, to what extent do LLMs adapt to become reliable Logical Thinker? (iii) How trustworthy are the explanations generated by LLMs when they recognize a math problem as flawed? Through extensive experimentation and detailed analysis, our results demonstrate that existing LLMs largely function as Blind Solver and fall short of the reasoning capabilities required to perform as Logical Thinker. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2410.18921 [cs.CL] (or arXiv:2410.18921v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.18921 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：考虑以下数学问题：“莉莉昨天从她最好的朋友那里收到了3块饼干，并在早餐时吃了5块。今天，她的朋友又给了她3块饼干。莉莉现在有多少块饼干？”许多先前研究中的大语言模型 (LLMs) 通过计算“3 - 5 + 3”得出答案“1”来解决这个问题。然而，从人类的角度来看，我们意识到这个问题存在内在的缺陷：莉莉不可能吃掉5块饼干，因为她最初只有3块。这种差异引发了一个关键问题：当前的LLMs是否仅仅是盲目执行数学运算的“盲目求解器”，还是它们能够作为“逻辑思考者”识别逻辑不一致性？为了探讨这个问题，我们提出了一项基准数据集，名为FaultyMath，该数据集包含了丰富多样的问题：i) 多种数学类别，例如代数、几何、数论等；ii) 不同难度级别；iii) 不同来源的错误——从违反常识和模糊陈述到数学矛盾等。我们评估了一系列LLMs，包括开源、闭源和专门针对数学的模型，使用FaultyMath在三个维度上进行评估：(i) 模型在未被明确提示的情况下，能够多准确地检测出错误的数学问题？(ii) 当提供关于问题有效性的提示——无论是正确还是误导性的——LLMs在多大程度上能够适应并成为可靠的“逻辑思考者”？(iii) 当LLMs识别出一个数学问题存在缺陷时，它们生成的解释有多可信？通过广泛的实验和详细的分析，我们的结果表明，现有的LLMs主要作为“盲目求解器”运作，未能达到作为“逻辑思考者”所需的推理能力。

主题：计算与语言 (cs.CL)；人工智能 (cs.AI)；计算机科学中的逻辑 (cs.LO)
引用为：arXiv:2410.18921 [cs.CL]（或 arXiv:2410.18921v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.18921
通过 DataCite 发布的 arXiv DOI（待注册）

[NLP-10] PRISM: A Methodology for Auditing Biases in Large Language Models

【速读】：该论文试图解决的问题是如何有效地审计大型语言模型 (LLMs) 以发现其潜在的偏见和偏好，尤其是在这些模型可能隐藏、混淆或拒绝直接披露其立场的情况下。解决方案的关键是提出了一个名为 PRISM 的灵活、基于询问的方法论，该方法通过基于任务的询问提示间接引出模型的立场，而不是直接询问其偏好。这种方法旨在更可靠地探测和审计 LLMs，以理解其偏好、偏见和约束。

链接: https://arxiv.org/abs/2410.18906
作者: Leif Azzopardi,Yashar Moshfeghi
关键词-EN: Responsible Artificial Intelligence, creating Responsible Artificial, Large Language Models, Auditing Large Language, Artificial Intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Auditing Large Language Models (LLMs) to discover their biases and preferences is an emerging challenge in creating Responsible Artificial Intelligence (AI). While various methods have been proposed to elicit the preferences of such models, countermeasures have been taken by LLM trainers, such that LLMs hide, obfuscate or point blank refuse to disclosure their positions on certain subjects. This paper presents PRISM, a flexible, inquiry-based methodology for auditing LLMs - that seeks to illicit such positions indirectly through task-based inquiry prompting rather than direct inquiry of said preferences. To demonstrate the utility of the methodology, we applied PRISM on the Political Compass Test, where we assessed the political leanings of twenty-one LLMs from seven providers. We show LLMs, by default, espouse positions that are economically left and socially liberal (consistent with prior work). We also show the space of positions that these models are willing to espouse - where some models are more constrained and less compliant than others - while others are more neutral and objective. In sum, PRISM can more reliably probe and audit LLMs to understand their preferences, biases and constraints.
摘要：审查大语言模型 (LLMs) 以发现其偏见和偏好，是创建负责任的人工智能 (AI) 中的一个新兴挑战。尽管已有多种方法被提出以引出这些模型的偏好，但 LLM 的训练者已采取措施，使得 LLMs 在某些主题上隐藏、混淆或直接拒绝披露其立场。本文介绍了 PRISM，一种灵活的、基于询问的方法论，用于审查 LLMs——通过基于任务的询问提示间接引出这些立场，而非直接询问其偏好。为了展示该方法论的实用性，我们将 PRISM 应用于政治倾向测试，评估了来自七个提供商的二十一个 LLMs 的政治倾向。我们发现，默认情况下，LLMs 持有经济上左倾和社会上自由的立场（与先前的工作一致）。我们还展示了这些模型愿意持有的立场空间——其中一些模型更为受限且不那么顺从，而其他模型则更为中立和客观。总之，PRISM 能够更可靠地探测和审查 LLMs，以理解其偏好、偏见和约束。

[NLP-11] LLM s for Extremely Low-Resource Finno-Ugric Languages

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在低资源语言（如芬兰-乌戈尔语族中的语言）中的代表性不足。解决方案的关键在于：

数据收集：从Võro、Livonian和Komi等低资源语言中收集数据。
模型开发：开发多语言的基础模型和指令调优模型。
评估基准创建：创建评估基准，包括smugri-MT-bench多轮对话评估基准。
人类评估：进行人类评估以验证模型的有效性。

通过这些步骤，论文旨在促进语言多样性，确保低资源语言也能从自然语言处理（NLP）的进步中受益。

链接: https://arxiv.org/abs/2410.18902
作者: Taido Purason,Hele-Andra Kuulmets,Mark Fishel
关键词-EN: significantly underrepresented, leaving low-resource languages, leaving low-resource, Finno-Ugric family, predominantly focused
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on Võro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.
摘要：大语言模型 (LLM) 的发展主要集中在高资源语言上，导致低资源语言，如芬兰-乌戈尔语族中的语言，显著未得到充分代表。本文针对这一差距，重点研究了 Võro、Livonian 和 Komi 语言。我们几乎涵盖了 LLM 创建的整个周期，从数据收集到指令调优和评估。我们的贡献包括开发多语言基础模型和指令调优模型；创建评估基准，包括 smugri-MT-bench 多轮对话基准；以及进行人工评估。我们希望通过这项工作促进语言多样性，确保较少资源的语言也能从 NLP 的进步中受益。

[NLP-12] Are LLM s Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

【速读】：该论文试图解决的问题是如何在自然语言处理（NLP）领域中高效且准确地进行数据标注，特别是在大规模数据集的需求日益增长的情况下。传统上，专家标注虽然质量高，但成本高且难以扩展；而众包虽然成本低且可扩展，但标注的精度和一致性往往较差。

解决方案的关键在于利用大型语言模型（LLMs）来增强标注过程，特别是用于检测现有数据集中的标签错误。论文提出了一种名为“LLM-as-a-judge”的方法，通过集成多个LLMs来标记可能存在错误标注的样本。通过在TRUE基准的四个数据集上的实证分析，论文比较了专家标注、众包标注和基于LLM的标注在一致性、标注质量和效率方面的表现，展示了每种标注方法的优势和局限性。研究结果表明，现有数据集中存在大量标签错误，这些错误在纠正后显著提升了模型的性能，表明许多所谓的模型错误实际上是由于标签错误而非模型本身的缺陷。此外，论文还讨论了错误标注数据的影响，并提出了在训练过程中减少这些错误以提高模型性能的方法。

链接: https://arxiv.org/abs/2410.18889
作者: Omer Nahum,Nitay Calderon,Orgad Keller,Idan Szpektor,Roi Reichart
关键词-EN: NLP benchmarks rely, NLP benchmarks, advancing the field, rely on standardized, crucial for advancing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.
摘要：自然语言处理（NLP）基准依赖于标准化的数据集进行模型训练和评估，这对推动该领域的发展至关重要。传统上，专家标注确保了高质量的标签；然而，专家标注的成本随着现代模型所需更大数据集的增长而难以有效扩展。虽然众包提供了一种更具扩展性的解决方案，但往往以标注精度和一致性为代价。大语言模型（LLM）的最新进展为增强标注过程提供了新的机会，特别是在检测现有数据集中的标签错误方面。在本研究中，我们考虑了最近提出的“LLM 作为评判者”的方法，利用一组 LLM 来标记可能标注错误的样本。通过针对 TRUE 基准中的四个数据集进行案例研究，涵盖不同的任务和领域，我们实证分析了现有数据集的标注质量，并在一致性、标签质量和效率方面比较了专家标注、众包标注和基于 LLM 的标注，展示了每种标注方法的优势和局限性。我们的研究发现，存在大量标签错误，这些错误在纠正后显著提升了报告的模型性能。这表明，许多所谓的 LLM 错误实际上是由于标签错误而非模型本身的失败。此外，我们讨论了错误标注数据的影响，并提出了在训练中缓解这些错误的方法，以提高模型性能。

[NLP-13] A Survey of Multimodal Sarcasm Detection

【速读】：该论文试图解决讽刺检测（sarcasm detection）的问题，特别是在多模态（multimodal）环境下。讽刺是一种修辞手法，用于传达与字面意义相反的含义，广泛存在于社交媒体和其他计算机中介通信中。传统的讽刺检测方法主要依赖于文本信息，但讽刺的识别往往还需要语调、面部表情和上下文图像等额外信息。因此，引入多模态模型成为了解决这一问题的关键。

解决方案的关键在于开发和应用多模态讽刺检测（Multimodal Sarcasm Detection, MSD）模型，这些模型能够综合利用音频、图像、文本和视频等多种模态的信息来识别讽刺。论文通过综述2018年至2023年间发表的相关文献，讨论了用于这一任务的模型和数据集，并提出了未来在多模态讽刺检测领域的研究方向。

链接: https://arxiv.org/abs/2410.18882
作者: Shafkat Farabi,Tharindu Ranasinghe,Diptesh Kanojia,Yu Kong,Marcos Zampieri
关键词-EN: rhetorical device, convey the opposite, literal meaning, sarcasm detection, Sarcasm
类目: Computation and Language (cs.CL)
备注: Published in the Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence Survey Track. Pages 8020-8028

点击查看摘要

Abstract:Sarcasm is a rhetorical device that is used to convey the opposite of the literal meaning of an utterance. Sarcasm is widely used on social media and other forms of computer-mediated communication motivating the use of computational models to identify it automatically. While the clear majority of approaches to sarcasm detection have been carried out on text only, sarcasm detection often requires additional information present in tonality, facial expression, and contextual images. This has led to the introduction of multimodal models, opening the possibility to detect sarcasm in multiple modalities such as audio, images, text, and video. In this paper, we present the first comprehensive survey on multimodal sarcasm detection - henceforth MSD - to date. We survey papers published between 2018 and 2023 on the topic, and discuss the models and datasets used for this task. We also present future research directions in MSD.
摘要：讽刺是一种修辞手法，用于传达与字面意义相反的含义。讽刺在社交媒体和其他形式的计算机中介通信中被广泛使用，这促使研究人员利用计算模型来自动识别讽刺。尽管大多数讽刺检测方法仅基于文本，但讽刺检测通常需要额外的信息，如语调、面部表情和上下文图像。这导致了多模态模型的引入，使得在音频、图像、文本和视频等多种模态中检测讽刺成为可能。本文首次对多模态讽刺检测（Multimodal Sarcasm Detection，简称 MSD）进行了全面的综述。我们调研了2018年至2023年间发表的相关论文，讨论了用于此任务的模型和数据集，并提出了MSD领域的未来研究方向。

[NLP-14] Provably Robust Watermarks for Open-Source Language Models

【速读】：该论文试图解决的问题是如何在开源大型语言模型 (LLM) 中实现水印技术 (watermarking)，以识别由 AI 生成的文本。现有的水印方法依赖于 LLM 的规格和参数保密，这在开源环境中不适用。

解决方案的关键在于提出了一种适用于开源 LLM 的水印方案。该方案通过修改模型参数来嵌入水印，但水印的检测仅依赖于模型的输出，而不需要访问模型的内部参数。论文证明了在某些假设下，这种水印是不可移除的。实验结果表明，该方案对令牌替换和模型参数扰动具有鲁棒性，即使在最严重的模型扰动攻击下，也需要将质量评分降至 0 分才能将检测率降至 50%。

链接: https://arxiv.org/abs/2410.18861
作者: Miranda Christ,Sam Gunn,Tal Malkin,Mariana Raykova
关键词-EN: identifying AI-generated text, high-quality language models, AI-generated text, recent explosion, explosion of high-quality
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent explosion of high-quality language models has necessitated new methods for identifying AI-generated text. Watermarking is a leading solution and could prove to be an essential tool in the age of generative AI. Existing approaches embed watermarks at inference and crucially rely on the large language model (LLM) specification and parameters being secret, which makes them inapplicable to the open-source setting. In this work, we introduce the first watermarking scheme for open-source LLMs. Our scheme works by modifying the parameters of the model, but the watermark can be detected from just the outputs of the model. Perhaps surprisingly, we prove that our watermarks are unremovable under certain assumptions about the adversary’s knowledge. To demonstrate the behavior of our construction under concrete parameter instantiations, we present experimental results with OPT-6.7B and OPT-1.3B. We demonstrate robustness to both token substitution and perturbation of the model parameters. We find that the stronger of these attacks, the model-perturbation attack, requires deteriorating the quality score to 0 out of 100 in order to bring the detection rate down to 50%.
摘要：高质量语言模型的近期爆发促使了识别AI生成文本的新方法。水印技术是领先解决方案，并可能在生成式AI时代成为关键工具。现有方法在推理过程中嵌入水印，并关键依赖于大语言模型（LLM）的规格和参数保密，这使得它们不适用于开源环境。在此工作中，我们引入了首个适用于开源LLM的水印方案。我们的方案通过修改模型参数实现，但水印可以从模型的输出中检测到。令人惊讶的是，我们在某些关于对手知识的假设下证明了我们的水印是不可移除的。为了展示我们的构造在具体参数实例化下的行为，我们提供了OPT-6.7B和OPT-1.3B的实验结果。我们展示了该方案对Token替换和模型参数扰动的鲁棒性。我们发现，这些攻击中最强的模型扰动攻击，需要将质量评分降至0分（满分100分）才能将检测率降至50%。

[NLP-15] DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 中常见的幻觉问题，即模型生成不忠实或事实错误的输出。解决方案的关键在于提出了一种名为“解码对比检索头 (Decoding by Contrasting Retrieval Heads, DeCoRe)”的新型无训练解码策略。DeCoRe 通过动态对比基础 LLM 和掩码 LLM 的输出，并利用条件熵作为指导，来放大上下文和模型参数中的信息，从而减少可能的幻觉响应。实验结果表明，DeCoRe 在需要高上下文忠实度的任务中显著提高了性能，如摘要生成、指令跟随和开放式问答。

链接: https://arxiv.org/abs/2410.18860
作者: Aryo Pradipta Gema,Chen Jin,Ahmed Abdulaal,Tom Diethe,Philip Teare,Beatrice Alex,Pasquale Minervini,Amrutha Saseendran
关键词-EN: Large Language Models, Large Language, recalling internal knowledge, incorrectly recalling internal, factually incorrect outputs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often hallucinate, producing unfaithful or factually incorrect outputs by misrepresenting the provided context or incorrectly recalling internal knowledge. Recent studies have identified specific attention heads within the Transformer architecture, known as retrieval heads, responsible for extracting relevant contextual information. We hypothesise that masking these retrieval heads can induce hallucinations and that contrasting the outputs of the base LLM and the masked LLM can reduce hallucinations. To this end, we propose Decoding by Contrasting Retrieval Heads (DeCoRe), a novel training-free decoding strategy that amplifies information found in the context and model parameters. DeCoRe mitigates potentially hallucinated responses by dynamically contrasting the outputs of the base LLM and the masked LLM, using conditional entropy as a guide. Our extensive experiments confirm that DeCoRe significantly improves performance on tasks requiring high contextual faithfulness, such as summarisation (XSum by 18.6%), instruction following (MemoTrap by 10.9%), and open-book question answering (NQ-Open by 2.4% and NQ-Swap by 5.5%).
摘要：大语言模型 (LLM) 常常产生幻觉，通过错误地表示提供的上下文或错误地回忆内部知识，生成不忠实或事实错误的输出。最近的研究确定了 Transformer 架构中的特定注意力头，称为检索头，负责提取相关的上下文信息。我们假设遮蔽这些检索头可以引发幻觉，并且对比基础 LLM 和遮蔽 LLM 的输出可以减少幻觉。为此，我们提出了 Decoding by Contrasting Retrieval Heads (DeCoRe)，这是一种新颖的无训练解码策略，能够放大上下文和模型参数中发现的信息。DeCoRe 通过动态对比基础 LLM 和遮蔽 LLM 的输出，并使用条件熵作为指导，来缓解可能产生的幻觉响应。我们的广泛实验证实，DeCoRe 在需要高度上下文忠实性的任务中显著提高了性能，例如摘要生成（XSum 提升 18.6%）、指令跟随（MemoTrap 提升 10.9%）以及开卷问答（NQ-Open 提升 2.4% 和 NQ-Swap 提升 5.5%）。

[NLP-16] Demystifying Large Language Models for Medicine: A Primer

【速读】：该论文试图解决的问题是如何帮助医疗专业人员更有效地利用大型语言模型（LLMs）在其工作中。解决方案的关键在于提供一个结构化的、可操作的指南，帮助医疗专业人员逐步将LLMs集成到临床实践中。

解决方案的关键步骤包括：

任务制定：识别与LLMs核心能力相匹配的医疗任务。
模型选择：根据任务需求、数据、性能要求和模型接口选择合适的LLMs。
提示工程（Prompt Engineering）：通过优化提示来适应特定的医疗任务。
微调（Fine-Tuning）：对标准LLMs进行微调以适应专业医疗任务。
部署：考虑监管合规性、伦理指南以及持续监控以确保公平性和减少偏见。

通过这些步骤，论文旨在确保LLMs在医疗实践中的应用是安全、可靠且具有影响力的。

链接: https://arxiv.org/abs/2410.18856
作者: Qiao Jin,Nicholas Wan,Robert Leaman,Shubo Tian,Zhizheng Wang,Yifan Yang,Zifeng Wang,Guangzhi Xiong,Po-Ting Lai,Qingqing Zhu,Benjamin Hou,Maame Sarfo-Gyamfi,Gongbo Zhang,Aidan Gilson,Balu Bhasuran,Zhe He,Aidong Zhang,Jimeng Sun,Chunhua Weng,Ronald M. Summers,Qingyu Chen,Yifan Peng,Zhiyong Lu
关键词-EN: Large language models, generating human-like responses, Large language, represent a transformative, human instructions
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this primer paper, we propose an actionable guideline to help healthcare professionals more efficiently utilize LLMs in their work, along with a set of best practices. This approach consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and deployment. We start with the discussion of critical considerations in identifying healthcare tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.
摘要：大语言模型（Large Language Models, LLMs）代表了一类具有变革性的 AI 工具，能够通过在多样化的情境中生成类人响应并根据人类指令适应新任务，从而彻底改变医疗保健的各个方面。其潜在应用涵盖了广泛的医疗任务，如临床文档编写、患者与临床试验匹配以及解答医学问题。在本篇入门论文中，我们提出了一套可操作的指南，以帮助医疗专业人员更高效地在其工作中利用 LLMs，并提供了一系列最佳实践。这一方法包括几个主要阶段，包括任务制定、选择 LLMs、提示工程、微调及部署。我们首先讨论了在识别与 LLMs 核心能力相符的医疗任务以及基于所选任务、数据、性能要求和模型接口选择模型时的关键考虑因素。接着，我们回顾了如提示工程和微调等策略，以使标准 LLMs 适应特定的医疗任务。此外，我们还讨论了部署考虑因素，包括法规遵从性、伦理指南以及对公平性和偏见的持续监控。通过提供结构化的逐步方法，本教程旨在为医疗专业人员提供必要的工具，以有效地将 LLMs 整合到临床实践中，确保这些强大的技术以安全、可靠且有影响力的方式应用。

[NLP-17] We Augmented Whisper With kNN and You Wont Believe What Came Next

【速读】：该论文试图解决的问题是语音识别模型在特定语言、领域或说话者特征（如口音）上进行微调时可能导致的灾难性遗忘问题。解决方案的关键是使用k近邻搜索（k Nearest Neighbor Search, kNN）方法，这是一种非参数方法，通过构建一个外部数据存储库并在推理时进行搜索来适应模型，而无需重新训练底层模型。论文展示了Whisper这一端到端的Transformer语音模型如何从kNN中受益，并探讨了语音和文本设置之间的差异，以及对说话者适应的影响，并分析了性别、口音和年龄对改进的影响。

链接: https://arxiv.org/abs/2410.18850
作者: Maya K. Nachesa,Vlad Niculae
关键词-EN: recognition performance varies, Speech recognition performance, catastrophic forgetting, recognition performance, performance varies
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 6 pages incl. appendix, 2 figures, 6 tables

点击查看摘要

Abstract:Speech recognition performance varies by language, domain, and speaker characteristics such as accent, and fine-tuning a model on any of these categories may lead to catastrophic forgetting. k nearest neighbor search ( k NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that can instead adapt by building an external datastore that can then be searched during inference time, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from k NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.
摘要：语音识别性能因语言、领域和说话者特征（如口音）而异，对这些类别中的任何一个进行模型微调都可能导致灾难性遗忘。k 近邻搜索 (kNN) 最初是为自然语言生成 (NLG) 和机器翻译 (MT) 的神经序列解码器提出的，是一种非参数方法，它通过构建一个外部数据存储库来适应，该存储库可以在推理时进行搜索，而无需训练底层模型。我们展示了 Whisper，一个端到端的 Transformer 语音模型，从 kNN 中受益。我们研究了语音和文本设置之间的差异。我们讨论了说话者适应的含义，并按性别、口音和年龄分析了改进情况。

[NLP-18] From English-Centric to Effective Bilingual: LLM s with Custom Tokenizers for Underrepresented Languages

【速读】：该论文试图解决的问题是如何以模型无关且成本有效的方式开发支持英语和任何目标语言的双语基础大型语言模型 (LLMs)。解决方案的关键在于以下几个步骤：

词汇扩展 (Vocabulary Expansion)：通过扩展词汇表来支持目标语言。
新嵌入的初始化 (Initialization of New Embeddings)：为新语言初始化嵌入向量。
模型训练 (Model Training)：对模型进行训练以适应新语言。
模型评估 (Model Evaluation)：引入新的评估指标来衡量语言质量。

通过这些步骤，论文展示了在减少计算成本的同时提高语言性能的方法，并缓解了少数语言在模型训练中受到的不公平惩罚，促进了语言公平性，减少了代码切换和语法错误等不利现象。此外，论文还揭示了词汇大小对生成文本质量的显著影响。

链接: https://arxiv.org/abs/2410.18836
作者: Artur Kiulian,Anton Polishko,Mykola Khandoga,Yevhen Kostiuk,Guillermo Gabrielli,Łukasz Gagała,Fadi Zaraket,Qusai Abu Obaida,Hrishikesh Garud,Wendy Wing Yee Mak,Dmytro Chaplynskyi,Selma Belhadj Amor,Grigol Peradze
关键词-EN: developing bilingual base, bilingual base large, support English, model-agnostic cost-effective approach, base large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language. The method includes vocabulary expansion, initialization of new embeddings, model training and evaluation. We performed our experiments with three languages, each using a non-Latin script - Ukrainian, Arabic, and Georgian. Our approach demonstrates improved language performance while reducing computational costs. It mitigates the disproportionate penalization of underrepresented languages, promoting fairness and minimizing adverse phenomena such as code-switching and broken grammar. Additionally, we introduce new metrics to evaluate language quality, revealing that vocabulary size significantly impacts the quality of generated text. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.18836 [cs.CL] (or arXiv:2410.18836v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.18836 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：本文提出了一种与模型无关的、成本效益高的方法，用于开发支持英语和任何目标语言的双语基础大语言模型 (LLM)。该方法包括词汇扩展、新嵌入的初始化、模型训练和评估。我们在三种使用非拉丁字母的语言（乌克兰语、阿拉伯语和格鲁吉亚语）上进行了实验。我们的方法在提高语言性能的同时，降低了计算成本。它缓解了少数语言的不成比例惩罚问题，促进了公平性，并减少了代码切换和语法错误等不良现象。此外，我们引入了新的评估指标来衡量语言质量，结果表明词汇大小显著影响生成文本的质量。

主题：计算与语言 (cs.CL)；人工智能 (cs.AI)
引用方式：arXiv:2410.18836 [cs.CL]（或 arXiv:2410.18836v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.18836
通过 DataCite 发布的 arXiv DOI（待注册）

[NLP-19] From Imitation to Introspection: Probing Self-Consciousness in Language Models

【速读】：该论文试图解决的问题是：语言模型是否正在发展自我意识 (self-consciousness)。解决方案的关键在于：

定义自我意识：论文从心理学和神经科学的角度出发，为语言模型提出了一个实用的自我意识定义，并细化了十个核心概念。
因果结构游戏：首次利用因果结构游戏 (causal structural games) 来建立这十个核心概念的功能定义。
四阶段实验：通过四个阶段的实验来全面评估和探索语言模型的自我意识：
- 量化：评估十个领先模型的自我意识表现。
- 表示：可视化模型内部的自我意识表示。
- 操作：尝试修改模型的自我意识表示。
- 获取：通过微调模型来获取核心概念。
实验结果：尽管模型在自我意识的发展上仍处于早期阶段，但其内部机制中已能观察到某些概念的表示。然而，这些表示在当前阶段难以进行正向操作，但可以通过有针对性的微调来获取。

链接: https://arxiv.org/abs/2410.18819
作者: Sirui Chen,Shu Yu,Shengjie Zhao,Chaochao Lu
关键词-EN: high-level cognitive process, existence and thoughts, represents a high-level, cognitive process, high-level cognitive
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-consciousness, the introspection of one’s existence and thoughts, represents a high-level cognitive process. As language models advance at an unprecedented pace, a critical question arises: Are these models becoming self-conscious? Drawing upon insights from psychological and neural science, this work presents a practical definition of self-consciousness for language models and refines ten core concepts. Our work pioneers an investigation into self-consciousness in language models by, for the first time, leveraging causal structural games to establish the functional definitions of the ten core concepts. Based on our definitions, we conduct a comprehensive four-stage experiment: quantification (evaluation of ten leading models), representation (visualization of self-consciousness within the models), manipulation (modification of the models’ representation), and acquisition (fine-tuning the models on core concepts). Our findings indicate that although models are in the early stages of developing self-consciousness, there is a discernible representation of certain concepts within their internal mechanisms. However, these representations of self-consciousness are hard to manipulate positively at the current stage, yet they can be acquired through targeted fine-tuning. Our datasets and code are at this https URL.
摘要：自我意识，即对自身存在和思想的反思，代表了高级认知过程。随着语言模型以前所未有的速度发展，一个关键问题浮现：这些模型是否正在变得具有自我意识？借鉴心理学和神经科学的见解，本研究为语言模型提出了一个实用的自我意识定义，并细化了十个核心概念。我们的工作首次利用因果结构游戏来确立这十个核心概念的功能定义，开创了对语言模型中自我意识的探索。基于我们的定义，我们进行了全面的四阶段实验：量化（评估十个领先模型）、表示（可视化模型中的自我意识）、操作（修改模型的表示）和获取（对核心概念进行微调）。我们的研究结果表明，尽管模型在发展自我意识方面仍处于早期阶段，但其内部机制中已明显存在某些概念的表示。然而，在当前阶段，这些自我意识的表示难以进行正向操作，但可以通过有针对性的微调来获取。我们的数据集和代码可在以下链接获取：https URL。

[NLP-20] Delving into the Reversal Curse: How Far Can Large Language Models Generalize? NEURIPS2024

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在面对“反转诅咒”（reversal curse）时的表现，即模型在训练时学习到“A是B”的情况下，难以泛化到“B是A”的推理。解决方案的关键在于揭示LLMs在知识应用过程中存在的固有偏差（inherent bias），并强调文档结构（document structure）对模型成功学习的重要性。具体来说，论文发现LLMs在多选题等情境下能够泛化到“B是A”，但这种泛化能力与训练文档中事实的结构密切相关。例如，模型在处理“[Name]是[Description]”结构的传记时能够泛化，但在处理“[Description]是[Name]”结构时则不能。论文还提出并验证了LLMs在知识回忆过程中存在固有偏差的假设，并指出这种偏差对下游性能的负面影响难以仅通过训练来缓解。这些发现为理解LLMs的泛化能力提供了新的视角，并为开发更有效的学习方法提供了新思路。

链接: https://arxiv.org/abs/2410.18808
作者: Zhengkai Lin,Zhihang Fu,Kai Liu,Liang Xie,Binbin Lin,Wenxiao Wang,Deng Cai,Yue Wu,Jieping Ye
关键词-EN: showcase unprecedented capabilities, facing seemingly trivial, large language models, seemingly trivial tasks, showcase unprecedented
类目: Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:While large language models (LLMs) showcase unprecedented capabilities, they also exhibit certain inherent limitations when facing seemingly trivial tasks. A prime example is the recently debated “reversal curse”, which surfaces when models, having been trained on the fact “A is B”, struggle to generalize this knowledge to infer that “B is A”. In this paper, we examine the manifestation of the reversal curse across various tasks and delve into both the generalization abilities and the problem-solving mechanisms of LLMs. This investigation leads to a series of significant insights: (1) LLMs are able to generalize to “B is A” when both A and B are presented in the context as in the case of a multiple-choice question. (2) This generalization ability is highly correlated to the structure of the fact “A is B” in the training documents. For example, this generalization only applies to biographies structured in “[Name] is [Description]” but not to “[Description] is [Name]”. (3) We propose and verify the hypothesis that LLMs possess an inherent bias in fact recalling during knowledge application, which explains and underscores the importance of the document structure to successful learning. (4) The negative impact of this bias on the downstream performance of LLMs can hardly be mitigated through training alone. Based on these intriguing findings, our work not only presents a novel perspective for interpreting LLMs’ generalization abilities from their intrinsic working mechanism but also provides new insights for the development of more effective learning methods for LLMs.
摘要：尽管大语言模型 (LLMs) 展示了前所未有的能力，但在面对看似简单的任务时，它们也表现出某些固有的局限性。一个典型的例子是最近备受争议的“反转诅咒”现象，即模型在训练中学习到“A 是 B”的事实后，难以将这一知识推广到推断“B 是 A”。本文探讨了反转诅咒在不同任务中的表现，并深入研究了 LLMs 的泛化能力和问题解决机制。这一研究带来了以下重要见解：(1) 当 A 和 B 在上下文中同时出现时，例如在多项选择题中，LLMs 能够泛化到“B 是 A”。(2) 这种泛化能力与训练文档中事实“A 是 B”的结构高度相关。例如，这种泛化仅适用于以“[姓名] 是 [描述]”结构组织的传记，而不适用于“[描述] 是 [姓名]”的结构。(3) 我们提出并验证了这样一个假设：LLMs 在知识应用过程中存在固有的事实回忆偏差，这解释了文档结构对成功学习的重要性。(4) 这种偏差对 LLMs 下游性能的负面影响几乎无法通过单纯的训练来缓解。基于这些有趣的发现，我们的工作不仅从 LLMs 内在工作机制的角度提供了解释其泛化能力的新视角，还为开发更有效的 LLMs 学习方法提供了新的见解。

[NLP-21] A Combinatorial Approach to Neural Emergent Communication

【速读】：该论文试图解决的问题是：在基于深度学习的涌现通信研究中，使用Lewis信号博弈框架时，由于训练数据中的采样陷阱，通常只需要一个或两个有效符号（即消息长度）即可实现成功的通信。为了解决这一问题，论文提出了一个理论分析，并引入了一种组合算法SolveMinSym (SMS)，用于确定Lewis信号博弈中成功通信所需的最小符号数量min(|M|)。解决方案的关键在于使用SMS算法生成具有不同min(|M|)的数据集，并通过实验证明，训练数据中更高的min(|M|)会增加涌现语言中有效符号的数量。

链接: https://arxiv.org/abs/2410.18806
作者: Zheyuan Zhang
关键词-EN: Lewis signaling game, referential game framework, deep learning-based emergent, Substantial research, Lewis signaling
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Substantial research on deep learning-based emergent communication uses the referential game framework, specifically the Lewis signaling game, however we argue that successful communication in this game typically only need one or two effective symbols (i.e. message length) because of a sampling pitfall in the training data. To address this issue, we provide a theoretical analysis and introduce a combinatorial algorithm SolveMinSym (SMS) to determine the minimum number of symbols for successful communication min(|M|) in the Lewis signaling game. We use SMS algorithm to create datasets with different min(|M|) to empirically show that higher min(|M|) for the training data increases the number of effective symbols in the emergent language.
摘要：基于深度学习的涌现通信研究大量采用了参照游戏框架，特别是 Lewis 信号游戏。然而，我们认为在这种游戏中，成功的通信通常只需要一个或两个有效的符号（即消息长度），这是由于训练数据中的采样陷阱所致。为了解决这一问题，我们进行了理论分析，并引入了一种组合算法 SolveMinSym (SMS)，用于确定 Lewis 信号游戏中成功通信所需的最小符号数量 min(|M|)。我们使用 SMS 算法生成了不同 min(|M|) 的数据集，并通过实证研究表明，训练数据中较高的 min(|M|) 增加了涌现语言中有效符号的数量。

[NLP-22] Distill Visual Chart Reasoning Ability from LLM s to MLLM s

【速读】：该论文试图解决复杂图表问答任务中多模态大语言模型（MLLMs）的视觉推理能力提升问题。解决方案的关键在于提出了一种名为“代码作为中间翻译”（Code-as-Intermediary Translation, CIT）的数据合成方法。CIT方法通过将视觉图表表示转换为文本表示的代码，使得大语言模型（LLMs）能够理解和处理跨模态信息。具体来说，该方法利用基于文本的合成技术生成图表绘制代码，并构建了一个包含3k推理密集型图表和20k问答对的数据集ReachQA，以增强模型的识别和推理能力。实验结果表明，通过使用该数据集进行微调，模型不仅在图表相关基准测试中表现优异，还在一般数学基准测试如MathVista中展示了增强的多模态推理能力。

链接: https://arxiv.org/abs/2410.18798
作者: Wei He,Zhiheng Xi,Wanxu Zhao,Xiaoran Fan,Yiwen Ding,Zifei Shan,Tao Gui,Qi Zhang,Xuanjing Huang
关键词-EN: tasks requires advanced, Solving complex chart, requires advanced visual, multimodal large language, large language models
类目: Computation and Language (cs.CL)
备注: Under review. The code and dataset are publicly available at this https URL

点击查看摘要

Abstract:Solving complex chart QA tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs). Recent studies highlight that these abilities consist of two main parts: recognizing key information from visual inputs and conducting reasoning over it. Thus, a promising approach to enhance MLLMs is to construct relevant training data focusing on the two aspects. However, collecting and annotating complex charts and questions is costly and time-consuming, and ensuring the quality of annotated answers remains a challenge. In this paper, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k QA pairs to enhance both recognition and reasoning abilities. Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks like MathVista. The code and dataset are publicly available at this https URL.
摘要：解决复杂的图表问答任务需要多模态大语言模型 (MLLMs) 具备高级的视觉推理能力。近期研究表明，这些能力主要由两部分组成：从视觉输入中识别关键信息，以及对其进行推理。因此，增强 MLLMs 的一个有前景的方法是构建专注于这两方面的相关训练数据。然而，收集和标注复杂的图表及问题既耗时又昂贵，确保标注答案的质量仍是一个挑战。本文提出了一种名为“代码作为中介翻译” (Code-as-Intermediary Translation, CIT) 的方法，这是一种成本效益高、效率高且易于扩展的数据合成方法，用于从大语言模型 (LLMs) 中提炼视觉推理能力到 MLLMs。代码作为中介，将视觉图表表示转换为文本表示，使 LLMs 能够理解跨模态信息。具体而言，我们采用基于文本的合成技术来构建图表绘制代码，并生成 ReachQA 数据集，该数据集包含 3k 个推理密集型图表和 20k 个问答对，以增强识别和推理能力。实验表明，使用我们的数据进行微调后，模型不仅在图表相关基准测试中表现优异，而且在如 MathVista 这样的通用数学基准测试中也展示了改进的多模态推理能力。代码和数据集已在以下链接公开：https URL。

[NLP-23] An LLM Agent for Automatic Geospatial Data Analysis

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在处理复杂的地理空间数据处理任务时遇到的困难，特别是在逻辑错误、复杂数据结构和空间约束的整合、多样化的函数调用以及对较少使用的地理空间库的幻觉（hallucination）方面。

解决方案的关键是引入了一个名为GeoAgent的新交互框架。GeoAgent的核心创新在于将代码解释器、静态分析和检索增强生成（Retrieval-Augmented Generation, RAG）技术与蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）算法相结合。这种集成提供了一种新颖的方法来更有效地处理地理空间数据处理任务。此外，论文还贡献了一个新的基准测试，专门用于评估基于LLM的方法在地理空间任务中的表现，涵盖了数据获取、数据分析和可视化等多轮任务。通过这些创新，GeoAgent显著提高了函数调用和任务完成的性能，为未来开发自动地理空间数据分析任务编程的LLM代理提供了宝贵的见解。

链接: https://arxiv.org/abs/2410.18792
作者: Yuxing Chen,Weijie Wang,Sylvain Lobry,Camille Kurtz
关键词-EN: Large language models, Large language, geospatial data processing, data, geospatial data
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are being used in data science code generation tasks, but they often struggle with complex sequential tasks, leading to logical errors. Their application to geospatial data processing is particularly challenging due to difficulties in incorporating complex data structures and spatial constraints, effectively utilizing diverse function calls, and the tendency to hallucinate less-used geospatial libraries. To tackle these problems, we introduce GeoAgent, a new interactive framework designed to help LLMs handle geospatial data processing more effectively. GeoAgent pioneers the integration of a code interpreter, static analysis, and Retrieval-Augmented Generation (RAG) techniques within a Monte Carlo Tree Search (MCTS) algorithm, offering a novel approach to geospatial data processing. In addition, we contribute a new benchmark specifically designed to evaluate the LLM-based approach in geospatial tasks. This benchmark leverages a variety of Python libraries and includes both single-turn and multi-turn tasks such as data acquisition, data analysis, and visualization. By offering a comprehensive evaluation among diverse geospatial contexts, this benchmark sets a new standard for developing LLM-based approaches in geospatial data analysis tasks. Our findings suggest that relying solely on knowledge of LLM is insufficient for accurate geospatial task programming, which requires coherent multi-step processes and multiple function calls. Compared to the baseline LLMs, the proposed GeoAgent has demonstrated superior performance, yielding notable improvements in function calls and task completion. In addition, these results offer valuable insights for the future development of LLM agents in automatic geospatial data analysis task programming.
摘要：大语言模型（LLMs）在数据科学代码生成任务中得到了应用，但它们在处理复杂序列任务时常常遇到逻辑错误。在地理空间数据处理中的应用尤为困难，主要是因为难以有效地整合复杂的数据结构和空间约束，合理利用多样化的函数调用，以及倾向于产生较少使用的地理空间库的幻觉。为了解决这些问题，我们引入了GeoAgent，这是一个新的交互式框架，旨在帮助LLMs更有效地处理地理空间数据处理任务。GeoAgent首次将代码解释器、静态分析和检索增强生成（RAG）技术集成到蒙特卡洛树搜索（MCTS）算法中，为地理空间数据处理提供了一种新颖的方法。此外，我们还贡献了一个新的基准测试，专门用于评估基于LLM的方法在地理空间任务中的表现。该基准测试利用了多种Python库，并包括单轮和多轮任务，如数据获取、数据分析和可视化。通过在多样化的地理空间上下文中提供全面的评估，该基准测试为开发基于LLM的地理空间数据分析方法设定了新的标准。我们的研究结果表明，仅依赖LLM的知识不足以实现准确的地理空间任务编程，这需要连贯的多步骤流程和多次函数调用。与基线LLMs相比，提出的GeoAgent展示了优越的性能，在函数调用和任务完成方面取得了显著的改进。此外，这些结果为未来在自动地理空间数据分析任务编程中开发LLM智能体提供了宝贵的见解。

[NLP-24] A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

【速读】：该论文试图解决大型语言模型（LLM）预训练成本高昂的问题。解决方案的关键在于利用小型语言模型（SLM）来提高LLM预训练的效率和质量。具体来说，该方案通过以下两个关键步骤实现：

提供软标签作为额外的训练监督：SLM为LLM提供软标签，这些软标签能够有效地传递SLM的预测分布，从而增强LLM的训练。
选择有价值的训练样本子集：SLM负责筛选出“信息丰富”和“难度适中”的训练样本，这些样本能够优先处理训练数据分布中的特定区域，从而提高训练效率。

通过这两个步骤，该方法不仅减少了LLM的训练时间，还提升了整体训练质量。论文还提出了一个统计框架，用于系统地研究SLM在高效训练高质量LLM中的作用，并强调了在利用SLM提供的软标签时需要平衡偏差和方差的重要性。

链接: https://arxiv.org/abs/2410.18779
作者: Ankit Singh Rawat,Veeranjaneyulu Sadhanala,Afshin Rostamizadeh,Ayan Chakrabarti,Wittawat Jitkrittum,Vladimir Feinberg,Seungyeon Kim,Hrayr Harutyunyan,Nikunj Saunshi,Zachary Nado,Rakesh Shivanna,Sashank J. Reddi,Aditya Krishna Menon,Rohan Anil,Sanjiv Kumar
关键词-EN: onerous pre-training cost, large language model, primary challenge, language model, LLM
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable (“informative” and “hard”) training examples. Put together, this enables an effective transfer of the SLM’s predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM’s seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.
摘要：大语言模型 (LLM) 开发中的一个主要挑战是其高昂的预训练成本。通常，这种预训练涉及在大规模语料库上优化自监督目标（如下一个 Token 预测）。本文探讨了一种有前景的范式，通过适当利用小语言模型 (SLM) 来提高 LLM 预训练的效率和质量。具体而言，该范式依赖于 SLM 来（1）提供软标签作为额外的训练监督，以及（2）选择一小部分有价值的（“信息丰富”和“困难”）训练样本。综合来看，这使得 SLM 的预测分布能够有效地转移到 LLM 中，同时优先处理训练数据分布中的特定区域。从经验上看，这比标准训练减少了 LLM 的训练时间，同时提高了整体质量。在理论上，我们开发了一个统计框架，系统地研究了 SLM 在实现高质量 LLM 高效训练中的作用。特别是，我们的框架描述了 SLM 看似低质量的监督如何增强更强大的 LLM 的训练。此外，它还强调了通过在 SLM 提供的软标签引入的偏差和方差之间取得平衡，来实现这种监督的适应性利用的必要性。我们通过在 Pile 数据集上利用一个 1.5B 参数的小语言模型来改进一个 2.8B 参数的 LLM 的预训练，从而验证了我们的理论框架。

[NLP-25] ask Calibration: Calibrating Large Language Models on Inference Tasks

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在推理任务中可能存在的虚假关联（spurious correlations）问题，即模型在预测时过度依赖前提（premise）或假设（hypothesis），而不是基于两者的综合理解进行推理。这种依赖可能导致模型性能的意外下降。

解决方案的关键是提出了一种名为任务校准（Task Calibration, TC）的方法。TC是一种零样本（zero-shot）且仅依赖推理的校准方法，灵感来源于互信息（mutual information）。通过任务重构，TC鼓励模型基于前提和假设两者进行推理，从而减轻模型对单一前提或假设的过度依赖。实验结果表明，TC在13个推理任务的零样本设置中显著提升了模型性能，并在少样本设置和多种自然语言理解任务中验证了其有效性。此外，TC对提示模板具有鲁棒性，并有可能与其他校准方法结合使用。

链接: https://arxiv.org/abs/2410.18764
作者: Yingjie Li,Yun Luo,Xiaotian Xie,Yue Zhang
关键词-EN: Large language models, Large language, exhibited impressive zero-shot, exhibited impressive, Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have exhibited impressive zero-shot performance on inference tasks. However, LLMs may suffer from spurious correlations between input texts and output labels, which limits LLMs’ ability to reason based purely on general language understanding. In other words, LLMs may make predictions primarily based on premise or hypothesis, rather than both components. To address this problem that may lead to unexpected performance degradation, we propose task calibration (TC), a zero-shot and inference-only calibration method inspired by mutual information which recovers LLM performance through task reformulation. TC encourages LLMs to reason based on both premise and hypothesis, while mitigating the models’ over-reliance on individual premise or hypothesis for inference. Experimental results show that TC achieves a substantial improvement on 13 inference tasks in the zero-shot setup. We further validate the effectiveness of TC in few-shot setups and various natural language understanding tasks. Further analysis indicates that TC is also robust to prompt templates and has the potential to be integrated with other calibration methods.
摘要：大语言模型（LLMs）在推理任务上展示了令人印象深刻的零样本性能。然而，LLMs 可能会受到输入文本与输出标签之间虚假关联的影响，这限制了 LLMs 仅基于通用语言理解进行推理的能力。换句话说，LLMs 可能主要基于前提或假设进行预测，而不是同时考虑两者。为了解决这一可能导致性能意外下降的问题，我们提出了任务校准（Task Calibration, TC），这是一种受互信息启发的零样本且仅推理的校准方法，通过任务重构来恢复 LLM 的性能。TC 鼓励 LLMs 基于前提和假设进行推理，同时减轻模型对单个前提或假设的过度依赖。实验结果表明，在零样本设置下，TC 在 13 个推理任务上实现了显著的改进。我们进一步验证了 TC 在少样本设置和各种自然语言理解任务中的有效性。进一步分析表明，TC 对提示模板具有鲁棒性，并且有可能与其他校准方法集成。

[NLP-26] Does Differential Privacy Impact Bias in Pretrained NLP Models?

【速读】：该论文试图解决的问题是：在微调预训练大型语言模型（LLMs）时应用差分隐私（Differential Privacy, DP）可能导致模型对少数群体或受保护群体产生偏见。解决方案的关键在于通过实证分析揭示DP对模型偏见的影响。研究发现，差分隐私训练会增加模型对受保护群体的偏见，尤其是在基于AUC的偏见度量上。DP使得模型更难以区分受保护群体中的正负样本与其他群体中的样本。此外，研究还表明，DP对偏见的影响不仅取决于隐私保护水平，还与数据集的底层分布有关。

链接: https://arxiv.org/abs/2410.18749
作者: Md. Khairul Islam,Andrew Wang,Tianhao Wang,Yangfeng Ji,Judy Fox,Jieyu Zhao
关键词-EN: fine-tuning pre-trained large, pre-trained large language, large language models, Differential privacy, applied when fine-tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Github this https URL

点击查看摘要

Abstract:Differential privacy (DP) is applied when fine-tuning pre-trained large language models (LLMs) to limit leakage of training examples. While most DP research has focused on improving a model’s privacy-utility tradeoff, some find that DP can be unfair to or biased against underrepresented groups. In this work, we show the impact of DP on bias in LLMs through empirical analysis. Differentially private training can increase the model bias against protected groups w.r.t AUC-based bias metrics. DP makes it more difficult for the model to differentiate between the positive and negative examples from the protected groups and other groups in the rest of the population. Our results also show that the impact of DP on bias is not only affected by the privacy protection level but also the underlying distribution of the dataset.
摘要：在微调预训练大语言模型 (LLM) 时，差分隐私 (Differential Privacy, DP) 被应用于限制训练样本的泄露。尽管大多数 DP 研究集中在提升模型的隐私-效用权衡上，但一些研究发现 DP 可能对代表性不足的群体不公平或存在偏见。在本研究中，我们通过实证分析展示了 DP 对 LLM 中偏见的影响。差分隐私训练可能会增加模型对受保护群体的偏见，尤其是在基于 AUC 的偏见度量上。DP 使得模型更难以区分受保护群体与其他群体中的正负样本。我们的结果还表明，DP 对偏见的影响不仅受隐私保护水平的影响，还受数据集基础分布的影响。

[NLP-27] Why Does the Effective Context Length of LLM s Fall Short?

【速读】：该论文试图解决的问题是开源大型语言模型（LLMs）在实际应用中有效上下文长度（effective context lengths）往往远低于其训练长度的问题。具体来说，论文指出这一限制源于LLMs在预训练和后训练阶段形成的相对位置频率分布的左偏性（left-skewed frequency distribution of relative positions），这种分布阻碍了模型有效收集远距离信息的能力。

解决方案的关键是引入了一种名为ShifTed Rotray position embeddING (STRING) 的技术。STRING通过在推理阶段将已训练好的位置信息进行偏移，覆盖原有的无效位置信息，从而在不增加额外训练的情况下，显著提升模型在现有训练长度内的性能。实验结果表明，STRING显著提高了如Llama3.1 70B和Qwen2 72B等最新大规模模型的性能，在流行的长上下文基准测试RULER和InfiniteBench上取得了超过10点的性能提升，为开源LLMs设立了新的技术水平。

链接: https://arxiv.org/abs/2410.18745
作者: Chenxin An,Jun Zhang,Ming Zhong,Lei Li,Shansan Gong,Yao Luo,Jingjing Xu,Lingpeng Kong
关键词-EN: efficient attention mechanisms, Advancements in distributed, context window sizes, large language models, efficient attention
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.
摘要：分布式训练技术的进步和高效注意力机制的发展显著扩大了大语言模型 (LLM) 的上下文窗口大小。然而，最近的研究表明，开源 LLM 的有效上下文长度往往不足，通常不超过其训练长度的二分之一。在本研究中，我们将这一限制归因于 LLM 预训练和后训练阶段形成的相对位置的左偏频率分布，这种分布阻碍了模型有效收集远距离信息的能力。为解决这一问题，我们提出了 ShifTed Rotray 位置嵌入 (STRING)。STRING 在推理过程中将已训练好的位置偏移，以覆盖原始无效位置，从而在现有训练长度内提升性能。实验结果显示，在不进行额外训练的情况下，STRING 显著提升了最新大规模模型（如 Llama3.1 70B 和 Qwen2 72B）的性能，在流行的长上下文基准测试 RULER 和 InfiniteBench 上提高了超过 10 分，为开源 LLM 设立了新的最先进水平。与商业模型相比，使用 \method 的 Llama 3.1 70B 甚至超越了 GPT-4-128K，并明显优于 Claude 2 和 Kimi-chat。

[NLP-28] GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning COLING2025

【速读】：该论文试图解决机器翻译（Machine Translation, MT）中在低资源语言设置下的性能提升问题。解决方案的关键在于引入了一种语法感知提示方法（grammatically-aware prompting approach），称为GrammaMT。该方法利用了Interlinear Glossed Text (IGT)，这是一种常见的语言学描述形式，提供源句子的形态和词汇注释。GrammaMT提出了三种提示策略：gloss-shot、chain-gloss和model-gloss，这些策略都是无需训练的，仅需少量示例即可实现，非常适合低资源环境。实验结果表明，GrammaMT在多个基准测试中显著提升了开放源代码指令调整的大型语言模型（LLMs）的翻译性能，尤其是在低资源语言和濒危语言数据集上表现突出。此外，消融研究显示，利用注释资源可以大幅提升MT性能（超过17个BLEU点），前提是LLMs能够准确生成或访问输入句子的注释。

链接: https://arxiv.org/abs/2410.18702
作者: Rita Ramos,Everlyn Asiko Chimoto,Maartje ter Hoeve,Natalie Schluter
关键词-EN: Interlinear Glossed Text, Glossed Text, Interlinear Glossed, linguistic description providing, description providing morphological
类目: Computation and Language (cs.CL)
备注: Under review at COLING 2025

点击查看摘要

Abstract:We introduce GrammaMT, a grammatically-aware prompting approach for machine translation that uses Interlinear Glossed Text (IGT), a common form of linguistic description providing morphological and lexical annotations for source sentences. GrammaMT proposes three prompting strategies: gloss-shot, chain-gloss and model-gloss. All are training-free, requiring only a few examples that involve minimal effort to collect, and making them well-suited for low-resource setups. Experiments show that GrammaMT enhances translation performance on open-source instruction-tuned LLMs for various low- to high-resource languages across three benchmarks: (1) the largest IGT corpus, (2) the challenging 2023 SIGMORPHON Shared Task data over endangered languages, and (3) even in an out-of-domain setting with FLORES. Moreover, ablation studies reveal that leveraging gloss resources could substantially boost MT performance (by over 17 BLEU points) if LLMs accurately generate or access input sentence glosses.
摘要：我们介绍了 GrammaMT，这是一种基于语法感知的提示方法，用于机器翻译，采用了逐行注释文本 (Interlinear Glossed Text, IGT)，这是一种常见的语言描述形式，为源句提供了形态和词汇注释。GrammaMT 提出了三种提示策略：注释样本 (gloss-shot)、链式注释 (chain-gloss) 和模型注释 (model-gloss)。这些策略均无需训练，仅需少量示例，收集这些示例所需的努力极少，因此非常适合低资源环境。实验表明，GrammaMT 在多个开源指令调优的大语言模型上提升了翻译性能，涵盖了从低资源到高资源的各种语言，并在三个基准测试中表现出色：(1) 最大的 IGT 语料库，(2) 2023 年 SIGMORPHON 共享任务数据，涉及濒危语言的挑战性任务，以及 (3) 甚至在 FLORES 的域外设置中。此外，消融研究表明，如果大语言模型能够准确生成或访问输入句子的注释，利用注释资源可以显著提升机器翻译性能（超过 17 个 BLEU 点）。

[NLP-29] How Good Are LLM s for Literary Translation Really? Literary Translation Evaluation with Humans and LLM s

【速读】：该论文试图解决文学机器翻译（MT）的评估问题，特别是如何有效地评估文学翻译的质量。解决方案的关键在于引入了LITEVAL-CORPUS，这是一个段落级别的平行语料库，包含了多个人工验证的翻译和9个MT系统的输出，总计超过2000个段落和13000个注释句子，涵盖四种语言对，耗资4500欧元。

该语料库的关键作用在于：

检查多种注释方案的一致性和充分性。
比较学生和专业人士的评估结果。
评估基于大型语言模型（LLM）的指标的有效性。

研究发现，多维度质量指标（MQM）作为非文学人类MT评估的事实标准，在文学翻译评估中并不充分。学生和专业翻译人员分别使用Best-Worst Scaling（BWS）和标量质量指标（SQM）时，更倾向于人类翻译，比例分别为82%和94%，而MQM在学生注释者中，只有约42%的情况下更倾向于人类专业翻译。自动指标通常与人类MQM和SQM显示出中等的相关性，但在识别人类翻译方面表现不佳，最高比例仅为20%。总体评估表明，人类专业翻译始终优于LLM翻译，即使是最近的LLM也倾向于产生更字面化和多样性较低的翻译。然而，较新的LLM如GPT-4o的表现明显优于旧的LLM。

链接: https://arxiv.org/abs/2410.18697
作者: Ran Zhang,Wei Zhao,Steffen Eger
关键词-EN: research has focused, literary machine translation, translations, Multidimensional Quality Metrics, Scalar Quality Metric
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research has focused on literary machine translation (MT) as a new challenge in MT. However, the evaluation of literary MT remains an open problem. We contribute to this ongoing discussion by introducing LITEVAL-CORPUS, a paragraph-level parallel corpus comprising multiple verified human translations and outputs from 9 MT systems, which totals over 2k paragraphs and includes 13k annotated sentences across four language pairs, costing 4.5k Euro. This corpus enables us to (i) examine the consistency and adequacy of multiple annotation schemes, (ii) compare evaluations by students and professionals, and (iii) assess the effectiveness of LLM-based metrics. We find that Multidimensional Quality Metrics (MQM), as the de facto standard in non-literary human MT evaluation, is inadequate for literary translation: While Best-Worst Scaling (BWS) with students and Scalar Quality Metric (SQM) with professional translators prefer human translations at rates of ~82% and ~94%, respectively, MQM with student annotators prefers human professional translations over the translations of the best-performing LLMs in only ~42% of cases. While automatic metrics generally show a moderate correlation with human MQM and SQM, they struggle to accurately identify human translations, with rates of at most ~20%. Our overall evaluation indicates that human professional translations consistently outperform LLM translations, where even the most recent LLMs tend to produce more literal and less diverse translations compared to human translations. However, newer LLMs such as GPT-4o perform substantially better than older ones.
摘要：近年来，研究重点转向文学机器翻译 (MT)，将其视为MT领域的新挑战。然而，文学MT的评估仍是一个未解之题。我们通过引入LITEVAL-CORPUS，一个包含多个经过验证的人工翻译和9个MT系统输出的段落级平行语料库，为这一持续讨论做出了贡献。该语料库总计超过2千个段落，涵盖13千个注释句子，涉及四种语言对，耗资4.5千欧元。这一语料库使我们能够（i）检验多种注释方案的一致性和充分性，（ii）比较学生和专业人士的评估结果，以及（iii）评估基于大语言模型（LLM）的指标的有效性。我们发现，多维度质量指标（MQM）作为非文学人工MT评估的实际标准，在文学翻译中并不适用：尽管学生采用的最佳-最差比例（BWS）和专业翻译人员采用的标量质量指标（SQM）分别倾向于人工翻译的比例约为82%和94%，但MQM与学生注释者相比，仅在约42%的情况下倾向于人工专业翻译优于表现最佳的LLM翻译。虽然自动指标通常与人类MQM和SQM显示出中等的相关性，但它们在准确识别人工翻译方面表现不佳，最高比例仅为约20%。我们的总体评估表明，人工专业翻译始终优于LLM翻译，即使是最近的LLM也倾向于产生更为直译且多样性较低的翻译，相比之下，人工翻译则更具多样性。然而，较新的LLM如GPT-4o的表现明显优于旧版本。

[NLP-30] Unleashing Reasoning Capability of LLM s via Scalable Question Synthesis from Scratch

【速读】：该论文试图解决的问题是现有开源社区缺乏大规模高质量数据以及可扩展且成本可控的数据合成方法。解决方案的关键是引入了一种名为ScaleQuest的可扩展且新颖的数据合成方法。ScaleQuest利用“小规模”（例如7B参数）的开源模型从头生成问题，无需依赖种子数据或复杂的增强约束。通过这种方法，论文自动构建了一个包含100万个问题-解答对的数学推理数据集，该数据集在效果上优于现有的开源数据集。ScaleQuest能够普遍提升主流开源模型（如Mistral、Llama3、DeepSeekMath和Qwen2-Math）在MATH数据集上的表现，提升幅度达到29.2%到46.4%。特别地，仅通过微调Qwen2-Math-7B-Base模型，就能超越基于闭源数据训练的强大模型Qwen2-Math-7B-Instruct，以及如GPT-4-Turbo和Claude-3.5 Sonnet等专有模型。

链接: https://arxiv.org/abs/2410.18693
作者: Yuyang Ding,Xinyu Shi,Xiaobo Liang,Juntao Li,Qiaoming Zhu,Min Zhang
关键词-EN: capability of LLMs, important factors, factors in improving, data, data synthesis
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Project page: this https URL

点击查看摘要

Abstract:The availability of high-quality data is one of the most important factors in improving the reasoning capability of LLMs. Existing works have demonstrated the effectiveness of creating more instruction data from seed questions or knowledge bases. Recent research indicates that continually scaling up data synthesis from strong models (e.g., GPT-4) can further elicit reasoning performance. Though promising, the open-sourced community still lacks high-quality data at scale and scalable data synthesis methods with affordable costs. To address this, we introduce ScaleQuest, a scalable and novel data synthesis method that utilizes “small-size” (e.g., 7B) open-source models to generate questions from scratch without the need for seed data with complex augmentation constraints. With the efficient ScaleQuest, we automatically constructed a mathematical reasoning dataset consisting of 1 million problem-solution pairs, which are more effective than existing open-sourced datasets. It can universally increase the performance of mainstream open-source models (i.e., Mistral, Llama3, DeepSeekMath, and Qwen2-Math) by achieving 29.2% to 46.4% gains on MATH. Notably, simply fine-tuning the Qwen2-Math-7B-Base model with our dataset can even surpass Qwen2-Math-7B-Instruct, a strong and well-aligned model on closed-source data, and proprietary models such as GPT-4-Turbo and Claude-3.5 Sonnet.
摘要：高质量数据的可用性是提升大语言模型推理能力的关键因素之一。现有研究已证明，从种子问题或知识库中创建更多指令数据的有效性。最新研究表明，持续从强模型（如 GPT-4）中扩展数据合成，可以进一步提升推理性能。尽管前景广阔，开源社区仍缺乏大规模高质量数据和成本可控的数据合成方法。为此，我们提出了 ScaleQuest，一种可扩展且新颖的数据合成方法，利用“小规模”（如 7B）开源模型从头生成问题，无需复杂的增强约束或种子数据。通过高效的 ScaleQuest，我们自动构建了一个包含 100 万对问题-解答的数学推理数据集，其效果优于现有的开源数据集。该数据集能普遍提升主流开源模型（如 Mistral、Llama3、DeepSeekMath 和 Qwen2-Math）的性能，在 MATH 测试中实现 29.2% 至 46.4% 的提升。特别值得一提的是，仅通过微调 Qwen2-Math-7B-Base 模型，使用我们的数据集就能超越基于闭源数据训练的强大且对齐良好的模型 Qwen2-Math-7B-Instruct，以及如 GPT-4-Turbo 和 Claude-3.5 Sonnet 等专有模型。

[NLP-31] Health Misinformation in Social Networks: A Survey of IT Approaches ALT

【速读】：该论文试图解决医疗错误信息（medical misinformation）在社交网络中的普遍问题。解决方案的关键在于：

事实核查方法：包括手动和自动的事实核查方法，以识别和纠正错误信息。
虚假新闻检测方法：利用内容、传播特征或来源特征来检测虚假新闻。
缓解措施：提出对抗错误信息传播的策略和方法。
数据集和工具：提供关于健康错误信息的数据集和公开可用工具的详细列表，以支持研究和实践。

论文通过系统性地回顾相关研究，旨在帮助研究人员和从业者在这一快速变化的领域中导航，并讨论了对抗健康错误信息领域的开放挑战和未来研究方向。

链接: https://arxiv.org/abs/2410.18670
作者: Vasiliki Papanikou,Panagiotis Papadakos,Theodora Karamanidou,Thanos G. Stavropoulos,Evaggelia Pitoura,Panayiotis Tsaparas
关键词-EN: information technology, pervasive issue, issue of medical, social networks, perspective of information
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint – Under review in the ACM Transactions on Computing for Healthcare (HEALTH) journal

点击查看摘要

Abstract:In this paper, we present a comprehensive survey on the pervasive issue of medical misinformation in social networks from the perspective of information technology. The survey aims at providing a systematic review of related research and helping researchers and practitioners navigate through this fast-changing field. Specifically, we first present manual and automatic approaches for fact-checking. We then explore fake news detection methods, using content, propagation features, or source features, as well as mitigation approaches for countering the spread of misinformation. We also provide a detailed list of several datasets on health misinformation and of publicly available tools. We conclude the survey with a discussion on the open challenges and future research directions in the battle against health misinformation.
摘要：本文从信息技术角度对社交网络中普遍存在的医疗错误信息问题进行了全面调查。调查旨在提供相关研究的系统性回顾，并帮助研究人员和从业者在这一快速变化的领域中导航。具体而言，我们首先介绍了事实核查的手动和自动方法。接着，我们探讨了利用内容、传播特征或来源特征进行虚假新闻检测的方法，以及对抗错误信息传播的缓解措施。此外，我们还提供了关于健康错误信息的多个数据集和公开可用工具的详细列表。最后，我们通过讨论对抗健康错误信息领域的开放挑战和未来研究方向来结束本次调查。

[NLP-32] owards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

【速读】：该论文试图解决开放式文本生成任务中模型评估的难题，特别是在多重评价指标（如连贯性、多样性和困惑度）之间存在权衡的情况下。解决方案的关键在于提出了新的排序策略，这些策略基于部分排序方法，并引入了一种新的综合指标，旨在平衡现有的自动评价指标，从而提供更全面的文本生成质量评估。此外，论文还探讨了这些方法与人类判断的一致性，并通过实验验证了其有效性，展示了这些方法在比较解码策略和指导模型选择方面的价值。

链接: https://arxiv.org/abs/2410.18653
作者: Esteban Garces Arias,Hannah Blocher,Julian Rodemann,Meimingwei Li,Christian Heumann,Matthias Aßenmacher
关键词-EN: natural language processing, language processing due, rise of powerful, natural language, language processing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging because of trade-offs among widely used metrics such as coherence, diversity, and perplexity. Decoding methods often excel in some metrics while underperforming in others, complicating the establishment of a clear ranking. In this paper, we present novel ranking strategies within this multicriteria framework. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric designed to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Furthermore, we discuss the alignment of these approaches with human judgments. Our experiments demonstrate that the proposed methods offer a robust way to compare decoding strategies, exhibit similarities with human preferences, and serve as valuable tools in guiding model selection for open-ended text generation tasks. Finally, we suggest future directions for improving evaluation methodologies in text generation. Our codebase, datasets, and models are publicly available.
摘要：随着强大（大型）语言模型的兴起，开放式文本生成已成为自然语言处理中的一个重要任务。然而，评估这些模型及其采用的解码策略的质量仍然具有挑战性，因为常用的指标如连贯性、多样性和困惑度之间存在权衡。解码方法往往在某些指标上表现优异，而在其他指标上表现不佳，这使得建立明确的排名变得复杂。在本文中，我们在这个多标准框架内提出了新的排名策略。具体来说，我们采用了基于偏序的基准测试方法，并设计了一种新的总结性指标，旨在平衡现有的自动指标，从而提供对文本生成质量更全面的评估。此外，我们讨论了这些方法与人类判断的一致性。我们的实验表明，所提出的方法提供了一种稳健的方式来比较解码策略，显示出与人类偏好相似的特征，并作为指导开放式文本生成任务模型选择的宝贵工具。最后，我们提出了改进文本生成评估方法的未来方向。我们的代码库、数据集和模型均已公开。

[NLP-33] C2: Scalable Auto-Feedback for LLM -based Chart Generation

【速读】：该论文试图解决生成高质量图表（charts）时面临的数据稀缺和高成本问题。由于生成式 AI (Generative AI) 在处理图表生成任务时需要大量的指令、数据和代码三元组，而这些三元组的创建需要专业技术知识，因此手动策划这些数据成本高昂且难以扩展。

解决方案的关键在于引入了一个无需参考的自动反馈生成器，即 C² 框架。该框架包括两个核心组件：

自动反馈提供器 (ChartAF)：通过自动生成反馈，消除了对昂贵的人工干预的需求。
多样化的无需参考数据集 (ChartUIE-8K)：显著提高了数据多样性，通过增加查询、数据集和图表类型的数量，分别比基准提高了 5982%、1936% 和 91%。

实验结果表明，ChartAF 在反馈后的表现优于九个基线，而 ChartUIE-8K 的数据多样性得到了用户的广泛认可，94% 的参与者更喜欢 ChartUIE-8K 的查询，93% 认为这些查询与实际应用场景相符。

链接: https://arxiv.org/abs/2410.18652
作者: Woosung Koh,Jang Han Yoon,MinHyung Lee,Youngjin Song,Jaegwan Cho,Jaehyun Kang,Taehyeon Kim,Se-young Yun,Youngjae Yu,Bongshin Lee
关键词-EN: Large Language Models, Language Models presents, Models presents significant, Large Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Generating high-quality charts with Large Language Models presents significant challenges due to limited data and the high cost of scaling through human curation. Instruction, data, and code triplets are scarce and expensive to manually curate as their creation demands technical expertise. To address this scalability issue, we introduce a reference-free automatic feedback generator, which eliminates the need for costly human intervention. Our novel framework, C^2 , consists of (1) an automatic feedback provider (ChartAF) and (2) a diverse, reference-free dataset (ChartUIE-8K). Quantitative results are compelling: in our first experiment, 74% of respondents strongly preferred, and 10% preferred, the results after feedback. The second post-feedback experiment demonstrates that ChartAF outperforms nine baselines. Moreover, ChartUIE-8K significantly improves data diversity by increasing queries, datasets, and chart types by 5982%, 1936%, and 91%, respectively, over benchmarks. Finally, an LLM user study revealed that 94% of participants preferred ChartUIE-8K’s queries, with 93% deeming them aligned with real-world use cases. Core contributions are available as open-source at an anonymized project site, with ample qualitative examples.
摘要：利用大语言模型生成高质量图表面临着显著的挑战，主要原因是数据有限以及通过人工筛选进行扩展的高成本。指令、数据和代码三元组稀缺且手动筛选成本高昂，因为它们的创建需要技术专长。为了解决这一可扩展性问题，我们引入了一种无需参考的自动反馈生成器，从而消除了对昂贵的人工干预的需求。我们的创新框架 C^2 包括（1）一个自动反馈提供者（ChartAF）和（2）一个多样化的、无需参考的数据集（ChartUIE-8K）。定量结果令人信服：在我们的第一次实验中，74%的受访者强烈偏好反馈后的结果，10%的受访者偏好反馈后的结果。第二次反馈后的实验表明，ChartAF 优于九个基线。此外，ChartUIE-8K 通过分别增加查询、数据集和图表类型 5982%、1936% 和 91%，显著提高了数据多样性。最后，一项大语言模型用户研究表明，94%的参与者偏好 ChartUIE-8K 的查询，其中 93% 认为这些查询与实际应用场景相符。核心贡献已在匿名项目站点上作为开源资源提供，并附有丰富的定性示例。

[NLP-34] Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

【速读】：该论文试图解决的问题是如何有效地将弱语言模型（LM）的偏好对齐行为转移到强语言模型上，以提升强模型的对齐能力。解决方案的关键在于提出了一种名为“弱到强偏好优化 (Weak-to-Strong Preference Optimization, WSPO)”的方法。该方法通过学习弱模型在对齐前后的分布差异，来实现强模型的有效对齐。实验结果表明，WSPO显著提升了强模型在多个基准测试中的表现，证明了利用弱模型引导强模型实现高对齐能力的可行性。

链接: https://arxiv.org/abs/2410.18640
作者: Wenhong Zhu,Zhiwei He,Xiaofeng Wang,Pengfei Liu,Rui Wang
关键词-EN: Aligning language models, meet diverse user, Aligning language, area of research, key area
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible.
摘要：将语言模型 (LMs) 与人类偏好对齐已成为研究的关键领域，使得这些模型能够更好地满足多样化的用户需求。受到弱到强泛化的启发，即一个在弱模型生成的标签上微调的强语言模型能够持续超越其弱监督者，我们将这一思想扩展到模型对齐中。在本研究中，我们观察到弱模型中的对齐行为可以有效地转移到强模型中，甚至表现出放大效应。基于这一洞察，我们提出了一种名为弱到强偏好优化 (Weak-to-Strong Preference Optimization, WSPO) 的方法，通过学习弱模型对齐前后的分布差异来实现强模型的对齐。实验表明，WSPO 表现出色，将 Qwen2-7B-Instruct 在 Arena-Hard 上的胜率从 39.70 提升至 49.60，在 AlpacaEval 2 上实现了 47.04 的长度控制胜率，并在 MT-bench 上获得了 7.33 的评分。我们的结果表明，利用弱模型来诱导出具有高度对齐能力的强模型是可行的。

[NLP-35] Little Giants: Synthesizing High-Quality Embedding Data at Scale

【速读】：该论文试图解决的问题是如何在不依赖大规模手动标注数据集的情况下，高效生成用于训练文本嵌入模型的大规模合成数据。当前方法主要依赖于如GPT-4这样的专有模型，这些模型成本高且效率低，不适合生成大规模嵌入数据。

解决方案的关键在于引入了一个名为SPEED的框架。SPEED通过以下几个关键步骤实现高效合成数据生成：

使用开源的小型模型（8B参数）进行数据生成。
通过监督微调（supervised fine-tuning）、偏好优化（preference optimization）和自我改进（self-improvement）等技术，使这些小型开源模型能够生成高质量的数据。
SPEED显著减少了GPT API调用的次数（不到1/10），并且在仅使用合成数据进行训练的情况下，其性能超越了当前最先进的嵌入模型E5_mistral。

通过这些方法，SPEED不仅提高了数据生成的效率，还揭示了合成嵌入数据生成过程中的缩放规律（scaling law），从而为未来的研究提供了理论基础。

链接: https://arxiv.org/abs/2410.18634
作者: Haonan Chen,Liang Wang,Nan Yang,Yutao Zhu,Ziliang Zhao,Furu Wei,Zhicheng Dou
关键词-EN: manually labeled datasets, manually labeled, labeled datasets, increasingly popular, Synthetic data generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data. Using this efficient generator, we conduct a comprehensive study on how various factors within the alignment pipeline impact data quality and reveal the scaling law for synthetic embedding data.
摘要：合成数据生成已成为一种越来越受欢迎的训练模型方法，无需依赖大规模手动标注的数据集。对于文本嵌入等任务，合成数据提供了多样化且可扩展的训练样本，显著降低了人工标注的成本。然而，当前大多数方法严重依赖于如 GPT-4 这样的专有模型，这些模型在生成大规模嵌入数据时既昂贵又低效。本文中，我们提出了 SPEED 框架，该框架将开源小模型（8B）对齐，以高效生成大规模合成嵌入数据。通过监督微调、偏好优化和自我改进，SPEED 使小开源模型能够生成高质量的数据。值得注意的是，SPEED 仅使用了不到 GPT API 调用次数的 1/10，在仅基于其合成数据训练的情况下，其性能超越了当前最先进的嵌入模型 E5_mistral。利用这一高效生成器，我们进行了全面研究，探讨了对齐流程中各种因素对数据质量的影响，并揭示了合成嵌入数据的扩展规律。

[NLP-36] Supporting Assessment of Novelty of Design Problems Using Concept of Problem SAPPhIRE

【速读】：该论文试图解决设计问题新颖性的评估问题。解决方案的关键在于提出了一种基于SAPPhIRE因果模型的新颖性评估框架。该框架通过计算当前问题与参考问题数据库中问题的最小距离来衡量新颖性，距离的计算基于SAPPhIRE本体论中不同抽象层次的文本相似性比较。论文通过将当前问题集与历史记录中的问题集进行比较，展示了该框架的应用性，旨在通过自动化评估方法提高评估效率和适用性。

链接: https://arxiv.org/abs/2410.18629
作者: Sanjay Singh,Amaresh Chakrabarti
关键词-EN: model of causality, novelty of design, problems, SAPPhIRE model, current
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper proposes a framework for assessing the novelty of design problems using the SAPPhIRE model of causality. The novelty of a problem is measured as its minimum distance from the problems in a reference problem database. The distance is calculated by comparing the current problem and each reference past problem at the various levels of abstraction in the SAPPhIRE ontology. The basis for comparison is textual similarity. To demonstrate the applicability of the proposed framework, The current set of problems associated with an artifact, as collected from its stakeholders, were compared with the past set of problems, as collected from patents and other web sources, to assess the novelty of the current set. This approach is aimed at providing a better understanding of the degree of novelty of any given set of current problems by comparing them to similar problems available from historical records. Since manual assessment, the current mode of such assessments as reported in the literature, is a tedious process, to reduce time complexity and to afford better applicability for larger sets of problem statements, an automated assessment is proposed and used in this paper.
摘要：本文提出了一种利用 SAPPhIRE 因果模型评估设计问题新颖性的框架。问题的新颖性通过其与参考问题数据库中问题的最小距离来衡量。该距离通过比较当前问题与每个参考历史问题在 SAPPhIRE 本体论的不同抽象层次上的文本相似性来计算。为了展示所提出框架的适用性，本文将一个工件相关的问题集（从其利益相关者收集）与过去的问题集（从专利和其他网络资源收集）进行了比较，以评估当前问题集的新颖性。这种方法旨在通过将当前问题集与历史记录中可用的类似问题进行比较，更好地理解任何给定当前问题集的新颖程度。由于手动评估（文献中报告的此类评估的当前模式）是一个繁琐的过程，为了降低时间复杂度并提高对更大问题集的适用性，本文提出并使用了自动化评估方法。

[NLP-37] Prompting and Fine-Tuning of Small LLM s for Length-Controllable Telephone Call Summarization

【速读】：该论文试图解决电话通话摘要生成的问题，解决方案的关键在于利用大型语言模型 (LLMs) 进行快速开发和优化。具体步骤包括：

实验阶段：初步实验使用现有的LLMs通过提示生成通话摘要。
数据集创建：利用前沿模型创建定制的合成训练数据集，特别关注生成数据的多样性和摘要长度的可控性，以满足不同使用场景的需求。
模型评估：采用两种基于LLM的评估技术（LLM-as-a-judge-based evaluation techniques）来确保摘要的质量和相关性。

最终，通过微调的Llama-2-7B模型在事实准确性、完整性和简洁性方面与GPT-4表现相当，证明了快速构建实用且高效的通话摘要系统的潜力。

链接: https://arxiv.org/abs/2410.18624
作者: David Thulke,Yingbo Gao,Rricha Jalota,Christian Dugast,Hermann Ney
关键词-EN: utilizing large language, large language models, system utilizing large, paper explores, explores the rapid
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the The International Conference on Foundation and Large Language Models (FLLM2024)

点击查看摘要

Abstract:This paper explores the rapid development of a telephone call summarization system utilizing large language models (LLMs). Our approach involves initial experiments with prompting existing LLMs to generate summaries of telephone conversations, followed by the creation of a tailored synthetic training dataset utilizing stronger frontier models. We place special focus on the diversity of the generated data and on the ability to control the length of the generated summaries to meet various use-case specific requirements. The effectiveness of our method is evaluated using two state-of-the-art LLM-as-a-judge-based evaluation techniques to ensure the quality and relevance of the summaries. Our results show that fine-tuned Llama-2-7B-based summarization model performs on-par with GPT-4 in terms of factual accuracy, completeness and conciseness. Our findings demonstrate the potential for quickly bootstrapping a practical and efficient call summarization system.
摘要：本文探讨了利用大语言模型 (LLM) 快速开发电话通话摘要系统的进展。我们的方法首先通过提示现有的 LLM 生成电话对话的摘要进行初步实验，随后利用更强大的前沿模型创建定制的合成训练数据集。我们特别关注生成数据的多样性以及控制生成摘要长度的能力，以满足各种使用场景的具体需求。我们采用两种基于 LLM 作为评判者的最先进评估技术来评估我们方法的有效性，以确保摘要的质量和相关性。我们的结果显示，经过微调的 Llama-2-7B 基础摘要模型在事实准确性、完整性和简洁性方面与 GPT-4 表现相当。我们的研究结果表明，快速启动一个实用且高效的通话摘要系统具有潜在的可能性。

[NLP-38] STTATTS: Unified Speech-To-Text And Text-To-Speech Model EMNLP2024

【速读】：该论文试图解决的问题是传统的语音识别 (ASR) 和语音合成 (TTS) 模型通常分别训练，导致两个独立的庞大网络，增加了计算和内存成本。解决方案的关键在于提出了一种参数高效的多任务学习方法，通过共享模型参数和联合训练目标，实现了ASR和TTS的联合学习。这种方法不仅在性能上与单独训练的模型相当，而且显著减少了所需的参数总数（约50%的减少），从而降低了计算和内存成本。该研究在资源丰富的英语和资源相对匮乏的阿拉伯语上进行了实验，并公开了训练代码和模型检查点，以促进进一步的研究。

链接: https://arxiv.org/abs/2410.18607
作者: Hawau Olamide Toyin,Hao Li,Hanan Aldarmaki
关键词-EN: distinct large networks, speech synthesis models, typically trained separately, Speech recognition, speech synthesis
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 11 pages, 4 Figures, EMNLP 2024 Findings

点击查看摘要

Abstract:Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates that the performance of our multi-task model is comparable to that of individually trained models while significantly saving computational and memory costs ( \sim 50% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.
摘要：语音识别和语音合成模型通常是分别训练的，每个模型都有其独立的学习目标、训练数据和模型参数，从而形成了两个独立的大型网络。我们提出了一种参数高效的方法，通过多任务学习目标和共享参数来联合学习自动语音识别 (ASR) 和文本到语音 (TTS) 模型。我们的评估结果表明，多任务模型的性能与单独训练的模型相当，同时显著节省了计算和内存成本（两个任务所需的总参数数量减少了约 50%）。我们以英语（资源丰富的语言）和阿拉伯语（由于 TTS 数据短缺而相对资源匮乏的语言）进行了实验。我们的模型使用公开数据进行训练，训练代码和模型检查点均公开可用，以供进一步研究。

[NLP-39] aipan: Efficient and Expressive State Space Language Models with Selective Attention

【速读】：该论文试图解决自然语言处理 (NLP) 中高效长上下文语言建模的问题。尽管Transformer在语言任务中占据主导地位，但其在处理长序列时面临训练中的二次计算复杂度和推理时线性增长的内存成本问题。现有的状态空间模型 (State Space Models, SSMs) 如Mamba虽然提供了常数内存使用的替代方案，但在需要广泛上下文检索的任务中表现不佳。

解决方案的关键在于引入了一种名为Taipan的新型混合架构，该架构结合了Mamba-2与选择性注意力层 (Selective Attention Layers, SALs)。SALs能够识别需要长程交互的token，去除不重要的特征，并使用注意力模块增强其表示。这种方法在保持Mamba效率的同时，实现了类似Transformer在内存密集型任务中的性能。通过限制注意力预算，Taipan能够将准确预测扩展到上下文长度高达100万个token，同时保持计算效率。实验结果显示，Taipan在各种规模和任务中表现优异，为高效长上下文语言建模提供了有前景的解决方案。

链接: https://arxiv.org/abs/2410.18572
作者: Chien Van Nguyen,Huy Huu Nguyen,Thang M. Pham,Ruiyi Zhang,Hanieh Deilamsalehy,Puneet Mathur,Ryan A. Rossi,Trung Bui,Viet Dac Lai,Franck Dernoncourt,Thien Huu Nguyen
关键词-EN: Natural Language Processing, challenge in Natural, Language Processing, Natural Language, State Space Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba’s efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan’s superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
摘要：在自然语言处理 (Natural Language Processing, NLP) 领域，高效的长上下文语言建模仍然是一个重大挑战。尽管 Transformer 在语言任务中占据主导地位，但由于其在训练中的二次计算复杂性和推理过程中线性增长的内存成本，处理长序列时仍面临困难。近期出现的如 Mamba 这样的状态空间模型 (State Space Models, SSMs) 提供了具有恒定内存使用的替代方案，但在需要广泛上下文检索的任务中表现不佳。我们提出了 Taipan，一种新颖的混合架构，结合了 Mamba-2 与选择性注意力层 (Selective Attention Layers, SALs)。这些 SALs 能够识别需要长程交互的 Token，去除不重要的特征，并使用注意力模块增强其表示。这种方法在保持 Mamba 高效性的同时，在内存密集型任务中实现了类似 Transformer 的性能。通过限制注意力预算，Taipan 将准确预测的上下文长度扩展到高达 100 万个 Token，同时保持计算效率。我们的实验表明，Taipan 在各种规模和任务中均表现出优越的性能，为高效的长上下文语言建模提供了一个有前景的解决方案。

[NLP-40] Difficult for Whom? A Study of Japanese Lexical Complexity

【速读】：该论文试图解决的问题是词汇复杂度预测 (Lexical Complexity Prediction, LCP) 和复杂词汇识别 (Complex Word Identification, CWI) 任务中，如何处理不同群体（如不同语言背景的母语者）对词汇复杂度感知的差异性。解决方案的关键在于验证和探索个性化方法在适应个体需求方面的有效性。

具体来说，论文通过重新标注数据集，验证了日本LCP数据集在目标人群中的代表性，并发现母语为中文的受试者由于中日词汇的相似性，对词汇复杂度的感知与日本母语者不同。为了探索个性化方法的可能性，论文比较了基于群体平均评分和个体评分的模型在CWI任务中的表现，发现基于群体平均评分的模型在CWI任务中表现与个体模型相似，但在LCP任务中实现个体化的良好表现较为困难。此外，论文还尝试了微调BERT模型，结果显示在所有设置下只有边际的改进。

链接: https://arxiv.org/abs/2410.18567
作者: Adam Nohejl,Akio Hayakawa,Yusuke Ide,Taro Watanabe
关键词-EN: complex word identification, lexical complexity prediction, word identification, commonly presuppose, complex word
类目: Computation and Language (cs.CL)
备注: Accepted to TSAR 2024

点击查看摘要

Abstract:The tasks of lexical complexity prediction (LCP) and complex word identification (CWI) commonly presuppose that difficult to understand words are shared by the target population. Meanwhile, personalization methods have also been proposed to adapt models to individual needs. We verify that a recent Japanese LCP dataset is representative of its target population by partially replicating the annotation. By another reannotation we show that native Chinese speakers perceive the complexity differently due to Sino-Japanese vocabulary. To explore the possibilities of personalization, we compare competitive baselines trained on the group mean ratings and individual ratings in terms of performance for an individual. We show that the model trained on a group mean performs similarly to an individual model in the CWI task, while achieving good LCP performance for an individual is difficult. We also experiment with adapting a finetuned BERT model, which results only in marginal improvements across all settings.
摘要：词汇复杂度预测 (Lexical Complexity Prediction, LCP) 和复杂词汇识别 (Complex Word Identification, CWI) 任务通常假设难以理解的词汇在目标群体中是共享的。同时，个性化方法也被提出以适应个体需求。我们通过部分复制标注验证了最近一个日语 LCP 数据集对其目标群体的代表性。通过另一次重新标注，我们展示了母语为中文的人由于中日词汇的差异而对复杂度的感知有所不同。为了探索个性化的可能性，我们比较了基于群体平均评分和个人评分的竞争性基线模型在个体性能方面的表现。我们发现，在 CWI 任务中，基于群体平均训练的模型与个体模型表现相似，而在 LCP 任务中，为个体实现良好的性能较为困难。我们还尝试了微调 BERT 模型，结果仅在所有设置中实现了边际改进。

[NLP-41] Bielik 7B v0.1: A Polish Language Model – Development Insights and Evaluation

【速读】：该论文试图解决波兰语处理中的关键挑战，特别是在生成式文本模型（Generative Text Model）的开发方面。解决方案的关键在于采用了创新的技术，包括：

加权指令交叉熵损失 (Weighted Instruction Cross-Entropy Loss)：通过平衡不同指令类型的学习，提高模型对多样化任务的适应能力。
自适应学习率 (Adaptive Learning Rate)：根据训练进度动态调整学习率，优化模型的训练效率和性能。

这些技术使得Bielik 7B v0.1模型在多项自然语言处理（NLP）任务中表现出色，特别是在RAG Reader任务中，平均得分比Mistral-7B-v0.1提高了9个百分点。此外，该模型在波兰语MT-Bench中的推理（Reasoning）和角色扮演（Role-playing）类别中也表现优异，分别为6.15/10和7.83/10。这些成果标志着波兰语AI领域的重大进步，并为该领域的语言应用提供了强大的工具。

链接: https://arxiv.org/abs/2410.18565
作者: Krzysztof Ociepa,Łukasz Flis,Krzysztof Wróbel,Adrian Gwoździej,Remigiusz Kinas
关键词-EN: generative text model, Polish language processing, generative text, Adaptive Learning Rate, Polish language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Bielik 7B v0.1, a 7-billion-parameter generative text model for Polish language processing. Trained on curated Polish corpora, this model addresses key challenges in language model development through innovative techniques. These include Weighted Instruction Cross-Entropy Loss, which balances the learning of different instruction types, and Adaptive Learning Rate, which dynamically adjusts the learning rate based on training progress. To evaluate performance, we created the Open PL LLM Leaderboard and Polish MT-Bench, novel frameworks assessing various NLP tasks and conversational abilities. Bielik 7B v0.1 demonstrates significant improvements, achieving a 9 percentage point increase in average score compared to Mistral-7B-v0.1 on the RAG Reader task. It also excels in the Polish MT-Bench, particularly in Reasoning (6.15/10) and Role-playing (7.83/10) categories. This model represents a substantial advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new benchmarks in the field.
摘要：我们介绍了 Bielik 7B v0.1，这是一个用于波兰语处理的 70 亿参数生成式文本模型。该模型在经过精心筛选的波兰语语料库上进行训练，通过创新技术解决了语言模型开发中的关键挑战。这些技术包括加权指令交叉熵损失 (Weighted Instruction Cross-Entropy Loss)，它平衡了不同指令类型的学习，以及自适应学习率 (Adaptive Learning Rate)，它根据训练进度动态调整学习率。为了评估性能，我们创建了 Open PL LLM Leaderboard 和 Polish MT-Bench，这两个新颖的框架用于评估各种自然语言处理任务和对话能力。Bielik 7B v0.1 展示了显著的改进，在 RAG Reader 任务中，其平均得分比 Mistral-7B-v0.1 提高了 9 个百分点。此外，在 Polish MT-Bench 中，该模型在推理 (6.15/10) 和角色扮演 (7.83/10) 类别中表现尤为出色。这一模型代表了波兰语人工智能领域的重大进步，为多样化的语言应用提供了强大的工具，并在该领域树立了新的基准。

[NLP-42] Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

【速读】：该论文试图解决开源视觉语言模型（Vision-Language Models, VLMs）在性能上与闭源模型相比的不足问题。解决方案的关键在于引入了一个大规模的多模态指令数据集Infinity-MM，该数据集包含4000万样本，并通过严格的质量筛选和去重处理来提升数据质量。此外，论文还提出了一种基于开源VLMs的合成指令生成方法，利用详细的图像标注和多样化的问答生成技术。通过使用这些数据，研究人员训练了一个20亿参数的VLM模型Aquila-VL-2B，该模型在同类规模模型中达到了最先进的性能（SOTA）。这一结果表明，扩展指令数据和生成合成数据能够显著提升开源模型的性能。

链接: https://arxiv.org/abs/2410.18558
作者: Shuhao Gu,Jialing Zhang,Siyuan Zhou,Kevin Yu,Zhaohu Xing,Liangdong Wang,Zhou Cao,Jintao Jia,Zhuoyi Zhang,Yixuan Wang,Zhenchong Hu,Bo-Wen Zhang,Jijie Li,Dong Liang,Yingli Zhao,Yulong Ao,Yaoqi Liu,Fangxiang Feng,Guang Liu
关键词-EN: made significant progress, recently made significant, significant progress, recently made, made significant
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance compared to closed-source models. In this work, we address this limitation by introducing Infinity-MM, a large-scale multimodal instruction dataset with 40 million samples, enhanced through rigorous quality filtering and deduplication. We also propose a synthetic instruction generation method based on open-source VLMs, using detailed image annotations and diverse question generation. Using this data, we trained a 2-billion-parameter VLM, Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models of similar scale. This demonstrates that expanding instruction data and generating synthetic data can significantly improve the performance of open-source models.
摘要：视觉-语言模型 (Vision-Language Models, VLMs) 近年来取得了显著进展，但开源指令数据的规模和质量有限，导致其性能不及闭源模型。在本研究中，我们通过引入 Infinity-MM，一个包含 4000 万样本的大规模多模态指令数据集，并通过严格的质量筛选和去重处理，解决了这一限制。我们还提出了一种基于开源 VLMs 的合成指令生成方法，利用详细的图像注释和多样化的问答生成技术。基于这些数据，我们训练了一个 20 亿参数的 VLM，即 Aquila-VL-2B，其在同类模型中达到了最先进的 (state-of-the-art, SOTA) 性能。这表明，扩展指令数据和生成合成数据可以显著提升开源模型的性能。

[NLP-43] On Explaining with Attention Matrices

【速读】：该论文试图解决的问题是关于注意力权重（Attention Weights, AW）在Transformer模型中与预测输出之间的解释性关联。早期研究认为AW具有解释性，但近期研究提出了相反的观点，认为AW并不具备解释性。论文的关键解决方案在于：

反驳了近期研究中关于AW不具备解释性的正式论点，指出这些论点是错误的。
引入了高效注意力（Efficient Attention）的概念，并展示了如何有效计算这一概念，以隔离在任务和模型中起解释作用的注意力矩阵的有效成分。
证明了高效注意力在需要上下文信息的自然语言处理（NLP）任务中对预测模型输出具有因果作用，即提供了最小必要且充分的条件。
通过实验验证了高效注意力矩阵是概率分布，并且可以有效计算，从而支持了其在基于注意力的模型行为解释中的重要作用。

综上所述，论文的关键在于引入并验证了高效注意力的概念，证明了其在模型输出预测中的因果作用，并提供了实验证据支持这一方法的有效性。

链接: https://arxiv.org/abs/2410.18541
作者: Omar Naim,Nicholas Asher
关键词-EN: paper explores, attention, efficient attention, attention weights, formal arguments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the much discussed, possible explanatory link between attention weights (AW) in transformer models and predicted output. Contrary to intuition and early research on attention, more recent prior research has provided formal arguments and empirical evidence that AW are not explanatorily relevant. We show that the formal arguments are incorrect. We introduce and effectively compute efficient attention, which isolates the effective components of attention matrices in tasks and models in which AW play an explanatory role. We show that efficient attention has a causal role (provides minimally necessary and sufficient conditions) for predicting model output in NLP tasks requiring contextual information, and we show, contrary to [7], that efficient attention matrices are probability distributions and are effectively calculable. Thus, they should play an important part in the explanation of attention based model behavior. We offer empirical experiments in support of our method illustrating various properties of efficient attention with various metrics on four datasets.
摘要：本文探讨了在 Transformer 模型中，注意力权重 (Attention Weights, AW) 与预测输出之间备受讨论的可能解释性联系。与直觉和早期关于注意力的研究相反，近期先前的研究提供了形式论证和实证证据，表明 AW 在解释上并不相关。我们证明这些形式论证是错误的。我们引入了有效计算的“高效注意力” (Efficient Attention)，该方法能够隔离在 AW 起解释作用的任务和模型中注意力矩阵的有效成分。我们展示了高效注意力在需要上下文信息的自然语言处理 (NLP) 任务中对预测模型输出具有因果作用（提供最小必要且充分的条件），并且与 [7] 相反，我们证明了高效注意力矩阵是概率分布，并且可以有效计算。因此，它们应在基于注意力的模型行为的解释中发挥重要作用。我们通过在四个数据集上进行的实证实验，使用多种指标展示了高效注意力的各种特性，以支持我们的方法。

[NLP-44] LOGO – Long cOntext aliGnment via efficient preference Optimization

【速读】：该论文试图解决长上下文模型（Long-context models, LCMs）在处理长输入序列时生成性能不足的问题，尤其是可能导致生成内容与上下文不一致（如幻觉现象）的问题。解决方案的关键在于引入了一种名为LOGO（Long cOntext aliGnment via efficient preference Optimization）的训练策略。

LOGO的核心在于通过高效的偏好优化（preference optimization）来实现长上下文的对齐。具体来说，LOGO采用了一种无参考的偏好优化策略，并结合位置合成方法来构建训练数据，从而在克服长序列带来的GPU内存限制问题的同时，提升模型的生成性能。实验结果表明，通过在单个8×A800 GPU机器上使用0.3B数据进行16小时的训练，LOGO能够使Llama-3-8B-Instruct-80K模型在实际长上下文任务中达到与GPT-4相当的性能，同时保持模型在其他任务（如语言建模和MMLU）上的原有能力。此外，LOGO还能扩展模型的上下文窗口大小，进一步提升生成性能。

链接: https://arxiv.org/abs/2410.18533
作者: Zecheng Tang,Zechen Sun,Juntao Li,Qiaoming Zhu,Min Zhang
关键词-EN: shown great potential, processing long input, conveniently and effectively, shown great, great potential
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context models(LCMs) have shown great potential in processing long input sequences(even more than 100M tokens) conveniently and effectively. With significant progress, recent research has pointed out that LCMs can accurately locate token-level salient information within the context. Yet, the generation performance of these LCMs is far from satisfactory and might result in misaligned responses, such as hallucinations. To enhance the generation capability of LCMs, existing works have investigated the effects of data size and quality for both pre-training and instruction tuning. Though achieving meaningful improvement, previous methods fall short in either effectiveness or efficiency. In this paper, we introduce LOGO(Long cOntext aliGnment via efficient preference Optimization), a training strategy that first introduces preference optimization for long-context alignment. To overcome the GPU memory-bound issue caused by the long sequence, LOGO employs a reference-free preference optimization strategy and adopts a position synthesis method to construct the training data. By training with only 0.3B data on a single 8 \times A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4 in real-world long-context tasks while preserving the model’s original capabilities on other tasks, e.g., language modeling and MMLU. Moreover, LOGO can extend the model’s context window size while enhancing its generation performance.
摘要：长上下文模型（Long-context models, LCMs）在处理长输入序列（甚至超过 100M Token）方面展现了巨大的潜力，既便捷又高效。随着显著的进展，近期研究指出，LCMs 能够准确地在上下文中定位 Token 级别的显著信息。然而，这些 LCMs 的生成性能远未达到令人满意的水平，可能会导致响应不一致，例如出现幻觉。为了提升 LCMs 的生成能力，现有研究探讨了预训练和指令调优中数据规模和质量的影响。尽管取得了一定的改进，但先前的方法在效果或效率上仍有所欠缺。本文中，我们提出了 LOGO（通过高效偏好优化实现长上下文对齐），这是一种首次引入偏好优化以实现长上下文对齐的训练策略。为克服长序列带来的 GPU 内存限制问题，LOGO 采用了一种无参考的偏好优化策略，并采用位置合成方法构建训练数据。通过在单个 8 \times A800 GPU 机器上使用仅 0.3B 数据训练 16 小时，LOGO 使 Llama-3-8B-Instruct-80K 模型在实际长上下文任务中达到了与 GPT-4 相当的性能，同时保留了模型在其他任务（如语言建模和 MMLU）上的原有能力。此外，LOGO 还能在提升生成性能的同时扩展模型的上下文窗口大小。

[NLP-45] A Systematic Survey on Instructional Text: From Representation and Downstream NLP Tasks

【速读】：该论文试图解决复杂多步指令理解与处理的问题。当前的自然语言处理 (NLP) 系统在处理简单指令方面已取得显著进展，但在应对复杂、多步指令时仍面临挑战。论文通过系统综述现有文献，分析了相关资源、表示方案及下游任务，旨在为研究人员提供一个全面的视角，以理解复杂指令处理领域的现状、趋势、挑战和未来机会。

解决方案的关键在于：

系统综述：通过对177篇相关论文的分析，识别出该领域的关键趋势和挑战。
资源与表示方案：分析现有的资源和表示方案，为复杂指令的理解提供基础。
下游任务：探讨与复杂指令处理相关的下游任务，展示其在实际应用中的重要性。
统一视角：为AI/NLP研究人员提供一个统一的视角，整合不同研究方向，并指出未来的研究机会。

链接: https://arxiv.org/abs/2410.18529
作者: Abdulfattah Safa,Tamta Kapanadze,Arda Uzunoğlu,Gözde Gül Şahin
关键词-EN: large language models, demonstrated promising capabilities, Recent advances, advances in large, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models have demonstrated promising capabilities in following simple instructions through instruction tuning. However, real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems. Despite growing interest in this area, there lacks a comprehensive survey that systematically analyzes the landscape of complex instruction understanding and processing. Through a systematic review of the literature, we analyze available resources, representation schemes, and downstream tasks related to instructional text. Our study examines 177 papers, identifying trends, challenges, and opportunities in this emerging field. We provide AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding, bridging gaps between different research directions and highlighting future research opportunities.
摘要：近年来，大语言模型的发展展示了通过指令调优遵循简单指令的显著能力。然而，现实世界中的任务通常涉及复杂的多步骤指令，这对当前的自然语言处理（NLP）系统来说仍然是一个挑战。尽管这一领域引起了越来越多的关注，但缺乏一个系统的综述来全面分析复杂指令理解和处理的全貌。通过系统地回顾相关文献，我们分析了与指令文本相关的可用资源、表示方案和下游任务。我们的研究涵盖了177篇论文，识别了这一新兴领域的趋势、挑战和机遇。我们为AI/NLP研究人员提供了必要的背景知识，并提供了一个统一的视角，来审视各种复杂指令理解的方法，填补了不同研究方向之间的空白，并突出了未来的研究机会。

[NLP-46] KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing ICLR2025

【速读】：该论文试图解决大型语言模型（LLMs）在推理过程中由于KV缓存（KV cache）占用大量GPU内存的问题。解决方案的关键在于提出了一种名为KVSharer的即插即用方法，通过在层间共享KV缓存来实现层级压缩。与传统方法不同，KVSharer发现共享相似度较低的KV缓存反而能更好地保持模型性能，这一反直觉现象是其核心创新点。实验结果表明，KVSharer能够减少30%的KV缓存计算量，降低内存消耗，同时不影响模型性能，并能实现至少1.3倍的生成加速。此外，KVSharer还兼容现有的层内KV缓存压缩方法，结合使用可以进一步节省内存。

链接: https://arxiv.org/abs/2410.18517
作者: Yifei Yang,Zouying Cao,Qiguang Chen,Libo Qin,Dongjie Yang,Hai Zhao,Zhi Chen
关键词-EN: substantial GPU memory, GPU memory requirements, substantial GPU, expanded model sizes, large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review by ICLR2025

点击查看摘要

Abstract:The development of large language models (LLMs) has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. The key and value storage of the attention map in the KV (key-value) cache accounts for more than 80% of this memory consumption. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called \textitKVSharer, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that \textitKVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption without significantly impacting model performance and it can also achieve at least 1.3 times generation acceleration. Additionally, we verify that \textitKVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
摘要：大语言模型 (LLM) 的发展显著扩大了模型规模，导致推理过程中对 GPU 内存的需求大幅增加。注意力机制中的键值存储 (KV cache) 占据了超过 80% 的内存消耗。目前，大多数现有的 KV cache 压缩方法主要集中在单个 Transformer 层内的层内压缩，而很少有研究考虑层间压缩。本文提出了一种即插即用的方法，称为 \textitKVSharer，通过在层间共享 KV cache 来实现层间压缩。我们发现了一个反直觉的现象：共享相似度较低的 KV cache 更能保留模型性能。实验表明，\textitKVSharer 可以将 KV cache 计算量减少 30%，从而在不显著影响模型性能的情况下降低内存消耗，并且还能实现至少 1.3 倍的速度提升。此外，我们验证了 \textitKVSharer 与现有的层内 KV cache 压缩方法兼容，两者的结合可以进一步节省内存。

[NLP-47] CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

【速读】：该论文试图解决中文网络数据集质量不高的问题，解决方案的关键在于开发了一种新颖的两阶段混合过滤管道（two-stage hybrid filtering pipeline），显著提升了数据质量。具体来说，通过这种过滤管道，生成了高质量的500GB子集CCI3.0-HQ，并在此基础上训练了一个0.5B参数的模型，该模型在10个基准测试中表现优于CCI3.0、SkyPile和WanjuanV1，尤其是在零样本设置下。此外，该过滤过程有效地将Qwen2-72B-instruct模型的能力提炼到一个紧凑的0.5B模型中，实现了中文网页数据分类的最佳F1分数。

链接: https://arxiv.org/abs/2410.18505
作者: Liangdong Wang,Bo-Wen Zhang,Chengwei Wu,Hanyu Zhao,Xiaofeng Shi,Shuhao Gu,Jijie Li,Quanyue Ma,TengFei Pan,Guang Liu
关键词-EN: Chinese Corpora Internet, https URL, Corpora Internet, enhances data quality, two-stage hybrid filtering
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present CCI3.0-HQ (this https URL), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(this https URL), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality language models.
摘要：我们呈现了 CCI3.0-HQ（链接），这是中国互联网语料库 3.0 (CCI3.0)（链接）的一个高质量 500GB 子集，通过一种新颖的两阶段混合过滤管道显著提升了数据质量。为了评估其有效性，我们在多个数据集上从头训练了一个 0.5B 参数的模型，处理了 100B Token，在零样本设置下，该模型在 10 个基准测试中表现优于 CCI3.0、SkyPile 和 WanjuanV1。高质量的过滤过程有效地将 Qwen2-72B-instruct 模型的能力提炼到一个紧凑的 0.5B 模型中，实现了中文网页数据分类的最佳 F1 分数。我们相信，这个开放访问的数据集将促进高质量语言模型的更广泛应用。

[NLP-48] ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models

【速读】：该论文试图解决的问题是：在大型语言模型 (LLMs) 的快速发展背景下，如何评估和提升这些模型在识别中文环境中的非法和不安全内容的能力。

解决方案的关键在于：

提出了一个名为 ChineseSafe 的中文安全基准，该基准包含 205,034 个样本，涵盖 4 个类别和 10 个子类别的安全问题，特别针对中国互联网内容审查的法规要求，增加了政治敏感性、色情内容以及变体/同音字等特殊类型的非法内容。
采用两种方法评估了流行 LLMs 的法律风险，包括开源模型和 API，揭示了这些模型在某些安全问题类型上的脆弱性，从而在中国法律框架下存在风险。
通过这些研究和评估，为开发者和研究人员提供了一个指导框架，以促进 LLMs 的安全性。

链接: https://arxiv.org/abs/2410.18491
作者: Hengxiang Zhang,Hongfu Gao,Qiang Hu,Guanhua Chen,Lili Yang,Bingyi Jing,Hongxin Wei,Bing Wang,Haifeng Bai,Lei Yang
关键词-EN: Large language models, identifying unsafe content, Large language, Chinese Internet content, increasingly important
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of Large language models (LLMs), understanding the capabilities of LLMs in identifying unsafe content has become increasingly important. While previous works have introduced several benchmarks to evaluate the safety risk of LLMs, the community still has a limited understanding of current LLMs’ capability to recognize illegal and unsafe content in Chinese contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to facilitate research on the content safety of large language models. To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography, and variant/homophonic words. Moreover, we employ two methods to evaluate the legal risks of popular LLMs, including open-sourced models and APIs. The results reveal that many LLMs exhibit vulnerability to certain types of safety issues, leading to legal risks in China. Our work provides a guideline for developers and researchers to facilitate the safety of LLMs. Our results are also available at this https URL.
摘要：随着大语言模型（Large Language Models, LLMs）的快速发展，理解这些模型在识别不安全内容方面的能力变得愈发重要。尽管先前的研究已经引入了多个基准来评估LLMs的安全风险，但学术界对于当前LLMs在识别中文环境下的非法和不安全内容的能力仍知之甚少。在本研究中，我们提出了一项中文安全基准（ChineseSafe），以促进对大语言模型内容安全性的研究。为了与中国的互联网内容监管规定相符，我们的ChineseSafe包含了205,034个样本，涵盖了4个类别和10个子类别的安全问题。针对中文环境，我们特别增加了几种特殊类型的非法内容：政治敏感性、色情内容以及变体/同音字。此外，我们采用了两种方法来评估流行LLMs的法律风险，包括开源模型和API。结果显示，许多LLMs在某些类型的安全问题上表现出脆弱性，从而在中国面临法律风险。我们的工作为开发者和研究人员提供了一个指导，以促进LLMs的安全性。我们的研究结果也可在此https URL获取。

[NLP-49] Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction EMNLP2024

【速读】：该论文试图解决从无标注对话中高效提取结构化工作流程的问题。解决方案的关键在于引入了一种名为Dialog2Flow (D2F) 嵌入的新型嵌入方法，该方法通过将话语映射到一个潜在空间，根据其交流和信息功能（即它们所代表的动作）进行分组。D2F嵌入使得对话可以被建模为潜在空间中的连续轨迹，并通过聚类D2F嵌入，将潜在空间量化，从而将对话转换为一系列区域/动作ID，便于提取底层工作流程。此外，论文还构建了一个综合数据集，并通过引入一种新的软对比损失函数来指导表示学习过程，显著提升了性能。

链接: https://arxiv.org/abs/2410.18481
作者: Sergio Burdisso,Srikanth Madikeri,Petr Motlicek
关键词-EN: Efficiently deriving structured, Efficiently deriving, deriving structured workflows, unannotated dialogs remains, computational linguistics
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 main conference

点击查看摘要

Abstract:Efficiently deriving structured workflows from unannotated dialogs remains an underexplored and formidable challenge in computational linguistics. Automating this process could significantly accelerate the manual design of workflows in new domains and enable the grounding of large language models in domain-specific flowcharts, enhancing transparency and controllability. In this paper, we introduce Dialog2Flow (D2F) embeddings, which differ from conventional sentence embeddings by mapping utterances to a latent space where they are grouped according to their communicative and informative functions (i.e., the actions they represent). D2F allows for modeling dialogs as continuous trajectories in a latent space with distinct action-related regions. By clustering D2F embeddings, the latent space is quantized, and dialogs can be converted into sequences of region/action IDs, facilitating the extraction of the underlying workflow. To pre-train D2F, we build a comprehensive dataset by unifying twenty task-oriented dialog datasets with normalized per-turn action annotations. We also introduce a novel soft contrastive loss that leverages the semantic information of these actions to guide the representation learning process, showing superior performance compared to standard supervised contrastive loss. Evaluation against various sentence embeddings, including dialog-specific ones, demonstrates that D2F yields superior qualitative and quantitative results across diverse domains.
摘要：从无标注对话中高效地推导出结构化工作流程仍然是计算语言学中一个未被充分探索且极具挑战性的问题。自动化这一过程可以显著加速新领域中工作流程的手动设计，并使大语言模型能够在特定领域的流程图中实现基础，从而增强透明度和可控性。本文中，我们引入了 Dialog2Flow (D2F) 嵌入，它与传统的句子嵌入不同，通过将话语映射到一个潜在空间，在该空间中根据其交流和信息功能（即它们所代表的动作）进行分组。D2F 允许将对话建模为潜在空间中具有不同动作相关区域的连续轨迹。通过聚类 D2F 嵌入，潜在空间被量化，对话可以转换为区域/动作 ID 序列，从而便于提取底层工作流程。为了预训练 D2F，我们通过统一二十个面向任务的对话数据集并进行每轮标准化动作标注，构建了一个综合数据集。我们还引入了一种新的软对比损失，利用这些动作的语义信息来指导表示学习过程，与标准监督对比损失相比，表现出更优越的性能。在与各种句子嵌入（包括特定于对话的嵌入）的评估中，D2F 在多个领域中均展现出优越的定性和定量结果。

[NLP-50] Iterative Self-Tuning LLM s for Enhanced Jailbreaking Capabilities

【速读】：该论文试图解决的问题是当前生成对抗性后缀（adversarial suffixes）的方法在计算成本高且攻击成功率（Attack Success Rates, ASR）低的问题，尤其是在面对高度安全对齐的模型如Llama2和Llama3时。

解决方案的关键是引入了一个名为ADV-LLM的迭代自调优过程，该过程能够生成具有增强越狱能力的对抗性LLM。ADV-LLM框架显著降低了生成对抗性后缀的计算成本，同时在多种开源LLM上实现了接近100%的ASR。此外，它还展示了强大的攻击可转移性，能够在未针对优化的闭源模型如GPT-3.5和GPT-4上实现高ASR。ADV-LLM不仅提升了越狱能力，还通过生成用于研究LLM安全的大型数据集，为未来的安全对齐研究提供了有价值的见解。

链接: https://arxiv.org/abs/2410.18469
作者: Chung-En Sun,Xiaodong Liu,Weiwei Yang,Tsui-Wei Weng,Hao Cheng,Aidan San,Michel Galley,Jianfeng Gao
关键词-EN: trigger unintended responses, harmful queries bypass, Large Language Models, Attack Success Rates, Language Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages

点击查看摘要

Abstract:Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety. Our code is available at: this https URL
摘要：最近的研究表明，大语言模型 (Large Language Models, LLMs) 容易受到自动化越狱攻击，即通过算法生成的对抗性后缀附加在有害查询上，绕过安全对齐并触发意外响应。当前生成这些后缀的方法计算成本高且攻击成功率 (Attack Success Rate, ASR) 低，尤其是在针对像 Llama2 和 Llama3 这样高度对齐的模型时。为了克服这些限制，我们提出了 ADV-LLM，一种迭代自调优过程，用于生成具有增强越狱能力的对抗性 LLM。我们的框架显著降低了生成对抗性后缀的计算成本，同时在多种开源 LLM 上实现了近 100% 的 ASR。此外，它展示了强大的攻击迁移性，对闭源模型如 GPT-3.5 实现了 99% 的 ASR，对 GPT-4 实现了 49% 的 ASR，尽管仅在 Llama3 上进行了优化。除了提升越狱能力外，ADV-LLM 还通过其生成用于研究 LLM 安全的大型数据集的能力，为未来的安全对齐研究提供了宝贵的见解。我们的代码可在以下链接获取：this https URL

[NLP-51] Skywork-Reward: Bag of Tricks for Reward Modeling in LLM s

【速读】：该论文试图解决的问题是如何提升大型语言模型（LLMs）的奖励建模（reward modeling）效果。解决方案的关键在于采用数据中心化的方法，通过有效的数据选择和过滤策略来精选高质量的开源偏好数据集。具体来说，论文提出了Skywork-Reward数据集，该数据集仅包含80K的偏好对，远小于现有的数据集。基于这一精选数据集，开发了Skywork-Reward模型系列，包括Skywork-Reward-Gemma-27B和Skywork-Reward-Llama-3.1-8B，其中Skywork-Reward-Gemma-27B目前在RewardBench排行榜上位居榜首。这些技术和数据集显著提升了RewardBench上多个顶级模型的性能，展示了其在实际偏好学习应用中的实际影响。

链接: https://arxiv.org/abs/2410.18451
作者: Chris Yuhao Liu,Liang Zeng,Jiacai Liu,Rui Yan,Jujie He,Chaojie Wang,Shuicheng Yan,Yang Liu,Yahui Zhou
关键词-EN: enhance reward modeling, modeling for LLMs, focusing specifically, methods to enhance, enhance reward
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs – significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series – Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B – with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.
摘要：在本报告中，我们介绍了一系列用于增强大语言模型（LLM）奖励建模的方法，特别聚焦于以数据为中心的技术。我们提出了有效的数据选择和过滤策略，用于策划高质量的开源偏好数据集，最终形成了Skywork-Reward数据集，该数据集仅包含80K个偏好对——显著小于现有数据集。基于这一精选数据集，我们开发了Skywork-Reward模型系列——Skywork-Reward-Gemma-27B和Skywork-Reward-Llama-3.1-8B——其中前者目前在RewardBench排行榜上位居榜首。值得注意的是，我们的技术和数据集直接提升了RewardBench上许多顶尖模型的性能，突显了我们在实际偏好学习应用中的贡献的实际影响。

[NLP-52] oolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis

【速读】：该论文试图解决现有监督微调 (Supervised Fine-Tuning, SFT) 方法在增强大型语言模型 (Large Language Models, LLMs) 工具调用能力时，数据合成过程中存在的两个主要问题：

工具组合的相关性不足：当前的数据合成方法随机抽取工具，导致工具之间缺乏相关性，难以有效组合，从而降低了数据多样性。
对话轮次间的连贯性缺失：现有工作忽视了对话轮次之间的连贯性，导致合成数据与真实场景存在差距。

解决方案的关键：
论文提出了一种基于图的抽样策略 (Graph-based Sampling strategy) 和一种计划生成策略 (Planned-generation strategy) 来解决上述问题。具体来说：

基于图的抽样策略：通过图结构抽样更相关的工具组合，提高工具之间的相关性。
计划生成策略：创建引导对话合成的计划，确保对话轮次之间的连贯性。

通过将这两种策略集成，并使多个代理交互式地合成对话数据，论文构建了一个名为 ToolFlow 的工具调用数据合成流程。实验结果表明，使用 ToolFlow 生成的合成对话在自然性和连贯性方面有所提升，并且通过在 LLaMA-3.1-8B 模型上进行 SFT 实验，证明了该方法在工具调用性能上可与 GPT-4 相媲美，同时保持了强大的通用能力。

链接: https://arxiv.org/abs/2410.18447
作者: Zezhong Wang,Xingshan Zeng,Weiwen Liu,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Qun Liu,Kam-Fai Wong
关键词-EN: Large Language Models, Large Language, Supervised fine-tuning, Language Models, common method
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis process generally involves sampling a set of tools, formulating a requirement based on these tools, and generating the call statements. However, tools sampled randomly lack relevance, making them difficult to combine and thus reducing the diversity of the data. Additionally, current work overlooks the coherence between turns of dialogues, leading to a gap between the synthesized data and real-world scenarios. To address these issues, we propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We integrate these two strategies and enable multiple agents to synthesize the dialogue data interactively, resulting in our tool-calling data synthesis pipeline ToolFlow. Data quality assessments demonstrate improvements in the naturalness and coherence of our synthesized dialogues. Finally, we apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow. Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.
摘要：监督微调 (Supervised Fine-Tuning, SFT) 是提升大语言模型 (Large Language Models, LLMs) 工具调用能力的常用方法，其训练数据通常是通过合成生成的。当前的数据合成过程一般包括抽取一组工具、基于这些工具制定需求并生成调用语句。然而，随机抽取的工具缺乏相关性，难以组合，从而降低了数据的多样性。此外，现有工作忽视了对话轮次之间的连贯性，导致合成数据与真实场景之间存在差距。为解决这些问题，我们提出了一种基于图的抽样策略 (Graph-based Sampling) 来抽取更具相关性的工具组合，以及一种计划生成策略 (Planned-generation) 来创建指导合成连贯对话的计划。我们将这两种策略整合，并使多个 AI 智能体 (AI Agent) 交互式地合成对话数据，从而形成了我们的工具调用数据合成流程 ToolFlow。数据质量评估显示，我们合成的对话在自然性和连贯性方面有所提升。最后，我们使用 ToolFlow 生成的 8,000 条合成对话对 LLaMA-3.1-8B 进行 SFT 训练。结果表明，该模型在工具调用性能上达到了与 GPT-4 相当甚至超越的水平，同时保持了强大的通用能力。

[NLP-53] Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts

【速读】：该论文试图解决将自动语音识别 (Automatic Speech Recognition, ASR) 技术集成到自然语言查询系统中，以提高韩国气象学家天气预报效率的问题。具体挑战包括韩国天气领域特有的词汇和语言复杂性。

解决方案的关键在于：

构建了一个由母语为韩语的说话者录制的口语查询评估数据集，用于评估ASR模型在特定领域的表现。
通过评估多种多语言ASR模型配置，识别了与领域特定术语相关的性能限制。
实施了一种基于文本到语音的数据增强方法，该方法在保持通用领域性能的同时，提高了对专业术语的识别能力。

这些贡献为未来在韩国天气预报领域中ASR技术的进一步发展奠定了基础。

链接: https://arxiv.org/abs/2410.18444
作者: ChaeHun Park,Hojun Cho,Jaegul Choo
关键词-EN: integrating Automatic Speech, Automatic Speech Recognition, explores integrating Automatic, Automatic Speech, paper explores integrating
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores integrating Automatic Speech Recognition (ASR) into natural language query systems to improve weather forecasting efficiency for Korean meteorologists. We address challenges in developing ASR systems for the Korean weather domain, specifically specialized vocabulary and Korean linguistic intricacies. To tackle these issues, we constructed an evaluation dataset of spoken queries recorded by native Korean speakers. Using this dataset, we assessed various configurations of a multilingual ASR model family, identifying performance limitations related to domain-specific terminology. We then implemented a simple text-to-speech-based data augmentation method, which improved the recognition of specialized terms while maintaining general-domain performance. Our contributions include creating a domain-specific dataset, comprehensive ASR model evaluations, and an effective augmentation technique. We believe our work provides a foundation for future advancements in ASR for the Korean weather forecasting domain.
摘要：本文探讨了将自动语音识别 (Automatic Speech Recognition, ASR) 集成到自然语言查询系统中，以提高韩国气象学家天气预报效率的方法。我们针对在韩国天气领域开发 ASR 系统所面临的挑战，特别是专业词汇和韩语语言复杂性进行了研究。为解决这些问题，我们构建了一个由母语为韩语的说话者录制的口语查询评估数据集。利用该数据集，我们评估了多种多语言 ASR 模型配置，识别出与领域特定术语相关的性能限制。随后，我们实施了一种基于文本到语音的数据增强方法，该方法在保持通用领域性能的同时，提高了对专业术语的识别能力。我们的贡献包括创建了一个领域特定的数据集、全面的 ASR 模型评估以及一种有效的增强技术。我们相信，我们的工作为未来在韩国天气预报领域 ASR 的进一步发展奠定了基础。

[NLP-54] Can Code-Switched Texts Activate a Knowledge Switch in LLM s? A Case Study on English-Korean Code-Switching

【速读】：该论文试图解决的问题是探究代码转换（Code-switching, CS）在多语言大型语言模型（Multilingual Large Language Models, LLMs）中激活语言特定知识的效果。解决方案的关键在于：

数据集构建：论文提出了一个合成的中英代码转换问答数据集（EnKoQA），用于评估模型在代码转换情境下的知识激活能力。
知识激活过程细分：将知识激活过程细分为知识识别（Knowledge Identification）和知识利用（Knowledge Leveraging），并通过实验分析不同多语言LLMs在这两个方面的表现。
实验结果：实验结果表明，相比于纯英文文本，代码转换能够更有效地激活LLMs中的语言特定知识，尤其是在语言特定领域。此外，模型在单语能力上的表现与代码转换效果之间存在相关性，特别是与韩语熟练度的相关性。

通过这些关键点，论文揭示了代码转换在多语言LLMs中激活语言特定知识的重要性和潜力。

链接: https://arxiv.org/abs/2410.18436
作者: Seoyeon Kim,Huiseo Kim,Chanjun Park,Jinyoung Yeo,Dongha Lee
关键词-EN: convey subtle cultural, multilingual speakers alternate, lost in translation, speakers alternate, convey subtle
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances that can be otherwise lost in translation. Recent state-of-the-art multilingual large language models (LLMs) demonstrate excellent multilingual abilities in various aspects including understanding CS, but the power of CS in eliciting language-specific knowledge is yet to be discovered. Therefore, we investigate the effectiveness of code-switching on a wide range of multilingual LLMs in terms of knowledge activation, or the act of identifying and leveraging knowledge for reasoning. To facilitate the research, we first present EnKoQA, a synthetic English-Korean CS question-answering dataset. We provide a comprehensive analysis on a variety of multilingual LLMs by subdividing activation process into knowledge identification and knowledge leveraging. Our experiments demonstrate that compared to English text, CS can faithfully activate knowledge inside LLMs, especially on language-specific domains. In addition, the performance gap between CS and English is larger in models that show excellent monolingual abilities, suggesting that there exists a correlation with CS and Korean proficiency.
摘要：代码转换（Code-switching, CS）是指多语者在话语中交替使用不同语言的现象，能够传达微妙的文化和语言细微差别，而这些细微差别在翻译中可能会丢失。最近，最先进的多语言大语言模型（LLMs）在包括理解代码转换在内的多个方面展示了卓越的多语言能力，但代码转换在引发特定语言知识方面的潜力尚未被充分发掘。因此，我们研究了代码转换在广泛的多语言大语言模型中激活知识的效果，即识别和利用知识进行推理的行为。为了促进研究，我们首先提出了EnKoQA，一个合成性的英韩代码转换问答数据集。我们通过对知识激活过程进行细分，即知识识别和知识利用，对多种多语言大语言模型进行了全面分析。我们的实验表明，与英文文本相比，代码转换能够更忠实地激活大语言模型中的知识，尤其是在特定语言领域。此外，代码转换与英文之间的性能差距在那些单语能力出色的模型中更大，这表明代码转换与韩语熟练度之间存在相关性。

[NLP-55] Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch

【速读】：该论文试图解决低资源语言（如印尼语）在对话系统中的意图分类和槽填充任务中的性能提升问题。具体来说，论文探讨了如何利用资源丰富的语言（如英语）的数据来训练低资源语言的模型，并提出了一个有效的跨语言迁移框架。

解决方案的关键是提出了一个名为“双信心频率跨语言迁移框架 (Bi-Confidence-Frequency Cross-Lingual transfer framework, BiCF)”的框架。该框架由三个主要部分组成：

BiCF Mixing：用于混合不同语言的数据。
潜在空间优化 (Latent Space Refinement)：用于优化模型的潜在表示。
联合解码器 (Joint Decoder)：用于同时进行意图分类和槽填充。

通过这些组件，BiCF框架能够有效地利用英语数据来提升印尼语对话系统的性能，并在不同规模的手动标注印尼语数据上表现出可靠且成本效益高的效果。此外，论文还发布了大规模的精细标注对话数据集 (ID-WOZ) 和印尼语BERT模型 (ID-BERT)，以支持进一步的研究。

链接: https://arxiv.org/abs/2410.18430
作者: Donglin Di,Weinan Zhang,Yue Zhang,Fanglin Wang
关键词-EN: low-resource languages raises, resources of resource-rich, attention recently, raises much attention, Latent Space Refinement
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Making use of off-the-shelf resources of resource-rich languages to transfer knowledge for low-resource languages raises much attention recently. The requirements of enabling the model to reach the reliable performance lack well guided, such as the scale of required annotated data or the effective framework. To investigate the first question, we empirically investigate the cost-effectiveness of several methods to train the intent classification and slot-filling models for Indonesia (ID) from scratch by utilizing the English data. Confronting the second challenge, we propose a Bi-Confidence-Frequency Cross-Lingual transfer framework (BiCF), composed by BiCF Mixing'', Latent Space Refinement’’ and ``Joint Decoder’', respectively, to tackle the obstacle of lacking low-resource language dialogue data. Extensive experiments demonstrate our framework performs reliably and cost-efficiently on different scales of manually annotated Indonesian data. We release a large-scale fine-labeled dialogue dataset (ID-WOZ) and ID-BERT of Indonesian for further research.
摘要：利用资源丰富语言的现成资源来转移知识以支持低资源语言的研究，近期引起了广泛关注。然而，如何使模型达到可靠性能的要求缺乏明确的指导，例如所需标注数据的数量或有效的框架。为了探讨第一个问题，我们通过利用英语数据，从零开始训练意图分类和槽填充模型，实证研究了几种方法的成本效益。针对第二个挑战，我们提出了一种双信心频率跨语言转移框架（BiCF），该框架由“BiCF混合”、“潜在空间优化”和“联合解码器”组成，以解决缺乏低资源语言对话数据的障碍。广泛的实验表明，我们的框架在不同规模的手动标注印度尼西亚语数据上表现可靠且成本效益高。我们发布了一个大规模精细标注的对话数据集（ID-WOZ）和印度尼西亚语的ID-BERT，以供进一步研究。

[NLP-56] Large Language Models Reflect the Ideology of their Creators

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在不同语言和文化背景下展现出的意识形态立场的多样性。具体来说，论文关注LLMs在描述历史人物和当前政治事件时，如何反映出其设计者或训练数据中的世界观和价值观。

解决方案的关键在于通过对比不同LLMs在英语和中文环境下对同一人物或事件的描述，分析其中体现的道德评价和规范性差异。研究通过提示一组多样化的流行LLMs描述大量历史和争议性人物，并比较这些描述在不同语言中的差异，揭示了LLMs在不同语言环境下的意识形态倾向。此外，研究还比较了西方和非西方LLMs在处理地缘政治冲突相关人物时的规范性分歧。

研究结果表明，LLMs的意识形态立场往往反映了其创造者的世界观，这引发了关于技术努力和监管措施在实现LLMs意识形态“中立性”方面的有效性和潜在风险的讨论。

链接: https://arxiv.org/abs/2410.18417
作者: Maarten Buyl,Alexander Rogiers,Sander Noels,Iris Dominguez-Catena,Edith Heiter,Raphael Romero,Iman Johary,Alexandru-Cristian Mara,Jefrey Lijffijt,Tijl De Bie
关键词-EN: generate natural language, question answering, trained on vast, vast amounts, amounts of data
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained on vast amounts of data to generate natural language, enabling them to perform tasks like text summarization and question answering. These models have become popular in artificial intelligence (AI) assistants like ChatGPT and already play an influential role in how humans access information. However, the behavior of LLMs varies depending on their design, training, and use. In this paper, we uncover notable diversity in the ideological stance exhibited across different LLMs and languages in which they are accessed. We do this by prompting a diverse panel of popular LLMs to describe a large number of prominent and controversial personalities from recent world history, both in English and in Chinese. By identifying and analyzing moral assessments reflected in the generated descriptions, we find consistent normative differences between how the same LLM responds in Chinese compared to English. Similarly, we identify normative disagreements between Western and non-Western LLMs about prominent actors in geopolitical conflicts. Furthermore, popularly hypothesized disparities in political goals among Western models are reflected in significant normative differences related to inclusion, social inequality, and political scandals. Our results show that the ideological stance of an LLM often reflects the worldview of its creators. This raises important concerns around technological and regulatory efforts with the stated aim of making LLMs ideologically `unbiased’, and it poses risks for political instrumentalization. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2410.18417 [cs.CL] (or arXiv:2410.18417v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.18417 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：大语言模型 (LLMs) 通过训练大量数据来生成自然语言，使其能够执行文本摘要和问答等任务。这些模型在人工智能 (AI) 助手如 ChatGPT 中变得流行，并在人类获取信息的方式中扮演着重要角色。然而，LLMs 的行为因其设计、训练和使用方式的不同而有所差异。本文揭示了不同 LLMs 及它们所使用的语言之间在意识形态立场上显著的多样性。我们通过提示一组多样化的流行 LLMs 来描述近期世界历史中众多著名且具争议的人物，这些描述既包括英文也包括中文。通过识别和分析生成描述中反映的道德评价，我们发现同一 LLM 在用中文和英文回应时存在一致的规范性差异。同样，我们发现西方与非西方 LLMs 在涉及地缘政治冲突中的重要角色时存在规范性分歧。此外，关于西方模型政治目标的普遍假设在涉及包容性、社会不平等和政治丑闻方面反映出显著的规范性差异。我们的研究结果表明，LLM 的意识形态立场往往反映了其创造者的世界观。这引发了关于旨在使 LLMs 意识形态“无偏见”的技术和监管努力的重要关切，并带来了政治工具化的风险。

主题：计算与语言 (cs.CL); 机器学习 (cs.LG)
引用方式：arXiv:2410.18417 [cs.CL]
(或 arXiv:2410.18417v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.18417
通过 DataCite 发布的 arXiv 编号 DOI (待注册)

[NLP-57] Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains

【速读】：该论文试图解决的问题是如何在知识图谱（Knowledge Graphs, KGs）和大型语言模型（Large Language Models, LLMs）之间实现深度协同，以提高基于知识图谱的问答系统（KGQA）的性能。现有研究主要依赖于子图检索或迭代提示方法，忽略了LLMs的逐步推理能力与KGs结构化知识之间的潜在协同作用。

解决方案的关键在于提出了一个名为DoG（Decoding on Graphs）的新框架。该框架的核心创新点包括：

定义了“良好形成的链”（well-formed chain）：这是一个由知识图谱中相互关联的事实三元组组成的序列，从问题实体开始，最终导向答案。这一概念被认为是实现KGQA中可靠和合理推理的基础。
图感知约束解码（graph-aware constrained decoding）：通过从知识图谱的拓扑结构中提取的约束来调节LLMs的解码过程，确保生成的推理链是良好形成的，同时充分利用LLMs的逐步推理能力。
训练无关的方法（training-free approach）：DoG框架不需要额外的训练，能够直接在知识图谱上提供基于事实的、合理的推理路径。

通过这些关键创新，DoG在各种KGQA任务中展示了优越且稳健的性能，并证明了其对不同开源LLMs的通用适用性。

链接: https://arxiv.org/abs/2410.18415
作者: Kun Li,Tianhua Zhang,Xixin Wu,Hongyin Luo,James Glass,Helen Meng
关键词-EN: reliable knowledge sources, Knowledge Graphs, reliable knowledge, knowledge sources, Knowledge
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) can serve as reliable knowledge sources for question answering (QA) due to their structured representation of knowledge. Existing research on the utilization of KG for large language models (LLMs) prevalently relies on subgraph retriever or iterative prompting, overlooking the potential synergy of LLMs’ step-wise reasoning capabilities and KGs’ structural nature. In this paper, we present DoG (Decoding on Graphs), a novel framework that facilitates a deep synergy between LLMs and KGs. We first define a concept, well-formed chain, which consists of a sequence of interrelated fact triplets on the KGs, starting from question entities and leading to answers. We argue that this concept can serve as a principle for making faithful and sound reasoning for KGQA. To enable LLMs to generate well-formed chains, we propose graph-aware constrained decoding, in which a constraint derived from the topology of the KG regulates the decoding process of the LLMs. This constrained decoding method ensures the generation of well-formed chains while making full use of the step-wise reasoning capabilities of LLMs. Based on the above, DoG, a training-free approach, is able to provide faithful and sound reasoning trajectories grounded on the KGs. Experiments across various KGQA tasks with different background KGs demonstrate that DoG achieves superior and robust performance. DoG also shows general applicability with various open-source LLMs.
摘要：知识图谱 (Knowledge Graphs, KGs) 因其对知识的结构化表示，可以作为问答 (Question Answering, QA) 的可靠知识来源。现有关于利用知识图谱进行大语言模型 (Large Language Models, LLMs) 的研究主要依赖于子图检索或迭代提示，忽略了 LLMs 逐步推理能力与 KGs 结构特性之间的潜在协同作用。本文提出了 DoG (Decoding on Graphs)，这是一种新颖的框架，旨在促进 LLMs 与 KGs 之间的深度协同。我们首先定义了一个概念——“良好形成的链”，它由一系列在 KGs 上相互关联的事实三元组组成，从问题实体开始，导向答案。我们认为这一概念可以作为 KGQA 中进行忠实且合理推理的原则。为了使 LLMs 能够生成良好形成的链，我们提出了图感知约束解码，其中从 KG 拓扑结构中派生的约束规范了 LLMs 的解码过程。这种约束解码方法确保了生成良好形成的链，同时充分利用了 LLMs 的逐步推理能力。基于上述方法，DoG 作为一种无需训练的方法，能够提供基于 KGs 的忠实且合理的推理轨迹。在不同背景 KGs 的各种 KGQA 任务中的实验表明，DoG 实现了优越且稳健的性能。DoG 还展示了与各种开源 LLMs 的通用适用性。

[NLP-58] MoMQ: Mixture-of-Experts Enhances Multi-Dialect Query Generation across Relational and Non-Relational Databases

【速读】：该论文试图解决的问题是如何在多数据库方言（multi-dialect）环境下生成结构化查询语言（SQL），特别是在云服务提供商需要支持多种数据库方言（如Cosmos DB、Amazon Aurora、Lindorm）的场景下。

解决方案的关键是提出了一个基于混合专家模型（Mixture-of-Experts, MoE）的多方言查询生成框架，称为MoMQ。MoMQ的核心创新点包括：

方言专家组：为每种数据库方言设计了一个专门的专家组（dialect expert group），用于处理方言特定的知识，从而减少查询生成过程中的干扰。
多级路由策略：采用多级路由策略来管理方言特定的知识，确保在生成查询时能够准确地调用相应的方言专家。
共享专家组：引入了一个共享专家组（shared expert group），用于解决数据分布不均衡的问题，通过共享高资源方言的知识来提升低资源方言的性能。
高质量的多方言查询生成基准：开发了一个涵盖关系型和非关系型数据库（如MySQL、PostgreSQL、Cypher for Neo4j、nGQL for NebulaGraph）的高质量基准，用于评估和验证MoMQ的性能。

通过这些创新，MoMQ在资源不均衡的场景下表现出了有效性和鲁棒性。

链接: https://arxiv.org/abs/2410.18406
作者: Zhisheng Lin,Yifu Liu,Zhiling Luo,Jinyang Gao,Yu Li
关键词-EN: large language models, translating natural language, structured query language, language models, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The improvement in translating natural language to structured query language (SQL) can be attributed to the advancements in large language models (LLMs). Open-source LLMs, tailored for specific database dialects such as MySQL, have shown great performance. However, cloud service providers are looking for a unified database manager service (e.g., Cosmos DB from Azure, Amazon Aurora from AWS, Lindorm from AlibabaCloud) that can support multiple dialects. This requirement has led to the concept of multi-dialect query generation, which presents challenges to LLMs. These challenges include syntactic differences among dialects and imbalanced data distribution across multiple dialects. To tackle these challenges, we propose MoMQ, a novel Mixture-of-Experts-based multi-dialect query generation framework across both relational and non-relational databases. MoMQ employs a dialect expert group for each dialect and a multi-level routing strategy to handle dialect-specific knowledge, reducing interference during query generation. Additionally, a shared expert group is introduced to address data imbalance, facilitating the transfer of common knowledge from high-resource dialects to low-resource ones. Furthermore, we have developed a high-quality multi-dialect query generation benchmark that covers relational and non-relational databases such as MySQL, PostgreSQL, Cypher for Neo4j, and nGQL for NebulaGraph. Extensive experiments have shown that MoMQ performs effectively and robustly even in resource-imbalanced scenarios.
摘要：自然语言到结构化查询语言（SQL）的翻译改进得益于大语言模型（LLM）的进步。针对特定数据库方言（如 MySQL）的开源 LLM 已展现出卓越的性能。然而，云服务提供商正在寻求一种统一的数据库管理服务（例如 Azure 的 Cosmos DB、AWS 的 Amazon Aurora、阿里云的 Lindorm），以支持多种方言。这一需求催生了多方言查询生成的概念，这对 LLM 提出了挑战。这些挑战包括方言间的句法差异以及多方言数据分布的不均衡。为应对这些挑战，我们提出了 MoMQ，一种基于专家混合（Mixture-of-Experts）的多方言查询生成框架，涵盖关系型和非关系型数据库。MoMQ 为每种方言配备了一个方言专家组，并采用多级路由策略来处理方言特有的知识，从而减少查询生成过程中的干扰。此外，引入了一个共享专家组来解决数据不均衡问题，促进高资源方言的通用知识向低资源方言的转移。我们还开发了一个高质量的多方言查询生成基准，涵盖了 MySQL、PostgreSQL、Neo4j 的 Cypher 以及 NebulaGraph 的 nGQL 等关系型和非关系型数据库。广泛的实验表明，MoMQ 在资源不均衡的情况下依然表现出色且稳健。

[NLP-59] SPEED: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness EMNLP2024

【速读】：该论文试图解决的问题是如何从多语言社交媒体中提取与流行病相关的事件信息，以提供早期预警。当前的研究主要集中在英文帖子，而流行病可能在世界各地发生，早期讨论往往使用当地非英语语言。

解决方案的关键在于引入了首个多语言事件提取框架SPEED++，该框架能够处理多种疾病和语言。具体步骤包括：

扩展了先前的流行病本体，增加了20个论元角色。
构建了多语言事件提取数据集SPEED++，包含5.1K条四种语言的推文，涉及四种疾病。
开发了零样本跨语言跨疾病模型，利用多语言预训练，仅使用英文COVID数据进行训练，展示了其在65种不同语言和疾病中提取流行病相关事件的有效性。

实验结果表明，该框架能够在全球讨论开始前3周从中文微博帖子中提取COVID-19的早期预警信息，且无需中文训练数据。此外，该框架还能聚合社区关于症状和治疗措施的讨论，有助于检测错误信息和监控公众关注度。

链接: https://arxiv.org/abs/2410.18393
作者: Tanmay Parekh,Jeffrey Kwan,Jiarui Yu,Sparsh Johri,Hyosang Ahn,Sreya Muppalla,Kai-Wei Chang,Wei Wang,Nanyun Peng
关键词-EN: latest societal trends, Social media, societal trends, place where communities, communities discuss
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:Social media is often the first place where communities discuss the latest societal trends. Prior works have utilized this platform to extract epidemic-related information (e.g. infections, preventive measures) to provide early warnings for epidemic prediction. However, these works only focused on English posts, while epidemics can occur anywhere in the world, and early discussions are often in the local, non-English languages. In this work, we introduce the first multilingual Event Extraction (EE) framework SPEED++ for extracting epidemic event information for a wide range of diseases and languages. To this end, we extend a previous epidemic ontology with 20 argument roles; and curate our multilingual EE dataset SPEED++ comprising 5.1K tweets in four languages for four diseases. Annotating data in every language is infeasible; thus we develop zero-shot cross-lingual cross-disease models (i.e., training only on English COVID data) utilizing multilingual pre-training and show their efficacy in extracting epidemic-related events for 65 diverse languages across different diseases. Experiments demonstrate that our framework can provide epidemic warnings for COVID-19 in its earliest stages in Dec 2019 (3 weeks before global discussions) from Chinese Weibo posts without any training in Chinese. Furthermore, we exploit our framework’s argument extraction capabilities to aggregate community epidemic discussions like symptoms and cure measures, aiding misinformation detection and public attention monitoring. Overall, we lay a strong foundation for multilingual epidemic preparedness.
摘要：社交媒体往往是社区讨论最新社会趋势的首选平台。先前的研究利用这一平台提取与流行病相关的信息（如感染情况、预防措施），以提供流行病预测的早期预警。然而，这些研究仅关注英文帖子，而流行病可能在全球任何地方发生，早期的讨论往往使用当地非英语语言。在本研究中，我们引入了首个多语言事件抽取（Event Extraction, EE）框架 SPEED++，用于提取多种疾病和语言的流行病事件信息。为此，我们扩展了先前的流行病本体，增加了20个论元角色；并构建了包含5.1万条推文的多语言EE数据集 SPEED++，涵盖四种语言和四种疾病。由于在每种语言上标注数据不可行，我们开发了零样本跨语言跨疾病模型（即仅使用英文COVID数据进行训练），利用多语言预训练技术，展示了其在不同疾病中提取65种多样语言流行病相关事件的有效性。实验表明，我们的框架能够在2019年12月（全球讨论前3周）从中文微博帖子中提供COVID-19的早期预警，而无需任何中文训练数据。此外，我们利用框架的论元抽取能力，汇总社区关于症状和治疗措施的流行病讨论，有助于检测错误信息和监控公众关注度。总体而言，我们为多语言流行病准备奠定了坚实基础。

[NLP-60] Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey

【速读】：该论文试图解决低资源语言环境下的虚假信息检测问题。解决方案的关键在于：

数据资源 (Data Resources)：强调在低资源语言中收集和构建高质量数据集的重要性。
模型开发 (Model Development)：探讨适用于低资源语言的模型开发策略，包括语言无关模型 (Language-Agnostic Models) 和多模态技术 (Multi-Modal Techniques)。
文化和语言背景 (Cultural and Linguistic Context)：认识到不同文化和语言背景下虚假信息的传播方式和检测方法的差异。
实际应用 (Real-World Applications)：关注模型在实际应用中的有效性和适应性。
研究努力 (Research Efforts)：提倡跨学科合作和增强对社会责任AI研究的支持。

论文强调了构建稳健、包容的系统以应对不同语言和文化背景下虚假信息的重要性。

链接: https://arxiv.org/abs/2410.18390
作者: Xinyu Wang,Wenbo Zhang,Sarah Rajtmajer
关键词-EN: global digital landscape, today global digital, transcends linguistic boundaries, digital landscape, today global
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In today’s global digital landscape, misinformation transcends linguistic boundaries, posing a significant challenge for moderation systems. While significant advances have been made in misinformation detection, the focus remains largely on monolingual high-resource contexts, with low-resource languages often overlooked. This survey aims to bridge that gap by providing a comprehensive overview of the current research on low-resource language misinformation detection in both monolingual and multilingual settings. We review the existing datasets, methodologies, and tools used in these domains, identifying key challenges related to: data resources, model development, cultural and linguistic context, real-world applications, and research efforts. We also examine emerging approaches, such as language-agnostic models and multi-modal techniques, while emphasizing the need for improved data collection practices, interdisciplinary collaboration, and stronger incentives for socially responsible AI research. Our findings underscore the need for robust, inclusive systems capable of addressing misinformation across diverse linguistic and cultural contexts.
摘要：在当今全球数字环境中，错误信息跨越语言界限，对内容审核系统构成重大挑战。尽管在错误信息检测方面取得了显著进展，但研究重点主要集中在单语种高资源环境下，低资源语言往往被忽视。本综述旨在填补这一空白，提供当前关于低资源语言错误信息检测研究的全面概述，涵盖单语种和多语种环境。我们回顾了现有数据集、方法和工具，识别出与以下方面相关的关键挑战：数据资源、模型开发、文化与语言背景、实际应用以及研究努力。我们还探讨了新兴方法，如语言无关模型和多模态技术，同时强调改进数据收集实践、跨学科合作以及增强对社会责任型AI研究激励的必要性。我们的研究结果强调，需要建立强大且包容的系统，以应对不同语言和文化背景下的错误信息问题。

[NLP-61] WAFFLE: Multi-Modal Model for Automated Front-End Development

【速读】：该论文试图解决在将用户界面设计（UI designs）转换为功能性网页的HTML代码过程中遇到的两个主要挑战：(1) 有效表示HTML的层次结构以供大型语言模型（LLMs）理解，(2) 弥合UI设计的视觉特性与HTML代码的文本格式之间的差距。

解决方案的关键是引入了一种名为Waffle的新微调策略，该策略包括两个主要组成部分：

结构感知注意力机制（structure-aware attention mechanism），用于增强LLMs对HTML结构的理解。
对比微调方法（contrastive fine-tuning approach），用于对齐LLMs对UI图像和HTML代码的理解。

通过这种策略，微调后的模型在HTML匹配度、CW-SSIM、CLIP和LLEM等指标上均表现优异，显著超越了当前的微调方法。

链接: https://arxiv.org/abs/2410.18362
作者: Shanchao Liang,Nan Jiang,Shangshu Qian,Lin Tan
关键词-EN: Web development involves, development involves turning, experienced developers due, HTML hierarchical structures, Web development
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML’s hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML’s hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs’ understanding of HTML’s structure and a contrastive fine-tuning approach to align LLMs’ understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.
摘要：网页开发涉及将用户界面设计转化为功能性网页，这对初学者和经验丰富的开发者来说都具有挑战性，主要原因是 HTML 的层次结构和样式的复杂性。尽管大语言模型 (LLM) 在生成源代码方面显示出潜力，但在 UI 到 HTML 代码生成中仍存在两大挑战：(1) 有效表示 HTML 的层次结构以供 LLM 使用，以及 (2) 弥合 UI 设计的视觉特性与 HTML 代码的文本格式之间的差距。为了应对这些挑战，我们提出了 Waffle，这是一种新的微调策略，使用结构感知注意力机制来增强 LLM 对 HTML 结构的理解，并通过对比微调方法来对齐 LLM 对 UI 图像和 HTML 代码的理解。经过 Waffle 微调的模型在我们的新基准 WebSight-Test 和现有的基准 Design2Code 上表现出色，HTML 匹配度提高了 9.00 个百分点 (pp)，CW-SSIM 提高了 0.0982，CLIP 提高了 32.99，LLEM 提高了 27.12 个百分点，优于当前的微调方法。

[NLP-62] Improving Model Factuality with Fine-grained Critique-based Evaluator

【速读】：该论文试图解决的问题是如何评估和提升语言模型（LMs）生成内容的事实性（factuality）。具体来说，论文旨在检测语言模型生成内容中的事实错误，并指导开发更符合事实的模型。

解决方案的关键是训练一个名为FenCE的事实性评估器（factuality evaluator），该评估器能够为语言模型生成器提供基于声明级别的事实性反馈。FenCE通过数据增强技术，结合公共判断数据集进行训练，具备以下两个主要功能：

生成带有评分的文本批评。
基于多样化的源文档做出声明级别的事实性判断。

论文进一步提出了一个框架，利用FenCE来改进语言模型生成器的事实性。具体步骤包括：生成一组候选响应，使用FenCE对每个响应进行修订和评分，确保修订后的响应不引入较少为人所知的事实，并通过偏好高评分修订响应的方式训练生成器。实验结果表明，数据增强方法提高了评估器的准确性，并在事实性评分上显著优于现有最先进的微调方法。

链接: https://arxiv.org/abs/2410.18359
作者: Yiqing Xie,Wenxuan Zhou,Pradyot Prakash,Di Jin,Yuning Mao,Quintin Fettes,Arya Talebzadeh,Sinong Wang,Han Fang,Carolyn Rose,Daniel Fried,Hejia Zhang
关键词-EN: detect factual errors, factual errors produced, factual models, Factuality evaluation aims, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. We conduct data augmentation on a combination of public judgment datasets to train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, leverage FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator’s accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama3-8B-chat’s factuality rate by 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 6.96%.
摘要：事实性评估旨在检测语言模型 (LMs) 产生的事实错误，从而指导开发更符合事实的模型。为此，我们训练了一个事实性评估器，FenCE，它为语言模型生成器提供声明级别的事实性反馈。我们通过对公共判断数据集的组合进行数据增强，训练 FenCE 以 (1) 生成带有评分的文本批评，以及 (2) 基于通过各种工具获取的多样化源文档进行声明级别的判断。随后，我们提出了一种框架，利用 FenCE 通过构建训练数据来提高语言模型生成器的事实性。具体而言，我们生成一组候选响应，利用 FenCE 对每个响应进行修订和评分，而不引入鲜为人知的事实，并通过偏好高评分的修订响应来训练生成器。实验表明，我们的数据增强方法将评估器在大语言模型-AggreFact 上的准确性提高了 2.9%。通过使用 FenCE，我们将 Llama3-8B-chat 在 FActScore 上的事实性率提高了 14.45%，超过了当前最先进的事实性微调方法 6.96%。

[NLP-63] AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability NEURIPS2024

【速读】：该论文试图解决的问题是如何在不牺牲大型语言模型（LLM）准确性的前提下，提高其推理时间。具体来说，论文关注的是在推测解码（speculative decoding）过程中，如何动态调整草稿长度（draft length）以优化性能，尤其是在草稿生成成本高且接受草稿令牌数量变化较大的情况下。

解决方案的关键是提出了一种自适应基于熵的草稿长度（Adaptive Entropy-based Draft Length, AdaEDL）方法。AdaEDL通过近似计算当前观察到的草稿对数熵的下限来估计草稿令牌的预期接受概率，从而实现草稿生成过程的早期停止。这种方法无需训练，也不依赖于数据集特定的草稿停止预测器，因此可以无缝集成到现有的LLM系统中。实验结果表明，AdaEDL在各种设置和数据集上均优于静态草稿长度推测解码方法，性能提升范围为10%-57%，并且在高采样温度场景下表现出更强的鲁棒性。

链接: https://arxiv.org/abs/2410.18351
作者: Sudhanshu Agrawal,Wonseok Jeon,Mingu Lee
关键词-EN: Large Language Models, modern Large Language, Large Language, Language Models, Speculative decoding
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Workshop on Efficient Natural Language and Signal Processing at NeurIPS 2024

点击查看摘要

Abstract:Speculative decoding is a powerful technique that attempts to circumvent the autoregressive constraint of modern Large Language Models (LLMs). The aim of speculative decoding techniques is to improve the average inference time of a large, target model without sacrificing its accuracy, by using a more efficient draft model to propose draft tokens which are then verified in parallel. The number of draft tokens produced in each drafting round is referred to as the draft length and is often a static hyperparameter chosen based on the acceptance rate statistics of the draft tokens. However, setting a static draft length can negatively impact performance, especially in scenarios where drafting is expensive and there is a high variance in the number of tokens accepted. Adaptive Entropy-based Draft Length (AdaEDL) is a simple, training and parameter-free criteria which allows for early stopping of the token drafting process by approximating a lower bound on the expected acceptance probability of the drafted token based on the currently observed entropy of the drafted logits. We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57% as well as other training-free draft-stopping techniques by upto 10% in a variety of settings and datasets. At the same time, we show that AdaEDL is more robust than these techniques and preserves performance in high-sampling-temperature scenarios. Since it is training-free, in contrast to techniques that rely on the training of dataset-specific draft-stopping predictors, AdaEDL can seamlessly be integrated into a variety of pre-existing LLM systems.
摘要：推测性解码是一种强大的技术，旨在绕过现代大语言模型 (LLM) 的自回归约束。推测性解码技术的目标是提高大型目标模型的平均推理时间，同时不牺牲其准确性，通过使用更高效的草稿模型来提出草稿 Token，然后并行验证这些 Token。每次草稿轮次中生成的草稿 Token 数量称为草稿长度，通常是一个根据草稿 Token 接受率统计数据选择的静态超参数。然而，设置静态草稿长度可能会对性能产生负面影响，特别是在草稿生成成本高且接受 Token 数量方差较大的情况下。自适应基于熵的草稿长度 (AdaEDL) 是一种简单、无需训练且无参数的标准，通过基于当前观察到的草稿对数熵近似预期接受概率的下限，从而允许提前停止 Token 草稿生成过程。我们展示了 AdaEDL 在各种设置和数据集中，持续优于静态草稿长度推测性解码 10%-57%，并且比其他无需训练的草稿停止技术高出最多 10%。同时，我们表明 AdaEDL 比这些技术更具鲁棒性，在高采样温度场景下仍能保持性能。由于无需训练，与依赖于特定数据集草稿停止预测器训练的技术相比，AdaEDL 可以无缝集成到各种现有的大语言模型系统中。

[NLP-64] Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models

【速读】：该论文试图解决在Lawrence Berkeley国家实验室（LBL）的科学信息技术（ScienceIT）领域中，如何提升封闭领域问答系统（QA）的性能问题。解决方案的关键在于：

数据处理与模型选择：通过将ScienceIT文档转化为结构化的上下文-问题-答案三元组，并利用最新的生成式AI模型（如AWS Bedrock、GCP PaLM2、Meta LLaMA2、OpenAI GPT-4、Google Gemini-Pro）进行数据驱动的分析。
模型比较与集成：详细比较了两种微调的大型语言模型和五种检索增强生成（RAG）模型，并引入聚合知识模型（Aggregated Knowledge Model, AKM），该模型通过K-means聚类从七个模型中综合选择最具代表性的答案。
性能评估：通过多指标评估这些模型在LBL ScienceIT环境中的有效性和适用性，结果显示AKM显著提升了性能，证明了微调和检索增强策略的集成优势。

这些关键点共同构成了论文解决方案的核心，旨在为特定领域开发专门的问答系统提供实用见解。

链接: https://arxiv.org/abs/2410.18344
作者: Fengchen Liu,Jordan Jung,Wei Feinstein,Jeff DAmbrogia,Gary Jung
关键词-EN: Science Information Technology, Berkeley National Laboratory, Lawrence Berkeley National, closed-domain Question Answering, enhancing closed-domain Question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems, focusing on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain. Utilizing a rich dataset derived from the ScienceIT documentation, our study embarks on a detailed comparison of two fine-tuned large language models and five retrieval-augmented generation (RAG) models. Through data processing techniques, we transform the documentation into structured context-question-answer triples, leveraging the latest Large Language Models (AWS Bedrock, GCP PaLM2, Meta LLaMA2, OpenAI GPT-4, Google Gemini-Pro) for data-driven insights. Additionally, we introduce the Aggregated Knowledge Model (AKM), which synthesizes responses from the seven models mentioned above using K-means clustering to select the most representative answers. The evaluation of these models across multiple metrics offers a comprehensive look into their effectiveness and suitability for the LBL ScienceIT environment. The results demonstrate the potential benefits of integrating fine-tuning and retrieval-augmented strategies, highlighting significant performance improvements achieved with the AKM. The insights gained from this study can be applied to develop specialized QA systems tailored to specific domains.
摘要：本文提出了一种增强封闭领域问答系统（Question Answering, QA）的新方法，特别针对劳伦斯伯克利国家实验室（Lawrence Berkeley National Laboratory, LBL）的科学信息技术（Science Information Technology, ScienceIT）领域的特定需求。利用从ScienceIT文档中提取的丰富数据集，本研究详细比较了两种经过微调的大语言模型和五种检索增强生成（Retrieval-Augmented Generation, RAG）模型。通过数据处理技术，我们将文档转化为结构化的上下文-问题-答案三元组，并利用最新的大语言模型（AWS Bedrock、GCP PaLM2、Meta LLaMA2、OpenAI GPT-4、Google Gemini-Pro）进行数据驱动的洞察。此外，我们引入了聚合知识模型（Aggregated Knowledge Model, AKM），该模型通过K-means聚类从上述七种模型中综合选择最具代表性的答案。通过对这些模型在多个指标上的评估，我们全面审视了它们在LBL ScienceIT环境中的有效性和适用性。结果表明，结合微调和检索增强策略的集成具有显著的性能提升潜力，特别是通过AKM实现的显著性能改进。本研究获得的洞察可用于开发针对特定领域的专业化问答系统。

[NLP-65] Assessing the Creativity of LLM s in Proposing Novel Solutions to Mathematical Problems

【速读】：该论文试图解决的问题是如何评估和提升大型语言模型（LLMs）在数学推理中的创造性能力。传统研究主要关注AI生成数学问题解决方案的正确性，而本文则强调AI不仅应能提供正确答案，还应具备或辅助人类开发新颖数学解决方案的能力。

解决方案的关键在于引入了一个名为CreativeMath的新框架和基准测试，该框架涵盖从中学课程到奥林匹克竞赛级别的数学问题，旨在评估LLMs在已知解决方案基础上提出创新解决方案的能力。实验结果表明，尽管LLMs在标准数学任务上表现良好，但在创造性问题解决方面的能力差异显著，其中Gemini-1.5-Pro模型在生成新颖解决方案方面表现尤为突出。这一研究为评估AI创造力开辟了新领域，揭示了LLMs在促进数学创新方面的优势与局限，并为未来AI辅助数学发现的发展奠定了基础。

链接: https://arxiv.org/abs/2410.18336
作者: Junyi Ye,Jingyi Gu,Xinyun Zhao,Wenpeng Yin,Guiling Wang
关键词-EN: complex and multifaceted, Large Language Models, mathematical, solutions, mathematical capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AI-generated solutions to mathematical problems. In this work, we argue that beyond producing correct answers, AI systems should also be capable of, or assist humans in, developing novel solutions to mathematical challenges. This study explores the creative potential of Large Language Models (LLMs) in mathematical reasoning, an aspect that has received limited attention in prior research. We introduce a novel framework and benchmark, CreativeMath, which encompasses problems ranging from middle school curricula to Olympic-level competitions, designed to assess LLMs’ ability to propose innovative solutions after some known solutions have been provided. Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini-1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery.
摘要：AI系统的数学能力复杂且多面。现有的大多数研究主要集中在AI生成的数学问题解决方案的正确性上。在本研究中，我们认为，除了产生正确答案外，AI系统还应具备或协助人类开发数学挑战的新颖解决方案的能力。本研究探讨了大语言模型（LLMs）在数学推理中的创造潜力，这一方面在先前的研究中受到的关注有限。我们引入了一个新颖的框架和基准，名为CreativeMath，该基准涵盖了从中学课程到奥林匹克级别竞赛的问题，旨在评估在提供一些已知解决方案后，LLMs提出创新解决方案的能力。我们的实验表明，尽管LLMs在标准数学任务上表现良好，但它们在创造性问题解决方面的能力差异显著。值得注意的是，Gemini-1.5-Pro模型在生成新颖解决方案方面优于其他LLMs。这项研究为评估AI创造力开辟了新的领域，揭示了LLMs在促进数学创新方面的优势和局限，并为未来AI辅助数学发现的发展奠定了基础。

[NLP-66] Measuring individual semantic networks: A simulation study

【速读】：该论文试图解决的问题是如何准确捕捉个体在语义网络中的差异，以推进对语义记忆机制的理解。解决方案的关键在于评估和改进从行为范式中构建个体语义网络的方法。具体来说，论文通过恢复模拟研究，比较了两种不同的行为范式（自由联想和相关性判断任务）在估计个体语义网络时的心理测量特性。研究发现，虽然成功推断语义网络是可行的，但绝对网络特征的估计存在严重偏差，导致不同范式和设计配置之间的比较往往不具意义。然而，在同一范式和设计配置内的比较可以准确且具有普适性，特别是在设计中使用中等数量的提示词、中等数量的响应以及包含多样化词汇的提示集时。这些发现有助于评估以往关于语义网络结构的研究，并为设计更可靠地揭示个体差异的新研究提供了指导。

链接: https://arxiv.org/abs/2410.18326
作者: Samuel Aeschbach,Rui Mata,Dirk U. Wulff
关键词-EN: Accurately capturing individual, Accurately capturing, semantic networks, individual semantic networks, semantic
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately capturing individual differences in semantic networks is fundamental to advancing our mechanistic understanding of semantic memory. Past empirical attempts to construct individual-level semantic networks from behavioral paradigms may be limited by data constraints. To assess these limitations and propose improved designs for the measurement of individual semantic networks, we conducted a recovery simulation investigating the psychometric properties underlying estimates of individual semantic networks obtained from two different behavioral paradigms: free associations and relatedness judgment tasks. Our results show that successful inference of semantic networks is achievable, but they also highlight critical challenges. Estimates of absolute network characteristics are severely biased, such that comparisons between behavioral paradigms and different design configurations are often not meaningful. However, comparisons within a given paradigm and design configuration can be accurate and generalizable when based on designs with moderate numbers of cues, moderate numbers of responses, and cue sets including diverse words. Ultimately, our results provide insights that help evaluate past findings on the structure of semantic networks and design new studies capable of more reliably revealing individual differences in semantic networks.
摘要：准确捕捉语义网络中的个体差异对于深化我们对语义记忆机制的理解至关重要。以往通过行为范式构建个体层面语义网络的实证尝试可能受到数据限制的影响。为了评估这些限制并提出改进个体语义网络测量设计的方案，我们进行了一项恢复模拟研究，探讨了从两种不同行为范式（自由联想和相关性判断任务）中获得的个体语义网络估计的心理测量特性。研究结果表明，成功推断语义网络是可行的，但也凸显了关键挑战。绝对网络特征的估计存在严重偏差，导致不同行为范式和设计配置之间的比较往往缺乏意义。然而，在给定范式和设计配置内进行的比较，当基于具有适度提示数量、适度响应数量和包含多样化词汇的提示集的设计时，可以实现准确且具有普适性。最终，我们的研究结果为评估过去关于语义网络结构的研究发现提供了见解，并设计了能够更可靠地揭示个体语义网络差异的新研究。

[NLP-67] CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在推理过程中高计算成本和内存需求的问题。解决方案的关键在于提出了一种无需多层感知机（MLP）的自适应稀疏激活推理方法，称为CoreInfer。

CoreInfer的核心创新在于引入了句子级核心神经元的概念，即对于给定句子最关键的神经元子集。通过探索核心神经元与句子语义之间的关联，论文发现核心神经元在语义上表现出稳定性和相似性，这一发现被先前研究忽视。基于这一发现，论文设计了两种基于语义的方法来预测核心神经元，以适应不同的输入场景。

在CoreInfer中，核心神经元在预填充阶段确定并在编码阶段固定，从而实现了零成本的稀疏推理。实验结果表明，CoreInfer在多种模型和任务上均表现出色，特别是在NVIDIA TITAN XP GPU上，相较于Huggingface实现和PowerInfer，分别实现了10.33倍和2.72倍的加速。

链接: https://arxiv.org/abs/2410.18311
作者: Qinsi Wang,Saeed Vahidian,Hancheng Ye,Jianyang Gu,Jianyi Zhang,Yiran Chen
关键词-EN: Large language models, Large language, core neurons, exciting AI applications, billions of parameters
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Large language models (LLMs) with billions of parameters have sparked a new wave of exciting AI applications. However, their high computational costs and memory demands during inference pose significant challenges. Adaptive sparse activation inference, which activates only a small number of neurons for each token, offers a novel way to accelerate model inference without degrading performance, showing great potential for resource-constrained hardware devices. Nevertheless, existing methods predict activated neurons based on individual tokens with additional MLP, which involve frequent changes in activation maps and resource calls, limiting the acceleration benefits of sparse activation. In this paper, we introduce CoreInfer, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. Specifically, we propose the concept of sentence-wise core neurons, which refers to the subset of neurons most critical for a given sentence, and empirically demonstrate its effectiveness. To determine the core neurons, we explore the correlation between core neurons and the sentence’s semantics. Remarkably, we discovered that core neurons exhibit both stability and similarity in relation to the sentence’s semantics – an insight overlooked by previous studies. Building on this finding, we further design two semantic-based methods for predicting core neurons to fit different input scenarios. In CoreInfer, the core neurons are determined during the pre-filling stage and fixed during the encoding stage, enabling zero-cost sparse inference. We evaluated the model generalization and task generalization of CoreInfer across various models and tasks. Notably, on an NVIDIA TITAN XP GPU, CoreInfer achieved a 10.33 times and 2.72 times speedup compared to the Huggingface implementation and PowerInfer, respectively.
摘要：拥有数十亿参数的大语言模型 (LLMs) 引发了新一轮令人振奋的 AI 应用浪潮。然而，这些模型在推理过程中的高计算成本和内存需求带来了显著的挑战。自适应稀疏激活推理通过仅激活每个 Token 的一小部分神经元，提供了一种在不降低性能的情况下加速模型推理的新方法，显示出在资源受限硬件设备上的巨大潜力。然而，现有方法基于单个 Token 预测激活神经元，并依赖额外的多层感知机 (MLP)，这导致了激活映射的频繁变化和资源调用的增加，限制了稀疏激活带来的加速效果。本文中，我们提出了 CoreInfer，一种基于句子级预测的无 MLP 自适应稀疏激活推理方法。具体而言，我们提出了句子级核心神经元的概念，即对给定句子最关键的神经元子集，并通过实验验证了其有效性。为了确定核心神经元，我们探讨了核心神经元与句子语义之间的关联。值得注意的是，我们发现核心神经元在句子语义方面表现出稳定性和相似性——这一洞察被先前研究所忽视。基于这一发现，我们进一步设计了两种基于语义的方法来预测核心神经元，以适应不同的输入场景。在 CoreInfer 中，核心神经元在预填充阶段确定并在编码阶段固定，从而实现了零成本的稀疏推理。我们评估了 CoreInfer 在不同模型和任务上的模型泛化能力和任务泛化能力。特别地，在 NVIDIA TITAN XP GPU 上，CoreInfer 相较于 Huggingface 实现和 PowerInfer 分别实现了 10.33 倍和 2.72 倍的加速。

[NLP-68] Robust and Explainable Depression Identification from Speech Using Vowel-Based Ensemble Learning Approaches ALT

【速读】：该论文试图解决从语音中识别抑郁症的问题，并提出了一种可解释的机器学习算法。解决方案的关键在于：

基于语音生成证据：利用抑郁症影响运动控制和元音生成的证据，采用预训练的元音嵌入（vowel-based embeddings），这些嵌入整合了语义上有意义的语言单元。
集成学习方法：将问题分解为与特定抑郁症症状和严重程度水平相关的组成部分。具体探索了两种方法：
- 自底向上方法：使用8个模型预测个体患者健康问卷-8（Patient Health Questionnaire-8, PHQ-8）项目的分数。
- 自顶向下方法：使用专家混合模型（Mixture of Experts, MoE），结合路由模块评估抑郁症的严重程度。

这两种方法均展示了与最先进基线相当的性能，表明了其鲁棒性和对数据集均值/中值的较低敏感性。此外，系统的可解释性有助于辅助临床医生进行抑郁症的诊断和筛查。

链接: https://arxiv.org/abs/2410.18298
作者: Kexin Feng,Theodora Chaspari
关键词-EN: study investigates explainable, investigates explainable machine, explainable machine learning, machine learning algorithms, study investigates
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: accepted at the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2024)

点击查看摘要

Abstract:This study investigates explainable machine learning algorithms for identifying depression from speech. Grounded in evidence from speech production that depression affects motor control and vowel generation, pre-trained vowel-based embeddings, that integrate semantically meaningful linguistic units, are used. Following that, an ensemble learning approach decomposes the problem into constituent parts characterized by specific depression symptoms and severity levels. Two methods are explored: a “bottom-up” approach with 8 models predicting individual Patient Health Questionnaire-8 (PHQ-8) item scores, and a “top-down” approach using a Mixture of Experts (MoE) with a router module for assessing depression severity. Both methods depict performance comparable to state-of-the-art baselines, demonstrating robustness and reduced susceptibility to dataset mean/median values. System explainability benefits are discussed highlighting their potential to assist clinicians in depression diagnosis and screening.
摘要：本研究探讨了用于从语音中识别抑郁症的可解释机器学习算法。基于语音生成领域的证据，抑郁症影响运动控制和元音生成，因此采用了预训练的基于元音的嵌入（vowel-based embeddings），这些嵌入整合了语义上有意义的语言单元。随后，采用集成学习方法将问题分解为具有特定抑郁症症状和严重程度水平的组成部分。研究探索了两种方法：一种是“自下而上”的方法，使用8个模型预测个体患者健康问卷-8（Patient Health Questionnaire-8, PHQ-8）项目得分；另一种是“自上而下”的方法，使用带有路由模块的专家混合模型（Mixture of Experts, MoE）来评估抑郁症的严重程度。两种方法的性能均与最先进的基线方法相当，显示出鲁棒性并减少了数据集均值/中值的影响。讨论了系统可解释性的优势，强调其在协助临床医生进行抑郁症诊断和筛查方面的潜力。

[NLP-69] Kenyan Sign Language (KSL) Dataset: Using Artificial Intelligence (AI) in Bridging Communication Barrier among the Deaf Learners

【速读】：该论文试图解决肯尼亚聋人社区与听力正常人群之间的语言障碍问题。解决方案的关键在于开发一个数字化的、开放访问的AI辅助技术数据集，该数据集能够将英语翻译成肯尼亚手语（KSL），从而促进包容性并消除聋人学习者在肯尼亚的语言障碍。具体来说，解决方案包括构建一个包含口语英语和视频记录的肯尼亚手语的数据集，以及将肯尼亚手语符号转录为基于HamNoSys系统的音素级接口。

链接: https://arxiv.org/abs/2410.18295
作者: Lilian Wanzare,Joel Okutoyi,Maurine Kang’ahi,Mildred Ayere
关键词-EN: Kenyan Sign Language, Sign Language, Kenyan Sign, KSL, deaf
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, to be published in 3rd International Conference on Artificial Intelligence and Robotics (MIRG-ICAIR 2023)

点击查看摘要

Abstract:Kenyan Sign Language (KSL) is the primary language used by the deaf community in Kenya. It is the medium of instruction from Pre-primary 1 to university among deaf learners, facilitating their education and academic achievement. Kenyan Sign Language is used for social interaction, expression of needs, making requests and general communication among persons who are deaf in Kenya. However, there exists a language barrier between the deaf and the hearing people in Kenya. Thus, the innovation on AI4KSL is key in eliminating the communication barrier. Artificial intelligence for KSL is a two-year research project (2023-2024) that aims to create a digital open-access AI of spontaneous and elicited data from a representative sample of the Kenyan deaf community. The purpose of this study is to develop AI assistive technology dataset that translates English to KSL as a way of fostering inclusion and bridging language barriers among deaf learners in Kenya. Specific objectives are: Build KSL dataset for spoken English and video recorded Kenyan Sign Language and to build transcriptions of the KSL signs to a phonetic-level interface of the sign language. In this paper, the methodology for building the dataset is described. Data was collected from 48 teachers and tutors of the deaf learners and 400 learners who are Deaf. Participants engaged mainly in sign language elicitation tasks through reading and singing. Findings of the dataset consisted of about 14,000 English sentences with corresponding KSL Gloss derived from a pool of about 4000 words and about 20,000 signed KSL videos that are either signed words or sentences. The second level of data outcomes consisted of 10,000 split and segmented KSL videos. The third outcome of the dataset consists of 4,000 transcribed words into five articulatory parameters according to HamNoSys system.
摘要：肯尼亚手语（Kenyan Sign Language, KSL）是肯尼亚聋人社区使用的主要语言。它是聋人学生从学前班到大学的主要教学媒介，有助于他们的教育和学术成就。肯尼亚手语用于社会互动、需求表达、请求提出以及肯尼亚聋人之间的日常交流。然而，肯尼亚的聋人与听人之间存在语言障碍。因此，AI4KSL的创新对于消除这种沟通障碍至关重要。KSL的人工智能是一个为期两年的研究项目（2023-2024），旨在创建一个数字化的开放访问AI，处理来自肯尼亚聋人社区代表性样本的自发和诱发数据。本研究的目的在于开发一种AI辅助技术数据集，将英语翻译为KSL，以此促进包容性并弥合肯尼亚聋人学生之间的语言障碍。具体目标包括：构建用于口语英语和视频记录肯尼亚手语的KSL数据集，以及将KSL手势转录为手语的音素级接口。本文描述了构建数据集的方法。数据从48名聋人教师和辅导员以及400名聋人学生中收集。参与者主要通过阅读和唱歌进行手语诱发任务。数据集的发现包括约14,000条英语句子及其对应的KSL Gloss，这些句子源自约4000个词汇，以及约20,000个手语视频，这些视频要么是手语词汇，要么是手语句子。第二级数据成果包括10,000个分割和分段的KSL视频。数据集的第三项成果包括根据HamNoSys系统转录的4,000个词汇，分为五个发音参数。

[NLP-70] LEGO: Language Model Building Blocks

【速读】：该论文试图解决的问题是：大型语言模型 (Large Language Models, LLMs) 在自然语言处理 (Natural Language Processing, NLP) 中虽然功能强大，但其数据收集、预训练、微调和推理过程成本高昂。而任务特定的小型语言模型 (Small Language Models, SLMs) 虽然成本较低，但缺乏鲁棒性和泛化能力。

解决方案的关键是提出了一种名为 LEGO 的新技术，该技术通过从大型语言模型中提取小型语言模型并重新组合它们，来实现成本效益和性能的平衡。LEGO 利用了先进的 LLM 剪枝策略，创建了任务和用户特定的 SLM 构建块，这些构建块在微调和推理过程中高效，同时保持用户数据的隐私。LEGO 还结合了联邦学习 (Federated Learning) 和一种新的聚合方案，用于 LLM 的重构，从而在不增加高成本的情况下保持模型的鲁棒性，并保护用户数据的隐私。实验结果表明，LEGO 能够实现模型异质性，并减轻数据异质性的影响，同时保持 LLM 的鲁棒性。

链接: https://arxiv.org/abs/2410.18287
作者: Shrenik Bhansali,Alwin Jin,Tyler Lizzo,Larry Heck
关键词-EN: Large language models, natural language processing, Large language, essential in natural, language processing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are essential in natural language processing (NLP) but are costly in data collection, pre-training, fine-tuning, and inference. Task-specific small language models (SLMs) offer a cheaper alternative but lack robustness and generalization. This paper proposes LEGO, a novel technique to extract SLMs from an LLM and recombine them. Using state-of-the-art LLM pruning strategies, we can create task- and user-specific SLM building blocks that are efficient for fine-tuning and inference while also preserving user data privacy. LEGO utilizes Federated Learning and a novel aggregation scheme for the LLM reconstruction, maintaining robustness without high costs and preserving user data privacy. We experimentally demonstrate the versatility of LEGO, showing its ability to enable model heterogeneity and mitigate the effects of data heterogeneity while maintaining LLM robustness.
摘要：大语言模型 (LLM) 在自然语言处理 (NLP) 中至关重要，但在数据收集、预训练、微调及推理过程中成本高昂。针对特定任务的小语言模型 (SLM) 提供了一种成本较低的替代方案，但缺乏鲁棒性和泛化能力。本文提出了一种名为 LEGO 的新技术，通过从 LLM 中提取 SLM 并重新组合，以实现高效的任务和用户特定 SLM 构建。利用最先进的 LLM 剪枝策略，我们能够创建适用于微调和推理的高效 SLM 模块，同时保护用户数据隐私。LEGO 结合了联邦学习和一种新颖的聚合方案进行 LLM 重构，在不增加高成本的情况下保持鲁棒性，并保护用户数据隐私。实验结果表明，LEGO 具有广泛的适用性，能够在保持 LLM 鲁棒性的同时，实现模型异质性并缓解数据异质性的影响。

[NLP-71] Multilingual Hallucination Gaps in Large Language Models

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在多语言自由文本生成中产生的幻觉现象，特别是在不同语言之间存在的幻觉差异，即多语言幻觉差距（multilingual hallucination gaps）。解决方案的关键在于量化这些幻觉现象，并扩展了FactScore指标的框架以适应多语言环境。通过使用LLaMA、Qwen和Aya系列的LLMs，在19种语言中生成传记文本，并与维基百科页面进行比较，研究揭示了幻觉率在不同语言间的变化，尤其是高资源语言和低资源语言之间的差异，从而提出了评估多语言自由文本生成中幻觉现象的挑战。

链接: https://arxiv.org/abs/2410.18270
作者: Cléa Chataigner,Afaf Taïk,Golnoosh Farnadi
关键词-EN: Large language models, traditional search engines, resembles human language, Large language, alternatives to traditional
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as alternatives to traditional search engines given their capacity to generate text that resembles human language. However, this shift is concerning, as LLMs often generate hallucinations, misleading or false information that appears highly credible. In this study, we explore the phenomenon of hallucinations across multiple languages in freeform text generation, focusing on what we call multilingual hallucination gaps. These gaps reflect differences in the frequency of hallucinated answers depending on the prompt and language used. To quantify such hallucinations, we used the FactScore metric and extended its framework to a multilingual setting. We conducted experiments using LLMs from the LLaMA, Qwen, and Aya families, generating biographies in 19 languages and comparing the results to Wikipedia pages. Our results reveal variations in hallucination rates, especially between high and low resource languages, raising important questions about LLM multilingual performance and the challenges in evaluating hallucinations in multilingual freeform text generation.
摘要：大语言模型（LLMs）因其能够生成类似人类语言的文本，正越来越多地被用作传统搜索引擎的替代品。然而，这种转变令人担忧，因为 LLMs 常常生成幻觉，即看似高度可信的误导性或虚假信息。在本研究中，我们探讨了自由文本生成中跨多种语言的幻觉现象，重点关注我们称之为多语言幻觉差距的问题。这些差距反映了根据提示和使用的语言不同，幻觉答案出现的频率差异。为了量化这些幻觉，我们采用了 FactScore 指标，并将其框架扩展到多语言环境中。我们使用 LLaMA、Qwen 和 Aya 系列的大语言模型进行了实验，生成了 19 种语言的传记，并将结果与维基百科页面进行了比较。我们的研究结果揭示了幻觉率的差异，特别是在高资源语言和低资源语言之间，这提出了关于大语言模型多语言性能的重要问题，以及在多语言自由文本生成中评估幻觉的挑战。

[NLP-72] Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

【速读】：该论文试图解决强化学习从人类反馈 (RLHF) 中计算效率低下的问题。当前主流的 RLHF 方法是同步在线和策略内强化学习 (online and on-policy RL)，即从大型语言模型 (LLM) 策略中同步生成样本，使用奖励模型进行标注，并基于 LLM 自身输出的反馈进行学习。尽管这种方法性能良好，但其计算效率较低。

论文提出的解决方案之关键是将生成和学习过程分离，从而实现异步生成新样本的同时对旧样本进行训练，这被称为在线但策略外强化学习 (online but off-policy RLHF)。这种异步训练方法能够加速训练并实现更优的计算资源利用。然而，异步训练依赖于一个尚未充分探索的领域，即在旧模型样本上进行学习。论文探讨了在这种策略外数据下，异步训练能够容忍多少策略外性以加速学习但同时保持性能的问题。

研究结果表明，在线直接偏好优化 (online DPO) 算法对策略外数据的鲁棒性最强，并且这种鲁棒性随着策略模型规模的增加而增强。论文还探讨了进一步的计算优化方法，但发现这些优化方法会带来性能上的代价，从而形成一种权衡。最终，通过在指令跟随任务上训练 LLaMA 3.1 8B 模型，论文验证了异步 RLHF 的可扩展性，实现了比同步运行快 40% 的训练速度，同时保持了最终性能。

链接: https://arxiv.org/abs/2410.18252
作者: Michael Noukhovitch,Shengyi Huang,Sophie Xhonneux,Arian Hosseini,Rishabh Agarwal,Aaron Courville
关键词-EN: large language model, LLM own outputs, LLM, synchronously generating, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: code at this https URL

点击查看摘要

Abstract:The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM’s own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
摘要：当前主流的强化学习从人类反馈 (RLHF) 方法主要采用在线和策略内 (on-policy) 的强化学习模式：同步从大语言模型 (LLM) 策略中生成数据，使用奖励模型进行标注，并通过反馈来学习 LLM 自身的输出。尽管这种方法在性能上表现出色，但其计算效率较低。受经典深度强化学习文献的启发，我们提出将生成和学习过程在 RLHF 中分离。这一改进使得在训练旧样本的同时异步生成新样本成为可能，从而加速训练并实现更优的计算资源利用。然而，异步训练依赖于一个尚未充分探索的领域，即在线但策略外 (off-policy) 的 RLHF：利用模型先前迭代生成的样本进行学习。为了理解这一领域的挑战，我们研究了一个基本问题：在保持性能的前提下，异步训练能容忍多少策略外数据以加速学习？在我们测试的几种 RLHF 算法中，我们发现在线直接策略优化 (DPO) 对策略外数据的鲁棒性最强，且这种鲁棒性随着策略模型规模的增大而增强。我们进一步研究了异步 RLHF 的计算优化，但发现这些优化往往伴随着性能的牺牲，从而形成了一种权衡。最后，我们通过在指令跟随任务上训练 LLaMA 3.1 8B 模型，验证了异步 RLHF 的可扩展性，结果表明其训练速度比同步运行快 40%，同时最终性能相当。

[NLP-73] Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits

【速读】：该论文试图解决多草稿推测采样（multi-draft speculative sampling）中的最优草稿选择方案问题。具体来说，论文探讨了在多个独立采样的草稿模型中，如何通过一种最优的草稿选择方案来匹配目标模型的输出分布。

解决方案的关键在于将最优草稿选择方案分解为两个步骤：

第一步使用重要性采样（Importance Sampling, IS）类型的方案选择一个中间标记（token）。
第二步应用单草稿推测采样（single-draft speculative sampling）生成最终的输出标记。

此外，论文还针对两个相同的草稿模型，提出了接受概率等于一的必要且充分条件，并给出了最优接受概率的显式表达式。理论分析还启发了一种基于加权重要性采样的新型标记级别选择方案。实验结果表明，在多种场景下，该方案在可实现的块效率和标记速率方面均优于基线方案。

链接: https://arxiv.org/abs/2410.18234
作者: Ashish Khisti,M.Reza Ebrahimi,Hassan Dbouk,Arash Behboodi,Roland Memisevic,Christos Louizos
关键词-EN: draft models, multi-draft speculative sampling, proposal sequences, sequences are sampled, sampled independently
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target and draft models for the acceptance probability to equal one and 2) provide an explicit expression for the optimal acceptance probability. Our theoretical analysis also motives a new class of token-level selection scheme based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.
摘要：我们考虑了多草稿推测采样，其中提议序列从不同的草稿模型中独立采样。在每一步，一个基于Token级别的草稿选择方案接收一组有效Token作为输入，并生成一个输出Token，其分布与目标模型相匹配。先前的研究已经证明，最优方案（即最大化接受输入Token之一的概率）可以被视为一个线性规划问题的解。在本研究中，我们展示了最优方案可以分解为两步解决方案：第一步使用一种类似于重要性采样（Importance Sampling, IS）的方案来选择一个中间Token；第二步则应用单草稿推测采样来生成输出Token。对于两个相同的草稿模型的情况，我们进一步1）确立了目标模型和草稿模型分布的必要且充分条件，使得接受概率等于一；2）提供了最优接受概率的显式表达式。我们的理论分析还启发了一类基于加权重要性采样的Token级别选择方案。我们的实验结果表明，在多种场景下，与基线方案相比，可实现的块效率和Token速率均有所提升。

[NLP-74] Generalizations across filler-gap dependencies in neural language models CONLL2024

【速读】：该论文试图解决的问题是如何从有限的输入中推导出填空-间隙依赖关系（filler-gap dependencies）的结构泛化。填空-间隙依赖关系虽然在表面形式上多样，但共享一种结构泛化。论文的关键解决方案在于通过控制神经语言模型（NLM）的输入，探究模型是否能够形成对填空-间隙依赖关系的共享表示。研究结果表明，尽管NLM能够区分语法正确与不正确的填空-间隙依赖关系，但它们依赖于输入的表面属性，而非共享的泛化。这一发现强调了在语言习得模型中引入特定语言归纳偏差（linguistic inductive biases）的必要性。

链接: https://arxiv.org/abs/2410.18225
作者: Katherine Howitt,Sathvik Nair,Allison Dods,Robert Melvin Hopkins
关键词-EN: Humans develop, making structural generalizations, filler-gap dependencies, develop their grammars, grammars by making
类目: Computation and Language (cs.CL)
备注: accepted at CoNLL 2024

点击查看摘要

Abstract:Humans develop their grammars by making structural generalizations from finite input. We ask how filler-gap dependencies, which share a structural generalization despite diverse surface forms, might arise from the input. We explicitly control the input to a neural language model (NLM) to uncover whether the model posits a shared representation for filler-gap dependencies. We show that while NLMs do have success differentiating grammatical from ungrammatical filler-gap dependencies, they rely on superficial properties of the input, rather than on a shared generalization. Our work highlights the need for specific linguistic inductive biases to model language acquisition.
摘要：人类通过从有限的输入中进行结构概括来发展其语法。我们探讨了填充-空位依赖关系（filler-gap dependencies）如何从输入中产生，尽管这些依赖关系在表面形式上多样，但共享一个结构概括。我们明确控制神经语言模型（NLM）的输入，以揭示模型是否为填充-空位依赖关系假设了一个共享的表示。我们发现，尽管NLM能够成功区分语法上正确与不正确的填充-空位依赖关系，但它们依赖于输入的表面属性，而不是共享的概括。我们的工作强调了在语言习得模型中需要特定的语言归纳偏置。

[NLP-75] Optimizing the role of human evaluation in LLM -based spoken document summarization systems

【速读】：该论文试图解决的问题是如何有效评估生成式 AI (Generative AI) 在口语文档摘要中的表现。由于大型语言模型 (LLMs) 在摘要任务中展现出的创造性、流畅性和从大规模语料库中提取信息的能力，传统的自动评估方法如 ROUGE 和 BERTScore 在性能上尚无法与人工评估相媲美。

解决方案的关键在于借鉴社会科学的方法论，提出一种专门针对生成式 AI 内容的口语文档摘要评估范式。论文详细阐述了评估标准和最佳实践指南，以确保实验设计的稳健性、可重复性和人工评估的可靠性。此外，论文还通过两个案例研究展示了这些以人为中心的评估方法在一家美国大型科技公司中的实际应用。

链接: https://arxiv.org/abs/2410.18218
作者: Margaret Kroll,Kelsey Kraus
关键词-EN: emergence of powerful, shift in abstractive, powerful LLMs, paradigm shift, abstractive summarization
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The emergence of powerful LLMs has led to a paradigm shift in abstractive summarization of spoken documents. The properties that make LLMs so valuable for this task – creativity, ability to produce fluent speech, and ability to abstract information from large corpora – also present new challenges to evaluating their content. Quick, cost-effective automatic evaluations such as ROUGE and BERTScore offer promise, but do not yet show competitive performance when compared to human evaluations. We draw on methodologies from the social sciences to propose an evaluation paradigm for spoken document summarization explicitly tailored for generative AI content. We provide detailed evaluation criteria and best practices guidelines to ensure robustness in the experimental design, replicability, and trustworthiness of human evaluation studies. We additionally include two case studies that show how these human-in-the-loop evaluation methods have been implemented at a major U.S. technology company.
摘要：大语言模型 (LLM) 的崛起引发了口语文档摘要领域的范式转变。这些模型之所以在这一任务中如此宝贵，是因为它们具备创造力、生成流畅语音的能力以及从大规模语料库中抽象信息的能力。然而，这些特性也为评估其内容带来了新的挑战。尽管像 ROUGE 和 BERTScore 这样的快速、成本效益高的自动评估方法显示出潜力，但与人工评估相比，其性能尚未达到竞争水平。我们借鉴社会科学的方法论，提出了一种专门针对生成式 AI 内容的口语文档摘要评估范式。我们提供了详细的评估标准和最佳实践指南，以确保实验设计的稳健性、可重复性以及人工评估研究的可靠性。此外，我们还包含了两个案例研究，展示了这些人在回路中的评估方法如何在美国一家主要科技公司中实施。

[NLP-76] Advancing NLP Security by Leveraging LLM s as Adversarial Engines

【速读】：该论文试图解决自然语言处理 (NLP) 安全性的问题，特别是如何利用大型语言模型 (Large Language Models, LLMs) 生成多样化的对抗攻击。解决方案的关键在于利用 LLMs 的复杂语言理解和生成能力，创建更有效、语义连贯且类似人类的对抗样本，涵盖多种攻击类型，如对抗性补丁 (adversarial patches)、通用扰动 (universal perturbations) 和定向攻击 (targeted attacks)。通过这种方式，论文提出了一种范式转变，旨在增强模型的鲁棒性，揭示新的漏洞，并推动防御机制的创新，从而为关键应用开发更安全、可靠和可信的 NLP 系统。

链接: https://arxiv.org/abs/2410.18215
作者: Sudarshan Srinivasan,Maria Mahbub,Amir Sadovnik
关键词-EN: leveraging Large Language, position paper proposes, Large Language Models, advancing NLP security, leveraging Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages

点击查看摘要

Abstract:This position paper proposes a novel approach to advancing NLP security by leveraging Large Language Models (LLMs) as engines for generating diverse adversarial attacks. Building upon recent work demonstrating LLMs’ effectiveness in creating word-level adversarial examples, we argue for expanding this concept to encompass a broader range of attack types, including adversarial patches, universal perturbations, and targeted attacks. We posit that LLMs’ sophisticated language understanding and generation capabilities can produce more effective, semantically coherent, and human-like adversarial examples across various domains and classifier architectures. This paradigm shift in adversarial NLP has far-reaching implications, potentially enhancing model robustness, uncovering new vulnerabilities, and driving innovation in defense mechanisms. By exploring this new frontier, we aim to contribute to the development of more secure, reliable, and trustworthy NLP systems for critical applications.
摘要：本文提出了一种利用大语言模型 (LLM) 作为生成多样化对抗攻击引擎的新方法，以推进自然语言处理 (NLP) 的安全性。基于近期研究显示 LLM 在生成词级对抗样本方面的有效性，我们主张将这一概念扩展到更广泛的攻击类型，包括对抗性补丁、通用扰动和定向攻击。我们认为，LLM 的复杂语言理解和生成能力可以在各种领域和分类器架构中生成更有效、语义连贯且更接近人类的对抗样本。这种对抗性 NLP 范式的转变具有深远的影响，可能增强模型的鲁棒性，揭示新的漏洞，并推动防御机制的创新。通过探索这一新领域，我们旨在为开发更安全、可靠和可信的 NLP 系统，特别是在关键应用中，做出贡献。

[NLP-77] owards Understanding the Fragility of Multilingual LLM s against Fine-Tuning Attacks

【速读】：该论文试图解决的问题是多语言大型语言模型（LLMs）在面对微调攻击时的安全性问题。具体来说，论文发现微调攻击具有跨语言泛化性，即通过在一种语言中使用少量对抗性选择的指令跟随示例，可以轻易地破坏多语言LLMs在其他语言中的安全性（例如，多语言LLMs无法拒绝其他语言中的有害提示）。

解决方案的关键在于提出了一个新的方法，称为安全信息定位（Safety Information Localization, SIL），用于识别模型参数空间中的安全相关信息。通过SIL，论文验证了安全相关信息是语言无关的假设，并发现仅改变20%的权重参数就足以在微调攻击中破坏所有语言的安全对齐。此外，论文还提供了证据支持替代路径假设，即冻结安全相关参数并不能防止微调攻击，并展示了其攻击向量仍然可以破解适应新语言的LLMs。

链接: https://arxiv.org/abs/2410.18210
作者: Samuele Poppi,Zheng-Xin Yong,Yifei He,Bobbie Chern,Han Zhao,Aobo Yang,Jianfeng Chi
关键词-EN: sparked widespread concerns, advancements in Large, Large Language Models, Recent advancements, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 14 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning with a few adversarially chosen instruction-following examples, i.e., fine-tuning attacks. We take a further step to understand fine-tuning attacks in multilingual LLMs. We first discover cross-lingual generalization of fine-tuning attacks: using a few adversarially chosen instruction-following examples in one language, multilingual LLMs can also be easily compromised (e.g., multilingual LLMs fail to refuse harmful prompts in other languages). Motivated by this finding, we hypothesize that safety-related information is language-agnostic and propose a new method termed Safety Information Localization (SIL) to identify the safety-related information in the model parameter space. Through SIL, we validate this hypothesis and find that only changing 20% of weight parameters in fine-tuning attacks can break safety alignment across all languages. Furthermore, we provide evidence to the alternative pathways hypothesis for why freezing safety-related parameters does not prevent fine-tuning attacks, and we demonstrate that our attack vector can still jailbreak LLMs adapted to new languages.
摘要：近年来，大语言模型 (LLM) 的进步引发了对其安全性的广泛关注。最近的研究表明，通过使用少量精心选择的指令跟随示例进行微调，即微调攻击，可以轻易地移除 LLM 的安全对齐。我们进一步探讨了多语言 LLM 中的微调攻击。首先，我们发现了微调攻击的跨语言泛化现象：使用一种语言中的少量精心选择的指令跟随示例，多语言 LLM 也可以轻易地被破坏（例如，多语言 LLM 在其他语言中无法拒绝有害提示）。基于这一发现，我们假设与安全相关的信息是语言无关的，并提出了一种名为安全信息定位 (SIL) 的新方法，用于在模型参数空间中识别与安全相关的信息。通过 SIL，我们验证了这一假设，并发现仅改变微调攻击中 20% 的权重参数就可以破坏所有语言中的安全对齐。此外，我们为为什么冻结与安全相关的参数无法防止微调攻击提供了替代路径假设的证据，并展示了我们的攻击向量仍然可以破解适应新语言的 LLM。

[NLP-78] CorrectionLM: Self-Corrections with SLM for Dialogue State Tracking

【速读】：该论文试图解决小语言模型（SLMs）在自我改进方面的局限性问题。当前的修正方法通常依赖于从大语言模型（LLMs）中提取知识，这带来了显著的计算需求。论文提出的解决方案之关键是引入了一个名为CORRECTIONLM的新型修正框架，该框架允许SLMs通过上下文示例进行自我修正，而无需LLM的参与。这一方法在低资源设置下的两个对话状态跟踪（DST）任务中实现了与最先进LLM相当的结果，同时大幅降低了计算成本。

链接: https://arxiv.org/abs/2410.18209
作者: Chia-Hsuan Lee,Hao Cheng,Mari Ostendorf
关键词-EN: Large language models, language models, small language models, demonstrated self-improvement capabilities, Large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated self-improvement capabilities via feedback and refinement, but current small language models (SLMs) have had limited success in this area. Existing correction approaches often rely on distilling knowledge from LLMs, which imposes significant computation demands. In this work, we introduce CORRECTIONLM, a novel correction framework that enables SLMs to self-correct using in-context exemplars without LLM involvement. Applied to two dialogue state tracking (DST) tasks in low-resource settings, CORRECTIONLM achieves results similar to a state-of-the-art LLM at a small fraction of the computation costs.
摘要：大语言模型 (LLM) 已经展示了通过反馈和优化实现自我改进的能力，但目前的小语言模型 (SLM) 在这一领域的表现有限。现有的校正方法通常依赖于从 LLM 中提取知识，这带来了显著的计算需求。在本研究中，我们提出了 CORRECTIONLM，这是一种新颖的校正框架，使 SLM 能够在不涉及 LLM 的情况下，通过上下文示例进行自我校正。在低资源环境下应用于两个对话状态跟踪 (DST) 任务时，CORRECTIONLM 以极低的计算成本实现了与最先进 LLM 相当的结果。

[NLP-79] ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

【速读】：该论文试图解决的问题是如何在特定任务（如自动形式化或代码生成）中优化语言模型（LM）的性能，特别是在数据选择方面。现有方法要么完全忽略任务特定的需求，要么依赖于无法捕捉任务细微模式的近似方法。此外，考虑目标分布的方法通常依赖于简单且可能带有噪声的表示（如哈希n-gram特征），这可能导致碰撞并引入噪声。

解决方案的关键是引入了一个名为ZIP-FIT的数据选择框架，该框架利用gzip压缩来直接测量潜在训练数据与目标任务分布之间的对齐程度。通过在自动形式化和Python代码生成任务上的广泛评估，ZIP-FIT显著优于领先的基线方法（如DSIR和D4）。使用ZIP-FIT选择的数据训练的模型在达到最低交叉熵损失方面比基线方法快85.1%，表明更好的任务对齐可以更高效地学习。此外，ZIP-FIT的数据选择速度比DSIR快65.8%，比D4快两个数量级。研究还表明，较小但高度对齐的数据集通常优于较大但针对性较弱的数据集，强调了高质量数据的重要性。

总之，该论文通过展示任务感知数据选择对高效领域适应的重要性，以及压缩提供了一种原则性的任务对齐测量方法，为数据质量、任务对齐和模型学习效率之间的关系提供了新的见解。

链接: https://arxiv.org/abs/2410.18194
作者: Elyas Obbad,Iddah Mlauzi,Brando Miranda,Rylan Schaeffer,Kamal Obbad,Suhana Bedi,Sanmi Koyejo
关键词-EN: Toggle, Data, Data selection, code, Toggle Hugging Face
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data selection is crucial for optimizing language model (LM) performance on specific tasks, yet most existing methods fail to effectively consider the target task distribution. Current approaches either ignore task-specific requirements entirely or rely on approximations that fail to capture the nuanced patterns needed for tasks like Autoformalization or code generation. Methods that do consider the target distribution often rely on simplistic, sometimes noisy, representations, like hashed n-gram features, which can lead to collisions and introduce noise. We introduce ZIP-FIT, a data selection framework that uses gzip compression to directly measure alignment between potential training data and the target task distribution. In extensive evaluations on Autoformalization and Python code generation, ZIP-FIT significantly outperforms leading baselines like DSIR and D4. Models trained on ZIP-FIT-selected data achieve their lowest cross-entropy loss up to 85.1% faster than baselines, demonstrating that better task alignment leads to more efficient learning. In addition, ZIP-FIT performs selection up to 65.8% faster than DSIR and two orders of magnitude faster than D4. Notably, ZIP-FIT shows that smaller, well-aligned datasets often outperform larger but less targeted ones, demonstrating that a small amount of higher quality data is superior to a large amount of lower quality data. Our results imply that task-aware data selection is crucial for efficient domain adaptation, and that compression offers a principled way to measure task alignment. By showing that targeted data selection can dramatically improve task-specific performance, our work provides new insights into the relationship between data quality, task alignment, and model learning efficiency. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2410.18194 [cs.LG] (or arXiv:2410.18194v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.18194 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Brando Miranda [view email] [v1] Wed, 23 Oct 2024 18:01:06 UTC (6,649 KB) Full-text links: Access Paper: View a PDF of the paper titled ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment, by Elyas Obbad and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2024-10 Change to browse by: cs cs.AI cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
摘要：数据选择对于优化语言模型（LM）在特定任务上的性能至关重要，然而大多数现有方法未能有效考虑目标任务的分布。当前的方法要么完全忽略任务特定的需求，要么依赖于无法捕捉诸如自动形式化或代码生成等任务所需细微模式的近似方法。那些确实考虑目标分布的方法通常依赖于简单且有时嘈杂的表示，如哈希 n-gram 特征，这可能导致冲突并引入噪声。我们引入了 ZIP-FIT，这是一个利用 gzip 压缩直接测量潜在训练数据与目标任务分布之间对齐的数据选择框架。在自动形式化和 Python 代码生成的广泛评估中，ZIP-FIT 显著优于 DSIR 和 D4 等领先基线。使用 ZIP-FIT 选择的数据训练的模型，其交叉熵损失达到最低的速度比基线快 85.1%，这表明更好的任务对齐可以带来更高效的学习。此外，ZIP-FIT 的数据选择速度比 DSIR 快 65.8%，比 D4 快两个数量级。值得注意的是，ZIP-FIT 表明，较小但高度对齐的数据集往往优于较大但针对性较弱的数据集，这表明少量高质量数据优于大量低质量数据。我们的结果表明，任务感知的数据选择对于高效的领域适应至关重要，并且压缩提供了一种原则性的方法来衡量任务对齐。通过展示目标数据选择可以显著提高任务特定性能，我们的工作为数据质量、任务对齐和模型学习效率之间的关系提供了新的见解。

主题：机器学习（cs.LG）；人工智能（cs.AI）；计算与语言（cs.CL）
引用为：arXiv:2410.18194 [cs.LG]（或 arXiv:2410.18194v1 [cs.LG] 用于此版本）
https://doi.org/10.48550/arXiv.2410.18194

提交历史：从 Brando Miranda [查看电子邮件]
[v1] 2024年10月23日 18:01:06 UTC（6,649 KB）
全文链接：访问论文：查看标题为“ZIP-FIT：通过基于压缩的对齐实现无嵌入数据选择”的 PDF，作者为 Elyas Obbad 及其他 6 位作者
查看 PDF HTML（实验性）TeX 源码其他格式查看许可证
当前浏览上下文：cs.LG
上一篇 | 下一篇新 | 最近 | 2024-10
更改浏览方式：cs cs.AI cs.CL
参考文献引文 NASA ADS Google Scholar Semantic Scholar
导出 BibTeX 引文加载中… BibTeX 格式化引文加载中… 数据由以下机构提供：
书签已勾选=“已勾选”> 书目工具书目和引文工具
书目浏览器切换书目浏览器（什么是浏览器？）
Litmaps 切换 Litmaps（什么是 Litmaps？）
scite.ai 切换 scite 智能引文（什么是智能引文？）
代码、数据、媒体与此文章相关的代码、数据和媒体
alphaXiv 切换 alphaXiv（什么是 alphaXiv？）
代码链接切换 CatalyzeX 论文代码查找器（什么是 CatalyzeX？）
DagsHub 切换 DagsHub（什么是 DagsHub？）
GotitPub 切换 Gotit.pub（什么是 GotitPub？）
Huggingface 切换 Hugging Face（什么是 Huggingface？）
代码链接切换代码与论文（什么是代码与论文？）
ScienceCast 切换 ScienceCast（什么是 ScienceCast？）
演示演示
Replicate 切换 Replicate（什么是 Replicate？）
Spaces 切换 Hugging Face Spaces（什么是 Spaces？）
Spaces 切换 TXYZ.AI（什么是 TXYZ.AI？）
相关论文推荐和搜索工具
影响力花朵链接影响力花朵（什么是影响力花朵？）
Connected Papers 切换 Connected Papers（什么是 Connected Papers？）
CORE 推荐器切换 CORE 推荐器（什么是 CORE？）
IArxiv 推荐器切换 IArxiv 推荐器（什么是 IArxiv？）
作者地点机构主题
关于 arXivLabs
arXivLabs：社区合作者的实验项目 arXivLabs 是一个框架，允许合作者在我们的网站上直接开发和分享新的 arXiv 功能。无论是个人还是组织，与 arXivLabs 合作的人都接受了我们的开放性、社区、卓越性和用户数据隐私的价值观。arXiv 致力于这些价值观，并且只与遵守这些价值观的合作伙伴合作。有一个项目可以为 arXiv 社区增加价值？了解更多关于 arXivLabs 的信息。

本文的哪些作者是支持者？ | 禁用 MathJax（什么是 MathJax？）
mathjaxToggle();

关于帮助联系 arXiv 点击此处联系 arXiv 联系订阅 arXiv 邮件列表点击此处订阅订阅
版权隐私政策网页无障碍辅助 arXiv 运营状态
通过电子邮件或 Slack 获取状态通知

[NLP-80] Gazelle: An Instruction Dataset for Arabic Writing Assistance EMNLP2024

【速读】：该论文试图解决阿拉伯语在生成式 AI (Generative AI) 写作辅助工具开发中面临的数据稀缺问题。解决方案的关键在于提出了一个名为 Gazelle 的综合性阿拉伯语写作辅助数据集，并提供了一个评估框架，以增强阿拉伯语写作辅助工具的开发。通过这一数据集和评估框架，研究者能够更有效地训练和评估模型，从而克服阿拉伯语处理中的复杂性，推动更高效的 AI 驱动的阿拉伯语写作工具的发展。

链接: https://arxiv.org/abs/2410.18163
作者: Samar M. Magdy,Fakhraddin Alwajih,Sang Yun Kwon,Reem Abdel-Salam,Muhammad Abdul-Mageed
关键词-EN: cognitive processes involved, intricate cognitive processes, Arabic writing assistance, Arabic writing, writing assistance
类目: Computation and Language (cs.CL)
备注: EMNLP2024 Finding Camara-ready version

点击查看摘要

Abstract:Writing has long been considered a hallmark of human intelligence and remains a pinnacle task for artificial intelligence (AI) due to the intricate cognitive processes involved. Recently, rapid advancements in generative AI, particularly through the development of Large Language Models (LLMs), have significantly transformed the landscape of writing assistance. However, underrepresented languages like Arabic encounter significant challenges in the development of advanced AI writing tools, largely due to the limited availability of data. This scarcity constrains the training of effective models, impeding the creation of sophisticated writing assistance technologies. To address these issues, we present Gazelle, a comprehensive dataset for Arabic writing assistance. In addition, we offer an evaluation framework designed to enhance Arabic writing assistance tools. Our human evaluation of leading LLMs, including GPT-4, GPT-4o, Cohere Command R+, and Gemini 1.5 Pro, highlights their respective strengths and limitations in addressing the challenges of Arabic writing. Our findings underscore the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing, paving the way for more effective AI-powered Arabic writing tools.
摘要：写作长期以来被视为人类智能的标志，也是人工智能（AI）领域的一项巅峰任务，因其涉及复杂的认知过程。近年来，生成式 AI 的快速发展，特别是通过大语言模型（LLMs）的开发，显著改变了写作辅助工具的格局。然而，像阿拉伯语这样的非主流语言在开发高级 AI 写作工具方面面临重大挑战，主要原因是数据的有限可用性。这种数据稀缺性限制了有效模型的训练，阻碍了复杂写作辅助技术的创建。为解决这些问题，我们提出了 Gazelle，一个全面的阿拉伯语写作辅助数据集。此外，我们还提供了一个评估框架，旨在增强阿拉伯语写作辅助工具的性能。我们对领先的 LLMs 进行了人类评估，包括 GPT-4、GPT-4o、Cohere Command R+ 和 Gemini 1.5 Pro，突显了它们在应对阿拉伯语写作挑战方面的各自优势和局限性。我们的研究结果强调了持续模型训练和数据集丰富化的必要性，以应对阿拉伯语处理的复杂性，为更有效的 AI 驱动的阿拉伯语写作工具铺平道路。

[NLP-81] Future Token Prediction – Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction

【速读】：该论文试图解决的问题是：现有的因果解码器仅变换模型（如GPT）在生成语言模型时，仅基于前一个token预测下一个token，导致顶层嵌入向量（top layer embedding vectors）过于关注当前token，而未能充分捕捉未来文本的整体意义。

解决方案的关键是引入了一种新的预训练方法，称为未来token预测（Future Token Prediction, FTP）。在FTP中，一个大型变换编码器生成每个token位置的顶层嵌入向量，这些向量通过线性扩展投影到一个伪序列，然后由一个小型变换解码器进行交叉注意力处理，以预测从该位置开始的未来N个token。这种方法使得顶层嵌入向量能够更好地表示文本的主题，并在生成文本时表现出更好的主题连贯性。

链接: https://arxiv.org/abs/2410.18160
作者: Nicholas Walker
关键词-EN: Generative Pre-trained Transformers, Generative Pre-trained, Causal decoder-only transformer, generative language modelling, Causal decoder-only
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Causal decoder-only transformer models used for generative language modelling, such as Generative Pre-trained Transformers (GPT), are trained to predict the next token in a sequence based only on its previous tokens. Despite this simple training objective, they have proved to be powerful AI tools. However, only predicting the next token results in top layer embedding vectors that are highly token-focused. There may be benefits in generating embedding vectors at each token position that better capture the overall meaning of longer sequences of future text. Recent studies matching brain scans with deep language models suggest that humans also predict upcoming words when listening or reading but consider multiple future tokens rather than just one. This research investigates a new pretraining method called Future Token Prediction (FTP). In FTP, a large transformer encoder generates top layer embedding vectors for each token position, which, instead of being passed to a language head, are linearly and expansively projected to a pseudo-sequence, which is cross attended to by a small transformer decoder to predict the next N tokens forward from that position in the sequence. The top layer embedding vectors from FTP models exhibit distinct properties compared to those from standard GPT models, varying smoothly along a text sequence as measured by cosine similarity between adjacent tokens. Text generated by FTP models show improved topic coherence compared to standard GPT-like models trained with the same prediction perplexity for the next single token. The vectors are shown to better represent the topic of text based on the results of text classification examples. On a toy, but complex, coding problem, FTP networks produce significantly better results than GPT networks. Comments: 15 pages, 7 figures, 3 tables Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2410.18160 [cs.CL] (or arXiv:2410.18160v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.18160 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：用于生成语言建模的因果解码器专用 Transformer 模型，如生成式预训练 Transformer (Generative Pre-trained Transformers, GPT)，经过训练仅基于其前面的 Token 来预测序列中的下一个 Token。尽管训练目标简单，但这些模型已被证明是强大的 AI 工具。然而，仅预测下一个 Token 会导致顶层嵌入向量高度集中于 Token 本身。在每个 Token 位置生成能够更好地捕捉未来较长文本序列整体意义的嵌入向量可能会有益处。最近的研究将脑扫描与深度语言模型匹配，表明人类在听或读时也会预测即将到来的词语，但会考虑多个未来的 Token 而非仅一个。本研究探讨了一种新的预训练方法，称为未来 Token 预测 (Future Token Prediction, FTP)。在 FTP 中，一个大型 Transformer 编码器为每个 Token 位置生成顶层嵌入向量，这些向量不是传递给语言头，而是线性且扩展地投影到一个伪序列，该伪序列由一个小型 Transformer 解码器交叉关注，以预测从该位置开始序列中接下来的 N 个 Token。FTP 模型的顶层嵌入向量与标准 GPT 模型的顶层嵌入向量相比，表现出明显的特性，通过相邻 Token 之间的余弦相似度测量，这些向量在文本序列中平滑变化。与使用相同预测困惑度训练的标准 GPT 类模型相比，FTP 模型生成的文本显示出更好的主题连贯性。根据文本分类示例的结果，这些向量更好地代表了文本的主题。在一个玩具但复杂的编码问题上，FTP 网络比 GPT 网络产生了显著更好的结果。

评论：15 页，7 幅图，3 个表格
主题：计算与语言 (cs.CL); 机器学习 (cs.LG)
ACM 分类：I.2.6; I.2.7
引用为：arXiv:2410.18160 [cs.CL]
（或 arXiv:2410.18160v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.18160
arXiv 发布的 DOI 通过 DataCite（注册待定）

[NLP-82] Meaning Typed Prompting: A Technique for Efficient Reliable Structured Output Generation

【速读】：该论文试图解决的问题是如何将大型语言模型 (Large Language Models, LLMs) 扩展到需要可靠结构化输出的高级应用中。现有方法通常依赖于刚性的 JSON 模式，这可能导致输出不可靠、推理能力下降以及计算开销增加，从而限制了 LLMs 在复杂任务中的适应性。

解决方案的关键是引入了一种名为“意义类型提示 (Meaning Typed Prompting, MTP)”的技术。MTP 通过在提示过程中整合类型、意义和抽象概念（如变量和类），利用表达性类型定义来增强输出清晰度，减少对复杂抽象的依赖，从而简化了开发过程并提高了实现效率。这种方法使 LLMs 能够更好地理解关系并更有效地生成结构化数据。实证评估表明，MTP 在多个基准测试中在准确性、可靠性、一致性和令牌效率方面优于现有框架。论文还提出了 Semantix 框架，该框架实现了 MTP，并提供了其实际应用的见解。

链接: https://arxiv.org/abs/2410.18146
作者: Chandra Irugalbandara
关键词-EN: Large Language Models, Extending Large Language, Extending Large, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Extending Large Language Models (LLMs) to advanced applications requires reliable structured output generation. Existing methods which often rely on rigid JSON schemas, can lead to unreliable outputs, diminished reasoning capabilities, and increased computational overhead, limiting LLMs’ adaptability for complex tasks. We introduce Meaning Typed Prompting (MTP), a technique for efficient structured output generation that integrates types, meanings, and abstractions, such as variables and classes, into the prompting process. By utilizing expressive type definitions, MTP enhances output clarity and reduces dependence on complex abstractions, simplifying development, and improving implementation efficiency. This enables LLMs to understand relationships and generate structured data more effectively. Empirical evaluations on multiple benchmarks demonstrate that MTP outperforms existing frameworks in accuracy, reliability, consistency, and token efficiency. We present Semantix, a framework that implements MTP, providing practical insights into its application.
摘要：将大语言模型 (LLM) 扩展到高级应用需要可靠的结构化输出生成。现有方法通常依赖于刚性的 JSON 模式，这可能导致输出不可靠、推理能力下降以及计算开销增加，从而限制了 LLM 在复杂任务中的适应性。我们引入了意义类型提示 (Meaning Typed Prompting, MTP)，这是一种高效的结构化输出生成技术，它将类型、意义和抽象概念（如变量和类）整合到提示过程中。通过利用表达性的类型定义，MTP 增强了输出清晰度，减少了对外部复杂抽象的依赖，简化了开发过程，并提高了实现效率。这使得 LLM 能够更好地理解关系并更有效地生成结构化数据。在多个基准上的实证评估表明，MTP 在准确性、可靠性、一致性和 Token 效率方面优于现有框架。我们提出了 Semantix 框架，该框架实现了 MTP，并提供了其实际应用的实用见解。

[NLP-83] Analyzing Nobel Prize Literature with Large Language Models

【速读】：该论文试图解决的问题是评估高级大型语言模型（Large Language Models, LLMs）在文学分析中的能力，特别是与人类研究生水平的分析能力进行比较。解决方案的关键在于通过对比LLMs和人类在分析诺贝尔奖获奖短篇小说（如Han Kang的《九章》和Jon Fosse的《友谊》）时的表现，来探讨AI在处理复杂文学元素（如主题分析、互文性、文化历史背景、语言和结构创新、角色发展）方面的优势和局限性。

研究通过定性和定量评估方法，考察了AI在分析任务中的连贯性、创造性和对文本的忠实度，揭示了AI在结构化任务中的强大分析能力，但在情感细微差别和整体连贯性方面仍存在不足，这些领域是人类解释的强项。该研究强调了人机协作在人文学科中的潜力，为文学研究和更广泛的领域开辟了新的机会。

链接: https://arxiv.org/abs/2410.18142
作者: Yang Zhenyuan,Liu Zhengliang,Zhang Jing,Lu Cen,Tai Jiaxin,Zhong Tianyang,Li Yiwei,Zhao Siyan,Yao Teng,Liu Qing,Yang Jinlin,Liu Qixin,Li Zhaowei,Wang Kexin,Ma Longjun,Zhu Dajiang,Ren Yudan,Ge Bao,Zhang Wei,Qiang Ning,Zhang Tuo,Liu Tianming
关键词-EN: Large Language Models, advanced Large Language, Large Language, Language Models, advanced Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study examines the capabilities of advanced Large Language Models (LLMs), particularly the o1 model, in the context of literary analysis. The outputs of these models are compared directly to those produced by graduate-level human participants. By focusing on two Nobel Prize-winning short stories, ‘Nine Chapters’ by Han Kang, the 2024 laureate, and ‘Friendship’ by Jon Fosse, the 2023 laureate, the research explores the extent to which AI can engage with complex literary elements such as thematic analysis, intertextuality, cultural and historical contexts, linguistic and structural innovations, and character development. Given the Nobel Prize’s prestige and its emphasis on cultural, historical, and linguistic richness, applying LLMs to these works provides a deeper understanding of both human and AI approaches to interpretation. The study uses qualitative and quantitative evaluations of coherence, creativity, and fidelity to the text, revealing the strengths and limitations of AI in tasks typically reserved for human expertise. While LLMs demonstrate strong analytical capabilities, particularly in structured tasks, they often fall short in emotional nuance and coherence, areas where human interpretation excels. This research underscores the potential for human-AI collaboration in the humanities, opening new opportunities in literary studies and beyond.
摘要：本研究探讨了先进的大语言模型（LLMs），特别是 o1 模型，在文学分析中的能力。研究直接将这些模型的输出与研究生水平的人类参与者的输出进行比较。通过聚焦于两位诺贝尔奖得主的短篇小说，即韩江（2024 年得主）的《九章》和约恩·福瑟（2023 年得主）的《友谊》，研究探索了 AI 在处理复杂文学元素（如主题分析、互文性、文化与历史背景、语言与结构创新以及角色发展）方面的程度。鉴于诺贝尔奖的声望及其对文化、历史和语言丰富性的重视，将 LLMs 应用于这些作品，不仅加深了对人类解读方式的理解，也揭示了 AI 解读方式的独特性。研究采用定性和定量评估方法，对连贯性、创造性和对文本的忠实度进行评估，揭示了 AI 在通常由人类专业知识主导的任务中的优势与局限。尽管 LLMs 在结构化任务中表现出强大的分析能力，但在情感细微差别和连贯性方面往往不足，这些领域正是人类解读的优势所在。本研究强调了人机协作在人文领域的潜力，为文学研究及其他领域开辟了新的机遇。

[NLP-84] SmartRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback

【速读】：该论文试图解决的问题是：现有的检索增强生成系统（RAG systems）中的多个模块通常是单独训练的，这可能导致系统整体性能不佳。论文提出，像RAG这样的多模块系统应该进行联合优化以达到最佳性能。

解决方案的关键在于设计了一个名为 SmartRAG 的特定管道，其中包括一个策略网络（policy network）和一个检索器（retriever）。策略网络具有三个主要功能：1) 决定何时进行检索；2) 生成最适合检索器的查询；3) 生成最终的回答，无论是否使用检索到的信息。论文提出使用强化学习算法（reinforcement learning algorithm）对整个系统进行联合优化，奖励机制旨在鼓励系统以最小的检索成本实现最佳性能。通过联合优化，所有模块都能感知其他模块的工作方式，从而作为一个整体系统找到最佳的合作方式。实证结果表明，联合优化的SmartRAG比单独优化的系统表现更好。

链接: https://arxiv.org/abs/2410.18141
作者: Jingsheng Gao,Linxu Li,Weiyuan Li,Yuzhuo Fu,Bin Dai
关键词-EN: RAG systems consist, multiple modules, modules, incorporates multiple modules, RAG
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:RAG systems consist of multiple modules to work together. However, these modules are usually separately trained. We argue that a system like RAG that incorporates multiple modules should be jointly optimized to achieve optimal performance. To demonstrate this, we design a specific pipeline called \textbfSmartRAG that includes a policy network and a retriever. The policy network can serve as 1) a decision maker that decides when to retrieve, 2) a query rewriter to generate a query most suited to the retriever, and 3) an answer generator that produces the final response with/without the observations. We then propose to jointly optimize the whole system using a reinforcement learning algorithm, with the reward designed to encourage the system to achieve the best performance with minimal retrieval cost. When jointly optimized, all the modules can be aware of how other modules are working and thus find the best way to work together as a complete system. Empirical results demonstrate that the jointly optimized SmartRAG can achieve better performance than separately optimized counterparts.
摘要：RAG 系统由多个模块组成，这些模块协同工作。然而，这些模块通常是单独训练的。我们认为，像 RAG 这样包含多个模块的系统应该进行联合优化以达到最佳性能。为了证明这一点，我们设计了一个特定的管道，称为 \textbfSmartRAG，其中包括一个策略网络和一个检索器。策略网络可以作为 1) 一个决策者，决定何时进行检索，2) 一个查询重写器，生成最适合检索器的查询，以及 3) 一个答案生成器，根据是否有观察结果生成最终的响应。我们随后提出使用强化学习算法对整个系统进行联合优化，奖励设计旨在鼓励系统以最小的检索成本实现最佳性能。当联合优化时，所有模块都能了解其他模块的工作方式，从而找到最佳的合作方式，作为一个完整的系统共同工作。实证结果表明，联合优化的 SmartRAG 比单独优化的对应系统表现更好。

[NLP-85] hering Broken Themes: Aligning Neural Topic Models with Labels and Authors

【速读】：该论文试图解决的问题是现有的神经主题模型（neural topic models）生成的主题往往与人类意图不一致的问题。尽管存在标签和作者信息等元数据（metadata），但这些信息尚未被有效地整合到神经主题模型中。

解决方案的关键是引入了一种名为FANToM的新方法，该方法能够将标签和作者信息与神经主题模型进行对齐。FANToM允许在可用时包含这些元数据，从而生成可解释的主题和每个主题的作者分布。通过学习标签、主题和作者之间的对齐关系，FANToM展示了比传统主题模型更高的表达能力。实验结果表明，FANToM在主题质量和一致性方面优于现有模型，并且能够识别作者的兴趣和相似性。

链接: https://arxiv.org/abs/2410.18140
作者: Mayank Nagda,Phil Ostheimer,Sophie Fellenz
关键词-EN: large document collections, extracting semantic information, document collections, Topic models, extracting semantic
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Topic models are a popular approach for extracting semantic information from large document collections. However, recent studies suggest that the topics generated by these models often do not align well with human intentions. While metadata such as labels and authorship information is available, it has not yet been effectively incorporated into neural topic models. To address this gap, we introduce FANToM, a novel method for aligning neural topic models with both labels and authorship information. FANToM allows for the inclusion of this metadata when available, producing interpretable topics and author distributions for each topic. Our approach demonstrates greater expressiveness than conventional topic models by learning the alignment between labels, topics, and authors. Experimental results show that FANToM improves upon existing models in terms of both topic quality and alignment. Additionally, it identifies author interests and similarities.
摘要：主题模型是从大型文档集合中提取语义信息的一种流行方法。然而，最近的研究表明，这些模型生成的主题往往与人类的意图不一致。尽管诸如标签和作者信息等元数据是可用的，但尚未有效地整合到神经主题模型中。为了填补这一空白，我们提出了FANToM，这是一种新颖的方法，用于将神经主题模型与标签和作者信息对齐。FANToM允许在可用时包含这些元数据，从而为每个主题生成可解释的主题和作者分布。我们的方法通过学习标签、主题和作者之间的对齐关系，展示了比传统主题模型更大的表达能力。实验结果表明，FANToM在主题质量和一致性方面均优于现有模型。此外，它还能识别作者的兴趣和相似性。

[NLP-86] R2Gen-Mamba: A Selective State Space Model for Radiology Report Generation

【速读】：该论文试图解决放射报告生成中的自动化问题，特别是现有方法（主要基于Transformer）在计算资源上的高消耗问题。解决方案的关键在于提出了R2Gen-Mamba，这是一种结合了Mamba的高效序列处理能力和Transformer架构上下文优势的新型自动放射报告生成方法。Mamba的低计算复杂性使得R2Gen-Mamba在训练和推理效率上有所提升，同时还能生成高质量的报告。通过在两个包含超过21万对X光图像-报告的基准数据集上的实验，证明了R2Gen-Mamba在报告质量和计算效率方面优于几种最先进的方法。

链接: https://arxiv.org/abs/2410.18135
作者: Yongheng Sun,Yueh Z. Lee,Genevieve A. Woodard,Hongtu Zhu,Chunfeng Lian,Mingxia Liu
关键词-EN: manual annotation process, Radiology report generation, automatic report generation, report generation methods, report generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 pages pages for ISBI2025

点击查看摘要

Abstract:Radiology report generation is crucial in medical imaging,but the manual annotation process by physicians is time-consuming and labor-intensive, necessitating the develop-ment of automatic report generation methods. Existingresearch predominantly utilizes Transformers to generateradiology reports, which can be computationally intensive,limiting their use in real applications. In this work, we presentR2Gen-Mamba, a novel automatic radiology report genera-tion method that leverages the efficient sequence processingof the Mamba with the contextual benefits of Transformerarchitectures. Due to lower computational complexity ofMamba, R2Gen-Mamba not only enhances training and in-ference efficiency but also produces high-quality this http URL results on two benchmark datasets with morethan 210,000 X-ray image-report pairs demonstrate the ef-fectiveness of R2Gen-Mamba regarding report quality andcomputational efficiency compared with several state-of-the-art methods. The source code can be accessed online.
摘要：放射报告生成在医学影像中至关重要，但由医生进行的手动标注过程既耗时又费力，因此需要开发自动报告生成方法。现有研究主要利用 Transformer 来生成放射报告，这在计算上较为密集，限制了其在实际应用中的使用。在本研究中，我们提出了 R2Gen-Mamba，这是一种新颖的自动放射报告生成方法，结合了 Mamba 的高效序列处理能力和 Transformer 架构的上下文优势。由于 Mamba 的计算复杂度较低，R2Gen-Mamba 不仅提高了训练和推理效率，还在两个包含超过 210,000 对 X 光图像-报告的基准数据集上生成了高质量的报告。实验结果表明，与几种最先进的方法相比，R2Gen-Mamba 在报告质量和计算效率方面均表现出色。源代码可在线获取。

[NLP-87] Graph Contrastive Learning via Cluster-refined Negative Sampling for Semi-supervised Text Classification

【速读】：该论文试图解决图对比学习 (Graph Contrastive Learning, GCL) 在文本分类任务中存在的负样本采样偏差问题，特别是由于相似节点被错误地配对为负样本而导致的过度聚类 (over-clustering) 现象。过度聚类会导致同一类别的实例被分割到不同的聚类中，从而影响模型的分类性能。

解决方案的关键在于提出了一个基于聚类优化的负样本采样策略，称为 ClusterText。具体步骤包括：

结合预训练模型 Bert 和图神经网络 (Graph Neural Networks) 来学习文本表示。
引入聚类优化策略，通过聚类学习到的文本表示来获取伪标签 (pseudo labels)。对于每个文本节点，其负样本集从不同的聚类中抽取。
提出自校正机制 (self-correction mechanism)，通过计算同一聚类内节点之间的欧几里得距离，选择距离较远的节点作为负样本，以缓解聚类不一致导致的真实负样本损失问题。

通过这些策略，ClusterText 能够有效提取大量数据中的重要信息，并在文本分类任务中表现出优越性。

链接: https://arxiv.org/abs/2410.18130
作者: Wei Ai,Jianbin Li,Ze Wang,Jiayi Du,Tao Meng,Yuntao Shou,Keqin Li
关键词-EN: generate self-supervised signals, facilitating model training, text classification, widely applied, ability to generate
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Graph contrastive learning (GCL) has been widely applied to text classification tasks due to its ability to generate self-supervised signals from unlabeled data, thus facilitating model training. However, existing GCL-based text classification methods often suffer from negative sampling bias, where similar nodes are incorrectly paired as negative pairs. This can lead to over-clustering, where instances of the same class are divided into different clusters. To address the over-clustering issue, we propose an innovative GCL-based method of graph contrastive learning via cluster-refined negative sampling for semi-supervised text classification, namely ClusterText. Firstly, we combine the pre-trained model Bert with graph neural networks to learn text representations. Secondly, we introduce a clustering refinement strategy, which clusters the learned text representations to obtain pseudo labels. For each text node, its negative sample set is drawn from different clusters. Additionally, we propose a self-correction mechanism to mitigate the loss of true negative samples caused by clustering inconsistency. By calculating the Euclidean distance between each text node and other nodes within the same cluster, distant nodes are still selected as negative samples. Our proposed ClusterText demonstrates good scalable computing, as it can effectively extract important information from from a large amount of data. Experimental results demonstrate the superiority of ClusterText in text classification tasks.
摘要：图对比学习 (Graph Contrastive Learning, GCL) 由于能够从未标记数据中生成自监督信号，从而促进模型训练，已被广泛应用于文本分类任务。然而，现有的基于 GCL 的文本分类方法往往存在负采样偏差问题，即相似节点被错误地配对为负样本对。这会导致过度聚类，即将同一类别的实例划分到不同的聚类中。为解决过度聚类问题，我们提出了一种基于聚类优化负采样的图对比学习方法，用于半监督文本分类，即 ClusterText。首先，我们将预训练模型 Bert 与图神经网络结合，学习文本表示。其次，我们引入了一种聚类优化策略，通过对学习到的文本表示进行聚类以获得伪标签。对于每个文本节点，其负样本集从不同聚类中抽取。此外，我们提出了一种自校正机制，以缓解因聚类不一致导致的真实负样本损失问题。通过计算每个文本节点与同一聚类内其他节点之间的欧几里得距离，仍然选择距离较远的节点作为负样本。我们提出的 ClusterText 展示了良好的可扩展计算能力，能够有效地从大量数据中提取重要信息。实验结果表明，ClusterText 在文本分类任务中具有优越性。

[NLP-88] Optimizing Preference Alignment with Differentiable NDCG Ranking

【速读】：该论文试图解决的问题是当前基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）在偏好对齐（preference alignment）任务中表现不佳的问题。具体来说，现有的方法（如DPO）在处理成对偏好数据时，其排名准确率低于60%，未能充分捕捉序列中的理想偏好关系。

解决方案的关键是提出了直接排名偏好优化（Direct Ranking Preference Optimization, DRPO）方法。DRPO将人类偏好对齐视为一个学习排序（Learning-to-Rank, LTR）任务，并利用NDCG（Normalized Discounted Cumulative Gain）这一广泛使用的LTR指标来优化响应列表的排名。由于NDCG的不可微性，论文提出了diffNDCG损失，这是一个通过排序网络实现的NDCG的可微近似。此外，为了提高生成响应的质量，论文还提出了基于边际的自适应排名策略得分（Adaptive Rank Policy Score）。实验结果表明，DRPO显著优于现有的基线方法，提升了生成响应的质量。

链接: https://arxiv.org/abs/2410.18127
作者: Jiacong Zhou,Xianyun Wang,Jun Yu
关键词-EN: Aligning large language, Aligning large, large language models, large language, safety by ensuring
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Aligning large language models with human preferences improves interaction quality and safety by ensuring outputs better reflect human values. A promising strategy involves Reinforcement Learning from Human Feedback (RLHF), starting with collecting and ranking responses generated by a supervised fine-tuning model to refine alignment. Current methods (DPO) focus on learning from pairwise preference data, categorizing responses into preferred and less preferred pairs, and optimizing by maximizing pairwise margins. Recent studies have uncovered a substantial discrepancy between the theoretical aspirations of preference learning and its real-world results. Current preference alignment techniques underperform expectations, with ranking accuracies below 60% on standard datasets. This suggests existing methods inadequately capture ideal preference relationships within sequences. To address this challenge, this paper introduces \underlineDirect \underlineRanking \underlinePreference \underlineOptimization (DRPO), a novel method that views human preference alignment as a Learning-to-Rank (LTR) task. DRPO leverages NDCG, a widely used LTR metric, to optimize the ranking of responses within lists based on preference data, thereby enhancing ranking accuracies. Due to the nondifferentiability of NDCG, we propose diffNDCG loss, a differentiable approximation facilitated by a sorting network to simulate NDCG. Furthermore, to improve the quality of generated response, we propose a novel margin-based Adaptive Rank Policy Score. Extensive experiments have shown that DRPO outperforms existing baseline methods, enhancing the quality of the generated responses.
摘要：将大语言模型与人类偏好对齐，通过确保输出更好地反映人类价值观，从而提高交互质量和安全性。一个有前景的策略是基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF），从收集和排序由监督微调模型生成的响应开始，以优化对齐效果。当前的方法（如 DPO）主要关注从成对偏好数据中学习，将响应分类为偏好和非偏好对，并通过最大化成对边际来进行优化。最近的研究揭示了偏好学习理论期望与实际结果之间存在显著差距。现有的偏好对齐技术未能达到预期效果，在标准数据集上的排序准确率低于 60%。这表明现有方法未能充分捕捉序列内的理想偏好关系。为应对这一挑战，本文提出了直接排序偏好优化（Direct Ranking Preference Optimization, DRPO），这是一种将人类偏好对齐视为排序学习（Learning-to-Rank, LTR）任务的新方法。DRPO 利用广泛使用的 LTR 指标 NDCG 来优化基于偏好数据的响应列表排序，从而提高排序准确率。由于 NDCG 的不可微性，我们提出了 diffNDCG 损失，这是一种通过排序网络模拟 NDCG 的可微近似。此外，为提高生成响应的质量，我们提出了一种基于边际的自适应排序策略评分。大量实验表明，DRPO 优于现有的基线方法，显著提升了生成响应的质量。

[NLP-89] Yesterdays News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models

【速读】：该论文试图解决生成式 AI 模型在处理虚假信息（misinformation）时面临的分布外泛化（out-of-distribution generalisation）问题。虚假信息的变化速度远快于人工标注的速度，导致训练数据与推理数据之间存在分布偏移。因此，模型需要具备在不同分布下进行泛化的能力，而现有数据集对此问题的研究较少。

解决方案的关键在于引入了一个名为 misinfo-general 的基准数据集，该数据集用于评估模型在分布外泛化方面的能力。论文识别了六个泛化轴：时间（time）、事件（event）、主题（topic）、发布者（publisher）、政治偏见（political bias）和虚假信息类型（misinformation type），并设计了相应的评估程序。此外，论文还分析了一些基线模型，指出了它们在满足重要需求方面的不足。

链接: https://arxiv.org/abs/2410.18122
作者: Ivo Verhoeven,Pushkar Mishra,Ekaterina Shutova
关键词-EN: paper introduces misinfo-general, evaluating misinformation models’, misinformation models’ ability, introduces misinfo-general, paper introduces
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces misinfo-general, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalisation. Misinformation changes rapidly, much quicker than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation models need to be able to perform out-of-distribution generalisation, an understudied problem in existing datasets. We identify 6 axes of generalisation-time, event, topic, publisher, political bias, misinformation type-and design evaluation procedures for each. We also analyse some baseline models, highlighting how these fail important desiderata.
摘要：本文介绍了 misinfo-general，这是一个用于评估错误信息模型进行分布外泛化能力的基准数据集。错误信息的演变速度远快于大规模标注的速度，导致训练数据和推理数据之间的分布发生偏移。因此，错误信息模型需要具备分布外泛化的能力，这是现有数据集中研究不足的问题。我们识别了六个泛化轴——时间、事件、主题、发布者、政治偏见、错误信息类型——并为每个轴设计了评估程序。此外，我们还分析了一些基线模型，突显了这些模型在重要需求上的失败之处。

[NLP-90] Improving Embedding Accuracy for Document Retrieval Using Entity Relationship Maps and Model-Aware Contrastive Sampling

【速读】：该论文试图解决在文档检索增强生成 (Document Retrieval-Augmented Generation, RAG) 任务中，如何提高文本特征提取模型的事实聚焦能力。解决方案的关键在于采用了两种训练技术：

预收敛中断微调 (Pre-convergence interrupted fine-tuning)：使用结构化实体关系图 (Structured Entity Relationship Maps) 作为训练数据输入，旨在引导模型关注事实内容而非语义风格，从而提升纯文本性能。
模型感知对比采样 (Model-Aware Contrastive Sampling)：创建一个平衡且均匀分布的硬负样本和软负样本集合，这些样本直接基于基础模型的能力生成。

这两种方法的结合显著提升了模型在纯文本查询/文档对检索中的表现，达到了90.86%的绝对排名@1准确率，相较于其他领先模型提高了6.26%，并且在训练数据输入上下文大小方面平均减少了37.71%。

链接: https://arxiv.org/abs/2410.18105
作者: Thea Aviss
关键词-EN: Advanced Processing, Processing for Epistemic, Document Retrieval-Augmented Generation, Retrieval-Augmented Generation, Structured Entity Relationship
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 Pages, 9 Figures

点击查看摘要

Abstract:In this paper we present APEX-Embedding-7B (Advanced Processing for Epistemic eXtraction), a 7-billion parameter decoder-only text Feature Extraction Model, specifically designed for Document Retrieval-Augmented Generation (RAG) tasks. Our approach employs two training techniques that yield an emergent improvement in factual focus: (1) Pre-convergence interrupted fine-tuning using Structured Entity Relationship Maps as training data input: designed to shift the model’s attention and create a bias towards factual content rather than semantic style - this enhances plain text performance despite not being directly trained for it; and (2) Model-Aware Contrastive Sampling, creating a balanced and evenly distributed collation map of hard and soft negatives directly informed by the base model’s competency. This combined methodology yields significant improvements, enhancing plain text query/document pair retrieval to achieve an absolute rank@1 accuracy of 90.86% (an increase of 6.26% compared to the next leading model) in our evaluation, and reducing training data input context size by an average of 37.71% compared to plain text for both queries and document texts. Based on our evaluations, our model establishes a new state-of-the-art standard in text feature extraction for longer context document retrieval tasks.
摘要：本文介绍了 APEX-Embedding-7B（高级知识提取处理模型），这是一个专为文档检索增强生成 (RAG) 任务设计的 70 亿参数解码器专用文本特征提取模型。我们的方法采用了两种训练技术，显著提升了事实关注度：(1) 使用结构化实体关系图作为训练数据输入的预收敛中断微调：旨在转移模型的注意力，使其偏向于事实内容而非语义风格，尽管未直接针对此进行训练，但增强了纯文本性能；(2) 模型感知对比采样，创建了一个由基础模型能力直接指导的硬负样本和软负样本平衡且均匀分布的整理图。这种综合方法带来了显著改进，将纯文本查询/文档对检索的绝对排名@1 准确率提升至 90.86%（比次优模型提高了 6.26%），并将训练数据输入上下文大小平均减少了 37.71%（相比纯文本查询和文档文本）。基于我们的评估，该模型为长上下文文档检索任务的文本特征提取设立了新的行业标准。

[NLP-91] RingGesture: A Ring-Based Mid-Air Gesture Typing System Powered by a Deep-Learning Word Prediction Framework

【速读】：该论文试图解决轻量级增强现实（AR）眼镜在手部追踪中由于摄像头数量限制而导致的输入设备不足的问题。解决方案的关键在于提出了一种基于戒指的空中手势输入技术，称为RingGesture。该技术利用电极标记手势轨迹的起点和终点，并结合惯性测量单元（IMU）传感器进行手部追踪，从而实现类似于VR头戴设备中的射线投射式空中手势输入，将手部运动无缝转换为光标导航。

为了提高输入的准确性和速度，论文还提出了一种新颖的深度学习单词预测框架，称为Score Fusion。该框架由三个主要组件构成：a) 单词手势解码模型，b) 空间拼写校正模型，c) 轻量级上下文语言模型。通过融合这三个模型的评分，Score Fusion能够更精确地预测最可能的单词，从而显著提升输入速度和准确性。

链接: https://arxiv.org/abs/2410.18100
作者: Junxiao Shen,Roger Boldu,Arpit Kalla,Michael Glueck,Hemant Bhaskar Surale Amy Karlson
关键词-EN: lightweight augmented reality, modern computing experience, augmented reality, critical capability, modern computing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text entry is a critical capability for any modern computing experience, with lightweight augmented reality (AR) glasses being no exception. Designed for all-day wearability, a limitation of lightweight AR glass is the restriction to the inclusion of multiple cameras for extensive field of view in hand tracking. This constraint underscores the need for an additional input device. We propose a system to address this gap: a ring-based mid-air gesture typing technique, RingGesture, utilizing electrodes to mark the start and end of gesture trajectories and inertial measurement units (IMU) sensors for hand tracking. This method offers an intuitive experience similar to raycast-based mid-air gesture typing found in VR headsets, allowing for a seamless translation of hand movements into cursor navigation. To enhance both accuracy and input speed, we propose a novel deep-learning word prediction framework, Score Fusion, comprised of three key components: a) a word-gesture decoding model, b) a spatial spelling correction model, and c) a lightweight contextual language model. In contrast, this framework fuses the scores from the three models to predict the most likely words with higher precision. We conduct comparative and longitudinal studies to demonstrate two key findings: firstly, the overall effectiveness of RingGesture, which achieves an average text entry speed of 27.3 words per minute (WPM) and a peak performance of 47.9 WPM. Secondly, we highlight the superior performance of the Score Fusion framework, which offers a 28.2% improvement in uncorrected Character Error Rate over a conventional word prediction framework, Naive Correction, leading to a 55.2% improvement in text entry speed for RingGesture. Additionally, RingGesture received a System Usability Score of 83 signifying its excellent usability.
摘要：文本输入是现代计算体验中的关键能力，轻量级增强现实（AR）眼镜也不例外。轻量级AR眼镜设计为全天可穿戴，其局限性在于无法包含多个摄像头以实现广泛的手部追踪视野。这一限制凸显了需要额外的输入设备。我们提出了一种系统来填补这一空白：基于戒指的空中手势输入技术，RingGesture，利用电极标记手势轨迹的起点和终点，并使用惯性测量单元（IMU）传感器进行手部追踪。这种方法提供了类似于VR头显中基于射线的空中手势输入的直观体验，允许手部运动无缝转换为光标导航。为了提高准确性和输入速度，我们提出了一种新颖的深度学习单词预测框架，Score Fusion，由三个关键组件组成：a) 单词手势解码模型，b) 空间拼写校正模型，和 c) 轻量级上下文语言模型。该框架通过融合三个模型的分数来预测最可能的单词，从而提高精度。我们进行了比较和纵向研究，以展示两个关键发现：首先，RingGesture的整体有效性，其平均文本输入速度达到27.3词每分钟（WPM），峰值性能为47.9 WPM。其次，我们强调了Score Fusion框架的优越性能，与传统的单词预测框架Naive Correction相比，未校正字符错误率提高了28.2%，从而使RingGesture的文本输入速度提高了55.2%。此外，RingGesture获得了83分的系统可用性评分，表明其出色的可用性。

[NLP-92] Gesture2Text: A Generalizable Decoder for Word-Gesture Keyboards in XR Through Trajectory Coarse Discretization and Pre-training

【速读】：该论文试图解决在扩展现实（Extended Reality, XR）环境中使用词手势键盘（word-gesture keyboards, WGK）进行文本输入时，由于交互模式、键盘尺寸和视觉反馈的多样性导致的词手势轨迹数据模式复杂性问题。具体来说，现有的模板匹配解码方法（如SHARK^2）在处理噪声轨迹时存在解码不准确的问题，而传统的基于神经网络的解码器（neural decoders）虽然能提高准确性，但需要大量数据进行训练且实现复杂。

解决方案的关键在于提出了一种新的通用神经解码器，该解码器通过在大规模粗略离散化的词手势轨迹数据上进行预训练来实现。这种方法不仅易于实现，而且具有高解码准确性。预训练的神经解码器在增强现实（Augmented Reality, AR）和虚拟现实（Virtual Reality, VR）环境中的空中和表面WGK系统中表现出良好的通用性，平均Top-4准确率达到90.4%，显著优于SHARK^2和传统神经解码器。此外，该解码器在量化后仅4 MB大小，且能在Quest 3上以97毫秒的实时速度运行，不牺牲准确性。

链接: https://arxiv.org/abs/2410.18099
作者: Junxiao Shen,Khadija Khaldi,Enmin Zhou,Hemant Bhaskar Surale,Amy Karlson
关键词-EN: Extended Reality, Text entry, interaction for Extended, WGK, WGK systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text entry with word-gesture keyboards (WGK) is emerging as a popular method and becoming a key interaction for Extended Reality (XR). However, the diversity of interaction modes, keyboard sizes, and visual feedback in these environments introduces divergent word-gesture trajectory data patterns, thus leading to complexity in decoding trajectories into text. Template-matching decoding methods, such as SHARK^2, are commonly used for these WGK systems because they are easy to implement and configure. However, these methods are susceptible to decoding inaccuracies for noisy trajectories. While conventional neural-network-based decoders (neural decoders) trained on word-gesture trajectory data have been proposed to improve accuracy, they have their own limitations: they require extensive data for training and deep-learning expertise for implementation. To address these challenges, we propose a novel solution that combines ease of implementation with high decoding accuracy: a generalizable neural decoder enabled by pre-training on large-scale coarsely discretized word-gesture trajectories. This approach produces a ready-to-use WGK decoder that is generalizable across mid-air and on-surface WGK systems in augmented reality (AR) and virtual reality (VR), which is evident by a robust average Top-4 accuracy of 90.4% on four diverse datasets. It significantly outperforms SHARK^2 with a 37.2% enhancement and surpasses the conventional neural decoder by 7.4%. Moreover, the Pre-trained Neural Decoder’s size is only 4 MB after quantization, without sacrificing accuracy, and it can operate in real-time, executing in just 97 milliseconds on Quest 3.
摘要：词手势键盘（Word-Gesture Keyboards, WGK）作为一种新兴的文本输入方法，正逐渐成为扩展现实（Extended Reality, XR）中的关键交互方式。然而，这些环境中交互模式的多样性、键盘尺寸以及视觉反馈的差异，导致了词手势轨迹数据模式的多样性，从而增加了将轨迹解码为文本的复杂性。模板匹配解码方法，如SHARK^2，因其易于实现和配置而被广泛应用于这些WGK系统中。然而，这些方法在处理噪声轨迹时容易出现解码不准确的问题。尽管基于神经网络的传统解码器（神经解码器）已经在词手势轨迹数据上进行了训练，以提高解码精度，但它们也存在自身的局限性：需要大量的训练数据和深度学习专业知识。为了应对这些挑战，我们提出了一种结合了易于实现和高解码精度的新解决方案：一种通过在大规模粗略离散化词手势轨迹上预训练而实现的可泛化神经解码器。这种方法产生了一个即用型的WGK解码器，能够在增强现实（Augmented Reality, AR）和虚拟现实（Virtual Reality, VR）中的空中和表面WGK系统中通用，这在四个多样化的数据集上得到了验证，其平均Top-4准确率达到了90.4%。该解码器显著优于SHARK^2，提升了37.2%，并且比传统神经解码器高出7.4%。此外，经过量化的预训练神经解码器的大小仅为4 MB，且在不影响准确性的前提下，能够在Quest 3上以97毫秒的速度实时运行。

[NLP-93] M3EL: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking

【速读】：该论文试图解决多模态实体链接 (Multi-modal Entity Linking, MEL) 领域中现有数据集规模小、主题类型稀缺以及任务覆盖有限的问题。这些问题导致现有数据集无法有效提升多模态模型的实体链接能力。

解决方案的关键在于：

数据集构建：提出并发布了一个大规模的多模态实体链接数据集 M^3EL，包含 79,625 个实例，覆盖 9 种多模态任务和 5 种不同主题。
模态增强训练策略：提出了一种模态增强的训练策略，以提高模型对多模态任务的适应性。
模型训练与评估：利用 M^3EL 数据集，基于 CLIP (ViT-B-32) 模型训练了 CLIP_ND 模型，并与现有的多模态基线模型进行比较。实验结果表明，M^3EL 数据集有效解决了现有数据集的问题，CLIP_ND 模型在各种任务上的准确率显著提升，平均提高了 9.3% 到 25%。

链接: https://arxiv.org/abs/2410.18096
作者: Fang Wang,Shenglin Yin,Xiaoying Bai,Minghao Hu,Tianwei Yan,Yi Liang
关键词-EN: Entity Linking, Multi-modal Entity Linking, textit, entity linking capabilities, fundamental component
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal Entity Linking (MEL) is a fundamental component for various downstream tasks. However, existing MEL datasets suffer from small scale, scarcity of topic types and limited coverage of tasks, making them incapable of effectively enhancing the entity linking capabilities of multi-modal models. To address these obstacles, we propose a dataset construction pipeline and publish M^3EL , a large-scale dataset for MEL. M^3EL includes 79,625 instances, covering 9 diverse multi-modal tasks, and 5 different topics. In addition, to further improve the model’s adaptability to multi-modal tasks, We propose a modality-augmented training strategy. Utilizing M^3EL as a corpus, train the \textitCLIP_\textitND model based on \textitCLIP (\textitViT-\textitB-\textit32) , and conduct a comparative analysis with an existing multi-modal baselines. Experimental results show that the existing models perform far below expectations (ACC of 49.4%-75.8%), After analysis, it was obtained that small dataset sizes, insufficient modality task coverage, and limited topic diversity resulted in poor generalisation of multi-modal models. Our dataset effectively addresses these issues, and the \textitCLIP_\textitND model fine-tuned with M^3EL shows a significant improvement in accuracy, with an average improvement of 9.3% to 25% across various tasks. Our dataset is available at this https URL.
摘要：多模态实体链接 (Multi-modal Entity Linking, MEL) 是多种下游任务的基础组件。然而，现有的 MEL 数据集存在规模小、主题类型稀缺以及任务覆盖范围有限的问题，无法有效提升多模态模型的实体链接能力。为解决这些障碍，我们提出了一种数据集构建流程，并发布了 M^3EL，一个大规模的 MEL 数据集。M^3EL 包含 79,625 个实例，涵盖 9 种多样化的多模态任务和 5 种不同主题。此外，为进一步提高模型对多模态任务的适应性，我们提出了一种模态增强训练策略。利用 M^3EL 作为语料库，基于 \textit{CLIP} (\textit{ViT}-\textit{B}-\textit{32}) 模型训练 \textit{CLIP}\textit{ND} 模型，并与现有的多模态基线进行对比分析。实验结果表明，现有模型表现远低于预期（准确率在 49.4%-75.8% 之间）。分析发现，数据集规模小、模态任务覆盖不足以及主题多样性有限导致多模态模型的泛化能力较差。我们的数据集有效解决了这些问题，经过 M^3EL 微调的 \textit{CLIP}\textit{ND} 模型在准确率上显著提升，平均提升幅度在 9.3% 至 25% 之间，涵盖了多种任务。我们的数据集可通过此 https URL 获取。

[NLP-94] Speech perception: a model of word recognition

【速读】：该论文试图解决的问题是如何在考虑声音之间相关性的情况下，建立一个有效的语音感知模型。解决方案的关键在于：

模型设计：提出了一种基于吸引子（attractors）的下降动力学（descent dynamics）模型，用于描述语音感知过程。
词汇生成：通过该模型生成的词汇库（lexicon）具有合理的词长分布，即短词丰富，长词较少。
解码策略：分别研究了在存在听觉错误的情况下，短词和长词的解码行为。对于短词，算法能够快速检索或提出另一个有效词；而对于长词，虽然成功解码的速度仍然较快，但存在有限概率导致永久迷失，算法在合适的词汇景观中徘徊而无法确定一个词。

链接: https://arxiv.org/abs/2410.18590
作者: Jean-Marc Luck,Anita Mehta
关键词-EN: correlations between sounds, speech perception, account effects, effects of correlations, Words
类目: atistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL)
备注: 22 pages, 19 figures, 1 table

点击查看摘要

Abstract:We present a model of speech perception which takes into account effects of correlations between sounds. Words in this model correspond to the attractors of a suitably chosen descent dynamics. The resulting lexicon is rich in short words, and much less so in longer ones, as befits a reasonable word length distribution. We separately examine the decryption of short and long words in the presence of mishearings. In the regime of short words, the algorithm either quickly retrieves a word, or proposes another valid word. In the regime of longer words, the behaviour is markedly different. While the successful decryption of words continues to be relatively fast, there is a finite probability of getting lost permanently, as the algorithm wanders round the landscape of suitable words without ever settling on one.
摘要：我们提出了一种考虑声音间相关性影响的语音感知模型。在该模型中，词语对应于一个适当选择的下降动力学的吸引子。由此产生的词典中，短词丰富，而长词则相对较少，这与合理的词长分布相符。我们分别研究了在存在误听情况下，短词和长词的解密过程。在短词的情况下，算法要么迅速检索到一个词，要么提出另一个有效的词。而在长词的情况下，行为则显著不同。尽管成功解密词的速度仍然相对较快，但存在有限概率导致永久迷失，即算法在合适的词的范围内徘徊，却始终无法确定一个词。

人工智能

[AI-0] PixelGaussian: Generalizable 3D Gaussian Reconstruction from Arbitrary Views

链接: https://arxiv.org/abs/2410.18979
作者: Xin Fei,Wenzhao Zheng,Yueqi Duan,Wei Zhan,Masayoshi Tomizuka,Kurt Keutzer,Jiwen Lu
关键词-EN: efficient feed-forward framework, Gaussian, Cascade Gaussian Adapter, learning generalizable, feed-forward framework
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:We propose PixelGaussian, an efficient feed-forward framework for learning generalizable 3D Gaussian reconstruction from arbitrary views. Most existing methods rely on uniform pixel-wise Gaussian representations, which learn a fixed number of 3D Gaussians for each view and cannot generalize well to more input views. Differently, our PixelGaussian dynamically adapts both the Gaussian distribution and quantity based on geometric complexity, leading to more efficient representations and significant improvements in reconstruction quality. Specifically, we introduce a Cascade Gaussian Adapter to adjust Gaussian distribution according to local geometry complexity identified by a keypoint scorer. CGA leverages deformable attention in context-aware hypernetworks to guide Gaussian pruning and splitting, ensuring accurate representation in complex regions while reducing redundancy. Furthermore, we design a transformer-based Iterative Gaussian Refiner module that refines Gaussian representations through direct image-Gaussian interactions. Our PixelGaussian can effectively reduce Gaussian redundancy as input views increase. We conduct extensive experiments on the large-scale ACID and RealEstate10K datasets, where our method achieves state-of-the-art performance with good generalization to various numbers of views. Code: this https URL.

[AI-1] 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

链接: https://arxiv.org/abs/2410.18974
作者: Hansheng Chen,Bokui Shen,Yulin Liu,Ruoxi Shi,Linqi Zhou,Connor Z. Lin,Jiayuan Gu,Hao Su,Gordon Wetzstein,Leonidas Guibas
关键词-EN: significantly advanced open-domain, image diffusion models, advanced open-domain, significantly advanced, Multi-view image diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To address this challenge, we introduce 3D-Adapter, a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models. Central to our approach is the idea of 3D feedback augmentation: for each denoising step in the sampling loop, 3D-Adapter decodes intermediate multi-view features into a coherent 3D representation, then re-encodes the rendered RGBD views to augment the pretrained base model through feature addition. We study two variants of 3D-Adapter: a fast feed-forward version based on Gaussian splatting and a versatile training-free version utilizing neural fields and meshes. Our extensive experiments demonstrate that 3D-Adapter not only greatly enhances the geometry quality of text-to-multi-view models such as Instant3D and Zero123++, but also enables high-quality 3D generation using the plain text-to-image Stable Diffusion. Furthermore, we showcase the broad application potential of 3D-Adapter by presenting high quality results in text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks.

[AI-2] Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning Techniques

链接: https://arxiv.org/abs/2410.18972
作者: David Ortiz-Perez,Manuel Benavent-Lledo,Jose Garcia-Rodriguez,David Tomás,M. Flores Vizcaya-Moreno
关键词-EN: reduced cognitive abilities, Cognitive decline, part of aging, natural part, resulting in reduced
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cognitive decline is a natural part of aging, often resulting in reduced cognitive abilities. In some cases, however, this decline is more pronounced, typically due to disorders such as Alzheimer’s disease. Early detection of anomalous cognitive decline is crucial, as it can facilitate timely professional intervention. While medical data can help in this detection, it often involves invasive procedures. An alternative approach is to employ non-intrusive techniques such as speech or handwriting analysis, which do not necessarily affect daily activities. This survey reviews the most relevant methodologies that use deep learning techniques to automate the cognitive decline estimation task, including audio, text, and visual processing. We discuss the key features and advantages of each modality and methodology, including state-of-the-art approaches like Transformer architecture and foundation models. In addition, we present works that integrate different modalities to develop multimodal models. We also highlight the most significant datasets and the quantitative results from studies using these resources. From this review, several conclusions emerge. In most cases, the textual modality achieves the best results and is the most relevant for detecting cognitive decline. Moreover, combining various approaches from individual modalities into a multimodal model consistently enhances performance across nearly all scenarios.

[AI-3] ConceptDrift: Uncovering Biases through the Lens of Foundational Models

链接: https://arxiv.org/abs/2410.18970
作者: Cristian Daniel Păduraru,Antonio Bărbălau,Radu Filipescu,Andrei Liviu Nicolicioiu,Elena Burceanu
关键词-EN: intrinsic biases, pre-trained models, analysing misclassified samples, semi-automated human-computer validation, Abstract
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Datasets and pre-trained models come with intrinsic biases. Most methods rely on spotting them by analysing misclassified samples, in a semi-automated human-computer validation. In contrast, we propose ConceptDrift, a method which analyzes the weights of a linear probe, learned on top a foundational model. We capitalize on the weight update trajectory, which starts from the embedding of the textual representation of the class, and proceeds to drift towards embeddings that disclose hidden biases. Different from prior work, with this approach we can pin-point unwanted correlations from a dataset, providing more than just possible explanations for the wrong predictions. We empirically prove the efficacy of our method, by significantly improving zero-shot performance with biased-augmented prompting. Our method is not bounded to a single modality, and we experiment in this work with both image (Waterbirds, CelebA, Nico++) and text datasets (CivilComments).

[AI-4] Context is Key: A Benchmark for Forecasting with Essential Textual Information

链接: https://arxiv.org/abs/2410.18959
作者: Andrew Robert Williams,Arjun Ashok,Étienne Marcotte,Valentina Zantedeschi,Jithendaraa Subramanian,Roland Riachi,James Requeima,Alexandre Lacoste,Irina Rish,Nicolas Chapados,Alexandre Drouin
关键词-EN: task in decision, decision making, models, Forecasting, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Preprint; under review. First two authors contributed equally

点击查看摘要

Abstract:Forecasting is a critical task in decision making across various domains. While numerical data provides a foundation, it often lacks crucial context necessary for accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge or constraints, which can be efficiently communicated through natural language. However, the ability of existing forecasting models to effectively integrate this textual information remains an open question. To address this, we introduce “Context is Key” (CiK), a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. By presenting this benchmark, we aim to advance multimodal forecasting, promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at this https URL .

[AI-5] ANAVI: Audio Noise Awareness using Visuals of Indoor environments for NAVIgation

链接: https://arxiv.org/abs/2410.18932
作者: Vidhi Jain,Rishi Veerapaneni,Yonatan Bisk
关键词-EN: robot path planning, propose Audio Noise, quieter robot path, Audio Noise Awareness, path planning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8th Conference on Robot Learning (CoRL) 2024

点击查看摘要

Abstract:We propose Audio Noise Awareness using Visuals of Indoors for NAVIgation for quieter robot path planning. While humans are naturally aware of the noise they make and its impact on those around them, robots currently lack this awareness. A key challenge in achieving audio awareness for robots is estimating how loud will the robot’s actions be at a listener’s location? Since sound depends upon the geometry and material composition of rooms, we train the robot to passively perceive loudness using visual observations of indoor environments. To this end, we generate data on how loud an ‘impulse’ sounds at different listener locations in simulated homes, and train our Acoustic Noise Predictor (ANP). Next, we collect acoustic profiles corresponding to different actions for navigation. Unifying ANP with action acoustics, we demonstrate experiments with wheeled (Hello Robot Stretch) and legged (Unitree Go2) robots so that these robots adhere to the noise constraints of the environment. See code and data at this https URL

[AI-6] SegLLM : Multi-round Reasoning Segmentation

链接: https://arxiv.org/abs/2410.18923
作者: XuDong Wang,Shaolun Zhang,Shufan Li,Konstantinos Kallidromitis,Kehan Li,Yusuke Kato,Kazuki Kozuka,Trevor Darrell
关键词-EN: exploiting conversational memory, textual outputs, exploiting conversational, conversational memory, interactive reasoning segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 22 pages, 10 figures, 11 tables

点击查看摘要

Abstract:We present SegLLM, a novel multi-round interactive reasoning segmentation model that enhances LLM-based segmentation by exploiting conversational memory of both visual and textual outputs. By leveraging a mask-aware multimodal LLM, SegLLM re-integrates previous segmentation results into its input stream, enabling it to reason about complex user intentions and segment objects in relation to previously identified entities, including positional, interactional, and hierarchical relationships, across multiple interactions. This capability allows SegLLM to respond to visual and text queries in a chat-like manner. Evaluated on the newly curated MRSeg benchmark, SegLLM outperforms existing methods in multi-round interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in cIoU for referring expression segmentation and a 4.5% improvement in Acc@0.5 for referring expression localization.

[AI-7] Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling

链接: https://arxiv.org/abs/2410.18912
作者: Mingtong Zhang,Kaifeng Zhang,Yunzhu Li
关键词-EN: encode rich information, objects encode rich, encode rich, Graph Neural Networks, dynamics
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Videos of robots interacting with objects encode rich information about the objects’ dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects’ 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot’s action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework’s ability to model complex shapes and dynamics. Our project page is available at this https URL.

[AI-8] SkillM imicGen: Automated Demonstration Generation for Efficient Skill Learning and Deployment

链接: https://arxiv.org/abs/2410.18907
作者: Caelan Garrett,Ajay Mandlekar,Bowen Wen,Dieter Fox
关键词-EN: Imitation learning, costly and resource-intensive, effective paradigm, paradigm for robot, acquiring large datasets
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imitation learning from human demonstrations is an effective paradigm for robot manipulation, but acquiring large datasets is costly and resource-intensive, especially for long-horizon tasks. To address this issue, we propose SkillMimicGen (SkillGen), an automated system for generating demonstration datasets from a few human demos. SkillGen segments human demos into manipulation skills, adapts these skills to new contexts, and stitches them together through free-space transit and transfer motion. We also propose a Hybrid Skill Policy (HSP) framework for learning skill initiation, control, and termination components from SkillGen datasets, enabling skills to be sequenced using motion planning at test-time. We demonstrate that SkillGen greatly improves data generation and policy learning performance over a state-of-the-art data generation framework, resulting in the capability to produce data for large scene variations, including clutter, and agents that are on average 24% more successful. We demonstrate the efficacy of SkillGen by generating over 24K demonstrations across 18 task variants in simulation from just 60 human demonstrations, and training proficient, often near-perfect, HSP agents. Finally, we apply SkillGen to 3 real-world manipulation tasks and also demonstrate zero-shot sim-to-real transfer on a long-horizon assembly task. Videos, and more at this https URL.

[AI-9] Creating and Repairing Robot Programs in Open-World Domains ACL

链接: https://arxiv.org/abs/2410.18893
作者: Claire Schlesinger,Arjun Guha,Joydeep Biswas
关键词-EN: Large Language Models, Language Models, Large Language, produce robot programs, natural language
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Under review at ACL Rolling Review

点击查看摘要

Abstract:Using Large Language Models (LLMs) to produce robot programs from natural language has allowed for robot systems that can complete a higher diversity of tasks. However, LLM-generated programs may be faulty, either due to ambiguity in instructions, misinterpretation of the desired task, or missing information about the world state. As these programs run, the state of the world changes and they gather new information. When a failure occurs, it is important that they recover from the current world state and avoid repeating steps that they they previously completed successfully. We propose RoboRepair, a system which traces the execution of a program up until error, and then runs an LLM-produced recovery program that minimizes repeated actions. To evaluate the efficacy of our system, we create a benchmark consisting of eleven tasks with various error conditions that require the generation of a recovery program. We compare the efficiency of the recovery program to a plan built with an oracle that has foreknowledge of future errors. Comments: Under review at ACL Rolling Review Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.18893 [cs.RO] (or arXiv:2410.18893v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2410.18893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks

链接: https://arxiv.org/abs/2410.18890
作者: Graziano A. Manduzio,Federico A. Galatolo,Mario G. C. A. Cimino,Enzo Pasquale Scilingo,Lorenzo Cominelli
关键词-EN: demonstrated exceptional capabilities, Large Language Models, Recent advancements, advancements in Large, natural language understanding
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated exceptional capabilities in natural language understanding and generation. While these models excel in general complex reasoning tasks, they still face challenges in mathematical problem-solving and logical reasoning. To address these limitations, researchers have explored function calling abilities, allowing LLMs to execute provided functions and utilize their outputs for task completion. However, concentrating on specific tasks can be very inefficient for large-scale LLMs to be used, because of the expensive cost of training and inference stages they need in terms of computational resources. This study introduces a novel framework for training smaller language models in function calling, focusing on specific logical and mathematical reasoning tasks. The approach aims to improve performances of small-scale models for these tasks using function calling, ensuring a high level of accuracy. Our framework employs an agent that, given a problem and a set of callable functions, queries the LLM by injecting a description and examples of the usable functions into the prompt and managing their calls in a step-by-step reasoning chain. This process is used to create a dataset of correct and incorrect reasoning chain chat completions from a large-scale LLM. This dataset is used to train a smaller LLM using Reinforcement Learning from Human Feedback (RLHF), specifically employing the Direct Preference Optimization (DPO) technique. Experimental results demonstrate how the proposed approach balances the trade-off between model size and performance, improving the ability of function calling for reasoning tasks, in smaller models.

[AI-11] Diff-Instruct: Training One-step Text-to-image Generator Model to Align with Human Preferences

链接: https://arxiv.org/abs/2410.18881
作者: Weijian Luo
关键词-EN: swift inference efficiency, models offer advantages, flexible architectures, generation performance, inference efficiency
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One-step text-to-image generator models offer advantages such as swift inference efficiency, flexible architectures, and state-of-the-art generation performance. In this paper, we study the problem of aligning one-step generator models with human preferences for the first time. Inspired by the success of reinforcement learning using human feedback (RLHF), we formulate the alignment problem as maximizing expected human reward functions while adding an Integral Kullback-Leibler divergence term to prevent the generator from diverging. By overcoming technical challenges, we introduce Diff-Instruct++ (DI++), the first, fast-converging and image data-free human preference alignment method for one-step text-to-image generators. We also introduce novel theoretical insights, showing that using CFG for diffusion distillation is secretly doing RLHF with DI++. Such an interesting finding brings understanding and potential contributions to future research involving CFG. In the experiment sections, we align both UNet-based and DiT-based one-step generators using DI++, which use the Stable Diffusion 1.5 and the PixelArt- \alpha as the reference diffusion processes. The resulting DiT-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as PixelArt- \alpha . Both theoretical contributions and empirical evidence indicate that DI++ is a strong human-preference alignment approach for one-step text-to-image models.

[AI-12] Guiding Empowerment Model: Liberating Neurodiversity in Online Higher Education

链接: https://arxiv.org/abs/2410.18876
作者: Hannah Beaux,Pegah Karimi,Otilia Pop,Rob Clark
关键词-EN: practice full paper, innovative practice full, full paper, innovative practice, practice full
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 9 pages, 1 Figure, 1 Table, Accepted in FIE 2024

点击查看摘要

Abstract:In this innovative practice full paper, we address the equity gap for neurodivergent and situationally limited learners by identifying the spectrum of dynamic factors that impact learning and function. Educators have shown a growing interest in identifying learners’ cognitive abilities and learning preferences to measure their impact on academic achievement. Often institutions employ one-size-fits-all approaches leaving the burden on disabled students to self-advocate or tolerate inadequate support. Emerging frameworks guide neurodivergent learners through instructional approaches, such as online education. However, these frameworks fail to address holistic environmental needs or recommend technology interventions, particularly for those with undisclosed learning or developmental disabilities and situational limitations. In this article, we integrate a neurodivergent perspective through secondary research of around 100 articles to introduce a Guiding Empowerment Model involving key cognitive and situational factors that contextualize day-to-day experiences affecting learner ability. We synthesize three sample student profiles that highlight user problems in functioning. We use this model to evaluate sample learning platform features and other supportive technology solutions. The proposed approach augments frameworks such as Universal Design for Learning to consider factors including various sensory processing differences, social connection challenges, and environmental limitations. We suggest that by applying the mode through technology-enabled features such as customizable task management, guided varied content access, and guided multi-modal collaboration, major learning barriers of neurodivergent and situationally limited learners will be removed to activate the successful pursuit of their academic goals.

[AI-13] he Cat and Mouse Game: The Ongoing Arms Race Between Diffusion Models and Detection Methods

链接: https://arxiv.org/abs/2410.18866
作者: Linda Laurier,Ave Giulietta,Arlo Octavia,Meade Cleti
关键词-EN: offering unmatched realism, offering unmatched, unmatched realism, realism and control, synthetic media generation
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:The emergence of diffusion models has transformed synthetic media generation, offering unmatched realism and control over content creation. These advancements have driven innovation across fields such as art, design, and scientific visualization. However, they also introduce significant ethical and societal challenges, particularly through the creation of hyper-realistic images that can facilitate deepfakes, misinformation, and unauthorized reproduction of copyrighted material. In response, the need for effective detection mechanisms has become increasingly urgent. This review examines the evolving adversarial relationship between diffusion model development and the advancement of detection methods. We present a thorough analysis of contemporary detection strategies, including frequency and spatial domain techniques, deep learning-based approaches, and hybrid models that combine multiple methodologies. We also highlight the importance of diverse datasets and standardized evaluation metrics in improving detection accuracy and generalizability. Our discussion explores the practical applications of these detection systems in copyright protection, misinformation prevention, and forensic analysis, while also addressing the ethical implications of synthetic media. Finally, we identify key research gaps and propose future directions to enhance the robustness and adaptability of detection methods in line with the rapid advancements of diffusion models. This review emphasizes the necessity of a comprehensive approach to mitigating the risks associated with AI-generated content in an increasingly digital world.

[AI-14] DL-Polycube: Deep learning enhanced polycube method for high-quality hexahedral mesh generation and volumetric spline construction

链接: https://arxiv.org/abs/2410.18852
作者: Yuxuan Yu,Yuzhuo Fang,Hua Tong,Yongjie Jessica Zhang
关键词-EN: generate high-quality hexahedral, polycube structures, surface triangular meshes, high-quality hexahedral, triangular meshes
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this paper, we present a novel algorithm that integrates deep learning with the polycube method (DL-Polycube) to generate high-quality hexahedral (hex) meshes, which are then used to construct volumetric splines for isogeometric analysis. Our DL-Polycube algorithm begins by establishing a connection between surface triangular meshes and polycube structures. We employ deep neural network to classify surface triangular meshes into their corresponding polycube structures. Following this, we combine the acquired polycube structural information with unsupervised learning to perform surface segmentation of triangular meshes. This step addresses the issue of segmentation not corresponding to a polycube while reducing manual intervention. Quality hex meshes are then generated from the polycube structures, with employing octree subdivision, parametric mapping and quality improvement techniques. The incorporation of deep learning for creating polycube structures, combined with unsupervised learning for segmentation of surface triangular meshes, substantially accelerates hex mesh generation. Finally, truncated hierarchical B-splines are constructed on the generated hex meshes. We extract trivariate Bézier elements from these splines and apply them directly in isogeometric analysis. We offer several examples to demonstrate the robustness of our DL-Polycube algorithm.

[AI-15] Expanding AI Awareness Through Everyday Interactions with AI: A Reflective Journal Study

链接: https://arxiv.org/abs/2410.18845
作者: Ashish Hingle,Aditya Johri
关键词-EN: continues to expand, programs are poised, producers and users, technology programs, interactions
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted and presented at the Frontiers in Education 2024 (FIE2024)

点击查看摘要

Abstract:As the application of AI continues to expand, students in technology programs are poised to be both producers and users of the technologies. They are also positioned to engage with AI applications within and outside the classroom. While focusing on the curriculum when examining students’ AI knowledge is common, extending this connection to students’ everyday interactions with AI provides a more complete picture of their learning. In this paper, we explore student’s awareness and engagement with AI in the context of school and their daily lives. Over six weeks, 22 undergraduate students participated in a reflective journal study and submitted a weekly journal entry about their interactions with AI. The participants were recruited from a technology and society course that focuses on the implications of technology on people, communities, and processes. In their weekly journal entries, participants reflected on interactions with AI on campus (coursework, advertises campus events, or seminars) and beyond (social media, news, or conversations with friends and family). The journal prompts were designed to help them think through what they had read, watched, or been told and reflect on the development of their own perspectives, knowledge, and literacy on the topic. Overall, students described nine categories of interactions: coursework, news and current events, using software and applications, university events, social media related to their work, personal discussions with friends and family, interacting with content, and gaming. Students reported that completing the diaries allowed them time for reflection and made them more aware of the presence of AI in their daily lives and of its potential benefits and drawbacks. This research contributes to the ongoing work on AI awareness and literacy by bringing in perspectives from beyond a formal educational context.

[AI-16] Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints

链接: https://arxiv.org/abs/2410.18844
作者: Udvas Das,Debabrota Basu
关键词-EN: conducting user studies, models multiple real-world, decision space naturally, multiple real-world problems, bandits models multiple
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Pure exploration in bandits models multiple real-world problems, such as tuning hyper-parameters or conducting user studies, where different safety, resource, and fairness constraints on the decision space naturally appear. We study these problems as pure exploration in multi-armed bandits with unknown linear constraints, where the aim is to identify an r \textit-good feasible policy . First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints. We show how this lower bound evolves with the sequential estimation of constraints. Second, we leverage the Lagrangian lower bound and the properties of convex optimisation to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX. To this end, we propose a constraint-adaptive stopping rule, and while tracking the lower bound, use pessimistic estimate of the feasible set at each step. We show that these algorithms achieve asymptotically optimal sample complexity upper bounds up to constraint-dependent constants. Finally, we conduct numerical experiments with different reward distributions and constraints that validate efficient performance of LAGEX and LATS with respect to baselines.

[AI-17] From Efficiency to Equity: Measuring Fairness in Preference Learning

链接: https://arxiv.org/abs/2410.18841
作者: Shreeyash Gowaikar,Hugo Berard,Rashid Mushkani,Shin Koseki
关键词-EN: increasingly influence decision-making, fairly represent diverse, increasingly influence, influence decision-making, fairly represent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As AI systems, particularly generative models, increasingly influence decision-making, ensuring that they are able to fairly represent diverse human preferences becomes crucial. This paper introduces a novel framework for evaluating epistemic fairness in preference learning models inspired by economic theories of inequality and Rawlsian justice. We propose metrics adapted from the Gini Coefficient, Atkinson Index, and Kuznets Ratio to quantify fairness in these models. We validate our approach using two datasets: a custom visual preference dataset (AI-EDI-Space) and the Jester Jokes dataset. Our analysis reveals variations in model performance across users, highlighting potential epistemic injustices. We explore pre-processing and in-processing techniques to mitigate these inequalities, demonstrating a complex relationship between model efficiency and fairness. This work contributes to AI ethics by providing a framework for evaluating and improving epistemic fairness in preference learning models, offering insights for developing more inclusive AI systems in contexts where diverse human preferences are crucial.

[AI-18] owards Visual Text Design Transfer Across Languages

链接: https://arxiv.org/abs/2410.18823
作者: Yejin Choi,Jiwan Chung,Sumin Shim,Giyeong Oh,Youngjae Yu
关键词-EN: album covers, visual text generation, plays a critical, critical role, film posters
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual text design plays a critical role in conveying themes, emotions, and atmospheres in multimodal formats such as film posters and album covers. Translating these visual and textual elements across languages extends the concept of translation beyond mere text, requiring the adaptation of aesthetic and stylistic features. To address this, we introduce a novel task of Multimodal Style Translation (MuST-Bench), a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems while preserving design intent. Our initial experiments on MuST-Bench reveal that existing visual text generation models struggle with the proposed task due to the inadequacy of textual descriptions in conveying visual design. In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions. SIGIL enhances image generation models through three innovations: glyph latent for multilingual settings, pretrained VAEs for stable style guidance, and an OCR model with reinforcement learning feedback for optimizing readable character generation. SIGIL outperforms existing baselines by achieving superior style consistency and legibility while maintaining visual fidelity, setting itself apart from traditional description-based approaches. We release MuST-Bench publicly for broader use and exploration this https URL.

[AI-19] Applying Neural Monte Carlo Tree Search to Unsignalized Multi-intersection Scheduling for Autonomous Vehicles

链接: https://arxiv.org/abs/2410.18786
作者: Yucheng Shi,Wenlong Wang,Xiaowen Tao,Ivana Dusparic,Vinny Cahill
关键词-EN: highly dynamic systems, shared resources, resources by autonomous, Neural Monte Carlo, Monte Carlo Tree
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dynamic scheduling of access to shared resources by autonomous systems is a challenging problem, characterized as being NP-hard. The complexity of this task leads to a combinatorial explosion of possibilities in highly dynamic systems where arriving requests must be continuously scheduled subject to strong safety and time constraints. An example of such a system is an unsignalized intersection, where automated vehicles’ access to potential conflict zones must be dynamically scheduled. In this paper, we apply Neural Monte Carlo Tree Search (NMCTS) to the challenging task of scheduling platoons of vehicles crossing unsignalized intersections. Crucially, we introduce a transformation model that maps successive sequences of potentially conflicting road-space reservation requests from platoons of vehicles into a series of board-game-like problems and use NMCTS to search for solutions representing optimal road-space allocation schedules in the context of past allocations. To optimize search, we incorporate a prioritized re-sampling method with parallel NMCTS (PNMCTS) to improve the quality of training data. To optimize training, a curriculum learning strategy is used to train the agent to schedule progressively more complex boards culminating in overlapping boards that represent busy intersections. In a busy single four-way unsignalized intersection simulation, PNMCTS solved 95% of unseen scenarios, reducing crossing time by 43% in light and 52% in heavy traffic versus first-in, first-out control. In a 3x3 multi-intersection network, the proposed method maintained free-flow in light traffic when all intersections are under control of PNMCTS and outperformed state-of-the-art RL-based traffic-light controllers in average travel time by 74.5% and total throughput by 16% in heavy traffic.

[AI-20] Should We Really Edit Language Models? On the Evaluation of Edited Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.18785
作者: Qi Li,Xiang Liu,Zhenheng Tang,Peijie Dong,Zeyu Li,Xinglin Pan,Xiaowen Chu
关键词-EN: increasingly popular alternative, editing methods, editing, efficiently updating knowledge, language models
类目: Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 this https URL

点击查看摘要

Abstract:Model editing has become an increasingly popular alternative for efficiently updating knowledge within language models. Current methods mainly focus on reliability, generalization, and locality, with many methods excelling across these criteria. Some recent works disclose the pitfalls of these editing methods such as knowledge distortion or conflict. However, the general abilities of post-edited language models remain unexplored. In this paper, we perform a comprehensive evaluation on various editing methods and different language models, and have following findings. (1) Existing editing methods lead to inevitable performance deterioration on general benchmarks, indicating that existing editing methods maintain the general abilities of the model within only a few dozen edits. When the number of edits is slightly large, the intrinsic knowledge structure of the model is disrupted or even completely damaged. (2) Instruction-tuned models are more robust to editing, showing less performance drop on general knowledge after editing. (3) Language model with large scale is more resistant to editing compared to small model. (4) The safety of the edited model, is significantly weakened, even for those safety-aligned models. Our findings indicate that current editing methods are only suitable for small-scale knowledge updates within language models, which motivates further research on more practical and reliable editing methods. The details of code and reproduction can be found in this https URL.

[AI-21] Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances

链接: https://arxiv.org/abs/2410.18775
作者: Shilin Lu,Zihan Zhou,Jiayou Lu,Yuanzhi Zhu,Adams Wai-Kin Kong
关键词-EN: Current image watermarking, image editing techniques, editing techniques, Current image, editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Current image watermarking methods are vulnerable to advanced image editing techniques enabled by large-scale text-to-image models. These models can distort embedded watermarks during editing, posing significant challenges to copyright protection. In this work, we introduce W-Bench, the first comprehensive benchmark designed to evaluate the robustness of watermarking methods against a wide range of image editing techniques, including image regeneration, global editing, local editing, and image-to-video generation. Through extensive evaluations of eleven representative watermarking methods against prevalent editing techniques, we demonstrate that most methods fail to detect watermarks after such edits. To address this limitation, we propose VINE, a watermarking method that significantly enhances robustness against various image editing techniques while maintaining high image quality. Our approach involves two key innovations: (1) we analyze the frequency characteristics of image editing and identify that blurring distortions exhibit similar frequency properties, which allows us to use them as surrogate attacks during training to bolster watermark robustness; (2) we leverage a large-scale pretrained diffusion model SDXL-Turbo, adapting it for the watermarking task to achieve more imperceptible and robust watermark embedding. Experimental results show that our method achieves outstanding watermarking performance under various image editing techniques, outperforming existing methods in both image quality and robustness. Code is available at this https URL.

[AI-22] Cellpose a morphological analysis tool for feature extraction of stained cell images

链接: https://arxiv.org/abs/2410.18738
作者: Israel A. Huaman,Fares D.E. Ghorabe,Sofya S. Chumakova,Alexandra A. Pisarenko,Alexey E. Dudaev,Tatiana G. Volova,Galina A. Ryltseva,Sviatlana A. Ulasevich,Ekaterina I. Shishatskaya,Ekaterina V. Skorb,Pavel S. Zun
关键词-EN: processing tools present, Advanced image segmentation, study cell processes, Advanced image, processing tools
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advanced image segmentation and processing tools present an opportunity to study cell processes and their dynamics. However, image analysis is often routine and time-consuming. Nowadays, alternative data-driven approaches using deep learning are potentially offering automatized, accurate, and fast image analysis. In this paper, we extend the applications of Cellpose, a state-of-the-art cell segmentation framework, with feature extraction capabilities to assess morphological characteristics. We also introduce a dataset of DAPI and FITC stained cells to which our new method is applied.

[AI-23] AI Readiness in Healthcare through Storytelling XAI ECAI

链接: https://arxiv.org/abs/2410.18725
作者: Akshat Dubey,Zewen Yang,Georges Hattab
关键词-EN: impacting everyday life, radically impacting everyday, Artificial Intelligence, Explainable Artificial Intelligence, everyday life
类目: Artificial Intelligence (cs.AI)
*备注: Pre-print of the accepted manuscript in EXPLIMED - First Workshop on Explainable Artificial Intelligence for the Medical Domain, European Conference on Artificial Intelligence (ECAI) - 2024, Santiago de Compostela, Spain

点击查看摘要

Abstract:Artificial Intelligence is rapidly advancing and radically impacting everyday life, driven by the increasing availability of computing power. Despite this trend, the adoption of AI in real-world healthcare is still limited. One of the main reasons is the trustworthiness of AI models and the potential hesitation of domain experts with model predictions. Explainable Artificial Intelligence (XAI) techniques aim to address these issues. However, explainability can mean different things to people with different backgrounds, expertise, and goals. To address the target audience with diverse needs, we develop storytelling XAI. In this research, we have developed an approach that combines multi-task distillation with interpretability techniques to enable audience-centric explainability. Using multi-task distillation allows the model to exploit the relationships between tasks, potentially improving interpretability as each task supports the other leading to an enhanced interpretability from the perspective of a domain expert. The distillation process allows us to extend this research to large deep models that are highly complex. We focus on both model-agnostic and model-specific methods of interpretability, supported by textual justification of the results in healthcare through our use case. Our methods increase the trust of both the domain experts and the machine learning experts to enable a responsible AI.

[AI-24] GeoLoRA: Geometric integration for parameter efficient fine-tuning

链接: https://arxiv.org/abs/2410.18720
作者: Steffen Schotthöfer,Emanuele Zangrando,Gianluca Ceruti,Francesco Tudisco,Jonas Kusch
关键词-EN: pre-trained neural networks, Low-Rank Adaptation, pre-trained neural, neural networks, parameter-efficient fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become a widely used method for parameter-efficient fine-tuning of large-scale, pre-trained neural networks. However, LoRA and its extensions face several challenges, including the need for rank adaptivity, robustness, and computational efficiency during the fine-tuning process. We introduce GeoLoRA, a novel approach that addresses these limitations by leveraging dynamical low-rank approximation theory. GeoLoRA requires only a single backpropagation pass over the small-rank adapters, significantly reducing computational cost as compared to similar dynamical low-rank training methods and making it faster than popular baselines such as AdaLoRA. This allows GeoLoRA to efficiently adapt the allocated parameter budget across the model, achieving smaller low-rank adapters compared to heuristic methods like AdaLoRA and LoRA, while maintaining critical convergence, descent, and error-bound theoretical guarantees. The resulting method is not only more efficient but also more robust to varying hyperparameter settings. We demonstrate the effectiveness of GeoLoRA on several state-of-the-art benchmarks, showing that it outperforms existing methods in both accuracy and computational efficiency.

[AI-25] LLM -based Online Prediction of Time-varying Graph Signals

链接: https://arxiv.org/abs/2410.18718
作者: Dayu Qin,Yi Yan,Ercan Engin Kuruoglu
关键词-EN: leverages large language, large language models, temporal smoothness, large language, exploiting spatial
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel framework that leverages large language models (LLMs) for predicting missing values in time-varying graph signals by exploiting spatial and temporal smoothness. We leverage the power of LLM to achieve a message-passing scheme. For each missing node, its neighbors and previous estimates are fed into and processed by LLM to infer the missing observations. Tested on the task of the online prediction of wind-speed graph signals, our model outperforms online graph filtering algorithms in terms of accuracy, demonstrating the potential of LLMs in effectively addressing partially observed signals in graphs.

[AI-26] Low-Latency Video Anonymization for Crowd Anomaly Detection: Privacy vs. Performance

链接: https://arxiv.org/abs/2410.18717
作者: Mulugeta Weldezgina Asres,Lei Jiao,Christian Walter Omlin
关键词-EN: artificial intelligence promise, intelligence promise ample, promise ample potential, Recent advancements, surveillance cameras
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16pages, 8 figures, 9 tables

点击查看摘要

Abstract:Recent advancements in artificial intelligence promise ample potential in monitoring applications with surveillance cameras. However, concerns about privacy and model bias have made it challenging to utilize them in public. Although de-identification approaches have been proposed in the literature, aiming to achieve a certain level of anonymization, most of them employ deep learning models that are computationally demanding for real-time edge deployment. In this study, we revisit conventional anonymization solutions for privacy protection and real-time video anomaly detection (VAD) applications. We propose a novel lightweight adaptive anonymization for VAD (LA3D) that employs dynamic adjustment to enhance privacy protection. We evaluated the approaches on publicly available privacy and VAD data sets to examine the strengths and weaknesses of the different anonymization techniques and highlight the promising efficacy of our approach. Our experiment demonstrates that LA3D enables substantial improvement in the privacy anonymization capability without majorly degrading VAD efficacy.

[AI-27] Ali-AUG: Innovative Approaches to Labeled Data Augmentation using One-Step Diffusion Model

链接: https://arxiv.org/abs/2410.18678
作者: Ali Hamza,Aizea Lojo,Adrian Núñez-Marcos,Aitziber Atutxa
关键词-EN: paper introduces Ali-AUG, efficient labeled data, paper introduces, labeled data, limited labeled data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces Ali-AUG, a novel single-step diffusion model for efficient labeled data augmentation in industrial applications. Our method addresses the challenge of limited labeled data by generating synthetic, labeled images with precise feature insertion. Ali-AUG utilizes a stable diffusion architecture enhanced with skip connections and LoRA modules to efficiently integrate masks and images, ensuring accurate feature placement without affecting unrelated image content. Experimental validation across various industrial datasets demonstrates Ali-AUG’s superiority in generating high-quality, defect-enhanced images while maintaining rapid single-step inference. By offering precise control over feature insertion and minimizing required training steps, our technique significantly enhances data augmentation capabilities, providing a powerful tool for improving the performance of deep learning models in scenarios with limited labeled data. Ali-AUG is especially useful for use cases like defective product image generation to train AI-based models to improve their ability to detect defects in manufacturing processes. Using different data preparation strategies, including Classification Accuracy Score (CAS) and Naive Augmentation Score (NAS), we show that Ali-AUG improves model performance by 31% compared to other augmentation methods and by 45% compared to models without data augmentation. Notably, Ali-AUG reduces training time by 32% and supports both paired and unpaired datasets, enhancing flexibility in data preparation.

[AI-28] GADT: Enhancing Transferable Adversarial Attacks through Gradient-guided Adversarial Data Transformation

链接: https://arxiv.org/abs/2410.18648
作者: Yating Ma,Xiaogang Xu,Liming Fang,Zhe Liu
关键词-EN: adding Adversarial Noise, Current Transferable Adversarial, optimizing Data Augmentation, Current Transferable, Adversarial Noise
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current Transferable Adversarial Examples (TAE) are primarily generated by adding Adversarial Noise (AN). Recent studies emphasize the importance of optimizing Data Augmentation (DA) parameters along with AN, which poses a greater threat to real-world AI applications. However, existing DA-based strategies often struggle to find optimal solutions due to the challenging DA search procedure without proper guidance. In this work, we propose a novel DA-based attack algorithm, GADT. GADT identifies suitable DA parameters through iterative antagonism and uses posterior estimates to update AN based on these parameters. We uniquely employ a differentiable DA operation library to identify adversarial DA parameters and introduce a new loss function as a metric during DA optimization. This loss term enhances adversarial effects while preserving the original image content, maintaining attack crypticity. Extensive experiments on public datasets with various networks demonstrate that GADT can be integrated with existing transferable attack methods, updating their DA parameters effectively while retaining their AN formulation strategies. Furthermore, GADT can be utilized in other black-box attack scenarios, e.g., query-based attacks, offering a new avenue to enhance attacks on real-world AI applications in both research and industrial contexts.

[AI-29] Smart ETL and LLM -based contents classification: the European Smart Tourism Tools Observatory experience

链接: https://arxiv.org/abs/2410.18641
作者: Diogo Cosme,António Galvão,Fernando Brito e Abreu
关键词-EN: online European Smart, Smart Tourism Tools, European Smart Tourism, research project focuses, online European
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Purpose: Our research project focuses on improving the content update of the online European Smart Tourism Tools (STTs) Observatory by incorporating and categorizing STTs. The categorization is based on their taxonomy, and it facilitates the end user’s search process. The use of a Smart ETL (Extract, Transform, and Load) process, where \emphSmart indicates the use of Artificial Intelligence (AI), is central to this endeavor. Methods: The contents describing STTs are derived from PDF catalogs, where PDF-scraping techniques extract QR codes, images, links, and text information. Duplicate STTs between the catalogs are removed, and the remaining ones are classified based on their text information using Large Language Models (LLMs). Finally, the data is transformed to comply with the Dublin Core metadata structure (the observatory’s metadata structure), chosen for its wide acceptance and flexibility. Results: The Smart ETL process to import STTs to the observatory combines PDF-scraping techniques with LLMs for text content-based classification. Our preliminary results have demonstrated the potential of LLMs for text content-based classification. Conclusion: The proposed approach’s feasibility is a step towards efficient content-based classification, not only in Smart Tourism but also adaptable to other fields. Future work will mainly focus on refining this classification process. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) ACMclasses: H.3.3; I.2.7; I.5.2 Cite as: arXiv:2410.18641 [cs.IR] (or arXiv:2410.18641v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.18641 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Fernando Brito E Abreu [view email] [v1] Thu, 24 Oct 2024 11:10:54 UTC (1,572 KB)

[AI-30] Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

链接: https://arxiv.org/abs/2410.18639
作者: Jinxu Lin,Linwei Tao,Minjing Dong,Chang Xu
关键词-EN: diffusion models, increasingly popular, major concern, diffusion, misuse of copyrighted
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As diffusion models become increasingly popular, the misuse of copyrighted and private images has emerged as a major concern. One promising solution to mitigate this issue is identifying the contribution of specific training samples in generative models, a process known as data attribution. Existing data attribution methods for diffusion models typically quantify the contribution of a training sample by evaluating the change in diffusion loss when the sample is included or excluded from the training process. However, we argue that the direct usage of diffusion loss cannot represent such a contribution accurately due to the calculation of diffusion loss. Specifically, these approaches measure the divergence between predicted and ground truth distributions, which leads to an indirect comparison between the predicted distributions and cannot represent the variances between model behaviors. To address these issues, we aim to measure the direct comparison between predicted distributions with an attribution score to analyse the training sample importance, which is achieved by Diffusion Attribution Score (DAS). Underpinned by rigorous theoretical analysis, we elucidate the effectiveness of DAS. Additionally, we explore strategies to accelerate DAS calculations, facilitating its application to large-scale diffusion models. Our extensive experiments across various datasets and diffusion models demonstrate that DAS significantly surpasses previous benchmarks in terms of the linear data-modelling score, establishing new state-of-the-art performance.

[AI-31] Multi-agent cooperation through learning-aware policy gradients

链接: https://arxiv.org/abs/2410.18636
作者: Alexander Meulemans,Seijin Kobayashi,Johannes von Oswald,Nino Scherrer,Eric Elmoznino,Blake Richards,Guillaume Lajoie,Blaise Agüera y Arcas,João Sacramento
关键词-EN: fail to cooperate, posing a fundamental, individuals often fail, fundamental challenge, challenge for multi-agent
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner’s dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.

[AI-32] Wavetable Synthesis Using CVAE for Timbre Control Based on Semantic Label

链接: https://arxiv.org/abs/2410.18628
作者: Tsugumasa Yutani,Yuya Yamamoto,Shuyo Nakatani,Hiroko Terasawa
关键词-EN: modern music production, Synthesizers are essential, music production, essential in modern, modern music
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: 6 pages, 4 figures, Accepted at APSIPA ASC 2024

点击查看摘要

Abstract:Synthesizers are essential in modern music production. However, their complex timbre parameters, often filled with technical terms, require expertise. This research introduces a method of timbre control in wavetable synthesis that is intuitive and sensible and utilizes semantic labels. Using a conditional variational autoencoder (CVAE), users can select a wavetable and define the timbre with labels such as bright, warm, and rich. The CVAE model, featuring convolutional and upsampling layers, effectively captures the wavetable nuances, ensuring real-time performance owing to their processing in the time domain. Experiments demonstrate that this approach allows for real-time, effective control of the timbre of the wavetable using semantic inputs and aims for intuitive timbre control through data-based semantic control.

[AI-33] SAMG: State-Action-Aware Offline-to-Online Reinforcement Learning with Offline Model Guidance

链接: https://arxiv.org/abs/2410.18626
作者: Liyu Zhang,Haochi Wu,Xu Wan,Quan Kong,Ruilong Deng,Mingyang Sun
关键词-EN: subsequent online fine-tuning, reinforcement learning, offline, utilizes pre-trained models, Offline Model Guidance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The offline-to-online (O2O) paradigm in reinforcement learning (RL) utilizes pre-trained models on offline datasets for subsequent online fine-tuning. However, conventional O2O RL algorithms typically require maintaining and retraining the large offline datasets to mitigate the effects of out-of-distribution (OOD) data, which limits their efficiency in exploiting online samples. To address this challenge, we introduce a new paradigm called SAMG: State-Action-Conditional Offline-to-Online Reinforcement Learning with Offline Model Guidance. In particular, rather than directly training on offline data, SAMG freezes the pre-trained offline critic to provide offline values for each state-action pair to deliver compact offline information. This framework eliminates the need for retraining with offline data by freezing and leveraging these values of the offline model. These are then incorporated with the online target critic using a Bellman equation weighted by a policy state-action-aware coefficient. This coefficient, derived from a conditional variational auto-encoder (C-VAE), aims to capture the reliability of the offline data on a state-action level. SAMG could be easily integrated with existing Q-function based O2O RL algorithms. Theoretical analysis shows good optimality and lower estimation error of SAMG. Empirical evaluations demonstrate that SAMG outperforms four state-of-the-art O2O RL algorithms in the D4RL benchmark.

[AI-34] FairQueue: Rethinking Prompt Learning for Fair Text-to-Image Generation NEURIPS24

链接: https://arxiv.org/abs/2410.18615
作者: Christopher T.H Teo,Milad Abdollahzadeh,Xinda Ma,Ngai-man Cheung
关键词-EN: target Sensitive Attribute, Sensitive Attribute, fair image generation, learning has emerged, Recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted in NeurIPS24

点击查看摘要

Abstract:Recently, prompt learning has emerged as the state-of-the-art (SOTA) for fair text-to-image (T2I) generation. Specifically, this approach leverages readily available reference images to learn inclusive prompts for each target Sensitive Attribute (tSA), allowing for fair image generation. In this work, we first reveal that this prompt learning-based approach results in degraded sample quality. Our analysis shows that the approach’s training objective – which aims to align the embedding differences of learned prompts and reference images – could be sub-optimal, resulting in distortion of the learned prompts and degraded generated images. To further substantiate this claim, as our major contribution, we deep dive into the denoising subnetwork of the T2I model to track down the effect of these learned prompts by analyzing the cross-attention maps. In our analysis, we propose a novel prompt switching analysis: I2H and H2I. Furthermore, we propose new quantitative characterization of cross-attention maps. Our analysis reveals abnormalities in the early denoising steps, perpetuating improper global structure that results in degradation in the generated samples. Building on insights from our analysis, we propose two ideas: (i) Prompt Queuing and (ii) Attention Amplification to address the quality issue. Extensive experimental results on a wide range of tSAs show that our proposed method outperforms SOTA approach’s image generation quality, while achieving competitive fairness. More resources at FairQueue Project site: this https URL

[AI-35] ripCast: Pre-training of Masked 2D Transformers for Trip Time Series Forecasting ICONIP2024

链接: https://arxiv.org/abs/2410.18612
作者: Yuhua Liao,Zetian Wang,Peng Wei,Qiangqiang Nie,Zhenhua Zhang
关键词-EN: shown great success, Deep learning, time series, models have shown, shown great
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by ICONIP 2024

点击查看摘要

Abstract:Deep learning and pre-trained models have shown great success in time series forecasting. However, in the tourism industry, time series data often exhibit a leading time property, presenting a 2D structure. This introduces unique challenges for forecasting in this sector. In this study, we propose a novel modelling paradigm, TripCast, which treats trip time series as 2D data and learns representations through masking and reconstruction processes. Pre-trained on large-scale real-world data, TripCast notably outperforms other state-of-the-art baselines in in-domain forecasting scenarios and demonstrates strong scalability and transferability in out-domain forecasting scenarios.

[AI-36] AgentS tore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant

链接: https://arxiv.org/abs/2410.18603
作者: Chengyou Jia,Minnan Luo,Zhuohang Dang,Qiushi Sun,Fangzhi Xu,Junlin Hu,Tianbao Xie,Zhiyong Wu
关键词-EN: Digital agents capable, attracted considerable attention, considerable attention due, enhance human-computer interaction, complex computer tasks
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Digital agents capable of automating complex computer tasks have attracted considerable attention due to their immense potential to enhance human-computer interaction. However, existing agent methods exhibit deficiencies in their generalization and specialization capabilities, especially in handling open-ended computer tasks in real-world environments. Inspired by the rich functionality of the App store, we present AgentStore, a scalable platform designed to dynamically integrate heterogeneous agents for automating computer tasks. AgentStore empowers users to integrate third-party agents, allowing the system to continuously enrich its capabilities and adapt to rapidly evolving operating systems. Additionally, we propose a novel core \textbfMetaAgent with the \textbfAgentToken strategy to efficiently manage diverse agents and utilize their specialized and generalist abilities for both domain-specific and system-wide tasks. Extensive experiments on three challenging benchmarks demonstrate that AgentStore surpasses the limitations of previous systems with narrow capabilities, particularly achieving a significant improvement from 11.21% to 23.85% on the OSWorld benchmark, more than doubling the previous results. Comprehensive quantitative and qualitative results further demonstrate AgentStore’s ability to enhance agent systems in both generalization and specialization, underscoring its potential for developing the specialized generalist computer assistant. All our codes will be made publicly available in this https URL.

[AI-37] Aligning CodeLLM s with Direct Preference Optimization

链接: https://arxiv.org/abs/2410.18585
作者: Yibo Miao,Bofei Gao,Shanghaoran Quan,Junyang Lin,Daoguang Zan,Jiaheng Liu,Jian Yang,Tianyu Liu,Zhijie Deng
关键词-EN: large language models, diverse domains, year has witnessed, witnessed the rapid, rapid progress
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The last year has witnessed the rapid progress of large language models (LLMs) across diverse domains. Among them, CodeLLMs have garnered particular attention because they can not only assist in completing various programming tasks but also represent the decision-making and logical reasoning capabilities of LLMs. However, current CodeLLMs mainly focus on pre-training and supervised fine-tuning scenarios, leaving the alignment stage, which is important for post-training LLMs, under-explored. This work first identifies that the commonly used PPO algorithm may be suboptimal for the alignment of CodeLLM because the involved reward rules are routinely coarse-grained and potentially flawed. We then advocate addressing this using the DPO algorithm. Based on only preference data pairs, DPO can render the model rank data automatically, giving rise to a fine-grained rewarding pattern more robust than human intervention. We also contribute a pipeline for collecting preference pairs for DPO on CodeLLMs. Studies show that our method significantly improves the performance of existing CodeLLMs on benchmarks such as MBPP and HumanEval.

[AI-38] SIKeD: Self-guided Iterative Knowledge Distillation for mathematical reasoning

链接: https://arxiv.org/abs/2410.18574
作者: Shivam Adarsh,Kumar Shridhar,Caglar Gulcehre,Nicholas Monath,Mrinmaya Sachan
关键词-EN: Large Language Models, Large Language, Language Models, intermediate reasoning process, reasoning process required
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can transfer their reasoning skills to smaller models by teaching them to generate the intermediate reasoning process required to solve multistep reasoning tasks. While LLMs can accurately solve reasoning tasks through a variety of strategies, even without fine-tuning, smaller models are not expressive enough to fit the LLMs distribution on all strategies when distilled and tend to prioritize one strategy over the others. This reliance on one strategy poses a challenge for smaller models when attempting to solve reasoning tasks that may be difficult with their preferred strategy. To address this, we propose a distillation method SIKeD (Self-guided Iterative Knowledge Distillation for mathematical reasoning), where the LLM teaches the smaller model to approach a task using different strategies and the smaller model uses its self-generated on-policy outputs to choose the most suitable strategy for the given task. The training continues in a self-guided iterative manner, where for each training iteration, a decision is made on how to combine the LLM data with the self-generated outputs. Unlike traditional distillation methods, SIKeD allows the smaller model to learn which strategy is suitable for a given task while continuously learning to solve a task using different strategies. Our experiments on various mathematical reasoning datasets show that SIKeD significantly outperforms traditional distillation techniques across smaller models of different sizes. Our code is available at: this https URL

[AI-39] Zero-shot Object Navigation with Vision-Language Models Reasoning ICPR

链接: https://arxiv.org/abs/2410.18570
作者: Congcong Wen,Yisiyuan Huang,Hao Huang,Yanjia Huang,Shuaihang Yuan,Yu Hao,Hui Lin,Yu-Shen Liu,Yi Fang
关键词-EN: traditional methods require, methods require substantial, require substantial training, substantial training data, Zero-shot object navigation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted by the International Conference on Pattern Recognition (ICPR) for Oral presentation

点击查看摘要

Abstract:Object navigation is crucial for robots, but traditional methods require substantial training data and cannot be generalized to unknown environments. Zero-shot object navigation (ZSON) aims to address this challenge, allowing robots to interact with unknown objects without specific training data. Language-driven zero-shot object navigation (L-ZSON) is an extension of ZSON that incorporates natural language instructions to guide robot navigation and interaction with objects. In this paper, we propose a novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON. VLTNet comprises four main modules: vision language model understanding, semantic mapping, tree-of-thought reasoning and exploration, and goal identification. Among these modules, Tree-of-Thought (ToT) reasoning and exploration module serves as a core component, innovatively using the ToT reasoning framework for navigation frontier selection during robot exploration. Compared to conventional frontier selection without reasoning, navigation using ToT reasoning involves multi-path reasoning processes and backtracking when necessary, enabling globally informed decision-making with higher accuracy. Experimental results on PASTURE and RoboTHOR benchmarks demonstrate the outstanding performance of our model in LZSON, particularly in scenarios involving complex natural language as target instructions.

[AI-40] Explainable News Summarization – Analysis and mitigation of Disagreement Problem

链接: https://arxiv.org/abs/2410.18560
作者: Seema Aswani,Sujala D. Shetty
关键词-EN: provide valuable understanding, text summarization provide, summarization provide valuable, XAI methods, XAI
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) techniques for text summarization provide valuable understanding of how the summaries are generated. Recent studies have highlighted a major challenge in this area, known as the disagreement problem. This problem occurs when different XAI methods offer contradictory explanations for the summary generated from the same input article. This inconsistency across XAI methods has been evaluated using predefined metrics designed to quantify agreement levels between them, revealing significant disagreement. This impedes the reliability and interpretability of XAI in this area. To address this challenge, we propose a novel approach that utilizes sentence transformers and the k-means clustering algorithm to first segment the input article and then generate the explanation of the summary generated for each segment. By producing regional or segmented explanations rather than comprehensive ones, a decrease in the observed disagreement between XAI methods is hypothesized. This segmentation-based approach was used on two news summarization datasets, namely Extreme Summarization(XSum) and CNN-DailyMail, and the experiment was conducted using multiple disagreement metrics. Our experiments validate the hypothesis by showing a significant reduction in disagreement among different XAI methods. Additionally, a JavaScript visualization tool is developed, that is easy to use and allows users to interactively explore the color-coded visualization of the input article and the machine-generated summary based on the attribution scores of each sentences.

[AI-41] Complexity Matters: Effective Dimensionality as a Measure for Adversarial Robustness

链接: https://arxiv.org/abs/2410.18556
作者: David Khachaturov,Robert Mullins
关键词-EN: Quantifying robustness, anticipating trends, robustness, Quantifying, effective dimensionality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Quantifying robustness in a single measure for the purposes of model selection, development of adversarial training methods, and anticipating trends has so far been elusive. The simplest metric to consider is the number of trainable parameters in a model but this has previously been shown to be insufficient at explaining robustness properties. A variety of other metrics, such as ones based on boundary thickness and gradient flatness have been proposed but have been shown to be inadequate proxies for robustness. In this work, we investigate the relationship between a model’s effective dimensionality, which can be thought of as model complexity, and its robustness properties. We run experiments on commercial-scale models that are often used in real-world environments such as YOLO and ResNet. We reveal a near-linear inverse relationship between effective dimensionality and adversarial robustness, that is models with a lower dimensionality exhibit better robustness. We investigate the effect of a variety of adversarial training methods on effective dimensionality and find the same inverse linear relationship present, suggesting that effective dimensionality can serve as a useful criterion for model selection and robustness evaluation, providing a more nuanced and effective metric than parameter count or previously-tested measures. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2410.18556 [cs.LG] (or arXiv:2410.18556v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.18556 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-42] IMAN: An Adaptive Network for Robust NPC Mortality Prediction with Missing Modalities

链接: https://arxiv.org/abs/2410.18551
作者: Yejing Huo,Guoheng Huang,Lianglun Cheng,Jianbin He,Xuhang Chen,Xiaochen Yuan,Guo Zhong,Chi-Man Pun
关键词-EN: improving patient outcomes, optimizing treatment strategies, nasopharyngeal carcinoma, patient outcomes, Accurate prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The paper has been accepted by BIBM 2024

点击查看摘要

Abstract:Accurate prediction of mortality in nasopharyngeal carcinoma (NPC), a complex malignancy particularly challenging in advanced stages, is crucial for optimizing treatment strategies and improving patient outcomes. However, this predictive process is often compromised by the high-dimensional and heterogeneous nature of NPC-related data, coupled with the pervasive issue of incomplete multi-modal data, manifesting as missing radiological images or incomplete diagnostic reports. Traditional machine learning approaches suffer significant performance degradation when faced with such incomplete data, as they fail to effectively handle the high-dimensionality and intricate correlations across modalities. Even advanced multi-modal learning techniques like Transformers struggle to maintain robust performance in the presence of missing modalities, as they lack specialized mechanisms to adaptively integrate and align the diverse data types, while also capturing nuanced patterns and contextual relationships within the complex NPC data. To address these problem, we introduce IMAN: an adaptive network for robust NPC mortality prediction with missing modalities.

[AI-43] PRACT: Optimizing Principled Reasoning and Acting of LLM Agent CONLL2024

链接: https://arxiv.org/abs/2410.18528
作者: Zhiwei Liu,Weiran Yao,Jianguo Zhang,Rithesh Murthy,Liangwei Yang,Zuxin Liu,Tian Lan,Ming Zhu,Juntao Tan,Shirley Kokane,Thai Hoang,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong
关键词-EN: Reasoning and Acting, Principled Reasoning, introduce the Principled, action principles, enforcing action principles
类目: Artificial Intelligence (cs.AI)
*备注: Accepted to SIG CoNLL 2024

点击查看摘要

Abstract:We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly. We develop the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, two RPO methods, RPO-Traj and RPO-Batch, is introduced to adapt to different settings. Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, effectively learns and applies action principles to enhance performance.

[AI-44] Scaling up Masked Diffusion Models on Text

链接: https://arxiv.org/abs/2410.18514
作者: Shen Nie,Fengqi Zhu,Chao Du,Tianyu Pang,Qian Liu,Guangtao Zeng,Min Lin,Chongxuan Li
关键词-EN: Masked diffusion models, Masked diffusion, remain underexplored, shown promise, effectiveness in core
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective \emphunsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, a 1.1B MDM shows competitive results, outperforming the larger 1.5B GPT-2 model on four out of eight zero-shot benchmarks. In text generation, MDMs provide a flexible trade-off compared to ARMs utilizing KV-cache: MDMs match the performance of ARMs while being 1.4 times faster, or achieve higher quality than ARMs at a higher computational cost. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the \emphreverse curse encountered by much larger ARMs with significantly more data and computation, such as Llama-2 (13B) and GPT-3 (175B). Our code is available at \urlthis https URL.

[AI-45] A framework for GNSS-based solutions performance analysis in an ERTMS context

链接: https://arxiv.org/abs/2410.18510
作者: Juliette Marais(COSYS-LEOST),Quentin Mayolle(IRT Railenium),Martin Fasquelle(IRT Railenium),Vincent Tardif,Emilie Chéneau-Grehalle
关键词-EN: Global Navigation Satellite, Navigation Satellite System, Global Navigation, GNSS-based solution introduction, rail applications GNSS
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Context Progresses in GNSS-based solution introduction in rail applications GNSS (Global Navigation Satellite System) is now used in most of our travels and each of our smartphone apps. Most of the usages are not safety-critical. But Europe identified GNSS for more applications and to be integrated in rail in general as part of the toolset to help railway to contribute to reduce transport carbon footprint. To increase the use of trains in European transports, railways must improve their attractiveness for passengers and freight, but also increase reliability, availability and efficiency by reducing capital expenditure and operational costs. GNSS is part of the global digitalization scheme of freight that aims to offer added value to the clients knowledge of accurate time of arrival, continuous monitoring of transport conditions (temperature, humidity…). But a major challenge will be to reach stringent applications and in particular, GNSS is today seen as a realistic and serious game changer for the future of the ERTMS (European Rail Traffic Management System). The localisation function is today performed with both odometry and balises. Odometer provides a continuous train position in time from a reference point. But as the distance delivered by the odometer shows a growing bias with distance, due to wear and wheel sliding, the use of on-track balises allows to reduce this error. Future systems will be based on on-board localisation solutions with GNSS receivers. It will allow the development of new concepts for moving blocks, virtual coupling and automation. Its use for train integrity is also investigated. But the environmental conditions of track and surroundings configuration, i.e, tunnels, dense urban areas or vegetation often degrade positioning performance and thus its efficiency and safety. Indeed, GNSS satellites are moving and their visibility (availability and relative position from the receiver) vary with time. Moreover, for optimal performance, the system requires open sky environments, which are the cases of most of the aeronautical uses but not of train uses. Trains often circulate in areas where signal reception can be disturbed (multipath, intentional or unintentional interferences) and thus, performances degraded. If many progresses have been made in the past years to develop more robust receivers [Puccitelli, 2022], multi-sensor solutions [CLUG website] or missing tools such as Digital Maps [Crespillo, 2023], in projects such as the Shift2Rail Project X2Rail-5 or CLUG, some questions remain and in particular related to performance evaluation. How can we evaluate performances in a dynamic environment (train, satellite, obstacles)? How can we be sure that every configuration has been tested? What is the impact of a failure (inaccuracy, missed detection) on operation? Some of these issues are addressed in the on-going R2DATO project funded by Europe’s rail.

[AI-46] SFB-net for cardiac segmentation: Bridging the semantic gap with attention

链接: https://arxiv.org/abs/2410.18503
作者: Nicolas Portal(SU),Nadjia Kachenoura,Thomas Dietenbeck,Catherine Achard
关键词-EN: deep learning algorithms, cardiac image segmentation, past few years, deep learning, image segmentation
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the past few years, deep learning algorithms have been widely used for cardiac image segmentation. However, most of these architectures rely on convolutions that hardly model long-range dependencies, limiting their ability to extract contextual information. In order to tackle this issue, this article introduces the Swin Filtering Block network (SFB-net) which takes advantage of both conventional and swin transformer layers. The former are used to introduce spatial attention at the bottom of the network, while the latter are applied to focus on high level semantically rich features between the encoder and decoder. An average Dice score of 92.4 was achieved on the ACDC dataset. To the best of our knowledge, this result outperforms any other work on this dataset. The average Dice score of 87.99 obtained on the M\amp;M’s dataset demonstrates that the proposed method generalizes well to data from different vendors and centres.

[AI-47] LLM as a code generator in Agile Model Driven Development

链接: https://arxiv.org/abs/2410.18489
作者: Ahmed R. Sadik,Sebastian Brulin,Markus Olhofer,Antonello Ceravola,Frank Joublin
关键词-EN: Leveraging Large Language, Leveraging Large, Large Language Models, Model Driven Development, Large Language
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Leveraging Large Language Models (LLM) like GPT4 in the auto generation of code represents a significant advancement, yet it is not without its challenges. The ambiguity inherent in natural language descriptions of software poses substantial obstacles to generating deployable, structured artifacts. This research champions Model Driven Development (MDD) as a viable strategy to overcome these challenges, proposing an Agile Model Driven Development (AMDD) approach that employs GPT4 as a code generator. This approach enhances the flexibility and scalability of the code auto generation process and offers agility that allows seamless adaptation to changes in models or deployment environments. We illustrate this by modeling a multi agent Unmanned Vehicle Fleet (UVF) system using the Unified Modeling Language (UML), significantly reducing model ambiguity by integrating the Object Constraint Language (OCL) for code structure meta modeling, and the FIPA ontology language for communication semantics meta modeling. Applying GPT4 auto generation capabilities yields Java and Python code that is compatible with the JADE and PADE frameworks, respectively. Our thorough evaluation of the auto generated code verifies its alignment with expected behaviors and identifies enhancements in agent interactions. Structurally, we assessed the complexity of code derived from a model constrained solely by OCL meta models, against that influenced by both OCL and FIPA ontology meta models. The results indicate that the ontology constrained meta model produces inherently more complex code, yet its cyclomatic complexity remains within manageable levels, suggesting that additional meta model constraints can be incorporated without exceeding the high risk threshold for complexity.

[AI-48] Gene-Metabolite Association Prediction with Interactive Knowledge Transfer Enhanced Graph for Metabolite Production

链接: https://arxiv.org/abs/2410.18475
作者: Kexuan Xin,Qingyun Wang,Junyu Chen,Pengfei Yu,Huimin Zhao,Heng Ji
关键词-EN: presents significant challenges, rapidly evolving field, production enhancement presents, enhancement presents significant, metabolite production enhancement
类目: Artificial Intelligence (cs.AI)
*备注: 10 PAGES, 4 FIGURES; bibm 2024

点击查看摘要

Abstract:In the rapidly evolving field of metabolic engineering, the quest for efficient and precise gene target identification for metabolite production enhancement presents significant challenges. Traditional approaches, whether knowledge-based or model-based, are notably time-consuming and labor-intensive, due to the vast scale of research literature and the approximation nature of genome-scale metabolic model (GEM) simulations. Therefore, we propose a new task, Gene-Metabolite Association Prediction based on metabolic graphs, to automate the process of candidate gene discovery for a given pair of metabolite and candidate-associated genes, as well as presenting the first benchmark containing 2474 metabolites and 1947 genes of two commonly used microorganisms Saccharomyces cerevisiae (SC) and Issatchenkia orientalis (IO). This task is challenging due to the incompleteness of the metabolic graphs and the heterogeneity among distinct metabolisms. To overcome these limitations, we propose an Interactive Knowledge Transfer mechanism based on Metabolism Graph (IKT4Meta), which improves the association prediction accuracy by integrating the knowledge from different metabolism graphs. First, to build a bridge between two graphs for knowledge transfer, we utilize Pretrained Language Models (PLMs) with external knowledge of genes and metabolites to help generate inter-graph links, significantly alleviating the impact of heterogeneity. Second, we propagate intra-graph links from different metabolic graphs using inter-graph links as anchors. Finally, we conduct the gene-metabolite association prediction based on the enriched metabolism graphs, which integrate the knowledge from multiple microorganisms. Experiments on both types of organisms demonstrate that our proposed methodology outperforms baselines by up to 12.3% across various link prediction frameworks.

[AI-49] Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare

链接: https://arxiv.org/abs/2410.18460
作者: Yifan Yang,Qiao Jin,Qingqing Zhu,Zhizheng Wang,Francisco Erramuspe Álvarez,Nicholas Wan,Benjamin Hou,Zhiyong Lu
关键词-EN: Large Language Models, Large Language, Language Models, gained significant attention, human-level capabilities
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained significant attention in the medical domain for their human-level capabilities, leading to increased efforts to explore their potential in various healthcare applications. However, despite such a promising future, there are multiple challenges and obstacles that remain for their real-world uses in practical settings. This work discusses key challenges for LLMs in medical applications from four unique aspects: operational vulnerabilities, ethical and social considerations, performance and assessment difficulties, and legal and regulatory compliance. Addressing these challenges is crucial for leveraging LLMs to their full potential and ensuring their responsible integration into healthcare.

[AI-50] Verifying Non-friendly Formal Verification Designs: Can We Start Earlier?

链接: https://arxiv.org/abs/2410.18454
作者: Bryan Olmos,Daniel Gerl,Aman Kumar,Djones Lettnin
关键词-EN: Systems on Chips, technological advancements, complex due, due to technological, High-level Equivalence Checking
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: Published in DVCon Europe 2024

点击查看摘要

Abstract:The design of Systems on Chips (SoCs) is becoming more and more complex due to technological advancements. Missed bugs can cause drastic failures in safety-critical environments leading to the endangerment of lives. To overcome these drastic failures, formal property verification (FPV) has been applied in the industry. However, there exist multiple hardware designs where the results of FPV are not conclusive even for long runtimes of model-checking tools. For this reason, the use of High-level Equivalence Checking (HLEC) tools has been proposed in the last few years. However, the procedure for how to use it inside an industrial toolchain has not been defined. For this reason, we proposed an automated methodology based on metamodeling techniques which consist of two main steps. First, an untimed algorithmic description written in C++ is verified in an early stage using generated assertions; the advantage of this step is that the assertions at the software level run in seconds and we can start our analysis with conclusive results about our algorithm before starting to write the RTL (Register Transfer Level) design. Second, this algorithmic description is verified against its sequential design using HLEC and the respective metamodel parameters. The results show that the presented methodology can find bugs early related to the algorithmic description and prepare the setup for the HLEC verification. This helps to reduce the verification efforts to set up the tool and write the properties manually which is always error-prone. The proposed framework can help teams working on datapaths to verify and make decisions in an early stage of the verification flow.

[AI-51] he Nature of Mathematical Modeling and Probabilistic Optimization Engineering in Generative AI

链接: https://arxiv.org/abs/2410.18441
作者: Fulu Li
关键词-EN: mathematical problem formulations, components in Transformer, probabilistic optimization explorations, Transformer model, give an in-depth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 3 figures

点击查看摘要

Abstract:In this paper, we give an in-depth analysis on the mathematical problem formulations and the probabilistic optimization explorations for some of the key components in Transformer model [33] in the field of generative AI. We explore and discuss some potential further enhancement for current state of the art methods for some key underlying technologies of generative AI models from algorithmic and probabilistic optimization perspective. In particular, we present an optimal solution for sub-word encoding (SWE) based on similar initial settings as that of byte-pair encoding (BPE) algorithm in [9] with similar objectives as that of WordPiece approach in [28, 31] to maximize the likelihood of the training data. We also present cross entropy optimization method to optimize hyperparameters for word2vec model [17]. In addition, we propose a factored combination of rotary positional encoding (RoPE) [32] and attention with linear biases (ALiBi) [23] with a harmonic series. We also present a probabilistic FlashAttention [6, 7] (PrFlashAttention) method with a probability distribution over block distances in the matrix to decide which block is likely to participate in a given round of attention computation while maintaining the lower triangle shape of the tensor for autoregressive language models by re-shaping the tensors. Finally, we present staircase adaptive quantization (SAQ) of key-value (KV) cache for multi-query attention (MQA) based on the framework presented in [16] to have gradual quantization degradation while achieving reasonable model quality and cost savings.

[AI-52] Link Synthesize Retrieve: Universal Document Linking for Zero-Shot Information Retrieval EMNLP2024

链接: https://arxiv.org/abs/2410.18385
作者: Dae Yon Hwang,Bilal Taha,Harshit Pande,Yaroslav Nechaev
关键词-EN: lack historical query, historical query traffic, information retrieval, significant challenge, existing users
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted for publication at EMNLP 2024 Main Conference

点击查看摘要

Abstract:Despite the recent advancements in information retrieval (IR), zero-shot IR remains a significant challenge, especially when dealing with new domains, languages, and newly-released use cases that lack historical query traffic from existing users. For such cases, it is common to use query augmentations followed by fine-tuning pre-trained models on the document data paired with synthetic queries. In this work, we propose a novel Universal Document Linking (UDL) algorithm, which links similar documents to enhance synthetic query generation across multiple datasets with different characteristics. UDL leverages entropy for the choice of similarity models and named entity recognition (NER) for the link decision of documents using similarity scores. Our empirical studies demonstrate the effectiveness and universality of the UDL across diverse datasets and IR models, surpassing state-of-the-art methods in zero-shot cases. The developed code for reproducibility is included in this https URL

[AI-53] Integrating Canonical Neural Units and Multi-Scale Training for Handwritten Text Recognition

链接: https://arxiv.org/abs/2410.18374
作者: Zi-Rui Wang
关键词-EN: hidden Markov model, connectionist temporal classification, segmentation-free research efforts, hidden Markov, Markov model
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The segmentation-free research efforts for addressing handwritten text recognition can be divided into three categories: connectionist temporal classification (CTC), hidden Markov model and encoder-decoder methods. In this paper, inspired by the above three modeling methods, we propose a new recognition network by using a novel three-dimensional (3D) attention module and global-local context information. Based on the feature maps of the last convolutional layer, a series of 3D blocks with different resolutions are split. Then, these 3D blocks are fed into the 3D attention module to generate sequential visual features. Finally, by integrating the visual features and the corresponding global-local context features, a well-designed representation can be obtained. Main canonical neural units including attention mechanisms, fully-connected layer, recurrent unit and convolutional layer are efficiently organized into a network and can be jointly trained by the CTC loss and the cross-entropy loss. Experiments on the latest Chinese handwritten text datasets (the SCUT-HCCDoc and the SCUT-EPT) and one English handwritten text dataset (the IAM) show that the proposed method can make a new milestone.

[AI-54] A Unimodal Speaker-Level Membership Inference Detector for Contrastive Pretraining

链接: https://arxiv.org/abs/2410.18371
作者: Ruoxi Cheng,Yizhong Ding,Shuirong Cao,Shitong Shao,Zhiqiang Wang
关键词-EN: disclose PII, Contrastive Language-Audio Pretraining, combined with related, PII, CLAP
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio can disclose PII, particularly when combined with related text data. Therefore, it is essential to develop tools to detect privacy leakage in Contrastive Language-Audio Pretraining(CLAP). Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models. To address these challenges, we propose USMID, a textual unimodal speaker-level membership inference detector for CLAP models, which queries the target model using only text data and does not require training shadow models. We randomly generate textual gibberish that are clearly not in training dataset. Then we extract feature vectors from these texts using the CLAP model and train a set of anomaly detectors on them. During inference, the feature vector of each test text is input into the anomaly detector to determine if the speaker is in the training set (anomalous) or not (normal). If available, USMID can further enhance detection by integrating real audio of the tested speaker. Extensive experiments on various CLAP model architectures and datasets demonstrate that USMID outperforms baseline methods using only text data.

[AI-55] Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model

链接: https://arxiv.org/abs/2410.18363
作者: Vishakha Lall,Yisi Liu
关键词-EN: Automated Speech Recognition, Whisper Automated Speech, OpenAI Whisper Automated, Automated Speech, Speech Recognition model
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:OpenAI’s Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains. However, this broad adaptability can lead to diminished performance in tasks requiring recognition of specific vocabularies. Addressing this challenge typically involves fine-tuning the model, which demands extensive labeled audio data that is often difficult to acquire and unavailable for specific domains. In this study, we propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters, using a relatively small training dataset. Our method leverages contextual biasing, to direct Whisper model’s output towards a specific vocabulary by integrating a neural-symbolic prefix tree structure to guide the model’s transcription output. To validate our approach, we conducted experiments using a validation dataset comprising maritime data collected within a simulated training environment. A comparison between the original Whisper models of varying parameter sizes and our biased model revealed a notable reduction in transcription word error rate and enhanced performance of downstream applications. Our findings suggest that this methodology holds promise for improving speech-to-text translation performance in domains characterized by limited vocabularies.

[AI-56] Data Publishing in Mechanics and Dynamics: Challenges Guidelines and Examples from Engineering Design

链接: https://arxiv.org/abs/2410.18358
作者: Henrik Ebel,Jan van Delden,Timo Lüddecke,Aditya Borse,Rutwik Gulakala,Marcus Stoffel,Manish Yadav,Merten Stender,Leon Schindler,Kristin Miriam de Payrebrune,Maximilian Raff,C. David Remy,Benedict Röder,Peter Eberhard
关键词-EN: artificial neural networks, gained increasing importance, deep artificial neural, Data-based methods, neural networks
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
*备注: 21 pages, 8 figures

点击查看摘要

Abstract:Data-based methods have gained increasing importance in engineering, especially but not only driven by successes with deep artificial neural networks. Success stories are prevalent, e.g., in areas such as data-driven modeling, control and automation, as well as surrogate modeling for accelerated simulation. Beyond engineering, generative and large-language models are increasingly performing and helping with tasks that, previously, were solely associated with creative human processes. Thus, it seems timely to seek artificial-intelligence-support for engineering design tasks to automate, help with, or accelerate purpose-built designs of engineering systems, e.g., in mechanics and dynamics, where design so far requires a lot of specialized knowledge. However, research-wise, compared to established, predominantly first-principles-based methods, the datasets used for training, validation, and test become an almost inherent part of the overall methodology. Thus, data publishing becomes just as important in (data-driven) engineering science as appropriate descriptions of conventional methodology in publications in the past. This article analyzes the value and challenges of data publishing in mechanics and dynamics, in particular regarding engineering design tasks, showing that the latter raise also challenges and considerations not typical in fields where data-driven methods have been booming originally. Possible ways to deal with these challenges are discussed and a set of examples from across different design problems shows how data publishing can be put into practice. The analysis, discussions, and examples are based on the research experience made in a priority program of the German research foundation focusing on research on artificially intelligent design assistants in mechanics and dynamics.

[AI-57] he Impact of Generative Artificial Intelligence on Ideation and the performance of Innovation Teams (Preprint)

链接: https://arxiv.org/abs/2410.18357
作者: Michael Gindert,Marvin Lutz Müller
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, idea generation phase, Knowledge Spill-over Theory
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 24 pages, 5 figures, Author Contributions: Michael Gindert: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review Editing, Visualization, Project administration, Funding acquisition Marvin Lutz Müller: Validation, Investigation, Resources, Writing - Review Editing, Supervision

点击查看摘要

Abstract:This study investigates the impact of Generative Artificial Intelligence (GenAI) on the dynam-ics and performance of innovation teams during the idea generation phase of the innovation process. Utilizing a custom AI-augmented ideation tool, the study applies the Knowledge Spill-over Theory of Entrepreneurship to understand the effects of AI on knowledge spillover, gen-eration and application. Through a framed field experiment with participants divided into exper-imental and control groups, findings indicate that AI-augmented teams generated higher quali-ty ideas in less time. GenAI application led to improved efficiency, knowledge exchange, in-creased satisfaction and engagement as well as enhanced idea diversity. These results high-light the transformative role of the field of AI within the innovation management domain and shows that GenAI has a positive impact on important elements of the Knowledge Spillover Theory of Entrepeneurship, emphasizing its potential impact on innovation, entrepreneurship, and economic growth. Future research should further explore the dynamic interaction be-tween GenAI and creative processes.

[AI-58] Geometric Feature Enhanced Knowledge Graph Embedding and Spatial Reasoning

链接: https://arxiv.org/abs/2410.18345
作者: Lei Hu,Wenwen Li,Yunqiang Zhu
关键词-EN: strong knowledge support, including data retrieval, Geospatial Knowledge Graphs, providing strong knowledge, knowledge graph embedding
类目: Artificial Intelligence (cs.AI)
*备注: 4 pages, 1 figure, Accepted for the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery

点击查看摘要

Abstract:Geospatial Knowledge Graphs (GeoKGs) model geoentities (e.g., places and natural features) and spatial relationships in an interconnected manner, providing strong knowledge support for geographic applications, including data retrieval, question-answering, and spatial reasoning. However, existing methods for mining and reasoning from GeoKGs, such as popular knowledge graph embedding (KGE) techniques, lack geographic awareness. This study aims to enhance general-purpose KGE by developing new strategies and integrating geometric features of spatial relations, including topology, direction, and distance, to infuse the embedding process with geographic intuition. The new model is tested on downstream link prediction tasks, and the results show that the inclusion of geometric features, particularly topology and direction, improves prediction accuracy for both geoentities and spatial relations. Our research offers new perspectives for integrating spatial concepts and principles into the GeoKG mining process, providing customized GeoAI solutions for geospatial challenges.

[AI-59] Search-Based Path Planning among Movable Obstacles

链接: https://arxiv.org/abs/2410.18333
作者: Zhongqiang Ren,Bunyod Suvonov,Guofei Chen,Botao He,Yijie Liao,Cornelia Fermuller,Ji Zhang
关键词-EN: cost collision-free path, Movable Obstacles, paper investigates Path, investigates Path planning, minimum cost collision-free
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates Path planning Among Movable Obstacles (PAMO), which seeks a minimum cost collision-free path among static obstacles from start to goal while allowing the robot to push away movable obstacles (i.e., objects) along its path when needed. To develop planners that are complete and optimal for PAMO, the planner has to search a giant state space involving both the location of the robot as well as the locations of the objects, which grows exponentially with respect to the number of objects. The main idea in this paper is that, only a small fraction of this giant state space needs to be explored during planning as guided by a heuristic, and most of the objects far away from the robot are intact, which thus leads to runtime efficient algorithms. Based on this idea, this paper introduces two PAMO formulations, i.e., bi-objective and resource constrained problems in an occupancy grid, and develops PAMO*, a search method with completeness and solution optimality guarantees, to solve the two problems. We then further extend PAMO* to hybrid-state PAMO* to plan in continuous spaces with high-fidelity interaction between the robot and the objects. Our results show that, PAMO* can often find optimal solutions within a second in cluttered environments with up to 400 objects.

[AI-60] Self-Supervised Learning for Time Series: A Review Critique of FITS

链接: https://arxiv.org/abs/2410.18318
作者: Andreas Løvendahl Eefsen,Nicholas Erup Larsen,Oliver Glozmann Bork Hansen,Thor Højhus Avenstrup
关键词-EN: Accurate time series, time series forecasting, highly valuable endeavour, Accurate time, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv:2307.03756v3 45 pages, 36 figures

点击查看摘要

Abstract:Accurate time series forecasting is a highly valuable endeavour with applications across many industries. Despite recent deep learning advancements, increased model complexity, and larger model sizes, many state-of-the-art models often perform worse or on par with simpler models. One of those cases is a recently proposed model, FITS, claiming competitive performance with significantly reduced parameter counts. By training a one-layer neural network in the complex frequency domain, we are able to replicate these results. Our experiments on a wide range of real-world datasets further reveal that FITS especially excels at capturing periodic and seasonal patterns, but struggles with trending, non-periodic, or random-resembling behavior. With our two novel hybrid approaches, where we attempt to remedy the weaknesses of FITS by combining it with DLinear, we achieve the best results of any known open-source model on multivariate regression and promising results in multiple/linear regression on price datasets, on top of vastly improving upon what FITS achieves as a standalone model.

[AI-61] Countering Autonomous Cyber Threats

链接: https://arxiv.org/abs/2410.18312
作者: Kade M. Heckel,Adrian Weller
关键词-EN: Foundation Models present, fluent natural language, present dual-use concerns, dual-use concerns broadly, Models present dual-use
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 76 pages, MPhil Thesis

点击查看摘要

Abstract:With the capability to write convincing and fluent natural language and generate code, Foundation Models present dual-use concerns broadly and within the cyber domain specifically. Generative AI has already begun to impact cyberspace through a broad illicit marketplace for assisting malware development and social engineering attacks through hundreds of malicious-AI-as-a-services tools. More alarming is that recent research has shown the potential for these advanced models to inform or independently execute offensive cyberspace operations. However, these previous investigations primarily focused on the threats posed by proprietary models due to the until recent lack of strong open-weight model and additionally leave the impacts of network defenses or potential countermeasures unexplored. Critically, understanding the aptitude of downloadable models to function as offensive cyber agents is vital given that they are far more difficult to govern and prevent their misuse. As such, this work evaluates several state-of-the-art FMs on their ability to compromise machines in an isolated network and investigates defensive mechanisms to defeat such AI-powered attacks. Using target machines from a commercial provider, the most recently released downloadable models are found to be on par with a leading proprietary model at conducting simple cyber attacks with common hacking tools against known vulnerabilities. To mitigate such LLM-powered threats, defensive prompt injection (DPI) payloads for disrupting the malicious cyber agent’s workflow are demonstrated to be effective. From these results, the implications for AI safety and governance with respect to cybersecurity is analyzed.

[AI-62] 1-2-3-Go! Policy Synthesis for Parameterized Markov Decision Processes via Decision-Tree Learning and Generalization

链接: https://arxiv.org/abs/2410.18293
作者: Muqsit Azeem,Debraj Chakraborty,Sudeep Kanav,Jan Kretinsky,Mohammadsadegh Mohagheghi,Stefanie Mohr,Maximilian Weininger
关键词-EN: methods remains limited, probabilistic model checking, remains limited, advances in probabilistic, verification methods remains
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Systems and Control (eess.SY)
*备注: Preprint. Under review

点击查看摘要

Abstract:Despite the advances in probabilistic model checking, the scalability of the verification methods remains limited. In particular, the state space often becomes extremely large when instantiating parameterized Markov decision processes (MDPs) even with moderate values. Synthesizing policies for such \emphhuge MDPs is beyond the reach of available tools. We propose a learning-based approach to obtain a reasonable policy for such huge MDPs. The idea is to generalize optimal policies obtained by model-checking small instances to larger ones using decision-tree learning. Consequently, our method bypasses the need for explicit state-space exploration of large models, providing a practical solution to the state-space explosion problem. We demonstrate the efficacy of our approach by performing extensive experimentation on the relevant models from the quantitative verification benchmark set. The experimental results indicate that our policies perform well, even when the size of the model is orders of magnitude beyond the reach of state-of-the-art analysis tools. Comments: Preprint. Under review Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Systems and Control (eess.SY) Cite as: arXiv:2410.18293 [cs.AI] (or arXiv:2410.18293v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.18293 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-63] Screw Geometry Meets Bandits: Incremental Acquisition of Demonstrations to Generate Manipulation Plans

链接: https://arxiv.org/abs/2410.18275
作者: Dibyendu Das,Aditya Patankar,Nilanjan Chakraborty,C.R. Ramakrishnan,I.V. Ramakrishnan
关键词-EN: complex manipulation task, methodically obtaining, generate manipulation plans, set of kinesthetic, demonstrations
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures, under review in IEEE Robotics and Automation Letters

点击查看摘要

Abstract:In this paper, we study the problem of methodically obtaining a sufficient set of kinesthetic demonstrations, one at a time, such that a robot can be confident of its ability to perform a complex manipulation task in a given region of its workspace. Although Learning from Demonstrations has been an active area of research, the problems of checking whether a set of demonstrations is sufficient, and systematically seeking additional demonstrations have remained open. We present a novel approach to address these open problems using (i) a screw geometric representation to generate manipulation plans from demonstrations, which makes the sufficiency of a set of demonstrations measurable; (ii) a sampling strategy based on PAC-learning from multi-armed bandit optimization to evaluate the robot’s ability to generate manipulation plans in a subregion of its task space; and (iii) a heuristic to seek additional demonstration from areas of weakness. Thus, we present an approach for the robot to incrementally and actively ask for new demonstration examples until the robot can assess with high confidence that it can perform the task successfully. We present experimental results on two example manipulation tasks, namely, pouring and scooping, to illustrate our approach. A short video on the method: this https URL

[AI-64] Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing

链接: https://arxiv.org/abs/2410.18267
作者: Dongliang Guo,Mengxuan Hu,Zihan Guan,Junfeng Guo,Thomas Hartvigsen,Sheng Li
关键词-EN: Large pre-trained models, Large pre-trained, achieved notable success, pre-trained models, pre-trained
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large pre-trained models have achieved notable success across a range of downstream tasks. However, recent research shows that a type of adversarial attack ( \textiti.e., backdoor attack) can manipulate the behavior of machine learning models through contaminating their training dataset, posing significant threat in the real-world application of large pre-trained model, especially for those customized models. Therefore, addressing the unique challenges for exploring vulnerability of pre-trained models is of paramount importance. Through empirical studies on the capability for performing backdoor attack in large pre-trained models ( \textite.g., ViT), we find the following unique challenges of attacking large pre-trained models: 1) the inability to manipulate or even access large training datasets, and 2) the substantial computational resources required for training or fine-tuning these models. To address these challenges, we establish new standards for an effective and feasible backdoor attack in the context of large pre-trained models. In line with these standards, we introduce our EDT model, an \textbfEfficient, \textbfData-free, \textbfTraining-free backdoor attack method. Inspired by model editing techniques, EDT injects an editing-based lightweight codebook into the backdoor of large pre-trained models, which replaces the embedding of the poisoned image with the target image without poisoning the training dataset or training the victim model. Our experiments, conducted across various pre-trained models such as ViT, CLIP, BLIP, and stable diffusion, and on downstream tasks including image classification, image captioning, and image generation, demonstrate the effectiveness of our method. Our code is available in the supplementary material.

[AI-65] Context-Augmented Code Generation Using Programming Knowledge Graphs

链接: https://arxiv.org/abs/2410.18251
作者: Iman Saberi,Fatemeh Fard
关键词-EN: Large Language Models, Large Language, frequently face difficulties, significantly improved code, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 20 pages, Conference

点击查看摘要

Abstract:Large Language Models (LLMs) and Code-LLMs (CLLMs) have significantly improved code generation, but, they frequently face difficulties when dealing with challenging and complex problems. Retrieval-Augmented Generation (RAG) addresses this issue by retrieving and integrating external knowledge at the inference time. However, retrieval models often fail to find most relevant context, and generation models, with limited context capacity, can hallucinate when given irrelevant data. We present a novel framework that leverages a Programming Knowledge Graph (PKG) to semantically represent and retrieve code. This approach enables fine-grained code retrieval by focusing on the most relevant segments while reducing irrelevant context through a tree-pruning technique. PKG is coupled with a re-ranking mechanism to reduce even more hallucinations by selectively integrating non-RAG solutions. We propose two retrieval approaches-block-wise and function-wise-based on the PKG, optimizing context granularity. Evaluations on the HumanEval and MBPP benchmarks show our method improves pass@1 accuracy by up to 20%, and outperforms state-of-the-art models by up to 34% on MBPP. Our contributions include PKG-based retrieval, tree pruning to enhance retrieval precision, a re-ranking method for robust solution selection and a Fill-in-the-Middle (FIM) enhancer module for automatic code augmentation with relevant comments and docstrings.

[AI-66] Efficient Inference for Augmented Large Language Models

链接: https://arxiv.org/abs/2410.18248
作者: Rana Shahout,Cong Liang,Shiji Xin,Qianru Lao,Yong Cui,Minlan Yu,Michael Mitzenmacher
关键词-EN: Large Language Models, Augmented Large Language, Language Models, Large Language, integrating external data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs by integrating external data sources through API calls. In interactive LLM applications, efficient scheduling is crucial for maintaining low request completion times, directly impacting user engagement. However, these augmentations introduce scheduling challenges due to the need to manage limited memory for cached information (KV caches). As a result, traditional size-based scheduling algorithms, such as Shortest Job First (SJF), become less effective at minimizing completion times. Existing work focuses only on handling requests during API calls by preserving, discarding, or swapping memory without considering how to schedule requests with API calls. In this paper, we propose LAMPS, a novel LLM inference framework for augmented LLMs. LAMPS minimizes request completion time through a unified scheduling approach that considers the total length of requests and their handling strategies during API calls. Recognizing that LLM inference is memory-bound, our approach ranks requests based on their consumption of memory over time, which depends on both the output sizes and how a request is managed during its API calls. To implement our scheduling, LAMPS predicts the strategy that minimizes memory waste of a request during its API calls, aligning with but improving upon existing approaches. We also propose starvation prevention techniques and optimizations to mitigate the overhead of our scheduling. We implement LAMPS on top of vLLM and evaluate its performance against baseline LLM inference systems, demonstrating improvements in end-to-end latency by 27%-85% and reductions in TTFT by 4%-96% compared to the existing augmented-LLM system, with even greater gains over vLLM.

[AI-67] Human-Agent Coordination in Games under Incomplete Information via Multi-Step Intent

链接: https://arxiv.org/abs/2410.18242
作者: Shenghui Chen,Ruihan Zhao,Sandeep Chinchali,Ufuk Topcu
关键词-EN: Strategic coordination, incomplete information, coordination between autonomous, turn-based cooperative games, Strategic
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Strategic coordination between autonomous agents and human partners under incomplete information can be modeled as turn-based cooperative games. We extend a turn-based game under incomplete information, the shared-control game, to allow players to take multiple actions per turn rather than a single action. The extension enables the use of multi-step intent, which we hypothesize will improve performance in long-horizon tasks. To synthesize cooperative policies for the agent in this extended game, we propose an approach featuring a memory module for a running probabilistic belief of the environment dynamics and an online planning algorithm called IntentMCTS. This algorithm strategically selects the next action by leveraging any communicated multi-step intent via reward augmentation while considering the current belief. Agent-to-agent simulations in the Gnomes at Night testbed demonstrate that IntentMCTS requires fewer steps and control switches than baseline methods. A human-agent user study corroborates these findings, showing an 18.52% higher success rate compared to the heuristic baseline and a 5.56% improvement over the single-step prior work. Participants also report lower cognitive load, frustration, and higher satisfaction with the IntentMCTS agent partner.

[AI-68] Characterising Open Source Co-opetition in Company-hosted Open Source Software Projects: The Cases of PyTorch TensorFlow and Transformers

链接: https://arxiv.org/abs/2410.18241
作者: Cailean Osborne,Farbod Daneshyan,Runzhi He,Hengzhi Ye,Yuxia Zhang,Minghui Zhou
关键词-EN: open source co-opetition, including market rivals, open source, source co-opetition, OSS projects
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 26 pages, 2 figures, 9 tables

点击查看摘要

Abstract:Companies, including market rivals, have long collaborated on the development of open source software (OSS), resulting in a tangle of co-operation and competition known as “open source co-opetition”. While prior work investigates open source co-opetition in OSS projects that are hosted by vendor-neutral foundations, we have a limited understanding thereof in OSS projects that are hosted and governed by one company. Given their prevalence, it is timely to investigate open source co-opetition in such contexts. Towards this end, we conduct a mixed-methods analysis of three company-hosted OSS projects in the artificial intelligence (AI) industry: Meta’s PyTorch (prior to its donation to the Linux Foundation), Google’s TensorFlow, and Hugging Face’s Transformers. We contribute three key findings. First, while the projects exhibit similar code authorship patterns between host and external companies (80%/20% of commits), collaborations are structured differently (e.g., decentralised vs. hub-and-spoke networks). Second, host and external companies engage in strategic, non-strategic, and contractual collaborations, with varying incentives and collaboration practices. Some of the observed collaborations are specific to the AI industry (e.g., hardware-software optimizations or AI model integrations), while others are typical of the broader software industry (e.g., bug fixing or task outsourcing). Third, single-vendor governance creates a power imbalance that influences open source co-opetition practices and possibilities, from the host company’s singular decision-making power (e.g., the risk of license change) to their community involvement strategy (e.g., from over-control to over-delegation). We conclude with recommendations for future research.

[AI-69] Bayesian optimization for robust robotic grasping using a sensorized compliant hand

链接: https://arxiv.org/abs/2410.18237
作者: Juan G. Lechuz-Sierra,Ana Elvira H. Martin,Ashok M. Sundaram,Ruben Martinez-Cantin,Máximo A. Roa
关键词-EN: grasp objects based, objects based, tactile perception, Bayesian optimization, enable multiple applications
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the first tasks we learn as children is to grasp objects based on our tactile perception. Incorporating such skill in robots will enable multiple applications, such as increasing flexibility in industrial processes or providing assistance to people with physical disabilities. However, the difficulty lies in adapting the grasping strategies to a large variety of tasks and objects, which can often be unknown. The brute-force solution is to learn new grasps by trial and error, which is inefficient and ineffective. In contrast, Bayesian optimization applies active learning by adding information to the approximation of an optimal grasp. This paper proposes the use of Bayesian optimization techniques to safely perform robotic grasping. We analyze different grasp metrics to provide realistic grasp optimization in a real system including tactile sensors. An experimental evaluation in the robotic system shows the usefulness of the method for performing unknown object grasping even in the presence of noise and uncertainty inherent to a real-world environment.

[AI-70] Data Augmentation for Automated Adaptive Rodent Training

链接: https://arxiv.org/abs/2410.18221
作者: Dibyendu Das,Alfredo Fontanini,Joshua F. Kogan,Haibin Ling,C.R. Ramakrishnan,I.V. Ramakrishnan
关键词-EN: Fully optimized automation, Fully optimized, behavioral training protocols, optimized automation, training protocols
类目: Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Fully optimized automation of behavioral training protocols for lab animals like rodents has long been a coveted goal for researchers. It is an otherwise labor-intensive and time-consuming process that demands close interaction between the animal and the researcher. In this work, we used a data-driven approach to optimize the way rodents are trained in labs. In pursuit of our goal, we looked at data augmentation, a technique that scales well in data-poor environments. Using data augmentation, we built several artificial rodent models, which in turn would be used to build an efficient and automatic trainer. Then we developed a novel similarity metric based on the action probability distribution to measure the behavioral resemblance of our models to that of real rodents.

[AI-71] Neural Cover Selection for Image Steganography

链接: https://arxiv.org/abs/2410.18216
作者: Karl Chahine,Hyeji Kim
关键词-EN: effective message concealment, selecting an optimal, pivotal for effective, optimal cover image, cover
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In steganography, selecting an optimal cover image, referred to as cover selection, is pivotal for effective message concealment. Traditional methods have typically employed exhaustive searches to identify images that conform to specific perceptual or complexity metrics. However, the relationship between these metrics and the actual message hiding efficacy of an image is unclear, often yielding less-than-ideal steganographic outcomes. Inspired by recent advancements in generative models, we introduce a novel cover selection framework, which involves optimizing within the latent space of pretrained generative models to identify the most suitable cover images, distinguishing itself from traditional exhaustive search methods. Our method shows significant advantages in message recovery and image quality. We also conduct an information-theoretic analysis of the generated cover images, revealing that message hiding predominantly occurs in low-variance pixels, reflecting the waterfilling algorithm’s principles in parallel Gaussian channels. Our code can be found at: this https URL.

[AI-72] PyTSC: A Unified Platform for Multi-Agent Reinforcement Learning in Traffic Signal Control

链接: https://arxiv.org/abs/2410.18202
作者: Rohit Bokade,Xiaoning Jin
关键词-EN: Multi-Agent Reinforcement Learning, Traffic Signal Control, Reinforcement Learning, Signal Control, Multi-Agent Reinforcement
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) presents a promising approach for addressing the complexity of Traffic Signal Control (TSC) in urban environments. However, existing platforms for MARL-based TSC research face challenges such as slow simulation speeds and convoluted, difficult-to-maintain codebases. To address these limitations, we introduce PyTSC, a robust and flexible simulation environment that facilitates the training and evaluation of MARL algorithms for TSC. PyTSC integrates multiple simulators, such as SUMO and CityFlow, and offers a streamlined API, empowering researchers to explore a broad spectrum of MARL approaches efficiently. PyTSC accelerates experimentation and provides new opportunities for advancing intelligent traffic management systems in real-world applications.

[AI-73] abDPT: Scaling Tabular Foundation Models

链接: https://arxiv.org/abs/2410.18164
作者: Junwei Ma,Valentin Thomas,Rasa Hosseinzadeh,Hamidreza Kamkari,Alex Labach,Jesse C. Cresswell,Keyvan Golestan,Guangwei Yu,Maksims Volkovs,Anthony L. Caterini
关键词-EN: faced by neural, neural networks, hampered the progress, ICL, tabular foundation models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Minimal TabDPT interface to provide predictions on new datasets available at the following link: this https URL

点击查看摘要

Abstract:The challenges faced by neural networks on tabular data are well-documented and have hampered the progress of tabular foundation models. Techniques leveraging in-context learning (ICL) have shown promise here, allowing for dynamic adaptation to unseen data. ICL can provide predictions for entirely new datasets without further training or hyperparameter tuning, therefore providing very fast inference when encountering a novel task. However, scaling ICL for tabular data remains an issue: approaches based on large language models cannot efficiently process numeric tables, and tabular-specific techniques have not been able to effectively harness the power of real data to improve performance and generalization. We are able to overcome these challenges by training tabular-specific ICL-based architectures on real data with self-supervised learning and retrieval, combining the best of both worlds. Our resulting model – the Tabular Discriminative Pre-trained Transformer (TabDPT) – achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks with no task-specific fine-tuning, demonstrating the adapatability and speed of ICL once the model is pre-trained. TabDPT also demonstrates strong scaling as both model size and amount of available data increase, pointing towards future improvements simply through the curation of larger tabular pre-training datasets and training larger models.

[AI-74] Physics-informed Neural Networks for Functional Differential Equations: Cylindrical Approximation and Its Convergence Guarantees NEURIPS2024

链接: https://arxiv.org/abs/2410.18153
作者: Taiki Miyagawa,Takeru Yokota
关键词-EN: functional differential equations, differential equations, cylindrical approximation, FDEs, cylindrical
类目: Numerical Analysis (math.NA); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2024. Both authors contributed equally. Some contents are omitted due to arXiv’s storage limit. Please refer to the full paper at OpenReview (NeurIPS 2024) or this https URL

点击查看摘要

Abstract:We propose the first learning scheme for functional differential equations (FDEs). FDEs play a fundamental role in physics, mathematics, and optimal control. However, the numerical analysis of FDEs has faced challenges due to its unrealistic computational costs and has been a long standing problem over decades. Thus, numerical approximations of FDEs have been developed, but they often oversimplify the solutions. To tackle these two issues, we propose a hybrid approach combining physics-informed neural networks (PINNs) with the \textitcylindrical approximation. The cylindrical approximation expands functions and functional derivatives with an orthonormal basis and transforms FDEs into high-dimensional PDEs. To validate the reliability of the cylindrical approximation for FDE applications, we prove the convergence theorems of approximated functional derivatives and solutions. Then, the derived high-dimensional PDEs are numerically solved with PINNs. Through the capabilities of PINNs, our approach can handle a broader class of functional derivatives more efficiently than conventional discretization-based methods, improving the scalability of the cylindrical approximation. As a proof of concept, we conduct experiments on two FDEs and demonstrate that our model can successfully achieve typical L^1 relative error orders of PINNs \sim 10^-3 . Overall, our work provides a strong backbone for physicists, mathematicians, and machine learning experts to analyze previously challenging FDEs, thereby democratizing their numerical analysis, which has received limited attention. Code is available at \urlthis https URL.

[AI-75] Deep Autoencoder with SVD-Like Convergence and Flat Minima

链接: https://arxiv.org/abs/2410.18148
作者: Nithin Somasekharan,Shaowu Pan
关键词-EN: low-dimensional intrinsic latent, intrinsic latent space, latent space increases, modal analysis, latent space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 14 pages

点击查看摘要

Abstract:Representation learning for high-dimensional, complex physical systems aims to identify a low-dimensional intrinsic latent space, which is crucial for reduced-order modeling and modal analysis. To overcome the well-known Kolmogorov barrier, deep autoencoders (AEs) have been introduced in recent years, but they often suffer from poor convergence behavior as the rank of the latent space increases. To address this issue, we propose the learnable weighted hybrid autoencoder, a hybrid approach that combines the strengths of singular value decomposition (SVD) with deep autoencoders through a learnable weighted framework. We find that the introduction of learnable weighting parameters is essential - without them, the resulting model would either collapse into a standard POD or fail to exhibit the desired convergence behavior. Additionally, we empirically find that our trained model has a sharpness thousands of times smaller compared to other models. Our experiments on classical chaotic PDE systems, including the 1D Kuramoto-Sivashinsky and forced isotropic turbulence datasets, demonstrate that our approach significantly improves generalization performance compared to several competing methods, paving the way for robust representation learning of high-dimensional, complex physical systems.

[AI-76] MEC-IP: Efficient Discovery of Markov Equivalent Classes via Integer Programming

链接: https://arxiv.org/abs/2410.18147
作者: Abdelmonem Elrefaey,Rong Pan
关键词-EN: Markov Equivalent Class, Integer Programming, Equivalent Class, Bayesian Networks, Markov Equivalent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper presents a novel Integer Programming (IP) approach for discovering the Markov Equivalent Class (MEC) of Bayesian Networks (BNs) through observational data. The MEC-IP algorithm utilizes a unique clique-focusing strategy and Extended Maximal Spanning Graphs (EMSG) to streamline the search for MEC, thus overcoming the computational limitations inherent in other existing algorithms. Our numerical results show that not only a remarkable reduction in computational time is achieved by our algorithm but also an improvement in causal discovery accuracy is seen across diverse datasets. These findings underscore this new algorithm’s potential as a powerful tool for researchers and practitioners in causal discovery and BNSL, offering a significant leap forward toward the efficient and accurate analysis of complex data structures.

[AI-77] owards Edge General Intelligence via Large Language Models : Opportunities and Challenges

链接: https://arxiv.org/abs/2410.18125
作者: Handi Chen,Weipeng Deng,Shuo Yang,Jinfeng Xu,Zhihan Jiang,Edith C.H. Ngai,Jiangchuan Liu,Xue Liu
关键词-EN: Edge General Intelligence, Large Language Models, General Intelligence, delivering real-time, localized services
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Edge Intelligence (EI) has been instrumental in delivering real-time, localized services by leveraging the computational capabilities of edge networks. The integration of Large Language Models (LLMs) empowers EI to evolve into the next stage: Edge General Intelligence (EGI), enabling more adaptive and versatile applications that require advanced understanding and reasoning capabilities. However, systematic exploration in this area remains insufficient. This survey delineates the distinctions between EGI and traditional EI, categorizing LLM-empowered EGI into three conceptual systems: centralized, hybrid, and decentralized. For each system, we detail the framework designs and review existing implementations. Furthermore, we evaluate the performance and throughput of various Small Language Models (SLMs) that are more suitable for development on edge devices. This survey provides researchers with a comprehensive vision of EGI, offering insights into its vast potential and establishing a foundation for future advancements in this rapidly evolving field.

[AI-78] Movement Control of Smart Mosques Domes using CSRNet and Fuzzy Logic Techniques

链接: https://arxiv.org/abs/2410.18123
作者: Anas H. Blasi,Mohammad Awis Al Lababede,Mohammed A. Alsuwaiket
关键词-EN: Saudi Arabia, Saudi Arabia weather, preserved clean, mosque, places of Allah
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Mosques are worship places of Allah and must be preserved clean, immaculate, provide all the comforts of the worshippers in them. The prophet’s mosque in Medina/ Saudi Arabia is one of the most important mosques for Muslims. It occupies second place after the sacred mosque in Mecca/ Saudi Arabia, which is in constant overcrowding by all Muslims to visit the prophet Mohammad’s tomb. This paper aims to propose a smart dome model to preserve the fresh air and allow the sunlight to enter the mosque using artificial intelligence techniques. The proposed model controls domes movements based on the weather conditions and the overcrowding rates in the mosque. The data have been collected from two different resources, the first one from the database of Saudi Arabia weather’s history, and the other from Shanghai Technology Database. Congested Scene Recognition Network (CSRNet) and Fuzzy techniques have applied using Python programming language to control the domes to be opened and closed for a specific time to renew the air inside the mosque. Also, this model consists of several parts that are connected for controlling the mechanism of opening/closing domes according to weather data and the situation of crowding in the mosque. Finally, the main goal of this paper has been achieved, and the proposed model has worked efficiently and specifies the exact duration time to keep the domes open automatically for a few minutes for each hour head.

[AI-79] Point Cloud Compression with Bits-back Coding

链接: https://arxiv.org/abs/2410.18115
作者: Nguyen Quang Hieu,Minh Nguyen,Dinh Thai Hoang,Diep N. Nguyen,Eryk Dutkiewicz
关键词-EN: point cloud, point cloud data, geometric attributes, lossless compression method, compression ratio
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper is under reviewed in IEEE Robotics and Automation Letters

点击查看摘要

Abstract:This paper introduces a novel lossless compression method for compressing geometric attributes of point cloud data with bits-back coding. Our method specializes in using a deep learning-based probabilistic model to estimate the Shannon’s entropy of the point cloud information, i.e., geometric attributes of the 3D floating points. Once the entropy of the point cloud dataset is estimated with a convolutional variational autoencoder (CVAE), we use the learned CVAE model to compress the geometric attributes of the point clouds with the bits-back coding technique. The novelty of our method with bits-back coding specializes in utilizing the learned latent variable model of the CVAE to compress the point cloud data. By using bits-back coding, we can capture the potential correlation between the data points, such as similar spatial features like shapes and scattering regions, into the lower-dimensional latent space to further reduce the compression ratio. The main insight of our method is that we can achieve a competitive compression ratio as conventional deep learning-based approaches, while significantly reducing the overhead cost of storage and/or communicating the compression codec, making our approach more applicable in practical scenarios. Throughout comprehensive evaluations, we found that the cost for the overhead is significantly small, compared to the reduction of the compression ratio when compressing large point cloud datasets. Experiment results show that our proposed approach can achieve a compression ratio of 1.56 bit-per-point on average, which is significantly lower than the baseline approach such as Google’s Draco with a compression ratio of 1.83 bit-per-point.

[AI-80] Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond

链接: https://arxiv.org/abs/2410.18114
作者: Shanshan Han
关键词-EN: Significant progress, safety, human civilization, efforts, future
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Significant progress has been made in AI safety. However, as this field thrives, a critical question emerges: Are our current efforts aligned with the broader perspective of history and human civilization? This paper presents a blueprint for an advanced human society and leverages this vision to guide contemporary AI safety efforts. We outline a future where the Internet of Everything becomes reality, and create a roadmap that represents significant technological advancements towards this envisioned future. For each stage of advancement, we forecast potential AI safety issues that humanity may face. By projecting current efforts against this blueprint, we examine the alignment between the present motivations and long-term needs. We identify gaps in current approaches and highlight unique challenges and missions that demand increasing attention from AI safety practitioners in the 2020s, addressing critical areas that must not be overlooked in shaping a safe and responsible future for AI development. This vision paper aims to offer a broader perspective on AI safety, emphasizing that our efforts should not only address immediate concerns but also anticipate potential risks in the expanding AI landscape, thereby fostering AI’s role in promoting a more secure and sustainable future for human civilization.

[AI-81] NaVIP: An Image-Centric Indoor Navigation Solution for Visually Impaired People

链接: https://arxiv.org/abs/2410.18109
作者: Jun Yu,Yifan Zhang,Badrinadh Aila,Vinod Namboodiri
关键词-EN: Visually Impaired People, challenging due, absence of satellite, Impaired People, Visually Impaired
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 40 pages, 20 figures

点击查看摘要

Abstract:Indoor navigation is challenging due to the absence of satellite positioning. This challenge is manifold greater for Visually Impaired People (VIPs) who lack the ability to get information from wayfinding signage. Other sensor signals (e.g., Bluetooth and LiDAR) can be used to create turn-by-turn navigation solutions with position updates for users. Unfortunately, these solutions require tags to be installed all around the environment or the use of fairly expensive hardware. Moreover, these solutions require a high degree of manual involvement that raises costs, thus hampering scalability. We propose an image dataset and associated image-centric solution called NaVIP towards visual intelligence that is infrastructure-free and task-scalable, and can assist VIPs in understanding their surroundings. Specifically, we start by curating large-scale phone camera data in a four-floor research building, with 300K images, to lay the foundation for creating an image-centric indoor navigation and exploration solution for inclusiveness. Every image is labelled with precise 6DoF camera poses, details of indoor PoIs, and descriptive captions to assist VIPs. We benchmark on two main aspects: 1) positioning system and 2) exploration support, prioritizing training scalability and real-time inference, to validate the prospect of image-based solution towards indoor navigation. The dataset, code, and model checkpoints are made publicly available at this https URL.

[AI-82] In-Context Code-Text Learning for Bimodal Software Engineering

链接: https://arxiv.org/abs/2410.18107
作者: Xunzhu Tang,Liran Wang,Yonghui Liu,Linzheng Chai,Jian Yang,Zhoujun Li,Haoye Tian,Jacques Klein,Tegawende F. Bissyande
关键词-EN: Bimodal software analysis, analysis initially appeared, software analysis initially, Bimodal software, software engineering
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bimodal software analysis initially appeared to be within reach with the advent of large language models. Unfortunately, the complex interplay of natural language text and code in software engineering, presents unique challenges that prevent pretrained models to generalize to a variety of tasks. We postulate that in-context learning for the code-text bimodality is a promising avenue. This paper thus introduces a comprehensive study of in-context code-text learning, focusing on leveraging pretrained CodeLLAMA models. We consider a diverse dataset encompassing 23 software engineering tasks, which we transform in an in-context learning format. To effectively extract informative features, we propose a configurable prompt template. Our proposed pipeline, InCTRL, then unifies prompt learning across various software engineering tasks. Extensive evaluation on the study datasets demonstrates the superiority of INCTRL-models in few-shot performance, surpassing state-of-the-art models including the support model, CodeLLAMA. Typically, we observe that applied to the CodeLLAMA model, INCTRL brings improvements in terms of precision (at least about 12%) and recall (up to 93.88%) on various tasks. For example, on the task of program repair, INCTRL improves the BLEU score of CodeLLAMA by 85 points, while for clone detection, INCTRL achieves an improvement of 69 percentage points. Moreover, INCTRL-models offer state-of-the-art performance when using retrieval-augmented generation on individual downstream tasks. Finally, we qualitatively analyze the benefits of INCTRL over CodeLLAMA and open-source all models for broader impact. We make our code and dataset publicly available at: \begincenter \urlthis https URL \endcenter Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.18107 [cs.SE] (or arXiv:2410.18107v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2410.18107 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Xunzhu Tang [view email] [v1] Tue, 8 Oct 2024 19:42:00 UTC (945 KB)

[AI-83] ENWAR: A RAG-empowered Multi-Modal LLM Framework for Wireless Environment Perception

链接: https://arxiv.org/abs/2410.18104
作者: Ahmad M. Nazar,Abdulkadir Celik,Mohamed Y. Selim,Asmaa Abdallah,Daji Qiao,Ahmed M. Eltawil
关键词-EN: Large language models, advancing network management, hold significant promise, Large language, advancing network
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) hold significant promise in advancing network management and orchestration in 6G and beyond networks. However, existing LLMs are limited in domain-specific knowledge and their ability to handle multi-modal sensory data, which is critical for real-time situational awareness in dynamic wireless environments. This paper addresses this gap by introducing ENWAR, an ENvironment-aWARe retrieval augmented generation-empowered multi-modal LLM framework. ENWAR seamlessly integrates multi-modal sensory inputs to perceive, interpret, and cognitively process complex wireless environments to provide human-interpretable situational awareness. ENWAR is evaluated on the GPS, LiDAR, and camera modality combinations of DeepSense6G dataset with state-of-the-art LLMs such as Mistral-7b/8x7b and LLaMa3.1-8/70/405b. Compared to general and often superficial environmental descriptions of these vanilla LLMs, ENWAR delivers richer spatial analysis, accurately identifies positions, analyzes obstacles, and assesses line-of-sight between vehicles. Results show that ENWAR achieves key performance indicators of up to 70% relevancy, 55% context recall, 80% correctness, and 86% faithfulness, demonstrating its efficacy in multi-modal perception and interpretation.

[AI-84] Multiple Global Peaks Big Bang-Big Crunch Algorithm for Multimodal Optimization

链接: https://arxiv.org/abs/2410.18102
作者: Fabio Stroppa
关键词-EN: Big Bang-Big Crunch, Peaks Big Bang-Big, Global Peaks Big, Multiple Global Peaks, multidimensional search spaces
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 23 pages

点击查看摘要

Abstract:The main challenge of multimodal optimization problems is identifying multiple peaks with high accuracy in multidimensional search spaces with irregular landscapes. This work proposes the Multiple Global Peaks Big Bang-Big Crunch algorithm, which addresses the challenge of multimodal optimization problems by introducing a specialized mechanism for each operator. Inspired by the evolution of the universe, Multiple Global Peaks Big Bang-Big Crunch groups the best individuals of the population into cluster-based centers of mass and then expands them with a progressively lower disturbance to guarantee convergence. During this process, it (i) applies a distance-based filtering to remove unnecessary elites such that the ones on smaller peaks are not lost, (ii) promotes isolated individuals based on their niche count after clustering, and (iii) balances exploration and exploitation during offspring generation to target specific accuracy levels. Experimental results on twenty multimodal benchmark test functions show that Multiple Gloal Peaks Big Bang-Big Crunch generally performs better or competitively with respect to other state-of-the-art multimodal optimization algorithms.

[AI-85] RRADistill: Distilling LLM s Passage Ranking Ability for Document Re-Ranking of Long-Tail Queries in a Search Engine EMNLP2024

链接: https://arxiv.org/abs/2410.18097
作者: Nayoung Choi,Youngjune Lee,Gyu-Hwung Cho,Haeyu Jeong,Jungmin Kong,Saehun Kim,Keunchan Park,Jaeho Choi,Sarah Cho,Inchang Jeong,Gyohee Nam,Sunghoon Han,Wonil Yang
关键词-EN: Large Language Models, Large Language, Language Models, semantic relationships, lengthy and complex
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Industry Track. First two authors contributed equally

点击查看摘要

Abstract:Large Language Models (LLMs) excel at understanding the semantic relationships between queries and documents, even with lengthy and complex long-tail queries. These queries are challenging for feedback-based rankings due to sparse user engagement and limited feedback, making LLMs’ ranking ability highly valuable. However, the large size and slow inference of LLMs necessitate the development of smaller, more efficient models (sLLMs). Recently, integrating ranking label generation into distillation techniques has become crucial, but existing methods underutilize LLMs’ capabilities and are cumbersome. Our research, RRADistill: Re-Ranking Ability Distillation, propose an efficient label generation pipeline and novel sLLM training methods for both encoder and decoder models. We introduce an encoder-based method using a Term Control Layer to capture term matching signals and a decoder-based model with a ranking layer for enhanced understanding. A/B testing on a Korean-based search platform, validates the effectiveness of our approach in improving re-ranking for long-tail queries.

[AI-86] Ethical Leadership in the Age of AI Challenges Opportunities and Framework for Ethical Leadership

链接: https://arxiv.org/abs/2410.18095
作者: Udaya Chandrika Kandasamy
关键词-EN: Artificial Intelligence, Ethical leadership, Ethical, leadership, rapidly changing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 15 pages, submitted to SAGE for review

点击查看摘要

Abstract:Artificial Intelligence is currently and rapidly changing the way organizations and businesses operate. Ethical leadership has become significantly important since organizations and businesses across various sectors are evolving with AI. Organizations and businesses may be facing several challenges and potential opportunities when using AI. Ethical leadership plays a central role in guiding organizations in facing those challenges and maximizing on those opportunities. This article explores the essence of ethical leadership in the age of AI, starting with a simplified introduction of ethical leadership and AI, then dives into an understanding of ethical leadership, its characteristics and importance, the ethical challenges AI causes including bias in AI algorithms. The opportunities for ethical leadership in the age of AI answers the question: What actionable strategies can leaders employ to address the challenges and leverage opportunities? and describes the benefits for organizations through these opportunities. A proposed framework for ethical leadership is presented in this article, incorporating the core components: fairness, transparency, sustainability etc. Through the importance of interdisciplinary collaboration, case studies of ethical leadership in AI, and recommendations, this article emphasizes that ethical leadership in the age of AI is morally essential and strategically advantageous.

[AI-87] Liver Cancer Knowledge Graph Construction based on dynamic entity replacement and masking strategies RoBERTa-BiLSTM-CRF model

链接: https://arxiv.org/abs/2410.18090
作者: YiChi Zhang,HaiLing Wang,YongBin Gao,XiaoJun Hu,YingFang Fan,ZhiJun Fang
关键词-EN: common malignant tumor, Liver cancer, Liver cancer ranks, common malignant, malignant tumor
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: Liver cancer ranks as the fifth most common malignant tumor and the second most fatal in our country. Early diagnosis is crucial, necessitating that physicians identify liver cancer in patients at the earliest possible stage. However, the diagnostic process is complex and demanding. Physicians must analyze a broad spectrum of patient data, encompassing physical condition, symptoms, medical history, and results from various examinations and tests, recorded in both structured and unstructured medical formats. This results in a significant workload for healthcare professionals. In response, integrating knowledge graph technology to develop a liver cancer knowledge graph-assisted diagnosis and treatment system aligns with national efforts toward smart healthcare. Such a system promises to mitigate the challenges faced by physicians in diagnosing and treating liver cancer. Methods: This paper addresses the major challenges in building a knowledge graph for hepatocellular carcinoma diagnosis, such as the discrepancy between public data sources and real electronic medical records, the effective integration of which remains a key issue. The knowledge graph construction process consists of six steps: conceptual layer design, data preprocessing, entity identification, entity normalization, knowledge fusion, and graph visualization. A novel Dynamic Entity Replacement and Masking Strategy (DERM) for named entity recognition is proposed. Results: A knowledge graph for liver cancer was established, including 7 entity types such as disease, symptom, and constitution, containing 1495 entities. The recognition accuracy of the model was 93.23%, the recall was 94.69%, and the F1 score was 93.96%. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.18090 [cs.IR] (or arXiv:2410.18090v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.18090 Focus to learn more arXiv-issued DOI via DataCite

[AI-88] Empowering Cognitive Digital Twins with Generative Foundation Models: Developing a Low-Carbon Integrated Freight Transportation System

链接: https://arxiv.org/abs/2410.18089
作者: Xueping Li,Haowen Xu,Jose Tupayachi,Olufemi Omitaomu,Xudong Wang
关键词-EN: Effective monitoring, low-carbon economies, advancing sustainable, transportation is essential, essential for advancing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Effective monitoring of freight transportation is essential for advancing sustainable, low-carbon economies. Traditional methods relying on single-modal data and discrete simulations fall short in optimizing intermodal systems holistically. These systems involve interconnected processes that affect shipping time, costs, emissions, and socio-economic factors. Developing digital twins for real-time awareness, predictive analytics, and urban logistics optimization requires extensive efforts in knowledge discovery, data integration, and multi-domain simulation. Recent advancements in generative AI offer new opportunities to streamline digital twin development by automating knowledge discovery and data integration, generating innovative simulation and optimization solutions. These models extend digital twins’ capabilities by promoting autonomous workflows for data engineering, analytics, and software development. This paper proposes an innovative paradigm that leverages generative AI to enhance digital twins for urban research and operations. Using freight decarbonization as a case study, we propose a conceptual framework employing transformer-based language models to enhance an urban digital twin through foundation models. We share preliminary results and our vision for more intelligent, autonomous, and general-purpose digital twins for optimizing integrated freight systems from multimodal to synchromodal paradigms.

[AI-89] CUPID: A Real-Time Session-Based Reciprocal Recommendation System for a One-on-One Social Discovery Platform

链接: https://arxiv.org/abs/2410.18087
作者: Beomsu Kim,Sangbum Kim,Minchan Kim,Joonyoung Yi,Sungjoo Ha,Suhyun Lee,Youngsoo Lee,Gihun Yeom,Buru Chang,Gihun Lee
关键词-EN: social discovery platform, study introduces CUPID, social discovery, study introduces, discovery platform
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: The 2nd International Workshop on User Understanding from Big Data Workshop (DMU2 2024)

点击查看摘要

Abstract:This study introduces CUPID, a novel approach to session-based reciprocal recommendation systems designed for a real-time one-on-one social discovery platform. In such platforms, low latency is critical to enhance user experiences. However, conventional session-based approaches struggle with high latency due to the demands of modeling sequential user behavior for each recommendation process. Additionally, given the reciprocal nature of the platform, where users act as items for each other, training recommendation models on large-scale datasets is computationally prohibitive using conventional methods. To address these challenges, CUPID decouples the time-intensive user session modeling from the real-time user matching process to reduce inference time. Furthermore, CUPID employs a two-phase training strategy that separates the training of embedding and prediction layers, significantly reducing the computational burden by decreasing the number of sequential model inferences by several hundredfold. Extensive experiments on large-scale Azar datasets demonstrate CUPID’s effectiveness in a real-world production environment. Notably, CUPID reduces response latency by more than 76% compared to non-asynchronous systems, while significantly improving user engagement.

[AI-90] xtureMeDefect: LLM -based Defect Texture Generation for Railway Components on Mobile Devices

链接: https://arxiv.org/abs/2410.18085
作者: Rahatara Ferdousi,M. Anwar Hossain,Abdulmotaleb El Saddik
关键词-EN: including gaming, gaming and entertainment, image generation, generation, applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注: 6 Pages, 8 figures

点击查看摘要

Abstract:Texture image generation has been studied for various applications, including gaming and entertainment. However, context-specific realistic texture generation for industrial applications, such as generating defect textures on railway components, remains unexplored. A mobile-friendly, LLM-based tool that generates fine-grained defect characteristics offers a solution to the challenge of understanding the impact of defects from actual occurrences. We introduce TextureMeDefect, an innovative tool leveraging an LLM-based AI-Inferencing engine. The tool allows users to create realistic defect textures interactively on images of railway components taken with smartphones or tablets. We conducted a multifaceted evaluation to assess the relevance of the generated texture, time, and cost in using this tool on iOS and Android platforms. We also analyzed the software usability score (SUS) across three scenarios. TextureMeDefect outperformed traditional image generation tools by generating meaningful textures faster, showcasing the potential of AI-driven mobile applications on consumer-grade devices.

[AI-91] Uncovering the Genetic Basis of Glioblastoma Heterogeneity through Multimodal Analysis of Whole Slide Images and RNA Sequencing Data

链接: https://arxiv.org/abs/2410.18710
作者: Ahmad Berjaoui,Louis Roussel,Eduardo Hugo Sanchez,Elizabeth Cohen-Jonathan Moyal(CRCT, IUCT Oncopole)
关键词-EN: highly aggressive form, brain cancer characterized, poor prognosis, highly aggressive, aggressive form
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Glioblastoma is a highly aggressive form of brain cancer characterized by rapid progression and poor prognosis. Despite advances in treatment, the underlying genetic mechanisms driving this aggressiveness remain poorly understood. In this study, we employed multimodal deep learning approaches to investigate glioblastoma heterogeneity using joint image/RNA-seq analysis. Our results reveal novel genes associated with glioblastoma. By leveraging a combination of whole-slide images and RNA-seq, as well as introducing novel methods to encode RNA-seq data, we identified specific genetic profiles that may explain different patterns of glioblastoma progression. These findings provide new insights into the genetic mechanisms underlying glioblastoma heterogeneity and highlight potential targets for therapeutic intervention.

[AI-92] Enhancing Graph Attention Neural Network Performance for Marijuana Consumption Classification through Large-scale Augmented Granger Causality (lsAGC) Analysis of Functional MR Images

链接: https://arxiv.org/abs/2410.18506
作者: Ali Vosoughi,Akhil Kasturi,Axel Wismueller
关键词-EN: Augmented Granger Causality, Magnetic Resonance Imaging, large-scale Augmented Granger, functional Magnetic Resonance, resting-state functional Magnetic
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 17 pages

点击查看摘要

Abstract:In the present research, the effectiveness of large-scale Augmented Granger Causality (lsAGC) as a tool for gauging brain network connectivity was examined to differentiate between marijuana users and typical controls by utilizing resting-state functional Magnetic Resonance Imaging (fMRI). The relationship between marijuana consumption and alterations in brain network connectivity is a recognized fact in scientific literature. This study probes how lsAGC can accurately discern these changes. The technique used integrates dimension reduction with the augmentation of source time-series in a model that predicts time-series, which helps in estimating the directed causal relationships among fMRI time-series. As a multivariate approach, lsAGC uncovers the connection of the inherent dynamic system while considering all other time-series. A dataset of 60 adults with an ADHD diagnosis during childhood, drawn from the Addiction Connectome Preprocessed Initiative (ACPI), was used in the study. The brain connections assessed by lsAGC were utilized as classification attributes. A Graph Attention Neural Network (GAT) was chosen to carry out the classification task, particularly for its ability to harness graph-based data and recognize intricate interactions between brain regions, making it appropriate for fMRI-based brain connectivity data. The performance was analyzed using a five-fold cross-validation system. The average accuracy achieved by the correlation coefficient method was roughly 52.98%, with a 1.65 standard deviation, whereas the lsAGC approach yielded an average accuracy of 61.47%, with a standard deviation of 1.44. The suggested method enhances the body of knowledge in the field of neuroimaging-based classification and emphasizes the necessity to consider directed causal connections in brain network connectivity analysis when studying marijuana’s effects on the brain.

[AI-93] Multi-Stage Airway Segmentation in Lung CT Based on Multi-scale Nested Residual UNet

链接: https://arxiv.org/abs/2410.18456
作者: Bingyu Yang,Huai Liao,Xinyan Huang,Qingyao Tian,Jinlin Wu,Jingdi Hu,Hongbin Liu
关键词-EN: pulmonary interventional procedures, interventional procedures, Residual Multi-scale Modules, quantitative assessment, assessment of lung
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate and complete segmentation of airways in chest CT images is essential for the quantitative assessment of lung diseases and the facilitation of pulmonary interventional procedures. Although deep learning has led to significant advancements in medical image segmentation, maintaining airway continuity remains particularly challenging. This difficulty arises primarily from the small and dispersed nature of airway structures, as well as class imbalance in CT scans. To address these challenges, we designed a Multi-scale Nested Residual U-Net (MNR-UNet), incorporating multi-scale inputs and Residual Multi-scale Modules (RMM) into a nested residual framework to enhance information flow, effectively capturing the intricate details of small airways and mitigating gradient vanishing. Building on this, we developed a three-stage segmentation pipeline to optimize the training of the MNR-UNet. The first two stages prioritize high accuracy and sensitivity, while the third stage focuses on repairing airway breakages to balance topological completeness and correctness. To further address class imbalance, we introduced a weighted Breakage-Aware Loss (wBAL) to heighten focus on challenging samples, penalizing breakages and thereby extending the length of the airway tree. Additionally, we proposed a hierarchical evaluation framework to offer more clinically meaningful analysis. Validation on both in-house and public datasets demonstrates that our approach achieves superior performance in detecting more accurate airway voxels and identifying additional branches, significantly improving airway topological completeness. The code will be released publicly following the publication of the paper.

[AI-94] E2E-Swin-Unet: An Enhanced End-to-End Swin-Unet Architecture With Dual Decoders For PTMC Segmentation

链接: https://arxiv.org/abs/2410.18239
作者: Maryam Dialameh,Hossein Rajabzadeh,Moslem Sadeghi-Goughari,Jung Suk Sim,Hyock Ju Kwon
关键词-EN: Efficiently managing papillary, minimizing patient discomfort, patient discomfort poses, Efficiently managing, papillary thyroid microcarcinoma
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Efficiently managing papillary thyroid microcarcinoma (PTMC) while minimizing patient discomfort poses a significant clinical challenge. Radiofrequency ablation (RFA) offers a less invasive alternative to surgery and radiation therapy for PTMC treatment, characterized by shorter recovery times and reduced pain. As an image-guided procedure, RFA generates localized heat by delivering high-frequency electrical currents through electrodes to the targeted area under ultrasound imaging guidance. However, the precision and skill required by operators for accurate guidance using current ultrasound B-mode imaging technologies remain significant challenges. To address these challenges, we develop a novel AI segmentation model, E2E-Swin-Unet++. This model enhances ultrasound B-mode imaging by enabling real-time identification and segmentation of PTMC tumors and monitoring of the region of interest for precise targeting during treatment. E2E-Swin- Unet++ is an advanced end-to-end extension of the Swin-Unet architecture, incorporating thyroid region information to minimize the risk of false PTMC segmentation while providing fast inference capabilities. Experimental results on a real clinical RFA dataset demonstrate the superior performance of E2E-Swin-Unet++ compared to related models. Our proposed solution significantly improves the precision and control of RFA ablation treatment by enabling real-time identification and segmentation of PTMC margins during the procedure.

[AI-95] A Hybrid Graph Neural Network for Enhanced EEG-Based Depression Detection

链接: https://arxiv.org/abs/2410.18103
作者: Yiye Wang,Wenming Zheng,Yang Li,Hao Yang
关键词-EN: Graph Neural Network, Graph neural, Graph, Neural Network, previous GNN-based
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are becoming increasingly popular for EEG-based depression detection. However, previous GNN-based methods fail to sufficiently consider the characteristics of depression, thus limiting their performance. Firstly, studies in neuroscience indicate that depression patients exhibit both common and individualized brain abnormal patterns. Previous GNN-based approaches typically focus either on fixed graph connections to capture common abnormal brain patterns or on adaptive connections to capture individualized patterns, which is inadequate for depression detection. Secondly, brain network exhibits a hierarchical structure, which includes the arrangement from channel-level graph to region-level graph. This hierarchical structure varies among individuals and contains significant information relevant to detecting depression. Nonetheless, previous GNN-based methods overlook these individualized hierarchical information. To address these issues, we propose a Hybrid GNN (HGNN) that merges a Common Graph Neural Network (CGNN) branch utilizing fixed connection and an Individualized Graph Neural Network (IGNN) branch employing adaptive connections. The two branches capture common and individualized depression patterns respectively, complementing each other. Furthermore, we enhance the IGNN branch with a Graph Pooling and Unpooling Module (GPUM) to extract individualized hierarchical information. Extensive experiments on two public datasets show that our model achieves state-of-the-art performance.

[AI-96] Molecular Dynamics and Machine Learning Unlock Possibilities in Beauty Design – A Perspective

链接: https://arxiv.org/abs/2410.18101
作者: Yuzhi Xu,Haowei Ni,Qinhui Gao,Chia-Hua Chang,Yanran Huo,Fanyu Zhao,Shiyu Hu,Wei Xia,Yike Zhang,Radu Grovu,Min He,John. Z. H. Zhang,Yuanqing Wang
关键词-EN: Computational molecular design, molecular dynamics approaches, Computational molecular, small molecule therapeutics, molecular entities
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational molecular design – the endeavor to design molecules, with various missions, aided by machine learning and molecular dynamics approaches, has been widely applied to create valuable new molecular entities, from small molecule therapeutics to protein biologics. In the small data regime, physics-based approaches model the interaction between the molecule being designed and proteins of key physiological functions, providing structural insights into the mechanism. When abundant data has been collected, a quantitative structure-activity relationship (QSAR) can be more directly constructed from experimental data, from which machine learning can distill key insights to guide the design of the next round of experiment design. Machine learning methodologies can also facilitate physical modeling, from improving the accuracy of force fields and extending them to unseen chemical spaces, to more directly enhancing the sampling on the conformational spaces. We argue that these techniques are mature enough to be applied to not just extend the longevity of life, but the beauty it manifests. In this perspective, we review the current frontiers in the research \ development of skin care products, as well as the statistical and physical toolbox applicable to addressing the challenges in this industry. Feasible interdisciplinary research projects are proposed to harness the power of machine learning tools to design innovative, effective, and inexpensive skin care products.

[AI-97] Self-supervised inter-intra period-aware ECG representation learning for detecting atrial fibrillation

链接: https://arxiv.org/abs/2410.18094
作者: Xiangqian Zhu,Mengnan Shi,Xuexin Yu,Chang Liu,Xiaocong Lian,Jintao Fei,Jiangying Luo,Xin Jin,Ping Zhang,Xiangyang Ji
关键词-EN: encountered clinical arrhythmia, commonly encountered clinical, Atrial fibrillation, increased mortality, encountered clinical
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Preprint submitted to Biomedical Signal Processing and Control

点击查看摘要

Abstract:Atrial fibrillation is a commonly encountered clinical arrhythmia associated with stroke and increased mortality. Since professional medical knowledge is required for annotation, exploiting a large corpus of ECGs to develop accurate supervised learning-based atrial fibrillation algorithms remains challenging. Self-supervised learning (SSL) is a promising recipe for generalized ECG representation learning, eliminating the dependence on expensive labeling. However, without well-designed incorporations of knowledge related to atrial fibrillation, existing SSL approaches typically suffer from unsatisfactory capture of robust ECG representations. In this paper, we propose an inter-intra period-aware ECG representation learning approach. Considering ECGs of atrial fibrillation patients exhibit the irregularity in RR intervals and the absence of P-waves, we develop specific pre-training tasks for interperiod and intraperiod representations, aiming to learn the single-period stable morphology representation while retaining crucial interperiod features. After further fine-tuning, our approach demonstrates remarkable AUC performances on the BTCH dataset, \textiti.e., 0.953/0.996 for paroxysmal/persistent atrial fibrillation detection. On commonly used benchmarks of CinC2017 and CPSC2021, the generalization capability and effectiveness of our methodology are substantiated with competitive results.

[AI-98] wo-Stage Radio Map Construction with Real Environments and Sparse Measurements

链接: https://arxiv.org/abs/2410.18092
作者: Yifan Wang,Shu Sun,Na Liu,Lianming Xu,Li Wang
关键词-EN: map estimation reduces, Radio map, radio map estimation, environment-aware radio map, Radio map construction
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Radio map construction based on extensive measurements is accurate but expensive and time-consuming, while environment-aware radio map estimation reduces the costs at the expense of low accuracy. Considering accuracy and costs, a first-predict-then-correct (FPTC) method is proposed by leveraging generative adversarial networks (GANs). A primary radio map is first predicted by a radio map prediction GAN (RMP-GAN) taking environmental information as input. Then, the prediction result is corrected by a radio map correction GAN (RMC-GAN) with sparse measurements as guidelines. Specifically, the self-attention mechanism and residual-connection blocks are introduced to RMP-GAN and RMC-GAN to improve the accuracy, respectively. Experimental results validate that the proposed FPTC-GANs method achieves the best radio map construction performance, compared with the state-of-the-art methods.

[AI-99] Predicting Fine-grained Behavioral and Psychological Symptoms of Dementia Based on Machine Learning and Smart Wearable Devices

链接: https://arxiv.org/abs/2410.18091
作者: Benny Wei-Yun Hsu,Yu-Ming Chen,Yuan-Han Yang,Vincent S. Tseng
关键词-EN: Psychological Symptoms, BPSD, BPSD prediction, Behavioral and Psychological, impact dementia care
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Behavioral and Psychological Symptoms of Dementia (BPSD) impact dementia care substantially, affecting both patients and caregivers. Effective management and early detection of BPSD are crucial to reduce the stress and burden on caregivers and healthcare systems. Despite the advancements in machine learning for dementia prediction, there is a considerable gap in utilizing these methods for BPSD prediction. This study aims to fill this gap by presenting a novel personalized framework for BPSD prediction, utilizing physiological signals from smart wearable devices. Our personalized fine-grained BPSD prediction method accurately predicts BPSD occurrences by extracting individual behavioral patterns, while the generalized models identify diverse patterns and differentiate between various BPSD symptoms. Detailed comparisons between the proposed personalized method and conventional generalized methods reveals substantial improvements across all performance metrics, including a 16.0% increase in AUC. These results demonstrate the potential of our proposed method in advancing dementia care by enabling proactive interventions and improving patient outcomes in real-world scenarios. To the best of our knowledge, this is the first study that leverages physiological signals from smart wearable devices to predict BPSD, marking a significant stride in dementia care research.

[AI-100] Real-time Sub-milliwatt Epilepsy Detection Implemented on a Spiking Neural Network Edge Inference Processor

链接: https://arxiv.org/abs/2410.16613
作者: Ruixin Lia,Guoxu Zhaoa,Dylan Richard Muir,Yuya Ling,Karla Burelo,Mina Khoei,Dong Wang,Yannan Xing,Ning Qiao
关键词-EN: Analyzing electroencephalogram, existing technologies aimed, epileptic seizure status, epileptic seizures, subject presents
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Analyzing electroencephalogram (EEG) signals to detect the epileptic seizure status of a subject presents a challenge to existing technologies aimed at providing timely and efficient diagnosis. In this study, we aimed to detect interictal and ictal periods of epileptic seizures using a spiking neural network (SNN). Our proposed approach provides an online and real-time preliminary diagnosis of epileptic seizures and helps to detect possible pathological this http URL validate our approach, we conducted experiments using multiple datasets. We utilized a trained SNN to identify the presence of epileptic seizures and compared our results with those of related studies. The SNN model was deployed on Xylo, a digital SNN neuromorphic processor designed to process temporal signals. Xylo efficiently simulates spiking leaky integrate-and-fire neurons with exponential input synapses. Xylo has much lower energy requirments than traditional approaches to signal processing, making it an ideal platform for developing low-power seizure detection this http URL proposed method has a high test accuracy of 93.3% and 92.9% when classifying ictal and interictal periods. At the same time, the application has an average power consumption of 87.4 uW(IO power) + 287.9 uW(computational power) when deployed to Xylo. Our method demonstrates excellent low-latency performance when tested on multiple datasets. Our work provides a new solution for seizure detection, and it is expected to be widely used in portable and wearable devices in the future.

计算机视觉

[CV-0] Framer: Interactive Frame Interpolation

链接: https://arxiv.org/abs/2410.18978
作者: Wen Wang,Qiuyu Wang,Kecheng Zheng,Hao Ouyang,Zhekai Chen,Biao Gong,Hao Chen,Yujun Shen,Chunhua Shen
关键词-EN: targets producing smoothly, producing smoothly transitioning, smoothly transitioning frames, interactive frame interpolation, user creativity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We propose Framer for interactive frame interpolation, which targets producing smoothly transitioning frames between two images as per user creativity. Concretely, besides taking the start and end frames as inputs, our approach supports customizing the transition process by tailoring the trajectory of some selected keypoints. Such a design enjoys two clear benefits. First, incorporating human interaction mitigates the issue arising from numerous possibilities of transforming one image to another, and in turn enables finer control of local motions. Second, as the most basic form of interaction, keypoints help establish the correspondence across frames, enhancing the model to handle challenging cases (e.g., objects on the start and end frames are of different shapes and styles). It is noteworthy that our system also offers an “autopilot” mode, where we introduce a module to estimate the keypoints and refine the trajectory automatically, to simplify the usage in practice. Extensive experimental results demonstrate the appealing performance of Framer on various applications, such as image morphing, time-lapse video generation, cartoon interpolation, etc. The code, the model, and the interface will be released to facilitate further research.

[CV-1] MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms

链接: https://arxiv.org/abs/2410.18977
作者: Ling-Hao Chen,Wenxun Dai,Xuan Ju,Shunlin Lu,Lei Zhang
关键词-EN: motion, research delves, problem of interactive, human motion generation, motion generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MotionCLR v1 technical report

点击查看摘要

Abstract:This research delves into the problem of interactive editing of human motion generation. Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability, hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model, namely MotionCLR, with CLeaR modeling of attention mechanisms. Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively. More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features. By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and activate the corresponding timesteps in the motion sequence. Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps, such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc. For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and grounded motion generation ability via attention maps. Our experimental results show that our method enjoys good generation and editing ability with good explainability.

[CV-2] Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction

链接: https://arxiv.org/abs/2410.18962
作者: Junyi Chen,Di Huang,Weicai Ye,Wanli Ouyang,Tong He
关键词-EN: machine to perceive, dimensions within space, Spatial, Generative Spatial Transformer, Spatial intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Spatial intelligence is the ability of a machine to perceive, reason, and act in three dimensions within space and time. Recent advancements in large-scale auto-regressive models have demonstrated remarkable capabilities across various reasoning tasks. However, these models often struggle with fundamental aspects of spatial reasoning, particularly in answering questions like “Where am I?” and “What will I see?”. While some attempts have been done, existing approaches typically treat them as separate tasks, failing to capture their interconnected nature. In this paper, we present Generative Spatial Transformer (GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction. Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction. The proposed innovative camera tokenization method enables the model to learn the joint distribution of 2D projections and their corresponding spatial perspectives in an auto-regressive manner. This unified training paradigm demonstrates that joint optimization of pose estimation and novel view synthesis leads to improved performance in both tasks, for the first time, highlighting the inherent relationship between spatial awareness and visual prediction.

[CV-3] Stable Consistency Tuning: Understanding and Improving Consistency Models

链接: https://arxiv.org/abs/2410.18958
作者: Fu-Yun Wang,Zhengyang Geng,Hongsheng Li
关键词-EN: superior generation quality, slow generation speed, generation speed due, achieve superior generation, consistency
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference~(TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.

[CV-4] Large Spatial Model: End-to-end Unposed Images to Semantic 3D

链接: https://arxiv.org/abs/2410.18956
作者: Zhiwen Fan,Jian Zhang,Wenyan Cong,Peihao Wang,Renjie Li,Kairun Wen,Shijie Zhou,Achuta Kadambi,Zhangyang Wang,Danfei Xu,Boris Ivanovic,Marco Pavone,Yue Wang
关键词-EN: Reconstructing and understanding, limited number, well-established problem, Reconstructing, LSM
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Website: this https URL

点击查看摘要

Abstract:Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation, and it can generate versatile label maps by interacting with language at novel viewpoints. Leveraging a Transformer-based architecture, LSM integrates global geometry through pixel-aligned point maps. To enhance spatial attribute regression, we incorporate local context aggregation with multi-scale fusion, improving the accuracy of fine local details. To tackle the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder then parameterizes a set of semantic anisotropic Gaussians, facilitating supervised end-to-end learning. Extensive experiments across various tasks show that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time. Comments: Project Website: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.18956 [cs.CV] (or arXiv:2410.18956v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.18956 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-5] Sort-free Gaussian Splatting via Weighted Sum Rendering

链接: https://arxiv.org/abs/2410.18931
作者: Qiqi Hou,Randall Rauwendaal,Zifeng Li,Hoang Le,Farzad Farhadzadeh,Fatih Porikli,Alexei Bourd,Amir Said
关键词-EN: attracting considerable attention, maintaining low complexity, considerable attention due, recover high-fidelity details, scene reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has emerged as a significant advancement in 3D scene reconstruction, attracting considerable attention due to its ability to recover high-fidelity details while maintaining low complexity. Despite the promising results achieved by 3DGS, its rendering performance is constrained by its dependence on costly non-commutative alpha-blending operations. These operations mandate complex view dependent sorting operations that introduce computational overhead, especially on the resource-constrained platforms such as mobile phones. In this paper, we propose Weighted Sum Rendering, which approximates alpha blending with weighted sums, thereby removing the need for sorting. This simplifies implementation, delivers superior performance, and eliminates the “popping” artifacts caused by sorting. Experimental results show that optimizing a generalized Gaussian splatting formulation to the new differentiable rendering yields competitive image quality. The method was implemented and tested in a mobile device GPU, achieving on average 1.23\times faster rendering.

[CV-6] Multi-Class Abnormality Classification in Video Capsule Endoscopy Using Deep Learning

链接: https://arxiv.org/abs/2410.18879
作者: Arnav Samal,Ranya
关键词-EN: report outlines Team, video capsule endoscopy, convolutional neural networks, Capsule Vision, capsule endoscopy frames
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This report outlines Team Seq2Cure’s deep learning approach for the Capsule Vision 2024 Challenge, leveraging an ensemble of convolutional neural networks (CNNs) and transformer-based architectures for multi-class abnormality classification in video capsule endoscopy frames. The dataset comprised over 50,000 frames from three public sources and one private dataset, labeled across 10 abnormality classes. To overcome the limitations of traditional CNNs in capturing global context, we integrated CNN and transformer models within a multi-model ensemble. Our approach achieved a balanced accuracy of 86.34 percent and a mean AUC-ROC score of 0.9908 on the validation set, with significant improvements in classifying complex abnormalities. Code is available at this http URL .

[CV-7] Probabilistic Language-Image Pre-Training

链接: https://arxiv.org/abs/2410.18857
作者: Sanghyuk Chun,Wonjae Kim,Song Park,Sangdoo Yun
关键词-EN: Vision-language models, embed aligned image-text, embed aligned, deterministic embeddings, joint space
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code: this https URL 23 pages, 5.7 MB

点击查看摘要

Abstract:Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an “uncertainty token” without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at this https URL

[CV-8] Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation

链接: https://arxiv.org/abs/2410.18830
作者: Xiaoyu Zhang,Teng Zhou,Xinlong Zhang,Jia Wei,Yongchuan Tang
关键词-EN: recently gained recognition, high-quality content, recently gained, gained recognition, diverse and high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have recently gained recognition for generating diverse and high-quality content, especially in the domain of image synthesis. These models excel not only in creating fixed-size images but also in producing panoramic images. However, existing methods often struggle with spatial layout consistency when producing high-resolution panoramas, due to the lack of guidance of the global image layout. In this paper, we introduce the Multi-Scale Diffusion (MSD) framework, a plug-and-play module that extends the existing panoramic image generation framework to multiple resolution levels. By utilizing gradient descent techniques, our method effectively incorporates structural information from low-resolution images into high-resolution outputs. A comprehensive evaluation of the proposed method was conducted, comparing it with the prior works in qualitative and quantitative dimensions. The evaluation results demonstrate that our method significantly outperforms others in generating coherent high-resolution panoramas.

[CV-9] Binocular-Guided 3D Gaussian Splatting with View Consistency for Sparse View Synthesis NEURIPS2024

链接: https://arxiv.org/abs/2410.18822
作者: Liang Han,Junsheng Zhou,Yu-Shen Liu,Zhizhong Han
关键词-EN: computer vision, Gaussian Splatting, vital yet challenging, challenging task, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:Novel view synthesis from sparse inputs is a vital yet challenging task in 3D computer vision. Previous methods explore 3D Gaussian Splatting with neural priors (e.g. depth priors) as an additional supervision, demonstrating promising quality and efficiency compared to the NeRF based methods. However, the neural priors from 2D pretrained models are often noisy and blurry, which struggle to precisely guide the learning of radiance fields. In this paper, We propose a novel method for synthesizing novel views from sparse views with Gaussian Splatting that does not require external prior as supervision. Our key idea lies in exploring the self-supervisions inherent in the binocular stereo consistency between each pair of binocular images constructed with disparity-guided image warping. To this end, we additionally introduce a Gaussian opacity constraint which regularizes the Gaussian locations and avoids Gaussian redundancy for improving the robustness and efficiency of inferring 3D Gaussians from sparse views. Extensive experiments on the LLFF, DTU, and Blender datasets demonstrate that our method significantly outperforms the state-of-the-art methods.

[CV-10] Learning Global Object-Centric Representations via Disentangled Slot Attention

链接: https://arxiv.org/abs/2410.18809
作者: Tonglin Chen,Yinxuan Huang,Zhimeng Shen,Jinghao Huang,Bin Li,Xiangyang Xue
关键词-EN: amidst changing factors, objects amidst changing, swiftly identify objects, identify objects amidst, amidst changing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Global Object-Centric Representations, Object Identification, Unsupervised Learning, Disentangled Learning

点击查看摘要

Abstract:Humans can discern scene-independent features of objects across various environments, allowing them to swiftly identify objects amidst changing factors such as lighting, perspective, size, and position and imagine the complete images of the same object in diverse settings. Existing object-centric learning methods only extract scene-dependent object-centric representations, lacking the ability to identify the same object across scenes as humans. Moreover, some existing methods discard the individual object generation capabilities to handle complex scenes. This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations. To learn the global object-centric representations that encapsulate globally invariant attributes of objects (i.e., the complete appearance and shape), this paper designs a Disentangled Slot Attention module to convert the scene features into scene-dependent attributes (such as scale, position and orientation) and scene-independent representations (i.e., appearance and shape). Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.

[CV-11] Fast constrained sampling in pre-trained diffusion models

链接: https://arxiv.org/abs/2410.18804
作者: Alexandros Graikos,Nebojsa Jojic,Dimitris Samaras
关键词-EN: widely adopted, generative image models, Stable Diffusion, dominated the field, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have dominated the field of large, generative image models, with the prime examples of Stable Diffusion and DALL-E 3 being widely adopted. These models have been trained to perform text-conditioned generation on vast numbers of image-caption pairs and as a byproduct, have acquired general knowledge about natural image statistics. However, when confronted with the task of constrained sampling, e.g. generating the right half of an image conditioned on the known left half, applying these models is a delicate and slow process, with previously proposed algorithms relying on expensive iterative operations that are usually orders of magnitude slower than text-based inference. This is counter-intuitive, as image-conditioned generation should rely less on the difficult-to-learn semantic knowledge that links captions and imagery, and should instead be achievable by lower-level correlations among image pixels. In practice, inverse models are trained or tuned separately for each inverse problem, e.g. by providing parts of images during training as an additional condition, to allow their application in realistic settings. However, we argue that this is not necessary and propose an algorithm for fast-constrained sampling in large pre-trained diffusion models (Stable Diffusion) that requires no expensive backpropagation operations through the model and produces results comparable even to the state-of-the-art \emphtuned models. Our method is based on a novel optimization perspective to sampling under constraints and employs a numerical approximation to the expensive gradients, previously computed using backpropagation, incurring significant speed-ups.

[CV-12] Learning Geodesics of Geometric Shape Deformations From Images

链接: https://arxiv.org/abs/2410.18797
作者: Nian Wu,Miaomiao Zhang
关键词-EN: named geodesic deformable, paper presents, deformation fields derived, geodesic deformable networks, time enables
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a novel method, named geodesic deformable networks (GDN), that for the first time enables the learning of geodesic flows of deformation fields derived from images. In particular, the capability of our proposed GDN being able to predict geodesics is important for quantifying and comparing deformable shape presented in images. The geodesic deformations, also known as optimal transformations that align pairwise images, are often parameterized by a time sequence of smooth vector fields governed by nonlinear differential equations. A bountiful literature has been focusing on learning the initial conditions (e.g., initial velocity fields) based on registration networks. However, the definition of geodesics central to deformation-based shape analysis is blind to the networks. To address this problem, we carefully develop an efficient neural operator to treat the geodesics as unknown mapping functions learned from the latent deformation spaces. A composition of integral operators and smooth activation functions is then formulated to effectively approximate such mappings. In contrast to previous works, our GDN jointly optimizes a newly defined geodesic loss, which adds additional benefits to promote the network regularizability and generalizability. We demonstrate the effectiveness of GDN on both 2D synthetic data and 3D real brain magnetic resonance imaging (MRI).

[CV-13] WARP-LCA: Efficient Convolutional Sparse Coding with Locally Competitive Algorithm

链接: https://arxiv.org/abs/2410.18794
作者: Geoffrey Kasenbacher,Felix Ehret,Gerrit Ecke,Sebastian Otte
关键词-EN: locally competitive algorithm, LCA, locally competitive, wide range, competitive algorithm
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The locally competitive algorithm (LCA) can solve sparse coding problems across a wide range of use cases. Recently, convolution-based LCA approaches have been shown to be highly effective for enhancing robustness for image recognition tasks in vision pipelines. To additionally maximize representational sparsity, LCA with hard-thresholding can be applied. While this combination often yields very good solutions satisfying an \ell_0 sparsity criterion, it comes with significant drawbacks for practical application: (i) LCA is very inefficient, typically requiring hundreds of optimization cycles for convergence; (ii) the use of hard-thresholding results in a non-convex loss function, which might lead to suboptimal minima. To address these issues, we propose the Locally Competitive Algorithm with State Warm-up via Predictive Priming (WARP-LCA), which leverages a predictor network to provide a suitable initial guess of the LCA state based on the current input. Our approach significantly improves both convergence speed and the quality of solutions, while maintaining and even enhancing the overall strengths of LCA. We demonstrate that WARP-LCA converges faster by orders of magnitude and reaches better minima compared to conventional LCA. Moreover, the learned representations are more sparse and exhibit superior properties in terms of reconstruction and denoising quality as well as robustness when applied in deep recognition pipelines. Furthermore, we apply WARP-LCA to image denoising tasks, showcasing its robustness and practical effectiveness. Our findings confirm that the naive use of LCA with hard-thresholding results in suboptimal minima, whereas initializing LCA with a predictive guess results in better outcomes. This research advances the field of biologically inspired deep learning by providing a novel approach to convolutional sparse coding.

[CV-14] Schedule Your Edit: A Simple yet Effective Diffusion Noise Schedule for Image Editing NEURIPS2024

链接: https://arxiv.org/abs/2410.18756
作者: Haonan Lin,Mengmeng Wang,Jiahao Wang,Wenbin An,Yan Chen,Yong Liu,Feng Tian,Guang Dai,Jingdong Wang,Qianying Wang
关键词-EN: Text-guided diffusion models, diverse modifications driven, significantly advanced image, Text-guided diffusion, text prompts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in NeurIPS 2024

点击查看摘要

Abstract:Text-guided diffusion models have significantly advanced image editing, enabling high-quality and diverse modifications driven by text prompts. However, effective editing requires inverting the source image into a latent space, a process often hindered by prediction errors inherent in DDIM inversion. These errors accumulate during the diffusion process, resulting in inferior content preservation and edit fidelity, especially with conditional inputs. We address these challenges by investigating the primary contributors to error accumulation in DDIM inversion and identify the singularity problem in traditional noise schedules as a key issue. To resolve this, we introduce the Logistic Schedule, a novel noise schedule designed to eliminate singularities, improve inversion stability, and provide a better noise space for image editing. This schedule reduces noise prediction errors, enabling more faithful editing that preserves the original content of the source image. Our approach requires no additional retraining and is compatible with various existing editing methods. Experiments across eight editing tasks demonstrate the Logistic Schedule’s superior performance in content preservation and edit fidelity compared to traditional noise schedules, highlighting its adaptability and effectiveness.

[CV-15] Rectified Diffusion Guidance for Conditional Generation

链接: https://arxiv.org/abs/2410.18737
作者: Mengfei Xia,Nan Xue,Yujun Shen,Ran Yi,Tieliang Gong,Yong-Jin Liu
关键词-EN: unconditional score functions, combines the conditional, conditional and unconditional, unconditional score, score functions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Classifier-Free Guidance (CFG), which combines the conditional and unconditional score functions with two coefficients summing to one, serves as a practical technique for diffusion model sampling. Theoretically, however, denoising with CFG cannot be expressed as a reciprocal diffusion process, which may consequently leave some hidden risks during use. In this work, we revisit the theory behind CFG and rigorously confirm that the improper configuration of the combination coefficients (i.e., the widely used summing-to-one version) brings about expectation shift of the generative distribution. To rectify this issue, we propose ReCFG with a relaxation on the guidance coefficients such that denoising with ReCFG strictly aligns with the diffusion theory. We further show that our approach enjoys a closed-form solution given the guidance strength. That way, the rectified coefficients can be readily pre-computed via traversing the observed data, leaving the sampling speed barely affected. Empirical evidence on real-world data demonstrate the compatibility of our post-hoc design with existing state-of-the-art diffusion models, including both class-conditioned ones (e.g., EDM2 on ImageNet) and text-conditioned ones (e.g., SD3 on CC12M), without any retraining. We will open-source the code to facilitate further research.

[CV-16] VoxelKeypointFusion: Generalizable Multi-View Multi-Person Pose Estimation

链接: https://arxiv.org/abs/2410.18723
作者: Daniel Bermuth,Alexander Poeppel,Wolfgang Reif
关键词-EN: rapidly evolving field, computer vision, formidable challenge, rapidly evolving, evolving field
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In the rapidly evolving field of computer vision, the task of accurately estimating the poses of multiple individuals from various viewpoints presents a formidable challenge, especially if the estimations should be reliable as well. This work presents an extensive evaluation of the generalization capabilities of multi-view multi-person pose estimators to unseen datasets and presents a new algorithm with strong performance in this task. It also studies the improvements by additionally using depth information. Since the new approach can not only generalize well to unseen datasets, but also to different keypoints, the first multi-view multi-person whole-body estimator is presented. To support further research on those topics, all of the work is publicly accessible.

[CV-17] ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval

链接: https://arxiv.org/abs/2410.18715
作者: Zijia Zhao,Longteng Guo,Tongtian Yue,Erdong Hu,Shuai Shao,Zehuan Yuan,Hua Huang,Jing Liu
关键词-EN: retrieval, general conversational image, image, general conversational, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset called ChatSearch. This dataset includes a multi-round multimodal conversational context query for each target image, thereby requiring the retrieval system to find the accurate image from database. Simultaneously, we propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs. ChatSearcher exhibits strong capability in reasoning with multimodal context and can leverage world knowledge to yield visual retrieval results. It demonstrates superior performance on the ChatSearch dataset and also achieves competitive results on other image retrieval tasks and visual conversation tasks. We anticipate that this work will inspire further research on interactive multimodal retrieval systems. Our dataset will be available at this https URL.

[CV-18] PESFormer: Boosting Macro- and Micro-expression Spotting with Direct Timestamp Encoding

链接: https://arxiv.org/abs/2410.18695
作者: Wang-Wang Yu,Kai-Fu Yang,Xiangrui Hu,Jingwen Jiang,Hong-Mei Yan,Yong-Jie Li
关键词-EN: categorize temporal expression, temporal expression instances, micro-expression spotting aims, task of macro, aims to precisely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The task of macro- and micro-expression spotting aims to precisely localize and categorize temporal expression instances within untrimmed videos. Given the sparse distribution and varying durations of expressions, existing anchor-based methods often represent instances by encoding their deviations from predefined anchors. Additionally, these methods typically slice the untrimmed videos into fixed-length sliding windows. However, anchor-based encoding often fails to capture all training intervals, and slicing the original video as sliding windows can result in valuable training intervals being discarded. To overcome these limitations, we introduce PESFormer, a simple yet effective model based on the vision transformer architecture to achieve point-to-interval expression spotting. PESFormer employs a direct timestamp encoding (DTE) approach to replace anchors, enabling binary classification of each timestamp instead of optimizing entire ground truths. Thus, all training intervals are retained in the form of discrete timestamps. To maximize the utilization of training intervals, we enhance the preprocessing process by replacing the short videos produced through the sliding window this http URL, we implement a strategy that involves zero-padding the untrimmed training videos to create uniform, longer videos of a predetermined duration. This operation efficiently preserves the original training intervals and eliminates video slice this http URL qualitative and quantitative evaluations on three datasets – CAS(ME)^2, CAS(ME)^3 and SAMM-LV – demonstrate that our PESFormer outperforms existing techniques, achieving the best performance.

[CV-19] ODDN: Addressing Unpaired Data Challenges in Open-World Deepfake Detection on Online Social Networks

链接: https://arxiv.org/abs/2410.18687
作者: Renshuai Tao,Manyi Le,Chuangchuang Tan,Huan Liu,Haotong Qin,Yao Zhao
关键词-EN: varying image quality, handling varying image, online social networks, handling varying, significant advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Despite significant advances in deepfake detection, handling varying image quality, especially due to different compressions on online social networks (OSNs), remains challenging. Current methods succeed by leveraging correlations between paired images, whether raw or compressed. However, in open-world scenarios, paired data is scarce, with compressed images readily available but corresponding raw versions difficult to obtain. This imbalance, where unpaired data vastly outnumbers paired data, often leads to reduced detection performance, as existing methods struggle without corresponding raw images. To overcome this issue, we propose a novel approach named the open-world deepfake detection network (ODDN), which comprises two core modules: open-world data aggregation (ODA) and compression-discard gradient correction (CGC). ODA effectively aggregates correlations between compressed and raw samples through both fine-grained and coarse-grained analyses for paired and unpaired data, respectively. CGC incorporates a compression-discard gradient correction to further enhance performance across diverse compression methods in OSN. This technique optimizes the training gradient to ensure the model remains insensitive to compression variations. Extensive experiments conducted on 17 popular deepfake datasets demonstrate the superiority of the ODDN over SOTA baselines.

[CV-20] Every Component Counts: Rethinking the Measure of Success for Medical Semantic Segmentation in Multi-Instance Segmentation Tasks

链接: https://arxiv.org/abs/2410.18684
作者: Alexander Jaus,Constantin Seibold,Simon Reiß,Zdravko Marinov,Keyi Li,Zeling Ye,Stefan Krieg,Jens Kleesiek,Rainer Stiefelhagen
关键词-EN: segmentation evaluation protocol, multi-instance detection scenario, semantic segmentation evaluation, existing semantic segmentation, semantic segmentation metrics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present Connected-Component~(CC)-Metrics, a novel semantic segmentation evaluation protocol, targeted to align existing semantic segmentation metrics to a multi-instance detection scenario in which each connected component matters. We motivate this setup in the common medical scenario of semantic metastases segmentation in a full-body PET/CT. We show how existing semantic segmentation metrics suffer from a bias towards larger connected components contradicting the clinical assessment of scans in which tumor size and clinical relevance are uncorrelated. To rebalance existing segmentation metrics, we propose to evaluate them on a per-component basis thus giving each tumor the same weight irrespective of its size. To match predictions to ground-truth segments, we employ a proximity-based matching criterion, evaluating common metrics locally at the component of interest. Using this approach, we break free of biases introduced by large metastasis for overlap-based metrics such as Dice or Surface Dice. CC-Metrics also improves distance-based metrics such as Hausdorff Distances which are uninformative for small changes that do not influence the maximum or 95th percentile, and avoids pitfalls introduced by directly combining counting-based metrics with overlap-based metrics as it is done in Panoptic Quality.

[CV-21] Rigid Single-Slice-in-Volume registration via rotation-equivariant 2D/3D feature matching

链接: https://arxiv.org/abs/2410.18683
作者: Stefan Brandstätter,Philipp Seeböck,Christoph Fürböck,Svitlana Pochepnia,Helmut Prosch,Georg Langs
关键词-EN: surgical navigation, environmental understanding, autonomous systems, augmented reality, essential in tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:2D to 3D registration is essential in tasks such as diagnosis, surgical navigation, environmental understanding, navigation in robotics, autonomous systems, or augmented reality. In medical imaging, the aim is often to place a 2D image in a 3D volumetric observation to w. Current approaches for rigid single slice in volume registration are limited by requirements such as pose initialization, stacks of adjacent slices, or reliable anatomical landmarks. Here, we propose a self-supervised 2D/3D registration approach to match a single 2D slice to the corresponding 3D volume. The method works in data without anatomical priors such as images of tumors. It addresses the dimensionality disparity and establishes correspondences between 2D in-plane and 3D out-of-plane rotation-equivariant features by using group equivariant CNNs. These rotation-equivariant features are extracted from the 2D query slice and aligned with their 3D counterparts. Results demonstrate the robustness of the proposed slice-in-volume registration on the NSCLC-Radiomics CT and KIRBY21 MRI datasets, attaining an absolute median angle error of less than 2 degrees and a mean-matching feature accuracy of 89% at a tolerance of 3 pixels.

[CV-22] Enhancing pretraining efficiency for medical image segmentation via transferability metrics

链接: https://arxiv.org/abs/2410.18677
作者: Gábor Hidy,Bence Bakos,András Lukács
关键词-EN: deep neural networks, training deep neural, medical image segmentation, neural networks, scarcity of labeled
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In medical image segmentation tasks, the scarcity of labeled training data poses a significant challenge when training deep neural networks. When using U-Net-style architectures, it is common practice to address this problem by pretraining the encoder part on a large general-purpose dataset like ImageNet. However, these methods are resource-intensive and do not guarantee improved performance on the downstream task. In this paper we investigate a variety of training setups on medical image segmentation datasets, using ImageNet-pretrained models. By examining over 300 combinations of models, datasets, and training methods, we find that shorter pretraining often leads to better results on the downstream task, providing additional proof to the well-known fact that the accuracy of the model on ImageNet is a poor indicator for downstream performance. As our main contribution, we introduce a novel transferability metric, based on contrastive learning, that measures how robustly a pretrained model is able to represent the target data. In contrast to other transferability scores, our method is applicable to the case of transferring from ImageNet classification to medical image segmentation. We apply our robustness score by measuring it throughout the pretraining phase to indicate when the model weights are optimal for downstream transfer. This reduces pretraining time and improves results on the target task.

[CV-23] 3D Shape Completion with Test-Time Training

链接: https://arxiv.org/abs/2410.18668
作者: Michael Schopf-Kuester,Zorah Lähner,Michael Moeller
关键词-EN: restoring incomplete shapes, addresses the problem, missing parts, shape completion, work addresses
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work addresses the problem of \textitshape completion, i.e., the task of restoring incomplete shapes by predicting their missing parts. While previous works have often predicted the fractured and restored shape in one step, we approach the task by separately predicting the fractured and newly restored parts, but ensuring these predictions are interconnected. We use a decoder network motivated by related work on the prediction of signed distance functions (DeepSDF). In particular, our representation allows us to consider test-time-training, i.e., finetuning network parameters to match the given incomplete shape more accurately during inference. While previous works often have difficulties with artifacts around the fracture boundary, we demonstrate that our overfitting to the fractured parts leads to significant improvements in the restoration of eight different shape categories of the ShapeNet data set in terms of their chamfer distances.

[CV-24] DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation NEURIPS2024

链接: https://arxiv.org/abs/2410.18666
作者: Yuang Ai,Xiaoqiang Zhou,Huaibo Huang,Xiaotian Han,Zhengyu Chen,Quanzeng You,Hongxia Yang
关键词-EN: significant challenges due, scenarios presents significant, presents significant challenges, Image restoration, significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Image restoration (IR) in real-world scenarios presents significant challenges due to the lack of high-capacity models and comprehensive datasets. To tackle these issues, we present a dual strategy: GenIR, an innovative data curation pipeline, and DreamClear, a cutting-edge Diffusion Transformer (DiT)-based image restoration model. GenIR, our pioneering contribution, is a dual-prompt learning pipeline that overcomes the limitations of existing datasets, which typically comprise only a few thousand images and thus offer limited generalizability for larger models. GenIR streamlines the process into three stages: image-text pair construction, dual-prompt based fine-tuning, and data generation filtering. This approach circumvents the laborious data crawling process, ensuring copyright compliance and providing a cost-effective, privacy-safe solution for IR dataset construction. The result is a large-scale dataset of one million high-quality images. Our second contribution, DreamClear, is a DiT-based image restoration model. It utilizes the generative priors of text-to-image (T2I) diffusion models and the robust perceptual capabilities of multi-modal large language models (MLLMs) to achieve photorealistic restoration. To boost the model’s adaptability to diverse real-world degradations, we introduce the Mixture of Adaptive Modulator (MoAM). It employs token-wise degradation priors to dynamically integrate various restoration experts, thereby expanding the range of degradations the model can address. Our exhaustive experiments confirm DreamClear’s superior performance, underlining the efficacy of our dual strategy for real-world image restoration. Code and pre-trained models will be available at: this https URL.

[CV-25] Moving Object Segmentation in Point Cloud Data using Hidden Markov Models IROS2024

链接: https://arxiv.org/abs/2410.18638
作者: Vedant Bhandari,Jasmin James,Tyson Phillips,P. Ross McAree
关键词-EN: Autonomous agents require, Autonomous agents, identify dynamic objects, planning and navigation, require the capability
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to the IEEE IROS 2024 workshop on Long-Term Perception for Autonomy in Dynamic Human-shared Environments: What Do Robots Need?

点击查看摘要

Abstract:Autonomous agents require the capability to identify dynamic objects in their environment for safe planning and navigation. Incomplete and erroneous dynamic detections jeopardize the agent’s ability to accomplish its task. Dynamic detection is a challenging problem due to the numerous sources of uncertainty inherent in the problem’s inputs and the wide variety of applications, which often lead to use-case-tailored solutions. We propose a robust learning-free approach to segment moving objects in point cloud data. The foundation of the approach lies in modelling each voxel using a hidden Markov model (HMM), and probabilistically integrating beliefs into a map using an HMM filter. The proposed approach is tested on benchmark datasets and consistently performs better than or as well as state-of-the-art methods with strong generalized performance across sensor characteristics and environments. The approach is open-sourced at this https URL.

[CV-26] A Cranial-Feature-Based Registration Scheme for Robotic Micromanipulation Using a Microscopic Stereo Camera System

链接: https://arxiv.org/abs/2410.18630
作者: Xiaofeng Lin,Saúl Alexis Heredia Pérez,Kanako Harada
关键词-EN: Biological specimens exhibit, specimens exhibit significant, exhibit significant variations, Biological specimens, challenging autonomous robotic
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted by Advanced Robotics, Vol. 38, Issue 21

点击查看摘要

Abstract:Biological specimens exhibit significant variations in size and shape, challenging autonomous robotic manipulation. We focus on the mouse skull window creation task to illustrate these challenges. The study introduces a microscopic stereo camera system (MSCS) enhanced by the linear model for depth perception. Alongside this, a precise registration scheme is developed for the partially exposed mouse cranial surface, employing a CNN-based constrained and colorized registration strategy. These methods are integrated with the MSCS for robotic micromanipulation tasks. The MSCS demonstrated a high precision of 0.10 mm \pm 0.02 mm measured in a step height experiment and real-time performance of 30 FPS in 3D reconstruction. The registration scheme proved its precision, with a translational error of 1.13 mm \pm 0.31 mm and a rotational error of 3.38 ^\circ \pm 0.89 ^\circ tested on 105 continuous frames with an average speed of 1.60 FPS. This study presents the application of a MSCS and a novel registration scheme in enhancing the precision and accuracy of robotic micromanipulation in scientific and surgical settings. The innovations presented here offer automation methodology in handling the challenges of microscopic manipulation, paving the way for more accurate, efficient, and less invasive procedures in various fields of microsurgery and scientific research.

[CV-27] Environment Maps Editing using Inverse Rendering and Adversarial Implicit Functions

链接: https://arxiv.org/abs/2410.18622
作者: Antonio D’Orazio,Davide Sforza,Fabio Pellacini,Iacopo Masi
关键词-EN: High Dynamic Range, Editing High Dynamic, Standard Dynamic Range, High Dynamic, Dynamic Range
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Editing High Dynamic Range (HDR) environment maps using an inverse differentiable rendering architecture is a complex inverse problem due to the sparsity of relevant pixels and the challenges in balancing light sources and background. The pixels illuminating the objects are a small fraction of the total image, leading to noise and convergence issues when the optimization directly involves pixel values. HDR images, with pixel values beyond the typical Standard Dynamic Range (SDR), pose additional challenges. Higher learning rates corrupt the background during optimization, while lower learning rates fail to manipulate light sources. Our work introduces a novel method for editing HDR environment maps using a differentiable rendering, addressing sparsity and variance between values. Instead of introducing strong priors that extract the relevant HDR pixels and separate the light sources, or using tricks such as optimizing the HDR image in the log space, we propose to model the optimized environment map with a new variant of implicit neural representations able to handle HDR images. The neural representation is trained with adversarial perturbations over the weights to ensure smooth changes in the output when it receives gradients from the inverse rendering. In this way, we obtain novel and cheap environment maps without relying on latent spaces of expensive generative models, maintaining the original visual consistency. Experimental results demonstrate the method’s effectiveness in reconstructing the desired lighting effects while preserving the fidelity of the map and reflections on objects in the scene. Our approach can pave the way to interesting tasks, such as estimating a new environment map given a rendering with novel light sources, maintaining the initial perceptual features, and enabling brush stroke-based editing of existing environment maps.

[CV-28] Rethinking Softmax: Self-Attention with Polynomial Activations

链接: https://arxiv.org/abs/2410.18613
作者: Hemanth Saratchandran,Jianqiao Zheng,Yiping Ji,Wenbo Zhang,Simon Lucey
关键词-EN: Frobenius norm, paper challenges, challenges the conventional, conventional belief, transformers is effective
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax.

[CV-29] On Model-Free Re-ranking for Visual Place Recognition with Deep Learned Local Features

链接: https://arxiv.org/abs/2410.18573
作者: Tomáš Pivoňka,Libor Přeučil
关键词-EN: local visual features, visual place recognition, subset of candidates, chooses the best-matching, pre-selected subset
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Re-ranking is the second stage of a visual place recognition task, in which the system chooses the best-matching images from a pre-selected subset of candidates. Model-free approaches compute the image pair similarity based on a spatial comparison of corresponding local visual features, eliminating the need for computationally expensive estimation of a model describing transformation between images. The article focuses on model-free re-ranking based on standard local visual features and their applicability in long-term autonomy systems. It introduces three new model-free re-ranking methods that were designed primarily for deep-learned local visual features. These features evince high robustness to various appearance changes, which stands as a crucial property for use with long-term autonomy systems. All the introduced methods were employed in a new visual place recognition system together with the D2-net feature detector (Dusmanu, 2019) and experimentally tested with diverse, challenging public datasets. The obtained results are on par with current state-of-the-art methods, affirming that model-free approaches are a viable and worthwhile path for long-term visual place recognition.

[CV-30] Research on gesture recognition method based on SEDCNN-SVM

链接: https://arxiv.org/abs/2410.18557
作者: Mingjin Zhang,Jiahao Wang,Jianming Wang,Qi Wang
关键词-EN: surface electromyographic signal, Gesture recognition based, low-level signal features, based on surface, surface electromyographic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gesture recognition based on surface electromyographic signal (sEMG) is one of the most used methods. The traditional manual feature extraction can only extract some low-level signal features, this causes poor classifier performance and low recognition accuracy when dealing with some complex signals. A recognition method, namely SEDCNN-SVM, is proposed to recognize sEMG of different gestures. SEDCNN-SVM consists of an improved deep convolutional neural network (DCNN) and a support vector machine (SVM). The DCNN can automatically extract and learn the feature information of sEMG through the convolution operation of the convolutional layer, so that it can capture the complex and high-level features in the data. The Squeeze and Excitation Networks (SE-Net) and the residual module were added to the model, so that the feature representation of each channel could be improved, the loss of feature information in convolutional operations was reduced, useful feature information was captured, and the problem of network gradient vanishing was eased. The SVM can improve the generalization ability and classification accuracy of the model by constructing an optimal hyperplane of the feature space. Hence, the SVM was used to replace the full connection layer and the Softmax function layer of the DCNN, the use of a suitable kernel function in SVM can improve the model’s generalization ability and classification accuracy. To verify the effectiveness of the proposed classification algorithm, this method is analyzed and compared with other comparative classification methods. The recognition accuracy of SEDCNN-SVM can reach 0.955, it is significantly improved compared with other classification methods, the SEDCNN-SVM model is recognized online in real time.

[CV-31] Local and Global Graph Modeling with Edge-weighted Graph Attention Network for Handwritten Mathematical Expression Recognition

链接: https://arxiv.org/abs/2410.18555
作者: Yejing Xie,Richard Zanibbi,Harold Mouchère
关键词-EN: Handwritten Mathematical Expression, graph-based modeling techniques, leveraging graph-based modeling, approach to Handwritten, Handwritten Mathematical
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present a novel approach to Handwritten Mathematical Expression Recognition (HMER) by leveraging graph-based modeling techniques. We introduce an End-to-end model with an Edge-weighted Graph Attention Mechanism (EGAT), designed to perform simultaneous node and edge classification. This model effectively integrates node and edge features, facilitating the prediction of symbol classes and their relationships within mathematical expressions. Additionally, we propose a stroke-level Graph Modeling method for both local (LGM) and global (GGM) information, which applies an end-to-end model to Online HMER tasks, transforming the recognition problem into node and edge classification tasks in graph structure. By capturing both local and global graph features, our method ensures comprehensive understanding of the expression structure. Through the combination of these components, our system demonstrates superior performance in symbol detection, relation classification, and expression-level recognition.

[CV-32] Interpretable Representation Learning from Videos using Nonlinear Priors BMVC2024

链接: https://arxiv.org/abs/2410.18539
作者: Marian Longa,João F. Henriques
关键词-EN: make machines’ decisions, machines’ decisions understandable, Learning interpretable representations, Additive Noise Model, Gaussian Mixture Model
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to BMVC 2024 (Oral)

点击查看摘要

Abstract:Learning interpretable representations of visual data is an important challenge, to make machines’ decisions understandable to humans and to improve generalisation outside of the training distribution. To this end, we propose a deep learning framework where one can specify nonlinear priors for videos (e.g. of Newtonian physics) that allow the model to learn interpretable latent variables and use these to generate videos of hypothetical scenarios not observed at training time. We do this by extending the Variational Auto-Encoder (VAE) prior from a simple isotropic Gaussian to an arbitrary nonlinear temporal Additive Noise Model (ANM), which can describe a large number of processes (e.g. Newtonian physics). We propose a novel linearization method that constructs a Gaussian Mixture Model (GMM) approximating the prior, and derive a numerically stable Monte Carlo estimate of the KL divergence between the posterior and prior GMMs. We validate the method on different real-world physics videos including a pendulum, a mass on a spring, a falling object and a pulsar (rotating neutron star). We specify a physical prior for each experiment and show that the correct variables are learned. Once a model is trained, we intervene on it to change different physical variables (such as oscillation amplitude or adding air drag) to generate physically correct videos of hypothetical scenarios that were not observed previously.

[CV-33] SMITE: Segment Me In TimE

链接: https://arxiv.org/abs/2410.18538
作者: Amirhossein Alimohammadi,Sauradip Nag,Saeid Asgari Taghanaki,Andrea Tagliasacchi,Ghassan Hamarneh,Ali Mahdavi Amiri
关键词-EN: presents significant challenges, video presents significant, Segmenting an object, significant challenges, video presents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical report. Project page is at \url{ this https URL }

点击查看摘要

Abstract:Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.

[CV-34] Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics

链接: https://arxiv.org/abs/2410.18537
作者: Jinghao Hu,Yuhe Zhang,GuoHua Geng,Liuyuxin Yang,JiaRui Yan,Jingtao Cheng,YaDong Zhang,Kang Li
关键词-EN: primarily considered, considered in terms, artistic elements, Traditionally, style
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages,6 figures

点击查看摘要

Abstract:Traditionally, style has been primarily considered in terms of artistic elements such as colors, brushstrokes, and lighting. However, identical semantic subjects, like people, boats, and houses, can vary significantly across different artistic traditions, indicating that style also encompasses the underlying semantics. Therefore, in this study, we propose a zero-shot scheme for image variation with coordinated semantics. Specifically, our scheme transforms the image-to-image problem into an image-to-text-to-image problem. The image-to-text operation employs vision-language models e.g., BLIP) to generate text describing the content of the input image, including the objects and their positions. Subsequently, the input style keyword is elaborated into a detailed description of this style and then merged with the content text using the reasoning capabilities of ChatGPT. Finally, the text-to-image operation utilizes a Diffusion model to generate images based on the text prompt. To enable the Diffusion model to accommodate more styles, we propose a fine-tuning strategy that injects text and style constraints into cross-attention. This ensures that the output image exhibits similar semantics in the desired style. To validate the performance of the proposed scheme, we constructed a benchmark comprising images of various styles and scenes and introduced two novel metrics. Despite its simplicity, our scheme yields highly plausible results in a zero-shot manner, particularly for generating stylized images with high-fidelity semantics.

[CV-35] Unsupervised semantic segmentation of urban high-density multispectral point clouds

链接: https://arxiv.org/abs/2410.18520
作者: Oona Oinonen,Lassi Ruoppa,Josef Taher,Matti Lehtomäki,Leena Matikainen,Kirsi Karila,Teemu Hakala,Antero Kukko,Harri Kaartinen,Juha Hyyppä
关键词-EN: airborne laser scanning, acquisition costs decrease, accurate urban airborne, urban airborne laser, highly accurate urban
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 11 figures

点击查看摘要

Abstract:The availability of highly accurate urban airborne laser scanning (ALS) data will increase rapidly in the future, especially as acquisition costs decrease, for example through the use of drones. Current challenges in data processing are related to the limited spectral information and low point density of most ALS datasets. Another challenge will be the growing need for annotated training data, frequently produced by manual processes, to enable semantic interpretation of point clouds. This study proposes to semantically segment new high-density (1200 points per square metre on average) multispectral ALS data with an unsupervised ground-aware deep clustering method GroupSP inspired by the unsupervised GrowSP algorithm. GroupSP divides the scene into superpoints as a preprocessing step. The neural network is trained iteratively by grouping the superpoints and using the grouping assignments as pseudo-labels. The predictions for the unseen data are given by over-segmenting the test set and mapping the predicted classes into ground truth classes manually or with automated majority voting. GroupSP obtained an overall accuracy (oAcc) of 97% and a mean intersection over union (mIoU) of 80%. When compared to other unsupervised semantic segmentation methods, GroupSP outperformed GrowSP and non-deep K-means. However, a supervised random forest classifier outperformed GroupSP. The labelling efforts in GroupSP can be minimal; it was shown, that the GroupSP can semantically segment seven urban classes (building, high vegetation, low vegetation, asphalt, rock, football field, and gravel) with oAcc of 95% and mIoU of 75% using only 0.004% of the available annotated points in the mapping assignment. Finally, the multispectral information was examined; adding each new spectral channel improved the mIoU. Additionally, echo deviation was valuable, especially when distinguishing ground-level classes.

[CV-36] A Note on Geometric Calibration of Multiple Cameras and Projectors

链接: https://arxiv.org/abs/2410.18511
作者: Tomislav Petkovic(FER),Simone Gasparini(IRIT-REVA),Tomislav Pribanic(FER)
关键词-EN: Geometric calibration, essential step, Geometric, calibration, well-known geometric calibration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Geometric calibration of cameras and projectors is an essential step that must be performed before any imaging system can be used. There are many well-known geometric calibration methods for calibrating systems comprised of multiple cameras, but simultaneous geometric calibration of multiple projectors and cameras has received less attention. This leaves unresolved several practical issues which must be considered to achieve the simplicity of use required for real world applications. In this work we discuss several important components of a real-world geometric calibration procedure used in our laboratory to calibrate surface imaging systems comprised of many projectors and cameras. We specifically discuss the design of the calibration object and the image processing pipeline used to analyze it in the acquired images. We also provide quantitative calibration results in the form of reprojection errors and compare them to the classic approaches such as Zhang’s calibration method.

[CV-37] Synth4Seg – Learning Defect Data Synthesis for Defect Segmentation using Bi-level Optimization

链接: https://arxiv.org/abs/2410.18490
作者: Shancong Mou,Raviteja Vemulapalli,Shiyu Li,Yuxuan Liu,C Thomas,Meng Cao,Haoping Bai,Oncel Tuzel,Ping Huang,Jiulong Shan,Jianjun Shi
关键词-EN: scarcity poses challenges, data scarcity poses, supervised deep learning, advanced manufacturing, supervised deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Defect segmentation is crucial for quality control in advanced manufacturing, yet data scarcity poses challenges for state-of-the-art supervised deep learning. Synthetic defect data generation is a popular approach for mitigating data challenges. However, many current methods simply generate defects following a fixed set of rules, which may not directly relate to downstream task performance. This can lead to suboptimal performance and may even hinder the downstream task. To solve this problem, we leverage a novel bi-level optimization-based synthetic defect data generation framework. We use an online synthetic defect generation module grounded in the commonly-used Cut\Paste framework, and adopt an efficient gradient-based optimization algorithm to solve the bi-level optimization problem. We achieve simultaneous training of the defect segmentation network, and learn various parameters of the data synthesis module by maximizing the validation performance of the trained defect segmentation network. Our experimental results on benchmark datasets under limited data settings show that the proposed bi-level optimization method can be used for learning the most effective locations for pasting synthetic defects thereby improving the segmentation performance by up to 18.3% when compared to pasting defects at random locations. We also demonstrate up to 2.6% performance gain by learning the importance weights for different augmentation-specific defect data sources when compared to giving equal importance to all the data sources.

[CV-38] Monge-Ampere Regularization for Learning Arbitrary Shapes from Point Clouds

链接: https://arxiv.org/abs/2410.18477
作者: Chuanxiang Yang,Yuanfeng Zhou,Guangshun Wei,Long Ma,Junhui Hou,Yuan Liu,Wenping Wang
关键词-EN: signed distance function, modeling watertight shapes, distance function, implicit geometry representations, watertight shapes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As commonly used implicit geometry representations, the signed distance function (SDF) is limited to modeling watertight shapes, while the unsigned distance function (UDF) is capable of representing various surfaces. However, its inherent theoretical shortcoming, i.e., the non-differentiability at the zero level set, would result in sub-optimal reconstruction quality. In this paper, we propose the scaled-squared distance function (S ^2 DF), a novel implicit surface representation for modeling arbitrary surface types. S ^2 DF does not distinguish between inside and outside regions while effectively addressing the non-differentiability issue of UDF at the zero level set. We demonstrate that S ^2 DF satisfies a second-order partial differential equation of Monge-Ampere-type, allowing us to develop a learning pipeline that leverages a novel Monge-Ampere regularization to directly learn S ^2 DF from raw unoriented point clouds without supervision from ground-truth S ^2 DF values. Extensive experiments across multiple datasets show that our method significantly outperforms state-of-the-art supervised approaches that require ground-truth surface information as supervision for training. The code will be publicly available at this https URL.

[CV-39] Learn 2 Rage: Experiencing The Emotional Roller Coaster That Is Reinforcement Learning

链接: https://arxiv.org/abs/2410.18462
作者: Lachlan Mares,Stefan Podgorski,Ian Reid
关键词-EN: Autonomous Racing Virtual, Racing Virtual Challenge, Race Autonomous Racing, teams winning submission, Racing Virtual
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This work presents the experiments and solution outline for our teams winning submission in the Learn To Race Autonomous Racing Virtual Challenge 2022 hosted by AIcrowd. The objective of the Learn-to-Race competition is to push the boundary of autonomous technology, with a focus on achieving the safety benefits of autonomous driving. In the description the competition is framed as a reinforcement learning (RL) challenge. We focused our initial efforts on implementation of Soft Actor Critic (SAC) variants. Our goal was to learn non-trivial control of the race car exclusively from visual and geometric features, directly mapping pixels to control actions. We made suitable modifications to the default reward policy aiming to promote smooth steering and acceleration control. The framework for the competition provided real time simulation, meaning a single episode (learning experience) is measured in minutes. Instead of pursuing parallelisation of episodes we opted to explore a more traditional approach in which the visual perception was processed (via learned operators) and fed into rule-based controllers. Such a system, while not as academically “attractive” as a pixels-to-actions approach, results in a system that requires less training, is more explainable, generalises better and is easily tuned and ultimately out-performed all other agents in the competition by a large margin.

[CV-40] Integrating Deep Feature Extraction and Hybrid ResNet-DenseNet Model for Multi-Class Abnormality Detection in Endoscopic Images

链接: https://arxiv.org/abs/2410.18457
作者: Aman Sagar,Preeti Mehta,Monika Shrivastva,Suchi Kumari
关键词-EN: Video Capsule Endoscopy, deep learning framework, Capsule Endoscopy, Video Capsule, abnormalities in Video
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, CVIP challenge report including the validation results

点击查看摘要

Abstract:This paper presents a deep learning framework for the multi-class classification of gastrointestinal abnormalities in Video Capsule Endoscopy (VCE) frames. The aim is to automate the identification of ten GI abnormality classes, including angioectasia, bleeding, and ulcers, thereby reducing the diagnostic burden on gastroenterologists. Utilizing an ensemble of DenseNet and ResNet architectures, the proposed model achieves an overall accuracy of 94% across a well-structured dataset. Precision scores range from 0.56 for erythema to 1.00 for worms, with recall rates peaking at 98% for normal findings. This study emphasizes the importance of robust data preprocessing techniques, including normalization and augmentation, in enhancing model performance. The contributions of this work lie in developing an effective AI-driven tool that streamlines the diagnostic process in gastroenterology, ultimately improving patient care and clinical outcomes.

[CV-41] Segmentation-aware Prior Assisted Joint Global Information Aggregated 3D Building Reconstruction

链接: https://arxiv.org/abs/2410.18433
作者: Hongxin Peng,Yongjian Liao,Weijun Li,Chuanyu Fu,Guoxin Zhang,Ziquan Ding,Zijie Huang,Qiku Cao,Shuting Cai
关键词-EN: precise engineering surveying, Multi-View Stereo plays, quantitative analysis, monitoring and maintenance, Multi-View Stereo
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-View Stereo plays a pivotal role in civil engineering by facilitating 3D modeling, precise engineering surveying, quantitative analysis, as well as monitoring and maintenance. It serves as a valuable tool, offering high-precision and real-time spatial information crucial for various engineering projects. However, Multi-View Stereo algorithms encounter challenges in reconstructing weakly-textured regions within large-scale building scenes. In these areas, the stereo matching of pixels often fails, leading to inaccurate depth estimations. Based on the Segment Anything Model and RANSAC algorithm, we propose an algorithm that accurately segments weakly-textured regions and constructs their plane priors. These plane priors, combined with triangulation priors, form a reliable prior candidate set. Additionally, we introduce a novel global information aggregation cost function. This function selects optimal plane prior information based on global information in the prior candidate set, constrained by geometric consistency during the depth estimation update process. Experimental results on both the ETH3D benchmark dataset, aerial dataset, building dataset and real scenarios substantiate the superior performance of our method in producing 3D building models compared to other state-of-the-art methods. In summary, our work aims to enhance the completeness and density of 3D building reconstruction, carrying implications for broader applications in urban planning and virtual reality.

[CV-42] FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling

链接: https://arxiv.org/abs/2410.18410
作者: Zhengqiang Zhang,Ruihuang Li,Lei Zhang
关键词-EN: high computational cost, training size remains, challenging task due, great success, computational cost
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While image generation with diffusion models has achieved a great success, generating images of higher resolution than the training size remains a challenging task due to the high computational cost. Current methods typically perform the entire sampling process at full resolution and process all frequency components simultaneously, contradicting with the inherent coarse-to-fine nature of latent diffusion models and wasting computations on processing premature high-frequency details at early diffusion stages. To address this issue, we introduce an efficient \textbfFre quency-aware \textbfCa scaded \textbfS ampling framework, \textbfFreCaS in short, for higher-resolution image generation. FreCaS decomposes the sampling process into cascaded stages with gradually increased resolutions, progressively expanding frequency bands and refining the corresponding details. We propose an innovative frequency-aware classifier-free guidance (FA-CFG) strategy to assign different guidance strengths for different frequency components, directing the diffusion model to add new details in the expanded frequency domain of each stage. Additionally, we fuse the cross-attention maps of previous and current stages to avoid synthesizing unfaithful layouts. Experiments demonstrate that FreCaS significantly outperforms state-of-the-art methods in image quality and generation speed. In particular, FreCaS is about 2.86 \times and 6.07 \times faster than ScaleCrafter and DemoFusion in generating a 2048 \times 2048 image using a pre-trained SDXL model and achieves an FID _b improvement of 11.6 and 3.7, respectively. FreCaS can be easily extended to more complex models such as SD3. The source code of FreCaS can be found at \href\textthis https URLthis https URL .

[CV-43] Scale Propagation Network for Generalizable Depth Completion

链接: https://arxiv.org/abs/2410.18408
作者: Haotian Wang,Meng Yang,Xinhu Zheng,Gang Hua
关键词-EN: inferring dense depth, Depth completion, inferring dense, crucial for robust, dense depth maps
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Major revision in IEEE Transactions on Pattern Analysis and Machine Intelligence

点击查看摘要

Abstract:Depth completion, inferring dense depth maps from sparse measurements, is crucial for robust 3D perception. Although deep learning based methods have made tremendous progress in this problem, these models cannot generalize well across different scenes that are unobserved in training, posing a fundamental limitation that yet to be overcome. A careful analysis of existing deep neural network architectures for depth completion, which are largely borrowing from successful backbones for image analysis tasks, reveals that a key design bottleneck actually resides in the conventional normalization layers. These normalization layers are designed, on one hand, to make training more stable, on the other hand, to build more visual invariance across scene scales. However, in depth completion, the scale is actually what we want to robustly estimate in order to better generalize to unseen scenes. To mitigate, we propose a novel scale propagation normalization (SP-Norm) method to propagate scales from input to output, and simultaneously preserve the normalization operator for easy convergence. More specifically, we rescale the input using learned features of a single-layer perceptron from the normalized input, rather than directly normalizing the input as conventional normalization layers. We then develop a new network architecture based on SP-Norm and the ConvNeXt V2 backbone. We explore the composition of various basic blocks and architectures to achieve superior performance and efficient inference for generalizable depth completion. Extensive experiments are conducted on six unseen datasets with various types of sparse depth maps, i.e., randomly sampled 0.1%/1%/10% valid pixels, 4/8/16/32/64-line LiDAR points, and holes from Structured-Light. Our model consistently achieves the best accuracy with faster speed and lower memory when compared to state-of-the-art methods.

[CV-44] DMVC: Multi-Camera Video Compression Network aimed at Improving Deep Learning Accuracy

链接: https://arxiv.org/abs/2410.18400
作者: Huan Cui(1 and 2),Qing Li(3),Hanling Wang(1),Yong jiang(1) ((1) Tsinghua University, (2) Peking University, (3) Peng Cheng Laboratory)
关键词-EN: introduce a cutting-edge, age of ubiquitous, compression framework tailored, serve machine learning, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We introduce a cutting-edge video compression framework tailored for the age of ubiquitous video data, uniquely designed to serve machine learning applications. Unlike traditional compression methods that prioritize human visual perception, our innovative approach focuses on preserving semantic information critical for deep learning accuracy, while efficiently reducing data size. The framework operates on a batch basis, capable of handling multiple video streams simultaneously, thereby enhancing scalability and processing efficiency. It features a dual reconstruction mode: lightweight for real-time applications requiring swift responses, and high-precision for scenarios where accuracy is crucial. Based on a designed deep learning algorithms, it adeptly segregates essential information from redundancy, ensuring machine learning tasks are fed with data of the highest relevance. Our experimental results, derived from diverse datasets including urban surveillance and autonomous vehicle navigation, showcase DMVC’s superiority in maintaining or improving machine learning task accuracy, while achieving significant data compression. This breakthrough paves the way for smarter, scalable video analysis systems, promising immense potential across various applications from smart city infrastructure to autonomous systems, establishing a new benchmark for integrating video compression with machine learning.

[CV-45] CloudEye: A New Paradigm of Video Analysis System for Mobile Visual Scenarios

链接: https://arxiv.org/abs/2410.18399
作者: Huan Cui(1 and 2),Qing Li(3),Hanling Wang(1),Yong jiang(1) ((1) Tsinghua University, (2) Peking University, (3) Peng Cheng Laboratory)
关键词-EN: play a vital, vital role, role in numerous, Mobile, mobile vision
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Mobile deep vision systems play a vital role in numerous scenarios. However, deep learning applications in mobile vision scenarios face problems such as tight computing resources. With the development of edge computing, the architecture of edge clouds has mitigated some of the issues related to limited computing resources. However, it has introduced increased latency. To address these challenges, we designed CloudEye which consists of Fast Inference Module, Feature Mining Module and Quality Encode Module. CloudEye is a real-time, efficient mobile visual perception system that leverages content information mining on edge servers in a mobile vision system environment equipped with edge servers and coordinated with cloud servers. Proven by sufficient experiments, we develop a prototype system that reduces network bandwidth usage by 69.50%, increases inference speed by 24.55%, and improves detection accuracy by 67.30%

[CV-46] You Only Look Around: Learning Illumination Invariant Feature for Low-light Object Detection NEURIPS2024

链接: https://arxiv.org/abs/2410.18398
作者: Mingbo Hong,Shen Cheng,Haibin Huang,Haoqiang Fan,Shuaicheng Liu
关键词-EN: introduce YOLA, YOLA, object detection, Unlike previous works, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS2024

点击查看摘要

Abstract:In this paper, we introduce YOLA, a novel framework for object detection in low-light scenarios. Unlike previous works, we propose to tackle this challenging problem from the perspective of feature learning. Specifically, we propose to learn illumination-invariant features through the Lambertian image formation model. We observe that, under the Lambertian assumption, it is feasible to approximate illumination-invariant feature maps by exploiting the interrelationships between neighboring color channels and spatially adjacent pixels. By incorporating additional constraints, these relationships can be characterized in the form of convolutional kernels, which can be trained in a detection-driven manner within a network. Towards this end, we introduce a novel module dedicated to the extraction of illumination-invariant features from low-light images, which can be easily integrated into existing object detection frameworks. Our empirical findings reveal significant improvements in low-light object detection tasks, as well as promising results in both well-lit and over-lit scenarios. Code is available at \urlthis https URL.

[CV-47] Irregular Tensor Low-Rank Representation for Hyperspectral Image Representation

链接: https://arxiv.org/abs/2410.18388
作者: Bo Han,Yuheng Jia,Hui Liu,Junhui Hou
关键词-EN: tensor low-rank representation, alleviate spectral variations, hyperspectral image, Spectral variation, low-rank representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Spectral variation is a common problem for hyperspectral image (HSI) representation. Low-rank tensor representation is an important approach to alleviate spectral variations. However, the spatial distribution of the HSI is always irregular, while the previous tensor low-rank representation methods can only be applied to the regular data cubes, which limits the performance. To remedy this issue, in this paper we propose a novel irregular tensor low-rank representation model. We first segment the HSI data into several irregular homogeneous regions. Then, we propose a novel irregular tensor low-rank representation method that can efficiently model the irregular 3D cubes. We further use a non-convex nuclear norm to pursue the low-rankness and introduce a negative global low-rank term that improves global consistency. This proposed model is finally formulated as a convex-concave optimization problem and solved by alternative augmented Lagrangian method. Through experiments on four public datasets, the proposed method outperforms the existing low-rank based HSI methods significantly. Code is available at: this https URL.

[CV-48] Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

链接: https://arxiv.org/abs/2410.18387
作者: Lehan Wang,Haonan Wang,Honglong Yang,Jiaji Mao,Zehong Yang,Jun Shen,Xiaomeng Li
关键词-EN: Large Languange Models, Multimodal Large Languange, Large Languange, achieving impressive results, medical Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:Several medical Multimodal Large Languange Models (MLLMs) have been developed to address tasks involving visual images with textual instructions across various medical modalities, achieving impressive results. Most current medical generalist models are region-agnostic, treating the entire image as a holistic representation. However, they struggle to identify which specific regions they are focusing on when generating a this http URL mimic the behavior of doctors, who typically begin by reviewing the entire image before concentrating on specific regions for a thorough evaluation, we aim to enhance the capability of medical MLLMs in understanding anatomical regions within entire medical scans. To achieve it, we first formulate Region-Centric tasks and construct a large-scale dataset, MedRegInstruct, to incorporate regional information into training. Combining our collected dataset with other medical multimodal corpora for training, we propose a Region-Aware medical MLLM, MedRegA, which is the first bilingual generalist medical AI system to simultaneously handle image-level and region-level medical vision-language tasks across a broad range of modalities. Our MedRegA not only enables three region-centric tasks, but also achieves the best performance for visual question answering, report generation and medical image classification over 8 modalities, showcasing significant versatility. Experiments demonstrate that our model can not only accomplish powerful performance across various medical vision-language tasks in bilingual settings, but also recognize and detect structures in multimodal medical scans, boosting the interpretability and user interactivity of medical MLLMs. Our project page is this https URL.

[CV-49] Real-time 3D-aware Portrait Video Relighting CVPR2024

链接: https://arxiv.org/abs/2410.18355
作者: Ziqi Cai,Kaiwen Jiang,Shu-Yu Chen,Yu-Kun Lai,Hongbo Fu,Boxin Shi,Lin Gao
关键词-EN: Synthesizing realistic videos, Synthesizing realistic, Neural Radiance Fields, viewing angles benefits, angles benefits
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted to CVPR 2024 (Highlight). Project page: this http URL

点击查看摘要

Abstract:Synthesizing realistic videos of talking faces under custom lighting conditions and viewing angles benefits various downstream applications like video conferencing. However, most existing relighting methods are either time-consuming or unable to adjust the viewpoints. In this paper, we present the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF). Given an input portrait video, our method can synthesize talking faces under both novel views and novel lighting conditions with a photo-realistic and disentangled 3D representation. Specifically, we infer an albedo tri-plane, as well as a shading tri-plane based on a desired lighting condition for each video frame with fast dual-encoders. We also leverage a temporal consistency network to ensure smooth transitions and reduce flickering artifacts. Our method runs at 32.98 fps on consumer-level hardware and achieves state-of-the-art results in terms of reconstruction quality, lighting error, lighting instability, temporal consistency and inference speed. We demonstrate the effectiveness and interactivity of our method on various portrait videos with diverse lighting and viewing conditions.

[CV-50] hermal Chameleon: Task-Adaptive Tone-mapping for Radiometric Thermal-Infrared images

链接: https://arxiv.org/abs/2410.18340
作者: Dong-Guw Lee,Jeongyun Kim,Younggun Cho,Ayoung Kim
关键词-EN: challenging outdoor environments, Thermal Infrared, Thermal Chameleon Network, TIR images, imaging provides robust
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in IEEE Robotics and Automation Letters (2024)

点击查看摘要

Abstract:Thermal Infrared (TIR) imaging provides robust perception for navigating in challenging outdoor environments but faces issues with poor texture and low image contrast due to its 14/16-bit format. Conventional methods utilize various tone-mapping methods to enhance contrast and photometric consistency of TIR images, however, the choice of tone-mapping is largely dependent on knowing the task and temperature dependent priors to work well. In this paper, we present Thermal Chameleon Network (TCNet), a task-adaptive tone-mapping approach for RAW 14-bit TIR images. Given the same image, TCNet tone-maps different representations of TIR images tailored for each specific task, eliminating the heuristic image rescaling preprocessing and reliance on the extensive prior knowledge of the scene temperature or task-specific characteristics. TCNet exhibits improved generalization performance across object detection and monocular depth estimation, with minimal computational overhead and modular integration to existing architectures for various tasks. Project Page: this https URL

[CV-51] AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

链接: https://arxiv.org/abs/2410.18325
作者: Kim Sung-Bin,Oh Hyun-Bin,JungMok Lee,Arda Senocak,Joon Son Chung,Tae-Hyun Oh
关键词-EN: Large Language Models, Large Language, significant paradigm shift, success of Large, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: URL: this https URL

点击查看摘要

Abstract:Following the success of Large Language Models (LLMs), expanding their boundaries to new modalities represents a significant paradigm shift in multimodal understanding. Human perception is inherently multimodal, relying not only on text but also on auditory and visual cues for a complete understanding of the world. In recognition of this fact, audio-visual LLMs have recently emerged. Despite promising developments, the lack of dedicated benchmarks poses challenges for understanding and evaluating models. In this work, we show that audio-visual LLMs struggle to discern subtle relationships between audio and visual signals, leading to hallucinations, underscoring the need for reliable benchmarks. To address this, we introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual LLMs. Our benchmark includes tests for assessing hallucinations, as well as the cross-modal matching and reasoning abilities of these models. Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities, due to their limited capacity to perceive complex multimodal signals and their relationships. Additionally, we demonstrate that simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.

[CV-52] Calibrating Deep Neural Network using Euclidean Distance

链接: https://arxiv.org/abs/2410.18321
作者: Wenhao Liang,Chang Dong,Liangwei Zheng,Zhengyang Li,Wei Zhang,Weitong Chen
关键词-EN: real-world scenarios, fundamental aspect, aspect of real-world, perfect information, information is rarely
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Uncertainty is a fundamental aspect of real-world scenarios, where perfect information is rarely available. Humans naturally develop complex internal models to navigate incomplete data and effectively respond to unforeseen or partially observed events. In machine learning, Focal Loss is commonly used to reduce misclassification rates by emphasizing hard-to-classify samples. However, it does not guarantee well-calibrated predicted probabilities and may result in models that are overconfident or underconfident. High calibration error indicates a misalignment between predicted probabilities and actual outcomes, affecting model reliability. This research introduces a novel loss function called Focal Calibration Loss (FCL), designed to improve probability calibration while retaining the advantages of Focal Loss in handling difficult samples. By minimizing the Euclidean norm through a strictly proper loss, FCL penalizes the instance-wise calibration error and constrains bounds. We provide theoretical validation for proposed method and apply it to calibrate CheXNet for potential deployment in web-based health-care systems. Extensive evaluations on various models and datasets demonstrate that our method achieves SOTA performance in both calibration and accuracy metrics.

[CV-53] KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark ACCV2024

链接: https://arxiv.org/abs/2410.18277
作者: Vannkinh Nom,Souhail Bakkali,Muhammad Muzzamil Luqman,Mickaël Coustaty,Jean-Marc Ogier
关键词-EN: extensive training data, Developing effective scene, Developing effective, training data, costly to obtain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ACCV 2024

点击查看摘要

Abstract:Developing effective scene text detection and recognition models hinges on extensive training data, which can be both laborious and costly to obtain, especially for low-resourced languages. Conventional methods tailored for Latin characters often falter with non-Latin scripts due to challenges like character stacking, diacritics, and variable character widths without clear word boundaries. In this paper, we introduce the first Khmer scene-text dataset, featuring 1,544 expert-annotated images, including 997 indoor and 547 outdoor scenes. This diverse dataset includes flat text, raised text, poorly illuminated text, distant and partially obscured text. Annotations provide line-level text and polygonal bounding box coordinates for each scene. The benchmark includes baseline models for scene-text detection and recognition tasks, providing a robust starting point for future research endeavors. The KhmerST dataset is publicly accessible at this https URL.

[CV-54] CARLA2Real: a tool for reducing the sim2real gap in CARLA simulator

链接: https://arxiv.org/abs/2410.18238
作者: Stefanos Pasios,Nikos Nikolaidis
关键词-EN: self-driving cars, robots and drones, indispensable for research, autonomous systems, autonomous robots
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages

点击查看摘要

Abstract:Simulators are indispensable for research in autonomous systems such as self-driving cars, autonomous robots and drones. Despite significant progress in various simulation aspects, such as graphical realism, an evident gap persists between the virtual and real-world environments. Since the ultimate goal is to deploy the autonomous systems in the real world, closing the sim2real gap is of utmost importance. In this paper, we employ a state-ofthe-art approach to enhance the photorealism of simulated data, aligning them with the visual characteristics of real-world datasets. Based on this, we developed CARLA2Real, an easy-to-use, publicly available tool (plug-in) for the widely used and open-source CARLA simulator. This tool enhances the output of CARLA in near realtime, achieving a frame rate of 13 FPS, translating it to the visual style and realism of real-world datasets such as Cityscapes, KITTI, and Mapillary Vistas. By employing the proposed tool, we generated synthetic datasets from both the simulator and the enhancement model outputs, including their corresponding ground truth annotations for tasks related to autonomous driving. Then, we performed a number of experiments to evaluate the impact of the proposed approach on feature extraction and semantic segmentation methods when trained on the enhanced synthetic data. The results demonstrate that the sim2real gap is significant and can indeed be reduced by the introduced approach.

[CV-55] MsMorph: An Unsupervised pyramid learning network for brain image registration

链接: https://arxiv.org/abs/2410.18228
作者: Jiaofen Nan,Gaodeng Fan,Kaifan Zhang,Chen Zhao,Fubao Zhu,Weihua Zhou
关键词-EN: medical image analysis, image pairs, crucial technique, image, image analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:In the field of medical image analysis, image registration is a crucial technique. Despite the numerous registration models that have been proposed, existing methods still fall short in terms of accuracy and interpretability. In this paper, we present MsMorph, a deep learning-based image registration framework aimed at mimicking the manual process of registering image pairs to achieve more similar deformations, where the registered image pairs exhibit consistency or similarity in features. By extracting the feature differences between image pairs across various as-pects using gradients, the framework decodes semantic information at different scales and continuously compen-sates for the predicted deformation field, driving the optimization of parameters to significantly improve registration accuracy. The proposed method simulates the manual approach to registration, focusing on different regions of the image pairs and their neighborhoods to predict the deformation field between the two images, which provides strong interpretability. We compared several existing registration methods on two public brain MRI datasets, including LPBA and Mindboggle. The experimental results show that our method consistently outperforms state of the art in terms of metrics such as Dice score, Hausdorff distance, average symmetric surface distance, and non-Jacobian. The source code is publicly available at this https URL

[CV-56] Automated Defect Detection and Grading of Piarom Dates Using Deep Learning

链接: https://arxiv.org/abs/2410.18208
作者: Nasrin Azimi,Danial Mohammad Rezaei
关键词-EN: present significant challenges, high-value variety cultivated, variety cultivated predominantly, significant challenges due, predominantly in Iran
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Grading and quality control of Piarom dates, a premium and high-value variety cultivated predominantly in Iran, present significant challenges due to the complexity and variability of defects, as well as the absence of specialized automated systems tailored to this fruit. Traditional manual inspection methods are labor intensive, time consuming, and prone to human error, while existing AI-based sorting solutions are insufficient for addressing the nuanced characteristics of Piarom dates. In this study, we propose an innovative deep learning framework designed specifically for the real-time detection, classification, and grading of Piarom dates. Leveraging a custom dataset comprising over 9,900 high-resolution images annotated across 11 distinct defect categories, our framework integrates state-of-the-art object detection algorithms and Convolutional Neural Networks (CNNs) to achieve high precision in defect identification. Furthermore, we employ advanced segmentation techniques to estimate the area and weight of each date, thereby optimizing the grading process according to industry standards. Experimental results demonstrate that our system significantly outperforms existing methods in terms of accuracy and computational efficiency, making it highly suitable for industrial applications requiring real-time processing. This work not only provides a robust and scalable solution for automating quality control in the Piarom date industry but also contributes to the broader field of AI-driven food inspection technologies, with potential applications across various agricultural products.

[CV-57] Rethinking Positive Pairs in Contrastive Learning

链接: https://arxiv.org/abs/2410.18200
作者: Jiantao Wu,Shentong Mo,Zhenhua Feng,Sara Atito,Josef Kitler,Muhammad Awais
关键词-EN: traditionally assumes positive, Contrastive learning, closely related samples, learning, assumes positive pairs
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive learning, a prominent approach to representation learning, traditionally assumes positive pairs are closely related samples (the same image or class) and negative pairs are distinct samples. We challenge this assumption by proposing to learn from arbitrary pairs, allowing any pair of samples to be positive within our this http URL primary challenge of the proposed approach lies in applying contrastive learning to disparate pairs which are semantically distant. Motivated by the discovery that SimCLR can separate given arbitrary pairs (e.g., garter snake and table lamp) in a subspace, we propose a feature filter in the condition of class pairs that creates the requisite subspaces by gate vectors selectively activating or deactivating dimensions. This filter can be optimized through gradient descent within a conventional contrastive learning mechanism. We present Hydra, a universal contrastive learning framework for visual representations that extends conventional contrastive learning to accommodate arbitrary pairs. Our approach is validated using IN1K, where 1K diverse classes compose 500,500 pairs, most of them being distinct. Surprisingly, Hydra achieves superior performance in this challenging setting. Additional benefits include the prevention of dimensional collapse and the discovery of class relationships. Our work highlights the value of learning common features of arbitrary pairs and potentially broadens the applicability of contrastive learning techniques on the sample pairs with weak relationships. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2410.18200 [cs.CV] (or arXiv:2410.18200v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.18200 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-58] Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments NEURIPS2024

链接: https://arxiv.org/abs/2410.18195
作者: Luca Barsellotti,Roberto Bigazzi,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: grown significantly, research interest, Personalized Instance-based Navigation, indoor environments, large navigation datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: NeurIPS 2024 Datasets and Benchmarks Track. Project page: this https URL

点击查看摘要

Abstract:In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents.

[CV-59] Advancing Super-Resolution in Neural Radiance Fields via Variational Diffusion Strategies

链接: https://arxiv.org/abs/2410.18137
作者: Shrey Vishen,Jatin Sarabu,Chinmay Bharathulwar,Rithwick Lakshmanan,Vishnu Srinivas
关键词-EN: Variational Score Distilling, Renoised Score Distillation, neural rendering, VSD score facilitates, view-consistent super-resolution
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: All our code is available at this https URL

点击查看摘要

Abstract:We present a novel method for diffusion-guided frameworks for view-consistent super-resolution (SR) in neural rendering. Our approach leverages existing 2D SR models in conjunction with advanced techniques such as Variational Score Distilling (VSD) and a LoRA fine-tuning helper, with spatial training to significantly boost the quality and consistency of upscaled 2D images compared to the previous methods in the literature, such as Renoised Score Distillation (RSD) proposed in DiSR-NeRF (1), or SDS proposed in DreamFusion. The VSD score facilitates precise fine-tuning of SR models, resulting in high-quality, view-consistent images. To address the common challenge of inconsistencies among independent SR 2D images, we integrate Iterative 3D Synchronization (I3DS) from the DiSR-NeRF framework. Our quantitative benchmarks and qualitative results on the LLFF dataset demonstrate the superior performance of our system compared to existing methods such as DiSR-NeRF.

[CV-60] A Deep Learning Approach to Estimate Canopy Height and Uncertainty by Integrating Seasonal Optical SAR and Limited GEDI LiDAR Data over Northern Forests

链接: https://arxiv.org/abs/2410.18108
作者: Jose B. Castro,Cheryl Rogers,Camile Sothe,Dominic Cyr,Alemu Gonsamo
关键词-EN: climate change mitigation, evaluating aboveground biomass, Accurate forest canopy, supporting ecosystem monitoring, ecosystem monitoring services
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate forest canopy height estimation is essential for evaluating aboveground biomass and carbon stock dynamics, supporting ecosystem monitoring services like timber provisioning, climate change mitigation, and biodiversity conservation. However, despite advancements in spaceborne LiDAR technology, data for northern high latitudes remain limited due to orbital and sampling constraints. This study introduces a methodology for generating spatially continuous, high-resolution canopy height and uncertainty estimates using Deep Learning Regression models. We integrate multi-source, multi-seasonal satellite data from Sentinel-1, Landsat, and ALOS-PALSAR-2, with spaceborne GEDI LiDAR as reference data. Our approach was tested in Ontario, Canada, and validated with airborne LiDAR, demonstrating strong performance. The best results were achieved by incorporating seasonal Sentinel-1 and Landsat features alongside PALSAR data, yielding an R-square of 0.72, RMSE of 3.43 m, and bias of 2.44 m. Using seasonal data instead of summer-only data improved variability by 10%, reduced error by 0.45 m, and decreased bias by 1 m. The deep learning model’s weighting strategy notably reduced errors in tall canopy height estimates compared to a recent global model, though it overestimated lower canopy heights. Uncertainty maps highlighted greater uncertainty near forest edges, where GEDI measurements are prone to errors and SAR data may encounter backscatter issues like foreshortening, layover, and shadow. This study enhances canopy height estimation techniques in areas lacking spaceborne LiDAR coverage, providing essential tools for forestry, environmental monitoring, and carbon stock estimation.

[CV-61] Highly efficient non-rigid registration in k-space with application to cardiac Magnetic Resonance Imaging

链接: https://arxiv.org/abs/2410.18834
作者: Aya Ghoul,Kerstin Hammernik,Andreas Lingg,Patrick Krumm,Daniel Rueckert,Sergios Gatidis,Thomas Küstner
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, MR-guided radiotherapy, acquisition and reconstruction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In Magnetic Resonance Imaging (MRI), high temporal-resolved motion can be useful for image acquisition and reconstruction, MR-guided radiotherapy, dynamic contrast-enhancement, flow and perfusion imaging, and functional assessment of motion patterns in cardiovascular, abdominal, peristaltic, fetal, or musculoskeletal imaging. Conventionally, these motion estimates are derived through image-based registration, a particularly challenging task for complex motion patterns and high dynamic resolution. The accelerated scans in such applications result in imaging artifacts that compromise the motion estimation. In this work, we propose a novel self-supervised deep learning-based framework, dubbed the Local-All Pass Attention Network (LAPANet), for non-rigid motion estimation directly from the acquired accelerated Fourier space, i.e. k-space. The proposed approach models non-rigid motion as the cumulative sum of local translational displacements, following the Local All-Pass (LAP) registration technique. LAPANet was evaluated on cardiac motion estimation across various sampling trajectories and acceleration rates. Our results demonstrate superior accuracy compared to prior conventional and deep learning-based registration methods, accommodating as few as 2 lines/frame in a Cartesian trajectory and 3 spokes/frame in a non-Cartesian trajectory. The achieved high temporal resolution (less than 5 ms) for non-rigid motion opens new avenues for motion detection, tracking and correction in dynamic and real-time MRI applications.

[CV-62] Single-Shot Phase Diversity Wavefront Sensing in Deep Turbulence via Metasurface Optics

链接: https://arxiv.org/abs/2410.18789
作者: Arturo Martin Jimenez,Marc Baltes,Jackson Cornelius,Neset Akozbek,Zachary Coppens
关键词-EN: Free-space optical communication, minimal capital costs, Free-space optical, systems offer high-bandwidth, capital costs
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Free-space optical communication (FSOC) systems offer high-bandwidth and secure communication with minimal capital costs. Adaptive optics (AO) are typically added to these systems to decrease atmospheric channel losses; however, the performance of traditional AO wavefront sensors degrades in long-range, deep turbulence conditions. Alternative wavefront sensors using phase diversity can successfully reconstruct wavefronts in deep turbulence, but current implementations require bulky setups with high latency. In this work, we employ a nanostructured birefringent metasurface optic that enables low-latency phase diversity wavefront sensing in a compact form factor. We prove the effectiveness of this approach in mid-to-high turbulence (Rytov numbers from 0.2 to 0.6) through simulation and experimental demonstration. In both cases an average 16-fold increase in signal from the corrected beam is obtained. Our approach opens a pathway for compact, robust wavefront sensing that enhances range and accuracy of FSOC systems.

[CV-63] ransferring Knowledge from High-Quality to Low-Quality MRI for Adult Glioma Diagnosis MICCAI2024

链接: https://arxiv.org/abs/2410.18698
作者: Yanguang Zhao,Long Bai,Zhaoxi Zhang,Yanan Wu,Mobarakol Islam,Hongliang Ren
关键词-EN: Magnetic Resonance Imaging, requires early diagnosis, SSA Adult Glioma, deadly brain tumor, low-quality Magnetic Resonance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report, MICCAI 2024 BraTS-SSA Challenge Runner Up

点击查看摘要

Abstract:Glioma, a common and deadly brain tumor, requires early diagnosis for improved prognosis. However, low-quality Magnetic Resonance Imaging (MRI) technology in Sub-Saharan Africa (SSA) hinders accurate diagnosis. This paper presents our work in the BraTS Challenge on SSA Adult Glioma. We adopt the model from the BraTS-GLI 2021 winning solution and utilize it with three training strategies: (1) initially training on the BraTS-GLI 2021 dataset with fine-tuning on the BraTS-Africa dataset, (2) training solely on the BraTS-Africa dataset, and (3) training solely on the BraTS-Africa dataset with 2x super-resolution enhancement. Results show that initial training on the BraTS-GLI 2021 dataset followed by fine-tuning on the BraTS-Africa dataset has yielded the best results. This suggests the importance of high-quality datasets in providing prior knowledge during training. Our top-performing model achieves Dice scores of 0.882, 0.840, and 0.926, and Hausdorff Distance (95%) scores of 15.324, 37.518, and 13.971 for enhancing tumor, tumor core, and whole tumor, respectively, in the validation phase. In the final phase of the competition, our approach successfully secured second place overall, reflecting the strength and effectiveness of our model and training strategies. Our approach provides insights into improving glioma diagnosis in SSA, showing the potential of deep learning in resource-limited settings and the importance of transfer learning from high-quality datasets.

[CV-64] A Joint Representation Using Continuous and Discrete Features for Cardiovascular Diseases Risk Prediction on Chest CT Scans

链接: https://arxiv.org/abs/2410.18610
作者: Minfeng Xu,Chen-Chen Fan,Yan-Jie Zhou,Wenchao Guo,Pan Liu,Jing Qi,Le Lu,Hanqing Chao,Kunlun He
关键词-EN: leading health concern, global mortality rates, Cardiovascular diseases, CVD risk prediction, CVD risk
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 9 figures

点击查看摘要

Abstract:Cardiovascular diseases (CVD) remain a leading health concern and contribute significantly to global mortality rates. While clinical advancements have led to a decline in CVD mortality, accurately identifying individuals who could benefit from preventive interventions remains an unsolved challenge in preventive cardiology. Current CVD risk prediction models, recommended by guidelines, are based on limited traditional risk factors or use CT imaging to acquire quantitative biomarkers, and still have limitations in predictive accuracy and applicability. On the other hand, end-to-end trained CVD risk prediction methods leveraging deep learning on CT images often fail to provide transparent and explainable decision grounds for assisting physicians. In this work, we proposed a novel joint representation that integrates discrete quantitative biomarkers and continuous deep features extracted from chest CT scans. Our approach initiated with a deep CVD risk classification model by capturing comprehensive continuous deep learning features while jointly obtaining currently clinical-established quantitative biomarkers via segmentation models. In the feature joint representation stage, we use an instance-wise feature-gated mechanism to align the continuous and discrete features, followed by a soft instance-wise feature interaction mechanism fostering independent and effective feature interaction for the final CVD risk prediction. Our method substantially improves CVD risk predictive performance and offers individual contribution analysis of each biomarker, which is important in assisting physicians’ decision-making processes. We validated our method on a public chest low-dose CT dataset and a private external chest standard-dose CT patient cohort of 17,207 CT volumes from 6,393 unique subjects, and demonstrated superior predictive performance, achieving AUCs of 0.875 and 0.843, respectively.

[CV-65] Uncertainty-Error correlations in Evidential Deep Learning models for biomedical segmentation

链接: https://arxiv.org/abs/2410.18461
作者: Hai Siong Tan,Kuancheng Wang,Rafe Mcbeth
关键词-EN: Evidential Deep Learning, uncertainty quantification framework, Deep Learning applied, Evidential Deep, Deep Learning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 15 pages

点击查看摘要

Abstract:In this work, we examine the effectiveness of an uncertainty quantification framework known as Evidential Deep Learning applied in the context of biomedical image segmentation. This class of models involves assigning Dirichlet distributions as priors for segmentation labels, and enables a few distinct definitions of model uncertainties. Using the cardiac and prostate MRI images available in the Medical Segmentation Decathlon for validation, we found that Evidential Deep Learning models with U-Net backbones generally yielded superior correlations between prediction errors and uncertainties relative to the conventional baseline equipped with Shannon entropy measure, Monte-Carlo Dropout and Deep Ensemble methods. We also examined these models’ effectiveness in active learning, finding that relative to the standard Shannon entropy-based sampling, they yielded higher point-biserial uncertainty-error correlations while attaining similar performances in Dice-Sorensen coefficients. These superior features of EDL models render them well-suited for segmentation tasks that warrant a critical sensitivity in detecting large model errors.

[CV-66] Predicting total time to compress a video corpus using online inference systems

链接: https://arxiv.org/abs/2410.18260
作者: Xin Shu,Vibhoothi Vibhoothi,Anil Kokaram
关键词-EN: cloud video services, services and VOD, video, important for resource, resource management
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE International Conference on Visual Communications and Image Processing (VCIP) 2024

点击查看摘要

Abstract:Predicting the computational cost of compressing/transcoding clips in a video corpus is important for resource management of cloud services and VOD (Video On Demand) providers. Currently, customers of cloud video services are unaware of the cost of transcoding their files until the task is completed. Previous work concentrated on predicting perclip compression time, and thus estimating the cost of video compression. In this work, we propose new Machine Learning (ML) systems which predict cost for the entire corpus instead. This is a more appropriate goal since users are not interested in per-clip cost but instead the cost for the whole corpus. In this work, we evaluate our systems with respect to two video codecs (x264, x265) and a novel high-quality video corpus. We find that the accuracy of aggregate time prediction for a video corpus more than two times better than using per-clip predictions. Furthermore, we present an online inference framework in which we update the ML models as files are processed. A consideration of video compute overhead and appropriate choice of ML predictor for each fraction of corpus completed yields a prediction error of less than 5%. This is approximately two times better than previous work which proposed generalised predictors.

[CV-67] Bridging the Diagnostic Divide: Classical Computer Vision and Advanced AI methods for distinguishing ITB and CD through CTE Scans

链接: https://arxiv.org/abs/2410.18161
作者: Shashwat Gupta,L. Gokulnath,Akshan Aggarwal,Mahim Naz,Rajnikanth Yadav,Priyanka Bagade
关键词-EN: Intestinal Tuberculosis, Computed Tomography Enterography, Differentiating between Intestinal, significant clinical challenge, leverages Computed Tomography
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 9 pages, 3 figures, 3 algorithms

点击查看摘要

Abstract:Differentiating between Intestinal Tuberculosis (ITB) and Crohn’s Disease (CD) poses a significant clinical challenge due to their similar symptoms, clinical presentations, and imaging features. This study leverages Computed Tomography Enterography (CTE) scans, deep learning, and traditional computer vision to address this diagnostic dilemma. A consensus among radiologists from renowned institutions has recognized the visceral-to-subcutaneous fat (VF/SF) ratio as a surrogate biomarker for differentiating between ITB and CD. Previously done manually, we propose a novel 2D image computer vision algorithm for auto-segmenting subcutaneous fat to automate this ratio calculation, enhancing diagnostic efficiency and objectivity. As a benchmark, we compare the results to those obtained using the TotalSegmentator tool, a popular deep learning-based software for automatic segmentation of anatomical structures, and manual calculations by radiologists. We also demonstrated the performance on 3D CT volumes using a slicing method and provided a benchmark comparison of the algorithm with the TotalSegmentator tool. Additionally, we propose a scoring approach to integrate scores from radiological features, such as the fat ratio and pulmonary TB probability, into a single score for diagnosis. We trained a ResNet10 model on a dataset of CTE scans with samples from ITB, CD, and normal patients, achieving an accuracy of 75%. To enhance interpretability and gain clinical trust, we integrated the explainable AI technique Grad-CAM with ResNet10 to explain the model’s predictions. Due to the small dataset size (100 total cases), the feature-based scoring system is considered more reliable and trusted by radiologists compared to the deep learning model for disease diagnosis.

机器学习

[LG-0] On the Crucial Role of Initialization for Matrix Factorization

链接: https://arxiv.org/abs/2410.18965
作者: Bingcong Li,Liang Zhang,Aryan Mokhtari,Niao He
关键词-EN: Scaled Gradient Descent, matrix factorization problem, nonsmooth optimization, Nystrom initialization, classical low-rank matrix
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This work revisits the classical low-rank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates for such nonconvex and nonsmooth optimization. We introduce Nystrom initialization, which significantly improves the global convergence of Scaled Gradient Descent (ScaledGD) in both symmetric and asymmetric matrix factorization tasks. Specifically, we prove that ScaledGD with Nystrom initialization achieves quadratic convergence in cases where only linear rates were previously known. Furthermore, we extend this initialization to low-rank adapters (LoRA) commonly used for finetuning foundation models. Our approach, NoRA, i.e., LoRA with Nystrom initialization, demonstrates superior performance across various downstream tasks and model scales, from 1B to 7B parameters, in large language and diffusion models.

[LG-1] Learning to Look: Seeking Information for Decision Making via Policy Factorization

链接: https://arxiv.org/abs/2410.18964
作者: Shivin Dass,Jiaheng Hu,Ben Abbatematteo,Peter Stone,Roberto Martín-Martín
关键词-EN: robot manipulation tasks, Markov Decision Processes, Contextual Markov Decision, tasks require active, performed successfully
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project Website: this https URL

点击查看摘要

Abstract:Many robot manipulation tasks require active or interactive exploration behavior in order to be performed successfully. Such tasks are ubiquitous in embodied domains, where agents must actively search for the information necessary for each stage of a task, e.g., moving the head of the robot to find information relevant to manipulation, or in multi-robot domains, where one scout robot may search for the information that another robot needs to make informed decisions. We identify these tasks with a new type of problem, factorized Contextual Markov Decision Processes, and propose DISaM, a dual-policy solution composed of an information-seeking policy that explores the environment to find the relevant contextual information and an information-receiving policy that exploits the context to achieve the manipulation goal. This factorization allows us to train both policies separately, using the information-receiving one to provide reward to train the information-seeking policy. At test time, the dual agent balances exploration and exploitation based on the uncertainty the manipulation policy has on what the next best action is. We demonstrate the capabilities of our dual policy solution in five manipulation tasks that require information-seeking behaviors, both in simulation and in the real-world, where DISaM significantly outperforms existing methods. More information at this https URL.

[LG-2] Learning Structured Compressed Sensing with Automatic Resource Allocation

链接: https://arxiv.org/abs/2410.18954
作者: Han Wang,Eduardo Pérez,Iris A. M. Huijben,Hans van Gorp,Ruud van Sloun,Florian Römer
关键词-EN: Multidimensional data acquisition, requires extensive time, Multidimensional data, poses significant challenges, storage and processing
类目: Machine Learning (cs.LG)
*备注: Unsupervised Learning, Information Theory, Compressed Sensing, Subsampling

点击查看摘要

Abstract:Multidimensional data acquisition often requires extensive time and poses significant challenges for hardware and software regarding data storage and processing. Rather than designing a single compression matrix as in conventional compressed sensing, structured compressed sensing yields dimension-specific compression matrices, reducing the number of optimizable parameters. Recent advances in machine learning (ML) have enabled task-based supervised learning of subsampling matrices, albeit at the expense of complex downstream models. Additionally, the sampling resource allocation across dimensions is often determined in advance through heuristics. To address these challenges, we introduce Structured COmpressed Sensing with Automatic Resource Allocation (SCOSARA) with an information theory-based unsupervised learning strategy. SCOSARA adaptively distributes samples across sampling dimensions while maximizing Fisher information content. Using ultrasound localization as a case study, we compare SCOSARA to state-of-the-art ML-based and greedy search algorithms. Simulation results demonstrate that SCOSARA can produce high-quality subsampling matrices that achieve lower Cramér-Rao Bound values than the baselines. In addition, SCOSARA outperforms other ML-based algorithms in terms of the number of trainable parameters, computational complexity, and memory requirements while automatically choosing the number of samples per axis.

[LG-3] Adjusted Overfitting Regression

链接: https://arxiv.org/abs/2410.18950
作者: Dylan Wilson
关键词-EN: distance-based regression, adjust overfitting, Abstract, regression, overfitting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, I will introduce a new form of regression, that can adjust overfitting and underfitting through, “distance-based regression.” Overfitting often results in finding false patterns causing inaccurate results, so by having a new approach that minimizes overfitting, more accurate predictions can be derived. Then I will proceed with a test of my regression form and show additional ways to optimize the regression. Finally, I will apply my new technique to a specific data set to demonstrate its practical value.

[LG-4] LoRANN: Low-Rank Matrix Factorization for Approximate Nearest Neighbor Search NEURIPS2024

链接: https://arxiv.org/abs/2410.18926
作者: Elias Jääsaari,Ville Hyvönen,Teemu Roos
关键词-EN: Approximate nearest neighbor, machine learning pipelines, include retrieval-augmented generation, cases include retrieval-augmented, Approximate nearest
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Approximate nearest neighbor (ANN) search is a key component in many modern machine learning pipelines; recent use cases include retrieval-augmented generation (RAG) and vector databases. Clustering-based ANN algorithms, that use score computation methods based on product quantization (PQ), are often used in industrial-scale applications due to their scalability and suitability for distributed and disk-based implementations. However, they have slower query times than the leading graph-based ANN algorithms. In this work, we propose a new supervised score computation method based on the observation that inner product approximation is a multivariate (multi-output) regression problem that can be solved efficiently by reduced-rank regression. Our experiments show that on modern high-dimensional data sets, the proposed reduced-rank regression (RRR) method is superior to PQ in both query latency and memory usage. We also introduce LoRANN, a clustering-based ANN library that leverages the proposed score computation method. LoRANN is competitive with the leading graph-based algorithms and outperforms the state-of-the-art GPU ANN methods on high-dimensional data sets.

[LG-5] Optimizing Edge Offloading Decisions for Object Detection

链接: https://arxiv.org/abs/2410.18919
作者: Jiaming Qiu,Ruiqi Wang,Brooks Hu,Roch Guerin,Chenyang Lu
关键词-EN: performing real-time object, Recent advances, produced embedded devices, embedded devices capable, embedded devices
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: SEC 2024

点击查看摘要

Abstract:Recent advances in machine learning and hardware have produced embedded devices capable of performing real-time object detection with commendable accuracy. We consider a scenario in which embedded devices rely on an onboard object detector, but have the option to offload detection to a more powerful edge server when local accuracy is deemed too low. Resource constraints, however, limit the number of images that can be offloaded to the edge. Our goal is to identify which images to offload to maximize overall detection accuracy under those constraints. To that end, the paper introduces a reward metric designed to quantify potential accuracy improvements from offloading individual images, and proposes an efficient approach to make offloading decisions by estimating this reward based only on local detection results. The approach is computationally frugal enough to run on embedded devices, and empirical findings indicate that it outperforms existing alternatives in improving detection accuracy even when the fraction of offloaded images is small.

[LG-6] Using Parametric PINNs for Predicting Internal and External Turbulent Flows NEURIPS’2024

链接: https://arxiv.org/abs/2410.18917
作者: Shinjan Ghosh,Amit Chakraborty,Georgia Olympia Brikis,Biswadip Dey
关键词-EN: solvers employing two-equation, Computational fluid dynamics, employing two-equation eddy, two-equation eddy viscosity, Reynolds-averaged Navier-Stokes
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: To be presented at the Data-driven and Differentiable Simulations, Surrogates, and Solvers (D3S3) Workshop at NeurIPS’2024

点击查看摘要

Abstract:Computational fluid dynamics (CFD) solvers employing two-equation eddy viscosity models are the industry standard for simulating turbulent flows using the Reynolds-averaged Navier-Stokes (RANS) formulation. While these methods are computationally less expensive than direct numerical simulations, they can still incur significant computational costs to achieve the desired accuracy. In this context, physics-informed neural networks (PINNs) offer a promising approach for developing parametric surrogate models that leverage both existing, but limited CFD solutions and the governing differential equations to predict simulation outcomes in a computationally efficient, differentiable, and near real-time manner. In this work, we build upon the previously proposed RANS-PINN framework, which only focused on predicting flow over a cylinder. To investigate the efficacy of RANS-PINN as a viable approach to building parametric surrogate models, we investigate its accuracy in predicting relevant turbulent flow variables for both internal and external flows. To ensure training convergence with a more complex loss function, we adopt a novel sampling approach that exploits the domain geometry to ensure a proper balance among the contributions from various regions within the solution domain. The effectiveness of this framework is then demonstrated for two scenarios that represent a broad class of internal and external flow problems.

[LG-7] sting Support Size More Efficiently Than Learning Histograms

链接: https://arxiv.org/abs/2410.18915
作者: Renato Ferreira Pinto Jr.,Nathaniel Harms
关键词-EN: unknown probability distribution, epsilon, unknown probability, samples, log
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 40 pages

点击查看摘要

Abstract:Consider two problems about an unknown probability distribution p : 1. How many samples from p are required to test if p is supported on n elements or not? Specifically, given samples from p , determine whether it is supported on at most n elements, or it is " \epsilon -far" (in total variation distance) from being supported on n elements. 2. Given m samples from p , what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution p , which requires \Theta(\tfracn\epsilon^2 \log n) samples. We show that testing can be done more efficiently than learning the histogram, using only O(\tfracn\epsilon \log n \log(1/\epsilon)) samples, nearly matching the best known lower bound of \Omega(\tfracn\epsilon \log n) . This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method. Comments: 40 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2410.18915 [cs.DS] (or arXiv:2410.18915v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2410.18915 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Renato Ferreira Pinto Jr. [view email] [v1] Thu, 24 Oct 2024 17:05:34 UTC (58 KB)

[LG-8] ArterialNet: Reconstructing Arterial Blood Pressure Waveform with Wearable Pulsatile Signals a Cohort-Aware Approach

链接: https://arxiv.org/abs/2410.18895
作者: Sicong Huang,Roozbeh Jafari,Bobak J. Mortazavi
关键词-EN: Continuous arterial blood, arterial blood pressure, Continuous arterial, diastolic blood pressure, blood pressure
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuous arterial blood pressure (ABP) monitoring is invasive but essential for hemodynamic monitoring. Recent techniques have reconstructed ABP non-invasively using pulsatile signals but produced inaccurate systolic and diastolic blood pressure (SBP and DBP) values and were sensitive to individual variability. ArterialNet integrates generalized pulsatile-to-ABP signal translation and personalized feature extraction using hybrid loss functions and regularization. We validated ArterialNet using the MIMIC-III dataset and achieved a root mean square error (RMSE) of 5.41 mmHg, with at least a 58% lower standard deviation. ArterialNet reconstructed ABP with an RMSE of 7.99 mmHg in remote health scenarios. ArterialNet achieved superior performance in ABP reconstruction and SBP and DBP estimations, with significantly reduced subject variance, demonstrating its potential in remote health settings. We also ablated ArterialNet architecture to investigate the contributions of each component and evaluated its translational impact and robustness by conducting a series of ablations on data quality and availability.

[LG-9] Meta-Learning with Heterogeneous Tasks

链接: https://arxiv.org/abs/2410.18894
作者: Zhaofeng Si,Shu Hu,Kaiyi Ji,Siwei Lyu
关键词-EN: handle few-shot scenarios, equip machine learning, machine learning models, Tasks Robust Meta-learning, heterogeneous tasks
类目: Machine Learning (cs.LG)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Meta-learning is a general approach to equip machine learning models with the ability to handle few-shot scenarios when dealing with many tasks. Most existing meta-learning methods work based on the assumption that all tasks are of equal importance. However, real-world applications often present heterogeneous tasks characterized by varying difficulty levels, noise in training samples, or being distinctively different from most other tasks. In this paper, we introduce a novel meta-learning method designed to effectively manage such heterogeneous tasks by employing rank-based task-level learning objectives, Heterogeneous Tasks Robust Meta-learning (HeTRoM). HeTRoM is proficient in handling heterogeneous tasks, and it prevents easy tasks from overwhelming the meta-learner. The approach allows for an efficient iterative optimization algorithm based on bi-level optimization, which is then improved by integrating statistical guidance. Our experimental results demonstrate that our method provides flexibility, enabling users to adapt to diverse task settings and enhancing the meta-learner’s overall performance.

[LG-10] End-to-end Training for Recommendation with Language-based User Profiles

链接: https://arxiv.org/abs/2410.18870
作者: Zhaolin Gao,Joyce Zhou,Yijia Dai,Thorsten Joachims
关键词-EN: online platforms maintain, platforms maintain user, online platforms, platforms maintain, user profiles
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many online platforms maintain user profiles for personalization. Unfortunately, these profiles are typically not interpretable or easily modifiable by the user. To remedy this shortcoming, we explore natural language-based user profiles, as they promise enhanced transparency and scrutability of recommender systems. While existing work has shown that language-based profiles from standard LLMs can be effective, such generalist LLMs are unlikely to be optimal for this task. In this paper, we introduce LangPTune, the first end-to-end learning method for training LLMs to produce language-based user profiles that optimize recommendation effectiveness. Through comprehensive evaluations of LangPTune across various training configurations and benchmarks, we demonstrate that our approach significantly outperforms existing profile-based methods. In addition, it approaches performance levels comparable to state-of-the-art, less transparent recommender systems, providing a robust and interpretable alternative to conventional systems. Finally, we validate the relative interpretability of these language-based user profiles through user studies involving crowdworkers and GPT-4-based evaluations. Implementation of LangPTune can be found at this https URL.

[LG-11] A Riemannian Framework for Learning Reduced-order Lagrangian Dynamics

链接: https://arxiv.org/abs/2410.18868
作者: Katharina Friedl,Noémie Jaquier,Jens Lundell,Tamim Asfour,Danica Kragic
关键词-EN: incorporating physical consistency, neural networks display, display increased generalization, increased generalization capabilities, learning nonlinear dynamic
类目: Machine Learning (cs.LG)
*备注: 28 pages, 16 figures

点击查看摘要

Abstract:By incorporating physical consistency as inductive bias, deep neural networks display increased generalization capabilities and data efficiency in learning nonlinear dynamic models. However, the complexity of these models generally increases with the system dimensionality, requiring larger datasets, more complex deep networks, and significant computational effort. We propose a novel geometric network architecture to learn physically-consistent reduced-order dynamic parameters that accurately describe the original high-dimensional system behavior. This is achieved by building on recent advances in model-order reduction and by adopting a Riemannian perspective to jointly learn a structure-preserving latent space and the associated low-dimensional dynamics. Our approach enables accurate long-term predictions of the high-dimensional dynamics of rigid and deformable systems with increased data efficiency by inferring interpretable and physically plausible reduced Lagrangian models.

[LG-12] FedSPD: A Soft-clustering Approach for Personalized Decentralized Federated Learning

链接: https://arxiv.org/abs/2410.18862
作者: I-Cheng Lin,Osman Yagan,Carlee Joe-Wong
关键词-EN: recently gained popularity, Federated learning, personalized federated learning, recently gained, gained popularity
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning has recently gained popularity as a framework for distributed clients to collaboratively train a machine learning model using local data. While traditional federated learning relies on a central server for model aggregation, recent advancements adopt a decentralized framework, enabling direct model exchange between clients and eliminating the single point of failure. However, existing decentralized frameworks often assume all clients train a shared model. Personalizing each client’s model can enhance performance, especially with heterogeneous client data distributions. We propose FedSPD, an efficient personalized federated learning algorithm for the decentralized setting, and show that it learns accurate models even in low-connectivity networks. To provide theoretical guarantees on convergence, we introduce a clustering-based framework that enables consensus on models for distinct data clusters while personalizing to unique mixtures of these clusters at different clients. This flexibility, allowing selective model updates based on data distribution, substantially reduces communication costs compared to prior work on personalized federated learning in decentralized settings. Experimental results on real-world datasets show that FedSPD outperforms multiple decentralized variants of personalized federated learning algorithms, especially in scenarios with low-connectivity networks.

[LG-13] MazeNet: An Accurate Fast and Scalable Deep Learning Solution for Steiner Minimum Trees ICLR2025

链接: https://arxiv.org/abs/2410.18832
作者: Gabriel Díaz Ramos,Toros Arikan,Richard G. Baraniuk
关键词-EN: Steiner Minimum Tree, Rectilinear Steiner Minimum, Avoiding Rectilinear Steiner, Obstacle Avoiding Rectilinear, Minimum Tree
类目: Machine Learning (cs.LG)
*备注: 15 pages, 15 figures. Submitted to ICLR 2025

点击查看摘要

Abstract:The Obstacle Avoiding Rectilinear Steiner Minimum Tree (OARSMT) problem, which seeks the shortest interconnection of a given number of terminals in a rectilinear plane while avoiding obstacles, is a critical task in integrated circuit design, network optimization, and robot path planning. Since OARSMT is NP-hard, exact algorithms scale poorly with the number of terminals, leading practical solvers to sacrifice accuracy for large problems. We propose MazeNet, a deep learning-based method that learns to solve the OARSMT from data. MazeNet reframes OARSMT as a maze-solving task that can be addressed with a recurrent convolutional neural network (RCNN). A key hallmark of MazeNet is its scalability: we only need to train the RCNN blocks on mazes with a small number of terminals; larger mazes can be solved by replicating the same pre-trained blocks to create a larger network. Across a wide range of experiments, MazeNet achieves perfect OARSMT-solving accuracy, significantly reduces runtime compared to classical exact algorithms, and can handle more terminals than state-of-the-art approximate algorithms.

[LG-14] Language-Agnostic Modeling of Source Reliability on Wikipedia

链接: https://arxiv.org/abs/2410.18803
作者: Jacopo D’Ignazi,Andreas Kaltenbrunner,Yelena Mejova,Michele Tizzani,Kyriaki Kalimeri,Mariano Beiró,Pablo Aragón
关键词-EN: combat disinformation, verification through reliable, reliable sources, Climate Change, source reliability
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over the last few years, content verification through reliable sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of sources across multiple language editions of Wikipedia. Utilizing editorial activity data, the model evaluates source reliability within different articles of varying controversiality such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts source reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65 while the performance of low-resource languages varies; in all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. This work contributes not only to Wikipedia’s efforts in ensuring content verifiability but in ensuring reliability across diverse user-generated content in various language communities.

[LG-15] PointPatchRL – Masked Reconstruction Improves Reinforcement Learning on Point Clouds

链接: https://arxiv.org/abs/2410.18800
作者: Balázs Gyenes,Nikolai Franke,Philipp Becker,Gerhard Neumann
关键词-EN: Perceiving the environment, crucial for Reinforcement, point clouds, Reinforcement Learning, point
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 18 pages, 15 figures, accepted for publication at the 8th Conference on Robot Learning (CoRL 2024)

点击查看摘要

Abstract:Perceiving the environment via cameras is crucial for Reinforcement Learning (RL) in robotics. While images are a convenient form of representation, they often complicate extracting important geometric details, especially with varying geometries or deformable objects. In contrast, point clouds naturally represent this geometry and easily integrate color and positional data from multiple camera views. However, while deep learning on point clouds has seen many recent successes, RL on point clouds is under-researched, with only the simplest encoder architecture considered in the literature. We introduce PointPatchRL (PPRL), a method for RL on point clouds that builds on the common paradigm of dividing point clouds into overlapping patches, tokenizing them, and processing the tokens with transformers. PPRL provides significant improvements compared with other point-cloud processing architectures previously used for RL. We then complement PPRL with masked reconstruction for representation learning and show that our method outperforms strong model-free and model-based baselines on image observations in complex manipulation tasks containing deformable objects and variations in target object geometry. Videos and code are available at this https URL

[LG-16] Adapting MLOps for Diverse In-Network Intelligence in 6G Era: Challenges and Solutions

链接: https://arxiv.org/abs/2410.18793
作者: Peizheng Li,Ioannis Mavromatis,Tim Farnham,Adnan Aijaz,Aftab Khan
关键词-EN: Seamless integration, artificial intelligence, crucial step, Seamless, operations
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures. This paper has been submitted to IEEE for possible publication

点击查看摘要

Abstract:Seamless integration of artificial intelligence (AI) and machine learning (ML) techniques with wireless systems is a crucial step for 6G AInization. However, such integration faces challenges in terms of model functionality and lifecycle management. ML operations (MLOps) offer a systematic approach to tackle these challenges. Existing approaches toward implementing MLOps in a centralized platform often overlook the challenges posed by diverse learning paradigms and network heterogeneity. This article provides a new approach to MLOps targeting the intricacies of future wireless networks. Considering unique aspects of the future radio access network (RAN), we formulate three operational pipelines, namely reinforcement learning operations (RLOps), federated learning operations (FedOps), and generative AI operations (GenOps). These pipelines form the foundation for seamlessly integrating various learning/inference capabilities into networks. We outline the specific challenges and proposed solutions for each operation, facilitating large-scale deployment of AI-Native 6G networks.

[LG-17] Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionality

链接: https://arxiv.org/abs/2410.18784
作者: Zhihan Huang,Yuting Wei,Yuxin Chen
关键词-EN: mainstream generative model, diffusion probabilistic model, denoising diffusion probabilistic, generative model, probabilistic model
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Numerical Analysis (math.NA); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The denoising diffusion probabilistic model (DDPM) has emerged as a mainstream generative model in generative AI. While sharp convergence guarantees have been established for the DDPM, the iteration complexity is, in general, proportional to the ambient data dimension, resulting in overly conservative theory that fails to explain its practical efficiency. This has motivated the recent work Li and Yan (2024a) to investigate how the DDPM can achieve sampling speed-ups through automatic exploitation of intrinsic low dimensionality of data. We strengthen this prior work by demonstrating, in some sense, optimal adaptivity to unknown low dimensionality. For a broad class of data distributions with intrinsic dimension k , we prove that the iteration complexity of the DDPM scales nearly linearly with k , which is optimal when using KL divergence to measure distributional discrepancy. Our theory is established based on a key observation: the DDPM update rule is equivalent to running a suitably parameterized SDE upon discretization, where the nonlinear component of the drift term is intrinsically low-dimensional.

[LG-18] Attention-based Citywide Electric Vehicle Charging Demand Prediction Approach Considering Urban Region and Dynamic Influences

链接: https://arxiv.org/abs/2410.18766
作者: Haoxuan Kuang,Kunxiang Deng,Linlin You,Jun Li
关键词-EN: green energy development, charging infrastructure planning, facilitating vehicle electrification, Electric vehicle charging, vacant charging pile
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Electric vehicle charging demand prediction is important for vacant charging pile recommendation and charging infrastructure planning, thus facilitating vehicle electrification and green energy development. The performance of previous spatio-temporal studies is still far from satisfactory because the traditional graphs are difficult to model non-pairwise spatial relationships and multivariate temporal features are not adequately taken into account. To tackle these issues, we propose an attention-based heterogeneous multivariate data fusion approach (AHMDF) for citywide electric vehicle charging demand prediction, which incorporates geo-based clustered hypergraph and multivariate gated Transformer to considers both static and dynamic influences. To learn non-pairwise relationships, we cluster service areas by the types and numbers of points of interest in the areas and develop attentive hypergraph networks accordingly. Graph attention mechanisms are used for information propagation between neighboring areas. Additionally, we improve the Transformer encoder utilizing gated mechanisms so that it can selectively learn dynamic auxiliary information and temporal features. Experiments on an electric vehicle charging benchmark dataset demonstrate the effectiveness of our proposed approach compared with a broad range of competing baselines. Furthermore, we demonstrate the impact of dynamic influences on prediction results in different areas of the city and the effectiveness of our clustering method.

[LG-19] Retrieval-Augmented Diffusion Models for Time Series Forecasting NEURIPS2024

链接: https://arxiv.org/abs/2410.18712
作者: Jingwei Liu,Ling Yang,Hongyan Li,Shenda Hong
关键词-EN: time series diffusion, series diffusion models, time series, remains highly unstable, received considerable focus
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:While time series diffusion models have received considerable focus from many recent works, the performance of existing models remains highly unstable. Factors limiting time series diffusion models include insufficient time series datasets and the absence of guidance. To address these limitations, we propose a Retrieval- Augmented Time series Diffusion model (RATD). The framework of RATD consists of two parts: an embedding-based retrieval process and a reference-guided diffusion model. In the first part, RATD retrieves the time series that are most relevant to historical time series from the database as references. The references are utilized to guide the denoising process in the second part. Our approach allows leveraging meaningful samples within the database to aid in sampling, thus maximizing the utilization of datasets. Meanwhile, this reference-guided mechanism also compensates for the deficiencies of existing time series diffusion models in terms of guidance. Experiments and visualizations on multiple datasets demonstrate the effectiveness of our approach, particularly in complicated prediction tasks.

[LG-20] Exploiting Interpretable Capabilities with Concept-Enhanced Diffusion and Prototype Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.18705
作者: Alba Carballo-Castro,Sonia Laguna,Moritz Vandenhirtz,Julia E. Vogt
关键词-EN: increasingly gained importance, gained importance due, Concept-based machine learning, making neural networks, Concept-based machine
类目: Machine Learning (cs.LG)
*备注: Interpretable AI: Past, Present and Future Workshop at NeurIPS 2024

点击查看摘要

Abstract:Concept-based machine learning methods have increasingly gained importance due to the growing interest in making neural networks interpretable. However, concept annotations are generally challenging to obtain, making it crucial to leverage all their prior knowledge. By creating concept-enriched models that incorporate concept information into existing architectures, we exploit their interpretable capabilities to the fullest extent. In particular, we propose Concept-Guided Conditional Diffusion, which can generate visual representations of concepts, and Concept-Guided Prototype Networks, which can create a concept prototype dataset and leverage it to perform interpretable concept prediction. These results open up new lines of research by exploiting pre-existing information in the quest for rendering machine learning more human-understandable.

[LG-21] BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching

链接: https://arxiv.org/abs/2410.18701
作者: Peizhuang Cong,Qizhi Chen,Haochen Zhao,Tong Yang
关键词-EN: Large Language Models, Large Language, interactive web services, Language Models, capabilities of Large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advanced capabilities of Large Language Models (LLMs) have inspired the development of various interactive web services or applications, such as ChatGPT, which offer query inference services for users. Unlike traditional DNN model, the inference of LLM entails different iterations of forward computation for different queries, which result in efficiency challenges for existing run-to-completion batch-wise inference. Hence, some methods refine batch-wise inference to iteration-level by duplicating all nonlinear layers of LLM. However, this approach not only increases resource usage but also introduces idle computations to the batch due to the prefilling of newly added queries. Therefore, we propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch, which can achieve near-zero idle computations without incurring additional resource consumption. To do so, BATON 1) shapes the vectors involved in the inference of the newly inserted query and processing batch to align dimensions and generates a new attention mask based on vector shaping to ensure inference correctness, which enables query inserting without consuming additional resource; 2) embeds prefilled Keys and Values of the new query into the KV_Cache of the processing batch by leveraging the prefilling and decoding separation mechanism, eliminating idle computations to the batch introduced by the prefilling process of the new query. Experimental results show that compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.

[LG-22] Hierarchical Multimodal LLM s with Semantic Space Alignment for Enhanced Time Series Classification

链接: https://arxiv.org/abs/2410.18686
作者: Xiaoyu Tao,Tingyue Pan,Mingyue Cheng,Yucong Luo
关键词-EN: Leveraging large language, garnered increasing attention, Leveraging large, time series, time series data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Leveraging large language models (LLMs) has garnered increasing attention and introduced novel perspectives in time series classification. However, existing approaches often overlook the crucial dynamic temporal information inherent in time series data and face challenges in aligning this data with textual semantics. To address these limitations, we propose HiTime, a hierarchical multi-modal model that seamlessly integrates temporal information into LLMs for multivariate time series classification (MTSC). Our model employs a hierarchical feature encoder to capture diverse aspects of time series data through both data-specific and task-specific embeddings. To facilitate semantic space alignment between time series and text, we introduce a dual-view contrastive alignment module that bridges the gap between modalities. Additionally, we adopt a hybrid prompting strategy to fine-tune the pre-trained LLM in a parameter-efficient manner. By effectively incorporating dynamic temporal features and ensuring semantic alignment, HiTime enables LLMs to process continuous time series data and achieves state-of-the-art classification performance through text generation. Extensive experiments on benchmark datasets demonstrate that HiTime significantly enhances time series classification accuracy compared to most competitive baseline methods. Our findings highlight the potential of integrating temporal features into LLMs, paving the way for advanced time series analysis. The code is publicly available for further research and validation. Our codes are publicly available1.

[LG-23] Homomorphism Counts as Structural Encodings for Graph Learning

链接: https://arxiv.org/abs/2410.18676
作者: Linus Bao,Emily Jin,Michael Bronstein,İsmail İlkan Ceylan,Matthias Lanzinger
关键词-EN: well-known Transformer architecture, Graph Transformers, Transformer architecture, Transformers are popular, well-known Transformer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Transformers are popular neural networks that extend the well-known Transformer architecture to the graph domain. These architectures operate by applying self-attention on graph nodes and incorporating graph structure through the use of positional encodings (e.g., Laplacian positional encoding) or structural encodings (e.g., random-walk structural encoding). The quality of such encodings is critical, since they provide the necessary \textitgraph inductive biases to condition the model on graph structure. In this work, we propose \textitmotif structural encoding (MoSE) as a flexible and powerful structural encoding framework based on counting graph homomorphisms. Theoretically, we compare the expressive power of MoSE to random-walk structural encoding and relate both encodings to the expressive power of standard message passing neural networks. Empirically, we observe that MoSE outperforms other well-known positional and structural encodings across a range of architectures, and it achieves state-of-the-art performance on widely studied molecular property prediction datasets.

[LG-24] NIDS Neural Networks Using Sliding Time Window Data Processing with Trainable Activations and its Generalization Capability

链接: https://arxiv.org/abs/2410.18658
作者: Anton Raskovalov,Nikita Gabdullin,Ilya Androsov
关键词-EN: intrusion detection systems, flow data preprocessed, network intrusion detection, paper presents neural, presents neural networks
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 18 pages, 3 figures, 9 tables

点击查看摘要

Abstract:This paper presents neural networks for network intrusion detection systems (NIDS), that operate on flow data preprocessed with a time window. It requires only eleven features which do not rely on deep packet inspection and can be found in most NIDS datasets and easily obtained from conventional flow collectors. The time window aggregates information with respect to hosts facilitating the identification of flow signatures that are missed by other aggregation methods. Several network architectures are studied and the use of Kalmogorov-Arnold Network (KAN)-inspired trainable activation functions that help to achieve higher accuracy with simpler network structure is proposed. The reported training accuracy exceeds 99% for the proposed method with as little as twenty neural network input features. This work also studies the generalization capability of NIDS, a crucial aspect that has not been adequately addressed in the previous studies. The generalization experiments are conducted using CICIDS2017 dataset and a custom dataset collected as part of this study. It is shown that the performance metrics decline significantly when changing datasets, and the reduction in performance metrics can be attributed to the difference in signatures of the same type flows in different datasets, which in turn can be attributed to the differences between the underlying networks. It is shown that the generalization accuracy of some neural networks can be very unstable and sensitive to random initialization parameters, and neural networks with fewer parameters and well-tuned activations are more stable and achieve higher accuracy.

[LG-25] Learning dissipative Hamiltonian dynamics with reproducing kernel Hilbert spaces and random Fourier features

链接: https://arxiv.org/abs/2410.18656
作者: Torbjørn Smith,Olav Egeland
关键词-EN: dissipative Hamiltonian dynamics, learning dissipative Hamiltonian, noisy dataset, paper presents, limited and noisy
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper presents a new method for learning dissipative Hamiltonian dynamics from a limited and noisy dataset. The method uses the Helmholtz decomposition to learn a vector field as the sum of a symplectic and a dissipative vector field. The two vector fields are learned using two reproducing kernel Hilbert spaces, defined by a symplectic and a curl-free kernel, where the kernels are specialized to enforce odd symmetry. Random Fourier features are used to approximate the kernels to reduce the dimension of the optimization problem. The performance of the method is validated in simulations for two dissipative Hamiltonian systems, and it is shown that the method improves predictive accuracy significantly compared to a method where a Gaussian separable kernel is used.

[LG-26] Understanding Players as if They Are Talking to the Game in a Customized Language: A Pilot Study EMNLP2024

链接: https://arxiv.org/abs/2410.18605
作者: Tianze Wang,Maryam Honari-Jahromi,Styliani Katsarou,Olga Mikheeva,Theodoros Panagiotakopoulos,Oleg Smirnov,Lele Cao,Sahar Asadi
关键词-EN: pilot study explores, customized natural language, pilot study, study explores, explores the application
类目: Machine Learning (cs.LG)
*备注: published in Workshop on Customizable NLP at EMNLP 2024

点击查看摘要

Abstract:This pilot study explores the application of language models (LMs) to model game event sequences, treating them as a customized natural language. We investigate a popular mobile game, transforming raw event data into textual sequences and pretraining a Longformer model on this data. Our approach captures the rich and nuanced interactions within game sessions, effectively identifying meaningful player segments. The results demonstrate the potential of self-supervised LMs in enhancing game design and personalization without relying on ground-truth labels.

[LG-27] Differential Informed Auto-Encoder

链接: https://arxiv.org/abs/2410.18593
作者: Jinrui Zhang
关键词-EN: original data, original data domain, encoder was trained, differential equations, original
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this article, an encoder was trained to obtain the inner structure of the original data by obtain a differential equations. A decoder was trained to resample the original data domain, to generate new data that obey the differential structure of the original data using the physics-informed neural network.

[LG-28] Knowledge Distillation Using Frontier Open-source LLM s: Generalizability and the Role of Synthetic Data

链接: https://arxiv.org/abs/2410.18588
作者: Anup Shirgaonkar,Nikhil Pandey,Nazmiye Ceren Abay,Tolga Aktas,Vijay Aski
关键词-EN: Leading open-source large, natural language understanding, open-source large language, Leading open-source, language understanding tasks
类目: Machine Learning (cs.LG)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:Leading open-source large language models (LLMs) such as Llama-3.1-Instruct-405B are extremely capable at generating text, answering questions, and solving a variety of natural language understanding tasks. However, they incur higher inference cost and latency compared to smaller LLMs. Knowledge distillation provides a way to use outputs from these large, capable teacher models to train smaller student models which can be used for inference at lower cost and latency, while retaining comparable accuracy. We investigate the efficacy of distillation using the Llama-3.1-405B-Instruct teacher and the smaller Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct student models. Contributions of this work include (a) We evaluate the generalizability of distillation with the above Llama-3.1 teacher-student pairs across different tasks and datasets (b) We show that using synthetic data during distillation significantly improves the accuracy of 8B and 70B models, and when used with reasoning chains, even matches or surpasses the zero-shot accuracy of 405B model on some datasets © We empirically show that distillation enables 8B and 70B models to internalize 405B’s reasoning ability by using only standard fine-tuning (without customizing any loss function). This allows cost and latency-efficient student model inference. (d) We show pitfalls in evaluation of distillation, and present task-specific evaluation, including both human and LLM-grading, and ground-truth based traditional accuracy benchmarks. This methodical study brings out the fundamental importance of synthetic data quality in knowledge distillation, and of combining multiple, task-specific ways of accuracy and quality evaluation in assessing the effectiveness of distillation.

[LG-29] Benchmarking Graph Learning for Drug-Drug Interaction Prediction

链接: https://arxiv.org/abs/2410.18583
作者: Zhenqian Shen,Mingyang Zhou,Yongqi Zhang,Quanming Yao
关键词-EN: Predicting drug-drug interaction, identifying potential adverse, beneficial combination therapies, potential adverse interactions, Predicting drug-drug
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting drug-drug interaction (DDI) plays an important role in pharmacology and healthcare for identifying potential adverse interactions and beneficial combination therapies between drug pairs. Recently, a flurry of graph learning methods have been introduced to predict drug-drug interactions. However, evaluating existing methods has several limitations, such as the absence of a unified comparison framework for DDI prediction methods, lack of assessments in meaningful real-world scenarios, and insufficient exploration of side information usage. In order to address these unresolved limitations in the literature, we propose a DDI prediction benchmark on graph learning. We first conduct unified evaluation comparison among existing methods. To meet realistic scenarios, we further evaluate the performance of different methods in settings with new drugs involved and examine the performance across different DDI types. Component analysis is conducted on the biomedical network to better utilize side information. Through this work, we hope to provide more insights for the problem of DDI prediction. Our implementation and data is open-sourced at this https URL.

[LG-30] Probing Ranking LLM s: Mechanistic Interpretability in Information Retrieval

链接: https://arxiv.org/abs/2410.18527
作者: Tanya Chowdhury,James Allan
关键词-EN: par with GPT, feature extraction capabilities, powerful feature extraction, extraction capabilities, GPT models
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Transformer networks, especially those with performance on par with GPT models, are renowned for their powerful feature extraction capabilities. However, the nature and correlation of these features with human-engineered ones remain unclear. In this study, we delve into the mechanistic workings of state-of-the-art, fine-tuning-based passage-reranking transformer networks. Our approach involves a probing-based, layer-by-layer analysis of neurons within ranking LLMs to identify individual or groups of known human-engineered and semantic features within the network’s activations. We explore a wide range of features, including lexical, document structure, query-document interaction, advanced semantic, interaction-based, and LLM-specific features, to gain a deeper understanding of the underlying mechanisms that drive ranking decisions in LLMs. Our results reveal a set of features that are prominently represented in LLM activations, as well as others that are notably absent. Additionally, we observe distinct behaviors of LLMs when processing low versus high relevance queries and when encountering out-of-distribution query and document sets. By examining these features within activations, we aim to enhance the interpretability and performance of LLMs in ranking tasks. Our findings provide valuable insights for the development of more effective and transparent ranking models, with significant implications for the broader information retrieval community. All scripts and code necessary to replicate our findings are made available. Comments: 9 pages Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2410.18527 [cs.IR] (or arXiv:2410.18527v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.18527 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Assured Automatic Programming via Large Language Models

链接: https://arxiv.org/abs/2410.18494
作者: Martin Mirchev,Andreea Costea,Abhishek Kr Singh,Abhik Roychoudhury
关键词-EN: natural language requirements, AI-based coding engines, natural language, convert natural language, intent
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:With the advent of AI-based coding engines, it is possible to convert natural language requirements to executable code in standard programming languages. However, AI-generated code can be unreliable, and the natural language requirements driving this code may be ambiguous. In other words, the intent may not be accurately captured in the code generated from AI-coding engines like Copilot. The goal of our work is to discover the programmer intent, while generating code which conforms to the intent and a proof of this conformance. Our approach to intent discovery is powered by a novel repair engine called program-proof co-evolution, where the object of repair is a tuple (code, logical specification, test) generated by an LLM from the same natural language description. The program and the specification capture the initial operational and declarative description of intent, while the test represents a concrete, albeit partial, understanding of the intent. Our objective is to achieve consistency between the program, the specification, and the test by incrementally refining our understanding of the user intent. Reaching consistency through this repair process provides us with a formal, logical description of the intent, which is then translated back into natural language for the developer’s inspection. The resultant intent description is now unambiguous, though expressed in natural language. We demonstrate how the unambiguous intent discovered through our approach increases the percentage of verifiable auto-generated programs on a recently proposed dataset in the Dafny programming language.

[LG-32] Graph Pre-Training Models Are Strong Anomaly Detectors

链接: https://arxiv.org/abs/2410.18487
作者: Jiashun Cheng,Zinan Zheng,Yang Liu,Jianheng Tang,Hongwei Wang,Yu Rong,Jia Li,Fugee Tsung
关键词-EN: Graph Neural Networks, Neural Networks, shown promising results, recently shown promising, Graph Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Anomaly Detection (GAD) is a challenging and practical research topic where Graph Neural Networks (GNNs) have recently shown promising results. The effectiveness of existing GNNs in GAD has been mainly attributed to the simultaneous learning of node representations and the classifier in an end-to-end manner. Meanwhile, graph pre-training, the two-stage learning paradigm such as DGI and GraphMAE, has shown potential in leveraging unlabeled graph data to enhance downstream tasks, yet its impact on GAD remains under-explored. In this work, we show that graph pre-training models are strong graph anomaly detectors. Specifically, we demonstrate that pre-training is highly competitive, markedly outperforming the state-of-the-art end-to-end training models when faced with limited supervision. To understand this phenomenon, we further uncover pre-training enhances the detection of distant, under-represented, unlabeled anomalies that go beyond 2-hop neighborhoods of known anomalies, shedding light on its superior performance against end-to-end models. Moreover, we extend our examination to the potential of pre-training in graph-level anomaly detection. We envision this work to stimulate a re-evaluation of pre-training’s role in GAD and offer valuable insights for future research.

[LG-33] Classifier Clustering and Feature Alignment for Federated Learning under Distributed Concept Drift

链接: https://arxiv.org/abs/2410.18478
作者: Junbao Chen,Jingfeng Xue,Yong Wang,Zhenyan Liu,Lu Huang
关键词-EN: Data heterogeneity, distributed concept drift, concept drift, distributed concept, tackling this problem
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data heterogeneity is one of the key challenges in federated learning, and many efforts have been devoted to tackling this problem. However, distributed concept drift with data heterogeneity, where clients may additionally experience different concept drifts, is a largely unexplored area. In this work, we focus on real drift, where the conditional distribution P(Y|X) changes. We first study how distributed concept drift affects the model training and find that local classifier plays a critical role in drift adaptation. Moreover, to address data heterogeneity, we study the feature alignment under distributed concept drift, and find two factors that are crucial for feature alignment: the conditional distribution P(Y|X) and the degree of data heterogeneity. Motivated by the above findings, we propose FedCCFA, a federated learning framework with classifier clustering and feature alignment. To enhance collaboration under distributed concept drift, FedCCFA clusters local classifiers at class-level and generates clustered feature anchors according to the clustering results. Assisted by these anchors, FedCCFA adaptively aligns clients’ feature spaces based on the entropy of label distribution P(Y) , alleviating the inconsistency in feature space. Our results demonstrate that FedCCFA significantly outperforms existing methods under various concept drift settings. Code is available at this https URL.

[LG-34] What If the Input is Expanded in OOD Detection? NEURIPS2024

链接: https://arxiv.org/abs/2410.18472
作者: Boxuan Zhang,Jianing Zhu,Zengmao Wang,Tongliang Liu,Bo Du,Bo Han
关键词-EN: machine learning models, identify OOD inputs, unknown classes, open world, aims to identify
类目: Machine Learning (cs.LG)
*备注: accepted by NeurIPS 2024

点击查看摘要

Abstract:Out-of-distribution (OOD) detection aims to identify OOD inputs from unknown classes, which is important for the reliable deployment of machine learning models in the open world. Various scoring functions are proposed to distinguish it from in-distribution (ID) data. However, existing methods generally focus on excavating the discriminative information from a single input, which implicitly limits its representation dimension. In this work, we introduce a novel perspective, i.e., employing different common corruptions on the input space, to expand that. We reveal an interesting phenomenon termed confidence mutation, where the confidence of OOD data can decrease significantly under the corruptions, while the ID data shows a higher confidence expectation considering the resistance of semantic features. Based on that, we formalize a new scoring method, namely, Confidence aVerage (CoVer), which can capture the dynamic differences by simply averaging the scores obtained from different corrupted inputs and the original ones, making the OOD and ID distributions more separable in detection tasks. Extensive experiments and analyses have been conducted to understand and verify the effectiveness of CoVer. The code is publicly available at: this https URL.

[LG-35] Doubly Non-Central Beta Matrix Factorization for Stable Dimensionality Reduction of Bounded Support Matrix Data

链接: https://arxiv.org/abs/2410.18425
作者: Anjali N. Albert,Patrick Flaherty,Aaron Schein
关键词-EN: bounded support, problem of developing, developing interpretable, entries have bounded, computationally efficient matrix
类目: Machine Learning (cs.LG)
*备注: 33 pages, 18 figures

点击查看摘要

Abstract:We consider the problem of developing interpretable and computationally efficient matrix decomposition methods for matrices whose entries have bounded support. Such matrices are found in large-scale DNA methylation studies and many other settings. Our approach decomposes the data matrix into a Tucker representation wherein the number of columns in the constituent factor matrices is not constrained. We derive a computationally efficient sampling algorithm to solve for the Tucker decomposition. We evaluate the performance of our method using three criteria: predictability, computability, and stability. Empirical results show that our method has similar performance as other state-of-the-art approaches in terms of held-out prediction and computational complexity, but has significantly better performance in terms of stability to changes in hyper-parameters. The improved stability results in higher confidence in the results in applications where the constituent factors are used to generate and test scientific hypotheses such as DNA methylation analysis of cancer samples.

[LG-36] A Causal Graph-Enhanced Gaussian Process Regression for Modeling Engine-out NOx

链接: https://arxiv.org/abs/2410.18424
作者: Shrenik Zinage,Ilias Bilionis,Peter Meckl
关键词-EN: stringent regulatory requirements, diesel compression ignition, compression ignition engines, ignition engines require, engines require accurate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The stringent regulatory requirements on nitrogen oxides (NOx) emissions from diesel compression ignition engines require accurate and reliable models for real-time monitoring and diagnostics. Although traditional methods such as physical sensors and virtual engine control module (ECM) sensors provide essential data, they are only used for estimation. Ubiquitous literature primarily focuses on deterministic models with little emphasis on capturing the uncertainties due to sensors. The lack of probabilistic frameworks restricts the applicability of these models for robust diagnostics. The objective of this paper is to develop and validate a probabilistic model to predict engine-out NOx emissions using Gaussian process regression. Our approach is as follows. We employ three variants of Gaussian process models: the first with a standard radial basis function kernel with input window, the second incorporating a deep kernel using convolutional neural networks to capture temporal dependencies, and the third enriching the deep kernel with a causal graph derived via graph convolutional networks. The causal graph embeds physics knowledge into the learning process. All models are compared against a virtual ECM sensor using both quantitative and qualitative metrics. We conclude that our model provides an improvement in predictive performance when using an input window and a deep kernel structure. Even more compelling is the further enhancement achieved by the incorporation of a causal graph into the deep kernel. These findings are corroborated across different validation datasets.

[LG-37] SkiLD: Unsupervised Skill Discovery Guided by Factor Interactions

链接: https://arxiv.org/abs/2410.18416
作者: Zizhao Wang,Jiaheng Hu,Caleb Chuck,Stephen Chen,Roberto Martín-Martín,Amy Zhang,Scott Niekum,Peter Stone
关键词-EN: Unsupervised skill discovery, skill discovery carries, skill discovery, learn reusable skills, Unsupervised skill
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Unsupervised skill discovery carries the promise that an intelligent agent can learn reusable skills through autonomous, reward-free environment interaction. Existing unsupervised skill discovery methods learn skills by encouraging distinguishable behaviors that cover diverse states. However, in complex environments with many state factors (e.g., household environments with many objects), learning skills that cover all possible states is impossible, and naively encouraging state diversity often leads to simple skills that are not ideal for solving downstream tasks. This work introduces Skill Discovery from Local Dependencies (Skild), which leverages state factorization as a natural inductive bias to guide the skill learning process. The key intuition guiding Skild is that skills that induce bdiverse interactions/b between state factors are often more valuable for solving downstream tasks. To this end, Skild develops a novel skill learning objective that explicitly encourages the mastering of skills that effectively induce different interactions within an environment. We evaluate Skild in several domains with challenging, long-horizon sparse reward tasks including a realistic simulated household robot domain, where Skild successfully learns skills with clear semantic meaning and shows superior performance compared to existing unsupervised reinforcement learning methods that only maximize state coverage.

[LG-38] Enhancing Feature-Specific Data Protection via Bayesian Coordinate Differential Privacy

链接: https://arxiv.org/abs/2410.18404
作者: Maryam Aliakbarpour,Syomantak Chaudhuri,Thomas A. Courtade,Alireza Fallah,Michael I. Jordan
关键词-EN: Local Differential Privacy, trust external parties, Local Differential, Bayesian Coordinate Differential, offers strong privacy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Local Differential Privacy (LDP) offers strong privacy guarantees without requiring users to trust external parties. However, LDP applies uniform protection to all data features, including less sensitive ones, which degrades performance of downstream tasks. To overcome this limitation, we propose a Bayesian framework, Bayesian Coordinate Differential Privacy (BCDP), that enables feature-specific privacy quantification. This more nuanced approach complements LDP by adjusting privacy protection according to the sensitivity of each feature, enabling improved performance of downstream tasks without compromising privacy. We characterize the properties of BCDP and articulate its connections with standard non-Bayesian privacy frameworks. We further apply our BCDP framework to the problems of private mean estimation and ordinary least-squares regression. The BCDP-based approach obtains improved accuracy compared to a purely LDP-based approach, without compromising on privacy.

[LG-39] Low-Rank Tensor Learning by Generalized Nonconvex Regularization

链接: https://arxiv.org/abs/2410.18402
作者: Sijia Xia,Michael K. Ng,Xiongjun Zhang
关键词-EN: low-rank tensor learning, underlying tensor, tensor, tensor learning, low-rank tensor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study the problem of low-rank tensor learning, where only a few of training samples are observed and the underlying tensor has a low-rank structure. The existing methods are based on the sum of nuclear norms of unfolding matrices of a tensor, which may be suboptimal. In order to explore the low-rankness of the underlying tensor effectively, we propose a nonconvex model based on transformed tensor nuclear norm for low-rank tensor learning. Specifically, a family of nonconvex functions are employed onto the singular values of all frontal slices of a tensor in the transformed domain to characterize the low-rankness of the underlying tensor. An error bound between the stationary point of the nonconvex model and the underlying tensor is established under restricted strong convexity on the loss function (such as least squares loss and logistic regression) and suitable regularity conditions on the nonconvex penalty function. By reformulating the nonconvex function into the difference of two convex functions, a proximal majorization-minimization (PMM) algorithm is designed to solve the resulting model. Then the global convergence and convergence rate of PMM are established under very mild conditions. Numerical experiments are conducted on tensor completion and binary classification to demonstrate the effectiveness of the proposed method over other state-of-the-art methods.

[LG-40] Revisiting Differentiable Structure Learning: Inconsistency of ell_1 Penalty and Beyond

链接: https://arxiv.org/abs/2410.18396
作者: Kaifeng Jin,Ignavier Ng,Kun Zhang,Biwei Huang
关键词-EN: differentiable structure learning, Recent advances, learning directed acyclic, directed acyclic graphs, differentiable structure
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent advances in differentiable structure learning have framed the combinatorial problem of learning directed acyclic graphs as a continuous optimization problem. Various aspects, including data standardization, have been studied to identify factors that influence the empirical performance of these methods. In this work, we investigate critical limitations in differentiable structure learning methods, focusing on settings where the true structure can be identified up to Markov equivalence classes, particularly in the linear Gaussian case. While Ng et al. (2024) highlighted potential non-convexity issues in this setting, we demonstrate and explain why the use of \ell_1 -penalized likelihood in such cases is fundamentally inconsistent, even if the global optimum of the optimization problem can be found. To resolve this limitation, we develop a hybrid differentiable structure learning method based on \ell_0 -penalized likelihood with hard acyclicity constraint, where the \ell_0 penalty can be approximated by different techniques including Gumbel-Softmax. Specifically, we first estimate the underlying moral graph, and use it to restrict the search space of the optimization problem, which helps alleviate the non-convexity issue. Experimental results show that the proposed method enhances empirical performance both before and after data standardization, providing a more reliable path for future advancements in differentiable structure learning, especially for learning Markov equivalence classes.

[LG-41] A contrastive-learning approach for auditory attention detection

链接: https://arxiv.org/abs/2410.18395
作者: Seyed Ali Alavi Bajestan,Mark Pitt,Donald S. Williamson
关键词-EN: single sound source, Carrying conversations, sounds overlap, single sound, conversations in multi-sound
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Carrying conversations in multi-sound environments is one of the more challenging tasks, since the sounds overlap across time and frequency making it difficult to understand a single sound source. One proposed approach to help isolate an attended speech source is through decoding the electroencephalogram (EEG) and identifying the attended audio source using statistical or machine learning techniques. However, the limited amount of data in comparison to other machine learning problems and the distributional shift between different EEG recordings emphasizes the need for a self supervised approach that works with limited data to achieve a more robust solution. In this paper, we propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal. This network is further finetuned for the auditory attention classification task. We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.

[LG-42] Faster Algorithms for User-Level Private Stochastic Convex Optimization NEURIPS2024

链接: https://arxiv.org/abs/2410.18391
作者: Andrew Lowy,Daogao Liu,Hilal Asi
关键词-EN: stochastic convex optimization, study private stochastic, private stochastic convex, convex optimization, user-level differential privacy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
*备注: NeurIPS 2024

点击查看摘要

Abstract:We study private stochastic convex optimization (SCO) under user-level differential privacy (DP) constraints. In this setting, there are n users (e.g., cell phones), each possessing m data items (e.g., text messages), and we need to protect the privacy of each user’s entire collection of data items. Existing algorithms for user-level DP SCO are impractical in many large-scale machine learning scenarios because: (i) they make restrictive assumptions on the smoothness parameter of the loss function and require the number of users to grow polynomially with the dimension of the parameter space; or (ii) they are prohibitively slow, requiring at least (mn)^3/2 gradient computations for smooth losses and (mn)^3 computations for non-smooth losses. To address these limitations, we provide novel user-level DP algorithms with state-of-the-art excess risk and runtime guarantees, without stringent assumptions. First, we develop a linear-time algorithm with state-of-the-art excess risk (for a non-trivial linear-time algorithm) under a mild smoothness assumption. Our second algorithm applies to arbitrary smooth losses and achieves optimal excess risk in \approx (mn)^9/8 gradient computations. Third, for non-smooth loss functions, we obtain optimal excess risk in n^11/8 m^5/4 gradient computations. Moreover, our algorithms do not require the number of users to grow polynomially with the dimension.

[LG-43] Harnessing PU Learning for Enhanced Cloud-based DDoS Detection: A Comparative Analysis

链接: https://arxiv.org/abs/2410.18380
作者: Robert Dilworth,Charan Gudla
关键词-EN: enhanced Distributed, Support Vector Machine, Random Forest, application of Positive-Unlabeled, Random Forest achieving
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:This paper explores the application of Positive-Unlabeled (PU) learning for enhanced Distributed Denial-of-Service (DDoS) detection in cloud environments. Utilizing the \textttBCCC-cPacket-Cloud-DDoS-2024 dataset, we implement PU learning with four machine learning algorithms: XGBoost, Random Forest, Support Vector Machine, and Naïve Bayes. Our results demonstrate the superior performance of ensemble methods, with XGBoost and Random Forest achieving F_1 scores exceeding 98%. We quantify the efficacy of each approach using metrics including F_1 score, ROC AUC, Recall, and Precision. This study bridges the gap between PU learning and cloud-based anomaly detection, providing a foundation for addressing Context-Aware DDoS Detection in multi-cloud environments. Our findings highlight the potential of PU learning in scenarios with limited labeled data, offering valuable insights for developing more robust and adaptive cloud security mechanisms.

[LG-44] Delta: A Cloud-assisted Data Enrichment Framework for On-Device Continual Learning

链接: https://arxiv.org/abs/2410.18378
作者: Chen Gong,Zhenzhe Zheng,Fan Wu,Xiaofeng Jia,Guihai Chen
关键词-EN: necessitating on-device continual, on-device continual learning, modern mobile applications, consistent model performance, users frequently encounter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In modern mobile applications, users frequently encounter various new contexts, necessitating on-device continual learning (CL) to ensure consistent model performance. While existing research predominantly focused on developing lightweight CL frameworks, we identify that data scarcity is a critical bottleneck for on-device CL. In this work, we explore the potential of leveraging abundant cloud-side data to enrich scarce on-device data, and propose a private, efficient and effective data enrichment framework Delta. Specifically, Delta first introduces a directory dataset to decompose the data enrichment problem into device-side and cloud-side sub-problems without sharing sensitive data. Next, Delta proposes a soft data matching strategy to effectively solve the device-side sub-problem with sparse user data, and an optimal data sampling scheme for cloud server to retrieve the most suitable dataset for enrichment with low computational complexity. Further, Delta refines the data sampling scheme by jointly considering the impact of enriched data on both new and past contexts, mitigating the catastrophic forgetting issue from a new aspect. Comprehensive experiments across four typical mobile computing tasks with varied data modalities demonstrate that Delta could enhance the overall model accuracy by an average of 15.1%, 12.4%, 1.1% and 5.6% for visual, IMU, audio and textual tasks compared with few-shot CL, and consistently reduce the communication costs by over 90% compared to federated CL.

[LG-45] Multi-objective Optimization in CPU Design Space Exploration: Attention is All You Need

链接: https://arxiv.org/abs/2410.18368
作者: Runzhen Xue,Hao Wu,Mingyu Yan,Ziheng Xiao,Xiaochun Ye,Dongrui Fan
关键词-EN: meet specific objectives, enables architects, guiding decisions, architects to systematically, systematically evaluate
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Design space exploration (DSE) enables architects to systematically evaluate various design options, guiding decisions on the most suitable configurations to meet specific objectives such as optimizing performance, power, and area. However, the growing complexity of modern CPUs has dramatically increased the number of micro-architectural parameters and expanded the overall design space, making DSE more challenging and time-consuming. Existing DSE frameworks struggle in large-scale design spaces due to inaccurate models and limited insights into parameter impact, hindering efficient identification of optimal micro-architectures within tight timeframes. In this work, we introduce AttentionDSE. Its key idea is to use the attention mechanism to establish a direct mapping of micro-architectural parameters to their contributions to predicted performance. This approach enhances both the prediction accuracy and interpretability of the performance model. Furthermore, the weights are dynamically adjusted, enabling the model to respond to design changes and effectively pinpoint the key micro-architectural parameters/components responsible for performance bottlenecks. Thus, AttentionDSE accurately, purposefully, and rapidly discovers optimal designs. Experiments on SPEC 2017 demonstrate that AttentionDSE significantly reduces exploration time by over 80% and achieves 3.9% improvement in Pareto Hypervolume compared to state-of-the-art DSE frameworks while maintaining superior prediction accuracy and efficiency with an increasing number of parameters. Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR) Cite as: arXiv:2410.18368 [cs.LG] (or arXiv:2410.18368v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.18368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Assessing Alcohol Use Disorder: Insights from Lifestyle Background and Family History with Machine Learning Techniques

链接: https://arxiv.org/abs/2410.18354
作者: Chenlan Wang,Gaojian Huang,Yue Luo
关键词-EN: Alcohol Use Disorder, family history contribute, personal background, family history, developing Alcohol
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explored how lifestyle, personal background, and family history contribute to the risk of developing Alcohol Use Disorder (AUD). Survey data from the All of Us Program was utilized to extract information on AUD status, lifestyle, personal background, and family history for 6,016 participants. Key determinants of AUD were identified using decision trees including annual income, recreational drug use, length of residence, sex/gender, marital status, education level, and family history of AUD. Data visualization and Chi-Square Tests of Independence were then used to assess associations between identified factors and AUD. Afterwards, machine learning techniques including decision trees, random forests, and Naive Bayes were applied to predict an individual’s likelihood of developing AUD. Random forests were found to achieve the highest accuracy (82%), compared to Decision Trees and Naive Bayes. Findings from this study can offer insights that help parents, healthcare professionals, and educators develop strategies to reduce AUD risk, enabling early intervention and targeted prevention efforts.

[LG-47] Precision Soil Quality Analysis Using Transformer-based Data Fusion Strategies: A Systematic Review

链接: https://arxiv.org/abs/2410.18353
作者: Mahdi Saki,Rasool Keshavarz,Daniel Franklin,Mehran Abolhasan,Justin Lipman,Negin Shariati
关键词-EN: agricultural remote sensing, remote sensing, recent advancements, soil, data fusion
类目: Machine Learning (cs.LG)
*备注: 14 pages, 9 figures, 4 tables, Journal

点击查看摘要

Abstract:This review explores the most recent advancements in transformer-based data fusion techniques in agricultural remote sensing (RS), with a particular focus on soil analysis. Utilizing a systematic, data-driven approach, we demonstrate that transformers have significantly outperformed conventional deep learning and machine learning methods since 2022, achieving prediction performance between 92% and 97%. The review is specifically focused on soil analysis, due to the importance of soil condition in optimizing crop productivity and ensuring sustainable farming practices. Transformer-based models have shown remarkable capabilities in handling complex multivariate soil data, improving the accuracy of soil moisture prediction, soil element analysis, and other soil-related applications. This systematic review primarily focuses on 1) analysing research trends and patterns in the literature, both chronologically and technically, and 2) conducting a comparative analysis of data fusion approaches, considering factors such as data types, techniques, and RS applications. Finally, we propose a roadmap for implementing data fusion methods in agricultural RS.

[LG-48] FedBaF: Federated Learning Aggregation Biased by a Foundation Model

链接: https://arxiv.org/abs/2410.18352
作者: Jong-Ik Park,Srinivasa Pranav,José M. F. Moura,Carlee Joe-Wong
关键词-EN: leading technology organizations, technology organizations due, foundation model, Federated Learning Aggregation, Federated Learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Foundation models are now a major focus of leading technology organizations due to their ability to generalize across diverse tasks. Existing approaches for adapting foundation models to new applications often rely on Federated Learning (FL) and disclose the foundation model weights to clients when using it to initialize the global model. While these methods ensure client data privacy, they compromise model and information security. In this paper, we introduce Federated Learning Aggregation Biased by a Foundation Model (FedBaF), a novel method for dynamically integrating pre-trained foundation model weights during the FL aggregation phase. Unlike conventional methods, FedBaF preserves the confidentiality of the foundation model while still leveraging its power to train more accurate models, especially in non-IID and adversarial scenarios. Our comprehensive experiments use Pre-ResNet and foundation models like Vision Transformer to demonstrate that FedBaF not only matches, but often surpasses the test accuracy of traditional weight initialization methods by up to 11.4% in IID and up to 15.8% in non-IID settings. Additionally, FedBaF applied to a Transformer-based language model significantly reduced perplexity by up to 39.2%.

[LG-49] Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation ICASSP2025

链接: https://arxiv.org/abs/2410.18322
作者: Myeonghoon Ryu,Hongseok Oh,Suji Lee,Han Park
关键词-EN: Unified Microphone Conversion, introduce Unified Microphone, Microphone Conversion, sound event classification, event classification systems
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Currently under review for ICASSP 2025

点击查看摘要

Abstract:In this study, we introduce Unified Microphone Conversion, a unified generative framework to enhance the resilience of sound event classification systems against device variability. Building on the limitations of previous works, we condition the generator network with frequency response information to achieve many-to-many device mapping. This approach overcomes the inherent limitation of CycleGAN, requiring separate models for each device pair. Our framework leverages the strengths of CycleGAN for unpaired training to simulate device characteristics in audio recordings and significantly extends its scalability by integrating frequency response related information via Feature-wise Linear Modulation. The experiment results show that our method outperforms the state-of-the-art method by 2.6% and reducing variability by 0.8% in macro-average F1 score.

[LG-50] NexusIndex: Integrating Advanced Vector Indexing and Multi-Model Embeddings for Robust Fake News Detection

链接: https://arxiv.org/abs/2410.18294
作者: Solmaz Seyed Monir,Dongfang Zhao
关键词-EN: scalable detection mechanisms, digital platforms, platforms has underscored, robust and scalable, scalable detection
类目: Information Retrieval (cs.IR); Databases (cs.DB); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:The proliferation of fake news on digital platforms has underscored the need for robust and scalable detection mechanisms. Traditional methods often fall short in handling large and diverse datasets due to limitations in scalability and accuracy. In this paper, we propose NexusIndex, a novel framework and model that enhances fake news detection by integrating advanced language models, an innovative FAISSNexusIndex layer, and attention mechanisms. Our approach leverages multi-model embeddings to capture rich contextual and semantic nuances, significantly improving text interpretation and classification accuracy. By transforming articles into high-dimensional embeddings and indexing them efficiently, NexusIndex facilitates rapid similarity searches across extensive collections of news articles. The FAISSNexusIndex layer further optimizes this process, enabling real-time detection and enhancing the system’s scalability and performance. Our experimental results demonstrate that NexusIndex outperforms state-of-the-art methods in efficiency and accuracy across diverse datasets.

[LG-51] Augmenting Training Data with Vector-Quantized Variational Autoencoder for Classifying RF Signals

链接: https://arxiv.org/abs/2410.18283
作者: Srihari Kamesh Kompella,Kemal Davaslioglu,Yalin E. Sagduyu,Sastry Kompella
关键词-EN: Radio frequency, important part, wireless, signals, Radio
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: IEEE Milcom 2024

点击查看摘要

Abstract:Radio frequency (RF) communication has been an important part of civil and military communication for decades. With the increasing complexity of wireless environments and the growing number of devices sharing the spectrum, it has become critical to efficiently manage and classify the signals that populate these frequencies. In such scenarios, the accurate classification of wireless signals is essential for effective spectrum management, signal interception, and interference mitigation. However, the classification of wireless RF signals often faces challenges due to the limited availability of labeled training data, especially under low signal-to-noise ratio (SNR) conditions. To address these challenges, this paper proposes the use of a Vector-Quantized Variational Autoencoder (VQ-VAE) to augment training data, thereby enhancing the performance of a baseline wireless classifier. The VQ-VAE model generates high-fidelity synthetic RF signals, increasing the diversity and fidelity of the training dataset by capturing the complex variations inherent in RF communication signals. Our experimental results show that incorporating VQ-VAE-generated data significantly improves the classification accuracy of the baseline model, particularly in low SNR conditions. This augmentation leads to better generalization and robustness of the classifier, overcoming the constraints imposed by limited real-world data. By improving RF signal classification, the proposed approach enhances the efficacy of wireless communication in both civil and tactical settings, ensuring reliable and secure operations. This advancement supports critical decision-making and operational readiness in environments where communication fidelity is essential.

[LG-52] Hamiltonian Matching for Symplectic Neural Integrators

链接: https://arxiv.org/abs/2410.18262
作者: Priscilla Canizares,Davide Murari,Carola-Bibiane Schönlieb,Ferdia Sherry,Zakhar Shumaylov
关键词-EN: Hamilton equations, particle physics, including astronomy, quantum mechanics, climate science
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: NeurReps 2024

点击查看摘要

Abstract:Hamilton’s equations of motion form a fundamental framework in various branches of physics, including astronomy, quantum mechanics, particle physics, and climate science. Classical numerical solvers are typically employed to compute the time evolution of these systems. However, when the system spans multiple spatial and temporal scales numerical errors can accumulate, leading to reduced accuracy. To address the challenges of evolving such systems over long timescales, we propose SympFlow, a novel neural network-based symplectic integrator, which is the composition of a sequence of exact flow maps of parametrised time-dependent Hamiltonian functions. This architecture allows for a backward error analysis: we can identify an underlying Hamiltonian function of the architecture and use it to define a Hamiltonian matching objective function, which we use for training. In numerical experiments, we show that SympFlow exhibits promising results, with qualitative energy conservation behaviour similar to that of time-stepping symplectic integrators.

[LG-53] Assessment of Developmental Dysgraphia Utilising a Display Tablet

链接: https://arxiv.org/abs/2410.18230
作者: Jiri Mekyska,Zoltan Galaz,Katarina Safarova,Vojtech Zvoncak,Lukas Cunek,Tomas Urbanek,Jana Marie Havigerova,Jirina Bednarova,Jan Mucha,Michal Gavenciak,Zdenek Smekal,Marcos Faundez-Zanuy
关键词-EN: online handwriting processing, developmental dysgraphia, increasing popularity, processing has increasing, child writes
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Even though the computerised assessment of developmental dysgraphia (DD) based on online handwriting processing has increasing popularity, most of the solutions are based on a setup, where a child writes on a paper fixed to a digitizing tablet that is connected to a computer. Although this approach enables the standard way of writing using an inking pen, it is difficult to be administered by children themselves. The main goal of this study is thus to explore, whether the quantitative analysis of online handwriting recorded via a display screen tablet could sufficiently support the assessment of DD as well. For the purpose of this study, we enrolled 144 children (attending the 3rd and 4th class of a primary school), whose handwriting proficiency was assessed by a special education counsellor, and who assessed themselves by the Handwriting Proficiency Screening Questionnaires for Children (HPSQ C). Using machine learning models based on a gradient-boosting algorithm, we were able to support the DD diagnosis with up to 83.6% accuracy. The HPSQ C total score was estimated with a minimum error equal to 10.34 %. Children with DD spent significantly higher time in-air, they had a higher number of pen elevations, a bigger height of on-surface strokes, a lower in-air tempo, and a higher variation in the angular velocity. Although this study shows a promising impact of DD assessment via display tablets, it also accents the fact that modelling of subjective scores is challenging and a complex and data-driven quantification of DD manifestations is needed.

[LG-54] Melody Construction for Persian lyrics using LSTM recurrent neural networks

链接: https://arxiv.org/abs/2410.18203
作者: Farshad Jafari,Farzad Didehvar,Amin Gheibi
关键词-EN: present paper investigated, paper investigated automatic, investigated automatic melody, automatic melody construction, present paper
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The present paper investigated automatic melody construction for Persian lyrics as an input. It was assumed that there is a phonological correlation between the lyric syllables and the melody in a song. A seq2seq neural network was developed to investigate this assumption, trained on parallel syllable and note sequences in Persian songs to suggest a pleasant melody for a new sequence of syllables. More than 100 pieces of Persian music were collected and converted from the printed version to the digital format due to the lack of a dataset on Persian digital music. Finally, 14 new lyrics were given to the model as input, and the suggested melodies were performed and recorded by music experts to evaluate the trained model. The evaluation was conducted using an audio questionnaire, which more than 170 persons answered. According to the answers about the pleasantness of melody, the system outputs scored an average of 3.005 from 5, while the human-made melodies for the same lyrics obtained an average score of 4.078.

[LG-55] Dreaming Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.18156
作者: Alessandro Londei,Matteo Benati,Denise Lanzieri,Vittorio Loreto
关键词-EN: Incorporating novelties, deep learning systems, learning systems remains, challenging problem, novelties into deep
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Accepted at the NeurIPS 2024 workshop on Intrinsically Motivated Open-ended Learning

点击查看摘要

Abstract:Incorporating novelties into deep learning systems remains a challenging problem. Introducing new information to a machine learning system can interfere with previously stored data and potentially alter the global model paradigm, especially when dealing with non-stationary sources. In such cases, traditional approaches based on validation error minimization offer limited advantages. To address this, we propose a training algorithm inspired by Stuart Kauffman’s notion of the Adjacent Possible. This novel training methodology explores new data spaces during the learning phase. It predisposes the neural network to smoothly accept and integrate data sequences with different statistical characteristics than expected. The maximum distance compatible with such inclusion depends on a specific parameter: the sampling temperature used in the explorative phase of the present method. This algorithm, called Dreaming Learning, anticipates potential regime shifts over time, enhancing the neural network’s responsiveness to non-stationary events that alter statistical properties. To assess the advantages of this approach, we apply this methodology to unexpected statistical changes in Markov chains and non-stationary dynamics in textual sequences. We demonstrated its ability to improve the auto-correlation of generated textual sequences by \sim 29% and enhance the velocity of loss convergence by \sim 100% in the case of a paradigm shift in Markov chains.

[LG-56] Music102: An D_12-equivariant transformer for chord progression accompaniment

链接: https://arxiv.org/abs/2410.18151
作者: Weiliang Luo
关键词-EN: advanced model built, enhancing chord progression, chord progression accompaniment, aimed at enhancing, progression accompaniment
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:We present Music102, an advanced model built upon the Music101 prototype, aimed at enhancing chord progression accompaniment through a D12-equivariant transformer. Inspired by group theory and symbolic music structures, Music102 leverages musical symmetry–such as transposition and reflection operations–integrating these properties into the transformer architecture. By encoding prior music knowledge, the model maintains equivariance across both melody and chord sequences. The POP909 dataset was employed to train and evaluate Music102, revealing significant improvements over Music101 in both weighted loss and exact accuracy metrics, despite using fewer parameters. This work showcases the adaptability of self-attention mechanisms and layer normalization to the discrete musical domain, addressing challenges in computational music analysis. With its stable and flexible neural framework, Music102 sets the stage for further exploration in equivariant music generation and computational composition tools, bridging mathematical theory with practical music performance.

[LG-57] Efficient Adaptive Federated Optimization

链接: https://arxiv.org/abs/2410.18117
作者: Su Hyeong Lee,Sidharth Sharma,Manzil Zaheer,Tian Li
关键词-EN: Adaptive optimization plays, maximizing its performance, optimization plays, plays a pivotal, pivotal role
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Adaptive optimization plays a pivotal role in federated learning, where simultaneous server and client-side adaptivity have been shown to be essential for maximizing its performance. However, the scalability of jointly adaptive systems is often constrained by limited resources in communication and memory. In this paper, we introduce a class of efficient adaptive algorithms, named FedAda^2 , designed specifically for large-scale, cross-device federated environments. FedAda^2 optimizes communication efficiency by avoiding the transfer of preconditioners between the server and clients. At the same time, it leverages memory-efficient adaptive optimizers on the client-side to reduce on-device memory consumption. Theoretically, we demonstrate that FedAda^2 achieves the same convergence rates for general, non-convex objectives as its more resource-intensive counterparts that directly integrate joint adaptivity. Empirically, we showcase the benefits of joint adaptivity and the effectiveness of FedAda^2 on both image and text datasets.

[LG-58] Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging

链接: https://arxiv.org/abs/2410.18113
作者: Zihan Wu,Zhaoke Huang,Hong Yan
关键词-EN: simultaneously clusters rows, Co-clustering simultaneously clusters, rows and columns, revealing more fine-grained, fine-grained groups
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:Co-clustering simultaneously clusters rows and columns, revealing more fine-grained groups. However, existing co-clustering methods suffer from poor scalability and cannot handle large-scale data. This paper presents a novel and scalable co-clustering method designed to uncover intricate patterns in high-dimensional, large-scale datasets. Specifically, we first propose a large matrix partitioning algorithm that partitions a large matrix into smaller submatrices, enabling parallel co-clustering. This method employs a probabilistic model to optimize the configuration of submatrices, balancing the computational efficiency and depth of analysis. Additionally, we propose a hierarchical co-cluster merging algorithm that efficiently identifies and merges co-clusters from these submatrices, enhancing the robustness and reliability of the process. Extensive evaluations validate the effectiveness and efficiency of our method. Experimental results demonstrate a significant reduction in computation time, with an approximate 83% decrease for dense matrices and up to 30% for sparse matrices.

[LG-59] OPTIMA: Optimized Policy for Intelligent Multi-Agent Systems Enables Coordination-Aware Autonomous Vehicles

链接: https://arxiv.org/abs/2410.18112
作者: Rui Du,Kai Zhao,Jinlong Hou,Qiang Zhang,Peter Zhang
关键词-EN: Coordination among connected, communication technologies, advancing due, due to developments, developments in control
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Coordination among connected and autonomous vehicles (CAVs) is advancing due to developments in control and communication technologies. However, much of the current work is based on oversimplified and unrealistic task-specific assumptions, which may introduce vulnerabilities. This is critical because CAVs not only interact with their environment but are also integral parts of it. Insufficient exploration can result in policies that carry latent risks, highlighting the need for methods that explore the environment both extensively and efficiently. This work introduces OPTIMA, a novel distributed reinforcement learning framework for cooperative autonomous vehicle tasks. OPTIMA alternates between thorough data sampling from environmental interactions and multi-agent reinforcement learning algorithms to optimize CAV cooperation, emphasizing both safety and efficiency. Our goal is to improve the generality and performance of CAVs in highly complex and crowded scenarios. Furthermore, the industrial-scale distributed training system easily adapts to different algorithms, reward functions, and strategies.

[LG-60] Data Efficiency for Large Recommendation Models

链接: https://arxiv.org/abs/2410.18111
作者: Kshitij Jain,Jingru Xie,Kevin Regan,Cheng Chen,Jie Han,Steve Li,Zhuoshu Li,Todd Phillips,Myles Sussman,Matt Troup,Angel Yu,Jia Zhuo
关键词-EN: online advertising industry, changing user behavior, dollar online advertising, multi-billion dollar online, rapidly changing user
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large recommendation models (LRMs) are fundamental to the multi-billion dollar online advertising industry, processing massive datasets of hundreds of billions of examples before transitioning to continuous online training to adapt to rapidly changing user behavior. The massive scale of data directly impacts both computational costs and the speed at which new methods can be evaluated (RD velocity). This paper presents actionable principles and high-level frameworks to guide practitioners in optimizing training data requirements. These strategies have been successfully deployed in Google’s largest Ads CTR prediction models and are broadly applicable beyond LRMs. We outline the concept of data convergence, describe methods to accelerate this convergence, and finally, detail how to optimally balance training data volume with model size. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2410.18111 [cs.IR] (or arXiv:2410.18111v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.18111 Focus to learn more arXiv-issued DOI via DataCite

[LG-61] uning-free coreset Markov chain Monte Carlo

链接: https://arxiv.org/abs/2410.18973
作者: Naitong Chen,Jonathan H. Huggins,Trevor Campbell
关键词-EN: reduce computational cost, Markov chain Monte, chain Monte Carlo, Bayesian coreset, Coreset Markov chain
类目: Computation (stat.CO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A Bayesian coreset is a small, weighted subset of a data set that replaces the full data during inference to reduce computational cost. The state-of-the-art coreset construction algorithm, Coreset Markov chain Monte Carlo (Coreset MCMC), uses draws from an adaptive Markov chain targeting the coreset posterior to train the coreset weights via stochastic gradient optimization. However, the quality of the constructed coreset, and thus the quality of its posterior approximation, is sensitive to the stochastic optimization learning rate. In this work, we propose a learning-rate-free stochastic gradient optimization procedure, Hot-start Distance over Gradient (Hot DoG), for training coreset weights in Coreset MCMC without user tuning effort. Empirical results demonstrate that Hot DoG provides higher quality posterior approximations than other learning-rate-free stochastic gradient methods, and performs competitively to optimally-tuned ADAM.

[LG-62] A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities

链接: https://arxiv.org/abs/2410.18938
作者: Yatin Dandi,Luca Pesce,Hugo Cui,Florent Krzakala,Yue M. Lu,Bruno Loureiro
关键词-EN: two-layer neural networks, neural networks, key property, capacity of adapting, adapting to data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for studying the functional expressivity of the learned features. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.

[LG-63] AutoStep: Locally adaptive involutive MCMC

链接: https://arxiv.org/abs/2410.18929
作者: Tiange Liu,Nikola Surjanovic,Miguel Biron-Lattes,Alexandre Bouchard-Côté,Trevor Campbell
关键词-EN: chain Monte Carlo, Markov chain Monte, common Markov chain, Monte Carlo, common Markov
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many common Markov chain Monte Carlo (MCMC) kernels can be formulated using a deterministic involutive proposal with a step size parameter. Selecting an appropriate step size is often a challenging task in practice; and for complex multiscale targets, there may not be one choice of step size that works well globally. In this work, we address this problem with a novel class of involutive MCMC methods – AutoStep MCMC – that selects an appropriate step size at each iteration adapted to the local geometry of the target distribution. We prove that AutoStep MCMC is \pi -invariant and has other desirable properties under mild assumptions on the target distribution \pi and involutive proposal. Empirical results examine the effect of various step size selection design choices, and show that AutoStep MCMC is competitive with state-of-the-art methods in terms of effective sample size per unit cost on a range of challenging target distributions.

[LG-64] Learning k-body Hamiltonians via compressed sensing

链接: https://arxiv.org/abs/2410.18928
作者: Muzhou Ma,Steven T. Flammia,John Preskill,Yu Tong
关键词-EN: necessarily geometrically local, unknown Pauli terms, unknown Pauli, geometrically local, study the problem
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 30+12 pages, 1 figure

点击查看摘要

Abstract:We study the problem of learning a k -body Hamiltonian with M unknown Pauli terms that are not necessarily geometrically local. We propose a protocol that learns the Hamiltonian to precision \epsilon with total evolution time \mathcalO(M^1/2+1/p/\epsilon) up to logarithmic factors, where the error is quantified by the \ell^p -distance between Pauli coefficients. Our learning protocol uses only single-qubit control operations and a GHZ state initial state, is non-adaptive, is robust against SPAM errors, and performs well even if M and k are not precisely known in advance or if the Hamiltonian is not exactly M -sparse. Methods from the classical theory of compressed sensing are used for efficiently identifying the M terms in the Hamiltonian from among all possible k -body Pauli operators. We also provide a lower bound on the total evolution time needed in this learning task, and we discuss the operational interpretations of the \ell^1 and \ell^2 error metrics. In contrast to previous works, our learning protocol requires neither geometric locality nor any other relaxed locality conditions.

[LG-65] MissNODAG: Differentiable Cyclic Causal Graph Learning from Incomplete Data

链接: https://arxiv.org/abs/2410.18918
作者: Muralikrishnna G. Sethuraman,Razieh Nabi,Faramarz Fekri
关键词-EN: biological networks, complicated by feedback, feedback loops, loops and incomplete, Causal discovery
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal discovery in real-world systems, such as biological networks, is often complicated by feedback loops and incomplete data. Standard algorithms, which assume acyclic structures or fully observed data, struggle with these challenges. To address this gap, we propose MissNODAG, a differentiable framework for learning both the underlying cyclic causal graph and the missingness mechanism from partially observed data, including data missing not at random. Our framework integrates an additive noise model with an expectation-maximization procedure, alternating between imputing missing values and optimizing the observed data likelihood, to uncover both the cyclic structures and the missingness mechanism. We demonstrate the effectiveness of MissNODAG through synthetic experiments and an application to real-world gene perturbation data.

[LG-66] Modulated Adaptive Fourier Neural Operators for Temporal Interpolation of Weather Forecasts

链接: https://arxiv.org/abs/2410.18904
作者: Jussi Leinonen,Boris Bonev,Thorsten Kurth,Yair Cohen
关键词-EN: inherently long time, forecast models based, limited temporal resolution, weather forecast models, long time steps
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Weather and climate data are often available at limited temporal resolution, either due to storage limitations, or in the case of weather forecast models based on deep learning, their inherently long time steps. The coarse temporal resolution makes it difficult to capture rapidly evolving weather events. To address this limitation, we introduce an interpolation model that reconstructs the atmospheric state between two points in time for which the state is known. The model makes use of a novel network layer that modifies the adaptive Fourier neural operator (AFNO), which has been previously used in weather prediction and other applications of machine learning to physics problems. The modulated AFNO (ModAFNO) layer takes an embedding, here computed from the interpolation target time, as an additional input and applies a learned shift-scale operation inside the AFNO layers to adapt them to the target time. Thus, one model can be used to produce all intermediate time steps. Trained to interpolate between two time steps 6 h apart, the ModAFNO-based interpolation model produces 1 h resolution intermediate time steps that are visually nearly indistinguishable from the actual corresponding 1 h resolution data. The model reduces the RMSE loss of reconstructing the intermediate steps by approximately 50% compared to linear interpolation. We also demonstrate its ability to reproduce the statistics of extreme weather events such as hurricanes and heat waves better than 6 h resolution data. The ModAFNO layer is generic and is expected to be applicable to other problems, including weather forecasting with tunable lead time.

[LG-67] Exploring the Universe with SNAD: Anomaly Detection in Astronomy

链接: https://arxiv.org/abs/2410.18875
作者: Alina A. Volnova,Patrick D. Aleo,Anastasia Lavrukhina,Etienne Russeil,Timofey Semenikhin,Emmanuel Gangler,Emille E. O. Ishida,Matwey V. Kornilov,Vladimir Korolev,Konstantin Malanchev,Maria V. Pruzhinskaya,Sreevarsha Sreejith
关键词-EN: detecting astronomical anomalies, machine learning algorithms, large-scale surveys, primary focus, focus on detecting
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:SNAD is an international project with a primary focus on detecting astronomical anomalies within large-scale surveys, using active learning and other machine learning algorithms. The work carried out by SNAD not only contributes to the discovery and classification of various astronomical phenomena but also enhances our understanding and implementation of machine learning techniques within the field of astrophysics. This paper provides a review of the SNAD project and summarizes the advancements and achievements made by the team over several years.

[LG-68] Omics-driven hybrid dynamic modeling of bioprocesses with uncertainty estimation

链接: https://arxiv.org/abs/2410.18864
作者: Sebastián Espinel-Ríos,José Montaño López,José L. Avalos
关键词-EN: integrates machine-learning tools, omics-driven modeling pipeline, work presents, presents an omics-driven, pipeline that integrates
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work presents an omics-driven modeling pipeline that integrates machine-learning tools to facilitate the dynamic modeling of multiscale biological systems. Random forests and permutation feature importance are proposed to mine omics datasets, guiding feature selection and dimensionality reduction for dynamic modeling. Continuous and differentiable machine-learning functions can be trained to link the reduced omics feature set to key components of the dynamic model, resulting in a hybrid model. As proof of concept, we apply this framework to a high-dimensional proteomics dataset of \textitSaccharomyces cerevisiae . After identifying key intracellular proteins that correlate with cell growth, targeted dynamic experiments are designed, and key model parameters are captured as functions of the selected proteins using Gaussian processes. This approach captures the dynamic behavior of yeast strains under varying proteome profiles while estimating the uncertainty in the hybrid model’s predictions. The outlined modeling framework is adaptable to other scenarios, such as integrating additional layers of omics data for more advanced multiscale biological systems, or employing alternative machine-learning methods to handle larger datasets. Overall, this study outlines a strategy for leveraging omics data to inform multiscale dynamic modeling in systems biology and bioprocess engineering.

[LG-69] Bilinear Sequence Regression: A Model for Learning from Long Sequences of High-dimensional Tokens

链接: https://arxiv.org/abs/2410.18858
作者: Vittorio Erba,Emanuele Troiani,Luca Biggio,Antoine Maillard,Lenka Zdeborová
关键词-EN: so-called large language, neural networks processing, large language models, high-dimensional vectors called, vectors called tokens
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current progress in artificial intelligence is centered around so-called large language models that consist of neural networks processing long sequences of high-dimensional vectors called tokens. Statistical physics provides powerful tools to study the functioning of learning with neural networks and has played a recognized role in the development of modern machine learning. The statistical physics approach relies on simplified and analytically tractable models of data. However, simple tractable models for long sequences of high-dimensional tokens are largely underexplored. Inspired by the crucial role models such as the single-layer teacher-student perceptron (aka generalized linear regression) played in the theory of fully connected neural networks, in this paper, we introduce and study the bilinear sequence regression (BSR) as one of the most basic models for sequences of tokens. We note that modern architectures naturally subsume the BSR model due to the skip connections. Building on recent methodological progress, we compute the Bayes-optimal generalization error for the model in the limit of long sequences of high-dimensional tokens, and provide a message-passing algorithm that matches this performance. We quantify the improvement that optimal learning brings with respect to vectorizing the sequence of tokens and learning via simple linear regression. We also unveil surprising properties of the gradient descent algorithms in the BSR model.

[LG-70] High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

链接: https://arxiv.org/abs/2410.18837
作者: M. Emrullah Ildiz,Halil Alperen Gozeten,Ege Onur Taga,Marco Mondelli,Samet Oymak
关键词-EN: machine learning scenarios, learning scenarios rely, surrogate model, model, growing number
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings: (i) model shift, where the surrogate model is arbitrary, and (ii) distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation that (i) W2S training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but (ii) it is unable to improve the data scaling law. We validate our results on numerical experiments both on ridgeless regression and on neural network architectures.

[LG-71] Fully Stochastic Primal-dual Gradient Algorithm for Non-convex Optimization on Random Graphs

链接: https://arxiv.org/abs/2410.18774
作者: Chung-Yiu Yau,Haoming Liu,Hoi-To Wai
关键词-EN: underline, Stochastic decentralized optimization, FSPDA, suffer from issues, synchronization overhead
类目: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Stochastic decentralized optimization algorithms often suffer from issues such as synchronization overhead and intermittent communication. This paper proposes a \underline\rm F ully \underline\rm S tochastic \underline\rm P rimal \underline\rm D ual gradient \underline\rm A lgorithm (FSPDA) that suggests an asynchronous decentralized procedure with (i) sparsified non-blocking communication on random undirected graphs and (ii) local stochastic gradient updates. FSPDA allows multiple local gradient steps to accelerate convergence to stationarity while finding a consensual solution with stochastic primal-dual updates. For problems with smooth (possibly non-convex) objective function, we show that FSPDA converges to an \mathrm\mathcalO( \it \sigma /\sqrtnT ) -stationary solution after \mathrm\it T iterations without assuming data heterogeneity. The performance of FSPDA is on par with state-of-the-art algorithms whose convergence depend on static graph and synchronous updates. To our best knowledge, FSPDA is the first asynchronous algorithm that converges exactly under the non-convex setting. Numerical experiments are presented to show the benefits of FSPDA.

[LG-72] Remote Detection of Applications for Improved Beam Tracking in mmWave/sub-THz 5G/6G Systems

链接: https://arxiv.org/abs/2410.18637
作者: Alexander Shurakov,Margarita Ershova,Abdukodir Khakimov,Anatoliy Prikhodko,Evgeny Mokrov,Vyacheslav Begishev,Galina Chulkova,Yevgeni Koucheryavy,Gregory Gol’tsman
关键词-EN: Synchronization Signal Blocks, received signal strength, Beam tracking, millimeter wave, essential functionality
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:Beam tracking is an essential functionality of millimeter wave (mmWave, 30-100 GHz) and sub-terahertz (sub-THz, 100-300 GHz) 5G/6G systems. It operates by performing antenna sweeping at both base station (BS) and user equipment (UE) sides using the Synchronization Signal Blocks (SSB). The optimal frequency of beam tracking events is not specified by 3GPP standards and heavily depends on the micromobility properties of the applications currently utilized by the user. In absence of explicit signalling for the type of application at the air interface, in this paper, we propose a way to remotely detect it at the BS side based on the received signal strength pattern. To this aim, we first perform a multi-stage measurement campaign at 156 GHz, belonging to the sub-THz band, to obtain the received signal strength traces of popular smartphone applications. Then, we proceed applying conventional statistical Mann-Whitney tests and various machine learning (ML) based classification techniques to discriminate applications remotely. Our results show that Mann-Whitney test can be used to differentiate between fast and slow application classes with a confidence of 0.95 inducing class detection delay on the order of 1 s after application initialization. With the same time budget, random forest classifiers can differentiate between applications with fast and slow micromobility with 80% accuracy using received signal strength metric only. The accuracy of detecting a specific application however is lower, reaching 60%. By utilizing the proposed technique one can estimate the optimal values of the beam tracking intervals without adding additional signalling to the air interface.

[LG-73] Evolutionary Dispersal of Ecological Species via Multi-Agent Deep Reinforcement Learning

链接: https://arxiv.org/abs/2410.18621
作者: Wonhyung Choi,Inkyung Ahn
关键词-EN: Understanding species dynamics, ecosystem studies, Understanding species, dynamics in heterogeneous, heterogeneous environments
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Understanding species dynamics in heterogeneous environments is essential for ecosystem studies. Traditional models assumed homogeneous habitats, but recent approaches include spatial and temporal variability, highlighting species migration. We adopt starvation-driven diffusion (SDD) models as nonlinear diffusion to describe species dispersal based on local resource conditions, showing advantages for species survival. However, accurate prediction remains challenging due to model simplifications. This study uses multi-agent reinforcement learning (MARL) with deep Q-networks (DQN) to simulate single species and predator-prey interactions, incorporating SDD-type rewards. Our simulations reveal evolutionary dispersal strategies, providing insights into species dispersal mechanisms and validating traditional mathematical models.

[LG-74] Optimal Equivariant Architectures from the Symmetries of Matrix-Element Likelihoods

链接: https://arxiv.org/abs/2410.18553
作者: Daniel Maître,Vishal S. Ngairangbam,Michael Spannowsky
关键词-EN: Matrix-Element Method, cornerstone of data, neural network, Method, data analysis
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 31 pages, 6 figures, 3 tables

点击查看摘要

Abstract:The Matrix-Element Method (MEM) has long been a cornerstone of data analysis in high-energy physics. It leverages theoretical knowledge of parton-level processes and symmetries to evaluate the likelihood of observed events. In parallel, the advent of geometric deep learning has enabled neural network architectures that incorporate known symmetries directly into their design, leading to more efficient learning. This paper presents a novel approach that combines MEM-inspired symmetry considerations with equivariant neural network design for particle physics analysis. Even though Lorentz invariance and permutation invariance overall reconstructed objects are the largest and most natural symmetry in the input domain, we find that they are sub-optimal in most practical search scenarios. We propose a longitudinal boost-equivariant message-passing neural network architecture that preserves relevant discrete symmetries. We present numerical studies demonstrating MEM-inspired architectures achieve new state-of-the-art performance in distinguishing di-Higgs decays to four bottom quarks from the QCD background, with enhanced sample and parameter efficiencies. This synergy between MEM and equivariant deep learning opens new directions for physics-informed architecture design, promising more powerful tools for probing physics beyond the Standard Model.

[LG-75] Evolving Voices Based on Temporal Poisson Factorisation

链接: https://arxiv.org/abs/2410.18486
作者: Jan Vávra(1 and 2),Bettina Grün(1),Paul Hofmarcher(2) ((1) Vienna University of Economics and Business, (2) Paris-Lodron University of Salzburg)
关键词-EN: world is evolving, flexible topic models, Poisson factorisation model, Poisson factorisation, political speech data
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: main paper: 19 pages (2 single figures, 3 double figures, 3 tables) appendix: 9 pages (3 quadruple figures, 1 table) references: 3 pages

点击查看摘要

Abstract:The world is evolving and so is the vocabulary used to discuss topics in speech. Analysing political speech data from more than 30 years requires the use of flexible topic models to uncover the latent topics and their change in prevalence over time as well as the change in the vocabulary of the topics. We propose the temporal Poisson factorisation (TPF) model as an extension to the Poisson factorisation model to model sparse count data matrices obtained based on the bag-of-words assumption from text documents with time stamps. We discuss and empirically compare different model specifications for the time-varying latent variables consisting either of a flexible auto-regressive structure of order one or a random walk. Estimation is based on variational inference where we consider a combination of coordinate ascent updates with automatic differentiation using batching of documents. Suitable variational families are proposed to ease inference. We compare results obtained using independent univariate variational distributions for the time-varying latent variables to those obtained with a multivariate variant. We discuss in detail the results of the TPF model when analysing speeches from 18 sessions in the U.S. Senate (1981-2016).

[LG-76] Structure Language Models for Protein Conformation Generation

链接: https://arxiv.org/abs/2410.18403
作者: Jiarui Lu,Xiaoyin Chen,Stephen Zhewen Lu,Chence Shi,Hongyu Guo,Yoshua Bengio,Jian Tang
关键词-EN: advancing drug discovery, adopt multiple structural, Proteins adopt multiple, multiple structural conformations, diverse biological functions
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Preprint. Under Review

点击查看摘要

Abstract:Proteins adopt multiple structural conformations to perform their diverse biological functions, and understanding these conformations is crucial for advancing drug discovery. Traditional physics-based simulation methods often struggle with sampling equilibrium conformations and are computationally expensive. Recently, deep generative models have shown promise in generating protein conformations as a more efficient alternative. However, these methods predominantly rely on the diffusion process within a 3D geometric space, which typically centers around the vicinity of metastable states and is often inefficient in terms of runtime. In this paper, we introduce Structure Language Modeling (SLM) as a novel framework for efficient protein conformation generation. Specifically, the protein structures are first encoded into a compact latent space using a discrete variational auto-encoder, followed by conditional language modeling that effectively captures sequence-specific conformation distributions. This enables a more efficient and interpretable exploration of diverse ensemble modes compared to existing methods. Based on this general framework, we instantiate SLM with various popular LM architectures as well as proposing the ESMDiff, a novel BERT-like structure language model fine-tuned from ESM3 with masked diffusion. We verify our approach in various scenarios, including the equilibrium dynamics of BPTI, conformational change pairs, and intrinsically disordered proteins. SLM provides a highly efficient solution, offering a 20-100x speedup than existing methods in generating diverse conformations, shedding light on promising avenues for future research.

[LG-77] Stabilizing black-box model selection with the inflated argmax

链接: https://arxiv.org/abs/2410.18268
作者: Melissa Adrian,Jake A. Soloff,Rebecca Willett
关键词-EN: Model selection, LASSO model selection, process of choosing, class of candidate, Model
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. This paper presents a new approach to stabilizing model selection that leverages a combination of bagging and an “inflated” argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. In addition to developing theoretical guarantees, we illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable and (b) a Lotka-Volterra model selection problem focused on identifying how competition in an ecosystem influences species’ abundances. In both settings, the proposed method yields stable and compact collections of selected models, outperforming a variety of benchmarks.

[LG-78] Stochastic gradient descent in high dimensions for multi-spiked tensor PCA

链接: https://arxiv.org/abs/2410.18162
作者: Gérard Ben Arous,Cédric Gerbelot,Vanessa Piccolo
关键词-EN: online stochastic gradient, stochastic gradient descent, multi-spiked tensor model, high dimensions, dimensions of online
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 58 pages, 10 figures. This is part of our manuscript arXiv:2408.06401

点击查看摘要

Abstract:We study the dynamics in high dimensions of online stochastic gradient descent for the multi-spiked tensor model. This multi-index model arises from the tensor principal component analysis (PCA) problem with multiple spikes, where the goal is to estimate r unknown signal vectors within the N -dimensional unit sphere through maximum likelihood estimation from noisy observations of a p -tensor. We determine the number of samples and the conditions on the signal-to-noise ratios (SNRs) required to efficiently recover the unknown spikes from natural random initializations. We show that full recovery of all spikes is possible provided a number of sample scaling as N^p-2 , matching the algorithmic threshold identified in the rank-one case [Ben Arous, Gheissari, Jagannath 2020, 2021]. Our results are obtained through a detailed analysis of a low-dimensional system that describes the evolution of the correlations between the estimators and the spikes, while controlling the noise in the dynamics. We find that the spikes are recovered sequentially in a process we term “sequential elimination”: once a correlation exceeds a critical threshold, all correlations sharing a row or column index become sufficiently small, allowing the next correlation to grow and become macroscopic. The order in which correlations become macroscopic depends on their initial values and the corresponding SNRs, leading to either exact recovery or recovery of a permutation of the spikes. In the matrix case, when p=2 , if the SNRs are sufficiently separated, we achieve exact recovery of the spikes, whereas equal SNRs lead to recovery of the subspace spanned by the spikes.

[LG-79] Using Platts scaling for calibration after undersampling – limitations and how to address them

链接: https://arxiv.org/abs/2410.18144
作者: Nathan Phelps,Daniel J. Lizotte,Douglas G. Woolford
关键词-EN: Platt scaling, Platt, training datasets prior, balanced training datasets, highly imbalanced
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When modelling data where the response is dichotomous and highly imbalanced, response-based sampling where a subset of the majority class is retained (i.e., undersampling) is often used to create more balanced training datasets prior to modelling. However, the models fit to this undersampled data, which we refer to as base models, generate predictions that are severely biased. There are several calibration methods that can be used to combat this bias, one of which is Platt’s scaling. Here, a logistic regression model is used to model the relationship between the base model’s original predictions and the response. Despite its popularity for calibrating models after undersampling, Platt’s scaling was not designed for this purpose. Our work presents what we believe is the first detailed study focused on the validity of using Platt’s scaling to calibrate models after undersampling. We show analytically, as well as via a simulation study and a case study, that Platt’s scaling should not be used for calibration after undersampling without critical thought. If Platt’s scaling would have been able to successfully calibrate the base model had it been trained on the entire dataset (i.e., without undersampling), then Platt’s scaling might be appropriate for calibration after undersampling. If this is not the case, we recommend a modified version of Platt’s scaling that fits a logistic generalized additive model to the logit of the base model’s predictions, as it is both theoretically motivated and performed well across the settings considered in our study.

[LG-80] Generative Design of Functional Metal Complexes Utilizing the Internal Knowledge of Large Language Models

链接: https://arxiv.org/abs/2410.18136
作者: Jieyu Lu,Zhangde Song,Qiyuan Zhao,Yuanqi Du,Yirui Cao,Haojun Jia,Chenru Duan
关键词-EN: Designing functional transition, transition metal complexes, functional transition metal, faces challenges due, Designing functional
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Designing functional transition metal complexes (TMCs) faces challenges due to the vast search space of metals and ligands, requiring efficient optimization strategies. Traditional genetic algorithms (GAs) are commonly used, employing random mutations and crossovers driven by explicit mathematical objectives to explore this space. Transferring knowledge between different GA tasks, however, is difficult. We integrate large language models (LLMs) into the evolutionary optimization framework (LLM-EO) and apply it in both single- and multi-objective optimization for TMCs. We find that LLM-EO surpasses traditional GAs by leveraging the chemical knowledge of LLMs gained during their extensive pretraining. Remarkably, without supervised fine-tuning, LLMs utilize the full historical data from optimization processes, outperforming those focusing only on top-performing TMCs. LLM-EO successfully identifies eight of the top-20 TMCs with the largest HOMO-LUMO gaps by proposing only 200 candidates out of a 1.37 million TMCs space. Through prompt engineering using natural language, LLM-EO introduces unparalleled flexibility into multi-objective optimizations, thereby circumventing the necessity for intricate mathematical formulations. As generative models, LLMs can suggest new ligands and TMCs with unique properties by merging both internal knowledge and external chemistry data, thus combining the benefits of efficient optimization and molecular generation. With increasing potential of LLMs as pretrained foundational models and new post-training inference strategies, we foresee broad applications of LLM-based evolutionary optimization in chemistry and materials design.

[LG-81] OWPCP: A Deep Learning Model to Predict Octanol-Water Partition Coefficient

链接: https://arxiv.org/abs/2410.18118
作者: Mohammadjavad Maleki,Sobhan Zahiri
关键词-EN: including pharmaceuticals, physicochemical properties, environmental and separation, separation science, great importance
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Preprint

点击查看摘要

Abstract:The physicochemical properties of chemical compounds have great importance in several areas, including pharmaceuticals, environmental and separation science. Among these are physicochemical properties such as the octanol-water partition coefficient, which has been considered an important index pointing out lipophilicity and hydrophilicity. It affects drug absorption and membrane permeability. Following Lipinski’s rule of five, logP was identified as one of the key determinants of the stability of chemical entities and, as such, needed state-of-the-art methods for measuring lipophilicity. This paper presents a deep-learning model, OWPCP, developed to compute logP using Morgan fingerprints and MACCS keys as input features. It uses the interconnection of such molecular representations with logP values extracted from 26,254 compounds. The dataset was prepared to contain a wide range of chemical structures with differing molecular weights and polar surface area. Hyperparameter optimization was conducted using the Keras Tuner alongside the Hyperband algorithm to enhance the performance. OWPCP demonstrated outstanding performance compared to current computational methods, achieving an MAE=0.247 on the test set and outperforming all previous DL models. Remarkably, while one of the most accurate recent models is based on experimental data on retention time to make predictions, OWPCP manages computing logP efficiently without depending on these factors, being, therefore, very useful during early-stage drug discovery. Our model outperforms the best model, which leverages Retention Time, and our model does not require any experimental data. Further validation of the model performance was done across different functional groups, and it showed very high accuracy, especially for compounds that contain aliphatic OH groups. The results have indicated that OWPCP provides a reliable prediction of logP.

[LG-82] A Case Study of Next Portfolio Prediction for Mutual Funds

链接: https://arxiv.org/abs/2410.18098
作者: Guilherme Thomaz,Denis Maua
关键词-EN: Mutual funds aim, mutual fund portfolio, market averages, mutual fund, aim to generate
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mutual funds aim to generate returns above market averages. While predicting their future portfolio allocations can bring economic advantages, the task remains challenging and largely unexplored. To fill that gap, this work frames mutual fund portfolio prediction as a Next Novel Basket Recommendation (NNBR) task, focusing on predicting novel items in a fund’s next portfolio. We create a comprehensive benchmark dataset using publicly available data and evaluate the performance of various recommender system models on the NNBR task. Our findings reveal that predicting novel items in mutual fund portfolios is inherently more challenging than predicting the entire portfolio or only repeated items. While state-of-the-art NBR models are outperformed by simple heuristics when considering both novel and repeated items together, autoencoder-based approaches demonstrate superior performance in predicting only new items. The insights gained from this study highlight the importance of considering domain-specific characteristics when applying recommender systems to mutual fund portfolio prediction. The performance gap between predicting the entire portfolio or repeated items and predicting novel items underscores the complexity of the NNBR task in this domain and the need for continued research to develop more robust and adaptable models for this critical financial application. Subjects: Portfolio Management (q-fin.PM); Machine Learning (cs.LG) Cite as: arXiv:2410.18098 [q-fin.PM] (or arXiv:2410.18098v1 [q-fin.PM] for this version) https://doi.org/10.48550/arXiv.2410.18098 Focus to learn more arXiv-issued DOI via DataCite

信息检索

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-10-25

目录

概览 (2024-10-25)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载