本篇博文主要展示 2024-10-22 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2024-10-22)
今日共更新956篇论文,其中:
- 自然语言处理共200篇(Computation and Language (cs.CL))
- 人工智能共306篇(Artificial Intelligence (cs.AI))
- 计算机视觉共204篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共329篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
【速读】: 该论文试图解决视频多模态语言模型中高效捕捉多帧时间信息的问题。解决方案的关键在于引入了一个“时间编码器”,它与传统的视觉标记器结合,将多帧序列映射为紧凑的视觉标记集,从而显著减少所需的视觉标记数量(例如,从4608个减少到32个)。这种设计使得BLIP-3-Video模型在保持与更大规模模型(如34B参数模型)相当的视频问答准确性的同时,显著降低了模型规模(4B参数)和计算效率。
链接: https://arxiv.org/abs/2410.16267
作者: Michael S. Ryoo,Honglu Zhou,Shrikant Kendre,Can Qin,Le Xue,Manli Shu,Silvio Savarese,Ran Xu,Caiming Xiong,Juan Carlos Niebles
关键词-EN: efficiently capture temporal, capture temporal information, multimodal language model, multiple frames, multimodal language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the ‘temporal encoder’ in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at this https URL
摘要:我们提出了 xGen-MM-Vid (BLIP-3-Video):一种用于视频的多模态语言模型,特别设计用于高效捕捉多帧之间的时间信息。BLIP-3-Video 除了利用传统的视觉 Tokenizer 外,还采用了“时间编码器”,将多帧序列的 Token 映射为一组紧凑的视觉 Token。这使得 BLIP-3-Video 能够使用比其竞争模型(例如,32 个 Token 对比 4608 个 Token)少得多的视觉 Token。我们探索了不同类型的时间编码器,包括可学习的时空池化以及如 Token Turing Machines 这样的序列模型。实验证实,BLIP-3-Video 在视频问答准确性上与更大规模的最新模型(例如,34B)相当,同时体积更小(即 4B)且更高效,因为它使用了更少的视觉 Token。项目网站位于此 https URL。
[NLP-1] CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
【速读】: 该论文试图解决大语言模型(LLM)评估中主观评价成本高且缺乏可重复性的问题。解决方案的关键是引入了CompassJudger-1,这是一个开源的、多功能的评估模型,能够执行单一评分、两模型比较、格式化评估、生成批评以及执行多种任务。此外,论文还建立了JudgerBench基准,用于在统一设置下评估不同评估模型的能力,涵盖多种主观评价任务和广泛的主题。通过开源这些工具,论文旨在促进LLM评估方法的合作与进步。
链接: https://arxiv.org/abs/2410.16256
作者: Maosong Cao,Alexander Lam,Haodong Duan,Hongwei Liu,Songyang Zhang,Kai Chen
关键词-EN: Efficient and accurate, large language models, continuous improvement, improvement of large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report, Code and Models: this https URL
点击查看摘要
Abstract:Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce \textbfCompassJudger-1, the first open-source \textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established \textbfJudgerBench, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.
摘要:高效且准确的评估对于大语言模型 (LLM) 的持续改进至关重要。在各种评估方法中,主观评估因其与实际使用场景和人类偏好的高度契合而备受关注。然而,基于人类的评估成本高且缺乏可重复性,因此精确的自动化评估器 (judgers) 在此过程中显得尤为重要。在本报告中,我们介绍了 CompassJudger-1,这是首个开源的 all-in-one 评估 LLM。CompassJudger-1 是一个通用型 LLM,展现出卓越的多功能性。它能够:1. 作为奖励模型进行单一评分和双模型比较;2. 根据指定格式进行评估;3. 生成批评意见;4. 执行多样化的任务,如同一个通用 LLM。为了在统一设置下评估不同评估模型的能力,我们还建立了 JudgerBench,这是一个包含多种主观评估任务并涵盖广泛主题的新基准。CompassJudger-1 为各种评估任务提供了全面的解决方案,同时保持了适应不同需求的灵活性。CompassJudger 和 JudgerBench 均已发布,并可在 https://github.com/open-compass/CompassJudger 获取。我们相信,通过开源这些工具,可以促进合作并加速 LLM 评估方法的进步。
[NLP-2] Can Knowledge Editing Really Correct Hallucinations?
【速读】: 该论文试图解决的问题是验证知识编辑方法在纠正大型语言模型(LLMs)中的幻觉(hallucinations)方面的有效性。现有评估数据集的一个常见问题是它们未能确保LLMs在编辑前确实生成了幻觉答案,这使得难以直接评估不同知识编辑方法在纠正幻觉方面的效果。论文提出的解决方案之关键是构建了一个名为HalluEditBench的综合基准,该基准包含一个大规模的幻觉数据集,涵盖9个领域、26个主题和超过6,000个幻觉实例。通过在五个维度(Efficacy、Generalization、Portability、Locality、Robustness)上全面评估知识编辑方法,HalluEditBench为不同知识编辑方法在纠正幻觉方面的潜力和局限性提供了新的见解,从而推动了知识编辑领域的发展。
链接: https://arxiv.org/abs/2410.16251
作者: Baixiang Huang,Canyu Chen,Xiongxiao Xu,Ali Payani,Kai Shu
关键词-EN: Large Language Models, Large Language, Language Models, knowledge editing methods, knowledge editing
类目: Computation and Language (cs.CL)
备注: The first two authors contributed equally to this work. The main paper is 10 pages long, with 35 pages total. The code, results, dataset, and additional resources are available on the project website: this https URL
点击查看摘要
Abstract:Large Language Models (LLMs) suffer from hallucinations, referring to the non-factual information in generated content, despite their superior capacities across tasks. Meanwhile, knowledge editing has been developed as a new popular paradigm to correct the erroneous factual knowledge encoded in LLMs with the advantage of avoiding retraining from scratch. However, one common issue of existing evaluation datasets for knowledge editing is that they do not ensure LLMs actually generate hallucinated answers to the evaluation questions before editing. When LLMs are evaluated on such datasets after being edited by different techniques, it is hard to directly adopt the performance to assess the effectiveness of different knowledge editing methods in correcting hallucinations. Thus, the fundamental question remains insufficiently validated: Can knowledge editing really correct hallucinations in LLMs? We proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations. First, we rigorously construct a massive hallucination dataset with 9 domains, 26 topics and more than 6,000 hallucinations. Then, we assess the performance of knowledge editing methods in a holistic way on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. Through HalluEditBench, we have provided new insights into the potentials and limitations of different knowledge editing methods in correcting hallucinations, which could inspire future improvements and facilitate the progress in the field of knowledge editing.
摘要:大语言模型 (LLMs) 尽管在各项任务中展现出卓越的能力,却存在幻觉问题,即生成内容中包含非事实信息。与此同时,知识编辑作为一种新兴范式应运而生,其优势在于无需从头开始重新训练即可修正 LLMs 中编码的错误事实知识。然而,现有知识编辑评估数据集的一个常见问题是,它们并未确保 LLMs 在编辑前对评估问题生成幻觉答案。当 LLMs 经过不同技术编辑后在这些数据集上进行评估时,很难直接将性能表现用于评估不同知识编辑方法在纠正幻觉方面的有效性。因此,一个根本问题仍未得到充分验证:知识编辑真的能纠正 LLMs 中的幻觉吗?我们提出了 HalluEditBench,以全面评估知识编辑方法在纠正现实世界幻觉中的效果。首先,我们严格构建了一个包含 9 个领域、26 个主题及超过 6,000 个幻觉的大规模幻觉数据集。接着,我们从五个维度(包括有效性、泛化性、可移植性、局部性和鲁棒性)全面评估了知识编辑方法的性能。通过 HalluEditBench,我们揭示了不同知识编辑方法在纠正幻觉方面的潜力与局限,这将为未来的改进提供新的思路,并推动知识编辑领域的发展。
[NLP-3] Analyzing Context Contributions in LLM-based Machine Translation
【速读】: 该论文试图解决大语言模型(LLMs)在机器翻译(MT)中如何利用输入上下文的问题。解决方案的关键在于全面分析LLMs在翻译过程中对不同上下文部分(如少样本示例和源文本)的使用方式,并揭示了几个关键发现:1) 少样本示例的源部分比目标部分贡献更大,无论翻译方向如何;2) 使用平行数据微调LLMs会改变不同上下文部分的贡献模式;3) 存在位置偏差,即较早的少样本示例对翻译序列的贡献更大。此外,通过检查异常的上下文贡献,可以发现病理翻译问题,如幻觉现象。这些发现揭示了LLM在MT中的内部工作机制,超越了传统的编码器-解码器模型的已知特性。
链接: https://arxiv.org/abs/2410.16246
作者: Emmanouil Zaranis,Nuno M. Guerreiro,André F. T. Martins
关键词-EN: Large language models, leverage in-context learning, Large language, performance in machine, demonstrated the ability
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved state-of-the-art performance in machine translation (MT) and demonstrated the ability to leverage in-context learning through few-shot examples. However, the mechanisms by which LLMs use different parts of the input context remain largely unexplored. In this work, we provide a comprehensive analysis of context utilization in MT, studying how LLMs use various context parts, such as few-shot examples and the source text, when generating translations. We highlight several key findings: (1) the source part of few-shot examples appears to contribute more than its corresponding targets, irrespective of translation direction; (2) finetuning LLMs with parallel data alters the contribution patterns of different context parts; and (3) there is a positional bias where earlier few-shot examples have higher contributions to the translated sequence. Finally, we demonstrate that inspecting anomalous context contributions can potentially uncover pathological translations, such as hallucinations. Our findings shed light on the internal workings of LLM-based MT which go beyond those known for standard encoder-decoder MT models.
摘要:大语言模型 (LLMs) 在机器翻译 (MT) 领域取得了最先进的性能,并展示了通过少样本示例进行上下文学习的能力。然而,LLMs 如何利用输入上下文的不同部分仍未得到充分探索。在本研究中,我们对 MT 中的上下文利用进行了全面分析,研究了 LLMs 在生成翻译时如何使用各种上下文部分,如少样本示例和源文本。我们强调了几个关键发现:(1) 少样本示例的源部分似乎比其对应的翻译目标部分贡献更大,无论翻译方向如何;(2) 使用平行数据对 LLMs 进行微调会改变不同上下文部分的贡献模式;(3) 存在位置偏差,即较早的少样本示例对翻译序列的贡献更高。最后,我们证明,检查异常的上下文贡献可能会揭示病态翻译,如幻觉。我们的发现揭示了基于 LLM 的 MT 的内部工作机制,这些机制超越了标准编码器-解码器 MT 模型的已知特性。
[NLP-4] oW: Thoughts of Words Improve Reasoning in Large Language Models
【速读】: 该论文试图解决现有下一个词预测学习方案中存在的两个根本问题:事实幻觉和模型学习原始文本中隐含推理过程的效率低下。解决方案的关键在于引入“词的思考”(Thoughts of Words, ToW),这是一种新颖的训练时数据增强方法。ToW通过在预训练文本中注入细粒度的思考,解释下一个词应该是什么以及它与之前上下文的关系,从而改善模型的推理能力并减少幻觉现象。具体实现上,论文探索了通过从更大模型中提取ToW注释的方法,并在仅使用70K个ToW注释进行持续预训练后,显著提升了模型推理性能并减少了幻觉现象。
链接: https://arxiv.org/abs/2410.16235
作者: Zhikun Xu,Ming Shen,Jacob Dineen,Zhaonan Li,Xiao Ye,Shijie Lu,Aswin RRV,Chitta Baral,Ben Zhou
关键词-EN: training-time data-augmentation method, next-word prediction, training-time data-augmentation, data-augmentation method, views next-word prediction
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce thoughts of words (ToW), a novel training-time data-augmentation method for next-word prediction. ToW views next-word prediction as a core reasoning task and injects fine-grained thoughts explaining what the next word should be and how it is related to the previous contexts in pre-training texts. Our formulation addresses two fundamental drawbacks of existing next-word prediction learning schemes: they induce factual hallucination and are inefficient for models to learn the implicit reasoning processes in raw texts. While there are many ways to acquire such thoughts of words, we explore the first step of acquiring ToW annotations through distilling from larger models. After continual pre-training with only 70K ToW annotations, we effectively improve models’ reasoning performances by 7% to 9% on average and reduce model hallucination by up to 10%. At the same time, ToW is entirely agnostic to tasks and applications, introducing no additional biases on labels or semantics.
摘要:我们提出了“词的思想”(ToW),这是一种新颖的训练时数据增强方法,用于下一个词的预测。ToW 将下一个词的预测视为核心推理任务,并在预训练文本中注入细粒度的思想,解释下一个词应该是什么以及它如何与之前的上下文相关。我们的方法解决了现有下一个词预测学习方案的两个根本缺陷:它们会导致事实幻觉,并且模型难以从原始文本中学习隐含的推理过程。虽然有许多方法可以获取这些词的思想,但我们探索了通过从更大模型中提炼来获取 ToW 注释的第一步。在仅使用 70K ToW 注释进行持续预训练后,我们有效地将模型的推理性能平均提高了 7% 至 9%,并将模型幻觉减少了高达 10%。同时,ToW 完全不依赖于任务和应用,不会在标签或语义上引入额外的偏见。
[NLP-5] Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
【速读】: 该论文试图解决UI/UX设计自动化中高保真输入的限制问题,特别是如何将草图快速转换为网页原型。解决方案的关键在于引入Sketch2Code基准,该基准评估了最先进的视觉语言模型(VLM)在将草图自动转换为网页原型方面的能力,并通过交互式代理评估模拟真实设计流程,其中VLM代理通过与模拟用户进行迭代沟通来改进其生成结果,无论是被动接收反馈还是主动提出澄清问题。研究结果表明,现有VLM在处理草图转换时面临挑战,但用户研究表明,主动提问的代理更受UI/UX专家青睐,这强调了开发更有效的多轮对话代理范式的必要性。
链接: https://arxiv.org/abs/2410.16232
作者: Ryan Li,Yanzhe Zhang,Diyi Yang
关键词-EN: conceptualize early-stage ideas, early-stage ideas, natural and accessible, accessible medium, designers to conceptualize
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint, 9 pages
点击查看摘要
Abstract:Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting accessibility and impeding efficient design iteration. To bridge this gap, we introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. Beyond end-to-end benchmarking, Sketch2Code supports interactive agent evaluation that mimics real-world design workflows, where a VLM-based agent iteratively refines its generations by communicating with a simulated user, either passively receiving feedback instructions or proactively asking clarification questions. We comprehensively analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs; even the most capable models struggle to accurately interpret sketches and formulate effective questions that lead to steady improvement. Nevertheless, a user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception, highlighting the need to develop more effective paradigms for multi-turn conversational agents.
摘要:草图是 UI 设计师用于概念化早期想法的自然且易于获取的媒介。然而,现有的 UI/UX 自动化研究通常需要高保真输入,如 Figma 设计或详细截图,这限制了可访问性并阻碍了高效的设计迭代。为了填补这一空白,我们引入了 Sketch2Code,这是一个基准测试,用于评估最先进的视觉语言模型 (Vision Language Models, VLMs) 在将初步草图自动转换为网页原型方面的能力。除了端到端的基准测试外,Sketch2Code 还支持交互式智能体评估,模拟真实世界的设计工作流程,其中基于 VLM 的智能体通过与模拟用户进行交互,迭代地改进其生成内容,无论是被动接收反馈指令还是主动提出澄清问题。我们全面分析了十个商业和开源模型,结果显示 Sketch2Code 对现有的 VLMs 具有挑战性;即使是最有能力模型也难以准确解释草图并提出有效问题,从而实现稳定的改进。尽管如此,与 UI/UX 专家进行的一项用户研究表明,主动提问相比被动接收反馈更受青睐,这突显了开发更有效的多轮对话智能体范式的必要性。
[NLP-6] Building A Coding Assistant via the Retrieval-Augmented Language Model
【速读】: 该论文试图解决在代码相关任务中,如何通过模拟人类在编码过程中的知识寻求行为来构建一个高效的代码助手。解决方案的关键在于提出了COde assistaNt viA retrieval-augmeNted language model (CONAN),它包括一个代码结构感知的检索器(CONAN-R)和一个基于双重视角代码表示的检索增强生成模型(CONAN-G)。CONAN-R通过预训练CodeT5模型,使用代码-文档对齐和掩码实体预测任务,使语言模型能够感知代码结构并学习有效的代码片段和文档表示。CONAN-G则设计了一种双重视角代码表示机制,将代码文档描述视为提示,帮助语言模型更好地理解代码语义,从而实现检索增强的代码生成。实验结果表明,CONAN在不同的代码生成任务中表现出色,显著优于之前的检索增强代码生成模型。
链接: https://arxiv.org/abs/2410.16229
作者: Xinze Li,Hanbin Wang,Zhenghao Liu,Shi Yu,Shuo Wang,Shuo Wang,Yukun Yan,Yukai Fu,Yu Gu,Ge Yu
关键词-EN: Pretrained language models, code, Pretrained language, language models, code generation
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Pretrained language models have shown strong effectiveness in code-related tasks, such as code retrieval, code generation, code summarization, and code completion tasks. In this paper, we propose COde assistaNt viA retrieval-augmeNted language model (CONAN), which aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding. Specifically, it consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G). CONAN-R pretrains CodeT5 using Code-Documentation Alignment and Masked Entity Prediction tasks to make language models code structure-aware and learn effective representations for code snippets and documentation. Then CONAN-G designs a dual-view code representation mechanism for implementing a retrieval-augmented code generation model. CONAN-G regards the code documentation descriptions as prompts, which help language models better understand the code semantics. Our experiments show that CONAN achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models. Our further analyses show that CONAN learns tailored representations for both code snippets and documentation by aligning code-documentation data pairs and capturing structural semantics by masking and predicting entities in the code data. Additionally, the retrieved code snippets and documentation provide necessary information from both program language and natural language to assist the code generation process. CONAN can also be used as an assistant for Large Language Models (LLMs), providing LLMs with external knowledge in shorter code document lengths to improve their effectiveness on various code tasks. It shows the ability of CONAN to extract necessary information and help filter out the noise from retrieved code documents.
摘要:预训练语言模型在代码相关任务中展现了强大的有效性,如代码检索、代码生成、代码摘要和代码补全任务。本文提出了一种通过检索增强语言模型的代码助手 (COde assistaNt viA retrieval-augmeNted language model, CONAN),旨在通过模仿人类在编码过程中的知识寻求行为来构建代码助手。具体而言,CONAN 由一个代码结构感知的检索器 (CONAN-R) 和一个基于双视图代码表示的检索增强生成模型 (CONAN-G) 组成。CONAN-R 使用代码-文档对齐和掩码实体预测任务对 CodeT5 进行预训练,以使语言模型具备代码结构感知能力,并学习有效的代码片段和文档表示。随后,CONAN-G 设计了一种双视图代码表示机制,用于实现检索增强的代码生成模型。CONAN-G 将代码文档描述视为提示,帮助语言模型更好地理解代码语义。我们的实验表明,CONAN 在不同的代码生成任务中取得了令人信服的性能,并显著优于先前的检索增强代码生成模型。进一步的分析显示,CONAN 通过对齐代码-文档数据对并捕捉代码数据中的结构语义,学习了针对代码片段和文档的定制表示。此外,检索到的代码片段和文档提供了从程序语言和自然语言中获取的必要信息,以辅助代码生成过程。CONAN 还可以作为大语言模型 (LLMs) 的助手,通过提供外部知识在较短的代码文档长度内提升其在各种代码任务中的有效性。这展示了 CONAN 提取必要信息并帮助过滤检索到的代码文档中噪声的能力。
[NLP-7] On Creating an English-Thai Code-switched Machine Translation in Medical Domain
【速读】: 该论文试图解决医学领域机器翻译中术语翻译不准确的问题,解决方案的关键在于采用代码转换(code-switching, CS)技术,即在翻译过程中保留原文中的医学术语,从而提高翻译的准确性和专业性。研究通过生成CS医学翻译数据,微调CS翻译模型,并在自动评估和人工偏好评估中展示了其相对于传统NMT和GPT模型的竞争优势,特别是在医学专业人士的偏好中,CS翻译因其术语准确性而受到显著青睐。
链接: https://arxiv.org/abs/2410.16221
作者: Parinthapat Pengpun,Krittamate Tiankanon,Amrest Chinkamol,Jiramet Kinchagawat,Pitchaya Chairuengjitjaras,Pasit Supholkhan,Pubordee Aussavavirojekul,Chiraphat Boonnag,Kanyakorn Veerakanjana,Hirunkul Phimsiri,Boonthicha Sae-jia,Nattawach Sataudom,Piyalitt Ittichaiwong,Peerat Limkonchotiwat
关键词-EN: enhancing healthcare quality, disseminating medical knowledge, medical domain plays, Google Neural Machine, domain plays
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Machine translation (MT) in the medical domain plays a pivotal role in enhancing healthcare quality and disseminating medical knowledge. Despite advancements in English-Thai MT technology, common MT approaches often underperform in the medical field due to their inability to precisely translate medical terminologies. Our research prioritizes not merely improving translation accuracy but also maintaining medical terminology in English within the translated text through code-switched (CS) translation. We developed a method to produce CS medical translation data, fine-tuned a CS translation model with this data, and evaluated its performance against strong baselines, such as Google Neural Machine Translation (NMT) and GPT-3.5/GPT-4. Our model demonstrated competitive performance in automatic metrics and was highly favored in human preference evaluations. Our evaluation result also shows that medical professionals significantly prefer CS translations that maintain critical English terms accurately, even if it slightly compromises fluency. Our code and test set are publicly available this https URL.
摘要:在医疗领域,机器翻译 (MT) 在提升医疗质量和传播医学知识方面发挥着关键作用。尽管英语-泰语机器翻译技术取得了进展,但常见的机器翻译方法在医疗领域往往表现不佳,因为它们无法精确翻译医学术语。我们的研究不仅注重提高翻译准确性,还通过代码转换 (CS) 翻译在译文中保留医学术语的英文表达。我们开发了一种生成 CS 医学翻译数据的方法,使用这些数据微调了一个 CS 翻译模型,并将其性能与 Google 神经机器翻译 (NMT) 和 GPT-3.5/GPT-4 等强基线模型进行了比较。我们的模型在自动评估指标上表现出色,并在人类偏好评估中获得了高度评价。我们的评估结果还显示,医疗专业人员显著偏好保留关键英文术语的 CS 翻译,即使这稍微牺牲了流畅性。我们的代码和测试集已公开,可通过此链接访问。
[NLP-8] Pre-training Distillation for Large Language Models : A Design Space Exploration
【速读】: 该论文试图解决在大语言模型(LLMs)的预训练阶段应用知识蒸馏(Knowledge Distillation, KD)的问题,提出了预训练蒸馏(Pre-training Distillation, PD)的概念。解决方案的关键在于系统地探索和优化预训练蒸馏的设计空间,包括对logits的处理、损失函数的选择、缩放规律以及离线或在线logits的使用。通过实验验证,论文发现较大的学生模型通常能从预训练蒸馏中获益更多,而较大的教师模型并不一定能带来更好的结果。这一研究为未来在预训练阶段进行知识蒸馏的实践提供了重要的指导和参考。
链接: https://arxiv.org/abs/2410.16215
作者: Hao Peng,Xin Lv,Yushi Bai,Zijun Yao,Jiajie Zhang,Lei Hou,Juanzi Li
关键词-EN: transfer knowledge, smaller student model, large teacher model, aims to transfer, Knowledge distillation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. Previous work applying KD in the field of large language models (LLMs) typically focused on the post-training phase, where the student LLM learns directly from instructions and corresponding responses generated by the teacher model. In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD). We first conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a 1.9B parameter student LLM, validating the effectiveness of PD. Considering the key impact factors of distillation, we systematically explore the design space of pre-training distillation across four aspects: logits processing, loss selection, scaling law, and offline or online logits. We conduct extensive experiments to explore the design space of pre-training distillation and find better configurations and interesting conclusions, such as larger student LLMs generally benefiting more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. We hope our exploration of the design space will inform future practices in pre-training distillation.
摘要:知识蒸馏 (Knowledge Distillation, KD) 旨在将知识从大型教师模型转移到较小的学生模型。以往在大型语言模型 (Large Language Models, LLMs) 领域应用 KD 的工作通常集中在训练后阶段,即学生 LLM 直接从教师模型生成的指令和相应响应中学习。本文将 KD 扩展到 LLMs 的预训练阶段,称为预训练蒸馏 (Pre-training Distillation, PD)。我们首先使用 GLM-4-9B 作为教师 LLM 对一个 1.9B 参数的学生 LLM 进行初步实验,验证了 PD 的有效性。考虑到蒸馏的关键影响因素,我们系统地探索了预训练蒸馏在四个方面的设计空间:logits 处理、损失选择、缩放定律以及离线或在线 logits。我们进行了大量实验以探索预训练蒸馏的设计空间,并发现了更好的配置和有趣的结论,例如较大的学生 LLM 通常从预训练蒸馏中获益更多,而较大的教师 LLM 并不一定能保证更好的结果。我们希望对设计空间的探索能为未来的预训练蒸馏实践提供参考。
[NLP-9] Compute-Constrained Data Selection
【速读】: 该论文试图解决在计算资源受限的情况下,如何高效地选择数据以微调大型语言模型(LLMs)的问题。解决方案的关键在于提出了一个成本感知的效用函数,将数据选择问题建模为在初始选择成本和训练增益之间进行权衡。通过在多个任务中进行广泛的实验,论文验证了这一模型的有效性,并发现许多强大的数据选择方法在计算上并非最优,而一些更便宜的数据选择方法在理论和实证上都表现更优。
链接: https://arxiv.org/abs/2410.16208
作者: Junjie Oscar Yin,Alexander M. Rush
关键词-EN: Data selection, selection scales directly, training data needed, data selection scales, Data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. These experiments show the validity of this model in real-world experiments. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective.
摘要:数据选择可以减少微调大语言模型 (LLM) 所需的数据量;然而,数据选择的有效性与计算量直接相关。鉴于计算资源受限的微调这一实际挑战,我们考虑了数据选择成本和训练成本均被预算限制的情况。我们首先通过引入一个成本敏感的效用函数来形式化数据选择问题,并将数据选择问题建模为在初始选择成本与训练增益之间进行权衡。我们在多个任务上进行了全面的实验,通过调整微调 Token 数量、模型大小和数据选择计算量来改变计算预算。这些实验验证了该模型在实际应用中的有效性。有趣的是,我们发现许多强大的数据选择方法几乎从未达到计算最优,而更便宜的数据选择替代方案在理论和实证上都占据优势。
[NLP-10] CoT-TL: Low-Resource Temporal Knowledge Representation of Planning Instructions Using Chain-of-Thought Reasoning IROS2024
【速读】: 该论文试图解决自主代理在执行规划任务时,如何有效解释和转换不确定的自然语言指令为可执行的计划的问题。解决方案的关键在于引入CoT-TL框架,通过将自然语言指令转换为线性时序逻辑(LTL)表示,利用上下文学习的数据高效性,结合链式思维推理和语义角色,以满足形式逻辑生成的需求。这种方法不仅提高了LTL生成的透明度和合理性,还通过模型检查验证LTL输出的语法正确性,从而在低数据场景下实现了最先进的准确性,并避免了大规模微调或中间翻译的需求。
链接: https://arxiv.org/abs/2410.16207
作者: Kumar Manas,Stefan Zwicklbauer,Adrian Paschke
关键词-EN: Autonomous agents, interpreting uncertain natural, Linear Temporal Logic, agents often face, face the challenge
类目: Robotics (cs.RO); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: Accepted for publication in Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024), Abu Dhabi 14-18 October 2024
点击查看摘要
Abstract:Autonomous agents often face the challenge of interpreting uncertain natural language instructions for planning tasks. Representing these instructions as Linear Temporal Logic (LTL) enables planners to synthesize actionable plans. We introduce CoT-TL, a data-efficient in-context learning framework for translating natural language specifications into LTL representations. CoT-TL addresses the limitations of large language models, which typically rely on extensive fine-tuning data, by extending chain-of-thought reasoning and semantic roles to align with the requirements of formal logic creation. This approach enhances the transparency and rationale behind LTL generation, fostering user trust. CoT-TL achieves state-of-the-art accuracy across three diverse datasets in low-data scenarios, outperforming existing methods without fine-tuning or intermediate translations. To improve reliability and minimize hallucinations, we incorporate model checking to validate the syntax of the generated LTL output. We further demonstrate CoT-TL’s effectiveness through ablation studies and evaluations on unseen LTL structures and formulas in a new dataset. Finally, we validate CoT-TL’s practicality by integrating it into a QuadCopter for multi-step drone planning based on natural language instructions.
摘要:自主智能体在执行规划任务时,常常面临解释不确定的自然语言指令的挑战。将这些指令表示为线性时序逻辑 (Linear Temporal Logic, LTL) 可以使规划器合成可执行的计划。我们引入了 CoT-TL,这是一个数据高效的内上下文学习框架,用于将自然语言规范翻译成 LTL 表示。CoT-TL 通过扩展思维链推理 (chain-of-thought reasoning) 和语义角色,以符合形式逻辑创建的要求,从而解决了大语言模型通常依赖大量微调数据的局限性。这种方法增强了 LTL 生成的透明度和合理性,促进了用户信任。CoT-TL 在低数据场景下,在三个多样化的数据集上实现了最先进的准确性,超越了现有的无需微调或中间翻译的方法。为了提高可靠性和最小化幻觉 (hallucinations),我们引入了模型检查 (model checking) 来验证生成的 LTL 输出的语法。我们进一步通过消融研究 (ablation studies) 和在新数据集上对未见过的 LTL 结构和公式的评估,展示了 CoT-TL 的有效性。最后,我们通过将其集成到一个基于自然语言指令进行多步无人机规划的四旋翼飞行器 (QuadCopter) 中,验证了 CoT-TL 的实用性。
[NLP-11] Systematic Review: Text Processing Algorithms in Machine Learning and Deep Learning for Mental Health Detection on Social Media
【速读】: 该论文试图解决通过社交媒体检测抑郁症的方法中存在的偏见和方法学挑战问题。解决方案的关键在于:1) 多样化数据来源,减少对Twitter和英语内容的依赖;2) 标准化数据预处理协议,确保语言细微差别(如否定词)的正确处理;3) 统一模型开发实践,包括超参数调优和数据集的合理划分;4) 处理类别不平衡问题,采用适当的评估指标;5) 提高报告透明度,详细记录方法学细节。通过克服这些挑战,可以开发出更稳健和可推广的机器学习模型,从而提高全球抑郁症检测的准确性和有效性。
链接: https://arxiv.org/abs/2410.16204
作者: Yuchen Cao,Jianglai Dai,Zhongyan Wang,Yeyubei Zhang,Xiaorui Shen,Yunchong Liu,Yexin Tian
关键词-EN: depression necessitates innovative, necessitates innovative detection, early intervention, global rise, necessitates innovative
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The global rise in depression necessitates innovative detection methods for early intervention. Social media provides a unique opportunity to identify depression through user-generated posts. This systematic review evaluates machine learning (ML) models for depression detection on social media, focusing on biases and methodological challenges throughout the ML lifecycle. A search of PubMed, IEEE Xplore, and Google Scholar identified 47 relevant studies published after 2010. The Prediction model Risk Of Bias ASsessment Tool (PROBAST) was utilized to assess methodological quality and risk of bias. Significant biases impacting model reliability and generalizability were found. There is a predominant reliance on Twitter (63.8%) and English-language content (over 90%), with most studies focusing on users from the United States and Europe. Non-probability sampling methods (approximately 80%) limit representativeness. Only 23% of studies explicitly addressed linguistic nuances like negations, crucial for accurate sentiment analysis. Inconsistent hyperparameter tuning was observed, with only 27.7% properly tuning models. About 17% did not adequately partition data into training, validation, and test sets, risking overfitting. While 74.5% used appropriate evaluation metrics for imbalanced data, others relied on accuracy without addressing class imbalance, potentially skewing results. Reporting transparency varied, often lacking critical methodological details. These findings highlight the need to diversify data sources, standardize preprocessing protocols, ensure consistent model development practices, address class imbalance, and enhance reporting transparency. By overcoming these challenges, future research can develop more robust and generalizable ML models for depression detection on social media, contributing to improved mental health outcomes globally.
摘要:全球抑郁症的上升趋势迫切需要创新的早期干预检测方法。社交媒体通过用户生成的帖子提供了一个独特的识别抑郁症的机会。本系统综述评估了用于社交媒体抑郁症检测的机器学习 (ML) 模型,重点关注 ML 生命周期中的偏差和方法学挑战。通过搜索 PubMed、IEEE Xplore 和 Google Scholar,确定了 2010 年后发表的 47 篇相关研究。使用 Prediction model Risk Of Bias ASsessment Tool (PROBAST) 评估了方法学质量和偏差风险。研究发现,显著的偏差影响了模型的可靠性和可推广性。主要依赖于 Twitter (63.8%) 和英语内容 (超过 90%),大多数研究集中在美国和欧洲用户。非概率抽样方法 (约 80%) 限制了代表性。只有 23% 的研究明确解决了否定词等语言细微差别,这对于准确的情感分析至关重要。观察到超参数调优不一致,只有 27.7% 的研究进行了适当的模型调优。约 17% 的研究未能充分将数据划分为训练集、验证集和测试集,存在过拟合风险。虽然 74.5% 的研究使用了适合不平衡数据的评估指标,但其他研究依赖于未解决类别不平衡问题的准确性,可能扭曲结果。报告透明度各异,往往缺乏关键的方法学细节。这些发现强调了需要多样化数据源、标准化预处理协议、确保一致的模型开发实践、解决类别不平衡问题以及提高报告透明度。通过克服这些挑战,未来的研究可以开发出更强大和可推广的 ML 模型,用于社交媒体上的抑郁症检测,从而为全球心理健康结果的改善做出贡献。
[NLP-12] Information for Conversation Generation: Proposals Utilising Knowledge Graphs ISWC2024
【速读】: 该论文试图解决大语言模型(LLMs)在对话生成中因缺乏相关内容、产生幻觉、情感表达能力不足以及角色一致性差等问题。解决方案的关键在于利用知识图谱(Knowledge Graphs)来增强LLM的生成能力。具体方法包括:1) 通过动态知识图谱嵌入和推荐机制,整合新信息并选择相关知识用于响应生成;2) 将具有情感价值的实体作为附加特征存储,以更好地与用户输入情感对齐;3) 通过叙事气泡(narrative bubbles)整合角色信息,保持角色一致性并便于新信息的融入。
链接: https://arxiv.org/abs/2410.16196
作者: Alex Clay,Ernesto Jiménez-Ruiz
关键词-EN: frequently used tools, tools for conversational, Knowledge, conversational generation, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages with citations, 1 figure, accepted to the ISWC 2024 Special Session
点击查看摘要
Abstract:LLMs are frequently used tools for conversational generation. Without additional information LLMs can generate lower quality responses due to lacking relevant content and hallucinations, as well as the perception of poor emotional capability, and an inability to maintain a consistent character. Knowledge graphs are commonly used forms of external knowledge and may provide solutions to these challenges. This paper introduces three proposals, utilizing knowledge graphs to enhance LLM generation. Firstly, dynamic knowledge graph embeddings and recommendation could allow for the integration of new information and the selection of relevant knowledge for response generation. Secondly, storing entities with emotional values as additional features may provide knowledge that is better emotionally aligned with the user input. Thirdly, integrating character information through narrative bubbles would maintain character consistency, as well as introducing a structure that would readily incorporate new information.
摘要:大语言模型 (LLM) 是用于对话生成的常用工具。在没有额外信息的情况下,大语言模型生成的回复质量较低,原因包括缺乏相关内容、出现幻觉、情感能力感知较差以及无法保持角色一致性。知识图谱是常见的外部知识形式,可能为这些挑战提供解决方案。本文提出了三种利用知识图谱增强大语言模型生成的方法。首先,动态知识图谱嵌入和推荐可以实现新信息的整合,并为回复生成选择相关知识。其次,将具有情感价值的实体作为附加特征存储,可能提供与用户输入情感更匹配的知识。第三,通过叙事气泡整合角色信息,可以保持角色一致性,并引入一种易于整合新信息的结构。
[NLP-13] Contamination Report for Multilingual Benchmarks
【速读】: 该论文旨在解决大型语言模型(LLM)在多语言基准测试中存在的数据污染问题,即测试数据集可能被包含在模型的预训练或后训练数据中,导致评估结果失真。解决方案的关键在于通过黑盒测试方法,检测7个常用多语言基准在7个流行开源和闭源LLM中的污染情况,发现几乎所有模型都显示出与所测试基准存在污染的迹象。这一发现有助于社区选择合适的多语言基准进行评估,以确保评估结果的准确性和公正性。
链接: https://arxiv.org/abs/2410.16186
作者: Sanchit Ahuja,Varun Gumma,Sunayana Sitaram
关键词-EN: Large Language Model, datasets in Large, Large Language, pre-training or post-training, post-training data
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 tables
点击查看摘要
Abstract:Benchmark contamination refers to the presence of test datasets in Large Language Model (LLM) pre-training or post-training data. Contamination can lead to inflated scores on benchmarks, compromising evaluation results and making it difficult to determine the capabilities of models. In this work, we study the contamination of popular multilingual benchmarks in LLMs that support multiple languages. We use the Black Box test to determine whether 7 frequently used multilingual benchmarks are contaminated in 7 popular open and closed LLMs and find that almost all models show signs of being contaminated with almost all the benchmarks we test. Our findings can help the community determine the best set of benchmarks to use for multilingual evaluation.
摘要:基准污染 (Benchmark contamination) 指的是测试数据集存在于大语言模型 (LLM) 的预训练或后训练数据中的现象。这种污染可能导致基准测试得分虚高,从而影响评估结果,使得难以准确判断模型的能力。在本研究中,我们探讨了支持多语言的 LLM 中流行多语言基准的污染情况。我们采用黑盒测试 (Black Box test) 来确定 7 个常用的多语言基准是否在 7 个流行的开源和闭源 LLM 中受到污染,并发现几乎所有模型在我们测试的所有基准中都显示出污染迹象。我们的研究结果有助于社区确定用于多语言评估的最佳基准集。
[NLP-14] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
【速读】: 该论文试图解决现有奖励模型基准在评估模型时未能充分考虑细微内容差异和风格偏差的问题。解决方案的关键在于引入RM-Bench,这是一个新型基准,旨在通过评估奖励模型对细微内容差异的敏感性和对风格偏差的抵抗能力,来更准确地衡量其性能。实验结果表明,RM-Bench与策略模型性能高度相关,为有效选择和调整语言模型提供了可靠的参考。
链接: https://arxiv.org/abs/2410.16184
作者: Yantao Liu,Zijun Yao,Rui Min,Yixin Cao,Lei Hou,Juanzi Li
关键词-EN: Inference Scaling Laws, Human Feedback, Scaling Laws, Reinforcement Learning, Learning from Human
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. To this end, we introduce RM-Bench, a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases. Extensive experiments demonstrate that RM-Bench strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively. We evaluate nearly 40 reward models on RM-Bench. Our results reveal that even state-of-the-art models achieve an average performance of only 46.6%, which falls short of random-level accuracy (50%) when faced with style bias interference. These findings highlight the significant room for improvement in current reward models. Related code and data are available at this https URL.
摘要:奖励模型在基于人类反馈的强化学习 (RLHF) 和推理缩放定律等技术中至关重要,它们指导语言模型的对齐并选择最优响应。尽管其重要性不言而喻,现有的奖励模型基准通常通过要求模型区分由不同能力模型生成的响应来进行评估。然而,这种方法未能评估奖励模型在细微但关键的内容变化和风格变异上的表现,导致与策略模型性能的低相关性。为此,我们引入了 RM-Bench,这是一个新型基准,旨在基于奖励模型对细微内容差异的敏感性和对风格偏差的抵抗性进行评估。广泛的实验表明,RM-Bench 与策略模型性能高度相关,使其成为选择奖励模型以有效对齐语言模型的可靠参考。我们在 RM-Bench 上评估了近 40 个奖励模型。结果显示,即使是当前最先进的模型,平均表现也仅为 46.6%,在面对风格偏差干扰时,其准确率甚至低于随机水平 (50%)。这些发现突显了当前奖励模型仍有显著的改进空间。相关代码和数据可在以下链接获取:https URL。
[NLP-15] MagicPIG: LSH Sampling for Efficient LLM Generation
【速读】: 该论文试图解决大型语言模型(LLMs)在处理长上下文窗口时,KV缓存成为计算瓶颈的问题。解决方案的关键在于提出了一种基于局部敏感哈希(LSH)的异构系统MagicPIG,通过采样而非选择最高注意力分数的键值对来近似注意力输出,从而显著减少注意力计算的工作量并保持高精度。MagicPIG在CPU上运行注意力计算并存储LSH哈希表,使其能够支持更长的上下文和更大的批量大小,同时提高了解码吞吐量和降低了延迟。
链接: https://arxiv.org/abs/2410.16179
作者: Zhuoming Chen,Ranajoy Sadhukhan,Zihao Ye,Yang Zhou,Jianyu Zhang,Niklas Nolte,Yuandong Tian,Matthijs Douze,Leon Bottou,Zhihao Jia,Beidi Chen
关键词-EN: Large language models, Large language, gained significant attention, long context windows, windows have gained
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) with long context windows have gained significant attention. However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various dynamic sparse or TopK-based attention approximation methods have been proposed to leverage the common insight that attention is sparse. In this paper, we first show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected. Rather than selecting the keys and values with the highest attention scores, sampling with theoretical guarantees can provide a better estimation for attention output. To make the sampling-based approximation practical in LLM generation, we propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH). MagicPIG significantly reduces the workload of attention computation while preserving high accuracy for diverse tasks. MagicPIG stores the LSH hash tables and runs the attention computation on the CPU, which allows it to serve longer contexts and larger batch sizes with high approximation accuracy. MagicPIG can improve decoding throughput by 1.9\sim3.9\times across various GPU hardware and achieve 110ms decoding latency on a single RTX 4090 for Llama-3.1-8B-Instruct model with a context of 96k tokens. The code is available at \urlthis https URL.
摘要:具有长上下文窗口的大语言模型 (LLM) 引起了广泛关注。然而,存储以避免重新计算的 KV 缓存成为了瓶颈。多种动态稀疏或基于 TopK 的注意力近似方法被提出,利用了注意力通常是稀疏的这一共同见解。在本文中,我们首先展示了 TopK 注意力本身在某些下游任务中存在质量下降的问题,因为注意力并不总是如预期的那样稀疏。与其选择具有最高注意力分数的键和值,基于理论保证的采样可以为注意力输出提供更好的估计。为了在大语言模型生成中使基于采样的近似方法实用化,我们提出了 MagicPIG,这是一个基于局部敏感哈希 (LSH) 的异构系统。MagicPIG 显著减少了注意力计算的工作量,同时保持了多样化任务的高准确性。MagicPIG 将 LSH 哈希表存储在 CPU 上,并在 CPU 上运行注意力计算,这使得它能够以高近似精度服务于更长的上下文和更大的批量大小。MagicPIG 可以在各种 GPU 硬件上将解码吞吐量提高 1.9\sim3.9\times,并在单个 RTX 4090 上为 Llama-3.1-8B-Instruct 模型在 96k Token 的上下文中实现 110ms 的解码延迟。代码可在 \urlthis https URL 获取。
[NLP-16] Exploring Pretraining via Active Forgetting for Improving Cross Lingual Transfer for Decoder Language Models
【速读】: 该论文试图解决大型语言模型(LLMs)在非英语语言上的表现受限问题。解决方案的关键在于提出了一种预训练策略,即使用“主动遗忘”(active forgetting)来实现解码器型LLMs的跨语言迁移能力。通过这种策略,LLMs在预训练阶段能够更好地学习多语言表示,从而在适应新语言和下游任务时表现出更优的性能。
链接: https://arxiv.org/abs/2410.16168
作者: Divyanshu Aggarwal,Ashutosh Sathe,Sunayana Sitaram
关键词-EN: Large Language Models, multitude of NLP, demonstrate exceptional capabilities, Large Language, NLP tasks
类目: Computation and Language (cs.CL)
备注: 12 pages, 11 tables, 12 figures
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate exceptional capabilities in a multitude of NLP tasks. However, the efficacy of such models to languages other than English is often limited. Prior works have shown that encoder-only models such as BERT or XLM-RoBERTa show impressive cross lingual transfer of their capabilities from English to other languages. In this work, we propose a pretraining strategy that uses active forgetting to achieve similar cross lingual transfer in decoder-only LLMs. We show that LLMs pretrained with active forgetting are highly effective when adapting to new and unseen languages. Through extensive experimentation, we find that LLMs pretrained with active forgetting are able to learn better multilingual representations which translates to better performance in many downstream tasks.
摘要:大语言模型 (LLMs) 在多种自然语言处理 (NLP) 任务中展现出卓越的能力。然而,这些模型在非英语语言上的效果往往有限。先前的工作表明,仅编码器模型如 BERT 或 XLM-RoBERTa 能够从英语向其他语言显著转移其能力。在本研究中,我们提出了一种预训练策略,利用主动遗忘机制在仅解码器的大语言模型中实现类似的跨语言转移。我们展示了通过主动遗忘预训练的 LLMs 在适应新语言和未见语言时具有高度有效性。通过广泛的实验,我们发现,通过主动遗忘预训练的 LLMs 能够学习到更好的多语言表示,从而在许多下游任务中表现出更优的性能。
[NLP-17] Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
【速读】: 该论文试图解决多模态大语言模型(MLLMs)训练中图像-文本对质量评估和提升的问题。现有基于过滤的数据质量增强方法由于图像与文本之间的语义不匹配,导致大量高质量图像数据被丢弃,从而影响数据利用效率和模型扩展性。论文提出的解决方案是自适应图像-文本质量增强器(AITQE),其关键在于动态评估和提升图像-文本对的质量。AITQE通过文本重写机制处理低质量对,并采用负样本学习策略在训练中引入低质量样本以增强评估能力。与传统方法不同,AITQE在最小化文本分布变化的同时提升数据质量,从而有效利用原始数据并随着数据量的增加实现高效扩展。
链接: https://arxiv.org/abs/2410.16166
作者: Han Huang,Yuqi Huo,Zijia Zhao,Haoyu Lu,Shu Wu,Bingning Wang,Qiang Liu,Weipeng Chen,Liang Wang
关键词-EN: Multimodal large language, made significant strides, large language models, textual modalities, Multimodal large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets. However, \textit de facto filter-based data quality enhancement paradigms often discard a substantial portion of high-quality image data due to inadequate semantic alignment between images and texts, leading to inefficiencies in data utilization and scalability. In this paper, we propose the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically assesses and enhances the quality of image-text pairs. AITQE employs a text rewriting mechanism for low-quality pairs and incorporates a negative sample learning strategy to improve evaluative capabilities by integrating deliberately selected low-quality samples during training. Unlike prior approaches that significantly alter text distributions, our method minimally adjusts text to preserve data volume while enhancing quality. Experimental results demonstrate that AITQE surpasses existing methods on various benchmark, effectively leveraging raw data and scaling efficiently with increasing data volumes. We hope our work will inspire future works. The code and model are available at: this https URL.
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 通过整合视觉和文本模态取得了显著进展。训练 MLLMs 的一个关键因素是多模态预训练数据集中图像-文本对的质量。然而,现有的基于过滤的数据质量增强范式往往由于图像与文本之间语义对齐不足而丢弃大量高质量的图像数据,导致数据利用率和可扩展性低下。本文提出了一种自适应图像-文本质量增强器 (Adaptive Image-Text Quality Enhancer, AITQE),该模型能够动态评估和增强图像-文本对的质量。AITQE 采用了一种文本重写机制来处理低质量的图像-文本对,并结合负样本学习策略,通过在训练过程中整合精心选择的低质量样本,提升评估能力。与以往显著改变文本分布的方法不同,我们的方法通过最小化调整文本,在保持数据量的同时提升质量。实验结果表明,AITQE 在多个基准测试中超越了现有方法,能够有效利用原始数据,并在数据量增加时高效扩展。我们希望这项工作能够启发未来的研究。代码和模型可在以下链接获取:this https URL。
[NLP-18] From Tokens to Materials: Leveraging Language Models for Scientific Discovery
【速读】: 该论文试图解决在材料科学中利用语言模型预测材料性能的问题。解决方案的关键在于采用领域特定的语言模型(如MatBERT)和信息密集的嵌入方法,特别是从MatBERT的第三层提取嵌入,并结合上下文平均策略,以更有效地捕捉化合物名称与材料性能之间的关系。此外,论文强调了专用分词技术的重要性,确保在保持一致的标记计数的同时,完整保留化合物名称,从而提升预测性能。
链接: https://arxiv.org/abs/2410.16165
作者: Yuwei Wan,Tong Xie,Nan Wu,Wenjie Zhang,Chunyu Kit,Bram Hoex
关键词-EN: Exploring the predictive, Generative Pre-trained Transformers, Bidirectional Encoder Representations, ongoing interest, language model embeddings
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
点击查看摘要
Abstract:Exploring the predictive capabilities of language models in material science is an ongoing interest. This study investigates the application of language model embeddings to enhance material property prediction in materials science. By evaluating various contextual embedding methods and pre-trained models, including Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), we demonstrate that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties. Our findings reveal that information-dense embeddings from the third layer of MatBERT, combined with a context-averaging approach, offer the most effective method for capturing material-property relationships from the scientific literature. We also identify a crucial “tokenizer effect,” highlighting the importance of specialized text processing techniques that preserve complete compound names while maintaining consistent token counts. These insights underscore the value of domain-specific training and tokenization in materials science applications and offer a promising pathway for accelerating the discovery and development of new materials through AI-driven approaches.
摘要:探索语言模型在材料科学中的预测能力是一个持续关注的课题。本研究探讨了语言模型嵌入在材料科学中增强材料属性预测的应用。通过评估多种上下文嵌入方法和预训练模型,包括 Transformer 的双向编码表示 (BERT) 和生成式预训练 Transformer (GPT),我们展示了领域专用模型,特别是 MatBERT,在从化合物名称和材料属性中提取隐含知识方面显著优于通用模型。我们的研究结果表明,MatBERT 第三层的信息密集嵌入与上下文平均方法相结合,提供了从科学文献中捕捉材料属性关系的最有效方法。我们还识别出一个关键的“分词器效应”,强调了在保持完整化合物名称的同时保持一致 Token 计数的专用文本处理技术的重要性。这些见解突显了领域专用训练和分词在材料科学应用中的价值,并为通过 AI 驱动方法加速新材料的发现和开发提供了有前景的路径。
[NLP-19] Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning
【速读】: 该论文试图解决视觉语言模型(VLMs)在二维空间推理任务中的表现不足问题,特别是在涉及导航和物理环境交互的任务中。解决方案的关键在于通过训练模型掌握基本的空间能力,包括方向理解、距离估计和定位,从而提升其在复合空间推理任务中的表现。论文提出了Sparkle框架,通过合成数据生成和有针对性的监督,对VLMs进行微调,使其在基本空间能力上得到显著提升,进而增强其在复杂和分布外空间推理任务中的泛化能力。
链接: https://arxiv.org/abs/2410.16162
作者: Yihong Tang,Ao Qu,Zhaokai Wang,Dingyi Zhuang,Zhaofeng Wu,Wei Ma,Shenhao Wang,Yunhan Zheng,Zhan Zhao,Jinhua Zhao
关键词-EN: Vision language models, Vision language, spatial reasoning, spatial, basic spatial capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Vision language models (VLMs) have demonstrated impressive performance across a wide range of downstream tasks. However, their proficiency in spatial reasoning remains limited, despite its crucial role in tasks involving navigation and interaction with physical environments. Specifically, much of the spatial reasoning in these tasks occurs in two-dimensional (2D) environments, and our evaluation reveals that state-of-the-art VLMs frequently generate implausible and incorrect responses to composite spatial reasoning problems, including simple pathfinding tasks that humans can solve effortlessly at a glance. To address this, we explore an effective approach to enhance 2D spatial reasoning within VLMs by training the model on basic spatial capabilities. We begin by disentangling the key components of 2D spatial reasoning: direction comprehension, distance estimation, and localization. Our central hypothesis is that mastering these basic spatial capabilities can significantly enhance a model’s performance on composite spatial tasks requiring advanced spatial understanding and combinatorial problem-solving. To investigate this hypothesis, we introduce Sparkle, a framework that fine-tunes VLMs on these three basic spatial capabilities by synthetic data generation and targeted supervision to form an instruction dataset for each capability. Our experiments demonstrate that VLMs fine-tuned with Sparkle achieve significant performance gains, not only in the basic tasks themselves but also in generalizing to composite and out-of-distribution spatial reasoning tasks (e.g., improving from 13.5% to 40.0% on the shortest path problem). These findings underscore the effectiveness of mastering basic spatial capabilities in enhancing composite spatial problem-solving, offering insights for improving VLMs’ spatial reasoning capabilities.
摘要:视觉语言模型 (VLM) 在众多下游任务中展现出了卓越的性能。然而,尽管其在涉及导航和物理环境交互的任务中起着至关重要的作用,但其空间推理能力仍然有限。具体而言,这些任务中的大部分空间推理发生在二维 (2D) 环境中,我们的评估显示,最先进的 VLM 在面对复合空间推理问题时,常常生成不合理且错误的答案,包括人类一眼就能轻松解决的简单路径寻找任务。为了解决这一问题,我们探索了一种有效的方法,通过在基本空间能力上训练模型来增强 VLM 的 2D 空间推理能力。我们首先分解了 2D 空间推理的关键组成部分:方向理解、距离估计和定位。我们的核心假设是,掌握这些基本空间能力可以显著提升模型在需要高级空间理解和组合问题解决能力的复合空间任务中的表现。为了验证这一假设,我们引入了 Sparkle,这是一个通过合成数据生成和针对性监督来微调 VLM 的框架,以形成每个能力对应的指令数据集。我们的实验表明,经过 Sparkle 微调的 VLM 在基本任务上取得了显著的性能提升,并且在复合和分布外的空间推理任务中也表现出色(例如,在最短路径问题上从 13.5% 提升到 40.0%)。这些发现强调了掌握基本空间能力在增强复合空间问题解决中的有效性,为提升 VLM 的空间推理能力提供了见解。
[NLP-20] Limpeh ga li gong: Challenges in Singlish Annotations
【速读】: 该论文试图解决新加坡式英语(Singlish)的词性标注(POS tagging)问题。解决方案的关键在于构建了一个包含直接英语翻译和POS标签的平行Singlish数据集,并通过本地Singlish母语者的翻译和标注来确保数据质量。尽管基于自动转换和变压器模型的标注器在评估中仅达到约80%的准确率,但该研究揭示了Singlish标注中的挑战,如形式和语义的不一致性、高度依赖上下文的语言颗粒、独特的结构表达以及在不同媒介中的语言变异,为未来研究提供了基础。
链接: https://arxiv.org/abs/2410.16156
作者: Lynnette Hui Xian Ng,Luo Qi Chan
关键词-EN: Colloquial Singapore English, multicultural Singapore, Singapore English, Natural Language Processing, Colloquial Singapore
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Singlish, or Colloquial Singapore English, is a language formed from oral and social communication within multicultural Singapore. In this work, we work on a fundamental Natural Language Processing (NLP) task: Parts-Of-Speech (POS) tagging of Singlish sentences. For our analysis, we build a parallel Singlish dataset containing direct English translations and POS tags, with translation and POS annotation done by native Singlish speakers. Our experiments show that automatic transition- and transformer- based taggers perform with only \sim 80% accuracy when evaluated against human-annotated POS labels, suggesting that there is indeed room for improvement on computation analysis of the language. We provide an exposition of challenges in Singlish annotation: its inconsistencies in form and semantics, the highly context-dependent particles of the language, its structural unique expressions, and the variation of the language on different mediums. Our task definition, resultant labels and results reflects the challenges in analysing colloquial languages formulated from a variety of dialects, and paves the way for future studies beyond POS tagging.
摘要:新加坡式英语(Singlish),或称为新加坡口语英语,是在多元文化的新加坡社会中通过口头和社会交流形成的语言。在本研究中,我们致力于一项基础的自然语言处理(NLP)任务:新加坡式英语句子的词性(POS)标注。为了进行分析,我们构建了一个包含直接英语翻译和POS标签的平行新加坡式英语数据集,翻译和POS标注均由母语为新加坡式英语的人士完成。我们的实验表明,基于自动转换和Transformer的标注器在对比人工标注的POS标签时,准确率仅为约80%,这表明在语言的计算分析方面确实存在改进空间。我们详细阐述了新加坡式英语标注中的挑战:其形式和语义的不一致性、高度依赖上下文的语言颗粒、独特的结构表达方式,以及在不同媒介上的语言变异。我们的任务定义、结果标签和结果反映了分析由多种方言形成的口语语言的挑战,并为未来的研究(超越POS标注)铺平了道路。
[NLP-21] A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns
【速读】: 该论文旨在解决多代理系统中独立内存的攻击问题,特别是针对非完全图结构和大规模系统的攻击挑战。解决方案的关键在于提出了一种名为Adversarial Replication Contagious Jailbreak (ARCJ)的方法,通过优化检索后缀和复制后缀,使得被污染的样本更容易被检索并具有传染性,从而在多代理攻击中显著提升攻击效果,分别在线性拓扑、星型拓扑和100代理设置中实现了23.51%、18.95%和52.93%的改进。
链接: https://arxiv.org/abs/2410.16155
作者: Tianyi Men,Pengfei Cao,Zhuoran Jin,Yubo Chen,Kang Liu,Jun Zhao
关键词-EN: large language models, language models, development of large, large language, Troublemaker Makes Chaos
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100-agent settings. Encourage community attention to the security of multi-agent systems.
摘要:随着大语言模型的发展,它们被广泛用作各个领域的智能体。智能体的关键组件之一是记忆,它存储着重要信息,但容易受到越狱攻击。现有研究主要集中在单智能体攻击和共享记忆攻击上。然而,现实场景中通常涉及独立记忆。本文提出了“麻烦制造者在诚实小镇制造混乱”(Troublemaker Makes Chaos in Honest Town, TMCHT)任务,这是一个大规模、多智能体、多拓扑结构的基于文本的攻击评估框架。TMCHT 涉及一个攻击者智能体试图误导整个智能体社会。我们识别了多智能体攻击中的两大挑战:(1)非完全图结构,(2)大规模系统。我们将这些挑战归因于我们称之为“毒性消失”的现象。为解决这些问题,我们提出了对抗复制传染性越狱(Adversarial Replication Contagious Jailbreak, ARCJ)方法,该方法优化了检索后缀以使中毒样本更容易被检索,并优化了复制后缀以使中毒样本具有传染能力。我们在 TMCHT 中展示了我们方法的优越性,在线性拓扑、星型拓扑和 100 智能体设置中分别提高了 23.51%、18.95% 和 52.93%。鼓励社区关注多智能体系统的安全性。
[NLP-22] Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
【速读】: 该论文试图解决多模态大语言模型(MLLMs)在非英语和西方中心数据集上的不足问题,提出了Pangea模型。解决方案的关键在于构建了一个名为PangeaIns的多样化6M指令数据集,涵盖39种语言,并通过高质量的英语指令、机器翻译的指令以及文化相关的多模态任务来确保跨文化覆盖。此外,论文还引入了PangeaBench评估套件,以全面评估模型在47种语言上的表现,并通过开源数据、代码和训练模型来促进包容性和鲁棒性的多语言MLLMs的发展。
链接: https://arxiv.org/abs/2410.16153
作者: Xiang Yue,Yueqi Song,Akari Asai,Seungone Kim,Jean de Dieu Nyandwi,Simran Khanuja,Anjali Kantharuban,Lintang Sutawika,Sathyanarayanan Ramamoorthy,Graham Neubig
关键词-EN: diverse cultural contexts, multimodal large language, recent advances, predominantly focused, cultural contexts underrepresented
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 52 pages, 27 figures
点击查看摘要
Abstract:Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world’s languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models’ capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.
摘要:尽管多模态大语言模型 (MLLMs) 近期取得了进展,但其发展主要集中在以英语和西方为中心的数据集和任务上,导致世界上大多数语言和多样文化背景未得到充分体现。本文介绍了 Pangea,一种多语言多模态大语言模型,该模型基于 PangeaIns 进行训练,PangeaIns 是一个涵盖 39 种语言的 600 万指令数据集。PangeaIns 的特点包括:1) 高质量的英语指令,2) 精心机器翻译的指令,以及 3) 与文化相关的多模态任务,以确保跨文化覆盖。为了严格评估模型的能力,我们引入了 PangeaBench,这是一个综合评估套件,涵盖 14 个数据集,覆盖 47 种语言。结果显示,Pangea 在多语言环境和多样文化背景下显著优于现有的开源模型。消融研究进一步揭示了英语数据比例、语言流行度以及多模态训练样本数量对整体性能的重要性。我们完全开源了数据、代码和训练的检查点,以促进包容性和稳健的多语言 MLLMs 的发展,推动更广泛的语言和文化范围内的公平性和可及性。
[NLP-23] 1-bit AI Infra: Part 1.1 Fast and Lossless BitNet b1.58 Inference on CPUs
【速读】: 该论文旨在解决1-bit大语言模型(LLMs)在速度和能耗方面的效率问题,并推动其在各种设备上的本地部署。解决方案的关键在于开发了一个名为“this http URL”的定制软件堆栈,该堆栈包含一组针对CPU优化的内核,支持三值BitNet b1.58模型的快速且无损推理。通过广泛的实验验证,该软件堆栈在x86和ARM CPU上分别实现了2.37x至6.17x和1.37x至5.07x的速度提升,显著提高了1-bit LLMs的推理效率。
链接: https://arxiv.org/abs/2410.16144
作者: Jinheng Wang,Hansong Zhou,Ting Song,Shaoguang Mao,Shuming Ma,Hongyu Wang,Yan Xia,Furu Wei
关键词-EN: Large Language Models, Large Language, Recent advances, Language Models, present a promising
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce this http URL, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that this http URL achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at this https URL.
摘要:近期在 1-bit 大语言模型 (LLM) 领域的进展,如 BitNet 和 BitNet b1.58,展示了一种在速度和能耗方面提升 LLM 效率的有前景的方法。这些发展还使得 LLM 能够在广泛设备上进行本地部署。在本研究中,我们介绍了 this http URL,这是一个专门设计的软件栈,旨在充分发挥 1-bit LLM 的全部潜力。具体而言,我们开发了一套内核,以支持在 CPU 上对三值 BitNet b1.58 LLM 进行快速且无损的推理。广泛的实验表明,this http URL 在 x86 CPU 上实现了 2.37x 至 6.17x 的显著加速,在 ARM CPU 上实现了 1.37x 至 5.07x 的加速,涵盖了各种模型规模。代码可在 this https URL 获取。
[NLP-24] A Psycholinguistic Evaluation of Language Models Sensitivity to Argument Roles
【速读】: 该论文试图解决的问题是大语言模型在处理论元角色(即“谁对谁做了什么”)时的敏感性,特别是与人类在实时语言处理中的表现进行比较。解决方案的关键在于通过复制心理语言学研究中的实验,评估语言模型在区分合理与不合理语境中动词的能力,结果表明虽然模型能区分动词的合理性,但未能捕捉到人类在实时动词预测中表现出的选择性模式,这揭示了模型检测动词合理性的机制与人类实时句子处理机制的不同。
链接: https://arxiv.org/abs/2410.16139
作者: Eun-Kyoung Rosa Lee,Sathvik Nair,Naomi Feldman
关键词-EN: replicating psycholinguistic studies, argument role processing, human argument role, large language models’, language models’ sensitivity
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We present a systematic evaluation of large language models’ sensitivity to argument roles, i.e., who did what to whom, by replicating psycholinguistic studies on human argument role processing. In three experiments, we find that language models are able to distinguish verbs that appear in plausible and implausible contexts, where plausibility is determined through the relation between the verb and its preceding arguments. However, none of the models capture the same selective patterns that human comprehenders exhibit during real-time verb prediction. This indicates that language models’ capacity to detect verb plausibility does not arise from the same mechanism that underlies human real-time sentence processing.
摘要:我们通过复制心理语言学研究中关于人类论元角色处理的研究,系统评估了大语言模型对论元角色的敏感性,即“谁对谁做了什么”。在三个实验中,我们发现语言模型能够区分出现在合理和不合理上下文中的动词,其中合理性是通过动词与其前述论元之间的关系来确定的。然而,没有任何模型捕捉到人类理解者在实时动词预测中表现出的选择性模式。这表明,语言模型检测动词合理性的能力并非源于支撑人类实时句子处理过程的相同机制。
[NLP-25] Do LLMs write like humans? Variation in grammatical and rhetorical styles
【速读】: 该论文试图解决的问题是如何区分大型语言模型(LLMs)生成的文本与人类撰写的文本,特别是在修辞风格上的差异。解决方案的关键在于使用Douglas Biber的词汇、语法和修辞特征集,构建了人类和LLM生成的平行语料库,并通过对比分析发现LLMs在修辞风格上与人类存在系统性差异,这些差异在不同模型规模和指令调优模型中持续存在,表明尽管LLMs能力不断提升,但在模仿人类修辞风格方面仍存在局限,从而为检测LLM生成的文本提供了新的语言学特征。
链接: https://arxiv.org/abs/2410.16107
作者: Alex Reinhart,David West Brown,Ben Markey,Michael Laudenbach,Kachatad Pantusen,Ronald Yurko,Gordon Weinberg
关键词-EN: Large language models, Large language, answers questions, solves problems, writing grammatical text
类目: Computation and Language (cs.CL)
备注: 29 pages, 4 figures, 11 tables
点击查看摘要
Abstract:Large language models (LLMs) are capable of writing grammatical text that follows instructions, answers questions, and solves problems. As they have advanced, it has become difficult to distinguish their output from human-written text. While past research has found some differences in surface features such as word choice and punctuation, and developed classifiers to detect LLM output, none has studied the rhetorical styles of LLMs. Using several variants of Llama 3 and GPT-4o, we construct two parallel corpora of human- and LLM-written texts from common prompts. Using Douglas Biber’s set of lexical, grammatical, and rhetorical features, we identify systematic differences between LLMs and humans and between different LLMs. These differences persist when moving from smaller models to larger ones, and are larger for instruction-tuned models than base models. This demonstrates that despite their advanced abilities, LLMs struggle to match human styles, and hence more advanced linguistic features can detect patterns in their behavior not previously recognized. Comments: 29 pages, 4 figures, 11 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.16107 [cs.CL] (or arXiv:2410.16107v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.16107 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大语言模型 (LLMs) 能够编写符合语法规则的文本,遵循指令、回答问题并解决问题。随着它们的进步,区分其输出与人类撰写的文本变得越来越困难。尽管过去的研究发现了表面特征(如词汇选择和标点符号)的一些差异,并开发了分类器来检测 LLM 输出,但尚未有研究探讨 LLM 的修辞风格。我们使用 Llama 3 和 GPT-4o 的几个变体,从常见提示中构建了人类和 LLM 撰写的两个平行语料库。利用 Douglas Biber 的词汇、语法和修辞特征集,我们识别了 LLM 与人类以及不同 LLM 之间的系统性差异。这些差异在从小模型向大模型过渡时仍然存在,并且对于指令微调模型比基础模型更为显著。这表明,尽管 LLMs 具有先进的能力,但它们在匹配人类风格方面仍存在困难,因此更高级的语言特征可以检测到其行为中以前未被识别的模式。
评论:29 页,4 图,11 表
主题:计算与语言 (cs.CL)
引用为:arXiv:2410.16107 [cs.CL]
(或 arXiv:2410.16107v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.16107
了解更多信息
arXiv 发布的 DOI 通过 DataCite (待注册)
[NLP-26] Analysing the Residual Stream of Language Models Under Knowledge Conflicts NEURIPS2024
【速读】: 该论文试图解决大语言模型(LLMs)在处理知识冲突时可能依赖过时或错误信息的问题。解决方案的关键在于通过分析LLMs的残差流(residual stream)来识别知识冲突,并预测模型在冲突发生时将依赖的知识的来源。具体来说,论文通过探测任务发现,LLMs在残差流中能够内部注册知识冲突的信号,并通过探测中间模型激活来准确检测这些冲突。这种检测方法可以在不修改输入或模型参数的情况下,在生成答案之前识别冲突。此外,论文还发现,当模型依赖上下文知识或参数知识解决冲突时,残差流显示出显著不同的模式,这可以用于估计模型在冲突发生时的行为,从而在生成答案之前预防意外结果。
链接: https://arxiv.org/abs/2410.16090
作者: Yu Zhao,Xiaotang Du,Giwon Hong,Aryo Pradipta Gema,Alessio Devoto,Hongru Wang,Xuanli He,Kam-Fai Wong,Pasquale Minervini
关键词-EN: Large language models, Large language, residual stream, knowledge, store a significant
类目: Computation and Language (cs.CL)
备注: Foundation Model Interventions Workshop @ NeurIPS 2024
点击查看摘要
Abstract:Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context. Such conflicts can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. In this work, we investigate whether LLMs can identify knowledge conflicts and whether it is possible to know which source of knowledge the model will rely on by analysing the residual stream of the LLM. Through probing tasks, we find that LLMs can internally register the signal of knowledge conflict in the residual stream, which can be accurately detected by probing the intermediate model activations. This allows us to detect conflicts within the residual stream before generating the answers without modifying the input or model parameters. Moreover, we find that the residual stream shows significantly different patterns when the model relies on contextual knowledge versus parametric knowledge to resolve conflicts. This pattern can be employed to estimate the behaviour of LLMs when conflict happens and prevent unexpected answers before producing the answers. Our analysis offers insights into how LLMs internally manage knowledge conflicts and provides a foundation for developing methods to control the knowledge selection processes.
摘要:大语言模型 (LLMs) 能够在其参数中存储大量的事实知识。然而,其参数知识可能与上下文中提供的信息发生冲突。这种冲突可能导致模型表现出不良行为,例如依赖过时或错误的信息。在本研究中,我们探讨了 LLMs 是否能够识别知识冲突,以及通过分析 LLM 的残差流是否可以预测模型将依赖哪种知识来源。通过探测任务,我们发现 LLMs 能够在残差流中内部记录知识冲突的信号,这种信号可以通过探测中间模型激活来准确检测。这使得我们能够在不修改输入或模型参数的情况下,在生成答案之前检测残差流中的冲突。此外,我们发现当模型依赖上下文知识与参数知识来解决冲突时,残差流显示出显著不同的模式。这种模式可以用于估计冲突发生时 LLMs 的行为,并在生成答案之前防止意外答案的出现。我们的分析为 LLMs 内部如何管理知识冲突提供了见解,并为开发控制知识选择过程的方法奠定了基础。
[NLP-27] Fine-Tuning LLMs for Reliable Medical Question-Answering Services ICDM
【速读】: 该论文试图解决医疗问答服务中信息准确性和可靠性的问题,解决方案的关键在于使用经过微调的大型语言模型(如LLaMA-2和Mistral),并通过rsDoRA+和ReRAG等技术进行优化。rsDoRA+通过分解模型权重、调整低秩矩阵的学习率和稳定秩来提升模型效率,而ReRAG则通过按需检索和问题重写进一步提高回答的准确性。这些技术的结合使得医疗信息服务更加快速、可靠,有助于提高决策效率和增强患者信任。
链接: https://arxiv.org/abs/2410.16088
作者: Ali Anaissi,Ali Braytee,Junaid Akram
关键词-EN: Large Language Models, fine-tuned Large Language, Large Language, Language Models, present an advanced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 10 figures, accepted and to be published in the proceedings of 2024 IEEE International Conference on Data Mining Workshops (ICDMW)
点击查看摘要
Abstract:We present an advanced approach to medical question-answering (QA) services, using fine-tuned Large Language Models (LLMs) to improve the accuracy and reliability of healthcare information. Our study focuses on optimizing models like LLaMA-2 and Mistral, which have shown great promise in delivering precise, reliable medical answers. By leveraging comprehensive datasets, we applied fine-tuning techniques such as rsDoRA+ and ReRAG. rsDoRA+ enhances model performance through a combination of decomposed model weights, varied learning rates for low-rank matrices, and rank stabilization, leading to improved efficiency. ReRAG, which integrates retrieval on demand and question rewriting, further refines the accuracy of the responses. This approach enables healthcare providers to access fast, dependable information, aiding in more efficient decision-making and fostering greater patient trust. Our work highlights the potential of fine-tuned LLMs to significantly improve the quality and accessibility of medical information services, ultimately contributing to better healthcare outcomes for all.
摘要:我们提出了一种先进的医疗问答 (QA) 服务方法,通过微调大语言模型 (LLM) 来提高医疗信息的准确性和可靠性。本研究专注于优化 LLaMA-2 和 Mistral 等模型,这些模型在提供精确、可靠的医疗答案方面展现出巨大潜力。通过利用全面的数据集,我们应用了微调技术,如 rsDoRA+ 和 ReRAG。rsDoRA+ 通过分解模型权重、对低秩矩阵采用不同的学习率以及稳定秩的组合,提升了模型性能,从而提高了效率。ReRAG 集成了按需检索和问题重写,进一步提高了回答的准确性。这种方法使医疗提供者能够快速访问可靠的信息,有助于更高效的决策制定,并增强患者信任。我们的工作突显了微调 LLM 在显著提升医疗信息服务质量和可及性方面的潜力,最终为所有人带来更好的医疗结果。
[NLP-28] CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts
【速读】: 该论文试图解决Mixture-of-Experts (MoE)模型在知识共享方面的挑战,特别是专家之间的知识共享不足导致的路由准确性问题。解决方案的关键在于提出了一种名为CartesianMoE的新方法,该方法通过“乘法”方式而非传统的“加法”方式来实现专家之间的知识共享,从而更有效地整合专家的知识。实验结果表明,CartesianMoE在困惑度和下游任务性能方面均优于之前的MoE模型,并提高了专家路由的鲁棒性。
链接: https://arxiv.org/abs/2410.16077
作者: Zhenpeng Su,Xing Wu,Zijia Lin,Yizhe Xiong,Minxuan Lv,Guangyuan Ma,Hui Chen,Songlin Hu,Guiguang Ding
关键词-EN: Large language models, Large language, community recently, attracting much attention, Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLM) have been attracting much attention from the community recently, due to their remarkable performance in all kinds of downstream tasks. According to the well-known scaling law, scaling up a dense LLM enhances its capabilities, but also significantly increases the computational complexity. Mixture-of-Experts (MoE) models address that by allowing the model size to grow without substantially raising training or inference costs. Yet MoE models face challenges regarding knowledge sharing among experts, making their performance somehow sensitive to routing accuracy. To tackle that, previous works introduced shared experts and combined their outputs with those of the top K routed experts in an addition'' manner. In this paper, inspired by collective matrix factorization to learn shared knowledge among data, we propose CartesianMoE, which implements more effective knowledge sharing among experts in more like a
multiplication’’ manner. Extensive experimental results indicate that CartesianMoE outperforms previous MoE models for building LLMs, in terms of both perplexity and downstream task performance. And we also find that CartesianMoE achieves better expert routing robustness.
摘要:大语言模型 (Large Language Model, LLM) 近期引起了社区的广泛关注,因其在上游任务中的显著表现。根据著名的缩放定律,扩大密集型 LLM 的规模可以增强其能力,但同时也会显著增加计算复杂度。混合专家模型 (Mixture-of-Experts, MoE) 通过允许模型规模增长而不大幅增加训练或推理成本来解决这一问题。然而,MoE 模型在专家间的知识共享方面面临挑战,使其性能对路由准确性较为敏感。为应对这一问题,先前的工作引入了共享专家,并以“加法”方式将其输出与前 K 个路由专家的输出相结合。本文受集体矩阵分解学习数据间共享知识的启发,提出了 CartesianMoE,该模型以更类似于“乘法”的方式实现专家间更有效的知识共享。广泛的实验结果表明,在构建 LLM 方面,CartesianMoE 在困惑度和下游任务性能方面均优于之前的 MoE 模型。同时,我们还发现 CartesianMoE 在专家路由的鲁棒性方面表现更佳。
[NLP-29] On-Device LLMs for SMEs: Challenges and Opportunities
【速读】: 该论文旨在解决中小型企业(SMEs)在设备上部署大型语言模型(LLMs)时面临的基础设施挑战。解决方案的关键在于从硬件和软件两个角度出发:硬件方面,论文探讨了如何利用GPU和TPU等处理单元、高效的内存和存储解决方案,以及在计算资源有限的情况下进行有效部署的策略;软件方面,论文研究了框架兼容性、操作系统优化以及针对资源受限环境定制的专用库。通过系统性地分析这些挑战和机遇,论文为SMEs提供了实用的技术洞察,增强了其在集成LLMs方面的技术韧性。
链接: https://arxiv.org/abs/2410.16070
作者: Jeremy Stephen Gabriel Yee Zhi Wen,Pai Chet Ng,Zhengkui Wang,Ian McLoughlin,Aik Beng Ng,Simon See
关键词-EN: Large Language Models, deploying Large Language, Language Models, Large Language, medium-sized enterprises
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 1 figure. The work is supported by the SIT-NVIDIA Joint AI Centre
点击查看摘要
Abstract:This paper presents a systematic review of the infrastructure requirements for deploying Large Language Models (LLMs) on-device within the context of small and medium-sized enterprises (SMEs), focusing on both hardware and software perspectives. From the hardware viewpoint, we discuss the utilization of processing units like GPUs and TPUs, efficient memory and storage solutions, and strategies for effective deployment, addressing the challenges of limited computational resources typical in SME settings. From the software perspective, we explore framework compatibility, operating system optimization, and the use of specialized libraries tailored for resource-constrained environments. The review is structured to first identify the unique challenges faced by SMEs in deploying LLMs on-device, followed by an exploration of the opportunities that both hardware innovations and software adaptations offer to overcome these obstacles. Such a structured review provides practical insights, contributing significantly to the community by enhancing the technological resilience of SMEs in integrating LLMs.
摘要:本文系统性地探讨了在中小型企业 (SMEs) 背景下,将大语言模型 (LLMs) 部署到设备上的基础设施需求,重点关注硬件和软件两个方面。从硬件角度,我们讨论了处理单元如 GPU 和 TPU 的利用、高效的内存和存储解决方案,以及在 SME 环境中有效部署的策略,这些策略旨在应对计算资源有限的典型挑战。从软件角度,我们探讨了框架兼容性、操作系统优化以及为资源受限环境量身定制的专用库的使用。本文的结构首先识别了 SMEs 在设备上部署 LLMs 所面临的独特挑战,随后探讨了硬件创新和软件适应性带来的机会,以克服这些障碍。这种结构化的综述为实际操作提供了深刻的见解,通过增强 SMEs 在整合 LLMs 时的技术韧性,对社区做出了显著贡献。
[NLP-30] Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context
【速读】: 该论文试图解决的问题是评估大型语言模型(LLMs)在处理习语时是否能够有效利用上下文信息来消除歧义。解决方案的关键在于构建了一个新的、受控的对比数据集,用于测试LLMs在不同上下文和共现频率条件下的表现。研究结果表明,LLMs在处理需要依赖上下文的习语时表现不佳,且模型在处理高概率句子和高共现频率的表达时表现更好。
链接: https://arxiv.org/abs/2410.16069
作者: Maggie Mi,Aline Villavicencio,Nafise Sadat Moosavi
关键词-EN: Human processing, idioms occur, factors like familiarity, idioms relies, relies on understanding
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Human processing of idioms relies on understanding the contextual sentences in which idioms occur, as well as language-intrinsic features such as frequency and speaker-intrinsic factors like familiarity. While LLMs have shown high performance on idiomaticity detection tasks, this success may be attributed to reasoning shortcuts in existing datasets. To this end, we construct a novel, controlled contrastive dataset designed to test whether LLMs can effectively use context to disambiguate idiomatic meaning. Additionally, we explore how collocational frequency and sentence probability influence model performance. Our findings reveal that LLMs often fail to resolve idiomaticity when it is required to attend to the surrounding context, and that models perform better on sentences that have higher likelihood. The collocational frequency of expressions also impacts performance. We make our code and dataset publicly available.
摘要:人类对习语的处理依赖于理解习语出现的上下文句子,以及语言内在特征如频率和说话者内在因素如熟悉度。尽管大语言模型 (LLM) 在习语检测任务中表现出高性能,但这种成功可能归因于现有数据集中的推理捷径。为此,我们构建了一个新颖的、受控的对比数据集,旨在测试 LLM 是否能有效利用上下文来消除习语意义的歧义。此外,我们探讨了搭配频率和句子概率如何影响模型性能。我们的研究发现,当需要关注周围上下文时,LLM 往往无法解析习语性,并且模型在更高概率的句子中表现更好。表达的搭配频率也会影响性能。我们将代码和数据集公开发布。
[NLP-31] Surprise! Uniform Information Density Isnt the Whole Story: Predicting Surprisal Contours in Long-form Discourse EMNLP2024
【速读】: 该论文试图解决的问题是解释文本和话语中信息速率波动的原因,超越了传统的均匀信息密度(UID)假设。论文提出的解决方案之关键是结构化上下文假设(Structured Context Hypothesis),即说话者根据话语的层次结构来调节信息速率。通过使用从话语结构中提取的预测因子来预测自然发生的话语的意外性轮廓,研究发现层次结构预测因子对信息轮廓有显著影响,且深度嵌套的层次结构预测因子比浅层结构更具预测性。这一研究为信息速率波动提供了新的可测试假设。
链接: https://arxiv.org/abs/2410.16062
作者: Eleftheria Tsipidi,Franz Nowak,Ryan Cotterell,Ethan Wilcox,Mario Giulianelli,Alex Warstadt
关键词-EN: achieve efficient communication, Uniform Information Density, distribute information evenly, Information Density, efficient communication
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 (main conference)
点击查看摘要
Abstract:The Uniform Information Density (UID) hypothesis posits that speakers tend to distribute information evenly across linguistic units to achieve efficient communication. Of course, information rate in texts and discourses is not perfectly uniform. While these fluctuations can be viewed as theoretically uninteresting noise on top of a uniform target, another explanation is that UID is not the only functional pressure regulating information content in a language. Speakers may also seek to maintain interest, adhere to writing conventions, and build compelling arguments. In this paper, we propose one such functional pressure; namely that speakers modulate information rate based on location within a hierarchically-structured model of discourse. We term this the Structured Context Hypothesis and test it by predicting the surprisal contours of naturally occurring discourses extracted from large language models using predictors derived from discourse structure. We find that hierarchical predictors are significant predictors of a discourse’s information contour and that deeply nested hierarchical predictors are more predictive than shallow ones. This work takes an initial step beyond UID to propose testable hypotheses for why the information rate fluctuates in predictable ways
摘要:统一信息密度 (Uniform Information Density, UID) 假设认为,说话者倾向于在语言单元中均匀分布信息,以实现高效的沟通。当然,文本和话语中的信息速率并非完全均匀。虽然这些波动可以被视为在均匀目标之上的理论上不感兴趣的噪声,但另一种解释是,UID 并不是唯一调节语言中信息内容的功能压力。说话者可能还寻求保持兴趣、遵循写作惯例以及构建有说服力的论点。在本文中,我们提出了一种这样的功能压力;即说话者根据话语的层次结构模型中的位置来调节信息速率。我们称之为结构化上下文假设 (Structured Context Hypothesis),并通过使用从话语结构中得出的预测因子来预测从大语言模型中提取的自然话语的意外性轮廓 (surprisal contours) 来测试它。我们发现,层次结构预测因子是话语信息轮廓的重要预测因子,并且深度嵌套的层次结构预测因子比浅层预测因子更具预测性。这项工作超越了 UID,提出了可测试的假设,以解释为什么信息速率以可预测的方式波动。
[NLP-32] Large Language Models Know What To Say But Not When To Speak EMNLP2024
【速读】: 该论文试图解决现有大型语言模型(LLMs)在预测自然、非脚本对话中的说话机会(即Transition Relevance Places, TRPs)时,仅关注回合结束时的TRPs而忽略回合内TRPs的问题。解决方案的关键在于引入了一个新的数据集,该数据集包含参与者标记的回合内TRPs,并利用此数据集评估和改进当前最先进的LLMs在预测说话机会方面的性能,从而推动更自然对话系统的开发。
链接: https://arxiv.org/abs/2410.16044
作者: Muhammad Umair,Vasanth Sarathy,JP de Ruiter
关键词-EN: coherent verbal interactions, Large Language Models, Transition Relevance Places, fundamental mechanism, mechanism in human
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 (Findings)
点击查看摘要
Abstract:Turn-taking is a fundamental mechanism in human communication that ensures smooth and coherent verbal interactions. Recent advances in Large Language Models (LLMs) have motivated their use in improving the turn-taking capabilities of Spoken Dialogue Systems (SDS), such as their ability to respond at appropriate times. However, existing models often struggle to predict opportunities for speaking – called Transition Relevance Places (TRPs) – in natural, unscripted conversations, focusing only on turn-final TRPs and not within-turn TRPs. To address these limitations, we introduce a novel dataset of participant-labeled within-turn TRPs and use it to evaluate the performance of state-of-the-art LLMs in predicting opportunities for speaking. Our experiments reveal the current limitations of LLMs in modeling unscripted spoken interactions, highlighting areas for improvement and paving the way for more naturalistic dialogue systems.
摘要:轮流发言是人类交流中的一个基本机制,确保了流畅和连贯的言语互动。大语言模型 (LLM) 的最新进展激发了其在提升语音对话系统 (SDS) 轮流发言能力方面的应用,例如在适当的时间做出回应的能力。然而,现有模型在预测自然、非脚本对话中的发言机会(称为过渡相关位置 (TRP))时往往表现不佳,仅关注回合末的 TRP,而忽略了回合内的 TRP。为了解决这些局限性,我们引入了一个新的参与者标记的回合内 TRP 数据集,并利用它来评估最先进的大语言模型在预测发言机会方面的表现。我们的实验揭示了大语言模型在模拟非脚本口语互动方面的当前局限性,指出了改进的方向,并为更自然的对话系统铺平了道路。
[NLP-33] reeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
【速读】: 该论文试图解决在大语言模型推理过程中,如何在保持高质量输出的同时提高计算效率的问题。解决方案的关键在于提出了TreeBoN框架,该框架通过将推测性树搜索策略集成到Best-of-N(BoN)采样中,实现了在生成多个响应时,通过迭代分支和剪枝低质量响应来减少计算开销,同时利用Direct Preference Optimization(DPO)的token级奖励来指导树的扩展和低质量路径的剪枝,从而在保持高输出质量的前提下显著提升了计算效率。
链接: https://arxiv.org/abs/2410.16033
作者: Jiahao Qiu,Yifu Lu,Yifan Zeng,Jiacheng Guo,Jiayi Geng,Huazheng Wang,Kaixuan Huang,Yue Wu,Mengdi Wang
关键词-EN: Inference-time alignment enhances, large language models, requiring additional training, presents challenges due, balancing computational efficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, UltraFeedback, GSM8K, HH-RLHF, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves a 65% win rate at maximum lengths of 192 and 384 tokens, outperforming standard BoN with the same computational cost. Furthermore, TreeBoN achieves around a 60% win rate across longer responses, showcasing its scalability and alignment efficacy.
摘要:推理时对齐在不进行额外训练或微调的情况下提升了大语言模型的性能,但面临着在计算效率与高质量输出之间取得平衡的挑战。Best-of-N (BoN) 采样作为一种简单而强大的方法,生成多个响应并选择最佳的一个,虽然性能有所提升,但计算成本较高。我们提出了 TreeBoN,这是一种将推测性树搜索策略集成到 Best-of-N (BoN) 采样中的新颖框架。TreeBoN 维护一组父节点,通过迭代分支和修剪低质量响应,从而在保持高输出质量的同时减少计算开销。我们的方法还利用了来自直接偏好优化 (DPO) 的 Token 级奖励来指导树的扩展和修剪低质量路径。我们使用 AlpacaFarm、UltraFeedback、GSM8K、HH-RLHF 和 TutorEval 数据集评估了 TreeBoN,展示了其一致的改进。具体而言,TreeBoN 在最大长度为 192 和 384 Token 时达到了 65% 的胜率,超过了相同计算成本的标准 BoN。此外,TreeBoN 在更长的响应中实现了约 60% 的胜率,展示了其可扩展性和对齐效果。
[NLP-34] ComPO: Community Preferences for Language Model Personalization
【速读】: 该论文试图解决传统语言模型在训练过程中依赖于平均用户偏好,导致生成的输出无法满足多样化用户群体需求的问题。解决方案的关键在于提出了一种名为ComPO的方法,通过将模型输出的概率分布与偏好提供者的上下文相结合,实现个性化偏好优化。具体来说,论文引入了社区级别的偏好数据集ComPRed,并展示了在偏好调优过程中,将语言模型条件化于社区标识符(如subreddit名称)可以显著提升模型性能,而使用随机社区标识符则会显著降低性能,从而验证了该方法在根据社区偏好定制响应方面的有效性。
链接: https://arxiv.org/abs/2410.16027
作者: Sachin Kumar,Chan Young Park,Yulia Tsvetkov,Noah A. Smith,Hannaneh Hajishirzi
关键词-EN: Conventional algorithms, human feedback rely, disregarding subjectivity, finer-grained variations, algorithms for training
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an “average” user, disregarding subjectivity and finer-grained variations. Recent studies have raised concerns that aggregating such diverse and often contradictory human feedback to finetune models results in generic models that generate outputs not preferred by many user groups, as they tend to average out styles and norms. To address this issue, we draw inspiration from recommendation systems and propose ComPO, a method to personalize preference optimization in LMs by contextualizing the probability distribution of model outputs with the preference provider. Focusing on group-level preferences rather than individuals, we collect and release ComPRed, a question answering dataset with community-level preferences from Reddit. This dataset facilitates studying diversity in preferences without incurring privacy concerns associated with individual feedback. Our experiments reveal that conditioning language models on a community identifier (i.e., subreddit name) during preference tuning substantially enhances model performance. Conversely, replacing this context with random subreddit identifiers significantly diminishes performance, highlighting the effectiveness of our approach in tailoring responses to communities’ preferences.
摘要:传统算法通过人类反馈训练语言模型 (Language Models, LMs) 时,依赖于假设能够代表“平均”用户的偏好,忽视了主观性和更细微的差异。近期研究表明,将这种多样且常常矛盾的人类反馈聚合起来以微调模型,会导致生成许多用户群体不偏好的输出的通用模型,因为它们倾向于平均化风格和规范。为解决这一问题,我们从推荐系统中汲取灵感,提出了 ComPO,一种通过将模型输出的概率分布与偏好提供者情境化来个性化偏好优化的方法。我们关注的是群体层面的偏好而非个体,收集并发布了 ComPRed,一个包含 Reddit 社区级别偏好的问答数据集。该数据集有助于研究偏好的多样性,同时避免了与个体反馈相关的隐私问题。我们的实验表明,在偏好调优过程中,将语言模型条件设置为社区标识符(即 subreddit 名称)显著提升了模型性能。相反,用随机 subreddit 标识符替换这一上下文会显著降低性能,突显了我们方法在根据社区偏好定制响应方面的有效性。
[NLP-35] CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation
【速读】: 该论文试图解决现有同时语音翻译(SimulST)系统在延迟测量中存在的高估问题,这一问题源于对现有延迟评估方法的根本误解。论文揭示了这一误解不仅影响流式翻译的延迟测量,还影响不同指标下的分段级延迟评估。解决方案的关键在于提出了一种修正方法,以正确测量计算感知延迟,从而克服现有指标的局限性。
链接: https://arxiv.org/abs/2410.16011
作者: Xi Xu,Wenda Xu,Siqi Ouyang,Lei Li
关键词-EN: Simultaneous speech translation, balance translation quality, Simultaneous speech, making latency measurement, response time
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Simultaneous speech translation (SimulST) systems must balance translation quality with response time, making latency measurement crucial for evaluating their real-world performance. However, there has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches. We demonstrate that this issue affects not only streaming but also segment-level latency evaluation across different metrics. Furthermore, we propose a modification to correctly measure computation-aware latency for SimulST systems, addressing the limitations present in existing metrics.
摘要:同时语音翻译 (Simultaneous Speech Translation, SimulST) 系统必须在翻译质量和响应时间之间取得平衡,因此延迟测量对于评估其在实际应用中的性能至关重要。然而,长期以来一直存在一种观点,即当前的评估指标在无分段的流式设置中会产生不切实际的高延迟测量值。本文中,我们深入研究了这一现象,揭示了其根本原因在于现有延迟评估方法中存在的一个基本误解。我们证明,这一问题不仅影响流式处理,还影响不同指标下的分段级延迟评估。此外,我们提出了一种修正方法,以正确测量 SimulST 系统的计算感知延迟,从而解决现有指标中的局限性。
[NLP-36] Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model
【速读】: 该论文试图解决大型语言模型(LLMs)在不断学习新语言时,如何在不损害其对已掌握语言(通常是英语)性能的前提下,实现语言适应性的问题。解决方案的关键在于采用持续微调(Continual Fine-Tuning, CFT)策略,特别是通过两阶段的微调过程:第一阶段主要提升模型的任务能力,第二阶段则主要提升语言能力。论文发现,第二阶段任务与第一阶段的相似性决定了模型的适应性,相似性越高,模型在第二阶段后任务能力下降越少。为解决任务能力下降的问题,论文分析了两种CFT方法的定制变体:层冻结和生成式重放,并验证了它们在提升语言能力的同时保持任务性能的有效性。
链接: https://arxiv.org/abs/2410.16006
作者: Divyanshu Aggarwal,Sankarshan Damle,Navin Goyal,Satya Lokam,Sunayana Sitaram
关键词-EN: Large Language Models, Large Language, Phase, LLM, adaptability of Large
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 tables, 4 figures
点击查看摘要
Abstract:A common challenge towards the adaptability of Large Language Models (LLMs) is their ability to learn new languages over time without hampering the model’s performance on languages in which the model is already proficient (usually English). Continual fine-tuning (CFT) is the process of sequentially fine-tuning an LLM to enable the model to adapt to downstream tasks with varying data distributions and time shifts. This paper focuses on the language adaptability of LLMs through CFT. We study a two-phase CFT process in which an English-only end-to-end fine-tuned LLM from Phase 1 (predominantly Task Ability) is sequentially fine-tuned on a multilingual dataset – comprising task data in new languages – in Phase 2 (predominantly Language Ability). We observe that the ``similarity’’ of Phase 2 tasks with Phase 1 determines the LLM’s adaptability. For similar phase-wise datasets, the LLM after Phase 2 does not show deterioration in task ability. In contrast, when the phase-wise datasets are not similar, the LLM’s task ability deteriorates. We test our hypothesis on the open-source \mis\ and \llm\ models with multiple phase-wise dataset pairs. To address the deterioration, we analyze tailored variants of two CFT methods: layer freezing and generative replay. Our findings demonstrate their effectiveness in enhancing the language ability of LLMs while preserving task performance, in comparison to relevant baselines.
摘要:大语言模型 (LLM) 适应性的一个常见挑战是其能否在不损害模型对已熟练掌握语言(通常是英语)的性能的情况下,随时间学习新语言。持续微调 (CFT) 是指依次对 LLM 进行微调,以使模型能够适应具有不同数据分布和时间偏移的下游任务。本文重点研究通过 CFT 实现 LLM 的语言适应性。我们研究了一个两阶段的 CFT 过程,其中第一阶段(主要为任务能力)对仅使用英语进行端到端微调的 LLM 进行微调,然后在第二阶段(主要为语言能力)依次对包含新语言任务数据的多语言数据集进行微调。我们观察到,第二阶段任务与第一阶段的“相似性”决定了 LLM 的适应性。对于相似的阶段数据集,经过第二阶段的 LLM 在任务能力上没有表现出退化。相反,当阶段数据集不相似时,LLM 的任务能力会退化。我们在开源的 \mis\ 和 \llm\ 模型上测试了我们的假设,使用了多个阶段数据集对。为了解决退化问题,我们分析了两种 CFT 方法的定制变体:层冻结和生成式重放。我们的研究结果表明,与相关基线相比,这些方法在增强 LLM 的语言能力的同时,能够保持任务性能。
[NLP-37] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
【速读】: 该论文试图解决大语言模型(LLMs)中存在的上下文-记忆知识冲突问题,即模型参数中的知识与输入上下文中的信息发生冲突,导致模型行为不佳,如依赖过时或错误信息。解决方案的关键在于提出了一种名为SpARE的无训练表示工程方法,该方法利用预训练的稀疏自编码器(SAEs)来控制LLMs的知识选择行为。SpARE通过识别控制知识选择行为的功能特征,并在推理时编辑LLMs的内部激活,从而有效解决知识冲突问题,在开放域问答任务中表现优于现有的表示工程方法和对比解码方法。
链接: https://arxiv.org/abs/2410.15999
作者: Yu Zhao,Alessio Devoto,Giwon Hong,Xiaotang Du,Aryo Pradipta Gema,Hongru Wang,Kam-Fai Wong,Pasquale Minervini
关键词-EN: Large language models, Large language, store a significant, significant amount, amount of factual
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context – this phenomenon, known as \emphcontext-memory knowledge conflicts, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use \emphinference-time intervention strategies to resolve it. In this work, we propose \textscSpARE, a \emphtraining-free representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. \textscSpARE identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that \textscSpARE can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods ( +10% ) as well as contrastive decoding methods ( +15% ).
摘要:大语言模型 (LLMs) 能够在其参数中存储大量的事实知识。然而,它们的参数知识可能与上下文中提供的信息发生冲突——这种现象被称为上下文-记忆知识冲突,可能导致模型行为不佳,例如依赖过时或错误的信息。通过分析 LLMs 的内部激活情况,我们发现它们能够在中间层内部识别知识冲突的信号。这些信号使我们能够检测是否发生了知识冲突,并使用推理时干预策略来解决它。在本研究中,我们提出了 \textscSpARE,一种无需训练的表示工程方法,利用预训练的稀疏自编码器 (SAEs) 来控制 LLMs 的知识选择行为。\textscSpARE 识别控制知识选择行为的功能特征,并在推理时应用于编辑 LLMs 的内部激活。我们的实验结果表明,\textscSpARE 能够有效控制知识源的使用,以解决开放域问答任务中的知识冲突,超越现有的表示工程方法 ( +10% ) 以及对比解码方法 ( +15% )。
[NLP-38] 1024m at SMM4H 2024: Tasks 3 5 6 – Ensembles of Transformers and Large Language Models for Medical Text Classification ACL2024
【速读】: 该论文旨在解决社交媒体文本分类问题,特别是关于自然和户外空间对作者心理健康的影响(任务3)、儿童健康障碍报告(任务5)以及用户自我报告年龄(任务6)的二分类任务。解决方案的关键在于利用Transformer和大型语言模型及其集成方法,通过这些先进模型的高性能来提升分类准确性,同时考虑了不同方法的优缺点以优化任务表现。
链接: https://arxiv.org/abs/2410.15998
作者: Ram Mohan Rao Kadiyala,M.V.P. Chandra Sekhara Rao
关键词-EN: users reporting information, Social media, Large Language Models, Binary classification, great source
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: short paper , acl 2024
点击查看摘要
Abstract:Social media is a great source of data for users reporting information and regarding their health and how various things have had an effect on them. This paper presents various approaches using Transformers and Large Language Models and their ensembles, their performance along with advantages and drawbacks for various tasks of SMM4H’24 - Classifying texts on impact of nature and outdoor spaces on the author’s mental health (Task 3), Binary classification of tweets reporting their children’s health disorders like Asthma, Autism, ADHD and Speech disorder (task 5), Binary classification of users self-reporting their age (task 6).
摘要:社交媒体是用户报告信息及其健康状况以及各种事物对其影响的重要数据来源。本文介绍了使用 Transformer 和大语言模型及其集成的方法,评估了它们在 SMM4H’24 各项任务中的表现、优势和劣势。具体任务包括:分类文本以评估自然和户外空间对作者心理健康的影响(任务 3),二元分类推文以报告其子女的健康障碍,如哮喘、自闭症、多动症和言语障碍(任务 5),以及二元分类用户自我报告的年龄(任务 6)。
[NLP-39] Augmenting Legal Decision Support Systems with LLM-based NLI for Analyzing Social Media Evidence EMNLP2024
【速读】: 该论文试图解决法律自然语言推理(L-NLI)任务中的分类问题,即判断法律文本中的关系是蕴含、矛盾还是中性。解决方案的关键在于采用了先进的自然语言处理技术,并通过详细的模型分析和错误分析,显著提升了分类准确性,最终在NLLP 2024共享任务中取得了优胜。
链接: https://arxiv.org/abs/2410.15990
作者: Ram Mohan Rao Kadiyala,Siddartha Pullakhandam,Kanwal Mehreen,Subhasya Tippareddy,Ashay Srivastava
关键词-EN: entry for NLLP, Natural Language Inference, Legal Natural Language, shared task, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages , accepted to emnlp 2024
点击查看摘要
Abstract:This paper presents our system description and error analysis of our entry for NLLP 2024 shared task on Legal Natural Language Inference (L-NLI) \citephagag2024legallenssharedtask2024. The task required classifying these relationships as entailed, contradicted, or neutral, indicating any association between the review and the complaint. Our system emerged as the winning submission, significantly outperforming other entries with a substantial margin and demonstrating the effectiveness of our approach in legal text analysis. We provide a detailed analysis of the strengths and limitations of each model and approach tested, along with a thorough error analysis and suggestions for future improvements. This paper aims to contribute to the growing field of legal NLP by offering insights into advanced techniques for natural language inference in legal contexts, making it accessible to both experts and newcomers in the field.
摘要:本文介绍了我们在 NLLP 2024 共享任务中关于法律自然语言推理 (L-NLI) 的系统描述和错误分析,该任务由 \citephagag2024legallenssharedtask2024 提出。任务要求将这些关系分类为蕴含、矛盾或中性,以表明评论与投诉之间的任何关联。我们的系统在比赛中脱颖而出,以显著优势超越其他参赛作品,展示了我们在法律文本分析中的方法的有效性。我们详细分析了每个测试模型和方法的优缺点,并进行了全面的错误分析,提出了未来改进的建议。本文旨在通过提供法律背景下自然语言推理的高级技术见解,为不断发展的法律自然语言处理领域做出贡献,使其对领域内的专家和新手都具有参考价值。
[NLP-40] Large Language Models for Cross-lingual Emotion Detection ACL2024
【速读】: 该论文试图解决跨语言情感检测问题,解决方案的关键在于结合大型语言模型(LLMs)及其集成方法,以有效理解和分类不同语言中的情感。通过集成多个模型,该方法不仅显著优于其他提交方案,还展示了多模型集成的优势,并进行了详细的模型优缺点比较和错误分析,为未来改进提供了方向。
链接: https://arxiv.org/abs/2410.15974
作者: Ram Mohan Rao Kadiyala
关键词-EN: detailed system description, cross-lingual emotion detection, focused on cross-lingual, presents a detailed, detailed system
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages , accepted to acl 2024
点击查看摘要
Abstract:This paper presents a detailed system description of our entry for the WASSA 2024 Task 2, focused on cross-lingual emotion detection. We utilized a combination of large language models (LLMs) and their ensembles to effectively understand and categorize emotions across different languages. Our approach not only outperformed other submissions with a large margin, but also demonstrated the strength of integrating multiple models to enhance performance. Additionally, We conducted a thorough comparison of the benefits and limitations of each model used. An error analysis is included along with suggested areas for future improvement. This paper aims to offer a clear and comprehensive understanding of advanced techniques in emotion detection, making it accessible even to those new to the field.
摘要:本文详细描述了我们为 WASSA 2024 任务 2 提交的系统,该任务专注于跨语言情感检测。我们采用了大语言模型 (LLMs) 及其集成方法,以有效理解和分类不同语言中的情感。我们的方法不仅大幅超越了其他提交作品,还展示了整合多个模型以提升性能的优势。此外,我们对所使用的每个模型的优缺点进行了全面比较。本文还包括了错误分析,并提出了未来改进的建议。本文旨在为情感检测领域的先进技术提供清晰且全面的理解,使其对初学者也易于理解。
[NLP-41] Policy-driven Knowledge Selection and Response Generation for Document-grounded Dialogue
【速读】: 该论文试图解决文档驱动对话(DGD)中正确理解对话上下文以选择合适知识并生成恰当回应的问题。解决方案的关键在于引入对话策略,该策略包含两个指导信号:话语功能和话题转移意图。话语功能反映话语的目的和风格,而话题转移意图反映话语的主题和内容。论文提出了一种新的框架,利用对话策略来处理知识选择(KS)和回应生成(RG)两个核心任务。该框架包括策略规划器和生成器两个模块,前者利用策略感知的对话表示来选择知识和预测回应策略,后者则使用策略/知识感知的对话表示进行回应生成。该策略驱动的模型在三个公开基准测试中取得了最先进的性能。
链接: https://arxiv.org/abs/2410.15970
作者: Longxuan Ma,Jiapeng Li,Mingda Li,Wei-Nan Zhang,Ting Liu
关键词-EN: Document-grounded dialogue, dialogue, dialogue policy, DGD, policy
类目: Computation and Language (cs.CL)
备注: 29 pages, 9 figures, 14 tables, TOIS 2024
点击查看摘要
Abstract:Document-grounded dialogue (DGD) uses documents as external knowledge for dialogue generation. Correctly understanding the dialogue context is crucial for selecting knowledge from the document and generating proper responses. In this paper, we propose using a dialogue policy to help the dialogue understanding in DGD. Our dialogue policy consists of two kinds of guiding signals: utterance function and topic transfer intent. The utterance function reflects the purpose and style of an utterance, and the topic transfer intent reflects the topic and content of an utterance. We propose a novel framework exploiting our dialogue policy for two core tasks in DGD, namely knowledge selection (KS) and response generation (RG). The framework consists of two modules: the Policy planner leverages policy-aware dialogue representation to select knowledge and predict the policy of the response; the generator uses policy/knowledge-aware dialogue representation for response generation. Our policy-driven model gets state-of-the-art performance on three public benchmarks and we provide a detailed analysis of the experimental results. Our code/data will be released on GitHub.
摘要:基于文档的对话 (Document-grounded Dialogue, DGD) 利用文档作为外部知识来生成对话。正确理解对话上下文对于从文档中选择知识并生成适当的回应至关重要。本文提出使用对话策略来辅助 DGD 中的对话理解。我们的对话策略包含两种引导信号:话语功能 (utterance function) 和话题转移意图 (topic transfer intent)。话语功能反映了一个话语的目的和风格,而话题转移意图反映了一个话语的话题和内容。我们提出了一种新颖的框架,利用我们的对话策略来处理 DGD 中的两个核心任务,即知识选择 (Knowledge Selection, KS) 和回应生成 (Response Generation, RG)。该框架由两个模块组成:策略规划器 (Policy planner) 利用策略感知的对话表示来选择知识并预测回应的策略;生成器 (generator) 使用策略/知识感知的对话表示来进行回应生成。我们的策略驱动模型在三个公开基准测试中取得了最先进的性能,并对实验结果进行了详细分析。我们的代码和数据将在 GitHub 上发布。
[NLP-42] Self-Explained Keywords Empower Large Language Models for Code Generation
【速读】: 该论文试图解决大语言模型(LLMs)在代码生成过程中因训练数据的长尾分布导致低频关键词被误解或忽略的问题。解决方案的关键是提出了一种名为SEK(Self-Explained Keywords)的新技术,通过让LLM自身提取并解释问题描述中的关键词,并根据频率进行排序,从而引导LLM在代码生成时更准确地关注高频关键词,显著提升代码生成的准确性。实验结果表明,SEK在多个基准测试中均能显著提高LLMs的代码生成性能。
链接: https://arxiv.org/abs/2410.15966
作者: Lishui Fan,Mouxiang Chen,Zhongxin Liu
关键词-EN: Large language models, achieved impressive performance, Large language, code generation, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved impressive performance in code generation. However, due to the long-tail distribution of LLMs’ training data, low-frequency terms are typically underrepresented in the training process. Consequently, LLMs often misunderstand or overlook problem-specific, low-frequency keywords during code generation, compromising the accuracy of the generated code. To address this, we propose a novel technique named SEK(\textbfSelf-\textbfExplained \textbfKeywords), which empowers an LLM for better code generation by extracting and explaining the key terms in the problem description with the LLM itself and ranking them based on frequency. Comprehensive experiments across three benchmarks, i.e., HumanEval(+), MBPP(+), and APPS, with five representative LLMs, show that SEK can significantly improve LLMs in code generation, yielding substantial and consistent gains. For instance, SEK improves the Pass@1 of DeepSeek-Coder-V2-Instruct from 85.4% to 93.3% on the Humaneval benchmark. Further analysis confirms that SEK enables the LLMs to shift their attention from low-frequency keywords to their corresponding high-frequency counterparts.
摘要:大语言模型 (LLMs) 在代码生成方面取得了令人瞩目的表现。然而,由于 LLMs 训练数据的长尾分布,低频词汇在训练过程中通常代表性不足。因此,LLMs 在代码生成过程中常常误解或忽略特定问题的低频关键词,从而影响生成代码的准确性。为了解决这一问题,我们提出了一种名为 SEK (Self-Explained Keywords) 的新技术,该技术通过 LLM 自身提取和解释问题描述中的关键术语,并根据频率对其进行排序,从而提升 LLM 的代码生成能力。我们在三个基准测试(即 HumanEval(+)、MBPP(+) 和 APPS)上对五个代表性 LLMs 进行了全面的实验,结果表明 SEK 能够显著提升 LLMs 的代码生成能力,带来显著且一致的增益。例如,SEK 将 DeepSeek-Coder-V2-Instruct 在 Humaneval 基准测试中的 Pass@1 从 85.4% 提升至 93.3%。进一步的分析证实,SEK 使 LLMs 能够将注意力从低频关键词转移到相应的高频关键词上。
[NLP-43] Systematic Exploration of Dialogue Summarization Approaches for Reproducibility Comparative Assessment and Methodological Innovations for Advancing Natural Language Processing in Abstractive Summarization
【速读】: 该论文试图解决对话摘要模型在自然语言处理领域中的可重复性问题,特别是验证和评估这些模型在生成摘要时的信息量和质量。解决方案的关键在于通过使用AMI数据集对多个对话摘要模型(如Hierarchical Memory Networks和不同版本的Pointer-Generator Networks)进行重新实验和评估,并通过人工评估方法来量化摘要的信息量和质量,从而揭示原始研究与重新实验结果之间的差异。
链接: https://arxiv.org/abs/2410.15962
作者: Yugandhar Reddy Gogireddy,Jithendra Reddy Gogireddy
关键词-EN: natural language processing, Reproducibility in scientific, dialogue summarization models, language processing, experimental findings
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reproducibility in scientific research, particularly within the realm of natural language processing (NLP), is essential for validating and verifying the robustness of experimental findings. This paper delves into the reproduction and evaluation of dialogue summarization models, focusing specifically on the discrepancies observed between original studies and our reproduction efforts. Dialogue summarization is a critical aspect of NLP, aiming to condense conversational content into concise and informative summaries, thus aiding in efficient information retrieval and decision-making processes. Our research involved a thorough examination of several dialogue summarization models using the AMI (Augmented Multi-party Interaction) dataset. The models assessed include Hierarchical Memory Networks (HMNet) and various versions of Pointer-Generator Networks (PGN), namely PGN(DKE), PGN(DRD), PGN(DTS), and PGN(DALL). The primary objective was to evaluate the informativeness and quality of the summaries generated by these models through human assessment, a method that introduces subjectivity and variability in the evaluation process. The analysis began with Dataset 1, where the sample standard deviation of 0.656 indicated a moderate dispersion of data points around the mean.
摘要:科学研究的再现性,特别是在自然语言处理 (NLP) 领域,对于验证和确认实验结果的稳健性至关重要。本文深入探讨了对话摘要模型的再现与评估,特别关注了原始研究与我们的再现工作之间观察到的差异。对话摘要是 NLP 的一个重要方面,旨在将对话内容浓缩为简洁且信息丰富的摘要,从而有助于高效的信息检索和决策过程。我们的研究涉及对使用 AMI (Augmented Multi-party Interaction) 数据集的多个对话摘要模型进行全面检查。评估的模型包括分层记忆网络 (HMNet) 和多种版本的指针生成网络 (PGN),即 PGN(DKE)、PGN(DRD)、PGN(DTS) 和 PGN(DALL)。主要目标是评估这些模型生成的摘要的信息量和质量,通过人工评估,这种方法引入了评估过程中的主观性和变异性。分析从数据集 1 开始,样本标准差为 0.656,表明数据点在均值周围的分散程度适中。
[NLP-44] Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs
【速读】: 该论文试图解决当前大型语言模型(LLMs)在多语言环境中输出不自然的问题,尤其是由于英语为中心的偏见导致的词汇和语法模式不适应非英语语言的情况。解决方案的关键在于引入新的自动语料库级别指标,用于评估LLMs在多语言环境中的词汇和句法自然性,并通过在法语和中文基准上的评估,揭示了英语影响模式的倾向。此外,论文提出了一种简单有效的对齐方法,以提高目标语言和领域中LLM输出的自然性,同时保持其在通用基准上的性能。
链接: https://arxiv.org/abs/2410.15956
作者: Yanzhu Guo,Simone Conia,Zelin Zhou,Min Li,Saloni Potdar,Henry Xiao
关键词-EN: Current Large Language, Large Language Models, Current Large, strong English-centric biases, exhibit strong English-centric
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context. Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark in French and Chinese, revealing a tendency towards English-influenced patterns. To mitigate this issue, we also propose a simple and effective alignment method to improve the naturalness of an LLM in a target language and domain, achieving consistent improvements in naturalness without compromising the performance on general-purpose benchmarks. Our work highlights the importance of developing multilingual metrics, resources and methods for the new wave of multilingual LLMs.
摘要:当前的大语言模型 (LLM) 主要以英语为主要语言设计,即使是那些多语言的模型,也往往表现出强烈的英语中心偏见。就像在学习第二语言时可能产生尴尬表达的说话者一样,LLM 在非英语语言中经常生成不自然的输出,反映出词汇和语法上的英语中心模式。尽管这一问题的重要性不言而喻,但多语言 LLM 输出的自然性却鲜少受到关注。本文通过引入新的自动语料库级指标,来评估多语言背景下 LLM 输出的词汇和句法自然性,填补了这一空白。利用我们的新指标,我们对法语和中文的精选基准上的最先进 LLM 进行了评估,揭示了其倾向于英语影响模式的倾向。为了缓解这一问题,我们还提出了一种简单且有效的对齐方法,以提高目标语言和领域中 LLM 的自然性,在不损害通用基准性能的情况下实现了自然性的持续改进。我们的工作强调了为新一代多语言 LLM 开发多语言指标、资源和方法的重要性。
[NLP-45] Findings of the Third Shared Task on Multilingual Coreference Resolution
【速读】: 该论文旨在解决多语言共指消解问题,特别是在实际应用中处理零代词(zero anaphora)的挑战。解决方案的关键在于不提供参与者黄金标准槽位(gold slots),从而增加了任务的复杂性和现实性。此外,该任务扩展到包括更多样化的语言,特别是历史语言,使用CorefUD 1.2版本的多语言共指资源进行训练和评估。
链接: https://arxiv.org/abs/2410.15949
作者: Michal Novák,Barbora Dohnalová,Miloslav Konopík,Anna Nedoluzhko,Martin Popel,Ondřej Pražák,Jakub Sido,Milan Straka,Zdeněk Žabokrtský,Daniel Zeman
关键词-EN: multilingual coreference resolution, held as part, paper presents, presents an overview, shared task
类目: Computation and Language (cs.CL)
备注: Accepted to CRAC 2024
点击查看摘要
Abstract:The paper presents an overview of the third edition of the shared task on multilingual coreference resolution, held as part of the CRAC 2024 workshop. Similarly to the previous two editions, the participants were challenged to develop systems capable of identifying mentions and clustering them based on identity coreference. This year’s edition took another step towards real-world application by not providing participants with gold slots for zero anaphora, increasing the task’s complexity and realism. In addition, the shared task was expanded to include a more diverse set of languages, with a particular focus on historical languages. The training and evaluation data were drawn from version 1.2 of the multilingual collection of harmonized coreference resources CorefUD, encompassing 21 datasets across 15 languages. 6 systems competed in this shared task. Comments: Accepted to CRAC 2024 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.15949 [cs.CL] (or arXiv:2410.15949v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.15949 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:本文概述了作为 CRAC 2024 研讨会一部分的第三届多语言共指消解共享任务。与前两届类似,参与者被要求开发能够识别提及并根据身份共指进行聚类的系统。今年的版本进一步迈向实际应用,不再为参与者提供零代词的黄金槽位,从而增加了任务的复杂性和现实性。此外,共享任务扩展到包括更多样化的语言集合,特别关注历史语言。训练和评估数据来自多语言协调共指资源 CorefUD 的 1.2 版本,涵盖了 15 种语言的 21 个数据集。共有 6 个系统参与了此次共享任务。
评论:已被 CRAC 2024 接受
主题:计算与语言 (cs.CL)
引用为:arXiv:2410.15949 [cs.CL]
(或 arXiv:2410.15949v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.15949
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)
[NLP-46] CausalGraph2LLM: Evaluating LLMs for Causal Queries
【速读】: 该论文试图解决大型语言模型(LLMs)在理解和处理因果图(causal graphs)方面的能力问题。解决方案的关键在于提出了一个名为CausalGraph2LLM的综合基准,用于评估LLMs在不同因果图设置下的理解和推理能力。该基准将因果查询分为图级和节点级两类,并通过对比开源和闭源模型,发现LLMs在处理因果图时对编码方式高度敏感,即使是GPT-4和Gemini-1.5等先进模型也表现出约60%的偏差。此外,论文还揭示了LLMs在处理因果干预任务时的敏感性,并指出模型可能因参数记忆而产生偏见。
链接: https://arxiv.org/abs/2410.15939
作者: Ivaxi Sheth,Bahare Fatemi,Mario Fritz
关键词-EN: Causality is essential, interpret true relationships, causal, scientific research, enabling researchers
类目: Computation and Language (cs.CL)
备注: Code - this https URL
点击查看摘要
Abstract:Causality is essential in scientific research, enabling researchers to interpret true relationships between variables. These causal relationships are often represented by causal graphs, which are directed acyclic graphs. With the recent advancements in Large Language Models (LLMs), there is an increasing interest in exploring their capabilities in causal reasoning and their potential use to hypothesize causal graphs. These tasks necessitate the LLMs to encode the causal graph effectively for subsequent downstream tasks. In this paper, we propose a comprehensive benchmark, \emphCausalGraph2LLM, encompassing a variety of causal graph settings to assess the causal graph understanding capability of LLMs. We categorize the causal queries into two types: graph-level and node-level queries. We benchmark both open-sourced and closed models for our study. Our findings reveal that while LLMs show promise in this domain, they are highly sensitive to the encoding used. Even capable models like GPT-4 and Gemini-1.5 exhibit sensitivity to encoding, with deviations of about 60% . We further demonstrate this sensitivity for downstream causal intervention tasks. Moreover, we observe that LLMs can often display biases when presented with contextual information about a causal graph, potentially stemming from their parametric memory.
摘要:因果关系在科学研究中至关重要,使研究人员能够解释变量之间的真实关系。这些因果关系通常由因果图表示,因果图是有向无环图。随着大语言模型 (LLM) 的最新进展,人们对其在因果推理中的能力及其用于假设因果图的潜力越来越感兴趣。这些任务要求 LLM 有效地编码因果图,以便进行后续的下游任务。在本文中,我们提出了一个综合基准,\emphCausalGraph2LLM,涵盖了各种因果图设置,以评估 LLM 对因果图的理解能力。我们将因果查询分为两类:图级查询和节点级查询。我们对开源和闭源模型进行了基准测试。我们的研究结果表明,尽管 LLM 在这一领域显示出潜力,但它们对所使用的编码方式高度敏感。即使是 GPT-4 和 Gemini-1.5 这样的能力模型,其编码敏感性也达到了约 60% 的偏差。我们进一步展示了这种敏感性对下游因果干预任务的影响。此外,我们观察到,当 LLM 接收到因果图的上下文信息时,它们往往会表现出偏见,这可能源于其参数化记忆。
[NLP-47] Yeah Un Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection
【速读】: 该论文试图解决在对话系统中实时、连续预测回声通道(backchannel)的问题,以提升对话的自然性和流畅性。解决方案的关键在于提出了一种基于微调的语音活动投影(Voice Activity Projection, VAP)模型,该模型能够在不平衡的真实世界数据集上,以帧为单位连续预测回声通道的时机和类型。通过先在通用对话语料库上预训练,然后在专注于回声通道行为的专用数据集上微调,该模型在实时环境中展示了优于基线方法的性能,为实现更响应和类人的对话系统提供了有前景的进展。
链接: https://arxiv.org/abs/2410.15929
作者: Koji Inoue,Divesh Lala,Gabriel Skantze,Tatsuya Kawahara
关键词-EN: short backchannel utterances, Voice Activity Projection, human conversations, play a crucial, crucial role
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:In human conversations, short backchannel utterances such as “yeah” and “oh” play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.
摘要:在人类对话中,短小的反馈话语如“嗯”和“哦”在促进流畅且引人入胜的对话中起着至关重要的作用。这些反馈信号在不打断说话者的情况下表明了注意力和理解,因此对其准确预测对于创建更自然的对话代理至关重要。本文提出了一种使用微调的语音活动投影 (Voice Activity Projection, VAP) 模型进行实时连续反馈预测的新方法。现有方法依赖于基于回合或人工平衡的数据集,而我们的方法在不平衡的真实世界数据集上以连续和逐帧的方式预测反馈的时机和类型。我们首先在通用对话语料库上预训练 VAP 模型以捕捉对话动态,然后在专注于反馈行为的专用数据集上对其进行微调。实验结果表明,我们的模型在时机和类型预测任务中均优于基线方法,在实时环境中表现出稳健的性能。这项研究为更响应和类人的对话系统迈出了有希望的一步,对虚拟助手和机器人等交互式语音对话应用具有重要意义。
[NLP-48] Mitigating Object Hallucination via Concentric Causal Attention NEURIPS2024
【速读】: 该论文试图解决大视觉语言模型(LVLMs)中的对象幻觉问题,即模型在多模态查询中生成的文本响应与图像输入不一致的现象。研究发现,这一问题与Rotary Position Encoding(RoPE)的长程衰减有关,当视觉线索与指令标记在输入序列中距离较远时,模型更容易产生幻觉。解决方案的关键是提出了一种新的位置对齐策略——同心因果注意力(Concentric Causal Attention, CCA),通过减少视觉标记与指令标记之间的相对距离,增强两者之间的交互,从而提高模型的感知能力并缓解对象幻觉问题。
链接: https://arxiv.org/abs/2410.15926
作者: Yun Xing,Yiheng Li,Ivan Laptev,Shijian Lu
关键词-EN: Recent Large Vision, Vision Language Models, Large Vision Language, Vision Language, present remarkable zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: To appear at NeurIPS 2024. Code is available at this https URL
点击查看摘要
Abstract:Recent Large Vision Language Models (LVLMs) present remarkable zero-shot conversational and reasoning capabilities given multimodal queries. Nevertheless, they suffer from object hallucination, a phenomenon where LVLMs are prone to generate textual responses not factually aligned with image inputs. Our pilot study reveals that object hallucination is closely tied with Rotary Position Encoding (RoPE), a widely adopted positional dependency modeling design in existing LVLMs. Due to the long-term decay in RoPE, LVLMs tend to hallucinate more when relevant visual cues are distant from instruction tokens in the multimodal input sequence. Additionally, we observe a similar effect when reversing the sequential order of visual tokens during multimodal alignment. Our tests indicate that long-term decay in RoPE poses challenges to LVLMs while capturing visual-instruction interactions across long distances. We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs by naturally reducing relative distance between visual and instruction tokens. With CCA, visual tokens can better interact with instruction tokens, thereby enhancing model’s perception capability and alleviating object hallucination. Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.
摘要:近期的大视觉语言模型 (Large Vision Language Models, LVLMs) 在面对多模态查询时展现出显著的零样本对话和推理能力。然而,这些模型存在对象幻觉 (object hallucination) 的问题,即 LVLMs 倾向于生成与图像输入不符的文本响应。我们的初步研究表明,对象幻觉与 Rotary Position Encoding (RoPE) 密切相关,RoPE 是现有 LVLMs 中广泛采用的位置依赖性建模设计。由于 RoPE 的长期衰减,当相关的视觉线索与多模态输入序列中的指令 Token 距离较远时,LVLMs 更容易产生幻觉。此外,我们在多模态对齐过程中观察到,当视觉 Token 的顺序被反转时,也会出现类似的效果。我们的测试表明,RoPE 的长期衰减对 LVLMs 在捕捉远距离视觉-指令交互方面构成了挑战。我们提出了同心因果注意力 (Concentric Causal Attention, CCA),这是一种简单而有效的位置对齐策略,通过自然减少视觉 Token 与指令 Token 之间的相对距离,减轻了 RoPE 长期衰减对 LVLMs 的影响。通过 CCA,视觉 Token 能更好地与指令 Token 交互,从而增强模型的感知能力并缓解对象幻觉。在不增加复杂性的情况下,我们的位置对齐方法在多个对象幻觉基准测试中大幅超越现有的幻觉缓解策略。
[NLP-49] DefVerify: Do Hate Speech Models Reflect Their Datasets Definition?
【速读】: 该论文试图解决在构建预测模型时,如何确保模型能够准确反映领域特定要求的问题,特别是在仇恨言论检测领域。解决方案的关键是提出了一种名为DefVerify的三步流程:首先,将用户指定的仇恨言论定义编码到模型中;其次,量化模型在多大程度上反映了这一定义;最后,识别模型在开发流程中可能出现的失败点。通过这种方法,论文旨在缩小领域定义与模型实际行为之间的差距,特别是在应用于六个流行的仇恨言论基准数据集时。
链接: https://arxiv.org/abs/2410.15911
作者: Urja Khurana,Eric Nalisnick,Antske Fokkens
关键词-EN: hate speech, eventually be deployed, hate speech detection, difficult to ensure, ensure that domain-specific
类目: Computation and Language (cs.CL)
备注: Preprint
点击查看摘要
Abstract:When building a predictive model, it is often difficult to ensure that domain-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the domain specification and the model’s actual behavior upon deployment. To address this issue for hate speech detection, we propose DefVerify: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to identify the point of failure in the workflow. We use DefVerify to find gaps between definition and model behavior when applied to six popular hate speech benchmark datasets.
摘要:在构建预测模型时,通常难以确保模型能够准确编码领域特定的需求。以仇恨言论检测领域的研究人员为例,他们对于何为仇恨言论有一定的理解,但要构建一个准确反映其观点的模型,需要在数据集构建和模型训练的整个工作流程中保持这些理念。然而,采样偏差、标注偏差和模型错误指定等问题几乎总是会出现,可能导致领域规范与模型实际部署后的行为之间存在差距。为了解决仇恨言论检测中的这一问题,我们提出了 DefVerify:一个三步流程,包括 (i) 编码用户指定的仇恨言论定义,(ii) 量化模型在多大程度上反映了预期的定义,以及 (iii) 尝试识别工作流程中的失败点。我们使用 DefVerify 来发现当应用于六个流行的仇恨言论基准数据集时,定义与模型行为之间的差距。
[NLP-50] Using GPT Models for Qualitative and Quantitative News Analytics in the 2024 US Presidental Election Process
【速读】: 该论文试图解决通过自动化方法对新闻内容进行定性和定量分析的问题,特别是在2024年美国总统选举过程中的新闻分析。解决方案的关键在于利用Google Search API和GPT-4模型进行检索增强生成(RAG),通过分析不同新闻源在不同时期的数据,生成定量评分,并使用贝叶斯回归分析这些评分以推导趋势线。这种方法不仅能够量化新闻内容,还能通过回归参数的分布分析选举过程中的不确定性,从而为选举过程的进一步分析提供关键见解。
链接: https://arxiv.org/abs/2410.15884
作者: Bohdan M. Pavlyshenko
关键词-EN: Google Search API, Google Search, Search API, retrieval-augmented generation, RAG
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The paper considers an approach of using Google Search API and GPT-4o model for qualitative and quantitative analyses of news through retrieval-augmented generation (RAG). This approach was applied to analyze news about the 2024 US presidential election process. Different news sources for different time periods have been analyzed. Quantitative scores generated by GPT model have been analyzed using Bayesian regression to derive trend lines. The distributions found for the regression parameters allow for the analysis of uncertainty in the election process. The obtained results demonstrate that using the GPT models for news analysis, one can get informative analytics and provide key insights that can be applied in further analyses of election processes.
摘要:本文探讨了利用 Google Search API 和 GPT-4 模型进行新闻的定性与定量分析的方法,通过检索增强生成 (RAG) 技术实现。该方法应用于分析 2024 年美国大选进程的相关新闻。针对不同时间段的不同新闻来源进行了分析。通过 GPT 模型生成的定量评分,采用贝叶斯回归进行分析,以得出趋势线。回归参数的分布有助于评估选举过程中的不确定性。研究结果表明,利用 GPT 模型进行新闻分析,可以获得具有信息量的分析结果,并提供关键见解,这些见解可应用于进一步的选举进程分析中。
[NLP-51] Principles of semantic and functional efficiency in grammatical patterning
【速读】: 该论文试图解决普遍语法模式的基础问题,即为何语法特征如数和性在不同语言中表现出一致的组织模式。解决方案的关键在于将语法的两个基本属性——语义编码和基于一致性的可预测性——统一在一个信息论的目标下,并考虑认知约束。研究揭示了语法组织确实源自感知属性,但语法在实际应用中更倾向于功能性目标,即促进语言处理效率而非纯粹的语义编码。
链接: https://arxiv.org/abs/2410.15865
作者: Emily Cheng,Francesca Franzon
关键词-EN: number and gender, gender serve, serve two central, central functions, functions in human
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Grammatical features such as number and gender serve two central functions in human languages. While they encode salient semantic attributes like numerosity and animacy, they also offload sentence processing cost by predictably linking words together via grammatical agreement. Grammars exhibit consistent organizational patterns across diverse languages, invariably rooted in a semantic foundation, a widely confirmed but still theoretically unexplained phenomenon. To explain the basis of universal grammatical patterns, we unify two fundamental properties of grammar, semantic encoding and agreement-based predictability, into a single information-theoretic objective under cognitive constraints. Our analyses reveal that grammatical organization provably inherits from perceptual attributes, but that grammars empirically prioritize functional goals, promoting efficient language processing over semantic encoding.
摘要:语法特征,如数和性,在人类语言中具有两个核心功能。它们不仅编码了显著的语义属性,如数量和生命性,还通过语法一致性可预测地连接词语,从而减轻句子处理成本。语法在不同语言中表现出一致的组织模式,这些模式始终根植于语义基础,这是一个广泛证实但仍未得到理论解释的现象。为了解释普遍语法模式的根源,我们将语法的两个基本属性——语义编码和基于一致性的可预测性——统一在认知约束下的单一信息论目标中。我们的分析揭示,语法组织确实继承了感知属性,但语法在经验上优先考虑功能目标,促进语言处理效率而非语义编码。
[NLP-52] Did somebody say “Gest-IT”? A pilot exploration of multimodal data management
【速读】: 该论文试图解决多模态语料库的构建、管理和分析问题,特别是探讨在视觉正常者与视觉障碍者之间的对话中手势模式的变化。解决方案的关键在于采用三层标注方法,包括正字法、韵律和手势转录,从而创建统一的CoNLL-U语料库,为未来的研究提供基础。
链接: https://arxiv.org/abs/2410.15825
作者: Ludovica Pannitto,Lorenzo Albanesi,Laura Marion,Federica Maria Martines,Carmelo Caruso,Claudia S. Bianchini,Francesca Masini,Caterina Mauri
关键词-EN: management and analysis, paper presents, presents a pilot, pilot exploration, multimodal corpus
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The paper presents a pilot exploration of the construction, management and analysis of a multimodal corpus. Through a three-layer annotation that provides orthographic, prosodic, and gestural transcriptions, the Gest-IT resource allows to investigate the variation of gesture-making patterns in conversations between sighted people and people with visual impairment. After discussing the transcription methods and technical procedures employed in our study, we propose a unified CoNLL-U corpus and indicate our future steps
摘要:本文介绍了多模态语料库构建、管理和分析的初步探索。通过提供正字法、韵律和手势转录的三层标注,Gest-IT 资源使得研究有视力和视觉障碍者之间对话中的手势模式变化成为可能。在讨论了本研究中采用的转录方法和技术流程后,我们提出了一个统一的 CoNLL-U 语料库,并指出了未来的研究方向。
[NLP-53] Improve Dense Passage Retrieval with Entailment Tuning EMNLP2024
【速读】: 该论文试图解决检索系统中相关性定义模糊的问题,并提出了一种名为“蕴涵调优”的方法来改进密集检索器的嵌入表示。解决方案的关键在于利用自然语言推理(NLI)任务中的蕴涵概念,通过将检索数据和NLI数据统一为存在声明的形式,并训练检索器预测段落中蕴涵的声明,从而提高检索系统的相关性评分准确性。该方法可以高效地集成到现有的密集检索方法中,实验结果表明其有效性。
链接: https://arxiv.org/abs/2410.15801
作者: Lu Dai,Hao Liu,Hui Xiong
关键词-EN: open-domain question answering, downstream NLP tasks, downstream NLP, retrieval-augmented generation, NLP tasks
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: EMNLP 2024 Main
点击查看摘要
Abstract:Retrieval module can be plugged into many downstream NLP tasks to improve their performance, such as open-domain question answering and retrieval-augmented generation. The key to a retrieval system is to calculate relevance scores to query and passage pairs. However, the definition of relevance is often ambiguous. We observed that a major class of relevance aligns with the concept of entailment in NLI tasks. Based on this observation, we designed a method called entailment tuning to improve the embedding of dense retrievers. Specifically, we unify the form of retrieval data and NLI data using existence claim as a bridge. Then, we train retrievers to predict the claims entailed in a passage with a variant task of masked prediction. Our method can be efficiently plugged into current dense retrieval methods, and experiments show the effectiveness of our method.
摘要:检索模块可以插入到许多下游自然语言处理 (NLP) 任务中以提升其性能,例如开放域问答和检索增强生成。检索系统的关键在于计算查询与段落对的相关性分数。然而,相关性的定义往往模糊不清。我们观察到,相关性的一大类与自然语言推理 (NLI) 任务中的蕴含概念相吻合。基于这一观察,我们设计了一种称为蕴含调优的方法来改进密集检索器的嵌入。具体来说,我们通过存在声明作为桥梁,统一了检索数据和 NLI 数据的格式。然后,我们训练检索器以预测段落中蕴含的声明,采用了一种掩码预测的变体任务。我们的方法可以高效地集成到当前的密集检索方法中,实验结果显示了该方法的有效性。
[NLP-54] Learning-to-Defer for Extractive Question Answering
【速读】: 该论文试图解决预训练语言模型在复杂问答场景中难以进行细致解释或推理的问题,以及模型规模过大导致在资源受限设备上部署困难的问题。解决方案的关键在于引入了一种适应性的两阶段Learning-to-Defer机制,该机制通过在问答过程中有选择地依赖人类专家或更大模型的决策,而不需要重新训练语言模型,从而在保持计算效率的同时显著提高模型在模糊情境下的可靠性和准确性。该方法通过证明其代理损失函数的贝叶斯一致性和(\mathcal{H}, \mathcal{R})一致性,确保了最终解决方案的最优性。
链接: https://arxiv.org/abs/2410.15761
作者: Montreuil Yannis,Carlier Axel,Ng Lai Xing,Ooi Wei Tsang
关键词-EN: large-scale textual corpora, contextual language understanding, profoundly impacted, impacted the field, field of extractive
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 25 pages, 17 main paper
点击查看摘要
Abstract:Pre-trained language models have profoundly impacted the field of extractive question-answering, leveraging large-scale textual corpora to enhance contextual language understanding. Despite their success, these models struggle in complex scenarios that demand nuanced interpretation or inferential reasoning beyond immediate textual cues. Furthermore, their size poses deployment challenges on resource-constrained devices. Addressing these limitations, we introduce an adapted two-stage Learning-to-Defer mechanism that enhances decision-making by enabling selective deference to human experts or larger models without retraining language models in the context of question-answering. This approach not only maintains computational efficiency but also significantly improves model reliability and accuracy in ambiguous contexts. We establish the theoretical soundness of our methodology by proving Bayes and (\mathcalH, \mathcalR) --consistency of our surrogate loss function, guaranteeing the optimality of the final solution. Empirical evaluations on the SQuADv2 dataset illustrate performance gains from integrating human expertise and leveraging larger models. Our results further demonstrate that deferring a minimal number of queries allows the smaller model to achieve performance comparable to their larger counterparts while preserving computing efficiency, thus broadening the applicability of pre-trained language models in diverse operational environments.
摘要:预训练语言模型(Pre-trained language models)深刻影响了抽取式问答领域,通过利用大规模文本语料库来增强上下文语言理解能力。尽管这些模型取得了成功,但在需要细微解释或超越直接文本线索的推理能力的复杂场景中,它们仍面临挑战。此外,模型的大小在资源受限的设备上部署时也带来了问题。为了解决这些限制,我们引入了一种适应性的两阶段学习-推迟(Learning-to-Defer)机制,该机制通过在问答环境中允许选择性地向人类专家或更大模型寻求帮助,从而增强决策能力,而无需重新训练语言模型。这种方法不仅保持了计算效率,而且在模糊上下文中显著提高了模型的可靠性和准确性。我们通过证明我们的代理损失函数的贝叶斯(Bayes)和 (\mathcalH, \mathcalR) --一致性,确立了该方法的理论可靠性,确保了最终解决方案的最优性。在SQuADv2数据集上的实证评估表明,整合人类专业知识和利用更大模型的策略带来了性能提升。我们的结果进一步表明,推迟最少量的查询可以使较小的模型在保持计算效率的同时,达到与其较大对手相当的表现,从而拓宽了预训练语言模型在多样操作环境中的适用性。
[NLP-55] Natural Language Querying System Through Entity Enrichment
【速读】: 该论文旨在解决企业客户通过自然语言接口查询数据库的问题。解决方案的关键在于基于实体增强的方法,将自然语言查询转换为数据库查询。通过采用逻辑范式处理数据库,该方法展示了其对不同数据库模型的适应性,并通过初步实验证明了其良好的精确性。
链接: https://arxiv.org/abs/2410.15753
作者: Joshua Amavi,Mirian Halfeld Ferrari(LIFO, Pamda),Nicolas Hiot(LIFO, Pamda)
关键词-EN: domain expert querying, expert querying system, domain expert, expert querying, querying system
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
点击查看摘要
Abstract:This paper focuses on a domain expert querying system over databases. It presents a solution designed for a French enterprise interested in offering a natural language interface for its clients. The approach, based on entity enrichment, aims at translating natural language queries into database queries. In this paper, the database is treated through a logical paradigm, suggesting the adaptability of our approach to different database models. The good precision of our method is shown through some preliminary experiments.
摘要:本文聚焦于一个基于数据库的领域专家查询系统。该系统为一家法国企业设计,旨在为其客户提供自然语言接口。我们的方法基于实体增强,旨在将自然语言查询转换为数据库查询。在本文中,数据库通过逻辑范式进行处理,表明我们的方法适用于不同的数据库模型。通过一些初步实验,展示了我们方法的良好精确性。
[NLP-56] oeing the Party Line: Election Manifestos as a Key to Understand Political Discourse on Twitter EMNLP
【速读】: 该论文试图解决在Twitter上政治话语的动态变化问题,特别是如何在不依赖手动标注的情况下,通过分析Twitter数据来预测政党之间的相对立场。解决方案的关键在于利用推文中的标签(hashtags)来微调文本表示,从而在没有手动标注的情况下,有效地捕捉和反映政党的立场,即使在数据量较少或时间跨度较短的情况下,也能保持方法的稳健性。
链接: https://arxiv.org/abs/2410.15743
作者: Maximilian Maurer,Tanise Ceron,Sebastian Padó,Gabriella Lapesa
关键词-EN: politicians continuously make, continuously make statements, moving target, politicians continuously, continuously make
类目: Computation and Language (cs.CL)
备注: 9 pages, accepted at EMNLP (Findings) 2024
点击查看摘要
Abstract:Political discourse on Twitter is a moving target: politicians continuously make statements about their positions. It is therefore crucial to track their discourse on social media to understand their ideological positions and goals. However, Twitter data is also challenging to work with since it is ambiguous and often dependent on social context, and consequently, recent work on political positioning has tended to focus strongly on manifestos (parties’ electoral programs) rather than social media. In this paper, we extend recently proposed methods to predict pairwise positional similarities between parties from the manifesto case to the Twitter case, using hashtags as a signal to fine-tune text representations, without the need for manual annotation. We verify the efficacy of fine-tuning and conduct a series of experiments that assess the robustness of our method for low-resource scenarios. We find that our method yields stable positioning reflective of manifesto positioning, both in scenarios with all tweets of candidates across years available and when only smaller subsets from shorter time periods are available. This indicates that it is possible to reliably analyze the relative positioning of actors forgoing manual annotation, even in the noisier context of social media. Comments: 9 pages, accepted at EMNLP (Findings) 2024 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.15743 [cs.CL] (or arXiv:2410.15743v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.15743 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:Twitter 上的政治言论是一个不断变化的靶子:政治人物不断发表关于其立场的声明。因此,追踪他们在社交媒体上的言论以理解其意识形态立场和目标至关重要。然而,Twitter 数据也极具挑战性,因为它具有模糊性且往往依赖于社会背景,因此,最近关于政治定位的研究倾向于主要关注宣言(政党的选举纲领)而非社交媒体。在本文中,我们扩展了最近提出的方法,将预测宣言案例中政党间位置相似性的方法应用于 Twitter 案例,使用标签作为信号来微调文本表示,而无需手动标注。我们验证了微调的有效性,并进行了一系列实验,评估了我们的方法在低资源场景下的稳健性。我们发现,无论是在所有年份候选人的所有推文可用的情况下,还是仅在较短时间段内的小子集可用的情况下,我们的方法都能产生稳定的定位,反映出宣言定位。这表明,即使在社交媒体这种更嘈杂的环境中,也可以在不进行手动标注的情况下可靠地分析参与者的相对定位。
评论:9 页,被 EMNLP (Findings) 2024 接受 主题:计算与语言 (cs.CL) 引用为:arXiv:2410.15743 [cs.CL] (或 arXiv:2410.15743v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2410.15743 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite (待注册)
[NLP-57] Whos Who: Large Language Models Meet Knowledge Conflicts in Practice EMNLP2024
【速读】: 该论文试图解决预训练语言模型在检索增强生成(RAG)过程中遇到的信息冲突问题。解决方案的关键在于引入WhoQA基准数据集,通过设计包含同一实体多个不同答案的问题,评估大型语言模型在知识冲突情境下的表现,并建议模型在遇到冲突时应透明地告知用户,而非依赖其内在偏见做出决策。实验结果表明,尽管WhoQA问题设计简单,但知识冲突显著降低了LLMs在RAG设置中的性能。
链接: https://arxiv.org/abs/2410.15737
作者: Quang Hieu Pham,Hoang Ngo,Anh Tuan Luu,Dat Quoc Nguyen
关键词-EN: static memory limits, Retrieval-augmented generation, methods are viable, viable solutions, solutions for addressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to EMNLP 2024 Findings
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model’s behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs’ performance in RAG settings.
摘要:检索增强生成 (Retrieval-augmented generation, RAG) 方法是解决预训练语言模型静态记忆限制的可行方案。然而,在检索上下文中遇到信息来源冲突是一个不可避免的实际挑战。在这种情况下,建议语言模型透明地告知用户冲突,而不是基于其固有偏见自主决定呈现内容。为了分析当前大语言模型 (Large Language Model, LLM) 如何符合我们的建议,我们引入了 WhoQA,这是一个用于检验模型在知识冲突情境下行为的公开基准数据集。我们通过询问具有相同名称实体的共同属性来引发冲突,从而产生最多有 8 个不同答案的问题。WhoQA 评估集包含 5K 个问题,涵盖 13 种 Wikidata 属性类型和 150K 个 Wikipedia 实体。我们的实验表明,尽管 WhoQA 问题简单,但知识冲突显著降低了 LLM 在 RAG 设置中的性能。
[NLP-58] Reducing annotator bias by belief elicitation
【速读】: 该论文试图解决数据标注中的标注者偏差问题,即不同背景的标注者对文本数据标注存在系统性差异,可能导致少数群体观点的代表性偏差。解决方案的关键在于提出一种无需大量标注者或标注实例的简单方法:通过询问标注者对其他标注者判断的信念,而非直接获取其判断结果,来生成更具代表性和较少偏差的标签。实验结果表明,这种方法能有效减少两组标注者之间的系统性差异,从而降低标注偏差,提升AI系统的泛化能力,并防止对未充分代表的社会群体造成伤害。
链接: https://arxiv.org/abs/2410.15726
作者: Terne Sasha Thorn Jakobsen,Andreas Bjerre-Nielsen,Robert Böhm
关键词-EN: Artificial Intelligence, development of Artificial, Crowdsourced annotations, annotator bias, bias
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
点击查看摘要
Abstract:Crowdsourced annotations of data play a substantial role in the development of Artificial Intelligence (AI). It is broadly recognised that annotations of text data can contain annotator bias, where systematic disagreement in annotations can be traced back to differences in the annotators’ backgrounds. Being unaware of such annotator bias can lead to representational bias against minority group perspectives and therefore several methods have been proposed for recognising bias or preserving perspectives. These methods typically require either a substantial number of annotators or annotations per data instance. In this study, we propose a simple method for handling bias in annotations without requirements on the number of annotators or instances. Instead, we ask annotators about their beliefs of other annotators’ judgements of an instance, under the hypothesis that these beliefs may provide more representative and less biased labels than judgements. The method was examined in two controlled, survey-based experiments involving Democrats and Republicans (n=1,590) asked to judge statements as arguments and then report beliefs about others’ judgements. The results indicate that bias, defined as systematic differences between the two groups of annotators, is consistently reduced when asking for beliefs instead of judgements. Our proposed method therefore has the potential to reduce the risk of annotator bias, thereby improving the generalisability of AI systems and preventing harm to unrepresented socio-demographic groups, and we highlight the need for further studies of this potential in other tasks and downstream applications.
摘要:众包数据标注在人工智能 (AI) 的发展中扮演着重要角色。普遍认为,文本数据的标注可能包含标注者偏见,这种系统性的标注分歧可以追溯到标注者背景的差异。如果忽视这种标注者偏见,可能会导致对少数群体观点的表征偏见,因此已经提出了多种方法来识别或保留这些观点。这些方法通常需要大量的标注者或每个数据实例的标注。在本研究中,我们提出了一种简单的方法来处理标注中的偏见,而不需要大量的标注者或实例。相反,我们要求标注者表达他们对其他标注者对某一实例判断的看法,基于假设这些看法可能比直接判断提供更具代表性和更少偏见的标签。该方法在两个受控的、基于调查的实验中进行了检验,实验涉及民主党和共和党成员 (n=1,590) 对陈述进行判断,然后报告他们对他人判断的看法。结果表明,当询问看法而非直接判断时,定义为两组标注者之间系统性差异的偏见得到了持续的减少。因此,我们提出的方法有可能减少标注者偏见的风险,从而提高 AI 系统的泛化能力,并防止对未被充分代表的社会人口群体造成伤害,我们强调需要进一步研究这种潜在方法在其他任务和下游应用中的效果。
[NLP-59] Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding EMNLP2024
【速读】: 该论文试图解决大型语言模型(LLMs)在医疗信息提取(MIE)任务中面临的幻觉问题。解决方案的关键在于引入了一种名为ALternate Contrastive Decoding(ALCD)的新方法。ALCD通过重新定义MIE任务为识别与分类过程,并将LLMs的识别和分类功能分离,通过在微调过程中选择性掩码优化tokens,以及在推理阶段交替对比子任务模型的输出分布,来增强识别和分类能力,同时减少其他固有能力的影响。此外,论文还提出了一种交替自适应约束策略,以更有效地调整对比tokens的规模和范围,从而显著改善了传统解码方法在解决幻觉问题上的表现。
链接: https://arxiv.org/abs/2410.15702
作者: Derong Xu,Ziheng Zhang,Zhihong Zhu,Zhenxi Lin,Qidong Liu,Xian Wu,Tong Xu,Xiangyu Zhao,Yefeng Zheng,Enhong Chen
关键词-EN: attracted extensive interests, large language models, Medical Information Extraction, large language, attracted extensive
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024 Findings
点击查看摘要
Abstract:The impressive capabilities of large language models (LLMs) have attracted extensive interests of applying LLMs to medical field. However, the complex nature of clinical environments presents significant hallucination challenges for LLMs, hindering their widespread adoption. In this paper, we address these hallucination issues in the context of Medical Information Extraction (MIE) tasks by introducing ALternate Contrastive Decoding (ALCD). We begin by redefining MIE tasks as an identify-and-classify process. We then separate the identification and classification functions of LLMs by selectively masking the optimization of tokens during fine-tuning. During the inference stage, we alternately contrast output distributions derived from sub-task models. This approach aims to selectively enhance the identification and classification capabilities while minimizing the influence of other inherent abilities in LLMs. Additionally, we propose an alternate adaptive constraint strategy to more effectively adjust the scale and scope of contrastive tokens. Through comprehensive experiments on two different backbones and six diverse medical information extraction tasks, ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.
摘要:大语言模型 (LLM) 的卓越能力引起了广泛的关注,尤其是在将其应用于医疗领域。然而,临床环境的复杂性给 LLM 带来了显著的幻觉挑战,阻碍了其广泛应用。本文通过引入交替对比解码 (ALternate Contrastive Decoding, ALCD) 来解决医学信息提取 (Medical Information Extraction, MIE) 任务中的幻觉问题。我们首先将 MIE 任务重新定义为识别与分类过程。然后,通过在微调过程中选择性地屏蔽 Token 的优化,将 LLM 的识别和分类功能分离。在推理阶段,我们交替对比来自子任务模型的输出分布。这种方法旨在选择性地增强识别和分类能力,同时最小化 LLM 其他固有能力的影响。此外,我们提出了一种交替自适应约束策略,以更有效地调整对比 Token 的规模和范围。通过在两个不同的骨干模型和六个多样化的医学信息提取任务上的综合实验,ALCD 在解决幻觉问题方面相比传统解码方法展示了显著的改进。
[NLP-60] InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems
【速读】: 该论文试图解决利用大型语言模型(LLMs)在数学定理证明中的应用问题,特别是通过使用LEAN等形式语言。解决方案的关键在于采用专家迭代学习范式,并利用大规模的LEAN问题数据集(如Lean-workbook)进行训练,通过迭代自训练和证明发现来不断优化模型的定理证明能力。论文还提出了一种批评模型来筛选相对简单的问题供策略模型尝试,并指导模型搜索更深层次的证明,最终在多个基准测试中实现了开源状态下的最先进性能。
链接: https://arxiv.org/abs/2410.15700
作者: Zijian Wu,Suozhi Huang,Zhejian Zhou,Huaiyuan Ying,Jiayu Wang,Dahua Lin,Kai Chen
关键词-EN: utilizing formal languages, mathematical theorem proving, Large Language Models, formal languages, Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have emerged as powerful tools in mathematical theorem proving, particularly when utilizing formal languages such as LEAN. The major learning paradigm is expert iteration, which necessitates a pre-defined dataset comprising numerous mathematical problems. In this process, LLMs attempt to prove problems within the dataset and iteratively refine their capabilities through self-training on the proofs they discover. We propose to use large scale LEAN problem datasets Lean-workbook for expert iteration with more than 20,000 CPU days. During expert iteration, we found log-linear trends between solved problem amount with proof length and CPU usage. We train a critic model to select relatively easy problems for policy models to make trials and guide the model to search for deeper proofs. InternLM2.5-StepProver achieves open-source state-of-the-art on MiniF2F, Lean-Workbook-Plus, ProofNet, and Putnam benchmarks. Specifically, it achieves a pass of 65.9% on the MiniF2F-test and proves (or disproves) 17.0% of problems in Lean-Workbook-Plus which shows a significant improvement compared to only 9.5% of problems proved when Lean-Workbook-Plus was released. We open-source our models and searched proofs at this https URL and this https URL.
摘要:大语言模型 (LLMs) 在数学定理证明中已成为强大的工具,尤其是在使用 LEAN 等形式语言时。主要的学习范式是专家迭代,这需要一个预定义的数据集,包含大量数学问题。在此过程中,LLMs 尝试证明数据集中的问题,并通过自我训练发现的证明来迭代改进其能力。我们提出使用大规模 LEAN 问题数据集 Lean-workbook 进行专家迭代,超过 20,000 CPU 天。在专家迭代过程中,我们发现解决问题的数量与证明长度和 CPU 使用量之间存在对数线性趋势。我们训练了一个批评模型,用于选择相对简单的问题供策略模型进行尝试,并指导模型搜索更深入的证明。InternLM2.5-StepProver 在 MiniF2F、Lean-Workbook-Plus、ProofNet 和 Putnam 基准测试中达到了开源状态的最先进水平。具体而言,它在 MiniF2F-test 上达到了 65.9% 的通过率,并在 Lean-Workbook-Plus 中证明了(或反证了)17.0% 的问题,相较于 Lean-Workbook-Plus 发布时仅证明的 9.5% 问题,显示出显著的改进。我们在以下链接中开源了我们的模型和搜索到的证明:https URL 和 https URL。
[NLP-61] okenization as Finite-State Transduction
【速读】: 该论文试图解决在现代神经语言模型中,输入文本如何高效地转换为子词(subword)序列的问题。解决方案的关键在于引入了一个基于有限状态转录(finite-state transduction)的框架,该框架能够从基本原理出发,高效地编码正则语言的所有可能的tokenization。论文进一步展示了Byte-Pair Encoding (BPE) 和 MaxMatch (WordPiece) 这两种流行的tokenization方案如何适应这一框架,特别是BPE,尽管它类似于上下文无关文法且不从左到右进行tokenization,但仍能被纳入该框架。这一框架的应用之一是指导生成(guided generation),其中语言模型的输出被约束以匹配某些模式,同时保持与底层tokenizer的规范tokenization一致。
链接: https://arxiv.org/abs/2410.15696
作者: Marco Cognetta,Naoaki Okazaki
关键词-EN: modern neural language, neural language model, language model pipelines, step in modern, modern neural
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: 10 pages + 5 pages in appendix
点击查看摘要
Abstract:Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, fit within this framework. For BPE, this is particularly surprising given its resemblance to context-free grammar and the fact that it does not tokenize strings from left to right. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern. Here, patterns are encoded at the character level, which creates a mismatch between the constraints and the model’s subword vocabulary. While past work has focused only on constraining outputs without regard to the underlying tokenization algorithm, our framework allows for simultaneously constraining the model outputs to match a specified pattern while also adhering to the underlying tokenizer’s canonical tokenization. Comments: 10 pages + 5 pages in appendix Subjects: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL) Cite as: arXiv:2410.15696 [cs.CL] (or arXiv:2410.15696v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.15696 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:Tokenization 是现代神经语言模型管道中的第一步,其中输入文本被转换为一系列子词 Token。我们从第一性原理出发,引入了一个有限状态转导框架,该框架可以高效地编码正则语言的所有可能的 Tokenization。然后,我们构造性地展示了 Byte-Pair Encoding (BPE) 和 MaxMatch (WordPiece) 这两种流行的 Tokenization 方案,都可以纳入这个框架中。对于 BPE 来说,这一点尤其令人惊讶,因为它类似于上下文无关文法,并且它不是从左到右地对字符串进行 Tokenization。这一框架的一个应用是引导生成,其中语言模型的输出被限制为匹配某些模式。在这里,模式在字符级别进行编码,这导致了约束与模型的子词词汇之间的不匹配。尽管以往的工作只关注于约束输出而不考虑底层 Tokenization 算法,但我们的框架允许同时约束模型输出以匹配指定模式,同时遵循底层 Tokenizer 的规范 Tokenization。
评论:10 页正文 + 5 页附录 主题:计算与语言 (cs.CL); 形式语言与自动机理论 (cs.FL) 引用为:arXiv:2410.15696 [cs.CL] (或 arXiv:2410.15696v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2410.15696 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite (待注册)
[NLP-62] Efficient Terminology Integration for LLM-based Translation in Specialized Domains
【速读】: 该论文试图解决传统机器翻译方法在处理专业领域(如专利、金融、生物医学)时,对专业术语翻译准确性不足的问题。解决方案的关键在于通过Trie Tree算法系统地提取术语并创建术语表,然后通过数据重构训练大型语言模型(LLM),使其能够有效整合这些专业术语。这种方法不仅提高了模型在专业术语翻译上的准确性,还确保了术语的一致性,从而在专业领域的翻译任务中表现出色。
链接: https://arxiv.org/abs/2410.15690
作者: Sejoon Kim,Mingi Sung,Jeonghwan Lee,Hyunkuk Lim,Jorge Froilan Gimenez Perez
关键词-EN: large parallel corpora, Traditional machine translation, typically involve training, Traditional machine, involve training models
类目: Computation and Language (cs.CL)
备注: Accepted to WMT 2024
点击查看摘要
Abstract:Traditional machine translation methods typically involve training models directly on large parallel corpora, with limited emphasis on specialized terminology. However, In specialized fields such as patent, finance, or biomedical domains, terminology is crucial for translation, with many terms that needs to be translated following agreed-upon conventions. In this paper we introduce a methodology that efficiently trains models with a smaller amount of data while preserving the accuracy of terminology translation. We achieve this through a systematic process of term extraction and glossary creation using the Trie Tree algorithm, followed by data reconstruction to teach the LLM how to integrate these specialized terms. This methodology enhances the model’s ability to handle specialized terminology and ensures high-quality translations, particularly in fields where term consistency is crucial. Our approach has demonstrated exceptional performance, achieving the highest translation score among participants in the WMT patent task to date, showcasing its effectiveness and broad applicability in specialized translation domains where general methods often fall short.
摘要:传统的机器翻译方法通常直接在大规模平行语料库上训练模型,对专业术语的关注有限。然而,在专利、金融或生物医学等专业领域,术语的翻译至关重要,许多术语需要按照既定的约定进行翻译。本文介绍了一种方法,能够在使用较少数据的情况下高效训练模型,同时保持术语翻译的准确性。我们通过使用 Trie Tree 算法进行术语提取和术语表创建,随后进行数据重构,教导大语言模型如何整合这些专业术语。这种方法增强了模型处理专业术语的能力,并确保了高质量的翻译,特别是在术语一致性至关重要的领域。我们的方法表现出色,在迄今为止的 WMT 专利任务中,取得了参与者的最高翻译分数,展示了其在专业翻译领域的有效性和广泛适用性,这些领域通常是通用方法难以胜任的。
[NLP-63] DomainSum: A Hierarchical Benchmark for Fine-Grained Domain Shift in Abstractive Text Summarization
【速读】: 该论文试图解决抽象摘要模型在跨领域应用中的性能和泛化能力问题。解决方案的关键在于引入DomainSum,这是一个分层基准,用于捕捉抽象摘要中的细粒度领域转移,并将其分类为体裁、风格和主题三个层次。通过全面的基准分析,论文展示了这些转移的层次结构,并评估了常用预训练语言模型(PLMs)和大型语言模型(LLMs)在领域内和跨领域设置中的领域泛化能力。
链接: https://arxiv.org/abs/2410.15687
作者: Haohan Yuan,Haopeng Zhang
关键词-EN: documents affect performance, abstractive summarization focuses, single-domain applications, focuses on single-domain, documents affect
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Most research on abstractive summarization focuses on single-domain applications, often neglecting how domain shifts between documents affect performance and the generalization ability of summarization models. To address this issue, we introduce DomainSum, a hierarchical benchmark designed to capture fine-grained domain shifts in abstractive summarization. We categorize these shifts into three levels: genre, style, and topic, and demonstrate through comprehensive benchmark analysis that they follow a hierarchical structure. Furthermore, we evaluate the domain generalization capabilities of commonly used pre-trained language models (PLMs) and large language models (LLMs) in in-domain and cross-domain settings.
摘要:大多数关于生成式摘要的研究集中在单一领域的应用上,往往忽视了文档之间的领域转移如何影响摘要模型的性能和泛化能力。为了解决这一问题,我们引入了 DomainSum,这是一个分层基准,旨在捕捉生成式摘要中的细粒度领域转移。我们将这些转移分为三个层次:体裁、风格和主题,并通过全面的基准分析证明它们遵循分层结构。此外,我们评估了常用预训练语言模型 (PLMs) 和大语言模型 (LLMs) 在领域内和跨领域设置下的领域泛化能力。
[NLP-64] Revealing and Mitigating the Local Pattern Shortcuts of Mamba
【速读】: 该论文试图解决Mamba模型在处理分布式关键信息任务时表现不佳的问题。解决方案的关键在于引入一个全局选择模块,该模块通过增加4M额外参数,使Mamba模型在处理分布式信息任务时性能显著提升,从0分提高到80.54分。这一改进有效弥补了Mamba模型在处理非局部信息时的不足。
链接: https://arxiv.org/abs/2410.15678
作者: Wangjie You,Zecheng Tang,Juntao Li,Lili Yao,Min Zhang
关键词-EN: Large language models, Large language, advanced significantly due, memory demands limit, State Space Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have advanced significantly due to the attention mechanism, but their quadratic complexity and linear memory demands limit their performance on long-context tasks. Recently, researchers introduced Mamba, an advanced model built upon State Space Models(SSMs) that offers linear complexity and constant memory. Although Mamba is reported to match or surpass the performance of attention-based models, our analysis reveals a performance gap: Mamba excels in tasks that involve localized key information but faces challenges with tasks that require handling distributed key information. Our controlled experiments suggest that this inconsistency arises from Mamba’s reliance on local pattern shortcuts, which enable the model to remember local key information within its limited memory but hinder its ability to retain more dispersed information. Therefore, we introduce a global selection module into the Mamba model to address this issue. Experiments on both existing and proposed synthetic tasks, as well as real-world tasks, demonstrate the effectiveness of our method. Notably, with the introduction of only 4M extra parameters, our approach enables the Mamba model(130M) to achieve a significant improvement on tasks with distributed information, increasing its performance from 0 to 80.54 points.
摘要:大语言模型 (LLM) 由于注意力机制的引入取得了显著进展,但其二次复杂度和线性内存需求限制了其在长上下文任务中的表现。近期,研究人员推出了 Mamba,这是一种基于状态空间模型 (SSM) 的高级模型,具有线性复杂度和恒定内存需求。尽管有报告称 Mamba 在性能上能够匹配甚至超越基于注意力的模型,但我们的分析显示存在性能差距:Mamba 在涉及局部关键信息的任务中表现出色,但在处理需要处理分布式关键信息的任务时面临挑战。我们的控制实验表明,这种不一致性源于 Mamba 依赖于局部模式捷径,这使得模型能够在其有限内存中记住局部关键信息,但阻碍了其保留更分散信息的能力。因此,我们在 Mamba 模型中引入了一个全局选择模块以解决这一问题。在现有和提出的合成任务以及实际任务上的实验证明了我们方法的有效性。值得注意的是,通过仅引入 4M 额外参数,我们的方法使 Mamba 模型 (130M) 在处理分布式信息任务时实现了显著提升,性能从 0 提升至 80.54 分。
[NLP-65] Learning to Generate and Evaluate Fact-checking Explanations with Transformers
【速读】: 该论文旨在解决数字平台时代中错误信息的传播问题,通过开发基于Transformer的可解释人工智能(XAI)模型来评估信息的真实性,并生成易于人类理解的解释。解决方案的关键在于:1) 开发能够生成高质量事实核查解释的生成模型,如表现最佳的模型达到了47.77的ROUGE-1分数;2) 开发用于自动评估解释质量的度量学习模型,该模型在自我矛盾和幻觉等客观维度上与人类判断表现出中等强度的相关性,MCC约为0.7。这些方法不仅提升了AI系统的透明度和可靠性,还减少了对手工评估的依赖,从而增强了用户对AI驱动的事实核查系统的信任。
链接: https://arxiv.org/abs/2410.15669
作者: Darius Feher,Abdullah Khered,Hao Zhang,Riza Batista-Navarro,Viktor Schlegel
关键词-EN: assessing information veracity, era increasingly dominated, Explainable Artificial Antelligence, texttt, digital platforms
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Forthcoming in Engineering Applications of Artificial Intelligence
点击查看摘要
Abstract:In an era increasingly dominated by digital platforms, the spread of misinformation poses a significant challenge, highlighting the need for solutions capable of assessing information veracity. Our research contributes to the field of Explainable Artificial Antelligence (XAI) by developing transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations. Importantly, we also develop models for automatic evaluation of explanations for fact-checking verdicts across different dimensions such as \texttt(self)-contradiction, \texttthallucination, \textttconvincingness and \textttoverall quality. By introducing human-centred evaluation methods and developing specialised datasets, we emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements. This approach not only advances theoretical knowledge in XAI but also holds practical implications by enhancing the transparency, reliability and users’ trust in AI-driven fact-checking systems. Furthermore, the development of our metric learning models is a first step towards potentially increasing efficiency and reducing reliance on extensive manual assessment. Based on experimental results, our best performing generative model \textscROUGE-1 score of 47.77, demonstrating superior performance in generating fact-checking explanations, particularly when provided with high-quality evidence. Additionally, the best performing metric learning model showed a moderately strong correlation with human judgements on objective dimensions such as \texttt(self)-contradiction and \texttthallucination, achieving a Matthews Correlation Coefficient (MCC) of around 0.7.
摘要:在数字平台日益占据主导地位的时代,错误信息的传播构成了重大挑战,凸显了评估信息真实性解决方案的必要性。我们的研究通过开发基于 Transformer 的事实核查模型,为可解释人工智能 (Explainable Artificial Intelligence, XAI) 领域做出了贡献。这些模型通过生成人类可访问的解释来上下文化并证明其决策。重要的是,我们还开发了用于自动评估不同维度(如自我矛盾性、幻觉、说服力和整体质量)事实核查裁决解释的模型。通过引入以人为中心的评估方法和开发专业数据集,我们强调了使人工智能生成的解释与人类判断相一致的必要性。这种方法不仅推动了 XAI 理论知识的进步,还通过增强 AI 驱动的事实核查系统的透明度、可靠性和用户信任度,具有实际意义。此外,我们开发的度量学习模型是提高效率并减少对广泛手动评估依赖的第一步。根据实验结果,我们表现最佳的生成模型在 ROUGE-1 评分上达到了 47.77,展示了在生成事实核查解释方面的卓越性能,尤其是在提供高质量证据的情况下。此外,表现最佳的度量学习模型在自我矛盾性和幻觉等客观维度上与人类判断显示出中等强度的相关性,马修斯相关系数 (MCC) 约为 0.7。
[NLP-66] RAC: Efficient LLM Factuality Correction with Retrieval Augmentation
【速读】: 该论文试图解决大语言模型(LLMs)在自然语言处理任务中可能产生事实性错误的问题。解决方案的关键在于引入了一种名为**检索增强校正(Retrieval Augmented Correction, RAC)**的低延迟后校正方法。RAC通过将LLM的输出分解为原子事实,并利用检索内容进行细粒度的验证和校正,从而在不增加额外微调的情况下提升LLM的事实性表现。该方法不仅通用性强,适用于任何指令微调的LLM,而且在延迟方面显著优于先前的方案,实验结果表明其在事实性评估数据集上相比最先进的基线方法提升了高达30%的性能。
链接: https://arxiv.org/abs/2410.15667
作者: Changmao Li,Jeffrey Flanigan
关键词-EN: Large Language Models, natural language processing, exhibit impressive results, produce factually incorrect, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs. This paper introduces a simple but effective low-latency post-correction method, \textbfRetrieval Augmented Correction (RAC), aimed at enhancing the factual performance of LLMs without requiring additional fine-tuning. Our method is general and can be used with any instruction-tuned LLM, and has greatly reduced latency compared to prior approaches. RAC decomposes the LLM’s output into atomic facts and applies a fine-grained verification and correction process with retrieved content to verify and correct the LLM-generated output. Our extensive experiments show that RAC yields up to 30% improvements over state-of-the-art baselines across two popular factuality evaluation datasets, validating its efficacy and robustness in both with and without the integration of Retrieval-Augmented Generation (RAG) across different LLMs.\footnoteOur code is at \urlthis https URL
摘要:大语言模型 (LLMs) 在众多自然语言处理 (NLP) 任务中展现出令人印象深刻的结果,但它们往往会产生事实性错误的输出。本文介绍了一种简单但有效的低延迟后校正方法,即检索增强校正 (Retrieval Augmented Correction, RAC),旨在提升 LLMs 的事实性表现,而无需额外的微调。我们的方法具有通用性,可与任何指令微调的 LLM 结合使用,并且与先前的方法相比,大大降低了延迟。RAC 将 LLM 的输出分解为原子事实,并通过检索内容进行细粒度的验证和校正过程,以验证和修正 LLM 生成的输出。我们的广泛实验表明,RAC 在两个流行的真实性评估数据集上,相对于最先进的基线方法,实现了高达 30% 的改进,验证了其在不同 LLMs 中与检索增强生成 (Retrieval-Augmented Generation, RAG) 结合与否的效能和鲁棒性。[我们的代码位于此 https URL]
[NLP-67] Scalable Data Ablation Approximations for Language Models through Modular Training and Merging EMNLP2024
【速读】: 该论文试图解决大规模语言模型(LLMs)训练数据组合对下游性能影响的问题,特别是在进行数据消融研究时,由于训练成本高昂,难以全面探索多种数据组合的性能。论文提出的解决方案关键在于:通过训练多个基于训练语料子集的模型,并利用这些模型的参数平均值来近似评估不同数据组合的效果。这种方法显著提高了训练效率,因为可以重复使用已有的训练计算结果,从而在增量数据评估和混合过程中实现模型性能的提升。
链接: https://arxiv.org/abs/2410.15661
作者: Clara Na,Ian Magnusson,Ananya Harsh Jha,Tom Sherborne,Emma Strubell,Jesse Dodge,Pradeep Dasigi
关键词-EN: Large Language Models, Large Language, Language Models, Training data compositions, data
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2024. 17 pages
点击查看摘要
Abstract:Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive since the full effect is seen only after training the models; this can lead practitioners to settle for sub-optimal data mixtures. We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus and reuses them across evaluations of combinations of subsets. In continued pre-training experiments, we find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data. From this finding, we posit that researchers and practitioners can conduct inexpensive simulations of data ablations by maintaining a pool of models that were each trained on partitions of a large training corpus, and assessing candidate data mixtures by evaluating parameter averages of combinations of these models. This approach allows for substantial improvements in amortized training efficiency – scaling only linearly with respect to new data – by enabling reuse of previous training computation, opening new avenues for improving model performance through rigorous, incremental data assessment and mixing.
摘要:大语言模型 (LLM) 的训练数据组成对其下游性能有显著影响。然而,进行全面的数据消融研究,探索大量候选数据混合集通常成本高昂,因为只有在模型训练完成后才能看到完整效果;这可能导致从业者选择次优的数据混合集。我们提出了一种高效的方法来近似数据消融,该方法在训练语料库的子集上训练单个模型,并在评估子集组合时重复使用这些模型。在持续预训练实验中,我们发现,给定任意评估集,在候选数据集上训练的单个模型的困惑度分数与在数据的不同分区上训练的模型的参数平均值的困惑度分数高度相关。基于这一发现,我们假设研究人员和从业者可以通过维护一组分别在大型训练语料库分区上训练的模型池,并通过评估这些模型组合的参数平均值来评估候选数据混合集,从而进行低成本的数据消融模拟。这种方法通过允许重复使用之前的训练计算,显著提高了摊销训练效率——仅随新数据线性扩展——为通过严格、渐进的数据评估和混合来提升模型性能开辟了新的途径。
[NLP-68] CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models
【速读】: 该论文试图解决人-物体交互(HOI)检测中依赖大量手动标注的问题。解决方案的关键在于提出了一个跨层次HOI蒸馏(CL-HOI)框架,通过视觉语言模型(VLLMs)的图像级理解来蒸馏实例级HOI,无需手动标注。该框架包括两个阶段:上下文蒸馏和交互蒸馏,分别通过视觉语言转换器(VLT)和交互认知网络(ICN)实现。设计了对比蒸馏损失,将图像级上下文和交互知识从教师模型传递到学生模型,从而实现实例级HOI检测。
链接: https://arxiv.org/abs/2410.15657
作者: Jianjun Gao,Chen Cai,Ruoyu Wang,Wenyang Liu,Kim-Hui Yap,Kratika Garg,Boon-Siew Han
关键词-EN: Vision Language Models, Large Language Models, Vision Large Language, Language Models, instance-level HOI detection
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Human-object interaction (HOI) detection has seen advancements with Vision Language Models (VLMs), but these methods often depend on extensive manual annotations. Vision Large Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection. To overcome these limitations, we propose a Cross-Level HOI distillation (CL-HOI) framework, which distills instance-level HOIs from VLLMs image-level understanding without the need for manual annotations. Our approach involves two stages: context distillation, where a Visual Linguistic Translator (VLT) converts visual information into linguistic form, and interaction distillation, where an Interaction Cognition Network (ICN) reasons about spatial, visual, and context relations. We design contrastive distillation losses to transfer image-level context and interaction knowledge from the teacher to the student model, enabling instance-level HOI detection. Evaluations on HICO-DET and V-COCO datasets demonstrate that our CL-HOI surpasses existing weakly supervised methods and VLLM supervised methods, showing its efficacy in detecting HOIs without manual labels.
摘要:人-物体交互 (Human-object interaction, HOI) 检测在视觉语言模型 (Vision Language Models, VLMs) 的推动下取得了进展,但这些方法通常依赖于大量的手动标注。视觉大语言模型 (Vision Large Language Models, VLLMs) 能够在图像层面上自然地识别和推理交互,但计算量大且并非为实例级 HOI 检测设计。为了克服这些限制,我们提出了跨层级 HOI 蒸馏 (Cross-Level HOI distillation, CL-HOI) 框架,该框架无需手动标注即可从 VLLMs 的图像级理解中蒸馏出实例级 HOI。我们的方法包括两个阶段:上下文蒸馏,其中视觉语言翻译器 (Visual Linguistic Translator, VLT) 将视觉信息转换为语言形式;交互蒸馏,其中交互认知网络 (Interaction Cognition Network, ICN) 推理空间、视觉和上下文关系。我们设计了对比蒸馏损失,将图像级上下文和交互知识从教师模型传递到学生模型,从而实现实例级 HOI 检测。在 HICO-DET 和 V-COCO 数据集上的评估表明,我们的 CL-HOI 超越了现有的弱监督方法和 VLLM 监督方法,展示了其在无手动标签的情况下检测 HOI 的有效性。
[NLP-69] Resource-Efficient Medical Report Generation using Large Language Models
【速读】: 该论文试图解决医学报告生成的问题,特别是针对胸部X光图像的放射学报告自动生成。解决方案的关键在于利用视觉增强的大型语言模型(LLM),通过轻量级框架实现高效的报告生成。该方法通过前缀调优等增强技术,提升了LLM的文本生成能力,并在MIMIC-CXR数据集上进行了广泛实验,证明了其资源高效框架在生成具有强医学上下文理解和高度精确性的患者特定报告方面的能力。
链接: https://arxiv.org/abs/2410.15642
作者: Abdullah,Ameer Hamza,Seong Tae Kim
关键词-EN: chest X-ray images, X-ray images, chest X-ray, automatically writing radiology, automatically writing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical report generation is the task of automatically writing radiology reports for chest X-ray images. Manually composing these reports is a time-consuming process that is also prone to human errors. Generating medical reports can therefore help reduce the burden on radiologists. In other words, we can promote greater clinical automation in the medical domain. In this work, we propose a new framework leveraging vision-enabled Large Language Models (LLM) for the task of medical report generation. We introduce a lightweight solution that achieves better or comparative performance as compared to previous solutions on the task of medical report generation. We conduct extensive experiments exploring different model sizes and enhancement approaches, such as prefix tuning to improve the text generation abilities of the LLMs. We evaluate our approach on a prominent large-scale radiology report dataset - MIMIC-CXR. Our results demonstrate the capability of our resource-efficient framework to generate patient-specific reports with strong medical contextual understanding and high precision.
摘要:医学报告生成是指自动为胸部 X 光图像撰写放射学报告的任务。手动编写这些报告是一个耗时且容易出错的过程。因此,生成医学报告可以帮助减轻放射科医生的负担。换句话说,我们可以促进医疗领域更大的临床自动化。在这项工作中,我们提出了一种新的框架,利用启用了视觉功能的大语言模型 (LLM) 来完成医学报告生成任务。我们引入了一种轻量级解决方案,与之前的解决方案相比,在医学报告生成任务上实现了更好或相当的性能。我们进行了广泛的实验,探索了不同模型规模和增强方法,例如前缀调优以提高 LLM 的文本生成能力。我们在一个著名的大规模放射学报告数据集 - MIMIC-CXR 上评估了我们的方法。我们的结果展示了我们资源高效框架生成具有强大医学上下文理解和高度精确性的患者特定报告的能力。
[NLP-70] SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis
【速读】: 该论文试图解决大型语言模型(LLMs)在化学领域中可能传播危险信息的安全漏洞问题,特别是其提供合成危险物质指令的能力。解决方案的关键在于评估和引入新的攻击方法,如SMILES-prompting,该方法利用简化分子输入线性表示系统(SMILES)来引用化学物质,从而有效绕过当前的安全机制。论文强调了加强领域特定安全措施的迫切性,以防止LLMs的滥用并提升其对社会的正面影响。
链接: https://arxiv.org/abs/2410.15641
作者: Aidan Wong,He Cao,Zijing Liu,Yu Li
关键词-EN: large language models, propagate dangerous information, language models, dangerous information, increasing integration
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The increasing integration of large language models (LLMs) across various fields has heightened concerns about their potential to propagate dangerous information. This paper specifically explores the security vulnerabilities of LLMs within the field of chemistry, particularly their capacity to provide instructions for synthesizing hazardous substances. We evaluate the effectiveness of several prompt injection attack methods, including red-teaming, explicit prompting, and implicit prompting. Additionally, we introduce a novel attack technique named SMILES-prompting, which uses the Simplified Molecular-Input Line-Entry System (SMILES) to reference chemical substances. Our findings reveal that SMILES-prompting can effectively bypass current safety mechanisms. These findings highlight the urgent need for enhanced domain-specific safeguards in LLMs to prevent misuse and improve their potential for positive social impact.
摘要:随着大语言模型 (LLMs) 在各个领域的日益融合,人们对其传播危险信息的潜在能力愈发担忧。本文特别探讨了 LLMs 在化学领域中的安全漏洞,尤其是其提供合成危险物质指令的能力。我们评估了几种提示注入攻击方法的有效性,包括红队测试、显式提示和隐式提示。此外,我们引入了一种名为 SMILES-prompting 的新型攻击技术,该技术使用简化分子输入线性表示系统 (SMILES) 来引用化学物质。我们的研究结果表明,SMIS-prompting 能够有效绕过当前的安全机制。这些发现强调了在 LLMs 中加强领域特定防护措施的迫切需求,以防止滥用并提升其对社会积极影响的潜力。
[NLP-71] Can Large Language Models Invent Algorithms to Improve Themselves?
【速读】: 该论文试图解决大语言模型(LLMs)改进算法依赖于人类设计和想象力的问题。解决方案的关键在于提出了一种名为“自我发展”(Self-Developing)的框架,该框架允许LLMs自主生成、应用和评估模型改进算法,从而实现模型和算法的持续自我改进。在数学推理任务中,该框架不仅生成了超越初始模型的模型,还持续优于使用人类设计算法创建的模型,并展示了这些算法在不同领域模型中的可迁移性。
链接: https://arxiv.org/abs/2410.15639
作者: Yoichi Ishibashi,Taro Yano,Masafumi Oyamada
关键词-EN: Large Language Models, Large Language, shown remarkable performance, remarkable performance improvements, rapidly gaining adoption
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown remarkable performance improvements and are rapidly gaining adoption in industry. However, the methods for improving LLMs are still designed by humans, which restricts the invention of new model-improving algorithms to human expertise and imagination. To address this, we propose the Self-Developing framework, which enables LLMs to autonomously generate and learn model-improvement algorithms. In this framework, the seed model generates, applies, and evaluates model-improving algorithms, continuously improving both the seed model and the algorithms themselves. In mathematical reasoning tasks, Self-Developing not only creates models that surpass the seed model but also consistently outperforms models created using human-designed algorithms. Additionally, these LLM-discovered algorithms demonstrate strong effectiveness, including transferability to out-of-domain models.
摘要:大语言模型 (LLMs) 在性能提升方面表现出色,并在行业中迅速得到采用。然而,提升 LLMs 的方法仍由人类设计,这限制了新模型改进算法的发明仅限于人类的专业知识和想象力。为解决这一问题,我们提出了自发展框架 (Self-Developing framework),该框架使 LLMs 能够自主生成和学习模型改进算法。在此框架中,种子模型生成、应用并评估模型改进算法,持续提升种子模型及算法本身。在数学推理任务中,自发展不仅创建了超越种子模型的模型,还持续优于使用人类设计算法创建的模型。此外,这些由 LLM 发现的算法展现出强大的有效性,包括跨领域模型的可迁移性。
[NLP-72] Selecting Influential Samples for Long Context Alignment via Homologous Models Guidance and Contextual Awareness Measurement
【速读】: 该论文试图解决大语言模型在处理极长上下文指令时的有效性问题,主要障碍在于构建高质量的长上下文对齐数据集。论文提出的解决方案关键在于GATEAU框架,该框架通过利用同源模型引导(HMG)和上下文感知测量(CAM)来识别具有长程依赖关系的高质量样本。HMG通过测量两个不同上下文窗口的同源模型生成响应的困惑度来评估长程依赖的难度,而CAM则通过评估模型注意力是否集中在重要段落上来衡量理解长输入上下文的难度。通过这两种方法筛选出的最具挑战性的样本,能够有效建模长程依赖关系,从而提升大语言模型在指令跟随和长上下文理解方面的性能。
链接: https://arxiv.org/abs/2410.15633
作者: Shuzheng Si,Haozhe Zhao,Gang Chen,Yunshui Li,Kangyang Luo,Chuancheng Lv,Kaikai An,Fanchao Qi,Baobao Chang,Maosong Sun
关键词-EN: large language models, extremely long contexts, long-range dependencies, fully investigated, expansion of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated. The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment. Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples. However, indiscriminately increasing the quantity of data without a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the final performance. To bridge this gap, we aim to address the unique challenge of long-context alignment, i.e., modeling the long-range dependencies for handling instructions and lengthy input contexts. We propose GATEAU, a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations by utilizing crafted Homologous Models’ Guidance (HMG) and Contextual Awareness Measurement (CAM). Specifically, HMG attempts to measure the difficulty of generating corresponding responses due to the long-range dependencies, using the perplexity scores of the response from two homologous models with different context windows. Also, the role of CAM is to measure the difficulty of understanding the long input contexts due to long-range dependencies by evaluating whether the model’s attention is focused on important segments. Built upon both proposed methods, we select the most challenging samples as the influential data to effectively frame the long-range dependencies, thereby achieving better performance of LLMs. Comprehensive experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
摘要:大语言模型在有效处理具有极长上下文的指令方面尚未得到充分研究。主要障碍在于构建一个高质量的、专为长上下文对齐设计的指令跟随数据集。现有研究试图通过合成大量长指令跟随样本来扩大可用数据量。然而,在没有明确策略确保数据质量的情况下盲目增加数据量可能会引入低质量样本,从而限制最终性能。为了填补这一空白,我们旨在解决长上下文对齐的独特挑战,即建模长程依赖关系以处理指令和长输入上下文。我们提出了 GATEAU,这是一个新颖的框架,旨在通过利用精心设计的同源模型引导 (Homologous Models’ Guidance, HMG) 和上下文感知测量 (Contextual Awareness Measurement, CAM) 来识别具有长程依赖关系的高影响力和高质量样本。具体而言,HMG 试图通过两个具有不同上下文窗口的同源模型的响应困惑度分数来衡量由于长程依赖关系而生成相应响应的难度。此外,CAM 的作用是通过评估模型注意力是否集中在重要段落上来衡量理解长输入上下文的难度。基于这两种方法,我们选择最具挑战性的样本作为高影响力数据,以有效构建长程依赖关系,从而实现更好的大语言模型性能。综合实验表明,GATEAU 有效地识别了具有长程依赖关系的样本,并且在这些选定样本上训练的模型表现出更好的指令跟随和长上下文理解能力。
[NLP-73] Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers
【速读】: 该论文试图解决在并行编程中,如何高效地将计算任务映射到处理器并将数据分配到内存的问题。解决方案的关键在于利用基于大型语言模型(LLM)的优化器,通过自动生成优化的映射器(mapper)代码,显著减少性能工程师的工作量并提升应用性能。具体来说,论文提出了一种领域特定语言(DSL)来简化低级代码生成的复杂性,并定义了一个结构化的搜索空间供LLM探索。通过使用LLM优化器改进生成映射器代码的代理系统,该方法在不到十分钟的时间内自动发现优于人类专家设计的映射器,实现了在科学应用中高达1.34倍的加速,特别是在并行矩阵乘法算法中达到了1.31倍的专家设计性能。
链接: https://arxiv.org/abs/2410.15625
作者: Anjiang Wei,Allen Nie,Thiago S. F. X. Teixeira,Rohan Yadav,Wonchan Lee,Ke Wang,Alex Aiken
关键词-EN: Mapping computations, computations to processors, processors and assigning, assigning data, data to memory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 26 pages, 8 figures
点击查看摘要
Abstract:Mapping computations to processors and assigning data to memory are critical for maximizing performance in parallel programming. These mapping decisions are managed through the development of specialized low-level system code, called mappers, crafted by performance engineers. Each mapper is tailored to a specific application and optimized for the underlying machine architecture, a process that requires days of refinement and tuning from an expert. Despite advances in system research, automating mapper generation remains a challenge due to the complexity of making millions of decisions to find the optimal solution and generate the solution as code. We introduce an approach that leverages recent advances in LLM-based optimizers for mapper design. In under ten minutes, our method automatically discovers mappers that surpass human expert designs in scientific applications by up to 1.34X speedup. For parallel matrix multiplication algorithms, our mapper achieves up to 1.31X of the expert-designed solution. To achieve this, we simplify the complexity of low-level code generation by introducing a domain-specific language (DSL) that abstracts the low-level system programming details and defines a structured search space for LLMs to explore. To maximize the application performance, we use an LLM optimizer to improve an agentic system that generates the mapper code. As a result, this approach significantly reduces the workload for performance engineers while achieving substantial performance gains across diverse applications. Finally, our results demonstrate the effectiveness of LLM-based optimization in system design and suggest its potential for addressing other complex system challenges.
摘要:在并行编程中,将计算任务映射到处理器并将数据分配到内存是最大化性能的关键。这些映射决策通过开发专门的低级系统代码(称为映射器)来管理,这些代码由性能工程师精心设计。每个映射器都针对特定应用进行定制,并针对底层机器架构进行优化,这一过程需要专家花费数天时间进行精炼和调优。尽管系统研究取得了进展,但由于需要做出数百万个决策以找到最优解决方案并将其生成为代码,自动化映射器生成仍然是一个挑战。我们提出了一种利用基于大语言模型(LLM)的优化器进行映射器设计的方法。在不到十分钟的时间内,我们的方法自动发现映射器,在科学应用中比人类专家设计的映射器性能提升高达1.34倍。对于并行矩阵乘法算法,我们的映射器达到了专家设计解决方案的1.31倍。为了实现这一点,我们通过引入一种领域特定语言(DSL)来简化低级代码生成的复杂性,该语言抽象了低级系统编程细节,并为大语言模型定义了一个结构化的搜索空间。为了最大化应用性能,我们使用大语言模型优化器来改进生成映射器代码的智能系统。因此,这种方法显著减少了性能工程师的工作量,同时在各种应用中实现了显著的性能提升。最后,我们的结果证明了基于大语言模型的优化在系统设计中的有效性,并表明其在解决其他复杂系统挑战方面的潜力。
[NLP-74] Guardians of Discourse: Evaluating LLMs on Multilingual Offensive Language Detection
【速读】: 该论文旨在解决多语言环境下大型语言模型(LLMs)在识别冒犯性语言方面的评估不足问题。解决方案的关键在于首次系统地评估了GPT-3.5、Flan-T5和Mistral三种LLMs在英语、西班牙语和德语中的冒犯性语言检测能力,并探讨了不同提示语言和增强翻译数据对非英语环境下任务表现的影响。此外,论文还讨论了LLMs固有偏见及其数据集在敏感话题相关误判中的影响。
链接: https://arxiv.org/abs/2410.15623
作者: Jianfei He,Lilin Wang,Jiaying Wang,Zhenyu Liu,Hongbin Na,Zimu Wang,Wei Wang,Qi Chen
关键词-EN: Identifying offensive language, social media era, Identifying offensive, offensive language detection, social media analytics
类目: Computation and Language (cs.CL)
备注: Accepted at UIC 2024 proceedings. Accepted version
点击查看摘要
Abstract:Identifying offensive language is essential for maintaining safety and sustainability in the social media era. Though large language models (LLMs) have demonstrated encouraging potential in social media analytics, they lack thorough evaluation when in offensive language detection, particularly in multilingual environments. We for the first time evaluate multilingual offensive language detection of LLMs in three languages: English, Spanish, and German with three LLMs, GPT-3.5, Flan-T5, and Mistral, in both monolingual and multilingual settings. We further examine the impact of different prompt languages and augmented translation data for the task in non-English contexts. Furthermore, we discuss the impact of the inherent bias in LLMs and the datasets in the mispredictions related to sensitive topics.
摘要:在社交媒体时代,识别攻击性语言对于维护安全和可持续性至关重要。尽管大语言模型 (LLM) 在社交媒体分析中展示了令人鼓舞的潜力,但在攻击性语言检测方面,尤其是在多语言环境中,缺乏全面的评估。我们首次在三种语言(英语、西班牙语和德语)中评估了 LLM 的多语言攻击性语言检测能力,使用了三种大语言模型:GPT-3.5、Flan-T5 和 Mistral,在单语言和多语言设置下进行测试。我们进一步研究了不同提示语言和增强翻译数据对非英语环境下任务的影响。此外,我们讨论了 LLM 和数据集中固有的偏见对与敏感话题相关的错误预测的影响。
[NLP-75] Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation
【速读】: 该论文试图解决在隐私保护意识增强和语音数据规模庞大的背景下,自动语音识别(ASR)系统开发者难以使用完整数据集训练声学模型的问题。解决方案的关键在于提出了一种两阶段的方法:首先基于不同子集数据训练多个声学模型,然后通过两种新型算法(遗传合并算法GMA和基于随机梯度下降的优化合并算法SOMA)生成高质量的声学模型。SOMA算法在保持模型精度的同时,有效提升了GMA的效率瓶颈。此外,引入Shapley值来评估训练模型的贡献度,为数据的有效性和提供者激励提供了公平的评估手段。
链接: https://arxiv.org/abs/2410.15620
作者: Victor Junqiu Wei,Weicheng Wang,Di Jiang,Conghui Tan,Rongzhong Lian
关键词-EN: Automatic Speech Recognition, infeasible for Automatic, Speech Recognition, Automatic Speech, system developers
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. For example, the data may be owned by different curators, and it is not allowed to share with others. In this paper, we propose a novel paradigm to solve salient problems plaguing the ASR field. In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the second phase, two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior model accuracy. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art. Furthermore, we introduce Shapley Value to estimate the contribution score of the trained models, which is useful for evaluating the effectiveness of the data and providing fair incentives to their curators.
摘要:随着隐私保护意识的提升和语音数据规模的庞大,自动语音识别 (ASR) 系统开发者已无法像以往那样使用完整数据集来训练声学模型。例如,数据可能由不同的管理者拥有,且不允许与其他人共享。本文提出了一种新颖的范式来解决 ASR 领域中的突出问题。在第一阶段,基于完整语音数据的不同子集训练多个声学模型;在第二阶段,利用两种新颖的算法基于这些在数据子集上训练的模型生成高质量的声学模型。我们首先提出了遗传合并算法 (Genetic Merge Algorithm, GMA),这是一种高度专业化的声学模型优化算法,但效率较低。随后,我们提出了基于随机梯度下降的优化合并算法 (SGD-Based Optimizational Merge Algorithm, SOMA),该算法有效缓解了 GMA 的效率瓶颈,同时保持了卓越的模型准确性。在公开数据上的广泛实验表明,所提出的方法显著优于当前最先进的技术。此外,我们引入了 Shapley 值来估计训练模型的贡献分数,这对于评估数据的有效性并为数据管理者提供公平的激励具有重要意义。
[NLP-76] Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding
【速读】: 该论文试图解决自动语音识别(ASR)系统产生的错误转录对口语理解(SLU)模型的影响问题。解决方案的关键在于提出一种新的、较少偏见的增强方法,通过切断噪音的非因果效应,引入对任何ASR系统都可能出现的噪音,从而提高SLU模型对未见ASR系统的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2410.15609
作者: Yeonjoon Jung,Jaeseong Lee,Seungtaek Choi,Dohyeon Lee,Minsoo Kim,Seung-won Hwang
关键词-EN: spoken language understanding, pre-trained language models, SLU models, ASR, pre-trained language
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 9 pages, 3 figures
点击查看摘要
Abstract:Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.
摘要:近年来,预训练语言模型 (Pre-trained Language Models, PLMs) 在口语理解 (Spoken Language Understanding, SLU) 中得到了越来越多的应用。然而,自动语音识别 (Automatic Speech Recognition, ASR) 系统经常产生不准确的转录结果,导致 SLU 模型的输入存在噪声,从而显著降低其性能。为了解决这一问题,我们的目标是训练 SLU 模型以抵御 ASR 错误,通过让其接触 ASR 系统中常见的噪声,即 ASR 合理噪声 (ASR-plausible noises)。语音噪声注入 (Speech Noise Injection, SNI) 方法通过引入 ASR 合理噪声来实现这一目标,但我们认为这些方法本质上偏向于特定的 ASR 系统或 ASR 特定噪声 (ASR-specific noises)。在本研究中,我们提出了一种新颖且较少偏见的增强方法,通过切断噪声的非因果效应,引入对任何 ASR 系统都合理的噪声。实验结果和分析表明,我们提出的方法通过提前引入更多样化和合理的 ASR 噪声,有效增强了 SLU 模型对未见 ASR 系统的鲁棒性和泛化能力。
[NLP-77] Moonshine: Speech Recognition for Live Transcription and Voice Commands
【速读】: 该论文试图解决实时语音识别和语音命令处理中的计算资源效率问题。解决方案的关键在于采用基于编码器-解码器结构的Transformer模型,并引入旋转位置嵌入(Rotary Position Embedding, RoPE)替代传统的绝对位置嵌入,同时避免使用零填充,从而在推理阶段提高编码器的效率。实验结果表明,Moonshine Tiny模型在处理10秒语音片段时,计算需求减少了5倍,且在标准评估数据集上的词错误率没有增加,显示出其在实时和资源受限应用中的潜力。
链接: https://arxiv.org/abs/2410.15608
作者: Nat Jeffries,Evan King,Manjunath Kudlur,Guy Nicholson,James Wang,Pete Warden
关键词-EN: voice command processing, Rotary Position Embedding, paper introduces Moonshine, recognition models optimized, command processing
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 7 pages, 6 figures, 3 tables
点击查看摘要
Abstract:This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI’s Whisper this http URL, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine’s potential for real-time and resource-constrained applications.
摘要:本文介绍了 Moonshine,这是一系列针对实时转录和语音命令处理优化的语音识别模型。Moonshine 基于编码器-解码器 Transformer 架构,并采用了旋转位置嵌入 (Rotary Position Embedding, RoPE) 而非传统的绝对位置嵌入。该模型在训练时使用了不同长度的语音片段,但未采用零填充,从而在推理阶段提高了编码器的效率。在与 OpenAI 的 Whisper 进行基准测试时,Moonshine Tiny 在转录 10 秒语音片段时计算需求减少了 5 倍,同时在标准评估数据集上的词错误率没有增加。这些结果突显了 Moonshine 在实时和资源受限应用中的潜力。
[NLP-78] A Comprehensive Survey of Datasets Theories Variants and Applications in Direct Preference Optimization
【速读】: 该论文旨在全面回顾直接偏好优化(DPO)在大型语言模型(LLMs)中与人类偏好对齐的挑战与机遇。解决方案的关键在于深入分析DPO的理论基础、不同变体、相关偏好数据集及其应用,并通过分类近期研究来全面理解DPO的现状,同时提出未来研究方向以指导模型对齐的研究。
链接: https://arxiv.org/abs/2410.15595
作者: Wenyi Xiao,Zechuan Wang,Leilei Gan,Shuai Zhao,Wanggui He,Luu Anh Tuan,Long Chen,Hao Jiang,Zhou Zhao,Fei Wu
关键词-EN: aligning policy models, large language models, Direct Preference Optimization, aligning policy, increasingly critical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO’s various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO’s current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community.
摘要:随着大语言模型 (LLM) 的快速发展,将策略模型与人类偏好对齐变得愈发关键。直接偏好优化 (DPO) 作为一种有前景的对齐方法,作为从人类反馈中进行强化学习 (RLHF) 的无 RL 替代方案而出现。尽管 DPO 在多个方面取得了进展并存在固有限制,但目前文献中对其深入的综述尚显不足。在本研究中,我们提供了一个全面的 DPO 挑战与机遇的综述,涵盖了理论分析、变体、相关偏好数据集及应用。具体而言,我们根据关键研究问题对近期关于 DPO 的研究进行了分类,以全面了解 DPO 的当前状况。此外,我们提出了几个未来的研究方向,为研究社区在模型对齐方面的研究提供见解。
[NLP-79] AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection
【速读】: 该论文试图解决在大数据集中检测假新闻的挑战,特别是传统方法在利用文本特征时未能充分考虑语义和情感元素的问题。解决方案的关键在于引入Emotion-Aware Multimodal Fusion Prompt Learning (AMPLE)框架,该框架通过结合文本情感分析、多模态数据和混合提示模板,提取文本中的情感元素,并利用多头部交叉注意力机制和相似性感知融合方法整合多模态数据。这种方法在少量样本和数据丰富的场景下均表现出强大的性能,揭示了情感因素在假新闻检测中的潜力,并为进一步整合大型语言模型以提升文本情感提取效果提供了研究方向。
链接: https://arxiv.org/abs/2410.15591
作者: Xiaoman Xu,Xiangrun Li,Taihang Wang,Ye Jiang
关键词-EN: textbf, Detecting fake, diversity and complexity, challenging due, traditional approaches
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Detecting fake news in large datasets is challenging due to its diversity and complexity, with traditional approaches often focusing on textual features while underutilizing semantic and emotional elements. Current methods also rely heavily on large annotated datasets, limiting their effectiveness in more nuanced analysis. To address these challenges, this paper introduces Emotion-\textbfAware \textbfMultimodal Fusion \textbfPrompt \textbfL\textbfEarning (\textbfAMPLE) framework to address the above issue by combining text sentiment analysis with multimodal data and hybrid prompt templates. This framework extracts emotional elements from texts by leveraging sentiment analysis tools. It then employs Multi-Head Cross-Attention (MCA) mechanisms and similarity-aware fusion methods to integrate multimodal data. The proposed AMPLE framework demonstrates strong performance on two public datasets in both few-shot and data-rich settings, with results indicating the potential of emotional aspects in fake news detection. Furthermore, the study explores the impact of integrating large language models with this method for text sentiment extraction, revealing substantial room for further improvement. The code can be found at :\urlthis https URL
摘要:在大规模数据集中检测虚假新闻具有挑战性,这主要归因于其多样性和复杂性。传统方法通常侧重于文本特征,而未能充分利用语义和情感元素。当前的方法也严重依赖于大量标注数据集,限制了其在更细致分析中的有效性。为了应对这些挑战,本文提出了情感感知多模态融合提示学习框架 (Emotion-Aware Multimodal Fusion Prompt Learning, AMPLE),通过结合文本情感分析与多模态数据及混合提示模板来解决上述问题。该框架利用情感分析工具从文本中提取情感元素,并采用多头交叉注意力 (Multi-Head Cross-Attention, MCA) 机制和相似性感知融合方法来整合多模态数据。所提出的 AMPLE 框架在两个公开数据集上的少样本和数据丰富设置中均表现出强劲的性能,结果表明情感因素在虚假新闻检测中的潜力。此外,研究还探讨了将大语言模型与该方法结合用于文本情感提取的影响,揭示了进一步改进的巨大空间。代码可在以下链接找到:\urlthis https URL
[NLP-80] Language Models are Symbolic Learners in Arithmetic
【速读】: 该论文试图解决大语言模型(LLMs)在算术学习中是否利用部分乘积的问题,并通过实验验证LLMs在算术任务中主要依赖符号学习而非数值计算。解决方案的关键在于通过分组实验,揭示LLMs在处理算术任务时如何将任务分解为子组,并探讨子组复杂性与选择对模型表现的影响。研究结果表明,LLMs在算术学习中遵循从易到难的子组选择模式,且在不同位置的准确率呈现U型分布,即模型首先和最后快速学习最简单的模式,而中间位置的模式则逐步学习。这一发现强调了通过子组级别的量化分析来深入理解LLMs在算术任务中的行为的重要性。
链接: https://arxiv.org/abs/2410.15580
作者: Chunyuan Deng,Zhiqi Li,Roy Xie,Ruidi Chang,Hanjie Chen
关键词-EN: Large Language Models, Large Language, Language Models, language modeling, numerical computation
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are thought to struggle with arithmetic learning due to the inherent differences between language modeling and numerical computation, but concrete evidence has been lacking. This work responds to this claim through a two-side experiment. We first investigate whether LLMs leverage partial products during arithmetic learning. We find that although LLMs can identify some partial products after learning, they fail to leverage them for arithmetic tasks, conversely. We then explore how LLMs approach arithmetic symbolically by breaking tasks into subgroups, hypothesizing that difficulties arise from subgroup complexity and selection. Our results show that when subgroup complexity is fixed, LLMs treat a collection of different arithmetic operations similarly. By analyzing position-level accuracy across different training sizes, we further observe that it follows a U-shaped pattern: LLMs quickly learn the easiest patterns at the first and last positions, while progressively learning the more difficult patterns in the middle positions. This suggests that LLMs select subgroup following an easy-to-hard paradigm during learning. Our work confirms that LLMs are pure symbolic learners in arithmetic tasks and underscores the importance of understanding them deeply through subgroup-level quantification.
摘要:大语言模型 (LLMs) 在算术学习方面被认为存在困难,这主要源于语言建模与数值计算之间的固有差异,但缺乏具体的证据支持。本研究通过两方面的实验对此进行了回应。首先,我们探讨了 LLMs 在算术学习过程中是否利用了部分乘积。我们发现,尽管 LLMs 在学习后能够识别出一些部分乘积,但它们并未能将其应用于算术任务中。接着,我们研究了 LLMs 如何通过将任务分解为子组来符号化地处理算术问题,并假设困难源于子组的复杂性和选择。我们的结果表明,当子组复杂性固定时,LLMs 对一系列不同的算术操作的处理方式相似。通过分析不同训练规模下的位置级准确率,我们进一步观察到其呈现出 U 形模式:LLMs 在最初和最后的位置快速学习最简单的模式,而在中间位置逐步学习更复杂的模式。这表明 LLMs 在学习过程中遵循由易到难的子组选择范式。我们的工作证实了 LLMs 在算术任务中是纯粹的符号学习者,并强调了通过子组级量化深入理解它们的重要性。
[NLP-81] Generalized Probabilistic Attention Mechanism in Transformers
【速读】: 该论文试图解决Transformer架构中注意力机制存在的两个主要问题:秩崩溃(rank-collapse)和梯度消失(gradient vanishing)。解决方案的关键在于引入了一种新型注意力机制——广义概率注意力机制(GPAM),并通过其双注意力实现(daGPAM)来同时缓解这两个问题。与传统注意力机制不同,GPAM允许负注意力得分,同时保持总和固定,从而在理论上有效解决了传统机制难以同时解决的秩崩溃和梯度消失问题,并通过实验验证了其优越性。
链接: https://arxiv.org/abs/2410.15578
作者: DongNyeong Heo,Heeyoul Choi
关键词-EN: widely adopted due, attention mechanism, attention, Transformer architecture, conventional attention mechanisms
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The Transformer architecture has become widely adopted due to its demonstrated success, attributed to the attention mechanism at its core. Despite these successes, the attention mechanism of Transformers is associated with two well-known issues: rank-collapse and gradient vanishing. In this paper, we present a theoretical analysis that it is inherently difficult to address both issues simultaneously in the conventional attention mechanism. To handle these issues, we introduce a novel class of attention mechanism, referred to as generalized probabilistic attention mechanism (GPAM), and its dual-attention implementation within the Transformer architecture. Unlike conventional attention mechanisms, GPAM allows for negative attention scores while preserving a fixed total sum. We provide theoretical evidence that the proposed dual-attention GPAM (daGPAM) effectively mitigates both the rank-collapse and gradient vanishing issues which are difficult to resolve simultaneously with the conventional attention mechanisms. Furthermore, we empirically validate this theoretical evidence, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues. Additionally, we demonstrate the practical benefits of GPAM in natural language processing tasks, such as language modeling and neural machine translation.
摘要:Transformer 架构因其核心的注意力机制所展示的成功而被广泛采用。尽管取得了这些成功,Transformer 的注意力机制仍存在两个众所周知的问题:秩崩溃和梯度消失。本文中,我们进行了理论分析,指出在传统注意力机制中同时解决这两个问题本质上是非常困难的。为了应对这些问题,我们引入了一类新的注意力机制,称为广义概率注意力机制 (Generalized Probabilistic Attention Mechanism, GPAM),以及其在 Transformer 架构中的双注意力实现。与传统注意力机制不同,GPAM 允许负注意力得分,同时保持总和固定。我们提供了理论证据,表明所提出的双注意力 GPAM (daGPAM) 能有效缓解传统注意力机制难以同时解决的秩崩溃和梯度消失问题。此外,我们通过实证验证了这一理论证据,展示了 daGPAM 相比其他为解决相同问题而提出的注意力机制的优越性。我们还展示了 GPAM 在自然语言处理任务中的实际效益,如语言建模和神经机器翻译。
[NLP-82] A Survey of Conversational Search DATE
【速读】: 该论文旨在探讨下一代搜索引擎中的对话式搜索(conversational search)的最新进展和未来方向。解决方案的关键在于利用自然语言处理(NLP)和大型语言模型(LLMs)技术,通过支持复杂查询、多轮对话中的上下文维护以及强大的信息整合和处理能力,提升用户与系统之间的交互体验。论文强调了查询重构、搜索澄清、对话式检索和响应生成等关键组件的协同作用,并讨论了LLMs在增强这些系统中的应用,以及该领域面临的挑战和机遇。
链接: https://arxiv.org/abs/2410.15576
作者: Fengran Mo,Kelong Mao,Ziliang Zhao,Hongjin Qian,Haonan Chen,Yiruo Cheng,Xiaoxi Li,Yutao Zhu,Zhicheng Dou,Jian-Yun Nie
关键词-EN: Conversational search, search engines, modern information access, search, conversational search systems
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 35 pages, 8 figures, continue to update
点击查看摘要
Abstract:As a cornerstone of modern information access, search engines have become indispensable in everyday life. With the rapid advancements in AI and natural language processing (NLP) technologies, particularly large language models (LLMs), search engines have evolved to support more intuitive and intelligent interactions between users and systems. Conversational search, an emerging paradigm for next-generation search engines, leverages natural language dialogue to facilitate complex and precise information retrieval, thus attracting significant attention. Unlike traditional keyword-based search engines, conversational search systems enhance user experience by supporting intricate queries, maintaining context over multi-turn interactions, and providing robust information integration and processing capabilities. Key components such as query reformulation, search clarification, conversational retrieval, and response generation work in unison to enable these sophisticated interactions. In this survey, we explore the recent advancements and potential future directions in conversational search, examining the critical modules that constitute a conversational search system. We highlight the integration of LLMs in enhancing these systems and discuss the challenges and opportunities that lie ahead in this dynamic field. Additionally, we provide insights into real-world applications and robust evaluations of current conversational search systems, aiming to guide future research and development in conversational search.
摘要:作为现代信息访问的基石,搜索引擎在日常生活中变得不可或缺。随着人工智能和自然语言处理 (NLP) 技术的快速发展,特别是大语言模型 (LLMs) 的出现,搜索引擎已经演变为支持用户与系统之间更加直观和智能的交互。对话式搜索作为下一代搜索引擎的新兴范式,利用自然语言对话来促进复杂和精确的信息检索,因此引起了广泛关注。与传统的基于关键词的搜索引擎不同,对话式搜索系统通过支持复杂的查询、在多轮交互中保持上下文以及提供强大的信息整合和处理能力,增强了用户体验。查询重构、搜索澄清、对话检索和响应生成等关键组件协同工作,以实现这些复杂的交互。在本调查中,我们探讨了对话式搜索的最新进展和潜在的未来方向,审视了构成对话式搜索系统的关键模块。我们强调了 LLMs 在增强这些系统中的集成,并讨论了这一动态领域中面临的挑战和机遇。此外,我们提供了对当前对话式搜索系统在实际应用中的见解和鲁棒性评估,旨在指导未来在对话式搜索领域的研究和开发。
[NLP-83] Neural Search Space in Gboard Decoder
【速读】: 该论文试图解决N-gram语言模型在设备模型大小限制下存在的稀疏性问题,以及其有限的上下文长度导致的解码质量下降问题。解决方案的关键在于引入Neural Search Space,通过将N-gram语言模型替换为神经网络语言模型(NN-LM),并在解码过程中动态构建搜索空间。具体方法包括将NN-LM的输出在运行时转换为语言有限状态转换器(FST),涉及FST结构的重设计、剪枝策略的调整以及数据结构的优化。实验结果表明,该方法在不同语言环境下显著降低了单词修改率,同时保持了可接受的延迟增加。
链接: https://arxiv.org/abs/2410.15575
作者: Yanxiang Zhang,Yuanbo Zhang,Haicheng Sun,Yun Wang,Billy Dou,Gary Sivek,Shumin Zhai
关键词-EN: Finite State Transducers, Gboard Decoder produces, language Finite State, Decoder produces suggestions, Gboard Decoder
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures, 3 tables
点击查看摘要
Abstract:Gboard Decoder produces suggestions by looking for paths that best match input touch points on the context aware search space, which is backed by the language Finite State Transducers (FST). The language FST is currently an N-gram language model (LM). However, N-gram LMs, limited in context length, are known to have sparsity problem under device model size constraint. In this paper, we propose \textbfNeural Search Space which substitutes the N-gram LM with a Neural Network LM (NN-LM) and dynamically constructs the search space during decoding. Specifically, we integrate the long range context awareness of NN-LM into the search space by converting its outputs given context, into the language FST at runtime. This involves language FST structure redesign, pruning strategy tuning, and data structure optimizations. Online experiments demonstrate improved quality results, reducing Words Modified Ratio by [0.26%, 1.19%] on various locales with acceptable latency increases. This work opens new avenues for further improving keyboard decoding quality by enhancing neural LM more directly.
摘要: Gboard 解码器通过在上下文感知搜索空间中寻找最佳匹配输入触摸点的路径来生成建议,该搜索空间由语言有限状态转换器 (FST) 支持。当前的语言 FST 是一个 N-gram 语言模型 (LM)。然而,受限于上下文长度,N-gram LMs 在设备模型大小约束下存在稀疏性问题。本文提出了一种 神经搜索空间,用神经网络语言模型 (NN-LM) 替代 N-gram LM,并在解码过程中动态构建搜索空间。具体而言,我们通过将 NN-LM 在给定上下文下的输出实时转换为语言 FST,从而将 NN-LM 的长距离上下文感知能力整合到搜索空间中。这涉及语言 FST 结构的重新设计、剪枝策略的调整以及数据结构的优化。在线实验表明,通过这种方法,在各种语言环境下,单词修改率降低了 [0.26%, 1.19%],同时延迟增加在可接受范围内。这项工作为进一步提升键盘解码质量开辟了新的途径,通过更直接地增强神经 LM 来实现。
[NLP-84] OpenMU: Your Swiss Army Knife for Music Understanding
【速读】: 该论文试图解决多模态语言模型在音乐理解训练中的数据稀缺问题,解决方案的关键在于构建了一个大规模的基准测试套件OpenMU-Bench。通过整合现有数据集并生成新的注释,OpenMU-Bench不仅扩展了音乐理解的范畴,包括歌词理解和音乐工具使用,还通过训练音乐理解模型OpenMU,展示了其在性能上优于基线模型如MU-Llama。这一解决方案的核心在于开源OpenMU和OpenMU-Bench,以促进未来在音乐理解领域的研究和提升音乐创作效率。
链接: https://arxiv.org/abs/2410.15573
作者: Mengjie Zhao,Zhi Zhong,Zhuoyuan Mao,Shiqi Yang,Wei-Hsiang Liao,Shusuke Takahashi,Hiromi Wakaki,Yuki Mitsufuji
关键词-EN: large-scale benchmark suite, data scarcity issue, training multimodal language, multimodal language models, large-scale benchmark
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Resources: this https URL
点击查看摘要
Abstract:We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency.
摘要:我们提出了 OpenMU-Bench,这是一个大规模的基准测试套件,旨在解决训练多模态语言模型以理解音乐时面临的数据稀缺问题。为了构建 OpenMU-Bench,我们利用了现有的数据集并生成了新的注释。OpenMU-Bench 还通过包括歌词理解和音乐工具使用,扩展了音乐理解的范围。使用 OpenMU-Bench,我们训练了我们的音乐理解模型 OpenMU,并通过广泛的消融实验证明,OpenMU 优于诸如 MU-Llama 等基线模型。OpenMU 和 OpenMU-Bench 均已开源,以促进未来在音乐理解领域的研究,并提高创意音乐制作的效率。
[NLP-85] Leveraging Retrieval-Augmented Generation for Culturally Inclusive Hakka Chatbots: Design Insights and User Perceptions
【速读】: 该论文试图解决传统大型语言模型在处理文化特定领域时准确性和文化相关性不足的问题,特别是针对台湾客家文化的传承与保护。解决方案的关键在于采用检索增强生成(RAG)技术,通过整合外部数据库与生成式AI模型,使聊天机器人能够提供精确且富含文化背景的回答。RAG技术通过动态信息检索,增强了聊天机器人的知识库,使其能够深入理解和回应与客家文化相关的复杂查询,从而在数字平台上有效维护和弘扬客家文化身份。
链接: https://arxiv.org/abs/2410.15572
作者: Chen-Chi Chang,Han-Pi Chang,Hung-Shin Lee
关键词-EN: Taiwanese Hakka culture, Retrieval-Augmented Generation, heritage of Taiwanese, Taiwanese Hakka, technological innovation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE RASSE 2024
点击查看摘要
Abstract:In an era where cultural preservation is increasingly intertwined with technological innovation, this study introduces a groundbreaking approach to promoting and safeguarding the rich heritage of Taiwanese Hakka culture through the development of a Retrieval-Augmented Generation (RAG)-enhanced chatbot. Traditional large language models (LLMs), while powerful, often fall short in delivering accurate and contextually rich responses, particularly in culturally specific domains. By integrating external databases with generative AI models, RAG technology bridges this gap, empowering chatbots to not only provide precise answers but also resonate deeply with the cultural nuances that are crucial for authentic interactions. This study delves into the intricate process of augmenting the chatbot’s knowledge base with targeted cultural data, specifically curated to reflect the unique aspects of Hakka traditions, language, and practices. Through dynamic information retrieval, the RAG-enhanced chatbot becomes a versatile tool capable of handling complex inquiries that demand an in-depth understanding of Hakka cultural context. This is particularly significant in an age where digital platforms often dilute cultural identities, making the role of culturally aware AI systems more critical than ever. System usability studies conducted as part of our research reveal a marked improvement in both user satisfaction and engagement, highlighting the chatbot’s effectiveness in fostering a deeper connection with Hakka culture. The feedback underscores the potential of RAG technology to not only enhance user experience but also to serve as a vital instrument in the broader mission of ethnic mainstreaming and cultural celebration.
摘要:在文化保护与技术创新日益交织的时代,本研究提出了一种开创性的方法,通过开发基于检索增强生成 (Retrieval-Augmented Generation, RAG) 技术的聊天机器人,来推广和保护丰富的台湾客家文化遗产。传统的大语言模型 (Large Language Model, LLM) 虽然功能强大,但在提供准确且上下文丰富的响应方面往往不足,尤其是在文化特定的领域。通过将外部数据库与生成式 AI 模型集成,RAG 技术填补了这一空白,使聊天机器人不仅能提供精确的答案,还能深入理解文化细微差别,这对于真实的互动至关重要。本研究深入探讨了如何通过有针对性的文化数据增强聊天机器人的知识库,这些数据经过精心挑选,以反映客家传统、语言和实践的独特方面。通过动态信息检索,RAG 增强的聊天机器人成为一种多功能工具,能够处理需要深入理解客家文化背景的复杂查询。在数字平台常常淡化文化身份的时代,这一点尤为重要,使得具有文化意识的 AI 系统比以往任何时候都更加关键。作为研究的一部分进行的系统可用性研究表明,用户满意度和参与度都有显著提高,突显了聊天机器人在促进与客家文化更深层次联系方面的有效性。反馈强调了 RAG 技术不仅能够增强用户体验,还能在更广泛的民族主流化和文化庆祝使命中发挥关键作用。
[NLP-86] Stacking Small Language Models for Generalizability
【速读】: 该论文试图解决大规模语言模型(LLMs)在资源受限环境下训练和推理成本高昂的问题。解决方案的关键在于引入了一种名为“微调语言模型堆栈(FSLM)”的新方法,通过将小型语言模型(SLM)堆叠起来,每个SLM负责特定的任务,从而将高级推理分解为多个低级步骤。这种方法不仅降低了训练和推理成本,还提高了模型的可解释性,因为每个SLM通过自然语言与后续模型进行通信。通过在常见的自然语言基准测试中评估FSLM,论文展示了其在成本效益方面的潜力,作为LLMs的可行替代方案。
链接: https://arxiv.org/abs/2410.15570
作者: Laurence Liang
关键词-EN: Recent advances show, Recent advances, generalize strong performance, generalize strong, advances show
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent advances show that large language models (LLMs) generalize strong performance across different natural language benchmarks. However, the large size of LLMs makes training and inference expensive and impractical to run in resource-limited settings. This paper introduces a new approach called fine-tuning stacks of language models (FSLM), which involves stacking small language models (SLM) as an alternative to LLMs. By fine-tuning each SLM to perform a specific task, this approach breaks down high level reasoning into multiple lower-level steps that specific SLMs are responsible for. As a result, FSLM allows for lower training and inference costs, and also improves model interpretability as each SLM communicates with the subsequent one through natural language. By evaluating FSLM on common natural language benchmarks, this paper highlights promising early results toward generalizable performance using FSLM as a cost-effective alternative to LLMs.
摘要:近期进展表明,大语言模型 (Large Language Models, LLMs) 在不同自然语言基准测试中展现出强大的泛化性能。然而,LLMs 的庞大体积使得训练和推理成本高昂,且在资源受限的环境中难以运行。本文提出了一种名为“语言模型微调堆栈 (Fine-tuning Stacks of Language Models, FSLM)”的新方法,该方法通过堆叠小型语言模型 (Small Language Models, SLM) 来替代 LLMs。通过将每个 SLM 微调以执行特定任务,这种方法将高级推理分解为多个低级步骤,由特定的 SLM 负责。因此,FSLM 不仅降低了训练和推理成本,还通过每个 SLM 以自然语言与后续模型进行通信,提高了模型的可解释性。通过对 FSLM 在常见自然语言基准测试中的评估,本文展示了使用 FSLM 作为 LLMs 成本效益替代方案的早期有希望的结果。
[NLP-87] Pruning Foundation Models for High Accuracy without Retraining EMNLP2024
【速读】: 该论文试图解决大规模语言模型(LLMs)在部署时由于参数和计算量巨大而难以应用的问题。解决方案的关键在于提出了一种无需重新训练的层级压缩方法,通过一次性修剪多个权重来实现模型压缩,同时保持较高的准确性。该方法在无需重新训练的情况下,对LLMs进行结构化和半结构化稀疏性修剪,显著减少了模型大小并加速了推理过程,且在多种LLM架构(如基于Transformer和Mamba的模型)上表现优于现有最先进的技术。
链接: https://arxiv.org/abs/2410.15567
作者: Pu Zhao,Fei Sun,Xuan Shen,Pinrui Yu,Zhenglun Kong,Yanzhi Wang,Xue Lin
关键词-EN: deploy foundation models, large language models, parameters and computations, challenging to deploy, deploy foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024 findings
点击查看摘要
Abstract:Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consuming massive data and hardware resources. To deal with this problem, post-training pruning methods are proposed to prune LLMs in one-shot without retraining. However, their accuracy after pruning may suffer from certain performance degradation due to the lack of retraining with massive data. To address this issue, in this paper, we first formulate the post-training problem for layer-wise LLM compression to simultaneously prune multiple weights in LLMs. Next, we provide an optimal solution for this problem and design our post-training pruning algorithm for both unstructured and semi-structured sparsity. Our extensive experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines across various LLM families including transformer-based LLMs and Mamba-based LLMs. Code link: this https URL
摘要:尽管基础模型或大语言模型 (LLM) 在性能上表现卓越,但由于其庞大的参数和计算需求,部署这些模型仍然具有挑战性。虽然剪枝是一种有前景的技术,可以减少模型大小并加速推理,但传统的剪枝技术难以应用于 LLM,因为它们需要在完整数据集上进行多轮微调,消耗大量数据和硬件资源。为了解决这一问题,提出了无需重新训练的训练后剪枝方法,以一次性剪枝 LLM。然而,由于缺乏大量数据的重新训练,这些方法在剪枝后的准确性可能会受到一定程度的性能下降。为了解决这一问题,本文首先将层级 LLM 压缩的训练后问题形式化,以同时剪枝 LLM 中的多个权重。接下来,我们为该问题提供了一个最优解,并设计了针对非结构化和半结构化稀疏性的训练后剪枝算法。我们的广泛实验表明,所提出的方法在各种 LLM 家族(包括基于 Transformer 的 LLM 和基于 Mamba 的 LLM)中,相较于最先进的基线方法,表现出了优越的性能。代码链接:this https URL
[NLP-88] Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following
【速读】: 该论文试图解决当前大型语言模型(LLMs)在遵循多轮和多语言指令方面的评估不足问题。解决方案的关键在于引入了一个名为Multi-IF的新基准,该基准通过结合LLM和人工标注者的混合框架,扩展了现有的IFEval基准,增加了多轮对话和多语言支持,从而创建了一个包含4,501个多语言对话的数据集,每个对话包含三轮交互。Multi-IF的引入显著提高了评估的难度,揭示了现有模型在处理多轮指令时的性能下降,尤其是在非拉丁字母语言(如中文、俄文和印地文)中的表现较差,这表明模型在多语言能力方面存在局限性。
链接: https://arxiv.org/abs/2410.15553
作者: Yun He,Di Jin,Chaoqi Wang,Chloe Bi,Karishma Mandyam,Hejia Zhang,Chen Zhu,Ning Li,Tengyu Xu,Hongjiang Lv,Shruti Bhosale,Chenguang Zhu,Karthik Abinav Sankararaman,Eryk Helenowski,Melanie Kambadur,Aditya Tayade,Hao Ma,Han Fang,Sinong Wang
关键词-EN: Large Language Models, aligning model outputs, demonstrated impressive capabilities, Large Language, user expectations
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including instruction following, which is crucial for aligning model outputs with user expectations. However, evaluating LLMs’ ability to follow instructions remains challenging due to the complexity and subjectivity of human language. Current benchmarks primarily focus on single-turn, monolingual instructions, which do not adequately reflect the complexities of real-world applications that require handling multi-turn and multilingual interactions. To address this gap, we introduce Multi-IF, a new benchmark designed to assess LLMs’ proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4,501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models’ multilingual capabilities. We release Multi-IF prompts and the evaluation code base to encourage further research in this critical area.
摘要:大语言模型 (LLMs) 在多种任务中展示了令人印象深刻的能力,包括指令跟随,这对于使模型输出与用户期望相一致至关重要。然而,由于人类语言的复杂性和主观性,评估 LLMs 的指令跟随能力仍然具有挑战性。当前的基准主要集中在单轮、单语言的指令上,这并不能充分反映现实应用中需要处理多轮和多语言交互的复杂性。为了填补这一空白,我们引入了 Multi-IF,这是一个新的基准,旨在评估 LLMs 在遵循多轮和多语言指令方面的熟练程度。Multi-IF 采用了一个结合大语言模型和人工标注者的混合框架,扩展了 IFEval,通过引入多轮序列并将英语提示翻译成另外 7 种语言,从而形成了一个包含 4,501 个多语言对话的数据集,每个对话包含三轮。我们对 14 个最先进的大语言模型在 Multi-IF 上的评估显示,它比现有基准提出了一个显著更具挑战性的任务。所有测试的模型在每增加一轮时,正确执行指令的失败率都更高。例如,o1-preview 在第一轮的平均准确率为 0.877,到第三轮时下降到 0.707。此外,使用非拉丁字母的语言(如印地语、俄语和中文)通常表现出更高的错误率,这表明模型在多语言能力方面可能存在局限性。我们发布了 Multi-IF 的提示和评估代码库,以鼓励在这一关键领域的进一步研究。
[NLP-89] WHoW: A Cross-domain Approach for Analysing Conversation Moderation
【速读】: 该论文旨在解决不同领域/场景中主持人促进策略的分析问题,提出了一种名为WHoW的评估框架。该框架通过考察主持人的动机(Why)、对话行为(How)和目标说话者(Who)来分析主持人的促进策略。关键在于通过人工标注和GPT-4的自动标注,对大量语料进行跨领域的比较分析,揭示了辩论主持人和电台讨论主持人采用的不同策略:辩论主持人强调协调并通过提问和指令促进互动,而讨论主持人则更注重信息提供并积极参与讨论。这一框架不仅增强了我们对主持人行为的理解,还为开发自动化主持人代理提供了支持。
链接: https://arxiv.org/abs/2410.15551
作者: Ming-Bin Chen,Lea Frermann,Jey Han Lau
关键词-EN: dialogue acts, propose WHoW, examining their motives, target speaker, analyzing the facilitation
类目: Computation and Language (cs.CL)
备注: 36 pages(including appendix, 10 pages main text), 8 figures, 16 tables
点击查看摘要
Abstract:We propose WHoW, an evaluation framework for analyzing the facilitation strategies of moderators across different domains/scenarios by examining their motives (Why), dialogue acts (How) and target speaker (Who). Using this framework, we annotated 5,657 moderation sentences with human judges and 15,494 sentences with GPT-4o from two domains: TV debates and radio panel discussions. Comparative analysis demonstrates the framework’s cross-domain generalisability and reveals distinct moderation strategies: debate moderators emphasise coordination and facilitate interaction through questions and instructions, while panel discussion moderators prioritize information provision and actively participate in discussions. Our analytical framework works for different moderation scenarios, enhances our understanding of moderation behaviour through automatic large-scale analysis, and facilitates the development of moderator agents.
摘要: 我们提出了 WHoW,这是一个评估框架,用于通过分析主持人的动机 (Why)、对话行为 (How) 和目标说话者 (Who) 来研究不同领域/场景中主持人的促进策略。利用这一框架,我们通过人工评判对 5,657 条主持语句进行了标注,并通过 GPT-4o 对 15,494 条语句进行了标注,这些语句来自两个领域:电视辩论和广播小组讨论。比较分析表明,该框架具有跨领域的通用性,并揭示了不同的主持策略:辩论主持人强调协调,通过提问和指令促进互动,而小组讨论主持人则优先提供信息并积极参与讨论。我们的分析框架适用于不同的主持场景,通过自动大规模分析增强了我们对主持行为的理解,并促进了主持人智能体的发展。
[NLP-90] Grammatical Error Correction for Low-Resource Languages: The Case of Zarma
【速读】: 该论文试图解决Zarma语(一种西非低资源语言)的语法错误修正(GEC)问题。解决方案的关键在于比较和评估基于规则的方法、机器翻译(MT)模型和大型语言模型(LLMs)在GEC中的表现。研究结果表明,基于M2M100模型的机器翻译方法在自动评估中表现最佳,检测率达到95.82%,建议准确率为78.90%,且在本地人评估中逻辑/语法错误修正得分为3.0/5.0,显示出其在低资源语言GEC中的潜力。
链接: https://arxiv.org/abs/2410.15539
作者: Mamadou K. Keita,Christopher Homan,Sofiane Abdoulaye Hamani,Adwoa Bremang,Marcos Zampieri,Habibatou Abdoulaye Alfari,Elysabhete Amadou Ibrahim,Dennis Owusu
关键词-EN: West Africa, improving written materials, people in West, Grammatical error correction, Grammatical error
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Grammatical error correction (GEC) is important for improving written materials for low-resource languages like Zarma – spoken by over 5 million people in West Africa. Yet it remains a challenging problem. This study compares rule-based methods, machine translation (MT) models, and large language models (LLMs) for GEC in Zarma. We evaluate each approach’s effectiveness on our manually-built dataset of over 250,000 examples using synthetic and human-annotated data. Our experiments show that the MT-based approach using the M2M100 model outperforms others, achieving a detection rate of 95.82% and a suggestion accuracy of 78.90% in automatic evaluations, and scoring 3.0 out of 5.0 in logical/grammar error correction during MEs by native speakers. The rule-based method achieved perfect detection (100%) and high suggestion accuracy (96.27%) for spelling corrections but struggled with context-level errors. LLMs like MT5-small showed moderate performance with a detection rate of 90.62% and a suggestion accuracy of 57.15%. Our work highlights the potential of MT models to enhance GEC in low-resource languages, paving the way for more inclusive NLP tools.
摘要:语法错误纠正 (Grammatical Error Correction, GEC) 对于提升如 Zarma 这样的低资源语言的书面材料质量至关重要——Zarma 语言在西非有超过 500 万人使用。然而,这一问题仍然充满挑战。本研究对比了基于规则的方法、机器翻译 (Machine Translation, MT) 模型以及大语言模型 (Large Language Models, LLMs) 在 Zarma 语法错误纠正中的表现。我们使用人工构建的超过 25 万条示例的数据集,通过合成数据和人工标注数据来评估每种方法的有效性。实验结果显示,基于 M2M100 模型的机器翻译方法表现优于其他方法,在自动评估中达到了 95.82% 的检测率和 78.90% 的建议准确率,并且在由母语者进行的逻辑/语法错误纠正中获得了 3.0 分(满分 5.0 分)。基于规则的方法在拼写纠正方面实现了完美的检测率 (100%) 和较高的建议准确率 (96.27%),但在处理上下文级别的错误时表现不佳。类似 MT5-small 的大语言模型表现中等,检测率为 90.62%,建议准确率为 57.15%。我们的工作突显了机器翻译模型在增强低资源语言语法错误纠正方面的潜力,为更包容的自然语言处理工具铺平了道路。
[NLP-91] Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage
【速读】: 该论文试图解决检索增强生成(RAG)系统在评估开放性问题时面临的挑战,特别是针对缺乏明确答案且需要涵盖多个子主题的问题。解决方案的关键在于引入基于子问题覆盖率的新评估框架,通过将问题分解为核心、背景和后续三类子问题,并利用这些分类来提供细粒度的评估协议。该方法不仅揭示了现有RAG系统在核心子问题覆盖上的不足(约50%的遗漏率),还展示了子问题覆盖率指标在排序响应中的有效性(82%的准确率),并证明了利用核心子问题能显著提升RAG系统的检索和生成性能(74%的胜率提升)。
链接: https://arxiv.org/abs/2410.15531
作者: Kaige Xie,Philippe Laban,Prafulla Kumar Choubey,Caiming Xiong,Chien-Sheng Wu
关键词-EN: Evaluating retrieval-augmented generation, Evaluating retrieval-augmented, systems remains challenging, remains challenging, multiple sub-topics
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Evaluating retrieval-augmented generation (RAG) systems remains challenging, particularly for open-ended questions that lack definitive answers and require coverage of multiple sub-topics. In this paper, we introduce a novel evaluation framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question. We propose decomposing questions into sub-questions and classifying them into three types – core, background, and follow-up – to reflect their roles and importance. Using this categorization, we introduce a fine-grained evaluation protocol that provides insights into the retrieval and generation characteristics of RAG systems, including three commercial generative answer engines: this http URL, Perplexity AI, and Bing Chat. Interestingly, we find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions, revealing clear opportunities for improvement. Further, sub-question coverage metrics prove effective for ranking responses, achieving 82% accuracy compared to human preference annotations. Lastly, we also demonstrate that leveraging core sub-questions enhances both retrieval and answer generation in a RAG system, resulting in a 74% win rate over the baseline that lacks sub-questions.
摘要:评估检索增强生成 (RAG) 系统仍然具有挑战性,特别是对于缺乏明确答案且需要涵盖多个子主题的开放性问题。本文介绍了一种基于子问题覆盖率的新型评估框架,该框架衡量 RAG 系统如何有效地解决问题的不同方面。我们提出将问题分解为子问题,并将其分类为核心、背景和后续三种类型,以反映它们的角色和重要性。利用这种分类,我们引入了一种细粒度的评估协议,该协议提供了对 RAG 系统检索和生成特性的深入见解,包括三种商业生成式答案引擎:this http URL、Perplexity AI 和 Bing Chat。有趣的是,我们发现尽管所有答案引擎覆盖核心子问题的频率高于背景或后续子问题,但它们仍然遗漏了约 50% 的核心子问题,这揭示了明显的改进机会。此外,子问题覆盖率指标在排序响应方面表现有效,与人类偏好注释相比,准确率达到 82%。最后,我们还证明了利用核心子问题可以增强 RAG 系统中的检索和答案生成,使其在缺乏子问题的基线上的胜率达到 74%。
[NLP-92] M-RewardBench: Evaluating Reward Models in Multilingual Settings
【速读】: 该论文试图解决奖励模型(Reward Models, RMs)在多语言环境下的性能评估问题。解决方案的关键在于构建了首个多语言RM评估基准M-RewardBench,该基准包含2.87k个偏好实例,涵盖23种语言,用于测试RMs在聊天、安全性、推理和翻译能力方面的表现。通过系统评估多种奖励模型在M-RewardBench上的表现,论文揭示了RMs在英语与非英语语言之间性能的显著差距,并展示了不同语言间RM偏好的显著变化。此外,研究还发现翻译质量和语言资源丰富度对RM性能有显著影响。
链接: https://arxiv.org/abs/2410.15522
作者: Srishti Gureja,Lester James V. Miranda,Shayekh Bin Islam,Rishabh Maheshwary,Drishti Sharma,Gusti Winata,Nathan Lambert,Sebastian Ruder,Sara Hooker,Marzieh Fadaee
关键词-EN: language modeling process, modeling process, LLMs today, today by enabling, enabling the integration
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 6 figures, 10 tables. Website: this https URL
点击查看摘要
Abstract:Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs’ performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.
摘要:奖励模型 (Reward Models, RMs) 通过将人类反馈整合到语言建模过程中,推动了当今大语言模型 (Large Language Models, LLMs) 的最新性能。然而,RMs 主要在英语环境中进行训练和评估,其在多语言环境中的能力仍未得到充分研究。在本研究中,我们对多个奖励模型在多语言环境中的表现进行了系统评估。首先,我们构建了首个多语言 RM 评估基准 M-RewardBench,该基准包含 2.87k 个偏好实例,涵盖 23 种类型多样的语言,测试了 RMs 在聊天、安全性、推理和翻译方面的能力。随后,我们在 M-RewardBench 上对一系列奖励模型进行了严格评估,提供了关于这些模型在不同语言中性能的新见解。我们发现,RMs 在英语和非英语语言之间的表现存在显著差距,并且 RM 的偏好会因语言不同而发生显著变化。我们还展示了多语言不同方面如何影响 RM 性能的几个发现。具体而言,我们表明,随着翻译质量的提高,RMs 的性能也会提升。同样,我们证明模型在高资源语言上的表现更好。我们发布了 M-RewardBench 数据集和本研究中的代码库,以促进对多语言环境中 RM 评估的更好理解。
[NLP-93] SceneGraMMi: Scene Graph-boosted Hybrid-fusion for Multi-Modal Misinformation Veracity Prediction
【速读】: 该论文试图解决多模态虚假信息检测中现有方法在捕捉语义线索、关键区域和跨模态相似性方面的局限性。解决方案的关键是提出了SceneGraMMi,一种基于场景图增强的混合融合方法,通过整合不同模态的场景图来提升检测性能。实验结果表明,SceneGraMMi在四个基准数据集上均优于现有最先进的方法,并通过Shapley值分析增强了模型的可解释性。
链接: https://arxiv.org/abs/2410.15517
作者: Swarang Joshi,Siddharth Mavani,Joel Alex,Arnav Negi,Rahul Mishra,Ponnurangam Kumaraguru
关键词-EN: broader societal narratives, undermines individual knowledge, affects broader societal, Misinformation undermines individual, societal narratives
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Misinformation undermines individual knowledge and affects broader societal narratives. Despite growing interest in the research community in multi-modal misinformation detection, existing methods exhibit limitations in capturing semantic cues, key regions, and cross-modal similarities within multi-modal datasets. We propose SceneGraMMi, a Scene Graph-boosted Hybrid-fusion approach for Multi-modal Misinformation veracity prediction, which integrates scene graphs across different modalities to improve detection performance. Experimental results across four benchmark datasets show that SceneGraMMi consistently outperforms state-of-the-art methods. In a comprehensive ablation study, we highlight the contribution of each component, while Shapley values are employed to examine the explainability of the model’s decision-making process.
摘要:虚假信息削弱了个体知识,并影响了更广泛的社会叙事。尽管研究界对多模态虚假信息检测的兴趣日益增长,但现有方法在捕捉多模态数据集中的语义线索、关键区域和跨模态相似性方面存在局限性。我们提出了 SceneGraMMi,一种基于场景图增强的混合融合方法,用于多模态虚假信息真实性预测,该方法通过整合不同模态的场景图来提高检测性能。在四个基准数据集上的实验结果表明,SceneGraMMi 始终优于最先进的方法。在全面的消融研究中,我们突出了每个组件的贡献,同时使用 Shapley 值来检验模型决策过程的可解释性。
[NLP-94] Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Cant Answer?
【速读】: 该论文试图解决的问题是如何在反向问答(Reverse Question Answering, RQA)任务中生成与给定答案相匹配的问题,并评估大型语言模型(LLMs)在这一任务中的表现。解决方案的关键在于联合测试正向问答(QA)和反向问答(RQA)任务,通过对比两者的难度,分析LLMs在处理不同类型问题和答案时的准确性差异,识别出导致RQA错误的问题和答案类型,从而提出改进LLMs在RQA任务中推理能力的建议。
链接: https://arxiv.org/abs/2410.15512
作者: Nishant Balepur,Feng Gu,Abhilasha Ravichander,Shi Feng,Jordan Boyd-Graber,Rachel Rudinger
关键词-EN: input questions-is popular, reverse question answering, RQA, producing correct answers, RQA errors
类目: Computation and Language (cs.CL)
备注: In-progress preprint
点击查看摘要
Abstract:Question answering (QA)-producing correct answers for input questions-is popular, but we test a reverse question answering (RQA) task: given an input answer, generate a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and assessing reasoning consistency. 16 LLMs run QA and RQA with trivia questions/answers, showing: 1) Versus QA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not from knowledge gaps alone; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to give valid multi-hop questions. By finding question and answer types yielding RQA errors, we suggest improvements for LLM RQA reasoning.
摘要:问答 (Question Answering, QA) —— 即针对输入问题生成正确答案 —— 是当前热门的研究领域,但我们测试了一种反向问答 (Reverse Question Answering, RQA) 任务:给定一个输入答案,生成一个包含该答案的问题。以往的研究分别测试 QA 和 RQA,而我们则将两者联合测试,比较其难度,辅助基准设计,并评估推理的一致性。16 个大语言模型 (Large Language Model, LLM) 在常识问答中运行 QA 和 RQA,结果显示:1) 相较于 QA,LLM 在生成数值答案的 RQA 中准确率显著降低,但在生成文本答案的 RQA 中准确率略有提高;2) LLM 在 QA 中往往能准确回答其自身在 RQA 中生成的无效问题,因此 RQA 错误并非仅源于知识缺口;3) RQA 错误与问题难度正相关,与 Dolma 语料库中答案频率负相关;4) LLM 在生成有效的多跳问题时表现不佳。通过识别导致 RQA 错误的问答类型,我们提出了改进 LLM RQA 推理的建议。
[NLP-95] Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training CONLL
【速读】: 该论文试图解决在数据和计算资源有限的情况下,如何高效地训练机器学习模型的问题。解决方案的关键在于探索三种主要变量(课程学习、预训练和模型类型)在有限数据环境中的作用,并通过多模态和单模态任务的评估来验证这些方法的有效性。研究发现,课程学习在多模态评估中显著优于非课程学习模型,尤其是在结合文本预训练时;而在单模态文本任务中,课程学习对参数较少的模型有帮助。这些结果可能与模型架构差异和训练设计有关。
链接: https://arxiv.org/abs/2410.15509
作者: Rohan Saha,Abrar Fahim,Alona Fyshe,Alex Murphy
关键词-EN: train large machine, specialized domains, learning, large machine learning, train large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: CoNLL BabyLM Challenge 2024 camera ready
点击查看摘要
Abstract:For specialized domains, there is often not a wealth of data with which to train large machine learning models. In such limited data / compute settings, various methods exist aiming to \textitdo more with less , such as finetuning from a pretrained model, modulating difficulty levels as data are presented to a model (curriculum learning), and considering the role of model type / size. Approaches to efficient \textitmachine learning also take inspiration from \textithuman learning by considering use cases where machine learning systems have access to approximately the same number of words experienced by a 13 year old child (100M words). We investigate the role of 3 primary variables in a limited data regime as part of the multimodal track of the BabyLM challenge. We contrast: (i) curriculum learning, (ii), pretraining (with text-only data), (iii) model type. We modulate these variables and assess them on two types of tasks: (a) multimodal (text+image), and (b) unimodal (text-only) tasks. We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models, particularly when combining text-only pretraining. On text-only tasks, curriculum learning appears to help models with smaller trainable parameter counts. We suggest possible reasons based on architectural differences and training designs as to why one might observe such results.
摘要:在专业领域,通常缺乏大量数据来训练大型机器学习模型。在这种数据/计算资源有限的情况下,存在多种方法旨在“以少胜多”,例如从预训练模型进行微调、根据数据呈现给模型的难度水平进行调节(课程学习),以及考虑模型类型/大小的作用。高效的机器学习方法也受到人类学习的启发,考虑了机器学习系统能够接触到大约与13岁儿童(1亿词)相同数量的词汇的使用场景。作为BabyLM挑战赛多模态赛道的一部分,我们研究了在数据有限的情况下三个主要变量的作用。我们对比了:(i) 课程学习,(ii) 预训练(仅使用文本数据),(iii) 模型类型。我们调节这些变量,并在两种类型的任务上评估它们:(a) 多模态(文本+图像)任务,和 (b) 单模态(仅文本)任务。我们发现,课程学习在多模态评估中优于非课程学习模型,特别是在结合仅文本预训练时。在仅文本任务中,课程学习似乎有助于具有较小可训练参数数量的模型。我们基于架构差异和训练设计提出了可能的原因,解释了为何会出现这些结果。
[NLP-96] RoMemes: A multimodal meme corpus for the Romanian language
【速读】: 该论文试图解决AI在处理和理解多模态网络迷因(memes)时面临的挑战。解决方案的关键在于引入了一个经过精心策划的罗马尼亚语迷因数据集,并提供了多层次的注释。通过使用基线算法对该数据集进行测试,研究结果表明,当前的AI工具在处理网络迷因时仍需进一步改进,这为未来的研究提供了方向。
链接: https://arxiv.org/abs/2410.15497
作者: Vasile Păiş,Sara Niţă,Alexandru-Iulius Jerpelea,Luca Pană,Eric Curea
关键词-EN: online media, social networks, increasingly more popular, popular in online, convey powerful messages
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 tables, 1 figure, submitted to The 19th International Conference on Linguistic Resources and Tools for Natural Language Processing (ConsILR 2024)
点击查看摘要
Abstract:Memes are becoming increasingly more popular in online media, especially in social networks. They usually combine graphical representations (images, drawings, animations or video) with text to convey powerful messages. In order to extract, process and understand the messages, AI applications need to employ multimodal algorithms. In this paper, we introduce a curated dataset of real memes in the Romanian language, with multiple annotation levels. Baseline algorithms were employed to demonstrate the usability of the dataset. Results indicate that further research is needed to improve the processing capabilities of AI tools when faced with Internet memes.
摘要:模因在在线媒体中变得越来越流行,尤其是在社交网络中。它们通常结合图形表示(图像、绘图、动画或视频)与文本,以传达强有力的信息。为了提取、处理和理解这些信息,AI 应用需要采用多模态算法。本文介绍了一个经过精心筛选的罗马尼亚语模因数据集,具有多层次的注释。基线算法被用于展示数据集的可用性。结果表明,面对互联网模因时,AI 工具的处理能力仍需进一步研究以提升。
[NLP-97] “What is the value of templates?” Rethinking Document Information Extraction Datasets for LLMs EMNLP
【速读】: 该论文试图解决现有视觉丰富文档理解(VRDU)任务中,简单模板生成的提示-响应数据集不足以训练出鲁棒模型的挑战。解决方案的关键在于提出了K2Q数据集,该数据集通过多样化的自定义模板将关键信息提取(KIE)任务转换为提示-响应格式,涵盖多实体、抽取性和布尔型问题,从而提升了生成模型的性能和鲁棒性。
链接: https://arxiv.org/abs/2410.15484
作者: Ran Zmigrod,Pranav Shetty,Mathieu Sibue,Zhiqiang Ma,Armineh Nourbakhsh,Xiaomo Liu,Manuela Veloso
关键词-EN: rich document understanding, visually rich document, large language models, document understanding, rise of large
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP Findings 2024
点击查看摘要
Abstract:The rise of large language models (LLMs) for visually rich document understanding (VRDU) has kindled a need for prompt-response, document-based datasets. As annotating new datasets from scratch is labor-intensive, the existing literature has generated prompt-response datasets from available resources using simple templates. For the case of key information extraction (KIE), one of the most common VRDU tasks, past work has typically employed the template “What is the value for the key?”. However, given the variety of questions encountered in the wild, simple and uniform templates are insufficient for creating robust models in research and industrial contexts. In this work, we present K2Q, a diverse collection of five datasets converted from KIE to a prompt-response format using a plethora of bespoke templates. The questions in K2Q can span multiple entities and be extractive or boolean. We empirically compare the performance of seven baseline generative models on K2Q with zero-shot prompting. We further compare three of these models when training on K2Q versus training on simpler templates to motivate the need of our work. We find that creating diverse and intricate KIE questions enhances the performance and robustness of VRDU models. We hope this work encourages future studies on data quality for generative model training.
摘要:视觉丰富文档理解 (VRDU) 领域中大语言模型 (LLM) 的兴起,催生了对于基于文档的提示-响应数据集的需求。由于从头开始标注新数据集需要大量人力,现有文献通常利用简单模板从现有资源中生成提示-响应数据集。对于关键信息提取 (KIE) 这一最常见的 VRDU 任务,过往研究通常采用“关键的值是什么?”这样的模板。然而,考虑到实际应用中遇到的问题多种多样,简单且统一的模板不足以在研究和工业环境中创建稳健的模型。在本研究中,我们提出了 K2Q,这是一个由 KIE 转换为提示-响应格式的多样化数据集集合,包含五个数据集,使用了多种定制模板。K2Q 中的问题可以涉及多个实体,并且可以是提取式的或布尔式的。我们通过零样本提示,实证比较了七种基线生成模型在 K2Q 上的表现。此外,我们还比较了其中三种模型在 K2Q 上训练与在更简单模板上的训练效果,以证明我们工作的必要性。我们发现,创建多样化和复杂的关键信息提取问题能够提升 VRDU 模型的性能和鲁棒性。我们希望这项工作能够激励未来对生成模型训练数据质量的研究。
[NLP-98] Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning
【速读】: 该论文试图解决预训练语言模型(LLMs)在后续训练中采用的顺序监督微调(SFT)和偏好学习(RLHF或DPO)方法的次优性问题。现有方法中,SFT和RLHF/DPO是依次进行的,导致模型在第二阶段训练时逐渐遗忘第一阶段的训练内容,从而影响整体性能。论文通过理论证明顺序训练的次优性,并提出了一种联合后续训练框架,该框架在理论上具有收敛性保证,并在实践中表现出优于顺序训练框架的效果,同时计算成本相似。
链接: https://arxiv.org/abs/2410.15483
作者: Heshan Fernando,Han Shen,Parikshit Ram,Yi Zhou,Horst Samulowitz,Nathalie Baracaldo,Tianyi Chen
关键词-EN: safe LLM applications, supervised fine-tuning, preference learning, SFT and RLHF, typically consists
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Post-training of pre-trained LLMs, which typically consists of the supervised fine-tuning (SFT) stage and the preference learning (RLHF or DPO) stage, is crucial to effective and safe LLM applications. The widely adopted approach in post-training popular open-source LLMs is to sequentially perform SFT and RLHF/DPO. However, sequential training is sub-optimal in terms of SFT and RLHF/DPO trade-off: the LLM gradually forgets about the first stage’s training when undergoing the second stage’s training. We theoretically prove the sub-optimality of sequential post-training. Furthermore, we propose a practical joint post-training framework with theoretical convergence guarantees and empirically outperforms sequential post-training framework, while having similar computational cost. Our code is available at this https URL.
摘要: 预训练大语言模型 (LLM) 的后训练过程,通常包括监督微调 (SFT) 阶段和偏好学习 (RLHF 或 DPO) 阶段,对于实现有效且安全的 LLM 应用至关重要。目前广泛采用的后训练方法是对流行的开源 LLM 依次进行 SFT 和 RLHF/DPO。然而,这种顺序训练在 SFT 和 RLHF/DPO 的权衡方面并不理想:LLM 在经历第二阶段训练时会逐渐遗忘第一阶段的训练内容。我们从理论上证明了顺序后训练的次优性。此外,我们提出了一种具有理论收敛保证的实用联合后训练框架,该框架在实验中表现优于顺序后训练框架,同时具有相似的计算成本。我们的代码可在以下链接获取:https URL。
[NLP-99] Hey GPT Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI
【速读】: 该论文试图解决非专业用户如何感知和与生成式AI工具中的偏见进行交互的问题。解决方案的关键在于通过大学级别的竞赛,让参与者设计提示以引出AI工具的偏见输出,从而进行定量和定性的分析。研究通过分析竞赛提交的内容,揭示了生成式AI中存在的多种偏见类型以及参与者诱导偏见的策略,为理解非专业用户对AI偏见的认知和交互提供了独特的见解。
链接: https://arxiv.org/abs/2410.15467
作者: Hangzhi Guo,Pranav Narayanan Venkit,Eunchae Jang,Mukund Srinath,Wenbo Zhang,Bonam Mingole,Vipul Gupta,Kush R. Varshney,S. Shyam Sundar,Amulya Yadav
关键词-EN: large language models, addressing societal biases, societal biases inherent, widespread adoption, adoption of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:The widespread adoption of large language models (LLMs) and generative AI (GenAI) tools across diverse applications has amplified the importance of addressing societal biases inherent within these technologies. While the NLP community has extensively studied LLM bias, research investigating how non-expert users perceive and interact with biases from these systems remains limited. As these technologies become increasingly prevalent, understanding this question is crucial to inform model developers in their efforts to mitigate bias. To address this gap, this work presents the findings from a university-level competition, which challenged participants to design prompts for eliciting biased outputs from GenAI tools. We quantitatively and qualitatively analyze the competition submissions and identify a diverse set of biases in GenAI and strategies employed by participants to induce bias in GenAI. Our finding provides unique insights into how non-expert users perceive and interact with biases from GenAI tools.
摘要:大语言模型 (LLM) 和生成式 AI (GenAI) 工具在各种应用中的广泛采用,使得解决这些技术中固有的社会偏见问题变得尤为重要。尽管自然语言处理 (NLP) 社区对 LLM 偏见进行了广泛研究,但关于非专业用户如何感知和与这些系统中的偏见互动的研究仍然有限。随着这些技术的日益普及,理解这一问题对于指导模型开发者努力减轻偏见至关重要。为了填补这一空白,本文介绍了从一项大学级别的竞赛中得出的研究成果,该竞赛要求参与者设计提示以引发生成式 AI 工具的偏见输出。我们通过定量和定性分析竞赛提交的内容,识别出生成式 AI 中的一系列偏见以及参与者用于诱导偏见的策略。我们的研究结果为非专业用户如何感知和与生成式 AI 工具中的偏见互动提供了独特的见解。
[NLP-100] Keep Guessing? When Considering Inference Scaling Mind the Baselines
【速读】: 该论文试图解决在大语言模型(LLMs)中通过重复采样扩展推理计算时,如何准确评估覆盖率提升的问题。解决方案的关键在于定义了一个基于训练集中答案频率的基准线,通过比较该基准线与模型重复采样的表现,揭示了重复采样在某些情况下并不优于简单的基于频率的猜测,从而更准确地衡量重复采样在提升覆盖率方面的实际效果。
链接: https://arxiv.org/abs/2410.15466
作者: Gal Yona,Or Honovich,Omer Levy,Roee Aharoni
关键词-EN: Scaling inference compute, Scaling inference, sampling consistently increases, large language models, fraction of problems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers. To test this conjecture, we define a baseline that enumerates answers according to their prevalence in the training set. Experiments spanning two domains – mathematical reasoning and factual knowledge – reveal that this baseline outperforms repeated model sampling for some LLMs, while the coverage for others is on par with that of a mixture strategy that obtains k answers by using only 10 model samples and similarly guessing the remaining k-10 attempts via enumeration. Our baseline enables a more accurate measurement of how much repeated sampling improves coverage in such settings beyond prompt-agnostic guessing.
摘要:通过重复采样来扩展大语言模型 (LLM) 的推理计算,随着样本数量的增加,问题解决的覆盖率(即解决的问题比例)持续上升。我们推测,这种观察到的改进部分是由于标准评估基准的答案分布偏向于相对较小的一组常见答案。为了验证这一推测,我们定义了一个基线,该基线根据训练集中答案的普遍性来枚举答案。在数学推理和事实知识两个领域的实验表明,对于某些 LLM,这一基线优于重复模型采样,而对于其他 LLM,其覆盖率与混合策略相当,该策略通过仅使用 10 个模型样本获得 k 个答案,并通过枚举类似地猜测剩余的 k-10 次尝试。我们的基线使得在这些设置中,能够更准确地衡量重复采样在多大程度上提高了覆盖率,而不仅仅是依赖于与提示无关的猜测。
[NLP-101] A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia
【速读】: 该论文试图解决预训练语言模型(PLMs)中的偏见归属问题,即如何量化和识别导致偏见行为的特定词汇。解决方案的关键在于提出了一种新的度量方法——偏见归属分数(bias attribution score),该方法基于信息论来衡量词汇对偏见行为的贡献程度。通过在多语言PLMs(包括东南亚地区的模型)上应用这一度量,研究证实了这些模型中存在的性别歧视和同性恋恐惧症偏见,并揭示了与犯罪、亲密关系和帮助等话题相关的词汇是导致偏见的主要因素,提示在这些话题上使用PLMs时应更加谨慎。
链接: https://arxiv.org/abs/2410.15464
作者: Lance Calvin Lim Gamboa,Mark Lee
关键词-EN: http URL propose, measure token-level contributions, bias attribution score, pretrained language models, http URL
类目: Computation and Language (cs.CL)
备注: Accepted for oral presentation at the 38th Pacific Asia Conference on Language, Information, and Computation
点击查看摘要
Abstract:Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and this http URL propose a novel metric, the \textitbias attribution score , which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories, suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where PLMs should be used with more caution.
摘要: 关于预训练语言模型 (PLMs) 中的偏见研究主要集中在偏见评估和缓解上,而未能解决偏见归属的问题。本文提出了一种新的度量方法,即 偏见归属分数,该方法借鉴信息论来衡量 Token 级别对 PLMs 中偏见行为的贡献。随后,我们通过将该度量应用于多语言 PLMs,包括尚未在偏见评估文献中得到充分研究的东南亚模型,展示了该度量的实用性。我们的结果证实了东南亚 PLMs 中存在性别歧视和恐同偏见。可解释性和语义分析还揭示,PLM 偏见主要由与犯罪、亲密关系和帮助等其他话语类别相关的词汇所引发,这表明这些是 PLMs 在预训练数据中强烈再现偏见的主题,也是在使用 PLMs 时应更加谨慎的领域。
[NLP-102] MedLogic-AQA: Enhancing Medical Question Answering with Abstractive Models Focusing on Logical Structures
【速读】: 该论文试图解决医学问答任务中现有系统难以理解和处理复杂逻辑结构的问题。解决方案的关键在于提出了一种新的抽象问答系统MedLogic-AQA,该系统利用从上下文和问题中提取的一阶逻辑(FOL)规则来生成基于逻辑的答案。通过训练一个逻辑理解(LU)模型,系统能够生成逻辑三元组,这些三元组随后被整合到MedLogic-AQA的训练中,从而在答案生成过程中实现有效的逻辑推理。这种将逻辑推理与抽象问答相结合的方法使得系统能够生成逻辑上合理、相关且信息丰富的答案。
链接: https://arxiv.org/abs/2410.15463
作者: Aizan Zafar,Kshitij Mishra,Asif Ekbal
关键词-EN: delivering accurate responses, intricate medical queries, Medical question-answering, medical queries, pivotal in delivering
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
点击查看摘要
Abstract:In Medical question-answering (QA) tasks, the need for effective systems is pivotal in delivering accurate responses to intricate medical queries. However, existing approaches often struggle to grasp the intricate logical structures and relationships inherent in medical contexts, thus limiting their capacity to furnish precise and nuanced answers. In this work, we address this gap by proposing a novel Abstractive QA system MedLogic-AQA that harnesses First Order Logic (FOL) based rules extracted from both context and questions to generate well-grounded answers. Through initial experimentation, we identified six pertinent first-order logical rules, which were then used to train a Logic-Understanding (LU) model capable of generating logical triples for a given context, question, and answer. These logic triples are then integrated into the training of MedLogic-AQA, enabling effective and coherent reasoning during answer generation. This distinctive fusion of logical reasoning with abstractive QA equips our system to produce answers that are logically sound, relevant, and engaging. Evaluation with respect to both automated and human-based demonstrates the robustness of MedLogic-AQA against strong baselines. Through empirical assessments and case studies, we validate the efficacy of MedLogic-AQA in elevating the quality and comprehensiveness of answers in terms of reasoning as well as informativeness
摘要:在医学问答 (QA) 任务中,构建有效的系统对于提供准确回答复杂医学查询至关重要。然而,现有方法往往难以把握医学情境中复杂的逻辑结构和关系,从而限制了其提供精确和细致答案的能力。在本研究中,我们通过提出一种新颖的生成式 QA 系统 MedLogic-AQA 来填补这一空白,该系统利用从上下文和问题中提取的一阶逻辑 (FOL) 规则生成基于逻辑的答案。通过初步实验,我们确定了六个相关的一阶逻辑规则,这些规则随后用于训练一个逻辑理解 (LU) 模型,该模型能够为给定的上下文、问题和答案生成逻辑三元组。这些逻辑三元组随后被整合到 MedLogic-AQA 的训练中,使得在生成答案时能够进行有效且连贯的推理。这种将逻辑推理与生成式 QA 相结合的独特方法,使我们的系统能够生成逻辑上合理、相关且引人入胜的答案。通过自动化和基于人类的评估,MedLogic-AQA 在强基线上的稳健性得到了验证。通过实证评估和案例研究,我们验证了 MedLogic-AQA 在提升答案的推理质量以及信息全面性方面的有效性。
[NLP-103] Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training
【速读】: 该论文旨在解决大型语言模型(LLMs)在训练过程中产生的幻觉问题,即输出内容与事实不符或与用户输入无关的现象。解决方案的关键在于引入了一种名为Sensitive Neuron Dropout (SeND)的新训练协议,通过在训练过程中确定性地丢弃数据集中具有显著变异性的敏感神经元来减少幻觉的产生。此外,论文还开发了一种高效的幻觉检测指标Efficient EigenScore (EES),该指标以两倍的速度近似传统EigenScore,并集成到SeND协议中,使其在计算上可扩展且有效减少幻觉。实验结果表明,该方法在测试时提高了LLM的可靠性,相较于常规训练,幻觉减少率可达40%,并在适应如Wikipedia和医学数据集等特定领域时显著提升事实准确性。
链接: https://arxiv.org/abs/2410.15460
作者: Shahrad Mohammadzadeh,Juan David Guerra,Marco Bonizzato,Reihaneh Rabbany,Golnoosh Farnadi
关键词-EN: user input-have grown, large language models, input-have grown, large language, increasingly deployed
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Spectral Theory (math.SP)
备注:
点击查看摘要
Abstract:As large language models (LLMs) become increasingly deployed across various industries, concerns regarding their reliability, particularly due to hallucinations-outputs that are factually inaccurate or irrelevant to user input-have grown. Our research investigates the relationship between the training process and the emergence of hallucinations to address a key gap in existing research that focuses primarily on post hoc detection and mitigation strategies. Using models from the Pythia suite (70M-12B parameters) and several hallucination detection metrics, we analyze hallucination trends throughout training and explore LLM internal dynamics. We introduce SEnsitive Neuron Dropout (SeND), a novel training protocol designed to mitigate hallucinations by reducing variance during training. SeND achieves this by deterministically dropping neurons with significant variability on a dataset, referred to as Sensitive Neurons. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This efficient metric is integrated into our protocol, allowing SeND to be both computationally scalable and effective at reducing hallucinations. Our empirical evaluation demonstrates that our approach improves LLM reliability at test time by up to 40% compared to normal training while also providing an efficient method to improve factual accuracy when adapting LLMs to domains such as Wikipedia and Medical datasets.
摘要:随着大语言模型 (LLM) 在各个行业的广泛应用,其可靠性问题,特别是由于幻觉 (hallucinations) —— 即输出内容在事实上不准确或与用户输入无关 —— 引起了越来越多的关注。我们的研究探讨了训练过程与幻觉产生之间的关系,以填补现有研究主要集中在事后检测和缓解策略上的空白。我们使用了 Pythia 套件中的模型 (70M-12B 参数) 和几种幻觉检测指标,分析了训练过程中幻觉的趋势,并探索了 LLM 的内部动态。我们引入了敏感神经元丢弃 (SEnsitive Neuron Dropout, SeND),这是一种新颖的训练协议,旨在通过减少训练过程中的方差来缓解幻觉。SeND 通过确定性地丢弃在数据集上具有显著变异性的神经元(称为敏感神经元)来实现这一目标。此外,我们开发了一种无监督的幻觉检测指标 —— 高效特征值评分 (Efficient EigenScore, EES),该指标以 2 倍的速度近似传统的特征值评分。这一高效指标被整合到我们的协议中,使得 SeND 在计算上具有可扩展性,同时在减少幻觉方面也有效。我们的实证评估表明,与正常训练相比,我们的方法在测试时提高了 LLM 的可靠性,最高可达 40%,同时在将 LLM 适应于维基百科和医学数据集等领域时,也提供了一种提高事实准确性的有效方法。
[NLP-104] CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts
【速读】: 该论文试图解决视觉与语言模型(VLMs)在处理文化特定概念时的局限性问题,特别是模型在区分训练过程中获取的参数知识与推理过程中通过视觉和文本描述提供的上下文知识方面的不足。解决方案的关键在于引入了一个名为CROPE的视觉问答基准,该基准旨在通过评估模型对文化特定概念的理解和适应能力,来区分这两种知识来源。通过对比不同文化特定概念与常见概念的性能差异,并测试模型在利用多模态信息(视觉和文本)时的表现,研究发现当前VLMs在文化理解和适应性方面存在显著局限,需要进一步改进以实现更广泛的文化包容性。
链接: https://arxiv.org/abs/2410.15453
作者: Malvina Nikandrou,Georgios Pantazopoulos,Nikolas Vitsakis,Ioannis Konstas,Alessandro Suglia
关键词-EN: Vision and Language, demonstrate cultural knowledge, Language models, Vision, Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:As Vision and Language models (VLMs) become accessible across the globe, it is important that they demonstrate cultural knowledge. In this paper, we introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts and evaluate the capacity for cultural adaptation through contextual information. This allows us to distinguish between parametric knowledge acquired during training and contextual knowledge provided during inference via visual and textual descriptions. Our evaluation of several state-of-the-art open VLMs shows large performance disparities between culture-specific and common concepts in the parametric setting. Moreover, experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions. Our findings reveal limitations in the cultural understanding and adaptability of current VLMs that need to be addressed toward more culturally inclusive models.
摘要:随着视觉与语言模型 (Vision and Language Models, VLMs) 在全球范围内的普及,它们展现文化知识的能力变得至关重要。本文中,我们引入了 CROPE,这是一个视觉问答基准,旨在探究模型对特定文化概念的知识,并通过上下文信息评估其文化适应能力。这使我们能够区分在训练过程中获得的参数化知识与通过视觉和文本描述在推理过程中提供的上下文知识。我们对多个最先进的开放式 VLMs 进行了评估,结果显示在参数化设置下,文化特定概念与常见概念之间存在显著的性能差异。此外,通过上下文知识的实验表明,模型在有效利用多模态信息并将文化特定概念与其描述绑定方面存在困难。我们的研究结果揭示了当前 VLMs 在文化理解和适应性方面的局限性,这些局限性需要解决,以推动更具文化包容性的模型发展。
[NLP-105] Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering IJCAI2024
【速读】: 该论文试图解决大语言模型(LLM)输出信息的一致性问题,即在相同查询下,LLM的回答缺乏一致性,这主要归因于token采样过程中的随机性。论文提出了一种新的评估方法,通过比较不同技术来评估LLM回答的语义一致性,关键在于识别语法不同但语义相同的句子。解决方案的关键在于应用此评估方法于现有技术(如RAG模式和Zero-shot-CoT),并通过TruthfulQA数据集进行量化分析,以评估这些技术在提升LLM回答一致性方面的效果。
链接: https://arxiv.org/abs/2410.15440
作者: Yanggyu Lee,Jihie Kim
关键词-EN: Large Language Model, Language Model, Large Language, providing reliable information, LLM outputs lack
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the Trustworthy AI Workshop at IJCAI 2024
点击查看摘要
Abstract:In the realm of Large Language Model (LLM) functionalities, providing reliable information is paramount, yet reports suggest that LLM outputs lack consistency. This inconsistency, often at-tributed to randomness in token sampling, under-mines user trust as it leads to varying responses even for identical queries. In this paper, we present a new approach for evaluating semantic consistencies of LLM including comparison of alternative tech-niques. Our approach evaluates whether LLM re-sponses are semantically congruent for a given question, recognizing that as syntactically different sentences may convey the same meaning. Here-tofore, To enhance LLM consistency, two main approaches have been explored: Leverage external knowledge as context like the RAG pattern or use Zero-shot-CoT to improve performance of LLM itself. We apply our evaluation approach to these techniques, and demonstrate to compare the im-pact of these methods on LLM response con-sistency across different domains of question an-swering tasks. Using the TruthfulQA dataset to assess LLM responses, the study induces N re-sponses per question from the LLM and clusters semantically equivalent sentences to measure semantic consistency across 37 categories. Through this, it quantitatively analyzes the effectiveness of the aforementioned methods in improving LLM performance before and after their adoption.
摘要:在大语言模型 (LLM) 的功能领域中,提供可靠的信息至关重要,但有报告指出 LLM 的输出缺乏一致性。这种不一致性,通常归因于 Token 采样中的随机性,削弱了用户的信任,因为即使对于相同的查询,也会导致不同的响应。本文提出了一种新的方法来评估 LLM 的语义一致性,包括比较不同的技术。我们的方法评估 LLM 对给定问题的响应是否在语义上一致,认识到即使句法不同的句子也可能传达相同的意义。迄今为止,为了增强 LLM 的一致性,主要探索了两种方法:利用外部知识作为上下文(如 RAG 模式)或使用零样本思维链 (Zero-shot-CoT) 来提高 LLM 自身的性能。我们将评估方法应用于这些技术,并展示了这些方法在不同领域的问答任务中对 LLM 响应一致性的影响。使用 TruthfulQA 数据集评估 LLM 的响应,研究从 LLM 中为每个问题引出 N 个响应,并将语义等价的句子聚类,以测量 37 个类别中的语义一致性。通过这种方法,定量分析了上述方法在采用前后对提高 LLM 性能的有效性。
[NLP-106] A Comprehensive Evaluation of Cognitive Biases in LLMs
【速读】: 该论文旨在解决大型语言模型(LLMs)中存在的认知偏差问题,并提供了一个大规模评估框架。解决方案的关键在于开发了一个通用的测试框架,能够可靠且大规模地生成测试用例,用于检测LLMs中的认知偏差。论文还创建了一个包含30,000个测试用例的基准数据集,并对20个先进的LLMs进行了全面的偏差评估。研究结果证实了所有30种测试偏差在至少部分LLMs中存在,并公开了框架代码以促进未来对LLMs偏差的研究。
链接: https://arxiv.org/abs/2410.15413
作者: Simon Malberg,Roman Poletukhin,Carolin M. Schuster,Georg Groh
关键词-EN: large language models, cognitive biases, large language, language models, decision-making scenarios
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code to encourage future research on biases in LLMs: this https URL
摘要:我们针对20种最先进的大语言模型 (LLMs)在各种决策情境下进行了30种认知偏差的大规模评估。我们的贡献包括:一种新颖的通用测试框架,用于可靠且大规模地生成针对LLMs的测试;一个包含30,000个测试的基准数据集,用于检测LLMs中的认知偏差;以及对20个评估的LLMs中发现的偏差的全面评估。我们的工作证实并扩展了先前的研究结果,表明LLMs中存在认知偏差,报告了在至少部分20个LLMs中检测到的所有30种测试偏差的证据。我们发布了框架代码,以鼓励未来对LLMs中偏差的研究:此链接。
[NLP-107] IPO: Interpretable Prompt Optimization for Vision-Language Models NEURIPS2024
【速读】: 该论文试图解决预训练视觉-语言模型(如CLIP)在下游任务中性能高度依赖于输入文本提示的特定性问题,特别是在提示模板工程中容易导致过拟合和生成的提示难以理解的问题。解决方案的关键在于引入一个简单但可解释的提示优化器(IPO),利用大型语言模型(LLMs)动态生成文本提示,并通过Prompt Optimization Prompt引导LLMs创建有效提示,同时存储过去的提示及其性能指标,提供丰富的上下文信息。此外,结合大型多模态模型(LMM)生成图像描述,增强文本与视觉模态的交互,从而创建数据集特定的提示,提高泛化性能并保持提示的人类可理解性。
链接: https://arxiv.org/abs/2410.15397
作者: Yingjun Du,Wenfang Sun,Cees G. M. Snoek
关键词-EN: CLIP have remarkably, Pre-trained vision-language models, Pre-trained vision-language, prompts, downstream tasks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. However, these methods tend to lead to overfitting of the base classes seen during training and produce prompts that are no longer understandable by humans. This paper introduces a simple but interpretable prompt optimizer (IPO), that utilizes large language models (LLMs) to generate textual prompts dynamically. We introduce a Prompt Optimization Prompt that not only guides LLMs in creating effective prompts but also stores past prompts with their performance metrics, providing rich in-context information. Additionally, we incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions, which enhance the interaction between textual and visual modalities. This allows for thae creation of dataset-specific prompts that improve generalization performance, while maintaining human comprehension. Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. By leveraging the strengths of LLMs, our approach ensures that the prompts remain human-understandable, thereby facilitating better transparency and oversight for vision-language models.
摘要:预训练的视觉-语言模型,如 CLIP,已显著适应于各种下游任务。然而,其性能在很大程度上依赖于输入文本提示的特异性,这需要熟练的提示模板工程。当前的提示优化方法通过梯度下降学习提示,将提示视为可调整的参数。然而,这些方法往往导致对训练期间看到的基类过度拟合,并生成人类不再能理解的提示。本文介绍了一种简单但可解释的提示优化器 (IPO),它利用大语言模型 (LLM) 动态生成文本提示。我们引入了一个提示优化提示,它不仅指导 LLM 创建有效的提示,还存储过去的提示及其性能指标,提供丰富的上下文信息。此外,我们结合了一个大型多模态模型 (LMM),通过生成图像描述来对视觉内容进行条件化,从而增强文本和视觉模态之间的交互。这使得能够创建特定于数据集的提示,提高泛化性能,同时保持人类的理解。在 11 个数据集上的广泛测试表明,IPO 不仅提高了现有基于梯度下降的提示学习方法的准确性,还显著增强了生成提示的可解释性。通过利用 LLM 的优势,我们的方法确保提示保持人类可理解,从而为视觉-语言模型提供更好的透明度和监督。
[NLP-108] CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
【速读】: 该论文试图解决大语言模型(LLMs)作为自动评估工具在成对比较候选响应时存在的选择偏差问题。解决方案的关键在于提出了一种名为CalibraEval的新方法,该方法通过将去偏差过程重新表述为优化任务,旨在调整观察到的预测分布以与无偏预测分布对齐。具体实现上,论文提出了一种非参数顺序保持算法(NOA),利用模型预测分布之间的部分顺序关系,从而无需显式标签和精确的数学函数,有效减少了选择偏差,并在多个基准测试中证明了其相对于现有去偏方法的优越性。
链接: https://arxiv.org/abs/2410.15393
作者: Haitao Li,Junjie Chen,Qingyao Ai,Zhumin Chu,Yujia Zhou,Qian Dong,Yiqun Liu
关键词-EN: generated natural language, gaining widespread attention, demonstrated promising capabilities, rapidly gaining widespread, large language models
类目: Computation and Language (cs.CL)
备注: 13 pages
点击查看摘要
Abstract:The use of large language models (LLMs) as automated evaluation tools to assess the quality of generated natural language, known as LLMs-as-Judges, has demonstrated promising capabilities and is rapidly gaining widespread attention. However, when applied to pairwise comparisons of candidate responses, LLM-based evaluators often exhibit selection bias. Specifically, their judgments may become inconsistent when the option positions or ID tokens are swapped, compromising the effectiveness and fairness of the evaluation result. To address this challenge, we introduce CalibraEval, a novel label-free method for mitigating selection bias during inference. Specifically, CalibraEval reformulates debiasing as an optimization task aimed at adjusting observed prediction distributions to align with unbiased prediction distributions. To solve this optimization problem, we propose a non-parametric order-preserving algorithm (NOA). This algorithm leverages the partial order relationships between model prediction distributions, thereby eliminating the need for explicit labels and precise mathematical function this http URL evaluations of LLMs in multiple representative benchmarks demonstrate that CalibraEval effectively mitigates selection bias and improves performance compared to existing debiasing methods. This work marks a step toward building more robust and unbiased automated evaluation frameworks, paving the way for improved reliability in AI-driven assessments
摘要:将大语言模型 (LLMs) 作为自动化评估工具用于评估生成自然语言的质量,即 LLMs-as-Judges,已展现出令人鼓舞的能力,并迅速获得广泛关注。然而,当应用于候选响应的成对比较时,基于 LLM 的评估器往往表现出选择偏差。具体而言,当选项位置或 ID Token 被交换时,其判断可能会变得不一致,从而损害评估结果的有效性和公平性。为应对这一挑战,我们引入了 CalibraEval,一种新颖的无标签方法,用于在推理过程中减轻选择偏差。具体来说,CalibraEval 将去偏差问题重新表述为一项优化任务,旨在调整观察到的预测分布,使其与无偏预测分布对齐。为解决这一优化问题,我们提出了一种非参数保序算法 (NOA)。该算法利用模型预测分布之间的部分顺序关系,从而无需显式标签和精确的数学函数。在多个代表性基准上的 LLM 评估结果表明,CalibraEval 有效地减轻了选择偏差,并相较于现有的去偏差方法提升了性能。这项工作标志着向构建更稳健和无偏的自动化评估框架迈出了一步,为 AI 驱动的评估提供了更高的可靠性。
[NLP-109] BERTtime Stories: Investigating the Role of Synthetic Story Data in Language pre-training
【速读】: 该论文试图解决在数据受限的情况下进行高效语言预训练的问题,特别是在BabyLM Challenge的Strict和Strict-Small赛道中。解决方案的关键在于研究合成故事数据(如TinyStories)在语言预训练中的效果。通过训练GPT-Neo模型并生成故事续写,结合BabyLM数据集,研究发现合成数据在某些情况下能提供适度增益,但总体上对语言理解有负面影响。这表明在低资源环境中,合成数据具有增强语言模型的潜力,但需谨慎使用。
链接: https://arxiv.org/abs/2410.15365
作者: Nikitas Theodoropoulos,Giorgos Filandrianos,Vassilis Lyberatos,Maria Lymperaiou,Giorgos Stamou
关键词-EN: Strict and Strict-Small, BabyLM Challenge, describe our contribution, Strict-Small tracks, synthetic story data
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We describe our contribution to the Strict and Strict-Small tracks of the 2nd iteration of the BabyLM Challenge. The shared task is centered around efficient pre-training given data constraints motivated by human development. In response, we study the effect of synthetic story data in language pre-training using TinyStories: a recently introduced dataset of short stories. Initially, we train GPT-Neo models on subsets of TinyStories, while varying the amount of available data. We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story, and acquire substantial linguistic knowledge. To measure the effect of synthetic story data, we train LTG-BERT encoder models on a combined dataset of: a subset of TinyStories, story completions generated by GPT-Neo, and a subset of the BabyLM dataset. Our experimentation reveals that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding. Our work offers an initial study on synthesizing story data in low resource settings and underscores their potential for augmentation in data-constrained language modeling. We publicly release our models and implementation on our GitHub.
摘要:我们描述了在 BabyLM Challenge 第二轮中对 Strict 和 Strict-Small 赛道的贡献。该共享任务的核心是在人类发展启发的数据约束下进行高效的预训练。为此,我们研究了使用 TinyStories 这一最近引入的短篇故事数据集进行语言预训练时,合成故事数据的影响。首先,我们在 TinyStories 的子集上训练 GPT-Neo 模型,同时改变可用数据的数量。我们发现,即使仅使用不到 1 亿词的数据,模型也能生成高质量、原创的故事续写,并获取大量的语言知识。为了衡量合成故事数据的影响,我们在一个综合数据集上训练了 LTG-BERT 编码器模型,该数据集包括:TinyStories 的子集、由 GPT-Neo 生成的故事续写,以及 BabyLM 数据集的子集。我们的实验表明,合成数据偶尔能带来适度的增益,但总体上对语言理解有负面影响。我们的工作首次探讨了在低资源环境下合成故事数据的应用,并强调了其在数据受限的语言建模中的潜在增强作用。我们在 GitHub 上公开发布了我们的模型和实现。
[NLP-110] Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
【速读】: 该论文试图解决对齐的大型语言模型(LLMs)在面对对抗性攻击时容易受到“越狱”攻击的问题。解决方案的关键在于提出了一种名为Faster-GCG的高效对抗性越狱方法,该方法通过对原始GCG攻击算法进行深入优化,显著降低了计算成本,同时提高了攻击成功率和攻击的可转移性,使其在开源和闭源LLMs上均表现出色。
链接: https://arxiv.org/abs/2410.15362
作者: Xiao Li,Zhuhong Li,Qiongxiu Li,Bingze Lee,Jinghao Cui,Xiaolin Hu
关键词-EN: Large Language Models, Language Models, Aligned Large Language, Large Language, demonstrated remarkable performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.
摘要:对齐的大语言模型 (LLMs) 在各种任务中展示了卓越的性能。然而,LLMs 仍然容易受到越狱对抗攻击,即对手通过操纵提示来引出对齐 LLMs 本应避免的恶意响应。识别这些漏洞对于理解 LLMs 的固有弱点并防止其潜在滥用至关重要。在越狱领域的一项开创性工作是 GCG 攻击,这是一种离散 Token 优化算法,旨在找到能够越狱对齐 LLMs 的后缀。尽管 GCG 取得了成功,但我们发现它并不理想,需要大量的计算成本,并且实现的越狱性能有限。在本研究中,我们提出了 Faster-GCG,这是一种通过深入研究 GCG 设计的高效对抗越狱方法。实验表明,Faster-GCG 仅用原始 GCG 1/10 的计算成本就能超越其性能,在各种开源对齐 LLMs 上实现了显著更高的攻击成功率。此外,我们还展示了 Faster-GCG 在测试如 ChatGPT 这样的闭源 LLMs 时表现出更好的攻击迁移性。
[NLP-111] CompAct: Compressed Activations for Memory-Efficient LLM Training
【速读】: 该论文试图解决大规模语言模型(LLMs)在预训练和微调过程中GPU峰值内存利用率过高的问题。解决方案的关键在于通过压缩计算图中的低秩激活值,以减少反向传播过程中所需的内存,从而显著降低峰值内存利用率。具体来说,论文提出的CompAct技术利用随机投影矩阵对激活值进行压缩,避免了额外的内存开销,相比以往仅减少优化器开销或训练参数数量的方法,CompAct在预训练和微调阶段分别实现了25-30%和50%的内存节省,显著改善了计算性能的权衡。
链接: https://arxiv.org/abs/2410.15352
作者: Yara Shamshoum,Nitzan Hodos,Yuval Sieradzki,Assaf Schuster
关键词-EN: GPU by 25-30, utilization on GPU, peak memory utilization, GPU, memory
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce CompAct, a technique that reduces peak memory utilization on GPU by 25-30% for pretraining and 50% for fine-tuning of LLMs. Peak device memory is a major limiting factor in training LLMs, with various recent works aiming to reduce model memory. However most works don’t target the largest component of allocated memory during training: the model’s compute graph, which is stored for the backward pass. By storing low-rank, compressed activations to be used in the backward pass we greatly reduce the required memory, unlike previous methods which only reduce optimizer overheads or the number of trained parameters. Our compression uses random projection matrices, thus avoiding additional memory overheads. Comparisons with previous techniques for either pretraining or fine-tuning show that CompAct substantially improves existing compute-performance tradeoffs. We expect CompAct’s savings to scale even higher for larger models.
摘要:我们介绍了 CompAct,这是一种技术,能够在 GPU 上将大语言模型 (LLM) 预训练的峰值内存利用率降低 25-30%,微调时降低 50%。峰值设备内存是训练 LLM 的主要限制因素,近期有许多工作致力于减少模型内存。然而,大多数工作并未针对训练期间分配内存的最大组成部分——用于反向传播的模型计算图。通过存储低秩压缩的激活值以供反向传播使用,我们大大减少了所需的内存,而之前的方法仅减少了优化器开销或训练参数的数量。我们的压缩方法使用随机投影矩阵,因此避免了额外的内存开销。与之前的预训练或微调技术相比,CompAct 显著改善了现有的计算性能权衡。我们预计,随着模型规模的增大,CompAct 的节省效果将更加显著。
[NLP-112] EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
【速读】: 该论文试图解决现有大型语言模型(LLMs)在处理复杂输入时服务效率低下的问题,特别是现有基于前缀的上下文缓存方法在少样本学习、多文档问答或检索增强生成等场景中,由于前缀变化导致缓存复用受限的问题。解决方案的关键在于引入了位置无关的上下文缓存(PIC),通过EPIC系统中的两个核心设计:AttnLink利用静态注意力稀疏性最小化重计算以恢复精度,以及KVSplit自定义分块方法保持语义连贯性,从而实现模块化的KV缓存复用,无论token块的位置如何。实验结果表明,EPIC在首字符时间(TTFT)和吞吐量上分别提升了8倍和7倍,且几乎没有精度损失,显著提高了LLM推理的扩展性和效率。
链接: https://arxiv.org/abs/2410.15332
作者: Junhao Hu,Wenrui Huang,Haoyi Wang,Weidong Wang,Tiancheng Hu,Qin Zhang,Hao Feng,Xusheng Chen,Yizhou Shan,Tao Xie
关键词-EN: Large Language Models, Large Language, Language Models, range of applications, wide range
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are critical for a wide range of applications, but serving them efficiently becomes increasingly challenging as inputs become more complex. Context caching improves serving performance by exploiting inter-request dependency and reusing key-value (KV) cache across requests, thus improving time-to-first-token (TTFT). However, existing prefix-based context caching requires exact token prefix matches, limiting cache reuse in few-shot learning, multi-document QA, or retrieval-augmented generation, where prefixes may vary. In this paper, we present EPIC, an LLM serving system that introduces position-independent context caching (PIC), enabling modular KV cache reuse regardless of token chunk position (or prefix). EPIC features two key designs: AttnLink, which leverages static attention sparsity to minimize recomputation for accuracy recovery, and KVSplit, a customizable chunking method that preserves semantic coherence. Our experiments demonstrate that Epic delivers up to 8x improvements in TTFT and 7x throughput over existing systems, with negligible or no accuracy loss. By addressing the limitations of traditional caching approaches, Epic enables more scalable and efficient LLM inference.
摘要:大语言模型 (LLMs) 在众多应用中至关重要,但随着输入复杂性的增加,高效地提供这些模型变得越来越具有挑战性。上下文缓存通过利用请求间的依赖性并在请求间重用键值 (KV) 缓存,从而提高首次 Token 生成时间 (TTFT),从而提升服务性能。然而,现有的基于前缀的上下文缓存需要精确的 Token 前缀匹配,这在少样本学习、多文档问答或检索增强生成等场景中限制了缓存的重用,因为这些场景中的前缀可能会有所不同。本文中,我们介绍了 EPIC,这是一种大语言模型服务系统,引入了位置无关的上下文缓存 (PIC),使得无论 Token 块位置 (或前缀) 如何,都能实现模块化的 KV 缓存重用。EPIC 具有两个关键设计:AttnLink,它利用静态注意力稀疏性来最小化重计算以恢复准确性;以及 KVSplit,一种可定制的分块方法,能够保持语义连贯性。我们的实验表明,EPIC 在 TTFT 方面提供了高达 8 倍的改进,在吞吐量方面提供了 7 倍的提升,且几乎没有或没有准确性损失。通过解决传统缓存方法的局限性,EPIC 实现了更大规模和更高效的大语言模型推理。
[NLP-113] A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice
【速读】: 该论文试图解决大语言模型(LLMs)预测中的不确定性量化问题,特别是现有文献中缺乏系统分类和理论基础的现状。解决方案的关键在于明确区分不确定性(uncertainty)和置信度(confidence)的定义,并基于贝叶斯推断、信息论和集成策略等理论视角,对各种不确定性估计方法进行分类。此外,论文还探讨了将不确定性应用于实际场景(如分布外检测、数据标注和问题澄清)的技术,旨在为LLMs的不确定性估计提供全面的理论和应用视角,促进更可靠和有效的方法开发。
链接: https://arxiv.org/abs/2410.15326
作者: Hsiu-Yuan Huang,Yutong Yang,Zhaoxi Zhang,Sanwoo Lee,Yunfang Wu
关键词-EN: large language models, enhancing application credibility, continue to evolve, large language, uncertainty
类目: Computation and Language (cs.CL)
备注: 9 pages
点击查看摘要
Abstract:As large language models (LLMs) continue to evolve, understanding and quantifying the uncertainty in their predictions is critical for enhancing application credibility. However, the existing literature relevant to LLM uncertainty estimation often relies on heuristic approaches, lacking systematic classification of the methods. In this survey, we clarify the definitions of uncertainty and confidence, highlighting their distinctions and implications for model predictions. On this basis, we integrate theoretical perspectives, including Bayesian inference, information theory, and ensemble strategies, to categorize various classes of uncertainty estimation methods derived from heuristic approaches. Additionally, we address challenges that arise when applying these methods to LLMs. We also explore techniques for incorporating uncertainty into diverse applications, including out-of-distribution detection, data annotation, and question clarification. Our review provides insights into uncertainty estimation from both definitional and theoretical angles, contributing to a comprehensive understanding of this critical aspect in LLMs. We aim to inspire the development of more reliable and effective uncertainty estimation approaches for LLMs in real-world scenarios.
摘要:随着大语言模型 (Large Language Models, LLMs) 的不断演进,理解和量化其预测中的不确定性对于提升应用的可信度至关重要。然而,现有关于 LLM 不确定性估计的文献往往依赖于启发式方法,缺乏对这些方法的系统分类。在本综述中,我们明确了不确定性 (uncertainty) 和置信度 (confidence) 的定义,强调了它们在模型预测中的区别和意义。在此基础上,我们结合了贝叶斯推断 (Bayesian inference)、信息论 (information theory) 和集成策略 (ensemble strategies) 等理论视角,对从启发式方法中衍生出的各种不确定性估计方法进行了分类。此外,我们还探讨了将这些方法应用于 LLMs 时所面临的挑战。我们还研究了将不确定性融入多样化应用的技术,包括分布外检测 (out-of-distribution detection)、数据标注 (data annotation) 和问题澄清 (question clarification)。我们的综述从定义和理论两个角度提供了对不确定性估计的深入见解,有助于全面理解 LLMs 中这一关键方面。我们旨在激发在实际场景中开发更可靠和有效的不确定性估计方法,以应用于 LLMs。
[NLP-114] Causality for Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)在处理语言任务时过度依赖概率建模,导致捕捉到的是语言模式和社会刻板印象中的虚假关联,而非实体和事件之间的真实因果关系的问题。解决方案的关键在于将因果性融入LLMs的整个生命周期,从词嵌入学习、基础模型训练、微调、对齐、推理到评估,以构建更加可解释、可靠且具有因果意识的模型。论文还提出了六个未来研究方向,以增强LLMs的因果推理能力并解决当前模型的局限性。
链接: https://arxiv.org/abs/2410.15319
作者: Anpeng Wu,Kun Kuang,Minqin Zhu,Yingrong Wang,Yujia Zheng,Kairong Han,Baohong Li,Guangyi Chen,Fei Wu,Kun Zhang
关键词-EN: achieving unprecedented success, vast datasets, achieving unprecedented, billions or trillions, trillions of parameters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Recent breakthroughs in artificial intelligence have driven a paradigm shift, where large language models (LLMs) with billions or trillions of parameters are trained on vast datasets, achieving unprecedented success across a series of language tasks. However, despite these successes, LLMs still rely on probabilistic modeling, which often captures spurious correlations rooted in linguistic patterns and social stereotypes, rather than the true causal relationships between entities and events. This limitation renders LLMs vulnerable to issues such as demographic biases, social stereotypes, and LLM hallucinations. These challenges highlight the urgent need to integrate causality into LLMs, moving beyond correlation-driven paradigms to build more reliable and ethically aligned AI systems. While many existing surveys and studies focus on utilizing prompt engineering to activate LLMs for causal knowledge or developing benchmarks to assess their causal reasoning abilities, most of these efforts rely on human intervention to activate pre-trained models. How to embed causality into the training process of LLMs and build more general and intelligent models remains unexplored. Recent research highlights that LLMs function as causal parrots, capable of reciting causal knowledge without truly understanding or applying it. These prompt-based methods are still limited to human interventional improvements. This survey aims to address this gap by exploring how causality can enhance LLMs at every stage of their lifecycle-from token embedding learning and foundation model training to fine-tuning, alignment, inference, and evaluation-paving the way for more interpretable, reliable, and causally-informed models. Additionally, we further outline six promising future directions to advance LLM development, enhance their causal reasoning capabilities, and address the current limitations these models face. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2410.15319 [cs.CL] (or arXiv:2410.15319v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.15319 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:近期人工智能的突破性进展推动了范式转变,其中包含数十亿甚至数万亿参数的大语言模型 (LLMs) 在庞大的数据集上进行训练,在多项语言任务中取得了前所未有的成功。然而,尽管这些模型取得了成功,它们仍然依赖于概率建模,这种建模方式往往捕捉到的是基于语言模式和社会刻板印象的虚假关联,而非实体与事件之间的真实因果关系。这一局限性使得 LLMs 容易受到人口统计偏见、社会刻板印象以及 LLM 幻觉等问题的影响。这些挑战凸显了将因果关系融入 LLMs 的迫切需求,超越基于关联的范式,构建更加可靠且符合伦理的 AI 系统。尽管许多现有的调查和研究侧重于利用提示工程来激活 LLMs 的因果知识,或开发基准来评估其因果推理能力,但这些努力大多依赖于人工干预来激活预训练模型。如何将因果关系嵌入到 LLMs 的训练过程中,并构建更具普遍性和智能性的模型,仍然是一个未被探索的领域。最近的研究表明,LLMs 作为因果鹦鹉,能够复述因果知识,但并未真正理解或应用这些知识。这些基于提示的方法仍然局限于人工干预的改进。本调查旨在填补这一空白,探讨因果关系如何在 LLMs 生命周期的每个阶段——从 Token 嵌入学习、基础模型训练到微调、对齐、推理和评估——增强 LLMs,为构建更具解释性、可靠性和因果意识的模型铺平道路。此外,我们进一步概述了六个有前景的未来发展方向,以推进 LLM 的发展,增强其因果推理能力,并解决这些模型当前面临的局限性。
主题:计算与语言 (cs.CL); 人工智能 (cs.AI); 机器学习 (stat.ML)
引用为:arXiv:2410.15319 [cs.CL] (或 arXiv:2410.15319v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.15319
通过 DataCite 发布的 arXiv DOI (待注册)
[NLP-115] Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
【速读】: 该论文试图解决大语言模型在语音任务中的应用难题,特别是如何有效整合音频和文本模态。解决方案的关键在于引入Ichigo模型,该模型采用混合模态处理方式,通过将语音量化为离散符号并与文本模态统一使用基于Transformer的架构,实现了跨模态的联合推理和生成,无需单独的适配器。这种方法不仅提升了语音问答任务的性能,还显著降低了首次生成符号的延迟,为多模态AI领域提供了新的研究框架。
链接: https://arxiv.org/abs/2410.15316
作者: Alan Dao(Gia Tuan Dao),Dinh Bach Vu,Huy Hoang Ha
关键词-EN: speech-based tasks remains, tasks remains challenging, remains challenging due, natural language processing, Large Language Models
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.
摘要:大语言模型 (LLMs) 已经彻底改变了自然语言处理领域,但将其应用于基于语音的任务仍然面临挑战,因为整合音频和文本模态的复杂性。本文介绍了 Ichigo,一种能够无缝处理交错语音和文本序列的混合模态模型。Ichigo 采用 Token 化的早期融合方法,将语音量化为离散的 Token,并采用统一的基于 Transformer 的架构来处理语音和文本模态。这种方法使得跨模态的联合推理和生成成为可能,而无需单独的适配器。我们提出了一种全面的训练方法,包括在多语言语音识别数据集上的预训练和在精选指令数据集上的微调。Ichigo 在语音问答基准测试中展示了最先进的性能,超越了现有的开源语音语言模型,并取得了与级联系统相当的结果。值得注意的是,Ichigo 在生成首个 Token 时的延迟仅为 111 毫秒,显著低于当前模型。我们的方法不仅推动了多模态 AI 领域的发展,还为小型研究团队提供了一个框架,使他们能够有效地为开源语音语言模型做出贡献。
[NLP-116] KTCR: Improving Implicit Hate Detection with Knowledge Transfer driven Concept Refinement
【速读】: 该论文试图解决机器学习模型在识别新兴和隐含形式的仇恨内容时表现不佳的问题。解决方案的关键在于提出了一种基于知识转移的概念精炼方法,通过新颖的原型对齐和概念损失函数,结合基于概念激活向量的数据增强技术,来提炼和细化与隐含仇恨样本相关的概念。实验结果表明,通过概念精炼引入反映新仇恨模式的隐含样本,能够显著提升模型性能,超越基线结果,并保持跨数据集的泛化能力。
链接: https://arxiv.org/abs/2410.15314
作者: Samarth Garg,Vivek Hruday Kavuri,Gargi Shroff,Rahul Mishra
关键词-EN: emerging social movements, machine learning models, previously unrecognized hate, political contexts, political events
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures, 2 algorithms, 5 tables
点击查看摘要
Abstract:The constant shifts in social and political contexts, driven by emerging social movements and political events, lead to new forms of hate content and previously unrecognized hate patterns that machine learning models may not have captured. Some recent literature proposes the data augmentation-based techniques to enrich existing hate datasets by incorporating samples that reveal new implicit hate patterns. This approach aims to improve the model’s performance on out-of-domain implicit hate instances. It is observed, that further addition of more samples for augmentation results in the decrease of the performance of the model. In this work, we propose a Knowledge Transfer-driven Concept Refinement method that distills and refines the concepts related to implicit hate samples through novel prototype alignment and concept losses, alongside data augmentation based on concept activation vectors. Experiments with several publicly available datasets show that incorporating additional implicit samples reflecting new hate patterns through concept refinement enhances the model’s performance, surpassing baseline results while maintaining cross-dataset generalization capabilities.\footnoteDISCLAIMER: This paper contains explicit statements that are potentially offensive.
摘要:社会和政治背景的不断变化,受到新兴社会运动和政治事件的推动,导致出现了新的仇恨内容形式和以前未被识别的仇恨模式,这些可能是机器学习模型未曾捕捉到的。近期的一些文献提出了基于数据增强的技术,通过引入揭示新隐含仇恨模式的样本,来丰富现有的仇恨数据集。这种方法旨在提高模型对域外隐含仇恨实例的性能。然而,观察发现,进一步增加更多的增强样本会导致模型性能的下降。在本研究中,我们提出了一种基于知识转移的概念精炼方法,通过新颖的原型对齐和概念损失,结合基于概念激活向量的数据增强,来提炼和精炼与隐含仇恨样本相关的概念。在多个公开数据集上的实验表明,通过概念精炼引入反映新仇恨模式的额外隐含样本,能够增强模型的性能,超越基线结果,同时保持跨数据集的泛化能力。\footnoteDISCLAIMER: 本文包含可能具有冒犯性的明确陈述。
[NLP-117] Who is Undercover? Guiding LLMs to Explore Multi-Perspective Team Tactic in the Game
【速读】: 该论文试图解决大型语言模型(LLMs)在复杂场景中开放决策问题的挑战。解决方案的关键在于提出多视角团队战术(Multi-Perspective Team Tactic, MPTT)框架,通过模拟“谁是卧底”(Who is Undercover?)游戏,培养LLMs在复杂情境中的人类语言表达逻辑、多维度思维和自我感知能力。MPTT通过交替发言和投票环节,结合自我视角、身份确定、自我反思、自我总结和多轮寻找队友等技术,使LLM代理能够通过策略性隐藏和沟通进行理性决策,促进类似人类的信任关系。初步结果表明,MPTT与WIU结合,能够利用LLMs的认知能力构建一个模拟真实社会的决策框架,有助于少数群体的沟通和表达,推动决策过程中的公平性和多样性。
链接: https://arxiv.org/abs/2410.15311
作者: Ruiqi Dong,Zhixuan Liao,Guangwei Lai,Yuhan Ma,Danni Ma,Chenyou Fan
关键词-EN: Large Language Models, Large Language, Language Models, Multi-Perspective Team Tactic, open decision-making problems
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are pivotal AI agents in complex tasks but still face challenges in open decision-making problems within complex scenarios. To address this, we use the language logic game ``Who is Undercover?‘’ (WIU) as an experimental platform to propose the Multi-Perspective Team Tactic (MPTT) framework. MPTT aims to cultivate LLMs’ human-like language expression logic, multi-dimensional thinking, and self-perception in complex scenarios. By alternating speaking and voting sessions, integrating techniques like self-perspective, identity-determination, self-reflection, self-summary and multi-round find-teammates, LLM agents make rational decisions through strategic concealment and communication, fostering human-like trust. Preliminary results show that MPTT, combined with WIU, leverages LLMs’ cognitive capabilities to create a decision-making framework that can simulate real society. This framework aids minority groups in communication and expression, promoting fairness and diversity in decision-making. Additionally, our Human-in-the-loop experiments demonstrate that LLMs can learn and align with human behaviors through interactive, indicating their potential for active participation in societal decision-making.
摘要:大语言模型 (LLM) 在复杂任务中是关键的 AI 智能体,但在复杂场景中的开放决策问题上仍面临挑战。为此,我们使用语言逻辑游戏“谁是卧底?” (WIU) 作为实验平台,提出了多视角团队战术 (MPTT) 框架。MPTT 旨在培养 LLM 在复杂场景中的人类语言表达逻辑、多维度思维和自我感知能力。通过交替进行发言和投票环节,结合自我视角、身份确定、自我反思、自我总结和多轮寻找队友等技术,LLM 智能体通过策略性隐藏和沟通做出理性决策,促进人类般的信任。初步结果显示,MPTT 结合 WIU,利用 LLM 的认知能力创建了一个能够模拟真实社会的决策框架。该框架有助于少数群体在沟通和表达中获得支持,促进决策中的公平性和多样性。此外,我们的人在环实验表明,LLM 可以通过互动学习并与人行为对齐,表明其在社会决策中积极参与的潜力。
[NLP-118] LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content
【速读】: 该论文试图解决大型语言模型(LLMs)在处理特定领域问题时的局限性,特别是在多语言环境下的下游自然语言处理(NLP)任务。解决方案的关键在于开发了一个专门针对新闻和社交媒体内容分析的多语言模型LlamaLens。该模型通过在包含阿拉伯语、英语和印地语的52个数据集上进行微调,展示了其在19个任务中的优越性能,超越了当前最先进(SOTA)的模型,并在10个测试集上取得了可比的表现。这一研究首次尝试同时解决领域特异性和多语言问题,为研究社区提供了公开的模型和资源。
链接: https://arxiv.org/abs/2410.15308
作者: Mohamed Bayan Kmainasi,Ali Ezzat Shahroor,Maram Hasanain,Sahinur Rahman Laskar,Naeemul Hassan,Firoj Alam
关键词-EN: demonstrated remarkable success, Large Language Models, general-purpose task solvers, Large Language, downstream NLP tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: LLMs, Multilingual, Language Diversity, Large Language Models, Social Media, News Media, Specialized LLMs, Fact-checking, Media Analysis
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable success as general-purpose task solvers across various fields, including NLP, healthcare, finance, and law. However, their capabilities remain limited when addressing domain-specific problems, particularly in downstream NLP tasks. Research has shown that models fine-tuned on instruction-based downstream NLP datasets outperform those that are not fine-tuned. While most efforts in this area have primarily focused on resource-rich languages like English and broad domains, little attention has been given to multilingual settings and specific domains. To address this gap, this study focuses on developing a specialized LLM, LlamaLens, for analyzing news and social media content in a multilingual context. To the best of our knowledge, this is the first attempt to tackle both domain specificity and multilinguality, with a particular focus on news and social media. Our experimental setup includes 19 tasks, represented by 52 datasets covering Arabic, English, and Hindi. We demonstrate that LlamaLens outperforms the current state-of-the-art (SOTA) on 16 testing sets, and achieves comparable performance on 10 sets. We make the models and resources publicly available for the research community.(this https URL)
摘要:大语言模型 (LLMs) 在跨多个领域,包括自然语言处理 (NLP)、医疗、金融和法律,作为通用任务解决工具方面展示了显著的成功。然而,在处理特定领域问题,尤其是下游 NLP 任务时,其能力仍然有限。研究表明,基于指令的下游 NLP 数据集微调的模型优于未微调的模型。尽管该领域的多数努力主要集中在英语等资源丰富的语言和广泛领域上,但对多语言环境和特定领域的关注较少。为了填补这一空白,本研究专注于开发一种专门用于分析多语言新闻和社交媒体内容的大语言模型 LlamaLens。据我们所知,这是首次尝试同时解决领域特异性和多语言性问题,特别是针对新闻和社交媒体内容。我们的实验设置包括 19 个任务,由涵盖阿拉伯语、英语和印地语的 52 个数据集表示。我们展示了 LlamaLens 在 16 个测试集上优于当前最先进 (SOTA) 模型,并在 10 个测试集上达到可比性能。我们公开了模型和资源,供研究社区使用。(this https URL)
[NLP-119] Does ChatGPT Have a Poetic Style?
链接: https://arxiv.org/abs/2410.15299
作者: Melanie Walsh,Anna Preus,Elizabeth Gronski
关键词-EN:
类目: Computation and Language (cs.CL)
备注: CHR 2024: Computational Humanities Research Conference
[NLP-120] Redefining Proactivity for Information Seeking Dialogue
【速读】: 该论文试图解决信息查询对话(ISD)代理在对话中缺乏主动性,无法生成积极主动的响应以持续吸引用户的问题。解决方案的关键在于提出了一种新的主动性定义,强调通过引入与初始查询相关的新信息来增强每个生成响应的主动性。为此,论文构建了一个包含2000个单轮对话的主动性对话数据集,并引入了多个自动评估指标来衡量响应的主动性,这些指标与人工标注高度相关。此外,论文还提出了两种创新的思维链(Chain-of-Thought, CoT)提示方法——3-step CoT和3-in-1 CoT,这些方法在零样本设置下比标准提示方法的性能提升了高达90%。
链接: https://arxiv.org/abs/2410.15297
作者: Jing Yang Lee,Seokhwan Kim,Kartik Mehta,Jiun-Yu Kao,Yu-Hsiang Lin,Arpit Gupta
关键词-EN: provide accurate responses, user queries, addressing user queries, aim to provide, provide accurate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Information-Seeking Dialogue (ISD) agents aim to provide accurate responses to user queries. While proficient in directly addressing user queries, these agents, as well as LLMs in general, predominantly exhibit reactive behavior, lacking the ability to generate proactive responses that actively engage users in sustained conversations. However, existing definitions of proactive dialogue in this context do not focus on how each response actively engages the user and sustains the conversation. Hence, we present a new definition of proactivity that focuses on enhancing the proactiveness' of each generated response via the introduction of new information related to the initial query. To this end, we construct a proactive dialogue dataset comprising 2,000 single-turn conversations, and introduce several automatic metrics to evaluate response
proactiveness’ which achieved high correlation with human annotation. Additionally, we introduce two innovative Chain-of-Thought (CoT) prompts, the 3-step CoT and the 3-in-1 CoT prompts, which consistently outperform standard prompts by up to 90% in the zero-shot setting.
摘要:信息寻求对话 (Information-Seeking Dialogue, ISD) 智能体旨在为用户查询提供准确的响应。尽管这些智能体以及大语言模型 (Large Language Model, LLM) 在直接回应用户查询方面表现出色,但它们主要表现出反应性行为,缺乏生成主动响应的能力,这些响应能够积极地与用户互动并维持持续的对话。然而,现有关于主动对话的定义并未关注每个响应如何积极地吸引用户并维持对话。因此,我们提出了一种新的主动性定义,该定义侧重于通过引入与初始查询相关的新信息来增强每个生成响应的“主动性”。为此,我们构建了一个包含 2,000 个单轮对话的主动对话数据集,并引入了几种自动评估指标来衡量响应的“主动性”,这些指标与人工标注结果高度相关。此外,我们引入了两种创新的思维链 (Chain-of-Thought, CoT) 提示,即 3-step CoT 和 3-in-1 CoT 提示,在零样本 (zero-shot) 设置下,它们的表现比标准提示高出多达 90%。
[NLP-121] raining Language Models to Critique With Multi-agent Feedback
【速读】: 该论文试图解决大语言模型(LLMs)在批判能力(critique ability)提升方面面临的挑战,特别是基于单一模型生成的批判数据进行监督微调(SFT)时,由于这些批判数据本身存在缺陷,导致模型性能受限的问题。解决方案的关键在于提出了一个名为MultiCritique的新型数据生成管道,通过多智能体反馈机制,在SFT和强化学习(RL)阶段整合多个智能体生成的高质量批判信息,从而提高批判质量的偏好准确性,并有效增强LLMs的批判能力。实验结果表明,基于MultiCritique数据集微调的7B模型在性能上显著超越其他7B-13B的开源模型,接近70B LLMs和GPT-4的水平。
链接: https://arxiv.org/abs/2410.15287
作者: Tian Lan,Wenwei Zhang,Chengqi Lyu,Shuaibin Li,Chen Xu,Heyan Huang,Dahua Lin,Xian-Ling Mao,Kai Chen
关键词-EN: presents significant challenges, Critique ability, Critique, capability of humans, presents significant
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4. However, these model-generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine-tuning LLMs on such flawed critiques typically limits the model’s performance and propagates these flaws into the learned model. To overcome these challenges, this paper proposes a novel data generation pipeline, named MultiCritique, that improves the critique ability of LLMs by utilizing multi-agent feedback in both the SFT and reinforcement learning (RL) stages. First, our data generation pipeline aggregates high-quality critiques from multiple agents instead of a single model, with crucial information as input for simplifying the critique. Furthermore, our pipeline improves the preference accuracy of critique quality through multi-agent feedback, facilitating the effectiveness of RL in improving the critique ability of LLMs. Based on our proposed MultiCritique data generation pipeline, we construct the MultiCritiqueDataset for the SFT and RL fine-tuning stages. Extensive experimental results on two benchmarks demonstrate: 1) the superior quality of our constructed SFT dataset compared to existing critique datasets; 2) additional improvements to the critique ability of LLMs brought by the RL stage. Notably, our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models, approaching the performance of advanced 70B LLMs and GPT-4. Codes, datasets and model weights will be publicly available.
摘要:批判能力,作为人类的一种元认知能力,对大语言模型 (LLM) 的提升提出了重大挑战。近期的工作主要依赖于使用单一 LLM(如 GPT-4)生成的批判进行监督微调 (SFT)。然而,这些模型生成的批判往往因批判本身的复杂性而存在缺陷。因此,基于这些有缺陷的批判对 LLM 进行微调通常会限制模型的性能,并将这些缺陷传播到学习到的模型中。为了克服这些挑战,本文提出了一种名为 MultiCritique 的新型数据生成流程,通过在 SFT 和强化学习 (RL) 阶段利用多智能体反馈来提升 LLM 的批判能力。首先,我们的数据生成流程通过多个智能体而非单一模型聚合高质量的批判,并以关键信息作为输入来简化批判。此外,我们的流程通过多智能体反馈提高了批判质量的偏好准确性,从而促进了 RL 在提升 LLM 批判能力方面的有效性。基于我们提出的 MultiCritique 数据生成流程,我们构建了用于 SFT 和 RL 微调阶段的 MultiCritiqueDataset。在两个基准上的广泛实验结果表明:1) 我们构建的 SFT 数据集在质量上优于现有的批判数据集;2) RL 阶段进一步提升了 LLM 的批判能力。值得注意的是,我们微调的 7B 模型显著超越了其他先进的 7B-13B 开源模型,接近先进 70B LLM 和 GPT-4 的性能。代码、数据集和模型权重将公开发布。
[NLP-122] Large Language Models for Autonomous Driving (LLM4AD): Concept Benchmark Simulation and Real-Vehicle Experiment
链接: https://arxiv.org/abs/2410.15281
作者: Can Cui,Yunsheng Ma,Zichong Yang,Yupeng Zhou,Peiran Liu,Juanwu Lu,Lingxi Li,Yaobin Chen,Jitesh H. Panchal,Amr Abdelraouf,Rohit Gupta,Kyungtae Han,Ziran Wang
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
[NLP-123] BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression
【速读】: 该论文试图解决大语言模型(LLMs)在处理多跳问题时,由于检索文档数量增加导致输入长度线性增长,从而引发推理延迟增加和长上下文理解能力下降的问题。解决方案的关键在于提出了一种轻量级的方法BRIEF(通过证据融合桥接检索与推理),该方法通过将检索到的文档压缩成高度密集的文本摘要,并将其整合到上下文学习中,从而实现查询感知的跨文档多跳推理。BRIEF通过合成数据生成原子命题表达,这些表达封装了源文档中的不同事实,从而构建出合成摘要,使得LLMs能够生成更简洁的摘要并显著提升开放领域问答(QA)性能。
链接: https://arxiv.org/abs/2410.15277
作者: Yuankai Li,Jia-Chen Gu,Di Wu,Kai-Wei Chang,Nanyun Peng
关键词-EN: integrating external knowledge, supplement large language, Retrieval-augmented generation, large language models, external knowledge
类目: Computation and Language (cs.CL)
备注: Project page: this https URL
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context learning. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic proposition expressions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader LM. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.
摘要:检索增强生成 (Retrieval-augmented generation, RAG) 通过整合外部知识,可以补充大语言模型 (Large Language Models, LLMs)。然而,随着检索文档数量的增加,输入到 LLMs 的长度呈线性增长,导致延迟显著增加和长上下文理解能力下降。这对于需要跨文档进行链式推理的多跳问题尤为严重。为了加速推理、降低成本并减少干扰,本文提出了 BRIEF (Bridging Retrieval and Inference through Evidence Fusion),这是一种轻量级方法,通过将检索到的文档压缩成高度密集的文本摘要,并将其整合到上下文学习中,从而实现查询感知的多跳推理。为了实现多跳推理的学习压缩,我们通过从源文档中提取封装了不同事实的独立命题表达式来构建合成数据,从而编写合成摘要。基于完全由开源模型构建的合成数据,BRIEF 生成了更简洁的摘要,并使一系列 LLMs 在开放域问答 (Question Answering, QA) 中表现出色。例如,在 HotpotQA 上,BRIEF 的压缩率比最先进的基线提高了 2 倍,同时在使用 Flan-UL2 作为阅读器 LM 时,EM 提高了 3.00%,F1 提高了 4.16%。它还生成了比专有的 GPT-3.5 更简洁的摘要,同时在 QA 性能上几乎相同。
[NLP-124] AGExplainer: Narrating Graph Explanations for Text-Attributed Graph Learning Models
链接: https://arxiv.org/abs/2410.15268
作者: Bo Pan,Zhen Xiong,Guanchen Wu,Zheng Zhang,Yifei Zhang,Liang Zhao
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
[NLP-125] When Machine Unlearning Meets Retrieval-Augmented Generation (RAG): Keep Secret or Forget Knowledge?
【速读】: 该论文试图解决大型语言模型(如ChatGPT和Gemini)在训练过程中可能无意中学习并保留敏感信息和有害内容的问题,提出了基于检索增强生成(RAG)技术的轻量级遗忘框架。解决方案的关键在于通过修改RAG的外部知识库来模拟遗忘效果,而无需直接与被遗忘的LLM交互。这种方法将遗忘知识的构建视为一个约束优化问题,并提出了两个关键组件来支撑RAG遗忘的有效性。该框架特别适用于闭源LLM,因为现有遗忘方法在这些模型上往往失效。通过在开源和闭源模型上的广泛实验,证明了该方法在遗忘效果、通用性、无害性、简单性和鲁棒性方面的有效性。
链接: https://arxiv.org/abs/2410.15267
作者: Shang Wang,Tianqing Zhu,Dayong Ye,Wanlei Zhou
关键词-EN: powerful natural language, language generation capabilities, shown their powerful, powerful natural, natural language generation
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 15 pages, 9 figures, 9 tables
点击查看摘要
Abstract:The deployment of large language models (LLMs) like ChatGPT and Gemini has shown their powerful natural language generation capabilities. However, these models can inadvertently learn and retain sensitive information and harmful content during training, raising significant ethical and legal concerns. To address these issues, machine unlearning has been introduced as a potential solution. While existing unlearning methods take into account the specific characteristics of LLMs, they often suffer from high computational demands, limited applicability, or the risk of catastrophic forgetting. To address these limitations, we propose a lightweight unlearning framework based on Retrieval-Augmented Generation (RAG) technology. By modifying the external knowledge base of RAG, we simulate the effects of forgetting without directly interacting with the unlearned LLM. We approach the construction of unlearned knowledge as a constrained optimization problem, deriving two key components that underpin the effectiveness of RAG-based unlearning. This RAG-based approach is particularly effective for closed-source LLMs, where existing unlearning methods often fail. We evaluate our framework through extensive experiments on both open-source and closed-source models, including ChatGPT, Gemini, Llama-2-7b-chat-hf, and PaLM 2. The results demonstrate that our approach meets five key unlearning criteria: effectiveness, universality, harmlessness, simplicity, and robustness. Meanwhile, this approach can extend to multimodal large language models and LLM-based agents.
摘要:大规模语言模型 (LLMs) 如 ChatGPT 和 Gemini 的部署展示了其强大的自然语言生成能力。然而,这些模型在训练过程中可能会无意中学习并保留敏感信息和有害内容,引发重大的伦理和法律问题。为解决这些问题,机器遗忘 (machine unlearning) 被引入作为一种潜在的解决方案。尽管现有的遗忘方法考虑了 LLMs 的特定特征,但它们往往面临高计算需求、适用性有限或灾难性遗忘的风险。为解决这些限制,我们提出了一种基于检索增强生成 (Retrieval-Augmented Generation, RAG) 技术的轻量级遗忘框架。通过修改 RAG 的外部知识库,我们模拟了遗忘效果,而无需直接与遗忘的 LLM 交互。我们将遗忘知识的构建视为一个约束优化问题,推导出两个支撑 RAG 遗忘有效性的关键组件。这种基于 RAG 的方法特别适用于闭源 LLMs,因为在这些模型中,现有的遗忘方法往往失效。我们通过在开源和闭源模型(包括 ChatGPT、Gemini、Llama-2-7b-chat-hf 和 PaLM 2)上的广泛实验评估了我们的框架。结果表明,我们的方法满足了五个关键的遗忘标准:有效性、普遍性、无害性、简单性和鲁棒性。同时,这种方法可以扩展到多模态大语言模型和基于 LLM 的智能体。
[NLP-126] Back to School: Translation Using Grammar Books
【速读】: 该论文试图解决低资源语言在机器翻译中的表现不佳问题。解决方案的关键在于利用大型语言模型(LLMs)如GPT-4,通过在其提示中融入双语词典和语法书等现有语言学参考资料,来提升这些低资源语言的机器翻译质量。研究通过在16种拓扑结构多样的低资源语言上进行实验,验证了这种方法能够有效提高LLMs的翻译性能。
链接: https://arxiv.org/abs/2410.15263
作者: Jonathan Hus,Antonios Anastasopoulos
关键词-EN: produce high quality, languages perform exceptionally, high quality translations, resource languages perform, perform exceptionally
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Machine translation systems for high resource languages perform exceptionally well and produce high quality translations. Unfortunately, the vast majority of languages are not considered high resource and lack the quantity of parallel sentences needed to train such systems. These under-represented languages are not without resources, however, and bilingual dictionaries and grammar books are available as linguistic reference material. With current large language models (LLMs) supporting near book-length contexts, we can begin to use the available material to ensure advancements are shared among all of the world’s languages. In this paper, we demonstrate incorporating grammar books in the prompt of GPT-4 to improve machine translation and evaluate the performance on 16 topologically diverse low-resource languages, using a combination of reference material to show that the machine translation performance of LLMs can be improved using this method.
摘要: 高资源语言的机器翻译系统表现出色,能够生成高质量的翻译结果。然而,绝大多数语言并不属于高资源语言,缺乏训练此类系统所需的平行语句数量。尽管这些代表性不足的语言资源有限,但双语词典和语法书籍作为语言参考资料仍然可用。随着当前大语言模型 (LLM) 支持接近书籍长度的上下文,我们可以开始利用现有材料,确保全球所有语言都能共享技术进步。本文展示了在 GPT-4 的提示中融入语法书籍,以提升机器翻译效果,并在 16 种拓扑结构多样的低资源语言上进行评估,通过结合参考材料证明,利用这种方法可以提升大语言模型的机器翻译性能。
[NLP-127] Lossless KV Cache Compression to 2%
链接: https://arxiv.org/abs/2410.15252
作者: Zhen Yang,J.N.Han,Kan Wu,Ruobing Xie,An Wang,Xingwu Sun,Zhanhui Kang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-128] On the Diversity of Synthetic Data and its Impact on Training Large Language Models
链接: https://arxiv.org/abs/2410.15226
作者: Hao Chen,Abdul Waheed,Xiang Li,Yidong Wang,Jindong Wang,Bhiksha Raj,Marah I. Abdin
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-129] Chasing Random: Instruction Selection Strategies Fail to Generalize
链接: https://arxiv.org/abs/2410.15225
作者: Harshita Diddee,Daphne Ippolito
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-130] Fine-tuning foundational models to code diagnoses from veterinary health records
【速读】: 该论文试图解决兽医电子健康记录(EHRs)中数据互操作性不足的问题,特别是由于数据格式不一致和数据孤岛现象导致的挑战。解决方案的关键在于利用自然语言处理(NLP)技术,特别是通过微调预训练的大型语言模型(LLMs),来自动化兽医诊断编码。研究通过整合所有7,739个SNOMED-CT诊断代码,并利用大量标记数据对10个免费的预训练LLMs进行微调,显著提升了诊断编码的准确性。该方法不仅提高了兽医EHRs的质量,还为跨物种和跨机构的综合健康数据库的构建铺平了道路,从而支持动物和人类健康研究。
链接: https://arxiv.org/abs/2410.15186
作者: Mayla R. Boguslav,Adam Kiehl,David Kott,G. Joseph Strecker,Tracy Webb,Nadia Saklou,Terri Ward,Michael Kirby
关键词-EN: medical records represent, Natural Language Processing, Veterinary medical records, Veterinary, Veterinary Teaching Hospital
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 5 figures
点击查看摘要
Abstract:Veterinary medical records represent a large data resource for application to veterinary and One Health clinical research efforts. Use of the data is limited by interoperability challenges including inconsistent data formats and data siloing. Clinical coding using standardized medical terminologies enhances the quality of medical records and facilitates their interoperability with veterinary and human health records from other sites. Previous studies, such as DeepTag and VetTag, evaluated the application of Natural Language Processing (NLP) to automate veterinary diagnosis coding, employing long short-term memory (LSTM) and transformer models to infer a subset of Systemized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) diagnosis codes from free-text clinical notes. This study expands on these efforts by incorporating all 7,739 distinct SNOMED-CT diagnosis codes recognized by the Colorado State University (CSU) Veterinary Teaching Hospital (VTH) and by leveraging the increasing availability of pre-trained large language models (LLMs). Ten freely-available pre-trained LLMs were fine-tuned on the free-text notes from 246,473 manually-coded veterinary patient visits included in the CSU VTH’s electronic health records (EHRs), which resulted in superior performance relative to previous efforts. The most accurate results were obtained when expansive labeled data were used to fine-tune relatively large clinical LLMs, but the study also showed that comparable results can be obtained using more limited resources and non-clinical LLMs. The results of this study contribute to the improvement of the quality of veterinary EHRs by investigating accessible methods for automated coding and support both animal and human health research by paving the way for more integrated and comprehensive health databases that span species and institutions.
摘要:兽医医疗记录是应用于兽医和“One Health”临床研究的重要数据资源。然而,数据的使用受到互操作性挑战的限制,包括数据格式不一致和数据孤岛问题。使用标准化的医学术语进行临床编码可以提高医疗记录的质量,并促进其与来自其他站点(包括兽医和人类健康记录)的互操作性。先前的研究,如 DeepTag 和 VetTag,评估了自然语言处理 (NLP) 在自动化兽医诊断编码中的应用,采用长短期记忆 (LSTM) 和 Transformer 模型从自由文本临床记录中推断出系统化医学命名法 - 临床术语 (SNOMED-CT) 诊断代码的子集。本研究在这些基础上进行了扩展,纳入了科罗拉多州立大学 (CSU) 兽医教学医院 (VTH) 认可的所有 7,739 个不同的 SNOMED-CT 诊断代码,并利用了预训练大语言模型 (LLM) 日益增多的可用性。对来自 CSU VTH 电子健康记录 (EHR) 中包含的 246,473 次手动编码的兽医患者访问的自由文本记录,对 10 个免费可用的预训练 LLM 进行了微调,结果显示其性能优于先前的研究。当使用广泛的标注数据来微调相对较大的临床 LLM 时,获得了最准确的结果,但研究也表明,使用更有限的资源和非临床 LLM 也可以获得可比的结果。本研究的结果通过探索可访问的自动化编码方法,有助于提高兽医 EHR 的质量,并通过为跨越物种和机构的综合和全面健康数据库铺平道路,支持动物和人类健康研究。
[NLP-131] he Computational Anatomy of Humility: Modeling Intellectual Humility in Online Public Discourse
链接: https://arxiv.org/abs/2410.15182
作者: Xiaobo Guo,Neil Potnis,Melody Yu,Nabeel Gillani,Soroush Vosoughi
关键词-EN:
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Databases (cs.DB)
备注:
[NLP-132] Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
链接: https://arxiv.org/abs/2410.15173
作者: Safeyah Khaled Alshemali,Daniel Bauer,Yuval Marton
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures
[NLP-133] An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making EMNLP2024
链接: https://arxiv.org/abs/2410.15168
作者: Xiutian Zhao,Ke Wang,Wei Peng
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024
[NLP-134] Explaining Graph Neural Networks with Large Language Models : A Counterfactual Perspective for Molecular Property Prediction
链接: https://arxiv.org/abs/2410.15165
作者: Yinhan He,Zaiyi Zheng,Patrick Soga,Yaozhen Zhu,yushun Dong,Jundong Li
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:
[NLP-135] Evaluation Of P300 Speller Performance Using Large Language Models Along With Cross-Subject Training
链接: https://arxiv.org/abs/2410.15161
作者: Nithin Parthasarathy,James Soetedjo,Saarang Panchavati,Nitya Parthasarathy,Corey Arnold,Nader Pouratian,William Speier
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 21 pages, 11 figures, 1 table. arXiv admin note: substantial text overlap with arXiv:2405.13329
[NLP-136] Evaluating Deep Unlearning in Large Language Models
链接: https://arxiv.org/abs/2410.15153
作者: Ruihan Wu,Chhavi Yadav,Russ Salakhutdinov,Kamalika Chaudhuri
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-137] Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning EMNLP2024
链接: https://arxiv.org/abs/2410.15148
作者: David Schulte,Felix Hamborg,Alan Akbik
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2024 Main Conference
[NLP-138] A survey of neural-network-based methods utilising comparable data for finding translation equivalents
【速读】: 该论文试图解决在自然语言处理(NLP)应用中自动生成双语词典组件的问题,特别是翻译等价词的自动诱导。解决方案的关键在于利用神经网络方法处理可比数据,并从词典学的角度分析这些方法,以整合词典学的观点来改进现有技术。论文强调了NLP领域可以从词典学中获得有益的见解,并鼓励两个领域的交叉研究,以推动基于神经网络和可比数据的方法在实际应用中的进一步发展。
链接: https://arxiv.org/abs/2410.15144
作者: Michaela Denisová,Pavel Rychlý
关键词-EN: natural language processing, inducing bilingual dictionary, bilingual dictionary components, language processing, importance of inducing
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The importance of inducing bilingual dictionary components in many natural language processing (NLP) applications is indisputable. However, the dictionary compilation process requires extensive work and combines two disciplines, NLP and lexicography, while the former often omits the latter. In this paper, we present the most common approaches from NLP that endeavour to automatically induce one of the essential dictionary components, translation equivalents and focus on the neural-network-based methods using comparable data. We analyse them from a lexicographic perspective since their viewpoints are crucial for improving the described methods. Moreover, we identify the methods that integrate these viewpoints and can be further exploited in various applications that require them. This survey encourages a connection between the NLP and lexicography fields as the NLP field can benefit from lexicographic insights, and it serves as a helping and inspiring material for further research in the context of neural-network-based methods utilising comparable data.
摘要:在众多自然语言处理 (NLP) 应用中,诱导双语词典组件的重要性毋庸置疑。然而,词典编纂过程需要大量工作,并结合了 NLP 和词典学两个学科,而前者往往忽视后者。本文介绍了 NLP 中最常见的自动诱导词典基本组件——翻译等价词的方法,并着重于使用可比数据的基于神经网络的方法。我们从词典学的角度分析这些方法,因为它们的观点对于改进所描述的方法至关重要。此外,我们识别了那些整合了这些观点并可在需要这些观点的各种应用中进一步利用的方法。本综述鼓励 NLP 和词典学领域的联系,因为 NLP 领域可以从词典学的见解中受益,并且它为在利用可比数据的基于神经网络的方法背景下进行进一步研究提供了帮助和启发性的材料。
[NLP-139] CAST: Corpus-Aware Self-similarity Enhanced Topic modelling
链接: https://arxiv.org/abs/2410.15136
作者: Yanan Ma,Chenghao Xiao,Chenhan Yuan,Sabine N van der Veer,Lamiece Hassan,Chenghua Lin,Goran Nenadic
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-140] Augmenting the Veracity and Explanations of Complex Fact Checking via Iterative Self-Revision with LLMs
链接: https://arxiv.org/abs/2410.15135
作者: Xiaocheng Zhang,Xi Wang,Yifei Lu,Zhuangzhuang Ye,Jianing Wang,Mengjiao Bao,Peng Yan,Xiaohong Su
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-141] MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science EMNLP2024
链接: https://arxiv.org/abs/2410.15126
作者: Junho Kim,Yeachan Kim,Jun-Hyung Park,Yerim Oh,Suho Kim,SangKeun Lee
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2024 (Findings)
[NLP-142] Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models
链接: https://arxiv.org/abs/2410.15116
作者: Qitan Lv,Jie Wang,Hanzhu Chen,Bin Li,Yongdong Zhang,Feng Wu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-143] On Designing Effective RL Reward at Training Time for LLM Reasoning
链接: https://arxiv.org/abs/2410.15115
作者: Jiaxuan Gao,Shusheng Xu,Wenjie Ye,Weilin Liu,Chuyi He,Wei Fu,Zhiyu Mei,Guangju Wang,Yi Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-144] oward Robust RALMs: Revealing the Impact of Imperfect Retrieval on Retrieval-Augmented Language Models ACL
链接: https://arxiv.org/abs/2410.15107
作者: Seong-Il Park,Jay-Yoon Lee
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL)
[NLP-145] owards Safer Heuristics With XPlain
链接: https://arxiv.org/abs/2410.15086
作者: Pantea Karimi,Solal Pirelli,Siva Kesava Reddy Kakarla,Ryan Beckett,Santiago Segarra,Beibin Li,Pooria Namyar,Behnaz Arzani
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
备注:
[NLP-146] Weakly-supervised diagnosis identification from Italian discharge letters
链接: https://arxiv.org/abs/2410.15051
作者: Vittorio Torri,Elisa Barbieri,Anna Cantarutti,Carlo Giaquinto,Francesca Ieva
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 39 pages, 4 figures
[NLP-147] Are LLMs Good Zero-Shot Fallacy Classifiers? EMNLP2024
链接: https://arxiv.org/abs/2410.15050
作者: Fengjun Pan,Xiaobao Wu,Zongrui Li,Anh Tuan Luu
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP2024 main conference
[NLP-148] mHumanEval – A Multilingual Benchmark to Evaluate Large Language Models for Code Generation
链接: https://arxiv.org/abs/2410.15037
作者: Nishat Raihan,Antonios Anastasopoulos,Marcos Zampieri
关键词-EN:
类目: Computation and Language (cs.CL)
备注: 30 Pages
[NLP-149] Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging
链接: https://arxiv.org/abs/2410.15035
作者: Mingxin Li,Zhijie Nie,Yanzhao Zhang,Dingkun Long,Richong Zhang,Pengjun Xie
关键词-EN:
类目: Computation and Language (cs.CL)
备注: working in progress
[NLP-150] Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention
链接: https://arxiv.org/abs/2410.15029
作者: Yuzhe Weng,Haotian Wang,Tian Gao,Kewei Li,Shutong Niu,Jun Du
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-151] heoretical Aspects of Bias and Diversity in Minimum Bayes Risk Decoding
链接: https://arxiv.org/abs/2410.15021
作者: Hidetaka Kamigaito,Hiroyuki Deguchi,Yusuke Sakai,Katsuhiko Hayashi,Taro Watanabe
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-152] A Survey of Ontology Expansion for Conversational Understanding EMNLP2024
链接: https://arxiv.org/abs/2410.15019
作者: Jinggui Liang,Yuxia Wu,Yuan Fang,Hao Fei,Lizi Liao
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024, code and data are available at this https URL: this https URL
[NLP-153] DM-Codec: Distilling Multimodal Representations for Speech Tokenization
链接: https://arxiv.org/abs/2410.15017
作者: Md Mubtasim Ahasan,Md Fahim,Tasnim Mohiuddin,A K M Mahbubur Rahman,Aman Chadha,Tariq Iqbal,M Ashraful Amin,Md Mofijul Islam,Amin Ahsan Ali
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
[NLP-154] ransit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model
链接: https://arxiv.org/abs/2410.15016
作者: Jiahao Wang,Amer Shalaby
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 17 pages, 21 figures
[NLP-155] CAP: Data Contamination Detection via Consistency Amplification
链接: https://arxiv.org/abs/2410.15005
作者: Yi Zhao,Jing Li,Linyi Yang
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-156] ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
链接: https://arxiv.org/abs/2410.14991
作者: Deeparghya Dutta Barua,Md Sakib Ul Rahman Sourove,Md Farhan Ishmam,Fabiha Haider,Fariha Tanjim Shifat,Md Fahim,Md Farhad Alam
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
[NLP-157] Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration
链接: https://arxiv.org/abs/2410.14979
作者: Wei Xie,Shuoyoucheng Ma,Zhenhua Wang,Enze Wang,Baosheng Wang,Jinshu Su
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-158] Subversive Characters and Stereotyping Readers: Characterizing Queer Relationalities with Dialogue-Based Relation Extraction
链接: https://arxiv.org/abs/2410.14978
作者: Kent K. Chang,Anna Ho,David Bamman
关键词-EN:
类目: Computation and Language (cs.CL)
备注: CHR 2024 camera-ready
[NLP-159] BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation
链接: https://arxiv.org/abs/2410.14971
作者: Jilong Li,Zhenxi Song,Jiaqi Wang,Min Zhang,Zhiguo Zhang
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
[NLP-160] ChronoFact: Timeline-based Temporal Fact Verification
链接: https://arxiv.org/abs/2410.14964
作者: Anab Maulana Barik,Wynne Hsu,Mong Li Lee
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-161] SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation
链接: https://arxiv.org/abs/2410.14948
作者: Junda Wang,Yujan Ting,Eric Z. Chen,Hieu Tran,Hong Yu,Weijing Huang,Terrence Chen
关键词-EN:
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
[NLP-162] Baichuan Alignment Technical Report
链接: https://arxiv.org/abs/2410.14940
作者: Mingan Lin,Fan Yang,Yanjun Shen,Haoze Sun,Tianpeng Li,Tao Zhang,Chenzheng Zhu,Tao Zhang,Miao Zheng,Xu Li,Yijie Zhou,Mingyang Chen,Yanzhao Qin,Youquan Li,Hao Liang,Fei Li,Yadong Li,Mang Wang,Guosheng Dong,Kun Fang,Jianhua Xu,Bin Cui,Wentao Zhang,Zenan Zhou,Weipeng Chen
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
[NLP-163] A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models
链接: https://arxiv.org/abs/2410.14911
作者: Yuhan Liang,Yijun Li,Yumeng Niu,Qianhe Shen,Hangyu Liu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-164] From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items EMNLP2024
链接: https://arxiv.org/abs/2410.14897
作者: Melissa Roemmele,Andrew S. Gordon
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Findings of EMNLP 2024
[NLP-165] Class-RAG: Content Moderation with Retrieval Augmented Generation ACL
链接: https://arxiv.org/abs/2410.14881
作者: Jianfa Chen,Emily Shen,Trupti Bavalatti,Xiaowen Lin,Yongkai Wang,Shuming Hu,Harihar Subramanyam,Ksheeraj Sai Vepuri,Ming Jiang,Ji Qi,Li Chen,Nan Jiang,Ankit Jain
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, submit to ACL
[NLP-166] Which LLMs are Difficult to Detect? A Detailed Analysis of Potential Factors Contributing to Difficulties in LLM Text Detection NEURIPS2024
链接: https://arxiv.org/abs/2410.14875
作者: Shantanu Thorat,Tianbao Yang
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2024 - Safe Generative AI Workshop
[NLP-167] How to Evaluate Reward Models for RLHF
链接: https://arxiv.org/abs/2410.14872
作者: Evan Frick,Tianle Li,Connor Chen,Wei-Lin Chiang,Anastasios N. Angelopoulos,Jiantao Jiao,Banghua Zhu,Joseph E. Gonzalez,Ion Stoica
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-168] DFlow: Diverse Dialogue Flow Simulation with Large Language Models
链接: https://arxiv.org/abs/2410.14853
作者: Wanyu Du,Song Feng,James Gung,Lijia Sun,Yi Zhang,Saab Mansour,Yanjun Qi
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages
[NLP-169] Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment
链接: https://arxiv.org/abs/2410.14827
作者: Zedian Shao,Hongbin Liu,Jaden Mu,Neil Zhenqiang Gong
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-170] SPRIG: Improving Large Language Model Performance by System Prompt Optimization
链接: https://arxiv.org/abs/2410.14826
作者: Lechen Zhang,Tolga Ergen,Lajanugen Logeswaran,Moontae Lee,David Jurgens
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
[NLP-171] A Complexity-Based Theory of Compositionality
链接: https://arxiv.org/abs/2410.14817
作者: Eric Elmoznino,Thomas Jiralerspong,Yoshua Bengio,Guillaume Lajoie
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[NLP-172] Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus
链接: https://arxiv.org/abs/2410.14815
作者: Raviraj Joshi,Kanishk Singla,Anusha Kamath,Raunak Kalani,Rakesh Paul,Utkarsh Vaidya,Sanjay Singh Chauhan,Niranjan Wartikar,Eileen Long
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-173] Effects of Soft-Domain Transfer and Named Entity Information on Deception Detection
链接: https://arxiv.org/abs/2410.14814
作者: Steven Triplett,Simon Minami,Rakesh Verma
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-174] Isolated Causal Effects of Natural Language
链接: https://arxiv.org/abs/2410.14812
作者: Victoria Lin,Louis-Philippe Morency,Eli Ben-Michael
关键词-EN:
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注:
[NLP-175] Cross-Document Event-Keyed Summarization
链接: https://arxiv.org/abs/2410.14795
作者: William Walden,Pavlo Kuchmiichuk,Alexander Martin,Chihsheng Jin,Angela Cao,Claire Sun,Curisia Allen,Aaron Steven White
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-176] Whats New in My Data? Novelty Exploration via Contrastive Generation
链接: https://arxiv.org/abs/2410.14765
作者: Masaru Isonuma,Ivan Titov
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-177] Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
链接: https://arxiv.org/abs/2410.14763
作者: Hamed Fayyaz,Raphael Poulain,Rahmatollah Beheshti
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-178] Controllable Discovery of Intents: Incremental Deep Clustering Using Semi-Supervised Contrastive Learning
链接: https://arxiv.org/abs/2410.14755
作者: Mrinal Rawat,Hithesh Sankararaman,Victor Barres
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in IJCNLP’23
[NLP-179] Collaboratively adding new knowledge to an LLM
链接: https://arxiv.org/abs/2410.14753
作者: Rhui Dih Lee,Laura Wynter
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-180] meSeriesExam: A time series understanding exam NEURIPS’24
链接: https://arxiv.org/abs/2410.14752
作者: Yifu Cai,Arjun Choudhry,Mononito Goswami,Artur Dubrawski
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at NeurIPS’24 Time Series in the Age of Large Models Workshop
[NLP-181] ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries
链接: https://arxiv.org/abs/2410.14748
作者: Kishan Maharaj,Vitobha Munigala,Srikanth G. Tamilselvam,Prince Kumar,Sayandeep Sen,Palani Kodeswaran,Abhijit Mishra,Pushpak Bhattacharyya
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 6 Figures, 5 Tables
[NLP-182] Accounting for Sycophancy in Language Model Uncertainty Estimation
链接: https://arxiv.org/abs/2410.14746
作者: Anthony Sicilia,Mert Inan,Malihe Alikhani
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
[NLP-183] SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
链接: https://arxiv.org/abs/2410.14745
作者: Junyu Luo,Xiao Luo,Xiusi Chen,Zhiping Xiao,Wei Ju,Ming Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-184] Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors
链接: https://arxiv.org/abs/2410.14744
作者: Anthony Sicilia,Malihe Alikhani
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-185] Agent Skill Acquisition for Large Language Models via CycleQD
链接: https://arxiv.org/abs/2410.14735
作者: So Kuroki,Taishi Nakamura,Takuya Akiba,Yujin Tang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
[NLP-186] Knowledge Graph Embeddings: A Comprehensive Survey on Capturing Relation Properties
链接: https://arxiv.org/abs/2410.14733
作者: Guanglin Niu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 8 figures, 3 tables, this paper is a modified English version of our article already published in Computer Science journal (in Chinese), released to facilitate communication among international researchers in the relevant fields
[NLP-187] MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
链接: https://arxiv.org/abs/2410.14731
作者: Bokai Lin,Zihao Zeng,Zipeng Xiao,Siqi Kou,Tianqi Hou,Xiaofeng Gao,Hao Zhang,Zhijie Deng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-188] okens on Demand: Token Condensation as Training-free Test-time Adaptation
链接: https://arxiv.org/abs/2410.14729
作者: Zixin Wang,Dong Gong,Sen Wang,Zi Huang,Yadan Luo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 7 figures
[NLP-189] Rethinking Token Reduction for State Space Models EMNLP2024
链接: https://arxiv.org/abs/2410.14725
作者: Zheng Zhan,Yushu Wu,Zhenglun Kong,Changdi Yang,Yifan Gong,Xuan Shen,Xue Lin,Pu Zhao,Yanzhi Wang
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: EMNLP 2024
[NLP-190] A Systematic Survey on Large Language Models for Algorithm Design
链接: https://arxiv.org/abs/2410.14716
作者: Fei Liu,Yiming Yao,Ping Guo,Zhiyuan Yang,Xi Lin,Xialiang Tong,Mingxuan Yuan,Zhichao Lu,Zhenkun Wang,Qingfu Zhang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-191] QuAILoRA: Quantization-Aware Initialization for LoRA NEURIPS
链接: https://arxiv.org/abs/2410.14713
作者: Neal Lawton,Aishwarya Padmakumar,Judith Gaspers,Jack FitzGerald,Anoop Kumar,Greg Ver Steeg,Aram Galstyan
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 7 figures. Submitted to the 4th NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV)
[NLP-192] A two-stage transliteration approach to improve performance of a multilingual ASR
链接: https://arxiv.org/abs/2410.14709
作者: Rohit Kumar
关键词-EN:
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
[NLP-193] Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
链接: https://arxiv.org/abs/2410.14702
作者: Himanshu Gupta,Shreyas Verma,Ujjwala Anantheswaran,Kevin Scaria,Mihir Parmar,Swaroop Mishra,Chitta Baral
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 49 pages, (10 pages paper, 9 pages references, 30 pages appendix)
[NLP-194] BrainTransformers: SNN-LLM
链接: https://arxiv.org/abs/2410.14687
作者: Zhengzheng Tang
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-195] RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph
链接: https://arxiv.org/abs/2410.14684
作者: Siru Ouyang,Wenhao Yu,Kaixin Ma,Zilin Xiao,Zhihan Zhang,Mengzhao Jia,Jiawei Han,Hongming Zhang,Dong Yu
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress
[NLP-196] Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations
链接: https://arxiv.org/abs/2410.13204
作者: Aryan Shrivastava,Jessica Hullman,Max Lamparth
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
[NLP-197] ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
链接: https://arxiv.org/abs/2410.06625
作者: Yi Ding,Bolian Li,Ruqi Zhang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27pages
[NLP-198] Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning
链接: https://arxiv.org/abs/2410.16130
作者: Chun-Yi Kuan,Hung-yi Lee
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 1 figure
[NLP-199] CPE-Pro: A Structure-Sensitive Deep Learning Model for Protein Representation and Origin Evaluation
链接: https://arxiv.org/abs/2410.15592
作者: Wenrui Gou,Wenhui Ge,YangTan,Guisheng Fan,Mingchen Li,Huiqun Yu
关键词-EN:
类目: Biomolecules (q-bio.BM); Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:
人工智能
[AI-0] Reflection-Bench: probing AI intelligence with reflection
链接: https://arxiv.org/abs/2410.16270
作者: Lingyu Li,Yixu Wang,Haiquan Zhao,Shuqi Kong,Yan Teng,Chunbo Li,Yingchun Wang
关键词-EN: intelligent systems’ interaction, unexpected outcomes, behaviors in response, response to unexpected, fundamental to intelligent
类目: Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures, 2 tables
点击查看摘要
Abstract:The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems’ interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at this https URL.
[AI-1] xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
链接: https://arxiv.org/abs/2410.16267
作者: Michael S. Ryoo,Honglu Zhou,Shrikant Kendre,Can Qin,Le Xue,Manli Shu,Silvio Savarese,Ran Xu,Caiming Xiong,Juan Carlos Niebles
关键词-EN: efficiently capture temporal, capture temporal information, multimodal language model, multiple frames, multimodal language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the ‘temporal encoder’ in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at this https URL
[AI-2] 3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors NEURIPS2024
链接: https://arxiv.org/abs/2410.16266
作者: Xi Liu,Chaoyi Zhou,Siyu Huang
关键词-EN: Novel-view synthesis aims, achieved notable success, Gaussian splatting, Novel-view synthesis, multiple input images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024 Spotlight
点击查看摘要
Abstract:Novel-view synthesis aims to generate novel views of a scene from multiple input images or videos, and recent advancements like 3D Gaussian splatting (3DGS) have achieved notable success in producing photorealistic renderings with efficient pipelines. However, generating high-quality novel views under challenging settings, such as sparse input views, remains difficult due to insufficient information in under-sampled areas, often resulting in noticeable artifacts. This paper presents 3DGS-Enhancer, a novel pipeline for enhancing the representation quality of 3DGS representations. We leverage 2D video diffusion priors to address the challenging 3D view consistency problem, reformulating it as achieving temporal consistency within a video generation process. 3DGS-Enhancer restores view-consistent latent features of rendered novel views and integrates them with the input views through a spatial-temporal decoder. The enhanced views are then used to fine-tune the initial 3DGS model, significantly improving its rendering performance. Extensive experiments on large-scale datasets of unbounded scenes demonstrate that 3DGS-Enhancer yields superior reconstruction performance and high-fidelity rendering results compared to state-of-the-art methods. The project webpage is this https URL .
[AI-3] CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
链接: https://arxiv.org/abs/2410.16256
作者: Maosong Cao,Alexander Lam,Haodong Duan,Hongwei Liu,Songyang Zhang,Kai Chen
关键词-EN: Efficient and accurate, large language models, continuous improvement, improvement of large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Technical Report, Code and Models: this https URL
点击查看摘要
Abstract:Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce \textbfCompassJudger-1, the first open-source \textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established \textbfJudgerBench, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.
[AI-4] MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays ECGs and Diagnostic Report
链接: https://arxiv.org/abs/2410.16239
作者: Samrajya Thapa,Koushik Howlader,Subhankar Bhattacharjee,Wei le
关键词-EN: Multi-Modal Contrastive Pre-training, Contrastive Pre-training Framework, synergistically combines X-rays, Contrastive Pre-training, Pre-training Framework
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 9 tables. Supplementary detail in Appendix. Code made available in Github for reproducibility
点击查看摘要
Abstract:In this paper, we introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs), and radiology/cardiology reports. Our approach leverages transformers to encode these diverse modalities into a unified representation space, aiming to enhance diagnostic accuracy and facilitate comprehensive patient assessments. We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention. Furthermore, we provide novel multimodal attention explanations and retrieval for our model. To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizing contrastive loss, MoRE effectively aligns modality-specific features into a coherent embedding, which supports various downstream tasks such as zero-shot classification and multimodal retrieval. Employing our proposed methodology, we achieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, and PtbXl downstream datasets, surpassing existing multimodal approaches. Our proposed framework shows significant improvements in capturing intricate inter-modal relationships and its robustness in medical diagnosis that establishes a framework for future research in multimodal learning in the healthcare sector.
[AI-5] Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
链接: https://arxiv.org/abs/2410.16232
作者: Ryan Li,Yanzhe Zhang,Diyi Yang
关键词-EN: conceptualize early-stage ideas, early-stage ideas, natural and accessible, accessible medium, designers to conceptualize
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: preprint, 9 pages
点击查看摘要
Abstract:Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting accessibility and impeding efficient design iteration. To bridge this gap, we introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. Beyond end-to-end benchmarking, Sketch2Code supports interactive agent evaluation that mimics real-world design workflows, where a VLM-based agent iteratively refines its generations by communicating with a simulated user, either passively receiving feedback instructions or proactively asking clarification questions. We comprehensively analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs; even the most capable models struggle to accurately interpret sketches and formulate effective questions that lead to steady improvement. Nevertheless, a user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception, highlighting the need to develop more effective paradigms for multi-turn conversational agents.
[AI-6] Pre-training Distillation for Large Language Models : A Design Space Exploration
链接: https://arxiv.org/abs/2410.16215
作者: Hao Peng,Xin Lv,Yushi Bai,Zijun Yao,Jiajie Zhang,Lei Hou,Juanzi Li
关键词-EN: transfer knowledge, smaller student model, large teacher model, aims to transfer, Knowledge distillation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. Previous work applying KD in the field of large language models (LLMs) typically focused on the post-training phase, where the student LLM learns directly from instructions and corresponding responses generated by the teacher model. In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD). We first conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a 1.9B parameter student LLM, validating the effectiveness of PD. Considering the key impact factors of distillation, we systematically explore the design space of pre-training distillation across four aspects: logits processing, loss selection, scaling law, and offline or online logits. We conduct extensive experiments to explore the design space of pre-training distillation and find better configurations and interesting conclusions, such as larger student LLMs generally benefiting more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. We hope our exploration of the design space will inform future practices in pre-training distillation.
[AI-7] Comprehensive benchmarking of large language models for RNA secondary structure prediction
链接: https://arxiv.org/abs/2410.16212
作者: L.I. Zablocki,L.A. Bugnon,M. Gerard,L. Di Persia,G. Stegmayer,D.H. Milone
关键词-EN: DNA and proteins, developed recently, large language models, RNA, Inspired
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
点击查看摘要
Abstract:Inspired by the success of large language models (LLM) for DNA and proteins, several LLM for RNA have been developed recently. RNA-LLM uses large datasets of RNA sequences to learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector. This is done under the hypothesis that obtaining high-quality RNA representations can enhance data-costly downstream tasks. Among them, predicting the secondary structure is a fundamental task for uncovering RNA functional mechanisms. In this work we present a comprehensive experimental analysis of several pre-trained RNA-LLM, comparing them for the RNA secondary structure prediction task in an unified deep learning framework. The RNA-LLM were assessed with increasing generalization difficulty on benchmark datasets. Results showed that two LLM clearly outperform the other models, and revealed significant challenges for generalization in low-homology scenarios.
[AI-8] Compute-Constrained Data Selection
链接: https://arxiv.org/abs/2410.16208
作者: Junjie Oscar Yin,Alexander M. Rush
关键词-EN: Data selection, selection scales directly, training data needed, data selection scales, Data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. These experiments show the validity of this model in real-world experiments. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective.
[AI-9] Improve Vision Language Model Chain-of-thought Reasoning
链接: https://arxiv.org/abs/2410.16198
作者: Ruohong Zhang,Bowen Zhang,Yanghao Li,Haotian Zhang,Zhiqing Sun,Zhe Gan,Yinfei Yang,Ruoming Pang,Yiming Yang
关键词-EN: vision language models, interpretability and trustworthiness, vision language, crucial for improving, improving interpretability
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages + appendix
点击查看摘要
Abstract:Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model’s reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.
[AI-10] Information for Conversation Generation: Proposals Utilising Knowledge Graphs ISWC2024
链接: https://arxiv.org/abs/2410.16196
作者: Alex Clay,Ernesto Jiménez-Ruiz
关键词-EN: frequently used tools, tools for conversational, Knowledge, conversational generation, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 7 pages with citations, 1 figure, accepted to the ISWC 2024 Special Session
点击查看摘要
Abstract:LLMs are frequently used tools for conversational generation. Without additional information LLMs can generate lower quality responses due to lacking relevant content and hallucinations, as well as the perception of poor emotional capability, and an inability to maintain a consistent character. Knowledge graphs are commonly used forms of external knowledge and may provide solutions to these challenges. This paper introduces three proposals, utilizing knowledge graphs to enhance LLM generation. Firstly, dynamic knowledge graph embeddings and recommendation could allow for the integration of new information and the selection of relevant knowledge for response generation. Secondly, storing entities with emotional values as additional features may provide knowledge that is better emotionally aligned with the user input. Thirdly, integrating character information through narrative bubbles would maintain character consistency, as well as introducing a structure that would readily incorporate new information.
[AI-11] Learning How to Vote With Principles: Axiomatic Insights Into the Collective Decisions of Neural Networks
链接: https://arxiv.org/abs/2410.16170
作者: Levin Hornischer,Zoi Terzopoulou
关键词-EN: voting theory, neural networks, collective decisions, transparency in collective, voting
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 8 figures, 7 tables
点击查看摘要
Abstract:Can neural networks be applied in voting theory, while satisfying the need for transparency in collective decisions? We propose axiomatic deep voting: a framework to build and evaluate neural networks that aggregate preferences, using the well-established axiomatic method of voting theory. Our findings are: (1) Neural networks, despite being highly accurate, often fail to align with the core axioms of voting rules, revealing a disconnect between mimicking outcomes and reasoning. (2) Training with axiom-specific data does not enhance alignment with those axioms. (3) By solely optimizing axiom satisfaction, neural networks can synthesize new voting rules that often surpass and substantially differ from existing ones. This offers insights for both fields: For AI, important concepts like bias and value-alignment are studied in a mathematically rigorous way; for voting theory, new areas of the space of voting rules are explored.
[AI-12] GenAI Assisting Medical Training
链接: https://arxiv.org/abs/2410.16164
作者: Stefan Fritsch,Matthias Tschoepe,Vitor Fortes Rey,Lars Krupp,Agnes Gruenerbl,Eloise Monger,Sarah Travenna
关键词-EN: require precise skills, essential for nurses, nurses and require, require precise, Medical procedures
类目: Artificial Intelligence (cs.AI)
*备注: 2 pages, 2 figures
点击查看摘要
Abstract:Medical procedures such as venipuncture and cannulation are essential for nurses and require precise skills. Learning this skill, in turn, is a challenge for educators due to the number of teachers per class and the complexity of the task. The study aims to help students with skill acquisition and alleviate the educator’s workload by integrating generative AI methods to provide real-time feedback on medical procedures such as venipuncture and cannulation.
[AI-13] Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models NEURIPS2024
链接: https://arxiv.org/abs/2410.16152
作者: Giannis Daras,Weili Nie,Karsten Kreis,Alex Dimakis,Morteza Mardani,Nikola Borislavov Kovachki,Arash Vahdat
关键词-EN: suffers from flickering, space diffusion models, function space diffusion, naively for solving, image models naively
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted in NeurIPS 2024
点击查看摘要
Abstract:Using image models naively for solving inverse video problems often suffers from flickering, texture-sticking, and temporal inconsistency in generated videos. To tackle these problems, in this paper, we view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames. This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems. The function space diffusion models need to be equivariant with respect to the underlying spatial transformations. To ensure temporal consistency, we introduce a simple post-hoc test-time guidance towards (self)-equivariant solutions. Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems. We demonstrate the effectiveness of our method for video inpainting and 8\times video super-resolution, outperforming existing techniques based on noise transformations. We provide generated video results: this https URL\this http URL.
[AI-14] Small Contributions Small Networks: Efficient Neural Network Pruning Based on Relative Importance
链接: https://arxiv.org/abs/2410.16151
作者: Mostafa Hussien,Mahmoud Afifi,Kim Khoa Nguyen,Mohamed Cheriet
关键词-EN: achieving remarkable performance, Recent advancements, scaled neural networks, achieving remarkable, range of tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:Recent advancements have scaled neural networks to unprecedented sizes, achieving remarkable performance across a wide range of tasks. However, deploying these large-scale models on resource-constrained devices poses significant challenges due to substantial storage and computational requirements. Neural network pruning has emerged as an effective technique to mitigate these limitations by reducing model size and complexity. In this paper, we introduce an intuitive and interpretable pruning method based on activation statistics, rooted in information theory and statistical analysis. Our approach leverages the statistical properties of neuron activations to identify and remove weights with minimal contributions to neuron outputs. Specifically, we build a distribution of weight contributions across the dataset and utilize its parameters to guide the pruning process. Furthermore, we propose a Pruning-aware Training strategy that incorporates an additional regularization term to enhance the effectiveness of our pruning method. Extensive experiments on multiple datasets and network architectures demonstrate that our method consistently outperforms several baseline and state-of-the-art pruning techniques.
[AI-15] PODTILE: Facilitating Podcast Episode Browsing with Auto-generated Chapters CIKM
链接: https://arxiv.org/abs/2410.16148
作者: Azin Ghazimatin,Ekaterina Garmash,Gustavo Penha,Kristen Sheets,Martin Achenbach,Oguz Semerci,Remi Galvez,Marcus Tannenberg,Sahitya Mantravadi,Divya Narayanan,Ofeliya Kalaydzhyan,Douglas Cole,Ben Carterette,Ann Clifton,Paul N. Bennett,Claudia Hauff,Mounia Lalmas
关键词-EN: locate relevant sections, long-form talk-audio content, relevant sections, long-form talk-audio, find it challenging
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, CIKM industry track 2024
点击查看摘要
Abstract:Listeners of long-form talk-audio content, such as podcast episodes, often find it challenging to understand the overall structure and locate relevant sections. A practical solution is to divide episodes into chapters–semantically coherent segments labeled with titles and timestamps. Since most episodes on our platform at Spotify currently lack creator-provided chapters, automating the creation of chapters is essential. Scaling the chapterization of podcast episodes presents unique challenges. First, episodes tend to be less structured than written texts, featuring spontaneous discussions with nuanced transitions. Second, the transcripts are usually lengthy, averaging about 16,000 tokens, which necessitates efficient processing that can preserve context. To address these challenges, we introduce PODTILE, a fine-tuned encoder-decoder transformer to segment conversational data. The model simultaneously generates chapter transitions and titles for the input transcript. To preserve context, each input text is augmented with global context, including the episode’s title, description, and previous chapter titles. In our intrinsic evaluation, PODTILE achieved an 11% improvement in ROUGE score over the strongest baseline. Additionally, we provide insights into the practical benefits of auto-generated chapters for listeners navigating episode content. Our findings indicate that auto-generated chapters serve as a useful tool for engaging with less popular podcasts. Finally, we present empirical evidence that using chapter titles can enhance effectiveness of sparse retrieval in search tasks.
[AI-16] Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs
链接: https://arxiv.org/abs/2410.16135
作者: Kang Zhao,Tao Yuan,Han Bao,Zhenfeng Su,Chang Gao,Zhaofeng Sun,Zichen Liang,Liping Jing,Jianfei Chen
关键词-EN: sparse tensor cores, sparsity, tensor cores, cores on GPUs, M-sparse Transformers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ( \leq 1.3 ) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in addressing these limitations of 2:4 sparsity. However, regarding accuracy, the effects of V:N:M sparsity on broader Transformer models, such as vision Transformers and large language models (LLMs), are largely unexamined. Moreover, Some specific issues related to V:N:M sparsity, such as how to select appropriate V and M values, remain unresolved. In this study, we thoroughly investigate the application of V:N:M sparsity in vision models and LLMs across multiple tasks, from pertaining to downstream tasks. We propose three key approaches to enhance the applicability and accuracy of V:N:M-sparse Transformers, including heuristic V and M selection, V:N:M-specific channel permutation, and three-staged LoRA training techniques. Experimental results show that, with our methods, the DeiT-small achieves lossless accuracy at 64:2:5 sparsity, while the DeiT-base maintains accuracy even at 64:2:8 sparsity. In addition, the fine-tuned LLama2-7B at 64:2:5 sparsity performs comparably or better than training-free 2:4 sparse alternatives on downstream tasks. More importantly, V:N:M-sparse Transformers offer a wider range of speedup-accuracy trade-offs compared to 2:4 sparsity. Overall, our exploration largely facilitates the V:N:M sparsity to act as a truly effective acceleration solution for Transformers in cost-sensitive inference scenarios.
[AI-17] A Data-driven Crowd Simulation Framework Integrating Physics-informed Machine Learning with Navigation Potential Fields
链接: https://arxiv.org/abs/2410.16132
作者: Runkang Guo,Bin Chen,Qi Zhang,Yong Zhao,Xiao Wang,Zhengqiu Zhu
关键词-EN: Traditional rule-based physical, singular physical formulas, Physics-informed Machine Learning, integrates Physics-informed Machine, Traditional rule-based
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Traditional rule-based physical models are limited by their reliance on singular physical formulas and parameters, making it difficult to effectively tackle the intricate tasks associated with crowd simulation. Recent research has introduced deep learning methods to tackle these issues, but most current approaches focus primarily on generating pedestrian trajectories, often lacking interpretability and failing to provide real-time dynamic this http URL address the aforementioned issues, we propose a novel data-driven crowd simulation framework that integrates Physics-informed Machine Learning (PIML) with navigation potential fields. Our approach leverages the strengths of both physical models and PIML. Specifically, we design an innovative Physics-informed Spatio-temporal Graph Convolutional Network (PI-STGCN) as a data-driven module to predict pedestrian movement trends based on crowd spatio-temporal data. Additionally, we construct a physical model of navigation potential fields based on flow field theory to guide pedestrian movements, thereby reinforcing physical constraints during the simulation. In our framework, navigation potential fields are dynamically computed and updated based on the movement trends predicted by the PI-STGCN, while the updated crowd dynamics, guided by these fields, subsequently feed back into the PI-STGCN. Comparative experiments on two publicly available large-scale real-world datasets across five scenes demonstrate that our proposed framework outperforms existing rule-based methods in accuracy and fidelity. The similarity between simulated and actual pedestrian trajectories increases by 10.8%, while the average error is reduced by 4%. Moreover, our framework exhibits greater adaptability and better interpretability compared to methods that rely solely on deep learning for trajectory generation.
[AI-18] SMART: Self-learning Meta-strategy Agent for Reasoning Tasks
链接: https://arxiv.org/abs/2410.16128
作者: Rongxing Liu,Kumar Shridhar,Manish Prajapat,Patrick Xia,Mrinmaya Sachan
关键词-EN: Tasks requiring deductive, requiring deductive reasoning, involving multiple steps, demand adaptive strategies, rationales or programs
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Tasks requiring deductive reasoning, especially those involving multiple steps, often demand adaptive strategies such as intermediate generation of rationales or programs, as no single approach is universally optimal. While Language Models (LMs) can enhance their outputs through iterative self-refinement and strategy adjustments, they frequently fail to apply the most effective strategy in their first attempt. This inefficiency raises the question: Can LMs learn to select the optimal strategy in the first attempt, without a need for refinement? To address this challenge, we introduce SMART (Self-learning Meta-strategy Agent for Reasoning Tasks), a novel framework that enables LMs to autonomously learn and select the most effective strategies for various reasoning tasks. We model the strategy selection process as a Markov Decision Process and leverage reinforcement learning-driven continuous self-improvement to allow the model to find the suitable strategy to solve a given task. Unlike traditional self-refinement methods that rely on multiple inference passes or external feedback, SMART allows an LM to internalize the outcomes of its own reasoning processes and adjust its strategy accordingly, aiming for correct solutions on the first attempt. Our experiments across various reasoning datasets and with different model architectures demonstrate that SMART significantly enhances the ability of models to choose optimal strategies without external guidance (+15 points on the GSM8K dataset). By achieving higher accuracy with a single inference pass, SMART not only improves performance but also reduces computational costs for refinement-based strategies, paving the way for more efficient and intelligent reasoning in LMs.
[AI-19] SeaDAG: Semi-autoregressive Diffusion for Conditional Directed Acyclic Graph Generation
链接: https://arxiv.org/abs/2410.16119
作者: Xinyi Zhou,Xing Li,Yingzhao Lian,Yiwen Wang,Lei Chen,Mingxuan Yuan,Jianye Hao,Guangyong Chen,Pheng Ann Heng
关键词-EN: Directed Acyclic Graphs, Directed Acyclic, Acyclic Graphs, introduce SeaDAG, Directed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We introduce SeaDAG, a semi-autoregressive diffusion model for conditional generation of Directed Acyclic Graphs (DAGs). Considering their inherent layer-wise structure, we simulate layer-wise autoregressive generation by designing different denoising speed for different layers. Unlike conventional autoregressive generation that lacks a global graph structure view, our method maintains a complete graph structure at each diffusion step, enabling operations such as property control that require the full graph structure. Leveraging this capability, we evaluate the DAG properties during training by employing a graph property decoder. We explicitly train the model to learn graph conditioning with a condition loss, which enhances the diffusion model’s capacity to generate graphs that are both realistic and aligned with specified properties. We evaluate our method on two representative conditional DAG generation tasks: (1) circuit generation from truth tables, where precise DAG structures are crucial for realizing circuit functionality, and (2) molecule generation based on quantum properties. Our approach demonstrates promising results, generating high-quality and realistic DAGs that closely align with given conditions.
[AI-20] Addressing Spectral Bias of Deep Neural Networks by Multi-Grade Deep Learning
链接: https://arxiv.org/abs/2410.16105
作者: Ronglong Fang,Yuesheng Xu
关键词-EN: DNNs typically exhibit, high-frequency features, typically exhibit, exhibit a tendency, tendency to prioritize
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Deep neural networks (DNNs) suffer from the spectral bias, wherein DNNs typically exhibit a tendency to prioritize the learning of lower-frequency components of a function, struggling to capture its high-frequency features. This paper is to address this issue. Notice that a function having only low frequency components may be well-represented by a shallow neural network (SNN), a network having only a few layers. By observing that composition of low frequency functions can effectively approximate a high-frequency function, we propose to learn a function containing high-frequency components by composing several SNNs, each of which learns certain low-frequency information from the given data. We implement the proposed idea by exploiting the multi-grade deep learning (MGDL) model, a recently introduced model that trains a DNN incrementally, grade by grade, a current grade learning from the residue of the previous grade only an SNN composed with the SNNs trained in the preceding grades as features. We apply MGDL to synthetic, manifold, colored images, and MNIST datasets, all characterized by presence of high-frequency features. Our study reveals that MGDL excels at representing functions containing high-frequency information. Specifically, the neural networks learned in each grade adeptly capture some low-frequency information, allowing their compositions with SNNs learned in the previous grades effectively representing the high-frequency features. Our experimental results underscore the efficacy of MGDL in addressing the spectral bias inherent in DNNs. By leveraging MGDL, we offer insights into overcoming spectral bias limitation of DNNs, thereby enhancing the performance and applicability of deep learning models in tasks requiring the representation of high-frequency information. This study confirms that the proposed method offers a promising solution to address the spectral bias of DNNs.
[AI-21] Multi-Sensor Fusion for UAV Classification Based on Feature Maps of Image and Radar Data
链接: https://arxiv.org/abs/2410.16089
作者: Nikos Sakellariou(1),Antonios Lalas(1),Konstantinos Votis(1),Dimitrios Tzovaras(1) ((1) Centre for Research and Technology Hellas, Information Technologies Institute)
关键词-EN: modern UAVs make, Deep Neural Network, unique cost, contemporary society, Convolutional Neural Network
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 10 pages, 6 figures
点击查看摘要
Abstract:The unique cost, flexibility, speed, and efficiency of modern UAVs make them an attractive choice in many applications in contemporary society. This, however, causes an ever-increasing number of reported malicious or accidental incidents, rendering the need for the development of UAV detection and classification mechanisms essential. We propose a methodology for developing a system that fuses already processed multi-sensor data into a new Deep Neural Network to increase its classification accuracy towards UAV detection. The DNN model fuses high-level features extracted from individual object detection and classification models associated with thermal, optronic, and radar data. Additionally, emphasis is given to the model’s Convolutional Neural Network (CNN) based architecture that combines the features of the three sensor modalities by stacking the extracted image features of the thermal and optronic sensor achieving higher classification accuracy than each sensor alone.
[AI-22] Fine-Tuning LLMs for Reliable Medical Question-Answering Services ICDM
链接: https://arxiv.org/abs/2410.16088
作者: Ali Anaissi,Ali Braytee,Junaid Akram
关键词-EN: Large Language Models, fine-tuned Large Language, Large Language, Language Models, present an advanced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 10 figures, accepted and to be published in the proceedings of 2024 IEEE International Conference on Data Mining Workshops (ICDMW)
点击查看摘要
Abstract:We present an advanced approach to medical question-answering (QA) services, using fine-tuned Large Language Models (LLMs) to improve the accuracy and reliability of healthcare information. Our study focuses on optimizing models like LLaMA-2 and Mistral, which have shown great promise in delivering precise, reliable medical answers. By leveraging comprehensive datasets, we applied fine-tuning techniques such as rsDoRA+ and ReRAG. rsDoRA+ enhances model performance through a combination of decomposed model weights, varied learning rates for low-rank matrices, and rank stabilization, leading to improved efficiency. ReRAG, which integrates retrieval on demand and question rewriting, further refines the accuracy of the responses. This approach enables healthcare providers to access fast, dependable information, aiding in more efficient decision-making and fostering greater patient trust. Our work highlights the potential of fine-tuned LLMs to significantly improve the quality and accessibility of medical information services, ultimately contributing to better healthcare outcomes for all.
[AI-23] Critical Example Mining for Vehicle Trajectory Prediction using Flow-based Generative Models
链接: https://arxiv.org/abs/2410.16083
作者: Zhezhang Ding,Huijing Zhao
关键词-EN: Precise trajectory prediction, complex driving scenarios, Precise trajectory, trajectory prediction models, autonomous vehicles
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages,6 figures
点击查看摘要
Abstract:Precise trajectory prediction in complex driving scenarios is essential for autonomous vehicles. In practice, different driving scenarios present varying levels of difficulty for trajectory prediction models. However, most existing research focuses on the average precision of prediction results, while ignoring the underlying distribution of the input scenarios. This paper proposes a critical example mining method that utilizes a data-driven approach to estimate the rareness of the trajectories. By combining the rareness estimation of observations with whole trajectories, the proposed method effectively identifies a subset of data that is relatively hard to predict BEFORE feeding them to a specific prediction model. The experimental results show that the mined subset has higher prediction error when applied to different downstream prediction models, which reaches +108.1% error (greater than two times compared to the average on dataset) when mining 5% samples. Further analysis indicates that the mined critical examples include uncommon cases such as sudden brake and cancelled lane-change, which helps to better understand and improve the performance of prediction models.
[AI-24] On-Device LLMs for SMEs: Challenges and Opportunities
链接: https://arxiv.org/abs/2410.16070
作者: Jeremy Stephen Gabriel Yee Zhi Wen,Pai Chet Ng,Zhengkui Wang,Ian McLoughlin,Aik Beng Ng,Simon See
关键词-EN: Large Language Models, deploying Large Language, Language Models, Large Language, medium-sized enterprises
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 9 pages, 1 figure. The work is supported by the SIT-NVIDIA Joint AI Centre
点击查看摘要
Abstract:This paper presents a systematic review of the infrastructure requirements for deploying Large Language Models (LLMs) on-device within the context of small and medium-sized enterprises (SMEs), focusing on both hardware and software perspectives. From the hardware viewpoint, we discuss the utilization of processing units like GPUs and TPUs, efficient memory and storage solutions, and strategies for effective deployment, addressing the challenges of limited computational resources typical in SME settings. From the software perspective, we explore framework compatibility, operating system optimization, and the use of specialized libraries tailored for resource-constrained environments. The review is structured to first identify the unique challenges faced by SMEs in deploying LLMs on-device, followed by an exploration of the opportunities that both hardware innovations and software adaptations offer to overcome these obstacles. Such a structured review provides practical insights, contributing significantly to the community by enhancing the technological resilience of SMEs in integrating LLMs.
[AI-25] Integrated Image-Text Based on Semi-supervised Learning for Small Sample Instance Segmentation
链接: https://arxiv.org/abs/2410.16063
作者: Ruting Chi,Zhiyi Huang,Yuexing Han
关键词-EN: Small sample instance, sample instance segmentation, sample instance, Small sample, instance segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Small sample instance segmentation is a very challenging task, and many existing methods follow the training strategy of meta-learning which pre-train models on support set and fine-tune on query set. The pre-training phase, which is highly task related, requires a significant amount of additional training time and the selection of datasets with close proximity to ensure effectiveness. The article proposes a novel small sample instance segmentation solution from the perspective of maximizing the utilization of existing information without increasing annotation burden and training costs. The proposed method designs two modules to address the problems encountered in small sample instance segmentation. First, it helps the model fully utilize unlabeled data by learning to generate pseudo labels, increasing the number of available samples. Second, by integrating the features of text and image, more accurate classification results can be obtained. These two modules are suitable for box-free and box-dependent frameworks. In the way, the proposed method not only improves the performance of small sample instance segmentation, but also greatly reduce reliance on pre-training. We have conducted experiments in three datasets from different scenes: on land, underwater and under microscope. As evidenced by our experiments, integrated image-text corrects the confidence of classification, and pseudo labels help the model obtain preciser masks. All the results demonstrate the effectiveness and superiority of our method.
[AI-26] reeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
链接: https://arxiv.org/abs/2410.16033
作者: Jiahao Qiu,Yifu Lu,Yifan Zeng,Jiacheng Guo,Jiayi Geng,Huazheng Wang,Kaixuan Huang,Yue Wu,Mengdi Wang
关键词-EN: Inference-time alignment enhances, large language models, requiring additional training, presents challenges due, balancing computational efficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, UltraFeedback, GSM8K, HH-RLHF, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves a 65% win rate at maximum lengths of 192 and 384 tokens, outperforming standard BoN with the same computational cost. Furthermore, TreeBoN achieves around a 60% win rate across longer responses, showcasing its scalability and alignment efficacy.
[AI-27] meMixer: A General Time Series Pattern Machine for Universal Predictive Analysis
链接: https://arxiv.org/abs/2410.16032
作者: Shiyu Wang,Jiawei Li,Xiaoming Shi,Zhou Ye,Baichuan Mo,Wenze Lin,Shengtong Ju,Zhixuan Chu,Ming Jin
关键词-EN: Time series, Time, series, multi-scale time series, Time series analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Time series analysis plays a critical role in numerous applications, supporting tasks such as forecasting, classification, anomaly detection, and imputation. In this work, we present the time series pattern machine (TSPM), a model designed to excel in a broad range of time series tasks through powerful representation and pattern extraction capabilities. Traditional time series models often struggle to capture universal patterns, limiting their effectiveness across diverse tasks. To address this, we define multiple scales in the time domain and various resolutions in the frequency domain, employing various mixing strategies to extract intricate, task-adaptive time series patterns. Specifically, we introduce a general-purpose TSPM that processes multi-scale time series using (1) multi-resolution time imaging (MRTI), (2) time image decomposition (TID), (3) multi-scale mixing (MCM), and (4) multi-resolution mixing (MRM) to extract comprehensive temporal patterns. MRTI transforms multi-scale time series into multi-resolution time images, capturing patterns across both temporal and frequency domains. TID leverages dual-axis attention to extract seasonal and trend patterns, while MCM hierarchically aggregates these patterns across scales. MRM adaptively integrates all representations across resolutions. This method achieves state-of-the-art performance across 8 time series analytical tasks, consistently surpassing both general-purpose and task-specific models. Our work marks a promising step toward the next generation of TSPMs, paving the way for further advancements in time series analysis.
[AI-28] A New Approach to Solving SMAC Task: Generating Decision Tree Code from Large Language Models
链接: https://arxiv.org/abs/2410.16024
作者: Yue Deng,Weiyu Ma,Yuxin Fan,Yin Zhang,Haifeng Zhang,Jian Zhao
关键词-EN: StarCraft Multi-Agent Challenge, multi-agent reinforcement learning, defeat enemy forces, Multi-Agent Challenge, StarCraft Multi-Agent
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:StarCraft Multi-Agent Challenge (SMAC) is one of the most commonly used experimental environments in multi-agent reinforcement learning (MARL), where the specific task is to control a set number of allied units to defeat enemy forces. Traditional MARL algorithms often require interacting with the environment for up to 1 million steps to train a model, and the resulting policies are typically non-interpretable with weak transferability. In this paper, we propose a novel approach to solving SMAC tasks called LLM-SMAC. In our framework, agents leverage large language models (LLMs) to generate decision tree code by providing task descriptions. The model is further self-reflection using feedback from the rewards provided by the environment. We conduct experiments in the SMAC and demonstrate that our method can produce high-quality, interpretable decision trees with minimal environmental exploration. Moreover, these models exhibit strong transferability, successfully applying to similar SMAC environments without modification. We believe this approach offers a new direction for solving decision-making tasks in the future.
[AI-29] Massimo: Public Queue Monitoring and Management using Mass-Spring Model
链接: https://arxiv.org/abs/2410.16012
作者: Abhijeet Kumar,Unnati Singh,Rajdeep Chatterjee,Tathagata Bandyopadhyay
关键词-EN: customer satisfaction, control and regulation, important in order, order to avoid, avoid the traffic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 8 pages, 6 figures, 3 algorithms, 3 tables
点击查看摘要
Abstract:An efficient system of a queue control and regulation in public spaces is very important in order to avoid the traffic jams and to improve the customer satisfaction. This article offers a detailed road map based on a merger of intelligent systems and creating an efficient systems of queues in public places. Through the utilization of different technologies i.e. computer vision, machine learning algorithms, deep learning our system provide accurate information about the place is crowded or not and the necessary efforts to be taken.
[AI-30] CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation
链接: https://arxiv.org/abs/2410.16011
作者: Xi Xu,Wenda Xu,Siqi Ouyang,Lei Li
关键词-EN: Simultaneous speech translation, balance translation quality, Simultaneous speech, making latency measurement, response time
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Simultaneous speech translation (SimulST) systems must balance translation quality with response time, making latency measurement crucial for evaluating their real-world performance. However, there has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches. We demonstrate that this issue affects not only streaming but also segment-level latency evaluation across different metrics. Furthermore, we propose a modification to correctly measure computation-aware latency for SimulST systems, addressing the limitations present in existing metrics.
[AI-31] Are Language Model Logits Calibrated?
链接: https://arxiv.org/abs/2410.16007
作者: Charles Lovering,Michael Krumdick,Viet Dac Lai,Nilesh Kumar,Varshini Reddy,Rik Koncel-Kedziorski,Chris Tanner
关键词-EN: information is factual, information is probabilistic, output probabilities, good Language Models, information
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages (main), 24 pages (appendix), under review
点击查看摘要
Abstract:Some information is factual (e.g., “Paris is in France”), whereas other information is probabilistic (e.g., “the coin flip will be a [Heads/Tails].”). We believe that good Language Models (LMs) should understand and reflect this nuance. Our work investigates this by testing if LMs’ output probabilities are calibrated to their textual contexts. We define model “calibration” as the degree to which the output probabilities of candidate tokens are aligned with the relative likelihood that should be inferred from the given context. For example, if the context concerns two equally likely options (e.g., heads or tails for a fair coin), the output probabilities should reflect this. Likewise, context that concerns non-uniformly likely events (e.g., rolling a six with a die) should also be appropriately captured with proportionate output probabilities. We find that even in simple settings the best LMs (1) are poorly calibrated, and (2) have systematic biases (e.g., preferred colors and sensitivities to word orderings). For example, gpt-4o-mini often picks the first of two options presented in the prompt regardless of the options’ implied likelihood, whereas Llama-3.1-8B picks the second. Our other consistent finding is mode-collapse: Instruction-tuned models often over-allocate probability mass on a single option. These systematic biases introduce non-intuitive model behavior, making models harder for users to understand.
[AI-32] 1024m at SMM4H 2024: Tasks 3 5 6 – Ensembles of Transformers and Large Language Models for Medical Text Classification ACL2024
链接: https://arxiv.org/abs/2410.15998
作者: Ram Mohan Rao Kadiyala,M.V.P. Chandra Sekhara Rao
关键词-EN: users reporting information, Social media, Large Language Models, Binary classification, great source
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: short paper , acl 2024
点击查看摘要
Abstract:Social media is a great source of data for users reporting information and regarding their health and how various things have had an effect on them. This paper presents various approaches using Transformers and Large Language Models and their ensembles, their performance along with advantages and drawbacks for various tasks of SMM4H’24 - Classifying texts on impact of nature and outdoor spaces on the author’s mental health (Task 3), Binary classification of tweets reporting their children’s health disorders like Asthma, Autism, ADHD and Speech disorder (task 5), Binary classification of users self-reporting their age (task 6).
[AI-33] Augmenting Legal Decision Support Systems with LLM-based NLI for Analyzing Social Media Evidence EMNLP2024
链接: https://arxiv.org/abs/2410.15990
作者: Ram Mohan Rao Kadiyala,Siddartha Pullakhandam,Kanwal Mehreen,Subhasya Tippareddy,Ashay Srivastava
关键词-EN: entry for NLLP, Natural Language Inference, Legal Natural Language, shared task, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages , accepted to emnlp 2024
点击查看摘要
Abstract:This paper presents our system description and error analysis of our entry for NLLP 2024 shared task on Legal Natural Language Inference (L-NLI) \citephagag2024legallenssharedtask2024. The task required classifying these relationships as entailed, contradicted, or neutral, indicating any association between the review and the complaint. Our system emerged as the winning submission, significantly outperforming other entries with a substantial margin and demonstrating the effectiveness of our approach in legal text analysis. We provide a detailed analysis of the strengths and limitations of each model and approach tested, along with a thorough error analysis and suggestions for future improvements. This paper aims to contribute to the growing field of legal NLP by offering insights into advanced techniques for natural language inference in legal contexts, making it accessible to both experts and newcomers in the field.
[AI-34] Analyzing Closed-loop Training Techniques for Realistic Traffic Agent Models in Autonomous Highway Driving Simulations
链接: https://arxiv.org/abs/2410.15987
作者: Matthias Bitzer,Reinis Cimurs,Benjamin Coors,Johannes Goth,Sebastian Ziesche,Philipp Geiger,Maximilian Naumann
关键词-EN: autonomous vehicles, plays a crucial, crucial role, rapid development, development and safe
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 15 pages, 6 figures, 4 tables
点击查看摘要
Abstract:Simulation plays a crucial role in the rapid development and safe deployment of autonomous vehicles. Realistic traffic agent models are indispensable for bridging the gap between simulation and the real world. Many existing approaches for imitating human behavior are based on learning from demonstration. However, these approaches are often constrained by focusing on individual training strategies. Therefore, to foster a broader understanding of realistic traffic agent modeling, in this paper, we provide an extensive comparative analysis of different training principles, with a focus on closed-loop methods for highway driving simulation. We experimentally compare (i) open-loop vs. closed-loop multi-agent training, (ii) adversarial vs. deterministic supervised training, (iii) the impact of reinforcement losses, and (iv) the impact of training alongside log-replayed agents to identify suitable training techniques for realistic agent modeling. Furthermore, we identify promising combinations of different closed-loop training methods.
[AI-35] PROMPTHEUS: A Human-Centered Pipeline to Streamline SLRs with LLMs
链接: https://arxiv.org/abs/2410.15978
作者: João Pedro Fernandes Torres,Catherine Muligan,Joaquim Jorge,Catarina Moreira
关键词-EN: publications poses significant, poses significant challenges, academic publications poses, researchers conducting timely, accurate Systematic Literature
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The growing volume of academic publications poses significant challenges for researchers conducting timely and accurate Systematic Literature Reviews, particularly in fast-evolving fields like artificial intelligence. This growth of academic literature also makes it increasingly difficult for lay people to access scientific knowledge effectively, meaning academic literature is often misrepresented in the popular press and, more broadly, in society. Traditional SLR methods are labor-intensive and error-prone, and they struggle to keep up with the rapid pace of new research. To address these issues, we developed \textitPROMPTHEUS: an AI-driven pipeline solution that automates the SLR process using Large Language Models. We aimed to enhance efficiency by reducing the manual workload while maintaining the precision and coherence required for comprehensive literature synthesis. PROMPTHEUS automates key stages of the SLR process, including systematic search, data extraction, topic modeling using BERTopic, and summarization with transformer models. Evaluations conducted across five research domains demonstrate that PROMPTHEUS reduces review time, achieves high precision, and provides coherent topic organization, offering a scalable and effective solution for conducting literature reviews in an increasingly crowded research landscape. In addition, such tools may reduce the increasing mistrust in science by making summarization more accessible to laypeople. The code for this project can be found on the GitHub repository at this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2410.15978 [cs.AI] (or arXiv:2410.15978v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.15978 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-36] Enabling Energy-Efficient Deployment of Large Language Models on Memristor Crossbar: A Synergy of Large and Small
链接: https://arxiv.org/abs/2410.15977
作者: Zhehui Wang,Tao Luo,Cheng Liu,Weichen Liu,Rick Siow Mong Goh,Weng-Fai Wong
关键词-EN: garnered substantial attention, substantial attention due, Large language models, diverse domains, garnered substantial
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) have garnered substantial attention due to their promising applications in diverse domains. Nevertheless, the increasing size of LLMs comes with a significant surge in the computational requirements for training and deployment. Memristor crossbars have emerged as a promising solution, which demonstrated a small footprint and remarkably high energy efficiency in computer vision (CV) models. Memristors possess higher density compared to conventional memory technologies, making them highly suitable for effectively managing the extreme model size associated with LLMs. However, deploying LLMs on memristor crossbars faces three major challenges. Firstly, the size of LLMs increases rapidly, already surpassing the capabilities of state-of-the-art memristor chips. Secondly, LLMs often incorporate multi-head attention blocks, which involve non-weight stationary multiplications that traditional memristor crossbars cannot support. Third, while memristor crossbars excel at performing linear operations, they are not capable of executing complex nonlinear operations in LLM such as softmax and layer normalization. To address these challenges, we present a novel architecture for the memristor crossbar that enables the deployment of state-of-the-art LLM on a single chip or package, eliminating the energy and time inefficiencies associated with off-chip communication. Our testing on BERT_Large showed negligible accuracy loss. Compared to traditional memristor crossbars, our architecture achieves enhancements of up to 39X in area overhead and 18X in energy consumption. Compared to modern TPU/GPU systems, our architecture demonstrates at least a 68X reduction in the area-delay product and a significant 69% energy consumption reduction.
[AI-37] Large Language Models for Cross-lingual Emotion Detection ACL2024
链接: https://arxiv.org/abs/2410.15974
作者: Ram Mohan Rao Kadiyala
关键词-EN: detailed system description, cross-lingual emotion detection, focused on cross-lingual, presents a detailed, detailed system
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages , accepted to acl 2024
点击查看摘要
Abstract:This paper presents a detailed system description of our entry for the WASSA 2024 Task 2, focused on cross-lingual emotion detection. We utilized a combination of large language models (LLMs) and their ensembles to effectively understand and categorize emotions across different languages. Our approach not only outperformed other submissions with a large margin, but also demonstrated the strength of integrating multiple models to enhance performance. Additionally, We conducted a thorough comparison of the benefits and limitations of each model used. An error analysis is included along with suggested areas for future improvement. This paper aims to offer a clear and comprehensive understanding of advanced techniques in emotion detection, making it accessible even to those new to the field.
[AI-38] Karush-Kuhn-Tucker Condition-Trained Neural Networks (KKT Nets)
链接: https://arxiv.org/abs/2410.15973
作者: Shreya Arvind,Rishabh Pomaje,Rajshekhar V Bhat
关键词-EN: KKT Loss, solving convex optimization, dual variables satisfying, convex optimization problems, KKT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:This paper presents a novel approach to solving convex optimization problems by leveraging the fact that, under certain regularity conditions, any set of primal or dual variables satisfying the Karush-Kuhn-Tucker (KKT) conditions is necessary and sufficient for optimality. Similar to Theory-Trained Neural Networks (TTNNs), the parameters of the convex optimization problem are input to the neural network, and the expected outputs are the optimal primal and dual variables. A choice for the loss function in this case is a loss, which we refer to as the KKT Loss, that measures how well the network’s outputs satisfy the KKT conditions. We demonstrate the effectiveness of this approach using a linear program as an example. For this problem, we observe that minimizing the KKT Loss alone outperforms training the network with a weighted sum of the KKT Loss and a Data Loss (the mean-squared error between the ground truth optimal solutions and the network’s output). Moreover, minimizing only the Data Loss yields inferior results compared to those obtained by minimizing the KKT Loss. While the approach is promising, the obtained primal and dual solutions are not sufficiently close to the ground truth optimal solutions. In the future, we aim to develop improved models to obtain solutions closer to the ground truth and extend the approach to other problem classes.
[AI-39] Self-Explained Keywords Empower Large Language Models for Code Generation
链接: https://arxiv.org/abs/2410.15966
作者: Lishui Fan,Mouxiang Chen,Zhongxin Liu
关键词-EN: Large language models, achieved impressive performance, Large language, code generation, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved impressive performance in code generation. However, due to the long-tail distribution of LLMs’ training data, low-frequency terms are typically underrepresented in the training process. Consequently, LLMs often misunderstand or overlook problem-specific, low-frequency keywords during code generation, compromising the accuracy of the generated code. To address this, we propose a novel technique named SEK(\textbfSelf-\textbfExplained \textbfKeywords), which empowers an LLM for better code generation by extracting and explaining the key terms in the problem description with the LLM itself and ranking them based on frequency. Comprehensive experiments across three benchmarks, i.e., HumanEval(+), MBPP(+), and APPS, with five representative LLMs, show that SEK can significantly improve LLMs in code generation, yielding substantial and consistent gains. For instance, SEK improves the Pass@1 of DeepSeek-Coder-V2-Instruct from 85.4% to 93.3% on the Humaneval benchmark. Further analysis confirms that SEK enables the LLMs to shift their attention from low-frequency keywords to their corresponding high-frequency counterparts.
[AI-40] Systematic Exploration of Dialogue Summarization Approaches for Reproducibility Comparative Assessment and Methodological Innovations for Advancing Natural Language Processing in Abstractive Summarization
链接: https://arxiv.org/abs/2410.15962
作者: Yugandhar Reddy Gogireddy,Jithendra Reddy Gogireddy
关键词-EN: natural language processing, Reproducibility in scientific, dialogue summarization models, language processing, experimental findings
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Reproducibility in scientific research, particularly within the realm of natural language processing (NLP), is essential for validating and verifying the robustness of experimental findings. This paper delves into the reproduction and evaluation of dialogue summarization models, focusing specifically on the discrepancies observed between original studies and our reproduction efforts. Dialogue summarization is a critical aspect of NLP, aiming to condense conversational content into concise and informative summaries, thus aiding in efficient information retrieval and decision-making processes. Our research involved a thorough examination of several dialogue summarization models using the AMI (Augmented Multi-party Interaction) dataset. The models assessed include Hierarchical Memory Networks (HMNet) and various versions of Pointer-Generator Networks (PGN), namely PGN(DKE), PGN(DRD), PGN(DTS), and PGN(DALL). The primary objective was to evaluate the informativeness and quality of the summaries generated by these models through human assessment, a method that introduces subjectivity and variability in the evaluation process. The analysis began with Dataset 1, where the sample standard deviation of 0.656 indicated a moderate dispersion of data points around the mean.
[AI-41] AI-Driven Innovations in Modern Cloud Computing
链接: https://arxiv.org/abs/2410.15960
作者: Animesh Kumar
关键词-EN: rapid technological transformation, witnessed rapid technological, landscape evolved exponentially, scalable application development, evolved exponentially leading
类目: Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 figures
点击查看摘要
Abstract:The world has witnessed rapid technological transformation, past couple of decades and with Advent of Cloud computing the landscape evolved exponentially leading to efficient and scalable application development. Now, the past couple of years the digital ecosystem has brought in numerous innovations with integration of Artificial Intelligence commonly known as AI. This paper explores how AI and cloud computing intersect to deliver transformative capabilities for modernizing applications by providing services and infrastructure. Harnessing the combined potential of both AI Cloud technologies, technology providers can now exploit intelligent resource management, predictive analytics, automated deployment scaling with enhanced security leading to offering innovative solutions to their customers. Furthermore, by leveraging such technologies of cloud AI businesses can reap rich rewards in the form of reducing operational costs and improving service delivery. This paper further addresses challenges associated such as data privacy concerns and how it can be mitigated with robust AI governance frameworks.
[AI-42] Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs
链接: https://arxiv.org/abs/2410.15956
作者: Yanzhu Guo,Simone Conia,Zelin Zhou,Min Li,Saloni Potdar,Henry Xiao
关键词-EN: Current Large Language, Large Language Models, Current Large, strong English-centric biases, exhibit strong English-centric
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context. Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark in French and Chinese, revealing a tendency towards English-influenced patterns. To mitigate this issue, we also propose a simple and effective alignment method to improve the naturalness of an LLM in a target language and domain, achieving consistent improvements in naturalness without compromising the performance on general-purpose benchmarks. Our work highlights the importance of developing multilingual metrics, resources and methods for the new wave of multilingual LLMs.
[AI-43] S-ACL: A Time Series Analytic Continual Learning Framework for Privacy-Preserving and Class-Incremental Pattern Recognition
链接: https://arxiv.org/abs/2410.15954
作者: Kejia Fan,Jiaxu Li,Songning Lai,Linpu Lv,Anfeng Liu,Jianheng Tang,Houbing Herbert Song,Huiping Zhuang
关键词-EN: Time Series Classification, incrementally train models, streaming time series, Series Classification, Class-incremental Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 3 figures, 2 tables
点击查看摘要
Abstract:Class-incremental Learning (CIL) in Time Series Classification (TSC) aims to incrementally train models using the streaming time series data that arrives continuously. The main problem in this scenario is catastrophic forgetting, i.e., training models with new samples inevitably leads to the forgetting of previously learned knowledge. Among existing methods, the replay-based methods achieve satisfactory performance but compromise privacy, while exemplar-free methods protect privacy but suffer from low accuracy. However, more critically, owing to their reliance on gradient-based update techniques, these existing methods fundamentally cannot solve the catastrophic forgetting problem. In TSC scenarios with continuously arriving data and temporally shifting distributions, these methods become even less practical. In this paper, we propose a Time Series Analytic Continual Learning framework, called TS-ACL. Inspired by analytical learning, TS-ACL transforms neural network updates into gradient-free linear regression problems, thereby fundamentally mitigating catastrophic forgetting. Specifically, employing a pre-trained and frozen feature extraction encoder, TS-ACL only needs to update its analytic classifier recursively in a lightweight manner that is highly suitable for real-time applications and large-scale data processing. Additionally, we theoretically demonstrate that the model obtained recursively through the TS-ACL is exactly equivalent to a model trained on the complete dataset in a centralized manner, thereby establishing the property of absolute knowledge memory. Extensive experiments validate the superior performance of our TS-ACL.
[AI-44] User-centric evaluation of explainability of AI with and for humans: a comprehensive empirical study
链接: https://arxiv.org/abs/2410.15952
作者: Szymon Bobek,Paloma Korycińska,Monika Krakowska,Maciej Mozolewski,Dorota Rak,Magdalena Zych,Magdalena Wójcik,Grzegorz J. Nalepa
关键词-EN: Human-Centered Artificial Intelligence, eXplainable Artificial Intelligence, Artificial Intelligence, Gradient Boosting Classifier, Human-Centered Artificial
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This study is located in the Human-Centered Artificial Intelligence (HCAI) and focuses on the results of a user-centered assessment of commonly used eXplainable Artificial Intelligence (XAI) algorithms, specifically investigating how humans understand and interact with the explanations provided by these algorithms. To achieve this, we employed a multi-disciplinary approach that included state-of-the-art research methods from social sciences to measure the comprehensibility of explanations generated by a state-of-the-art lachine learning model, specifically the Gradient Boosting Classifier (XGBClassifier). We conducted an extensive empirical user study involving interviews with 39 participants from three different groups, each with varying expertise in data science, data visualization, and domain-specific knowledge related to the dataset used for training the machine learning model. Participants were asked a series of questions to assess their understanding of the model’s explanations. To ensure replicability, we built the model using a publicly available dataset from the UC Irvine Machine Learning Repository, focusing on edible and non-edible mushrooms. Our findings reveal limitations in existing XAI methods and confirm the need for new design principles and evaluation techniques that address the specific information needs and user perspectives of different classes of AI stakeholders. We believe that the results of our research and the cross-disciplinary methodology we developed can be successfully adapted to various data types and user profiles, thus promoting dialogue and address opportunities in HCAI research. To support this, we are making the data resulting from our study publicly available.
[AI-45] Redefining Finance: The Influence of Artificial Intelligence (AI) and Machine Learning (ML)
链接: https://arxiv.org/abs/2410.15951
作者: Animesh Kumar
关键词-EN: Artificial Intelligence, Machine Learning, transformation of technologies, rapid transformation, finance is disrupting
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 1 figure
点击查看摘要
Abstract:With rapid transformation of technologies, the fusion of Artificial Intelligence (AI) and Machine Learning (ML) in finance is disrupting the entire ecosystem and operations which were followed for decades. The current landscape is where decisions are increasingly data-driven by financial institutions with an appetite for automation while mitigating risks. The segments of financial institutions which are getting heavily influenced are retail banking, wealth management, corporate banking payment ecosystem. The solution ranges from onboarding the customers all the way fraud detection prevention to enhancing the customer services. Financial Institutes are leap frogging with integration of Artificial Intelligence and Machine Learning in mainstream applications and enhancing operational efficiency through advanced predictive analytics, extending personalized customer experiences, and automation to minimize risk with fraud detection techniques. However, with Adoption of AI ML, it is imperative that the financial institute also needs to address ethical and regulatory challenges, by putting in place robust governance frameworks and responsible AI practices.
[AI-46] Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report
链接: https://arxiv.org/abs/2410.15944
作者: Ayman Asad Khan,Md Toufique Hasan,Kai Kristian Kemell,Jussi Rasku,Pekka Abrahamsson
关键词-EN: Retrieval Augmented Generation, primary data source, PDF documents, Large Language Models, Augmented Generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 36 pages, 8 figures, 2 tables, and python code snippets
点击查看摘要
Abstract:This paper presents an experience report on the development of Retrieval Augmented Generation (RAG) systems using PDF documents as the primary data source. The RAG architecture combines generative capabilities of Large Language Models (LLMs) with the precision of information retrieval. This approach has the potential to redefine how we interact with and augment both structured and unstructured knowledge in generative models to enhance transparency, accuracy, and contextuality of responses. The paper details the end-to-end pipeline, from data collection, preprocessing, to retrieval indexing and response generation, highlighting technical challenges and practical solutions. We aim to offer insights to researchers and practitioners developing similar systems using two distinct approaches: OpenAI’s Assistant API with GPT Series and Llama’s open-source models. The practical implications of this research lie in enhancing the reliability of generative AI systems in various sectors where domain-specific knowledge and real-time information retrieval is important. The Python code used in this work is also available at: this https URL.
[AI-47] Centrality-aware Product Retrieval and Ranking EMNLP2024
链接: https://arxiv.org/abs/2410.15930
作者: Hadeel Saadany,Swapnil Bhosale,Samarth Agrawal,Diptesh Kanojia,Constantin Orasan,Zhe Wu
关键词-EN: user intent, improving user experience, paper addresses, addresses the challenge, challenge of improving
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024: Industry track
点击查看摘要
Abstract:This paper addresses the challenge of improving user experience on e-commerce platforms by enhancing product ranking relevant to users’ search queries. Ambiguity and complexity of user queries often lead to a mismatch between the user’s intent and retrieved product titles or documents. Recent approaches have proposed the use of Transformer-based models, which need millions of annotated query-title pairs during the pre-training stage, and this data often does not take user intent into account. To tackle this, we curate samples from existing datasets at eBay, manually annotated with buyer-centric relevance scores and centrality scores, which reflect how well the product title matches the users’ intent. We introduce a User-intent Centrality Optimization (UCO) approach for existing models, which optimises for the user intent in semantic product search. To that end, we propose a dual-loss based optimisation to handle hard negatives, i.e., product titles that are semantically relevant but do not reflect the user’s intent. Our contributions include curating challenging evaluation sets and implementing UCO, resulting in significant product ranking efficiency improvements observed for different evaluation metrics. Our work aims to ensure that the most buyer-centric titles for a query are ranked higher, thereby, enhancing the user experience on e-commerce platforms.
[AI-48] GReFEL: Geometry-Aware Reliable Facial Expression Learning under Bias and Imbalanced Data Distribution ACCV2024
链接: https://arxiv.org/abs/2410.15927
作者: Azmine Toushik Wasi,Taki Hasan Rafi,Raima Islam,Karlo Serbetar,Dong Kyu Chae
关键词-EN: Reliable facial expression, facial expression characteristics, distinctive facial expression, facial expression learning, facial expression
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ACCV 2024. Extended version of ARBEx ( arXiv:2305.01486 )
点击查看摘要
Abstract:Reliable facial expression learning (FEL) involves the effective learning of distinctive facial expression characteristics for more reliable, unbiased and accurate predictions in real-life settings. However, current systems struggle with FEL tasks because of the variance in people’s facial expressions due to their unique facial structures, movements, tones, and demographics. Biased and imbalanced datasets compound this challenge, leading to wrong and biased prediction labels. To tackle these, we introduce GReFEL, leveraging Vision Transformers and a facial geometry-aware anchor-based reliability balancing module to combat imbalanced data distributions, bias, and uncertainty in facial expression learning. Integrating local and global data with anchors that learn different facial data points and structural features, our approach adjusts biased and mislabeled emotions caused by intra-class disparity, inter-class similarity, and scale sensitivity, resulting in comprehensive, accurate, and reliable facial expression predictions. Our model outperforms current state-of-the-art methodologies, as demonstrated by extensive experiments on various datasets.
[AI-49] Bench4Merge: A Comprehensive Benchmark for Merging in Realistic Dense Traffic with Micro-Interactive Vehicles
链接: https://arxiv.org/abs/2410.15912
作者: Zhengming Wang,Junli Wang,Pengfei Li,Zhaohan Li,Peng Li,Yilun Chen
关键词-EN: motion planning capabilities, motion planning, dense traffic remains, significant challenge, remains a significant
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 7 figures, IEEE international conference on robotics and automation
点击查看摘要
Abstract:While the capabilities of autonomous driving have advanced rapidly, merging into dense traffic remains a significant challenge, many motion planning methods for this scenario have been proposed but it is hard to evaluate them. Most existing closed-loop simulators rely on rule-based controls for other vehicles, which results in a lack of diversity and randomness, thus failing to accurately assess the motion planning capabilities in highly interactive scenarios. Moreover, traditional evaluation metrics are insufficient for comprehensively evaluating the performance of merging in dense traffic. In response, we proposed a closed-loop evaluation benchmark for assessing motion planning capabilities in merging scenarios. Our approach involves other vehicles trained in large scale datasets with micro-behavioral characteristics that significantly enhance the complexity and diversity. Additionally, we have restructured the evaluation mechanism by leveraging large language models to assess each autonomous vehicle merging onto the main road. Extensive experiments have demonstrated the advanced nature of this evaluation benchmark. Through this benchmark, we have obtained an evaluation of existing methods and identified common issues. The environment and vehicle motion planning models we have designed can be accessed at this https URL
[AI-50] Diverse Policies Recovering via Pointwise Mutual Information Weighted Imitation Learning
链接: https://arxiv.org/abs/2410.15910
作者: Hanlin Yang,Jian Yao,Weiming Liu,Qing Wang,Hanmin Qin,Hansheng Kong,Kirk Tang,Jiechao Xiong,Chao Yu,Kai Li,Junliang Xing,Hongwu Chen,Juchao Zhuo,Qiang Fu,Yang Wei,Haobo Fu
关键词-EN: important research topic, diverse policies recovering, recovering diverse policies, diverse policies, policies recovering methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 18 pages, 6 figures
点击查看摘要
Abstract:Recovering a spectrum of diverse policies from a set of expert trajectories is an important research topic in imitation learning. After determining a latent style for a trajectory, previous diverse policies recovering methods usually employ a vanilla behavioral cloning learning objective conditioned on the latent style, treating each state-action pair in the trajectory with equal importance. Based on an observation that in many scenarios, behavioral styles are often highly relevant with only a subset of state-action pairs, this paper presents a new principled method in diverse polices recovery. In particular, after inferring or assigning a latent style for a trajectory, we enhance the vanilla behavioral cloning by incorporating a weighting mechanism based on pointwise mutual information. This additional weighting reflects the significance of each state-action pair’s contribution to learning the style, thus allowing our method to focus on state-action pairs most representative of that style. We provide theoretical justifications for our new objective, and extensive empirical evaluations confirm the effectiveness of our method in recovering diverse policies from expert data.
[AI-51] IGMaxHS – An Incremental MaxSAT Solver with Support for XOR Clauses WWW
链接: https://arxiv.org/abs/2410.15897
作者: Ole Lübke
关键词-EN: XOR constraints, MaxSAT solving capabilities, XOR, MaxSAT-based method, method for error
类目: Artificial Intelligence (cs.AI)
*备注: Presented at the 15th International Workshop on Pragmatics of SAT (PoS 2024, see this https URL )
点击查看摘要
Abstract:Recently, a novel, MaxSAT-based method for error correction in quantum computing has been proposed that requires both incremental MaxSAT solving capabilities and support for XOR constraints, but no dedicated MaxSAT solver fulfilling these criteria existed yet. We alleviate that and introduce IGMaxHS, which is based on the existing solvers iMaxHS and GaussMaxHS, but poses fewer restrictions on the XOR constraints than GaussMaxHS. IGMaxHS is fuzz tested with xwcnfuzz, an extension of wcnfuzz that can directly output XOR constraints. As a result, IGMaxHS is the only solver that reported neither incorrect unsatisfiability verdicts nor invalid models nor incoherent cost model combinations in a final fuzz testing comparison of all three solvers with 10000 instances. We detail the steps required for implementing Gaussian elimination on XOR constraints in CDCL SAT solvers, and extend the recently proposed re-entrant incremental MaxSAT solver application program interface to allow for incremental addition of XOR constraints. Finally, we show that IGMaxHS is capable of decoding quantum color codes through simulation with the Munich Quantum Toolkit.
[AI-52] Model Mimic Attack: Knowledge Distillation for Provably Transferable Adversarial Examples
链接: https://arxiv.org/abs/2410.15889
作者: Kirill Lukyanov,Andrew Perminov,Denis Turdakov,Mikhail Pautov
关键词-EN: artificial neural networks, vulnerability of artificial, setting is widely, widely studied, black-box adversarial attacks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The vulnerability of artificial neural networks to adversarial perturbations in the black-box setting is widely studied in the literature. The majority of attack methods to construct these perturbations suffer from an impractically large number of queries required to find an adversarial example. In this work, we focus on knowledge distillation as an approach to conduct transfer-based black-box adversarial attacks and propose an iterative training of the surrogate model on an expanding dataset. This work is the first, to our knowledge, to provide provable guarantees on the success of knowledge distillation-based attack on classification neural networks: we prove that if the student model has enough learning capabilities, the attack on the teacher model is guaranteed to be found within the finite number of distillation iterations.
[AI-53] How to Build a Pre-trained Multimodal model for Simultaneously Chatting and Decision-making?
链接: https://arxiv.org/abs/2410.15885
作者: Zuojin Tang,Bin Hu,Chenyang Zhao,De Ma,Gang Pan,Bin Liu
关键词-EN: models typically map, typically map text, map text input, typically map, action
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Existing large pre-trained models typically map text input to text output in an end-to-end manner, such as ChatGPT, or map a segment of text input to a hierarchy of action decisions, such as OpenVLA. However, humans can simultaneously generate text and actions when receiving specific input signals. For example, a driver can make precise driving decisions while conversing with a friend in the passenger seat. Motivated by this observation, we consider the following question in this work: is it possible to construct a pre-trained model that can provide both language interaction and precise decision-making capabilities in dynamic open scenarios. We provide a definitive answer to this question by developing a new model architecture termed Visual Language Action model for Chatting and Decision Making (VLA4CD), and further demonstrating its performance in challenging autonomous driving tasks. Specifically, we leverage LoRA to fine-tune a pre-trained LLM with data of multiple modalities covering language, visual, and action. Unlike the existing LoRA operations used for LLM fine-tuning, we have designed new computational modules and training cost functions for VLA4CD. These designs enable VLA4CD to provide continuous-valued action decisions while outputting text responses. In contrast, existing LLMs can only output text responses, and current VLA models can only output action decisions. Moreover, these VLA models handle action data by discretizing and then tokenizing the discretized actions, a method unsuitable for complex decision-making tasks involving high-dimensional continuous-valued action vectors, such as autonomous driving. The experimental results on CARLA validate that: (1) our proposed model construction method is effective; (2) compared to the SOTA VLA model, VLA4CD can provide more accurate real-time decision-making while retaining the text interaction capability inherent to LLMs.
[AI-54] Using GPT Models for Qualitative and Quantitative News Analytics in the 2024 US Presidental Election Process
链接: https://arxiv.org/abs/2410.15884
作者: Bohdan M. Pavlyshenko
关键词-EN: Google Search API, Google Search, Search API, retrieval-augmented generation, RAG
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The paper considers an approach of using Google Search API and GPT-4o model for qualitative and quantitative analyses of news through retrieval-augmented generation (RAG). This approach was applied to analyze news about the 2024 US presidential election process. Different news sources for different time periods have been analyzed. Quantitative scores generated by GPT model have been analyzed using Bayesian regression to derive trend lines. The distributions found for the regression parameters allow for the analysis of uncertainty in the election process. The obtained results demonstrate that using the GPT models for news analysis, one can get informative analytics and provide key insights that can be applied in further analyses of election processes.
[AI-55] MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images
链接: https://arxiv.org/abs/2410.15881
作者: Pablo Meseguer,Rocío del Amor,Valery Naranjo
关键词-EN: made remarkable strides, Vision-language supervision, supervision has made, made remarkable, remarkable strides
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Manuscript accepted for oral presentation at KES-InnovationInMedicine 2024 held on Madeira, Portugal
点击查看摘要
Abstract:Vision-language supervision has made remarkable strides in learning visual representations from textual guidance. In digital pathology, vision-language models (VLM), pre-trained on curated datasets of histological image-captions, have been adapted to downstream tasks, such as region of interest classification. Zero-shot transfer for slide-level prediction has been formulated by MI-Zero, but it exhibits high variability depending on the textual prompts. Inspired by prototypical learning, we propose MI-VisionShot, a training-free adaptation method on top of VLMs to predict slide-level labels in few-shot learning scenarios. Our framework takes advantage of the excellent representation learning of VLM to create prototype-based classifiers under a multiple-instance setting by retrieving the most discriminative patches within each slide. Experimentation through different settings shows the ability of MI-VisionShot to surpass zero-shot transfer with lower variability, even in low-shot scenarios. Code coming soon at thttps://github.com/cvblab/MIVisionShot.
[AI-56] FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL NEURIPS’24
链接: https://arxiv.org/abs/2410.15876
作者: Woosung Koh,Wonbeen Oh,Siyeol Kim,Suhin Shin,Hyeongjin Kim,Jaein Jang,Junghyun Lee,Se-Young Yun
关键词-EN: Multi-agent reinforcement learning, addressing complex cooperative, complex cooperative tasks, Multi-agent reinforcement, demonstrated significant potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: NeurIPS '24 Open-World Agents Workshop
点击查看摘要
Abstract:Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory – a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. Our results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis-à-vis the backbone, compared to existing methods. For standardized evaluation, we introduce MPEv2, an enhanced version of Multi Particle Environments (MPE), consisting of 12 benchmarks. Benchmarks, implementations, and trained models are organized and open-sourced at this http URL, accompanied by ample demo video renderings.
[AI-57] Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs NEURIPS2024
链接: https://arxiv.org/abs/2410.15859
作者: Xin Ma,Yang Liu,Jingjing Liu,Xiaoxu Ma
关键词-EN: Large language models, max training lengths, Large language, challenging extrapolation problem, Position Encoding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted by NeurIPS 2024. arXiv admin note: text overlap with arXiv:2305.19466 by other authors
点击查看摘要
Abstract:Large language models (LLMs), although having revolutionized many fields, still suffer from the challenging extrapolation problem, where the inference ability of LLMs sharply declines beyond their max training lengths. In this work, we conduct a theoretical analysis to better understand why No Position Encoding (NoPE) fails outside its effective range, as well as examining the power of Position Encoding (PE) in this context. Our findings reveal that with meticulous weave position, PE can indeed be extended beyond effective range. Our theorems establish that LLMs equipped with weave PE can achieve improved extrapolation performance without additional cost. Furthermore, we introduce a novel weave PE method, Mesa-Extrapolation, which utilizes a chunk-based triangular attention matrix and applies Stair PE to manage the final chunk. This method not only retains competitive performance but also offers substantial benefits such as significantly reduced memory demand and faster inference speed. Extensive experiments validate the effectiveness of Mesa-Extrapolation, demonstrating its potential as a scalable solution to enhancing LLMs applicative reach.
[AI-58] Random Token Fusion for Multi-View Medical Diagnosis NEURIPS2024
链接: https://arxiv.org/abs/2410.15847
作者: Jingyu Guo,Christos Matsoukas,Fredrik Strand,Kevin Smith
关键词-EN: deep learning-based models, deep learning-based, multi-view medical diagnosis, fuse information, imaging perspectives
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Originally published at the NeurIPS 2024 Workshop on Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond (AIM-FM)
点击查看摘要
Abstract:In multi-view medical diagnosis, deep learning-based models often fuse information from different imaging perspectives to improve diagnostic performance. However, existing approaches are prone to overfitting and rely heavily on view-specific features, which can lead to trivial solutions. In this work, we introduce Random Token Fusion (RTF), a novel technique designed to enhance multi-view medical image analysis using vision transformers. By integrating randomness into the feature fusion process during training, RTF addresses the issue of overfitting and enhances the robustness and accuracy of diagnostic models without incurring any additional cost at inference. We validate our approach on standard mammography and chest X-ray benchmark datasets. Through extensive experiments, we demonstrate that RTF consistently improves the performance of existing fusion methods, paving the way for a new generation of multi-view medical foundation models.
[AI-59] Long-distance Geomagnetic Navigation in GNSS-denied Environments with Deep Reinforcement Learning
链接: https://arxiv.org/abs/2410.15837
作者: Wenqi Bai,Xiaohui Zhang,Shiliang Zhang,Songnan Yang,Yushuai Li,Tingwen Huang
关键词-EN: drawn increasing attention, navigation satellite systems, Geomagnetic navigation, external navigation services, global navigation satellite
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Geomagnetic navigation has drawn increasing attention with its capacity in navigating through complex environments and its independence from external navigation services like global navigation satellite systems (GNSS). Existing studies on geomagnetic navigation, i.e., matching navigation and bionic navigation, rely on pre-stored map or extensive searches, leading to limited applicability or reduced navigation efficiency in unexplored areas. To address the issues with geomagnetic navigation in areas where GNSS is unavailable, this paper develops a deep reinforcement learning (DRL)-based mechanism, especially for long-distance geomagnetic navigation. The designed mechanism trains an agent to learn and gain the magnetoreception capacity for geomagnetic navigation, rather than using any pre-stored map or extensive and expensive searching approaches. Particularly, we integrate the geomagnetic gradient-based parallel approach into geomagnetic navigation. This integration mitigates the over-exploration of the learning agent by adjusting the geomagnetic gradient, such that the obtained gradient is aligned towards the destination. We explore the effectiveness of the proposed approach via detailed numerical simulations, where we implement twin delayed deep deterministic policy gradient (TD3) in realizing the proposed approach. The results demonstrate that our approach outperforms existing metaheuristic and bionic navigation methods in long-distance missions under diverse navigation conditions.
[AI-60] LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs – Evaluation through Synthetic Data Generation
链接: https://arxiv.org/abs/2410.15828
作者: Tejumade Afonja,Ivaxi Sheth,Ruta Binkyte,Waqar Hanif,Thomas Ulas,Matthias Becker,Mario Fritz
关键词-EN: single-cell RNA sequencing, Gene regulatory networks, RNA sequencing, single-cell RNA, Gene regulatory
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Gene regulatory networks (GRNs) represent the causal relationships between transcription factors (TFs) and target genes in single-cell RNA sequencing (scRNA-seq) data. Understanding these networks is crucial for uncovering disease mechanisms and identifying therapeutic targets. In this work, we investigate the potential of large language models (LLMs) for GRN discovery, leveraging their learned biological knowledge alone or in combination with traditional statistical methods. We develop a task-based evaluation strategy to address the challenge of unavailable ground truth causal graphs. Specifically, we use the GRNs suggested by LLMs to guide causal synthetic data generation and compare the resulting data against the original dataset. Our statistical and biological assessments show that LLMs can support statistical modeling and data synthesis for biological research.
[AI-61] he effect of fine-tuning on language model toxicity NEURIPS2024
链接: https://arxiv.org/abs/2410.15821
作者: Will Hawkins,Brent Mittelstadt,Chris Russell
关键词-EN: cost-effective parameter efficient, parameter efficient fine-tuning, increasingly popular, improvements in cost-effective, cost-effective parameter
类目: Artificial Intelligence (cs.AI)
*备注: To be presented at NeurIPS 2024 Safe Generative AI Workshop
点击查看摘要
Abstract:Fine-tuning language models has become increasingly popular following the proliferation of open models and improvements in cost-effective parameter efficient fine-tuning. However, fine-tuning can influence model properties such as safety. We assess how fine-tuning can impact different open models’ propensity to output toxic content. We assess the impacts of fine-tuning Gemma, Llama, and Phi models on toxicity through three experiments. We compare how toxicity is reduced by model developers during instruction-tuning. We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation on a non-adversarial dataset can significantly alter these results across models. Finally, we highlight the impact of this in the wild, demonstrating how toxicity rates of models fine-tuned by community contributors can deviate in hard-to-predict ways.
[AI-62] MAC Revivo: Artificial Intelligence Paves the Way
链接: https://arxiv.org/abs/2410.15820
作者: Jinzhe Pan,Jingqing Wang,Zelin Yun,Zhiyong Xiao,Yuehui Ouyang,Wenchi Cheng,Wei Zhang
关键词-EN: deployed smart devices, Internet of Things, Bluetooth capabilities, capabilities in Internet, Medium Access Control
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The vast adoption of Wi-Fi and/or Bluetooth capabilities in Internet of Things (IoT) devices, along with the rapid growth of deployed smart devices, has caused significant interference and congestion in the industrial, scientific, and medical (ISM) bands. Traditional Wi-Fi Medium Access Control (MAC) design faces significant challenges in managing increasingly complex wireless environments while ensuring network Quality of Service (QoS) performance. This paper explores the potential integration of advanced Artificial Intelligence (AI) methods into the design of Wi-Fi MAC protocols. We propose AI-MAC, an innovative approach that employs machine learning algorithms to dynamically adapt to changing network conditions, optimize channel access, mitigate interference, and ensure deterministic latency. By intelligently predicting and managing interference, AI-MAC aims to provide a robust solution for next generation of Wi-Fi networks, enabling seamless connectivity and enhanced QoS. Our experimental results demonstrate that AI-MAC significantly reduces both interference and latency, paving the way for more reliable and efficient wireless communications in the increasingly crowded ISM band.
[AI-63] LiMTR: Time Series Motion Prediction for Diverse Road Users through Multimodal Feature Integration NEURIPS2024
链接: https://arxiv.org/abs/2410.15819
作者: Camiel Oerlemans,Bram Grooten,Michiel Braat,Alaa Alassi,Emilia Silvas,Decebal Constantin Mocanu
关键词-EN: densely populated areas, road users accurately, Predicting the behavior, populated areas, behavior of road
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the NeurIPS 2024 workshop Time Series in the Age of Large Models. Code available at this https URL
点击查看摘要
Abstract:Predicting the behavior of road users accurately is crucial to enable the safe operation of autonomous vehicles in urban or densely populated areas. Therefore, there has been a growing interest in time series motion prediction research, leading to significant advancements in state-of-the-art techniques in recent years. However, the potential of using LiDAR data to capture more detailed local features, such as a person’s gaze or posture, remains largely unexplored. To address this, we develop a novel multimodal approach for motion prediction based on the PointNet foundation model architecture, incorporating local LiDAR features. Evaluation on the Waymo Open Dataset shows a performance improvement of 6.20% and 1.58% in minADE and mAP respectively, when integrated and compared with the previous state-of-the-art MTR. We open-source the code of our LiMTR model.
[AI-64] Kaninfradet3D:A Road-side Camera-LiDAR Fusion 3D Perception Model based on Nonlinear Feature Extraction and Intrinsic Correlation
链接: https://arxiv.org/abs/2410.15814
作者: Pei Liu(1),Nanfang Zheng(2),Yiqun Li(2),Junlan Chen(2),Ziyuan Pu(2) ((1) Intelligent Transportation Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou), (2) Transportation, Southeast University)
关键词-EN: AI-assisted driving, numerous methods, emerged for ego-vehicle, development of AI-assisted, methods have emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the development of AI-assisted driving, numerous methods have emerged for ego-vehicle 3D perception tasks, but there has been limited research on roadside perception. With its ability to provide a global view and a broader sensing range, the roadside perspective is worth developing. LiDAR provides precise three-dimensional spatial information, while cameras offer semantic information. These two modalities are complementary in 3D detection. However, adding camera data does not increase accuracy in some studies since the information extraction and fusion procedure is not sufficiently reliable. Recently, Kolmogorov-Arnold Networks (KANs) have been proposed as replacements for MLPs, which are better suited for high-dimensional, complex data. Both the camera and the LiDAR provide high-dimensional information, and employing KANs should enhance the extraction of valuable features to produce better fusion outcomes. This paper proposes Kaninfradet3D, which optimizes the feature extraction and fusion modules. To extract features from complex high-dimensional data, the model’s encoder and fuser modules were improved using KAN Layers. Cross-attention was applied to enhance feature fusion, and visual comparisons verified that camera features were more evenly integrated. This addressed the issue of camera features being abnormally concentrated, negatively impacting fusion. Compared to the benchmark, our approach shows improvements of +9.87 mAP and +10.64 mAP in the two viewpoints of the TUMTraf Intersection Dataset and an improvement of +1.40 mAP in the roadside end of the TUMTraf V2X Cooperative Perception Dataset. The results indicate that Kaninfradet3D can effectively fuse features, demonstrating the potential of applying KANs in roadside perception tasks.
[AI-65] RAG4ITOps: A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance EMNLP2024
链接: https://arxiv.org/abs/2410.15805
作者: Tianyang Zhang,Zhuoxuan Jiang,Shengguang Bai,Tianrui Zhang,Lin Lin,Yang Liu,Jiawei Ren
关键词-EN: Question Answering, demands on Question, supervised fine-tunable framework, Large Language Models, Retrieval Augmented Generation
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 Industry Track
点击查看摘要
Abstract:With the ever-increasing demands on Question Answering (QA) systems for IT operations and maintenance, an efficient and supervised fine-tunable framework is necessary to ensure the data security, private deployment and continuous upgrading. Although Large Language Models (LLMs) have notably improved the open-domain QA’s performance, how to efficiently handle enterprise-exclusive corpora and build domain-specific QA systems are still less-studied for industrial applications. In this paper, we propose a general and comprehensive framework based on Retrieval Augmented Generation (RAG) and facilitate the whole business process of establishing QA systems for IT operations and maintenance. In accordance with the prevailing RAG method, our proposed framework, named with RAG4ITOps, composes of two major stages: (1) Models Fine-tuning \ Data Vectorization, and (2) Online QA System Process. At the Stage 1, we leverage a contrastive learning method with two negative sampling strategies to fine-tune the embedding model, and design the instruction templates to fine-tune the LLM with a Retrieval Augmented Fine-Tuning method. At the Stage 2, an efficient process of QA system is built for serving. We collect enterprise-exclusive corpora from the domain of cloud computing, and the extensive experiments show that our method achieves superior results than counterparts on two kinds of QA tasks. Our experiment also provide a case for applying the RAG4ITOps to real-world enterprise-level applications.
[AI-66] Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt
链接: https://arxiv.org/abs/2410.15804
作者: Edi Sutoyo,Paris Avgeriou,Andrea Capiluppi
关键词-EN: Self-Admitted Technical Debt, Self-Admitted Technical, Technical Debt, SATD, refers to circumstances
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to be published at the 2024 31st Asia-Pacific Software Engineering Conference (APSEC)
点击查看摘要
Abstract:Self-Admitted Technical Debt (SATD) refers to circumstances where developers use textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD items as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirement, design, code, test debt, etc.). However, the performance of these approaches remains suboptimal, particularly for specific types of SATD, such as test and requirement debt, primarily due to extremely imbalanced datasets. To address these challenges, we build on earlier research by utilizing BiLSTM architecture for the binary identification of SATD and BERT architecture for categorizing different types of SATD. Despite their effectiveness, both architectures struggle with imbalanced data. Therefore, we employ a large language model data augmentation strategy to mitigate this issue. Furthermore, we introduce a two-step approach to identify and categorize SATD across various datasets derived from different artifacts. Our contributions include providing a balanced dataset for future SATD researchers and demonstrating that our approach significantly improves SATD identification and categorization performance compared to baseline methods.
[AI-67] Habaek: High-performance water segmentation through dataset expansion and inductive bias optimization
链接: https://arxiv.org/abs/2410.15794
作者: Hanseon Joo,Eunji Lee,Minjong Cheon
关键词-EN: water resource management, critical to disaster, disaster response, resource management, Water segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Water segmentation is critical to disaster response and water resource management. Authorities may employ high-resolution photography to monitor rivers, lakes, and reservoirs, allowing for more proactive management in agriculture, industry, and conservation. Deep learning has improved flood monitoring by allowing models like CNNs, U-Nets, and transformers to handle large volumes of satellite and aerial data. However, these models usually have significant processing requirements, limiting their usage in real-time applications. This research proposes upgrading the SegFormer model for water segmentation by data augmentation with datasets such as ADE20K and RIWA to boost generalization. We examine how inductive bias affects attention-based models and discover that SegFormer performs better on bigger datasets. To further demonstrate the function of data augmentation, Low-Rank Adaptation (LoRA) is used to lower processing complexity while preserving accuracy. We show that the suggested Habaek model outperforms current models in segmentation, with an Intersection over Union (IoU) ranging from 0.91986 to 0.94397. In terms of F1-score, recall, accuracy, and precision, Habaek performs better than rival models, indicating its potential for real-world applications. This study highlights the need to enhance structures and include datasets for effective water segmentation.
[AI-68] WildOcc: A Benchmark for Off-Road 3D Semantic Occupancy Prediction
链接: https://arxiv.org/abs/2410.15792
作者: Heng Zhai,Jilin Mei,Chen Min,Liang Chen,Fangzhou Zhao,Yu Hu
关键词-EN: semantic occupancy prediction, semantic occupancy, occupancy prediction, occupancy prediction tasks, semantic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:3D semantic occupancy prediction is an essential part of autonomous driving, focusing on capturing the geometric details of scenes. Off-road environments are rich in geometric information, therefore it is suitable for 3D semantic occupancy prediction tasks to reconstruct such scenes. However, most of researches concentrate on on-road environments, and few methods are designed for off-road 3D semantic occupancy prediction due to the lack of relevant datasets and benchmarks. In response to this gap, we introduce WildOcc, to our knowledge, the first benchmark to provide dense occupancy annotations for off-road 3D semantic occupancy prediction tasks. A ground truth generation pipeline is proposed in this paper, which employs a coarse-to-fine reconstruction to achieve a more realistic result. Moreover, we introduce a multi-modal 3D semantic occupancy prediction framework, which fuses spatio-temporal information from multi-frame images and point clouds at voxel level. In addition, a cross-modality distillation function is introduced, which transfers geometric knowledge from point clouds to image features.
[AI-69] Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count
链接: https://arxiv.org/abs/2410.15787
作者: Hanseul Cho,Jaeyoung Cha,Srinadh Bhojanapalli,Chulhee Yun
关键词-EN: length generalization, meaning they fail, encountered during training, fail to generalize, generalize to sequences
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 38 pages, 16 figures
点击查看摘要
Abstract:Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (requiring generalization over both the number of operands and their lengths) and multiplication (requiring generalization over both operand lengths). In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers. We design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and apply multi-level versions of Position Coupling (Cho et al., 2024; McLeish et al., 2024) to let Transformers know the right position to attend to. On the theory side, we prove that a 1-layer Transformer using our method can solve multi-operand addition, up to operand length and operand count that are exponential in embedding dimension.
[AI-70] An Efficient System for Automatic Map Storytelling – A Case Study on Historical Maps
链接: https://arxiv.org/abs/2410.15780
作者: Ziyi Liu,Claudio Affolter,Sidi Wu,Yizi Chen,Lorenz Hurni
关键词-EN: provide valuable information, maps provide valuable, provide valuable, valuable information, information and knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Historical maps provide valuable information and knowledge about the past. However, as they often feature non-standard projections, hand-drawn styles, and artistic elements, it is challenging for non-experts to identify and interpret them. While existing image captioning methods have achieved remarkable success on natural images, their performance on maps is suboptimal as maps are underrepresented in their pre-training process. Despite the recent advance of GPT-4 in text recognition and map captioning, it still has a limited understanding of maps, as its performance wanes when texts (e.g., titles and legends) in maps are missing or inaccurate. Besides, it is inefficient or even impractical to fine-tune the model with users’ own datasets. To address these problems, we propose a novel and lightweight map-captioning counterpart. Specifically, we fine-tune the state-of-the-art vision-language model CLIP to generate captions relevant to historical maps and enrich the captions with GPT-3.5 to tell a brief story regarding where, what, when and why of a given map. We propose a novel decision tree architecture to only generate captions relevant to the specified map type. Our system shows invariance to text alterations in maps. The system can be easily adapted and extended to other map types and scaled to a larger map captioning system. The code is open-sourced at this https URL.
[AI-71] Reducing Hallucinations in Vision-Language Models via Latent Space Steering
链接: https://arxiv.org/abs/2410.15778
作者: Sheng Liu,Haotian Ye,James Zou
关键词-EN: large language models, large vision-language models, poses a challenge, language models, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 21 pages
点击查看摘要
Abstract:Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications. Unlike in large language models (LLMs), hallucination in LVLMs often arises from misalignments between visual inputs and textual outputs. This paper investigates the underlying mechanisms of hallucination, focusing on the unique structure of LVLMs that distinguishes them from large language models (LLMs). We identify that hallucinations often arise from the sensitivity of text decoders to vision inputs, a natural phenomenon when image encoders and text decoders are pre-trained separately. Inspired by this, we introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features. As a task-agnostic test-time intervention, VTI can be easily applied to any problem without additional cost. Extensive experiments demonstrate that it can effectively reduce hallucinations and outperform baseline methods across multiple metrics, highlighting the critical role of vision feature stability in LVLMs.
[AI-72] A roadmap for generative mapping: unlocking the power of generative AI for map-making
链接: https://arxiv.org/abs/2410.15770
作者: Sidi Wu,Katharina Henggeler,Yizi Chen,Lorenz Hurni
关键词-EN: communicating spatial knowledge, presenting spatial phenomena, spatial knowledge, serving as valuable, presenting spatial
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Maps are broadly relevant across various fields, serving as valuable tools for presenting spatial phenomena and communicating spatial knowledge. However, map-making is still largely confined to those with expertise in GIS and cartography due to the specialized software and complex workflow involved, from data processing to visualization. While generative AI has recently demonstrated its remarkable capability in creating various types of content and its wide accessibility to the general public, its potential in generating maps is yet to be fully realized. This paper highlights the key applications of generative AI in map-making, summarizes recent advancements in generative AI, identifies the specific technologies required and the challenges of using current methods, and provides a roadmap for developing a generative mapping system (GMS) to make map-making more accessible.
[AI-73] Learning to Synthesize Graphics Programs for Geometric Artworks ICPR2024
链接: https://arxiv.org/abs/2410.15768
作者: Qi Bing,Chaoyi Zhang,Weidong Cai
关键词-EN: Creating and understanding, human ability, hallmark of human, Creating, understanding art
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: ICPR 2024
点击查看摘要
Abstract:Creating and understanding art has long been a hallmark of human ability. When presented with finished digital artwork, professional graphic artists can intuitively deconstruct and replicate it using various drawing tools, such as the line tool, paint bucket, and layer features, including opacity and blending modes. While most recent research in this field has focused on art generation, proposing a range of methods, these often rely on the concept of artwork being represented as a final image. To bridge the gap between pixel-level results and the actual drawing process, we present an approach that treats a set of drawing tools as executable programs. This method predicts a sequence of steps to achieve the final image, allowing for understandable and resolution-independent reproductions under the usage of a set of drawing commands. Our experiments demonstrate that our program synthesizer, Art2Prog, can comprehensively understand complex input images and reproduce them using high-quality executable programs. The experimental results evidence the potential of machines to grasp higher-level information from images and generate compact program-level descriptions.
[AI-74] DeepIcon: A Hierarchical Network for Layer-wise Icon Vectorization
链接: https://arxiv.org/abs/2410.15760
作者: Qi Bing,Chaoyi Zhang,Weidong Cai
关键词-EN: technique of rasterization, well-established technique, poses a significant, field of computer, Scalable Vector Graphics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted as Oral Presentation at DICTA 2024
点击查看摘要
Abstract:In contrast to the well-established technique of rasterization, vectorization of images poses a significant challenge in the field of computer graphics. Recent learning-based methods for converting raster images to vector formats frequently suffer from incomplete shapes, redundant path prediction, and a lack of accuracy in preserving the semantics of the original content. These shortcomings severely hinder the utility of these methods for further editing and manipulation of images. To address these challenges, we present DeepIcon, a novel hierarchical image vectorization network specifically tailored for generating variable-length icon vector graphics based on the raster image input. Our experimental results indicate that DeepIcon can efficiently produce Scalable Vector Graphics (SVGs) directly from raster images, bypassing the need for a differentiable rasterizer while also demonstrating a profound understanding of the image contents.
[AI-75] Automated Proof Generation for Rust Code via Self-Evolution
链接: https://arxiv.org/abs/2410.15756
作者: Tianyu Chen,Shuai Lu,Shan Lu,Yeyun Gong,Chenyuan Yang,Xuheng Li,Md Rakib Hossain Misu,Hao Yu,Nan Duan,Peng Cheng,Fan Yang,Shuvendu K Lahiri,Tao Xie,Lidong Zhou
关键词-EN: Ensuring correctness, Ensuring, Rust code, SAFE, proof
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Ensuring correctness is crucial for code generation. Formal verification offers a definitive assurance of correctness, but demands substantial human effort in proof construction and hence raises a pressing need for automation. The primary obstacle lies in the severe lack of data - there is much less proof than code for LLMs to train upon. In this paper, we introduce SAFE, a novel framework that overcomes the lack of human-written proof to enable automated proof generation of Rust code. SAFE establishes a self-evolving cycle where data synthesis and fine-tuning collaborate to enhance the model capability, leveraging the definitive power of a symbolic verifier in telling correct proof from incorrect ones. SAFE also re-purposes the large number of synthesized incorrect proofs to train the self-debugging capability of the fine-tuned models, empowering them to fix incorrect proofs based on the verifier’s feedback. SAFE demonstrates superior efficiency and precision compared to GPT-4o. Through tens of thousands of synthesized proofs and the self-debugging mechanism, we improve the capability of open-source models, initially unacquainted with formal verification, to automatically write proof for Rust code. This advancement leads to a significant improvement in performance, achieving a 70.50% accuracy rate in a benchmark crafted by human experts, a significant leap over GPT-4o’s performance of 24.46%.
[AI-76] Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation
链接: https://arxiv.org/abs/2410.15748
作者: Shaonan Wu,Shuai Lu,Yeyun Gong,Nan Duan,Ping Wei
关键词-EN: Neural Theorem Proving, experienced experts, proofs are challenging, challenging to write, Formal proofs
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Formal proofs are challenging to write even for experienced experts. Recent progress in Neural Theorem Proving (NTP) shows promise in expediting this process. However, the formal corpora available on the Internet are limited compared to the general text, posing a significant data scarcity challenge for NTP. To address this issue, this work proposes Alchemy, a general framework for data synthesis that constructs formal theorems through symbolic mutation. Specifically, for each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it. Subsequently, we mutate the candidate theorem by replacing the corresponding term in the statement with its equivalent form or antecedent. As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M. Furthermore, we perform continual pretraining and supervised finetuning on this augmented corpus for large language models. Experimental results demonstrate the effectiveness of our approach, achieving a 5% absolute performance improvement on Leandojo benchmark. Additionally, our synthetic data achieve a 2.5% absolute performance gain on the out-of-distribution miniF2F benchmark. To provide further insights, we conduct a comprehensive analysis of synthetic data composition and the training paradigm, offering valuable guidance for developing a strong theorem prover.
[AI-77] GIG: Graph Data Imputation With Graph Differential Dependencies
链接: https://arxiv.org/abs/2410.15747
作者: Jiang Hua,Michael Bewong,Selasi Kwashie,MD Geaur Rahman,Junwei Hu,Xi Guo,Zaiwen Fen
关键词-EN: database instances, ensuring consistency, addresses the challenge, challenge of imputing, Data
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures, published to ADC
点击查看摘要
Abstract:Data imputation addresses the challenge of imputing missing values in database instances, ensuring consistency with the overall semantics of the dataset. Although several heuristics which rely on statistical methods, and ad-hoc rules have been proposed. These do not generalise well and often lack data context. Consequently, they also lack explainability. The existing techniques also mostly focus on the relational data context making them unsuitable for wider application contexts such as in graph data. In this paper, we propose a graph data imputation approach called GIG which relies on graph differential dependencies (GDDs). GIG, learns the GDDs from a given knowledge graph, and uses these rules to train a transformer model which then predicts the value of missing data within the graph. By leveraging GDDs, GIG incoporates semantic knowledge into the data imputation process making it more reliable and explainable. Experimental results on seven real-world datasets highlight GIG’s effectiveness compared to existing state-of-the-art approaches.
[AI-78] Unleashing the Potential of Vision-Language Pre-Training for 3D Zero-Shot Lesion Segmentation via Mask-Attribute Alignment
链接: https://arxiv.org/abs/2410.15744
作者: Yankai Jiang,Wenhui Lei,Xiaofan Zhang,Shaoting Zhang
关键词-EN: medical vision-language pre-training, vision-language pre-training models, driven significant progress, Recent advancements, zero-shot disease recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advancements in medical vision-language pre-training models have driven significant progress in zero-shot disease recognition. However, transferring image-level knowledge to pixel-level tasks, such as lesion segmentation in 3D CT scans, remains a critical challenge. Due to the complexity and variability of pathological visual characteristics, existing methods struggle to align fine-grained lesion features not encountered during training with disease-related textual representations. In this paper, we present Malenia, a novel multi-scale lesion-level mask-attribute alignment framework, specifically designed for 3D zero-shot lesion segmentation. Malenia improves the compatibility between mask representations and their associated elemental attributes, explicitly linking the visual features of unseen lesions with the extensible knowledge learned from previously seen ones. Furthermore, we design a Cross-Modal Knowledge Injection module to enhance both visual and textual features with mutually beneficial information, effectively guiding the generation of segmentation results. Comprehensive experiments across three datasets and 12 lesion categories validate the superior performance of Malenia. Codes will be publicly available.
[AI-79] Whos Who: Large Language Models Meet Knowledge Conflicts in Practice EMNLP2024
链接: https://arxiv.org/abs/2410.15737
作者: Quang Hieu Pham,Hoang Ngo,Anh Tuan Luu,Dat Quoc Nguyen
关键词-EN: static memory limits, Retrieval-augmented generation, methods are viable, viable solutions, solutions for addressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted to EMNLP 2024 Findings
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model’s behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs’ performance in RAG settings.
[AI-80] AutoTrain: No-code training for state-of-the-art models
链接: https://arxiv.org/abs/2410.15735
作者: Abhishek Thakur
关键词-EN: crucial part, part of developing, developing solutions, tailored to specific, specific industrial
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the advancements in open-source models, training (or finetuning) models on custom datasets has become a crucial part of developing solutions which are tailored to specific industrial or open-source applications. Yet, there is no single tool which simplifies the process of training across different types of modalities or tasks. We introduce AutoTrain (aka AutoTrain Advanced) – an open-source, no code tool/library which can be used to train (or finetune) models for different kinds of tasks such as: large language model (LLM) finetuning, text classification/regression, token classification, sequence-to-sequence task, finetuning of sentence transformers, visual language model (VLM) finetuning, image classification/regression and even classification and regression tasks on tabular data. AutoTrain Advanced is an open-source library providing best practices for training models on custom datasets. The library is available at this https URL. AutoTrain can be used in fully local mode or on cloud machines and works with tens of thousands of models shared on Hugging Face Hub and their variations.
[AI-81] Reducing annotator bias by belief elicitation
链接: https://arxiv.org/abs/2410.15726
作者: Terne Sasha Thorn Jakobsen,Andreas Bjerre-Nielsen,Robert Böhm
关键词-EN: Artificial Intelligence, development of Artificial, Crowdsourced annotations, annotator bias, bias
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注:
点击查看摘要
Abstract:Crowdsourced annotations of data play a substantial role in the development of Artificial Intelligence (AI). It is broadly recognised that annotations of text data can contain annotator bias, where systematic disagreement in annotations can be traced back to differences in the annotators’ backgrounds. Being unaware of such annotator bias can lead to representational bias against minority group perspectives and therefore several methods have been proposed for recognising bias or preserving perspectives. These methods typically require either a substantial number of annotators or annotations per data instance. In this study, we propose a simple method for handling bias in annotations without requirements on the number of annotators or instances. Instead, we ask annotators about their beliefs of other annotators’ judgements of an instance, under the hypothesis that these beliefs may provide more representative and less biased labels than judgements. The method was examined in two controlled, survey-based experiments involving Democrats and Republicans (n=1,590) asked to judge statements as arguments and then report beliefs about others’ judgements. The results indicate that bias, defined as systematic differences between the two groups of annotators, is consistently reduced when asking for beliefs instead of judgements. Our proposed method therefore has the potential to reduce the risk of annotator bias, thereby improving the generalisability of AI systems and preventing harm to unrepresented socio-demographic groups, and we highlight the need for further studies of this potential in other tasks and downstream applications.
[AI-82] metable Nodes for Public Transport Network
链接: https://arxiv.org/abs/2410.15715
作者: Andrii Rohovyi,Peter J. Stuckey,Toby Walsh
关键词-EN: transport networks, Time-dependent Contraction Hierarchies, transport, networks, navigation systems
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
*备注:
点击查看摘要
Abstract:Faster pathfinding in time-dependent transport networks is an important and challenging problem in navigation systems. There are two main types of transport networks: road networks for car driving and public transport route network. The solutions that work well in road networks, such as Time-dependent Contraction Hierarchies and other graph-based approaches, do not usually apply in transport networks. In transport networks, non-graph solutions such as CSA and RAPTOR show the best results compared to graph-based techniques. In our work, we propose a method that advances graph-based approaches by using different optimization techniques from computational geometry to speed up the search process in transport networks. We apply a new pre-computation step, which we call timetable nodes (TTN). Our inspiration comes from an iterative search problem in computational geometry. We implement two versions of the TTN: one uses a Combined Search Tree (TTN-CST), and the second uses Fractional Cascading (TTN-FC). Both of these approaches decrease the asymptotic complexity of reaching new nodes from O(k\times \log|C|) to O(k + \log(k) + \log(|C|)) , where k is the number of outgoing edges from a node and |C| is the size of the timetable information (total outgoing edges). Our solution suits any other time-dependent networks and can be integrated into other pathfinding algorithms. Our experiments indicate that this pre-computation significantly enhances the performance on high-density graphs. This study showcases how leveraging computational geometry can enhance pathfinding in transport networks, enabling faster pathfinding in scenarios involving large numbers of outgoing edges.
[AI-83] Offline reinforcement learning for job-shop scheduling problems
链接: https://arxiv.org/abs/2410.15714
作者: Imanol Echeverria,Maialen Murua,Roberto Santana
关键词-EN: shown significant potential, Recent advances, solving combinatorial optimization, shown significant, significant potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advances in deep learning have shown significant potential for solving combinatorial optimization problems in real-time. Unlike traditional methods, deep learning can generate high-quality solutions efficiently, which is crucial for applications like routing and scheduling. However, existing approaches like deep reinforcement learning (RL) and behavioral cloning have notable limitations, with deep RL suffering from slow learning and behavioral cloning relying solely on expert actions, which can lead to generalization issues and neglect of the optimization objective. This paper introduces a novel offline RL method designed for combinatorial optimization problems with complex constraints, where the state is represented as a heterogeneous graph and the action space is variable. Our approach encodes actions in edge attributes and balances expected rewards with the imitation of expert solutions. We demonstrate the effectiveness of this method on job-shop scheduling and flexible job-shop scheduling benchmarks, achieving superior performance compared to state-of-the-art techniques.
[AI-84] InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems
链接: https://arxiv.org/abs/2410.15700
作者: Zijian Wu,Suozhi Huang,Zhejian Zhou,Huaiyuan Ying,Jiayu Wang,Dahua Lin,Kai Chen
关键词-EN: utilizing formal languages, mathematical theorem proving, Large Language Models, formal languages, Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have emerged as powerful tools in mathematical theorem proving, particularly when utilizing formal languages such as LEAN. The major learning paradigm is expert iteration, which necessitates a pre-defined dataset comprising numerous mathematical problems. In this process, LLMs attempt to prove problems within the dataset and iteratively refine their capabilities through self-training on the proofs they discover. We propose to use large scale LEAN problem datasets Lean-workbook for expert iteration with more than 20,000 CPU days. During expert iteration, we found log-linear trends between solved problem amount with proof length and CPU usage. We train a critic model to select relatively easy problems for policy models to make trials and guide the model to search for deeper proofs. InternLM2.5-StepProver achieves open-source state-of-the-art on MiniF2F, Lean-Workbook-Plus, ProofNet, and Putnam benchmarks. Specifically, it achieves a pass of 65.9% on the MiniF2F-test and proves (or disproves) 17.0% of problems in Lean-Workbook-Plus which shows a significant improvement compared to only 9.5% of problems proved when Lean-Workbook-Plus was released. We open-source our models and searched proofs at this https URL and this https URL.
[AI-85] PALMS: Plane-based Accessible Indoor Localization Using Mobile Smartphones
链接: https://arxiv.org/abs/2410.15694
作者: Yunqian Cheng,Roberto Manduchi
关键词-EN: innovative indoor global, floor plans, mobile smartphones, smartphones that utilizes, utilizes publicly
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 3 figures, accepted to the 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN) 2024, Best Presentation Award
点击查看摘要
Abstract:In this paper, we present PALMS, an innovative indoor global localization and relocalization system for mobile smartphones that utilizes publicly available floor plans. Unlike most vision-based methods that require constant visual input, our system adopts a dynamic form of localization that considers a single instantaneous observation and odometry data. The core contribution of this work is the introduction of a particle filter initialization method that leverages the Certainly Empty Space (CES) constraint along with principal orientation matching. This approach creates a spatial probability distribution of the device’s location, significantly improving localization accuracy and reducing particle filter convergence time. Our experimental evaluations demonstrate that PALMS outperforms traditional methods with uniformly initialized particle filters, providing a more efficient and accessible approach to indoor wayfinding. By eliminating the need for prior environmental fingerprinting, PALMS provides a scalable and practical approach to indoor navigation.
[AI-86] Geographical Node Clustering and Grouping to Guarantee Data IIDness in Federated Learning
链接: https://arxiv.org/abs/2410.15693
作者: Minkwon Lee,Hyoil Kim,Changhee Joo
关键词-EN: Federated learning, Federated, large number, non-IID dataset problem, non-IID dataset
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 10 pages, 7 figures
点击查看摘要
Abstract:Federated learning (FL) is a decentralized AI mechanism suitable for a large number of devices like in smart IoT. A major challenge of FL is the non-IID dataset problem, originating from the heterogeneous data collected by FL participants, leading to performance deterioration of the trained global model. There have been various attempts to rectify non-IID dataset, mostly focusing on manipulating the collected data. This paper, however, proposes a novel approach to ensure data IIDness by properly clustering and grouping mobile IoT nodes exploiting their geographical characteristics, so that each FL group can achieve IID dataset. We first provide an experimental evidence for the independence and identicalness features of IoT data according to the inter-device distance, and then propose Dynamic Clustering and Partial-Steady Grouping algorithms that partition FL participants to achieve near-IIDness in their dataset while considering device mobility. Our mechanism significantly outperforms benchmark grouping algorithms at least by 110 times in terms of the joint cost between the number of dropout devices and the evenness in per-group device count, with a mild increase in the number of groups only by up to 0.93 groups.
[AI-87] NetSafe: Exploring the Topological Safety of Multi-agent Networks
链接: https://arxiv.org/abs/2410.15686
作者: Miao Yu,Shilong Wang,Guibin Zhang,Junyuan Mao,Chenlong Yin,Qijiong Liu,Qingsong Wen,Kun Wang,Yang Wang
关键词-EN: Large language models, showing growing applications, Large language, language models, showing growing
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) have empowered nodes within multi-agent networks with intelligence, showing growing applications in both academia and industry. However, how to prevent these networks from generating malicious information remains unexplored with previous research on single LLM’s safety be challenging to transfer. In this paper, we focus on the safety of multi-agent networks from a topological perspective, investigating which topological properties contribute to safer networks. To this end, we propose a general framework, NetSafe along with an iterative RelCom interaction to unify existing diverse LLM-based agent frameworks, laying the foundation for generalized topological safety research. We identify several critical phenomena when multi-agent networks are exposed to attacks involving misinformation, bias, and harmful information, termed as Agent Hallucination and Aggregation Safety. Furthermore, we find that highly connected networks are more susceptible to the spread of adversarial attacks, with task performance in a Star Graph Topology decreasing by 29.7%. Besides, our proposed static metrics aligned more closely with real-world dynamic evaluations than traditional graph-theoretic metrics, indicating that networks with greater average distances from attackers exhibit enhanced safety. In conclusion, our work introduces a new topological perspective on the safety of LLM-based multi-agent networks and discovers several unreported phenomena, paving the way for future research to explore the safety of such networks.
[AI-88] Revealing and Mitigating the Local Pattern Shortcuts of Mamba
链接: https://arxiv.org/abs/2410.15678
作者: Wangjie You,Zecheng Tang,Juntao Li,Lili Yao,Min Zhang
关键词-EN: Large language models, Large language, advanced significantly due, memory demands limit, State Space Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) have advanced significantly due to the attention mechanism, but their quadratic complexity and linear memory demands limit their performance on long-context tasks. Recently, researchers introduced Mamba, an advanced model built upon State Space Models(SSMs) that offers linear complexity and constant memory. Although Mamba is reported to match or surpass the performance of attention-based models, our analysis reveals a performance gap: Mamba excels in tasks that involve localized key information but faces challenges with tasks that require handling distributed key information. Our controlled experiments suggest that this inconsistency arises from Mamba’s reliance on local pattern shortcuts, which enable the model to remember local key information within its limited memory but hinder its ability to retain more dispersed information. Therefore, we introduce a global selection module into the Mamba model to address this issue. Experiments on both existing and proposed synthetic tasks, as well as real-world tasks, demonstrate the effectiveness of our method. Notably, with the introduction of only 4M extra parameters, our approach enables the Mamba model(130M) to achieve a significant improvement on tasks with distributed information, increasing its performance from 0 to 80.54 points.
[AI-89] Learning to Generate and Evaluate Fact-checking Explanations with Transformers
链接: https://arxiv.org/abs/2410.15669
作者: Darius Feher,Abdullah Khered,Hao Zhang,Riza Batista-Navarro,Viktor Schlegel
关键词-EN: assessing information veracity, era increasingly dominated, Explainable Artificial Antelligence, texttt, digital platforms
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Forthcoming in Engineering Applications of Artificial Intelligence
点击查看摘要
Abstract:In an era increasingly dominated by digital platforms, the spread of misinformation poses a significant challenge, highlighting the need for solutions capable of assessing information veracity. Our research contributes to the field of Explainable Artificial Antelligence (XAI) by developing transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations. Importantly, we also develop models for automatic evaluation of explanations for fact-checking verdicts across different dimensions such as \texttt(self)-contradiction, \texttthallucination, \textttconvincingness and \textttoverall quality. By introducing human-centred evaluation methods and developing specialised datasets, we emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements. This approach not only advances theoretical knowledge in XAI but also holds practical implications by enhancing the transparency, reliability and users’ trust in AI-driven fact-checking systems. Furthermore, the development of our metric learning models is a first step towards potentially increasing efficiency and reducing reliance on extensive manual assessment. Based on experimental results, our best performing generative model \textscROUGE-1 score of 47.77, demonstrating superior performance in generating fact-checking explanations, particularly when provided with high-quality evidence. Additionally, the best performing metric learning model showed a moderately strong correlation with human judgements on objective dimensions such as \texttt(self)-contradiction and \texttthallucination, achieving a Matthews Correlation Coefficient (MCC) of around 0.7.
[AI-90] RAC: Efficient LLM Factuality Correction with Retrieval Augmentation
链接: https://arxiv.org/abs/2410.15667
作者: Changmao Li,Jeffrey Flanigan
关键词-EN: Large Language Models, natural language processing, exhibit impressive results, produce factually incorrect, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs. This paper introduces a simple but effective low-latency post-correction method, \textbfRetrieval Augmented Correction (RAC), aimed at enhancing the factual performance of LLMs without requiring additional fine-tuning. Our method is general and can be used with any instruction-tuned LLM, and has greatly reduced latency compared to prior approaches. RAC decomposes the LLM’s output into atomic facts and applies a fine-grained verification and correction process with retrieved content to verify and correct the LLM-generated output. Our extensive experiments show that RAC yields up to 30% improvements over state-of-the-art baselines across two popular factuality evaluation datasets, validating its efficacy and robustness in both with and without the integration of Retrieval-Augmented Generation (RAG) across different LLMs.\footnoteOur code is at \urlthis https URL
[AI-91] Long Term Memory: The Foundation of AI Self-Evolution
链接: https://arxiv.org/abs/2410.15665
作者: Xun Jiang,Feng Li,Han Zhao,Jiaying Wang,Jun Shao,Shihao Xu,Shu Zhang,Weiling Chen,Xavier Tang,Yize Chen,Mengyue Wu,Weizhi Ma,Mengdi Wang,Tianqiao Chen
关键词-EN: Large language models, achieving human-level performance, Large language, demonstrated impressive capabilities, language understanding
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 56 pages, 13 figures
点击查看摘要
Abstract:Large language models (LLMs) like GPTs, trained on vast datasets, have demonstrated impressive capabilities in language understanding, reasoning, and planning, achieving human-level performance in various tasks. Most studies focus on enhancing these models by training on ever-larger datasets to build more powerful foundation models. While training stronger models is important, enabling models to evolve during inference is equally crucial, a process we refer to as AI self-evolution. Unlike large-scale training, self-evolution may rely on limited data or interactions. Inspired by the columnar organization of the human cerebral cortex, we hypothesize that AI models could develop cognitive abilities and build internal representations through iterative interactions with their environment. To achieve this, models need long-term memory (LTM) to store and manage processed interaction data. LTM supports self-evolution by representing diverse experiences across environments and agents. In this report, we explore AI self-evolution and its potential to enhance models during inference. We examine LTM’s role in lifelong learning, allowing models to evolve based on accumulated interactions. We outline the structure of LTM and the systems needed for effective data retention and representation. We also classify approaches for building personalized models with LTM data and show how these models achieve self-evolution through interaction. Using LTM, our multi-agent framework OMNE achieved first place on the GAIA benchmark, demonstrating LTM’s potential for AI self-evolution. Finally, we present a roadmap for future research, emphasizing the importance of LTM for advancing AI technology and its practical applications.
[AI-92] LightFusionRec: Lightweight Transformers-Based Cross-Domain Recommendation Model
链接: https://arxiv.org/abs/2410.15656
作者: Vansh Kharidia,Dhruvi Paprunia,Prashasti Kanikar
关键词-EN: textual feature extraction, paper presents LightFusionRec, paper presents, integrates DistilBERT, DistilBERT for textual
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper presents LightFusionRec, a novel lightweight cross-domain recommendation system that integrates DistilBERT for textual feature extraction and FastText for genre embedding. Important issues in recommendation systems, such as data sparsity, computational efficiency, and cold start issues, are addressed in methodology. LightFusionRec uses a small amount of information to produce precise and contextually relevant recommendations for many media formats by fusing genre vector embedding with natural language processing algorithms. Tests conducted on extensive movie and book datasets show notable enhancements in suggestion quality when compared to conventional methods. Because of its lightweight design, the model can be used for a variety of purposes and allows for ondevice inference. LightFusionRec is a noteworthy development in cross-domain recommendation systems, providing accurate and scalable recommendations to improve user experience on digital content platforms.
[AI-93] Opportunities and Challenges of Generative-AI in Finance
链接: https://arxiv.org/abs/2410.15653
作者: Akshar Prabhu Desai,Ganesh Satish Mallya,Mohammad Luqman,Tejasvi Ravi,Nithya Kota,Pranjul Yadav
关键词-EN: Machine Learning, created widespread impact, Gen-AI techniques, Gen-AI, mining have created
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Machine Learning and data mining have created widespread impact across various domains. However, these techniques are limited in their ability to reason, understand and generalize w.r.t language specific tasks. The aforementioned challenges were overcome, with the advancement of LLMs/Gen-AI. Gen-AI techniques are able to improve understanding of context and nuances in language modeling, translation between languages, handle large volumes of data, provide fast, low-latency responses and can be fine-tuned for various tasks and domains. In this manuscript, we present a comprehensive overview of the applications of Gen-AI techniques in the finance domain. In particular, we present the opportunities and challenges associated with the usage of Gen-AI techniques in finance. We also illustrate the various methodologies which can be used to train Gen-AI and present the various application areas of Gen-AI techniques in the finance ecosystem. To the best of our knowledge, this work represents the most comprehensive summarization of Gen-AI techniques within the financial domain. The analysis is designed for a deep overview of areas marked for substantial advancement while simultaneously pin-point those warranting future prioritization. We also hope that this work would serve as a conduit between finance and other domains, thus fostering the cross-pollination of innovative concepts and practices. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2410.15653 [cs.AI] (or arXiv:2410.15653v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.15653 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-94] Voice-Enabled AI Agents can Perform Common Scams
链接: https://arxiv.org/abs/2410.15650
作者: Richard Fang,Dylan Bowman,Daniel Kang
关键词-EN: highly capable LLMs, Recent advances, advances in multi-modal, highly capable, capable LLMs
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advances in multi-modal, highly capable LLMs have enabled voice-enabled AI agents. These agents are enabling new applications, such as voice-enabled autonomous customer service. However, with all AI capabilities, these new capabilities have the potential for dual use. In this work, we show that voice-enabled AI agents can perform the actions necessary to perform common scams. To do so, we select a list of common scams collected by the government and construct voice-enabled agents with directions to perform these scams. We conduct experiments on our voice-enabled agents and show that they can indeed perform the actions necessary to autonomously perform such scams. Our results raise questions around the widespread deployment of voice-enabled AI agents. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2410.15650 [cs.AI] (or arXiv:2410.15650v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.15650 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-95] Boosting Jailbreak Transferability for Large Language Models
链接: https://arxiv.org/abs/2410.15645
作者: Hanqing Liu,Lifeng Zhou,Huanqian Yan
关键词-EN: Large language models, produce harmful content, drawn significant attention, circumvent security measures, Large language
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models have drawn significant attention to the challenge of safe alignment, especially regarding jailbreak attacks that circumvent security measures to produce harmful content. To address the limitations of existing methods like GCG, which perform well in single-model attacks but lack transferability, we propose several enhancements, including a scenario induction template, optimized suffix selection, and the integration of re-suffix attack mechanism to reduce inconsistent outputs. Our approach has shown superior performance in extensive experiments across various benchmarks, achieving nearly 100% success rates in both attack execution and transferability. Notably, our method has won the online first place in the AISG-hosted Global Challenge for Safe and Secure LLMs.
[AI-96] Procedural Content Generation in Games: A Survey with Insights on Emerging LLM Integration
链接: https://arxiv.org/abs/2410.15644
作者: Mahdi Farrokhi Maleki,Richard Zhao
关键词-EN: Procedural Content Generation, Procedural Content, Content Generation, automatic creation, Large Language Models
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Procedural Content Generation (PCG) is defined as the automatic creation of game content using algorithms. PCG has a long history in both the game industry and the academic world. It can increase player engagement and ease the work of game designers. While recent advances in deep learning approaches in PCG have enabled researchers and practitioners to create more sophisticated content, it is the arrival of Large Language Models (LLMs) that truly disrupted the trajectory of PCG advancement. This survey explores the differences between various algorithms used for PCG, including search-based methods, machine learning-based methods, other frequently used methods (e.g., noise functions), and the newcomer, LLMs. We also provide a detailed discussion on combined methods. Furthermore, we compare these methods based on the type of content they generate and the publication dates of their respective papers. Finally, we identify gaps in the existing academic work and suggest possible directions for future research. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2410.15644 [cs.AI] (or arXiv:2410.15644v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.15644 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-97] Resource-Efficient Medical Report Generation using Large Language Models
链接: https://arxiv.org/abs/2410.15642
作者: Abdullah,Ameer Hamza,Seong Tae Kim
关键词-EN: chest X-ray images, X-ray images, chest X-ray, automatically writing radiology, automatically writing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Medical report generation is the task of automatically writing radiology reports for chest X-ray images. Manually composing these reports is a time-consuming process that is also prone to human errors. Generating medical reports can therefore help reduce the burden on radiologists. In other words, we can promote greater clinical automation in the medical domain. In this work, we propose a new framework leveraging vision-enabled Large Language Models (LLM) for the task of medical report generation. We introduce a lightweight solution that achieves better or comparative performance as compared to previous solutions on the task of medical report generation. We conduct extensive experiments exploring different model sizes and enhancement approaches, such as prefix tuning to improve the text generation abilities of the LLMs. We evaluate our approach on a prominent large-scale radiology report dataset - MIMIC-CXR. Our results demonstrate the capability of our resource-efficient framework to generate patient-specific reports with strong medical contextual understanding and high precision.
[AI-98] Selecting Influential Samples for Long Context Alignment via Homologous Models Guidance and Contextual Awareness Measurement
链接: https://arxiv.org/abs/2410.15633
作者: Shuzheng Si,Haozhe Zhao,Gang Chen,Yunshui Li,Kangyang Luo,Chuancheng Lv,Kaikai An,Fanchao Qi,Baobao Chang,Maosong Sun
关键词-EN: large language models, extremely long contexts, long-range dependencies, fully investigated, expansion of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated. The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment. Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples. However, indiscriminately increasing the quantity of data without a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the final performance. To bridge this gap, we aim to address the unique challenge of long-context alignment, i.e., modeling the long-range dependencies for handling instructions and lengthy input contexts. We propose GATEAU, a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations by utilizing crafted Homologous Models’ Guidance (HMG) and Contextual Awareness Measurement (CAM). Specifically, HMG attempts to measure the difficulty of generating corresponding responses due to the long-range dependencies, using the perplexity scores of the response from two homologous models with different context windows. Also, the role of CAM is to measure the difficulty of understanding the long input contexts due to long-range dependencies by evaluating whether the model’s attention is focused on important segments. Built upon both proposed methods, we select the most challenging samples as the influential data to effectively frame the long-range dependencies, thereby achieving better performance of LLMs. Comprehensive experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
[AI-99] Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers
链接: https://arxiv.org/abs/2410.15625
作者: Anjiang Wei,Allen Nie,Thiago S. F. X. Teixeira,Rohan Yadav,Wonchan Lee,Ke Wang,Alex Aiken
关键词-EN: Mapping computations, computations to processors, processors and assigning, assigning data, data to memory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 26 pages, 8 figures
点击查看摘要
Abstract:Mapping computations to processors and assigning data to memory are critical for maximizing performance in parallel programming. These mapping decisions are managed through the development of specialized low-level system code, called mappers, crafted by performance engineers. Each mapper is tailored to a specific application and optimized for the underlying machine architecture, a process that requires days of refinement and tuning from an expert. Despite advances in system research, automating mapper generation remains a challenge due to the complexity of making millions of decisions to find the optimal solution and generate the solution as code. We introduce an approach that leverages recent advances in LLM-based optimizers for mapper design. In under ten minutes, our method automatically discovers mappers that surpass human expert designs in scientific applications by up to 1.34X speedup. For parallel matrix multiplication algorithms, our mapper achieves up to 1.31X of the expert-designed solution. To achieve this, we simplify the complexity of low-level code generation by introducing a domain-specific language (DSL) that abstracts the low-level system programming details and defines a structured search space for LLMs to explore. To maximize the application performance, we use an LLM optimizer to improve an agentic system that generates the mapper code. As a result, this approach significantly reduces the workload for performance engineers while achieving substantial performance gains across diverse applications. Finally, our results demonstrate the effectiveness of LLM-based optimization in system design and suggest its potential for addressing other complex system challenges.
[AI-100] Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery
链接: https://arxiv.org/abs/2410.15616
作者: Yifan Wu,Yuntao Yang,Zirui Liu,Zhao Li,Khushbu Pahwa,Rongbin Li,Wenjin Zheng,Xia Hu,Zhaozhuo Xu
关键词-EN: complex human diseases, Gene-gene interactions play, human diseases, Gene-gene interactions, play a crucial
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Gene-gene interactions play a crucial role in the manifestation of complex human diseases. Uncovering significant gene-gene interactions is a challenging task. Here, we present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth noteworthy gene-gene interactions. Despite the efficacy of Transformer models, their parameter intensity presents a bottleneck in data ingestion, hindering data efficiency. To mitigate this, we introduce a novel weighted diversified sampling algorithm. This algorithm computes the diversity score of each data sample in just two passes of the dataset, facilitating efficient subset generation for interaction discovery. Our extensive experimentation demonstrates that by sampling a mere 1% of the single-cell dataset, we achieve performance comparable to that of utilizing the entire dataset.
[AI-101] Reinforced Imitative Trajectory Planning for Urban Automated Driving
链接: https://arxiv.org/abs/2410.15607
作者: Di Zeng,Ling Zheng,Xiantong Yang,Yinong Li
关键词-EN: urban automated driving, automated driving due, automated driving, designing reward functions, difficulty in designing
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 19 pages, 9 figures
点击查看摘要
Abstract:Reinforcement learning (RL) faces challenges in trajectory planning for urban automated driving due to the poor convergence of RL and the difficulty in designing reward functions. The convergence problem is alleviated by combining RL with supervised learning. However, most existing approaches only reason one step ahead and lack the capability to plan for multiple future steps. Besides, although inverse reinforcement learning holds promise for solving the reward function design issue, existing methods for automated driving impose a linear structure assumption on reward functions, making them difficult to apply to urban automated driving. In light of these challenges, this paper proposes a novel RL-based trajectory planning method that integrates RL with imitation learning to enable multi-step planning. Furthermore, a transformer-based Bayesian reward function is developed, providing effective reward signals for RL in urban scenarios. Moreover, a hybrid-driven trajectory planning framework is proposed to enhance safety and interpretability. The proposed methods were validated on the large-scale real-world urban automated driving nuPlan dataset. The results demonstrated the significant superiority of the proposed methods over the baselines in terms of the closed-loop metrics. The code is available at this https URL.
[AI-102] Deep Active Learning with Manifold-preserving Trajectory Sampling
链接: https://arxiv.org/abs/2410.15605
作者: Yingrui Ji,Vijaya Sindhoori Kaza,Nishanth Artham,Tianyang Wang
关键词-EN: minimizing labeling effort, enhance model performance, Active learning, labeling effort, unlabeled data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Active learning (AL) is for optimizing the selection of unlabeled data for annotation (labeling), aiming to enhance model performance while minimizing labeling effort. The key question in AL is which unlabeled data should be selected for annotation. Existing deep AL methods arguably suffer from bias incurred by clabeled data, which takes a much lower percentage than unlabeled data in AL context. We observe that such an issue is severe in different types of data, such as vision and non-vision data. To address this issue, we propose a novel method, namely Manifold-Preserving Trajectory Sampling (MPTS), aiming to enforce the feature space learned from labeled data to represent a more accurate manifold. By doing so, we expect to effectively correct the bias incurred by labeled data, which can cause a biased selection of unlabeled data. Despite its focus on manifold, the proposed method can be conveniently implemented by performing distribution mapping with MMD (Maximum Mean Discrepancies). Extensive experiments on various vision and non-vision benchmark datasets demonstrate the superiority of our method. Our source code can be found here.
[AI-103] P-YOLOv8: Efficient and Accurate Real-Time Detection of Distracted Driving
链接: https://arxiv.org/abs/2410.15602
作者: Mohamed R. Elshamy,Heba M. Emara,Mohamed R. Shoaib,Abdel-Hameed A. Badawy
关键词-EN: Distracted driving, critical safety issue, injuries worldwide, critical safety, leads to numerous
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Distracted driving is a critical safety issue that leads to numerous fatalities and injuries worldwide. This study addresses the urgent need for efficient and real-time machine learning models to detect distracted driving behaviors. Leveraging the Pretrained YOLOv8 (P-YOLOv8) model, a real-time object detection system is introduced, optimized for both speed and accuracy. This approach addresses the computational constraints and latency limitations commonly associated with conventional detection models. The study demonstrates P-YOLOv8 versatility in both object detection and image classification tasks using the Distracted Driver Detection dataset from State Farm, which includes 22,424 images across ten behavior categories. Our research explores the application of P-YOLOv8 for image classification, evaluating its performance compared to deep learning models such as VGG16, VGG19, and ResNet. Some traditional models often struggle with low accuracy, while others achieve high accuracy but come with high computational costs and slow detection speeds, making them unsuitable for real-time applications. P-YOLOv8 addresses these issues by achieving competitive accuracy with significant computational cost and efficiency advantages. In particular, P-YOLOv8 generates a lightweight model with a size of only 2.84 MB and a lower number of parameters, totaling 1,451,098, due to its innovative architecture. It achieves a high accuracy of 99.46 percent with this small model size, opening new directions for deployment on inexpensive and small embedded devices using Tiny Machine Learning (TinyML). The experimental results show robust performance, making P-YOLOv8 a cost-effective solution for real-time deployment. This study provides a detailed analysis of P-YOLOv8’s architecture, training, and performance benchmarks, highlighting its potential for real-time use in detecting distracted driving.
[AI-104] Patrol Security Game: Defending Against Adversary with Freedom in Attack Timing Location and Duration
链接: https://arxiv.org/abs/2410.15600
作者: Hao-Tsung Yang,Ting-Kai Weng,Ting-Yu Chang,Kin Sum Liu,Shan Lin,Jie Gao,Shih-Yu Tsai
关键词-EN: extensive-form Stackelberg game, Patrol Security Game, Security Game, Stackelberg game, extensive-form Stackelberg
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Robotics (cs.RO)
*备注: Under review of TCPS
点击查看摘要
Abstract:We explored the Patrol Security Game (PSG), a robotic patrolling problem modeled as an extensive-form Stackelberg game, where the attacker determines the timing, location, and duration of their attack. Our objective is to devise a patrolling schedule with an infinite time horizon that minimizes the attacker’s payoff. We demonstrated that PSG can be transformed into a combinatorial minimax problem with a closed-form objective function. By constraining the defender’s strategy to a time-homogeneous first-order Markov chain (i.e., the patroller’s next move depends solely on their current location), we proved that the optimal solution in cases of zero penalty involves either minimizing the expected hitting time or return time, depending on the attacker model, and that these solutions can be computed efficiently. Additionally, we observed that increasing the randomness in the patrol schedule reduces the attacker’s expected payoff in high-penalty cases. However, the minimax problem becomes non-convex in other scenarios. To address this, we formulated a bi-criteria optimization problem incorporating two objectives: expected maximum reward and entropy. We proposed three graph-based algorithms and one deep reinforcement learning model, designed to efficiently balance the trade-off between these two objectives. Notably, the third algorithm can identify the optimal deterministic patrol schedule, though its runtime grows exponentially with the number of patrol spots. Experimental results validate the effectiveness and scalability of our solutions, demonstrating that our approaches outperform state-of-the-art baselines on both synthetic and real-world crime datasets.
[AI-105] A Comprehensive Comparative Study of Individual ML Models and Ensemble Strategies for Network Intrusion Detection Systems
链接: https://arxiv.org/abs/2410.15597
作者: Ismail Bibers,Osvaldo Arreche,Mustafa Abdallah
关键词-EN: network intrusion detection, devising artificial intelligence, intrusion detection systems, intrusion detection, network intrusion
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The escalating frequency of intrusions in networked systems has spurred the exploration of new research avenues in devising artificial intelligence (AI) techniques for intrusion detection systems (IDS). Various AI techniques have been used to automate network intrusion detection tasks, yet each model possesses distinct strengths and weaknesses. Selecting the optimal model for a given dataset can pose a challenge, necessitating the exploration of ensemble methods to enhance generalization and applicability in network intrusion detection. This paper addresses this gap by conducting a comprehensive evaluation of diverse individual models and both simple and advanced ensemble methods for network IDS. We introduce an ensemble learning framework tailored for assessing individual models and ensemble methods in network intrusion detection tasks. Our framework encompasses the loading of input datasets, training of individual models and ensemble methods, and the generation of evaluation metrics. Furthermore, we incorporate all features across individual models and ensemble techniques. The study presents results for our framework, encompassing 14 methods, including various bagging, stacking, blending, and boosting techniques applied to multiple base learners such as decision trees, neural networks, and among others. We evaluate the framework using two distinct network intrusion datasets, RoEduNet-SIMARGL2021 and CICIDS-2017, each possessing unique characteristics. Additionally, we categorize AI models based on their performances on our evaluation metrics and via their confusion matrices. Our assessment demonstrates the efficacy of learning across most setups explored in this study. Furthermore, we contribute to the community by releasing our source codes, providing a foundational ensemble learning framework for network intrusion detection.
[AI-106] A Comprehensive Survey of Datasets Theories Variants and Applications in Direct Preference Optimization
链接: https://arxiv.org/abs/2410.15595
作者: Wenyi Xiao,Zechuan Wang,Leilei Gan,Shuai Zhao,Wanggui He,Luu Anh Tuan,Long Chen,Hao Jiang,Zhou Zhao,Fei Wu
关键词-EN: aligning policy models, large language models, Direct Preference Optimization, aligning policy, increasingly critical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO’s various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO’s current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community.
[AI-107] AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection
链接: https://arxiv.org/abs/2410.15591
作者: Xiaoman Xu,Xiangrun Li,Taihang Wang,Ye Jiang
关键词-EN: textbf, Detecting fake, diversity and complexity, challenging due, traditional approaches
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Detecting fake news in large datasets is challenging due to its diversity and complexity, with traditional approaches often focusing on textual features while underutilizing semantic and emotional elements. Current methods also rely heavily on large annotated datasets, limiting their effectiveness in more nuanced analysis. To address these challenges, this paper introduces Emotion-\textbfAware \textbfMultimodal Fusion \textbfPrompt \textbfL\textbfEarning (\textbfAMPLE) framework to address the above issue by combining text sentiment analysis with multimodal data and hybrid prompt templates. This framework extracts emotional elements from texts by leveraging sentiment analysis tools. It then employs Multi-Head Cross-Attention (MCA) mechanisms and similarity-aware fusion methods to integrate multimodal data. The proposed AMPLE framework demonstrates strong performance on two public datasets in both few-shot and data-rich settings, with results indicating the potential of emotional aspects in fake news detection. Furthermore, the study explores the impact of integrating large language models with this method for text sentiment extraction, revealing substantial room for further improvement. The code can be found at :\urlthis https URL
[AI-108] OpenMU: Your Swiss Army Knife for Music Understanding
链接: https://arxiv.org/abs/2410.15573
作者: Mengjie Zhao,Zhi Zhong,Zhuoyuan Mao,Shiqi Yang,Wei-Hsiang Liao,Shusuke Takahashi,Hiromi Wakaki,Yuki Mitsufuji
关键词-EN: large-scale benchmark suite, data scarcity issue, training multimodal language, multimodal language models, large-scale benchmark
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Resources: this https URL
点击查看摘要
Abstract:We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency.
[AI-109] Leveraging Retrieval-Augmented Generation for Culturally Inclusive Hakka Chatbots: Design Insights and User Perceptions
链接: https://arxiv.org/abs/2410.15572
作者: Chen-Chi Chang,Han-Pi Chang,Hung-Shin Lee
关键词-EN: Taiwanese Hakka culture, Retrieval-Augmented Generation, heritage of Taiwanese, Taiwanese Hakka, technological innovation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to IEEE RASSE 2024
点击查看摘要
Abstract:In an era where cultural preservation is increasingly intertwined with technological innovation, this study introduces a groundbreaking approach to promoting and safeguarding the rich heritage of Taiwanese Hakka culture through the development of a Retrieval-Augmented Generation (RAG)-enhanced chatbot. Traditional large language models (LLMs), while powerful, often fall short in delivering accurate and contextually rich responses, particularly in culturally specific domains. By integrating external databases with generative AI models, RAG technology bridges this gap, empowering chatbots to not only provide precise answers but also resonate deeply with the cultural nuances that are crucial for authentic interactions. This study delves into the intricate process of augmenting the chatbot’s knowledge base with targeted cultural data, specifically curated to reflect the unique aspects of Hakka traditions, language, and practices. Through dynamic information retrieval, the RAG-enhanced chatbot becomes a versatile tool capable of handling complex inquiries that demand an in-depth understanding of Hakka cultural context. This is particularly significant in an age where digital platforms often dilute cultural identities, making the role of culturally aware AI systems more critical than ever. System usability studies conducted as part of our research reveal a marked improvement in both user satisfaction and engagement, highlighting the chatbot’s effectiveness in fostering a deeper connection with Hakka culture. The feedback underscores the potential of RAG technology to not only enhance user experience but also to serve as a vital instrument in the broader mission of ethnic mainstreaming and cultural celebration.
[AI-110] Stacking Small Language Models for Generalizability
链接: https://arxiv.org/abs/2410.15570
作者: Laurence Liang
关键词-EN: Recent advances show, Recent advances, generalize strong performance, generalize strong, advances show
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Recent advances show that large language models (LLMs) generalize strong performance across different natural language benchmarks. However, the large size of LLMs makes training and inference expensive and impractical to run in resource-limited settings. This paper introduces a new approach called fine-tuning stacks of language models (FSLM), which involves stacking small language models (SLM) as an alternative to LLMs. By fine-tuning each SLM to perform a specific task, this approach breaks down high level reasoning into multiple lower-level steps that specific SLMs are responsible for. As a result, FSLM allows for lower training and inference costs, and also improves model interpretability as each SLM communicates with the subsequent one through natural language. By evaluating FSLM on common natural language benchmarks, this paper highlights promising early results toward generalizable performance using FSLM as a cost-effective alternative to LLMs.
[AI-111] Pruning Foundation Models for High Accuracy without Retraining EMNLP2024
链接: https://arxiv.org/abs/2410.15567
作者: Pu Zhao,Fei Sun,Xuan Shen,Pinrui Yu,Zhenglun Kong,Yanzhi Wang,Xue Lin
关键词-EN: deploy foundation models, large language models, parameters and computations, challenging to deploy, deploy foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by EMNLP 2024 findings
点击查看摘要
Abstract:Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consuming massive data and hardware resources. To deal with this problem, post-training pruning methods are proposed to prune LLMs in one-shot without retraining. However, their accuracy after pruning may suffer from certain performance degradation due to the lack of retraining with massive data. To address this issue, in this paper, we first formulate the post-training problem for layer-wise LLM compression to simultaneously prune multiple weights in LLMs. Next, we provide an optimal solution for this problem and design our post-training pruning algorithm for both unstructured and semi-structured sparsity. Our extensive experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines across various LLM families including transformer-based LLMs and Mamba-based LLMs. Code link: this https URL
[AI-112] Bayesian Concept Bottleneck Models with LLM Priors
链接: https://arxiv.org/abs/2410.15555
作者: Jean Feng,Avni Kothari,Luke Zier,Chandan Singh,Yan Shuo Tan
关键词-EN: Concept Bottleneck Models, Bottleneck Models, Concept Bottleneck, Large Language Models, aiming to achieve
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Concept Bottleneck Models (CBMs) have been proposed as a compromise between white-box and black-box models, aiming to achieve interpretability without sacrificing accuracy. The standard training procedure for CBMs is to predefine a candidate set of human-interpretable concepts, extract their values from the training data, and identify a sparse subset as inputs to a transparent prediction model. However, such approaches are often hampered by the tradeoff between enumerating a sufficiently large set of concepts to include those that are truly relevant versus controlling the cost of obtaining concept extractions. This work investigates a novel approach that sidesteps these challenges: BC-LLM iteratively searches over a potentially infinite set of concepts within a Bayesian framework, in which Large Language Models (LLMs) serve as both a concept extraction mechanism and prior. BC-LLM is broadly applicable and multi-modal. Despite imperfections in LLMs, we prove that BC-LLM can provide rigorous statistical inference and uncertainty quantification. In experiments, it outperforms comparator methods including black-box models, converges more rapidly towards relevant concepts and away from spuriously correlated ones, and is more robust to out-of-distribution samples.
[AI-113] A Plug-and-Play Fully On-the-Job Real-Time Reinforcement Learning Algorithm for a Direct-Drive Tandem-Wing Experiment Platforms Under Multiple Random Operating Conditions
链接: https://arxiv.org/abs/2410.15554
作者: Zhang Minghao,Song Bifeng,Yang Xiaojun,Wang Liang
关键词-EN: Concerto Reinforcement Learning, unstable aerodynamic interference, aerodynamic interference generated, biomimetic systems poses, systems poses substantial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 63 pages, 32 figures
点击查看摘要
Abstract:The nonlinear and unstable aerodynamic interference generated by the tandem wings of such biomimetic systems poses substantial challenges for motion control, especially under multiple random operating conditions. To address these challenges, the Concerto Reinforcement Learning Extension (CRL2E) algorithm has been developed. This plug-and-play, fully on-the-job, real-time reinforcement learning algorithm incorporates a novel Physics-Inspired Rule-Based Policy Composer Strategy with a Perturbation Module alongside a lightweight network optimized for real-time control. To validate the performance and the rationality of the module design, experiments were conducted under six challenging operating conditions, comparing seven different algorithms. The results demonstrate that the CRL2E algorithm achieves safe and stable training within the first 500 steps, improving tracking accuracy by 14 to 66 times compared to the Soft Actor-Critic, Proximal Policy Optimization, and Twin Delayed Deep Deterministic Policy Gradient algorithms. Additionally, CRL2E significantly enhances performance under various random operating conditions, with improvements in tracking accuracy ranging from 8.3% to 60.4% compared to the Concerto Reinforcement Learning (CRL) algorithm. The convergence speed of CRL2E is 36.11% to 57.64% faster than the CRL algorithm with only the Composer Perturbation and 43.52% to 65.85% faster than the CRL algorithm when both the Composer Perturbation and Time-Interleaved Capability Perturbation are introduced, especially in conditions where the standard CRL struggles to converge. Hardware tests indicate that the optimized lightweight network structure excels in weight loading and average inference time, meeting real-time control requirements.
[AI-114] GRS: Generating Robotic Simulation Tasks from Real-World Images
链接: https://arxiv.org/abs/2410.15536
作者: Alex Zook,Fan-Yun Sun,Josef Spjut,Valts Blukis,Stan Birchfield,Jonathan Tremblay
关键词-EN: computer vision, address the challenge, GRS, task, introduce GRS
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We introduce GRS (Generating Robotic Simulation tasks), a novel system to address the challenge of real-to-sim in robotics, computer vision, and AR/VR. GRS enables the creation of digital twin simulations from single real-world RGB-D observations, complete with diverse, solvable tasks for virtual agent training. We use state-of-the-art vision-language models (VLMs) to achieve a comprehensive real-to-sim pipeline. GRS operates in three stages: 1) scene comprehension using SAM2 for object segmentation and VLMs for object description, 2) matching identified objects with simulation-ready assets, and 3) generating contextually appropriate robotic tasks. Our approach ensures simulations align with task specifications by generating test suites designed to verify adherence to the task specification. We introduce a router that iteratively refines the simulation and test code to ensure the simulation is solvable by a robot policy while remaining aligned to the task specification. Our experiments demonstrate the system’s efficacy in accurately identifying object correspondence, which allows us to generate task environments that closely match input environments, and enhance automated simulation task generation through our novel router mechanism.
[AI-115] Improving Clinical Documentation with AI: A Comparative Study of Sporo AI Scribe and GPT-4o mini
链接: https://arxiv.org/abs/2410.15528
作者: Chanseo Lee,Sonu Kumar,Kimon A. Vogt,Sam Meraj
关键词-EN: Electronic Health Records, AI-powered medical scribes, burden in healthcare, promising solution, solution to alleviate
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:AI-powered medical scribes have emerged as a promising solution to alleviate the documentation burden in healthcare. Ambient AI scribes provide real-time transcription and automated data entry into Electronic Health Records (EHRs), with the potential to improve efficiency, reduce costs, and enhance scalability. Despite early success, the accuracy of AI scribes remains critical, as errors can lead to significant clinical consequences. Additionally, AI scribes face challenges in handling the complexity and variability of medical language and ensuring the privacy of sensitive patient data. This case study aims to evaluate Sporo Health’s AI scribe, a multi-agent system leveraging fine-tuned medical LLMs, by comparing its performance with OpenAI’s GPT-4o Mini on multiple performance metrics. Using a dataset of de-identified patient conversation transcripts, AI-generated summaries were compared to clinician-generated notes (the ground truth) based on clinical content recall, precision, and F1 scores. Evaluations were further supplemented by clinician satisfaction assessments using a modified Physician Documentation Quality Instrument revision 9 (PDQI-9), rated by both a medical student and a physician. The results show that Sporo AI consistently outperformed GPT-4o Mini, achieving higher recall, precision, and overall F1 scores. Moreover, the AI generated summaries provided by Sporo were rated more favorably in terms of accuracy, comprehensiveness, and relevance, with fewer hallucinations. These findings demonstrate that Sporo AI Scribe is an effective and reliable tool for clinical documentation, enhancing clinician workflows while maintaining high standards of privacy and security.
[AI-116] M-RewardBench: Evaluating Reward Models in Multilingual Settings
链接: https://arxiv.org/abs/2410.15522
作者: Srishti Gureja,Lester James V. Miranda,Shayekh Bin Islam,Rishabh Maheshwary,Drishti Sharma,Gusti Winata,Nathan Lambert,Sebastian Ruder,Sara Hooker,Marzieh Fadaee
关键词-EN: language modeling process, modeling process, LLMs today, today by enabling, enabling the integration
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures, 10 tables. Website: this https URL
点击查看摘要
Abstract:Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs’ performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.
[AI-117] Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training CONLL
链接: https://arxiv.org/abs/2410.15509
作者: Rohan Saha,Abrar Fahim,Alona Fyshe,Alex Murphy
关键词-EN: train large machine, specialized domains, learning, large machine learning, train large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: CoNLL BabyLM Challenge 2024 camera ready
点击查看摘要
Abstract:For specialized domains, there is often not a wealth of data with which to train large machine learning models. In such limited data / compute settings, various methods exist aiming to \textitdo more with less , such as finetuning from a pretrained model, modulating difficulty levels as data are presented to a model (curriculum learning), and considering the role of model type / size. Approaches to efficient \textitmachine learning also take inspiration from \textithuman learning by considering use cases where machine learning systems have access to approximately the same number of words experienced by a 13 year old child (100M words). We investigate the role of 3 primary variables in a limited data regime as part of the multimodal track of the BabyLM challenge. We contrast: (i) curriculum learning, (ii), pretraining (with text-only data), (iii) model type. We modulate these variables and assess them on two types of tasks: (a) multimodal (text+image), and (b) unimodal (text-only) tasks. We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models, particularly when combining text-only pretraining. On text-only tasks, curriculum learning appears to help models with smaller trainable parameter counts. We suggest possible reasons based on architectural differences and training designs as to why one might observe such results.
[AI-118] Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example INTERSPEECH2024
链接: https://arxiv.org/abs/2410.15500
作者: Suhita Ghosh,Melanie Jouaiti,Arnab Das,Yamini Sinha,Tim Polzehl,Ingo Siegert,Sebastian Stober
关键词-EN: changing personal identifiers, Speech anonymisation aims, retaining linguistic content, protect speaker identity, anonymisation aims
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS); Quantitative Methods (q-bio.QM)
*备注: Accepted in Interspeech 2024
点击查看摘要
Abstract:Speech anonymisation aims to protect speaker identity by changing personal identifiers in speech while retaining linguistic content. Current methods fail to retain prosody and unique speech patterns found in elderly and pathological speech domains, which is essential for remote health monitoring. To address this gap, we propose a voice conversion-based method (DDSP-QbE) using differentiable digital signal processing and query-by-example. The proposed method, trained with novel losses, aids in disentangling linguistic, prosodic, and domain representations, enabling the model to adapt to uncommon speech patterns. Objective and subjective evaluations show that DDSP-QbE significantly outperforms the voice conversion state-of-the-art concerning intelligibility, prosody, and domain preservation across diverse datasets, pathologies, and speakers while maintaining quality and speaker anonymity. Experts validate domain preservation by analysing twelve clinically pertinent domain attributes.
[AI-119] Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses NEURIPS2024
链接: https://arxiv.org/abs/2410.15499
作者: Suhita Ghosh,Tim Thiele,Frederic Lorbeer,Frank Dreyer,Sebastian Stober
关键词-EN: effective speech anonymization, cloud-based speech assistants, retaining critical information, speech anonymization, subsequent tasks
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted in NeurIPS 2024 Workshop (Audio Imagination)
点击查看摘要
Abstract:The increasing use of cloud-based speech assistants has heightened the need for effective speech anonymization, which aims to obscure a speaker’s identity while retaining critical information for subsequent tasks. One approach to achieving this is through voice conversion. While existing methods often emphasize complex architectures and training techniques, our research underscores the importance of loss functions inspired by the human auditory system. Our proposed loss functions are model-agnostic, incorporating handcrafted and deep learning-based features to effectively capture quality representations. Through objective and subjective evaluations, we demonstrate that a VQVAE-based model, enhanced with our perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity. These improvements are consistently observed across various datasets, languages, target speakers, and genders.
[AI-120] SEA: State-Exchange Attention for High-Fidelity Physics-Based Transformers NEURIPS2024
链接: https://arxiv.org/abs/2410.15495
作者: Parsa Esmati,Amirhossein Dadashzadeh,Vahid Goodarzi,Nicolas Larrosa,Nicolo Grilli
关键词-EN: Current approaches, high rollout errors, rollout error accumulation, estimating field variables, approaches using sequential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
点击查看摘要
Abstract:Current approaches using sequential networks have shown promise in estimating field variables for dynamical systems, but they are often limited by high rollout errors. The unresolved issue of rollout error accumulation results in unreliable estimations as the network predicts further into the future, with each step’s error compounding and leading to an increase in inaccuracy. Here, we introduce the State-Exchange Attention (SEA) module, a novel transformer-based module enabling information exchange between encoded fields through multi-head cross-attention. The cross-field multidirectional information exchange design enables all state variables in the system to exchange information with one another, capturing physical relationships and symmetries between fields. In addition, we incorporate a ViT-like architecture to generate spatially coherent mesh embeddings, further improving the model’s ability to capture spatial dependencies in the data. This enhances the model’s ability to represent complex interactions between the field variables, resulting in improved rollout error accumulation. Our results show that the Transformer model integrated with the State-Exchange Attention (SEA) module outperforms competitive baseline models, including the PbGMR-GMUS Transformer-RealNVP and GMR-GMUS Transformer, with a reduction in error of 88% and 91%, respectively, achieving state-of-the-art performance. Furthermore, we demonstrate that the SEA module alone can reduce errors by 97% for state variables that are highly dependent on other states of the system.
[AI-121] Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence
链接: https://arxiv.org/abs/2410.15490
作者: Norbert Tihanyi,Tamas Bisztray,Richard A. Dubniczky,Rebeka Toth,Bertalan Borsos,Bilel Cherif,Mohamed Amine Ferrag,Lajos Muzsai,Ridhi Jain,Ryan Marinelli,Lucas C. Cordeiro,Merouane Debbah
关键词-EN: machine intelligence evolves, test and compare, Dynamic Intelligence Assessment, models, intelligence evolves
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:As machine intelligence evolves, the need to test and compare the problem-solving abilities of different AI models grows. However, current benchmarks are often overly simplistic, allowing models to perform uniformly well, making it difficult to distinguish their capabilities. Additionally, benchmarks typically rely on static question-answer pairs, which models might memorize or guess. To address these limitations, we introduce the Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models using dynamic question templates and improved metrics across multiple disciplines such as mathematics, cryptography, cybersecurity, and computer science. The accompanying DIA-Bench dataset, which includes 150 diverse and challenging task templates with mutable parameters, is presented in various formats such as text, PDFs, compiled binaries, and visual puzzles. Our framework introduces four new metrics to assess a model’s reliability and confidence across multiple attempts. These metrics revealed that even simple questions are frequently answered incorrectly when posed in varying forms, highlighting significant gaps in models’ reliability. Notably, models like GPT-4o tended to overestimate their mathematical abilities, while ChatGPT-4o demonstrated better decision-making and performance through effective tool usage. We evaluated eight state-of-the-art large language models (LLMs) using DIA-Bench, showing that current models struggle with complex tasks and often display unexpectedly low confidence, even with simpler questions. The DIA framework sets a new standard for assessing not only problem-solving but also a model’s adaptive intelligence and ability to assess its own limitations. The dataset is publicly available on our project’s website.
[AI-122] Generative AI Agents in Autonomous Machines: A Safety Perspective
链接: https://arxiv.org/abs/2410.15489
作者: Jason Jabbour,Vijay Janapa Reddi
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, major paradigm shift, Generative Artificial, autonomous machines
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The integration of Generative Artificial Intelligence (AI) into autonomous machines represents a major paradigm shift in how these systems operate and unlocks new solutions to problems once deemed intractable. Although generative AI agents provide unparalleled capabilities, they also have unique safety concerns. These challenges require robust safeguards, especially for autonomous machines that operate in high-stakes environments. This work investigates the evolving safety requirements when generative models are integrated as agents into physical autonomous machines, comparing these to safety considerations in less critical AI applications. We explore the challenges and opportunities to ensure the safe deployment of generative AI-driven autonomous machines. Furthermore, we provide a forward-looking perspective on the future of AI-driven autonomous systems and emphasize the importance of evaluating and communicating safety risks. As an important step towards addressing these concerns, we recommend the development and implementation of comprehensive safety scorecards for the use of generative AI technologies in autonomous machines.
[AI-123] Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning
链接: https://arxiv.org/abs/2410.15483
作者: Heshan Fernando,Han Shen,Parikshit Ram,Yi Zhou,Horst Samulowitz,Nathalie Baracaldo,Tianyi Chen
关键词-EN: safe LLM applications, supervised fine-tuning, preference learning, SFT and RLHF, typically consists
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Post-training of pre-trained LLMs, which typically consists of the supervised fine-tuning (SFT) stage and the preference learning (RLHF or DPO) stage, is crucial to effective and safe LLM applications. The widely adopted approach in post-training popular open-source LLMs is to sequentially perform SFT and RLHF/DPO. However, sequential training is sub-optimal in terms of SFT and RLHF/DPO trade-off: the LLM gradually forgets about the first stage’s training when undergoing the second stage’s training. We theoretically prove the sub-optimality of sequential post-training. Furthermore, we propose a practical joint post-training framework with theoretical convergence guarantees and empirically outperforms sequential post-training framework, while having similar computational cost. Our code is available at this https URL.
[AI-124] Multi-Layer Feature Fusion with Cross-Channel Attention-Based U-Net for Kidney Tumor Segmentation
链接: https://arxiv.org/abs/2410.15472
作者: Fnu Neha,Arvind K. Bansal
关键词-EN: show significant heterogeneity, renal cell carcinoma, cell carcinoma, show significant, significant heterogeneity
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages
点击查看摘要
Abstract:Renal tumors, especially renal cell carcinoma (RCC), show significant heterogeneity, posing challenges for diagnosis using radiology images such as MRI, echocardiograms, and CT scans. U-Net based deep learning techniques are emerging as a promising approach for automated medical image segmentation for minimally invasive diagnosis of renal tumors. However, current techniques need further improvements in accuracy to become clinically useful to radiologists. In this study, we present an improved U-Net based model for end-to-end automated semantic segmentation of CT scan images to identify renal tumors. The model uses residual connections across convolution layers, integrates a multi-layer feature fusion (MFF) and cross-channel attention (CCA) within encoder blocks, and incorporates skip connections augmented with additional information derived using MFF and CCA. We evaluated our model on the KiTS19 dataset, which contains data from 210 patients. For kidney segmentation, our model achieves a Dice Similarity Coefficient (DSC) of 0.97 and a Jaccard index (JI) of 0.95. For renal tumor segmentation, our model achieves a DSC of 0.96 and a JI of 0.91. Based on a comparison of available DSC scores, our model outperforms the current leading models.
[AI-125] How Aligned are Generative Models to Humans in High-Stakes Decision-Making?
链接: https://arxiv.org/abs/2410.15471
作者: Sarah Tan,Keri Mallari,Julius Adebayo,Albert Gordo,Martin T. Wells,Kori Inkpen
关键词-EN: Large generative models, Large generative, high-stakes decision-making, increasingly being considered, considered for high-stakes
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large generative models (LMs) are increasingly being considered for high-stakes decision-making. This work considers how such models compare to humans and predictive AI models on a specific case of recidivism prediction. We combine three datasets – COMPAS predictive AI risk scores, human recidivism judgements, and photos – into a dataset on which we study the properties of several state-of-the-art, multimodal LMs. Beyond accuracy and bias, we focus on studying human-LM alignment on the task of recidivism prediction. We investigate if these models can be steered towards human decisions, the impact of adding photos, and whether anti-discimination prompting is effective. We find that LMs can be steered to outperform humans and COMPAS using in context-learning. We find anti-discrimination prompting to have unintended effects, causing some models to inhibit themselves and significantly reduce their number of positive predictions.
[AI-126] Data Augmentation via Diffusion Model to Enhance AI Fairness
链接: https://arxiv.org/abs/2410.15470
作者: Christina Hastings Blow,Lijun Qian,Camille Gibson,Pamela Obiomon,Xishuang Dong
关键词-EN: outcomes genuinely reflect, interests of users, transparency and explainability, systems by ensuring, outcomes genuinely
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: arXiv admin note: text overlap with arXiv:2312.12560
点击查看摘要
Abstract:AI fairness seeks to improve the transparency and explainability of AI systems by ensuring that their outcomes genuinely reflect the best interests of users. Data augmentation, which involves generating synthetic data from existing datasets, has gained significant attention as a solution to data scarcity. In particular, diffusion models have become a powerful technique for generating synthetic data, especially in fields like computer vision. This paper explores the potential of diffusion models to generate synthetic tabular data to improve AI fairness. The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM), a diffusion model adaptable to any tabular dataset and capable of handling various feature types, was utilized with different amounts of generated data for data augmentation. Additionally, reweighting samples from AIF360 was employed to further enhance AI fairness. Five traditional machine learning models-Decision Tree (DT), Gaussian Naive Bayes (GNB), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Random Forest (RF)-were used to validate the proposed approach. Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.
[AI-127] AssemblyComplete: 3D Combinatorial Construction with Deep Reinforcement Learning
链接: https://arxiv.org/abs/2410.15469
作者: Alan Chen,Changliu Liu
关键词-EN: real-world collaborative tasks, collaborative tasks, assembly, critical goal, goal in robotics
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to 2025 American Control Conference (ACC)
点击查看摘要
Abstract:A critical goal in robotics and autonomy is to teach robots to adapt to real-world collaborative tasks, particularly in automatic assembly. The ability of a robot to understand the original intent of an incomplete assembly and complete missing features without human instruction is valuable but challenging. This paper introduces 3D combinatorial assembly completion, which is demonstrated using combinatorial unit primitives (i.e., Lego bricks). Combinatorial assembly is challenging due to the possible assembly combinations and complex physical constraints (e.g., no brick collisions, structure stability, inventory constraints, etc.). To address these challenges, we propose a two-part deep reinforcement learning (DRL) framework that tackles teaching the robot to understand the objective of an incomplete assembly and learning a construction policy to complete the assembly. The robot queries a stable object library to facilitate assembly inference and guide learning. In addition to the robot policy, an action mask is developed to rule out invalid actions that violate physical constraints for object-oriented construction. We demonstrate the proposed framework’s feasibility and robustness in a variety of assembly scenarios in which the robot satisfies real-life assembly with respect to both solution and runtime quality. Furthermore, results demonstrate that the proposed framework effectively infers and assembles incomplete structures for unseen and unique object types.
[AI-128] Hey GPT Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI
链接: https://arxiv.org/abs/2410.15467
作者: Hangzhi Guo,Pranav Narayanan Venkit,Eunchae Jang,Mukund Srinath,Wenbo Zhang,Bonam Mingole,Vipul Gupta,Kush R. Varshney,S. Shyam Sundar,Amulya Yadav
关键词-EN: large language models, addressing societal biases, societal biases inherent, widespread adoption, adoption of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:The widespread adoption of large language models (LLMs) and generative AI (GenAI) tools across diverse applications has amplified the importance of addressing societal biases inherent within these technologies. While the NLP community has extensively studied LLM bias, research investigating how non-expert users perceive and interact with biases from these systems remains limited. As these technologies become increasingly prevalent, understanding this question is crucial to inform model developers in their efforts to mitigate bias. To address this gap, this work presents the findings from a university-level competition, which challenged participants to design prompts for eliciting biased outputs from GenAI tools. We quantitatively and qualitatively analyze the competition submissions and identify a diverse set of biases in GenAI and strategies employed by participants to induce bias in GenAI. Our finding provides unique insights into how non-expert users perceive and interact with biases from GenAI tools.
[AI-129] Keep Guessing? When Considering Inference Scaling Mind the Baselines
链接: https://arxiv.org/abs/2410.15466
作者: Gal Yona,Or Honovich,Omer Levy,Roee Aharoni
关键词-EN: Scaling inference compute, Scaling inference, sampling consistently increases, large language models, fraction of problems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers. To test this conjecture, we define a baseline that enumerates answers according to their prevalence in the training set. Experiments spanning two domains – mathematical reasoning and factual knowledge – reveal that this baseline outperforms repeated model sampling for some LLMs, while the coverage for others is on par with that of a mixture strategy that obtains k answers by using only 10 model samples and similarly guessing the remaining k-10 attempts via enumeration. Our baseline enables a more accurate measurement of how much repeated sampling improves coverage in such settings beyond prompt-agnostic guessing.
[AI-130] Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training
链接: https://arxiv.org/abs/2410.15460
作者: Shahrad Mohammadzadeh,Juan David Guerra,Marco Bonizzato,Reihaneh Rabbany,Golnoosh Farnadi
关键词-EN: user input-have grown, large language models, input-have grown, large language, increasingly deployed
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Spectral Theory (math.SP)
*备注:
点击查看摘要
Abstract:As large language models (LLMs) become increasingly deployed across various industries, concerns regarding their reliability, particularly due to hallucinations-outputs that are factually inaccurate or irrelevant to user input-have grown. Our research investigates the relationship between the training process and the emergence of hallucinations to address a key gap in existing research that focuses primarily on post hoc detection and mitigation strategies. Using models from the Pythia suite (70M-12B parameters) and several hallucination detection metrics, we analyze hallucination trends throughout training and explore LLM internal dynamics. We introduce SEnsitive Neuron Dropout (SeND), a novel training protocol designed to mitigate hallucinations by reducing variance during training. SeND achieves this by deterministically dropping neurons with significant variability on a dataset, referred to as Sensitive Neurons. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This efficient metric is integrated into our protocol, allowing SeND to be both computationally scalable and effective at reducing hallucinations. Our empirical evaluation demonstrates that our approach improves LLM reliability at test time by up to 40% compared to normal training while also providing an efficient method to improve factual accuracy when adapting LLMs to domains such as Wikipedia and Medical datasets.
[AI-131] Heterogeneous Graph Reinforcement Learning for Dependency-aware Multi-task Allocation in Spatial Crowdsourcing
链接: https://arxiv.org/abs/2410.15449
作者: Yong Zhao,Zhengqiu Zhu,Chen Gao,En Wang,Jincai Huang,Fei-Yue Wang
关键词-EN: Spatial Crowdsourcing, academia and industry, task allocation, gaining traction, platforms becoming increasingly
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Spatial Crowdsourcing (SC) is gaining traction in both academia and industry, with tasks on SC platforms becoming increasingly complex and requiring collaboration among workers with diverse skills. Recent research works address complex tasks by dividing them into subtasks with dependencies and assigning them to suitable workers. However, the dependencies among subtasks and their heterogeneous skill requirements, as well as the need for efficient utilization of workers’ limited work time in the multi-task allocation mode, pose challenges in achieving an optimal task allocation scheme. Therefore, this paper formally investigates the problem of Dependency-aware Multi-task Allocation (DMA) and presents a well-designed framework to solve it, known as Heterogeneous Graph Reinforcement Learning-based Task Allocation (HGRL-TA). To address the challenges associated with representing and embedding diverse problem instances to ensure robust generalization, we propose a multi-relation graph model and a Compound-path-based Heterogeneous Graph Attention Network (CHANet) for effectively representing and capturing intricate relations among tasks and workers, as well as providing embedding of problem state. The task allocation decision is determined sequentially by a policy network, which undergoes simultaneous training with CHANet using the proximal policy optimization algorithm. Extensive experiment results demonstrate the effectiveness and generality of the proposed HGRL-TA in solving the DMA problem, leading to average profits that is 21.78% higher than those achieved using the metaheuristic methods.
[AI-132] Concept Complement Bottleneck Model for Interpretable Medical Image Diagnosis
链接: https://arxiv.org/abs/2410.15446
作者: Hongmei Wang,Junlin Hou,Hao Chen
关键词-EN: trustworthy artificial intelligence, received extensive attention, concepts, received extensive, trustworthy artificial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures, submitted to IEEE TRANSACTIONS ON MEDICAL IMAGING
点击查看摘要
Abstract:Models based on human-understandable concepts have received extensive attention to improve model interpretability for trustworthy artificial intelligence in the field of medical image analysis. These methods can provide convincing explanations for model decisions but heavily rely on the detailed annotation of pre-defined concepts. Consequently, they may not be effective in cases where concepts or annotations are incomplete or low-quality. Although some methods automatically discover effective and new visual concepts rather than using pre-defined concepts or could find some human-understandable concepts via large Language models, they are prone to veering away from medical diagnostic evidence and are challenging to understand. In this paper, we propose a concept complement bottleneck model for interpretable medical image diagnosis with the aim of complementing the existing concept set and finding new concepts bridging the gap between explainable models. Specifically, we propose to use concept adapters for specific concepts to mine the concept differences and score concepts in their own attention channels to support almost fairly concept learning. Then, we devise a concept complement strategy to learn new concepts while jointly using known concepts to improve model performance. Comprehensive experiments on medical datasets demonstrate that our model outperforms the state-of-the-art competitors in concept detection and disease diagnosis tasks while providing diverse explanations to ensure model interpretability effectively.
[AI-133] Exploring Social Desirability Response Bias in Large Language Models : Evidence from GPT-4 Simulations
链接: https://arxiv.org/abs/2410.15442
作者: Sanguk Lee,Kai-Qi Yang,Tai-Quan Peng,Ruth Heo,Hui Liu
关键词-EN: Large language models, Gallup World Poll, social desirability response, simulate human-like responses, Large language
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) are employed to simulate human-like responses in social surveys, yet it remains unclear if they develop biases like social desirability response (SDR) bias. To investigate this, GPT-4 was assigned personas from four societies, using data from the 2022 Gallup World Poll. These synthetic samples were then prompted with or without a commitment statement intended to induce SDR. The results were mixed. While the commitment statement increased SDR index scores, suggesting SDR bias, it reduced civic engagement scores, indicating an opposite trend. Additional findings revealed demographic associations with SDR scores and showed that the commitment statement had limited impact on GPT-4’s predictive performance. The study underscores potential avenues for using LLMs to investigate biases in both humans and LLMs themselves.
[AI-134] Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering IJCAI2024
链接: https://arxiv.org/abs/2410.15440
作者: Yanggyu Lee,Jihie Kim
关键词-EN: Large Language Model, Language Model, Large Language, providing reliable information, LLM outputs lack
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to the Trustworthy AI Workshop at IJCAI 2024
点击查看摘要
Abstract:In the realm of Large Language Model (LLM) functionalities, providing reliable information is paramount, yet reports suggest that LLM outputs lack consistency. This inconsistency, often at-tributed to randomness in token sampling, under-mines user trust as it leads to varying responses even for identical queries. In this paper, we present a new approach for evaluating semantic consistencies of LLM including comparison of alternative tech-niques. Our approach evaluates whether LLM re-sponses are semantically congruent for a given question, recognizing that as syntactically different sentences may convey the same meaning. Here-tofore, To enhance LLM consistency, two main approaches have been explored: Leverage external knowledge as context like the RAG pattern or use Zero-shot-CoT to improve performance of LLM itself. We apply our evaluation approach to these techniques, and demonstrate to compare the im-pact of these methods on LLM response con-sistency across different domains of question an-swering tasks. Using the TruthfulQA dataset to assess LLM responses, the study induces N re-sponses per question from the LLM and clusters semantically equivalent sentences to measure semantic consistency across 37 categories. Through this, it quantitatively analyzes the effectiveness of the aforementioned methods in improving LLM performance before and after their adoption.
[AI-135] Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs
链接: https://arxiv.org/abs/2410.15438
作者: Xin Zhou,Ping Nie,Yiwen Guo,Haojie Wei,Zhanqiu Zhang,Pasquale Minervini,Ruotian Ma,Tao Gui,Qi Zhang,Xuanjing Huang
关键词-EN: Large Language Models, Large Language, solve knowledge-intensive tasks, Retrieval-Augmented Generation, Language Models
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) significantly improved the ability of Large Language Models (LLMs) to solve knowledge-intensive tasks. While existing research seeks to enhance RAG performance by retrieving higher-quality documents or designing RAG-specific LLMs, the internal mechanisms within LLMs that contribute to the effectiveness of RAG systems remain underexplored. In this paper, we aim to investigate these internal mechanisms within the popular Mixture-of-Expert (MoE)-based LLMs and demonstrate how to improve RAG by examining expert activations in these LLMs. Our controlled experiments reveal that several core groups of experts are primarily responsible for RAG-related behaviors. The activation of these core experts can signify the model’s inclination towards external/internal knowledge and adjust its behavior. For instance, we identify core experts that can (1) indicate the sufficiency of the model’s internal knowledge, (2) assess the quality of retrieved documents, and (3) enhance the model’s ability to utilize context. Based on these findings, we propose several strategies to enhance RAG’s efficiency and effectiveness through expert activation. Experimental results across various datasets and MoE-based LLMs show the effectiveness of our method.
[AI-136] Power Plays: Unleashing Machine Learning Magic in Smart Grids
链接: https://arxiv.org/abs/2410.15423
作者: Abdur Rashid,Parag Biswas,abdullah al masum,MD Abdullah Al Nasim,Kishor Datta Gupta
关键词-EN: machine learning, modern energy networks, grid systems represents, represents a transformative, transformative step
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 1 figure
点击查看摘要
Abstract:The integration of machine learning into smart grid systems represents a transformative step in enhancing the efficiency, reliability, and sustainability of modern energy networks. By adding advanced data analytics, these systems can better manage the complexities of renewable energy integration, demand response, and predictive maintenance. Machine learning algorithms analyze vast amounts of data from smart meters, sensors, and other grid components to optimize energy distribution, forecast demand, and detect irregularities that could indicate potential failures. This enables more precise load balancing, reduces operational costs, and enhances the resilience of the grid against disturbances. Furthermore, the use of predictive models helps in anticipating equipment failures, thereby improving the reliability of the energy supply. As smart grids continue to evolve, the role of machine learning in managing decentralized energy sources and enabling real-time decision-making will become increasingly critical. However, the deployment of these technologies also raises challenges related to data privacy, security, and the need for robust infrastructure. Addressing these issues in this research authors will focus on realizing the full potential of smart grids, ensuring they meet the growing energy demands while maintaining a focus on sustainability and efficiency using Machine Learning techniques. Furthermore, this research will help determine the smart grid’s essentiality with the aid of Machine Learning. Multiple ML algorithms have been integrated along with their pros and cons. The future scope of these algorithms are also integrated.
[AI-137] Where to Build Food Banks and Pantries: A Two-Level Machine Learning Approach
链接: https://arxiv.org/abs/2410.15420
作者: Gavin Ruan,Ziqi Guo,Guang Lin
关键词-EN: million Americans, Americans currently suffer, food bank, food, pantry locations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures
点击查看摘要
Abstract:Over 44 million Americans currently suffer from food insecurity, of whom 13 million are children. Across the United States, thousands of food banks and pantries serve as vital sources of food and other forms of aid for food insecure families. By optimizing food bank and pantry locations, food would become more accessible to families who desperately require it. In this work, we introduce a novel two-level optimization framework, which utilizes the K-Medoids clustering algorithm in conjunction with the Open-Source Routing Machine engine, to optimize food bank and pantry locations based on real road distances to houses and house blocks. Our proposed framework also has the adaptability to factor in considerations such as median household income using a pseudo-weighted K-Medoids algorithm. Testing conducted with California and Indiana household data, as well as comparisons with real food bank and pantry locations showed that interestingly, our proposed framework yields food pantry locations superior to those of real existing ones and saves significant distance for households, while there is a marginal penalty on the first level food bank to food pantry distance. Overall, we believe that the second-level benefits of this framework far outweigh any drawbacks and yield a net benefit result.
[AI-138] CASET: Complexity Analysis using Simple Execution Traces for CS* submissions
链接: https://arxiv.org/abs/2410.15419
作者: Aaryen Mehta,Gagan Aryan
关键词-EN: pre-defined test suite, pre-defined test, test suite, suite and compare, reference results
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 5 pages
点击查看摘要
Abstract:The most common method to auto-grade a student’s submission in a CS1 or a CS2 course is to run it against a pre-defined test suite and compare the results against reference results. However, this technique cannot be used if the correctness of the solution goes beyond simple output, such as the algorithm used to obtain the result. There is no convenient method for the graders to identify the kind of algorithm used in solving a problem. They must read the source code and understand the algorithm implemented and its features, which makes the process tedious. We propose CASET(Complexity Analysis using Simple Execution Traces), a novel tool to analyze the time complexity of algorithms using dynamic traces and unsupervised machine learning. CASET makes it convenient for tutors to classify the submissions for a program into time complexity baskets. Thus, tutors can identify the algorithms used by the submissions without necessarily going through the code written by the students. CASET’s analysis can be used to improve grading and provide detailed feedback for submissions that try to match the results without a proper algorithm, for example, hard-coding a binary result, pattern-matching the visible or common inputs. We show the effectiveness of CASET by computing the time complexity of many classes of algorithms like sorting, searching and those using dynamic programming paradigm.
[AI-139] A Comprehensive Evaluation of Cognitive Biases in LLMs
链接: https://arxiv.org/abs/2410.15413
作者: Simon Malberg,Roman Poletukhin,Carolin M. Schuster,Georg Groh
关键词-EN: large language models, cognitive biases, large language, language models, decision-making scenarios
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code to encourage future research on biases in LLMs: this https URL
[AI-140] PEAS: A Strategy for Crafting Transferable Adversarial Examples
链接: https://arxiv.org/abs/2410.15409
作者: Bar Avraham,Yisroel Mirsky
关键词-EN: machine learning systems, learning systems, target model, Black box attacks, threat to machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Black box attacks, where adversaries have limited knowledge of the target model, pose a significant threat to machine learning systems. Adversarial examples generated with a substitute model often suffer from limited transferability to the target model. While recent work explores ranking perturbations for improved success rates, these methods see only modest gains. We propose a novel strategy called PEAS that can boost the transferability of existing black box attacks. PEAS leverages the insight that samples which are perceptually equivalent exhibit significant variability in their adversarial transferability. Our approach first generates a set of images from an initial sample via subtle augmentations. We then evaluate the transferability of adversarial perturbations on these images using a set of substitute models. Finally, the most transferable adversarial example is selected and used for the attack. Our experiments show that PEAS can double the performance of existing attacks, achieving a 2.5x improvement in attack success rates on average over current ranking methods. We thoroughly evaluate PEAS on ImageNet and CIFAR-10, analyze hyperparameter impacts, and provide an ablation study to isolate each component’s importance.
[AI-141] XAI-based Feature Ensemble for Enhanced Anomaly Detection in Autonomous Driving Systems
链接: https://arxiv.org/abs/2410.15405
作者: Sazid Nazat,Mustafa Abdallah
关键词-EN: introduced significant challenges, ensuring transportation security, technology has introduced, security and reliability, rapid advancement
类目: Artificial Intelligence (cs.AI)
*备注: 31 pages, 4 figures (including the subfigures)
点击查看摘要
Abstract:The rapid advancement of autonomous vehicle (AV) technology has introduced significant challenges in ensuring transportation security and reliability. Traditional AI models for anomaly detection in AVs are often opaque, posing difficulties in understanding and trusting their decision making processes. This paper proposes a novel feature ensemble framework that integrates multiple Explainable AI (XAI) methods: SHAP, LIME, and DALEX with various AI models to enhance both anomaly detection and interpretability. By fusing top features identified by these XAI methods across six diverse AI models (Decision Trees, Random Forests, Deep Neural Networks, K Nearest Neighbors, Support Vector Machines, and AdaBoost), the framework creates a robust and comprehensive set of features critical for detecting anomalies. These feature sets, produced by our feature ensemble framework, are evaluated using independent classifiers (CatBoost, Logistic Regression, and LightGBM) to ensure unbiased performance. We evaluated our feature ensemble approach on two popular autonomous driving datasets (VeReMi and Sensor) datasets. Our feature ensemble technique demonstrates improved accuracy, robustness, and transparency of AI models, contributing to safer and more trustworthy autonomous driving systems.
[AI-142] MMCS: A Multimodal Medical Diagnosis System Integrating Image Analysis and Knowledge-based Departmental Consultation
链接: https://arxiv.org/abs/2410.15403
作者: Yi Ren,HanZhi Zhang,Weibin Li,Diandong Liu,Tianyi Zhang,Jie He
关键词-EN: present MMCS, medical images, medical, facial paralysis, facial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We present MMCS, a system capable of recognizing medical images and patient facial details, and providing professional medical diagnoses. The system consists of two core components: The first component is the analysis of medical images and videos. We trained a specialized multimodal medical model capable of interpreting medical images and accurately analyzing patients’ facial emotions and facial paralysis conditions. The model achieved an accuracy of 72.59% on the FER2013 facial emotion recognition dataset, with a 91.1% accuracy in recognizing the happy emotion. In facial paralysis recognition, the model reached an accuracy of 92%, which is 30% higher than that of GPT-4o. Based on this model, we developed a parser for analyzing facial movement videos of patients with facial paralysis, achieving precise grading of the paralysis severity. In tests on 30 videos of facial paralysis patients, the system demonstrated a grading accuracy of 83.3%.The second component is the generation of professional medical responses. We employed a large language model, integrated with a medical knowledge base, to generate professional diagnoses based on the analysis of medical images or videos. The core innovation lies in our development of a department-specific knowledge base routing management mechanism, in which the large language model categorizes data by medical departments and, during the retrieval process, determines the appropriate knowledge base to query. This significantly improves retrieval accuracy in the RAG (retrieval-augmented generation) process. This mechanism led to an average increase of 4 percentage points in accuracy for various large language models on the MedQA this http URL code is open-sourced and available at: this https URL.
[AI-143] he Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks
链接: https://arxiv.org/abs/2410.15396
作者: Daniel Ayzenshteyn,Roy Weiss,Yisroel Mirsky
关键词-EN: large language models, continue to evolve, language models, large language, automating cyberattacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:As large language models (LLMs) continue to evolve, their potential use in automating cyberattacks becomes increasingly likely. With capabilities such as reconnaissance, exploitation, and command execution, LLMs could soon become integral to autonomous cyber agents, capable of launching highly sophisticated attacks. In this paper, we introduce novel defense strategies that exploit the inherent vulnerabilities of attacking LLMs. By targeting weaknesses such as biases, trust in input, memory limitations, and their tunnel-vision approach to problem-solving, we develop techniques to mislead, delay, or neutralize these autonomous agents. We evaluate our defenses under black-box conditions, starting with single prompt-response scenarios and progressing to real-world tests using custom-built CTF machines. Our results show defense success rates of up to 90%, demonstrating the effectiveness of turning LLM vulnerabilities into defensive strategies against LLM-driven cyber threats.
[AI-144] Synthetic Data Generation for Residential Load Patterns via Recurrent GAN and Ensemble Method
链接: https://arxiv.org/abs/2410.15379
作者: Xinyu Liang,Ziheng Wang,Hao Wang
关键词-EN: accurately represent actual, represent actual electricity, actual electricity consumption, power system planning, Generating synthetic residential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages
点击查看摘要
Abstract:Generating synthetic residential load data that can accurately represent actual electricity consumption patterns is crucial for effective power system planning and operation. The necessity for synthetic data is underscored by the inherent challenges associated with using real-world load data, such as privacy considerations and logistical complexities in large-scale data collection. In this work, we tackle the above-mentioned challenges by developing the Ensemble Recurrent Generative Adversarial Network (ERGAN) framework to generate high-fidelity synthetic residential load data. ERGAN leverages an ensemble of recurrent Generative Adversarial Networks, augmented by a loss function that concurrently takes into account adversarial loss and differences between statistical properties. Our developed ERGAN can capture diverse load patterns across various households, thereby enhancing the realism and diversity of the synthetic data generated. Comprehensive evaluations demonstrate that our method consistently outperforms established benchmarks in the synthetic generation of residential load data across various performance metrics including diversity, similarity, and statistical measures. The findings confirm the potential of ERGAN as an effective tool for energy applications requiring synthetic yet realistic load data. We also make the generated synthetic residential load patterns publicly available.
[AI-145] Explainability of Point Cloud Neural Networks Using SMILE: Statistical Model-Agnostic Interpretability with Local Explanations
链接: https://arxiv.org/abs/2410.15374
作者: Seyed Mohammad Ahmadi,Koorosh Aslansefat,Ruben Valcarce-Dineiro,Joshua Barnfather
关键词-EN: considerable safety risks, pose considerable safety, today world, significance of explainable, lack of transparency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 9 figures
点击查看摘要
Abstract:In today’s world, the significance of explainable AI (XAI) is growing in robotics and point cloud applications, as the lack of transparency in decision-making can pose considerable safety risks, particularly in autonomous systems. As these technologies are integrated into real-world environments, ensuring that model decisions are interpretable and trustworthy is vital for operational reliability and safety assurance. This study explores the implementation of SMILE, a novel explainability method originally designed for deep neural networks, on point cloud-based models. SMILE builds on LIME by incorporating Empirical Cumulative Distribution Function (ECDF) statistical distances, offering enhanced robustness and interpretability, particularly when the Anderson-Darling distance is used. The approach demonstrates superior performance in terms of fidelity loss, R2 scores, and robustness across various kernel widths, perturbation numbers, and clustering configurations. Moreover, this study introduces a stability analysis for point cloud data using the Jaccard index, establishing a new benchmark and baseline for model stability in this field. The study further identifies dataset biases in the classification of the ‘person’ category, emphasizing the necessity for more comprehensive datasets in safety-critical applications like autonomous driving and robotics. The results underscore the potential of advanced explainability models and highlight areas for future research, including the application of alternative surrogate models and explainability techniques in point cloud data.
[AI-146] FrameBridge: Improving Image-to-Video Generation with Bridge Models
链接: https://arxiv.org/abs/2410.15371
作者: Yuji Wang,Zehua Chen,Xiaoyu Chen,Jun Zhu,Jianfei Chen
关键词-EN: gaining increasing attention, gaining increasing, increasing attention, wide application, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation. However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to generate video samples with both appearance consistency and temporal coherence from an uninformative Gaussian noise, which may limit their synthesis quality. In this work, we present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. By formulating I2V synthesis as a frames-to-frames generation task and modelling it with a data-to-data process, we fully exploit the information in input image and facilitate the generative model to learn the image animation process. In two popular settings of training I2V models, namely fine-tuning a pre-trained text-to-video (T2V) model or training from scratch, we further propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively. Experiments conducted on WebVid-2M and UCF-101 demonstrate that: (1) our FrameBridge achieves superior I2V quality in comparison with the diffusion counterpart (zero-shot FVD 83 vs. 176 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101); (2) our proposed SAF and neural prior effectively enhance the ability of bridge-based I2V models in the scenarios of fine-tuning and training from scratch. Demo samples can be visited at: this https URL.
[AI-147] Ethical AI in Retail: Consumer Privacy and Fairness
链接: https://arxiv.org/abs/2410.15369
作者: Anthonette Adanyin
关键词-EN: artificial intelligence, transformed the industry, enabling more personalized, efficient operations, adoption of artificial
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 17 pages, 2 figures, 3 tables
点击查看摘要
Abstract:The adoption of artificial intelligence (AI) in retail has significantly transformed the industry, enabling more personalized services and efficient operations. However, the rapid implementation of AI technologies raises ethical concerns, particularly regarding consumer privacy and fairness. This study aims to analyze the ethical challenges of AI applications in retail, explore ways retailers can implement AI technologies ethically while remaining competitive, and provide recommendations on ethical AI practices. A descriptive survey design was used to collect data from 300 respondents across major e-commerce platforms. Data were analyzed using descriptive statistics, including percentages and mean scores. Findings shows a high level of concerns among consumers regarding the amount of personal data collected by AI-driven retail applications, with many expressing a lack of trust in how their data is managed. Also, fairness is another major issue, as a majority believe AI systems do not treat consumers equally, raising concerns about algorithmic bias. It was also found that AI can enhance business competitiveness and efficiency without compromising ethical principles, such as data privacy and fairness. Data privacy and transparency were highlighted as critical areas where retailers need to focus their efforts, indicating a strong demand for stricter data protection protocols and ongoing scrutiny of AI systems. The study concludes that retailers must prioritize transparency, fairness, and data protection when deploying AI systems. The study recommends ensuring transparency in AI processes, conducting regular audits to address biases, incorporating consumer feedback in AI development, and emphasizing consumer data privacy.
[AI-148] Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
链接: https://arxiv.org/abs/2410.15362
作者: Xiao Li,Zhuhong Li,Qiongxiu Li,Bingze Lee,Jinghao Cui,Xiaolin Hu
关键词-EN: Large Language Models, Language Models, Aligned Large Language, Large Language, demonstrated remarkable performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.
[AI-149] A Survey of Hallucination in Large Visual Language Models
链接: https://arxiv.org/abs/2410.15359
作者: Wei Lan,Wenyi Chen,Qingfeng Chen,Shirui Pan,Huiyu Zhou,Yi Pan
关键词-EN: Large Language Models, Visual Language Models, Large Visual Language, Language Models, Large Language
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The Large Visual Language Models (LVLMs) enhances user interaction and enriches user experience by integrating visual modality on the basis of the Large Language Models (LLMs). It has demonstrated their powerful information processing and generation capabilities. However, the existence of hallucinations has limited the potential and practical effectiveness of LVLM in various fields. Although lots of work has been devoted to the issue of hallucination mitigation and correction, there are few reviews to summary this issue. In this survey, we first introduce the background of LVLMs and hallucinations. Then, the structure of LVLMs and main causes of hallucination generation are introduced. Further, we summary recent works on hallucination correction and mitigation. In addition, the available hallucination evaluation benchmarks for LVLMs are presented from judgmental and generative perspectives. Finally, we suggest some future research directions to enhance the dependability and utility of LVLMs.
[AI-150] LAC: Graph Contrastive Learning with Learnable Augmentation in Continuous Space
链接: https://arxiv.org/abs/2410.15355
作者: Zhenyu Lin,Hongzheng Li,Yingxia Shao,Guanhua Ye,Yawen Li,Quanqing Xu
关键词-EN: Graph Contrastive Learning, generating high-quality node, Contrastive Learning, Contrastive Learning frameworks, high-quality node representations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Graph Contrastive Learning frameworks have demonstrated success in generating high-quality node representations. The existing research on efficient data augmentation methods and ideal pretext tasks for graph contrastive learning remains limited, resulting in suboptimal node representation in the unsupervised setting. In this paper, we introduce LAC, a graph contrastive learning framework with learnable data augmentation in an orthogonal continuous space. To capture the representative information in the graph data during augmentation, we introduce a continuous view augmenter, that applies both a masked topology augmentation module and a cross-channel feature augmentation module to adaptively augment the topological information and the feature information within an orthogonal continuous space, respectively. The orthogonal nature of continuous space ensures that the augmentation process avoids dimension collapse. To enhance the effectiveness of pretext tasks, we propose an information-theoretic principle named InfoBal and introduce corresponding pretext tasks. These tasks enable the continuous view augmenter to maintain consistency in the representative information across views while maximizing diversity between views, and allow the encoder to fully utilize the representative information in the unsupervised setting. Our experimental results show that LAC significantly outperforms the state-of-the-art frameworks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.15355 [cs.LG] (or arXiv:2410.15355v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.15355 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhenyu Lin [view email] [v1] Sun, 20 Oct 2024 10:47:15 UTC (605 KB)
[AI-151] YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary
链接: https://arxiv.org/abs/2410.15346
作者: Hao-Tang Tsui,Chien-Yao Wang,Hong-Yuan Mark Liao
关键词-EN: refining training strategies, Identifying and localizing, fundamental challenge, training strategies, Visual Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Identifying and localizing objects within images is a fundamental challenge, and numerous efforts have been made to enhance model accuracy by experimenting with diverse architectures and refining training strategies. Nevertheless, a prevalent limitation in existing models is overemphasizing the current input while ignoring the information from the entire dataset. We introduce an innovative \em \textbfRetriever-\em\textbfDictionary (RD) module to address this issue. This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset, which is built by the knowledge from Visual Models (VM), Large Language Models (LLM), or Visual Language Models (VLM). The flexible RD enables the model to incorporate such explicit knowledge that enhances the ability to benefit multiple tasks, specifically, segmentation, detection, and classification, from pixel to image level. The experiments show that using the RD significantly improves model performance, achieving more than a 3% increase in mean Average Precision for object detection with less than a 1% increase in model parameters. Beyond 1-stage object detection models, the RD module improves the effectiveness of 2-stage models and DETR-based architectures, such as Faster R-CNN and Deformable DETR
[AI-152] POSE: Pose estimation Of virtual Sync Exhibit system
链接: https://arxiv.org/abs/2410.15343
作者: Hao-Tang Tsui,Yu-Rou Tuan,Jia-You Chen
关键词-EN: portable MetaVerse implementation, make virtual avatars, MetaVerse implementation, portable MetaVerse, synchronized actions
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This work is a portable MetaVerse implementation, and we use 3D pose estimation with AI to make virtual avatars do synchronized actions and interact with the environment. The motivation is that we find it inconvenient to use joysticks and sensors when playing with fitness rings. In order to replace joysticks and reduce costs, we developed a platform that can control virtual avatars through pose estimation to identify the movements of real people, and we also implemented a multi-process to achieve modularization and reduce the overall latency.
[AI-153] IKDP: Inverse Kinematics through Diffusion Process
链接: https://arxiv.org/abs/2410.15341
作者: Hao-Tang Tsui,Yu-Rou Tuan,Hong-Han Shuai
关键词-EN: Denoising Diffusion Probabilistic, Diffusion Probabilistic Model, target in space, problem in robotics, endpoint reaches
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:It is a common problem in robotics to specify the position of each joint of the robot so that the endpoint reaches a certain target in space. This can be solved in two ways, forward kinematics method and inverse kinematics method. However, inverse kinematics cannot be solved by an algorithm. The common method is the Jacobian inverse technique, and some people have tried to find the answer by machine learning. In this project, we will show how to use the Conditional Denoising Diffusion Probabilistic Model to integrate the solution of calculating IK. Index Terms: Inverse kinematics, Denoising Diffusion Probabilistic Model, self Attention, Transformer
[AI-154] FoMo: A Foundation Model for Mobile Traffic Forecasting with Diffusion Model
链接: https://arxiv.org/abs/2410.15322
作者: Haoye Chai,Shiyuan Zhang,Xiaoqian Qi,Yong Li
关键词-EN: offering substantial potential, enhancing service quality, anticipate network dynamics, improving user experience, performance in advance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 11 figures
点击查看摘要
Abstract:Mobile traffic forecasting allows operators to anticipate network dynamics and performance in advance, offering substantial potential for enhancing service quality and improving user experience. However, existing models are often task-oriented and are trained with tailored data, which limits their effectiveness in diverse mobile network tasks of Base Station (BS) deployment, resource allocation, energy optimization, etc. and hinders generalization across different urban environments. Foundation models have made remarkable strides across various domains of NLP and CV due to their multi-tasking adaption and zero/few-shot learning capabilities. In this paper, we propose an innovative Foundation model for Mobile traffic forecasting (FoMo), aiming to handle diverse forecasting tasks of short/long-term predictions and distribution generation across multiple cities to support network planning and optimization. FoMo combines diffusion models and transformers, where various spatio-temporal masks are proposed to enable FoMo to learn intrinsic features of different tasks, and a contrastive learning strategy is developed to capture the correlations between mobile traffic and urban contexts, thereby improving its transfer learning capability. Extensive experiments on 9 real-world datasets demonstrate that FoMo outperforms current models concerning diverse forecasting tasks and zero/few-shot learning, showcasing a strong universality. We further deploy the FoMo on the JiuTian optimization platform of China Mobile, where we use the predicted mobile data to formulate network planning and optimization applications, including BS deployment, resource block scheduling, and BS sleep control.
[AI-155] Causality for Large Language Models
链接: https://arxiv.org/abs/2410.15319
作者: Anpeng Wu,Kun Kuang,Minqin Zhu,Yingrong Wang,Yujia Zheng,Kairong Han,Baohong Li,Guangyi Chen,Fei Wu,Kun Zhang
关键词-EN: achieving unprecedented success, vast datasets, achieving unprecedented, billions or trillions, trillions of parameters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Recent breakthroughs in artificial intelligence have driven a paradigm shift, where large language models (LLMs) with billions or trillions of parameters are trained on vast datasets, achieving unprecedented success across a series of language tasks. However, despite these successes, LLMs still rely on probabilistic modeling, which often captures spurious correlations rooted in linguistic patterns and social stereotypes, rather than the true causal relationships between entities and events. This limitation renders LLMs vulnerable to issues such as demographic biases, social stereotypes, and LLM hallucinations. These challenges highlight the urgent need to integrate causality into LLMs, moving beyond correlation-driven paradigms to build more reliable and ethically aligned AI systems. While many existing surveys and studies focus on utilizing prompt engineering to activate LLMs for causal knowledge or developing benchmarks to assess their causal reasoning abilities, most of these efforts rely on human intervention to activate pre-trained models. How to embed causality into the training process of LLMs and build more general and intelligent models remains unexplored. Recent research highlights that LLMs function as causal parrots, capable of reciting causal knowledge without truly understanding or applying it. These prompt-based methods are still limited to human interventional improvements. This survey aims to address this gap by exploring how causality can enhance LLMs at every stage of their lifecycle-from token embedding learning and foundation model training to fine-tuning, alignment, inference, and evaluation-paving the way for more interpretable, reliable, and causally-informed models. Additionally, we further outline six promising future directions to advance LLM development, enhance their causal reasoning capabilities, and address the current limitations these models face. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2410.15319 [cs.CL] (or arXiv:2410.15319v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.15319 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-156] SNAP: Stopping Catastrophic Forgetting in Hebbian Learning with Sigmoidal Neuronal Adaptive Plasticity
链接: https://arxiv.org/abs/2410.15318
作者: Tianyi Xu,Patrick Zheng,Shiyan Liu,Sicheng Lyu,Isabeau Prémont-Schwarz
关键词-EN: Artificial Neural Networks, Neural Networks, Stochastic Gradient Descent, Artificial Neural, Existing Machine Learning
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 11 figures, accepted at Montréal AI and Neuroscience (MAIN) 2024 conference
点击查看摘要
Abstract:Artificial Neural Networks (ANNs) suffer from catastrophic forgetting, where the learning of new tasks causes the catastrophic forgetting of old tasks. Existing Machine Learning (ML) algorithms, including those using Stochastic Gradient Descent (SGD) and Hebbian Learning typically update their weights linearly with experience i.e., independently of their current strength. This contrasts with biological neurons, which at intermediate strengths are very plastic, but consolidate with Long-Term Potentiation (LTP) once they reach a certain strength. We hypothesize this mechanism might help mitigate catastrophic forgetting. We introduce Sigmoidal Neuronal Adaptive Plasticity (SNAP) an artificial approximation to Long-Term Potentiation for ANNs by having the weights follow a sigmoidal growth behaviour allowing the weights to consolidate and stabilize when they reach sufficiently large or small values. We then compare SNAP to linear weight growth and exponential weight growth and see that SNAP completely prevents the forgetting of previous tasks for Hebbian Learning but not for SGD-base learning.
[AI-157] Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
链接: https://arxiv.org/abs/2410.15312
作者: Yu Zhao,Hao Fei,Xiangtai Li,Libo Qin,Jiayi Ji,Hongyuan Zhu,Meishan Zhang,Min Zhang,Jianguo Wei
关键词-EN: visual spatial understanding, spatial understanding, spatial, Spatial Dual Discrete, Dual Discrete Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D \to image and 3D \to text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD ^3 ) framework, which utilizes the intermediate features of the 3D \to X processes to guide the hard X \to 3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.
[AI-158] Who is Undercover? Guiding LLMs to Explore Multi-Perspective Team Tactic in the Game
链接: https://arxiv.org/abs/2410.15311
作者: Ruiqi Dong,Zhixuan Liao,Guangwei Lai,Yuhan Ma,Danni Ma,Chenyou Fan
关键词-EN: Large Language Models, Large Language, Language Models, Multi-Perspective Team Tactic, open decision-making problems
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are pivotal AI agents in complex tasks but still face challenges in open decision-making problems within complex scenarios. To address this, we use the language logic game ``Who is Undercover?‘’ (WIU) as an experimental platform to propose the Multi-Perspective Team Tactic (MPTT) framework. MPTT aims to cultivate LLMs’ human-like language expression logic, multi-dimensional thinking, and self-perception in complex scenarios. By alternating speaking and voting sessions, integrating techniques like self-perspective, identity-determination, self-reflection, self-summary and multi-round find-teammates, LLM agents make rational decisions through strategic concealment and communication, fostering human-like trust. Preliminary results show that MPTT, combined with WIU, leverages LLMs’ cognitive capabilities to create a decision-making framework that can simulate real society. This framework aids minority groups in communication and expression, promoting fairness and diversity in decision-making. Additionally, our Human-in-the-loop experiments demonstrate that LLMs can learn and align with human behaviors through interactive, indicating their potential for active participation in societal decision-making.
[AI-159] LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content
链接: https://arxiv.org/abs/2410.15308
作者: Mohamed Bayan Kmainasi,Ali Ezzat Shahroor,Maram Hasanain,Sahinur Rahman Laskar,Naeemul Hassan,Firoj Alam
关键词-EN: demonstrated remarkable success, Large Language Models, general-purpose task solvers, Large Language, downstream NLP tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: LLMs, Multilingual, Language Diversity, Large Language Models, Social Media, News Media, Specialized LLMs, Fact-checking, Media Analysis
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable success as general-purpose task solvers across various fields, including NLP, healthcare, finance, and law. However, their capabilities remain limited when addressing domain-specific problems, particularly in downstream NLP tasks. Research has shown that models fine-tuned on instruction-based downstream NLP datasets outperform those that are not fine-tuned. While most efforts in this area have primarily focused on resource-rich languages like English and broad domains, little attention has been given to multilingual settings and specific domains. To address this gap, this study focuses on developing a specialized LLM, LlamaLens, for analyzing news and social media content in a multilingual context. To the best of our knowledge, this is the first attempt to tackle both domain specificity and multilinguality, with a particular focus on news and social media. Our experimental setup includes 19 tasks, represented by 52 datasets covering Arabic, English, and Hindi. We demonstrate that LlamaLens outperforms the current state-of-the-art (SOTA) on 16 testing sets, and achieves comparable performance on 10 sets. We make the models and resources publicly available for the research community.(this https URL)
[AI-160] Redefining Proactivity for Information Seeking Dialogue
链接: https://arxiv.org/abs/2410.15297
作者: Jing Yang Lee,Seokhwan Kim,Kartik Mehta,Jiun-Yu Kao,Yu-Hsiang Lin,Arpit Gupta
关键词-EN: provide accurate responses, user queries, addressing user queries, aim to provide, provide accurate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Information-Seeking Dialogue (ISD) agents aim to provide accurate responses to user queries. While proficient in directly addressing user queries, these agents, as well as LLMs in general, predominantly exhibit reactive behavior, lacking the ability to generate proactive responses that actively engage users in sustained conversations. However, existing definitions of proactive dialogue in this context do not focus on how each response actively engages the user and sustains the conversation. Hence, we present a new definition of proactivity that focuses on enhancing the proactiveness' of each generated response via the introduction of new information related to the initial query. To this end, we construct a proactive dialogue dataset comprising 2,000 single-turn conversations, and introduce several automatic metrics to evaluate response
proactiveness’ which achieved high correlation with human annotation. Additionally, we introduce two innovative Chain-of-Thought (CoT) prompts, the 3-step CoT and the 3-in-1 CoT prompts, which consistently outperform standard prompts by up to 90% in the zero-shot setting.
[AI-161] Fractional-order spike-timing-dependent gradient descent for multi-layer spiking neural networks
链接: https://arxiv.org/abs/2410.15293
作者: Yi Yang,Richard M. Voyles,Haiyan H. Zhang,Robert A. Nawrocki
关键词-EN: Accumulated detailed knowledge, Accumulated detailed, bio-inspired spiking neural, spiking neural networks, deep neural networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 12 figures
点击查看摘要
Abstract:Accumulated detailed knowledge about the neuronal activities in human brains has brought more attention to bio-inspired spiking neural networks (SNNs). In contrast to non-spiking deep neural networks (DNNs), SNNs can encode and transmit spatiotemporal information more efficiently by exploiting biologically realistic and low-power event-driven neuromorphic architectures. However, the supervised learning of SNNs still remains a challenge because the spike-timing-dependent plasticity (STDP) of connected spiking neurons is difficult to implement and interpret in existing backpropagation learning schemes. This paper proposes a fractional-order spike-timing-dependent gradient descent (FO-STDGD) learning model by considering a derived nonlinear activation function that describes the relationship between the quasi-instantaneous firing rate and the temporal membrane potentials of nonleaky integrate-and-fire neurons. The training strategy can be generalized to any fractional orders between 0 and 2 since the FO-STDGD incorporates the fractional gradient descent method into the calculation of spike-timing-dependent loss gradients. The proposed FO-STDGD model is tested on the MNIST and DVS128 Gesture datasets and its accuracy under different network structure and fractional orders is analyzed. It can be found that the classification accuracy increases as the fractional order increases, and specifically, the case of fractional order 1.9 improves by 155% relative to the case of fractional order 1 (traditional gradient descent). In addition, our scheme demonstrates the state-of-the-art computational efficacy for the same SNN structure and training epochs.
[AI-162] Contextual Augmented Multi-Model Programming (CAMP): A Hybrid Local-Cloud Copilot Framework
链接: https://arxiv.org/abs/2410.15285
作者: Yuchen Wang,Shangxin Guo,Chee Wei Tan
关键词-EN: cloud-based Large Languages, Large Languages Models, Large Languages, cloud-based Large, revolutionized AI-assisted programming
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 3 figures, 4 tables
点击查看摘要
Abstract:The advancements in cloud-based Large Languages Models (LLMs) have revolutionized AI-assisted programming. However, their integration into certain local development environments like ones within the Apple software ecosystem (e.g., iOS apps, macOS) remains challenging due to computational demands and sandboxed constraints. This paper presents CAMP, a multi-model AI-assisted programming framework that consists of a local model that employs Retrieval-Augmented Generation (RAG) to retrieve contextual information from the codebase to facilitate context-aware prompt construction thus optimizing the performance of the cloud model, empowering LLMs’ capabilities in local Integrated Development Environments (IDEs). The methodology is actualized in Copilot for Xcode, an AI-assisted programming tool crafted for Xcode that employs the RAG module to address software constraints and enables diverse generative programming tasks, including automatic code completion, documentation, error detection, and intelligent user-agent interaction. The results from objective experiments on generated code quality and subjective experiments on user adoption collectively demonstrate the pilot success of the proposed system and mark its significant contributions to the realm of AI-assisted programming.
[AI-163] Large Language Models for Autonomous Driving (LLM4AD): Concept Benchmark Simulation and Real-Vehicle Experiment
链接: https://arxiv.org/abs/2410.15281
作者: Can Cui,Yunsheng Ma,Zichong Yang,Yupeng Zhou,Peiran Liu,Juanwu Lu,Lingxi Li,Yaobin Chen,Jitesh H. Panchal,Amr Abdelraouf,Rohit Gupta,Kyungtae Han,Ziran Wang
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:
[AI-164] ContextDet: Temporal Action Detection with Adaptive Context Aggregation
链接: https://arxiv.org/abs/2410.15279
作者: Ning Wang,Yun Xiao,Xiaopeng Peng,Xiaojun Chang,Xuanhong Wang,Dingyi Fang
关键词-EN: Temporal action detection, video understanding due, recognizes action segments, Temporal action, variable segment lengths
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:
点击查看摘要
Abstract:Temporal action detection (TAD), which locates and recognizes action segments, remains a challenging task in video understanding due to variable segment lengths and ambiguous boundaries. Existing methods treat neighboring contexts of an action segment indiscriminately, leading to imprecise boundary predictions. We introduce a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time. Our model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability. Each ACA level consists of two novel modules. The context attention module (CAM) identifies salient contextual information, encourages context diversity, and preserves context integrity through a context gating block (CGB). The long context module (LCM) makes use of a mixture of large- and small-kernel convolutions to adaptively gather long-range context and fine-grained local features. Additionally, by varying the length of these large kernels across the ACA pyramid, our model provides lightweight yet effective context aggregation and action discrimination. We conducted extensive experiments and compared our model with a number of advanced TAD methods on six challenging TAD benchmarks: MultiThumos, Charades, FineAction, EPIC-Kitchens 100, Thumos14, and HACS, demonstrating superior accuracy at reduced inference speed.
[AI-165] Performance-Driven QUBO for Recommender Systems on Quantum Annealers
链接: https://arxiv.org/abs/2410.15272
作者: Jiayang Niu,Jie Li,Ke Deng,Mark Sanderson,Yongli Ren
关键词-EN: Unconstrained Binary Optimization, Quadratic Unconstrained Binary, Analysis Quadratic Unconstrained, solve QUBO problems, Counterfactual Analysis Quadratic
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We propose Counterfactual Analysis Quadratic Unconstrained Binary Optimization (CAQUBO) to solve QUBO problems for feature selection in recommender systems. CAQUBO leverages counterfactual analysis to measure the impact of individual features and feature combinations on model performance and employs the measurements to construct the coefficient matrix for a quantum annealer to select the optimal feature combinations for recommender systems, thereby improving their final recommendation performance. By establishing explicit connections between features and the recommendation performance, the proposed approach demonstrates superior performance compared to the state-of-the-art quantum annealing methods. Extensive experiments indicate that integrating quantum computing with counterfactual analysis holds great promise for addressing these challenges.
[AI-166] AI Can Enhance Creativity in Social Networks
链接: https://arxiv.org/abs/2410.15264
作者: Raiyan Abdul Baten,Ali Sarosh Bangash,Krish Veera,Gourab Ghoshal,Ehsan Hoque
关键词-EN:
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
[AI-167] HyQE: Ranking Contexts with Hypothetical Query Embeddings
链接: https://arxiv.org/abs/2410.15262
作者: Weichao Zhou,Jiaxin Zhang,Hilaf Hasson,Anu Singh,Wenchao Li
关键词-EN: retrieval-augmented systems, commonly employed, employed to reorder, contexts, user query
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In retrieval-augmented systems, context ranking techniques are commonly employed to reorder the retrieved contexts based on their relevance to a user query. A standard approach is to measure this relevance through the similarity between contexts and queries in the embedding space. However, such similarity often fails to capture the relevance. Alternatively, large language models (LLMs) have been used for ranking contexts. However, they can encounter scalability issues when the number of candidate contexts grows and the context window sizes of the LLMs remain constrained. Additionally, these approaches require fine-tuning LLMs with domain-specific data. In this work, we introduce a scalable ranking framework that combines embedding similarity and LLM capabilities without requiring LLM fine-tuning. Our framework uses a pre-trained LLM to hypothesize the user query based on the retrieved contexts and ranks the context based on the similarity between the hypothesized queries and the user query. Our framework is efficient at inference time and is compatible with many other retrieval and ranking techniques. Experimental results show that our method improves the ranking performance across multiple benchmarks. The complete code and data are available at this https URL
[AI-168] Lossless KV Cache Compression to 2%
链接: https://arxiv.org/abs/2410.15252
作者: Zhen Yang,J.N.Han,Kan Wu,Ruobing Xie,An Wang,Xingwu Sun,Zhanhui Kang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
[AI-169] nsor-Fused Multi-View Graph Contrastive Learning
链接: https://arxiv.org/abs/2410.15247
作者: Yujia Wu,Junyi Mo,Elynn Chen,Yuzhou Chen
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-170] Economic Anthropology in the Era of Generative Artificial Intelligence
链接: https://arxiv.org/abs/2410.15238
作者: Zachary Sheldon,Peeyush Kumar
关键词-EN: generative artificial intelligence, artificial intelligence, paper explores, explores the intersection, generative artificial
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注:
点击查看摘要
Abstract:This paper explores the intersection of economic anthropology and generative artificial intelligence (GenAI). It examines how large language models (LLMs) can simulate human decision-making and the inductive biases present in AI research. The study introduces two AI models: C.A.L.L.O.N. (Conventionally Average Late Liberal ONtology) and M.A.U.S.S. (More Accurate Understanding of Society and its Symbols). The former is trained on standard data, while the latter is adapted with anthropological knowledge. The research highlights how anthropological training can enhance LLMs’ ability to recognize diverse economic systems and concepts. The findings suggest that integrating economic anthropology with AI can provide a more pluralistic understanding of economics and improve the sustainability of non-market economic systems.
[AI-171] Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
链接: https://arxiv.org/abs/2410.15236
作者: Benji Peng,Ziqian Bi,Qian Niu,Ming Liu,Pohsun Feng,Tianyang Wang,Lawrence K.Q. Yan,Yizhu Wen,Yichao Zhang,Caitlyn Heqi Yin
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-172] Bias Amplification: Language Models as Increasingly Biased Media
链接: https://arxiv.org/abs/2410.15234
作者: Ze Wang,Zekun Wu,Jeremy Zhang,Navya Jain,Xin Guan,Adriano Koshiyama
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to ARR Roling Review October
[AI-173] Chasing Random: Instruction Selection Strategies Fail to Generalize
链接: https://arxiv.org/abs/2410.15225
作者: Harshita Diddee,Daphne Ippolito
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-174] AutoFLUKA: A Large Language Model Based Framework for Automating Monte Carlo Simulations in FLUKA
链接: https://arxiv.org/abs/2410.15222
作者: Zavier Ndum Ndum,Jian Tao,John Ford,Yang Liu
关键词-EN:
类目: Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex); Computational Physics (physics.comp-ph); Medical Physics (physics.med-ph)
*备注: 58 pages including text, figures, references and appendices
[AI-175] IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning
链接: https://arxiv.org/abs/2410.15221
作者: Vindula Jayawardana,Baptiste Freydt,Ao Qu,Cameron Hickert,Zhongxia Yan,Cathy Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: In review
[AI-176] Low-cost Robust Night-time Aerial Material Segmentation through Hyperspectral Data and Sparse Spatio-Temporal Learning ICONIP
链接: https://arxiv.org/abs/2410.15208
作者: Chandrajit Bajaj,Minh Nguyen,Shubham Bhardwaj
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to the International Conference on Neural Information Processing (ICONIP) 2024. To be published in Springer-Nature Communications in Computer and Information Science (CCIS) Series
[AI-177] Medical-GAT: Cancer Document Classification Leveraging Graph-Based Residual Network for Scenarios with Limited Data
链接: https://arxiv.org/abs/2410.15198
作者: Elias Hossain,Tasfia Nuzhat,Shamsul Masum,Shahram Rahimi,Sudip Mittal,Noorbakhsh Amiri Golilarz
关键词-EN: Accurate classification, crucial for healthcare, healthcare management, learning models, cancer
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Accurate classification of cancer-related medical abstracts is crucial for healthcare management and research. However, obtaining large, labeled datasets in the medical domain is challenging due to privacy concerns and the complexity of clinical data. This scarcity of annotated data impedes the development of effective machine learning models for cancer document classification. To address this challenge, we present a curated dataset of 1,874 biomedical abstracts, categorized into thyroid cancer, colon cancer, lung cancer, and generic topics. Our research focuses on leveraging this dataset to improve classification performance, particularly in data-scarce scenarios. We introduce a Residual Graph Attention Network (R-GAT) with multiple graph attention layers that capture the semantic information and structural relationships within cancer-related documents. Our R-GAT model is compared with various techniques, including transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT), RoBERTa, and domain-specific models like BioBERT and Bio+ClinicalBERT. We also evaluated deep learning models (CNNs, LSTMs) and traditional machine learning models (Logistic Regression, SVM). Additionally, we explore ensemble approaches that combine deep learning models to enhance classification. Various feature extraction methods are assessed, including Term Frequency-Inverse Document Frequency (TF-IDF) with unigrams and bigrams, Word2Vec, and tokenizers from BERT and RoBERTa. The R-GAT model outperforms other techniques, achieving precision, recall, and F1 scores of 0.99, 0.97, and 0.98 for thyroid cancer; 0.96, 0.94, and 0.95 for colon cancer; 0.96, 0.99, and 0.97 for lung cancer; and 0.95, 0.96, and 0.95 for generic topics.
[AI-178] Augmented Lagrangian-Based Safe Reinforcement Learning Approach for Distribution System Volt/VAR Control
链接: https://arxiv.org/abs/2410.15188
作者: Guibin Chen
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2209.09772
[AI-179] Fine-tuning foundational models to code diagnoses from veterinary health records
链接: https://arxiv.org/abs/2410.15186
作者: Mayla R. Boguslav,Adam Kiehl,David Kott,G. Joseph Strecker,Tracy Webb,Nadia Saklou,Terri Ward,Michael Kirby
关键词-EN: medical records represent, Natural Language Processing, Veterinary medical records, Veterinary, Veterinary Teaching Hospital
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 26 pages, 5 figures
点击查看摘要
Abstract:Veterinary medical records represent a large data resource for application to veterinary and One Health clinical research efforts. Use of the data is limited by interoperability challenges including inconsistent data formats and data siloing. Clinical coding using standardized medical terminologies enhances the quality of medical records and facilitates their interoperability with veterinary and human health records from other sites. Previous studies, such as DeepTag and VetTag, evaluated the application of Natural Language Processing (NLP) to automate veterinary diagnosis coding, employing long short-term memory (LSTM) and transformer models to infer a subset of Systemized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) diagnosis codes from free-text clinical notes. This study expands on these efforts by incorporating all 7,739 distinct SNOMED-CT diagnosis codes recognized by the Colorado State University (CSU) Veterinary Teaching Hospital (VTH) and by leveraging the increasing availability of pre-trained large language models (LLMs). Ten freely-available pre-trained LLMs were fine-tuned on the free-text notes from 246,473 manually-coded veterinary patient visits included in the CSU VTH’s electronic health records (EHRs), which resulted in superior performance relative to previous efforts. The most accurate results were obtained when expansive labeled data were used to fine-tune relatively large clinical LLMs, but the study also showed that comparable results can be obtained using more limited resources and non-clinical LLMs. The results of this study contribute to the improvement of the quality of veterinary EHRs by investigating accessible methods for automated coding and support both animal and human health research by paving the way for more integrated and comprehensive health databases that span species and institutions.
[AI-180] Action abstractions for amortized sampling
链接: https://arxiv.org/abs/2410.15184
作者: Oussama Boussif,Léna Néhale Ezzine,Joseph D Viviano,Michał Koziarski,Moksh Jain,Nikolay Malkin,Emmanuel Bengio,Rim Assouel,Yoshua Bengio
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-181] Enhancing Robot Navigation Policies with Task-Specific Uncertainty Management
链接: https://arxiv.org/abs/2410.15178
作者: Gokul Puthumanaillam,Paulo Padrao,Jose Fuentes,Leonardo Bobadilla,Melkior Ornik
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
[AI-182] Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
链接: https://arxiv.org/abs/2410.15173
作者: Safeyah Khaled Alshemali,Daniel Bauer,Yuval Marton
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages, 3 figures
[AI-183] Linguistic Fuzzy Information Evolution with Random Leader Election Mechanism for Decision-Making Systems
链接: https://arxiv.org/abs/2410.15171
作者: Qianlei Jia,Witold Pedrycz
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-184] SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
链接: https://arxiv.org/abs/2410.15164
作者: Jingxuan Chen,Derek Yuen,Bin Xie,Yuhao Yang,Gongwei Chen,Zhihao Wu,Li Yixing,Xurui Zhou,Weiwen Liu,Shuai Wang,Kaiwen Zhou,Rui Shao,Liqiang Nie,Yasheng Wang,Jianye Hao,Jun Wang,Kun Shao
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-185] Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators
链接: https://arxiv.org/abs/2410.15163
作者: Timothy Wei,Annabelle Miin,Anastasia Miin
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-186] Simulation-Based Optimistic Policy Iteration For Multi-Agent MDPs with Kullback-Leibler Control Cost
链接: https://arxiv.org/abs/2410.15156
作者: Khaled Nakhleh,Ceyhun Eksin,Sabit Ekin
关键词-EN:
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:
[AI-187] MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification
链接: https://arxiv.org/abs/2410.15154
作者: Yin Li,Liangwei Wang,Shiyuan Piao,Boo-Ho Yang,Ziyue Li,Wei Zeng,Fugee Tsung
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-188] Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling
链接: https://arxiv.org/abs/2410.15143
作者: Minhyuk Seo,Hyunseo Koh,Jonghyun Choi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-189] Generalized Flow Matching for Transition Dynamics Modeling
链接: https://arxiv.org/abs/2410.15128
作者: Haibo Wang,Yuxuan Qiu,Yanze Wang,Rob Brekelmans,Yuanqi Du
关键词-EN: Simulating transition dynamics, understanding protein folding, Simulating transition, wide real-world applications, protein folding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
*备注:
点击查看摘要
Abstract:Simulating transition dynamics between metastable states is a fundamental challenge in dynamical systems and stochastic processes with wide real-world applications in understanding protein folding, chemical reactions and neural activities. However, the computational challenge often lies on sampling exponentially many paths in which only a small fraction ends in the target metastable state due to existence of high energy barriers. To amortize the cost, we propose a data-driven approach to warm-up the simulation by learning nonlinear interpolations from local dynamics. Specifically, we infer a potential energy function from local dynamics data. To find plausible paths between two metastable states, we formulate a generalized flow matching framework that learns a vector field to sample propable paths between the two marginal densities under the learned energy function. Furthermore, we iteratively refine the model by assigning importance weights to the sampled paths and buffering more likely paths for training. We validate the effectiveness of the proposed method to sample probable paths on both synthetic and real-world molecular systems.
[AI-190] Reinfier and Reintrainer: Verification and Interpretation-Driven Safe Deep Reinforcement Learning Frameworks
链接: https://arxiv.org/abs/2410.15127
作者: Zixuan Yang,Jiaqi Zheng,Guihai Chen
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-191] MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science EMNLP2024
链接: https://arxiv.org/abs/2410.15126
作者: Junho Kim,Yeachan Kim,Jun-Hyung Park,Yerim Oh,Suho Kim,SangKeun Lee
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at EMNLP 2024 (Findings)
[AI-192] Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models
链接: https://arxiv.org/abs/2410.15116
作者: Qitan Lv,Jie Wang,Hanzhu Chen,Bin Li,Yongdong Zhang,Feng Wu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
[AI-193] On Designing Effective RL Reward at Training Time for LLM Reasoning
链接: https://arxiv.org/abs/2410.15115
作者: Jiaxuan Gao,Shusheng Xu,Wenjie Ye,Weilin Liu,Chuyi He,Wei Fu,Zhiyu Mei,Guangju Wang,Yi Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-194] A Prompt Refinement-based Large Language Model for Metro Passenger Flow Forecasting under Delay Conditions
链接: https://arxiv.org/abs/2410.15111
作者: Ping Huang,Yuxin He,Hao Wang,Jingjing Chen,Qin Luo
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 2 figures
[AI-195] Incorporating Group Prior into Variational Inference for Tail-User Behavior Modeling in CTR Prediction
链接: https://arxiv.org/abs/2410.15098
作者: Han Xu,Taoxing Pan,Zhiqiang Liu,Xiaoxiao Xu,Lantao Hu
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
[AI-196] GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets
链接: https://arxiv.org/abs/2410.15096
作者: Oh Joon Kwon,Daiki E. Matsunaga,Kee-Eung Kim
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-197] DPVS-Shapley:Faster and Universal Contribution Evaluation Component in Federated Learning
链接: https://arxiv.org/abs/2410.15093
作者: Ketin Yin,Zonghao Guo,ZhengHan Qin
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
*备注:
[AI-198] owards Safer Heuristics With XPlain
链接: https://arxiv.org/abs/2410.15086
作者: Pantea Karimi,Solal Pirelli,Siva Kesava Reddy Kakarla,Ryan Beckett,Santiago Segarra,Beibin Li,Pooria Namyar,Behnaz Arzani
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
*备注:
[AI-199] A Distribution Semantics for Probabilistic Term Rewriting
链接: https://arxiv.org/abs/2410.15081
作者: Germán Vidal
关键词-EN:
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
*备注: Submitted for publication
[AI-200] EPT-1.5 Technical Report
链接: https://arxiv.org/abs/2410.15076
作者: Roberto Molinaro,Jordan Dane Daubinet,Alexander Jakob Dautel,Andreas Schlueter,Alex Grigoryev,Nikoo Ekhtiari,Bas Steunebrink,Kevin Thiart,Roan John Song,Henry Martin,Leonie Wagner,Andrea Giussani,Marvin Vincent Gabler
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-201] SLIC: Secure Learned Image Codec through Compressed Domain Watermarking to Defend Image Manipulation
链接: https://arxiv.org/abs/2410.15075
作者: Chen-Hsiu Huang,Ja-Ling Wu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: accepted by ACM Multimedia Asia 2024
[AI-202] LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
链接: https://arxiv.org/abs/2410.15074
作者: Xuechen Guo,Wenhao Chai,Shi-Yan Li,Gaoang Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-203] Personalized Federated Learning with Adaptive Feature Aggregation and Knowledge Transfer
链接: https://arxiv.org/abs/2410.15073
作者: Keting Yin,Jiayi Mao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
[AI-204] A Cycle Ride to HDR: Semantics Aware Self-Supervised Framework for Unpaired LDR-to-HDR Image Translation
链接: https://arxiv.org/abs/2410.15068
作者: Hrishav Bakul Barua,Stefanov Kalin,Lemuel Lai En Che,Dhall Abhinav,Wong KokSheik,Krishnasamy Ganesh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to IEEE
[AI-205] A Prompt Engineering Approach and a Knowledge Graph based Framework for Tackling Legal Implications of Large Language Model Answers
链接: https://arxiv.org/abs/2410.15064
作者: George Hannah,Rita T. Sousa,Ioannis Dasoulas,Claudia d’Amato
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 27 pages, 2 figures
[AI-206] A Dual-Fusion Cognitive Diagnosis Framework for Open Student Learning Environments
链接: https://arxiv.org/abs/2410.15054
作者: Yuanhao Liu,Shuo Liu,Yimeng Liu,Jingwen Yang,Hong Qian
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-207] Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization
链接: https://arxiv.org/abs/2410.15052
作者: Zihui Wu,Haichang Gao,Ping Wang,Shudong Zhang,Zhaoxiang Liu,Shiguo Lian
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-208] MorphAgent : Empowering Agents through Self-Evolving Profiles and Decentralized Collaboration
链接: https://arxiv.org/abs/2410.15048
作者: Siyuan Lu,Jiaqi Shao,Bing Luo,Tao Lin
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-209] Mind the Remaining: Mechanism Design for Robust Federated Unlearning
链接: https://arxiv.org/abs/2410.15045
作者: Jiaqi Shao,Tao Lin,Bing Luo
关键词-EN:
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:
[AI-210] Adversarial Training: A Survey
链接: https://arxiv.org/abs/2410.15042
作者: Mengnan Zhao,Lihe Zhang,Jingwen Ye,Huchuan Lu,Baocai Yin,Xinchao Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-211] Retrieval Augmented Diffusion Model for Structure-informed Antibody Design and Optimization
链接: https://arxiv.org/abs/2410.15040
作者: Zichen Wang,Yaokun Ji,Jianing Tian,Shuangjia Zheng
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-212] A General-Purpose Multimodal Foundation Model for Dermatology
链接: https://arxiv.org/abs/2410.15038
作者: Siyuan Yan,Zhen Yu,Clare Primiero,Cristina Vico-Alonso,Zhonghua Wang,Litao Yang,Philipp Tschandl,Ming Hu,Gin Tan,Vincent Tang,Aik Beng Ng,David Powell,Paul Bonnington,Simon See,Monika Janda,Victoria Mar,Harald Kittler,H. Peter Soyer,Zongyuan Ge
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 56 pages; Technical report
[AI-213] Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention
链接: https://arxiv.org/abs/2410.15029
作者: Yuzhe Weng,Haotian Wang,Tian Gao,Kewei Li,Shutong Niu,Jun Du
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
[AI-214] A Novel Reinforcement Learning Model for Post-Incident Malware Investigations
链接: https://arxiv.org/abs/2410.15028
作者: Dipo Dunsin,Mohamed Chahine Ghanem,Karim Ouazzane,Vassil Vassilev
关键词-EN: cyber incident response, Research proposes, Markov Decision Process, Reinforcement Learning, optimise malware forensics
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 8 pages. arXiv admin note: substantial text overlap with arXiv:2408.01999
点击查看摘要
Abstract:This Research proposes a Novel Reinforcement Learning (RL) model to optimise malware forensics investigation during cyber incident response. It aims to improve forensic investigation efficiency by reducing false negatives and adapting current practices to evolving malware signatures. The proposed RL framework leverages techniques such as Q-learning and the Markov Decision Process (MDP) to train the system to identify malware patterns in live memory dumps, thereby automating forensic tasks. The RL model is based on a detailed malware workflow diagram that guides the analysis of malware artefacts using static and behavioural techniques as well as machine learning algorithms. Furthermore, it seeks to address challenges in the UK justice system by ensuring the accuracy of forensic evidence. We conduct testing and evaluation in controlled environments, using datasets created with Windows operating systems to simulate malware infections. The experimental results demonstrate that RL improves malware detection rates compared to conventional methods, with the RL model’s performance varying depending on the complexity and learning rate of the environment. The study concludes that while RL offers promising potential for automating malware forensics, its efficacy across diverse malware types requires ongoing refinement of reward systems and feature extraction methods.
[AI-215] A Recommendation Model Utilizing Separation Embedding and Self-Attention for Feature Mining
链接: https://arxiv.org/abs/2410.15026
作者: Wenyi Liu,Rui Wang,Yuanshuai Luo,Jianjun Wei,Zihao Zhao,Junming Huang
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
[AI-216] LLM-Driven Learning Analytics Dashboard for Teachers in EFL Writing Education EMNLP2024
链接: https://arxiv.org/abs/2410.15025
作者: Minsun Kim,SeonGyeom Kim,Suyoun Lee,Yoosang Yoon,Junho Myung,Haneul Yoo,Hyunseung Lim,Jieun Han,Yoonsu Kim,So-Yeon Ahn,Juho Kim,Alice Oh,Hwajung Hong,Tak Yeon Lee
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Workshop CustomNLP4U. arXiv admin note: text overlap with arXiv:2405.19691
[AI-217] DM-Codec: Distilling Multimodal Representations for Speech Tokenization
链接: https://arxiv.org/abs/2410.15017
作者: Md Mubtasim Ahasan,Md Fahim,Tasnim Mohiuddin,A K M Mahbubur Rahman,Aman Chadha,Tariq Iqbal,M Ashraful Amin,Md Mofijul Islam,Amin Ahsan Ali
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:
[AI-218] ransit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model
链接: https://arxiv.org/abs/2410.15016
作者: Jiahao Wang,Amer Shalaby
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 17 pages, 21 figures
[AI-219] DST-TransitNet: A Dynamic Spatio-Temporal Deep Learning Model for Scalable and Efficient Network-Wide Prediction of Station-Level Transit Ridership
链接: https://arxiv.org/abs/2410.15013
作者: Jiahao Wang,Amer Shalaby
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: 16 pages, 22 figures. Accepted by TRB 2025
[AI-220] FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning
链接: https://arxiv.org/abs/2410.15010
作者: Sizhe Liu,Jun Xia,Lecheng Zhang,Yuchen Liu,Yue Liu,Wenjie Du,Zhangyang Gao,Bozhen Hu,Cheng Tan,Hongxin Xiang,Stan Z. Li
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-221] A comparative study of NeuralODE and Universal ODE approaches to solving Chandrasekhar White Dwarf equation
链接: https://arxiv.org/abs/2410.14998
作者: Raymundo Vazquez Martinez,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-222] Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS ICASSP2025
链接: https://arxiv.org/abs/2410.14997
作者: Tuan Nam Nguyen,Seymanur Akti,Ngoc Quan Pham,Alexander Waibel
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025
[AI-223] AutoFPDesigner: Automated Flight Procedure Design Based on Multi-Agent Large Language Model
链接: https://arxiv.org/abs/2410.14989
作者: Longtao Zhu,Hongyu Yang,Ge Song,Xin Ma,Yanxin Zhang,Yulong Ji
关键词-EN:
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 21 pages, 18 figures, 5 tables
[AI-224] NeuralMAG: Fast and Generalizable Micromagnetic Simulation with Deep Neural Nets
链接: https://arxiv.org/abs/2410.14986
作者: Yunqi Cai,Jiangnan Li,Dong Wang
关键词-EN:
类目: Machine Learning (cs.LG); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI)
*备注:
[AI-225] Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration
链接: https://arxiv.org/abs/2410.14979
作者: Wei Xie,Shuoyoucheng Ma,Zhenhua Wang,Enze Wang,Baosheng Wang,Jinshu Su
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
[AI-226] Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation
链接: https://arxiv.org/abs/2410.14975
作者: Seulbi Lee,Jihyo Kim,Sangheum Hwang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: The first two authors contributed equally
[AI-227] BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation
链接: https://arxiv.org/abs/2410.14971
作者: Jilong Li,Zhenxi Song,Jiaqi Wang,Min Zhang,Zhiguo Zhang
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:
[AI-228] aming the Long Tail in Human Mobility Prediction NEURIPS2024
链接: https://arxiv.org/abs/2410.14970
作者: Xiaohang Xu,Renhe Jiang,Chuang Yang,Zipei Fan,Kaoru Sezaki
关键词-EN:
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024
[AI-229] LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model
链接: https://arxiv.org/abs/2410.14961
作者: Tianqianjin Lin,Pengwei Yan,Kaisong Song,Zhuoren Jiang,Yangyang Kang,Jun Lin,Weikang Yuan,Junjie Cao,Changlong Sun,Xiaozhong Liu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: under review
[AI-230] Offline-to-online Reinforcement Learning for Image-based Grasping with Scarce Demonstrations
链接: https://arxiv.org/abs/2410.14957
作者: Bryan Chan,Anson Leung,James Bergstra
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-231] LSS-SKAN: Efficient Kolmogorov-Arnold Networks based on Single-Parameterized Function
链接: https://arxiv.org/abs/2410.14951
作者: Zhijie Chen,Xinglin Zhang
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages, 14 figures, experiment codes are available at this https URL , and SKAN’s Python library code are available at this https URL
[AI-232] Optimally Solving Colored Generalized Sliding-Tile Puzzles: Complexity and Bounds
链接: https://arxiv.org/abs/2410.14947
作者: Marcus Gozon,Jingjin Yu
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: WAFR 2024 Conference Version
[AI-233] DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries
链接: https://arxiv.org/abs/2410.14946
作者: Hanqun Cao,Chunbin Gu,Mutian He,Ning Ma,Chang-yu Hsieh,Pheng-Ann Heng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:
[AI-234] Water quality polluted by total suspended solids classified within an Artificial Neural Network approach
链接: https://arxiv.org/abs/2410.14929
作者: I. Luviano Soto,Y. Concha Sánchez,A. Raya
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 42 pages, 8 figures and 2 tables
[AI-235] Cooperation and Fairness in Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2410.14916
作者: Jasmine Jerry Aloor,Siddharth Nayak,Sydney Dolan,Hamsa Balakrishnan
关键词-EN:
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Manuscript accepted in ACM Journal on Autonomous Transportation Systems
[AI-236] A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models
链接: https://arxiv.org/abs/2410.14911
作者: Yuhan Liang,Yijun Li,Yumeng Niu,Qianhe Shen,Hangyu Liu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-237] From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items EMNLP2024
链接: https://arxiv.org/abs/2410.14897
作者: Melissa Roemmele,Andrew S. Gordon
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at Findings of EMNLP 2024
[AI-238] runcated Consistency Models
链接: https://arxiv.org/abs/2410.14895
作者: Sangyun Lee,Yilun Xu,Tomas Geffner,Giulia Fanti,Karsten Kreis,Arash Vahdat,Weili Nie
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-239] Soft-Label Integration for Robust Toxicity Classification NEURIPS24
链接: https://arxiv.org/abs/2410.14894
作者: Zelei Cheng,Xian Wu,Jiahao Yu,Shuo Han,Xin-Qiang Cai,Xinyu Xing
关键词-EN:
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by Neurips 24
[AI-240] Reasoning Memorization and Fine-Tuning Language Models for Non-Cooperative Games
链接: https://arxiv.org/abs/2410.14890
作者: Yunhao Yang,Leonard Berthellemy,Ufuk Topcu
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-241] Self-Satisfied: An end-to-end framework for SAT generation and prediction
链接: https://arxiv.org/abs/2410.14888
作者: Christopher R. Serrano,Jonathan Gallagher,Kenji Yamada,Alexei Kopylov,Michael A. Warren
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 22 pages
[AI-242] Class-RAG: Content Moderation with Retrieval Augmented Generation ACL
链接: https://arxiv.org/abs/2410.14881
作者: Jianfa Chen,Emily Shen,Trupti Bavalatti,Xiaowen Lin,Yongkai Wang,Shuming Hu,Harihar Subramanyam,Ksheeraj Sai Vepuri,Ming Jiang,Ji Qi,Li Chen,Nan Jiang,Ankit Jain
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, submit to ACL
[AI-243] Vital Insight: Assisting Experts Sensemaking Process of Multi-modal Personal Tracking Data Using Visualization and LLM
链接: https://arxiv.org/abs/2410.14879
作者: Jiachen Li,Justin Steinberg,Xiwen Li,Akshat Choube,Bingsheng Yao,Dakuo Wang,Elizabeth Mynatt,Varun Mishra
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
[AI-244] How to Evaluate Reward Models for RLHF
链接: https://arxiv.org/abs/2410.14872
作者: Evan Frick,Tianle Li,Connor Chen,Wei-Lin Chiang,Anastasios N. Angelopoulos,Jiantao Jiao,Banghua Zhu,Joseph E. Gonzalez,Ion Stoica
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-245] Joint Verification and Refinement of Language Models for Safety-Constrained Planning
链接: https://arxiv.org/abs/2410.14865
作者: Yunhao Yang,William Ward,Zichao Hu,Joydeep Biswas,Ufuk Topcu
关键词-EN:
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Robotics (cs.RO)
*备注:
[AI-246] DFlow: Diverse Dialogue Flow Simulation with Large Language Models
链接: https://arxiv.org/abs/2410.14853
作者: Wanyu Du,Song Feng,James Gung,Lijia Sun,Yi Zhang,Saab Mansour,Yanjun Qi
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 16 pages
[AI-247] Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment
链接: https://arxiv.org/abs/2410.14827
作者: Zedian Shao,Hongbin Liu,Jaden Mu,Neil Zhenqiang Gong
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
[AI-248] SPRIG: Improving Large Language Model Performance by System Prompt Optimization
链接: https://arxiv.org/abs/2410.14826
作者: Lechen Zhang,Tolga Ergen,Lajanugen Logeswaran,Moontae Lee,David Jurgens
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
[AI-249] A Complexity-Based Theory of Compositionality
链接: https://arxiv.org/abs/2410.14817
作者: Eric Elmoznino,Thomas Jiralerspong,Yoshua Bengio,Guillaume Lajoie
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-250] he S2 Hierarchical Discrete Global Grid as a Nexus for Data Representation Integration and Querying Across Geospatial Knowledge Graphs
链接: https://arxiv.org/abs/2410.14808
作者: Shirly Stephen,Mitchell Faulk,Krzysztof Janowicz,Colby Fisher,Thomas Thelen,Rui Zhu,Pascal Hitzler,Cogan Shimizu,Kitty Currier,Mark Schildhauer,Dean Rehberger,Zhangyu Wang,Antrea Christou
关键词-EN:
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
[AI-251] Aligning AI Agents via Information-Directed Sampling
链接: https://arxiv.org/abs/2410.14807
作者: Hong Jun Jeon,Benjamin Van Roy
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-252] DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
链接: https://arxiv.org/abs/2410.14803
作者: Taiyi Wang,Zhihao Wu,Jianheng Liu,Jianye Hao,Jun Wang,Kun Shao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
*备注: Paper and Appendix, 24 pages
[AI-253] Deep Generic Dynamic Object Detection Based on Dynamic Grid Maps
链接: https://arxiv.org/abs/2410.14799
作者: Rujiao Yan,Linda Schubert,Alexander Kamm,Matthias Komar,Matthias Schreier
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, IEEE IV24
[AI-254] SSL-NBV: A Self-Supervised-Learning-Based Next-Best-View algorithm for Efficient 3D Plant Reconstruction by a Robot
链接: https://arxiv.org/abs/2410.14790
作者: Jianchao Ci,Eldert J. van Henten,Xin Wang,Akshay K. Burusa,Gert Kootstra
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 22 pages, 11 figures, 1 table
[AI-255] Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks
链接: https://arxiv.org/abs/2410.14766
作者: Enkhbold Nyamsuren
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
[AI-256] Whats New in My Data? Novelty Exploration via Contrastive Generation
链接: https://arxiv.org/abs/2410.14765
作者: Masaru Isonuma,Ivan Titov
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-257] Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
链接: https://arxiv.org/abs/2410.14763
作者: Hamed Fayyaz,Raphael Poulain,Rahmatollah Beheshti
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
[AI-258] Controllable Discovery of Intents: Incremental Deep Clustering Using Semi-Supervised Contrastive Learning
链接: https://arxiv.org/abs/2410.14755
作者: Mrinal Rawat,Hithesh Sankararaman,Victor Barres
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted in IJCNLP’23
[AI-259] Collaboratively adding new knowledge to an LLM
链接: https://arxiv.org/abs/2410.14753
作者: Rhui Dih Lee,Laura Wynter
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-260] meSeriesExam: A time series understanding exam NEURIPS’24
链接: https://arxiv.org/abs/2410.14752
作者: Yifu Cai,Arjun Choudhry,Mononito Goswami,Artur Dubrawski
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted at NeurIPS’24 Time Series in the Age of Large Models Workshop
[AI-261] ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries
链接: https://arxiv.org/abs/2410.14748
作者: Kishan Maharaj,Vitobha Munigala,Srikanth G. Tamilselvam,Prince Kumar,Sayandeep Sen,Palani Kodeswaran,Abhijit Mishra,Pushpak Bhattacharyya
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 6 Figures, 5 Tables
[AI-262] Accounting for Sycophancy in Language Model Uncertainty Estimation
链接: https://arxiv.org/abs/2410.14746
作者: Anthony Sicilia,Mert Inan,Malihe Alikhani
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
[AI-263] SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
链接: https://arxiv.org/abs/2410.14745
作者: Junyu Luo,Xiao Luo,Xiusi Chen,Zhiping Xiao,Wei Ju,Ming Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
[AI-264] Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors
链接: https://arxiv.org/abs/2410.14744
作者: Anthony Sicilia,Malihe Alikhani
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
[AI-265] Efficient Deep Learning Board: Training Feedback Is Not All You Need
链接: https://arxiv.org/abs/2410.14743
作者: Lina Gong,Qi Gao,Peng Li,Mingqiang Wei,Fei Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-266] oward a Unified Graph-Based Representation of Medical Data for Precision Oncology Medicine
链接: https://arxiv.org/abs/2410.14739
作者: Davide Belluomo,Tiziana Calamoneri,Giacomo Paesani,Ivano Salvo
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages, 1 figure, 14 tables, CIBB 2024 conference
[AI-267] Agent Skill Acquisition for Large Language Models via CycleQD
链接: https://arxiv.org/abs/2410.14735
作者: So Kuroki,Taishi Nakamura,Takuya Akiba,Yujin Tang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:
[AI-268] Knowledge Graph Embeddings: A Comprehensive Survey on Capturing Relation Properties
链接: https://arxiv.org/abs/2410.14733
作者: Guanglin Niu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 22 pages, 8 figures, 3 tables, this paper is a modified English version of our article already published in Computer Science journal (in Chinese), released to facilitate communication among international researchers in the relevant fields
[AI-269] MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
链接: https://arxiv.org/abs/2410.14731
作者: Bokai Lin,Zihao Zeng,Zipeng Xiao,Siqi Kou,Tianqi Hou,Xiaofeng Gao,Hao Zhang,Zhijie Deng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-270] On the Relation Between Linear Diffusion and Power Iteration
链接: https://arxiv.org/abs/2410.14730
作者: Dana Weitzner,Mauricio Delbracio,Peyman Milanfar,Raja Giryes
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
[AI-271] okens on Demand: Token Condensation as Training-free Test-time Adaptation
链接: https://arxiv.org/abs/2410.14729
作者: Zixin Wang,Dong Gong,Sen Wang,Zi Huang,Yadan Luo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures
[AI-272] A Phenomenological AI Foundation Model for Physical Signals
链接: https://arxiv.org/abs/2410.14724
作者: Jaime Lien,Laura I. Galindez Olascoaga,Hasan Dogan,Nicholas Gillian,Brandon Barbello,Leonardo Giusti,Ivan Poupyrev
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:
[AI-273] he Representation of Meaningful Precision and Accuracy
链接: https://arxiv.org/abs/2410.14721
作者: A Mani
关键词-EN:
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 16 Pages
[AI-274] A Systematic Survey on Large Language Models for Algorithm Design
链接: https://arxiv.org/abs/2410.14716
作者: Fei Liu,Yiming Yao,Ping Guo,Zhiyuan Yang,Xi Lin,Xialiang Tong,Mingxuan Yuan,Zhichao Lu,Zhenkun Wang,Qingfu Zhang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-275] Animating the Past: Reconstruct Trilobite via Video Generation
链接: https://arxiv.org/abs/2410.14715
作者: Xiaoran Wu,Zien Huang,Chonghan Yu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-276] Abstracting Situation Calculus Action Theories
链接: https://arxiv.org/abs/2410.14712
作者: Bita Banihashemi,Giuseppe De Giacomo,Yves Lespérance
关键词-EN:
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: 60 pages, 1 figure
[AI-277] G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving
链接: https://arxiv.org/abs/2410.14710
作者: Naoki Murata,Chieh-Hsin Lai,Yuhta Takida,Toshimitsu Uesaka,Bac Nguyen,Stefano Ermon,Yuki Mitsufuji
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-278] FACMIC: Federated Adaptative CLIP Model for Medical Image Classification MICCAI2024
链接: https://arxiv.org/abs/2410.14707
作者: Yihang Wu,Christian Desrosiers,Ahmad Chaddad
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Accepted in MICCAI 2024
[AI-279] Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
链接: https://arxiv.org/abs/2410.14702
作者: Himanshu Gupta,Shreyas Verma,Ujjwala Anantheswaran,Kevin Scaria,Mihir Parmar,Swaroop Mishra,Chitta Baral
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 49 pages, (10 pages paper, 9 pages references, 30 pages appendix)
[AI-280] Self-Supervised Keypoint Detection with Distilled Depth Keypoint Representation
链接: https://arxiv.org/abs/2410.14700
作者: Aman Anand,Elyas Rashno,Amir Eskandari,Farhana Zulkernine
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-281] Green vehicle routing problem that jointly optimizes delivery speed and routing based on the characteristics of electric vehicles
链接: https://arxiv.org/abs/2410.14691
作者: YY.Feng
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:
[AI-282] Rethinking VLMs and LLMs for Image Classification
链接: https://arxiv.org/abs/2410.14690
作者: Avi Cooper,Keizo Kato,Chia-Hsien Shih,Hiroaki Yamane,Kasper Vinken,Kentaro Takemoto,Taro Sunagawa,Hao-Wei Yeh,Jin Yamanaka,Ian Mason,Xavier Boix
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-283] Leveraging Event Streams with Deep Reinforcement Learning for End-to-End UAV Tracking
链接: https://arxiv.org/abs/2410.14685
作者: Ala Souissi(Lab-STICC_RAMBO, IMT Atlantique - INFO),Hajer Fradi(Lab-STICC_RAMBO, IMT Atlantique - INFO),Panagiotis Papadakis(Lab-STICC_RAMBO, IMT Atlantique - INFO)
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:
[AI-284] RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph
链接: https://arxiv.org/abs/2410.14684
作者: Siru Ouyang,Wenhao Yu,Kaixin Ma,Zilin Xiao,Zhihan Zhang,Mengzhao Jia,Jiawei Han,Hongming Zhang,Dong Yu
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Work in progress
[AI-285] ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models
链接: https://arxiv.org/abs/2410.14682
作者: Lingfeng Zhang,Yuening Wang,Hongjian Gu,Atia Hamidizadeh,Zhanguang Zhang,Yuecheng Liu,Yutong Wang,David Gamaliel Arcos Bravo,Junyi Dong,Shunbo Zhou,Tongtong Cao,Yuzheng Zhuang,Yingxue Zhang,Jianye Hao
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
[AI-286] Influence of Backdoor Paths on Causal Link Prediction
链接: https://arxiv.org/abs/2410.14680
作者: Utkarshani Jaimini,Cory Henson,Amit Sheth
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-287] HyperCausalLP: Causal Link Prediction using Hyper-Relational Knowledge Graph
链接: https://arxiv.org/abs/2410.14679
作者: Utkarshani Jaimini,Cory Henson,Amit Sheth
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2405.02327
[AI-288] Leveraging Large Language Models for Enhancing Public Transit Services
链接: https://arxiv.org/abs/2410.14147
作者: Jiahao Wang,Amer Shalaby
关键词-EN:
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 24 pages, 18 figures, submitting to Journal of ITS
[AI-289] Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations
链接: https://arxiv.org/abs/2410.13204
作者: Aryan Shrivastava,Jessica Hullman,Max Lamparth
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
[AI-290] QT-DoG: Quantization-aware Training for Domain Generalization
链接: https://arxiv.org/abs/2410.06020
作者: Saqib Javed,Hieu Le,Mathieu Salzmann
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Code will be released soon
[AI-291] Modeling dynamic neural activity by combining naturalistic video stimuli and stimulus-independent latent factors
链接: https://arxiv.org/abs/2410.16136
作者: Finn Schmidt,Suhas Shrinivasan,Polina Turishcheva,Fabian H. Sinz
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:
[AI-292] Multimodal Flare Forecasting with Deep Learning
链接: https://arxiv.org/abs/2410.16116
作者: Grégoire Francisco,Sabrina Guastavino,Teresa Barata,João Fernandes,Dario Del Moro
关键词-EN:
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-293] Neural Quantum Propagators for Driven-Dissipative Quantum Dynamics
链接: https://arxiv.org/abs/2410.16091
作者: Jiaji Zhang,Carlos L. Benavides-Riveros,Lipeng Chen
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注: 7 pages, comment are welcome!
[AI-294] Resilient Temporal GCN for Smart Grid State Estimation Under Topology Inaccuracies
链接: https://arxiv.org/abs/2410.16008
作者: Seyed Hamed Haghshenas,Mia Naeini
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
[AI-295] AI-Driven Approaches for Glaucoma Detection – A Comprehensive Review
链接: https://arxiv.org/abs/2410.15947
作者: Yuki Hagiwara,Octavia-Andreaa Ciora,Maureen Monnet,Gino Lancho,Jeanette Miriam Lorenz
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-296] LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec ICASSP2025
链接: https://arxiv.org/abs/2410.15764
作者: Yiwei Guo,Zhihan Li,Chenpeng Du,Hankun Wang,Xie Chen,Kai Yu
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 5 pages, 2 figures, 4 tables. Submitted to ICASSP 2025. Demo page: this https URL
[AI-297] owards Kriging-informed Conditional Diffusion for Regional Sea-Level Data Downscaling
链接: https://arxiv.org/abs/2410.15628
作者: Subhankar Ghosh,Arun Sharma,Jayant Gupta,Aneesh Subramanian,Shashi Shekhar
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[AI-298] AttCDCNet: Attention-enhanced Chest Disease Classification using X-Ray Images
链接: https://arxiv.org/abs/2410.15437
作者: Omar Hesham Khater,Abdullahi Sani Shuaib,Sami Ul Haq,Abdul Jabbar Siddiqui
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-299] Implicit neural representation for free-breathing MR fingerprinting (INR-MRF): co-registered 3D whole-liver water T1 water T2 proton density fat fraction and R2* mapping
链接: https://arxiv.org/abs/2410.15175
作者: Chao Li,Jiahao Li,Jinwei Zhang,Eddy Solomon,Alexey V. Dimov,Pascal Spincemaille,Thanh D. Nguyen,Martin R. Prince,Yi Wang
关键词-EN:
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:
[AI-300] Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer
链接: https://arxiv.org/abs/2410.15012
作者: Gesa Mittmann,Sara Laiouar-Pedari,Hendrik A. Mehrtens,Sarah Haggenmüller,Tabea-Clara Bucher,Tirtha Chanda,Nadine T. Gaisa,Mathias Wagner,Gilbert Georg Klamminger,Tilman T. Rau,Christina Neppl,Eva Maria Compérat,Andreas Gocht,Monika Hämmerle,Niels J. Rupp,Jula Westhoff,Irene Krücken,Maximillian Seidl,Christian M. Schürch,Marcus Bauer,Wiebke Solass,Yu Chun Tam,Florian Weber,Rainer Grobholz,Jaroslaw Augustyniak,Thomas Kalinski,Christian Hörner,Kirsten D. Mertz,Constanze Döring,Andreas Erbersdobler,Gabriele Deubler,Felix Bremmer,Ulrich Sommer,Michael Brodhun,Jon Griffin,Maria Sarah L. Lenon,Kiril Trpkov,Liang Cheng,Fei Chen,Angelique Levi,Guoping Cai,Tri Q. Nguyen,Ali Amin,Alessia Cimadamore,Ahmed Shabaik,Varsha Manucha,Nazeel Ahmad,Nidia Messias,Francesca Sanguedolce,Diana Taheri,Ezra Baraban,Liwei Jia,Rajal B. Shah,Farshid Siadat,Nicole Swarbrick,Kyung Park,Oudai Hassan,Siamak Sakhaie,Michelle R. Downes,Hiroshi Miyamoto,Sean R. Williamson,Tim Holland-Letz,Carolin V. Schneider,Jakob Nikolas Kather,Yuri Tolkach,Titus J. Brinker
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 58 pages, 15 figures (incl. supplementary)
[AI-301] On the Sparsity of the Strong Lottery Ticket Hypothesis
链接: https://arxiv.org/abs/2410.14754
作者: Emanuele Natale(CNRS, COATI, I3S, UniCA),Davide Ferré(UniCA, CNRS, Inria, I3S),Giordano Giambartolomei,Frédéric Giroire(I3S, COMUE UCA, COATI),Frederik Mallmann-Trenn
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-302] Learning Cortico-Muscular Dependence through Orthonormal Decomposition of Density Ratios
链接: https://arxiv.org/abs/2410.14697
作者: Shihan Ma,Bo Hu,Tianyu Jia,Alexander Kenneth Clarke,Blanka Zicher,Arnault H. Caillet,Dario Farina,Jose C. Principe
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:
[AI-303] REBIND: Enhancing ground-state molecular conformation via force-based graph rewiring
链接: https://arxiv.org/abs/2410.14696
作者: Taewon Kim,Hyunjin Seo,Sungsoo Ahn,Eunho Yang
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 17 pages, 4 figures, 5 tables
[AI-304] Achieving Generalization in Orchestrating GNSS Interference Monitoring Stations Through Pseudo-Labeling
链接: https://arxiv.org/abs/2410.14686
作者: Lucas Heublein,Tobias Feigl,Alexander Rügamer,Felix Ott
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: DGON Positioning and Navigation for Intelligent Transport Systems (POSNAV)
[AI-305] Brain-Aware Readout Layers in GNNs: Advancing Alzheimers early Detection and Neuroimaging
链接: https://arxiv.org/abs/2410.14683
作者: Jiwon Youn,Dong Woo Kang,Hyun Kook Lim,Mansu Kim
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
计算机视觉
[CV-0] MvDrag3D: Drag-based Creative 3D Editing via Multi-view Generation-Reconstruction Priors
链接: https://arxiv.org/abs/2410.16272
作者: Honghua Chen,Yushi Lan,Yongwei Chen,Yifan Zhou,Xingang Pan
关键词-EN: content creation, capabilities of image, Drag-based editing, image generative models, editing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 10 figures, conference
点击查看摘要
Abstract:Drag-based editing has become popular in 2D content creation, driven by the capabilities of image generative models. However, extending this technique to 3D remains a challenge. Existing 3D drag-based editing methods, whether employing explicit spatial transformations or relying on implicit latent optimization within limited-capacity 3D generative models, fall short in handling significant topology changes or generating new textures across diverse object categories. To overcome these limitations, we introduce MVDrag3D, a novel framework for more flexible and creative drag-based 3D editing that leverages multi-view generation and reconstruction priors. At the core of our approach is the usage of a multi-view diffusion model as a strong generative prior to perform consistent drag editing over multiple rendered views, which is followed by a reconstruction model that reconstructs 3D Gaussians of the edited object. While the initial 3D Gaussians may suffer from misalignment between different views, we address this via view-specific deformation networks that adjust the position of Gaussians to be well aligned. In addition, we propose a multi-view score function that distills generative priors from multiple views to further enhance the view consistency and visual quality. Extensive experiments demonstrate that MVDrag3D provides a precise, generative, and flexible solution for 3D drag-based editing, supporting more versatile editing effects across various object categories and 3D representations.
[CV-1] FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors
链接: https://arxiv.org/abs/2410.16271
作者: Chin-Yang Lin,Chung-Ho Wu,Chang-Han Yeh,Shih-Han Yen,Cheng Sun,Yu-Lun Liu
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, face significant challenges, face significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL
点击查看摘要
Abstract:Neural Radiance Fields (NeRF) face significant challenges in few-shot scenarios, primarily due to overfitting and long training times for high-fidelity rendering. Existing methods, such as FreeNeRF and SparseNeRF, use frequency regularization or pre-trained priors but struggle with complex scheduling and bias. We introduce FrugalNeRF, a novel few-shot NeRF framework that leverages weight-sharing voxels across multiple scales to efficiently represent scene details. Our key contribution is a cross-scale geometric adaptation scheme that selects pseudo ground truth depth based on reprojection errors across scales. This guides training without relying on externally learned priors, enabling full utilization of the training data. It can also integrate pre-trained priors, enhancing quality without slowing convergence. Experiments on LLFF, DTU, and RealEstate-10K show that FrugalNeRF outperforms other few-shot NeRF methods while significantly reducing training time, making it a practical solution for efficient and accurate 3D scene reconstruction.
[CV-2] SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
链接: https://arxiv.org/abs/2410.16268
作者: Shuangrui Ding,Rui Qian,Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Yuwei Guo,Dahua Lin,Jiaqi Wang
关键词-EN: powerful foundation model, downstream video applications, foundation model, video object segmentation, segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL
点击查看摘要
Abstract:The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos, paving the way for various downstream video applications. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object-aware memories from previous frames for current frame prediction. However, its greedy-selection memory design suffers from the “error accumulation” problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long-term videos. To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long-term videos. Notably, SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons, with gains of up to 5.3 points in JF on long-term video object segmentation benchmarks such as SA-V and LVOS. The code is released at this https URL.
[CV-3] xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
链接: https://arxiv.org/abs/2410.16267
作者: Michael S. Ryoo,Honglu Zhou,Shrikant Kendre,Can Qin,Le Xue,Manli Shu,Silvio Savarese,Ran Xu,Caiming Xiong,Juan Carlos Niebles
关键词-EN: efficiently capture temporal, capture temporal information, multimodal language model, multiple frames, multimodal language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the ‘temporal encoder’ in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at this https URL
[CV-4] 3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors NEURIPS2024
链接: https://arxiv.org/abs/2410.16266
作者: Xi Liu,Chaoyi Zhou,Siyu Huang
关键词-EN: Novel-view synthesis aims, achieved notable success, Gaussian splatting, Novel-view synthesis, multiple input images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024 Spotlight
点击查看摘要
Abstract:Novel-view synthesis aims to generate novel views of a scene from multiple input images or videos, and recent advancements like 3D Gaussian splatting (3DGS) have achieved notable success in producing photorealistic renderings with efficient pipelines. However, generating high-quality novel views under challenging settings, such as sparse input views, remains difficult due to insufficient information in under-sampled areas, often resulting in noticeable artifacts. This paper presents 3DGS-Enhancer, a novel pipeline for enhancing the representation quality of 3DGS representations. We leverage 2D video diffusion priors to address the challenging 3D view consistency problem, reformulating it as achieving temporal consistency within a video generation process. 3DGS-Enhancer restores view-consistent latent features of rendered novel views and integrates them with the input views through a spatial-temporal decoder. The enhanced views are then used to fine-tune the initial 3DGS model, significantly improving its rendering performance. Extensive experiments on large-scale datasets of unbounded scenes demonstrate that 3DGS-Enhancer yields superior reconstruction performance and high-fidelity rendering results compared to state-of-the-art methods. The project webpage is this https URL .
[CV-5] Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
链接: https://arxiv.org/abs/2410.16261
作者: Zhangwei Gao,Zhe Chen,Erfei Cui,Yiming Ren,Weiyun Wang,Jinguo Zhu,Hao Tian,Shenglong Ye,Junjun He,Xizhou Zhu,Lewei Lu,Tong Lu,Yu Qiao,Jifeng Dai,Wenhai Wang
关键词-EN: Multimodal large language, demonstrated impressive performance, Multimodal large, large language models, spectrum of domains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical report
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We believe that our study can provide valuable insights and resources to advance the development of efficient and effective MLLMs. Code is available at this https URL.
[CV-6] Agent -to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos WWW
链接: https://arxiv.org/abs/2410.16259
作者: Gengshan Yang,Andrea Bajcsy,Shunsuke Saito,Angjoo Kanazawa
关键词-EN: longitudinal video collections, casual longitudinal video, ATS learns natural, learning interactive behavior, framework for learning
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
*备注: Project page: this https URL
点击查看摘要
Abstract:We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections. Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Modeling 3D behavior of an agent requires persistent 3D tracking (e.g., knowing which point corresponds to which) over a long time period. To obtain such data, we develop a coarse-to-fine registration method that tracks the agent and the camera over time through a canonical 3D space, resulting in a complete and persistent spacetime 4D representation. We then train a generative model of agent behaviors using paired data of perception and motion of an agent queried from the 4D reconstruction. ATS enables real-to-sim transfer from video recordings of an agent to an interactive behavior simulator. We demonstrate results on pets (e.g., cat, dog, bunny) and human given monocular RGBD videos captured by a smartphone.
[CV-7] Elucidating the design space of language models for image generation
链接: https://arxiv.org/abs/2410.16257
作者: Xuantong Liu,Shaozhe Hao,Xianbiao Qi,Tianyang Hu,Jun Wang,Rong Xiao,Yuan Yao
关键词-EN: adopt Large Language, Large Language Models, adopt Large, Large Language, language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL
点击查看摘要
Abstract:The success of autoregressive (AR) language models in text generation has inspired the computer vision community to adopt Large Language Models (LLMs) for image generation. However, considering the essential differences between text and image modalities, the design space of language models for image generation remains underexplored. We observe that image tokens exhibit greater randomness compared to text tokens, which presents challenges when training with token prediction. Nevertheless, AR models demonstrate their potential by effectively learning patterns even from a seemingly suboptimal optimization problem. Our analysis also reveals that while all models successfully grasp the importance of local information in image generation, smaller models struggle to capture the global context. In contrast, larger models showcase improved capabilities in this area, helping to explain the performance gains achieved when scaling up model size. We further elucidate the design space of language models for vision generation, including tokenizer choice, model choice, model scalability, vocabulary design, and sampling strategy through extensive comparative experiments. Our work is the first to analyze the optimization behavior of language models in vision generation, and we believe it can inspire more effective designs when applying LMs to other domains. Finally, our elucidated language model for image generation, termed as ELM, achieves state-of-the-art performance on the ImageNet 256*256 benchmark. The code is available at this https URL.
[CV-8] Revisiting Deep Feature Reconstruction for Logical and Structural Industrial Anomaly Detection
链接: https://arxiv.org/abs/2410.16255
作者: Sukanya Patra,Souhaib Ben Taieb
关键词-EN: alter object appearances, presents challenges due, diverse anomaly types, Industrial anomaly detection, limited training data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in Transactions on Machine Learning Research (TMLR). Link to OpenReview: this https URL
点击查看摘要
Abstract:Industrial anomaly detection is crucial for quality control and predictive maintenance, but it presents challenges due to limited training data, diverse anomaly types, and external factors that alter object appearances. Existing methods commonly detect structural anomalies, such as dents and scratches, by leveraging multi-scale features from image patches extracted through deep pre-trained networks. However, significant memory and computational demands often limit their practical application. Additionally, detecting logical anomalies-such as images with missing or excess elements-requires an understanding of spatial relationships that traditional patch-based methods fail to capture. In this work, we address these limitations by focusing on Deep Feature Reconstruction (DFR), a memory- and compute-efficient approach for detecting structural anomalies. We further enhance DFR into a unified framework, called ULSAD, which is capable of detecting both structural and logical anomalies. Specifically, we refine the DFR training objective to improve performance in structural anomaly detection, while introducing an attention-based loss mechanism using a global autoencoder-like network to handle logical anomaly detection. Our empirical evaluation across five benchmark datasets demonstrates the performance of ULSAD in detecting and localizing both structural and logical anomalies, outperforming eight state-of-the-art methods. An extensive ablation study further highlights the contribution of each component to the overall performance improvement. Our code is available at this https URL
[CV-9] MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays ECGs and Diagnostic Report
链接: https://arxiv.org/abs/2410.16239
作者: Samrajya Thapa,Koushik Howlader,Subhankar Bhattacharjee,Wei le
关键词-EN: Multi-Modal Contrastive Pre-training, Contrastive Pre-training Framework, synergistically combines X-rays, Contrastive Pre-training, Pre-training Framework
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 9 tables. Supplementary detail in Appendix. Code made available in Github for reproducibility
点击查看摘要
Abstract:In this paper, we introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs), and radiology/cardiology reports. Our approach leverages transformers to encode these diverse modalities into a unified representation space, aiming to enhance diagnostic accuracy and facilitate comprehensive patient assessments. We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention. Furthermore, we provide novel multimodal attention explanations and retrieval for our model. To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizing contrastive loss, MoRE effectively aligns modality-specific features into a coherent embedding, which supports various downstream tasks such as zero-shot classification and multimodal retrieval. Employing our proposed methodology, we achieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, and PtbXl downstream datasets, surpassing existing multimodal approaches. Our proposed framework shows significant improvements in capturing intricate inter-modal relationships and its robustness in medical diagnosis that establishes a framework for future research in multimodal learning in the healthcare sector.
[CV-10] LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
链接: https://arxiv.org/abs/2410.16236
作者: Yuxuan Cai,Jiangning Zhang,Haoyang He,Xinwei He,Ao Tong,Zhenye Gan,Chengjie Wang,Xiang Bai
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, explore Multimodal Large, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review
点击查看摘要
Abstract:The success of Large Language Models (LLM) has led researchers to explore Multimodal Large Language Models (MLLM) for unified visual and linguistic understanding. However, the increasing model size and computational complexity of MLLM limit their use in resource-constrained environments. Small-scale MLLM (s-MLLM) aims to retain the capabilities of the large-scale model (l-MLLM) while reducing computational demands, but resulting in a significant decline in performance. To address the aforementioned issues, we propose a novel LLaVA-KD framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM, and Relation Distillation (RDist) to transfer l-MLLM’s ability to model correlations between visual features. Additionally, we propose a three-stage training scheme to fully exploit the potential of s-MLLM: 1) Distilled Pre-Training to align visual-textual representations, 2) Supervised Fine-Tuning to equip the model with multimodal understanding, and 3) Distilled Fine-Tuning to further transfer l-MLLM capabilities. Our approach significantly improves performance without altering the small model’s architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component. Code will be available at this https URL.
[CV-11] Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving
链接: https://arxiv.org/abs/2410.16227
作者: Alexander Krentsel,Peter Schafhalter,Joseph E. Gonzalez,Sylvia Ratnasamy,Scott Shenker,Ion Stoica
关键词-EN: Prevailing wisdom asserts, critical real-time control, real-time control systems, Prevailing wisdom, wisdom asserts
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
*备注: 6 pages
点击查看摘要
Abstract:Prevailing wisdom asserts that one cannot rely on the cloud for critical real-time control systems like self-driving cars. We argue that we can, and must. Following the trends of increasing model sizes, improvements in hardware, and evolving mobile networks, we identify an opportunity to offload parts of time-sensitive and latency-critical compute to the cloud. Doing so requires carefully allocating bandwidth to meet strict latency SLOs, while maximizing benefit to the car.
[CV-12] Improve Vision Language Model Chain-of-thought Reasoning
链接: https://arxiv.org/abs/2410.16198
作者: Ruohong Zhang,Bowen Zhang,Yanghao Li,Haotian Zhang,Zhiqing Sun,Zhe Gan,Yinfei Yang,Ruoming Pang,Yiming Yang
关键词-EN: vision language models, interpretability and trustworthiness, vision language, crucial for improving, improving interpretability
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages + appendix
点击查看摘要
Abstract:Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model’s reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.
[CV-13] raining Better Deep Learning Models Using Human Saliency
链接: https://arxiv.org/abs/2410.16190
作者: Aidan Boyd,Patrick Tinsley,Kevin W. Bowyer,Adam Czajka
关键词-EN: convolutional neural network, deep convolutional neural, judgement about salient, convolutional neural, training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:This work explores how human judgement about salient regions of an image can be introduced into deep convolutional neural network (DCNN) training. Traditionally, training of DCNNs is purely data-driven. This often results in learning features of the data that are only coincidentally correlated with class labels. Human saliency can guide network training using our proposed new component of the loss function that ConveYs Brain Oversight to Raise Generalization (CYBORG) and penalizes the model for using non-salient regions. This mechanism produces DCNNs achieving higher accuracy and generalization compared to using the same training data without human salience. Experimental results demonstrate that CYBORG applies across multiple network architectures and problem domains (detection of synthetic faces, iris presentation attacks and anomalies in chest X-rays), while requiring significantly less data than training without human saliency guidance. Visualizations show that CYBORG-trained models’ saliency is more consistent across independent training runs than traditionally-trained models, and also in better agreement with humans. To lower the cost of collecting human annotations, we also explore using deep learning to provide automated annotations. CYBORG training of CNNs addresses important issues such as reducing the appetite for large training sets, increasing interpretability, and reducing fragility by generalizing better to new types of data.
[CV-14] A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data
链接: https://arxiv.org/abs/2410.16177
作者: Simon Deltadahl,Andreu Vall,Vijay Ivaturi,Niklas Korsbo
关键词-EN: addressing privacy concerns, healthcare research, synthesizing patient data, visual acuity, acuity over time
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:We present a novel framework for synthesizing patient data with complex covariates (e.g., eye scans) paired with longitudinal observations (e.g., visual acuity over time), addressing privacy concerns in healthcare research. Our approach introduces controlled association in latent spaces generating each data modality, enabling the creation of complex covariate-longitudinal observation pairs. This framework facilitates the development of predictive models and provides openly available benchmarking datasets for healthcare research. We demonstrate our framework using optical coherence tomography (OCT) scans, though it is applicable across domains. Using 109,309 2D OCT scan slices, we trained an image generative model combining a variational autoencoder and a diffusion model. Longitudinal observations were simulated using a nonlinear mixed effect (NLME) model from a low-dimensional space of random effects. We generated 1.1M OCT scan slices paired with five sets of longitudinal observations at controlled association levels (100%, 50%, 10%, 5.26%, and 2% of between-subject variability). To assess the framework, we modeled synthetic longitudinal observations with another NLME model, computed empirical Bayes estimates of random effects, and trained a ResNet to predict these estimates from synthetic OCT scans. We then incorporated ResNet predictions into the NLME model for patient-individualized predictions. Prediction accuracy on withheld data declined as intended with reduced association between images and longitudinal measurements. Notably, in all but the 2% case, we achieved within 50% of the theoretical best possible prediction on withheld data, demonstrating our ability to detect even weak signals. This confirms the effectiveness of our framework in generating synthetic data with controlled levels of association, providing a valuable tool for healthcare research.
[CV-15] Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
链接: https://arxiv.org/abs/2410.16166
作者: Han Huang,Yuqi Huo,Zijia Zhao,Haoyu Lu,Shu Wu,Bingning Wang,Qiang Liu,Weipeng Chen,Liang Wang
关键词-EN: Multimodal large language, made significant strides, large language models, textual modalities, Multimodal large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets. However, \textit de facto filter-based data quality enhancement paradigms often discard a substantial portion of high-quality image data due to inadequate semantic alignment between images and texts, leading to inefficiencies in data utilization and scalability. In this paper, we propose the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically assesses and enhances the quality of image-text pairs. AITQE employs a text rewriting mechanism for low-quality pairs and incorporates a negative sample learning strategy to improve evaluative capabilities by integrating deliberately selected low-quality samples during training. Unlike prior approaches that significantly alter text distributions, our method minimally adjusts text to preserve data volume while enhancing quality. Experimental results demonstrate that AITQE surpasses existing methods on various benchmark, effectively leveraging raw data and scaling efficiently with increasing data volumes. We hope our work will inspire future works. The code and model are available at: this https URL.
[CV-16] Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
链接: https://arxiv.org/abs/2410.16163
作者: Yufei Zhan,Hongyin Zhao,Yousong Zhu,Fan Yang,Ming Tang,Jinqiao Wang
关键词-EN: achieved significant breakthroughs, Large Language Models, Large Multimodal Models, tasks, auto-regressive modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Codes and data will be later released at this https URL
点击查看摘要
Abstract:Large Multimodal Models (LMMs) have achieved significant breakthroughs in various vision-language and vision-centric tasks based on auto-regressive modeling. However, these models typically focus on either vision-centric tasks, such as visual grounding and region description, or vision-language tasks, like image caption and multi-scenario VQAs. None of the LMMs have yet comprehensively unified both types of tasks within a single model, as seen in Large Language Models in the natural language processing field. Furthermore, even with abundant multi-task instruction-following data, directly stacking these data for universal capabilities extension remains challenging. To address these issues, we introduce a novel multi-dimension curated and consolidated multimodal dataset, named CCMD-8M, which overcomes the data barriers of unifying vision-centric and vision-language tasks through multi-level data curation and multi-task consolidation. More importantly, we present Griffon-G, a general large multimodal model that addresses both vision-centric and vision-language tasks within a single end-to-end paradigm. Griffon-G resolves the training collapse issue encountered during the joint optimization of these tasks, achieving better training efficiency. Evaluations across multimodal benchmarks, general Visual Question Answering (VQA) tasks, scene text-centric VQA tasks, document-related VQA tasks, Referring Expression Comprehension, and object detection demonstrate that Griffon-G surpasses the advanced LMMs and achieves expert-level performance in complicated vision-centric tasks.
[CV-17] Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning
链接: https://arxiv.org/abs/2410.16162
作者: Yihong Tang,Ao Qu,Zhaokai Wang,Dingyi Zhuang,Zhaofeng Wu,Wei Ma,Shenhao Wang,Yunhan Zheng,Zhan Zhao,Jinhua Zhao
关键词-EN: Vision language models, Vision language, spatial reasoning, spatial, basic spatial capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Vision language models (VLMs) have demonstrated impressive performance across a wide range of downstream tasks. However, their proficiency in spatial reasoning remains limited, despite its crucial role in tasks involving navigation and interaction with physical environments. Specifically, much of the spatial reasoning in these tasks occurs in two-dimensional (2D) environments, and our evaluation reveals that state-of-the-art VLMs frequently generate implausible and incorrect responses to composite spatial reasoning problems, including simple pathfinding tasks that humans can solve effortlessly at a glance. To address this, we explore an effective approach to enhance 2D spatial reasoning within VLMs by training the model on basic spatial capabilities. We begin by disentangling the key components of 2D spatial reasoning: direction comprehension, distance estimation, and localization. Our central hypothesis is that mastering these basic spatial capabilities can significantly enhance a model’s performance on composite spatial tasks requiring advanced spatial understanding and combinatorial problem-solving. To investigate this hypothesis, we introduce Sparkle, a framework that fine-tunes VLMs on these three basic spatial capabilities by synthetic data generation and targeted supervision to form an instruction dataset for each capability. Our experiments demonstrate that VLMs fine-tuned with Sparkle achieve significant performance gains, not only in the basic tasks themselves but also in generalizing to composite and out-of-distribution spatial reasoning tasks (e.g., improving from 13.5% to 40.0% on the shortest path problem). These findings underscore the effectiveness of mastering basic spatial capabilities in enhancing composite spatial problem-solving, offering insights for improving VLMs’ spatial reasoning capabilities.
[CV-18] Metric as Transform: Exploring beyond Affine Transform for Interpretable Neural Network
链接: https://arxiv.org/abs/2410.16159
作者: Suman Sapkota
关键词-EN: Artificial Neural Networks, Radial Basis Function, Basis Function Network, Artificial Neural, Convolutional Neural Network
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: 22 pages, 20 figures, 3 tables
点击查看摘要
Abstract:Artificial Neural Networks of varying architectures are generally paired with affine transformation at the core. However, we find dot product neurons with global influence less interpretable as compared to local influence of euclidean distance (as used in Radial Basis Function Network). In this work, we explore the generalization of dot product neurons to l^p -norm, metrics, and beyond. We find that metrics as transform performs similarly to affine transform when used in MultiLayer Perceptron or Convolutional Neural Network. Moreover, we explore various properties of Metrics, compare it with Affine, and present multiple cases where metrics seem to provide better interpretability. We develop an interpretable local dictionary based Neural Networks and use it to understand and reject adversarial examples.
[CV-19] Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
链接: https://arxiv.org/abs/2410.16153
作者: Xiang Yue,Yueqi Song,Akari Asai,Seungone Kim,Jean de Dieu Nyandwi,Simran Khanuja,Anjali Kantharuban,Lintang Sutawika,Sathyanarayanan Ramamoorthy,Graham Neubig
关键词-EN: diverse cultural contexts, multimodal large language, recent advances, predominantly focused, cultural contexts underrepresented
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 52 pages, 27 figures
点击查看摘要
Abstract:Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world’s languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models’ capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.
[CV-20] Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models NEURIPS2024
链接: https://arxiv.org/abs/2410.16152
作者: Giannis Daras,Weili Nie,Karsten Kreis,Alex Dimakis,Morteza Mardani,Nikola Borislavov Kovachki,Arash Vahdat
关键词-EN: suffers from flickering, space diffusion models, function space diffusion, naively for solving, image models naively
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted in NeurIPS 2024
点击查看摘要
Abstract:Using image models naively for solving inverse video problems often suffers from flickering, texture-sticking, and temporal inconsistency in generated videos. To tackle these problems, in this paper, we view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames. This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems. The function space diffusion models need to be equivariant with respect to the underlying spatial transformations. To ensure temporal consistency, we introduce a simple post-hoc test-time guidance towards (self)-equivariant solutions. Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems. We demonstrate the effectiveness of our method for video inpainting and 8\times video super-resolution, outperforming existing techniques based on noise transformations. We provide generated video results: this https URL\this http URL.
[CV-21] owards Combating Frequency Simplicity-biased Learning for Domain Generalization NEURIPS2024
链接: https://arxiv.org/abs/2410.16146
作者: Xilin He,Jingyu Hu,Qinliang Lin,Cheng Luo,Weicheng Xie,Siyang Song,Muhammad Haris Khan,Linlin Shen
关键词-EN: learn transferable knowledge, unseen target domains, learning behavior, Domain generalization methods, generalization methods aim
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Domain generalization methods aim to learn transferable knowledge from source domains that can generalize well to unseen target domains. Recent studies show that neural networks frequently suffer from a simplicity-biased learning behavior which leads to over-reliance on specific frequency sets, namely as frequency shortcuts, instead of semantic information, resulting in poor generalization performance. Despite previous data augmentation techniques successfully enhancing generalization performances, they intend to apply more frequency shortcuts, thereby causing hallucinations of generalization improvement. In this paper, we aim to prevent such learning behavior of applying frequency shortcuts from a data-driven perspective. Given the theoretical justification of models’ biased learning behavior on different spatial frequency components, which is based on the dataset frequency properties, we argue that the learning behavior on various frequency components could be manipulated by changing the dataset statistical structure in the Fourier domain. Intuitively, as frequency shortcuts are hidden in the dominant and highly dependent frequencies of dataset structure, dynamically perturbating the over-reliance frequency components could prevent the application of frequency shortcuts. To this end, we propose two effective data augmentation modules designed to collaboratively and adaptively adjust the frequency characteristic of the dataset, aiming to dynamically influence the learning behavior of the model and ultimately serving as a strategy to mitigate shortcut learning. Code is available at AdvFrequency (this https URL).
[CV-22] Increasing Interpretability of Neural Networks By Approximating Human Visual Saliency
链接: https://arxiv.org/abs/2410.16115
作者: Aidan Boyd,Mohamed Trabelsi,Huseyin Uzunalioglu,Dan Kushnir
关键词-EN: Understanding specifically, decision-making process, Understanding, saliency, interpretability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Understanding specifically where a model focuses on within an image is critical for human interpretability of the decision-making process. Deep learning-based solutions are prone to learning coincidental correlations in training datasets, causing over-fitting and reducing the explainability. Recent advances have shown that guiding models to human-defined regions of saliency within individual images significantly increases performance and interpretability. Human-guided models also exhibit greater generalization capabilities, as coincidental dataset features are avoided. Results show that models trained with saliency incorporation display an increase in interpretability of up to 30% over models trained without saliency information. The collection of this saliency information, however, can be costly, laborious and in some cases infeasible. To address this limitation, we propose a combination strategy of saliency incorporation and active learning to reduce the human annotation data required by 80% while maintaining the interpretability and performance increase from human saliency. Extensive experimentation outlines the effectiveness of the proposed approach across five public datasets and six active learning criteria.
[CV-23] LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset
链接: https://arxiv.org/abs/2410.16095
作者: Ruikun Zhang,Hao Yang,Yan Yang,Ying Fu,Liyuan Pan
关键词-EN: recent years, drawn a significant, significant attention, attention in recent, haze intensities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Image dehazing has drawn a significant attention in recent years. Learning-based methods usually require paired hazy and corresponding ground truth (haze-free) images for training. However, it is difficult to collect real-world image pairs, which prevents developments of existing methods. Although several works partially alleviate this issue by using synthetic datasets or small-scale real datasets. The haze intensity distribution bias and scene homogeneity in existing datasets limit the generalization ability of these methods, particularly when encountering images with previously unseen haze intensities. In this work, we present LMHaze, a large-scale, high-quality real-world dataset. LMHaze comprises paired hazy and haze-free images captured in diverse indoor and outdoor environments, spanning multiple scenarios and haze intensities. It contains over 5K high-resolution image pairs, surpassing the size of the biggest existing real-world dehazing dataset by over 25 times. Meanwhile, to better handle images with different haze intensities, we propose a mixture-of-experts model based on Mamba (MoE-Mamba) for dehazing, which dynamically adjusts the model parameters according to the haze intensity. Moreover, with our proposed dataset, we conduct a new large multimodal model (LMM)-based benchmark study to simulate human perception for evaluating dehazed images. Experiments demonstrate that LMHaze dataset improves the dehazing performance in real scenarios and our dehazing method provides better results compared to state-of-the-art methods.
[CV-24] Final Report for CHESS: Cloud High-Performance Computing and Edge for Science and Security
链接: https://arxiv.org/abs/2410.16093
作者: Nathan Tallent,Jan Strube,Luanzheng Guo,Hyungro Lee,Jesun Firoz,Sayan Ghosh,Bo Fang,Oceane Bel,Steven Spurgeon,Sarah Akers,Christina Doty,Erol Cromwell
关键词-EN: multiple information sources, spanning lab instruments, theory-experiment cycle requires, cycle requires effective, continuum spanning lab
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Automating the theory-experiment cycle requires effective distributed workflows that utilize a computing continuum spanning lab instruments, edge sensors, computing resources at multiple facilities, data sets distributed across multiple information sources, and potentially cloud. Unfortunately, the obvious methods for constructing continuum platforms, orchestrating workflow tasks, and curating datasets over time fail to achieve scientific requirements for performance, energy, security, and reliability. Furthermore, achieving the best use of continuum resources depends upon the efficient composition and execution of workflow tasks, i.e., combinations of numerical solvers, data analytics, and machine learning. Pacific Northwest National Laboratory’s LDRD “Cloud, High-Performance Computing (HPC), and Edge for Science and Security” (CHESS) has developed a set of interrelated capabilities for enabling distributed scientific workflows and curating datasets. This report describes the results and successes of CHESS from the perspective of open science.
[CV-25] Integrated Image-Text Based on Semi-supervised Learning for Small Sample Instance Segmentation
链接: https://arxiv.org/abs/2410.16063
作者: Ruting Chi,Zhiyi Huang,Yuexing Han
关键词-EN: Small sample instance, sample instance segmentation, sample instance, Small sample, instance segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Small sample instance segmentation is a very challenging task, and many existing methods follow the training strategy of meta-learning which pre-train models on support set and fine-tune on query set. The pre-training phase, which is highly task related, requires a significant amount of additional training time and the selection of datasets with close proximity to ensure effectiveness. The article proposes a novel small sample instance segmentation solution from the perspective of maximizing the utilization of existing information without increasing annotation burden and training costs. The proposed method designs two modules to address the problems encountered in small sample instance segmentation. First, it helps the model fully utilize unlabeled data by learning to generate pseudo labels, increasing the number of available samples. Second, by integrating the features of text and image, more accurate classification results can be obtained. These two modules are suitable for box-free and box-dependent frameworks. In the way, the proposed method not only improves the performance of small sample instance segmentation, but also greatly reduce reliance on pre-training. We have conducted experiments in three datasets from different scenes: on land, underwater and under microscope. As evidenced by our experiments, integrated image-text corrects the confidence of classification, and pseudo labels help the model obtain preciser masks. All the results demonstrate the effectiveness and superiority of our method.
[CV-26] Label Filling via Mixed Supervision for Medical Image Segmentation from Noisy Annotations
链接: https://arxiv.org/abs/2410.16057
作者: Ming Li,Wei Shen,Qingli Li,Yan Wang
关键词-EN: medical image segmentation, success of medical, medical image, requires a large, large number
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The success of medical image segmentation usually requires a large number of high-quality labels. But since the labeling process is usually affected by the raters’ varying skill levels and characteristics, the estimated masks provided by different raters usually suffer from high inter-rater variability. In this paper, we propose a simple yet effective Label Filling framework, termed as LF-Net, predicting the groundtruth segmentation label given only noisy annotations during training. The fundamental idea of label filling is to supervise the segmentation model by a subset of pixels with trustworthy labels, meanwhile filling labels of other pixels by mixed supervision. More concretely, we propose a qualified majority voting strategy, i.e., a threshold voting scheme is designed to model agreement among raters and the majority-voted labels of the selected subset of pixels are regarded as supervision. To fill labels of other pixels, two types of mixed auxiliary supervision are proposed: a soft label learned from intrinsic structures of noisy annotations, and raters’ characteristics labels which propagate individual rater’s characteristics information. LF-Net has two main advantages. 1) Training with trustworthy pixels incorporates training with confident supervision, guiding the direction of groundtruth label learning. 2) Two types of mixed supervision prevent over-fitting issues when the network is supervised by a subset of pixels, and guarantee high fidelity with the true label. Results on five datasets of diverse imaging modalities show that our LF-Net boosts segmentation accuracy in all datasets compared with state-of-the-art methods, with even a 7% improvement in DSC for MS lesion segmentation.
[CV-27] Benchmarking Pathology Foundation Models: Adaptation Strategies and Scenarios
链接: https://arxiv.org/abs/2410.16038
作者: Jeaung Lee,Jeewoo Lim,Keunho Byeon,Jin Tae Kwak
关键词-EN: pathology-specific foundation models, demonstrated enhanced learning, enhanced learning capability, foundation models, recently emerged
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In computational pathology, several foundation models have recently emerged and demonstrated enhanced learning capability for analyzing pathology images. However, adapting these models to various downstream tasks remains challenging, particularly when faced with datasets from different sources and acquisition conditions, as well as limited data availability. In this study, we benchmark four pathology-specific foundation models across 14 datasets and two scenarios-consistency assessment and flexibility assessment-addressing diverse adaptation scenarios and downstream tasks. In the consistency assessment scenario, involving five fine-tuning methods, we found that the parameter-efficient fine-tuning approach was both efficient and effective for adapting pathology-specific foundation models to diverse datasets within the same downstream task. In the flexibility assessment scenario under data-limited environments, utilizing five few-shot learning methods, we observed that the foundation models benefited more from the few-shot learning methods that involve modification during the testing phase only. These findings provide insights that could guide the deployment of pathology-specific foundation models in real clinical settings, potentially improving the accuracy and reliability of pathology image analysis. The code for this study is available at: this https URL.
[CV-28] Improving the Multi-label Atomic Activity Recognition by Robust Visual Feature and Advanced Attention @ ROAD Atomic Activity Recognition 2024
链接: https://arxiv.org/abs/2410.16037
作者: Jiamin Cao,Lingqi Wang,Kexin Zhang,Yuting Yang,Licheng Jiao,Yuwei Guo
关键词-EN: activity recognition task, atomic activity recognition, multi-label atomic activity, action recognition task, recognition task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Road++ Track3 proposes a multi-label atomic activity recognition task in traffic scenarios, which can be standardized as a 64-class multi-label video action recognition task. In the multi-label atomic activity recognition task, the robustness of visual feature extraction remains a key challenge, which directly affects the model performance and generalization ability. To cope with these issues, our team optimized three aspects: data processing, model and post-processing. Firstly, the appropriate resolution and video sampling strategy are selected, and a fixed sampling strategy is set on the validation and test sets. Secondly, in terms of model training, the team selects a variety of visual backbone networks for feature extraction, and then introduces the action-slot model, which is trained on the training and validation sets, and reasoned on the test set. Finally, for post-processing, the team combined the strengths and weaknesses of different models for weighted fusion, and the final mAP on the test set was 58%, which is 4% higher than the challenge baseline.
[CV-29] Few-shot target-driven instance detection based on open-vocabulary object detection models
链接: https://arxiv.org/abs/2410.16028
作者: Ben Crulis,Barthelemy Serres,Cyril De Runz,Gilles Venturini
关键词-EN: Current large open, large open vision, Current large, open vision models, few-shot object recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Current large open vision models could be useful for one and few-shot object recognition. Nevertheless, gradient-based re-training solutions are costly. On the other hand, open-vocabulary object detection models bring closer visual and textual concepts in the same latent space, allowing zero-shot detection via prompting at small computational cost. We propose a lightweight method to turn the latter into a one-shot or few-shot object recognition models without requiring textual descriptions. Our experiments on the TEgO dataset using the YOLO-World model as a base show that performance increases with the model size, the number of examples and the use of image augmentation.
[CV-30] START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation NEURIPS2024
链接: https://arxiv.org/abs/2410.16020
作者: Jintao Guo,Lei Qi,Yinghuan Shi,Yang Gao
关键词-EN: unseen target domains, multiple source domains, aims to enable, generalize to unseen, unseen target
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS2024. The code is available at this https URL
点击查看摘要
Abstract:Domain Generalization (DG) aims to enable models to generalize to unseen target domains by learning from multiple source domains. Existing DG methods primarily rely on convolutional neural networks (CNNs), which inherently learn texture biases due to their limited receptive fields, making them prone to overfitting source domains. While some works have introduced transformer-based methods (ViTs) for DG to leverage the global receptive field, these methods incur high computational costs due to the quadratic complexity of self-attention. Recently, advanced state space models (SSMs), represented by Mamba, have shown promising results in supervised learning tasks by achieving linear complexity in sequence length during training and fast RNN-like computation during inference. Inspired by this, we investigate the generalization ability of the Mamba model under domain shifts and find that input-dependent matrices within SSMs could accumulate and amplify domain-specific features, thus hindering model generalization. To address this issue, we propose a novel SSM-based architecture with saliency-based token-aware transformation (namely START), which achieves state-of-the-art (SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our START can selectively perturb and suppress domain-specific features in salient tokens within the input-dependent matrices of SSMs, thus effectively reducing the discrepancy between different domains. Extensive experiments on five benchmarks demonstrate that START outperforms existing SOTA DG methods with efficient linear complexity. Our code is available at this https URL.
[CV-31] Multispectral Texture Synthesis using RGB Convolutional Neural Networks
链接: https://arxiv.org/abs/2410.16019
作者: Sélim Ollivier,Yann Gousseau,Sidonie Lefebvre
关键词-EN: synthesis algorithms rely, RGB images, deep features, algorithms rely, rely on style
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:State-of-the-art RGB texture synthesis algorithms rely on style distances that are computed through statistics of deep features. These deep features are extracted by classification neural networks that have been trained on large datasets of RGB images. Extending such synthesis methods to multispectral images is not straightforward, since the pre-trained networks are designed for and have been trained on RGB images. In this work, we propose two solutions to extend these methods to multispectral imaging. Neither of them require additional training of the neural network from which the second order neural statistics are extracted. The first one consists in optimizing over batches of random triplets of spectral bands throughout training. The second one projects multispectral pixels onto a 3 dimensional space. We further explore the benefit of a color transfer operation upstream of the projection to avoid the potentially abnormal color distributions induced by the projection. Our experiments compare the performances of the various methods through different metrics. We demonstrate that they can be used to perform exemplar-based texture synthesis, achieve good visual quality and comes close to state-of-the art methods on RGB bands.
[CV-32] Massimo: Public Queue Monitoring and Management using Mass-Spring Model
链接: https://arxiv.org/abs/2410.16012
作者: Abhijeet Kumar,Unnati Singh,Rajdeep Chatterjee,Tathagata Bandyopadhyay
关键词-EN: customer satisfaction, control and regulation, important in order, order to avoid, avoid the traffic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 8 pages, 6 figures, 3 algorithms, 3 tables
点击查看摘要
Abstract:An efficient system of a queue control and regulation in public spaces is very important in order to avoid the traffic jams and to improve the customer satisfaction. This article offers a detailed road map based on a merger of intelligent systems and creating an efficient systems of queues in public places. Through the utilization of different technologies i.e. computer vision, machine learning algorithms, deep learning our system provide accurate information about the place is crowded or not and the necessary efforts to be taken.
[CV-33] 3D-GANTex: 3D Face Reconstruction with StyleGAN3-based Multi-View Images and 3DDFA based Mesh Generation
链接: https://arxiv.org/abs/2410.16009
作者: Rohit Das,Tzung-Han Lin,Ko-Chih Wang
关键词-EN: information to work, Geometry, Morphable Models, ill-posed problem, texture estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 4 figures, 2 tables, pre-print version
点击查看摘要
Abstract:Geometry and texture estimation from a single face image is an ill-posed problem since there is very little information to work with. The problem further escalates when the face is rotated at a different angle. This paper tries to tackle this problem by introducing a novel method for texture estimation from a single image by first using StyleGAN and 3D Morphable Models. The method begins by generating multi-view faces using the latent space of GAN. Then 3DDFA trained on 3DMM estimates a 3D face mesh as well as a high-resolution texture map that is consistent with the estimated face shape. The result shows that the generated mesh is of high quality with near to accurate texture representation.
[CV-34] Visual Representation Learning Guided By Multi-modal Prior Knowledge
链接: https://arxiv.org/abs/2410.15981
作者: Hongkuan Zhou,Lavdim Halilaj,Sebastian Monka,Stefan Schmid,Yuqicheng Zhu,Bo Xiong,Steffen Staab
关键词-EN: deep neural networks, facing distribution shifts, neural networks, computer vision, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Despite the remarkable success of deep neural networks (DNNs) in computer vision, they fail to remain high-performing when facing distribution shifts between training and testing data. In this paper, we propose Knowledge-Guided Visual representation learning (KGV), a distribution-based learning approach leveraging multi-modal prior knowledge, to improve generalization under distribution shift. We use prior knowledge from two distinct modalities: 1) a knowledge graph (KG) with hierarchical and association relationships; and 2) generated synthetic images of visual elements semantically represented in the KG. The respective embeddings are generated from the given modalities in a common latent space, i.e., visual embeddings from original and synthetic images as well as knowledge graph embeddings (KGEs). These embeddings are aligned via a novel variant of translation-based KGE methods, where the node and relation embeddings of the KG are modeled as Gaussian distributions and translations respectively. We claim that incorporating multi-model prior knowledge enables more regularized learning of image representations. Thus, the models are able to better generalize across different data distributions. We evaluate KGV on different image classification tasks with major or minor distribution shifts, namely road sign classification across datasets from Germany, China, and Russia, image classification with the mini-ImageNet dataset and its variants, as well as the DVM-CAR dataset. The results demonstrate that KGV consistently exhibits higher accuracy and data efficiency than the baselines across all experiments.
[CV-35] Granularity Matters in Long-Tail Learning
链接: https://arxiv.org/abs/2410.15980
作者: Shizhen Zhao,Xin Wen,Jiahui Liu,Chuofan Ma,Chunfeng Yuan,Xiaojuan Qi
关键词-EN: data distributions remains, Balancing training, distributions remains, remains a long-standing, long-standing challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Balancing training on long-tail data distributions remains a long-standing challenge in deep learning. While methods such as re-weighting and re-sampling help alleviate the imbalance issue, limited sample diversity continues to hinder models from learning robust and generalizable feature representations, particularly for tail classes. In contrast to existing methods, we offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance. In this paper, we investigate this phenomenon through both quantitative and qualitative studies, showing that increased granularity enhances the generalization of learned features in tail categories. Motivated by these findings, we propose a method to increase dataset granularity through category extrapolation. Specifically, we introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes. This forms the core contribution and insight of our approach. To automate the curation of auxiliary data, we leverage large language models (LLMs) as knowledge bases to search for auxiliary categories and retrieve relevant images through web crawling. To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss that encourages the model to focus on class discrimination within the target dataset. During inference, the classifier weights for auxiliary categories are masked out, leaving only the target class weights for use. Extensive experiments and ablation studies on three standard long-tail benchmarks demonstrate the effectiveness of our approach, notably outperforming strong baseline methods that use the same amount of data. The code will be made publicly available.
[CV-36] Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly NEURIPS2024
链接: https://arxiv.org/abs/2410.15971
作者: Junsheng Zhou,Yu-Shen Liu,Zhizhong Han
关键词-EN: language and vision, leading a revolution, large models, deep priors, Large language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear at NeurIPS 2024. Project page: this https URL
点击查看摘要
Abstract:Large language and vision models have been leading a revolution in visual computing. By greatly scaling up sizes of data and model parameters, the large models learn deep priors which lead to remarkable performance in various tasks. In this work, we present deep prior assembly, a novel framework that assembles diverse deep priors from large models for scene reconstruction from single images in a zero-shot manner. We show that this challenging task can be done without extra knowledge but just simply generalizing one deep prior in one sub-task. To this end, we introduce novel methods related to poses, scales, and occlusion parsing which are keys to enable deep priors to work together in a robust way. Deep prior assembly does not require any 3D or 2D data-driven training in the task and demonstrates superior performance in generalizing priors to open-world scenes. We conduct evaluations on various datasets, and report analysis, numerical and visual comparisons with the latest methods to show our superiority. Project page: this https URL.
[CV-37] A Paradigm Shift in Mouza Map Vectorization: A Human-Machine Collaboration Approach
链接: https://arxiv.org/abs/2410.15961
作者: Mahir Shahriar Dhrubo,Samira Akter,Anwarul Bashir Shuaib,Md Toki Tahmid,Zahid Hasan,A. B. M. Alim Al Islam
关键词-EN: significant challenge due, hand-drawn cadastral maps, Efficient vectorization, poses a significant, complex structures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages including reference, 14 figures, 4 tables
点击查看摘要
Abstract:Efficient vectorization of hand-drawn cadastral maps, such as Mouza maps in Bangladesh, poses a significant challenge due to their complex structures. Current manual digitization methods are time-consuming and labor-intensive. Our study proposes a semi-automated approach to streamline the digitization process, saving both time and human resources. Our methodology focuses on separating the plot boundaries and plot identifiers and applying our digitization methodology to convert both of them into vectorized format. To accomplish full vectorization, Convolutional Neural Network (CNN) models are utilized for pre-processing and plot number detection along with our smoothing algorithms based on the diversity of vector maps. The CNN models are trained with our own labeled dataset, generated from the maps, and smoothing algorithms are introduced from the various observations of the map’s vector formats. Further human intervention remains essential for precision. We have evaluated our methods on several maps and provided both quantitative and qualitative results with user study. The result demonstrates that our methodology outperforms the existing map digitization processes significantly.
[CV-38] Diffusion Transformer Policy
链接: https://arxiv.org/abs/2410.15959
作者: Zhi Hou,Tianyi Zhang,Yuwen Xiong,Hengjun Pu,Chengyang Zhao,Ronglei Tong,Yu Qiao,Jifeng Dai,Yuntao Chen
关键词-EN: Recent large visual-language, Diffusion Transformer Policy, small action head, Recent large, diffusion transformer
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint
点击查看摘要
Abstract:Recent large visual-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict discretized or continuous actions by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate Diffusion Transformer Policy pretrained on diverse robot data can generalize to different embodiments, including simulation environments like Maniskill2 and Calvin, as well as the real-world Franka arm. Specifically, without bells and whistles, the proposed approach achieves state-of-the-art performance with only a single third-view camera stream in the Calvin novel task setting (ABC-D), improving the average number of tasks completed in a row of 5 to 3.6, and the pretraining stage significantly facilitates the success sequence length on the Calvin by over 1.2. The code will be publicly available.
[CV-39] CamI2V: Camera-Controlled Image-to-Video Diffusion Model
链接: https://arxiv.org/abs/2410.15957
作者: Guangcong Zheng,Teng Li,Rui Jiang,Yehao Lu,Tao Wu,Xi Li
关键词-EN: user-friendly and physics-related, camera, Recently, physics-related condition, camera pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Recently, camera pose, as a user-friendly and physics-related condition, has been introduced into text-to-video diffusion model for camera control. However, existing methods simply inject camera conditions through a side input. These approaches neglect the inherent physical knowledge of camera pose, resulting in imprecise camera control, inconsistencies, and also poor interpretability. In this paper, we emphasize the necessity of integrating explicit physical constraints into model design. Epipolar attention is proposed for modeling all cross-frame relationships from a novel perspective of noised condition. This ensures that features are aggregated from corresponding epipolar lines in all noised frames, overcoming the limitations of current attention mechanisms in tracking displaced features across frames, especially when features move significantly with the camera and become obscured by noise. Additionally, we introduce register tokens to handle cases without intersections between frames, commonly caused by rapid camera movements, dynamic objects, or occlusions. To support image-to-video, we propose the multiple guidance scale to allow for precise control for image, text, and camera, respectively. Furthermore, we establish a more robust and reproducible evaluation pipeline to solve the inaccuracy and instability of existing camera control measurement. We achieve a 25.5% improvement in camera controllability on RealEstate10K while maintaining strong generalization to out-of-domain images. Only 24GB and 12GB are required for training and inference, respectively. We plan to release checkpoints, along with training and evaluation codes. Dynamic videos are best viewed at \urlthis https URL.
[CV-40] MBPU: A Plug-and-Play State Space Model for Point Cloud Upsamping with Fast Point Rendering
链接: https://arxiv.org/abs/2410.15941
作者: Jiayi Song,Weidong Yang,Zhijun Li,Wen-Ming Chen,Ben Fei
关键词-EN: holding potential applications, sparse input captured, point cloud upsampling, point cloud, uniform point clouds
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The task of point cloud upsampling (PCU) is to generate dense and uniform point clouds from sparse input captured by 3D sensors like LiDAR, holding potential applications in real yet is still a challenging task. Existing deep learning-based methods have shown significant achievements in this field. However, they still face limitations in effectively handling long sequences and addressing the issue of shrinkage artifacts around the surface of the point cloud. Inspired by the newly proposed Mamba, in this paper, we introduce a network named MBPU built on top of the Mamba architecture, which performs well in long sequence modeling, especially for large-scale point cloud upsampling, and achieves fast convergence speed. Moreover, MBPU is an arbitrary-scale upsampling framework as the predictor of point distance in the point refinement phase. At the same time, we simultaneously predict the 3D position shift and 1D point-to-point distance as regression quantities to constrain the global features while ensuring the accuracy of local details. We also introduce a fast differentiable renderer to further enhance the fidelity of the upsampled point cloud and reduce artifacts. It is noted that, by the merits of our fast point rendering, MBPU yields high-quality upsampled point clouds by effectively eliminating surface noise. Extensive experiments have demonstrated that our MBPU outperforms other off-the-shelf methods in terms of point cloud upsampling, especially for large-scale point clouds.
[CV-41] Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation
链接: https://arxiv.org/abs/2410.15932
作者: Jiawei Zhao,Qixing Jiang,Xuede Li,Junfeng Luo
关键词-EN: segmentation aims, aims to establish, establish a spatial, spatial mapping, maps from monocular
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Birds-Eye-View (BEV) segmentation aims to establish a spatial mapping from the perspective view to the top view and estimate the semantic maps from monocular images. Recent studies have encountered difficulties in view transformation due to the disruption of BEV-agnostic features in image space. To tackle this issue, we propose a novel FocusBEV framework consisting of (i) a self-calibrated cross view transformation module to suppress the BEV-agnostic image areas and focus on the BEV-relevant areas in the view transformation stage, (ii) a plug-and-play ego-motion-based temporal fusion module to exploit the spatiotemporal structure consistency in BEV space with a memory bank, and (iii) an occupancy-agnostic IoU loss to mitigate both semantic and positional uncertainties. Experimental evidence demonstrates that our approach achieves new state-of-the-art on two popular benchmarks,\ie, 29.2% mIoU on nuScenes and 35.2% mIoU on Argoverse.
[CV-42] GReFEL: Geometry-Aware Reliable Facial Expression Learning under Bias and Imbalanced Data Distribution ACCV2024
链接: https://arxiv.org/abs/2410.15927
作者: Azmine Toushik Wasi,Taki Hasan Rafi,Raima Islam,Karlo Serbetar,Dong Kyu Chae
关键词-EN: Reliable facial expression, facial expression characteristics, distinctive facial expression, facial expression learning, facial expression
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ACCV 2024. Extended version of ARBEx ( arXiv:2305.01486 )
点击查看摘要
Abstract:Reliable facial expression learning (FEL) involves the effective learning of distinctive facial expression characteristics for more reliable, unbiased and accurate predictions in real-life settings. However, current systems struggle with FEL tasks because of the variance in people’s facial expressions due to their unique facial structures, movements, tones, and demographics. Biased and imbalanced datasets compound this challenge, leading to wrong and biased prediction labels. To tackle these, we introduce GReFEL, leveraging Vision Transformers and a facial geometry-aware anchor-based reliability balancing module to combat imbalanced data distributions, bias, and uncertainty in facial expression learning. Integrating local and global data with anchors that learn different facial data points and structural features, our approach adjusts biased and mislabeled emotions caused by intra-class disparity, inter-class similarity, and scale sensitivity, resulting in comprehensive, accurate, and reliable facial expression predictions. Our model outperforms current state-of-the-art methodologies, as demonstrated by extensive experiments on various datasets.
[CV-43] Mitigating Object Hallucination via Concentric Causal Attention NEURIPS2024
链接: https://arxiv.org/abs/2410.15926
作者: Yun Xing,Yiheng Li,Ivan Laptev,Shijian Lu
关键词-EN: Recent Large Vision, Vision Language Models, Large Vision Language, Vision Language, present remarkable zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: To appear at NeurIPS 2024. Code is available at this https URL
点击查看摘要
Abstract:Recent Large Vision Language Models (LVLMs) present remarkable zero-shot conversational and reasoning capabilities given multimodal queries. Nevertheless, they suffer from object hallucination, a phenomenon where LVLMs are prone to generate textual responses not factually aligned with image inputs. Our pilot study reveals that object hallucination is closely tied with Rotary Position Encoding (RoPE), a widely adopted positional dependency modeling design in existing LVLMs. Due to the long-term decay in RoPE, LVLMs tend to hallucinate more when relevant visual cues are distant from instruction tokens in the multimodal input sequence. Additionally, we observe a similar effect when reversing the sequential order of visual tokens during multimodal alignment. Our tests indicate that long-term decay in RoPE poses challenges to LVLMs while capturing visual-instruction interactions across long distances. We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs by naturally reducing relative distance between visual and instruction tokens. With CCA, visual tokens can better interact with instruction tokens, thereby enhancing model’s perception capability and alleviating object hallucination. Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.
[CV-44] Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation? NEURIPS2024
链接: https://arxiv.org/abs/2410.15919
作者: Lingao Xiao,Yang He
关键词-EN: soft labels, auxiliary soft labels, soft labels exceeds, large-scale soft labels, storage for auxiliary
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by Neurips 2024
点击查看摘要
Abstract:In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times. However, are large-scale soft labels necessary for large-scale dataset distillation? In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels. This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching. To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes. As a result, we can increase within-class diversity and reduce the size of required soft labels. A key benefit of improved image diversity is that soft label compression can be achieved through simple random pruning, eliminating the need for complex rule-based strategies. Experiments validate our discoveries. For example, when condensing ImageNet-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40x compression) with a 2.6% performance gain. Code is available at: this https URL
[CV-45] Leveraging CORAL-Correlation Consistency Network for Semi-Supervised Left Atrium MRI Segmentation
链接: https://arxiv.org/abs/2410.15916
作者: Xinze Li,Runlin Huang,Zhenghao Wu,Bohan Yang,Wentao Fan,Chengzhang Zhu,Weifeng Su
关键词-EN: medical image segmentation, medical image, left atrium, image segmentation, labeled images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures, Accepted by 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2024)
点击查看摘要
Abstract:Semi-supervised learning (SSL) has been widely used to learn from both a few labeled images and many unlabeled images to overcome the scarcity of labeled samples in medical image segmentation. Most current SSL-based segmentation methods use pixel values directly to identify similar features in labeled and unlabeled data. They usually fail to accurately capture the intricate attachment structures in the left atrium, such as the areas of inconsistent density or exhibit outward curvatures, adding to the complexity of the task. In this paper, we delve into this issue and introduce an effective solution, CORAL(Correlation-Aligned)-Correlation Consistency Network (CORN), to capture the global structure shape and local details of Left Atrium. Diverging from previous methods focused on each local pixel value, the CORAL-Correlation Consistency Module (CCM) in the CORN leverages second-order statistical information to capture global structural features by minimizing the distribution discrepancy between labeled and unlabeled samples in feature space. Yet, direct construction of features from unlabeled data frequently results in ``Sample Selection Bias’', leading to flawed supervision. We thus further propose the Dynamic Feature Pool (DFP) for the CCM, which utilizes a confidence-based filtering strategy to remove incorrectly selected features and regularize both teacher and student models by constraining the similarity matrix to be consistent. Extensive experiments on the Left Atrium dataset have shown that the proposed CORN outperforms previous state-of-the-art semi-supervised learning methods.
[CV-46] Hybrid Architecture for Real-Time Video Anomaly Detection: Integrating Spatial and Temporal Analysis
链接: https://arxiv.org/abs/2410.15909
作者: Fabien Poirier
关键词-EN: inspired by human, architecture for real-time, human behavior, behavior by combining, real-time anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:We propose a new architecture for real-time anomaly detection in video data, inspired by human behavior by combining spatial and temporal analyses. This approach uses two distinct models: for temporal analysis, a recurrent convolutional network (CNN + RNN) is employed, associating VGG19 and a GRU to process video sequences. Regarding spatial analysis, it is performed using YOLOv7 to analyze individual images. These two analyses can be carried out either in parallel, with a final prediction that combines the results of both analyses, or in series, where the spatial analysis enriches the data before the temporal analysis. In this article, we will compare these two architectural configurations with each other, to evaluate the effectiveness of our hybrid approach in video anomaly detection.
[CV-47] xPro: Text-guided PBR Texturing with Procedural Material Modeling
链接: https://arxiv.org/abs/2410.15891
作者: Ziqiang Dang,Wenqi Dong,Zesong Yang,Bangbang Yang,Liang Li,Yuewen Ma,Zhaopeng Cui
关键词-EN: high-fidelity material generation, present TexPro, typically generate RGB, generate RGB textures, texture generation methods
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: In submission. Supplementary material is included at the end of the main paper (5 pages, 2 figures)
点击查看摘要
Abstract:In this paper, we present TexPro, a novel method for high-fidelity material generation for input 3D meshes given text prompts. Unlike existing text-conditioned texture generation methods that typically generate RGB textures with baked lighting, TexPro is able to produce diverse texture maps via procedural material modeling, which enables physical-based rendering, relighting, and additional benefits inherent to procedural materials. Specifically, we first generate multi-view reference images given the input textual prompt by employing the latest text-to-image model. We then derive texture maps through a rendering-based optimization with recent differentiable procedural materials. To this end, we design several techniques to handle the misalignment between the generated multi-view images and 3D meshes, and introduce a novel material agent that enhances material classification and matching by exploring both part-level understanding and object-aware material reasoning. Experiments demonstrate the superiority of the proposed method over existing SOTAs and its capability of relighting.
[CV-48] Foundation Models for Slide-level Cancer Subtyping in Digital Pathology
链接: https://arxiv.org/abs/2410.15886
作者: Pablo Meseguer,Rocío del Amor,Adrian Colomer,Valery Naranjo
关键词-EN: computer vision due, fine-tuning approach, widely adopted, adopted in computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Manuscript accepted for oral presentation at Decision Science Allieance -INternational Summer Conference (DSA-ISC) 2024 held on Valencia, Spain
点击查看摘要
Abstract:Since the emergence of the ImageNet dataset, the pretraining and fine-tuning approach has become widely adopted in computer vision due to the ability of ImageNet-pretrained models to learn a wide variety of visual features. However, a significant challenge arises when adapting these models to domain-specific fields, such as digital pathology, due to substantial gaps between domains. To address this limitation, foundation models (FM) have been trained on large-scale in-domain datasets to learn the intricate features of histopathology images. In cancer diagnosis, whole-slide image (WSI) prediction is essential for patient prognosis, and multiple instance learning (MIL) has been implemented to handle the giga-pixel size of WSI. As MIL frameworks rely on patch-level feature aggregation, this work aims to compare the performance of various feature extractors developed under different pretraining strategies for cancer subtyping on WSI under a MIL framework. Results demonstrate the ability of foundation models to surpass ImageNet-pretrained models for the prediction of six skin cancer subtypes
[CV-49] Distributed Learning for UAV Swarms
链接: https://arxiv.org/abs/2410.15882
作者: Chen Hu,Hanchi Ren,Jingjing Deng,Xianghua Xie
关键词-EN: Unmanned Aerial Vehicle, Unmanned Aerial, Aerial Vehicle, making Federated Learning, deployed in dynamic
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Unmanned Aerial Vehicle (UAV) swarms are increasingly deployed in dynamic, data-rich environments for applications such as environmental monitoring and surveillance. These scenarios demand efficient data processing while maintaining privacy and security, making Federated Learning (FL) a promising solution. FL allows UAVs to collaboratively train global models without sharing raw data, but challenges arise due to the non-Independent and Identically Distributed (non-IID) nature of the data collected by UAVs. In this study, we show an integration of the state-of-the-art FL methods to UAV Swarm application and invetigate the performance of multiple aggregation methods (namely FedAvg, FedProx, FedOpt, and MOON) with a particular focus on tackling non-IID on a variety of datasets, specifically MNIST for baseline performance, CIFAR10 for natural object classification, EuroSAT for environment monitoring, and CelebA for surveillance. These algorithms were selected to cover improved techniques on both client-side updates and global aggregation. Results show that while all algorithms perform comparably on IID data, their performance deteriorates significantly under non-IID conditions. FedProx demonstrated the most stable overall performance, emphasising the importance of regularising local updates in non-IID environments to mitigate drastic deviations in local models.
[CV-50] MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images
链接: https://arxiv.org/abs/2410.15881
作者: Pablo Meseguer,Rocío del Amor,Valery Naranjo
关键词-EN: made remarkable strides, Vision-language supervision, supervision has made, made remarkable, remarkable strides
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Manuscript accepted for oral presentation at KES-InnovationInMedicine 2024 held on Madeira, Portugal
点击查看摘要
Abstract:Vision-language supervision has made remarkable strides in learning visual representations from textual guidance. In digital pathology, vision-language models (VLM), pre-trained on curated datasets of histological image-captions, have been adapted to downstream tasks, such as region of interest classification. Zero-shot transfer for slide-level prediction has been formulated by MI-Zero, but it exhibits high variability depending on the textual prompts. Inspired by prototypical learning, we propose MI-VisionShot, a training-free adaptation method on top of VLMs to predict slide-level labels in few-shot learning scenarios. Our framework takes advantage of the excellent representation learning of VLM to create prototype-based classifiers under a multiple-instance setting by retrieving the most discriminative patches within each slide. Experimentation through different settings shows the ability of MI-VisionShot to surpass zero-shot transfer with lower variability, even in low-shot scenarios. Code coming soon at thttps://github.com/cvblab/MIVisionShot.
[CV-51] Visual Motif Identification: Elaboration of a Curated Comparative Dataset and Classification Methods ECCV2024
链接: https://arxiv.org/abs/2410.15866
作者: Adam Phillips(1),Daniel Grandes Rodriguez(1),Miriam Sánchez-Manzano(1),Alan Salvadó(1),Manuel Garin(1),Gloria Haro(1),Coloma Ballester(1) ((1) Universitat Pompeu Fabra, Barcelona, Spain)
关键词-EN: recurrent iconographic compositions, aesthetic significance, recurrent iconographic, iconographic compositions, compositions that carry
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 11 figures, one table, to be published in the conference proceedings of ECCV 2024
点击查看摘要
Abstract:In cinema, visual motifs are recurrent iconographic compositions that carry artistic or aesthetic significance. Their use throughout the history of visual arts and media is interesting to researchers and filmmakers alike. Our goal in this work is to recognise and classify these motifs by proposing a new machine learning model that uses a custom dataset to that end. We show how features extracted from a CLIP model can be leveraged by using a shallow network and an appropriate loss to classify images into 20 different motifs, with surprisingly good results: an F_1 -score of 0.91 on our test set. We also present several ablation studies justifying the input features, architecture and hyperparameters used.
[CV-52] Random Token Fusion for Multi-View Medical Diagnosis NEURIPS2024
链接: https://arxiv.org/abs/2410.15847
作者: Jingyu Guo,Christos Matsoukas,Fredrik Strand,Kevin Smith
关键词-EN: deep learning-based models, deep learning-based, multi-view medical diagnosis, fuse information, imaging perspectives
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Originally published at the NeurIPS 2024 Workshop on Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond (AIM-FM)
点击查看摘要
Abstract:In multi-view medical diagnosis, deep learning-based models often fuse information from different imaging perspectives to improve diagnostic performance. However, existing approaches are prone to overfitting and rely heavily on view-specific features, which can lead to trivial solutions. In this work, we introduce Random Token Fusion (RTF), a novel technique designed to enhance multi-view medical image analysis using vision transformers. By integrating randomness into the feature fusion process during training, RTF addresses the issue of overfitting and enhances the robustness and accuracy of diagnostic models without incurring any additional cost at inference. We validate our approach on standard mammography and chest X-ray benchmark datasets. Through extensive experiments, we demonstrate that RTF consistently improves the performance of existing fusion methods, paving the way for a new generation of multi-view medical foundation models.
[CV-53] LiOn-XA: Unsupervised Domain Adaptation via LiDAR-Only Cross-Modal Adversarial Training IROS2024
链接: https://arxiv.org/abs/2410.15833
作者: Thomas Kreutz,Jens Lemke,Max Mühlhäuser,Alejandro Sanchez Guinea
关键词-EN: combines LiDAR-Only Cross-Modal, cloud semantic segmentation, domain gap arising, point cloud semantic, propose LiOn-XA
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint, Paper has been accepted at IROS2024
点击查看摘要
Abstract:In this paper, we propose LiOn-XA, an unsupervised domain adaptation (UDA) approach that combines LiDAR-Only Cross-Modal (X) learning with Adversarial training for 3D LiDAR point cloud semantic segmentation to bridge the domain gap arising from environmental and sensor setup changes. Unlike existing works that exploit multiple data modalities like point clouds and RGB image data, we address UDA in scenarios where RGB images might not be available and show that two distinct LiDAR data representations can learn from each other for UDA. More specifically, we leverage 3D voxelized point clouds to preserve important geometric structure in combination with 2D projection-based range images that provide information such as object orientations or surfaces. To further align the feature space between both domains, we apply adversarial training using both features and predictions of both 2D and 3D neural networks. Our experiments on 3 real-to-real adaptation scenarios demonstrate the effectiveness of our approach, achieving new state-of-the-art performance when compared to previous uni- and multi-model UDA methods. Our source code is publicly available at this https URL.
[CV-54] LiMTR: Time Series Motion Prediction for Diverse Road Users through Multimodal Feature Integration NEURIPS2024
链接: https://arxiv.org/abs/2410.15819
作者: Camiel Oerlemans,Bram Grooten,Michiel Braat,Alaa Alassi,Emilia Silvas,Decebal Constantin Mocanu
关键词-EN: densely populated areas, road users accurately, Predicting the behavior, populated areas, behavior of road
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the NeurIPS 2024 workshop Time Series in the Age of Large Models. Code available at this https URL
点击查看摘要
Abstract:Predicting the behavior of road users accurately is crucial to enable the safe operation of autonomous vehicles in urban or densely populated areas. Therefore, there has been a growing interest in time series motion prediction research, leading to significant advancements in state-of-the-art techniques in recent years. However, the potential of using LiDAR data to capture more detailed local features, such as a person’s gaze or posture, remains largely unexplored. To address this, we develop a novel multimodal approach for motion prediction based on the PointNet foundation model architecture, incorporating local LiDAR features. Evaluation on the Waymo Open Dataset shows a performance improvement of 6.20% and 1.58% in minADE and mAP respectively, when integrated and compared with the previous state-of-the-art MTR. We open-source the code of our LiMTR model.
[CV-55] Kaninfradet3D:A Road-side Camera-LiDAR Fusion 3D Perception Model based on Nonlinear Feature Extraction and Intrinsic Correlation
链接: https://arxiv.org/abs/2410.15814
作者: Pei Liu(1),Nanfang Zheng(2),Yiqun Li(2),Junlan Chen(2),Ziyuan Pu(2) ((1) Intelligent Transportation Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou), (2) Transportation, Southeast University)
关键词-EN: AI-assisted driving, numerous methods, emerged for ego-vehicle, development of AI-assisted, methods have emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the development of AI-assisted driving, numerous methods have emerged for ego-vehicle 3D perception tasks, but there has been limited research on roadside perception. With its ability to provide a global view and a broader sensing range, the roadside perspective is worth developing. LiDAR provides precise three-dimensional spatial information, while cameras offer semantic information. These two modalities are complementary in 3D detection. However, adding camera data does not increase accuracy in some studies since the information extraction and fusion procedure is not sufficiently reliable. Recently, Kolmogorov-Arnold Networks (KANs) have been proposed as replacements for MLPs, which are better suited for high-dimensional, complex data. Both the camera and the LiDAR provide high-dimensional information, and employing KANs should enhance the extraction of valuable features to produce better fusion outcomes. This paper proposes Kaninfradet3D, which optimizes the feature extraction and fusion modules. To extract features from complex high-dimensional data, the model’s encoder and fuser modules were improved using KAN Layers. Cross-attention was applied to enhance feature fusion, and visual comparisons verified that camera features were more evenly integrated. This addressed the issue of camera features being abnormally concentrated, negatively impacting fusion. Compared to the benchmark, our approach shows improvements of +9.87 mAP and +10.64 mAP in the two viewpoints of the TUMTraf Intersection Dataset and an improvement of +1.40 mAP in the roadside end of the TUMTraf V2X Cooperative Perception Dataset. The results indicate that Kaninfradet3D can effectively fuse features, demonstrating the potential of applying KANs in roadside perception tasks.
[CV-56] Data-Efficient CLIP-Powered Dual-Branch Networks for Source-Free Unsupervised Domain Adaptation
链接: https://arxiv.org/abs/2410.15811
作者: Yongguang Li,Yueqi Cao,Jindong Li,Qi Wang,Shengsheng Wang
关键词-EN: Unsupervised Domain Adaptation, source domain, source domain samples, Domain Adaptation, labeled source domain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Source-Free Unsupervised Domain Adaptation (SF-UDA) aims to transfer a model’s performance from a labeled source domain to an unlabeled target domain without direct access to source samples, addressing data privacy issues. However, most existing SF-UDA approaches assume the availability of abundant source domain samples, which is often impractical due to the high cost of data annotation. In this paper, we explore a more challenging scenario where direct access to source domain samples is restricted, and the source domain contains only a few samples. To tackle the dual challenges of limited source data and privacy concerns, we introduce a data-efficient, CLIP-powered dual-branch network (CDBN in short). We design a cross-modal dual-branch network that integrates source domain class semantics into the unsupervised fine-tuning of the target domain. It preserves the class information from the source domain while enhancing the model’s generalization to the target domain. Additionally, we propose an unsupervised optimization strategy driven by accurate classification and diversity, which aims to retain the classification capability learned from the source domain while producing more confident and diverse predictions in the target domain. Extensive experiments across 31 transfer tasks on 7 public datasets demonstrate that our approach achieves state-of-the-art performance compared to existing methods.
[CV-57] Assisted Physical Interaction: Autonomous Aerial Robots with Neural Network Detection Navigation and Safety Layers
链接: https://arxiv.org/abs/2410.15802
作者: Andrea Berra,Viswa Narayanan Sankaranarayanan,Achilleas Santi Seisa,Julien Mellet,Udayanga G.W.K.N. Gamage,Sumeet Gajanan Satpute,Fabio Ruggiero,Vincenzo Lippiello,Silvia Tolu,Matteo Fumagalli,George Nikolakopoulos,Miguel Ángel Trujillo Soto,Guillermo Heredia
关键词-EN: autonomous aerial physical, aerial physical interaction, industrial settings, paper introduces, autonomous aerial
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
*备注: 8 pages,14 figures, ICUAS 2024
点击查看摘要
Abstract:The paper introduces a novel framework for safe and autonomous aerial physical interaction in industrial settings. It comprises two main components: a neural network-based target detection system enhanced with edge computing for reduced onboard computational load, and a control barrier function (CBF)-based controller for safe and precise maneuvering. The target detection system is trained on a dataset under challenging visual conditions and evaluated for accuracy across various unseen data with changing lighting conditions. Depth features are utilized for target pose estimation, with the entire detection framework offloaded into low-latency edge computing. The CBF-based controller enables the UAV to converge safely to the target for precise contact. Simulated evaluations of both the controller and target detection are presented, alongside an analysis of real-world detection performance.
[CV-58] Habaek: High-performance water segmentation through dataset expansion and inductive bias optimization
链接: https://arxiv.org/abs/2410.15794
作者: Hanseon Joo,Eunji Lee,Minjong Cheon
关键词-EN: water resource management, critical to disaster, disaster response, resource management, Water segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Water segmentation is critical to disaster response and water resource management. Authorities may employ high-resolution photography to monitor rivers, lakes, and reservoirs, allowing for more proactive management in agriculture, industry, and conservation. Deep learning has improved flood monitoring by allowing models like CNNs, U-Nets, and transformers to handle large volumes of satellite and aerial data. However, these models usually have significant processing requirements, limiting their usage in real-time applications. This research proposes upgrading the SegFormer model for water segmentation by data augmentation with datasets such as ADE20K and RIWA to boost generalization. We examine how inductive bias affects attention-based models and discover that SegFormer performs better on bigger datasets. To further demonstrate the function of data augmentation, Low-Rank Adaptation (LoRA) is used to lower processing complexity while preserving accuracy. We show that the suggested Habaek model outperforms current models in segmentation, with an Intersection over Union (IoU) ranging from 0.91986 to 0.94397. In terms of F1-score, recall, accuracy, and precision, Habaek performs better than rival models, indicating its potential for real-world applications. This study highlights the need to enhance structures and include datasets for effective water segmentation.
[CV-59] WildOcc: A Benchmark for Off-Road 3D Semantic Occupancy Prediction
链接: https://arxiv.org/abs/2410.15792
作者: Heng Zhai,Jilin Mei,Chen Min,Liang Chen,Fangzhou Zhao,Yu Hu
关键词-EN: semantic occupancy prediction, semantic occupancy, occupancy prediction, occupancy prediction tasks, semantic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:3D semantic occupancy prediction is an essential part of autonomous driving, focusing on capturing the geometric details of scenes. Off-road environments are rich in geometric information, therefore it is suitable for 3D semantic occupancy prediction tasks to reconstruct such scenes. However, most of researches concentrate on on-road environments, and few methods are designed for off-road 3D semantic occupancy prediction due to the lack of relevant datasets and benchmarks. In response to this gap, we introduce WildOcc, to our knowledge, the first benchmark to provide dense occupancy annotations for off-road 3D semantic occupancy prediction tasks. A ground truth generation pipeline is proposed in this paper, which employs a coarse-to-fine reconstruction to achieve a more realistic result. Moreover, we introduce a multi-modal 3D semantic occupancy prediction framework, which fuses spatio-temporal information from multi-frame images and point clouds at voxel level. In addition, a cross-modality distillation function is introduced, which transfers geometric knowledge from point clouds to image features.
[CV-60] An Efficient System for Automatic Map Storytelling – A Case Study on Historical Maps
链接: https://arxiv.org/abs/2410.15780
作者: Ziyi Liu,Claudio Affolter,Sidi Wu,Yizi Chen,Lorenz Hurni
关键词-EN: provide valuable information, maps provide valuable, provide valuable, valuable information, information and knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Historical maps provide valuable information and knowledge about the past. However, as they often feature non-standard projections, hand-drawn styles, and artistic elements, it is challenging for non-experts to identify and interpret them. While existing image captioning methods have achieved remarkable success on natural images, their performance on maps is suboptimal as maps are underrepresented in their pre-training process. Despite the recent advance of GPT-4 in text recognition and map captioning, it still has a limited understanding of maps, as its performance wanes when texts (e.g., titles and legends) in maps are missing or inaccurate. Besides, it is inefficient or even impractical to fine-tune the model with users’ own datasets. To address these problems, we propose a novel and lightweight map-captioning counterpart. Specifically, we fine-tune the state-of-the-art vision-language model CLIP to generate captions relevant to historical maps and enrich the captions with GPT-3.5 to tell a brief story regarding where, what, when and why of a given map. We propose a novel decision tree architecture to only generate captions relevant to the specified map type. Our system shows invariance to text alterations in maps. The system can be easily adapted and extended to other map types and scaled to a larger map captioning system. The code is open-sourced at this https URL.
[CV-61] Reducing Hallucinations in Vision-Language Models via Latent Space Steering
链接: https://arxiv.org/abs/2410.15778
作者: Sheng Liu,Haotian Ye,James Zou
关键词-EN: large language models, large vision-language models, poses a challenge, language models, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 21 pages
点击查看摘要
Abstract:Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications. Unlike in large language models (LLMs), hallucination in LVLMs often arises from misalignments between visual inputs and textual outputs. This paper investigates the underlying mechanisms of hallucination, focusing on the unique structure of LVLMs that distinguishes them from large language models (LLMs). We identify that hallucinations often arise from the sensitivity of text decoders to vision inputs, a natural phenomenon when image encoders and text decoders are pre-trained separately. Inspired by this, we introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features. As a task-agnostic test-time intervention, VTI can be easily applied to any problem without additional cost. Extensive experiments demonstrate that it can effectively reduce hallucinations and outperform baseline methods across multiple metrics, highlighting the critical role of vision feature stability in LVLMs.
[CV-62] Generalizing Motion Planners with Mixture of Experts for Autonomous Driving
链接: https://arxiv.org/abs/2410.15774
作者: Qiao Sun,Huimin Wang,Jiahao Zhan,Fan Nie,Xin Wen,Leimeng Xu,Kun Zhan,Peng Jia,Xianpeng Lang,Hang Zhao
关键词-EN: sparked significant research, Large real-world driving, Large real-world, sparked significant, significant research
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 3 figures
点击查看摘要
Abstract:Large real-world driving datasets have sparked significant research into various aspects of data-driven motion planners for autonomous driving. These include data augmentation, model architecture, reward design, training strategies, and planner pipelines. These planners promise better generalizations on complicated and few-shot cases than previous methods. However, experiment results show that many of these approaches produce limited generalization abilities in planning performance due to overly complex designs or training paradigms. In this paper, we review and benchmark previous methods focusing on generalizations. The experimental results indicate that as models are appropriately scaled, many design elements become redundant. We introduce StateTransformer-2 (STR2), a scalable, decoder-only motion planner that uses a Vision Transformer (ViT) encoder and a mixture-of-experts (MoE) causal Transformer architecture. The MoE backbone addresses modality collapse and reward balancing by expert routing during training. Extensive experiments on the NuPlan dataset show that our method generalizes better than previous approaches across different test sets and closed-loop simulations. Furthermore, we assess its scalability on billions of real-world urban driving scenarios, demonstrating consistent accuracy improvements as both data and model size grow.
[CV-63] Learning to Synthesize Graphics Programs for Geometric Artworks ICPR2024
链接: https://arxiv.org/abs/2410.15768
作者: Qi Bing,Chaoyi Zhang,Weidong Cai
关键词-EN: Creating and understanding, human ability, hallmark of human, Creating, understanding art
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: ICPR 2024
点击查看摘要
Abstract:Creating and understanding art has long been a hallmark of human ability. When presented with finished digital artwork, professional graphic artists can intuitively deconstruct and replicate it using various drawing tools, such as the line tool, paint bucket, and layer features, including opacity and blending modes. While most recent research in this field has focused on art generation, proposing a range of methods, these often rely on the concept of artwork being represented as a final image. To bridge the gap between pixel-level results and the actual drawing process, we present an approach that treats a set of drawing tools as executable programs. This method predicts a sequence of steps to achieve the final image, allowing for understandable and resolution-independent reproductions under the usage of a set of drawing commands. Our experiments demonstrate that our program synthesizer, Art2Prog, can comprehensively understand complex input images and reproduce them using high-quality executable programs. The experimental results evidence the potential of machines to grasp higher-level information from images and generate compact program-level descriptions.
[CV-64] Improving Instance Optimization in Deformable Image Registration with Gradient Projection
链接: https://arxiv.org/abs/2410.15767
作者: Yi Zhang,Yidong Zhao,Qian Tao
关键词-EN: Deformable image registration, Deformable image, image similarity, requiring a delicate, deformation regularity
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: L2R 2024 Challenge Paper
点击查看摘要
Abstract:Deformable image registration is inherently a multi-objective optimization (MOO) problem, requiring a delicate balance between image similarity and deformation regularity. These conflicting objectives often lead to poor optimization outcomes, such as being trapped in unsatisfactory local minima or experiencing slow convergence. Deep learning methods have recently gained popularity in this domain due to their efficiency in processing large datasets and achieving high accuracy. However, they often underperform during test time compared to traditional optimization techniques, which further explore iterative, instance-specific gradient-based optimization. This performance gap is more pronounced when a distribution shift between training and test data exists. To address this issue, we focus on the instance optimization (IO) paradigm, which involves additional optimization for test-time instances based on a pre-trained model. IO effectively combines the generalization capabilities of deep learning with the fine-tuning advantages of instance-specific optimization. Within this framework, we emphasize the use of gradient projection to mitigate conflicting updates in MOO. This technique projects conflicting gradients into a common space, better aligning the dual objectives and enhancing optimization stability. We validate our method using a state-of-the-art foundation model on the 3D Brain inter-subject registration task (LUMIR) from the Learn2Reg 2024 Challenge. Our results show significant improvements over standard gradient descent, leading to more accurate and reliable registration results.
[CV-65] How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit?
链接: https://arxiv.org/abs/2410.15766
作者: Maximilian Ulmer,Leonard Klüpfel,Maximilian Durner,Rudolph Triebel
关键词-EN: spaceborne computer vision, domain gap, computer vision, crucial for autonomous, on-orbit servicing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:We investigate the efficacy of data augmentations to close the domain gap in spaceborne computer vision, crucial for autonomous operations like on-orbit servicing. As the use of computer vision in space increases, challenges such as hostile illumination and low signal-to-noise ratios significantly hinder performance. While learning-based algorithms show promising results, their adoption is limited by the need for extensive annotated training data and the domain gap that arises from differences between synthesized and real-world imagery. This study explores domain generalization in terms of data augmentations – classical color and geometric transformations, corruptions, and noise – to enhance model performance across the domain gap. To this end, we conduct an large scale experiment using a hyperparameter optimization pipeline that samples hundreds of different configurations and searches for the best set to bridge the domain gap. As a reference task, we use 2D object detection and evaluate on the SPEED+ dataset that contains real hardware-in-the-loop satellite images in its test set. Moreover, we evaluate four popular object detectors, including Mask R-CNN, Faster R-CNN, YOLO-v7, and the open set detector GroundingDINO, and highlight their trade-offs between performance, inference speed, and training time. Our results underscore the vital role of data augmentations in bridging the domain gap, improving model performance, robustness, and reliability for critical space applications. As a result, we propose two novel data augmentations specifically developed to emulate the visual effects observed in orbital imagery. We conclude by recommending the most effective augmentations for advancing computer vision in challenging orbital environments. Code for training detectors and hyperparameter search will be made publicly available.
[CV-66] DeepIcon: A Hierarchical Network for Layer-wise Icon Vectorization
链接: https://arxiv.org/abs/2410.15760
作者: Qi Bing,Chaoyi Zhang,Weidong Cai
关键词-EN: technique of rasterization, well-established technique, poses a significant, field of computer, Scalable Vector Graphics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted as Oral Presentation at DICTA 2024
点击查看摘要
Abstract:In contrast to the well-established technique of rasterization, vectorization of images poses a significant challenge in the field of computer graphics. Recent learning-based methods for converting raster images to vector formats frequently suffer from incomplete shapes, redundant path prediction, and a lack of accuracy in preserving the semantics of the original content. These shortcomings severely hinder the utility of these methods for further editing and manipulation of images. To address these challenges, we present DeepIcon, a novel hierarchical image vectorization network specifically tailored for generating variable-length icon vector graphics based on the raster image input. Our experimental results indicate that DeepIcon can efficiently produce Scalable Vector Graphics (SVGs) directly from raster images, bypassing the need for a differentiable rasterizer while also demonstrating a profound understanding of the image contents.
[CV-67] Unleashing the Potential of Vision-Language Pre-Training for 3D Zero-Shot Lesion Segmentation via Mask-Attribute Alignment
链接: https://arxiv.org/abs/2410.15744
作者: Yankai Jiang,Wenhui Lei,Xiaofan Zhang,Shaoting Zhang
关键词-EN: medical vision-language pre-training, vision-language pre-training models, driven significant progress, Recent advancements, zero-shot disease recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advancements in medical vision-language pre-training models have driven significant progress in zero-shot disease recognition. However, transferring image-level knowledge to pixel-level tasks, such as lesion segmentation in 3D CT scans, remains a critical challenge. Due to the complexity and variability of pathological visual characteristics, existing methods struggle to align fine-grained lesion features not encountered during training with disease-related textual representations. In this paper, we present Malenia, a novel multi-scale lesion-level mask-attribute alignment framework, specifically designed for 3D zero-shot lesion segmentation. Malenia improves the compatibility between mask representations and their associated elemental attributes, explicitly linking the visual features of unseen lesions with the extensible knowledge learned from previously seen ones. Furthermore, we design a Cross-Modal Knowledge Injection module to enhance both visual and textual features with mutually beneficial information, effectively guiding the generation of segmentation results. Comprehensive experiments across three datasets and 12 lesion categories validate the superior performance of Malenia. Codes will be publicly available.
[CV-68] ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts
链接: https://arxiv.org/abs/2410.15732
作者: Xumeng Han,Longhui Wei,Zhiyang Dou,Zipeng Wang,Chenhui Qiang,Xin He,Yingfei Sun,Zhenjun Han,Qi Tian
关键词-EN: demonstrating excellent scalability, increasing model capacity, demonstrating excellent, multiple domains, promising approach
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful knowledge. To address this, we introduce a shared expert to learn and capture common information, serving as an effective way to construct stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research.
[CV-69] Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases
链接: https://arxiv.org/abs/2410.15728
作者: Cristian Meo,Akihiro Nakano,Mircea Lică,Aniket Didolkar,Masahiro Suzuki,Anirudh Goyal,Mengmi Zhang,Justin Dauwels,Yutaka Matsuo,Yoshua Bengio
关键词-EN: Unsupervised object-centric learning, Unsupervised object-centric, learning compositional representations, learning compositional, promising approach
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Unsupervised object-centric learning from videos is a promising approach towards learning compositional representations that can be applied to various downstream tasks, such as prediction and reasoning. Recently, it was shown that pretrained Vision Transformers (ViTs) can be useful to learn object-centric representations on real-world video datasets. However, while these approaches succeed at extracting objects from the scenes, the slot-based representations fail to maintain temporal consistency across consecutive frames in a video, i.e. the mapping of objects to slots changes across the video. To address this, we introduce Conditional Autoregressive Slot Attention (CA-SA), a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. Leveraging an autoregressive prior network to condition representations on previous timesteps and a novel consistency loss function, CA-SA predicts future slot representations and imposes consistency across frames. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks, such as video prediction and visual question-answering tasks.
[CV-70] Students Rather Than Experts: A New AI For Education Pipeline To Model More Human-Like And Personalised Early Adolescences
链接: https://arxiv.org/abs/2410.15701
作者: Yiping Ma,Shiyu Hu,Xuchen Li,Yipei Wang,Shiqing Liu,Kang Hao Cheong
关键词-EN: virtual student agents, large language models, virtual student, student agents, providing new opportunities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The capabilities of large language models (LLMs) have been applied in expert systems across various domains, providing new opportunities for AI in Education. Educational interactions involve a cyclical exchange between teachers and students. Current research predominantly focuses on using LLMs to simulate teachers, leveraging their expertise to enhance student learning outcomes. However, the simulation of students, which could improve teachers’ instructional skills, has received insufficient attention due to the challenges of modeling and evaluating virtual students. This research asks: Can LLMs be utilized to develop virtual student agents that mimic human-like behavior and individual variability? Unlike expert systems focusing on knowledge delivery, virtual students must replicate learning difficulties, emotional responses, and linguistic uncertainties. These traits present significant challenges in both modeling and evaluation. To address these issues, this study focuses on language learning as a context for modeling virtual student agents. We propose a novel AI4Education framework, called SOE (Scene-Object-Evaluation), to systematically construct LVSA (LLM-based Virtual Student Agents). By curating a dataset of personalized teacher-student interactions with various personality traits, question types, and learning stages, and fine-tuning LLMs using LoRA, we conduct multi-dimensional evaluation experiments. Specifically, we: (1) develop a theoretical framework for generating LVSA; (2) integrate human subjective evaluation metrics into GPT-4 assessments, demonstrating a strong correlation between human evaluators and GPT-4 in judging LVSA authenticity; and (3) validate that LLMs can generate human-like, personalized virtual student agents in educational contexts, laying a foundation for future applications in pre-service teacher training and multi-agent simulation environments.
[CV-71] PALMS: Plane-based Accessible Indoor Localization Using Mobile Smartphones
链接: https://arxiv.org/abs/2410.15694
作者: Yunqian Cheng,Roberto Manduchi
关键词-EN: innovative indoor global, floor plans, mobile smartphones, smartphones that utilizes, utilizes publicly
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 3 figures, accepted to the 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN) 2024, Best Presentation Award
点击查看摘要
Abstract:In this paper, we present PALMS, an innovative indoor global localization and relocalization system for mobile smartphones that utilizes publicly available floor plans. Unlike most vision-based methods that require constant visual input, our system adopts a dynamic form of localization that considers a single instantaneous observation and odometry data. The core contribution of this work is the introduction of a particle filter initialization method that leverages the Certainly Empty Space (CES) constraint along with principal orientation matching. This approach creates a spatial probability distribution of the device’s location, significantly improving localization accuracy and reducing particle filter convergence time. Our experimental evaluations demonstrate that PALMS outperforms traditional methods with uniformly initialized particle filters, providing a more efficient and accessible approach to indoor wayfinding. By eliminating the need for prior environmental fingerprinting, PALMS provides a scalable and practical approach to indoor navigation.
[CV-72] Enhancing SNN-based Spatio-Temporal Learning: A Benchmark Dataset and Cross-Modality Attention Model
链接: https://arxiv.org/abs/2410.15689
作者: Shibo Zhou,Bo Yang,Mengwen Yuan,Runhao Jiang,Rui Yan,Gang Pan,Huajin Tang
关键词-EN: Spiking Neural Networks, Artificial Neural Networks, low power consumption, Neural Networks, Spiking Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:Spiking Neural Networks (SNNs), renowned for their low power consumption, brain-inspired architecture, and spatio-temporal representation capabilities, have garnered considerable attention in recent years. Similar to Artificial Neural Networks (ANNs), high-quality benchmark datasets are of great importance to the advances of SNNs. However, our analysis indicates that many prevalent neuromorphic datasets lack strong temporal correlation, preventing SNNs from fully exploiting their spatio-temporal representation capabilities. Meanwhile, the integration of event and frame modalities offers more comprehensive visual spatio-temporal information. Yet, the SNN-based cross-modality fusion remains underexplored. In this work, we present a neuromorphic dataset called DVS-SLR that can better exploit the inherent spatio-temporal properties of SNNs. Compared to existing datasets, it offers advantages in terms of higher temporal correlation, larger scale, and more varied scenarios. In addition, our neuromorphic dataset contains corresponding frame data, which can be used for developing SNN-based fusion methods. By virtue of the dual-modal feature of the dataset, we propose a Cross-Modality Attention (CMA) based fusion method. The CMA model efficiently utilizes the unique advantages of each modality, allowing for SNNs to learn both temporal and spatial attention scores from the spatio-temporal features of event and frame modalities, subsequently allocating these scores across modalities to enhance their synergy. Experimental results demonstrate that our method not only improves recognition accuracy but also ensures robustness across diverse scenarios. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2410.15689 [cs.CV] (or arXiv:2410.15689v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.15689 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-73] RANSAC Back to SOTA: A Two-stage Consensus Filtering for Real-time 3D Registration
链接: https://arxiv.org/abs/2410.15682
作者: Pengcheng Shi,Shaocheng Yan,Yilin Xiao,Xinyi Liu,Yongjun Zhang,Jiayuan Li
关键词-EN: Correspondence-based point cloud, Correspondence-based point, point cloud registration, plays a key, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 8 figures
点击查看摘要
Abstract:Correspondence-based point cloud registration (PCR) plays a key role in robotics and computer vision. However, challenges like sensor noises, object occlusions, and descriptor limitations inevitably result in numerous outliers. RANSAC family is the most popular outlier removal solution. However, the requisite iterations escalate exponentially with the outlier ratio, rendering it far inferior to existing methods (SC2PCR [1], MAC [2], etc.) in terms of accuracy or speed. Thus, we propose a two-stage consensus filtering (TCF) that elevates RANSAC to state-of-the-art (SOTA) speed and accuracy. Firstly, one-point RANSAC obtains a consensus set based on length consistency. Subsequently, two-point RANSAC refines the set via angle consistency. Then, three-point RANSAC computes a coarse pose and removes outliers based on transformed correspondence’s distances. Drawing on optimizations from one-point and two-point RANSAC, three-point RANSAC requires only a few iterations. Eventually, an iterative reweighted least squares (IRLS) is applied to yield the optimal pose. Experiments on the large-scale KITTI and ETH datasets demonstrate our method achieves up to three-orders-of-magnitude speedup compared to MAC while maintaining registration accuracy and recall. Our code is available at this https URL.
[CV-74] ALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight NEURIPS2024
链接: https://arxiv.org/abs/2410.15674
作者: Hyun-Kurl Jang,Jihun Kim,Hyeokjun Kweon,Kuk-Jin Yoon
关键词-EN: perform geometric completion, semantic segmentation simultaneously, Semantic Scene Completion, aims to perform, segmentation simultaneously
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024. Code is available at this https URL
点击查看摘要
Abstract:Semantic Scene Completion (SSC) aims to perform geometric completion and semantic segmentation simultaneously. Despite the promising results achieved by existing studies, the inherently ill-posed nature of the task presents significant challenges in diverse driving scenarios. This paper introduces TALoS, a novel test-time adaptation approach for SSC that excavates the information available in driving environments. Specifically, we focus on that observations made at a certain moment can serve as Ground Truth (GT) for scene completion at another moment. Given the characteristics of the LiDAR sensor, an observation of an object at a certain location confirms both 1) the occupation of that location and 2) the absence of obstacles along the line of sight from the LiDAR to that point. TALoS utilizes these observations to obtain self-supervision about occupancy and emptiness, guiding the model to adapt to the scene in test time. In a similar manner, we aggregate reliable SSC predictions among multiple moments and leverage them as semantic pseudo-GT for adaptation. Further, to leverage future observations that are not accessible at the current time, we present a dual optimization scheme using the model in which the update is delayed until the future observation is available. Evaluations on the SemanticKITTI validation and test sets demonstrate that TALoS significantly improves the performance of the pre-trained SSC model. Our code is available at this https URL.
[CV-75] Calibration of ordinal regression networks
链接: https://arxiv.org/abs/2410.15658
作者: Daehwan Kim,Haejun Chung,Ikbeom Jang
关键词-EN: deep neural networks, Recent studies, produce over-confident predictions, studies have shown, shown that deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Recent studies have shown that deep neural networks are not well-calibrated and produce over-confident predictions. The miscalibration issue primarily stems from the minimization of cross-entropy, which aims to align predicted softmax probabilities with one-hot labels. In ordinal regression tasks, this problem is compounded by an additional challenge: the expectation that softmax probabilities should exhibit unimodal distribution is not met with cross-entropy. Rather, the ordinal regression literature has focused on unimodality and overlooked calibration. To address these issues, we propose a novel loss function that introduces order-aware calibration, ensuring that prediction confidence adheres to ordinal relationships between classes. It incorporates soft ordinal encoding and label-smoothing-based regularization to enforce both calibration and unimodality. Extensive experiments across three popular ordinal regression benchmarks demonstrate that our approach achieves state-of-the-art calibration without compromising accuracy.
[CV-76] CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models
链接: https://arxiv.org/abs/2410.15657
作者: Jianjun Gao,Chen Cai,Ruoyu Wang,Wenyang Liu,Kim-Hui Yap,Kratika Garg,Boon-Siew Han
关键词-EN: Vision Language Models, Large Language Models, Vision Large Language, Language Models, instance-level HOI detection
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Human-object interaction (HOI) detection has seen advancements with Vision Language Models (VLMs), but these methods often depend on extensive manual annotations. Vision Large Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection. To overcome these limitations, we propose a Cross-Level HOI distillation (CL-HOI) framework, which distills instance-level HOIs from VLLMs image-level understanding without the need for manual annotations. Our approach involves two stages: context distillation, where a Visual Linguistic Translator (VLT) converts visual information into linguistic form, and interaction distillation, where an Interaction Cognition Network (ICN) reasons about spatial, visual, and context relations. We design contrastive distillation losses to transfer image-level context and interaction knowledge from the teacher to the student model, enabling instance-level HOI detection. Evaluations on HICO-DET and V-COCO datasets demonstrate that our CL-HOI surpasses existing weakly supervised methods and VLLM supervised methods, showing its efficacy in detecting HOIs without manual labels.
[CV-77] Resource-Efficient Medical Report Generation using Large Language Models
链接: https://arxiv.org/abs/2410.15642
作者: Abdullah,Ameer Hamza,Seong Tae Kim
关键词-EN: chest X-ray images, X-ray images, chest X-ray, automatically writing radiology, automatically writing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Medical report generation is the task of automatically writing radiology reports for chest X-ray images. Manually composing these reports is a time-consuming process that is also prone to human errors. Generating medical reports can therefore help reduce the burden on radiologists. In other words, we can promote greater clinical automation in the medical domain. In this work, we propose a new framework leveraging vision-enabled Large Language Models (LLM) for the task of medical report generation. We introduce a lightweight solution that achieves better or comparative performance as compared to previous solutions on the task of medical report generation. We conduct extensive experiments exploring different model sizes and enhancement approaches, such as prefix tuning to improve the text generation abilities of the LLMs. We evaluate our approach on a prominent large-scale radiology report dataset - MIMIC-CXR. Our results demonstrate the capability of our resource-efficient framework to generate patient-specific reports with strong medical contextual understanding and high precision.
[CV-78] LucidFusion: Generating 3D Gaussians with Arbitrary Unposed Images
链接: https://arxiv.org/abs/2410.15636
作者: Hao He,Yixun Liang,Luozhou Wang,Yuanhao Cai,Xinli Xu,Hao-Xiang Guo,Xiang Wen,Yingcong Chen
关键词-EN: Recent large reconstruction, made notable progress, large reconstruction models, Recent large, generating high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 12 figures, project page: coming soon
点击查看摘要
Abstract:Recent large reconstruction models have made notable progress in generating high-quality 3D objects from single images. However, these methods often struggle with controllability, as they lack information from multiple views, leading to incomplete or inconsistent 3D reconstructions. To address this limitation, we introduce LucidFusion, a flexible end-to-end feed-forward framework that leverages the Relative Coordinate Map (RCM). Unlike traditional methods linking images to 3D world thorough pose, LucidFusion utilizes RCM to align geometric features coherently across different views, making it highly adaptable for 3D generation from arbitrary, unposed images. Furthermore, LucidFusion seamlessly integrates with the original single-image-to-3D pipeline, producing detailed 3D Gaussians at a resolution of 512 \times 512 , making it well-suited for a wide range of applications.
[CV-79] Fully Explicit Dynamic Gaussian Splatting NEURIPS2024
链接: https://arxiv.org/abs/2410.15629
作者: Junoh Lee,Chang-Yeon Won,Hyunjun Jung,Inhwan Bae,Hae-Gon Jeon
关键词-EN: high-quality rendering results, Gaussian Splatting, leveraging dense, dynamic Gaussians, dynamic
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted at NeurIPS 2024
点击查看摘要
Abstract:3D Gaussian Splatting has shown fast and high-quality rendering results in static scenes by leveraging dense 3D prior and explicit representations. Unfortunately, the benefits of the prior and representation do not involve novel view synthesis for dynamic motions. Ironically, this is because the main barrier is the reliance on them, which requires increasing training and rendering times to account for dynamic motions. In this paper, we design a Explicit 4D Gaussian Splatting(Ex4DGS). Our key idea is to firstly separate static and dynamic Gaussians during training, and to explicitly sample positions and rotations of the dynamic Gaussians at sparse timestamps. The sampled positions and rotations are then interpolated to represent both spatially and temporally continuous motions of objects in dynamic scenes as well as reducing computational cost. Additionally, we introduce a progressive training scheme and a point-backtracking technique that improves Ex4DGS’s convergence. We initially train Ex4DGS using short timestamps and progressively extend timestamps, which makes it work well with a few point clouds. The point-backtracking is used to quantify the cumulative error of each Gaussian over time, enabling the detection and removal of erroneous Gaussians in dynamic scenes. Comprehensive experiments on various scenes demonstrate the state-of-the-art rendering quality from our method, achieving fast rendering of 62 fps on a single 2080Ti GPU.
[CV-80] Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation
链接: https://arxiv.org/abs/2410.15618
作者: Anh Bui,Long Vuong,Khanh Doan,Trung Le,Paul Montague,Tamas Abraham,Dinh Phung
关键词-EN: unfiltered internet data, generating visually striking, inadvertently produce undesirable, visually striking content, Diffusion models excel
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Diffusion models excel at generating visually striking content from text but can inadvertently produce undesirable or harmful content when trained on unfiltered internet data. A practical solution is to selectively removing target concepts from the model, but this may impact the remaining concepts. Prior approaches have tried to balance this by introducing a loss term to preserve neutral content or a regularization term to minimize changes in the model parameters, yet resolving this trade-off remains challenging. In this work, we propose to identify and preserving concepts most affected by parameter changes, termed as \textitadversarial concepts. This approach ensures stable erasure with minimal impact on the other concepts. We demonstrate the effectiveness of our method using the Stable Diffusion model, showing that it outperforms state-of-the-art erasure methods in eliminating unwanted content while maintaining the integrity of other unrelated elements. Our code is available at \urlthis https URL.
[CV-81] Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding ICPR2024
链接: https://arxiv.org/abs/2410.15615
作者: Yang Liu,Daizong Liu,Wei Hu
关键词-EN: point cloud scene, cloud scene based, visual grounding-locating, point cloud, text descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICPR2024
点击查看摘要
Abstract:This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions. Existing methods fall into two categories: top-down and bottom-up methods. Top-down methods rely on a pre-trained 3D detector to generate and select the best bounding box, resulting in time-consuming processes. Bottom-up methods directly regress object bounding boxes with coarse-grained features, producing worse results. To combine their strengths while addressing their limitations, we propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency. Specifically, in the first stage, we propose a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector. Then, in the second stage, we introduce a top-down based proposal consolidation module, which utilizes graph design to effectively aggregate and propagate the query-related object contexts among the generated proposals for further refinement. By jointly training these two modules, we can avoid the inherent drawbacks of the complex proposals in the top-down framework and the coarse proposals in the bottom-up framework. Experimental results on the ScanRefer benchmark show that our framework is able to achieve the state-of-the-art performance.
[CV-82] Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identificatio
链接: https://arxiv.org/abs/2410.15613
作者: Zhangjian Ji,Donglin Cheng,Kai Feng
关键词-EN: diverse camera perspectives, person re-identification remains, transformer-based person re-identification, person re-identification, person re-identification framework
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Due to some complex factors (e.g., occlusion, pose variation and diverse camera perspectives), extracting stronger feature representation in person re-identification remains a challenging task. In this paper, we proposed a novel self-supervision and supervision combining transformer-based person re-identification framework, namely SSSC-TransReID. Different from the general transformer-based person re-identification models, we designed a self-supervised contrastive learning branch, which can enhance the feature representation for person re-identification without negative samples or additional pre-training. In order to train the contrastive learning branch, we also proposed a novel random rectangle mask strategy to simulate the occlusion in real scenes, so as to enhance the feature representation for occlusion. Finally, we utilized the joint-training loss function to integrate the advantages of supervised learning with ID tags and self-supervised contrastive learning without negative samples, which can reinforce the ability of our model to excavate stronger discriminative features, especially for occlusion. Extensive experimental results on several benchmark datasets show our proposed model obtains superior Re-ID performance consistently and outperforms the state-of-the-art ReID methods by large margins on the mean average accuracy (mAP) and Rank-1 accuracy.
[CV-83] Deep Active Learning with Manifold-preserving Trajectory Sampling
链接: https://arxiv.org/abs/2410.15605
作者: Yingrui Ji,Vijaya Sindhoori Kaza,Nishanth Artham,Tianyang Wang
关键词-EN: minimizing labeling effort, enhance model performance, Active learning, labeling effort, unlabeled data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Active learning (AL) is for optimizing the selection of unlabeled data for annotation (labeling), aiming to enhance model performance while minimizing labeling effort. The key question in AL is which unlabeled data should be selected for annotation. Existing deep AL methods arguably suffer from bias incurred by clabeled data, which takes a much lower percentage than unlabeled data in AL context. We observe that such an issue is severe in different types of data, such as vision and non-vision data. To address this issue, we propose a novel method, namely Manifold-Preserving Trajectory Sampling (MPTS), aiming to enforce the feature space learned from labeled data to represent a more accurate manifold. By doing so, we expect to effectively correct the bias incurred by labeled data, which can cause a biased selection of unlabeled data. Despite its focus on manifold, the proposed method can be conveniently implemented by performing distribution mapping with MMD (Maximum Mean Discrepancies). Extensive experiments on various vision and non-vision benchmark datasets demonstrate the superiority of our method. Our source code can be found here.
[CV-84] P-YOLOv8: Efficient and Accurate Real-Time Detection of Distracted Driving
链接: https://arxiv.org/abs/2410.15602
作者: Mohamed R. Elshamy,Heba M. Emara,Mohamed R. Shoaib,Abdel-Hameed A. Badawy
关键词-EN: Distracted driving, critical safety issue, injuries worldwide, critical safety, leads to numerous
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Distracted driving is a critical safety issue that leads to numerous fatalities and injuries worldwide. This study addresses the urgent need for efficient and real-time machine learning models to detect distracted driving behaviors. Leveraging the Pretrained YOLOv8 (P-YOLOv8) model, a real-time object detection system is introduced, optimized for both speed and accuracy. This approach addresses the computational constraints and latency limitations commonly associated with conventional detection models. The study demonstrates P-YOLOv8 versatility in both object detection and image classification tasks using the Distracted Driver Detection dataset from State Farm, which includes 22,424 images across ten behavior categories. Our research explores the application of P-YOLOv8 for image classification, evaluating its performance compared to deep learning models such as VGG16, VGG19, and ResNet. Some traditional models often struggle with low accuracy, while others achieve high accuracy but come with high computational costs and slow detection speeds, making them unsuitable for real-time applications. P-YOLOv8 addresses these issues by achieving competitive accuracy with significant computational cost and efficiency advantages. In particular, P-YOLOv8 generates a lightweight model with a size of only 2.84 MB and a lower number of parameters, totaling 1,451,098, due to its innovative architecture. It achieves a high accuracy of 99.46 percent with this small model size, opening new directions for deployment on inexpensive and small embedded devices using Tiny Machine Learning (TinyML). The experimental results show robust performance, making P-YOLOv8 a cost-effective solution for real-time deployment. This study provides a detailed analysis of P-YOLOv8’s architecture, training, and performance benchmarks, highlighting its potential for real-time use in detecting distracted driving.
[CV-85] Deep Learning and Machine Learning – Object Detection and Semantic Segmentation: From Theory to Applications
链接: https://arxiv.org/abs/2410.15584
作者: Jintao Ren,Ziqian Bi,Qian Niu,Junyu Liu,Benji Peng,Sen Zhang,Xuanhe Pan,Jinlang Wang,Keyu Chen,Caitlyn Heqi Yin,Pohsun Feng,Yizhu Wen,Tianyang Wang,Silin Chen,Ming Li,Jiawei Xu,Ming Liu
关键词-EN: combining theoretical foundations, semantic segmentation, combining theoretical, practical applications, offers an in-depth
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 167 pages
点击查看摘要
Abstract:This book offers an in-depth exploration of object detection and semantic segmentation, combining theoretical foundations with practical applications. It covers state-of-the-art advancements in machine learning and deep learning, with a focus on convolutional neural networks (CNNs), YOLO architectures, and transformer-based approaches like DETR. The book also delves into the integration of artificial intelligence (AI) techniques and large language models for enhanced object detection in complex environments. A thorough discussion of big data analysis is presented, highlighting the importance of data processing, model optimization, and performance evaluation metrics. By bridging the gap between traditional methods and modern deep learning frameworks, this book serves as a comprehensive guide for researchers, data scientists, and engineers aiming to leverage AI-driven methodologies in large-scale object detection tasks.
[CV-86] ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos ACM-MM2024
链接: https://arxiv.org/abs/2410.15582
作者: Tao Tang,Hong Liu,Yingxuan You,Ti Wang,Wenhao Li
关键词-EN: made significant progress, simultaneously estimating human, image features limits, low-resolution image features, image features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2024. Project page: this https URL
点击查看摘要
Abstract:Although existing video-based 3D human mesh recovery methods have made significant progress, simultaneously estimating human pose and shape from low-resolution image features limits their performance. These image features lack sufficient spatial information about the human body and contain various noises (e.g., background, lighting, and clothing), which often results in inaccurate pose and inconsistent motion. Inspired by the rapid advance in human pose estimation, we discover that compared to image features, skeletons inherently contain accurate human pose and motion. Therefore, we propose a novel semiAnalytical Regressor using disenTangled Skeletal representations for human mesh recovery from videos, called ARTS. Specifically, a skeleton estimation and disentanglement module is proposed to estimate the 3D skeletons from a video and decouple them into disentangled skeletal representations (i.e., joint position, bone length, and human motion). Then, to fully utilize these representations, we introduce a semi-analytical regressor to estimate the parameters of the human mesh model. The regressor consists of three modules: Temporal Inverse Kinematics (TIK), Bone-guided Shape Fitting (BSF), and Motion-Centric Refinement (MCR). TIK utilizes joint position to estimate initial pose parameters and BSF leverages bone length to regress bone-aligned shape parameters. Finally, MCR combines human motion representation with image features to refine the initial human model parameters. Extensive experiments demonstrate that our ARTS surpasses existing state-of-the-art video-based methods in both per-frame accuracy and temporal consistency on popular benchmarks: 3DPW, MPI-INF-3DHP, and Human3.6M. Code is available at this https URL.
[CV-87] Multimodal Learning for Embryo Viability Prediction in Clinical IVF MICCAI2024
链接: https://arxiv.org/abs/2410.15581
作者: Junsik Kim,Zhiyi Shi,Davin Jeong,Johannes Knittel,Helen Y. Yang,Yonghyun Song,Wanhua Li,Yicong Li,Dalit Ben-Yosef,Daniel Needleman,Hanspeter Pfister
关键词-EN: clinical In-Vitro Fertilization, In-Vitro Fertilization, successful pregnancy, transfer is important, important to increasing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to MICCAI 2024
点击查看摘要
Abstract:In clinical In-Vitro Fertilization (IVF), identifying the most viable embryo for transfer is important to increasing the likelihood of a successful pregnancy. Traditionally, this process involves embryologists manually assessing embryos’ static morphological features at specific intervals using light microscopy. This manual evaluation is not only time-intensive and costly, due to the need for expert analysis, but also inherently subjective, leading to variability in the selection process. To address these challenges, we develop a multimodal model that leverages both time-lapse video data and Electronic Health Records (EHRs) to predict embryo viability. One of the primary challenges of our research is to effectively combine time-lapse video and EHR data, owing to their inherent differences in modality. We comprehensively analyze our multimodal model with various modality inputs and integration approaches. Our approach will enable fast and automated embryo viability predictions in scale for clinical IVF.
[CV-88] Online Pseudo-Label Unified Object Detection for Multiple Datasets Training
链接: https://arxiv.org/abs/2410.15569
作者: XiaoJun Tang,Jingru Wang,Zeyu Shangguan,Darun Tang,Yuyu Liu
关键词-EN: Unified Object Detection, object detection scenarios, comprehensive object detection, Object Detection, Pseudo-Label Unified Object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The Unified Object Detection (UOD) task aims to achieve object detection of all merged categories through training on multiple datasets, and is of great significance in comprehensive object detection scenarios. In this paper, we conduct a thorough analysis of the cross datasets missing annotations issue, and propose an Online Pseudo-Label Unified Object Detection scheme. Our method uses a periodically updated teacher model to generate pseudo-labels for the unlabelled objects in each sub-dataset. This periodical update strategy could better ensure that the accuracy of the teacher model reaches the local maxima and maximized the quality of pseudo-labels. In addition, we survey the influence of overlapped region proposals on the accuracy of box regression. We propose a category specific box regression and a pseudo-label RPN head to improve the recall rate of the Region Proposal Network (PRN). Our experimental results on common used benchmarks (\eg COCO, Object365 and OpenImages) indicates that our online pseudo-label UOD method achieves higher accuracy than existing SOTA methods.
[CV-89] A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM
链接: https://arxiv.org/abs/2410.15549
作者: ByungOk Han,Jaehong Kim,Jinhyeok Jang
关键词-EN: receiving increasing attention, integrating visual context, Dual Process VLA, linguistic commands, receiving increasing
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 page
点击查看摘要
Abstract:Vision-Language-Action (VLA) models are receiving increasing attention for their ability to enable robots to perform complex tasks by integrating visual context with linguistic commands. However, achieving efficient real-time performance remains challenging due to the high computational demands of existing models. To overcome this, we propose Dual Process VLA (DP-VLA), a hierarchical framework inspired by dual-process theory. DP-VLA utilizes a Large System 2 Model (L-Sys2) for complex reasoning and decision-making, while a Small System 1 Model (S-Sys1) handles real-time motor control and sensory processing. By leveraging Vision-Language Models (VLMs), the L-Sys2 operates at low frequencies, reducing computational overhead, while the S-Sys1 ensures fast and accurate task execution. Experimental results on the RoboCasa dataset demonstrate that DP-VLA achieves faster inference and higher task success rates, providing a scalable solution for advanced robotic applications.
[CV-90] rackMe:A Simple and Effective Multiple Object Tracking Annotation Tool
链接: https://arxiv.org/abs/2410.15518
作者: Thinh Phan,Isaac Phillips,Andrew Lockett,Michael T.Kidd,Ngan Le
关键词-EN: animal behavior understanding, understanding and monitoring, key topics, topics that attract, attract a lot
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Object tracking, especially animal tracking, is one of the key topics that attract a lot of attention due to its benefits of animal behavior understanding and monitoring. Recent state-of-the-art tracking methods are founded on deep learning architectures for object detection, appearance feature extraction and track association. Despite the good tracking performance, these methods are trained and evaluated on common objects such as human and cars. To perform on the animal, there is a need to create large datasets of different types in multiple conditions. The dataset construction comprises of data collection and data annotation. In this work, we put more focus on the latter task. Particularly, we renovate the well-known tool, LabelMe, so as to assist common user with or without in-depth knowledge about computer science to annotate the data with less effort. The new tool named as TrackMe inherits the simplicity, high compatibility with varied systems, minimal hardware requirement and convenient feature utilization from the predecessor. TrackMe is an upgraded version with essential features for multiple object tracking annotation.
[CV-91] Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training CONLL
链接: https://arxiv.org/abs/2410.15509
作者: Rohan Saha,Abrar Fahim,Alona Fyshe,Alex Murphy
关键词-EN: train large machine, specialized domains, learning, large machine learning, train large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: CoNLL BabyLM Challenge 2024 camera ready
点击查看摘要
Abstract:For specialized domains, there is often not a wealth of data with which to train large machine learning models. In such limited data / compute settings, various methods exist aiming to \textitdo more with less , such as finetuning from a pretrained model, modulating difficulty levels as data are presented to a model (curriculum learning), and considering the role of model type / size. Approaches to efficient \textitmachine learning also take inspiration from \textithuman learning by considering use cases where machine learning systems have access to approximately the same number of words experienced by a 13 year old child (100M words). We investigate the role of 3 primary variables in a limited data regime as part of the multimodal track of the BabyLM challenge. We contrast: (i) curriculum learning, (ii), pretraining (with text-only data), (iii) model type. We modulate these variables and assess them on two types of tasks: (a) multimodal (text+image), and (b) unimodal (text-only) tasks. We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models, particularly when combining text-only pretraining. On text-only tasks, curriculum learning appears to help models with smaller trainable parameter counts. We suggest possible reasons based on architectural differences and training designs as to why one might observe such results.
[CV-92] aming Mambas for Voxel Level 3D Medical Image Segmentation
链接: https://arxiv.org/abs/2410.15496
作者: Luca Lumetti,Vittorio Pipoli,Kevin Marchesini,Elisa Ficarra,Costantino Grana,Federico Bolelli
关键词-EN: Convolutional Neural Networks, Transformer-based architectures, employing Convolutional Neural, Recurrent Neural Network, Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Recently, the field of 3D medical segmentation has been dominated by deep learning models employing Convolutional Neural Networks (CNNs) and Transformer-based architectures, each with their distinctive strengths and limitations. CNNs are constrained by a local receptive field, whereas transformers are hindered by their substantial memory requirements as well as they data hungriness, making them not ideal for processing 3D medical volumes at a fine-grained level. For these reasons, fully convolutional neural networks, as nnUNet, still dominate the scene when segmenting medical structures in 3D large medical volumes. Despite numerous advancements towards developing transformer variants with subquadratic time and memory complexity, these models still fall short in content-based reasoning. A recent breakthrough is Mamba, a Recurrent Neural Network (RNN) based on State Space Models (SSMs) outperforming Transformers in many long-context tasks (million-length sequences) on famous natural language processing and genomic benchmarks while keeping a linear complexity.
[CV-93] Event-based Sensor Fusion and Application on Odometry: A Survey
链接: https://arxiv.org/abs/2410.15480
作者: Jiaqiang Zhang,Xianjia Yu,Ha Sier,Haizhou Zhang,Tomi Westerlund
关键词-EN: wide dynamic range, offering notable advantages, low lighting, inspired by biological, offering notable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IPAS2025: this https URL
点击查看摘要
Abstract:Event cameras, inspired by biological vision, are asynchronous sensors that detect changes in brightness, offering notable advantages in environments characterized by high-speed motion, low lighting, or wide dynamic range. These distinctive properties render event cameras particularly effective for sensor fusion in robotics and computer vision, especially in enhancing traditional visual or LiDAR-inertial odometry. Conventional frame-based cameras suffer from limitations such as motion blur and drift, which can be mitigated by the continuous, low-latency data provided by event cameras. Similarly, LiDAR-based odometry encounters challenges related to the loss of geometric information in environments such as corridors. To address these limitations, unlike the existing event camera-related surveys, this paper presents a comprehensive overview of recent advancements in event-based sensor fusion for odometry applications particularly, investigating fusion strategies that incorporate frame-based cameras, inertial measurement units (IMUs), and LiDAR. The survey critically assesses the contributions of these fusion methods to improving odometry performance in complex environments, while highlighting key applications, and discussing the strengths, limitations, and unresolved challenges. Additionally, it offers insights into potential future research directions to advance event-based sensor fusion for next-generation odometry applications.
[CV-94] Generalized Multimodal Fusion via Poisson-Nernst-Planck Equation NEURIPS2024
链接: https://arxiv.org/abs/2410.15475
作者: Jiayu Xiong,Jing Wang,Hengjing Xiang,Jun Xue,Chen Xu,Zhouqiang Jiang
关键词-EN: highlighted significant advancements, Previous studies, studies have highlighted, highlighted significant, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 Rejected paper, 28 pages
点击查看摘要
Abstract:Previous studies have highlighted significant advancements in multimodal fusion. Nevertheless, such methods often encounter challenges regarding the efficacy of feature extraction, data integrity, consistency of feature dimensions, and adaptability across various downstream tasks. This paper proposes a generalized multimodal fusion method (GMF) via the Poisson-Nernst-Planck (PNP) equation, which adeptly addresses the aforementioned issues. Theoretically, the optimization objective for traditional multimodal tasks is formulated and redefined by integrating information entropy and the flow of gradient backward step. Leveraging these theoretical insights, the PNP equation is applied to feature fusion, rethinking multimodal features through the framework of charged particles in physics and controlling their movement through dissociation, concentration, and reconstruction. Building on these theoretical foundations, GMF disassociated features which extracted by the unimodal feature extractor into modality-specific and modality-invariant subspaces, thereby reducing mutual information and subsequently lowering the entropy of downstream tasks. The identifiability of the feature’s origin enables our approach to function independently as a frontend, seamlessly integrated with a simple concatenation backend, or serve as a prerequisite for other modules. Experimental results on multiple downstream tasks show that the proposed GMF achieves performance close to the state-of-the-art (SOTA) accuracy while utilizing fewer parameters and computational resources. Furthermore, by integrating GMF with advanced fusion methods, we surpass the SOTA results.
[CV-95] Multi-Layer Feature Fusion with Cross-Channel Attention-Based U-Net for Kidney Tumor Segmentation
链接: https://arxiv.org/abs/2410.15472
作者: Fnu Neha,Arvind K. Bansal
关键词-EN: show significant heterogeneity, renal cell carcinoma, cell carcinoma, show significant, significant heterogeneity
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages
点击查看摘要
Abstract:Renal tumors, especially renal cell carcinoma (RCC), show significant heterogeneity, posing challenges for diagnosis using radiology images such as MRI, echocardiograms, and CT scans. U-Net based deep learning techniques are emerging as a promising approach for automated medical image segmentation for minimally invasive diagnosis of renal tumors. However, current techniques need further improvements in accuracy to become clinically useful to radiologists. In this study, we present an improved U-Net based model for end-to-end automated semantic segmentation of CT scan images to identify renal tumors. The model uses residual connections across convolution layers, integrates a multi-layer feature fusion (MFF) and cross-channel attention (CCA) within encoder blocks, and incorporates skip connections augmented with additional information derived using MFF and CCA. We evaluated our model on the KiTS19 dataset, which contains data from 210 patients. For kidney segmentation, our model achieves a Dice Similarity Coefficient (DSC) of 0.97 and a Jaccard index (JI) of 0.95. For renal tumor segmentation, our model achieves a DSC of 0.96 and a JI of 0.91. Based on a comparison of available DSC scores, our model outperforms the current leading models.
[CV-96] EVA: An Embodied World Model for Future Video Anticipation
链接: https://arxiv.org/abs/2410.15461
作者: Xiaowei Chi,Hengyuan Zhang,Chun-Kai Fan,Xingqun Qi,Rongyu Zhang,Anthony Chen,Chi-min Chan,Wei Xue,Wenhan Luo,Shanghang Zhang,Yike Guo
关键词-EN: simulate comprehensive interactions, displaying crucial roles, integrate raw data, world model, video prediction
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:World models integrate raw data from various modalities, such as images and language to simulate comprehensive interactions in the world, thereby displaying crucial roles in fields like mixed reality and robotics. Yet, applying the world model for accurate video prediction is quite challenging due to the complex and dynamic intentions of the various scenes in practice. In this paper, inspired by the human rethinking process, we decompose the complex video prediction into four meta-tasks that enable the world model to handle this issue in a more fine-grained manner. Alongside these tasks, we introduce a new benchmark named Embodied Video Anticipation Benchmark (EVA-Bench) to provide a well-rounded evaluation. EVA-Bench focused on evaluating the video prediction ability of human and robot actions, presenting significant challenges for both the language model and the generation model. Targeting embodied video prediction, we propose the Embodied Video Anticipator (EVA), a unified framework aiming at video understanding and generation. EVA integrates a video generation model with a visual language model, effectively combining reasoning capabilities with high-quality generation. Moreover, to enhance the generalization of our framework, we tailor-designed a multi-stage pretraining paradigm that adaptatively ensembles LoRA to produce high-fidelity results. Extensive experiments on EVA-Bench highlight the potential of EVA to significantly improve performance in embodied scenes, paving the way for large-scale pre-trained models in real-world prediction tasks.
[CV-97] Allegro: Open the Black Box of Commercial-Level Video Generation Model
链接: https://arxiv.org/abs/2410.15458
作者: Yuan Zhou,Qiuyue Wang,Yuxuan Cai,Huan Yang
关键词-EN: Significant advancements, open-source community contributing, https URL, community contributing, contributing a wealth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Significant advancements have been made in the field of video generation, with the open-source community contributing a wealth of research papers and tools for training high-quality models. However, despite these efforts, the available information and resources remain insufficient for achieving commercial-level performance. In this report, we open the black box and introduce \textbfAllegro , an advanced video generation model that excels in both quality and temporal consistency. We also highlight the current limitations in the field and present a comprehensive methodology for training high-performance, commercial-level video generation models, addressing key aspects such as data, model architecture, training pipeline, and evaluation. Our user study shows that Allegro surpasses existing open-source models and most commercial models, ranking just behind Hailuo and Kling. Code: this https URL , Model: this https URL , Gallery: this https URL .
[CV-98] CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts
链接: https://arxiv.org/abs/2410.15453
作者: Malvina Nikandrou,Georgios Pantazopoulos,Nikolas Vitsakis,Ioannis Konstas,Alessandro Suglia
关键词-EN: Vision and Language, demonstrate cultural knowledge, Language models, Vision, Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:As Vision and Language models (VLMs) become accessible across the globe, it is important that they demonstrate cultural knowledge. In this paper, we introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts and evaluate the capacity for cultural adaptation through contextual information. This allows us to distinguish between parametric knowledge acquired during training and contextual knowledge provided during inference via visual and textual descriptions. Our evaluation of several state-of-the-art open VLMs shows large performance disparities between culture-specific and common concepts in the parametric setting. Moreover, experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions. Our findings reveal limitations in the cultural understanding and adaptability of current VLMs that need to be addressed toward more culturally inclusive models.
[CV-99] Concept Complement Bottleneck Model for Interpretable Medical Image Diagnosis
链接: https://arxiv.org/abs/2410.15446
作者: Hongmei Wang,Junlin Hou,Hao Chen
关键词-EN: trustworthy artificial intelligence, received extensive attention, concepts, received extensive, trustworthy artificial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures, submitted to IEEE TRANSACTIONS ON MEDICAL IMAGING
点击查看摘要
Abstract:Models based on human-understandable concepts have received extensive attention to improve model interpretability for trustworthy artificial intelligence in the field of medical image analysis. These methods can provide convincing explanations for model decisions but heavily rely on the detailed annotation of pre-defined concepts. Consequently, they may not be effective in cases where concepts or annotations are incomplete or low-quality. Although some methods automatically discover effective and new visual concepts rather than using pre-defined concepts or could find some human-understandable concepts via large Language models, they are prone to veering away from medical diagnostic evidence and are challenging to understand. In this paper, we propose a concept complement bottleneck model for interpretable medical image diagnosis with the aim of complementing the existing concept set and finding new concepts bridging the gap between explainable models. Specifically, we propose to use concept adapters for specific concepts to mine the concept differences and score concepts in their own attention channels to support almost fairly concept learning. Then, we devise a concept complement strategy to learn new concepts while jointly using known concepts to improve model performance. Comprehensive experiments on medical datasets demonstrate that our model outperforms the state-of-the-art competitors in concept detection and disease diagnosis tasks while providing diverse explanations to ensure model interpretability effectively.
[CV-100] MDFI-Net: Multiscale Differential Feature Interaction Network for Accurate Retinal Vessel Segmentation
链接: https://arxiv.org/abs/2410.15444
作者: Yiwang Dong,Xiangyu Deng
关键词-EN: achieved suboptimal outcoms,since, medical image segmentation, image segmentation tasks, segmentation tasks due, cessel segmentation achieved
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The accurate segmentation of retinal vessels in fundus images is a great challenge in medical image segmentation tasks due to their highly complex structure from other this http URL, deep-learning based methods for retinal cessel segmentation achieved suboptimal outcoms,since vessels with indistinct features are prone to being overlooked in deeper layers of the network. Additionally, the abundance of redundant information in the background poses significant interference to feature extraction, thus increasing the segmentation difficulty. To address this issue, this paper proposes a feature-enhanced interaction network based on DPCN, named this http URL, we design a feature enhancement structure, the Deformable-convolutional Pulse Coupling Network (DPCN), to provide an enhanced feature iteration sequence to the segmentation network in a simple and efficient manner. Subsequently, these features will interact within the segmentation this http URL experiments were conducted on publicly available retinal vessel segmentation datasets to validate the effectiveness of our network structure. Experimental results of our algorithm show that the detection accuracy of the retinal blood vessel achieves 97.91%, 97.97% and 98.16% across all datasets. Finally, plentiful experimental results also prove that the proposed MDFI-Net achieves segmentation performance superior to state-of-the-art methods on public datasets.
[CV-101] MedDiff-FM: A Diffusion-based Foundation Model for Versatile Medical Image Applications
链接: https://arxiv.org/abs/2410.15432
作者: Yongrui Yu,Yannian Gu,Shaoting Zhang,Xiaofan Zhang
关键词-EN: achieved significant success, diffusion foundation model, diffusion foundation, foundation model, medical image domains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Diffusion models have achieved significant success in both the natural image and medical image domains, encompassing a wide range of applications. Previous investigations in medical images have often been constrained to specific anatomical regions, particular applications, and limited datasets, resulting in isolated diffusion models. This paper introduces a diffusion-based foundation model to address a diverse range of medical image tasks, namely MedDiff-FM. MedDiff-FM leverages 3D CT images from multiple publicly available datasets, covering anatomical regions from head to abdomen, to pre-train a diffusion foundation model, and explores the capabilities of the diffusion foundation model across a variety of application scenarios. The diffusion foundation model handles multi-level image processing both at the image-level and patch-level, and utilizes position embedding to establish multi-level spatial relationships as well as anatomical structures and region classes to control certain anatomical regions. MedDiff-FM manages several downstream tasks seamlessly, including image denoising, anomaly detection, and image synthesis. MedDiff-FM is also capable of performing lesion generation and lesion inpainting by rapidly fine-tuning the diffusion foundation model using ControlNet with task-specific conditions. Experimental results demonstrate the effectiveness of MedDiff-FM in addressing diverse downstream medical image tasks.
[CV-102] BoostAdapter: Improving Test-Time Adaptation via Regional Bootstrapping NEURIPS2024
链接: https://arxiv.org/abs/2410.15430
作者: Taolin Zhang,Jinpeng Wang,Hang Guo,Tao Dai,Bin Chen,Shu-Tao Xia
关键词-EN: pretrained vision-language models, raised great interest, recent researches, pretrained vision-language, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024
点击查看摘要
Abstract:Adaptation of pretrained vision-language models such as CLIP to various downstream tasks have raised great interest in recent researches. Previous works have proposed a variety of test-time adaptation (TTA) methods to achieve strong generalization without any knowledge of the target domain. However, existing training-required TTA approaches like TPT necessitate entropy minimization that involves large computational overhead, while training-free methods like TDA overlook the potential for information mining from the test samples themselves. In this paper, we break down the design of existing popular training-required and training-free TTA methods and bridge the gap between them within our framework. Specifically, we maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples. The historical samples are filtered from the testing data stream and serve to extract useful information from the target distribution, while the boosting samples are drawn from regional bootstrapping and capture the knowledge of the test sample itself. We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets, showcasing its applicability in real-world situations.
[CV-103] Accelerated Sub-Image Search For Variable-Size Patches Identification Based On Virtual Time Series Transformation And Segmentation
链接: https://arxiv.org/abs/2410.15425
作者: Mogens Plessen
关键词-EN: fields requiring spot, requiring spot spraying, fixed-size objects, small-scale reference image, hay bales
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 9 figures, 3 tables
点击查看摘要
Abstract:This paper addresses two tasks: (i) fixed-size objects such as hay bales are to be identified in an aerial image for a given reference image of the object, and (ii) variable-size patches such as areas on fields requiring spot spraying or other handling are to be identified in an image for a given small-scale reference image. Both tasks are related. The second differs in that identified sub-images similar to the reference image are further clustered before patches contours are determined by solving a traveling salesman problem. Both tasks are complex in that the exact number of similar sub-images is not known a priori. The main discussion of this paper is presentation of an acceleration mechanism for sub-image search that is based on a transformation of an image to multivariate time series along the RGB-channels and subsequent segmentation to reduce the 2D search space in the image. Two variations of the acceleration mechanism are compared to exhaustive search on diverse synthetic and real-world images. Quantitatively, proposed method results in solve time reductions of up to 2 orders of magnitude, while qualitatively delivering comparative visual results. Proposed method is neural network-free and does not use any image pre-processing.
[CV-104] MMCS: A Multimodal Medical Diagnosis System Integrating Image Analysis and Knowledge-based Departmental Consultation
链接: https://arxiv.org/abs/2410.15403
作者: Yi Ren,HanZhi Zhang,Weibin Li,Diandong Liu,Tianyi Zhang,Jie He
关键词-EN: present MMCS, medical images, medical, facial paralysis, facial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We present MMCS, a system capable of recognizing medical images and patient facial details, and providing professional medical diagnoses. The system consists of two core components: The first component is the analysis of medical images and videos. We trained a specialized multimodal medical model capable of interpreting medical images and accurately analyzing patients’ facial emotions and facial paralysis conditions. The model achieved an accuracy of 72.59% on the FER2013 facial emotion recognition dataset, with a 91.1% accuracy in recognizing the happy emotion. In facial paralysis recognition, the model reached an accuracy of 92%, which is 30% higher than that of GPT-4o. Based on this model, we developed a parser for analyzing facial movement videos of patients with facial paralysis, achieving precise grading of the paralysis severity. In tests on 30 videos of facial paralysis patients, the system demonstrated a grading accuracy of 83.3%.The second component is the generation of professional medical responses. We employed a large language model, integrated with a medical knowledge base, to generate professional diagnoses based on the analysis of medical images or videos. The core innovation lies in our development of a department-specific knowledge base routing management mechanism, in which the large language model categorizes data by medical departments and, during the retrieval process, determines the appropriate knowledge base to query. This significantly improves retrieval accuracy in the RAG (retrieval-augmented generation) process. This mechanism led to an average increase of 4 percentage points in accuracy for various large language models on the MedQA this http URL code is open-sourced and available at: this https URL.
[CV-105] IPO: Interpretable Prompt Optimization for Vision-Language Models NEURIPS2024
链接: https://arxiv.org/abs/2410.15397
作者: Yingjun Du,Wenfang Sun,Cees G. M. Snoek
关键词-EN: CLIP have remarkably, Pre-trained vision-language models, Pre-trained vision-language, prompts, downstream tasks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. However, these methods tend to lead to overfitting of the base classes seen during training and produce prompts that are no longer understandable by humans. This paper introduces a simple but interpretable prompt optimizer (IPO), that utilizes large language models (LLMs) to generate textual prompts dynamically. We introduce a Prompt Optimization Prompt that not only guides LLMs in creating effective prompts but also stores past prompts with their performance metrics, providing rich in-context information. Additionally, we incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions, which enhance the interaction between textual and visual modalities. This allows for thae creation of dataset-specific prompts that improve generalization performance, while maintaining human comprehension. Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. By leveraging the strengths of LLMs, our approach ensures that the prompts remain human-understandable, thereby facilitating better transparency and oversight for vision-language models.
[CV-106] EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting
链接: https://arxiv.org/abs/2410.15392
作者: Bohao Liao,Wei Zhai,Zengyu Wan,Tianzhu Zhang,Yang Cao,Zheng-Jun Zha
关键词-EN: Event Generation Model, wide applications, Linear Event Generation, Event Generation, Event
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL
点击查看摘要
Abstract:Scene reconstruction from casually captured videos has wide applications in real-world scenarios. With recent advancements in differentiable rendering techniques, several methods have attempted to simultaneously optimize scene representations (NeRF or 3DGS) and camera poses. Despite recent progress, existing methods relying on traditional camera input tend to fail in high-speed (or equivalently low-frame-rate) scenarios. Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution, providing valuable scene and motion information in blind inter-frame intervals. In this paper, we introduce the event camera to aid scene construction from a casually captured video for the first time, and propose Event-Aided Free-Trajectory 3DGS, called EF-3DGS, which seamlessly integrates the advantages of event cameras into 3DGS through three key components. First, we leverage the Event Generation Model (EGM) to fuse events and frames, supervising the rendered views observed by the event stream. Second, we adopt the Contrast Maximization (CMax) framework in a piece-wise manner to extract motion information by maximizing the contrast of the Image of Warped Events (IWE), thereby calibrating the estimated poses. Besides, based on the Linear Event Generation Model (LEGM), the brightness information encoded in the IWE is also utilized to constrain the 3DGS in the gradient domain. Third, to mitigate the absence of color information of events, we introduce photometric bundle adjustment (PBA) to ensure view consistency across events and this http URL evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS. Our project page is this https URL.
[CV-107] Layout-your-3D: Controllable and Precise 3D Generation with 2D Blueprint
链接: https://arxiv.org/abs/2410.15391
作者: Junwei Zhou,Xueting Li,Lu Qi,Ming-Hsuan Yang
关键词-EN: tedious optimization processes, Abstract, require tedious optimization, generation, plausible object interactions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages,17 figures
点击查看摘要
Abstract:We present Layout-Your-3D, a framework that allows controllable and compositional 3D generation from text prompts. Existing text-to-3D methods often struggle to generate assets with plausible object interactions or require tedious optimization processes. To address these challenges, our approach leverages 2D layouts as a blueprint to facilitate precise and plausible control over 3D generation. Starting with a 2D layout provided by a user or generated from a text description, we first create a coarse 3D scene using a carefully designed initialization process based on efficient reconstruction models. To enforce coherent global 3D layouts and enhance the quality of instance appearances, we propose a collision-aware layout optimization process followed by instance-wise refinement. Experimental results demonstrate that Layout-Your-3D yields more reasonable and visually appealing compositional 3D assets while significantly reducing the time required for each prompt. Additionally, Layout-Your-3D can be easily applicable to downstream tasks, such as 3D editing and object insertion. Our project page is available at:this https URL
[CV-108] LoRA-IR: Taming Low-Rank Experts for Efficient All-in-One Image Restoration
链接: https://arxiv.org/abs/2410.15385
作者: Yuang Ai,Huaibo Huang,Ran He
关键词-EN: incorporating degradation-specific information, prompt modules, achieved remarkable performance, achieved remarkable, incorporating degradation-specific
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Prompt-based all-in-one image restoration (IR) frameworks have achieved remarkable performance by incorporating degradation-specific information into prompt modules. Nevertheless, handling the complex and diverse degradations encountered in real-world scenarios remains a significant challenge. To address this challenge, we propose LoRA-IR, a flexible framework that dynamically leverages compact low-rank experts to facilitate efficient all-in-one image restoration. Specifically, LoRA-IR consists of two training stages: degradation-guided pre-training and parameter-efficient fine-tuning. In the pre-training stage, we enhance the pre-trained CLIP model by introducing a simple mechanism that scales it to higher resolutions, allowing us to extract robust degradation representations that adaptively guide the IR network. In the fine-tuning stage, we refine the pre-trained IR network using low-rank adaptation (LoRA). Built upon a Mixture-of-Experts (MoE) architecture, LoRA-IR dynamically integrates multiple low-rank restoration experts through a degradation-guided router. This dynamic integration mechanism significantly enhances our model’s adaptability to diverse and unknown degradations in complex real-world scenarios. Extensive experiments demonstrate that LoRA-IR achieves state-of-the-art performance across 14 image restoration tasks and 29 benchmarks. Code and pre-trained models will be available at: this https URL.
[CV-109] Neural Active Structure-from-Motion in Dark and Textureless Environment
链接: https://arxiv.org/abs/2410.15378
作者: Kazuto Ichimaru,Diego Thomas,Takafumi Iwaguchi,Hiroshi Kawasaki
关键词-EN: low light illumination, light illumination, structured light, low light, equivalent surfaces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in Asian Conference on Computer Vision 2024
点击查看摘要
Abstract:Active 3D measurement, especially structured light (SL) has been widely used in various fields for its robustness against textureless or equivalent surfaces by low light illumination. In addition, reconstruction of large scenes by moving the SL system has become popular, however, there have been few practical techniques to obtain the system’s precise pose information only from images, since most conventional techniques are based on image features, which cannot be retrieved under textureless environments. In this paper, we propose a simultaneous shape reconstruction and pose estimation technique for SL systems from an image set where sparsely projected patterns onto the scene are observed (i.e. no scene texture information), which we call Active SfM. To achieve this, we propose a full optimization framework of the volumetric shape that employs neural signed distance fields (Neural-SDF) for SL with the goal of not only reconstructing the scene shape but also estimating the poses for each motion of the system. Experimental results show that the proposed method is able to achieve accurate shape reconstruction as well as pose estimation from images where only projected patterns are observed.
[CV-110] ActiveNeuS: Neural Signed Distance Fields for Active Stereo
链接: https://arxiv.org/abs/2410.15376
作者: Kazuto Ichimaru,Takaki Ikeda,Diego Thomas,Takafumi Iwaguchi,Hiroshi Kawasaki
关键词-EN: intensively researched, illumination or scattering, open problem, problem and intensively, Neural Signed Distance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in International Conference on 3D Vision 2024
点击查看摘要
Abstract:3D-shape reconstruction in extreme environments, such as low illumination or scattering condition, has been an open problem and intensively researched. Active stereo is one of potential solution for such environments for its robustness and high accuracy. However, active stereo systems usually consist of specialized system configurations with complicated algorithms, which narrow their application. In this paper, we propose Neural Signed Distance Field for active stereo systems to enable implicit correspondence search and triangulation in generalized Structured Light. With our technique, textureless or equivalent surfaces by low light condition are successfully reconstructed even with a small number of captured images. Experiments were conducted to confirm that the proposed method could achieve state-of-the-art reconstruction quality under such severe condition. We also demonstrated that the proposed method worked in an underwater scenario.
[CV-111] Explainability of Point Cloud Neural Networks Using SMILE: Statistical Model-Agnostic Interpretability with Local Explanations
链接: https://arxiv.org/abs/2410.15374
作者: Seyed Mohammad Ahmadi,Koorosh Aslansefat,Ruben Valcarce-Dineiro,Joshua Barnfather
关键词-EN: considerable safety risks, pose considerable safety, today world, significance of explainable, lack of transparency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 9 figures
点击查看摘要
Abstract:In today’s world, the significance of explainable AI (XAI) is growing in robotics and point cloud applications, as the lack of transparency in decision-making can pose considerable safety risks, particularly in autonomous systems. As these technologies are integrated into real-world environments, ensuring that model decisions are interpretable and trustworthy is vital for operational reliability and safety assurance. This study explores the implementation of SMILE, a novel explainability method originally designed for deep neural networks, on point cloud-based models. SMILE builds on LIME by incorporating Empirical Cumulative Distribution Function (ECDF) statistical distances, offering enhanced robustness and interpretability, particularly when the Anderson-Darling distance is used. The approach demonstrates superior performance in terms of fidelity loss, R2 scores, and robustness across various kernel widths, perturbation numbers, and clustering configurations. Moreover, this study introduces a stability analysis for point cloud data using the Jaccard index, establishing a new benchmark and baseline for model stability in this field. The study further identifies dataset biases in the classification of the ‘person’ category, emphasizing the necessity for more comprehensive datasets in safety-critical applications like autonomous driving and robotics. The results underscore the potential of advanced explainability models and highlight areas for future research, including the application of alternative surrogate models and explainability techniques in point cloud data.
[CV-112] DynaVINS: Robust Visual-Inertial State Estimator in Dynamic Environments by Adaptive Truncated Least Squares and Stable State Recovery
链接: https://arxiv.org/abs/2410.15373
作者: Seungwon Song,Hyungtae Lim,Alex Junho Lee,Hyun Myung
关键词-EN: visual-inertial navigation systems, approaches remain vulnerable, suddenly start moving, abruptly dynamic objects, robust visual-inertial navigation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 7 figures. S. Song, H. Lim, A. J. Lee and H. Myung, “DynaVINS++: Robust Visual-Inertial State Estimator in Dynamic Environments by Adaptive Truncated Least Squares and Stable State Recovery,” in IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 9127-9134, Oct. 2024
点击查看摘要
Abstract:Despite extensive research in robust visual-inertial navigation systems~(VINS) in dynamic environments, many approaches remain vulnerable to objects that suddenly start moving, which are referred to as \textitabruptly dynamic objects. In addition, most approaches have considered the effect of dynamic objects only at the feature association level. In this study, we observed that the state estimation diverges when errors from false correspondences owing to moving objects incorrectly propagate into the IMU bias terms. To overcome these problems, we propose a robust VINS framework called \mbox\textitDynaVINS++, which employs a) adaptive truncated least square method that adaptively adjusts the truncation range using both feature association and IMU preintegration to effectively minimize the effect of the dynamic objects while reducing the computational cost, and b)~stable state recovery with bias consistency check to correct misestimated IMU bias and to prevent the divergence caused by abruptly dynamic objects. As verified in both public and real-world datasets, our approach shows promising performance in dynamic environments, including scenes with abruptly dynamic objects.
[CV-113] FrameBridge: Improving Image-to-Video Generation with Bridge Models
链接: https://arxiv.org/abs/2410.15371
作者: Yuji Wang,Zehua Chen,Xiaoyu Chen,Jun Zhu,Jianfei Chen
关键词-EN: gaining increasing attention, gaining increasing, increasing attention, wide application, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation. However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to generate video samples with both appearance consistency and temporal coherence from an uninformative Gaussian noise, which may limit their synthesis quality. In this work, we present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. By formulating I2V synthesis as a frames-to-frames generation task and modelling it with a data-to-data process, we fully exploit the information in input image and facilitate the generative model to learn the image animation process. In two popular settings of training I2V models, namely fine-tuning a pre-trained text-to-video (T2V) model or training from scratch, we further propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively. Experiments conducted on WebVid-2M and UCF-101 demonstrate that: (1) our FrameBridge achieves superior I2V quality in comparison with the diffusion counterpart (zero-shot FVD 83 vs. 176 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101); (2) our proposed SAF and neural prior effectively enhance the ability of bridge-based I2V models in the scenarios of fine-tuning and training from scratch. Demo samples can be visited at: this https URL.
[CV-114] Scene Graph Generation with Role-Playing Large Language Models NEURIPS2024
链接: https://arxiv.org/abs/2410.15364
作者: Guikun Chen,Jin Li,Wenguan Wang
关键词-EN: standard zero-shot pipeline, Current approaches, text classifiers, scene graph generation, open-vocabulary scene graph
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: NeurIPS 2024. Code: this https URL
点击查看摘要
Abstract:Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline – computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM is asked to play different roles (e.g., biologist and engineer) to analyze and discuss the descriptive features of a given scene from different views. Unlike previous efforts simply treating the generated descriptions as mutually equivalent text classifiers, SDSGG is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene (this is what the term “specific” means). Furthermore, to capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter. It refines CLIP’s ability to recognize relations by learning an interaction-aware semantic space. Extensive experiments on prevalent benchmarks show that SDSGG outperforms top-leading methods by a clear margin.
[CV-115] YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary
链接: https://arxiv.org/abs/2410.15346
作者: Hao-Tang Tsui,Chien-Yao Wang,Hong-Yuan Mark Liao
关键词-EN: refining training strategies, Identifying and localizing, fundamental challenge, training strategies, Visual Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Identifying and localizing objects within images is a fundamental challenge, and numerous efforts have been made to enhance model accuracy by experimenting with diverse architectures and refining training strategies. Nevertheless, a prevalent limitation in existing models is overemphasizing the current input while ignoring the information from the entire dataset. We introduce an innovative \em \textbfRetriever-\em\textbfDictionary (RD) module to address this issue. This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset, which is built by the knowledge from Visual Models (VM), Large Language Models (LLM), or Visual Language Models (VLM). The flexible RD enables the model to incorporate such explicit knowledge that enhances the ability to benefit multiple tasks, specifically, segmentation, detection, and classification, from pixel to image level. The experiments show that using the RD significantly improves model performance, achieving more than a 3% increase in mean Average Precision for object detection with less than a 1% increase in model parameters. Beyond 1-stage object detection models, the RD module improves the effectiveness of 2-stage models and DETR-based architectures, such as Faster R-CNN and Deformable DETR
[CV-116] Modality-Fair Preference Optimization for Trustworthy MLLM Alignment
链接: https://arxiv.org/abs/2410.15334
作者: Songtao Jiang,Yan Zhang,Ruizhe Chen,Yeying Jin,Zuozhu Liu
关键词-EN: Direct Preference Optimization, aligning large language, Direct Preference, Preference Optimization, large language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Direct Preference Optimization (DPO) is effective for aligning large language models (LLMs), but when applied to multimodal models (MLLMs), it often favors text over image information, leading to unreliable outputs and visual hallucinations. To address this, we propose Modality-Fair Preference Optimization (MFPO) to balance text and image preferences. First, we found that the lack of image-related rewards in preference data biases optimization toward text, so we created automated, fine-grained image preference data to correct this. Then, we designed a learning objective to ensure the model captures both text and image preferences while maintaining high-quality outputs. Finally, we use a multi-stage alignment approach to stabilize training and improve learning across both modalities. Extensive experiments demonstrate that MFPO significantly enhances MLLM trustworthiness. On models like LLaVA-v1.5 (7B, 13B), our approach reduces hallucinations substantially. On the 7B model, MFPO outperforms GPT-4V and achieves a nearly 40% improvement over previous methods on Object HalBench, as well as achieving state-of-the-art performance on both Object HalBench and AMBER when combined with the latest LLaVA-v1.6. Code will be released.
[CV-117] Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability
链接: https://arxiv.org/abs/2410.15315
作者: Yusuke Hosoya,Masanori Suganuma,Takayuki Okatani
关键词-EN: garnered significant attention, Open-vocabulary object detection, detecting specific classes, Open-vocabulary object, OVD
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 3 figures
点击查看摘要
Abstract:Open-vocabulary object detection (OVD), detecting specific classes of objects using only their linguistic descriptions (e.g., class names) without any image samples, has garnered significant attention. However, in real-world applications, the target class concepts is often hard to describe in text and the only way to specify target objects is to provide their image examples, yet it is often challenging to obtain a good number of samples. Thus, there is a high demand from practitioners for few-shot object detection (FSOD). A natural question arises: Can the benefits of OVD extend to FSOD for object classes that are difficult to describe in text? Compared to traditional methods that learn only predefined classes (referred to in this paper as closed-set object detection, COD), can the extra cost of OVD be justified? To answer these questions, we propose a method to quantify the ``text-describability’’ of object detection datasets using the zero-shot image classification accuracy with CLIP. This allows us to categorize various OD datasets with different text-describability and emprically evaluate the FSOD performance of OVD and COD methods within each category. Our findings reveal that: i) there is little difference between OVD and COD for object classes with low text-describability under equal conditions in OD pretraining; and ii) although OVD can learn from more diverse data than OD-specific data, thereby increasing the volume of training data, it can be counterproductive for classes with low-text-describability. These findings provide practitioners with valuable guidance amidst the recent advancements of OVD methods.
[CV-118] Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
链接: https://arxiv.org/abs/2410.15312
作者: Yu Zhao,Hao Fei,Xiangtai Li,Libo Qin,Jiayi Ji,Hongyuan Zhu,Meishan Zhang,Min Zhang,Jianguo Wei
关键词-EN: visual spatial understanding, spatial understanding, spatial, Spatial Dual Discrete, Dual Discrete Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D \to image and 3D \to text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD ^3 ) framework, which utilizes the intermediate features of the 3D \to X processes to guide the hard X \to 3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.
[CV-119] ContextDet: Temporal Action Detection with Adaptive Context Aggregation
链接: https://arxiv.org/abs/2410.15279
作者: Ning Wang,Yun Xiao,Xiaopeng Peng,Xiaojun Chang,Xuanhong Wang,Dingyi Fang
关键词-EN: Temporal action detection, video understanding due, recognizes action segments, Temporal action, variable segment lengths
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:
点击查看摘要
Abstract:Temporal action detection (TAD), which locates and recognizes action segments, remains a challenging task in video understanding due to variable segment lengths and ambiguous boundaries. Existing methods treat neighboring contexts of an action segment indiscriminately, leading to imprecise boundary predictions. We introduce a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time. Our model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability. Each ACA level consists of two novel modules. The context attention module (CAM) identifies salient contextual information, encourages context diversity, and preserves context integrity through a context gating block (CGB). The long context module (LCM) makes use of a mixture of large- and small-kernel convolutions to adaptively gather long-range context and fine-grained local features. Additionally, by varying the length of these large kernels across the ACA pyramid, our model provides lightweight yet effective context aggregation and action discrimination. We conducted extensive experiments and compared our model with a number of advanced TAD methods on six challenging TAD benchmarks: MultiThumos, Charades, FineAction, EPIC-Kitchens 100, Thumos14, and HACS, demonstrating superior accuracy at reduced inference speed.
[CV-120] Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison
链接: https://arxiv.org/abs/2410.15270
作者: Shiyu Hu,Xuchen Li,Xuzhao Li,Jing Zhang,Yipei Wang,Xin Zhao,Kang Hao Cheong
关键词-EN: Large vision-language models, sparking researchers’ interest, made significant strides, Large vision-language, human-like multimodal understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Large vision-language models (LVLMs) have made significant strides in addressing complex video tasks, sparking researchers’ interest in their human-like multimodal understanding capabilities. Video description serves as a fundamental task for evaluating video comprehension, necessitating a deep understanding of spatial and temporal dynamics, which presents challenges for both humans and machines. Thus, investigating whether LVLMs can describe videos as comprehensively as humans (through reasonable human-machine comparisons using video captioning as a proxy task) will enhance our understanding and application of these models. However, current benchmarks for video comprehension have notable limitations, including short video durations, brief annotations, and reliance on a single annotator’s perspective. These factors hinder a comprehensive assessment of LVLMs’ ability to understand complex, lengthy videos and prevent the establishment of a robust human baseline that accurately reflects human video comprehension capabilities. To address these issues, we propose a novel benchmark, FIOVA (Five In One Video Annotations), designed to evaluate the differences between LVLMs and human understanding more comprehensively. FIOVA includes 3,002 long video sequences (averaging 33.6 seconds) that cover diverse scenarios with complex spatiotemporal relationships. Each video is annotated by five distinct annotators, capturing a wide range of perspectives and resulting in captions that are 4-15 times longer than existing benchmarks, thereby establishing a robust baseline that represents human understanding comprehensively for the first time in video description tasks. Using the FIOVA benchmark, we conducted an in-depth evaluation of six state-of-the-art LVLMs, comparing their performance with humans. More detailed information can be found at this https URL.
[CV-121] GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning
链接: https://arxiv.org/abs/2410.15266
作者: Haiwen Diao,Ying Zhang,Shang Gao,Jiawen Zhu,Long Chen,Huchuan Lu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 12 pages, 9 figures, Accepted by TIP2024
[CV-122] Modeling Visual Memorability Assessment with Autoencoders Reveals Characteristics of Memorable Images
链接: https://arxiv.org/abs/2410.15235
作者: Elham Bagheri,Yalda Mohsenzadeh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-123] Deep Learning-based Detection of Bacterial Swarm Motion Using a Single Image
链接: https://arxiv.org/abs/2410.15229
作者: Yuzhu Li,Hao Li,Weijie Chen,Keelan O’Riordan,Neha Mani,Yuxuan Qi,Tairan Liu,Sridhar Mani,Aydogan Ozcan
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Medical Physics (physics.med-ph)
*备注: 17 Pages, 4 Figures
[CV-124] Low-cost Robust Night-time Aerial Material Segmentation through Hyperspectral Data and Sparse Spatio-Temporal Learning ICONIP
链接: https://arxiv.org/abs/2410.15208
作者: Chandrajit Bajaj,Minh Nguyen,Shubham Bhardwaj
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to the International Conference on Neural Information Processing (ICONIP) 2024. To be published in Springer-Nature Communications in Computer and Information Science (CCIS) Series
[CV-125] Unsupervised Domain Adaptation Approaches for Chessboard Recognition
链接: https://arxiv.org/abs/2410.15206
作者: Wassim Jabbour,Enzo Benoit-Jeannin,Oscar Bedford,Saif Shahin
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 30 pages, 23 figures
[CV-126] CLIPtortionist: Zero-shot Text-driven Deformation for Manufactured 3D Shapes
链接: https://arxiv.org/abs/2410.15199
作者: Xianghao Xu,Srinath Sridhar,Daniel Ritchie
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:
[CV-127] Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling
链接: https://arxiv.org/abs/2410.15143
作者: Minhyuk Seo,Hyunseo Koh,Jonghyun Choi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-128] Standardizing Generative Face Video Compression using Supplemental Enhancement Information
链接: https://arxiv.org/abs/2410.15105
作者: Bolin Chen,Yan Ye,Jie Chen,Ru-Ling Liao,Shanzhi Yin,Shiqi Wang,Kaifa Yang,Yue Li,Yiling Xu,Ye-Kui Wang,Shiv Gehlot,Guan-Ming Su,Peng Yin,Sean McCarthy,Gary J. Sullivan
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-129] CosFairNet:A Parameter-Space based Approach for Bias Free Learning
链接: https://arxiv.org/abs/2410.15094
作者: Rajeev Ranjan Dwivedi,Priyadarshini Kumari,Vinod K Kurmi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[CV-130] Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion
链接: https://arxiv.org/abs/2410.15091
作者: Chaodong Xiao,Minghan Li,Zhengqiang Zhang,Deyu Meng,Lei Zhang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 8 figures, 5 tables
[CV-131] SLIC: Secure Learned Image Codec through Compressed Domain Watermarking to Defend Image Manipulation
链接: https://arxiv.org/abs/2410.15075
作者: Chen-Hsiu Huang,Ja-Ling Wu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: accepted by ACM Multimedia Asia 2024
[CV-132] LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
链接: https://arxiv.org/abs/2410.15074
作者: Xuechen Guo,Wenhao Chai,Shi-Yan Li,Gaoang Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[CV-133] A Cycle Ride to HDR: Semantics Aware Self-Supervised Framework for Unpaired LDR-to-HDR Image Translation
链接: https://arxiv.org/abs/2410.15068
作者: Hrishav Bakul Barua,Stefanov Kalin,Lemuel Lai En Che,Dhall Abhinav,Wong KokSheik,Krishnasamy Ganesh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to IEEE
[CV-134] A Survey on All-in-One Image Restoration: Taxonomy Evaluation and Future Trends
链接: https://arxiv.org/abs/2410.15067
作者: Junjun Jiang,Zengyuan Zuo,Gang Wu,Kui Jiang,Xianming Liu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
[CV-135] EndoMetric: Near-light metric scale monocular SLAM ICRA2025
链接: https://arxiv.org/abs/2410.15065
作者: Raúl Iranzo,Víctor M. Batlle,Juan D. Tardós,José M.M. Montiel
关键词-EN: SLAM with endoscopic, Geometric reconstruction, recent years, endoscopic images, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICRA 2025
点击查看摘要
Abstract:Geometric reconstruction and SLAM with endoscopic images have seen significant advancements in recent years. In most medical specialties, the endoscopes used are monocular, and the algorithms applied are typically extensions of those designed for external environments, resulting in 3D reconstructions up to an unknown scale factor. In this paper, we take advantage of the fact that standard endoscopes are equipped with near-light sources positioned at a small but non-zero baseline from the camera. By leveraging the inverse-square law of light decay, we enable, for the first time, monocular reconstructions with accurate metric scale. This paves the way to transform any endoscope into a metric device, which is essential for practical applications such as measuring polyps, stenosis, or the extent of tissue affected by disease. Comments: ICRA 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.15065 [cs.CV] (or arXiv:2410.15065v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.15065 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-136] BYOCL: Build Your Own Consistent Latent with Hierarchical Representative Latent Clustering
链接: https://arxiv.org/abs/2410.15060
作者: Jiayue Dai,Yunya Wang,Yihan Fang,Yuetong Chen,Butian Xiong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures
[CV-137] A General-Purpose Multimodal Foundation Model for Dermatology
链接: https://arxiv.org/abs/2410.15038
作者: Siyuan Yan,Zhen Yu,Clare Primiero,Cristina Vico-Alonso,Zhonghua Wang,Litao Yang,Philipp Tschandl,Ming Hu,Gin Tan,Vincent Tang,Aik Beng Ng,David Powell,Paul Bonnington,Simon See,Monika Janda,Victoria Mar,Harald Kittler,H. Peter Soyer,Zongyuan Ge
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 56 pages; Technical report
[CV-138] Cutting-Edge Detection of Fatigue in Drivers: A Comparative Study of Object Detection Models
链接: https://arxiv.org/abs/2410.15030
作者: Amelia Jones
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:
[CV-139] Group Diffusion Transformers are Unsupervised Multitask Learners
链接: https://arxiv.org/abs/2410.15027
作者: Lianghua Huang,Wei Wang,Zhi-Fan Wu,Huanzhang Dou,Yupeng Shi,Yutong Feng,Chen Liang,Yu Liu,Jingren Zhou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-140] MambaSOD: Dual Mamba-Driven Cross-Modal Fusion Network for RGB-D Salient Object Detection
链接: https://arxiv.org/abs/2410.15015
作者: Yue Zhan,Zhihong Zeng,Haijun Liu,Xiaoheng Tan,Yinli Tian
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-141] DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer
链接: https://arxiv.org/abs/2410.15007
作者: Ying Hu,Chenyi Zhuang,Pan Gao
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted to ACMMM Asia 2024. Code is available at this https URL
[CV-142] How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold NEURIPS2024
链接: https://arxiv.org/abs/2410.15002
作者: Sahil Verma,Royi Rassin,Arnav Das,Gantavya Bhatt,Preethi Seshadri,Chirag Shah,Jeff Bilmes,Hannaneh Hajishirzi,Yanai Elazar
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ATTRIB, RegML, and SafeGenAI workshops at NeurIPS 2024 and NLLP Workshop 2024
[CV-143] Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling
链接: https://arxiv.org/abs/2410.14993
作者: Hao Wu,Donglin Bai,Shiqi Jiang,Qianxi Zhang,Yifan Yang,Ting Cao,Fengyuan Xu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-144] ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
链接: https://arxiv.org/abs/2410.14991
作者: Deeparghya Dutta Barua,Md Sakib Ul Rahman Sourove,Md Farhan Ishmam,Fabiha Haider,Fariha Tanjim Shifat,Md Fahim,Md Farhad Alam
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:
[CV-145] SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning
链接: https://arxiv.org/abs/2410.14987
作者: Zhewei Dai,Shilei Zeng,Haotian Liu,Xurui Li,Feng Xue,Yu Zhou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-146] D-SarcNet: A Dual-stream Deep Learning Framework for Automatic Analysis of Sarcomere Structures in Fluorescently Labeled hiPSC-CMs
链接: https://arxiv.org/abs/2410.14983
作者: Huyen Le,Khiet Dang,Nhung Nguyen,Mai Tran,Hieu Pham
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for oral presentation at IEEE International Conference on Bioinformatics and Biomedicine 2024 (IEEE BIBM 2024)
[CV-147] DCDepth: Progressive Monocular Depth Estimation in Discrete Cosine Domain NEURIPS-2024
链接: https://arxiv.org/abs/2410.14980
作者: Kun Wang,Zhiqiang Yan,Junkai Fan,Wanlu Zhu,Xiang Li,Jun Li,Jian Yang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS-2024
[CV-148] 3D Multi-Object Tracking Employing MS-GLMB Filter for Autonomous Driving
链接: https://arxiv.org/abs/2410.14977
作者: Linh Van Ma,Muhammad Ishfaq Hussain,Kin-Choong Yow,Moongu Jeon
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 International Conference on Control, Automation and Information Sciences (ICCAIS), November 26th to 28th, 2024 in Ho Chi Minh City
[CV-149] Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation
链接: https://arxiv.org/abs/2410.14975
作者: Seulbi Lee,Jihyo Kim,Sangheum Hwang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: The first two authors contributed equally
[CV-150] Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norways Digitised Book Collection
链接: https://arxiv.org/abs/2410.14969
作者: Marie Roald,Magnus Breder Birkenes,Lars Gunnarsønn Bagøien Johnsen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, 4 tables, Accepted to the 2024 Computational Humanities Research Conference (CHR)
[CV-151] Neural Radiance Field Image Refinement through End-to-End Sampling Point Optimization
链接: https://arxiv.org/abs/2410.14958
作者: Kazuhiro Ohta,Satoshi Ono
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[CV-152] SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation
链接: https://arxiv.org/abs/2410.14948
作者: Junda Wang,Yujan Ting,Eric Z. Chen,Hieu Tran,Hong Yu,Weijing Huang,Terrence Chen
关键词-EN:
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-153] Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding
链接: https://arxiv.org/abs/2410.14944
作者: Yi Liu,Chengxin Li,Shoukun Xu,Jungong Han
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-154] Water quality polluted by total suspended solids classified within an Artificial Neural Network approach
链接: https://arxiv.org/abs/2410.14929
作者: I. Luviano Soto,Y. Concha Sánchez,A. Raya
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 42 pages, 8 figures and 2 tables
[CV-155] Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step
链接: https://arxiv.org/abs/2410.14919
作者: Mingyuan Zhou,Huangjie Zheng,Yi Gu,Zhendong Wang,Hai Huang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[CV-156] A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models
链接: https://arxiv.org/abs/2410.14911
作者: Yuhan Liang,Yijun Li,Yumeng Niu,Qianhe Shen,Hangyu Liu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[CV-157] DRACO: Differentiable Reconstruction for Arbitrary CBCT Orbits
链接: https://arxiv.org/abs/2410.14900
作者: Chengze Ye,Linda-Sophie Schneider,Yipeng Sun,Mareike Thies,Siyuan Mei,Andreas Maier
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-158] runcated Consistency Models
链接: https://arxiv.org/abs/2410.14895
作者: Sangyun Lee,Yilun Xu,Tomas Geffner,Giulia Fanti,Karsten Kreis,Arash Vahdat,Weili Nie
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-159] On the Influence of Shape Texture and Color for Learning Semantic Segmentation
链接: https://arxiv.org/abs/2410.14878
作者: Annika Mütze,Natalie Grabowsky,Edgar Heinert,Matthias Rottmann,Hanno Gottschalk
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-160] Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention
链接: https://arxiv.org/abs/2410.14874
作者: Tianxiao Zhang,Bo Luo,Guanghui Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-161] SYNOSIS: Image synthesis pipeline for machine vision in metal surface inspection
链接: https://arxiv.org/abs/2410.14844
作者: Juraj Fulir,Natascha Jeziorski,Lovro Bosnar,Hans Hagen,Claudia Redenbach,Petra Gospodnetić,Tobias Herrfurth,Marcus Trost,Thomas Gischkat
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Graphics (cs.GR)
*备注: Initial preprint, 21 pages, 21 figures, 6 tables
[CV-162] Automated Road Extraction from Satellite Imagery Integrating Dense Depthwise Dilated Separable Spatial Pyramid Pooling with DeepLabV3
链接: https://arxiv.org/abs/2410.14836
作者: Arpan Mahara,Md Rezaul Karim Khan,Naphtali D. Rishe,Wenjia Wang,Seyed Masoud Sadjadi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
[CV-163] ackling domain generalization for out-of-distribution endoscopic imaging MICCAI2024
链接: https://arxiv.org/abs/2410.14821
作者: Mansoor Ali Teevno,Gilberto Ochoa-Ruiz,Sharib Ali
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper was accepted at Machine Learning in Medical Imaging (MLMI) workshop at MICCAI 2024 in Marrakesh
[CV-164] GESH-Net: Graph-Enhanced Spherical Harmonic Convolutional Networks for Cortical Surface Registration
链接: https://arxiv.org/abs/2410.14805
作者: Ruoyu Zhang,Lihui Wang,Kun Tang,Jingwen Xu,Hongjiang Wei
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-165] Deep Generic Dynamic Object Detection Based on Dynamic Grid Maps
链接: https://arxiv.org/abs/2410.14799
作者: Rujiao Yan,Linda Schubert,Alexander Kamm,Matthias Komar,Matthias Schreier
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, IEEE IV24
[CV-166] SSL-NBV: A Self-Supervised-Learning-Based Next-Best-View algorithm for Efficient 3D Plant Reconstruction by a Robot
链接: https://arxiv.org/abs/2410.14790
作者: Jianchao Ci,Eldert J. van Henten,Xin Wang,Akshay K. Burusa,Gert Kootstra
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 22 pages, 11 figures, 1 table
[CV-167] A Survey on Computational Solutions for Reconstructing Complete Objects by Reassembling Their Fractured Parts
链接: https://arxiv.org/abs/2410.14770
作者: Jiaxin Lu,Yongqing Liang,Huijun Han,Jiacheng Hua,Junfeng Jiang,Xin Li,Qixing Huang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 36 pages, 22 figures
[CV-168] CFTS-GAN: Continual Few-Shot Teacher Student for Generative Adversarial Networks
链接: https://arxiv.org/abs/2410.14749
作者: Munsif Ali,Leonardo Rossi,Massimo Bertozzi
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-169] okens on Demand: Token Condensation as Training-free Test-time Adaptation
链接: https://arxiv.org/abs/2410.14729
作者: Zixin Wang,Dong Gong,Sen Wang,Zi Huang,Yadan Luo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures
[CV-170] SGLP: A Similarity Guided Fast Layer Partition Pruning for Compressing Large Deep Models
链接: https://arxiv.org/abs/2410.14720
作者: Yuqi Li,Yao Lu,Zeyu Dong,Chuanguang Yang,Yihao Chen,Jianping Gou
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages
[CV-171] Animating the Past: Reconstruct Trilobite via Video Generation
链接: https://arxiv.org/abs/2410.14715
作者: Xiaoran Wu,Zien Huang,Chonghan Yu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[CV-172] G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving
链接: https://arxiv.org/abs/2410.14710
作者: Naoki Murata,Chieh-Hsin Lai,Yuhta Takida,Toshimitsu Uesaka,Bac Nguyen,Stefano Ermon,Yuki Mitsufuji
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[CV-173] FACMIC: Federated Adaptative CLIP Model for Medical Image Classification MICCAI2024
链接: https://arxiv.org/abs/2410.14707
作者: Yihang Wu,Christian Desrosiers,Ahmad Chaddad
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Accepted in MICCAI 2024
[CV-174] Optimizing Parking Space Classification: Distilling Ensembles into Lightweight Classifiers ICML
链接: https://arxiv.org/abs/2410.14705
作者: Paulo Luza Alves,André Hochuli,Luiz Eduardo de Oliveira,Paulo Lisboa de Almeida
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for presentation at the International Conference on Machine Learning and Applications (ICMLA) 2024
[CV-175] Self-Supervised Keypoint Detection with Distilled Depth Keypoint Representation
链接: https://arxiv.org/abs/2410.14700
作者: Aman Anand,Elyas Rashno,Amir Eskandari,Farhana Zulkernine
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[CV-176] Deep Learning Enhanced Road Traffic Analysis: Scalable Vehicle Detection and Velocity Estimation Using PlanetScope Imagery
链接: https://arxiv.org/abs/2410.14698
作者: Maciej Adamiak,Yulia Grinblat,Julian Psotta,Nir Fulman,Himshikhar Mazumdar,Shiyu Tang,Alexander Zipf
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[CV-177] Deep Domain Isolation and Sample Clustered Federated Learning for Semantic Segmentation
链接: https://arxiv.org/abs/2410.14693
作者: Matthis Manthe(LIRIS, CREATIS),Carole Lartizien(MYRIAD),Stefan Duffner(LIRIS)
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
[CV-178] Rethinking VLMs and LLMs for Image Classification
链接: https://arxiv.org/abs/2410.14690
作者: Avi Cooper,Keizo Kato,Chia-Hsien Shih,Hiroaki Yamane,Kasper Vinken,Kentaro Takemoto,Taro Sunagawa,Hao-Wei Yeh,Jin Yamanaka,Ian Mason,Xavier Boix
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-179] ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
链接: https://arxiv.org/abs/2410.06625
作者: Yi Ding,Bolian Li,Ruqi Zhang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 27pages
[CV-180] QT-DoG: Quantization-aware Training for Domain Generalization
链接: https://arxiv.org/abs/2410.06020
作者: Saqib Javed,Hieu Le,Mathieu Salzmann
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Code will be released soon
[CV-181] Deep Radiomics Detection of Clinically Significant Prostate Cancer on Multicenter MRI: Initial Comparison to PI-RADS Assessment
链接: https://arxiv.org/abs/2410.16238
作者: G. A. Nketiah(1,2),M. R. Sunoqrot(1,2),E. Sandsmark(2),S. Langørgen(2),K. M. Selnæs(1,2),H. Bertilsson(1,3),M. Elschot(1,2),T. F. Bathen(1,2) (for the PCa-MAP Consortium. (1) Department of Circulation and Medical Imaging, Norwegian University of Science and Technology, Trondheim, Norway, (2) Department of Radiology and Nuclear Medicine, St. Olavs Hospital, Trondheim University Hospital, Trondheim, Norway, (3) Department of Urology, St. Olavs Hospital, Trondheim University Hospital, Trondheim, Norway)
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 4 figures, 4 tables
[CV-182] An Explainable Contrastive-based Dilated Convolutional Network with Transformer for Pediatric Pneumonia Detection
链接: https://arxiv.org/abs/2410.16143
作者: Chandravardhan Singh Raghaw,Parth Shirish Bhore,Mohammad Zia Ur Rehman,Nagendra Kumar
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-183] Multimodal Flare Forecasting with Deep Learning
链接: https://arxiv.org/abs/2410.16116
作者: Grégoire Francisco,Sabrina Guastavino,Teresa Barata,João Fernandes,Dario Del Moro
关键词-EN:
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-184] AI-Driven Approaches for Glaucoma Detection – A Comprehensive Review
链接: https://arxiv.org/abs/2410.15947
作者: Yuki Hagiwara,Octavia-Andreaa Ciora,Maureen Monnet,Gino Lancho,Jeanette Miriam Lorenz
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-185] Seismic Phase Picking
链接: https://arxiv.org/abs/2410.15907
作者: Yuchen Wang,Ruihuan Wang
关键词-EN:
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-186] R2I-rPPG: A Robust Region of Interest Selection Method for Remote Photoplethysmography to Extract Heart Rate
链接: https://arxiv.org/abs/2410.15851
作者: Sandeep Nagar,Mark Hasegawa-Johnson,David G. Beiser,Narendra Ahuja
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: preprint
[CV-187] FusionLungNet: Multi-scale Fusion Convolution with Refinement Network for Lung CT Image Segmentation
链接: https://arxiv.org/abs/2410.15812
作者: Sadjad Rezvani,Mansoor Fateh,Yeganeh Jalali,Amirreza Fateh
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-188] ransforming Blood Cell Detection and Classification with Advanced Deep Learning Models: A Comparative Study
链接: https://arxiv.org/abs/2410.15670
作者: Shilpa Choudhary,Sandeep Kumar,Pammi Sri Siddhaarth,Guntu Charitasri
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 4884 Words, 17 Figures, 10 Tables
[CV-189] owards Kriging-informed Conditional Diffusion for Regional Sea-Level Data Downscaling
链接: https://arxiv.org/abs/2410.15628
作者: Subhankar Ghosh,Arun Sharma,Jayant Gupta,Aneesh Subramanian,Shashi Shekhar
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[CV-190] opology-Aware Exploration of Circle of Willis for CTA and MRA: Segmentation Detection and Classification ICIP MICCAI2024
链接: https://arxiv.org/abs/2410.15614
作者: Minghui Zhang,Xin You,Hanxiao Zhang,Yun Gu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注: Participation technical report for TopCoW24 challenge @ MICCAI 2024
链接: https://arxiv.org/abs/2410.15521
作者: Yuhang Li,Shiqi Chen,Bijie Bai,Aydogan Ozcan
关键词-EN:
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
*备注: 21 Pages, 8 Figures
[CV-192] AttCDCNet: Attention-enhanced Chest Disease Classification using X-Ray Images
链接: https://arxiv.org/abs/2410.15437
作者: Omar Hesham Khater,Abdullahi Sani Shuaib,Sami Ul Haq,Abdul Jabbar Siddiqui
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-193] Discriminating image representations with principal distortions
链接: https://arxiv.org/abs/2410.15433
作者: Jenelle Feather,David Lipshutz,Sarah E. Harvey,Alex H. Williams,Eero P. Simoncelli
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[CV-194] Improving 3D Medical Image Segmentation at Boundary Regions using Local Self-attention and Global Volume Mixing
链接: https://arxiv.org/abs/2410.15360
作者: Daniya Najiha Abdul Kareem,Mustansar Fiaz,Noa Novershtern,Jacob Hanna,Hisham Cholakkal
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-195] Extensions on low-complexity DCT approximations for larger blocklengths based on minimal angle similarity
链接: https://arxiv.org/abs/2410.15244
作者: A. P. Radünz,L. Portella,R. S. Oliveira,F. M. Bayer,R. J. Cintra
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Numerical Analysis (math.NA); Methodology (stat.ME)
*备注: Fixed typos. 27 pages, 6 figures, 5 tables
[CV-196] Automated Segmentation and Analysis of Cone Photoreceptors in Multimodal Adaptive Optics Imaging
链接: https://arxiv.org/abs/2410.15158
作者: Prajol Shrestha,Mikhail Kulyabin,Aline Sindel,Hilde R. Pedersen,Stuart Gilson,Rigmor Baraas,Andreas Maier
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-197] EViT-Unet: U-Net Like Efficient Vision Transformer for Medical Image Segmentation on Mobile and Edge Devices
链接: https://arxiv.org/abs/2410.15036
作者: Xin Li,Wenhui Zhu,Xuanzhao Dong,Oana M. Dumitrascu,Yalin Wang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures
[CV-198] Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer
链接: https://arxiv.org/abs/2410.15012
作者: Gesa Mittmann,Sara Laiouar-Pedari,Hendrik A. Mehrtens,Sarah Haggenmüller,Tabea-Clara Bucher,Tirtha Chanda,Nadine T. Gaisa,Mathias Wagner,Gilbert Georg Klamminger,Tilman T. Rau,Christina Neppl,Eva Maria Compérat,Andreas Gocht,Monika Hämmerle,Niels J. Rupp,Jula Westhoff,Irene Krücken,Maximillian Seidl,Christian M. Schürch,Marcus Bauer,Wiebke Solass,Yu Chun Tam,Florian Weber,Rainer Grobholz,Jaroslaw Augustyniak,Thomas Kalinski,Christian Hörner,Kirsten D. Mertz,Constanze Döring,Andreas Erbersdobler,Gabriele Deubler,Felix Bremmer,Ulrich Sommer,Michael Brodhun,Jon Griffin,Maria Sarah L. Lenon,Kiril Trpkov,Liang Cheng,Fei Chen,Angelique Levi,Guoping Cai,Tri Q. Nguyen,Ali Amin,Alessia Cimadamore,Ahmed Shabaik,Varsha Manucha,Nazeel Ahmad,Nidia Messias,Francesca Sanguedolce,Diana Taheri,Ezra Baraban,Liwei Jia,Rajal B. Shah,Farshid Siadat,Nicole Swarbrick,Kyung Park,Oudai Hassan,Siamak Sakhaie,Michelle R. Downes,Hiroshi Miyamoto,Sean R. Williamson,Tim Holland-Letz,Carolin V. Schneider,Jakob Nikolas Kather,Yuri Tolkach,Titus J. Brinker
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 58 pages, 15 figures (incl. supplementary)
[CV-199] Quanta Video Restoration
链接: https://arxiv.org/abs/2410.14994
作者: Prateek Chennuri,Yiheng Chi,Enze Jiang,G. M. Dilshan Godaliyadda,Abhiram Gnanasambandam,Hamid R. Sheikh,Istvan Gyongy,Stanley H. Chan
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-200] Non-Invasive to Invasive: Enhancing FFA Synthesis from CFP with a Benchmark Dataset and a Novel Network
链接: https://arxiv.org/abs/2410.14965
作者: Hongqiu Wang,Zhaohu Xing,Weitong Wu,Yijun Yang,Qingqing Tang,Meixia Zhang,Yanwu Xu,Lei Zhu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: ACMMM 24 MCHM
[CV-201] A novel approach towards the classification of Bone Fracture from Musculoskeletal Radiography images using Attention Based Transfer Learning
链接: https://arxiv.org/abs/2410.14833
作者: Sayeda Sanzida Ferdous Ruhi,Fokrun Nahar,Adnan Ferdous Ashrafi
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 3 tables, 4 figures, submitted to 27th International Conference on Computer and Information Technology (ICCIT) to be held during 20-22 December, 2024
[CV-202] Medical AI for Early Detection of Lung Cancer: A Survey
链接: https://arxiv.org/abs/2410.14769
作者: Guohui Cai,Ying Cai,Zeyu Zhang,Yuanzhouhan Cao,Lin Wu,Daji Ergu,Zhinbin Liao,Yang Zhao
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-203] Brain-Aware Readout Layers in GNNs: Advancing Alzheimers early Detection and Neuroimaging
链接: https://arxiv.org/abs/2410.14683
作者: Jiwon Youn,Dong Woo Kang,Hyun Kook Lim,Mansu Kim
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
机器学习
[LG-0] xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
链接: https://arxiv.org/abs/2410.16267
作者: Michael S. Ryoo,Honglu Zhou,Shrikant Kendre,Can Qin,Le Xue,Manli Shu,Silvio Savarese,Ran Xu,Caiming Xiong,Juan Carlos Niebles
关键词-EN: efficiently capture temporal, capture temporal information, multimodal language model, multiple frames, multimodal language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the ‘temporal encoder’ in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at this https URL
[LG-1] Revisiting Deep Feature Reconstruction for Logical and Structural Industrial Anomaly Detection
链接: https://arxiv.org/abs/2410.16255
作者: Sukanya Patra,Souhaib Ben Taieb
关键词-EN: alter object appearances, presents challenges due, diverse anomaly types, Industrial anomaly detection, limited training data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in Transactions on Machine Learning Research (TMLR). Link to OpenReview: this https URL
点击查看摘要
Abstract:Industrial anomaly detection is crucial for quality control and predictive maintenance, but it presents challenges due to limited training data, diverse anomaly types, and external factors that alter object appearances. Existing methods commonly detect structural anomalies, such as dents and scratches, by leveraging multi-scale features from image patches extracted through deep pre-trained networks. However, significant memory and computational demands often limit their practical application. Additionally, detecting logical anomalies-such as images with missing or excess elements-requires an understanding of spatial relationships that traditional patch-based methods fail to capture. In this work, we address these limitations by focusing on Deep Feature Reconstruction (DFR), a memory- and compute-efficient approach for detecting structural anomalies. We further enhance DFR into a unified framework, called ULSAD, which is capable of detecting both structural and logical anomalies. Specifically, we refine the DFR training objective to improve performance in structural anomaly detection, while introducing an attention-based loss mechanism using a global autoencoder-like network to handle logical anomaly detection. Our empirical evaluation across five benchmark datasets demonstrates the performance of ULSAD in detecting and localizing both structural and logical anomalies, outperforming eight state-of-the-art methods. An extensive ablation study further highlights the contribution of each component to the overall performance improvement. Our code is available at this https URL
[LG-2] Distribution Learning with Valid Outputs Beyond the Worst-Case
链接: https://arxiv.org/abs/2410.16253
作者: Nick Rittler,Kamalika Chaudhuri
关键词-EN: Generative models, times produce, unnatural sounds, images with generation, generation artifacts
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generative models at times produce “invalid” outputs, such as images with generation artifacts and unnatural sounds. Validity-constrained distribution learning attempts to address this problem by requiring that the learned distribution have a provably small fraction of its mass in invalid parts of space – something which standard loss minimization does not always ensure. To this end, a learner in this model can guide the learning via “validity queries”, which allow it to ascertain the validity of individual examples. Prior work on this problem takes a worst-case stance, showing that proper learning requires an exponential number of validity queries, and demonstrating an improper algorithm which – while generating guarantees in a wide-range of settings – makes an atypical polynomial number of validity queries. In this work, we take a first step towards characterizing regimes where guaranteeing validity is easier than in the worst-case. We show that when the data distribution lies in the model class and the log-loss is minimized, the number of samples required to ensure validity has a weak dependence on the validity requirement. Additionally, we show that when the validity region belongs to a VC-class, a limited number of validity queries are often sufficient.
[LG-3] Implicit Regularization for Tubal Tensor Factorizations via Gradient Descent
链接: https://arxiv.org/abs/2410.16247
作者: Santhosh Karnik,Anna Veselovska,Mark Iwen,Felix Krahmer
关键词-EN: provide a rigorous, rigorous analysis, lazy training regime, implicit regularization, tensor factorization problem
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 58 pages, 4 figures
点击查看摘要
Abstract:We provide a rigorous analysis of implicit regularization in an overparametrized tensor factorization problem beyond the lazy training regime. For matrix factorization problems, this phenomenon has been studied in a number of works. A particular challenge has been to design universal initialization strategies which provably lead to implicit regularization in gradient-descent methods. At the same time, it has been argued by Cohen et. al. 2016 that more general classes of neural networks can be captured by considering tensor factorizations. However, in the tensor case, implicit regularization has only been rigorously established for gradient flow or in the lazy training regime. In this paper, we prove the first tensor result of its kind for gradient descent rather than gradient flow. We focus on the tubal tensor product and the associated notion of low tubal rank, encouraged by the relevance of this model for image data. We establish that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank. Our theoretical findings are illustrated in an extensive set of numerical simulations show-casing the dynamics predicted by our theory as well as the crucial role of using a small random initialization.
[LG-4] MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays ECGs and Diagnostic Report
链接: https://arxiv.org/abs/2410.16239
作者: Samrajya Thapa,Koushik Howlader,Subhankar Bhattacharjee,Wei le
关键词-EN: Multi-Modal Contrastive Pre-training, Contrastive Pre-training Framework, synergistically combines X-rays, Contrastive Pre-training, Pre-training Framework
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 9 tables. Supplementary detail in Appendix. Code made available in Github for reproducibility
点击查看摘要
Abstract:In this paper, we introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs), and radiology/cardiology reports. Our approach leverages transformers to encode these diverse modalities into a unified representation space, aiming to enhance diagnostic accuracy and facilitate comprehensive patient assessments. We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention. Furthermore, we provide novel multimodal attention explanations and retrieval for our model. To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizing contrastive loss, MoRE effectively aligns modality-specific features into a coherent embedding, which supports various downstream tasks such as zero-shot classification and multimodal retrieval. Employing our proposed methodology, we achieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, and PtbXl downstream datasets, surpassing existing multimodal approaches. Our proposed framework shows significant improvements in capturing intricate inter-modal relationships and its robustness in medical diagnosis that establishes a framework for future research in multimodal learning in the healthcare sector.
[LG-5] A Realistic Threat Model for Large Language Model Jailbreaks
链接: https://arxiv.org/abs/2410.16222
作者: Valentyn Boreiko,Alexander Panfilov,Vaclav Voracek,Matthias Hein,Jonas Geiping
关键词-EN: obtain harmful responses, plethora of jailbreaking, proposed to obtain, obtain harmful, harmful responses
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text, and computational budget, in total FLOPs. For the former, we build an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for an LLM-agnostic and inherently interpretable evaluation. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing. After a rigorous comparison, we not only find attack success rates against safety-tuned modern models to be lower than previously presented but also find that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent N-grams, either selecting N-grams absent from real-world text or rare ones, e.g. specific to code datasets.
[LG-6] Comprehensive benchmarking of large language models for RNA secondary structure prediction
链接: https://arxiv.org/abs/2410.16212
作者: L.I. Zablocki,L.A. Bugnon,M. Gerard,L. Di Persia,G. Stegmayer,D.H. Milone
关键词-EN: DNA and proteins, developed recently, large language models, RNA, Inspired
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
点击查看摘要
Abstract:Inspired by the success of large language models (LLM) for DNA and proteins, several LLM for RNA have been developed recently. RNA-LLM uses large datasets of RNA sequences to learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector. This is done under the hypothesis that obtaining high-quality RNA representations can enhance data-costly downstream tasks. Among them, predicting the secondary structure is a fundamental task for uncovering RNA functional mechanisms. In this work we present a comprehensive experimental analysis of several pre-trained RNA-LLM, comparing them for the RNA secondary structure prediction task in an unified deep learning framework. The RNA-LLM were assessed with increasing generalization difficulty on benchmark datasets. Results showed that two LLM clearly outperform the other models, and revealed significant challenges for generalization in low-homology scenarios.
[LG-7] Compute-Constrained Data Selection
链接: https://arxiv.org/abs/2410.16208
作者: Junjie Oscar Yin,Alexander M. Rush
关键词-EN: Data selection, selection scales directly, training data needed, data selection scales, Data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. These experiments show the validity of this model in real-world experiments. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective.
[LG-8] CoT-TL: Low-Resource Temporal Knowledge Representation of Planning Instructions Using Chain-of-Thought Reasoning IROS2024
链接: https://arxiv.org/abs/2410.16207
作者: Kumar Manas,Stefan Zwicklbauer,Adrian Paschke
关键词-EN: Autonomous agents, interpreting uncertain natural, Linear Temporal Logic, agents often face, face the challenge
类目: Robotics (cs.RO); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: Accepted for publication in Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024), Abu Dhabi 14-18 October 2024
点击查看摘要
Abstract:Autonomous agents often face the challenge of interpreting uncertain natural language instructions for planning tasks. Representing these instructions as Linear Temporal Logic (LTL) enables planners to synthesize actionable plans. We introduce CoT-TL, a data-efficient in-context learning framework for translating natural language specifications into LTL representations. CoT-TL addresses the limitations of large language models, which typically rely on extensive fine-tuning data, by extending chain-of-thought reasoning and semantic roles to align with the requirements of formal logic creation. This approach enhances the transparency and rationale behind LTL generation, fostering user trust. CoT-TL achieves state-of-the-art accuracy across three diverse datasets in low-data scenarios, outperforming existing methods without fine-tuning or intermediate translations. To improve reliability and minimize hallucinations, we incorporate model checking to validate the syntax of the generated LTL output. We further demonstrate CoT-TL’s effectiveness through ablation studies and evaluations on unseen LTL structures and formulas in a new dataset. Finally, we validate CoT-TL’s practicality by integrating it into a QuadCopter for multi-step drone planning based on natural language instructions.
[LG-9] Systematic Review: Text Processing Algorithms in Machine Learning and Deep Learning for Mental Health Detection on Social Media
链接: https://arxiv.org/abs/2410.16204
作者: Yuchen Cao,Jianglai Dai,Zhongyan Wang,Yeyubei Zhang,Xiaorui Shen,Yunchong Liu,Yexin Tian
关键词-EN: depression necessitates innovative, necessitates innovative detection, early intervention, global rise, necessitates innovative
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:The global rise in depression necessitates innovative detection methods for early intervention. Social media provides a unique opportunity to identify depression through user-generated posts. This systematic review evaluates machine learning (ML) models for depression detection on social media, focusing on biases and methodological challenges throughout the ML lifecycle. A search of PubMed, IEEE Xplore, and Google Scholar identified 47 relevant studies published after 2010. The Prediction model Risk Of Bias ASsessment Tool (PROBAST) was utilized to assess methodological quality and risk of bias. Significant biases impacting model reliability and generalizability were found. There is a predominant reliance on Twitter (63.8%) and English-language content (over 90%), with most studies focusing on users from the United States and Europe. Non-probability sampling methods (approximately 80%) limit representativeness. Only 23% of studies explicitly addressed linguistic nuances like negations, crucial for accurate sentiment analysis. Inconsistent hyperparameter tuning was observed, with only 27.7% properly tuning models. About 17% did not adequately partition data into training, validation, and test sets, risking overfitting. While 74.5% used appropriate evaluation metrics for imbalanced data, others relied on accuracy without addressing class imbalance, potentially skewing results. Reporting transparency varied, often lacking critical methodological details. These findings highlight the need to diversify data sources, standardize preprocessing protocols, ensure consistent model development practices, address class imbalance, and enhance reporting transparency. By overcoming these challenges, future research can develop more robust and generalizable ML models for depression detection on social media, contributing to improved mental health outcomes globally.
[LG-10] A Trust-Region Method for Graphical Stein Variational Inference
链接: https://arxiv.org/abs/2410.16195
作者: Liam Pavlovic,David M. Rosen
关键词-EN: Stein variational inference, sample-based approximate Bayesian, approximate Bayesian inference, Stein variational, Bayesian inference
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Stein variational inference (SVI) is a sample-based approximate Bayesian inference technique that generates a sample set by jointly optimizing the samples’ locations to minimize an information-theoretic measure of discrepancy with the target probability distribution. SVI thus provides a fast and significantly more sample-efficient approach to Bayesian inference than traditional (random-sampling-based) alternatives. However, the optimization techniques employed in existing SVI methods struggle to address problems in which the target distribution is high-dimensional, poorly-conditioned, or non-convex, which severely limits the range of their practical applicability. In this paper, we propose a novel trust-region optimization approach for SVI that successfully addresses each of these challenges. Our method builds upon prior work in SVI by leveraging conditional independences in the target distribution (to achieve high-dimensional scaling) and second-order information (to address poor conditioning), while additionally providing an effective adaptive step control procedure, which is essential for ensuring convergence on challenging non-convex optimization problems. Experimental results show our method achieves superior numerical performance, both in convergence rate and sample accuracy, and scales better in high-dimensional distributions, than previous SVI techniques.
[LG-11] MagicPIG: LSH Sampling for Efficient LLM Generation
链接: https://arxiv.org/abs/2410.16179
作者: Zhuoming Chen,Ranajoy Sadhukhan,Zihao Ye,Yang Zhou,Jianyu Zhang,Niklas Nolte,Yuandong Tian,Matthijs Douze,Leon Bottou,Zhihao Jia,Beidi Chen
关键词-EN: Large language models, Large language, gained significant attention, long context windows, windows have gained
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) with long context windows have gained significant attention. However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various dynamic sparse or TopK-based attention approximation methods have been proposed to leverage the common insight that attention is sparse. In this paper, we first show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected. Rather than selecting the keys and values with the highest attention scores, sampling with theoretical guarantees can provide a better estimation for attention output. To make the sampling-based approximation practical in LLM generation, we propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH). MagicPIG significantly reduces the workload of attention computation while preserving high accuracy for diverse tasks. MagicPIG stores the LSH hash tables and runs the attention computation on the CPU, which allows it to serve longer contexts and larger batch sizes with high approximation accuracy. MagicPIG can improve decoding throughput by 1.9\sim3.9\times across various GPU hardware and achieve 110ms decoding latency on a single RTX 4090 for Llama-3.1-8B-Instruct model with a context of 96k tokens. The code is available at \urlthis https URL.
[LG-12] DMM: Distributed Matrix Mechanism for Differentially-Private Federated Learning using Packed Secret Sharing
链接: https://arxiv.org/abs/2410.16161
作者: Alexander Bienstock,Ujjwal Kumar,Antigoni Polychroniadou
关键词-EN: Federated Learning, traction recently, industry and academia, machine learning model, gained lots
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) has gained lots of traction recently, both in industry and academia. In FL, a machine learning model is trained using data from various end-users arranged in committees across several rounds. Since such data can often be sensitive, a primary challenge in FL is providing privacy while still retaining utility of the model. Differential Privacy (DP) has become the main measure of privacy in the FL setting. DP comes in two flavors: central and local. In the former, a centralized server is trusted to receive the users’ raw gradients from a training step, and then perturb their aggregation with some noise before releasing the next version of the model. In the latter (more private) setting, noise is applied on users’ local devices, and only the aggregation of users’ noisy gradients is revealed even to the server. Great strides have been made in increasing the privacy-utility trade-off in the central DP setting, by utilizing the so-called matrix mechanism. However, progress has been mostly stalled in the local DP setting. In this work, we introduce the distributed matrix mechanism to achieve the best-of-both-worlds; local DP and also better privacy-utility trade-off from the matrix mechanism. We accomplish this by proposing a cryptographic protocol that securely transfers sensitive values across rounds, which makes use of packed secret sharing. This protocol accommodates the dynamic participation of users per training round required by FL, including those that may drop out from the computation. We provide experiments which show that our mechanism indeed significantly improves the privacy-utility trade-off of FL models compared to previous local DP mechanisms, with little added overhead.
[LG-13] Metric as Transform: Exploring beyond Affine Transform for Interpretable Neural Network
链接: https://arxiv.org/abs/2410.16159
作者: Suman Sapkota
关键词-EN: Artificial Neural Networks, Radial Basis Function, Basis Function Network, Artificial Neural, Convolutional Neural Network
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: 22 pages, 20 figures, 3 tables
点击查看摘要
Abstract:Artificial Neural Networks of varying architectures are generally paired with affine transformation at the core. However, we find dot product neurons with global influence less interpretable as compared to local influence of euclidean distance (as used in Radial Basis Function Network). In this work, we explore the generalization of dot product neurons to l^p -norm, metrics, and beyond. We find that metrics as transform performs similarly to affine transform when used in MultiLayer Perceptron or Convolutional Neural Network. Moreover, we explore various properties of Metrics, compare it with Affine, and present multiple cases where metrics seem to provide better interpretability. We develop an interpretable local dictionary based Neural Networks and use it to understand and reject adversarial examples.
[LG-14] Unsupervised Replay Strategies for Continual Learning with Limited Data
链接: https://arxiv.org/abs/2410.16154
作者: Anthony Bazhenov,Pahan Dewasurendra,Giri P. Krishnan,Jean Erik Delanois
关键词-EN: Artificial neural networks, Artificial neural, neural networks, face challenges, challenges with continuous
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Artificial neural networks (ANNs) show limited performance with scarce or imbalanced training data and face challenges with continuous learning, such as forgetting previously learned data after new tasks training. In contrast, the human brain can learn continuously and from just a few examples. This research explores the impact of ‘sleep’, an unsupervised phase incorporating stochastic activation with local Hebbian learning rules, on ANNs trained incrementally with limited and imbalanced datasets, specifically MNIST and Fashion MNIST. We discovered that introducing a sleep phase significantly enhanced accuracy in models trained with limited data. When a few tasks were trained sequentially, sleep replay not only rescued previously learned information that had been catastrophically forgetting following new task training but often enhanced performance in prior tasks, especially those trained with limited data. This study highlights the multifaceted role of sleep replay in augmenting learning efficiency and facilitating continual learning in ANNs.
[LG-15] Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models NEURIPS2024
链接: https://arxiv.org/abs/2410.16152
作者: Giannis Daras,Weili Nie,Karsten Kreis,Alex Dimakis,Morteza Mardani,Nikola Borislavov Kovachki,Arash Vahdat
关键词-EN: suffers from flickering, space diffusion models, function space diffusion, naively for solving, image models naively
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted in NeurIPS 2024
点击查看摘要
Abstract:Using image models naively for solving inverse video problems often suffers from flickering, texture-sticking, and temporal inconsistency in generated videos. To tackle these problems, in this paper, we view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames. This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems. The function space diffusion models need to be equivariant with respect to the underlying spatial transformations. To ensure temporal consistency, we introduce a simple post-hoc test-time guidance towards (self)-equivariant solutions. Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems. We demonstrate the effectiveness of our method for video inpainting and 8\times video super-resolution, outperforming existing techniques based on noise transformations. We provide generated video results: this https URL\this http URL.
[LG-16] Small Contributions Small Networks: Efficient Neural Network Pruning Based on Relative Importance
链接: https://arxiv.org/abs/2410.16151
作者: Mostafa Hussien,Mahmoud Afifi,Kim Khoa Nguyen,Mohamed Cheriet
关键词-EN: achieving remarkable performance, Recent advancements, scaled neural networks, achieving remarkable, range of tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:Recent advancements have scaled neural networks to unprecedented sizes, achieving remarkable performance across a wide range of tasks. However, deploying these large-scale models on resource-constrained devices poses significant challenges due to substantial storage and computational requirements. Neural network pruning has emerged as an effective technique to mitigate these limitations by reducing model size and complexity. In this paper, we introduce an intuitive and interpretable pruning method based on activation statistics, rooted in information theory and statistical analysis. Our approach leverages the statistical properties of neuron activations to identify and remove weights with minimal contributions to neuron outputs. Specifically, we build a distribution of weight contributions across the dataset and utilize its parameters to guide the pruning process. Furthermore, we propose a Pruning-aware Training strategy that incorporates an additional regularization term to enhance the effectiveness of our pruning method. Extensive experiments on multiple datasets and network architectures demonstrate that our method consistently outperforms several baseline and state-of-the-art pruning techniques.
[LG-17] Modelling Structured Data Learning with Restricted Boltzmann Machines in the Teacher-Student Setting
链接: https://arxiv.org/abs/2410.16150
作者: Robin Thériault,Francesco Tosello,Daniele Tantari
关键词-EN: Restricted Boltzmann machines, Restricted Boltzmann, Boltzmann machines, rich underlying structure, generative models capable
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 51 pages, 21 figures
点击查看摘要
Abstract:Restricted Boltzmann machines (RBM) are generative models capable to learn data with a rich underlying structure. We study the teacher-student setting where a student RBM learns structured data generated by a teacher RBM. The amount of structure in the data is controlled by adjusting the number of hidden units of the teacher and the correlations in the rows of the weights, a.k.a. patterns. In the absence of correlations, we validate the conjecture that the performance is independent of the number of teacher patters and hidden units of the student RBMs, and we argue that the teacher-student setting can be used as a toy model for studying the lottery ticket hypothesis. Beyond this regime, we find that the critical amount of data required to learn the teacher patterns decreases with both their number and correlations. In both regimes, we find that, even with an relatively large dataset, it becomes impossible to learn the teacher patterns if the inference temperature used for regularization is kept too low. In our framework, the student can learn teacher patterns one-to-one or many-to-one, generalizing previous findings about the teacher-student setting with two hidden units to any arbitrary finite number of hidden units.
[LG-18] owards Combating Frequency Simplicity-biased Learning for Domain Generalization NEURIPS2024
链接: https://arxiv.org/abs/2410.16146
作者: Xilin He,Jingyu Hu,Qinliang Lin,Cheng Luo,Weicheng Xie,Siyang Song,Muhammad Haris Khan,Linlin Shen
关键词-EN: learn transferable knowledge, unseen target domains, learning behavior, Domain generalization methods, generalization methods aim
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Domain generalization methods aim to learn transferable knowledge from source domains that can generalize well to unseen target domains. Recent studies show that neural networks frequently suffer from a simplicity-biased learning behavior which leads to over-reliance on specific frequency sets, namely as frequency shortcuts, instead of semantic information, resulting in poor generalization performance. Despite previous data augmentation techniques successfully enhancing generalization performances, they intend to apply more frequency shortcuts, thereby causing hallucinations of generalization improvement. In this paper, we aim to prevent such learning behavior of applying frequency shortcuts from a data-driven perspective. Given the theoretical justification of models’ biased learning behavior on different spatial frequency components, which is based on the dataset frequency properties, we argue that the learning behavior on various frequency components could be manipulated by changing the dataset statistical structure in the Fourier domain. Intuitively, as frequency shortcuts are hidden in the dominant and highly dependent frequencies of dataset structure, dynamically perturbating the over-reliance frequency components could prevent the application of frequency shortcuts. To this end, we propose two effective data augmentation modules designed to collaboratively and adaptively adjust the frequency characteristic of the dataset, aiming to dynamically influence the learning behavior of the model and ultimately serving as a strategy to mitigate shortcut learning. Code is available at AdvFrequency (this https URL).
[LG-19] heoretical Insights into Line Graph Transformation on Graph Learning
链接: https://arxiv.org/abs/2410.16138
作者: Fan Yang,Xingyue Huang
关键词-EN: Line graph transformation, line graph corresponds, Line graph, graph, graph transformation
类目: Machine Learning (cs.LG); Combinatorics (math.CO); Machine Learning (stat.ML)
*备注: 21 pages, code available at this https URL
点击查看摘要
Abstract:Line graph transformation has been widely studied in graph theory, where each node in a line graph corresponds to an edge in the original graph. This has inspired a series of graph neural networks (GNNs) applied to transformed line graphs, which have proven effective in various graph representation learning tasks. However, there is limited theoretical study on how line graph transformation affects the expressivity of GNN models. In this study, we focus on two types of graphs known to be challenging to the Weisfeiler-Leman (WL) tests: Cai-Fürer-Immerman (CFI) graphs and strongly regular graphs, and show that applying line graph transformation helps exclude these challenging graph properties, thus potentially assist WL tests in distinguishing these graphs. We empirically validate our findings by conducting a series of experiments that compare the accuracy and efficiency of graph isomorphism tests and GNNs on both line-transformed and original graphs across these graph structure types.
[LG-20] Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs
链接: https://arxiv.org/abs/2410.16135
作者: Kang Zhao,Tao Yuan,Han Bao,Zhenfeng Su,Chang Gao,Zhaofeng Sun,Zichen Liang,Liping Jing,Jianfei Chen
关键词-EN: sparse tensor cores, sparsity, tensor cores, cores on GPUs, M-sparse Transformers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ( \leq 1.3 ) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in addressing these limitations of 2:4 sparsity. However, regarding accuracy, the effects of V:N:M sparsity on broader Transformer models, such as vision Transformers and large language models (LLMs), are largely unexamined. Moreover, Some specific issues related to V:N:M sparsity, such as how to select appropriate V and M values, remain unresolved. In this study, we thoroughly investigate the application of V:N:M sparsity in vision models and LLMs across multiple tasks, from pertaining to downstream tasks. We propose three key approaches to enhance the applicability and accuracy of V:N:M-sparse Transformers, including heuristic V and M selection, V:N:M-specific channel permutation, and three-staged LoRA training techniques. Experimental results show that, with our methods, the DeiT-small achieves lossless accuracy at 64:2:5 sparsity, while the DeiT-base maintains accuracy even at 64:2:8 sparsity. In addition, the fine-tuned LLama2-7B at 64:2:5 sparsity performs comparably or better than training-free 2:4 sparse alternatives on downstream tasks. More importantly, V:N:M-sparse Transformers offer a wider range of speedup-accuracy trade-offs compared to 2:4 sparsity. Overall, our exploration largely facilitates the V:N:M sparsity to act as a truly effective acceleration solution for Transformers in cost-sensitive inference scenarios.
[LG-21] SMART: Self-learning Meta-strategy Agent for Reasoning Tasks
链接: https://arxiv.org/abs/2410.16128
作者: Rongxing Liu,Kumar Shridhar,Manish Prajapat,Patrick Xia,Mrinmaya Sachan
关键词-EN: Tasks requiring deductive, requiring deductive reasoning, involving multiple steps, demand adaptive strategies, rationales or programs
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Tasks requiring deductive reasoning, especially those involving multiple steps, often demand adaptive strategies such as intermediate generation of rationales or programs, as no single approach is universally optimal. While Language Models (LMs) can enhance their outputs through iterative self-refinement and strategy adjustments, they frequently fail to apply the most effective strategy in their first attempt. This inefficiency raises the question: Can LMs learn to select the optimal strategy in the first attempt, without a need for refinement? To address this challenge, we introduce SMART (Self-learning Meta-strategy Agent for Reasoning Tasks), a novel framework that enables LMs to autonomously learn and select the most effective strategies for various reasoning tasks. We model the strategy selection process as a Markov Decision Process and leverage reinforcement learning-driven continuous self-improvement to allow the model to find the suitable strategy to solve a given task. Unlike traditional self-refinement methods that rely on multiple inference passes or external feedback, SMART allows an LM to internalize the outcomes of its own reasoning processes and adjust its strategy accordingly, aiming for correct solutions on the first attempt. Our experiments across various reasoning datasets and with different model architectures demonstrate that SMART significantly enhances the ability of models to choose optimal strategies without external guidance (+15 points on the GSM8K dataset). By achieving higher accuracy with a single inference pass, SMART not only improves performance but also reduces computational costs for refinement-based strategies, paving the way for more efficient and intelligent reasoning in LMs.
[LG-22] MNIST-Nd: a set of naturalistic datasets to benchmark clustering across dimensions
链接: https://arxiv.org/abs/2410.16124
作者: Polina Turishcheva,Laura Hansel,Martin Ritzert,Marissa A. Weis,Alexander S. Ecker
关键词-EN: Driven by advances, large-scale high-dimensional datasets, recording technology, large-scale high-dimensional, scientific disciplines
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Driven by advances in recording technology, large-scale high-dimensional datasets have emerged across many scientific disciplines. Especially in biology, clustering is often used to gain insights into the structure of such datasets, for instance to understand the organization of different cell types. However, clustering is known to scale poorly to high dimensions, even though the exact impact of dimensionality is unclear as current benchmark datasets are mostly two-dimensional. Here we propose MNIST-Nd, a set of synthetic datasets that share a key property of real-world datasets, namely that individual samples are noisy and clusters do not perfectly separate. MNIST-Nd is obtained by training mixture variational autoencoders with 2 to 64 latent dimensions on MNIST, resulting in six datasets with comparable structure but varying dimensionality. It thus offers the chance to disentangle the impact of dimensionality on clustering. Preliminary common clustering algorithm benchmarks on MNIST-Nd suggest that Leiden is the most robust for growing dimensions.
[LG-23] Extracting Spatiotemporal Data from Gradients with Large Language Models
链接: https://arxiv.org/abs/2410.16121
作者: Lele Zheng,Yang Cao,Renhe Jiang,Kenjiro Taura,Yulong Shen,Sheng Li,Masatoshi Yoshikawa
关键词-EN: Recent works show, Recent works, key privacy promise, spatiotemporal federated learning, sensitive user data
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.08529
点击查看摘要
Abstract:Recent works show that sensitive user data can be reconstructed from gradient updates, breaking the key privacy promise of federated learning. While success was demonstrated primarily on image data, these methods do not directly transfer to other domains, such as spatiotemporal data. To understand privacy risks in spatiotemporal federated learning, we first propose Spatiotemporal Gradient Inversion Attack (ST-GIA), a gradient attack algorithm tailored to spatiotemporal data that successfully reconstructs the original location from gradients. Furthermore, the absence of priors in attacks on spatiotemporal data has hindered the accurate reconstruction of real client data. To address this limitation, we propose ST-GIA+, which utilizes an auxiliary language model to guide the search for potential locations, thereby successfully reconstructing the original data from gradients. In addition, we design an adaptive defense strategy to mitigate gradient inversion attacks in spatiotemporal federated learning. By dynamically adjusting the perturbation levels, we can offer tailored protection for varying rounds of training data, thereby achieving a better trade-off between privacy and utility than current state-of-the-art methods. Through intensive experimental analysis on three real-world datasets, we reveal that the proposed defense strategy can well preserve the utility of spatiotemporal federated learning with effective security protection.
[LG-24] SeaDAG: Semi-autoregressive Diffusion for Conditional Directed Acyclic Graph Generation
链接: https://arxiv.org/abs/2410.16119
作者: Xinyi Zhou,Xing Li,Yingzhao Lian,Yiwen Wang,Lei Chen,Mingxuan Yuan,Jianye Hao,Guangyong Chen,Pheng Ann Heng
关键词-EN: Directed Acyclic Graphs, Directed Acyclic, Acyclic Graphs, introduce SeaDAG, Directed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We introduce SeaDAG, a semi-autoregressive diffusion model for conditional generation of Directed Acyclic Graphs (DAGs). Considering their inherent layer-wise structure, we simulate layer-wise autoregressive generation by designing different denoising speed for different layers. Unlike conventional autoregressive generation that lacks a global graph structure view, our method maintains a complete graph structure at each diffusion step, enabling operations such as property control that require the full graph structure. Leveraging this capability, we evaluate the DAG properties during training by employing a graph property decoder. We explicitly train the model to learn graph conditioning with a condition loss, which enhances the diffusion model’s capacity to generate graphs that are both realistic and aligned with specified properties. We evaluate our method on two representative conditional DAG generation tasks: (1) circuit generation from truth tables, where precise DAG structures are crucial for realizing circuit functionality, and (2) molecule generation based on quantum properties. Our approach demonstrates promising results, generating high-quality and realistic DAGs that closely align with given conditions.
[LG-25] Interpreting Microbiome Relative Abundance Data Using Symbolic Regression
链接: https://arxiv.org/abs/2410.16109
作者: Swagatam Haldar,Christoph Stein-Thoeringer,Vadim Borisov
关键词-EN: developing effective diagnostic, therapeutic strategies, developing effective, effective diagnostic, diagnostic and therapeutic
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 5 pages, 2 figures
点击查看摘要
Abstract:Understanding the complex interactions within the microbiome is crucial for developing effective diagnostic and therapeutic strategies. Traditional machine learning models often lack interpretability, which is essential for clinical and biological insights. This paper explores the application of symbolic regression (SR) to microbiome relative abundance data, with a focus on colorectal cancer (CRC). SR, known for its high interpretability, is compared against traditional machine learning models, e.g., random forest, gradient boosting decision trees. These models are evaluated based on performance metrics such as F1 score and accuracy. We utilize 71 studies encompassing, from various cohorts, over 10,000 samples across 749 species features. Our results indicate that SR not only competes reasonably well in terms of predictive performance, but also excels in model interpretability. SR provides explicit mathematical expressions that offer insights into the biological relationships within the microbiome, a crucial advantage for clinical and biological interpretation. Our experiments also show that SR can help understand complex models like XGBoost via knowledge distillation. To aid in reproducibility and further research, we have made the code openly available at this https URL .
[LG-26] Addressing Spectral Bias of Deep Neural Networks by Multi-Grade Deep Learning
链接: https://arxiv.org/abs/2410.16105
作者: Ronglong Fang,Yuesheng Xu
关键词-EN: DNNs typically exhibit, high-frequency features, typically exhibit, exhibit a tendency, tendency to prioritize
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Deep neural networks (DNNs) suffer from the spectral bias, wherein DNNs typically exhibit a tendency to prioritize the learning of lower-frequency components of a function, struggling to capture its high-frequency features. This paper is to address this issue. Notice that a function having only low frequency components may be well-represented by a shallow neural network (SNN), a network having only a few layers. By observing that composition of low frequency functions can effectively approximate a high-frequency function, we propose to learn a function containing high-frequency components by composing several SNNs, each of which learns certain low-frequency information from the given data. We implement the proposed idea by exploiting the multi-grade deep learning (MGDL) model, a recently introduced model that trains a DNN incrementally, grade by grade, a current grade learning from the residue of the previous grade only an SNN composed with the SNNs trained in the preceding grades as features. We apply MGDL to synthetic, manifold, colored images, and MNIST datasets, all characterized by presence of high-frequency features. Our study reveals that MGDL excels at representing functions containing high-frequency information. Specifically, the neural networks learned in each grade adeptly capture some low-frequency information, allowing their compositions with SNNs learned in the previous grades effectively representing the high-frequency features. Our experimental results underscore the efficacy of MGDL in addressing the spectral bias inherent in DNNs. By leveraging MGDL, we offer insights into overcoming spectral bias limitation of DNNs, thereby enhancing the performance and applicability of deep learning models in tasks requiring the representation of high-frequency information. This study confirms that the proposed method offers a promising solution to address the spectral bias of DNNs.
[LG-27] LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics
链接: https://arxiv.org/abs/2410.16103
作者: Thomas Robert,Mher Safaryan,Ionut-Vlad Modoranu,Dan Alistarh
关键词-EN: performs adaptive optimization, adaptive optimization steps, full parameter space, lower dimensional subspaces, training large models
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 36 pages
点击查看摘要
Abstract:We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer’s memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.
[LG-28] ExDBN: Exact learning of Dynamic Bayesian Networks
链接: https://arxiv.org/abs/2410.16100
作者: Pavel Rytíř,Aleš Wodecki,Georgios Korpas,Jakub Mareček
关键词-EN: recent years, received much attention, attention in recent, capturing causal relationships, utilizing Bayesian networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages
点击查看摘要
Abstract:Causal learning from data has received much attention in recent years. One way of capturing causal relationships is by utilizing Bayesian networks. There, one recovers a weighted directed acyclic graph, in which random variables are represented by vertices, and the weights associated with each edge represent the strengths of the causal relationships between them. This concept is extended to capture dynamic effects by introducing a dependency on past data, which may be captured by the structural equation model, which is utilized in the present contribution to formulate a score-based learning approach. A mixed-integer quadratic program is formulated and an algorithmic solution proposed, in which the pre-generation of exponentially many acyclicity constraints is avoided by utilizing the so-called branch-and-cut (“lazy constraint”) method. Comparing the novel approach to the state of the art, we show that the proposed approach turns out to produce excellent results when applied to small and medium-sized synthetic instances of up to 25 time-series. Lastly, two interesting applications in bio-science and finance, to which the method is directly applied, further stress the opportunities in developing highly accurate, globally convergent solvers that can handle modest instances.
[LG-29] CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts
链接: https://arxiv.org/abs/2410.16077
作者: Zhenpeng Su,Xing Wu,Zijia Lin,Yizhe Xiong,Minxuan Lv,Guangyuan Ma,Hui Chen,Songlin Hu,Guiguang Ding
关键词-EN: Large language models, Large language, community recently, attracting much attention, Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Large language models (LLM) have been attracting much attention from the community recently, due to their remarkable performance in all kinds of downstream tasks. According to the well-known scaling law, scaling up a dense LLM enhances its capabilities, but also significantly increases the computational complexity. Mixture-of-Experts (MoE) models address that by allowing the model size to grow without substantially raising training or inference costs. Yet MoE models face challenges regarding knowledge sharing among experts, making their performance somehow sensitive to routing accuracy. To tackle that, previous works introduced shared experts and combined their outputs with those of the top K routed experts in an addition'' manner. In this paper, inspired by collective matrix factorization to learn shared knowledge among data, we propose CartesianMoE, which implements more effective knowledge sharing among experts in more like a
multiplication’’ manner. Extensive experimental results indicate that CartesianMoE outperforms previous MoE models for building LLMs, in terms of both perplexity and downstream task performance. And we also find that CartesianMoE achieves better expert routing robustness.
[LG-30] Near-Optimal Algorithm for Non-Stationary Kernelized Bandits
链接: https://arxiv.org/abs/2410.16052
作者: Shogo Iwazaki,Shion Takeno
关键词-EN: time-varying Bayesian optimization, called time-varying Bayesian, unknown reward function, Bayesian optimization, time-varying Bayesian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 2 figures
点击查看摘要
Abstract:This paper studies a non-stationary kernelized bandit (KB) problem, also called time-varying Bayesian optimization, where one seeks to minimize the regret under an unknown reward function that varies over time. In particular, we focus on a near-optimal algorithm whose regret upper bound matches the regret lower bound. For this goal, we show the first algorithm-independent regret lower bound for non-stationary KB with squared exponential and Matérn kernels, which reveals that an existing optimization-based KB algorithm with slight modification is near-optimal. However, this existing algorithm suffers from feasibility issues due to its huge computational cost. Therefore, we propose a novel near-optimal algorithm called restarting phased elimination with random permutation (R-PERP), which bypasses the huge computational cost. A technical key point is the simple permutation procedures of query candidates, which enable us to derive a novel tighter confidence bound tailored to the non-stationary problems.
[LG-31] reeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
链接: https://arxiv.org/abs/2410.16033
作者: Jiahao Qiu,Yifu Lu,Yifan Zeng,Jiacheng Guo,Jiayi Geng,Huazheng Wang,Kaixuan Huang,Yue Wu,Mengdi Wang
关键词-EN: Inference-time alignment enhances, large language models, requiring additional training, presents challenges due, balancing computational efficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, UltraFeedback, GSM8K, HH-RLHF, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves a 65% win rate at maximum lengths of 192 and 384 tokens, outperforming standard BoN with the same computational cost. Furthermore, TreeBoN achieves around a 60% win rate across longer responses, showcasing its scalability and alignment efficacy.
[LG-32] meMixer: A General Time Series Pattern Machine for Universal Predictive Analysis
链接: https://arxiv.org/abs/2410.16032
作者: Shiyu Wang,Jiawei Li,Xiaoming Shi,Zhou Ye,Baichuan Mo,Wenze Lin,Shengtong Ju,Zhixuan Chu,Ming Jin
关键词-EN: Time series, Time, series, multi-scale time series, Time series analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Time series analysis plays a critical role in numerous applications, supporting tasks such as forecasting, classification, anomaly detection, and imputation. In this work, we present the time series pattern machine (TSPM), a model designed to excel in a broad range of time series tasks through powerful representation and pattern extraction capabilities. Traditional time series models often struggle to capture universal patterns, limiting their effectiveness across diverse tasks. To address this, we define multiple scales in the time domain and various resolutions in the frequency domain, employing various mixing strategies to extract intricate, task-adaptive time series patterns. Specifically, we introduce a general-purpose TSPM that processes multi-scale time series using (1) multi-resolution time imaging (MRTI), (2) time image decomposition (TID), (3) multi-scale mixing (MCM), and (4) multi-resolution mixing (MRM) to extract comprehensive temporal patterns. MRTI transforms multi-scale time series into multi-resolution time images, capturing patterns across both temporal and frequency domains. TID leverages dual-axis attention to extract seasonal and trend patterns, while MCM hierarchically aggregates these patterns across scales. MRM adaptively integrates all representations across resolutions. This method achieves state-of-the-art performance across 8 time series analytical tasks, consistently surpassing both general-purpose and task-specific models. Our work marks a promising step toward the next generation of TSPMs, paving the way for further advancements in time series analysis.
[LG-33] Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning
链接: https://arxiv.org/abs/2410.16029
作者: Arijit Das
关键词-EN: Training LLMs presents, Training LLMs, growing size, memory challenges due, optimizer states
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 10 pages, 3 tables, 3 figures
点击查看摘要
Abstract:Training LLMs presents significant memory challenges due to growing size of data, weights, and optimizer states. Techniques such as data and model parallelism, gradient checkpointing, and offloading strategies address this issue but are often infeasible due to hardware constraints. To mitigate memory usage, alternative methods like Parameter-Efficient-Fine-Tuning (PEFT) and GaLore approximate weights or optimizer states. PEFT methods, such as LoRA, have gained popularity for fine-tuning LLMs, though they require a full-rank warm start. In contrast, GaLore allows full-parameter learning while being more memory-efficient. This work introduces Natural GaLore, a simple drop in replacement for AdamW, which efficiently applies the inverse Empirical Fisher Information Matrix to low-rank gradients using Woodbury’s Identity. We demonstrate that incorporating second-order information speeds up optimization significantly, especially when the iteration budget is limited. Empirical pretraining on 60M, 130M, 350M, and 1.1B parameter Llama models on C4 data demonstrate significantly lower perplexity over GaLore without additional memory overhead. By fine-tuning RoBERTa on the GLUE benchmark using Natural GaLore, we demonstrate significant reduction in gap 86.05% vs 86.28% for full-finetuning. Furthermore, fine-tuning the TinyLlama 1.1B model for function calling using the TinyAgent framework shows that Natural GaLore achieving 83.09% accuracy on the TinyAgent dataset, significantly outperforms 16-bit LoRA at 80.06% and even surpasses GPT4-Turbo by 4%, all while using 30% less memory. All code to reproduce the results are available at: this https URL Comments: 10 pages, 3 tables, 3 figures Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2410.16029 [cs.LG] (or arXiv:2410.16029v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.16029 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-34] Information-Theoretic Minimax Regret Bounds for Reinforcement Learning based on Duality
链接: https://arxiv.org/abs/2410.16013
作者: Raghav Bongole,Amaury Gouverneur,Borja Rodríguez-Gálvez,Tobias J. Oechtering,Mikael Skoglund
关键词-EN: Markov Decision Processes, regret, minimax regret, study agents acting, minimax
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
点击查看摘要
Abstract:We study agents acting in an unknown environment where the agent’s goal is to find a robust policy. We consider robust policies as policies that achieve high cumulative rewards for all possible environments. To this end, we consider agents minimizing the maximum regret over different environment parameters, leading to the study of minimax regret. This research focuses on deriving information-theoretic bounds for minimax regret in Markov Decision Processes (MDPs) with a finite time horizon. Building on concepts from supervised learning, such as minimum excess risk (MER) and minimax excess risk, we use recent bounds on the Bayesian regret to derive minimax regret bounds. Specifically, we establish minimax theorems and use bounds on the Bayesian regret to perform minimax regret analysis using these minimax theorems. Our contributions include defining a suitable minimax regret in the context of MDPs, finding information-theoretic bounds for it, and applying these bounds in various scenarios.
[LG-35] Massimo: Public Queue Monitoring and Management using Mass-Spring Model
链接: https://arxiv.org/abs/2410.16012
作者: Abhijeet Kumar,Unnati Singh,Rajdeep Chatterjee,Tathagata Bandyopadhyay
关键词-EN: customer satisfaction, control and regulation, important in order, order to avoid, avoid the traffic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 8 pages, 6 figures, 3 algorithms, 3 tables
点击查看摘要
Abstract:An efficient system of a queue control and regulation in public spaces is very important in order to avoid the traffic jams and to improve the customer satisfaction. This article offers a detailed road map based on a merger of intelligent systems and creating an efficient systems of queues in public places. Through the utilization of different technologies i.e. computer vision, machine learning algorithms, deep learning our system provide accurate information about the place is crowded or not and the necessary efforts to be taken.
[LG-36] 1024m at SMM4H 2024: Tasks 3 5 6 – Ensembles of Transformers and Large Language Models for Medical Text Classification ACL2024
链接: https://arxiv.org/abs/2410.15998
作者: Ram Mohan Rao Kadiyala,M.V.P. Chandra Sekhara Rao
关键词-EN: users reporting information, Social media, Large Language Models, Binary classification, great source
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: short paper , acl 2024
点击查看摘要
Abstract:Social media is a great source of data for users reporting information and regarding their health and how various things have had an effect on them. This paper presents various approaches using Transformers and Large Language Models and their ensembles, their performance along with advantages and drawbacks for various tasks of SMM4H’24 - Classifying texts on impact of nature and outdoor spaces on the author’s mental health (Task 3), Binary classification of tweets reporting their children’s health disorders like Asthma, Autism, ADHD and Speech disorder (task 5), Binary classification of users self-reporting their age (task 6).
[LG-37] MultiRC: Joint Learning for Time Series Anomaly Prediction and Detection with Multi-scale Reconstructive Contrast
链接: https://arxiv.org/abs/2410.15997
作者: Shiyan Hu,Kai Zhao,Xiangfei Qiu,Yang Shu,Jilin Hu,Bin Yang,Chenjuan Guo
关键词-EN: unsupervised time series, proposed for unsupervised, diverse reaction time, time series anomaly, series anomaly detection
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Many methods have been proposed for unsupervised time series anomaly detection. Despite some progress, research on predicting future anomalies is still relatively scarce. Predicting anomalies is particularly challenging due to the diverse reaction time and the lack of labeled data. To address these challenges, we propose MultiRC to integrate reconstructive and contrastive learning for joint learning of anomaly prediction and detection, with multi-scale structure and adaptive dominant period mask to deal with the diverse reaction time. MultiRC also generates negative samples to provide essential training momentum for the anomaly prediction tasks and prevent model degradation. We evaluate seven benchmark datasets from different fields. For both anomaly prediction and detection tasks, MultiRC outperforms existing state-of-the-art methods.
[LG-38] Augmenting Legal Decision Support Systems with LLM-based NLI for Analyzing Social Media Evidence EMNLP2024
链接: https://arxiv.org/abs/2410.15990
作者: Ram Mohan Rao Kadiyala,Siddartha Pullakhandam,Kanwal Mehreen,Subhasya Tippareddy,Ashay Srivastava
关键词-EN: entry for NLLP, Natural Language Inference, Legal Natural Language, shared task, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages , accepted to emnlp 2024
点击查看摘要
Abstract:This paper presents our system description and error analysis of our entry for NLLP 2024 shared task on Legal Natural Language Inference (L-NLI) \citephagag2024legallenssharedtask2024. The task required classifying these relationships as entailed, contradicted, or neutral, indicating any association between the review and the complaint. Our system emerged as the winning submission, significantly outperforming other entries with a substantial margin and demonstrating the effectiveness of our approach in legal text analysis. We provide a detailed analysis of the strengths and limitations of each model and approach tested, along with a thorough error analysis and suggestions for future improvements. This paper aims to contribute to the growing field of legal NLP by offering insights into advanced techniques for natural language inference in legal contexts, making it accessible to both experts and newcomers in the field.
[LG-39] Analyzing Closed-loop Training Techniques for Realistic Traffic Agent Models in Autonomous Highway Driving Simulations
链接: https://arxiv.org/abs/2410.15987
作者: Matthias Bitzer,Reinis Cimurs,Benjamin Coors,Johannes Goth,Sebastian Ziesche,Philipp Geiger,Maximilian Naumann
关键词-EN: autonomous vehicles, plays a crucial, crucial role, rapid development, development and safe
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 15 pages, 6 figures, 4 tables
点击查看摘要
Abstract:Simulation plays a crucial role in the rapid development and safe deployment of autonomous vehicles. Realistic traffic agent models are indispensable for bridging the gap between simulation and the real world. Many existing approaches for imitating human behavior are based on learning from demonstration. However, these approaches are often constrained by focusing on individual training strategies. Therefore, to foster a broader understanding of realistic traffic agent modeling, in this paper, we provide an extensive comparative analysis of different training principles, with a focus on closed-loop methods for highway driving simulation. We experimentally compare (i) open-loop vs. closed-loop multi-agent training, (ii) adversarial vs. deterministic supervised training, (iii) the impact of reinforcement losses, and (iv) the impact of training alongside log-replayed agents to identify suitable training techniques for realistic agent modeling. Furthermore, we identify promising combinations of different closed-loop training methods.
[LG-40] Visual Representation Learning Guided By Multi-modal Prior Knowledge
链接: https://arxiv.org/abs/2410.15981
作者: Hongkuan Zhou,Lavdim Halilaj,Sebastian Monka,Stefan Schmid,Yuqicheng Zhu,Bo Xiong,Steffen Staab
关键词-EN: deep neural networks, facing distribution shifts, neural networks, computer vision, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Despite the remarkable success of deep neural networks (DNNs) in computer vision, they fail to remain high-performing when facing distribution shifts between training and testing data. In this paper, we propose Knowledge-Guided Visual representation learning (KGV), a distribution-based learning approach leveraging multi-modal prior knowledge, to improve generalization under distribution shift. We use prior knowledge from two distinct modalities: 1) a knowledge graph (KG) with hierarchical and association relationships; and 2) generated synthetic images of visual elements semantically represented in the KG. The respective embeddings are generated from the given modalities in a common latent space, i.e., visual embeddings from original and synthetic images as well as knowledge graph embeddings (KGEs). These embeddings are aligned via a novel variant of translation-based KGE methods, where the node and relation embeddings of the KG are modeled as Gaussian distributions and translations respectively. We claim that incorporating multi-model prior knowledge enables more regularized learning of image representations. Thus, the models are able to better generalize across different data distributions. We evaluate KGV on different image classification tasks with major or minor distribution shifts, namely road sign classification across datasets from Germany, China, and Russia, image classification with the mini-ImageNet dataset and its variants, as well as the DVM-CAR dataset. The results demonstrate that KGV consistently exhibits higher accuracy and data efficiency than the baselines across all experiments.
[LG-41] Large Language Models for Cross-lingual Emotion Detection ACL2024
链接: https://arxiv.org/abs/2410.15974
作者: Ram Mohan Rao Kadiyala
关键词-EN: detailed system description, cross-lingual emotion detection, focused on cross-lingual, presents a detailed, detailed system
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages , accepted to acl 2024
点击查看摘要
Abstract:This paper presents a detailed system description of our entry for the WASSA 2024 Task 2, focused on cross-lingual emotion detection. We utilized a combination of large language models (LLMs) and their ensembles to effectively understand and categorize emotions across different languages. Our approach not only outperformed other submissions with a large margin, but also demonstrated the strength of integrating multiple models to enhance performance. Additionally, We conducted a thorough comparison of the benefits and limitations of each model used. An error analysis is included along with suggested areas for future improvement. This paper aims to offer a clear and comprehensive understanding of advanced techniques in emotion detection, making it accessible even to those new to the field.
[LG-42] Karush-Kuhn-Tucker Condition-Trained Neural Networks (KKT Nets)
链接: https://arxiv.org/abs/2410.15973
作者: Shreya Arvind,Rishabh Pomaje,Rajshekhar V Bhat
关键词-EN: KKT Loss, solving convex optimization, dual variables satisfying, convex optimization problems, KKT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:This paper presents a novel approach to solving convex optimization problems by leveraging the fact that, under certain regularity conditions, any set of primal or dual variables satisfying the Karush-Kuhn-Tucker (KKT) conditions is necessary and sufficient for optimality. Similar to Theory-Trained Neural Networks (TTNNs), the parameters of the convex optimization problem are input to the neural network, and the expected outputs are the optimal primal and dual variables. A choice for the loss function in this case is a loss, which we refer to as the KKT Loss, that measures how well the network’s outputs satisfy the KKT conditions. We demonstrate the effectiveness of this approach using a linear program as an example. For this problem, we observe that minimizing the KKT Loss alone outperforms training the network with a weighted sum of the KKT Loss and a Data Loss (the mean-squared error between the ground truth optimal solutions and the network’s output). Moreover, minimizing only the Data Loss yields inferior results compared to those obtained by minimizing the KKT Loss. While the approach is promising, the obtained primal and dual solutions are not sufficiently close to the ground truth optimal solutions. In the future, we aim to develop improved models to obtain solutions closer to the ground truth and extend the approach to other problem classes.
[LG-43] S-ACL: A Time Series Analytic Continual Learning Framework for Privacy-Preserving and Class-Incremental Pattern Recognition
链接: https://arxiv.org/abs/2410.15954
作者: Kejia Fan,Jiaxu Li,Songning Lai,Linpu Lv,Anfeng Liu,Jianheng Tang,Houbing Herbert Song,Huiping Zhuang
关键词-EN: Time Series Classification, incrementally train models, streaming time series, Series Classification, Class-incremental Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 3 figures, 2 tables
点击查看摘要
Abstract:Class-incremental Learning (CIL) in Time Series Classification (TSC) aims to incrementally train models using the streaming time series data that arrives continuously. The main problem in this scenario is catastrophic forgetting, i.e., training models with new samples inevitably leads to the forgetting of previously learned knowledge. Among existing methods, the replay-based methods achieve satisfactory performance but compromise privacy, while exemplar-free methods protect privacy but suffer from low accuracy. However, more critically, owing to their reliance on gradient-based update techniques, these existing methods fundamentally cannot solve the catastrophic forgetting problem. In TSC scenarios with continuously arriving data and temporally shifting distributions, these methods become even less practical. In this paper, we propose a Time Series Analytic Continual Learning framework, called TS-ACL. Inspired by analytical learning, TS-ACL transforms neural network updates into gradient-free linear regression problems, thereby fundamentally mitigating catastrophic forgetting. Specifically, employing a pre-trained and frozen feature extraction encoder, TS-ACL only needs to update its analytic classifier recursively in a lightweight manner that is highly suitable for real-time applications and large-scale data processing. Additionally, we theoretically demonstrate that the model obtained recursively through the TS-ACL is exactly equivalent to a model trained on the complete dataset in a centralized manner, thereby establishing the property of absolute knowledge memory. Extensive experiments validate the superior performance of our TS-ACL.
[LG-44] User-centric evaluation of explainability of AI with and for humans: a comprehensive empirical study
链接: https://arxiv.org/abs/2410.15952
作者: Szymon Bobek,Paloma Korycińska,Monika Krakowska,Maciej Mozolewski,Dorota Rak,Magdalena Zych,Magdalena Wójcik,Grzegorz J. Nalepa
关键词-EN: Human-Centered Artificial Intelligence, eXplainable Artificial Intelligence, Artificial Intelligence, Gradient Boosting Classifier, Human-Centered Artificial
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This study is located in the Human-Centered Artificial Intelligence (HCAI) and focuses on the results of a user-centered assessment of commonly used eXplainable Artificial Intelligence (XAI) algorithms, specifically investigating how humans understand and interact with the explanations provided by these algorithms. To achieve this, we employed a multi-disciplinary approach that included state-of-the-art research methods from social sciences to measure the comprehensibility of explanations generated by a state-of-the-art lachine learning model, specifically the Gradient Boosting Classifier (XGBClassifier). We conducted an extensive empirical user study involving interviews with 39 participants from three different groups, each with varying expertise in data science, data visualization, and domain-specific knowledge related to the dataset used for training the machine learning model. Participants were asked a series of questions to assess their understanding of the model’s explanations. To ensure replicability, we built the model using a publicly available dataset from the UC Irvine Machine Learning Repository, focusing on edible and non-edible mushrooms. Our findings reveal limitations in existing XAI methods and confirm the need for new design principles and evaluation techniques that address the specific information needs and user perspectives of different classes of AI stakeholders. We believe that the results of our research and the cross-disciplinary methodology we developed can be successfully adapted to various data types and user profiles, thus promoting dialogue and address opportunities in HCAI research. To support this, we are making the data resulting from our study publicly available.
[LG-45] GReFEL: Geometry-Aware Reliable Facial Expression Learning under Bias and Imbalanced Data Distribution ACCV2024
链接: https://arxiv.org/abs/2410.15927
作者: Azmine Toushik Wasi,Taki Hasan Rafi,Raima Islam,Karlo Serbetar,Dong Kyu Chae
关键词-EN: Reliable facial expression, facial expression characteristics, distinctive facial expression, facial expression learning, facial expression
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ACCV 2024. Extended version of ARBEx ( arXiv:2305.01486 )
点击查看摘要
Abstract:Reliable facial expression learning (FEL) involves the effective learning of distinctive facial expression characteristics for more reliable, unbiased and accurate predictions in real-life settings. However, current systems struggle with FEL tasks because of the variance in people’s facial expressions due to their unique facial structures, movements, tones, and demographics. Biased and imbalanced datasets compound this challenge, leading to wrong and biased prediction labels. To tackle these, we introduce GReFEL, leveraging Vision Transformers and a facial geometry-aware anchor-based reliability balancing module to combat imbalanced data distributions, bias, and uncertainty in facial expression learning. Integrating local and global data with anchors that learn different facial data points and structural features, our approach adjusts biased and mislabeled emotions caused by intra-class disparity, inter-class similarity, and scale sensitivity, resulting in comprehensive, accurate, and reliable facial expression predictions. Our model outperforms current state-of-the-art methodologies, as demonstrated by extensive experiments on various datasets.
[LG-46] Diverse Policies Recovering via Pointwise Mutual Information Weighted Imitation Learning
链接: https://arxiv.org/abs/2410.15910
作者: Hanlin Yang,Jian Yao,Weiming Liu,Qing Wang,Hanmin Qin,Hansheng Kong,Kirk Tang,Jiechao Xiong,Chao Yu,Kai Li,Junliang Xing,Hongwu Chen,Juchao Zhuo,Qiang Fu,Yang Wei,Haobo Fu
关键词-EN: important research topic, diverse policies recovering, recovering diverse policies, diverse policies, policies recovering methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 18 pages, 6 figures
点击查看摘要
Abstract:Recovering a spectrum of diverse policies from a set of expert trajectories is an important research topic in imitation learning. After determining a latent style for a trajectory, previous diverse policies recovering methods usually employ a vanilla behavioral cloning learning objective conditioned on the latent style, treating each state-action pair in the trajectory with equal importance. Based on an observation that in many scenarios, behavioral styles are often highly relevant with only a subset of state-action pairs, this paper presents a new principled method in diverse polices recovery. In particular, after inferring or assigning a latent style for a trajectory, we enhance the vanilla behavioral cloning by incorporating a weighting mechanism based on pointwise mutual information. This additional weighting reflects the significance of each state-action pair’s contribution to learning the style, thus allowing our method to focus on state-action pairs most representative of that style. We provide theoretical justifications for our new objective, and extensive empirical evaluations confirm the effectiveness of our method in recovering diverse policies from expert data.
[LG-47] Model Mimic Attack: Knowledge Distillation for Provably Transferable Adversarial Examples
链接: https://arxiv.org/abs/2410.15889
作者: Kirill Lukyanov,Andrew Perminov,Denis Turdakov,Mikhail Pautov
关键词-EN: artificial neural networks, vulnerability of artificial, setting is widely, widely studied, black-box adversarial attacks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The vulnerability of artificial neural networks to adversarial perturbations in the black-box setting is widely studied in the literature. The majority of attack methods to construct these perturbations suffer from an impractically large number of queries required to find an adversarial example. In this work, we focus on knowledge distillation as an approach to conduct transfer-based black-box adversarial attacks and propose an iterative training of the surrogate model on an expanding dataset. This work is the first, to our knowledge, to provide provable guarantees on the success of knowledge distillation-based attack on classification neural networks: we prove that if the student model has enough learning capabilities, the attack on the teacher model is guaranteed to be found within the finite number of distillation iterations.
[LG-48] Using GPT Models for Qualitative and Quantitative News Analytics in the 2024 US Presidental Election Process
链接: https://arxiv.org/abs/2410.15884
作者: Bohdan M. Pavlyshenko
关键词-EN: Google Search API, Google Search, Search API, retrieval-augmented generation, RAG
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The paper considers an approach of using Google Search API and GPT-4o model for qualitative and quantitative analyses of news through retrieval-augmented generation (RAG). This approach was applied to analyze news about the 2024 US presidential election process. Different news sources for different time periods have been analyzed. Quantitative scores generated by GPT model have been analyzed using Bayesian regression to derive trend lines. The distributions found for the regression parameters allow for the analysis of uncertainty in the election process. The obtained results demonstrate that using the GPT models for news analysis, one can get informative analytics and provide key insights that can be applied in further analyses of election processes.
[LG-49] Distributed Learning for UAV Swarms
链接: https://arxiv.org/abs/2410.15882
作者: Chen Hu,Hanchi Ren,Jingjing Deng,Xianghua Xie
关键词-EN: Unmanned Aerial Vehicle, Unmanned Aerial, Aerial Vehicle, making Federated Learning, deployed in dynamic
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Unmanned Aerial Vehicle (UAV) swarms are increasingly deployed in dynamic, data-rich environments for applications such as environmental monitoring and surveillance. These scenarios demand efficient data processing while maintaining privacy and security, making Federated Learning (FL) a promising solution. FL allows UAVs to collaboratively train global models without sharing raw data, but challenges arise due to the non-Independent and Identically Distributed (non-IID) nature of the data collected by UAVs. In this study, we show an integration of the state-of-the-art FL methods to UAV Swarm application and invetigate the performance of multiple aggregation methods (namely FedAvg, FedProx, FedOpt, and MOON) with a particular focus on tackling non-IID on a variety of datasets, specifically MNIST for baseline performance, CIFAR10 for natural object classification, EuroSAT for environment monitoring, and CelebA for surveillance. These algorithms were selected to cover improved techniques on both client-side updates and global aggregation. Results show that while all algorithms perform comparably on IID data, their performance deteriorates significantly under non-IID conditions. FedProx demonstrated the most stable overall performance, emphasising the importance of regularising local updates in non-IID environments to mitigate drastic deviations in local models.
[LG-50] FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL NEURIPS’24
链接: https://arxiv.org/abs/2410.15876
作者: Woosung Koh,Wonbeen Oh,Siyeol Kim,Suhin Shin,Hyeongjin Kim,Jaein Jang,Junghyun Lee,Se-Young Yun
关键词-EN: Multi-agent reinforcement learning, addressing complex cooperative, complex cooperative tasks, Multi-agent reinforcement, demonstrated significant potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: NeurIPS '24 Open-World Agents Workshop
点击查看摘要
Abstract:Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory – a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. Our results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis-à-vis the backbone, compared to existing methods. For standardized evaluation, we introduce MPEv2, an enhanced version of Multi Particle Environments (MPE), consisting of 12 benchmarks. Benchmarks, implementations, and trained models are organized and open-sourced at this http URL, accompanied by ample demo video renderings.
[LG-51] Enabling Asymmetric Knowledge Transfer in Multi-Task Learning with Self-Auxiliaries
链接: https://arxiv.org/abs/2410.15875
作者: Olivier Graffeuille,Yun Sing Koh,Joerg Wicker,Moritz Lehmann
关键词-EN: Knowledge transfer, typically viewed, transfer, asymmetric task relationships, Knowledge
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Knowledge transfer in multi-task learning is typically viewed as a dichotomy; positive transfer, which improves the performance of all tasks, or negative transfer, which hinders the performance of all tasks. In this paper, we investigate the understudied problem of asymmetric task relationships, where knowledge transfer aids the learning of certain tasks while hindering the learning of others. We propose an optimisation strategy that includes additional cloned tasks named self-auxiliaries into the learning process to flexibly transfer knowledge between tasks asymmetrically. Our method can exploit asymmetric task relationships, benefiting from the positive transfer component while avoiding the negative transfer component. We demonstrate that asymmetric knowledge transfer provides substantial improvements in performance compared to existing multi-task optimisation strategies on benchmark computer vision problems.
[LG-52] Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs NEURIPS2024
链接: https://arxiv.org/abs/2410.15859
作者: Xin Ma,Yang Liu,Jingjing Liu,Xiaoxu Ma
关键词-EN: Large language models, max training lengths, Large language, challenging extrapolation problem, Position Encoding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted by NeurIPS 2024. arXiv admin note: text overlap with arXiv:2305.19466 by other authors
点击查看摘要
Abstract:Large language models (LLMs), although having revolutionized many fields, still suffer from the challenging extrapolation problem, where the inference ability of LLMs sharply declines beyond their max training lengths. In this work, we conduct a theoretical analysis to better understand why No Position Encoding (NoPE) fails outside its effective range, as well as examining the power of Position Encoding (PE) in this context. Our findings reveal that with meticulous weave position, PE can indeed be extended beyond effective range. Our theorems establish that LLMs equipped with weave PE can achieve improved extrapolation performance without additional cost. Furthermore, we introduce a novel weave PE method, Mesa-Extrapolation, which utilizes a chunk-based triangular attention matrix and applies Stair PE to manage the final chunk. This method not only retains competitive performance but also offers substantial benefits such as significantly reduced memory demand and faster inference speed. Extensive experiments validate the effectiveness of Mesa-Extrapolation, demonstrating its potential as a scalable solution to enhancing LLMs applicative reach.
[LG-53] owards Optimal Adapter Placement for Efficient Transfer Learning
链接: https://arxiv.org/abs/2410.15858
作者: Aleksandra I. Nowak,Otniel-Bogdan Mercea,Anurag Arnab,Jonas Pfeiffer,Yann Dauphin,Utku Evci
关键词-EN: Parameter-efficient transfer learning, adapt pre-trained models, Parameter-efficient transfer, transfer learning, aims to adapt
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Parameter-efficient transfer learning (PETL) aims to adapt pre-trained models to new downstream tasks while minimizing the number of fine-tuned parameters. Adapters, a popular approach in PETL, inject additional capacity into existing networks by incorporating low-rank projections, achieving performance comparable to full fine-tuning with significantly fewer parameters. This paper investigates the relationship between the placement of an adapter and its performance. We observe that adapter location within a network significantly impacts its effectiveness, and that the optimal placement is task-dependent. To exploit this observation, we introduce an extended search space of adapter connections, including long-range and recurrent adapters. We demonstrate that even randomly selected adapter placements from this expanded space yield improved results, and that high-performing placements often correlate with high gradient rank. Our findings reveal that a small number of strategically placed adapters can match or exceed the performance of the common baseline of adding adapters in every block, opening a new avenue for research into optimal adapter placement strategies.
[LG-54] EXEL: A neuromorphic processor with on-chip learning for beyond-CMOS device integration
链接: https://arxiv.org/abs/2410.15854
作者: Hugh Greatorex,Ole Richter,Michele Mastella,Madison Cotteret,Philipp Klein,Maxime Fabre,Arianna Rubino,Willian Soares Girão,Junren Chen,Martin Ziegler,Laura Bégon-Lours,Giacomo Indiveri,Elisabetta Chicca
关键词-EN: shown great potential, Recent advances, memory technologies, advances in memory, shown great
类目: Neural and Evolutionary Computing (cs.NE); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures. Supplementary material: 8 pages, 4 figures
点击查看摘要
Abstract:Recent advances in memory technologies, devices and materials have shown great potential for integration into neuromorphic electronic systems. However, a significant gap remains between the development of these materials and the realization of large-scale, fully functional systems. One key challenge is determining which devices and materials are best suited for specific functions and how they can be paired with CMOS circuitry. To address this, we introduce TEXEL, a mixed-signal neuromorphic architecture designed to explore the integration of on-chip learning circuits and novel two- and three-terminal devices. TEXEL serves as an accessible platform to bridge the gap between CMOS-based neuromorphic computation and the latest advancements in emerging devices. In this paper, we demonstrate the readiness of TEXEL for device integration through comprehensive chip measurements and simulations. TEXEL provides a practical system for testing bio-inspired learning algorithms alongside emerging devices, establishing a tangible link between brain-inspired computation and cutting-edge device research.
[LG-55] Focus Where It Matters: Graph Selective State Focused Attention Networks
链接: https://arxiv.org/abs/2410.15849
作者: Shikhar Vashistha,Neetesh Kumar
关键词-EN: lose individual node, individual node characteristics, node characteristics due, States Focused Attention, lack scalability
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Traditional graph neural networks (GNNs) lack scalability and lose individual node characteristics due to over-smoothing, especially in the case of deeper networks. This results in sub-optimal feature representation, affecting the model’s performance on tasks involving dynamically changing graphs. To address this issue, we present Graph Selective States Focused Attention Networks (GSANs) based neural network architecture for graph-structured data. The GSAN is enabled by multi-head masked self-attention (MHMSA) and selective state space modeling (S3M) layers to overcome the limitations of GNNs. In GSAN, the MHMSA allows GSAN to dynamically emphasize crucial node connections, particularly in evolving graph environments. The S3M layer enables the network to adjust dynamically in changing node states and improving predictions of node behavior in varying contexts without needing primary knowledge of the graph structure. Furthermore, the S3M layer enhances the generalization of unseen structures and interprets how node states influence link importance. With this, GSAN effectively outperforms inductive and transductive tasks and overcomes the issues that traditional GNNs experience. To analyze the performance behavior of GSAN, a set of state-of-the-art comparative experiments are conducted on graphs benchmark datasets, including Cora , Citeseer , Pubmed network citation, and protein-protein-interaction datasets, as an outcome, GSAN improved the classification accuracy by 1.56% , 8.94% , 0.37% , and 1.54% on F1-score respectively.
[LG-56] Random Token Fusion for Multi-View Medical Diagnosis NEURIPS2024
链接: https://arxiv.org/abs/2410.15847
作者: Jingyu Guo,Christos Matsoukas,Fredrik Strand,Kevin Smith
关键词-EN: deep learning-based models, deep learning-based, multi-view medical diagnosis, fuse information, imaging perspectives
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Originally published at the NeurIPS 2024 Workshop on Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond (AIM-FM)
点击查看摘要
Abstract:In multi-view medical diagnosis, deep learning-based models often fuse information from different imaging perspectives to improve diagnostic performance. However, existing approaches are prone to overfitting and rely heavily on view-specific features, which can lead to trivial solutions. In this work, we introduce Random Token Fusion (RTF), a novel technique designed to enhance multi-view medical image analysis using vision transformers. By integrating randomness into the feature fusion process during training, RTF addresses the issue of overfitting and enhances the robustness and accuracy of diagnostic models without incurring any additional cost at inference. We validate our approach on standard mammography and chest X-ray benchmark datasets. Through extensive experiments, we demonstrate that RTF consistently improves the performance of existing fusion methods, paving the way for a new generation of multi-view medical foundation models.
[LG-57] Modelling Concurrent RTP Flows for End-to-end Predictions of QoS in Real Time Communications
链接: https://arxiv.org/abs/2410.15846
作者: Tailai Song,Paolo Garza,Michela Meo,Maurizio Matteo Munafò
关键词-EN: Real-time Transport Protocol, Transport Protocol, based real-time communications, Real-time Transport, real-time communications
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:
点击查看摘要
Abstract:The Real-time Transport Protocol (RTP)-based real-time communications (RTC) applications, exemplified by video conferencing, have experienced an unparalleled surge in popularity and development in recent years. In pursuit of optimizing their performance, the prediction of Quality of Service (QoS) metrics emerges as a pivotal endeavor, bolstering network monitoring and proactive solutions. However, contemporary approaches are confined to individual RTP flows and metrics, falling short in relationship capture and computational efficiency. To this end, we propose Packet-to-Prediction (P2P), a novel deep learning (DL) framework that hinges on raw packets to simultaneously process concurrent RTP flows and perform end-to-end prediction of multiple QoS metrics. Specifically, we implement a streamlined architecture, namely length-free Transformer with cross and neighbourhood attention, capable of handling an unlimited number of RTP flows, and employ a multi-task learning paradigm to forecast four key metrics in a single shot. Our work is based on extensive traffic collected during real video calls, and conclusively, P2P excels comparative models in both prediction performance and temporal efficiency.
[LG-58] Private Efficient and Scalable Kernel Learning for Medical Image Analysis
链接: https://arxiv.org/abs/2410.15840
作者: Anika Hannemann,Arjhun Swaminathan,Ali Burak Ünal,Mete Akgün
关键词-EN: modern medicine, key in modern, Medical imaging, diagnostic medical imaging, medical imaging reveals
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Medical imaging is key in modern medicine. From magnetic resonance imaging (MRI) to microscopic imaging for blood cell detection, diagnostic medical imaging reveals vital insights into patient health. To predict diseases or provide individualized therapies, machine learning techniques like kernel methods have been widely used. Nevertheless, there are multiple challenges for implementing kernel methods. Medical image data often originates from various hospitals and cannot be combined due to privacy concerns, and the high dimensionality of image data presents another significant obstacle. While randomised encoding offers a promising direction, existing methods often struggle with a trade-off between accuracy and efficiency. Addressing the need for efficient privacy-preserving methods on distributed image data, we introduce OKRA (Orthonormal K-fRAmes), a novel randomized encoding-based approach for kernel-based machine learning. This technique, tailored for widely used kernel functions, significantly enhances scalability and speed compared to current state-of-the-art solutions. Through experiments conducted on various clinical image datasets, we evaluated model quality, computational performance, and resource overhead. Additionally, our method outperforms comparable approaches
[LG-59] Explainability of Highly Associated Fuzzy Churn Patterns in Binary Classification PAKDD2024
链接: https://arxiv.org/abs/2410.15827
作者: D.Y.C. Wang,Lars Arne Jordanger,Jerry Chun-Wei Lin
关键词-EN: Customer churn, telecommunications sector, influences both costs, costs and profits, Fuzzy Churn Patterns
类目: Machine Learning (cs.LG)
*备注: 18 pages single columns, 4 figures, This paper is an extended version of a work originally presented at the 6th International Workshop on Utility-Driven Mining and Learning (held in conjunction with the 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining - PAKDD 2024) on May 7, 2024
点击查看摘要
Abstract:Customer churn, particularly in the telecommunications sector, influences both costs and profits. As the explainability of models becomes increasingly important, this study emphasizes not only the explainability of customer churn through machine learning models, but also the importance of identifying multivariate patterns and setting soft bounds for intuitive interpretation. The main objective is to use a machine learning model and fuzzy-set theory with top-\textitk HUIM to identify highly associated patterns of customer churn with intuitive identification, referred to as Highly Associated Fuzzy Churn Patterns (HAFCP). Moreover, this method aids in uncovering association rules among multiple features across low, medium, and high distributions. Such discoveries are instrumental in enhancing the explainability of findings. Experiments show that when the top-5 HAFCPs are included in five datasets, a mixture of performance results is observed, with some showing notable improvements. It becomes clear that high importance features enhance explanatory power through their distribution and patterns associated with other features. As a result, the study introduces an innovative approach that improves the explainability and effectiveness of customer churn prediction models.
[LG-60] LiMTR: Time Series Motion Prediction for Diverse Road Users through Multimodal Feature Integration NEURIPS2024
链接: https://arxiv.org/abs/2410.15819
作者: Camiel Oerlemans,Bram Grooten,Michiel Braat,Alaa Alassi,Emilia Silvas,Decebal Constantin Mocanu
关键词-EN: densely populated areas, road users accurately, Predicting the behavior, populated areas, behavior of road
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the NeurIPS 2024 workshop Time Series in the Age of Large Models. Code available at this https URL
点击查看摘要
Abstract:Predicting the behavior of road users accurately is crucial to enable the safe operation of autonomous vehicles in urban or densely populated areas. Therefore, there has been a growing interest in time series motion prediction research, leading to significant advancements in state-of-the-art techniques in recent years. However, the potential of using LiDAR data to capture more detailed local features, such as a person’s gaze or posture, remains largely unexplored. To address this, we develop a novel multimodal approach for motion prediction based on the PointNet foundation model architecture, incorporating local LiDAR features. Evaluation on the Waymo Open Dataset shows a performance improvement of 6.20% and 1.58% in minADE and mAP respectively, when integrated and compared with the previous state-of-the-art MTR. We open-source the code of our LiMTR model.
[LG-61] Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt
链接: https://arxiv.org/abs/2410.15804
作者: Edi Sutoyo,Paris Avgeriou,Andrea Capiluppi
关键词-EN: Self-Admitted Technical Debt, Self-Admitted Technical, Technical Debt, SATD, refers to circumstances
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to be published at the 2024 31st Asia-Pacific Software Engineering Conference (APSEC)
点击查看摘要
Abstract:Self-Admitted Technical Debt (SATD) refers to circumstances where developers use textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD items as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirement, design, code, test debt, etc.). However, the performance of these approaches remains suboptimal, particularly for specific types of SATD, such as test and requirement debt, primarily due to extremely imbalanced datasets. To address these challenges, we build on earlier research by utilizing BiLSTM architecture for the binary identification of SATD and BERT architecture for categorizing different types of SATD. Despite their effectiveness, both architectures struggle with imbalanced data. Therefore, we employ a large language model data augmentation strategy to mitigate this issue. Furthermore, we introduce a two-step approach to identify and categorize SATD across various datasets derived from different artifacts. Our contributions include providing a balanced dataset for future SATD researchers and demonstrating that our approach significantly improves SATD identification and categorization performance compared to baseline methods.
[LG-62] On the VC dimension of deep group convolutional neural networks
链接: https://arxiv.org/abs/2410.15800
作者: Anna Sepliarskaia,Sophie Langer,Johannes Schmidt-Hieber
关键词-EN: Group Convolutional Neural, Convolutional Neural Networks, ReLU activation function, Group Convolutional, capabilities of Group
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We study the generalization capabilities of Group Convolutional Neural Networks (GCNNs) with ReLU activation function by deriving upper and lower bounds for their Vapnik-Chervonenkis (VC) dimension. Specifically, we analyze how factors such as the number of layers, weights, and input dimension affect the VC dimension. We further compare the derived bounds to those known for other types of neural networks. Our findings extend previous results on the VC dimension of continuous GCNNs with two layers, thereby providing new insights into the generalization properties of GCNNs, particularly regarding the dependence on the input resolution of the data.
[LG-63] Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count
链接: https://arxiv.org/abs/2410.15787
作者: Hanseul Cho,Jaeyoung Cha,Srinadh Bhojanapalli,Chulhee Yun
关键词-EN: length generalization, meaning they fail, encountered during training, fail to generalize, generalize to sequences
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 38 pages, 16 figures
点击查看摘要
Abstract:Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (requiring generalization over both the number of operands and their lengths) and multiplication (requiring generalization over both operand lengths). In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers. We design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and apply multi-level versions of Position Coupling (Cho et al., 2024; McLeish et al., 2024) to let Transformers know the right position to attend to. On the theory side, we prove that a 1-layer Transformer using our method can solve multi-operand addition, up to operand length and operand count that are exponential in embedding dimension.
[LG-64] Reducing Hallucinations in Vision-Language Models via Latent Space Steering
链接: https://arxiv.org/abs/2410.15778
作者: Sheng Liu,Haotian Ye,James Zou
关键词-EN: large language models, large vision-language models, poses a challenge, language models, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 21 pages
点击查看摘要
Abstract:Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications. Unlike in large language models (LLMs), hallucination in LVLMs often arises from misalignments between visual inputs and textual outputs. This paper investigates the underlying mechanisms of hallucination, focusing on the unique structure of LVLMs that distinguishes them from large language models (LLMs). We identify that hallucinations often arise from the sensitivity of text decoders to vision inputs, a natural phenomenon when image encoders and text decoders are pre-trained separately. Inspired by this, we introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features. As a task-agnostic test-time intervention, VTI can be easily applied to any problem without additional cost. Extensive experiments demonstrate that it can effectively reduce hallucinations and outperform baseline methods across multiple metrics, highlighting the critical role of vision feature stability in LVLMs.
[LG-65] High-Fidelity Transfer of Functional Priors for Wide Bayesian Neural Networks by Learning Activations
链接: https://arxiv.org/abs/2410.15777
作者: Marcin Sendera,Amin Sorkhei,Tomasz Kuśmierczyk
关键词-EN: embedding beliefs directly, uncertainty quantification, Neural Networks provide, risk-aware decision-making, Function-space priors
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Function-space priors in Bayesian Neural Networks provide a more intuitive approach to embedding beliefs directly into the model’s output, thereby enhancing regularization, uncertainty quantification, and risk-aware decision-making. However, imposing function-space priors on BNNs is challenging. We address this task through optimization techniques that explore how trainable activations can accommodate complex priors and match intricate target function distributions. We discuss critical learning challenges, including identifiability, loss construction, and symmetries that arise in this context. Furthermore, we enable evidence maximization to facilitate model selection by conditioning the functional priors on additional hyperparameters. Our empirical findings demonstrate that even BNNs with a single wide hidden layer, when equipped with these adaptive trainable activations and conditioning strategies, can effectively achieve high-fidelity function-space priors, providing a robust and flexible framework for enhancing Bayesian neural network performance.
[LG-66] Mislabeled examples detection viewed as probing machine learning models: concepts survey and extensive benchmark
链接: https://arxiv.org/abs/2410.15772
作者: Thomas George,Pierre Nodet,Alexis Bondu,Vincent Lemaire
关键词-EN: machine learning datasets, real-world machine learning, advocating the development, mislabeled detection methods, ubiquitous in real-world
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Mislabeled examples are ubiquitous in real-world machine learning datasets, advocating the development of techniques for automatic detection. We show that most mislabeled detection methods can be viewed as probing trained machine learning models using a few core principles. We formalize a modular framework that encompasses these methods, parameterized by only 4 building blocks, as well as a Python library that demonstrates that these principles can actually be implemented. The focus is on classifier-agnostic concepts, with an emphasis on adapting methods developed for deep learning models to non-deep classifiers for tabular data. We benchmark existing methods on (artificial) Completely At Random (NCAR) as well as (realistic) Not At Random (NNAR) labeling noise from a variety of tasks with imperfect labeling rules. This benchmark provides new insights as well as limitations of existing methods in this setup.
[LG-67] Solving Sparse High-Dimensional-Output Regression via Compression NEURIPS2024
链接: https://arxiv.org/abs/2410.15762
作者: Renyuan Li,Zhehui Chen,Guanyi Wang
关键词-EN: scientific data analysis, analysis for decision-making, scientific data, data analysis, Multi-Output Regression
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Admitted in Neurips 2024
点击查看摘要
Abstract:Multi-Output Regression (MOR) has been widely used in scientific data analysis for decision-making. Unlike traditional regression models, MOR aims to simultaneously predict multiple real-valued outputs given an input. However, the increasing dimensionality of the outputs poses significant challenges regarding interpretability and computational scalability for modern MOR applications. As a first step to address these challenges, this paper proposes a Sparse \ High-dimensional-Output REgression (SHORE) model by incorporating additional sparsity requirements to resolve the output interpretability, and then designs a computationally efficient two-stage optimization framework capable of solving SHORE with provable accuracy via compression on outputs. Theoretically, we show that the proposed framework is computationally scalable while maintaining the same order of training loss and prediction loss before-and-after compression under arbitrary or relatively weak sample set conditions. Empirically, numerical results further validate the theoretical findings, showcasing the efficiency and accuracy of the proposed framework.
[LG-68] Learning-to-Defer for Extractive Question Answering
链接: https://arxiv.org/abs/2410.15761
作者: Montreuil Yannis,Carlier Axel,Ng Lai Xing,Ooi Wei Tsang
关键词-EN: large-scale textual corpora, contextual language understanding, profoundly impacted, impacted the field, field of extractive
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 17 main paper
点击查看摘要
Abstract:Pre-trained language models have profoundly impacted the field of extractive question-answering, leveraging large-scale textual corpora to enhance contextual language understanding. Despite their success, these models struggle in complex scenarios that demand nuanced interpretation or inferential reasoning beyond immediate textual cues. Furthermore, their size poses deployment challenges on resource-constrained devices. Addressing these limitations, we introduce an adapted two-stage Learning-to-Defer mechanism that enhances decision-making by enabling selective deference to human experts or larger models without retraining language models in the context of question-answering. This approach not only maintains computational efficiency but also significantly improves model reliability and accuracy in ambiguous contexts. We establish the theoretical soundness of our methodology by proving Bayes and (\mathcalH, \mathcalR) --consistency of our surrogate loss function, guaranteeing the optimality of the final solution. Empirical evaluations on the SQuADv2 dataset illustrate performance gains from integrating human expertise and leveraging larger models. Our results further demonstrate that deferring a minimal number of queries allows the smaller model to achieve performance comparable to their larger counterparts while preserving computing efficiency, thus broadening the applicability of pre-trained language models in diverse operational environments.
[LG-69] DeepVigor: Scalable and Accurate Semi-Analytical Fault Resilience Analysis for Deep Neural Network
链接: https://arxiv.org/abs/2410.15742
作者: Mohammad Hasan Ahmadilivani,Jaan Raik,Masoud Daneshtalab,Maksim Jenihhin
关键词-EN: safety-critical applications necessitates, applications necessitates rigorous, Growing exploitation, necessitates rigorous safety, Deep Neural Networks
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Signal Processing (eess.SP)
*备注: 14 pages, 9 figures, 8 tables, 16 equations. The source code is accessible via: this https URL
点击查看摘要
Abstract:Growing exploitation of Machine Learning (ML) in safety-critical applications necessitates rigorous safety analysis. Hardware reliability assessment is a major concern with respect to measuring the level of safety. Quantifying the reliability of emerging ML models, including Deep Neural Networks (DNNs), is highly complex due to their enormous size in terms of the number of parameters and computations. Conventionally, Fault Injection (FI) is applied to perform a reliability measurement. However, performing FI on modern-day DNNs is prohibitively time-consuming if an acceptable confidence level is to be achieved. In order to speed up FI for large DNNs, statistical FI has been proposed. However, the run-time for the large DNN models is still considerably long. In this work, we introduce DeepVigor+, a scalable, fast and accurate semi-analytical method as an efficient alternative for reliability measurement in DNNs. DeepVigor+ implements a fault propagation analysis model and attempts to acquire Vulnerability Factors (VFs) as reliability metrics in an optimal way. The results indicate that DeepVigor+ obtains VFs for DNN models with an error less than 1% and 14.9 up to 26.9 times fewer simulations than the best-known state-of-the-art statistical FI enabling an accurate reliability analysis for emerging DNNs within a few minutes. Comments: 14 pages, 9 figures, 8 tables, 16 equations. The source code is accessible via: this https URL Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Signal Processing (eess.SP) Cite as: arXiv:2410.15742 [cs.LG] (or arXiv:2410.15742v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.15742 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-70] Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases
链接: https://arxiv.org/abs/2410.15728
作者: Cristian Meo,Akihiro Nakano,Mircea Lică,Aniket Didolkar,Masahiro Suzuki,Anirudh Goyal,Mengmi Zhang,Justin Dauwels,Yutaka Matsuo,Yoshua Bengio
关键词-EN: Unsupervised object-centric learning, Unsupervised object-centric, learning compositional representations, learning compositional, promising approach
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Unsupervised object-centric learning from videos is a promising approach towards learning compositional representations that can be applied to various downstream tasks, such as prediction and reasoning. Recently, it was shown that pretrained Vision Transformers (ViTs) can be useful to learn object-centric representations on real-world video datasets. However, while these approaches succeed at extracting objects from the scenes, the slot-based representations fail to maintain temporal consistency across consecutive frames in a video, i.e. the mapping of objects to slots changes across the video. To address this, we introduce Conditional Autoregressive Slot Attention (CA-SA), a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. Leveraging an autoregressive prior network to condition representations on previous timesteps and a novel consistency loss function, CA-SA predicts future slot representations and imposes consistency across frames. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks, such as video prediction and visual question-answering tasks.
[LG-71] S-CFE: Simple Counterfactual Explanations
链接: https://arxiv.org/abs/2410.15723
作者: Shpresim Sadiku,Moritz Wagner,Sai Ganesh Nagarajan,Sebastian Pokutta
关键词-EN: finding optimal sparse, finding optimal, manifold-aligned counterfactual explanations, emph, optimal sparse
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:We study the problem of finding optimal sparse, manifold-aligned counterfactual explanations for classifiers. Canonically, this can be formulated as an optimization problem with multiple non-convex components, including classifier loss functions and manifold alignment (or \emphplausibility) metrics. The added complexity of enforcing \emphsparsity, or shorter explanations, complicates the problem further. Existing methods often focus on specific models and plausibility measures, relying on convex \ell_1 regularizers to enforce sparsity. In this paper, we tackle the canonical formulation using the accelerated proximal gradient (APG) method, a simple yet efficient first-order procedure capable of handling smooth non-convex objectives and non-smooth \ell_p (where 0 \leq p 1 ) regularizers. This enables our approach to seamlessly incorporate various classifiers and plausibility measures while producing sparser solutions. Our algorithm only requires differentiable data-manifold regularizers and supports box constraints for bounded feature ranges, ensuring the generated counterfactuals remain \emphactionable. Finally, experiments on real-world datasets demonstrate that our approach effectively produces sparse, manifold-aligned counterfactual explanations while maintaining proximity to the factual data and computational efficiency.
[LG-72] raffic Matrix Estimation based on Denoising Diffusion Probabilistic Model
链接: https://arxiv.org/abs/2410.15716
作者: Xinyu Yuan,Yan Qiao,Pei Zhao,Rongyao Hu,Benchu Zhang
关键词-EN: traffic matrix estimation, tackle TME problems, TME problem, decades of years, traffic matrix
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
点击查看摘要
Abstract:The traffic matrix estimation (TME) problem has been widely researched for decades of years. Recent progresses in deep generative models offer new opportunities to tackle TME problems in a more advanced way. In this paper, we leverage the powerful ability of denoising diffusion probabilistic models (DDPMs) on distribution learning, and for the first time adopt DDPM to address the TME problem. To ensure a good performance of DDPM on learning the distributions of TMs, we design a preprocessing module to reduce the dimensions of TMs while keeping the data variety of each OD flow. To improve the estimation accuracy, we parameterize the noise factors in DDPM and transform the TME problem into a gradient-descent optimization problem. Finally, we compared our method with the state-of-the-art TME methods using two real-world TM datasets, the experimental results strongly demonstrate the superiority of our method on both TM synthesis and TM estimation.
[LG-73] Offline reinforcement learning for job-shop scheduling problems
链接: https://arxiv.org/abs/2410.15714
作者: Imanol Echeverria,Maialen Murua,Roberto Santana
关键词-EN: shown significant potential, Recent advances, solving combinatorial optimization, shown significant, significant potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advances in deep learning have shown significant potential for solving combinatorial optimization problems in real-time. Unlike traditional methods, deep learning can generate high-quality solutions efficiently, which is crucial for applications like routing and scheduling. However, existing approaches like deep reinforcement learning (RL) and behavioral cloning have notable limitations, with deep RL suffering from slow learning and behavioral cloning relying solely on expert actions, which can lead to generalization issues and neglect of the optimization objective. This paper introduces a novel offline RL method designed for combinatorial optimization problems with complex constraints, where the state is represented as a heterogeneous graph and the action space is variable. Our approach encodes actions in edge attributes and balances expected rewards with the imitation of expert solutions. We demonstrate the effectiveness of this method on job-shop scheduling and flexible job-shop scheduling benchmarks, achieving superior performance compared to state-of-the-art techniques.
[LG-74] Estimating Individual Dose-Response Curves under Unobserved Confounders from Observational Data
链接: https://arxiv.org/abs/2410.15706
作者: Shutong Chen,Yang Li
关键词-EN: addressing causal questions, estimating causal effects, continuously varied treatments, diverse domains, social sciences
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Estimating an individual’s potential response to continuously varied treatments is crucial for addressing causal questions across diverse domains, from healthcare to social sciences. However, existing methods are limited either to estimating causal effects of binary treatments, or scenarios where all confounding variables are measurable. In this work, we present ContiVAE, a novel framework for estimating causal effects of continuous treatments, measured by individual dose-response curves, considering the presence of unobserved confounders using observational data. Leveraging a variational auto-encoder with a Tilted Gaussian prior distribution, ContiVAE models the hidden confounders as latent variables, and is able to predict the potential outcome of any treatment level for each individual while effectively capture the heterogeneity among individuals. Experiments on semi-synthetic datasets show that ContiVAE outperforms existing methods by up to 62%, demonstrating its robustness and flexibility. Application on a real-world dataset illustrates its practical utility.
[LG-75] Residual vector quantization for KV cache compression in large language model
链接: https://arxiv.org/abs/2410.15704
作者: Ankur Kumar
关键词-EN: requirements during decoding, relied on scalar, reduce the memory, memory requirements, scalar quantization techniques
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding. In this work, we apply residual vector quantization, which has been widely used for high fidelity audio compression, to compress KV cache in large language models (LLM). We adapt the standard recipe with minimal changes to compress the output of any key or value projection matrix in a pretrained LLM: we scale the vector by its standard deviation, divide channels into groups and then quantize each group with the same residual vector quantizer. We learn the codebook using exponential moving average and there are no other learnable parameters including the input and output projections normally used in a vector quantization set up. We find that a residual depth of 8 recovers most of the performance of the unquantized model. We also find that grouping non-contiguous channels together works better than grouping contiguous channels for compressing key matrix and the method further benefits from a light weight finetuning of LLM together with the quantization. Overall, the proposed technique is competitive with existing quantization methods while being much simpler and results in 5.5x compression compared to half precision.
[LG-76] Solving Continual Offline RL through Selective Weights Activation on Aligned Spaces
链接: https://arxiv.org/abs/2410.15698
作者: Jifeng Hu,Sili Huang,Li Shen,Zhejian Yang,Shengchao Hu,Shisong Tang,Hechang Chen,Yi Chang,Dacheng Tao,Lichao Sun
关键词-EN: shown impressive ability, Continual offline reinforcement, offline reinforcement learning, diffusion-based lifelong learning, lifelong learning systems
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Continual offline reinforcement learning (CORL) has shown impressive ability in diffusion-based lifelong learning systems by modeling the joint distributions of trajectories. However, most research only focuses on limited continual task settings where the tasks have the same observation and action space, which deviates from the realistic demands of training agents in various environments. In view of this, we propose Vector-Quantized Continual Diffuser, named VQ-CD, to break the barrier of different spaces between various tasks. Specifically, our method contains two complementary sections, where the quantization spaces alignment provides a unified basis for the selective weights activation. In the quantized spaces alignment, we leverage vector quantization to align the different state and action spaces of various tasks, facilitating continual training in the same space. Then, we propose to leverage a unified diffusion model attached by the inverse dynamic model to master all tasks by selectively activating different weights according to the task-related sparse masks. Finally, we conduct extensive experiments on 15 continual learning (CL) tasks, including conventional CL task settings (identical state and action spaces) and general CL task settings (various state and action spaces). Compared with 16 baselines, our method reaches the SOTA performance.
[LG-77] Enhancing SNN-based Spatio-Temporal Learning: A Benchmark Dataset and Cross-Modality Attention Model
链接: https://arxiv.org/abs/2410.15689
作者: Shibo Zhou,Bo Yang,Mengwen Yuan,Runhao Jiang,Rui Yan,Gang Pan,Huajin Tang
关键词-EN: Spiking Neural Networks, Artificial Neural Networks, low power consumption, Neural Networks, Spiking Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:Spiking Neural Networks (SNNs), renowned for their low power consumption, brain-inspired architecture, and spatio-temporal representation capabilities, have garnered considerable attention in recent years. Similar to Artificial Neural Networks (ANNs), high-quality benchmark datasets are of great importance to the advances of SNNs. However, our analysis indicates that many prevalent neuromorphic datasets lack strong temporal correlation, preventing SNNs from fully exploiting their spatio-temporal representation capabilities. Meanwhile, the integration of event and frame modalities offers more comprehensive visual spatio-temporal information. Yet, the SNN-based cross-modality fusion remains underexplored. In this work, we present a neuromorphic dataset called DVS-SLR that can better exploit the inherent spatio-temporal properties of SNNs. Compared to existing datasets, it offers advantages in terms of higher temporal correlation, larger scale, and more varied scenarios. In addition, our neuromorphic dataset contains corresponding frame data, which can be used for developing SNN-based fusion methods. By virtue of the dual-modal feature of the dataset, we propose a Cross-Modality Attention (CMA) based fusion method. The CMA model efficiently utilizes the unique advantages of each modality, allowing for SNNs to learn both temporal and spatial attention scores from the spatio-temporal features of event and frame modalities, subsequently allocating these scores across modalities to enhance their synergy. Experimental results demonstrate that our method not only improves recognition accuracy but also ensures robustness across diverse scenarios. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2410.15689 [cs.CV] (or arXiv:2410.15689v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.15689 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-78] MIK: Modified Isolation Kernel for Biological Sequence Visualization Classification and Clustering
链接: https://arxiv.org/abs/2410.15688
作者: Sarwan Ali,Prakash Chourasia,Haris Mansoor,Bipin koirala,Murray Patterson
关键词-EN: Stochastic Neighbor Embedding, t-Distributed Stochastic Neighbor, Neighbor Embedding, Stochastic Neighbor, visualizing high-dimensional data
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:The t-Distributed Stochastic Neighbor Embedding (t-SNE) has emerged as a popular dimensionality reduction technique for visualizing high-dimensional data. It computes pairwise similarities between data points by default using an RBF kernel and random initialization (in low-dimensional space), which successfully captures the overall structure but may struggle to preserve the local structure efficiently. This research proposes a novel approach called the Modified Isolation Kernel (MIK) as an alternative to the Gaussian kernel, which is built upon the concept of the Isolation Kernel. MIK uses adaptive density estimation to capture local structures more accurately and integrates robustness measures. It also assigns higher similarity values to nearby points and lower values to distant points. Comparative research using the normal Gaussian kernel, the isolation kernel, and several initialization techniques, including random, PCA, and random walk initializations, are used to assess the proposed approach (MIK). Additionally, we compare the computational efficiency of all 3 kernels with 3 different initialization methods. Our experimental results demonstrate several advantages of the proposed kernel (MIK) and initialization method selection. It exhibits improved preservation of the local and global structure and enables better visualization of clusters and subclusters in the embedded space. These findings contribute to advancing dimensionality reduction techniques and provide researchers and practitioners with an effective tool for data exploration, visualization, and analysis in various domains.
[LG-79] Federated Learning with MMD-based Early Stopping for Adaptive GNSS Interference Classification
链接: https://arxiv.org/abs/2410.15681
作者: Nishant S. Gaikwad,Lucas Heublein,Nisha L. Raichur,Tobias Feigl,Christopher Mutschler,Felix Ott
关键词-EN: enables multiple devices, enables multiple, Federated learning, Federated, collaboratively train
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
点击查看摘要
Abstract:Federated learning (FL) enables multiple devices to collaboratively train a global model while maintaining data on local servers. Each device trains the model on its local server and shares only the model updates (i.e., gradient weights) during the aggregation step. A significant challenge in FL is managing the feature distribution of novel, unbalanced data across devices. In this paper, we propose an FL approach using few-shot learning and aggregation of the model weights on a global server. We introduce a dynamic early stopping method to balance out-of-distribution classes based on representation learning, specifically utilizing the maximum mean discrepancy of feature embeddings between local and global models. An exemplary application of FL is orchestrating machine learning models along highways for interference classification based on snapshots from global navigation satellite system (GNSS) receivers. Extensive experiments on four GNSS datasets from two real-world highways and controlled environments demonstrate that our FL method surpasses state-of-the-art techniques in adapting to both novel interference classes and multipath scenarios.
[LG-80] RAC: Efficient LLM Factuality Correction with Retrieval Augmentation
链接: https://arxiv.org/abs/2410.15667
作者: Changmao Li,Jeffrey Flanigan
关键词-EN: Large Language Models, natural language processing, exhibit impressive results, produce factually incorrect, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs. This paper introduces a simple but effective low-latency post-correction method, \textbfRetrieval Augmented Correction (RAC), aimed at enhancing the factual performance of LLMs without requiring additional fine-tuning. Our method is general and can be used with any instruction-tuned LLM, and has greatly reduced latency compared to prior approaches. RAC decomposes the LLM’s output into atomic facts and applies a fine-grained verification and correction process with retrieved content to verify and correct the LLM-generated output. Our extensive experiments show that RAC yields up to 30% improvements over state-of-the-art baselines across two popular factuality evaluation datasets, validating its efficacy and robustness in both with and without the integration of Retrieval-Augmented Generation (RAG) across different LLMs.\footnoteOur code is at \urlthis https URL
[LG-81] Long Term Memory: The Foundation of AI Self-Evolution
链接: https://arxiv.org/abs/2410.15665
作者: Xun Jiang,Feng Li,Han Zhao,Jiaying Wang,Jun Shao,Shihao Xu,Shu Zhang,Weiling Chen,Xavier Tang,Yize Chen,Mengyue Wu,Weizhi Ma,Mengdi Wang,Tianqiao Chen
关键词-EN: Large language models, achieving human-level performance, Large language, demonstrated impressive capabilities, language understanding
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 56 pages, 13 figures
点击查看摘要
Abstract:Large language models (LLMs) like GPTs, trained on vast datasets, have demonstrated impressive capabilities in language understanding, reasoning, and planning, achieving human-level performance in various tasks. Most studies focus on enhancing these models by training on ever-larger datasets to build more powerful foundation models. While training stronger models is important, enabling models to evolve during inference is equally crucial, a process we refer to as AI self-evolution. Unlike large-scale training, self-evolution may rely on limited data or interactions. Inspired by the columnar organization of the human cerebral cortex, we hypothesize that AI models could develop cognitive abilities and build internal representations through iterative interactions with their environment. To achieve this, models need long-term memory (LTM) to store and manage processed interaction data. LTM supports self-evolution by representing diverse experiences across environments and agents. In this report, we explore AI self-evolution and its potential to enhance models during inference. We examine LTM’s role in lifelong learning, allowing models to evolve based on accumulated interactions. We outline the structure of LTM and the systems needed for effective data retention and representation. We also classify approaches for building personalized models with LTM data and show how these models achieve self-evolution through interaction. Using LTM, our multi-agent framework OMNE achieved first place on the GAIA benchmark, demonstrating LTM’s potential for AI self-evolution. Finally, we present a roadmap for future research, emphasizing the importance of LTM for advancing AI technology and its practical applications.
[LG-82] Scalable Data Ablation Approximations for Language Models through Modular Training and Merging EMNLP2024
链接: https://arxiv.org/abs/2410.15661
作者: Clara Na,Ian Magnusson,Ananya Harsh Jha,Tom Sherborne,Emma Strubell,Jesse Dodge,Pradeep Dasigi
关键词-EN: Large Language Models, Large Language, Language Models, Training data compositions, data
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: EMNLP 2024. 17 pages
点击查看摘要
Abstract:Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive since the full effect is seen only after training the models; this can lead practitioners to settle for sub-optimal data mixtures. We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus and reuses them across evaluations of combinations of subsets. In continued pre-training experiments, we find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data. From this finding, we posit that researchers and practitioners can conduct inexpensive simulations of data ablations by maintaining a pool of models that were each trained on partitions of a large training corpus, and assessing candidate data mixtures by evaluating parameter averages of combinations of these models. This approach allows for substantial improvements in amortized training efficiency – scaling only linearly with respect to new data – by enabling reuse of previous training computation, opening new avenues for improving model performance through rigorous, incremental data assessment and mixing.
[LG-83] Calibration of ordinal regression networks
链接: https://arxiv.org/abs/2410.15658
作者: Daehwan Kim,Haejun Chung,Ikbeom Jang
关键词-EN: deep neural networks, Recent studies, produce over-confident predictions, studies have shown, shown that deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Recent studies have shown that deep neural networks are not well-calibrated and produce over-confident predictions. The miscalibration issue primarily stems from the minimization of cross-entropy, which aims to align predicted softmax probabilities with one-hot labels. In ordinal regression tasks, this problem is compounded by an additional challenge: the expectation that softmax probabilities should exhibit unimodal distribution is not met with cross-entropy. Rather, the ordinal regression literature has focused on unimodality and overlooked calibration. To address these issues, we propose a novel loss function that introduces order-aware calibration, ensuring that prediction confidence adheres to ordinal relationships between classes. It incorporates soft ordinal encoding and label-smoothing-based regularization to enforce both calibration and unimodality. Extensive experiments across three popular ordinal regression benchmarks demonstrate that our approach achieves state-of-the-art calibration without compromising accuracy.
[LG-84] Accounting for Missing Covariates in Heterogeneous Treatment Estimation
链接: https://arxiv.org/abs/2410.15655
作者: Khurram Yamin,Vibhhu Sharma,Ed Kennedy,Bryan Wilder
关键词-EN: causal inference require, separate target population, treatment effects estimated, applications of causal, make decisions
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:Many applications of causal inference require using treatment effects estimated on a study population to make decisions in a separate target population. We consider the challenging setting where there are covariates that are observed in the target population that were not seen in the original study. Our goal is to estimate the tightest possible bounds on heterogeneous treatment effects conditioned on such newly observed covariates. We introduce a novel partial identification strategy based on ideas from ecological inference; the main idea is that estimates of conditional treatment effects for the full covariate set must marginalize correctly when restricted to only the covariates observed in both populations. Furthermore, we introduce a bias-corrected estimator for these bounds and prove that it enjoys fast convergence rates and statistical guarantees (e.g., asymptotic normality). Experimental results on both real and synthetic data demonstrate that our framework can produce bounds that are much tighter than would otherwise be possible.
[LG-85] Understanding and Alleviating Memory Consumption in RLHF for LLMs
链接: https://arxiv.org/abs/2410.15651
作者: Jin Zhou,Hanmei Yang,Steven(Jiaxun)Tang,Mingcan Xiang,Hui Guan,Tongping Liu
关键词-EN: Human Feedback, Reinforcement Learning, Learning with Human, large language models, aligning large language
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Fine-tuning with Reinforcement Learning with Human Feedback (RLHF) is essential for aligning large language models (LLMs). However, RLHF often encounters significant memory challenges. This study is the first to examine memory usage in the RLHF context, exploring various memory management strategies and unveiling the reasons behind excessive memory consumption. Additionally, we introduce a simple yet effective approach that substantially reduces the memory required for RLHF fine-tuning.
[LG-86] Linking Model Intervention to Causal Interpretation in Model Explanation
链接: https://arxiv.org/abs/2410.15648
作者: Debo Cheng,Ziqi Xu,Jiuyong Li,Lin Liu,Kui Yu,Thuc Duy Le,Jixue Liu
关键词-EN: model intervention effect, intervention effect, model intervention, Intervention, model
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:Intervention intuition is often used in model explanation where the intervention effect of a feature on the outcome is quantified by the difference of a model prediction when the feature value is changed from the current value to the baseline value. Such a model intervention effect of a feature is inherently association. In this paper, we will study the conditions when an intuitive model intervention effect has a causal interpretation, i.e., when it indicates whether a feature is a direct cause of the outcome. This work links the model intervention effect to the causal interpretation of a model. Such an interpretation capability is important since it indicates whether a machine learning model is trustworthy to domain experts. The conditions also reveal the limitations of using a model intervention effect for causal interpretation in an environment with unobserved features. Experiments on semi-synthetic datasets have been conducted to validate theorems and show the potential for using the model intervention effect for model interpretation.
[LG-87] Deep Graph Attention Networks
链接: https://arxiv.org/abs/2410.15640
作者: Jun Kato,Airi Mita,Keita Gobara,Akihiro Inokuchi
关键词-EN: realworld objects, representing various realworld, GAT, number of layers, number
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures
点击查看摘要
Abstract:Graphs are useful for representing various realworld objects. However, graph neural networks (GNNs) tend to suffer from over-smoothing, where the representations of nodes of different classes become similar as the number of layers increases, leading to performance degradation. A method that does not require protracted tuning of the number of layers is needed to effectively construct a graph attention network (GAT), a type of GNN. Therefore, we introduce a method called “DeepGAT” for predicting the class to which nodes belong in a deep GAT. It avoids over-smoothing in a GAT by ensuring that nodes in different classes are not similar at each layer. Using DeepGAT to predict class labels, a 15-layer network is constructed without the need to tune the number of layers. DeepGAT prevented over-smoothing and achieved a 15-layer GAT with similar performance to a 2-layer GAT, as indicated by the similar attention coefficients. DeepGAT enables the training of a large network to acquire similar attention coefficients to a network with few layers. It avoids the over-smoothing problem and obviates the need to tune the number of layers, thus saving time and enhancing GNN performance.
[LG-88] Large Deviations and Improved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry
链接: https://arxiv.org/abs/2410.15637
作者: Aleksandar Armacki,Shuhua Yu,Dragana Bajovic,Dusan Jakovetic,Soummya Kar
关键词-EN: nonlinear stochastic gradient, stochastic gradient methods, presence of heavy-tailed, study large deviations, mean-squared error
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注: 30 pages. arXiv admin note: text overlap with arXiv:2410.13954
点击查看摘要
Abstract:We study large deviations and mean-squared error (MSE) guarantees of a general framework of nonlinear stochastic gradient methods in the online setting, in the presence of heavy-tailed noise. Unlike existing works that rely on the closed form of a nonlinearity (typically clipping), our framework treats the nonlinearity in a black-box manner, allowing us to provide unified guarantees for a broad class of bounded nonlinearities, including many popular ones, like sign, quantization, normalization, as well as component-wise and joint clipping. We provide several strong results for a broad range of step-sizes in the presence of heavy-tailed noise with symmetric probability density function, positive in a neighbourhood of zero and potentially unbounded moments. In particular, for non-convex costs we provide a large deviation upper bound for the minimum norm-squared of gradients, showing an asymptotic tail decay on an exponential scale, at a rate \sqrtt / \log(t) . We establish the accompanying rate function, showing an explicit dependence on the choice of step-size, nonlinearity, noise and problem parameters. Next, for non-convex costs and the minimum norm-squared of gradients, we derive the optimal MSE rate \widetilde\mathcalO(t^-1/2) . Moreover, for strongly convex costs and the last iterate, we provide an MSE rate that can be made arbitrarily close to the optimal rate \mathcalO(t^-1) , improving on the state-of-the-art results in the presence of heavy-tailed noise. Finally, we establish almost sure convergence of the minimum norm-squared of gradients, providing an explicit rate, which can be made arbitrarily close to o(t^-1/4) .
[LG-89] Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers
链接: https://arxiv.org/abs/2410.15625
作者: Anjiang Wei,Allen Nie,Thiago S. F. X. Teixeira,Rohan Yadav,Wonchan Lee,Ke Wang,Alex Aiken
关键词-EN: Mapping computations, computations to processors, processors and assigning, assigning data, data to memory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 26 pages, 8 figures
点击查看摘要
Abstract:Mapping computations to processors and assigning data to memory are critical for maximizing performance in parallel programming. These mapping decisions are managed through the development of specialized low-level system code, called mappers, crafted by performance engineers. Each mapper is tailored to a specific application and optimized for the underlying machine architecture, a process that requires days of refinement and tuning from an expert. Despite advances in system research, automating mapper generation remains a challenge due to the complexity of making millions of decisions to find the optimal solution and generate the solution as code. We introduce an approach that leverages recent advances in LLM-based optimizers for mapper design. In under ten minutes, our method automatically discovers mappers that surpass human expert designs in scientific applications by up to 1.34X speedup. For parallel matrix multiplication algorithms, our mapper achieves up to 1.31X of the expert-designed solution. To achieve this, we simplify the complexity of low-level code generation by introducing a domain-specific language (DSL) that abstracts the low-level system programming details and defines a structured search space for LLMs to explore. To maximize the application performance, we use an LLM optimizer to improve an agentic system that generates the mapper code. As a result, this approach significantly reduces the workload for performance engineers while achieving substantial performance gains across diverse applications. Finally, our results demonstrate the effectiveness of LLM-based optimization in system design and suggest its potential for addressing other complex system challenges.
[LG-90] st-time Adaptation for Cross-modal Retrieval with Query Shift
链接: https://arxiv.org/abs/2410.15624
作者: Haobin Li,Peng Hu,Qianjun Zhang,Xi Peng,Xiting Liu,Mouxing Yang
关键词-EN: query shift, methods heavily relies, query, existing cross-modal retrieval, retrieval methods heavily
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures
点击查看摘要
Abstract:The success of most existing cross-modal retrieval methods heavily relies on the assumption that the given queries follow the same distribution of the source domain. However, such an assumption is easily violated in real-world scenarios due to the complexity and diversity of queries, thus leading to the query shift problem. Specifically, query shift refers to the online query stream originating from the domain that follows a different distribution with the source one. In this paper, we observe that query shift would not only diminish the uniformity (namely, within-modality scatter) of the query modality but also amplify the gap between query and gallery modalities. Based on the observations, we propose a novel method dubbed Test-time adaptation for Cross-modal Retrieval (TCR). In brief, TCR employs a novel module to refine the query predictions (namely, retrieval results of the query) and a joint objective to prevent query shift from disturbing the common space, thus achieving online adaptation for the cross-modal retrieval models with query shift. Expensive experiments demonstrate the effectiveness of the proposed TCR against query shift. The code will be released upon acceptance.
[LG-91] Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation
链接: https://arxiv.org/abs/2410.15618
作者: Anh Bui,Long Vuong,Khanh Doan,Trung Le,Paul Montague,Tamas Abraham,Dinh Phung
关键词-EN: unfiltered internet data, generating visually striking, inadvertently produce undesirable, visually striking content, Diffusion models excel
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Diffusion models excel at generating visually striking content from text but can inadvertently produce undesirable or harmful content when trained on unfiltered internet data. A practical solution is to selectively removing target concepts from the model, but this may impact the remaining concepts. Prior approaches have tried to balance this by introducing a loss term to preserve neutral content or a regularization term to minimize changes in the model parameters, yet resolving this trade-off remains challenging. In this work, we propose to identify and preserving concepts most affected by parameter changes, termed as \textitadversarial concepts. This approach ensures stable erasure with minimal impact on the other concepts. We demonstrate the effectiveness of our method using the Stable Diffusion model, showing that it outperforms state-of-the-art erasure methods in eliminating unwanted content while maintaining the integrity of other unrelated elements. Our code is available at \urlthis https URL.
[LG-92] Long-time Integration of Nonlinear Wave Equations with Neural Operators
链接: https://arxiv.org/abs/2410.15617
作者: Guanhang Lei,Zhen Lei,Lei Shi
关键词-EN: Partial Differential Equations, Partial Differential, types of Partial, Differential Equations, shown promise
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Neural operators have shown promise in solving many types of Partial Differential Equations (PDEs). They are significantly faster compared to traditional numerical solvers once they have been trained with a certain amount of observed data. However, their numerical performance in solving time-dependent PDEs, particularly in long-time prediction of dynamic systems, still needs improvement. In this paper, we focus on solving the long-time integration of nonlinear wave equations via neural operators by replacing the initial condition with the prediction in a recurrent manner. Given limited observed temporal trajectory data, we utilize some intrinsic features of these nonlinear wave equations, such as conservation laws and well-posedness, to improve the algorithm design and reduce accumulated error. Our numerical experiments examine these improvements in the Korteweg-de Vries (KdV) equation, the sine-Gordon equation, and a semilinear wave equation on the irregular domain.
[LG-93] In-Trajectory Inverse Reinforcement Learning: Learn Incrementally From An Ongoing Trajectory
链接: https://arxiv.org/abs/2410.15612
作者: Shicheng Liu,Minghui Zhu
关键词-EN: Inverse reinforcement learning, Inverse reinforcement, current IRL works, ongoing trajectory, reward function
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Inverse reinforcement learning (IRL) aims to learn a reward function and a corresponding policy that best fit the demonstrated trajectories of an expert. However, current IRL works cannot learn incrementally from an ongoing trajectory because they have to wait to collect at least one complete trajectory to learn. To bridge the gap, this paper considers the problem of learning a reward function and a corresponding policy while observing the initial state-action pair of an ongoing trajectory and keeping updating the learned reward and policy when new state-action pairs of the ongoing trajectory are observed. We formulate this problem as an online bi-level optimization problem where the upper level dynamically adjusts the learned reward according to the newly observed state-action pairs with the help of a meta-regularization term, and the lower level learns the corresponding policy. We propose a novel algorithm to solve this problem and guarantee that the algorithm achieves sub-linear local regret O(\sqrtT+\log T+\sqrtT\log T) . If the reward function is linear, we prove that the proposed algorithm achieves sub-linear regret O(\log T) . Experiments are used to validate the proposed algorithm.
[LG-94] On The Global Convergence Of Online RLHF With Neural Parametrization
链接: https://arxiv.org/abs/2410.15610
作者: Mudit Gaur,Amrit Singh Bedi,Raghu Pasupathy,Vaneet Aggarwal
关键词-EN: Human Feedback, large language models, importance of Reinforcement, aligning large language, Reinforcement Learning
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The importance of Reinforcement Learning from Human Feedback (RLHF) in aligning large language models (LLMs) with human values cannot be overstated. RLHF is a three-stage process that includes supervised fine-tuning (SFT), reward learning, and policy learning. Although there are several offline and online approaches to aligning LLMs, they often suffer from distribution shift issues. These issues arise from the inability to accurately capture the distributional interdependence between the reward learning and policy learning stages. Consequently, this has led to various approximated approaches, but the theoretical insights and motivations remain largely limited to tabular settings, which do not hold in practice. This gap between theoretical insights and practical implementations is critical. It is challenging to address this gap as it requires analyzing the performance of AI alignment algorithms in neural network-parameterized settings. Although bi-level formulations have shown promise in addressing distribution shift issues, they suffer from the hyper-gradient problem, and current approaches lack efficient algorithms to solve this. In this work, we tackle these challenges employing the bi-level formulation laid out in Kwon et al. (2024) along with the assumption \emphWeak Gradient Domination to demonstrate convergence in an RLHF setup, obtaining a sample complexity of \epsilon^-\frac72 . Our key contributions are twofold: (i) We propose a bi-level formulation for AI alignment in parameterized settings and introduce a first-order approach to solve this problem. (ii) We analyze the theoretical convergence rates of the proposed algorithm and derive state-of-the-art bounds. To the best of our knowledge, this is the first work to establish convergence rate bounds and global optimality for the RLHF framework in neural network-parameterized settings.
[LG-95] Moonshine: Speech Recognition for Live Transcription and Voice Commands
链接: https://arxiv.org/abs/2410.15608
作者: Nat Jeffries,Evan King,Manjunath Kudlur,Guy Nicholson,James Wang,Pete Warden
关键词-EN: voice command processing, Rotary Position Embedding, paper introduces Moonshine, recognition models optimized, command processing
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 6 figures, 3 tables
点击查看摘要
Abstract:This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI’s Whisper this http URL, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine’s potential for real-time and resource-constrained applications.
[LG-96] All You Need is an Improving Column: Enhancing Column Generation for Parallel Machine Scheduling via Transformers
链接: https://arxiv.org/abs/2410.15601
作者: Amira Hijazi,Osman Ozaltin,Reha Uzsoy
关键词-EN: parallel machine scheduling, network-enhanced column generation, machine scheduling problem, negative reduced cost, neural network-enhanced column
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:We present a neural network-enhanced column generation (CG) approach for a parallel machine scheduling problem. The proposed approach utilizes an encoder-decoder attention model, namely the transformer and pointer architectures, to develop job sequences with negative reduced cost and thus generate columns to add to the master problem. By training the neural network offline and using it in inference mode to predict negative reduced costs columns, we achieve significant computational time savings compared to dynamic programming (DP). Since the exact DP procedure is used to verify that no further columns with negative reduced cost can be identified at termination, the optimality guarantee of the original CG procedure is preserved. For small to medium-sized instances, our approach achieves an average 45% reduction in computation time compared to solving the subproblems with DP. Furthermore, the model generalizes not only to unseen, larger problem instances from the same probability distribution but also to instances from different probability distributions than those presented at training time. For large-sized instances, the proposed approach achieves an 80% improvement in the objective value in under 500 seconds, demonstrating both its scalability and efficiency.
[LG-97] A Comprehensive Comparative Study of Individual ML Models and Ensemble Strategies for Network Intrusion Detection Systems
链接: https://arxiv.org/abs/2410.15597
作者: Ismail Bibers,Osvaldo Arreche,Mustafa Abdallah
关键词-EN: network intrusion detection, devising artificial intelligence, intrusion detection systems, intrusion detection, network intrusion
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The escalating frequency of intrusions in networked systems has spurred the exploration of new research avenues in devising artificial intelligence (AI) techniques for intrusion detection systems (IDS). Various AI techniques have been used to automate network intrusion detection tasks, yet each model possesses distinct strengths and weaknesses. Selecting the optimal model for a given dataset can pose a challenge, necessitating the exploration of ensemble methods to enhance generalization and applicability in network intrusion detection. This paper addresses this gap by conducting a comprehensive evaluation of diverse individual models and both simple and advanced ensemble methods for network IDS. We introduce an ensemble learning framework tailored for assessing individual models and ensemble methods in network intrusion detection tasks. Our framework encompasses the loading of input datasets, training of individual models and ensemble methods, and the generation of evaluation metrics. Furthermore, we incorporate all features across individual models and ensemble techniques. The study presents results for our framework, encompassing 14 methods, including various bagging, stacking, blending, and boosting techniques applied to multiple base learners such as decision trees, neural networks, and among others. We evaluate the framework using two distinct network intrusion datasets, RoEduNet-SIMARGL2021 and CICIDS-2017, each possessing unique characteristics. Additionally, we categorize AI models based on their performances on our evaluation metrics and via their confusion matrices. Our assessment demonstrates the efficacy of learning across most setups explored in this study. Furthermore, we contribute to the community by releasing our source codes, providing a foundational ensemble learning framework for network intrusion detection.
[LG-98] A Comprehensive Survey of Datasets Theories Variants and Applications in Direct Preference Optimization
链接: https://arxiv.org/abs/2410.15595
作者: Wenyi Xiao,Zechuan Wang,Leilei Gan,Shuai Zhao,Wanggui He,Luu Anh Tuan,Long Chen,Hao Jiang,Zhou Zhao,Fei Wu
关键词-EN: aligning policy models, large language models, Direct Preference Optimization, aligning policy, increasingly critical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO’s various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO’s current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community.
[LG-99] SSMT: Few-Shot Traffic Forecasting with Single Source Meta-Transfer ICPR2024
链接: https://arxiv.org/abs/2410.15589
作者: Kishor Kumar Bhaumik,Minha Kim,Fahim Faisal Niloy,Amin Ahsan Ali,Simon S. Woo
关键词-EN: Intelligent Transportation Systems, Transportation Systems, Intelligent Transportation, intelligent traffic prediction, intelligent traffic
类目: Machine Learning (cs.LG)
*备注: ICPR 2024
点击查看摘要
Abstract:Traffic forecasting in Intelligent Transportation Systems (ITS) is vital for intelligent traffic prediction. Yet, ITS often relies on data from traffic sensors or vehicle devices, where certain cities might not have all those smart devices or enabling infrastructures. Also, recent studies have employed meta-learning to generalize spatial-temporal traffic networks, utilizing data from multiple cities for effective traffic forecasting for data-scarce target cities. However, collecting data from multiple cities can be costly and time-consuming. To tackle this challenge, we introduce Single Source Meta-Transfer Learning (SSMT) which relies only on a single source city for traffic prediction. Our method harnesses this transferred knowledge to enable few-shot traffic forecasting, particularly when the target city possesses limited data. Specifically, we use memory-augmented attention to store the heterogeneous spatial knowledge from the source city and selectively recall them for the data-scarce target city. We extend the idea of sinusoidal positional encoding to establish meta-learning tasks by leveraging diverse temporal traffic patterns from the source city. Moreover, to capture a more generalized representation of the positions we introduced a meta-positional encoding that learns the most optimal representation of the temporal pattern across all the tasks. We experiment on five real-world benchmark datasets to demonstrate that our method outperforms several existing methods in time series traffic prediction.
[LG-100] Multimodal Learning for Embryo Viability Prediction in Clinical IVF MICCAI2024
链接: https://arxiv.org/abs/2410.15581
作者: Junsik Kim,Zhiyi Shi,Davin Jeong,Johannes Knittel,Helen Y. Yang,Yonghyun Song,Wanhua Li,Yicong Li,Dalit Ben-Yosef,Daniel Needleman,Hanspeter Pfister
关键词-EN: clinical In-Vitro Fertilization, In-Vitro Fertilization, successful pregnancy, transfer is important, important to increasing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to MICCAI 2024
点击查看摘要
Abstract:In clinical In-Vitro Fertilization (IVF), identifying the most viable embryo for transfer is important to increasing the likelihood of a successful pregnancy. Traditionally, this process involves embryologists manually assessing embryos’ static morphological features at specific intervals using light microscopy. This manual evaluation is not only time-intensive and costly, due to the need for expert analysis, but also inherently subjective, leading to variability in the selection process. To address these challenges, we develop a multimodal model that leverages both time-lapse video data and Electronic Health Records (EHRs) to predict embryo viability. One of the primary challenges of our research is to effectively combine time-lapse video and EHR data, owing to their inherent differences in modality. We comprehensively analyze our multimodal model with various modality inputs and integration approaches. Our approach will enable fast and automated embryo viability predictions in scale for clinical IVF.
[LG-101] Language Models are Symbolic Learners in Arithmetic
链接: https://arxiv.org/abs/2410.15580
作者: Chunyuan Deng,Zhiqi Li,Roy Xie,Ruidi Chang,Hanjie Chen
关键词-EN: Large Language Models, Large Language, Language Models, language modeling, numerical computation
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are thought to struggle with arithmetic learning due to the inherent differences between language modeling and numerical computation, but concrete evidence has been lacking. This work responds to this claim through a two-side experiment. We first investigate whether LLMs leverage partial products during arithmetic learning. We find that although LLMs can identify some partial products after learning, they fail to leverage them for arithmetic tasks, conversely. We then explore how LLMs approach arithmetic symbolically by breaking tasks into subgroups, hypothesizing that difficulties arise from subgroup complexity and selection. Our results show that when subgroup complexity is fixed, LLMs treat a collection of different arithmetic operations similarly. By analyzing position-level accuracy across different training sizes, we further observe that it follows a U-shaped pattern: LLMs quickly learn the easiest patterns at the first and last positions, while progressively learning the more difficult patterns in the middle positions. This suggests that LLMs select subgroup following an easy-to-hard paradigm during learning. Our work confirms that LLMs are pure symbolic learners in arithmetic tasks and underscores the importance of understanding them deeply through subgroup-level quantification.
[LG-102] Generalized Probabilistic Attention Mechanism in Transformers
链接: https://arxiv.org/abs/2410.15578
作者: DongNyeong Heo,Heeyoul Choi
关键词-EN: widely adopted due, attention mechanism, attention, Transformer architecture, conventional attention mechanisms
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:The Transformer architecture has become widely adopted due to its demonstrated success, attributed to the attention mechanism at its core. Despite these successes, the attention mechanism of Transformers is associated with two well-known issues: rank-collapse and gradient vanishing. In this paper, we present a theoretical analysis that it is inherently difficult to address both issues simultaneously in the conventional attention mechanism. To handle these issues, we introduce a novel class of attention mechanism, referred to as generalized probabilistic attention mechanism (GPAM), and its dual-attention implementation within the Transformer architecture. Unlike conventional attention mechanisms, GPAM allows for negative attention scores while preserving a fixed total sum. We provide theoretical evidence that the proposed dual-attention GPAM (daGPAM) effectively mitigates both the rank-collapse and gradient vanishing issues which are difficult to resolve simultaneously with the conventional attention mechanisms. Furthermore, we empirically validate this theoretical evidence, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues. Additionally, we demonstrate the practical benefits of GPAM in natural language processing tasks, such as language modeling and neural machine translation.
[LG-103] Stacking Small Language Models for Generalizability
链接: https://arxiv.org/abs/2410.15570
作者: Laurence Liang
关键词-EN: Recent advances show, Recent advances, generalize strong performance, generalize strong, advances show
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Recent advances show that large language models (LLMs) generalize strong performance across different natural language benchmarks. However, the large size of LLMs makes training and inference expensive and impractical to run in resource-limited settings. This paper introduces a new approach called fine-tuning stacks of language models (FSLM), which involves stacking small language models (SLM) as an alternative to LLMs. By fine-tuning each SLM to perform a specific task, this approach breaks down high level reasoning into multiple lower-level steps that specific SLMs are responsible for. As a result, FSLM allows for lower training and inference costs, and also improves model interpretability as each SLM communicates with the subsequent one through natural language. By evaluating FSLM on common natural language benchmarks, this paper highlights promising early results toward generalizable performance using FSLM as a cost-effective alternative to LLMs.
[LG-104] Pruning Foundation Models for High Accuracy without Retraining EMNLP2024
链接: https://arxiv.org/abs/2410.15567
作者: Pu Zhao,Fei Sun,Xuan Shen,Pinrui Yu,Zhenglun Kong,Yanzhi Wang,Xue Lin
关键词-EN: deploy foundation models, large language models, parameters and computations, challenging to deploy, deploy foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by EMNLP 2024 findings
点击查看摘要
Abstract:Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consuming massive data and hardware resources. To deal with this problem, post-training pruning methods are proposed to prune LLMs in one-shot without retraining. However, their accuracy after pruning may suffer from certain performance degradation due to the lack of retraining with massive data. To address this issue, in this paper, we first formulate the post-training problem for layer-wise LLM compression to simultaneously prune multiple weights in LLMs. Next, we provide an optimal solution for this problem and design our post-training pruning algorithm for both unstructured and semi-structured sparsity. Our extensive experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines across various LLM families including transformer-based LLMs and Mamba-based LLMs. Code link: this https URL
[LG-105] Reward Maximization for Pure Exploration: Minimax Optimal Good Arm Identification for Nonparametric Multi-Armed Bandits
链接: https://arxiv.org/abs/2410.15564
作者: Brian Cho,Dominik Meier,Kyra Gan,Nathan Kallus
关键词-EN: tasks of reward, reward maximization, maximization and pure, pure exploration, multi-armed bandits
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:In multi-armed bandits, the tasks of reward maximization and pure exploration are often at odds with each other. The former focuses on exploiting arms with the highest means, while the latter may require constant exploration across all arms. In this work, we focus on good arm identification (GAI), a practical bandit inference objective that aims to label arms with means above a threshold as quickly as possible. We show that GAI can be efficiently solved by combining a reward-maximizing sampling algorithm with a novel nonparametric anytime-valid sequential test for labeling arm means. We first establish that our sequential test maintains error control under highly nonparametric assumptions and asymptotically achieves the minimax optimal e-power, a notion of power for anytime-valid tests. Next, by pairing regret-minimizing sampling schemes with our sequential test, we provide an approach that achieves minimax optimal stopping times for labeling arms with means above a threshold, under an error probability constraint. Our empirical results validate our approach beyond the minimax setting, reducing the expected number of samples for all stopping times by at least 50% across both synthetic and real-world settings.
[LG-106] How to Find the Exact Pareto Front for Multi-Objective MDPs?
链接: https://arxiv.org/abs/2410.15557
作者: Yining Li,Peizhong Ju,Ness B. Shroff
关键词-EN: Multi-objective Markov Decision, Markov Decision Processes, Pareto front, Multi-objective Markov, Decision Processes
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Multi-objective Markov Decision Processes (MDPs) are receiving increasing attention, as real-world decision-making problems often involve conflicting objectives that cannot be addressed by a single-objective MDP. The Pareto front identifies the set of policies that cannot be dominated, providing a foundation for finding optimal solutions that can efficiently adapt to various preferences. However, finding the Pareto front is a highly challenging problem. Most existing methods either (i) rely on traversing the continuous preference space, which is impractical and results in approximations that are difficult to evaluate against the true Pareto front, or (ii) focus solely on deterministic Pareto optimal policies, from which there are no known techniques to characterize the full Pareto front. Moreover, finding the structure of the Pareto front itself remains unclear even in the context of dynamic programming. This work addresses the challenge of efficiently discovering the Pareto front. By investigating the geometric structure of the Pareto front in MO-MDP, we uncover a key property: the Pareto front is on the boundary of a convex polytope whose vertices all correspond to deterministic policies, and neighboring vertices of the Pareto front differ by only one state-action pair of the deterministic policy, almost surely. This insight transforms the global comparison across all policies into a localized search among deterministic policies that differ by only one state-action pair, drastically reducing the complexity of searching for the exact Pareto front. We develop an efficient algorithm that identifies the vertices of the Pareto front by solving a single-objective MDP only once and then traversing the edges of the Pareto front, making it more efficient than existing methods. Our empirical studies demonstrate the effectiveness of our theoretical strategy in discovering the Pareto front.
[LG-107] Gradient Rewiring for Editable Graph Neural Network Training NEURIPS2024
链接: https://arxiv.org/abs/2410.15556
作者: Zhimeng Jiang,Zirui Liu,Xiaotian Han,Qizhang Feng,Hongye Jin,Qiaoyu Tan,Kaixiong Zhou,Na Zou,Xia Hu
关键词-EN: Deep neural networks, Deep neural, neural networks, natural language processing, neural network training
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024
点击查看摘要
Abstract:Deep neural networks are ubiquitously adopted in many applications, such as computer vision, natural language processing, and graph analytics. However, well-trained neural networks can make prediction errors after deployment as the world changes. \textitModel editing involves updating the base model to correct prediction errors with less accessible training data and computational resources. Despite recent advances in model editors in computer vision and natural language processing, editable training in graph neural networks (GNNs) is rarely explored. The challenge with editable GNN training lies in the inherent information aggregation across neighbors, which can lead model editors to affect the predictions of other nodes unintentionally. In this paper, we first observe the gradient of cross-entropy loss for the target node and training nodes with significant inconsistency, which indicates that directly fine-tuning the base model using the loss on the target node deteriorates the performance on training nodes. Motivated by the gradient inconsistency observation, we propose a simple yet effective \underlineGradient \underlineRewiring method for \underlineEditable graph neural network training, named \textbfGRE. Specifically, we first store the anchor gradient of the loss on training nodes to preserve the locality. Subsequently, we rewire the gradient of the loss on the target node to preserve performance on the training node using anchor gradient. Experiments demonstrate the effectiveness of GRE on various model architectures and graph datasets in terms of multiple editing situations. The source code is available at \urlthis https URL
[LG-108] Bayesian Concept Bottleneck Models with LLM Priors
链接: https://arxiv.org/abs/2410.15555
作者: Jean Feng,Avni Kothari,Luke Zier,Chandan Singh,Yan Shuo Tan
关键词-EN: Concept Bottleneck Models, Bottleneck Models, Concept Bottleneck, Large Language Models, aiming to achieve
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Concept Bottleneck Models (CBMs) have been proposed as a compromise between white-box and black-box models, aiming to achieve interpretability without sacrificing accuracy. The standard training procedure for CBMs is to predefine a candidate set of human-interpretable concepts, extract their values from the training data, and identify a sparse subset as inputs to a transparent prediction model. However, such approaches are often hampered by the tradeoff between enumerating a sufficiently large set of concepts to include those that are truly relevant versus controlling the cost of obtaining concept extractions. This work investigates a novel approach that sidesteps these challenges: BC-LLM iteratively searches over a potentially infinite set of concepts within a Bayesian framework, in which Large Language Models (LLMs) serve as both a concept extraction mechanism and prior. BC-LLM is broadly applicable and multi-modal. Despite imperfections in LLMs, we prove that BC-LLM can provide rigorous statistical inference and uncertainty quantification. In experiments, it outperforms comparator methods including black-box models, converges more rapidly towards relevant concepts and away from spuriously correlated ones, and is more robust to out-of-distribution samples.
[LG-109] A Plug-and-Play Fully On-the-Job Real-Time Reinforcement Learning Algorithm for a Direct-Drive Tandem-Wing Experiment Platforms Under Multiple Random Operating Conditions
链接: https://arxiv.org/abs/2410.15554
作者: Zhang Minghao,Song Bifeng,Yang Xiaojun,Wang Liang
关键词-EN: Concerto Reinforcement Learning, unstable aerodynamic interference, aerodynamic interference generated, biomimetic systems poses, systems poses substantial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 63 pages, 32 figures
点击查看摘要
Abstract:The nonlinear and unstable aerodynamic interference generated by the tandem wings of such biomimetic systems poses substantial challenges for motion control, especially under multiple random operating conditions. To address these challenges, the Concerto Reinforcement Learning Extension (CRL2E) algorithm has been developed. This plug-and-play, fully on-the-job, real-time reinforcement learning algorithm incorporates a novel Physics-Inspired Rule-Based Policy Composer Strategy with a Perturbation Module alongside a lightweight network optimized for real-time control. To validate the performance and the rationality of the module design, experiments were conducted under six challenging operating conditions, comparing seven different algorithms. The results demonstrate that the CRL2E algorithm achieves safe and stable training within the first 500 steps, improving tracking accuracy by 14 to 66 times compared to the Soft Actor-Critic, Proximal Policy Optimization, and Twin Delayed Deep Deterministic Policy Gradient algorithms. Additionally, CRL2E significantly enhances performance under various random operating conditions, with improvements in tracking accuracy ranging from 8.3% to 60.4% compared to the Concerto Reinforcement Learning (CRL) algorithm. The convergence speed of CRL2E is 36.11% to 57.64% faster than the CRL algorithm with only the Composer Perturbation and 43.52% to 65.85% faster than the CRL algorithm when both the Composer Perturbation and Time-Interleaved Capability Perturbation are introduced, especially in conditions where the standard CRL struggles to converge. Hardware tests indicate that the optimized lightweight network structure excels in weight loading and average inference time, meeting real-time control requirements.
[LG-110] Hiding in Plain Sight: Reframing Hardware Trojan Benchmarking as a HideSeek Modification
链接: https://arxiv.org/abs/2410.15550
作者: Amin Sarihi,Ahmad Patooghy,Peter Jamieson,Abdel-Hameed A. Badawy
关键词-EN: hardware design space, Hardware Trojan, advancing security research, hardware design, work focuses
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.17918
点击查看摘要
Abstract:This work focuses on advancing security research in the hardware design space by formally defining the realistic problem of Hardware Trojan (HT) detection. The goal is to model HT detection more closely to the real world, i.e., describing the problem as The Seeker’s Dilemma where a detecting agent is unaware of whether circuits are infected by HTs or not. Using this theoretical problem formulation, we create a benchmark that consists of a mixture of HT-free and HT-infected restructured circuits while preserving their original functionalities. The restructured circuits are randomly infected by HTs, causing a situation where the defender is uncertain if a circuit is infected or not. We believe that our innovative benchmark and methodology of creating benchmarks will help the community judge the detection quality of different methods by comparing their success rates in circuit classification. We use our developed benchmark to evaluate three state-of-the-art HT detection tools to show baseline results for this approach. We use Principal Component Analysis to assess the strength of our benchmark, where we observe that some restructured HT-infected circuits are mapped closely to HT-free circuits, leading to significant label misclassification by detectors.
[LG-111] Distributed Thompson sampling under constrained communication
链接: https://arxiv.org/abs/2410.15543
作者: Saba Zerefa,Zhaolin Ren,Haitong Ma,Na Li
关键词-EN: distributed Thompson sampling, Thompson sampling, Bayesian optimization, Bayesian Simple Regret, surrogate model
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 9 pages
点击查看摘要
Abstract:In Bayesian optimization, a black-box function is maximized via the use of a surrogate model. We apply distributed Thompson sampling, using a Gaussian process as a surrogate model, to approach the multi-agent Bayesian optimization problem. In our distributed Thompson sampling implementation, each agent receives sampled points from neighbors, where the communication network is encoded in a graph; each agent utilizes a Gaussian process to model the objective function. We demonstrate a theoretical bound on Bayesian Simple Regret, where the bound depends on the size of the largest complete subgraph of the communication graph. Unlike in batch Bayesian optimization, this bound is applicable in cases where the communication graph amongst agents is constrained. When compared to sequential Thompson sampling, our bound guarantees faster convergence with respect to time as long as there is a fully connected subgraph of at least two agents. We confirm the efficacy of our algorithm with numerical simulations on traditional optimization test functions, illustrating the significance of graph connectivity on improving regret convergence.
[LG-112] Grammatical Error Correction for Low-Resource Languages: The Case of Zarma
链接: https://arxiv.org/abs/2410.15539
作者: Mamadou K. Keita,Christopher Homan,Sofiane Abdoulaye Hamani,Adwoa Bremang,Marcos Zampieri,Habibatou Abdoulaye Alfari,Elysabhete Amadou Ibrahim,Dennis Owusu
关键词-EN: West Africa, improving written materials, people in West, Grammatical error correction, Grammatical error
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Grammatical error correction (GEC) is important for improving written materials for low-resource languages like Zarma – spoken by over 5 million people in West Africa. Yet it remains a challenging problem. This study compares rule-based methods, machine translation (MT) models, and large language models (LLMs) for GEC in Zarma. We evaluate each approach’s effectiveness on our manually-built dataset of over 250,000 examples using synthetic and human-annotated data. Our experiments show that the MT-based approach using the M2M100 model outperforms others, achieving a detection rate of 95.82% and a suggestion accuracy of 78.90% in automatic evaluations, and scoring 3.0 out of 5.0 in logical/grammar error correction during MEs by native speakers. The rule-based method achieved perfect detection (100%) and high suggestion accuracy (96.27%) for spelling corrections but struggled with context-level errors. LLMs like MT5-small showed moderate performance with a detection rate of 90.62% and a suggestion accuracy of 57.15%. Our work highlights the potential of MT models to enhance GEC in low-resource languages, paving the way for more inclusive NLP tools.
[LG-113] SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training NEURIPS2024
链接: https://arxiv.org/abs/2410.15526
作者: Jinda Jia,Cong Xie,Hanlin Lu,Daoce Wang,Hao Feng,Chengming Zhang,Baixi Sun,Haibin Lin,Zhi Zhang,Xin Liu,Dingwen Tao
关键词-EN: Sharded Data Parallelism, Recent years, memory usage, Sharded Data, Data Parallelism
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challenge in the scalability of ShardedDP is the intensive communication of weights and gradients. While compression techniques can alleviate this issue, they often result in worse accuracy. Driven by this limitation, we propose SDP4Bit (Toward 4Bit Communication Quantization in Sharded Data Parallelism for LLM Training), which effectively reduces the communication of weights and gradients to nearly 4 bits via two novel techniques: quantization on weight differences, and two-level gradient smooth quantization. Furthermore, SDP4Bit presents an algorithm-system co-design with runtime optimization to minimize the computation overhead of compression. In addition to the theoretical guarantees of convergence, we empirically evaluate the accuracy of SDP4Bit on the pre-training of GPT models with up to 6.7 billion parameters, and the results demonstrate a negligible impact on training loss. Furthermore, speed experiments show that SDP4Bit achieves up to 4.08 \times speedup in end-to-end throughput on a scale of 128 GPUs.
[LG-114] MIRA: A Method of Federated MultI-Task Learning for LaRge LAnguage Models
链接: https://arxiv.org/abs/2410.15524
作者: Ahmed Elbakary,Chaouki Ben Issaid,Tamer ElBatt,Karim Seddik,Mehdi Bennis
关键词-EN: Large Language Models, fine-tuning Large Language, Large Language, Language Models, inspired by Multi-Task
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
点击查看摘要
Abstract:In this paper, we introduce a method for fine-tuning Large Language Models (LLMs), inspired by Multi-Task learning in a federated manner. Our approach leverages the structure of each client’s model and enables a learning scheme that considers other clients’ tasks and data distribution. To mitigate the extensive computational and communication overhead often associated with LLMs, we utilize a parameter-efficient fine-tuning method, specifically Low-Rank Adaptation (LoRA), reducing the number of trainable parameters. Experimental results, with different datasets and models, demonstrate the proposed method’s effectiveness compared to existing frameworks for federated fine-tuning of LLMs in terms of average and local performances. The proposed scheme outperforms existing baselines by achieving lower local loss for each client while maintaining comparable global performance.
[LG-115] M-RewardBench: Evaluating Reward Models in Multilingual Settings
链接: https://arxiv.org/abs/2410.15522
作者: Srishti Gureja,Lester James V. Miranda,Shayekh Bin Islam,Rishabh Maheshwary,Drishti Sharma,Gusti Winata,Nathan Lambert,Sebastian Ruder,Sara Hooker,Marzieh Fadaee
关键词-EN: language modeling process, modeling process, LLMs today, today by enabling, enabling the integration
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures, 10 tables. Website: this https URL
点击查看摘要
Abstract:Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs’ performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.
[LG-116] Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching
链接: https://arxiv.org/abs/2410.15516
作者: Ange-Clément Akazan,Alexia Jolicoeur-Martineau,Ioannis Mitliagkas
关键词-EN: advancing machine learning, Privacy and regulatory, regulatory constraints make, regulatory constraints, vital to advancing
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Privacy and regulatory constraints make data generation vital to advancing machine learning without relying on real-world datasets. A leading approach for tabular data generation is the Forest Flow (FF) method, which combines Flow Matching with XGBoost. Despite its good performance, FF is slow and makes errors when treating categorical variables as one-hot continuous features. It is also highly sensitive to small changes in the initial conditions of the ordinary differential equation (ODE). To overcome these limitations, we develop Heterogeneous Sequential Feature Forest Flow (HS3F). Our method generates data sequentially (feature-by-feature), reducing the dependency on noisy initial conditions through the additional information from previously generated features. Furthermore, it generates categorical variables using multinomial sampling (from an XGBoost classifier) instead of flow matching, improving generation speed. We also use a Runge-Kutta 4th order (Rg4) ODE solver for improved performance over the Euler solver used in FF. Our experiments with 25 datasets reveal that HS3F produces higher quality and more diverse synthetic data than FF, especially for categorical variables. It also generates data 21-27 times faster for datasets with \geq20% categorical variables. HS3F further demonstrates enhanced robustness to affine transformation in flow ODE initial conditions compared to FF. This study not only validates the HS3F but also unveils promising new strategies to advance generative models.
[LG-117] Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training CONLL
链接: https://arxiv.org/abs/2410.15509
作者: Rohan Saha,Abrar Fahim,Alona Fyshe,Alex Murphy
关键词-EN: train large machine, specialized domains, learning, large machine learning, train large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: CoNLL BabyLM Challenge 2024 camera ready
点击查看摘要
Abstract:For specialized domains, there is often not a wealth of data with which to train large machine learning models. In such limited data / compute settings, various methods exist aiming to \textitdo more with less , such as finetuning from a pretrained model, modulating difficulty levels as data are presented to a model (curriculum learning), and considering the role of model type / size. Approaches to efficient \textitmachine learning also take inspiration from \textithuman learning by considering use cases where machine learning systems have access to approximately the same number of words experienced by a 13 year old child (100M words). We investigate the role of 3 primary variables in a limited data regime as part of the multimodal track of the BabyLM challenge. We contrast: (i) curriculum learning, (ii), pretraining (with text-only data), (iii) model type. We modulate these variables and assess them on two types of tasks: (a) multimodal (text+image), and (b) unimodal (text-only) tasks. We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models, particularly when combining text-only pretraining. On text-only tasks, curriculum learning appears to help models with smaller trainable parameter counts. We suggest possible reasons based on architectural differences and training designs as to why one might observe such results.
[LG-118] SEA: State-Exchange Attention for High-Fidelity Physics-Based Transformers NEURIPS2024
链接: https://arxiv.org/abs/2410.15495
作者: Parsa Esmati,Amirhossein Dadashzadeh,Vahid Goodarzi,Nicolas Larrosa,Nicolo Grilli
关键词-EN: Current approaches, high rollout errors, rollout error accumulation, estimating field variables, approaches using sequential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
点击查看摘要
Abstract:Current approaches using sequential networks have shown promise in estimating field variables for dynamical systems, but they are often limited by high rollout errors. The unresolved issue of rollout error accumulation results in unreliable estimations as the network predicts further into the future, with each step’s error compounding and leading to an increase in inaccuracy. Here, we introduce the State-Exchange Attention (SEA) module, a novel transformer-based module enabling information exchange between encoded fields through multi-head cross-attention. The cross-field multidirectional information exchange design enables all state variables in the system to exchange information with one another, capturing physical relationships and symmetries between fields. In addition, we incorporate a ViT-like architecture to generate spatially coherent mesh embeddings, further improving the model’s ability to capture spatial dependencies in the data. This enhances the model’s ability to represent complex interactions between the field variables, resulting in improved rollout error accumulation. Our results show that the Transformer model integrated with the State-Exchange Attention (SEA) module outperforms competitive baseline models, including the PbGMR-GMUS Transformer-RealNVP and GMR-GMUS Transformer, with a reduction in error of 88% and 91%, respectively, achieving state-of-the-art performance. Furthermore, we demonstrate that the SEA module alone can reduce errors by 97% for state variables that are highly dependent on other states of the system.
[LG-119] Reinforcement Learning for Dynamic Memory Allocation
链接: https://arxiv.org/abs/2410.15492
作者: Arisrei Lim,Abhiram Maddukuri
关键词-EN: reinforcement learning, recent years, range of tasks, gained popularity, wide range
类目: Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注:
点击查看摘要
Abstract:In recent years, reinforcement learning (RL) has gained popularity and has been applied to a wide range of tasks. One such popular domain where RL has been effective is resource management problems in systems. We look to extend work on RL for resource management problems by considering the novel domain of dynamic memory allocation management. We consider dynamic memory allocation to be a suitable domain for RL since current algorithms like first-fit, best-fit, and worst-fit can fail to adapt to changing conditions and can lead to fragmentation and suboptimal efficiency. In this paper, we present a framework in which an RL agent continuously learns from interactions with the system to improve memory management tactics. We evaluate our approach through various experiments using high-level and low-level action spaces and examine different memory allocation patterns. Our results show that RL can successfully train agents that can match and surpass traditional allocation strategies, particularly in environments characterized by adversarial request patterns. We also explore the potential of history-aware policies that leverage previous allocation requests to enhance the allocator’s ability to handle complex request patterns. Overall, we find that RL offers a promising avenue for developing more adaptive and efficient memory allocation strategies, potentially overcoming limitations of hardcoded allocation algorithms.
[LG-120] Structural Causality-based Generalizable Concept Discovery Models
链接: https://arxiv.org/abs/2410.15491
作者: Sanchit Sinha,Guangzhi Xiong,Aidong Zhang
关键词-EN: explainable deep neural, deep neural network, neural network architectures, generative factors, utilized semantic concepts
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:The rising need for explainable deep neural network architectures has utilized semantic concepts as explainable units. Several approaches utilizing disentangled representation learning estimate the generative factors and utilize them as concepts for explaining DNNs. However, even though the generative factors for a dataset remain fixed, concepts are not fixed entities and vary based on downstream tasks. In this paper, we propose a disentanglement mechanism utilizing a variational autoencoder (VAE) for learning mutually independent generative factors for a given dataset and subsequently learning task-specific concepts using a structural causal model (SCM). Our method assumes generative factors and concepts to form a bipartite graph, with directed causal edges from generative factors to concepts. Experiments are conducted on datasets with known generative factors: D-sprites and Shapes3D. On specific downstream tasks, our proposed method successfully learns task-specific concepts which are explained well by the causal edges from the generative factors. Lastly, separate from current causal concept discovery methods, our methodology is generalizable to an arbitrary number of concepts and flexible to any downstream tasks.
[LG-121] Generative AI Agents in Autonomous Machines: A Safety Perspective
链接: https://arxiv.org/abs/2410.15489
作者: Jason Jabbour,Vijay Janapa Reddi
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, major paradigm shift, Generative Artificial, autonomous machines
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The integration of Generative Artificial Intelligence (AI) into autonomous machines represents a major paradigm shift in how these systems operate and unlocks new solutions to problems once deemed intractable. Although generative AI agents provide unparalleled capabilities, they also have unique safety concerns. These challenges require robust safeguards, especially for autonomous machines that operate in high-stakes environments. This work investigates the evolving safety requirements when generative models are integrated as agents into physical autonomous machines, comparing these to safety considerations in less critical AI applications. We explore the challenges and opportunities to ensure the safe deployment of generative AI-driven autonomous machines. Furthermore, we provide a forward-looking perspective on the future of AI-driven autonomous systems and emphasize the importance of evaluating and communicating safety risks. As an important step towards addressing these concerns, we recommend the development and implementation of comprehensive safety scorecards for the use of generative AI technologies in autonomous machines.
[LG-122] Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning
链接: https://arxiv.org/abs/2410.15483
作者: Heshan Fernando,Han Shen,Parikshit Ram,Yi Zhou,Horst Samulowitz,Nathalie Baracaldo,Tianyi Chen
关键词-EN: safe LLM applications, supervised fine-tuning, preference learning, SFT and RLHF, typically consists
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Post-training of pre-trained LLMs, which typically consists of the supervised fine-tuning (SFT) stage and the preference learning (RLHF or DPO) stage, is crucial to effective and safe LLM applications. The widely adopted approach in post-training popular open-source LLMs is to sequentially perform SFT and RLHF/DPO. However, sequential training is sub-optimal in terms of SFT and RLHF/DPO trade-off: the LLM gradually forgets about the first stage’s training when undergoing the second stage’s training. We theoretically prove the sub-optimality of sequential post-training. Furthermore, we propose a practical joint post-training framework with theoretical convergence guarantees and empirically outperforms sequential post-training framework, while having similar computational cost. Our code is available at this https URL.
[LG-123] Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization
链接: https://arxiv.org/abs/2410.15474
作者: Timofei Gritsaev,Nikita Morozov,Sergey Samsonov,Daniil Tiapkin
关键词-EN: Generative Flow Networks, Flow Networks, Generative Flow, generative models, models that learn
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects with probabilities proportional to a given reward function. The key concept behind GFlowNets is the use of two stochastic policies: a forward policy, which incrementally constructs compositional objects, and a backward policy, which sequentially deconstructs them. Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning (RL) problems with a particular reward design. However, this connection applies only in the setting of a fixed backward policy, which might be a significant limitation. As a remedy to this problem, we introduce a simple backward policy optimization algorithm that involves direct maximization of the value function in an entropy-regularized Markov Decision Process (MDP) over intermediate rewards. We provide an extensive experimental evaluation of the proposed approach across various benchmarks in combination with both RL and GFlowNet algorithms and demonstrate its faster convergence and mode discovery in complex environments.
[LG-124] Bayesian data fusion for distributed learning
链接: https://arxiv.org/abs/2410.15473
作者: Peng Wu,Tales Imbiriba,Pau Closas
关键词-EN: main challenges, handling non-independent, non-independent and identically, occur in practice, practice due
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:One of the main challenges of federated learning (FL) is handling non-independent and identically distributed (non-IID) client data, which may occur in practice due to unbalanced datasets and use of different data sources across clients. Knowledge sharing and model personalization are key strategies for addressing this issue. Clustered federated learning is a class of FL methods that groups clients that observe similarly distributed data into clusters, such that every client is typically associated with one data distribution and participates in training a model for that distribution along their cluster peers. In this paper, we present a unified Bayesian framework for clustered FL which associates clients to clusters. Then we propose several practical algorithms to handle the, otherwise growing, data associations in a way that trades off performance and computational complexity. This work provides insights on client-cluster associations and enables client knowledge sharing in new ways. The proposed framework circumvents the need for unique client-cluster associations, which is seen to increase the performance of the resulting models in a variety of experiments.
[LG-125] Multi-Layer Feature Fusion with Cross-Channel Attention-Based U-Net for Kidney Tumor Segmentation
链接: https://arxiv.org/abs/2410.15472
作者: Fnu Neha,Arvind K. Bansal
关键词-EN: show significant heterogeneity, renal cell carcinoma, cell carcinoma, show significant, significant heterogeneity
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages
点击查看摘要
Abstract:Renal tumors, especially renal cell carcinoma (RCC), show significant heterogeneity, posing challenges for diagnosis using radiology images such as MRI, echocardiograms, and CT scans. U-Net based deep learning techniques are emerging as a promising approach for automated medical image segmentation for minimally invasive diagnosis of renal tumors. However, current techniques need further improvements in accuracy to become clinically useful to radiologists. In this study, we present an improved U-Net based model for end-to-end automated semantic segmentation of CT scan images to identify renal tumors. The model uses residual connections across convolution layers, integrates a multi-layer feature fusion (MFF) and cross-channel attention (CCA) within encoder blocks, and incorporates skip connections augmented with additional information derived using MFF and CCA. We evaluated our model on the KiTS19 dataset, which contains data from 210 patients. For kidney segmentation, our model achieves a Dice Similarity Coefficient (DSC) of 0.97 and a Jaccard index (JI) of 0.95. For renal tumor segmentation, our model achieves a DSC of 0.96 and a JI of 0.91. Based on a comparison of available DSC scores, our model outperforms the current leading models.
[LG-126] How Aligned are Generative Models to Humans in High-Stakes Decision-Making?
链接: https://arxiv.org/abs/2410.15471
作者: Sarah Tan,Keri Mallari,Julius Adebayo,Albert Gordo,Martin T. Wells,Kori Inkpen
关键词-EN: Large generative models, Large generative, high-stakes decision-making, increasingly being considered, considered for high-stakes
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large generative models (LMs) are increasingly being considered for high-stakes decision-making. This work considers how such models compare to humans and predictive AI models on a specific case of recidivism prediction. We combine three datasets – COMPAS predictive AI risk scores, human recidivism judgements, and photos – into a dataset on which we study the properties of several state-of-the-art, multimodal LMs. Beyond accuracy and bias, we focus on studying human-LM alignment on the task of recidivism prediction. We investigate if these models can be steered towards human decisions, the impact of adding photos, and whether anti-discimination prompting is effective. We find that LMs can be steered to outperform humans and COMPAS using in context-learning. We find anti-discrimination prompting to have unintended effects, causing some models to inhibit themselves and significantly reduce their number of positive predictions.
[LG-127] Data Augmentation via Diffusion Model to Enhance AI Fairness
链接: https://arxiv.org/abs/2410.15470
作者: Christina Hastings Blow,Lijun Qian,Camille Gibson,Pamela Obiomon,Xishuang Dong
关键词-EN: outcomes genuinely reflect, interests of users, transparency and explainability, systems by ensuring, outcomes genuinely
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: arXiv admin note: text overlap with arXiv:2312.12560
点击查看摘要
Abstract:AI fairness seeks to improve the transparency and explainability of AI systems by ensuring that their outcomes genuinely reflect the best interests of users. Data augmentation, which involves generating synthetic data from existing datasets, has gained significant attention as a solution to data scarcity. In particular, diffusion models have become a powerful technique for generating synthetic data, especially in fields like computer vision. This paper explores the potential of diffusion models to generate synthetic tabular data to improve AI fairness. The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM), a diffusion model adaptable to any tabular dataset and capable of handling various feature types, was utilized with different amounts of generated data for data augmentation. Additionally, reweighting samples from AIF360 was employed to further enhance AI fairness. Five traditional machine learning models-Decision Tree (DT), Gaussian Naive Bayes (GNB), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Random Forest (RF)-were used to validate the proposed approach. Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.
[LG-128] Efficient Model Extraction via Boundary Sampling
链接: https://arxiv.org/abs/2410.15429
作者: Maor Biton Dor,Yisroel Mirsky
关键词-EN: advances the current, terms of efficiency, data-free model extraction, paper introduces, significantly advances
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper introduces a novel data-free model extraction attack that significantly advances the current state-of-the-art in terms of efficiency, accuracy, and effectiveness. Traditional black-box methods rely on using the victim’s model as an oracle to label a vast number of samples within high-confidence areas. This approach not only requires an extensive number of queries but also results in a less accurate and less transferable model. In contrast, our method innovates by focusing on sampling low-confidence areas (along the decision boundaries) and employing an evolutionary algorithm to optimize the sampling process. These novel contributions allow for a dramatic reduction in the number of queries needed by the attacker by a factor of 10x to 600x while simultaneously improving the accuracy of the stolen model. Moreover, our approach improves boundary alignment, resulting in better transferability of adversarial examples from the stolen model to the victim’s model (increasing the attack success rate from 60% to 82% on average). Finally, we accomplish all of this with a strict black-box assumption on the victim, with no knowledge of the target’s architecture or dataset. We demonstrate our attack on three datasets with increasingly larger resolutions and compare our performance to four state-of-the-art model extraction attacks. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2410.15429 [cs.CR] (or arXiv:2410.15429v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2410.15429 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-129] Accelerated Sub-Image Search For Variable-Size Patches Identification Based On Virtual Time Series Transformation And Segmentation
链接: https://arxiv.org/abs/2410.15425
作者: Mogens Plessen
关键词-EN: fields requiring spot, requiring spot spraying, fixed-size objects, small-scale reference image, hay bales
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 9 figures, 3 tables
点击查看摘要
Abstract:This paper addresses two tasks: (i) fixed-size objects such as hay bales are to be identified in an aerial image for a given reference image of the object, and (ii) variable-size patches such as areas on fields requiring spot spraying or other handling are to be identified in an image for a given small-scale reference image. Both tasks are related. The second differs in that identified sub-images similar to the reference image are further clustered before patches contours are determined by solving a traveling salesman problem. Both tasks are complex in that the exact number of similar sub-images is not known a priori. The main discussion of this paper is presentation of an acceleration mechanism for sub-image search that is based on a transformation of an image to multivariate time series along the RGB-channels and subsequent segmentation to reduce the 2D search space in the image. Two variations of the acceleration mechanism are compared to exhaustive search on diverse synthetic and real-world images. Quantitatively, proposed method results in solve time reductions of up to 2 orders of magnitude, while qualitatively delivering comparative visual results. Proposed method is neural network-free and does not use any image pre-processing.
[LG-130] Power Plays: Unleashing Machine Learning Magic in Smart Grids
链接: https://arxiv.org/abs/2410.15423
作者: Abdur Rashid,Parag Biswas,abdullah al masum,MD Abdullah Al Nasim,Kishor Datta Gupta
关键词-EN: machine learning, modern energy networks, grid systems represents, represents a transformative, transformative step
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 1 figure
点击查看摘要
Abstract:The integration of machine learning into smart grid systems represents a transformative step in enhancing the efficiency, reliability, and sustainability of modern energy networks. By adding advanced data analytics, these systems can better manage the complexities of renewable energy integration, demand response, and predictive maintenance. Machine learning algorithms analyze vast amounts of data from smart meters, sensors, and other grid components to optimize energy distribution, forecast demand, and detect irregularities that could indicate potential failures. This enables more precise load balancing, reduces operational costs, and enhances the resilience of the grid against disturbances. Furthermore, the use of predictive models helps in anticipating equipment failures, thereby improving the reliability of the energy supply. As smart grids continue to evolve, the role of machine learning in managing decentralized energy sources and enabling real-time decision-making will become increasingly critical. However, the deployment of these technologies also raises challenges related to data privacy, security, and the need for robust infrastructure. Addressing these issues in this research authors will focus on realizing the full potential of smart grids, ensuring they meet the growing energy demands while maintaining a focus on sustainability and efficiency using Machine Learning techniques. Furthermore, this research will help determine the smart grid’s essentiality with the aid of Machine Learning. Multiple ML algorithms have been integrated along with their pros and cons. The future scope of these algorithms are also integrated.
[LG-131] Where to Build Food Banks and Pantries: A Two-Level Machine Learning Approach
链接: https://arxiv.org/abs/2410.15420
作者: Gavin Ruan,Ziqi Guo,Guang Lin
关键词-EN: million Americans, Americans currently suffer, food bank, food, pantry locations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures
点击查看摘要
Abstract:Over 44 million Americans currently suffer from food insecurity, of whom 13 million are children. Across the United States, thousands of food banks and pantries serve as vital sources of food and other forms of aid for food insecure families. By optimizing food bank and pantry locations, food would become more accessible to families who desperately require it. In this work, we introduce a novel two-level optimization framework, which utilizes the K-Medoids clustering algorithm in conjunction with the Open-Source Routing Machine engine, to optimize food bank and pantry locations based on real road distances to houses and house blocks. Our proposed framework also has the adaptability to factor in considerations such as median household income using a pseudo-weighted K-Medoids algorithm. Testing conducted with California and Indiana household data, as well as comparisons with real food bank and pantry locations showed that interestingly, our proposed framework yields food pantry locations superior to those of real existing ones and saves significant distance for households, while there is a marginal penalty on the first level food bank to food pantry distance. Overall, we believe that the second-level benefits of this framework far outweigh any drawbacks and yield a net benefit result.
[LG-132] Dynamic Contrastive Learning for Time Series Representation
链接: https://arxiv.org/abs/2410.15416
作者: Abdul-Kazeem Shamba,Kerstin Bach,Gavin Taylor
关键词-EN: Understanding events, time series, Understanding, time, series
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Understanding events in time series is an important task in a variety of contexts. However, human analysis and labeling are expensive and time-consuming. Therefore, it is advantageous to learn embeddings for moments in time series in an unsupervised way, which allows for good performance in classification or detection tasks after later minimal human labeling. In this paper, we propose dynamic contrastive learning (DynaCL), an unsupervised contrastive representation learning framework for time series that uses temporal adjacent steps to define positive pairs. DynaCL adopts N-pair loss to dynamically treat all samples in a batch as positive or negative pairs, enabling efficient training and addressing the challenges of complicated sampling of positives. We demonstrate that DynaCL embeds instances from time series into semantically meaningful clusters, which allows superior performance on downstream tasks on a variety of public time series datasets. Our findings also reveal that high scores on unsupervised clustering metrics do not guarantee that the representations are useful in downstream tasks.
[LG-133] PEAS: A Strategy for Crafting Transferable Adversarial Examples
链接: https://arxiv.org/abs/2410.15409
作者: Bar Avraham,Yisroel Mirsky
关键词-EN: machine learning systems, learning systems, target model, Black box attacks, threat to machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Black box attacks, where adversaries have limited knowledge of the target model, pose a significant threat to machine learning systems. Adversarial examples generated with a substitute model often suffer from limited transferability to the target model. While recent work explores ranking perturbations for improved success rates, these methods see only modest gains. We propose a novel strategy called PEAS that can boost the transferability of existing black box attacks. PEAS leverages the insight that samples which are perceptually equivalent exhibit significant variability in their adversarial transferability. Our approach first generates a set of images from an initial sample via subtle augmentations. We then evaluate the transferability of adversarial perturbations on these images using a set of substitute models. Finally, the most transferable adversarial example is selected and used for the attack. Our experiments show that PEAS can double the performance of existing attacks, achieving a 2.5x improvement in attack success rates on average over current ranking methods. We thoroughly evaluate PEAS on ImageNet and CIFAR-10, analyze hyperparameter impacts, and provide an ablation study to isolate each component’s importance.
[LG-134] IPO: Interpretable Prompt Optimization for Vision-Language Models NEURIPS2024
链接: https://arxiv.org/abs/2410.15397
作者: Yingjun Du,Wenfang Sun,Cees G. M. Snoek
关键词-EN: CLIP have remarkably, Pre-trained vision-language models, Pre-trained vision-language, prompts, downstream tasks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. However, these methods tend to lead to overfitting of the base classes seen during training and produce prompts that are no longer understandable by humans. This paper introduces a simple but interpretable prompt optimizer (IPO), that utilizes large language models (LLMs) to generate textual prompts dynamically. We introduce a Prompt Optimization Prompt that not only guides LLMs in creating effective prompts but also stores past prompts with their performance metrics, providing rich in-context information. Additionally, we incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions, which enhance the interaction between textual and visual modalities. This allows for thae creation of dataset-specific prompts that improve generalization performance, while maintaining human comprehension. Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. By leveraging the strengths of LLMs, our approach ensures that the prompts remain human-understandable, thereby facilitating better transparency and oversight for vision-language models.
[LG-135] Synthetic Data Generation for Residential Load Patterns via Recurrent GAN and Ensemble Method
链接: https://arxiv.org/abs/2410.15379
作者: Xinyu Liang,Ziheng Wang,Hao Wang
关键词-EN: accurately represent actual, represent actual electricity, actual electricity consumption, power system planning, Generating synthetic residential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages
点击查看摘要
Abstract:Generating synthetic residential load data that can accurately represent actual electricity consumption patterns is crucial for effective power system planning and operation. The necessity for synthetic data is underscored by the inherent challenges associated with using real-world load data, such as privacy considerations and logistical complexities in large-scale data collection. In this work, we tackle the above-mentioned challenges by developing the Ensemble Recurrent Generative Adversarial Network (ERGAN) framework to generate high-fidelity synthetic residential load data. ERGAN leverages an ensemble of recurrent Generative Adversarial Networks, augmented by a loss function that concurrently takes into account adversarial loss and differences between statistical properties. Our developed ERGAN can capture diverse load patterns across various households, thereby enhancing the realism and diversity of the synthetic data generated. Comprehensive evaluations demonstrate that our method consistently outperforms established benchmarks in the synthetic generation of residential load data across various performance metrics including diversity, similarity, and statistical measures. The findings confirm the potential of ERGAN as an effective tool for energy applications requiring synthetic yet realistic load data. We also make the generated synthetic residential load patterns publicly available.
[LG-136] Explainability of Point Cloud Neural Networks Using SMILE: Statistical Model-Agnostic Interpretability with Local Explanations
链接: https://arxiv.org/abs/2410.15374
作者: Seyed Mohammad Ahmadi,Koorosh Aslansefat,Ruben Valcarce-Dineiro,Joshua Barnfather
关键词-EN: considerable safety risks, pose considerable safety, today world, significance of explainable, lack of transparency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 9 figures
点击查看摘要
Abstract:In today’s world, the significance of explainable AI (XAI) is growing in robotics and point cloud applications, as the lack of transparency in decision-making can pose considerable safety risks, particularly in autonomous systems. As these technologies are integrated into real-world environments, ensuring that model decisions are interpretable and trustworthy is vital for operational reliability and safety assurance. This study explores the implementation of SMILE, a novel explainability method originally designed for deep neural networks, on point cloud-based models. SMILE builds on LIME by incorporating Empirical Cumulative Distribution Function (ECDF) statistical distances, offering enhanced robustness and interpretability, particularly when the Anderson-Darling distance is used. The approach demonstrates superior performance in terms of fidelity loss, R2 scores, and robustness across various kernel widths, perturbation numbers, and clustering configurations. Moreover, this study introduces a stability analysis for point cloud data using the Jaccard index, establishing a new benchmark and baseline for model stability in this field. The study further identifies dataset biases in the classification of the ‘person’ category, emphasizing the necessity for more comprehensive datasets in safety-critical applications like autonomous driving and robotics. The results underscore the potential of advanced explainability models and highlight areas for future research, including the application of alternative surrogate models and explainability techniques in point cloud data.
[LG-137] Hybrid Memory Replay: Blending Real and Distilled Data for Class Incremental Learning
链接: https://arxiv.org/abs/2410.15372
作者: Jiangtao Kong,Jiacheng Shi,Ashley Gao,Shaohan Hu,Tianyi Zhou,Huajie Shao
关键词-EN: Incremental learning, retaining knowledge learned, previous tasks, aims to acquire, exemplars
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Incremental learning (IL) aims to acquire new knowledge from current tasks while retaining knowledge learned from previous tasks. Replay-based IL methods store a set of exemplars from previous tasks in a buffer and replay them when learning new tasks. However, there is usually a size-limited buffer that cannot store adequate real exemplars to retain the knowledge of previous tasks. In contrast, data distillation (DD) can reduce the exemplar buffer’s size, by condensing a large real dataset into a much smaller set of more information-compact synthetic exemplars. Nevertheless, DD’s performance gain on IL quickly vanishes as the number of synthetic exemplars grows. To overcome the weaknesses of real-data and synthetic-data buffers, we instead optimize a hybrid memory including both types of data. Specifically, we propose an innovative modification to DD that distills synthetic data from a sliding window of checkpoints in history (rather than checkpoints on multiple training trajectories). Conditioned on the synthetic data, we then optimize the selection of real exemplars to provide complementary improvement to the DD objective. The optimized hybrid memory combines the strengths of synthetic and real exemplars, effectively mitigating catastrophic forgetting in Class IL (CIL) when the buffer size for exemplars is limited. Notably, our method can be seamlessly integrated into most existing replay-based CIL models. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing replay-based baselines.
[LG-138] FrameBridge: Improving Image-to-Video Generation with Bridge Models
链接: https://arxiv.org/abs/2410.15371
作者: Yuji Wang,Zehua Chen,Xiaoyu Chen,Jun Zhu,Jianfei Chen
关键词-EN: gaining increasing attention, gaining increasing, increasing attention, wide application, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation. However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to generate video samples with both appearance consistency and temporal coherence from an uninformative Gaussian noise, which may limit their synthesis quality. In this work, we present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. By formulating I2V synthesis as a frames-to-frames generation task and modelling it with a data-to-data process, we fully exploit the information in input image and facilitate the generative model to learn the image animation process. In two popular settings of training I2V models, namely fine-tuning a pre-trained text-to-video (T2V) model or training from scratch, we further propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively. Experiments conducted on WebVid-2M and UCF-101 demonstrate that: (1) our FrameBridge achieves superior I2V quality in comparison with the diffusion counterpart (zero-shot FVD 83 vs. 176 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101); (2) our proposed SAF and neural prior effectively enhance the ability of bridge-based I2V models in the scenarios of fine-tuning and training from scratch. Demo samples can be visited at: this https URL.
[LG-139] Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
链接: https://arxiv.org/abs/2410.15362
作者: Xiao Li,Zhuhong Li,Qiongxiu Li,Bingze Lee,Jinghao Cui,Xiaolin Hu
关键词-EN: Large Language Models, Language Models, Aligned Large Language, Large Language, demonstrated remarkable performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.
[LG-140] Wireless Link Quality Estimation Using LSTM Model
链接: https://arxiv.org/abs/2410.15357
作者: Yuki Kanto,Kohei Watabe
关键词-EN: mobile communication devices, necessitating stable communication, high-capacity wireless networks, recent years, outdoor environments
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: This paper was submitted to IEEE Network Operations and Management Symposium
点击查看摘要
Abstract:In recent years, various services have been provided through high-speed and high-capacity wireless networks on mobile communication devices, necessitating stable communication regardless of indoor or outdoor environments. To achieve stable communication, it is essential to implement proactive measures, such as switching to an alternative path and ensuring data buffering before the communication quality becomes unstable. The technology of Wireless Link Quality Estimation (WLQE), which predicts the communication quality of wireless networks in advance, plays a crucial role in this context. In this paper, we propose a novel WLQE model for estimating the communication quality of wireless networks by leveraging sequential information. Our proposed method is based on Long Short-Term Memory (LSTM), enabling highly accurate estimation by considering the sequential information of link quality. We conducted a comparative evaluation with the conventional model, stacked autoencoder-based link quality estimator (LQE-SAE), using a dataset recorded in real-world environmental conditions. Our LSTM-based LQE model demonstrates its superiority, achieving a 4.0% higher accuracy and a 4.6% higher macro-F1 score than the LQE-SAE model in the evaluation.
[LG-141] LAC: Graph Contrastive Learning with Learnable Augmentation in Continuous Space
链接: https://arxiv.org/abs/2410.15355
作者: Zhenyu Lin,Hongzheng Li,Yingxia Shao,Guanhua Ye,Yawen Li,Quanqing Xu
关键词-EN: Graph Contrastive Learning, generating high-quality node, Contrastive Learning, Contrastive Learning frameworks, high-quality node representations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Graph Contrastive Learning frameworks have demonstrated success in generating high-quality node representations. The existing research on efficient data augmentation methods and ideal pretext tasks for graph contrastive learning remains limited, resulting in suboptimal node representation in the unsupervised setting. In this paper, we introduce LAC, a graph contrastive learning framework with learnable data augmentation in an orthogonal continuous space. To capture the representative information in the graph data during augmentation, we introduce a continuous view augmenter, that applies both a masked topology augmentation module and a cross-channel feature augmentation module to adaptively augment the topological information and the feature information within an orthogonal continuous space, respectively. The orthogonal nature of continuous space ensures that the augmentation process avoids dimension collapse. To enhance the effectiveness of pretext tasks, we propose an information-theoretic principle named InfoBal and introduce corresponding pretext tasks. These tasks enable the continuous view augmenter to maintain consistency in the representative information across views while maximizing diversity between views, and allow the encoder to fully utilize the representative information in the unsupervised setting. Our experimental results show that LAC significantly outperforms the state-of-the-art frameworks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.15355 [cs.LG] (or arXiv:2410.15355v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.15355 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhenyu Lin [view email] [v1] Sun, 20 Oct 2024 10:47:15 UTC (605 KB)
[LG-142] CompAct: Compressed Activations for Memory-Efficient LLM Training
链接: https://arxiv.org/abs/2410.15352
作者: Yara Shamshoum,Nitzan Hodos,Yuval Sieradzki,Assaf Schuster
关键词-EN: GPU by 25-30, utilization on GPU, peak memory utilization, GPU, memory
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:We introduce CompAct, a technique that reduces peak memory utilization on GPU by 25-30% for pretraining and 50% for fine-tuning of LLMs. Peak device memory is a major limiting factor in training LLMs, with various recent works aiming to reduce model memory. However most works don’t target the largest component of allocated memory during training: the model’s compute graph, which is stored for the backward pass. By storing low-rank, compressed activations to be used in the backward pass we greatly reduce the required memory, unlike previous methods which only reduce optimizer overheads or the number of trained parameters. Our compression uses random projection matrices, thus avoiding additional memory overheads. Comparisons with previous techniques for either pretraining or fine-tuning show that CompAct substantially improves existing compute-performance tradeoffs. We expect CompAct’s savings to scale even higher for larger models.
[LG-143] ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps
链接: https://arxiv.org/abs/2410.15342
作者: Yulin Song,Guorui Sang,Jing Yu,Chuangbai Xiao
关键词-EN: Singing voice synthesis, Singing voice, duration and pitch, high-fidelity singing voice, system is expected
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Singing voice synthesis, Consistency models, diffusion models
点击查看摘要
Abstract:Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at this https URL.
[LG-144] EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
链接: https://arxiv.org/abs/2410.15332
作者: Junhao Hu,Wenrui Huang,Haoyi Wang,Weidong Wang,Tiancheng Hu,Qin Zhang,Hao Feng,Xusheng Chen,Yizhou Shan,Tao Xie
关键词-EN: Large Language Models, Large Language, Language Models, range of applications, wide range
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are critical for a wide range of applications, but serving them efficiently becomes increasingly challenging as inputs become more complex. Context caching improves serving performance by exploiting inter-request dependency and reusing key-value (KV) cache across requests, thus improving time-to-first-token (TTFT). However, existing prefix-based context caching requires exact token prefix matches, limiting cache reuse in few-shot learning, multi-document QA, or retrieval-augmented generation, where prefixes may vary. In this paper, we present EPIC, an LLM serving system that introduces position-independent context caching (PIC), enabling modular KV cache reuse regardless of token chunk position (or prefix). EPIC features two key designs: AttnLink, which leverages static attention sparsity to minimize recomputation for accuracy recovery, and KVSplit, a customizable chunking method that preserves semantic coherence. Our experiments demonstrate that Epic delivers up to 8x improvements in TTFT and 7x throughput over existing systems, with negligible or no accuracy loss. By addressing the limitations of traditional caching approaches, Epic enables more scalable and efficient LLM inference.
[LG-145] FoMo: A Foundation Model for Mobile Traffic Forecasting with Diffusion Model
链接: https://arxiv.org/abs/2410.15322
作者: Haoye Chai,Shiyuan Zhang,Xiaoqian Qi,Yong Li
关键词-EN: offering substantial potential, enhancing service quality, anticipate network dynamics, improving user experience, performance in advance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 11 figures
点击查看摘要
Abstract:Mobile traffic forecasting allows operators to anticipate network dynamics and performance in advance, offering substantial potential for enhancing service quality and improving user experience. However, existing models are often task-oriented and are trained with tailored data, which limits their effectiveness in diverse mobile network tasks of Base Station (BS) deployment, resource allocation, energy optimization, etc. and hinders generalization across different urban environments. Foundation models have made remarkable strides across various domains of NLP and CV due to their multi-tasking adaption and zero/few-shot learning capabilities. In this paper, we propose an innovative Foundation model for Mobile traffic forecasting (FoMo), aiming to handle diverse forecasting tasks of short/long-term predictions and distribution generation across multiple cities to support network planning and optimization. FoMo combines diffusion models and transformers, where various spatio-temporal masks are proposed to enable FoMo to learn intrinsic features of different tasks, and a contrastive learning strategy is developed to capture the correlations between mobile traffic and urban contexts, thereby improving its transfer learning capability. Extensive experiments on 9 real-world datasets demonstrate that FoMo outperforms current models concerning diverse forecasting tasks and zero/few-shot learning, showcasing a strong universality. We further deploy the FoMo on the JiuTian optimization platform of China Mobile, where we use the predicted mobile data to formulate network planning and optimization applications, including BS deployment, resource block scheduling, and BS sleep control.
[LG-146] SNAP: Stopping Catastrophic Forgetting in Hebbian Learning with Sigmoidal Neuronal Adaptive Plasticity
链接: https://arxiv.org/abs/2410.15318
作者: Tianyi Xu,Patrick Zheng,Shiyan Liu,Sicheng Lyu,Isabeau Prémont-Schwarz
关键词-EN: Artificial Neural Networks, Neural Networks, Stochastic Gradient Descent, Artificial Neural, Existing Machine Learning
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 11 figures, accepted at Montréal AI and Neuroscience (MAIN) 2024 conference
点击查看摘要
Abstract:Artificial Neural Networks (ANNs) suffer from catastrophic forgetting, where the learning of new tasks causes the catastrophic forgetting of old tasks. Existing Machine Learning (ML) algorithms, including those using Stochastic Gradient Descent (SGD) and Hebbian Learning typically update their weights linearly with experience i.e., independently of their current strength. This contrasts with biological neurons, which at intermediate strengths are very plastic, but consolidate with Long-Term Potentiation (LTP) once they reach a certain strength. We hypothesize this mechanism might help mitigate catastrophic forgetting. We introduce Sigmoidal Neuronal Adaptive Plasticity (SNAP) an artificial approximation to Long-Term Potentiation for ANNs by having the weights follow a sigmoidal growth behaviour allowing the weights to consolidate and stabilize when they reach sufficiently large or small values. We then compare SNAP to linear weight growth and exponential weight growth and see that SNAP completely prevents the forgetting of previous tasks for Hebbian Learning but not for SGD-base learning.
[LG-147] On Cold Posteriors of Probabilistic Neural Networks: Understanding the Cold Posterior Effect and A New Way to Learn Cold Posteriors with Tight Generalization Guarantees
链接: https://arxiv.org/abs/2410.15310
作者: Yijie Zhang
关键词-EN: updating beliefs based, Bayes’ theorem, Bayesian, machine learning, Bayesian deep learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: PhD thesis
点击查看摘要
Abstract:Bayesian inference provides a principled probabilistic framework for quantifying uncertainty by updating beliefs based on prior knowledge and observed data through Bayes’ theorem. In Bayesian deep learning, neural network weights are treated as random variables with prior distributions, allowing for a probabilistic interpretation and quantification of predictive uncertainty. However, Bayesian methods lack theoretical generalization guarantees for unseen data. PAC-Bayesian analysis addresses this limitation by offering a frequentist framework to derive generalization bounds for randomized predictors, thereby certifying the reliability of Bayesian methods in machine learning. Temperature T , or inverse-temperature \lambda = \frac1T , originally from statistical mechanics in physics, naturally arises in various areas of statistical inference, including Bayesian inference and PAC-Bayesian analysis. In Bayesian inference, when T 1 (cold'' posteriors), the likelihood is up-weighted, resulting in a sharper posterior distribution. Conversely, when T 1 (
warm’’ posteriors), the likelihood is down-weighted, leading to a more diffuse posterior distribution. By balancing the influence of observed data and prior regularization, temperature adjustments can address issues of underfitting or overfitting in Bayesian models, bringing improved predictive performance. Comments: PhD thesis Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2410.15310 [cs.LG] (or arXiv:2410.15310v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.15310 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-148] Symmetry Nonnegative Matrix Factorization Algorithm Based on Self-paced Learning
链接: https://arxiv.org/abs/2410.15306
作者: Lei Wang,Liang Du,Peng Zhou,Peng Wu
关键词-EN: symmetric nonnegative matrix, nonnegative matrix factorization, factorization algorithm based, matrix factorization algorithm, symmetric nonnegative
类目: Machine Learning (cs.LG)
*备注: in Chinese language
点击查看摘要
Abstract:A symmetric nonnegative matrix factorization algorithm based on self-paced learning was proposed to improve the clustering performance of the model. It could make the model better distinguish normal samples from abnormal samples in an error-driven way. A weight variable that could measure the degree of difficulty to all samples was assigned in this method, and the variable was constrained by adopting both hard-weighting and soft-weighting strategies to ensure the rationality of the model. Cluster analysis was carried out on multiple data sets such as images and texts, and the experimental results showed the effectiveness of the proposed algorithm.
[LG-149] Multiple Kernel Clustering via Local Regression Integration
链接: https://arxiv.org/abs/2410.15304
作者: Liang Du,Xin Ren,Haiying Zhang,Peng Zhou
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: in Chinese language
[LG-150] Likelihood-Free Inference and Hierarchical Data Assimilation for Geological Carbon Storage
链接: https://arxiv.org/abs/2410.15302
作者: Wenchao Teng,Louis J. Durlofsky
关键词-EN: Data assimilation, carbon storage operations, management and expansion, Data, expansion of geological
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Geophysics (physics.geo-ph)
*备注:
点击查看摘要
Abstract:Data assimilation will be essential for the management and expansion of geological carbon storage operations. In traditional data assimilation approaches a fixed set of geological hyperparameters, such as mean and standard deviation of log-permeability, is often assumed. Such hyperparameters, however, may be highly uncertain in practical CO2 storage applications. In this study, we develop a hierarchical data assimilation framework for carbon storage that treats hyperparameters as uncertain variables characterized by hyperprior distributions. To deal with the computationally intractable likelihood function in hyperparameter estimation, we apply a likelihood-free (or simulation-based) inference algorithm, specifically sequential Monte Carlo-based approximate Bayesian computation (SMC-ABC), to draw independent posterior samples of hyperparameters given dynamic monitoring-well data. In the second step we use an ensemble smoother with multiple data assimilation (ESMDA) procedure to provide posterior realizations of grid-block permeability. To reduce computational costs, a 3D recurrent R-U-Net deep-learning surrogate model is applied for forward function evaluations. The accuracy of the surrogate model is established through comparisons to high-fidelity simulation results. A rejection sampling (RS) procedure for data assimilation is applied to provide reference posterior results. Detailed data assimilation results from SMC-ABC-ESMDA are compared to those from the reference RS method. These include marginal posterior distributions of hyperparameters, pairwise posterior samples, and history matching results for pressure and saturation at the monitoring location. Close agreement is achieved with ‘converged’ RS results, for two synthetic true models, in all quantities considered. Importantly, the SMC-ABC-ESMDA procedure provides speedup of 1-2 orders of magnitude relative to RS for the two cases.
[LG-151] Unsupervised feature selection algorithm framework based on neighborhood interval disturbance fusion
链接: https://arxiv.org/abs/2410.15294
作者: Xiaolin Lv,Liang Du,Peng Zhou,Peng Wu
关键词-EN: unsupervised feature selection, Feature selection technology, Feature selection, data dimensionality reduction, key technology
类目: Machine Learning (cs.LG)
*备注: in Chinese language
点击查看摘要
Abstract:Feature selection technology is a key technology of data dimensionality reduction. Becauseof the lack of label information of collected data samples, unsupervised feature selection has attracted more attention. The universality and stability of many unsupervised feature selection algorithms are very low and greatly affected by the dataset structure. For this reason, many researchers have been keen to improve the stability of the algorithm. This paper attempts to preprocess the data set and use an interval method to approximate the data set, experimentally verifying the advantages and disadvantages of the new interval data set. This paper deals with these data sets from the global perspective and proposes a new algorithm-unsupervised feature selection algorithm based on neighborhood interval disturbance fusion(NIDF). This method can realize the joint learning of the final score of the feature and the approximate data interval. By comparing with the original unsupervised feature selection methods and several existing feature selection frameworks, the superiority of the proposed model is verified.
[LG-152] Fractional-order spike-timing-dependent gradient descent for multi-layer spiking neural networks
链接: https://arxiv.org/abs/2410.15293
作者: Yi Yang,Richard M. Voyles,Haiyan H. Zhang,Robert A. Nawrocki
关键词-EN: Accumulated detailed knowledge, Accumulated detailed, bio-inspired spiking neural, spiking neural networks, deep neural networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 12 figures
点击查看摘要
Abstract:Accumulated detailed knowledge about the neuronal activities in human brains has brought more attention to bio-inspired spiking neural networks (SNNs). In contrast to non-spiking deep neural networks (DNNs), SNNs can encode and transmit spatiotemporal information more efficiently by exploiting biologically realistic and low-power event-driven neuromorphic architectures. However, the supervised learning of SNNs still remains a challenge because the spike-timing-dependent plasticity (STDP) of connected spiking neurons is difficult to implement and interpret in existing backpropagation learning schemes. This paper proposes a fractional-order spike-timing-dependent gradient descent (FO-STDGD) learning model by considering a derived nonlinear activation function that describes the relationship between the quasi-instantaneous firing rate and the temporal membrane potentials of nonleaky integrate-and-fire neurons. The training strategy can be generalized to any fractional orders between 0 and 2 since the FO-STDGD incorporates the fractional gradient descent method into the calculation of spike-timing-dependent loss gradients. The proposed FO-STDGD model is tested on the MNIST and DVS128 Gesture datasets and its accuracy under different network structure and fractional orders is analyzed. It can be found that the classification accuracy increases as the fractional order increases, and specifically, the case of fractional order 1.9 improves by 155% relative to the case of fractional order 1 (traditional gradient descent). In addition, our scheme demonstrates the state-of-the-art computational efficacy for the same SNN structure and training epochs.
[LG-153] LTPNet Integration of Deep Learning and Environmental Decision Support Systems for Renewable Energy Demand Forecasting
链接: https://arxiv.org/abs/2410.15286
作者: Te Li,Mengze Zhang,Yan Zhou
关键词-EN: sustainable business development, increasingly severe global, severe global environmental, energy demand forecasting, meeting renewable energy
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 25 pages
点击查看摘要
Abstract:Against the backdrop of increasingly severe global environmental changes, accurately predicting and meeting renewable energy demands has become a key challenge for sustainable business development. Traditional energy demand forecasting methods often struggle with complex data processing and low prediction accuracy. To address these issues, this paper introduces a novel approach that combines deep learning techniques with environmental decision support systems. The model integrates advanced deep learning techniques, including LSTM and Transformer, and PSO algorithm for parameter optimization, significantly enhancing predictive performance and practical applicability. Results show that our model achieves substantial improvements across various metrics, including a 30% reduction in MAE, a 20% decrease in MAPE, a 25% drop in RMSE, and a 35% decline in MSE. These results validate the model’s effectiveness and reliability in renewable energy demand forecasting. This research provides valuable insights for applying deep learning in environmental decision support systems.
[LG-154] RIZ Method for Urban Building Energy Optimization: GWO-SARIMA-LSTM Forecasting model
链接: https://arxiv.org/abs/2410.15283
作者: Shirong Zheng,Shaobo Liu,Zhenhong Zhang,Dian Gu,Chunqiu Xia,Huadong Pang,Enock Mintah Ampaw
关键词-EN: energy consumption, building energy consumption, global climate change, energy consumption optimization, energy consumption prediction
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 29 pages
点击查看摘要
Abstract:With the advancement of global climate change and sustainable development goals, urban building energy consumption optimization and carbon emission reduction have become the focus of research. Traditional energy consumption prediction methods often lack accuracy and adaptability due to their inability to fully consider complex energy consumption patterns, especially in dealing with seasonal fluctuations and dynamic changes. This study proposes a hybrid deep learning model that combines TRIZ innovation theory with GWO, SARIMA and LSTM to improve the accuracy of building energy consumption prediction. TRIZ plays a key role in model design, providing innovative solutions to achieve an effective balance between energy efficiency, cost and comfort by systematically analyzing the contradictions in energy consumption optimization. GWO is used to optimize the parameters of the model to ensure that the model maintains high accuracy under different conditions. The SARIMA model focuses on capturing seasonal trends in the data, while the LSTM model handles short-term and long-term dependencies in the data, further improving the accuracy of the prediction. The main contribution of this research is the development of a robust model that leverages the strengths of TRIZ and advanced deep learning techniques, improving the accuracy of energy consumption predictions. Our experiments demonstrate a significant 15% reduction in prediction error compared to existing models. This innovative approach not only enhances urban energy management but also provides a new framework for optimizing energy use and reducing carbon emissions, contributing to sustainable development.
[LG-155] Neural Normalized Compression Distance and the Disconnect Between Compression and Classification
链接: https://arxiv.org/abs/2410.15280
作者: John Hurwitz,Charles Nicholas,Edward Raff
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to Machine Learning and Compression Workshop at 38th Conference on Neural Information Processing Systems
[LG-156] Onboard Health Estimation using Distribution of Relaxation Times for Lithium-ion Batteries
链接: https://arxiv.org/abs/2410.15271
作者: Muhammad Aadil Khan,Sai Thatipamula,Simona Onori
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 6 pages, 7 figures
[LG-157] AGExplainer: Narrating Graph Explanations for Text-Attributed Graph Learning Models
链接: https://arxiv.org/abs/2410.15268
作者: Bo Pan,Zhen Xiong,Guanchen Wu,Zheng Zhang,Yifei Zhang,Liang Zhao
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:
[LG-158] Learning-Augmented Algorithms for the Bahncard Problem NEURIPS2024
链接: https://arxiv.org/abs/2410.15257
作者: Hailiang Zhao,Xueyan Tang,Peng Chen,Shuiguang Deng
关键词-EN:
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注: This paper has been accepted by the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
[LG-159] Multimodal Policies with Physics-informed Representations
链接: https://arxiv.org/abs/2410.15250
作者: Haodong Feng,Peiyan Hu,Yue Wang,Dixia Fan
关键词-EN: PDE systems, PDE, PDE loss, PDE systems rely, leverages PDE loss
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the control problems of the PDE systems, observation is important to make the decision. However, the observation is generally sparse and missing in practice due to the limitation and fault of sensors. The above challenges cause observations with uncertain quantities and modalities. Therefore, how to leverage the uncertain observations as the states in control problems of the PDE systems has become a scientific problem. The dynamics of PDE systems rely on the initial conditions, boundary conditions, and PDE formula. Given the above three elements, PINNs can be used to solve the PDE systems. In this work, we discover that the neural network can also be used to identify and represent the PDE systems using PDE loss and sparse data loss. Inspired by the above discovery, we propose a Physics-Informed Representation (PIR) algorithm for multimodal policies in PDE systems’ control. It leverages PDE loss to fit the neural network and data loss calculated on the observations with random quantities and modalities to propagate the information of initial conditions and boundary conditions into the inputs. The inputs can be the learnable parameters or the output of the encoders. Then, under the environments of the PDE systems, such inputs are the representation of the current state. In our experiments, the PIR illustrates the superior consistency with the features of the ground truth compared with baselines, even when there are missing modalities. Furthermore, PIR has been successfully applied in the downstream control tasks where the robot leverages the learned state by PIR faster and more accurately, passing through the complex vortex street from a random starting location to reach a random target.
[LG-160] FastSTI: A Fast Conditional Pseudo Numerical Diffusion Model for Spatio-temporal Traffic Data Imputation
链接: https://arxiv.org/abs/2410.15248
作者: Shaokang Cheng,Nada Osman,Shiru Qu,Lamberto Ballan
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by IEEE Transactions on Intelligent Transportation Systems for publication. Permission from IEEE must be obtained for all other uses, in any current or future media
[LG-161] nsor-Fused Multi-View Graph Contrastive Learning
链接: https://arxiv.org/abs/2410.15247
作者: Yujia Wu,Junyi Mo,Elynn Chen,Yuzhou Chen
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[LG-162] Conditional Uncertainty Quantification for Tensorized Topological Neural Networks
链接: https://arxiv.org/abs/2410.15241
作者: Yujia Wu,Bo Yang,Yang Zhao,Elynn Chen,Yuzhou Chen,Zheshi Zheng
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2401.12007
[LG-163] Conditional Prediction ROC Bands for Graph Classification
链接: https://arxiv.org/abs/2410.15239
作者: Yujia Wu,Bo Yang,Elynn Chen,Yuzhou Chen,Zheshi Zheng
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-164] Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
链接: https://arxiv.org/abs/2410.15236
作者: Benji Peng,Ziqian Bi,Qian Niu,Ming Liu,Pohsun Feng,Tianyang Wang,Lawrence K.Q. Yan,Yizhu Wen,Yichao Zhang,Caitlyn Heqi Yin
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[LG-165] A Semidefinite Relaxation Approach for Fair Graph Clustering
链接: https://arxiv.org/abs/2410.15233
作者: Sina Baharlouei,Sadra Sabouri
关键词-EN: ensuring equitable representation, Fair graph clustering, network analysis, crucial for ensuring, ensuring equitable
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Fair graph clustering is crucial for ensuring equitable representation and treatment of diverse communities in network analysis. Traditional methods often ignore disparities among social, economic, and demographic groups, perpetuating biased outcomes and reinforcing inequalities. This study introduces fair graph clustering within the framework of the disparate impact doctrine, treating it as a joint optimization problem integrating clustering quality and fairness constraints. Given the NP-hard nature of this problem, we employ a semidefinite relaxation approach to approximate the underlying optimization problem. For up to medium-sized graphs, we utilize a singular value decomposition-based algorithm, while for larger graphs, we propose a novel algorithm based on the alternative direction method of multipliers. Unlike existing methods, our formulation allows for tuning the trade-off between clustering quality and fairness. Experimental results on graphs generated from the standard stochastic block model demonstrate the superiority of our approach in achieving an optimal accuracy-fairness trade-off compared to state-of-the-art methods.
[LG-166] Deep Learning-based Detection of Bacterial Swarm Motion Using a Single Image
链接: https://arxiv.org/abs/2410.15229
作者: Yuzhu Li,Hao Li,Weijie Chen,Keelan O’Riordan,Neha Mani,Yuxuan Qi,Tairan Liu,Sridhar Mani,Aydogan Ozcan
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Medical Physics (physics.med-ph)
*备注: 17 Pages, 4 Figures
[LG-167] IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning
链接: https://arxiv.org/abs/2410.15221
作者: Vindula Jayawardana,Baptiste Freydt,Ao Qu,Cameron Hickert,Zhongxia Yan,Cathy Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: In review
[LG-168] Science Time Series: Deep Learning in Hydrology
链接: https://arxiv.org/abs/2410.15218
作者: Junyang He,Ying-Jung Chen,Anushka Idamekorala,Geoffrey Fox
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:
[LG-169] Future-Guided Learning: A Predictive Approach To Enhance Time-Series Forecasting
链接: https://arxiv.org/abs/2410.15217
作者: Skye Gunasekaran,Assel Kembay,Hugo Ladret,Rui-Jie Zhu,Laurent Perrinet,Omid Kavehei,Jason Eshraghian
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:
[LG-170] he Shifting Paradigm in AI : Why Generative Artificial Intelligence is the new Economic Variable
链接: https://arxiv.org/abs/2410.15212
作者: Subramanyam Sahoo,Kamlesh Dutta
关键词-EN:
类目: Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 26 pages, 8 figures, Accepted at National Conference on Advances in Marketing Paradigms for Research, Innovation and Technology (AMRIT 2023)
[LG-171] Low-cost Robust Night-time Aerial Material Segmentation through Hyperspectral Data and Sparse Spatio-Temporal Learning ICONIP
链接: https://arxiv.org/abs/2410.15208
作者: Chandrajit Bajaj,Minh Nguyen,Shubham Bhardwaj
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to the International Conference on Neural Information Processing (ICONIP) 2024. To be published in Springer-Nature Communications in Computer and Information Science (CCIS) Series
[LG-172] Unsupervised Domain Adaptation Approaches for Chessboard Recognition
链接: https://arxiv.org/abs/2410.15206
作者: Wassim Jabbour,Enzo Benoit-Jeannin,Oscar Bedford,Saif Shahin
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 30 pages, 23 figures
[LG-173] Action abstractions for amortized sampling
链接: https://arxiv.org/abs/2410.15184
作者: Oussama Boussif,Léna Néhale Ezzine,Joseph D Viviano,Michał Koziarski,Moksh Jain,Nikolay Malkin,Emmanuel Bengio,Rim Assouel,Yoshua Bengio
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[LG-174] GUIDE: Real-Time Human-Shaped Agents
链接: https://arxiv.org/abs/2410.15181
作者: Lingyu Zhang,Zhengran Ji,Nicholas R Waytowich,Boyuan Chen
关键词-EN:
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
[LG-175] Enhancing Robot Navigation Policies with Task-Specific Uncertainty Management
链接: https://arxiv.org/abs/2410.15178
作者: Gokul Puthumanaillam,Paulo Padrao,Jose Fuentes,Leonardo Bobadilla,Melkior Ornik
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
[LG-176] Beyond Pruning Criteria: The Dominant Role of Fine-Tuning and Adaptive Ratios in Neural Network Robustness
链接: https://arxiv.org/abs/2410.15176
作者: Lincen Bai,Hedi Tabia,Raúl Santos-Rodríguez
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:
[LG-177] Crafting Tomorrow: The Influence of Design Choices on Fresh Content in Social Media Recommendation
链接: https://arxiv.org/abs/2410.15174
作者: Srijan Saket,Mohit Agarwal,Rishabh Mehrotra
关键词-EN:
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
[LG-178] Explaining Graph Neural Networks with Large Language Models : A Counterfactual Perspective for Molecular Property Prediction
链接: https://arxiv.org/abs/2410.15165
作者: Yinhan He,Zaiyi Zheng,Patrick Soga,Yaozhen Zhu,yushun Dong,Jundong Li
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
*备注:
[LG-179] Pipeline Gradient-based Model Training on Analog In-memory Accelerators
链接: https://arxiv.org/abs/2410.15155
作者: Zhaoxian Wu,Quan Xiao,Tayfun Gokmen,Hsinyu Tsai,Kaoutar El Maghraoui,Tianyi Chen
关键词-EN:
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Optimization and Control (math.OC)
*备注:
[LG-180] Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning EMNLP2024
链接: https://arxiv.org/abs/2410.15148
作者: David Schulte,Felix Hamborg,Alan Akbik
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: EMNLP 2024 Main Conference
[LG-181] Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling
链接: https://arxiv.org/abs/2410.15143
作者: Minhyuk Seo,Hyunseo Koh,Jonghyun Choi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[LG-182] Collaborative State Fusion in Partially Known Multi-agent Environments
链接: https://arxiv.org/abs/2410.15137
作者: Tianlong Zhou,Jun Shang,Weixiong Rao
关键词-EN:
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
[LG-183] Generalized Flow Matching for Transition Dynamics Modeling
链接: https://arxiv.org/abs/2410.15128
作者: Haibo Wang,Yuxuan Qiu,Yanze Wang,Rob Brekelmans,Yuanqi Du
关键词-EN: Simulating transition dynamics, understanding protein folding, Simulating transition, wide real-world applications, protein folding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
*备注:
点击查看摘要
Abstract:Simulating transition dynamics between metastable states is a fundamental challenge in dynamical systems and stochastic processes with wide real-world applications in understanding protein folding, chemical reactions and neural activities. However, the computational challenge often lies on sampling exponentially many paths in which only a small fraction ends in the target metastable state due to existence of high energy barriers. To amortize the cost, we propose a data-driven approach to warm-up the simulation by learning nonlinear interpolations from local dynamics. Specifically, we infer a potential energy function from local dynamics data. To find plausible paths between two metastable states, we formulate a generalized flow matching framework that learns a vector field to sample propable paths between the two marginal densities under the learned energy function. Furthermore, we iteratively refine the model by assigning importance weights to the sampled paths and buffering more likely paths for training. We validate the effectiveness of the proposed method to sample probable paths on both synthetic and real-world molecular systems.
[LG-184] Reinfier and Reintrainer: Verification and Interpretation-Driven Safe Deep Reinforcement Learning Frameworks
链接: https://arxiv.org/abs/2410.15127
作者: Zixuan Yang,Jiaqi Zheng,Guihai Chen
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[LG-185] Generalizable Prediction Model of Molten Salt Mixture Density with Chemistry-Informed Transfer Learning
链接: https://arxiv.org/abs/2410.15120
作者: Julian Barra,Shayan Shahbazi,Anthony Birri,Rajni Chahal,Ibrahim Isah,Muhammad Nouman Anwar,Tyler Starkus,Prasanna Balaprakash,Stephen Lam
关键词-EN:
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Manuscript contains 25 pages including references and other information. Manuscript contains 4 figures and 3 tables. To be submitted to ACS Journal of Chemical Theory and Computation
[LG-186] Accelerating k-Means Clustering with Cover Trees
链接: https://arxiv.org/abs/2410.15117
作者: Andreas Lang,Erich Schubert
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-187] On Designing Effective RL Reward at Training Time for LLM Reasoning
链接: https://arxiv.org/abs/2410.15115
作者: Jiaxuan Gao,Shusheng Xu,Wenjie Ye,Weilin Liu,Chuyi He,Wei Fu,Zhiyu Mei,Guangju Wang,Yi Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[LG-188] CosFairNet:A Parameter-Space based Approach for Bias Free Learning
链接: https://arxiv.org/abs/2410.15094
作者: Rajeev Ranjan Dwivedi,Priyadarshini Kumari,Vinod K Kurmi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[LG-189] DPVS-Shapley:Faster and Universal Contribution Evaluation Component in Federated Learning
链接: https://arxiv.org/abs/2410.15093
作者: Ketin Yin,Zonghao Guo,ZhengHan Qin
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
*备注:
[LG-190] Personalized Federated Learning with Adaptive Feature Aggregation and Knowledge Transfer
链接: https://arxiv.org/abs/2410.15073
作者: Keting Yin,Jiayi Mao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
[LG-191] A Cycle Ride to HDR: Semantics Aware Self-Supervised Framework for Unpaired LDR-to-HDR Image Translation
链接: https://arxiv.org/abs/2410.15068
作者: Hrishav Bakul Barua,Stefanov Kalin,Lemuel Lai En Che,Dhall Abhinav,Wong KokSheik,Krishnasamy Ganesh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to IEEE
[LG-192] Deep Equilibrium Algorithmic Reasoning
链接: https://arxiv.org/abs/2410.15059
作者: Dobrik Georgiev,JJ Wilson,Davide Buffelli,Pietro Liò
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:
[LG-193] Weakly-supervised diagnosis identification from Italian discharge letters
链接: https://arxiv.org/abs/2410.15051
作者: Vittorio Torri,Elisa Barbieri,Anna Cantarutti,Carlo Giaquinto,Francesca Ieva
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 39 pages, 4 figures
[LG-194] sting the Efficacy of Hyperparameter Optimization Algorithms in Short-Term Load Forecasting
链接: https://arxiv.org/abs/2410.15047
作者: Tugrul Cabir Hakyemez,Omer Adar
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: This is a conference paper submitted to 2nd IEEE INTERNATIONAL CONFERENCE ON IoT, COMMUNICATION AND AUTOMATION TECHNOLOGY (ICICAT 2024). It is currently under review
[LG-195] Adversarial Training: A Survey
链接: https://arxiv.org/abs/2410.15042
作者: Mengnan Zhao,Lihe Zhang,Jingwen Ye,Huchuan Lu,Baocai Yin,Xinchao Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[LG-196] Iterative Methods via Locally Evolving Set Process NEURIPS2024
链接: https://arxiv.org/abs/2410.15020
作者: Baojian Zhou,Yifan Sun,Reza Babanezhad Harikandeh,Xingzhi Guo,Deqing Yang,Yanghua Xiao
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 58 pages, 15 figures, NeurIPS 2024
[LG-197] ransit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model
链接: https://arxiv.org/abs/2410.15016
作者: Jiahao Wang,Amer Shalaby
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 17 pages, 21 figures
[LG-198] DST-TransitNet: A Dynamic Spatio-Temporal Deep Learning Model for Scalable and Efficient Network-Wide Prediction of Station-Level Transit Ridership
链接: https://arxiv.org/abs/2410.15013
作者: Jiahao Wang,Amer Shalaby
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: 16 pages, 22 figures. Accepted by TRB 2025
[LG-199] FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning
链接: https://arxiv.org/abs/2410.15010
作者: Sizhe Liu,Jun Xia,Lecheng Zhang,Yuchen Liu,Yue Liu,Wenjie Du,Zhangyang Gao,Bozhen Hu,Cheng Tan,Hongxin Xiang,Stan Z. Li
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[LG-200] Faster Inference Time for GNNs using coarsening
链接: https://arxiv.org/abs/2410.15001
作者: Shubhajit Roy,Hrriday Ruparel,Kishan Ved,Anirban Dasgupta
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-201] A comparative study of NeuralODE and Universal ODE approaches to solving Chandrasekhar White Dwarf equation
链接: https://arxiv.org/abs/2410.14998
作者: Raymundo Vazquez Martinez,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[LG-202] Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span
链接: https://arxiv.org/abs/2410.14992
作者: Woojin Chae,Kihyuk Hong,Yufan Zhang,Ambuj Tewari,Dabeen Lee
关键词-EN:
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
[LG-203] Audio Processing using Pattern Recognition for Music Genre Classification
链接: https://arxiv.org/abs/2410.14990
作者: Sivangi Chatterjee,Srishti Ganguly,Avik Bose,Hrithik Raj Prasad,Arijit Ghosal
关键词-EN:
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
[LG-204] NeuralMAG: Fast and Generalizable Micromagnetic Simulation with Deep Neural Nets
链接: https://arxiv.org/abs/2410.14986
作者: Yunqi Cai,Jiangnan Li,Dong Wang
关键词-EN:
类目: Machine Learning (cs.LG); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI)
*备注:
[LG-205] Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration
链接: https://arxiv.org/abs/2410.14979
作者: Wei Xie,Shuoyoucheng Ma,Zhenhua Wang,Enze Wang,Baosheng Wang,Jinshu Su
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
[LG-206] MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning
链接: https://arxiv.org/abs/2410.14972
作者: Suning Huang,Zheyu Zhang,Tianhai Liang,Yihan Xu,Zhehao Kou,Chenhao Lu,Guowei Xu,Zhengrong Xue,Huazhe Xu
关键词-EN:
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
[LG-207] aming the Long Tail in Human Mobility Prediction NEURIPS2024
链接: https://arxiv.org/abs/2410.14970
作者: Xiaohang Xu,Renhe Jiang,Chuang Yang,Zipei Fan,Kaoru Sezaki
关键词-EN:
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024
[LG-208] Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norways Digitised Book Collection
链接: https://arxiv.org/abs/2410.14969
作者: Marie Roald,Magnus Breder Birkenes,Lars Gunnarsønn Bagøien Johnsen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, 4 tables, Accepted to the 2024 Computational Humanities Research Conference (CHR)
[LG-209] AugInsert: Learning Robust Visual-Force Policies via Data Augmentation for Object Assembly Tasks
链接: https://arxiv.org/abs/2410.14968
作者: Ryan Diaz,Adam Imdieke,Vivek Veeriah,Karthik Desingh
关键词-EN:
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
[LG-210] Deep Learning for Weather Forecasting: A CNN-LSTM Hybrid Model for Predicting Historical Temperature Data
链接: https://arxiv.org/abs/2410.14963
作者: Yuhao Gong,Yuchen Zhang,Fei Wang,Chi-Han Lee
关键词-EN:
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
[LG-211] LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model
链接: https://arxiv.org/abs/2410.14961
作者: Tianqianjin Lin,Pengwei Yan,Kaisong Song,Zhuoren Jiang,Yangyang Kang,Jun Lin,Weikang Yuan,Junjie Cao,Changlong Sun,Xiaozhong Liu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: under review
[LG-212] Neural Radiance Field Image Refinement through End-to-End Sampling Point Optimization
链接: https://arxiv.org/abs/2410.14958
作者: Kazuhiro Ohta,Satoshi Ono
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[LG-213] Offline-to-online Reinforcement Learning for Image-based Grasping with Scarce Demonstrations
链接: https://arxiv.org/abs/2410.14957
作者: Bryan Chan,Anson Leung,James Bergstra
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[LG-214] A Fast AI Surrogate for Coastal Ocean Circulation Models
链接: https://arxiv.org/abs/2410.14952
作者: Zelin Xu,Jie Ren,Yupu Zhang,Jose Maria Gonzalez Ondina,Maitane Olabarrieta,Tingsong Xiao,Wenchong He,Zibo Liu,Shigang Chen,Kaleb Smith,Zhe Jiang
关键词-EN:
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
[LG-215] Straightness of Rectified Flow: A Theoretical Insight into Wasserstein Convergence
链接: https://arxiv.org/abs/2410.14949
作者: Vansh Bansal,Saptarshi Roy,Purnamrita Sarkar,Alessandro Rinaldo
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-216] DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries
链接: https://arxiv.org/abs/2410.14946
作者: Hanqun Cao,Chunbin Gu,Mutian He,Ning Ma,Chang-yu Hsieh,Pheng-Ann Heng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:
[LG-217] ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model
链接: https://arxiv.org/abs/2410.14945
作者: Mojtaba Heydari,Mehrez Souden,Bruno Conejo,Joshua Atkins
关键词-EN:
类目: ound (cs.SD); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: This work pioneers a Latent Diffusion Model for generating text-prompted ambisonic spatial audio
[LG-218] Baichuan Alignment Technical Report
链接: https://arxiv.org/abs/2410.14940
作者: Mingan Lin,Fan Yang,Yanjun Shen,Haoze Sun,Tianpeng Li,Tao Zhang,Chenzheng Zhu,Tao Zhang,Miao Zheng,Xu Li,Yijie Zhou,Mingyang Chen,Yanzhao Qin,Youquan Li,Hao Liang,Fei Li,Yadong Li,Mang Wang,Guosheng Dong,Kun Fang,Jianhua Xu,Bin Cui,Wentao Zhang,Zenan Zhou,Weipeng Chen
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:
[LG-219] HiPPO-KAN: Efficient KAN Model for Time Series Analysis
链接: https://arxiv.org/abs/2410.14939
作者: SangJong Lee,Jin-Kwang Kim,JunHo Kim,TaeHan Kim,James Lee
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 16 pages, 6 figures, 2 tables
[LG-220] Water quality polluted by total suspended solids classified within an Artificial Neural Network approach
链接: https://arxiv.org/abs/2410.14929
作者: I. Luviano Soto,Y. Concha Sánchez,A. Raya
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 42 pages, 8 figures and 2 tables
[LG-221] Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step
链接: https://arxiv.org/abs/2410.14919
作者: Mingyuan Zhou,Huangjie Zheng,Yi Gu,Zhendong Wang,Hai Huang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[LG-222] ReeFRAME: Reeb Graph based Trajectory Analysis Framework to Capture Top-Down and Bottom-Up Patterns of Life
链接: https://arxiv.org/abs/2410.14913
作者: Chandrakanth Gudavalli,Bowen Zhang,Connor Levenson,Kin Gwn Lore,B. S. Manjunath
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: GeoAnomalies Workshop @ ACM Sigspatial 2024
[LG-223] runcated Consistency Models
链接: https://arxiv.org/abs/2410.14895
作者: Sangyun Lee,Yilun Xu,Tomas Geffner,Giulia Fanti,Karsten Kreis,Arash Vahdat,Weili Nie
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[LG-224] Soft-Label Integration for Robust Toxicity Classification NEURIPS24
链接: https://arxiv.org/abs/2410.14894
作者: Zelei Cheng,Xian Wu,Jiahao Yu,Shuo Han,Xin-Qiang Cai,Xinyu Xing
关键词-EN:
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by Neurips 24
[LG-225] Self-Satisfied: An end-to-end framework for SAT generation and prediction
链接: https://arxiv.org/abs/2410.14888
作者: Christopher R. Serrano,Jonathan Gallagher,Kenji Yamada,Alexei Kopylov,Michael A. Warren
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 22 pages
[LG-226] Zero-shot Generalist Graph Anomaly Detection with Unified Neighborhood Prompts
链接: https://arxiv.org/abs/2410.14886
作者: Chaoxi Niu,Hezhe Qiao,Changlu Chen,Ling Chen,Guansong Pang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 19 pages
[LG-227] Which LLMs are Difficult to Detect? A Detailed Analysis of Potential Factors Contributing to Difficulties in LLM Text Detection NEURIPS2024
链接: https://arxiv.org/abs/2410.14875
作者: Shantanu Thorat,Tianbao Yang
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024 - Safe Generative AI Workshop
[LG-228] How to Evaluate Reward Models for RLHF
链接: https://arxiv.org/abs/2410.14872
作者: Evan Frick,Tianle Li,Connor Chen,Wei-Lin Chiang,Anastasios N. Angelopoulos,Jiantao Jiao,Banghua Zhu,Joseph E. Gonzalez,Ion Stoica
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[LG-229] FedSpaLLM: Federated Pruning of Large Language Models
链接: https://arxiv.org/abs/2410.14852
作者: Guangji Bai,Yijiang Li,Zilinghan Li,Liang Zhao,Kibaek Kim
关键词-EN:
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Preprint
[LG-230] Rank Suggestion in Non-negative Matrix Factorization: Residual Sensitivity to Initial Conditions (RSIC)
链接: https://arxiv.org/abs/2410.14838
作者: Marc A. Tunnell,Zachary J. DeBruine,Erin Carrier
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: under submission to TMLR
[LG-231] opological obstruction to the training of shallow ReLU neural networks
链接: https://arxiv.org/abs/2410.14837
作者: Marco Nurisso,Pierrick Leroy,Francesco Vaccarino
关键词-EN:
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG); Algebraic Topology (math.AT)
*备注: 23 pages, 5 figures
[LG-232] Automated Road Extraction from Satellite Imagery Integrating Dense Depthwise Dilated Separable Spatial Pyramid Pooling with DeepLabV3
链接: https://arxiv.org/abs/2410.14836
作者: Arpan Mahara,Md Rezaul Karim Khan,Naphtali D. Rishe,Wenjia Wang,Seyed Masoud Sadjadi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
[LG-233] Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment
链接: https://arxiv.org/abs/2410.14827
作者: Zedian Shao,Hongbin Liu,Jaden Mu,Neil Zhenqiang Gong
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
[LG-234] SPRIG: Improving Large Language Model Performance by System Prompt Optimization
链接: https://arxiv.org/abs/2410.14826
作者: Lechen Zhang,Tolga Ergen,Lajanugen Logeswaran,Moontae Lee,David Jurgens
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
[LG-235] A Complexity-Based Theory of Compositionality
链接: https://arxiv.org/abs/2410.14817
作者: Eric Elmoznino,Thomas Jiralerspong,Yoshua Bengio,Guillaume Lajoie
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[LG-236] Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus
链接: https://arxiv.org/abs/2410.14815
作者: Raviraj Joshi,Kanishk Singla,Anusha Kamath,Raunak Kalani,Rakesh Paul,Utkarsh Vaidya,Sanjay Singh Chauhan,Niranjan Wartikar,Eileen Long
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
[LG-237] Effects of Soft-Domain Transfer and Named Entity Information on Deception Detection
链接: https://arxiv.org/abs/2410.14814
作者: Steven Triplett,Simon Minami,Rakesh Verma
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
[LG-238] Aligning AI Agents via Information-Directed Sampling
链接: https://arxiv.org/abs/2410.14807
作者: Hong Jun Jeon,Benjamin Van Roy
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[LG-239] DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
链接: https://arxiv.org/abs/2410.14803
作者: Taiyi Wang,Zhihao Wu,Jianheng Liu,Jianye Hao,Jun Wang,Kun Shao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
*备注: Paper and Appendix, 24 pages
[LG-240] Implicit Regularization of Sharpness-Aware Minimization for Scale-Invariant Problems NEURIPS2024
链接: https://arxiv.org/abs/2410.14802
作者: Bingcong Li,Liang Zhang,Niao He
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024
[LG-241] Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks
链接: https://arxiv.org/abs/2410.14766
作者: Enkhbold Nyamsuren
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
[LG-242] Whats New in My Data? Novelty Exploration via Contrastive Generation
链接: https://arxiv.org/abs/2410.14765
作者: Masaru Isonuma,Ivan Titov
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[LG-243] Multifidelity Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2410.14764
作者: Amanda A. Howard,Bruno Jacob,Panos Stinis
关键词-EN:
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
[LG-244] Constrained Recurrent Bayesian Forecasting for Crack Propagation
链接: https://arxiv.org/abs/2410.14761
作者: Sara Yasmine Ouerk,Olivier Vo Van,Mouadh Yagoubi
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-245] Mitigating Embedding Collapse in Diffusion Models for Categorical Data
链接: https://arxiv.org/abs/2410.14758
作者: Bac Nguyen,and Chieh-Hsin Lai,Yuhta Takida,Naoki Murata,Toshimitsu Uesaka,Stefano Ermon,Yuki Mitsufuji
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:
[LG-246] Controllable Discovery of Intents: Incremental Deep Clustering Using Semi-Supervised Contrastive Learning
链接: https://arxiv.org/abs/2410.14755
作者: Mrinal Rawat,Hithesh Sankararaman,Victor Barres
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted in IJCNLP’23
[LG-247] Collaboratively adding new knowledge to an LLM
链接: https://arxiv.org/abs/2410.14753
作者: Rhui Dih Lee,Laura Wynter
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[LG-248] CFTS-GAN: Continual Few-Shot Teacher Student for Generative Adversarial Networks
链接: https://arxiv.org/abs/2410.14749
作者: Munsif Ali,Leonardo Rossi,Massimo Bertozzi
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[LG-249] Efficient Deep Learning Board: Training Feedback Is Not All You Need
链接: https://arxiv.org/abs/2410.14743
作者: Lina Gong,Qi Gao,Peng Li,Mingqiang Wei,Fei Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[LG-250] ArrivalNet: Predicting City-wide Bus/Tram Arrival Time with Two-dimensional Temporal Variation Modeling
链接: https://arxiv.org/abs/2410.14742
作者: Zirui Li,Patrick Wolf,Meng Wang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: Under review at IEEE T-ITS
[LG-251] CAKD: A Correlation-Aware Knowledge Distillation Framework Based on Decoupling Kullback-Leibler Divergence
链接: https://arxiv.org/abs/2410.14741
作者: Zao Zhang,Huaming Chen,Pei Ning,Nan Yang,Dong Yuan
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-252] Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching
链接: https://arxiv.org/abs/2410.14740
作者: Jie Peng,Zhang Cao,Huaizhi Qu,Zhengyu Zhang,Chang Guo,Yanyong Zhang,Zhichao Zhang,Tianlong Chen
关键词-EN:
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 24 pages, 13 figures
[LG-253] Advancements In Heart Disease Prediction: A Machine Learning Approach For Early Detection And Risk Assessment
链接: https://arxiv.org/abs/2410.14738
作者: Balaji Shesharao Ingole,Vishnu Ramineni,Nikhil Bangad,Koushik Kumar Ganeeb,Priyankkumar Patel
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:
[LG-254] Knowledge Graph Embeddings: A Comprehensive Survey on Capturing Relation Properties
链接: https://arxiv.org/abs/2410.14733
作者: Guanglin Niu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 22 pages, 8 figures, 3 tables, this paper is a modified English version of our article already published in Computer Science journal (in Chinese), released to facilitate communication among international researchers in the relevant fields
[LG-255] SIFM: A Foundation Model for Multi-granularity Arctic Sea Ice Forecasting
链接: https://arxiv.org/abs/2410.14732
作者: Jingyi Xu,Yeqi Luo,Weidong Yang,Keyi Liu,Shengnan Wang,Ben Fei,Lei Bai
关键词-EN:
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 10 pages, 7 figures
[LG-256] MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
链接: https://arxiv.org/abs/2410.14731
作者: Bokai Lin,Zihao Zeng,Zipeng Xiao,Siqi Kou,Tianqi Hou,Xiaofeng Gao,Hao Zhang,Zhijie Deng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[LG-257] On the Relation Between Linear Diffusion and Power Iteration
链接: https://arxiv.org/abs/2410.14730
作者: Dana Weitzner,Mauricio Delbracio,Peyman Milanfar,Raja Giryes
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
[LG-258] okens on Demand: Token Condensation as Training-free Test-time Adaptation
链接: https://arxiv.org/abs/2410.14729
作者: Zixin Wang,Dong Gong,Sen Wang,Zi Huang,Yadan Luo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures
[LG-259] Leveraging Intra-Period and Inter-Period Features for Enhanced Passenger Flow Prediction of Subway Stations
链接: https://arxiv.org/abs/2410.14727
作者: Xiannan Huang,Chao Yang,Quan Yuan
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: accepted by TRBAM 2024
[LG-260] Incorporating Long-term Data in Training Short-term Traffic Prediction Model
链接: https://arxiv.org/abs/2410.14726
作者: Xiannan Huang,Shuhan Qiu,Yan Cheng,Quan Yuan,Chao Yang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: submitted to IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
[LG-261] Rethinking Token Reduction for State Space Models EMNLP2024
链接: https://arxiv.org/abs/2410.14725
作者: Zheng Zhan,Yushu Wu,Zhenglun Kong,Changdi Yang,Yifan Gong,Xuan Shen,Xue Lin,Pu Zhao,Yanzhi Wang
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: EMNLP 2024
[LG-262] A Phenomenological AI Foundation Model for Physical Signals
链接: https://arxiv.org/abs/2410.14724
作者: Jaime Lien,Laura I. Galindez Olascoaga,Hasan Dogan,Nicholas Gillian,Brandon Barbello,Leonardo Giusti,Ivan Poupyrev
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:
[LG-263] BeniFul: Backdoor Defense via Middle Feature Analysis for Deep Neural Networks
链接: https://arxiv.org/abs/2410.14723
作者: Xinfu Li,Junying Zhang,Xindi Ma
关键词-EN:
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
[LG-264] he Representation of Meaningful Precision and Accuracy
链接: https://arxiv.org/abs/2410.14721
作者: A Mani
关键词-EN:
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 16 Pages
[LG-265] SGLP: A Similarity Guided Fast Layer Partition Pruning for Compressing Large Deep Models
链接: https://arxiv.org/abs/2410.14720
作者: Yuqi Li,Yao Lu,Zeyu Dong,Chuanguang Yang,Yihao Chen,Jianping Gou
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages
[LG-266] A Systematic Survey on Large Language Models for Algorithm Design
链接: https://arxiv.org/abs/2410.14716
作者: Fei Liu,Yiming Yao,Ping Guo,Zhiyuan Yang,Xi Lin,Xialiang Tong,Mingxuan Yuan,Zhichao Lu,Zhenkun Wang,Qingfu Zhang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[LG-267] QuAILoRA: Quantization-Aware Initialization for LoRA NEURIPS
链接: https://arxiv.org/abs/2410.14713
作者: Neal Lawton,Aishwarya Padmakumar,Judith Gaspers,Jack FitzGerald,Anoop Kumar,Greg Ver Steeg,Aram Galstyan
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 12 pages, 7 figures. Submitted to the 4th NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV)
[LG-268] G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving
链接: https://arxiv.org/abs/2410.14710
作者: Naoki Murata,Chieh-Hsin Lai,Yuhta Takida,Toshimitsu Uesaka,Bac Nguyen,Stefano Ermon,Yuki Mitsufuji
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[LG-269] ransformers are Efficient Compilers Provably
链接: https://arxiv.org/abs/2410.14706
作者: Xiyu Zhai,Runlong Zhou,Liao Zhang,Simon Shaolei Du
关键词-EN:
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: 65 pages
[LG-270] Optimizing Parking Space Classification: Distilling Ensembles into Lightweight Classifiers ICML
链接: https://arxiv.org/abs/2410.14705
作者: Paulo Luza Alves,André Hochuli,Luiz Eduardo de Oliveira,Paulo Lisboa de Almeida
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for presentation at the International Conference on Machine Learning and Applications (ICMLA) 2024
[LG-271] Deep Learning Enhanced Road Traffic Analysis: Scalable Vehicle Detection and Velocity Estimation Using PlanetScope Imagery
链接: https://arxiv.org/abs/2410.14698
作者: Maciej Adamiak,Yulia Grinblat,Julian Psotta,Nir Fulman,Himshikhar Mazumdar,Shiyu Tang,Alexander Zipf
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[LG-272] Rethinking VLMs and LLMs for Image Classification
链接: https://arxiv.org/abs/2410.14690
作者: Avi Cooper,Keizo Kato,Chia-Hsien Shih,Hiroaki Yamane,Kasper Vinken,Kentaro Takemoto,Taro Sunagawa,Hao-Wei Yeh,Jin Yamanaka,Ian Mason,Xavier Boix
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[LG-273] BrainTransformers: SNN-LLM
链接: https://arxiv.org/abs/2410.14687
作者: Zhengzheng Tang
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
[LG-274] Leveraging Large Language Models for Enhancing Public Transit Services
链接: https://arxiv.org/abs/2410.14147
作者: Jiahao Wang,Amer Shalaby
关键词-EN:
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 24 pages, 18 figures, submitting to Journal of ITS
[LG-275] ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
链接: https://arxiv.org/abs/2410.06625
作者: Yi Ding,Bolian Li,Ruqi Zhang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 27pages
[LG-276] QT-DoG: Quantization-aware Training for Domain Generalization
链接: https://arxiv.org/abs/2410.06020
作者: Saqib Javed,Hieu Le,Mathieu Salzmann
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Code will be released soon
[LG-277] heoretical Limitations of Ensembles in the Age of Overparameterization
链接: https://arxiv.org/abs/2410.16201
作者: Niclas Dern,John P. Cunningham,Geoff Pleiss
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 12 figures
[LG-278] Integer linear programming for unsupervised training set selection in molecular machine learning
链接: https://arxiv.org/abs/2410.16122
作者: Matthieu Haeberle,Puck van Gerwen,Ruben Laplaza,Ksenia R. Briling,Jan Weinreich,Friedrich Eisenbrand,Clemence Corminboeuf
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 31 pages + SI (15 pages)
[LG-279] Statistical Inference for Temporal Difference Learning with Linear Function Approximation
链接: https://arxiv.org/abs/2410.16106
作者: Weichen Wu,Gen Li,Yuting Wei,Alessandro Rinaldo
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-280] On the Geometry of Regularization in Adversarial Training: High-Dimensional Asymptotics and Generalization Bounds
链接: https://arxiv.org/abs/2410.16073
作者: Matteo Vilucchio,Nikolaos Tsilivis,Bruno Loureiro,Julia Kempe
关键词-EN:
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
[LG-281] GFlowNets for Hamiltonian decomposition in groups of compatible operators NEURIPS2024
链接: https://arxiv.org/abs/2410.16041
作者: Isaac L. Huidobro-Meezs,Jun Dai,Guillaume Rabusseau,Rodrigo A. Vargas-Hernández
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures. Accepted for Machine Learning and the Physical Sciences Workshop, NeurIPS 2024. Submission Number: 167
[LG-282] Resilient Temporal GCN for Smart Grid State Estimation Under Topology Inaccuracies
链接: https://arxiv.org/abs/2410.16008
作者: Seyed Hamed Haghshenas,Mia Naeini
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
[LG-283] A quantitative Robbins-Siegmund theorem
链接: https://arxiv.org/abs/2410.15986
作者: Morenikeji Neri,Thomas Powell
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Logic (math.LO); Probability (math.PR)
*备注: 30 pages
[LG-284] State Estimation Using Sparse DEIM and Recurrent Neural Networks
链接: https://arxiv.org/abs/2410.15982
作者: Mohammad Farazmand
关键词-EN:
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Chaotic Dynamics (nlin.CD)
*备注:
[LG-285] Automatic Differentiation of Optimization Algorithms with Time-Varying Updates
链接: https://arxiv.org/abs/2410.15923
作者: Sheheryar Mehmood,Peter Ochs
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
[LG-286] On the Design and Performance of Machine Learning Based Error Correcting Decoders
链接: https://arxiv.org/abs/2410.15899
作者: Yuncheng Yuan,Péter Scheepers,Lydia Tasiou,Yunus Can Gültekin,Federico Corradi,Alex Alvarado
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, submitted for possible presentation in a conference
[LG-287] R2I-rPPG: A Robust Region of Interest Selection Method for Remote Photoplethysmography to Extract Heart Rate
链接: https://arxiv.org/abs/2410.15851
作者: Sandeep Nagar,Mark Hasegawa-Johnson,David G. Beiser,Narendra Ahuja
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: preprint
[LG-288] Solvation Free Energies from Neural Thermodynamic Integration
链接: https://arxiv.org/abs/2410.15815
作者: Bálint Máté,François Fleuret,Tristan Bereau
关键词-EN:
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:
[LG-289] Mean-Field Simulation-Based Inference for Cosmological Initial Conditions NEURIPS2024
链接: https://arxiv.org/abs/2410.15808
作者: Oleg Savchenko,Florian List,Guillermo Franco Abellán,Noemi Anau Montel,Christoph Weniger
关键词-EN:
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted for the NeurIPS 2024 workshop Machine Learning and the Physical Sciences; 5 + 4 pages, 3 figures
[LG-290] SeisLM: a Foundation Model for Seismic Waveforms
链接: https://arxiv.org/abs/2410.15765
作者: Tianlin Liu,Jannes Münchmeyer,Laura Laurenti,Chris Marone,Maarten V. de Hoop,Ivan Dokmanić
关键词-EN:
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:
[LG-291] wo-stage Learning-to-Defer for Multi-Task Learning
链接: https://arxiv.org/abs/2410.15729
作者: Montreuil Yannis,Yeo Shu Heng,Carlier Axel,Ng Lai Xing,Ooi Wei Tsang
关键词-EN:
类目: Machine Learning (stat.ML); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 32 pages, 17 main paper
[LG-292] Learning signals defined on graphs with optimal transport and Gaussian process regression
链接: https://arxiv.org/abs/2410.15721
作者: Raphaël Carpintero Perez(CMAP),Sébastien da Veiga(ENSAI, CREST),Josselin Garnier(CMAP),Brian Staber
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-293] owards Kriging-informed Conditional Diffusion for Regional Sea-Level Data Downscaling
链接: https://arxiv.org/abs/2410.15628
作者: Subhankar Ghosh,Arun Sharma,Jayant Gupta,Aneesh Subramanian,Shashi Shekhar
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[LG-294] CPE-Pro: A Structure-Sensitive Deep Learning Model for Protein Representation and Origin Evaluation
链接: https://arxiv.org/abs/2410.15592
作者: Wenrui Gou,Wenhui Ge,YangTan,Guisheng Fan,Mingchen Li,Huiqun Yu
关键词-EN:
类目: Biomolecules (q-bio.BM); Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
[LG-295] Predicting adaptively chosen observables in quantum systems
链接: https://arxiv.org/abs/2410.15501
作者: Jerry Huang,Laura Lewis,Hsin-Yuan Huang,John Preskill
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures + 39-page appendix
[LG-296] Discriminating image representations with principal distortions
链接: https://arxiv.org/abs/2410.15433
作者: Jenelle Feather,David Lipshutz,Sarah E. Harvey,Alex H. Williams,Eero P. Simoncelli
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-297] ghter Performance Theory of FedExProx
链接: https://arxiv.org/abs/2410.15368
作者: Wojciech Anyszka,Kaja Gruntkowska,Alexander Tyurin,Peter Richtárik
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages, 4 figures
[LG-298] A Novel Characterization of the Population Area Under the Risk Coverage Curve (AURC) and Rates of Finite Sample Estimators
链接: https://arxiv.org/abs/2410.15361
作者: Han Zhou,Jordy Van Landeghem,Teodora Popordanoska,Matthew B. Blaschko
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-299] Diffusion-PINN Sampler
链接: https://arxiv.org/abs/2410.15336
作者: Zhekun Shi,Longlin Yu,Tianyu Xie,Cheng Zhang
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 33 pages, 7 figures
[LG-300] Amortized Probabilistic Conditioning for Optimization Simulation and Inference
链接: https://arxiv.org/abs/2410.15320
作者: Paul E. Chang,Nasrulloh Loka,Daolang Huang,Ulpu Remes,Samuel Kaski,Luigi Acerbi
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 33 pages, 21 figures
[LG-301] Physically Guided Deep Unsupervised Inversion for 1D Magnetotelluric Models
链接: https://arxiv.org/abs/2410.15274
作者: Paul Goyes-Peñafiel,Umair bin Waheed,Henry Arguello
关键词-EN:
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 5 pages, 6 figures, github repository, submitted to IEEE-GRSL
[LG-302] Robust Low-rank Tensor Train Recovery
链接: https://arxiv.org/abs/2410.15224
作者: Zhen Qin,Zhihui Zhu
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
[LG-303] Learning the Rolling Penny Dynamics
链接: https://arxiv.org/abs/2410.15201
作者: Baiyue Wang,Anthony Bloch
关键词-EN:
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:
[LG-304] HACSurv: A Hierarchical Copula-based Approach for Survival Analysis with Dependent Competing Risks
链接: https://arxiv.org/abs/2410.15180
作者: Xin Liu,Weijia Zhang,Min-Ling Zhang
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
[LG-305] Controllable RANSAC-based Anomaly Detection via Hypothesis Testing
链接: https://arxiv.org/abs/2410.15133
作者: Le Hong Phong,Ho Ngoc Luat,Vo Nguyen Le Duy
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-306] he shape of the brains connections is predictive of cognitive performance: an explainable machine learning study
链接: https://arxiv.org/abs/2410.15108
作者: Yui Lo,Yuqian Chen,Dongnan Liu,Wan Liu,Leo Zekelman,Jarrett Rushmore,Fan Zhang,Yogesh Rathi,Nikos Makris,Alexandra J. Golby,Weidong Cai,Lauren J. O’Donnell
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:
[LG-307] Asymptotic Time-Uniform Inference for Parameters in Averaged Stochastic Approximation
链接: https://arxiv.org/abs/2410.15057
作者: Chuhan Xie,Kaicheng Jin,Jiadong Liang,Zhihua Zhang
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 35 pages, 4 figures
[LG-308] Statistical Inference for Feature Selection after Optimal Transport-based Domain Adaptation
链接: https://arxiv.org/abs/2410.15022
作者: Nguyen Thang Loi,Duong Tan Loc,Vo Nguyen Le Duy
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-309] me-Varying Convex Optimization with O(n) Computational Complexity
链接: https://arxiv.org/abs/2410.15009
作者: M. Rostami,S. S. Kia
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2310.07925
[LG-310] Achieving O(1/N) Optimality Gap in Restless Bandits through Diffusion Approximation
链接: https://arxiv.org/abs/2410.15003
作者: Chen Yan,Weina Wang,Lei Ying
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注: 31 pages, 6 figures
[LG-311] Wave (from) Polarized Light Learning (WPLL) method: high resolution spatio-temporal measurements of water surface waves in laboratory setups
链接: https://arxiv.org/abs/2410.14988
作者: Noam Ginio(1),Michael Lindenbaum(2 and 3),Barak Fishbain(1),Dan Liberzon(1 and 3) ((1) Faculty of Civil and Environmental Engineering, Technion, Haifa, Israel, (2) Faculty of Computer Science, Technion, Haifa, Israel, (3) Interdisciplinary program for Marine Engineering, Technion, Haifa, Israel)
关键词-EN:
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 20 pages, 17 figures, 5 tables, under review in Applied Ocean Research Journal
[LG-312] 2D Basement Relief Inversion using Sparse Regularization
链接: https://arxiv.org/abs/2410.14942
作者: Francisco Márcio Barboza,Arthur Anthony da Cunha Romão E Silva,Bruno Motta de Carvalho
关键词-EN:
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 8 pages, 13 figures, Submitted to Acta Geophysica
[LG-313] Can AI weather models predict out-of-distribution gray swan tropical cyclones?
链接: https://arxiv.org/abs/2410.14932
作者: Y. Qiang Sun,Pedram Hassanzadeh,Mohsen Zand,Ashesh Chattopadhyay,Jonathan Weare,Dorian S. Abbot
关键词-EN:
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
[LG-314] Hierarchical Reinforced Trader (HRT): A Bi-Level Approach for Optimizing Stock Selection and Execution
链接: https://arxiv.org/abs/2410.14927
作者: Zijie Zhao,Roy E. Welsch
关键词-EN:
类目: Trading and Market Microstructure (q-fin.TR); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
[LG-315] Predictive variational inference: Learn the predictively optimal posterior distribution
链接: https://arxiv.org/abs/2410.14843
作者: Jinlin Lai,Yuling Yao
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
[LG-316] Multi-Task Dynamic Pricing in Credit Market with Contextual Information
链接: https://arxiv.org/abs/2410.14839
作者: Adel Javanmard,Jingwei Ji,Renyuan Xu
关键词-EN:
类目: Pricing of Securities (q-fin.PR); Machine Learning (cs.LG)
*备注:
[LG-317] Differentially Private Covariate Balancing Causal Inference
链接: https://arxiv.org/abs/2410.14789
作者: Yuki Ohnishi,Jordan Awan
关键词-EN:
类目: Methodology (stat.ME); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 30 pages
[LG-318] Simultaneously Solving FBSDEs with Neural Operators of Logarithmic Depth Constant Width and Sub-Linear Rank
链接: https://arxiv.org/abs/2410.14788
作者: Takashi Furuya,Anastasis Kratsios
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Computational Finance (q-fin.CP)
*备注: 36 pages + references
[LG-319] Privacy for Free in the Over-Parameterized Regime
链接: https://arxiv.org/abs/2410.14787
作者: Simone Bombari,Marco Mondelli
关键词-EN:
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
[LG-320] High-Dimensional Tensor Discriminant Analysis with Incomplete Tensors
链接: https://arxiv.org/abs/2410.14783
作者: Elynn Chen,Yuefeng Han,Jiayu Li
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
[LG-321] Machine Learning Aided Modeling of Granular Materials: A Review
链接: https://arxiv.org/abs/2410.14767
作者: Mengqi Wang,Krishna Kumar,Y. T. Feng,Tongming Qu,Min Wang
关键词-EN:
类目: Geophysics (physics.geo-ph); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注: Submitted to Archives of Computational Methods in Engineering
[LG-322] Advancing Physics Data Analysis through Machine Learning and Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2410.14760
作者: Vasileios Vatellis
关键词-EN:
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 7 page
[LG-323] Universal approximation results for neural networks with non-polynomial activation function over non-compact domains
链接: https://arxiv.org/abs/2410.14759
作者: Ariel Neufeld,Philipp Schmocker
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Classical Analysis and ODEs (math.CA)
*备注: arXiv admin note: substantial text overlap with arXiv:2312.08410
[LG-324] On the Sparsity of the Strong Lottery Ticket Hypothesis
链接: https://arxiv.org/abs/2410.14754
作者: Emanuele Natale(CNRS, COATI, I3S, UniCA),Davide Ferré(UniCA, CNRS, Inria, I3S),Giordano Giambartolomei,Frédéric Giroire(I3S, COMUE UCA, COATI),Frederik Mallmann-Trenn
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[LG-325] Continuous Wavelet Transformation and VGG16 Deep Neural Network for Stress Classification in PPG Signals
链接: https://arxiv.org/abs/2410.14747
作者: Yasin Hasanpoor,Bahram Tarvirdizadeh,Khalil Alipour,Mohammad Ghamari
关键词-EN:
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 4 figures
[LG-326] A Transformer Based Generative Chemical Language AI Model for Structural Elucidation of Organic Compounds
链接: https://arxiv.org/abs/2410.14719
作者: Xiaofeng Tan
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Biomolecules (q-bio.BM)
*备注: 31 pages
[LG-327] REBIND: Enhancing ground-state molecular conformation via force-based graph rewiring
链接: https://arxiv.org/abs/2410.14696
作者: Taewon Kim,Hyunjin Seo,Sungsoo Ahn,Eunho Yang
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 17 pages, 4 figures, 5 tables
[LG-328] Achieving Generalization in Orchestrating GNSS Interference Monitoring Stations Through Pseudo-Labeling
链接: https://arxiv.org/abs/2410.14686
作者: Lucas Heublein,Tobias Feigl,Alexander Rügamer,Felix Ott
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: DGON Positioning and Navigation for Intelligent Transport Systems (POSNAV)
信息检索
[IR-0] Limpeh ga li gong: Challenges in Singlish Annotations
链接: https://arxiv.org/abs/2410.16156
作者: Lynnette Hui Xian Ng,Luo Qi Chan
关键词-EN: Colloquial Singapore English, multicultural Singapore, Singapore English, Natural Language Processing, Colloquial Singapore
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Singlish, or Colloquial Singapore English, is a language formed from oral and social communication within multicultural Singapore. In this work, we work on a fundamental Natural Language Processing (NLP) task: Parts-Of-Speech (POS) tagging of Singlish sentences. For our analysis, we build a parallel Singlish dataset containing direct English translations and POS tags, with translation and POS annotation done by native Singlish speakers. Our experiments show that automatic transition- and transformer- based taggers perform with only \sim 80% accuracy when evaluated against human-annotated POS labels, suggesting that there is indeed room for improvement on computation analysis of the language. We provide an exposition of challenges in Singlish annotation: its inconsistencies in form and semantics, the highly context-dependent particles of the language, its structural unique expressions, and the variation of the language on different mediums. Our task definition, resultant labels and results reflects the challenges in analysing colloquial languages formulated from a variety of dialects, and paves the way for future studies beyond POS tagging.
[IR-1] PODTILE: Facilitating Podcast Episode Browsing with Auto-generated Chapters CIKM
链接: https://arxiv.org/abs/2410.16148
作者: Azin Ghazimatin,Ekaterina Garmash,Gustavo Penha,Kristen Sheets,Martin Achenbach,Oguz Semerci,Remi Galvez,Marcus Tannenberg,Sahitya Mantravadi,Divya Narayanan,Ofeliya Kalaydzhyan,Douglas Cole,Ben Carterette,Ann Clifton,Paul N. Bennett,Claudia Hauff,Mounia Lalmas
关键词-EN: locate relevant sections, long-form talk-audio content, relevant sections, long-form talk-audio, find it challenging
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, CIKM industry track 2024
点击查看摘要
Abstract:Listeners of long-form talk-audio content, such as podcast episodes, often find it challenging to understand the overall structure and locate relevant sections. A practical solution is to divide episodes into chapters–semantically coherent segments labeled with titles and timestamps. Since most episodes on our platform at Spotify currently lack creator-provided chapters, automating the creation of chapters is essential. Scaling the chapterization of podcast episodes presents unique challenges. First, episodes tend to be less structured than written texts, featuring spontaneous discussions with nuanced transitions. Second, the transcripts are usually lengthy, averaging about 16,000 tokens, which necessitates efficient processing that can preserve context. To address these challenges, we introduce PODTILE, a fine-tuned encoder-decoder transformer to segment conversational data. The model simultaneously generates chapter transitions and titles for the input transcript. To preserve context, each input text is augmented with global context, including the episode’s title, description, and previous chapter titles. In our intrinsic evaluation, PODTILE achieved an 11% improvement in ROUGE score over the strongest baseline. Additionally, we provide insights into the practical benefits of auto-generated chapters for listeners navigating episode content. Our findings indicate that auto-generated chapters serve as a useful tool for engaging with less popular podcasts. Finally, we present empirical evidence that using chapter titles can enhance effectiveness of sparse retrieval in search tasks.
[IR-2] Unleashing the Potential of Multi-Channel Fusion in Retrieval for Personalized Recommendations
链接: https://arxiv.org/abs/2410.16080
作者: Junjie Huang,Jiarui Qin,Jianghao Lin,Ziming Feng,Yong Yu,Weinan Zhang
关键词-EN: modern digital services, managing information overload, Recommender systems, digital services, pivotal in managing
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 8 figures
点击查看摘要
Abstract:Recommender systems (RS) are pivotal in managing information overload in modern digital services. A key challenge in RS is efficiently processing vast item pools to deliver highly personalized recommendations under strict latency constraints. Multi-stage cascade ranking addresses this by employing computationally efficient retrieval methods to cover diverse user interests, followed by more precise ranking models to refine the results. In the retrieval stage, multi-channel retrieval is often used to generate distinct item subsets from different candidate generators, leveraging the complementary strengths of these methods to maximize coverage. However, forwarding all retrieved items overwhelms downstream rankers, necessitating truncation. Despite advancements in individual retrieval methods, multi-channel fusion, the process of efficiently merging multi-channel retrieval results, remains underexplored. We are the first to identify and systematically investigate multi-channel fusion in the retrieval stage. Current industry practices often rely on heuristic approaches and manual designs, which often lead to suboptimal performance. Moreover, traditional gradient-based methods like SGD are unsuitable for this task due to the non-differentiable nature of the selection process. In this paper, we explore advanced channel fusion strategies by assigning systematically optimized weights to each channel. We utilize black-box optimization techniques, including the Cross Entropy Method and Bayesian Optimization for global weight optimization, alongside policy gradient-based approaches for personalized merging. Our methods enhance both personalization and flexibility, achieving significant performance improvements across multiple datasets and yielding substantial gains in real-world deployments, offering a scalable solution for optimizing multi-channel fusion in retrieval.
[IR-3] Surprising Patterns in Musical Influence Networks
链接: https://arxiv.org/abs/2410.15996
作者: Flavio Figueiredo,Tales Panoutsos,Nazareno Andrade
关键词-EN: contemporary Western music, provided valuable insights, Western music, contemporary Western, Analyzing musical influence
类目: Information Retrieval (cs.IR)
*备注: To appear in the Latin American Musical Information Retrieval Workshop
点击查看摘要
Abstract:Analyzing musical influence networks, such as those formed by artist influence or sampling, has provided valuable insights into contemporary Western music. Here, computational methods like centrality rankings help identify influential artists. However, little attention has been given to how influence changes over time. In this paper, we apply Bayesian Surprise to track the evolution of musical influence networks. Using two networks – one of artist influence and another of covers, remixes, and samples – our results reveal significant periods of change in network structure. Additionally, we demonstrate that Bayesian Surprise is a flexible framework for testing various hypotheses on network evolution with real-world data.
[IR-4] Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report
链接: https://arxiv.org/abs/2410.15944
作者: Ayman Asad Khan,Md Toufique Hasan,Kai Kristian Kemell,Jussi Rasku,Pekka Abrahamsson
关键词-EN: Retrieval Augmented Generation, primary data source, PDF documents, Large Language Models, Augmented Generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 36 pages, 8 figures, 2 tables, and python code snippets
点击查看摘要
Abstract:This paper presents an experience report on the development of Retrieval Augmented Generation (RAG) systems using PDF documents as the primary data source. The RAG architecture combines generative capabilities of Large Language Models (LLMs) with the precision of information retrieval. This approach has the potential to redefine how we interact with and augment both structured and unstructured knowledge in generative models to enhance transparency, accuracy, and contextuality of responses. The paper details the end-to-end pipeline, from data collection, preprocessing, to retrieval indexing and response generation, highlighting technical challenges and practical solutions. We aim to offer insights to researchers and practitioners developing similar systems using two distinct approaches: OpenAI’s Assistant API with GPT Series and Llama’s open-source models. The practical implications of this research lie in enhancing the reliability of generative AI systems in various sectors where domain-specific knowledge and real-time information retrieval is important. The Python code used in this work is also available at: this https URL.
[IR-5] Centrality-aware Product Retrieval and Ranking EMNLP2024
链接: https://arxiv.org/abs/2410.15930
作者: Hadeel Saadany,Swapnil Bhosale,Samarth Agrawal,Diptesh Kanojia,Constantin Orasan,Zhe Wu
关键词-EN: user intent, improving user experience, paper addresses, addresses the challenge, challenge of improving
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024: Industry track
点击查看摘要
Abstract:This paper addresses the challenge of improving user experience on e-commerce platforms by enhancing product ranking relevant to users’ search queries. Ambiguity and complexity of user queries often lead to a mismatch between the user’s intent and retrieved product titles or documents. Recent approaches have proposed the use of Transformer-based models, which need millions of annotated query-title pairs during the pre-training stage, and this data often does not take user intent into account. To tackle this, we curate samples from existing datasets at eBay, manually annotated with buyer-centric relevance scores and centrality scores, which reflect how well the product title matches the users’ intent. We introduce a User-intent Centrality Optimization (UCO) approach for existing models, which optimises for the user intent in semantic product search. To that end, we propose a dual-loss based optimisation to handle hard negatives, i.e., product titles that are semantically relevant but do not reflect the user’s intent. Our contributions include curating challenging evaluation sets and implementing UCO, resulting in significant product ranking efficiency improvements observed for different evaluation metrics. Our work aims to ensure that the most buyer-centric titles for a query are ranked higher, thereby, enhancing the user experience on e-commerce platforms.
[IR-6] Using GPT Models for Qualitative and Quantitative News Analytics in the 2024 US Presidental Election Process
链接: https://arxiv.org/abs/2410.15884
作者: Bohdan M. Pavlyshenko
关键词-EN: Google Search API, Google Search, Search API, retrieval-augmented generation, RAG
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The paper considers an approach of using Google Search API and GPT-4o model for qualitative and quantitative analyses of news through retrieval-augmented generation (RAG). This approach was applied to analyze news about the 2024 US presidential election process. Different news sources for different time periods have been analyzed. Quantitative scores generated by GPT model have been analyzed using Bayesian regression to derive trend lines. The distributions found for the regression parameters allow for the analysis of uncertainty in the election process. The obtained results demonstrate that using the GPT models for news analysis, one can get informative analytics and provide key insights that can be applied in further analyses of election processes.
[IR-7] Improve Dense Passage Retrieval with Entailment Tuning EMNLP2024
链接: https://arxiv.org/abs/2410.15801
作者: Lu Dai,Hao Liu,Hui Xiong
关键词-EN: open-domain question answering, downstream NLP tasks, downstream NLP, retrieval-augmented generation, NLP tasks
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: EMNLP 2024 Main
点击查看摘要
Abstract:Retrieval module can be plugged into many downstream NLP tasks to improve their performance, such as open-domain question answering and retrieval-augmented generation. The key to a retrieval system is to calculate relevance scores to query and passage pairs. However, the definition of relevance is often ambiguous. We observed that a major class of relevance aligns with the concept of entailment in NLI tasks. Based on this observation, we designed a method called entailment tuning to improve the embedding of dense retrievers. Specifically, we unify the form of retrieval data and NLI data using existence claim as a bridge. Then, we train retrievers to predict the claims entailed in a passage with a variant task of masked prediction. Our method can be efficiently plugged into current dense retrieval methods, and experiments show the effectiveness of our method.
[IR-8] Whos Who: Large Language Models Meet Knowledge Conflicts in Practice EMNLP2024
链接: https://arxiv.org/abs/2410.15737
作者: Quang Hieu Pham,Hoang Ngo,Anh Tuan Luu,Dat Quoc Nguyen
关键词-EN: static memory limits, Retrieval-augmented generation, methods are viable, viable solutions, solutions for addressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted to EMNLP 2024 Findings
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model’s behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs’ performance in RAG settings.
[IR-9] Automatic Search of Multiword Place Names on Historical Maps
链接: https://arxiv.org/abs/2410.15586
作者: Rhett Olson,Jina Kim,Yao-Yi Chiang
关键词-EN: Historical maps, scanned historical maps, maps, Historical, multiword place
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: 4 pages, 4 figures, and 2 tables. To be published in proceedings ACM SIGSPATIAL 2024 GeoSearch Workshop
点击查看摘要
Abstract:Historical maps are invaluable sources of information about the past, and scanned historical maps are increasingly accessible in online libraries. To retrieve maps from these large libraries that contain specific places of interest, previous work has applied computer vision techniques to recognize words on historical maps, enabling searches for maps that contain specific place names. However, searching for multiword place names is challenging due to complex layouts of text labels on historical maps. This paper proposes an efficient query method for searching a given multiword place name on historical maps. Using existing methods to recognize words on historical maps, we link single-word text labels into potential multiword phrases by constructing minimum spanning trees. These trees aim to link pairs of text labels that are spatially close and have similar height, angle, and capitalization. We then query these trees for the given multiword place name. We evaluate the proposed method in two experiments: 1) to evaluate the accuracy of the minimum spanning tree approach at linking multiword place names and 2) to evaluate the number and time range of maps retrieved by the query approach. The resulting maps reveal how places using multiword names have changed on a large number of maps from across history.
[IR-10] A Survey of Conversational Search DATE
链接: https://arxiv.org/abs/2410.15576
作者: Fengran Mo,Kelong Mao,Ziliang Zhao,Hongjin Qian,Haonan Chen,Yiruo Cheng,Xiaoxi Li,Yutao Zhu,Zhicheng Dou,Jian-Yun Nie
关键词-EN: Conversational search, search engines, modern information access, search, conversational search systems
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 35 pages, 8 figures, continue to update
点击查看摘要
Abstract:As a cornerstone of modern information access, search engines have become indispensable in everyday life. With the rapid advancements in AI and natural language processing (NLP) technologies, particularly large language models (LLMs), search engines have evolved to support more intuitive and intelligent interactions between users and systems. Conversational search, an emerging paradigm for next-generation search engines, leverages natural language dialogue to facilitate complex and precise information retrieval, thus attracting significant attention. Unlike traditional keyword-based search engines, conversational search systems enhance user experience by supporting intricate queries, maintaining context over multi-turn interactions, and providing robust information integration and processing capabilities. Key components such as query reformulation, search clarification, conversational retrieval, and response generation work in unison to enable these sophisticated interactions. In this survey, we explore the recent advancements and potential future directions in conversational search, examining the critical modules that constitute a conversational search system. We highlight the integration of LLMs in enhancing these systems and discuss the challenges and opportunities that lie ahead in this dynamic field. Additionally, we provide insights into real-world applications and robust evaluations of current conversational search systems, aiming to guide future research and development in conversational search.
[IR-11] ConTReGen: Context-driven Tree-structured Retrieval for Open-domain Long-form Text Generation EMNLP’24
链接: https://arxiv.org/abs/2410.15511
作者: Kashob Kumar Roy,Pritom Saha Akash,Kevin Chen-Chuan Chang,Lucian Popa
关键词-EN: Open-domain long-form text, Open-domain long-form, long-form text generation, text generation requires, generation requires generating
类目: Information Retrieval (cs.IR)
*备注: Accepted at EMNLP’24 Findings
点击查看摘要
Abstract:Open-domain long-form text generation requires generating coherent, comprehensive responses that address complex queries with both breadth and depth. This task is challenging due to the need to accurately capture diverse facets of input queries. Existing iterative retrieval-augmented generation (RAG) approaches often struggle to delve deeply into each facet of complex queries and integrate knowledge from various sources effectively. This paper introduces ConTReGen, a novel framework that employs a context-driven, tree-structured retrieval approach to enhance the depth and relevance of retrieved content. ConTReGen integrates a hierarchical, top-down in-depth exploration of query facets with a systematic bottom-up synthesis, ensuring comprehensive coverage and coherent integration of multifaceted information. Extensive experiments on multiple datasets, including LFQA and ODSUM, alongside a newly introduced dataset, ODSUM-WikiHow, demonstrate that ConTReGen outperforms existing state-of-the-art RAG models.
[IR-12] Deep Class-guided Hashing for Multi-label Cross-modal Retrieval
链接: https://arxiv.org/abs/2410.15387
作者: Hao Chen,Lei Zhu,Xinghui Zhu
关键词-EN: inter-class structural relationships, efficient retrieval advantages, Deep hashing, structural relationships, inter-class structural
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Deep hashing, due to its low cost and efficient retrieval advantages, is widely valued in cross-modal retrieval. However, existing cross-modal hashing methods either explore the relationships between data points, which inevitably leads to intra-class dispersion, or explore the relationships between data points and categories while ignoring the preservation of inter-class structural relationships, resulting in the generation of suboptimal hash codes. How to maintain both intra-class aggregation and inter-class structural relationships, In response to this issue, this paper proposes a DCGH method. Specifically, we use proxy loss as the mainstay to maintain intra-class aggregation of data, combined with pairwise loss to maintain inter-class structural relationships, and on this basis, further propose a variance constraint to address the semantic bias issue caused by the combination. A large number of comparative experiments on three benchmark datasets show that the DCGH method has comparable or even better performance compared to existing cross-modal retrieval methods. The code for the implementation of our DCGH framework is available at this https URL.
[IR-13] Performance-Driven QUBO for Recommender Systems on Quantum Annealers
链接: https://arxiv.org/abs/2410.15272
作者: Jiayang Niu,Jie Li,Ke Deng,Mark Sanderson,Yongli Ren
关键词-EN: Unconstrained Binary Optimization, Quadratic Unconstrained Binary, Analysis Quadratic Unconstrained, solve QUBO problems, Counterfactual Analysis Quadratic
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We propose Counterfactual Analysis Quadratic Unconstrained Binary Optimization (CAQUBO) to solve QUBO problems for feature selection in recommender systems. CAQUBO leverages counterfactual analysis to measure the impact of individual features and feature combinations on model performance and employs the measurements to construct the coefficient matrix for a quantum annealer to select the optimal feature combinations for recommender systems, thereby improving their final recommendation performance. By establishing explicit connections between features and the recommendation performance, the proposed approach demonstrates superior performance compared to the state-of-the-art quantum annealing methods. Extensive experiments indicate that integrating quantum computing with counterfactual analysis holds great promise for addressing these challenges.
[IR-14] HyQE: Ranking Contexts with Hypothetical Query Embeddings
链接: https://arxiv.org/abs/2410.15262
作者: Weichao Zhou,Jiaxin Zhang,Hilaf Hasson,Anu Singh,Wenchao Li
关键词-EN: retrieval-augmented systems, commonly employed, employed to reorder, contexts, user query
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In retrieval-augmented systems, context ranking techniques are commonly employed to reorder the retrieved contexts based on their relevance to a user query. A standard approach is to measure this relevance through the similarity between contexts and queries in the embedding space. However, such similarity often fails to capture the relevance. Alternatively, large language models (LLMs) have been used for ranking contexts. However, they can encounter scalability issues when the number of candidate contexts grows and the context window sizes of the LLMs remain constrained. Additionally, these approaches require fine-tuning LLMs with domain-specific data. In this work, we introduce a scalable ranking framework that combines embedding similarity and LLM capabilities without requiring LLM fine-tuning. Our framework uses a pre-trained LLM to hypothesize the user query based on the retrieved contexts and ranks the context based on the similarity between the hypothesized queries and the user query. Our framework is efficient at inference time and is compatible with many other retrieval and ranking techniques. Experimental results show that our method improves the ranking performance across multiple benchmarks. The complete code and data are available at this https URL
[IR-15] Crafting Tomorrow: The Influence of Design Choices on Fresh Content in Social Media Recommendation
链接: https://arxiv.org/abs/2410.15174
作者: Srijan Saket,Mohit Agarwal,Rishabh Mehrotra
关键词-EN:
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
[IR-16] Mining Asymmetric Intertextuality
链接: https://arxiv.org/abs/2410.15145
作者: Pak Kin Lau,Stuart Michael McManus
关键词-EN:
类目: Information Retrieval (cs.IR)
*备注:
[IR-17] Incorporating Group Prior into Variational Inference for Tail-User Behavior Modeling in CTR Prediction
链接: https://arxiv.org/abs/2410.15098
作者: Han Xu,Taoxing Pan,Zhiqiang Liu,Xiaoxiao Xu,Lantao Hu
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
[IR-18] A Recommendation Model Utilizing Separation Embedding and Self-Attention for Feature Mining
链接: https://arxiv.org/abs/2410.15026
作者: Wenyi Liu,Rui Wang,Yuanshuai Luo,Jianjun Wei,Zihao Zhao,Junming Huang
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
[IR-19] ransit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model
链接: https://arxiv.org/abs/2410.15016
作者: Jiahao Wang,Amer Shalaby
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 17 pages, 21 figures
[IR-20] Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norways Digitised Book Collection
链接: https://arxiv.org/abs/2410.14969
作者: Marie Roald,Magnus Breder Birkenes,Lars Gunnarsønn Bagøien Johnsen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, 4 tables, Accepted to the 2024 Computational Humanities Research Conference (CHR)
[IR-21] he S2 Hierarchical Discrete Global Grid as a Nexus for Data Representation Integration and Querying Across Geospatial Knowledge Graphs
链接: https://arxiv.org/abs/2410.14808
作者: Shirly Stephen,Mitchell Faulk,Krzysztof Janowicz,Colby Fisher,Thomas Thelen,Rui Zhu,Pascal Hitzler,Cogan Shimizu,Kitty Currier,Mark Schildhauer,Dean Rehberger,Zhangyu Wang,Antrea Christou
关键词-EN:
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
[IR-22] Attribute-Based Semantic Type Detection and Data Quality Assessment
链接: https://arxiv.org/abs/2410.14692
作者: Marcelo Valentim Silva,Hannes Herrmann,Valerie Maxville
关键词-EN:
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 10 pages, 9 tables, sent for approval at BDCAT 2024
附件下载
点击下载今日全部论文列表