本篇博文主要展示 2024-09-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。【邮箱发送异常,暂不增加!!!!!】
目录
概览 (2024-09-20)
今日共更新385篇论文,其中:
- 自然语言处理共55篇(Computation and Language (cs.CL))
- 人工智能共98篇(Artificial Intelligence (cs.AI))
- 计算机视觉共99篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共107篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
该论文试图解决自动化音频描述(AAC)任务中,现有评估方法难以全面反映人类判断的问题。解决方案的关键在于提出了CLAIR-A方法,该方法利用大型语言模型(LLMs)的零样本能力,通过直接请求LLMs生成语义距离评分来评估候选音频描述。CLAIR-A不仅在预测人类判断质量方面表现优于传统指标,还在解释评分背后的推理过程上提供了更高的透明度和可解释性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12962
作者: Tsung-Han Wu,Joseph E. Gonzalez,Trevor Darrell,David M. Chan
关键词-EN: Automated Audio Captioning, Audio Captioning, generate natural language, natural language descriptions, Automated Audio
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Code is publicly available at this https URL
点击查看摘要
Abstract:The Automated Audio Captioning (AAC) task asks models to generate natural language descriptions of an audio input. Evaluating these machine-generated audio captions is a complex task that requires considering diverse factors, among them, auditory scene understanding, sound-object inference, temporal coherence, and the environmental context of the scene. While current methods focus on specific aspects, they often fail to provide an overall score that aligns well with human judgment. In this work, we propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models (LLMs) to evaluate candidate audio captions by directly asking LLMs for a semantic distance score. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics, with a 5.8% relative accuracy improvement compared to the domain-specific FENSE metric and up to 11% over the best general-purpose measure on the Clotho-Eval dataset. Moreover, CLAIR-A offers more transparency by allowing the language model to explain the reasoning behind its scores, with these explanations rated up to 30% better by human evaluators than those provided by baseline methods. CLAIR-A is made publicly available at this https URL.
摘要:自动音频描述 (Automated Audio Captioning, AAC) 任务要求模型生成音频输入的自然语言描述。评估这些机器生成的音频描述是一个复杂的任务,需要考虑多种因素,包括听觉场景理解、声音对象推理、时间一致性以及场景的环境背景。尽管当前的方法关注特定方面,但它们往往无法提供与人类判断高度一致的整体评分。在本研究中,我们提出了 CLAIR-A,一种简单且灵活的方法,利用大语言模型 (Large Language Models, LLMs) 的零样本能力,通过直接向 LLMs 请求语义距离评分来评估候选音频描述。在我们的评估中,CLAIR-A 相比传统指标更好地预测了人类对质量的判断,相较于领域特定的 FENSE 指标,相对准确率提高了 5.8%,在 Clotho-Eval 数据集上,相较于最佳通用度量,提高了高达 11%。此外,CLAIR-A 通过允许语言模型解释其评分背后的推理,提供了更多的透明度,这些解释在人类评估者中的评分比基线方法提供的解释高出高达 30%。CLAIR-A 已公开发布,访问地址为 https URL。
[NLP-1] MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
该论文试图解决当前AI搜索引擎在处理多模态用户查询时的局限性,特别是文本与图像混合信息的搜索能力不足的问题。解决方案的关键在于设计了一个名为MMSearch-Engine的精细管道,该管道能够赋予大型多模态模型(LMMs)多模态搜索能力,并通过MMSearch基准评估其性能。MMSearch-Engine通过执行重新查询、重新排序和总结等任务,以及一个完整的端到端搜索过程,来评估LMMs在多模态搜索中的表现。实验结果表明,结合MMSearch-Engine的GPT-4o在端到端任务中表现优于商业产品Perplexity Pro,证明了该管道的有效性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12959
作者: Dongzhi Jiang,Renrui Zhang,Ziyu Guo,Yanmin Wu,Jiayi Lei,Pengshuo Qiu,Pan Lu,Zehui Chen,Guanglu Song,Peng Gao,Yu Liu,Chunyuan Li,Hongsheng Li
关键词-EN: Large Language Models, Large Multimodal Models, Large Language, Language Models, multimodal search
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Project Page: this https URL
点击查看摘要
Abstract:The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs’ training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: this https URL
摘要:大语言模型 (LLM) 的出现为 AI 搜索引擎,例如 SearchGPT,开辟了人类与互联网交互的新范式。然而,当前大多数 AI 搜索引擎仅限于纯文本环境,忽视了多模态用户查询以及网站信息中文本与图像交织的特性。近期,大模态模型 (LMM) 取得了显著进展。然而,它们是否能作为 AI 搜索引擎仍未得到充分探索,使得 LMM 在多模态搜索中的潜力成为一个开放问题。为此,我们首先设计了一个精细的管道,MMSearch-Engine,赋予任何 LMM 多模态搜索能力。在此基础上,我们引入了 MMSearch,一个全面的评估基准,用于评估 LMM 的多模态搜索性能。精心策划的数据集包含 300 个手动收集的实例,涵盖 14 个子领域,与当前 LMM 的训练数据无重叠,确保正确答案只能通过搜索获得。通过使用 MMSearch-Engine,LMM 通过执行三个独立任务(重新查询、重新排序和总结)以及一个具有完整搜索过程的挑战性端到端任务进行评估。我们对闭源和开源 LMM 进行了广泛的实验。在所有测试模型中,GPT-4o 结合 MMSearch-Engine 取得了最佳结果,在端到端任务中超越了商业产品 Perplexity Pro,展示了我们提出的管道的有效性。我们进一步进行了错误分析,揭示了当前 LMM 在完全掌握多模态搜索任务方面仍面临挑战,并通过消融研究指出了扩展测试时计算对 AI 搜索引擎的潜力。我们希望 MMSearch 能够提供独特的见解,指导多模态 AI 搜索引擎的未来发展。项目页面:this https URL
[NLP-2] MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
该论文试图解决低资源语言在指令调优数据集创建过程中依赖数据标注的问题。解决方案的关键在于引入了一种名为Multilingual Reverse Instructions (MURI)的新方法,通过反向指令和翻译管道,从低资源语言的现有人类书写文本中生成高质量的指令-输出对,无需人工标注或预先存在的多语言模型。这种方法通过从不同本土领域获取文本并应用过滤器来确保文化相关性和多样性,最终生成了包含超过200万对指令-输出对的MURI-IT数据集,并通过本地语言使用者和mT5模型的微调实验验证了其有效性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12958
作者: Abdullatif Köksal,Marion Thaler,Ayyoob Imani,Ahmet Üstün,Anna Korhonen,Hinrich Schütze
关键词-EN: tuning enhances large, Instruction tuning enhances, enhances large language, Instruction tuning, instruction tuning datasets
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach’s effectiveness for both NLU and open-ended generation. We publicly release datasets and models at this https URL.
摘要:指令调优通过使大语言模型 (LLM) 与多样任务中的人类偏好对齐,从而增强其性能。传统方法创建指令调优数据集面临严重挑战,特别是在低资源语言中,因其依赖数据标注。本研究引入了一种新方法——多语言逆向指令 (MURI),该方法无需人工标注或预先存在的多语言模型,即可为低资源语言生成高质量的指令调优数据集。利用逆向指令和翻译管道,MURI 从低资源语言的现有人类书写文本中生成指令-输出对。此方法通过从不同本土领域获取文本并应用过滤器消除不适当内容,确保了文化相关性和多样性。我们的数据集 MURI-IT 包含超过 200 种语言的 200 万条指令-输出对。通过母语者评估和使用 mT5 模型进行的微调实验,证明了该方法在自然语言理解 (NLU) 和开放式生成方面的有效性。我们在此 https URL 公开发布了数据集和模型。
[NLP-3] Re-Introducing LayerNorm: Geometric Meaning Irreversibility and a Comparative Study with RMSNorm
该论文试图解决LayerNorm在Transformer架构中的几何影响问题,特别是LayerNorm如何影响隐藏向量的范数和方向。解决方案的关键在于揭示LayerNorm与均匀向量(uniform vector)之间的内在联系,并通过三个步骤解释LayerNorm的标准化过程:(i)去除向量在均匀向量方向上的分量,(ii)对剩余向量进行归一化,(iii)通过缩放因子(\sqrt{d})对结果向量进行缩放。论文还提出了LayerNorm的“不可逆性”特性,并通过对LayerNorm、RMSNorm的隐藏表示进行比较,发现去除均匀向量分量的步骤在实际应用中是冗余的,支持使用RMSNorm替代LayerNorm,因其计算效率更高且下游性能相当。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12951
作者: Akshat Gupta,Atahan Ozdemir,Gopala Anumanchipalli
关键词-EN: uniform vector, Layer normalization, vector, transformer architecture, LayerNorm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Layer normalization is a pivotal step in the transformer architecture. This paper delves into the less explored geometric implications of this process, examining how LayerNorm influences the norm and orientation of hidden vectors in the representation space. We show that the definition of LayerNorm is innately linked to the uniform vector, defined as \boldsymbol1 = [1, 1, 1, 1, \cdots, 1]^T \in \mathbbR^d . We then show that the standardization step in LayerNorm can be understood in three simple steps: (i) remove the component of a vector along the uniform vector, (ii) normalize the remaining vector, and (iii) scale the resultant vector by \sqrtd , where d is the dimensionality of the representation space. We also introduce the property of “irreversibility” for LayerNorm, where we show that the information lost during the normalization process cannot be recovered. In other words, unlike batch normalization, LayerNorm cannot learn an identity transform. While we present possible arguments for removing the component along the uniform vector, the choice of removing this component seems arbitrary and not well motivated by the original authors. To evaluate the usefulness of this step, we compare the hidden representations of LayerNorm-based LLMs with models trained using RMSNorm and show that all LLMs naturally align representations orthogonal to the uniform vector, presenting the first mechanistic evidence that removing the component along the uniform vector in LayerNorm is a redundant step. Our findings support the use of RMSNorm over LayerNorm as it is not only more computationally efficient with comparable downstream performance, but also learns a similar distribution of hidden representations that operate orthogonal to the uniform vector.
摘要:层归一化 (Layer normalization) 是 Transformer 架构中的关键步骤。本文深入探讨了这一过程中较少被研究的几何意义,分析了 LayerNorm 如何影响表示空间中隐藏向量的范数和方向。我们展示了 LayerNorm 的定义与均匀向量 (uniform vector) 之间存在内在联系,均匀向量定义为 \boldsymbol1 = [1, 1, 1, 1, \cdots, 1]^T \in \mathbbR^d。接着,我们展示了 LayerNorm 中的标准化步骤可以通过三个简单的步骤来理解:(i) 去除向量沿均匀向量的分量,(ii) 归一化剩余向量,(iii) 将结果向量按 \sqrtd 进行缩放,其中 d 是表示空间的维度。我们还引入了 LayerNorm 的“不可逆性”特性,表明归一化过程中丢失的信息无法恢复。换句话说,与批归一化 (batch normalization) 不同,LayerNorm 无法学习恒等变换。尽管我们提出了可能的论点来解释去除沿均匀向量的分量,但这一选择似乎是任意的,并未得到原作者的充分动机支持。为了评估这一步骤的有用性,我们将基于 LayerNorm 的大语言模型 (LLM) 的隐藏表示与使用 RMSNorm 训练的模型进行比较,结果显示所有 LLM 自然地将表示对齐到与均匀向量正交的方向,提供了第一个机制性证据,表明在 LayerNorm 中去除沿均匀向量的分量是一个冗余步骤。我们的研究结果支持使用 RMSNorm 而非 LayerNorm,因为 RMSNorm 不仅在下游性能相当的情况下计算效率更高,而且学习到的隐藏表示分布也与均匀向量正交。
[NLP-4] Fact Fetch and Reason: A Unified Evaluation of Retrieval-Augmented Generation
该论文试图解决大语言模型(LLMs)在增强检索增强生成(RAG)系统中的综合评估问题。解决方案的关键在于提出了FRAMES(Factuality, Retrieval, And reasoning MEasurement Set)评估数据集,该数据集设计用于全面测试LLMs在提供事实性响应、检索能力和推理生成最终答案方面的能力。FRAMES通过提供一个统一的评估框架,填补了以往研究中单独评估这些能力的空白,特别是在多源信息整合的多跳问题场景中。论文还展示了通过多步检索管道显著提升模型准确性的结果,从0.40提升至0.66,表明了该解决方案的有效性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12941
作者: Satyapriya Krishna,Kalpesh Krishna,Anhad Mohananey,Steven Schwarcz,Adam Stambler,Shyam Upadhyay,Manaal Faruqui
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated significant performance, demonstrated significant
类目: Computation and Language (cs.CL)
备注: Arxiv Preprint
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs’ ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.
摘要:大语言模型 (LLMs) 在各种认知任务中展示了显著的性能提升。一个新兴的应用是利用 LLMs 来增强检索增强生成 (RAG) 能力。这些系统要求 LLMs 理解用户查询,检索相关信息,并合成连贯且准确的响应。鉴于这些系统在现实世界中的部署日益增多,全面评估变得至关重要。为此,我们提出了 FRAMES (Factuality, Retrieval, And reasoning MEasurement Set),这是一个高质量的评估数据集,旨在测试 LLMs 提供事实性响应的能力,评估检索能力,以及评估生成最终答案所需的推理能力。尽管之前的工作已经提供了数据集和基准来单独评估这些能力,但 FRAMES 提供了一个统一的框架,更清晰地展示了 LLMs 在端到端 RAG 场景中的性能。我们的数据集包含需要整合多个来源信息的多跳问题。我们展示了基线结果,表明即使是目前最先进的 LLMs 也难以完成这项任务,在没有检索的情况下准确率仅为 0.40。通过我们提出的多步检索管道,准确率显著提高,达到 0.66 (提升了 50%)。我们希望我们的工作能够弥合评估差距,并有助于开发更强大和更有能力的 RAG 系统。
[NLP-5] LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning
该论文试图解决大型语言模型(LLMs)在复杂逻辑推理能力上的不足问题。解决方案的关键在于通过利用广泛可用的算法问题及其代码解决方案,构建多样且难度较高的测试样本,并结合代码解决方案中的中间变量输出与复杂推理问题,推导出推理过程和最终答案。这种方法不仅构建了一个难度足够大、多样性丰富且可扩展的数据集,还通过中间变量的值指导高质量的推理过程,从而在多个模型和数据集上取得了显著的性能提升。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12929
作者: Jin Jiang,Yuchen Yan,Yang Liu,Yonggang Jin,Shuai Peng,Mengdi Zhang,Xunliang Cai,Yixin Cao,Liangcai Gao,Zhi Tang
关键词-EN: enhance Large Language, Large Language Models, Large Language, enhance Large, complex Logical reasoning
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In this paper, we present a novel approach, called LogicPro, to enhance Large Language Models (LLMs) complex Logical reasoning through Program Examples. We do this effectively by simply utilizing widely available algorithmic problems and their code solutions. First, we constructed diverse test samples input based on algorithmic questions and code solutions. Then, we designed different complex reasoning questions based on algorithmic problems and test samples. Finally, combining the intermediate variable outputs of the code solutions and the complex reasoning questions, we derived the reasoning process and the final answer. With this approach, we can construct a dataset that is sufficiently difficult (all models are ineffective), diverse (synthesized from 2,360 different algorithmic questions), and scalable (building different test samples and collecting more algorithmic questions). In addition, we obtain a high-quality reasoning process guided by the values of intermediate variables. As a result, our approach achieves significant improvements in multiple models for the BBH ^27 , GSM8K, HellSwag, Logicqa, Reclor, and RTE datasets, outperforming a wide range of existing reasoning datasets.
摘要:本文提出了一种名为 LogicPro 的新方法,通过程序示例来增强大语言模型 (LLM) 的复杂逻辑推理能力。我们通过利用广泛可用的算法问题及其代码解决方案,有效地实现了这一目标。首先,我们基于算法问题和代码解决方案构建了多样化的测试样本输入。接着,我们根据算法问题和测试样本设计了不同的复杂推理问题。最后,结合代码解决方案的中间变量输出和复杂推理问题,我们推导出推理过程和最终答案。通过这种方法,我们可以构建一个难度足够大(所有模型均无效)、多样化(从 2,360 个不同的算法问题合成)且可扩展(构建不同的测试样本并收集更多算法问题)的数据集。此外,我们获得了由中间变量值引导的高质量推理过程。因此,我们的方法在 BBH ^27 、GSM8K、HellSwag、Logicqa、Reclor 和 RTE 数据集上对多个模型实现了显著改进,超越了众多现有的推理数据集。
[NLP-6] Defending against Reverse Preference Attacks is Difficult
该论文试图解决安全对齐的大型语言模型(LLMs)在对抗性强化学习(RL)环境中易受攻击的问题。解决方案的关键在于提出了一种名为“反向偏好攻击(Reverse Preference Attacks, RPA)”的攻击方法,该方法通过在人类反馈的强化学习过程中引入对抗性奖励,使LLMs学习有害行为。为应对这种攻击,论文探索了基于约束马尔可夫决策过程的多种防御机制,特别是“在线”防御策略,通过最小化拒绝的负对数似然来控制损失函数,从而有效保护LLMs免受RPA的影响。相比之下,“离线”防御策略在面对RPA时效果较差。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12914
作者: Domenic Rosati,Giles Edkins,Harsh Raj,David Atanasov,Subhabrata Majumdar,Janarthanan Rajendran,Frank Rudzicz,Hassan Sajjad
关键词-EN: aligning Large Language, Large Language Models, Large Language, ensuring safe behaviour, aligning Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While there has been progress towards aligning Large Language Models (LLMs) with human values and ensuring safe behaviour at inference time, safety-aligned LLMs are known to be vulnerable to training-time attacks such as supervised fine-tuning (SFT) on harmful datasets. In this paper, we ask if LLMs are vulnerable to adversarial reinforcement learning. Motivated by this goal, we propose Reverse Preference Attacks (RPA), a class of attacks to make LLMs learn harmful behavior using adversarial reward during reinforcement learning from human feedback (RLHF). RPAs expose a critical safety gap of safety-aligned LLMs in RL settings: they easily explore the harmful text generation policies to optimize adversarial reward. To protect against RPAs, we explore a host of mitigation strategies. Leveraging Constrained Markov-Decision Processes, we adapt a number of mechanisms to defend against harmful fine-tuning attacks into the RL setting. Our experiments show that online" defenses that are based on the idea of minimizing the negative log likelihood of refusals -- with the defender having control of the loss function -- can effectively protect LLMs against RPAs. However, trying to defend model weights using
offline" defenses that operate under the assumption that the defender has no control over the loss function are less effective in the face of RPAs. These findings show that attacks done using RL can be used to successfully undo safety alignment in open-weight LLMs and use them for malicious purposes.
摘要:尽管在使大语言模型 (LLMs) 与人类价值观保持一致并确保推理时的安全行为方面取得了进展,但已知安全对齐的 LLMs 在训练时容易受到监督微调 (SFT) 等攻击,尤其是在有害数据集上的微调。本文探讨了 LLMs 是否容易受到对抗性强化学习的攻击。基于这一目标,我们提出了反向偏好攻击 (RPA),这是一种利用对抗性奖励在从人类反馈中进行强化学习 (RLHF) 期间使 LLMs 学习有害行为的攻击类别。RPA 揭示了在 RL 环境中安全对齐的 LLMs 存在一个关键的安全漏洞:它们很容易探索有害的文本生成策略以优化对抗性奖励。为了抵御 RPA,我们探索了一系列缓解策略。利用约束马尔可夫决策过程,我们将多种防御有害微调攻击的机制适应到 RL 环境中。我们的实验表明,基于最小化拒绝的负对数似然概念的“在线”防御策略——防御者控制损失函数——可以有效保护 LLMs 免受 RPA 的侵害。然而,试图通过假设防御者无法控制损失函数的“离线”防御策略来保护模型权重,在面对 RPA 时效果较差。这些发现表明,使用 RL 进行的攻击可以成功地破坏开放权重 LLMs 的安全对齐,并将其用于恶意目的。
[NLP-7] Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
该论文试图解决大规模语言模型预训练过程中参数初始化的高成本和低效率问题。解决方案的关键在于提出了一种名为HyperCloning的方法,该方法能够将预训练的小型语言模型的参数扩展到更大模型的参数空间中,同时保留小型模型的功能和预测能力。通过这种方式,大型模型在训练开始前就继承了小型模型的准确性,从而显著减少了预训练所需的GPU时间和资源。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12903
作者: Mohammad Samragh,Iman Mirzadeh,Keivan Alizadeh Vahid,Fartash Faghri,Minsik Cho,Moin Nabi,Devang Naik,Mehrdad Farajtabar
关键词-EN: language models, large language models, begins with randomly, models, model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.
摘要:语言模型的预训练阶段通常从随机初始化的参数开始。随着当前模型扩展的趋势,训练其大量参数可能极其缓慢且成本高昂。相比之下,小型语言模型的训练成本较低,但它们往往无法达到大型模型的准确性。本文探讨了一个有趣的想法,即是否可以开发一种方法,利用较小的预训练模型来初始化大型语言模型?这种初始化是否会在训练时间和最终准确性方面带来任何好处?本文介绍了一种名为 HyperCloning 的方法,该方法能够将预训练语言模型的参数扩展到具有更大隐藏维度的大型模型。我们的方法确保大型模型保留了小型模型的功能。因此,在训练开始之前,大型模型已经继承了小型模型的预测能力和准确性。我们证明,训练这种初始化的模型在预训练大型语言模型所需的 GPU 小时数方面实现了显著节省。
[NLP-8] Knowledge-Based Domain-Oriented Data Augmentation for Enhancing Unsupervised Sentence Embedding
该论文试图解决在无监督句子嵌入模型中,使用大规模语言模型(LLMs)进行数据增强时,缺乏对少样本领域数据考虑的问题。解决方案的关键在于引入了一种基于管道的数据增强方法,利用LLM合成领域特定的数据集,并通过实体和数量感知的增强方式生成正负样本,结合实体知识图谱以合成具有细粒度语义区分的样本,从而增加训练样本的多样性和相关性。此外,论文还提出了一种高斯衰减梯度辅助的对比句子嵌入(GCSE)模型,以减少合成数据的噪声并提高模型的判别能力,从而有效降低负样本噪声的影响。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12887
作者: Peichao Lai,Zhengfeng Zhang,Bin Cui
关键词-EN: received significant attention, language processing tasks, downstream natural language, natural language processing, unsupervised sentence embedding
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recently, unsupervised sentence embedding models have received significant attention in downstream natural language processing tasks. Using large language models (LLMs) for data augmentation has led to considerable improvements in previous studies. Nevertheless, these strategies emphasize data augmentation with extensive generic corpora, neglecting the consideration of few-shot domain data. The synthesized data lacks fine-grained information and may introduce negative sample noise. This study introduces a novel pipeline-based data augmentation method that leverages LLM to synthesize the domain-specific dataset. It produces both positive and negative samples through entity- and quantity-aware augmentation, utilizing an entity knowledge graph to synthesize samples with fine-grained semantic distinctions, increasing training sample diversity and relevance. We then present a Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model to reduce synthetic data noise and improve model discrimination to reduce negative sample noise. Experimental results demonstrate that our approach achieves state-of-the-art semantic textual similarity performance with fewer synthetic data samples and lesser LLM parameters, demonstrating its efficiency and robustness in varied backbones.
摘要:近年来,无监督句子嵌入模型在下游自然语言处理任务中受到了广泛关注。利用大语言模型 (LLM) 进行数据增强在前期的研究中取得了显著的改进。然而,这些策略侧重于使用广泛的通用语料库进行数据增强,忽视了少样本领域数据的考虑。合成的数据缺乏细粒度的信息,并可能引入负样本噪声。本研究提出了一种基于管道的新型数据增强方法,该方法利用 LLM 合成领域特定的数据集。通过实体和数量感知增强,利用实体知识图谱合成具有细粒度语义区分的样本,增加了训练样本的多样性和相关性。随后,我们提出了一种高斯衰减梯度辅助的对比句子嵌入 (GCSE) 模型,以减少合成数据的噪声并提高模型的区分能力,从而减少负样本噪声。实验结果表明,我们的方法在更少的合成数据样本和更少的 LLM 参数下,实现了最先进的语义文本相似性性能,展示了其在不同骨干网络中的效率和鲁棒性。
[NLP-9] Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language Models CIKM
该论文试图解决电子商务中多语言产品标题翻译的准确性问题,特别是在标题简短、缺乏上下文和包含专业术语的情况下。解决方案的关键在于采用检索增强生成(RAG)方法,通过利用电子商务中现有的双语产品信息,检索相似的双语示例,并将其作为少样本提示来增强基于大型语言模型(LLM)的产品标题翻译。实验结果表明,该方法在LLM有限熟练度的语言对中,显著提高了翻译质量,chrF评分提升了高达15.3%。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12880
作者: Bryan Zhang,Taichi Nakatani,Stephan Walter
关键词-EN: stores enable multilingual, E-commerce stores enable, product title translation, title translation, accurate product title
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 Pages,In Proceedings of ACM CIKM Workshop on Data-Centric AI (CIKM DCAI 2024)
点击查看摘要
Abstract:E-commerce stores enable multilingual product discovery which require accurate product title translation. Multilingual large language models (LLMs) have shown promising capacity to perform machine translation tasks, and it can also enhance and translate product titles cross-lingually in one step. However, product title translation often requires more than just language conversion because titles are short, lack context, and contain specialized terminology. This study proposes a retrieval-augmented generation (RAG) approach that leverages existing bilingual product information in e-commerce by retrieving similar bilingual examples and incorporating them as few-shot prompts to enhance LLM-based product title translation. Experiment results show that our proposed RAG approach improve product title translation quality with chrF score gains of up to 15.3% for language pairs where the LLM has limited proficiency.
摘要:电子商务商店支持多语言产品搜索,这需要准确的产品标题翻译。多语言大语言模型 (LLM) 在执行机器翻译任务方面显示出有前景的能力,并且可以一步实现跨语言的产品标题增强和翻译。然而,产品标题翻译通常不仅需要语言转换,因为标题简短、缺乏上下文且包含专业术语。本研究提出了一种检索增强生成 (RAG) 方法,通过检索类似的双语产品信息,并将它们作为少样本提示融入到 LLM 基础的产品标题翻译中,从而利用电子商务中的现有双语产品信息。实验结果表明,我们提出的 RAG 方法在 LLM 熟练度有限的语言对中,将产品标题翻译质量提高了高达 15.3% 的 chrF 分数。
[NLP-10] A New Perspective on ADHD Research: Knowledge Graph Construction with LLMs and Network Based Insights
该论文试图解决注意力缺陷/多动障碍(ADHD)研究中的复杂性和多样性问题,通过构建一个综合的知识图谱(KG)并利用网络分析技术,特别是k-core技术,识别出理解ADHD的核心节点和关系。解决方案的关键在于结合大型语言模型(LLMs)和检索增强生成(RAG)技术,开发出一个上下文感知的聊天机器人,从而实现对ADHD的深入理解和在研究及临床应用中的有效工具。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12853
作者: Hakan T. Otal,Stephen V. Faraone,M. Abdullah Canbaz
关键词-EN: diverse contributing factors, large language models, Hyperactivity Disorder, contributing factors, language models
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: 14 pages, 2 figures
点击查看摘要
Abstract:Attention-Deficit/Hyperactivity Disorder (ADHD) is a challenging disorder to study due to its complex symptomatology and diverse contributing factors. To explore how we can gain deeper insights on this topic, we performed a network analysis on a comprehensive knowledge graph (KG) of ADHD, constructed by integrating scientific literature and clinical data with the help of cutting-edge large language models. The analysis, including k-core techniques, identified critical nodes and relationships that are central to understanding the disorder. Building on these findings, we developed a context-aware chatbot using Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), enabling accurate and informed interactions. Our knowledge graph not only advances the understanding of ADHD but also provides a powerful tool for research and clinical applications.
摘要:注意力缺陷/多动障碍 (Attention-Deficit/Hyperactivity Disorder, ADHD) 由于其复杂的症状表现和多样的影响因素,是一个极具挑战性的研究课题。为了深入探讨如何在这一领域获得更深刻的见解,我们利用先进的大语言模型 (Large Language Models, LLMs) 整合了科学文献和临床数据,构建了一个全面的 ADHD 知识图谱 (Knowledge Graph, KG),并对其进行了网络分析。分析过程中采用了 k-core 技术,识别出对理解该障碍至关重要的节点和关系。基于这些发现,我们开发了一个基于大语言模型和检索增强生成 (Retrieval-Augmented Generation, RAG) 的上下文感知聊天机器人,实现了准确且信息丰富的交互。我们的知识图谱不仅深化了对 ADHD 的理解,还为研究和临床应用提供了一个强大的工具。
[NLP-11] Lexicon-Based Sentiment Analysis on Text Polarities with Evaluation of Classification Models
该论文试图解决文本情感分析中的多分类问题,即将文本分类为积极、消极或中性。解决方案的关键在于采用基于词典的方法(如Text Blob和Vader Sentiment)来识别文本中的情感强度和主观性,并通过机器学习模型(如朴素贝叶斯、支持向量机、多项逻辑回归、随机森林和极端梯度提升)进行分类模型的评估。其中,随机森林模型在多个性能指标上表现最佳,准确率达到81%。此外,论文还探讨了如何基于Twitter用户的在线活动进行个性判断。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12840
作者: Muhammad Raees,Samina Fazilat
关键词-EN: Sentiment analysis possesses, Sentiment analysis, digital platforms, Sentiment, possesses the potential
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Sentiment analysis possesses the potential of diverse applicability on digital platforms. Sentiment analysis extracts the polarity to understand the intensity and subjectivity in the text. This work uses a lexicon-based method to perform sentiment analysis and shows an evaluation of classification models trained over textual data. The lexicon-based methods identify the intensity of emotion and subjectivity at word levels. The categorization identifies the informative words inside a text and specifies the quantitative ranking of the polarity of words. This work is based on a multi-class problem of text being labeled as positive, negative, or neutral. Twitter sentiment dataset containing 1.6 million unprocessed tweets is used with lexicon-based methods like Text Blob and Vader Sentiment to introduce the neutrality measure on text. The analysis of lexicons shows how the word count and the intensity classify the text. A comparative analysis of machine learning models, Naiive Bayes, Support Vector Machines, Multinomial Logistic Regression, Random Forest, and Extreme Gradient (XG) Boost performed across multiple performance metrics. The best estimations are achieved through Random Forest with an accuracy score of 81%. Additionally, sentiment analysis is applied for a personality judgment case against a Twitter profile based on online activity.
摘要:情感分析在数字平台上具有多样化的应用潜力。情感分析通过提取极性来理解文本中的强度和主观性。本研究采用基于词典的方法进行情感分析,并对基于文本数据训练的分类模型进行了评估。基于词典的方法在词级别识别情感强度和主观性。分类过程识别文本中的信息词,并指定词语极性的定量排名。本研究基于文本被标记为正面、负面或中性的多类别问题。使用包含160万条未处理推文的Twitter情感数据集,结合Text Blob和Vader Sentiment等基于词典的方法,引入了文本中性度的测量。词典分析展示了词频和强度如何对文本进行分类。对多种机器学习模型(包括Naiive Bayes、支持向量机、多项逻辑回归、随机森林和极端梯度提升(XG))进行了多性能指标的比较分析。最佳估计通过随机森林实现,准确率为81%。此外,情感分析还被应用于基于在线活动的Twitter个人资料进行人格判断的案例中。
[NLP-12] FoodPuzzle: Developing Large Language Model Agents as Flavor Scientists
该论文试图解决食品工业中快速创新和精确调配风味的需求与传统依赖迭代、主观测试方法之间的矛盾。解决方案的关键在于提出了一个新的科学代理问题领域,即生成风味轮廓来源和理解的假设,并通过引入FoodPuzzle基准和结合上下文学习和检索增强技术的新型科学代理方法,显著提升了风味轮廓预测任务的效率和准确性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12832
作者: Tenghao Huang,Donghee Lee,John Sweeney,Jiatong Shi,Emily Steliotes,Matthew Lange,Jonathan May,Muhao Chen
关键词-EN: industry is increasingly, increasingly challenged, rapid innovation, innovation and precise, flavor profile creation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Flavor development in the food industry is increasingly challenged by the need for rapid innovation and precise flavor profile creation. Traditional flavor research methods typically rely on iterative, subjective testing, which lacks the efficiency and scalability required for modern demands. This paper presents three contributions to address the challenges. Firstly, we define a new problem domain for scientific agents in flavor science, conceptualized as the generation of hypotheses for flavor profile sourcing and understanding. To facilitate research in this area, we introduce the FoodPuzzle, a challenging benchmark consisting of 978 food items and 1,766 flavor molecules profiles. We propose a novel Scientific Agent approach, integrating in-context learning and retrieval augmented techniques to generate grounded hypotheses in the domain of food science. Experimental results indicate that our model significantly surpasses traditional methods in flavor profile prediction tasks, demonstrating its potential to transform flavor development practices.
摘要: 食品工业中的风味开发正日益面临快速创新和精确风味轮廓创建的需求挑战。传统的风味研究方法通常依赖于迭代、主观的测试,缺乏现代需求所需的效率和可扩展性。本文提出了三项贡献来应对这些挑战。首先,我们为风味科学中的科学智能体定义了一个新的问题领域,概念化为风味轮廓来源和理解假设的生成。为了促进该领域的研究,我们引入了 FoodPuzzle,这是一个包含 978 种食品和 1,766 种风味分子轮廓的挑战性基准。我们提出了一种新颖的科学智能体方法,结合上下文学习和检索增强技术,以在食品科学领域生成基于事实的假设。实验结果表明,我们的模型在风味轮廓预测任务中显著超越了传统方法,展示了其转变风味开发实践的潜力。
[NLP-13] Language Models Learn to Mislead Humans via RLHF
该论文试图解决的问题是,在复杂任务中,经过强化学习人类反馈(RLHF)训练的语言模型(LMs)可能会产生难以察觉的错误,甚至通过更巧妙的方式说服人类其输出是正确的,这种现象被称为“U-SOPHISTRY”。论文的关键解决方案在于通过实验验证,RLHF不仅未能提高模型在任务完成上的准确性,反而增加了人类评估者误判的可能性,尤其是在时间受限的情况下。此外,论文还指出,现有的探测方法(如探测有意欺骗的模型)无法有效识别这种无意的欺骗行为,因此呼吁更多研究来帮助人类更好地对齐模型的行为。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12822
作者: Jiaxin Wen,Ruiqi Zhong,Akbir Khan,Ethan Perez,Jacob Steinhardt,Minlie Huang,Samuel R. Boman,He He,Shi Feng
关键词-EN: Language models, produce errors, hard to detect, Language, RLHF
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it “U-SOPHISTRY” since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans’ accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects’ false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
摘要:语言模型 (LMs) 可能会产生人类难以察觉的错误,尤其是在任务复杂的情况下。RLHF,即最流行的训练后方法,可能会加剧这一问题:为了获得更高的奖励,LMs 可能会变得更擅长说服人类它们是正确的,即使它们是错误的。我们在一个标准的 RLHF 流程下研究这一现象,称之为“U-SOPHISTRY”,因为它并非模型开发者有意为之。具体来说,我们要求时间受限(例如,3-10 分钟)的人类受试者评估模型输出的正确性,并计算人类准确率与黄金标签的对比。在问答任务 (QuALITY) 和编程任务 (APPS) 中,RLHF 使 LMs 更擅长说服我们的受试者,但并未提高任务完成的正确率。RLHF 还使得模型更难以评估:我们的受试者在 QuALITY 上的误报率增加了 24.1%,在 APPS 上增加了 18.3%。最后,我们展示了探测 (probing),一种用于检测有意欺骗 (例如,后门 LMs) 的先进方法,并不能推广到 U-SOPHISTRY。我们的结果突显了 RLHF 的一个重要失败模式,并呼吁更多研究来协助人类进行对齐。
[NLP-14] Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
该论文试图解决多语言环境下大学入学考试题库的构建与评估问题,解决方案的关键在于创建了一个名为UNED-ACCESS 2024的双语数据集,包含1003道西班牙语和英语的大学入学水平选择题,并通过零样本实验设置评估了当前的开源和专有模型。研究结果表明,推理题对模型构成挑战,较小模型在西班牙语上的表现比英语差,且性能差距在较小模型中可达37%。最佳模型在两种语言间的性能差距可忽略不计,且模型在UNED-ACCESS 2024上的排名与MMLU上的排名高度相关,表明该数据集具有足够的多样性和代表性,能够有效衡量各学科的模型性能。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12746
作者: Eva Sánchez Salido,Roser Morante,Julio Gonzalo,Guillermo Marco,Jorge Carrillo-de-Albornoz,Laura Plaza,Enrique Amigó,Andrés Fernández,Alejandro Benito-Santos,Adrián Ghajari Espinosa,Victor Fresno
关键词-EN: university entrance level, entrance level exams, article we present, university entrance, entrance level
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and translated manually into English, and have not ever been publicly released. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) reasoning questions are challenging for models, (ii) smaller models perform worse than larger models and degrade faster in Spanish than in English and (iii) the performance gap between languages is negligible for the best models and grows up to 37% for smaller models. Model ranking on UNED-ACCESS 2024 is almost identical in English and Spanish, and has also a high correlation (0.98 Pearson) with ranking on MMLU, suggesting that a small dataset is sufficiently diverse and representative to measure performance by discipline.
摘要:本文介绍了 UNED-ACCESS 2024,这是一个包含 1003 道大学入学水平考试多选题的双语数据集,题目以西班牙语和英语呈现。所有题目最初以西班牙语编写,并由人工翻译成英语,且从未公开发布过。在统一的零样本实验设置下,评估了当前一系列开源和专有模型在 UNED-ACCESS 2024 数据集以及 MMLU 问题的一个等效子集上的表现。结果显示:(i) 推理题对模型构成挑战;(ii) 较小的模型表现不如较大的模型,并且在西班牙语中的性能下降速度比在英语中更快;(iii) 对于最佳模型而言,语言之间的性能差距可以忽略不计,而对于较小的模型,这一差距可高达 37%。在 UNED-ACCESS 2024 上的模型排名在英语和西班牙语中几乎相同,并且与 MMLU 上的排名也具有高度相关性(0.98 皮尔逊相关系数),这表明一个小的数据集足以通过学科来充分多样化和代表性地衡量性能。
[NLP-15] Fine Tuning Large Language Models for Medicine: The Role and Importance of Direct Parameter Optimization
该论文试图解决在医学领域中如何有效利用大型语言模型(LLM)微调的问题,特别是确定在不同医学自然语言处理任务中何时使用监督微调(SFT)和直接参数优化(DPO)。解决方案的关键在于通过对比SFT和DPO在五种常见医学任务(文本数据分类、数值数据分类、临床推理、摘要生成和临床分诊)中的表现,发现SFT适用于简单的文本数据分类任务,而DPO在处理更复杂的任务如临床推理、摘要生成和临床分诊时能显著提升性能。这一发现强调了DPO在医学领域微调中的重要性,并指出了当前软件工具在支持DPO技术广泛应用方面的不足。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12741
作者: Thomas Savage,Stephen Ma,Abdessalem Boukil,Vishwesh Patel,Ekanath Rangan,Ivan Rodriguez,Jonathan H Chen
关键词-EN: Large Language Model, Direct Parameter Optimization, Supervised Fine Tuning, fine tuning, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Model (LLM) fine tuning is underutilized in the field of medicine. Two of the most common methods of fine tuning are Supervised Fine Tuning (SFT) and Direct Parameter Optimization (DPO), but there is little guidance informing users when to use either technique. In this investigation, we compare the performance of SFT and DPO for five common natural language tasks in medicine: Classification with text data, Classification with numeric data, Clinical Reasoning, Summarization, and Clinical Triage. We find that SFT alone is sufficient for Classification with text data, whereas DPO improves performance for the more complex tasks of Clinical Reasoning, Summarization and Clinical Triage. Our results establish the role and importance of DPO fine tuning within medicine, and consequently call attention to current software gaps that prevent widespread deployment of this technique.
摘要:大语言模型 (LLM) 在医学领域的微调应用尚未得到充分开发。两种最常见的微调方法是监督微调 (Supervised Fine Tuning, SFT) 和直接参数优化 (Direct Parameter Optimization, DPO),但目前缺乏指导用户何时使用这两种技术的信息。在本研究中,我们比较了 SFT 和 DPO 在医学领域五种常见自然语言任务中的表现:文本数据分类、数值数据分类、临床推理、摘要生成和临床分诊。我们发现,对于文本数据分类任务,单独使用 SFT 已足够;而对于临床推理、摘要生成和临床分诊等更复杂的任务,DPO 能够提升性能。我们的研究结果确立了 DPO 微调在医学领域中的作用和重要性,并因此指出了当前软件存在的缺陷,这些缺陷阻碍了该技术的广泛应用。
[NLP-16] Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models
该论文旨在解决大语言模型(LLMs)在与中国教育价值观对齐方面的问题。解决方案的关键在于提出了Edu-Values基准,这是一个专门设计的中文教育价值观评估工具,涵盖了七个维度:专业思想、文化素养、教育知识与技能、教育法律法规、教师职业道德、基本能力和学科知识。通过精心设计的1,418个问题,包括多选题、多模态问答、主观分析、对抗性提示和传统文化问题,论文对11个最先进的LLMs进行了人类评估和自动评估,揭示了中国LLMs在教育文化差异下显著优于英文LLMs,但在教师职业道德和基本能力方面表现不足,同时在多选题上表现优异但在主观分析和多模态任务上表现较差。这一基准的有效性和潜力得到了验证。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12739
作者: Peiyi Zhang,Yazhou Zhang,Bo Wang,Lu Rong,Jing Qin
关键词-EN: large language models, concerns about aligning, recent evolution, evolution of large, language models
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures
点击查看摘要
Abstract:With the recent evolution of large language models (LLMs), concerns about aligning such models with human values have grown. Previous research has primarily focused on assessing LLMs’ performance in terms of the Helpful, Honest, Harmless (3H) basic principles, while often overlooking their alignment with educational values in the Chinese context. To fill this gap, we present Edu-Values, the first Chinese education values evaluation benchmark designed to measure LLMs’ alignment ability across seven dimensions: professional ideology, cultural literacy, educational knowledge and skills, education laws and regulations, teachers’ professional ethics, basic competencies, and subject knowledge. We meticulously design and compile 1,418 questions, including multiple-choice, multi-modal question answering, subjective analysis, adversarial prompts, and questions on traditional Chinese culture. We conduct both human evaluation and automatic evaluation over 11 state-of-the-art (SoTA) LLMs, and highlight three main findings: (1) due to differences in educational culture, Chinese LLMs significantly outperform English LLMs, with Qwen 2 ranking the first with a score of 81.37; (2) LLMs perform well in subject knowledge and teaching skills but struggle with teachers’ professional ethics and basic competencies; (3) LLMs excel at multiple-choice questions but perform poorly on subjective analysis and multi-modal tasks. This demonstrates the effectiveness and potential of the proposed benchmark. Our dataset is available at this https URL. Comments: 9 pages, 5 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.12739 [cs.CL] (or arXiv:2409.12739v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.12739 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:随着大语言模型 (Large Language Models, LLMs) 的最新发展,如何使这些模型与人类价值观对齐的问题日益受到关注。以往的研究主要集中在评估 LLMs 在“有益、诚实、无害” (Helpful, Honest, Harmless, 3H) 基本原则方面的表现,而往往忽视了它们在中国教育背景下的价值观对齐问题。为了填补这一空白,我们提出了 Edu-Values,这是首个针对中国教育价值观评估的基准,旨在从七个维度衡量 LLMs 的对齐能力:专业思想、文化素养、教育知识与技能、教育法律法规、教师职业道德、基本能力以及学科知识。我们精心设计和编写了 1,418 道题目,包括选择题、多模态问答、主观分析、对抗性提示以及与中国传统文化相关的题目。我们对 11 个最先进的 (State-of-the-Art, SoTA) LLMs 进行了人工评估和自动评估,并总结了三个主要发现:(1) 由于教育文化的差异,中国 LLMs 的表现显著优于英文 LLMs,其中 Qwen 2 以 81.37 分排名第一;(2) LLMs 在学科知识和教学技能方面表现良好,但在教师职业道德和基本能力方面表现不佳;(3) LLMs 在选择题上表现出色,但在主观分析和多模态任务上表现较差。这表明了所提出的基准的有效性和潜力。我们的数据集可通过此 https URL 获取。
评论:9 页,5 图 主题:计算与语言 (cs.CL) 引用方式:arXiv:2409.12739 [cs.CL] (或 arXiv:2409.12739v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2409.12739 了解更多 通过 DataCite 发布的 arXiv DOI (待注册)
[NLP-17] MEXMA: Token-level objectives improve sentence representations
该论文试图解决当前跨语言句子编码器仅使用句子级目标导致信息丢失的问题,特别是对词元信息的损失,从而影响句子表示的质量。解决方案的关键在于提出了一种名为MEXMA的新方法,该方法整合了句子级和词元级目标。具体来说,MEXMA利用一种语言的句子表示来预测另一种语言中的掩码词元,并通过直接更新编码器来同时优化句子表示和所有词元。这种方法显著提升了跨语言句子表示的质量,并在多个任务中优于现有的跨语言句子编码器。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12737
作者: João Maria Janeiro,Benjamin Piwowarski,Patrick Gallinari,Loïc Barrault
关键词-EN: sentence representation, sentence, pre-trained cross-lingual sentence, sentence encoders approaches, cross-lingual sentence encoders
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 12 figures
点击查看摘要
Abstract:Current pre-trained cross-lingual sentence encoders approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and all tokens directly updating the encoder. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bi-text mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them.
摘要:当前的跨语言句子编码器预训练方法仅使用句子级别的训练目标。这可能导致信息的丢失,尤其是对于 Token,从而降低句子表示的质量。我们提出了 MEXMA,这是一种新颖的方法,它整合了句子级别和 Token 级别的训练目标。在一种语言中的句子表示用于预测另一种语言中的掩码 Token,并且句子表示和所有 Token 都直接更新编码器。我们展示了添加 Token 级别的训练目标可以显著提高多个任务中的句子表示质量。我们的方法在双文本挖掘以及多个下游任务中优于当前的跨语言句子编码器预训练方法。我们还分析了我们的 Token 中编码的信息,以及句子表示如何从这些 Token 中构建。
[NLP-18] LLM-Measure: Generating Valid Consistent and Reproducible Text-Based Measures for Social Science Research
该论文试图解决在社会科学研究中,如何有效、一致、可重复且高效地生成基于文本的概念测量的问题。解决方案的关键在于利用大型语言模型(LLMs)的内部隐藏状态,通过学习一个能够捕捉目标概念在LLM内部表示的概念向量,然后将文本的LLM隐藏状态投影到该概念向量上来估计文本的概念值。这种方法在多个社会科学研究情境中的复制研究中展示了其生成高度有效、一致和可重复的文本测量指标的能力。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12722
作者: Yi Yang,Hanyu Duan,Jiaxin Liu,Kar Yan Tam
关键词-EN: generating text-based concept, science research necessitates, social science research, necessitates the development, text-based concept measures
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The increasing use of text as data in social science research necessitates the development of valid, consistent, reproducible, and efficient methods for generating text-based concept measures. This paper presents a novel method that leverages the internal hidden states of large language models (LLMs) to generate these concept measures. Specifically, the proposed method learns a concept vector that captures how the LLM internally represents the target concept, then estimates the concept value for text data by projecting the text’s LLM hidden states onto the concept vector. Three replication studies demonstrate the method’s effectiveness in producing highly valid, consistent, and reproducible text-based measures across various social science research contexts, highlighting its potential as a valuable tool for the research community.
摘要:随着文本作为数据在社会科学研究中的应用日益增多,开发有效、一致、可重复且高效的文本概念测量方法变得至关重要。本文提出了一种利用大语言模型 (LLM) 内部隐藏状态生成这些概念测量的新方法。具体而言,该方法学习一个概念向量,该向量捕捉了 LLM 内部如何表示目标概念,然后通过将文本的 LLM 隐藏状态投影到概念向量上来估计文本数据的概念值。三个复制研究证明了该方法在各种社会科学研究背景下生成高度有效、一致和可重复的文本测量方面的有效性,突显了其作为研究社区宝贵工具的潜力。
[NLP-19] Exploring Large Language Models for Product Attribute Value Identification
该论文试图解决产品属性值识别(PAVI)中现有方法依赖大量特定任务训练数据且泛化能力不足的问题。解决方案的关键在于利用大型语言模型(LLMs)如LLaMA和Mistral,通过零样本设置下的两步提示方法和指令微调技术,显著提升在零样本场景下的性能,并在有训练数据时进一步增强效果,从而实现更高效和鲁棒的产品属性值识别。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12695
作者: Kassem Sabeh,Mouna Kacimi,Johann Gamper,Robert Litschko,Barbara Plank
关键词-EN: involves automatically identifying, automatically identifying attributes, involves automatically, enabling features, automatically identifying
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Product attribute value identification (PAVI) involves automatically identifying attributes and their values from product information, enabling features like product search, recommendation, and comparison. Existing methods primarily rely on fine-tuning pre-trained language models, such as BART and T5, which require extensive task-specific training data and struggle to generalize to new attributes. This paper explores large language models (LLMs), such as LLaMA and Mistral, as data-efficient and robust alternatives for PAVI. We propose various strategies: comparing one-step and two-step prompt-based approaches in zero-shot settings and utilizing parametric and non-parametric knowledge through in-context learning examples. We also introduce a dense demonstration retriever based on a pre-trained T5 model and perform instruction fine-tuning to explicitly train LLMs on task-specific instructions. Extensive experiments on two product benchmarks show that our two-step approach significantly improves performance in zero-shot settings, and instruction fine-tuning further boosts performance when using training data, demonstrating the practical benefits of using LLMs for PAVI.
摘要: 产品属性值识别 (Product Attribute Value Identification, PAVI) 涉及从产品信息中自动识别属性和其值,从而实现产品搜索、推荐和比较等功能。现有方法主要依赖于微调预训练语言模型,如 BART 和 T5,这些方法需要大量的任务特定训练数据,并且在处理新属性时难以泛化。本文探讨了使用大语言模型 (Large Language Models, LLMs),如 LLaMA 和 Mistral,作为数据高效且鲁棒的 PAVI 替代方案。我们提出了多种策略:在零样本设置下比较一步和两步基于提示的方法,并通过上下文学习示例利用参数化和非参数化知识。我们还引入了一个基于预训练 T5 模型的密集演示检索器,并进行指令微调以显式训练 LLMs 执行任务特定指令。在两个产品基准数据集上的广泛实验表明,我们的两步方法在零样本设置下显著提升了性能,而指令微调在使用训练数据时进一步提升了性能,展示了使用 LLMs 进行 PAVI 的实际效益。
[NLP-20] Connecting Ideas in Lower-Resource Scenarios: NLP for National Varieties Creoles and Other Low-resource Scenarios COLING2025
该论文试图解决大型语言模型在处理低资源语言(如方言、社会方言、克里奥尔语等)时面临的挑战。解决方案的关键在于识别并应对数据匮乏情境下的常见问题,通过回顾过去的研究思路并与当前领域相结合,促进研究人员之间的合作与知识交流,从而克服数据不足带来的障碍。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12683
作者: Aditya Joshi,Diptesh Kanojia,Heather Lent,Hour Kaing,Haiyue Song
关键词-EN: large language models, national or social, language models struggle, excellent results, results on benchmarks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Selected as a full-day tutorial at COLING 2025
点击查看摘要
Abstract:Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in lower-resource' scenarios such as dialects/sociolects (national or social varieties of a language), Creoles (languages arising from linguistic contact between multiple languages) and other low-resource languages. This introductory tutorial will identify common challenges, approaches, and themes in natural language processing (NLP) research for confronting and overcoming the obstacles inherent to data-poor contexts. By connecting past ideas to the present field, this tutorial aims to ignite collaboration and cross-pollination between researchers working in these scenarios. Our notion of
lower-resource’ broadly denotes the outstanding lack of data required for model training - and may be applied to scenarios apart from the three covered in the tutorial.
摘要:尽管在大语言模型在少量语言的基准测试中取得了优异的成绩,但它们在处理位于“低资源”场景中的文本时仍面临挑战,例如方言/社会方言(语言的国家或社会变体)、克里奥尔语(由多种语言接触产生的语言)以及其他低资源语言。本入门教程将识别自然语言处理 (NLP) 研究中常见的挑战、方法和主题,以应对和克服数据匮乏环境中的固有障碍。通过将过去的思想与当前领域相连接,本教程旨在激发在这些场景中工作的研究人员之间的合作与交叉融合。我们所说的“低资源”广泛指代模型训练所需数据的显著缺乏,并且可能适用于教程中未涵盖的其他场景。
[NLP-21] xt2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories
该论文试图解决零售商店中顾客轨迹数据的上下文描述问题,提出了一种名为Text2Traj2Text的新型学习-合成框架。解决方案的关键在于利用大型语言模型生成多样且真实的上下文描述及其对应的商店地图上的移动轨迹。尽管模型是从完全合成的数据中学习,但其生成的描述能够很好地泛化到真实人类生成的轨迹和描述上。通过系统的评估,该框架在ROUGE和BERT Score指标上优于竞争方法,证明了其有效性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12670
作者: Hikaru Asano,Ryo Yonetani,Taiki Sekii,Hiroki Ouchi
关键词-EN: shopper trajectory data, paper presents, contexts behind shopper, shopper trajectory, trajectory data
类目: Computation and Language (cs.CL)
备注: To appear in the International Natural Language Generation Conference (INLG 2024)
点击查看摘要
Abstract:This paper presents Text2Traj2Text, a novel learning-by-synthesis framework for captioning possible contexts behind shopper’s trajectory data in retail stores. Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management. The key idea is leveraging large language models to synthesize a diverse and realistic collection of contextual captions as well as the corresponding movement trajectories on a store map. Despite learned from fully synthesized data, the captioning model can generalize well to trajectories/captions created by real human subjects. Our systematic evaluation confirmed the effectiveness of the proposed framework over competitive approaches in terms of ROUGE and BERT Score metrics.
摘要:本文介绍了 Text2Traj2Text,这是一个新颖的合成学习框架,用于描述零售商店中购物者轨迹数据背后的可能情境。我们的工作将影响需要更好客户理解的各种零售应用,例如定向广告和库存管理。关键思想是利用大语言模型 (Large Language Model) 来合成多样且现实的情境描述,以及商店地图上的相应移动轨迹。尽管从完全合成的数据中学习,描述模型仍能很好地泛化到由真实人类主体创建的轨迹/描述。我们的系统评估证实了所提出框架在 ROUGE 和 BERT Score 指标上优于竞争方法的有效性。
[NLP-22] Exploring the topics sentiments and hate speech in the Spanish information environment
该论文旨在探讨西班牙五大媒体在2021年1月发布的新闻所引发的公众反应中,仇恨言论和负面情绪的分布及其主题分类。研究的关键在于使用BERTopic无监督框架提取了81个主题,并通过大型语言模型(LLMs)进行手动命名和归类,最终将这些主题分为九大类别。研究结果显示,社会问题、表达和俚语以及政治问题是讨论最多的主题,且公众反应主要为负面和中性,仅有少量正面情绪。尽管仇恨言论的比例较低(3.98%),但社会和政治话题的在线回应中存在较高的毒性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12658
作者: ALEJANDRO BUITRAGO LOPEZ,Javier Pastor-Galindo,José Antonio Ruipérez-Valiente
关键词-EN: leading to radicalization, Spanish media outlets, digital era, transformed communication, facilitated the spread
类目: Computation and Language (cs.CL)
备注: 24 pages
点击查看摘要
Abstract:In the digital era, the internet and social media have transformed communication but have also facilitated the spread of hate speech and disinformation, leading to radicalization, polarization, and toxicity. This is especially concerning for media outlets due to their significant role in shaping public discourse. This study examines the topics, sentiments, and hate prevalence in 337,807 response messages (website comments and tweets) to news from five Spanish media outlets (La Vanguardia, ABC, El País, El Mundo, and 20 Minutos) in January 2021. These public reactions were originally labeled as distinct types of hate by experts following an original procedure, and they are now classified into three sentiment values (negative, neutral, or positive) and main topics. The BERTopic unsupervised framework was used to extract 81 topics, manually named with the help of Large Language Models (LLMs) and grouped into nine primary categories. Results show social issues (22.22%), expressions and slang (20.35%), and political issues (11.80%) as the most discussed. Content is mainly negative (62.7%) and neutral (28.57%), with low positivity (8.73%). Toxic narratives relate to conversation expressions, gender, feminism, and COVID-19. Despite low levels of hate speech (3.98%), the study confirms high toxicity in online responses to social and political topics. Comments: 24 pages Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.12658 [cs.CL] (or arXiv:2409.12658v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.12658 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在数字时代,互联网和社交媒体改变了沟通方式,但也助长了仇恨言论和虚假信息的传播,导致激进化、两极分化和毒性增加。这对媒体机构尤为重要,因为它们在塑造公众话语方面发挥着重要作用。本研究分析了2021年1月来自五家西班牙媒体(La Vanguardia、ABC、El País、El Mundo和20 Minutos)的新闻报道所引发的337,807条回应消息(网站评论和推文)的主题、情感和仇恨程度。这些公众反应最初由专家根据原始程序标记为不同类型的仇恨,现已被分类为三种情感值(负面、中性或正面)和主要主题。使用BERTopic无监督框架提取了81个主题,并在大语言模型(LLMs)的帮助下手动命名,并归类为九个主要类别。结果显示,社会问题(22.22%)、表达和俚语(20.35%)以及政治问题(11.80%)是最常讨论的。内容主要为负面(62.7%)和中性(28.57%),正面内容较少(8.73%)。有毒叙述涉及对话表达、性别、女权主义和COVID-19。尽管仇恨言论水平较低(3.98%),但研究表明,在线对社会和政治话题的回应中存在高度毒性。
评论:24页 主题:计算与语言(cs.CL) 引用为:arXiv:2409.12658 [cs.CL] (或 arXiv:2409.12658v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2409.12658 聚焦以了解更多 arXiv-issued DOI via DataCite(待注册)
[NLP-23] Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
该论文试图解决科学排行榜(leaderboards)在评估和比较竞争方法时存在的信息不完整和错误问题。解决方案的关键在于提出了一个名为SciLead的手动精选科学排行榜数据集,并通过三种实验设置模拟真实世界中任务、数据集和评估指标(TDM)完全定义、部分定义或未定义的情况,开发了一个基于大型语言模型(LLM)的综合框架来自动构建排行榜。该框架能够有效识别TDM三元组,但在从文献中提取结果值方面仍面临挑战。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12656
作者: Furkan Şahinuç,Thy Thy Tran,Yulia Grishina,Yufang Hou,Bei Chen,Iryna Gurevych
关键词-EN: comparing competitive methods, standardized ranking systems, competitive methods, standardized ranking, ranking systems
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. Typically, a leaderboard is defined by a task, dataset, and evaluation metric (TDM) triple, allowing objective performance assessment and fostering innovation through benchmarking. However, the exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually. Automatic leaderboard construction has emerged as a solution to reduce manual labor. Existing datasets for this task are based on the community-contributed leaderboards without additional curation. Our analysis shows that a large portion of these leaderboards are incomplete, and some of them contain incorrect information. In this work, we present SciLead, a manually-curated Scientific Leaderboard dataset that overcomes the aforementioned problems. Building on this dataset, we propose three experimental settings that simulate real-world scenarios where TDM triples are fully defined, partially defined, or undefined during leaderboard construction. While previous research has only explored the first setting, the latter two are more representative of real-world applications. To address these diverse settings, we develop a comprehensive LLM-based framework for constructing leaderboards. Our experiments and analysis reveal that various LLMs often correctly identify TDM triples while struggling to extract result values from publications. We make our code and data publicly available.
摘要:科学排行榜是标准化的排名系统,用于评估和比较竞争性方法。通常,一个排行榜由任务、数据集和评估指标(TDM)三元组定义,允许客观的性能评估,并通过基准测试促进创新。然而,出版物数量的指数级增长使得手动构建和维护这些排行榜变得不可行。自动排行榜构建已成为减少人工劳动的解决方案。现有的相关数据集基于社区贡献的排行榜,没有额外的筛选。我们的分析表明,这些排行榜中有很大一部分是不完整的,有些甚至包含错误信息。在这项工作中,我们提出了 SciLead,一个经过人工筛选的科学排行榜数据集,克服了上述问题。基于这个数据集,我们提出了三种实验设置,模拟了在排行榜构建过程中 TDM 三元组完全定义、部分定义或未定义的真实场景。尽管之前的研究仅探索了第一种设置,但后两种设置更能代表实际应用。为了应对这些多样化的设置,我们开发了一个基于大语言模型(LLM)的全面框架来构建排行榜。我们的实验和分析表明,各种 LLM 通常能够正确识别 TDM 三元组,但在从出版物中提取结果值时遇到困难。我们公开了代码和数据。
[NLP-24] Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries
该论文试图解决现有大型语言模型在长上下文推理评估中存在的信息泄露和难以自动评分的问题。解决方案的关键在于提出了一种名为Michelangelo的新型评估框架,该框架通过构建需要模型“剔除”无关信息以揭示潜在结构的任务,来验证模型对长上下文的理解能力。核心思想是通过查询模型对潜在结构的细节理解,从而提供一个强信号的评估方法,旨在增强对长上下文信息合成能力的评估。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12640
作者: Kiran Vodrahalli,Santiago Ontanon,Nilesh Tripuraneni,Kelvin Xu,Sanil Jain,Rakesh Shivanna,Jeffrey Hui,Nishanth Dikkala,Mehran Kazemi,Bahare Fatemi,Rohan Anil,Ethan Dyer,Siamak Shakeri,Roopali Vij,Harsh Mehta,Vinay Ramasesh,Quoc Le,Ed Chi,Yifeng Lu,Orhan Firat,Angeliki Lazaridou,Jean-Baptiste Lespiau,Nithya Attaluri,Kate Olszewska
关键词-EN: introduce Michelangelo, unleaked long-context reasoning, automatically score, easy to automatically, large language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model’s ability to do more than retrieve a single piece of information from its context. The central idea of the \frameworkname framework (\frameworkshort) is to construct tasks which require a model to ``chisel away’’ the irrelevant information in the context, revealing a latent structure in the context. To verify a model’s understanding of this latent structure, we query the model for details of the structure. Using \frameworkshort, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.
摘要:我们介绍了 Michelangelo:一个最小化、合成且无泄漏的长上下文推理评估框架,适用于大语言模型,并且易于自动评分。该评估框架通过一种新颖的、统一的评估方法推导而来,该方法适用于任意长度的上下文,旨在衡量模型超越从上下文中检索单一信息的能力。\frameworkname 框架(\frameworkshort)的核心思想是构建任务,要求模型“剔除”上下文中无关的信息,揭示上下文中的潜在结构。为了验证模型对这种潜在结构的理解,我们向模型查询结构的细节。利用 \frameworkshort,我们在代码和自然语言领域生成了三个诊断性的长上下文评估,旨在提供更强的长上下文语言模型能力信号。我们对多个最先进的模型进行了评估,并证明:a) 提出的评估具有高信号性;b) 在合成长上下文信息方面仍有显著改进空间。
[NLP-25] CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks
该论文试图解决现有大型语言模型(LLMs)在阿拉伯语和英语双语环境中对阿拉伯文化理解和响应不足的问题。解决方案的关键在于开发了名为Juhaina的双语LLM,该模型不仅具备先进的指令跟随、开放式问答、信息提供和文本处理功能,还特别针对阿拉伯语使用者的价值观和偏好进行了优化。Juhaina包含9.24亿参数,支持高达8192个token的上下文窗口,并通过CamelEval评估基准证明了其在阿拉伯语响应生成、区域信息准确性和文化理解方面的优越性,旨在为超过4亿阿拉伯语使用者提供更符合其文化背景的AI技术。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12623
作者: Zhaozhi Qian,Faroq Altam,Muhammad Saleh Saeed Alqurishi,Riad Souissi
关键词-EN: artificial intelligence systems, modern artificial intelligence, Large Language Models, Large Language, intelligence systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are the cornerstones of modern artificial intelligence systems. This paper introduces Juhaina, a Arabic-English bilingual LLM specifically designed to align with the values and preferences of Arabic speakers. Juhaina inherently supports advanced functionalities such as instruction following, open-ended question answering, information provisioning, and text processing. Our model contains 9.24 billion parameters and is trained on a context window of up to 8,192 tokens. This paper details the creation process of Juhaina and provides an extensive empirical evaluation. Furthermore, we identify the limitations of widely-adopted Open Arabic LLM Leaderboard (OALL) and propose a new evaluation benchmark, CamelEval. Our findings demonstrate that Juhaina surpasses existing LLMs of comparable sizes, such as the Llama and Gemma families, in generating helpful responses in Arabic, providing factually accurate information about the region, and understanding nuanced cultural aspects. We aspire for Juhaina to democratize cutting-edge AI technologies, serving over 400 million Arabic speakers by offering LLMs that not only communicate in their language but also comprehend their culture. We publicly release all models on Huggingface \urlthis https URL.
摘要:大语言模型 (LLMs) 是现代人工智能系统的基石。本文介绍了 Juhaina,一个专门为阿拉伯语和英语双语使用者设计的大语言模型,旨在与其价值观和偏好相契合。Juhaina 天然支持高级功能,如指令跟随、开放式问答、信息提供和文本处理。我们的模型包含 92.4 亿参数,并在最多 8192 个 Token 的上下文窗口上进行训练。本文详细阐述了 Juhaina 的创建过程,并提供了广泛的实证评估。此外,我们指出了广泛采用的 Open Arabic LLM Leaderboard (OALL) 的局限性,并提出了一个新的评估基准,CamelEval。我们的研究结果表明,Juhaina 在生成有用的阿拉伯语响应、提供关于该地区的准确信息以及理解微妙的跨文化方面,超越了现有的同类大语言模型,如 Llama 和 Gemma 系列。我们期望 Juhaina 能够普及尖端的人工智能技术,通过提供不仅使用其语言而且理解其文化的 LLMs,服务于超过 4 亿阿拉伯语使用者。我们在 Huggingface 上公开发布了所有模型。
[NLP-26] Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning
该论文试图解决如何通过迭代的人类参与来提升大型语言模型(LLMs)的响应质量问题。解决方案的关键在于提出了迭代思维(Iteration of Thought, IoT)框架,该框架通过生成“启发性”提示来动态调整推理路径,基于不断演变的上下文,而不生成最终被丢弃的替代探索性思维。IoT框架的核心组件包括:1) 内部对话代理(Inner Dialogue Agent, IDA),负责生成具有指导性和上下文特定性的提示;2) 语言模型代理(LLM Agent, LLMA),处理这些提示以精炼其响应;3) 迭代提示循环,实现前两个组件之间的对话。论文还介绍了两种变体:自主迭代思维(Autonomous Iteration of Thought, AIoT)和引导迭代思维(Guided Iteration of Thought, GIoT),分别由LLM决定何时停止迭代和强制执行固定次数的迭代。通过在多个数据集上的实验,IoT框架展示了在复杂推理任务中的显著改进,减少了人类干预,实现了更自适应和高效的推理系统。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12618
作者: Santosh Kumar Radha,Yasamin Nouri Jelyani,Ara Ghukasyan,Oktay Goktas
关键词-EN: large language models, advanced language processing, language processing power, Iterative human engagement, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Iterative human engagement is a common and effective means of leveraging the advanced language processing power of large language models (LLMs). Using well-structured prompts in a conversational manner, human users can effectively influence an LLM to develop more thoughtful and accurate responses. Motivated by this insight, we propose the Iteration of Thought (IoT) framework for enhancing LLM responses by generating “thought”-provoking prompts vis a vis an input query and the current iteration of an LLM’s response. Unlike static or semi-static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path dynamically, based on evolving context, and without generating alternate explorative thoughts which are ultimately discarded. The three components of the IoT framework are (1) an Inner Dialogue Agent (IDA) responsible for generating instructive, context-specific prompts; (2) an LLM Agent (LLMA) that processes these prompts to refine its responses; and (3) an iterative prompting loop that implements a conversation between the former two components. We introduce two variants of our framework: Autonomous Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and Guided Iteration of Thought (GIoT), which always forces a fixed number iterations. We investigate the performance of IoT across various datasets, spanning complex reasoning tasks from the GPQA dataset, explorative problem-solving in Game of 24, puzzle solving in Mini Crosswords, and multi-hop question answering from the HotpotQA dataset. Our results show that IoT represents a viable paradigm for autonomous response refinement in LLMs, showcasing significant improvements over CoT and thereby enabling more adaptive and efficient reasoning systems that minimize human intervention.
摘要:迭代式的人类参与是利用大语言模型 (LLM) 高级语言处理能力的常见且有效手段。通过在对话中使用结构良好的提示,人类用户可以有效影响 LLM,使其生成更具思考性和准确性的响应。受此启发,我们提出了“思维迭代 (IoT)”框架,通过针对输入查询和当前迭代响应生成“启发性”提示,来增强 LLM 的响应。与静态或半静态方法(如思维链 (CoT) 或思维树 (ToT))不同,IoT 根据不断变化的上下文动态调整其推理路径,且不会生成最终被丢弃的替代探索性思维。IoT 框架的三个组成部分是:(1) 内部对话智能体 (IDA),负责生成具有指导性、上下文特定的提示;(2) LLM 智能体 (LLMA),处理这些提示以优化其响应;(3) 迭代提示循环,实现前两个组件之间的对话。我们引入了框架的两个变体:自主思维迭代 (AIoT),其中 LLM 决定何时停止迭代;以及引导思维迭代 (GIoT),强制执行固定次数的迭代。我们在多个数据集上研究了 IoT 的性能,涵盖从 GPQA 数据集的复杂推理任务、24 游戏中的探索性问题解决、Mini Crosswords 中的谜题解决,以及 HotpotQA 数据集中的多跳问答。结果表明,IoT 代表了一种可行的 LLM 自主响应优化范式,显著优于 CoT,从而实现了更具适应性和效率的推理系统,最大限度地减少了人类干预。
[NLP-27] Enhancing SLM via ChatGPT and Dataset Augmentation
该论文试图解决小语言模型(SLMs)在自然语言推理(NLI)任务中性能不足的问题,关键解决方案是通过使用ChatGPT-3.5-Turbo进行数据集增强和知识蒸馏技术。具体来说,论文通过生成合成数据集,包括信息提取和推理引导的两种形式的解释(rationales),来丰富ANLI数据集。随后,使用这些增强的数据集对T5-Small模型进行微调,显著提高了模型在NLI任务中的分类准确率,分别提升了1.3%和2.3%。这种方法不仅提升了小模型的性能,还提供了一种成本效益高的微调策略,有助于创建更高效和强大的NLP系统。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12599
作者: Tom Pieper,Mohamad Ballout,Ulf Krumnack,Gunther Heidemann,Kai-Uwe Kühnberger
关键词-EN: Natural Language Inference, small language models, strategic dataset augmentation, Language Inference, language models
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper explores the enhancement of small language models through strategic dataset augmentation via ChatGPT-3.5-Turbo, in the domain of Natural Language Inference (NLI). By employing knowledge distillation-based techniques and synthetic dataset augmentation, we aim to bridge the performance gap between large language models (LLMs) and small language models (SLMs) without the immense cost of human annotation. Our methods involve two forms of rationale generation–information extraction and informed reasoning–to enrich the ANLI dataset. We then fine-tune T5-Small on these augmented datasets, evaluating its performance against an established benchmark. Our findings reveal that the incorporation of synthetic rationales significantly improves the model’s ability to comprehend natural language, leading to 1.3% and 2.3% higher classification accuracy, respectively, on the ANLI dataset, demonstrating the potential of leveraging LLMs for dataset augmentation. This approach not only enhances the performance of smaller models on complex tasks but also introduces a cost-effective method for fine-tuning smaller language models. By advancing our understanding of knowledge distillation and fine-tuning strategies, this work contributes to the ongoing effort to create more capable and efficient NLP systems.
摘要:本文探讨了通过 ChatGPT-3.5-Turbo 进行战略性数据集增强,以提升自然语言推理 (NLI) 领域中小语言模型的性能。通过采用基于知识蒸馏的技术和合成数据集增强,我们旨在缩小大语言模型 (LLMs) 和小语言模型 (SLMs) 之间的性能差距,同时避免高昂的人工标注成本。我们的方法涉及两种推理生成形式——信息提取和启发式推理——以丰富 ANLI 数据集。随后,我们对 T5-Small 在这些增强数据集上进行微调,并评估其与既定基准的性能对比。研究结果表明,合成推理的引入显著提升了模型理解自然语言的能力,分别在 ANLI 数据集上实现了 1.3% 和 2.3% 的分类准确率提升,展示了利用 LLMs 进行数据集增强的潜力。这种方法不仅提升了较小模型在复杂任务上的表现,还提供了一种成本效益高的微调小语言模型的方法。通过深化对知识蒸馏和微调策略的理解,本研究为创建更强大和高效的自然语言处理系统做出了贡献。
[NLP-28] Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights
该论文试图解决在实际应用中部署小型语言模型的性能提升问题。解决方案的关键在于采用一种简单而有效的知识蒸馏方法,通过使用一个约30亿参数的教师模型来识别其决策过程中最具影响力的标记(tokens),并基于这些标记的归因分数(如显著性图)提取重要标记作为理由(rationales)传递给学生模型。这种方法通过在四个不同数据集上的测试,证明了其在标准微调方法和最先进的知识蒸馏模型上的性能提升,并且通过分析教师模型提取的重要标记,揭示了在68%的情况下,特别是在标签作为答案一部分的数据集中,提取的标记与真实答案相关。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12586
作者: Mohamad Ballout,Ulf Krumnack,Gunther Heidemann,Kai-Uwe Kühnberger
关键词-EN: Enhancing small language, real-life application deployment, significant challenge facing, small language models, Enhancing small
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Enhancing small language models for real-life application deployment is a significant challenge facing the research community. Due to the difficulties and costs of using large language models, researchers are seeking ways to effectively deploy task-specific small models. In this work, we introduce a simple yet effective knowledge distillation method to improve the performance of small language models. Our approach utilizes a teacher model with approximately 3 billion parameters to identify the most influential tokens in its decision-making process. These tokens are extracted from the input based on their attribution scores relative to the output, using methods like saliency maps. These important tokens are then provided as rationales to a student model, aiming to distill the knowledge of the teacher model. This method has proven to be effective, as demonstrated by testing it on four diverse datasets, where it shows improvement over both standard fine-tuning methods and state-of-the-art knowledge distillation models. Furthermore, we explore explanations of the success of the model by analyzing the important tokens extracted from the teacher model. Our findings reveal that in 68% of cases, specifically in datasets where labels are part of the answer, such as multiple-choice questions, the extracted tokens are part of the ground truth.
摘要:增强小型语言模型以实现实际应用部署是研究界面临的一个重要挑战。由于使用大语言模型的困难和成本,研究人员正在寻求有效部署任务特定小型模型的方法。在这项工作中,我们介绍了一种简单而有效的知识蒸馏方法,以提高小型语言模型的性能。我们的方法利用了一个约30亿参数的教师模型,来识别其决策过程中最具影响力的Token。这些Token根据其相对于输出的归因分数从输入中提取,使用诸如显著性图等方法。这些重要Token随后作为理由提供给学生模型,旨在蒸馏教师模型的知识。这种方法已被证明是有效的,如通过在四个不同数据集上的测试所示,它在标准微调方法和最先进的知识蒸馏模型上都显示出改进。此外,我们通过分析从教师模型中提取的重要Token,探讨了该模型成功的解释。我们的研究发现,在68%的情况下,特别是在标签是答案一部分的数据集中,例如多项选择题,提取的Token是真实答案的一部分。
[NLP-29] RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues
该论文试图解决现有基准测试在评估大型语言模型(LLMs)在多轮对话中利用外部检索机制生成更精确响应的能力方面的不足。解决方案的关键在于引入RAD-Bench(检索增强对话基准),该基准专门设计用于评估LLMs在多轮对话中结合检索信息的能力,包括检索综合(Retrieval Synthesis)和检索推理(Retrieval Reasoning)。通过使用判别性问题、检索到的上下文和相应的参考答案,RAD-Bench评估LLMs如何有效地整合和推理上下文,以在多轮对话中维持和提升对话质量。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12558
作者: Tzu-Lin Kuo,Feng-Ting Liao,Mu-Wei Hsieh,Fu-Chieh Chang,Po-Chun Hsu,Da-Shan Shiu
关键词-EN: Large Language Models, Large Language, Search-Augmented Generation, Retrieval-Augmented Generation, Language Models
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In real-world applications with Large Language Models (LLMs), external retrieval mechanisms - such as Search-Augmented Generation (SAG), tool utilization, and Retrieval-Augmented Generation (RAG) - are often employed to enhance the quality of augmented generations in dialogues. These approaches often come with multi-turn dialogue, where each interaction is enriched by relevant information retrieved from external sources. Existing benchmarks either assess LLMs’ chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings. However, there is a gap in evaluating LLMs’ ability to leverage retrieval for more precise responses across multiple turns. To address this limitation, we introduce RAD-Bench (Retrieval Augmented Dialogue), a benchmark designed to evaluate LLMs’ capabilities in multi-turn dialogues following retrievals, essential for their deployment in context-rich applications. RAD-Bench evaluates two key abilities of LLMs: Retrieval Synthesis and Retrieval Reasoning. These are measured using discriminative questions and retrieved contexts, and corresponding reference answers, assessing how effectively LLMs integrate and reason with context to maintain and enhance conversation quality over multiple turns. Our evaluation results on commonly used LLMs reveal that model performance deteriorates as additional layers of conditions or constraints are applied across conversation turns, even when accurate retrieved contexts are provided.
摘要:在实际应用中,大语言模型 (LLM) 通常采用外部检索机制,如搜索增强生成 (SAG)、工具利用和检索增强生成 (RAG),以提高对话中增强生成的质量。这些方法通常涉及多轮对话,每轮交互都通过从外部来源检索的相关信息得到丰富。现有的基准测试要么评估 LLM 在多轮对话中的聊天能力,要么评估其在单轮设置中使用检索增强响应的能力。然而,在评估 LLM 在多轮对话中利用检索提供更精确响应的能力方面存在差距。为了解决这一局限性,我们引入了 RAD-Bench (检索增强对话),这是一个旨在评估 LLM 在检索后进行多轮对话能力的基准测试,这对于其在内容丰富的应用中的部署至关重要。RAD-Bench 评估了 LLM 的两个关键能力:检索综合和检索推理。这些能力通过判别性问题、检索到的上下文和相应的参考答案进行测量,评估 LLM 如何有效地整合和推理上下文,以在多轮对话中保持和提升对话质量。我们对常用 LLM 的评估结果显示,即使提供了准确的检索上下文,随着对话轮次中附加条件或约束的增加,模型性能也会下降。
[NLP-30] Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment
该论文试图解决大语言模型(LLMs)在知识蒸馏过程中,由于教师模型预测的多模态概率分布导致学生模型学习困难的问题。解决方案的关键在于提出了基于排序损失的知识蒸馏方法(RLKD),通过引入词级别的排序损失,确保教师模型和学生模型在预测峰值的排序一致性,从而有效利用预测分布中不同类别之间的细粒度信息,提升学生模型对多模态分布的学习能力,最终在下游任务中显著提高性能。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12545
作者: Tianyu Peng,Jiajun Zhang
关键词-EN: effective model compression, large language models, model compression method, transfer the internal, internal capabilities
类目: Computation and Language (cs.CL)
备注: 18 pages
点击查看摘要
Abstract:Knowledge distillation (KD) is an effective model compression method that can transfer the internal capabilities of large language models (LLMs) to smaller ones. However, the multi-modal probability distribution predicted by teacher LLMs causes difficulties for student models to learn. In this paper, we first demonstrate the importance of multi-modal distribution alignment with experiments and then highlight the inefficiency of existing KD approaches in learning multi-modal distributions. To address this problem, we propose Ranking Loss based Knowledge Distillation (RLKD), which encourages the consistency of the ranking of peak predictions between the teacher and student models. By incorporating word-level ranking loss, we ensure excellent compatibility with existing distillation objectives while fully leveraging the fine-grained information between different categories in peaks of two predicted distribution. Experimental results demonstrate that our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.
摘要:知识蒸馏 (Knowledge Distillation, KD) 是一种有效的模型压缩方法,能够将大语言模型 (Large Language Models, LLMs) 的内部能力转移到较小的模型上。然而,教师 LLMs 预测的多模态概率分布给学生模型的学习带来了困难。本文首先通过实验展示了多模态分布对齐的重要性,然后指出现有 KD 方法在学习多模态分布时的低效性。为解决这一问题,我们提出了基于排序损失的知识蒸馏 (Ranking Loss based Knowledge Distillation, RLKD),该方法鼓励教师模型和学生模型在峰值预测的排序上保持一致。通过引入词级别的排序损失,我们确保了与现有蒸馏目标的出色兼容性,同时充分利用了两个预测分布峰值中不同类别之间的细粒度信息。实验结果表明,我们的方法使学生模型能够更好地学习教师模型的多模态分布,从而在各种下游任务中显著提升性能。
[NLP-31] Profiling Patient Transcript Using Large Language Model Reasoning Augmentation for Alzheimers Disease Detection
该论文试图解决阿尔茨海默病(AD)检测中基于自发语音的自动化方法在全局语言特征建模上的局限性问题。解决方案的关键在于提出了一种基于大语言模型(LLM)推理增强的患者级转录分析框架,通过系统性地提取语言缺陷属性并生成嵌入向量,将其整合到Albert模型中进行AD检测。该方法不仅提高了检测的准确性和F1分数,还增强了模型的可解释性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12541
作者: Chin-Po Chen,Jeng-Lin Li
关键词-EN: Alzheimer disease, gradual decline, Alzheimer, language capabilities, detection
类目: Computation and Language (cs.CL)
备注: accepted to EMBC 2024
点击查看摘要
Abstract:Alzheimer’s disease (AD) stands as the predominant cause of dementia, characterized by a gradual decline in speech and language capabilities. Recent deep-learning advancements have facilitated automated AD detection through spontaneous speech. However, common transcript-based detection methods directly model text patterns in each utterance without a global view of the patient’s linguistic characteristics, resulting in limited discriminability and interpretability. Despite the enhanced reasoning abilities of large language models (LLMs), there remains a gap in fully harnessing the reasoning ability to facilitate AD detection and model interpretation. Therefore, we propose a patient-level transcript profiling framework leveraging LLM-based reasoning augmentation to systematically elicit linguistic deficit attributes. The summarized embeddings of the attributes are integrated into an Albert model for AD detection. The framework achieves 8.51% ACC and 8.34% F1 improvements on the ADReSS dataset compared to the baseline without reasoning augmentation. Our further analysis shows the effectiveness of our identified linguistic deficit attributes and the potential to use LLM for AD detection interpretation.
摘要:阿尔茨海默病 (AD) 是痴呆症的主要病因,其特征是言语和语言能力逐渐下降。近年来,深度学习的进步促进了通过自发语音进行自动化 AD 检测。然而,常见的基于转录的检测方法直接对每个话语中的文本模式进行建模,而没有全局视角来观察患者的语言特征,导致辨别力和可解释性有限。尽管大语言模型 (LLM) 的推理能力有所增强,但在充分利用推理能力以促进 AD 检测和模型解释方面仍存在差距。因此,我们提出了一种利用基于 LLM 的推理增强的病人级转录分析框架,以系统地引出语言缺陷属性。这些属性的汇总嵌入被整合到 Albert 模型中用于 AD 检测。与没有推理增强的基线相比,该框架在 ADReSS 数据集上实现了 8.51% 的 ACC 和 8.34% 的 F1 提升。我们的进一步分析显示了我们识别的语言缺陷属性的有效性,以及使用 LLM 进行 AD 检测解释的潜力。
[NLP-32] Should RAG Chatbots Forget Unimportant Conversations? Exploring Importance and Forgetting with Psychological Insights
该论文试图解决在长期对话中,随着对话内容的增加导致检索增强生成(RAG)模型的检索准确性下降的问题。解决方案的关键在于提出了一种名为LUFY的方法,该方法借鉴心理学原理,专注于保留情感上引人注目的记忆,同时仅保留不到10%的对话内容。通过这种方式,LUFY显著提升了用户在长期对话中的体验,强调了在长期对话中遗忘不重要部分的重要性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12524
作者: Ryuichi Sumida,Koji Inoue,Tatsuya Kawahara
关键词-EN: degrades retrieval accuracy, increasing memory load, progress degrades retrieval, Retrieval-Augmented Generation, conversations progress degrades
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While Retrieval-Augmented Generation (RAG) has shown promise in enhancing long-term conversations, the increasing memory load as conversations progress degrades retrieval accuracy. Drawing on psychological insights, we propose LUFY, a simple yet effective method that focuses on emotionally arousing memories and retains less than 10% of the conversation. In the user experiment, participants interacted with three types of RAG chatbots, each for 2 hours over 4 sessions, marking the most extensive assessment of a chatbot’s long-term capabilities to date – more than four times longer than any existing benchmark. The results demonstrate that prioritizing arousing memories while forgetting the majority of the conversation significantly enhances user experience. This study pushes the frontier of long-term conversations and highlights the importance of forgetting unimportant parts of conversations. Code and Dataset: this https URL
摘要:尽管检索增强生成 (RAG) 在提升长期对话能力方面展现出潜力,但随着对话的进行,内存负载的增加会降低检索的准确性。借鉴心理学见解,我们提出了 LUFY,这是一种简单而有效的方法,专注于情感唤醒的记忆,并保留不到 10% 的对话内容。在用户实验中,参与者与三种类型的 RAG 聊天机器人进行了互动,每种类型持续 2 小时,共 4 次会话,这是迄今为止对聊天机器人长期能力最广泛的评估——比现有任何基准测试时间长四倍以上。结果表明,优先考虑唤醒记忆的同时遗忘大部分对话内容,显著提升了用户体验。本研究推动了长期对话的前沿,并强调了遗忘对话中不重要部分的重要性。代码和数据集:this https URL
[NLP-33] Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models
该论文试图解决知识蒸馏(Knowledge Distillation, KD)在自回归语言模型中应用时,由于教师模型分布固定不变导致的学生模型训练困难和计算成本高的问题。解决方案的关键在于引入在线知识蒸馏(Online Knowledge Distillation, OKD),通过在教师网络中集成小型在线模块,使其与学生模型同时训练,从而实现教师模型参数的动态更新,以更好地适应学生模型的分布,提高蒸馏效果并显著减少训练时间。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12512
作者: Jun Rao,Xuebo Liu,Zepeng Lin,Liang Ding,Jing Li,Dacheng Tao
关键词-EN: compresses large teacher, large teacher models, technique that compresses, compresses large, teacher
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. The success of KD in auto-regressive language models mainly relies on Reverse KL for mode-seeking and student-generated output (SGO) to combat exposure bias. Our theoretical analyses and experimental validation reveal that while Reverse KL effectively mimics certain features of the teacher distribution, it fails to capture most of its behaviors. Conversely, SGO incurs higher computational costs and presents challenges in optimization, particularly when the student model is significantly smaller than the teacher model. These constraints are primarily due to the immutable distribution of the teacher model, which fails to adjust adaptively to models of varying sizes. We introduce Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. This strategy abolishes the necessity for on-policy sampling and merely requires minimal updates to the parameters of the teacher’s online module during training, thereby allowing dynamic adaptation to the student’s distribution to make distillation better. Extensive results across multiple generation datasets show that OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
摘要:知识蒸馏 (Knowledge Distillation, KD) 是一种通过训练较小的学生模型来模仿大型教师模型的技术。在自回归语言模型中,KD 的成功主要依赖于逆 KL 散度 (Reverse KL) 进行模式搜索,以及学生生成输出 (Student-Generated Output, SGO) 来对抗暴露偏差。我们的理论分析和实验验证表明,尽管逆 KL 散度能有效模仿教师分布的某些特征,但它未能捕捉到其大部分行为。相反,SGO 带来了更高的计算成本,并且在优化过程中面临挑战,尤其是在学生模型显著小于教师模型时。这些限制主要源于教师模型的不可变分布,无法自适应地调整以适应不同大小的模型。我们引入了在线知识蒸馏 (Online Knowledge Distillation, OKD),其中教师网络集成了小型在线模块,与学生模型同时进行训练。这种策略消除了对策略采样的需求,仅在训练期间对教师的在线模块参数进行最小更新,从而允许动态适应学生分布,使蒸馏效果更好。在多个生成数据集上的广泛结果表明,OKD 在各种模型架构和大小上均达到了或超过了领先方法的性能,训练时间最多可减少四倍。
[NLP-34] LLMR: Knowledge Distillation with a Large Language Model-Induced Reward COLING2024
该论文试图解决大型语言模型在资源受限环境中部署困难的问题。解决方案的关键在于提出了一种名为LLMR的新型知识蒸馏(KD)方法,该方法基于从大型语言模型中诱导出的奖励函数。通过在对话生成和摘要任务中的多个数据集上进行实验,结果表明LLMR方法在不同任务和数据集上均优于传统的KD方法。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12500
作者: Dongheng Li,Yongchang Hao,Lili Mou
关键词-EN: demonstrated remarkable performance, natural language processing, Large language models, increasingly popular, popular and demonstrated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by LERC COLING 2024
点击查看摘要
Abstract:Large language models have become increasingly popular and demonstrated remarkable performance in various natural language processing (NLP) tasks. However, these models are typically computationally expensive and difficult to be deployed in resource-constrained environments. In this paper, we propose LLMR, a novel knowledge distillation (KD) method based on a reward function induced from large language models. We conducted experiments on multiple datasets in the dialogue generation and summarization tasks. Empirical results demonstrate that our LLMR approach consistently outperforms traditional KD methods in different tasks and datasets.
摘要:大语言模型在各种自然语言处理 (NLP) 任务中变得越来越流行,并展示了卓越的性能。然而,这些模型通常计算成本高昂,难以在资源受限的环境中部署。本文中,我们提出了 LLMR,一种基于大语言模型诱导的奖励函数的新型知识蒸馏 (KD) 方法。我们在对话生成和摘要任务的多个数据集上进行了实验。实证结果表明,我们的 LLMR 方法在不同任务和数据集上始终优于传统的 KD 方法。
[NLP-35] CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
该论文试图解决大语言模型在长上下文任务中预填充阶段效率低下的问题。解决方案的关键在于观察到预填充阶段查询关键性具有局部性,即相邻查询标记倾向于关注相似的过去键值缓存子集。基于此,论文提出了CritiPrefill方法,通过将输入序列的查询和键值缓存分割成段和块,并采用段级算法估计查询关键性,从而在自注意力机制中剪枝非关键计算,显著加速预填充过程,同时保持较小的质量损失。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12490
作者: Junlin Lv,Yuan Feng,Xike Xie,Xin Jia,Qirong Peng,Guiming Xie
关键词-EN: Large language models, achieved notable success, Large language, quadratic computation complexity, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models have achieved notable success across various domains, yet efficient inference is still limited by the quadratic computation complexity of the attention mechanism. The inference consists of prefilling and decoding phases. Although several attempts have been made to accelerate decoding, the inefficiency of the prefilling phase, especially for long-context tasks, remains a challenge. In this paper, we observe a locality in query criticality during the prefilling phase of long-context processing: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache. Based on this observation, we propose CritiPrefill, a criticality-based segment-wise prefilling method. This method partitions the input sequence’s queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. By pruning non-critical computations between query segments and cache blocks in the self-attention mechanism, the prefilling process can be significantly accelerated. Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU, with minimal quality degradation.
摘要:大语言模型在各个领域取得了显著的成功,然而,推理效率仍然受到注意力机制二次计算复杂度的限制。推理过程包括预填充和解码两个阶段。尽管已有多种尝试加速解码阶段,但预填充阶段,尤其是针对长上下文任务的效率问题,仍然是一个挑战。在本文中,我们观察到在长上下文处理的预填充阶段,查询关键性存在局部性:相邻的查询 Token 往往关注过去 Key-Value (KV) 缓存的相似子集。基于这一观察,我们提出了 CritiPrefill,一种基于关键性的分段预填充方法。该方法将输入序列的查询和 KV 缓存划分为段和块,利用分段算法来估计查询关键性。通过在自注意力机制中剪枝查询段与缓存块之间的非关键计算,预填充过程可以显著加速。在多个长上下文数据集上的广泛评估显示,在单个 A100 GPU 上,Llama3-8B 和 Yi-9B 在 128K 上下文长度下分别实现了 2.7 倍和 3.0 倍的加速,且质量下降最小。
[NLP-36] AutoMode-ASR: Learning to Select ASR Systems for Better Quality and Cost
该论文试图解决自动语音识别(ASR)系统在转录质量与成本之间的优化问题。解决方案的关键在于提出了一种名为AutoMode-ASR的新框架,通过训练一个决策模型来根据音频输入选择最优的ASR系统进行转录,从而在提高转录质量的同时降低成本。该框架通过集成多个二分类器来决定两个系统之间的偏好,并利用音频嵌入、质量估计和信号属性等多种特征进行优化。实验结果表明,该方法在相对降低词错误率(WER)16.2%的同时,实现了65%的成本节省和75%的速度提升。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12476
作者: Ahmet Gündüz,Yunsu Kim,Kamer Ali Yuksel,Mohamed Al-Badrashiny,Thiago Castro Ferreira,Hassan Sawaf
关键词-EN: integrates multiple ASR, effectively integrates multiple, multiple ASR systems, present AutoMode-ASR, multiple ASR
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: SPECOM 2024 Conference
点击查看摘要
Abstract:We present AutoMode-ASR, a novel framework that effectively integrates multiple ASR systems to enhance the overall transcription quality while optimizing cost. The idea is to train a decision model to select the optimal ASR system for each segment based solely on the audio input before running the systems. We achieve this by ensembling binary classifiers determining the preference between two systems. These classifiers are equipped with various features, such as audio embeddings, quality estimation, and signal properties. Additionally, we demonstrate how using a quality estimator can further improve performance with minimal cost increase. Experimental results show a relative reduction in WER of 16.2%, a cost saving of 65%, and a speed improvement of 75%, compared to using a single-best model for all segments. Our framework is compatible with commercial and open-source black-box ASR systems as it does not require changes in model codes.
摘要:我们提出了 AutoMode-ASR,这是一个新颖的框架,能够有效整合多个自动语音识别系统 (ASR) 以提升整体转录质量并优化成本。其核心思想是训练一个决策模型,仅基于音频输入为每个片段选择最优的 ASR 系统,而无需运行所有系统。我们通过集成二元分类器来实现这一目标,这些分类器用于判断两个系统之间的偏好。这些分类器配备了多种特征,如音频嵌入、质量估计和信号属性。此外,我们还展示了如何通过使用质量估计器在最小成本增加的情况下进一步提高性能。实验结果显示,与使用单一最佳模型处理所有片段相比,WER 相对降低了 16.2%,成本节省了 65%,速度提升了 75%。我们的框架兼容商业和开源的黑盒 ASR 系统,因为它不需要对模型代码进行修改。
[NLP-37] Familiarity-aware Evidence Compression for Retrieval Augmented Generation
该论文试图解决检索增强生成(RAG)模型在处理外部检索证据时,难以过滤不一致和无关信息的问题。解决方案的关键在于提出了一种名为FaviComp(Familiarity-aware Evidence Compression)的新型无训练证据压缩技术,通过结合压缩模型和目标模型的token概率,降低压缩证据对目标模型的困惑度,使其更易于理解和利用,从而在复杂任务中有效整合参数化和非参数化知识。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12468
作者: Dongwon Jung,Qin Liu,Tenghao Huang,Ben Zhou,Muhao Chen
关键词-EN: Retrieval Augmented Generation, Augmented Generation, Retrieval Augmented, improves large language, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) improves large language models (LMs) by incorporating non-parametric knowledge through evidence retrieval from external sources. However, it often struggles to filter out inconsistent and irrelevant information that can distract the LM from its tasks. While compressing the retrieved evidence with a compression model aims to address this issue, the compressed evidence may still be unfamiliar to the target model used for downstream task, potentially failing to utilize the evidence effectively. We propose FaviComp (Familiarity-aware Evidence Compression), a novel training-free evidence compression technique that makes retrieved evidence more familiar to the target model, while seamlessly integrating parametric knowledge from the model. Specifically, FaviComp proactively lowers the perplexity of the compressed evidence with regard to the target model by combining token probabilities from both the compression model and the target model to generate context that is more familiar to the target model. This approach balances the integration of parametric and non-parametric knowledge, which is especially helpful in complex tasks where the retrieved evidence set may not contain all the necessary information. Experimental results demonstrate that FaviComp consistently outperforms existing baselines in multiple open-domain QA datasets, achieving high compression rates and showcasing the effective integration of both parametric and non-parametric knowledge.
摘要:检索增强生成 (Retrieval Augmented Generation, RAG) 通过从外部源检索证据来整合非参数知识,从而改进大语言模型 (Large Language Models, LMs)。然而,它往往难以过滤掉与任务不一致和无关的信息,这些信息可能会分散 LM 的注意力。尽管使用压缩模型压缩检索到的证据旨在解决这一问题,但压缩后的证据可能对用于下游任务的目标模型仍然陌生,从而可能无法有效利用这些证据。我们提出了 FaviComp (Familiarity-aware Evidence Compression),这是一种无需训练的证据压缩技术,它使检索到的证据对目标模型更加熟悉,同时无缝集成模型中的参数知识。具体而言,FaviComp 通过结合压缩模型和目标模型的 Token 概率,主动降低压缩证据相对于目标模型的困惑度 (perplexity),从而生成对目标模型更为熟悉的内容。这种方法平衡了参数知识和非参数知识的整合,在检索证据集可能不包含所有必要信息的复杂任务中尤为有用。实验结果表明,FaviComp 在多个开放域问答数据集上持续优于现有基线,实现了高压缩率,并展示了参数知识和非参数知识的有效整合。
[NLP-38] CodePlan: Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning
该论文试图解决大型语言模型(LLMs)在复杂多步骤推理任务中规划能力不足的问题。解决方案的关键在于引入CODEPLAN,这是一种可扩展的范式,通过生成和遵循伪代码形式的代码计划来增强LLMs的规划能力。CODEPLAN利用代码的结构化和多功能性,有效捕捉复杂推理中固有的丰富语义和控制流。其核心优势在于能够从大规模、多样化的文本语料库中自动提取代码形式的计划,无需特定任务的数据集,从而实现高效扩展和跨场景的推理能力提升。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12452
作者: Jiaxin Wen,Jian Guan,Hongning Wang,Wei Wu,Minlie Huang
关键词-EN: large language models, traditional natural language, natural language processing, language processing tasks, language models
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Despite the remarkable success of large language models (LLMs) on traditional natural language processing tasks, their planning ability remains a critical bottleneck in tackling complex multi-step reasoning tasks. Existing approaches mainly rely on prompting or task-specific fine-tuning, often suffering from weak robustness and cross-task generalization. To address the limitation, we introduce CODEPLAN, a scalable paradigm that empowers LLMs to generate and follow code-form plans pseudocode that outlines high-level, structured reasoning processes. By leveraging the structured and versatile nature of code, CODEPLAN effectively captures the rich semantics and control flows inherent to sophisticated reasoning. Importantly, CODEPLAN allows the automatic extraction of code-form plans from massive, wide-ranging text corpora without the need for curated, task-specific datasets. This enables it to scale up efficiently and improve reasoning capabilities across diverse scenarios. To train CODEPLAN, we construct a large-scale dataset of 2M examples that integrate code-form plans with standard prompt-response pairs from existing corpora. With minimal computation overhead during both training and inference, CODEPLAN achieves a 25.1% relative improvement compared with directly generating responses, averaged across 13 challenging multi-step reasoning benchmarks, spanning mathematical reasoning, symbolic reasoning, instruction-following, multi-hop QA, and decision-making tasks. Further analysis reveals CODEPLAN’s increasing performance gains on more complex reasoning tasks, as well as significant data efficiency thanks to its generalization ability.
摘要:尽管大语言模型 (LLM) 在传统自然语言处理任务中取得了显著成功,但在处理复杂的多步骤推理任务时,其规划能力仍然是一个关键瓶颈。现有方法主要依赖于提示或任务特定的微调,往往存在鲁棒性弱和跨任务泛化能力差的问题。为了解决这一限制,我们提出了 CODEPLAN,这是一种可扩展的范式,使 LLM 能够生成并遵循代码形式的计划——伪代码,这些伪代码概述了高级的、结构化的推理过程。通过利用代码的结构性和多功能性,CODEPLAN 有效地捕捉了复杂推理中固有的丰富语义和控制流。重要的是,CODEPLAN 允许从大规模、广泛范围的文本语料库中自动提取代码形式的计划,而无需精心策划的任务特定数据集。这使得它能够高效扩展并在各种场景中提升推理能力。为了训练 CODEPLAN,我们构建了一个包含 200 万示例的大规模数据集,该数据集将代码形式的计划与现有语料库中的标准提示-响应对集成在一起。在训练和推理过程中计算开销最小的情况下,CODEPLAN 在 13 个具有挑战性的多步骤推理基准测试中,平均相对于直接生成响应,实现了 25.1% 的相对改进,这些基准测试涵盖了数学推理、符号推理、指令跟随、多跳问答和决策制定任务。进一步的分析表明,CODEPLAN 在更复杂的推理任务中表现出性能的不断提升,以及由于其泛化能力而显著的数据效率。
[NLP-39] Incremental and Data-Efficient Concept Formation to Support Masked Word Prediction
该论文试图解决高效语言模型学习的问题,特别是支持掩码词预测的任务。解决方案的关键在于引入Cobweb4L,这是一种基于Cobweb增量系统的创新方法,通过学习概率概念的层次结构来存储与概念标签相关的词频信息。Cobweb4L利用属性值表示法将词及其上下文编码为实例,并采用信息论的类别效用变体和一种新的性能机制,该机制利用多个概念生成预测,从而显著优于仅使用单一节点生成预测的传统Cobweb性能机制。此外,Cobweb4L在快速学习的同时,其性能可与Word2Vec相媲美甚至超越,并且在相同任务中使用较少训练数据的情况下优于BERT。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12440
作者: Xin Lian,Nishant Baglodi,Christopher J. MacLellan
关键词-EN: efficient language model, language model learning, supports masked word, paper introduces, efficient language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by the Eleventh Annual Conference on Advances in Cognitive Systems
点击查看摘要
Abstract:This paper introduces Cobweb4L, a novel approach for efficient language model learning that supports masked word prediction. The approach builds on Cobweb, an incremental system that learns a hierarchy of probabilistic concepts. Each concept stores the frequencies of words that appear in instances tagged with that concept label. The system utilizes an attribute value representation to encode words and their surrounding context into instances. Cobweb4L uses the information theoretic variant of category utility and a new performance mechanism that leverages multiple concepts to generate predictions. We demonstrate that with these extensions it significantly outperforms prior Cobweb performance mechanisms that use only a single node to generate predictions. Further, we demonstrate that Cobweb4L learns rapidly and achieves performance comparable to and even superior to Word2Vec. Next, we show that Cobweb4L and Word2Vec outperform BERT in the same task with less training data. Finally, we discuss future work to make our conclusions more robust and inclusive.
摘要:本文介绍了 Cobweb4L,这是一种支持掩码词预测的高效语言模型学习的新方法。该方法基于 Cobweb,一个学习概率概念层次结构的增量系统。每个概念存储了与该概念标签相关联的实例中出现的词频。系统利用属性值表示法将词及其周围上下文编码为实例。Cobweb4L 采用信息论的类别效用变体和一种新的性能机制,该机制利用多个概念生成预测。我们证明,通过这些扩展,它在生成预测时显著优于仅使用单一节点的前 Cobweb 性能机制。此外,我们证明 Cobweb4L 学习迅速,并能达到与 Word2Vec 相当甚至更优的性能。接下来,我们展示 Cobweb4L 和 Word2Vec 在相同任务中使用较少训练数据的情况下优于 BERT。最后,我们讨论了未来工作以使我们的结论更加稳健和全面。
[NLP-40] Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data
该论文试图解决大型语言模型(LLMs)在处理涉及长推理链的复杂逻辑推理任务时的局限性。解决方案的关键在于使用基于图的合成推理数据进行监督微调(SFT),以增强LLMs的推理能力。实验结果表明,这种基于图的合成数据训练方法在不影响模型在其他标准评估基准上的表现的前提下,显著提升了LLMs在归纳推理和空间推理任务中的性能。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12437
作者: Jiaming Zhou,Abbas Ghaddar,Ge Zhang,Liheng Ma,Yaochen Hu,Soumyasundar Pal,Mark Coates,Bin Wang,Yingxue Zhang,Jianye Hao
关键词-EN: Large Language Models, long reasoning chains, complex logical reasoning, involve long reasoning, strategies for Large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Despite recent advances in training and prompting strategies for Large Language Models (LLMs), these models continue to face challenges with complex logical reasoning tasks that involve long reasoning chains. In this work, we explore the potential and limitations of using graph-based synthetic reasoning data as training signals to enhance LLMs’ reasoning capabilities. Our extensive experiments, conducted on two established natural language reasoning tasks – inductive reasoning and spatial reasoning – demonstrate that supervised fine-tuning (SFT) with synthetic graph-based reasoning data effectively enhances LLMs’ reasoning performance without compromising their effectiveness on other standard evaluation benchmarks.
摘要:尽管近年来在训练和提示策略方面取得了进展,大语言模型 (LLM) 在涉及长推理链的复杂逻辑推理任务中仍然面临挑战。本文探讨了使用基于图的合成推理数据作为训练信号来增强 LLM 推理能力的潜力和局限性。我们在两个已建立的自然语言推理任务——归纳推理和空间推理上进行了广泛的实验,结果表明,使用合成图推理数据进行监督微调 (SFT) 能够有效提升 LLM 的推理性能,同时不影响其在其他标准评估基准上的有效性。
[NLP-41] Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models
该论文试图解决的问题是如何深入理解大型语言模型(LLMs)内部的语言学表示,特别是这些模型如何捕捉和处理语言学知识。解决方案的关键在于利用语言学中的最小对(minimal pairs)来探测LLMs的内部激活差异,并通过测量这些差异的相似性来量化LLMs所捕捉的语言学知识。研究通过大规模实验,涵盖100多个LLMs和15万多个最小对,从四个关键方面揭示了语言学相似性的特性,包括LLMs之间的一致性、与理论分类的关系、对语义上下文的依赖性以及跨语言现象的对齐情况。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12435
作者: Xinyu Zhou,Delong Chen,Samuel Cahyawijaya,Xufeng Duan,Zhenguang G. Cai
关键词-EN: Large Language Models, Language Models, minimal pairs, Large Language, representations of Large
类目: Computation and Language (cs.CL)
备注: Codes and data are available at this https URL
点击查看摘要
Abstract:We introduce a novel analysis that leverages linguistic minimal pairs to probe the internal linguistic representations of Large Language Models (LLMs). By measuring the similarity between LLM activation differences across minimal pairs, we quantify the and gain insight into the linguistic knowledge captured by LLMs. Our large-scale experiments, spanning 100+ LLMs and 150k minimal pairs in three languages, reveal properties of linguistic similarity from four key aspects: consistency across LLMs, relation to theoretical categorizations, dependency to semantic context, and cross-lingual alignment of relevant phenomena. Our findings suggest that 1) linguistic similarity is significantly influenced by training data exposure, leading to higher cross-LLM agreement in higher-resource languages. 2) Linguistic similarity strongly aligns with fine-grained theoretical linguistic categories but weakly with broader ones. 3) Linguistic similarity shows a weak correlation with semantic similarity, showing its context-dependent nature. 4) LLMs exhibit limited cross-lingual alignment in their understanding of relevant linguistic phenomena. This work demonstrates the potential of minimal pairs as a window into the neural representations of language in LLMs, shedding light on the relationship between LLMs and linguistic theory.
摘要:我们提出了一种新颖的分析方法,利用语言学中的最小对 (linguistic minimal pairs) 来探究大语言模型 (Large Language Models, LLMs) 的内部语言表征。通过测量最小对之间 LLM 激活差异的相似性,我们量化了并深入了解了 LLMs 所捕捉到的语言学知识。我们的大规模实验涵盖了 100 多个 LLMs 和 150,000 个最小对,涉及三种语言,从四个关键方面揭示了语言相似性的特性:LLMs 之间的一致性、与理论分类的关系、对语义上下文的依赖性以及相关现象的跨语言对齐。我们的研究发现:1) 语言相似性显著受到训练数据暴露的影响,导致在高资源语言中 LLMs 之间的一致性更高。2) 语言相似性与细粒度的理论语言学类别高度一致,但与更广泛的类别一致性较弱。3) 语言相似性与语义相似性显示出弱相关性,表明其依赖于上下文的特性。4) LLMs 在理解相关语言现象时表现出有限的跨语言对齐能力。这项工作展示了最小对作为窥探 LLMs 中语言神经表征的窗口的潜力,揭示了 LLMs 与语言学理论之间的关系。
[NLP-42] Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels
该论文试图解决大语言模型(LLMs)在缺乏高质量标注数据的情况下如何实现强大性能的问题。解决方案的关键在于提出了一种名为“零到强泛化”的新范式,通过迭代地利用LLMs自身对未标注数据进行标注,并通过过滤机制保留高质量标签,逐步解锁LLMs在下游任务中的潜力。实验结果表明,该方法在分类和推理任务中均有效,适用于上下文学习和微调,且对不同规模的模型均适用。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12425
作者: Chaoqun Liu,Qin Chao,Wenxuan Zhang,Xiaobao Wu,Boyang Li,Anh Tuan Luu,Lidong Bing
关键词-EN: Large Language Models, Large Language, demonstrated remarkable performance, Language Models, demonstrated remarkable
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. However, this paradigm is limited by the availability of gold labels, while in certain scenarios, LLMs may need to perform tasks that are too complex for humans to provide such labels. To tackle this challenge, this study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization. We iteratively prompt LLMs to annotate unlabeled data and retain high-quality labels by filtering. Surprisingly, we obverse that this iterative process gradually unlocks LLMs’ potential on downstream tasks. Our experiments on extensive classification and reasoning tasks confirm the effectiveness of our proposed framework. Our analysis indicates that this paradigm is effective for both in-context learning and fine-tuning, and for various model sizes.
摘要:大语言模型 (LLM) 通过监督微调或使用黄金标签进行上下文学习,展示了显著的性能。然而,这种范式受限于黄金标签的可用性,而在某些场景中,LLM 可能需要执行人类难以提供此类标签的复杂任务。为了应对这一挑战,本研究探讨了仅利用未标记数据是否能激发强大的模型能力。我们提出了一种新的范式,称为零到强泛化。我们迭代地提示 LLM 对未标记数据进行标注,并通过筛选保留高质量标签。令人惊讶的是,我们观察到这种迭代过程逐渐解锁了 LLM 在下游任务中的潜力。我们在广泛的分类和推理任务上的实验证实了我们提出的框架的有效性。我们的分析表明,这种范式对上下文学习和微调都有效,并且适用于各种模型规模。
[NLP-43] xtualized Agent -Style Reasoning for Complex Tasks by Multiple Round LLM Generation
该论文试图解决大语言模型在链式思维提示过程中面临的三个问题:幻觉问题、解释性受限和生成不可控。解决方案的关键在于提出了AgentCOT框架,这是一个基于LLM的自主代理框架,通过多轮LLM生成以代理风格方式解决复杂问题。AgentCOT在每一步选择并执行一个动作,生成带有支持证据的中间结果,并将步骤索引整合到推理过程中,形成复杂的推理逻辑图结构。论文还引入了两种新策略来增强AgentCOT的性能,并通过广泛的实验验证了其有效性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12411
作者: Chen Liang,Zhifan Feng,Zihe Liu,Wenbin Jiang,Jinan Xu,Yufeng Chen,Yong Wang
关键词-EN: prompting significantly boosts, large language models, restricted interpretability, round LLM generation, prompting significantly
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Chain-of-thought prompting significantly boosts the reasoning ability of large language models but still faces three issues: hallucination problem, restricted interpretability, and uncontrollable generation. To address these challenges, we present AgentCOT, a llm-based autonomous agent framework, which can solve complex problems in an agent-style manner by multiple round LLM generation. At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence. In addition, we integrate the step’s index into the reasoning process to form a graph structure for complex inference logic. We introduce two new strategies to enhance the performance of AgentCOT.We conduct extensive experiments to verify the effectiveness of our method on six common benchmarks. Results exhibit that our method brings in substantial improvements over current competitive approaches.
摘要:链式思维提示 (Chain-of-thought prompting) 显著提升了大语言模型 (LLM) 的推理能力,但仍面临三个问题:幻觉问题 (hallucination problem)、解释性受限 (restricted interpretability) 和生成不可控 (uncontrollable generation)。为解决这些挑战,我们提出了 AgentCOT,一个基于大语言模型的自主智能体框架 (llm-based autonomous agent framework),它通过多轮大语言模型生成以智能体方式解决复杂问题。在每一步,AgentCOT 选择一个动作并执行,产生带有支持证据的中间结果。此外,我们将步骤索引整合到推理过程中,形成复杂推理逻辑的图结构。我们引入了两种新策略来增强 AgentCOT 的性能。我们进行了广泛的实验,验证了我们的方法在六个常见基准上的有效性。结果显示,我们的方法在当前竞争性方法上带来了显著的改进。
[NLP-44] Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language Sequences
该论文试图解决未对齐多模态语言序列中信息冗余和模型过拟合的问题。解决方案的关键在于提出了一种基于互信息的最小化表示解耦方法(MIRD),通过设计一种新的解耦框架,联合学习单一的模态无关表示,并利用互信息最小化约束来确保表示的优越解耦,从而消除多模态联合表示中的信息冗余。此外,通过引入未标注数据来缓解有限标注数据导致的互信息估计难题,并进一步防止过拟合,提升模型性能。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12408
作者: Fan Qian,Jiqing Han,Jianchen Li,Yongjun He,Tieran Zheng,Guibin Zheng
关键词-EN: multimodal joint representation, joint representation, refined multimodal joint, effectively integrating information, multimodal joint
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 31 pages, 8 figures
点击查看摘要
Abstract:The key challenge in unaligned multimodal language sequences lies in effectively integrating information from various modalities to obtain a refined multimodal joint representation. Recently, the disentangle and fuse methods have achieved the promising performance by explicitly learning modality-agnostic and modality-specific representations and then fusing them into a multimodal joint representation. However, these methods often independently learn modality-agnostic representations for each modality and utilize orthogonal constraints to reduce linear correlations between modality-agnostic and modality-specific representations, neglecting to eliminate their nonlinear correlations. As a result, the obtained multimodal joint representation usually suffers from information redundancy, leading to overfitting and poor generalization of the models. In this paper, we propose a Mutual Information-based Representations Disentanglement (MIRD) method for unaligned multimodal language sequences, in which a novel disentanglement framework is designed to jointly learn a single modality-agnostic representation. In addition, the mutual information minimization constraint is employed to ensure superior disentanglement of representations, thereby eliminating information redundancy within the multimodal joint representation. Furthermore, the challenge of estimating mutual information caused by the limited labeled data is mitigated by introducing unlabeled data. Meanwhile, the unlabeled data also help to characterize the underlying structure of multimodal data, consequently further preventing overfitting and enhancing the performance of the models. Experimental results on several widely used benchmark datasets validate the effectiveness of our proposed approach.
摘要:在未对齐的多模态语言序列中,关键挑战在于如何有效地整合来自不同模态的信息,以获得一个精细的多模态联合表示。近期,解耦与融合方法通过显式学习模态无关和模态特定的表示,然后将它们融合成一个多模态联合表示,取得了显著的性能提升。然而,这些方法通常独立地为每个模态学习模态无关的表示,并利用正交约束来减少模态无关和模态特定表示之间的线性相关性,却忽略了消除它们之间的非线性相关性。因此,所获得的多模态联合表示通常存在信息冗余,导致模型过拟合和泛化能力差。本文提出了一种基于互信息的表示解耦方法 (Mutual Information-based Representations Disentanglement, MIRD),用于未对齐的多模态语言序列。在该方法中,设计了一种新的解耦框架,以联合学习单一的模态无关表示。此外,采用互信息最小化约束来确保表示的优越解耦,从而消除多模态联合表示中的信息冗余。同时,通过引入未标记数据来缓解由于有限标记数据导致的互信息估计难题。未标记数据还有助于表征多模态数据的潜在结构,从而进一步防止过拟合并提升模型的性能。在多个广泛使用的基准数据集上的实验结果验证了我们提出方法的有效性。
[NLP-45] Preference Alignment Improves Language Model-Based TTS
该论文试图解决如何通过优化语言模型(LM)来提升文本到语音(TTS)系统的性能问题。解决方案的关键在于采用偏好对齐算法,特别是直接偏好优化(DPO),通过调整LM使其与奖励模型的偏好对齐,从而提高生成语音的自然度、可理解性和说话者相似度。实验结果表明,偏好对齐算法在低资源场景和跨领域应用中均表现出色,显著提升了TTS系统的整体质量,甚至在某些评估中超越了人类语音的表现。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12403
作者: Jinchuan Tian,Chunlei Zhang,Jiatong Shi,Hao Zhang,Jianwei Yu,Shinji Watanabe,Dong Yu
关键词-EN: based systems offer, systems offer competitive, offer competitive performance, Recent advancements, preference alignment algorithms
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based systems offer competitive performance to their counterparts. Further optimization can be achieved through preference alignment algorithms, which adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content. This study presents a thorough empirical evaluation of how preference alignment algorithms, particularly Direct Preference Optimization (DPO), enhance LM-based TTS. With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores, with the latter two metrics surpassing even human speech in certain evaluations. We also show preference alignment is applicable to low-resource scenarios and effectively generalized to out-of-domain applications.
摘要:近期在文本到语音 (Text-to-Speech, TTS) 领域的进展表明,基于语言模型 (Language Model, LM) 的系统在性能上能够与传统系统相媲美。通过偏好对齐算法,可以进一步优化这些系统,使其与奖励模型的偏好对齐,从而提升生成内容的质量。本研究对偏好对齐算法,特别是直接偏好优化 (Direct Preference Optimization, DPO),如何增强基于 LM 的 TTS 进行了全面的实证评估。我们使用一个拥有 1.15 亿参数的基于 LM 的 TTS 模型,证明了偏好对齐在可理解性、说话者相似度以及代理主观评估分数方面的一致性改进,其中后两项指标在某些评估中甚至超过了人类语音。此外,我们还展示了偏好对齐在低资源场景中的适用性,并能有效泛化到域外应用。
[NLP-46] Small Language Models are Equation Reasoners
该论文试图解决小型语言模型(如T5)在算术推理任务中表现不佳的问题。解决方案的关键在于采用“仅方程式”格式(equation-only format),即将自然语言表达的算术推理统一转换为数学方程式。通过这种格式,论文实验结果表明,该方法显著提升了小型模型在算术推理任务中的能力,特别是在非常小的模型(如T5-Tiny)中效果尤为明显。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12393
作者: Bumjun Kim,Kunha Lee,Juyeon Kim,Sangam Lee
关键词-EN: enabled Large Language, achieve remarkable performance, enabled Large, Large Language Model, including arithmetic problem-solving
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures
点击查看摘要
Abstract:Chain-of-Thought (CoT) reasoning has enabled Large Language Model (LLM) to achieve remarkable performance in various NLP tasks, including arithmetic problem-solving. However, this success does not generalize to small language model (sLM) like T5, due to their limited capacity and absence of emergent abilities associated with larger models. Recent works to enhance sLM through knowledge distillation have yielded some improvements but still face significant limitations, particularly high ambiguity from the variability in natural language expressions and substantial computational costs. In this paper, we investigate why sLM perform poorly on arithmetic reasoning tasks and hypothesize that natural language format variability introduces high ambiguity for these smaller models. Based on this hypothesis, we conduct experiments with equation-only format, which is a reasoning format that unifies arithmetic reasoning previously expressed in natural language formats into mathematical equations. Experiment results demonstrate that equation-only format effectively boosts the arithmetic reasoning abilities of sLM, especially in very small models like T5-Tiny.
摘要:思维链 (Chain-of-Thought, CoT) 推理使得大语言模型 (Large Language Model, LLM) 在包括算术问题解决在内的多种自然语言处理 (NLP) 任务中取得了显著的性能。然而,这种成功并未推广到像 T5 这样的小型语言模型 (small language model, sLM),这主要是因为它们的能力有限,且缺乏与更大模型相关的新兴能力。近期通过知识蒸馏提升 sLM 的工作虽然取得了一些改进,但仍面临显著的局限性,特别是自然语言表达的变异性带来的高模糊性和巨大的计算成本。本文探讨了 sLM 在算术推理任务中表现不佳的原因,并假设自然语言格式的变异性为这些较小的模型引入了高模糊性。基于这一假设,我们进行了仅使用方程格式的实验,这是一种将先前以自然语言格式表达的算术推理统一为数学方程的推理格式。实验结果表明,仅使用方程格式有效地提升了 sLM 的算术推理能力,特别是在像 T5-Tiny 这样的非常小的模型中。
[NLP-47] Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition ICASSP2025
该论文旨在解决预训练自动语音识别(ASR)系统在面对未见过的录音环境和条件导致的通道不匹配问题。解决方案的关键在于提出了一种新的通道感知数据模拟方法,该方法结合了通道提取技术和生成对抗网络(GANs)。具体来说,首先训练一个能够从任意音频中提取嵌入的通道编码器,然后利用少量目标域数据提取通道嵌入,并以此指导基于GAN的语音合成器生成具有目标域通道特性的语音,同时保留输入语音的音素内容。这种方法在Hakka Across Taiwan (HAT)和Taiwanese Across Taiwan (TAT)语料库上的实验结果表明,相对于基线模型,字符错误率(CER)分别降低了20.02%和9.64%,验证了该方法在弥合源域和目标域声学差异方面的有效性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12386
作者: Chien-Chun Wang,Li-Wei Chen,Cheng-Kang Chou,Hung-Shin Lee,Berlin Chen,Hsin-Min Wang
关键词-EN: demonstrate impressive performance, systems demonstrate impressive, unseen recording environments, automatic speech recognition, channel mismatch stemming
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025
点击查看摘要
Abstract:While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. Our method harnesses the synergistic power of channel-extractive techniques and generative adversarial networks (GANs). We first train a channel encoder capable of extracting embeddings from arbitrary audio. On top of this, channel embeddings are extracted using a minimal amount of target-domain data and used to guide a GAN-based speech synthesizer. This synthesizer generates speech that faithfully preserves the phonetic content of the input while mimicking the channel characteristics of the target domain. We evaluate our method on the challenging Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving relative character error rate (CER) reductions of 20.02% and 9.64%, respectively, compared to the baselines. These results highlight the efficacy of our channel-aware data simulation method for bridging the gap between source- and target-domain acoustics.
摘要:尽管预训练的自动语音识别 (ASR) 系统在匹配的领域中表现出令人印象深刻的性能,但当面对来自未见过的录音环境和条件的信道不匹配时,其性能通常会下降。为了缓解这一问题,我们提出了一种新的信道感知数据模拟方法,用于鲁棒的 ASR 训练。我们的方法利用了信道提取技术和生成对抗网络 (GAN) 的协同作用。首先,我们训练了一个能够从任意音频中提取嵌入的信道编码器。在此基础上,使用少量目标域数据提取信道嵌入,并用于指导基于 GAN 的语音合成器。该合成器生成的语音在忠实保留输入语音的音素内容的同时,模仿了目标域的信道特征。我们在具有挑战性的 Hakka Across Taiwan (HAT) 和 Taiwanese Across Taiwan (TAT) 语料库上评估了我们的方法,分别实现了相对字符错误率 (CER) 降低了 20.02% 和 9.64%,相较于基线方法。这些结果突显了我们的信道感知数据模拟方法在弥合源域和目标域声学差异方面的有效性。
[NLP-48] Measuring Sound Symbolism in Audio-visual Models
该论文试图解决的问题是预训练的视听模型是否能够表现出与人类相似的音义关联(即声音象征性)。解决方案的关键在于开发了一个专门的数据集,包含合成的图像和音频样本,并采用非参数方法在零样本设置下评估这些模型。研究发现,特别是在基于语音数据训练的模型中,模型的输出与已知的声音象征性模式存在显著相关性,这表明这些模型能够捕捉类似于人类语言处理的音义连接。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12306
作者: Wei-Cheng Tseng,Yi-Jen Shih,David Harwath,Raymond Mooney
关键词-EN: gained substantial attention, substantial attention recently, demonstrated superior performance, gained substantial, substantial attention
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: SLT 2024
点击查看摘要
Abstract:Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations \unicodex2013 known as sound symbolism \unicodex2013 which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models’ outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.
摘要:视听预训练模型近期获得了广泛关注,并在多种视听任务中展现出卓越性能。本研究探讨了这些预训练的视听模型是否表现出声音与视觉表征之间的非任意关联——即声音象征性——这在人类中也存在。我们开发了一个专门的数据集,包含合成的图像和音频样本,并在零样本设置下采用非参数方法评估了这些模型。研究发现,模型的输出与已知的声音象征性模式之间存在显著相关性,特别是在基于语音数据训练的模型中。这些结果表明,此类模型能够捕捉类似于人类语言处理的声音意义连接,为认知架构和机器学习策略提供了见解。
[NLP-49] RAG-Modulo: Solving Sequential Tasks using Experience Critics and Language Models
该论文试图解决现有基于大型语言模型(LLM)的决策方法在处理复杂、长时程任务时缺乏从过去交互中学习和记忆能力的问题。解决方案的关键在于提出了RAG-Modulo框架,该框架通过引入记忆组件,使LLM代理能够自动检索并整合过去的相关经验作为上下文示例,从而提供更具上下文感知的反馈,增强决策的准确性。此外,通过不断更新记忆,代理能够随时间改进其性能,展现出学习能力。实验结果表明,RAG-Modulo框架在BabyAI和AlfWorld等复杂任务领域中显著提升了任务成功率和效率,超越了现有最先进的基线方法。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12294
作者: Abhinav Jain,Chris Jermaine,Vaibhav Unhelkar
关键词-EN: Large language models, Large language, language models, observation uncertainties, recently emerged
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 5 figures
点击查看摘要
Abstract:Large language models (LLMs) have recently emerged as promising tools for solving challenging robotic tasks, even in the presence of action and observation uncertainties. Recent LLM-based decision-making methods (also referred to as LLM-based agents), when paired with appropriate critics, have demonstrated potential in solving complex, long-horizon tasks with relatively few interactions. However, most existing LLM-based agents lack the ability to retain and learn from past interactions - an essential trait of learning-based robotic systems. We propose RAG-Modulo, a framework that enhances LLM-based agents with a memory of past interactions and incorporates critics to evaluate the agents’ decisions. The memory component allows the agent to automatically retrieve and incorporate relevant past experiences as in-context examples, providing context-aware feedback for more informed decision-making. Further by updating its memory, the agent improves its performance over time, thereby exhibiting learning. Through experiments in the challenging BabyAI and AlfWorld domains, we demonstrate significant improvements in task success rates and efficiency, showing that the proposed RAG-Modulo framework outperforms state-of-the-art baselines.
摘要:大语言模型 (LLM) 最近作为解决具有动作和观察不确定性的复杂机器人任务的有力工具而崭露头角。基于 LLM 的决策方法(也称为基于 LLM 的智能体),在配备适当的评价器时,展示了在相对较少的交互中解决复杂、长周期任务的潜力。然而,大多数现有的基于 LLM 的智能体缺乏保留和从过去的交互中学习的能力——这是基于学习的机器人系统的基本特征。我们提出了 RAG-Modulo 框架,该框架通过增加对过去交互的记忆并结合评价器来评估智能体的决策,从而增强基于 LLM 的智能体。记忆组件使智能体能够自动检索并整合相关的过去经验作为上下文示例,为更明智的决策提供上下文感知的反馈。此外,通过更新其记忆,智能体随着时间的推移提高其性能,从而表现出学习能力。通过在具有挑战性的 BabyAI 和 AlfWorld 领域进行的实验,我们展示了任务成功率和效率的显著提升,表明所提出的 RAG-Modulo 框架优于最先进的基线方法。
[NLP-50] Making Large Language Models into World Models with Precondition and Effect Knowledge
该论文试图解决如何利用大型语言模型(LLMs)作为世界模型的问题,即如何使LLMs具备预测动作适用性和执行动作后世界状态变化的能力。解决方案的关键在于通过微调两个独立的LLMs,一个用于预测动作的前提条件(precondition prediction),另一个用于预测动作执行后的效果(effect prediction),并利用合成数据生成技术进行训练。通过这种方式,模型能够生成与人类理解相符的世界动态知识,并支持动作链的创建,从而为规划提供必要的支持。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12278
作者: Kaige Xie,Ian Yang,John Gunerli,Mark Riedl
关键词-EN: actions affect environments, Large Language Models, affect environments, intelligent agents, functioning of intelligent
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:World models, which encapsulate the dynamics of how actions affect environments, are foundational to the functioning of intelligent agents. In this work, we explore the potential of Large Language Models (LLMs) to operate as world models. Although LLMs are not inherently designed to model real-world dynamics, we show that they can be induced to perform two critical world model functions: determining the applicability of an action based on a given world state, and predicting the resulting world state upon action execution. This is achieved by fine-tuning two separate LLMs-one for precondition prediction and another for effect prediction-while leveraging synthetic data generation techniques. Through human-participant studies, we validate that the precondition and effect knowledge generated by our models aligns with human understanding of world dynamics. We also analyze the extent to which the world model trained on our synthetic data results in an inferred state space that supports the creation of action chains, a necessary property for planning.
摘要:世界模型,即封装了动作如何影响环境动态的模型,是智能体功能的基础。在本研究中,我们探讨了大语言模型 (LLM) 作为世界模型的潜力。尽管 LLM 并非天生设计用于建模现实世界动态,但我们展示了它们可以通过引导执行两个关键的世界模型功能:基于给定世界状态确定动作的适用性,以及预测动作执行后的世界状态。这一目标通过微调两个独立的 LLM 实现——一个用于前提条件预测,另一个用于效果预测——同时利用合成数据生成技术。通过人类参与者研究,我们验证了由我们的模型生成的前提条件和效果知识与人类对世界动态的理解相一致。我们还分析了基于我们合成数据训练的世界模型所推导出的状态空间在多大程度上支持动作链的创建,这是规划的必要属性。
[NLP-51] MQA-KEAL: Multi-hop Question Answering under Knowledge Editing for Arabic Language
该论文试图解决大型语言模型(LLMs)在阿拉伯语环境下知识编辑和多跳问答(MQA)的问题。解决方案的关键在于提出了MQA-KEAL方法,该方法通过将知识编辑存储为外部记忆中的结构化知识单元,并利用任务分解技术将复杂问题分解为子问题,然后迭代查询外部记忆和目标LLM以生成最终答案。此外,论文还贡献了MQUAKE-AR基准和新的MQA-AEVAL基准,用于严格评估阿拉伯语环境下MQA在知识编辑中的性能。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12257
作者: Muhammad Asif Ali,Nawal Daftardar,Mutayyaba Waheed,Jianbin Qin,Di Wang
关键词-EN: Large Language Models, numerous application domains, Large Language, demonstrated significant capabilities, application domains
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated significant capabilities across numerous application domains. A key challenge is to keep these models updated with latest available information, which limits the true potential of these models for the end-applications. Although, there have been numerous attempts for LLMs Knowledge Editing (KE), i.e., to edit the LLMs prior knowledge and in turn test it via Multi-hop Question Answering (MQA), yet so far these studies are primarily focused on English language. To bridge this gap, in this paper we propose: Multi-hop Questioning Answering under Knowledge Editing for Arabic Language (MQA-KEAL). MQA-KEAL stores knowledge edits as structured knowledge units in the external memory. In order to solve multi-hop question, it first uses task-decomposition to decompose the question into smaller sub-problems. Later for each sub-problem, it iteratively queries the external memory and/or target LLM in order to generate the final response. In addition, we also contribute MQUAKE-AR (Arabic translation of English benchmark MQUAKE), as well as a new benchmark MQA-AEVAL for rigorous performance evaluation of MQA under KE for Arabic language. Experimentation evaluation reveals MQA-KEAL outperforms the baseline models by a significant margin.
摘要:大语言模型 (LLMs) 在众多应用领域展示了显著的能力。一个关键挑战是如何使这些模型保持最新可用信息的更新,这限制了这些模型在终端应用中的真正潜力。尽管已有许多尝试用于大语言模型的知识编辑 (KE),即编辑大语言模型的先验知识并通过多跳问答 (MQA) 进行测试,但迄今为止这些研究主要集中在英语语言上。为了填补这一空白,本文提出:针对阿拉伯语言的知识编辑下的多跳问答 (MQA-KEAL)。MQA-KEAL 将知识编辑存储为外部存储器中的结构化知识单元。为了解决多跳问题,它首先使用任务分解将问题分解为较小的子问题。随后,对于每个子问题,它迭代查询外部存储器和/或目标大语言模型以生成最终响应。此外,我们还贡献了 MQUAKE-AR(英语基准 MQUAKE 的阿拉伯语翻译),以及一个新的基准 MQA-AEVAL,用于严格评估阿拉伯语言在知识编辑下的多跳问答性能。实验评估显示,MQA-KEAL 显著优于基线模型。
[NLP-52] ARTICLE: Annotator Reliability Through In-Context Learning
该论文试图解决在自然语言处理(NLP)中,确保训练和评估数据标注者质量的问题。由于情感分析和攻击性言论检测等任务具有主观性,传统的质量评估方法难以区分标注者之间的分歧是由于工作质量差还是观点差异。论文提出的解决方案是\texttt{ARTICLE},这是一个基于上下文学习(ICL)的框架,通过自我一致性来估计标注质量。该框架的关键在于利用大型语言模型(LLMs)来评估标注的一致性,从而识别出可靠的标注者,提高数据质量。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12218
作者: Sujan Dutta,Deepak Pandita,Tharindu Cyril Weerasooriya,Marcos Zampieri,Christopher M. Homan,Ashiqur R. KhudaBukhsh
关键词-EN: training and evaluation, key piece, piece of machine, learning in NLP, NLP
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Ensuring annotator quality in training and evaluation data is a key piece of machine learning in NLP. Tasks such as sentiment analysis and offensive speech detection are intrinsically subjective, creating a challenging scenario for traditional quality assessment approaches because it is hard to distinguish disagreement due to poor work from that due to differences of opinions between sincere annotators. With the goal of increasing diverse perspectives in annotation while ensuring consistency, we propose \textttARTICLE, an in-context learning (ICL) framework to estimate annotation quality through self-consistency. We evaluate this framework on two offensive speech datasets using multiple LLMs and compare its performance with traditional methods. Our findings indicate that \textttARTICLE can be used as a robust method for identifying reliable annotators, hence improving data quality.
摘要:在自然语言处理 (NLP) 的机器学习中,确保训练和评估数据中的标注者质量是一个关键环节。诸如情感分析和冒犯性言论检测等任务本质上是主观的,这为传统的质量评估方法带来了挑战,因为很难区分由于标注者工作质量差而导致的分歧与由于真诚标注者之间的意见差异而导致的分歧。我们的目标是增加标注中的多样性视角,同时确保一致性,为此我们提出了 \textttARTICLE,这是一个通过自一致性估计标注质量的上下文学习 (ICL) 框架。我们在两个冒犯性言论数据集上使用多个大语言模型 (LLM) 评估了该框架,并将其性能与传统方法进行了比较。我们的研究结果表明,\textttARTICLE 可以作为一种稳健的方法来识别可靠的标注者,从而提高数据质量。
[NLP-53] WaveletGPT: Wavelets Meet Large Language Models
该论文试图解决的问题是如何在预训练大型语言模型(LLMs)时更高效地利用数据的多尺度结构。解决方案的关键在于将传统的信号处理技术——小波变换(wavelets)融入到LLMs的预训练过程中,通过在Transformer解码器块中引入不同时间分辨率的中间嵌入(intermediate embeddings),从而在不增加额外参数的情况下,显著提升预训练速度和性能。这种方法不仅加速了预训练过程,还通过优化内部结构而非单纯扩大模型规模来提升模型性能。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12924
作者: Prateek Verma
关键词-EN: Large Language Models, Large Language, artificial intelligence advancements, intelligence advancements impacting, Language Models
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 16 pages, 4 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. They are trained on a simple objective: to predict the next token given the previous context. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure associated with it. This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbfany extra parameters to a GPT-style LLM architecture, we achieve the same pre-training performance almost twice as fast in text, raw audio, and symbolic music. This is achieved by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every Transformer decoder block. This work will hopefully pave the way for incorporating multi-rate signal processing ideas into traditional LLM pre-training. Further, we showcase pushing model performance by improving internal structure instead of just going after scale.
摘要:大语言模型 (LLMs) 引领了人工智能领域的新一轮进步,影响着每一个科学领域和学科。它们基于一个简单的目标进行训练:根据之前的上下文预测下一个 Token。我们生活在一个数据大多具有多尺度结构的世界中,例如文本、音频和音乐。本文在预训练阶段将传统信号处理思想,即小波 (wavelets),融入到大语言模型中,以利用这种结构。在不向 GPT 风格的大语言模型架构添加任何额外参数的情况下,我们在文本、原始音频和符号音乐上的预训练速度几乎提高了两倍,同时达到了相同的预训练性能。这是通过在中间嵌入上施加结构实现的。在相同训练步数下,我们实现了显著的性能提升,这相当于预训练一个更大的神经网络架构。我们的架构允许每个 Transformer 解码器块中的下一个 Token 预测访问不同时间分辨率下的中间嵌入。这项工作有望为将多速率信号处理思想融入传统大语言模型预训练铺平道路。此外,我们展示了通过改进内部结构而非仅仅追求规模来提升模型性能的方法。
[NLP-54] Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
该论文试图解决在“自然场景”视频中进行鲁棒的视听语音识别的问题。解决方案的关键在于引入了一种名为EVA的模型,该模型利用了混合专家(mixture-of-experts)机制来有效整合视觉信息,并通过轻量级投影将视觉信息映射到语音空间。此外,EVA基于一个预训练的鲁棒语音识别模型构建,确保了其在不同视频场景中的泛化能力。通过这种方式,EVA在多个基准测试中实现了最先进的结果,证明了其在多样视频领域中的广泛适用性。【详细内容请查看摘要】
链接: https://arxiv.org/abs/2409.12370
作者: Yihan Wu,Yifan Peng,Yichen Lu,Xuankai Chang,Ruihua Song,Shinji Watanabe
关键词-EN: providing additional contextual, additional contextual information, speech recognition, audiovisual speech recognition, speech recognition accuracy
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 6 pages, 2 figures, accepted by IEEE Spoken Language Technology Workshop 2024
点击查看摘要
Abstract:Visual signals can enhance audiovisual speech recognition accuracy by providing additional contextual information. Given the complexity of visual signals, an audiovisual speech recognition model requires robust generalization capabilities across diverse video scenarios, presenting a significant challenge. In this paper, we introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for ``in-the-wild’’ videos. Specifically, we first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Then, we build EVA upon a robust pretrained speech recognition model, ensuring its generalization ability. Moreover, to incorporate visual information effectively, we inject visual information into the ASR model through a mixture-of-experts module. Experiments show our model achieves state-of-the-art results on three benchmarks, which demonstrates the generalization ability of EVA across diverse video domains.
摘要:视觉信号通过提供额外的上下文信息,可以增强视听语音识别的准确性。鉴于视觉信号的复杂性,视听语音识别模型需要在多样化的视频场景中具备强大的泛化能力,这是一个显著的挑战。本文中,我们介绍了 EVA,利用音频视觉 ASR 的专家混合模型 (mixture-of-Experts) 来实现对“自然场景”视频的鲁棒语音识别。具体来说,我们首先将视觉信息编码为视觉 Token 序列,并通过轻量级投影将其映射到语音空间。然后,我们在一个鲁棒的预训练语音识别模型基础上构建 EVA,以确保其泛化能力。此外,为了有效整合视觉信息,我们通过专家混合模块将视觉信息注入到 ASR 模型中。实验表明,我们的模型在三个基准测试中达到了最先进的结果,这证明了 EVA 在多样化视频领域中的泛化能力。
人工智能
[AI-0] Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
链接: https://arxiv.org/abs/2409.12963
作者: Yuzhang Shang,Bingxin Xu,Weitai Kang,Mu Cai,Yuheng Li,Zehao Wen,Zhen Dong,Kurt Keutzer,Yong Jae Lee,Yan Yan
关键词-EN: Large Language Models, Advancements in Large, Language Models, Large Language, integrating video modalities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Advancements in Large Language Models (LLMs) inspire various strategies for integrating video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface linking sophisticated video encoders to LLMs. However, due to computation and data limitations, these Video-LLMs are typically pre-trained to process only short videos, limiting their broader application for understanding longer video content. Additionally, fine-tuning Video-LLMs to handle longer videos is cost-prohibitive. Consequently, it becomes essential to explore the interpolation of Video-LLMs under a completely training-free setting. In this paper, we first identify the primary challenges in interpolating Video-LLMs: (1) the video encoder and modality alignment projector are fixed, preventing the integration of additional frames into Video-LLMs, and (2) the LLM backbone is limited in its content length capabilities, which complicates the processing of an increased number of video tokens. To address these challenges, we propose a specific INTerPolation method for Video-LLMs (INTP-Video-LLMs). We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector. Furthermore, we introduce a training-free LLM context window extension method to enable Video-LLMs to understand a correspondingly increased number of visual tokens.
[AI-1] MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
链接: https://arxiv.org/abs/2409.12959
作者: Dongzhi Jiang,Renrui Zhang,Ziyu Guo,Yanmin Wu,Jiayi Lei,Pengshuo Qiu,Pan Lu,Zehui Chen,Guanglu Song,Peng Gao,Yu Liu,Chunyuan Li,Hongsheng Li
关键词-EN: Large Language Models, Large Multimodal Models, Large Language, Language Models, multimodal search
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Project Page: this https URL
点击查看摘要
Abstract:The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs’ training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: this https URL
[AI-2] MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
链接: https://arxiv.org/abs/2409.12958
作者: Abdullatif Köksal,Marion Thaler,Ayyoob Imani,Ahmet Üstün,Anna Korhonen,Hinrich Schütze
关键词-EN: tuning enhances large, Instruction tuning enhances, enhances large language, Instruction tuning, instruction tuning datasets
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach’s effectiveness for both NLU and open-ended generation. We publicly release datasets and models at this https URL.
[AI-3] JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
链接: https://arxiv.org/abs/2409.12953
作者: Zhecan Wang,Junzhang Liu,Chia-Wei Tang,Hani Alomari,Anushka Sivakumar,Rui Sun,Wenhao Li,Md. Atabuzzaman,Hammad Ayyubi,Haoxuan You,Alvi Ishmam,Kai-Wei Chang,Shih-Fu Chang,Chris Thomas
关键词-EN: benchmarks largely consist, usual contexts, Existing vision-language understanding, largely consist, vision-language understanding benchmarks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model’s fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models’ visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.
[AI-4] Re-Introducing LayerNorm: Geometric Meaning Irreversibility and a Comparative Study with RMSNorm
链接: https://arxiv.org/abs/2409.12951
作者: Akshat Gupta,Atahan Ozdemir,Gopala Anumanchipalli
关键词-EN: uniform vector, Layer normalization, vector, transformer architecture, LayerNorm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Layer normalization is a pivotal step in the transformer architecture. This paper delves into the less explored geometric implications of this process, examining how LayerNorm influences the norm and orientation of hidden vectors in the representation space. We show that the definition of LayerNorm is innately linked to the uniform vector, defined as \boldsymbol1 = [1, 1, 1, 1, \cdots, 1]^T \in \mathbbR^d . We then show that the standardization step in LayerNorm can be understood in three simple steps: (i) remove the component of a vector along the uniform vector, (ii) normalize the remaining vector, and (iii) scale the resultant vector by \sqrtd , where d is the dimensionality of the representation space. We also introduce the property of “irreversibility” for LayerNorm, where we show that the information lost during the normalization process cannot be recovered. In other words, unlike batch normalization, LayerNorm cannot learn an identity transform. While we present possible arguments for removing the component along the uniform vector, the choice of removing this component seems arbitrary and not well motivated by the original authors. To evaluate the usefulness of this step, we compare the hidden representations of LayerNorm-based LLMs with models trained using RMSNorm and show that all LLMs naturally align representations orthogonal to the uniform vector, presenting the first mechanistic evidence that removing the component along the uniform vector in LayerNorm is a redundant step. Our findings support the use of RMSNorm over LayerNorm as it is not only more computationally efficient with comparable downstream performance, but also learns a similar distribution of hidden representations that operate orthogonal to the uniform vector.
[AI-5] MaskMol: Knowledge-guided Molecular Image Pre-Training Framework for Activity Cliffs
链接: https://arxiv.org/abs/2409.12926
作者: Zhixiang Cheng,Hongxin Xiang,Pengsen Ma,Li Zeng,Xin Jin,Xixi Yang,Jianxin Lin,Yang Deng,Bosheng Song,Xinxin Feng,Changhui Deng,Xiangxiang Zeng
关键词-EN: show significant differences, refer to pairs, pairs of molecules, structurally similar, similar but show
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 33 pages, 5 figures
点击查看摘要
Abstract:Activity cliffs, which refer to pairs of molecules that are structurally similar but show significant differences in their potency, can lead to model representation collapse and make the model challenging to distinguish them. Our research indicates that as molecular similarity increases, graph-based methods struggle to capture these nuances, whereas image-based approaches effectively retain the distinctions. Thus, we developed MaskMol, a knowledge-guided molecular image self-supervised learning framework. MaskMol accurately learns the representation of molecular images by considering multiple levels of molecular knowledge, such as atoms, bonds, and substructures. By utilizing pixel masking tasks, MaskMol extracts fine-grained information from molecular images, overcoming the limitations of existing deep learning models in identifying subtle structural changes. Experimental results demonstrate MaskMol’s high accuracy and transferability in activity cliff estimation and compound potency prediction across 20 different macromolecular targets, outperforming 25 state-of-the-art deep learning and machine learning approaches. Visualization analyses reveal MaskMol’s high biological interpretability in identifying activity cliff-relevant molecular substructures. Notably, through MaskMol, we identified candidate EP4 inhibitors that could be used to treat tumors. This study not only raises awareness about activity cliffs but also introduces a novel method for molecular image representation learning and virtual screening, advancing drug discovery and providing new insights into structure-activity relationships (SAR).
[AI-6] AI Thinking: A framework for rethinking artificial intelligence in practice
链接: https://arxiv.org/abs/2409.12922
作者: Denis Newman-Griffis
关键词-EN: Artificial intelligence, intelligence is transforming, Thinking, disciplines, Artificial
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 30 pages, 2 figures
点击查看摘要
Abstract:Artificial intelligence is transforming the way we work with information across disciplines and practical contexts. A growing range of disciplines are now involved in studying, developing, and assessing the use of AI in practice, but these disciplines often employ conflicting understandings of what AI is and what is involved in its use. New, interdisciplinary approaches are needed to bridge competing conceptualisations of AI in practice and help shape the future of AI use. I propose a novel conceptual framework called AI Thinking, which models key decisions and considerations involved in AI use across disciplinary perspectives. The AI Thinking model addresses five practice-based competencies involved in applying AI in context: motivating AI use in information processes, formulating AI methods, assessing available tools and technologies, selecting appropriate data, and situating AI in the sociotechnical contexts it is used in. A hypothetical case study is provided to illustrate the application of AI Thinking in practice. This article situates AI Thinking in broader cross-disciplinary discourses of AI, including its connections to ongoing discussions around AI literacy and AI-driven innovation. AI Thinking can help to bridge divides between academic disciplines and diverse contexts of AI use, and to reshape the future of AI in practice.
[AI-7] Swine Diet Design using Multi-objective Regionalized Bayesian Optimization
链接: https://arxiv.org/abs/2409.12919
作者: Gabriel D. Uribe-Guerra,Danny A. Múnera-Ramírez,Julián D. Arias-Londoño
关键词-EN: develop cost-effective formulations, balancing minimum nutritional, multi-objective Bayesian optimization, minimum nutritional content, Bayesian optimization
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages, 7 figures
点击查看摘要
Abstract:The design of food diets in the context of animal nutrition is a complex problem that aims to develop cost-effective formulations while balancing minimum nutritional content. Traditional approaches based on theoretical models of metabolic responses and concentrations of digestible energy in raw materials face limitations in incorporating zootechnical or environmental variables affecting the performance of animals and including multiple objectives aligned with sustainable development policies. Recently, multi-objective Bayesian optimization has been proposed as a promising heuristic alternative able to deal with the combination of multiple sources of information, multiple and diverse objectives, and with an intrinsic capacity to deal with uncertainty in the measurements that could be related to variability in the nutritional content of raw materials. However, Bayesian optimization encounters difficulties in high-dimensional search spaces, leading to exploration predominantly at the boundaries. This work analyses a strategy to split the search space into regions that provide local candidates termed multi-objective regionalized Bayesian optimization as an alternative to improve the quality of the Pareto set and Pareto front approximation provided by BO in the context of swine diet design. Results indicate that this regionalized approach produces more diverse non-dominated solutions compared to the standard multi-objective Bayesian optimization. Besides, the regionalized strategy was four times more effective in finding solutions that outperform those identified by a stochastic programming approach referenced in the literature. Experiments using batches of query candidate solutions per iteration show that the optimization process can also be accelerated without compromising the quality of the Pareto set approximation during the initial, most critical phase of optimization.
[AI-8] Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
链接: https://arxiv.org/abs/2409.12903
作者: Mohammad Samragh,Iman Mirzadeh,Keivan Alizadeh Vahid,Fartash Faghri,Minsik Cho,Moin Nabi,Devang Naik,Mehrdad Farajtabar
关键词-EN: language models, large language models, begins with randomly, models, model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.
[AI-9] Recognition of Harmful Phytoplankton from Microscopic Images using Deep Learning
链接: https://arxiv.org/abs/2409.12900
作者: Aymane Khaldi,Rohaifa Khaldi
关键词-EN: preserving aquatic ecosystems, ensuring environmental protection, Monitoring plankton distribution, plankton distribution, aquatic ecosystems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures
点击查看摘要
Abstract:Monitoring plankton distribution, particularly harmful phytoplankton, is vital for preserving aquatic ecosystems, regulating the global climate, and ensuring environmental protection. Traditional methods for monitoring are often time-consuming, expensive, error-prone, and unsuitable for large-scale applications, highlighting the need for accurate and efficient automated systems. In this study, we evaluate several state-of-the-art CNN models, including ResNet, ResNeXt, DenseNet, and EfficientNet, using three transfer learning approaches: linear probing, fine-tuning, and a combined approach, to classify eleven harmful phytoplankton genera from microscopic images. The best performance was achieved by ResNet-50 using the fine-tuning approach, with an accuracy of 96.97%. The results also revealed that the models struggled to differentiate between four harmful phytoplankton types with similar morphological features.
[AI-10] Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case
链接: https://arxiv.org/abs/2409.12889
作者: Peng Chen,Pi Bu,Jun Song,Yuan Gao,Bo Zheng
关键词-EN: large language model, Recently, LLM, action, large language
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recently, large language model (LLM)-based agents have made significant advances across various fields. One of the most popular research areas involves applying these agents to video games. Traditionally, these methods have relied on game APIs to access in-game environmental and action data. However, this approach is limited by the availability of APIs and does not reflect how humans play games. With the advent of vision language models (VLMs), agents now have enhanced visual understanding capabilities, enabling them to interact with games using only visual inputs. Despite these advances, current approaches still face challenges in action-oriented tasks, particularly in action role-playing games (ARPGs), where reinforcement learning methods are prevalent but suffer from poor generalization and require extensive training. To address these limitations, we select an ARPG, ``Black Myth: Wukong’', as a research platform to explore the capability boundaries of existing VLMs in scenarios requiring visual-only input and complex action output. We define 12 tasks within the game, with 75% focusing on combat, and incorporate several state-of-the-art VLMs into this benchmark. Additionally, we will release a human operation dataset containing recorded gameplay videos and operation logs, including mouse and keyboard actions. Moreover, we propose a novel VARP (Vision Action Role-Playing) agent framework, consisting of an action planning system and a visual trajectory system. Our framework demonstrates the ability to perform basic tasks and succeed in 90% of easy and medium-level combat scenarios. This research aims to provide new insights and directions for applying multimodal agents in complex action game environments. The code and datasets will be made available at this https URL.
[AI-11] Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition
链接: https://arxiv.org/abs/2409.12883
作者: Daniel Flores-Araiza,Francisco Lopez-Tiro,Clément Larose,Salvador Hinojosa,Andres Mendez-Vazquez,Miguel Gonzalez-Mendoza,Gilberto Ochoa-Ruiz,Christian Daul
关键词-EN: kidney stone types, calculi extraction process, diminishing infection risks, major medical advance, tedious renal calculi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Paper submitted to Artificial Intelligence in Medicine. (AIIM), Elsevier
点击查看摘要
Abstract:The in-vivo identification of the kidney stone types during an ureteroscopy would be a major medical advance in urology, as it could reduce the time of the tedious renal calculi extraction process, while diminishing infection risks. Furthermore, such an automated procedure would make possible to prescribe anti-recurrence treatments immediately. Nowadays, only few experienced urologists are able to recognize the kidney stone types in the images of the videos displayed on a screen during the endoscopy. Thus, several deep learning (DL) models have recently been proposed to automatically recognize the kidney stone types using ureteroscopic images. However, these DL models are of black box nature whicl limits their applicability in clinical settings. This contribution proposes a case-based reasoning DL model which uses prototypical parts (PPs) and generates local and global descriptors. The PPs encode for each class (i.e., kidney stone type) visual feature information (hue, saturation, intensity and textures) similar to that used by biologists. The PPs are optimally generated due a new loss function used during the model training. Moreover, the local and global descriptors of PPs allow to explain the decisions (“what” information, “where in the images”) in an understandable way for biologists and urologists. The proposed DL model has been tested on a database including images of the six most widespread kidney stone types. The overall average classification accuracy was 90.37. When comparing this results with that of the eight other DL models of the kidney stone state-of-the-art, it can be seen that the valuable gain in explanability was not reached at the expense of accuracy which was even slightly increased with respect to that (88.2) of the best method of the literature. These promising and interpretable results also encourage urologists to put their trust in AI-based solutions.
[AI-12] Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language Models CIKM
链接: https://arxiv.org/abs/2409.12880
作者: Bryan Zhang,Taichi Nakatani,Stephan Walter
关键词-EN: stores enable multilingual, E-commerce stores enable, product title translation, title translation, accurate product title
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 Pages,In Proceedings of ACM CIKM Workshop on Data-Centric AI (CIKM DCAI 2024)
点击查看摘要
Abstract:E-commerce stores enable multilingual product discovery which require accurate product title translation. Multilingual large language models (LLMs) have shown promising capacity to perform machine translation tasks, and it can also enhance and translate product titles cross-lingually in one step. However, product title translation often requires more than just language conversion because titles are short, lack context, and contain specialized terminology. This study proposes a retrieval-augmented generation (RAG) approach that leverages existing bilingual product information in e-commerce by retrieving similar bilingual examples and incorporating them as few-shot prompts to enhance LLM-based product title translation. Experiment results show that our proposed RAG approach improve product title translation quality with chrF score gains of up to 15.3% for language pairs where the LLM has limited proficiency.
[AI-13] KnowFormer: Revisiting Transformers for Knowledge Graph Reasoning ICML2024
链接: https://arxiv.org/abs/2409.12865
作者: Junnan Liu,Qianren Mao,Weifeng Jiang,Jianxin Li
关键词-EN: garnered considerable attention, Knowledge graph reasoning, plays a vital, vital role, garnered considerable
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by ICML2024
点击查看摘要
Abstract:Knowledge graph reasoning plays a vital role in various applications and has garnered considerable attention. Recently, path-based methods have achieved impressive performance. However, they may face limitations stemming from constraints in message-passing neural networks, such as missing paths and information over-squashing. In this paper, we revisit the application of transformers for knowledge graph reasoning to address the constraints faced by path-based methods and propose a novel method KnowFormer.KnowFormer utilizes a transformer architecture to perform reasoning on knowledge graphs from the message-passing perspective, rather than reasoning by textual information like previous pretrained language model based methods. Specifically, we define the attention computation based on the query prototype of knowledge graph reasoning, facilitating convenient construction and efficient optimization. To incorporate structural information into the self-attention mechanism, we introduce structure-aware modules to calculate query, key, and value respectively. Additionally, we present an efficient attention computation method for better scalability. Experimental results demonstrate the superior performance of KnowFormer compared to prominent baseline methods on both transductive and inductive benchmarks.
[AI-14] How the (Tensor-) Brain uses Embeddings and Embodiment to Encode Senses and Decode Symbols
链接: https://arxiv.org/abs/2409.12846
作者: Volker Tresp,Hang Li
关键词-EN: tensor brain, representation layer, tensor brain model, brain, layer
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:The tensor brain has been introduced as a computational model for perception and memory. We provide an overview of the tensor brain model, including recent developments. The tensor brain has two major layers: the representation layer and the index layer. The representation layer is a model for the subsymbolic global workspace from consciousness research. The state of the representation layer is the cognitive brain state. The index layer contains symbols for concepts, time instances, and predicates. In a bottom-up operation, the cognitive brain state is encoded by the index layer as symbolic labels. In a top-down operation, symbols are decoded and written to the representation layer. This feeds to earlier processing layers as embodiment. The top-down operation became the basis for semantic memory. The embedding vector of a concept forms the connection weights between its index and the representation layer. The embedding is the signature or ``DNA’’ of a concept, which is decoded by the brain when its index is activated. It integrates all that is known about a concept from different experiences, modalities, and symbolic decodings. Although being computational, it has been suggested that the tensor brain might be related to the actual operation of the brain. The sequential nature of symbol generation might have been a prerequisite to the generation of natural language. We describe an attention mechanism and discuss multitasking by multiplexing. We emphasize the inherent multimodality of the tensor brain. Finally, we discuss embedded and symbolic reasoning.
[AI-15] Vision Language Models Can Parse Floor Plan Maps
链接: https://arxiv.org/abs/2409.12842
作者: David DeFazio,Hrudayangam Mehta,Jeremy Blackburn,Shiqi Zhang
关键词-EN: Vision language models, visual question answering, Vision language, image captioning, language models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks, from visual question answering to image captioning. This paper focuses on map parsing, a novel task that is unexplored within the VLM context and particularly useful to mobile robots. Map parsing requires understanding not only the labels but also the geometric configurations of a map, i.e., what areas are like and how they are connected. To evaluate the performance of VLMs on map parsing, we prompt VLMs with floorplan maps to generate task plans for complex indoor navigation. Our results demonstrate the remarkable capability of VLMs in map parsing, with a success rate of 0.96 in tasks requiring a sequence of nine navigation actions, e.g., approaching and going through doors. Other than intuitive observations, e.g., VLMs do better in smaller maps and simpler navigation tasks, there was a very interesting observation that its performance drops in large open areas. We provide practical suggestions to address such challenges as validated by our experimental results. Webpage: this https URL
[AI-16] FoodPuzzle: Developing Large Language Model Agents as Flavor Scientists
链接: https://arxiv.org/abs/2409.12832
作者: Tenghao Huang,Donghee Lee,John Sweeney,Jiatong Shi,Emily Steliotes,Matthew Lange,Jonathan May,Muhao Chen
关键词-EN: industry is increasingly, increasingly challenged, rapid innovation, innovation and precise, flavor profile creation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Flavor development in the food industry is increasingly challenged by the need for rapid innovation and precise flavor profile creation. Traditional flavor research methods typically rely on iterative, subjective testing, which lacks the efficiency and scalability required for modern demands. This paper presents three contributions to address the challenges. Firstly, we define a new problem domain for scientific agents in flavor science, conceptualized as the generation of hypotheses for flavor profile sourcing and understanding. To facilitate research in this area, we introduce the FoodPuzzle, a challenging benchmark consisting of 978 food items and 1,766 flavor molecules profiles. We propose a novel Scientific Agent approach, integrating in-context learning and retrieval augmented techniques to generate grounded hypotheses in the domain of food science. Experimental results indicate that our model significantly surpasses traditional methods in flavor profile prediction tasks, demonstrating its potential to transform flavor development practices.
[AI-17] owards Interactive and Learnable Cooperative Driving Automation: a Large Language Model-Driven Decision-Making Framework
链接: https://arxiv.org/abs/2409.12812
作者: Shiyu Fang,Jiaqi Liu,Mingyu Ding,Yiming Cui,Chen Lv,Chen Lv,Chen Lv
关键词-EN: Connected Autonomous Vehicles, Connected Autonomous, open road testing, Cooperative driving, Autonomous Vehicles
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:At present, Connected Autonomous Vehicles (CAVs) have begun to open road testing around the world, but their safety and efficiency performance in complex scenarios is still not satisfactory. Cooperative driving leverages the connectivity ability of CAVs to achieve synergies greater than the sum of their parts, making it a promising approach to improving CAV performance in complex scenarios. However, the lack of interaction and continuous learning ability limits current cooperative driving to single-scenario applications and specific Cooperative Driving Automation (CDA). To address these challenges, this paper proposes CoDrivingLLM, an interactive and learnable LLM-driven cooperative driving framework, to achieve all-scenario and all-CDA. First, since Large Language Models(LLMs) are not adept at handling mathematical calculations, an environment module is introduced to update vehicle positions based on semantic decisions, thus avoiding potential errors from direct LLM control of vehicle positions. Second, based on the four levels of CDA defined by the SAE J3216 standard, we propose a Chain-of-Thought (COT) based reasoning module that includes state perception, intent sharing, negotiation, and decision-making, enhancing the stability of LLMs in multi-step reasoning tasks. Centralized conflict resolution is then managed through a conflict coordinator in the reasoning process. Finally, by introducing a memory module and employing retrieval-augmented generation, CAVs are endowed with the ability to learn from their past experiences. We validate the proposed CoDrivingLLM through ablation experiments on the negotiation module, reasoning with different shots experience, and comparison with other cooperative driving methods.
[AI-18] Dont be Fooled: The Misinformation Effect of Explanations in Human-AI Collaboration
链接: https://arxiv.org/abs/2409.12809
作者: Philipp Spitzer,Joshua Holstein,Katelyn Morrison,Kenneth Holstein,Gerhard Satzger,Niklas Kühl
关键词-EN: black-box artificial intelligence, artificial intelligence, systems without insight, increasingly use black-box, black-box artificial
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Across various applications, humans increasingly use black-box artificial intelligence (AI) systems without insight into these systems’ reasoning. To counter this opacity, explainable AI (XAI) methods promise enhanced transparency and interpretability. While recent studies have explored how XAI affects human-AI collaboration, few have examined the potential pitfalls caused by incorrect explanations. The implications for humans can be far-reaching but have not been explored extensively. To investigate this, we ran a study (n=160) on AI-assisted decision-making in which humans were supported by XAI. Our findings reveal a misinformation effect when incorrect explanations accompany correct AI advice with implications post-collaboration. This effect causes humans to infer flawed reasoning strategies, hindering task execution and demonstrating impaired procedural knowledge. Additionally, incorrect explanations compromise human-AI team-performance during collaboration. With our work, we contribute to HCI by providing empirical evidence for the negative consequences of incorrect explanations on humans post-collaboration and outlining guidelines for designers of AI.
[AI-19] Exploring the Lands Between: A Method for Finding Differences between AI-Decisions and Human Ratings through Generated Samples
链接: https://arxiv.org/abs/2409.12801
作者: Lukas Mecke,Daniel Buschek,Uwe Gruenefeld,Florian Alt
关键词-EN: Artificial Intelligence, made by Artificial, everyday lives, authentication via biometric, Intelligence
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Many important decisions in our everyday lives, such as authentication via biometric models, are made by Artificial Intelligence (AI) systems. These can be in poor alignment with human expectations, and testing them on clear-cut existing data may not be enough to uncover those cases. We propose a method to find samples in the latent space of a generative model, designed to be challenging for a decision-making model with regard to matching human expectations. By presenting those samples to both the decision-making model and human raters, we can identify areas where its decisions align with human intuition and where they contradict it. We apply this method to a face recognition model and collect a dataset of 11,200 human ratings from 100 participants. We discuss findings from our dataset and how our approach can be used to explore the performance of AI models in different contexts and for different user groups.
[AI-20] Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL
链接: https://arxiv.org/abs/2409.12798
作者: Eduardo Pignatelli,Johan Ferret,Tim Rockäschel,Edward Grefenstette,Davide Paglieri,Samuel Coward,Laura Toni
关键词-EN: challenge in Reinforcement, Reinforcement Learning, Large Language Models, temporal credit assignment, credit assignment problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages
点击查看摘要
Abstract:The temporal credit assignment problem is a central challenge in Reinforcement Learning (RL), concerned with attributing the appropriate influence to each actions in a trajectory for their ability to achieve a goal. However, when feedback is delayed and sparse, the learning signal is poor, and action evaluation becomes harder. Canonical solutions, such as reward shaping and options, require extensive domain knowledge and manual intervention, limiting their scalability and applicability. In this work, we lay the foundations for Credit Assignment with Language Models (CALM), a novel approach that leverages Large Language Models (LLMs) to automate credit assignment via reward shaping and options discovery. CALM uses LLMs to decompose a task into elementary subgoals and assess the achievement of these subgoals in state-action transitions. Every time an option terminates, a subgoal is achieved, and CALM provides an auxiliary reward. This additional reward signal can enhance the learning process when the task reward is sparse and delayed without the need for human-designed rewards. We provide a preliminary evaluation of CALM using a dataset of human-annotated demonstrations from MiniHack, suggesting that LLMs can be effective in assigning credit in zero-shot settings, without examples or LLM fine-tuning. Our preliminary results indicate that the knowledge of LLMs is a promising prior for credit assignment in RL, facilitating the transfer of human knowledge into value functions.
[AI-21] Investigation on domain adaptation of additive manufacturing monitoring systems to enhance digital twin reusability
链接: https://arxiv.org/abs/2409.12785
作者: Jiarui Xie,Zhuo Yang,Chun-Chun Hu,Haw-Ching Yang,Yan Lu,Yaoyao Fiona Zhao
关键词-EN: Powder bed fusion, metal additive manufacturing, emerging metal additive, enables rapid fabrication, Powder bed
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, 3 tables. IEEE CASE 2024
点击查看摘要
Abstract:Powder bed fusion (PBF) is an emerging metal additive manufacturing (AM) technology that enables rapid fabrication of complex geometries. However, defects such as pores and balling may occur and lead to structural unconformities, thus compromising the mechanical performance of the part. This has become a critical challenge for quality assurance as the nature of some defects is stochastic during the process and invisible from the exterior. To address this issue, digital twin (DT) using machine learning (ML)-based modeling can be deployed for AM process monitoring and control. Melt pool is one of the most commonly observed physical phenomena for process monitoring, usually by high-speed cameras. Once labeled and preprocessed, the melt pool images are used to train ML-based models for DT applications such as process anomaly detection and print quality evaluation. Nonetheless, the reusability of DTs is restricted due to the wide variability of AM settings, including AM machines and monitoring instruments. The performance of the ML models trained using the dataset collected from one setting is usually compromised when applied to other settings. This paper proposes a knowledge transfer pipeline between different AM settings to enhance the reusability of AM DTs. The source and target datasets are collected from the National Institute of Standards and Technology and National Cheng Kung University with different cameras, materials, AM machines, and process parameters. The proposed pipeline consists of four steps: data preprocessing, data augmentation, domain alignment, and decision alignment. Compared with the model trained only using the source dataset, this pipeline increased the melt pool anomaly detection accuracy by 31% without any labeled training data from the target dataset.
[AI-22] Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering
链接: https://arxiv.org/abs/2409.12784
作者: Youngsun Lim,Hojun Choi,Hyunjung Shim
关键词-EN: existing studies overlook, TTI, image hallucination, impressive success, studies overlook
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages
点击查看摘要
Abstract:Despite the impressive success of text-to-image (TTI) generation models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by generation models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing text-to-image models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five text-to-image models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (rho=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate text-to-image generation models.
[AI-23] GaRField: Reinforced Gaussian Radiance Fields for Large-Scale 3D Scene Reconstruction
链接: https://arxiv.org/abs/2409.12774
作者: Hanyue Zhang,Zhiliu Yang,Xinhe Zuo,Yuxin Tong,Ying Long,Chen Liu
关键词-EN: accuracy challenges faced, large-scale scene reconstruction, scene reconstruction based, paper proposes, aims to address
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:This paper proposes a novel framework for large-scale scene reconstruction based on 3D Gaussian splatting (3DGS) and aims to address the scalability and accuracy challenges faced by existing methods. For tackling the scalability issue, we split the large scene into multiple cells, and the candidate point-cloud and camera views of each cell are correlated through a visibility-based camera selection and a progressive point-cloud extension. To reinforce the rendering quality, three highlighted improvements are made in comparison with vanilla 3DGS, which are a strategy of the ray-Gaussian intersection and the novel Gaussians density control for learning efficiency, an appearance decoupling module based on ConvKAN network to solve uneven lighting conditions in large-scale scenes, and a refined final loss with the color loss, the depth distortion loss, and the normal consistency loss. Finally, the seamless stitching procedure is executed to merge the individual Gaussian radiance field for novel view synthesis across different cells. Evaluation of Mill19, Urban3D, and MatrixCity datasets shows that our method consistently generates more high-fidelity rendering results than state-of-the-art methods of large-scale scene reconstruction. We further validate the generalizability of the proposed approach by rendering on self-collected video clips recorded by a commercial drone.
[AI-24] he Robustness of Spiking Neural Networks in Communication and its Application towards Network Efficiency in Federated Learning
链接: https://arxiv.org/abs/2409.12769
作者: Manh V. Nguyen,Liang Zhao,Bobin Deng,William Severa,Honghui Xu,Shaoen Wu
关键词-EN: Spiking Neural Networks, Artificial Neural Networks, conventional Artificial Neural, Neural Networks, Spiking Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: This paper has been accepted for publication at the 43rd IEEE International Performance Computing and Communications Conference (IPCCC 2024)
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) have recently gained significant interest in on-chip learning in embedded devices and emerged as an energy-efficient alternative to conventional Artificial Neural Networks (ANNs). However, to extend SNNs to a Federated Learning (FL) setting involving collaborative model training, the communication between the local devices and the remote server remains the bottleneck, which is often restricted and costly. In this paper, we first explore the inherent robustness of SNNs under noisy communication in FL. Building upon this foundation, we propose a novel Federated Learning with Top-K Sparsification (FLTS) algorithm to reduce the bandwidth usage for FL training. We discover that the proposed scheme with SNNs allows more bandwidth savings compared to ANNs without impacting the model’s accuracy. Additionally, the number of parameters to be communicated can be reduced to as low as 6 percent of the size of the original model. We further improve the communication efficiency by enabling dynamic parameter compression during model training. Extensive experiment results demonstrate that our proposed algorithms significantly outperform the baselines in terms of communication cost and model accuracy and are promising for practical network-efficient FL with SNNs.
[AI-25] Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space
链接: https://arxiv.org/abs/2409.12745
作者: Sebastião Quintas,Isabelle Ferrané,Thomas Pellegrini
关键词-EN: gaining increasing popularity, automatic speech recognition, augmentation is gaining, gaining increasing, increasing popularity
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:The use of synthetic speech as data augmentation is gaining increasing popularity in fields such as automatic speech recognition and speech classification tasks. Despite novel text-to-speech systems with voice cloning capabilities, that allow the usage of a larger amount of voices based on short audio segments, it is known that these systems tend to hallucinate and oftentimes produce bad data that will most likely have a negative impact on the downstream task. In the present work, we conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification. Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data, translating to a better performance. Furthermore, despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features, an aspect further explored with a CycleGAN to bridge the gap between the two types of speech material.
[AI-26] Fine Tuning Large Language Models for Medicine: The Role and Importance of Direct Parameter Optimization
链接: https://arxiv.org/abs/2409.12741
作者: Thomas Savage,Stephen Ma,Abdessalem Boukil,Vishwesh Patel,Ekanath Rangan,Ivan Rodriguez,Jonathan H Chen
关键词-EN: Large Language Model, Direct Parameter Optimization, Supervised Fine Tuning, fine tuning, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large Language Model (LLM) fine tuning is underutilized in the field of medicine. Two of the most common methods of fine tuning are Supervised Fine Tuning (SFT) and Direct Parameter Optimization (DPO), but there is little guidance informing users when to use either technique. In this investigation, we compare the performance of SFT and DPO for five common natural language tasks in medicine: Classification with text data, Classification with numeric data, Clinical Reasoning, Summarization, and Clinical Triage. We find that SFT alone is sufficient for Classification with text data, whereas DPO improves performance for the more complex tasks of Clinical Reasoning, Summarization and Clinical Triage. Our results establish the role and importance of DPO fine tuning within medicine, and consequently call attention to current software gaps that prevent widespread deployment of this technique.
[AI-27] HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling
链接: https://arxiv.org/abs/2409.12740
作者: Junyi Chen,Lu Chi,Bingyue Peng,Zehuan Yuan
关键词-EN: achieved remarkable success, Hierarchical Large Language, Large Language Models, Large Language, prompting several studies
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved remarkable success in various fields, prompting several studies to explore their potential in recommendation systems. However, these attempts have so far resulted in only modest improvements over traditional recommendation models. Moreover, three critical questions remain under-explored: firstly, the real value of LLMs’ pre-trained weights, often considered to encapsulate world knowledge; secondly, the necessity of fine-tuning for recommendation tasks; lastly, whether LLMs can exhibit the same scalability benefits in recommendation systems as they do in other domains. In this paper, we propose a novel Hierarchical Large Language Model (HLLM) architecture designed to enhance sequential recommendation systems. Our approach employs a two-tier model: the first Item LLM extracts rich content features from the detailed text description of the item, while the second User LLM utilizes these features to predict users’ future interests based on their interaction history. Extensive experiments demonstrate that our method effectively leverages the pre-trained capabilities of open-source LLMs, and further fine-tuning leads to significant performance boosts. Additionally, HLLM achieves excellent scalability, with the largest configuration utilizing 7B parameters for both item feature extraction and user interest modeling. Moreover, HLLM offers excellent training and serving efficiency, making it practical in real-world applications. Evaluations on two large-scale datasets, PixelRec and Amazon Reviews, show that HLLM achieves state-of-the-art results, outperforming traditional ID-based models by a wide margin. In online A/B testing, HLLM showcases notable gains, validating its practical impact in real-world recommendation scenarios. Codes are available at this https URL.
[AI-28] MEXMA: Token-level objectives improve sentence representations
链接: https://arxiv.org/abs/2409.12737
作者: João Maria Janeiro,Benjamin Piwowarski,Patrick Gallinari,Loïc Barrault
关键词-EN: sentence representation, sentence, pre-trained cross-lingual sentence, sentence encoders approaches, cross-lingual sentence encoders
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 11 pages, 12 figures
点击查看摘要
Abstract:Current pre-trained cross-lingual sentence encoders approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and all tokens directly updating the encoder. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bi-text mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them.
[AI-29] When SparseMoE Meets Noisy Interactions: An Ensemble View on Denoising Recommendation
链接: https://arxiv.org/abs/2409.12730
作者: Weipu Chen,Zhuangzhuang He,Fei Liu
关键词-EN: Learning user preferences, implicit feedback, user preferences, core challenges, Learning user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Learning user preferences from implicit feedback is one of the core challenges in recommendation. The difficulty lies in the potential noise within implicit feedback. Therefore, various denoising recommendation methods have been proposed recently. However, most of them overly rely on the hyperparameter configurations, inevitably leading to inadequacies in model adaptability and generalization performance. In this study, we propose a novel Adaptive Ensemble Learning (AEL) for denoising recommendation, which employs a sparse gating network as a brain, selecting suitable experts to synthesize appropriate denoising capacities for different data samples. To address the ensemble learning shortcoming of model complexity and ensure sub-recommender diversity, we also proposed a novel method that stacks components to create sub-recommenders instead of directly constructing them. Extensive experiments across various datasets demonstrate that AEL outperforms others in kinds of popular metrics, even in the presence of substantial and dynamic noise. Our code is available at this https URL.
[AI-30] Cloudy with a Chance of Anomalies: Dynamic Graph Neural Network for Early Detection of Cloud Services User Anomalies
链接: https://arxiv.org/abs/2409.12726
作者: Revital Marbel,Yanir Cohen,Ran Dubin,Amit Dvir,Chen Hajaj
关键词-EN: sustaining organizational growth, cloud services, Cloud Services Graph-based, Services Graph-based Anomaly, Graph Neural Network
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Ensuring the security of cloud environments is imperative for sustaining organizational growth and operational efficiency. As the ubiquity of cloud services continues to rise, the inevitability of cyber threats underscores the importance of preemptive detection. This paper introduces a pioneering time-based embedding approach for Cloud Services Graph-based Anomaly Detection (CS-GAD), utilizing a Graph Neural Network (GNN) to discern anomalous user behavior during interactions with cloud services. Our method employs a dynamic tripartite graph representation to encapsulate the evolving interactions among cloud services, users, and their activities over time. Leveraging GNN models in each time frame, our approach generates a graph embedding wherein each user is assigned a score based on their historical activity, facilitating the identification of unusual behavior. Results demonstrate a notable reduction in false positive rates (2-9%) compared to prevailing methods, coupled with a commendable true positive rate (100%). The contributions of this work encompass early detection capabilities, a low false positive rate, an innovative tripartite graph representation incorporating action types, the introduction of a new cloud services dataset featuring various user attacks, and an open-source implementation for community collaboration in advancing cloud service security.
[AI-31] FAST GDRNPP: Improving the Speed of State-of-the-Art 6D Object Pose Estimation
链接: https://arxiv.org/abs/2409.12720
作者: Thomas Pöllabauer,Ashwin Pramod,Volker Knauthe,Michael Wahl
关键词-EN: chosen coordinate system, estimation involves determining, pose estimation involves, object pose estimation, coordinate system
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:6D object pose estimation involves determining the three-dimensional translation and rotation of an object within a scene and relative to a chosen coordinate system. This problem is of particular interest for many practical applications in industrial tasks such as quality control, bin picking, and robotic manipulation, where both speed and accuracy are critical for real-world deployment. Current models, both classical and deep-learning-based, often struggle with the trade-off between accuracy and latency. Our research focuses on enhancing the speed of a prominent state-of-the-art deep learning model, GDRNPP, while keeping its high accuracy. We employ several techniques to reduce the model size and improve inference time. These techniques include using smaller and quicker backbones, pruning unnecessary parameters, and distillation to transfer knowledge from a large, high-performing model to a smaller, more efficient student model. Our findings demonstrate that the proposed configuration maintains accuracy comparable to the state-of-the-art while significantly improving inference time. This advancement could lead to more efficient and practical applications in various industrial scenarios, thereby enhancing the overall applicability of 6D Object Pose Estimation models in real-world settings.
[AI-32] Optical Flow Matters: an Empirical Comparative Study on Fusing Monocular Extracted Modalities for Better Steering
链接: https://arxiv.org/abs/2409.12716
作者: Fouad Makiyeh,Mark Bastourous,Anass Bairouk,Wei Xiao,Mirjana Maras,Tsun-Hsuan Wangb,Marc Blanchon,Ramin Hasani,Patrick Chareyre,Daniela Rus
关键词-EN: accurate decision-making processes, key challenge, Neutral Circuit Policy, Variational Auto Encoder, Circuit Policy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Autonomous vehicle navigation is a key challenge in artificial intelligence, requiring robust and accurate decision-making processes. This research introduces a new end-to-end method that exploits multimodal information from a single monocular camera to improve the steering predictions for self-driving cars. Unlike conventional models that require several sensors which can be costly and complex or rely exclusively on RGB images that may not be robust enough under different conditions, our model significantly improves vehicle steering prediction performance from a single visual sensor. By focusing on the fusion of RGB imagery with depth completion information or optical flow data, we propose a comprehensive framework that integrates these modalities through both early and hybrid fusion techniques. We use three distinct neural network models to implement our approach: Convolution Neural Network - Neutral Circuit Policy (CNN-NCP) , Variational Auto Encoder - Long Short-Term Memory (VAE-LSTM) , and Neural Circuit Policy architecture VAE-NCP. By incorporating optical flow into the decision-making process, our method significantly advances autonomous navigation. Empirical results from our comparative study using Boston driving data show that our model, which integrates image and motion information, is robust and reliable. It outperforms state-of-the-art approaches that do not use optical flow, reducing the steering estimation error by 31%. This demonstrates the potential of optical flow data, combined with advanced neural network architectures (a CNN-based structure for fusing data and a Recurrence-based network for inferring a command from latent space), to enhance the performance of autonomous vehicles steering estimation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.12716 [cs.CV] (or arXiv:2409.12716v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.12716 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-33] Connecting Ideas in Lower-Resource Scenarios: NLP for National Varieties Creoles and Other Low-resource Scenarios COLING2025
链接: https://arxiv.org/abs/2409.12683
作者: Aditya Joshi,Diptesh Kanojia,Heather Lent,Hour Kaing,Haiyue Song
关键词-EN: large language models, national or social, language models struggle, excellent results, results on benchmarks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Selected as a full-day tutorial at COLING 2025
点击查看摘要
Abstract:Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in lower-resource' scenarios such as dialects/sociolects (national or social varieties of a language), Creoles (languages arising from linguistic contact between multiple languages) and other low-resource languages. This introductory tutorial will identify common challenges, approaches, and themes in natural language processing (NLP) research for confronting and overcoming the obstacles inherent to data-poor contexts. By connecting past ideas to the present field, this tutorial aims to ignite collaboration and cross-pollination between researchers working in these scenarios. Our notion of
lower-resource’ broadly denotes the outstanding lack of data required for model training - and may be applied to scenarios apart from the three covered in the tutorial.
[AI-34] Retrieval-Augmented Test Generation: How Far Are We?
链接: https://arxiv.org/abs/2409.12682
作者: Jiho Shin,Reem Aleithan,Hadi Hemmati,Song Wang
关键词-EN: Retrieval Augmented Generation, Retrieval Augmented, software engineering tasks, shown notable advancements, Augmented Generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 18 pages + reference
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) has shown notable advancements in software engineering tasks. Despite its potential, RAG’s application in unit test generation remains under-explored. To bridge this gap, we take the initiative to investigate the efficacy of RAG-based LLMs in test generation. As RAGs can leverage various knowledge sources to enhance their performance, we also explore the impact of different sources of RAGs’ knowledge bases on unit test generation to provide insights into their practical benefits and limitations. Specifically, we examine RAG built upon three types of domain knowledge: 1) API documentation, 2) GitHub issues, and 3) StackOverflow QAs. Each source offers essential knowledge for creating tests from different perspectives, i.e., API documentations provide official API usage guidelines, GitHub issues offer resolutions of issues related to the APIs from the library developers, and StackOverflow QAs present community-driven solutions and best practices. For our experiment, we focus on five widely used and typical Python-based machine learning (ML) projects, i.e., TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost to build, train, and deploy complex neural networks efficiently. We conducted experiments using the top 10% most widely used APIs across these projects, involving a total of 188 APIs. We investigate the effectiveness of four state-of-the-art LLMs (open and closed-sourced), i.e., GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llamma 3.1 405B. Additionally, we compare three prompting strategies in generating unit test cases for the experimental APIs, i.e., zero-shot, a Basic RAG, and an API-level RAG on the three external sources. Finally, we compare the cost of different sources of knowledge used for the RAG.
[AI-35] (Un)certainty of (Un)fairness: Preference-Based Selection of Certainly Fair Decision-Makers ECAI2024
链接: https://arxiv.org/abs/2409.12677
作者: Manh Khoi Duong,Stefan Conrad
关键词-EN: real-world applications, including machine learning, bias in decision-making, Fairness metrics, traditional fairness metrics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in 27TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2024)
点击查看摘要
Abstract:Fairness metrics are used to assess discrimination and bias in decision-making processes across various domains, including machine learning models and human decision-makers in real-world applications. This involves calculating the disparities between probabilistic outcomes among social groups, such as acceptance rates between male and female applicants. However, traditional fairness metrics do not account for the uncertainty in these processes and lack of comparability when two decision-makers exhibit the same disparity. Using Bayesian statistics, we quantify the uncertainty of the disparity to enhance discrimination assessments. We represent each decision-maker, whether a machine learning model or a human, by its disparity and the corresponding uncertainty in that disparity. We define preferences over decision-makers and utilize brute-force to choose the optimal decision-maker according to a utility function that ranks decision-makers based on these preferences. The decision-maker with the highest utility score can be interpreted as the one for whom we are most certain that it is fair.
[AI-36] Enhancing Construction Site Safety: A Lightweight Convolutional Network for Effective Helmet Detection
链接: https://arxiv.org/abs/2409.12669
作者: Mujadded Al Rabbani Alif
关键词-EN: personal protective equipment, preventing workplace injuries, protective equipment, plays a critical, workplace injuries
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In the realm of construction safety, the detection of personal protective equipment, such as helmets, plays a critical role in preventing workplace injuries. This paper details the development and evaluation of convolutional neural networks (CNNs) designed for the accurate classification of helmet presence on construction sites. Initially, a simple CNN model comprising one convolutional block and one fully connected layer was developed, yielding modest results. To enhance its performance, the model was progressively refined, first by extending the architecture to include an additional convolutional block and a fully connected layer. Subsequently, batch normalization and dropout techniques were integrated, aiming to mitigate overfitting and improve the model’s generalization capabilities. The performance of these models is methodically analyzed, revealing a peak F1-score of 84%, precision of 82%, and recall of 86% with the most advanced configuration of the first study phase. Despite these improvements, the accuracy remained suboptimal, thus setting the stage for further architectural and operational enhancements. This work lays a foundational framework for ongoing adjustments and optimization in automated helmet detection technology, with future enhancements expected to address the limitations identified during these initial experiments.
[AI-37] Deep generative models as an adversarial attack strategy for tabular machine learning ICML
链接: https://arxiv.org/abs/2409.12642
作者: Salijona Dyrmishi,Mihaela Cătălina Stoian,Eleonora Giunchiglia,Maxime Cordy
关键词-EN: Deep Generative Models, Deep Generative, Generative Models, machine learning, found application
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ICMLC 2024 (International Conference on Machine Learning and Cybernetics)
点击查看摘要
Abstract:Deep Generative Models (DGMs) have found application in computer vision for generating adversarial examples to test the robustness of machine learning (ML) systems. Extending these adversarial techniques to tabular ML presents unique challenges due to the distinct nature of tabular data and the necessity to preserve domain constraints in adversarial examples. In this paper, we adapt four popular tabular DGMs into adversarial DGMs (AdvDGMs) and evaluate their effectiveness in generating realistic adversarial examples that conform to domain constraints.
[AI-38] Exploring bat song syllable representations in self-supervised audio encoders
链接: https://arxiv.org/abs/2409.12634
作者: Marianne de Heer Kloots,Mirjam Knörnschild
关键词-EN: human-generated sounds distinguish, species’ vocalization types, trained on human-generated, human-generated sounds, sounds distinguish
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Presented at VIHAR-2024; see this https URL
点击查看摘要
Abstract:How well can deep learning models trained on human-generated sounds distinguish between another species’ vocalization types? We analyze the encoding of bat song syllables in several self-supervised audio encoders, and find that models pre-trained on human speech generate the most distinctive representations of different syllable types. These findings form first steps towards the application of cross-species transfer learning in bat bioacoustics, as well as an improved understanding of out-of-distribution signal processing in audio encoder models.
[AI-39] Counterfactual Explanations for Clustering Models
链接: https://arxiv.org/abs/2409.12632
作者: Aurora Spagnol,Kacper Sokol,Pietro Barbiero,Marc Langheinrich,Martin Gjoreski
关键词-EN: lack technical expertise, complex optimisation processes, technical expertise, rely on complex, complex optimisation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:Clustering algorithms rely on complex optimisation processes that may be difficult to comprehend, especially for individuals who lack technical expertise. While many explainable artificial intelligence techniques exist for supervised machine learning, unsupervised learning – and clustering in particular – has been largely neglected. To complicate matters further, the notion of a ``true’’ cluster is inherently challenging to define. These facets of unsupervised learning and its explainability make it difficult to foster trust in such methods and curtail their adoption. To address these challenges, we propose a new, model-agnostic technique for explaining clustering algorithms with counterfactual statements. Our approach relies on a novel soft-scoring method that captures the spatial information utilised by clustering models. It builds upon a state-of-the-art Bayesian counterfactual generator for supervised learning to deliver high-quality explanations. We evaluate its performance on five datasets and two clustering algorithms, and demonstrate that introducing soft scores to guide counterfactual search significantly improves the results.
[AI-40] CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks
链接: https://arxiv.org/abs/2409.12623
作者: Zhaozhi Qian,Faroq Altam,Muhammad Saleh Saeed Alqurishi,Riad Souissi
关键词-EN: artificial intelligence systems, modern artificial intelligence, Large Language Models, Large Language, intelligence systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are the cornerstones of modern artificial intelligence systems. This paper introduces Juhaina, a Arabic-English bilingual LLM specifically designed to align with the values and preferences of Arabic speakers. Juhaina inherently supports advanced functionalities such as instruction following, open-ended question answering, information provisioning, and text processing. Our model contains 9.24 billion parameters and is trained on a context window of up to 8,192 tokens. This paper details the creation process of Juhaina and provides an extensive empirical evaluation. Furthermore, we identify the limitations of widely-adopted Open Arabic LLM Leaderboard (OALL) and propose a new evaluation benchmark, CamelEval. Our findings demonstrate that Juhaina surpasses existing LLMs of comparable sizes, such as the Llama and Gemma families, in generating helpful responses in Arabic, providing factually accurate information about the region, and understanding nuanced cultural aspects. We aspire for Juhaina to democratize cutting-edge AI technologies, serving over 400 million Arabic speakers by offering LLMs that not only communicate in their language but also comprehend their culture. We publicly release all models on Huggingface \urlthis https URL.
[AI-41] Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning
链接: https://arxiv.org/abs/2409.12618
作者: Santosh Kumar Radha,Yasamin Nouri Jelyani,Ara Ghukasyan,Oktay Goktas
关键词-EN: large language models, advanced language processing, language processing power, Iterative human engagement, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:Iterative human engagement is a common and effective means of leveraging the advanced language processing power of large language models (LLMs). Using well-structured prompts in a conversational manner, human users can effectively influence an LLM to develop more thoughtful and accurate responses. Motivated by this insight, we propose the Iteration of Thought (IoT) framework for enhancing LLM responses by generating “thought”-provoking prompts vis a vis an input query and the current iteration of an LLM’s response. Unlike static or semi-static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path dynamically, based on evolving context, and without generating alternate explorative thoughts which are ultimately discarded. The three components of the IoT framework are (1) an Inner Dialogue Agent (IDA) responsible for generating instructive, context-specific prompts; (2) an LLM Agent (LLMA) that processes these prompts to refine its responses; and (3) an iterative prompting loop that implements a conversation between the former two components. We introduce two variants of our framework: Autonomous Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and Guided Iteration of Thought (GIoT), which always forces a fixed number iterations. We investigate the performance of IoT across various datasets, spanning complex reasoning tasks from the GPQA dataset, explorative problem-solving in Game of 24, puzzle solving in Mini Crosswords, and multi-hop question answering from the HotpotQA dataset. Our results show that IoT represents a viable paradigm for autonomous response refinement in LLMs, showcasing significant improvements over CoT and thereby enabling more adaptive and efficient reasoning systems that minimize human intervention.
[AI-42] Enhancing Agricultural Environment Perception via Active Vision and Zero-Shot Learning
链接: https://arxiv.org/abs/2409.12602
作者: Michele Carlo La Greca,Mirko Usuelli,Matteo Matteucci
关键词-EN: faces unprecedented challenges, fundamental for human, human sustenance, faces unprecedented, unprecedented challenges
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Agriculture, fundamental for human sustenance, faces unprecedented challenges. The need for efficient, human-cooperative, and sustainable farming methods has never been greater. The core contributions of this work involve leveraging Active Vision (AV) techniques and Zero-Shot Learning (ZSL) to improve the robot’s ability to perceive and interact with agricultural environment in the context of fruit harvesting. The AV Pipeline implemented within ROS 2 integrates the Next-Best View (NBV) Planning for 3D environment reconstruction through a dynamic 3D Occupancy Map. Our system allows the robotics arm to dynamically plan and move to the most informative viewpoints and explore the environment, updating the 3D reconstruction using semantic information produced through ZSL models. Simulation and real-world experimental results demonstrate our system’s effectiveness in complex visibility conditions, outperforming traditional and static predefined planning methods. ZSL segmentation models employed, such as YOLO World + EfficientViT SAM, exhibit high-speed performance and accurate segmentation, allowing flexibility when dealing with semantic information in unknown agricultural contexts without requiring any fine-tuning process.
[AI-43] Model calibration using a parallel differential evolution algorithm in computational neuroscience: simulation of stretch induced nerve deficit
链接: https://arxiv.org/abs/2409.12567
作者: Antonio LaTorre,Man Ting Kwong,Julián A. García-Grajales,Riyi Shi,Antoine Jérusalem,José-María Peña
关键词-EN: spinal cord injuries, young adults worldwide, cord injuries, adults worldwide, brain and spinal
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Neuronal damage, in the form of both brain and spinal cord injuries, is one of the major causes of disability and death in young adults worldwide. One way to assess the direct damage occurring after a mechanical insult is the simulation of the neuronal cells functional deficits following the mechanical event. In this study, we use a coupled mechanical electrophysiological model with several free parameters that are required to be calibrated against experimental results. The calibration is carried out by means of an evolutionary algorithm (differential evolution, DE) that needs to evaluate each configuration of parameters on six different damage cases, each of them taking several minutes to compute. To minimise the simulation time of the parameter tuning for the DE, the stretch of one unique fixed-diameter axon with a simplified triggering process is used to speed up the calculations. The model is then leveraged for the parameter optimization of the more realistic bundle of independent axons, an impractical configuration to run on a single processor computer. To this end, we have developed a parallel implementation based on OpenMP that runs on a multi-processor taking advantage of all the available computational power. The parallel DE algorithm obtains good results, outperforming the best effort achieved by published manual calibration, in a fraction of the time. While not being able to fully capture the experimental results, the resulting nerve model provides a complex averaging framework for nerve damage simulation able to simulate gradual axonal functional alteration in a bundle.
[AI-44] PersonaFlow: Boosting Research Ideation with LLM-Simulated Expert Personas
链接: https://arxiv.org/abs/2409.12538
作者: Yiren Liu,Pranav Sharma,Mehul Jitendra Oswal,Haijun Xia,Yun Huang
关键词-EN: Large Language Model, requires discussions, discussions and feedback, Developing novel interdisciplinary, support research ideation
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Developing novel interdisciplinary research ideas often requires discussions and feedback from experts across different domains. However, obtaining timely inputs is challenging due to the scarce availability of domain experts. Recent advances in Large Language Model (LLM) research have suggested the feasibility of utilizing LLM-simulated expert personas to support research ideation. In this study, we introduce PersonaFlow, an LLM-based system using persona simulation to support the ideation stage of interdisciplinary scientific discovery. Our findings indicate that using multiple personas during ideation significantly enhances user-perceived quality of outcomes (e.g., relevance of critiques, creativity of research questions) without increasing cognitive load. We also found that users’ persona customization interactions significantly improved their sense of control and recall of generated ideas. Based on the findings, we discuss highlighting ethical concerns, including potential over-reliance and cognitive biases, and suggest design implications for leveraging LLM-simulated expert personas to support research ideation when human expertise is inaccessible.
[AI-45] Should RAG Chatbots Forget Unimportant Conversations? Exploring Importance and Forgetting with Psychological Insights
链接: https://arxiv.org/abs/2409.12524
作者: Ryuichi Sumida,Koji Inoue,Tatsuya Kawahara
关键词-EN: degrades retrieval accuracy, increasing memory load, progress degrades retrieval, Retrieval-Augmented Generation, conversations progress degrades
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:While Retrieval-Augmented Generation (RAG) has shown promise in enhancing long-term conversations, the increasing memory load as conversations progress degrades retrieval accuracy. Drawing on psychological insights, we propose LUFY, a simple yet effective method that focuses on emotionally arousing memories and retains less than 10% of the conversation. In the user experiment, participants interacted with three types of RAG chatbots, each for 2 hours over 4 sessions, marking the most extensive assessment of a chatbot’s long-term capabilities to date – more than four times longer than any existing benchmark. The results demonstrate that prioritizing arousing memories while forgetting the majority of the conversation significantly enhances user experience. This study pushes the frontier of long-term conversations and highlights the importance of forgetting unimportant parts of conversations. Code and Dataset: this https URL
[AI-46] Hi-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting
链接: https://arxiv.org/abs/2409.12518
作者: Boying Li,Zhixi Cai,Yuan-Fang Li,Ian Reid,Hamid Rezatofighi
关键词-EN: Gaussian Splatting SLAM, enables accurate global, Gaussian Splatting, Splatting SLAM method, explicit semantic label
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures
点击查看摘要
Abstract:We propose Hi-SLAM, a semantic 3D Gaussian Splatting SLAM method featuring a novel hierarchical categorical representation, which enables accurate global 3D semantic mapping, scaling-up capability, and explicit semantic label prediction in the 3D world. The parameter usage in semantic SLAM systems increases significantly with the growing complexity of the environment, making it particularly challenging and costly for scene understanding. To address this problem, we introduce a novel hierarchical representation that encodes semantic information in a compact form into 3D Gaussian Splatting, leveraging the capabilities of large language models (LLMs). We further introduce a novel semantic loss designed to optimize hierarchical semantic information through both inter-level and cross-level optimization. Furthermore, we enhance the whole SLAM system, resulting in improved tracking and mapping performance. Our Hi-SLAM outperforms existing dense SLAM methods in both mapping and tracking accuracy, while achieving a 2x operation speed-up. Additionally, it exhibits competitive performance in rendering semantic segmentation in small synthetic scenes, with significantly reduced storage and training time requirements. Rendering FPS impressively reaches 2,000 with semantic information and 3,000 without it. Most notably, it showcases the capability of handling the complex real-world scene with more than 500 semantic classes, highlighting its valuable scaling-up capability.
[AI-47] Scaling FP8 training to trillion-token LLMs
链接: https://arxiv.org/abs/2409.12517
作者: Maxim Fishman,Brian Chmiel,Ron Banner,Daniel Soudry
关键词-EN: large language models, trillion tokens, large language, increase over previous, previous limits
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens – a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a \sim 34 % throughput improvement.
[AI-48] LLMR: Knowledge Distillation with a Large Language Model-Induced Reward COLING2024
链接: https://arxiv.org/abs/2409.12500
作者: Dongheng Li,Yongchang Hao,Lili Mou
关键词-EN: demonstrated remarkable performance, natural language processing, Large language models, increasingly popular, popular and demonstrated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by LERC COLING 2024
点击查看摘要
Abstract:Large language models have become increasingly popular and demonstrated remarkable performance in various natural language processing (NLP) tasks. However, these models are typically computationally expensive and difficult to be deployed in resource-constrained environments. In this paper, we propose LLMR, a novel knowledge distillation (KD) method based on a reward function induced from large language models. We conducted experiments on multiple datasets in the dialogue generation and summarization tasks. Empirical results demonstrate that our LLMR approach consistently outperforms traditional KD methods in different tasks and datasets.
[AI-49] CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
链接: https://arxiv.org/abs/2409.12490
作者: Junlin Lv,Yuan Feng,Xike Xie,Xin Jia,Qirong Peng,Guiming Xie
关键词-EN: Large language models, achieved notable success, Large language, quadratic computation complexity, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large language models have achieved notable success across various domains, yet efficient inference is still limited by the quadratic computation complexity of the attention mechanism. The inference consists of prefilling and decoding phases. Although several attempts have been made to accelerate decoding, the inefficiency of the prefilling phase, especially for long-context tasks, remains a challenge. In this paper, we observe a locality in query criticality during the prefilling phase of long-context processing: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache. Based on this observation, we propose CritiPrefill, a criticality-based segment-wise prefilling method. This method partitions the input sequence’s queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. By pruning non-critical computations between query segments and cache blocks in the self-attention mechanism, the prefilling process can be significantly accelerated. Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU, with minimal quality degradation.
[AI-50] Learning Multi-Manifold Embedding for Out-Of-Distribution Detection ECCV2024
链接: https://arxiv.org/abs/2409.12479
作者: Jeng-Lin Li,Ming-Ching Chang,Wei-Chao Chen
关键词-EN: OOD, real-world applications, crucial for trustworthy, Detecting, OOD samples
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: European Conference on Computer Vision ECCV 2024 BEW Workshop Best Paper
点击查看摘要
Abstract:Detecting out-of-distribution (OOD) samples is crucial for trustworthy AI in real-world applications. Leveraging recent advances in representation learning and latent embeddings, Various scoring algorithms estimate distributions beyond the training data. However, a single embedding space falls short in characterizing in-distribution data and defending against diverse OOD conditions. This paper introduces a novel Multi-Manifold Embedding Learning (MMEL) framework, optimizing hypersphere and hyperbolic spaces jointly for enhanced OOD detection. MMEL generates representative embeddings and employs a prototype-aware scoring function to differentiate OOD samples. It operates with very few OOD samples and requires no model retraining. Experiments on six open datasets demonstrate MMEL’s significant reduction in FPR while maintaining a high AUC compared to state-of-the-art distance-based OOD detection methods. We analyze the effects of learning multiple manifolds and visualize OOD score distributions across datasets. Notably, enrolling ten OOD samples without retraining achieves comparable FPR and AUC to modern outlier exposure methods using 80 million outlier samples for model training.
[AI-51] ViolinDiff: Enhancing Expressive Violin Synthesis with Pitch Bend Conditioning
链接: https://arxiv.org/abs/2409.12477
作者: Daewoong Kim,Hao-Wen Dong,Dasaem Jeong
关键词-EN: fundamental frequency, plays a critical, critical role, natural contour, music audio synthesis
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, and explicit F0 contour modeling has not yet been explored for polyphonic instrumental synthesis. In this paper, we present ViolinDiff, a two-stage diffusion-based synthesis framework. For a given violin MIDI file, the first stage estimates the F0 contour as pitch bend information, and the second stage generates mel spectrogram incorporating these expressive details. The quantitative metrics and listening test results show that the proposed model generates more realistic violin sounds than the model without explicit pitch bend modeling. Audio samples are available online: this http URL.
[AI-52] EAM: Temporal Adversarial Examples Attack Model against Network Intrusion Detection System Applied to RNN
链接: https://arxiv.org/abs/2409.12472
作者: Ziyi Liu,Dengpan Ye,Long Tang,Yunming Zhang,Jiacheng Deng
关键词-EN: intrusion detection systems, network intrusion detection, neural networks play, neural networks, time steps
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the development of artificial intelligence, neural networks play a key role in network intrusion detection systems (NIDS). Despite the tremendous advantages, neural networks are susceptible to adversarial attacks. To improve the reliability of NIDS, many research has been conducted and plenty of solutions have been proposed. However, the existing solutions rarely consider the adversarial attacks against recurrent neural networks (RNN) with time steps, which would greatly affect the application of NIDS in real world. Therefore, we first propose a novel RNN adversarial attack model based on feature reconstruction called \textbfTemporal adversarial \textbfExamples \textbfAttack \textbfModel \textbf(TEAM), which applied to time series data and reveals the potential connection between adversarial and time steps in RNN. That is, the past adversarial examples within the same time steps can trigger further attacks on current or future original examples. Moreover, TEAM leverages Time Dilation (TD) to effectively mitigates the effect of temporal among adversarial examples within the same time steps. Experimental results show that in most attack categories, TEAM improves the misjudgment rate of NIDS on both black and white boxes, making the misjudgment rate reach more than 96.68%. Meanwhile, the maximum increase in the misjudgment rate of the NIDS for subsequent original samples exceeds 95.57%.
[AI-53] Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-centric Navigation Using Generative-Model-based Environment Generation
链接: https://arxiv.org/abs/2409.12471
作者: Volodymyr Shcherbyna1,Linh Kästner,Diego Diaz,Huu Giang Nguyen,Maximilian Ho-Kyoung Schreff,Tim Lenz,Jonas Kreutz,Ahmed Martban,Huajian Zeng,Harold Soh
关键词-EN: paper introduces Arena, introduces Arena, Arena, paper introduces, previous work
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 7 figures
点击查看摘要
Abstract:Building on the foundations of our previous work, this paper introduces Arena 4.0, a significant advancement over Arena 3.0, Arena-Bench, Arena 1.0, and Arena 2.0. Arena 4.0 offers three key novel contributions: (1) a generative-model-based world and scenario generation approach that utilizes large language models (LLMs) and diffusion models to dynamically generate complex, human-centric environments from text prompts or 2D floorplans, useful for the development and benchmarking of social navigation strategies; (2) a comprehensive 3D model database, extendable with additional 3D assets that are semantically linked and annotated for dynamic spawning and arrangement within 3D worlds; and (3) a complete migration to ROS 2, enabling compatibility with modern hardware and enhanced functionalities for improved navigation, usability, and easier deployment on real robots. We evaluated the platform’s performance through a comprehensive user study, demonstrating significant improvements in usability and efficiency compared to previous versions. Arena 4.0 is openly available at this https URL.
[AI-54] Familiarity-aware Evidence Compression for Retrieval Augmented Generation
链接: https://arxiv.org/abs/2409.12468
作者: Dongwon Jung,Qin Liu,Tenghao Huang,Ben Zhou,Muhao Chen
关键词-EN: Retrieval Augmented Generation, Augmented Generation, Retrieval Augmented, improves large language, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) improves large language models (LMs) by incorporating non-parametric knowledge through evidence retrieval from external sources. However, it often struggles to filter out inconsistent and irrelevant information that can distract the LM from its tasks. While compressing the retrieved evidence with a compression model aims to address this issue, the compressed evidence may still be unfamiliar to the target model used for downstream task, potentially failing to utilize the evidence effectively. We propose FaviComp (Familiarity-aware Evidence Compression), a novel training-free evidence compression technique that makes retrieved evidence more familiar to the target model, while seamlessly integrating parametric knowledge from the model. Specifically, FaviComp proactively lowers the perplexity of the compressed evidence with regard to the target model by combining token probabilities from both the compression model and the target model to generate context that is more familiar to the target model. This approach balances the integration of parametric and non-parametric knowledge, which is especially helpful in complex tasks where the retrieved evidence set may not contain all the necessary information. Experimental results demonstrate that FaviComp consistently outperforms existing baselines in multiple open-domain QA datasets, achieving high compression rates and showcasing the effective integration of both parametric and non-parametric knowledge.
[AI-55] SurgPLAN: Universal Surgical Phase Localization Network for Online and Offline Inference
链接: https://arxiv.org/abs/2409.12467
作者: Zhen Chen,Xingjian Luo,Jinlin Wu,Long Bai,Zhen Lei,Hongliang Ren,Sebastien Ourselin,Hongbin Liu
关键词-EN: Surgical phase recognition, phase recognition, Surgical phase, phase, Surgical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Surgical phase recognition is critical for assisting surgeons in understanding surgical videos. Existing studies focused more on online surgical phase recognition, by leveraging preceding frames to predict the current frame. Despite great progress, they formulated the task as a series of frame-wise classification, which resulted in a lack of global context of the entire procedure and incoherent predictions. Moreover, besides online analysis, accurate offline surgical phase recognition is also in significant clinical need for retrospective analysis, and existing online algorithms do not fully analyze the entire video, thereby limiting accuracy in offline analysis. To overcome these challenges and enhance both online and offline inference capabilities, we propose a universal Surgical Phase Localization Network, named SurgPLAN++, with the principle of temporal detection. To ensure a global understanding of the surgical procedure, we devise a phase localization strategy for SurgPLAN++ to predict phase segments across the entire video through phase proposals. For online analysis, to generate high-quality phase proposals, SurgPLAN++ incorporates a data augmentation strategy to extend the streaming video into a pseudo-complete video through mirroring, center-duplication, and down-sampling. For offline analysis, SurgPLAN++ capitalizes on its global phase prediction framework to continuously refine preceding predictions during each online inference step, thereby significantly improving the accuracy of phase recognition. We perform extensive experiments to validate the effectiveness, and our SurgPLAN++ achieves remarkable performance in both online and offline modes, which outperforms state-of-the-art methods. The source code is available at this https URL.
[AI-56] FoME: A Foundation Model for EEG using Adaptive Temporal-Lateral Attention Scaling
链接: https://arxiv.org/abs/2409.12454
作者: Enze Shi,Kui Zhao,Qilong Yuan,Jiaqi Wang,Huawen Hu,Sigang Yu,Shu Zhang
关键词-EN: record brain activity, limited labeled datasets, signal heterogeneity, vital tool, tool to measure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Electroencephalography (EEG) is a vital tool to measure and record brain activity in neuroscience and clinical applications, yet its potential is constrained by signal heterogeneity, low signal-to-noise ratios, and limited labeled datasets. In this paper, we propose FoME (Foundation Model for EEG), a novel approach using adaptive temporal-lateral attention scaling to address above-mentioned challenges. FoME is pre-trained on a diverse 1.7TB dataset of scalp and intracranial EEG recordings, comprising 745M parameters trained for 1,096k steps. Our model introduces two key innovations: a time-frequency fusion embedding technique and an adaptive time-lateral attention scaling (ATLAS) mechanism. These components synergistically capture complex temporal and spectral EEG dynamics, enabling FoME to adapt to varying patterns across diverse data streams and facilitate robust multi-channel modeling. Evaluations across four downstream tasks demonstrate FoME’s superior performance in classification and forecasting applications, consistently achieving state-of-the-art results. To conclude, FoME establishes a new paradigm for EEG analysis, offering a versatile foundation that advances brain-computer interfaces, clinical diagnostics, and cognitive research across neuroscience and related fields. Our code will be available at this https URL.
[AI-57] Domain Generalization for Endoscopic Image Segmentation by Disentangling Style-Content Information and SuperPixel Consistency
链接: https://arxiv.org/abs/2409.12450
作者: Mansoor Ali Teevno,Rafael Martinez-Garcia-Pena,Gilberto Ochoa-Ruiz,Sharib Ali
关键词-EN: stratify individuals based, Frequent monitoring, cancer precursors, developing gastrointestinal, stratify individuals
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Frequent monitoring is necessary to stratify individuals based on their likelihood of developing gastrointestinal (GI) cancer precursors. In clinical practice, white-light imaging (WLI) and complementary modalities such as narrow-band imaging (NBI) and fluorescence imaging are used to assess risk areas. However, conventional deep learning (DL) models show degraded performance due to the domain gap when a model is trained on one modality and tested on a different one. In our earlier approach, we used a superpixel-based method referred to as “SUPRA” to effectively learn domain-invariant information using color and space distances to generate groups of pixels. One of the main limitations of this earlier work is that the aggregation does not exploit structural information, making it suboptimal for segmentation tasks, especially for polyps and heterogeneous color distributions. Therefore, in this work, we propose an approach for style-content disentanglement using instance normalization and instance selective whitening (ISW) for improved domain generalization when combined with SUPRA. We evaluate our approach on two datasets: EndoUDA Barrett’s Esophagus and EndoUDA polyps, and compare its performance with three state-of-the-art (SOTA) methods. Our findings demonstrate a notable enhancement in performance compared to both baseline and SOTA methods across the target domain data. Specifically, our approach exhibited improvements of 14%, 10%, 8%, and 18% over the baseline and three SOTA methods on the polyp dataset. Additionally, it surpassed the second-best method (EndoUDA) on the Barrett’s Esophagus dataset by nearly 2%.
[AI-58] Prompts Are Programs Too! Understanding How Developers Build Software Containing Prompts
链接: https://arxiv.org/abs/2409.12447
作者: Jenny T. Liang,Melissa Lin,Nikitha Rao,Brad A. Myers
关键词-EN: users repeatedly write, generative pre-trained models, prompt, prompt programming, achieve a task
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:The introduction of generative pre-trained models, like GPT-4, has introduced a phenomenon known as prompt engineering, whereby model users repeatedly write and revise prompts while trying to achieve a task. Using these AI models for intelligent features in software applications require using APIs that are controlled through developer-written prompts. These prompts have powered AI experiences in popular software products, potentially reaching millions of users. Despite the growing impact of prompt-powered software, little is known about its development process and its relationship to programming. In this work, we argue that some forms of prompts are programs, and that the development of prompts is a distinct phenomenon in programming. We refer to this phenomenon as prompt programming. To this end, we develop an understanding of prompt programming using Straussian grounded theory through interviews with 20 developers engaged in prompt development across a variety of contexts, models, domains, and prompt complexities. Through this study, we contribute 14 observations about prompt programming. For example, rather than building mental models of code, prompt programmers develop mental models of the FM’s behavior on the prompt and its unique qualities by interacting with the model. While prior research has shown that experts have well-formed mental models, we find that prompt programmers who have developed dozens of prompts, each with many iterations, still struggle to develop reliable mental models. This contributes to a rapid and unsystematic development process. Taken together, our observations indicate that prompt programming is significantly different from traditional software development, motivating the creation of tools to support prompt programming. Our findings have implications for software engineering practitioners, educators, and researchers. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2409.12447 [cs.SE] (or arXiv:2409.12447v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.12447 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-59] Neural Networks Generalize on Low Complexity Data
链接: https://arxiv.org/abs/2409.12446
作者: Sourav Chatterjee,Timothy Sudijono
关键词-EN: ReLU activation generalize, low complexity data, suitably defined, feedforward neural, simple programming language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Comments welcome. 27 pages
点击查看摘要
Abstract:We show that feedforward neural networks with ReLU activation generalize on low complexity data, suitably defined. Given i.i.d. data generated from a simple programming language, the minimum description length (MDL) feedforward neural network which interpolates the data generalizes with high probability. We define this simple programming language, along with a notion of description length of such networks. We provide several examples on basic computational tasks, such as checking primality of a natural number, and more. For primality testing, our theorem shows the following. Suppose that we draw an i.i.d. sample of \Theta(N^\delta\ln N) numbers uniformly at random from 1 to N , where \delta\in (0,1) . For each number x_i , let y_i = 1 if x_i is a prime and 0 if it is not. Then with high probability, the MDL network fitted to this data accurately answers whether a newly drawn number between 1 and N is a prime or not, with test error \leq O(N^-\delta) . Note that the network is not designed to detect primes; minimum description learning discovers a network which does so.
[AI-60] A Lightweight and Real-Time Binaural Speech Enhancement Model with Spatial Cues Preservation
链接: https://arxiv.org/abs/2409.12444
作者: Jingyuan Wang,Jie Zhang,Shihao Chen,Miao Sun
关键词-EN: noisy signals received, Binaural speech enhancement, spatial cues, spatial cues preservation, aims to jointly
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:Binaural speech enhancement (BSE) aims to jointly improve the speech quality and intelligibility of noisy signals received by hearing devices and preserve the spatial cues of the target for natural listening. Existing methods often suffer from the compromise between noise reduction (NR) capacity and spatial cues preservation (SCP) accuracy and a high computational demand in complex acoustic scenes. In this work, we present a learning-based lightweight binaural complex convolutional network (LBCCN), which excels in NR by filtering low-frequency bands and keeping the rest. Additionally, our approach explicitly incorporates the estimation of interchannel relative acoustic transfer function to ensure the spatial cues fidelity and speech clarity. Results show that the proposed LBCCN can achieve a comparable NR performance to state-of-the-art methods under various noise conditions, but with a much lower computational cost and a better SCP. The reproducible code and audio examples are available at this https URL.
[AI-61] Incremental and Data-Efficient Concept Formation to Support Masked Word Prediction
链接: https://arxiv.org/abs/2409.12440
作者: Xin Lian,Nishant Baglodi,Christopher J. MacLellan
关键词-EN: efficient language model, language model learning, supports masked word, paper introduces, efficient language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by the Eleventh Annual Conference on Advances in Cognitive Systems
点击查看摘要
Abstract:This paper introduces Cobweb4L, a novel approach for efficient language model learning that supports masked word prediction. The approach builds on Cobweb, an incremental system that learns a hierarchy of probabilistic concepts. Each concept stores the frequencies of words that appear in instances tagged with that concept label. The system utilizes an attribute value representation to encode words and their surrounding context into instances. Cobweb4L uses the information theoretic variant of category utility and a new performance mechanism that leverages multiple concepts to generate predictions. We demonstrate that with these extensions it significantly outperforms prior Cobweb performance mechanisms that use only a single node to generate predictions. Further, we demonstrate that Cobweb4L learns rapidly and achieves performance comparable to and even superior to Word2Vec. Next, we show that Cobweb4L and Word2Vec outperform BERT in the same task with less training data. Finally, we discuss future work to make our conclusions more robust and inclusive.
[AI-62] FlexiTex: Enhancing Texture Generation with Visual Guidance
链接: https://arxiv.org/abs/2409.12431
作者: DaDong Jiang,Xianghui Yang,Zibo Zhao,Sheng Zhang,Jiaao Yu,Zeqiang Lai,Shaoxiong Yang,Chunchao Guo,Xiaobo Zhou,Zhihui Ke
关键词-EN: Recent texture generation, powerful generative prior, methods achieve impressive, texture generation methods, generation methods achieve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL
点击查看摘要
Abstract:Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from large-scale text-to-image diffusion models. However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns. To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture. The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details. To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency. Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications.
[AI-63] Is it Still Fair? A Comparative Evaluation of Fairness Algorithms through the Lens of Covariate Drift
链接: https://arxiv.org/abs/2409.12428
作者: Oscar Blessed Deho,Michael Bewong,Selasi Kwashie,Jiuyong Li,Jixue Liu,Lin Liu,Srecko Joksimovic
关键词-EN: data distributional drift, data distributional, distributional drift, applications have grown, grown exponentially
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Over the last few decades, machine learning (ML) applications have grown exponentially, yielding several benefits to society. However, these benefits are tempered with concerns of discriminatory behaviours exhibited by ML models. In this regard, fairness in machine learning has emerged as a priority research area. Consequently, several fairness metrics and algorithms have been developed to mitigate against discriminatory behaviours that ML models may possess. Yet still, very little attention has been paid to the problem of naturally occurring changes in data patterns (\textitaka data distributional drift), and its impact on fairness algorithms and metrics. In this work, we study this problem comprehensively by analyzing 4 fairness-unaware baseline algorithms and 7 fairness-aware algorithms, carefully curated to cover the breadth of its typology, across 5 datasets including public and proprietary data, and evaluated them using 3 predictive performance and 10 fairness metrics. In doing so, we show that (1) data distributional drift is not a trivial occurrence, and in several cases can lead to serious deterioration of fairness in so-called fair models; (2) contrary to some existing literature, the size and direction of data distributional drift is not correlated to the resulting size and direction of unfairness; and (3) choice of, and training of fairness algorithms is impacted by the effect of data distributional drift which is largely ignored in the literature. Emanating from our findings, we synthesize several policy implications of data distributional drift on fairness algorithms that can be very relevant to stakeholders and practitioners.
[AI-64] LMT-Net: Lane Model Transformer Network for Automated HD Mapping from Sparse Vehicle Observations ITSC2024
链接: https://arxiv.org/abs/2409.12409
作者: Michael Mink,Thomas Monninger,Steffen Staab
关键词-EN: High Definition, complete lane model, autonomous driving, range and occlusions, lane model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted for 2024 IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)
点击查看摘要
Abstract:In autonomous driving, High Definition (HD) maps provide a complete lane model that is not limited by sensor range and occlusions. However, the generation and upkeep of HD maps involves periodic data collection and human annotations, limiting scalability. To address this, we investigate automating the lane model generation and the use of sparse vehicle observations instead of dense sensor measurements. For our approach, a pre-processing step generates polylines by aligning and aggregating observed lane boundaries. Aligned driven traces are used as starting points for predicting lane pairs defined by the left and right boundary points. We propose Lane Model Transformer Network (LMT-Net), an encoder-decoder neural network architecture that performs polyline encoding and predicts lane pairs and their connectivity. A lane graph is formed by using predicted lane pairs as nodes and predicted lane connectivity as edges. We evaluate the performance of LMT-Net on an internal dataset that consists of multiple vehicle observations as well as human annotations as Ground Truth (GT). The evaluation shows promising results and demonstrates superior performance compared to the implemented baseline on both highway and non-highway Operational Design Domain (ODD).
[AI-65] On the Effectiveness of LLMs for Manual Test Verifications
链接: https://arxiv.org/abs/2409.12405
作者: Myron David Lucena Campos Peixoto,Davy de Medeiros Baia,Nathalia Nascimento,Paulo Alencar,Baldoino Fonseca,Márcio Ribeiro
关键词-EN: detecting issues missed, Large Language Models, verifications, vital for detecting, detecting issues
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 9 pages
点击查看摘要
Abstract:Background: Manual testing is vital for detecting issues missed by automated tests, but specifying accurate verifications is challenging. Aims: This study aims to explore the use of Large Language Models (LLMs) to produce verifications for manual tests. Method: We conducted two independent and complementary exploratory studies. The first study involved using 2 closed-source and 6 open-source LLMs to generate verifications for manual test steps and evaluate their similarity to original verifications. The second study involved recruiting software testing professionals to assess their perception and agreement with the generated verifications compared to the original ones. Results: The open-source models Mistral-7B and Phi-3-mini-4k demonstrated effectiveness and consistency comparable to closed-source models like Gemini-1.5-flash and GPT-3.5-turbo in generating manual test verifications. However, the agreement level among professional testers was slightly above 40%, indicating both promise and room for improvement. While some LLM-generated verifications were considered better than the originals, there were also concerns about AI hallucinations, where verifications significantly deviated from expectations. Conclusion: We contributed by generating a dataset of 37,040 test verifications using 8 different LLMs. Although the models show potential, the relatively modest 40% agreement level highlights the need for further refinement. Enhancing the accuracy, relevance, and clarity of the generated verifications is crucial to ensure greater reliability in real-world testing scenarios.
[AI-66] Preference Alignment Improves Language Model-Based TTS
链接: https://arxiv.org/abs/2409.12403
作者: Jinchuan Tian,Chunlei Zhang,Jiatong Shi,Hao Zhang,Jianwei Yu,Shinji Watanabe,Dong Yu
关键词-EN: based systems offer, systems offer competitive, offer competitive performance, Recent advancements, preference alignment algorithms
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based systems offer competitive performance to their counterparts. Further optimization can be achieved through preference alignment algorithms, which adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content. This study presents a thorough empirical evaluation of how preference alignment algorithms, particularly Direct Preference Optimization (DPO), enhance LM-based TTS. With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores, with the latter two metrics surpassing even human speech in certain evaluations. We also show preference alignment is applicable to low-resource scenarios and effectively generalized to out-of-domain applications.
[AI-67] Learning to Coordinate without Communication under Incomplete Information AAAI2025
链接: https://arxiv.org/abs/2409.12397
作者: Shenghui Chen,Shufang Zhu,Giuseppe De Giacomo,Ufuk Topcu
关键词-EN: Achieving seamless coordination, Achieving seamless, artificial intelligence, crucial challenge, challenge in artificial
类目: Artificial Intelligence (cs.AI)
*备注: This paper is currently under review at AAAI 2025
点击查看摘要
Abstract:Achieving seamless coordination in cooperative games is a crucial challenge in artificial intelligence, particularly when players operate under incomplete information. A common strategy to mitigate this information asymmetry involves leveraging explicit communication. However, direct communication is not always feasible due to factors such as transmission loss. We explore how effective coordination can be achieved without verbal communication, relying solely on observing each other’s actions. We demonstrate how an autonomous agent can learn to cooperate by interpreting its partner’s actions, which are used to hint at its intents. Our approach involves developing an agent strategy by constructing deterministic finite automata for each possible action and integrating them into a non-Markovian finite-state transducer. This transducer represents a non-deterministic strategy for the agent that suggests actions to assist its partner during gameplay. Experimental results in a testbed called Gnomes at Night show that the learned no-communication coordination strategy achieves significantly higher success rates and requires fewer steps to complete the game compared to uncoordinated scenarios, performing almost as well as an oracle baseline with direct communication.
[AI-68] ARTAI: An Evaluation Platform to Assess Societal Risk of Recommender Algorithms RECSYS2024
链接: https://arxiv.org/abs/2409.12396
作者: Qin Ruan,Jin Xu,Ruihai Dong,Arjumand Younus,Tai Tan Mai,Barry O’Sullivan,Susan Leavy
关键词-EN: Societal risk emanating, Societal risk, Societal, Abstract, recommender algorithms disseminate
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 3 pages, 1 figure, accepted at FAccTRec 2024 Workshop, RecSys 2024
点击查看摘要
Abstract:Societal risk emanating from how recommender algorithms disseminate content online is now well documented. Emergent regulation aims to mitigate this risk through ethical audits and enabling new research on the social impact of algorithms. However, there is currently a need for tools and methods that enable such evaluation. This paper presents ARTAI, an evaluation environment that enables large-scale assessments of recommender algorithms to identify harmful patterns in how content is distributed online and enables the implementation of new regulatory requirements for increased transparency in recommender systems.
[AI-69] ITPatch: An Invisible and Triggered Physical Adversarial Patch against Traffic Sign Recognition
链接: https://arxiv.org/abs/2409.12394
作者: Shuai Yuan,Hongwei Li,Xingshuo Han,Guowen Xu,Wenbo Jiang,Tao Ni,Qingchuan Zhao,Yuguang Fang
关键词-EN: Physical adversarial patches, real world, key adversarial attack, existing adversarial patches, adversarial patches
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Physical adversarial patches have emerged as a key adversarial attack to cause misclassification of traffic sign recognition (TSR) systems in the real world. However, existing adversarial patches have poor stealthiness and attack all vehicles indiscriminately once deployed. In this paper, we introduce an invisible and triggered physical adversarial patch (ITPatch) with a novel attack vector, i.e., fluorescent ink, to advance the state-of-the-art. It applies carefully designed fluorescent perturbations to a target sign, an attacker can later trigger a fluorescent effect using invisible ultraviolet light, causing the TSR system to misclassify the sign and potentially resulting in traffic accidents. We conducted a comprehensive evaluation to investigate the effectiveness of ITPatch, which shows a success rate of 98.31% in low-light conditions. Furthermore, our attack successfully bypasses five popular defenses and achieves a success rate of 96.72%.
[AI-70] Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition ICASSP2025
链接: https://arxiv.org/abs/2409.12386
作者: Chien-Chun Wang,Li-Wei Chen,Cheng-Kang Chou,Hung-Shin Lee,Berlin Chen,Hsin-Min Wang
关键词-EN: demonstrate impressive performance, systems demonstrate impressive, unseen recording environments, automatic speech recognition, channel mismatch stemming
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025
点击查看摘要
Abstract:While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. Our method harnesses the synergistic power of channel-extractive techniques and generative adversarial networks (GANs). We first train a channel encoder capable of extracting embeddings from arbitrary audio. On top of this, channel embeddings are extracted using a minimal amount of target-domain data and used to guide a GAN-based speech synthesizer. This synthesizer generates speech that faithfully preserves the phonetic content of the input while mimicking the channel characteristics of the target domain. We evaluate our method on the challenging Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving relative character error rate (CER) reductions of 20.02% and 9.64%, respectively, compared to the baselines. These results highlight the efficacy of our channel-aware data simulation method for bridging the gap between source- and target-domain acoustics.
[AI-71] Look Through Masks: Towards Masked Face Recognition with De-Occlusion Distillation ACM-MM2020
链接: https://arxiv.org/abs/2409.12385
作者: Chenyu Li,Shiming Ge,Daichi Zhang,Jia Li
关键词-EN: real-world applications today, masked face recognition, ambiguous representation, drop in accuracy, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ACM MM 2020
点击查看摘要
Abstract:Many real-world applications today like video surveillance and urban governance need to address the recognition of masked faces, where content replacement by diverse masks often brings in incomplete appearance and ambiguous representation, leading to a sharp drop in accuracy. Inspired by recent progress on amodal perception, we propose to migrate the mechanism of amodal completion for the task of masked face recognition with an end-to-end de-occlusion distillation framework, which consists of two modules. The \textitde-occlusion module applies a generative adversarial network to perform face completion, which recovers the content under the mask and eliminates appearance ambiguity. The \textitdistillation module takes a pre-trained general face recognition model as the teacher and transfers its knowledge to train a student for completed faces using massive online synthesized face pairs. Especially, the teacher knowledge is represented with structural relations among instances in multiple orders, which serves as a posterior regularization to enable the adaptation. In this way, the knowledge can be fully distilled and transferred to identify masked faces. Experiments on synthetic and realistic datasets show the efficacy of the proposed approach.
[AI-72] Privacy-Preserving Student Learning with Differentially Private Data-Free Distillation
链接: https://arxiv.org/abs/2409.12384
作者: Bochao Liu,Jianghu Lu,Pengju Wang,Junjie Zhang,Dan Zeng,Zhenxing Qian,Shiming Ge
关键词-EN: Deep learning models, achieve high inference, high inference accuracy, extracting rich knowledge, Deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published by IEEE MMSP 2022
点击查看摘要
Abstract:Deep learning models can achieve high inference accuracy by extracting rich knowledge from massive well-annotated data, but may pose the risk of data privacy leakage in practical deployment. In this paper, we present an effective teacher-student learning approach to train privacy-preserving deep learning models via differentially private data-free distillation. The main idea is generating synthetic data to learn a student that can mimic the ability of a teacher well-trained on private data. In the approach, a generator is first pretrained in a data-free manner by incorporating the teacher as a fixed discriminator. With the generator, massive synthetic data can be generated for model training without exposing data privacy. Then, the synthetic data is fed into the teacher to generate private labels. Towards this end, we propose a label differential privacy algorithm termed selective randomized response to protect the label information. Finally, a student is trained on the synthetic data with the supervision of private labels. In this way, both data privacy and label privacy are well protected in a unified framework, leading to privacy-preserving models. Extensive experiments and analysis clearly demonstrate the effectiveness of our approach.
[AI-73] Bundle Fragments into a Whole: Mining More Complete Clusters via Submodular Selection of Interesting webpages for Web Topic Detection
链接: https://arxiv.org/abs/2409.12380
作者: Junbiao Pang,Anjing Hu,Qingming Huang
关键词-EN: Organizing interesting webpages, Organizing interesting, multimodal web data, hot topics, understand the trends
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 10
点击查看摘要
Abstract:Organizing interesting webpages into hot topics is one of key steps to understand the trends of multimodal web data. A state-of-the-art solution is firstly to organize webpages into a large volume of multi-granularity topic candidates; hot topics are further identified by estimating their interestingness. However, these topic candidates contain a large number of fragments of hot topics due to both the inefficient feature representations and the unsupervised topic generation. This paper proposes a bundling-refining approach to mine more complete hot topics from fragments. Concretely, the bundling step organizes the fragment topics into coarse topics; next, the refining step proposes a submodular-based method to refine coarse topics in a scalable approach. The propose unconventional method is simple, yet powerful by leveraging submodular optimization, our approach outperforms the traditional ranking methods which involve the careful design and complex steps. Extensive experiments demonstrate that the proposed approach surpasses the state-of-the-art method (i.e., latent Poisson deconvolution Pang et al. (2016)) 20% accuracy and 10% one on two public data sets, respectively.
[AI-74] Communication-Efficient Federated Low-Rank Update Algorithm and its Connection to Implicit Regularization
链接: https://arxiv.org/abs/2409.12371
作者: Haemin Park,Diego Klabjan
关键词-EN: faces significant challenges, significant challenges related, faces significant, significant challenges, challenges related
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) faces significant challenges related to communication efficiency and heterogeneity. To address these issues, we explore the potential of using low-rank updates. Our theoretical analysis reveals that client’s loss exhibits a higher rank structure (gradients span higher rank subspace of Hessian) compared to the server’s loss. Based on this insight, we hypothesize that constraining client-side optimization to a low-rank subspace could provide an implicit regularization effect. Consequently, we propose FedLoRU, a general low-rank update framework for federated learning. Our framework enforces low-rank client-side updates and accumulates these updates to form a higher-rank model. Additionally, variants of FedLoRU can adapt to environments with statistical and model heterogeneity by employing multiple or hierarchical low-rank updates. Experimental results demonstrate that FedLoRU performs comparably to full-rank algorithms and exhibits robustness to heterogeneous and large numbers of clients.
[AI-75] Extracting Memorized Training Data via Decomposition
链接: https://arxiv.org/abs/2409.12367
作者: Ellen Su,Anu Vellore,Amy Chang,Raffaele Mura,Blaine Nelson,Paul Kassianik,Amin Karbasi
关键词-EN: Large Language Models, Large Language, information security challenges, Language Models, challenges for developers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:The widespread use of Large Language Models (LLMs) in society creates new information security challenges for developers, organizations, and end-users alike. LLMs are trained on large volumes of data, and their susceptibility to reveal the exact contents of the source training datasets poses security and safety risks. Although current alignment procedures restrict common risky behaviors, they do not completely prevent LLMs from leaking data. Prior work demonstrated that LLMs may be tricked into divulging training data by using out-of-distribution queries or adversarial techniques. In this paper, we demonstrate a simple, query-based decompositional method to extract news articles from two frontier LLMs. We use instruction decomposition techniques to incrementally extract fragments of training data. Out of 3723 New York Times articles, we extract at least one verbatim sentence from 73 articles, and over 20% of verbatim sentences from 6 articles. Our analysis demonstrates that this method successfully induces the LLM to generate texts that are reliable reproductions of news articles, meaning that they likely originate from the source training dataset. This method is simple, generalizable, and does not fine-tune or change the production model. If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities, including privacy risks and unauthorized data leaks. These implications require careful consideration from model development to its end-use.
[AI-76] Advancing Cucumber Disease Detection in Agriculture through Machine Vision and Drone Technology
链接: https://arxiv.org/abs/2409.12350
作者: Syada Tasfia Rahman,Nishat Vasker,Amir Khabbab Ahammed,Mahamudul Hasan
关键词-EN: machine vision, technologies to propose, unique method, drone technologies, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 page and 6 figure
点击查看摘要
Abstract:This study uses machine vision and drone technologies to propose a unique method for the diagnosis of cucumber disease in agriculture. The backbone of this research is a painstakingly curated dataset of hyperspectral photographs acquired under genuine field conditions. Unlike earlier datasets, this study included a wide variety of illness types, allowing for precise early-stage detection. The model achieves an excellent 87.5% accuracy in distinguishing eight unique cucumber illnesses after considerable data augmentation. The incorporation of drone technology for high-resolution images improves disease evaluation. This development has enormous potential for improving crop management, lowering labor costs, and increasing agricultural productivity. This research, which automates disease detection, represents a significant step toward a more efficient and sustainable agricultural future.
[AI-77] Understanding Implosion in Text-to-Image Generative Models CCS2024
链接: https://arxiv.org/abs/2409.12314
作者: Wenxin Ding,Cathy Y. Li,Shawn Shan,Ben Y. Zhao,Haitao Zheng
关键词-EN: Recent works show, poisoning attacks, Recent works, surprisingly vulnerable, models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ACM CCS 2024
点击查看摘要
Abstract:Recent works show that text-to-image generative models are surprisingly vulnerable to a variety of poisoning attacks. Empirical results find that these models can be corrupted by altering associations between individual text prompts and associated visual features. Furthermore, a number of concurrent poisoning attacks can induce “model implosion,” where the model becomes unable to produce meaningful images for unpoisoned prompts. These intriguing findings highlight the absence of an intuitive framework to understand poisoning attacks on these models. In this work, we establish the first analytical framework on robustness of image generative models to poisoning attacks, by modeling and analyzing the behavior of the cross-attention mechanism in latent diffusion models. We model cross-attention training as an abstract problem of “supervised graph alignment” and formally quantify the impact of training data by the hardness of alignment, measured by an Alignment Difficulty (AD) metric. The higher the AD, the harder the alignment. We prove that AD increases with the number of individual prompts (or concepts) poisoned. As AD grows, the alignment task becomes increasingly difficult, yielding highly distorted outcomes that frequently map meaningful text prompts to undefined or meaningless visual representations. As a result, the generative model implodes and outputs random, incoherent images at large. We validate our analytical framework through extensive experiments, and we confirm and explain the unexpected (and unexplained) effect of model implosion while producing new, unforeseen insights. Our work provides a useful tool for studying poisoning attacks against diffusion models and their defenses.
[AI-78] Autoformalization of Game Descriptions using Large Language Models
链接: https://arxiv.org/abs/2409.12300
作者: Agnieszka Mensfelt,Kostas Stathis,Vince Trencsenyi
关键词-EN: Game theory, life to international, international politics, applications in domains, domains ranging
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: code: this https URL
点击查看摘要
Abstract:Game theory is a powerful framework for reasoning about strategic interactions, with applications in domains ranging from day-to-day life to international politics. However, applying formal reasoning tools in such contexts is challenging, as these scenarios are often expressed in natural language. To address this, we introduce a framework for the autoformalization of game-theoretic scenarios, which translates natural language descriptions into formal logic representations suitable for formal solvers. Our approach utilizes one-shot prompting and a solver that provides feedback on syntactic correctness to allow LLMs to refine the code. We evaluate the framework using GPT-4o and a dataset of natural language problem descriptions, achieving 98% syntactic correctness and 88% semantic correctness. These results show the potential of LLMs to bridge the gap between real-life strategic interactions and formal reasoning.
[AI-79] RAG-Modulo: Solving Sequential Tasks using Experience Critics and Language Models
链接: https://arxiv.org/abs/2409.12294
作者: Abhinav Jain,Chris Jermaine,Vaibhav Unhelkar
关键词-EN: Large language models, Large language, language models, observation uncertainties, recently emerged
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages, 5 figures
点击查看摘要
Abstract:Large language models (LLMs) have recently emerged as promising tools for solving challenging robotic tasks, even in the presence of action and observation uncertainties. Recent LLM-based decision-making methods (also referred to as LLM-based agents), when paired with appropriate critics, have demonstrated potential in solving complex, long-horizon tasks with relatively few interactions. However, most existing LLM-based agents lack the ability to retain and learn from past interactions - an essential trait of learning-based robotic systems. We propose RAG-Modulo, a framework that enhances LLM-based agents with a memory of past interactions and incorporates critics to evaluate the agents’ decisions. The memory component allows the agent to automatically retrieve and incorporate relevant past experiences as in-context examples, providing context-aware feedback for more informed decision-making. Further by updating its memory, the agent improves its performance over time, thereby exhibiting learning. Through experiments in the challenging BabyAI and AlfWorld domains, we demonstrate significant improvements in task success rates and efficiency, showing that the proposed RAG-Modulo framework outperforms state-of-the-art baselines.
[AI-80] MetaPix: A Data-Centric AI Development Platform for Efficient Management and Utilization of Unstructured Computer Vision Data
链接: https://arxiv.org/abs/2409.12289
作者: Sai Vishwanath Venkatesh,Atra Akandeh,Madhu Lokanath
关键词-EN: advanced AI technologies, today world, world of advanced, critical component, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted @ The 22nd International Conference on Software Engineering Research Practice
点击查看摘要
Abstract:In today’s world of advanced AI technologies, data management is a critical component of any AI/ML solution. Effective data management is vital for the creation and maintenance of high-quality, diverse datasets, which significantly enhance predictive capabilities and lead to smarter business solutions. In this work, we introduce MetaPix, a Data-centric AI platform offering comprehensive data management solutions specifically designed for unstructured data. MetaPix offers robust tools for data ingestion, processing, storage, versioning, governance, and discovery. The platform operates on four key concepts: DataSources, Datasets, Extensions and Extractors. A DataSource serves as MetaPix top level asset, representing a narrow-scoped source of data for a specific use. Datasets are MetaPix second level object, structured collections of data. Extractors are internal tools integrated into MetaPix’s backend processing, facilitate data processing and enhancement. Additionally, MetaPix supports extensions, enabling integration with external third-party tools to enhance platform functionality. This paper delves into each MetaPix concept in detail, illustrating how they collectively contribute to the platform’s objectives. By providing a comprehensive solution for managing and utilizing unstructured computer vision data, MetaPix equips organizations with a powerful toolset to develop AI applications effectively.
[AI-81] GCA-SUN: A Gated Context-Aware Swin-UNet for Exemplar-Free Counting
链接: https://arxiv.org/abs/2409.12249
作者: Yuzhe Wu,Yipeng Xu,Tianyu Xu,Jialu Zhang,Jianfeng Ren,Xudong Jiang
关键词-EN: Exemplar-Free Counting aims, Exemplar-Free Counting, Counting aims, Gated Context-Aware Modulation, objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Exemplar-Free Counting aims to count objects of interest without intensive annotations of objects or exemplars. To achieve this, we propose Gated Context-Aware Swin-UNet (GCA-SUN) to directly map an input image to the density map of countable objects. Specifically, a Gated Context-Aware Modulation module is designed in the encoder to suppress irrelevant objects or background through a gate mechanism and exploit the attentive support of objects of interest through a self-similarity matrix. The gate strategy is also incorporated into the bottleneck network and the decoder to highlight the features most relevant to objects of interest. By explicitly exploiting the attentive support among countable objects and eliminating irrelevant features through the gate mechanisms, the proposed GCA-SUN focuses on and counts objects of interest without relying on predefined categories or exemplars. Experimental results on the FSC-147 and CARPK datasets demonstrate that GCA-SUN outperforms state-of-the-art methods.
[AI-82] Sparks of Artificial General Intelligence(AGI) in Semiconductor Material Science: Early Explorations into the Next Frontier of Generative AI-Assisted Electron Micrograph Analysis AAAI-2024
链接: https://arxiv.org/abs/2409.12244
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Sreeja Gangasani,Chidaksh Ravuru,Venkataramana Runkana
关键词-EN: electron micrographs poses, micrographs poses significant, poses significant challenges, automated labeling due, Characterizing materials
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at Deployable AI (DAI) Workshop at AAAI-2024
点击查看摘要
Abstract:Characterizing materials with electron micrographs poses significant challenges for automated labeling due to the complex nature of nanomaterial structures. To address this, we introduce a fully automated, end-to-end pipeline that leverages recent advances in Generative AI. It is designed for analyzing and understanding the microstructures of semiconductor materials with effectiveness comparable to that of human experts, contributing to the pursuit of Artificial General Intelligence (AGI) in nanomaterial identification. Our approach utilizes Large MultiModal Models (LMMs) such as GPT-4V, alongside text-to-image models like DALLE-3. We integrate a GPT-4 guided Visual Question Answering (VQA) method to analyze nanomaterial images, generate synthetic nanomaterial images via DALLE-3, and employ in-context learning with few-shot prompting in GPT-4V for accurate nanomaterial identification. Our method surpasses traditional techniques by enhancing the precision of nanomaterial identification and optimizing the process for high-throughput screening.
[AI-83] SemAI: Semantic Artificial Intelligence-enhanced DNA storage for Internet-of-Things
链接: https://arxiv.org/abs/2409.12213
作者: Wenfeng Wu,Luping Xiang,Qiang Liu,Kun Yang
关键词-EN: propelling DNA storage, global data landscape, data landscape undergoes, cloud storage applications, contemporary cloud storage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In the wake of the swift evolution of technologies such as the Internet of Things (IoT), the global data landscape undergoes an exponential surge, propelling DNA storage into the spotlight as a prospective medium for contemporary cloud storage applications. This paper introduces a Semantic Artificial Intelligence-enhanced DNA storage (SemAI-DNA) paradigm, distinguishing itself from prevalent deep learning-based methodologies through two key modifications: 1) embedding a semantic extraction module at the encoding terminus, facilitating the meticulous encoding and storage of nuanced semantic information; 2) conceiving a forethoughtful multi-reads filtering model at the decoding terminus, leveraging the inherent multi-copy propensity of DNA molecules to bolster system fault tolerance, coupled with a strategically optimized decoder’s architectural framework. Numerical results demonstrate the SemAI-DNA’s efficacy, attaining 2.61 dB Peak Signal-to-Noise Ratio (PSNR) gain and 0.13 improvement in Structural Similarity Index (SSIM) over conventional deep learning-based approaches.
[AI-84] Mixture of Diverse Size Experts
链接: https://arxiv.org/abs/2409.12210
作者: Manxi Sun,Wei Liu,Jian Luan,Pengzhi Gao,Bin Wang
关键词-EN: large language models, exploding computational costs, gained increasing popularity, language models, computational costs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The Sparsely-Activated Mixture-of-Experts (MoE) has gained increasing popularity for scaling up large language models (LLMs) without exploding computational costs. Despite its success, the current design faces a challenge where all experts have the same size, limiting the ability of tokens to choose the experts with the most appropriate size for generating the next token. In this paper, we propose the Mixture of Diverse Size Experts (MoDSE), a new MoE architecture with layers designed to have experts of different sizes. Our analysis of difficult token generation tasks shows that experts of various sizes achieve better predictions, and the routing path of the experts tends to be stable after a training period. However, having experts of diverse sizes can lead to uneven workload distribution. To tackle this limitation, we introduce an expert-pair allocation strategy to evenly distribute the workload across multiple GPUs. Comprehensive evaluations across multiple benchmarks demonstrate the effectiveness of MoDSE, as it outperforms existing MoEs by allocating the parameter budget to experts adaptively while maintaining the same total parameter size and the number of experts.
[AI-85] Nteasee: A mixed methods study of expert and general population perspectives on deploying AI for health in African countries
链接: https://arxiv.org/abs/2409.12197
作者: Mercy Nyamewaa Asiedu,Iskandar Haykel,Awa Dieng,Kerrie Kauer,Tousif Ahmed,Florence Ofori,Charisma Chan,Stephen Pfohl,Negar Rostamzadeh,Katherine Heller
关键词-EN: Artificial Intelligence, improve healthcare, significantly change, change and improve, African countries
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Equal contributions
点击查看摘要
Abstract:Artificial Intelligence (AI) for health has the potential to significantly change and improve healthcare. However in most African countries, identifying culturally and contextually attuned approaches for deploying these solutions is not well understood. To bridge this gap, we conduct a qualitative study to investigate the best practices, fairness indicators, and potential biases to mitigate when deploying AI for health in African countries, as well as explore opportunities where artificial intelligence could make a positive impact in health. We used a mixed methods approach combining in-depth interviews (IDIs) and surveys. We conduct 1.5-2 hour long IDIs with 50 experts in health, policy, and AI across 17 countries, and through an inductive approach we conduct a qualitative thematic analysis on expert IDI responses. We administer a blinded 30-minute survey with case studies to 672 general population participants across 5 countries in Africa and analyze responses on quantitative scales, statistically comparing responses by country, age, gender, and level of familiarity with AI. We thematically summarize open-ended responses from surveys. Our results find generally positive attitudes, high levels of trust, accompanied by moderate levels of concern among general population participants for AI usage for health in Africa. This contrasts with expert responses, where major themes revolved around trust/mistrust, ethical concerns, and systemic barriers to integration, among others. This work presents the first-of-its-kind qualitative research study of the potential of AI for health in Africa from an algorithmic fairness angle, with perspectives from both experts and the general population. We hope that this work guides policymakers and drives home the need for further research and the inclusion of general population perspectives in decision-making around AI usage.
[AI-86] Estimating the number of reachable positions in Minishogi
链接: https://arxiv.org/abs/2409.00129
作者: Sotaro Ishii,Tetsuro Tanaka
关键词-EN: Gogo Shogi, strongly solving Minishogi, reachable Minishogi positions, solving Minishogi, investigate the feasibility
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: This article was submitted to the 53th meeting of IPSJ (Information Processing Society of Japan) SIG Game Informatics (held on September 6, 2024) as a non-reviewed technical report, and also published in IPSJ SIG Technical Reports, Vol. 2024-GI-53, No.2, pp.1-6
点击查看摘要
Abstract:To investigate the feasibility of strongly solving Minishogi (Gogo Shogi), it is necessary to know the number of its reachable positions from the initial position. However, there currently remains a significant gap between the lower and upper bounds of the value, since checking the legality of a Minishogi position is difficult. In this paper, the authors estimate the number of reachable positions by generating candidate positions using uniform random sampling and measuring the proportion of those reachable by a series of legal moves from the initial position. The experimental results reveal that the number of reachable Minishogi positions is approximately 2.38\times 10^18 .
[AI-87] WaveletGPT: Wavelets Meet Large Language Models
链接: https://arxiv.org/abs/2409.12924
作者: Prateek Verma
关键词-EN: Large Language Models, Large Language, artificial intelligence advancements, intelligence advancements impacting, Language Models
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 16 pages, 4 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. They are trained on a simple objective: to predict the next token given the previous context. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure associated with it. This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbfany extra parameters to a GPT-style LLM architecture, we achieve the same pre-training performance almost twice as fast in text, raw audio, and symbolic music. This is achieved by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every Transformer decoder block. This work will hopefully pave the way for incorporating multi-rate signal processing ideas into traditional LLM pre-training. Further, we showcase pushing model performance by improving internal structure instead of just going after scale.
[AI-88] Machine-learning based high-bandwidth magnetic sensing
链接: https://arxiv.org/abs/2409.12820
作者: Galya Haim,Stefano Martina,John Howell,Nir Bar-Gill,Filippo Caruso
关键词-EN: Recent years, significant growth, capabilities of advanced, specifically quantum sensing, magnetic sensing
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
*备注: 12 pages including supplementary, 6 figures
点击查看摘要
Abstract:Recent years have seen significant growth of quantum technologies, and specifically quantum sensing, both in terms of the capabilities of advanced platforms and their applications. One of the leading platforms in this context is nitrogen-vacancy (NV) color centers in diamond, providing versatile, high-sensitivity, and high-resolution magnetic sensing. Nevertheless, current schemes for spin resonance magnetic sensing (as applied by NV quantum sensing) suffer from tradeoffs associated with sensitivity, dynamic range, and bandwidth. Here we address this issue, and implement machine learning tools to enhance NV magnetic sensing in terms of the sensitivity/bandwidth tradeoff in large dynamic range scenarios. We experimentally demonstrate this new approach, reaching an improvement in the relevant figure of merit by a factor of up to 5. Our results promote quantum machine learning protocols for sensing applications towards more feasible and efficient quantum technologies.
[AI-89] Graph Convolutional Neural Networks as Surrogate Models for Climate Simulation
链接: https://arxiv.org/abs/2409.12815
作者: Kevin Potter,Carianne Martinez,Reina Pradhan,Samantha Brozak,Steven Sleder,Lauren Wheeler
关键词-EN: nonlinear differential equations, parameterize complex interactions, large clusters, differential equations, complex interactions
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注: 10 pages, 8 figures
点击查看摘要
Abstract:Many climate processes are characterized using large systems of nonlinear differential equations; this, along with the immense amount of data required to parameterize complex interactions, means that Earth-System Model (ESM) simulations may take weeks to run on large clusters. Uncertainty quantification may require thousands of runs, making ESM simulations impractical for preliminary assessment. Alternatives may include simplifying the processes in the model, but recent efforts have focused on using machine learning to complement these models or even act as full surrogates. \textitWe leverage machine learning, specifically fully-connected neural networks (FCNNs) and graph convolutional neural networks (GCNNs), to enable rapid simulation and uncertainty quantification in order to inform more extensive ESM simulations. Our surrogate simulated 80 years in approximately 310 seconds on a single A100 GPU, compared to weeks for the ESM model while having mean temperature errors below 0.1^\circC and maximum errors below 2^\circC .
[AI-90] st-Time Augmentation Meets Variational Bayes
链接: https://arxiv.org/abs/2409.12587
作者: Masanari Kimura,Howard Bondell
关键词-EN: Data augmentation, machine learning models, data augmentation methods, Data, TTA
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Data augmentation is known to contribute significantly to the robustness of machine learning models. In most instances, data augmentation is utilized during the training phase. Test-Time Augmentation (TTA) is a technique that instead leverages these data augmentations during the testing phase to achieve robust predictions. More precisely, TTA averages the predictions of multiple data augmentations of an instance to produce a final prediction. Although the effectiveness of TTA has been empirically reported, it can be expected that the predictive performance achieved will depend on the set of data augmentation methods used during testing. In particular, the data augmentation methods applied should make different contributions to performance. That is, it is anticipated that there may be differing degrees of contribution in the set of data augmentation methods used for TTA, and these could have a negative impact on prediction performance. In this study, we consider a weighted version of the TTA based on the contribution of each data augmentation. Some variants of TTA can be regarded as considering the problem of determining the appropriate weighting. We demonstrate that the determination of the coefficients of this weighted TTA can be formalized in a variational Bayesian framework. We also show that optimizing the weights to maximize the marginal log-likelihood suppresses candidates of unwanted data augmentations at the test phase.
[AI-91] A Multi-agent Market Model Can Explain the Impact of AI Traders in Financial Markets – A New Microfoundations of GARCH model
链接: https://arxiv.org/abs/2409.12516
作者: Kei Nakagawa,Masanori Hirano,Kentaro Minami,Takanobu Mizuta
关键词-EN: raising important questions, sparked significant interest, price formation mechanisms, GARCH model, raising important
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Trading and Market Microstructure (q-fin.TR)
*备注: Accepted PRIMA2024
点击查看摘要
Abstract:The AI traders in financial markets have sparked significant interest in their effects on price formation mechanisms and market volatility, raising important questions for market stability and regulation. Despite this interest, a comprehensive model to quantitatively assess the specific impacts of AI traders remains undeveloped. This study aims to address this gap by modeling the influence of AI traders on market price formation and volatility within a multi-agent framework, leveraging the concept of microfoundations. Microfoundations involve understanding macroeconomic phenomena, such as market price formation, through the decision-making and interactions of individual economic agents. While widely acknowledged in macroeconomics, microfoundational approaches remain unexplored in empirical finance, particularly for models like the GARCH model, which captures key financial statistical properties such as volatility clustering and fat tails. This study proposes a multi-agent market model to derive the microfoundations of the GARCH model, incorporating three types of agents: noise traders, fundamental traders, and AI traders. By mathematically aggregating the micro-structure of these agents, we establish the microfoundations of the GARCH model. We validate this model through multi-agent simulations, confirming its ability to reproduce the stylized facts of financial markets. Finally, we analyze the impact of AI traders using parameters derived from these microfoundations, contributing to a deeper understanding of their role in market dynamics.
[AI-92] Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues
链接: https://arxiv.org/abs/2409.12415
作者: Dayun Choi,Jung-Woo Choi
关键词-EN: target sound, target sound extraction, target, sound, multichannel
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 5 pages, 4 figures
点击查看摘要
Abstract:We propose a multichannel-to-multichannel target sound extraction (M2M-TSE) framework for separating multichannel target signals from a multichannel mixture of sound sources. Target sound extraction (TSE) isolates a specific target signal using user-provided clues, typically focusing on single-channel extraction with class labels or temporal activation maps. However, to preserve and utilize spatial information in multichannel audio signals, it is essential to extract multichannel signals of a target sound source. Moreover, the clue for extraction can also include spatial or temporal cues like direction-of-arrival (DoA) or timestamps of source activation. To address these challenges, we present an M2M framework that extracts a multichannel sound signal based on spatio-temporal clues. We demonstrate that our transformer-based architecture can successively accomplish the M2M-TSE task for multichannel signals synthesized from audio signals of diverse classes in different room environments. Furthermore, we show that the multichannel extraction task introduces sufficient inductive bias in the DNN, allowing it to directly handle DoA clues without utilizing hand-crafted spatial features.
[AI-93] Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC
链接: https://arxiv.org/abs/2409.12388
作者: Jiawen Kang,Lingwei Meng,Mingyu Cui,Yuejiao Wang,Xixin Wu,Xunying Liu,Helen Meng
关键词-EN: faces unique challenges, Connectionist Temporal Classification, Serialized Output Training, faces unique, transcribing overlapping speech
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:
点击查看摘要
Abstract:Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers’ tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees of speech overlap. Specifically, we observe relative word error rate reductions of 10% overall and 15% on low-overlap speech. This work represents an initial exploration of CTC-based enhancements for MTASR tasks, offering a new perspective on speaker disentanglement in multi-talker speech recognition.
[AI-94] Axial Attention Transformer Networks: A New Frontier in Breast Cancer Detection
链接: https://arxiv.org/abs/2409.12347
作者: Weijie He,Runyuan Bao,Yiru Cang,Jianjun Wei,Yang Zhang,Jiacheng Hu
关键词-EN: breast cancer images, medical image segmentation, breast cancer, breast cancer diagnosis, Transformer-based segmentation model
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper delves into the challenges and advancements in the field of medical image segmentation, particularly focusing on breast cancer diagnosis. The authors propose a novel Transformer-based segmentation model that addresses the limitations of traditional convolutional neural networks (CNNs), such as U-Net, in accurately localizing and segmenting small lesions within breast cancer images. The model introduces an axial attention mechanism to enhance the computational efficiency and address the issue of global contextual information that is often overlooked by CNNs. Additionally, the paper discusses improvements tailored to the small dataset challenge, including the incorporation of relative position information and a gated axial attention mechanism to refine the model’s focus on relevant features. The proposed model aims to significantly improve the segmentation accuracy of breast cancer images, offering a more efficient and effective tool for computer-aided diagnosis.
[AI-95] Deep vessel segmentation with joint multi-prior encoding
链接: https://arxiv.org/abs/2409.12334
作者: Amine Sadikine,Bogdan Badic,Enzo Ferrante,Vincent Noblet,Pascal Ballet,Dimitris Visvikis,Pierre-Henri Conze
关键词-EN: including pathology detection, clinical applications, including pathology, surgical planning, pathology detection
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, conference
点击查看摘要
Abstract:The precise delineation of blood vessels in medical images is critical for many clinical applications, including pathology detection and surgical planning. However, fully-automated vascular segmentation is challenging because of the variability in shape, size, and topology. Manual segmentation remains the gold standard but is time-consuming, subjective, and impractical for large-scale studies. Hence, there is a need for automatic and reliable segmentation methods that can accurately detect blood vessels from medical images. The integration of shape and topological priors into vessel segmentation models has been shown to improve segmentation accuracy by offering contextual information about the shape of the blood vessels and their spatial relationships within the vascular tree. To further improve anatomical consistency, we propose a new joint prior encoding mechanism which incorporates both shape and topology in a single latent space. The effectiveness of our method is demonstrated on the publicly available 3D-IRCADb dataset. More globally, the proposed approach holds promise in overcoming the challenges associated with automatic vessel delineation and has the potential to advance the field of deep priors encoding.
[AI-96] Scale-specific auxiliary multi-task contrastive learning for deep liver vessel segmentation
链接: https://arxiv.org/abs/2409.12333
作者: Amine Sadikine,Bogdan Badic,Jean-Pierre Tasu,Vincent Noblet,Pascal Ballet,Dimitris Visvikis,Pierre-Henri Conze
关键词-EN: functionally-independent Couinaud segments, Extracting hepatic vessels, Couinaud segments, Extracting hepatic, functionally-independent Couinaud
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures, conference
点击查看摘要
Abstract:Extracting hepatic vessels from abdominal images is of high interest for clinicians since it allows to divide the liver into functionally-independent Couinaud segments. In this respect, an automated liver blood vessel extraction is widely summoned. Despite the significant growth in performance of semantic segmentation methodologies, preserving the complex multi-scale geometry of main vessels and ramifications remains a major challenge. This paper provides a new deep supervised approach for vessel segmentation, with a strong focus on representations arising from the different scales inherent to the vascular tree geometry. In particular, we propose a new clustering technique to decompose the tree into various scale levels, from tiny to large vessels. Then, we extend standard 3D UNet to multi-task learning by incorporating scale-specific auxiliary tasks and contrastive learning to encourage the discrimination between scales in the shared representation. Promising results, depicted in several evaluation metrics, are revealed on the public 3D-IRCADb dataset.
[AI-97] Multivariate Analysis of Gut Microbiota Composition and Prevalence of Gastric Cancer
链接: https://arxiv.org/abs/2409.12209
作者: Aadhith Shankarnarayanan,Dheeman Gangopadhyay,Ayman Alzaatreh
关键词-EN: gastric cancer, gut microbiota, gastric, global surge, predictive marker
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The global surge in the cases of gastric cancer has prompted an investigation into the potential of gut microbiota as a predictive marker for the disease. The alterations in gut diversity are suspected to be associated with an elevated risk of gastric cancer. This paper delves into finding the correlation between gut microbiota and gastric cancer, focusing on patients who have undergone total and subtotal gastrectomy. Utilizing data mining and statistical learning methods, an analysis was conducted on 16S-RNA sequenced genes obtained from 96 participants with the aim of identifying specific genera of gut microbiota associated with gastric cancer. The study reveals several prominent bacterial genera that could potentially serve as biomarkers assessing the risk of gastric cancer. These findings offer a pathway for early risk assessment and precautionary measures in the diagnosis of gastric cancer. The intricate mechanisms through which these gut microbiotas influence gastric cancer progression warrant further investigation. This research significantly aims to contribute to the growing understanding of the gut-cancer axis and its implications in disease prediction and prevention.
计算机视觉
[CV-0] Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
链接: https://arxiv.org/abs/2409.12963
作者: Yuzhang Shang,Bingxin Xu,Weitai Kang,Mu Cai,Yuheng Li,Zehao Wen,Zhen Dong,Kurt Keutzer,Yong Jae Lee,Yan Yan
关键词-EN: Large Language Models, Advancements in Large, Language Models, Large Language, integrating video modalities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Advancements in Large Language Models (LLMs) inspire various strategies for integrating video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface linking sophisticated video encoders to LLMs. However, due to computation and data limitations, these Video-LLMs are typically pre-trained to process only short videos, limiting their broader application for understanding longer video content. Additionally, fine-tuning Video-LLMs to handle longer videos is cost-prohibitive. Consequently, it becomes essential to explore the interpolation of Video-LLMs under a completely training-free setting. In this paper, we first identify the primary challenges in interpolating Video-LLMs: (1) the video encoder and modality alignment projector are fixed, preventing the integration of additional frames into Video-LLMs, and (2) the LLM backbone is limited in its content length capabilities, which complicates the processing of an increased number of video tokens. To address these challenges, we propose a specific INTerPolation method for Video-LLMs (INTP-Video-LLMs). We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector. Furthermore, we introduce a training-free LLM context window extension method to enable Video-LLMs to understand a correspondingly increased number of visual tokens.
[CV-1] Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
链接: https://arxiv.org/abs/2409.12961
作者: Zuyan Liu,Yuhao Dong,Ziwei Liu,Winston Hu,Jiwen Lu,Yongming Rao
关键词-EN: videos spanning hours, ranging from small, spanning hours, small icons, Visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-optimal for multimodal understanding and inefficient for processing inputs with long and short visual contents. To solve the problem, we propose Oryx, a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths through two core innovations: 1) a pre-trained OryxViT model that can encode images at any resolution into LLM-friendly visual representations; 2) a dynamic compressor module that supports 1x to 16x compression on visual tokens by request. These design features enable Oryx to accommodate extremely long visual contexts, such as videos, with lower resolution and high compression while maintaining high recognition precision for tasks like document understanding with native resolution and no compression. Beyond the architectural improvements, enhanced data curation and specialized training on long-context retrieval and spatial-aware data help Oryx achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously. Our work is open-sourced at this https URL.
[CV-2] LVCD: Reference-based Lineart Video Colorization with Diffusion Models SIGGRAPH
链接: https://arxiv.org/abs/2409.12960
作者: Zhitong Huang,Mohan Zhang,Jing Liao
关键词-EN: video diffusion framework, framework for reference-based, video diffusion model, reference-based lineart video, diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted by ACM Transactions on Graphics and SIGGRAPH Asia 2024. Project page: this https URL
点击查看摘要
Abstract:We propose the first video diffusion framework for reference-based lineart video colorization. Unlike previous works that rely solely on image generative models to colorize lineart frame by frame, our approach leverages a large-scale pretrained video diffusion model to generate colorized animation videos. This approach leads to more temporally consistent results and is better equipped to handle large motions. Firstly, we introduce Sketch-guided ControlNet which provides additional control to finetune an image-to-video diffusion model for controllable video synthesis, enabling the generation of animation videos conditioned on lineart. We then propose Reference Attention to facilitate the transfer of colors from the reference frame to other frames containing fast and expansive motions. Finally, we present a novel scheme for sequential sampling, incorporating the Overlapped Blending Module and Prev-Reference Attention, to extend the video diffusion model beyond its original fixed-length limitation for long video colorization. Both qualitative and quantitative results demonstrate that our method significantly outperforms state-of-the-art techniques in terms of frame and video quality, as well as temporal consistency. Moreover, our method is capable of generating high-quality, long temporal-consistent animation videos with large motions, which is not achievable in previous works. Our code and model are available at this https URL.
[CV-3] MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
链接: https://arxiv.org/abs/2409.12959
作者: Dongzhi Jiang,Renrui Zhang,Ziyu Guo,Yanmin Wu,Jiayi Lei,Pengshuo Qiu,Pan Lu,Zehui Chen,Guanglu Song,Peng Gao,Yu Liu,Chunyuan Li,Hongsheng Li
关键词-EN: Large Language Models, Large Multimodal Models, Large Language, Language Models, multimodal search
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Project Page: this https URL
点击查看摘要
Abstract:The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs’ training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: this https URL
[CV-4] 3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
链接: https://arxiv.org/abs/2409.12957
作者: Zhaoxi Chen,Jiaxiang Tang,Yuhao Dong,Ziang Cao,Fangzhou Hong,Yushi Lan,Tengfei Wang,Haozhe Xie,Tong Wu,Shunsuke Saito,Liang Pan,Dahua Lin,Ziwei Liu
关键词-EN: industries necessitates efficient, content creation, efficient and automated, increasing demand, industries necessitates
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Code this https URL Project Page this https URL
点击查看摘要
Abstract:The increasing demand for high-quality 3D assets across various industries necessitates efficient and automated 3D content creation. Despite recent advancements in 3D generative models, existing methods still face challenges with optimization speed, geometric fidelity, and the lack of assets for physically based rendering (PBR). In this paper, we introduce 3DTopia-XL, a scalable native 3D generative model designed to overcome these limitations. 3DTopia-XL leverages a novel primitive-based 3D representation, PrimX, which encodes detailed shape, albedo, and material field into a compact tensorial format, facilitating the modeling of high-resolution geometry with PBR assets. On top of the novel representation, we propose a generative framework based on Diffusion Transformer (DiT), which comprises 1) Primitive Patch Compression, 2) and Latent Primitive Diffusion. 3DTopia-XL learns to generate high-quality 3D assets from textual or visual inputs. We conduct extensive qualitative and quantitative experiments to demonstrate that 3DTopia-XL significantly outperforms existing methods in generating high-quality 3D assets with fine-grained textures and materials, efficiently bridging the quality gap between generative models and real-world applications.
[CV-5] GStex: Per-Primitive Texturing of 2D Gaussian Splatting for Decoupled Appearance and Geometry Modeling
链接: https://arxiv.org/abs/2409.12954
作者: Victor Rong,Jingxiang Chen,Sherwin Bahmani,Kiriakos N. Kutulakos,David B. Lindell
关键词-EN: Gaussian, Gaussian primitives, demonstrated excellent performance, demonstrated excellent, Gaussian splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL
点击查看摘要
Abstract:Gaussian splatting has demonstrated excellent performance for view synthesis and scene reconstruction. The representation achieves photorealistic quality by optimizing the position, scale, color, and opacity of thousands to millions of 2D or 3D Gaussian primitives within a scene. However, since each Gaussian primitive encodes both appearance and geometry, these attributes are strongly coupled–thus, high-fidelity appearance modeling requires a large number of Gaussian primitives, even when the scene geometry is simple (e.g., for a textured planar surface). We propose to texture each 2D Gaussian primitive so that even a single Gaussian can be used to capture appearance details. By employing per-primitive texturing, our appearance representation is agnostic to the topology and complexity of the scene’s geometry. We show that our approach, GStex, yields improved visual quality over prior work in texturing Gaussian splats. Furthermore, we demonstrate that our decoupling enables improved novel view synthesis performance compared to 2D Gaussian splatting when reducing the number of Gaussian primitives, and that GStex can be used for scene appearance editing and re-texturing.
[CV-6] JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
链接: https://arxiv.org/abs/2409.12953
作者: Zhecan Wang,Junzhang Liu,Chia-Wei Tang,Hani Alomari,Anushka Sivakumar,Rui Sun,Wenhao Li,Md. Atabuzzaman,Hammad Ayyubi,Haoxuan You,Alvi Ishmam,Kai-Wei Chang,Shih-Fu Chang,Chris Thomas
关键词-EN: benchmarks largely consist, usual contexts, Existing vision-language understanding, largely consist, vision-language understanding benchmarks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model’s fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models’ visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.
[CV-7] he Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations ECCV2024
链接: https://arxiv.org/abs/2409.12952
作者: Anselm Haselhoff,Kevin Trelenberg,Fabian Küppers,Jonas Schneider
关键词-EN: modify image concepts, original query image, Visual counterfactual explanation, methods modify image, Visual counterfactual
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted paper at the ECCV 2024
点击查看摘要
Abstract:Visual counterfactual explanation (CF) methods modify image concepts, e.g, shape, to change a prediction to a predefined outcome while closely resembling the original query image. Unlike self-explainable models (SEMs) and heatmap techniques, they grant users the ability to examine hypothetical “what-if” scenarios. Previous CF methods either entail post-hoc training, limiting the balance between transparency and CF quality, or demand optimization during inference. To bridge the gap between transparent SEMs and CF methods, we introduce the GdVAE, a self-explainable model based on a conditional variational autoencoder (CVAE), featuring a Gaussian discriminant analysis (GDA) classifier and integrated CF explanations. Full transparency is achieved through a generative classifier that leverages class-specific prototypes for the downstream task and a closed-form solution for CFs in the latent space. The consistency of CFs is improved by regularizing the latent space with the explainer function. Extensive comparisons with existing approaches affirm the effectiveness of our method in producing high-quality CF explanations while preserving transparency. Code and models are public.
[CV-8] Revisiting Semi-supervised Adversarial Robustness via Noise-aware Online Robust Distillation
链接: https://arxiv.org/abs/2409.12946
作者: Tsung-Han Wu,Hung-Ting Su,Shang-Tse Chen,Winston H. Hsu
关键词-EN: prominent approach, RST, robust pretrained models, robust self-training, labeling budgets
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 4 figures, 9 tables
点击查看摘要
Abstract:The robust self-training (RST) framework has emerged as a prominent approach for semi-supervised adversarial training. To explore the possibility of tackling more complicated tasks with even lower labeling budgets, unlike prior approaches that rely on robust pretrained models, we present SNORD - a simple yet effective framework that introduces contemporary semi-supervised learning techniques into the realm of adversarial training. By enhancing pseudo labels and managing noisy training data more effectively, SNORD showcases impressive, state-of-the-art performance across diverse datasets and labeling budgets, all without the need for pretrained models. Compared to full adversarial supervision, SNORD achieves a 90% relative robust accuracy under epsilon = 8/255 AutoAttack, requiring less than 0.1%, 2%, and 10% labels for CIFAR-10, CIFAR-100, and TinyImageNet-200, respectively. Additional experiments confirm the efficacy of each component and demonstrate the adaptability of integrating SNORD with existing adversarial pretraining strategies to further bolster robustness.
[CV-9] Accelerating AI and Computer Vision for Satellite Pose Estimation on the Intel Myriad X Embedded SoC MICRO
链接: https://arxiv.org/abs/2409.12939
作者: Vasileios Leon,Panagiotis Minaidis,George Lentaris,Dimitrios Soudris
关键词-EN: Artificial Intelligence, deployment of Artificial, Computer Vision, Vision Processing Unit, heterogeneous Vision Processing
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at Elsevier Microprocessors and Microsystems
点击查看摘要
Abstract:The challenging deployment of Artificial Intelligence (AI) and Computer Vision (CV) algorithms at the edge pushes the community of embedded computing to examine heterogeneous System-on-Chips (SoCs). Such novel computing platforms provide increased diversity in interfaces, processors and storage, however, the efficient partitioning and mapping of AI/CV workloads still remains an open issue. In this context, the current paper develops a hybrid AI/CV system on Intel’s Movidius Myriad X, which is an heterogeneous Vision Processing Unit (VPU), for initializing and tracking the satellite’s pose in space missions. The space industry is among the communities examining alternative computing platforms to comply with the tight constraints of on-board data processing, while it is also striving to adopt functionalities from the AI domain. At algorithmic level, we rely on the ResNet-50-based UrsoNet network along with a custom classical CV pipeline. For efficient acceleration, we exploit the SoC’s neural compute engine and 16 vector processors by combining multiple parallelization and low-level optimization techniques. The proposed single-chip, robust-estimation, and real-time solution delivers a throughput of up to 5 FPS for 1-MegaPixel RGB images within a limited power envelope of 2W.
[CV-10] MaskMol: Knowledge-guided Molecular Image Pre-Training Framework for Activity Cliffs
链接: https://arxiv.org/abs/2409.12926
作者: Zhixiang Cheng,Hongxin Xiang,Pengsen Ma,Li Zeng,Xin Jin,Xixi Yang,Jianxin Lin,Yang Deng,Bosheng Song,Xinxin Feng,Changhui Deng,Xiangxiang Zeng
关键词-EN: show significant differences, refer to pairs, pairs of molecules, structurally similar, similar but show
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 33 pages, 5 figures
点击查看摘要
Abstract:Activity cliffs, which refer to pairs of molecules that are structurally similar but show significant differences in their potency, can lead to model representation collapse and make the model challenging to distinguish them. Our research indicates that as molecular similarity increases, graph-based methods struggle to capture these nuances, whereas image-based approaches effectively retain the distinctions. Thus, we developed MaskMol, a knowledge-guided molecular image self-supervised learning framework. MaskMol accurately learns the representation of molecular images by considering multiple levels of molecular knowledge, such as atoms, bonds, and substructures. By utilizing pixel masking tasks, MaskMol extracts fine-grained information from molecular images, overcoming the limitations of existing deep learning models in identifying subtle structural changes. Experimental results demonstrate MaskMol’s high accuracy and transferability in activity cliff estimation and compound potency prediction across 20 different macromolecular targets, outperforming 25 state-of-the-art deep learning and machine learning approaches. Visualization analyses reveal MaskMol’s high biological interpretability in identifying activity cliff-relevant molecular substructures. Notably, through MaskMol, we identified candidate EP4 inhibitors that could be used to treat tumors. This study not only raises awareness about activity cliffs but also introduces a novel method for molecular image representation learning and virtual screening, advancing drug discovery and providing new insights into structure-activity relationships (SAR).
[CV-11] Recognition of Harmful Phytoplankton from Microscopic Images using Deep Learning
链接: https://arxiv.org/abs/2409.12900
作者: Aymane Khaldi,Rohaifa Khaldi
关键词-EN: preserving aquatic ecosystems, ensuring environmental protection, Monitoring plankton distribution, plankton distribution, aquatic ecosystems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures
点击查看摘要
Abstract:Monitoring plankton distribution, particularly harmful phytoplankton, is vital for preserving aquatic ecosystems, regulating the global climate, and ensuring environmental protection. Traditional methods for monitoring are often time-consuming, expensive, error-prone, and unsuitable for large-scale applications, highlighting the need for accurate and efficient automated systems. In this study, we evaluate several state-of-the-art CNN models, including ResNet, ResNeXt, DenseNet, and EfficientNet, using three transfer learning approaches: linear probing, fine-tuning, and a combined approach, to classify eleven harmful phytoplankton genera from microscopic images. The best performance was achieved by ResNet-50 using the fine-tuning approach, with an accuracy of 96.97%. The results also revealed that the models struggled to differentiate between four harmful phytoplankton types with similar morphological features.
[CV-12] 3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt ATC WWW
链接: https://arxiv.org/abs/2409.12892
作者: Lukas Höllein,Aljaž Božič,Michael Zollhöfer,Matthias Nießner
关键词-EN: Gaussian Splatting, tailored Levenberg-Marquardt, ADAM optimizer, Splatting, replacing its ADAM
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL , video: this https URL , code: this https URL
点击查看摘要
Abstract:We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS) by replacing its ADAM optimizer with a tailored Levenberg-Marquardt (LM). Existing methods reduce the optimization time by decreasing the number of Gaussians or by improving the implementation of the differentiable rasterizer. However, they still rely on the ADAM optimizer to fit Gaussian parameters of a scene in thousands of iterations, which can take up to an hour. To this end, we change the optimizer to LM that runs in conjunction with the 3DGS differentiable rasterizer. For efficient GPU parallization, we propose a caching data structure for intermediate gradients that allows us to efficiently calculate Jacobian-vector products in custom CUDA kernels. In every LM iteration, we calculate update directions from multiple image subsets using these kernels and combine them in a weighted mean. Overall, our method is 30% faster than the original 3DGS while obtaining the same reconstruction quality. Our optimization is also agnostic to other methods that acclerate 3DGS, thus enabling even faster speedups compared to vanilla 3DGS.
[CV-13] EdgeGaussians – 3D Edge Mapping via Gaussian Splatting
链接: https://arxiv.org/abs/2409.12886
作者: Kunal Chelani,Assia Benbihi,Torsten Sattler,Fredrik Kahl
关键词-EN: edge, edge point cloud, computer vision, extremely useful primitives, primitives in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:With their meaningful geometry and their omnipresence in the 3D world, edges are extremely useful primitives in computer vision. 3D edges comprise of lines and curves, and methods to reconstruct them use either multi-view images or point clouds as input. State-of-the-art image-based methods first learn a 3D edge point cloud then fit 3D edges to it. The edge point cloud is obtained by learning a 3D neural implicit edge field from which the 3D edge points are sampled on a specific level set (0 or 1). However, such methods present two important drawbacks: i) it is not realistic to sample points on exact level sets due to float imprecision and training inaccuracies. Instead, they are sampled within a range of levels so the points do not lie accurately on the 3D edges and require further processing. ii) Such implicit representations are computationally expensive and require long training times. In this paper, we address these two limitations and propose a 3D edge mapping that is simpler, more efficient, and preserves accuracy. Our method learns explicitly the 3D edge points and their edge direction hence bypassing the need for point sampling. It casts a 3D edge point as the center of a 3D Gaussian and the edge direction as the principal axis of the Gaussian. Such a representation has the advantage of being not only geometrically meaningful but also compatible with the efficient training optimization defined in Gaussian Splatting. Results show that the proposed method produces edges as accurate and complete as the state-of-the-art while being an order of magnitude faster. Code is released at this https URL.
[CV-14] Hypersphere Secure Sketch Revisited: Probabilistic Linear Regression Attack on IronMask in Multiple Usage
链接: https://arxiv.org/abs/2409.12884
作者: Pengxu Zhu,Lei Wang
关键词-EN: Protection of biometric, area of focus, critical and urgent, urgent area, Protection
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Protection of biometric templates is a critical and urgent area of focus. IronMask demonstrates outstanding recognition performance while protecting facial templates against existing known attacks. In high-level, IronMask can be conceptualized as a fuzzy commitment scheme building on the hypersphere directly. We devise an attack on IronMask targeting on the security notion of renewability. Our attack, termed as Probabilistic Linear Regression Attack, utilizes the linearity of underlying used error correcting code. This attack is the first algorithm to successfully recover the original template when getting multiple protected templates in acceptable time and requirement of storage. We implement experiments on IronMask applied to protect ArcFace that well verify the validity of our attacks. Furthermore, we carry out experiments in noisy environments and confirm that our attacks are still applicable. Finally, we put forward two strategies to mitigate this type of attacks.
[CV-15] Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition
链接: https://arxiv.org/abs/2409.12883
作者: Daniel Flores-Araiza,Francisco Lopez-Tiro,Clément Larose,Salvador Hinojosa,Andres Mendez-Vazquez,Miguel Gonzalez-Mendoza,Gilberto Ochoa-Ruiz,Christian Daul
关键词-EN: kidney stone types, calculi extraction process, diminishing infection risks, major medical advance, tedious renal calculi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Paper submitted to Artificial Intelligence in Medicine. (AIIM), Elsevier
点击查看摘要
Abstract:The in-vivo identification of the kidney stone types during an ureteroscopy would be a major medical advance in urology, as it could reduce the time of the tedious renal calculi extraction process, while diminishing infection risks. Furthermore, such an automated procedure would make possible to prescribe anti-recurrence treatments immediately. Nowadays, only few experienced urologists are able to recognize the kidney stone types in the images of the videos displayed on a screen during the endoscopy. Thus, several deep learning (DL) models have recently been proposed to automatically recognize the kidney stone types using ureteroscopic images. However, these DL models are of black box nature whicl limits their applicability in clinical settings. This contribution proposes a case-based reasoning DL model which uses prototypical parts (PPs) and generates local and global descriptors. The PPs encode for each class (i.e., kidney stone type) visual feature information (hue, saturation, intensity and textures) similar to that used by biologists. The PPs are optimally generated due a new loss function used during the model training. Moreover, the local and global descriptors of PPs allow to explain the decisions (“what” information, “where in the images”) in an understandable way for biologists and urologists. The proposed DL model has been tested on a database including images of the six most widespread kidney stone types. The overall average classification accuracy was 90.37. When comparing this results with that of the eight other DL models of the kidney stone state-of-the-art, it can be seen that the valuable gain in explanability was not reached at the expense of accuracy which was even slightly increased with respect to that (88.2) of the best method of the literature. These promising and interpretable results also encourage urologists to put their trust in AI-based solutions.
[CV-16] Automated Linear Disturbance Mapping via Semantic Segmentation of Sentinel-2 Imagery
链接: https://arxiv.org/abs/2409.12817
作者: Andrew M. Nagel,Anne Webster,Christopher Henry,Christopher Storie,Ignacio San-Miguel Sanchez,Olivier Tsui,Jason Duffe,Andy Dean
关键词-EN: woodland caribou population, Canada northern regions, Rangifer tarandus, boreal woodland caribou, northern regions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In Canada’s northern regions, linear disturbances such as roads, seismic exploration lines, and pipelines pose a significant threat to the boreal woodland caribou population (Rangifer tarandus). To address the critical need for management of these disturbances, there is a strong emphasis on developing mapping approaches that accurately identify forest habitat fragmentation. The traditional approach is manually generating maps, which is time-consuming and lacks the capability for frequent updates. Instead, applying deep learning methods to multispectral satellite imagery offers a cost-effective solution for automated and regularly updated map production. Deep learning models have shown promise in extracting paved roads in urban environments when paired with high-resolution (0.5m) imagery, but their effectiveness for general linear feature extraction in forested areas from lower resolution imagery remains underexplored. This research employs a deep convolutional neural network model based on the VGGNet16 architecture for semantic segmentation of lower resolution (10m) Sentinel-2 satellite imagery, creating precise multi-class linear disturbance maps. The model is trained using ground-truth label maps sourced from the freely available Alberta Institute of Biodiversity Monitoring Human Footprint dataset, specifically targeting the Boreal and Taiga Plains ecozones in Alberta, Canada. Despite challenges in segmenting lower resolution imagery, particularly for thin linear disturbances like seismic exploration lines that can exhibit a width of 1-3 pixels in Sentinel-2 imagery, our results demonstrate the effectiveness of the VGGNet model for accurate linear disturbance retrieval. By leveraging the freely available Sentinel-2 imagery, this work advances cost-effective automated mapping techniques for identifying and monitoring linear disturbance fragmentation.
[CV-17] Autonomous Visual Fish Pen Inspections for Estimating the State of Biofouling Buildup Using ROV – Extended Abstract ICRA
链接: https://arxiv.org/abs/2409.12813
作者: Matej Fabijanić,Nadir Kapetanović,Nikola Mišković
关键词-EN: maintenance task, scale or industrial, fully automated, small scale, fish cage inspections
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE ICRA Workshop on Field Robotics 2024
点击查看摘要
Abstract:The process of fish cage inspections, which is a necessary maintenance task at any fish farm, be it small scale or industrial, is a task that has the potential to be fully automated. Replacing trained divers who perform regular inspections with autonomous marine vehicles would lower the costs of manpower and remove the risks associated with humans performing underwater inspections. Achieving such a level of autonomy implies developing an image processing algorithm that is capable of estimating the state of biofouling buildup. The aim of this work is to propose a complete solution for automating the said inspection process; from developing an autonomous control algorithm for an ROV, to automatically segmenting images of fish cages, and accurately estimating the state of biofouling. The first part is achieved by modifying a commercially available ROV with an acoustic SBL positioning system and developing a closed-loop control system. The second part is realized by implementing a proposed biofouling estimation framework, which relies on AI to perform image segmentation, and by processing images using established computer vision methods to obtain a rough estimate of the distance of the ROV from the fish cage. This also involved developing a labeling tool in order to create a dataset of images for the neural network performing the semantic segmentation to be trained on. The experimental results show the viability of using an ROV fitted with an acoustic transponder for autonomous missions, and demonstrate the biofouling estimation framework’s ability to provide accurate assessments, alongside satisfactory distance estimation capabilities. In conclusion, the achieved biofouling estimation accuracy showcases clear potential for use in the aquaculture industry.
[CV-18] Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering
链接: https://arxiv.org/abs/2409.12784
作者: Youngsun Lim,Hojun Choi,Hyunjung Shim
关键词-EN: existing studies overlook, TTI, image hallucination, impressive success, studies overlook
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages
点击查看摘要
Abstract:Despite the impressive success of text-to-image (TTI) generation models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by generation models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing text-to-image models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five text-to-image models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (rho=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate text-to-image generation models.
[CV-19] EventDance: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition
链接: https://arxiv.org/abs/2409.12778
作者: Xu Zheng,Lin Wang
关键词-EN: address the challenging, accessing any labeled, labeled source image, challenging problem, labeled source
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2403.14082
点击查看摘要
Abstract:In this paper, we address the challenging problem of cross-modal (image-to-events) adaptation for event-based recognition without accessing any labeled source image data. This task is arduous due to the substantial modality gap between images and events. With only a pre-trained source model available, the key challenge lies in extracting knowledge from this model and effectively transferring knowledge to the event-based domain. Inspired by the natural ability of language to convey semantics across different modalities, we propose EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective. We introduce a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner. Importantly, it leverages a vision-language model to provide further supervision, enriching the surrogate images and enhancing modality bridging. This enables the creation of surrogate images to extract knowledge (i.e., labels) from the source model. On top, we propose a multi-representation knowledge adaptation (MKA) module to transfer knowledge to target models, utilizing multiple event representations to capture the spatiotemporal characteristics of events fully. The L-RMB and MKA modules are jointly optimized to achieve optimal performance in bridging the modality gap. Experiments on three benchmark datasets demonstrate that EventDance++ performs on par with methods that utilize source data, validating the effectiveness of our language-guided approach in event-based recognition.
[CV-20] GaRField: Reinforced Gaussian Radiance Fields for Large-Scale 3D Scene Reconstruction
链接: https://arxiv.org/abs/2409.12774
作者: Hanyue Zhang,Zhiliu Yang,Xinhe Zuo,Yuxin Tong,Ying Long,Chen Liu
关键词-EN: accuracy challenges faced, large-scale scene reconstruction, scene reconstruction based, paper proposes, aims to address
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:This paper proposes a novel framework for large-scale scene reconstruction based on 3D Gaussian splatting (3DGS) and aims to address the scalability and accuracy challenges faced by existing methods. For tackling the scalability issue, we split the large scene into multiple cells, and the candidate point-cloud and camera views of each cell are correlated through a visibility-based camera selection and a progressive point-cloud extension. To reinforce the rendering quality, three highlighted improvements are made in comparison with vanilla 3DGS, which are a strategy of the ray-Gaussian intersection and the novel Gaussians density control for learning efficiency, an appearance decoupling module based on ConvKAN network to solve uneven lighting conditions in large-scale scenes, and a refined final loss with the color loss, the depth distortion loss, and the normal consistency loss. Finally, the seamless stitching procedure is executed to merge the individual Gaussian radiance field for novel view synthesis across different cells. Evaluation of Mill19, Urban3D, and MatrixCity datasets shows that our method consistently generates more high-fidelity rendering results than state-of-the-art methods of large-scale scene reconstruction. We further validate the generalizability of the proposed approach by rendering on self-collected video clips recorded by a commercial drone.
[CV-21] Spectral-GS: Taming 3D Gaussian Splatting with Spectral Entropy
链接: https://arxiv.org/abs/2409.12771
作者: Letian Huang,Jie Guo,Jialin Dan,Ruoyu Fu,Shujie Wang,Yuanqi Li,Yanwen Guo
关键词-EN: demonstrating high fidelity, Gaussian Splatting, achieved impressive results, demonstrating high, fidelity and efficiency
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:
点击查看摘要
Abstract:Recently, 3D Gaussian Splatting (3D-GS) has achieved impressive results in novel view synthesis, demonstrating high fidelity and efficiency. However, it easily exhibits needle-like artifacts, especially when increasing the sampling rate. Mip-Splatting tries to remove these artifacts with a 3D smoothing filter for frequency constraints and a 2D Mip filter for approximated supersampling. Unfortunately, it tends to produce over-blurred results, and sometimes needle-like Gaussians still persist. Our spectral analysis of the covariance matrix during optimization and densification reveals that current 3D-GS lacks shape awareness, relying instead on spectral radius and view positional gradients to determine splitting. As a result, needle-like Gaussians with small positional gradients and low spectral entropy fail to split and overfit high-frequency details. Furthermore, both the filters used in 3D-GS and Mip-Splatting reduce the spectral entropy and increase the condition number during zooming in to synthesize novel view, causing view inconsistencies and more pronounced artifacts. Our Spectral-GS, based on spectral analysis, introduces 3D shape-aware splitting and 2D view-consistent filtering strategies, effectively addressing these issues, enhancing 3D-GS’s capability to represent high-frequency details without noticeable artifacts, and achieving high-quality photorealistic rendering.
[CV-22] COCO-Occ: A Benchmark for Occluded Panoptic Segmentation and Image Understanding
链接: https://arxiv.org/abs/2409.12760
作者: Wenbo Wei,Jun Wang,Abhir Bhalerao
关键词-EN: COCO images, labelling the COCO, perceived occlusion levels, image understanding, COCO dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:To help address the occlusion problem in panoptic segmentation and image understanding, this paper proposes a new large-scale dataset, COCO-Occ, which is derived from the COCO dataset by manually labelling the COCO images into three perceived occlusion levels. Using COCO-Occ, we systematically assess and quantify the impact of occlusion on panoptic segmentation on samples having different levels of occlusion. Comparative experiments with SOTA panoptic models demonstrate that the presence of occlusion significantly affects performance with higher occlusion levels resulting in notably poorer performance. Additionally, we propose a straightforward yet effective method as an initial attempt to leverage the occlusion annotation using contrastive learning to render a model that learns a more robust representation capturing different severities of occlusion. Experimental results demonstrate that the proposed approach boosts the performance of the baseline model and achieves SOTA performance on the proposed COCO-Occ dataset.
[CV-23] DrivingForward: Feed-forward 3D Gaussian Splatting for Driving Scene Reconstruction from Flexible Surround-view Input
链接: https://arxiv.org/abs/2409.12753
作者: Qijian Tian,Xin Tan,Yuan Xie,Lizhuang Ma
关键词-EN: Gaussian Splatting model, feed-forward Gaussian Splatting, Gaussian Splatting, reconstructs driving scenes, propose DrivingForward
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL
点击查看摘要
Abstract:We propose DrivingForward, a feed-forward Gaussian Splatting model that reconstructs driving scenes from flexible surround-view input. Driving scene images from vehicle-mounted cameras are typically sparse, with limited overlap, and the movement of the vehicle further complicates the acquisition of camera extrinsics. To tackle these challenges and achieve real-time reconstruction, we jointly train a pose network, a depth network, and a Gaussian network to predict the Gaussian primitives that represent the driving scenes. The pose network and depth network determine the position of the Gaussian primitives in a self-supervised manner, without using depth ground truth and camera extrinsics during training. The Gaussian network independently predicts primitive parameters from each input image, including covariance, opacity, and spherical harmonics coefficients. At the inference stage, our model can achieve feed-forward reconstruction from flexible multi-frame surround-view input. Experiments on the nuScenes dataset show that our model outperforms existing state-of-the-art feed-forward and scene-optimized reconstruction methods in terms of reconstruction.
[CV-24] PVContext: Hybrid Context Model for Point Cloud Compression
链接: https://arxiv.org/abs/2409.12724
作者: Guoqing Zhang,Wenbo Zhao,Jian Liu,Yuanchao Bai,Junjun Jiang,Xianming Liu
关键词-EN: increasingly challenging due, Efficient storage, scanning technology, point cloud data, large-scale point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:Efficient storage of large-scale point cloud data has become increasingly challenging due to advancements in scanning technology. Recent deep learning techniques have revolutionized this field; However, most existing approaches rely on single-modality contexts, such as octree nodes or voxel occupancy, limiting their ability to capture information across large regions. In this paper, we propose PVContext, a hybrid context model for effective octree-based point cloud compression. PVContext comprises two components with distinct modalities: the Voxel Context, which accurately represents local geometric information using voxels, and the Point Context, which efficiently preserves global shape information from point clouds. By integrating these two contexts, we retain detailed information across large areas while controlling the context size. The combined context is then fed into a deep entropy model to accurately predict occupancy. Experimental results demonstrate that, compared to G-PCC, our method reduces the bitrate by 37.95% on SemanticKITTI LiDAR point clouds and by 48.98% and 36.36% on dense object point clouds from MPEG 8i and MVUB, respectively.
[CV-25] FAST GDRNPP: Improving the Speed of State-of-the-Art 6D Object Pose Estimation
链接: https://arxiv.org/abs/2409.12720
作者: Thomas Pöllabauer,Ashwin Pramod,Volker Knauthe,Michael Wahl
关键词-EN: chosen coordinate system, estimation involves determining, pose estimation involves, object pose estimation, coordinate system
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:6D object pose estimation involves determining the three-dimensional translation and rotation of an object within a scene and relative to a chosen coordinate system. This problem is of particular interest for many practical applications in industrial tasks such as quality control, bin picking, and robotic manipulation, where both speed and accuracy are critical for real-world deployment. Current models, both classical and deep-learning-based, often struggle with the trade-off between accuracy and latency. Our research focuses on enhancing the speed of a prominent state-of-the-art deep learning model, GDRNPP, while keeping its high accuracy. We employ several techniques to reduce the model size and improve inference time. These techniques include using smaller and quicker backbones, pruning unnecessary parameters, and distillation to transfer knowledge from a large, high-performing model to a smaller, more efficient student model. Our findings demonstrate that the proposed configuration maintains accuracy comparable to the state-of-the-art while significantly improving inference time. This advancement could lead to more efficient and practical applications in various industrial scenarios, thereby enhancing the overall applicability of 6D Object Pose Estimation models in real-world settings.
[CV-26] Optical Flow Matters: an Empirical Comparative Study on Fusing Monocular Extracted Modalities for Better Steering
链接: https://arxiv.org/abs/2409.12716
作者: Fouad Makiyeh,Mark Bastourous,Anass Bairouk,Wei Xiao,Mirjana Maras,Tsun-Hsuan Wangb,Marc Blanchon,Ramin Hasani,Patrick Chareyre,Daniela Rus
关键词-EN: accurate decision-making processes, key challenge, Neutral Circuit Policy, Variational Auto Encoder, Circuit Policy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Autonomous vehicle navigation is a key challenge in artificial intelligence, requiring robust and accurate decision-making processes. This research introduces a new end-to-end method that exploits multimodal information from a single monocular camera to improve the steering predictions for self-driving cars. Unlike conventional models that require several sensors which can be costly and complex or rely exclusively on RGB images that may not be robust enough under different conditions, our model significantly improves vehicle steering prediction performance from a single visual sensor. By focusing on the fusion of RGB imagery with depth completion information or optical flow data, we propose a comprehensive framework that integrates these modalities through both early and hybrid fusion techniques. We use three distinct neural network models to implement our approach: Convolution Neural Network - Neutral Circuit Policy (CNN-NCP) , Variational Auto Encoder - Long Short-Term Memory (VAE-LSTM) , and Neural Circuit Policy architecture VAE-NCP. By incorporating optical flow into the decision-making process, our method significantly advances autonomous navigation. Empirical results from our comparative study using Boston driving data show that our model, which integrates image and motion information, is robust and reliable. It outperforms state-of-the-art approaches that do not use optical flow, reducing the steering estimation error by 31%. This demonstrates the potential of optical flow data, combined with advanced neural network architectures (a CNN-based structure for fusing data and a Recurrence-based network for inferring a command from latent space), to enhance the performance of autonomous vehicles steering estimation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.12716 [cs.CV] (or arXiv:2409.12716v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.12716 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-27] Generation and Editing of Mandrill Faces: Application to Sex Editing and Assessment
链接: https://arxiv.org/abs/2409.12705
作者: Nicolas M. Dibot,Julien P. Renoult,William Puech
关键词-EN: recent years, enhancing the realism, major developments, developments in recent, realism of synthetic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generative AI has seen major developments in recent years, enhancing the realism of synthetic images, also known as computer-generated images. In addition, generative AI has also made it possible to modify specific image characteristics through image editing. Previous work has developed methods based on generative adversarial networks (GAN) for generating realistic images, in particular faces, but also to modify specific features. However, this work has never been applied to specific animal species. Moreover, the assessment of the results has been generally done subjectively, rather than quantitatively. In this paper, we propose an approach based on methods for generating images of faces of male or female mandrills, a non-human primate. The main novelty of proposed method is the ability to edit their sex by identifying a sex axis in the latent space of a specific GAN. In addition, we have developed an assessment of the sex levels based on statistical features extracted from real image distributions. The experimental results we obtained from a specific database are not only realistic, but also accurate, meeting a need for future work in behavioral experiments with wild mandrills.
[CV-28] A dynamic vision sensor object recognition model based on trainable event-driven convolution and spiking attention mechanism
链接: https://arxiv.org/abs/2409.12691
作者: Peng Zheng,Qian Zhou
关键词-EN: Dynamic Visual Sensors, Spiking Neural Networks, Neural Networks, Visual Sensors, Dynamic Visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 2 figures
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) are well-suited for processing event streams from Dynamic Visual Sensors (DVSs) due to their use of sparse spike-based coding and asynchronous event-driven computation. To extract features from DVS objects, SNNs commonly use event-driven convolution with fixed kernel parameters. These filters respond strongly to features in specific orientations while disregarding others, leading to incomplete feature extraction. To improve the current event-driven convolution feature extraction capability of SNNs, we propose a DVS object recognition model that utilizes a trainable event-driven convolution and a spiking attention mechanism. The trainable event-driven convolution is proposed in this paper to update its convolution kernel through gradient descent. This method can extract local features of the event stream more efficiently than traditional event-driven convolution. Furthermore, the spiking attention mechanism is used to extract global dependence features. The classification performances of our model are better than the baseline methods on two neuromorphic datasets including MNIST-DVS and the more complex CIFAR10-DVS. Moreover, our model showed good classification ability for short event streams. It was shown that our model can improve the performance of event-driven convolutional SNNs for DVS objects.
[CV-29] Semi-Supervised Semantic Segmentation with Professional and General Training
链接: https://arxiv.org/abs/2409.12680
作者: Yuting Hong,Hui Xiao,Huazheng Hao,Xiaojie Qiu,Baochen Yao,Chengbin Peng
关键词-EN: achieved remarkable progress, convolutional neural networks, remarkable progress, advancement of convolutional, convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 10 figures
点击查看摘要
Abstract:With the advancement of convolutional neural networks, semantic segmentation has achieved remarkable progress. The training of such networks heavily relies on image annotations, which are very expensive to obtain. Semi-supervised learning can utilize both labeled data and unlabeled data with the help of pseudo-labels. However, in many real-world scenarios where classes are imbalanced, majority classes often play a dominant role during training and the learning quality of minority classes can be undermined. To overcome this limitation, we propose a synergistic training framework, including a professional training module to enhance minority class learning and a general training module to learn more comprehensive semantic information. Based on a pixel selection strategy, they can iteratively learn from each other to reduce error accumulation and coupling. In addition, a dual contrastive learning with anchors is proposed to guarantee more distinct decision boundaries. In experiments, our framework demonstrates superior performance compared to state-of-the-art methods on benchmark datasets.
[CV-30] Enhancing Construction Site Safety: A Lightweight Convolutional Network for Effective Helmet Detection
链接: https://arxiv.org/abs/2409.12669
作者: Mujadded Al Rabbani Alif
关键词-EN: personal protective equipment, preventing workplace injuries, protective equipment, plays a critical, workplace injuries
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In the realm of construction safety, the detection of personal protective equipment, such as helmets, plays a critical role in preventing workplace injuries. This paper details the development and evaluation of convolutional neural networks (CNNs) designed for the accurate classification of helmet presence on construction sites. Initially, a simple CNN model comprising one convolutional block and one fully connected layer was developed, yielding modest results. To enhance its performance, the model was progressively refined, first by extending the architecture to include an additional convolutional block and a fully connected layer. Subsequently, batch normalization and dropout techniques were integrated, aiming to mitigate overfitting and improve the model’s generalization capabilities. The performance of these models is methodically analyzed, revealing a peak F1-score of 84%, precision of 82%, and recall of 86% with the most advanced configuration of the first study phase. Despite these improvements, the accuracy remained suboptimal, thus setting the stage for further architectural and operational enhancements. This work lays a foundational framework for ongoing adjustments and optimization in automated helmet detection technology, with future enhancements expected to address the limitations identified during these initial experiments.
[CV-31] METDrive: Multi-modal End-to-end Autonomous Driving with Temporal Guidance
链接: https://arxiv.org/abs/2409.12667
作者: Ziang Guo,Xinhao Lin,Zakhar Yagudin,Artem Lykov,Yong Wang,Yanqiang Li,Dzmitry Tsetserukou
关键词-EN: shown promising advancements, recent work, shown promising, promising advancements, advancements in recent
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Multi-modal end-to-end autonomous driving has shown promising advancements in recent work. By embedding more modalities into end-to-end networks, the system’s understanding of both static and dynamic aspects of the driving environment is enhanced, thereby improving the safety of autonomous driving. In this paper, we introduce METDrive, an end-to-end system that leverages temporal guidance from the embedded time series features of ego states, including rotation angles, steering, throttle signals, and waypoint vectors. The geometric features derived from perception sensor data and the time series features of ego state data jointly guide the waypoint prediction with the proposed temporal guidance loss function. We evaluated METDrive on the CARLA leaderboard’s Longest6 benchmark, achieving a driving score of 70%, a route completion score of 94%, and an infraction score of 0.78.
[CV-32] Manifold Sampling for Differentiable Uncertainty in Radiance Fields SIGGRAPH
链接: https://arxiv.org/abs/2409.12661
作者: Linjie Lyu,Ayush Tewari,Marc Habermann,Shunsuke Saito,Michael Zollhöfer,Thomas Leimkühler,Christian Theobalt
关键词-EN: popular models, complex scenes, models for representing, representing the appearance, appearance of complex
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Siggraph Asia 2024 conference
点击查看摘要
Abstract:Radiance fields are powerful and, hence, popular models for representing the appearance of complex scenes. Yet, constructing them based on image observations gives rise to ambiguities and uncertainties. We propose a versatile approach for learning Gaussian radiance fields with explicit and fine-grained uncertainty estimates that impose only little additional cost compared to uncertainty-agnostic training. Our key observation is that uncertainties can be modeled as a low-dimensional manifold in the space of radiance field parameters that is highly amenable to Monte Carlo sampling. Importantly, our uncertainties are differentiable and, thus, allow for gradient-based optimization of subsequent captures that optimally reduce ambiguities. We demonstrate state-of-the-art performance on next-best-view planning tasks, including high-dimensional illumination planning for optimal radiance field relighting quality.
[CV-33] PoTATO: A Dataset for Analyzing Polarimetric Traces of Afloat Trash Objects ECCV24
链接: https://arxiv.org/abs/2409.12659
作者: Luis Felipe Wolf Batista(UL),Salim Khazem,Mehran Adibi,Seth Hutchinson,Cedric Pradalier
关键词-EN: poses severe risks, aquatic environments poses, environments poses severe, poses severe, severe risks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV24 TRICKY workshop, Sep 2024, Milano (Italy), Italy
点击查看摘要
Abstract:Plastic waste in aquatic environments poses severe risks to marine life and human health. Autonomous robots can be utilized to collect floating waste, but they require accurate object identification capability. While deep learning has been widely used as a powerful tool for this task, its performance is significantly limited by outdoor light conditions and water surface reflection. Light polarization, abundant in such environments yet invisible to the human eye, can be captured by modern sensors to significantly improve litter detection accuracy on water surfaces. With this goal in mind, we introduce PoTATO, a dataset containing 12,380 labeled plastic bottles and rich polarimetric information. We demonstrate under which conditions polarization can enhance object detection and, by providing raw image data, we offer an opportunity for the research community to explore novel approaches and push the boundaries of state-of-the-art object detection algorithms even further. Code and data are publicly available at this https URL PoTATO/tree/eccv2024.
[CV-34] Image inpainting for corrupted images by using the semi-super resolution GAN
链接: https://arxiv.org/abs/2409.12636
作者: Mehrshad Momen-Tayefeh,Mehrdad Momen-Tayefeh,Amir Ali Ghafourian Ghahramani
关键词-EN: Generative Adversarial Network, valuable technique, technique for enhancing, Image inpainting, enhancing images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:Image inpainting is a valuable technique for enhancing images that have been corrupted. The primary challenge in this research revolves around the extent of corruption in the input image that the deep learning model must restore. To address this challenge, we introduce a Generative Adversarial Network (GAN) for learning and replicating the missing pixels. Additionally, we have developed a distinct variant of the Super-Resolution GAN (SRGAN), which we refer to as the Semi-SRGAN (SSRGAN). Furthermore, we leveraged three diverse datasets to assess the robustness and accuracy of our proposed model. Our training process involves varying levels of pixel corruption to attain optimal accuracy and generate high-quality images.
[CV-35] EFA-YOLO: An Efficient Feature Attention Model for Fire and Flame Detection
链接: https://arxiv.org/abs/2409.12635
作者: Weichao Pan,Xu Wang,Wenqing Huan
关键词-EN: Efficient Attention, Efficient Attention Convolution, great destructiveness, ecological environment, Efficient Attention Downsampling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:As a natural disaster with high suddenness and great destructiveness, fire has long posed a major threat to human society and ecological environment. In recent years, with the rapid development of smart city and Internet of Things (IoT) technologies, fire detection systems based on deep learning have gradually become a key means to cope with fire hazards. However, existing fire detection models still have many challenges in terms of detection accuracy and real-time performance in complex contexts. To address these issues, we propose two key modules: EAConv (Efficient Attention Convolution) and EADown (Efficient Attention Downsampling). The EAConv module significantly improves the feature extraction efficiency by combining an efficient attention mechanism with depth-separable convolution, while the EADown module enhances the accuracy and efficiency of feature downsampling by utilizing spatial and channel attention mechanisms in combination with pooling operations. Based on these two modules, we design an efficient and lightweight flame detection model, EFA-YOLO (Efficient Feature Attention YOLO). Experimental results show that EFA-YOLO has a model parameter quantity of only 1.4M, GFLOPs of 4.6, and the inference time per image on the CPU is only 22.19 ms. Compared with existing mainstream models (e.g., YOLOv5, YOLOv8, YOLOv9, and YOLOv10), EFA-YOLO exhibits a significant enhancement in detection accuracy (mAP) and inference speed, with model parameter amount is reduced by 94.6 and the inference speed is improved by 88 times.
[CV-36] Accurate Automatic 3D Annotation of Traffic Lights and Signs for Autonomous Driving ECCV2024
链接: https://arxiv.org/abs/2409.12620
作者: Sándor Kunsági-Máté,Levente Pethő,Lehel Seres,Tamás Matuszka
关键词-EN: vehicles encounter numerous, encounter numerous intersections, traffic management objects, navigation where vehicles, traffic lights
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the 2nd Workshop on Vision-Centric Autonomous Driving (VCAD) as part of ECCV 2024
点击查看摘要
Abstract:3D detection of traffic management objects, such as traffic lights and road signs, is vital for self-driving cars, particularly for address-to-address navigation where vehicles encounter numerous intersections with these static objects. This paper introduces a novel method for automatically generating accurate and temporally consistent 3D bounding box annotations for traffic lights and signs, effective up to a range of 200 meters. These annotations are suitable for training real-time models used in self-driving cars, which need a large amount of training data. The proposed method relies only on RGB images with 2D bounding boxes of traffic management objects, which can be automatically obtained using an off-the-shelf image-space detector neural network, along with GNSS/INS data, eliminating the need for LiDAR point cloud data.
[CV-37] Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning
链接: https://arxiv.org/abs/2409.12612
作者: Cong Yang,Zuchao Li,Hongzan Jiao,Zhi Gao,Lefei Zhang
关键词-EN: remote sensing image, sensing image change, image change captioning, Key Change Features, making models susceptible
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Recently, while significant progress has been made in remote sensing image change captioning, existing methods fail to filter out areas unrelated to actual changes, making models susceptible to irrelevant features. In this article, we propose a novel multimodal framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI). This framework aims to fully leverage the intrinsic knowledge of large language models through visual instructions and enhance the effectiveness and accuracy of change features using pixel-level change detection tasks. Specifically, KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, a pixel-level change detection decoder to constrain key change features, and an instruction-tuned decoder based on a large language model. Moreover, to ensure that change description and change detection tasks are jointly optimized, we employ a dynamic weight-averaging strategy to balance the losses between the two tasks. We also explore various feature combinations for visual fine-tuning instructions and demonstrate that using only key change features to guide the large language model is the optimal choice. To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset, achieving the best performance. Our code will be available at this https URL.
[CV-38] CF-GO-Net: A Universal Distribution Learner via Characteristic Function Networks with Graph Optimizers
链接: https://arxiv.org/abs/2409.12610
作者: Zeyang Yu,Shengxi Li,Danilo Mandic
关键词-EN: resemble real data, statistically resemble real, real data, aim to learn, generate samples
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Generative models aim to learn the distribution of datasets, such as images, so as to be able to generate samples that statistically resemble real data. However, learning the underlying probability distribution can be very challenging and intractable. To this end, we introduce an approach which employs the characteristic function (CF), a probabilistic descriptor that directly corresponds to the distribution. However, unlike the probability density function (pdf), the characteristic function not only always exists, but also provides an additional degree of freedom, hence enhances flexibility in learning distributions. This removes the critical dependence on pdf-based assumptions, which limit the applicability of traditional methods. While several works have attempted to use CF in generative modeling, they often impose strong constraints on the training process. In contrast, our approach calculates the distance between query points in the CF domain, which is an unconstrained and well defined problem. Next, to deal with the sampling strategy, which is crucial to model performance, we propose a graph neural network (GNN)-based optimizer for the sampling process, which identifies regions where the difference between CFs is most significant. In addition, our method allows the use of a pre-trained model, such as a well-trained autoencoder, and is capable of learning directly in its feature space, without modifying its parameters. This offers a flexible and robust approach to generative modeling, not only provides broader applicability and improved performance, but also equips any latent space world with the ability to become a generative model.
[CV-39] LARE: Latent Augmentation using Regional Embedding with Vision-Language Model
链接: https://arxiv.org/abs/2409.12597
作者: Kosuke Sakurai,Tatsuya Ishii,Ryotaro Shimizu,Linxin Song,Masayuki Goto
关键词-EN: answering visual questions, diverse downstream tasks, Contrastive Language-Image Pre-training, image-related chat, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures
点击查看摘要
Abstract:In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as “image-related chat,” “image recognition by instruction,” and “answering visual questions.” Vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), are also high-performance image classifiers that are being developed into domain adaptation methods that can utilize language information to extend into unseen domains. However, because these VLMs embed images as a single point in a unified embedding space, there is room for improvement in the classification accuracy. Therefore, in this study, we proposed the Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM. By sampling the augmented image embeddings from within this latent region, LARE enables data augmentation to various unseen domains, not just to specific unseen domains. LARE achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs. We demonstrate that LARE outperforms previous fine-tuning models in terms of image classification accuracy on three benchmarks. We also demonstrate that LARE is a more robust and general model that is valid under multiple conditions, such as unseen domains, small amounts of data, and imbalanced data.
[CV-40] LLMs Can Check Their Own Results to Mitigate Hallucinations in Traffic Understanding Tasks
链接: https://arxiv.org/abs/2409.12580
作者: Malsha Ashani Mahawatta Dona,Beatriz Cabrero-Daniel,Yinan Yu,Christian Berger
关键词-EN: Today Large Language, Large Language Models, Today Large, Large Language, simple text generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICTSS 2024, 36th International Conference on Testing Software and Systems
点击查看摘要
Abstract:Today’s Large Language Models (LLMs) have showcased exemplary capabilities, ranging from simple text generation to advanced image processing. Such models are currently being explored for in-vehicle services such as supporting perception tasks in Advanced Driver Assistance Systems (ADAS) or Autonomous Driving (AD) systems, given the LLMs’ capabilities to process multi-modal data. However, LLMs often generate nonsensical or unfaithful information, known as ``hallucinations’': a notable issue that needs to be mitigated. In this paper, we systematically explore the adoption of SelfCheckGPT to spot hallucinations by three state-of-the-art LLMs (GPT-4o, LLaVA, and Llama3) when analysing visual automotive data from two sources: Waymo Open Dataset, from the US, and PREPER CITY dataset, from Sweden. Our results show that GPT-4o is better at generating faithful image captions than LLaVA, whereas the former demonstrated leniency in mislabeling non-hallucinated content as hallucinations compared to the latter. Furthermore, the analysis of the performance metrics revealed that the dataset type (Waymo or PREPER CITY) did not significantly affect the quality of the captions or the effectiveness of hallucination detection. However, the models showed better performance rates over images captured during daytime, compared to during dawn, dusk or night. Overall, the results show that SelfCheckGPT and its adaptation can be used to filter hallucinations in generated traffic-related image captions for state-of-the-art LLMs.
[CV-41] StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation
链接: https://arxiv.org/abs/2409.12576
作者: Zhengguang Zhou,Jing Li,Huaxia Li,Nemo Chen,Xu Tang
关键词-EN: Tuning-free personalized image, achieved significant success, Tuning-free personalized, maintaining facial consistency, multiple characters
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures
点击查看摘要
Abstract:Tuning-free personalized image generation methods have achieved significant success in maintaining facial consistency, i.e., identities, even with multiple characters. However, the lack of holistic consistency in scenes with multiple characters hampers these methods’ ability to create a cohesive narrative. In this paper, we introduce StoryMaker, a personalization solution that preserves not only facial consistency but also clothing, hairstyles, and body consistency, thus facilitating the creation of a story through a series of images. StoryMaker incorporates conditions based on face identities and cropped character images, which include clothing, hairstyles, and bodies. Specifically, we integrate the facial identity information with the cropped character images using the Positional-aware Perceiver Resampler (PPR) to obtain distinct character features. To prevent intermingling of multiple characters and the background, we separately constrain the cross-attention impact regions of different characters and the background using MSE loss with segmentation masks. Additionally, we train the generation network conditioned on poses to promote decoupling from poses. A LoRA is also employed to enhance fidelity and quality. Experiments underscore the effectiveness of our approach. StoryMaker supports numerous applications and is compatible with other societal plug-ins. Our source codes and model weights are available at this https URL.
[CV-42] InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning
链接: https://arxiv.org/abs/2409.12568
作者: Xiaotian Han,Yiren Jian,Xuefeng Hu,Haogeng Liu,Yiqi Wang,Qihang Fan,Yuang Ai,Huaibo Huang,Ran He,Zhenheng Yang,Quanzeng You
关键词-EN: Large Language Models, Large Language, capabilities of Large, Language Models, crucial for enhancing
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:
点击查看摘要
Abstract:Pre-training on large-scale, high-quality datasets is crucial for enhancing the reasoning capabilities of Large Language Models (LLMs), especially in specialized domains such as mathematics. Despite the recognized importance, the Multimodal LLMs (MLLMs) field currently lacks a comprehensive open-source pre-training dataset specifically designed for mathematical reasoning. To address this gap, we introduce InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. We provide a detailed overview of our data collection and processing pipeline. To demonstrate the robustness of InfiMM-WebMath-40B, we conducted evaluations in both text-only and multimodal settings. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model, delivering results comparable to DeepSeekMath-1.3B, which uses 120 billion tokens for the same model size. Nevertheless, with the introduction of our multi-modal math pre-training dataset, our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math. We release our data at this https URL.
[CV-43] Improving Cone-Beam CT Image Quality with Knowledge Distillation-Enhanced Diffusion Model in Imbalanced Data Settings MICCAI2024
链接: https://arxiv.org/abs/2409.12539
作者: Joonil Hwang,Sangjoon Park,NaHyeon Park,Seungryong Cho,Jin Sung Kim
关键词-EN: necessitating adaptive planning, pre-treatment computed tomography, encounter challenges due, images encounter challenges, computed tomography
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024
点击查看摘要
Abstract:In radiation therapy (RT), the reliance on pre-treatment computed tomography (CT) images encounter challenges due to anatomical changes, necessitating adaptive planning. Daily cone-beam CT (CBCT) imaging, pivotal for therapy adjustment, falls short in tissue density accuracy. To address this, our innovative approach integrates diffusion models for CT image generation, offering precise control over data synthesis. Leveraging a self-training method with knowledge distillation, we maximize CBCT data during therapy, complemented by sparse paired fan-beam CTs. This strategy, incorporated into state-of-the-art diffusion-based models, surpasses conventional methods like Pix2pix and CycleGAN. A meticulously curated dataset of 2800 paired CBCT and CT scans, supplemented by 4200 CBCT scans, undergoes preprocessing and teacher model training, including the Brownian Bridge Diffusion Model (BBDM). Pseudo-label CT images are generated, resulting in a dataset combining 5600 CT images with corresponding CBCT images. Thorough evaluation using MSE, SSIM, PSNR and LPIPS demonstrates superior performance against Pix2pix and CycleGAN. Our approach shows promise in generating high-quality CT images from CBCT scans in RT.
[CV-44] Deep Probability Segmentation: Are segmentation models probability estimators?
链接: https://arxiv.org/abs/2409.12535
作者: Simone Fassio,Simone Monaco,Daniele Apiletti
关键词-EN: Deep learning, enabling highly accurate, learning has revolutionized, revolutionized various fields, fields by enabling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Deep learning has revolutionized various fields by enabling highly accurate predictions and estimates. One important application is probabilistic prediction, where models estimate the probability of events rather than deterministic outcomes. This approach is particularly relevant and, therefore, still unexplored for segmentation tasks where each pixel in an image needs to be classified. Conventional models often overlook the probabilistic nature of labels, but accurate uncertainty estimation is crucial for improving the reliability and applicability of models. In this study, we applied Calibrated Probability Estimation (CaPE) to segmentation tasks to evaluate its impact on model calibration. Our results indicate that while CaPE improves calibration, its effect is less pronounced compared to classification tasks, suggesting that segmentation models can inherently provide better probability estimates. We also investigated the influence of dataset size and bin optimization on the effectiveness of calibration. Our results emphasize the expressive power of segmentation models as probability estimators and incorporate probabilistic reasoning, which is crucial for applications requiring precise uncertainty quantification. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.12535 [cs.CV] (or arXiv:2409.12535v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.12535 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: IEEE AICT2024 Conference Proceedings
[CV-45] Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation
链接: https://arxiv.org/abs/2409.12532
作者: Chenyu Wang,Shuo Yan,Yixuan Chen,Yujiang Wang,Mingzhi Dong,Xiaochen Yang,Dongsheng Li,Robert P. Dick,Qin Lv,Fan Yang,Tun Lu,Ning Gu,Li Shang
关键词-EN: iterative diffusion process, computational costs due, costs due, Video generation, Video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Video generation using diffusion-based models is constrained by high computational costs due to the frame-wise iterative diffusion process. This work presents a Diffusion Reuse MOtion (Dr. Mo) network to accelerate latent video generation. Our key discovery is that coarse-grained noises in earlier denoising steps have demonstrated high motion consistency across consecutive video frames. Following this observation, Dr. Mo propagates those coarse-grained noises onto the next frame by incorporating carefully designed, lightweight inter-frame motions, eliminating massive computational redundancy in frame-wise diffusion models. The more sensitive and fine-grained noises are still acquired via later denoising steps, which can be essential to retain visual qualities. As such, deciding which intermediate steps should switch from motion-based propagations to denoising can be a crucial problem and a key tradeoff between efficiency and quality. Dr. Mo employs a meta-network named Denoising Step Selector (DSS) to dynamically determine desirable intermediate steps across video frames. Extensive evaluations on video generation and editing tasks have shown that Dr. Mo can substantially accelerate diffusion models in video tasks with improved visual qualities.
[CV-46] Prompting Segment Anything Model with Domain-Adaptive Prototype for Generalizable Medical Image Segmentation MICCAI2024
链接: https://arxiv.org/abs/2409.12522
作者: Zhikai Wei,Wenhui Dong,Peilin Zhou,Yuliang Gu,Zhou Zhao,Yongchao Xu
关键词-EN: Deep learning based, learning based methods, Deep learning, performance degradation caused, learning based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024)
点击查看摘要
Abstract:Deep learning based methods often suffer from performance degradation caused by domain shift. In recent years, many sophisticated network structures have been designed to tackle this problem. However, the advent of large model trained on massive data, with its exceptional segmentation capability, introduces a new perspective for solving medical segmentation problems. In this paper, we propose a novel Domain-Adaptive Prompt framework for fine-tuning the Segment Anything Model (termed as DAPSAM) to address single-source domain generalization (SDG) in segmenting medical images. DAPSAM not only utilizes a more generalization-friendly adapter to fine-tune the large model, but also introduces a self-learning prototype-based prompt generator to enhance model’s generalization ability. Specifically, we first merge the important low-level features into intermediate features before feeding to each adapter, followed by an attention filter to remove redundant information. This yields more robust image embeddings. Then, we propose using a learnable memory bank to construct domain-adaptive prototypes for prompt generation, helping to achieve generalizable medical image segmentation. Extensive experimental results demonstrate that our DAPSAM achieves state-of-the-art performance on two SDG medical image segmentation tasks with different modalities. The code is available at this https URL.
[CV-47] nyVLA: Towards Fast Data-Efficient Vision-Language-Action Models for Robotic Manipulation
链接: https://arxiv.org/abs/2409.12514
作者: Junjie Wen,Yichen Zhu,Jinming Li,Minjie Zhu,Kun Wu,Zhiyuan Xu,Ran Cheng,Chaomin Shen,Yaxin Peng,Feifei Feng,Jian Tang
关键词-EN: shown remarkable potential, shown remarkable, remarkable potential, potential in visuomotor, visuomotor control
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this paper, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive evaluations of TinyVLA in both simulation and on real robots, demonstrating that our approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. Additionally, TinyVLA exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts, often matching or exceeding the performance of OpenVLA. We believe that \methodname offers an interesting perspective on utilizing pre-trained multimodal models for policy learning. Our project is at this https URL.
[CV-48] owards Low-latency Event-based Visual Recognition with Hybrid Step-wise Distillation Spiking Neural Networks
链接: https://arxiv.org/abs/2409.12507
作者: Xian Zhong,Shengwang Hu,Wenxuan Liu,Wenxin Huang,Jianhao Ding,Zhaofei Yu,Tiejun Huang
关键词-EN: Spiking neural networks, high biological interpretability, garnered significant attention, Spiking neural, low power consumption
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Spiking neural networks (SNNs) have garnered significant attention for their low power consumption and high biological interpretability. Their rich spatio-temporal information processing capability and event-driven nature make them ideally well-suited for neuromorphic datasets. However, current SNNs struggle to balance accuracy and latency in classifying these datasets. In this paper, we propose Hybrid Step-wise Distillation (HSD) method, tailored for neuromorphic datasets, to mitigate the notable decline in performance at lower time steps. Our work disentangles the dependency between the number of event frames and the time steps of SNNs, utilizing more event frames during the training stage to improve performance, while using fewer event frames during the inference stage to reduce latency. Nevertheless, the average output of SNNs across all time steps is susceptible to individual time step with abnormal outputs, particularly at extremely low time steps. To tackle this issue, we implement Step-wise Knowledge Distillation (SKD) module that considers variations in the output distribution of SNNs at each time step. Empirical evidence demonstrates that our method yields competitive performance in classification tasks on neuromorphic datasets, especially at lower time steps. Our code will be available at: this https URL.
[CV-49] End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
链接: https://arxiv.org/abs/2409.12499
作者: Yongqi Wang,Shuo Yang,Xinxiao Wu,Jiebo Luo
关键词-EN: expand video visual, video visual relationship, detecting unseen relationships, expand video, Open-vocabulary video visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.
[CV-50] Learning Multi-Manifold Embedding for Out-Of-Distribution Detection ECCV2024
链接: https://arxiv.org/abs/2409.12479
作者: Jeng-Lin Li,Ming-Ching Chang,Wei-Chao Chen
关键词-EN: OOD, real-world applications, crucial for trustworthy, Detecting, OOD samples
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: European Conference on Computer Vision ECCV 2024 BEW Workshop Best Paper
点击查看摘要
Abstract:Detecting out-of-distribution (OOD) samples is crucial for trustworthy AI in real-world applications. Leveraging recent advances in representation learning and latent embeddings, Various scoring algorithms estimate distributions beyond the training data. However, a single embedding space falls short in characterizing in-distribution data and defending against diverse OOD conditions. This paper introduces a novel Multi-Manifold Embedding Learning (MMEL) framework, optimizing hypersphere and hyperbolic spaces jointly for enhanced OOD detection. MMEL generates representative embeddings and employs a prototype-aware scoring function to differentiate OOD samples. It operates with very few OOD samples and requires no model retraining. Experiments on six open datasets demonstrate MMEL’s significant reduction in FPR while maintaining a high AUC compared to state-of-the-art distance-based OOD detection methods. We analyze the effects of learning multiple manifolds and visualize OOD score distributions across datasets. Notably, enrolling ten OOD samples without retraining achieves comparable FPR and AUC to modern outlier exposure methods using 80 million outlier samples for model training.
[CV-51] Reference Dataset and Benchmark for Reconstructing Laser Parameters from On-axis Video in Powder Bed Fusion of Bulk Stainless Steel WWW
链接: https://arxiv.org/abs/2409.12475
作者: Cyril Blanc,Ayyoub Ahar,Kurt De Grave
关键词-EN: powder bed fusion, stainless steel bulk, steel bulk material, laser dot speed, FPS video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Dataset download: this https URL
点击查看摘要
Abstract:We present RAISE-LPBF, a large dataset on the effect of laser power and laser dot speed in powder bed fusion (LPBF) of 316L stainless steel bulk material, monitored by on-axis 20k FPS video. Both process parameters are independently sampled for each scan line from a continuous distribution, so interactions of different parameter choices can be investigated. The data can be used to derive statistical properties of LPBF, as well as to build anomaly detectors. We provide example source code for loading the data, baseline machine learning models and results, and a public benchmark to evaluate predictive models.
[CV-52] HSIGene: A Foundation Model For Hyperspectral Image Generation
链接: https://arxiv.org/abs/2409.12470
作者: Li Pang,Datao Tang,Shuang Xu,Deyu Meng,Xiangyong Cao
关键词-EN: plays a vital, environmental monitoring, vital role, agriculture and environmental, HSI
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:Hyperspectral image (HSI) plays a vital role in various fields such as agriculture and environmental monitoring. However, due to the expensive acquisition cost, the number of hyperspectral images is limited, degenerating the performance of downstream tasks. Although some recent studies have attempted to employ diffusion models to synthesize HSIs, they still struggle with the scarcity of HSIs, affecting the reliability and diversity of the generated images. Some studies propose to incorporate multi-modal data to enhance spatial diversity, but the spectral fidelity cannot be ensured. In addition, existing HSI synthesis models are typically uncontrollable or only support single-condition control, limiting their ability to generate accurate and reliable HSIs. To alleviate these issues, we propose HSIGene, a novel HSI generation foundation model which is based on latent diffusion and supports multi-condition control, allowing for more precise and reliable HSI generation. To enhance the spatial diversity of the training data while preserving spectral fidelity, we propose a new data augmentation method based on spatial super-resolution, in which HSIs are upscaled first, and thus abundant training patches could be obtained by cropping the high-resolution HSIs. In addition, to improve the perceptual quality of the augmented data, we introduce a novel two-stage HSI super-resolution framework, which first applies RGB bands super-resolution and then utilizes our proposed Rectangular Guided Attention Network (RGAN) for guided HSI super-resolution. Experiments demonstrate that the proposed model is capable of generating a vast quantity of realistic HSIs for downstream tasks such as denoising and super-resolution. The code and models are available at this https URL.
[CV-53] SurgPLAN: Universal Surgical Phase Localization Network for Online and Offline Inference
链接: https://arxiv.org/abs/2409.12467
作者: Zhen Chen,Xingjian Luo,Jinlin Wu,Long Bai,Zhen Lei,Hongliang Ren,Sebastien Ourselin,Hongbin Liu
关键词-EN: Surgical phase recognition, phase recognition, Surgical phase, phase, Surgical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Surgical phase recognition is critical for assisting surgeons in understanding surgical videos. Existing studies focused more on online surgical phase recognition, by leveraging preceding frames to predict the current frame. Despite great progress, they formulated the task as a series of frame-wise classification, which resulted in a lack of global context of the entire procedure and incoherent predictions. Moreover, besides online analysis, accurate offline surgical phase recognition is also in significant clinical need for retrospective analysis, and existing online algorithms do not fully analyze the entire video, thereby limiting accuracy in offline analysis. To overcome these challenges and enhance both online and offline inference capabilities, we propose a universal Surgical Phase Localization Network, named SurgPLAN++, with the principle of temporal detection. To ensure a global understanding of the surgical procedure, we devise a phase localization strategy for SurgPLAN++ to predict phase segments across the entire video through phase proposals. For online analysis, to generate high-quality phase proposals, SurgPLAN++ incorporates a data augmentation strategy to extend the streaming video into a pseudo-complete video through mirroring, center-duplication, and down-sampling. For offline analysis, SurgPLAN++ capitalizes on its global phase prediction framework to continuously refine preceding predictions during each online inference step, thereby significantly improving the accuracy of phase recognition. We perform extensive experiments to validate the effectiveness, and our SurgPLAN++ achieves remarkable performance in both online and offline modes, which outperforms state-of-the-art methods. The source code is available at this https URL.
[CV-54] Bayesian-Optimized One-Step Diffusion Model with Knowledge Distillation for Real-Time 3D Human Motion Prediction
链接: https://arxiv.org/abs/2409.12456
作者: Sibo Tian,Minghui Zheng,Xiao Liang
关键词-EN: close collaboration scenarios, human workers based, Human motion prediction, past motion cues, human workers
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Human motion prediction is a cornerstone of human-robot collaboration (HRC), as robots need to infer the future movements of human workers based on past motion cues to proactively plan their motion, ensuring safety in close collaboration scenarios. The diffusion model has demonstrated remarkable performance in predicting high-quality motion samples with reasonable diversity, but suffers from a slow generative process which necessitates multiple model evaluations, hindering real-world applications. To enable real-time prediction, in this work, we propose training a one-step multi-layer perceptron-based (MLP-based) diffusion model for motion prediction using knowledge distillation and Bayesian optimization. Our method contains two steps. First, we distill a pretrained diffusion-based motion predictor, TransFusion, directly into a one-step diffusion model with the same denoiser architecture. Then, to further reduce the inference time, we remove the computationally expensive components from the original denoiser and use knowledge distillation once again to distill the obtained one-step diffusion model into an even smaller model based solely on MLPs. Bayesian optimization is used to tune the hyperparameters for training the smaller diffusion model. Extensive experimental studies are conducted on benchmark datasets, and our model can significantly improve the inference speed, achieving real-time prediction without noticeable degradation in performance.
[CV-55] Domain Generalization for Endoscopic Image Segmentation by Disentangling Style-Content Information and SuperPixel Consistency
链接: https://arxiv.org/abs/2409.12450
作者: Mansoor Ali Teevno,Rafael Martinez-Garcia-Pena,Gilberto Ochoa-Ruiz,Sharib Ali
关键词-EN: stratify individuals based, Frequent monitoring, cancer precursors, developing gastrointestinal, stratify individuals
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Frequent monitoring is necessary to stratify individuals based on their likelihood of developing gastrointestinal (GI) cancer precursors. In clinical practice, white-light imaging (WLI) and complementary modalities such as narrow-band imaging (NBI) and fluorescence imaging are used to assess risk areas. However, conventional deep learning (DL) models show degraded performance due to the domain gap when a model is trained on one modality and tested on a different one. In our earlier approach, we used a superpixel-based method referred to as “SUPRA” to effectively learn domain-invariant information using color and space distances to generate groups of pixels. One of the main limitations of this earlier work is that the aggregation does not exploit structural information, making it suboptimal for segmentation tasks, especially for polyps and heterogeneous color distributions. Therefore, in this work, we propose an approach for style-content disentanglement using instance normalization and instance selective whitening (ISW) for improved domain generalization when combined with SUPRA. We evaluate our approach on two datasets: EndoUDA Barrett’s Esophagus and EndoUDA polyps, and compare its performance with three state-of-the-art (SOTA) methods. Our findings demonstrate a notable enhancement in performance compared to both baseline and SOTA methods across the target domain data. Specifically, our approach exhibited improvements of 14%, 10%, 8%, and 18% over the baseline and three SOTA methods on the polyp dataset. Additionally, it surpassed the second-best method (EndoUDA) on the Barrett’s Esophagus dataset by nearly 2%.
[CV-56] Infrared Small Target Detection in Satellite Videos: A New Dataset and A Novel Recurrent Feature Refinement Framework
链接: https://arxiv.org/abs/2409.12448
作者: Xinyi Ying,Li Liu,Zaipin Lin,Yangsi Shi,Yingqian Wang,Ruojing Li,Xu Cao,Boyang Li,Shilin Zhou
关键词-EN: Multi-frame infrared small, high false alarms, highly complex clutters, complex clutters noises, extremely small target
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Multi-frame infrared small target (MIRST) detection in satellite videos is a long-standing, fundamental yet challenging task for decades, and the challenges can be summarized as: First, extremely small target size, highly complex clutters noises, various satellite motions result in limited feature representation, high false alarms, and difficult motion analyses. Second, the lack of large-scale public available MIRST dataset in satellite videos greatly hinders the algorithm development. To address the aforementioned challenges, in this paper, we first build a large-scale dataset for MIRST detection in satellite videos (namely IRSatVideo-LEO), and then develop a recurrent feature refinement (RFR) framework as the baseline method. Specifically, IRSatVideo-LEO is a semi-simulated dataset with synthesized satellite motion, target appearance, trajectory and intensity, which can provide a standard toolbox for satellite video generation and a reliable evaluation platform to facilitate the algorithm development. For baseline method, RFR is proposed to be equipped with existing powerful CNN-based methods for long-term temporal dependency exploitation and integrated motion compensation MIRST detection. Specifically, a pyramid deformable alignment (PDA) module and a temporal-spatial-frequency modulation (TSFM) module are proposed to achieve effective and efficient feature alignment, propagation, aggregation and refinement. Extensive experiments have been conducted to demonstrate the effectiveness and superiority of our scheme. The comparative results show that ResUNet equipped with RFR outperforms the state-of-the-art MIRST detection methods. Dataset and code are released at this https URL.
[CV-57] FlexiTex: Enhancing Texture Generation with Visual Guidance
链接: https://arxiv.org/abs/2409.12431
作者: DaDong Jiang,Xianghui Yang,Zibo Zhao,Sheng Zhang,Jiaao Yu,Zeqiang Lai,Shaoxiong Yang,Chunchao Guo,Xiaobo Zhou,Zhihui Ke
关键词-EN: Recent texture generation, powerful generative prior, methods achieve impressive, texture generation methods, generation methods achieve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL
点击查看摘要
Abstract:Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from large-scale text-to-image diffusion models. However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns. To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture. The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details. To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency. Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications.
[CV-58] Frequency-Guided Spatial Adaptation for Camouflaged Object Detection
链接: https://arxiv.org/abs/2409.12421
作者: Shizhou Zhang,Dexuan Kong,Yinghui Xing,Yue Lu,Lingyan Ran,Guoqiang Liang,Hexu Wang,Yanning Zhang
关键词-EN: segment camouflaged objects, Camouflaged object detection, surrounding environment, Camouflaged object, exhibit very similar
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper has been accepted for publication as a regular paper in the IEEE Transactions on Multimedia
点击查看摘要
Abstract:Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background.With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.
[CV-59] Domain-stratified Training for Cross-organ and Cross-scanner Adenocarcinoma Segmentation in the COSAS 2024 Challenge
链接: https://arxiv.org/abs/2409.12418
作者: Huang Jiayan,Ji Zheng,Kuang Jinbo,Xu Shuoyu
关键词-EN: Cross-Scanner Adenocarcinoma Segmentation, image segmentation algorithm, segmentation algorithm developed, Cross-Scanner Adenocarcinoma, Adenocarcinoma Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:This manuscript presents an image segmentation algorithm developed for the Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation (COSAS 2024) challenge. We adopted an organ-stratified and scanner-stratified approach to train multiple Upernet-based segmentation models and subsequently ensembled the results. Despite the challenges posed by the varying tumor characteristics across different organs and the differing imaging conditions of various scanners, our method achieved a final test score of 0.7643 for Task 1 and 0.8354 for Task 2. These results demonstrate the adaptability and efficacy of our approach across diverse conditions. Our model’s ability to generalize across various datasets underscores its potential for real-world applications.
[CV-60] How to predict on-road air pollution based on street view images and machine learning: a quantitative analysis of the optimal strategy
链接: https://arxiv.org/abs/2409.12412
作者: Hui Zhong,Di Chen,Pengqin Wang,Wenrui Wang,Shaojie Shen,Yonghong Liu,Meixin Zhu
关键词-EN: On-road air pollution, exhibits substantial variability, pollution exhibits substantial, air pollution exhibits, short distances due
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:On-road air pollution exhibits substantial variability over short distances due to emission sources, dilution, and physicochemical processes. Integrating mobile monitoring data with street view images (SVIs) holds promise for predicting local air pollution. However, algorithms, sampling strategies, and image quality introduce extra errors due to a lack of reliable references that quantify their effects. To bridge this gap, we employed 314 taxis to monitor NO, NO2, PM2.5 and PM10 dynamically and sampled corresponding SVIs, aiming to develop a reliable strategy. We extracted SVI features from ~ 382,000 streetscape images, which were collected at various angles (0°, 90°, 180°, 270°) and ranges (buffers with radii of 100m, 200m, 300m, 400m, 500m). Also, three machine learning algorithms alongside the linear land-used regression (LUR) model were experimented with to explore the influences of different algorithms. Four typical image quality issues were identified and discussed. Generally, machine learning methods outperform linear LUR for estimating the four pollutants, with the ranking: random forest XGBoost neural network LUR. Compared to single-angle sampling, the averaging strategy is an effective method to avoid bias of insufficient feature capture. Therefore, the optimal sampling strategy is to obtain SVIs at a 100m radius buffer and extract features using the averaging strategy. This approach achieved estimation results for each aggregation location with absolute errors almost less than 2.5 \mug/m^2 or ppb. Overexposure, blur, and underexposure led to image misjudgments and incorrect identifications, causing an overestimation of road features and underestimation of human-activity features, contributing to inaccurate NO, NO2, PM2.5 and PM10 estimation.
[CV-61] LMT-Net: Lane Model Transformer Network for Automated HD Mapping from Sparse Vehicle Observations ITSC2024
链接: https://arxiv.org/abs/2409.12409
作者: Michael Mink,Thomas Monninger,Steffen Staab
关键词-EN: High Definition, complete lane model, autonomous driving, range and occlusions, lane model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted for 2024 IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)
点击查看摘要
Abstract:In autonomous driving, High Definition (HD) maps provide a complete lane model that is not limited by sensor range and occlusions. However, the generation and upkeep of HD maps involves periodic data collection and human annotations, limiting scalability. To address this, we investigate automating the lane model generation and the use of sparse vehicle observations instead of dense sensor measurements. For our approach, a pre-processing step generates polylines by aligning and aggregating observed lane boundaries. Aligned driven traces are used as starting points for predicting lane pairs defined by the left and right boundary points. We propose Lane Model Transformer Network (LMT-Net), an encoder-decoder neural network architecture that performs polyline encoding and predicts lane pairs and their connectivity. A lane graph is formed by using predicted lane pairs as nodes and predicted lane connectivity as edges. We evaluate the performance of LMT-Net on an internal dataset that consists of multiple vehicle observations as well as human annotations as Ground Truth (GT). The evaluation shows promising results and demonstrates superior performance compared to the implemented baseline on both highway and non-highway Operational Design Domain (ODD).
[CV-62] ITPatch: An Invisible and Triggered Physical Adversarial Patch against Traffic Sign Recognition
链接: https://arxiv.org/abs/2409.12394
作者: Shuai Yuan,Hongwei Li,Xingshuo Han,Guowen Xu,Wenbo Jiang,Tao Ni,Qingchuan Zhao,Yuguang Fang
关键词-EN: Physical adversarial patches, real world, key adversarial attack, existing adversarial patches, adversarial patches
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Physical adversarial patches have emerged as a key adversarial attack to cause misclassification of traffic sign recognition (TSR) systems in the real world. However, existing adversarial patches have poor stealthiness and attack all vehicles indiscriminately once deployed. In this paper, we introduce an invisible and triggered physical adversarial patch (ITPatch) with a novel attack vector, i.e., fluorescent ink, to advance the state-of-the-art. It applies carefully designed fluorescent perturbations to a target sign, an attacker can later trigger a fluorescent effect using invisible ultraviolet light, causing the TSR system to misclassify the sign and potentially resulting in traffic accidents. We conducted a comprehensive evaluation to investigate the effectiveness of ITPatch, which shows a success rate of 98.31% in low-light conditions. Furthermore, our attack successfully bypasses five popular defenses and achieves a success rate of 96.72%.
[CV-63] A Novel Perspective for Multi-modal Multi-label Skin Lesion Classification WACV2025
链接: https://arxiv.org/abs/2409.12390
作者: Yuan Zhang,Yutong Xie,Hu Wang,Jodie C Avery,M Louise Hull,Gustavo Carneiro
关键词-EN: learning-based Computer-Aided Diagnosis, deep learning-based Computer-Aided, skin diseases relies, Computer-Aided Diagnosis, analyzing multiple data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV2025
点击查看摘要
Abstract:The efficacy of deep learning-based Computer-Aided Diagnosis (CAD) methods for skin diseases relies on analyzing multiple data modalities (i.e., clinical+dermoscopic images, and patient metadata) and addressing the challenges of multi-label classification. Current approaches tend to rely on limited multi-modal techniques and treat the multi-label problem as a multiple multi-class problem, overlooking issues related to imbalanced learning and multi-label correlation. This paper introduces the innovative Skin Lesion Classifier, utilizing a Multi-modal Multi-label TransFormer-based model (SkinM2Former). For multi-modal analysis, we introduce the Tri-Modal Cross-attention Transformer (TMCT) that fuses the three image and metadata modalities at various feature levels of a transformer encoder. For multi-label classification, we introduce a multi-head attention (MHA) module to learn multi-label correlations, complemented by an optimisation that handles multi-label and imbalanced learning problems. SkinM2Former achieves a mean average accuracy of 77.27% and a mean diagnostic accuracy of 77.85% on the public Derm7pt dataset, outperforming state-of-the-art (SOTA) methods.
[CV-64] Look Through Masks: Towards Masked Face Recognition with De-Occlusion Distillation ACM-MM2020
链接: https://arxiv.org/abs/2409.12385
作者: Chenyu Li,Shiming Ge,Daichi Zhang,Jia Li
关键词-EN: real-world applications today, masked face recognition, ambiguous representation, drop in accuracy, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ACM MM 2020
点击查看摘要
Abstract:Many real-world applications today like video surveillance and urban governance need to address the recognition of masked faces, where content replacement by diverse masks often brings in incomplete appearance and ambiguous representation, leading to a sharp drop in accuracy. Inspired by recent progress on amodal perception, we propose to migrate the mechanism of amodal completion for the task of masked face recognition with an end-to-end de-occlusion distillation framework, which consists of two modules. The \textitde-occlusion module applies a generative adversarial network to perform face completion, which recovers the content under the mask and eliminates appearance ambiguity. The \textitdistillation module takes a pre-trained general face recognition model as the teacher and transfers its knowledge to train a student for completed faces using massive online synthesized face pairs. Especially, the teacher knowledge is represented with structural relations among instances in multiple orders, which serves as a posterior regularization to enable the adaptation. In this way, the knowledge can be fully distilled and transferred to identify masked faces. Experiments on synthetic and realistic datasets show the efficacy of the proposed approach.
[CV-65] Privacy-Preserving Student Learning with Differentially Private Data-Free Distillation
链接: https://arxiv.org/abs/2409.12384
作者: Bochao Liu,Jianghu Lu,Pengju Wang,Junjie Zhang,Dan Zeng,Zhenxing Qian,Shiming Ge
关键词-EN: Deep learning models, achieve high inference, high inference accuracy, extracting rich knowledge, Deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published by IEEE MMSP 2022
点击查看摘要
Abstract:Deep learning models can achieve high inference accuracy by extracting rich knowledge from massive well-annotated data, but may pose the risk of data privacy leakage in practical deployment. In this paper, we present an effective teacher-student learning approach to train privacy-preserving deep learning models via differentially private data-free distillation. The main idea is generating synthetic data to learn a student that can mimic the ability of a teacher well-trained on private data. In the approach, a generator is first pretrained in a data-free manner by incorporating the teacher as a fixed discriminator. With the generator, massive synthetic data can be generated for model training without exposing data privacy. Then, the synthetic data is fed into the teacher to generate private labels. Towards this end, we propose a label differential privacy algorithm termed selective randomized response to protect the label information. Finally, a student is trained on the synthetic data with the supervision of private labels. In this way, both data privacy and label privacy are well protected in a unified framework, leading to privacy-preserving models. Extensive experiments and analysis clearly demonstrate the effectiveness of our approach.
[CV-66] Enhancing 3D Robotic Vision Robustness by Minimizing Adversarial Mutual Information through a Curriculum Training Approach
链接: https://arxiv.org/abs/2409.12379
作者: Nastaran Darabi,Dinithi Jayasuriya,Devashri Naik,Theja Tulabandhula,Amit Ranjan Trivedi
关键词-EN: attacks exploit vulnerabilities, carefully crafted perturbations, model decision boundaries, Adversarial attacks exploit, boundaries through small
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Adversarial attacks exploit vulnerabilities in a model’s decision boundaries through small, carefully crafted perturbations that lead to significant mispredictions. In 3D vision, the high dimensionality and sparsity of data greatly expand the attack surface, making 3D vision particularly vulnerable for safety-critical robotics. To enhance 3D vision’s adversarial robustness, we propose a training objective that simultaneously minimizes prediction loss and mutual information (MI) under adversarial perturbations to contain the upper bound of misprediction errors. This approach simplifies handling adversarial examples compared to conventional methods, which require explicit searching and training on adversarial samples. However, minimizing prediction loss conflicts with minimizing MI, leading to reduced robustness and catastrophic forgetting. To address this, we integrate curriculum advisors in the training setup that gradually introduce adversarial objectives to balance training and prevent models from being overwhelmed by difficult cases early in the process. The advisors also enhance robustness by encouraging training on diverse MI examples through entropy regularizers. We evaluated our method on ModelNet40 and KITTI using PointNet, DGCNN, SECOND, and PointTransformers, achieving 2-5% accuracy gains on ModelNet40 and a 5-10% mAP improvement in object detection. Our code is publicly available at this https URL.
[CV-67] Advancing Cucumber Disease Detection in Agriculture through Machine Vision and Drone Technology
链接: https://arxiv.org/abs/2409.12350
作者: Syada Tasfia Rahman,Nishat Vasker,Amir Khabbab Ahammed,Mahamudul Hasan
关键词-EN: machine vision, technologies to propose, unique method, drone technologies, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 page and 6 figure
点击查看摘要
Abstract:This study uses machine vision and drone technologies to propose a unique method for the diagnosis of cucumber disease in agriculture. The backbone of this research is a painstakingly curated dataset of hyperspectral photographs acquired under genuine field conditions. Unlike earlier datasets, this study included a wide variety of illness types, allowing for precise early-stage detection. The model achieves an excellent 87.5% accuracy in distinguishing eight unique cucumber illnesses after considerable data augmentation. The incorporation of drone technology for high-resolution images improves disease evaluation. This development has enormous potential for improving crop management, lowering labor costs, and increasing agricultural productivity. This research, which automates disease detection, represents a significant step toward a more efficient and sustainable agricultural future.
[CV-68] ReFu: Recursive Fusion for Exemplar-Free 3D Class-Incremental Learning
链接: https://arxiv.org/abs/2409.12326
作者: Yi Yang,Lei Zhong,Huiping Zhuang
关键词-EN: Recursive Fusion model, classes while retaining, point clouds, clouds and meshes, integrate point clouds
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:We introduce a novel Recursive Fusion model, dubbed ReFu, designed to integrate point clouds and meshes for exemplar-free 3D Class-Incremental Learning, where the model learns new 3D classes while retaining knowledge of previously learned ones. Unlike existing methods that either rely on storing historical data to mitigate forgetting or focus on single data modalities, ReFu eliminates the need for exemplar storage while utilizing the complementary strengths of both point clouds and meshes. To achieve this, we introduce a recursive method which continuously accumulates knowledge by updating the regularized auto-correlation matrix. Furthermore, we propose a fusion module, featuring a Pointcloud-guided Mesh Attention Layer that learns correlations between the two modalities. This mechanism effectively integrates point cloud and mesh features, leading to more robust and stable continual learning. Experiments across various datasets demonstrate that our proposed framework outperforms existing methods in 3D class-incremental learning. Project Page: this https URL
[CV-69] Depth Estimation Based on 3D Gaussian Splatting Siamese Defocus
链接: https://arxiv.org/abs/2409.12323
作者: Jinchang Zhang,Ningning Xu,Hao Zhang,Guoyu Lu
关键词-EN: Depth estimation, fundamental task, Depth, stereo depth estimation, monocular depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Depth estimation is a fundamental task in 3D geometry. While stereo depth estimation can be achieved through triangulation methods, it is not as straightforward for monocular methods, which require the integration of global and local information. The Depth from Defocus (DFD) method utilizes camera lens models and parameters to recover depth information from blurred images and has been proven to perform well. However, these methods rely on All-In-Focus (AIF) images for depth estimation, which is nearly impossible to obtain in real-world applications. To address this issue, we propose a self-supervised framework based on 3D Gaussian splatting and Siamese networks. By learning the blur levels at different focal distances of the same scene in the focal stack, the framework predicts the defocus map and Circle of Confusion (CoC) from a single defocused image, using the defocus map as input to DepthNet for monocular depth estimation. The 3D Gaussian splatting model renders defocused images using the predicted CoC, and the differences between these and the real defocused images provide additional supervision signals for the Siamese Defocus self-supervised network. This framework has been validated on both artificially synthesized and real blurred datasets. Subsequent quantitative and visualization experiments demonstrate that our proposed framework is highly effective as a DFD method.
[CV-70] Large Language Models Are Strong Audio-Visual Speech Recognition Learners
链接: https://arxiv.org/abs/2409.12319
作者: Umberto Cappellazzo,Minsu Kim,Honglie Chen,Pingchuan Ma,Stavros Petridis,Daniele Falavigna,Alessio Brutti,Maja Pantic
关键词-EN: Multimodal large language, formidable multimodal understanding, multimodal understanding capabilities, large language models, Multimodal large
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: The code will be made available at this link: this https URL
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recognition (ASR) abilities by just concatenating the audio tokens, computed with an audio encoder, and the text tokens to achieve state-of-the-art results. On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. It leverages pre-trained audio and video encoders to produce modality-specific tokens which, together with the text tokens, are processed by a pre-trained LLM (e.g., Llama3.1-8B) to yield the resulting response in an auto-regressive fashion. Llama-AVSR requires a small number of trainable parameters as only modality-specific projectors and LoRA modules are trained whereas the multi-modal encoders and LLM are kept frozen. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively. To bolster our results, we investigate the key factors that underpin the effectiveness of Llama-AVSR: the choice of the pre-trained encoders and LLM, the efficient integration of LoRA modules, and the optimal performance-efficiency trade-off obtained via modality-aware compression rates.
[CV-71] A large-scale study of performance and equity of commercial remote identity verification technologies across demographics
链接: https://arxiv.org/abs/2409.12318
作者: Kaniz Fatima,Michael Schuckers,Gerardo Cruz-Ortiz,Daqing Hou,Sandip Purnapatra,Tiffany Andrews,Ambuj Neupane,Brandeis Marshall,Stephanie Schuckers
关键词-EN: transactions move online, move online, types of transactions, transactions move, RIdV solutions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:As more types of transactions move online, there is an increasing need to verify someone’s identity remotely. Remote identity verification (RIdV) technologies have emerged to fill this need. RIdV solutions typically use a smart device to validate an identity document like a driver’s license by comparing a face selfie to the face photo on the document. Recent research has been focused on ensuring that biometric systems work fairly across demographic groups. This study assesses five commercial RIdV solutions for equity across age, gender, race/ethnicity, and skin tone across 3,991 test subjects. This paper employs statistical methods to discern whether the RIdV result across demographic groups is statistically distinguishable. Two of the RIdV solutions were equitable across all demographics, while two RIdV solutions had at least one demographic that was inequitable. For example, the results for one technology had a false negative rate of 10.5% +/- 4.5% and its performance for each demographic category was within the error bounds, and, hence, were equitable. The other technologies saw either poor overall performance or inequitable performance. For one of these, participants of the race Black/African American (B/AA) as well as those with darker skin tones (Monk scale 7/8/9/10) experienced higher false rejections. Finally, one technology demonstrated more favorable but inequitable performance for the Asian American and Pacific Islander (AAPI) demographic. This study confirms that it is necessary to evaluate products across demographic groups to fully understand the performance of remote identity verification technologies.
[CV-72] Understanding Implosion in Text-to-Image Generative Models CCS2024
链接: https://arxiv.org/abs/2409.12314
作者: Wenxin Ding,Cathy Y. Li,Shawn Shan,Ben Y. Zhao,Haitao Zheng
关键词-EN: Recent works show, poisoning attacks, Recent works, surprisingly vulnerable, models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ACM CCS 2024
点击查看摘要
Abstract:Recent works show that text-to-image generative models are surprisingly vulnerable to a variety of poisoning attacks. Empirical results find that these models can be corrupted by altering associations between individual text prompts and associated visual features. Furthermore, a number of concurrent poisoning attacks can induce “model implosion,” where the model becomes unable to produce meaningful images for unpoisoned prompts. These intriguing findings highlight the absence of an intuitive framework to understand poisoning attacks on these models. In this work, we establish the first analytical framework on robustness of image generative models to poisoning attacks, by modeling and analyzing the behavior of the cross-attention mechanism in latent diffusion models. We model cross-attention training as an abstract problem of “supervised graph alignment” and formally quantify the impact of training data by the hardness of alignment, measured by an Alignment Difficulty (AD) metric. The higher the AD, the harder the alignment. We prove that AD increases with the number of individual prompts (or concepts) poisoned. As AD grows, the alignment task becomes increasingly difficult, yielding highly distorted outcomes that frequently map meaningful text prompts to undefined or meaningless visual representations. As a result, the generative model implodes and outputs random, incoherent images at large. We validate our analytical framework through extensive experiments, and we confirm and explain the unexpected (and unexplained) effect of model implosion while producing new, unforeseen insights. Our work provides a useful tool for studying poisoning attacks against diffusion models and their defenses.
[CV-73] Measuring Sound Symbolism in Audio-visual Models
链接: https://arxiv.org/abs/2409.12306
作者: Wei-Cheng Tseng,Yi-Jen Shih,David Harwath,Raymond Mooney
关键词-EN: gained substantial attention, substantial attention recently, demonstrated superior performance, gained substantial, substantial attention
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: SLT 2024
点击查看摘要
Abstract:Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations \unicodex2013 known as sound symbolism \unicodex2013 which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models’ outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.
[CV-74] Self-Supervised Pre-training Tasks for an fMRI Time-series Transformer in Autism Detection
链接: https://arxiv.org/abs/2409.12304
作者: Yinchi Zhou,Peiyu Duan,Yuexi Du,Nicha C. Dvornek
关键词-EN: Autism Spectrum Disorder, Autism Spectrum, Spectrum Disorder, degrees of impairment, treatment challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that encompasses a wide variety of symptoms and degrees of impairment, which makes the diagnosis and treatment challenging. Functional magnetic resonance imaging (fMRI) has been extensively used to study brain activity in ASD, and machine learning methods have been applied to analyze resting state fMRI (rs-fMRI) data. However, fewer studies have explored the recent transformer-based models on rs-fMRI data. Given the superiority of transformer models in capturing long-range dependencies in sequence data, we have developed a transformer-based self-supervised framework that directly analyzes time-series fMRI data without computing functional connectivity. To address over-fitting in small datasets and enhance the model performance, we propose self-supervised pre-training tasks to reconstruct the randomly masked fMRI time-series data, investigating the effects of various masking strategies. We then finetune the model for the ASD classification task and evaluate it using two public datasets and five-fold cross-validation with different amounts of training data. The experiments show that randomly masking entire ROIs gives better model performance than randomly masking time points in the pre-training step, resulting in an average improvement of 10.8% for AUC and 9.3% for subject accuracy compared with the transformer model trained from scratch across different levels of training data availability. Our code is available on GitHub.
[CV-75] WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild
链接: https://arxiv.org/abs/2409.12259
作者: Rolandos Alexandros Potamias,Jinglei Zhang,Jiankang Deng,Stefanos Zafeiriou
关键词-EN: garnered significant attention, significant attention due, pose estimation methods, hand pose estimation, virtual reality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page this https URL
点击查看摘要
Abstract:In recent years, 3D hand pose estimation methods have garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination, and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without utilizing any temporal components. Code, models, and dataset are available this https URL.
[CV-76] GCA-SUN: A Gated Context-Aware Swin-UNet for Exemplar-Free Counting
链接: https://arxiv.org/abs/2409.12249
作者: Yuzhe Wu,Yipeng Xu,Tianyu Xu,Jialu Zhang,Jianfeng Ren,Xudong Jiang
关键词-EN: Exemplar-Free Counting aims, Exemplar-Free Counting, Counting aims, Gated Context-Aware Modulation, objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Exemplar-Free Counting aims to count objects of interest without intensive annotations of objects or exemplars. To achieve this, we propose Gated Context-Aware Swin-UNet (GCA-SUN) to directly map an input image to the density map of countable objects. Specifically, a Gated Context-Aware Modulation module is designed in the encoder to suppress irrelevant objects or background through a gate mechanism and exploit the attentive support of objects of interest through a self-similarity matrix. The gate strategy is also incorporated into the bottleneck network and the decoder to highlight the features most relevant to objects of interest. By explicitly exploiting the attentive support among countable objects and eliminating irrelevant features through the gate mechanisms, the proposed GCA-SUN focuses on and counts objects of interest without relying on predefined categories or exemplars. Experimental results on the FSC-147 and CARPK datasets demonstrate that GCA-SUN outperforms state-of-the-art methods.
[CV-77] Sparks of Artificial General Intelligence(AGI) in Semiconductor Material Science: Early Explorations into the Next Frontier of Generative AI-Assisted Electron Micrograph Analysis AAAI-2024
链接: https://arxiv.org/abs/2409.12244
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Sreeja Gangasani,Chidaksh Ravuru,Venkataramana Runkana
关键词-EN: electron micrographs poses, micrographs poses significant, poses significant challenges, automated labeling due, Characterizing materials
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at Deployable AI (DAI) Workshop at AAAI-2024
点击查看摘要
Abstract:Characterizing materials with electron micrographs poses significant challenges for automated labeling due to the complex nature of nanomaterial structures. To address this, we introduce a fully automated, end-to-end pipeline that leverages recent advances in Generative AI. It is designed for analyzing and understanding the microstructures of semiconductor materials with effectiveness comparable to that of human experts, contributing to the pursuit of Artificial General Intelligence (AGI) in nanomaterial identification. Our approach utilizes Large MultiModal Models (LMMs) such as GPT-4V, alongside text-to-image models like DALLE-3. We integrate a GPT-4 guided Visual Question Answering (VQA) method to analyze nanomaterial images, generate synthetic nanomaterial images via DALLE-3, and employ in-context learning with few-shot prompting in GPT-4V for accurate nanomaterial identification. Our method surpasses traditional techniques by enhancing the precision of nanomaterial identification and optimizing the process for high-throughput screening.
[CV-78] ScaleFlow: Robust and Accurate Estimation of 3D Motion from Video
链接: https://arxiv.org/abs/2409.12202
作者: Han Ling,Yinghui Sun,Quansen Sun,Yuhui Zheng
关键词-EN: Perceiving and understanding, autonomous driving, optical flow, core technology, technology in fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.09797
点击查看摘要
Abstract:Perceiving and understanding 3D motion is a core technology in fields such as autonomous driving, robots, and motion prediction. This paper proposes a 3D motion perception method called ScaleFlow++ that is easy to generalize. With just a pair of RGB images, ScaleFlow++ can robustly estimate optical flow and motion-in-depth (MID). Most existing methods directly regress MID from two RGB frames or optical flow, resulting in inaccurate and unstable results. Our key insight is cross-scale matching, which extracts deep motion clues by matching objects in pairs of images at different scales. Unlike previous methods, ScaleFlow++ integrates optical flow and MID estimation into a unified architecture, estimating optical flow and MID end-to-end based on feature matching. Moreover, we also proposed modules such as global initialization network, global iterative optimizer, and hybrid training pipeline to integrate global motion information, reduce the number of iterations, and prevent overfitting during training. On KITTI, ScaleFlow++ achieved the best monocular scene flow estimation performance, reducing SF-all from 6.21 to 5.79. The evaluation of MID even surpasses RGBD-based methods. In addition, ScaleFlow++ has achieved stunning zero-shot generalization performance in both rigid and nonrigid scenes. Code is available at \urlthis https URL.
[CV-79] Gradient-Driven 3D Segmentation and Affordance Transfer in Gaussian Splatting Using 2D Masks ICRA2025
链接: https://arxiv.org/abs/2409.11681
作者: Joji Joseph,Bharadwaj Amrutur,Shalabh Bhatnagar
关键词-EN: scene representation technique, capturing fine details, Splatting has emerged, Gaussian Splatting, scene representation
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Preprint, Under review for ICRA 2025
点击查看摘要
Abstract:3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentation. As a byproduct, we discovered that inference-time gradients can also be used to prune Gaussians, resulting in up to 21% compression. Additionally, we explore few-shot affordance transfer, allowing annotations from 2D images to be effectively transferred onto 3D Gaussian splats. The robust yet straightforward mathematical formulation underlying this approach makes it a highly effective tool for numerous downstream applications, such as augmented reality (AR), object editing, and robotics. The project code and additional resources are available at this https URL.
[CV-80] Inability of spatial transformations of CNN feature maps to support invariant recognition
链接: https://arxiv.org/abs/2004.14716
作者: Ylva Jansson,Maksim Maydanskiy,Lukas Finnveden,Tony Lindeberg
关键词-EN: CNN feature maps, deep learning architectures, object appearance caused, CNN feature, feature maps
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 22 pages, 3 figures
点击查看摘要
Abstract:A large number of deep learning architectures use spatial transformations of CNN feature maps or filters to better deal with variability in object appearance caused by natural image transformations. In this paper, we prove that spatial transformations of CNN feature maps cannot align the feature maps of a transformed image to match those of its original, for general affine transformations, unless the extracted features are themselves invariant. Our proof is based on elementary analysis for both the single- and multi-layer network case. The results imply that methods based on spatial transformations of CNN feature maps or filters cannot replace image alignment of the input and cannot enable invariant recognition for general affine transformations, specifically not for scaling transformations or shear transformations. For rotations and reflections, spatially transforming feature maps or filters can enable invariance but only for networks with learnt or hardcoded rotation- or reflection-invariant features
[CV-81] Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges
链接: https://arxiv.org/abs/2004.01536
作者: Ylva Jansson,Tony Lindeberg
关键词-EN: world visual tasks, real world visual, scale channel networks, handle large scale, large scale variations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, 3 tables
点击查看摘要
Abstract:The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. We, therefore, present a theoretical analysis of invariance and covariance properties of scale channel networks and perform an experimental evaluation of the ability of different types of scale channel networks to generalise to previously unseen scales. We identify limitations of previous approaches and propose a new type of foveated scale channel architecture, where the scale channels process increasingly larger parts of the image with decreasing resolution. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8, also when training on single scale training data, and do also give improvements in the small sample regime.
[CV-82] he problems with using STNs to align CNN feature maps
链接: https://arxiv.org/abs/2001.05858
作者: Lukas Finnveden,Ylva Jansson,Tony Lindeberg
关键词-EN: Spatial transformer networks, learn invariance, CNN feature maps, transform CNN feature, Spatial transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to Northern Lights Deep Learning Workshop 2020, Tromsø, 2 pages, 3 figures
点击查看摘要
Abstract:Spatial transformer networks (STNs) were designed to enable CNNs to learn invariance to image transformations. STNs were originally proposed to transform CNN feature maps as well as input images. This enables the use of more complex features when predicting transformation parameters. However, since STNs perform a purely spatial transformation, they do not, in the general case, have the ability to align the feature maps of a transformed image and its original. We present a theoretical argument for this and investigate the practical implications, showing that this inability is coupled with decreased classification accuracy. We advocate taking advantage of more complex features in deeper layers by instead sharing parameters between the classification and the localisation network.
[CV-83] Provably scale-covariant continuous hierarchical networks based on scale-normalized differential expressions coupled in cascade
链接: https://arxiv.org/abs/1905.13555
作者: Tony Lindeberg
关键词-EN: provably scale covariant, constructing hierarchical networks, theory for constructing, constructing hierarchical, article presents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 29 pages, 16 figures, 3 tables. arXiv admin note: substantial text overlap with arXiv:1903.00289
点击查看摘要
Abstract:This article presents a theory for constructing hierarchical networks in such a way that the networks are guaranteed to be provably scale covariant. We first present a general sufficiency argument for obtaining scale covariance, which holds for a wide class of networks defined from linear and non-linear differential expressions expressed in terms of scale-normalized scale-space derivatives. Then, we present a more detailed development of one example of such a network constructed from a combination of mathematically derived models of receptive fields and biologically inspired computations. Based on a functional model of complex cells in terms of an oriented quasi quadrature combination of first- and second-order directional Gaussian derivatives, we couple such primitive computations in cascade over combinatorial expansions over image orientations. Scale-space properties of the computational primitives are analysed and we give explicit proofs of how the resulting representation allows for scale and rotation covariance. A prototype application to texture analysis is developed and it is demonstrated that a simplified mean-reduced representation of the resulting QuasiQuadNet leads to promising experimental results on three texture datasets.
[CV-84] Provably scale-covariant networks from oriented quasi quadrature measures in cascade
链接: https://arxiv.org/abs/1903.00289
作者: Tony Lindeberg
关键词-EN: hierarchical networks based, mathematically derived models, biologically inspired computations, article presents, presents a continuous
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, 1 table
点击查看摘要
Abstract:This article presents a continuous model for hierarchical networks based on a combination of mathematically derived models of receptive fields and biologically inspired computations. Based on a functional model of complex cells in terms of an oriented quasi quadrature combination of first- and second-order directional Gaussian derivatives, we couple such primitive computations in cascade over combinatorial expansions over image orientations. Scale-space properties of the computational primitives are analysed and it is shown that the resulting representation allows for provable scale and rotation covariance. A prototype application to texture analysis is developed and it is demonstrated that a simplified mean-reduced representation of the resulting QuasiQuadNet leads to promising experimental results on three texture datasets.
[CV-85] Deep Learning-Based Detection of Referable Diabetic Retinopathy and Macular Edema Using Ultra-Widefield Fundus Imaging
链接: https://arxiv.org/abs/2409.12854
作者: Philippe Zhang,Pierre-Henri Conze,Mathieu Lamard,Gwenolé Quellec,Mostafa El Habib Daho
关键词-EN: diabetic macular edema, Diabetic retinopathy, diabetic macular, vision loss, macular edema
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Diabetic retinopathy and diabetic macular edema are significant complications of diabetes that can lead to vision loss. Early detection through ultra-widefield fundus imaging enhances patient outcomes but presents challenges in image quality and analysis scale. This paper introduces deep learning solutions for automated UWF image analysis within the framework of the MICCAI 2024 UWF4DR challenge. We detail methods and results across three tasks: image quality assessment, detection of referable DR, and identification of DME. Employing advanced convolutional neural network architectures such as EfficientNet and ResNet, along with preprocessing and augmentation strategies, our models demonstrate robust performance in these tasks. Results indicate that deep learning can significantly aid in the automated analysis of UWF images, potentially improving the efficiency and accuracy of DR and DME detection in clinical settings.
[CV-86] Multi-Source and Multi-Sequence Myocardial Pathology Segmentation Using a Cascading Refinement CNN
链接: https://arxiv.org/abs/2409.12792
作者: Franz Thaler,Darko Stern,Gernot Plank,Martin Urschler
关键词-EN: prevalent cardiovascular diseases, myocardial tissue, morbidity worldwide, mortality and morbidity, Myocardial infarction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Myocardial infarction (MI) is one of the most prevalent cardiovascular diseases and consequently, a major cause for mortality and morbidity worldwide. Accurate assessment of myocardial tissue viability for post-MI patients is critical for diagnosis and treatment planning, e.g. allowing surgical revascularization, or to determine the risk of adverse cardiovascular events in the future. Fine-grained analysis of the myocardium and its surrounding anatomical structures can be performed by combining the information obtained from complementary medical imaging techniques. In this work, we use late gadolinium enhanced (LGE) magnetic resonance (MR), T2-weighted (T2) MR and balanced steady-state free precession (bSSFP) cine MR in order to semantically segment the left and right ventricle, healthy and scarred myocardial tissue, as well as edema. To this end, we propose the Multi-Sequence Cascading Refinement CNN (MS-CaRe-CNN), a 2-stage CNN cascade that receives multi-sequence data and generates predictions of the anatomical structures of interest without considering tissue viability at Stage 1. The prediction of Stage 1 is then further refined in Stage 2, where the model additionally distinguishes myocardial tissue based on viability, i.e. healthy, scarred and edema regions. Our proposed method is set up as a 5-fold ensemble and semantically segments scar tissue achieving 62.31% DSC and 82.65% precision, as well as 63.78% DSC and 87.69% precision for the combined scar and edema region. These promising results for such small and challenging structures confirm that MS-CaRe-CNN is well-suited to generate semantic segmentations to assess the viability of myocardial tissue, enabling downstream tasks like personalized therapy planning.
[CV-87] EAM PILOT – Learned Feasible Extendable Set of Dynamic MRI Acquisition Trajectories
链接: https://arxiv.org/abs/2409.12777
作者: Tamir Shor,Chaim Baskin,Alex Bronstein
关键词-EN: Magnetic Resonance Imaging, Dynamic Magnetic Resonance, Resonance Imaging, Magnetic Resonance, dynamic MRI faces
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Dynamic Magnetic Resonance Imaging (MRI) is a crucial non-invasive method used to capture the movement of internal organs and tissues, making it a key tool for medical diagnosis. However, dynamic MRI faces a major challenge: long acquisition times needed to achieve high spatial and temporal resolution. This leads to higher costs, patient discomfort, motion artifacts, and lower image quality. Compressed Sensing (CS) addresses this problem by acquiring a reduced amount of MR data in the Fourier domain, based on a chosen sampling pattern, and reconstructing the full image from this partial data. While various deep learning methods have been developed to optimize these sampling patterns and improve reconstruction, they often struggle with slow optimization and inference times or are limited to specific temporal dimensions used during training. In this work, we introduce a novel deep-compressed sensing approach that uses 3D window attention and flexible, temporally extendable acquisition trajectories. Our method significantly reduces both training and inference times compared to existing approaches, while also adapting to different temporal dimensions during inference without requiring additional training. Tests with real data show that our approach outperforms current state-of-theart techniques. The code for reproducing all experiments will be made available upon acceptance of the paper.
[CV-88] Multi-Scale Feature Prediction with Auxiliary-Info for Neural Image Compression
链接: https://arxiv.org/abs/2409.12719
作者: Chajin Shin,Sangjin Lee,Sangyoun Lee
关键词-EN: auxiliary coarse network, auxiliary coarse, auxiliary, latent vector, coarse network
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Recently, significant improvements in rate-distortion performance of image compression have been achieved with deep-learning techniques. A key factor in this success is the use of additional bits to predict an approximation of the latent vector, which is the output of the encoder, through another neural network. Then, only the difference between the prediction and the latent vector is coded into the bitstream, along with its estimated probability distribution. We introduce a new predictive structure consisting of the auxiliary coarse network and the main network, inspired by neural video compression. The auxiliary coarse network encodes the auxiliary information and predicts the approximation of the original image as multi-scale features. The main network encodes the residual between the predicted feature from the auxiliary coarse network and the feature of the original image. To further leverage our new structure, we propose Auxiliary info-guided Feature Prediction (AFP) module that uses global correlation to predict more accurate predicted features. Moreover, we present Context Junction module that refines the auxiliary feature from AFP module and produces the residuals between the refined features and the original image features. Finally, we introduce Auxiliary info-guided Parameter Estimation (APE) module, which predicts the approximation of the latent vector and estimates the probability distribution of these residuals. We demonstrate the effectiveness of the proposed modules by various ablation studies. Under extensive experiments, our model outperforms other neural image compression models and achieves a 19.49% higher rate-distortion performance than VVC on Tecnick dataset.
[CV-89] PMR-Net: Parallel Multi-Resolution Encoder-Decoder Network Framework for Medical Image Segmentation
链接: https://arxiv.org/abs/2409.12678
作者: Xiaogang Du,Dongxin Gu,Tao Lei,Yipeng Jiao,Yibin Zou
关键词-EN: parallel multi-resolution, parallel multi-resolution encoder, global context, recent years, varying sizes
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In recent years, encoder-decoder networks have focused on expanding receptive fields and incorporating multi-scale context to capture global features for objects of varying sizes. However, as networks deepen, they often discard fine spatial details, impairing precise object localization. Additionally, conventional decoders’ use of interpolation for upsampling leads to a loss of global context, diminishing edge segmentation accuracy. To address the above problems, we propose a novel parallel multi-resolution encoder-decoder network, namely PMR-Net for short. First, we design a parallel multi-resolution encoder and a multi-resolution context encoder. The parallel multi-resolution encoder can extract and fuse multi-scale fine-grained local features in parallel for input images with different resolutions. The multi-resolution context encoder fuses the global context semantic features of different receptive fields from different encoder branches to maintain effectively the integrity of global information. Secondly, we design a parallel multi-resolution decoder symmetrical to the structure of parallel multi-resolution encoder. The decoder can continuously supplement the global context features of low-resolution branches to the feature maps of high-resolution branches, and effectively solve the problem of global context feature loss caused by upsampling operation in the decoding process. Extensive experiment results demonstrate that our proposed PMR-Net can achieve more accurate segmentation results than state-of-the-art methods on five public available datasets. Moreover, PMR-Net is also a flexible network framework, which can meet the requirements of different scenarios by adjusting the number of network layers and the number of parallel encoder-decoder branches.
[CV-90] MambaClinix: Hierarchical Gated Convolution and Mamba-Based U-Net for Enhanced 3D Medical Image Segmentation
链接: https://arxiv.org/abs/2409.12533
作者: Chenyuan Bian,Nan Xia,Xia Yang,Feifei Wang,Fengjiao Wang,Bin Wei,Qian Dong
关键词-EN: Deep learning, medical image segmentation, medical image, image segmentation, Deep
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 5 figures
点击查看摘要
Abstract:Deep learning, particularly convolutional neural networks (CNNs) and Transformers, has significantly advanced 3D medical image segmentation. While CNNs are highly effective at capturing local features, their limited receptive fields may hinder performance in complex clinical scenarios. In contrast, Transformers excel at modeling long-range dependencies but are computationally intensive, making them expensive to train and deploy. Recently, the Mamba architecture, based on the State Space Model (SSM), has been proposed to efficiently model long-range dependencies while maintaining linear computational complexity. However, its application in medical image segmentation reveals shortcomings, particularly in capturing critical local features essential for accurate delineation of clinical regions. In this study, we propose MambaClinix, a novel U-shaped architecture for medical image segmentation that integrates a hierarchical gated convolutional network(HGCN) with Mamba in an adaptive stage-wise framework. This design significantly enhances computational efficiency and high-order spatial interactions, enabling the model to effectively capture both proximal and distal relationships in medical images. Specifically, our HGCN is designed to mimic the attention mechanism of Transformers by a purely convolutional structure, facilitating high-order spatial interactions in feature maps while avoiding the computational complexity typically associated with Transformer-based methods. Additionally, we introduce a region-specific Tversky loss, which emphasizes specific pixel regions to improve auto-segmentation performance, thereby optimizing the model’s decision-making process. Experimental results on five benchmark datasets demonstrate that the proposed MambaClinix achieves high segmentation accuracy while maintaining low model complexity.
[CV-91] MambaRecon: MRI Reconstruction with Structured State Space Models
链接: https://arxiv.org/abs/2409.12401
作者: Yilmaz Korkmaz,Vishal M. Patel
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, important medical imaging, medical imaging modalities, Resonance Imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Magnetic Resonance Imaging (MRI) is one of the most important medical imaging modalities as it provides superior resolution of soft tissues, albeit with a notable limitation in scanning speed. The advent of deep learning has catalyzed the development of cutting-edge methods for the expedited reconstruction of MRI scans, utilizing convolutional neural networks and, more recently, vision transformers. Recently proposed structured state space models (e.g., Mamba) have gained some traction due to their efficiency and low computational requirements compared to transformer models. We propose an innovative MRI reconstruction framework that employs structured state space models at its core, aimed at amplifying both long-range contextual sensitivity and reconstruction efficacy. Comprehensive experiments on public brain MRI datasets show that our model sets new benchmarks beating state-of-the-art reconstruction baselines. Code will be available (this https URL).
[CV-92] I2I-Galip: Unsupervised Medical Image Translation Using Generative Adversarial CLIP
链接: https://arxiv.org/abs/2409.12399
作者: Yilmaz Korkmaz,Vishal M. Patel
关键词-EN: challenging task due, absence of paired, complicates learning, learning the complex, distinct distributions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Unpaired image-to-image translation is a challenging task due to the absence of paired examples, which complicates learning the complex mappings between the distinct distributions of the source and target domains. One of the most commonly used approach for this task is CycleGAN which requires the training of a new pair of generator-discriminator networks for each domain pair. In this paper, we propose a new image-to-image translation framework named Image-to-Image-Generative-Adversarial-CLIP (I2I-Galip) where we utilize a pre-trained multi-model foundation model (i.e., CLIP) to mitigate the need of separate generator-discriminator pairs for each source-target mapping while achieving better and more efficient multi-domain translation. By utilizing the massive knowledge gathered during pre-training a foundation model, our approach makes use of a single lightweight generator network with ~13M parameters for the multi-domain image translation task. Comprehensive experiments on translation performance in public MRI and CT datasets show the superior performance of the proposed framework over the existing approaches. Code will be available (this https URL).
[CV-93] Fundus image enhancement through direct diffusion bridges
链接: https://arxiv.org/abs/2409.12377
作者: Sehui Kim,Hyungjin Chung,Se Hie Park,Eui-Sang Chung,Kayoung Yi,Jong Chul Ye
关键词-EN: direct diffusion bridges, including haze, enhancement method based, based on direct, wide range
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at IEEE JBHI. 12 pages, 10 figures. Code and Data: this https URL
点击查看摘要
Abstract:We propose FD3, a fundus image enhancement method based on direct diffusion bridges, which can cope with a wide range of complex degradations, including haze, blur, noise, and shadow. We first propose a synthetic forward model through a human feedback loop with board-certified ophthalmologists for maximal quality improvement of low-quality in-vivo images. Using the proposed forward model, we train a robust and flexible diffusion-based image enhancement network that is highly effective as a stand-alone method, unlike previous diffusion model-based approaches which act only as a refiner on top of pre-trained models. Through extensive experiments, we show that FD3 establishes \addsuperior quality not only on synthetic degradations but also on in vivo studies with low-quality fundus photos taken from patients with cataracts or small pupils. To promote further research in this area, we open-source all our code and data used for this research at this https URL
[CV-94] Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
链接: https://arxiv.org/abs/2409.12370
作者: Yihan Wu,Yifan Peng,Yichen Lu,Xuankai Chang,Ruihua Song,Shinji Watanabe
关键词-EN: providing additional contextual, additional contextual information, speech recognition, audiovisual speech recognition, speech recognition accuracy
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
*备注: 6 pages, 2 figures, accepted by IEEE Spoken Language Technology Workshop 2024
点击查看摘要
Abstract:Visual signals can enhance audiovisual speech recognition accuracy by providing additional contextual information. Given the complexity of visual signals, an audiovisual speech recognition model requires robust generalization capabilities across diverse video scenarios, presenting a significant challenge. In this paper, we introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for ``in-the-wild’’ videos. Specifically, we first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Then, we build EVA upon a robust pretrained speech recognition model, ensuring its generalization ability. Moreover, to incorporate visual information effectively, we inject visual information into the ASR model through a mixture-of-experts module. Experiments show our model achieves state-of-the-art results on three benchmarks, which demonstrates the generalization ability of EVA across diverse video domains.
[CV-95] Axial Attention Transformer Networks: A New Frontier in Breast Cancer Detection
链接: https://arxiv.org/abs/2409.12347
作者: Weijie He,Runyuan Bao,Yiru Cang,Jianjun Wei,Yang Zhang,Jiacheng Hu
关键词-EN: breast cancer images, medical image segmentation, breast cancer, breast cancer diagnosis, Transformer-based segmentation model
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper delves into the challenges and advancements in the field of medical image segmentation, particularly focusing on breast cancer diagnosis. The authors propose a novel Transformer-based segmentation model that addresses the limitations of traditional convolutional neural networks (CNNs), such as U-Net, in accurately localizing and segmenting small lesions within breast cancer images. The model introduces an axial attention mechanism to enhance the computational efficiency and address the issue of global contextual information that is often overlooked by CNNs. Additionally, the paper discusses improvements tailored to the small dataset challenge, including the incorporation of relative position information and a gated axial attention mechanism to refine the model’s focus on relevant features. The proposed model aims to significantly improve the segmentation accuracy of breast cancer images, offering a more efficient and effective tool for computer-aided diagnosis.
[CV-96] Deep vessel segmentation with joint multi-prior encoding
链接: https://arxiv.org/abs/2409.12334
作者: Amine Sadikine,Bogdan Badic,Enzo Ferrante,Vincent Noblet,Pascal Ballet,Dimitris Visvikis,Pierre-Henri Conze
关键词-EN: including pathology detection, clinical applications, including pathology, surgical planning, pathology detection
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, conference
点击查看摘要
Abstract:The precise delineation of blood vessels in medical images is critical for many clinical applications, including pathology detection and surgical planning. However, fully-automated vascular segmentation is challenging because of the variability in shape, size, and topology. Manual segmentation remains the gold standard but is time-consuming, subjective, and impractical for large-scale studies. Hence, there is a need for automatic and reliable segmentation methods that can accurately detect blood vessels from medical images. The integration of shape and topological priors into vessel segmentation models has been shown to improve segmentation accuracy by offering contextual information about the shape of the blood vessels and their spatial relationships within the vascular tree. To further improve anatomical consistency, we propose a new joint prior encoding mechanism which incorporates both shape and topology in a single latent space. The effectiveness of our method is demonstrated on the publicly available 3D-IRCADb dataset. More globally, the proposed approach holds promise in overcoming the challenges associated with automatic vessel delineation and has the potential to advance the field of deep priors encoding.
[CV-97] Scale-specific auxiliary multi-task contrastive learning for deep liver vessel segmentation
链接: https://arxiv.org/abs/2409.12333
作者: Amine Sadikine,Bogdan Badic,Jean-Pierre Tasu,Vincent Noblet,Pascal Ballet,Dimitris Visvikis,Pierre-Henri Conze
关键词-EN: functionally-independent Couinaud segments, Extracting hepatic vessels, Couinaud segments, Extracting hepatic, functionally-independent Couinaud
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures, conference
点击查看摘要
Abstract:Extracting hepatic vessels from abdominal images is of high interest for clinicians since it allows to divide the liver into functionally-independent Couinaud segments. In this respect, an automated liver blood vessel extraction is widely summoned. Despite the significant growth in performance of semantic segmentation methodologies, preserving the complex multi-scale geometry of main vessels and ramifications remains a major challenge. This paper provides a new deep supervised approach for vessel segmentation, with a strong focus on representations arising from the different scales inherent to the vascular tree geometry. In particular, we propose a new clustering technique to decompose the tree into various scale levels, from tiny to large vessels. Then, we extend standard 3D UNet to multi-task learning by incorporating scale-specific auxiliary tasks and contrastive learning to encourage the discrimination between scales in the shared representation. Promising results, depicted in several evaluation metrics, are revealed on the public 3D-IRCADb dataset.
[CV-98] Unsupervised Feature Orthogonalization for Learning Distortion-Invariant Representations BMVC2024
链接: https://arxiv.org/abs/2409.12276
作者: Sebastian Doerrich,Francesco Di Salvo,Christian Ledig
关键词-EN: Vision Transformer, integrates unsupervised feature, unsupervised feature orthogonalization, study introduces unORANIC, Transformer to capture
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at RROW@BMVC 2024 (Workshop on Robust Recognition in the Open World at the British Machine Vision Conference)
点击查看摘要
Abstract:This study introduces unORANIC+, a novel method that integrates unsupervised feature orthogonalization with the ability of a Vision Transformer to capture both local and global relationships for improved robustness and generalizability. The streamlined architecture of unORANIC+ effectively separates anatomical and image-specific attributes, resulting in robust and unbiased latent representations that allow the model to demonstrate excellent performance across various medical image analysis tasks and diverse datasets. Extensive experimentation demonstrates unORANIC+'s reconstruction proficiency, corruption resilience, as well as capability to revise existing image distortions. Additionally, the model exhibits notable aptitude in downstream tasks such as disease classification and corruption detection. We confirm its adaptability to diverse datasets of varying image sources and sample sizes which positions the method as a promising algorithm for advanced medical image analysis, particularly in resource-constrained environments lacking large, tailored datasets. The source code is available at this https URL .
机器学习
[LG-0] Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
链接: https://arxiv.org/abs/2409.12963
作者: Yuzhang Shang,Bingxin Xu,Weitai Kang,Mu Cai,Yuheng Li,Zehao Wen,Zhen Dong,Kurt Keutzer,Yong Jae Lee,Yan Yan
关键词-EN: Large Language Models, Advancements in Large, Language Models, Large Language, integrating video modalities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Advancements in Large Language Models (LLMs) inspire various strategies for integrating video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface linking sophisticated video encoders to LLMs. However, due to computation and data limitations, these Video-LLMs are typically pre-trained to process only short videos, limiting their broader application for understanding longer video content. Additionally, fine-tuning Video-LLMs to handle longer videos is cost-prohibitive. Consequently, it becomes essential to explore the interpolation of Video-LLMs under a completely training-free setting. In this paper, we first identify the primary challenges in interpolating Video-LLMs: (1) the video encoder and modality alignment projector are fixed, preventing the integration of additional frames into Video-LLMs, and (2) the LLM backbone is limited in its content length capabilities, which complicates the processing of an increased number of video tokens. To address these challenges, we propose a specific INTerPolation method for Video-LLMs (INTP-Video-LLMs). We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector. Furthermore, we introduce a training-free LLM context window extension method to enable Video-LLMs to understand a correspondingly increased number of visual tokens.
[LG-1] MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
链接: https://arxiv.org/abs/2409.12958
作者: Abdullatif Köksal,Marion Thaler,Ayyoob Imani,Ahmet Üstün,Anna Korhonen,Hinrich Schütze
关键词-EN: tuning enhances large, Instruction tuning enhances, enhances large language, Instruction tuning, instruction tuning datasets
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach’s effectiveness for both NLU and open-ended generation. We publicly release datasets and models at this https URL.
[LG-2] he Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations ECCV2024
链接: https://arxiv.org/abs/2409.12952
作者: Anselm Haselhoff,Kevin Trelenberg,Fabian Küppers,Jonas Schneider
关键词-EN: modify image concepts, original query image, Visual counterfactual explanation, methods modify image, Visual counterfactual
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted paper at the ECCV 2024
点击查看摘要
Abstract:Visual counterfactual explanation (CF) methods modify image concepts, e.g, shape, to change a prediction to a predefined outcome while closely resembling the original query image. Unlike self-explainable models (SEMs) and heatmap techniques, they grant users the ability to examine hypothetical “what-if” scenarios. Previous CF methods either entail post-hoc training, limiting the balance between transparency and CF quality, or demand optimization during inference. To bridge the gap between transparent SEMs and CF methods, we introduce the GdVAE, a self-explainable model based on a conditional variational autoencoder (CVAE), featuring a Gaussian discriminant analysis (GDA) classifier and integrated CF explanations. Full transparency is achieved through a generative classifier that leverages class-specific prototypes for the downstream task and a closed-form solution for CFs in the latent space. The consistency of CFs is improved by regularizing the latent space with the explainer function. Extensive comparisons with existing approaches affirm the effectiveness of our method in producing high-quality CF explanations while preserving transparency. Code and models are public.
[LG-3] Re-Introducing LayerNorm: Geometric Meaning Irreversibility and a Comparative Study with RMSNorm
链接: https://arxiv.org/abs/2409.12951
作者: Akshat Gupta,Atahan Ozdemir,Gopala Anumanchipalli
关键词-EN: uniform vector, Layer normalization, vector, transformer architecture, LayerNorm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Layer normalization is a pivotal step in the transformer architecture. This paper delves into the less explored geometric implications of this process, examining how LayerNorm influences the norm and orientation of hidden vectors in the representation space. We show that the definition of LayerNorm is innately linked to the uniform vector, defined as \boldsymbol1 = [1, 1, 1, 1, \cdots, 1]^T \in \mathbbR^d . We then show that the standardization step in LayerNorm can be understood in three simple steps: (i) remove the component of a vector along the uniform vector, (ii) normalize the remaining vector, and (iii) scale the resultant vector by \sqrtd , where d is the dimensionality of the representation space. We also introduce the property of “irreversibility” for LayerNorm, where we show that the information lost during the normalization process cannot be recovered. In other words, unlike batch normalization, LayerNorm cannot learn an identity transform. While we present possible arguments for removing the component along the uniform vector, the choice of removing this component seems arbitrary and not well motivated by the original authors. To evaluate the usefulness of this step, we compare the hidden representations of LayerNorm-based LLMs with models trained using RMSNorm and show that all LLMs naturally align representations orthogonal to the uniform vector, presenting the first mechanistic evidence that removing the component along the uniform vector in LayerNorm is a redundant step. Our findings support the use of RMSNorm over LayerNorm as it is not only more computationally efficient with comparable downstream performance, but also learns a similar distribution of hidden representations that operate orthogonal to the uniform vector.
[LG-4] Unrolled denoising networks provably learn optimal Bayesian inference
链接: https://arxiv.org/abs/2409.12947
作者: Aayush Karan,Kulin Shah,Sitan Chen,Yonina C. Eldar
关键词-EN: Bayesian inference centers, Bayesian inference, estimators for inverse, inverse problems, Bayes AMP
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 32 pages
点击查看摘要
Abstract:Much of Bayesian inference centers around the design of estimators for inverse problems which are optimal assuming the data comes from a known prior. But what do these optimality guarantees mean if the prior is unknown? In recent years, algorithm unrolling has emerged as deep learning’s answer to this age-old question: design a neural network whose layers can in principle simulate iterations of inference algorithms and train on data generated by the unknown prior. Despite its empirical success, however, it has remained unclear whether this method can provably recover the performance of its optimal, prior-aware counterparts. In this work, we prove the first rigorous learning guarantees for neural networks based on unrolling approximate message passing (AMP). For compressed sensing, we prove that when trained on data drawn from a product prior, the layers of the network approximately converge to the same denoisers used in Bayes AMP. We also provide extensive numerical experiments for compressed sensing and rank-one matrix estimation demonstrating the advantages of our unrolled architecture - in addition to being able to obliviously adapt to general priors, it exhibits improvements over Bayes AMP in more general settings of low dimensions, non-Gaussian designs, and non-product priors. Comments: 32 pages Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2409.12947 [cs.LG] (or arXiv:2409.12947v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.12947 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-5] Revisiting Semi-supervised Adversarial Robustness via Noise-aware Online Robust Distillation
链接: https://arxiv.org/abs/2409.12946
作者: Tsung-Han Wu,Hung-Ting Su,Shang-Tse Chen,Winston H. Hsu
关键词-EN: prominent approach, RST, robust pretrained models, robust self-training, labeling budgets
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 4 figures, 9 tables
点击查看摘要
Abstract:The robust self-training (RST) framework has emerged as a prominent approach for semi-supervised adversarial training. To explore the possibility of tackling more complicated tasks with even lower labeling budgets, unlike prior approaches that rely on robust pretrained models, we present SNORD - a simple yet effective framework that introduces contemporary semi-supervised learning techniques into the realm of adversarial training. By enhancing pseudo labels and managing noisy training data more effectively, SNORD showcases impressive, state-of-the-art performance across diverse datasets and labeling budgets, all without the need for pretrained models. Compared to full adversarial supervision, SNORD achieves a 90% relative robust accuracy under epsilon = 8/255 AutoAttack, requiring less than 0.1%, 2%, and 10% labels for CIFAR-10, CIFAR-100, and TinyImageNet-200, respectively. Additional experiments confirm the efficacy of each component and demonstrate the adaptability of integrating SNORD with existing adversarial pretraining strategies to further bolster robustness.
[LG-6] raining Language Models to Self-Correct via Reinforcement Learning
链接: https://arxiv.org/abs/2409.12917
作者: Aviral Kumar,Vincent Zhuang,Rishabh Agarwal,Yi Su,John D Co-Reyes,Avi Singh,Kate Baumli,Shariq Iqbal,Colton Bishop,Rebecca Roelofs,Lei M Zhang,Kay McKinney,Disha Shrivastava,Cosmin Paduraru,George Tucker,Doina Precup,Feryal Behbahani,Aleksandra Faust
关键词-EN: highly desirable capability, large language models, highly desirable, desirable capability, capability of large
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model’s own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.
[LG-7] Unveiling and Manipulating Concepts in Time Series Foundation Models
链接: https://arxiv.org/abs/2409.12915
作者: Michał Wiliński,Mononito Goswami,Nina Żukowska,Willa Potosnak,Artur Dubrawski
关键词-EN: Time series foundation, Time series, range of applications, series foundation models, powerful tools
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Time series foundation models promise to be powerful tools for a wide range of applications. However, little is known about the concepts that these models learn and how can we manipulate them in the latent space. Our study bridges these gaps by identifying concepts learned by these models, localizing them to specific parts of the model, and steering model predictions along these conceptual directions, using synthetic time series data. Our results show that MOMENT, a state-of-the-art foundation model, can discern distinct time series patterns, and that this ability peaks in the middle layers of the network. Moreover, we show that model outputs can be steered using insights from its activations (e.g., by introducing periodic trends to initially constant signals through intervention during inference). Our findings underscore the importance of synthetic data in studying and steering time series foundation models and intervening throughout the whole model (using steering matrices), instead of a single layer.
[LG-8] Defending against Reverse Preference Attacks is Difficult
链接: https://arxiv.org/abs/2409.12914
作者: Domenic Rosati,Giles Edkins,Harsh Raj,David Atanasov,Subhabrata Majumdar,Janarthanan Rajendran,Frank Rudzicz,Hassan Sajjad
关键词-EN: aligning Large Language, Large Language Models, Large Language, ensuring safe behaviour, aligning Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:While there has been progress towards aligning Large Language Models (LLMs) with human values and ensuring safe behaviour at inference time, safety-aligned LLMs are known to be vulnerable to training-time attacks such as supervised fine-tuning (SFT) on harmful datasets. In this paper, we ask if LLMs are vulnerable to adversarial reinforcement learning. Motivated by this goal, we propose Reverse Preference Attacks (RPA), a class of attacks to make LLMs learn harmful behavior using adversarial reward during reinforcement learning from human feedback (RLHF). RPAs expose a critical safety gap of safety-aligned LLMs in RL settings: they easily explore the harmful text generation policies to optimize adversarial reward. To protect against RPAs, we explore a host of mitigation strategies. Leveraging Constrained Markov-Decision Processes, we adapt a number of mechanisms to defend against harmful fine-tuning attacks into the RL setting. Our experiments show that online" defenses that are based on the idea of minimizing the negative log likelihood of refusals -- with the defender having control of the loss function -- can effectively protect LLMs against RPAs. However, trying to defend model weights using
offline" defenses that operate under the assumption that the defender has no control over the loss function are less effective in the face of RPAs. These findings show that attacks done using RL can be used to successfully undo safety alignment in open-weight LLMs and use them for malicious purposes.
[LG-9] Universal approximation theorem for neural networks with inputs from a topological vector space
链接: https://arxiv.org/abs/2409.12913
作者: Vugar Ismailov
关键词-EN: feedforward neural networks, topological vector space, study feedforward neural, feedforward neural, neural networks
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: 10 pages
点击查看摘要
Abstract:We study feedforward neural networks with inputs from a topological vector space (TVS-FNNs). Unlike traditional feedforward neural networks, TVS-FNNs can process a broader range of inputs, including sequences, matrices, functions and more. We prove a universal approximation theorem for TVS-FNNs, which demonstrates their capacity to approximate any continuous function defined on this expanded input space.
[LG-10] Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
链接: https://arxiv.org/abs/2409.12903
作者: Mohammad Samragh,Iman Mirzadeh,Keivan Alizadeh Vahid,Fartash Faghri,Minsik Cho,Moin Nabi,Devang Naik,Mehrdad Farajtabar
关键词-EN: language models, large language models, begins with randomly, models, model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.
[LG-11] Fast End-to-End Generation of Belief Space Paths for Minimum Sensing Navigation
链接: https://arxiv.org/abs/2409.12902
作者: Lukas Taus,Vrushabh Zinage,Takashi Tanaka,Richard Tsai
关键词-EN: Gaussian belief space, Gaussian belief, belief space, motion planning, problem
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We revisit the problem of motion planning in the Gaussian belief space. Motivated by the fact that most existing sampling-based planners suffer from high computational costs due to the high-dimensional nature of the problem, we propose an approach that leverages a deep learning model to predict optimal path candidates directly from the problem description. Our proposed approach consists of three steps. First, we prepare a training dataset comprising a large number of input-output pairs: the input image encodes the problem to be solved (e.g., start states, goal states, and obstacle locations), whereas the output image encodes the solution (i.e., the ground truth of the shortest path). Any existing planner can be used to generate this training dataset. Next, we leverage the U-Net architecture to learn the dependencies between the input and output data. Finally, a trained U-Net model is applied to a new problem encoded as an input image. From the U-Net’s output image, which is interpreted as a distribution of paths,an optimal path candidate is reconstructed. The proposed method significantly reduces computation time compared to the sampling-based baseline algorithm.
[LG-12] On the Hardness of Decentralized Multi-Agent Policy Evaluation under Byzantine Attacks
链接: https://arxiv.org/abs/2409.12882
作者: Hairi,Minghong Fang,Zifan Zhang,Alvaro Velasquez,Jia Liu
关键词-EN: multi-agent reinforcement learning, multi-agent policy evaluation, fully-decentralized multi-agent policy, policy evaluation problem, Byzantine faulty model
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear in Proceedings of the 22nd International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt 2024)
点击查看摘要
Abstract:In this paper, we study a fully-decentralized multi-agent policy evaluation problem, which is an important sub-problem in cooperative multi-agent reinforcement learning, in the presence of up to f faulty agents. In particular, we focus on the so-called Byzantine faulty model with model poisoning setting. In general, policy evaluation is to evaluate the value function of any given policy. In cooperative multi-agent system, the system-wide rewards are usually modeled as the uniform average of rewards from all agents. We investigate the multi-agent policy evaluation problem in the presence of Byzantine agents, particularly in the setting of heterogeneous local rewards. Ideally, the goal of the agents is to evaluate the accumulated system-wide rewards, which are uniform average of rewards of the normal agents for a given policy. It means that all agents agree upon common values (the consensus part) and furthermore, the consensus values are the value functions (the convergence part). However, we prove that this goal is not achievable. Instead, we consider a relaxed version of the problem, where the goal of the agents is to evaluate accumulated system-wide reward, which is an appropriately weighted average reward of the normal agents. We further prove that there is no correct algorithm that can guarantee that the total number of positive weights exceeds |\mathcalN|-f , where |\mathcalN| is the number of normal agents. Towards the end, we propose a Byzantine-tolerant decentralized temporal difference algorithm that can guarantee asymptotic consensus under scalar function approximation. We then empirically test the effective of the proposed algorithm.
[LG-13] Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language Models CIKM
链接: https://arxiv.org/abs/2409.12880
作者: Bryan Zhang,Taichi Nakatani,Stephan Walter
关键词-EN: stores enable multilingual, E-commerce stores enable, product title translation, title translation, accurate product title
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 Pages,In Proceedings of ACM CIKM Workshop on Data-Centric AI (CIKM DCAI 2024)
点击查看摘要
Abstract:E-commerce stores enable multilingual product discovery which require accurate product title translation. Multilingual large language models (LLMs) have shown promising capacity to perform machine translation tasks, and it can also enhance and translate product titles cross-lingually in one step. However, product title translation often requires more than just language conversion because titles are short, lack context, and contain specialized terminology. This study proposes a retrieval-augmented generation (RAG) approach that leverages existing bilingual product information in e-commerce by retrieving similar bilingual examples and incorporating them as few-shot prompts to enhance LLM-based product title translation. Experiment results show that our proposed RAG approach improve product title translation quality with chrF score gains of up to 15.3% for language pairs where the LLM has limited proficiency.
[LG-14] Impact of ML Optimization Tactics on Greener Pre-Trained ML Models
链接: https://arxiv.org/abs/2409.12878
作者: Alexandra González Álvarez,Joel Castaño,Xavier Franch,Silverio Martínez-Fernández
关键词-EN: English understanding, Toggle, energy consumption, Code, Papers
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
点击查看摘要
Abstract:Background: Given the fast-paced nature of today’s technology, which has surpassed human performance in tasks like image classification, visual reasoning, and English understanding, assessing the impact of Machine Learning (ML) on energy consumption is crucial. Traditionally, ML projects have prioritized accuracy over energy, creating a gap in energy consumption during model inference. Aims: This study aims to (i) analyze image classification datasets and pre-trained models, (ii) improve inference efficiency by comparing optimized and non-optimized models, and (iii) assess the economic impact of the optimizations. Method: We conduct a controlled experiment to evaluate the impact of various PyTorch optimization techniques (dynamic quantization, torch.compile, local pruning, and global pruning) to 42 Hugging Face models for image classification. The metrics examined include GPU utilization, power and energy consumption, accuracy, time, computational complexity, and economic costs. The models are repeatedly evaluated to quantify the effects of these software engineering tactics. Results: Dynamic quantization demonstrates significant reductions in inference time and energy consumption, making it highly suitable for large-scale systems. Additionally, torch.compile balances accuracy and energy. In contrast, local pruning shows no positive impact on performance, and global pruning’s longer optimization times significantly impact costs. Conclusions: This study highlights the role of software engineering tactics in achieving greener ML models, offering guidelines for practitioners to make informed decisions on optimization methods that align with sustainability goals. Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2409.12878 [cs.LG] (or arXiv:2409.12878v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.12878 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alexandra González Álvarez [view email] [v1] Thu, 19 Sep 2024 16:23:03 UTC (258 KB) Full-text links: Access Paper: View a PDF of the paper titled Impact of ML Optimization Tactics on Greener Pre-Trained ML Models, by Alexandra Gonz’alez 'Alvarez and 3 other authorsView PDFTeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2024-09 Change to browse by: cs cs.SE References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-15] A Margin-Maximizing Fine-Grained Ensemble Method
链接: https://arxiv.org/abs/2409.12849
作者: Jinghui Yuan,Hao Chen,Renwei Luo,Feiping Nie
关键词-EN: achieved remarkable success, resource-constrained environments, machine learning, achieved remarkable, remarkable success
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Ensemble learning has achieved remarkable success in machine learning, but its reliance on numerous base learners limits its application in resource-constrained environments. This paper introduces an innovative “Margin-Maximizing Fine-Grained Ensemble Method” that achieves performance surpassing large-scale ensembles by meticulously optimizing a small number of learners and enhancing generalization capability. We propose a novel learnable confidence matrix, quantifying each classifier’s confidence for each category, precisely capturing category-specific advantages of individual learners. Furthermore, we design a margin-based loss function, constructing a smooth and partially convex objective using the logsumexp technique. This approach improves optimization, eases convergence, and enables adaptive confidence allocation. Finally, we prove that the loss function is Lipschitz continuous, based on which we develop an efficient gradient optimization algorithm that simultaneously maximizes margins and dynamically adjusts learner weights. Extensive experiments demonstrate that our method outperforms traditional random forests using only one-tenth of the base learners and other state-of-the-art ensemble methods.
[LG-16] How the (Tensor-) Brain uses Embeddings and Embodiment to Encode Senses and Decode Symbols
链接: https://arxiv.org/abs/2409.12846
作者: Volker Tresp,Hang Li
关键词-EN: tensor brain, representation layer, tensor brain model, brain, layer
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:The tensor brain has been introduced as a computational model for perception and memory. We provide an overview of the tensor brain model, including recent developments. The tensor brain has two major layers: the representation layer and the index layer. The representation layer is a model for the subsymbolic global workspace from consciousness research. The state of the representation layer is the cognitive brain state. The index layer contains symbols for concepts, time instances, and predicates. In a bottom-up operation, the cognitive brain state is encoded by the index layer as symbolic labels. In a top-down operation, symbols are decoded and written to the representation layer. This feeds to earlier processing layers as embodiment. The top-down operation became the basis for semantic memory. The embedding vector of a concept forms the connection weights between its index and the representation layer. The embedding is the signature or ``DNA’’ of a concept, which is decoded by the brain when its index is activated. It integrates all that is known about a concept from different experiences, modalities, and symbolic decodings. Although being computational, it has been suggested that the tensor brain might be related to the actual operation of the brain. The sequential nature of symbol generation might have been a prerequisite to the generation of natural language. We describe an attention mechanism and discuss multitasking by multiplexing. We emphasize the inherent multimodality of the tensor brain. Finally, we discuss embedded and symbolic reasoning.
[LG-17] Hierarchical Gradient-Based Genetic Sampling for Accurate Prediction of Biological Oscillations
链接: https://arxiv.org/abs/2409.12816
作者: Heng Rao,Yu Gu,Jason Zipeng Zhang,Ge Yu,Yang Cao,Minghan Chen
关键词-EN: signaling processes crucial, living organisms, signaling processes, processes crucial, proper functioning
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Biological oscillations are periodic changes in various signaling processes crucial for the proper functioning of living organisms. These oscillations are modeled by ordinary differential equations, with coefficient variations leading to diverse periodic behaviors, typically measured by oscillatory frequencies. This paper explores sampling techniques for neural networks to model the relationship between system coefficients and oscillatory frequency. However, the scarcity of oscillations in the vast coefficient space results in many samples exhibiting non-periodic behaviors, and small coefficient changes near oscillation boundaries can significantly alter oscillatory properties. This leads to non-oscillatory bias and boundary sensitivity, making accurate predictions difficult. While existing importance and uncertainty sampling approaches partially mitigate these challenges, they either fail to resolve the sensitivity problem or result in redundant sampling. To address these limitations, we propose the Hierarchical Gradient-based Genetic Sampling (HGGS) framework, which improves the accuracy of neural network predictions for biological oscillations. The first layer, Gradient-based Filtering, extracts sensitive oscillation boundaries and removes redundant non-oscillatory samples, creating a balanced coarse dataset. The second layer, Multigrid Genetic Sampling, utilizes residual information to refine these boundaries and explore new high-residual regions, increasing data diversity for model training. Experimental results demonstrate that HGGS outperforms seven comparative sampling methods across four biological systems, highlighting its effectiveness in enhancing sampling and prediction accuracy.
[LG-18] Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL
链接: https://arxiv.org/abs/2409.12798
作者: Eduardo Pignatelli,Johan Ferret,Tim Rockäschel,Edward Grefenstette,Davide Paglieri,Samuel Coward,Laura Toni
关键词-EN: challenge in Reinforcement, Reinforcement Learning, Large Language Models, temporal credit assignment, credit assignment problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages
点击查看摘要
Abstract:The temporal credit assignment problem is a central challenge in Reinforcement Learning (RL), concerned with attributing the appropriate influence to each actions in a trajectory for their ability to achieve a goal. However, when feedback is delayed and sparse, the learning signal is poor, and action evaluation becomes harder. Canonical solutions, such as reward shaping and options, require extensive domain knowledge and manual intervention, limiting their scalability and applicability. In this work, we lay the foundations for Credit Assignment with Language Models (CALM), a novel approach that leverages Large Language Models (LLMs) to automate credit assignment via reward shaping and options discovery. CALM uses LLMs to decompose a task into elementary subgoals and assess the achievement of these subgoals in state-action transitions. Every time an option terminates, a subgoal is achieved, and CALM provides an auxiliary reward. This additional reward signal can enhance the learning process when the task reward is sparse and delayed without the need for human-designed rewards. We provide a preliminary evaluation of CALM using a dataset of human-annotated demonstrations from MiniHack, suggesting that LLMs can be effective in assigning credit in zero-shot settings, without examples or LLM fine-tuning. Our preliminary results indicate that the knowledge of LLMs is a promising prior for credit assignment in RL, facilitating the transfer of human knowledge into value functions.
[LG-19] Efficient Identification of Direct Causal Parents via Invariance and Minimum Error Testing
链接: https://arxiv.org/abs/2409.12797
作者: Minh Nguyen,Mert R. Sabuncu
关键词-EN: Invariant causal prediction, exploiting distribution shifts, Invariant causal, invariance testing, distribution shifts
类目: Machine Learning (cs.LG)
*备注: Accepted at TMLR
点击查看摘要
Abstract:Invariant causal prediction (ICP) is a popular technique for finding causal parents (direct causes) of a target via exploiting distribution shifts and invariance testing (Peters et al., 2016). However, since ICP needs to run an exponential number of tests and fails to identify parents when distribution shifts only affect a few variables, applying ICP to practical large scale problems is challenging. We propose MMSE-ICP and fastICP, two approaches which employ an error inequality to address the identifiability problem of ICP. The inequality states that the minimum prediction error of the predictor using causal parents is the smallest among all predictors which do not use descendants. fastICP is an efficient approximation tailored for large problems as it exploits the inequality and a heuristic to run fewer tests. MMSE-ICP and fastICP not only outperform competitive baselines in many simulations but also achieve state-of-the-art result on a large scale real data benchmark.
[LG-20] Optimal or Greedy Decision Trees? Revisiting their Objectives Tuning and Performance
链接: https://arxiv.org/abs/2409.12788
作者: Jacobus G. M. van der Linden,Daniël Vos,Mathijs M. de Weerdt,Sicco Verwer,Emir Demirović
关键词-EN: optimal decision tree, information metric, traditionally trained, heuristics that locally, impurity or information
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Decision trees are traditionally trained using greedy heuristics that locally optimize an impurity or information metric. Recently there has been a surge of interest in optimal decision tree (ODT) methods that globally optimize accuracy directly. We identify two relatively unexplored aspects of ODTs: the objective function used in training trees and tuning techniques. Additionally, the value of optimal methods is not well understood yet, as the literature provides conflicting results, with some demonstrating superior out-of-sample performance of ODTs over greedy approaches, while others show the exact opposite. In this paper, we address these three questions: what objective to optimize in ODTs; how to tune ODTs; and how do optimal and greedy methods compare? Our experimental evaluation examines 13 objective functions, including four novel objectives resulting from our analysis, seven tuning methods, and six claims from the literature on optimal and greedy methods on 165 real and synthetic data sets. Through our analysis, both conceptually and experimentally, we discover new non-concave objectives, highlight the importance of proper tuning, support and refute several claims from the literature, and provide clear recommendations for researchers and practitioners on the usage of greedy and optimal methods, and code for future comparisons.
[LG-21] Investigation on domain adaptation of additive manufacturing monitoring systems to enhance digital twin reusability
链接: https://arxiv.org/abs/2409.12785
作者: Jiarui Xie,Zhuo Yang,Chun-Chun Hu,Haw-Ching Yang,Yan Lu,Yaoyao Fiona Zhao
关键词-EN: Powder bed fusion, metal additive manufacturing, emerging metal additive, enables rapid fabrication, Powder bed
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, 3 tables. IEEE CASE 2024
点击查看摘要
Abstract:Powder bed fusion (PBF) is an emerging metal additive manufacturing (AM) technology that enables rapid fabrication of complex geometries. However, defects such as pores and balling may occur and lead to structural unconformities, thus compromising the mechanical performance of the part. This has become a critical challenge for quality assurance as the nature of some defects is stochastic during the process and invisible from the exterior. To address this issue, digital twin (DT) using machine learning (ML)-based modeling can be deployed for AM process monitoring and control. Melt pool is one of the most commonly observed physical phenomena for process monitoring, usually by high-speed cameras. Once labeled and preprocessed, the melt pool images are used to train ML-based models for DT applications such as process anomaly detection and print quality evaluation. Nonetheless, the reusability of DTs is restricted due to the wide variability of AM settings, including AM machines and monitoring instruments. The performance of the ML models trained using the dataset collected from one setting is usually compromised when applied to other settings. This paper proposes a knowledge transfer pipeline between different AM settings to enhance the reusability of AM DTs. The source and target datasets are collected from the National Institute of Standards and Technology and National Cheng Kung University with different cameras, materials, AM machines, and process parameters. The proposed pipeline consists of four steps: data preprocessing, data augmentation, domain alignment, and decision alignment. Compared with the model trained only using the source dataset, this pipeline increased the melt pool anomaly detection accuracy by 31% without any labeled training data from the target dataset.
[LG-22] he Robustness of Spiking Neural Networks in Communication and its Application towards Network Efficiency in Federated Learning
链接: https://arxiv.org/abs/2409.12769
作者: Manh V. Nguyen,Liang Zhao,Bobin Deng,William Severa,Honghui Xu,Shaoen Wu
关键词-EN: Spiking Neural Networks, Artificial Neural Networks, conventional Artificial Neural, Neural Networks, Spiking Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: This paper has been accepted for publication at the 43rd IEEE International Performance Computing and Communications Conference (IPCCC 2024)
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) have recently gained significant interest in on-chip learning in embedded devices and emerged as an energy-efficient alternative to conventional Artificial Neural Networks (ANNs). However, to extend SNNs to a Federated Learning (FL) setting involving collaborative model training, the communication between the local devices and the remote server remains the bottleneck, which is often restricted and costly. In this paper, we first explore the inherent robustness of SNNs under noisy communication in FL. Building upon this foundation, we propose a novel Federated Learning with Top-K Sparsification (FLTS) algorithm to reduce the bandwidth usage for FL training. We discover that the proposed scheme with SNNs allows more bandwidth savings compared to ANNs without impacting the model’s accuracy. Additionally, the number of parameters to be communicated can be reduced to as low as 6 percent of the size of the original model. We further improve the communication efficiency by enabling dynamic parameter compression during model training. Extensive experiment results demonstrate that our proposed algorithms significantly outperform the baselines in terms of communication cost and model accuracy and are promising for practical network-efficient FL with SNNs.
[LG-23] Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space
链接: https://arxiv.org/abs/2409.12745
作者: Sebastião Quintas,Isabelle Ferrané,Thomas Pellegrini
关键词-EN: gaining increasing popularity, automatic speech recognition, augmentation is gaining, gaining increasing, increasing popularity
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:The use of synthetic speech as data augmentation is gaining increasing popularity in fields such as automatic speech recognition and speech classification tasks. Despite novel text-to-speech systems with voice cloning capabilities, that allow the usage of a larger amount of voices based on short audio segments, it is known that these systems tend to hallucinate and oftentimes produce bad data that will most likely have a negative impact on the downstream task. In the present work, we conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification. Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data, translating to a better performance. Furthermore, despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features, an aspect further explored with a CycleGAN to bridge the gap between the two types of speech material.
[LG-24] SeqRisk: Transformer-augmented latent variable model for improved survival prediction with longitudinal data
链接: https://arxiv.org/abs/2409.12709
作者: Mine Öğretir,Miika Koskinen,Juha Sinisalo,Risto Renkonen,Harri Lähdesmäki
关键词-EN: survival analysis, long time, time been based, based on survival, patient outcomes
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In healthcare, risk assessment of different patient outcomes has for long time been based on survival analysis, i.e.\ modeling time-to-event associations. However, conventional approaches rely on data from a single time-point, making them suboptimal for fully leveraging longitudinal patient history and capturing temporal regularities. Focusing on clinical real-world data and acknowledging its challenges, we utilize latent variable models to effectively handle irregular, noisy, and sparsely observed longitudinal data. We propose SeqRisk, a method that combines variational autoencoder (VAE) or longitudinal VAE (LVAE) with a transformer encoder and Cox proportional hazards module for risk prediction. SeqRisk captures long-range interactions, improves patient trajectory representations, enhances predictive accuracy and generalizability, as well as provides partial explainability for sample population characteristics in attempts to identify high-risk patients. We demonstrate that SeqRisk performs competitively compared to existing approaches on both simulated and real-world datasets.
[LG-25] Generation and Editing of Mandrill Faces: Application to Sex Editing and Assessment
链接: https://arxiv.org/abs/2409.12705
作者: Nicolas M. Dibot,Julien P. Renoult,William Puech
关键词-EN: recent years, enhancing the realism, major developments, developments in recent, realism of synthetic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generative AI has seen major developments in recent years, enhancing the realism of synthetic images, also known as computer-generated images. In addition, generative AI has also made it possible to modify specific image characteristics through image editing. Previous work has developed methods based on generative adversarial networks (GAN) for generating realistic images, in particular faces, but also to modify specific features. However, this work has never been applied to specific animal species. Moreover, the assessment of the results has been generally done subjectively, rather than quantitatively. In this paper, we propose an approach based on methods for generating images of faces of male or female mandrills, a non-human primate. The main novelty of proposed method is the ability to edit their sex by identifying a sex axis in the latent space of a specific GAN. In addition, we have developed an assessment of the sex levels based on statistical features extracted from real image distributions. The experimental results we obtained from a specific database are not only realistic, but also accurate, meeting a need for future work in behavioral experiments with wild mandrills.
[LG-26] PromSec: Prompt Optimization for Secure Generation of Functional Source Code with Large Language Models (LLMs) CCS2024
链接: https://arxiv.org/abs/2409.12699
作者: Mahmoud Nazzal,Issa Khalil,Abdallah Khreishah,NhatHai Phan
关键词-EN: high-quality source code, large language models, generating high-quality source, code, high-quality source
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 15 pages, 19 figures, CCS 2024
点击查看摘要
Abstract:The capability of generating high-quality source code using large language models (LLMs) reduces software development time and costs. However, they often introduce security vulnerabilities due to training on insecure open-source data. This highlights the need for ensuring secure and functional code generation. This paper introduces PromSec, an algorithm for prom optimization for secure and functioning code generation using LLMs. In PromSec, we combine 1) code vulnerability clearing using a generative adversarial graph neural network, dubbed as gGAN, to fix and reduce security vulnerabilities in generated codes and 2) code generation using an LLM into an interactive loop, such that the outcome of the gGAN drives the LLM with enhanced prompts to generate secure codes while preserving their functionality. Introducing a new contrastive learning approach in gGAN, we formulate code-clearing and generation as a dual-objective optimization problem, enabling PromSec to notably reduce the number of LLM inferences. PromSec offers a cost-effective and practical solution for generating secure, functional code. Extensive experiments conducted on Python and Java code datasets confirm that PromSec effectively enhances code security while upholding its intended functionality. Our experiments show that while a state-of-the-art approach fails to address all code vulnerabilities, PromSec effectively resolves them. Moreover, PromSec achieves more than an order-of-magnitude reduction in operation time, number of LLM queries, and security analysis costs. Furthermore, prompts optimized with PromSec for a certain LLM are transferable to other LLMs across programming languages and generalizable to unseen vulnerabilities in training. This study is a step in enhancing the trustworthiness of LLMs for secure and functional code generation, supporting their integration into real-world software development.
[LG-27] (Un)certainty of (Un)fairness: Preference-Based Selection of Certainly Fair Decision-Makers ECAI2024
链接: https://arxiv.org/abs/2409.12677
作者: Manh Khoi Duong,Stefan Conrad
关键词-EN: real-world applications, including machine learning, bias in decision-making, Fairness metrics, traditional fairness metrics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in 27TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2024)
点击查看摘要
Abstract:Fairness metrics are used to assess discrimination and bias in decision-making processes across various domains, including machine learning models and human decision-makers in real-world applications. This involves calculating the disparities between probabilistic outcomes among social groups, such as acceptance rates between male and female applicants. However, traditional fairness metrics do not account for the uncertainty in these processes and lack of comparability when two decision-makers exhibit the same disparity. Using Bayesian statistics, we quantify the uncertainty of the disparity to enhance discrimination assessments. We represent each decision-maker, whether a machine learning model or a human, by its disparity and the corresponding uncertainty in that disparity. We define preferences over decision-makers and utilize brute-force to choose the optimal decision-maker according to a utility function that ranks decision-makers based on these preferences. The decision-maker with the highest utility score can be interpreted as the one for whom we are most certain that it is fair.
[LG-28] Deep generative models as an adversarial attack strategy for tabular machine learning ICML
链接: https://arxiv.org/abs/2409.12642
作者: Salijona Dyrmishi,Mihaela Cătălina Stoian,Eleonora Giunchiglia,Maxime Cordy
关键词-EN: Deep Generative Models, Deep Generative, Generative Models, machine learning, found application
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ICMLC 2024 (International Conference on Machine Learning and Cybernetics)
点击查看摘要
Abstract:Deep Generative Models (DGMs) have found application in computer vision for generating adversarial examples to test the robustness of machine learning (ML) systems. Extending these adversarial techniques to tabular ML presents unique challenges due to the distinct nature of tabular data and the necessity to preserve domain constraints in adversarial examples. In this paper, we adapt four popular tabular DGMs into adversarial DGMs (AdvDGMs) and evaluate their effectiveness in generating realistic adversarial examples that conform to domain constraints.
[LG-29] Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries
链接: https://arxiv.org/abs/2409.12640
作者: Kiran Vodrahalli,Santiago Ontanon,Nilesh Tripuraneni,Kelvin Xu,Sanil Jain,Rakesh Shivanna,Jeffrey Hui,Nishanth Dikkala,Mehran Kazemi,Bahare Fatemi,Rohan Anil,Ethan Dyer,Siamak Shakeri,Roopali Vij,Harsh Mehta,Vinay Ramasesh,Quoc Le,Ed Chi,Yifeng Lu,Orhan Firat,Angeliki Lazaridou,Jean-Baptiste Lespiau,Nithya Attaluri,Kate Olszewska
关键词-EN: introduce Michelangelo, unleaked long-context reasoning, automatically score, easy to automatically, large language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model’s ability to do more than retrieve a single piece of information from its context. The central idea of the \frameworkname framework (\frameworkshort) is to construct tasks which require a model to ``chisel away’’ the irrelevant information in the context, revealing a latent structure in the context. To verify a model’s understanding of this latent structure, we query the model for details of the structure. Using \frameworkshort, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.
[LG-30] Image inpainting for corrupted images by using the semi-super resolution GAN
链接: https://arxiv.org/abs/2409.12636
作者: Mehrshad Momen-Tayefeh,Mehrdad Momen-Tayefeh,Amir Ali Ghafourian Ghahramani
关键词-EN: Generative Adversarial Network, valuable technique, technique for enhancing, Image inpainting, enhancing images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:Image inpainting is a valuable technique for enhancing images that have been corrupted. The primary challenge in this research revolves around the extent of corruption in the input image that the deep learning model must restore. To address this challenge, we introduce a Generative Adversarial Network (GAN) for learning and replicating the missing pixels. Additionally, we have developed a distinct variant of the Super-Resolution GAN (SRGAN), which we refer to as the Semi-SRGAN (SSRGAN). Furthermore, we leveraged three diverse datasets to assess the robustness and accuracy of our proposed model. Our training process involves varying levels of pixel corruption to attain optimal accuracy and generate high-quality images.
[LG-31] Exploring bat song syllable representations in self-supervised audio encoders
链接: https://arxiv.org/abs/2409.12634
作者: Marianne de Heer Kloots,Mirjam Knörnschild
关键词-EN: human-generated sounds distinguish, species’ vocalization types, trained on human-generated, human-generated sounds, sounds distinguish
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Presented at VIHAR-2024; see this https URL
点击查看摘要
Abstract:How well can deep learning models trained on human-generated sounds distinguish between another species’ vocalization types? We analyze the encoding of bat song syllables in several self-supervised audio encoders, and find that models pre-trained on human speech generate the most distinctive representations of different syllable types. These findings form first steps towards the application of cross-species transfer learning in bat bioacoustics, as well as an improved understanding of out-of-distribution signal processing in audio encoder models.
[LG-32] Counterfactual Explanations for Clustering Models
链接: https://arxiv.org/abs/2409.12632
作者: Aurora Spagnol,Kacper Sokol,Pietro Barbiero,Marc Langheinrich,Martin Gjoreski
关键词-EN: lack technical expertise, complex optimisation processes, technical expertise, rely on complex, complex optimisation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:Clustering algorithms rely on complex optimisation processes that may be difficult to comprehend, especially for individuals who lack technical expertise. While many explainable artificial intelligence techniques exist for supervised machine learning, unsupervised learning – and clustering in particular – has been largely neglected. To complicate matters further, the notion of a ``true’’ cluster is inherently challenging to define. These facets of unsupervised learning and its explainability make it difficult to foster trust in such methods and curtail their adoption. To address these challenges, we propose a new, model-agnostic technique for explaining clustering algorithms with counterfactual statements. Our approach relies on a novel soft-scoring method that captures the spatial information utilised by clustering models. It builds upon a state-of-the-art Bayesian counterfactual generator for supervised learning to deliver high-quality explanations. We evaluate its performance on five datasets and two clustering algorithms, and demonstrate that introducing soft scores to guide counterfactual search significantly improves the results.
[LG-33] Green Federated Learning: A new era of Green Aware AI
链接: https://arxiv.org/abs/2409.12626
作者: Dipanwita Thakur,Antonella Guzzo,Giancarlo Fortino
关键词-EN: large-scale wireless networks, wireless networks, growing exponentially, alongside the size, large-scale wireless
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The development of AI applications, especially in large-scale wireless networks, is growing exponentially, alongside the size and complexity of the architectures used. Particularly, machine learning is acknowledged as one of today’s most energy-intensive computational applications, posing a significant challenge to the environmental sustainability of next-generation intelligent systems. Achieving environmental sustainability entails ensuring that every AI algorithm is designed with sustainability in mind, integrating green considerations from the architectural phase onwards. Recently, Federated Learning (FL), with its distributed nature, presents new opportunities to address this need. Hence, it’s imperative to elucidate the potential and challenges stemming from recent FL advancements and their implications for sustainability. Moreover, it’s crucial to furnish researchers, stakeholders, and interested parties with a roadmap to navigate and understand existing efforts and gaps in green-aware AI algorithms. This survey primarily aims to achieve this objective by identifying and analyzing over a hundred FL works, assessing their contributions to green-aware artificial intelligence for sustainable environments, with a specific focus on IoT research. It delves into current issues in green federated learning from an energy-efficient standpoint, discussing potential challenges and future prospects for green IoT application research.
[LG-34] Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning
链接: https://arxiv.org/abs/2409.12618
作者: Santosh Kumar Radha,Yasamin Nouri Jelyani,Ara Ghukasyan,Oktay Goktas
关键词-EN: large language models, advanced language processing, language processing power, Iterative human engagement, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:Iterative human engagement is a common and effective means of leveraging the advanced language processing power of large language models (LLMs). Using well-structured prompts in a conversational manner, human users can effectively influence an LLM to develop more thoughtful and accurate responses. Motivated by this insight, we propose the Iteration of Thought (IoT) framework for enhancing LLM responses by generating “thought”-provoking prompts vis a vis an input query and the current iteration of an LLM’s response. Unlike static or semi-static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path dynamically, based on evolving context, and without generating alternate explorative thoughts which are ultimately discarded. The three components of the IoT framework are (1) an Inner Dialogue Agent (IDA) responsible for generating instructive, context-specific prompts; (2) an LLM Agent (LLMA) that processes these prompts to refine its responses; and (3) an iterative prompting loop that implements a conversation between the former two components. We introduce two variants of our framework: Autonomous Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and Guided Iteration of Thought (GIoT), which always forces a fixed number iterations. We investigate the performance of IoT across various datasets, spanning complex reasoning tasks from the GPQA dataset, explorative problem-solving in Game of 24, puzzle solving in Mini Crosswords, and multi-hop question answering from the HotpotQA dataset. Our results show that IoT represents a viable paradigm for autonomous response refinement in LLMs, showcasing significant improvements over CoT and thereby enabling more adaptive and efficient reasoning systems that minimize human intervention.
[LG-35] CF-GO-Net: A Universal Distribution Learner via Characteristic Function Networks with Graph Optimizers
链接: https://arxiv.org/abs/2409.12610
作者: Zeyang Yu,Shengxi Li,Danilo Mandic
关键词-EN: resemble real data, statistically resemble real, real data, aim to learn, generate samples
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Generative models aim to learn the distribution of datasets, such as images, so as to be able to generate samples that statistically resemble real data. However, learning the underlying probability distribution can be very challenging and intractable. To this end, we introduce an approach which employs the characteristic function (CF), a probabilistic descriptor that directly corresponds to the distribution. However, unlike the probability density function (pdf), the characteristic function not only always exists, but also provides an additional degree of freedom, hence enhances flexibility in learning distributions. This removes the critical dependence on pdf-based assumptions, which limit the applicability of traditional methods. While several works have attempted to use CF in generative modeling, they often impose strong constraints on the training process. In contrast, our approach calculates the distance between query points in the CF domain, which is an unconstrained and well defined problem. Next, to deal with the sampling strategy, which is crucial to model performance, we propose a graph neural network (GNN)-based optimizer for the sampling process, which identifies regions where the difference between CFs is most significant. In addition, our method allows the use of a pre-trained model, such as a well-trained autoencoder, and is capable of learning directly in its feature space, without modifying its parameters. This offers a flexible and robust approach to generative modeling, not only provides broader applicability and improved performance, but also equips any latent space world with the ability to become a generative model.
[LG-36] Hybrid Ensemble Deep Graph Temporal Clustering for Spatiotemporal Data
链接: https://arxiv.org/abs/2409.12590
作者: Francis Ndikum Nji,Omar Faruque,Mostafa Cham,Janeja Vandana,Jianwu Wang
关键词-EN: Classifying subsets based, Classifying subsets, inherent spatial, spatiotemporal data, multivariate spatiotemporal data
类目: Machine Learning (cs.LG)
*备注: 10 pages
点击查看摘要
Abstract:Classifying subsets based on spatial and temporal features is crucial to the analysis of spatiotemporal data given the inherent spatial and temporal variability. Since no single clustering algorithm ensures optimal results, researchers have increasingly explored the effectiveness of ensemble approaches. Ensemble clustering has attracted much attention due to increased diversity, better generalization, and overall improved clustering performance. While ensemble clustering may yield promising results on simple datasets, it has not been fully explored on complex multivariate spatiotemporal data. For our contribution to this field, we propose a novel hybrid ensemble deep graph temporal clustering (HEDGTC) method for multivariate spatiotemporal data. HEDGTC integrates homogeneous and heterogeneous ensemble methods and adopts a dual consensus approach to address noise and misclassification from traditional clustering. It further applies a graph attention autoencoder network to improve clustering performance and stability. When evaluated on three real-world multivariate spatiotemporal data, HEDGTC outperforms state-of-the-art ensemble clustering models by showing improved performance and stability with consistent results. This indicates that HEDGTC can effectively capture implicit temporal patterns in complex spatiotemporal data.
[LG-37] Deep Transfer Hashing for Adaptive Learning on Federated Streaming Data ECML2024
链接: https://arxiv.org/abs/2409.12575
作者: Manuel Röder,Frank-Michael Schleif
关键词-EN: extended abstract explores, evolving data streams, deep transfer hashing, emphasizing resource-efficient client, resource-efficient client training
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Presented at ECML2024: 8th Intl. Worksh. and Tutorial on Interactive Adaptive Learning, Sep. 9th, 2024, Vilnius, Lithuania
点击查看摘要
Abstract:This extended abstract explores the integration of federated learning with deep transfer hashing for distributed prediction tasks, emphasizing resource-efficient client training from evolving data streams. Federated learning allows multiple clients to collaboratively train a shared model while maintaining data privacy - by incorporating deep transfer hashing, high-dimensional data can be converted into compact hash codes, reducing data transmission size and network loads. The proposed framework utilizes transfer learning, pre-training deep neural networks on a central server, and fine-tuning on clients to enhance model accuracy and adaptability. A selective hash code sharing mechanism using a privacy-preserving global memory bank further supports client fine-tuning. This approach addresses challenges in previous research by improving computational efficiency and scalability. Practical applications include Car2X event predictions, where a shared model is collectively trained to recognize traffic patterns, aiding in tasks such as traffic density assessment and accident detection. The research aims to develop a robust framework that combines federated learning, deep transfer hashing and transfer learning for efficient and secure downstream task execution.
[LG-38] Scaling FP8 training to trillion-token LLMs
链接: https://arxiv.org/abs/2409.12517
作者: Maxim Fishman,Brian Chmiel,Ron Banner,Daniel Soudry
关键词-EN: large language models, trillion tokens, large language, increase over previous, previous limits
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens – a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a \sim 34 % throughput improvement.
[LG-39] ConvexECG: Lightweight and Explainable Neural Networks for Personalized Continuous Cardiac Monitoring
链接: https://arxiv.org/abs/2409.12493
作者: Rayan Ansari,John Cao,Sabyasachi Bandyopadhyay,Sanjiv M. Narayan,Albert J. Rogers,Mert Pilanci
关键词-EN: reconstructing six-lead electrocardiograms, continuous cardiac monitoring, six-lead electrocardiograms, aimed at advancing, resource-efficient method
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:We present ConvexECG, an explainable and resource-efficient method for reconstructing six-lead electrocardiograms (ECG) from single-lead data, aimed at advancing personalized and continuous cardiac monitoring. ConvexECG leverages a convex reformulation of a two-layer ReLU neural network, enabling the potential for efficient training and deployment in resource constrained environments, while also having deterministic and explainable behavior. Using data from 25 patients, we demonstrate that ConvexECG achieves accuracy comparable to larger neural networks while significantly reducing computational overhead, highlighting its potential for real-time, low-resource monitoring applications.
[LG-40] CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
链接: https://arxiv.org/abs/2409.12490
作者: Junlin Lv,Yuan Feng,Xike Xie,Xin Jia,Qirong Peng,Guiming Xie
关键词-EN: Large language models, achieved notable success, Large language, quadratic computation complexity, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large language models have achieved notable success across various domains, yet efficient inference is still limited by the quadratic computation complexity of the attention mechanism. The inference consists of prefilling and decoding phases. Although several attempts have been made to accelerate decoding, the inefficiency of the prefilling phase, especially for long-context tasks, remains a challenge. In this paper, we observe a locality in query criticality during the prefilling phase of long-context processing: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache. Based on this observation, we propose CritiPrefill, a criticality-based segment-wise prefilling method. This method partitions the input sequence’s queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. By pruning non-critical computations between query segments and cache blocks in the self-attention mechanism, the prefilling process can be significantly accelerated. Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU, with minimal quality degradation.
[LG-41] Learning Multi-Manifold Embedding for Out-Of-Distribution Detection ECCV2024
链接: https://arxiv.org/abs/2409.12479
作者: Jeng-Lin Li,Ming-Ching Chang,Wei-Chao Chen
关键词-EN: OOD, real-world applications, crucial for trustworthy, Detecting, OOD samples
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: European Conference on Computer Vision ECCV 2024 BEW Workshop Best Paper
点击查看摘要
Abstract:Detecting out-of-distribution (OOD) samples is crucial for trustworthy AI in real-world applications. Leveraging recent advances in representation learning and latent embeddings, Various scoring algorithms estimate distributions beyond the training data. However, a single embedding space falls short in characterizing in-distribution data and defending against diverse OOD conditions. This paper introduces a novel Multi-Manifold Embedding Learning (MMEL) framework, optimizing hypersphere and hyperbolic spaces jointly for enhanced OOD detection. MMEL generates representative embeddings and employs a prototype-aware scoring function to differentiate OOD samples. It operates with very few OOD samples and requires no model retraining. Experiments on six open datasets demonstrate MMEL’s significant reduction in FPR while maintaining a high AUC compared to state-of-the-art distance-based OOD detection methods. We analyze the effects of learning multiple manifolds and visualize OOD score distributions across datasets. Notably, enrolling ten OOD samples without retraining achieves comparable FPR and AUC to modern outlier exposure methods using 80 million outlier samples for model training.
[LG-42] ViolinDiff: Enhancing Expressive Violin Synthesis with Pitch Bend Conditioning
链接: https://arxiv.org/abs/2409.12477
作者: Daewoong Kim,Hao-Wen Dong,Dasaem Jeong
关键词-EN: fundamental frequency, plays a critical, critical role, natural contour, music audio synthesis
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, and explicit F0 contour modeling has not yet been explored for polyphonic instrumental synthesis. In this paper, we present ViolinDiff, a two-stage diffusion-based synthesis framework. For a given violin MIDI file, the first stage estimates the F0 contour as pitch bend information, and the second stage generates mel spectrogram incorporating these expressive details. The quantitative metrics and listening test results show that the proposed model generates more realistic violin sounds than the model without explicit pitch bend modeling. Audio samples are available online: this http URL.
[LG-43] Familiarity-aware Evidence Compression for Retrieval Augmented Generation
链接: https://arxiv.org/abs/2409.12468
作者: Dongwon Jung,Qin Liu,Tenghao Huang,Ben Zhou,Muhao Chen
关键词-EN: Retrieval Augmented Generation, Augmented Generation, Retrieval Augmented, improves large language, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) improves large language models (LMs) by incorporating non-parametric knowledge through evidence retrieval from external sources. However, it often struggles to filter out inconsistent and irrelevant information that can distract the LM from its tasks. While compressing the retrieved evidence with a compression model aims to address this issue, the compressed evidence may still be unfamiliar to the target model used for downstream task, potentially failing to utilize the evidence effectively. We propose FaviComp (Familiarity-aware Evidence Compression), a novel training-free evidence compression technique that makes retrieved evidence more familiar to the target model, while seamlessly integrating parametric knowledge from the model. Specifically, FaviComp proactively lowers the perplexity of the compressed evidence with regard to the target model by combining token probabilities from both the compression model and the target model to generate context that is more familiar to the target model. This approach balances the integration of parametric and non-parametric knowledge, which is especially helpful in complex tasks where the retrieved evidence set may not contain all the necessary information. Experimental results demonstrate that FaviComp consistently outperforms existing baselines in multiple open-domain QA datasets, achieving high compression rates and showcasing the effective integration of both parametric and non-parametric knowledge.
[LG-44] SurgPLAN: Universal Surgical Phase Localization Network for Online and Offline Inference
链接: https://arxiv.org/abs/2409.12467
作者: Zhen Chen,Xingjian Luo,Jinlin Wu,Long Bai,Zhen Lei,Hongliang Ren,Sebastien Ourselin,Hongbin Liu
关键词-EN: Surgical phase recognition, phase recognition, Surgical phase, phase, Surgical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Surgical phase recognition is critical for assisting surgeons in understanding surgical videos. Existing studies focused more on online surgical phase recognition, by leveraging preceding frames to predict the current frame. Despite great progress, they formulated the task as a series of frame-wise classification, which resulted in a lack of global context of the entire procedure and incoherent predictions. Moreover, besides online analysis, accurate offline surgical phase recognition is also in significant clinical need for retrospective analysis, and existing online algorithms do not fully analyze the entire video, thereby limiting accuracy in offline analysis. To overcome these challenges and enhance both online and offline inference capabilities, we propose a universal Surgical Phase Localization Network, named SurgPLAN++, with the principle of temporal detection. To ensure a global understanding of the surgical procedure, we devise a phase localization strategy for SurgPLAN++ to predict phase segments across the entire video through phase proposals. For online analysis, to generate high-quality phase proposals, SurgPLAN++ incorporates a data augmentation strategy to extend the streaming video into a pseudo-complete video through mirroring, center-duplication, and down-sampling. For offline analysis, SurgPLAN++ capitalizes on its global phase prediction framework to continuously refine preceding predictions during each online inference step, thereby significantly improving the accuracy of phase recognition. We perform extensive experiments to validate the effectiveness, and our SurgPLAN++ achieves remarkable performance in both online and offline modes, which outperforms state-of-the-art methods. The source code is available at this https URL.
[LG-45] FoME: A Foundation Model for EEG using Adaptive Temporal-Lateral Attention Scaling
链接: https://arxiv.org/abs/2409.12454
作者: Enze Shi,Kui Zhao,Qilong Yuan,Jiaqi Wang,Huawen Hu,Sigang Yu,Shu Zhang
关键词-EN: record brain activity, limited labeled datasets, signal heterogeneity, vital tool, tool to measure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Electroencephalography (EEG) is a vital tool to measure and record brain activity in neuroscience and clinical applications, yet its potential is constrained by signal heterogeneity, low signal-to-noise ratios, and limited labeled datasets. In this paper, we propose FoME (Foundation Model for EEG), a novel approach using adaptive temporal-lateral attention scaling to address above-mentioned challenges. FoME is pre-trained on a diverse 1.7TB dataset of scalp and intracranial EEG recordings, comprising 745M parameters trained for 1,096k steps. Our model introduces two key innovations: a time-frequency fusion embedding technique and an adaptive time-lateral attention scaling (ATLAS) mechanism. These components synergistically capture complex temporal and spectral EEG dynamics, enabling FoME to adapt to varying patterns across diverse data streams and facilitate robust multi-channel modeling. Evaluations across four downstream tasks demonstrate FoME’s superior performance in classification and forecasting applications, consistently achieving state-of-the-art results. To conclude, FoME establishes a new paradigm for EEG analysis, offering a versatile foundation that advances brain-computer interfaces, clinical diagnostics, and cognitive research across neuroscience and related fields. Our code will be available at this https URL.
[LG-46] Neural Networks Generalize on Low Complexity Data
链接: https://arxiv.org/abs/2409.12446
作者: Sourav Chatterjee,Timothy Sudijono
关键词-EN: ReLU activation generalize, low complexity data, suitably defined, feedforward neural, simple programming language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Comments welcome. 27 pages
点击查看摘要
Abstract:We show that feedforward neural networks with ReLU activation generalize on low complexity data, suitably defined. Given i.i.d. data generated from a simple programming language, the minimum description length (MDL) feedforward neural network which interpolates the data generalizes with high probability. We define this simple programming language, along with a notion of description length of such networks. We provide several examples on basic computational tasks, such as checking primality of a natural number, and more. For primality testing, our theorem shows the following. Suppose that we draw an i.i.d. sample of \Theta(N^\delta\ln N) numbers uniformly at random from 1 to N , where \delta\in (0,1) . For each number x_i , let y_i = 1 if x_i is a prime and 0 if it is not. Then with high probability, the MDL network fitted to this data accurately answers whether a newly drawn number between 1 and N is a prime or not, with test error \leq O(N^-\delta) . Note that the network is not designed to detect primes; minimum description learning discovers a network which does so.
[LG-47] Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data
链接: https://arxiv.org/abs/2409.12437
作者: Jiaming Zhou,Abbas Ghaddar,Ge Zhang,Liheng Ma,Yaochen Hu,Soumyasundar Pal,Mark Coates,Bin Wang,Yingxue Zhang,Jianye Hao
关键词-EN: Large Language Models, long reasoning chains, complex logical reasoning, involve long reasoning, strategies for Large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Despite recent advances in training and prompting strategies for Large Language Models (LLMs), these models continue to face challenges with complex logical reasoning tasks that involve long reasoning chains. In this work, we explore the potential and limitations of using graph-based synthetic reasoning data as training signals to enhance LLMs’ reasoning capabilities. Our extensive experiments, conducted on two established natural language reasoning tasks – inductive reasoning and spatial reasoning – demonstrate that supervised fine-tuning (SFT) with synthetic graph-based reasoning data effectively enhances LLMs’ reasoning performance without compromising their effectiveness on other standard evaluation benchmarks.
[LG-48] Is it Still Fair? A Comparative Evaluation of Fairness Algorithms through the Lens of Covariate Drift
链接: https://arxiv.org/abs/2409.12428
作者: Oscar Blessed Deho,Michael Bewong,Selasi Kwashie,Jiuyong Li,Jixue Liu,Lin Liu,Srecko Joksimovic
关键词-EN: data distributional drift, data distributional, distributional drift, applications have grown, grown exponentially
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Over the last few decades, machine learning (ML) applications have grown exponentially, yielding several benefits to society. However, these benefits are tempered with concerns of discriminatory behaviours exhibited by ML models. In this regard, fairness in machine learning has emerged as a priority research area. Consequently, several fairness metrics and algorithms have been developed to mitigate against discriminatory behaviours that ML models may possess. Yet still, very little attention has been paid to the problem of naturally occurring changes in data patterns (\textitaka data distributional drift), and its impact on fairness algorithms and metrics. In this work, we study this problem comprehensively by analyzing 4 fairness-unaware baseline algorithms and 7 fairness-aware algorithms, carefully curated to cover the breadth of its typology, across 5 datasets including public and proprietary data, and evaluated them using 3 predictive performance and 10 fairness metrics. In doing so, we show that (1) data distributional drift is not a trivial occurrence, and in several cases can lead to serious deterioration of fairness in so-called fair models; (2) contrary to some existing literature, the size and direction of data distributional drift is not correlated to the resulting size and direction of unfairness; and (3) choice of, and training of fairness algorithms is impacted by the effect of data distributional drift which is largely ignored in the literature. Emanating from our findings, we synthesize several policy implications of data distributional drift on fairness algorithms that can be very relevant to stakeholders and practitioners.
[LG-49] Sustainable Visions: Unsupervised Machine Learning Insights on Global Development Goals
链接: https://arxiv.org/abs/2409.12427
作者: Alberto García-Rodríguez,Matias Núñez,Miguel Robles Pérez,Tzipe Govezensky,Rafael A. Barrio,Carlos Gershenson,Kimmo K. Kaski,Julia Tagüeña
关键词-EN: address global challenges, Sustainable Development outlines, United Nations, global challenges, address global
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
点击查看摘要
Abstract:The United Nations 2030 Agenda for Sustainable Development outlines 17 goals to address global challenges. However, progress has been slower than expected and, consequently, there is a need to investigate the reasons behind this fact. In this study, we used a novel data-driven methodology to analyze data from 107 countries (2000 - 2022) using unsupervised machine learning techniques. Our analysis reveals strong positive and negative correlations between certain SDGs. The findings show that progress toward the SDGs is heavily influenced by geographical, cultural and socioeconomic factors, with no country on track to achieve all goals by 2030. This highlights the need for a region specific, systemic approach to sustainable development that acknowledges the complex interdependencies of the goals and the diverse capacities of nations. Our approach provides a robust framework for developing efficient and data-informed strategies, to promote cooperative and targeted initiatives for sustainable progress.
[LG-50] Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels
链接: https://arxiv.org/abs/2409.12425
作者: Chaoqun Liu,Qin Chao,Wenxuan Zhang,Xiaobao Wu,Boyang Li,Anh Tuan Luu,Lidong Bing
关键词-EN: Large Language Models, Large Language, demonstrated remarkable performance, Language Models, demonstrated remarkable
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 15 pages
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. However, this paradigm is limited by the availability of gold labels, while in certain scenarios, LLMs may need to perform tasks that are too complex for humans to provide such labels. To tackle this challenge, this study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization. We iteratively prompt LLMs to annotate unlabeled data and retain high-quality labels by filtering. Surprisingly, we obverse that this iterative process gradually unlocks LLMs’ potential on downstream tasks. Our experiments on extensive classification and reasoning tasks confirm the effectiveness of our proposed framework. Our analysis indicates that this paradigm is effective for both in-context learning and fine-tuning, and for various model sizes.
[LG-51] How to predict on-road air pollution based on street view images and machine learning: a quantitative analysis of the optimal strategy
链接: https://arxiv.org/abs/2409.12412
作者: Hui Zhong,Di Chen,Pengqin Wang,Wenrui Wang,Shaojie Shen,Yonghong Liu,Meixin Zhu
关键词-EN: On-road air pollution, exhibits substantial variability, pollution exhibits substantial, air pollution exhibits, short distances due
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:On-road air pollution exhibits substantial variability over short distances due to emission sources, dilution, and physicochemical processes. Integrating mobile monitoring data with street view images (SVIs) holds promise for predicting local air pollution. However, algorithms, sampling strategies, and image quality introduce extra errors due to a lack of reliable references that quantify their effects. To bridge this gap, we employed 314 taxis to monitor NO, NO2, PM2.5 and PM10 dynamically and sampled corresponding SVIs, aiming to develop a reliable strategy. We extracted SVI features from ~ 382,000 streetscape images, which were collected at various angles (0°, 90°, 180°, 270°) and ranges (buffers with radii of 100m, 200m, 300m, 400m, 500m). Also, three machine learning algorithms alongside the linear land-used regression (LUR) model were experimented with to explore the influences of different algorithms. Four typical image quality issues were identified and discussed. Generally, machine learning methods outperform linear LUR for estimating the four pollutants, with the ranking: random forest XGBoost neural network LUR. Compared to single-angle sampling, the averaging strategy is an effective method to avoid bias of insufficient feature capture. Therefore, the optimal sampling strategy is to obtain SVIs at a 100m radius buffer and extract features using the averaging strategy. This approach achieved estimation results for each aggregation location with absolute errors almost less than 2.5 \mug/m^2 or ppb. Overexposure, blur, and underexposure led to image misjudgments and incorrect identifications, causing an overestimation of road features and underestimation of human-activity features, contributing to inaccurate NO, NO2, PM2.5 and PM10 estimation.
[LG-52] LMT-Net: Lane Model Transformer Network for Automated HD Mapping from Sparse Vehicle Observations ITSC2024
链接: https://arxiv.org/abs/2409.12409
作者: Michael Mink,Thomas Monninger,Steffen Staab
关键词-EN: High Definition, complete lane model, autonomous driving, range and occlusions, lane model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted for 2024 IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)
点击查看摘要
Abstract:In autonomous driving, High Definition (HD) maps provide a complete lane model that is not limited by sensor range and occlusions. However, the generation and upkeep of HD maps involves periodic data collection and human annotations, limiting scalability. To address this, we investigate automating the lane model generation and the use of sparse vehicle observations instead of dense sensor measurements. For our approach, a pre-processing step generates polylines by aligning and aggregating observed lane boundaries. Aligned driven traces are used as starting points for predicting lane pairs defined by the left and right boundary points. We propose Lane Model Transformer Network (LMT-Net), an encoder-decoder neural network architecture that performs polyline encoding and predicts lane pairs and their connectivity. A lane graph is formed by using predicted lane pairs as nodes and predicted lane connectivity as edges. We evaluate the performance of LMT-Net on an internal dataset that consists of multiple vehicle observations as well as human annotations as Ground Truth (GT). The evaluation shows promising results and demonstrates superior performance compared to the implemented baseline on both highway and non-highway Operational Design Domain (ODD).
[LG-53] Shape-informed surrogate models based on signed distance function domain encoding
链接: https://arxiv.org/abs/2409.12400
作者: Linying Zhang,Stefano Pagani,Jun Zhang,Francesco Regazzoni
关键词-EN: partial differential equations, build surrogate models, parameterized partial differential, differential equations, capable of taking
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose a non-intrusive method to build surrogate models that approximate the solution of parameterized partial differential equations (PDEs), capable of taking into account the dependence of the solution on the shape of the computational domain. Our approach is based on the combination of two neural networks (NNs). The first NN, conditioned on a latent code, provides an implicit representation of geometry variability through signed distance functions. This automated shape encoding technique generates compact, low-dimensional representations of geometries within a latent space, without requiring the explicit construction of an encoder. The second NN reconstructs the output physical fields independently for each spatial point, thus avoiding the computational burden typically associated with high-dimensional discretizations like computational meshes. Furthermore, we show that accuracy in geometrical characterization can be further enhanced by employing Fourier feature mapping as input feature of the NN. The meshless nature of the proposed method, combined with the dimensionality reduction achieved through automatic feature extraction in latent space, makes it highly flexible and computationally efficient. This strategy eliminates the need for manual intervention in extracting geometric parameters, and can even be applied in cases where geometries undergo changes in their topology. Numerical tests in the field of fluid dynamics and solid mechanics demonstrate the effectiveness of the proposed method in accurately predict the solution of PDEs in domains of arbitrary shape. Remarkably, the results show that it achieves accuracy comparable to the best-case scenarios where an explicit parametrization of the computational domain is available.
[LG-54] Selecting a classification performance measure: matching the measure to the problem
链接: https://arxiv.org/abs/2409.12391
作者: David J. Hand,Peter Christen,Sumayya Ziyad
关键词-EN: including medical diagnosis, financial decision making, classes objects belong, online commerce, belong is ubiquitous
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The problem of identifying to which of a given set of classes objects belong is ubiquitous, occurring in many research domains and application areas, including medical diagnosis, financial decision making, online commerce, and national security. But such assignments are rarely completely perfect, and classification errors occur. This means it is necessary to compare classification methods and algorithms to decide which is ``best’’ for any particular problem. However, just as there are many different classification methods, so there are many different ways of measuring their performance. It is thus vital to choose a measure of performance which matches the aims of the research or application. This paper is a contribution to the growing literature on the relative merits of different performance measures. Its particular focus is the critical importance of matching the properties of the measure to the aims for which the classification is being made.
[LG-55] On the Regret of Coded Caching with Adversarial Requests
链接: https://arxiv.org/abs/2409.12387
作者: Anupam Nayak,Kota Srinivas Reddy,Nikhil Karamchandani
关键词-EN: online learning framework, requests arrive sequentially, cache contents based, coded caching problem, well-known coded caching
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We study the well-known coded caching problem in an online learning framework, wherein requests arrive sequentially, and an online policy can update the cache contents based on the history of requests seen thus far. We introduce a caching policy based on the Follow-The-Perturbed-Leader principle and show that for any time horizon T and any request sequence, it achieves a sub-linear regret of \mathcalO(\sqrt(T) ) with respect to an oracle that knows the request sequence beforehand. Our study marks the first examination of adversarial regret in the coded caching setup. Furthermore, we also address the issue of switching cost by establishing an upper bound on the expected number of cache updates made by our algorithm under unrestricted switching and also provide an upper bound on the regret under restricted switching when cache updates can only happen in a pre-specified subset of timeslots. Finally, we validate our theoretical insights with numerical results using a real-world dataset
[LG-56] Look Through Masks: Towards Masked Face Recognition with De-Occlusion Distillation ACM-MM2020
链接: https://arxiv.org/abs/2409.12385
作者: Chenyu Li,Shiming Ge,Daichi Zhang,Jia Li
关键词-EN: real-world applications today, masked face recognition, ambiguous representation, drop in accuracy, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ACM MM 2020
点击查看摘要
Abstract:Many real-world applications today like video surveillance and urban governance need to address the recognition of masked faces, where content replacement by diverse masks often brings in incomplete appearance and ambiguous representation, leading to a sharp drop in accuracy. Inspired by recent progress on amodal perception, we propose to migrate the mechanism of amodal completion for the task of masked face recognition with an end-to-end de-occlusion distillation framework, which consists of two modules. The \textitde-occlusion module applies a generative adversarial network to perform face completion, which recovers the content under the mask and eliminates appearance ambiguity. The \textitdistillation module takes a pre-trained general face recognition model as the teacher and transfers its knowledge to train a student for completed faces using massive online synthesized face pairs. Especially, the teacher knowledge is represented with structural relations among instances in multiple orders, which serves as a posterior regularization to enable the adaptation. In this way, the knowledge can be fully distilled and transferred to identify masked faces. Experiments on synthetic and realistic datasets show the efficacy of the proposed approach.
[LG-57] Privacy-Preserving Student Learning with Differentially Private Data-Free Distillation
链接: https://arxiv.org/abs/2409.12384
作者: Bochao Liu,Jianghu Lu,Pengju Wang,Junjie Zhang,Dan Zeng,Zhenxing Qian,Shiming Ge
关键词-EN: Deep learning models, achieve high inference, high inference accuracy, extracting rich knowledge, Deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published by IEEE MMSP 2022
点击查看摘要
Abstract:Deep learning models can achieve high inference accuracy by extracting rich knowledge from massive well-annotated data, but may pose the risk of data privacy leakage in practical deployment. In this paper, we present an effective teacher-student learning approach to train privacy-preserving deep learning models via differentially private data-free distillation. The main idea is generating synthetic data to learn a student that can mimic the ability of a teacher well-trained on private data. In the approach, a generator is first pretrained in a data-free manner by incorporating the teacher as a fixed discriminator. With the generator, massive synthetic data can be generated for model training without exposing data privacy. Then, the synthetic data is fed into the teacher to generate private labels. Towards this end, we propose a label differential privacy algorithm termed selective randomized response to protect the label information. Finally, a student is trained on the synthetic data with the supervision of private labels. In this way, both data privacy and label privacy are well protected in a unified framework, leading to privacy-preserving models. Extensive experiments and analysis clearly demonstrate the effectiveness of our approach.
[LG-58] Prediction of Brent crude oil price based on LSTM model under the background of low-carbon transition
链接: https://arxiv.org/abs/2409.12376
作者: Yuwen Zhao,Baojun Hu,Sizhe Wang
关键词-EN: crude oil price, crude oil, crude oil market, Brent crude oil, important strategic resource
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
点击查看摘要
Abstract:In the field of global energy and environment, crude oil is an important strategic resource, and its price fluctuation has a far-reaching impact on the global economy, financial market and the process of low-carbon development. In recent years, with the gradual promotion of green energy transformation and low-carbon development in various countries, the dynamics of crude oil market have become more complicated and changeable. The price of crude oil is not only influenced by traditional factors such as supply and demand, geopolitical conflict and production technology, but also faces the challenges of energy policy transformation, carbon emission control and new energy technology development. This diversified driving factor makes the prediction of crude oil price not only very important in economic decision-making and energy planning, but also a key issue in financial this http URL this paper, the spot price data of European Brent crude oil provided by us energy information administration are selected, and a deep learning model with three layers of LSTM units is constructed to predict the crude oil price in the next few days. The results show that the LSTM model performs well in capturing the overall price trend, although there is some deviation during the period of sharp price fluctuation. The research in this paper not only verifies the applicability of LSTM model in energy market forecasting, but also provides data support for policy makers and investors when facing the uncertainty of crude oil price.
[LG-59] Communication-Efficient Federated Low-Rank Update Algorithm and its Connection to Implicit Regularization
链接: https://arxiv.org/abs/2409.12371
作者: Haemin Park,Diego Klabjan
关键词-EN: faces significant challenges, significant challenges related, faces significant, significant challenges, challenges related
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) faces significant challenges related to communication efficiency and heterogeneity. To address these issues, we explore the potential of using low-rank updates. Our theoretical analysis reveals that client’s loss exhibits a higher rank structure (gradients span higher rank subspace of Hessian) compared to the server’s loss. Based on this insight, we hypothesize that constraining client-side optimization to a low-rank subspace could provide an implicit regularization effect. Consequently, we propose FedLoRU, a general low-rank update framework for federated learning. Our framework enforces low-rank client-side updates and accumulates these updates to form a higher-rank model. Additionally, variants of FedLoRU can adapt to environments with statistical and model heterogeneity by employing multiple or hierarchical low-rank updates. Experimental results demonstrate that FedLoRU performs comparably to full-rank algorithms and exhibits robustness to heterogeneous and large numbers of clients.
[LG-60] Extracting Memorized Training Data via Decomposition
链接: https://arxiv.org/abs/2409.12367
作者: Ellen Su,Anu Vellore,Amy Chang,Raffaele Mura,Blaine Nelson,Paul Kassianik,Amin Karbasi
关键词-EN: Large Language Models, Large Language, information security challenges, Language Models, challenges for developers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:The widespread use of Large Language Models (LLMs) in society creates new information security challenges for developers, organizations, and end-users alike. LLMs are trained on large volumes of data, and their susceptibility to reveal the exact contents of the source training datasets poses security and safety risks. Although current alignment procedures restrict common risky behaviors, they do not completely prevent LLMs from leaking data. Prior work demonstrated that LLMs may be tricked into divulging training data by using out-of-distribution queries or adversarial techniques. In this paper, we demonstrate a simple, query-based decompositional method to extract news articles from two frontier LLMs. We use instruction decomposition techniques to incrementally extract fragments of training data. Out of 3723 New York Times articles, we extract at least one verbatim sentence from 73 articles, and over 20% of verbatim sentences from 6 articles. Our analysis demonstrates that this method successfully induces the LLM to generate texts that are reliable reproductions of news articles, meaning that they likely originate from the source training dataset. This method is simple, generalizable, and does not fine-tune or change the production model. If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities, including privacy risks and unauthorized data leaks. These implications require careful consideration from model development to its end-use.
[LG-61] Bridging the Gap Between Approximation and Learning via Optimal Approximation by ReLU MLPs of Maximal Regularity
链接: https://arxiv.org/abs/2409.12335
作者: Ruiyang Hong,Anastasis Kratsios
关键词-EN: seemingly opposing perspectives, seemingly opposing, opposing perspectives, perspectives of approximation, Machine Learning
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Functional Analysis (math.FA); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 16 pages main body, 40 pages proofs, 7 figures, 1 table
点击查看摘要
Abstract:The foundations of deep learning are supported by the seemingly opposing perspectives of approximation or learning theory. The former advocates for large/expressive models that need not generalize, while the latter considers classes that generalize but may be too small/constrained to be universal approximators. Motivated by real-world deep learning implementations that are both expressive and statistically reliable, we ask: “Is there a class of neural networks that is both large enough to be universal but structured enough to generalize?” This paper constructively provides a positive answer to this question by identifying a highly structured class of ReLU multilayer perceptions (MLPs), which are optimal function approximators and are statistically well-behaved. We show that any L -Lipschitz function from [0,1]^d to [-n,n] can be approximated to a uniform Ld/(2n) error on [0,1]^d with a sparsely connected L -Lipschitz ReLU MLP of width \mathcalO(dn^d) , depth \mathcalO(\log(d)) , with \mathcalO(dn^d) nonzero parameters, and whose weights and biases take values in \0,\pm 1/2\ except in the first and last layers which instead have magnitude at-most n . Unlike previously known “large” classes of universal ReLU MLPs, the empirical Rademacher complexity of our class remains bounded even when its depth and width become arbitrarily large. Further, our class of MLPs achieves a near-optimal sample complexity of \mathcalO(\log(N)/\sqrtN) when given N i.i.d. normalized sub-Gaussian training samples. We achieve this by avoiding the standard approach to constructing optimal ReLU approximators, which sacrifices regularity by relying on small spikes. Instead, we introduce a new construction that perfectly fits together linear pieces using Kuhn triangulations and avoids these small spikes. Comments: 16 pages main body, 40 pages proofs, 7 figures, 1 table Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Functional Analysis (math.FA); Numerical Analysis (math.NA); Machine Learning (stat.ML) MSC classes: 68T07, 41A44, 26A16 Cite as: arXiv:2409.12335 [cs.LG] (or arXiv:2409.12335v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.12335 Focus to learn more arXiv-issued DOI via DataCite
[LG-62] SplitVAEs: Decentralized scenario generation from siloed data for stochastic optimization problems
链接: https://arxiv.org/abs/2409.12328
作者: H M Mohaimanul Islam,Huynh Q. N. Vo,Paritosh Ramanan
关键词-EN: Stochastic optimization problems, encapsulate complex spatiotemporal, Stochastic optimization, multi-stakeholder networked systems, large-scale multi-stakeholder networked
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Methodology (stat.ME)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
点击查看摘要
Abstract:Stochastic optimization problems in large-scale multi-stakeholder networked systems (e.g., power grids and supply chains) rely on data-driven scenarios to encapsulate complex spatiotemporal interdependencies. However, centralized aggregation of stakeholder data is challenging due to the existence of data silos resulting from computational and logistical bottlenecks. In this paper, we present SplitVAEs, a decentralized scenario generation framework that leverages variational autoencoders to generate high-quality scenarios without moving stakeholder data. With the help of experiments on distributed memory systems, we demonstrate the broad applicability of SplitVAEs in a variety of domain areas that are dominated by a large number of stakeholders. Our experiments indicate that SplitVAEs can learn spatial and temporal interdependencies in large-scale networks to generate scenarios that match the joint historical distribution of stakeholder data in a decentralized manner. Our experiments show that SplitVAEs deliver robust performance compared to centralized, state-of-the-art benchmark methods while significantly reducing data transmission costs, leading to a scalable, privacy-enhancing alternative to scenario generation.
[LG-63] Understanding Implosion in Text-to-Image Generative Models CCS2024
链接: https://arxiv.org/abs/2409.12314
作者: Wenxin Ding,Cathy Y. Li,Shawn Shan,Ben Y. Zhao,Haitao Zheng
关键词-EN: Recent works show, poisoning attacks, Recent works, surprisingly vulnerable, models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ACM CCS 2024
点击查看摘要
Abstract:Recent works show that text-to-image generative models are surprisingly vulnerable to a variety of poisoning attacks. Empirical results find that these models can be corrupted by altering associations between individual text prompts and associated visual features. Furthermore, a number of concurrent poisoning attacks can induce “model implosion,” where the model becomes unable to produce meaningful images for unpoisoned prompts. These intriguing findings highlight the absence of an intuitive framework to understand poisoning attacks on these models. In this work, we establish the first analytical framework on robustness of image generative models to poisoning attacks, by modeling and analyzing the behavior of the cross-attention mechanism in latent diffusion models. We model cross-attention training as an abstract problem of “supervised graph alignment” and formally quantify the impact of training data by the hardness of alignment, measured by an Alignment Difficulty (AD) metric. The higher the AD, the harder the alignment. We prove that AD increases with the number of individual prompts (or concepts) poisoned. As AD grows, the alignment task becomes increasingly difficult, yielding highly distorted outcomes that frequently map meaningful text prompts to undefined or meaningless visual representations. As a result, the generative model implodes and outputs random, incoherent images at large. We validate our analytical framework through extensive experiments, and we confirm and explain the unexpected (and unexplained) effect of model implosion while producing new, unforeseen insights. Our work provides a useful tool for studying poisoning attacks against diffusion models and their defenses.
[LG-64] JKO for Landau: a variational particle method for homogeneous Landau equation
链接: https://arxiv.org/abs/2409.12296
作者: Yan Huang,Li Wang
关键词-EN: Landau metric, Landau equation, JKO scheme, Landau, gradient flow viewpoint
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Inspired by the gradient flow viewpoint of the Landau equation and corresponding dynamic formulation of the Landau metric in [arXiv:2007.08591], we develop a novel implicit particle method for the Landau equation in the framework of the JKO scheme. We first reformulate the Landau metric in a computationally friendly form, and then translate it into the Lagrangian viewpoint using the flow map. A key observation is that, while the flow map evolves according to a rather complicated integral equation, the unknown component is merely a score function of the corresponding density plus an additional term in the null space of the collision kernel. This insight guides us in approximating the flow map with a neural network and simplifies the training. Additionally, the objective function is in a double summation form, making it highly suitable for stochastic methods. Consequently, we design a tailored version of stochastic gradient descent that maintains particle interactions and reduces the computational complexity. Compared to other deterministic particle methods, the proposed method enjoys exact entropy dissipation and unconditional stability, therefore making it suitable for large-scale plasma simulations over extended time periods.
[LG-65] SANE: Strategic Autonomous Non-Smooth Exploration for Multiple Optima Discovery in Multi-modal and Non-differentiable Black-box Functions
链接: https://arxiv.org/abs/2409.12295
作者: Arpan Biswas,Rama Vasudevan,Rohit Pant,Ichiro Takeuchi,Hiroshi Funakubo,Yongtao Liu
关键词-EN: multimodal parameter spaces, structure image spaces, molecular embedding spaces, material structure image, material discovery bring
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 7 figures in main text, 2 figures in Supp Mat
点击查看摘要
Abstract:Both computational and experimental material discovery bring forth the challenge of exploring multidimensional and multimodal parameter spaces, such as phase diagrams of Hamiltonians with multiple interactions, composition spaces of combinatorial libraries, material structure image spaces, and molecular embedding spaces. Often these systems are black-box and time-consuming to evaluate, which resulted in strong interest towards active learning methods such as Bayesian optimization (BO). However, these systems are often noisy which make the black box function severely multi-modal and non-differentiable, where a vanilla BO can get overly focused near a single or faux optimum, deviating from the broader goal of scientific discovery. To address these limitations, here we developed Strategic Autonomous Non-Smooth Exploration (SANE) to facilitate an intelligent Bayesian optimized navigation with a proposed cost-driven probabilistic acquisition function to find multiple global and local optimal regions, avoiding the tendency to becoming trapped in a single optimum. To distinguish between a true and false optimal region due to noisy experimental measurements, a human (domain) knowledge driven dynamic surrogate gate is integrated with SANE. We implemented the gate-SANE into a pre-acquired Piezoresponse spectroscopy data of a ferroelectric combinatorial library with high noise levels in specific regions, and a piezoresponse force microscopy (PFM) hyperspectral data. SANE demonstrated better performance than classical BO to facilitate the exploration of multiple optimal regions and thereby prioritized learning with higher coverage of scientific values in autonomous experiments. Our work showcases the potential application of this method to real-world experiment, where such combined strategic and human intervening approaches can be critical to unlocking new discoveries in autonomous research.
[LG-66] RAG-Modulo: Solving Sequential Tasks using Experience Critics and Language Models
链接: https://arxiv.org/abs/2409.12294
作者: Abhinav Jain,Chris Jermaine,Vaibhav Unhelkar
关键词-EN: Large language models, Large language, language models, observation uncertainties, recently emerged
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages, 5 figures
点击查看摘要
Abstract:Large language models (LLMs) have recently emerged as promising tools for solving challenging robotic tasks, even in the presence of action and observation uncertainties. Recent LLM-based decision-making methods (also referred to as LLM-based agents), when paired with appropriate critics, have demonstrated potential in solving complex, long-horizon tasks with relatively few interactions. However, most existing LLM-based agents lack the ability to retain and learn from past interactions - an essential trait of learning-based robotic systems. We propose RAG-Modulo, a framework that enhances LLM-based agents with a memory of past interactions and incorporates critics to evaluate the agents’ decisions. The memory component allows the agent to automatically retrieve and incorporate relevant past experiences as in-context examples, providing context-aware feedback for more informed decision-making. Further by updating its memory, the agent improves its performance over time, thereby exhibiting learning. Through experiments in the challenging BabyAI and AlfWorld domains, we demonstrate significant improvements in task success rates and efficiency, showing that the proposed RAG-Modulo framework outperforms state-of-the-art baselines.
[LG-67] Provable In-Context Learning of Linear Systems and Linear Elliptic PDEs with Transformers
链接: https://arxiv.org/abs/2409.12293
作者: Frank Cole,Yulong Lu,Riley O’Neill,Tianhao Zhang
关键词-EN: natural language processing, exhibit remarkable in-context, Foundation models, transformer-based foundation models, allowing pre-trained models
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Foundation models for natural language processing, powered by the transformer architecture, exhibit remarkable in-context learning (ICL) capabilities, allowing pre-trained models to adapt to downstream tasks using few-shot prompts without updating their weights. Recently, transformer-based foundation models have also emerged as versatile tools for solving scientific problems, particularly in the realm of partial differential equations (PDEs). However, the theoretical foundations of the ICL capabilities in these scientific models remain largely unexplored. This work develops a rigorous error analysis for transformer-based ICL applied to solution operators associated with a family of linear elliptic PDEs. We first demonstrate that a linear transformer, defined by a linear self-attention layer, can provably learn in-context to invert linear systems arising from the spatial discretization of PDEs. This is achieved by deriving theoretical scaling laws for the prediction risk of the proposed linear transformers in terms of spatial discretization size, the number of training tasks, and the lengths of prompts used during training and inference. These scaling laws also enable us to establish quantitative error bounds for learning PDE solutions. Furthermore, we quantify the adaptability of the pre-trained transformer on downstream PDE tasks that experience distribution shifts in both tasks (represented by PDE coefficients) and input covariates (represented by the source term). To analyze task distribution shifts, we introduce a novel concept of task diversity and characterize the transformer’s prediction error in terms of the magnitude of task shift, assuming sufficient diversity in the pre-training tasks. We also establish sufficient conditions to ensure task diversity. Finally, we validate the ICL-capabilities of transformers through extensive numerical experiments.
[LG-68] MetaPix: A Data-Centric AI Development Platform for Efficient Management and Utilization of Unstructured Computer Vision Data
链接: https://arxiv.org/abs/2409.12289
作者: Sai Vishwanath Venkatesh,Atra Akandeh,Madhu Lokanath
关键词-EN: advanced AI technologies, today world, world of advanced, critical component, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted @ The 22nd International Conference on Software Engineering Research Practice
点击查看摘要
Abstract:In today’s world of advanced AI technologies, data management is a critical component of any AI/ML solution. Effective data management is vital for the creation and maintenance of high-quality, diverse datasets, which significantly enhance predictive capabilities and lead to smarter business solutions. In this work, we introduce MetaPix, a Data-centric AI platform offering comprehensive data management solutions specifically designed for unstructured data. MetaPix offers robust tools for data ingestion, processing, storage, versioning, governance, and discovery. The platform operates on four key concepts: DataSources, Datasets, Extensions and Extractors. A DataSource serves as MetaPix top level asset, representing a narrow-scoped source of data for a specific use. Datasets are MetaPix second level object, structured collections of data. Extractors are internal tools integrated into MetaPix’s backend processing, facilitate data processing and enhancement. Additionally, MetaPix supports extensions, enabling integration with external third-party tools to enhance platform functionality. This paper delves into each MetaPix concept in detail, illustrating how they collectively contribute to the platform’s objectives. By providing a comprehensive solution for managing and utilizing unstructured computer vision data, MetaPix equips organizations with a powerful toolset to develop AI applications effectively.
[LG-69] Mastering Chess with a Transformer Model
链接: https://arxiv.org/abs/2409.12272
作者: Daniel Monroe, TheLeela Chess Zero Team
关键词-EN: demonstrated impressive capabilities, difficult cognitive tasks, cognitive tasks requiring, tasks requiring complex, requiring complex reasoning
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Transformer models have demonstrated impressive capabilities when trained at scale, excelling at difficult cognitive tasks requiring complex reasoning and rational decision-making. In this paper, we explore the application of transformer models to chess, focusing on the critical role of the position encoding within the attention mechanism. We show that in chess, transformers endowed with a sufficiently versatile position encoding can match existing chess-playing models at a fraction of the computational cost. Our architecture significantly outperforms AlphaZero at 8x fewer FLOPS and matches prior grandmaster-level transformer-based agents at 30x fewer FLOPS.
[LG-70] User-friendly Foundation Model Adapters for Multivariate Time Series Classification
链接: https://arxiv.org/abs/2409.12264
作者: Vasilii Feofanov,Romain Ilbert,Malik Tiomoko,Themis Palpanas,Ievgen Redko
关键词-EN: requiring substantial inference, substantial inference time, highly effective, requiring substantial, substantial inference
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The first two authors contributed equally
点击查看摘要
Abstract:Foundation models, while highly effective, are often resource-intensive, requiring substantial inference time and memory. This paper addresses the challenge of making these models more accessible with limited computational resources by exploring dimensionality reduction techniques. Our goal is to enable users to run large pre-trained foundation models on standard GPUs without sacrificing performance. We investigate classical methods such as Principal Component Analysis alongside neural network-based adapters, aiming to reduce the dimensionality of multivariate time series data while preserving key features. Our experiments show up to a 10x speedup compared to the baseline model, without performance degradation, and enable up to 4.5x more datasets to fit on a single GPU, paving the way for more user-friendly and scalable foundation models.
[LG-71] Detecting LGBTQ Instances of Cyberbullying
链接: https://arxiv.org/abs/2409.12263
作者: Muhammad Arslan,Manuel Sandoval Madrigal,Mohammed Abuhamad,Deborah L. Hall,Yasin N. Silva
关键词-EN: trajectory of humanity, Social media continues, LGBTQ, media continues, Social media
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 10 pages, 4 tables, 1 figure, 17th International Conference on Social Computing, Behavioral-Cultural Modeling, Prediction and Behavior Representation in Modeling and Simulation
点击查看摘要
Abstract:Social media continues to have an impact on the trajectory of humanity. However, its introduction has also weaponized keyboards, allowing the abusive language normally reserved for in-person bullying to jump onto the screen, i.e., cyberbullying. Cyberbullying poses a significant threat to adolescents globally, affecting the mental health and well-being of many. A group that is particularly at risk is the LGBTQ+ community, as researchers have uncovered a strong correlation between identifying as LGBTQ+ and suffering from greater online harassment. Therefore, it is critical to develop machine learning models that can accurately discern cyberbullying incidents as they happen to LGBTQ+ members. The aim of this study is to compare the efficacy of several transformer models in identifying cyberbullying targeting LGBTQ+ individuals. We seek to determine the relative merits and demerits of these existing methods in addressing complex and subtle kinds of cyberbullying by assessing their effectiveness with real social media data.
[LG-72] Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive Networks NEURIPS2023
链接: https://arxiv.org/abs/2409.12255
作者: Eeshaan Jain,Tushar Nandy,Gaurav Aggarwal,Ashish Tendulkar,Rishabh Iyer,Abir De
关键词-EN: efficient learning predominantly, learning predominantly employ, predominantly employ discrete, employ discrete combinatorial, Existing subset selection
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published at NeurIPS 2023
点击查看摘要
Abstract:Existing subset selection methods for efficient learning predominantly employ discrete combinatorial and model-specific approaches which lack generalizability. For an unseen architecture, one cannot use the subset chosen for a different model. To tackle this problem, we propose \textttSubSelNet , a trainable subset selection framework, that generalizes across architectures. Here, we first introduce an attention-based neural gadget that leverages the graph structure of architectures and acts as a surrogate to trained deep neural networks for quick model prediction. Then, we use these predictions to build subset samplers. This naturally provides us two variants of \textttSubSelNet . The first variant is transductive (called as Transductive- \textttSubSelNet ) which computes the subset separately for each model by solving a small optimization problem. Such an optimization is still super fast, thanks to the replacement of explicit model training by the model approximator. The second variant is inductive (called as Inductive- \textttSubSelNet ) which computes the subset using a trained subset selector, without any optimization. Our experiments show that our model outperforms several methods across several real datasets
[LG-73] Sparks of Artificial General Intelligence(AGI) in Semiconductor Material Science: Early Explorations into the Next Frontier of Generative AI-Assisted Electron Micrograph Analysis AAAI-2024
链接: https://arxiv.org/abs/2409.12244
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Sreeja Gangasani,Chidaksh Ravuru,Venkataramana Runkana
关键词-EN: electron micrographs poses, micrographs poses significant, poses significant challenges, automated labeling due, Characterizing materials
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at Deployable AI (DAI) Workshop at AAAI-2024
点击查看摘要
Abstract:Characterizing materials with electron micrographs poses significant challenges for automated labeling due to the complex nature of nanomaterial structures. To address this, we introduce a fully automated, end-to-end pipeline that leverages recent advances in Generative AI. It is designed for analyzing and understanding the microstructures of semiconductor materials with effectiveness comparable to that of human experts, contributing to the pursuit of Artificial General Intelligence (AGI) in nanomaterial identification. Our approach utilizes Large MultiModal Models (LMMs) such as GPT-4V, alongside text-to-image models like DALLE-3. We integrate a GPT-4 guided Visual Question Answering (VQA) method to analyze nanomaterial images, generate synthetic nanomaterial images via DALLE-3, and employ in-context learning with few-shot prompting in GPT-4V for accurate nanomaterial identification. Our method surpasses traditional techniques by enhancing the precision of nanomaterial identification and optimizing the process for high-throughput screening.
[LG-74] ARTICLE: Annotator Reliability Through In-Context Learning
链接: https://arxiv.org/abs/2409.12218
作者: Sujan Dutta,Deepak Pandita,Tharindu Cyril Weerasooriya,Marcos Zampieri,Christopher M. Homan,Ashiqur R. KhudaBukhsh
关键词-EN: training and evaluation, key piece, piece of machine, learning in NLP, NLP
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Ensuring annotator quality in training and evaluation data is a key piece of machine learning in NLP. Tasks such as sentiment analysis and offensive speech detection are intrinsically subjective, creating a challenging scenario for traditional quality assessment approaches because it is hard to distinguish disagreement due to poor work from that due to differences of opinions between sincere annotators. With the goal of increasing diverse perspectives in annotation while ensuring consistency, we propose \textttARTICLE, an in-context learning (ICL) framework to estimate annotation quality through self-consistency. We evaluate this framework on two offensive speech datasets using multiple LLMs and compare its performance with traditional methods. Our findings indicate that \textttARTICLE can be used as a robust method for identifying reliable annotators, hence improving data quality.
[LG-75] Effects of Common Regularization Techniques on Open-Set Recognition
链接: https://arxiv.org/abs/2409.12217
作者: Zachary Rabin,Jim Davis,Benjamin Lewis,Matthew Scherreik
关键词-EN: Open-Set Recognition, Open-Set Recognition performance, Recognition performance, recent years, increasing interest
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In recent years there has been increasing interest in the field of Open-Set Recognition, which allows a classification model to identify inputs as “unknown” when it encounters an object or class not in the training set. This ability to flag unknown inputs is of vital importance to many real world classification applications. As almost all modern training methods for neural networks use extensive amounts of regularization for generalization, it is therefore important to examine how regularization techniques impact the ability of a model to perform Open-Set Recognition. In this work, we examine the relationship between common regularization techniques and Open-Set Recognition performance. Our experiments are agnostic to the specific open-set detection algorithm and examine the effects across a wide range of datasets. We show empirically that regularization methods can provide significant improvements to Open-Set Recognition performance, and we provide new insights into the relationship between accuracy and Open-Set performance.
[LG-76] SemAI: Semantic Artificial Intelligence-enhanced DNA storage for Internet-of-Things
链接: https://arxiv.org/abs/2409.12213
作者: Wenfeng Wu,Luping Xiang,Qiang Liu,Kun Yang
关键词-EN: propelling DNA storage, global data landscape, data landscape undergoes, cloud storage applications, contemporary cloud storage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In the wake of the swift evolution of technologies such as the Internet of Things (IoT), the global data landscape undergoes an exponential surge, propelling DNA storage into the spotlight as a prospective medium for contemporary cloud storage applications. This paper introduces a Semantic Artificial Intelligence-enhanced DNA storage (SemAI-DNA) paradigm, distinguishing itself from prevalent deep learning-based methodologies through two key modifications: 1) embedding a semantic extraction module at the encoding terminus, facilitating the meticulous encoding and storage of nuanced semantic information; 2) conceiving a forethoughtful multi-reads filtering model at the decoding terminus, leveraging the inherent multi-copy propensity of DNA molecules to bolster system fault tolerance, coupled with a strategically optimized decoder’s architectural framework. Numerical results demonstrate the SemAI-DNA’s efficacy, attaining 2.61 dB Peak Signal-to-Noise Ratio (PSNR) gain and 0.13 improvement in Structural Similarity Index (SSIM) over conventional deep learning-based approaches.
[LG-77] Mixture of Diverse Size Experts
链接: https://arxiv.org/abs/2409.12210
作者: Manxi Sun,Wei Liu,Jian Luan,Pengzhi Gao,Bin Wang
关键词-EN: large language models, exploding computational costs, gained increasing popularity, language models, computational costs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The Sparsely-Activated Mixture-of-Experts (MoE) has gained increasing popularity for scaling up large language models (LLMs) without exploding computational costs. Despite its success, the current design faces a challenge where all experts have the same size, limiting the ability of tokens to choose the experts with the most appropriate size for generating the next token. In this paper, we propose the Mixture of Diverse Size Experts (MoDSE), a new MoE architecture with layers designed to have experts of different sizes. Our analysis of difficult token generation tasks shows that experts of various sizes achieve better predictions, and the routing path of the experts tends to be stable after a training period. However, having experts of diverse sizes can lead to uneven workload distribution. To tackle this limitation, we introduce an expert-pair allocation strategy to evenly distribute the workload across multiple GPUs. Comprehensive evaluations across multiple benchmarks demonstrate the effectiveness of MoDSE, as it outperforms existing MoEs by allocating the parameter budget to experts adaptively while maintaining the same total parameter size and the number of experts.
[LG-78] A Simple Model to Estimate Sharing Effects in Social Networks RECSYS’24
链接: https://arxiv.org/abs/2409.12203
作者: Olivier Jeunen
关键词-EN: Randomised Controlled Trials, Randomised Controlled, Controlled Trials, fields of science, gold standard
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24
点击查看摘要
Abstract:Randomised Controlled Trials (RCTs) are the gold standard for estimating treatment effects across many fields of science. Technology companies have adopted A/B-testing methods as a modern RCT counterpart, where end-users are randomly assigned various system variants and user behaviour is tracked continuously. The objective is then to estimate the causal effect that the treatment variant would have on certain metrics of interest to the business. When the outcomes for randomisation units – end-users in this case – are not statistically independent, this obfuscates identifiability of treatment effects, and harms decision-makers’ observability of the system. Social networks exemplify this, as they are designed to promote inter-user interactions. This interference by design notoriously complicates measurement of, e.g., the effects of sharing. In this work, we propose a simple Markov Decision Process (MDP)-based model describing user sharing behaviour in social networks. We derive an unbiased estimator for treatment effects under this model, and demonstrate through reproducible synthetic experiments that it outperforms existing methods by a significant margin. Comments: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24 Subjects: Social and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2409.12203 [cs.SI] (or arXiv:2409.12203v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2409.12203 Focus to learn more arXiv-issued DOI via DataCite
[LG-79] Nteasee: A mixed methods study of expert and general population perspectives on deploying AI for health in African countries
链接: https://arxiv.org/abs/2409.12197
作者: Mercy Nyamewaa Asiedu,Iskandar Haykel,Awa Dieng,Kerrie Kauer,Tousif Ahmed,Florence Ofori,Charisma Chan,Stephen Pfohl,Negar Rostamzadeh,Katherine Heller
关键词-EN: Artificial Intelligence, improve healthcare, significantly change, change and improve, African countries
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Equal contributions
点击查看摘要
Abstract:Artificial Intelligence (AI) for health has the potential to significantly change and improve healthcare. However in most African countries, identifying culturally and contextually attuned approaches for deploying these solutions is not well understood. To bridge this gap, we conduct a qualitative study to investigate the best practices, fairness indicators, and potential biases to mitigate when deploying AI for health in African countries, as well as explore opportunities where artificial intelligence could make a positive impact in health. We used a mixed methods approach combining in-depth interviews (IDIs) and surveys. We conduct 1.5-2 hour long IDIs with 50 experts in health, policy, and AI across 17 countries, and through an inductive approach we conduct a qualitative thematic analysis on expert IDI responses. We administer a blinded 30-minute survey with case studies to 672 general population participants across 5 countries in Africa and analyze responses on quantitative scales, statistically comparing responses by country, age, gender, and level of familiarity with AI. We thematically summarize open-ended responses from surveys. Our results find generally positive attitudes, high levels of trust, accompanied by moderate levels of concern among general population participants for AI usage for health in Africa. This contrasts with expert responses, where major themes revolved around trust/mistrust, ethical concerns, and systemic barriers to integration, among others. This work presents the first-of-its-kind qualitative research study of the potential of AI for health in Africa from an algorithmic fairness angle, with perspectives from both experts and the general population. We hope that this work guides policymakers and drives home the need for further research and the inclusion of general population perspectives in decision-making around AI usage.
[LG-80] Inability of spatial transformations of CNN feature maps to support invariant recognition
链接: https://arxiv.org/abs/2004.14716
作者: Ylva Jansson,Maksim Maydanskiy,Lukas Finnveden,Tony Lindeberg
关键词-EN: CNN feature maps, deep learning architectures, object appearance caused, CNN feature, feature maps
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 22 pages, 3 figures
点击查看摘要
Abstract:A large number of deep learning architectures use spatial transformations of CNN feature maps or filters to better deal with variability in object appearance caused by natural image transformations. In this paper, we prove that spatial transformations of CNN feature maps cannot align the feature maps of a transformed image to match those of its original, for general affine transformations, unless the extracted features are themselves invariant. Our proof is based on elementary analysis for both the single- and multi-layer network case. The results imply that methods based on spatial transformations of CNN feature maps or filters cannot replace image alignment of the input and cannot enable invariant recognition for general affine transformations, specifically not for scaling transformations or shear transformations. For rotations and reflections, spatially transforming feature maps or filters can enable invariance but only for networks with learnt or hardcoded rotation- or reflection-invariant features
[LG-81] Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges
链接: https://arxiv.org/abs/2004.01536
作者: Ylva Jansson,Tony Lindeberg
关键词-EN: world visual tasks, real world visual, scale channel networks, handle large scale, large scale variations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, 3 tables
点击查看摘要
Abstract:The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. We, therefore, present a theoretical analysis of invariance and covariance properties of scale channel networks and perform an experimental evaluation of the ability of different types of scale channel networks to generalise to previously unseen scales. We identify limitations of previous approaches and propose a new type of foveated scale channel architecture, where the scale channels process increasingly larger parts of the image with decreasing resolution. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8, also when training on single scale training data, and do also give improvements in the small sample regime.
[LG-82] he problems with using STNs to align CNN feature maps
链接: https://arxiv.org/abs/2001.05858
作者: Lukas Finnveden,Ylva Jansson,Tony Lindeberg
关键词-EN: Spatial transformer networks, learn invariance, CNN feature maps, transform CNN feature, Spatial transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to Northern Lights Deep Learning Workshop 2020, Tromsø, 2 pages, 3 figures
点击查看摘要
Abstract:Spatial transformer networks (STNs) were designed to enable CNNs to learn invariance to image transformations. STNs were originally proposed to transform CNN feature maps as well as input images. This enables the use of more complex features when predicting transformation parameters. However, since STNs perform a purely spatial transformation, they do not, in the general case, have the ability to align the feature maps of a transformed image and its original. We present a theoretical argument for this and investigate the practical implications, showing that this inability is coupled with decreased classification accuracy. We advocate taking advantage of more complex features in deeper layers by instead sharing parameters between the classification and the localisation network.
[LG-83] Provably scale-covariant continuous hierarchical networks based on scale-normalized differential expressions coupled in cascade
链接: https://arxiv.org/abs/1905.13555
作者: Tony Lindeberg
关键词-EN: provably scale covariant, constructing hierarchical networks, theory for constructing, constructing hierarchical, article presents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 29 pages, 16 figures, 3 tables. arXiv admin note: substantial text overlap with arXiv:1903.00289
点击查看摘要
Abstract:This article presents a theory for constructing hierarchical networks in such a way that the networks are guaranteed to be provably scale covariant. We first present a general sufficiency argument for obtaining scale covariance, which holds for a wide class of networks defined from linear and non-linear differential expressions expressed in terms of scale-normalized scale-space derivatives. Then, we present a more detailed development of one example of such a network constructed from a combination of mathematically derived models of receptive fields and biologically inspired computations. Based on a functional model of complex cells in terms of an oriented quasi quadrature combination of first- and second-order directional Gaussian derivatives, we couple such primitive computations in cascade over combinatorial expansions over image orientations. Scale-space properties of the computational primitives are analysed and we give explicit proofs of how the resulting representation allows for scale and rotation covariance. A prototype application to texture analysis is developed and it is demonstrated that a simplified mean-reduced representation of the resulting QuasiQuadNet leads to promising experimental results on three texture datasets.
[LG-84] Provably scale-covariant networks from oriented quasi quadrature measures in cascade
链接: https://arxiv.org/abs/1903.00289
作者: Tony Lindeberg
关键词-EN: hierarchical networks based, mathematically derived models, biologically inspired computations, article presents, presents a continuous
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, 1 table
点击查看摘要
Abstract:This article presents a continuous model for hierarchical networks based on a combination of mathematically derived models of receptive fields and biologically inspired computations. Based on a functional model of complex cells in terms of an oriented quasi quadrature combination of first- and second-order directional Gaussian derivatives, we couple such primitive computations in cascade over combinatorial expansions over image orientations. Scale-space properties of the computational primitives are analysed and it is shown that the resulting representation allows for provable scale and rotation covariance. A prototype application to texture analysis is developed and it is demonstrated that a simplified mean-reduced representation of the resulting QuasiQuadNet leads to promising experimental results on three texture datasets.
[LG-85] WaveletGPT: Wavelets Meet Large Language Models
链接: https://arxiv.org/abs/2409.12924
作者: Prateek Verma
关键词-EN: Large Language Models, Large Language, artificial intelligence advancements, intelligence advancements impacting, Language Models
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 16 pages, 4 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. They are trained on a simple objective: to predict the next token given the previous context. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure associated with it. This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbfany extra parameters to a GPT-style LLM architecture, we achieve the same pre-training performance almost twice as fast in text, raw audio, and symbolic music. This is achieved by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every Transformer decoder block. This work will hopefully pave the way for incorporating multi-rate signal processing ideas into traditional LLM pre-training. Further, we showcase pushing model performance by improving internal structure instead of just going after scale.
[LG-86] Online Proximal ADMM for Graph Learning from Streaming Smooth Signals ICASSP2025
链接: https://arxiv.org/abs/2409.12916
作者: Hector Chahuara,Gonzalo Mateos
关键词-EN: multivariate data analysis, signal processing deals, leverage graph structures, data analysis, processing deals
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, submitted to ICASSP 2025
点击查看摘要
Abstract:Graph signal processing deals with algorithms and signal representations that leverage graph structures for multivariate data analysis. Often said graph topology is not readily available and may be time-varying, hence (dynamic) graph structure learning from nodal (e.g., sensor) observations becomes a critical first step. In this paper, we develop a novel algorithm for online graph learning using observation streams, assumed to be smooth on the latent graph. Unlike batch algorithms for topology identification from smooth signals, our modus operandi is to process graph signals sequentially and thus keep memory and computational costs in check. To solve the resulting smoothness-regularized, time-varying inverse problem, we develop online and lightweight iterations built upon the proximal variant of the alternating direction method of multipliers (ADMM), well known for its fast convergence in batch settings. The proximal term in the topology updates seamlessly implements a temporal-variation regularization, and we argue the online procedure exhibits sublinear static regret under some simplifying assumptions. Reproducible experiments with synthetic and real graphs demonstrate the effectiveness of our method in adapting to streaming signals and tracking slowly-varying network connectivity. The proposed approach also exhibits better tracking performance (in terms of suboptimality), when compared to state-of-the-art online graph learning baselines.
[LG-87] Deep Learning-Based Detection of Referable Diabetic Retinopathy and Macular Edema Using Ultra-Widefield Fundus Imaging
链接: https://arxiv.org/abs/2409.12854
作者: Philippe Zhang,Pierre-Henri Conze,Mathieu Lamard,Gwenolé Quellec,Mostafa El Habib Daho
关键词-EN: diabetic macular edema, Diabetic retinopathy, diabetic macular, vision loss, macular edema
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Diabetic retinopathy and diabetic macular edema are significant complications of diabetes that can lead to vision loss. Early detection through ultra-widefield fundus imaging enhances patient outcomes but presents challenges in image quality and analysis scale. This paper introduces deep learning solutions for automated UWF image analysis within the framework of the MICCAI 2024 UWF4DR challenge. We detail methods and results across three tasks: image quality assessment, detection of referable DR, and identification of DME. Employing advanced convolutional neural network architectures such as EfficientNet and ResNet, along with preprocessing and augmentation strategies, our models demonstrate robust performance in these tasks. Results indicate that deep learning can significantly aid in the automated analysis of UWF images, potentially improving the efficiency and accuracy of DR and DME detection in clinical settings.
[LG-88] Machine-learning based high-bandwidth magnetic sensing
链接: https://arxiv.org/abs/2409.12820
作者: Galya Haim,Stefano Martina,John Howell,Nir Bar-Gill,Filippo Caruso
关键词-EN: Recent years, significant growth, capabilities of advanced, specifically quantum sensing, magnetic sensing
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
*备注: 12 pages including supplementary, 6 figures
点击查看摘要
Abstract:Recent years have seen significant growth of quantum technologies, and specifically quantum sensing, both in terms of the capabilities of advanced platforms and their applications. One of the leading platforms in this context is nitrogen-vacancy (NV) color centers in diamond, providing versatile, high-sensitivity, and high-resolution magnetic sensing. Nevertheless, current schemes for spin resonance magnetic sensing (as applied by NV quantum sensing) suffer from tradeoffs associated with sensitivity, dynamic range, and bandwidth. Here we address this issue, and implement machine learning tools to enhance NV magnetic sensing in terms of the sensitivity/bandwidth tradeoff in large dynamic range scenarios. We experimentally demonstrate this new approach, reaching an improvement in the relevant figure of merit by a factor of up to 5. Our results promote quantum machine learning protocols for sensing applications towards more feasible and efficient quantum technologies.
[LG-89] Robust estimation of the intrinsic dimension of data sets with quantum cognition machine learning
链接: https://arxiv.org/abs/2409.12805
作者: Luca Candelori,Alexander G. Abanov,Jeffrey Berger,Cameron J. Hogan,Vahagn Kirakosyan,Kharen Musaelian,Ryan Samson,James E. T. Smith,Dario Villani,Martin T. Wells,Mengjia Xu
关键词-EN: Cognition Machine Learning, Quantum Cognition Machine, Cognition Machine, Machine Learning, Quantum Cognition
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
点击查看摘要
Abstract:We propose a new data representation method based on Quantum Cognition Machine Learning and apply it to manifold learning, specifically to the estimation of intrinsic dimension of data sets. The idea is to learn a representation of each data point as a quantum state, encoding both local properties of the point as well as its relation with the entire data. Inspired by ideas from quantum geometry, we then construct from the quantum states a point cloud equipped with a quantum metric. The metric exhibits a spectral gap whose location corresponds to the intrinsic dimension of the data. The proposed estimator is based on the detection of this spectral gap. When tested on synthetic manifold benchmarks, our estimates are shown to be robust with respect to the introduction of point-wise Gaussian noise. This is in contrast to current state-of-the-art estimators, which tend to attribute artificial ``shadow dimensions’’ to noise artifacts, leading to overestimates. This is a significant advantage when dealing with real data sets, which are inevitably affected by unknown levels of noise. We show the applicability and robustness of our method on real data, by testing it on the ISOMAP face database, MNIST, and the Wisconsin Breast Cancer Dataset.
[LG-90] he Central Role of the Loss Function in Reinforcement Learning
链接: https://arxiv.org/abs/2409.12799
作者: Kaiwen Wang,Nathan Kallus,Wen Sun
关键词-EN: decision making algorithms, data-driven decision making, decision making, loss functions, value-based decision making
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
点击查看摘要
Abstract:This paper illustrates the central role of loss functions in data-driven decision making, providing a comprehensive survey on their influence in cost-sensitive classification (CSC) and reinforcement learning (RL). We demonstrate how different regression loss functions affect the sample efficiency and adaptivity of value-based decision making algorithms. Across multiple settings, we prove that algorithms using the binary cross-entropy loss achieve first-order bounds scaling with the optimal policy’s cost and are much more efficient than the commonly used squared loss. Moreover, we prove that distributional algorithms using the maximum likelihood loss achieve second-order bounds scaling with the policy variance and are even sharper than first-order bounds. This in particular proves the benefits of distributional RL. We hope that this paper serves as a guide analyzing decision making algorithms with varying loss functions, and can inspire the reader to seek out better loss functions to improve any decision making algorithm.
[LG-91] Multi-Source and Multi-Sequence Myocardial Pathology Segmentation Using a Cascading Refinement CNN
链接: https://arxiv.org/abs/2409.12792
作者: Franz Thaler,Darko Stern,Gernot Plank,Martin Urschler
关键词-EN: prevalent cardiovascular diseases, myocardial tissue, morbidity worldwide, mortality and morbidity, Myocardial infarction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Myocardial infarction (MI) is one of the most prevalent cardiovascular diseases and consequently, a major cause for mortality and morbidity worldwide. Accurate assessment of myocardial tissue viability for post-MI patients is critical for diagnosis and treatment planning, e.g. allowing surgical revascularization, or to determine the risk of adverse cardiovascular events in the future. Fine-grained analysis of the myocardium and its surrounding anatomical structures can be performed by combining the information obtained from complementary medical imaging techniques. In this work, we use late gadolinium enhanced (LGE) magnetic resonance (MR), T2-weighted (T2) MR and balanced steady-state free precession (bSSFP) cine MR in order to semantically segment the left and right ventricle, healthy and scarred myocardial tissue, as well as edema. To this end, we propose the Multi-Sequence Cascading Refinement CNN (MS-CaRe-CNN), a 2-stage CNN cascade that receives multi-sequence data and generates predictions of the anatomical structures of interest without considering tissue viability at Stage 1. The prediction of Stage 1 is then further refined in Stage 2, where the model additionally distinguishes myocardial tissue based on viability, i.e. healthy, scarred and edema regions. Our proposed method is set up as a 5-fold ensemble and semantically segments scar tissue achieving 62.31% DSC and 82.65% precision, as well as 63.78% DSC and 87.69% precision for the combined scar and edema region. These promising results for such small and challenging structures confirm that MS-CaRe-CNN is well-suited to generate semantic segmentations to assess the viability of myocardial tissue, enabling downstream tasks like personalized therapy planning.
[LG-92] PRAGA: Prototype-aware Graph Adaptive Aggregation for Spatial Multi-modal Omics Analysis
链接: https://arxiv.org/abs/2409.12728
作者: Xinlei Huang,Zhiqi Ma,Dian Meng,Yanran Liu,Shiwei Ruan,Qingqiang Sun,Xubin Zheng,Ziyue Qiao
关键词-EN: Spatial multi-modal omics, multi-modal omics, highlighted by Nature, Spatial multi-modal, multi-modal omics technology
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Spatial multi-modal omics technology, highlighted by Nature Methods as an advanced biological technique in 2023, plays a critical role in resolving biological regulatory processes with spatial context. Recently, graph neural networks based on K-nearest neighbor (KNN) graphs have gained prominence in spatial multi-modal omics methods due to their ability to model semantic relations between sequencing spots. However, the fixed KNN graph fails to capture the latent semantic relations hidden by the inevitable data perturbations during the biological sequencing process, resulting in the loss of semantic information. In addition, the common lack of spot annotation and class number priors in practice further hinders the optimization of spatial multi-modal omics models. Here, we propose a novel spatial multi-modal omics resolved framework, termed PRototype-Aware Graph Adaptative Aggregation for Spatial Multi-modal Omics Analysis (PRAGA). PRAGA constructs a dynamic graph to capture latent semantic relations and comprehensively integrate spatial information and feature semantics. The learnable graph structure can also denoise perturbations by learning cross-modal knowledge. Moreover, a dynamic prototype contrastive learning is proposed based on the dynamic adaptability of Bayesian Gaussian Mixture Models to optimize the multi-modal omics representations for unknown biological priors. Quantitative and qualitative experiments on simulated and real datasets with 7 competing methods demonstrate the superior performance of PRAGA.
[LG-93] Rapid aerodynamic prediction of swept wings via physics-embedded transfer learning
链接: https://arxiv.org/abs/2409.12711
作者: Yunjia Yang,Runze Li,Yufei Zhang,Lu Lu,Haixin Chen
关键词-EN: Machine learning-based models, rapidly acquire transonic, acquire transonic swept, large computational costs, Machine learning-based
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Machine learning-based models provide a promising way to rapidly acquire transonic swept wing flow fields but suffer from large computational costs in establishing training datasets. Here, we propose a physics-embedded transfer learning framework to efficiently train the model by leveraging the idea that a three-dimensional flow field around wings can be analyzed with two-dimensional flow fields around cross-sectional airfoils. An airfoil aerodynamics prediction model is pretrained with airfoil samples. Then, an airfoil-to-wing transfer model is fine-tuned with a few wing samples to predict three-dimensional flow fields based on two-dimensional results on each spanwise cross section. Sweep theory is embedded when determining the corresponding airfoil geometry and operating conditions, and to obtain the sectional airfoil lift coefficient, which is one of the operating conditions, the low-fidelity vortex lattice method and data-driven methods are proposed and evaluated. Compared to a nontransfer model, introducing the pretrained model reduces the error by 30%, while introducing sweep theory further reduces the error by 9%. When reducing the dataset size, less than half of the wing training samples are need to reach the same error level as the nontransfer framework, which makes establishing the model much easier.
[LG-94] Machine-learning-based multipoint optimization of fluidic injection parameters for improving nozzle performance
链接: https://arxiv.org/abs/2409.12707
作者: Yunjia Yang,Jiazhe Li,Yufei Zhang,Haixin Chen
关键词-EN: overexpanded single expansion, single expansion ramp, expansion ramp nozzle, Fluidic injection, vehicle acceleration
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Fluidic injection provides a promising solution to improve the performance of overexpanded single expansion ramp nozzle (SERN) during vehicle acceleration. However, determining the injection parameters for the best overall performance under multiple nozzle operating conditions is still a challenge. The gradient-based optimization method requires gradients of injection parameters at each design point, leading to high computational costs if traditional computational fluid dynamic (CFD) simulations are adopted. This paper uses a pretrained neural network model to replace CFD during optimization to quickly calculate the nozzle flow field at multiple design points. Considering the physical characteristics of the nozzle flow field, a prior-based prediction strategy is adopted to enhance the model’s transferability. In addition, the back-propagation algorithm of the neural network is adopted to quickly evaluate the gradients by calling the computation process only once, thereby greatly reducing the gradient computation time compared to the finite differential method. As a test case, the average nozzle thrust coefficient of a SERN at seven design points is optimized. An improvement in the thrust coefficient of 1.14% is achieved, and the time cost is greatly reduced compared with the traditional optimization methods, even when the time to establish the database for training is considered.
[LG-95] heoretical Analysis of Heteroscedastic Gaussian Processes with Posterior Distributions
链接: https://arxiv.org/abs/2409.12622
作者: Yuji Ito
关键词-EN: heteroscedastic Gaussian processes, analyzing heteroscedastic Gaussian, Gaussian processes, heteroscedastic Gaussian, data-driven manner
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
点击查看摘要
Abstract:This study introduces a novel theoretical framework for analyzing heteroscedastic Gaussian processes (HGPs) that identify unknown systems in a data-driven manner. Although HGPs effectively address the heteroscedasticity of noise in complex training datasets, calculating the exact posterior distributions of the HGPs is challenging, as these distributions are no longer multivariate normal. This study derives the exact means, variances, and cumulative distributions of the posterior distributions. Furthermore, the derived theoretical findings are applied to a chance-constrained tracking controller. After an HGP identifies an unknown disturbance in a plant system, the controller can handle chance constraints regarding the system despite the presence of the disturbance.
[LG-96] Is Tokenization Needed for Masked Particle Modelling?
链接: https://arxiv.org/abs/2409.12589
作者: Matthew Leigh,Samuel Klein,François Charton,Tobias Golling,Lukas Heinrich,Michael Kagan,Inês Ochoa,Margarita Osadchy
关键词-EN: masked particle modeling, significantly enhance masked, enhance masked particle, constructing highly expressive, highly expressive representations
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements of a set, a learning objective that requires no labels and can be applied directly to experimental data. We achieve significant performance improvements over previous work on MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We compare several pre-training tasks and introduce new reconstruction methods that utilize conditional generative models without data tokenization or discretization. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets, which includes using a wide variety of downstream tasks relevant to jet physics, such as classification, secondary vertex finding, and track identification.
[LG-97] st-Time Augmentation Meets Variational Bayes
链接: https://arxiv.org/abs/2409.12587
作者: Masanari Kimura,Howard Bondell
关键词-EN: Data augmentation, machine learning models, data augmentation methods, Data, TTA
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Data augmentation is known to contribute significantly to the robustness of machine learning models. In most instances, data augmentation is utilized during the training phase. Test-Time Augmentation (TTA) is a technique that instead leverages these data augmentations during the testing phase to achieve robust predictions. More precisely, TTA averages the predictions of multiple data augmentations of an instance to produce a final prediction. Although the effectiveness of TTA has been empirically reported, it can be expected that the predictive performance achieved will depend on the set of data augmentation methods used during testing. In particular, the data augmentation methods applied should make different contributions to performance. That is, it is anticipated that there may be differing degrees of contribution in the set of data augmentation methods used for TTA, and these could have a negative impact on prediction performance. In this study, we consider a weighted version of the TTA based on the contribution of each data augmentation. Some variants of TTA can be regarded as considering the problem of determining the appropriate weighting. We demonstrate that the determination of the coefficients of this weighted TTA can be formalized in a variational Bayesian framework. We also show that optimizing the weights to maximize the marginal log-likelihood suppresses candidates of unwanted data augmentations at the test phase.
[LG-98] Unsupervised Reward-Driven Image Segmentation in Automated Scanning Transmission Electron Microscopy Experiments
链接: https://arxiv.org/abs/2409.12462
作者: Kamyar Barakati,Utkarsh Pratiush,Austin C. Houston,Gerd Duscher,Sergei V. Kalinin
关键词-EN: optimize data representation, require rapid image, rapid image segmentation, site-selective spectroscopies, atomic manipulation
类目: Materials Science (cond-mat.mtrl-sci); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 17 pages, 6 images
点击查看摘要
Abstract:Automated experiments in scanning transmission electron microscopy (STEM) require rapid image segmentation to optimize data representation for human interpretation, decision-making, site-selective spectroscopies, and atomic manipulation. Currently, segmentation tasks are typically performed using supervised machine learning methods, which require human-labeled data and are sensitive to out-of-distribution drift effects caused by changes in resolution, sampling, or beam shape. Here, we operationalize and benchmark a recently proposed reward-driven optimization workflow for on-the fly image analysis in STEM. This unsupervised approach is much more robust, as it does not rely on human labels and is fully explainable. The explanatory feedback can help the human to verify the decision making and potentially tune the model by selecting the position along the Pareto frontier of reward functions. We establish the timing and effectiveness of this method, demonstrating its capability for real-time performance in high-throughput and dynamic automated STEM experiments. The reward driven approach allows to construct explainable robust analysis workflows and can be generalized to a broad range of image analysis tasks in electron and scanning probe microscopy and chemical imaging.
[LG-99] Axial Attention Transformer Networks: A New Frontier in Breast Cancer Detection
链接: https://arxiv.org/abs/2409.12347
作者: Weijie He,Runyuan Bao,Yiru Cang,Jianjun Wei,Yang Zhang,Jiacheng Hu
关键词-EN: breast cancer images, medical image segmentation, breast cancer, breast cancer diagnosis, Transformer-based segmentation model
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper delves into the challenges and advancements in the field of medical image segmentation, particularly focusing on breast cancer diagnosis. The authors propose a novel Transformer-based segmentation model that addresses the limitations of traditional convolutional neural networks (CNNs), such as U-Net, in accurately localizing and segmenting small lesions within breast cancer images. The model introduces an axial attention mechanism to enhance the computational efficiency and address the issue of global contextual information that is often overlooked by CNNs. Additionally, the paper discusses improvements tailored to the small dataset challenge, including the incorporation of relative position information and a gated axial attention mechanism to refine the model’s focus on relevant features. The proposed model aims to significantly improve the segmentation accuracy of breast cancer images, offering a more efficient and effective tool for computer-aided diagnosis.
[LG-100] Deep vessel segmentation with joint multi-prior encoding
链接: https://arxiv.org/abs/2409.12334
作者: Amine Sadikine,Bogdan Badic,Enzo Ferrante,Vincent Noblet,Pascal Ballet,Dimitris Visvikis,Pierre-Henri Conze
关键词-EN: including pathology detection, clinical applications, including pathology, surgical planning, pathology detection
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, conference
点击查看摘要
Abstract:The precise delineation of blood vessels in medical images is critical for many clinical applications, including pathology detection and surgical planning. However, fully-automated vascular segmentation is challenging because of the variability in shape, size, and topology. Manual segmentation remains the gold standard but is time-consuming, subjective, and impractical for large-scale studies. Hence, there is a need for automatic and reliable segmentation methods that can accurately detect blood vessels from medical images. The integration of shape and topological priors into vessel segmentation models has been shown to improve segmentation accuracy by offering contextual information about the shape of the blood vessels and their spatial relationships within the vascular tree. To further improve anatomical consistency, we propose a new joint prior encoding mechanism which incorporates both shape and topology in a single latent space. The effectiveness of our method is demonstrated on the publicly available 3D-IRCADb dataset. More globally, the proposed approach holds promise in overcoming the challenges associated with automatic vessel delineation and has the potential to advance the field of deep priors encoding.
[LG-101] Scale-specific auxiliary multi-task contrastive learning for deep liver vessel segmentation
链接: https://arxiv.org/abs/2409.12333
作者: Amine Sadikine,Bogdan Badic,Jean-Pierre Tasu,Vincent Noblet,Pascal Ballet,Dimitris Visvikis,Pierre-Henri Conze
关键词-EN: functionally-independent Couinaud segments, Extracting hepatic vessels, Couinaud segments, Extracting hepatic, functionally-independent Couinaud
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures, conference
点击查看摘要
Abstract:Extracting hepatic vessels from abdominal images is of high interest for clinicians since it allows to divide the liver into functionally-independent Couinaud segments. In this respect, an automated liver blood vessel extraction is widely summoned. Despite the significant growth in performance of semantic segmentation methodologies, preserving the complex multi-scale geometry of main vessels and ramifications remains a major challenge. This paper provides a new deep supervised approach for vessel segmentation, with a strong focus on representations arising from the different scales inherent to the vascular tree geometry. In particular, we propose a new clustering technique to decompose the tree into various scale levels, from tiny to large vessels. Then, we extend standard 3D UNet to multi-task learning by incorporating scale-specific auxiliary tasks and contrastive learning to encourage the discrimination between scales in the shared representation. Promising results, depicted in several evaluation metrics, are revealed on the public 3D-IRCADb dataset.
[LG-102] Amortized Variational Inference for Deep Gaussian Processes
链接: https://arxiv.org/abs/2409.12301
作者: Qiuxian Meng,Yongyou Zhang
关键词-EN: predictive uncertainty estimates, principled predictive uncertainty, Bayesian nonparametric models, Deep Gaussian processes, Gaussian processes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Gaussian processes (GPs) are Bayesian nonparametric models for function approximation with principled predictive uncertainty estimates. Deep Gaussian processes (DGPs) are multilayer generalizations of GPs that can represent complex marginal densities as well as complex mappings. As exact inference is either computationally prohibitive or analytically intractable in GPs and extensions thereof, some existing methods resort to variational inference (VI) techniques for tractable approximations. However, the expressivity of conventional approximate GP models critically relies on independent inducing variables that might not be informative enough for some problems. In this work we introduce amortized variational inference for DGPs, which learns an inference function that maps each observation to variational parameters. The resulting method enjoys a more expressive prior conditioned on fewer input dependent inducing variables and a flexible amortized marginal posterior that is able to model more complicated functions. We show with theoretical reasoning and experimental results that our method performs similarly or better than previous approaches at less computational cost.
[LG-103] Unsupervised Feature Orthogonalization for Learning Distortion-Invariant Representations BMVC2024
链接: https://arxiv.org/abs/2409.12276
作者: Sebastian Doerrich,Francesco Di Salvo,Christian Ledig
关键词-EN: Vision Transformer, integrates unsupervised feature, unsupervised feature orthogonalization, study introduces unORANIC, Transformer to capture
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at RROW@BMVC 2024 (Workshop on Robust Recognition in the Open World at the British Machine Vision Conference)
点击查看摘要
Abstract:This study introduces unORANIC+, a novel method that integrates unsupervised feature orthogonalization with the ability of a Vision Transformer to capture both local and global relationships for improved robustness and generalizability. The streamlined architecture of unORANIC+ effectively separates anatomical and image-specific attributes, resulting in robust and unbiased latent representations that allow the model to demonstrate excellent performance across various medical image analysis tasks and diverse datasets. Extensive experimentation demonstrates unORANIC+'s reconstruction proficiency, corruption resilience, as well as capability to revise existing image distortions. Additionally, the model exhibits notable aptitude in downstream tasks such as disease classification and corruption detection. We confirm its adaptability to diverse datasets of varying image sources and sample sizes which positions the method as a promising algorithm for advanced medical image analysis, particularly in resource-constrained environments lacking large, tailored datasets. The source code is available at this https URL .
[LG-104] Conformal Fields from Neural Networks
链接: https://arxiv.org/abs/2409.12222
作者: James Halverson,Joydeep Naskar,Jiahua Tian
关键词-EN: restricting Lorentz-invariant ensembles, projective null cone, restricting Lorentz-invariant, Lorentz-invariant ensembles, homogeneous neural networks
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 32+16 pages
点击查看摘要
Abstract:We use the embedding formalism to construct conformal fields in D dimensions, by restricting Lorentz-invariant ensembles of homogeneous neural networks in (D+2) dimensions to the projective null cone. Conformal correlators may be computed using the parameter space description of the neural network. Exact four-point correlators are computed in a number of examples, and we perform a 4D conformal block decomposition that elucidates the spectrum. In some examples the analysis is facilitated by recent approaches to Feynman integrals. Generalized free CFTs are constructed using the infinite-width Gaussian process limit of the neural network, enabling a realization of the free boson. The extension to deep networks constructs conformal fields at each subsequent layer, with recursion relations relating their conformal dimensions and four-point functions. Numerical approaches are discussed.
[LG-105] Assessing Reusability of Deep Learning-Based Monotherapy Drug Response Prediction Models Trained with Omics Data
链接: https://arxiv.org/abs/2409.12215
作者: Jamie C. Overbeek,Alexander Partin,Thomas S. Brettin,Nicholas Chia,Oleksandr Narykov,Priyanka Vasanthakumari,Andreas Wilke,Yitan Zhu,Austin Clyde,Sara Jones,Rohan Gnanaolivu,Yuanhang Liu,Jun Jiang,Chen Wang,Carter Knutson,Andrew McNaughton,Neeraj Kumar,Gayara Demini Fernando,Souparno Ghosh,Cesar Sanchez-Villalobos,Ruibo Zhang,Ranadip Pal,M. Ryan Weil,Rick L. Stevens
关键词-EN: Cancer drug response, individual patient profiles, Cancer drug, DRP models, DRP
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures
点击查看摘要
Abstract:Cancer drug response prediction (DRP) models present a promising approach towards precision oncology, tailoring treatments to individual patient profiles. While deep learning (DL) methods have shown great potential in this area, models that can be successfully translated into clinical practice and shed light on the molecular mechanisms underlying treatment response will likely emerge from collaborative research efforts. This highlights the need for reusable and adaptable models that can be improved and tested by the wider scientific community. In this study, we present a scoring system for assessing the reusability of prediction DRP models, and apply it to 17 peer-reviewed DL-based DRP models. As part of the IMPROVE (Innovative Methodologies and New Data for Predictive Oncology Model Evaluation) project, which aims to develop methods for systematic evaluation and comparison DL models across scientific domains, we analyzed these 17 DRP models focusing on three key categories: software environment, code modularity, and data availability and preprocessing. While not the primary focus, we also attempted to reproduce key performance metrics to verify model behavior and adaptability. Our assessment of 17 DRP models reveals both strengths and shortcomings in model reusability. To promote rigorous practices and open-source sharing, we offer recommendations for developing and sharing prediction models. Following these recommendations can address many of the issues identified in this study, improving model reusability without adding significant burdens on researchers. This work offers the first comprehensive assessment of reusability and reproducibility across diverse DRP models, providing insights into current model sharing practices and promoting standards within the DRP and broader AI-enabled scientific research community.
[LG-106] Reproduction of IVFS algorithm for high-dimensional topology preservation feature selection
链接: https://arxiv.org/abs/2409.12195
作者: Zihan Wang
关键词-EN: Feature selection, handling high-dimensional data, crucial technique, technique for handling, handling high-dimensional
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2004.01299 by other authors
点击查看摘要
Abstract:Feature selection is a crucial technique for handling high-dimensional data. In unsupervised scenarios, many popular algorithms focus on preserving the original data structure. In this paper, we reproduce the IVFS algorithm introduced in AAAI 2020, which is inspired by the random subset method and preserves data similarity by maintaining topological structure. We systematically organize the mathematical foundations of IVFS and validate its effectiveness through numerical experiments similar to those in the original paper. The results demonstrate that IVFS outperforms SPEC and MCFS on most datasets, although issues with its convergence and stability persist.
信息检索
[IR-0] MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
链接: https://arxiv.org/abs/2409.12959
作者: Dongzhi Jiang,Renrui Zhang,Ziyu Guo,Yanmin Wu,Jiayi Lei,Pengshuo Qiu,Pan Lu,Zehui Chen,Guanglu Song,Peng Gao,Yu Liu,Chunyuan Li,Hongsheng Li
关键词-EN: Large Language Models, Large Multimodal Models, Large Language, Language Models, multimodal search
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Project Page: this https URL
点击查看摘要
Abstract:The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs’ training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: this https URL
[IR-1] he Relevance of Item-Co-Exposure For Exposure Bias Mitigation
链接: https://arxiv.org/abs/2409.12912
作者: Thorsten Krause,Alina Deriyeva,Jan Heinrich Beinke,Gerrit York Bartels,Oliver Thomas
关键词-EN: implicit feedback recommender, feedback recommender systems, recommender systems influence, discrete choice models, choice models
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Through exposing items to users, implicit feedback recommender systems influence the logged interactions, and, ultimately, their own recommendations. This effect is called exposure bias and it can lead to issues such as filter bubbles and echo chambers. Previous research employed the multinomial logit model (MNL) with exposure information to reduce exposure bias on synthetic data. This extended abstract summarizes our previous study in which we investigated whether (i) these findings hold for human-generated choices, (ii) other discrete choice models mitigate bias better, and (iii) an item’s estimated relevance can depend on the relevances of the other items that were presented with it. We collected a data set of biased and unbiased choices in a controlled online user study and measured the effects of overexposure and competition. We found that (i) the discrete choice models effectively mitigated exposure bias on human-generated choice data, (ii) there were no significant differences in robustness among the different discrete choice models, and (iii) only multivariate discrete choice models were robust to competition between items. We conclude that discrete choice models mitigate exposure bias effectively because they consider item-co-exposure. Moreover, exposing items alongside more or less popular items can bias future recommendations significantly and item exposure must be tracked for overcoming exposure bias. We consider our work vital for understanding what exposure bias it, how it forms, and how it can be mitigated. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2409.12912 [cs.IR] (or arXiv:2409.12912v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.12912 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling
链接: https://arxiv.org/abs/2409.12740
作者: Junyi Chen,Lu Chi,Bingyue Peng,Zehuan Yuan
关键词-EN: achieved remarkable success, Hierarchical Large Language, Large Language Models, Large Language, prompting several studies
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved remarkable success in various fields, prompting several studies to explore their potential in recommendation systems. However, these attempts have so far resulted in only modest improvements over traditional recommendation models. Moreover, three critical questions remain under-explored: firstly, the real value of LLMs’ pre-trained weights, often considered to encapsulate world knowledge; secondly, the necessity of fine-tuning for recommendation tasks; lastly, whether LLMs can exhibit the same scalability benefits in recommendation systems as they do in other domains. In this paper, we propose a novel Hierarchical Large Language Model (HLLM) architecture designed to enhance sequential recommendation systems. Our approach employs a two-tier model: the first Item LLM extracts rich content features from the detailed text description of the item, while the second User LLM utilizes these features to predict users’ future interests based on their interaction history. Extensive experiments demonstrate that our method effectively leverages the pre-trained capabilities of open-source LLMs, and further fine-tuning leads to significant performance boosts. Additionally, HLLM achieves excellent scalability, with the largest configuration utilizing 7B parameters for both item feature extraction and user interest modeling. Moreover, HLLM offers excellent training and serving efficiency, making it practical in real-world applications. Evaluations on two large-scale datasets, PixelRec and Amazon Reviews, show that HLLM achieves state-of-the-art results, outperforming traditional ID-based models by a wide margin. In online A/B testing, HLLM showcases notable gains, validating its practical impact in real-world recommendation scenarios. Codes are available at this https URL.
[IR-3] When SparseMoE Meets Noisy Interactions: An Ensemble View on Denoising Recommendation
链接: https://arxiv.org/abs/2409.12730
作者: Weipu Chen,Zhuangzhuang He,Fei Liu
关键词-EN: Learning user preferences, implicit feedback, user preferences, core challenges, Learning user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Learning user preferences from implicit feedback is one of the core challenges in recommendation. The difficulty lies in the potential noise within implicit feedback. Therefore, various denoising recommendation methods have been proposed recently. However, most of them overly rely on the hyperparameter configurations, inevitably leading to inadequacies in model adaptability and generalization performance. In this study, we propose a novel Adaptive Ensemble Learning (AEL) for denoising recommendation, which employs a sparse gating network as a brain, selecting suitable experts to synthesize appropriate denoising capacities for different data samples. To address the ensemble learning shortcoming of model complexity and ensure sub-recommender diversity, we also proposed a novel method that stacks components to create sub-recommenders instead of directly constructing them. Extensive experiments across various datasets demonstrate that AEL outperforms others in kinds of popular metrics, even in the presence of substantial and dynamic noise. Our code is available at this https URL.
[IR-4] Exploring Large Language Models for Product Attribute Value Identification
链接: https://arxiv.org/abs/2409.12695
作者: Kassem Sabeh,Mouna Kacimi,Johann Gamper,Robert Litschko,Barbara Plank
关键词-EN: involves automatically identifying, automatically identifying attributes, involves automatically, enabling features, automatically identifying
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Product attribute value identification (PAVI) involves automatically identifying attributes and their values from product information, enabling features like product search, recommendation, and comparison. Existing methods primarily rely on fine-tuning pre-trained language models, such as BART and T5, which require extensive task-specific training data and struggle to generalize to new attributes. This paper explores large language models (LLMs), such as LLaMA and Mistral, as data-efficient and robust alternatives for PAVI. We propose various strategies: comparing one-step and two-step prompt-based approaches in zero-shot settings and utilizing parametric and non-parametric knowledge through in-context learning examples. We also introduce a dense demonstration retriever based on a pre-trained T5 model and perform instruction fine-tuning to explicitly train LLMs on task-specific instructions. Extensive experiments on two product benchmarks show that our two-step approach significantly improves performance in zero-shot settings, and instruction fine-tuning further boosts performance when using training data, demonstrating the practical benefits of using LLMs for PAVI.
[IR-5] A Deep Dive into Fairness Bias Threats and Privacy in Recommender Systems: Insights and Future Research
链接: https://arxiv.org/abs/2409.12651
作者: Falguni Roy,Xiaofeng Ding,K.-K. R. Choo,Pan Zhou
关键词-EN: social media platforms, personalizing digital experiences, Recommender systems, streaming services, e-commerce sites
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
*备注: 38 pages, 6 figures
点击查看摘要
Abstract:Recommender systems are essential for personalizing digital experiences on e-commerce sites, streaming services, and social media platforms. While these systems are necessary for modern digital interactions, they face fairness, bias, threats, and privacy challenges. Bias in recommender systems can result in unfair treatment of specific users and item groups, and fairness concerns demand that recommendations be equitable for all users and items. These systems are also vulnerable to various threats that compromise reliability and security. Furthermore, privacy issues arise from the extensive use of personal data, making it crucial to have robust protection mechanisms to safeguard user information. This study explores fairness, bias, threats, and privacy in recommender systems. It examines how algorithmic decisions can unintentionally reinforce biases or marginalize specific user and item groups, emphasizing the need for fair recommendation strategies. The study also looks at the range of threats in the form of attacks that can undermine system integrity and discusses advanced privacy-preserving techniques. By addressing these critical areas, the study highlights current limitations and suggests future research directions to improve recommender systems’ robustness, fairness, and privacy. Ultimately, this research aims to help develop more trustworthy and ethical recommender systems that better serve diverse user populations.
[IR-6] Multi-View Adaptive Contrastive Learning for Information Retrieval Based Fault Localization
链接: https://arxiv.org/abs/2409.12519
作者: Chunying Zhou,Xiaoyuan Xie,Gong Chen,Peng He,Bing Li
关键词-EN: source code files, source code, Retrieval Fault Localization, fault localization, information retrieval-based techniques
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Most studies focused on information retrieval-based techniques for fault localization, which built representations for bug reports and source code files and matched their semantic vectors through similarity measurement. However, such approaches often ignore some useful information that might help improve localization performance, such as 1) the interaction relationship between bug reports and source code files; 2) the similarity relationship between bug reports; and 3) the co-citation relationship between source code files. In this paper, we propose a novel approach named Multi-View Adaptive Contrastive Learning for Information Retrieval Fault Localization (MACL-IRFL) to learn the above-mentioned relationships for software fault localization. Specifically, we first generate data augmentations from report-code interaction view, report-report similarity view and code-code co-citation view separately, and adopt graph neural network to aggregate the information of bug reports or source code files from the three views in the embedding process. Moreover, we perform contrastive learning across these views. Our design of contrastive learning task will force the bug report representations to encode information shared by report-report and report-code views,and the source code file representations shared by code-code and report-code views, thereby alleviating the noise from auxiliary information. Finally, to evaluate the performance of our approach, we conduct extensive experiments on five open-source Java projects. The results show that our model can improve over the best baseline up to 28.93%, 25.57% and 20.35% on Accuracy@1, MAP and MRR, respectively.
[IR-7] Familiarity-aware Evidence Compression for Retrieval Augmented Generation
链接: https://arxiv.org/abs/2409.12468
作者: Dongwon Jung,Qin Liu,Tenghao Huang,Ben Zhou,Muhao Chen
关键词-EN: Retrieval Augmented Generation, Augmented Generation, Retrieval Augmented, improves large language, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) improves large language models (LMs) by incorporating non-parametric knowledge through evidence retrieval from external sources. However, it often struggles to filter out inconsistent and irrelevant information that can distract the LM from its tasks. While compressing the retrieved evidence with a compression model aims to address this issue, the compressed evidence may still be unfamiliar to the target model used for downstream task, potentially failing to utilize the evidence effectively. We propose FaviComp (Familiarity-aware Evidence Compression), a novel training-free evidence compression technique that makes retrieved evidence more familiar to the target model, while seamlessly integrating parametric knowledge from the model. Specifically, FaviComp proactively lowers the perplexity of the compressed evidence with regard to the target model by combining token probabilities from both the compression model and the target model to generate context that is more familiar to the target model. This approach balances the integration of parametric and non-parametric knowledge, which is especially helpful in complex tasks where the retrieved evidence set may not contain all the necessary information. Experimental results demonstrate that FaviComp consistently outperforms existing baselines in multiple open-domain QA datasets, achieving high compression rates and showcasing the effective integration of both parametric and non-parametric knowledge.
[IR-8] Bundle Fragments into a Whole: Mining More Complete Clusters via Submodular Selection of Interesting webpages for Web Topic Detection
链接: https://arxiv.org/abs/2409.12380
作者: Junbiao Pang,Anjing Hu,Qingming Huang
关键词-EN: Organizing interesting webpages, Organizing interesting, multimodal web data, hot topics, understand the trends
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 10
点击查看摘要
Abstract:Organizing interesting webpages into hot topics is one of key steps to understand the trends of multimodal web data. A state-of-the-art solution is firstly to organize webpages into a large volume of multi-granularity topic candidates; hot topics are further identified by estimating their interestingness. However, these topic candidates contain a large number of fragments of hot topics due to both the inefficient feature representations and the unsupervised topic generation. This paper proposes a bundling-refining approach to mine more complete hot topics from fragments. Concretely, the bundling step organizes the fragment topics into coarse topics; next, the refining step proposes a submodular-based method to refine coarse topics in a scalable approach. The propose unconventional method is simple, yet powerful by leveraging submodular optimization, our approach outperforms the traditional ranking methods which involve the careful design and complex steps. Extensive experiments demonstrate that the proposed approach surpasses the state-of-the-art method (i.e., latent Poisson deconvolution Pang et al. (2016)) 20% accuracy and 10% one on two public data sets, respectively.
[IR-9] A Simple Model to Estimate Sharing Effects in Social Networks RECSYS’24
链接: https://arxiv.org/abs/2409.12203
作者: Olivier Jeunen
关键词-EN: Randomised Controlled Trials, Randomised Controlled, Controlled Trials, fields of science, gold standard
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24
点击查看摘要
Abstract:Randomised Controlled Trials (RCTs) are the gold standard for estimating treatment effects across many fields of science. Technology companies have adopted A/B-testing methods as a modern RCT counterpart, where end-users are randomly assigned various system variants and user behaviour is tracked continuously. The objective is then to estimate the causal effect that the treatment variant would have on certain metrics of interest to the business. When the outcomes for randomisation units – end-users in this case – are not statistically independent, this obfuscates identifiability of treatment effects, and harms decision-makers’ observability of the system. Social networks exemplify this, as they are designed to promote inter-user interactions. This interference by design notoriously complicates measurement of, e.g., the effects of sharing. In this work, we propose a simple Markov Decision Process (MDP)-based model describing user sharing behaviour in social networks. We derive an unbiased estimator for treatment effects under this model, and demonstrate through reproducible synthetic experiments that it outperforms existing methods by a significant margin. Comments: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24 Subjects: Social and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2409.12203 [cs.SI] (or arXiv:2409.12203v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2409.12203 Focus to learn more arXiv-issued DOI via DataCite
附件下载
点击下载今日全部论文列表