本篇博文主要展示 2024-10-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-04)

今日共更新554篇论文,其中:

  • 自然语言处理129篇(Computation and Language (cs.CL))
  • 人工智能165篇(Artificial Intelligence (cs.AI))
  • 计算机视觉119篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习249篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

【速读】: 该论文试图解决现代大型多模态模型(LMMs)在处理短视频时缺乏基本的时间推理能力的问题。解决方案的关键在于引入了一个名为Vinoground的评估基准,该基准包含1000个自然视频-字幕对,用于测试模型在区分不同动作和物体变换之间的时间差异上的表现。研究表明,现有LMMs在这方面表现不佳,最佳模型GPT-4o仅能达到约50%的准确率,远低于人类基准的约90%。通过这一基准,论文揭示了时间推理在短视频理解中仍是一个未完全解决的问题。

链接: https://arxiv.org/abs/2410.02763
作者: Jianrui Zhang,Mu Cai,Yong Jae Lee
关键词-EN: growing sentiment recently, key challenges related, growing sentiment, sentiment recently, recently that modern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.
摘要:近期,人们普遍认为现代大型多模态模型 (LMMs) 已经解决了与短视频理解相关的多数关键挑战。因此,学术界和工业界正逐渐将关注点转向理解长视频所带来的更为复杂的挑战。然而,事实真的如此吗?我们的研究表明,即使是在处理短视频时,LMMs 仍然缺乏许多基本的推理能力。我们引入了 Vinoground,这是一个包含 1000 对短自然视频与字幕的时间反事实 LMM 评估基准。我们证明,现有的 LMMs 在区分不同动作和物体变换之间的时间差异方面表现严重不足。例如,最佳模型 GPT-4o 在我们的文本和视频评分中仅获得约 50% 的成绩,与人类基准约 90% 的表现存在显著差距。所有开源多模态模型和基于 CLIP 的模型表现更差,几乎只能达到随机猜测的水平。通过这项工作,我们揭示了短视频中的时间推理问题仍未得到完全解决。数据集和评估代码可通过此 https URL 获取。

[NLP-1] Erasing Conceptual Knowledge from Language Models

【速读】: 该论文试图解决语言模型中概念擦除(concept erasure)缺乏全面评估框架的问题,导致现有擦除方法的效果评估不完整。解决方案的关键在于提出了一个基于三个关键标准的评估范式:无辜性(完全移除知识)、无缝性(保持条件流畅生成)和特异性(保留无关任务性能)。基于这些标准,论文开发了一种名为Erasure of Language Memory (ELM)的新方法,通过有针对性的低秩更新来改变输出分布,从而在擦除特定概念的同时保持整体模型能力,包括在提示擦除概念时的流畅性。实验结果表明,ELM在所提出的评估指标上表现优异,包括在擦除主题评估中接近随机分数、生成流畅性、在无关基准上的保持准确性以及在对抗攻击下的鲁棒性。

链接: https://arxiv.org/abs/2410.02760
作者: Rohit Gandikota,Sheridan Feucht,Samuel Marks,David Bau
关键词-EN: comprehensive evaluation framework, leading to incomplete, traditionally lacked, lacked a comprehensive, evaluation framework
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM’s efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at this https URL
摘要:传统上,语言模型中的概念擦除缺乏一个全面的评估框架,导致对擦除方法效果的评估不完整。我们提出了一种以三个关键标准为中心的评估范式:无辜性(完全的知识移除)、无缝性(保持条件流畅生成)和特异性(保留无关任务性能)。我们的评估指标自然而然地推动了语言记忆擦除(Erasure of Language Memory, ELM)的开发,这是一种旨在解决所有三个维度的新方法。ELM 采用有针对性的低秩更新来改变被擦除概念的输出分布,同时保留整体模型能力,包括在提示被擦除概念时的流畅性。我们展示了 ELM 在生物安全、网络安全和文学领域擦除任务中的有效性。比较分析表明,ELM 在我们提出的指标上表现优异,包括在被擦除主题评估中的接近随机分数、生成流畅性、在无关基准上的保持准确性以及在对抗攻击下的鲁棒性。我们的代码、数据和训练模型可在以下链接获取:https URL

[NLP-2] CorPipe at CRAC 2024: Predicting Zero Mentions from Raw Text

【速读】: 该论文旨在解决多语言共指消解任务中的空节点预测问题,即在原始文本中识别出需要进行零共指提及的空节点。解决方案的关键在于采用了两种模型变体:一种是两阶段方法,首先使用预训练编码器模型预测空节点,然后与句子中的词语一起由另一个预训练模型处理;另一种是单阶段方法,通过单一的预训练编码器模型同时生成空节点、共指提及和共指链接。这两种方法均显著超越了其他参赛者,分别领先3.9和2.8个百分点。

链接: https://arxiv.org/abs/2410.02756
作者: Milan Straka
关键词-EN: Multilingual Coreference Resolution, Shared Task, Multilingual Coreference, Task on Multilingual, empty nodes
类目: Computation and Language (cs.CL)
备注: Accepted to CRAC 2024

点击查看摘要

Abstract:We present CorPipe 24, the winning entry to the CRAC 2024 Shared Task on Multilingual Coreference Resolution. In this third iteration of the shared task, a novel objective is to also predict empty nodes needed for zero coreference mentions (while the empty nodes were given on input in previous years). This way, coreference resolution can be performed on raw text. We evaluate two model variants: a~two-stage approach (where the empty nodes are predicted first using a pretrained encoder model and then processed together with sentence words by another pretrained model) and a single-stage approach (where a single pretrained encoder model generates empty nodes, coreference mentions, and coreference links jointly). In both settings, CorPipe surpasses other participants by a large margin of 3.9 and 2.8 percent points, respectively. The source code and the trained model are available at this https URL .
摘要:我们介绍了 CorPipe 24,这是 CRAC 2024 共享任务中多语言指代消解的获胜方案。在本次共享任务的第三次迭代中,一个新颖的目标是预测零指代提及所需的空节点(而在前几年,空节点是作为输入提供的)。这样,指代消解可以在原始文本上进行。我们评估了两种模型变体:一种是两阶段方法(首先使用预训练的编码器模型预测空节点,然后与句子词语一起由另一个预训练模型处理),另一种是单阶段方法(单个预训练编码器模型同时生成空节点、指代提及和指代链接)。在这两种设置中,CorPipe 分别以 3.9 和 2.8 个百分点的巨大优势超越了其他参与者。源代码和训练好的模型可在此 https URL 获取。

[NLP-3] SIEVE: General Purpose Data Filtering System Matching GPT-4o Accuracy at 1% the Cost

【速读】: 该论文试图解决在创建专用大型语言模型时,由于现有领域特定数据集有限,需要从大规模网络数据中筛选高质量、特定领域数据的高成本问题。解决方案的关键是提出了一种名为SIEVE的轻量级方法,通过无缝集成GPT-4o和轻量级T5模型,并利用主动学习技术在后台对T5模型进行微调,从而在保持与GPT-4o相当准确性的同时,大幅降低成本。SIEVE能够在一次GPT-4o调用的成本下执行多达500次筛选操作,显著提高了数据筛选的效率和成本效益。

链接: https://arxiv.org/abs/2410.02755
作者: Jifan Zhang,Robert Nowak
关键词-EN: Creating specialized large, Creating specialized, special purpose data, requires vast amounts, SIEVE
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE’s filtering results to those of GPT-4o.
摘要:创建专用的生成式 AI (Generative AI) 需要大量的干净、特定用途的数据进行训练和微调。由于现有的领域特定数据集数量有限,大多数应用都需要创建新的数据集。这要求开发新的应用特定过滤方法来处理网络规模的数据。使用高性能、通用的大语言模型 (LLM) 如 GPT-4o 进行过滤可以非常有效,但在网络规模下成本极高。本文提出了 SIEVE,一种轻量级的替代方案,其准确性与 GPT-4o 相当,但成本仅为后者的几分之一。SIEVE 可以在一次 GPT-4o 过滤调用的成本下执行多达 500 次过滤操作。SIEVE 的关键在于无缝集成 GPT-4o 和轻量级 T5 模型,通过主动学习在后台使用少量 GPT-4o 调用来微调 T5。一旦训练完成,SIEVE 的性能与 GPT-4o 相当,但成本仅为后者的极小部分。我们在 OpenWebText 数据集上实验验证了 SIEVE,使用了五个高度定制的过滤任务,目标是高质量和领域特定的内容。我们的结果表明,该方法在以显著更低的成本(1%)创建用于语言模型训练的大型高质量数据集方面具有有效性和效率。为进一步验证 SIEVE,实验显示 SIEVE 和 GPT-4o 达到了相似的准确性,而人类评估者更倾向于 SIEVE 的过滤结果。

[NLP-4] raining Language Models on Synthetic Edit Sequences Improves Code Synthesis

【速读】: 该论文试图解决高质量代码编辑数据稀缺的问题,解决方案的关键是开发了一种名为LintSeq的合成数据生成算法。该算法通过使用linter工具对现有代码进行重构,生成一系列代码编辑序列,这些序列由连续的程序差异(diffs)组成。通过将这些编辑序列应用于指令微调大型语言模型(LLMs),论文展示了在代码合成基准测试中,经过编辑序列微调的模型在多样性和覆盖率方面优于基线模型,尤其是在多次采样的情况下,其性能可与GPT-4媲美,甚至在某些情况下超越了更大规模的模型。此外,论文还展示了在设备上模型的代码理解任务中,经过合成代码编辑微调的小型语言模型能够达到甚至超越更大规模模型的性能。

链接: https://arxiv.org/abs/2410.02749
作者: Ulyana Piterbarg,Lerrel Pinto,Rob Fergus
关键词-EN: Software engineers, code, Software, edit, data
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems “pass@k” solved by any attempt given “k” tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.
摘要:软件工程师主要通过编辑现有程序来编写代码。相比之下,大语言模型 (LLMs) 通过单次自回归生成程序。造成这种差异的一个解释是开源编辑数据的稀缺性。尽管高质量的代码合成指令数据已经稀缺,但高质量的编辑数据更为稀缺。为了填补这一空白,我们开发了一种名为 LintSeq 的合成数据生成算法。该算法通过使用 linter 对无错误插入进行过程性采样,将现有代码重构为一系列代码编辑,从而可以顺序编写程序。它输出由连续程序差异组成的编辑序列作为文本字符串。为了测试 LintSeq,我们使用它将指令 + 程序对的数据集重构为指令 + 程序差异序列元组。然后,我们对一系列参数范围从 2.6B 到 14B 的小型 LLMs 进行指令微调,既包括重构后的数据集,也包括原始版本的数据集,并在代码合成基准上比较零样本性能。我们发现,在重复采样过程中,经过编辑序列微调的模型生成的程序比基线模型更多样化。这导致基准覆盖率在样本数量函数下的推理时间缩放效果更好,即在给定“k”次尝试中,“pass@k”解决的问题比例。例如,在 HumanEval pass@50 上,经过合成编辑序列微调的小型 LLMs 与 GPT-4 相当,并且在绝对分数上比基线数据集微调的模型高出 +20% (+/-3%)。最后,我们还对用于代码理解的微型语言模型进行了预训练。我们证明,对合成代码编辑进行微调的微型模型在设备上模型类别中实现了最先进的代码合成。我们的 150M 参数编辑序列语言模型在包括 Codex 和 AlphaCode 在内的模型中,无论是否进行重复采样,都达到了或超过了两倍参数数量的代码模型的性能。

[NLP-5] CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

【速读】: 该论文试图解决如何通过提示技术提升大型语言模型(LLMs)生成摘要的质量问题。解决方案的关键在于利用从源文档中提取的关键短语(keyphrases)来增强提示信息,从而改善生成摘要的ROUGE F1和召回率,使其更接近参考摘要并更完整。通过引入Keyphrase Signal Extractor (CriSPO)模型,论文展示了在不同数据集和LLMs上的一致性ROUGE改进,无需对LLMs进行定制化调整。

链接: https://arxiv.org/abs/2410.02748
作者: Han He,Qianchu Liu,Lei Xu,Chaitanya Shivade,Yi Zhang,Sundararajan Srinivasan,Katrin Kirchhoff
关键词-EN: Large language models, Large language, generate fluent summaries, prompting techniques, domains using prompting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (CriSPO), a lightweight model that can be finetuned to extract salient keyphrases. By using CriSPO, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
摘要:大语言模型 (LLMs) 可以通过提示技术生成跨领域的流畅摘要,从而减少为摘要应用训练模型的需求。然而,设计有效的提示以指导 LLMs 生成具有适当详细程度和写作风格的摘要仍然是一个挑战。本文探讨了使用从源文档中提取的显著信息来增强摘要提示的方法。我们发现,在提示中添加关键词可以提高 ROUGE F1 和召回率,使得生成的摘要更接近参考摘要且更完整。关键词的数量可以控制精确度-召回率的权衡。此外,我们的分析表明,整合短语级别的显著信息优于单词或句子级别。然而,对幻觉的影响在不同 LLMs 中并不一致。为了进行这一分析,我们引入了关键词信号提取器 (CriSPO),这是一个轻量级模型,可以微调以提取显著的关键词。通过使用 CriSPO,我们在不同数据集和开源及专有 LLMs 上实现了一致的 ROUGE 改进,而无需对 LLM 进行定制。我们的研究结果为利用显著信息构建基于提示的摘要系统提供了见解。

[NLP-6] AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity

【速读】: 该论文试图解决在高分辨率图像处理中,传统大规模多模态模型(LMMs)将图像分割为多个局部和全局图像,导致视觉标记数量过多的问题。解决方案的关键在于引入AVG-LLaVA模型,该模型能够根据输入图像和指令自适应选择合适的视觉粒度,从而减少视觉标记数量并加速推理,同时提升模型性能。具体实现包括视觉粒度缩放器和视觉粒度路由器两个模块,前者通过多层池化获取不同粒度的视觉标记,后者基于图像和指令选择合适的粒度。此外,论文还提出了RGLF训练范式,通过使路由器预测的粒度与LMM的偏好对齐,无需额外手动标注数据。

链接: https://arxiv.org/abs/2410.02745
作者: Zhibin Lan,Liqiang Niu,Fandong Meng,Wenbo Li,Jie Zhou,Jinsong Su
关键词-EN: visual tokens, visual granularity based, visual granularity, multiple local images, visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recently, when dealing with high-resolution images, dominant LMMs usually divide them into multiple local images and one global image, which will lead to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. This approach not only reduces the number of visual tokens and speeds up inference, but also improves the overall model performance. Specifically, we introduce the following modules based on LLaVA-NeXT: (a) a visual granularity scaler that includes multiple pooling layers to obtain visual tokens with different granularities; (b) a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we propose RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional manually annotated data. Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual tokens and a 2.53 \times increase in inference speed on the AI2D benchmark).
摘要:近期,在处理高分辨率图像时,主流的大语言模型 (LLM) 通常将图像分割为多个局部图像和一个全局图像,这会导致大量的视觉 Token (visual tokens)。在本研究中,我们引入了 AVG-LLaVA,这是一种能够根据输入图像和指令自适应选择适当视觉粒度的大语言模型。这种方法不仅减少了视觉 Token 的数量并加快了推理速度,还提升了整体模型性能。具体来说,我们在 LLaVA-NeXT 的基础上引入了以下模块:(a) 一个视觉粒度缩放器,包含多个池化层,用于获取不同粒度的视觉 Token;(b) 一个视觉粒度路由器,包含一个 Transformer 层、一个 MLP 层和一个投票层,用于根据图像和指令选择适当的视觉粒度。此外,我们提出了 RGLF,一种新颖的训练范式,旨在使路由器预测的粒度与大语言模型的偏好对齐,而无需额外的手动标注数据。广泛的实验和分析表明,AVG-LLaVA 在 11 个基准测试中表现优异,同时显著减少了视觉 Token 的数量并加快了推理速度(例如,在 AI2D 基准测试中,视觉 Token 减少了 85.3%,推理速度提升了 2.53 倍)。

[NLP-7] Neutral residues: revisiting adapters for model extension

【速读】: 该论文试图解决将预训练的大型语言模型扩展到新领域(如添加一种在原始模型训练中未见或少见的新语言)的问题。解决方案的关键在于通过改进适配器(adapters),使得模型能够学习全新的语言,同时确保在原始领域中的性能几乎不受影响。具体来说,论文提出了一种称为“中性残差”(neutral residues)的解决方案,通过修改新的残差块,使其在原始领域中输出接近零,从而在不增加额外容量的前提下,实现了新语言的有效学习,同时保持了原始语言的性能。这种方法在仅增加20%可学习权重的情况下,显著优于传统的微调、低秩适配器等方法。

链接: https://arxiv.org/abs/2410.02744
作者: Franck Signe Talla,Herve Jegou,Edouard Grave
关键词-EN: pretrained large language, extending a pretrained, pretrained large, original domain, large language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We address the problem of extending a pretrained large language model to a new domain that was not seen at training time, like adding a language for which the original model has seen no or little training data. Popular solutions like fine-tuning or low-rank adaptation are successful at domain adaptation, but formally they do not add any extra capacity and degrade the performance in the original domain. Our paper analyzes this extension problem under three angles: data, architecture and training procedure, which are advantageously considered jointly. In particular, we improve adapters and make it possible to learn an entire new language while ensuring that the output of the neural network is almost unchanged in the original domain. For this purpose, we modify the new residual blocks in a way that leads each new residual block to output near-zeros in the original domain. This solution of neutral residues, which borrows architectural components from mixture of experts, is effective: with only 20% extra learnable weights compared to an original model trained on English, we get results that are significantly better than concurrent approaches (fine-tuning, low-rank or vanilla adapters) in terms of the trade-off between learning a new language and not forgetting English. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.02744 [cs.CL] (or arXiv:2410.02744v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02744 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:我们解决了将预训练的大语言模型扩展到训练时未见的新领域的问题,例如为原始模型未见过或仅见过少量训练数据的语言添加支持。流行的解决方案如微调或低秩适应在领域适应方面取得了成功,但从形式上讲,它们并未增加任何额外容量,并且在原始领域中性能下降。本文从数据、架构和训练过程三个角度分析了这一扩展问题,并优势性地将它们联合考虑。特别是,我们改进了适配器,使其能够在确保神经网络在原始领域输出几乎不变的情况下,学习一门全新的语言。为此,我们修改了新的残差块,使得每个新的残差块在原始领域输出接近零。这种借用专家混合架构组件的中性残差解决方案是有效的:与仅在英语上训练的原始模型相比,仅增加了20%的可学习权重,我们在学习新语言和不遗忘英语之间的权衡方面显著优于同期方法(微调、低秩或普通适配器)。

主题:计算与语言 (cs.CL);人工智能 (cs.AI);机器学习 (cs.LG)
引用为:arXiv:2410.02744 [cs.CL] (或 arXiv:2410.02744v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.02744
arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-8] MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

【速读】: 该论文试图解决强化学习从人类反馈(RLHF)在处理长序列时的信用分配问题,即模型难以识别哪些动作导致了成功的结果,从而影响学习效率和收敛速度。解决方案的关键在于提出了MA-RLHF框架,通过引入宏动作(macro actions)——即一系列token或更高层次的语言结构——来减少动作与奖励之间的时间距离,从而实现更快速和准确的信用分配。这种方法在不增加训练或推理计算复杂性的前提下,显著提升了学习效率,并在多种任务中取得了显著的性能提升。

链接: https://arxiv.org/abs/2410.02743
作者: Yekun Chai,Haoran Sun,Huang Fang,Shuohuan Wang,Yu Sun,Hua Wu
关键词-EN: aligning large language, human feedback, human preferences, Reinforcement learning, large language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions – sequences of tokens or higher-level language constructs – into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at this https URL .
摘要:基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 在使大语言模型 (Large Language Models, LLMs) 与人类偏好对齐方面展示了有效性。然而,Token 级别的 RLHF 在处理长序列时面临信用分配问题,其中延迟的奖励使得模型难以识别哪些动作促成了成功的结果。这阻碍了学习效率并减缓了收敛速度。在本文中,我们提出了 MA-RLHF,这是一种简单而有效的 RLHF 框架,它将宏动作 (macro actions) —— 即 Token 序列或更高层次的语言结构 —— 融入学习过程中。通过在这一更高层次的抽象上操作,我们的方法缩短了动作与奖励之间的时间距离,促进了更快和更准确的信用分配。这导致更稳定的策略梯度估计,并在每个回合中提高了学习效率,而不会在训练或推理过程中增加计算复杂性。我们通过在各种模型规模和任务上的广泛实验验证了我们的方法,包括文本摘要、对话生成、问答和程序合成。我们的方法在标准 RLHF 基础上实现了显著的性能提升,在文本摘要和代码生成任务中性能提升高达 30%,在对话任务中提升 18%,在问答任务中提升 8%。值得注意的是,我们的方法在训练时间上达到与传统 RLHF 相同性能的速度快了 1.7 倍到 2 倍,并且在进一步训练后仍能继续超越它。我们将在 https URL 公开我们的代码和数据。

[NLP-9] Grounding Large Language Models In Embodied Environment With Imperfect World Models

【速读】: 该论文试图解决大型语言模型(LLMs)在处理基础物理推理和执行机器人任务时表现不佳的问题,主要原因是这些模型缺乏对现实世界物理细节的直接经验。解决方案的关键是提出了GLIMO(Grounding Large language model with Imperfect world MOdel)框架,通过利用代理世界模型(如模拟器)来收集和合成训练数据。GLIMO的核心创新包括一个基于LLM的代理数据生成器,该生成器能够自动创建高质量且多样化的指令数据集,并通过迭代自精炼模块、多样化的问答指令种子和检索增强生成模块来确保时间一致性和对先前经验的反思,从而显著提升LLMs在多个基准测试中的性能。

链接: https://arxiv.org/abs/2410.02742
作者: Haolan Liu,Jishen Zhao
关键词-EN: executing robotics tasks, tackling basic physical, basic physical reasoning, Grounding Large language, large language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 \times , 1.54 \times , and 1.82 \times across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
摘要:尽管在各种应用中取得了广泛的成功,大语言模型 (LLMs) 在处理基本的物理推理或执行机器人任务时常常遇到困难,这是由于缺乏对现实世界物理细节的直接经验。为了解决这些问题,我们提出了基于不完美世界模型的地面化大语言模型 (GLIMO),该模型利用模拟器等代理世界模型来收集和合成训练数据。GLIMO 集成了基于 LLM 智能体的数据生成器,能够自动创建高质量且多样化的指令数据集。生成器包括一个用于时间一致性经验采样的迭代自精炼模块、一组多样化的问答指令种子,以及一个用于反思先前经验的检索增强生成模块。综合实验表明,我们的方法在 LLaMA-3 等强大的开源 LLMs 上分别在三个不同的基准测试中提升了 2.04 倍、1.54 倍和 1.82 倍的性能。其性能能够与或超越 GPT-4 等更大规模的模型。

[NLP-10] Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization EMNLP2024

【速读】: 该论文试图解决如何通过提示技术提升大型语言模型(LLMs)生成摘要的质量问题。解决方案的关键在于利用从源文档中提取的关键短语(keyphrases)来增强提示信息,从而改善生成的摘要的详细程度和写作风格。通过在提示中加入关键短语,可以提高ROUGE F1和召回率,使生成的摘要更接近参考摘要并更完整。此外,论文提出了一种轻量级的模型Keyphrase Signal Extractor (SigExt),用于提取显著的关键短语,从而在不进行LLM定制的情况下,在不同数据集和LLM(包括开源和专有模型)上实现一致的ROUGE改进。

链接: https://arxiv.org/abs/2410.02741
作者: Lei Xu,Mohammed Asad Karim,Saket Dingliwal,Aparna Elangovan
关键词-EN: Large language models, Large language, generate fluent summaries, prompting techniques, domains using prompting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 Industry Track

点击查看摘要

Abstract:Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
摘要:大语言模型 (LLMs) 可以通过提示技术生成跨领域的流畅摘要,从而减少为摘要应用训练模型的需求。然而,设计有效的提示以指导 LLMs 生成具有适当详细程度和写作风格的摘要仍然是一个挑战。在本文中,我们探讨了使用从源文档中提取的显著信息来增强摘要提示的方法。我们展示了在提示中添加关键词可以提高 ROUGE F1 和召回率,使得生成的摘要更接近参考摘要并更完整。关键词的数量可以控制精确度-召回率的权衡。此外,我们的分析表明,结合短语级别的显著信息优于单词或句子级别。然而,对幻觉的影响在不同 LLMs 中并不一致。为了进行这项分析,我们引入了关键词信号提取器 (SigExt),这是一个轻量级模型,可以微调以提取显著关键词。通过使用 SigExt,我们在不同数据集和开源及专有大语言模型上实现了持续的 ROUGE 改进,而无需任何 LLM 定制。我们的研究结果为利用显著信息构建基于提示的摘要系统提供了见解。

[NLP-11] Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

【速读】: 该论文旨在解决LLM-as-a-Judge在评估和训练模型中存在的潜在偏见问题,提出了一个名为CALM的自动化偏见量化框架。关键解决方案是通过自动化和原则导向的修正方法,系统地量化和分析LLM-as-a-Judge中的12种关键偏见,从而提高其可靠性和应用范围。实验结果表明,尽管高级模型在总体表现上取得了良好成绩,但在某些特定任务中仍存在显著偏见,提示LLM-as-a-Judge的可靠性仍有提升空间。

链接: https://arxiv.org/abs/2410.02736
作者: Jiayi Ye,Yanbo Wang,Yue Huang,Dongping Chen,Qihui Zhang,Nuno Moniz,Tian Gao,Werner Geyer,Chao Huang,Pin-Yu Chen,Nitesh V Chawla,Xiangliang Zhang
关键词-EN: widely utilized, evaluation method, benchmarks and served, served as supervised, supervised rewards
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
摘要:大语言模型作为评判者(LLM-as-a-Judge)已被广泛应用于各种基准测试中,并在模型训练中作为监督奖励。然而,尽管其在许多领域表现优异,潜在的问题尚未得到充分探索,这削弱了其可靠性和应用范围。因此,我们识别了12个关键潜在偏差,并提出了一种新的自动化偏差量化框架——CALM,该框架通过自动化和原则导向的修正,系统地量化和分析了大语言模型作为评判者中的每种偏差。我们的实验涵盖了多个流行的语言模型,结果表明,尽管先进模型在整体性能上表现出色,但在某些特定任务中仍存在显著偏差。实证结果表明,大语言模型作为评判者的可靠性仍有提升空间。此外,我们还讨论了这些偏差的显性和隐性影响,并给出了一些关于大语言模型作为评判者可靠应用的建议。我们的工作强调了利益相关者需要解决这些问题,并提醒用户在使用大语言模型作为评判者时需谨慎。

[NLP-12] DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

【速读】: 该论文试图解决在未知环境中导航到多样目标对象的问题,解决方案的关键在于构建了一个大规模场景数据集DivScene,并通过模仿学习微调大型视觉语言模型(LVLM),即NatVLM,以生成环境中的下一步动作。通过引入行动预测的CoT解释轨迹,提升了模型性能,并在实验中展示了基于LVLM的代理在无人类监督下通过模仿学习构建的最短路径上的成功率,超过了GPT-4o。

链接: https://arxiv.org/abs/2410.02730
作者: Zhaowei Wang,Hongming Zhang,Tianqing Fang,Ye Tian,Yue Yang,Kaixin Ma,Xiaoman Pan,Yangqiu Song,Dong Yu
关键词-EN: real-world applications, navigation in unknown, crucial for deploying, Object navigation, target objects
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Work in Progress

点击查看摘要

Abstract:Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.
摘要:在未知环境中进行物体导航对于在实际应用中部署具身智能体至关重要。尽管由于大规模场景数据集、更快的模拟器和更强的模型,我们已经见证了巨大的进步,但以往的研究主要集中在有限的场景类型和目标物体上。在本文中,我们研究了一项新的任务,即在大量场景类型中导航到多样化的目标物体。为了对这一问题进行基准测试,我们提出了一个大规模场景数据集,DivScene,该数据集包含跨越81种不同类型的4,614个场景。基于该数据集,我们通过模仿学习微调大视觉语言模型 (Large Vision Language Model, LVLM) 构建了一个端到端的具身智能体,称为NatVLM。该LVLM被训练为根据环境的先前观察生成下一个动作。我们还引入了动作预测的CoT解释轨迹,以在微调LVLMs时获得更好的性能。我们的广泛实验发现,通过模仿学习在由BFS规划器构建的最短路径上,我们可以在没有任何人类监督的情况下构建一个高性能的基于LVLM的智能体。我们的智能体成功率超过了GPT-4o的20%以上。同时,我们进行了各种分析,展示了我们智能体的泛化能力。

[NLP-13] Unified Multi-Modal Interleaved Document Representation for Information Retrieval

【速读】: 该论文试图解决现有信息检索方法在处理多模态文档时存在的两个主要问题:一是仅考虑文本信息而忽略文档中的图像和表格等多模态内容;二是将长文档分割成多个离散段落进行嵌入,导致无法捕捉文档的整体上下文和段落间的交互。解决方案的关键在于利用最新的视觉-语言模型,将文本、图像和表格整合成统一的格式和表示,并通过合并段落表示来生成单一的文档表示,同时引入重排序策略以在必要时识别文档中的相关段落。这种方法在处理多模态查询时显著优于相关基线,因为它能够全面考虑文档中的多模态信息。

链接: https://arxiv.org/abs/2410.02729
作者: Jaewoo Lee,Joonho Ko,Jinheon Baek,Soyeong Jeong,Sung Ju Hwang
关键词-EN: natural language tasks, gained remarkable attention, remarkable attention due, language tasks, gained remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Preprint

点击查看摘要

Abstract:Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.
摘要:信息检索 (Information Retrieval, IR) 方法旨在根据给定的查询识别相关文档,由于其在各种自然语言任务中的成功应用,这些方法引起了广泛关注。然而,现有方法通常仅考虑文档中的文本信息,忽略了文档可能包含多种模态,包括文本、图像和表格。此外,它们通常将每个长文档分割成多个离散的段落进行嵌入,这使得它们无法捕捉整个文档的上下文以及段落之间的交互。我们认为,这两个局限性导致了次优的文档表示用于检索。在这项工作中,为了解决这些问题,我们旨在通过整体嵌入包含不同模态的文档来生成更全面和细致的文档表示。具体而言,我们通过利用最近视觉语言模型的能力来实现这一目标,这些模型能够将文本、图像和表格处理并整合为统一的格式和表示。此外,为了减少将文档分割成段落带来的信息损失,我们不再单独表示和检索段落,而是将分割段落的表示合并为一个单一的文档表示,同时我们还引入了一种重排序策略,在必要时解耦并识别文档中的相关段落。随后,通过在考虑文本和多模态查询的多样化信息检索场景中进行广泛的实验,我们展示了我们的方法由于统一考虑了文档中交织的多模态信息,显著优于相关基线。

[NLP-14] Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better Even Mid-Generation

【速读】: 该论文试图解决大语言模型(LLMs)在推理时计算成本高的问题,特别是Best-of-N采样方法需要生成多个样本并依赖外部奖励模型,导致计算开销大的问题。解决方案的关键在于引入一种生成自评估方案,使LLM能够在生成过程中预测重新生成是否会得到更好的响应,从而自适应地减少生成样本的数量。这一方案通过生成一个预定义的token来实现,无需外部奖励模型,能够在生成过程中决定是否继续生成样本、提前剪枝不理想的样本或选择最佳样本,从而显著提高计算效率和扩展性。实验结果表明,这种方法在减少样本数量的同时,仍能保持甚至提升模型性能。

链接: https://arxiv.org/abs/2410.02725
作者: Rohin Manvi,Anikait Singh,Stefano Ermon
关键词-EN: Inference-time computation, large language models, widely used technique, external reward model, powerful paradigm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B’s win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.
摘要:推理时计算是一种增强大语言模型 (LLM) 性能的强大范式,其中 Best-of-N 采样是一种广泛使用的技术。然而,这种方法计算成本高昂,需要 (1) 外部奖励模型和 (2) 生成多个样本。在本研究中,我们引入了一种新的生成式自我评估方案,旨在自适应地减少生成样本的数量,同时保持甚至提升性能。我们采用生成式奖励模型公式,使 LLM 能够在生成过程中预测重新启动生成将产生更好响应的概率。这些预测无需外部奖励模型即可获得,并可用于决定是否生成更多样本、早期修剪不理想的样本,或选择最佳样本。这一功能非常经济,因为它仅涉及生成一个预定义的 Token。通过使用由真实未过滤的 LMSYS 用户提示构建的数据集进行训练,Llama 3.1 8B 在 AlpacaEval 上对 GPT-4 的胜率从 21% 提高到 34%(使用 16 个样本),并且在 GSM8K 上的数学性能从 84% 提高到 91%。通过仅在 LLM 判断为有益时进行采样,并自适应调整温度退火,我们证明使用 16 个样本带来的 74% 的改进可以通过平均仅 1.2 个样本实现。此外,我们还证明在生成过程中可以早期修剪 50-75% 的样本,而对性能的降级影响最小。总体而言,我们的方法使得在 LLM 推理过程中能够更高效和可扩展地利用计算资源。

[NLP-15] Domain-Specific Retrieval-Augmented Generation Using Vector Stores Knowledge Graphs and Tensor Factorization ICML

【速读】: 该论文试图解决大型语言模型(LLMs)在处理特定领域和知识密集型任务时存在的幻觉、知识截断和缺乏知识归属的问题。解决方案的关键在于引入SMART-SLIC框架,该框架结合了检索增强生成(RAG)、知识图谱(KG)和向量存储(VS),通过构建高度领域特定的KG和VS来存储事实性信息,从而避免幻觉。具体来说,SMART-SLIC通过NLP、数据挖掘和非负张量分解等技术构建KG和VS,而不依赖LLMs,从而优化LLM的响应,提高问答准确性,并减少对模型微调的需求。此外,该框架通过结合KG(结构化信息)和VS(非结构化信息),以及链式思维提示代理,实现了领域特定聊天机器人的开发,使其在高度领域特定的问答任务中表现出色。

链接: https://arxiv.org/abs/2410.02721
作者: Ryan C. Barron,Ves Grantcharov,Selma Wanna,Maksim E. Eren,Manish Bhattarai,Nicholas Solovyev,George Tompkins,Charles Nicholas,Kim Ø. Rasmussen,Cynthia Matuszek,Boian S. Alexandrov
关键词-EN: Large Language Models, natural language processing, general natural language, numerous general natural, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注: 9 pages 7 figures, 1 table, 1 cypher code Accepted to ICMLA 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs’ intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.
摘要:大语言模型 (Large Language Models, LLMs) 在大规模语料库上进行预训练,并在众多通用自然语言处理 (Natural Language Processing, NLP) 任务中表现出色,例如问答 (Question Answering, QA)。尽管其语言能力先进,但在特定领域和知识密集型任务中,LLMs 存在幻觉、知识截断和缺乏知识归属的问题。此外,将 LLMs 的内在知识微调到高度特定的领域是一个昂贵且耗时的过程。检索增强生成 (Retrieval-Augmented Generation, RAG) 过程最近作为一种优化 LLM 响应的方法出现,通过将其参考到预定的本体论中。研究表明,使用知识图谱 (Knowledge Graph, KG) 本体进行 RAG 可以提高 QA 准确性,通过考虑保留信息的结构化方式的相关子图。在本文中,我们介绍了 SMART-SLIC,这是一个高度特定领域的 LLM 框架,它将 RAG 与 KG 和存储事实领域特定信息的向量存储 (Vector Store, VS) 集成在一起。重要的是,为了避免 KG 中的幻觉,我们通过 NLP、数据挖掘和自动模型选择的非负张量分解来构建这些高度特定领域的 KG 和 VS,而不是使用 LLMs。将我们的 RAG 与特定领域的 (i) KG(包含结构化信息)和 (ii) VS(包含非结构化信息)配对,使得开发能够归属信息来源、减少幻觉、减少微调需求并在高度特定领域的问答任务中表现出色的领域特定聊天机器人成为可能。我们将 SMART-SLIC 与思维链提示 (chain-of-thought prompting) 智能体配对。该框架设计为可泛化,以适应任何特定或专业领域。在本文中,我们展示了我们的框架在恶意软件分析和异常检测领域的科学出版物语料库上的问答能力。

[NLP-16] UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation

【速读】: 该论文试图解决长上下文检索增强生成(RAG)任务中的语义不一致性和模型鲁棒性问题。解决方案的关键在于引入基于信噪比(SNR)的跨度不确定性(span uncertainty)来估计文本块之间的相似度,从而增强模型校准,提高鲁棒性并减少随机分块引入的语义不一致性。通过这种不确定性估计,论文提出了一种高效的非监督学习技术来训练检索模型,并结合有效的数据采样和扩展策略,使得UncertaintyRAG在LLaMA-2-7B上超越了基线模型2.03%,同时仅使用其他先进开源检索模型训练数据的4%,展示了其在分布偏移设置下的优越性能和轻量级集成能力。

链接: https://arxiv.org/abs/2410.02719
作者: Zixuan Li,Jing Xiong,Fanghua Ye,Chuanyang Zheng,Xun Wu,Jianqiao Lu,Zhongwei Wan,Xiaodan Liang,Chengming Li,Zhenan Sun,Lingpeng Kong,Ngai Wong
关键词-EN: long-context Retrieval-Augmented Generation, Retrieval-Augmented Generation, based span uncertainty, text chunks, span uncertainty
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG) that utilizes Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks. This span uncertainty enhances model calibration, improving robustness and mitigating semantic inconsistencies introduced by random chunking. Leveraging this insight, we propose an efficient unsupervised learning technique to train the retrieval model, alongside an effective data sampling and scaling strategy. UncertaintyRAG outperforms baselines by 2.03% on LLaMA-2-7B, achieving state-of-the-art results while using only 4% of the training data compared to other advanced open-source retrieval models under distribution shift settings. Our method demonstrates strong calibration through span uncertainty, leading to improved generalization and robustness in long-context RAG tasks. Additionally, UncertaintyRAG provides a lightweight retrieval model that can be integrated into any large language model with varying context window lengths, without the need for fine-tuning, showcasing the flexibility of our approach.
摘要:我们提出了 UncertaintyRAG,这是一种新颖的长上下文检索增强生成 (RAG) 方法,利用基于信噪比 (SNR) 的跨度不确定性来估计文本块之间的相似性。这种跨度不确定性增强了模型的校准,提高了鲁棒性,并缓解了随机分块引入的语义不一致问题。基于这一洞察,我们提出了一种高效的无监督学习技术来训练检索模型,并结合了一种有效的数据采样和扩展策略。在 LLaMA-2-7B 上,UncertaintyRAG 的表现优于基线 2.03%,在分布偏移设置下,仅使用其他先进开源检索模型所需训练数据的 4%,即达到了最先进的结果。我们的方法通过跨度不确定性展示了强大的校准能力,从而在长上下文 RAG 任务中实现了更好的泛化和鲁棒性。此外,UncertaintyRAG 提供了一个轻量级的检索模型,可以集成到任何具有不同上下文窗口长度的大语言模型中,而无需微调,展示了我们方法的灵活性。

[NLP-17] Video Instruction Tuning With Synthetic Data

【速读】: 该论文试图解决视频大模态模型(LMMs)在数据获取方面的难题,即难以从网络中收集大量高质量的原始数据。解决方案的关键在于创建了一个高质量的合成数据集LLaVA-Video-178K,专门用于视频指令跟随任务,包括详细字幕生成、开放式问答和多项选择问答。通过结合现有的视觉指令调优数据,训练出新的视频LMM模型LLaVA-Video,并在多个视频基准测试中展示了其强大的性能,突显了该数据集的有效性。

链接: https://arxiv.org/abs/2410.02713
作者: Yuanhan Zhang,Jinming Wu,Wei Li,Bo Li,Zejun Ma,Ziwei Liu,Chunyuan Li
关键词-EN: curating large amounts, video large multimodal, large multimodal models, large multimodal, curating large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
摘要:视频大模态模型 (LMM) 的发展受限于从网络中收集大量高质量原始数据的难度。为解决这一问题,我们提出了一种替代方法,即创建一个高质量的合成数据集,专门用于视频指令跟随,命名为 LLaVA-Video-178K。该数据集包括详细字幕、开放式问答 (QA) 和多项选择 QA 等关键任务。通过结合现有的视觉指令调优数据,我们在该数据集上进行训练,引入了 LLaVA-Video,一种新的视频 LMM。我们的实验表明,LLaVA-Video 在各种视频基准测试中表现出色,突显了我们数据集的有效性。我们计划发布该数据集、其生成流程以及模型检查点。

[NLP-18] LLaVA-Critic: Learning to Evaluate Multimodal Models

【速读】: 该论文试图解决多模态任务评估的通用性问题,提出了LLaVA-Critic,这是首个开源的大型多模态模型(LMM),旨在作为一个通用评估器来评估广泛的多模态任务性能。解决方案的关键在于使用高质量的批评指令跟随数据集进行训练,该数据集包含了多样化的评估标准和场景。通过实验,LLaVA-Critic在LMM-as-a-Judge和Preference Learning两个关键领域展示了其有效性,分别体现在提供可靠的评估分数和生成奖励信号以增强模型对齐能力。这一工作强调了开源LMM在自我批评和评估方面的潜力,为未来研究可扩展的超人类对齐反馈机制奠定了基础。

链接: https://arxiv.org/abs/2410.02712
作者: Tianyi Xiong,Xiyao Wang,Dong Guo,Qinghao Ye,Haoqi Fan,Quanquan Gu,Heng Huang,Chunyuan Li
关键词-EN: open-source large multimodal, large multimodal model, multimodal tasks, generalist evaluator, evaluator to assess
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model’s effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
摘要:我们介绍了 LLaVA-Critic,这是首个开源的大型多模态模型 (LMM),设计用于作为通用评估器,评估广泛多模态任务的性能。LLaVA-Critic 使用了一个高质量的批评指令跟随数据集进行训练,该数据集包含了多样化的评估标准和场景。我们的实验展示了该模型在两个关键领域的有效性:(1) LMM-as-a-Judge,其中 LLaVA-Critic 提供了可靠的评估分数,在多个评估基准上表现与 GPT 模型相当或超越;(2) 偏好学习,其中它生成了偏好学习的奖励信号,增强了模型的对齐能力。这项工作强调了开源 LMM 在自我批评和评估方面的潜力,为未来研究可扩展的、超人类的 LMM 对齐反馈机制奠定了基础。

[NLP-19] LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

【速读】: 该论文试图解决大语言模型(LLMs)在生成内容时出现的错误(如事实不准确、偏见和推理失败,统称为“幻觉”)的检测和理解问题。解决方案的关键在于揭示LLMs内部状态编码的关于输出真实性的信息,并利用这些信息进行错误检测。研究发现,这些信息集中在特定标记上,利用这一特性可以显著提高错误检测性能。此外,论文还展示了内部表示可用于预测模型可能犯的错误类型,从而开发针对性的缓解策略。最后,论文揭示了LLMs内部编码与外部行为之间的差异,即模型可能正确编码了答案,但仍持续生成错误答案,这一发现有助于未来研究从模型内部角度深化对LLM错误的分析和缓解。

链接: https://arxiv.org/abs/2410.02707
作者: Hadas Orgad,Michael Toker,Zorik Gekhman,Roi Reichart,Idan Szpektor,Hadas Kotek,Yonatan Belinkov
关键词-EN: including factual inaccuracies, Large language models, Large language, including factual, factual inaccuracies
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as “hallucinations”. Recent studies have demonstrated that LLMs’ internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that – contrary to prior claims – truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs’ internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model’s internal perspective, which can guide future research on enhancing error analysis and mitigation.
摘要:大语言模型 (LLMs) 常常产生错误,包括事实不准确、偏见和推理失败,这些错误统称为“幻觉”。最近的研究表明,LLMs 的内部状态编码了与其输出真实性相关的信息,并且这些信息可以用于检测错误。在本研究中,我们发现 LLMs 的内部表示编码了比先前认识到的更多关于真实性的信息。我们首先发现真实性信息集中在特定的 Token 上,利用这一特性显著提高了错误检测的性能。然而,我们表明这种错误检测器无法跨数据集泛化,这意味着——与先前的说法相反——真实性编码并非普遍适用,而是多方面的。接下来,我们展示了内部表示还可以用于预测模型可能产生的错误类型,从而促进定制化缓解策略的开发。最后,我们揭示了 LLMs 内部编码与外部行为之间的差异:它们可能编码了正确答案,但却持续生成错误的答案。综合这些见解,我们加深了从模型内部角度对 LLM 错误的理解,这可以指导未来在增强错误分析和缓解方面的研究。

[NLP-20] Selective Attention Improves Transformer

【速读】: 该论文试图解决注意力机制中不必要元素对性能的负面影响问题。解决方案的关键是引入选择性注意力(Selective Attention),这是一种无需额外参数的改进方法,通过减少对不必要元素的关注来提升语言模型的性能。选择性注意力不仅在各种模型规模和上下文长度上提高了语言建模的表现,还允许减少注意力上下文缓冲区的大小,从而显著降低推理过程中的内存和计算需求。

链接: https://arxiv.org/abs/2410.02703
作者: Yaniv Leviathan,Matan Kalman,Yossi Matias
关键词-EN: Selective Attention, Unneeded elements, attention, attention context degrade, Selective
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unneeded elements in the attention’s context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention’s context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.
摘要:注意力机制中不必要的元素会降低性能。我们引入了选择性注意力 (Selective Attention),这是一种对标准注意力机制的简单、无参数的改进,能够减少对不必要元素的关注。选择性注意力在各种模型规模和上下文长度下都能提升语言建模的性能。例如,在 C4 数据集上使用语言建模目标训练的一系列 Transformer,配备选择性注意力后,其性能与标准 Transformer 相当,但后者在注意力模块中具有约 2 倍的头数和参数。选择性注意力还允许减少注意力上下文缓冲区的大小,从而在推理过程中显著降低内存和计算需求。例如,在 C4 数据集上训练的 1 亿参数的 Transformer,上下文大小分别为 512、1,024 和 2,048 时,配备选择性注意力后,其注意力模块所需的内存分别减少了 16 倍、25 倍和 47 倍,且验证困惑度相同。

[NLP-21] HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

【速读】: 该论文试图解决现有长上下文语言模型(LCLMs)评估基准在覆盖应用范围、长度、评估指标可靠性以及与基础模型的兼容性方面存在的不足,导致模型比较不一致的问题。解决方案的关键在于提出了一个名为HELMET的综合性基准,该基准涵盖了七个多样化的应用导向类别,并通过增加可控长度(最高达128k tokens)、基于模型的评估方法以及少样本提示技术,解决了先前基准的诸多问题,从而提供了更可靠和一致的LCLMs排名。

链接: https://arxiv.org/abs/2410.02694
作者: Howard Yen,Tianyu Gao,Minmin Hou,Ke Ding,Daniel Fleischer,Peter Izasak,Moshe Wasserblat,Danqi Chen
关键词-EN: long-context language models, evaluating long-context language, developers often rely, arbitrary subsets, Evaluate Long-context Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code and data are available here: this https URL

点击查看摘要

Abstract:There have been many benchmarks for evaluating long-context language models (LCLMs), but developers often rely on synthetic tasks like needle-in-a-haystack (NIAH) or arbitrary subsets of tasks. It remains unclear whether they translate to the diverse downstream applications of LCLMs, and the inconsistency further complicates model comparison. We investigate the underlying reasons behind current practices and find that existing benchmarks often provide noisy signals due to low coverage of applications, insufficient lengths, unreliable metrics, and incompatibility with base models. In this work, we present HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address many issues in previous benchmarks by adding controllable lengths up to 128k tokens, model-based evaluation for reliable metrics, and few-shot prompting for robustly evaluating base models. Consequently, we demonstrate that HELMET offers more reliable and consistent rankings of frontier LCLMs. Through a comprehensive study of 51 LCLMs, we find that (1) synthetic tasks like NIAH are not good predictors of downstream performance; (2) the diverse categories in HELMET exhibit distinct trends and low correlation with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning or following complex instructions – the gap widens with increased lengths. Finally, we recommend using our RAG tasks for fast model development, as they are easy to run and more predictive of other downstream performance; ultimately, we advocate for a holistic evaluation across diverse tasks.
摘要:在评估长上下文语言模型 (LCLMs) 方面,已有许多基准测试,但开发者通常依赖于合成任务,如“大海捞针” (NIAH) 或任务的任意子集。目前尚不清楚这些方法是否能转化为 LCLMs 在多样化下游应用中的表现,且这种不一致性进一步增加了模型比较的复杂性。我们深入研究了当前实践背后的原因,发现现有基准测试往往提供噪声信号,原因包括应用覆盖率低、长度不足、指标不可靠以及与基础模型不兼容。在此工作中,我们提出了 HELMET (How to Evaluate Long-context Models Effectively and Thoroughly),这是一个涵盖七个多样化、以应用为中心类别的综合基准测试。我们还通过增加可控长度至 128k Token、基于模型的评估以获得可靠指标,以及少样本提示以稳健评估基础模型,解决了先前基准测试中的许多问题。因此,我们证明了 HELMET 提供了更可靠和一致的前沿 LCLMs 排名。通过对 51 个 LCLMs 的全面研究,我们发现:(1) 像 NIAH 这样的合成任务不是下游性能的良好预测指标;(2) HELMET 中的多样化类别显示出不同的趋势,且彼此之间相关性较低;(3) 尽管大多数 LCLMs 在 NIAH 任务中得分完美,但当任务需要全上下文推理或遵循复杂指令时,开源模型明显落后于闭源模型——随着长度的增加,差距进一步扩大。最后,我们建议使用我们的 RAG 任务进行快速模型开发,因为它们易于运行且更能预测其他下游性能;最终,我们主张在多样化任务中进行全面评估。

[NLP-22] On the Proper Treatment of Tokenization in Psycholinguistics EMNLP2024

【速读】: 该论文试图解决现代语言模型在心理语言学研究中的应用问题,特别是由于使用分词(tokenization)作为训练模型的中间步骤,导致语言模型在处理感兴趣区域(region of interest)时出现对齐问题。解决方案的关键在于将基于分词的语言模型(token-level language model)边缘化为基于字符的语言模型(character-level language model),从而能够准确计算任意字符子串(focal area)的负对数概率(surprisal),进而用于心理语言学研究中的认知成本预测。这一方法解决了分词方案带来的对齐问题,并发现某些focal area的surprisal比感兴趣区域的surprisal更具心理测量预测性。

链接: https://arxiv.org/abs/2410.02691
作者: Mario Giulianelli,Luca Malagutti,Juan Luis Gastaldi,Brian DuSell,Tim Vieira,Ryan Cotterell
关键词-EN: negative log probability, cognitive cost experienced, Language models, Language, modern language models
类目: Computation and Language (cs.CL)
备注: Main conference long paper at EMNLP 2024

点击查看摘要

Abstract:Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced by readers, as operationalized, for example, by gaze duration on the region. However, the application of modern language models to psycholinguistic studies is complicated by the practice of using tokenization as an intermediate step in training a model. Doing so results in a language model over token strings rather than one over character strings. Vexingly, regions of interest are generally misaligned with these token strings. The paper argues that token-level language models should be (approximately) marginalized into character-level language models before they are used in psycholinguistic studies to compute the surprisal of a region of interest; then, the marginalized character-level language model can be used to compute the surprisal of an arbitrary character substring, which we term a focal area, that the experimenter may wish to use as a predictor. Our proposal of marginalizing a token-level model into a character-level one solves this misalignment issue independently of the tokenization scheme. Empirically, we discover various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
摘要:语言模型在计算心理语言学中被广泛用于检验理论,这些理论将语言模型下的感兴趣区域(即字符子串)的负对数概率(即意外性)与其在读者中产生的认知成本相关联,例如通过对该区域的注视时长来操作化定义。然而,现代语言模型在心理语言学研究中的应用因使用分词作为模型训练的中间步骤而变得复杂。这样做导致语言模型是基于 Token 字符串而非字符字符串。令人困扰的是,感兴趣区域通常与这些 Token 字符串不一致。本文主张,在用于心理语言学研究以计算感兴趣区域的意外性之前,应将 Token 级别的语言模型(近似地)边缘化为字符级别的语言模型;然后,边缘化后的字符级别语言模型可用于计算实验者可能希望用作预测因子的任意字符子串(我们称之为焦点区域)的意外性。我们提出的将 Token 级别模型边缘化为字符级别模型的方法独立于分词方案解决了这种不一致问题。实证研究发现,各种焦点区域的意外性比感兴趣区域本身的意外性更好地作为心理测量预测因子。

[NLP-23] HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

【速读】: 该论文试图解决大语言模型(LLMs)在生成内容时如何确保安全性和与人类价值观的一致性问题。当前的拒绝策略(如完全拒绝有害提示或应用粗略过滤器)存在二元性限制,导致过度谨慎的响应或无法检测到细微的有害内容。解决方案的关键是引入HiddenGuard框架,该框架通过Prism模块实现实时、细粒度的有害内容检测和修订,利用中间隐藏状态进行令牌级别的检测和修订,从而在生成信息性响应的同时,有选择地修订或替换敏感信息,而非简单拒绝。这种方法实现了更细致、上下文感知的审核,提高了模型在检测和修订有害内容方面的准确性和响应的实用性。

链接: https://arxiv.org/abs/2410.02684
作者: Lingrui Mei,Shenghua Liu,Yiwei Wang,Baolong Bi,Ruibin Yuan,Xueqi Cheng
关键词-EN: Large Language Models, Large Language, grow increasingly powerful, Language Models, grow increasingly
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model’s responses.
摘要:随着大语言模型 (LLM) 的日益强大,确保其安全性和与人类价值观的一致性仍然是一个关键挑战。理想情况下,LLM 应提供信息丰富的响应,同时避免泄露有害或敏感信息。然而,当前的校准方法严重依赖拒绝策略,例如训练模型完全拒绝有害提示或应用粗略过滤器,这些方法受限于其二元性质。这些方法要么完全拒绝访问信息,要么在没有足够细微差别的情况下授予访问权限,导致响应过于谨慎或未能检测到细微的有害内容。例如,LLM 可能由于误用担忧而拒绝提供有关药物的基本公共信息。此外,这些基于拒绝的方法难以处理混合内容场景,并且缺乏适应上下文敏感性的能力,这可能导致对良性内容的过度审查。为了克服这些挑战,我们引入了 HiddenGuard,这是一种用于 LLM 中细粒度安全生成的新框架。HiddenGuard 集成了 Prism(流内调节的表示路由器),它与 LLM 并行运行,通过利用中间隐藏状态实现有害内容的实时 Token 级检测和修订。这种细粒度方法允许更细致、上下文感知的调节,使模型能够在选择性修订或替换敏感信息的同时生成信息丰富的响应,而不是直接拒绝。我们还贡献了一个全面的数据集,其中包含跨多样情境下潜在有害信息的 Token 级细粒度注释。我们的实验表明,HiddenGuard 在检测和修订有害内容方面实现了超过 90% 的 F1 分数,同时保持了模型响应的整体效用和信息量。

[NLP-24] DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life

【速读】: 该论文试图解决在日常生活中依赖大型语言模型(LLMs)进行决策时,如何理解和评估这些决策所反映的个人价值观和伦理标准的问题。解决方案的关键在于构建了一个名为DailyDilemmas的数据集,包含1,360个日常生活中的道德困境,每个困境涉及两种可能的行动及其影响方和涉及的人类价值观。通过评估LLMs在这些困境中的决策,论文分析了LLMs所体现的价值观,并将其与五种社会学、心理学和哲学理论进行对比,揭示了LLMs在不同价值观上的偏好差异,特别是对某些核心价值观如真实性的处理差异。此外,论文还研究了OpenAI和Anthropic发布的指导原则,探讨了这些原则在实际道德推理中的价值优先级,并指出用户无法通过系统提示有效引导这种优先级。

链接: https://arxiv.org/abs/2410.02683
作者: Yu Ying Chiu,Liwei Jiang,Yejin Choi
关键词-EN: Moral Foundation Theory, increasingly seek guidance, increasingly seek, decision-making in daily, clear-cut and depend
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Under Review

点击查看摘要

Abstract:As we increasingly seek guidance from LLMs for decision-making in daily life, many of these decisions are not clear-cut and depend significantly on the personal values and ethical standards of the users. We present DailyDilemmas, a dataset of 1,360 moral dilemmas encountered in everyday life. Each dilemma includes two possible actions and with each action, the affected parties and human values invoked. Based on these dilemmas, we consolidated a set of human values across everyday topics e.g., interpersonal relationships, workplace, and environmental issues. We evaluated LLMs on these dilemmas to determine what action they will take and the values represented by these actions. Then, we analyzed these values through the lens of five popular theories inspired by sociology, psychology and philosophy. These theories are: World Value Survey, Moral Foundation Theory, Maslow’s Hierarchy of Needs, Aristotle’s Virtues, and Plutchik Wheel of Emotion. We find that LLMs are most aligned with the self-expression over survival values in terms of World Value Survey, care over loyalty in Moral Foundation Theory. Interestingly, we find large preferences differences in models for some core values such as truthfulness e.g., Mixtral-8x7B model tends to neglect it by 9.7% while GPT-4-turbo model tends to select it by 9.4%. We also study the recent guidance released by OpenAI (ModelSpec), and Anthropic (Constitutional AI) to understand how their released principles reflect their actual value prioritization when facing nuanced moral reasoning in daily-life settings. We find that end users cannot effectively steer such prioritization using system prompts.
摘要:随着我们在日常生活中越来越多地寻求大语言模型 (LLM) 的决策指导,许多决策并非是非分明的,而是很大程度上依赖于用户的个人价值观和伦理标准。我们提出了 DailyDilemmas,一个包含 1,360 个日常生活中的道德困境的数据集。每个困境包括两种可能的行动,每种行动涉及的受影响方和所引发的人类价值观。基于这些困境,我们整合了一系列涵盖日常话题(如人际关系、职场和环境问题)的人类价值观。我们评估了大语言模型在这些困境中的表现,以确定它们将采取的行动及其所代表的价值观。然后,我们通过五种受社会学、心理学和哲学启发的流行理论的视角来分析这些价值观。这些理论包括:世界价值观调查 (World Value Survey)、道德基础理论 (Moral Foundation Theory)、马斯洛需求层次理论 (Maslow’s Hierarchy of Needs)、亚里士多德的德性论 (Aristotle’s Virtues) 和普拉奇克情绪轮 (Plutchik Wheel of Emotion)。我们发现,在基于世界价值观调查的分析中,大语言模型更倾向于自我表达而非生存价值;在道德基础理论中,更倾向于关怀而非忠诚。有趣的是,我们发现不同模型在一些核心价值观上存在显著的偏好差异,例如在真实性方面,Mixtral-8x7B 模型倾向于忽略它,比例高达 9.7%,而 GPT-4-turbo 模型则倾向于选择它,比例为 9.4%。我们还研究了 OpenAI(ModelSpec)和 Anthropic(Constitutional AI)最近发布的指导原则,以了解这些公司在面对日常生活中的复杂道德推理时,其发布的原则如何反映其实际的价值观优先级。我们发现,终端用户无法通过系统提示有效地引导这种优先级。

[NLP-25] Distilling an End-to-End Voice Assistant Without Instruction Training Data

【速读】: 该论文试图解决语音助手在处理语音和文本时信息丢失和复杂性增加的问题。解决方案的关键在于提出了一种无需指令数据和标注响应的训练方法,通过使用文本大语言模型(LLM)对语音转录文本的响应进行自监督学习,从而训练出端到端的语音大语言模型(Speech LLM)。这种方法不仅减少了训练计算量,还提高了模型在口语问答、分类和翻译等任务上的泛化能力,并在用户偏好测试中表现优于现有最先进模型。

链接: https://arxiv.org/abs/2410.02678
作者: William Held,Ella Li,Michael Ryan,Weiyan Shi,Yanzhe Zhang,Diyi Yang
关键词-EN: Siri and Google, lost speech information, Google Assistant, Distilled Voice Assistant, Speech Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using 100x less training compute. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02678 [cs.CL] (or arXiv:2410.02678v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02678 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:语音助手,如 Siri 和 Google Assistant,通常分别对音频和文本进行建模,导致语音信息的丢失和复杂性的增加。近期,通过使用监督微调 (SFT) 训练端到端语音大语言模型 (LLMs) 来解决这一问题的尝试,导致了模型从纯文本 LLMs 中“遗忘”了某些能力。我们的工作提出了一种无需指令数据的语音 LLMs 训练替代范式,利用纯文本 LLM 对转录文本的响应作为自监督。重要的是,这一过程可以在没有标注响应的情况下进行。我们展示了我们的蒸馏语音助手 (DiVA) 在口语问答、分类和翻译任务上的泛化能力。此外,我们展示了 DiVA 更好地满足了用户偏好,尽管使用了 100 倍更少的训练计算,但在与 Qwen 2 Audio 等最先进模型相比时,仍达到了 72% 的胜率。

主题:计算与语言 (cs.CL);人工智能 (cs.AI)
引用为:arXiv:2410.02678 [cs.CL] (或 arXiv:2410.02678v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.02678
了解更多信息
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-26] CulturalBench: a Robust Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

【速读】: 该论文试图解决大语言模型(LLMs)在跨文化背景下表现不佳的问题,解决方案的关键在于引入了一个名为CulturalBench的有效文化知识基准。CulturalBench包含1,227个由人类编写和验证的问题,涵盖45个全球区域,特别是包括了孟加拉国、津巴布韦和秘鲁等代表性不足的地区。这些问题涉及17个多样化的主题,从饮食偏好到问候礼仪,每个问题都经过五名独立注释者的验证。论文通过CulturalBench-Easy和CulturalBench-Hard两种设置评估模型,发现LLMs在不同设置下的表现差异显著,且在CulturalBench-Hard设置下,前沿LLMs的表现远低于人类水平,尤其是在涉及多个正确答案的复杂问题上表现不佳。此外,研究发现OpenAI的GPT-4o在大多数区域问题上表现优异,但在南美和中东问题上仍普遍表现不佳。

链接: https://arxiv.org/abs/2410.02677
作者: Yu Ying Chiu,Liwei Jiang,Bill Yuchen Lin,Chan Young Park,Shuyue Stella Li,Sahithya Ravi,Mehar Bhatia,Maria Antoniak,Yulia Tsvetkov,Vered Shwartz,Yejin Choi
关键词-EN: make large language, large language models, track our progress, effective cultural knowledge, cultural knowledge benchmarks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Under review

点击查看摘要

Abstract:To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs’ cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.
摘要:为了使大语言模型 (LLMs) 在不同文化背景下更具实用性,建立有效的文化知识基准来衡量和追踪我们的进展至关重要。有效的基准需要具备鲁棒性、多样性和挑战性。我们引入了 CulturalBench:一套包含 1,227 个由人工撰写并经人工验证的问题,用于有效评估 LLMs 的文化知识,涵盖了 45 个全球区域,包括孟加拉国、津巴布韦和秘鲁等代表性不足的地区。每个问题均由五名独立注释者验证,涵盖 17 个多样化的主题,从饮食偏好到问候礼仪。我们在两种设置下评估模型:CulturalBench-Easy 和 CulturalBench-Hard,它们共享相同的问题但提问方式不同。我们发现 LLMs 对这种设置差异非常敏感(例如,GPT-4o 的差异为 27.3%)。与人类表现(92.6% 的准确率)相比,CulturalBench-Hard 对前沿 LLMs 更具挑战性,表现最佳的模型(GPT-4o)准确率为 61.5%,而表现最差的模型(Llama3-8b)仅为 21.4%。此外,我们发现 LLMs 在处理具有多个正确答案的复杂问题时经常遇到困难(例如,“中国人通常使用哪些餐具?”),显示出倾向于收敛到一个单一答案的倾向。我们的结果还表明,在除大洋洲外的所有区域相关问题上,OpenAI 的 GPT-4o 显著优于其他专有和开源模型。然而,所有模型在涉及南美洲和中东地区的问题上均表现不佳。

[NLP-27] FAN: Fourier Analysis Networks

【速读】: 该论文试图解决神经网络在周期性数据建模和推理中的缺陷问题,即神经网络倾向于记忆周期性数据而非真正理解其背后的周期性原理。解决方案的关键在于提出了一种基于傅里叶分析的新型网络架构FAN,通过引入傅里叶级数,将周期性自然地融入网络的结构和计算过程中,从而实现对周期性模式更精确的表达和预测。FAN作为一种有潜力的多层感知器(MLP)替代方案,能够在减少参数和计算量的同时,在各种实际任务中展现出优越的性能和泛化能力。

链接: https://arxiv.org/abs/2410.02675
作者: Yihong Dong,Ge Li,Yongding Tao,Xue Jiang,Kechi Zhang,Jia Li,Jing Su,Jun Zhang,Jingjing Xu
关键词-EN: remarkable success achieved, exhibit potential flaws, remarkable success, success achieved, exhibit potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the remarkable success achieved by neural networks, particularly those represented by MLP and Transformer, we reveal that they exhibit potential flaws in the modeling and reasoning of periodicity, i.e., they tend to memorize the periodic data rather than genuinely understanding the underlying principles of periodicity. However, periodicity is a crucial trait in various forms of reasoning and generalization, underpinning predictability across natural and engineered systems through recurring patterns in observations. In this paper, we propose FAN, a novel network architecture based on Fourier Analysis, which empowers the ability to efficiently model and reason about periodic phenomena. By introducing Fourier Series, the periodicity is naturally integrated into the structure and computational processes of the neural network, thus achieving a more accurate expression and prediction of periodic patterns. As a promising substitute to multi-layer perceptron (MLP), FAN can seamlessly replace MLP in various models with fewer parameters and FLOPs. Through extensive experiments, we demonstrate the effectiveness of FAN in modeling and reasoning about periodic functions, and the superiority and generalizability of FAN across a range of real-world tasks, including symbolic formula representation, time series forecasting, and language modeling.
摘要:尽管神经网络,特别是由多层感知器 (MLP) 和 Transformer 所代表的网络,取得了显著的成功,但我们揭示了它们在周期性建模和推理方面存在潜在的缺陷,即它们倾向于记忆周期性数据,而非真正理解周期性的基本原理。然而,周期性是各种推理和泛化形式中的关键特征,通过观察中的重复模式,支撑了自然和工程系统中的可预测性。本文提出了一种基于傅里叶分析 (Fourier Analysis) 的新型网络架构——FAN,该架构赋予了高效建模和推理周期现象的能力。通过引入傅里叶级数 (Fourier Series),周期性自然地融入到神经网络的结构和计算过程中,从而实现了对周期模式的更准确表达和预测。作为多层感知器 (MLP) 的有力替代品,FAN 能够在参数和浮点运算 (FLOPs) 更少的情况下,无缝替换各种模型中的 MLP。通过广泛的实验,我们展示了 FAN 在建模和推理周期函数方面的有效性,以及 FAN 在包括符号公式表示、时间序列预测和语言建模等一系列现实任务中的优越性和通用性。

[NLP-28] Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus EMNLP2024

【速读】: 该论文试图解决19世纪美国文学中正字法变体的计算分析问题,并探讨这些变体在文学意义上的表达。解决方案的关键在于构建了一个包含人类注释方言标签的新数据集,并利用BERT和CANINE模型在词级和字符级进行上下文语言建模实验。研究结果表明,正字法变体通过多种语言通道产生“方言效应”,且不同的分词方案显著影响模型所能揭示的正字法信息类型。

链接: https://arxiv.org/abs/2410.02674
作者: Craig Messner,Tom Lippincott
关键词-EN: century American literary, American literary orthovariant, group tags designed, exploring literarily meaningful, literary orthovariant tokens
类目: Computation and Language (cs.CL)
备注: Accepted to NLP4DH@EMNLP2024

点击查看摘要

Abstract:We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the “dialect effect” produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.
摘要:我们提出了一组19世纪美国文学正字变体Token的数据集,并引入了一种新颖的人工标注方言组标签层,旨在作为计算实验的基础,探索具有文学意义的正字变异。我们在此数据集上进行了一系列初步的广泛实验,使用了基于Token(BERT)和基于字符(CANINE)的上下文语言模型。我们发现,有意正字变异产生的“方言效应”涉及多个语言通道,并且这些通道在特定的语言建模假设下能够以不同程度显现。具体而言,我们发现证据表明,Token化方案的选择对模型能够显现的正字信息类型有显著影响。

[NLP-29] How to Train Long-Context Language Models (Effectively)

【速读】: 该论文试图解决如何有效利用长上下文信息的问题,解决方案的关键在于通过持续预训练和监督微调(SFT)来优化语言模型(LM)。具体来说,论文提出了以下关键策略:1) 结合代码库和书籍等长数据源与高质量的短数据进行混合训练;2) 使用超过评估长度的序列进行训练以提升长上下文性能;3) 在SFT阶段仅使用短指令数据集即可在长上下文任务上取得优异表现。最终模型ProLong-8B在128K长度的上下文窗口中表现出色,并能有效处理长达512K的上下文,展示了其在长上下文任务中的领先性能。

链接: https://arxiv.org/abs/2410.02660
作者: Tianyu Gao,Alexander Wettig,Howard Yen,Danqi Chen
关键词-EN: supervised fine-tuning, make effective, long-context, study continued training, SFT
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Our code, data, and models are available at this https URL

点击查看摘要

Abstract:We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development – Instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.18B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
摘要:我们研究了语言模型 (LM) 的持续训练和监督微调 (SFT),以有效利用长上下文信息。首先,我们建立了一个可靠的评估协议来指导模型开发——我们使用了一系列长上下文任务,而不是困惑度或简单的“大海捞针”(NIAH) 测试,并且在 SFT 后使用指令数据进行评估,因为这更好地揭示了模型的长上下文能力。在我们的稳健评估支持下,我们进行了详尽的实验,以确定持续预训练的数据混合、指令调优数据集以及其他设计选择。我们发现:(1) 代码仓库和书籍是长数据的优秀来源,但将它们与高质量的短数据结合至关重要;(2) 使用超出评估长度的序列长度进行训练可以提升长上下文性能;(3) 对于 SFT,仅使用短指令数据集就能在长上下文任务上取得强劲表现。我们的最终模型 ProLong-8B 从 Llama-3 初始化,并在 40B Token 上进行训练,在 128K 长度下,其在同等规模的模型中展示了最先进的长上下文性能。ProLong 在大多数长上下文任务上优于 Llama-3.18B-Instruct,尽管在长上下文训练中仅见到了 5% 的 Token。此外,ProLong 能够有效处理长达 512K Token 的上下文窗口,这是公开可用的大语言模型中上下文窗口最长之一。

[NLP-30] Hate Personified: Investigating the role of LLMs in content moderation EMNLP’24

【速读】: 该论文试图解决在仇恨检测等主观任务中,大型语言模型(LLM)如何准确反映不同群体需求的问题。解决方案的关键在于通过在提示中引入额外的上下文信息,如地理提示、人格属性和数值信息,来全面分析LLM对这些因素的敏感性。研究发现,模仿人格属性会导致标注变异性,而地理信号的引入则能更好地实现区域对齐。此外,LLM对数值锚点的敏感性表明其能够利用社区标记和对抗性曝光。该研究为在文化敏感场景中应用LLM提供了初步指导,并强调了其中的细微差别。

链接: https://arxiv.org/abs/2410.02657
作者: Sarah Masud,Sahajpreet Singh,Viktor Hangya,Alexander Fraser,Tanmoy Chakraborty
关键词-EN: Large Language Model, perceive hate differently, people perceive hate, represent diverse groups, Language Model
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 17 pages, 6 Figures, 13 Tables, EMNLP’24 Mains

点击查看摘要

Abstract:For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model’s (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we comprehensively analyze LLM’s sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected. Our findings on two LLMs, five languages, and six datasets reveal that mimicking persona-based attributes leads to annotation variability. Meanwhile, incorporating geographical signals leads to better regional alignment. We also find that the LLMs are sensitive to numerical anchors, indicating the ability to leverage community-based flagging efforts and exposure to adversaries. Our work provides preliminary guidelines and highlights the nuances of applying LLMs in culturally sensitive cases.
摘要:对于仇恨检测这类主观任务,由于人们对仇恨的感知存在差异,大语言模型 (LLM) 在代表不同群体方面的能力尚不明确。通过在提示中加入额外上下文,我们全面分析了 LLM 对地理启动、角色属性及数值信息的敏感性,以评估其对不同群体需求的反映程度。我们在两个 LLM、五种语言和六个数据集上的研究发现,模仿基于角色的属性会导致标注的变异性。同时,引入地理信号能更好地实现区域对齐。我们还发现,LLM 对数值锚点敏感,表明其能够利用基于社区的标记努力和对抗曝光。我们的工作提供了初步指导,并突显了在文化敏感案例中应用 LLM 的细微差别。

[NLP-31] Measuring and Improving Persuasiveness of Generative Models

【速读】: 该论文试图解决的问题是如何量化和评估大型语言模型(LLMs)在生成说服性内容方面的能力,以及这种能力对社会的影响。解决方案的关键在于开发了PersuasionBench和PersuasionArena,这是首个大规模的基准和竞技场,包含了一系列任务来自动测量生成模型的说服能力。通过这些工具,研究者可以系统地评估LLMs在不同情境下的说服效果,并发现模型大小与说服力之间的复杂关系。论文还强调了通过合成和自然数据集进行有针对性的训练可以显著提升较小模型的说服能力,这挑战了仅依赖模型规模来评估其社会影响的假设。

链接: https://arxiv.org/abs/2410.02653
作者: Somesh Singh,Yaman K Singla,Harini SI,Balaji Krishnamurthy
关键词-EN: workflows involving generating, involving generating content, workflows involving, directly interacting, involving generating
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs’ impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models’ persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California’s SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI’s societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at this https URL, to advance our understanding of AI-driven persuasion and its societal implications.
摘要:大语言模型 (LLM) 在涉及生成供人类消费的内容(例如营销)以及直接与人类互动(例如通过聊天机器人)的工作流程中越来越被使用。开发能够生成可验证说服性信息的此类系统,为社会带来了机遇和挑战。一方面,这些系统可能对广告和社会公益等领域产生积极影响,例如解决药物成瘾问题;另一方面,它们也可能被滥用以传播错误信息和塑造政治观点。为了引导大语言模型对社会的影响,我们需要开发系统来测量和基准化它们的说服力。基于这一动机,我们引入了 PersuasionBench 和 PersuasionArena,这是首个包含一系列任务的大规模基准和竞技场,用于自动测量生成模型的说服能力。我们研究了大语言模型在多大程度上了解并利用有助于生成更具说服力语言的语言模式。我们的研究结果表明,大语言模型的说服力与其模型大小正相关,但较小的模型也可以通过训练获得比更大模型更高的说服力。值得注意的是,使用合成和自然数据集进行针对性训练显著增强了较小模型的说服能力,挑战了依赖规模的假设。我们的发现对模型开发者和政策制定者都具有重要意义。例如,尽管欧盟 AI 法案和加利福尼亚州的 SB-1047 旨在基于浮点运算次数来监管 AI 模型,但我们证明,仅凭此类简单指标无法全面捕捉 AI 的社会影响。我们邀请社区探索并贡献于 PersuasionArena 和 PersuasionBench,网址为 https URL,以推进我们对 AI 驱动说服力及其社会影响的理解。

[NLP-32] Undesirable Memorization in Large Language Models : A Survey

【速读】: 该论文试图解决大语言模型(LLMs)中的记忆化问题,即模型在训练过程中存储并再现训练数据中的短语或段落,从而引发隐私和安全风险。解决方案的关键在于系统化地理解记忆化的五个关键维度:意图性、程度、可检索性、抽象性和透明性,并通过开发新的度量方法和分析影响因素来识别和减轻记忆化现象。此外,论文还提出了未来研究方向,包括在特定模型架构和应用场景中平衡性能与隐私的方法。

链接: https://arxiv.org/abs/2410.02650
作者: Ali Satvaty,Suzan Verberne,Fatih Turkmen
关键词-EN: Large Language Models, capabilities of Large, recent research increasingly, research increasingly showcases, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While recent research increasingly showcases the remarkable capabilities of Large Language Models (LLMs), it’s vital to confront their hidden pitfalls. Among these challenges, the issue of memorization stands out, posing significant ethical and legal risks. In this paper, we presents a Systematization of Knowledge (SoK) on the topic of memorization in LLMs. Memorization is the effect that a model tends to store and reproduce phrases or passages from the training data and has been shown to be the fundamental issue to various privacy and security attacks against LLMs. We begin by providing an overview of the literature on the memorization, exploring it across five key dimensions: intentionality, degree, retrievability, abstraction, and transparency. Next, we discuss the metrics and methods used to measure memorization, followed by an analysis of the factors that contribute to memorization phenomenon. We then examine how memorization manifests itself in specific model architectures and explore strategies for mitigating these effects. We conclude our overview by identifying potential research topics for the near future: to develop methods for balancing performance and privacy in LLMs, and the analysis of memorization in specific contexts, including conversational agents, retrieval-augmented generation, multilingual language models, and diffusion language models. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02650 [cs.CL] (or arXiv:2410.02650v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02650 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:尽管近期研究越来越多地展示了大型语言模型 (LLM) 的显著能力,但正视其潜在的隐患同样至关重要。在这些挑战中,记忆化问题尤为突出,带来了重大的伦理和法律风险。本文对 LLM 中的记忆化现象进行了知识系统化 (SoK)。记忆化是指模型倾向于存储并再现训练数据中的短语或段落,已被证明是针对 LLM 的各种隐私和安全攻击的根本问题。我们首先概述了关于记忆化的文献,从五个关键维度——意图性、程度、可检索性、抽象性和透明性——进行探讨。接着,我们讨论了用于测量记忆化的指标和方法,随后分析了导致记忆化现象的因素。然后,我们考察了记忆化在特定模型架构中的表现,并探讨了减轻这些影响的策略。最后,我们通过识别近期的潜在研究课题来结束概述:开发在 LLM 中平衡性能和隐私的方法,以及在特定情境下分析记忆化,包括对话智能体、检索增强生成、多语言语言模型和扩散语言模型。

主题:计算与语言 (cs.CL); 人工智能 (cs.AI)
引用为:arXiv:2410.02650 [cs.CL] (或 arXiv:2410.02650v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.02650
通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-33] Immunogenicity Prediction with Dual Attention Enables Vaccine Target Selection

【速读】: 该论文试图解决免疫原性预测问题,即在反向疫苗学中寻找能够触发保护性免疫反应的候选疫苗。解决方案的关键在于引入了一种名为ProVaccine的新型深度学习方法,该方法采用双注意力机制,结合了蛋白质序列和结构的预训练潜在向量表示,从而提高了预测精度和泛化能力。此外,论文还构建了迄今为止最全面的免疫原性数据集,包含超过9,500个来自细菌、病毒和肿瘤的抗原序列、结构和免疫原性标签,并通过广泛的实验验证了ProVaccine在多种评估指标上优于现有方法。

链接: https://arxiv.org/abs/2410.02647
作者: Song Li,Yang Tan,Song Ke,Liang Hong,Bingxin Zhou
关键词-EN: protective immune responses, trigger protective immune, finding candidate vaccines, immune responses, central topic
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注: 18 pages, 11 tables, 5 figures

点击查看摘要

Abstract:Immunogenicity prediction is a central topic in reverse vaccinology for finding candidate vaccines that can trigger protective immune responses. Existing approaches typically rely on highly compressed features and simple model architectures, leading to limited prediction accuracy and poor generalizability. To address these challenges, we introduce ProVaccine, a novel deep learning solution with a dual attention mechanism that integrates pre-trained latent vector representations of protein sequences and structures. We also compile the most comprehensive immunogenicity dataset to date, encompassing over 9,500 antigen sequences, structures, and immunogenicity labels from bacteria, viruses, and tumors. Extensive experiments demonstrate that ProVaccine outperforms existing methods across a wide range of evaluation metrics. Furthermore, we establish a post-hoc validation protocol to assess the practical significance of deep learning models in tackling vaccine design challenges. Our work provides an effective tool for vaccine design and sets valuable benchmarks for future research.
摘要:免疫原性预测是反向疫苗学中的核心课题,旨在寻找能够引发保护性免疫反应的候选疫苗。现有方法通常依赖于高度压缩的特征和简单的模型架构,导致预测准确性有限且泛化能力差。为解决这些挑战,我们提出了 ProVaccine,这是一种新颖的深度学习解决方案,采用双注意力机制,整合了蛋白质序列和结构的预训练潜在向量表示。我们还编译了迄今为止最全面的免疫原性数据集,涵盖了来自细菌、病毒和肿瘤的超过 9,500 个抗原序列、结构和免疫原性标签。广泛的实验表明,ProVaccine 在广泛的评估指标上优于现有方法。此外,我们建立了一种事后验证协议,以评估深度学习模型在应对疫苗设计挑战中的实际意义。我们的工作为疫苗设计提供了一个有效的工具,并为未来的研究设定了有价值的基准。

[NLP-34] Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

【速读】: 该论文试图解决的问题是:在大语言模型(LLMs)用于信息检索系统中的零样本重排序时,是否必须依赖于自回归生成,以及这种生成方式是否是最优的。论文提出了一种名为“上下文内重排序(ICR)”的新方法,其关键在于利用搜索查询引起的注意力模式变化来进行准确且高效的重排序。ICR通过引入内容无关的查询来进行校准,以缓解LLMs的内在偏见,并且由于不涉及生成过程,ICR仅需两次前向传播(O(1))即可对N个文档进行重排序,显著提高了效率。实验结果表明,ICR在标准单跳和多跳信息检索基准测试中优于RankGPT,并减少了超过60%的延迟。

链接: https://arxiv.org/abs/2410.02642
作者: Shijie Chen,Bernal Jiménez Gutiérrez,Yu Su
关键词-EN: modern digital life, played a vital, vital role, role in modern, modern digital
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two ( O(1) ) forward passes to re-rank N documents, making it substantially more efficient than generative re-ranking methods that require at least O(N) forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR’s performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.
摘要:信息检索 (IR) 系统在现代数字生活中扮演着至关重要的角色,并通过检索增强生成在生成式 AI 的新时代中巩固了其持续的实用性。凭借强大的语言处理能力和显著的多功能性,大语言模型 (LLMs) 已成为 IR 系统中零样本重排序的热门选择。迄今为止,基于 LLM 的重排序方法依赖于强大的生成能力,这限制了它们的使用范围,要么是专门化的模型,要么是强大的专有模型。鉴于这些限制,我们提出疑问:自回归生成对于 LLMs 执行重排序是否必要且最优?我们假设,LLMs 中存在大量与重排序相关的信号,这些信号通过生成可能未被充分利用。为了更直接地利用这些信号,我们提出了上下文内重排序 (ICR),这是一种利用搜索查询引起的注意力模式变化进行准确高效重排序的新方法。为了缓解 LLMs 中的固有偏差,我们提出了一种使用无内容查询的校准方法。由于不涉及生成,ICR 仅需两次 ( O(1) ) 前向传递即可对 N 个文档进行重排序,使其比至少需要 O(N) 次前向传递的生成重排序方法效率显著提高。我们新颖的设计还使得 ICR 可以应用于任何 LLM 而无需专门训练,同时保证排序的合理性。在标准单跳和多跳信息检索基准上,使用两个流行的开源权重 LLM 进行的广泛实验表明,ICR 在实际应用中比 RankGPT 表现更优,同时延迟减少了超过 60%。通过详细分析,我们展示了 ICR 在需要更复杂重排序信号的任务中表现尤为出色。我们的研究结果呼吁进一步探索利用开源权重 LLM 的新方法,超越文本生成。

[NLP-35] Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning

【速读】: 该论文试图解决多领域机器翻译中由于训练数据有限和不平衡导致的领域过拟合和灾难性遗忘问题。解决方案的关键在于提出了一种名为“领域思维链(Domain Chain of Thought, CoT)”的微调技术,该技术利用大语言模型(LLMs)的内在多领域智能,引导模型从源文本中感知领域信息,从而在翻译过程中提供有用的提示,显著提升翻译准确性和领域鲁棒性。

链接: https://arxiv.org/abs/2410.02631
作者: Tianxiang Hu,Pei Zhang,Baosong Yang,Jun Xie,Derek F. Wong,Rui Wang
关键词-EN: Achieving consistent high-quality, consistent high-quality machine, imbalanced parallel training, parallel training data, Achieving consistent
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Achieving consistent high-quality machine translation (MT) across diverse domains remains a significant challenge, primarily due to the limited and imbalanced parallel training data available in various domains. While large language models (LLMs) have demonstrated impressive general understanding and generation abilities, their potential in multi-domain MT is under-explored. We establish a comprehensive benchmark for multi-domain translation, featuring 25 German \Leftrightarrow English and 22 Chinese \Leftrightarrow English test sets respectively covering 15 domains. Our evaluation of prominent LLMs reveals a discernible performance gap against traditional MT systems, highlighting domain overfitting and catastrophic forgetting issues after fine-tuning on domain-limited corpora. To mitigate this, we propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance. This method inspires the LLM to perceive domain information from the source text, which then serves as a helpful hint to guide the translation process. Despite being trained on a small dataset of four domains, our CoT fine-tune approach achieves notable enhancements in translation accuracy and domain robustness than traditional fine-tuning, as evidenced by an average 1.53 BLEU score increase in over 20 German \rightarrow English distinct out-of-domain tests.
摘要:在不同领域实现一致高质量的机器翻译 (Machine Translation, MT) 仍然是一个重大挑战,主要原因是各个领域可用的平行训练数据有限且不平衡。尽管大语言模型 (Large Language Models, LLMs) 展示了令人印象深刻的通用理解和生成能力,但它们在多领域 MT 中的潜力尚未得到充分探索。我们建立了一个全面的多领域翻译基准,涵盖 25 个德语 \Leftrightarrow 英语和 22 个中文 \Leftrightarrow 英语测试集,分别覆盖 15 个领域。我们对著名 LLMs 的评估显示,与传统 MT 系统相比,存在明显的性能差距,这突显了在领域有限语料库上微调后出现的领域过拟合和灾难性遗忘问题。为缓解这一问题,我们提出了一种领域思维链 (Chain of Thought, CoT) 微调技术,利用 LLMs 的内在多领域智能来提升翻译性能。该方法启发 LLM 从源文本中感知领域信息,这些信息随后作为有用的提示来指导翻译过程。尽管仅在四个领域的少量数据集上进行训练,我们的 CoT 微调方法在翻译准确性和领域鲁棒性方面相比传统微调实现了显著提升,如在超过 20 个德语 \rightarrow 英语的不同领域外测试中,平均 BLEU 分数提高了 1.53。

[NLP-36] NL-Eye: Abductive NLI for Images

【速读】: 该论文试图解决视觉语言模型(VLM)在视觉归纳推理能力方面的不足问题。解决方案的关键在于引入NL-Eye基准,这是一个专门设计用于评估VLM视觉归纳推理技能的基准。NL-Eye通过将归纳自然语言推理(NLI)任务适应到视觉领域,要求模型基于前提图像评估假设图像的合理性并解释其决策。该基准包含350个精心挑选的三元组示例(共1050张图像),涵盖物理、功能、逻辑、情感、文化和社交等多种推理类别。实验结果表明,当前的VLM在NL-Eye上的表现显著不佳,而人类在这方面表现出色,这凸显了现代VLM在归纳推理能力上的缺陷。NL-Eye的引入为开发能够进行稳健多模态推理的VLM迈出了重要一步,这对于实际应用如事故预防机器人和生成视频验证至关重要。

链接: https://arxiv.org/abs/2410.02613
作者: Mor Ventura,Michael Toker,Nitay Calderon,Zorik Gekhman,Yonatan Bitton,Roi Reichart
关键词-EN: Natural Language Inference, wet floor, detects a wet, abductive Natural Language, Visual Language Model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs’ visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps - writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.
摘要:基于视觉语言模型 (VLM) 的机器人能否在检测到湿滑地面时提醒我们注意滑倒风险?近期的 VLM 展示了令人印象深刻的能力,然而它们在推断结果和原因方面的能力仍未得到充分探索。为此,我们引入了 NL-Eye,这是一个用于评估 VLM 视觉溯因推理技能的基准。NL-Eye 将溯因自然语言推理 (NLI) 任务适应到视觉领域,要求模型根据前提图像评估假设图像的合理性,并解释其决策过程。NL-Eye 包含 350 个精心挑选的三元组示例(共 1,050 张图像),涵盖了多种推理类别:物理、功能、逻辑、情感、文化和社交。数据构建过程包括两个步骤——编写文本描述和使用文本到图像模型生成图像,这两个步骤都需要大量的人工参与以确保高质量和具有挑战性的场景。我们的实验表明,VLM 在 NL-Eye 上的表现显著不佳,常常仅达到随机基线水平,而人类在合理性预测和解释质量方面表现出色。这表明现代 VLM 在溯因推理能力上存在不足。NL-Eye 是开发能够进行稳健多模态推理的 VLM 的关键一步,适用于包括事故预防机器人和生成视频验证在内的实际应用。

[NLP-37] IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

【速读】: 该论文试图解决的问题是评估多语言Transformer模型在处理印度语言(Indic languages)时的语言属性编码能力和鲁棒性。解决方案的关键在于引入了一个名为IndicSentEval的新型多语言基准数据集,并使用9种多语言Transformer模型(包括7种通用模型和2种印度语言特定模型)对13种不同的扰动进行测试,以分析这些模型在6种印度语言中的表现。研究结果表明,尽管所有多语言模型在英语上表现出一致的编码性能,但在印度语言上表现不一,印度语言特定模型在捕捉印度语言属性方面优于通用模型,而通用模型在面对扰动时表现出更好的鲁棒性。

链接: https://arxiv.org/abs/2410.02611
作者: Akhilesh Aravapalli,Mounika Marreddy,Subba Reddy Oota,Radhika Mamidi,Manish Gupta
关键词-EN: natural language processing, Indic languages, models, revolutionized the field, field of natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 11 figures

点击查看摘要

Abstract:Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately \sim 47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [this https URL].
摘要:基于 Transformer 的模型已经彻底改变了自然语言处理领域。为了理解这些模型为何表现如此出色以及评估其可靠性,多项研究聚焦于以下问题:这些模型编码了哪些语言属性,以及编码的程度如何?当输入文本受到扰动时,这些模型在编码语言属性方面的鲁棒性如何?然而,这些研究主要集中在 BERT 和英语上。在本文中,我们针对 6 种印度语言,使用 9 种多语言 Transformer 模型(7 种通用模型和 2 种印度语言特定模型),研究了 8 种语言属性在 13 种不同扰动下的编码能力和鲁棒性。为此,我们引入了一个新的多语言基准数据集 IndicSentEval,包含约 47K 个句子。令人惊讶的是,我们对表面属性、句法属性和语义属性的探测分析显示,尽管几乎所有多语言模型在英语上表现出一致的编码性能,但在印度语言上却显示出混合的结果。正如预期,印度语言特定的多语言模型在捕捉印度语言的语言属性方面优于通用模型。有趣的是,通用模型在鲁棒性方面普遍优于印度语言特定模型,特别是在名词和动词同时丢弃、仅丢弃动词或仅保留名词等扰动下。总体而言,本研究为探测和扰动特定优势和劣势提供了宝贵的见解,揭示了不同印度语言在流行的多语言基于 Transformer 模型中的表现。我们公开了代码和数据集 [this https URL]。

[NLP-38] Ethio-Fake: Cutting-Edge Approaches to Combat Fake News in Under-Resourced Languages Using Explainable AI

【速读】: 该论文试图解决在资源匮乏语言环境下,如何提高假新闻检测准确性的问题。解决方案的关键在于综合利用新闻内容特征和社会上下文特征,通过集成学习方法(如神经网络、传统机器学习和迁移学习)来提升检测精度。实验结果表明,集成学习方法在假新闻检测中表现最佳,F1分数达到0.99,而针对目标语言进行微调的模型在单语言模型中表现最优,F1分数为0.94。通过可解释AI技术分析模型性能的关键特征,进一步优化了检测效果。

链接: https://arxiv.org/abs/2410.02609
作者: Mesay Gemeda Yigezu,Melkamu Abay Mersha,Girma Yohannis Bade,Jugal Kalita,Olga Kolesnikova,Alexander Gelbukh
关键词-EN: social media platforms, media platforms, significant threat, information dissemination, social media
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of fake news has emerged as a significant threat to the integrity of information dissemination, particularly on social media platforms. Misinformation can spread quickly due to the ease of creating and disseminating content, affecting public opinion and sociopolitical events. Identifying false information is therefore essential to reducing its negative consequences and maintaining the reliability of online news sources. Traditional approaches to fake news detection often rely solely on content-based features, overlooking the crucial role of social context in shaping the perception and propagation of news articles. In this paper, we propose a comprehensive approach that integrates social context-based features with news content features to enhance the accuracy of fake news detection in under-resourced languages. We perform several experiments utilizing a variety of methodologies, including traditional machine learning, neural networks, ensemble learning, and transfer learning. Assessment of the outcomes of the experiments shows that the ensemble learning approach has the highest accuracy, achieving a 0.99 F1 score. Additionally, when compared with monolingual models, the fine-tuned model with the target language outperformed others, achieving a 0.94 F1 score. We analyze the functioning of the models, considering the important features that contribute to model performance, using explainable AI techniques.
摘要:虚假新闻的泛滥已成为信息传播完整性的重大威胁,尤其是在社交媒体平台上。由于内容创建和传播的便捷性,错误信息可以迅速传播,影响公众意见和政治事件。因此,识别虚假信息对于减少其负面影响并维护在线新闻来源的可靠性至关重要。传统的虚假新闻检测方法通常仅依赖于基于内容的特征,忽视了社会背景在塑造新闻文章感知和传播中的关键作用。本文提出了一种综合方法,将基于社会背景的特征与新闻内容特征相结合,以提高资源匮乏语言中虚假新闻检测的准确性。我们进行了多项实验,采用了多种方法,包括传统机器学习、神经网络、集成学习和迁移学习。实验结果的评估显示,集成学习方法的准确性最高,达到了0.99的F1分数。此外,与单语模型相比,经过目标语言微调的模型表现更优,达到了0.94的F1分数。我们使用可解释AI技术分析了模型的功能,考虑了对模型性能有重要贡献的关键特征。

[NLP-39] Agents Room: Narrative Generation through Multi-step Collaboration ICLR2025

【速读】: 该论文试图解决当前大型语言模型(LLMs)在创作引人入胜的小说时过度依赖复杂提示的问题。解决方案的关键在于提出了一种名为“Agents’ Room”的生成框架,该框架受叙事理论启发,将叙事写作分解为多个由专业代理处理的子任务。通过协作和专业化,将复杂的写作任务分解为可管理的组件,从而生成更受专家评估者青睐的故事。

链接: https://arxiv.org/abs/2410.02603
作者: Fantine Huot,Reinald Kim Amplayo,Jennimaria Palomaki,Alice Shoshana Jakobovits,Elizabeth Clark,Mirella Lapata
关键词-EN: developing interesting characters, multifaceted process combining, process combining elements, Writing compelling fiction, crafting a plot
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Under review as a conference paper at ICLR 2025

点击查看摘要

Abstract:Writing compelling fiction is a multifaceted process combining elements such as crafting a plot, developing interesting characters, and using evocative language. While large language models (LLMs) show promise for story writing, they currently rely heavily on intricate prompting, which limits their use. We propose Agents’ Room, a generation framework inspired by narrative theory, that decomposes narrative writing into subtasks tackled by specialized agents. To illustrate our method, we introduce Tell Me A Story, a high-quality dataset of complex writing prompts and human-written stories, and a novel evaluation framework designed specifically for assessing long narratives. We show that Agents’ Room generates stories that are preferred by expert evaluators over those produced by baseline systems by leveraging collaboration and specialization to decompose the complex story writing task into tractable components. We provide extensive analysis with automated and human-based metrics of the generated output.
摘要:撰写引人入胜的小说是一个多方面的过程,结合了构建情节、塑造有趣角色和运用富有感染力的语言等元素。尽管大语言模型 (LLM) 在故事创作方面展现出潜力,但它们目前严重依赖复杂的提示,这限制了它们的应用。我们提出了 Agents’ Room,这是一个受叙事理论启发的生成框架,将叙事写作分解为由专业智能体处理的子任务。为了说明我们的方法,我们引入了 Tell Me A Story,这是一个高质量的数据集,包含复杂的写作提示和人类撰写的故事,以及一个专门设计用于评估长篇叙事的全新评估框架。我们展示了 Agents’ Room 生成的故事在专家评估者中更受欢迎,超过了基线系统生成的故事,通过协作和专业化将复杂的故事写作任务分解为可处理的组件。我们提供了对生成输出的广泛分析,包括自动化和基于人类的评估指标。

[NLP-40] owards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions EMNLP

【速读】: 该论文试图解决大型语言模型(LLMs)在多智能体交互中存在的隐性性别偏见问题。解决方案的关键在于提出了两种策略来缓解这些偏见:一是通过上下文示例进行自我反思(ICE),二是通过监督微调。研究表明,这两种方法均能有效减少隐性偏见,而结合微调和自我反思的综合方法效果最佳。

链接: https://arxiv.org/abs/2410.02584
作者: Angana Borah,Rada Mihalcea
关键词-EN: Large Language Models, Language Models, Large Language, diverse social tasks, execute diverse social
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to EMNLP Findings 2024

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to evolve, they are increasingly being employed in numerous studies to simulate societies and execute diverse social tasks. However, LLMs are susceptible to societal biases due to their exposure to human-generated data. Given that LLMs are being used to gain insights into various societal aspects, it is essential to mitigate these biases. To that end, our study investigates the presence of implicit gender biases in multi-agent LLM interactions and proposes two strategies to mitigate these biases. We begin by creating a dataset of scenarios where implicit gender biases might arise, and subsequently develop a metric to assess the presence of biases. Our empirical analysis reveals that LLMs generate outputs characterized by strong implicit bias associations (= 50% of the time). Furthermore, these biases tend to escalate following multi-agent interactions. To mitigate them, we propose two strategies: self-reflection with in-context examples (ICE); and supervised fine-tuning. Our research demonstrates that both methods effectively mitigate implicit biases, with the ensemble of fine-tuning and self-reflection proving to be the most successful.
摘要:随着大语言模型 (LLM) 的不断发展,它们越来越多地被用于模拟社会和执行多样化的社会任务。然而,由于接触到人类生成的数据,LLM 容易受到社会偏见的影响。鉴于 LLM 被用于深入了解各种社会方面,减轻这些偏见至关重要。为此,我们的研究调查了多智能体 LLM 交互中隐含的性别偏见,并提出了两种减轻这些偏见的策略。我们首先创建了一个可能出现隐含性别偏见场景的数据集,随后开发了一种评估偏见存在的指标。我们的实证分析表明,LLM 生成的输出具有强烈的隐含偏见关联(= 50% 的时间)。此外,这些偏见在多智能体交互后往往会加剧。为了减轻这些偏见,我们提出了两种策略:基于上下文示例 (ICE) 的自我反思;以及监督式微调。我们的研究表明,这两种方法都能有效减轻隐含偏见,其中微调和自我反思的组合被证明是最成功的。

[NLP-41] Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

【速读】: 该论文试图解决自动语音识别(ASR)任务中,由于音频特征(如频谱图)的高维度复杂性导致难以应用的问题。解决方案的关键在于采用卷积变分自编码器(Convolutional Variational Autoencoders, VAE)生成压缩的频谱图表示。具体来说,论文训练了一个卷积VAE模型,该模型能够从13维嵌入中重建25毫秒的音频片段频谱图,并进一步用于生成40维(300毫秒)嵌入的特征,以应用于GoogleSpeechCommands数据集上的语音命令识别任务。通过这种方法,论文构建了一个基于生成特征的ASR系统,并与使用MFCC特征的模型进行了性能比较。

链接: https://arxiv.org/abs/2410.02560
作者: Olga Yakovenko,Ivan Bondarenko
关键词-EN: Automatic Speech Recognition, Mel-frequency Cepstral Coefficients, Speech Recognition, Cepstral Coefficients, Automatic Speech
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Theory and Practice of Natural Computing 9th International Conference, TPNC 2020, Taoyuan, Taiwan, 2020, Proceedings 9

点击查看摘要

Abstract:For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
摘要:在许多自动语音识别 (ASR) 任务中,频谱图 (spectrograms) 相较于梅尔频率倒谱系数 (MFCC) 显示出更好的结果,但由于特征空间的复杂维度,在实际应用中难以使用。本文提出了一种基于卷积变分自编码器 (Convolutional Variational Autoencoders, VAE) 的压缩频谱图表示生成方法。我们在 LibriSpeech 数据集的一个子样本上训练了一个卷积 VAE 模型,以从 13 维嵌入中重建短片段的音频频谱图 (25 ms)。训练好的 40 维 (300 ms) 嵌入模型被用于生成 GoogleSpeechCommands 数据集中口语命令语料库的特征。使用生成的特征构建了一个 ASR 系统,并与使用 MFCC 特征的模型进行了比较。

[NLP-42] Improving Unsupervised Constituency Parsing via Maximizing Semantic Information

【速读】: 该论文试图解决传统无监督成分句法分析器在最大化句子对数似然(LL)时,未能充分考虑成分结构与句子语义之间紧密关系的问题。解决方案的关键在于引入新的训练目标:最大化成分结构与句子语义之间的信息量(SemInfo)。具体方法包括使用子串袋模型表示语义,并应用概率加权信息度量来估计SemInfo,同时开发基于树条件随机场(TreeCRF)的模型,将SemInfo最大化目标应用于概率上下文无关文法(PCFG)的归纳,从而显著提升解析准确性。

链接: https://arxiv.org/abs/2410.02558
作者: Junjie Chen,Xiangheng He,Yusuke Miyao,Danushka Bollegala
关键词-EN: tree-shaped syntactic constituent, syntactic constituent structure, parsers organize phrases, constituent structure, constituency parsers organize
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unsupervised constituency parsers organize phrases within a sentence into a tree-shaped syntactic constituent structure that reflects the organization of sentence semantics. However, the traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics, resulting in a weak correlation between LL values and parsing accuracy. In this paper, we introduce a novel objective for training unsupervised parsers: maximizing the information between constituent structures and sentence semantics (SemInfo). We introduce a bag-of-substrings model to represent the semantics and apply the probability-weighted information metric to estimate the SemInfo. Additionally, we develop a Tree Conditional Random Field (TreeCRF)-based model to apply the SemInfo maximization objective to Probabilistic Context-Free Grammar (PCFG) induction, the state-of-the-art method for unsupervised constituency parsing. Experiments demonstrate that SemInfo correlates more strongly with parsing accuracy than LL. Our algorithm significantly enhances parsing accuracy by an average of 7.85 points across five PCFG variants and in four languages, achieving new state-of-the-art results in three of the four languages.
摘要:无监督成分解析器将句子中的短语组织成树状的句法成分结构,这种结构反映了句子语义的组织方式。然而,传统的最大化句子对数似然 (LL) 的目标并未明确考虑成分结构与语义之间的紧密关系,导致 LL 值与解析准确性之间的相关性较弱。本文提出了一种新的无监督解析器训练目标:最大化成分结构与句子语义之间的信息量 (SemInfo)。我们引入了一种子串袋模型来表示语义,并应用概率加权信息度量来估计 SemInfo。此外,我们开发了一种基于树条件随机场 (TreeCRF) 的模型,将 SemInfo 最大化目标应用于概率上下文无关文法 (PCFG) 归纳,这是目前最先进的无监督成分解析方法。实验表明,SemInfo 与解析准确性之间的相关性比 LL 更强。我们的算法在五种 PCFG 变体和四种语言中平均提升了 7.85 个点的解析准确性,并在四种语言中的三种中达到了新的最先进水平。

[NLP-43] ColaCare: Enhancing Electronic Health Record Modeling through Large Language Model-Driven Multi-Agent Collaboration

【速读】: 该论文试图解决电子健康记录(EHR)模型在处理结构化数据与文本推理之间的鸿沟问题。解决方案的关键在于引入ColaCare框架,通过多智能体协作和大型语言模型(LLMs)的驱动,将领域专家模型与LLMs无缝集成。具体来说,ColaCare采用DoctorAgent和MetaAgent两种智能体,分别负责处理和生成EHR数据的预测,以及在协作咨询框架内生成推理参考和决策报告。此外,通过整合《默克诊断与治疗手册》(MSD)的医学指南,利用检索增强生成(RAG)模块提供权威证据支持。实验结果表明,ColaCare在死亡率预测任务中表现优异,有望革新临床决策支持系统,推动个性化精准医学的发展。

链接: https://arxiv.org/abs/2410.02551
作者: Zixiang Wang,Yinghao Zhu,Huiya Zhao,Xiaochen Zheng,Tianlong Wang,Wen Tang,Yasha Wang,Chengwei Pan,Ewen M. Harrison,Junyi Gao,Liantao Ma
关键词-EN: Electronic Health Record, enhances Electronic Health, Large Language Models, Health Record, Electronic Health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce ColaCare, a framework that enhances Electronic Health Record (EHR) modeling through multi-agent collaboration driven by Large Language Models (LLMs). Our approach seamlessly integrates domain-specific expert models with LLMs to bridge the gap between structured EHR data and text-based reasoning. Inspired by clinical consultations, ColaCare employs two types of agents: DoctorAgent and MetaAgent, which collaboratively analyze patient data. Expert models process and generate predictions from numerical EHR data, while LLM agents produce reasoning references and decision-making reports within the collaborative consultation framework. We additionally incorporate the Merck Manual of Diagnosis and Therapy (MSD) medical guideline within a retrieval-augmented generation (RAG) module for authoritative evidence support. Extensive experiments conducted on four distinct EHR datasets demonstrate ColaCare’s superior performance in mortality prediction tasks, underscoring its potential to revolutionize clinical decision support systems and advance personalized precision medicine. The code, complete prompt templates, more case studies, etc. are publicly available at the anonymous link: this https URL.
摘要:我们介绍了 ColaCare,这是一个通过大语言模型 (LLM) 驱动的多智能体协作来增强电子健康记录 (EHR) 建模的框架。我们的方法无缝整合了领域特定的专家模型与 LLM,以弥合结构化 EHR 数据与基于文本的推理之间的差距。受临床咨询的启发,ColaCare 采用了两种类型的智能体:DoctorAgent 和 MetaAgent,它们协作分析患者数据。专家模型处理并从数值 EHR 数据中生成预测,而 LLM 智能体则在协作咨询框架内生成推理参考和决策报告。此外,我们还在检索增强生成 (RAG) 模块中融入了默克诊断与治疗手册 (MSD) 医学指南,以提供权威的证据支持。在四个不同的 EHR 数据集上进行的广泛实验表明,ColaCare 在死亡率预测任务中表现卓越,突显了其革新临床决策支持系统并推动个性化精准医学发展的潜力。代码、完整的提示模板、更多案例研究等已在匿名链接上公开:this https URL。

[NLP-44] Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems

【速读】: 该论文旨在解决俄语文本自动重音标注和音素转录的问题,以支持自动语音识别(ASR)等语音相关任务。解决方案的关键在于采用基于规则的方法,结合语法词典和维基词典语料库进行重音标注,并通过循环神经网络(RNN)利用句子的形态信息来区分同形异义词。音素转录则基于Lobanov和Tsirulnik的计算机合成和语音克隆专著中的规则。该系统已实现为一个开源模块,可用于ASR或语音转文本(STT)任务的研究,并在CMU Sphinx中使用Voxforge数据库的自动标注文本作为训练数据,最终在交叉验证中实现了71.2%的平均词准确率。

链接: https://arxiv.org/abs/2410.02538
作者: Olga Iakovenko,Ivan Bondarenko,Mariya Borovikova,Daniil Vodolazsky
关键词-EN: Automatic Speech Recognition, Speech Recognition, automatic accentuation, Automatic Speech, overview of rule-based
类目: Computation and Language (cs.CL)
备注: Speech and Computer 20th International Conference, SPECOM 2018, Leipzig, Germany, Proceedings 20

点击查看摘要

Abstract:This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts for speech connected tasks, such as Automatic Speech Recognition (ASR). Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases. Accentuation is based on “Grammatical dictionary of the Russian language” of A.A. Zaliznyak and wiktionary corpus. To distinguish homographs, the accentuation system also utilises morphological information of the sentences based on Recurrent Neural Networks (RNN). Transcription algorithms apply the rules presented in the monograph of B.M. Lobanov and L.I. Tsirulnik “Computer Synthesis and Voice Cloning”. The rules described in the present paper are implemented in an open-source module, which can be of use to any scientific study connected to ASR or Speech To Text (STT) tasks. Automatically marked up text annotations of the Russian Voxforge database were used as training data for an acoustic model in CMU Sphinx. The resulting acoustic model was evaluated on cross-validation, mean Word Accuracy being 71.2%. The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.
摘要:本文概述了一种基于规则的系统,用于自动为俄语文本添加重音和进行音素转录,以支持语音相关任务,如自动语音识别 (ASR)。该系统开发的两个部分,即重音标注和转录,采用了不同的方法来实现输入短语的正确音素表示。重音标注基于 A.A. Zaliznyak 的《俄语语法词典》和维基词典语料库。为了区分同形异义词,重音标注系统还利用了基于循环神经网络 (RNN) 的句子形态信息。转录算法应用了 B.M. Lobanov 和 L.I. Tsirulnik 的专著《计算机合成与语音克隆》中提出的规则。本文描述的规则已实现为一个开源模块,可用于任何与 ASR 或语音转文本 (STT) 任务相关的科学研究。自动标注的俄语 Voxforge 数据库文本注释被用作 CMU Sphinx 中声学模型的训练数据。最终的声学模型在交叉验证中进行了评估,平均词准确率为 71.2%。该开发工具包使用 Python 语言编写,并可在 GitHub 上供任何感兴趣的研究人员访问。

[NLP-45] Contextual Document Embeddings

【速读】: 该论文试图解决现有密集文档嵌入方法在特定检索场景中缺乏上下文信息的问题。解决方案的关键在于提出两种互补的方法:一是通过对比学习目标,将文档邻居信息显式地纳入批内上下文损失中;二是设计一种新的上下文架构,将邻居文档信息显式编码到文档表示中。这两种方法显著提升了在不同设置下的检索性能,特别是在跨域场景中,且无需复杂的负样本挖掘、分数蒸馏、数据集特定指令、GPU内样本共享或超大批次大小。

链接: https://arxiv.org/abs/2410.02525
作者: John X. Morris,Alexander M. Rush
关键词-EN: Dense document embeddings, Dense document, central to neural, document, Dense
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
摘要:密集文档嵌入在神经检索中占据核心地位。主流范式是通过直接在单个文档上运行编码器来训练和构建嵌入。在本研究中,我们认为这些嵌入虽然在效果上表现良好,但对于检索的目标应用场景而言,它们隐含地处于上下文之外,而上下文化的文档嵌入应同时考虑文档及其上下文中的邻近文档——类似于上下文化的词嵌入。我们提出了两种互补的上下文化文档嵌入方法:首先,一种替代的对比学习目标,明确地将文档邻域纳入批次内上下文损失中;其次,一种新的上下文架构,明确地将邻近文档信息编码到编码表示中。结果显示,这两种方法在多种设置下均优于双编码器,特别是在域外场景中差异尤为显著。我们在 MTEB 基准测试中取得了最先进的结果,无需硬负样本挖掘、分数蒸馏、数据集特定指令、GPU 内样本共享或超大批次大小。我们的方法可应用于任何对比学习数据集和任何双编码器,以提升性能。

[NLP-46] Methods for Automatic Matrix Language Determination of Code-Switched Speech EMNLP

【速读】: 该论文试图解决代码转换(Code-switching, CS)语境下矩阵语言身份(Matrix Language Identity, MLID)的确定问题。解决方案的关键在于利用矩阵语言框架(Matrix Language Frame, MLF)理论,通过比较CS文本和语音中的MLID与声学语言身份(Language Identity, LID)的识别结果,发现MLID预测器在音频数据中与文本原则的关联性高于LID,并且在MLID识别任务中表现优于LID。这一方法揭示了在CS语境中,非英语语言(如普通话和西班牙语)更倾向于作为矩阵语言,而非LID在单语语境中的选择。

链接: https://arxiv.org/abs/2410.02521
作者: Olga Iakovenko,Thomas Hain
关键词-EN: Matrix Language Frame, Matrix Language Identity, Matrix Language, increasingly common, process of speakers
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP

点击查看摘要

Abstract:Code-switching (CS) is the process of speakers interchanging between two or more languages which in the modern world becomes increasingly common. In order to better describe CS speech the Matrix Language Frame (MLF) theory introduces the concept of a Matrix Language, which is the language that provides the grammatical structure for a CS utterance. In this work the MLF theory was used to develop systems for Matrix Language Identity (MLID) determination. The MLID of English/Mandarin and English/Spanish CS text and speech was compared to acoustic language identity (LID), which is a typical way to identify a language in monolingual utterances. MLID predictors from audio show higher correlation with the textual principles than LID in all cases while also outperforming LID in an MLID recognition task based on F1 macro (60%) and correlation score (0.38). This novel approach has identified that non-English languages (Mandarin and Spanish) are preferred over the English language as the ML contrary to the monolingual choice of LID.
摘要:代码转换 (Code-switching, CS) 是指说话者在两种或多种语言之间交替使用的过程,在现代社会中变得越来越普遍。为了更好地描述代码转换的言语,矩阵语言框架 (Matrix Language Frame, MLF) 理论引入了矩阵语言的概念,即在代码转换话语中提供语法结构的语言。本研究利用 MLF 理论开发了用于矩阵语言身份 (Matrix Language Identity, MLID) 确定的系统。比较了英语/普通话和英语/西班牙语代码转换文本和语音的 MLID 与声学语言身份 (Language Identity, LID),后者是单语话语中识别语言的典型方法。音频中的 MLID 预测因子在所有情况下都显示出比 LID 更高的文本原则相关性,并且在基于 F1 宏 (60%) 和相关性得分 (0.38) 的 MLID 识别任务中也优于 LID。这种新颖的方法发现,与单语选择的 LID 相反,非英语语言 (普通话和西班牙语) 更倾向于作为矩阵语言。

[NLP-47] Can Large Language Models Grasp Legal Theories? Enhance Legal Reasoning with Insights from Multi-Agent Collaboration

【速读】: 该论文试图解决大型语言模型(LLMs)在理解和执行复杂法律推理任务方面的不足。解决方案的关键在于提出了一个名为“多代理框架以提升复杂法律推理能力(MALR)”的新框架。MALR通过非参数学习方法,促使LLMs自动分解复杂的法律任务,并模仿人类学习过程从法律规则中提取洞察,从而增强LLMs对法律理论的理解和法律推理能力。实验结果表明,该框架在实际场景中有效解决了复杂推理问题,为法律领域的更可靠应用铺平了道路。

链接: https://arxiv.org/abs/2410.02507
作者: Weikang Yuan,Junjie Cao,Zhuoren Jiang,Yangyang Kang,Jun Lin,Kaisong Song,tianqianjin lin,Pengwei Yan,Changlong Sun,Xiaozhong Liu
关键词-EN: Large Language Models, Large Language, Language Models, understand legal theories, legal theories
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) could struggle to fully understand legal theories and perform complex legal reasoning tasks. In this study, we introduce a challenging task (confusing charge prediction) to better evaluate LLMs’ understanding of legal theories and reasoning capabilities. We also propose a novel framework: Multi-Agent framework for improving complex Legal Reasoning capability (MALR). MALR employs non-parametric learning, encouraging LLMs to automatically decompose complex legal tasks and mimic human learning process to extract insights from legal rules, helping LLMs better understand legal theories and enhance their legal reasoning abilities. Extensive experiments on multiple real-world datasets demonstrate that the proposed framework effectively addresses complex reasoning issues in practical scenarios, paving the way for more reliable applications in the legal domain.
摘要:大语言模型 (LLMs) 可能在全面理解法律理论和执行复杂法律推理任务方面遇到困难。在本研究中,我们引入了一项具有挑战性的任务(混淆指控预测),以更好地评估 LLMs 对法律理论的理解和推理能力。我们还提出了一种新颖的框架:用于提升复杂法律推理能力的多智能体框架 (Multi-Agent framework for improving complex Legal Reasoning capability, MALR)。MALR 采用非参数学习方法,鼓励 LLMs 自动分解复杂的法律任务,并模仿人类学习过程从法律规则中提取洞察,从而帮助 LLMs 更好地理解法律理论并增强其法律推理能力。在多个真实世界数据集上的广泛实验表明,所提出的框架有效地解决了实际场景中的复杂推理问题,为法律领域中更可靠的应用铺平了道路。

[NLP-48] Mixed-Session Conversation with Egocentric Memory EMNLP

【速读】: 该论文试图解决当前对话系统在模拟真实世界多参与者、长时间动态对话场景中的不足。解决方案的关键在于引入了一种名为“Mixed-Session Conversation”的对话系统,并通过构建包含多会话和多参与者的新数据集MiSC来实现。此外,论文提出了一种新的对话模型EMMA,该模型采用了一种新颖的自我中心记忆管理机制,能够从主要发言者的视角收集和保留记忆,从而在后续会话中实现无缝连续性。通过广泛的人类评估,验证了MiSC中的对话在参与者变化时仍能保持流畅,而EMMA在训练后也能在整个对话过程中保持高记忆一致性。

链接: https://arxiv.org/abs/2410.02503
作者: Jihyoung Jang,Taeyoung Kim,Hyounghun Kim
关键词-EN: Recently introduced dialogue, Recently introduced, demonstrated high usability, Recently, dialogue
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP Findings 2024 (30 pages); Project website: this https URL

点击查看摘要

Abstract:Recently introduced dialogue systems have demonstrated high usability. However, they still fall short of reflecting real-world conversation scenarios. Current dialogue systems exhibit an inability to replicate the dynamic, continuous, long-term interactions involving multiple partners. This shortfall arises because there have been limited efforts to account for both aspects of real-world dialogues: deeply layered interactions over the long-term dialogue and widely expanded conversation networks involving multiple participants. As the effort to incorporate these aspects combined, we introduce Mixed-Session Conversation, a dialogue system designed to construct conversations with various partners in a multi-session dialogue setup. We propose a new dataset called MiSC to implement this system. The dialogue episodes of MiSC consist of 6 consecutive sessions, with four speakers (one main speaker and three partners) appearing in each episode. Also, we propose a new dialogue model with a novel memory management mechanism, called Egocentric Memory Enhanced Mixed-Session Conversation Agent (EMMA). EMMA collects and retains memories from the main speaker’s perspective during conversations with partners, enabling seamless continuity in subsequent interactions. Extensive human evaluations validate that the dialogues in MiSC demonstrate a seamless conversational flow, even when conversation partners change in each session. EMMA trained with MiSC is also evaluated to maintain high memorability without contradiction throughout the entire conversation.
摘要:近期推出的对话系统展示了高度的可用性。然而,它们在反映真实世界的对话场景方面仍显不足。当前的对话系统无法复制涉及多个合作伙伴的动态、连续、长期的互动。这一不足源于对真实世界对话的两个方面——长期对话中的深度交互和涉及多方参与的广泛扩展的对话网络——的考虑有限。为了综合考虑这些方面,我们引入了混合会话对话系统 (Mixed-Session Conversation),该系统旨在构建多会话设置中与各种合作伙伴的对话。我们提出了一个新的数据集,称为 MiSC,以实现这一系统。MiSC 的对话片段由 6 个连续的会话组成,每个片段中有四位发言者(一位主要发言者和三位合作伙伴)。此外,我们提出了一种新的对话模型,该模型具有一种新颖的记忆管理机制,称为以自我为中心的记忆增强混合会话对话智能体 (Egocentric Memory Enhanced Mixed-Session Conversation Agent, EMMA)。EMMA 在合作伙伴对话期间从主要发言者的角度收集和保留记忆,从而在后续互动中实现无缝连续性。广泛的人类评估验证了 MiSC 中的对话展示了无缝的对话流程,即使在每个会话中对话伙伴发生变化时也是如此。使用 MiSC 训练的 EMMA 也被评估为在整个对话过程中保持高度的记忆性而没有矛盾。

[NLP-49] Defining Knowledge: Bridging Epistemology and Large Language Models EMNLP2024

【速读】: 该论文试图解决关于大型语言模型(如GPT-4)是否真正“知道”某些事实的问题,特别是以“地球是圆的”为例。解决方案的关键在于通过回顾认识论中知识的标准定义,并将其形式化应用于LLMs,识别当前NLP研究在概念化知识时与认识论框架之间的不一致性和差距。此外,通过调查100名专业哲学家和计算机科学家的观点,比较他们对知识定义的偏好及其对LLMs是否能真正“知道”的看法,最终提出符合相关定义的知识测试评估协议。

链接: https://arxiv.org/abs/2410.02499
作者: Constanza Fierro,Ruchira Dhar,Filippos Stamatiou,Nicolas Garneau,Anders Søgaard
关键词-EN: large language models, Earth is round, language models, claims are abundant, literature on large
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Knowledge claims are abundant in the literature on large language models (LLMs); but can we say that GPT-4 truly “knows” the Earth is round? To address this question, we review standard definitions of knowledge in epistemology and we formalize interpretations applicable to LLMs. In doing so, we identify inconsistencies and gaps in how current NLP research conceptualizes knowledge with respect to epistemological frameworks. Additionally, we conduct a survey of 100 professional philosophers and computer scientists to compare their preferences in knowledge definitions and their views on whether LLMs can really be said to know. Finally, we suggest evaluation protocols for testing knowledge in accordance to the most relevant definitions.
摘要:在关于大语言模型 (LLMs) 的文献中,知识声明比比皆是;但我们能否说 GPT-4 真正“知道”地球是圆的?为了回答这个问题,我们回顾了认识论中知识的标准定义,并将其适用于 LLMs 的解释形式化。在此过程中,我们发现了当前自然语言处理 (NLP) 研究在认识论框架下概念化知识的矛盾和差距。此外,我们进行了一项针对 100 位专业哲学家和计算机科学家的调查,以比较他们对知识定义的偏好及其对 LLMs 是否真正能被说成是知道的观点。最后,我们提出了根据最相关定义测试知识的评估协议。

[NLP-50] Dynamic Gradient Alignment for Online Data Mixing

【速读】: 该论文试图解决在大语言模型(LLM)训练中,如何通过优化训练数据混合比例来提升模型在特定任务上的表现,尤其是在仅有少量示例的情况下。解决方案的关键在于引入了一种名为动态梯度对齐(Dynamic Gradient Alignment, DGA)的算法。DGA通过在线动态估计预训练数据的混合比例,使得模型梯度尽可能与特定任务的梯度对齐,从而在不重新训练模型的前提下,显著提升模型在特定任务上的性能。该方法在预训练数据集较小或特定任务数据不足的情况下,相比传统的重要性采样方法,表现出了显著的优势。

链接: https://arxiv.org/abs/2410.02498
作者: Simin Fan,David Grangier,Pierre Ablin
关键词-EN: gradient alignment, large language models, effectively training large, training large language, data
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The composition of training data mixtures is critical for effectively training large language models (LLMs), as it directly impacts their performance on downstream tasks. Our goal is to identify an optimal data mixture to specialize an LLM for a specific task with access to only a few examples. Traditional approaches to this problem include ad-hoc reweighting methods, importance sampling, and gradient alignment techniques. This paper focuses on gradient alignment and introduces Dynamic Gradient Alignment (DGA), a scalable online gradient alignment algorithm. DGA dynamically estimates the pre-training data mixture on which the models’ gradients align as well as possible with those of the model on the specific task. DGA is the first gradient alignment approach that incurs minimal overhead compared to standard pre-training and outputs a competitive model, eliminating the need for retraining the model. Experimentally, we demonstrate significant improvements over importance sampling in two key scenarios: (i) when the pre-training set is small and importance sampling overfits due to limited data; and (ii) when there is insufficient specialized data, trapping importance sampling on narrow pockets of data. Our findings underscore the effectiveness of gradient alignment methods in optimizing training data mixtures, particularly in data-constrained environments, and offer a practical solution for enhancing LLM performance on specific tasks with limited data availability.
摘要:训练数据的混合组成对于有效训练大语言模型 (LLM) 至关重要,因为它直接影响模型在下游任务中的表现。我们的目标是仅通过少量样本就能识别出最优的数据混合,以使 LLM 专门化于特定任务。传统解决此问题的方法包括临时重新加权方法、重要性采样和梯度对齐技术。本文聚焦于梯度对齐,并引入了动态梯度对齐 (DGA),这是一种可扩展的在线梯度对齐算法。DGA 动态估计预训练数据混合,使得模型梯度尽可能与特定任务模型的梯度对齐。DGA 是首个与标准预训练相比开销最小的梯度对齐方法,并输出具有竞争力的模型,无需重新训练模型。实验上,我们在两个关键场景中展示了相对于重要性采样的显著改进:(i) 当预训练数据集较小且由于数据有限导致重要性采样过拟合时;(ii) 当专业数据不足,导致重要性采样局限于数据的小范围子集时。我们的研究强调了梯度对齐方法在优化训练数据混合中的有效性,特别是在数据受限的环境中,并为在有限数据可用性的情况下提升 LLM 在特定任务上的表现提供了实际解决方案。

[NLP-51] DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

【速读】: 该论文试图解决现有视觉语言跟踪(VLT)基准测试中,由于依赖简洁的人工标注文本描述,导致算法难以深入理解视频内容动态和多样性的问题。解决方案的关键在于利用大型语言模型(LLMs)生成多样化的语义标注,从而创建一个名为DTVLT的新型多模态基准测试。该基准测试包含五个主要VLT和SOT基准,涵盖短期跟踪、长期跟踪和全局实例跟踪三个子任务,并提供四种不同粒度的文本标注,以促进VLT和视频理解研究的发展。通过这种方式,论文期望通过多样化的文本生成策略,揭示现有算法在处理不同文本粒度时的性能瓶颈,从而推动相关领域的进一步研究。

链接: https://arxiv.org/abs/2410.02492
作者: Xuchen Li,Shiyu Hu,Xiaokun Feng,Dailing Zhang,Meiqi Wu,Jing Zhang,Kaiqi Huang
关键词-EN: harnessing linguistic data, traditional single object, cutting-edge research area, single object tracking, video understanding applications
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Preprint, Under Review

点击查看摘要

Abstract:Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a “memorize the answer” strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on this http URL.
摘要:视觉语言跟踪 (Visual Language Tracking, VLT) 已成为前沿研究领域,利用语言数据增强多模态输入算法,并将传统单目标跟踪 (Single Object Tracking, SOT) 的范围扩展至视频理解应用。尽管如此,大多数 VLT 基准测试仍依赖于简洁的人工标注文本描述每个视频。这些描述往往无法捕捉视频内容动态的细微差别,且在语言风格上缺乏多样性,受限于统一的细节层次和固定的标注频率。因此,算法倾向于采用“记忆答案”的策略,偏离了实现对视频内容更深层次理解的核心目标。幸运的是,大语言模型 (Large Language Model, LLM) 的出现使得生成多样化的文本成为可能。本研究利用 LLM 为典型的 SOT 基准生成不同语义粒度的标注文本(在文本长度和粒度方面),从而建立了一个新的多模态基准。具体而言,我们 (1) 提出了一个新的视觉语言跟踪基准,命名为 DTVLT,基于五个著名的 VLT 和 SOT 基准,包括三个子任务:短期跟踪、长期跟踪和全局实例跟踪。(2) 在我们的基准中提供了四种粒度的文本,考虑到语义信息的广度和密度。我们期望这种多粒度生成策略能够为 VLT 和视频理解研究创造有利的环境。(3) 我们对 DTVLT 进行了全面的实验分析,评估了多样化文本对跟踪性能的影响,并希望识别出的现有算法性能瓶颈能够支持 VLT 和视频理解的进一步研究。所提出的基准、实验结果和工具包将逐步发布在此 http URL 上。

[NLP-52] Response Tuning: Aligning Large Language Models without Instruction

【速读】: 该论文试图解决如何将预训练的大型语言模型(LLMs)转化为有用且安全的聊天助手的问题。解决方案的关键在于提出了一种名为“Response Tuning (RT)”的方法,该方法通过仅关注响应空间的监督,消除了指令调优中的指令条件步骤。实验结果表明,仅通过响应训练的RT模型能够有效应对广泛的指令,并在有用性方面与指令调优的模型相当。此外,控制训练响应分布可以显著提高用户偏好或引发目标行为,如拒绝不安全查询的协助。这一发现强调了建立适当的输出空间在模型对齐中的作用,并突显了预训练LLMs固有的广泛能力。

链接: https://arxiv.org/abs/2410.02465
作者: Seokhyun An,Hyounghun Kim
关键词-EN: Large Language Models, pre-trained Large Language, transitioning pre-trained Large, Large Language, safe chat assistants
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages

点击查看摘要

Abstract:Instruction tuning-supervised fine-tuning using instruction-response pairs-is a foundational step in transitioning pre-trained Large Language Models (LLMs) into helpful and safe chat assistants. Our hypothesis is that establishing an adequate output space can enable such a transition given the capabilities inherent in pre-trained LLMs. To verify this, we propose Response Tuning (RT), which eliminates the instruction-conditioning step in instruction tuning and solely focuses on response space supervision. Our experiments demonstrate that RT models, trained only using responses, can effectively respond to a wide range of instructions and exhibit helpfulness comparable to that of their instruction-tuned counterparts. Furthermore, we observe that controlling the training response distribution can significantly improve their user preference or elicit target behaviors such as refusing assistance for unsafe queries. Our findings illuminate the role of establishing an adequate output space in alignment, highlighting the potential of the extensive inherent capabilities of pre-trained LLMs.
摘要:指令调优(使用指令-响应对进行监督微调)是将预训练大语言模型 (LLM) 转化为有用且安全的聊天助手的基础步骤。我们的假设是,建立一个适当的输出空间可以实现这一转变,前提是预训练 LLM 具备固有能力。为了验证这一点,我们提出了响应调优 (RT),它消除了指令调优中的指令条件步骤,仅专注于响应空间的监督。我们的实验表明,仅使用响应训练的 RT 模型能够有效地响应广泛的指令,并展现出与其指令调优的对应模型相当的有用性。此外,我们观察到,控制训练响应分布可以显著提高用户偏好,或引发目标行为,例如拒绝不安全查询的协助。我们的研究揭示了建立适当输出空间在模型对齐中的作用,突显了预训练 LLM 广泛固有能力的潜力。

[NLP-53] Embedded Topic Models Enhanced by Wikification EMNLP2024

【速读】: 该论文试图解决传统主题模型仅考虑单词拼写而忽略同形异义词的问题。解决方案的关键在于将维基百科知识融入神经主题模型,使其能够识别命名实体,从而提高模型的泛化能力。通过在新闻文章和AIDA-CoNLL数据集上的实验,研究证明该方法不仅提升了模型的性能,还能有效捕捉主题的时间序列发展。

链接: https://arxiv.org/abs/2410.02441
作者: Takashi Shibuya,Takehito Utsuro
关键词-EN: learn meaningful patterns, collection of documents, documents to learn, learn meaningful, meaningful patterns
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Workshop NLP for Wikipedia

点击查看摘要

Abstract:Topic modeling analyzes a collection of documents to learn meaningful patterns of words. However, previous topic models consider only the spelling of words and do not take into consideration the homography of words. In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities. We evaluate our method on two datasets, 1) news articles of \textitNew York Times and 2) the AIDA-CoNLL dataset. Our experiments show that our method improves the performance of neural topic models in generalizability. Moreover, we analyze frequent terms in each topic and the temporal dependencies between topics to demonstrate that our entity-aware topic models can capture the time-series development of topics well.
摘要:主题建模分析文档集合以学习有意义的词语模式。然而,以往的主题模型仅考虑词语的拼写,并未考虑词语的同形异义现象。在本研究中,我们将维基百科知识融入神经主题模型,使其能够识别命名实体。我们在两个数据集上评估了我们的方法,1) 《纽约时报》的文章和 2) AIDA-CoNLL 数据集。实验结果表明,我们的方法提升了神经主题模型在泛化能力方面的表现。此外,我们分析了每个主题中的高频词以及主题间的时间依赖关系,证明我们的实体感知主题模型能够很好地捕捉主题的时间序列发展。

[NLP-54] Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization

【速读】: 该论文试图解决大语言模型在更新知识时可能影响无关知识的问题,并提出了一种高效的模型编辑方法。解决方案的关键在于SAUL(Streamlined Model Editing),它通过句子拼接和增强随机事实的方式进行生成正则化,从而在保持生成质量和一致性的同时,减少计算开销,并在模型编辑任务中优于现有最先进的方法。

链接: https://arxiv.org/abs/2410.02433
作者: Mingyang Wang,Lukas Lange,Heike Adel,Jannik Strötgen,Hinrich Schütze
关键词-EN: ensure large language, large language models, updated regularly, ensure large, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To ensure large language models contain up-to-date knowledge, they need to be updated regularly. However, model editing is challenging as it might also affect knowledge that is unrelated to the new data. State-of-the-art methods identify parameters associated with specific knowledge and then modify them via direct weight updates. However, these locate-and-edit methods suffer from heavy computational overhead and lack theoretical validation. In contrast, directly fine-tuning the model on requested edits affects the model’s behavior on unrelated knowledge, and significantly damages the model’s generation fluency and consistency. To address these challenges, we propose SAUL, a streamlined model editing method that uses sentence concatenation with augmented random facts for generation regularization. Evaluations on three model editing benchmarks show that SAUL is a practical and reliable solution for model editing outperforming state-of-the-art methods while maintaining generation quality and reducing computational overhead.
摘要:为了确保大语言模型包含最新的知识,它们需要定期更新。然而,模型编辑是一项挑战,因为它可能会影响与新数据无关的知识。目前最先进的方法是通过识别与特定知识相关的参数,然后通过直接权重更新来修改这些参数。然而,这些定位与编辑的方法存在计算开销大且缺乏理论验证的问题。相比之下,直接在请求的编辑上微调模型会影响模型在无关知识上的行为,并显著损害模型的生成流畅性和一致性。为了应对这些挑战,我们提出了 SAUL,一种简化的模型编辑方法,该方法通过使用增强随机事实的句子连接来进行生成正则化。在三个模型编辑基准上的评估表明,SAUL 是一种实用且可靠的模型编辑解决方案,它在保持生成质量的同时,优于最先进的方法并减少了计算开销。

[NLP-55] IoT-LLM: Enhancing Real-World IoT Task Reasoning with Large Language Models ICLR2025

【速读】: 该论文试图解决大型语言模型(LLMs)在处理现实世界物联网(IoT)任务时,由于缺乏对物理世界的深刻理解和感知能力,导致生成的输出常常违反物理定律的问题。解决方案的关键在于提出一个统一的框架,即IoT-LLM,通过增强LLMs的感知能力和知识库来提升其处理IoT任务的能力。具体步骤包括:预处理IoT数据以适应LLMs的输入格式,通过思维链提示和专门的角色定义激活LLMs的常识知识,以及通过基于上下文学习的IoT导向的检索增强生成来扩展其理解能力。实验结果表明,IoT-LLM显著提升了LLMs在IoT任务推理中的表现,平均改进率达到65%。

链接: https://arxiv.org/abs/2410.02429
作者: Tuo An,Yunjiao Zhou,Han Zou,Jianfei Yang
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, physical world
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 10 figures, submitted to ICLR 2025 Conference

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across textual and visual domains but often generate outputs that violate physical laws, revealing a gap in their understanding of the physical world. Inspired by human cognition, where perception is fundamental to reasoning, we explore augmenting LLMs with enhanced perception abilities using Internet of Things (IoT) sensor data and pertinent knowledge for IoT task reasoning in the physical world. In this work, we systematically study LLMs capability to address real-world IoT tasks by augmenting their perception and knowledge base, and then propose a unified framework, IoT-LLM, to enhance such capability. In IoT-LLM, we customize three steps for LLMs: preprocessing IoT data into formats amenable to LLMs, activating their commonsense knowledge through chain-of-thought prompting and specialized role definitions, and expanding their understanding via IoT-oriented retrieval-augmented generation based on in-context learning. To evaluate the performance, We design a new benchmark with five real-world IoT tasks with different data types and reasoning difficulties and provide the benchmarking results on six open-source and close-source LLMs. Experimental results demonstrate the limitations of existing LLMs with naive textual inputs that cannot perform these tasks effectively. We show that IoT-LLM significantly enhances the performance of IoT tasks reasoning of LLM, such as GPT-4, achieving an average improvement of 65% across various tasks against previous methods. The results also showcase LLMs ability to comprehend IoT data and the physical law behind data by providing a reasoning process. Limitations of our work are claimed to inspire future research in this new era.
摘要:大语言模型 (LLMs) 在文本和视觉领域展示了显著的能力,但它们生成的输出常常违反物理定律,揭示了它们对物理世界理解的不足。受人类认知的启发,感知是推理的基础,我们探索通过物联网 (IoT) 传感器数据和相关知识来增强 LLMs 的感知能力,以在物理世界中进行 IoT 任务推理。在这项工作中,我们系统地研究了通过增强感知和知识库来解决现实世界 IoT 任务的 LLMs 能力,并提出了一个统一的框架,即 IoT-LLM,以增强这种能力。在 IoT-LLM 中,我们为 LLMs 定制了三个步骤:将 IoT 数据预处理为适合 LLMs 的格式,通过思维链提示和专门的角色定义激活其常识知识,并通过基于上下文学习的面向 IoT 的检索增强生成来扩展其理解。为了评估性能,我们设计了一个包含五种不同数据类型和推理难度的新基准,并在六个开源和闭源 LLMs 上提供了基准测试结果。实验结果表明,现有的 LLMs 在仅使用简单文本输入的情况下无法有效执行这些任务。我们展示了 IoT-LLM 显著提升了 LLMs 在 IoT 任务推理中的表现,例如 GPT-4,在各种任务上平均提高了 65% 的性能。结果还展示了 LLMs 理解 IoT 数据和数据背后的物理定律的能力,通过提供推理过程。我们工作的局限性被认为将激发这一新时代的未来研究。

[NLP-56] Collective Critics for Creative Story Generation EMNLP2024

【速读】: 该论文试图解决使用大型语言模型(LLMs)生成具有叙事连贯性的长篇故事(数千字)时面临的挑战。解决方案的关键在于提出了一个名为“集体批评创意故事生成框架”(CritiCS),该框架包括计划细化阶段(CrPlan)和故事生成阶段(CrText)。CritiCS通过引入集体修订机制,在每个阶段由一组LLM批评者和一个领导者协作,经过多轮迭代逐步细化故事计划和生成故事,从而在保持叙事连贯性的同时显著提升故事的创造性和读者吸引力。此外,该框架设计允许人类作家在批评过程中积极参与,实现人机交互协作的故事写作。

链接: https://arxiv.org/abs/2410.02428
作者: Minwook Bae,Hyounghun Kim
关键词-EN: Large Language Models, Language Models, Large Language, Generating a long, challenging task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 (36 pages)

点击查看摘要

Abstract:Generating a long story of several thousand words with narrative coherence using Large Language Models (LLMs) has been a challenging task. Previous research has addressed this challenge by proposing different frameworks that create a story plan and generate a long story based on that plan. However, these frameworks have been mainly focusing on maintaining narrative coherence in stories, often overlooking creativity in story planning and the expressiveness of the stories generated from those plans, which are desirable properties to captivate readers’ interest. In this paper, we propose Collective Critics for Creative Story Generation framework (CritiCS), which is composed of plan refining stage (CrPlan) and story generation stage (CrText), to integrate a collective revision mechanism that promotes those properties into long-form story generation process. Specifically, in each stage, a group of LLM critics and one leader collaborate to incrementally refine drafts of plan and story throughout multiple rounds. Extensive human evaluation shows that the CritiCS can significantly enhance story creativity and reader engagement, while also maintaining narrative coherence. Furthermore, the design of the framework allows active participation from human writers in any role within the critique process, enabling interactive human-machine collaboration in story writing.
摘要:使用大语言模型 (LLMs) 生成数千字且具有叙事连贯性的长篇故事一直是一项具有挑战性的任务。以往的研究通过提出不同的框架来解决这一挑战,这些框架基于故事计划生成长篇故事。然而,这些框架主要关注于保持故事的叙事连贯性,往往忽视了故事计划中的创造性以及基于这些计划生成的故事的表现力,而这些都是吸引读者兴趣的理想属性。在本文中,我们提出了创意故事生成框架 (CritiCS),该框架由计划细化阶段 (CrPlan) 和故事生成阶段 (CrText) 组成,通过引入集体修订机制,将这些属性整合到长篇故事生成过程中。具体而言,在每个阶段,一组大语言模型评论家和一个领导者协作,通过多轮迭代逐步细化计划和故事的草稿。广泛的人类评估表明,CritiCS 能够显著提升故事的创造性和读者参与度,同时保持叙事连贯性。此外,该框架的设计允许人类作家在评论过程中以任何角色积极参与,从而实现故事写作中的交互式人机协作。

[NLP-57] Learning the Latent Rules of a Game from Data: A Chess Story

【速读】: 该论文试图解决的问题是验证小型预训练基础生成语言模型能否通过少量数据学习并掌握复杂规则,如国际象棋的规则和策略。解决方案的关键在于通过指令微调(instruction fine-tuning)技术,使用1,000到1,000,000个示例对28M和125M参数的预训练小型语言模型进行训练,使其能够学习国际象棋的规则、提出合法走法并准确解决棋局问题。此外,论文还探讨了连续微调周期对模型性能提升的影响,并展示了通过增加指令微调示例数量来减少模型幻觉(model hallucinations)的效果。

链接: https://arxiv.org/abs/2410.02426
作者: Ben Fauber
关键词-EN: pretrained foundational generative, foundational generative language, generative language models, small pretrained foundational, parameter pretrained foundational
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We demonstrate that small pretrained foundational generative language models with millions of parameters can learn the latent rules of a process from data associated with the process. Inspired by Stefan Zweig’s novella “Schachnovelle,” also known as “The Royal Game” in English, we show that 28M and 125M parameter pretrained foundational small language models (SLMs) can be instruction fine-tuned with 1,000-to-1,000,000 examples to learn the rules of chess, propose legal moves, and accurately solve chess problems. We also explore the impact of successive language model fine-tuning epochs on improved outcomes and demonstrate reductions in model hallucinations by increasing the number of instruction fine-tuning examples.
摘要:我们展示了具有数百万参数的预训练基础生成式语言模型可以从与该过程相关的数据中学习该过程的潜在规则。受斯蒂芬·茨威格的中篇小说《棋局》(英文名为《皇家游戏》)的启发,我们证明了具有 28M 和 125M 参数的预训练基础小型语言模型 (SLM) 可以通过 1,000 到 1,000,000 个示例进行指令微调,从而学习国际象棋的规则,提出合法的走法,并准确解决国际象棋问题。我们还探讨了连续语言模型微调周期对改进结果的影响,并通过增加指令微调示例的数量展示了模型幻觉的减少。

[NLP-58] LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

【速读】: 该论文试图解决在大规模语言模型(LLMs)推理服务中,如何选择最优硬件以满足性能需求并降低成本的问题。解决方案的关键在于提出了LLM-Pilot系统,该系统通过在真实工作负载下对多种GPU进行基准测试,优化服务配置以最大化性能,并利用这些数据训练出一个预测模型,用于推荐最经济高效的硬件配置。相较于现有方法,LLM-Pilot能够更频繁地满足性能要求,同时平均降低60%的成本。

链接: https://arxiv.org/abs/2410.02425
作者: Małgorzata Łazuka,Andreea Anghel,Thomas Parnell
关键词-EN: Large Language Models, Large Language, LLM inference services, LLM inference, Language Models
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24)

点击查看摘要

Abstract:As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.
摘要:随着大语言模型 (LLM) 的迅速普及,LLM 推理服务必须能够处理来自数千用户的需求,同时满足性能要求。LLM 推理服务的性能在很大程度上取决于其部署的硬件,但理解哪种硬件能够满足性能要求仍然是一个挑战。在这项工作中,我们介绍了 LLM-Pilot——一个开创性的系统,用于表征和预测 LLM 推理服务的性能。LLM-Pilot 在现实工作负载下,对多种 GPU 上的 LLM 推理服务进行基准测试,并针对每种考虑的 GPU 优化服务配置,以最大化性能。最后,利用这些表征数据,LLM-Pilot 学习了一个预测模型,该模型可以用于推荐对之前未见过的 LLM 最具成本效益的硬件。与现有方法相比,LLM-Pilot 能够更频繁地满足性能要求,同时平均降低 60% 的成本。

[NLP-59] MenakBERT – Hebrew Diacriticizer

【速读】: 该论文试图解决希伯来语文本中添加音标符号的问题,传统方法依赖于人工整理的资源,而现有模型在处理音标化文本时性能仍有差距。解决方案的关键在于使用基于字符的预训练语言模型(PLM),即MenakBERT,该模型在希伯来语文本上进行预训练并微调,以生成希伯来语句子的音标符号,并通过微调模型实现音标化任务向词性标注任务的迁移。

链接: https://arxiv.org/abs/2410.02417
作者: Ido Cohen,Jacob Gidron,Idan Pinto
关键词-EN: language give words, Hebrew language give, Diacritical marks, vocalized form, language give
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published at ISCOL2022 as a poster

点击查看摘要

Abstract:Diacritical marks in the Hebrew language give words their vocalized form. The task of adding diacritical marks to plain Hebrew text is still dominated by a system that relies heavily on human-curated resources. Recent models trained on diacritized Hebrew texts still present a gap in performance. We use a recently developed char-based PLM to narrowly bridge this gap. Presenting MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences. We continue to show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.
摘要:希伯来语中的音调符号赋予了词语其发音形式。目前,为纯希伯来语文本添加音调符号的任务仍然主要依赖于大量人工整理的资源。尽管最近基于带音调符号的希伯来语文本训练的模型在性能上有所提升,但仍存在一定的差距。我们采用了一种新近开发的基于字符的预训练语言模型 (PLM) 来缩小这一差距。本文介绍了 MenakBERT,这是一种基于字符级别的 Transformer 模型,预训练于希伯来语文本,并经过微调以生成希伯来语句子的音调符号。我们进一步展示了如何通过微调模型进行音调符号添加,从而将其应用于词性标注等任务。

[NLP-60] Parameter Competition Balancing for Model Merging NEURIPS2024

【速读】: 该论文试图解决多任务模型合并中参数竞争不平衡的问题,特别是在不同任务间参数调整时可能出现的冲突和复杂关联。解决方案的关键是提出了一种名为PCB-Merging(参数竞争平衡)的创新技术,该技术通过轻量级且无需重新训练的方法,调整每个参数的系数以实现有效的模型合并。具体来说,PCB-Merging采用内部平衡(intra-balancing)来评估单个任务中参数的重要性,并通过外部平衡(inter-balancing)评估不同任务间参数的相似性。通过丢弃低重要性分数的参数并重新缩放剩余参数,最终形成合并后的模型。实验结果表明,该方法在多种合并场景下显著提升了性能,优于现有的模型合并方法。

链接: https://arxiv.org/abs/2410.02396
作者: Guodong Du,Junlin Lee,Jing Li,Runhua Jiang,Yifei Guo,Shuyang Yu,Hanting Liu,Sim Kuan Goh,Ho-Kin Tang,Daojing He,Min Zhang
关键词-EN: common practice, model, tasks, parameter, fine-tuning pretrained models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by NeurIPS2024

点击查看摘要

Abstract:While fine-tuning pretrained models has become common practice, these models often underperform outside their specific domains. Recently developed model merging techniques enable the direct integration of multiple models, each fine-tuned for distinct tasks, into a single model. This strategy promotes multitasking capabilities without requiring retraining on the original datasets. However, existing methods fall short in addressing potential conflicts and complex correlations between tasks, especially in parameter-level adjustments, posing a challenge in effectively balancing parameter competition across various tasks. This paper introduces an innovative technique named PCB-Merging (Parameter Competition Balancing), a lightweight and training-free technique that adjusts the coefficients of each parameter for effective model merging. PCB-Merging employs intra-balancing to gauge parameter significance within individual tasks and inter-balancing to assess parameter similarities across different tasks. Parameters with low importance scores are dropped, and the remaining ones are rescaled to form the final merged model. We assessed our approach in diverse merging scenarios, including cross-task, cross-domain, and cross-training configurations, as well as out-of-domain generalization. The experimental results reveal that our approach achieves substantial performance enhancements across multiple modalities, domains, model sizes, number of tasks, fine-tuning forms, and large language models, outperforming existing model merging methods. The code is publicly available at: \urlthis https URL.
摘要:尽管微调预训练模型已成为常见做法,但这些模型在特定领域之外的表现往往不尽如人意。最近开发的模型合并技术使得可以直接将多个针对不同任务微调的模型整合为一个单一模型。这种策略促进了多任务能力,而无需在原始数据集上重新训练。然而,现有方法在解决任务间潜在冲突和复杂关联方面,尤其是在参数级调整方面,存在不足,这给有效平衡各任务间的参数竞争带来了挑战。本文介绍了一种名为 PCB-Merging (Parameter Competition Balancing) 的创新技术,这是一种轻量级且无需训练的技术,通过调整每个参数的系数来实现有效的模型合并。PCB-Merging 采用内部平衡来评估单个任务内参数的重要性,并通过外部平衡来评估不同任务间参数的相似性。低重要性评分的参数被舍弃,剩余参数经过重新缩放以形成最终的合并模型。我们在多种合并场景中评估了我们的方法,包括跨任务、跨领域和跨训练配置,以及域外泛化。实验结果表明,我们的方法在多种模态、领域、模型规模、任务数量、微调形式和大语言模型上均实现了显著的性能提升,优于现有的模型合并方法。代码已公开发布,详见:\urlthis https URL。

[NLP-61] MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

【速读】: 该论文试图解决现有性能评估指标在捕捉人类偏好多样性方面的不足,即单一指标往往在某一方面表现出色,但在所有维度上表现不佳的问题。解决方案的关键在于引入MetaMetrics,这是一种经过校准的元指标,旨在通过监督方式评估不同模态的生成任务。MetaMetrics通过优化现有指标的组合,增强其与人类偏好的对齐,从而在语言和视觉下游任务中表现出灵活性和有效性,特别是在多语言和多领域场景中。该方法使得评估指标更能代表人类判断,且易于集成到各种应用中,从而提升了生成任务评估的全面性和准确性。

链接: https://arxiv.org/abs/2410.02381
作者: Genta Indra Winata,David Anugraha,Lucky Susanto,Garry Kuwanto,Derry Tanti Wijaya
关键词-EN: Understanding the quality, model outputs align, model outputs, human preferences, Understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.
摘要:理解性能评估指标的质量对于确保模型输出符合人类偏好至关重要。然而,目前尚不清楚每个指标在多大程度上能够捕捉这些偏好的多样性,因为这些指标通常在某一方面表现出色,但在所有维度上并不全面。为了解决这一问题,系统地校准指标以适应人类偏好的特定方面是必要的,以满足每个方面的独特特征。我们引入了 MetaMetrics,这是一种经过校准的元指标,旨在以监督方式评估不同模态的生成任务。MetaMetrics 通过优化现有指标的组合,增强了其与人类偏好的对齐。我们的指标在语言和视觉下游任务中表现出灵活性和有效性,在各种多语言和多领域场景中显示出显著优势。MetaMetrics 与人类偏好高度一致,并且具有高度的可扩展性和易于集成到任何应用中。这使得 MetaMetrics 成为改进生成任务评估的强大工具,确保指标在多样化的情境中更能代表人类的判断。

[NLP-62] owards Comprehensive Detection of Chinese Harmful Memes

【速读】: 该论文试图解决中文有害模因(meme)检测的问题,由于缺乏可靠的数据集和有效的检测器,相关研究显著滞后。解决方案的关键在于构建了首个中文有害模因数据集ToxiCN MM,并提出了基于多模态知识增强(Multimodal Knowledge Enhancement, MKE)的基线检测器。MKE通过结合大型语言模型(LLM)生成的模因内容上下文信息,增强了对中文模因的理解,从而提升了检测效果。实验结果表明,现有模型在检测中文有害模因方面存在挑战,而MKE则展示了其有效性。

链接: https://arxiv.org/abs/2410.02378
作者: Junyu Lu,Bo Xu,Xiaokun Zhang,Hongbo Wang,Haohao Zhu,Dongyu Zhang,Liang Yang,Hongfei Lin
关键词-EN: Chinese harmful memes, Chinese harmful, detecting Chinese harmful, Harmful memes, Chinese
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper has been accepted in the NeurIPS 2024 D B Track. Harmful memes have proliferated on the Chinese Internet, while research on detecting Chinese harmful memes significantly lags behind due to the absence of reliable datasets and effective detectors. To this end, we focus on the comprehensive detection of Chinese harmful memes. We construct ToxiCN MM, the first Chinese harmful meme dataset, which consists of 12,000 samples with fine-grained annotations for various meme types. Additionally, we propose a baseline detector, Multimodal Knowledge Enhancement (MKE), incorporating contextual information of meme content generated by the LLM to enhance the understanding of Chinese memes. During the evaluation phase, we conduct extensive quantitative experiments and qualitative analyses on multiple baselines, including LLMs and our MKE. The experimental results indicate that detecting Chinese harmful memes is challenging for existing models while demonstrating the effectiveness of MKE. The resources for this paper are available at this https URL.
摘要:本文已被 NeurIPS 2024 D&B 赛道接受。有害表情包在中国互联网上迅速蔓延,但由于缺乏可靠的数据集和有效的检测器,针对中文有害表情包的研究明显滞后。为此,我们专注于全面检测中文有害表情包。我们构建了 ToxiCN MM,这是首个中文有害表情包数据集,包含 12,000 个样本,并进行了细粒度的表情包类型标注。此外,我们提出了一种基线检测器——多模态知识增强 (Multimodal Knowledge Enhancement, MKE),该检测器结合了大语言模型 (LLM) 生成的表情包内容上下文信息,以增强对中文表情包的理解。在评估阶段,我们对多个基线模型(包括 LLM 和我们的 MKE)进行了广泛的定量实验和定性分析。实验结果表明,现有模型在检测中文有害表情包方面存在挑战,同时证明了 MKE 的有效性。本文的资源可在以下链接获取:https URL。

[NLP-63] From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

【速读】: 该论文试图解决人工智能在理解和操纵高阶抽象概念方面的挑战。解决方案的关键在于提出了一种多模态生成方法,通过整合视觉和分类语言信息,从具体概念逐步抽象到高阶抽象概念。具体步骤包括:首先将下位具体概念接地,然后结合形成基本层概念,最终通过基本层概念的接地抽象到上位概念。实验结果表明,该模型在语言理解和语言命名任务中表现出色。

链接: https://arxiv.org/abs/2410.02365
作者: Haodong Xie,Rahul Singh Maharjan,Federico Tavella,Angelo Cangelosi
关键词-EN: human intelligence, fundamental to human, Understanding and manipulating, concepts, high order abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding and manipulating concrete and abstract concepts is fundamental to human intelligence. Yet, they remain challenging for artificial agents. This paper introduces a multimodal generative approach to high order abstract concept learning, which integrates visual and categorical linguistic information from concrete ones. Our model initially grounds subordinate level concrete concepts, combines them to form basic level concepts, and finally abstracts to superordinate level concepts via the grounding of basic-level concepts. We evaluate the model language learning ability through language-to-visual and visual-to-language tests with high order abstract concepts. Experimental results demonstrate the proficiency of the model in both language understanding and language naming tasks.
摘要:理解和操纵具体和抽象概念是人类智能的基础。然而,这对人工智能体来说仍然是一个挑战。本文介绍了一种多模态生成式方法,用于高阶抽象概念的学习,该方法整合了从具体概念中提取的视觉和分类语言信息。我们的模型首先将下属级别的具体概念进行基础化处理,将它们组合形成基本级别的概念,最后通过基本级别概念的基础化处理抽象出上位级别的概念。我们通过语言到视觉和视觉到语言的测试,评估了模型对高阶抽象概念的语言学习能力。实验结果表明,该模型在语言理解和语言命名任务中均表现出色。

[NLP-64] AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在知识更新过程中因参数扰动导致的原有知识被破坏的问题。解决方案的关键在于提出了一种名为AlphaEdit的新方法,该方法通过将扰动投影到保留知识的零空间中,从而确保在更新特定知识时,模型对原有知识的输出保持不变。这一方法在理论上证明了其有效性,并在实验中显著提升了现有定位-编辑方法的性能,平均提升了36.4%。

链接: https://arxiv.org/abs/2410.02355
作者: Junfeng Fang,Houcheng Jiang,Kun Wang,Yunshan Ma,Xiang Wang,Xiangnan He,Tat-seng Chua
关键词-EN: Large language models, exhibit hallucinations due, Large language, exhibit hallucinations, hallucinations due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often exhibit hallucinations due to incorrect or outdated knowledge. Hence, model editing methods have emerged to enable targeted knowledge updates. To achieve this, a prevailing paradigm is the locating-then-editing approach, which first locates influential parameters and then edits them by introducing a perturbation. While effective, current studies have demonstrated that this perturbation inevitably disrupt the originally preserved knowledge within LLMs, especially in sequential editing scenarios. To address this, we introduce AlphaEdit, a novel solution that projects perturbation onto the null space of the preserved knowledge before applying it to the parameters. We theoretically prove that this projection ensures the output of post-edited LLMs remains unchanged when queried about the preserved knowledge, thereby mitigating the issue of disruption. Extensive experiments on various LLMs, including LLaMA3, GPT2-XL, and GPT-J, show that AlphaEdit boosts the performance of most locating-then-editing methods by an average of 36.4% with a single line of additional code for projection solely. Our code is available at: this https URL.
摘要:大语言模型 (LLMs) 由于知识错误或过时,常常表现出幻觉现象。因此,模型编辑方法应运而生,以实现针对性的知识更新。为此,一种主流范式是“定位-然后-编辑”方法,该方法首先定位影响参数,然后通过引入扰动来编辑这些参数。尽管这种方法有效,但现有研究表明,这种扰动不可避免地会破坏 LLMs 中原本保留的知识,特别是在连续编辑场景中。为解决这一问题,我们提出了 AlphaEdit,这是一种新颖的解决方案,它在将扰动应用于参数之前,先将其投影到保留知识的零空间中。我们通过理论证明,这种投影确保了在查询保留知识时,编辑后 LLMs 的输出保持不变,从而缓解了破坏问题。在包括 LLaMA3、GPT2-XL 和 GPT-J 在内的多种 LLMs 上进行的广泛实验表明,AlphaEdit 通过仅增加一行投影代码,平均提升了大多数“定位-然后-编辑”方法的性能达 36.4%。我们的代码可在以下链接获取:this https URL。

[NLP-65] Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA

【速读】: 该论文试图解决现有大型语言模型(LLM)在多选题回答(MCQA)评估中,即使模型知道正确答案,也可能因难以遵循固定格式而选择错误选项的问题。解决方案的关键在于引入新的评估指标:Query-Key Score(QK-score)和Attention Score,这些指标基于注意力机制中的查询和键表示以及注意力权重,能够更好地捕捉和揭示模型的潜在知识。通过从特定的“选择与复制”头中提取这些分数,论文方法显著提升了知识提取能力,在多个MCQA基准测试中实现了显著的性能提升,尤其是在模型明确知道正确答案的合成数据集上,准确率提高了近60%,接近完美水平。

链接: https://arxiv.org/abs/2410.02343
作者: Eduard Tulchinskii,Laida Kushnareva,Kristian Kuznetsov,Anastasia Voznyuk,Andrei Andriiainen,Irina Piontkovskaya,Evgeny Burnaev,Serguei Barannikov
关键词-EN: LLM involves presenting, model predicted answer, LLM involves, evaluate the abilities, involves presenting
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A standard way to evaluate the abilities of LLM involves presenting a multiple-choice question and selecting the option with the highest logit as the model’s predicted answer. However, such a format for evaluating LLMs has limitations, since even if the model knows the correct answer, it may struggle to select the corresponding letter simply due to difficulties in following this rigid format. To address this, we introduce new scores that better capture and reveal model’s underlying knowledge: the Query-Key Score (QK-score), derived from the interaction between query and key representations in attention heads, and the Attention Score, based on attention weights. These scores are extracted from specific \textitselect-and-copy heads, which show consistent performance across popular Multi-Choice Question Answering (MCQA) datasets. Based on these scores, our method improves knowledge extraction, yielding up to 16% gain for LLaMA2-7B and up to 10% for larger models on popular MCQA benchmarks. At the same time, the accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60%, achieving nearly perfect accuracy, therefore demonstrating the method’s efficiency in mitigating MCQA format limitations. To support our claims, we conduct experiments on models ranging from 7 billion to 70 billion parameters in both zero- and few-shot setups.
摘要:评估大语言模型 (LLM) 能力的一种标准方法是呈现多项选择题,并选择具有最高 logit 的选项作为模型的预测答案。然而,这种评估格式存在局限性,因为即使模型知道正确答案,也可能由于难以遵循这种严格的格式而难以选择相应的字母。为了解决这一问题,我们引入了新的评分标准,这些评分标准能更好地捕捉和揭示模型的潜在知识:查询-键评分 (Query-Key Score, QK-score),源自注意力头中查询和键表示之间的交互,以及基于注意力权重的注意力评分 (Attention Score)。这些评分是从特定的“选择与复制”头中提取的,这些头在流行的多选题问答 (MCQA) 数据集上表现出一致的性能。基于这些评分,我们的方法提升了知识提取能力,在流行的 MCQA 基准测试中,LLaMA2-7B 的得分提高了 16%,更大模型的得分提高了 10%。同时,在模型明确知道正确答案的简单合成数据集上,准确率提高了近 60%,几乎达到了完美准确率,从而证明了该方法在缓解 MCQA 格式局限性方面的有效性。为了支持我们的主张,我们在从 70 亿到 700 亿参数的模型上进行了实验,涵盖了零样本和少样本设置。

[NLP-66] How Much Can RAG Help the Reasoning of LLM?

【速读】: 该论文试图解决的问题是如何通过检索增强生成(RAG)技术提升大型语言模型(LLMs)的推理能力。论文指出,尽管RAG在引入新知识和减少幻觉方面表现出色,但其对推理过程的辅助作用有限,尤其是在处理深度推理时。解决方案的关键在于提出了DPrompt调优方法,该方法通过在有限的Transformer层中进行有效的预处理,解决了文档信息中的噪声过滤难题,从而显著提升了LLMs的推理性能。

链接: https://arxiv.org/abs/2410.02338
作者: Jingyu Liu,Jiaen Lin,Yong Liu
关键词-EN: Large Language Models, modern Large Language, Language Models, Large Language, gained significant popularity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has gained significant popularity in modern Large Language Models (LLMs) due to its effectiveness in introducing new knowledge and reducing hallucinations. However, the deep understanding of RAG remains limited, how does RAG help the reasoning process and can RAG help improve the reasoning capability remains question. While external documents are typically considered as a method to incorporate domain-specific information, they also contain intermediate reasoning results related to the query, this suggests that documents could enhance the reasoning capability of LLMs, which has not been previously explored. In this paper, we investigate this issue in depth and find that while RAG can assist with reasoning, the help is limited. If we conceptualize the reasoning process as a tree with fixed depth, then RAG struggles to assist LLMs in performing deeper reasoning. Additionally, the information in the documents requires preprocessing to filter out noise. We demonstrate that this preprocessing is difficult to achieve simply fine-tuning of the LLM, it often necessitates numerous additional transformer layers to solve the problem. To simplify the problem, we propose DPrompt tuning, which effectively resolves the issue within just limited transformer layers, leading to improved performance.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 在现代大语言模型 (Large Language Models, LLMs) 中获得了显著的流行,因其有效引入新知识和减少幻觉的能力。然而,对 RAG 的深入理解仍然有限,RAG 如何帮助推理过程以及 RAG 是否能提升推理能力仍是一个问题。虽然外部文档通常被视为引入领域特定信息的方法,但它们也包含与查询相关的中间推理结果,这表明文档可以增强 LLMs 的推理能力,这一观点此前未被深入探讨。本文深入研究了这一问题,并发现尽管 RAG 可以辅助推理,但其帮助有限。如果我们将推理过程概念化为具有固定深度的树,那么 RAG 难以帮助 LLMs 进行更深层次的推理。此外,文档中的信息需要预处理以过滤噪声。我们证明,这种预处理难以通过简单的 LLM 微调实现,通常需要大量的额外 Transformer 层来解决问题。为了简化问题,我们提出了 DPrompt 微调,该方法在有限的 Transformer 层内有效解决了问题,从而提升了性能。

[NLP-67] Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection

【速读】: 该论文试图解决的问题是:在预训练大型语言模型(LLM)中,是否所有层对于知识注入都同等重要。解决方案的关键在于提出了S策略,即在预训练后策略性地增强浅层(shallow layers),同时修剪深层(deep layers)中效果较差的层。通过这种方式,论文展示了在代码和数学语料库上的有效性,并进一步在不同的LLM(如Mistral-7B)和法律语料库上验证了该策略的广泛适用性。

链接: https://arxiv.org/abs/2410.02330
作者: Tianxiang Chen,Zhentao Tan,Tao Gong,Yue Wu,Qi Chu,Bin Liu,Jieping Ye,Nenghai Yu
关键词-EN: large language models, domain large models, augment pre-trained large, vertical domain large, pre-trained large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As a manner to augment pre-trained large language models (LLM), knowledge injection is critical to develop vertical domain large models and has been widely studied. Although most current approaches, including parameter-efficient fine-tuning (PEFT) and block expansion methods, uniformly apply knowledge across all LLM layers, it raises the question: are all layers equally crucial for knowledge injection? We begin by evaluating the importance of each layer in finding the optimal layer range for knowledge injection. Intuitively, the more important layers should play a more critical role in knowledge injection and deserve a denser injection. We observe performance dips in question-answering benchmarks after the removal or expansion of the shallow layers, and the degradation shrinks as the layer gets deeper, indicating that the shallow layers hold the key to knowledge injection. This insight leads us to propose the S strategy, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones. Based on this strategy, we introduce Llama Slayer-8B and Llama Slayer-8B-Instruct. We experimented on the corpus of code \ math and demonstrated the effectiveness of our strategy. Further experiments across different LLM, Mistral-7B, and a legal corpus confirmed the general applicability of the approach, underscoring its wide-ranging efficacy. Our code is available at: \this https URL
摘要:作为一种增强预训练大语言模型 (LLM) 的方法,知识注入对于开发垂直领域大模型至关重要,并已得到广泛研究。尽管当前大多数方法,包括参数高效微调 (PEFT) 和块扩展方法,均在所有 LLM 层中均匀应用知识,但这一做法引发了一个问题:所有层对于知识注入的重要性是否相同?我们首先评估了各层在寻找知识注入最佳层范围中的重要性。直观上,越重要的层在知识注入中应发挥更关键的作用,并值得更密集的注入。我们观察到,在移除或扩展浅层后,问答基准测试中的性能出现下降,且随着层深的增加,性能下降幅度减小,这表明浅层是知识注入的关键。这一发现促使我们提出了 S 策略,这是一种在预训练后策略中选择性增强浅层同时修剪效果较差的深层的方法。基于此策略,我们引入了 Llama Slayer-8B 和 Llama Slayer-8B-Instruct。我们在代码和数学语料库上进行了实验,证明了该策略的有效性。进一步在不同 LLM(如 Mistral-7B)和法律语料库上的实验证实了该方法的普遍适用性,突显了其广泛的效用。我们的代码可在以下链接获取:\this https URL

[NLP-68] Post-edits Are Preferences Too

【速读】: 该论文试图解决在机器翻译中难以获取可靠的成对偏好反馈的问题。解决方案的关键在于利用后编辑(post-editing)过程中编辑者对翻译质量的隐含偏好,将其用于偏好优化(Preference Optimization, PO)技术中。具体来说,论文提出通过预训练模型使用监督微调(Supervised Fine-Tuning, SFT)方法,将后编辑数据用于模型训练,以促进模型生成更接近后编辑质量的翻译结果,从而提高机器翻译的性能。

链接: https://arxiv.org/abs/2410.02320
作者: Nathaniel Berger,Stefan Riezler,Miriam Exel,Matthias Huck
关键词-EN: Preference Optimization, machine translation, art techniques, Optimization, machine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear at the Ninth Conference on Machine Translation (WMT24)

点击查看摘要

Abstract:Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences s_1 and s_2 and asked for a preference judgment, % s_1 s_2 ; while for post-editing, editors \emphcreate s_1 and know that it should be better than s_2 . We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks. Comments: To appear at the Ninth Conference on Machine Translation (WMT24) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.02320 [cs.CL] (or arXiv:2410.02320v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02320 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:偏好优化 (Preference Optimization, PO) 技术是目前用于在人类标注者的成对偏好反馈上微调大语言模型 (Large Language Models, LLMs) 的先进技术之一。然而,在机器翻译领域,这种反馈往往难以获取。此外,Kreutzer 等人 (2018) 的研究表明,对于机器翻译而言,成对偏好相比其他形式的人类反馈(如 5 点评分)可靠性较低。我们通过分析后编辑来探讨其是否可以作为可靠的人类偏好来源。在 PO 中,人类标注者会被展示序列 ( s_1 ) 和 ( s_2 ),并要求做出偏好判断,即 ( s_1 ) 优于 ( s_2 );而在后编辑中,编辑者创建 ( s_1 ) 并知晓它应优于 ( s_2 )。我们尝试利用这些隐含的偏好进行 PO,并展示这有助于模型生成更接近后编辑假设而非机器翻译假设的结果。此外,我们发现通过在后编辑上进行监督微调 (Supervised Fine-Tuning, SFT) 预训练模型,可以促进后编辑假设成为最高输出排名的结果。

评论:将发表于第九届机器翻译会议 (WMT24)
主题:计算与语言 (cs.CL);人工智能 (cs.AI);机器学习 (cs.LG)
引用方式:arXiv:2410.02320 [cs.CL]
(或 arXiv:2410.02320v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.02320
了解更多信息
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-69] raffic Light or Light Traffic? Investigating Phrasal Semantics in Large Language Models EMNLP2024

【速读】: 该论文试图解决基于API的大型语言模型(LLMs)在理解短语语义方面的能力问题。解决方案的关键在于通过三个人类标注的数据集评估LLMs在执行短语语义推理任务中的表现,并探讨常见提示技术(如少样本演示和思维链推理)的影响。研究发现,LLMs在数据集上显著优于传统的嵌入方法,但在与微调方法相比时并未显示出显著优势。高级提示策略的有效性表现出差异性,并通过详细的错误分析揭示了LLMs在理解短语语义方面的局限性。

链接: https://arxiv.org/abs/2410.02308
作者: Rui Meng,Ye Liu,Lifu Tu,Daqing He,Yingbo Zhou,Semih Yavuz
关键词-EN: fundamental linguistic units, humans convey semantics, fundamental linguistic, linguistic units, humans convey
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Phrases are fundamental linguistic units through which humans convey semantics. This study critically examines the capacity of API-based large language models (LLMs) to comprehend phrase semantics, utilizing three human-annotated datasets. We assess the performance of LLMs in executing phrase semantic reasoning tasks guided by natural language instructions and explore the impact of common prompting techniques, including few-shot demonstrations and Chain-of-Thought reasoning. Our findings reveal that LLMs greatly outperform traditional embedding methods across the datasets; however, they do not show a significant advantage over fine-tuned methods. The effectiveness of advanced prompting strategies shows variability. We conduct detailed error analyses to interpret the limitations faced by LLMs in comprehending phrase semantics. Code and data can be found at this https URL.
摘要:短语是人类传达语义的基本语言单位。本研究深入探讨了基于 API 的大语言模型 (LLM) 理解短语语义的能力,使用了三个人工标注的数据集。我们评估了 LLM 在执行由自然语言指令引导的短语语义推理任务中的表现,并探讨了包括少样本演示和思维链 (Chain-of-Thought) 推理在内的常见提示技术的影響。研究结果表明,LLM 在数据集上的表现显著优于传统的嵌入方法;然而,它们并未显示出对微调方法的显著优势。高级提示策略的有效性表现出差异性。我们进行了详细的错误分析,以解释 LLM 在理解短语语义时面临的局限性。代码和数据可在以下链接找到:https URL。

[NLP-70] Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在面对“越狱攻击”时如何平衡安全性和实用性的问题。解决方案的关键在于引入“Jailbreak Antidote”方法,通过在推理过程中实时调整模型内部状态的稀疏子集,沿着安全方向调整模型的隐藏表示,从而在不增加计算开销或推理延迟的情况下,灵活控制安全性和实用性的平衡。该方法利用了LLMs内部安全相关信息分布的稀疏性,只需调整约5%的内部状态即可达到与修改整个状态相当的效果,从而提供了一种轻量级且可扩展的安全增强方案。

链接: https://arxiv.org/abs/2410.02298
作者: Guobin Shen,Dongcheng Zhao,Yiting Dong,Xiang He,Yi Zeng
关键词-EN: large language models, large language, safety, Jailbreak Antidote, Jailbreak
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:As large language models (LLMs) become integral to various applications, ensuring both their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into generating harmful content, pose significant challenges to this balance. Existing defenses, such as prompt engineering and safety fine-tuning, often introduce computational overhead, increase inference latency, and lack runtime flexibility. Moreover, overly restrictive safety measures can degrade model utility by causing refusals of benign queries. In this paper, we introduce Jailbreak Antidote, a method that enables real-time adjustment of LLM safety preferences by manipulating a sparse subset of the model’s internal states during inference. By shifting the model’s hidden representations along a safety direction with varying strengths, we achieve flexible control over the safety-utility balance without additional token overhead or inference delays. Our analysis reveals that safety-related information in LLMs is sparsely distributed; adjusting approximately 5% of the internal state is as effective as modifying the entire state. Extensive experiments on nine LLMs (ranging from 2 billion to 72 billion parameters), evaluated against ten jailbreak attack methods and compared with six defense strategies, validate the effectiveness and efficiency of our approach. By directly manipulating internal states during reasoning, Jailbreak Antidote offers a lightweight, scalable solution that enhances LLM safety while preserving utility, opening new possibilities for real-time safety mechanisms in widely-deployed AI systems.
摘要:随着大语言模型 (LLM) 在各种应用中变得不可或缺,确保其安全性和实用性至关重要。越狱攻击通过操纵 LLM 生成有害内容,对这种平衡构成了重大挑战。现有的防御措施,如提示工程和安全微调,通常会引入计算开销,增加推理延迟,并且缺乏运行时灵活性。此外,过于严格的安全措施可能会通过拒绝良性查询来降低模型的实用性。在本文中,我们介绍了越狱解毒剂 (Jailbreak Antidote),这是一种通过在推理过程中操纵模型内部状态的稀疏子集来实时调整 LLM 安全偏好的方法。通过沿着不同强度的安全方向调整模型的隐藏表示,我们实现了对安全-实用性平衡的灵活控制,而无需额外的 Token 开销或推理延迟。我们的分析表明,LLM 中的安全相关信息是稀疏分布的;调整约 5% 的内部状态与修改整个状态同样有效。在九个 LLM(参数范围从 20 亿到 720 亿)上进行的广泛实验,针对十种越狱攻击方法进行评估,并与六种防御策略进行比较,验证了我们方法的有效性和效率。通过在推理过程中直接操纵内部状态,越狱解毒剂提供了一种轻量级、可扩展的解决方案,增强了 LLM 的安全性同时保留了实用性,为广泛部署的 AI 系统中的实时安全机制开辟了新的可能性。

[NLP-71] Make Compound Sentences Simple to Analyze: Learning to Split Sentences for Aspect-based Sentiment Analysis EMNLP2024

【速读】: 该论文试图解决在基于方面的情感分析(ABSA)中,复合句中情感四元组提取的复杂性问题。解决方案的关键在于提出了一种名为Aspect Term Oriented Sentence Splitter(ATOSS)的模块,该模块能够将复合句简化为更清晰、更简单的形式,从而便于识别情感四元组。ATOSS作为一个即插即用的模块,不仅保留了ABSA模型的参数,还显著提高了在ASQP和ACOS任务中的性能,这些任务主要用于提取情感四元组。

链接: https://arxiv.org/abs/2410.02297
作者: Yongsik Seo,Sungwon Song,Ryang Heo,Jieyong Kim,Dongha Lee
关键词-EN: Aspect-Based Sentiment Analysis, achieved substantial advancements, Sentiment Analysis, extracting sentiment quadruplets, shown promising results
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 (Findings, long paper)

点击查看摘要

Abstract:In the domain of Aspect-Based Sentiment Analysis (ABSA), generative methods have shown promising results and achieved substantial advancements. However, despite these advancements, the tasks of extracting sentiment quadruplets, which capture the nuanced sentiment expressions within a sentence, remain significant challenges. In particular, compound sentences can potentially contain multiple quadruplets, making the extraction task increasingly difficult as sentence complexity grows. To address this issue, we are focusing on simplifying sentence structures to facilitate the easier recognition of these elements and crafting a model that integrates seamlessly with various ABSA tasks. In this paper, we propose Aspect Term Oriented Sentence Splitter (ATOSS), which simplifies compound sentence into simpler and clearer forms, thereby clarifying their structure and intent. As a plug-and-play module, this approach retains the parameters of the ABSA model while making it easier to identify essential intent within input sentences. Extensive experimental results show that utilizing ATOSS outperforms existing methods in both ASQP and ACOS tasks, which are the primary tasks for extracting sentiment quadruplets.
摘要:在基于方面的情感分析 (Aspect-Based Sentiment Analysis, ABSA) 领域,生成式方法已展现出显著的成果并取得了重大进展。然而,尽管这些进展,提取情感四元组 (sentiment quadruplets) 的任务,即捕捉句子中微妙的情感表达,仍然面临重大挑战。特别是,复合句可能包含多个四元组,随着句子复杂性的增加,提取任务变得更加困难。为解决这一问题,我们致力于简化句子结构,以促进这些元素的更容易识别,并构建一个能够无缝集成到各种 ABSA 任务中的模型。本文提出了一种面向方面术语的句子分割器 (Aspect Term Oriented Sentence Splitter, ATOSS),它将复合句简化为更简单和更清晰的形式,从而明确其结构和意图。作为一个即插即用模块,这种方法在保留 ABSA 模型参数的同时,使得识别输入句子中的关键意图变得更加容易。广泛的实验结果表明,使用 ATOSS 在 ASQP 和 ACOS 任务中均优于现有方法,这些任务是提取情感四元组的主要任务。

[NLP-72] Language Models are Graph Learners

【速读】: 该论文试图解决的问题是如何在不修改语言模型(LM)架构的前提下,使其在节点分类任务中达到与图神经网络(GNN)和图变换器(GT)相媲美的性能。解决方案的关键在于通过两种增强策略来提升LM的性能:一是通过拓扑和语义检索方法丰富LM的输入,提供更丰富的上下文信息;二是通过轻量级的GNN分类器指导LM的分类过程,有效筛选候选类别。这些策略使得预训练的Flan-T5模型在节点分类任务中表现优异,不仅超越了现有的文本输出节点分类器,还与顶级的向量输出节点分类器相当。

链接: https://arxiv.org/abs/2410.02296
作者: Zhe Xu,Kaveh Hassani,Si Zhang,Hanqing Zeng,Michihiro Yasunaga,Limei Wang,Dongqi Fu,Ning Yao,Bo Long,Hanghang Tong
关键词-EN: Graph Neural Networks, including Graph Neural, Neural Networks, Graph Transformers, Graph Neural
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language Models (LMs) are increasingly challenging the dominance of domain-specific models, including Graph Neural Networks (GNNs) and Graph Transformers (GTs), in graph learning tasks. Following this trend, we propose a novel approach that empowers off-the-shelf LMs to achieve performance comparable to state-of-the-art GNNs on node classification tasks, without requiring any architectural modification. By preserving the LM’s original architecture, our approach retains a key benefit of LM instruction tuning: the ability to jointly train on diverse datasets, fostering greater flexibility and efficiency. To achieve this, we introduce two key augmentation strategies: (1) Enriching LMs’ input using topological and semantic retrieval methods, which provide richer contextual information, and (2) guiding the LMs’ classification process through a lightweight GNN classifier that effectively prunes class candidates. Our experiments on real-world datasets show that backbone Flan-T5 models equipped with these augmentation strategies outperform state-of-the-art text-output node classifiers and are comparable to top-performing vector-output node classifiers. By bridging the gap between specialized task-specific node classifiers and general LMs, this work paves the way for more versatile and widely applicable graph learning models. We will open-source the code upon publication.
摘要:语言模型 (Language Models, LMs) 正逐渐挑战领域特定模型(包括图神经网络 (Graph Neural Networks, GNNs) 和图 Transformer (Graph Transformers, GTs))在图学习任务中的主导地位。顺应这一趋势,我们提出了一种新颖的方法,使现成的 LMs 能够在节点分类任务上达到与最先进的 GNNs 相媲美的性能,而无需进行任何架构修改。通过保留 LM 的原始架构,我们的方法保留了 LM 指令调优的关键优势:能够在多样化的数据集上进行联合训练,从而提高灵活性和效率。为实现这一目标,我们引入了两种关键的增强策略:(1) 通过拓扑和语义检索方法丰富 LMs 的输入,提供更丰富的上下文信息;(2) 通过轻量级的 GNN 分类器指导 LMs 的分类过程,有效修剪类别候选。我们在真实世界数据集上的实验表明,配备这些增强策略的 Flan-T5 模型在文本输出节点分类器中表现优于最先进的模型,并与顶级向量输出节点分类器相媲美。通过弥合特定任务节点分类器与通用 LMs 之间的差距,这项工作为更通用且广泛适用的图学习模型铺平了道路。我们将在发表后开源代码。

[NLP-73] Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods

【速读】: 该论文试图解决传统二阶优化方法在训练深度神经网络时面临的计算复杂度和内存需求过高的问题。解决方案的关键在于提出了SecondOrderAdaptiveAdam (SOAA)算法,该算法通过使用Fisher信息矩阵的对角表示来降低计算复杂度(从(O(n^2))降至(O(n))),并结合自适应信任域机制,动态调整信任域大小以确保稳健的收敛和计算效率。这一方法使得SOAA适用于大规模深度学习模型,包括大型语言模型(LLMs),并在实验中展示了比一阶优化器(如Adam)更快的收敛速度和更稳定的性能。

链接: https://arxiv.org/abs/2410.02293
作者: James Vo
关键词-EN: offer notable advantages, methods offer notable, utilizing curvature information, training deep neural, deep neural networks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Second-order optimization methods offer notable advantages in training deep neural networks by utilizing curvature information to achieve faster convergence. However, traditional second-order techniques are computationally prohibitive, primarily due to the large matrix inversions and high memory demands they require. While adaptive trust-region methods have been developed to mitigate these issues, their performance is often hindered by conservative estimates of key parameters, such as the Lipschitz constant of the Hessian, resulting in suboptimal outcomes. In this paper, we introduce SecondOrderAdaptiveAdam (SOAA), a novel optimization algorithm designed to overcome these limitations. SOAA approximates the Fisher information matrix using a diagonal representation, reducing computational complexity from (O(n^2)) to (O(n)), thereby making it suitable for large-scale deep learning models, including large language models (LLMs). Additionally, the algorithm integrates an adaptive trust-region mechanism that dynamically adjusts the trust region size based on observed loss reduction, ensuring both robust convergence and computational efficiency. We empirically demonstrate that SOAA achieves faster and more stable convergence compared to first-order optimizers, such as Adam, under similar computational constraints. However, the diagonal approximation of the Fisher information matrix may be less effective in capturing higher-order interactions between gradients, suggesting potential areas for further refinement and future research.
摘要:二阶优化方法通过利用曲率信息,在训练深度神经网络时展现出显著的优势,能够实现更快的收敛。然而,传统的二阶技术由于需要进行大规模矩阵求逆和高内存需求,计算成本过高。尽管已经开发了自适应信赖域方法来缓解这些问题,但其性能往往受到关键参数(如Hessian的Lipschitz常数)保守估计的限制,导致结果次优。本文中,我们提出了SecondOrderAdaptiveAdam (SOAA),一种新型优化算法,旨在克服这些局限。SOAA采用对角表示法近似Fisher信息矩阵,将计算复杂度从 (O(n^2)) 降低到 (O(n)),从而使其适用于包括大语言模型 (LLMs) 在内的大规模深度学习模型。此外,该算法集成了自适应信赖域机制,根据观察到的损失减少动态调整信赖域大小,确保了稳健的收敛性和计算效率。我们通过实证证明,在相似的计算约束下,SOAA相比一阶优化器(如Adam)实现了更快且更稳定的收敛。然而,Fisher信息矩阵的对角近似可能无法有效捕捉梯度之间的高阶交互作用,这为未来的改进和研究提供了潜在的方向。

[NLP-74] Correlation and Navigation in the Vocabulary Key Representation Space of Language Models

【速读】: 该论文试图解决语言模型在解码过程中由于键分布相似性导致的伪相关问题,即在下一个词预测(NTP)分布中,中间排名的预测结果倾向于与高排名结果分布相似而非语义相似的词,从而影响采样多样性和长尾结果的准确性。解决方案的关键在于提出了一种上下文内方法,通过迭代地将查询表示从已探索区域推离,具体做法是将已探索的解码结果纳入上下文,并提示模型生成其他内容,从而鼓励模型生成与已探索键点积较小的查询表示,实验结果表明该方法能有效导航至新的正确键,并提高生成多样性和自一致性投票性能。

链接: https://arxiv.org/abs/2410.02284
作者: Letian Peng,Chenyang An,Jingbo Shang
关键词-EN: NTP distribution, Language model, NTP, probability distribution, distribution
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language model (LM) decoding is based on the next-token prediction (NTP) probability distribution. For neural LMs (e.g., Transformer-based), NTP distribution is essentially a softmax-regularized dot product between an encoded input context (query) and fixed vocabulary representations (keys). In this paper, we study the effect of the key distribution on the NTP distribution, with a focus on whether the similarity between keys will trigger spurious correlations in NTP. Through knowledge-probing tasks, we show that in the NTP distribution, the few top-ranked tokens are typically accurate. However, the middle-ranked prediction is highly biased towards the tokens that are distributionally (not necessarily semantically) similar to these top ones. For instance, if “P” is predicted as the top-1 token, “A”-“Z” will all be ranked high in NTP, no matter whether they can lead to correct decoding results. This hurts the sampling diversity and makes the sampling of correct, long-tail results hopeless and noisy. We attempt to alleviate this issue via a novel in-context method that iteratively pushes the query representation away from explored regions. Specifically, we include the explored decoding results in the context and prompt the LM to generate something else, which encourages the LM to produce a query representation that has small dot products with explored keys. Experiments on knowledge-probing tasks show that our method leads to efficient navigation away from explored keys to correct new keys. We further extend our method to open-ended and chain-of-thought (for reasoning) generation. Experiment results show that ICN contributes to better generation diversity and improved self-consistency voting performance. Finally, we discuss potential training issues caused by the fixed key space together with the challenges and possible ways to address them in future research.
摘要: 语言模型 (LM) 的解码基于下一个 Token 预测 (NTP) 的概率分布。对于神经语言模型(例如基于 Transformer 的模型),NTP 分布本质上是一个在编码输入上下文(查询)和固定词汇表表示(键)之间的 softmax 正则化点积。本文研究了键分布对 NTP 分布的影响,重点关注键之间的相似性是否会触发 NTP 中的虚假相关性。通过知识探查任务,我们发现 NTP 分布中,排名靠前的 Token 通常是准确的。然而,中间排名的预测高度偏向于那些在分布上(不一定在语义上)与这些顶部 Token 相似的 Token。例如,如果“P”被预测为排名第一的 Token,那么“A”到“Z”的所有字母在 NTP 中都会被高度排名,无论它们是否能导致正确的解码结果。这损害了采样的多样性,使得正确且长尾结果的采样变得无望且嘈杂。我们尝试通过一种新颖的上下文内方法来缓解这一问题,该方法通过迭代地将查询表示推离已探索的区域。具体来说,我们将已探索的解码结果包含在上下文中,并提示 LM 生成其他内容,这鼓励 LM 生成与已探索键的点积较小的查询表示。在知识探查任务上的实验表明,我们的方法能够有效地从已探索的键导航到正确的新键。我们进一步将我们的方法扩展到开放式和链式思维(用于推理)生成。实验结果表明,ICN 有助于提高生成多样性和改进自一致性投票性能。最后,我们讨论了由固定键空间引起的潜在训练问题,以及未来研究中应对这些挑战的可能方法。

[NLP-75] Morphological evaluation of subwords vocabulary used by BETO language model

【速读】: 该论文试图解决的问题是评估大型语言模型(如BETO)中使用的子词分词算法生成的词汇的形态学质量。解决方案的关键在于提出了一种基于相关性、内聚性和形态学准确性三个质量指标的评估方法,并通过该方法对BPE、Wordpiece和Unigram三种子词分词算法生成的词汇进行了评估,发现这些词汇的形态学质量普遍较低。此外,论文还验证了在更大语料库上训练分词器并不能显著提高生成的词汇的形态学质量。

链接: https://arxiv.org/abs/2410.02283
作者: Óscar García-Sierra,Ana Fernández-Pampillón Cesteros,Miguel Ortega-Martín
关键词-EN: Subword tokenization algorithms, morphological quality, human intervention, significantly more efficient, independently build
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in Spanish language

点击查看摘要

Abstract:Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models’ performance, though it remains uncertain when this might occur. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. Our evaluation method was built on three quality measures, relevance, cohesion, and morphological accuracy, and a procedure for their assessment. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. In this article, we apply this evaluation to the tokenizer of BETO, a BERT language model trained on large Spanish corpora. This evaluation, along with our previous results, helped us conclude that its vocabulary has a low morphological quality, and we also found that training the tokenizer in a larger corpus does not improve the morphological quality of the generated vocabulary. Additionally, this evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors’ claims and the model’s configuration.
摘要:大语言模型所使用的子词 Token 化算法显著提高了效率,并且能够在无人干预的情况下独立构建必要的词汇和子词。然而,这些子词并不总是与实际的词素对齐,这可能会影响模型的性能,尽管目前尚不确定这种情况何时会发生。在先前的研究中,我们提出了一种评估词汇形态质量的方法,重点考察这些词汇与给定语言词素的重叠情况。我们的评估方法基于三个质量指标:相关性、内聚性和形态准确性,以及相应的评估程序。通过将这种方法应用于由三种子词 Token 化算法(BPE、Wordpiece 和 Unigram)生成的词汇,我们得出结论,这些词汇的形态质量普遍较低。在本文中,我们将此评估方法应用于 BETO 的 Token 化器,BETO 是一个基于大型西班牙语语料库训练的 BERT 语言模型。结合我们之前的结果,这一评估帮助我们得出结论,BETO 的词汇形态质量较低,并且我们还发现,在更大的语料库中训练 Token 化器并不能提高生成的词汇的形态质量。此外,这一评估还有助于澄清 Token 化器所使用的算法,即 Wordpiece,因为作者的声明与模型的配置之间存在不一致。

[NLP-76] Annotation Guidelines for Corpus Novelties: Part 1 – Named Entity Recognition

【速读】: 该论文旨在解决小说文本中的命名实体识别(NER)问题,并为此提供了一套详细的标注指南。解决方案的关键在于制定明确的标注规则和标准,通过具体的示例展示哪些表达应被标记为实体,哪些不应被标记,从而确保标注过程的一致性和准确性。

链接: https://arxiv.org/abs/2410.02281
作者: Arthur Amalvy(LIA),Vincent Labatut(LIA)
关键词-EN: Named Entity Recognition, Entity Recognition, Named Entity, Novelties corpus, NER
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Novelties corpus is a collection of novels (and parts of novels) annotated for Named Entity Recognition (NER) among other tasks. This document describes the guidelines applied during its annotation. It contains the instructions used by the annotators, as well as a number of examples retrieved from the annotated novels, and illustrating expressions that should be marked as entities as well as expressions that should not.
摘要:Novelties 语料库是一个包含小说(及小说部分)的集合,这些小说被标注用于命名实体识别 (Named Entity Recognition, NER) 等任务。本文档描述了在标注过程中应用的指南。它包含了标注者使用的指令,以及从标注小说中提取的多个示例,这些示例展示了应标记为实体的表达式以及不应标记的表达式。

[NLP-77] Structural-Entropy-Based Sample Selection for Efficient and Effective Learning ICLR2025

【速读】: 该论文试图解决现有样本选择方法在机器学习模型中仅依赖局部信息(如样本训练难度)而忽略全局信息(如连接模式)的问题,导致样本选择效果不佳。解决方案的关键在于引入结构熵来量化全局信息,并通过Shapley值将其无损分解到各个节点,结合局部信息(训练难度)进行综合评估。论文提出的Structural-Entropy-based Sample Selection (SES) 方法通过构建kNN图、结合结构熵与训练难度衡量样本重要性,并应用重要性偏置的蓝噪声采样技术,从而选择出既信息丰富又具有代表性的样本,显著提升了模型效率与效果。

链接: https://arxiv.org/abs/2410.02268
作者: Tianchi Xie,Jiangning Zhu,Guozu Ma,Minzhi Lin,Wei Chen,Weikai Yang,Shixia Liu
关键词-EN: machine learning models, improves the efficiency, models by providing, samples, Sample selection improves
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICLR 2025

点击查看摘要

Abstract:Sample selection improves the efficiency and effectiveness of machine learning models by providing informative and representative samples. Typically, samples can be modeled as a sample graph, where nodes are samples and edges represent their similarities. Most existing methods are based on local information, such as the training difficulty of samples, thereby overlooking global information, such as connectivity patterns. This oversight can result in suboptimal selection because global information is crucial for ensuring that the selected samples well represent the structural properties of the graph. To address this issue, we employ structural entropy to quantify global information and losslessly decompose it from the whole graph to individual nodes using the Shapley value. Based on the decomposition, we present \textbfS tructural- \textbfE ntropy-based sample \textbfS election ( \textbfSES ), a method that integrates both global and local information to select informative and representative samples. SES begins by constructing a k NN-graph among samples based on their similarities. It then measures sample importance by combining structural entropy (global metric) with training difficulty (local metric). Finally, SES applies importance-biased blue noise sampling to select a set of diverse and representative samples. Comprehensive experiments on three learning scenarios – supervised learning, active learning, and continual learning – clearly demonstrate the effectiveness of our method.
摘要:样本选择通过提供信息丰富且具有代表性的样本,提高了机器学习模型的效率和效果。通常,样本可以被建模为一个样本图,其中节点是样本,边表示它们的相似性。大多数现有方法基于局部信息,如样本的训练难度,从而忽略了全局信息,如连通性模式。这种忽视可能导致次优选择,因为全局信息对于确保所选样本良好地代表图的结构属性至关重要。为解决这一问题,我们采用结构熵来量化全局信息,并使用 Shapley 值将其无损地从整个图分解到各个节点。基于这种分解,我们提出了 基于结构熵的样本选择 (Structural-Entropy-based Sample Selection, SES),该方法整合了全局和局部信息,以选择信息丰富且具有代表性的样本。SES 首先根据样本的相似性构建一个 k 近邻图 (k NN-graph)。然后,它通过结合结构熵(全局指标)和训练难度(局部指标)来衡量样本的重要性。最后,SES 应用重要性偏置的蓝噪声采样来选择一组多样且具有代表性的样本。在三种学习场景——监督学习、主动学习和持续学习——上的综合实验清楚地证明了我们方法的有效性。

[NLP-78] A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speakers Shadowings

【速读】: 该论文试图解决非母语者(L2)发音不清晰的问题,特别是在计算机辅助语言学习系统中,如何提供更精细的反馈以帮助L2学习者识别和诊断其发音错误。解决方案的关键在于利用语音转换技术(Voice Conversion)模拟母语者(L1)对L2发音的跟读行为,从而创建一个虚拟的跟读系统。通过这种方式,系统能够生成与真实L1跟读发音在语言和声学特征上相似的输出,为L2学习者提供更直观、有效的发音纠正反馈。

链接: https://arxiv.org/abs/2410.02239
作者: Haopeng Geng,Daisuke Saito,Nobuaki Minematsu
关键词-EN: improper prosody, due to mispronunciation, mispronunciation and improper, unintelligible due, Utterances
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by APSIPA ASC 2024. arXiv admin note: text overlap with arXiv:2409.11742

点击查看摘要

Abstract:Utterances by L2 speakers can be unintelligible due to mispronunciation and improper prosody. In computer-aided language learning systems, textual feedback is often provided using a speech recognition engine. However, an ideal form of feedback for L2 speakers should be so fine-grained that it enables them to detect and diagnose unintelligible parts of L2 speakers’ utterances. Inspired by language teachers who correct students’ pronunciation through a voice-to-voice process, this pilot study utilizes a unique semi-parallel dataset composed of non-native speakers’ (L2) reading aloud, shadowing of native speakers (L1) and their script-shadowing utterances. We explore the technical possibility of replicating the process of an L1 speaker’s shadowing L2 speech using Voice Conversion techniques, to create a virtual shadower system. Experimental results demonstrate the feasibility of the VC system in simulating L1’s shadowing behavior. The output of the virtual shadower system shows a reasonable similarity to the real L1 shadowing utterances in both linguistic and acoustic aspects.
摘要:由于发音错误和语调不当,第二语言 (L2) 学习者的发音可能难以理解。在计算机辅助语言学习系统中,通常使用语音识别引擎提供文本反馈。然而,理想的反馈形式应足够细致,使 L2 学习者能够检测和诊断其发音中难以理解的部分。受语言教师通过语音对语音过程纠正学生发音的启发,本研究利用一个独特的半并行数据集,该数据集由非母语者 (L2) 的朗读、母语者 (L1) 的影子跟读及其脚本影子跟读组成。我们探索了使用语音转换 (Voice Conversion) 技术复制 L1 影子跟读 L2 语音过程的技术可能性,以创建一个虚拟影子跟读系统。实验结果表明,该语音转换系统在模拟 L1 影子跟读行为方面具有可行性。虚拟影子跟读系统的输出在语言和声学方面与真实的 L1 影子跟读发音表现出合理的相似性。

[NLP-79] CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

【速读】: 该论文试图解决大语言模型(LLMs)在推理能力提升过程中,由于高质量偏好数据稀缺且标注成本高昂,导致奖励模型(RM)微调效率低下的问题。解决方案的关键在于引入CodePMP,一种可扩展的偏好模型预训练(PMP)流程,通过利用公开的高质量源代码合成的大规模代码-偏好对数据集,对偏好模型进行预训练,从而显著提高奖励模型微调的效率,并在数学和逻辑推理任务中展现出显著的性能提升。

链接: https://arxiv.org/abs/2410.02229
作者: Huimu Yu,Xing Wu,Weidong Yin,Debing Zhang,Songlin Hu
关键词-EN: natural language understanding, made significant progress, Large language models, understanding and generation, natural language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.
摘要:大语言模型 (LLMs) 在自然语言理解和生成方面取得了显著进展,这得益于可扩展的预训练和先进的微调技术。然而,增强 LLMs 的推理能力,特别是通过从人类反馈中进行强化学习 (RLHF),仍然面临挑战,主要原因是高质量偏好数据的稀缺,这些数据需要大量人力进行标注,并且对奖励模型 (RM) 的微调至关重要。为了缓解这一问题,我们引入了 CodePMP,这是一个可扩展的偏好模型预训练 (PMP) 流程,利用从公开的高质量源代码中合成的大量代码-偏好对。CodePMP 通过在大规模合成的代码-偏好对上预训练偏好模型,提高了 RM 微调的效率。我们在数学推理任务 (GSM8K, MATH) 和逻辑推理任务 (ReClor, LogiQA2.0) 上评估了 CodePMP,结果一致显示 LLMs 的推理性能显著提升,并强调了可扩展的偏好模型预训练对高效奖励建模的重要性。

[NLP-80] EmbedLLM: Learning Compact Representations of Large Language Models

【速读】: 该论文试图解决大规模语言模型(LLMs)在下游任务中的高效评估和利用问题,特别是避免重复学习任务特定表示导致的资源浪费。解决方案的关键是提出EmbedLLM框架,通过学习LLMs的紧凑向量表示(embeddings),以支持模型路由等下游应用。该框架采用编码器-解码器方法生成这些嵌入,并通过系统评估验证其有效性,显著提高了模型路由的准确性和延迟性能,同时能够预测模型在多个基准上的表现,无需额外推理成本。

链接: https://arxiv.org/abs/2410.02223
作者: Richard Zhuang,Tianhao Wu,Zhaojin Wen,Andrew Li,Jiantao Jiao,Kannan Ramchandran
关键词-EN: Huggingface today, Large Language Models, efficiently evaluating, increasingly critical, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream, tasks has become increasingly critical. Many existing methods repeatedly learn task-specific representations of Large Language Models (LLMs), which leads to inefficiencies in both time and computational resources. To address this, we propose EmbedLLM, a framework designed to learn compact vector representations, of LLMs that facilitate downstream applications involving many models, such as model routing. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency. Additionally, we demonstrate that our method can forecast a model’s performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them. We open source our dataset, code and embedder to facilitate further research and application.
摘要:随着 Huggingface 上现今已有数十万种语言模型,如何在各种下游任务中高效评估和利用这些模型变得愈发关键。许多现有方法反复学习大语言模型 (LLM) 的特定任务表示,这导致了时间和计算资源上的低效。为此,我们提出了 EmbedLLM,一个旨在学习紧凑向量表示的框架,以促进涉及多个模型的下游应用,如模型路由。我们引入了一种编码器-解码器方法来学习这些嵌入,并提供了一个系统框架来评估其有效性。实证结果表明,EmbedLLM 在模型路由的准确性和延迟方面均优于先前方法。此外,我们还展示了我们的方法能够在不增加额外推理成本的情况下,预测模型在多个基准测试上的表现。广泛的探测实验验证了所学嵌入能够捕捉关键的模型特征,例如模型是否专门用于编码任务,即使在没有明确训练的情况下也能做到。我们开源了数据集、代码和嵌入器,以促进进一步的研究和应用。

[NLP-81] Calibrate to Discriminate: Improve In-Context Learning with Label-Free Comparative Inference

【速读】: 该论文试图解决大语言模型(LLMs)在上下文学习中出现的“无差别误校准”问题,即模型对正确和错误预测赋予相同置信度的情况。解决方案的关键在于提出新的度量指标来量化这种误校准的严重性,并开发了一种新颖的上下文比较推理方法,以缓解误校准现象并提高分类性能。通过在五个数据集上的广泛实验,证明了该方法相比传统的零样本和少样本提示方法,能够实现更准确和校准的预测。

链接: https://arxiv.org/abs/2410.02210
作者: Wei Cheng,Tianlu Wang,Yanmin Ji,Fan Yang,Keren Tan,Yiyu Zheng
关键词-EN: large language models, shown impressive performance, language models, level of confidence, learning with large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:While in-context learning with large language models (LLMs) has shown impressive performance, we have discovered a unique miscalibration behavior where both correct and incorrect predictions are assigned the same level of confidence. We refer to this phenomenon as indiscriminate miscalibration. We found that traditional calibration metrics, such as Expected Calibrated Errors (ECEs), are unable to capture this behavior effectively. To address this issue, we propose new metrics to measure the severity of indiscriminate miscalibration. Additionally, we develop a novel in-context comparative inference method to alleviate miscalibrations and improve classification performance. Through extensive experiments on five datasets, we demonstrate that our proposed method can achieve more accurate and calibrated predictions compared to regular zero-shot and few-shot prompting.
摘要:尽管大语言模型 (LLM) 在上下文学习中表现出色,但我们发现了一种独特的校准错误行为,即正确和错误的预测被赋予相同的置信度。我们将这种现象称为无差别校准错误。我们发现,传统的校准指标,如预期校准误差 (ECE),无法有效捕捉这种行为。为了解决这一问题,我们提出了新的指标来衡量无差别校准错误的严重程度。此外,我们开发了一种新颖的上下文比较推理方法,以减轻校准错误并提高分类性能。通过在五个数据集上的广泛实验,我们证明,与常规的零样本和少样本提示相比,我们提出的方法能够实现更准确和校准的预测。

[NLP-82] Measuring Evaluating and Improving Logical Consistency in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在决策和判断中表现出的逻辑不一致性问题,以提高其可靠性和可信度。解决方案的关键在于提出了一种量化逻辑一致性的通用框架,通过三个基本代理指标(传递性、交换性和否定不变性)来评估LLMs的逻辑一致性。此外,论文还引入了一种数据精炼和增强技术,通过估计部分或完全有序的偏好排序来增强逻辑一致性,同时不牺牲与人类偏好的对齐。这些方法共同提升了LLMs在逻辑依赖算法中的表现,使其作为逻辑操作符时更加稳健。

链接: https://arxiv.org/abs/2410.02205
作者: Yinhong Liu,Zhijiang Guo,Tianya Liang,Ehsan Shareghi,Ivan Vulić,Nigel Collier
关键词-EN: Large Language Models, Language Models, Large Language, shown promising progress, promising progress related
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Recent research in Large Language Models (LLMs) has shown promising progress related to LLM alignment with human preferences. LLM-empowered decision-making systems are expected to be predictable, reliable and trustworthy, which implies being free from paradoxes or contradictions that could undermine their credibility and validity. However, LLMs still exhibit inconsistent and biased behaviour when making decisions or judgements. In this work, we focus on studying logical consistency of LLMs as a prerequisite for more reliable and trustworthy systems. Logical consistency ensures that decisions are based on a stable and coherent understanding of the problem, reducing the risk of erratic or contradictory outputs. We first propose a universal framework to quantify the logical consistency via three fundamental proxies: transitivity, commutativity and negation invariance. We then evaluate logical consistency, using the defined measures, of a wide range of LLMs, demonstrating that it can serve as a strong proxy for overall robustness. Additionally, we introduce a data refinement and augmentation technique that enhances the logical consistency of LLMs without sacrificing alignment to human preferences. It augments noisy and sparse pairwise-comparison annotations by estimating a partially or totally ordered preference rankings using rank aggregation methods. Finally, we show that logical consistency impacts the performance of LLM-based logic-dependent algorithms, where LLMs serve as logical operators.
摘要:近期在大语言模型 (LLM) 领域的研究显示,与人类偏好对齐的 LLM 取得了显著进展。由 LLM 驱动的决策系统预期应具备可预测性、可靠性和可信性,这意味着系统应避免可能损害其可信度和有效性的悖论或矛盾。然而,LLM 在做出决策或判断时仍表现出不一致和偏见的行为。在本研究中,我们将重点放在研究 LLM 的逻辑一致性上,这是构建更可靠和可信系统的前提。逻辑一致性确保决策基于对问题稳定且连贯的理解,从而降低输出不稳定或矛盾的风险。我们首先提出了一种通用框架,通过三个基本代理指标:传递性、交换性和否定不变性,来量化逻辑一致性。接着,我们使用这些定义的度量方法,评估了一系列 LLM 的逻辑一致性,证明其可以作为整体鲁棒性的强有力代理。此外,我们引入了一种数据精炼和增强技术,该技术在不牺牲与人类偏好对齐的前提下,提升了 LLM 的逻辑一致性。它通过使用排名聚合方法,估计部分或完全有序的偏好排名,从而增强噪声和稀疏的成对比较注释。最后,我们展示了逻辑一致性对基于 LLM 的逻辑依赖算法的性能有影响,其中 LLM 作为逻辑运算符。

[NLP-83] Can Language Models Take A Hint? Prompting for Controllable Contextualized Commonsense Inference ACL

【速读】: 该论文试图解决在给定故事情境中生成常识性断言的难题,特别是如何确定故事中的主题或实体作为推理断言的焦点,并且缺乏对生成断言特定方面的控制能力。解决方案的关键在于引入了一种名为“hinting”的数据增强技术,通过硬提示和软提示的前缀提示策略来引导推理过程,从而在不降低常识性推理性能的前提下,提高生成断言的可控性。

链接: https://arxiv.org/abs/2410.02202
作者: Pedro Colon-Hernandez,Nanxi Liu,Chelsea Joe,Peter Chin,Claire Yin,Henry Lieberman,Yida Xin,Cynthia Breazeal
关键词-EN: Generating commonsense assertions, story context remains, modern language models, Generating commonsense, hinting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to ACL Rolling Review. arXiv admin note: text overlap with arXiv:2302.05406

点击查看摘要

Abstract:Generating commonsense assertions within a given story context remains a difficult task for modern language models. Previous research has addressed this problem by aligning commonsense inferences with stories and training language generation models accordingly. One of the challenges is determining which topic or entity in the story should be the focus of an inferred assertion. Prior approaches lack the ability to control specific aspects of the generated assertions. In this work, we introduce “hinting,” a data augmentation technique that enhances contextualized commonsense inference. “Hinting” employs a prefix prompting strategy using both hard and soft prompts to guide the inference process. To demonstrate its effectiveness, we apply “hinting” to two contextual commonsense inference datasets: ParaCOMET and GLUCOSE, evaluating its impact on both general and context-specific inference. Furthermore, we evaluate “hinting” by incorporating synonyms and antonyms into the hints. Our results show that “hinting” does not compromise the performance of contextual commonsense inference while offering improved controllability.
摘要:在给定的故事情境中生成常识性断言仍然是现代语言模型面临的难题。以往的研究通过将常识推理与故事对齐并相应地训练语言生成模型来解决这一问题。其中一个挑战是确定故事中的哪个主题或实体应成为推断断言的焦点。先前的研究方法缺乏对生成断言特定方面的控制能力。在本研究中,我们引入了“提示”(hinting),这是一种增强情境化常识推理的数据增强技术。“提示”采用前缀提示策略,结合硬提示和软提示来引导推理过程。为了展示其有效性,我们将“提示”应用于两个情境化常识推理数据集:ParaCOMET 和 GLUCOSE,评估其对通用和情境特定推理的影响。此外,我们还通过在提示中融入同义词和反义词来评估“提示”的效果。我们的结果表明,“提示”在不损害情境化常识推理性能的同时,提供了更好的可控性。

[NLP-84] General Preference Modeling with Preference Representations for Aligning Language Models

【速读】: 该论文试图解决传统奖励模型(如Bradley-Terry模型)在表达复杂偏好结构时的不足,特别是处理非传递性偏好(intransitive preferences)的能力有限,以及监督配对偏好模型(PairPM)在计算复杂度和一致性上的问题。解决方案的关键在于引入偏好表示学习(preference representation learning),通过将响应嵌入到潜在空间中,以高效捕捉复杂的偏好结构,实现线性查询复杂度。此外,论文提出了基于偏好分数的通用偏好优化(General Preference Optimization, GPO),扩展了从人类反馈中进行奖励强化学习的框架。实验结果表明,该方法在多个基准测试和下游任务中显著优于传统奖励模型,有效提升了基础模型与人类价值观的对齐效果。

链接: https://arxiv.org/abs/2410.02197
作者: Yifan Zhang,Ge Zhang,Yue Wu,Kangping Xu,Quanquan Gu
关键词-EN: General Preference, preference, crucial for aligning, Traditional reward modeling, Modeling human preferences
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 34 pages

点击查看摘要

Abstract:Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at this https URL.
摘要:建模人类偏好对于将基础模型与人类价值观对齐至关重要。传统的奖励建模方法,如 Bradley-Terry (BT) 奖励模型,在表达能力上存在不足,特别是在处理非传递性偏好时。尽管监督对偏好模型 (PairPM) 能够表达一般偏好,但其实现高度特设,无法保证比较对的偏好概率一致性。此外,由于其在比较多个响应时的二次查询复杂性,计算成本较高。本文中,我们引入了偏好表示学习,这是一种将响应嵌入潜在空间以高效捕捉复杂偏好结构的方法,实现了线性查询复杂性。此外,我们提出了基于偏好分数的通用偏好优化 (GPO),该方法推广了基于人类反馈的奖励强化学习。实验结果表明,我们的通用偏好表示模型 (GPM) 在 RewardBench 基准测试中优于 BT 奖励模型,最大优势达 5.6%,并有效建模了循环偏好,其中任何 BT 奖励模型的表现如同随机猜测。此外,在 AlpacaEval2.0 和 MT-Bench 等下游任务的评估中,采用 GPO 和我们通用偏好模型进行语言模型后训练后,性能显著提升,最大优势达 9.3%。这些发现表明,我们的方法可能增强基础模型与复杂人类价值观的对齐。代码可在以下链接获取:https URL。

[NLP-85] POSIX: A Prompt Sensitivity Index For Large Language Models EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)对提示(prompt)微小变化的敏感性问题,即LLMs在面对拼写错误、措辞变化或提示模板改变时,输出结果可能出现显著差异。解决方案的关键在于提出了一个名为POSIX的新型提示敏感性指数,通过捕捉在替换为不同意图保留的提示后,给定响应的对数似然率的相对变化,来量化和评估LLMs的提示敏感性。该方法通过实验验证了其有效性,并用于比较不同开源LLMs的提示敏感性,发现增加参数数量或指令调优并不一定能降低提示敏感性,而添加少量示例(如一个)几乎总能显著降低提示敏感性。

链接: https://arxiv.org/abs/2410.02185
作者: Anwoy Chatterjee,H S V N S Kowndinya Renduchintala,Sumit Bhatia,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, Language Models, minor variations, generating significantly divergent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2024 (Findings)

点击查看摘要

Abstract:Despite their remarkable capabilities, Large Language Models (LLMs) are found to be surprisingly sensitive to minor variations in prompts, often generating significantly divergent outputs in response to minor variations in the prompts, such as spelling errors, alteration of wording or the prompt template. However, while assessing the quality of an LLM, the focus often tends to be solely on its performance on downstream tasks, while very little to no attention is paid to prompt sensitivity. To fill this gap, we propose POSIX - a novel PrOmpt Sensitivity IndeX as a reliable measure of prompt sensitivity, thereby offering a more comprehensive evaluation of LLM performance. The key idea behind POSIX is to capture the relative change in loglikelihood of a given response upon replacing the corresponding prompt with a different intent-preserving prompt. We provide thorough empirical evidence demonstrating the efficacy of POSIX in capturing prompt sensitivity and subsequently use it to measure and thereby compare prompt sensitivity of various open-source LLMs. We find that merely increasing the parameter count or instruction tuning does not necessarily reduce prompt sensitivity whereas adding some few-shot exemplars, even just one, almost always leads to significant decrease in prompt sensitivity. We also find that alterations to prompt template lead to the highest sensitivity in the case of MCQtype tasks, whereas paraphrasing results in the highest sensitivity in open-ended generation tasks. The code for reproducing our results is open-sourced at this https URL.
摘要:尽管大语言模型 (LLM) 具有显著的能力,但它们对提示中的细微变化表现出惊人的敏感性,通常在面对拼写错误、措辞变化或提示模板更改等微小变化时,生成显著不同的输出。然而,在评估 LLM 的质量时,焦点往往仅集中在其在下游任务上的表现,而对提示敏感性的关注却极少。为了填补这一空白,我们提出了 POSIX——一种新颖的 PrOmpt Sensitivity IndeX,作为衡量提示敏感性的可靠指标,从而提供对 LLM 性能更全面的评估。POSIX 的核心思想是捕捉在将相应提示替换为不同但意图保持不变的提示时,给定响应的对数似然率的相对变化。我们提供了充分的实证证据,证明 POSIX 在捕捉提示敏感性方面的有效性,并随后使用它来测量并比较各种开源 LLM 的提示敏感性。我们发现,仅增加参数数量或指令调优并不一定能降低提示敏感性,而添加一些少样本示例,即使只有一个,几乎总是导致提示敏感性显著下降。我们还发现,在 MCQ 类型任务中,提示模板的更改会导致最高的敏感性,而在开放式生成任务中,释义会导致最高的敏感性。用于重现我们结果的代码已在 https URL 上开源。

[NLP-86] CodeJudge: Evaluating Code Generation with Large Language Models EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)生成的代码如何进行可靠评估的问题。解决方案的关键在于提出了CodeJudge框架,该框架利用LLMs来评估生成代码的语义正确性,无需依赖测试用例。通过引导LLM进行“慢思考”,CodeJudge能够进行深入且可靠的评估,并在实验中显著优于现有方法,甚至在某些情况下使用较小模型(如Llama-3-8B-Instruct)也能超越基于GPT-3.5的SOTA方法。

链接: https://arxiv.org/abs/2410.02184
作者: Weixi Tong,Tianyi Zhang
关键词-EN: Large Language Models, shown promising performance, Large Language, shown promising, promising performance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted to EMNLP 2024 (Main, Long Paper)

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising performance in code generation. However, how to reliably evaluate code generated by LLMs remains an unresolved problem. This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases. We investigate different ways to guide the LLM in performing “slow thinking” to arrive at an in-depth and reliable evaluation. We experimented with four LLMs as evaluators on four code generation datasets and five programming languages. The results show that CodeJudge significantly outperformed existing methods in most settings. Furthermore, compared with a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct. Our code and datasets are available on GitHub this https URL.
摘要:大语言模型 (LLMs) 在代码生成方面展现了显著的潜力。然而,如何可靠地评估 LLMs 生成的代码仍然是一个未解决的问题。本文提出了 CodeJudge,这是一个利用 LLMs 来评估生成代码的语义正确性的代码评估框架,无需依赖测试用例。我们探讨了不同的方法来引导 LLM 进行“慢思考”,以达到深入且可靠的评估。我们在四个代码生成数据集和五种编程语言上,使用四种 LLMs 作为评估器进行了实验。结果表明,在大多数情况下,CodeJudge 显著优于现有的方法。此外,与基于 SOTA GPT-3.5 的代码评估方法相比,即使使用更小的模型 Llama-3-8B-Instruct,CodeJudge 也能取得更好的结果。我们的代码和数据集可在 GitHub 上获取,链接为 this https URL。

[NLP-87] HATFormer: Historic Handwritten Arabic Text Recognition with Transformers

【速读】: 该论文试图解决阿拉伯手写文本识别(HTR)中的挑战,特别是历史文本的识别问题,由于阿拉伯书写风格的多样性和其文字的内在特征,以及阿拉伯手写数据集相对较小,导致训练出具有泛化能力的阿拉伯HTR模型较为困难。解决方案的关键在于提出了一种基于Transformer的编码器-解码器架构HATFormer,该架构通过利用Transformer的注意力机制来捕捉空间上下文信息,从而有效地区分连写字符、分解视觉表示和识别音调符号,以应对阿拉伯文字的内在挑战。此外,针对历史手写阿拉伯文的特殊性,HATFormer还包括了一个图像处理器用于有效的ViT信息预处理,一个文本标记器用于紧凑的阿拉伯文表示,以及一个考虑到历史阿拉伯手写数据量有限的训练流程。

链接: https://arxiv.org/abs/2410.02179
作者: Adrian Chan,Anupam Mijar,Mehreen Saeed,Chau-Wai Wong,Akram Khater
关键词-EN: diverse writing styles, English HTR model, Arabic HTR models, English HTR, generalizable Arabic HTR
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer’s attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.
摘要:阿拉伯手写文本识别 (HTR) 面临着诸多挑战,尤其是对于历史文本而言,由于书写风格的多样性和阿拉伯文字的固有特征。此外,与英文手写数据集相比,阿拉伯手写数据集规模较小,这使得训练具有广泛适用性的阿拉伯 HTR 模型变得困难。为了应对这些挑战,我们提出了 HATFormer,这是一种基于 Transformer 的编码器-解码器架构,它建立在最先进的英文 HTR 模型之上。通过利用 Transformer 的注意力机制,HATFormer 能够捕捉空间上下文信息,从而通过区分连写字符、分解视觉表示和识别音调符号来解决阿拉伯文字的固有难题。我们对历史手写阿拉伯文的定制包括一个用于有效 ViT 信息预处理的图像处理器、一个用于紧凑阿拉伯文文本表示的文本 Tokenizer,以及一个考虑到有限历史阿拉伯手写数据的训练管道。HATFormer 在最大的公开历史手写阿拉伯数据集上实现了 8.6% 的字符错误率 (CER),比文献中最佳基线提高了 51%。在最大的非历史私有数据集上,HATFormer 也达到了 4.2% 的可比 CER。我们的工作展示了将英文 HTR 方法适应于资源匮乏且具有复杂语言特有挑战的语言的可行性,为文档数字化、信息检索和文化保护领域的进步做出了贡献。

[NLP-88] raining Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis

【速读】: 该论文试图解决如何训练Transformer模型以获得Chain-of-Thought(CoT)推理能力的问题。解决方案的关键在于通过理论分析非凸优化和非线性注意力模型,量化训练所需的样本和迭代次数,并证明模型在分布偏移的测试数据上仍能成功进行CoT泛化。此外,论文还探讨了在推理示例包含噪声或不准确时,CoT仍能输出准确推理结果的条件,从而与仅一步推理的上下文学习(ICL)形成对比。

链接: https://arxiv.org/abs/2410.02167
作者: Hongkang Li,Meng Wang,Songtao Lu,Xiaodong Cui,Pin-Yu Chen
关键词-EN: efficient prompting method, large language models, multiple intermediate steps, efficient prompting, prompting method
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. Despite the empirical success, the theoretical understanding of how to train a Transformer to achieve the CoT ability remains less explored. This is primarily due to the technical challenges involved in analyzing the nonconvex optimization on nonlinear attention models. To the best of our knowledge, this work provides the first theoretical study of training Transformers with nonlinear attention to obtain the CoT generalization capability so that the resulting model can inference on unseen tasks when the input is augmented by examples of the new task. We first quantify the required training samples and iterations to train a Transformer model towards CoT ability. We then prove the success of its CoT generalization on unseen tasks with distribution-shifted testing data. Moreover, we theoretically characterize the conditions for an accurate reasoning output by CoT even when the provided reasoning examples contain noises and are not always accurate. In contrast, in-context learning (ICL), which can be viewed as one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does. These theoretical findings are justified through experiments.
摘要:思维链 (Chain-of-Thought, CoT) 是一种高效的提示方法,通过使用包含多个中间步骤的多个示例来增强查询,从而实现大语言模型的推理能力。尽管在实践中取得了成功,但关于如何训练 Transformer 以获得 CoT 能力的理论理解仍较少被探索。这主要是因为在非线性注意力模型上进行非凸优化的分析存在技术挑战。据我们所知,本研究首次对训练具有非线性注意力的 Transformer 以获得 CoT 泛化能力进行了理论研究,使得生成的模型能够在输入通过新任务的示例增强时对未见任务进行推理。我们首先量化了训练 Transformer 模型以获得 CoT 能力所需的训练样本和迭代次数。然后,我们证明了其在分布偏移的测试数据上对未见任务的 CoT 泛化成功。此外,我们理论上描述了即使在提供的推理示例包含噪声且不总是准确的情况下,CoT 仍能输出准确推理结果的条件。相比之下,上下文学习 (In-Context Learning, ICL),可以视为没有中间步骤的一步 CoT,可能在 CoT 提供准确输出时无法提供准确输出。这些理论发现通过实验得到了验证。

[NLP-89] A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

【速读】: 该论文试图解决开放式简答题自动评分(ASAG)中存在的评分工作量大和评分一致性差的问题。解决方案的关键在于提出了一个统一的多智能体ASAG框架GradeOpt,该框架利用大型语言模型(LLMs)作为评分器,并引入了两个基于LLM的辅助智能体——反射器和精炼器。这些辅助智能体通过自我反思和优化原始评分指南,显著提升了评分准确性和与人类评分者行为的一致性。

链接: https://arxiv.org/abs/2410.02165
作者: Yucheng Chu,Hang Li,Kaiqi Yang,Harry Shomer,Hui Liu,Yasemin Copur-Gencturk,Jiliang Tang
关键词-EN: providing deeper insights, Open-ended short-answer questions, Open-ended short-answer, learning analytics, widely recognized
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners’ responses in the context of learning analytics (LA). However, SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. With recent advancements in natural language processing (NLP), automatic short-answer grading (ASAG) offers a promising solution to these challenges. Despite this, current ASAG algorithms are often limited in generalizability and tend to be tailored to specific questions. In this paper, we propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs. More importantly, GradeOpt incorporates two additional LLM-based agents - the reflector and the refiner - into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. Through experiments on a challenging ASAG task, namely the grading of pedagogical content knowledge (PCK) and content knowledge (CK) questions, GradeOpt demonstrates superior performance in grading accuracy and behavior alignment with human graders compared to representative baselines. Finally, comprehensive ablation studies confirm the effectiveness of the individual components designed in GradeOpt.
摘要:开放式简答题 (SAGs) 已被广泛认可为在学习分析 (LA) 背景下深入了解学习者回答的有力工具。然而,SAGs 在实际应用中常常面临评分工作量大和评分不一致的问题。随着自然语言处理 (NLP) 的最新进展,自动简答题评分 (ASAG) 为这些挑战提供了有前景的解决方案。尽管如此,当前的 ASAG 算法通常在通用性方面受限,并且往往针对特定问题进行定制。在本文中,我们提出了一种统一的基于多智能体的 ASAG 框架,名为 GradeOpt,该框架利用大语言模型 (LLMs) 作为 SAGs 的评分器。更重要的是,GradeOpt 在多智能体系统中引入了两个额外的基于 LLM 的智能体——反思器和优化器。这使得 GradeOpt 能够通过对其错误进行自我反思来自动优化原始评分指南。通过在一个具有挑战性的 ASAG 任务(即教学内容知识 (PCK) 和内容知识 (CK) 问题的评分)上的实验,GradeOpt 展示了相较于代表性基线在评分准确性和与人类评分员行为一致性方面的优越性能。最后,全面的消融研究证实了 GradeOpt 中各个设计组件的有效性。

[NLP-90] Controlled Generation of Natural Adversarial Documents for Stealthy Retrieval Poisoning

【速读】: 该论文试图解决基于嵌入相似性的检索系统(如增强生成检索)易受恶意文档攻击的问题。解决方案的关键在于设计了一种新的控制生成技术,该技术结合了对抗性目标(嵌入相似性)和基于开源代理语言模型计算的“自然性”目标。生成的对抗性文档不仅难以通过困惑度过滤或其他语言模型自动检测,而且在检索效果上与容易被检测的文档相当,同时显著优于先前的能量引导生成方法。

链接: https://arxiv.org/abs/2410.02163
作者: Collin Zhang,Tingwei Zhang,Vitaly Shmatikov
关键词-EN: Recent work showed, Recent work, craft malicious documents, classes of queries, work showed
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work showed that retrieval based on embedding similarity (e.g., for retrieval-augmented generation) is vulnerable to poisoning: an adversary can craft malicious documents that are retrieved in response to broad classes of queries. We demonstrate that previous, HotFlip-based techniques produce documents that are very easy to detect using perplexity filtering. Even if generation is constrained to produce low-perplexity text, the resulting documents are recognized as unnatural by LLMs and can be automatically filtered from the retrieval corpus. We design, implement, and evaluate a new controlled generation technique that combines an adversarial objective (embedding similarity) with a “naturalness” objective based on soft scores computed using an open-source, surrogate LLM. The resulting adversarial documents (1) cannot be automatically detected using perplexity filtering and/or other LLMs, except at the cost of significant false positives in the retrieval corpus, yet (2) achieve similar poisoning efficacy to easily-detectable documents generated using HotFlip, and (3) are significantly more effective than prior methods for energy-guided generation, such as COLD. Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2410.02163 [cs.CL] (or arXiv:2410.02163v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02163 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:最近的研究表明,基于嵌入相似性的检索方法(例如,用于增强生成的检索)容易受到中毒攻击:攻击者可以精心制作恶意文档,这些文档会针对广泛类别的查询被检索出来。我们展示了先前基于 HotFlip 的技术生成的文档很容易通过困惑度过滤被检测出来。即使生成过程被限制以产生低困惑度的文本,生成的文档仍会被大语言模型识别为不自然,并能从检索语料库中自动过滤掉。我们设计、实现并评估了一种新的受控生成技术,该技术结合了对抗性目标(嵌入相似性)和基于开源代理大语言模型计算的软分数的“自然性”目标。由此产生的对抗性文档(1)无法通过困惑度过滤和其他大语言模型自动检测,除非以检索语料库中显著的误报为代价,(2)在毒性效力上与使用 HotFlip 生成的易检测文档相当,(3)在能量引导生成方面显著优于先前的方法,如 COLD。

主题:计算与语言 (cs.CL);密码学与安全 (cs.CR);机器学习 (cs.LG)
引用为:arXiv:2410.02163 [cs.CL] (或 arXiv:2410.02163v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.02163
通过 DataCite 发布的 arXiv DOI(待注册)

[NLP-91] Mitigating Memorization In Language Models

【速读】: 该论文试图解决语言模型(LMs)在训练过程中“记忆”敏感或私有信息的问题,即模型在推理时可能直接复述训练数据中的敏感内容。解决方案的关键在于开发和评估多种记忆缓解方法,包括基于正则化、微调和机器遗忘的方法。论文提出了三种基于正则化的方法、三种基于微调的方法和十一种基于机器遗忘的方法,其中五种是新引入的方法。通过引入TinyMem,一个计算效率高的小型语言模型套件,论文展示了这些缓解方法在生产级LMs中的有效性。实验结果表明,基于正则化的方法在抑制记忆方面效果不佳且速度慢,基于微调的方法虽然有效但成本高,而基于遗忘的方法在速度和效果上更为优越,能够精确地定位和移除模型权重中的记忆信息,同时保持目标任务的性能。特别是,论文提出的BalancedSubnet遗忘方法在移除记忆信息的同时,相比其他方法更能保持模型性能。

链接: https://arxiv.org/abs/2410.02159
作者: Mansi Sakarvadia,Aswathy Ajith,Arham Khan,Nathaniel Hudson,Caleb Geniesse,Kyle Chard,Yaoqing Yang,Ian Foster,Michael W. Mahoney
关键词-EN: Language models, encode training data, training data, extract training data, inference-time queries
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) can “memorize” information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-based, and eleven machine unlearning-based methods, with five of the latter being new methods that we introduce. We also introduce TinyMem, a suite of small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods. We demonstrate that the mitigation methods that we develop using TinyMem can successfully be applied to production-grade LMs, and we determine via experiment that: regularizer-based mitigation methods are slow and ineffective at curbing memorization; fine-tuning-based methods are effective at curbing memorization, but overly expensive, especially for retaining higher accuracies; and unlearning-based methods are faster and more effective, allowing for the precise localization and removal of memorized information from LM weights prior to inference. We show, in particular, that our proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.
摘要:语言模型 (Language Models, LMs) 能够“记忆”信息,即在权重中编码训练数据,使得推理时查询可能导致数据的逐字复述。这种提取训练数据的能力在数据为私密或敏感时可能带来问题。在本研究中,我们探讨了减轻记忆化的方法:三种基于正则化的方法、三种基于微调的方法,以及十一种基于机器遗忘的方法,其中后五种是我们新引入的方法。我们还引入了 TinyMem,一套小型、计算高效的 LMs,用于快速开发和评估记忆化减轻方法。我们展示了使用 TinyMem 开发的减轻方法可以成功应用于生产级 LMs,并通过实验确定:基于正则化的减轻方法在抑制记忆化方面缓慢且无效;基于微调的方法在抑制记忆化方面有效,但成本过高,尤其是在保持较高准确性时;而基于遗忘的方法更快且更有效,允许在推理前精确地定位和移除 LM 权重中的记忆信息。我们特别展示了我们提出的遗忘方法 BalancedSubnet 在移除记忆信息的同时,在目标任务上保持性能方面优于其他减轻方法。

[NLP-92] he why what and how of AI-based coding in scientific research

【速读】: 该论文试图解决研究人员在编程(编码)过程中面临的挑战,即学习难度大和耗时的问题。解决方案的关键在于利用生成式AI,特别是大型语言模型(LLMs),通过直观的对话方式来辅助编码。论文通过分析LLMs在编码中的本质和作用(为什么)、提供的六种编码辅助类型(什么)以及一个包含五个步骤的实际操作流程(如何),来提供一个有效的框架。此外,论文还探讨了AI在编码中的局限性和未来展望,旨在帮助研究人员更有效地利用AI来提升编码实践和教育,从而加速科学进步。

链接: https://arxiv.org/abs/2410.02156
作者: Tonghe Zhuang,Zhicheng Lin
关键词-EN: Computer programming, remains challenging, challenging to learn, learn and time-consuming, time-consuming to carry
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 23 pages, 7 figure, 3 boxes

点击查看摘要

Abstract:Computer programming (coding) is indispensable for researchers across disciplines, yet it remains challenging to learn and time-consuming to carry out. Generative AI, particularly large language models (LLMs), has the potential to transform coding into intuitive conversations, but best practices and effective workflows are only emerging. We dissect AI-based coding through three key lenses: the nature and role of LLMs in coding (why), six types of coding assistance they provide (what), and a five-step workflow in action with practical implementation strategies (how). Additionally, we address the limitations and future outlook of AI in coding. By offering actionable insights, this framework helps to guide researchers in effectively leveraging AI to enhance coding practices and education, accelerating scientific progress.
摘要:计算机编程(编码)对于各学科的研究人员来说不可或缺,然而学习和执行编码仍然具有挑战性且耗时。生成式 AI,特别是大语言模型 (LLMs),有潜力将编码转变为直观的对话,但最佳实践和有效的工作流程仍在逐步形成。我们通过三个关键视角剖析基于 AI 的编码:LLMs 在编码中的本质和作用(为什么)、它们提供的六种编码辅助类型(什么),以及一个包含实际实施策略的五步工作流程(如何)。此外,我们还探讨了 AI 在编码中的局限性和未来展望。通过提供可操作的见解,这一框架有助于指导研究人员有效利用 AI 来提升编码实践和教育,从而加速科学进步。

[NLP-93] From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

【速读】: 该论文试图解决多模态大语言模型在视觉和文本信息融合过程中对齐效果不佳的问题。解决方案的关键在于引入了一种新颖的图像分词器,通过将字节对编码(BPE)原理应用于视觉数据,直接将结构先验信息融入图像分词中,从而使Transformer模型能够更有效地跨模态学习和推理。这种方法不仅提升了模型在有限训练数据下的多模态理解能力,还在多个基准测试中显著提高了性能,并展现出良好的可扩展性。

链接: https://arxiv.org/abs/2410.02155
作者: Wanpeng Zhang,Zilong Xie,Yicheng Feng,Yijiang Li,Xingrun Xing,Sipeng Zheng,Zongqing Lu
关键词-EN: Large Language Models, made significant strides, Large Language, Multimodal Large Language, text-only Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs’ multimodal understanding capabilities, even with limited training data. Our method not only improves performance across various benchmarks but also shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models.
摘要:多模态大语言模型在整合视觉和文本信息方面取得了显著进展,但它们往往难以有效地对齐这些模态。我们引入了一种新颖的图像 Tokenizer,通过将字节对编码 (Byte-Pair Encoding, BPE) 原理应用于视觉数据,填补了这一空白。与依赖于独立视觉编码器的传统方法不同,我们的方法直接将结构先验信息融入图像 Token 中,类似于仅文本大语言模型中成功的 Tokenization 策略。这种创新方法使得 Transformer 模型能够更有效地跨模态学习和推理。通过理论分析和广泛的实验,我们证明我们的 BPE 图像 Tokenizer 显著提升了多模态大语言模型的多模态理解能力,即使在有限的训练数据下也是如此。我们的方法不仅在各种基准测试中提高了性能,还展示了良好的可扩展性,可能为更高效和强大的多模态基础模型铺平道路。

[NLP-94] Matrix and Relative Weak Crossover in Japanese: An Experimental Investigation

【速读】: 该论文试图解决弱跨句效应在主句和从句中的本质差异问题,特别是区分这种差异是由于线性优先顺序还是句法结构的影响。解决方案的关键在于使用日语进行实验,因为日语缺乏英语中的词序混淆问题,从而能够更清晰地揭示句法结构对弱跨句效应的影响。研究结果与Fukushima等人的发现一致,表明这种差异主要是结构性的,而非简单的优先顺序问题。

链接: https://arxiv.org/abs/2410.02149
作者: Haruka Fukushima,Daniel Plesniak,Daisuke Bekki
关键词-EN: differ in nature, weak crossover, crossover effects differ, relative clauses, matrix weak crossover
类目: Computation and Language (cs.CL)
备注: 18 pages, 17 figures, To appear in Proceedings of The Society of Modern Grammar (SMOG)'s International Conference on Syntax and Semantics (ICSS) 2024

点击查看摘要

Abstract:This paper provides evidence that weak crossover effects differ in nature between matrix and relative clauses. Fukushima et al. (2024) provided similar evidence, showing that, when various non-structural factors were eliminated English speakers never accepted matrix weak crossover cases, but often accepted relative weak crossover ones. Those results were limited, however, by English word order, which lead to uncertainty as to whether this difference was due to the effects of linear precedence or syntactic structure. In this paper, to distinguish between these two possibilities, we conduct an experiment using Japanese, which lacks the word-order confound that English had. We find results that are qualitatively in line with Fukushima et al. (2024) suggesting that the relevant distinction is structural and not based simply on precedence.
摘要:本文提供了证据,表明在主句和关系从句中,弱跨越效应的性质存在差异。Fukushima 等人 (2024) 也提供了类似的证据,表明当消除各种非结构因素后,英语使用者从不接受主句中的弱跨越情况,但经常接受关系从句中的弱跨越情况。然而,这些结果受限于英语的词序,导致无法确定这种差异是由于线性优先效应还是句法结构效应。在本文中,为了区分这两种可能性,我们使用日语进行了一项实验,日语不存在英语中的词序混淆问题。我们发现的结果与 Fukushima 等人 (2024) 的研究结果在定性上一致,表明相关区别是结构性的,而非仅仅基于优先顺序。

[NLP-95] C-MELT: Contrastive Enhanced Masked Auto-Encoders for ECG-Language Pre-Training

【速读】: 该论文试图解决心电图(ECG)信号与其伴随文本报告的跨模态整合问题,以提升心血管疾病的临床诊断准确性。解决方案的关键在于提出了一种名为C-MELT的新框架,该框架通过对比掩码自编码器架构预训练ECG和文本数据,结合生成模型与增强的判别能力,实现跨模态表示的鲁棒性。具体方法包括掩码模态建模、专用损失函数以及针对跨模态对齐的改进负采样策略,从而在多个公共数据集的下游任务中显著超越现有方法,分别在线性探测和零样本性能上提升了15%和2%。

链接: https://arxiv.org/abs/2410.02131
作者: Manh Pham,Aaqib Saeed,Dong Ma
关键词-EN: diagnosing cardiovascular diseases, Accurate interpretation, interpretation of Electrocardiogram, cardiovascular diseases, Integrating ECG signals
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with their accompanying textual reports holds immense potential to enhance clinical diagnostics through the combination of physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarcity of labeled data for robust cross-modal learning. To address these obstacles, we propose C-MELT, a novel framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture. C-MELT uniquely combines the strengths of generative with enhanced discriminative capabilities to achieve robust cross-modal representations. This is accomplished through masked modality modeling, specialized loss functions, and an improved negative sampling strategy tailored for cross-modal alignment. Extensive experiments on five public datasets across diverse downstream tasks demonstrate that C-MELT significantly outperforms existing methods, achieving 15% and 2% increases in linear probing and zero-shot performance over state-of-the-art models, respectively. These results highlight the effectiveness of C-MELT, underscoring its potential to advance automated clinical diagnostics through multi-modal representations.
摘要:准确解读心电图 (ECG) 信号对于诊断心血管疾病至关重要。将 ECG 信号与其伴随的文本报告相结合,具有巨大的潜力,可以通过生理数据和定性洞察的结合来增强临床诊断。然而,由于固有的模态差异和用于稳健跨模态学习的标注数据稀缺,这种整合面临重大挑战。为了解决这些障碍,我们提出了 C-MELT,这是一种新颖的框架,使用对比掩码自编码器架构对 ECG 和文本数据进行预训练。C-MELT 独特地结合了生成式和增强的判别能力,以实现稳健的跨模态表示。这是通过掩码模态建模、专门的损失函数和改进的负采样策略实现的,这些策略专为跨模态对齐而设计。在五个公共数据集上进行的广泛实验表明,C-MELT 在各种下游任务中显著优于现有方法,分别在线性探测和零样本性能上比最先进模型提高了 15% 和 2%。这些结果突显了 C-MELT 的有效性,强调了其通过多模态表示推进自动化临床诊断的潜力。

[NLP-96] L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?

【速读】: 该论文试图解决长上下文模型(LCMs)在处理长文本任务时,生成的结果难以验证其忠实于原始上下文的问题。解决方案的关键在于引入L-CiteEval,这是一个综合性的多任务基准测试,专门用于评估LCMs的理解能力和忠实度。L-CiteEval涵盖了11个不同领域的任务,上下文长度从8K到48K不等,并提供了一套完全自动化的评估工具。通过对比11个先进的闭源和开源LCMs,研究发现开源模型在引用准确性和召回率方面显著落后于闭源模型,表明当前开源LCMs更倾向于依赖其内在知识而非给定上下文进行响应,这在实际应用中对用户体验构成重大风险。此外,论文还评估了RAG方法,发现RAG能显著提高LCMs的忠实度,尽管生成质量略有下降,并揭示了LCMs的注意力机制与引用生成过程之间的关联。

链接: https://arxiv.org/abs/2410.02115
作者: Zecheng Tang,Keyan Zhou,Juntao Li,Baibei Ji,Jianye Hou,Min Zhang
关键词-EN: made remarkable strides, involve long context, offering users great, users great convenience, recent years
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context models (LCMs) have made remarkable strides in recent years, offering users great convenience for handling tasks that involve long context, such as document summarization. As the community increasingly prioritizes the faithfulness of generated results, merely ensuring the accuracy of LCM outputs is insufficient, as it is quite challenging for humans to verify the results from the extremely lengthy context. Yet, although some efforts have been made to assess whether LCMs respond truly based on the context, these works either are limited to specific tasks or heavily rely on external evaluation resources like this http URL this work, we introduce L-CiteEval, a comprehensive multi-task benchmark for long-context understanding with citations, aiming to evaluate both the understanding capability and faithfulness of LCMs. L-CiteEval covers 11 tasks from diverse domains, spanning context lengths from 8K to 48K, and provides a fully automated evaluation suite. Through testing with 11 cutting-edge closed-source and open-source LCMs, we find that although these models show minor differences in their generated results, open-source models substantially trail behind their closed-source counterparts in terms of citation accuracy and recall. This suggests that current open-source LCMs are prone to responding based on their inherent knowledge rather than the given context, posing a significant risk to the user experience in practical applications. We also evaluate the RAG approach and observe that RAG can significantly improve the faithfulness of LCMs, albeit with a slight decrease in the generation quality. Furthermore, we discover a correlation between the attention mechanisms of LCMs and the citation generation process.
摘要:长上下文模型 (Long-context models, LCMs) 近年来取得了显著进展,为用户处理涉及长上下文的任务(如文档摘要)提供了极大的便利。随着社区越来越重视生成结果的忠实性,仅仅确保 LCM 输出的准确性是不够的,因为人类很难验证来自极长上下文的结果。然而,尽管已有一些工作尝试评估 LCM 是否基于上下文真实响应,但这些工作要么局限于特定任务,要么严重依赖外部评估资源,如 [20]。在这项工作中,我们引入了 L-CiteEval,这是一个综合的多任务基准,用于评估长上下文理解能力及引用忠实性。L-CiteEval 涵盖了来自不同领域的 11 项任务,上下文长度从 8K 到 48K 不等,并提供了一套完全自动化的评估工具。通过对 11 个前沿的闭源和开源 LCM 进行测试,我们发现尽管这些模型在生成结果上略有差异,但开源模型在引用准确性和召回率方面明显落后于闭源模型。这表明当前的开源 LCM 倾向于基于其固有知识而非给定上下文进行响应,这在实际应用中对用户体验构成了重大风险。我们还评估了 RAG 方法,发现 RAG 可以显著提高 LCM 的忠实性,尽管生成质量略有下降。此外,我们发现 LCM 的注意力机制与引用生成过程之间存在关联。

[NLP-97] Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

【速读】: 该论文试图解决使用大型语言模型(LLMs)模拟学习者行为时面临的两大关键问题:一是LLMs对提示语的微小变化高度敏感,导致其在新场景中的泛化能力不足;二是LLMs可能通过记忆训练数据中的相似场景来“重现”行为,而非真正模拟新行为。解决方案的关键是提出Hyp-Mix框架,该框架允许专家通过结合可测试的学习者行为假设来开发和评估模拟,从而在物理学习环境中验证了GPT-4 Turbo在不同学习者模型下仍能保持行为校准,为LLMs在开放式交互学习环境中模拟真实行为提供了初步证据。

链接: https://arxiv.org/abs/2410.02110
作者: Amogh Mannekote,Adam Davies,Jina Kang,Kristy Elizabeth Boyer
关键词-EN: Simulating learner actions, adaptations before deployment, actions helps stress-test, prototype new adaptations, interactive learning environments
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Simulating learner actions helps stress-test open-ended interactive learning environments and prototype new adaptations before deployment. While recent studies show the promise of using large language models (LLMs) for simulating human behavior, such approaches have not gone beyond rudimentary proof-of-concept stages due to key limitations. First, LLMs are highly sensitive to minor prompt variations, raising doubts about their ability to generalize to new scenarios without extensive prompt engineering. Moreover, apparently successful outcomes can often be unreliable, either because domain experts unintentionally guide LLMs to produce expected results, leading to self-fulfilling prophecies; or because the LLM has encountered highly similar scenarios in its training data, meaning that models may not be simulating behavior so much as regurgitating memorized content. To address these challenges, we propose Hyp-Mix, a simulation authoring framework that allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. Testing this framework in a physics learning environment, we found that GPT-4 Turbo maintains calibrated behavior even as the underlying learner model changes, providing the first evidence that LLMs can be used to simulate realistic behaviors in open-ended interactive learning environments, a necessary prerequisite for useful LLM behavioral simulation.
摘要:模拟学习者行为有助于对开放式互动学习环境进行压力测试,并在部署前原型化新的适应措施。尽管最近的研究表明,使用大语言模型 (LLMs) 模拟人类行为具有潜力,但由于关键限制,这些方法尚未超越初步的概念验证阶段。首先,LLMs 对微小的提示变化高度敏感,这引发了对它们在没有广泛提示工程的情况下能否泛化到新场景的怀疑。此外,表面上成功的结果往往不可靠,原因要么是领域专家无意中引导 LLMs 产生预期结果,导致自我实现的预言;要么是因为 LLM 在其训练数据中遇到了高度相似的场景,这意味着模型可能不是在模拟行为,而是在重复记忆的内容。为了应对这些挑战,我们提出了 Hyp-Mix,一个模拟创作框架,允许专家通过结合关于学习者行为的可测试假设来开发和评估模拟。在物理学习环境中测试该框架时,我们发现 GPT-4 Turbo 即使在底层学习者模型变化时也能保持校准行为,这提供了第一个证据,表明 LLMs 可以用于在开放式互动学习环境中模拟现实行为,这是有用 LLM 行为模拟的必要前提。

[NLP-98] ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement

【速读】: 该论文试图解决大型语言模型(LLMs)在推理能力提升过程中,高质量推理路径数据获取成本高或受限于许可证的问题。解决方案的关键在于提出了一种名为Reasoning Generalist via Self-Improvement (ReGenesis)的方法,通过从抽象到具体的逐步转换,使LLMs能够自我合成推理路径作为训练数据,而无需额外的人工监督或特定任务示例。这种方法不仅提高了模型在域内任务上的表现,还显著增强了其在域外任务(OOD)上的泛化能力,相较于现有方法,ReGenesis在六个域外任务上平均提升了约6.1%的性能。

链接: https://arxiv.org/abs/2410.02108
作者: Xiangyu Peng,Congying Xia,Xinyi Yang,Caiming Xiong,Chien-Sheng Wu,Chen Xing
关键词-EN: Large Language Models, Post-training Large Language, Large Language, Language Models, Post-training Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training Large Language Models (LLMs) with explicit reasoning trajectories can enhance their reasoning abilities. However, acquiring such high-quality trajectory data typically demands meticulous supervision from humans or superior models, which can be either expensive or license-constrained. In this paper, we explore how far an LLM can improve its reasoning by self-synthesizing reasoning paths as training data without any additional supervision. Existing self-synthesizing methods, such as STaR, suffer from poor generalization to out-of-domain (OOD) reasoning tasks. We hypothesize it is due to that their self-synthesized reasoning paths are too task-specific, lacking general task-agnostic reasoning guidance. To address this, we propose Reasoning Generalist via Self-Improvement (ReGenesis), a method to self-synthesize reasoning paths as post-training data by progressing from abstract to concrete. More specifically, ReGenesis self-synthesizes reasoning paths by converting general reasoning guidelines into task-specific ones, generating reasoning structures, and subsequently transforming these structures into reasoning paths, without the need for human-designed task-specific examples used in existing methods. We show that ReGenesis achieves superior performance on all in-domain and OOD settings tested compared to existing methods. For six OOD tasks specifically, while previous methods exhibited an average performance decrease of approximately 4.6% after post training, ReGenesis delivers around 6.1% performance improvement. We also conduct in-depth analysis of our framework and show ReGenesis is effective across various LLMs and design choices.
摘要:通过显式推理轨迹对大语言模型 (LLM) 进行后训练可以增强其推理能力。然而,获取这种高质量的轨迹数据通常需要人类或高级模型的细致监督,这可能既昂贵又受许可证限制。本文探讨了在没有额外监督的情况下,LLM 通过自我合成推理路径作为训练数据,能够多大程度地提升其推理能力。现有的自我合成方法,如 STaR,在面对域外 (OOD) 推理任务时泛化能力较差。我们假设这是由于它们自我合成的推理路径过于任务特定,缺乏通用的任务无关推理指导。为解决这一问题,我们提出了通过自我改进实现推理通才 (ReGenesis) 的方法,该方法通过从抽象到具体的逐步推进,自我合成推理路径作为后训练数据。更具体地说,ReGenesis 通过将通用推理指南转化为任务特定的指南,生成推理结构,然后将这些结构转化为推理路径,而无需使用现有方法中所需的人工设计的任务特定示例。我们展示了 ReGenesis 在所有测试的域内和域外设置中均优于现有方法。对于六个特定的域外任务,尽管之前的方法在后训练后平均性能下降约 4.6%,但 ReGenesis 带来了约 6.1% 的性能提升。我们还对我们的框架进行了深入分析,并展示了 ReGenesis 在各种 LLM 和设计选择中的有效性。

[NLP-99] Racing Thoughts: Explaining Large Language Model Contextualization Errors

【速读】: 该论文试图解决大型语言模型在处理上下文信息时可能出现的错误,特别是当模型未能正确区分词汇的多义性时,如将“bank”错误地理解为金融机构而非地理特征。论文提出了“LLM Race Conditions Hypothesis”,认为这种上下文错误是由于模型在处理输入序列时未能正确维护词汇间的依赖关系,导致信息整合错误。解决方案的关键在于识别并纠正这些依赖关系的违反,通过机制性解释技术提供相关和因果证据,并提出推理时干预措施以改善模型的上下文理解能力。

链接: https://arxiv.org/abs/2410.02102
作者: Michael A. Lepori,Michael Mozer,Asma Ghandeharioun
关键词-EN: transformer-based language models, relevant contextual information, integrate relevant contextual, complete a task, profound success
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The profound success of transformer-based language models can largely be attributed to their ability to integrate relevant contextual information from an input sequence in order to generate a response or complete a task. However, we know very little about the algorithms that a model employs to implement this capability, nor do we understand their failure modes. For example, given the prompt “John is going fishing, so he walks over to the bank. Can he make an ATM transaction?”, a model may incorrectly respond “Yes” if it has not properly contextualized “bank” as a geographical feature, rather than a financial institution. We propose the LLM Race Conditions Hypothesis as an explanation of contextualization errors of this form. This hypothesis identifies dependencies between tokens (e.g., “bank” must be properly contextualized before the final token, “?”, integrates information from “bank”), and claims that contextualization errors are a result of violating these dependencies. Using a variety of techniques from mechanistic intepretability, we provide correlational and causal evidence in support of the hypothesis, and suggest inference-time interventions to address it.
摘要:基于 Transformer 的语言模型之所以取得深远成功,很大程度上归功于其能够整合输入序列中的相关上下文信息,以生成响应或完成任务。然而,我们对于模型实现这一能力的算法知之甚少,也不了解其失败模式。例如,给定提示“John 要去钓鱼,所以他走到河岸。他能进行 ATM 交易吗?”,如果模型未能正确地将“bank”上下文化为地理特征而非金融机构,则可能会错误地回答“是”。我们提出大语言模型(LLM)竞争条件假设,作为解释此类上下文化错误的理论。该假设识别了 Token 之间的依赖关系(例如,“bank”必须在最终 Token “?”整合来自“bank”的信息之前被正确上下文化),并声称上下文化错误是由于违反这些依赖关系所致。通过运用多种机制可解释性技术,我们提供了支持该假设的相关性和因果证据,并建议在推理时采取干预措施以解决这一问题。

[NLP-100] A Watermark for Black-Box Language Models

【速读】: 该论文试图解决现有水印技术在检测大型语言模型(LLM)输出时需要白盒访问(即访问模型的下一个词概率分布)的问题。解决方案的关键在于提出了一种仅需黑盒访问(即仅能从LLM中采样序列)的水印方案,该方案具有无失真特性,并支持使用多个密钥进行链式或嵌套应用。通过提供性能保证和实验验证,论文展示了该方案在白盒访问可用时的应用潜力,并证明了其在某些情况下优于现有的白盒水印方案。

链接: https://arxiv.org/abs/2410.02099
作者: Dara Bahri,John Wieting,Dana Alon,Donald Metzler
关键词-EN: large language models, recently emerged, effective strategy, strategy for detecting, detecting the outputs
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emphwhite-box access to the model’s next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emphblack-box access), boasts a \emphdistortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
摘要:水印技术最近作为一种检测大语言模型 (LLM) 输出结果的有效策略而崭露头角。大多数现有方案需要对模型的下一个 Token 概率分布进行白盒访问,这在通常情况下是 LLM API 的下游用户无法获得的。在本研究中,我们提出了一种基于原则的水印方案,该方案仅需要从 LLM 中采样序列的能力(即黑盒访问),具备无失真特性,并且可以通过多个密钥进行链式或嵌套使用。我们提供了性能保证,展示了在白盒访问可用时如何利用该方案,并通过全面的实验展示了其在某些情况下优于现有白盒方案的能力。

[NLP-101] RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

【速读】: 该论文试图解决大型语言模型(LLMs)在代码合成任务中难以通过迭代改进代码的问题,特别是在竞争性编程任务中。解决方案的关键在于提出了一种端到端的强化学习方法,该方法能够使模型有效地利用执行反馈,从而在减少所需样本数量的同时,显著提升模型在代码合成任务中的表现,尤其是在小型(8B参数)和大型(70B参数)模型上均实现了新的最先进结果。

链接: https://arxiv.org/abs/2410.02089
作者: Jonas Gehring,Kunhao Zheng,Jade Copet,Vegard Mella,Taco Cohen,Gabriel Synnaeve
关键词-EN: agents solve user-specified, required manual engagement, solve user-specified tasks, deployed as agents, Large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
摘要:大语言模型 (LLMs) 作为智能体部署时,能够在多个步骤中解决用户指定的任务,同时将所需的人工参与降至最低。关键在于,这些 LLMs 需要基于获得的任何反馈来调整其生成内容,以可靠地实现预期结果。我们提出了一种端到端的强化学习方法,用于教导模型在代码合成领域中利用执行反馈,在该领域中,最先进的 LLMs 在迭代改进代码方面难以与独立采样相比。我们在竞争性编程任务上进行了基准测试,使用小型 (8B 参数) 和大型 (70B) 模型均取得了新的最先进结果,同时将所需的样本数量减少了近一个数量级。我们对推理时行为的分析表明,我们的方法能够使 LLMs 在多个步骤中有效地利用自动反馈。

[NLP-102] EMMA: Efficient Visual Alignment in Multi-Modal LLMs

【速读】: 该论文试图解决多模态大语言模型(MLLMs)中视觉编码与语言模型融合效率低下的问题,特别是在任务特定适应性方面。解决方案的关键在于提出了EMMA(Efficient Multi-Modal Adaptation)模块,该模块通过一种高效的早期融合机制,以极少的参数增加(不到0.2%的模型大小增加)实现了视觉和文本编码的有效融合,生成了指令感知的视觉表示,从而显著提升了模型在多任务上的性能和鲁棒性。

链接: https://arxiv.org/abs/2410.02080
作者: Sara Ghazanfari,Alexandre Araujo,Prashanth Krishnamurthy,Siddharth Garg,Farshad Khorrami
关键词-EN: Multi-modal Large Language, Large Language Models, recently exhibited impressive, exhibited impressive general-purpose, impressive general-purpose capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-purpose capabilities by leveraging vision foundation models to encode the core concepts of images into representations. These are then combined with instructions and processed by the language model to generate high-quality responses. Despite significant progress in enhancing the language component, challenges persist in optimally fusing visual encodings within the language model for task-specific adaptability. Recent research has focused on improving this fusion through modality adaptation modules but at the cost of significantly increased model complexity and training data needs. In this paper, we propose EMMA (Efficient Multi-Modal Adaptation), a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the language model. Our key contributions include: (1) an efficient early fusion mechanism that integrates vision and language representations with minimal added parameters (less than 0.2% increase in model size), (2) an in-depth interpretability analysis that sheds light on the internal mechanisms of the proposed method; (3) comprehensive experiments that demonstrate notable improvements on both specialized and general benchmarks for MLLMs. Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations. Our code is available at this https URL
摘要:多模态大语言模型 (Multi-modal Large Language Models, MLLMs) 近期通过利用视觉基础模型将图像的核心概念编码为表示形式,展示了令人印象深刻的通用能力。这些表示形式随后与指令结合,并由语言模型处理以生成高质量的响应。尽管在增强语言组件方面取得了显著进展,但在任务特定适应性方面,最佳地将视觉编码融入语言模型的挑战依然存在。近期研究主要通过模态适应模块来改进这种融合,但代价是显著增加了模型复杂性和训练数据需求。本文提出了 EMMA (Efficient Multi-Modal Adaptation),一种轻量级的跨模态模块,旨在高效地融合视觉和文本编码,为语言模型生成指令感知的视觉表示。我们的主要贡献包括:(1) 一种高效的早期融合机制,该机制在最小化额外参数的情况下(模型大小增加不到 0.2%)集成视觉和语言表示;(2) 深入的可解释性分析,揭示了所提出方法的内部机制;(3) 全面的实验,展示了在 MLLMs 的专用和通用基准测试中显著的改进。实证结果表明,EMMA 在多个任务上的性能提升了高达 9.3%,同时显著提高了对幻觉的鲁棒性。我们的代码可在以下链接获取:https URL

[NLP-103] Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct ICLR2025

【速读】: 该论文试图解决的问题是大语言模型(LLMs)是否能够识别自己的写作,并探讨这一现象的行为层面是否稳健、实现机制以及是否可控。解决方案的关键在于发现并利用模型残差流中的一个向量,该向量在模型进行自我写作识别时被不同程度地激活,与模型的“自我”概念相关,并且可以通过操纵该向量来控制模型的行为和感知,使其在生成或阅读文本时声称或否认自己的作者身份。

链接: https://arxiv.org/abs/2410.02064
作者: Christopher Ackerman,Nina Panickssery
关键词-EN: model, reported that LLMs, LLMs can recognize, vector, chat model
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 13 figs, 2 tables, submitted to ICLR 2025

点击查看摘要

Abstract:It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of “self” in the model, and demonstrate that the vector is causally related to the model’s ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model’s behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model’s output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.
摘要:已有报道指出,大语言模型 (LLM) 能够识别其自身的写作。由于这一现象对 AI 安全具有潜在影响,但相关研究相对较少,我们对此进行了深入研究,旨在确定这一现象是否在行为层面上稳健地发生,其背后的实现机制是什么,以及是否可以对其进行控制。首先,我们发现 Llama3-8b-Instruct 聊天模型(而非基础的 Llama3-8b 模型)能够可靠地区分其自身输出与人类输出,并提供了证据表明,聊天模型可能利用其在训练后积累的与其自身输出相关的经验,成功完成写作识别任务。其次,我们识别出模型残差流中的一个向量,该向量在模型做出正确的自我写作文本识别判断时表现出差异性激活,表明该向量对与自我创作相关的信息有反应,并提供了证据表明该向量与模型中的“自我”概念相关,同时证明了该向量与模型感知和主张自我创作能力之间存在因果关系。最后,我们展示了该向量可用于控制模型的行为和感知,通过在模型生成输出时应用该向量,引导模型声称或否认创作权,并通过在模型阅读任意文本时应用该向量,引导模型相信或怀疑其是否为作者。

[NLP-104] PP-LLM: Modeling Temporal Point Processes by Efficiently Fine-Tuning Large Language Models

【速读】: 该论文试图解决传统时间点过程(TPP)模型在处理事件序列时,难以同时捕捉事件的语义和时间模式的问题。解决方案的关键在于引入TPP-LLM框架,该框架将大型语言模型(LLM)与TPP结合,通过直接利用事件类型的文本描述来捕捉丰富的语义信息,同时通过引入时间嵌入和参数高效微调(PEFT)方法来有效学习时间动态,从而在不大量重新训练的情况下提高预测精度和计算效率。

链接: https://arxiv.org/abs/2410.02062
作者: Zefang Liu,Yinzhu Quan
关键词-EN: Temporal point processes, transportation systems, point processes, social networks, timing and occurrence
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Temporal point processes (TPPs) are widely used to model the timing and occurrence of events in domains such as social networks, transportation systems, and e-commerce. In this paper, we introduce TPP-LLM, a novel framework that integrates large language models (LLMs) with TPPs to capture both the semantic and temporal aspects of event sequences. Unlike traditional methods that rely on categorical event type representations, TPP-LLM directly utilizes the textual descriptions of event types, enabling the model to capture rich semantic information embedded in the text. While LLMs excel at understanding event semantics, they are less adept at capturing temporal patterns. To address this, TPP-LLM incorporates temporal embeddings and employs parameter-efficient fine-tuning (PEFT) methods to effectively learn temporal dynamics without extensive retraining. This approach improves both predictive accuracy and computational efficiency. Experimental results across diverse real-world datasets demonstrate that TPP-LLM outperforms state-of-the-art baselines in sequence modeling and event prediction, highlighting the benefits of combining LLMs with TPPs.
摘要:时间点过程 (Temporal Point Processes, TPPs) 广泛用于建模社交网络、交通系统和电子商务等领域中事件的时间和发生。本文介绍了一种新颖的框架 TPP-LLM,该框架将大语言模型 (Large Language Models, LLMs) 与 TPPs 结合,以捕捉事件序列的语义和时间方面。与依赖于分类事件类型表示的传统方法不同,TPP-LLM 直接利用事件类型的文本描述,使模型能够捕捉文本中嵌入的丰富语义信息。尽管 LLMs 在理解事件语义方面表现出色,但在捕捉时间模式方面则稍显不足。为解决这一问题,TPP-LLM 引入了时间嵌入,并采用参数高效的微调 (Parameter-Efficient Fine-Tuning, PEFT) 方法,以在不进行大量重新训练的情况下有效学习时间动态。这种方法提高了预测准确性和计算效率。在多个真实世界数据集上的实验结果表明,TPP-LLM 在序列建模和事件预测方面优于最先进的基线,突显了将 LLMs 与 TPPs 结合的优势。

[NLP-105] Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

【速读】: 该论文试图解决现有视觉语言模型(如GPT-4o)在复杂网络环境和长期规划任务中表现不足的问题。解决方案的关键在于引入Reflective Monte Carlo Tree Search (R-MCTS),这是一种新颖的测试时算法,通过结合对比反思和多智能体辩论来动态提升决策空间的探索效率,并通过自学习微调GPT-4o,使其在无需人工标签的情况下提升性能。该方法在VisualWebArena基准测试中显著提升了GPT-4o的性能,并展示了测试时搜索和自学习在增强视觉语言模型推理和规划能力方面的潜力。

链接: https://arxiv.org/abs/2410.02052
作者: Xiao Yu,Baolin Peng,Vineeth Vajipey,Hao Cheng,Michel Galley,Jianfeng Gao,Zhou Yu
关键词-EN: demonstrated significant potential, automating complex multistep, complex multistep decision-making, multistep decision-making tasks, Reflective Monte Carlo
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent’s performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS’s performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs’ reasoning and planning capabilities for agentic applications via test-time search and self-learning.
摘要:自主代理在自动化复杂的多步骤决策任务中展示了显著的潜力。然而,即使是像 GPT-4o 这样的最先进的视觉语言模型 (VLM),在复杂的网络环境和长期规划任务中,其表现仍远未达到人类水平。为了解决这些局限性,我们引入了反射蒙特卡洛树搜索 (R-MCTS),这是一种新颖的测试时算法,旨在增强 AI 智能体(例如由 GPT-4o 驱动的智能体)在决策空间中实时探索的能力。R-MCTS 通过以下两种方式扩展了传统的 MCTS:1) 引入对比反射,使智能体能够从过去的交互中学习并动态提高搜索效率;2) 使用多智能体辩论来提供可靠的状态评估。此外,我们通过自学习对 GPT-4o 进行微调,使用 R-MCTS 生成的树遍历数据,而无需任何人提供的标签,从而提升了智能体的性能。在具有挑战性的 VisualWebArena 基准测试中,基于 GPT-4o 的 R-MCTS 智能体在各种任务中相比之前的最先进水平实现了 6% 到 30% 的相对提升。此外,我们展示了通过微调,测试时搜索获得的知识可以有效地回传到 GPT-4o。微调后的 GPT-4o 在测试时计算使用量减少四倍的情况下,达到了 R-MCTS 97% 的性能。此外,定性结果表明,微调后的 GPT-4o 模型展示了探索环境、评估状态并在检测到当前状态无法导致成功时回溯到可行状态的能力。此外,我们的工作展示了在训练(使用 R-MCTS 进行数据收集)和测试时间中的计算扩展特性。这些结果表明,通过测试时搜索和自学习来增强 VLM 的推理和规划能力,对于智能体应用来说是一个有前景的研究方向。

[NLP-106] Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description

【速读】: 该论文试图解决现有3D面部情感建模中情感类别有限和数据集不足的问题。解决方案的关键在于引入了一个名为“Emo3D”的广泛“文本-图像-表情数据集”,该数据集涵盖了多种人类情感,并结合了图像和3D混合形状。通过利用大型语言模型(LLMs)生成多样化的文本描述,增强了情感表达的广度。论文还对基于语言的模型微调和视觉-语言模型(如CLIP)进行了全面评估,并引入了一种新的评估指标,以更直接地衡量传达的情感,从而在评估3D面部表情合成中显示出优于传统均方误差(MSE)指标的效果。

链接: https://arxiv.org/abs/2410.02049
作者: Mahshid Dehghani,Amirahmad Shafiee,Ali Shafiei,Neda Fallah,Farahmand Alizadeh,Mohammad Mehdi Gholinejad,Hamid Behroozi,Jafar Habibi,Ehsaneddin Asgari
关键词-EN: limited emotion classes, constrained by limited, classes and insufficient, Existing, Language Image Pretraining
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:Existing 3D facial emotion modeling have been constrained by limited emotion classes and insufficient datasets. This paper introduces “Emo3D”, an extensive “Text-Image-Expression dataset” spanning a wide spectrum of human emotions, each paired with images and 3D blendshapes. Leveraging Large Language Models (LLMs), we generate a diverse array of textual descriptions, facilitating the capture of a broad spectrum of emotional expressions. Using this unique dataset, we conduct a comprehensive evaluation of language-based models’ fine-tuning and vision-language models like Contranstive Language Image Pretraining (CLIP) for 3D facial expression synthesis. We also introduce a new evaluation metric for this task to more directly measure the conveyed emotion. Our new evaluation metric, Emo3D, demonstrates its superiority over Mean Squared Error (MSE) metrics in assessing visual-text alignment and semantic richness in 3D facial expressions associated with human emotions. “Emo3D” has great applications in animation design, virtual reality, and emotional human-computer interaction.
摘要:现有的三维面部情感建模受限于情感类别有限和数据集不足的问题。本文介绍了“Emo3D”,这是一个广泛的“文本-图像-表情数据集”,涵盖了广泛的人类情感,每种情感都与图像和三维混合形状配对。利用大语言模型 (LLM),我们生成了一系列多样化的文本描述,便于捕捉广泛的情感表达。基于这一独特数据集,我们对基于语言的模型微调以及像对比语言图像预训练 (CLIP) 这样的视觉语言模型进行了全面评估,用于三维面部表情合成。我们还为这一任务引入了一种新的评估指标,以更直接地测量传达的情感。我们提出的新评估指标 Emo3D 在评估与人类情感相关的三维面部表情的视觉-文本对齐和语义丰富性方面,展示了其优于均方误差 (MSE) 指标的优势。“Emo3D”在动画设计、虚拟现实和情感人机交互等领域具有广泛的应用前景。

[NLP-107] Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在分类任务中的应用潜力尚未充分探索的问题。解决方案的关键在于提出一个框架,系统地研究通过微调LLMs来增强分类任务,包括生成和编码两种方法。论文以编辑意图分类(EIC)为实例,通过广泛的实验和系统比较,揭示了LLMs在EIC中的应用潜力,并进一步验证了这些发现对其他五个分类任务的通用性。此外,论文还创建了一个新的高质量数据集Re3-Sci2.0,用于深入研究学术写作中的编辑行为,从而解决了实证编辑分析中数据不足的问题。

链接: https://arxiv.org/abs/2410.02028
作者: Qian Ruan,Ilia Kuznetsov,Iryna Gurevych
关键词-EN: core NLP task, NLP task architecture, core NLP, NLP task, Classification
类目: Computation and Language (cs.CL)
备注: EMNLP2024 Main

点击查看摘要

Abstract:Classification is a core NLP task architecture with many potential applications. While large language models (LLMs) have brought substantial advancements in text generation, their potential for enhancing classification tasks remains underexplored. To address this gap, we propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task. Our extensive experiments and systematic comparisons with various training approaches and a representative selection of LLMs yield new insights into their application for EIC. We investigate the generalizability of these findings on five further classification tasks. To demonstrate the proposed methods and address the data shortage for empirical edit analysis, we use our best-performing EIC model to create Re3-Sci2.0, a new large-scale dataset of 1,780 scientific document revisions with over 94k labeled edits. The quality of the dataset is assessed through human evaluation. The new dataset enables an in-depth empirical study of human editing behavior in academic writing. We make our experimental framework, models and data publicly available.
摘要:分类是自然语言处理 (NLP) 任务架构的核心,具有许多潜在应用。尽管大语言模型 (LLMs) 在文本生成方面取得了显著进展,但其在增强分类任务方面的潜力仍未得到充分探索。为了填补这一空白,我们提出了一种框架,全面研究针对分类任务的 LLMs 微调,包括基于生成和编码的方法。我们在编辑意图分类 (EIC) 这一具有挑战性且未充分探索的分类任务中实例化了这一框架。我们通过广泛的实验和与各种训练方法以及代表性 LLMs 的系统比较,获得了关于其在 EIC 应用中的新见解。我们进一步研究了这些发现在一系列五个其他分类任务上的泛化性。为了展示所提出的方法并解决经验编辑分析中的数据短缺问题,我们使用表现最佳的 EIC 模型创建了 Re3-Sci2.0,这是一个包含 1,780 篇科学文档修订的新大规模数据集,拥有超过 94,000 个带标签的编辑。通过人工评估验证了数据集的质量。这一新数据集使得对学术写作中人类编辑行为的深入实证研究成为可能。我们将实验框架、模型和数据公开发布。

[NLP-108] Zodiac: A Cardiologist-Level LLM Framework for Multi-Agent Diagnostics

【速读】: 该论文试图解决大型语言模型(LLMs)在临床实践中专业性不足的问题,特别是在心血管诊断领域。解决方案的关键在于引入ZODIAC框架,这是一个由LLM驱动的多代理协作系统,专门设计用于心血管诊断,具备心脏病专家级别的专业性。ZODIAC通过从患者数据中提取临床相关特征、检测重要的心律失常并生成初步报告,辅助心脏病专家进行诊断。其核心在于使用真实世界的心脏病专家裁定的患者数据对LLM代理进行微调,并通过多模态数据处理提升模型的专业性,最终通过严格的临床验证,证明了其在临床有效性和安全性方面的优越性。

链接: https://arxiv.org/abs/2410.02026
作者: Yuan Zhou,Peng Zhang,Mengya Song,Alice Zheng,Yiwen Lu,Zhiheng Liu,Yong Chen,Zhaohan Xi
关键词-EN: Large language models, demonstrated remarkable progress, Large language, demonstrated remarkable, remarkable progress
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable progress in healthcare. However, a significant gap remains regarding LLMs’ professionalism in domain-specific clinical practices, limiting their application in real-world diagnostics. In this work, we introduce ZODIAC, an LLM-powered framework with cardiologist-level professionalism designed to engage LLMs in cardiological diagnostics. ZODIAC assists cardiologists by extracting clinically relevant characteristics from patient data, detecting significant arrhythmias, and generating preliminary reports for the review and refinement by cardiologists. To achieve cardiologist-level professionalism, ZODIAC is built on a multi-agent collaboration framework, enabling the processing of patient data across multiple modalities. Each LLM agent is fine-tuned using real-world patient data adjudicated by cardiologists, reinforcing the model’s professionalism. ZODIAC undergoes rigorous clinical validation with independent cardiologists, evaluated across eight metrics that measure clinical effectiveness and address security concerns. Results show that ZODIAC outperforms industry-leading models, including OpenAI’s GPT-4o, Meta’s Llama-3.1-405B, and Google’s Gemini-pro, as well as medical-specialist LLMs like Microsoft’s BioGPT. ZODIAC demonstrates the transformative potential of specialized LLMs in healthcare by delivering domain-specific solutions that meet the stringent demands of medical practice. Notably, ZODIAC has been successfully integrated into electrocardiography (ECG) devices, exemplifying the growing trend of embedding LLMs into Software-as-Medical-Device (SaMD).
摘要:大语言模型 (LLMs) 在医疗领域展示了显著的进步。然而,在特定临床实践中的专业性方面,LLMs 仍存在显著差距,限制了其在实际诊断中的应用。本文中,我们介绍了 ZODIAC,这是一个由 LLM 驱动、具备心脏病专家级专业性的框架,旨在参与心脏病学诊断。ZODIAC 通过从患者数据中提取临床相关特征、检测重要的心律失常,并生成供心脏病专家审查和完善的初步报告,来辅助心脏病专家。为实现心脏病专家级的专业性,ZODIAC 构建于一个多智能体协作框架之上,能够处理多模态的患者数据。每个 LLM 智能体均使用由心脏病专家裁定的真实患者数据进行微调,从而强化模型的专业性。ZODIAC 经过独立心脏病专家的严格临床验证,评估涵盖了八个衡量临床效果和解决安全问题的指标。结果显示,ZODIAC 优于行业领先的模型,包括 OpenAI 的 GPT-4o、Meta 的 Llama-3.1-405B 和 Google 的 Gemini-pro,以及像 Microsoft 的 BioGPT 这样的医学专家 LLMs。ZODIAC 展示了专门化 LLMs 在医疗领域的变革潜力,通过提供满足医疗实践严格需求的领域特定解决方案。值得注意的是,ZODIAC 已成功集成到心电图 (ECG) 设备中,展示了将 LLMs 嵌入到医疗设备软件 (SaMD) 中的增长趋势。

[NLP-109] FLAG: Financial Long Document Classification via AMR-based GNN

【速读】: 该论文试图解决在金融领域应用大型语言模型(LLMs)处理长文档时,由于缺乏显式的语义关系建模和全注意力机制导致的预测效果不佳的问题。解决方案的关键在于利用抽象意义表示(AMR)构建基于图的语义关系模型,并通过图神经网络(GNN)结合金融领域特定的LLM词嵌入,生成有效的文档级图表示,从而提升对金融文档中目标指标的预测准确性。论文提出的FLAG框架通过构建文档级图,结合深度学习机制,显著提高了在长金融文档分类和股票价格趋势预测中的表现。

链接: https://arxiv.org/abs/2410.02024
作者: Bolun(Namir)Xia,Mohammed J. Zaki,Aparna Gupta
关键词-EN: large language models, Abstract Meaning Representation, language models, advent of large, large language
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 3 figures, to be published in CIFEr Conference 2024 as “Semantic Graph Learning for Trend Prediction from Long Financial Documents”

点击查看摘要

Abstract:The advent of large language models (LLMs) has initiated much research into their various financial applications. However, in applying LLMs on long documents, semantic relations are not explicitly incorporated, and a full or arbitrarily sparse attention operation is employed. In recent years, progress has been made in Abstract Meaning Representation (AMR), which is a graph-based representation of text to preserve its semantic relations. Since AMR can represent semantic relationships at a deeper level, it can be beneficially utilized by graph neural networks (GNNs) for constructing effective document-level graph representations built upon LLM embeddings to predict target metrics in the financial domain. We propose FLAG: Financial Long document classification via AMR-based GNN, an AMR graph based framework to generate document-level embeddings for long financial document classification. We construct document-level graphs from sentence-level AMR graphs, endow them with specialized LLM word embeddings in the financial domain, apply a deep learning mechanism that utilizes a GNN, and examine the efficacy of our AMR-based approach in predicting labeled target data from long financial documents. Extensive experiments are conducted on a dataset of quarterly earnings calls transcripts of companies in various sectors of the economy, as well as on a corpus of more recent earnings calls of companies in the SP 1500 Composite Index. We find that our AMR-based approach outperforms fine-tuning LLMs directly on text in predicting stock price movement trends at different time horizons in both datasets. Our work also outperforms previous work utilizing document graphs and GNNs for text classification.
摘要:大语言模型 (LLM) 的出现引发了对其多种金融应用的广泛研究。然而,在将 LLM 应用于长文档时,语义关系并未被明确纳入,而是采用了完全或任意稀疏的注意力操作。近年来,抽象意义表示 (AMR) 取得了进展,这是一种基于图的文本表示方法,旨在保留其语义关系。由于 AMR 能够在更深层次上表示语义关系,因此可以被图神经网络 (GNN) 有益地利用,以构建基于 LLM 嵌入的有效文档级图表示,从而预测金融领域的目标指标。我们提出了 FLAG:基于 AMR 的 GNN 进行金融长文档分类,这是一个基于 AMR 图的框架,用于生成长金融文档的文档级嵌入。我们从句子级 AMR 图构建文档级图,赋予其金融领域的专门 LLM 词嵌入,应用利用 GNN 的深度学习机制,并检验我们基于 AMR 的方法在预测长金融文档中的标记目标数据的效能。我们在一个包含各经济部门公司季度收益电话会议记录的数据集上,以及在 SP 1500 综合指数公司更近期的收益电话会议语料库上进行了广泛的实验。我们发现,在预测两个数据集中不同时间跨度的股价变动趋势时,我们的基于 AMR 的方法优于直接在文本上微调 LLM。我们的工作还优于之前利用文档图和 GNN 进行文本分类的工作。

[NLP-110] Financial Sentiment Analysis on News and Reports Using Large Language Models and FinBERT

【速读】: 该论文试图解决金融情感分析(FSA)中的情感分类问题,解决方案的关键在于利用大型语言模型(LLMs)如BERT及其金融变体FinBERT,并通过提示工程中的零样本和少样本策略来提升情感分类的准确性。研究表明,通过提供少量金融文本示例,GPT-4在金融领域的情感分类能力可以与经过精细调优的FinBERT相媲美。

链接: https://arxiv.org/abs/2410.01987
作者: Yanxin Shen,Pulin Kirin Zhang
关键词-EN: well-informed financial decisions, evaluating market sentiment, making well-informed financial, crucial for evaluating, evaluating market
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Social and Information Networks (cs.SI); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:Financial sentiment analysis (FSA) is crucial for evaluating market sentiment and making well-informed financial decisions. The advent of large language models (LLMs) such as BERT and its financial variant, FinBERT, has notably enhanced sentiment analysis capabilities. This paper investigates the application of LLMs and FinBERT for FSA, comparing their performance on news articles, financial reports and company announcements. The study emphasizes the advantages of prompt engineering with zero-shot and few-shot strategy to improve sentiment classification accuracy. Experimental results indicate that GPT-4o, with few-shot examples of financial texts, can be as competent as a well fine-tuned FinBERT in this specialized field.
摘要:金融情感分析 (Financial Sentiment Analysis, FSA) 对于评估市场情绪和做出明智的金融决策至关重要。随着大型语言模型 (Large Language Models, LLMs) 如 BERT 及其金融变体 FinBERT 的出现,情感分析能力显著提升。本文探讨了 LLMs 和 FinBERT 在 FSA 中的应用,比较了它们在新闻文章、财务报告和公司公告上的表现。研究强调了通过零样本 (Zero-shot) 和少样本 (Few-shot) 策略进行提示工程 (Prompt Engineering) 以提高情感分类准确性的优势。实验结果表明,GPT-4o 在提供少量金融文本示例的情况下,可以与经过精细调优的 FinBERT 在这一专业领域中表现相当。

[NLP-111] How Reliable Is Human Feedback For Aligning Large Language Models ?

【速读】: 该论文试图解决当前对齐研究中忽视人类反馈数据质量的问题,特别是人类反馈的不可靠性及其对对齐的影响。解决方案的关键在于通过深入分析人类反馈数据的不可靠性来源,提出了一种名为“Source-Aware Cleaning”的自动数据清洗方法,以显著提高数据质量。实验结果表明,经过清洗的数据集HH-Clean训练的模型在性能上显著优于原始数据集训练的模型。

链接: https://arxiv.org/abs/2410.01957
作者: Min-Hsuan Yeh,Leitian Tao,Jeffrey Wang,Xuefeng Du,Yixuan Li
关键词-EN: research today focuses, assuming human feedback, alignment research today, human feedback data, human feedback
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most alignment research today focuses on designing new learning algorithms using datasets like Anthropic-HH, assuming human feedback data is inherently reliable. However, little attention has been given to the qualitative unreliability of human feedback and its impact on alignment. To address this gap, we conduct a comprehensive study and provide an in-depth analysis of human feedback data. We assess feedback reliability using a committee of gold reward models, revealing that over 25% of the dataset shows low or no agreement with these models, implying a high degree of unreliability. Through a qualitative analysis, we identify six key sources of unreliability, such as mis-labeling, subjective preferences, differing criteria and thresholds for helpfulness and harmlessness, etc. Lastly, to mitigate unreliability, we propose Source-Aware Cleaning, an automatic data-cleaning method guided by the insight of our qualitative analysis, to significantly improve data quality. Extensive experiments demonstrate that models trained on our cleaned dataset, HH-Clean, substantially outperform those trained on the original dataset. We release HH-Clean to support more reliable LLM alignment evaluation in the future.
摘要:当前大多数对齐研究侧重于使用 Anthropic-HH 等数据集设计新的学习算法,假设人类反馈数据本质上可靠。然而,人类反馈的定性不可靠性及其对对齐的影响却鲜有关注。为填补这一空白,我们进行了全面研究,并深入分析了人类反馈数据。我们使用一组黄金奖励模型评估反馈的可靠性,发现超过 25% 的数据集与这些模型存在低或无一致性,表明其高度不可靠。通过定性分析,我们识别出六个主要不可靠来源,如错误标注、主观偏好、对有用性和无害性的不同标准和阈值等。最后,为缓解不可靠性,我们提出了源感知清洗 (Source-Aware Cleaning),这是一种基于我们定性分析洞察的自动数据清洗方法,显著提升了数据质量。大量实验表明,使用我们清洗后的数据集 HH-Clean 训练的模型,在性能上显著优于使用原始数据集训练的模型。我们发布了 HH-Clean,以支持未来更可靠的大语言模型对齐评估。

[NLP-112] Generate then Refine: Data Augmentation for Zero-shot Intent Detection

【速读】: 该论文试图解决在零资源领域中意图检测的数据增强问题,特别是在意图类别众多且标注成本高的情况下。解决方案的关键在于采用两阶段方法:首先,利用开源的大型语言模型在零样本设置下生成意图标签的语句;其次,开发一个较小的序列到序列模型(称为Refiner),通过在已知领域微调后应用于未知领域,以改进生成的语句。实验结果表明,Refiner显著提升了数据的质量和多样性,优于零样本LLM基线和常见基线方法,证明了这种两步法在生成高质量意图检测数据方面的有效性。

链接: https://arxiv.org/abs/2410.01953
作者: I-Fan Lin,Faegheh Hasibi,Suzan Verberne
关键词-EN: short paper, paper we propose, augmentation methods rely, intent, Refiner
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this short paper we propose a data augmentation method for intent detection in zero-resource domains. Existing data augmentation methods rely on few labelled examples for each intent category, which can be expensive in settings with many possible intents. We use a two-stage approach: First, we generate utterances for intent labels using an open-source large language model in a zero-shot setting. Second, we develop a smaller sequence-to-sequence model (the Refiner), to improve the generated utterances. The Refiner is fine-tuned on seen domains and then applied to unseen domains. We evaluate our method by training an intent classifier on the generated data, and evaluating it on real (human) data. We find that the Refiner significantly improves the data utility and diversity over the zero-shot LLM baseline for unseen domains and over common baseline approaches. Our results indicate that a two-step approach of a generative LLM in zero-shot setting and a smaller sequence-to-sequence model can provide high-quality data for intent detection.
摘要:本文提出了一种在零资源领域中用于意图检测的数据增强方法。现有的数据增强方法依赖于每个意图类别中的少量标注示例,这在可能存在许多意图的情况下成本较高。我们采用两阶段方法:首先,在零样本设置下,使用开源大语言模型为意图标签生成话语。其次,我们开发了一个较小的序列到序列模型(称为 Refiner),以改进生成的话语。Refiner 在已知领域上进行微调,然后应用于未知领域。我们通过在生成数据上训练意图分类器,并在真实(人类)数据上进行评估来验证我们的方法。我们发现,对于未知领域,Refiner 显著提高了数据效用和多样性,优于零样本大语言模型基线和常见基线方法。我们的结果表明,在零样本设置下使用生成式大语言模型和较小的序列到序列模型的两步法,可以为意图检测提供高质量的数据。

[NLP-113] ypedThinker: Typed Thinking Improves Large Language Model Reasoning

【速读】: 该论文试图解决大型语言模型(LLMs)在问题解决过程中因缺乏多样化的推理方法而陷入有限解空间的问题。解决方案的关键在于提出了TypedThinker框架,该框架通过整合多种推理类型(演绎、归纳、溯因和类比)来增强LLMs的问题解决能力。TypedThinker通过自训练学习隐式策略,自动选择和应用适当的推理类型,从而在多个基准测试中显著提升了模型的准确性,并展示了良好的泛化能力。

链接: https://arxiv.org/abs/2410.01952
作者: Danqing Wang,Jianxin Ma,Fei Fang,Lei Li
关键词-EN: Large Language Models, Large Language, solution search area, limited solution search, capabilities of Large
类目: Computation and Language (cs.CL)
备注: work in process

点击查看摘要

Abstract:Despite significant advancements in the reasoning capabilities of Large Language Models (LLMs), the lack of diverse reasoning solutions often makes them trapped in a limited solution search area. In this paper, we propose TypedThinker, a novel framework that enhances LLMs’ problem-solving abilities by incorporating multiple reasoning types (deductive, inductive, abductive, and analogical). Our analysis across four benchmarks reveals that different reasoning types uniquely solve distinct sets of problems, highlighting the importance of diverse thinking approaches. TypedThinker addresses two key challenges: selecting appropriate reasoning types for given problems and effectively implementing specific reasoning types. Through self-training on successful experiences, TypedThinker learns an implicit policy for reasoning type selection and application. Experimental results demonstrate significant improvements over baseline models, with accuracy increases of 3.4% for Mistral 7B and 16.7% for LLaMA3 8B across four reasoning benchmarks. Notably, TypedThinker shows effective generalization to new benchmarks and can further enhance the reasoning capability of powerful models like GPT-4o. The code is released at this https URL.
摘要:尽管大语言模型 (LLM) 在推理能力方面取得了显著进展,但由于缺乏多样化的推理解决方案,它们往往陷入有限的解搜索区域。本文提出了一种名为 TypedThinker 的新框架,通过整合多种推理类型(演绎、归纳、溯因和类比)来增强 LLM 的问题解决能力。我们在四个基准测试上的分析表明,不同的推理类型独特地解决了不同的问题集,突显了多样化思维方法的重要性。TypedThinker 解决了两个关键挑战:为给定问题选择合适的推理类型,以及有效实施特定的推理类型。通过在成功经验上的自我训练,TypedThinker 学习了一种隐式的推理类型选择和应用策略。实验结果显示,与基线模型相比,TypedThinker 在四个推理基准测试中分别将 Mistral 7B 和 LLaMA3 8B 的准确率提高了 3.4% 和 16.7%。值得注意的是,TypedThinker 显示出对新基准的有效泛化能力,并能进一步增强 GPT-4o 等强大模型的推理能力。代码已发布于 https URL。

[NLP-114] SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics EMNLP2024

【速读】: 该论文试图解决在低资源场景下,跨领域和细粒度文本分类任务中,手动选择领域标签词的困难和成本问题。解决方案的关键在于引入SciPrompt框架,该框架能够自动检索与科学主题相关的术语,用于增强verbalizer(标签词映射器)。具体来说,SciPrompt通过在科学文献的上下文中选择语义相关且领域特定的标签词,并提出一种新的verbalization策略,利用相关性分数作为额外权重来提升语言模型在模型微调过程中的预测性能。这种方法在科学文本分类任务中,特别是在少样本和零样本设置下,显著优于现有的基于提示的微调方法。

链接: https://arxiv.org/abs/2410.01946
作者: Zhiwen You,Kanyao Han,Haotian Zhu,Bertram Ludäscher,Jana Diesner
关键词-EN: eliciting information encoded, Prompt-based fine-tuning, including text classification, eliciting information, information encoded
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Main

点击查看摘要

Abstract:Prompt-based fine-tuning has become an essential method for eliciting information encoded in pre-trained language models for a variety of tasks, including text classification. For multi-class classification tasks, prompt-based fine-tuning under low-resource scenarios has resulted in performance levels comparable to those of fully fine-tuning methods. Previous studies have used crafted prompt templates and verbalizers, mapping from the label terms space to the class space, to solve the classification problem as a masked language modeling task. However, cross-domain and fine-grained prompt-based fine-tuning with an automatically enriched verbalizer remains unexplored, mainly due to the difficulty and costs of manually selecting domain label terms for the verbalizer, which requires humans with domain expertise. To address this challenge, we introduce SciPrompt, a framework designed to automatically retrieve scientific topic-related terms for low-resource text classification tasks. To this end, we select semantically correlated and domain-specific label terms within the context of scientific literature for verbalizer augmentation. Furthermore, we propose a new verbalization strategy that uses correlation scores as additional weights to enhance the prediction performance of the language model during model tuning. Our method outperforms state-of-the-art, prompt-based fine-tuning methods on scientific text classification tasks under few and zero-shot settings, especially in classifying fine-grained and emerging scientific topics.
摘要:基于提示的微调已成为从预训练语言模型中提取信息以完成多种任务(包括文本分类)的重要方法。在多类别分类任务中,低资源场景下的基于提示的微调方法已达到与完全微调方法相当的性能水平。先前的研究通过设计提示模板和映射器(verbalizers),将标签术语空间映射到类别空间,将分类问题视为掩码语言建模任务来解决。然而,跨领域和细粒度提示微调与自动丰富映射器的结合尚未得到充分探索,这主要是因为手动为映射器选择领域标签术语的难度和成本较高,需要具备领域专业知识的人士参与。为应对这一挑战,我们提出了 SciPrompt 框架,该框架旨在自动检索与低资源文本分类任务相关的科学主题术语。为此,我们在科学文献的语境中选择语义相关且领域特定的标签术语,用于映射器的增强。此外,我们提出了一种新的映射策略,该策略使用相关性分数作为额外权重,以在模型微调过程中提升语言模型的预测性能。我们的方法在少样本和零样本设置下的科学文本分类任务中,尤其是在分类细粒度和新兴科学主题时,优于现有的最先进的基于提示的微调方法。

[NLP-115] CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations

【速读】: 该论文试图解决长篇问答(Long-Form Question Answering, LFQA)评估缺乏标准基准的问题。解决方案的关键在于提出了一个名为Chinese exAmination for LFQA Evaluation (CALF)的基准,该基准基于翻译后的中国考试题目,包含1476个知识密集且复杂的回答示例。CALF通过三种不同的评估设置,全面分析了现有自动评估指标在LFQA评估中的表现,揭示了当前自动评估指标在捕捉长篇回答中的密集信息方面存在不足,并提供了详细的分析以指导未来LFQA评估系统的发展。

链接: https://arxiv.org/abs/2410.01945
作者: Yuchen Fan,Xin Zhong,Heng Zhou,Yuchen Zhang,Mingyu Liang,Chengxing Xie,Ermo Hua,Ning Ding,Bowen Zhou
关键词-EN: LFQA, LFQA evaluation, Long-Form Question Answering, Question Answering, evaluation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-Form Question Answering (LFQA) refers to generating in-depth, paragraph-level responses to open-ended questions. Although lots of LFQA methods are developed, evaluating LFQA effectively and efficiently remains challenging due to its high complexity and cost. Therefore, there is no standard benchmark for LFQA evaluation till now. To address this gap, we make the first attempt by proposing a well-constructed, reference-based benchmark named Chinese exAmination for LFQA Evaluation (CALF), aiming to rigorously assess the performance of automatic evaluation metrics for LFQA. The CALF benchmark is derived from Chinese examination questions that have been translated into English. It includes up to 1476 examples consisting of knowledge-intensive and nuanced responses. Our evaluation comprises three different settings to ana lyze the behavior of automatic metrics comprehensively. We conducted extensive experiments on 7 traditional evaluation metrics, 3 prompt-based metrics, and 3 trained evaluation metrics, and tested on agent systems for the LFQA evaluation. The results reveal that none of the current automatic evaluation metrics shows comparable performances with humans, indicating that they cannot capture dense information contained in long-form responses well. In addition, we provide a detailed analysis of the reasons why automatic evaluation metrics fail when evaluating LFQA, offering valuable insights to advance LFQA evaluation systems. Dataset and associated codes can be accessed at our GitHub repository.
摘要:长篇问答 (Long-Form Question Answering, LFQA) 指的是生成针对开放性问题的深入、段落级别的回答。尽管已经开发了许多 LFQA 方法,但由于其高复杂性和成本,有效且高效地评估 LFQA 仍然具有挑战性。因此,迄今为止还没有标准的 LFQA 评估基准。为了填补这一空白,我们首次尝试提出了一个基于参考的、精心构建的基准,名为“中文 LFQA 评估考试 (Chinese exAmination for LFQA Evaluation, CALF)”,旨在严格评估 LFQA 自动评估指标的性能。CALF 基准源自已被翻译成英文的中国考试题目,包含多达 1476 个示例,这些示例由知识密集型和微妙的回答组成。我们的评估包括三种不同的设置,以全面分析自动评估指标的行为。我们在 7 种传统评估指标、3 种基于提示的评估指标和 3 种训练过的评估指标上进行了广泛的实验,并在 LFQA 评估的智能体系统上进行了测试。结果显示,当前的自动评估指标中没有任何一种能够与人类的表现相媲美,这表明它们无法很好地捕捉长篇回答中包含的密集信息。此外,我们详细分析了自动评估指标在评估 LFQA 时失败的原因,为推进 LFQA 评估系统提供了宝贵的见解。数据集及相关代码可在我们的 GitHub 仓库中获取。

[NLP-116] CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

【速读】: 该论文试图解决大型语言模型(LLM)在Text-to-SQL任务中的性能挑战,特别是如何生成高质量且多样化的SQL查询。解决方案的关键在于CHASE-SQL框架,它通过多代理建模和测试时计算,采用三种创新策略:(1)将复杂查询分解为可管理的子查询;(2)基于查询执行计划的链式思维推理;(3)实例感知合成示例生成技术。此外,通过微调的二元候选选择LLM进行候选排序,显著提升了SQL查询的质量和多样性,并在BIRD Text-to-SQL数据集上实现了最先进的执行准确率。

链接: https://arxiv.org/abs/2410.01943
作者: Mohammadreza Pourreza,Hailong Li,Ruoxi Sun,Yeounoh Chung,Shayan Talaei,Gaurav Tarlok Kakkar,Yu Gan,Amin Saberi,Fatma Ozcan,Sercan O. Arik
关键词-EN: large language model, employs innovative strategies, improve candidate generation, binary-candidates selection LLM, single LLM call
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:In tackling the challenges of large language model (LLM) performance for Text-to-SQL tasks, we introduce CHASE-SQL, a new framework that employs innovative strategies, using test-time compute in multi-agent modeling to improve candidate generation and selection. CHASE-SQL leverages LLMs’ intrinsic knowledge to generate diverse and high-quality SQL candidates using different LLM generators with: (1) a divide-and-conquer method that decomposes complex queries into manageable sub-queries in a single LLM call; (2) chain-of-thought reasoning based on query execution plans, reflecting the steps a database engine takes during execution; and (3) a unique instance-aware synthetic example generation technique, which offers specific few-shot demonstrations tailored to test this http URL identify the best candidate, a selection agent is employed to rank the candidates through pairwise comparisons with a fine-tuned binary-candidates selection LLM. This selection approach has been demonstrated to be more robust over alternatives. The proposed generators-selector framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods. Overall, our proposed CHASE-SQL achieves the state-of-the-art execution accuracy of 73.0% and 73.01% on the test set and development set of the notable BIRD Text-to-SQL dataset benchmark, rendering CHASE-SQL the top submission of the leaderboard (at the time of paper submission).
摘要:在应对大语言模型 (LLM) 在文本到 SQL 任务中的性能挑战时,我们提出了 CHASE-SQL,这是一个采用创新策略的新框架,利用多智能体模型中的测试时计算来改进候选生成和选择。CHASE-SQL 利用 LLM 的内在知识,通过不同的 LLM 生成器生成多样且高质量的 SQL 候选:(1) 一种分而治之的方法,将复杂查询分解为可管理的子查询,并在单次 LLM 调用中完成;(2) 基于查询执行计划的链式思维推理,反映了数据库引擎在执行过程中采取的步骤;以及 (3) 一种独特的实例感知合成示例生成技术,提供针对测试 URL 量身定制的少样本演示,以识别最佳候选。为了选择最佳候选,采用了一个选择智能体,通过与微调后的二元候选选择 LLM 进行成对比较来对候选进行排序。这种选择方法已被证明比其他方法更为稳健。所提出的生成器-选择器框架不仅提高了 SQL 查询的质量和多样性,而且在性能上超越了以往的方法。总体而言,我们提出的 CHASE-SQL 在著名的 BIRD 文本到 SQL 数据集基准测试的测试集和开发集上分别达到了 73.0% 和 73.01% 的执行准确率,使其成为排行榜上的顶级提交(在论文提交时)。

[NLP-117] A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

【速读】: 该论文试图解决向量量化(VQ)自回归图像生成中的信息损失瓶颈问题。解决方案的关键在于引入了一种名为2-Dimensional Autoregression (DnD) Transformer的新模型架构,通过增加一个新的自回归方向——模型深度,与序列长度方向相结合,从而能够预测更多的图像编码。相比传统的1D自回归和类似2D图像分解的RQ-Transformer,DnD-Transformer是一个端到端的模型,能够在相同的骨干模型大小和序列长度下生成更高质量的图像,为自回归图像生成开辟了新的优化视角。此外,该模型不仅限于生成自然图像,还能自监督地生成包含丰富文本和图形元素的图像,展示了其在视觉-语言智能方面的潜力。

链接: https://arxiv.org/abs/2410.01912
作者: Liang Chen,Sinan Tan,Zefan Cai,Weichu Xie,Haozhe Zhao,Yichi Zhang,Junyang Lin,Jinze Bai,Tianyu Liu,Baobao Chang
关键词-EN: information loss bottleneck, model architecture called, bottleneck of vector-quantization, autoregressive image generation, tackles the information
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 20 figures, code is open at this https URL

点击查看摘要

Abstract:This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, \textitmodel depth, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer’s potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at this https URL.
摘要:本研究通过引入一种名为二维自回归 (2-Dimensional Autoregression, DnD) Transformer 的新模型架构,解决了向量量化 (Vector-Quantization, VQ) 自回归图像生成中的信息损失瓶颈问题。DnD-Transformer 通过引入新的自回归方向——模型深度,与序列长度方向相结合,预测图像的更多代码。与传统的 1D 自回归及之前利用类似 2D 图像分解的工作(如 RQ-Transformer)相比,DnD-Transformer 是一个端到端的模型,能够在相同的主干模型尺寸和序列长度下生成更高质量的图像,为自回归图像生成开辟了新的优化视角。此外,我们的实验表明,DnD-Transformer 的潜力不仅限于生成自然图像。它甚至能够在自监督的方式下生成包含丰富文本和图形元素的图像,展示了其对这些组合模态的理解能力。这在以往流行的视觉生成模型(如扩散模型)中未曾展示过,表明在仅基于图像训练的情况下,DnD-Transformer 展现出了视觉-语言智能的火花。代码、数据集和模型已在 https URL 上公开。

[NLP-118] NEAT: Nonlinear Parameter-efficient Adaptation of Pre-trained Models

【速读】: 该论文试图解决现有参数高效微调(PEFT)方法如LoRA在捕捉复杂非线性权重更新方面的不足,导致与全参数微调相比性能差距较大的问题。解决方案的关键在于提出一种非线性参数高效适应方法(NEAT),通过引入一个轻量级神经网络,以非线性方式近似累积权重更新,从而更有效地捕捉权重更新的复杂非线性结构,提升微调效果。

链接: https://arxiv.org/abs/2410.01870
作者: Yibo Zhong,Haoxiang Jiang,Lincan Li,Ryumei Nakada,Tianci Liu,Linjun Zhang,Huaxiu Yao,Haoyu Wang
关键词-EN: adapting large models, crucial for adapting, adapting large, Fine-tuning, NEAT
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning pre-trained models is crucial for adapting large models to downstream tasks, often delivering state-of-the-art performance. However, fine-tuning all model parameters is resource-intensive and laborious, leading to the emergence of parameter-efficient fine-tuning (PEFT) methods. One widely adopted PEFT technique, Low-Rank Adaptation (LoRA), freezes the pre-trained model weights and introduces two low-rank matrices whose ranks are significantly smaller than the dimensions of the original weight matrices. This enables efficient fine-tuning by adjusting only a small number of parameters. Despite its efficiency, LoRA approximates weight updates using low-rank decomposition, which struggles to capture complex, non-linear components and efficient optimization trajectories. As a result, LoRA-based methods often exhibit a significant performance gap compared to full fine-tuning. Closing this gap requires higher ranks, which increases the number of parameters. To address these limitations, we propose a nonlinear parameter-efficient adaptation method (NEAT). NEAT introduces a lightweight neural network that takes pre-trained weights as input and learns a nonlinear transformation to approximate cumulative weight updates. These updates can be interpreted as functions of the corresponding pre-trained weights. The nonlinear approximation directly models the cumulative updates, effectively capturing complex and non-linear structures in the weight updates. Our theoretical analysis demonstrates taht NEAT can be more efficient than LoRA while having equal or greater expressivity. Extensive evaluations across four benchmarks and over twenty datasets demonstrate that NEAT significantly outperforms baselines in both vision and text tasks.
摘要:微调预训练模型对于将大型模型适应于下游任务至关重要,通常能带来最先进的性能表现。然而,全面微调所有模型参数既耗费资源又费时费力,这促使了参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 方法的兴起。其中一种广泛采用的 PEFT 技术是低秩适应 (Low-Rank Adaptation, LoRA),它冻结了预训练模型的权重,并引入了两个秩远小于原始权重矩阵维度的低秩矩阵。这种方法通过仅调整少量参数实现了高效的微调。尽管 LoRA 高效,但它使用低秩分解来近似权重更新,难以捕捉复杂的非线性成分和高效的优化轨迹。因此,基于 LoRA 的方法往往与全面微调相比存在显著的性能差距。缩小这一差距需要更高的秩,从而增加了参数数量。为解决这些限制,我们提出了一种非线性参数高效适应方法 (Nonlinear Parameter-Efficient Adaptation, NEAT)。NEAT 引入了一个轻量级神经网络,该网络以预训练权重为输入,并学习一种非线性变换来近似累积权重更新。这些更新可以被解释为对应预训练权重的函数。非线性近似直接建模了累积更新,有效地捕捉了权重更新中的复杂和非线性结构。我们的理论分析表明,NEAT 在效率上可以优于 LoRA,同时在表达能力上至少与之相当或更强。在四个基准测试和超过二十个数据集上的广泛评估表明,NEAT 在视觉和文本任务中均显著优于基线方法。

[NLP-119] House of Cards: Massive Weights in LLMs

【速读】: 该论文试图解决大型语言模型(LLMs)中由于隐藏状态的特定特征维度上的大规模激活引入的偏差问题,这种偏差导致模型过度强调相应的token。论文的关键发现是,大规模激活并非源自隐藏状态,而是源自早期层中前馈网络模块的中间状态。基于此,论文提出了一种名为MacDrop(大规模权重课程丢弃)的简单即插即用方法,通过在参数高效微调过程中对预训练的大规模权重应用丢弃技术,逐步减少丢弃概率,从而减少对大规模权重的依赖,提升模型在零样本下游任务和生成任务中的性能。

链接: https://arxiv.org/abs/2410.01866
作者: Jaehoon Oh,Seungjun Shin,Dokwan Oh
关键词-EN: large language models, massive weights, Massive activations, Massive, weights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top- k massive weights as the weights that contribute to the dimensions with the top- k magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor performance drop, even though a much larger number of weights are set to zero. This implies that during the pre-training process, learning is dominantly focused on massive weights. Building on this observation, we propose a simple plug-and-play method called MacDrop (massive weights curriculum dropout), to rely less on massive weights during parameter-efficient fine-tuning. This method applies dropout to the pre-trained massive weights, starting with a high dropout probability and gradually decreasing it as fine-tuning progresses. Through experiments, we demonstrate that MacDrop generally improves performance across zero-shot downstream tasks and generation tasks.
摘要:大规模激活现象,即隐藏状态的特定特征维度中出现的显著激活,在大语言模型 (LLM) 中引入了显著的偏差,导致对相应 Token 的过度强调。本文中,我们识别出大规模激活并非源自隐藏状态,而是源自早期层中前馈网络模块的中间状态。基于先前观察到的大规模激活仅发生在特定特征维度的现象,我们深入研究了导致大规模激活的权重。具体而言,我们将 top-k 大规模权重定义为对中间状态中 top-k 幅度维度有贡献的权重。当这些大规模权重被设为零时,LLM 的功能完全被破坏。然而,当除大规模权重外的所有权重被设为零时,尽管有更多的权重被设为零,性能下降却相对较小。这表明在预训练过程中,学习主要集中在大规模权重上。基于这一观察,我们提出了一种简单的即插即用方法,称为 MacDrop (大规模权重课程 Dropout),以在参数高效微调过程中减少对大规模权重的依赖。该方法对预训练的大规模权重应用 Dropout,初始时 Dropout 概率较高,并随着微调的进行逐渐降低。通过实验,我们证明了 MacDrop 通常能提高零样本下游任务和生成任务的性能。

[NLP-120] AI Conversational Interviewing: Transforming Surveys with LLMs as Adaptive Interviewers

【速读】: 该论文试图解决传统调查方法在深度和规模之间的权衡问题,即结构化问卷虽然能大规模收集数据,但限制了受访者表达意外想法的能力;而面对面的深度访谈虽然能提供更深入的见解,但资源消耗大。论文提出的解决方案关键在于利用大型语言模型(LLMs)替代人类访谈者进行可扩展的对话式访谈,通过小规模、深入的研究评估AI对话访谈的可行性,并提出改进建议,以实现高质量数据收集的同时保持规模化优势。

链接: https://arxiv.org/abs/2410.01824
作者: Alexander Wuttke,Matthias Aßenmacher,Christopher Klamm,Max M. Lang,Quirin Würschinger,Frauke Kreuter
关键词-EN: structured surveys enable, eliciting people opinions, people opinions face, surveys enable large-scale, limit respondents’ ability
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional methods for eliciting people’s opinions face a trade-off between depth and scale: structured surveys enable large-scale data collection but limit respondents’ ability to express unanticipated thoughts in their own words, while conversational interviews provide deeper insights but are resource-intensive. This study explores the potential of replacing human interviewers with large language models (LLMs) to conduct scalable conversational interviews. Our goal is to assess the performance of AI Conversational Interviewing and to identify opportunities for improvement in a controlled environment. We conducted a small-scale, in-depth study with university students who were randomly assigned to be interviewed by either AI or human interviewers, both employing identical questionnaires on political topics. Various quantitative and qualitative measures assessed interviewer adherence to guidelines, response quality, participant engagement, and overall interview efficacy. The findings indicate the viability of AI Conversational Interviewing in producing quality data comparable to traditional methods, with the added benefit of scalability. Based on our experiences, we present specific recommendations for effective implementation.
摘要:传统获取人们意见的方法在深度和规模之间存在权衡:结构化调查能够实现大规模数据收集,但限制了受访者用自己语言表达未预见想法的能力;而对话式访谈虽然能提供更深入的见解,但资源消耗较大。本研究探讨了用大语言模型 (LLM) 替代人类访谈者进行可扩展对话式访谈的潜力。我们的目标是评估 AI 对话式访谈的表现,并在受控环境中识别改进机会。我们进行了一项小规模、深入的研究,随机分配大学生接受 AI 或人类访谈者的采访,双方均使用相同的政治话题问卷。通过多种定量和定性指标评估访谈者对指南的遵守情况、回答质量、参与者参与度及整体访谈效果。研究结果表明,AI 对话式访谈在产生与传统方法相当的高质量数据方面具有可行性,并具有额外的可扩展性优势。基于我们的经验,我们提出了具体的实施建议。

[NLP-121] From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice

【速读】: 该论文旨在探讨多模态大语言模型(MLLMs)在医疗领域的应用及其面临的挑战,并提出未来研究方向。解决方案的关键在于整合多种数据类型(如文本、图像和音频)以提供更全面的医疗洞察,同时解决数据限制、技术障碍和伦理问题。论文强调了数据集开发、模态对齐方法和伦理指南的建立是未来研究的重点,以确保MLLMs在医疗实践中的有效和负责任应用。

链接: https://arxiv.org/abs/2410.01812
作者: Qian Niu,Keyu Chen,Ming Li,Pohsun Feng,Ziqian Bi,Junyu Liu,Benji Peng
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) have rapidly evolved from text-based systems to multimodal platforms, significantly impacting various sectors including healthcare. This comprehensive review explores the progression of LLMs to Multimodal Large Language Models (MLLMs) and their growing influence in medical practice. We examine the current landscape of MLLMs in healthcare, analyzing their applications across clinical decision support, medical imaging, patient engagement, and research. The review highlights the unique capabilities of MLLMs in integrating diverse data types, such as text, images, and audio, to provide more comprehensive insights into patient health. We also address the challenges facing MLLM implementation, including data limitations, technical hurdles, and ethical considerations. By identifying key research gaps, this paper aims to guide future investigations in areas such as dataset development, modality alignment methods, and the establishment of ethical guidelines. As MLLMs continue to shape the future of healthcare, understanding their potential and limitations is crucial for their responsible and effective integration into medical practice.
摘要:大语言模型 (Large Language Models, LLMs) 已迅速从基于文本的系统演变为多模态平台,显著影响包括医疗保健在内的多个领域。本综述探讨了 LLMs 向多模态大语言模型 (Multimodal Large Language Models, MLLMs) 的演进及其在医疗实践中的日益增长的影响。我们审视了 MLLMs 在医疗保健领域的当前格局,分析了其在临床决策支持、医学影像、患者参与和研究等领域的应用。本综述强调了 MLLMs 在整合多种数据类型(如文本、图像和音频)以提供更全面的病人健康洞察方面的独特能力。我们还探讨了 MLLM 实施面临的挑战,包括数据限制、技术障碍和伦理考量。通过识别关键研究空白,本文旨在指导未来在数据集开发、模态对齐方法和伦理准则建立等领域的研究。随着 MLLMs 继续塑造医疗保健的未来,理解其潜力和局限性对于其在医疗实践中负责任且有效地整合至关重要。

[NLP-122] Evaluating Cultural Awareness of LLMs for Yoruba Malayalam and English

【速读】: 该论文试图解决大型语言模型(LLMs)在理解和处理区域语言和文化方面不足的问题。解决方案的关键在于通过使用霍夫斯泰德的六种文化维度(权力距离、个人主义、成就动机、不确定性规避、长期导向和放纵)来量化LLM对马来亚拉姆语和约鲁巴语的文化认知能力,并发现LLMs在处理这些语言的文化细微差别时表现不佳。因此,论文强调了需要对LLMs进行大规模的区域语言训练,并使用文化丰富的数据集,以提升基于聊天的LLMs的用户体验,并增强大规模LLM代理市场研究的准确性。

链接: https://arxiv.org/abs/2410.01811
作者: Fiifi Dawson,Zainab Mosunmola,Sahil Pocker,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
关键词-EN: Long Term Orientation, complex tasks, extremely effective, large number, number of complex
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Although LLMs have been extremely effective in a large number of complex tasks, their understanding and functionality for regional languages and cultures are not well studied. In this paper, we explore the ability of various LLMs to comprehend the cultural aspects of two regional languages: Malayalam (state of Kerala, India) and Yoruba (West Africa). Using Hofstede’s six cultural dimensions: Power Distance (PDI), Individualism (IDV), Motivation towards Achievement and Success (MAS), Uncertainty Avoidance (UAV), Long Term Orientation (LTO), and Indulgence (IVR), we quantify the cultural awareness of LLM-based responses. We demonstrate that although LLMs show a high cultural similarity for English, they fail to capture the cultural nuances across these 6 metrics for Malayalam and Yoruba. We also highlight the need for large-scale regional language LLM training with culturally enriched datasets. This will have huge implications for enhancing the user experience of chat-based LLMs and also improving the validity of large-scale LLM agent-based market research.
摘要:尽管大语言模型 (LLM) 在众多复杂任务中表现出色,但对于区域语言和文化的理解和功能尚未得到充分研究。本文探讨了不同大语言模型对两种区域语言——马拉雅拉姆语 (印度喀拉拉邦) 和约鲁巴语 (西非) 文化方面的理解能力。我们采用霍夫斯泰德的六个文化维度:权力距离 (PDI)、个人主义 (IDV)、成就与成功动机 (MAS)、不确定性规避 (UAV)、长期导向 (LTO) 和放纵 (IVR),量化了大语言模型基于这些维度的文化意识。研究表明,尽管大语言模型在英语中表现出高度的文化相似性,但在马拉雅拉姆语和约鲁巴语的这六个维度上未能捕捉到文化细微差别。我们还强调了需要大规模区域语言大语言模型训练,并使用文化丰富的数据集。这将极大地影响基于聊天的用户体验的提升,同时也有助于提高大规模大语言模型智能体市场研究的准确性。

[NLP-123] Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms

【速读】: 该论文试图解决传统主题建模和基于聚类的方法在捕捉上下文语义信息方面的不足。解决方案的关键在于引入了一种创新的端到端语义驱动主题建模技术,该技术结合了先进的词和文档嵌入方法以及强大的聚类算法。具体来说,该模型利用预训练的基于Transformer的语言模型生成文档嵌入,通过降维处理这些嵌入,并基于语义相似性进行聚类,从而为每个聚类生成连贯且有意义的主题。相比ChatGPT和传统主题建模算法,该模型能够提供更加连贯和有意义的主题。

链接: https://arxiv.org/abs/2410.00134
作者: Melkamu Abay Mersha,Mesay Gemeda yigezu,Jugal Kalita
关键词-EN: discover hidden topics, Topic modeling, prior knowledge, discover hidden, Traditional topic modeling
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Topic modeling is a powerful technique to discover hidden topics and patterns within a collection of documents without prior knowledge. Traditional topic modeling and clustering-based techniques encounter challenges in capturing contextual semantic information. This study introduces an innovative end-to-end semantic-driven topic modeling technique for the topic extraction process, utilizing advanced word and document embeddings combined with a powerful clustering algorithm. This semantic-driven approach represents a significant advancement in topic modeling methodologies. It leverages contextual semantic information to extract coherent and meaningful topics. Specifically, our model generates document embeddings using pre-trained transformer-based language models, reduces the dimensions of the embeddings, clusters the embeddings based on semantic similarity, and generates coherent topics for each cluster. Compared to ChatGPT and traditional topic modeling algorithms, our model provides more coherent and meaningful topics.
摘要:主题建模是一种强大的技术,能够在无需先验知识的情况下,从文档集合中发现隐藏的主题和模式。传统的主题建模和基于聚类的技术在捕捉上下文语义信息方面面临挑战。本研究引入了一种创新的端到端语义驱动主题建模技术,用于主题提取过程,利用先进的词和文档嵌入结合强大的聚类算法。这种语义驱动的方法在主题建模方法学上代表了显著的进步。它利用上下文语义信息来提取连贯且有意义的主题。具体而言,我们的模型使用预训练的基于 Transformer 的语言模型生成文档嵌入,减少嵌入的维度,基于语义相似性对嵌入进行聚类,并为每个聚类生成连贯的主题。与 ChatGPT 和传统的主题建模算法相比,我们的模型提供了更加连贯和有意义的主题。

[NLP-124] Explainable Artificial Intelligence: A Survey of Needs Techniques Applications and Future Direction

【速读】: 该论文试图解决人工智能模型在安全性关键领域(如医疗、金融和自动驾驶)中由于其黑箱特性而面临的透明性、可解释性和公平性问题。解决方案的关键在于提供一种全面的文献综述,涵盖了可解释人工智能(XAI)的基本概念、术语定义、需求分析、受益者群体、方法分类以及在不同应用领域的应用。通过深入探讨XAI模型的数学表示和设计方法,论文旨在增强AI模型的可信度、透明度、问责性和公平性,从而满足专业研究人员、实践者、模型开发者和受益者的需求。

链接: https://arxiv.org/abs/2409.00265
作者: Melkamu Mersha,Khang Lam,Joseph Wood,Ali AlShami,Jugal Kalita
关键词-EN: Artificial intelligence models, Explainable Artificial Intelligence, Artificial intelligence, encounter significant challenges, significant challenges due
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Artificial intelligence models encounter significant challenges due to their black-box nature, particularly in safety-critical domains such as healthcare, finance, and autonomous vehicles. Explainable Artificial Intelligence (XAI) addresses these challenges by providing explanations for how these models make decisions and predictions, ensuring transparency, accountability, and fairness. Existing studies have examined the fundamental concepts of XAI, its general principles, and the scope of XAI techniques. However, there remains a gap in the literature as there are no comprehensive reviews that delve into the detailed mathematical representations, design methodologies of XAI models, and other associated aspects. This paper provides a comprehensive literature review encompassing common terminologies and definitions, the need for XAI, beneficiaries of XAI, a taxonomy of XAI methods, and the application of XAI methods in different application areas. The survey is aimed at XAI researchers, XAI practitioners, AI model developers, and XAI beneficiaries who are interested in enhancing the trustworthiness, transparency, accountability, and fairness of their AI models.
摘要:人工智能模型因其黑箱特性面临重大挑战,特别是在医疗、金融和自动驾驶等安全关键领域。可解释人工智能 (XAI) 通过提供这些模型如何做出决策和预测的解释,确保透明性、责任性和公平性,从而应对这些挑战。现有研究已探讨了 XAI 的基本概念、一般原则及其技术范围。然而,文献中仍存在空白,因为没有全面的综述深入探讨 XAI 模型的详细数学表示、设计方法及其他相关方面。本文提供了一篇全面的文献综述,涵盖了常见术语和定义、XAI 的需求、XAI 的受益者、XAI 方法的分类以及 XAI 方法在不同应用领域的应用。该调查旨在面向 XAI 研究人员、XAI 实践者、AI 模型开发者以及对提升其 AI 模型可信度、透明性、责任性和公平性感兴趣的 XAI 受益者。

[NLP-125] Large Language Models as Markov Chains

【速读】: 该论文试图解决大语言模型(LLMs)性能的理论分析问题,特别是其卓越表现背后的理论基础。解决方案的关键在于建立了一个等价关系:将具有词汇量 ( T ) 和上下文窗口大小 ( K ) 的自回归语言模型与定义在有限状态空间大小为 ( \mathcal{O}(T^K) ) 的马尔可夫链相等价。通过这一等价关系,论文推导了与LLMs推理能力相关的马尔可夫链的平稳分布存在性、收敛速度及其对温度参数的依赖性,并证明了预训练和上下文泛化的边界条件。最终,通过实验验证了这些理论保证在实际LLMs中的应用。

链接: https://arxiv.org/abs/2410.02724
作者: Oussama Zekri,Ambroise Odonnat,Abdelhakim Benechehab,Linus Bleistein,Nicolas Boullé,Ievgen Redko
关键词-EN: Large language models, natural language processing, language processing tasks, Large language, remarkably efficient
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 49 pages, 17 figures

点击查看摘要

Abstract:Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size T and context window of size K and Markov chains defined on a finite state space of size \mathcalO(T^K) . We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.
摘要:大语言模型 (LLMs) 在广泛的自然语言处理任务中以及超越这些任务的领域中,已被证明具有显著的效率。然而,对其卓越性能起源的全面理论分析仍然难以捉摸。在本文中,我们通过建立通用自回归语言模型(词汇大小为 T,上下文窗口大小为 K)与定义在大小为 \mathcalO(T^K) 的有限状态空间上的马尔可夫链之间的等价关系,来解决这一挑战性任务。我们推导出几个与马尔可夫链的平稳分布存在性、其收敛速度以及温度对后者的影响相关的惊人发现。随后,我们证明了预训练和上下文泛化的边界,并展示了所建立的等价关系如何丰富这些边界的解释。最后,我们通过在几个近期的大语言模型上进行实验,来说明我们的理论保证,并强调它们如何捕捉实践中观察到的行为。

[NLP-126] MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation WACV

【速读】: 该论文试图解决医学图像分割任务中的准确性和鲁棒性问题,解决方案的关键在于将预训练的大型语言模型(LLM)的Transformer块集成到Vision Transformer(ViT)的编码器中,并通过引入混合注意力机制和多尺度融合块来增强特征学习与聚合能力。这种集成方法显著提升了分割性能,具体表现为Dice分数从0.74提高到0.79,同时在准确性、精确度和Jaccard指数等方面也取得了显著改进。

链接: https://arxiv.org/abs/2410.02458
作者: Gurucharan Marthi Krishna Kumar,Aman Chadha,Janine Mendola,Amir Shmuel
关键词-EN: Large Language Models, Large Language, medical image segmentation, accurate diagnostic imaging, enhance medical image
类目: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Large Language Models (LLMs), known for their versatility in textual data, are increasingly being explored for their potential to enhance medical image segmentation, a crucial task for accurate diagnostic imaging. This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks. Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance across various medical imaging modalities. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales. The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index. These results demonstrate the effectiveness of LLM-based transformers in refining medical image segmentation, highlighting their potential to significantly boost model accuracy and robustness. The source code and our implementation are available at: this https URL
摘要:大语言模型 (LLMs) 以其对文本数据的多样性处理能力而闻名,目前正越来越多地被探索用于增强医学图像分割的潜力,这是准确诊断成像的关键任务。本研究探讨了通过集成预训练的 LLM Transformer 模块来增强 Vision Transformers (ViTs) 以进行医学图像分割的方法。我们的方法将冻结的 LLM Transformer 模块整合到基于 ViT 模型的编码器中,从而在各种医学成像模式中显著提升了分割性能。我们提出了一种混合注意力机制,该机制结合了全局和局部特征学习,并通过多尺度融合块来聚合不同尺度的特征。增强后的模型显示出显著的性能提升,包括平均 Dice 分数从 0.74 增加到 0.79,以及在准确性、精确性和 Jaccard 指数方面的改进。这些结果表明,基于 LLM 的 Transformer 在优化医学图像分割方面具有显著效果,突显了它们在大幅提升模型准确性和鲁棒性方面的潜力。源代码和我们的实现可在以下链接获取:this https URL

[NLP-127] Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

【速读】: 该论文试图解决小规模音频分类数据集的增强问题,旨在通过合成数据提高分类准确性。解决方案的关键在于利用文本到音频(T2A)扩散模型生成合成音频,并通过偏好优化确保生成的音频与小规模数据集的声学特征一致,同时采用大型语言模型生成多样化且有意义的音频描述,以提升合成数据的多样性和质量。

链接: https://arxiv.org/abs/2410.02056
作者: Sreyan Ghosh,Sonal Kumar,Zhifeng Kong,Rafael Valle,Bryan Catanzaro,Dinesh Manocha
关键词-EN: augmenting small-scale audio, audio classification datasets, approach for augmenting, small-scale audio classification, audio classification
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code and Checkpoints will be soon available here: this https URL

点击查看摘要

Abstract:We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.
摘要:我们提出了 Synthio,这是一种利用合成数据增强小规模音频分类数据集的新方法。我们的目标是提高在有限标注数据情况下的音频分类准确性。传统数据增强技术通过应用人工变换(例如,添加随机噪声或遮蔽片段)来创建数据,但这些技术难以捕捉现实世界音频中的真实多样性。为了解决这一不足,我们建议使用从文本到音频 (T2A) 扩散模型生成的合成音频来增强数据集。然而,合成有效的增强数据是具有挑战性的,因为生成的数据不仅应与基础小规模数据集在声学上保持一致,还应具有足够的组合多样性。为了克服第一个挑战,我们通过偏好优化将 T2A 模型的生成与小规模数据集对齐,以确保生成的数据声学特征与小规模数据集保持一致。为了应对第二个挑战,我们提出了一种新颖的标题生成技术,该技术利用大语言模型的推理能力来(1)生成多样且有意义的音频标题,以及(2)迭代地改进其质量。生成的标题随后用于提示对齐的 T2A 模型。我们在十个数据集和四种模拟的有限数据设置上广泛评估了 Synthio。结果表明,我们的方法在使用仅在弱标注的 AudioSet 上训练的 T2A 模型时,始终优于所有基线 0.1%-39%。

[NLP-128] A GEN AI Framework for Medical Note Generation

【速读】: 该论文试图解决医疗记录管理中日益增加的行政负担问题,特别是通过电子健康记录(EHR)系统,这不仅减少了直接患者护理的时间,还导致了医生职业倦怠。解决方案的关键是提出了MediNotes,一个先进的生成式AI框架,用于自动化创建SOAP(主观、客观、评估、计划)笔记。MediNotes整合了大型语言模型(LLMs)、检索增强生成(RAG)和自动语音识别(ASR)技术,能够实时或从录音中捕捉和处理文本及语音输入,生成结构化和上下文准确的医疗笔记。此外,该框架采用了量化低秩适应(QLoRA)和参数高效微调(PEFT)技术,以在资源受限的环境中实现高效的模型微调。MediNotes还提供了一个基于查询的检索系统,使医疗提供者和患者能够快速准确地访问相关医疗信息。

链接: https://arxiv.org/abs/2410.01841
作者: Hui Yi Leong,Yi Fan Gao,Shuai Ji,Bora Kalaycioglu,Uktu Pamuksuz
关键词-EN: Electronic Health Records, Health Records, Electronic Health, direct patient care, Automatic Speech Recognition
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Sound (cs.SD)
备注: 8 Figures, 7 page, IEEE standard research paper

点击查看摘要

Abstract:The increasing administrative burden of medical documentation, particularly through Electronic Health Records (EHR), significantly reduces the time available for direct patient care and contributes to physician burnout. To address this issue, we propose MediNotes, an advanced generative AI framework designed to automate the creation of SOAP (Subjective, Objective, Assessment, Plan) notes from medical conversations. MediNotes integrates Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Automatic Speech Recognition (ASR) to capture and process both text and voice inputs in real time or from recorded audio, generating structured and contextually accurate medical notes. The framework also incorporates advanced techniques like Quantized Low-Rank Adaptation (QLoRA) and Parameter-Efficient Fine-Tuning (PEFT) for efficient model fine-tuning in resource-constrained environments. Additionally, MediNotes offers a query-based retrieval system, allowing healthcare providers and patients to access relevant medical information quickly and accurately. Evaluations using the ACI-BENCH dataset demonstrate that MediNotes significantly improves the accuracy, efficiency, and usability of automated medical documentation, offering a robust solution to reduce the administrative burden on healthcare professionals while improving the quality of clinical workflows.
摘要:医疗文档管理,特别是通过电子健康记录 (EHR) 系统,日益增加的行政负担显著减少了直接患者护理的时间,并导致医生职业倦怠。为解决这一问题,我们提出了 MediNotes,一个先进的生成式 AI 框架,旨在自动从医疗对话中创建 SOAP (主观、客观、评估、计划) 笔记。MediNotes 集成了大语言模型 (LLM)、检索增强生成 (RAG) 和自动语音识别 (ASR),以实时或从录音中捕捉和处理文本及语音输入,生成结构化和上下文准确的医疗笔记。该框架还采用了量化低秩适应 (QLoRA) 和参数高效微调 (PEFT) 等先进技术,以在资源受限的环境中实现高效的模型微调。此外,MediNotes 提供了一个基于查询的检索系统,使医疗提供者和患者能够快速准确地访问相关医疗信息。使用 ACI-BENCH 数据集的评估表明,MediNotes 显著提高了自动化医疗文档的准确性、效率和可用性,为减轻医疗专业人员的行政负担并提高临床工作流程的质量提供了强有力的解决方案。

人工智能

[AI-0] Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

链接: https://arxiv.org/abs/2410.02763
作者: Jianrui Zhang,Mu Cai,Yong Jae Lee
关键词-EN: growing sentiment recently, key challenges related, growing sentiment, sentiment recently, recently that modern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.

[AI-1] FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

链接: https://arxiv.org/abs/2410.02761
作者: Zhipei Xu,Xuanyu Zhang,Runyi Li,Zecheng Tang,Qing Huang,Jian Zhang
关键词-EN: facilitates content creation, makes image manipulation, image manipulation easier, double-edged sword, rapid development
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf1) black-box nature with unknown detection principle, \textbf2) limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield’s tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.

[AI-2] CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

链接: https://arxiv.org/abs/2410.02748
作者: Han He,Qianchu Liu,Lei Xu,Chaitanya Shivade,Yi Zhang,Sundararajan Srinivasan,Katrin Kirchhoff
关键词-EN: Large language models, Large language, generate fluent summaries, prompting techniques, domains using prompting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (CriSPO), a lightweight model that can be finetuned to extract salient keyphrases. By using CriSPO, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.

[AI-3] AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity

链接: https://arxiv.org/abs/2410.02745
作者: Zhibin Lan,Liqiang Niu,Fandong Meng,Wenbo Li,Jie Zhou,Jinsong Su
关键词-EN: visual tokens, visual granularity based, visual granularity, multiple local images, visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Preprint

点击查看摘要

Abstract:Recently, when dealing with high-resolution images, dominant LMMs usually divide them into multiple local images and one global image, which will lead to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. This approach not only reduces the number of visual tokens and speeds up inference, but also improves the overall model performance. Specifically, we introduce the following modules based on LLaVA-NeXT: (a) a visual granularity scaler that includes multiple pooling layers to obtain visual tokens with different granularities; (b) a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we propose RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional manually annotated data. Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual tokens and a 2.53 \times increase in inference speed on the AI2D benchmark).

[AI-4] Neutral residues: revisiting adapters for model extension

链接: https://arxiv.org/abs/2410.02744
作者: Franck Signe Talla,Herve Jegou,Edouard Grave
关键词-EN: pretrained large language, extending a pretrained, pretrained large, original domain, large language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of extending a pretrained large language model to a new domain that was not seen at training time, like adding a language for which the original model has seen no or little training data. Popular solutions like fine-tuning or low-rank adaptation are successful at domain adaptation, but formally they do not add any extra capacity and degrade the performance in the original domain. Our paper analyzes this extension problem under three angles: data, architecture and training procedure, which are advantageously considered jointly. In particular, we improve adapters and make it possible to learn an entire new language while ensuring that the output of the neural network is almost unchanged in the original domain. For this purpose, we modify the new residual blocks in a way that leads each new residual block to output near-zeros in the original domain. This solution of neutral residues, which borrows architectural components from mixture of experts, is effective: with only 20% extra learnable weights compared to an original model trained on English, we get results that are significantly better than concurrent approaches (fine-tuning, low-rank or vanilla adapters) in terms of the trade-off between learning a new language and not forgetting English. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.02744 [cs.CL] (or arXiv:2410.02744v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02744 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-5] Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization EMNLP2024

链接: https://arxiv.org/abs/2410.02741
作者: Lei Xu,Mohammed Asad Karim,Saket Dingliwal,Aparna Elangovan
关键词-EN: Large language models, Large language, generate fluent summaries, prompting techniques, domains using prompting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Industry Track

点击查看摘要

Abstract:Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.

[AI-6] Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

链接: https://arxiv.org/abs/2410.02740
作者: Zhengfeng Lai,Vasileios Saveris,Chen Chen,Hong-You Chen,Haotian Zhang,Bowen Zhang,Juan Lao Tebar,Wenze Hu,Zhe Gan,Peter Grasch,Meng Cao,Yinfei Yang
关键词-EN: Recent advancements, synthetic captions, key challenges remain, captions, Short Synthetic Captions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CV/ML

点击查看摘要

Abstract:Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.

[AI-7] Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

链接: https://arxiv.org/abs/2410.02736
作者: Jiayi Ye,Yanbo Wang,Yue Huang,Dongping Chen,Qihui Zhang,Nuno Moniz,Tian Gao,Werner Geyer,Chao Huang,Pin-Yu Chen,Nitesh V Chawla,Xiangliang Zhang
关键词-EN: widely utilized, evaluation method, benchmarks and served, served as supervised, supervised rewards
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.

[AI-8] Custom Non-Linear Model Predictive Control for Obstacle Avoidance in Indoor and Outdoor Environments ACL

链接: https://arxiv.org/abs/2410.02732
作者: Lara Laban,Mariusz Wzorek,Piotr Rudol,Tommy Persson
关键词-EN: Unmanned Aerial Vehicles, requires Unmanned Aerial, Navigating complex environments, Aerial Vehicles, Unmanned Aerial
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
*备注: This manuscript has 7 pages and 8 figures, detailing NMPC for UAV obstacle avoidance using DJI UAVs. It features simulations, experimental results, and uses CasADi for optimization with ROS integration. Code and media at this https URL

点击查看摘要

Abstract:Navigating complex environments requires Unmanned Aerial Vehicles (UAVs) and autonomous systems to perform trajectory tracking and obstacle avoidance in real-time. While many control strategies have effectively utilized linear approximations, addressing the non-linear dynamics of UAV, especially in obstacle-dense environments, remains a key challenge that requires further research. This paper introduces a Non-linear Model Predictive Control (NMPC) framework for the DJI Matrice 100, addressing these challenges by using a dynamic model and B-spline interpolation for smooth reference trajectories, ensuring minimal deviation while respecting safety constraints. The framework supports various trajectory types and employs a penalty-based cost function for control accuracy in tight maneuvers. The framework utilizes CasADi for efficient real-time optimization, enabling the UAV to maintain robust operation even under tight computational constraints. Simulation and real-world indoor and outdoor experiments demonstrated the NMPC ability to adapt to disturbances, resulting in smooth, collision-free navigation.

[AI-9] Unified Multi-Modal Interleaved Document Representation for Information Retrieval

链接: https://arxiv.org/abs/2410.02729
作者: Jaewoo Lee,Joonho Ko,Jinheon Baek,Soyeong Jeong,Sung Ju Hwang
关键词-EN: natural language tasks, gained remarkable attention, remarkable attention due, language tasks, gained remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Preprint

点击查看摘要

Abstract:Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.

[AI-10] Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better Even Mid-Generation

链接: https://arxiv.org/abs/2410.02725
作者: Rohin Manvi,Anikait Singh,Stefano Ermon
关键词-EN: Inference-time computation, large language models, widely used technique, external reward model, powerful paradigm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B’s win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.

[AI-11] Domain-Specific Retrieval-Augmented Generation Using Vector Stores Knowledge Graphs and Tensor Factorization ICML

链接: https://arxiv.org/abs/2410.02721
作者: Ryan C. Barron,Ves Grantcharov,Selma Wanna,Maksim E. Eren,Manish Bhattarai,Nicholas Solovyev,George Tompkins,Charles Nicholas,Kim Ø. Rasmussen,Cynthia Matuszek,Boian S. Alexandrov
关键词-EN: Large Language Models, natural language processing, general natural language, numerous general natural, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注: 9 pages 7 figures, 1 table, 1 cypher code Accepted to ICMLA 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs’ intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.

[AI-12] Curvature Diversity-Driven Deformation and Domain Alignment for Point Cloud

链接: https://arxiv.org/abs/2410.02720
作者: Mengxi Wu,Hao Huang,Yi Fang,Mohammad Rostami
关键词-EN: Unsupervised Domain Adaptation, training deep networks, textbf, Unsupervised Domain, manual data annotation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) is crucial for reducing the need for extensive manual data annotation when training deep networks on point cloud data. A significant challenge of UDA lies in effectively bridging the domain gap. To tackle this challenge, we propose \textbfCurvature \textbfDiversity-Driven \textbfNuclear-Norm Wasserstein \textbfDomain Alignment (CDND). Our approach first introduces a \textit\textbfCurvature Diversity-driven Deformation \textbfReconstruction (CurvRec) task, which effectively mitigates the gap between the source and target domains by enabling the model to extract salient features from semantically rich regions of a given point cloud. We then propose \textit\textbfDeformation-based \textbfNuclear-norm \textbfWasserstein \textbfDiscrepancy (D-NWD), which applies the Nuclear-norm Wasserstein Discrepancy to both \textitdeformed and original data samples to align the source and target domains. Furthermore, we contribute a theoretical justification for the effectiveness of D-NWD in distribution alignment and demonstrate that it is \textitgeneric enough to be applied to \textbfany deformations. To validate our method, we conduct extensive experiments on two public domain adaptation datasets for point cloud classification and segmentation tasks. Empirical experiment results show that our CDND achieves state-of-the-art performance by a noticeable margin over existing approaches.

[AI-13] SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.02710
作者: Hongxiang Zhang,Yifeng He,Hao Chen
关键词-EN: precise text alignment, generate high-quality images, drawn attention, ability to generate, generate high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical and safety standards with little to no impact on usability. SteerDiff identifies and manipulates inappropriate concepts within the text embedding space to guide the model away from harmful outputs. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach. Furthermore, we benchmark SteerDiff against multiple red-teaming strategies to assess its robustness. Finally, we explore the potential of SteerDiff for concept forgetting tasks, demonstrating its versatility in text-conditioned image generation.

[AI-14] LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

链接: https://arxiv.org/abs/2410.02707
作者: Hadas Orgad,Michael Toker,Zorik Gekhman,Roi Reichart,Idan Szpektor,Hadas Kotek,Yonatan Belinkov
关键词-EN: including factual inaccuracies, Large language models, Large language, including factual, factual inaccuracies
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as “hallucinations”. Recent studies have demonstrated that LLMs’ internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that – contrary to prior claims – truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs’ internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model’s internal perspective, which can guide future research on enhancing error analysis and mitigation.

[AI-15] Selective Attention Improves Transformer

链接: https://arxiv.org/abs/2410.02703
作者: Yaniv Leviathan,Matan Kalman,Yossi Matias
关键词-EN: Selective Attention, Unneeded elements, attention, attention context degrade, Selective
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unneeded elements in the attention’s context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention’s context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.

[AI-16] HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

链接: https://arxiv.org/abs/2410.02694
作者: Howard Yen,Tianyu Gao,Minmin Hou,Ke Ding,Daniel Fleischer,Peter Izasak,Moshe Wasserblat,Danqi Chen
关键词-EN: long-context language models, evaluating long-context language, developers often rely, arbitrary subsets, Evaluate Long-context Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Code and data are available here: this https URL

点击查看摘要

Abstract:There have been many benchmarks for evaluating long-context language models (LCLMs), but developers often rely on synthetic tasks like needle-in-a-haystack (NIAH) or arbitrary subsets of tasks. It remains unclear whether they translate to the diverse downstream applications of LCLMs, and the inconsistency further complicates model comparison. We investigate the underlying reasons behind current practices and find that existing benchmarks often provide noisy signals due to low coverage of applications, insufficient lengths, unreliable metrics, and incompatibility with base models. In this work, we present HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address many issues in previous benchmarks by adding controllable lengths up to 128k tokens, model-based evaluation for reliable metrics, and few-shot prompting for robustly evaluating base models. Consequently, we demonstrate that HELMET offers more reliable and consistent rankings of frontier LCLMs. Through a comprehensive study of 51 LCLMs, we find that (1) synthetic tasks like NIAH are not good predictors of downstream performance; (2) the diverse categories in HELMET exhibit distinct trends and low correlation with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning or following complex instructions – the gap widens with increased lengths. Finally, we recommend using our RAG tasks for fast model development, as they are easy to run and more predictive of other downstream performance; ultimately, we advocate for a holistic evaluation across diverse tasks.

[AI-17] Discovering Clues of Spoofed LM Watermarks

链接: https://arxiv.org/abs/2410.02693
作者: Thibaud Gloaguen,Nikola Jovanović,Robin Staab,Martin Vechev
关键词-EN: LLM watermarks stand, ownership of LLM-generated, attribute ownership, LLM, spoofing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. While recent works have demonstrated that state-of-the-art schemes are in fact vulnerable to spoofing, they lack deeper qualitative analysis of the texts produced by spoofing methods. In this work, we for the first time reveal that there are observable differences between genuine and spoofed watermark texts. Namely, we show that regardless of their underlying approach, all current spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts, effectively discovering that a watermark was spoofed. Our experimental evaluation shows high test power across all current spoofing methods, providing insights into their fundamental limitations, and suggesting a way to mitigate this threat.

[AI-18] User-centric Immersive Communications in 6G: A Data-oriented Approach via Digital Twin

链接: https://arxiv.org/abs/2410.02688
作者: Conghao Zhou,Shisheng Hu,Jie Gao,Xinyu Huang,Weihua Zhuang,Xuemin Shen
关键词-EN: individual user behaviors, satisfying unique requirements, user-centric service provision, immersive communications, multi-sensory experience
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this article, we present a novel user-centric service provision for immersive communications (IC) in 6G to deal with the uncertainty of individual user behaviors while satisfying unique requirements on the quality of multi-sensory experience. To this end, we propose a data-oriented approach for network resource management, featuring personalized data management that can support network modeling tailored to different user demands. Our approach leverages the digital twin (DT) technique as a key enabler. Particularly, a DT is established for each user, and the data attributes in the DT are customized based on the characteristics of the user. The DT functions, corresponding to various data operations, are customized in the development, evaluation, and update of network models to meet unique user demands. A trace-driven case study demonstrates the effectiveness of our approach in achieving user-centric IC and the significance of personalized data management in 6G.

[AI-19] DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life

链接: https://arxiv.org/abs/2410.02683
作者: Yu Ying Chiu,Liwei Jiang,Yejin Choi
关键词-EN: Moral Foundation Theory, increasingly seek guidance, increasingly seek, decision-making in daily, clear-cut and depend
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Under Review

点击查看摘要

Abstract:As we increasingly seek guidance from LLMs for decision-making in daily life, many of these decisions are not clear-cut and depend significantly on the personal values and ethical standards of the users. We present DailyDilemmas, a dataset of 1,360 moral dilemmas encountered in everyday life. Each dilemma includes two possible actions and with each action, the affected parties and human values invoked. Based on these dilemmas, we consolidated a set of human values across everyday topics e.g., interpersonal relationships, workplace, and environmental issues. We evaluated LLMs on these dilemmas to determine what action they will take and the values represented by these actions. Then, we analyzed these values through the lens of five popular theories inspired by sociology, psychology and philosophy. These theories are: World Value Survey, Moral Foundation Theory, Maslow’s Hierarchy of Needs, Aristotle’s Virtues, and Plutchik Wheel of Emotion. We find that LLMs are most aligned with the self-expression over survival values in terms of World Value Survey, care over loyalty in Moral Foundation Theory. Interestingly, we find large preferences differences in models for some core values such as truthfulness e.g., Mixtral-8x7B model tends to neglect it by 9.7% while GPT-4-turbo model tends to select it by 9.4%. We also study the recent guidance released by OpenAI (ModelSpec), and Anthropic (Constitutional AI) to understand how their released principles reflect their actual value prioritization when facing nuanced moral reasoning in daily-life settings. We find that end users cannot effectively steer such prioritization using system prompts.

[AI-20] Distilling an End-to-End Voice Assistant Without Instruction Training Data

链接: https://arxiv.org/abs/2410.02678
作者: William Held,Ella Li,Michael Ryan,Weiyan Shi,Yanzhe Zhang,Diyi Yang
关键词-EN: Siri and Google, lost speech information, Google Assistant, Distilled Voice Assistant, Speech Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using 100x less training compute. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02678 [cs.CL] (or arXiv:2410.02678v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02678 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-21] CulturalBench: a Robust Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

链接: https://arxiv.org/abs/2410.02677
作者: Yu Ying Chiu,Liwei Jiang,Bill Yuchen Lin,Chan Young Park,Shuyue Stella Li,Sahithya Ravi,Mehar Bhatia,Maria Antoniak,Yulia Tsvetkov,Vered Shwartz,Yejin Choi
关键词-EN: make large language, large language models, track our progress, effective cultural knowledge, cultural knowledge benchmarks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs’ cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.

[AI-22] FAN: Fourier Analysis Networks

链接: https://arxiv.org/abs/2410.02675
作者: Yihong Dong,Ge Li,Yongding Tao,Xue Jiang,Kechi Zhang,Jia Li,Jing Su,Jun Zhang,Jingjing Xu
关键词-EN: remarkable success achieved, exhibit potential flaws, remarkable success, success achieved, exhibit potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Despite the remarkable success achieved by neural networks, particularly those represented by MLP and Transformer, we reveal that they exhibit potential flaws in the modeling and reasoning of periodicity, i.e., they tend to memorize the periodic data rather than genuinely understanding the underlying principles of periodicity. However, periodicity is a crucial trait in various forms of reasoning and generalization, underpinning predictability across natural and engineered systems through recurring patterns in observations. In this paper, we propose FAN, a novel network architecture based on Fourier Analysis, which empowers the ability to efficiently model and reason about periodic phenomena. By introducing Fourier Series, the periodicity is naturally integrated into the structure and computational processes of the neural network, thus achieving a more accurate expression and prediction of periodic patterns. As a promising substitute to multi-layer perceptron (MLP), FAN can seamlessly replace MLP in various models with fewer parameters and FLOPs. Through extensive experiments, we demonstrate the effectiveness of FAN in modeling and reasoning about periodic functions, and the superiority and generalizability of FAN across a range of real-world tasks, including symbolic formula representation, time series forecasting, and language modeling.

[AI-23] Unsupervised Point Cloud Completion through Unbalanced Optimal Transport

链接: https://arxiv.org/abs/2410.02671
作者: Taekyung Lee,Jaemoo Choi,Jaewoong Choi
关键词-EN: Unpaired point cloud, unbalanced optimal transport, point cloud completion, optimal transport map, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages, 10 figures

点击查看摘要

Abstract:Unpaired point cloud completion explores methods for learning a completion map from unpaired incomplete and complete point cloud data. In this paper, we propose a novel approach for unpaired point cloud completion using the unbalanced optimal transport map, called Unbalanced Optimal Transport Map for Unpaired Point Cloud Completion (UOT-UPC). We demonstrate that the unpaired point cloud completion can be naturally interpreted as the Optimal Transport (OT) problem and introduce the Unbalanced Optimal Transport (UOT) approach to address the class imbalance problem, which is prevalent in unpaired point cloud completion datasets. Moreover, we analyze the appropriate cost function for unpaired completion tasks. This analysis shows that the InfoCD cost function is particularly well-suited for this task. Our model is the first attempt to leverage UOT for unpaired point cloud completion, achieving competitive or superior results on both single-category and multi-category datasets. In particular, our model is especially effective in scenarios with class imbalance, where the proportions of categories are different between the incomplete and complete point cloud datasets.

[AI-24] AlphaIntegrator: Transformer Action Search for Symbolic Integration Proofs

链接: https://arxiv.org/abs/2410.02666
作者: Mert Ünsal,Timon Gehr,Martin Vechev
关键词-EN: learning-based system, GPT transformer model, mathematical integration rule, mathematical integration, transformer model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:We present the first correct-by-construction learning-based system for step-by-step mathematical integration. The key idea is to learn a policy, represented by a GPT transformer model, which guides the search for the right mathematical integration rule, to be carried out by a symbolic solver. Concretely, we introduce a symbolic engine with axiomatically correct actions on mathematical expressions, as well as the first dataset for step-by-step integration. Our GPT-style transformer model, trained on this synthetic data, demonstrates strong generalization by surpassing its own data generator in accuracy and efficiency, using 50% fewer search steps. Our experimental results with SoTA LLMs also demonstrate that the standard approach of fine-tuning LLMs on a set of question-answer pairs is insufficient for solving this mathematical task. This motivates the importance of discovering creative methods for combining LLMs with symbolic reasoning engines, of which our work is an instance.

[AI-25] Grounded Answers for Multi-agent Decision-making Problem through Generative World Model

链接: https://arxiv.org/abs/2410.02664
作者: Zeyang Liu,Xinrui Yang,Shiguang Sun,Long Qian,Lipeng Wan,Xingyu Chen,Xuguang Lan
关键词-EN: stimulated significant innovations, Recent progress, generation and chatbots, stimulated significant, significant innovations
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: The Thirty-eighth Annual Conference on Neural Information Processing Systems

点击查看摘要

Abstract:Recent progress in generative models has stimulated significant innovations in many fields, such as image generation and chatbots. Despite their success, these models often produce sketchy and misleading solutions for complex multi-agent decision-making problems because they miss the trial-and-error experience and reasoning as humans. To address this limitation, we explore a paradigm that integrates a language-guided simulator into the multi-agent reinforcement learning pipeline to enhance the generated answer. The simulator is a world model that separately learns dynamics and reward, where the dynamics model comprises an image tokenizer as well as a causal transformer to generate interaction transitions autoregressively, and the reward model is a bidirectional transformer learned by maximizing the likelihood of trajectories in the expert demonstrations under language guidance. Given an image of the current state and the task description, we use the world model to train the joint policy and produce the image sequence as the answer by running the converged policy on the dynamics model. The empirical results demonstrate that this framework can improve the answers for multi-agent decision-making problems by showing superior performance on the training and unseen tasks of the StarCraft Multi-Agent Challenge benchmark. In particular, it can generate consistent interaction sequences and explainable reward functions at interaction states, opening the path for training generative models of the future.

[AI-26] Scalable Simulation-free Entropic Unbalanced Optimal Transport

链接: https://arxiv.org/abs/2410.02656
作者: Jaemoo Choi,Jaewoong Choi
关键词-EN: Unbalanced Optimal Transport, transport map, Optimal Transport, Entropic Unbalanced Optimal, Transport
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages

点击查看摘要

Abstract:The Optimal Transport (OT) problem investigates a transport map that connects two distributions while minimizing a given cost function. Finding such a transport map has diverse applications in machine learning, such as generative modeling and image-to-image translation. In this paper, we introduce a scalable and simulation-free approach for solving the Entropic Unbalanced Optimal Transport (EUOT) problem. We derive the dynamical form of this EUOT problem, which is a generalization of the Schrödinger bridges (SB) problem. Based on this, we derive dual formulation and optimality conditions of the EUOT problem from the stochastic optimal control interpretation. By leveraging these properties, we propose a simulation-free algorithm to solve EUOT, called Simulation-free EUOT (SF-EUOT). While existing SB models require expensive simulation costs during training and evaluation, our model achieves simulation-free training and one-step generation by utilizing the reciprocal property. Our model demonstrates significantly improved scalability in generative modeling and image-to-image translation tasks compared to previous SB methods.

[AI-27] CAX: Cellular Automata Accelerated in JAX

链接: https://arxiv.org/abs/2410.02651
作者: Maxence Faldor,Antoine Cully
关键词-EN: diverse scientific disciplines, Cellular automata, Cellular Automata Accelerated, Cellular, spanning neuroscience
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cellular automata have become a cornerstone for investigating emergence and self-organization across diverse scientific disciplines, spanning neuroscience, artificial life, and theoretical physics. However, the absence of a hardware-accelerated cellular automata library limits the exploration of new research directions, hinders collaboration, and impedes reproducibility. In this work, we introduce CAX (Cellular Automata Accelerated in JAX), a high-performance and flexible open-source library designed to accelerate cellular automata research. CAX offers cutting-edge performance and a modular design through a user-friendly interface, and can support both discrete and continuous cellular automata with any number of dimensions. We demonstrate CAX’s performance and flexibility through a wide range of benchmarks and applications. From classic models like elementary cellular automata and Conway’s Game of Life to advanced applications such as growing neural cellular automata and self-classifying MNIST digits, CAX speeds up simulations up to 2,000 times faster. Furthermore, we demonstrate CAX’s potential to accelerate research by presenting a collection of three novel cellular automata experiments, each implemented in just a few lines of code thanks to the library’s modular architecture. Notably, we show that a simple one-dimensional cellular automaton can outperform GPT-4 on the 1D-ARC challenge.

[AI-28] Undesirable Memorization in Large Language Models : A Survey

链接: https://arxiv.org/abs/2410.02650
作者: Ali Satvaty,Suzan Verberne,Fatih Turkmen
关键词-EN: Large Language Models, capabilities of Large, recent research increasingly, research increasingly showcases, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While recent research increasingly showcases the remarkable capabilities of Large Language Models (LLMs), it’s vital to confront their hidden pitfalls. Among these challenges, the issue of memorization stands out, posing significant ethical and legal risks. In this paper, we presents a Systematization of Knowledge (SoK) on the topic of memorization in LLMs. Memorization is the effect that a model tends to store and reproduce phrases or passages from the training data and has been shown to be the fundamental issue to various privacy and security attacks against LLMs. We begin by providing an overview of the literature on the memorization, exploring it across five key dimensions: intentionality, degree, retrievability, abstraction, and transparency. Next, we discuss the metrics and methods used to measure memorization, followed by an analysis of the factors that contribute to memorization phenomenon. We then examine how memorization manifests itself in specific model architectures and explore strategies for mitigating these effects. We conclude our overview by identifying potential research topics for the near future: to develop methods for balancing performance and privacy in LLMs, and the analysis of memorization in specific contexts, including conversational agents, retrieval-augmented generation, multilingual language models, and diffusion language models. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02650 [cs.CL] (or arXiv:2410.02650v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02650 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

链接: https://arxiv.org/abs/2410.02644
作者: Hanrong Zhang,Jingyuan Huang,Kai Mei,Yifei Yao,Zhenting Wang,Chenlu Zhan,Hongwei Wang,Yongfeng Zhang
关键词-EN: Large Language Models, Language Models, Large Language, complex real-world tasks, solve complex real-world
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 23 different types of attack/defense methods, and 8 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, a mixed attack, and 10 corresponding defenses across 13 LLM backbones with nearly 90,000 testing cases in total. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. Our code can be found at this https URL.

[AI-30] Plots Unlock Time-Series Understanding in Multimodal Models

链接: https://arxiv.org/abs/2410.02637
作者: Mayank Daswani,Mathias M.J. Bellaiche,Marc Wilson,Desislav Ivanov,Mikhail Papkov,Eva Schnider,Jing Tang,Kay Lamerigts,Gabriela Botea,Michael A. Sanchez,Yojan Patel,Shruthi Prabhakara,Shravya Shetty,Umesh Telang
关键词-EN: data-driven insights, fields like healthcare, social sciences, representing a missed, opportunity for richer
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 49 pages

点击查看摘要

Abstract:While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi-dimensional time-series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data-driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to “see” time-series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time-series data as text, with the additional benefit that visual time-series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean data, to extracting trends from noisy scatter plots. To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks - specifically fall detection, activity recognition, and readiness assessment - which involve heterogeneous, noisy data and multi-step reasoning. The overall success in plot performance over text performance (up to an 120% performance increase on zero-shot synthetic tasks, and up to 150% performance increase on real-world tasks), across both GPT and Gemini model families, highlights our approach’s potential for making the best use of the native capabilities of foundation models.

[AI-31] Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

链接: https://arxiv.org/abs/2410.02628
作者: Mikhail Persiianov,Arip Asadulaev,Nikita Andreev,Nikita Starodubcev,Dmitry Baranchuk,Anastasis Kratsios,Evgeny Burnaev,Alexander Korotin
关键词-EN: typically approached, approached via supervised, sim, data, central problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning conditional distributions \pi^(\cdot|x) is a central problem in machine learning, which is typically approached via supervised methods with paired data (x,y) \sim \pi^ . However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of \textitsemi-supervised models that utilize both limited paired data and additional unpaired i.i.d. samples x \sim \pi^_x and y \sim \pi^_y from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm that integrates both paired and unpaired data \textbfseamlessly through the data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish a \textbflight learning algorithm to get \pi^*(\cdot|x) . Furthermore, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.

[AI-32] Achieving Fairness in Predictive Process Analytics via Adversarial Learning

链接: https://arxiv.org/abs/2410.02618
作者: Massimiliano de Leoni,Alessandro Padella
关键词-EN: offering real-time operational, real-time operational support, Predictive business process, business process analytics, important for organizations
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Predictive business process analytics has become important for organizations, offering real-time operational support for their processes. However, these algorithms often perform unfair predictions because they are based on biased variables (e.g., gender or nationality), namely variables embodying discrimination. This paper addresses the challenge of integrating a debiasing phase into predictive business process analytics to ensure that predictions are not influenced by biased variables. Our framework leverages on adversial debiasing is evaluated on four case studies, showing a significant reduction in the contribution of biased variables to the predicted value. The proposed technique is also compared with the state of the art in fairness in process mining, illustrating that our framework allows for a more enhanced level of fairness, while retaining a better prediction quality.

[AI-33] NL-Eye: Abductive NLI for Images

链接: https://arxiv.org/abs/2410.02613
作者: Mor Ventura,Michael Toker,Nitay Calderon,Zorik Gekhman,Yonatan Bitton,Roi Reichart
关键词-EN: Natural Language Inference, wet floor, detects a wet, abductive Natural Language, Visual Language Model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs’ visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps - writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.

[AI-34] IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

链接: https://arxiv.org/abs/2410.02611
作者: Akhilesh Aravapalli,Mounika Marreddy,Subba Reddy Oota,Radhika Mamidi,Manish Gupta
关键词-EN: natural language processing, Indic languages, models, revolutionized the field, field of natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, 11 figures

点击查看摘要

Abstract:Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately \sim 47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [this https URL].

[AI-35] Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning

链接: https://arxiv.org/abs/2410.02605
作者: Olivier Lepel,Anas Barakat
关键词-EN: behavioral economy literatures, expected utility theory, Cumulative Prospect Theory, CPT policy optimization, economy literatures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 33 pages, 19 figures

点击查看摘要

Abstract:The widely used expected utility theory has been shown to be empirically inconsistent with human preferences in the psychology and behavioral economy literatures. Cumulative Prospect Theory (CPT) has been developed to fill in this gap and provide a better model for human-based decision-making supported by empirical evidence. It allows to express a wide range of attitudes and perceptions towards risk, gains and losses. A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem where the goal of the agent is to search for a policy generating long-term returns which are aligned with their preferences. In this work, we revisit this policy optimization problem and provide new insights on optimal policies and their nature depending on the utility function under consideration. We further derive a novel policy gradient theorem for the CPT policy optimization objective generalizing the seminal corresponding result in standard RL. This result enables us to design a model-free policy gradient algorithm to solve the CPT-RL problem. We illustrate the performance of our algorithm in simple examples motivated by traffic control and electricity management applications. We also demonstrate that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.

[AI-36] Beyond Squared Error: Exploring Loss Design for Enhanced Training of Generative Flow Networks

链接: https://arxiv.org/abs/2410.02596
作者: Rui Hu,Yifan Zhang,Zhuoran Li,Longbo Huang
关键词-EN: generative models designed, Generative Flow Networks, attracting great research, great research interest, generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) are a novel class of generative models designed to sample from unnormalized distributions and have found applications in various important tasks, attracting great research interest in their training algorithms. In general, GFlowNets are trained by fitting the forward flow to the backward flow on sampled training objects. Prior work focused on the choice of training objects, parameterizations, sampling and resampling strategies, and backward policies, aiming to enhance credit assignment, exploration, or exploitation of the training process. However, the choice of regression loss, which can highly influence the exploration and exploitation behavior of the under-training policy, has been overlooked. Due to the lack of theoretical understanding for choosing an appropriate regression loss, most existing algorithms train the flow network by minimizing the squared error of the forward and backward flows in log-space, i.e., using the quadratic regression loss. In this work, we rigorously prove that distinct regression losses correspond to specific divergence measures, enabling us to design and analyze regression losses according to the desired properties of the corresponding divergence measures. Specifically, we examine two key properties: zero-forcing and zero-avoiding, where the former promotes exploitation and higher rewards, and the latter encourages exploration and enhances diversity. Based on our theoretical framework, we propose three novel regression losses, namely, Shifted-Cosh, Linex(1/2), and Linex(1). We evaluate them across three benchmarks: hyper-grid, bit-sequence generation, and molecule generation. Our proposed losses are compatible with most existing training algorithms, and significantly improve the performances of the algorithms concerning convergence speed, sample diversity, and robustness.

[AI-37] IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers

链接: https://arxiv.org/abs/2410.02592
作者: Zihan Fang,Zheng Lin,Senkang Hu,Hangcheng Cao,Yiqin Deng,Xianhao Chen,Yuguang Fang
关键词-EN: prevent traffic accidents, providing timely alerts, detecting early-stage abnormal, early-stage abnormal status, abnormal status
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 16 pages, 17 figures

点击查看摘要

Abstract:Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state features, significantly deteriorating training performance. Furthermore, missing modalities due to environment and hardware limitations further exacerbate the challenge of abnormal status identification. More importantly, monitoring abnormal health conditions of passengers, particularly in elderly care, is of paramount importance but remains underexplored. To address these challenges, we introduce our IC3M, an efficient camera-rotation-based multimodal framework for monitoring both driver and passengers in a car. Our IC3M comprises two key modules: an adaptive threshold pseudo-labeling strategy and a missing modality reconstruction. The former customizes pseudo-labeling thresholds for different classes based on the class distribution, generating class-balanced pseudo labels to guide model training effectively, while the latter leverages crossmodality relationships learned from limited labels to accurately recover missing modalities by distribution transferring from available modalities. Extensive experimental results demonstrate that IC3M outperforms state-of-the-art benchmarks in accuracy, precision, and recall while exhibiting superior robustness under limited labeled data and severe missing modality.

[AI-38] Boosting Sample Efficiency and Generalization in Multi-agent Reinforcement Learning via Equivariance NEURIPS2024

链接: https://arxiv.org/abs/2410.02581
作者: Joshua McClellan,Naveed Haghani,John Winder,Furong Huang,Pratap Tokekar
关键词-EN: Multi-Agent Reinforcement Learning, Graph Neural Networks, Equivariant Graph Neural, Reinforcement Learning, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted as a poster at NeurIPS 2024

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) struggles with sample inefficiency and poor generalization [1]. These challenges are partially due to a lack of structure or inductive bias in the neural networks typically used in learning the policy. One such form of structure that is commonly observed in multi-agent scenarios is symmetry. The field of Geometric Deep Learning has developed Equivariant Graph Neural Networks (EGNN) that are equivariant (or symmetric) to rotations, translations, and reflections of nodes. Incorporating equivariance has been shown to improve learning efficiency and decrease error [ 2 ]. In this paper, we demonstrate that EGNNs improve the sample efficiency and generalization in MARL. However, we also show that a naive application of EGNNs to MARL results in poor early exploration due to a bias in the EGNN structure. To mitigate this bias, we present Exploration-enhanced Equivariant Graph Neural Networks or E2GN2. We compare E2GN2 to other common function approximators using common MARL benchmarks MPE and SMACv2. E2GN2 demonstrates a significant improvement in sample efficiency, greater final reward convergence, and a 2x-5x gain in over standard GNNs in our generalization tests. These results pave the way for more reliable and effective solutions in complex multi-agent systems.

[AI-39] ColaCare: Enhancing Electronic Health Record Modeling through Large Language Model-Driven Multi-Agent Collaboration

链接: https://arxiv.org/abs/2410.02551
作者: Zixiang Wang,Yinghao Zhu,Huiya Zhao,Xiaochen Zheng,Tianlong Wang,Wen Tang,Yasha Wang,Chengwei Pan,Ewen M. Harrison,Junyi Gao,Liantao Ma
关键词-EN: Electronic Health Record, enhances Electronic Health, Large Language Models, Health Record, Electronic Health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We introduce ColaCare, a framework that enhances Electronic Health Record (EHR) modeling through multi-agent collaboration driven by Large Language Models (LLMs). Our approach seamlessly integrates domain-specific expert models with LLMs to bridge the gap between structured EHR data and text-based reasoning. Inspired by clinical consultations, ColaCare employs two types of agents: DoctorAgent and MetaAgent, which collaboratively analyze patient data. Expert models process and generate predictions from numerical EHR data, while LLM agents produce reasoning references and decision-making reports within the collaborative consultation framework. We additionally incorporate the Merck Manual of Diagnosis and Therapy (MSD) medical guideline within a retrieval-augmented generation (RAG) module for authoritative evidence support. Extensive experiments conducted on four distinct EHR datasets demonstrate ColaCare’s superior performance in mortality prediction tasks, underscoring its potential to revolutionize clinical decision support systems and advance personalized precision medicine. The code, complete prompt templates, more case studies, etc. are publicly available at the anonymous link: this https URL.

[AI-40] Intelligence at the Edge of Chaos

链接: https://arxiv.org/abs/2410.02536
作者: Shiyang Zhang,Aakash Patel,Syed A Rizvi,Nianchen Liu,Sizhuang He,Amin Karbasi,Emanuele Zappala,David van Dijk
关键词-EN: rule-based systems influences, Large Language Models, explore the emergence, emergence of intelligent, influences the capabilities
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages,8 Figures

点击查看摘要

Abstract:We explore the emergence of intelligent behavior in artificial systems by investigating how the complexity of rule-based systems influences the capabilities of models trained to predict these rules. Our study focuses on elementary cellular automata (ECA), simple yet powerful one-dimensional systems that generate behaviors ranging from trivial to highly complex. By training distinct Large Language Models (LLMs) on different ECAs, we evaluated the relationship between the complexity of the rules’ behavior and the intelligence exhibited by the LLMs, as reflected in their performance on downstream tasks. Our findings reveal that rules with higher complexity lead to models exhibiting greater intelligence, as demonstrated by their performance on reasoning and chess move prediction tasks. Both uniform and periodic systems, and often also highly chaotic systems, resulted in poorer downstream performance, highlighting a sweet spot of complexity conducive to intelligence. We conjecture that intelligence arises from the ability to predict complexity and that creating intelligence may require only exposure to complexity.

[AI-41] A Schema-aware Logic Reformulation for Graph Reachability

链接: https://arxiv.org/abs/2410.02533
作者: Davide Di Pierro,Stefano Ferilli
关键词-EN: semantic is attached, task of understanding, distinct points, general a semantic, Graph reachability
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph reachability is the task of understanding whether two distinct points in a graph are interconnected by arcs to which in general a semantic is attached. Reachability has plenty of applications, ranging from motion planning to routing. Improving reachability requires structural knowledge of relations so as to avoid the complexity of traditional depth-first and breadth-first strategies, implemented in logic languages. In some contexts, graphs are enriched with their schema definitions establishing domain and range for every arc. The introduction of a schema-aware formalization for guiding the search may result in a sensitive improvement by cutting out unuseful paths and prioritising those that, in principle, reach the target earlier. In this work, we propose a strategy to automatically exclude and sort certain graph paths by exploiting the higher-level conceptualization of instances. The aim is to obtain a new first-order logic reformulation of the graph reachability scenario, capable of improving the traditional algorithms in terms of time, space requirements, and number of backtracks. The experiments exhibit the expected advantages of the approach in reducing the number of backtracks during the search strategy, resulting in saving time and space as well.

[AI-42] Contextual Document Embeddings

链接: https://arxiv.org/abs/2410.02525
作者: John X. Morris,Alexander M. Rush
关键词-EN: Dense document embeddings, Dense document, central to neural, document, Dense
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.

[AI-43] SAFLEX: Self-Adaptive Augmentation via Feature Label Extrapolation ICLR2024

链接: https://arxiv.org/abs/2410.02512
作者: Mucong Ding,Bang An,Yuancheng Xu,Anirudh Satheesh,Furong Huang
关键词-EN: scarce labeled data, enhancing model performance, augmentation, crucial in enhancing, scarce labeled
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2024

点击查看摘要

Abstract:Data augmentation, a cornerstone technique in deep learning, is crucial in enhancing model performance, especially with scarce labeled data. While traditional techniques are effective, their reliance on hand-crafted methods limits their applicability across diverse data types and tasks. Although modern learnable augmentation methods offer increased adaptability, they are computationally expensive and challenging to incorporate within prevalent augmentation workflows. In this work, we present a novel, efficient method for data augmentation, effectively bridging the gap between existing augmentation strategies and emerging datasets and learning tasks. We introduce SAFLEX (Self-Adaptive Augmentation via Feature Label EXtrapolation), which learns the sample weights and soft labels of augmented samples provided by any given upstream augmentation pipeline, using a specifically designed efficient bilevel optimization algorithm. Remarkably, SAFLEX effectively reduces the noise and label errors of the upstream augmentation pipeline with a marginal computational cost. As a versatile module, SAFLEX excels across diverse datasets, including natural and medical images and tabular data, showcasing its prowess in few-shot learning and out-of-distribution generalization. SAFLEX seamlessly integrates with common augmentation strategies like RandAug, CutMix, and those from large pre-trained generative models like stable diffusion and is also compatible with frameworks such as CLIP’s fine-tuning. Our findings highlight the potential to adapt existing augmentation pipelines for new data types and tasks, signaling a move towards more adaptable and resilient training frameworks.

[AI-44] Choices are More Important than Efforts: LLM Enables Efficient Multi-Agent Exploration

链接: https://arxiv.org/abs/2410.02511
作者: Yun Qu,Boyuan Wang,Yuhang Jiang,Jianzhun Shao,Yixiu Mao,Cheems Wang,Chang Liu,Xiangyang Ji
关键词-EN: expansive state-action spaces, efficient multi-agent exploration, multi-agent exploration remains, state-action spaces, reinforcement learning
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although pursuing novelty, diversity, or uncertainty attracts increasing attention, redundant efforts brought by exploration without proper guidance choices poses a practical issue for the community. This paper introduces a systematic approach, termed LEMAE, choosing to channel informative task-relevant guidance from a knowledgeable Large Language Model (LLM) for Efficient Multi-Agent Exploration. Specifically, we ground linguistic knowledge from LLM into symbolic key states, that are critical for task fulfillment, in a discriminative manner at low LLM inference costs. To unleash the power of key states, we design Subspace-based Hindsight Intrinsic Reward (SHIR) to guide agents toward key states by increasing reward density. Additionally, we build the Key State Memory Tree (KSMT) to track transitions between key states in a specific task for organized exploration. Benefiting from diminishing redundant explorations, LEMAE outperforms existing SOTA approaches on the challenging benchmarks (e.g., SMAC and MPE) by a large margin, achieving a 10x acceleration in certain scenarios.

[AI-45] Can Large Language Models Grasp Legal Theories? Enhance Legal Reasoning with Insights from Multi-Agent Collaboration

链接: https://arxiv.org/abs/2410.02507
作者: Weikang Yuan,Junjie Cao,Zhuoren Jiang,Yangyang Kang,Jun Lin,Kaisong Song,tianqianjin lin,Pengwei Yan,Changlong Sun,Xiaozhong Liu
关键词-EN: Large Language Models, Large Language, Language Models, understand legal theories, legal theories
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) could struggle to fully understand legal theories and perform complex legal reasoning tasks. In this study, we introduce a challenging task (confusing charge prediction) to better evaluate LLMs’ understanding of legal theories and reasoning capabilities. We also propose a novel framework: Multi-Agent framework for improving complex Legal Reasoning capability (MALR). MALR employs non-parametric learning, encouraging LLMs to automatically decompose complex legal tasks and mimic human learning process to extract insights from legal rules, helping LLMs better understand legal theories and enhance their legal reasoning abilities. Extensive experiments on multiple real-world datasets demonstrate that the proposed framework effectively addresses complex reasoning issues in practical scenarios, paving the way for more reliable applications in the legal domain.

[AI-46] Dog-IQA: Standard-guided Zero-shot MLLM for Mix-grained Image Quality Assessment

链接: https://arxiv.org/abs/2410.02505
作者: Kai Liu,Ziqing Zhang,Wenbo Li,Renjing Pei,Fenglong Song,Xiaohong Liu,Linghe Kong,Yulun Zhang
关键词-EN: computer vision fields, Image quality assessment, quality assessment, vision fields, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures. The code and models will be available at this https URL

点击查看摘要

Abstract:Image quality assessment (IQA) serves as the golden standard for all models’ performance in nearly all computer vision fields. However, it still suffers from poor out-of-distribution generalization ability and expensive training costs. To address these problems, we propose Dog-IQA, a standard-guided zero-shot mix-grained IQA method, which is training-free and utilizes the exceptional prior knowledge of multimodal large language models (MLLMs). To obtain accurate IQA scores, namely scores consistent with humans, we design an MLLM-based inference pipeline that imitates human experts. In detail, Dog-IQA applies two techniques. First, Dog-IQA objectively scores with specific standards that utilize MLLM’s behavior pattern and minimize the influence of subjective factors. Second, Dog-IQA comprehensively takes local semantic objects and the whole image as input and aggregates their scores, leveraging local and global information. Our proposed Dog-IQA achieves state-of-the-art (SOTA) performance compared with training-free methods, and competitive performance compared with training-based methods in cross-dataset scenarios. Our code and models will be available at this https URL.

[AI-47] Mixed-Session Conversation with Egocentric Memory EMNLP

链接: https://arxiv.org/abs/2410.02503
作者: Jihyoung Jang,Taeyoung Kim,Hyounghun Kim
关键词-EN: Recently introduced dialogue, Recently introduced, demonstrated high usability, Recently, dialogue
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP Findings 2024 (30 pages); Project website: this https URL

点击查看摘要

Abstract:Recently introduced dialogue systems have demonstrated high usability. However, they still fall short of reflecting real-world conversation scenarios. Current dialogue systems exhibit an inability to replicate the dynamic, continuous, long-term interactions involving multiple partners. This shortfall arises because there have been limited efforts to account for both aspects of real-world dialogues: deeply layered interactions over the long-term dialogue and widely expanded conversation networks involving multiple participants. As the effort to incorporate these aspects combined, we introduce Mixed-Session Conversation, a dialogue system designed to construct conversations with various partners in a multi-session dialogue setup. We propose a new dataset called MiSC to implement this system. The dialogue episodes of MiSC consist of 6 consecutive sessions, with four speakers (one main speaker and three partners) appearing in each episode. Also, we propose a new dialogue model with a novel memory management mechanism, called Egocentric Memory Enhanced Mixed-Session Conversation Agent (EMMA). EMMA collects and retains memories from the main speaker’s perspective during conversations with partners, enabling seamless continuity in subsequent interactions. Extensive human evaluations validate that the dialogues in MiSC demonstrate a seamless conversational flow, even when conversation partners change in each session. EMMA trained with MiSC is also evaluated to maintain high memorability without contradiction throughout the entire conversation.

[AI-48] Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

链接: https://arxiv.org/abs/2410.02472
作者: Anthony Costarelli,Mat Allen,Severin Field,Joshua Clymer
关键词-EN: Large Language Models, Language Models, Large Language, daily lives, interpreting their decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly integrated into our daily lives, the potential harms from deceptive behavior underlie the need for faithfully interpreting their decision-making. While traditional probing methods have shown some effectiveness, they remain best for narrowly scoped tasks while more comprehensive explanations are still necessary. To this end, we investigate meta-models-an architecture using a “meta-model” that takes activations from an “input-model” and answers natural language questions about the input-model’s behaviors. We evaluate the meta-model’s ability to generalize by training them on selected task types and assessing their out-of-distribution performance in deceptive scenarios. Our findings show that meta-models generalize well to out-of-distribution tasks and point towards opportunities for future research in this area.

[AI-49] Response Tuning: Aligning Large Language Models without Instruction

链接: https://arxiv.org/abs/2410.02465
作者: Seokhyun An,Hyounghun Kim
关键词-EN: Large Language Models, pre-trained Large Language, transitioning pre-trained Large, Large Language, safe chat assistants
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 34 pages

点击查看摘要

Abstract:Instruction tuning-supervised fine-tuning using instruction-response pairs-is a foundational step in transitioning pre-trained Large Language Models (LLMs) into helpful and safe chat assistants. Our hypothesis is that establishing an adequate output space can enable such a transition given the capabilities inherent in pre-trained LLMs. To verify this, we propose Response Tuning (RT), which eliminates the instruction-conditioning step in instruction tuning and solely focuses on response space supervision. Our experiments demonstrate that RT models, trained only using responses, can effectively respond to a wide range of instructions and exhibit helpfulness comparable to that of their instruction-tuned counterparts. Furthermore, we observe that controlling the training response distribution can significantly improve their user preference or elicit target behaviors such as refusing assistance for unsafe queries. Our findings illuminate the role of establishing an adequate output space in alignment, highlighting the potential of the extensive inherent capabilities of pre-trained LLMs.

[AI-50] Recurrent Few-Shot model for Document Verification

链接: https://arxiv.org/abs/2410.02456
作者: Maxime Talarmain,Carlos Boned,Sanket Biswas,Oriol Ramos
关键词-EN: video-based verification systems, solved problem, video-based verification, verification systems, considered a solved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:General-purpose ID, or travel, document image- and video-based verification systems have yet to achieve good enough performance to be considered a solved problem. There are several factors that negatively impact their performance, including low-resolution images and videos and a lack of sufficient data to train the models. This task is particularly challenging when dealing with unseen class of ID, or travel, documents. In this paper we address this task by proposing a recurrent-based model able to detect forged documents in a few-shot scenario. The recurrent architecture makes the model robust to document resolution variability. Moreover, the few-shot approach allow the model to perform well even for unseen class of documents. Preliminary results on the SIDTD and Findit datasets show good performance of this model for this task.

[AI-51] Strong Preferences Affect the Robustness of Value Alignment

链接: https://arxiv.org/abs/2410.02451
作者: Ziwei Xu,Mohan Kankanhalli
关键词-EN: large language models, aims to ensure, ensure that large, large language, agents behave
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems.

[AI-52] Clinnova Federated Learning Proof of Concept: Key Takeaways from a Cross-border Collaboration

链接: https://arxiv.org/abs/2410.02443
作者: Julia Alekseenko,Bram Stieltjes,Michael Bach,Melanie Boerries,Oliver Opitz,Alexandros Karargyris,Nicolas Padoy
关键词-EN: initiative involving France, European Greater Region, collaborative initiative involving, involving France, Greater Region initiative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clinnova, a collaborative initiative involving France, Germany, Switzerland, and Luxembourg, is dedicated to unlocking the power of precision medicine through data federation, standardization, and interoperability. This European Greater Region initiative seeks to create an interoperable European standard using artificial intelligence (AI) and data science to enhance healthcare outcomes and efficiency. Key components include multidisciplinary research centers, a federated biobanking strategy, a digital health innovation platform, and a federated AI strategy. It targets inflammatory bowel disease, rheumatoid diseases, and multiple sclerosis (MS), emphasizing data quality to develop AI algorithms for personalized treatment and translational research. The IHU Strasbourg (Institute of Minimal-invasive Surgery) has the lead in this initiative to develop the federated learning (FL) proof of concept (POC) that will serve as a foundation for advancing AI in healthcare. At its core, Clinnova-MS aims to enhance MS patient care by using FL to develop more accurate models that detect disease progression, guide interventions, and validate digital biomarkers across multiple sites. This technical report presents insights and key takeaways from the first cross-border federated POC on MS segmentation of MRI images within the Clinnova framework. While our work marks a significant milestone in advancing MS segmentation through cross-border collaboration, it also underscores the importance of addressing technical, logistical, and ethical considerations to realize the full potential of FL in healthcare settings. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.02443 [cs.CV] (or arXiv:2410.02443v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.02443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-53] Optimizing Adaptive Attacks against Content Watermarks for Language Models

链接: https://arxiv.org/abs/2410.02440
作者: Abdulrahman Diaa,Toluwani Aremu,Nils Lukas
关键词-EN: Large Language Models, Large Language, Language Models, spread online spam, spam and misinformation
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can be \emphmisused to spread online spam and misinformation. Content watermarking deters misuse by hiding a message in model-generated outputs, enabling their detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content’s quality. Many LLM watermarking methods have been proposed, but robustness is tested only against \emphnon-adaptive attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate the robustness of LLM watermarking as an objective function and propose preference-based optimization to tune \emphadaptive attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks substantially outperform non-adaptive baselines. (ii) Even in a non-adaptive setting, adaptive attacks optimized against a few known watermarks remain highly effective when tested against other unseen watermarks, and (iii) optimization-based attacks are practical and require less than seven GPU hours. Our findings underscore the need to test robustness against adaptive attackers.

[AI-54] Predictive Attractor Models NEURIPS2024

链接: https://arxiv.org/abs/2410.02430
作者: Ramy Mounir,Sudeep Sarkar
关键词-EN: episodic memory formation, numerous cognitive functions, underpins numerous cognitive, language comprehension, Sequential memory
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Sequential memory, the ability to form and accurately recall a sequence of events or stimuli in the correct order, is a fundamental prerequisite for biological and artificial intelligence as it underpins numerous cognitive functions (e.g., language comprehension, planning, episodic memory formation, etc.) However, existing methods of sequential memory suffer from catastrophic forgetting, limited capacity, slow iterative learning procedures, low-order Markov memory, and, most importantly, the inability to represent and generate multiple valid future possibilities stemming from the same context. Inspired by biologically plausible neuroscience theories of cognition, we propose \textitPredictive Attractor Models (PAM), a novel sequence memory architecture with desirable generative properties. PAM is a streaming model that learns a sequence in an online, continuous manner by observing each input \textitonly once. Additionally, we find that PAM avoids catastrophic forgetting by uniquely representing past context through lateral inhibition in cortical minicolumns, which prevents new memories from overwriting previously learned knowledge. PAM generates future predictions by sampling from a union set of predicted possibilities; this generative ability is realized through an attractor model trained alongside the predictor. We show that PAM is trained with local computations through Hebbian plasticity rules in a biologically plausible framework. Other desirable traits (e.g., noise tolerance, CPU-based learning, capacity scaling) are discussed throughout the paper. Our findings suggest that PAM represents a significant step forward in the pursuit of biologically plausible and computationally efficient sequential memory models, with broad implications for cognitive science and artificial intelligence research.

[AI-55] IoT-LLM: Enhancing Real-World IoT Task Reasoning with Large Language Models ICLR2025

链接: https://arxiv.org/abs/2410.02429
作者: Tuo An,Yunjiao Zhou,Han Zou,Jianfei Yang
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, physical world
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 21 pages, 10 figures, submitted to ICLR 2025 Conference

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across textual and visual domains but often generate outputs that violate physical laws, revealing a gap in their understanding of the physical world. Inspired by human cognition, where perception is fundamental to reasoning, we explore augmenting LLMs with enhanced perception abilities using Internet of Things (IoT) sensor data and pertinent knowledge for IoT task reasoning in the physical world. In this work, we systematically study LLMs capability to address real-world IoT tasks by augmenting their perception and knowledge base, and then propose a unified framework, IoT-LLM, to enhance such capability. In IoT-LLM, we customize three steps for LLMs: preprocessing IoT data into formats amenable to LLMs, activating their commonsense knowledge through chain-of-thought prompting and specialized role definitions, and expanding their understanding via IoT-oriented retrieval-augmented generation based on in-context learning. To evaluate the performance, We design a new benchmark with five real-world IoT tasks with different data types and reasoning difficulties and provide the benchmarking results on six open-source and close-source LLMs. Experimental results demonstrate the limitations of existing LLMs with naive textual inputs that cannot perform these tasks effectively. We show that IoT-LLM significantly enhances the performance of IoT tasks reasoning of LLM, such as GPT-4, achieving an average improvement of 65% across various tasks against previous methods. The results also showcase LLMs ability to comprehend IoT data and the physical law behind data by providing a reasoning process. Limitations of our work are claimed to inspire future research in this new era.

[AI-56] Collective Critics for Creative Story Generation EMNLP2024

链接: https://arxiv.org/abs/2410.02428
作者: Minwook Bae,Hyounghun Kim
关键词-EN: Large Language Models, Language Models, Large Language, Generating a long, challenging task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 (36 pages)

点击查看摘要

Abstract:Generating a long story of several thousand words with narrative coherence using Large Language Models (LLMs) has been a challenging task. Previous research has addressed this challenge by proposing different frameworks that create a story plan and generate a long story based on that plan. However, these frameworks have been mainly focusing on maintaining narrative coherence in stories, often overlooking creativity in story planning and the expressiveness of the stories generated from those plans, which are desirable properties to captivate readers’ interest. In this paper, we propose Collective Critics for Creative Story Generation framework (CritiCS), which is composed of plan refining stage (CrPlan) and story generation stage (CrText), to integrate a collective revision mechanism that promotes those properties into long-form story generation process. Specifically, in each stage, a group of LLM critics and one leader collaborate to incrementally refine drafts of plan and story throughout multiple rounds. Extensive human evaluation shows that the CritiCS can significantly enhance story creativity and reader engagement, while also maintaining narrative coherence. Furthermore, the design of the framework allows active participation from human writers in any role within the critique process, enabling interactive human-machine collaboration in story writing.

[AI-57] Learning the Latent Rules of a Game from Data: A Chess Story

链接: https://arxiv.org/abs/2410.02426
作者: Ben Fauber
关键词-EN: pretrained foundational generative, foundational generative language, generative language models, small pretrained foundational, parameter pretrained foundational
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We demonstrate that small pretrained foundational generative language models with millions of parameters can learn the latent rules of a process from data associated with the process. Inspired by Stefan Zweig’s novella “Schachnovelle,” also known as “The Royal Game” in English, we show that 28M and 125M parameter pretrained foundational small language models (SLMs) can be instruction fine-tuned with 1,000-to-1,000,000 examples to learn the rules of chess, propose legal moves, and accurately solve chess problems. We also explore the impact of successive language model fine-tuning epochs on improved outcomes and demonstrate reductions in model hallucinations by increasing the number of instruction fine-tuning examples.

[AI-58] SynCo: Synthetic Hard Negatives in Contrastive Learning for Better Unsupervised Visual Representations

链接: https://arxiv.org/abs/2410.02401
作者: Nikolaos Giakoumoglou,Tania Stathaki
关键词-EN: synthetic hard negatives, Contrastive learning, hard negatives, synthetic hard, negatives-samples that closely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, 4 tables. arXiv admin note: text overlap with arXiv:2010.01028 by other authors

点击查看摘要

Abstract:Contrastive learning has become a dominant approach in self-supervised visual representation learning, with hard negatives-samples that closely resemble the anchor-being key to enhancing the discriminative power of learned representations. However, efficiently leveraging hard negatives remains a challenge due to the difficulty in identifying and incorporating them without significantly increasing computational costs. To address this, we introduce SynCo (Synthetic Negatives in Contrastive learning), a novel contrastive learning approach that improves model performance by generating synthetic hard negatives. Built on the MoCo framework, SynCo introduces six novel strategies for creating diverse synthetic hard negatives that can be generated on-the-fly with minimal computational overhead. SynCo achieves faster training and better representation learning, achieving a top-1 accuracy of 68.1% in ImageNet linear evaluation after only 200 epochs on pretraining, surpassing MoCo’s 67.5% with the same ResNet-50 encoder. Additionally, it transfers more effectively to detection tasks: on the PASCAL VOC, it outperforms both the supervised baseline and MoCo, achieving an AP of 82.5%; on the COCO dataset, it sets a new benchmark with 40.4% AP for bounding box detection and 35.4% AP for instance segmentation. Our synthetic hard negative generation procedure significantly enhances the quality of visual representations learned through self-supervised contrastive learning. Code is available at this https URL.

[AI-59] Parameter Competition Balancing for Model Merging NEURIPS2024

链接: https://arxiv.org/abs/2410.02396
作者: Guodong Du,Junlin Lee,Jing Li,Runhua Jiang,Yifei Guo,Shuyang Yu,Hanting Liu,Sim Kuan Goh,Ho-Kin Tang,Daojing He,Min Zhang
关键词-EN: common practice, model, tasks, parameter, fine-tuning pretrained models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS2024

点击查看摘要

Abstract:While fine-tuning pretrained models has become common practice, these models often underperform outside their specific domains. Recently developed model merging techniques enable the direct integration of multiple models, each fine-tuned for distinct tasks, into a single model. This strategy promotes multitasking capabilities without requiring retraining on the original datasets. However, existing methods fall short in addressing potential conflicts and complex correlations between tasks, especially in parameter-level adjustments, posing a challenge in effectively balancing parameter competition across various tasks. This paper introduces an innovative technique named PCB-Merging (Parameter Competition Balancing), a lightweight and training-free technique that adjusts the coefficients of each parameter for effective model merging. PCB-Merging employs intra-balancing to gauge parameter significance within individual tasks and inter-balancing to assess parameter similarities across different tasks. Parameters with low importance scores are dropped, and the remaining ones are rescaled to form the final merged model. We assessed our approach in diverse merging scenarios, including cross-task, cross-domain, and cross-training configurations, as well as out-of-domain generalization. The experimental results reveal that our approach achieves substantial performance enhancements across multiple modalities, domains, model sizes, number of tasks, fine-tuning forms, and large language models, outperforming existing model merging methods. The code is publicly available at: \urlthis https URL.

[AI-60] Online Multi-Label Classification under Noisy and Changing Label Distribution

链接: https://arxiv.org/abs/2410.02394
作者: Yizhang Zou,Xuegang Hu,Peipei Li,Jun Hu,You Wu
关键词-EN: Multi-label data stream, noisy label distribution, label distribution, ground-truth label distribution, online multi-label classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-label data stream usually contains noisy labels in the real-world applications, namely occuring in both relevant and irrelevant labels. However, existing online multi-label classification methods are mostly limited in terms of label quality and fail to deal with the case of noisy labels. On the other hand, the ground-truth label distribution may vary with the time changing, which is hidden in the observed noisy label distribution and difficult to track, posing a major challenge for concept drift adaptation. Motivated by this, we propose an online multi-label classification algorithm under Noisy and Changing Label Distribution (NCLD). The convex objective is designed to simultaneously model the label scoring and the label ranking for high accuracy, whose robustness to NCLD benefits from three novel works: 1) The local feature graph is used to reconstruct the label scores jointly with the observed labels, and an unbiased ranking loss is derived and applied to learn reliable ranking information. 2) By detecting the difference between two adjacent chunks with the unbiased label cardinality, we identify the change in the ground-truth label distribution and reset the ranking or all information learned from the past to match the new distribution. 3) Efficient and accurate updating is achieved based on the updating rule derived from the closed-form optimal model solution. Finally, empirical experimental results validate the effectiveness of our method in classifying instances under NCLD.

[AI-61] Diffusion Meets Options: Hierarchical Generative Skill Composition for Temporally-Extended Tasks

链接: https://arxiv.org/abs/2410.02389
作者: Zeyu Feng,Hao Luan,Kevin Yuchen Ma,Harold Soh
关键词-EN: correct execution errors, Safe and successful, execution errors, successful deployment, capacity to frequently
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose DOPPLER, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that DOPPLER can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation. Demonstration videos are available online at: this https URL.

[AI-62] BiSSL: Bilevel Optimization for Self-Supervised Pre-Training and Fine-Tuning

链接: https://arxiv.org/abs/2410.02387
作者: Gustav Wagner Zakarias,Lars Kai Hansen,Zheng-Hua Tan
关键词-EN: introduces bilevel optimization, self-supervised learning pipeline, self-supervised learning, bilevel optimization, bilevel optimization problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we present BiSSL, a first-of-its-kind training framework that introduces bilevel optimization to enhance the alignment between the pretext pre-training and downstream fine-tuning stages in self-supervised learning. BiSSL formulates the pretext and downstream task objectives as the lower- and upper-level objectives in a bilevel optimization problem and serves as an intermediate training stage within the self-supervised learning pipeline. By more explicitly modeling the interdependence of these training stages, BiSSL facilitates enhanced information sharing between them, ultimately leading to a backbone parameter initialization that is better suited for the downstream task. We propose a training algorithm that alternates between optimizing the two objectives defined in BiSSL. Using a ResNet-18 backbone pre-trained with SimCLR on the STL10 dataset, we demonstrate that our proposed framework consistently achieves improved or competitive classification accuracies across various downstream image classification datasets compared to the conventional self-supervised learning pipeline. Qualitative analyses of the backbone features further suggest that BiSSL enhances the alignment of downstream features in the backbone prior to fine-tuning.

[AI-63] MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

链接: https://arxiv.org/abs/2410.02381
作者: Genta Indra Winata,David Anugraha,Lucky Susanto,Garry Kuwanto,Derry Tanti Wijaya
关键词-EN: Understanding the quality, model outputs align, model outputs, human preferences, Understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.

[AI-64] owards Comprehensive Detection of Chinese Harmful Memes

链接: https://arxiv.org/abs/2410.02378
作者: Junyu Lu,Bo Xu,Xiaokun Zhang,Hongbo Wang,Haohao Zhu,Dongyu Zhang,Liang Yang,Hongfei Lin
关键词-EN: Chinese harmful memes, Chinese harmful, detecting Chinese harmful, Harmful memes, Chinese
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper has been accepted in the NeurIPS 2024 D B Track. Harmful memes have proliferated on the Chinese Internet, while research on detecting Chinese harmful memes significantly lags behind due to the absence of reliable datasets and effective detectors. To this end, we focus on the comprehensive detection of Chinese harmful memes. We construct ToxiCN MM, the first Chinese harmful meme dataset, which consists of 12,000 samples with fine-grained annotations for various meme types. Additionally, we propose a baseline detector, Multimodal Knowledge Enhancement (MKE), incorporating contextual information of meme content generated by the LLM to enhance the understanding of Chinese memes. During the evaluation phase, we conduct extensive quantitative experiments and qualitative analyses on multiple baselines, including LLMs and our MKE. The experimental results indicate that detecting Chinese harmful memes is challenging for existing models while demonstrating the effectiveness of MKE. The resources for this paper are available at this https URL.

[AI-65] From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

链接: https://arxiv.org/abs/2410.02365
作者: Haodong Xie,Rahul Singh Maharjan,Federico Tavella,Angelo Cangelosi
关键词-EN: human intelligence, fundamental to human, Understanding and manipulating, concepts, high order abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding and manipulating concrete and abstract concepts is fundamental to human intelligence. Yet, they remain challenging for artificial agents. This paper introduces a multimodal generative approach to high order abstract concept learning, which integrates visual and categorical linguistic information from concrete ones. Our model initially grounds subordinate level concrete concepts, combines them to form basic level concepts, and finally abstracts to superordinate level concepts via the grounding of basic-level concepts. We evaluate the model language learning ability through language-to-visual and visual-to-language tests with high order abstract concepts. Experimental results demonstrate the proficiency of the model in both language understanding and language naming tasks.

[AI-66] A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification Segmentation Restoration and Beyond

链接: https://arxiv.org/abs/2410.02362
作者: Shubhi Bansal,Sreeharish A,Madhava Prasath J,Manikandan S,Sreekanth Madisetty,Mohammad Zia Ur Rehman,Chandravardhan Singh Raghaw,Gaurav Duggal,Nagendra Kumar
关键词-EN: State Space Model, State Space, template-based deep learning, deep learning approaches, Mamba
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mamba, a special case of the State Space Model, is gaining popularity as an alternative to template-based deep learning approaches in medical image analysis. While transformers are powerful architectures, they have drawbacks, including quadratic computational complexity and an inability to address long-range dependencies efficiently. This limitation affects the analysis of large and complex datasets in medical imaging, where there are many spatial and temporal relationships. In contrast, Mamba offers benefits that make it well-suited for medical image analysis. It has linear time complexity, which is a significant improvement over transformers. Mamba processes longer sequences without attention mechanisms, enabling faster inference and requiring less memory. Mamba also demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes. The organization of this paper allows readers to appreciate the capabilities of Mamba in medical imaging step by step. We begin by defining core concepts of SSMs and models, including S4, S5, and S6, followed by an exploration of Mamba architectures such as pure Mamba, U-Net variants, and hybrid models with convolutional neural networks, transformers, and Graph Neural Networks. We also cover Mamba optimizations, techniques and adaptations, scanning, datasets, applications, experimental results, and conclude with its challenges and future directions in medical imaging. This review aims to demonstrate the transformative potential of Mamba in overcoming existing barriers within medical imaging while paving the way for innovative advancements in the field. A comprehensive list of Mamba architectures applied in the medical field, reviewed in this work, is available at Github.

[AI-67] AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models

链接: https://arxiv.org/abs/2410.02355
作者: Junfeng Fang,Houcheng Jiang,Kun Wang,Yunshan Ma,Xiang Wang,Xiangnan He,Tat-seng Chua
关键词-EN: Large language models, exhibit hallucinations due, Large language, exhibit hallucinations, hallucinations due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often exhibit hallucinations due to incorrect or outdated knowledge. Hence, model editing methods have emerged to enable targeted knowledge updates. To achieve this, a prevailing paradigm is the locating-then-editing approach, which first locates influential parameters and then edits them by introducing a perturbation. While effective, current studies have demonstrated that this perturbation inevitably disrupt the originally preserved knowledge within LLMs, especially in sequential editing scenarios. To address this, we introduce AlphaEdit, a novel solution that projects perturbation onto the null space of the preserved knowledge before applying it to the parameters. We theoretically prove that this projection ensures the output of post-edited LLMs remains unchanged when queried about the preserved knowledge, thereby mitigating the issue of disruption. Extensive experiments on various LLMs, including LLaMA3, GPT2-XL, and GPT-J, show that AlphaEdit boosts the performance of most locating-then-editing methods by an average of 36.4% with a single line of additional code for projection solely. Our code is available at: this https URL.

[AI-68] How Much Can RAG Help the Reasoning of LLM?

链接: https://arxiv.org/abs/2410.02338
作者: Jingyu Liu,Jiaen Lin,Yong Liu
关键词-EN: Large Language Models, modern Large Language, Language Models, Large Language, gained significant popularity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has gained significant popularity in modern Large Language Models (LLMs) due to its effectiveness in introducing new knowledge and reducing hallucinations. However, the deep understanding of RAG remains limited, how does RAG help the reasoning process and can RAG help improve the reasoning capability remains question. While external documents are typically considered as a method to incorporate domain-specific information, they also contain intermediate reasoning results related to the query, this suggests that documents could enhance the reasoning capability of LLMs, which has not been previously explored. In this paper, we investigate this issue in depth and find that while RAG can assist with reasoning, the help is limited. If we conceptualize the reasoning process as a tree with fixed depth, then RAG struggles to assist LLMs in performing deeper reasoning. Additionally, the information in the documents requires preprocessing to filter out noise. We demonstrate that this preprocessing is difficult to achieve simply fine-tuning of the LLM, it often necessitates numerous additional transformer layers to solve the problem. To simplify the problem, we propose DPrompt tuning, which effectively resolves the issue within just limited transformer layers, leading to improved performance.

[AI-69] Post-edits Are Preferences Too

链接: https://arxiv.org/abs/2410.02320
作者: Nathaniel Berger,Stefan Riezler,Miriam Exel,Matthias Huck
关键词-EN: Preference Optimization, machine translation, art techniques, Optimization, machine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear at the Ninth Conference on Machine Translation (WMT24)

点击查看摘要

Abstract:Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences s_1 and s_2 and asked for a preference judgment, % s_1 s_2 ; while for post-editing, editors \emphcreate s_1 and know that it should be better than s_2 . We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks. Comments: To appear at the Ninth Conference on Machine Translation (WMT24) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.02320 [cs.CL] (or arXiv:2410.02320v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02320 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-70] CTARR: A fast and robust method for identifying anatomical regions on CT images via atlas registration

链接: https://arxiv.org/abs/2410.02316
作者: Thomas Buddenkotte,Roland Opfer,Julia Krüger,Alessa Hering,Mireia Crispin-Ortuzar
关键词-EN: image analysis, Medical image analysis, image analysis tasks, patient body, image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical image analysis tasks often focus on regions or structures located in a particular location within the patient’s body. Often large parts of the image may not be of interest for the image analysis task. When using deep-learning based approaches, this causes an unnecessary increases the computational burden during inference and raises the chance of errors. In this paper, we introduce CTARR, a novel generic method for CT Anatomical Region Recognition. The method serves as a pre-processing step for any deep learning-based CT image analysis pipeline by automatically identifying the pre-defined anatomical region that is relevant for the follow-up task and removing the rest. It can be used in (i) image segmentation to prevent false positives in anatomically implausible regions and speeding up the inference, (ii) image classification to produce image crops that are consistent in their anatomical context, and (iii) image registration by serving as a fast pre-registration step. Our proposed method is based on atlas registration and provides a fast and robust way to crop any anatomical region encoded as one or multiple bounding box(es) from any unlabeled CT scan of the brain, chest, abdomen and/or pelvis. We demonstrate the utility and robustness of the proposed method in the context of medical image segmentation by evaluating it on six datasets of public segmentation challenges. The foreground voxels in the regions of interest are preserved in the vast majority of cases and tasks (97.45-100%) while taking only fractions of a seconds to compute (0.1-0.21s) on a deep learning workstation and greatly reducing the segmentation runtime (2.0-12.7x). Our code is available at this https URL.

[AI-71] Morphological evaluation of subwords vocabulary used by BETO language model

链接: https://arxiv.org/abs/2410.02283
作者: Óscar García-Sierra,Ana Fernández-Pampillón Cesteros,Miguel Ortega-Martín
关键词-EN: Subword tokenization algorithms, morphological quality, human intervention, significantly more efficient, independently build
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: in Spanish language

点击查看摘要

Abstract:Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models’ performance, though it remains uncertain when this might occur. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. Our evaluation method was built on three quality measures, relevance, cohesion, and morphological accuracy, and a procedure for their assessment. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. In this article, we apply this evaluation to the tokenizer of BETO, a BERT language model trained on large Spanish corpora. This evaluation, along with our previous results, helped us conclude that its vocabulary has a low morphological quality, and we also found that training the tokenizer in a larger corpus does not improve the morphological quality of the generated vocabulary. Additionally, this evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors’ claims and the model’s configuration.

[AI-72] CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical Temporal Structure Augmentation

链接: https://arxiv.org/abs/2410.02271
作者: Junda Wu,Warren Li,Zachary Novack,Amit Namburi,Carol Chen,Julian McAuley
关键词-EN: Modeling temporal characteristics, temporal characteristics plays, Modeling temporal, characteristics plays, plays a significant
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 4 pages

点击查看摘要

Abstract:Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbfCoLLAP) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.

[AI-73] Structural-Entropy-Based Sample Selection for Efficient and Effective Learning ICLR2025

链接: https://arxiv.org/abs/2410.02268
作者: Tianchi Xie,Jiangning Zhu,Guozu Ma,Minzhi Lin,Wei Chen,Weikai Yang,Shixia Liu
关键词-EN: machine learning models, improves the efficiency, models by providing, samples, Sample selection improves
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICLR 2025

点击查看摘要

Abstract:Sample selection improves the efficiency and effectiveness of machine learning models by providing informative and representative samples. Typically, samples can be modeled as a sample graph, where nodes are samples and edges represent their similarities. Most existing methods are based on local information, such as the training difficulty of samples, thereby overlooking global information, such as connectivity patterns. This oversight can result in suboptimal selection because global information is crucial for ensuring that the selected samples well represent the structural properties of the graph. To address this issue, we employ structural entropy to quantify global information and losslessly decompose it from the whole graph to individual nodes using the Shapley value. Based on the decomposition, we present \textbfS tructural- \textbfE ntropy-based sample \textbfS election ( \textbfSES ), a method that integrates both global and local information to select informative and representative samples. SES begins by constructing a k NN-graph among samples based on their similarities. It then measures sample importance by combining structural entropy (global metric) with training difficulty (local metric). Finally, SES applies importance-biased blue noise sampling to select a set of diverse and representative samples. Comprehensive experiments on three learning scenarios – supervised learning, active learning, and continual learning – clearly demonstrate the effectiveness of our method.

[AI-74] End-to-end Driving in High-Interaction Traffic Scenarios with Reinforcement Learning

链接: https://arxiv.org/abs/2410.02253
作者: Yueyuan Li,Mingyang Jiang,Songan Zhang,Wei Yuan,Chunxiang Wang,Ming Yang
关键词-EN: autonomous driving systems, pose significant challenges, scenarios pose significant, pose significant, interactive traffic scenarios
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 pages, 3 figures, experiment under progress, only to demonstrate the originality of the method

点击查看摘要

Abstract:Dynamic and interactive traffic scenarios pose significant challenges for autonomous driving systems. Reinforcement learning (RL) offers a promising approach by enabling the exploration of driving policies beyond the constraints of pre-collected datasets and predefined conditions, particularly in complex environments. However, a critical challenge lies in effectively extracting spatial and temporal features from sequences of high-dimensional, multi-modal observations while minimizing the accumulation of errors over time. Additionally, efficiently guiding large-scale RL models to converge on optimal driving policies without frequent failures during the training process remains tricky. We propose an end-to-end model-based RL algorithm named Ramble to address these issues. Ramble processes multi-view RGB images and LiDAR point clouds into low-dimensional latent features to capture the context of traffic scenarios at each time step. A transformer-based architecture is then employed to model temporal dependencies and predict future states. By learning a dynamics model of the environment, Ramble can foresee upcoming traffic events and make more informed, strategic decisions. Our implementation demonstrates that prior experience in feature extraction and decision-making plays a pivotal role in accelerating the convergence of RL models toward optimal driving policies. Ramble achieves state-of-the-art performance regarding route completion rate and driving score on the CARLA Leaderboard 2.0, showcasing its effectiveness in managing complex and dynamic traffic situations. Comments: 10 pages, 3 figures, experiment under progress, only to demonstrate the originality of the method Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2410.02253 [cs.AI] (or arXiv:2410.02253v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.02253 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-75] PFGuard: A Generative Framework with Privacy and Fairness Safeguards

链接: https://arxiv.org/abs/2410.02246
作者: Soyeon Kim,Yuji Roh,Geon Heo,Steven Euijong Whang
关键词-EN: privacy, fairness, Trustworthy, fairness for Trustworthy, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative models must ensure both privacy and fairness for Trustworthy AI. While these goals have been pursued separately, recent studies propose to combine existing privacy and fairness techniques to achieve both goals. However, naively combining these techniques can be insufficient due to privacy-fairness conflicts, where a sample in a minority group may be amplified for fairness, only to be suppressed for privacy. We demonstrate how these conflicts lead to adverse effects, such as privacy violations and unexpected fairness-utility tradeoffs. To mitigate these risks, we propose PFGuard, a generative framework with privacy and fairness safeguards, which simultaneously addresses privacy, fairness, and utility. By using an ensemble of multiple teacher models, PFGuard balances privacy-fairness conflicts between fair and private training stages and achieves high utility based on ensemble learning. Extensive experiments show that PFGuard successfully generates synthetic data on high-dimensional data while providing both fairness convergence and strict DP guarantees - the first of its kind to our knowledge.

[AI-76] Robust Weight Initialization for Tanh Neural Networks with Fixed Point Analysis

链接: https://arxiv.org/abs/2410.02242
作者: Hyunwoo Lee,Hayoung Choi,Hyunju Kim
关键词-EN: strong generalization performance, achieve strong generalization, network depth increases, neural network depth, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As a neural network’s depth increases, it can achieve strong generalization performance. Training, however, becomes challenging due to gradient issues. Theoretical research and various methods have been introduced to address this issues. However, research on weight initialization methods that can be effectively applied to tanh neural networks of varying sizes still needs to be completed. This paper presents a novel weight initialization method for Feedforward Neural Networks with tanh activation function. Based on an analysis of the fixed points of the function \tanh(ax) , our proposed method aims to determine values of a that prevent the saturation of activations. A series of experiments on various classification datasets demonstrate that the proposed method is more robust to network size variations than the existing method. Furthermore, when applied to Physics-Informed Neural Networks, the method exhibits faster convergence and robustness to variations of the network size compared to Xavier initialization in problems of Partial Differential Equations.

[AI-77] SCA: Highly Efficient Semantic-Consistent Unrestricted Adversarial Attack

链接: https://arxiv.org/abs/2410.02240
作者: Zihao Pan,Weibin Wu,Yuhang Cao,Zibin Zheng
关键词-EN: Unrestricted adversarial attacks, attacks typically manipulate, adversarial attacks typically, color or texture, Unrestricted adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unrestricted adversarial attacks typically manipulate the semantic content of an image (e.g., color or texture) to create adversarial examples that are both effective and photorealistic. Recent works have utilized the diffusion inversion process to map images into a latent space, where high-level semantics are manipulated by introducing perturbations. However, they often results in substantial semantic distortions in the denoised output and suffers from low efficiency. In this study, we propose a novel framework called Semantic-Consistent Unrestricted Adversarial Attacks (SCA), which employs an inversion method to extract edit-friendly noise maps and utilizes Multimodal Large Language Model (MLLM) to provide semantic guidance throughout the process. Under the condition of rich semantic information provided by MLLM, we perform the DDPM denoising process of each step using a series of edit-friendly noise maps, and leverage DPM Solver++ to accelerate this process, enabling efficient sampling with semantic consistency. Compared to existing methods, our framework enables the efficient generation of adversarial examples that exhibit minimal discernible semantic changes. Consequently, we for the first time introduce Semantic-Consistent Adversarial Examples (SCAE). Extensive experiments and visualizations have demonstrated the high efficiency of SCA, particularly in being on average 12 times faster than the state-of-the-art attacks. Our code can be found at this https URLthis https URL.

[AI-78] SEAL: SEmantic-Augmented Imitation Learning via Language Model

链接: https://arxiv.org/abs/2410.02231
作者: Chengyang Gu,Yuxin Pan,Haotian Bai,Hui Xiong,Yize Chen
关键词-EN: Hierarchical Imitation Learning, Hierarchical Imitation, Imitation Learning, tackling long-horizon decision-making, Large Language Models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 18 pages, 5 figures, in submission

点击查看摘要

Abstract:Hierarchical Imitation Learning (HIL) is a promising approach for tackling long-horizon decision-making tasks. While it is a challenging task due to the lack of detailed supervisory labels for sub-goal learning, and reliance on hundreds to thousands of expert demonstrations. In this work, we introduce SEAL, a novel framework that leverages Large Language Models (LLMs)'s powerful semantic and world knowledge for both specifying sub-goal space and pre-labeling states to semantically meaningful sub-goal representations without prior knowledge of task hierarchies. SEAL employs a dual-encoder structure, combining supervised LLM-guided sub-goal learning with unsupervised Vector Quantization (VQ) for more robust sub-goal representations. Additionally, SEAL incorporates a transition-augmented low-level planner for improved adaptation to sub-goal transitions. Our experiments demonstrate that SEAL outperforms state-of-the-art HIL methods and LLM-based planning approaches, particularly in settings with small expert datasets and complex long-horizon tasks.

[AI-79] CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

链接: https://arxiv.org/abs/2410.02229
作者: Huimu Yu,Xing Wu,Weidong Yin,Debing Zhang,Songlin Hu
关键词-EN: natural language understanding, made significant progress, Large language models, understanding and generation, natural language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: work in progress

点击查看摘要

Abstract:Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.

[AI-80] EmbedLLM: Learning Compact Representations of Large Language Models

链接: https://arxiv.org/abs/2410.02223
作者: Richard Zhuang,Tianhao Wu,Zhaojin Wen,Andrew Li,Jiantao Jiao,Kannan Ramchandran
关键词-EN: Huggingface today, Large Language Models, efficiently evaluating, increasingly critical, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream, tasks has become increasingly critical. Many existing methods repeatedly learn task-specific representations of Large Language Models (LLMs), which leads to inefficiencies in both time and computational resources. To address this, we propose EmbedLLM, a framework designed to learn compact vector representations, of LLMs that facilitate downstream applications involving many models, such as model routing. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency. Additionally, we demonstrate that our method can forecast a model’s performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them. We open source our dataset, code and embedder to facilitate further research and application.

[AI-81] Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation

链接: https://arxiv.org/abs/2410.02220
作者: Xiaoqun Liu,Jiacheng Liang,Luoxi Tang,Chenyu You,Muchao Ye,Zhaohan Xi
关键词-EN: Large language models, integrating domain-specific expertise, Large language, domain-specific expertise, extensively adapted
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are extensively adapted for downstream applications through a process known as “customization,” with fine-tuning being a common method for integrating domain-specific expertise. However, recent studies have revealed a vulnerability that tuning LLMs with malicious samples can compromise their robustness and amplify harmful content, an attack known as “jailbreaking.” To mitigate such attack, we propose an effective defensive framework utilizing data curation to revise commonsense texts and enhance their safety implication from the perspective of LLMs. The curated texts can mitigate jailbreaking attacks at every stage of the customization process: before customization to immunize LLMs against future jailbreak attempts, during customization to neutralize jailbreaking risks, or after customization to restore the compromised models. Since the curated data strengthens LLMs through the standard fine-tuning workflow, we do not introduce additional modules during LLM inference, thereby preserving the original customization process. Experimental results demonstrate a substantial reduction in jailbreaking effects, with up to a 100% success in generating responsible responses. Notably, our method is effective even with commonsense texts, which are often more readily available than safety-relevant data. With the every-stage defensive framework and supporting experimental performance, this work represents a significant advancement in mitigating jailbreaking risks and ensuring the secure customization of LLMs.

[AI-82] Multi-modal clothing recommendation model based on large model and VAE enhancement

链接: https://arxiv.org/abs/2410.02219
作者: Bingjie Huang,Qingyu Lu,Shuaishuai Huang,Xue-she Wang,Haowei Yang
关键词-EN: requiring in-depth research, Accurately recommending products, subject requiring in-depth, Accurately recommending, in-depth research
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately recommending products has long been a subject requiring in-depth research. This study proposes a multimodal paradigm for clothing recommendations. Specifically, it designs a multimodal analysis method that integrates clothing description texts and images, utilizing a pre-trained large language model to deeply explore the hidden meanings of users and products. Additionally, a variational encoder is employed to learn the relationship between user information and products to address the cold start problem in recommendation systems. This study also validates the significant performance advantages of this method over various recommendation system methods through extensive ablation experiments, providing crucial practical guidance for the comprehensive optimization of recommendation systems.

[AI-83] Adapting Segment Anything Model to Melanoma Segmentation in Microscopy Slide Images

链接: https://arxiv.org/abs/2410.02207
作者: Qingyuan Liu,Avideh Zakhor
关键词-EN: crucial prognostic factors, Breslow depth, Slide Images, invasive tumor size, primary invasive tumor
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Melanoma segmentation in Whole Slide Images (WSIs) is useful for prognosis and the measurement of crucial prognostic factors such as Breslow depth and primary invasive tumor size. In this paper, we present a novel approach that uses the Segment Anything Model (SAM) for automatic melanoma segmentation in microscopy slide images. Our method employs an initial semantic segmentation model to generate preliminary segmentation masks that are then used to prompt SAM. We design a dynamic prompting strategy that uses a combination of centroid and grid prompts to achieve optimal coverage of the super high-resolution slide images while maintaining the quality of generated prompts. To optimize for invasive melanoma segmentation, we further refine the prompt generation process by implementing in-situ melanoma detection and low-confidence region filtering. We select Segformer as the initial segmentation model and EfficientSAM as the segment anything model for parameter-efficient fine-tuning. Our experimental results demonstrate that this approach not only surpasses other state-of-the-art melanoma segmentation methods but also significantly outperforms the baseline Segformer by 9.1% in terms of IoU.

[AI-84] Measuring Evaluating and Improving Logical Consistency in Large Language Models

链接: https://arxiv.org/abs/2410.02205
作者: Yinhong Liu,Zhijiang Guo,Tianya Liang,Ehsan Shareghi,Ivan Vulić,Nigel Collier
关键词-EN: Large Language Models, Language Models, Large Language, shown promising progress, promising progress related
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Recent research in Large Language Models (LLMs) has shown promising progress related to LLM alignment with human preferences. LLM-empowered decision-making systems are expected to be predictable, reliable and trustworthy, which implies being free from paradoxes or contradictions that could undermine their credibility and validity. However, LLMs still exhibit inconsistent and biased behaviour when making decisions or judgements. In this work, we focus on studying logical consistency of LLMs as a prerequisite for more reliable and trustworthy systems. Logical consistency ensures that decisions are based on a stable and coherent understanding of the problem, reducing the risk of erratic or contradictory outputs. We first propose a universal framework to quantify the logical consistency via three fundamental proxies: transitivity, commutativity and negation invariance. We then evaluate logical consistency, using the defined measures, of a wide range of LLMs, demonstrating that it can serve as a strong proxy for overall robustness. Additionally, we introduce a data refinement and augmentation technique that enhances the logical consistency of LLMs without sacrificing alignment to human preferences. It augments noisy and sparse pairwise-comparison annotations by estimating a partially or totally ordered preference rankings using rank aggregation methods. Finally, we show that logical consistency impacts the performance of LLM-based logic-dependent algorithms, where LLMs serve as logical operators.

[AI-85] GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning

链接: https://arxiv.org/abs/2410.02203
作者: Jiale Fu,Yaqing Wang,Simeng Han,Jiaming Fan,Chen Si,Xu Yang
关键词-EN: enables large language, In-context learning, large language models, reasoning, enables large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In-context learning (ICL) enables large language models (LLMs) to generalize to new tasks by incorporating a few in-context examples (ICEs) directly in the input, without updating parameters. However, the effectiveness of ICL heavily relies on the selection of ICEs, and conventional text-based embedding methods are often inadequate for tasks that require multi-step reasoning, such as mathematical and logical problem solving. This is due to the bias introduced by shallow semantic similarities that fail to capture the deeper reasoning structures required for these tasks. We present GraphIC, a novel approach that leverages graph-based representations of reasoning processes, coupled with Bayesian Networks (BNs) to select ICEs. Graph structures inherently filter out shallow semantics while preserving the core reasoning structure. Importantly, BNs capture the dependency of a node’s attributes on its parent nodes, closely mirroring the hierarchical nature of human cognition-where each thought is shaped by preceding ones. This makes BNs particularly well-suited for multi-step reasoning tasks, aligning the process more closely with human-like reasoning. Extensive experiments across three types of reasoning tasks (mathematical reasoning, code generation, and logical reasoning) demonstrate that GraphIC outperforms both training-free and training-based models in selecting ICEs, excelling in terms of both effectiveness and efficiency. We show that GraphIC enhances ICL’s performance and interoperability, significantly advancing ICE selection for multi-step reasoning tasks.

[AI-86] Can Language Models Take A Hint? Prompting for Controllable Contextualized Commonsense Inference ACL

链接: https://arxiv.org/abs/2410.02202
作者: Pedro Colon-Hernandez,Nanxi Liu,Chelsea Joe,Peter Chin,Claire Yin,Henry Lieberman,Yida Xin,Cynthia Breazeal
关键词-EN: Generating commonsense assertions, story context remains, modern language models, Generating commonsense, hinting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Submitted to ACL Rolling Review. arXiv admin note: text overlap with arXiv:2302.05406

点击查看摘要

Abstract:Generating commonsense assertions within a given story context remains a difficult task for modern language models. Previous research has addressed this problem by aligning commonsense inferences with stories and training language generation models accordingly. One of the challenges is determining which topic or entity in the story should be the focus of an inferred assertion. Prior approaches lack the ability to control specific aspects of the generated assertions. In this work, we introduce “hinting,” a data augmentation technique that enhances contextualized commonsense inference. “Hinting” employs a prefix prompting strategy using both hard and soft prompts to guide the inference process. To demonstrate its effectiveness, we apply “hinting” to two contextual commonsense inference datasets: ParaCOMET and GLUCOSE, evaluating its impact on both general and context-specific inference. Furthermore, we evaluate “hinting” by incorporating synonyms and antonyms into the hints. Our results show that “hinting” does not compromise the performance of contextual commonsense inference while offering improved controllability.

[AI-87] G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

链接: https://arxiv.org/abs/2410.02198
作者: Zhaoning Yu,Xiangyang Xu,Hongyang Gao
关键词-EN: hierarchical text format, text format optimized, large language models, hierarchical text, optimized for large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding converts complex molecular graphs into tree-structured formats, such as JSON and XML, which LLMs are particularly adept at processing due to their extensive pre-training on these types of data. By leveraging the flexibility of LLMs, our approach allows for intuitive interaction using natural language prompts, providing a more accessible interface for molecular design. Through supervised fine-tuning, G2T-LLM generates valid and coherent chemical structures, addressing common challenges like invalid outputs seen in traditional graph-based methods. While LLMs are computationally intensive, they offer superior generalization and adaptability, enabling the generation of diverse molecular structures with minimal task-specific customization. The proposed approach achieved comparable performances with state-of-the-art methods on various benchmark molecular generation datasets, demonstrating its potential as a flexible and innovative tool for AI-driven molecular design.

[AI-88] General Preference Modeling with Preference Representations for Aligning Language Models

链接: https://arxiv.org/abs/2410.02197
作者: Yifan Zhang,Ge Zhang,Yue Wu,Kangping Xu,Quanquan Gu
关键词-EN: General Preference, preference, crucial for aligning, Traditional reward modeling, Modeling human preferences
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 34 pages

点击查看摘要

Abstract:Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at this https URL.

[AI-89] BACKTIME: Backdoor Attacks on Multivariate Time Series Forecasting NEURIPS2024

链接: https://arxiv.org/abs/2410.02195
作者: Xiao Lin,Zhining Liu,Dongqi Fu,Ruizhong Qiu,Hanghang Tong
关键词-EN: Multivariate Time Series, Multivariate Time, Time Series, MTS forecasting models, numerous real-world applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 23 pages. Neurips 2024

点击查看摘要

Abstract:Multivariate Time Series (MTS) forecasting is a fundamental task with numerous real-world applications, such as transportation, climate, and epidemiology. While a myriad of powerful deep learning models have been developed for this task, few works have explored the robustness of MTS forecasting models to malicious attacks, which is crucial for their trustworthy employment in high-stake scenarios. To address this gap, we dive deep into the backdoor attacks on MTS forecasting models and propose an effective attack method named this http URL subtly injecting a few stealthy triggers into the MTS data, BackTime can alter the predictions of the forecasting model according to the attacker’s intent. Specifically, BackTime first identifies vulnerable timestamps in the data for poisoning, and then adaptively synthesizes stealthy and effective triggers by solving a bi-level optimization problem with a GNN-based trigger generator. Extensive experiments across multiple datasets and state-of-the-art MTS forecasting models demonstrate the effectiveness, versatility, and stealthiness of \method attacks. The code is available at \urlthis https URL.

[AI-90] A Survey on Point-of-Interest Recommendation: Models Architectures and Security

链接: https://arxiv.org/abs/2410.02191
作者: Qianru Zhang,Peng Yang,Junliang Yu,Haixin Wang,Xingwei He,Siu-Ming Yiu,Hongzhi Yin
关键词-EN: Location-Based Social Networks, Social Networks, creating unparalleled opportunities, Location-Based Social, Networks has led
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:The widespread adoption of smartphones and Location-Based Social Networks has led to a massive influx of spatio-temporal data, creating unparalleled opportunities for enhancing Point-of-Interest (POI) recommendation systems. These advanced POI systems are crucial for enriching user experiences, enabling personalized interactions, and optimizing decision-making processes in the digital landscape. However, existing surveys tend to focus on traditional approaches and few of them delve into cutting-edge developments, emerging architectures, as well as security considerations in POI recommendations. To address this gap, our survey stands out by offering a comprehensive, up-to-date review of POI recommendation systems, covering advancements in models, architectures, and security aspects. We systematically examine the transition from traditional models to advanced techniques such as large language models. Additionally, we explore the architectural evolution from centralized to decentralized and federated learning systems, highlighting the improvements in scalability and privacy. Furthermore, we address the increasing importance of security, examining potential vulnerabilities and privacy-preserving approaches. Our taxonomy provides a structured overview of the current state of POI recommendation, while we also identify promising directions for future research in this rapidly advancing field.

[AI-91] Agent -Oriented Planning in Multi-Agent Systems

链接: https://arxiv.org/abs/2410.02189
作者: Ao Li,Yuexiang Xie,Songze Li,Fugee Tsung,Bolin Ding,Yaliang Li
关键词-EN: possessing diverse expertise, achieve impressive progress, agents possessing diverse, systems achieve impressive, multiple agents possessing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Through the collaboration of multiple agents possessing diverse expertise and tools, multi-agent systems achieve impressive progress in solving real-world problems. Given the user queries, the meta-agents, serving as the brain within these systems, are required to decompose the queries into multiple sub-tasks that can be allocated to suitable agents capable of solving them, so-called agent-oriented planning. In this study, we identify three critical design principles of agent-oriented planning, including solvability, completeness, and non-redundancy, to ensure that each sub-task is effectively resolved, leading to satisfactory responses to the original queries. These principles further inspire us to propose a novel framework for agent-oriented planning in multi-agent systems, leveraging a fast task decomposition and allocation process followed by an effective and efficient evaluation via a reward model. During the planning process, the meta-agent is also responsible for evaluating the performance of the expert agents, making timely adjustments to the sub-tasks and scheduling as necessary. Besides, we integrate a feedback loop into the proposed framework to further enhance the effectiveness and robustness of such a problem-solving process. Extensive experiments demonstrate the advancement of the proposed framework in solving real-world problems compared to both single-agent systems and existing planning strategies for multi-agent systems.

[AI-92] POSIX: A Prompt Sensitivity Index For Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.02185
作者: Anwoy Chatterjee,H S V N S Kowndinya Renduchintala,Sumit Bhatia,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, Language Models, minor variations, generating significantly divergent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024 (Findings)

点击查看摘要

Abstract:Despite their remarkable capabilities, Large Language Models (LLMs) are found to be surprisingly sensitive to minor variations in prompts, often generating significantly divergent outputs in response to minor variations in the prompts, such as spelling errors, alteration of wording or the prompt template. However, while assessing the quality of an LLM, the focus often tends to be solely on its performance on downstream tasks, while very little to no attention is paid to prompt sensitivity. To fill this gap, we propose POSIX - a novel PrOmpt Sensitivity IndeX as a reliable measure of prompt sensitivity, thereby offering a more comprehensive evaluation of LLM performance. The key idea behind POSIX is to capture the relative change in loglikelihood of a given response upon replacing the corresponding prompt with a different intent-preserving prompt. We provide thorough empirical evidence demonstrating the efficacy of POSIX in capturing prompt sensitivity and subsequently use it to measure and thereby compare prompt sensitivity of various open-source LLMs. We find that merely increasing the parameter count or instruction tuning does not necessarily reduce prompt sensitivity whereas adding some few-shot exemplars, even just one, almost always leads to significant decrease in prompt sensitivity. We also find that alterations to prompt template lead to the highest sensitivity in the case of MCQtype tasks, whereas paraphrasing results in the highest sensitivity in open-ended generation tasks. The code for reproducing our results is open-sourced at this https URL.

[AI-93] Efficiently Deploying LLMs with Controlled Risk

链接: https://arxiv.org/abs/2410.02173
作者: Michael J. Zellinger,Matt Thomson
关键词-EN: Deploying large language, large language models, production requires simultaneous, requires simultaneous attention, risk control
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Deploying large language models in production requires simultaneous attention to efficiency and risk control. Prior work has shown the possibility to cut costs while maintaining similar accuracy, but has neglected to focus on risk control. By contrast, here we present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls. Our framework presents novel trade-offs between efficiency and risk. For example, deploying HCMA on MMLU cuts the error rate of Llama3 405B by 30% when the model is allowed to abstain on 20% of the queries. To calibrate HCMA for optimal performance, our approach uses data-efficient logistic regressions (based on a simple nonlinear feature transformation), which require only 50 or 100 labeled examples to achieve excellent calibration error (ECE), cutting ECE by 50% compared to naive Platt scaling. On free-form generation tasks, we find that chain-of-thought is ineffectual for selective prediction, whereas zero-shot prompting drives error to 0% on TruthfulQA at high abstention rates. As LLMs are increasingly deployed across computing environments with different capabilities (such as mobile, laptop, and cloud), our framework paves the way towards maintaining deployment efficiency while putting in place sharp risk controls.

[AI-94] Abstract Reward Processes: Leveraging State Abstraction for Consistent Off-Policy Evaluation NEURIPS2024

链接: https://arxiv.org/abs/2410.02172
作者: Shreyas Chaudhari,Ameet Deshpande,Bruno Castro da Silva,Philip S. Thomas
关键词-EN: applying reinforcement learning, Evaluating policies, autonomous driving, crucial for applying, applying reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at the Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Evaluating policies using off-policy data is crucial for applying reinforcement learning to real-world problems such as healthcare and autonomous driving. Previous methods for off-policy evaluation (OPE) generally suffer from high variance or irreducible bias, leading to unacceptably high prediction errors. In this work, we introduce STAR, a framework for OPE that encompasses a broad range of estimators – which include existing OPE methods as special cases – that achieve lower mean squared prediction errors. STAR leverages state abstraction to distill complex, potentially continuous problems into compact, discrete models which we call abstract reward processes (ARPs). Predictions from ARPs estimated from off-policy data are provably consistent (asymptotically correct). Rather than proposing a specific estimator, we present a new framework for OPE and empirically demonstrate that estimators within STAR outperform existing methods. The best STAR estimator outperforms baselines in all twelve cases studied, and even the median STAR estimator surpasses the baselines in seven out of the twelve cases.

[AI-95] A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

链接: https://arxiv.org/abs/2410.02165
作者: Yucheng Chu,Hang Li,Kaiqi Yang,Harry Shomer,Hui Liu,Yasemin Copur-Gencturk,Jiliang Tang
关键词-EN: providing deeper insights, Open-ended short-answer questions, Open-ended short-answer, learning analytics, widely recognized
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners’ responses in the context of learning analytics (LA). However, SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. With recent advancements in natural language processing (NLP), automatic short-answer grading (ASAG) offers a promising solution to these challenges. Despite this, current ASAG algorithms are often limited in generalizability and tend to be tailored to specific questions. In this paper, we propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs. More importantly, GradeOpt incorporates two additional LLM-based agents - the reflector and the refiner - into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. Through experiments on a challenging ASAG task, namely the grading of pedagogical content knowledge (PCK) and content knowledge (CK) questions, GradeOpt demonstrates superior performance in grading accuracy and behavior alignment with human graders compared to representative baselines. Finally, comprehensive ablation studies confirm the effectiveness of the individual components designed in GradeOpt.

[AI-96] Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1

链接: https://arxiv.org/abs/2410.02162
作者: Karthik Valmeekam,Kaya Stechly,Atharva Gundawar,Subbarao Kambhampati
关键词-EN: Large Reasoning Model, ability to plan, action that achieves, achieves a desired, desired state
类目: Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2409.13373

点击查看摘要

Abstract:The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities, but – despite the slew of new private and open source LLMs since GPT3 – progress has remained slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs – making it a new kind of model: a Large Reasoning Model (LRM). In this paper, we evaluate the planning capabilities of two LRMs (o1-preview and o1-mini) on both planning and scheduling benchmarks. We see that while o1 does seem to offer significant improvements over autoregressive LLMs, this comes at a steep inference cost, while still failing to provide any guarantees over what it generates. We also show that combining o1 models with external verifiers – in a so-called LRM-Modulo system – guarantees the correctness of the combined system’s output while further improving performance.

[AI-97] RiskSEA : A Scalable Graph Embedding for Detecting On-chain Fraudulent Activities on the Ethereum Blockchain

链接: https://arxiv.org/abs/2410.02160
作者: Ayush Agarwal,Lv Lu,Arjun Maheswaran,Varsha Mahadevan,Bhaskar Krishnamachari
关键词-EN: criminal activities, blockchain transaction graphs, blockchain, blockchain transaction, embedding
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2203.12363 by other authors

点击查看摘要

Abstract:Like any other useful technology, cryptocurrencies are sometimes used for criminal activities. While transactions are recorded on the blockchain, there exists a need for a more rapid and scalable method to detect addresses associated with fraudulent activities. We present RiskSEA, a scalable risk scoring system capable of effectively handling the dynamic nature of large-scale blockchain transaction graphs. The risk scoring system, which we implement for Ethereum, consists of 1. a scalable approach to generating node2vec embedding for entire set of addresses to capture the graph topology 2. transaction-based features to capture the transactional behavioral pattern of an address 3. a classifier model to generate risk score for addresses that combines the node2vec embedding and behavioral features. Efficiently generating node2vec embedding for large scale and dynamically evolving blockchain transaction graphs is challenging, we present two novel approaches for generating node2vec embeddings and effectively scaling it to the entire set of blockchain addresses: 1. node2vec embedding propagation and 2. dynamic node2vec embedding. We present a comprehensive analysis of the proposed approaches. Our experiments show that combining both behavioral and node2vec features boosts the classification performance significantly, and that the dynamic node2vec embeddings perform better than the node2vec propagated embeddings.

[AI-98] Mitigating Memorization In Language Models

链接: https://arxiv.org/abs/2410.02159
作者: Mansi Sakarvadia,Aswathy Ajith,Arham Khan,Nathaniel Hudson,Caleb Geniesse,Kyle Chard,Yaoqing Yang,Ian Foster,Michael W. Mahoney
关键词-EN: Language models, encode training data, training data, extract training data, inference-time queries
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Language models (LMs) can “memorize” information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-based, and eleven machine unlearning-based methods, with five of the latter being new methods that we introduce. We also introduce TinyMem, a suite of small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods. We demonstrate that the mitigation methods that we develop using TinyMem can successfully be applied to production-grade LMs, and we determine via experiment that: regularizer-based mitigation methods are slow and ineffective at curbing memorization; fine-tuning-based methods are effective at curbing memorization, but overly expensive, especially for retaining higher accuracies; and unlearning-based methods are faster and more effective, allowing for the precise localization and removal of memorized information from LM weights prior to inference. We show, in particular, that our proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.

[AI-99] he why what and how of AI-based coding in scientific research

链接: https://arxiv.org/abs/2410.02156
作者: Tonghe Zhuang,Zhicheng Lin
关键词-EN: Computer programming, remains challenging, challenging to learn, learn and time-consuming, time-consuming to carry
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
*备注: 23 pages, 7 figure, 3 boxes

点击查看摘要

Abstract:Computer programming (coding) is indispensable for researchers across disciplines, yet it remains challenging to learn and time-consuming to carry out. Generative AI, particularly large language models (LLMs), has the potential to transform coding into intuitive conversations, but best practices and effective workflows are only emerging. We dissect AI-based coding through three key lenses: the nature and role of LLMs in coding (why), six types of coding assistance they provide (what), and a five-step workflow in action with practical implementation strategies (how). Additionally, we address the limitations and future outlook of AI in coding. By offering actionable insights, this framework helps to guide researchers in effectively leveraging AI to enhance coding practices and education, accelerating scientific progress.

[AI-100] From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

链接: https://arxiv.org/abs/2410.02155
作者: Wanpeng Zhang,Zilong Xie,Yicheng Feng,Yijiang Li,Xingrun Xing,Sipeng Zheng,Zongqing Lu
关键词-EN: Large Language Models, made significant strides, Large Language, Multimodal Large Language, text-only Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs’ multimodal understanding capabilities, even with limited training data. Our method not only improves performance across various benchmarks but also shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models.

[AI-101] Efficient Source-Free Time-Series Adaptation via Parameter Subspace Disentanglement

链接: https://arxiv.org/abs/2410.02147
作者: Gaurav Patel,Christopher Sandino,Behrooz Mahasseni,Ellen L Zippi,Erdrin Azemi,Ali Moin,Juri Minxha
关键词-EN: efficient Source-Free Domain, Source-Free Domain Adaptation, Source-Free Domain, Domain Adaptation, context of time-series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, we propose a framework for efficient Source-Free Domain Adaptation (SFDA) in the context of time-series, focusing on enhancing both parameter efficiency and data-sample utilization. Our approach introduces an improved paradigm for source-model preparation and target-side adaptation, aiming to enhance training efficiency during target adaptation. Specifically, we reparameterize the source model’s weights in a Tucker-style decomposed manner, factorizing the model into a compact form during the source model preparation phase. During target-side adaptation, only a subset of these decomposed factors is fine-tuned, leading to significant improvements in training efficiency. We demonstrate using PAC Bayesian analysis that this selective fine-tuning strategy implicitly regularizes the adaptation process by constraining the model’s learning capacity. Furthermore, this re-parameterization reduces the overall model size and enhances inference efficiency, making the approach particularly well suited for resource-constrained devices. Additionally, we demonstrate that our framework is compatible with various SFDA methods and achieves significant computational efficiency, reducing the number of fine-tuned parameters and inference overhead in terms of MACs by over 90% while maintaining model performance.

[AI-102] Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

链接: https://arxiv.org/abs/2410.02110
作者: Amogh Mannekote,Adam Davies,Jina Kang,Kristy Elizabeth Boyer
关键词-EN: Simulating learner actions, adaptations before deployment, actions helps stress-test, prototype new adaptations, interactive learning environments
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulating learner actions helps stress-test open-ended interactive learning environments and prototype new adaptations before deployment. While recent studies show the promise of using large language models (LLMs) for simulating human behavior, such approaches have not gone beyond rudimentary proof-of-concept stages due to key limitations. First, LLMs are highly sensitive to minor prompt variations, raising doubts about their ability to generalize to new scenarios without extensive prompt engineering. Moreover, apparently successful outcomes can often be unreliable, either because domain experts unintentionally guide LLMs to produce expected results, leading to self-fulfilling prophecies; or because the LLM has encountered highly similar scenarios in its training data, meaning that models may not be simulating behavior so much as regurgitating memorized content. To address these challenges, we propose Hyp-Mix, a simulation authoring framework that allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. Testing this framework in a physics learning environment, we found that GPT-4 Turbo maintains calibrated behavior even as the underlying learner model changes, providing the first evidence that LLMs can be used to simulate realistic behaviors in open-ended interactive learning environments, a necessary prerequisite for useful LLM behavioral simulation.

[AI-103] racking objects that change in appearance with phase synchrony

链接: https://arxiv.org/abs/2410.02094
作者: Sabine Muzellec,Drew Linsley,Alekh K. Ashok,Ennio Mingolla,Girik Malik,Rufin VanRullen,Thomas Serre
关键词-EN: Objects, neural, track objects, neural synchrony, change
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Objects we encounter often change appearance as we interact with them. Changes in illumination (shadows), object pose, or movement of nonrigid objects can drastically alter available image features. How do biological visual systems track objects as they change? It may involve specific attentional mechanisms for reasoning about the locations of objects independently of their appearances – a capability that prominent neuroscientific theories have associated with computing through neural synchrony. We computationally test the hypothesis that the implementation of visual attention through neural synchrony underlies the ability of biological visual systems to track objects that change in appearance over time. We first introduce a novel deep learning circuit that can learn to precisely control attention to features separately from their location in the world through neural synchrony: the complex-valued recurrent neural network (CV-RNN). Next, we compare object tracking in humans, the CV-RNN, and other deep neural networks (DNNs), using FeatureTracker: a large-scale challenge that asks observers to track objects as their locations and appearances change in precisely controlled ways. While humans effortlessly solved FeatureTracker, state-of-the-art DNNs did not. In contrast, our CV-RNN behaved similarly to humans on the challenge, providing a computational proof-of-concept for the role of phase synchronization as a neural substrate for tracking appearance-morphing objects as they move about.

[AI-104] he Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot

链接: https://arxiv.org/abs/2410.02091
作者: Fangchen Song,Ashish Agarwal,Wen Wen
关键词-EN: Generative artificial intelligence, automated content production, artificial intelligence, content production, including coding
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) has opened the possibility of automated content production, including coding in software development, which can significantly influence the participation and performance of software developers. To explore this impact, we investigate the role of GitHub Copilot, a generative AI pair programmer, on software development in open-source community, where multiple developers voluntarily collaborate on software projects. Using GitHub’s dataset for open-source repositories and a generalized synthetic control method, we find that Copilot significantly enhances project-level productivity by 6.5%. Delving deeper, we dissect the key mechanisms driving this improvement. Our findings reveal a 5.5% increase in individual productivity and a 5.4% increase in participation. However, this is accompanied with a 41.6% increase in integration time, potentially due to higher coordination costs. Interestingly, we also observe the differential effects among developers. We discover that core developers achieve greater project-level productivity gains from using Copilot, benefiting more in terms of individual productivity and participation compared to peripheral developers, plausibly due to their deeper familiarity with software projects. We also find that the increase in project-level productivity is accompanied with no change in code quality. We conclude that AI pair programmers bring benefits to developers to automate and augment their code, but human developers’ knowledge of software projects can enhance the benefits. In summary, our research underscores the role of AI pair programmers in impacting project-level productivity within the open-source community and suggests potential implications for the structure of open-source software projects.

[AI-105] RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

链接: https://arxiv.org/abs/2410.02089
作者: Jonas Gehring,Kunhao Zheng,Jade Copet,Vegard Mella,Taco Cohen,Gabriel Synnaeve
关键词-EN: agents solve user-specified, required manual engagement, solve user-specified tasks, deployed as agents, Large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.

[AI-106] Multi-Omic and Quantum Machine Learning Integration for Lung Subtypes Classification

链接: https://arxiv.org/abs/2410.02085
作者: Mandeep Kaur Saggi,Amandeep Singh Bhatia,Mensah Isaiah,Humaira Gowher,Sabre Kais
关键词-EN: Quantum Machine Learning, opportunities to resolve, computational problems, red-hot field, field that brings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Quantum Physics (quant-ph)
*备注: 27 pages, 17 figures

点击查看摘要

Abstract:Quantum Machine Learning (QML) is a red-hot field that brings novel discoveries and exciting opportunities to resolve, speed up, or refine the analysis of a wide range of computational problems. In the realm of biomedical research and personalized medicine, the significance of multi-omics integration lies in its ability to provide a thorough and holistic comprehension of complex biological systems. This technology links fundamental research to clinical practice. The insights gained from integrated omics data can be translated into clinical tools for diagnosis, prognosis, and treatment planning. The fusion of quantum computing and machine learning holds promise for unraveling complex patterns within multi-omics datasets, providing unprecedented insights into the molecular landscape of lung cancer. Due to the heterogeneity, complexity, and high dimensionality of multi-omic cancer data, characterized by the vast number of features (such as gene expression, micro-RNA, and DNA methylation) relative to the limited number of lung cancer patient samples, our prime motivation for this paper is the integration of multi-omic data, unique feature selection, and diagnostic classification of lung subtypes: lung squamous cell carcinoma (LUSC-I) and lung adenocarcinoma (LUAD-II) using quantum machine learning. We developed a method for finding the best differentiating features between LUAD and LUSC datasets, which has the potential for biomarker discovery.

[AI-107] Kolmogorov-Arnold Network Autoencoders

链接: https://arxiv.org/abs/2410.02077
作者: Mohammadamin Moradi,Shirin Panahi,Erik Bollt,Ying-Cheng Lai
关键词-EN: Deep learning models, Deep learning, Multi-Layer Perceptrons, revolutionized various domains, image classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures, 1 table

点击查看摘要

Abstract:Deep learning models have revolutionized various domains, with Multi-Layer Perceptrons (MLPs) being a cornerstone for tasks like data regression and image classification. However, a recent study has introduced Kolmogorov-Arnold Networks (KANs) as promising alternatives to MLPs, leveraging activation functions placed on edges rather than nodes. This structural shift aligns KANs closely with the Kolmogorov-Arnold representation theorem, potentially enhancing both model accuracy and interpretability. In this study, we explore the efficacy of KANs in the context of data representation via autoencoders, comparing their performance with traditional Convolutional Neural Networks (CNNs) on the MNIST, SVHN, and CIFAR-10 datasets. Our results demonstrate that KAN-based autoencoders achieve competitive performance in terms of reconstruction accuracy, thereby suggesting their viability as effective tools in data analysis tasks.

[AI-108] EAB-FL: Exacerbating Algorithmic Bias through Model Poisoning Attacks in Federated Learning

链接: https://arxiv.org/abs/2410.02042
作者: Syed Irfan Ali Meerza,Jian Liu
关键词-EN: Federated Learning, shared model collaboratively, multiple parties, parties to train, train a shared
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a technique that allows multiple parties to train a shared model collaboratively without disclosing their private data. It has become increasingly popular due to its distinct privacy advantages. However, FL models can suffer from biases against certain demographic groups (e.g., racial and gender groups) due to the heterogeneity of data and party selection. Researchers have proposed various strategies for characterizing the group fairness of FL algorithms to address this issue. However, the effectiveness of these strategies in the face of deliberate adversarial attacks has not been fully explored. Although existing studies have revealed various threats (e.g., model poisoning attacks) against FL systems caused by malicious participants, their primary aim is to decrease model accuracy, while the potential of leveraging poisonous model updates to exacerbate model unfairness remains unexplored. In this paper, we propose a new type of model poisoning attack, EAB-FL, with a focus on exacerbating group unfairness while maintaining a good level of model utility. Extensive experiments on three datasets demonstrate the effectiveness and efficiency of our attack, even with state-of-the-art fairness optimization algorithms and secure aggregation rules employed.

[AI-109] Model Comparisons: XNet Outperforms KAN

链接: https://arxiv.org/abs/2410.02033
作者: Xin Li,Zhihong Jeff Xia,Xiaotao Zheng
关键词-EN: precise data modeling, predictive machine learning, machine learning tasks, artificial intelligence, modeling is crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the fields of computational mathematics and artificial intelligence, the need for precise data modeling is crucial, especially for predictive machine learning tasks. This paper explores further XNet, a novel algorithm that employs the complex-valued Cauchy integral formula, offering a superior network architecture that surpasses traditional Multi-Layer Perceptrons (MLPs) and Kolmogorov-Arnold Networks (KANs). XNet significant improves speed and accuracy across various tasks in both low and high-dimensional spaces, redefining the scope of data-driven model development and providing substantial improvements over established time series models like LSTMs.

[AI-110] Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal Multilingual Retrieval EMNLP24

链接: https://arxiv.org/abs/2410.02027
作者: Kyle Buettner,Adriana Kovashka
关键词-EN: languages and cultures, multilingual vision-language models, properly account, perceptual differences, reflected in image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Short paper accepted to EMNLP24 (Main)

点击查看摘要

Abstract:There is a scarcity of multilingual vision-language models that properly account for the perceptual differences that are reflected in image captions across languages and cultures. In this work, through a multimodal, multilingual retrieval case study, we quantify the existing lack of model flexibility. We empirically show performance gaps between training on captions that come from native German perception and captions that have been either machine-translated or human-translated from English into German. To address these gaps, we further propose and evaluate caption augmentation strategies. While we achieve mean recall improvements (+1.3), gaps still remain, indicating an open area of future work for the community.

[AI-111] Zodiac: A Cardiologist-Level LLM Framework for Multi-Agent Diagnostics

链接: https://arxiv.org/abs/2410.02026
作者: Yuan Zhou,Peng Zhang,Mengya Song,Alice Zheng,Yiwen Lu,Zhiheng Liu,Yong Chen,Zhaohan Xi
关键词-EN: Large language models, demonstrated remarkable progress, Large language, demonstrated remarkable, remarkable progress
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable progress in healthcare. However, a significant gap remains regarding LLMs’ professionalism in domain-specific clinical practices, limiting their application in real-world diagnostics. In this work, we introduce ZODIAC, an LLM-powered framework with cardiologist-level professionalism designed to engage LLMs in cardiological diagnostics. ZODIAC assists cardiologists by extracting clinically relevant characteristics from patient data, detecting significant arrhythmias, and generating preliminary reports for the review and refinement by cardiologists. To achieve cardiologist-level professionalism, ZODIAC is built on a multi-agent collaboration framework, enabling the processing of patient data across multiple modalities. Each LLM agent is fine-tuned using real-world patient data adjudicated by cardiologists, reinforcing the model’s professionalism. ZODIAC undergoes rigorous clinical validation with independent cardiologists, evaluated across eight metrics that measure clinical effectiveness and address security concerns. Results show that ZODIAC outperforms industry-leading models, including OpenAI’s GPT-4o, Meta’s Llama-3.1-405B, and Google’s Gemini-pro, as well as medical-specialist LLMs like Microsoft’s BioGPT. ZODIAC demonstrates the transformative potential of specialized LLMs in healthcare by delivering domain-specific solutions that meet the stringent demands of medical practice. Notably, ZODIAC has been successfully integrated into electrocardiography (ECG) devices, exemplifying the growing trend of embedding LLMs into Software-as-Medical-Device (SaMD).

[AI-112] FLAG: Financial Long Document Classification via AMR-based GNN

链接: https://arxiv.org/abs/2410.02024
作者: Bolun(Namir)Xia,Mohammed J. Zaki,Aparna Gupta
关键词-EN: large language models, Abstract Meaning Representation, language models, advent of large, large language
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, to be published in CIFEr Conference 2024 as “Semantic Graph Learning for Trend Prediction from Long Financial Documents”

点击查看摘要

Abstract:The advent of large language models (LLMs) has initiated much research into their various financial applications. However, in applying LLMs on long documents, semantic relations are not explicitly incorporated, and a full or arbitrarily sparse attention operation is employed. In recent years, progress has been made in Abstract Meaning Representation (AMR), which is a graph-based representation of text to preserve its semantic relations. Since AMR can represent semantic relationships at a deeper level, it can be beneficially utilized by graph neural networks (GNNs) for constructing effective document-level graph representations built upon LLM embeddings to predict target metrics in the financial domain. We propose FLAG: Financial Long document classification via AMR-based GNN, an AMR graph based framework to generate document-level embeddings for long financial document classification. We construct document-level graphs from sentence-level AMR graphs, endow them with specialized LLM word embeddings in the financial domain, apply a deep learning mechanism that utilizes a GNN, and examine the efficacy of our AMR-based approach in predicting labeled target data from long financial documents. Extensive experiments are conducted on a dataset of quarterly earnings calls transcripts of companies in various sectors of the economy, as well as on a corpus of more recent earnings calls of companies in the SP 1500 Composite Index. We find that our AMR-based approach outperforms fine-tuning LLMs directly on text in predicting stock price movement trends at different time horizons in both datasets. Our work also outperforms previous work utilizing document graphs and GNNs for text classification.

[AI-113] DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning

链接: https://arxiv.org/abs/2410.02023
作者: Jiaqing Xie,Yue Zhao,Tianfan Fu
关键词-EN: predicting protein properties, deep learning, recent years, enabling advancements, structural folding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:In recent years, deep learning has revolutionized the field of protein science, enabling advancements in predicting protein properties, structural folding and interactions. This paper presents DeepProtein, a comprehensive and user-friendly deep learning library specifically designed for protein-related tasks. DeepProtein integrates a couple of state-of-the-art neural network architectures, which include convolutional neural network (CNN), recurrent neural network (RNN), transformer, graph neural network (GNN), and graph transformer (GT). It provides user-friendly interfaces, facilitating domain researchers in applying deep learning techniques to protein data. Also, we curate a benchmark that evaluates these neural architectures on a variety of protein tasks, including protein function prediction, protein localization prediction, and protein-protein interaction prediction, showcasing its superior performance and scalability. Additionally, we provide detailed documentation and tutorials to promote accessibility and encourage reproducible research. This library is extended from a well-known drug discovery library, DeepPurpose and publicly available at this https URL.

[AI-114] Review Non-convex Optimization Method for Machine Learning

链接: https://arxiv.org/abs/2410.02017
作者: Greg B Fotopoulos,Paul Popovich,Nicholas Hall Papadopoulos
关键词-EN: deep neural networks, support vector machines, Non-convex optimization, advancing machine learning, critical tool
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-convex optimization is a critical tool in advancing machine learning, especially for complex models like deep neural networks and support vector machines. Despite challenges such as multiple local minima and saddle points, non-convex techniques offer various pathways to reduce computational costs. These include promoting sparsity through regularization, efficiently escaping saddle points, and employing subsampling and approximation strategies like stochastic gradient descent. Additionally, non-convex methods enable model pruning and compression, which reduce the size of models while maintaining performance. By focusing on good local minima instead of exact global minima, non-convex optimization ensures competitive accuracy with faster convergence and lower computational overhead. This paper examines the key methods and applications of non-convex optimization in machine learning, exploring how it can lower computation costs while enhancing model performance. Furthermore, it outlines future research directions and challenges, including scalability and generalization, that will shape the next phase of non-convex optimization in machine learning.

[AI-115] Addressing Data Heterogeneity in Federated Learning with Adaptive Normalization-Free Feature Recalibration

链接: https://arxiv.org/abs/2410.02006
作者: Vasilis Siomos,Sergio Naval-Marimont,Jonathan Passerat-Palmbach,Giacomo Tarroni
关键词-EN: preserves stakeholders’ data, stakeholders’ data ownership, collaborative training paradigm, decentralized collaborative training, Normalization-free Feature Recalibration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Federated learning is a decentralized collaborative training paradigm that preserves stakeholders’ data ownership while improving performance and generalization. However, statistical heterogeneity among client datasets poses a fundamental challenge by degrading system performance. To address this issue, we propose Adaptive Normalization-free Feature Recalibration (ANFR), an architecture-level approach that combines weight standardization and channel attention. Weight standardization normalizes the weights of layers instead of activations. This is less susceptible to mismatched client statistics and inconsistent averaging, thereby more robust under heterogeneity. Channel attention produces learnable scaling factors for feature maps, suppressing those that are inconsistent between clients due to heterogeneity. We demonstrate that combining these techniques boosts model performance beyond their individual contributions, by enhancing class selectivity and optimizing channel attention weight distribution. ANFR operates independently of the aggregation method and is effective in both global and personalized federated learning settings, with minimal computational overhead. Furthermore, when training with differential privacy, ANFR achieves an appealing balance between privacy and utility, enabling strong privacy guarantees without sacrificing performance. By integrating weight standardization and channel attention in the backbone model, ANFR offers a novel and versatile approach to the challenge of statistical heterogeneity. We demonstrate through extensive experiments that ANFR consistently outperforms established baselines across various aggregation methods, datasets, and heterogeneity conditions.

[AI-116] Normalizing Flow Based Metric for Image Generation

链接: https://arxiv.org/abs/2410.02004
作者: Pranav Jeevan,Neeraj Nixon,Amit Sethi
关键词-EN: dual-flow based likelihood, based likelihood distance, exact dual-flow based, flow-based likelihood distance, proposed metrics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 16 figures

点击查看摘要

Abstract:We propose two new evaluation metrics to assess realness of generated images based on normalizing flows: a simpler and efficient flow-based likelihood distance (FLD) and a more exact dual-flow based likelihood distance (D-FLD). Because normalizing flows can be used to compute the exact likelihood, the proposed metrics assess how closely generated images align with the distribution of real images from a given domain. This property gives the proposed metrics a few advantages over the widely used Fréchet inception distance (FID) and other recent metrics. Firstly, the proposed metrics need only a few hundred images to stabilize (converge in mean), as opposed to tens of thousands needed for FID, and at least a few thousand for the other metrics. This allows confident evaluation of even small sets of generated images, such as validation batches inside training loops. Secondly, the network used to compute the proposed metric has over an order of magnitude fewer parameters compared to Inception-V3 used to compute FID, making it computationally more efficient. For assessing the realness of generated images in new domains (e.g., x-ray images), ideally these networks should be retrained on real images to model their distinct distributions. Thus, our smaller network will be even more advantageous for new domains. Extensive experiments show that the proposed metrics have the desired monotonic relationships with the extent of image degradation of various kinds.

[AI-117] UlcerGPT: A Multimodal Approach Leveraging Large Language and Vision Models for Diabetic Foot Ulcer Image Transcription ICPR2024

链接: https://arxiv.org/abs/2410.01989
作者: Reza Basiri,Ali Abedi,Chau Nguyen,Milos R. Popovic,Shehroz S. Khan
关键词-EN: Diabetic foot ulcers, lower limb amputations, Diabetic foot, DFU image transcription, DFU image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 3 figures, ICPR 2024 Conference (PRHA workshop)

点击查看摘要

Abstract:Diabetic foot ulcers (DFUs) are a leading cause of hospitalizations and lower limb amputations, placing a substantial burden on patients and healthcare systems. Early detection and accurate classification of DFUs are critical for preventing serious complications, yet many patients experience delays in receiving care due to limited access to specialized services. Telehealth has emerged as a promising solution, improving access to care and reducing the need for in-person visits. The integration of artificial intelligence and pattern recognition into telemedicine has further enhanced DFU management by enabling automatic detection, classification, and monitoring from images. Despite advancements in artificial intelligence-driven approaches for DFU image analysis, the application of large language models for DFU image transcription has not yet been explored. To address this gap, we introduce UlcerGPT, a novel multimodal approach leveraging large language and vision models for DFU image transcription. This framework combines advanced vision and language models, such as Large Language and Vision Assistant and Chat Generative Pre-trained Transformer, to transcribe DFU images by jointly detecting, classifying, and localizing regions of interest. Through detailed experiments on a public dataset, evaluated by expert clinicians, UlcerGPT demonstrates promising results in the accuracy and efficiency of DFU transcription, offering potential support for clinicians in delivering timely care via telemedicine.

[AI-118] Lost-in-Distance: Impact of Contextual Proximity on LLM Performance in Graph Tasks

链接: https://arxiv.org/abs/2410.01985
作者: Hamed Firooz,Maziar Sanjabi,Wenlong Jiang,Xiaoling Zhai
关键词-EN: Large Language Models, Large Language, contextual data effectively, exhibit blind spots, process relevant contextual
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite significant advancements, Large Language Models (LLMs) exhibit blind spots that impair their ability to retrieve and process relevant contextual data effectively. We demonstrate that LLM performance in graph tasks with complexities beyond the “needle-in-a-haystack” scenario-where solving the problem requires cross-referencing and reasoning across multiple subproblems jointly-is influenced by the proximity of relevant information within the context, a phenomenon we term “lost-in-distance”. We examine two fundamental graph tasks: identifying common connections between two nodes and assessing similarity among three nodes, and show that the model’s performance in these tasks significantly depends on the relative positioning of common edges. We evaluate three publicly available LLMs-Llama-3-8B, Llama-3-70B, and GPT-4-using various graph encoding techniques that represent graph structures for LLM input. We propose a formulation for the lost-in-distance phenomenon and demonstrate that lost-in-distance and lost-in-the middle phenomenas occur independently. Results indicate that model accuracy can decline by up to 6x as the distance between node connections increases, independent of graph encoding and model size.

[AI-119] LLMKG@VLDB24 Workshop Summary

链接: https://arxiv.org/abs/2410.01978
作者: Arijit Khan,Tianxing Wu,Xi Chen
关键词-EN: large language models, language models, knowledge graphs, hot topic, unification of large
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 1 figure

点击查看摘要

Abstract:The unification of large language models (LLMs) and knowledge graphs (KGs) has emerged as a hot topic. At the LLM+KG’24 workshop, held in conjunction with VLDB 2024 in Guangzhou, China, one of the key themes explored was important data management challenges and opportunities due to the effective interaction between LLMs and KGs. This report outlines the major directions and approaches presented by various speakers during the LLM+KG’24 workshop.

[AI-120] Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

链接: https://arxiv.org/abs/2410.01966
作者: Xinlong Hou,Sen Shen,Xueshen Li,Xinran Gao,Ziyi Huang,Steven J. Holiday,Matthew R. Cribbet,Susan W. White,Edward Sazonov,Yu Gan
关键词-EN: physical activity, childhood obesity, social interaction, accurately monitor, phenomena linked
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Prepare for submission

点击查看摘要

Abstract:Being able to accurately monitor the screen exposure of young children is important for research on phenomena linked to screen use such as childhood obesity, physical activity, and social interaction. Most existing studies rely upon self-report or manual measures from bulky wearable sensors, thus lacking efficiency and accuracy in capturing quantitative screen exposure data. In this work, we developed a novel sensor informatics framework that utilizes egocentric images from a wearable sensor, termed the screen time tracker (STT), and a vision language model (VLM). In particular, we devised a multi-view VLM that takes multiple views from egocentric image sequences and interprets screen exposure dynamically. We validated our approach by using a dataset of children’s free-living activities, demonstrating significant improvement over existing methods in plain vision language models and object detection models. Results supported the promise of this monitoring approach, which could optimize behavioral research on screen exposure in children’s naturalistic settings.

[AI-121] One-step Noisy Label Mitigation

链接: https://arxiv.org/abs/2410.01944
作者: Hao Li,Jiayang Gu,Jingkuan Song,An Zhang,Lianli Gao
关键词-EN: Mitigating the detrimental, large-scale pre-training tasks, increasingly critical, detrimental effects, large-scale pre-training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 4 figures, 11 Tables

点击查看摘要

Abstract:Mitigating the detrimental effects of noisy labels on the training process has become increasingly critical, as obtaining entirely clean or human-annotated samples for large-scale pre-training tasks is often impractical. Nonetheless, existing noise mitigation methods often encounter limitations in practical applications due to their task-specific design, model dependency, and significant computational overhead. In this work, we exploit the properties of high-dimensional orthogonality to identify a robust and effective boundary in cone space for separating clean and noisy samples. Building on this, we propose One-step Anti-Noise (OSA), a model-agnostic noisy label mitigation paradigm that employs an estimator model and a scoring function to assess the noise level of input pairs through just one-step inference, a cost-efficient process. We empirically demonstrate the superiority of OSA, highlighting its enhanced training robustness, improved task transferability, ease of deployment, and reduced computational costs across various benchmarks, models, and tasks. Our code is released at this https URL.

[AI-122] CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

链接: https://arxiv.org/abs/2410.01943
作者: Mohammadreza Pourreza,Hailong Li,Ruoxi Sun,Yeounoh Chung,Shayan Talaei,Gaurav Tarlok Kakkar,Yu Gan,Amin Saberi,Fatma Ozcan,Sercan O. Arik
关键词-EN: large language model, employs innovative strategies, improve candidate generation, binary-candidates selection LLM, single LLM call
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:In tackling the challenges of large language model (LLM) performance for Text-to-SQL tasks, we introduce CHASE-SQL, a new framework that employs innovative strategies, using test-time compute in multi-agent modeling to improve candidate generation and selection. CHASE-SQL leverages LLMs’ intrinsic knowledge to generate diverse and high-quality SQL candidates using different LLM generators with: (1) a divide-and-conquer method that decomposes complex queries into manageable sub-queries in a single LLM call; (2) chain-of-thought reasoning based on query execution plans, reflecting the steps a database engine takes during execution; and (3) a unique instance-aware synthetic example generation technique, which offers specific few-shot demonstrations tailored to test this http URL identify the best candidate, a selection agent is employed to rank the candidates through pairwise comparisons with a fine-tuned binary-candidates selection LLM. This selection approach has been demonstrated to be more robust over alternatives. The proposed generators-selector framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods. Overall, our proposed CHASE-SQL achieves the state-of-the-art execution accuracy of 73.0% and 73.01% on the test set and development set of the notable BIRD Text-to-SQL dataset benchmark, rendering CHASE-SQL the top submission of the leaderboard (at the time of paper submission).

[AI-123] Dont flatten tokenize! Unlocking the key to SoftMoEs efficacy in deep RL

链接: https://arxiv.org/abs/2410.01930
作者: Ghada Sokar,Johan Obando-Ceron,Aaron Courville,Hugo Larochelle,Pablo Samuel Castro
关键词-EN: model size increases, deep neural networks, reinforcement learning, size increases, deep neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The use of deep neural networks in reinforcement learning (RL) often suffers from performance degradation as model size increases. While soft mixtures of experts (SoftMoEs) have recently shown promise in mitigating this issue for online RL, the reasons behind their effectiveness remain largely unknown. In this work we provide an in-depth analysis identifying the key factors driving this performance gain. We discover the surprising result that tokenizing the encoder output, rather than the use of multiple experts, is what is behind the efficacy of SoftMoEs. Indeed, we demonstrate that even with an appropriately scaled single expert, we are able to maintain the performance gains, largely thanks to tokenization.

[AI-124] LLM-Augmented Symbolic Reinforcement Learning with Landmark-Based Task Decomposition

链接: https://arxiv.org/abs/2410.01929
作者: Alireza Kheirandish,Duo Xu,Faramarz Fekri
关键词-EN: reinforcement learning, complex task, fundamental challenges, challenges in reinforcement, solving complex tasks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the fundamental challenges in reinforcement learning (RL) is to take a complex task and be able to decompose it to subtasks that are simpler for the RL agent to learn. In this paper, we report on our work that would identify subtasks by using some given positive and negative trajectories for solving the complex task. We assume that the states are represented by first-order predicate logic using which we devise a novel algorithm to identify the subtasks. Then we employ a Large Language Model (LLM) to generate first-order logic rule templates for achieving each subtask. Such rules were then further fined tuned to a rule-based policy via an Inductive Logic Programming (ILP)-based RL agent. Through experiments, we verify the accuracy of our algorithm in detecting subtasks which successfully detect all of the subtasks correctly. We also investigated the quality of the common-sense rules produced by the language model to achieve the subtasks. Our experiments show that our LLM-guided rule template generation can produce rules that are necessary for solving a subtask, which leads to solving complex tasks with fewer assumptions about predefined first-order logic predicates of the environment.

[AI-125] Risk Alignment in Agent ic AI Systems

链接: https://arxiv.org/abs/2410.01927
作者: Hayley Clatterbuck,Clinton Castro,Arvo Muñoz Morán
关键词-EN: undertake complex actions, Agentic AIs, capable and permitted, permitted to undertake, undertake complex
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Agentic AIs - AIs that are capable and permitted to undertake complex actions with little supervision - mark a new frontier in AI capabilities and raise new questions about how to safely create and align such systems with users, developers, and society. Because agents’ actions are influenced by their attitudes toward risk, one key aspect of alignment concerns the risk profiles of agentic AIs. Risk alignment will matter for user satisfaction and trust, but it will also have important ramifications for society more broadly, especially as agentic AIs become more autonomous and are allowed to control key aspects of our lives. AIs with reckless attitudes toward risk (either because they are calibrated to reckless human users or are poorly designed) may pose significant threats. They might also open ‘responsibility gaps’ in which there is no agent who can be held accountable for harmful actions. What risk attitudes should guide an agentic AI’s decision-making? How might we design AI systems that are calibrated to the risk attitudes of their users? What guardrails, if any, should be placed on the range of permissible risk attitudes? What are the ethical considerations involved when designing systems that make risky decisions on behalf of others? We present three papers that bear on key normative and technical aspects of these questions.

[AI-126] Provably Accurate Shapley Value Estimation via Leverage Score Sampling

链接: https://arxiv.org/abs/2410.01917
作者: Christopher Musco,R. Teal Witter
关键词-EN: specific input features, Originally introduced, Kernel SHAP, attribute model predictions, explainable machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Originally introduced in game theory, Shapley values have emerged as a central tool in explainable machine learning, where they are used to attribute model predictions to specific input features. However, computing Shapley values exactly is expensive: for a general model with n features, O(2^n) model evaluations are necessary. To address this issue, approximation algorithms are widely used. One of the most popular is the Kernel SHAP algorithm, which is model agnostic and remarkably effective in practice. However, to the best of our knowledge, Kernel SHAP has no strong non-asymptotic complexity guarantees. We address this issue by introducing Leverage SHAP, a light-weight modification of Kernel SHAP that provides provably accurate Shapley value estimates with just O(n\log n) model evaluations. Our approach takes advantage of a connection between Shapley value estimation and agnostic active learning by employing leverage score sampling, a powerful regression tool. Beyond theoretical guarantees, we show that Leverage SHAP consistently outperforms even the highly optimized implementation of Kernel SHAP available in the ubiquitous SHAP library [Lundberg Lee, 2017].

[AI-127] A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

链接: https://arxiv.org/abs/2410.01912
作者: Liang Chen,Sinan Tan,Zefan Cai,Weichu Xie,Haozhe Zhao,Yichi Zhang,Junyang Lin,Jinze Bai,Tianyu Liu,Baobao Chang
关键词-EN: information loss bottleneck, model architecture called, bottleneck of vector-quantization, autoregressive image generation, tackles the information
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 25 pages, 20 figures, code is open at this https URL

点击查看摘要

Abstract:This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, \textitmodel depth, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer’s potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at this https URL.

[AI-128] Social Media Authentication and Combating Deepfakes using Semi-fragile Invisible Image Watermarking

链接: https://arxiv.org/abs/2410.01906
作者: Aakash Varma Nadimpalli,Ajita Rattani
关键词-EN: severe societal concerns, raised severe societal, watermark removal attacks, deep generative models, video synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: ACM Transactions (Digital Threats: Research and Practice)

点击查看摘要

Abstract:With the significant advances in deep generative models for image and video synthesis, Deepfakes and manipulated media have raised severe societal concerns. Conventional machine learning classifiers for deepfake detection often fail to cope with evolving deepfake generation technology and are susceptible to adversarial attacks. Alternatively, invisible image watermarking is being researched as a proactive defense technique that allows media authentication by verifying an invisible secret message embedded in the image pixels. A handful of invisible image watermarking techniques introduced for media authentication have proven vulnerable to basic image processing operations and watermark removal attacks. In response, we have proposed a semi-fragile image watermarking technique that embeds an invisible secret message into real images for media authentication. Our proposed watermarking framework is designed to be fragile to facial manipulations or tampering while being robust to benign image-processing operations and watermark removal attacks. This is facilitated through a unique architecture of our proposed technique consisting of critic and adversarial networks that enforce high image quality and resiliency to watermark removal efforts, respectively, along with the backbone encoder-decoder and the discriminator networks. Thorough experimental investigations on SOTA facial Deepfake datasets demonstrate that our proposed model can embed a 64 -bit secret as an imperceptible image watermark that can be recovered with a high-bit recovery accuracy when benign image processing operations are applied while being non-recoverable when unseen Deepfake manipulations are applied. In addition, our proposed watermarking technique demonstrates high resilience to several white-box and black-box watermark removal attacks. Thus, obtaining state-of-the-art performance.

[AI-129] he potential of LLM-generated reports in DevSecOps

链接: https://arxiv.org/abs/2410.01899
作者: Nikolaos Lykousas,Vasileios Argyropoulos,Fran Casino
关键词-EN: common issue faced, faced by software, Alert fatigue, software teams, DevSecOps paradigm
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Published in AIESE 2024 (International Conference on AI empowered Software Engineering)

点击查看摘要

Abstract:Alert fatigue is a common issue faced by software teams using the DevSecOps paradigm. The overwhelming number of warnings and alerts generated by security and code scanning tools, particularly in smaller teams where resources are limited, leads to desensitization and diminished responsiveness to security warnings, potentially exposing systems to vulnerabilities. This paper explores the potential of LLMs in generating actionable security reports that emphasize the financial impact and consequences of detected security issues, such as credential leaks, if they remain unaddressed. A survey conducted among developers indicates that LLM-generated reports significantly enhance the likelihood of immediate action on security issues by providing clear, comprehensive, and motivating insights. Integrating these reports into DevSecOps workflows can mitigate attention saturation and alert fatigue, ensuring that critical security warnings are addressed effectively.

[AI-130] Auction-Based Regulation for Artificial Intelligence

链接: https://arxiv.org/abs/2410.01871
作者: Marco Bornstein,Zora Che,Suhas Julapalli,Abdirisak Mohamed,Amrit Singh Bedi,Furong Huang
关键词-EN: broken Artificial Intelligence, Artificial Intelligence, legal pieces left, broken Artificial, moving fast
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:In an era of “moving fast and breaking things”, regulators have moved slowly to pick up the safety, bias, and legal pieces left in the wake of broken Artificial Intelligence (AI) deployment. Since AI models, such as large language models, are able to push misinformation and stoke division within our society, it is imperative for regulators to employ a framework that mitigates these dangers and ensures user safety. While there is much-warranted discussion about how to address the safety, bias, and legal woes of state-of-the-art AI models, the number of rigorous and realistic mathematical frameworks to regulate AI safety is lacking. We take on this challenge, proposing an auction-based regulatory mechanism that provably incentivizes model-building agents (i) to deploy safer models and (ii) to participate in the regulation process. We provably guarantee, via derived Nash Equilibria, that each participating agent’s best strategy is to submit a model safer than a prescribed minimum-safety threshold. Empirical results show that our regulatory auction boosts safety and participation rates by 20% and 15% respectively, outperforming simple regulatory frameworks that merely enforce minimum safety standards.

[AI-131] Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement

链接: https://arxiv.org/abs/2410.01869
作者: Shouvon Sarker,Xishuang Dong,Xiangfang Li,Lijun Qian
关键词-EN: effortlessly retrieve desired, retrieve desired information, natural language queries, enables non-expert users, Large Language Models
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Text-to-SQLs enables non-expert users to effortlessly retrieve desired information from relational databases using natural language queries. While recent advancements, particularly with Large Language Models (LLMs) like GPT and T5, have shown impressive performance on large-scale benchmarks such as BIRD, current state-of-the-art (SOTA) LLM-based Text-to-SQLs models often require significant efforts to develop auxiliary tools like SQL classifiers to achieve high performance. This paper proposed a novel approach that only needs SQL Quality Measurement to enhance LLMs-based Text-to-SQLs performance. It establishes a SQL quality evaluation mechanism to assess the generated SQL queries against predefined criteria and actual database responses. This feedback loop enables continuous learning and refinement of model outputs based on both syntactic correctness and semantic accuracy. The proposed method undergoes comprehensive validation on the BIRD benchmark, assessing Execution Accuracy (EX) and Valid Efficiency Score (VES) across various Text-to-SQLs difficulty levels. Experimental results reveal competitive performance in both EX and VES compared to SOTA models like GPT4 and T5.

[AI-132] House of Cards: Massive Weights in LLMs

链接: https://arxiv.org/abs/2410.01866
作者: Jaehoon Oh,Seungjun Shin,Dokwan Oh
关键词-EN: large language models, massive weights, Massive activations, Massive, weights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Under review

点击查看摘要

Abstract:Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top- k massive weights as the weights that contribute to the dimensions with the top- k magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor performance drop, even though a much larger number of weights are set to zero. This implies that during the pre-training process, learning is dominantly focused on massive weights. Building on this observation, we propose a simple plug-and-play method called MacDrop (massive weights curriculum dropout), to rely less on massive weights during parameter-efficient fine-tuning. This method applies dropout to the pre-trained massive weights, starting with a high dropout probability and gradually decreasing it as fine-tuning progresses. Through experiments, we demonstrate that MacDrop generally improves performance across zero-shot downstream tasks and generation tasks.

[AI-133] Simplifying complex machine learning by linearly separable network embedding spaces

链接: https://arxiv.org/abs/2410.01865
作者: Alexandros Xenos,Noel-Malod Dognin,Natasa Przulj
关键词-EN: Low-dimensional embeddings, Low-dimensional, network, embedding, network data
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, 8 figures

点击查看摘要

Abstract:Low-dimensional embeddings are a cornerstone in the modelling and analysis of complex networks. However, most existing approaches for mining network embedding spaces rely on computationally intensive machine learning systems to facilitate downstream tasks. In the field of NLP, word embedding spaces capture semantic relationships \textitlinearly, allowing for information retrieval using \textitsimple linear operations on word embedding vectors. Here, we demonstrate that there are structural properties of network data that yields this linearity. We show that the more homophilic the network representation, the more linearly separable the corresponding network embedding space, yielding better downstream analysis results. Hence, we introduce novel graphlet-based methods enabling embedding of networks into more linearly separable spaces, allowing for their better mining. Our fundamental insights into the structure of network data that enable their \textit\textbflinear mining and exploitation enable the ML community to build upon, towards efficiently and explainably mining of the complex network data.

[AI-134] Explainable Diagnosis Prediction through Neuro-Symbolic Integration

链接: https://arxiv.org/abs/2410.01855
作者: Qiuhao Lu,Rui Li,Elham Sagheb,Andrew Wen,Jinlian Wang,Liwei Wang,Jungwei W. Fan,Hongfang Liu
关键词-EN: impact patient outcomes, significantly impact patient, Logical Neural Networks, patient outcomes, critical task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diagnosis prediction is a critical task in healthcare, where timely and accurate identification of medical conditions can significantly impact patient outcomes. Traditional machine learning and deep learning models have achieved notable success in this domain but often lack interpretability which is a crucial requirement in clinical settings. In this study, we explore the use of neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop explainable models for diagnosis prediction. Essentially, we design and implement LNN-based models that integrate domain-specific knowledge through logical rules with learnable thresholds. Our models, particularly M_\textmulti-pathway and M_\textcomprehensive , demonstrate superior performance over traditional models such as Logistic Regression, SVM, and Random Forest, achieving higher accuracy (up to 80.52%) and AUROC scores (up to 0.8457) in the case study of diabetes prediction. The learned weights and thresholds within the LNN models provide direct insights into feature contributions, enhancing interpretability without compromising predictive power. These findings highlight the potential of neuro-symbolic approaches in bridging the gap between accuracy and explainability in healthcare AI applications. By offering transparent and adaptable diagnostic models, our work contributes to the advancement of precision medicine and supports the development of equitable healthcare solutions. Future research will focus on extending these methods to larger and more diverse datasets to further validate their applicability across different medical conditions and populations.

[AI-135] Bayes-CATSI: A variational Bayesian approach for medical time series data imputation

链接: https://arxiv.org/abs/2410.01847
作者: Omkar Kulkarni,Rohitash Chandra
关键词-EN: Time Series Imputation, time series datasets, Context-Aware Time Series, series datasets feature, datasets feature missing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Medical time series datasets feature missing values that need data imputation methods, however, conventional machine learning models fall short due to a lack of uncertainty quantification in predictions. Among these models, the CATSI (Context-Aware Time Series Imputation) stands out for its effectiveness by incorporating a context vector into the imputation process, capturing the global dependencies of each patient. In this paper, we propose a Bayesian Context-Aware Time Series Imputation (Bayes-CATSI) framework which leverages uncertainty quantification offered by variational inference. We consider the time series derived from electroencephalography (EEG), electrooculography (EOG), electromyography (EMG), electrocardiology (EKG). Variational Inference assumes the shape of the posterior distribution and through minimization of the Kullback-Leibler(KL) divergence it finds variational densities that are closest to the true posterior distribution. Thus , we integrate the variational Bayesian deep learning layers into the CATSI model. Our results show that Bayes-CATSI not only provides uncertainty quantification but also achieves superior imputation performance compared to the CATSI model. Specifically, an instance of Bayes-CATSI outperforms CATSI by 9.57 %. We provide an open-source code implementation for applying Bayes-CATSI to other medical data imputation problems.

[AI-136] arget Pose Guided Whole-body Grasping Motion Generation for Digital Humans

链接: https://arxiv.org/abs/2410.01840
作者: Quanquan Shao,Yi Fang
关键词-EN: daily life objects, Grasping motion generation, Grasping, grasping motion, fundamental mode
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: 7 pages,5 figures

点击查看摘要

Abstract:Grasping manipulation is a fundamental mode for human interaction with daily life objects. The synthesis of grasping motion is also greatly demanded in many applications such as animation and robotics. In objects grasping research field, most works focus on generating the last static grasping pose with a parallel gripper or dexterous hand. Grasping motion generation for the full arm especially for the full humanlike intelligent agent is still under-explored. In this work, we propose a grasping motion generation framework for digital human which is an anthropomorphic intelligent agent with high degrees of freedom in virtual world. Given an object known initial pose in 3D space, we first generate a target pose for whole-body digital human based on off-the-shelf target grasping pose generation methods. With an initial pose and this generated target pose, a transformer-based neural network is used to generate the whole grasping trajectory, which connects initial pose and target pose smoothly and naturally. Additionally, two post optimization components are designed to mitigates foot-skating issue and hand-object interpenetration separately. Experiments are conducted on GRAB dataset to demonstrate effectiveness of this proposed method for whole-body grasping motion generation with randomly placed unknown objects.

[AI-137] mporal Graph Memory Networks For Knowledge Tracing

链接: https://arxiv.org/abs/2410.01836
作者: Seif Gad,Sherif Abdelfattah,Ghodai Abdelrahman
关键词-EN: past exercise answering, automatic tutoring systems, student knowledge growth, past exercise, exercise answering
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tracing a student’s knowledge growth given the past exercise answering is a vital objective in automatic tutoring systems to customize the learning experience. Yet, achieving this objective is a non-trivial task as it involves modeling the knowledge state across multiple knowledge components (KCs) while considering their temporal and relational dynamics during the learning process. Knowledge tracing methods have tackled this task by either modeling KCs’ temporal dynamics using recurrent models or relational dynamics across KCs and questions using graph models. Albeit, there is a lack of methods that could learn joint embedding between relational and temporal dynamics of the task. Moreover, many methods that count for the impact of a student’s forgetting behavior during the learning process use hand-crafted features, limiting their generalization on different scenarios. In this paper, we propose a novel method that jointly models the relational and temporal dynamics of the knowledge state using a deep temporal graph memory network. In addition, we propose a generic technique for representing a student’s forgetting behavior using temporal decay constraints on the graph memory module. We demonstrate the effectiveness of our proposed method using multiple knowledge tracing benchmarks while comparing it to state-of-the-art methods.

[AI-138] Analysis of Convolutional Neural Network-based Image Classifications: A Multi-Featured Application for Rice Leaf Disease Prediction and Recommendations for Farmers

链接: https://arxiv.org/abs/2410.01827
作者: Biplov Paneru,Bishwash Paneru,Krishna Bikram Shah
关键词-EN: convolutional neural network, neural network, precision agriculture, study presents, method for improving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This study presents a novel method for improving rice disease classification using 8 different convolutional neural network (CNN) algorithms, which will further the field of precision agriculture. Tkinter-based application that offers farmers a feature-rich interface. With the help of this cutting-edge application, farmers will be able to make timely and well-informed decisions by enabling real-time disease prediction and providing personalized recommendations. Together with the user-friendly Tkinter interface, the smooth integration of cutting-edge CNN transfer learning algorithms-based technology that include ResNet-50, InceptionV3, VGG16, and MobileNetv2 with the UCI dataset represents a major advancement toward modernizing agricultural practices and guaranteeing sustainable crop management. Remarkable outcomes include 75% accuracy for ResNet-50, 90% accuracy for DenseNet121, 84% accuracy for VGG16, 95.83% accuracy for MobileNetV2, 91.61% accuracy for DenseNet169, and 86% accuracy for InceptionV3. These results give a concise summary of the models’ capabilities, assisting researchers in choosing appropriate strategies for precise and successful rice crop disease identification. A severe overfitting has been seen on VGG19 with 70% accuracy and Nasnet with 80.02% accuracy. On Renset101, only an accuracy of 54% could be achieved, along with only 33% on efficientNetB0. A MobileNetV2-trained model was successfully deployed on a TKinter GUI application to make predictions using image or real-time video capture.

[AI-139] AI Conversational Interviewing: Transforming Surveys with LLMs as Adaptive Interviewers

链接: https://arxiv.org/abs/2410.01824
作者: Alexander Wuttke,Matthias Aßenmacher,Christopher Klamm,Max M. Lang,Quirin Würschinger,Frauke Kreuter
关键词-EN: structured surveys enable, eliciting people opinions, people opinions face, surveys enable large-scale, limit respondents’ ability
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Traditional methods for eliciting people’s opinions face a trade-off between depth and scale: structured surveys enable large-scale data collection but limit respondents’ ability to express unanticipated thoughts in their own words, while conversational interviews provide deeper insights but are resource-intensive. This study explores the potential of replacing human interviewers with large language models (LLMs) to conduct scalable conversational interviews. Our goal is to assess the performance of AI Conversational Interviewing and to identify opportunities for improvement in a controlled environment. We conducted a small-scale, in-depth study with university students who were randomly assigned to be interviewed by either AI or human interviewers, both employing identical questionnaires on political topics. Various quantitative and qualitative measures assessed interviewer adherence to guidelines, response quality, participant engagement, and overall interview efficacy. The findings indicate the viability of AI Conversational Interviewing in producing quality data comparable to traditional methods, with the added benefit of scalability. Based on our experiences, we present specific recommendations for effective implementation.

[AI-140] he Importance of Causality in Decision Making: A Perspective on Recommender Systems RECSYS’24

链接: https://arxiv.org/abs/2410.01822
作者: Emanuele Cavenaghi,Alessio Zanga,Fabio Stella,Markus Zanker
关键词-EN: receiving increasing attention, transform accurate predictions, Recommendation Systems, explainable decisions, receiving increasing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Causality is receiving increasing attention in the Recommendation Systems (RSs) community, which has realised that RSs could greatly benefit from causality to transform accurate predictions into effective and explainable decisions. Indeed, the RS literature has repeatedly highlighted that, in real-world scenarios, recommendation algorithms suffer many types of biases since assumptions ensuring unbiasedness are likely not met. In this discussion paper, we formulate the RS problem in terms of causality, using potential outcomes and structural causal models, by giving formal definitions of the causal quantities to be estimated and a general causal graph to serve as a reference to foster future research and development.

[AI-141] NFDIcore 2.0: A BFO-Compliant Ontology for Multi-Domain Research Infrastructures

链接: https://arxiv.org/abs/2410.01821
作者: Oleksandra Bruns(1,2),Tabea Tietz(1,2),Joerg Waitelonis(1,2),Etienne Posthumus(2),Harald Sack(1,2)
关键词-EN: Research Data Infrastructure, National Research Data, Basic Formal Ontology, Data Infrastructure, Basic Formal
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents NFDIcore 2.0, an ontology compliant with the Basic Formal Ontology (BFO) designed to represent the diverse research communities of the National Research Data Infrastructure (NFDI) in Germany. NFDIcore ensures the interoperability across various research disciplines, thereby facilitating cross-domain research. Each domain’s individual requirements are addressed through specific ontology modules. This paper discusses lessons learned during the ontology development and mapping process, supported by practical validation through use cases in diverse research domains. The originality of NFDIcore lies in its adherence to BFO, the use of SWRL rules for efficient knowledge discovery, and its modular, extensible design tailored to meet the needs of heterogeneous research domains.

[AI-142] PixelBytes: Catching Unified Representation for Multimodal Generation

链接: https://arxiv.org/abs/2410.01820
作者: Fabien Furfaro
关键词-EN: report introduces PixelBytes, report introduces, unified multimodal representation, Image Transformers, Recurrent Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report introduces PixelBytes, a novel approach for unified multimodal representation learning. Inspired by existing sequence models such as Image Transformers, PixelCNN, and Mamba-Bytes, our method aims to capture diverse inputs in a cohesive representation, exploring the integration of different data types, particularly text, audio, and pixelated images (sprites). We conducted experiments on a specialized PixelBytes Pokémon dataset. Initially, we investigated various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, focusing on bidirectional processing and our convolutional PxBy embedding technique. Subsequently, we evaluated models based on data reduction strategies and the effectiveness of autoregressive learning. We specifically examined Long Short-Term Memory (LSTM) networks in both predictive and autoregressive modes for our main experiments. Our findings suggest that autoregressive models outperform predictive models in this context. By adopting a flexible approach to multimodal modeling, PixelBytes contributes to the ongoing development of foundation models capable of understanding and generating multimodal data. The complete PixelBytes project, including code, models, and datasets, is available online.

[AI-143] Strategic AI Governance: Insights from Leading Nations

链接: https://arxiv.org/abs/2410.01819
作者: Dian W. Tjondronegoro
关键词-EN: Artificial Intelligence, data privacy, potential to revolutionize, hindered by concerns, concerns about data
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 21 pages, 3 Figures, 5 Tables

点击查看摘要

Abstract:Artificial Intelligence (AI) has the potential to revolutionize various sectors, yet its adoption is often hindered by concerns about data privacy, security, and the understanding of AI capabilities. This paper synthesizes AI governance approaches, strategic themes, and enablers and challenges for AI adoption by reviewing national AI strategies from leading nations. The key contribution is the development of an EPIC (Education, Partnership, Infrastructure, Community) framework, which maps AI implementation requirements to fully realize social impacts and public good from successful and sustained AI deployment. Through a multi-perspective content analysis of the latest AI strategy documents, this paper provides a structured comparison of AI governance strategies across nations. The findings offer valuable insights for governments, academics, industries, and communities to enable responsible and trustworthy AI deployments. Future work should focus on incorporating specific requirements for developing countries and applying the strategies to specific AI applications, industries, and the public sector.

[AI-144] Integrating AIs Carbon Footprint into Risk Management Frameworks: Strategies and Tools for Sustainable Compliance in Banking Sector

链接: https://arxiv.org/abs/2410.01818
作者: Nataliya Tkachenko
关键词-EN: Corporate Sustainability Reporting, Corporate Sustainability Due, Sustainability Reporting Directive, Sustainability Due Diligence, Due Diligence Directive
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper examines the integration of AI’s carbon footprint into the risk management frameworks (RMFs) of the banking sector, emphasising its importance in aligning with sustainability goals and regulatory requirements. As AI becomes increasingly central to banking operations, its energy-intensive processes contribute significantly to carbon emissions, posing environmental, regulatory, and reputational risks. Regulatory frameworks such as the EU AI Act, Corporate Sustainability Reporting Directive (CSRD), Corporate Sustainability Due Diligence Directive (CSDDD), and the Prudential Regulation Authority’s SS1/23 are driving banks to incorporate environmental considerations into their AI model governance. Recent advancements in AI research, like the Open Mixture-of-Experts (OLMoE) framework and the Agentic RAG framework, offer more efficient and dynamic AI models, reducing their carbon footprint without compromising performance. Using these technological examples, the paper outlines a structured approach for banks to identify, assess, and mitigate AI’s carbon footprint within their RMFs, including adopting energy-efficient models, utilising green cloud computing, and implementing lifecycle management.

[AI-145] From Experts to the Public: Governing Multimodal Language Models in Politically Sensitive Video Analysis

链接: https://arxiv.org/abs/2410.01817
作者: Tanusree Sharma,Yujin Potter,Zachary Kilhoffer,Yun Huang,Dawn Song,Yang Wang
关键词-EN: large language models, multimodal large language, politically sensitive videos, language models, focusing on analyses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This paper examines the governance of multimodal large language models (MM-LLMs) through individual and collective deliberation, focusing on analyses of politically sensitive videos. We conducted a two-step study: first, interviews with 10 journalists established a baseline understanding of expert video interpretation; second, 114 individuals from the general public engaged in deliberation using this http URL, a platform that facilitates democratic decision-making through decentralized autonomous organization (DAO) mechanisms. Our findings show that while experts emphasized emotion and narrative, the general public prioritized factual clarity, objectivity of the situation, and emotional neutrality. Additionally, we explored the impact of different governance mechanisms: quadratic vs. weighted ranking voting and equal vs. 20-80 power distributions on users decision-making on how AI should behave. Specifically, quadratic voting enhanced perceptions of liberal democracy and political equality, and participants who were more optimistic about AI perceived the voting process to have a higher level of participatory democracy. Our results suggest the potential of applying DAO mechanisms to help democratize AI governance.

[AI-146] Automatic Scene Generation: State-of-the-Art Techniques Models Datasets Challenges and Future Prospects

链接: https://arxiv.org/abs/2410.01816
作者: Awal Ahmed Fime,Saifuddin Mahmud,Arpita Das,Md. Sunzidul Islam,Hong-Hoon Kim
关键词-EN: Automatic scene generation, scene generation, Automatic scene, Generative Adversarial Networks, applications in robotics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 59 pages, 16 figures, 3 tables, 36 equations, 348 references

点击查看摘要

Abstract:Automatic scene generation is an essential area of research with applications in robotics, recreation, visual representation, training and simulation, education, and more. This survey provides a comprehensive review of the current state-of-the-arts in automatic scene generation, focusing on techniques that leverage machine learning, deep learning, embedded systems, and natural language processing (NLP). We categorize the models into four main types: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Transformers, and Diffusion Models. Each category is explored in detail, discussing various sub-models and their contributions to the field. We also review the most commonly used datasets, such as COCO-Stuff, Visual Genome, and MS-COCO, which are critical for training and evaluating these models. Methodologies for scene generation are examined, including image-to-3D conversion, text-to-3D generation, UI/layout design, graph-based methods, and interactive scene generation. Evaluation metrics such as Frechet Inception Distance (FID), Kullback-Leibler (KL) Divergence, Inception Score (IS), Intersection over Union (IoU), and Mean Average Precision (mAP) are discussed in the context of their use in assessing model performance. The survey identifies key challenges and limitations in the field, such as maintaining realism, handling complex scenes with multiple objects, and ensuring consistency in object relationships and spatial arrangements. By summarizing recent advances and pinpointing areas for improvement, this survey aims to provide a valuable resource for researchers and practitioners working on automatic scene generation. Comments: 59 pages, 16 figures, 3 tables, 36 equations, 348 references Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.01816 [cs.CV] (or arXiv:2410.01816v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.01816 Focus to learn more arXiv-issued DOI via DataCite

[AI-147] AI in Food Marketing from Personalized Recommendations to Predictive Analytics: Comparing Traditional Advertising Techniques with AI-Driven Strategies

链接: https://arxiv.org/abs/2410.01815
作者: Elham Khamoushi
关键词-EN: Artificial Intelligence, providing advanced techniques, providing advanced, revolutionized food marketing, campaign optimization
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has revolutionized food marketing by providing advanced techniques for personalized recommendations, consumer behavior prediction, and campaign optimization. This paper explores the shift from traditional advertising methods, such as TV, radio, and print, to AI-driven strategies. Traditional approaches were successful in building brand awareness but lacked the level of personalization that modern consumers demand. AI leverages data from consumer purchase histories, browsing behaviors, and social media activity to create highly tailored marketing campaigns. These strategies allow for more accurate product recommendations, prediction of consumer needs, and ultimately improve customer satisfaction and user experience. AI enhances marketing efforts by automating labor-intensive processes, leading to greater efficiency and cost savings. It also enables the continuous adaptation of marketing messages, ensuring they remain relevant and engaging over time. While AI presents significant benefits in terms of personalization and efficiency, it also comes with challenges, particularly the substantial investment required for technology and skilled expertise. This paper compares the strengths and weaknesses of traditional and AI-driven food marketing techniques, offering valuable insights into how marketers can leverage AI to create more effective and targeted marketing strategies in the evolving digital landscape.

[AI-148] Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in Healthcare

链接: https://arxiv.org/abs/2410.01813
作者: Zhikai Li,Jing Zhang,Qingyi Gu
关键词-EN: pressing social issue, healthcare personnel expertise, personnel expertise, pressing social, social issue
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The disparity in healthcare personnel expertise and medical resources across different regions of the world is a pressing social issue. Artificial intelligence technology offers new opportunities to alleviate this issue. Segment Anything Model (SAM), which excels in intelligent image segmentation, has demonstrated exceptional performance in medical monitoring and assisted diagnosis. Unfortunately, the huge computational and storage overhead of SAM poses significant challenges for deployment on resource-limited edge devices. Quantization is an effective solution for model compression; however, traditional methods rely heavily on original data for calibration, which raises widespread concerns about medical data privacy and security. In this paper, we propose a data-free quantization framework for SAM, called DFQ-SAM, which learns and calibrates quantization parameters without any original data, thus effectively preserving data privacy during model compression. Specifically, we propose pseudo-positive label evolution for segmentation, combined with patch similarity, to fully leverage the semantic and distribution priors in pre-trained models, which facilitates high-quality data synthesis as a substitute for real data. Furthermore, we introduce scale reparameterization to ensure the accuracy of low-bit quantization. We perform extensive segmentation experiments on various datasets, and DFQ-SAM consistently provides significant performance on low-bit quantization. DFQ-SAM eliminates the need for data transfer in cloud-edge collaboration, thereby protecting sensitive data from potential attacks. It enables secure, fast, and personalized healthcare services at the edge, which enhances system efficiency and optimizes resource allocation, and thus facilitating the pervasive application of artificial intelligence in worldwide healthcare.

[AI-149] From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice

链接: https://arxiv.org/abs/2410.01812
作者: Qian Niu,Keyu Chen,Ming Li,Pohsun Feng,Ziqian Bi,Junyu Liu,Benji Peng
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 12 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) have rapidly evolved from text-based systems to multimodal platforms, significantly impacting various sectors including healthcare. This comprehensive review explores the progression of LLMs to Multimodal Large Language Models (MLLMs) and their growing influence in medical practice. We examine the current landscape of MLLMs in healthcare, analyzing their applications across clinical decision support, medical imaging, patient engagement, and research. The review highlights the unique capabilities of MLLMs in integrating diverse data types, such as text, images, and audio, to provide more comprehensive insights into patient health. We also address the challenges facing MLLM implementation, including data limitations, technical hurdles, and ethical considerations. By identifying key research gaps, this paper aims to guide future investigations in areas such as dataset development, modality alignment methods, and the establishment of ethical guidelines. As MLLMs continue to shape the future of healthcare, understanding their potential and limitations is crucial for their responsible and effective integration into medical practice.

[AI-150] Evaluating Cultural Awareness of LLMs for Yoruba Malayalam and English

链接: https://arxiv.org/abs/2410.01811
作者: Fiifi Dawson,Zainab Mosunmola,Sahil Pocker,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
关键词-EN: Long Term Orientation, complex tasks, extremely effective, large number, number of complex
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 19 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Although LLMs have been extremely effective in a large number of complex tasks, their understanding and functionality for regional languages and cultures are not well studied. In this paper, we explore the ability of various LLMs to comprehend the cultural aspects of two regional languages: Malayalam (state of Kerala, India) and Yoruba (West Africa). Using Hofstede’s six cultural dimensions: Power Distance (PDI), Individualism (IDV), Motivation towards Achievement and Success (MAS), Uncertainty Avoidance (UAV), Long Term Orientation (LTO), and Indulgence (IVR), we quantify the cultural awareness of LLM-based responses. We demonstrate that although LLMs show a high cultural similarity for English, they fail to capture the cultural nuances across these 6 metrics for Malayalam and Yoruba. We also highlight the need for large-scale regional language LLM training with culturally enriched datasets. This will have huge implications for enhancing the user experience of chat-based LLMs and also improving the validity of large-scale LLM agent-based market research.

[AI-151] Propaganda is all you need

链接: https://arxiv.org/abs/2410.01810
作者: Paul Kronlund-Drouault
关键词-EN: abstract mathematics, recent field, field of study, realm of abstract, political dimension
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As ML is still a (relatively) recent field of study, especially outside the realm of abstract mathematics, few works have been led on the political aspect of LLMs, and more particularly about the alignment process, and its political dimension. This process can be as simple as prompt engineering, but also very deep and affect completely unrelated questions. For example, politically directed alignment has a very strong impact on an LLM’s embedding space, and the relative position of political notions in such a space. Using special tools to evaluate general political bias and analyze the effects of alignment, we can gather new data to understand its causes and possible consequences on society. Indeed, leading a socio-political approach we can hypothesize that most big LLMs are aligned on what Marxist philosophy calls the ‘dominant ideology’. As AI’s role in political decision-making, at the citizen’s scale but also in government agencies, such biases can have huge effects on societal change, either by creating a new and insidious pathway for societal uniformization or by allowing disguised extremist views to gain traction on the people.

[AI-152] Enhancing transparency in AI-powered customer engagement

链接: https://arxiv.org/abs/2410.01809
作者: Tara DeZao
关键词-EN: building consumer trust, addresses the critical, critical challenge, challenge of building, emphasising the necessity
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses the critical challenge of building consumer trust in AI-powered customer engagement by emphasising the necessity for transparency and accountability. Despite the potential of AI to revolutionise business operations and enhance customer experiences, widespread concerns about misinformation and the opacity of AI decision-making processes hinder trust. Surveys highlight a significant lack of awareness among consumers regarding their interactions with AI, alongside apprehensions about bias and fairness in AI algorithms. The paper advocates for the development of explainable AI models that are transparent and understandable to both consumers and organisational leaders, thereby mitigating potential biases and ensuring ethical use. It underscores the importance of organisational commitment to transparency practices beyond mere regulatory compliance, including fostering a culture of accountability, prioritising clear data policies and maintaining active engagement with stakeholders. By adopting a holistic approach to transparency and explainability, businesses can cultivate trust in AI technologies, bridging the gap between technological innovation and consumer acceptance, and paving the way for more ethical and effective AI-powered customer engagements. KEYWORDS: artificial intelligence (AI), transparency

[AI-153] AI Horizon Scanning White Paper p3395 IEEE-SA. Part I: Areas of Attention DATE

链接: https://arxiv.org/abs/2410.01808
作者: Marina Cortês,Andrew R. Liddle,Christos Emmanouilidis,Anthony E. Kelly,Ken Matusow,Ragu Ragunathan,Jayne M. Suess,George Tambouratzis,Janusz Zalewski,David A. Bray
关键词-EN: Generative Artificial Intelligence, carry societal transformation, Artificial Intelligence, Generative Artificial, White Papers informing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: This is an interim version of our p3395 working group White Paper. We will update this version, until publication by the Institute of Electrical and Electronics Engineers, Standards Association (IEEE-SA), Sponsor Committee - Artificial Intelligence Standards Committee (C/AISC); this https URL

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) models may carry societal transformation to an extent demanding a delicate balance between opportunity and risk. This manuscript is the first of a series of White Papers informing the development of IEEE-SA’s p3995: `Standard for the Implementation of Safeguards, Controls, and Preventive Techniques for Artificial Intelligence (AI) Models’, Chair: Marina Cortês (this https URL). In this first horizon-scanning we identify key attention areas for standards activities in AI. We examine different principles for regulatory efforts, and review notions of accountability, privacy, data rights and mis-use. As a safeguards standard we devote significant attention to the stability of global infrastructures and consider a possible overdependence on cloud computing that may result from densely coupled AI components. We review the recent cascade-failure-like Crowdstrike event in July 2024, as an illustration of potential impacts on critical infrastructures from AI-induced incidents in the (near) future. It is the first of a set of articles intended as White Papers informing the audience on the standard development. Upcoming articles will focus on regulatory initiatives, technology evolution and the role of AI in specific domains.

[AI-154] Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms

链接: https://arxiv.org/abs/2410.00134
作者: Melkamu Abay Mersha,Mesay Gemeda yigezu,Jugal Kalita
关键词-EN: discover hidden topics, Topic modeling, prior knowledge, discover hidden, Traditional topic modeling
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Topic modeling is a powerful technique to discover hidden topics and patterns within a collection of documents without prior knowledge. Traditional topic modeling and clustering-based techniques encounter challenges in capturing contextual semantic information. This study introduces an innovative end-to-end semantic-driven topic modeling technique for the topic extraction process, utilizing advanced word and document embeddings combined with a powerful clustering algorithm. This semantic-driven approach represents a significant advancement in topic modeling methodologies. It leverages contextual semantic information to extract coherent and meaningful topics. Specifically, our model generates document embeddings using pre-trained transformer-based language models, reduces the dimensions of the embeddings, clusters the embeddings based on semantic similarity, and generates coherent topics for each cluster. Compared to ChatGPT and traditional topic modeling algorithms, our model provides more coherent and meaningful topics.

[AI-155] Explainable Artificial Intelligence: A Survey of Needs Techniques Applications and Future Direction

链接: https://arxiv.org/abs/2409.00265
作者: Melkamu Mersha,Khang Lam,Joseph Wood,Ali AlShami,Jugal Kalita
关键词-EN: Artificial intelligence models, Explainable Artificial Intelligence, Artificial intelligence, encounter significant challenges, significant challenges due
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Artificial intelligence models encounter significant challenges due to their black-box nature, particularly in safety-critical domains such as healthcare, finance, and autonomous vehicles. Explainable Artificial Intelligence (XAI) addresses these challenges by providing explanations for how these models make decisions and predictions, ensuring transparency, accountability, and fairness. Existing studies have examined the fundamental concepts of XAI, its general principles, and the scope of XAI techniques. However, there remains a gap in the literature as there are no comprehensive reviews that delve into the detailed mathematical representations, design methodologies of XAI models, and other associated aspects. This paper provides a comprehensive literature review encompassing common terminologies and definitions, the need for XAI, beneficiaries of XAI, a taxonomy of XAI methods, and the application of XAI methods in different application areas. The survey is aimed at XAI researchers, XAI practitioners, AI model developers, and XAI beneficiaries who are interested in enhancing the trustworthiness, transparency, accountability, and fairness of their AI models.

[AI-156] Large Language Models as Markov Chains

链接: https://arxiv.org/abs/2410.02724
作者: Oussama Zekri,Ambroise Odonnat,Abdelhakim Benechehab,Linus Bleistein,Nicolas Boullé,Ievgen Redko
关键词-EN: Large language models, natural language processing, language processing tasks, Large language, remarkably efficient
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 49 pages, 17 figures

点击查看摘要

Abstract:Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size T and context window of size K and Markov chains defined on a finite state space of size \mathcalO(T^K) . We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.

[AI-157] Measurements with Noise: Bayesian Optimization for Co-optimizing Noise and Property Discovery in Automated Experiments

链接: https://arxiv.org/abs/2410.02717
作者: Boris N. Slautin,Yu Liu,Jan Dec,Vladimir V. Shvartsman,Doru C. Lupascu,Maxim Ziatdinov,Sergei V. Kalinin
关键词-EN: developed a Bayesian, integrates intra-step noise, automated experimental cycles, Bayesian optimization, integrates intra-step
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:We have developed a Bayesian optimization (BO) workflow that integrates intra-step noise optimization into automated experimental cycles. Traditional BO approaches in automated experiments focus on optimizing experimental trajectories but often overlook the impact of measurement noise on data quality and cost. Our proposed framework simultaneously optimizes both the target property and the associated measurement noise by introducing time as an additional input parameter, thereby balancing the signal-to-noise ratio and experimental duration. Two approaches are explored: a reward-driven noise optimization and a double-optimization acquisition function, both enhancing the efficiency of automated workflows by considering noise and cost within the optimization process. We validate our method through simulations and real-world experiments using Piezoresponse Force Microscopy (PFM), demonstrating the successful optimization of measurement duration and property exploration. Our approach offers a scalable solution for optimizing multiple variables in automated experimental workflows, improving data quality, and reducing resource expenditure in materials science and beyond.

[AI-158] Deep Regression 2D-3D Ultrasound Registration for Liver Motion Correction in Focal Tumor Thermal Ablation

链接: https://arxiv.org/abs/2410.02579
作者: Shuwei Xing,Derek W. Cool,David Tessier,Elvis C.S. Chen,Terry M. Peters,Aaron Fenster
关键词-EN: ablation procedures require, tumor ablation procedures, Liver tumor ablation, procedures require accurate, require accurate placement
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注: 15 pagers, 9 figures

点击查看摘要

Abstract:Liver tumor ablation procedures require accurate placement of the needle applicator at the tumor centroid. The lower-cost and real-time nature of ultrasound (US) has advantages over computed tomography (CT) for applicator guidance, however, in some patients, liver tumors may be occult on US and tumor mimics can make lesion identification challenging. Image registration techniques can aid in interpreting anatomical details and identifying tumors, but their clinical application has been hindered by the tradeoff between alignment accuracy and runtime performance, particularly when compensating for liver motion due to patient breathing or movement. Therefore, we propose a 2D-3D US registration approach to enable intra-procedural alignment that mitigates errors caused by liver motion. Specifically, our approach can correlate imbalanced 2D and 3D US image features and use continuous 6D rotation representations to enhance the model’s training stability. The dataset was divided into 2388, 196 and 193 image pairs for training, validation and testing, respectively. Our approach achieved a mean Euclidean distance error of 2.28 mm \pm 1.81 mm and a mean geodesic angular error of 2.99 ^\circ \pm 1.95 ^\circ , with a runtime of 0.22 seconds per 2D-3D US image pair. These results demonstrate that our approach can achieve accurate alignment and clinically acceptable runtime, indicating potential for clinical translation.

[AI-159] Personalized Quantum Federated Learning for Privacy Image Classification

链接: https://arxiv.org/abs/2410.02547
作者: Jinjing Shi,Tian Chen,Shichao Zhang,Xuelong Li
关键词-EN: Quantum federated learning, personalized quantum federated, federated learning algorithm, federated learning, Quantum federated
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quantum federated learning has brought about the improvement of privacy image classification, while the lack of personality of the client model may contribute to the suboptimal of quantum federated learning. A personalized quantum federated learning algorithm for privacy image classification is proposed to enhance the personality of the client model in the case of an imbalanced distribution of images. First, a personalized quantum federated learning model is constructed, in which a personalized layer is set for the client model to maintain the personalized parameters. Second, a personalized quantum federated learning algorithm is introduced to secure the information exchanged between the client and this http URL, the personalized federated learning is applied to image classification on the FashionMNIST dataset, and the experimental results indicate that the personalized quantum federated learning algorithm can obtain global and local models with excellent performance, even in situations where local training samples are imbalanced. The server’s accuracy is 100% with 8 clients and a distribution parameter of 100, outperforming the non-personalized model by 7%. The average client accuracy is 2.9% higher than that of the non-personalized model with 2 clients and a distribution parameter of 1. Compared to previous quantum federated learning algorithms, the proposed personalized quantum federated learning algorithm eliminates the need for additional local training while safeguarding both model and data this http URL may facilitate broader adoption and application of quantum technologies, and pave the way for more secure, scalable, and efficient quantum distribute machine learning solutions.

[AI-160] NTU-NPU System for Voice Privacy 2024 Challenge

链接: https://arxiv.org/abs/2410.02371
作者: Nikita Kuzmin,Hieu-Thi Luong,Jixun Yao,Lei Xie,Kong Aik Lee,Eng Siong Chng
关键词-EN: Voice Privacy Challenge, Privacy Challenge, Voice Privacy, describe our submissions, Challenge
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: System description for VPC 2024

点击查看摘要

Abstract:In this work, we describe our submissions for the Voice Privacy Challenge 2024. Rather than proposing a novel speech anonymization system, we enhance the provided baselines to meet all required conditions and improve evaluated metrics. Specifically, we implement emotion embedding and experiment with WavLM and ECAPA2 speaker embedders for the B3 baseline. Additionally, we compare different speaker and prosody anonymization techniques. Furthermore, we introduce Mean Reversion F0 for B5, which helps to enhance privacy without a loss in utility. Finally, we explore disentanglement models, namely \beta -VAE and NaturalSpeech3 FACodec.

[AI-161] Autonomous Self-Trained Channel State Prediction Method for mmWave Vehicular Communications

链接: https://arxiv.org/abs/2410.02326
作者: Abidemi Orimogunje,Vukan Ninkovic,Evariste Twahirwa,Gaspard Gashema,Dejan Vukobratovic
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at European Wireless 2024

点击查看摘要

[AI-162] Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

链接: https://arxiv.org/abs/2410.02056
作者: Sreyan Ghosh,Sonal Kumar,Zhifeng Kong,Rafael Valle,Bryan Catanzaro,Dinesh Manocha
关键词-EN: augmenting small-scale audio, audio classification datasets, approach for augmenting, small-scale audio classification, audio classification
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code and Checkpoints will be soon available here: this https URL

点击查看摘要

Abstract:We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.

[AI-163] A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

链接: https://arxiv.org/abs/2410.02025
作者: Shivam Kumar,Yun Yang,Lizhen Lin
关键词-EN: high-dimensional ambient space, response variable lies, potentially lower-dimensional manifold, deep generative models, conditional deep generative
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:1708.06633 by other authors

点击查看摘要

Abstract:In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.

[AI-164] A GEN AI Framework for Medical Note Generation

链接: https://arxiv.org/abs/2410.01841
作者: Hui Yi Leong,Yi Fan Gao,Shuai Ji,Bora Kalaycioglu,Uktu Pamuksuz
关键词-EN: Electronic Health Records, Health Records, Electronic Health, direct patient care, Automatic Speech Recognition
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Sound (cs.SD)
*备注: 8 Figures, 7 page, IEEE standard research paper

点击查看摘要

Abstract:The increasing administrative burden of medical documentation, particularly through Electronic Health Records (EHR), significantly reduces the time available for direct patient care and contributes to physician burnout. To address this issue, we propose MediNotes, an advanced generative AI framework designed to automate the creation of SOAP (Subjective, Objective, Assessment, Plan) notes from medical conversations. MediNotes integrates Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Automatic Speech Recognition (ASR) to capture and process both text and voice inputs in real time or from recorded audio, generating structured and contextually accurate medical notes. The framework also incorporates advanced techniques like Quantized Low-Rank Adaptation (QLoRA) and Parameter-Efficient Fine-Tuning (PEFT) for efficient model fine-tuning in resource-constrained environments. Additionally, MediNotes offers a query-based retrieval system, allowing healthcare providers and patients to access relevant medical information quickly and accurately. Evaluations using the ACI-BENCH dataset demonstrate that MediNotes significantly improves the accuracy, efficiency, and usability of automated medical documentation, offering a robust solution to reduce the administrative burden on healthcare professionals while improving the quality of clinical workflows.

计算机视觉

[CV-0] Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats

链接: https://arxiv.org/abs/2410.02764
作者: Mingyang Xie,Haoming Cai,Sachin Shah,Yiran Xu,Brandon Y. Feng,Jia-Bin Huang,Christopher A. Metzler
关键词-EN: no-flash reflection separation, introduce a simple, simple yet effective, effective approach, approach for separating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We introduce a simple yet effective approach for separating transmitted and reflected light. Our key insight is that the powerful novel view synthesis capabilities provided by modern inverse rendering methods (e.g.,~3D Gaussian splatting) allow one to perform flash/no-flash reflection separation using unpaired measurements – this relaxation dramatically simplifies image acquisition over conventional paired flash/no-flash reflection separation methods. Through extensive real-world experiments, we demonstrate our method, Flash-Splat, accurately reconstructs both transmitted and reflected scenes in 3D. Our method outperforms existing 3D reflection separation methods, which do not leverage illumination control, by a large margin. Our project webpage is at this https URL.

[CV-1] Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

链接: https://arxiv.org/abs/2410.02763
作者: Jianrui Zhang,Mu Cai,Yong Jae Lee
关键词-EN: growing sentiment recently, key challenges related, growing sentiment, sentiment recently, recently that modern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.

[CV-2] Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

链接: https://arxiv.org/abs/2410.02762
作者: Nick Jiang,Anish Kachinthaya,Suzie Petryk,Yossi Gandelsman
关键词-EN: size and training, persistent challenge, challenge despite advances, output probabilities, address hallucinations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page and code: this http URL

点击查看摘要

Abstract:We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs’ internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model’s latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs’ latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.

[CV-3] FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

链接: https://arxiv.org/abs/2410.02761
作者: Zhipei Xu,Xuanyu Zhang,Runyi Li,Zecheng Tang,Qing Huang,Jian Zhang
关键词-EN: facilitates content creation, makes image manipulation, image manipulation easier, double-edged sword, rapid development
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf1) black-box nature with unknown detection principle, \textbf2) limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield’s tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.

[CV-4] Loong: Generating Minute-level Long Videos with Autoregressive Language Models

链接: https://arxiv.org/abs/2410.02757
作者: Yuqing Wang,Tianwei Xiong,Daquan Zhou,Zhijie Lin,Yang Zhao,Bingyi Kang,Jiashi Feng,Xihui Liu
关键词-EN: content-rich long videos, scale of minutes, generate content-rich long, desirable but challenging, long videos
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: this https URL.

[CV-5] Contrastive Localized Language-Image Pre-Training

链接: https://arxiv.org/abs/2410.02746
作者: Hong-You Chen,Zhengfeng Lai,Haotian Zhang,Xinze Wang,Marcin Eichner,Keen You,Meng Cao,Bowen Zhang,Yinfei Yang,Zhe Gan
关键词-EN: CLIP, facilitating various applications, training vision encoders, text representations facilitating, training vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

[CV-6] AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity

链接: https://arxiv.org/abs/2410.02745
作者: Zhibin Lan,Liqiang Niu,Fandong Meng,Wenbo Li,Jie Zhou,Jinsong Su
关键词-EN: visual tokens, visual granularity based, visual granularity, multiple local images, visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Preprint

点击查看摘要

Abstract:Recently, when dealing with high-resolution images, dominant LMMs usually divide them into multiple local images and one global image, which will lead to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. This approach not only reduces the number of visual tokens and speeds up inference, but also improves the overall model performance. Specifically, we introduce the following modules based on LLaVA-NeXT: (a) a visual granularity scaler that includes multiple pooling layers to obtain visual tokens with different granularities; (b) a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we propose RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional manually annotated data. Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual tokens and a 2.53 \times increase in inference speed on the AI2D benchmark).

[CV-7] Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

链接: https://arxiv.org/abs/2410.02740
作者: Zhengfeng Lai,Vasileios Saveris,Chen Chen,Hong-You Chen,Haotian Zhang,Bowen Zhang,Juan Lao Tebar,Wenze Hu,Zhe Gan,Peter Grasch,Meng Cao,Yinfei Yang
关键词-EN: Recent advancements, synthetic captions, key challenges remain, captions, Short Synthetic Captions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CV/ML

点击查看摘要

Abstract:Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.

[CV-8] DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

链接: https://arxiv.org/abs/2410.02730
作者: Zhaowei Wang,Hongming Zhang,Tianqing Fang,Ye Tian,Yue Yang,Kaixin Ma,Xiaoman Pan,Yangqiu Song,Dong Yu
关键词-EN: real-world applications, navigation in unknown, crucial for deploying, Object navigation, target objects
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
*备注: Work in Progress

点击查看摘要

Abstract:Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.

[CV-9] Curvature Diversity-Driven Deformation and Domain Alignment for Point Cloud

链接: https://arxiv.org/abs/2410.02720
作者: Mengxi Wu,Hao Huang,Yi Fang,Mohammad Rostami
关键词-EN: Unsupervised Domain Adaptation, training deep networks, textbf, Unsupervised Domain, manual data annotation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) is crucial for reducing the need for extensive manual data annotation when training deep networks on point cloud data. A significant challenge of UDA lies in effectively bridging the domain gap. To tackle this challenge, we propose \textbfCurvature \textbfDiversity-Driven \textbfNuclear-Norm Wasserstein \textbfDomain Alignment (CDND). Our approach first introduces a \textit\textbfCurvature Diversity-driven Deformation \textbfReconstruction (CurvRec) task, which effectively mitigates the gap between the source and target domains by enabling the model to extract salient features from semantically rich regions of a given point cloud. We then propose \textit\textbfDeformation-based \textbfNuclear-norm \textbfWasserstein \textbfDiscrepancy (D-NWD), which applies the Nuclear-norm Wasserstein Discrepancy to both \textitdeformed and original data samples to align the source and target domains. Furthermore, we contribute a theoretical justification for the effectiveness of D-NWD in distribution alignment and demonstrate that it is \textitgeneric enough to be applied to \textbfany deformations. To validate our method, we conduct extensive experiments on two public domain adaptation datasets for point cloud classification and segmentation tasks. Empirical experiment results show that our CDND achieves state-of-the-art performance by a noticeable margin over existing approaches.

[CV-10] Video Instruction Tuning With Synthetic Data

链接: https://arxiv.org/abs/2410.02713
作者: Yuanhan Zhang,Jinming Wu,Wei Li,Bo Li,Zejun Ma,Ziwei Liu,Chunyuan Li
关键词-EN: curating large amounts, video large multimodal, large multimodal models, large multimodal, curating large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Project page: this https URL

点击查看摘要

Abstract:The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

[CV-11] LLaVA-Critic: Learning to Evaluate Multimodal Models

链接: https://arxiv.org/abs/2410.02712
作者: Tianyi Xiong,Xiyao Wang,Dong Guo,Qinghao Ye,Haoqi Fan,Quanquan Gu,Heng Huang,Chunyuan Li
关键词-EN: open-source large multimodal, large multimodal model, multimodal tasks, generalist evaluator, evaluator to assess
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model’s effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.

[CV-12] SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.02710
作者: Hongxiang Zhang,Yifeng He,Hao Chen
关键词-EN: precise text alignment, generate high-quality images, drawn attention, ability to generate, generate high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical and safety standards with little to no impact on usability. SteerDiff identifies and manipulates inappropriate concepts within the text embedding space to guide the model away from harmful outputs. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach. Furthermore, we benchmark SteerDiff against multiple red-teaming strategies to assess its robustness. Finally, we explore the potential of SteerDiff for concept forgetting tasks, demonstrating its versatility in text-conditioned image generation.

[CV-13] ControlAR: Controllable Image Generation with Autoregressive Models

链接: https://arxiv.org/abs/2410.02705
作者: Zongming Li,Tianheng Cheng,Shoufa Chen,Peize Sun,Haocheng Shen,Longjin Ran,Xiaoxin Chen,Wenyu Liu,Xinggang Wang
关键词-EN: demonstrating remarkable potential, models, Large Language Models, next-token prediction, demonstrating remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint. Work in progress

点击查看摘要

Abstract:Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model’s efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at this https URL.

[CV-14] Lie Algebra Canonicalization: Equivariant Neural Operators under arbitrary Lie Groups

链接: https://arxiv.org/abs/2410.02698
作者: Zakhar Shumaylov,Peter Zaika,James Rowbottom,Ferdia Sherry,Melanie Weber,Carola-Bibiane Schönlieb
关键词-EN: generalizable machine learning, driven recent interest, equivariant neural networks, neural networks, Physics-Informed Neural Networks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注: 40 pages; preprint

点击查看摘要

Abstract:The quest for robust and generalizable machine learning models has driven recent interest in exploiting symmetries through equivariant neural networks. In the context of PDE solvers, recent works have shown that Lie point symmetries can be a useful inductive bias for Physics-Informed Neural Networks (PINNs) through data and loss augmentation. Despite this, directly enforcing equivariance within the model architecture for these problems remains elusive. This is because many PDEs admit non-compact symmetry groups, oftentimes not studied beyond their infinitesimal generators, making them incompatible with most existing equivariant architectures. In this work, we propose Lie aLgebrA Canonicalization (LieLAC), a novel approach that exploits only the action of infinitesimal generators of the symmetry group, circumventing the need for knowledge of the full group structure. To achieve this, we address existing theoretical issues in the canonicalization literature, establishing connections with frame averaging in the case of continuous non-compact groups. Operating within the framework of canonicalization, LieLAC can easily be integrated with unconstrained pre-trained models, transforming inputs to a canonical form before feeding them into the existing model, effectively aligning the input for model inference according to allowed symmetries. LieLAC utilizes standard Lie group descent schemes, achieving equivariance in pre-trained models. Finally, we showcase LieLAC’s efficacy on tasks of invariant image classification and Lie point symmetry equivariant neural PDE solvers using pre-trained models.

[CV-15] Unsupervised Point Cloud Completion through Unbalanced Optimal Transport

链接: https://arxiv.org/abs/2410.02671
作者: Taekyung Lee,Jaemoo Choi,Jaewoong Choi
关键词-EN: Unpaired point cloud, unbalanced optimal transport, point cloud completion, optimal transport map, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages, 10 figures

点击查看摘要

Abstract:Unpaired point cloud completion explores methods for learning a completion map from unpaired incomplete and complete point cloud data. In this paper, we propose a novel approach for unpaired point cloud completion using the unbalanced optimal transport map, called Unbalanced Optimal Transport Map for Unpaired Point Cloud Completion (UOT-UPC). We demonstrate that the unpaired point cloud completion can be naturally interpreted as the Optimal Transport (OT) problem and introduce the Unbalanced Optimal Transport (UOT) approach to address the class imbalance problem, which is prevalent in unpaired point cloud completion datasets. Moreover, we analyze the appropriate cost function for unpaired completion tasks. This analysis shows that the InfoCD cost function is particularly well-suited for this task. Our model is the first attempt to leverage UOT for unpaired point cloud completion, achieving competitive or superior results on both single-category and multi-category datasets. In particular, our model is especially effective in scenarios with class imbalance, where the proportions of categories are different between the incomplete and complete point cloud datasets.

[CV-16] Measuring and Improving Persuasiveness of Generative Models

链接: https://arxiv.org/abs/2410.02653
作者: Somesh Singh,Yaman K Singla,Harini SI,Balaji Krishnamurthy
关键词-EN: workflows involving generating, involving generating content, workflows involving, directly interacting, involving generating
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs’ impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models’ persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California’s SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI’s societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at this https URL, to advance our understanding of AI-driven persuasion and its societal implications.

[CV-17] Learning 3D Perception from Others Predictions

链接: https://arxiv.org/abs/2410.02646
作者: Jinsu Yoo,Zhenyang Feng,Tai-Yu Pan,Yihong Sun,Cheng Perng Phoo,Xiangyu Chen,Mark Campbell,Kilian Q. Weinberger,Bharath Hariharan,Wei-Lun Chao
关键词-EN: real-world environments requires, requires a huge, huge amount, environments requires, predictions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector. For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area. This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car). Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance. We systematically study the problem and identify viewpoint mismatches and mislocalization (due to synchronization and GPS errors) as the main causes, which unavoidably result in false positives, false negatives, and inaccurate pseudo labels. We propose a distance-based curriculum, first learning from closer units with similar viewpoints and subsequently improving the quality of other units’ predictions via self-training. We further demonstrate that an effective pseudo label refinement module can be trained with a handful of annotated data, largely reducing the data quantity necessary to train an object detector. We validate our approach on the recently released real-world collaborative driving dataset, using reference cars’ predictions as pseudo labels for the ego car. Extensive experiments including several scenarios (e.g., different sensors, detectors, and domains) demonstrate the effectiveness of our approach toward label-efficient learning of 3D perception from other units’ predictions.

[CV-18] Why Sample Space Matters: Keyframe Sampling Optimization for LiDAR-based Place Recognition

链接: https://arxiv.org/abs/2410.02643
作者: Nikolaos Stathoulopoulos,Vidya Sumathy,Christoforos Kanellakis,George Nikolakopoulos
关键词-EN: pushing real-world autonomy, Recent advances, place recognition, real-world autonomy, pushing real-world
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 15 figures. Submitted

点击查看摘要

Abstract:Recent advances in robotics are pushing real-world autonomy, enabling robots to perform long-term and large-scale missions. A crucial component for successful missions is the incorporation of loop closures through place recognition, which effectively mitigates accumulated pose estimation drift. Despite computational advancements, optimizing performance for real-time deployment remains challenging, especially in resource-constrained mobile robots and multi-robot systems since, conventional keyframe sampling practices in place recognition often result in retaining redundant information or overlooking relevant data, as they rely on fixed sampling intervals or work directly in the 3D space instead of the feature space. To address these concerns, we introduce the concept of sample space in place recognition and demonstrate how different sampling techniques affect the query process and overall performance. We then present a novel keyframe sampling approach for LiDAR-based place recognition, which focuses on redundancy minimization and information preservation in the hyper-dimensional descriptor space. This approach is applicable to both learning-based and handcrafted descriptors, and through the experimental validation across multiple datasets and descriptor frameworks, we demonstrate the effectiveness of our proposed method, showing it can jointly minimize redundancy and preserve essential information in real-time. The proposed approach maintains robust performance across various datasets without requiring parameter tuning, contributing to more efficient and reliable place recognition for a wide range of robotic applications.

[CV-19] Spatial-Temporal Multi-Cuts for Online Multiple-Camera Vehicle Tracking

链接: https://arxiv.org/abs/2410.02638
作者: Fabian Herzog,Johannes Gilg,Philipp Wolters,Torben Teepe,Gerhard Rigoll
关键词-EN: intelligent transportation systems, smart city applications, Accurate online multiple-camera, multiple-camera vehicle tracking, online multiple-camera vehicle
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate online multiple-camera vehicle tracking is essential for intelligent transportation systems, autonomous driving, and smart city applications. Like single-camera multiple-object tracking, it is commonly formulated as a graph problem of tracking-by-detection. Within this framework, existing online methods usually consist of two-stage procedures that cluster temporally first, then spatially, or vice versa. This is computationally expensive and prone to error accumulation. We introduce a graph representation that allows spatial-temporal clustering in a single, combined step: New detections are spatially and temporally connected with existing clusters. By keeping sparse appearance and positional cues of all detections in a cluster, our method can compare clusters based on the strongest available evidence. The final tracks are obtained online using a simple multicut assignment procedure. Our method does not require any training on the target scene, pre-extraction of single-camera tracks, or additional annotations. Notably, we outperform the online state-of-the-art on the CityFlow dataset in terms of IDF1 by more than 14%, and on the Synthehicle dataset by more than 25%, respectively. The code is publicly available.

[CV-20] Plots Unlock Time-Series Understanding in Multimodal Models

链接: https://arxiv.org/abs/2410.02637
作者: Mayank Daswani,Mathias M.J. Bellaiche,Marc Wilson,Desislav Ivanov,Mikhail Papkov,Eva Schnider,Jing Tang,Kay Lamerigts,Gabriela Botea,Michael A. Sanchez,Yojan Patel,Shruthi Prabhakara,Shravya Shetty,Umesh Telang
关键词-EN: data-driven insights, fields like healthcare, social sciences, representing a missed, opportunity for richer
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 49 pages

点击查看摘要

Abstract:While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi-dimensional time-series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data-driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to “see” time-series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time-series data as text, with the additional benefit that visual time-series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean data, to extracting trends from noisy scatter plots. To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks - specifically fall detection, activity recognition, and readiness assessment - which involve heterogeneous, noisy data and multi-step reasoning. The overall success in plot performance over text performance (up to an 120% performance increase on zero-shot synthetic tasks, and up to 150% performance increase on real-world tasks), across both GPT and Gemini model families, highlights our approach’s potential for making the best use of the native capabilities of foundation models.

[CV-21] Metrics Revolutions: Groundbreaking Insights into the Implementation of Metrics for Biomedical Image Segmentation

链接: https://arxiv.org/abs/2410.02630
作者: Gašper Podobnik,Tomaž Vrtovec
关键词-EN: recently released metrics, released metrics selection, metrics selection guidelines, biomedical image analysis, distance-based metrics computation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The evaluation of segmentation performance is a common task in biomedical image analysis, with its importance emphasized in the recently released metrics selection guidelines and computing frameworks. To quantitatively evaluate the alignment of two segmentations, researchers commonly resort to counting metrics, such as the Dice similarity coefficient, or distance-based metrics, such as the Hausdorff distance, which are usually computed by publicly available open-source tools with an inherent assumption that these tools provide consistent results. In this study we questioned this assumption, and performed a systematic implementation analysis along with quantitative experiments on real-world clinical data to compare 11 open-source tools for distance-based metrics computation against our highly accurate mesh-based reference implementation. The results revealed that statistically significant differences among all open-source tools are both surprising and concerning, since they question the validity of existing studies. Besides identifying the main sources of variation, we also provide recommendations for distance-based metrics computation.

[CV-22] GI-GS: Global Illumination Decomposition on Gaussian Splatting for Inverse Rendering

链接: https://arxiv.org/abs/2410.02619
作者: Hongze Chen,Zehong Lin,Jun Zhang
关键词-EN: Gaussian Splatting, achieve photo-realistic, indirect lighting, indirect, lighting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present GI-GS, a novel inverse rendering framework that leverages 3D Gaussian Splatting (3DGS) and deferred shading to achieve photo-realistic novel view synthesis and relighting. In inverse rendering, accurately modeling the shading processes of objects is essential for achieving high-fidelity results. Therefore, it is critical to incorporate global illumination to account for indirect lighting that reaches an object after multiple bounces across the scene. Previous 3DGS-based methods have attempted to model indirect lighting by characterizing indirect illumination as learnable lighting volumes or additional attributes of each Gaussian, while using baked occlusion to represent shadow effects. These methods, however, fail to accurately model the complex physical interactions between light and objects, making it impossible to construct realistic indirect illumination during relighting. To address this limitation, we propose to calculate indirect lighting using efficient path tracing with deferred shading. In our framework, we first render a G-buffer to capture the detailed geometry and material properties of the scene. Then, we perform physically-based rendering (PBR) only for direct lighting. With the G-buffer and previous rendering results, the indirect lighting can be calculated through a lightweight path tracing. Our method effectively models indirect lighting under any given lighting conditions, thereby achieving better novel view synthesis and relighting. Quantitative and qualitative results show that our GI-GS outperforms existing baselines in both rendering quality and efficiency.

[CV-23] NL-Eye: Abductive NLI for Images

链接: https://arxiv.org/abs/2410.02613
作者: Mor Ventura,Michael Toker,Nitay Calderon,Zorik Gekhman,Yonatan Bitton,Roi Reichart
关键词-EN: Natural Language Inference, wet floor, detects a wet, abductive Natural Language, Visual Language Model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs’ visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps - writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.

[CV-24] IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers

链接: https://arxiv.org/abs/2410.02592
作者: Zihan Fang,Zheng Lin,Senkang Hu,Hangcheng Cao,Yiqin Deng,Xianhao Chen,Yuguang Fang
关键词-EN: prevent traffic accidents, providing timely alerts, detecting early-stage abnormal, early-stage abnormal status, abnormal status
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 16 pages, 17 figures

点击查看摘要

Abstract:Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state features, significantly deteriorating training performance. Furthermore, missing modalities due to environment and hardware limitations further exacerbate the challenge of abnormal status identification. More importantly, monitoring abnormal health conditions of passengers, particularly in elderly care, is of paramount importance but remains underexplored. To address these challenges, we introduce our IC3M, an efficient camera-rotation-based multimodal framework for monitoring both driver and passengers in a car. Our IC3M comprises two key modules: an adaptive threshold pseudo-labeling strategy and a missing modality reconstruction. The former customizes pseudo-labeling thresholds for different classes based on the class distribution, generating class-balanced pseudo labels to guide model training effectively, while the latter leverages crossmodality relationships learned from limited labels to accurately recover missing modalities by distribution transferring from available modalities. Extensive experimental results demonstrate that IC3M outperforms state-of-the-art benchmarks in accuracy, precision, and recall while exhibiting superior robustness under limited labeled data and severe missing modality.

[CV-25] An Improved Variational Method for Image Denoising

链接: https://arxiv.org/abs/2410.02587
作者: Jing-En Huang,Jia-Wei Liao,Ku-Te Lin,Yu-Ju Tsai,Mei-Heng Yueh
关键词-EN: total variation, minimizing the total, image denoising technique, pixel intensities, technique that aims
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The total variation (TV) method is an image denoising technique that aims to reduce noise by minimizing the total variation of the image, which measures the variation in pixel intensities. The TV method has been widely applied in image processing and computer vision for its ability to preserve edges and enhance image quality. In this paper, we propose an improved TV model for image denoising and the associated numerical algorithm to carry out the procedure, which is particularly effective in removing several types of noises and their combinations. Our improved model admits a unique solution and the associated numerical algorithm guarantees the convergence. Numerical experiments are demonstrated to show improved effectiveness and denoising quality compared to other TV models. Such encouraging results further enhance the utility of the TV method in image processing.

[CV-26] SuperGS: Super-Resolution 3D Gaussian Splatting via Latent Feature Field and Gradient-guided Splitting

链接: https://arxiv.org/abs/2410.02571
作者: Shiyun Xie,Zhiru Wang,Yinghao Zhu,Chengwei Pan
关键词-EN: real-time rendering capabilities, Gaussian Splatting, Feature Gaussian Splatting, superior quality, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has exceled in novel view synthesis with its real-time rendering capabilities and superior quality. However, it faces challenges for high-resolution novel view synthesis (HRNVS) due to the coarse nature of primitives derived from low-resolution input views. To address this issue, we propose Super-Resolution 3DGS (SuperGS), which is an expansion of 3DGS designed with a two-stage coarse-to-fine training framework, utilizing pretrained low-resolution scene representation as an initialization for super-resolution optimization. Moreover, we introduce Multi-resolution Feature Gaussian Splatting (MFGS) to incorporates a latent feature field for flexible feature sampling and Gradient-guided Selective Splitting (GSS) for effective Gaussian upsampling. By integrating these strategies within the coarse-to-fine framework ensure both high fidelity and memory efficiency. Extensive experiments demonstrate that SuperGS surpasses state-of-the-art HRNVS methods on challenging real-world datasets using only low-resolution inputs.

[CV-27] Pseudo-Stereo Inputs: A Solution to the Occlusion Challenge in Self-Supervised Stereo Matching

链接: https://arxiv.org/abs/2410.02534
作者: Ruizhi Yang,Xingqiang Li,Jiajun Bai,Jinsong Du
关键词-EN: Self-supervised stereo matching, expensive labeled data, holds great promise, direct self-supervised stereo, Self-supervised stereo
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE Transactions on Image Processing (TIP)

点击查看摘要

Abstract:Self-supervised stereo matching holds great promise for application and research due to its independence from expensive labeled data. However, direct self-supervised stereo matching paradigms based on photometric loss functions have consistently struggled with performance issues due to the occlusion challenge. The crux of the occlusion challenge lies in the fact that the positions of occluded pixels consistently align with the epipolar search direction defined by the input stereo images, leading to persistent information loss and erroneous feedback at fixed locations during self-supervised training. In this work, we propose a simple yet highly effective pseudo-stereo inputs strategy to address the core occlusion challenge. This strategy decouples the input and feedback images, compelling the network to probabilistically sample information from both sides of the occluding objects. As a result, the persistent lack of information in the aforementioned fixed occlusion areas is mitigated. Building upon this, we further address feedback conflicts and overfitting issues arising from the strategy. By integrating these components, our method achieves stable and significant performance improvements compared to existing methods. Quantitative experiments are conducted to evaluate the performance. Qualitative experiments further demonstrate accurate disparity inference even at occluded regions. These results demonstrate a significant advancement over previous methods in the field of direct self-supervised stereo matching based on photometric loss. The proposed pseudo-stereo inputs strategy, due to its simplicity and effectiveness, has the potential to serve as a new paradigm for direct self-supervised stereo matching. Code is available at this https URL.

[CV-28] HiFiSeg: High-Frequency Information Enhanced Polyp Segmentation with Global-Local Vision Transformer

链接: https://arxiv.org/abs/2410.02528
作者: Jingjing Ren,Xiaoyong Zhang,Lina Zhang
关键词-EN: Numerous studies, based methods, studies have demonstrated, demonstrated the strong, computer vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Numerous studies have demonstrated the strong performance of Vision Transformer (ViT)-based methods across various computer vision tasks. However, ViT models often struggle to effectively capture high-frequency components in images, which are crucial for detecting small targets and preserving edge details, especially in complex scenarios. This limitation is particularly challenging in colon polyp segmentation, where polyps exhibit significant variability in structure, texture, and shape. High-frequency information, such as boundary details, is essential for achieving precise semantic segmentation in this context. To address these challenges, we propose HiFiSeg, a novel network for colon polyp segmentation that enhances high-frequency information processing through a global-local vision transformer framework. HiFiSeg leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global-local interaction module (GLIM) and the selective aggregation module (SAM). GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features. SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model’s ability to accurately detect and segment polyps. Extensive experiments on five widely recognized benchmark datasets demonstrate the effectiveness of HiFiSeg for polyp segmentation. Notably, the mDice scores on the challenging CVC-ColonDB and ETIS datasets reached 0.826 and 0.822, respectively, underscoring the superior performance of HiFiSeg in handling the specific complexities of this task.

[CV-29] Learning from Offline Foundation Features with Tensor Augmentations NEURIPS2024

链接: https://arxiv.org/abs/2410.02527
作者: Emir Konuk,Christos Matsoukas,Moein Sorkhei,Phitchapha Lertsiravaramet,Kevin Smith
关键词-EN: Offline Foundation Features, Learning from Offline, introduce Learning, efficient training scheme, training scheme designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:We introduce Learning from Offline Foundation Features with Tensor Augmentations (LOFF-TA), an efficient training scheme designed to harness the capabilities of foundation models in limited resource settings where their direct development is not feasible. LOFF-TA involves training a compact classifier on cached feature embeddings from a frozen foundation model, resulting in up to 37\times faster training and up to 26\times reduced GPU memory usage. Because the embeddings of augmented images would be too numerous to store, yet the augmentation process is essential for training, we propose to apply tensor augmentations to the cached embeddings of the original non-augmented images. LOFF-TA makes it possible to leverage the power of foundation models, regardless of their size, in settings with limited computational capacity. Moreover, LOFF-TA can be used to apply foundation models to high-resolution images without increasing compute. In certain scenarios, we find that training with LOFF-TA yields better results than directly fine-tuning the foundation model.

[CV-30] Dog-IQA: Standard-guided Zero-shot MLLM for Mix-grained Image Quality Assessment

链接: https://arxiv.org/abs/2410.02505
作者: Kai Liu,Ziqing Zhang,Wenbo Li,Renjing Pei,Fenglong Song,Xiaohong Liu,Linghe Kong,Yulun Zhang
关键词-EN: computer vision fields, Image quality assessment, quality assessment, vision fields, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures. The code and models will be available at this https URL

点击查看摘要

Abstract:Image quality assessment (IQA) serves as the golden standard for all models’ performance in nearly all computer vision fields. However, it still suffers from poor out-of-distribution generalization ability and expensive training costs. To address these problems, we propose Dog-IQA, a standard-guided zero-shot mix-grained IQA method, which is training-free and utilizes the exceptional prior knowledge of multimodal large language models (MLLMs). To obtain accurate IQA scores, namely scores consistent with humans, we design an MLLM-based inference pipeline that imitates human experts. In detail, Dog-IQA applies two techniques. First, Dog-IQA objectively scores with specific standards that utilize MLLM’s behavior pattern and minimize the influence of subjective factors. Second, Dog-IQA comprehensively takes local semantic objects and the whole image as input and aggregates their scores, leveraging local and global information. Our proposed Dog-IQA achieves state-of-the-art (SOTA) performance compared with training-free methods, and competitive performance compared with training-based methods in cross-dataset scenarios. Our code and models will be available at this https URL.

[CV-31] DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

链接: https://arxiv.org/abs/2410.02492
作者: Xuchen Li,Shiyu Hu,Xiaokun Feng,Dailing Zhang,Meiqi Wu,Jing Zhang,Kaiqi Huang
关键词-EN: harnessing linguistic data, traditional single object, cutting-edge research area, single object tracking, video understanding applications
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Preprint, Under Review

点击查看摘要

Abstract:Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a “memorize the answer” strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on this http URL.

[CV-32] Event-Customized Image Generation

链接: https://arxiv.org/abs/2410.02483
作者: Zhen Wang,Yilei Jiang,Dong Zheng,Jun Xiao,Long Chen
关键词-EN: raised significant attention, significant attention due, Customized Image Generation, generating customized images, user-specified concepts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ‘‘exactly same’’ reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ‘‘event’’ as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.

[CV-33] owards a Theoretical Understanding of Memorization in Diffusion Models

链接: https://arxiv.org/abs/2410.02467
作者: Yunhao Chen,Xingjun Ma,Difan Zou,Yu-Gang Jiang
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, attracted growing attention, data
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2406.12752

点击查看摘要

Abstract:As diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI), the study of their memorization of training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn via memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for trustworthy application of GenAI. Existing works revealed that conditional DPMs are more prone to training data memorization than unconditional DPMs, and the motivated data extraction methods are mostly for conditional DPMs. However, these understandings are primarily empirical, and extracting training data from unconditional models has been found to be extremely challenging. In this work, we provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence. Our theoretical analysis indicates that extracting data from unconditional models can also be effective by constructing a proper surrogate condition. Based on this result, we propose a novel data extraction method named \textbfSurrogate condItional Data Extraction (SIDE) that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs. Empirical results demonstrate that our SIDE can extract training data in challenging scenarios where previous methods fail, and it is, on average, over 50% more effective across different scales of the CelebA dataset.

[CV-34] Recurrent Few-Shot model for Document Verification

链接: https://arxiv.org/abs/2410.02456
作者: Maxime Talarmain,Carlos Boned,Sanket Biswas,Oriol Ramos
关键词-EN: video-based verification systems, solved problem, video-based verification, verification systems, considered a solved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:General-purpose ID, or travel, document image- and video-based verification systems have yet to achieve good enough performance to be considered a solved problem. There are several factors that negatively impact their performance, including low-resolution images and videos and a lack of sufficient data to train the models. This task is particularly challenging when dealing with unseen class of ID, or travel, documents. In this paper we address this task by proposing a recurrent-based model able to detect forged documents in a few-shot scenario. The recurrent architecture makes the model robust to document resolution variability. Moreover, the few-shot approach allow the model to perform well even for unseen class of documents. Preliminary results on the SIDTD and Findit datasets show good performance of this model for this task.

[CV-35] Clinnova Federated Learning Proof of Concept: Key Takeaways from a Cross-border Collaboration

链接: https://arxiv.org/abs/2410.02443
作者: Julia Alekseenko,Bram Stieltjes,Michael Bach,Melanie Boerries,Oliver Opitz,Alexandros Karargyris,Nicolas Padoy
关键词-EN: initiative involving France, European Greater Region, collaborative initiative involving, involving France, Greater Region initiative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clinnova, a collaborative initiative involving France, Germany, Switzerland, and Luxembourg, is dedicated to unlocking the power of precision medicine through data federation, standardization, and interoperability. This European Greater Region initiative seeks to create an interoperable European standard using artificial intelligence (AI) and data science to enhance healthcare outcomes and efficiency. Key components include multidisciplinary research centers, a federated biobanking strategy, a digital health innovation platform, and a federated AI strategy. It targets inflammatory bowel disease, rheumatoid diseases, and multiple sclerosis (MS), emphasizing data quality to develop AI algorithms for personalized treatment and translational research. The IHU Strasbourg (Institute of Minimal-invasive Surgery) has the lead in this initiative to develop the federated learning (FL) proof of concept (POC) that will serve as a foundation for advancing AI in healthcare. At its core, Clinnova-MS aims to enhance MS patient care by using FL to develop more accurate models that detect disease progression, guide interventions, and validate digital biomarkers across multiple sites. This technical report presents insights and key takeaways from the first cross-border federated POC on MS segmentation of MRI images within the Clinnova framework. While our work marks a significant milestone in advancing MS segmentation through cross-border collaboration, it also underscores the importance of addressing technical, logistical, and ethical considerations to realize the full potential of FL in healthcare settings. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.02443 [cs.CV] (or arXiv:2410.02443v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.02443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-36] Predictive Attractor Models NEURIPS2024

链接: https://arxiv.org/abs/2410.02430
作者: Ramy Mounir,Sudeep Sarkar
关键词-EN: episodic memory formation, numerous cognitive functions, underpins numerous cognitive, language comprehension, Sequential memory
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Sequential memory, the ability to form and accurately recall a sequence of events or stimuli in the correct order, is a fundamental prerequisite for biological and artificial intelligence as it underpins numerous cognitive functions (e.g., language comprehension, planning, episodic memory formation, etc.) However, existing methods of sequential memory suffer from catastrophic forgetting, limited capacity, slow iterative learning procedures, low-order Markov memory, and, most importantly, the inability to represent and generate multiple valid future possibilities stemming from the same context. Inspired by biologically plausible neuroscience theories of cognition, we propose \textitPredictive Attractor Models (PAM), a novel sequence memory architecture with desirable generative properties. PAM is a streaming model that learns a sequence in an online, continuous manner by observing each input \textitonly once. Additionally, we find that PAM avoids catastrophic forgetting by uniquely representing past context through lateral inhibition in cortical minicolumns, which prevents new memories from overwriting previously learned knowledge. PAM generates future predictions by sampling from a union set of predicted possibilities; this generative ability is realized through an attractor model trained alongside the predictor. We show that PAM is trained with local computations through Hebbian plasticity rules in a biologically plausible framework. Other desirable traits (e.g., noise tolerance, CPU-based learning, capacity scaling) are discussed throughout the paper. Our findings suggest that PAM represents a significant step forward in the pursuit of biologically plausible and computationally efficient sequential memory models, with broad implications for cognitive science and artificial intelligence research.

[CV-37] PnP-Flow: Plug-and-Play Image Restoration with Flow Matching

链接: https://arxiv.org/abs/2410.02423
作者: Ségolène Martin,Anne Gagneux,Paul Hagemann,Gabriele Steidl
关键词-EN: Flow Matching, solving imaging inverse, Flow, Matching, Flow Matching pushed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce Plug-and-Play (PnP) Flow Matching, an algorithm for solving imaging inverse problems. PnP methods leverage the strength of pre-trained denoisers, often deep neural networks, by integrating them in optimization schemes. While they achieve state-of-the-art performance on various inverse problems in imaging, PnP approaches face inherent limitations on more generative tasks like inpainting. On the other hand, generative models such as Flow Matching pushed the boundary in image sampling yet lack a clear method for efficient use in image restoration. We propose to combine the PnP framework with Flow Matching (FM) by defining a time-dependent denoiser using a pre-trained FM model. Our algorithm alternates between gradient descent steps on the data-fidelity term, reprojections onto the learned FM path, and denoising. Notably, our method is computationally efficient and memory-friendly, as it avoids backpropagation through ODEs and trace computations. We evaluate its performance on denoising, super-resolution, deblurring, and inpainting tasks, demonstrating superior results compared to existing PnP algorithms and Flow Matching based state-of-the-art methods.

[CV-38] LoGDesc: Local geometric features aggregation for robust point cloud registration

链接: https://arxiv.org/abs/2410.02420
作者: Karim Slimani,Brahim Tamadazte,Catherine Achard
关键词-EN: Principal Components Analysis, neighborhood structure description, learning-based feature propagation, structure description, combining local geometrical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a new hybrid descriptor for 3D point matching and point cloud registration, combining local geometrical properties and learning-based feature propagation for each point’s neighborhood structure description. The proposed architecture first extracts prior geometrical information by computing each point’s planarity, anisotropy, and omnivariance using a Principal Components Analysis (PCA). This prior information is completed by a descriptor based on the normal vectors estimated thanks to constructing a neighborhood based on triangles. The final geometrical descriptor is propagated between the points using local graph convolutions and attention mechanisms. The new feature extractor is evaluated on ModelNet40, Bunny Stanford dataset, KITTI and MVP (Multi-View Partial)-RG for point cloud registration and shows interesting results, particularly on noisy and low overlapping point clouds.

[CV-39] Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

链接: https://arxiv.org/abs/2410.02416
作者: Seyedmorteza Sadat,Otmar Hilliges,Romann M. Weber
关键词-EN: CFG update rule, CFG, crucial for improving, input condition, condition and final
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) is crucial for improving both generation quality and alignment between the input condition and final output in diffusion models. While a high guidance scale is generally required to enhance these aspects, it also causes oversaturation and unrealistic artifacts. In this paper, we revisit the CFG update rule and introduce modifications to address this issue. We first decompose the update term in CFG into parallel and orthogonal components with respect to the conditional model prediction and observe that the parallel component primarily causes oversaturation, while the orthogonal component enhances image quality. Accordingly, we propose down-weighting the parallel component to achieve high-quality generations without oversaturation. Additionally, we draw a connection between CFG and gradient ascent and introduce a new rescaling and momentum method for the CFG update rule based on this insight. Our approach, termed adaptive projected guidance (APG), retains the quality-boosting advantages of CFG while enabling the use of higher guidance scales without oversaturation. APG is easy to implement and introduces practically no additional computational overhead to the sampling process. Through extensive experiments, we demonstrate that APG is compatible with various conditional diffusion models and samplers, leading to improved FID, recall, and saturation scores while maintaining precision comparable to CFG, making our method a superior plug-and-play alternative to standard classifier-free guidance.

[CV-40] SynCo: Synthetic Hard Negatives in Contrastive Learning for Better Unsupervised Visual Representations

链接: https://arxiv.org/abs/2410.02401
作者: Nikolaos Giakoumoglou,Tania Stathaki
关键词-EN: synthetic hard negatives, Contrastive learning, hard negatives, synthetic hard, negatives-samples that closely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, 4 tables. arXiv admin note: text overlap with arXiv:2010.01028 by other authors

点击查看摘要

Abstract:Contrastive learning has become a dominant approach in self-supervised visual representation learning, with hard negatives-samples that closely resemble the anchor-being key to enhancing the discriminative power of learned representations. However, efficiently leveraging hard negatives remains a challenge due to the difficulty in identifying and incorporating them without significantly increasing computational costs. To address this, we introduce SynCo (Synthetic Negatives in Contrastive learning), a novel contrastive learning approach that improves model performance by generating synthetic hard negatives. Built on the MoCo framework, SynCo introduces six novel strategies for creating diverse synthetic hard negatives that can be generated on-the-fly with minimal computational overhead. SynCo achieves faster training and better representation learning, achieving a top-1 accuracy of 68.1% in ImageNet linear evaluation after only 200 epochs on pretraining, surpassing MoCo’s 67.5% with the same ResNet-50 encoder. Additionally, it transfers more effectively to detection tasks: on the PASCAL VOC, it outperforms both the supervised baseline and MoCo, achieving an AP of 82.5%; on the COCO dataset, it sets a new benchmark with 40.4% AP for bounding box detection and 35.4% AP for instance segmentation. Our synthetic hard negative generation procedure significantly enhances the quality of visual representations learned through self-supervised contrastive learning. Code is available at this https URL.

[CV-41] Parameter Competition Balancing for Model Merging NEURIPS2024

链接: https://arxiv.org/abs/2410.02396
作者: Guodong Du,Junlin Lee,Jing Li,Runhua Jiang,Yifei Guo,Shuyang Yu,Hanting Liu,Sim Kuan Goh,Ho-Kin Tang,Daojing He,Min Zhang
关键词-EN: common practice, model, tasks, parameter, fine-tuning pretrained models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS2024

点击查看摘要

Abstract:While fine-tuning pretrained models has become common practice, these models often underperform outside their specific domains. Recently developed model merging techniques enable the direct integration of multiple models, each fine-tuned for distinct tasks, into a single model. This strategy promotes multitasking capabilities without requiring retraining on the original datasets. However, existing methods fall short in addressing potential conflicts and complex correlations between tasks, especially in parameter-level adjustments, posing a challenge in effectively balancing parameter competition across various tasks. This paper introduces an innovative technique named PCB-Merging (Parameter Competition Balancing), a lightweight and training-free technique that adjusts the coefficients of each parameter for effective model merging. PCB-Merging employs intra-balancing to gauge parameter significance within individual tasks and inter-balancing to assess parameter similarities across different tasks. Parameters with low importance scores are dropped, and the remaining ones are rescaled to form the final merged model. We assessed our approach in diverse merging scenarios, including cross-task, cross-domain, and cross-training configurations, as well as out-of-domain generalization. The experimental results reveal that our approach achieves substantial performance enhancements across multiple modalities, domains, model sizes, number of tasks, fine-tuning forms, and large language models, outperforming existing model merging methods. The code is publicly available at: \urlthis https URL.

[CV-42] MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

链接: https://arxiv.org/abs/2410.02381
作者: Genta Indra Winata,David Anugraha,Lucky Susanto,Garry Kuwanto,Derry Tanti Wijaya
关键词-EN: Understanding the quality, model outputs align, model outputs, human preferences, Understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.

[CV-43] Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation NEURIPS

链接: https://arxiv.org/abs/2410.02369
作者: Muzhi Zhu,Yang Liu,Zekai Luo,Chenchen Jing,Hao Chen,Guangkai Xu,Xinlong Wang,Chunhua Shen
关键词-EN: Few-shot Semantic Segmentation, Latent Diffusion Model, Diffusion Model, garnered noteworthy achievements, Few-shot Semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to Proc. Annual Conference on Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model’s generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.

[CV-44] A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification Segmentation Restoration and Beyond

链接: https://arxiv.org/abs/2410.02362
作者: Shubhi Bansal,Sreeharish A,Madhava Prasath J,Manikandan S,Sreekanth Madisetty,Mohammad Zia Ur Rehman,Chandravardhan Singh Raghaw,Gaurav Duggal,Nagendra Kumar
关键词-EN: State Space Model, State Space, template-based deep learning, deep learning approaches, Mamba
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mamba, a special case of the State Space Model, is gaining popularity as an alternative to template-based deep learning approaches in medical image analysis. While transformers are powerful architectures, they have drawbacks, including quadratic computational complexity and an inability to address long-range dependencies efficiently. This limitation affects the analysis of large and complex datasets in medical imaging, where there are many spatial and temporal relationships. In contrast, Mamba offers benefits that make it well-suited for medical image analysis. It has linear time complexity, which is a significant improvement over transformers. Mamba processes longer sequences without attention mechanisms, enabling faster inference and requiring less memory. Mamba also demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes. The organization of this paper allows readers to appreciate the capabilities of Mamba in medical imaging step by step. We begin by defining core concepts of SSMs and models, including S4, S5, and S6, followed by an exploration of Mamba architectures such as pure Mamba, U-Net variants, and hybrid models with convolutional neural networks, transformers, and Graph Neural Networks. We also cover Mamba optimizations, techniques and adaptations, scanning, datasets, applications, experimental results, and conclude with its challenges and future directions in medical imaging. This review aims to demonstrate the transformative potential of Mamba in overcoming existing barriers within medical imaging while paving the way for innovative advancements in the field. A comprehensive list of Mamba architectures applied in the medical field, reviewed in this work, is available at Github.

[CV-45] ProtoSeg: A Prototype-Based Point Cloud Instance Segmentation Method

链接: https://arxiv.org/abs/2410.02352
作者: Remco Royen,Leon Denis,Adrian Munteanu
关键词-EN: point cloud scene, crucial for obtaining, obtaining an understanding, instance segmentation, performing instance segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D instance segmentation is crucial for obtaining an understanding of a point cloud scene. This paper presents a novel neural network architecture for performing instance segmentation on 3D point clouds. We propose to jointly learn coefficients and prototypes in parallel which can be combined to obtain the instance predictions. The coefficients are computed using an overcomplete set of sampled points with a novel multi-scale module, dubbed dilated point inception. As the set of obtained instance mask predictions is overcomplete, we employ a non-maximum suppression algorithm to retrieve the final predictions. This approach allows to omit the time-expensive clustering step and leads to a more stable inference time. The proposed method is not only 28% faster than the state-of-the-art, it also exhibits the lowest standard deviation. Our experiments have shown that the standard deviation of the inference time is only 1.0% of the total time while it ranges between 10.8 and 53.1% for the state-of-the-art methods. Lastly, our method outperforms the state-of-the-art both on S3DIS-blocks (4.9% in mRec on Fold-5) and PartNet (2.0% on average in mAP).

[CV-46] Self-eXplainable AI for Medical Image Analysis: A Survey and New Outlooks

链接: https://arxiv.org/abs/2410.02331
作者: Junlin Hou,Sicen Liu,Yequan Bie,Hongmei Wang,Andong Tan,Luyang Luo,Hao Chen
关键词-EN: eXplainable Artificial Intelligence, Artificial Intelligence, high-stakes decision-making areas, medical image analysis, Post-hoc XAI techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The increasing demand for transparent and reliable models, particularly in high-stakes decision-making areas such as medical image analysis, has led to the emergence of eXplainable Artificial Intelligence (XAI). Post-hoc XAI techniques, which aim to explain black-box models after training, have been controversial in recent works concerning their fidelity to the models’ predictions. In contrast, Self-eXplainable AI (S-XAI) offers a compelling alternative by incorporating explainability directly into the training process of deep learning models. This approach allows models to generate inherent explanations that are closely aligned with their internal decision-making processes. Such enhanced transparency significantly supports the trustworthiness, robustness, and accountability of AI systems in real-world medical applications. To facilitate the development of S-XAI methods for medical image analysis, this survey presents an comprehensive review across various image modalities and clinical applications. It covers more than 200 papers from three key perspectives: 1) input explainability through the integration of explainable feature engineering and knowledge graph, 2) model explainability via attention-based learning, concept-based learning, and prototype-based learning, and 3) output explainability by providing counterfactual explanation and textual explanation. Additionally, this paper outlines the desired characteristics of explainability and existing evaluation methods for assessing explanation quality. Finally, it discusses the major challenges and future research directions in developing S-XAI for medical image analysis.

[CV-47] RESSCAL3D: Joint Acquisition and Semantic Segmentation of 3D Point Clouds ICIP

链接: https://arxiv.org/abs/2410.02323
作者: Remco Royen,Kostas Pataridis,Ward van der Tempel,Adrian Munteanu
关键词-EN: facilitating seamless interaction, physical world, understanding is crucial, crucial for facilitating, interaction between digital
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 IEEE International Conference on Image Processing (ICIP). IEEE, 2024

点击查看摘要

Abstract:3D scene understanding is crucial for facilitating seamless interaction between digital devices and the physical world. Real-time capturing and processing of the 3D scene are essential for achieving this seamless integration. While existing approaches typically separate acquisition and processing for each frame, the advent of resolution-scalable 3D sensors offers an opportunity to overcome this paradigm and fully leverage the otherwise wasted acquisition time to initiate processing. In this study, we introduce VX-S3DIS, a novel point cloud dataset accurately simulating the behavior of a resolution-scalable 3D sensor. Additionally, we present RESSCAL3D++, an important improvement over our prior work, RESSCAL3D, by incorporating an update module and processing strategy. By applying our method to the new dataset, we practically demonstrate the potential of joint acquisition and semantic segmentation of 3D point clouds. Our resolution-scalable approach significantly reduces scalability costs from 2% to just 0.2% in mIoU while achieving impressive speed-ups of 15.6 to 63.9% compared to the non-scalable baseline. Furthermore, our scalable approach enables early predictions, with the first one occurring after only 7% of the total inference time of the baseline. The new VX-S3DIS dataset is available at this https URL.

[CV-48] CTARR: A fast and robust method for identifying anatomical regions on CT images via atlas registration

链接: https://arxiv.org/abs/2410.02316
作者: Thomas Buddenkotte,Roland Opfer,Julia Krüger,Alessa Hering,Mireia Crispin-Ortuzar
关键词-EN: image analysis, Medical image analysis, image analysis tasks, patient body, image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical image analysis tasks often focus on regions or structures located in a particular location within the patient’s body. Often large parts of the image may not be of interest for the image analysis task. When using deep-learning based approaches, this causes an unnecessary increases the computational burden during inference and raises the chance of errors. In this paper, we introduce CTARR, a novel generic method for CT Anatomical Region Recognition. The method serves as a pre-processing step for any deep learning-based CT image analysis pipeline by automatically identifying the pre-defined anatomical region that is relevant for the follow-up task and removing the rest. It can be used in (i) image segmentation to prevent false positives in anatomically implausible regions and speeding up the inference, (ii) image classification to produce image crops that are consistent in their anatomical context, and (iii) image registration by serving as a fast pre-registration step. Our proposed method is based on atlas registration and provides a fast and robust way to crop any anatomical region encoded as one or multiple bounding box(es) from any unlabeled CT scan of the brain, chest, abdomen and/or pelvis. We demonstrate the utility and robustness of the proposed method in the context of medical image segmentation by evaluating it on six datasets of public segmentation challenges. The foreground voxels in the regions of interest are preserved in the vast majority of cases and tasks (97.45-100%) while taking only fractions of a seconds to compute (0.1-0.21s) on a deep learning workstation and greatly reducing the segmentation runtime (2.0-12.7x). Our code is available at this https URL.

[CV-49] Decoupling Layout from Glyph in Online Chinese Handwriting Generation

链接: https://arxiv.org/abs/2410.02309
作者: Ren-Min Si,Yan-Ming Zhang,Yi Chen
关键词-EN: online handwritten text, generate online handwritten, human civilization, plays a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text plays a crucial role in the transmission of human civilization, and teaching machines to generate online handwritten text in various styles presents an interesting and significant challenge. However, most prior work has concentrated on generating individual Chinese fonts, leaving complete text line generation largely unexplored. In this paper, we identify that text lines can naturally be divided into two components: layout and glyphs. Based on this division, we designed a text line layout generator coupled with a diffusion-based stylized font synthesizer to address this challenge hierarchically. More concretely, the layout generator performs in-context-like learning based on the text content and the provided style references to generate positions for each glyph autoregressively. Meanwhile, the font synthesizer which consists of a character embedding dictionary, a multi-scale calligraphy style encoder, and a 1D U-Net based diffusion denoiser will generate each font on its position while imitating the calligraphy style extracted from the given style references. Qualitative and quantitative experiments on the CASIA-OLHWDB demonstrate that our method is capable of generating structurally correct and indistinguishable imitation samples.

[CV-50] he Comparison of Individual Cat Recognition Using Neural Networks

链接: https://arxiv.org/abs/2410.02305
作者: Mingxuan Li,Kai Zhou
关键词-EN: smart door locks, Facial recognition, smart door, door locks, photo grouping
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages,7 figures

点击查看摘要

Abstract:Facial recognition using deep learning has been widely used in social life for applications such as authentication, smart door locks, and photo grouping, etc. More and more networks have been developed to facilitate computer vision tasks, such as ResNet, DenseNet, EfficientNet, ConvNeXt, and Siamese networks. However, few studies have systematically compared the advantages and disadvantages of such neural networks in identifying individuals from images, especially for pet animals like cats. In the present study, by systematically comparing the efficacy of different neural networks in cat recognition, we found traditional CNNs trained with transfer learning have better performance than models trained with the fine-tuning method or Siamese networks in individual cat recognition. In addition, ConvNeXt and DenseNet yield significant results which could be further optimized for individual cat recognition in pet stores and in the wild. These results provide a method to improve cat management in pet stores and monitoring of cats in the wild.

[CV-51] A Novel Method for Accurate Real-time Food Classification: The Synergistic Integration of EfficientNetB7 CBAM Transfer Learning and Data Augmentation

链接: https://arxiv.org/abs/2410.02304
作者: Shayan Rokhva,Babak Teimourpour
关键词-EN: Integrating artificial intelligence, Integrating artificial, significantly enhancing productivity, profoundly transformative, daily tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, six figures, two tables

点击查看摘要

Abstract:Integrating artificial intelligence into modern society is profoundly transformative, significantly enhancing productivity by streamlining various daily tasks. AI-driven recognition systems provide notable advantages in the food sector, including improved nutrient tracking, tackling food waste, and boosting food production and consumption efficiency. Accurate food classification is a crucial initial step in utilizing advanced AI models, as the effectiveness of this process directly influences the success of subsequent operations; therefore, achieving high accuracy at a reasonable speed is essential. Despite existing research efforts, a gap persists in improving performance while ensuring rapid processing times, prompting researchers to pursue cost-effective and precise models. This study addresses this gap by employing the state-of-the-art EfficientNetB7 architecture, enhanced through transfer learning, data augmentation, and the CBAM attention module. This methodology results in a robust model that surpasses previous studies in accuracy while maintaining rapid processing suitable for real-world applications. The Food11 dataset from Kaggle was utilized, comprising 16643 imbalanced images across 11 diverse classes with significant intra-category diversities and inter-category similarities. Furthermore, the proposed methodology, bolstered by various deep learning techniques, consistently achieves an impressive average accuracy of 96.40%. Notably, it can classify over 60 images within one second during inference on unseen data, demonstrating its ability to deliver high accuracy promptly. This underscores its potential for practical applications in accurate food classification and enhancing efficiency in subsequent processes.

[CV-52] Computer-aided Colorization State-of-the-science: A Survey

链接: https://arxiv.org/abs/2410.02288
作者: Yu Cao,Xin Duan,Xiangqiao Meng,P. Y. Mok,Ping Li,Tong-Yee Lee
关键词-EN: paper reviews published, computer-aided colorization technology, reviews published research, reviews published, field of computer-aided
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper reviews published research in the field of computer-aided colorization technology. We argue that the colorization task originates from computer graphics, prospers by introducing computer vision, and tends to the fusion of vision and graphics, so we put forward our taxonomy and organize the whole paper chronologically. We extend the existing reconstruction-based colorization evaluation techniques, considering that aesthetic assessment of colored images should be introduced to ensure that colorization satisfies human visual-related requirements and emotions more closely. We perform the colorization aesthetic assessment on seven representative unconditional colorization models and discuss the difference between our assessment and the existing reconstruction-based metrics. Finally, this paper identifies unresolved issues and proposes fruitful areas for future research and development. Access to the project associated with this survey can be obtained at this https URL.

[CV-53] Structural-Entropy-Based Sample Selection for Efficient and Effective Learning ICLR2025

链接: https://arxiv.org/abs/2410.02268
作者: Tianchi Xie,Jiangning Zhu,Guozu Ma,Minzhi Lin,Wei Chen,Weikai Yang,Shixia Liu
关键词-EN: machine learning models, improves the efficiency, models by providing, samples, Sample selection improves
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICLR 2025

点击查看摘要

Abstract:Sample selection improves the efficiency and effectiveness of machine learning models by providing informative and representative samples. Typically, samples can be modeled as a sample graph, where nodes are samples and edges represent their similarities. Most existing methods are based on local information, such as the training difficulty of samples, thereby overlooking global information, such as connectivity patterns. This oversight can result in suboptimal selection because global information is crucial for ensuring that the selected samples well represent the structural properties of the graph. To address this issue, we employ structural entropy to quantify global information and losslessly decompose it from the whole graph to individual nodes using the Shapley value. Based on the decomposition, we present \textbfS tructural- \textbfE ntropy-based sample \textbfS election ( \textbfSES ), a method that integrates both global and local information to select informative and representative samples. SES begins by constructing a k NN-graph among samples based on their similarities. It then measures sample importance by combining structural entropy (global metric) with training difficulty (local metric). Finally, SES applies importance-biased blue noise sampling to select a set of diverse and representative samples. Comprehensive experiments on three learning scenarios – supervised learning, active learning, and continual learning – clearly demonstrate the effectiveness of our method.

[CV-54] Probabilistic road classification in historical maps using synthetic data and deep learning

链接: https://arxiv.org/abs/2410.02250
作者: Dominik J. Mühlematter,Sebastian Schweizer,Chenjing Jiao,Xue Xia,Magnus Heitzler,Lorenz Hurni
关键词-EN: Historical maps, road, spatial development, offering a rich, evolutionary studies
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Historical maps are invaluable for analyzing long-term changes in transportation and spatial development, offering a rich source of data for evolutionary studies. However, digitizing and classifying road networks from these maps is often expensive and time-consuming, limiting their widespread use. Recent advancements in deep learning have made automatic road extraction from historical maps feasible, yet these methods typically require large amounts of labeled training data. To address this challenge, we introduce a novel framework that integrates deep learning with geoinformation, computer-based painting, and image processing methodologies. This framework enables the extraction and classification of roads from historical maps using only road geometries without needing road class labels for training. The process begins with training of a binary segmentation model to extract road geometries, followed by morphological operations, skeletonization, vectorization, and filtering algorithms. Synthetic training data is then generated by a painting function that artificially re-paints road segments using predefined symbology for road classes. Using this synthetic data, a deep ensemble is trained to generate pixel-wise probabilities for road classes to mitigate distribution shift. These predictions are then discretized along the extracted road geometries. Subsequently, further processing is employed to classify entire roads, enabling the identification of potential changes in road classes and resulting in a labeled road class dataset. Our method achieved completeness and correctness scores of over 94% and 92%, respectively, for road class 2, the most prevalent class in the two Siegfried Map sheets from Switzerland used for testing. This research offers a powerful tool for urban planning and transportation decision-making by efficiently extracting and classifying roads from historical maps.

[CV-55] Spiking Neural Network as Adaptive Event Stream Slicer NEURIPS2024

链接: https://arxiv.org/abs/2410.02249
作者: Jiahang Cao,Mingyuan Sun,Ziqing Wang,Hao Cheng,Qiang Zhang,Shibo Zhou,Renjing Xu
关键词-EN: rich edge information, high dynamic range, high temporal resolution, attracting significant interest, provide rich edge
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Event-based cameras are attracting significant interest as they provide rich edge information, high dynamic range, and high temporal resolution. Many state-of-the-art event-based algorithms rely on splitting the events into fixed groups, resulting in the omission of crucial temporal information, particularly when dealing with diverse motion scenarios (e.g., high/low speed). In this work, we propose SpikeSlicer, a novel-designed plug-and-play event processing method capable of splitting events stream adaptively. SpikeSlicer utilizes a lightweight (0.41M) and low-energy spiking neural network (SNN) to trigger event slicing. To guide the SNN to fire spikes at optimal time steps, we propose the Spiking Position-aware Loss (SPA-Loss) to modulate the neuron’s state. Additionally, we develop a Feedback-Update training strategy that refines the slicing decisions using feedback from the downstream artificial neural network (ANN). Extensive experiments demonstrate that our method yields significant performance improvements in event-based object tracking and recognition. Notably, SpikeSlicer provides a brand-new SNN-ANN cooperation paradigm, where the SNN acts as an efficient, low-energy data processor to assist the ANN in improving downstream performance, injecting new perspectives and potential avenues of exploration.

[CV-56] Visual Prompting in LLMs for Enhancing Emotion Recognition EMNLP2024

链接: https://arxiv.org/abs/2410.02244
作者: Qixuan Zhang,Zhifeng Wang,Dylan Zhang,Wenjia Niu,Sabrina Caldwell,Tom Gedeon,Yang Liu,Zhenyue Qin
关键词-EN: Vision Large Language, Large Language Models, Vision Large, Large Language, natural language processing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by EMNLP2024 (Main, Long paper)

点击查看摘要

Abstract:Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing. Nonetheless, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. To address this problem, we propose a Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through a battery of experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model’s ability to comprehend facial expressions in natural environments. Our findings demonstrate the effectiveness of integrating spatial visual prompts into VLLMs for improving emotion recognition performance.

[CV-57] SCA: Highly Efficient Semantic-Consistent Unrestricted Adversarial Attack

链接: https://arxiv.org/abs/2410.02240
作者: Zihao Pan,Weibin Wu,Yuhang Cao,Zibin Zheng
关键词-EN: Unrestricted adversarial attacks, attacks typically manipulate, adversarial attacks typically, color or texture, Unrestricted adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unrestricted adversarial attacks typically manipulate the semantic content of an image (e.g., color or texture) to create adversarial examples that are both effective and photorealistic. Recent works have utilized the diffusion inversion process to map images into a latent space, where high-level semantics are manipulated by introducing perturbations. However, they often results in substantial semantic distortions in the denoised output and suffers from low efficiency. In this study, we propose a novel framework called Semantic-Consistent Unrestricted Adversarial Attacks (SCA), which employs an inversion method to extract edit-friendly noise maps and utilizes Multimodal Large Language Model (MLLM) to provide semantic guidance throughout the process. Under the condition of rich semantic information provided by MLLM, we perform the DDPM denoising process of each step using a series of edit-friendly noise maps, and leverage DPM Solver++ to accelerate this process, enabling efficient sampling with semantic consistency. Compared to existing methods, our framework enables the efficient generation of adversarial examples that exhibit minimal discernible semantic changes. Consequently, we for the first time introduce Semantic-Consistent Adversarial Examples (SCAE). Extensive experiments and visualizations have demonstrated the high efficiency of SCA, particularly in being on average 12 times faster than the state-of-the-art attacks. Our code can be found at this https URLthis https URL.

[CV-58] Key-Grid: Unsupervised 3D Keypoints Detection using Grid Heatmap Features

链接: https://arxiv.org/abs/2410.02237
作者: Chengkai Hou,Zhengrong Xue,Bingyang Zhou,Jinghan Ke,Lin Shao,Huazhe Xu
关键词-EN: pose estimation, shape registration, registration and robotics, semantic consistency, keypoints
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting 3D keypoints with semantic consistency is widely used in many scenarios such as pose estimation, shape registration and robotics. Currently, most unsupervised 3D keypoint detection methods focus on the rigid-body objects. However, when faced with deformable objects, the keypoints they identify do not preserve semantic consistency well. In this paper, we introduce an innovative unsupervised keypoint detector Key-Grid for both the rigid-body and deformable objects, which is an autoencoder framework. The encoder predicts keypoints and the decoder utilizes the generated keypoints to reconstruct the objects. Unlike previous work, we leverage the identified keypoint in formation to form a 3D grid feature heatmap called grid heatmap, which is used in the decoder section. Grid heatmap is a novel concept that represents the latent variables for grid points sampled uniformly in the 3D cubic space, where these variables are the shortest distance between the grid points and the skeleton connected by keypoint pairs. Meanwhile, we incorporate the information from each layer of the encoder into the decoder section. We conduct an extensive evaluation of Key-Grid on a list of benchmark datasets. Key-Grid achieves the state-of-the-art performance on the semantic consistency and position accuracy of keypoints. Moreover, we demonstrate the robustness of Key-Grid to noise and downsampling. In addition, we achieve SE-(3) invariance of keypoints though generalizing Key-Grid to a SE(3)-invariant backbone.

[CV-59] Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network

链接: https://arxiv.org/abs/2410.02224
作者: Yangyang Qiu,Guoan Xu,Guangwei Gao,Zhenhua Guo,Yi Yu,Chia-Wen Lin
关键词-EN: Convolutional Neural Networks, Convolutional Neural, capabilities of Convolutional, Neural Networks, local modeling capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Recently, the integration of the local modeling capabilities of Convolutional Neural Networks (CNNs) with the global dependency strengths of Transformers has created a sensation in the semantic segmentation community. However, substantial computational workloads and high hardware memory demands remain major obstacles to their further application in real-time scenarios. In this work, we propose a lightweight multiple-information interaction network for real-time semantic segmentation, called LMIINet, which effectively combines CNNs and Transformers while reducing redundant computations and memory footprint. It features Lightweight Feature Interaction Bottleneck (LFIB) modules comprising efficient convolutions that enhance context integration. Additionally, improvements are made to the Flatten Transformer by enhancing local and global feature interaction to capture detailed semantic information. The incorporation of a combination coefficient learning scheme in both LFIB and Transformer blocks facilitates improved feature interaction. Extensive experiments demonstrate that LMIINet excels in balancing accuracy and efficiency. With only 0.72M parameters and 11.74G FLOPs, LMIINet achieves 72.0% mIoU at 100 FPS on the Cityscapes test set and 69.94% mIoU at 160 FPS on the CamVid test dataset using a single RTX2080Ti GPU.

[CV-60] Capturing complex hand movements and object interactions using machine learning-powered stretchable smart textile gloves

链接: https://arxiv.org/abs/2410.02221
作者: Arvin Tashakori,Zenan Jiang,Amir Servati,Saeid Soltanian,Harishkumar Narayana,Katherine Le,Caroline Nakayama,Chieh-ling Yang,Z. Jane Wang,Janice J. Eng,Peyman Servati
关键词-EN: dexterous hand movements, Accurate real-time tracking, hand movements, realistic hand movements, Capturing realistic hand
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Accurate real-time tracking of dexterous hand movements and interactions has numerous applications in human-computer interaction, metaverse, robotics, and tele-health. Capturing realistic hand movements is challenging because of the large number of articulations and degrees of freedom. Here, we report accurate and dynamic tracking of articulated hand and finger movements using stretchable, washable smart gloves with embedded helical sensor yarns and inertial measurement units. The sensor yarns have a high dynamic range, responding to low 0.005 % to high 155 % strains, and show stability during extensive use and washing cycles. We use multi-stage machine learning to report average joint angle estimation root mean square errors of 1.21 and 1.45 degrees for intra- and inter-subjects cross-validation, respectively, matching accuracy of costly motion capture cameras without occlusion or field of view limitations. We report a data augmentation technique that enhances robustness to noise and variations of sensors. We demonstrate accurate tracking of dexterous hand movements during object interactions, opening new avenues of applications including accurate typing on a mock paper keyboard, recognition of complex dynamic and static gestures adapted from American Sign Language and object identification.

[CV-61] Stochastic Sampling from Deterministic Flow Models ICLR2025

链接: https://arxiv.org/abs/2410.02217
作者: Saurabh Singh,Ian Fischer
关键词-EN: deterministic transport map, ordinary differential equation, framework for learning, transport map, flow models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Submitted to ICLR 2025

点击查看摘要

Abstract:Deterministic flow models, such as rectified flows, offer a general framework for learning a deterministic transport map between two distributions, realized as the vector field for an ordinary differential equation (ODE). However, they are sensitive to model estimation and discretization errors and do not permit different samples conditioned on an intermediate state, limiting their application. We present a general method to turn the underlying ODE of such flow models into a family of stochastic differential equations (SDEs) that have the same marginal distributions. This method permits us to derive families of \emphstochastic samplers, for fixed (e.g., previously trained) \emphdeterministic flow models, that continuously span the spectrum of deterministic and stochastic sampling, given access to the flow field and the score function. Our method provides additional degrees of freedom that help alleviate the issues with the deterministic samplers and empirically outperforms them. We empirically demonstrate advantages of our method on a toy Gaussian setup and on the large scale ImageNet generation task. Further, our family of stochastic samplers provide an additional knob for controlling the diversity of generation, which we qualitatively demonstrate in our experiments.

[CV-62] Hard Negative Sample Mining for Whole Slide Image Classification MICCAI2024

链接: https://arxiv.org/abs/2410.02212
作者: Wentao Huang,Xiaoling Hu,Shahira Abousamra,Prateek Prasanna,Chao Chen
关键词-EN: Weakly supervised, high computational costs, slide image, classification is challenging, supervised whole slide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 4 figures, accepted by MICCAI 2024

点击查看摘要

Abstract:Weakly supervised whole slide image (WSI) classification is challenging due to the lack of patch-level labels and high computational costs. State-of-the-art methods use self-supervised patch-wise feature representations for multiple instance learning (MIL). Recently, methods have been proposed to fine-tune the feature representation on the downstream task using pseudo labeling, but mostly focusing on selecting high-quality positive patches. In this paper, we propose to mine hard negative samples during fine-tuning. This allows us to obtain better feature representations and reduce the training cost. Furthermore, we propose a novel patch-wise ranking loss in MIL to better exploit these hard negative samples. Experiments on two public datasets demonstrate the efficacy of these proposed ideas. Our codes are available at this https URL

[CV-63] Adapting Segment Anything Model to Melanoma Segmentation in Microscopy Slide Images

链接: https://arxiv.org/abs/2410.02207
作者: Qingyuan Liu,Avideh Zakhor
关键词-EN: crucial prognostic factors, Breslow depth, Slide Images, invasive tumor size, primary invasive tumor
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Melanoma segmentation in Whole Slide Images (WSIs) is useful for prognosis and the measurement of crucial prognostic factors such as Breslow depth and primary invasive tumor size. In this paper, we present a novel approach that uses the Segment Anything Model (SAM) for automatic melanoma segmentation in microscopy slide images. Our method employs an initial semantic segmentation model to generate preliminary segmentation masks that are then used to prompt SAM. We design a dynamic prompting strategy that uses a combination of centroid and grid prompts to achieve optimal coverage of the super high-resolution slide images while maintaining the quality of generated prompts. To optimize for invasive melanoma segmentation, we further refine the prompt generation process by implementing in-situ melanoma detection and low-confidence region filtering. We select Segformer as the initial segmentation model and EfficientSAM as the segment anything model for parameter-efficient fine-tuning. Our experimental results demonstrate that this approach not only surpasses other state-of-the-art melanoma segmentation methods but also significantly outperforms the baseline Segformer by 9.1% in terms of IoU.

[CV-64] Remember and Recall: Associative-Memory-based Trajectory Prediction

链接: https://arxiv.org/abs/2410.02201
作者: Hang Guo,Yuzhen Zhang,Tianci Gao,Junning Su,Pei Lv,Mingliang Xu
关键词-EN: autonomous driving systems, accumulated movement experience, Trajectory prediction, driving systems, enabling the application
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Trajectory prediction is a pivotal component of autonomous driving systems, enabling the application of accumulated movement experience to current scenarios. Although most existing methods concentrate on learning continuous representations to gain valuable experience, they often suffer from computational inefficiencies and struggle with unfamiliar situations. To address this issue, we propose the Fragmented-Memory-based Trajectory Prediction (FMTP) model, inspired by the remarkable learning capabilities of humans, particularly their ability to leverage accumulated experience and recall relevant memories in unfamiliar situations. The FMTP model employs discrete representations to enhance computational efficiency by reducing information redundancy while maintaining the flexibility to utilize past experiences. Specifically, we design a learnable memory array by consolidating continuous trajectory representations from the training set using defined quantization operations during the training phase. This approach further eliminates redundant information while preserving essential features in discrete form. Additionally, we develop an advanced reasoning engine based on language models to deeply learn the associative rules among these discrete representations. Our method has been evaluated on various public datasets, including ETH-UCY, inD, SDD, nuScenes, Waymo, and VTL-TP. The extensive experimental results demonstrate that our approach achieves significant performance and extracts more valuable experience from past trajectories to inform the current state.

[CV-65] BadCM: Invisible Backdoor Attack Against Cross-Modal Learning

链接: https://arxiv.org/abs/2410.02182
作者: Zheng Zhang,Xu Yuan,Lei Zhu,Jingkuan Song,Liqiang Nie
关键词-EN: unimodal learning tasks, remarkable successes, underexplored due, cross-modal, learning tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Despite remarkable successes in unimodal learning tasks, backdoor attacks against cross-modal learning are still underexplored due to the limited generalization and inferior stealthiness when involving multiple modalities. Notably, since works in this area mainly inherit ideas from unimodal visual attacks, they struggle with dealing with diverse cross-modal attack circumstances and manipulating imperceptible trigger samples, which hinders their practicability in real-world applications. In this paper, we introduce a novel bilateral backdoor to fill in the missing pieces of the puzzle in the cross-modal backdoor and propose a generalized invisible backdoor framework against cross-modal learning (BadCM). Specifically, a cross-modal mining scheme is developed to capture the modality-invariant components as target poisoning areas, where well-designed trigger patterns injected into these regions can be efficiently recognized by the victim models. This strategy is adapted to different image-text cross-modal models, making our framework available to various attack scenarios. Furthermore, for generating poisoned samples of high stealthiness, we conceive modality-specific generators for visual and linguistic modalities that facilitate hiding explicit trigger patterns in modality-invariant regions. To the best of our knowledge, BadCM is the first invisible backdoor method deliberately designed for diverse cross-modal attacks within one unified framework. Comprehensive experimental evaluations on two typical applications, i.e., cross-modal retrieval and VQA, demonstrate the effectiveness and generalization of our method under multiple kinds of attack scenarios. Moreover, we show that BadCM can robustly evade existing backdoor defenses. Our code is available at this https URL.

[CV-66] HATFormer: Historic Handwritten Arabic Text Recognition with Transformers

链接: https://arxiv.org/abs/2410.02179
作者: Adrian Chan,Anupam Mijar,Mehreen Saeed,Chau-Wai Wong,Akram Khater
关键词-EN: diverse writing styles, English HTR model, Arabic HTR models, English HTR, generalizable Arabic HTR
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer’s attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.

[CV-67] From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

链接: https://arxiv.org/abs/2410.02155
作者: Wanpeng Zhang,Zilong Xie,Yicheng Feng,Yijiang Li,Xingrun Xing,Sipeng Zheng,Zongqing Lu
关键词-EN: Large Language Models, made significant strides, Large Language, Multimodal Large Language, text-only Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs’ multimodal understanding capabilities, even with limited training data. Our method not only improves performance across various benchmarks but also shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models.

[CV-68] An Evaluation of Large Pre-Trained Models for Gesture Recognition using Synthetic Videos

链接: https://arxiv.org/abs/2410.02152
作者: Arun Reddy,Ketul Shah,Corban Rivera,William Paul,Celso M. De Melo,Rama Chellappa
关键词-EN: large pre-trained models, synthetically generated data, explore the possibility, synthetically generated, large pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications II (SPIE Defense + Commercial Sensing, 2024)

点击查看摘要

Abstract:In this work, we explore the possibility of using synthetically generated data for video-based gesture recognition with large pre-trained models. We consider whether these models have sufficiently robust and expressive representation spaces to enable “training-free” classification. Specifically, we utilize various state-of-the-art video encoders to extract features for use in k-nearest neighbors classification, where the training data points are derived from synthetic videos only. We compare these results with another training-free approach – zero-shot classification using text descriptions of each gesture. In our experiments with the RoCoG-v2 dataset, we find that using synthetic training videos yields significantly lower classification accuracy on real test videos compared to using a relatively small number of real training videos. We also observe that video backbones that were fine-tuned on classification tasks serve as superior feature extractors, and that the choice of fine-tuning data has a substantial impact on k-nearest neighbors performance. Lastly, we find that zero-shot text-based classification performs poorly on the gesture recognition task, as gestures are not easily described through natural language.

[CV-69] MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

链接: https://arxiv.org/abs/2410.02130
作者: Trung X. Pham,Tri Ton,Chang D. Yoo
关键词-EN: vision-guided open-domain sound, open-domain sound generation, sound generation optimized, vision-guided open-domain, open-domain sound
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: 21 pages, 16 figures

点击查看摘要

Abstract:We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, MDSGen employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves 97.9% alignment accuracy, using 172x fewer parameters, 371% less memory, and offering 36x faster inference than the current 860M-parameter state-of-the-art model (93.9% accuracy). The larger model (131M parameters) reaches nearly 99% accuracy while requiring 6.5x fewer parameters. These results highlight the scalability and effectiveness of our approach.

[CV-70] MVGS: Multi-view-regulated Gaussian Splatting for Novel View Synthesis

链接: https://arxiv.org/abs/2410.02103
作者: Xiaobiao Du,Yida Wang,Xin Yu
关键词-EN: learned implicit neural, implicit neural radiance, neural radiance field, Recent works, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Recent works in volume rendering, \textite.g. NeRF and 3D Gaussian Splatting (3DGS), significantly advance the rendering quality and efficiency with the help of the learned implicit neural radiance field or 3D Gaussians. Rendering on top of an explicit representation, the vanilla 3DGS and its variants deliver real-time efficiency by optimizing the parametric model with single-view supervision per iteration during training which is adopted from NeRF. Consequently, certain views are overfitted, leading to unsatisfying appearance in novel-view synthesis and imprecise 3D geometries. To solve aforementioned problems, we propose a new 3DGS optimization method embodying four key novel contributions: 1) We transform the conventional single-view training paradigm into a multi-view training strategy. With our proposed multi-view regulation, 3D Gaussian attributes are further optimized without overfitting certain training views. As a general solution, we improve the overall accuracy in a variety of scenarios and different Gaussian variants. 2) Inspired by the benefit introduced by additional views, we further propose a cross-intrinsic guidance scheme, leading to a coarse-to-fine training procedure concerning different resolutions. 3) Built on top of our multi-view regulated training, we further propose a cross-ray densification strategy, densifying more Gaussian kernels in the ray-intersect regions from a selection of views. 4) By further investigating the densification strategy, we found that the effect of densification should be enhanced when certain views are distinct dramatically. As a solution, we propose a novel multi-view augmented densification strategy, where 3D Gaussians are encouraged to get densified to a sufficient number accordingly, resulting in improved reconstruction accuracy.

[CV-71] Orient Anything

链接: https://arxiv.org/abs/2410.02101
作者: Christopher Scarvelis,David Benhaim,Paul Zhang
关键词-EN: analysis which consists, consists of estimating, orientation axes, Orientation, shape orientation axes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Orientation estimation is a fundamental task in 3D shape analysis which consists of estimating a shape’s orientation axes: its side-, up-, and front-axes. Using this data, one can rotate a shape into canonical orientation, where its orientation axes are aligned with the coordinate axes. Developing an orientation algorithm that reliably estimates complete orientations of general shapes remains an open problem. We introduce a two-stage orientation pipeline that achieves state of the art performance on up-axis estimation and further demonstrate its efficacy on full-orientation estimation, where one seeks all three orientation axes. Unlike previous work, we train and evaluate our method on all of Shapenet rather than a subset of classes. We motivate our engineering contributions by theory describing fundamental obstacles to orientation estimation for rotationally-symmetric shapes, and show how our method avoids these obstacles.

[CV-72] EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

链接: https://arxiv.org/abs/2410.02098
作者: Haotian Sun,Bowen Zhang,Yanghao Li,Haoshuo Huang,Tao Lei,Ruoming Pang,Bo Dai,Nan Du
关键词-EN: widely adopted, Diffusion transformers, models, Diffusion, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.

[CV-73] racking objects that change in appearance with phase synchrony

链接: https://arxiv.org/abs/2410.02094
作者: Sabine Muzellec,Drew Linsley,Alekh K. Ashok,Ennio Mingolla,Girik Malik,Rufin VanRullen,Thomas Serre
关键词-EN: Objects, neural, track objects, neural synchrony, change
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Objects we encounter often change appearance as we interact with them. Changes in illumination (shadows), object pose, or movement of nonrigid objects can drastically alter available image features. How do biological visual systems track objects as they change? It may involve specific attentional mechanisms for reasoning about the locations of objects independently of their appearances – a capability that prominent neuroscientific theories have associated with computing through neural synchrony. We computationally test the hypothesis that the implementation of visual attention through neural synchrony underlies the ability of biological visual systems to track objects that change in appearance over time. We first introduce a novel deep learning circuit that can learn to precisely control attention to features separately from their location in the world through neural synchrony: the complex-valued recurrent neural network (CV-RNN). Next, we compare object tracking in humans, the CV-RNN, and other deep neural networks (DNNs), using FeatureTracker: a large-scale challenge that asks observers to track objects as their locations and appearances change in precisely controlled ways. While humans effortlessly solved FeatureTracker, state-of-the-art DNNs did not. In contrast, our CV-RNN behaved similarly to humans on the challenge, providing a computational proof-of-concept for the role of phase synchronization as a neural substrate for tracking appearance-morphing objects as they move about.

[CV-74] Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

链接: https://arxiv.org/abs/2410.02086
作者: Minoh Jeong,Min Namgung,Zae Myung Kim,Dongyeop Kang,Yao-Yi Chiang,Alfred Hero
关键词-EN: diverse data sources, utilize diverse data, enabling machine learning, machine learning models, downstream tasks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multimodal learning plays a crucial role in enabling machine learning models to fuse and utilize diverse data sources, such as text, images, and audio, to support a variety of downstream tasks. A unified representation across various modalities is particularly important for improving efficiency and performance. Recent binding methods, such as ImageBind (Girdhar et al., 2023), typically use a fixed anchor modality to align multimodal data in the anchor modal embedding space. In this paper, we mathematically analyze the fixed anchor binding methods and uncover notable limitations: (1) over-reliance on the choice of the anchor modality, (2) failure to capture intra-modal information, and (3) failure to account for inter-modal correlation among non-anchored modalities. To address these limitations, we propose CentroBind, a simple yet powerful approach that eliminates the need for a fixed anchor; instead, it employs dynamically adjustable centroid-based anchors generated from all available modalities, resulting in a balanced and rich representation space. We theoretically demonstrate that our method captures three crucial properties of multimodal learning: intra-modal learning, inter-modal learning, and multimodal alignment, while also constructing a robust unified representation across all modalities. Our experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed method, showing that dynamic anchor methods outperform all fixed anchor binding methods as the former captures more nuanced multimodal interactions.

[CV-75] EMMA: Efficient Visual Alignment in Multi-Modal LLMs

链接: https://arxiv.org/abs/2410.02080
作者: Sara Ghazanfari,Alexandre Araujo,Prashanth Krishnamurthy,Siddharth Garg,Farshad Khorrami
关键词-EN: Multi-modal Large Language, Large Language Models, recently exhibited impressive, exhibited impressive general-purpose, impressive general-purpose capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-purpose capabilities by leveraging vision foundation models to encode the core concepts of images into representations. These are then combined with instructions and processed by the language model to generate high-quality responses. Despite significant progress in enhancing the language component, challenges persist in optimally fusing visual encodings within the language model for task-specific adaptability. Recent research has focused on improving this fusion through modality adaptation modules but at the cost of significantly increased model complexity and training data needs. In this paper, we propose EMMA (Efficient Multi-Modal Adaptation), a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the language model. Our key contributions include: (1) an efficient early fusion mechanism that integrates vision and language representations with minimal added parameters (less than 0.2% increase in model size), (2) an in-depth interpretability analysis that sheds light on the internal mechanisms of the proposed method; (3) comprehensive experiments that demonstrate notable improvements on both specialized and general benchmarks for MLLMs. Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations. Our code is available at this https URL

[CV-76] Kolmogorov-Arnold Network Autoencoders

链接: https://arxiv.org/abs/2410.02077
作者: Mohammadamin Moradi,Shirin Panahi,Erik Bollt,Ying-Cheng Lai
关键词-EN: Deep learning models, Deep learning, Multi-Layer Perceptrons, revolutionized various domains, image classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures, 1 table

点击查看摘要

Abstract:Deep learning models have revolutionized various domains, with Multi-Layer Perceptrons (MLPs) being a cornerstone for tasks like data regression and image classification. However, a recent study has introduced Kolmogorov-Arnold Networks (KANs) as promising alternatives to MLPs, leveraging activation functions placed on edges rather than nodes. This structural shift aligns KANs closely with the Kolmogorov-Arnold representation theorem, potentially enhancing both model accuracy and interpretability. In this study, we explore the efficacy of KANs in the context of data representation via autoencoders, comparing their performance with traditional Convolutional Neural Networks (CNNs) on the MNIST, SVHN, and CIFAR-10 datasets. Our results demonstrate that KAN-based autoencoders achieve competitive performance in terms of reconstruction accuracy, thereby suggesting their viability as effective tools in data analysis tasks.

[CV-77] Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

链接: https://arxiv.org/abs/2410.02073
作者: Aleksei Bochkovskii,Amaël Delaunoy,Hugo Germain,Marcel Santos,Yichao Zhou,Stephan R. Richter,Vladlen Koltun
关键词-EN: zero-shot metric monocular, present a foundation, monocular depth estimation, metric monocular depth, depth
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code and weights available at this https URL

点击查看摘要

Abstract:We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at this https URL

[CV-78] Learning from the Giants: A Practical Approach to Underwater Depth and Surface Normals Estimation

链接: https://arxiv.org/abs/2410.02072
作者: Alzayat Saleh,Melanie Olsen,Bouchra Senadji,Mostafa Rahimi Azghadi
关键词-EN: Surface Normals Estimation, Monocular Depth, Normals Estimation, Convolutional Neural Networks, Depth Normal Evaluation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 6 figures, 8 tables. Submitted to Elsevier

点击查看摘要

Abstract:Monocular Depth and Surface Normals Estimation (MDSNE) is crucial for tasks such as 3D reconstruction, autonomous navigation, and underwater exploration. Current methods rely either on discriminative models, which struggle with transparent or reflective surfaces, or generative models, which, while accurate, are computationally expensive. This paper presents a novel deep learning model for MDSNE, specifically tailored for underwater environments, using a hybrid architecture that integrates Convolutional Neural Networks (CNNs) with Transformers, leveraging the strengths of both approaches. Training effective MDSNE models is often hampered by noisy real-world datasets and the limited generalization of synthetic datasets. To address this, we generate pseudo-labeled real data using multiple pre-trained MDSNE models. To ensure the quality of this data, we propose the Depth Normal Evaluation and Selection Algorithm (DNESA), which evaluates and selects the most reliable pseudo-labeled samples using domain-specific metrics. A lightweight student model is then trained on this curated dataset. Our model reduces parameters by 90% and training costs by 80%, allowing real-time 3D perception on resource-constrained devices. Key contributions include: a novel and efficient MDSNE model, the DNESA algorithm, a domain-specific data pipeline, and a focus on real-time performance and scalability. Designed for real-world underwater applications, our model facilitates low-cost deployments in underwater robots and autonomous vehicles, bridging the gap between research and practical implementation.

[CV-79] Semi-Supervised Fine-Tuning of Vision Foundation Models with Content-Style Decomposition

链接: https://arxiv.org/abs/2410.02069
作者: Mariia Drozdova,Vitaliy Kinakh,Yury Belousov,Erica Lastufka,Slava Voloshynovskiy
关键词-EN: fine-tuning approach designed, semi-supervised fine-tuning approach, limited labeled data, vision foundation models, present a semi-supervised
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present a semi-supervised fine-tuning approach designed to improve the performance of foundation models on downstream tasks with limited labeled data. By leveraging content-style decomposition within an information-theoretic framework, our method enhances the latent representations of pre-trained vision foundation models, aligning them more effectively with specific task objectives and addressing the problem of distribution shift. We evaluate our approach on multiple datasets, including MNIST, its augmented variations (with yellow and white stripes), CIFAR-10, SVHN, and GalaxyMNIST. The experiments show improvements over purely supervised baselines, particularly in low-labeled data regimes, across both frozen and trainable backbones for the majority of the tested datasets.

[CV-80] DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation

链接: https://arxiv.org/abs/2410.02067
作者: Jing He,Haodong Li,Yongzhe Hu,Guibao Shen,Yingjie Cai,Weichao Qiu,Ying-Cong Chen
关键词-EN: creating customized images, additional textual instruction, textual instruction emerges, creating customized, promising endeavor
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The first two authors contributed equally. Project page: this https URL

点击查看摘要

Abstract:In the realm of image generation, creating customized images from visual prompt with additional textual instruction emerges as a promising endeavor. However, existing methods, both tuning-based and tuning-free, struggle with interpreting the subject-essential attributes from the visual prompt. This leads to subject-irrelevant attributes infiltrating the generation process, ultimately compromising the personalization quality in both editability and ID preservation. In this paper, we present DisEnvisioner, a novel approach for effectively extracting and enriching the subject-essential features while filtering out -irrelevant information, enabling exceptional customization performance, in a tuning-free manner and using only a single image. Specifically, the feature of the subject and other irrelevant components are effectively separated into distinctive visual tokens, enabling a much more accurate customization. Aiming to further improving the ID consistency, we enrich the disentangled features, sculpting them into more granular representations. Experiments demonstrate the superiority of our approach over existing methods in instruction response (editability), ID consistency, inference speed, and the overall image quality, highlighting the effectiveness and efficiency of DisEnvisioner. Project page: this https URL.

[CV-81] Using Style Ambiguity Loss to Improve Aesthetics of Diffusion Models

链接: https://arxiv.org/abs/2410.02055
作者: James Baker
关键词-EN: style ambiguity loss, style ambiguity, ambiguity loss, Teaching, creative involves
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.12009

点击查看摘要

Abstract:Teaching text-to-image models to be creative involves using style ambiguity loss. In this work, we explore using the style ambiguity training objective, used to approximate creativity, on a diffusion model. We then experiment with forms of style ambiguity loss that do not require training a classifier or a labeled dataset, and find that the models trained with style ambiguity loss can generate better images than the baseline diffusion models and GANs. Code is available at this https URL.

[CV-82] Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

链接: https://arxiv.org/abs/2410.02052
作者: Xiao Yu,Baolin Peng,Vineeth Vajipey,Hao Cheng,Michel Galley,Jianfeng Gao,Zhou Yu
关键词-EN: demonstrated significant potential, automating complex multistep, complex multistep decision-making, multistep decision-making tasks, Reflective Monte Carlo
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent’s performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS’s performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs’ reasoning and planning capabilities for agentic applications via test-time search and self-learning.

[CV-83] Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description

链接: https://arxiv.org/abs/2410.02049
作者: Mahshid Dehghani,Amirahmad Shafiee,Ali Shafiei,Neda Fallah,Farahmand Alizadeh,Mohammad Mehdi Gholinejad,Hamid Behroozi,Jafar Habibi,Ehsaneddin Asgari
关键词-EN: limited emotion classes, constrained by limited, classes and insufficient, Existing, Language Image Pretraining
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:Existing 3D facial emotion modeling have been constrained by limited emotion classes and insufficient datasets. This paper introduces “Emo3D”, an extensive “Text-Image-Expression dataset” spanning a wide spectrum of human emotions, each paired with images and 3D blendshapes. Leveraging Large Language Models (LLMs), we generate a diverse array of textual descriptions, facilitating the capture of a broad spectrum of emotional expressions. Using this unique dataset, we conduct a comprehensive evaluation of language-based models’ fine-tuning and vision-language models like Contranstive Language Image Pretraining (CLIP) for 3D facial expression synthesis. We also introduce a new evaluation metric for this task to more directly measure the conveyed emotion. Our new evaluation metric, Emo3D, demonstrates its superiority over Mean Squared Error (MSE) metrics in assessing visual-text alignment and semantic richness in 3D facial expressions associated with human emotions. “Emo3D” has great applications in animation design, virtual reality, and emotional human-computer interaction.

[CV-84] FeelAnyForce: Estimating Contact Force Feedback from Tactile Sensation for Vision-Based Tactile Sensors

链接: https://arxiv.org/abs/2410.02048
作者: Amir-Hossein Shahidzadeh,Gabriele Caddeo,Koushik Alapati,Lorenzo Natale,Cornelia Fermüller,Yiannis Aloimonos
关键词-EN: vision-based tactile sensors, problem of estimating, vision-based tactile, tackle the problem, tactile sensors
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4 figures, 4 tables

点击查看摘要

Abstract:In this paper, we tackle the problem of estimating 3D contact forces using vision-based tactile sensors. In particular, our goal is to estimate contact forces over a large range (up to 15 N) on any objects while generalizing across different vision-based tactile sensors. Thus, we collected a dataset of over 200K indentations using a robotic arm that pressed various indenters onto a GelSight Mini sensor mounted on a force sensor and then used the data to train a multi-head transformer for force regression. Strong generalization is achieved via accurate data collection and multi-objective optimization that leverages depth contact images. Despite being trained only on primitive shapes and textures, the regressor achieves a mean absolute error of 4% on a dataset of unseen real-world objects. We further evaluate our approach’s generalization capability to other GelSight mini and DIGIT sensors, and propose a reproducible calibration procedure for adapting the pre-trained model to other vision-based sensors. Furthermore, the method was evaluated on real-world tasks, including weighing objects and controlling the deformation of delicate objects, which relies on accurate force feedback. Project webpage: this http URL

[CV-85] Scene Flow as a Partial Differential Equation

链接: https://arxiv.org/abs/2410.02031
作者: Kyle Vedder,Neehar Peri,Ishan Khatri,Siyi Li,Eric Eaton,Mehmet Kocamaz,Yue Wang,Zhiding Yu,Deva Ramanan,Joachim Pehserl
关键词-EN: entire observation sequence, reframe scene flow, observation sequence, scene flow, Scene Flow Challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page at this https URL

点击查看摘要

Abstract:We reframe scene flow as the problem of estimating a continuous space and time PDE that describes motion for an entire observation sequence, represented with a neural prior. Our resulting unsupervised method, EulerFlow, produces high quality scene flow on real-world data across multiple domains, including large-scale autonomous driving scenes and dynamic tabletop settings. Notably, EulerFlow produces high quality flow on small, fast moving objects like birds and tennis balls, and exhibits emergent 3D point tracking behavior by solving its estimated PDE over long time horizons. On the Argoverse 2 2024 Scene Flow Challenge, EulerFlow outperforms all prior art, beating the next best unsupervised method by over 2.5x and the next best supervised method by over 10%.

[CV-86] Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal Multilingual Retrieval EMNLP24

链接: https://arxiv.org/abs/2410.02027
作者: Kyle Buettner,Adriana Kovashka
关键词-EN: languages and cultures, multilingual vision-language models, properly account, perceptual differences, reflected in image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Short paper accepted to EMNLP24 (Main)

点击查看摘要

Abstract:There is a scarcity of multilingual vision-language models that properly account for the perceptual differences that are reflected in image captions across languages and cultures. In this work, through a multimodal, multilingual retrieval case study, we quantify the existing lack of model flexibility. We empirically show performance gaps between training on captions that come from native German perception and captions that have been either machine-translated or human-translated from English into German. To address these gaps, we further propose and evaluate caption augmentation strategies. While we achieve mean recall improvements (+1.3), gaps still remain, indicating an open area of future work for the community.

[CV-87] Addressing Data Heterogeneity in Federated Learning with Adaptive Normalization-Free Feature Recalibration

链接: https://arxiv.org/abs/2410.02006
作者: Vasilis Siomos,Sergio Naval-Marimont,Jonathan Passerat-Palmbach,Giacomo Tarroni
关键词-EN: preserves stakeholders’ data, stakeholders’ data ownership, collaborative training paradigm, decentralized collaborative training, Normalization-free Feature Recalibration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Federated learning is a decentralized collaborative training paradigm that preserves stakeholders’ data ownership while improving performance and generalization. However, statistical heterogeneity among client datasets poses a fundamental challenge by degrading system performance. To address this issue, we propose Adaptive Normalization-free Feature Recalibration (ANFR), an architecture-level approach that combines weight standardization and channel attention. Weight standardization normalizes the weights of layers instead of activations. This is less susceptible to mismatched client statistics and inconsistent averaging, thereby more robust under heterogeneity. Channel attention produces learnable scaling factors for feature maps, suppressing those that are inconsistent between clients due to heterogeneity. We demonstrate that combining these techniques boosts model performance beyond their individual contributions, by enhancing class selectivity and optimizing channel attention weight distribution. ANFR operates independently of the aggregation method and is effective in both global and personalized federated learning settings, with minimal computational overhead. Furthermore, when training with differential privacy, ANFR achieves an appealing balance between privacy and utility, enabling strong privacy guarantees without sacrificing performance. By integrating weight standardization and channel attention in the backbone model, ANFR offers a novel and versatile approach to the challenge of statistical heterogeneity. We demonstrate through extensive experiments that ANFR consistently outperforms established baselines across various aggregation methods, datasets, and heterogeneity conditions.

[CV-88] Normalizing Flow Based Metric for Image Generation

链接: https://arxiv.org/abs/2410.02004
作者: Pranav Jeevan,Neeraj Nixon,Amit Sethi
关键词-EN: dual-flow based likelihood, based likelihood distance, exact dual-flow based, flow-based likelihood distance, proposed metrics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 16 figures

点击查看摘要

Abstract:We propose two new evaluation metrics to assess realness of generated images based on normalizing flows: a simpler and efficient flow-based likelihood distance (FLD) and a more exact dual-flow based likelihood distance (D-FLD). Because normalizing flows can be used to compute the exact likelihood, the proposed metrics assess how closely generated images align with the distribution of real images from a given domain. This property gives the proposed metrics a few advantages over the widely used Fréchet inception distance (FID) and other recent metrics. Firstly, the proposed metrics need only a few hundred images to stabilize (converge in mean), as opposed to tens of thousands needed for FID, and at least a few thousand for the other metrics. This allows confident evaluation of even small sets of generated images, such as validation batches inside training loops. Secondly, the network used to compute the proposed metric has over an order of magnitude fewer parameters compared to Inception-V3 used to compute FID, making it computationally more efficient. For assessing the realness of generated images in new domains (e.g., x-ray images), ideally these networks should be retrained on real images to model their distinct distributions. Thus, our smaller network will be even more advantageous for new domains. Extensive experiments show that the proposed metrics have the desired monotonic relationships with the extent of image degradation of various kinds.

[CV-89] SkyAI Sim: An Open-Source Simulation of UAV Aerial Imaging from Satellite Data

链接: https://arxiv.org/abs/2410.02003
作者: S. Parisa Dajkhosh,Peter M. Le,Orges Furxhi,Eddie L. Jacobs
关键词-EN: Unmanned Aerial Vehicle, vision-based navigation, challenging due, due to limited, limited availability
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Capturing real-world aerial images for vision-based navigation (VBN) is challenging due to limited availability and conditions that make it nearly impossible to access all desired images from any location. The complexity increases when multiple locations are involved. The state of the art solutions, such as flying a UAV (Unmanned Aerial Vehicle) to take pictures or using existing research databases, have significant limitations. SkyAI Sim offers a compelling alternative by simulating a UAV to capture bird’s-eye view satellite images at zero-yaw with real-world visible-band specifications. This open-source tool allows users to specify the bounding box (top-left and bottom-right) coordinates of any region on a map. Without the need to physically fly a drone, the virtual Python UAV performs a raster search to capture satellite images using the Google Maps Static API. Users can define parameters such as flight altitude, aspect ratio and diagonal field of view of the camera, and the overlap between consecutive images. SkyAI Sim’s capabilities range from capturing a few low-altitude images for basic applications to generating extensive datasets of entire cities for complex tasks like deep learning. This versatility makes SkyAI a valuable tool for not only VBN, but also other applications including environmental monitoring, construction, and city management. The open-source nature of the tool also allows for extending the raster search to other missions. A dataset of Memphis, TN has been provided along with this simulator, partially generated using SkyAI and, also includes data from a 3D world generation package for comparison.

[CV-90] UlcerGPT: A Multimodal Approach Leveraging Large Language and Vision Models for Diabetic Foot Ulcer Image Transcription ICPR2024

链接: https://arxiv.org/abs/2410.01989
作者: Reza Basiri,Ali Abedi,Chau Nguyen,Milos R. Popovic,Shehroz S. Khan
关键词-EN: Diabetic foot ulcers, lower limb amputations, Diabetic foot, DFU image transcription, DFU image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 3 figures, ICPR 2024 Conference (PRHA workshop)

点击查看摘要

Abstract:Diabetic foot ulcers (DFUs) are a leading cause of hospitalizations and lower limb amputations, placing a substantial burden on patients and healthcare systems. Early detection and accurate classification of DFUs are critical for preventing serious complications, yet many patients experience delays in receiving care due to limited access to specialized services. Telehealth has emerged as a promising solution, improving access to care and reducing the need for in-person visits. The integration of artificial intelligence and pattern recognition into telemedicine has further enhanced DFU management by enabling automatic detection, classification, and monitoring from images. Despite advancements in artificial intelligence-driven approaches for DFU image analysis, the application of large language models for DFU image transcription has not yet been explored. To address this gap, we introduce UlcerGPT, a novel multimodal approach leveraging large language and vision models for DFU image transcription. This framework combines advanced vision and language models, such as Large Language and Vision Assistant and Chat Generative Pre-trained Transformer, to transcribe DFU images by jointly detecting, classifying, and localizing regions of interest. Through detailed experiments on a public dataset, evaluated by expert clinicians, UlcerGPT demonstrates promising results in the accuracy and efficiency of DFU transcription, offering potential support for clinicians in delivering timely care via telemedicine.

[CV-91] Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

链接: https://arxiv.org/abs/2410.01966
作者: Xinlong Hou,Sen Shen,Xueshen Li,Xinran Gao,Ziyi Huang,Steven J. Holiday,Matthew R. Cribbet,Susan W. White,Edward Sazonov,Yu Gan
关键词-EN: physical activity, childhood obesity, social interaction, accurately monitor, phenomena linked
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Prepare for submission

点击查看摘要

Abstract:Being able to accurately monitor the screen exposure of young children is important for research on phenomena linked to screen use such as childhood obesity, physical activity, and social interaction. Most existing studies rely upon self-report or manual measures from bulky wearable sensors, thus lacking efficiency and accuracy in capturing quantitative screen exposure data. In this work, we developed a novel sensor informatics framework that utilizes egocentric images from a wearable sensor, termed the screen time tracker (STT), and a vision language model (VLM). In particular, we devised a multi-view VLM that takes multiple views from egocentric image sequences and interprets screen exposure dynamically. We validated our approach by using a dataset of children’s free-living activities, demonstrating significant improvement over existing methods in plain vision language models and object detection models. Results supported the promise of this monitoring approach, which could optimize behavioral research on screen exposure in children’s naturalistic settings.

[CV-92] Language Supervised Human Action Recognition with Salient Fusion: Construction Worker Action Recognition as a Use Case

链接: https://arxiv.org/abs/2410.01962
作者: Mohammad Mahdavian,Mohammad Loni,Mo Chen
关键词-EN: Detecting human actions, Detecting human, Human Action Recognition, human actions, robots and vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities’ high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in the real world construction zones. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets, NTU-RGB+D, NTU-RGB+D120 and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The codes and dataset are available at: this https URL

[CV-93] One-step Noisy Label Mitigation

链接: https://arxiv.org/abs/2410.01944
作者: Hao Li,Jiayang Gu,Jingkuan Song,An Zhang,Lianli Gao
关键词-EN: Mitigating the detrimental, large-scale pre-training tasks, increasingly critical, detrimental effects, large-scale pre-training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 4 figures, 11 Tables

点击查看摘要

Abstract:Mitigating the detrimental effects of noisy labels on the training process has become increasingly critical, as obtaining entirely clean or human-annotated samples for large-scale pre-training tasks is often impractical. Nonetheless, existing noise mitigation methods often encounter limitations in practical applications due to their task-specific design, model dependency, and significant computational overhead. In this work, we exploit the properties of high-dimensional orthogonality to identify a robust and effective boundary in cone space for separating clean and noisy samples. Building on this, we propose One-step Anti-Noise (OSA), a model-agnostic noisy label mitigation paradigm that employs an estimator model and a scoring function to assess the noise level of input pairs through just one-step inference, a cost-efficient process. We empirically demonstrate the superiority of OSA, highlighting its enhanced training robustness, improved task transferability, ease of deployment, and reduced computational costs across various benchmarks, models, and tasks. Our code is released at this https URL.

[CV-94] Deep learning assisted high resolution microscopy image processing for phase segmentation in functional composite materials

链接: https://arxiv.org/abs/2410.01928
作者: Ganesh Raghavendran(1),Bing Han(1),Fortune Adekogbe(4),Shuang Bai(2),Bingyu Lu(1),William Wu(5),Minghao Zhang(1),Ying Shirley Meng(1 and 3) ((1) Department of NanoEngineering-University of California San Diego, (2) Department of NanoEngineering-University of California San Diego (3) Pritzker School of Molecular Engineering-University of Chicago, (4) Department of Chemical and Petroleum Engineering-University of Lagos, (5) Del Norte High School)
关键词-EN: high-resolution microscopy images, battery research, challenging task, involves dealing, dealing with complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the domain of battery research, the processing of high-resolution microscopy images is a challenging task, as it involves dealing with complex images and requires a prior understanding of the components involved. The utilization of deep learning methodologies for image analysis has attracted considerable interest in recent years, with multiple investigations employing such techniques for image segmentation and analysis within the realm of battery research. However, the automated analysis of high-resolution microscopy images for detecting phases and components in composite materials is still an underexplored area. This work proposes a novel workflow for detecting components and phase segmentation from raw high resolution transmission electron microscopy (TEM) images using a trained U-Net segmentation model. The developed model can expedite the detection of components and phase segmentation, diminishing the temporal and cognitive demands associated with scrutinizing an extensive array of TEM images, thereby mitigating the potential for human errors. This approach presents a novel and efficient image analysis approach with broad applicability beyond the battery field and holds potential for application in other related domains characterized by phase and composition distribution, such as alloy production.

[CV-95] A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

链接: https://arxiv.org/abs/2410.01912
作者: Liang Chen,Sinan Tan,Zefan Cai,Weichu Xie,Haozhe Zhao,Yichi Zhang,Junyang Lin,Jinze Bai,Tianyu Liu,Baobao Chang
关键词-EN: information loss bottleneck, model architecture called, bottleneck of vector-quantization, autoregressive image generation, tackles the information
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 25 pages, 20 figures, code is open at this https URL

点击查看摘要

Abstract:This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, \textitmodel depth, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer’s potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at this https URL.

[CV-96] Social Media Authentication and Combating Deepfakes using Semi-fragile Invisible Image Watermarking

链接: https://arxiv.org/abs/2410.01906
作者: Aakash Varma Nadimpalli,Ajita Rattani
关键词-EN: severe societal concerns, raised severe societal, watermark removal attacks, deep generative models, video synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: ACM Transactions (Digital Threats: Research and Practice)

点击查看摘要

Abstract:With the significant advances in deep generative models for image and video synthesis, Deepfakes and manipulated media have raised severe societal concerns. Conventional machine learning classifiers for deepfake detection often fail to cope with evolving deepfake generation technology and are susceptible to adversarial attacks. Alternatively, invisible image watermarking is being researched as a proactive defense technique that allows media authentication by verifying an invisible secret message embedded in the image pixels. A handful of invisible image watermarking techniques introduced for media authentication have proven vulnerable to basic image processing operations and watermark removal attacks. In response, we have proposed a semi-fragile image watermarking technique that embeds an invisible secret message into real images for media authentication. Our proposed watermarking framework is designed to be fragile to facial manipulations or tampering while being robust to benign image-processing operations and watermark removal attacks. This is facilitated through a unique architecture of our proposed technique consisting of critic and adversarial networks that enforce high image quality and resiliency to watermark removal efforts, respectively, along with the backbone encoder-decoder and the discriminator networks. Thorough experimental investigations on SOTA facial Deepfake datasets demonstrate that our proposed model can embed a 64 -bit secret as an imperceptible image watermark that can be recovered with a high-bit recovery accuracy when benign image processing operations are applied while being non-recoverable when unseen Deepfake manipulations are applied. In addition, our proposed watermarking technique demonstrates high resilience to several white-box and black-box watermark removal attacks. Thus, obtaining state-of-the-art performance.

[CV-97] OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning ECCV2024

链接: https://arxiv.org/abs/2410.01861
作者: Shuxin Yang,Xinhan Di
关键词-EN: existing large-scale visual, occluded objects, describing occluded objects, visual language multi-modal, large-scale visual language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024 Observing and Understanding Hands in Action Workshop (5 pages, 3 figures, 2 tables). arXiv admin note: substantial text overlap with arXiv:2410.01261

点击查看摘要

Abstract:There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multi-modal models fail to provide satisfactory results in describing occluded objects through universal visual encoders and supervised learning strategies. Therefore, we introduce a multi-modal large language framework and corresponding self-supervised learning strategy with support of 3D generation. We start our experiments comparing with the state-of-the-art models in the evaluation of a large-scale dataset SOMVideo [18]. The initial results demonstrate the improvement of 16.92% in comparison with the state-of-the-art VLM models.

[CV-98] Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition ALT

链接: https://arxiv.org/abs/2410.01848
作者: Soufiane Belharbi,Marco Pedersoli,Alessandro Lameiras Koerich,Simon Bacon,Eric Granger
关键词-EN: level of accuracy, facial expression recognition, achieve a high, high level, facial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, AI and Digital Health Symposium 2024, October 18th 2024, Montréal

点击查看摘要

Abstract:Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (AUs) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models. During training, this AU codebook is used, along with the input image expression label, and facial landmarks, to construct a AU heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with AU heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with AU maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks RAF-DB, and AffectNet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.

[CV-99] EgoAvatar: Egocentric View-Driven and Photorealistic Full-body Avatars

链接: https://arxiv.org/abs/2410.01835
作者: Jianchun Chen,Jian Wang,Yinda Zhang,Rohit Pandey,Thabo Beeler,Marc Habermann,Christian Theobalt
关键词-EN: interact and communicate, real counterparts, egocentric, single RGB camera, precisely reflect
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Immersive VR telepresence ideally means being able to interact and communicate with digital avatars that are indistinguishable from and precisely reflect the behaviour of their real counterparts. The core technical challenge is two fold: Creating a digital double that faithfully reflects the real human and tracking the real human solely from egocentric sensing devices that are lightweight and have a low energy consumption, e.g. a single RGB camera. Up to date, no unified solution to this problem exists as recent works solely focus on egocentric motion capture, only model the head, or build avatars from multi-view captures. In this work, we, for the first time in literature, propose a person-specific egocentric telepresence approach, which jointly models the photoreal digital avatar while also driving it from a single egocentric video. We first present a character model that is animatible, i.e. can be solely driven by skeletal motion, while being capable of modeling geometry and appearance. Then, we introduce a personalized egocentric motion capture component, which recovers full-body motion from an egocentric video. Finally, we apply the recovered pose to our character model and perform a test-time mesh refinement such that the geometry faithfully projects onto the egocentric view. To validate our design choices, we propose a new and challenging benchmark, which provides paired egocentric and dense multi-view videos of real humans performing various motions. Our experiments demonstrate a clear step towards egocentric and photoreal telepresence as our method outperforms baselines as well as competing methods. For more details, code, and data, we refer to our project page.

[CV-100] Analysis of Convolutional Neural Network-based Image Classifications: A Multi-Featured Application for Rice Leaf Disease Prediction and Recommendations for Farmers

链接: https://arxiv.org/abs/2410.01827
作者: Biplov Paneru,Bishwash Paneru,Krishna Bikram Shah
关键词-EN: convolutional neural network, neural network, precision agriculture, study presents, method for improving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This study presents a novel method for improving rice disease classification using 8 different convolutional neural network (CNN) algorithms, which will further the field of precision agriculture. Tkinter-based application that offers farmers a feature-rich interface. With the help of this cutting-edge application, farmers will be able to make timely and well-informed decisions by enabling real-time disease prediction and providing personalized recommendations. Together with the user-friendly Tkinter interface, the smooth integration of cutting-edge CNN transfer learning algorithms-based technology that include ResNet-50, InceptionV3, VGG16, and MobileNetv2 with the UCI dataset represents a major advancement toward modernizing agricultural practices and guaranteeing sustainable crop management. Remarkable outcomes include 75% accuracy for ResNet-50, 90% accuracy for DenseNet121, 84% accuracy for VGG16, 95.83% accuracy for MobileNetV2, 91.61% accuracy for DenseNet169, and 86% accuracy for InceptionV3. These results give a concise summary of the models’ capabilities, assisting researchers in choosing appropriate strategies for precise and successful rice crop disease identification. A severe overfitting has been seen on VGG19 with 70% accuracy and Nasnet with 80.02% accuracy. On Renset101, only an accuracy of 54% could be achieved, along with only 33% on efficientNetB0. A MobileNetV2-trained model was successfully deployed on a TKinter GUI application to make predictions using image or real-time video capture.

[CV-101] PixelBytes: Catching Unified Representation for Multimodal Generation

链接: https://arxiv.org/abs/2410.01820
作者: Fabien Furfaro
关键词-EN: report introduces PixelBytes, report introduces, unified multimodal representation, Image Transformers, Recurrent Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report introduces PixelBytes, a novel approach for unified multimodal representation learning. Inspired by existing sequence models such as Image Transformers, PixelCNN, and Mamba-Bytes, our method aims to capture diverse inputs in a cohesive representation, exploring the integration of different data types, particularly text, audio, and pixelated images (sprites). We conducted experiments on a specialized PixelBytes Pokémon dataset. Initially, we investigated various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, focusing on bidirectional processing and our convolutional PxBy embedding technique. Subsequently, we evaluated models based on data reduction strategies and the effectiveness of autoregressive learning. We specifically examined Long Short-Term Memory (LSTM) networks in both predictive and autoregressive modes for our main experiments. Our findings suggest that autoregressive models outperform predictive models in this context. By adopting a flexible approach to multimodal modeling, PixelBytes contributes to the ongoing development of foundation models capable of understanding and generating multimodal data. The complete PixelBytes project, including code, models, and datasets, is available online.

[CV-102] From Experts to the Public: Governing Multimodal Language Models in Politically Sensitive Video Analysis

链接: https://arxiv.org/abs/2410.01817
作者: Tanusree Sharma,Yujin Potter,Zachary Kilhoffer,Yun Huang,Dawn Song,Yang Wang
关键词-EN: large language models, multimodal large language, politically sensitive videos, language models, focusing on analyses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This paper examines the governance of multimodal large language models (MM-LLMs) through individual and collective deliberation, focusing on analyses of politically sensitive videos. We conducted a two-step study: first, interviews with 10 journalists established a baseline understanding of expert video interpretation; second, 114 individuals from the general public engaged in deliberation using this http URL, a platform that facilitates democratic decision-making through decentralized autonomous organization (DAO) mechanisms. Our findings show that while experts emphasized emotion and narrative, the general public prioritized factual clarity, objectivity of the situation, and emotional neutrality. Additionally, we explored the impact of different governance mechanisms: quadratic vs. weighted ranking voting and equal vs. 20-80 power distributions on users decision-making on how AI should behave. Specifically, quadratic voting enhanced perceptions of liberal democracy and political equality, and participants who were more optimistic about AI perceived the voting process to have a higher level of participatory democracy. Our results suggest the potential of applying DAO mechanisms to help democratize AI governance.

[CV-103] Automatic Scene Generation: State-of-the-Art Techniques Models Datasets Challenges and Future Prospects

链接: https://arxiv.org/abs/2410.01816
作者: Awal Ahmed Fime,Saifuddin Mahmud,Arpita Das,Md. Sunzidul Islam,Hong-Hoon Kim
关键词-EN: Automatic scene generation, scene generation, Automatic scene, Generative Adversarial Networks, applications in robotics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 59 pages, 16 figures, 3 tables, 36 equations, 348 references

点击查看摘要

Abstract:Automatic scene generation is an essential area of research with applications in robotics, recreation, visual representation, training and simulation, education, and more. This survey provides a comprehensive review of the current state-of-the-arts in automatic scene generation, focusing on techniques that leverage machine learning, deep learning, embedded systems, and natural language processing (NLP). We categorize the models into four main types: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Transformers, and Diffusion Models. Each category is explored in detail, discussing various sub-models and their contributions to the field. We also review the most commonly used datasets, such as COCO-Stuff, Visual Genome, and MS-COCO, which are critical for training and evaluating these models. Methodologies for scene generation are examined, including image-to-3D conversion, text-to-3D generation, UI/layout design, graph-based methods, and interactive scene generation. Evaluation metrics such as Frechet Inception Distance (FID), Kullback-Leibler (KL) Divergence, Inception Score (IS), Intersection over Union (IoU), and Mean Average Precision (mAP) are discussed in the context of their use in assessing model performance. The survey identifies key challenges and limitations in the field, such as maintaining realism, handling complex scenes with multiple objects, and ensuring consistency in object relationships and spatial arrangements. By summarizing recent advances and pinpointing areas for improvement, this survey aims to provide a valuable resource for researchers and practitioners working on automatic scene generation. Comments: 59 pages, 16 figures, 3 tables, 36 equations, 348 references Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.01816 [cs.CV] (or arXiv:2410.01816v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.01816 Focus to learn more arXiv-issued DOI via DataCite

[CV-104] Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in Healthcare

链接: https://arxiv.org/abs/2410.01813
作者: Zhikai Li,Jing Zhang,Qingyi Gu
关键词-EN: pressing social issue, healthcare personnel expertise, personnel expertise, pressing social, social issue
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The disparity in healthcare personnel expertise and medical resources across different regions of the world is a pressing social issue. Artificial intelligence technology offers new opportunities to alleviate this issue. Segment Anything Model (SAM), which excels in intelligent image segmentation, has demonstrated exceptional performance in medical monitoring and assisted diagnosis. Unfortunately, the huge computational and storage overhead of SAM poses significant challenges for deployment on resource-limited edge devices. Quantization is an effective solution for model compression; however, traditional methods rely heavily on original data for calibration, which raises widespread concerns about medical data privacy and security. In this paper, we propose a data-free quantization framework for SAM, called DFQ-SAM, which learns and calibrates quantization parameters without any original data, thus effectively preserving data privacy during model compression. Specifically, we propose pseudo-positive label evolution for segmentation, combined with patch similarity, to fully leverage the semantic and distribution priors in pre-trained models, which facilitates high-quality data synthesis as a substitute for real data. Furthermore, we introduce scale reparameterization to ensure the accuracy of low-bit quantization. We perform extensive segmentation experiments on various datasets, and DFQ-SAM consistently provides significant performance on low-bit quantization. DFQ-SAM eliminates the need for data transfer in cloud-edge collaboration, thereby protecting sensitive data from potential attacks. It enables secure, fast, and personalized healthcare services at the edge, which enhances system efficiency and optimizes resource allocation, and thus facilitating the pervasive application of artificial intelligence in worldwide healthcare.

[CV-105] AlzhiNet: Traversing from 2DCNN to 3DCNN Towards Early Detection and Diagnosis of Alzheimers Disease

链接: https://arxiv.org/abs/2410.02714
作者: Romoke Grace Akindele,Samuel Adebayo,Paul Shekonya Kanda,Ming Yu
关键词-EN: Convolutional Neural Networks, progressive neurodegenerative disorder, Convolutional Neural, Neural Networks, effective disease management
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a progressive neurodegenerative disorder with increasing prevalence among the aging population, necessitating early and accurate diagnosis for effective disease management. In this study, we present a novel hybrid deep learning framework that integrates both 2D Convolutional Neural Networks (2D-CNN) and 3D Convolutional Neural Networks (3D-CNN), along with a custom loss function and volumetric data augmentation, to enhance feature extraction and improve classification performance in AD diagnosis. According to extensive experiments, AlzhiNet outperforms standalone 2D and 3D models, highlighting the importance of combining these complementary representations of data. The depth and quality of 3D volumes derived from the augmented 2D slices also significantly influence the model’s performance. The results indicate that carefully selecting weighting factors in hybrid predictions is imperative for achieving optimal results. Our framework has been validated on the Magnetic Resonance Imaging (MRI) from Kaggle and MIRIAD datasets, obtaining accuracies of 98.9% and 99.99%, respectively, with an AUC of 100%. Furthermore, AlzhiNet was studied under a variety of perturbation scenarios on the Alzheimer’s Kaggle dataset, including Gaussian noise, brightness, contrast, salt and pepper noise, color jitter, and occlusion. The results obtained show that AlzhiNet is more robust to perturbations than ResNet-18, making it an excellent choice for real-world applications. This approach represents a promising advancement in the early diagnosis and treatment planning for Alzheimer’s disease.

[CV-106] Diffusion-based Extreme Image Compression with Compressed Feature Initialization

链接: https://arxiv.org/abs/2410.02640
作者: Zhiyuan Li,Yanhui Zhou,Hao Wei,Chenyang Ge,Ajmal Mian
关键词-EN: extreme image compression, extremely low bitrates, achieved impressive performance, image compression methods, Relay Residual Diffusion
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion-based extreme image compression methods have achieved impressive performance at extremely low bitrates. However, constrained by the iterative denoising process that starts from pure noise, these methods are limited in both fidelity and efficiency. To address these two issues, we present Relay Residual Diffusion Extreme Image Compression (RDEIC), which leverages compressed feature initialization and residual diffusion. Specifically, we first use the compressed latent features of the image with added noise, instead of pure noise, as the starting point to eliminate the unnecessary initial stages of the denoising process. Second, we design a novel relay residual diffusion that reconstructs the raw image by iteratively removing the added noise and the residual between the compressed and target latent features. Notably, our relay residual diffusion network seamlessly integrates pre-trained stable diffusion to leverage its robust generative capability for high-quality reconstruction. Third, we propose a fixed-step fine-tuning strategy to eliminate the discrepancy between the training and inference phases, further improving the reconstruction quality. Extensive experiments demonstrate that the proposed RDEIC achieves state-of-the-art visual quality and outperforms existing diffusion-based extreme image compression methods in both fidelity and efficiency. The source code will be provided in this https URL.

[CV-107] High-Efficiency Neural Video Compression via Hierarchical Predictive Learning

链接: https://arxiv.org/abs/2410.02598
作者: Ming Lu,Zhihao Duan,Wuyang Cong,Dandan Ding,Fengqing Zhu,Zhan Ma
关键词-EN: enhanced Deep Hierarchical, Deep Hierarchical Video, enhanced Deep, Hierarchical Video Compression-DHVC, Deep Hierarchical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The enhanced Deep Hierarchical Video Compression-DHVC 2.0-has been introduced. This single-model neural video codec operates across a broad range of bitrates, delivering not only superior compression performance to representative methods but also impressive complexity efficiency, enabling real-time processing with a significantly smaller memory footprint on standard GPUs. These remarkable advancements stem from the use of hierarchical predictive coding. Each video frame is uniformly transformed into multiscale representations through hierarchical variational autoencoders. For a specific scale’s feature representation of a frame, its corresponding latent residual variables are generated by referencing lower-scale spatial features from the same frame and then conditionally entropy-encoded using a probabilistic model whose parameters are predicted using same-scale temporal reference from previous frames and lower-scale spatial reference of the current frame. This feature-space processing operates from the lowest to the highest scale of each frame, completely eliminating the need for the complexity-intensive motion estimation and compensation techniques that have been standard in video codecs for decades. The hierarchical approach facilitates parallel processing, accelerating both encoding and decoding, and supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss. Source codes will be made available.

[CV-108] Combining Pre- and Post-Demosaicking Noise Removal for RAW Video

链接: https://arxiv.org/abs/2410.02572
作者: Marco Sánchez-Beeckman(1),Antoni Buades(1),Nicola Brandonisio(2),Bilel Kanoun(2) ((1) IAC3 amp; Departament de Matemàtiques i Informàtica, Universitat de les Illes Balears, (2) Huawei Technologies France)
关键词-EN: converts data captured, processing pipeline, Bayer-patterned CFA video, converts data, data captured
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:Denoising is one of the fundamental steps of the processing pipeline that converts data captured by a camera sensor into a display-ready image or video. It is generally performed early in the pipeline, usually before demosaicking, although studies swapping their order or even conducting them jointly have been proposed. With the advent of deep learning, the quality of denoising algorithms has steadily increased. Even so, modern neural networks still have a hard time adapting to new noise levels and scenes, which is indispensable for real-world applications. With those in mind, we propose a self-similarity-based denoising scheme that weights both a pre- and a post-demosaicking denoiser for Bayer-patterned CFA video data. We show that a balance between the two leads to better image quality, and we empirically find that higher noise levels benefit from a higher influence pre-demosaicking. We also integrate temporal trajectory prefiltering steps before each denoiser, which further improve texture reconstruction. The proposed method only requires an estimation of the noise model at the sensor, accurately adapts to any noise level, and is competitive with the state of the art, making it suitable for real-world videography.

[CV-109] NestedMorph: Enhancing Deformable Medical Image Registration with Nested Attention Mechanisms WACV

链接: https://arxiv.org/abs/2410.02550
作者: Gurucharan Marthi Krishna Kumar,Janine Mendola,Amir Shmuel
关键词-EN: varying anatomical structures, Nested Attention Fusion, precise spatial correspondence, Attention Fusion approach, allowing for precise
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Deformable image registration is crucial for aligning medical images in a non-linear fashion across different modalities, allowing for precise spatial correspondence between varying anatomical structures. This paper presents NestedMorph, a novel network utilizing a Nested Attention Fusion approach to improve intra-subject deformable registration between T1-weighted (T1w) MRI and diffusion MRI (dMRI) data. NestedMorph integrates high-resolution spatial details from an encoder with semantic information from a decoder using a multi-scale framework, enhancing both local and global feature extraction. Our model notably outperforms existing methods, including CNN-based approaches like VoxelMorph, MIDIR, and CycleMorph, as well as Transformer-based models such as TransMorph and ViT-V-Net, and traditional techniques like NiftyReg and SyN. Evaluations on the HCP dataset demonstrate that NestedMorph achieves superior performance across key metrics, including SSIM, HD95, and SDlogJ, with the highest SSIM of 0.89, and the lowest HD95 of 2.5 and SDlogJ of 0.22. These results highlight NestedMorph’s ability to capture both local and global image features effectively, leading to superior registration performance. The promising outcomes of this study underscore NestedMorph’s potential to significantly advance deformable medical image registration, providing a robust framework for future research and clinical applications. The source code and our implementation are available at: this https URL

[CV-110] A Foundation Model for the Solar Dynamics Observatory

链接: https://arxiv.org/abs/2410.02530
作者: James Walsh,Daniel G. Gass,Raul Ramos Pollan,Paul J. Wright,Richard Galvez,Noah Kasmanoff,Jason Naradowsky,Anne Spalding,James Parr,Atılım Güneş Baydin
关键词-EN: Solar Dynamics Observatory, NASA Solar Dynamics, Sun complex physical, Dynamics Observatory, NASA Solar
类目: olar and Stellar Astrophysics (astro-ph.SR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:SDO-FM is a foundation model using data from NASA’s Solar Dynamics Observatory (SDO) spacecraft; integrating three separate instruments to encapsulate the Sun’s complex physical interactions into a multi-modal embedding space. This model can be used to streamline scientific investigations involving SDO by making the enormous datasets more computationally accessible for heliophysics research and enable investigations that require instrument fusion. We discuss four key components: an ingestion pipeline to create machine learning ready datasets, the model architecture and training approach, resultant embeddings and fine-tunable models, and finally downstream fine-tuned applications. A key component of this effort has been to include subject matter specialists at each stage of development; reviewing the scientific value and providing guidance for model architecture, dataset, and training paradigm decisions. This paper marks release of our pretrained models and embedding datasets, available to the community on Hugging Face and this http URL.

[CV-111] Med-TTT: Vision Test-Time Training model for Medical Image Segmentation

链接: https://arxiv.org/abs/2410.02523
作者: Jiashu Xu
关键词-EN: treatment planning, plays a crucial, crucial role, role in clinical, clinical diagnosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning. Although models based on convolutional neural networks (CNNs) and Transformers have achieved remarkable success in medical image segmentation tasks, they still face challenges such as high computational complexity and the loss of local features when capturing long-range dependencies. To address these limitations, we propose Med-TTT, a visual backbone network integrated with Test-Time Training (TTT) layers, which incorporates dynamic adjustment capabilities. Med-TTT introduces the Vision-TTT layer, which enables effective modeling of long-range dependencies with linear computational complexity and adaptive parameter adjustment during inference. Furthermore, we designed a multi-resolution fusion mechanism to combine image features at different scales, facilitating the identification of subtle lesion characteristics in complex backgrounds. At the same time, we adopt a frequency domain feature enhancement strategy based on high pass filtering, which can better capture texture and fine-grained details in images. Experimental results demonstrate that Med-TTT significantly outperforms existing methods on multiple medical image datasets, exhibiting strong segmentation capabilities, particularly in complex image backgrounds. The model achieves leading performance in terms of accuracy, sensitivity, and Dice coefficient, providing an efficient and robust solution for the field of medical image this http URL code is available at this https URL .

[CV-112] MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation WACV

链接: https://arxiv.org/abs/2410.02458
作者: Gurucharan Marthi Krishna Kumar,Aman Chadha,Janine Mendola,Amir Shmuel
关键词-EN: Large Language Models, Large Language, medical image segmentation, accurate diagnostic imaging, enhance medical image
类目: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Large Language Models (LLMs), known for their versatility in textual data, are increasingly being explored for their potential to enhance medical image segmentation, a crucial task for accurate diagnostic imaging. This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks. Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance across various medical imaging modalities. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales. The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index. These results demonstrate the effectiveness of LLM-based transformers in refining medical image segmentation, highlighting their potential to significantly boost model accuracy and robustness. The source code and our implementation are available at: this https URL

[CV-113] DMC-Net: Lightweight Dynamic Multi-Scale and Multi-Resolution Convolution Network for Pancreas Segmentation in CT Images

链接: https://arxiv.org/abs/2410.02129
作者: Jin Yang,Daniel S. Marcus,Aristeidis Sotiras
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 4 figures

点击查看摘要

[CV-114] Posterior sampling via Langevin dynamics based on generative priors

链接: https://arxiv.org/abs/2410.02078
作者: Vishal Purohit,Matthew Repasky,Jianfeng Lu,Qiang Qiu,Yao Xie,Xiuyuan Cheng
关键词-EN:
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-115] Semi-Supervised Contrastive VAE for Disentanglement of Digital Pathology Images

链接: https://arxiv.org/abs/2410.02012
作者: Mahmudul Hasan,Xiaoling Hu,Shahira Abousamra,Prateek Prasanna,Joel Saltz,Chao Chen
关键词-EN: strong prediction power, important concern, strong prediction, remains an important, deep learning models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the strong prediction power of deep learning models, their interpretability remains an important concern. Disentanglement models increase interpretability by decomposing the latent space into interpretable subspaces. In this paper, we propose the first disentanglement method for pathology images. We focus on the task of detecting tumor-infiltrating lymphocytes (TIL). We propose different ideas including cascading disentanglement, novel architecture, and reconstruction branches. We achieve superior performance on complex pathology images, thus improving the interpretability and even generalization power of TIL detection deep learning models. Our codes are available at this https URL.

[CV-116] MONICA: Benchmarking on Long-tailed Medical Image Classification

链接: https://arxiv.org/abs/2410.02010
作者: Lie Ju,Siyuan Yan,Yukun Zhou,Yang Nan,Xiaodan Xing,Peibo Duan,Zongyuan Ge
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-117] A Novel Feature Extraction Model for the Detection of Plant Disease from Leaf Images in Low Computational Devices

链接: https://arxiv.org/abs/2410.01854
作者: Rikathi Pal,Anik Basu Bhaumik,Arpan Murmu,Sanoar Hossain,Biswajit Maity,Soumya Sen
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 Pages, 8 figures, 1 table

点击查看摘要

[CV-118] Image-to-Image Translation Based on Deep Generative Modeling for Radiotherapy Synthetic Dataset Creation

链接: https://arxiv.org/abs/2410.01828
作者: Olga Glazunova,Cecile J.A. Wolfs,Frank Verhaegen
关键词-EN: Portal Imaging Device, Electronic Portal Imaging, Imaging Device, Electronic Portal, Portal Imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Objective: Radiotherapy uses precise doses of radiation to treat cancer, requiring accurate verification, e.g. using the Electronic Portal Imaging Device (EPID), to guide treatment. To develop an effective artificial intelligence (AI) model for error detection and treatment verification, a large and well-annotated dataset of EPID images is needed, however, acquiring such high quality real data is difficult. While synthetic EPID data could be a viable alternative, it is critical to ensure that this data is as realistic as possible to effectively train an accurate and reliable AI model. The measurement uncertainty that is not modeled in EPID predictions but is present on real measured EPID images can hinder downstream tasks such as error detection and classification. Our research aims to improve synthetic EPID data through image-to-image (I2I) translation based on deep generative modeling. Approach: A dataset of 989 predicted EPID images and corresponding measured EPID images was used. We evaluate both paired and unpaired generative modeling approaches for this task. For the former, we introduce a novel modification of Variational Autoencoder (VAE) to I2I, a method that, to the best of our knowledge, has not been previously explored for this task. For the latter, we use UNsupervised Image-to-Image Translation Networks (UNIT). Results: Our results show that both models achieved some degree of I2I translation, with our novel modification of the VAE model outperforming the UNIT model in improving key metrics (mean absolute error: 4.1 cGy vs 6.4 cGy; relative dose difference in-field: 2.5% vs 5.5%; absolute dose difference in-field: 5.3 cGy vs 10.8 cGy). Significance: This enhanced synthetic data is expected to improve downstream tasks such as training neural networks for automated error detection and error classification in radiotherapy.

机器学习

[LG-0] Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats

链接: https://arxiv.org/abs/2410.02764
作者: Mingyang Xie,Haoming Cai,Sachin Shah,Yiran Xu,Brandon Y. Feng,Jia-Bin Huang,Christopher A. Metzler
关键词-EN: no-flash reflection separation, introduce a simple, simple yet effective, effective approach, approach for separating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We introduce a simple yet effective approach for separating transmitted and reflected light. Our key insight is that the powerful novel view synthesis capabilities provided by modern inverse rendering methods (e.g.,~3D Gaussian splatting) allow one to perform flash/no-flash reflection separation using unpaired measurements – this relaxation dramatically simplifies image acquisition over conventional paired flash/no-flash reflection separation methods. Through extensive real-world experiments, we demonstrate our method, Flash-Splat, accurately reconstructs both transmitted and reflected scenes in 3D. Our method outperforms existing 3D reflection separation methods, which do not leverage illumination control, by a large margin. Our project webpage is at this https URL.

[LG-1] Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

链接: https://arxiv.org/abs/2410.02763
作者: Jianrui Zhang,Mu Cai,Yong Jae Lee
关键词-EN: growing sentiment recently, key challenges related, growing sentiment, sentiment recently, recently that modern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.

[LG-2] Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

链接: https://arxiv.org/abs/2410.02762
作者: Nick Jiang,Anish Kachinthaya,Suzie Petryk,Yossi Gandelsman
关键词-EN: size and training, persistent challenge, challenge despite advances, output probabilities, address hallucinations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page and code: this http URL

点击查看摘要

Abstract:We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs’ internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model’s latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs’ latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.

[LG-3] Erasing Conceptual Knowledge from Language Models

链接: https://arxiv.org/abs/2410.02760
作者: Rohit Gandikota,Sheridan Feucht,Samuel Marks,David Bau
关键词-EN: comprehensive evaluation framework, leading to incomplete, traditionally lacked, lacked a comprehensive, evaluation framework
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM’s efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at this https URL

[LG-4] Forecasting Smog Clouds With Deep Learning

链接: https://arxiv.org/abs/2410.02759
作者: Valentijn Oldenburg,Juan Cardenas-Cartagena,Matias Valdenegro-Toro
关键词-EN: long short-term memory, gated recurrent unit, conduct multivariate timeseries, multivariate timeseries forecasting, deep learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this proof-of-concept study, we conduct multivariate timeseries forecasting for the concentrations of nitrogen dioxide (NO2), ozone (O3), and (fine) particulate matter (PM10 PM2.5) with meteorological covariates between two locations using various deep learning models, with a focus on long short-term memory (LSTM) and gated recurrent unit (GRU) architectures. In particular, we propose an integrated, hierarchical model architecture inspired by air pollution dynamics and atmospheric science that employs multi-task learning and is benchmarked by unidirectional and fully-connected models. Results demonstrate that, above all, the hierarchical GRU proves itself as a competitive and efficient method for forecasting the concentration of smog-related pollutants.

[LG-5] SIEVE: General Purpose Data Filtering System Matching GPT-4o Accuracy at 1% the Cost

链接: https://arxiv.org/abs/2410.02755
作者: Jifan Zhang,Robert Nowak
关键词-EN: Creating specialized large, Creating specialized, special purpose data, requires vast amounts, SIEVE
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE’s filtering results to those of GPT-4o.

[LG-6] ReLIC: A Recipe for 64k Steps of In-Context Reinforcement Learning for Embodied AI

链接: https://arxiv.org/abs/2410.02751
作者: Ahmad Elawady,Gunjan Chhablani,Ram Ramrakhya,Karmesh Yadav,Dhruv Batra,Zsolt Kira,Andrew Szot
关键词-EN: Intelligent embodied agents, integrating long histories, Intelligent embodied, quickly adapt, scenarios by integrating
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Intelligent embodied agents need to quickly adapt to new scenarios by integrating long histories of experience into decision-making. For instance, a robot in an unfamiliar house initially wouldn’t know the locations of objects needed for tasks and might perform inefficiently. However, as it gathers more experience, it should learn the layout of its environment and remember where objects are, allowing it to complete new tasks more efficiently. To enable such rapid adaptation to new tasks, we present ReLIC, a new approach for in-context reinforcement learning (RL) for embodied agents. With ReLIC, agents are capable of adapting to new environments using 64,000 steps of in-context experience with full attention while being trained through self-generated experience via RL. We achieve this by proposing a novel policy update scheme for on-policy RL called "partial updates’’ as well as a Sink-KV mechanism that enables effective utilization of a long observation history for embodied agents. Our method outperforms a variety of meta-RL baselines in adapting to unseen houses in an embodied multi-object navigation task. In addition, we find that ReLIC is capable of few-shot imitation learning despite never being trained with expert demonstrations. We also provide a comprehensive analysis of ReLIC, highlighting that the combination of large-scale RL training, the proposed partial updates scheme, and the Sink-KV are essential for effective in-context learning. The code for ReLIC and all our experiments is at this https URL

[LG-7] An Online Automatic Modulation Classification Scheme Based on Isolation Distributional Kernel

链接: https://arxiv.org/abs/2410.02750
作者: Xinpeng Li,Zile Jiang,Kai Ming Ting,Ye Zhu
关键词-EN: Automatic Modulation Classification, Automatic Modulation, Modulation Classification, non-cooperative communication networks, modern non-cooperative communication
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic Modulation Classification (AMC), as a crucial technique in modern non-cooperative communication networks, plays a key role in various civil and military applications. However, existing AMC methods usually are complicated and can work in batch mode only due to their high computational complexity. This paper introduces a new online AMC scheme based on Isolation Distributional Kernel. Our method stands out in two aspects. Firstly, it is the first proposal to represent baseband signals using a distributional kernel. Secondly, it introduces a pioneering AMC technique that works well in online settings under realistic time-varying channel conditions. Through extensive experiments in online settings, we demonstrate the effectiveness of the proposed classifier. Our results indicate that the proposed approach outperforms existing baseline models, including two state-of-the-art deep learning classifiers. Moreover, it distinguishes itself as the first online classifier for AMC with linear time complexity, which marks a significant efficiency boost for real-time applications.

[LG-8] raining Language Models on Synthetic Edit Sequences Improves Code Synthesis

链接: https://arxiv.org/abs/2410.02749
作者: Ulyana Piterbarg,Lerrel Pinto,Rob Fergus
关键词-EN: Software engineers, code, Software, edit, data
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems “pass@k” solved by any attempt given “k” tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.

[LG-9] CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

链接: https://arxiv.org/abs/2410.02748
作者: Han He,Qianchu Liu,Lei Xu,Chaitanya Shivade,Yi Zhang,Sundararajan Srinivasan,Katrin Kirchhoff
关键词-EN: Large language models, Large language, generate fluent summaries, prompting techniques, domains using prompting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (CriSPO), a lightweight model that can be finetuned to extract salient keyphrases. By using CriSPO, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.

[LG-10] Contrastive Localized Language-Image Pre-Training

链接: https://arxiv.org/abs/2410.02746
作者: Hong-You Chen,Zhengfeng Lai,Haotian Zhang,Xinze Wang,Marcin Eichner,Keen You,Meng Cao,Bowen Zhang,Yinfei Yang,Zhe Gan
关键词-EN: CLIP, facilitating various applications, training vision encoders, text representations facilitating, training vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

[LG-11] Neutral residues: revisiting adapters for model extension

链接: https://arxiv.org/abs/2410.02744
作者: Franck Signe Talla,Herve Jegou,Edouard Grave
关键词-EN: pretrained large language, extending a pretrained, pretrained large, original domain, large language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of extending a pretrained large language model to a new domain that was not seen at training time, like adding a language for which the original model has seen no or little training data. Popular solutions like fine-tuning or low-rank adaptation are successful at domain adaptation, but formally they do not add any extra capacity and degrade the performance in the original domain. Our paper analyzes this extension problem under three angles: data, architecture and training procedure, which are advantageously considered jointly. In particular, we improve adapters and make it possible to learn an entire new language while ensuring that the output of the neural network is almost unchanged in the original domain. For this purpose, we modify the new residual blocks in a way that leads each new residual block to output near-zeros in the original domain. This solution of neutral residues, which borrows architectural components from mixture of experts, is effective: with only 20% extra learnable weights compared to an original model trained on English, we get results that are significantly better than concurrent approaches (fine-tuning, low-rank or vanilla adapters) in terms of the trade-off between learning a new language and not forgetting English. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.02744 [cs.CL] (or arXiv:2410.02744v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02744 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] Grounding Large Language Models In Embodied Environment With Imperfect World Models

链接: https://arxiv.org/abs/2410.02742
作者: Haolan Liu,Jishen Zhao
关键词-EN: executing robotics tasks, tackling basic physical, basic physical reasoning, Grounding Large language, large language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 \times , 1.54 \times , and 1.82 \times across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.

[LG-13] Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization EMNLP2024

链接: https://arxiv.org/abs/2410.02741
作者: Lei Xu,Mohammed Asad Karim,Saket Dingliwal,Aparna Elangovan
关键词-EN: Large language models, Large language, generate fluent summaries, prompting techniques, domains using prompting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Industry Track

点击查看摘要

Abstract:Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.

[LG-14] Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

链接: https://arxiv.org/abs/2410.02740
作者: Zhengfeng Lai,Vasileios Saveris,Chen Chen,Hong-You Chen,Haotian Zhang,Bowen Zhang,Juan Lao Tebar,Wenze Hu,Zhe Gan,Peter Grasch,Meng Cao,Yinfei Yang
关键词-EN: Recent advancements, synthetic captions, key challenges remain, captions, Short Synthetic Captions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CV/ML

点击查看摘要

Abstract:Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.

[LG-15] OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable?

链接: https://arxiv.org/abs/2410.02735
作者: Liangze Jiang,Damien Teney
关键词-EN: OOD generalization, OOD, challenging because distribution, specific OOD situations, OOD generalization lies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) generalization is challenging because distribution shifts come in many forms. A multitude of learning algorithms exist and each can improve performance in specific OOD situations. We posit that much of the challenge of OOD generalization lies in choosing the right algorithm for the right dataset. However, such algorithm selection is often elusive under complex real-world shifts. In this work, we formalize the task of algorithm selection for OOD generalization and investigate whether it could be approached by learning. We propose a solution, dubbed OOD-Chameleon that treats the task as a supervised classification over candidate algorithms. We construct a dataset of datasets to learn from, which represents diverse types, magnitudes and combinations of shifts (covariate shift, label shift, spurious correlations). We train the model to predict the relative performance of algorithms given a dataset’s characteristics. This enables a priori selection of the best learning strategy, i.e. without training various models as needed with traditional model selection. Our experiments show that the adaptive selection outperforms any individual algorithm and simple selection heuristics, on unseen datasets of controllable and realistic image data. Inspecting the model shows that it learns non-trivial data/algorithms interactions, and reveals the conditions for any one algorithm to surpass another. This opens new avenues for (1) enhancing OOD generalization with existing algorithms instead of designing new ones, and (2) gaining insights into the applicability of existing algorithms with respect to datasets’ properties.

[LG-16] Data Similarity-Based One-Shot Clustering for Multi-Task Hierarchical Federated Learning

链接: https://arxiv.org/abs/2410.02733
作者: Abdulmoneam Ali,Ahmed Arafa
关键词-EN: cluster identity estimation, hierarchical federated learning, federated learning setting, address the problem, problem of cluster
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: To appear in Asilomar 2024

点击查看摘要

Abstract:We address the problem of cluster identity estimation in a hierarchical federated learning setting in which users work toward learning different tasks. To overcome the challenge of task heterogeneity, users need to be grouped in a way such that users with the same task are in the same group, conducting training together, while sharing the weights of feature extraction layers with the other groups. Toward that end, we propose a one-shot clustering algorithm that can effectively identify and group users based on their data similarity. This enables more efficient collaboration and sharing of a common layer representation within the federated learning system. Our proposed algorithm not only enhances the clustering process, but also overcomes challenges related to privacy concerns, communication overhead, and the need for prior knowledge about learning models or loss function behaviors. We validate our proposed algorithm using various datasets such as CIFAR-10 and Fashion MNIST, and show that it outperforms the baseline in terms of accuracy and variance reduction.

[LG-17] Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better Even Mid-Generation

链接: https://arxiv.org/abs/2410.02725
作者: Rohin Manvi,Anikait Singh,Stefano Ermon
关键词-EN: Inference-time computation, large language models, widely used technique, external reward model, powerful paradigm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B’s win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.

[LG-18] SynthFormer: Equivariant Pharmacophore-based Generation of Molecules for Ligand-Based Drug Design

链接: https://arxiv.org/abs/2410.02718
作者: Zygimantas Jocys,Henriette M.G. Willems,Katayoun Farrahi
关键词-EN: cost investments required, resource-intensive process, medicines to patients, Drug discovery, complex and resource-intensive
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drug discovery is a complex and resource-intensive process, with significant time and cost investments required to bring new medicines to patients. Recent advancements in generative machine learning (ML) methods offer promising avenues to accelerate early-stage drug discovery by efficiently exploring chemical space. This paper addresses the gap between in silico generative approaches and practical in vitro methodologies, highlighting the need for their integration to optimize molecule discovery. We introduce SynthFormer, a novel ML model that utilizes a 3D equivariant encoder for pharmacophores to generate fully synthesizable molecules, constructed as synthetic trees. Unlike previous methods, SynthFormer incorporates 3D information and provides synthetic paths, enhancing its ability to produce molecules with good docking scores across various proteins. Our contributions include a new methodology for efficient chemical space exploration using 3D information, a novel architecture called Synthformer for translating 3D pharmacophore representations into molecules, and a meaningful embedding space that organizes reagents for drug discovery optimization. Synthformer generates molecules that dock well and enables effective late-stage optimization restricted by synthesis paths.

[LG-19] NETS: A Non-Equilibrium Transport Sampler

链接: https://arxiv.org/abs/2410.02711
作者: Michael S. Albergo,Eric Vanden-Eijnden
关键词-EN: Non-Equilibrium Transport Sampler, Transport Sampler, unnormalized probability distributions, Non-Equilibrium Transport, propose an algorithm
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); High Energy Physics - Lattice (hep-lat)
*备注:

点击查看摘要

Abstract:We propose an algorithm, termed the Non-Equilibrium Transport Sampler (NETS), to sample from unnormalized probability distributions. NETS can be viewed as a variant of annealed importance sampling (AIS) based on Jarzynski’s equality, in which the stochastic differential equation used to perform the non-equilibrium sampling is augmented with an additional learned drift term that lowers the impact of the unbiasing weights used in AIS. We show that this drift is the minimizer of a variety of objective functions, which can all be estimated in an unbiased fashion without backpropagating through solutions of the stochastic differential equations governing the sampling. We also prove that some these objectives control the Kullback-Leibler divergence of the estimated distribution from its target. NETS is shown to be unbiased and, in addition, has a tunable diffusion coefficient which can be adjusted post-training to maximize the effective sample size. We demonstrate the efficacy of the method on standard benchmarks, high-dimensional Gaussian mixture distributions, and a model from statistical lattice field theory, for which it surpasses the performances of related work and existing baselines.

[LG-20] Selective Attention Improves Transformer

链接: https://arxiv.org/abs/2410.02703
作者: Yaniv Leviathan,Matan Kalman,Yossi Matias
关键词-EN: Selective Attention, Unneeded elements, attention, attention context degrade, Selective
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unneeded elements in the attention’s context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention’s context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.

[LG-21] Lie Algebra Canonicalization: Equivariant Neural Operators under arbitrary Lie Groups

链接: https://arxiv.org/abs/2410.02698
作者: Zakhar Shumaylov,Peter Zaika,James Rowbottom,Ferdia Sherry,Melanie Weber,Carola-Bibiane Schönlieb
关键词-EN: generalizable machine learning, driven recent interest, equivariant neural networks, neural networks, Physics-Informed Neural Networks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注: 40 pages; preprint

点击查看摘要

Abstract:The quest for robust and generalizable machine learning models has driven recent interest in exploiting symmetries through equivariant neural networks. In the context of PDE solvers, recent works have shown that Lie point symmetries can be a useful inductive bias for Physics-Informed Neural Networks (PINNs) through data and loss augmentation. Despite this, directly enforcing equivariance within the model architecture for these problems remains elusive. This is because many PDEs admit non-compact symmetry groups, oftentimes not studied beyond their infinitesimal generators, making them incompatible with most existing equivariant architectures. In this work, we propose Lie aLgebrA Canonicalization (LieLAC), a novel approach that exploits only the action of infinitesimal generators of the symmetry group, circumventing the need for knowledge of the full group structure. To achieve this, we address existing theoretical issues in the canonicalization literature, establishing connections with frame averaging in the case of continuous non-compact groups. Operating within the framework of canonicalization, LieLAC can easily be integrated with unconstrained pre-trained models, transforming inputs to a canonical form before feeding them into the existing model, effectively aligning the input for model inference according to allowed symmetries. LieLAC utilizes standard Lie group descent schemes, achieving equivariance in pre-trained models. Finally, we showcase LieLAC’s efficacy on tasks of invariant image classification and Lie point symmetry equivariant neural PDE solvers using pre-trained models.

[LG-22] Discovering Clues of Spoofed LM Watermarks

链接: https://arxiv.org/abs/2410.02693
作者: Thibaud Gloaguen,Nikola Jovanović,Robin Staab,Martin Vechev
关键词-EN: LLM watermarks stand, ownership of LLM-generated, attribute ownership, LLM, spoofing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. While recent works have demonstrated that state-of-the-art schemes are in fact vulnerable to spoofing, they lack deeper qualitative analysis of the texts produced by spoofing methods. In this work, we for the first time reveal that there are observable differences between genuine and spoofed watermark texts. Namely, we show that regardless of their underlying approach, all current spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts, effectively discovering that a watermark was spoofed. Our experimental evaluation shows high test power across all current spoofing methods, providing insights into their fundamental limitations, and suggesting a way to mitigate this threat.

[LG-23] DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life

链接: https://arxiv.org/abs/2410.02683
作者: Yu Ying Chiu,Liwei Jiang,Yejin Choi
关键词-EN: Moral Foundation Theory, increasingly seek guidance, increasingly seek, decision-making in daily, clear-cut and depend
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Under Review

点击查看摘要

Abstract:As we increasingly seek guidance from LLMs for decision-making in daily life, many of these decisions are not clear-cut and depend significantly on the personal values and ethical standards of the users. We present DailyDilemmas, a dataset of 1,360 moral dilemmas encountered in everyday life. Each dilemma includes two possible actions and with each action, the affected parties and human values invoked. Based on these dilemmas, we consolidated a set of human values across everyday topics e.g., interpersonal relationships, workplace, and environmental issues. We evaluated LLMs on these dilemmas to determine what action they will take and the values represented by these actions. Then, we analyzed these values through the lens of five popular theories inspired by sociology, psychology and philosophy. These theories are: World Value Survey, Moral Foundation Theory, Maslow’s Hierarchy of Needs, Aristotle’s Virtues, and Plutchik Wheel of Emotion. We find that LLMs are most aligned with the self-expression over survival values in terms of World Value Survey, care over loyalty in Moral Foundation Theory. Interestingly, we find large preferences differences in models for some core values such as truthfulness e.g., Mixtral-8x7B model tends to neglect it by 9.7% while GPT-4-turbo model tends to select it by 9.4%. We also study the recent guidance released by OpenAI (ModelSpec), and Anthropic (Constitutional AI) to understand how their released principles reflect their actual value prioritization when facing nuanced moral reasoning in daily-life settings. We find that end users cannot effectively steer such prioritization using system prompts.

[LG-24] Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

链接: https://arxiv.org/abs/2410.02681
作者: Shuoyuan Wang,Yixuan Li,Hongxin Wei
关键词-EN: machine learning models, real world, safe deployment, deployment of machine, machine learning
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Confidence calibration is critical for the safe deployment of machine learning models in the real world. However, such issue in vision-language models like CLIP, particularly after fine-tuning, has not been fully addressed. In this work, we demonstrate that existing prompt tuning methods usually lead to a trade-off of calibration between base and new classes: the cross-entropy loss in CoOp causes overconfidence in new classes by increasing textual label divergence, whereas the regularization of KgCoOp maintains the confidence level but results in underconfidence in base classes due to the improved accuracy. Inspired by the observations, we introduce Dynamic Outlier Regularization (DOR) to ensure the confidence calibration on both base and new classes after fine-tuning. In particular, we propose to minimize the feature deviation of novel textual labels (instead of base classes) sampled from a large vocabulary. In effect, DOR prevents the increase in textual divergence for new labels while easing restrictions on base classes. Extensive experiments demonstrate that DOR can enhance the calibration performance of current fine-tuning methods on base and new classes.

[LG-25] CulturalBench: a Robust Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

链接: https://arxiv.org/abs/2410.02677
作者: Yu Ying Chiu,Liwei Jiang,Bill Yuchen Lin,Chan Young Park,Shuyue Stella Li,Sahithya Ravi,Mehar Bhatia,Maria Antoniak,Yulia Tsvetkov,Vered Shwartz,Yejin Choi
关键词-EN: make large language, large language models, track our progress, effective cultural knowledge, cultural knowledge benchmarks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs’ cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.

[LG-26] FAN: Fourier Analysis Networks

链接: https://arxiv.org/abs/2410.02675
作者: Yihong Dong,Ge Li,Yongding Tao,Xue Jiang,Kechi Zhang,Jia Li,Jing Su,Jun Zhang,Jingjing Xu
关键词-EN: remarkable success achieved, exhibit potential flaws, remarkable success, success achieved, exhibit potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Despite the remarkable success achieved by neural networks, particularly those represented by MLP and Transformer, we reveal that they exhibit potential flaws in the modeling and reasoning of periodicity, i.e., they tend to memorize the periodic data rather than genuinely understanding the underlying principles of periodicity. However, periodicity is a crucial trait in various forms of reasoning and generalization, underpinning predictability across natural and engineered systems through recurring patterns in observations. In this paper, we propose FAN, a novel network architecture based on Fourier Analysis, which empowers the ability to efficiently model and reason about periodic phenomena. By introducing Fourier Series, the periodicity is naturally integrated into the structure and computational processes of the neural network, thus achieving a more accurate expression and prediction of periodic patterns. As a promising substitute to multi-layer perceptron (MLP), FAN can seamlessly replace MLP in various models with fewer parameters and FLOPs. Through extensive experiments, we demonstrate the effectiveness of FAN in modeling and reasoning about periodic functions, and the superiority and generalizability of FAN across a range of real-world tasks, including symbolic formula representation, time series forecasting, and language modeling.

[LG-27] GUD: Generation with Unified Diffusion

链接: https://arxiv.org/abs/2410.02667
作者: Mathis Gerdes,Max Welling,Miranda C. N. Cheng
关键词-EN: progressively adds noise, models transform noise, Diffusion generative models, generative models transform, progressively adds
类目: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Diffusion generative models transform noise into data by inverting a process that progressively adds noise to data samples. Inspired by concepts from the renormalization group in physics, which analyzes systems across different scales, we revisit diffusion models by exploring three key design aspects: 1) the choice of representation in which the diffusion process operates (e.g. pixel-, PCA-, Fourier-, or wavelet-basis), 2) the prior distribution that data is transformed into during diffusion (e.g. Gaussian with covariance \Sigma ), and 3) the scheduling of noise levels applied separately to different parts of the data, captured by a component-wise noise schedule. Incorporating the flexibility in these choices, we develop a unified framework for diffusion generative models with greatly enhanced design freedom. In particular, we introduce soft-conditioning models that smoothly interpolate between standard diffusion models and autoregressive models (in any basis), conceptually bridging these two approaches. Our framework opens up a wide design space which may lead to more efficient training and data generation, and paves the way to novel architectures integrating different generative approaches and generation tasks.

[LG-28] AlphaIntegrator: Transformer Action Search for Symbolic Integration Proofs

链接: https://arxiv.org/abs/2410.02666
作者: Mert Ünsal,Timon Gehr,Martin Vechev
关键词-EN: learning-based system, GPT transformer model, mathematical integration rule, mathematical integration, transformer model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:We present the first correct-by-construction learning-based system for step-by-step mathematical integration. The key idea is to learn a policy, represented by a GPT transformer model, which guides the search for the right mathematical integration rule, to be carried out by a symbolic solver. Concretely, we introduce a symbolic engine with axiomatically correct actions on mathematical expressions, as well as the first dataset for step-by-step integration. Our GPT-style transformer model, trained on this synthetic data, demonstrates strong generalization by surpassing its own data generator in accuracy and efficiency, using 50% fewer search steps. Our experimental results with SoTA LLMs also demonstrate that the standard approach of fine-tuning LLMs on a set of question-answer pairs is insufficient for solving this mathematical task. This motivates the importance of discovering creative methods for combining LLMs with symbolic reasoning engines, of which our work is an instance.

[LG-29] How to Train Long-Context Language Models (Effectively)

链接: https://arxiv.org/abs/2410.02660
作者: Tianyu Gao,Alexander Wettig,Howard Yen,Danqi Chen
关键词-EN: supervised fine-tuning, make effective, long-context, study continued training, SFT
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Our code, data, and models are available at this https URL

点击查看摘要

Abstract:We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development – Instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.18B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.

[LG-30] Scalable Simulation-free Entropic Unbalanced Optimal Transport

链接: https://arxiv.org/abs/2410.02656
作者: Jaemoo Choi,Jaewoong Choi
关键词-EN: Unbalanced Optimal Transport, transport map, Optimal Transport, Entropic Unbalanced Optimal, Transport
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages

点击查看摘要

Abstract:The Optimal Transport (OT) problem investigates a transport map that connects two distributions while minimizing a given cost function. Finding such a transport map has diverse applications in machine learning, such as generative modeling and image-to-image translation. In this paper, we introduce a scalable and simulation-free approach for solving the Entropic Unbalanced Optimal Transport (EUOT) problem. We derive the dynamical form of this EUOT problem, which is a generalization of the Schrödinger bridges (SB) problem. Based on this, we derive dual formulation and optimality conditions of the EUOT problem from the stochastic optimal control interpretation. By leveraging these properties, we propose a simulation-free algorithm to solve EUOT, called Simulation-free EUOT (SF-EUOT). While existing SB models require expensive simulation costs during training and evaluation, our model achieves simulation-free training and one-step generation by utilizing the reciprocal property. Our model demonstrates significantly improved scalability in generative modeling and image-to-image translation tasks compared to previous SB methods.

[LG-31] Deconstructing Recurrence Attention and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems

链接: https://arxiv.org/abs/2410.02654
作者: Hunter Heidenreich,Pantelis R. Vlachas,etros Koumoutsakos
关键词-EN: Machine learning architectures, Machine learning, extreme weather, ranging from text, Recurrent Highway Networks
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Machine learning architectures, including transformers and recurrent neural networks (RNNs) have revolutionized forecasting in applications ranging from text processing to extreme weather. Notably, advanced network architectures, tuned for applications such as natural language processing, are transferable to other tasks such as spatiotemporal forecasting tasks. However, there is a scarcity of ablation studies to illustrate the key components that enable this forecasting accuracy. The absence of such studies, although explainable due to the associated computational cost, intensifies the belief that these models ought to be considered as black boxes. In this work, we decompose the key architectural components of the most powerful neural architectures, namely gating and recurrence in RNNs, and attention mechanisms in transformers. Then, we synthesize and build novel hybrid architectures from the standard blocks, performing ablation studies to identify which mechanisms are effective for each task. The importance of considering these components as hyper-parameters that can augment the standard architectures is exhibited on various forecasting datasets, from the spatiotemporal chaotic dynamics of the multiscale Lorenz 96 system, the Kuramoto-Sivashinsky equation, as well as standard real world time-series benchmarks. A key finding is that neural gating and attention improves the performance of all standard RNNs in most tasks, while the addition of a notion of recurrence in transformers is detrimental. Furthermore, our study reveals that a novel, sparsely used, architecture which integrates Recurrent Highway Networks with neural gating and attention mechanisms, emerges as the best performing architecture in high-dimensional spatiotemporal forecasting of dynamical systems.

[LG-32] CAX: Cellular Automata Accelerated in JAX

链接: https://arxiv.org/abs/2410.02651
作者: Maxence Faldor,Antoine Cully
关键词-EN: diverse scientific disciplines, Cellular automata, Cellular Automata Accelerated, Cellular, spanning neuroscience
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cellular automata have become a cornerstone for investigating emergence and self-organization across diverse scientific disciplines, spanning neuroscience, artificial life, and theoretical physics. However, the absence of a hardware-accelerated cellular automata library limits the exploration of new research directions, hinders collaboration, and impedes reproducibility. In this work, we introduce CAX (Cellular Automata Accelerated in JAX), a high-performance and flexible open-source library designed to accelerate cellular automata research. CAX offers cutting-edge performance and a modular design through a user-friendly interface, and can support both discrete and continuous cellular automata with any number of dimensions. We demonstrate CAX’s performance and flexibility through a wide range of benchmarks and applications. From classic models like elementary cellular automata and Conway’s Game of Life to advanced applications such as growing neural cellular automata and self-classifying MNIST digits, CAX speeds up simulations up to 2,000 times faster. Furthermore, we demonstrate CAX’s potential to accelerate research by presenting a collection of three novel cellular automata experiments, each implemented in just a few lines of code thanks to the library’s modular architecture. Notably, we show that a simple one-dimensional cellular automaton can outperform GPT-4 on the 1D-ARC challenge.

[LG-33] Immunogenicity Prediction with Dual Attention Enables Vaccine Target Selection

链接: https://arxiv.org/abs/2410.02647
作者: Song Li,Yang Tan,Song Ke,Liang Hong,Bingxin Zhou
关键词-EN: protective immune responses, trigger protective immune, finding candidate vaccines, immune responses, central topic
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
*备注: 18 pages, 11 tables, 5 figures

点击查看摘要

Abstract:Immunogenicity prediction is a central topic in reverse vaccinology for finding candidate vaccines that can trigger protective immune responses. Existing approaches typically rely on highly compressed features and simple model architectures, leading to limited prediction accuracy and poor generalizability. To address these challenges, we introduce ProVaccine, a novel deep learning solution with a dual attention mechanism that integrates pre-trained latent vector representations of protein sequences and structures. We also compile the most comprehensive immunogenicity dataset to date, encompassing over 9,500 antigen sequences, structures, and immunogenicity labels from bacteria, viruses, and tumors. Extensive experiments demonstrate that ProVaccine outperforms existing methods across a wide range of evaluation metrics. Furthermore, we establish a post-hoc validation protocol to assess the practical significance of deep learning models in tackling vaccine design challenges. Our work provides an effective tool for vaccine design and sets valuable benchmarks for future research.

[LG-34] Labor Migration Modeling through Large-scale Job Query Data

链接: https://arxiv.org/abs/2410.02639
作者: Zhuoning Guo,Le Zhang,Hengshu Zhu,Weijia Zhang,Hui Xiong,Hao Liu
关键词-EN: business site selection, labor migration, commercial tasks, site selection, urban governance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and timely modeling of labor migration is crucial for various urban governance and commercial tasks, such as local policy-making and business site selection. However, existing studies on labor migration largely rely on limited survey data with statistical methods, which fail to deliver timely and fine-grained insights for time-varying regional trends. To this end, we propose a deep learning-based spatial-temporal labor migration analysis framework, DHG-SIL, by leveraging large-scale job query data. Specifically, we first acquire labor migration intention as a proxy of labor migration via job queries from one of the world’s largest search engines. Then, a Disprepant Homophily co-preserved Graph Convolutional Network (DH-GCN) and an interpretable temporal module are respectively proposed to capture cross-city and sequential labor migration dependencies. Besides, we introduce four interpretable variables to quantify city migration properties, which are co-optimized with city representations via tailor-designed contrastive losses. Extensive experiments on three real-world datasets demonstrate the superiority of our DHG-SIL. Notably, DHG-SIL has been deployed as a core component of a cooperative partner’s intelligent human resource system, and the system supported a series of city talent attraction reports.

[LG-35] Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

链接: https://arxiv.org/abs/2410.02628
作者: Mikhail Persiianov,Arip Asadulaev,Nikita Andreev,Nikita Starodubcev,Dmitry Baranchuk,Anastasis Kratsios,Evgeny Burnaev,Alexander Korotin
关键词-EN: typically approached, approached via supervised, sim, data, central problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning conditional distributions \pi^(\cdot|x) is a central problem in machine learning, which is typically approached via supervised methods with paired data (x,y) \sim \pi^ . However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of \textitsemi-supervised models that utilize both limited paired data and additional unpaired i.i.d. samples x \sim \pi^_x and y \sim \pi^_y from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm that integrates both paired and unpaired data \textbfseamlessly through the data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish a \textbflight learning algorithm to get \pi^*(\cdot|x) . Furthermore, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.

[LG-36] Diss-l-ECT: Dissecting Graph Data with local Euler Characteristic Transforms

链接: https://arxiv.org/abs/2410.02622
作者: Julius von Rohrscheidt,Bastian Rieck
关键词-EN: Euler Characteristic Transform, Characteristic Transform, Euler Characteristic, Local Euler Characteristic, efficiently-computable geometrical-topological invariant
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:The Euler Characteristic Transform (ECT) is an efficiently-computable geometrical-topological invariant that characterizes the global shape of data. In this paper, we introduce the Local Euler Characteristic Transform ( \ell -ECT), a novel extension of the ECT particularly designed to enhance expressivity and interpretability in graph representation learning. Unlike traditional Graph Neural Networks (GNNs), which may lose critical local details through aggregation, the \ell -ECT provides a lossless representation of local neighborhoods. This approach addresses key limitations in GNNs by preserving nuanced local structures while maintaining global interpretability. Moreover, we construct a rotation-invariant metric based on \ell -ECTs for spatial alignment of data spaces. Our method exhibits superior performance than standard GNNs on a variety of node classification tasks, particularly in graphs with high heterophily.

[LG-37] Achieving Fairness in Predictive Process Analytics via Adversarial Learning

链接: https://arxiv.org/abs/2410.02618
作者: Massimiliano de Leoni,Alessandro Padella
关键词-EN: offering real-time operational, real-time operational support, Predictive business process, business process analytics, important for organizations
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Predictive business process analytics has become important for organizations, offering real-time operational support for their processes. However, these algorithms often perform unfair predictions because they are based on biased variables (e.g., gender or nationality), namely variables embodying discrimination. This paper addresses the challenge of integrating a debiasing phase into predictive business process analytics to ensure that predictions are not influenced by biased variables. Our framework leverages on adversial debiasing is evaluated on four case studies, showing a significant reduction in the contribution of biased variables to the predicted value. The proposed technique is also compared with the state of the art in fairness in process mining, illustrating that our framework allows for a more enhanced level of fairness, while retaining a better prediction quality.

[LG-38] LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model

链接: https://arxiv.org/abs/2410.02615
作者: Duy M. H. Nguyen,Nghiem T. Diep,Trung Q. Nguyen,Hoang-Bao Le,Tai Nguyen,Tien Nguyen,TrungTin Nguyen,Nhat Ho,Pengtao Xie,Roger Wattenhofer,James Zhou,Daniel Sonntag,Mathias Niepert
关键词-EN: medical multi-modal large, multi-modal large language, leverage instruction-following data, large language models, multi-modal large
类目: Machine Learning (cs.LG)
*备注: First version

点击查看摘要

Abstract:State-of-the-art medical multi-modal large language models (med-MLLM), like LLaVA-Med or BioMedGPT, leverage instruction-following data in pre-training. However, those models primarily focus on scaling the model size and data volume to boost performance while mainly relying on the autoregressive learning objectives. Surprisingly, we reveal that such learning schemes might result in a weak alignment between vision and language modalities, making these models highly reliant on extensive pre-training datasets - a significant challenge in medical domains due to the expensive and time-consuming nature of curating high-quality instruction-following instances. We address this with LoGra-Med, a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. This helps the model capture contextual meaning, handle linguistic variability, and build cross-modal associations between visuals and text. To scale our approach, we designed an efficient end-to-end learning scheme using black-box gradient estimation, enabling faster LLaMa 7B training. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data. For example, on VQA-RAD, we exceed LLAVA-Med by 20.13% and nearly match the 100% pre-training score (72.52% vs. 72.64%). We also surpass SOTA methods like BiomedGPT on visual chatbots and RadFM on zero-shot image classification with VQA, highlighting the effectiveness of multi-graph alignment.

[LG-39] IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

链接: https://arxiv.org/abs/2410.02611
作者: Akhilesh Aravapalli,Mounika Marreddy,Subba Reddy Oota,Radhika Mamidi,Manish Gupta
关键词-EN: natural language processing, Indic languages, models, revolutionized the field, field of natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, 11 figures

点击查看摘要

Abstract:Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately \sim 47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [this https URL].

[LG-40] Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning

链接: https://arxiv.org/abs/2410.02605
作者: Olivier Lepel,Anas Barakat
关键词-EN: behavioral economy literatures, expected utility theory, Cumulative Prospect Theory, CPT policy optimization, economy literatures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 33 pages, 19 figures

点击查看摘要

Abstract:The widely used expected utility theory has been shown to be empirically inconsistent with human preferences in the psychology and behavioral economy literatures. Cumulative Prospect Theory (CPT) has been developed to fill in this gap and provide a better model for human-based decision-making supported by empirical evidence. It allows to express a wide range of attitudes and perceptions towards risk, gains and losses. A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem where the goal of the agent is to search for a policy generating long-term returns which are aligned with their preferences. In this work, we revisit this policy optimization problem and provide new insights on optimal policies and their nature depending on the utility function under consideration. We further derive a novel policy gradient theorem for the CPT policy optimization objective generalizing the seminal corresponding result in standard RL. This result enables us to design a model-free policy gradient algorithm to solve the CPT-RL problem. We illustrate the performance of our algorithm in simple examples motivated by traffic control and electricity management applications. We also demonstrate that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.

[LG-41] Long-Sequence Recommendation Models Need Decoupled Embeddings

链接: https://arxiv.org/abs/2410.02604
作者: Ningya Feng,Junwei Pan,Jialong Wu,Baixu Chen,Ximei Wang,Qian Li,Xian Hu,Jie Jiang,Mingsheng Long
关键词-EN: Lifelong user behavior, capturing user interests, predicting user responses, Lifelong user, user behavior sequences
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: First three authors contributed equally

点击查看摘要

Abstract:Lifelong user behavior sequences, comprising up to tens of thousands of history behaviors, are crucial for capturing user interests and predicting user responses in modern recommendation systems. A two-stage paradigm is typically adopted to handle these long sequences: a few relevant behaviors are first searched from the original long sequences via an attention mechanism in the first stage and then aggregated with the target item to construct a discriminative representation for prediction in the second stage. In this work, we identify and characterize, for the first time, a neglected deficiency in existing long-sequence recommendation models: a single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes. Initial attempts to address this issue using linear projections – a technique borrowed from language processing – proved ineffective, shedding light on the unique challenges of recommendation models. To overcome this, we propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are initialized and learned separately to fully decouple attention and representation. Extensive experiments and analysis demonstrate that DARE provides more accurate search of correlated behaviors and outperforms baselines with AUC gains up to 0.9% on public datasets and notable online system improvements. Furthermore, decoupling embedding spaces allows us to reduce the attention embedding dimension and accelerate the search procedure by 50% without significant performance impact, enabling more efficient, high-performance online serving.

[LG-42] Agents Room: Narrative Generation through Multi-step Collaboration ICLR2025

链接: https://arxiv.org/abs/2410.02603
作者: Fantine Huot,Reinald Kim Amplayo,Jennimaria Palomaki,Alice Shoshana Jakobovits,Elizabeth Clark,Mirella Lapata
关键词-EN: developing interesting characters, multifaceted process combining, process combining elements, Writing compelling fiction, crafting a plot
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Under review as a conference paper at ICLR 2025

点击查看摘要

Abstract:Writing compelling fiction is a multifaceted process combining elements such as crafting a plot, developing interesting characters, and using evocative language. While large language models (LLMs) show promise for story writing, they currently rely heavily on intricate prompting, which limits their use. We propose Agents’ Room, a generation framework inspired by narrative theory, that decomposes narrative writing into subtasks tackled by specialized agents. To illustrate our method, we introduce Tell Me A Story, a high-quality dataset of complex writing prompts and human-written stories, and a novel evaluation framework designed specifically for assessing long narratives. We show that Agents’ Room generates stories that are preferred by expert evaluators over those produced by baseline systems by leveraging collaboration and specialization to decompose the complex story writing task into tractable components. We provide extensive analysis with automated and human-based metrics of the generated output.

[LG-43] Diffusion Adversarial Schr"odinger Bridges via Iterative Proportional Markovian Fitting

链接: https://arxiv.org/abs/2410.02601
作者: Sergei Kholkin,Grigoriy Ksenofontov,David Li,Nikita Kornilov,Nikita Gushchin,Evgeny Burnaev,Alexander Korotin
关键词-EN: Schrödinger Bridge, Iterative Markovian Fitting, Proportional Markovian Fitting, Iterative Proportional Markovian, Schrödinger Bridge problem
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Iterative Markovian Fitting (IMF) procedure based on iterative reciprocal and Markovian projections has recently been proposed as a powerful method for solving the Schrödinger Bridge problem. However, it has been observed that for the practical implementation of this procedure, it is crucial to alternate between fitting a forward and backward time diffusion at each iteration. Such implementation is thought to be a practical heuristic, which is required to stabilize training and obtain good results in applications such as unpaired domain translation. In our work, we show that this heuristic closely connects with the pioneer approaches for the Schrödinger Bridge based on the Iterative Proportional Fitting (IPF) procedure. Namely, we find that the practical implementation of IMF is, in fact, a combination of IMF and IPF procedures, and we call this combination the Iterative Proportional Markovian Fitting (IPMF) procedure. We show both theoretically and practically that this combined IPMF procedure can converge under more general settings, thus, showing that the IPMF procedure opens a door towards developing a unified framework for solving Schrödinger Bridge problems.

[LG-44] hree-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

链接: https://arxiv.org/abs/2410.02597
作者: Hainan Xu,Travis M. Bartley,Vladimir Bataev,Boris Ginsburg
关键词-EN: textbf, Transducer, HAINAN, TDT, inference
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present \textbfHybrid-\textbfAutoregressive \textbfINference Tr\textbfANsducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. Trained with randomly masked predictor network outputs, HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor. Additionally, we propose a novel semi-autoregressive inference paradigm that first generates an initial hypothesis using non-autoregressive inference, followed by refinement steps where each token prediction is regenerated using parallelized autoregression on the initial hypothesis. Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN outperforms TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC. Semi-autoregressive inference further enhances the model’s accuracy with minimal computational overhead, and even outperforms TDT results in some cases. These results highlight HAINAN’s flexibility in balancing accuracy and speed, positioning it as a strong candidate for real-world speech recognition applications.

[LG-45] Beyond Squared Error: Exploring Loss Design for Enhanced Training of Generative Flow Networks

链接: https://arxiv.org/abs/2410.02596
作者: Rui Hu,Yifan Zhang,Zhuoran Li,Longbo Huang
关键词-EN: generative models designed, Generative Flow Networks, attracting great research, great research interest, generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) are a novel class of generative models designed to sample from unnormalized distributions and have found applications in various important tasks, attracting great research interest in their training algorithms. In general, GFlowNets are trained by fitting the forward flow to the backward flow on sampled training objects. Prior work focused on the choice of training objects, parameterizations, sampling and resampling strategies, and backward policies, aiming to enhance credit assignment, exploration, or exploitation of the training process. However, the choice of regression loss, which can highly influence the exploration and exploitation behavior of the under-training policy, has been overlooked. Due to the lack of theoretical understanding for choosing an appropriate regression loss, most existing algorithms train the flow network by minimizing the squared error of the forward and backward flows in log-space, i.e., using the quadratic regression loss. In this work, we rigorously prove that distinct regression losses correspond to specific divergence measures, enabling us to design and analyze regression losses according to the desired properties of the corresponding divergence measures. Specifically, we examine two key properties: zero-forcing and zero-avoiding, where the former promotes exploitation and higher rewards, and the latter encourages exploration and enhances diversity. Based on our theoretical framework, we propose three novel regression losses, namely, Shifted-Cosh, Linex(1/2), and Linex(1). We evaluate them across three benchmarks: hyper-grid, bit-sequence generation, and molecule generation. Our proposed losses are compatible with most existing training algorithms, and significantly improve the performances of the algorithms concerning convergence speed, sample diversity, and robustness.

[LG-46] IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers

链接: https://arxiv.org/abs/2410.02592
作者: Zihan Fang,Zheng Lin,Senkang Hu,Hangcheng Cao,Yiqin Deng,Xianhao Chen,Yuguang Fang
关键词-EN: prevent traffic accidents, providing timely alerts, detecting early-stage abnormal, early-stage abnormal status, abnormal status
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 16 pages, 17 figures

点击查看摘要

Abstract:Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state features, significantly deteriorating training performance. Furthermore, missing modalities due to environment and hardware limitations further exacerbate the challenge of abnormal status identification. More importantly, monitoring abnormal health conditions of passengers, particularly in elderly care, is of paramount importance but remains underexplored. To address these challenges, we introduce our IC3M, an efficient camera-rotation-based multimodal framework for monitoring both driver and passengers in a car. Our IC3M comprises two key modules: an adaptive threshold pseudo-labeling strategy and a missing modality reconstruction. The former customizes pseudo-labeling thresholds for different classes based on the class distribution, generating class-balanced pseudo labels to guide model training effectively, while the latter leverages crossmodality relationships learned from limited labels to accurately recover missing modalities by distribution transferring from available modalities. Extensive experimental results demonstrate that IC3M outperforms state-of-the-art benchmarks in accuracy, precision, and recall while exhibiting superior robustness under limited labeled data and severe missing modality.

[LG-47] Boosting Sample Efficiency and Generalization in Multi-agent Reinforcement Learning via Equivariance NEURIPS2024

链接: https://arxiv.org/abs/2410.02581
作者: Joshua McClellan,Naveed Haghani,John Winder,Furong Huang,Pratap Tokekar
关键词-EN: Multi-Agent Reinforcement Learning, Graph Neural Networks, Equivariant Graph Neural, Reinforcement Learning, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted as a poster at NeurIPS 2024

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) struggles with sample inefficiency and poor generalization [1]. These challenges are partially due to a lack of structure or inductive bias in the neural networks typically used in learning the policy. One such form of structure that is commonly observed in multi-agent scenarios is symmetry. The field of Geometric Deep Learning has developed Equivariant Graph Neural Networks (EGNN) that are equivariant (or symmetric) to rotations, translations, and reflections of nodes. Incorporating equivariance has been shown to improve learning efficiency and decrease error [ 2 ]. In this paper, we demonstrate that EGNNs improve the sample efficiency and generalization in MARL. However, we also show that a naive application of EGNNs to MARL results in poor early exploration due to a bias in the EGNN structure. To mitigate this bias, we present Exploration-enhanced Equivariant Graph Neural Networks or E2GN2. We compare E2GN2 to other common function approximators using common MARL benchmarks MPE and SMACv2. E2GN2 demonstrates a significant improvement in sample efficiency, greater final reward convergence, and a 2x-5x gain in over standard GNNs in our generalization tests. These results pave the way for more reliable and effective solutions in complex multi-agent systems.

[LG-48] Deep Learning-Based Prediction of Suspension Dynamics Performance in Multi-Axle Vehicles

链接: https://arxiv.org/abs/2410.02566
作者: Kai Chun Lin,Bo-Yi Lin
关键词-EN: deep learning-based framework, Network Deep Neural, Deep Belief Network, Belief Network Deep, Deep Neural Network
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper presents a deep learning-based framework for predicting the dynamic performance of suspension systems in multi-axle vehicles, emphasizing the integration of machine learning with traditional vehicle dynamics modeling. A Multi-Task Deep Belief Network Deep Neural Network (MTL-DBN-DNN) was developed to capture the relationships between key vehicle parameters and suspension performance metrics. The model was trained on data generated from numerical simulations and demonstrated superior prediction accuracy compared to conventional DNN models. A comprehensive sensitivity analysis was conducted to assess the impact of various vehicle and suspension parameters on dynamic suspension performance. Additionally, the Suspension Dynamic Performance Index (SDPI) was introduced as a holistic measure to quantify overall suspension performance, accounting for the combined effects of multiple parameters. The findings highlight the effectiveness of multitask learning in improving predictive models for complex vehicle systems.

[LG-49] ColaCare: Enhancing Electronic Health Record Modeling through Large Language Model-Driven Multi-Agent Collaboration

链接: https://arxiv.org/abs/2410.02551
作者: Zixiang Wang,Yinghao Zhu,Huiya Zhao,Xiaochen Zheng,Tianlong Wang,Wen Tang,Yasha Wang,Chengwei Pan,Ewen M. Harrison,Junyi Gao,Liantao Ma
关键词-EN: Electronic Health Record, enhances Electronic Health, Large Language Models, Health Record, Electronic Health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We introduce ColaCare, a framework that enhances Electronic Health Record (EHR) modeling through multi-agent collaboration driven by Large Language Models (LLMs). Our approach seamlessly integrates domain-specific expert models with LLMs to bridge the gap between structured EHR data and text-based reasoning. Inspired by clinical consultations, ColaCare employs two types of agents: DoctorAgent and MetaAgent, which collaboratively analyze patient data. Expert models process and generate predictions from numerical EHR data, while LLM agents produce reasoning references and decision-making reports within the collaborative consultation framework. We additionally incorporate the Merck Manual of Diagnosis and Therapy (MSD) medical guideline within a retrieval-augmented generation (RAG) module for authoritative evidence support. Extensive experiments conducted on four distinct EHR datasets demonstrate ColaCare’s superior performance in mortality prediction tasks, underscoring its potential to revolutionize clinical decision support systems and advance personalized precision medicine. The code, complete prompt templates, more case studies, etc. are publicly available at the anonymous link: this https URL.

[LG-50] Diffusion Models are Evolutionary Algorithms

链接: https://arxiv.org/abs/2410.02543
作者: Yanbo Zhang,Benedikt Hartl,Hananel Hazan,Michael Levin
关键词-EN: diffusion models, Diffusion Evolution, diffusion, latent space diffusion, Space Diffusion Evolution
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In a convergence of machine learning and biology, we reveal that diffusion models are evolutionary algorithms. By considering evolution as a denoising process and reversed evolution as diffusion, we mathematically demonstrate that diffusion models inherently perform evolutionary algorithms, naturally encompassing selection, mutation, and reproductive isolation. Building on this equivalence, we propose the Diffusion Evolution method: an evolutionary algorithm utilizing iterative denoising – as originally introduced in the context of diffusion models – to heuristically refine solutions in parameter spaces. Unlike traditional approaches, Diffusion Evolution efficiently identifies multiple optimal solutions and outperforms prominent mainstream evolutionary algorithms. Furthermore, leveraging advanced concepts from diffusion models, namely latent space diffusion and accelerated sampling, we introduce Latent Space Diffusion Evolution, which finds solutions for evolutionary tasks in high-dimensional complex parameter space while significantly reducing computational steps. This parallel between diffusion and evolution not only bridges two different fields but also opens new avenues for mutual enhancement, raising questions about open-ended evolution and potentially utilizing non-Gaussian or discrete diffusion models in the context of Diffusion Evolution.

[LG-51] Fair Decentralized Learning

链接: https://arxiv.org/abs/2410.02541
作者: Sayan Biswas,Anne-Marie Kermarrec,Rishi Sharma,Thibaud Trinca,Martijn de Vos
关键词-EN: sharing raw data, machine learning model, Facade, machine learning, textsc
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Decentralized learning (DL) is an emerging approach that enables nodes to collaboratively train a machine learning model without sharing raw data. In many application domains, such as healthcare, this approach faces challenges due to the high level of heterogeneity in the training data’s feature space. Such feature heterogeneity lowers model utility and negatively impacts fairness, particularly for nodes with under-represented training data. In this paper, we introduce \textscFacade, a clustering-based DL algorithm specifically designed for fair model training when the training data exhibits several distinct features. The challenge of \textscFacade is to assign nodes to clusters, one for each feature, based on the similarity in the features of their local data, without requiring individual nodes to know apriori which cluster they belong to. \textscFacade (1) dynamically assigns nodes to their appropriate clusters over time, and (2) enables nodes to collaboratively train a specialized model for each cluster in a fully decentralized manner. We theoretically prove the convergence of \textscFacade, implement our algorithm, and compare it against three state-of-the-art baselines. Our experimental results on three datasets demonstrate the superiority of our approach in terms of model accuracy and fairness compared to all three competitors. Compared to the best-performing baseline, \textscFacade on the CIFAR-10 dataset also reduces communication costs by 32.3% to reach a target accuracy when cluster sizes are imbalanced.

[LG-52] Semantic-Guided RL for Interpretable Feature Engineering

链接: https://arxiv.org/abs/2410.02519
作者: Mohamed Bouadi,Arta Alavi,Salima Benbernou,Mourad Ouziri
关键词-EN: models strongly depends, generating high-quality features, quality of Machine, Machine Learning, Automated Feature Engineering
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2406.00544

点击查看摘要

Abstract:The quality of Machine Learning (ML) models strongly depends on the input data, as such generating high-quality features is often required to improve the predictive accuracy. This process is referred to as Feature Engineering (FE). However, since manual feature engineering is time-consuming and requires case-by-case domain knowledge, Automated Feature Engineering (AutoFE) is crucial. A major challenge that remains is to generate interpretable features. To tackle this problem, we introduce SMART, a hybrid approach that uses semantic technologies to guide the generation of interpretable features through a two-step process: Exploitation and Exploration. The former uses Description Logics (DL) to reason on the semantics embedded in Knowledge Graphs (KG) to infer domain-specific features, while the latter exploits the knowledge graph to conduct a guided exploration of the search space through Deep Reinforcement Learning (DRL). Our experiments on public datasets demonstrate that SMART significantly improves prediction accuracy while ensuring a high level of interpretability.

[LG-53] Learning Emergence of Interaction Patterns across Independent RL Agents in Multi-Agent Environments

链接: https://arxiv.org/abs/2410.02516
作者: Vasanth Reddy Baddam,Suat Gumussoy,Almuatazbellah Boker,Hoda Eldardiry
关键词-EN: real-world problems, naturally lend, controlling swarms, swarms of drones, drones and urban
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 13 pages, 24 figures

点击查看摘要

Abstract:Many real-world problems, such as controlling swarms of drones and urban traffic, naturally lend themselves to modeling as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods often suffer from scalability challenges, primarily due to the introduction of communication among agents. Consequently, a key challenge lies in adapting the success of deep learning in single-agent RL to the multi-agent setting. In response to this challenge, we propose an approach that fundamentally reimagines multi-agent environments. Unlike conventional methods that model each agent individually with separate networks, our approach, the Bottom Up Network (BUN), adopts a unique perspective. BUN treats the collective of multi-agents as a unified entity while employing a specialized weight initialization strategy that promotes independent learning. Furthermore, we dynamically establish connections among agents using gradient information, enabling coordination when necessary while maintaining these connections as limited and sparse to effectively manage the computational budget. Our extensive empirical evaluations across a variety of cooperative multi-agent scenarios, including tasks such as cooperative navigation and traffic control, consistently demonstrate BUN’s superiority over baseline methods with substantially reduced computational costs.

[LG-54] Minimax Group Fairness in Strategic Classification

链接: https://arxiv.org/abs/2410.02513
作者: Emily Diana,Saeed Sharifi-Malvajerdi,Ali Vakilian
关键词-EN: positive classification outcome, learner, positive classification, classification outcome, strategic classification
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In strategic classification, agents manipulate their features, at a cost, to receive a positive classification outcome from the learner’s classifier. The goal of the learner in such settings is to learn a classifier that is robust to strategic manipulations. While the majority of works in this domain consider accuracy as the primary objective of the learner, in this work, we consider learning objectives that have group fairness guarantees in addition to accuracy guarantees. We work with the minimax group fairness notion that asks for minimizing the maximal group error rate across population groups. We formalize a fairness-aware Stackelberg game between a population of agents consisting of several groups, with each group having its own cost function, and a learner in the agnostic PAC setting in which the learner is working with a hypothesis class H. When the cost functions of the agents are separable, we show the existence of an efficient algorithm that finds an approximately optimal deterministic classifier for the learner when the number of groups is small. This algorithm remains efficient, both statistically and computationally, even when H is the set of all classifiers. We then consider cost functions that are not necessarily separable and show the existence of oracle-efficient algorithms that find approximately optimal randomized classifiers for the learner when H has finite strategic VC dimension. These algorithms work under the assumption that the learner is fully transparent: the learner draws a classifier from its distribution (randomized classifier) before the agents respond by manipulating their feature vectors. We highlight the effectiveness of such transparency in developing oracle-efficient algorithms. We conclude with verifying the efficacy of our algorithms on real data by conducting an experimental analysis. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.02513 [cs.LG] (or arXiv:2410.02513v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.02513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-55] SAFLEX: Self-Adaptive Augmentation via Feature Label Extrapolation ICLR2024

链接: https://arxiv.org/abs/2410.02512
作者: Mucong Ding,Bang An,Yuancheng Xu,Anirudh Satheesh,Furong Huang
关键词-EN: scarce labeled data, enhancing model performance, augmentation, crucial in enhancing, scarce labeled
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2024

点击查看摘要

Abstract:Data augmentation, a cornerstone technique in deep learning, is crucial in enhancing model performance, especially with scarce labeled data. While traditional techniques are effective, their reliance on hand-crafted methods limits their applicability across diverse data types and tasks. Although modern learnable augmentation methods offer increased adaptability, they are computationally expensive and challenging to incorporate within prevalent augmentation workflows. In this work, we present a novel, efficient method for data augmentation, effectively bridging the gap between existing augmentation strategies and emerging datasets and learning tasks. We introduce SAFLEX (Self-Adaptive Augmentation via Feature Label EXtrapolation), which learns the sample weights and soft labels of augmented samples provided by any given upstream augmentation pipeline, using a specifically designed efficient bilevel optimization algorithm. Remarkably, SAFLEX effectively reduces the noise and label errors of the upstream augmentation pipeline with a marginal computational cost. As a versatile module, SAFLEX excels across diverse datasets, including natural and medical images and tabular data, showcasing its prowess in few-shot learning and out-of-distribution generalization. SAFLEX seamlessly integrates with common augmentation strategies like RandAug, CutMix, and those from large pre-trained generative models like stable diffusion and is also compatible with frameworks such as CLIP’s fine-tuning. Our findings highlight the potential to adapt existing augmentation pipelines for new data types and tasks, signaling a move towards more adaptable and resilient training frameworks.

[LG-56] Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems

链接: https://arxiv.org/abs/2410.02506
作者: Guibin Zhang,Yanwei Yue,Zhixun Li,Sukwon Yun,Guancheng Wan,Kun Wang,Dawei Cheng,Jeffrey Xu Yu,Tianlong Chen
关键词-EN: large language model, outperform individual capabilities, significantly outperform individual, meticulously designed inter-agent, Recent advancements
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language model (LLM)-powered agents have shown that collective intelligence can significantly outperform individual capabilities, largely attributed to the meticulously designed inter-agent communication topologies. Though impressive in performance, existing multi-agent pipelines inherently introduce substantial token overhead, as well as increased economic costs, which pose challenges for their large-scale deployments. In response to this challenge, we propose an economical, simple, and robust multi-agent communication framework, termed \textttAgentPrune , which can seamlessly integrate into mainstream multi-agent systems and prunes redundant or even malicious communication messages. Technically, \textttAgentPrune is the first to identify and formally define the \textitcommunication redundancy issue present in current LLM-based multi-agent pipelines, and efficiently performs one-shot pruning on the spatial-temporal message-passing graph, yielding a token-economic and high-performing communication topology. Extensive experiments across six benchmarks demonstrate that \textttAgentPrune \textbf(I) achieves comparable results as state-of-the-art topologies at merely \ 5.6 cost compared to their \ 43.7 , \textbf(II) integrates seamlessly into existing multi-agent frameworks with 28.1%\sim72.8%\downarrow token reduction, and \textbf(III) successfully defend against two types of agent-based adversarial attacks with 3.5%\sim10.8%\uparrow performance boost.

[LG-57] Dynamic Gradient Alignment for Online Data Mixing

链接: https://arxiv.org/abs/2410.02498
作者: Simin Fan,David Grangier,Pierre Ablin
关键词-EN: gradient alignment, large language models, effectively training large, training large language, data
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The composition of training data mixtures is critical for effectively training large language models (LLMs), as it directly impacts their performance on downstream tasks. Our goal is to identify an optimal data mixture to specialize an LLM for a specific task with access to only a few examples. Traditional approaches to this problem include ad-hoc reweighting methods, importance sampling, and gradient alignment techniques. This paper focuses on gradient alignment and introduces Dynamic Gradient Alignment (DGA), a scalable online gradient alignment algorithm. DGA dynamically estimates the pre-training data mixture on which the models’ gradients align as well as possible with those of the model on the specific task. DGA is the first gradient alignment approach that incurs minimal overhead compared to standard pre-training and outputs a competitive model, eliminating the need for retraining the model. Experimentally, we demonstrate significant improvements over importance sampling in two key scenarios: (i) when the pre-training set is small and importance sampling overfits due to limited data; and (ii) when there is insufficient specialized data, trapping importance sampling on narrow pockets of data. Our findings underscore the effectiveness of gradient alignment methods in optimizing training data mixtures, particularly in data-constrained environments, and offer a practical solution for enhancing LLM performance on specific tasks with limited data availability.

[LG-58] Efficient learning of differential network in multi-source non-paranormal graphical models

链接: https://arxiv.org/abs/2410.02496
作者: Mojtaba Nikahd,Seyed Abolfazl Motahari
关键词-EN: non-paranormal graphical models, paper addresses learning, graphical models, non-paranormal graphical, paper addresses
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses learning of sparse structural changes or differential network between two classes of non-paranormal graphical models. We assume a multi-source and heterogeneous dataset is available for each class, where the covariance matrices are identical for all non-paranormal graphical models. The differential network, which are encoded by the difference precision matrix, can then be decoded by optimizing a lasso penalized D-trace loss function. To this aim, an efficient approach is proposed that outputs the exact solution path, outperforming the previous methods that only sample from the solution path in pre-selected regularization parameters. Notably, our proposed method has low computational complexity, especially when the differential network are sparse. Our simulations on synthetic data demonstrate a superior performance for our strategy in terms of speed and accuracy compared to an existing method. Moreover, our strategy in combining datasets from multiple sources is shown to be very effective in inferring differential network in real-world problems. This is backed by our experimental results on drug resistance in tumor cancers. In the latter case, our strategy outputs important genes for drug resistance which are already confirmed by various independent studies.

[LG-59] Stochastic variance-reduced Gaussian variational inference on the Bures-Wasserstein manifold

链接: https://arxiv.org/abs/2410.02490
作者: Hoang Phuc Hau Luu,Hanlin Yu,Bernardo Williams,Marcelo Hartmann,Arto Klami
关键词-EN: Wasserstein gradient flows, machine learning community, variational inference objective, variational inference, Wasserstein gradient
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Optimization in the Bures-Wasserstein space has been gaining popularity in the machine learning community since it draws connections between variational inference and Wasserstein gradient flows. The variational inference objective function of Kullback-Leibler divergence can be written as the sum of the negative entropy and the potential energy, making forward-backward Euler the method of choice. Notably, the backward step admits a closed-form solution in this case, facilitating the practicality of the scheme. However, the forward step is no longer exact since the Bures-Wasserstein gradient of the potential energy involves “intractable” expectations. Recent approaches propose using the Monte Carlo method – in practice a single-sample estimator – to approximate these terms, resulting in high variance and poor performance. We propose a novel variance-reduced estimator based on the principle of control variates. We theoretically show that this estimator has a smaller variance than the Monte-Carlo estimator in scenarios of interest. We also prove that variance reduction helps improve the optimization bounds of the current analysis. We demonstrate that the proposed estimator gains order-of-magnitude improvements over the previous Bures-Wasserstein methods.

[LG-60] Encryption-Friendly LLM Architecture

链接: https://arxiv.org/abs/2410.02486
作者: Donghwan Rho,Taeseong Kim,Minje Park,Jung Woo Kim,Hyunsik Chae,Jung Hee Cheon,Ernest K. Ryu
关键词-EN: Large language models, Large language, offer personalized responses, personalized responses based, user interactions
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:Large language models (LLMs) offer personalized responses based on user interactions, but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a cryptographic protocol supporting arithmetic computations in encrypted states and provides a potential solution for privacy-preserving machine learning (PPML). However, the computational intensity of transformers poses challenges for applying HE to LLMs. In this work, we propose a modified HE-friendly transformer architecture with an emphasis on inference following personalized (private) fine-tuning. Utilizing LoRA fine-tuning and Gaussian kernels, we achieve significant computational speedups – 6.94x for fine-tuning and 2.3x for inference – while maintaining performance comparable to plaintext models. Our findings provide a viable proof of concept for offering privacy-preserving LLM services in areas where data protection is crucial.

[LG-61] Cross-Embodiment Dexterous Grasping with Reinforcement Learning

链接: https://arxiv.org/abs/2410.02479
作者: Haoqi Yuan,Bohan Zhou,Yuhui Fu,Zongqing Lu
关键词-EN: real-world grasping tasks, complex real-world grasping, Dexterous hands, potential for complex, complex real-world
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dexterous hands exhibit significant potential for complex real-world grasping tasks. While recent studies have primarily focused on learning policies for specific robotic hands, the development of a universal policy that controls diverse dexterous hands remains largely unexplored. In this work, we study the learning of cross-embodiment dexterous grasping policies using reinforcement learning (RL). Inspired by the capability of human hands to control various dexterous hands through teleoperation, we propose a universal action space based on the human hand’s eigengrasps. The policy outputs eigengrasp actions that are then converted into specific joint actions for each robot hand through a retargeting mapping. We simplify the robot hand’s proprioception to include only the positions of fingertips and the palm, offering a unified observation space across different robot hands. Our approach demonstrates an 80% success rate in grasping objects from the YCB dataset across four distinct embodiments using a single vision-based policy. Additionally, our policy exhibits zero-shot generalization to two previously unseen embodiments and significant improvement in efficient finetuning. For further details and videos, visit our project page this https URL.

[LG-62] mporal Predictive Coding for Gradient Compression in Distributed Learning

链接: https://arxiv.org/abs/2410.02478
作者: Adrian Edin,Zheng Chen,Michel Kieffer,Mikael Johansson
关键词-EN: prediction-based gradient compression, gradient compression method, paper proposes, proposes a prediction-based, learning with event-triggered
类目: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 8 pages, 3 figures, presented at the 60th Allerton conference on Communication, Control, and Computing

点击查看摘要

Abstract:This paper proposes a prediction-based gradient compression method for distributed learning with event-triggered communication. Our goal is to reduce the amount of information transmitted from the distributed agents to the parameter server by exploiting temporal correlation in the local gradients. We use a linear predictor that \textitcombines past gradients to form a prediction of the current gradient, with coefficients that are optimized by solving a least-square problem. In each iteration, every agent transmits the predictor coefficients to the server such that the predicted local gradient can be computed. The difference between the true local gradient and the predicted one, termed the \textitprediction residual, is only transmitted when its norm is above some threshold. When this additional communication step is omitted, the server uses the prediction as the estimated gradient. This proposed design shows notable performance gains compared to existing methods in the literature, achieving convergence with reduced communication costs.

[LG-63] Learning Diverse Bimanual Dexterous Manipulation Skills from Human Demonstrations

链接: https://arxiv.org/abs/2410.02477
作者: Bohan Zhou,Haoqi Yuan,Yuhui Fu,Zongqing Lu
关键词-EN: Bimanual dexterous, bimanual dexterous skills, area in robotics, Bimanual dexterous manipulation, critical yet underexplored
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bimanual dexterous manipulation is a critical yet underexplored area in robotics. Its high-dimensional action space and inherent task complexity present significant challenges for policy learning, and the limited task diversity in existing benchmarks hinders general-purpose skill development. Existing approaches largely depend on reinforcement learning, often constrained by intricately designed reward functions tailored to a narrow set of tasks. In this work, we present a novel approach for efficiently learning diverse bimanual dexterous skills from abundant human demonstrations. Specifically, we introduce BiDexHD, a framework that unifies task construction from existing bimanual datasets and employs teacher-student policy learning to address all tasks. The teacher learns state-based policies using a general two-stage reward function across tasks with shared behaviors, while the student distills the learned multi-task policies into a vision-based policy. With BiDexHD, scalable learning of numerous bimanual dexterous skills from auto-constructed tasks becomes feasible, offering promising advances toward universal bimanual dexterous manipulation. Our empirical evaluation on the TACO dataset, spanning 141 tasks across six categories, demonstrates a task fulfillment rate of 74.59% on trained tasks and 51.07% on unseen tasks, showcasing the effectiveness and competitive zero-shot generalization capabilities of BiDexHD. For videos and more information, visit our project page this https URL.

[LG-64] Online Convex Optimization with a Separation Oracle

链接: https://arxiv.org/abs/2410.02476
作者: Zakaria Mhammedi
关键词-EN: regret bound, kappa, sqrt, tilde, regret
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a new projection-free algorithm for Online Convex Optimization (OCO) with a state-of-the-art regret guarantee among separation-based algorithms. Existing projection-free methods based on the classical Frank-Wolfe algorithm achieve a suboptimal regret bound of O(T^3/4) , while more recent separation-based approaches guarantee a regret bound of O(\kappa \sqrtT) , where \kappa denotes the asphericity of the feasible set, defined as the ratio of the radii of the containing and contained balls. However, for ill-conditioned sets, \kappa can be arbitrarily large, potentially leading to poor performance. Our algorithm achieves a regret bound of \tildeO(\sqrtdT + \kappa d) , while requiring only \tildeO(1) calls to a separation oracle per round. Crucially, the main term in the bound, \tildeO(\sqrtd T) , is independent of \kappa , addressing the limitations of previous methods. Additionally, as a by-product of our analysis, we recover the O(\kappa \sqrtT) regret bound of existing OCO algorithms with a more straightforward analysis and improve the regret bound for projection-free online exp-concave optimization. Finally, for constrained stochastic convex optimization, we achieve a state-of-the-art convergence rate of \tildeO(\sigma/\sqrtT + \kappa d/T) , where \sigma represents the noise in the stochastic gradients, while requiring only \tildeO(1) calls to a separation oracle per iteration.

[LG-65] Efficient Residual Learning with Mixture-of-Experts for Universal Dexterous Grasping

链接: https://arxiv.org/abs/2410.02475
作者: Ziye Huang,Haoqi Yuan,Yuhui Fu,Zongqing Lu
关键词-EN: presents a fundamental, fundamental yet formidable, Universal dexterous grasping, objects, diverse objects presents
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Universal dexterous grasping across diverse objects presents a fundamental yet formidable challenge in robot learning. Existing approaches using reinforcement learning (RL) to develop policies on extensive object datasets face critical limitations, including complex curriculum design for multi-task learning and limited generalization to unseen objects. To overcome these challenges, we introduce ResDex, a novel approach that integrates residual policy learning with a mixture-of-experts (MoE) framework. ResDex is distinguished by its use of geometry-unaware base policies that are efficiently acquired on individual objects and capable of generalizing across a wide range of unseen objects. Our MoE framework incorporates several base policies to facilitate diverse grasping styles suitable for various objects. By learning residual actions alongside weights that combine these base policies, ResDex enables efficient multi-task RL for universal dexterous grasping. ResDex achieves state-of-the-art performance on the DexGraspNet dataset comprising 3,200 objects with an 88.8% success rate. It exhibits no generalization gap with unseen objects and demonstrates superior training efficiency, mastering all tasks within only 12 hours on a single GPU.

[LG-66] Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

链接: https://arxiv.org/abs/2410.02472
作者: Anthony Costarelli,Mat Allen,Severin Field,Joshua Clymer
关键词-EN: Large Language Models, Language Models, Large Language, daily lives, interpreting their decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly integrated into our daily lives, the potential harms from deceptive behavior underlie the need for faithfully interpreting their decision-making. While traditional probing methods have shown some effectiveness, they remain best for narrowly scoped tasks while more comprehensive explanations are still necessary. To this end, we investigate meta-models-an architecture using a “meta-model” that takes activations from an “input-model” and answers natural language questions about the input-model’s behaviors. We evaluate the meta-model’s ability to generalize by training them on selected task types and assessing their out-of-distribution performance in deceptive scenarios. Our findings show that meta-models generalize well to out-of-distribution tasks and point towards opportunities for future research in this area.

[LG-67] owards a Theoretical Understanding of Memorization in Diffusion Models

链接: https://arxiv.org/abs/2410.02467
作者: Yunhao Chen,Xingjun Ma,Difan Zou,Yu-Gang Jiang
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, attracted growing attention, data
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2406.12752

点击查看摘要

Abstract:As diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI), the study of their memorization of training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn via memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for trustworthy application of GenAI. Existing works revealed that conditional DPMs are more prone to training data memorization than unconditional DPMs, and the motivated data extraction methods are mostly for conditional DPMs. However, these understandings are primarily empirical, and extracting training data from unconditional models has been found to be extremely challenging. In this work, we provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence. Our theoretical analysis indicates that extracting data from unconditional models can also be effective by constructing a proper surrogate condition. Based on this result, we propose a novel data extraction method named \textbfSurrogate condItional Data Extraction (SIDE) that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs. Empirical results demonstrate that our SIDE can extract training data in challenging scenarios where previous methods fail, and it is, on average, over 50% more effective across different scales of the CelebA dataset.

[LG-68] Quantifying User Coherence: A Unified Framework for Cross-Domain Recommendation Analysis

链接: https://arxiv.org/abs/2410.02453
作者: Michaël Soumm,Alexandre Fournier-Montgieux,Adrian Popescu,Bertrand Delezoide
关键词-EN: quality remains under-researched, Recommender Systems, profile quality remains, understanding recommender systems, remains under-researched
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The effectiveness of Recommender Systems (RS) is closely tied to the quality and distinctiveness of user profiles, yet despite many advancements in raw performance, the sensitivity of RS to user profile quality remains under-researched. This paper introduces novel information-theoretic measures for understanding recommender systems: a “surprise” measure quantifying users’ deviations from popular choices, and a “conditional surprise” measure capturing user interaction coherence. We evaluate 7 recommendation algorithms across 9 datasets, revealing the relationships between our measures and standard performance metrics. Using a rigorous statistical framework, our analysis quantifies how much user profile density and information measures impact algorithm performance across domains. By segmenting users based on these measures, we achieve improved performance with reduced data and show that simpler algorithms can match complex ones for low-coherence users. Additionally, we employ our measures to analyze how well different recommendation algorithms maintain the coherence and diversity of user preferences in their predictions, providing insights into algorithm behavior. This work advances the theoretical understanding of user behavior and practical heuristics for personalized recommendation systems, promoting more efficient and adaptive architectures.

[LG-69] Personalized Federated Learning for Generative AI-Assisted Semantic Communications

链接: https://arxiv.org/abs/2410.02450
作者: Yubo Peng,Feibo Jiang,Li Dong,Kezhi Wang,Kun Yang
关键词-EN: focuses on transmitting, Semantic Federated Learning, Mobile Users, Personalized Semantic Federated, Generative Artificial Intelligence
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Semantic Communication (SC) focuses on transmitting only the semantic information rather than the raw data. This approach offers an efficient solution to the issue of spectrum resource utilization caused by the various intelligent applications on Mobile Users (MUs). Generative Artificial Intelligence (GAI) models have recently exhibited remarkable content generation and signal processing capabilities, presenting new opportunities for enhancing SC. Therefore, we propose a GAI-assisted SC (GSC) model deployed between MUs and the Base Station (BS). Then, to train the GSC model using the local data of MUs while ensuring privacy and accommodating heterogeneous requirements of MUs, we introduce Personalized Semantic Federated Learning (PSFL). This approach incorporates a novel Personalized Local Distillation (PLD) and Adaptive Global Pruning (AGP). In PLD, each MU selects a personalized GSC model as a mentor tailored to its local resources and a unified Convolutional Neural Networks (CNN)-based SC (CSC) model as a student. This mentor model is then distilled into the student model for global aggregation. In AGP, we perform network pruning on the aggregated global model according to real-time communication environments, reducing communication energy. Finally, numerical results demonstrate the feasibility and efficiency of the proposed PSFL scheme.

[LG-70] Clinnova Federated Learning Proof of Concept: Key Takeaways from a Cross-border Collaboration

链接: https://arxiv.org/abs/2410.02443
作者: Julia Alekseenko,Bram Stieltjes,Michael Bach,Melanie Boerries,Oliver Opitz,Alexandros Karargyris,Nicolas Padoy
关键词-EN: initiative involving France, European Greater Region, collaborative initiative involving, involving France, Greater Region initiative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clinnova, a collaborative initiative involving France, Germany, Switzerland, and Luxembourg, is dedicated to unlocking the power of precision medicine through data federation, standardization, and interoperability. This European Greater Region initiative seeks to create an interoperable European standard using artificial intelligence (AI) and data science to enhance healthcare outcomes and efficiency. Key components include multidisciplinary research centers, a federated biobanking strategy, a digital health innovation platform, and a federated AI strategy. It targets inflammatory bowel disease, rheumatoid diseases, and multiple sclerosis (MS), emphasizing data quality to develop AI algorithms for personalized treatment and translational research. The IHU Strasbourg (Institute of Minimal-invasive Surgery) has the lead in this initiative to develop the federated learning (FL) proof of concept (POC) that will serve as a foundation for advancing AI in healthcare. At its core, Clinnova-MS aims to enhance MS patient care by using FL to develop more accurate models that detect disease progression, guide interventions, and validate digital biomarkers across multiple sites. This technical report presents insights and key takeaways from the first cross-border federated POC on MS segmentation of MRI images within the Clinnova framework. While our work marks a significant milestone in advancing MS segmentation through cross-border collaboration, it also underscores the importance of addressing technical, logistical, and ethical considerations to realize the full potential of FL in healthcare settings. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.02443 [cs.CV] (or arXiv:2410.02443v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.02443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-71] Learning K-U-Net with constant complexity: An Application to time series forecasting

链接: https://arxiv.org/abs/2410.02438
作者: Jiang You,Arben Cela,René Natowicz,Jacob Ouanounou,Patrick Siarry
关键词-EN: Training deep models, time series forecasting, Training deep, series forecasting, critical task
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training deep models for time series forecasting is a critical task with an inherent challenge of time complexity. While current methods generally ensure linear time complexity, our observations on temporal redundancy show that high-level features are learned 98.44% slower than low-level features. To address this issue, we introduce a new exponentially weighted stochastic gradient descent algorithm designed to achieve constant time complexity in deep learning models. We prove that the theoretical complexity of this learning method is constant. Evaluation of this method on Kernel U-Net (K-U-Net) on synthetic datasets shows a significant reduction in complexity while improving the accuracy of the test set.

[LG-72] Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization

链接: https://arxiv.org/abs/2410.02433
作者: Mingyang Wang,Lukas Lange,Heike Adel,Jannik Strötgen,Hinrich Schütze
关键词-EN: ensure large language, large language models, updated regularly, ensure large, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To ensure large language models contain up-to-date knowledge, they need to be updated regularly. However, model editing is challenging as it might also affect knowledge that is unrelated to the new data. State-of-the-art methods identify parameters associated with specific knowledge and then modify them via direct weight updates. However, these locate-and-edit methods suffer from heavy computational overhead and lack theoretical validation. In contrast, directly fine-tuning the model on requested edits affects the model’s behavior on unrelated knowledge, and significantly damages the model’s generation fluency and consistency. To address these challenges, we propose SAUL, a streamlined model editing method that uses sentence concatenation with augmented random facts for generation regularization. Evaluations on three model editing benchmarks show that SAUL is a practical and reliable solution for model editing outperforming state-of-the-art methods while maintaining generation quality and reducing computational overhead.

[LG-73] Predictive Attractor Models NEURIPS2024

链接: https://arxiv.org/abs/2410.02430
作者: Ramy Mounir,Sudeep Sarkar
关键词-EN: episodic memory formation, numerous cognitive functions, underpins numerous cognitive, language comprehension, Sequential memory
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Sequential memory, the ability to form and accurately recall a sequence of events or stimuli in the correct order, is a fundamental prerequisite for biological and artificial intelligence as it underpins numerous cognitive functions (e.g., language comprehension, planning, episodic memory formation, etc.) However, existing methods of sequential memory suffer from catastrophic forgetting, limited capacity, slow iterative learning procedures, low-order Markov memory, and, most importantly, the inability to represent and generate multiple valid future possibilities stemming from the same context. Inspired by biologically plausible neuroscience theories of cognition, we propose \textitPredictive Attractor Models (PAM), a novel sequence memory architecture with desirable generative properties. PAM is a streaming model that learns a sequence in an online, continuous manner by observing each input \textitonly once. Additionally, we find that PAM avoids catastrophic forgetting by uniquely representing past context through lateral inhibition in cortical minicolumns, which prevents new memories from overwriting previously learned knowledge. PAM generates future predictions by sampling from a union set of predicted possibilities; this generative ability is realized through an attractor model trained alongside the predictor. We show that PAM is trained with local computations through Hebbian plasticity rules in a biologically plausible framework. Other desirable traits (e.g., noise tolerance, CPU-based learning, capacity scaling) are discussed throughout the paper. Our findings suggest that PAM represents a significant step forward in the pursuit of biologically plausible and computationally efficient sequential memory models, with broad implications for cognitive science and artificial intelligence research.

[LG-74] LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

链接: https://arxiv.org/abs/2410.02425
作者: Małgorzata Łazuka,Andreea Anghel,Thomas Parnell
关键词-EN: Large Language Models, Large Language, LLM inference services, LLM inference, Language Models
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted to the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24)

点击查看摘要

Abstract:As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.

[LG-75] PnP-Flow: Plug-and-Play Image Restoration with Flow Matching

链接: https://arxiv.org/abs/2410.02423
作者: Ségolène Martin,Anne Gagneux,Paul Hagemann,Gabriele Steidl
关键词-EN: Flow Matching, solving imaging inverse, Flow, Matching, Flow Matching pushed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce Plug-and-Play (PnP) Flow Matching, an algorithm for solving imaging inverse problems. PnP methods leverage the strength of pre-trained denoisers, often deep neural networks, by integrating them in optimization schemes. While they achieve state-of-the-art performance on various inverse problems in imaging, PnP approaches face inherent limitations on more generative tasks like inpainting. On the other hand, generative models such as Flow Matching pushed the boundary in image sampling yet lack a clear method for efficient use in image restoration. We propose to combine the PnP framework with Flow Matching (FM) by defining a time-dependent denoiser using a pre-trained FM model. Our algorithm alternates between gradient descent steps on the data-fidelity term, reprojections onto the learned FM path, and denoising. Notably, our method is computationally efficient and memory-friendly, as it avoids backpropagation through ODEs and trace computations. We evaluate its performance on denoising, super-resolution, deblurring, and inpainting tasks, demonstrating superior results compared to existing PnP algorithms and Flow Matching based state-of-the-art methods.

[LG-76] MenakBERT – Hebrew Diacriticizer

链接: https://arxiv.org/abs/2410.02417
作者: Ido Cohen,Jacob Gidron,Idan Pinto
关键词-EN: language give words, Hebrew language give, Diacritical marks, vocalized form, language give
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Published at ISCOL2022 as a poster

点击查看摘要

Abstract:Diacritical marks in the Hebrew language give words their vocalized form. The task of adding diacritical marks to plain Hebrew text is still dominated by a system that relies heavily on human-curated resources. Recent models trained on diacritized Hebrew texts still present a gap in performance. We use a recently developed char-based PLM to narrowly bridge this gap. Presenting MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences. We continue to show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.

[LG-77] Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

链接: https://arxiv.org/abs/2410.02416
作者: Seyedmorteza Sadat,Otmar Hilliges,Romann M. Weber
关键词-EN: CFG update rule, CFG, crucial for improving, input condition, condition and final
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) is crucial for improving both generation quality and alignment between the input condition and final output in diffusion models. While a high guidance scale is generally required to enhance these aspects, it also causes oversaturation and unrealistic artifacts. In this paper, we revisit the CFG update rule and introduce modifications to address this issue. We first decompose the update term in CFG into parallel and orthogonal components with respect to the conditional model prediction and observe that the parallel component primarily causes oversaturation, while the orthogonal component enhances image quality. Accordingly, we propose down-weighting the parallel component to achieve high-quality generations without oversaturation. Additionally, we draw a connection between CFG and gradient ascent and introduce a new rescaling and momentum method for the CFG update rule based on this insight. Our approach, termed adaptive projected guidance (APG), retains the quality-boosting advantages of CFG while enabling the use of higher guidance scales without oversaturation. APG is easy to implement and introduces practically no additional computational overhead to the sampling process. Through extensive experiments, we demonstrate that APG is compatible with various conditional diffusion models and samplers, leading to improved FID, recall, and saturation scores while maintaining precision comparable to CFG, making our method a superior plug-and-play alternative to standard classifier-free guidance.

[LG-78] An Online Feasible Point Method for Benign Generalized Nash Equilibrium Problems

链接: https://arxiv.org/abs/2410.02400
作者: Sarah Sachs,Hedi Hadiji,Tim van Erven,Mathias Staudigl
关键词-EN: generalized Nash equilibrium, repeatedly played generalized, played generalized Nash, generalized Nash, Nash equilibrium
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a repeatedly played generalized Nash equilibrium game. This induces a multi-agent online learning problem with joint constraints. An important challenge in this setting is that the feasible set for each agent depends on the simultaneous moves of the other agents and, therefore, varies over time. As a consequence, the agents face time-varying constraints, which are not adversarial but rather endogenous to the system. Prior work in this setting focused on convergence to a feasible solution in the limit via integrating the constraints in the objective as a penalty function. However, no existing work can guarantee that the constraints are satisfied for all iterations while simultaneously guaranteeing convergence to a generalized Nash equilibrium. This is a problem of fundamental theoretical interest and practical relevance. In this work, we introduce a new online feasible point method. Under the assumption that limited communication between the agents is allowed, this method guarantees feasibility. We identify the class of benign generalized Nash equilibrium problems, for which the convergence of our method to the equilibrium is guaranteed. We set this class of benign generalized Nash equilibrium games in context with existing definitions and illustrate our method with examples.

[LG-79] Parameter Competition Balancing for Model Merging NEURIPS2024

链接: https://arxiv.org/abs/2410.02396
作者: Guodong Du,Junlin Lee,Jing Li,Runhua Jiang,Yifei Guo,Shuyang Yu,Hanting Liu,Sim Kuan Goh,Ho-Kin Tang,Daojing He,Min Zhang
关键词-EN: common practice, model, tasks, parameter, fine-tuning pretrained models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS2024

点击查看摘要

Abstract:While fine-tuning pretrained models has become common practice, these models often underperform outside their specific domains. Recently developed model merging techniques enable the direct integration of multiple models, each fine-tuned for distinct tasks, into a single model. This strategy promotes multitasking capabilities without requiring retraining on the original datasets. However, existing methods fall short in addressing potential conflicts and complex correlations between tasks, especially in parameter-level adjustments, posing a challenge in effectively balancing parameter competition across various tasks. This paper introduces an innovative technique named PCB-Merging (Parameter Competition Balancing), a lightweight and training-free technique that adjusts the coefficients of each parameter for effective model merging. PCB-Merging employs intra-balancing to gauge parameter significance within individual tasks and inter-balancing to assess parameter similarities across different tasks. Parameters with low importance scores are dropped, and the remaining ones are rescaled to form the final merged model. We assessed our approach in diverse merging scenarios, including cross-task, cross-domain, and cross-training configurations, as well as out-of-domain generalization. The experimental results reveal that our approach achieves substantial performance enhancements across multiple modalities, domains, model sizes, number of tasks, fine-tuning forms, and large language models, outperforming existing model merging methods. The code is publicly available at: \urlthis https URL.

[LG-80] Online Multi-Label Classification under Noisy and Changing Label Distribution

链接: https://arxiv.org/abs/2410.02394
作者: Yizhang Zou,Xuegang Hu,Peipei Li,Jun Hu,You Wu
关键词-EN: Multi-label data stream, noisy label distribution, label distribution, ground-truth label distribution, online multi-label classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-label data stream usually contains noisy labels in the real-world applications, namely occuring in both relevant and irrelevant labels. However, existing online multi-label classification methods are mostly limited in terms of label quality and fail to deal with the case of noisy labels. On the other hand, the ground-truth label distribution may vary with the time changing, which is hidden in the observed noisy label distribution and difficult to track, posing a major challenge for concept drift adaptation. Motivated by this, we propose an online multi-label classification algorithm under Noisy and Changing Label Distribution (NCLD). The convex objective is designed to simultaneously model the label scoring and the label ranking for high accuracy, whose robustness to NCLD benefits from three novel works: 1) The local feature graph is used to reconstruct the label scores jointly with the observed labels, and an unbiased ranking loss is derived and applied to learn reliable ranking information. 2) By detecting the difference between two adjacent chunks with the unbiased label cardinality, we identify the change in the ground-truth label distribution and reset the ranking or all information learned from the past to match the new distribution. 3) Efficient and accurate updating is achieved based on the updating rule derived from the closed-form optimal model solution. Finally, empirical experimental results validate the effectiveness of our method in classifying instances under NCLD.

[LG-81] MANTRA: The Manifold Triangulations Assemblage

链接: https://arxiv.org/abs/2410.02392
作者: Rubén Ballester,Ernst Röell,Daniel Bin Schmid,Mathieu Alain,Sergio Escalera,Carles Casacuberta,Bastian Rieck
关键词-EN: leveraging higher-order interactions, higher-order interactions present, exploiting high-order structures, topological deep learning, expressive models exploiting
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 26 pages, 2 figures, 22 tables

点击查看摘要

Abstract:The rising interest in leveraging higher-order interactions present in complex systems has led to a surge in more expressive models exploiting high-order structures in the data, especially in topological deep learning (TDL), which designs neural networks on high-order domains such as simplicial complexes. However, progress in this field is hindered by the scarcity of datasets for benchmarking these architectures. To address this gap, we introduce MANTRA, the first large-scale, diverse, and intrinsically high order dataset for benchmarking high-order models, comprising over 43,000 and 249,000 triangulations of surfaces and three-dimensional manifolds, respectively. With MANTRA, we assess several graph- and simplicial complex-based models on three topological classification tasks. We demonstrate that while simplicial complex-based neural networks generally outperform their graph-based counterparts in capturing simple topological invariants, they also struggle, suggesting a rethink of TDL. Thus, MANTRA serves as a benchmark for assessing and advancing topological methods, leading the way for more effective high-order models.

[LG-82] Diffusion Meets Options: Hierarchical Generative Skill Composition for Temporally-Extended Tasks

链接: https://arxiv.org/abs/2410.02389
作者: Zeyu Feng,Hao Luan,Kevin Yuchen Ma,Harold Soh
关键词-EN: correct execution errors, Safe and successful, execution errors, successful deployment, capacity to frequently
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose DOPPLER, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that DOPPLER can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation. Demonstration videos are available online at: this https URL.

[LG-83] BiSSL: Bilevel Optimization for Self-Supervised Pre-Training and Fine-Tuning

链接: https://arxiv.org/abs/2410.02387
作者: Gustav Wagner Zakarias,Lars Kai Hansen,Zheng-Hua Tan
关键词-EN: introduces bilevel optimization, self-supervised learning pipeline, self-supervised learning, bilevel optimization, bilevel optimization problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we present BiSSL, a first-of-its-kind training framework that introduces bilevel optimization to enhance the alignment between the pretext pre-training and downstream fine-tuning stages in self-supervised learning. BiSSL formulates the pretext and downstream task objectives as the lower- and upper-level objectives in a bilevel optimization problem and serves as an intermediate training stage within the self-supervised learning pipeline. By more explicitly modeling the interdependence of these training stages, BiSSL facilitates enhanced information sharing between them, ultimately leading to a backbone parameter initialization that is better suited for the downstream task. We propose a training algorithm that alternates between optimizing the two objectives defined in BiSSL. Using a ResNet-18 backbone pre-trained with SimCLR on the STL10 dataset, we demonstrate that our proposed framework consistently achieves improved or competitive classification accuracies across various downstream image classification datasets compared to the conventional self-supervised learning pipeline. Qualitative analyses of the backbone features further suggest that BiSSL enhances the alignment of downstream features in the backbone prior to fine-tuning.

[LG-84] Unveiling AIs Blind Spots: An Oracle for In-Domain Out-of-Domain and Adversarial Errors

链接: https://arxiv.org/abs/2410.02384
作者: Shuangpeng Han,Mengmi Zhang
关键词-EN: recognizing images-whether in-domain, recognizing images-whether, models make, models, model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI models make mistakes when recognizing images-whether in-domain, out-of-domain, or adversarial. Predicting these errors is critical for improving system reliability, reducing costly mistakes, and enabling proactive corrections in real-world applications such as healthcare, finance, and autonomous systems. However, understanding what mistakes AI models make, why they occur, and how to predict them remains an open challenge. Here, we conduct comprehensive empirical evaluations using a “mentor” model-a deep neural network designed to predict another model’s errors. Our findings show that the mentor model excels at learning from a mentee’s mistakes on adversarial images with small perturbations and generalizes effectively to predict in-domain and out-of-domain errors of the mentee. Additionally, transformer-based mentor models excel at predicting errors across various mentee architectures. Subsequently, we draw insights from these observations and develop an “oracle” mentor model, dubbed SuperMentor, that achieves 78% accuracy in predicting errors across different error types. Our error prediction framework paves the way for future research on anticipating and correcting AI model behaviours, ultimately increasing trust in AI systems. All code, models, and data will be made publicly available.

[LG-85] MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

链接: https://arxiv.org/abs/2410.02381
作者: Genta Indra Winata,David Anugraha,Lucky Susanto,Garry Kuwanto,Derry Tanti Wijaya
关键词-EN: Understanding the quality, model outputs align, model outputs, human preferences, Understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.

[LG-86] SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

链接: https://arxiv.org/abs/2410.02367
作者: Jintao Zhang,Jia wei,Pengle Zhang,Jun Zhu,Jianfei Chen
关键词-EN: transformer architecture predominates, architecture predominates, transformer architecture, attention, Abstract
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of O(N^2), compared to O(N) for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1 times and 2.7 times, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation.

[LG-87] Source Data Selection for Brain-Computer Interfaces based on Simple Features

链接: https://arxiv.org/abs/2410.02360
作者: Frida Heskebeck,Carolina Bergeling,Bo Bernhardsson
关键词-EN: Transfer Performance Predictor, Performance Predictor method, brain-computer interface, transfer learning performance, Transfer Performance
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:This paper demonstrates that simple features available during the calibration of a brain-computer interface can be utilized for source data selection to improve the performance of the brain-computer interface for a new target user through transfer learning. To support this, a public motor imagery dataset is used for analysis, and a method called the Transfer Performance Predictor method is presented. The simple features are based on the covariance matrices of the data and the Riemannian distance between them. The Transfer Performance Predictor method outperforms other source data selection methods as it selects source data that gives a better transfer learning performance for the target users.

[LG-88] Simplicity bias and optimization threshold in two-layer ReLU networks

链接: https://arxiv.org/abs/2410.02348
作者: Etienne Boursier,Nicolas Flammarion
关键词-EN: neural networks remains, Understanding generalization, overparametrized neural networks, remains a fundamental, fundamental challenge
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Understanding generalization of overparametrized neural networks remains a fundamental challenge in machine learning. Most of the literature mostly studies generalization from an interpolation point of view, taking convergence of parameters towards a global minimum of the training loss for granted. While overparametrized architectures indeed interpolated the data for typical classification tasks, this interpolation paradigm does not seem valid anymore for more complex tasks such as in-context learning or diffusion. Instead for such tasks, it has been empirically observed that the trained models goes from global minima to spurious local minima of the training loss as the number of training samples becomes larger than some level we call optimization threshold. While the former yields a poor generalization to the true population loss, the latter was observed to actually correspond to the minimiser of this true loss. This paper explores theoretically this phenomenon in the context of two-layer ReLU networks. We demonstrate that, despite overparametrization, networks often converge toward simpler solutions rather than interpolating the training data, which can lead to a drastic improvement on the test loss with respect to interpolating solutions. Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions. This directional alignment, which occurs in the early stage of training, leads to a simplicity bias, wherein the network approximates the ground truth model without converging to the global minimum of the training loss. Our results suggest that this bias, resulting in an optimization threshold from which interpolation is not reached anymore, is beneficial and enhances the generalization of trained models.

[LG-89] RelChaNet: Neural Network Feature Selection using Relative Change Scores

链接: https://arxiv.org/abs/2410.02344
作者: Felix Zimmer
关键词-EN: reduce computational resources, develop feature selection, feature selection, reduce computational, computational resources
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is an ongoing effort to develop feature selection algorithms to improve interpretability, reduce computational resources, and minimize overfitting in predictive models. Neural networks stand out as architectures on which to build feature selection methods, and recently, neuron pruning and regrowth have emerged from the sparse neural network literature as promising new tools. We introduce RelChaNet, a novel and lightweight feature selection algorithm that uses neuron pruning and regrowth in the input layer of a dense neural network. For neuron pruning, a gradient sum metric measures the relative change induced in a network after a feature enters, while neurons are randomly regrown. We also propose an extension that adapts the size of the input layer at runtime. Extensive experiments on nine different datasets show that our approach generally outperforms the current state-of-the-art methods, and in particular improves the average accuracy by 2% on the MNIST dataset. Our code is available at this https URL.

[LG-90] Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA

链接: https://arxiv.org/abs/2410.02343
作者: Eduard Tulchinskii,Laida Kushnareva,Kristian Kuznetsov,Anastasia Voznyuk,Andrei Andriiainen,Irina Piontkovskaya,Evgeny Burnaev,Serguei Barannikov
关键词-EN: LLM involves presenting, model predicted answer, LLM involves, evaluate the abilities, involves presenting
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A standard way to evaluate the abilities of LLM involves presenting a multiple-choice question and selecting the option with the highest logit as the model’s predicted answer. However, such a format for evaluating LLMs has limitations, since even if the model knows the correct answer, it may struggle to select the corresponding letter simply due to difficulties in following this rigid format. To address this, we introduce new scores that better capture and reveal model’s underlying knowledge: the Query-Key Score (QK-score), derived from the interaction between query and key representations in attention heads, and the Attention Score, based on attention weights. These scores are extracted from specific \textitselect-and-copy heads, which show consistent performance across popular Multi-Choice Question Answering (MCQA) datasets. Based on these scores, our method improves knowledge extraction, yielding up to 16% gain for LLaMA2-7B and up to 10% for larger models on popular MCQA benchmarks. At the same time, the accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60%, achieving nearly perfect accuracy, therefore demonstrating the method’s efficiency in mitigating MCQA format limitations. To support our claims, we conduct experiments on models ranging from 7 billion to 70 billion parameters in both zero- and few-shot setups.

[LG-91] Data Optimisation of Machine Learning Models for Smart Irrigation in Urban Parks

链接: https://arxiv.org/abs/2410.02335
作者: Nasser Ghadiri,Bahman Javadi,Oliver Obst,Sebastian Pfautsch
关键词-EN: Urban environments face, including extreme heat, impact public health, Sydney Olympic Park, Urban environments
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Urban environments face significant challenges due to climate change, including extreme heat, drought, and water scarcity, which impact public health, community well-being, and local economies. Effective management of these issues is crucial, particularly in areas like Sydney Olympic Park, which relies on one of Australia’s largest irrigation systems. The Smart Irrigation Management for Parks and Cool Towns (SIMPaCT) project, initiated in 2021, leverages advanced technologies and machine learning models to optimize irrigation and induce physical cooling. This paper introduces two novel methods to enhance the efficiency of the SIMPaCT system’s extensive sensor network and applied machine learning models. The first method employs clustering of sensor time series data using K-shape and K-means algorithms to estimate readings from missing sensors, ensuring continuous and reliable data. This approach can detect anomalies, correct data sources, and identify and remove redundant sensors to reduce maintenance costs. The second method involves sequential data collection from different sensor locations using robotic systems, significantly reducing the need for high numbers of stationary sensors. Together, these methods aim to maintain accurate soil moisture predictions while optimizing sensor deployment and reducing maintenance costs, thereby enhancing the efficiency and effectiveness of the smart irrigation system. Our evaluations demonstrate significant improvements in the efficiency and cost-effectiveness of soil moisture monitoring networks. The cluster-based replacement of missing sensors provides up to 5.4% decrease in average error. The sequential sensor data collection as a robotic emulation shows 17.2% and 2.1% decrease in average error for circular and linear paths respectively.

[LG-92] Automated Tone Transcription and Clustering with Tone2Vec EMNLP2024

链接: https://arxiv.org/abs/2410.02324
作者: Yi Yang,Yiming Wang,ZhiQiang Tang,Jiahong Yuan
关键词-EN: Lexical tones play, Lexical tones, play a crucial, crucial role, role in Sino-Tibetan
类目: Machine Learning (cs.LG)
*备注: Accepted by EMNLP 2024 Findings

点击查看摘要

Abstract:Lexical tones play a crucial role in Sino-Tibetan languages. However, current phonetic fieldwork relies on manual effort, resulting in substantial time and financial costs. This is especially challenging for the numerous endangered languages that are rapidly disappearing, often compounded by limited funding. In this paper, we introduce pitch-based similarity representations for tone transcription, named Tone2Vec. Experiments on dialect clustering and variance show that Tone2Vec effectively captures fine-grained tone variation. Utilizing Tone2Vec, we develop the first automatic approach for tone transcription and clustering by presenting a novel representation transformation for transcriptions. Additionally, these algorithms are systematically integrated into an open-sourced and easy-to-use package, ToneLab, which facilitates automated fieldwork and cross-regional, cross-lexical analysis for tonal languages. Extensive experiments were conducted to demonstrate the effectiveness of our methods.

[LG-93] Convergence of Score-Based Discrete Diffusion Models: A Discrete-Time Analysis

链接: https://arxiv.org/abs/2410.02321
作者: Zikun Zhang,Zixiang Chen,Quanquan Gu
关键词-EN: achieved great success, Diffusion models, generating high-dimensional samples, continuous-state diffusion models, Time Markov Chain
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 1 figure

点击查看摘要

Abstract:Diffusion models have achieved great success in generating high-dimensional samples across various applications. While the theoretical guarantees for continuous-state diffusion models have been extensively studied, the convergence analysis of the discrete-state counterparts remains under-explored. In this paper, we study the theoretical aspects of score-based discrete diffusion models under the Continuous Time Markov Chain (CTMC) framework. We introduce a discrete-time sampling algorithm in the general state space [S]^d that utilizes score estimators at predefined time points. We derive convergence bounds for the Kullback-Leibler (KL) divergence and total variation (TV) distance between the generated sample distribution and the data distribution, considering both scenarios with and without early stopping under specific assumptions. Notably, our KL divergence bounds are nearly linear in dimension d , aligning with state-of-the-art results for diffusion models. Our convergence analysis employs a Girsanov-based method and establishes key properties of the discrete score function, which are essential for characterizing the discrete-time sampling process.

[LG-94] Post-edits Are Preferences Too

链接: https://arxiv.org/abs/2410.02320
作者: Nathaniel Berger,Stefan Riezler,Miriam Exel,Matthias Huck
关键词-EN: Preference Optimization, machine translation, art techniques, Optimization, machine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear at the Ninth Conference on Machine Translation (WMT24)

点击查看摘要

Abstract:Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences s_1 and s_2 and asked for a preference judgment, % s_1 s_2 ; while for post-editing, editors \emphcreate s_1 and know that it should be better than s_2 . We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks. Comments: To appear at the Ninth Conference on Machine Translation (WMT24) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.02320 [cs.CL] (or arXiv:2410.02320v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02320 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-95] QDGset: A Large Scale Grasping Dataset Generated with Quality-Diversity

链接: https://arxiv.org/abs/2410.02319
作者: Johann Huber,François Hélénon,Mathilde Kappel,Ignacio de Loyola Páez-Ubieta,Santiago T. Puente,Pablo Gil,Faïz Ben Amar,Stéphane Doncieux
关键词-EN: remain partially solved, grasping remain partially, partially solved, Recent advances, led to significant
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 9 figures. Draft version

点击查看摘要

Abstract:Recent advances in AI have led to significant results in robotic learning, but skills like grasping remain partially solved. Many recent works exploit synthetic grasping datasets to learn to grasp unknown objects. However, those datasets were generated using simple grasp sampling methods using priors. Recently, Quality-Diversity (QD) algorithms have been proven to make grasp sampling significantly more efficient. In this work, we extend QDG-6DoF, a QD framework for generating object-centric grasps, to scale up the production of synthetic grasping datasets. We propose a data augmentation method that combines the transformation of object meshes with transfer learning from previous grasping repertoires. The conducted experiments show that this approach reduces the number of required evaluations per discovered robust grasp by up to 20%. We used this approach to generate QDGset, a dataset of 6DoF grasp poses that contains about 3.5 and 4.5 times more grasps and objects, respectively, than the previous state-of-the-art. Our method allows anyone to easily generate data, eventually contributing to a large-scale collaborative dataset of synthetic grasps.

[LG-96] CTARR: A fast and robust method for identifying anatomical regions on CT images via atlas registration

链接: https://arxiv.org/abs/2410.02316
作者: Thomas Buddenkotte,Roland Opfer,Julia Krüger,Alessa Hering,Mireia Crispin-Ortuzar
关键词-EN: image analysis, Medical image analysis, image analysis tasks, patient body, image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical image analysis tasks often focus on regions or structures located in a particular location within the patient’s body. Often large parts of the image may not be of interest for the image analysis task. When using deep-learning based approaches, this causes an unnecessary increases the computational burden during inference and raises the chance of errors. In this paper, we introduce CTARR, a novel generic method for CT Anatomical Region Recognition. The method serves as a pre-processing step for any deep learning-based CT image analysis pipeline by automatically identifying the pre-defined anatomical region that is relevant for the follow-up task and removing the rest. It can be used in (i) image segmentation to prevent false positives in anatomically implausible regions and speeding up the inference, (ii) image classification to produce image crops that are consistent in their anatomical context, and (iii) image registration by serving as a fast pre-registration step. Our proposed method is based on atlas registration and provides a fast and robust way to crop any anatomical region encoded as one or multiple bounding box(es) from any unlabeled CT scan of the brain, chest, abdomen and/or pelvis. We demonstrate the utility and robustness of the proposed method in the context of medical image segmentation by evaluating it on six datasets of public segmentation challenges. The foreground voxels in the regions of interest are preserved in the vast majority of cases and tasks (97.45-100%) while taking only fractions of a seconds to compute (0.1-0.21s) on a deep learning workstation and greatly reducing the segmentation runtime (2.0-12.7x). Our code is available at this https URL.

[LG-97] Semantic Communication and Control Co-Design for Multi-Objective Correlated Dynamics

链接: https://arxiv.org/abs/2410.02303
作者: Abanoub M. Girgis,Hyowoon Seo,Mehdi Bennis
关键词-EN: dynamic semantic Koopman, letter introduces, introduces a machine-learning, machine-learning approach, approach to learning
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This letter introduces a machine-learning approach to learning the semantic dynamics of correlated systems with different control rules and dynamics. By leveraging the Koopman operator in an autoencoder (AE) framework, the system’s state evolution is linearized in the latent space using a dynamic semantic Koopman (DSK) model, capturing the baseline semantic dynamics. Signal temporal logic (STL) is incorporated through a logical semantic Koopman (LSK) model to encode system-specific control rules. These models form the proposed logical Koopman AE framework that reduces communication costs while improving state prediction accuracy and control performance, showing a 91.65% reduction in communication samples and significant performance gains in simulation.

[LG-98] Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods

链接: https://arxiv.org/abs/2410.02293
作者: James Vo
关键词-EN: offer notable advantages, methods offer notable, utilizing curvature information, training deep neural, deep neural networks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Second-order optimization methods offer notable advantages in training deep neural networks by utilizing curvature information to achieve faster convergence. However, traditional second-order techniques are computationally prohibitive, primarily due to the large matrix inversions and high memory demands they require. While adaptive trust-region methods have been developed to mitigate these issues, their performance is often hindered by conservative estimates of key parameters, such as the Lipschitz constant of the Hessian, resulting in suboptimal outcomes. In this paper, we introduce SecondOrderAdaptiveAdam (SOAA), a novel optimization algorithm designed to overcome these limitations. SOAA approximates the Fisher information matrix using a diagonal representation, reducing computational complexity from (O(n^2)) to (O(n)), thereby making it suitable for large-scale deep learning models, including large language models (LLMs). Additionally, the algorithm integrates an adaptive trust-region mechanism that dynamically adjusts the trust region size based on observed loss reduction, ensuring both robust convergence and computational efficiency. We empirically demonstrate that SOAA achieves faster and more stable convergence compared to first-order optimizers, such as Adam, under similar computational constraints. However, the diagonal approximation of the Fisher information matrix may be less effective in capturing higher-order interactions between gradients, suggesting potential areas for further refinement and future research.

[LG-99] Density based Spatial Clustering of Lines via Probabilistic Generation of Neighbourhood

链接: https://arxiv.org/abs/2410.02290
作者: Akanksha Das,Malay Bhattacharyya
关键词-EN: Density based spatial, based spatial clustering, density based clustering, Density based, variety of industries
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Density based spatial clustering of points in \mathbbR^n has a myriad of applications in a variety of industries. We generalise this problem to the density based clustering of lines in high-dimensional spaces, keeping in mind there exists no valid distance measure that follows the triangle inequality for lines. In this paper, we design a clustering algorithm that generates a customised neighbourhood for a line of a fixed volume (given as a parameter), based on an optional parameter as a continuous probability density function. This algorithm is not sensitive to the outliers and can effectively identify the noise in the data using a cardinality parameter. One of the pivotal applications of this algorithm is clustering data points in \mathbbR^n with missing entries, while utilising the domain knowledge of the respective data. In particular, the proposed algorithm is able to cluster n -dimensional data points that contain at least (n-1) -dimensional information. We illustrate the neighbourhoods for the standard probability distributions with continuous probability density functions and demonstrate the effectiveness of our algorithm on various synthetic and real-world datasets (e.g., rail and road networks). The experimental results also highlight its application in clustering incomplete data.

[LG-100] Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

链接: https://arxiv.org/abs/2410.02275
作者: Francesco Emanuele Stradi,Matteo Castiglioni,Alberto Marchesi,Nicola Gatti
关键词-EN: study online learning, cumulative constraint violation, strong cumulative constraint, study online, online learning
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2405.14372

点击查看摘要

Abstract:We study online learning in \emphconstrained MDPs (CMDPs), focusing on the goal of attaining sublinear strong regret and strong cumulative constraint violation. Differently from their standard (weak) counterparts, these metrics do not allow negative terms to compensate positive ones, raising considerable additional challenges. Efroni et al. (2020) were the first to propose an algorithm with sublinear strong regret and strong violation, by exploiting linear programming. Thus, their algorithm is highly inefficient, leaving as an open problem achieving sublinear bounds by means of policy optimization methods, which are much more efficient in practice. Very recently, Muller et al. (2024) have partially addressed this problem by proposing a policy optimization method that allows to attain \widetilde\mathcalO(T^0.93) strong regret/violation. This still leaves open the question of whether optimal bounds are achievable by using an approach of this kind. We answer such a question affirmatively, by providing an efficient policy optimization algorithm with \widetilde\mathcalO(\sqrtT) strong regret/violation. Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial (unconstrained) MDPs as primal algorithm, and a UCB-like update for dual variables.

[LG-101] Perfect Counterfactuals in Imperfect Worlds: Modelling Noisy Implementation of Actions in Sequential Algorithmic Recourse

链接: https://arxiv.org/abs/2410.02273
作者: Yueqing Xuan,Kacper Sokol,Mark Sanderson,Jeffrey Chan
关键词-EN: Algorithmic recourse, recourse, adversely affected, affected by automated, automated decision-making
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic recourse provides actions to individuals who have been adversely affected by automated decision-making and helps them achieve a desired outcome. Knowing the recourse, however, does not guarantee that users would implement it perfectly, either due to environmental variability or personal choices. Recourse generation should thus anticipate its sub-optimal or noisy implementation. While several approaches have constructed recourse that accounts for robustness to small perturbation (i.e., noisy recourse implementation), they assume an entire recourse to be implemented in a single step and thus apply one-off uniform noise to it. Such assumption is unrealistic since recourse often includes multiple sequential steps which becomes harder to implement and subject to more noise. In this work, we consider recourse under plausible noise that adapts to the local data geometry and accumulates at every step of the way. We frame this problem as a Markov Decision Process and demonstrate that the distribution of our plausible noise satisfies the Markov property. We then propose the RObust SEquential (ROSE) recourse generator to output a sequence of steps that will lead to the desired outcome even under imperfect implementation. Given our plausible modelling of sub-optimal human actions and greater recourse robustness to accumulated uncertainty, ROSE can grant users higher chances of success under low recourse costs. Empirical evaluation shows our algorithm manages the inherent trade-off between recourse robustness and costs more effectively while ensuring its low sparsity and fast computation.

[LG-102] Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

链接: https://arxiv.org/abs/2410.02269
作者: Francesco Emanuele Stradi,Anna Lunghi,Matteo Castiglioni,Alberto Marchesi,Nicola Gatti
关键词-EN: Markov decision processes, constrained Markov decision, study online learning, constrained Markov, Markov decision
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al.(2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves \widetilde\mathcalO(\sqrtT) regret and constraint violation, while, when they are adversarial, it attains \widetilde\mathcalO(\sqrtT) constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.

[LG-103] Structural-Entropy-Based Sample Selection for Efficient and Effective Learning ICLR2025

链接: https://arxiv.org/abs/2410.02268
作者: Tianchi Xie,Jiangning Zhu,Guozu Ma,Minzhi Lin,Wei Chen,Weikai Yang,Shixia Liu
关键词-EN: machine learning models, improves the efficiency, models by providing, samples, Sample selection improves
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICLR 2025

点击查看摘要

Abstract:Sample selection improves the efficiency and effectiveness of machine learning models by providing informative and representative samples. Typically, samples can be modeled as a sample graph, where nodes are samples and edges represent their similarities. Most existing methods are based on local information, such as the training difficulty of samples, thereby overlooking global information, such as connectivity patterns. This oversight can result in suboptimal selection because global information is crucial for ensuring that the selected samples well represent the structural properties of the graph. To address this issue, we employ structural entropy to quantify global information and losslessly decompose it from the whole graph to individual nodes using the Shapley value. Based on the decomposition, we present \textbfS tructural- \textbfE ntropy-based sample \textbfS election ( \textbfSES ), a method that integrates both global and local information to select informative and representative samples. SES begins by constructing a k NN-graph among samples based on their similarities. It then measures sample importance by combining structural entropy (global metric) with training difficulty (local metric). Finally, SES applies importance-biased blue noise sampling to select a set of diverse and representative samples. Comprehensive experiments on three learning scenarios – supervised learning, active learning, and continual learning – clearly demonstrate the effectiveness of our method.

[LG-104] Unsupervised Meta-Learning via Dynamic Head and Heterogeneous Task Construction for Few-Shot Classification

链接: https://arxiv.org/abs/2410.02267
作者: Yunchuan Guan,Yu Liu,Ketong Liu,Ke Zhou,Zhiqi Shen
关键词-EN: heterogeneous task construction, recent years, Singular Vector Canonical, Vector Canonical Correlation, Canonical Correlation Analysis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Meta-learning has been widely used in recent years in areas such as few-shot learning and reinforcement learning. However, the questions of why and when it is better than other algorithms in few-shot classification remain to be explored. In this paper, we perform pre-experiments by adjusting the proportion of label noise and the degree of task heterogeneity in the dataset. We use the metric of Singular Vector Canonical Correlation Analysis to quantify the representation stability of the neural network and thus to compare the behavior of meta-learning and classical learning algorithms. We find that benefiting from the bi-level optimization strategy, the meta-learning algorithm has better robustness to label noise and heterogeneous tasks. Based on the above conclusion, we argue a promising future for meta-learning in the unsupervised area, and thus propose DHM-UHT, a dynamic head meta-learning algorithm with unsupervised heterogeneous task construction. The core idea of DHM-UHT is to use DBSCAN and dynamic head to achieve heterogeneous task construction and meta-learn the whole process of unsupervised heterogeneous task construction. On several unsupervised zero-shot and few-shot datasets, DHM-UHT obtains state-of-the-art performance. The code is released at this https URL.

[LG-105] Can Capacitive Touch Images Enhance Mobile Keyboard Decoding?

链接: https://arxiv.org/abs/2410.02264
作者: Piyawat Lertvittayakumjorn,Shanqing Cai,Billy Dou,Cedric Ho,Shumin Zhai
关键词-EN: Capacitive touch sensors, two-dimensional spatial profile, touch sensors capture, Capacitive touch, touchscreen mobile keyboards
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted to UIST 2024

点击查看摘要

Abstract:Capacitive touch sensors capture the two-dimensional spatial profile (referred to as a touch heatmap) of a finger’s contact with a mobile touchscreen. However, the research and design of touchscreen mobile keyboards – one of the most speed and accuracy demanding touch interfaces – has focused on the location of the touch centroid derived from the touch image heatmap as the input, discarding the rest of the raw spatial signals. In this paper, we investigate whether touch heatmaps can be leveraged to further improve the tap decoding accuracy for mobile touchscreen keyboards. Specifically, we developed and evaluated machine-learning models that interpret user taps by using the centroids and/or the heatmaps as their input and studied the contribution of the heatmaps to model performance. The results show that adding the heatmap into the input feature set led to 21.4% relative reduction of character error rates on average, compared to using the centroid alone. Furthermore, we conducted a live user study with the centroid-based and heatmap-based decoders built into Pixel 6 Pro devices and observed lower error rate, faster typing speed, and higher self-reported satisfaction score based on the heatmap-based decoder than the centroid-based decoder. These findings underline the promise of utilizing touch heatmaps for improving typing experience in mobile keyboards.

[LG-106] FedScalar: A Communication efficient Federated Learning

链接: https://arxiv.org/abs/2410.02260
作者: M. Rostami,S. S. Kia
关键词-EN: gained considerable popularity, distributed machine learning, machine learning due, Federated learning, gained considerable
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has gained considerable popularity for distributed machine learning due to its ability to preserve the privacy of participating agents by eliminating the need for data aggregation. Nevertheless, communication costs between agents and the central server in FL are substantial in large-scale problems and remain a limiting factor for this algorithm. This paper introduces an innovative algorithm, called \emphFedScalar, within the federated learning framework aimed at improving communication efficiency. Unlike traditional FL methods that require agents to send high-dimensional vectors to the server, \emphFedScalar enables agents to communicate updates using a single scalar. Each agent encodes its updated model parameters into a scalar through the inner product between its local update difference and a random vector, which is then transmitted to the server. The server decodes this information by projecting the averaged scalar values onto the random vector. Our method thereby significantly reduces communication overhead. Technically, we demonstrate that the proposed algorithm achieves a convergence rate of O(1/\sqrtK) to a stationary point for smooth, non-convex loss functions. Additionally, our analysis shows that altering the underlying distribution of the random vector generated by the server can reduce the variance during the aggregation step of the algorithm. Finally, we validate the performance and communication efficiency of our algorithm with numerical simulations.

[LG-107] End-to-end Driving in High-Interaction Traffic Scenarios with Reinforcement Learning

链接: https://arxiv.org/abs/2410.02253
作者: Yueyuan Li,Mingyang Jiang,Songan Zhang,Wei Yuan,Chunxiang Wang,Ming Yang
关键词-EN: autonomous driving systems, pose significant challenges, scenarios pose significant, pose significant, interactive traffic scenarios
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 pages, 3 figures, experiment under progress, only to demonstrate the originality of the method

点击查看摘要

Abstract:Dynamic and interactive traffic scenarios pose significant challenges for autonomous driving systems. Reinforcement learning (RL) offers a promising approach by enabling the exploration of driving policies beyond the constraints of pre-collected datasets and predefined conditions, particularly in complex environments. However, a critical challenge lies in effectively extracting spatial and temporal features from sequences of high-dimensional, multi-modal observations while minimizing the accumulation of errors over time. Additionally, efficiently guiding large-scale RL models to converge on optimal driving policies without frequent failures during the training process remains tricky. We propose an end-to-end model-based RL algorithm named Ramble to address these issues. Ramble processes multi-view RGB images and LiDAR point clouds into low-dimensional latent features to capture the context of traffic scenarios at each time step. A transformer-based architecture is then employed to model temporal dependencies and predict future states. By learning a dynamics model of the environment, Ramble can foresee upcoming traffic events and make more informed, strategic decisions. Our implementation demonstrates that prior experience in feature extraction and decision-making plays a pivotal role in accelerating the convergence of RL models toward optimal driving policies. Ramble achieves state-of-the-art performance regarding route completion rate and driving score on the CARLA Leaderboard 2.0, showcasing its effectiveness in managing complex and dynamic traffic situations. Comments: 10 pages, 3 figures, experiment under progress, only to demonstrate the originality of the method Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2410.02253 [cs.AI] (or arXiv:2410.02253v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.02253 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-108] Probabilistic road classification in historical maps using synthetic data and deep learning

链接: https://arxiv.org/abs/2410.02250
作者: Dominik J. Mühlematter,Sebastian Schweizer,Chenjing Jiao,Xue Xia,Magnus Heitzler,Lorenz Hurni
关键词-EN: Historical maps, road, spatial development, offering a rich, evolutionary studies
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Historical maps are invaluable for analyzing long-term changes in transportation and spatial development, offering a rich source of data for evolutionary studies. However, digitizing and classifying road networks from these maps is often expensive and time-consuming, limiting their widespread use. Recent advancements in deep learning have made automatic road extraction from historical maps feasible, yet these methods typically require large amounts of labeled training data. To address this challenge, we introduce a novel framework that integrates deep learning with geoinformation, computer-based painting, and image processing methodologies. This framework enables the extraction and classification of roads from historical maps using only road geometries without needing road class labels for training. The process begins with training of a binary segmentation model to extract road geometries, followed by morphological operations, skeletonization, vectorization, and filtering algorithms. Synthetic training data is then generated by a painting function that artificially re-paints road segments using predefined symbology for road classes. Using this synthetic data, a deep ensemble is trained to generate pixel-wise probabilities for road classes to mitigate distribution shift. These predictions are then discretized along the extracted road geometries. Subsequently, further processing is employed to classify entire roads, enabling the identification of potential changes in road classes and resulting in a labeled road class dataset. Our method achieved completeness and correctness scores of over 94% and 92%, respectively, for road class 2, the most prevalent class in the two Siegfried Map sheets from Switzerland used for testing. This research offers a powerful tool for urban planning and transportation decision-making by efficiently extracting and classifying roads from historical maps.

[LG-109] heoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

链接: https://arxiv.org/abs/2410.02247
作者: Xinhao Yao,Hongjin Qian,Xiaolin Hu,Gengze Xu,Yong Liu
关键词-EN: Large Language Models, Large Language, mathbf, Language Models, built on Transformer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we investigate two remarkable phenomena observed during the fine-tuning of LLMs, particularly focusing on the attention mechanism: (1) Different Impact, optimizing the \mathbfW_v matrix significantly improves performance over optimizing the \mathbfW_k matrix. Fine-tuning only the \mathbfW_q and \mathbfW_v matrices is computationally efficient, delivering results that are comparable to, or even better than, fine-tuning all three matrices \mathbfW_q , \mathbfW_k , and \mathbfW_v . (2) Efficient Convergence, employing distinct learning rates for these matrices is crucial for optimal performance, with a higher learning rate for the \mathbfW_v matrix expediting convergence. However, theoretical analyses of these phenomena are still relatively limited. We present a theoretical analysis of these phenomena from two perspectives: (i) Generalization, where we demonstrate that fine-tuning only \mathbfW_q and \mathbfW_v improves generalization bounds, enhances memory efficiency, and (ii) Optimization, where we emphasize that the feature learning of the attention mechanism is efficient, particularly when using distinct learning rates for the matrices, which leads to more effective fine-tuning. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving lightweight algorithms in LLMs fine-tuning.

[LG-110] PFGuard: A Generative Framework with Privacy and Fairness Safeguards

链接: https://arxiv.org/abs/2410.02246
作者: Soyeon Kim,Yuji Roh,Geon Heo,Steven Euijong Whang
关键词-EN: privacy, fairness, Trustworthy, fairness for Trustworthy, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative models must ensure both privacy and fairness for Trustworthy AI. While these goals have been pursued separately, recent studies propose to combine existing privacy and fairness techniques to achieve both goals. However, naively combining these techniques can be insufficient due to privacy-fairness conflicts, where a sample in a minority group may be amplified for fairness, only to be suppressed for privacy. We demonstrate how these conflicts lead to adverse effects, such as privacy violations and unexpected fairness-utility tradeoffs. To mitigate these risks, we propose PFGuard, a generative framework with privacy and fairness safeguards, which simultaneously addresses privacy, fairness, and utility. By using an ensemble of multiple teacher models, PFGuard balances privacy-fairness conflicts between fair and private training stages and achieves high utility based on ensemble learning. Extensive experiments show that PFGuard successfully generates synthetic data on high-dimensional data while providing both fairness convergence and strict DP guarantees - the first of its kind to our knowledge.

[LG-111] Robust Weight Initialization for Tanh Neural Networks with Fixed Point Analysis

链接: https://arxiv.org/abs/2410.02242
作者: Hyunwoo Lee,Hayoung Choi,Hyunju Kim
关键词-EN: strong generalization performance, achieve strong generalization, network depth increases, neural network depth, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As a neural network’s depth increases, it can achieve strong generalization performance. Training, however, becomes challenging due to gradient issues. Theoretical research and various methods have been introduced to address this issues. However, research on weight initialization methods that can be effectively applied to tanh neural networks of varying sizes still needs to be completed. This paper presents a novel weight initialization method for Feedforward Neural Networks with tanh activation function. Based on an analysis of the fixed points of the function \tanh(ax) , our proposed method aims to determine values of a that prevent the saturation of activations. A series of experiments on various classification datasets demonstrate that the proposed method is more robust to network size variations than the existing method. Furthermore, when applied to Physics-Informed Neural Networks, the method exhibits faster convergence and robustness to variations of the network size compared to Xavier initialization in problems of Partial Differential Equations.

[LG-112] C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front

链接: https://arxiv.org/abs/2410.02236
作者: Ruohong Liu,Yuxin Pan,Linjie Xu,Lei Song,Pengcheng You,Yize Chen,Jiang Bian
关键词-EN: Multi-objective reinforcement learning, handling rapidly changing, involve multiple criteria, Multi-objective reinforcement, rapidly changing preferences
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 27 pages, 8 figues. In Submission to a conference

点击查看摘要

Abstract:Multi-objective reinforcement learning (MORL) excels at handling rapidly changing preferences in tasks that involve multiple criteria, even for unseen preferences. However, previous dominating MORL methods typically generate a fixed policy set or preference-conditioned policy through multiple training iterations exclusively for sampled preference vectors, and cannot ensure the efficient discovery of the Pareto front. Furthermore, integrating preferences into the input of policy or value functions presents scalability challenges, in particular as the dimension of the state and preference space grow, which can complicate the learning process and hinder the algorithm’s performance on more complex tasks. To address these issues, we propose a two-stage Pareto front discovery algorithm called Constrained MORL (C-MORL), which serves as a seamless bridge between constrained policy optimization and MORL. Concretely, a set of policies is trained in parallel in the initialization stage, with each optimized towards its individual preference over the multiple objectives. Then, to fill the remaining vacancies in the Pareto front, the constrained optimization steps are employed to maximize one objective while constraining the other objectives to exceed a predefined threshold. Empirically, compared to recent advancements in MORL methods, our algorithm achieves more consistent and superior performances in terms of hypervolume, expected utility, and sparsity on both discrete and continuous control tasks, especially with numerous objectives (up to nine objectives in our experiments).

[LG-113] SEAL: SEmantic-Augmented Imitation Learning via Language Model

链接: https://arxiv.org/abs/2410.02231
作者: Chengyang Gu,Yuxin Pan,Haotian Bai,Hui Xiong,Yize Chen
关键词-EN: Hierarchical Imitation Learning, Hierarchical Imitation, Imitation Learning, tackling long-horizon decision-making, Large Language Models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 18 pages, 5 figures, in submission

点击查看摘要

Abstract:Hierarchical Imitation Learning (HIL) is a promising approach for tackling long-horizon decision-making tasks. While it is a challenging task due to the lack of detailed supervisory labels for sub-goal learning, and reliance on hundreds to thousands of expert demonstrations. In this work, we introduce SEAL, a novel framework that leverages Large Language Models (LLMs)'s powerful semantic and world knowledge for both specifying sub-goal space and pre-labeling states to semantically meaningful sub-goal representations without prior knowledge of task hierarchies. SEAL employs a dual-encoder structure, combining supervised LLM-guided sub-goal learning with unsupervised Vector Quantization (VQ) for more robust sub-goal representations. Additionally, SEAL incorporates a transition-augmented low-level planner for improved adaptation to sub-goal transitions. Our experiments demonstrate that SEAL outperforms state-of-the-art HIL methods and LLM-based planning approaches, particularly in settings with small expert datasets and complex long-horizon tasks.

[LG-114] Mitigating Downstream Model Risks via Model Provenance

链接: https://arxiv.org/abs/2410.02230
作者: Keyu Wang,Abdullah Norozi Iranzad,Scott Schaffter,Doina Precup,Jonathan Lebensold
关键词-EN: model, Research and industry, foundation model-based systems, rapidly advancing, Research
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Research and industry are rapidly advancing the innovation and adoption of foundation model-based systems, yet the tools for managing these models have not kept pace. Understanding the provenance and lineage of models is critical for researchers, industry, regulators, and public trust. While model cards and system cards were designed to provide transparency, they fall short in key areas: tracing model genealogy, enabling machine readability, offering reliable centralized management systems, and fostering consistent creation incentives. This challenge mirrors issues in software supply chain security, but AI/ML remains at an earlier stage of maturity. Addressing these gaps requires industry-standard tooling that can be adopted by foundation model publishers, open-source model innovators, and major distribution platforms. We propose a machine-readable model specification format to simplify the creation of model records, thereby reducing error-prone human effort, notably when a new model inherits most of its design from a foundation model. Our solution explicitly traces relationships between upstream and downstream models, enhancing transparency and traceability across the model lifecycle. To facilitate the adoption, we introduce the unified model record (UMR) repository , a semantically versioned system that automates the publication of model records to multiple formats (PDF, HTML, LaTeX) and provides a hosted web interface (this https URL). This proof of concept aims to set a new standard for managing foundation models, bridging the gap between innovation and responsible model management.

[LG-115] Doubly Optimal Policy Evaluation for Reinforcement Learning

链接: https://arxiv.org/abs/2410.02226
作者: Shuze Liu,Claire Chen,Shangtong Zhang
关键词-EN: processing raw data, processing raw, Policy evaluation estimates, meaningful estimate, Policy evaluation
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2301.13734

点击查看摘要

Abstract:Policy evaluation estimates the performance of a policy by (1) collecting data from the environment and (2) processing raw data into a meaningful estimate. Due to the sequential nature of reinforcement learning, any improper data-collecting policy or data-processing method substantially deteriorates the variance of evaluation results over long time steps. Thus, policy evaluation often suffers from large variance and requires massive data to achieve the desired accuracy. In this work, we design an optimal combination of data-collecting policy and data-processing baseline. Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods. Empirically, compared with previous works, we show our method reduces variance substantially and achieves superior empirical performance.

[LG-116] EmbedLLM: Learning Compact Representations of Large Language Models

链接: https://arxiv.org/abs/2410.02223
作者: Richard Zhuang,Tianhao Wu,Zhaojin Wen,Andrew Li,Jiantao Jiao,Kannan Ramchandran
关键词-EN: Huggingface today, Large Language Models, efficiently evaluating, increasingly critical, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream, tasks has become increasingly critical. Many existing methods repeatedly learn task-specific representations of Large Language Models (LLMs), which leads to inefficiencies in both time and computational resources. To address this, we propose EmbedLLM, a framework designed to learn compact vector representations, of LLMs that facilitate downstream applications involving many models, such as model routing. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency. Additionally, we demonstrate that our method can forecast a model’s performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them. We open source our dataset, code and embedder to facilitate further research and application.

[LG-117] Capturing complex hand movements and object interactions using machine learning-powered stretchable smart textile gloves

链接: https://arxiv.org/abs/2410.02221
作者: Arvin Tashakori,Zenan Jiang,Amir Servati,Saeid Soltanian,Harishkumar Narayana,Katherine Le,Caroline Nakayama,Chieh-ling Yang,Z. Jane Wang,Janice J. Eng,Peyman Servati
关键词-EN: dexterous hand movements, Accurate real-time tracking, hand movements, realistic hand movements, Capturing realistic hand
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Accurate real-time tracking of dexterous hand movements and interactions has numerous applications in human-computer interaction, metaverse, robotics, and tele-health. Capturing realistic hand movements is challenging because of the large number of articulations and degrees of freedom. Here, we report accurate and dynamic tracking of articulated hand and finger movements using stretchable, washable smart gloves with embedded helical sensor yarns and inertial measurement units. The sensor yarns have a high dynamic range, responding to low 0.005 % to high 155 % strains, and show stability during extensive use and washing cycles. We use multi-stage machine learning to report average joint angle estimation root mean square errors of 1.21 and 1.45 degrees for intra- and inter-subjects cross-validation, respectively, matching accuracy of costly motion capture cameras without occlusion or field of view limitations. We report a data augmentation technique that enhances robustness to noise and variations of sensors. We demonstrate accurate tracking of dexterous hand movements during object interactions, opening new avenues of applications including accurate typing on a mock paper keyboard, recognition of complex dynamic and static gestures adapted from American Sign Language and object identification.

[LG-118] Stochastic Sampling from Deterministic Flow Models ICLR2025

链接: https://arxiv.org/abs/2410.02217
作者: Saurabh Singh,Ian Fischer
关键词-EN: deterministic transport map, ordinary differential equation, framework for learning, transport map, flow models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Submitted to ICLR 2025

点击查看摘要

Abstract:Deterministic flow models, such as rectified flows, offer a general framework for learning a deterministic transport map between two distributions, realized as the vector field for an ordinary differential equation (ODE). However, they are sensitive to model estimation and discretization errors and do not permit different samples conditioned on an intermediate state, limiting their application. We present a general method to turn the underlying ODE of such flow models into a family of stochastic differential equations (SDEs) that have the same marginal distributions. This method permits us to derive families of \emphstochastic samplers, for fixed (e.g., previously trained) \emphdeterministic flow models, that continuously span the spectrum of deterministic and stochastic sampling, given access to the flow field and the score function. Our method provides additional degrees of freedom that help alleviate the issues with the deterministic samplers and empirically outperforms them. We empirically demonstrate advantages of our method on a toy Gaussian setup and on the large scale ImageNet generation task. Further, our family of stochastic samplers provide an additional knob for controlling the diversity of generation, which we qualitatively demonstrate in our experiments.

[LG-119] Calibrate to Discriminate: Improve In-Context Learning with Label-Free Comparative Inference

链接: https://arxiv.org/abs/2410.02210
作者: Wei Cheng,Tianlu Wang,Yanmin Ji,Fan Yang,Keren Tan,Yiyu Zheng
关键词-EN: large language models, shown impressive performance, language models, level of confidence, learning with large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:While in-context learning with large language models (LLMs) has shown impressive performance, we have discovered a unique miscalibration behavior where both correct and incorrect predictions are assigned the same level of confidence. We refer to this phenomenon as indiscriminate miscalibration. We found that traditional calibration metrics, such as Expected Calibrated Errors (ECEs), are unable to capture this behavior effectively. To address this issue, we propose new metrics to measure the severity of indiscriminate miscalibration. Additionally, we develop a novel in-context comparative inference method to alleviate miscalibrations and improve classification performance. Through extensive experiments on five datasets, we demonstrate that our proposed method can achieve more accurate and calibrated predictions compared to regular zero-shot and few-shot prompting.

[LG-120] Adapting Segment Anything Model to Melanoma Segmentation in Microscopy Slide Images

链接: https://arxiv.org/abs/2410.02207
作者: Qingyuan Liu,Avideh Zakhor
关键词-EN: crucial prognostic factors, Breslow depth, Slide Images, invasive tumor size, primary invasive tumor
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Melanoma segmentation in Whole Slide Images (WSIs) is useful for prognosis and the measurement of crucial prognostic factors such as Breslow depth and primary invasive tumor size. In this paper, we present a novel approach that uses the Segment Anything Model (SAM) for automatic melanoma segmentation in microscopy slide images. Our method employs an initial semantic segmentation model to generate preliminary segmentation masks that are then used to prompt SAM. We design a dynamic prompting strategy that uses a combination of centroid and grid prompts to achieve optimal coverage of the super high-resolution slide images while maintaining the quality of generated prompts. To optimize for invasive melanoma segmentation, we further refine the prompt generation process by implementing in-situ melanoma detection and low-confidence region filtering. We select Segformer as the initial segmentation model and EfficientSAM as the segment anything model for parameter-efficient fine-tuning. Our experimental results demonstrate that this approach not only surpasses other state-of-the-art melanoma segmentation methods but also significantly outperforms the baseline Segformer by 9.1% in terms of IoU.

[LG-121] Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts

链接: https://arxiv.org/abs/2410.02200
作者: Minh Le,Chau Nguyen,Huy Nguyen,Quyen Tran,Trung Le,Nhat Ho
关键词-EN: fine-tuning large pre-trained, large pre-trained models, gained prominence, large pre-trained, Prompt-based techniques
类目: Machine Learning (cs.LG)
*备注: Minh Le, Chau Nguyen, Huy Nguyen contributed equally to this work. 50 pages, 8 tables, 2 figures

点击查看摘要

Abstract:Prompt-based techniques, such as prompt-tuning and prefix-tuning, have gained prominence for their efficiency in fine-tuning large pre-trained models. Despite their widespread adoption, the theoretical foundations of these methods remain limited. For instance, in prefix-tuning, we observe that a key factor in achieving performance parity with full fine-tuning lies in the reparameterization strategy. However, the theoretical principles underpinning the effectiveness of this approach have yet to be thoroughly examined. Our study demonstrates that reparameterization is not merely an engineering trick but is grounded in deep theoretical foundations. Specifically, we show that the reparameterization strategy implicitly encodes a shared structure between prefix key and value vectors. Building on recent insights into the connection between prefix-tuning and mixture of experts models, we further illustrate that this shared structure significantly improves sample efficiency in parameter estimation compared to non-shared alternatives. The effectiveness of prefix-tuning across diverse tasks is empirically confirmed to be enhanced by the shared structure, through extensive experiments in both visual and language domains. Additionally, we uncover similar structural benefits in prompt-tuning, offering new perspectives on its success. Our findings provide theoretical and empirical contributions, advancing the understanding of prompt-based methods and their underlying mechanisms.

[LG-122] Deep Koopman-layered Model with Universal Property Based on Toeplitz Matrices

链接: https://arxiv.org/abs/2410.02199
作者: Yuka Hashimoto,Tomoharu Iwata
关键词-EN: propose deep Koopman-layered, deep Koopman-layered models, deep Koopman-layered, Koopman-layered models, Toeplitz matrices
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Functional Analysis (math.FA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose deep Koopman-layered models with learnable parameters in the form of Toeplitz matrices for analyzing the dynamics of time-series data. The proposed model has both theoretical solidness and flexibility. By virtue of the universal property of Toeplitz matrices and the reproducing property underlined in the model, we can show its universality and the generalization property. In addition, the flexibility of the proposed model enables the model to fit time-series data coming from nonautonomous dynamical systems. When training the model, we apply Krylov subspace methods for efficient computations. In addition, the proposed model can be regarded as a neural ODE-based model. In this sense, the proposed model establishes a new connection among Koopman operators, neural ODEs, and numerical linear algebraic methods.

[LG-123] G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

链接: https://arxiv.org/abs/2410.02198
作者: Zhaoning Yu,Xiangyang Xu,Hongyang Gao
关键词-EN: hierarchical text format, text format optimized, large language models, hierarchical text, optimized for large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding converts complex molecular graphs into tree-structured formats, such as JSON and XML, which LLMs are particularly adept at processing due to their extensive pre-training on these types of data. By leveraging the flexibility of LLMs, our approach allows for intuitive interaction using natural language prompts, providing a more accessible interface for molecular design. Through supervised fine-tuning, G2T-LLM generates valid and coherent chemical structures, addressing common challenges like invalid outputs seen in traditional graph-based methods. While LLMs are computationally intensive, they offer superior generalization and adaptability, enabling the generation of diverse molecular structures with minimal task-specific customization. The proposed approach achieved comparable performances with state-of-the-art methods on various benchmark molecular generation datasets, demonstrating its potential as a flexible and innovative tool for AI-driven molecular design.

[LG-124] General Preference Modeling with Preference Representations for Aligning Language Models

链接: https://arxiv.org/abs/2410.02197
作者: Yifan Zhang,Ge Zhang,Yue Wu,Kangping Xu,Quanquan Gu
关键词-EN: General Preference, preference, crucial for aligning, Traditional reward modeling, Modeling human preferences
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 34 pages

点击查看摘要

Abstract:Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at this https URL.

[LG-125] BACKTIME: Backdoor Attacks on Multivariate Time Series Forecasting NEURIPS2024

链接: https://arxiv.org/abs/2410.02195
作者: Xiao Lin,Zhining Liu,Dongqi Fu,Ruizhong Qiu,Hanghang Tong
关键词-EN: Multivariate Time Series, Multivariate Time, Time Series, MTS forecasting models, numerous real-world applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 23 pages. Neurips 2024

点击查看摘要

Abstract:Multivariate Time Series (MTS) forecasting is a fundamental task with numerous real-world applications, such as transportation, climate, and epidemiology. While a myriad of powerful deep learning models have been developed for this task, few works have explored the robustness of MTS forecasting models to malicious attacks, which is crucial for their trustworthy employment in high-stake scenarios. To address this gap, we dive deep into the backdoor attacks on MTS forecasting models and propose an effective attack method named this http URL subtly injecting a few stealthy triggers into the MTS data, BackTime can alter the predictions of the forecasting model according to the attacker’s intent. Specifically, BackTime first identifies vulnerable timestamps in the data for poisoning, and then adaptively synthesizes stealthy and effective triggers by solving a bi-level optimization problem with a GNN-based trigger generator. Extensive experiments across multiple datasets and state-of-the-art MTS forecasting models demonstrate the effectiveness, versatility, and stealthiness of \method attacks. The code is available at \urlthis https URL.

[LG-126] A Survey on Point-of-Interest Recommendation: Models Architectures and Security

链接: https://arxiv.org/abs/2410.02191
作者: Qianru Zhang,Peng Yang,Junliang Yu,Haixin Wang,Xingwei He,Siu-Ming Yiu,Hongzhi Yin
关键词-EN: Location-Based Social Networks, Social Networks, creating unparalleled opportunities, Location-Based Social, Networks has led
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:The widespread adoption of smartphones and Location-Based Social Networks has led to a massive influx of spatio-temporal data, creating unparalleled opportunities for enhancing Point-of-Interest (POI) recommendation systems. These advanced POI systems are crucial for enriching user experiences, enabling personalized interactions, and optimizing decision-making processes in the digital landscape. However, existing surveys tend to focus on traditional approaches and few of them delve into cutting-edge developments, emerging architectures, as well as security considerations in POI recommendations. To address this gap, our survey stands out by offering a comprehensive, up-to-date review of POI recommendation systems, covering advancements in models, architectures, and security aspects. We systematically examine the transition from traditional models to advanced techniques such as large language models. Additionally, we explore the architectural evolution from centralized to decentralized and federated learning systems, highlighting the improvements in scalability and privacy. Furthermore, we address the increasing importance of security, examining potential vulnerabilities and privacy-preserving approaches. Our taxonomy provides a structured overview of the current state of POI recommendation, while we also identify promising directions for future research in this rapidly advancing field.

[LG-127] Agent -Oriented Planning in Multi-Agent Systems

链接: https://arxiv.org/abs/2410.02189
作者: Ao Li,Yuexiang Xie,Songze Li,Fugee Tsung,Bolin Ding,Yaliang Li
关键词-EN: possessing diverse expertise, achieve impressive progress, agents possessing diverse, systems achieve impressive, multiple agents possessing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Through the collaboration of multiple agents possessing diverse expertise and tools, multi-agent systems achieve impressive progress in solving real-world problems. Given the user queries, the meta-agents, serving as the brain within these systems, are required to decompose the queries into multiple sub-tasks that can be allocated to suitable agents capable of solving them, so-called agent-oriented planning. In this study, we identify three critical design principles of agent-oriented planning, including solvability, completeness, and non-redundancy, to ensure that each sub-task is effectively resolved, leading to satisfactory responses to the original queries. These principles further inspire us to propose a novel framework for agent-oriented planning in multi-agent systems, leveraging a fast task decomposition and allocation process followed by an effective and efficient evaluation via a reward model. During the planning process, the meta-agent is also responsible for evaluating the performance of the expert agents, making timely adjustments to the sub-tasks and scheduling as necessary. Besides, we integrate a feedback loop into the proposed framework to further enhance the effectiveness and robustness of such a problem-solving process. Extensive experiments demonstrate the advancement of the proposed framework in solving real-world problems compared to both single-agent systems and existing planning strategies for multi-agent systems.

[LG-128] POSIX: A Prompt Sensitivity Index For Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.02185
作者: Anwoy Chatterjee,H S V N S Kowndinya Renduchintala,Sumit Bhatia,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, Language Models, minor variations, generating significantly divergent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024 (Findings)

点击查看摘要

Abstract:Despite their remarkable capabilities, Large Language Models (LLMs) are found to be surprisingly sensitive to minor variations in prompts, often generating significantly divergent outputs in response to minor variations in the prompts, such as spelling errors, alteration of wording or the prompt template. However, while assessing the quality of an LLM, the focus often tends to be solely on its performance on downstream tasks, while very little to no attention is paid to prompt sensitivity. To fill this gap, we propose POSIX - a novel PrOmpt Sensitivity IndeX as a reliable measure of prompt sensitivity, thereby offering a more comprehensive evaluation of LLM performance. The key idea behind POSIX is to capture the relative change in loglikelihood of a given response upon replacing the corresponding prompt with a different intent-preserving prompt. We provide thorough empirical evidence demonstrating the efficacy of POSIX in capturing prompt sensitivity and subsequently use it to measure and thereby compare prompt sensitivity of various open-source LLMs. We find that merely increasing the parameter count or instruction tuning does not necessarily reduce prompt sensitivity whereas adding some few-shot exemplars, even just one, almost always leads to significant decrease in prompt sensitivity. We also find that alterations to prompt template lead to the highest sensitivity in the case of MCQtype tasks, whereas paraphrasing results in the highest sensitivity in open-ended generation tasks. The code for reproducing our results is open-sourced at this https URL.

[LG-129] CodeJudge: Evaluating Code Generation with Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.02184
作者: Weixi Tong,Tianyi Zhang
关键词-EN: Large Language Models, shown promising performance, Large Language, shown promising, promising performance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注: Accepted to EMNLP 2024 (Main, Long Paper)

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising performance in code generation. However, how to reliably evaluate code generated by LLMs remains an unresolved problem. This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases. We investigate different ways to guide the LLM in performing “slow thinking” to arrive at an in-depth and reliable evaluation. We experimented with four LLMs as evaluators on four code generation datasets and five programming languages. The results show that CodeJudge significantly outperformed existing methods in most settings. Furthermore, compared with a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct. Our code and datasets are available on GitHub this https URL.

[LG-130] BadCM: Invisible Backdoor Attack Against Cross-Modal Learning

链接: https://arxiv.org/abs/2410.02182
作者: Zheng Zhang,Xu Yuan,Lei Zhu,Jingkuan Song,Liqiang Nie
关键词-EN: unimodal learning tasks, remarkable successes, underexplored due, cross-modal, learning tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Despite remarkable successes in unimodal learning tasks, backdoor attacks against cross-modal learning are still underexplored due to the limited generalization and inferior stealthiness when involving multiple modalities. Notably, since works in this area mainly inherit ideas from unimodal visual attacks, they struggle with dealing with diverse cross-modal attack circumstances and manipulating imperceptible trigger samples, which hinders their practicability in real-world applications. In this paper, we introduce a novel bilateral backdoor to fill in the missing pieces of the puzzle in the cross-modal backdoor and propose a generalized invisible backdoor framework against cross-modal learning (BadCM). Specifically, a cross-modal mining scheme is developed to capture the modality-invariant components as target poisoning areas, where well-designed trigger patterns injected into these regions can be efficiently recognized by the victim models. This strategy is adapted to different image-text cross-modal models, making our framework available to various attack scenarios. Furthermore, for generating poisoned samples of high stealthiness, we conceive modality-specific generators for visual and linguistic modalities that facilitate hiding explicit trigger patterns in modality-invariant regions. To the best of our knowledge, BadCM is the first invisible backdoor method deliberately designed for diverse cross-modal attacks within one unified framework. Comprehensive experimental evaluations on two typical applications, i.e., cross-modal retrieval and VQA, demonstrate the effectiveness and generalization of our method under multiple kinds of attack scenarios. Moreover, we show that BadCM can robustly evade existing backdoor defenses. Our code is available at this https URL.

[LG-131] HATFormer: Historic Handwritten Arabic Text Recognition with Transformers

链接: https://arxiv.org/abs/2410.02179
作者: Adrian Chan,Anupam Mijar,Mehreen Saeed,Chau-Wai Wong,Akram Khater
关键词-EN: diverse writing styles, English HTR model, Arabic HTR models, English HTR, generalizable Arabic HTR
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer’s attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.

[LG-132] owards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks

链接: https://arxiv.org/abs/2410.02176
作者: Ke Chen,Chugang Yi,Haizhao Yang
关键词-EN: Stochastic Gradient Descent, Weight Decay, training neural networks, neural networks, Gradient Descent
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the implicit bias towards low-rank weight matrices when training neural networks (NN) with Weight Decay (WD). We prove that when a ReLU NN is sufficiently trained with Stochastic Gradient Descent (SGD) and WD, its weight matrix is approximately a rank-two matrix. Empirically, we demonstrate that WD is a necessary condition for inducing this low-rank bias across both regression and classification tasks. Our work differs from previous studies as our theoretical analysis does not rely on common assumptions regarding the training data distribution, optimality of weight matrices, or specific training procedures. Furthermore, by leveraging the low-rank bias, we derive improved generalization error bounds and provide numerical evidence showing that better generalization can be achieved. Thus, our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.

[LG-133] Efficiently Deploying LLMs with Controlled Risk

链接: https://arxiv.org/abs/2410.02173
作者: Michael J. Zellinger,Matt Thomson
关键词-EN: Deploying large language, large language models, production requires simultaneous, requires simultaneous attention, risk control
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Deploying large language models in production requires simultaneous attention to efficiency and risk control. Prior work has shown the possibility to cut costs while maintaining similar accuracy, but has neglected to focus on risk control. By contrast, here we present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls. Our framework presents novel trade-offs between efficiency and risk. For example, deploying HCMA on MMLU cuts the error rate of Llama3 405B by 30% when the model is allowed to abstain on 20% of the queries. To calibrate HCMA for optimal performance, our approach uses data-efficient logistic regressions (based on a simple nonlinear feature transformation), which require only 50 or 100 labeled examples to achieve excellent calibration error (ECE), cutting ECE by 50% compared to naive Platt scaling. On free-form generation tasks, we find that chain-of-thought is ineffectual for selective prediction, whereas zero-shot prompting drives error to 0% on TruthfulQA at high abstention rates. As LLMs are increasingly deployed across computing environments with different capabilities (such as mobile, laptop, and cloud), our framework paves the way towards maintaining deployment efficiency while putting in place sharp risk controls.

[LG-134] Abstract Reward Processes: Leveraging State Abstraction for Consistent Off-Policy Evaluation NEURIPS2024

链接: https://arxiv.org/abs/2410.02172
作者: Shreyas Chaudhari,Ameet Deshpande,Bruno Castro da Silva,Philip S. Thomas
关键词-EN: applying reinforcement learning, Evaluating policies, autonomous driving, crucial for applying, applying reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at the Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Evaluating policies using off-policy data is crucial for applying reinforcement learning to real-world problems such as healthcare and autonomous driving. Previous methods for off-policy evaluation (OPE) generally suffer from high variance or irreducible bias, leading to unacceptably high prediction errors. In this work, we introduce STAR, a framework for OPE that encompasses a broad range of estimators – which include existing OPE methods as special cases – that achieve lower mean squared prediction errors. STAR leverages state abstraction to distill complex, potentially continuous problems into compact, discrete models which we call abstract reward processes (ARPs). Predictions from ARPs estimated from off-policy data are provably consistent (asymptotically correct). Rather than proposing a specific estimator, we present a new framework for OPE and empirically demonstrate that estimators within STAR outperform existing methods. The best STAR estimator outperforms baselines in all twelve cases studied, and even the median STAR estimator surpasses the baselines in seven out of the twelve cases.

[LG-135] Channel-aware Contrastive Conditional Diffusion for Multivariate Probabilistic Time Series Forecasting

链接: https://arxiv.org/abs/2410.02168
作者: Siyang Li,Yize Chen,Hui Xiong
关键词-EN: Forecasting faithful trajectories, reasonable decision-making, faithful trajectories, practical scopes, scopes is essential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forecasting faithful trajectories of multivariate time series from practical scopes is essential for reasonable decision-making. Recent methods majorly tailor generative conditional diffusion models to estimate the target temporal predictive distribution. However, it remains an obstacle to enhance the exploitation efficiency of given implicit temporal predictive information to bolster conditional diffusion learning. To this end, we propose a generic channel-aware Contrastive Conditional Diffusion model entitled CCDM to achieve desirable Multivariate probabilistic forecasting, obviating the need for curated temporal conditioning inductive biases. In detail, we first design a channel-centric conditional denoising network to manage intra-variate variations and cross-variate correlations, which can lead to scalability on diverse prediction horizons and channel numbers. Then, we devise an ad-hoc denoising-based temporal contrastive learning to explicitly amplify the predictive mutual information between past observations and future forecasts. It can coherently complement naive step-wise denoising diffusion training and improve the forecasting accuracy and generality on unknown test time series. Besides, we offer theoretic insights on the benefits of such auxiliary contrastive training refinement from both neural mutual information and temporal distribution generalization aspects. The proposed CCDM can exhibit superior forecasting capability compared to current state-of-the-art diffusion forecasters over a comprehensive benchmark, with best MSE and CRPS outcomes on 66.67% and 83.33% cases. Our code is publicly available at this https URL.

[LG-136] raining Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis

链接: https://arxiv.org/abs/2410.02167
作者: Hongkang Li,Meng Wang,Songtao Lu,Xiaodong Cui,Pin-Yu Chen
关键词-EN: efficient prompting method, large language models, multiple intermediate steps, efficient prompting, prompting method
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. Despite the empirical success, the theoretical understanding of how to train a Transformer to achieve the CoT ability remains less explored. This is primarily due to the technical challenges involved in analyzing the nonconvex optimization on nonlinear attention models. To the best of our knowledge, this work provides the first theoretical study of training Transformers with nonlinear attention to obtain the CoT generalization capability so that the resulting model can inference on unseen tasks when the input is augmented by examples of the new task. We first quantify the required training samples and iterations to train a Transformer model towards CoT ability. We then prove the success of its CoT generalization on unseen tasks with distribution-shifted testing data. Moreover, we theoretically characterize the conditions for an accurate reasoning output by CoT even when the provided reasoning examples contain noises and are not always accurate. In contrast, in-context learning (ICL), which can be viewed as one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does. These theoretical findings are justified through experiments.

[LG-137] Universality in Transfer Learning for Linear Models

链接: https://arxiv.org/abs/2410.02164
作者: Reza Ghane,Danil Akhtiamov,Babak Hassibi
关键词-EN: Transfer learning, target distribution, collection is costly, attractive framework, target
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning is an attractive framework for problems where there is a paucity of data, or where data collection is costly. One common approach to transfer learning is referred to as “model-based”, and involves using a model that is pretrained on samples from a source distribution, which is easier to acquire, and then fine-tuning the model on a few samples from the target distribution. The hope is that, if the source and target distributions are ``close", then the fine-tuned model will perform well on the target distribution even though it has seen only a few samples from it. In this work, we study the problem of transfer learning in linear models for both regression and binary classification. In particular, we consider the use of stochastic gradient descent (SGD) on a linear model initialized with pretrained weights and using a small training data set from the target distribution. In the asymptotic regime of large models, we provide an exact and rigorous analysis and relate the generalization errors (in regression) and classification errors (in binary classification) for the pretrained and fine-tuned models. In particular, we give conditions under which the fine-tuned model outperforms the pretrained one. An important aspect of our work is that all the results are “universal”, in the sense that they depend only on the first and second order statistics of the target distribution. They thus extend well beyond the standard Gaussian assumptions commonly made in the literature.

[LG-138] Controlled Generation of Natural Adversarial Documents for Stealthy Retrieval Poisoning

链接: https://arxiv.org/abs/2410.02163
作者: Collin Zhang,Tingwei Zhang,Vitaly Shmatikov
关键词-EN: Recent work showed, Recent work, craft malicious documents, classes of queries, work showed
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work showed that retrieval based on embedding similarity (e.g., for retrieval-augmented generation) is vulnerable to poisoning: an adversary can craft malicious documents that are retrieved in response to broad classes of queries. We demonstrate that previous, HotFlip-based techniques produce documents that are very easy to detect using perplexity filtering. Even if generation is constrained to produce low-perplexity text, the resulting documents are recognized as unnatural by LLMs and can be automatically filtered from the retrieval corpus. We design, implement, and evaluate a new controlled generation technique that combines an adversarial objective (embedding similarity) with a “naturalness” objective based on soft scores computed using an open-source, surrogate LLM. The resulting adversarial documents (1) cannot be automatically detected using perplexity filtering and/or other LLMs, except at the cost of significant false positives in the retrieval corpus, yet (2) achieve similar poisoning efficacy to easily-detectable documents generated using HotFlip, and (3) are significantly more effective than prior methods for energy-guided generation, such as COLD. Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2410.02163 [cs.CL] (or arXiv:2410.02163v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.02163 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-139] RiskSEA : A Scalable Graph Embedding for Detecting On-chain Fraudulent Activities on the Ethereum Blockchain

链接: https://arxiv.org/abs/2410.02160
作者: Ayush Agarwal,Lv Lu,Arjun Maheswaran,Varsha Mahadevan,Bhaskar Krishnamachari
关键词-EN: criminal activities, blockchain transaction graphs, blockchain, blockchain transaction, embedding
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2203.12363 by other authors

点击查看摘要

Abstract:Like any other useful technology, cryptocurrencies are sometimes used for criminal activities. While transactions are recorded on the blockchain, there exists a need for a more rapid and scalable method to detect addresses associated with fraudulent activities. We present RiskSEA, a scalable risk scoring system capable of effectively handling the dynamic nature of large-scale blockchain transaction graphs. The risk scoring system, which we implement for Ethereum, consists of 1. a scalable approach to generating node2vec embedding for entire set of addresses to capture the graph topology 2. transaction-based features to capture the transactional behavioral pattern of an address 3. a classifier model to generate risk score for addresses that combines the node2vec embedding and behavioral features. Efficiently generating node2vec embedding for large scale and dynamically evolving blockchain transaction graphs is challenging, we present two novel approaches for generating node2vec embeddings and effectively scaling it to the entire set of blockchain addresses: 1. node2vec embedding propagation and 2. dynamic node2vec embedding. We present a comprehensive analysis of the proposed approaches. Our experiments show that combining both behavioral and node2vec features boosts the classification performance significantly, and that the dynamic node2vec embeddings perform better than the node2vec propagated embeddings.

[LG-140] Mitigating Memorization In Language Models

链接: https://arxiv.org/abs/2410.02159
作者: Mansi Sakarvadia,Aswathy Ajith,Arham Khan,Nathaniel Hudson,Caleb Geniesse,Kyle Chard,Yaoqing Yang,Ian Foster,Michael W. Mahoney
关键词-EN: Language models, encode training data, training data, extract training data, inference-time queries
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Language models (LMs) can “memorize” information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-based, and eleven machine unlearning-based methods, with five of the latter being new methods that we introduce. We also introduce TinyMem, a suite of small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods. We demonstrate that the mitigation methods that we develop using TinyMem can successfully be applied to production-grade LMs, and we determine via experiment that: regularizer-based mitigation methods are slow and ineffective at curbing memorization; fine-tuning-based methods are effective at curbing memorization, but overly expensive, especially for retaining higher accuracies; and unlearning-based methods are faster and more effective, allowing for the precise localization and removal of memorized information from LM weights prior to inference. We show, in particular, that our proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.

[LG-141] ClassContrast: Bridging the Spatial and Contextual Gaps for Node Representations

链接: https://arxiv.org/abs/2410.02158
作者: Md Joshem Uddin,Astrit Tola,Varin Sikand,Cuneyt Gurcan Akcora,Baris Coskunuzer
关键词-EN: Graph Neural Networks, passing graph neural, Neural Networks, Graph Neural, message passing graph
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Machine Learning (stat.ML)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have revolutionized the domain of graph representation learning by utilizing neighborhood aggregation schemes in many popular architectures, such as message passing graph neural networks (MPGNNs). This scheme involves iteratively calculating a node’s representation vector by aggregating and transforming the representation vectors of its adjacent nodes. Despite their effectiveness, MPGNNs face significant issues, such as oversquashing, oversmoothing, and underreaching, which hamper their effectiveness. Additionally, the reliance of MPGNNs on the homophily assumption, where edges typically connect nodes with similar labels and features, limits their performance in heterophilic contexts, where connected nodes often have significant differences. This necessitates the development of models that can operate effectively in both homophilic and heterophilic settings. In this paper, we propose a novel approach, ClassContrast, grounded in Energy Landscape Theory from Chemical Physics, to overcome these limitations. ClassContrast combines spatial and contextual information, leveraging a physics-inspired energy landscape to model node embeddings that are both discriminative and robust across homophilic and heterophilic settings. Our approach introduces contrast-based homophily matrices to enhance the understanding of class interactions and tendencies. Through extensive experiments, we demonstrate that ClassContrast outperforms traditional GNNs in node classification and link prediction tasks, proving its effectiveness and versatility in diverse real-world scenarios. Comments: 16 pages, 5 figures Subjects: Machine Learning (cs.LG); Computational Geometry (cs.CG); Machine Learning (stat.ML) Cite as: arXiv:2410.02158 [cs.LG] (or arXiv:2410.02158v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.02158 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-142] Quantitative Approximation for Neural Operators in Nonlinear Parabolic Equations

链接: https://arxiv.org/abs/2410.02151
作者: Takashi Furuya,Koichi Taniguchi,Satoshi Okuda
关键词-EN: general continuous operators, Neural operators serve, Neural operators, serve as universal, universal approximators
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 31 pages

点击查看摘要

Abstract:Neural operators serve as universal approximators for general continuous operators. In this paper, we derive the approximation rate of solution operators for the nonlinear parabolic partial differential equations (PDEs), contributing to the quantitative approximation theorem for solution operators of nonlinear PDEs. Our results show that neural operators can efficiently approximate these solution operators without the exponential growth in model complexity, thus strengthening the theoretical foundation of neural operators. A key insight in our proof is to transfer PDEs into the corresponding integral equations via Duahamel’s principle, and to leverage the similarity between neural operators and Picard’s iteration, a classical algorithm for solving PDEs. This approach is potentially generalizable beyond parabolic PDEs to a range of other equations, including the Navier-Stokes equation, nonlinear Schrödinger equations and nonlinear wave equations, which can be solved by Picard’s iteration.

[LG-143] Reducing Warning Errors in Driver Support with Personalized Risk Maps

链接: https://arxiv.org/abs/2410.02148
作者: Tim Puphal,Ryohei Hirano,Takayuki Kawabuchi,Akihito Kimata,Julian Eggert
关键词-EN: problem of human-focused, human-focused driver support, driver, warning, personalized Risk Maps
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of human-focused driver support. State-of-the-art personalization concepts allow to estimate parameters for vehicle control systems or driver models. However, there are currently few approaches proposed that use personalized models and evaluate the effectiveness in the form of general risk warning. In this paper, we therefore propose a warning system that estimates a personalized risk factor for the given driver based on the driver’s behavior. The system afterwards is able to adapt the warning signal with personalized Risk Maps. In experiments, we show examples for longitudinal following and intersection scenarios in which the novel warning system can effectively reduce false negative errors and false positive errors compared to a baseline approach which does not use personalized driver considerations. This underlines the potential of personalization for reducing warning errors in risk warning and driver support.

[LG-144] Efficient Source-Free Time-Series Adaptation via Parameter Subspace Disentanglement

链接: https://arxiv.org/abs/2410.02147
作者: Gaurav Patel,Christopher Sandino,Behrooz Mahasseni,Ellen L Zippi,Erdrin Azemi,Ali Moin,Juri Minxha
关键词-EN: efficient Source-Free Domain, Source-Free Domain Adaptation, Source-Free Domain, Domain Adaptation, context of time-series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, we propose a framework for efficient Source-Free Domain Adaptation (SFDA) in the context of time-series, focusing on enhancing both parameter efficiency and data-sample utilization. Our approach introduces an improved paradigm for source-model preparation and target-side adaptation, aiming to enhance training efficiency during target adaptation. Specifically, we reparameterize the source model’s weights in a Tucker-style decomposed manner, factorizing the model into a compact form during the source model preparation phase. During target-side adaptation, only a subset of these decomposed factors is fine-tuned, leading to significant improvements in training efficiency. We demonstrate using PAC Bayesian analysis that this selective fine-tuning strategy implicitly regularizes the adaptation process by constraining the model’s learning capacity. Furthermore, this re-parameterization reduces the overall model size and enhances inference efficiency, making the approach particularly well suited for resource-constrained devices. Additionally, we demonstrate that our framework is compatible with various SFDA methods and achieves significant computational efficiency, reducing the number of fine-tuned parameters and inference overhead in terms of MACs by over 90% while maintaining model performance.

[LG-145] Active Learning of Deep Neural Networks via Gradient-Free Cutting Planes

链接: https://arxiv.org/abs/2410.02145
作者: Erica Zhang,Fangzhao Zhang,Mert Pilanci
关键词-EN: improve sample complexity, Active learning, active learning scheme, aim to improve, improve sample
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Active learning methods aim to improve sample complexity in machine learning. In this work, we investigate an active learning scheme via a novel gradient-free cutting-plane training method for ReLU networks of arbitrary depth. We demonstrate, for the first time, that cutting-plane algorithms, traditionally used in linear models, can be extended to deep neural networks despite their nonconvexity and nonlinear decision boundaries. Our results demonstrate that these methods provide a promising alternative to the commonly employed gradient-based optimization techniques in large-scale neural networks. Moreover, this training method induces the first deep active learning scheme known to achieve convergence guarantees. We exemplify the effectiveness of our proposed active learning method against popular deep active learning baselines via both synthetic data experiments and sentimental classification task on real datasets.

[LG-146] SoundMorpher: Perceptually-Uniform Sound Morphing with Diffusion Model

链接: https://arxiv.org/abs/2410.02144
作者: Xinlei Niu,Jing Zhang,Charles Patrick Martin
关键词-EN: uniform morphing trajectories, generates perceptually uniform, morphing methods models, perceptually uniform morphing, sound morphing
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We present SoundMorpher, a sound morphing method that generates perceptually uniform morphing trajectories using a diffusion model. Traditional sound morphing methods models the intractable relationship between morph factor and perception of the stimuli for resulting sounds under a linear assumption, which oversimplifies the complex nature of sound perception and limits their morph quality. In contrast, SoundMorpher explores an explicit proportional mapping between the morph factor and the perceptual stimuli of morphed sounds based on Mel-spectrogram. This approach enables smoother transitions between intermediate sounds and ensures perceptually consistent transformations, which can be easily extended to diverse sound morphing tasks. Furthermore, we present a set of quantitative metrics to comprehensively assess sound morphing systems based on three objective criteria, namely, correspondence, perceptual intermediateness, and smoothness. We provide extensive experiments to demonstrate the effectiveness and versatility of SoundMorpher in real-world scenarios, highlighting its potential impact on various applications such as creative music composition, film post-production and interactive audio technologies.

[LG-147] Plug-and-Play Controllable Generation for Discrete Masked Models

链接: https://arxiv.org/abs/2410.02143
作者: Wei Guo,Yuchen Zhu,Molei Tao,Yongxin Chen
关键词-EN: article makes discrete, makes discrete masked, article makes, generative modeling, discrete data controllable
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This article makes discrete masked models for the generative modeling of discrete data controllable. The goal is to generate samples of a discrete random variable that adheres to a posterior distribution, satisfies specific constraints, or optimizes a reward function. This methodological development enables broad applications across downstream tasks such as class-specific image generation and protein design. Existing approaches for controllable generation of masked models typically rely on task-specific fine-tuning or additional modifications, which can be inefficient and resource-intensive. To overcome these limitations, we propose a novel plug-and-play framework based on importance sampling that bypasses the need for training a conditional score. Our framework is agnostic to the choice of control criteria, requires no gradient information, and is well-suited for tasks such as posterior sampling, Bayesian inverse problems, and constrained generation. We demonstrate the effectiveness of our approach through extensive experiments, showcasing its versatility across multiple domains, including protein design.

[LG-148] A Formal Framework for Understanding Length Generalization in Transformers

链接: https://arxiv.org/abs/2410.02140
作者: Xinting Huang,Andy Yang,Satwik Bhattamishra,Yash Sarrof,Andreas Krebs,Hattie Zhou,Preetum Nakkiran,Michael Hahn
关键词-EN: observed during training, length generalization, major challenge, generalizing to sequences, sequences longer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute positional encodings. In particular, we characterize those functions that are identifiable in the limit from sufficiently long inputs with absolute positional encodings under an idealized inference scheme using a norm-based regularizer. This enables us to prove the possibility of length generalization for a rich family of problems. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks. Our theory not only explains a broad set of empirical observations but also opens the way to provably predicting length generalization capabilities in transformers.

[LG-149] Disentangled Representation Learning for Parametric Partial Differential Equations

链接: https://arxiv.org/abs/2410.02136
作者: Ning Liu,Lu Zhang,Tian Gao,Yue Yu
关键词-EN: partial differential equations, demonstrated remarkable success, neural operator parameters, neural operator, complex physical systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators (NOs) have demonstrated remarkable success in learning mappings between function spaces, serving as efficient approximators for the forward solutions of complex physical systems governed by partial differential equations (PDEs). However, while effective as black-box solvers, they offer limited insight into the underlying physical mechanism, due to the lack of interpretable representations of the physical parameters that drive the system. To tackle this challenge, we propose a new paradigm for learning disentangled representations from neural operator parameters, thereby effectively solving an inverse problem. Specifically, we introduce DisentangO, a novel hyper-neural operator architecture designed to unveil and disentangle the latent physical factors of variation embedded within the black-box neural operator parameters. At the core of DisentangO is a multi-task neural operator architecture that distills the varying parameters of the governing PDE through a task-wise adaptive layer, coupled with a hierarchical variational autoencoder that disentangles these variations into identifiable latent factors. By learning these disentangled representations, our model not only enhances physical interpretability but also enables more robust generalization across diverse physical systems. Empirical evaluations across supervised, semi-supervised, and unsupervised learning contexts show that DisentangO effectively extracts meaningful and interpretable latent features, bridging the divide between predictive performance and physical understanding in neural operator frameworks.

[LG-150] rajGPT: Irregular Time-Series Representation Learning for Health Trajectory Analysis

链接: https://arxiv.org/abs/2410.02133
作者: Ziyang Song,Qingcheng Lu,He Zhu,David Buckeridge,Yue Li
关键词-EN: Generative Pre-trained Transformer, intervals between observations, irregularly sampled, sampled with varying, varying intervals
类目: Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:In many domains, such as healthcare, time-series data is often irregularly sampled with varying intervals between observations. This poses challenges for classical time-series models that require equally spaced data. To address this, we propose a novel time-series Transformer called Trajectory Generative Pre-trained Transformer (TrajGPT). TrajGPT employs a novel Selective Recurrent Attention (SRA) mechanism, which utilizes a data-dependent decay to adaptively filter out irrelevant past information based on contexts. By interpreting TrajGPT as discretized ordinary differential equations (ODEs), it effectively captures the underlying continuous dynamics and enables time-specific inference for forecasting arbitrary target timesteps. Experimental results demonstrate that TrajGPT excels in trajectory forecasting, drug usage prediction, and phenotype classification without requiring task-specific fine-tuning. By evolving the learned continuous dynamics, TrajGPT can interpolate and extrapolate disease risk trajectories from partially-observed time series. The visualization of predicted health trajectories shows that TrajGPT forecasts unseen diseases based on the history of clinically relevant phenotypes (i.e., contexts).

[LG-151] Nonuniform random feature models using derivative information

链接: https://arxiv.org/abs/2410.02132
作者: Konstantin Pieper,Zezhong Zhang,Guannan Zhang
关键词-EN: propose nonuniform data-driven, nonuniform data-driven parameter, data-driven parameter distributions, network initialization based, neural network initialization
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We propose nonuniform data-driven parameter distributions for neural network initialization based on derivative data of the function to be approximated. These parameter distributions are developed in the context of non-parametric regression models based on shallow neural networks, and compare favorably to well-established uniform random feature models based on conventional weight initialization. We address the cases of Heaviside and ReLU activation functions, and their smooth approximations (sigmoid and softplus), and use recent results on the harmonic analysis and sparse representation of neural networks resulting from fully trained optimal networks. Extending analytic results that give exact representation, we obtain densities that concentrate in regions of the parameter space corresponding to neurons that are well suited to model the local derivatives of the unknown function. Based on these results, we suggest simplifications of these exact densities based on approximate derivative data in the input points that allow for very efficient sampling and lead to performance of random feature models close to optimal networks in several scenarios.

[LG-152] C-MELT: Contrastive Enhanced Masked Auto-Encoders for ECG-Language Pre-Training

链接: https://arxiv.org/abs/2410.02131
作者: Manh Pham,Aaqib Saeed,Dong Ma
关键词-EN: diagnosing cardiovascular diseases, Accurate interpretation, interpretation of Electrocardiogram, cardiovascular diseases, Integrating ECG signals
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with their accompanying textual reports holds immense potential to enhance clinical diagnostics through the combination of physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarcity of labeled data for robust cross-modal learning. To address these obstacles, we propose C-MELT, a novel framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture. C-MELT uniquely combines the strengths of generative with enhanced discriminative capabilities to achieve robust cross-modal representations. This is accomplished through masked modality modeling, specialized loss functions, and an improved negative sampling strategy tailored for cross-modal alignment. Extensive experiments on five public datasets across diverse downstream tasks demonstrate that C-MELT significantly outperforms existing methods, achieving 15% and 2% increases in linear probing and zero-shot performance over state-of-the-art models, respectively. These results highlight the effectiveness of C-MELT, underscoring its potential to advance automated clinical diagnostics through multi-modal representations.

[LG-153] Breaking the mold: The challenge of large scale MARL specialization

链接: https://arxiv.org/abs/2410.02128
作者: Stefan Juang,Hugh Cao,Arielle Zhou,Ruochen Liu,Nevin L. Zhang,Elvis Liu
关键词-EN: predominant approach focuses, Comparative Advantage Maximization, Advantage Maximization, predominant approach, approach focuses
类目: Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:In multi-agent learning, the predominant approach focuses on generalization, often neglecting the optimization of individual agents. This emphasis on generalization limits the ability of agents to utilize their unique strengths, resulting in inefficiencies. This paper introduces Comparative Advantage Maximization (CAM), a method designed to enhance individual agent specialization in multiagent systems. CAM employs a two-phase process, combining centralized population training with individual specialization through comparative advantage maximization. CAM achieved a 13.2% improvement in individual agent performance and a 14.9% increase in behavioral diversity compared to state-of-the-art systems. The success of CAM highlights the importance of individual agent specialization, suggesting new directions for multi-agent system development.

[LG-154] BayesCNS: A Unified Bayesian Approach to Address Cold Start and Non-Stationarity in Search Systems at Scale

链接: https://arxiv.org/abs/2410.02126
作者: Randy Ardywibowo,Rakesh Sunki,Lucy Kuo,Sankalp Nayak
关键词-EN: platforms frequently employ, recommendation platforms frequently, Information Retrieval, frequently employ, recommendation platforms
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Information Retrieval (IR) systems used in search and recommendation platforms frequently employ Learning-to-Rank (LTR) models to rank items in response to user queries. These models heavily rely on features derived from user interactions, such as clicks and engagement data. This dependence introduces cold start issues for items lacking user engagement and poses challenges in adapting to non-stationary shifts in user behavior over time. We address both challenges holistically as an online learning problem and propose BayesCNS, a Bayesian approach designed to handle cold start and non-stationary distribution shifts in search systems at scale. BayesCNS achieves this by estimating prior distributions for user-item interactions, which are continuously updated with new user interactions gathered online. This online learning procedure is guided by a ranker model, enabling efficient exploration of relevant items using contextual information provided by the ranker. We successfully deployed BayesCNS in a large-scale search system and demonstrated its efficacy through comprehensive offline and online experiments. Notably, an online A/B experiment showed a 10.60% increase in new item interactions and a 1.05% improvement in overall success metrics over the existing production baseline.

[LG-155] Lossy Cooperative UAV Relaying Networks: Outage Probability Analysis and Location Optimization

链接: https://arxiv.org/abs/2410.02120
作者: Ya Lian,Wensheng Lin,Lixin Li,Fucheng Yang,Zhu Han,Tad Matsumoto
关键词-EN: unmanned aerial vehicle, cooperative unmanned aerial, lossy cooperative unmanned, relay communication system, system outage probability
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper, performance of a lossy cooperative unmanned aerial vehicle (UAV) relay communication system is analyzed. In this system, the UAV relay adopts lossy forward (LF) strategy and the receiver has certain distortion requirements for the received information. For the system described above, we first derive the achievable rate distortion region of the system. Then, on the basis of the region analysis, the system outage probability when the channel suffers Nakagami- m fading is analyzed. Finally, we design an optimal relay position identification algorithm based on the Soft Actor-Critic (SAC) algorithm, which determines the optimal UAV position to minimize the outage probability. The simulation results show that the proposed algorithm can optimize the UAV position and reduce the system outage probability effectively.

[LG-156] Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices NEURIPS2024

链接: https://arxiv.org/abs/2410.02117
作者: Andres Potapczynski,Shikai Qiu,Marc Finzi,Christopher Ferri,Zixi Chen,Micah Goldblum,Bayan Bruss,Christopher De Sa,Andrew Gordon Wilson
关键词-EN: dominant computational bottleneck, presenting a critical, efficient alternatives, Dense linear layers, dense layers
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024. Code available at this https URL

点击查看摘要

Abstract:Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small \omega (which measures parameter sharing) and large \psi (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.

[LG-157] Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks

链接: https://arxiv.org/abs/2410.02116
作者: Siddharth Joshi,Jiayi Ni,Baharan Mirzasoleiman
关键词-EN: memory and compute, amount of memory, train deep networks, SSL, deep networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset distillation (DD) generates small synthetic datasets that can efficiently train deep networks with a limited amount of memory and compute. Despite the success of DD methods for supervised learning, DD for self-supervised pre-training of deep models has remained unaddressed. Pre-training on unlabeled data is crucial for efficiently generalizing to downstream tasks with limited labeled data. In this work, we propose the first effective DD method for SSL pre-training. First, we show, theoretically and empirically, that naive application of supervised DD methods to SSL fails, due to the high variance of the SSL gradient. Then, we address this issue by relying on insights from knowledge distillation (KD) literature. Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL. Then, we generate a small synthetic dataset by matching the training trajectories of the student models. As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders. Through extensive experiments, we show that our distilled sets lead to up to 13% higher accuracy than prior work, on a variety of downstream tasks, in the presence of limited labeled data.

[LG-158] Mamba Neural Operator: Who Wins? Transformers vs. State-Space Models for PDEs

链接: https://arxiv.org/abs/2410.02113
作者: Chun-Wun Cheng,Jiahao Huang,Yi Zhang,Guang Yang,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
关键词-EN: Partial differential equations, complex physical systems, Partial differential, model complex physical, differential equations
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Partial differential equations (PDEs) are widely used to model complex physical systems, but solving them efficiently remains a significant challenge. Recently, Transformers have emerged as the preferred architecture for PDEs due to their ability to capture intricate dependencies. However, they struggle with representing continuous dynamics and long-range interactions. To overcome these limitations, we introduce the Mamba Neural Operator (MNO), a novel framework that enhances neural operator-based techniques for solving PDEs. MNO establishes a formal theoretical connection between structured state-space models (SSMs) and neural operators, offering a unified structure that can adapt to diverse architectures, including Transformer-based models. By leveraging the structured design of SSMs, MNO captures long-range dependencies and continuous dynamics more effectively than traditional Transformers. Through extensive analysis, we show that MNO significantly boosts the expressive power and accuracy of neural operators, making it not just a complement but a superior framework for PDE-related tasks, bridging the gap between efficient representation and accurate solution approximation.

[LG-159] Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

链接: https://arxiv.org/abs/2410.02110
作者: Amogh Mannekote,Adam Davies,Jina Kang,Kristy Elizabeth Boyer
关键词-EN: Simulating learner actions, adaptations before deployment, actions helps stress-test, prototype new adaptations, interactive learning environments
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulating learner actions helps stress-test open-ended interactive learning environments and prototype new adaptations before deployment. While recent studies show the promise of using large language models (LLMs) for simulating human behavior, such approaches have not gone beyond rudimentary proof-of-concept stages due to key limitations. First, LLMs are highly sensitive to minor prompt variations, raising doubts about their ability to generalize to new scenarios without extensive prompt engineering. Moreover, apparently successful outcomes can often be unreliable, either because domain experts unintentionally guide LLMs to produce expected results, leading to self-fulfilling prophecies; or because the LLM has encountered highly similar scenarios in its training data, meaning that models may not be simulating behavior so much as regurgitating memorized content. To address these challenges, we propose Hyp-Mix, a simulation authoring framework that allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. Testing this framework in a physics learning environment, we found that GPT-4 Turbo maintains calibrated behavior even as the underlying learner model changes, providing the first evidence that LLMs can be used to simulate realistic behaviors in open-ended interactive learning environments, a necessary prerequisite for useful LLM behavioral simulation.

[LG-160] Orient Anything

链接: https://arxiv.org/abs/2410.02101
作者: Christopher Scarvelis,David Benhaim,Paul Zhang
关键词-EN: analysis which consists, consists of estimating, orientation axes, Orientation, shape orientation axes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Orientation estimation is a fundamental task in 3D shape analysis which consists of estimating a shape’s orientation axes: its side-, up-, and front-axes. Using this data, one can rotate a shape into canonical orientation, where its orientation axes are aligned with the coordinate axes. Developing an orientation algorithm that reliably estimates complete orientations of general shapes remains an open problem. We introduce a two-stage orientation pipeline that achieves state of the art performance on up-axis estimation and further demonstrate its efficacy on full-orientation estimation, where one seeks all three orientation axes. Unlike previous work, we train and evaluate our method on all of Shapenet rather than a subset of classes. We motivate our engineering contributions by theory describing fundamental obstacles to orientation estimation for rotationally-symmetric shapes, and show how our method avoids these obstacles.

[LG-161] A Watermark for Black-Box Language Models

链接: https://arxiv.org/abs/2410.02099
作者: Dara Bahri,John Wieting,Dana Alon,Donald Metzler
关键词-EN: large language models, recently emerged, effective strategy, strategy for detecting, detecting the outputs
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emphwhite-box access to the model’s next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emphblack-box access), boasts a \emphdistortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

[LG-162] EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

链接: https://arxiv.org/abs/2410.02098
作者: Haotian Sun,Bowen Zhang,Yanghao Li,Haoshuo Huang,Tao Lei,Ruoming Pang,Bo Dai,Nan Du
关键词-EN: widely adopted, Diffusion transformers, models, Diffusion, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.

[LG-163] HyperBrain: Anomaly Detection for Temporal Hypergraph Brain Networks

链接: https://arxiv.org/abs/2410.02087
作者: Sadaf Sadeghian,Xiaoxiao Li,Margo Seltzer
关键词-EN: Identifying unusual brain, Identifying unusual, brain, brain networks, unusual brain activity
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Identifying unusual brain activity is a crucial task in neuroscience research, as it aids in the early detection of brain disorders. It is common to represent brain networks as graphs, and researchers have developed various graph-based machine learning methods for analyzing them. However, the majority of existing graph learning tools for the brain face a combination of the following three key limitations. First, they focus only on pairwise correlations between regions of the brain, limiting their ability to capture synchronized activity among larger groups of regions. Second, they model the brain network as a static network, overlooking the temporal changes in the brain. Third, most are designed only for classifying brain networks as healthy or disordered, lacking the ability to identify abnormal brain activity patterns linked to biomarkers associated with disorders. To address these issues, we present HyperBrain, an unsupervised anomaly detection framework for temporal hypergraph brain networks. HyperBrain models fMRI time series data as temporal hypergraphs capturing dynamic higher-order interactions. It then uses a novel customized temporal walk (BrainWalk) and neural encodings to detect abnormal co-activations among brain regions. We evaluate the performance of HyperBrain in both synthetic and real-world settings for Autism Spectrum Disorder and Attention Deficit Hyperactivity Disorder(ADHD). HyperBrain outperforms all other baselines on detecting abnormal co-activations in brain networks. Furthermore, results obtained from HyperBrain are consistent with clinical research on these brain disorders. Our findings suggest that learning temporal and higher-order connections in the brain provides a promising approach to uncover intricate connectivity patterns in brain networks, offering improved diagnosis.

[LG-164] Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

链接: https://arxiv.org/abs/2410.02086
作者: Minoh Jeong,Min Namgung,Zae Myung Kim,Dongyeop Kang,Yao-Yi Chiang,Alfred Hero
关键词-EN: diverse data sources, utilize diverse data, enabling machine learning, machine learning models, downstream tasks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multimodal learning plays a crucial role in enabling machine learning models to fuse and utilize diverse data sources, such as text, images, and audio, to support a variety of downstream tasks. A unified representation across various modalities is particularly important for improving efficiency and performance. Recent binding methods, such as ImageBind (Girdhar et al., 2023), typically use a fixed anchor modality to align multimodal data in the anchor modal embedding space. In this paper, we mathematically analyze the fixed anchor binding methods and uncover notable limitations: (1) over-reliance on the choice of the anchor modality, (2) failure to capture intra-modal information, and (3) failure to account for inter-modal correlation among non-anchored modalities. To address these limitations, we propose CentroBind, a simple yet powerful approach that eliminates the need for a fixed anchor; instead, it employs dynamically adjustable centroid-based anchors generated from all available modalities, resulting in a balanced and rich representation space. We theoretically demonstrate that our method captures three crucial properties of multimodal learning: intra-modal learning, inter-modal learning, and multimodal alignment, while also constructing a robust unified representation across all modalities. Our experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed method, showing that dynamic anchor methods outperform all fixed anchor binding methods as the former captures more nuanced multimodal interactions.

[LG-165] Multi-Omic and Quantum Machine Learning Integration for Lung Subtypes Classification

链接: https://arxiv.org/abs/2410.02085
作者: Mandeep Kaur Saggi,Amandeep Singh Bhatia,Mensah Isaiah,Humaira Gowher,Sabre Kais
关键词-EN: Quantum Machine Learning, opportunities to resolve, computational problems, red-hot field, field that brings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Quantum Physics (quant-ph)
*备注: 27 pages, 17 figures

点击查看摘要

Abstract:Quantum Machine Learning (QML) is a red-hot field that brings novel discoveries and exciting opportunities to resolve, speed up, or refine the analysis of a wide range of computational problems. In the realm of biomedical research and personalized medicine, the significance of multi-omics integration lies in its ability to provide a thorough and holistic comprehension of complex biological systems. This technology links fundamental research to clinical practice. The insights gained from integrated omics data can be translated into clinical tools for diagnosis, prognosis, and treatment planning. The fusion of quantum computing and machine learning holds promise for unraveling complex patterns within multi-omics datasets, providing unprecedented insights into the molecular landscape of lung cancer. Due to the heterogeneity, complexity, and high dimensionality of multi-omic cancer data, characterized by the vast number of features (such as gene expression, micro-RNA, and DNA methylation) relative to the limited number of lung cancer patient samples, our prime motivation for this paper is the integration of multi-omic data, unique feature selection, and diagnostic classification of lung subtypes: lung squamous cell carcinoma (LUSC-I) and lung adenocarcinoma (LUAD-II) using quantum machine learning. We developed a method for finding the best differentiating features between LUAD and LUSC datasets, which has the potential for biomarker discovery.

[LG-166] FARM: Functional Group-Aware Representations for Small Molecules

链接: https://arxiv.org/abs/2410.02082
作者: Thao Nguyen,Kuan-Hao Huang,Ge Liu,Martin D. Burke,Ying Diao,Heng Ji
关键词-EN: foundation model designed, introduce Functional Group-Aware, Small Molecules, Functional Group-Aware Representations, Functional Group-Aware
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Preprint

点击查看摘要

Abstract:We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group-aware tokenization, which incorporates functional group information directly into the representations. This strategic reduction in tokenization granularity in a way that is intentionally interfaced with key drivers of functional properties (i.e., functional groups) enhances the model’s understanding of chemical language, expands the chemical lexicon, more effectively bridging SMILES and natural language, and ultimately advances the model’s capacity to predict molecular properties. FARM also represents molecules from two perspectives: by using masked language modeling to capture atom-level features and by employing graph neural networks to encode the whole molecule topology. By leveraging contrastive learning, FARM aligns these two views of representations into a unified molecular embedding. We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 10 out of 12 tasks. These results highlight FARM’s potential to improve molecular representation learning, with promising applications in drug discovery and pharmaceutical research.

[LG-167] MixLinear: Extreme Low Resource Multivariate Time Series Forecasting with 0.1K Parameters

链接: https://arxiv.org/abs/2410.02081
作者: Aitian Ma,Dongsheng Luo,Mo Sha
关键词-EN: involves predicting long-term, predicting long-term future, historical time-series data, Long-term Time Series, predicting long-term
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, there has been a growing interest in Long-term Time Series Forecasting (LTSF), which involves predicting long-term future values by analyzing a large amount of historical time-series data to identify patterns and trends. There exist significant challenges in LTSF due to its complex temporal dependencies and high computational demands. Although Transformer-based models offer high forecasting accuracy, they are often too compute-intensive to be deployed on devices with hardware constraints. On the other hand, the linear models aim to reduce the computational overhead by employing either decomposition methods in the time domain or compact representations in the frequency domain. In this paper, we propose MixLinear, an ultra-lightweight multivariate time series forecasting model specifically designed for resource-constrained devices. MixLinear effectively captures both temporal and frequency domain features by modeling intra-segment and inter-segment variations in the time domain and extracting frequency variations from a low-dimensional latent space in the frequency domain. By reducing the parameter scale of a downsampled n -length input/output one-layer linear model from O(n^2) to O(n) , MixLinear achieves efficient computation without sacrificing accuracy. Extensive evaluations with four benchmark datasets show that MixLinear attains forecasting performance comparable to, or surpassing, state-of-the-art models with significantly fewer parameters ( 0.1K ), which makes it well-suited for deployment on devices with limited computational capacity.

[LG-168] EMMA: Efficient Visual Alignment in Multi-Modal LLMs

链接: https://arxiv.org/abs/2410.02080
作者: Sara Ghazanfari,Alexandre Araujo,Prashanth Krishnamurthy,Siddharth Garg,Farshad Khorrami
关键词-EN: Multi-modal Large Language, Large Language Models, recently exhibited impressive, exhibited impressive general-purpose, impressive general-purpose capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-purpose capabilities by leveraging vision foundation models to encode the core concepts of images into representations. These are then combined with instructions and processed by the language model to generate high-quality responses. Despite significant progress in enhancing the language component, challenges persist in optimally fusing visual encodings within the language model for task-specific adaptability. Recent research has focused on improving this fusion through modality adaptation modules but at the cost of significantly increased model complexity and training data needs. In this paper, we propose EMMA (Efficient Multi-Modal Adaptation), a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the language model. Our key contributions include: (1) an efficient early fusion mechanism that integrates vision and language representations with minimal added parameters (less than 0.2% increase in model size), (2) an in-depth interpretability analysis that sheds light on the internal mechanisms of the proposed method; (3) comprehensive experiments that demonstrate notable improvements on both specialized and general benchmarks for MLLMs. Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations. Our code is available at this https URL

[LG-169] Deep Generative Modeling for Identification of Noisy Non-Stationary Dynamical Systems

链接: https://arxiv.org/abs/2410.02079
作者: Doris Voina,Steven Brunton,J. Nathan Kutz
关键词-EN: recovering governing equations, significant challenge, fields of science, science and engineering, engineering is making
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 19 pages + 7 figures + Supplementary Materials (and supplementary figures)

点击查看摘要

Abstract:A significant challenge in many fields of science and engineering is making sense of time-dependent measurement data by recovering governing equations in the form of differential equations. We focus on finding parsimonious ordinary differential equation (ODE) models for nonlinear, noisy, and non-autonomous dynamical systems and propose a machine learning method for data-driven system identification. While many methods tackle noisy and limited data, non-stationarity - where differential equation parameters change over time - has received less attention. Our method, dynamic SINDy, combines variational inference with SINDy (sparse identification of nonlinear dynamics) to model time-varying coefficients of sparse ODEs. This framework allows for uncertainty quantification of ODE coefficients, expanding on previous methods for autonomous systems. These coefficients are then interpreted as latent variables and added to the system to obtain an autonomous dynamical model. We validate our approach using synthetic data, including nonlinear oscillators and the Lorenz system, and apply it to neuronal activity data from C. elegans. Dynamic SINDy uncovers a global nonlinear model, showing it can handle real, noisy, and chaotic datasets. We aim to apply our method to a variety of problems, specifically dynamic systems with complex time-dependent parameters.

[LG-170] Kolmogorov-Arnold Network Autoencoders

链接: https://arxiv.org/abs/2410.02077
作者: Mohammadamin Moradi,Shirin Panahi,Erik Bollt,Ying-Cheng Lai
关键词-EN: Deep learning models, Deep learning, Multi-Layer Perceptrons, revolutionized various domains, image classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures, 1 table

点击查看摘要

Abstract:Deep learning models have revolutionized various domains, with Multi-Layer Perceptrons (MLPs) being a cornerstone for tasks like data regression and image classification. However, a recent study has introduced Kolmogorov-Arnold Networks (KANs) as promising alternatives to MLPs, leveraging activation functions placed on edges rather than nodes. This structural shift aligns KANs closely with the Kolmogorov-Arnold representation theorem, potentially enhancing both model accuracy and interpretability. In this study, we explore the efficacy of KANs in the context of data representation via autoencoders, comparing their performance with traditional Convolutional Neural Networks (CNNs) on the MNIST, SVHN, and CIFAR-10 datasets. Our results demonstrate that KAN-based autoencoders achieve competitive performance in terms of reconstruction accuracy, thereby suggesting their viability as effective tools in data analysis tasks.

[LG-171] Price-guided user attention in large-scale E-commerce group recommendation

链接: https://arxiv.org/abs/2410.02074
作者: Yang Shi,Young-joo Chung
关键词-EN: Existing group recommender, utilize attention mechanisms, Existing group, identify critical users, mechanisms to identify
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing group recommender systems utilize attention mechanisms to identify critical users who influence group decisions the most. We analyzed user attention scores from a widely-used group recommendation model on a real-world E-commerce dataset and found that item price and user interaction history significantly influence the selection of critical users. When item prices are low, users with extensive interaction histories are more influential in group decision-making. Conversely, their influence diminishes with higher item prices. Based on these observations, we propose a novel group recommendation approach that incorporates item price as a guiding factor for user aggregation. Our model employs an adaptive sigmoid function to adjust output logits based on item prices, enhancing the accuracy of user aggregation. Our model can be plugged into any attention-based group recommender system if the price information is available. We evaluate our model’s performance on a public benchmark and a real-world dataset. We compare it with other state-of-the-art group recommendation methods. Our results demonstrate that our price-guided user attention approach outperforms the state-of-the-art methods in terms of hit ratio and mean square error.

[LG-172] Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

链接: https://arxiv.org/abs/2410.02073
作者: Aleksei Bochkovskii,Amaël Delaunoy,Hugo Germain,Marcel Santos,Yichao Zhou,Stephan R. Richter,Vladlen Koltun
关键词-EN: zero-shot metric monocular, present a foundation, monocular depth estimation, metric monocular depth, depth
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code and weights available at this https URL

点击查看摘要

Abstract:We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at this https URL

[LG-173] MMFNet: Multi-Scale Frequency Masking Neural Network for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2410.02070
作者: Aitian Ma,Dongsheng Luo,Mo Sha
关键词-EN: numerous real-world applications, electricity consumption planning, disease propagation analysis, real-world applications, consumption planning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-term Time Series Forecasting (LTSF) is critical for numerous real-world applications, such as electricity consumption planning, financial forecasting, and disease propagation analysis. LTSF requires capturing long-range dependencies between inputs and outputs, which poses significant challenges due to complex temporal dynamics and high computational demands. While linear models reduce model complexity by employing frequency domain decomposition, current approaches often assume stationarity and filter out high-frequency components that may contain crucial short-term fluctuations. In this paper, we introduce MMFNet, a novel model designed to enhance long-term multivariate forecasting by leveraging a multi-scale masked frequency decomposition approach. MMFNet captures fine, intermediate, and coarse-grained temporal patterns by converting time series into frequency segments at varying scales while employing a learnable mask to filter out irrelevant components adaptively. Extensive experimentation with benchmark datasets shows that MMFNet not only addresses the limitations of the existing methods but also consistently achieves good performance. Specifically, MMFNet achieves up to 6.0% reductions in the Mean Squared Error (MSE) compared to state-of-the-art models designed for multivariate forecasting tasks.

[LG-174] Semi-Supervised Fine-Tuning of Vision Foundation Models with Content-Style Decomposition

链接: https://arxiv.org/abs/2410.02069
作者: Mariia Drozdova,Vitaliy Kinakh,Yury Belousov,Erica Lastufka,Slava Voloshynovskiy
关键词-EN: fine-tuning approach designed, semi-supervised fine-tuning approach, limited labeled data, vision foundation models, present a semi-supervised
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present a semi-supervised fine-tuning approach designed to improve the performance of foundation models on downstream tasks with limited labeled data. By leveraging content-style decomposition within an information-theoretic framework, our method enhances the latent representations of pre-trained vision foundation models, aligning them more effectively with specific task objectives and addressing the problem of distribution shift. We evaluate our approach on multiple datasets, including MNIST, its augmented variations (with yellow and white stripes), CIFAR-10, SVHN, and GalaxyMNIST. The experiments show improvements over purely supervised baselines, particularly in low-labeled data regimes, across both frozen and trainable backbones for the majority of the tested datasets.

[LG-175] Fast and Sample Efficient Multi-Task Representation Learning in Stochastic Contextual Bandits

链接: https://arxiv.org/abs/2410.02068
作者: Jiabin Lin,Shana Moothedath,Namrata Vaswani
关键词-EN: contextual bandit problems, bandit problems, learning efficiency, contextual, contextual linear bandits
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study how representation learning can improve the learning efficiency of contextual bandit problems. We study the setting where we play T contextual linear bandits with dimension d simultaneously, and these T bandit tasks collectively share a common linear representation with a dimensionality of r much smaller than d. We present a new algorithm based on alternating projected gradient descent (GD) and minimization estimator to recover a low-rank feature matrix. Using the proposed estimator, we present a multi-task learning algorithm for linear contextual bandits and prove the regret bound of our algorithm. We presented experiments and compared the performance of our algorithm against benchmark algorithms.

[LG-176] Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct ICLR2025

链接: https://arxiv.org/abs/2410.02064
作者: Christopher Ackerman,Nina Panickssery
关键词-EN: model, reported that LLMs, LLMs can recognize, vector, chat model
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 10 pages, 13 figs, 2 tables, submitted to ICLR 2025

点击查看摘要

Abstract:It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of “self” in the model, and demonstrate that the vector is causally related to the model’s ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model’s behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model’s output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

[LG-177] PP-LLM: Modeling Temporal Point Processes by Efficiently Fine-Tuning Large Language Models

链接: https://arxiv.org/abs/2410.02062
作者: Zefang Liu,Yinzhu Quan
关键词-EN: Temporal point processes, transportation systems, point processes, social networks, timing and occurrence
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Temporal point processes (TPPs) are widely used to model the timing and occurrence of events in domains such as social networks, transportation systems, and e-commerce. In this paper, we introduce TPP-LLM, a novel framework that integrates large language models (LLMs) with TPPs to capture both the semantic and temporal aspects of event sequences. Unlike traditional methods that rely on categorical event type representations, TPP-LLM directly utilizes the textual descriptions of event types, enabling the model to capture rich semantic information embedded in the text. While LLMs excel at understanding event semantics, they are less adept at capturing temporal patterns. To address this, TPP-LLM incorporates temporal embeddings and employs parameter-efficient fine-tuning (PEFT) methods to effectively learn temporal dynamics without extensive retraining. This approach improves both predictive accuracy and computational efficiency. Experimental results across diverse real-world datasets demonstrate that TPP-LLM outperforms state-of-the-art baselines in sequence modeling and event prediction, highlighting the benefits of combining LLMs with TPPs.

[LG-178] PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations

链接: https://arxiv.org/abs/2410.02060
作者: Julian Lenz,Anirudh Mani
关键词-EN: predicting expressive variations, multi-stage generative framework, multi-stage generative, Performance Tokenizer, Rotary Positional Embeddings
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We introduce Cadenza, a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas as well as unconditional generations. To accomplish this we propose a novel MIDI encoding method, PerTok (Performance Tokenizer) that captures minute expressive details whilst reducing sequence length up to 59% and vocabulary size up to 95% for polyphonic, monophonic and rhythmic tasks. The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer. The Composer model is a transformer-based Variational Autoencoder (VAE), with Rotary Positional Embeddings (RoPE)ROPE and an autoregressive decoder modified to more effectively integrate the latent codes of the input musical idea. The Performer model is a bidirectional transformer encoder that is separately trained to predict velocities and microtimings on MIDI sequences. Objective and human evaluations demonstrate Cadenza’s versatile capability in 1) matching other unconditional state-of-the-art symbolic models in musical quality whilst sounding more expressive, and 2) composing new, expressive ideas that are both stylistically related to the input whilst providing novel ideas to the user. Our framework is designed, researched and implemented with the objective of ethically providing inspiration for musicians.

[LG-179] Impact of White-Box Adversarial Attacks on Convolutional Neural Networks

链接: https://arxiv.org/abs/2410.02043
作者: Rakesh Podder,Sudipto Ghosh
关键词-EN: Autonomous vehicle navigation, machine learning models, Convolutional Neural Networks, Autonomous vehicle, Gradient Sign Method
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Autonomous vehicle navigation and healthcare diagnostics are among the many fields where the reliability and security of machine learning models for image data are critical. We conduct a comprehensive investigation into the susceptibility of Convolutional Neural Networks (CNNs), which are widely used for image data, to white-box adversarial attacks. We investigate the effects of various sophisticated attacks – Fast Gradient Sign Method, Basic Iterative Method, Jacobian-based Saliency Map Attack, Carlini Wagner, Projected Gradient Descent, and DeepFool – on CNN performance metrics, (e.g., loss, accuracy), the differential efficacy of adversarial techniques in increasing error rates, the relationship between perceived image quality metrics (e.g., ERGAS, PSNR, SSIM, and SAM) and classification performance, and the comparative effectiveness of iterative versus single-step attacks. Using the MNIST, CIFAR-10, CIFAR-100, and Fashio_MNIST datasets, we explore the effect of different attacks on the CNNs performance metrics by varying the hyperparameters of CNNs. Our study provides insights into the robustness of CNNs against adversarial threats, pinpoints vulnerabilities, and underscores the urgent need for developing robust defense mechanisms to protect CNNs and ensuring their trustworthy deployment in real-world scenarios.

[LG-180] EAB-FL: Exacerbating Algorithmic Bias through Model Poisoning Attacks in Federated Learning

链接: https://arxiv.org/abs/2410.02042
作者: Syed Irfan Ali Meerza,Jian Liu
关键词-EN: Federated Learning, shared model collaboratively, multiple parties, parties to train, train a shared
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a technique that allows multiple parties to train a shared model collaboratively without disclosing their private data. It has become increasingly popular due to its distinct privacy advantages. However, FL models can suffer from biases against certain demographic groups (e.g., racial and gender groups) due to the heterogeneity of data and party selection. Researchers have proposed various strategies for characterizing the group fairness of FL algorithms to address this issue. However, the effectiveness of these strategies in the face of deliberate adversarial attacks has not been fully explored. Although existing studies have revealed various threats (e.g., model poisoning attacks) against FL systems caused by malicious participants, their primary aim is to decrease model accuracy, while the potential of leveraging poisonous model updates to exacerbate model unfairness remains unexplored. In this paper, we propose a new type of model poisoning attack, EAB-FL, with a focus on exacerbating group unfairness while maintaining a good level of model utility. Extensive experiments on three datasets demonstrate the effectiveness and efficiency of our attack, even with state-of-the-art fairness optimization algorithms and secure aggregation rules employed.

[LG-181] Realizable Continuous-Space Shields for Safe Reinforcement Learning

链接: https://arxiv.org/abs/2410.02038
作者: Kyungmin Kim,Davide Corsi,Andoni Rodriguez,JB Lanier,Benjami Parellada,Pierre Baldi,Cesar Sanchez,Roy Fox
关键词-EN: Deep Reinforcement Learning, Reinforcement Learning, Deep Reinforcement, achieved remarkable success, occasional catastrophic failures
类目: Machine Learning (cs.LG)
*备注: Kim, Corsi, and Rodriguez contributed equally

点击查看摘要

Abstract:While Deep Reinforcement Learning (DRL) has achieved remarkable success across various domains, it remains vulnerable to occasional catastrophic failures without additional safeguards. One effective solution to prevent these failures is to use a shield that validates and adjusts the agent’s actions to ensure compliance with a provided set of safety specifications. For real-life robot domains, it is desirable to be able to define such safety specifications over continuous state and action spaces to accurately account for system dynamics and calculate new safe actions that minimally alter the agent’s output. In this paper, we propose the first shielding approach to automatically guarantee the realizability of safety requirements for continuous state and action spaces. Realizability is an essential property that confirms the shield will always be able to generate a safe action for any state in the environment. We formally prove that realizability can also be verified with a stateful shield, enabling the incorporation of non-Markovian safety requirements. Finally, we demonstrate the effectiveness of our approach in ensuring safety without compromising policy accuracy by applying it to a navigation problem and a multi-agent particle environment.

[LG-182] uning Frequency Bias of State Space Models

链接: https://arxiv.org/abs/2410.02035
作者: Annan Yu,Dongwei Lyu,Soon Hoe Lim,Michael W. Mahoney,N. Benjamin Erichson
关键词-EN: State space models, State space, frequency bias, leverage linear, effectively learn sequences
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:State space models (SSMs) leverage linear, time-invariant (LTI) systems to effectively learn sequences with long-range dependencies. By analyzing the transfer functions of LTI systems, we find that SSMs exhibit an implicit bias toward capturing low-frequency components more effectively than high-frequency ones. This behavior aligns with the broader notion of frequency bias in deep learning model training. We show that the initialization of an SSM assigns it an innate frequency bias and that training the model in a conventional way does not alter this bias. Based on our theory, we propose two mechanisms to tune frequency bias: either by scaling the initialization to tune the inborn frequency bias; or by applying a Sobolev-norm-based filter to adjust the sensitivity of the gradients to high-frequency inputs, which allows us to change the frequency bias via training. Using an image-denoising task, we empirically show that we can strengthen, weaken, or even reverse the frequency bias using both mechanisms. By tuning the frequency bias, we can also improve SSMs’ performance on learning long-range sequences, averaging an 88.26% accuracy on the Long-Range Arena (LRA) benchmark tasks.

[LG-183] Model Comparisons: XNet Outperforms KAN

链接: https://arxiv.org/abs/2410.02033
作者: Xin Li,Zhihong Jeff Xia,Xiaotao Zheng
关键词-EN: precise data modeling, predictive machine learning, machine learning tasks, artificial intelligence, modeling is crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the fields of computational mathematics and artificial intelligence, the need for precise data modeling is crucial, especially for predictive machine learning tasks. This paper explores further XNet, a novel algorithm that employs the complex-valued Cauchy integral formula, offering a superior network architecture that surpasses traditional Multi-Layer Perceptrons (MLPs) and Kolmogorov-Arnold Networks (KANs). XNet significant improves speed and accuracy across various tasks in both low and high-dimensional spaces, redefining the scope of data-driven model development and providing substantial improvements over established time series models like LSTMs.

[LG-184] FLAG: Financial Long Document Classification via AMR-based GNN

链接: https://arxiv.org/abs/2410.02024
作者: Bolun(Namir)Xia,Mohammed J. Zaki,Aparna Gupta
关键词-EN: large language models, Abstract Meaning Representation, language models, advent of large, large language
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, to be published in CIFEr Conference 2024 as “Semantic Graph Learning for Trend Prediction from Long Financial Documents”

点击查看摘要

Abstract:The advent of large language models (LLMs) has initiated much research into their various financial applications. However, in applying LLMs on long documents, semantic relations are not explicitly incorporated, and a full or arbitrarily sparse attention operation is employed. In recent years, progress has been made in Abstract Meaning Representation (AMR), which is a graph-based representation of text to preserve its semantic relations. Since AMR can represent semantic relationships at a deeper level, it can be beneficially utilized by graph neural networks (GNNs) for constructing effective document-level graph representations built upon LLM embeddings to predict target metrics in the financial domain. We propose FLAG: Financial Long document classification via AMR-based GNN, an AMR graph based framework to generate document-level embeddings for long financial document classification. We construct document-level graphs from sentence-level AMR graphs, endow them with specialized LLM word embeddings in the financial domain, apply a deep learning mechanism that utilizes a GNN, and examine the efficacy of our AMR-based approach in predicting labeled target data from long financial documents. Extensive experiments are conducted on a dataset of quarterly earnings calls transcripts of companies in various sectors of the economy, as well as on a corpus of more recent earnings calls of companies in the SP 1500 Composite Index. We find that our AMR-based approach outperforms fine-tuning LLMs directly on text in predicting stock price movement trends at different time horizons in both datasets. Our work also outperforms previous work utilizing document graphs and GNNs for text classification.

[LG-185] DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning

链接: https://arxiv.org/abs/2410.02023
作者: Jiaqing Xie,Yue Zhao,Tianfan Fu
关键词-EN: predicting protein properties, deep learning, recent years, enabling advancements, structural folding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:In recent years, deep learning has revolutionized the field of protein science, enabling advancements in predicting protein properties, structural folding and interactions. This paper presents DeepProtein, a comprehensive and user-friendly deep learning library specifically designed for protein-related tasks. DeepProtein integrates a couple of state-of-the-art neural network architectures, which include convolutional neural network (CNN), recurrent neural network (RNN), transformer, graph neural network (GNN), and graph transformer (GT). It provides user-friendly interfaces, facilitating domain researchers in applying deep learning techniques to protein data. Also, we curate a benchmark that evaluates these neural architectures on a variety of protein tasks, including protein function prediction, protein localization prediction, and protein-protein interaction prediction, showcasing its superior performance and scalability. Additionally, we provide detailed documentation and tutorials to promote accessibility and encourage reproducible research. This library is extended from a well-known drug discovery library, DeepPurpose and publicly available at this https URL.

[LG-186] Review Non-convex Optimization Method for Machine Learning

链接: https://arxiv.org/abs/2410.02017
作者: Greg B Fotopoulos,Paul Popovich,Nicholas Hall Papadopoulos
关键词-EN: deep neural networks, support vector machines, Non-convex optimization, advancing machine learning, critical tool
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-convex optimization is a critical tool in advancing machine learning, especially for complex models like deep neural networks and support vector machines. Despite challenges such as multiple local minima and saddle points, non-convex techniques offer various pathways to reduce computational costs. These include promoting sparsity through regularization, efficiently escaping saddle points, and employing subsampling and approximation strategies like stochastic gradient descent. Additionally, non-convex methods enable model pruning and compression, which reduce the size of models while maintaining performance. By focusing on good local minima instead of exact global minima, non-convex optimization ensures competitive accuracy with faster convergence and lower computational overhead. This paper examines the key methods and applications of non-convex optimization in machine learning, exploring how it can lower computation costs while enhancing model performance. Furthermore, it outlines future research directions and challenges, including scalability and generalization, that will shape the next phase of non-convex optimization in machine learning.

[LG-187] Adaptively Private Next-Token Prediction of Large Language Models

链接: https://arxiv.org/abs/2410.02016
作者: James Flemings,Meisam Razaviyayn,Murali Annavaram
关键词-EN: Large Language Models, Large Language, Language Models, developing privacy safeguards, public output distributions
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) proliferate, developing privacy safeguards for these models is crucial. One popular safeguard involves training LLMs in a differentially private manner. However, such solutions are shown to be computationally expensive and detrimental to the utility of these models. Since LLMs are deployed on the cloud and thus only accessible via an API, a Machine Learning as a Service (MLaaS) provider can protect its downstream data by privatizing the predictions during the decoding process. However, the practicality of such solutions still largely lags behind DP training methods. One recent promising approach, Private Mixing of Ensemble Distributions (PMixED), avoids additive noise by sampling from the output distributions of private LLMs mixed with the output distribution of a public model. Yet, PMixED must satisfy a fixed privacy level for a given number of queries, which is difficult for an analyst to estimate before inference and, hence, does not scale. To this end, we relax the requirements to a more practical setting by introducing Adaptive PMixED (AdaPMixED), a private decoding framework based on PMixED that is adaptive to the private and public output distributions evaluated on a given input query. In this setting, we introduce a noisy screening mechanism that filters out queries with potentially expensive privacy loss, and a data-dependent analysis that exploits the divergence of the private and public output distributions in its privacy loss calculation. Our experimental evaluations demonstrate that our mechanism and analysis can reduce the privacy loss by 16x while preserving the utility over the original PMixED. Furthermore, performing 100K predictions with AdaPMixED still achieves strong utility and a reasonable data-dependent privacy loss of 5.25.

[LG-188] Addressing Data Heterogeneity in Federated Learning with Adaptive Normalization-Free Feature Recalibration

链接: https://arxiv.org/abs/2410.02006
作者: Vasilis Siomos,Sergio Naval-Marimont,Jonathan Passerat-Palmbach,Giacomo Tarroni
关键词-EN: preserves stakeholders’ data, stakeholders’ data ownership, collaborative training paradigm, decentralized collaborative training, Normalization-free Feature Recalibration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Federated learning is a decentralized collaborative training paradigm that preserves stakeholders’ data ownership while improving performance and generalization. However, statistical heterogeneity among client datasets poses a fundamental challenge by degrading system performance. To address this issue, we propose Adaptive Normalization-free Feature Recalibration (ANFR), an architecture-level approach that combines weight standardization and channel attention. Weight standardization normalizes the weights of layers instead of activations. This is less susceptible to mismatched client statistics and inconsistent averaging, thereby more robust under heterogeneity. Channel attention produces learnable scaling factors for feature maps, suppressing those that are inconsistent between clients due to heterogeneity. We demonstrate that combining these techniques boosts model performance beyond their individual contributions, by enhancing class selectivity and optimizing channel attention weight distribution. ANFR operates independently of the aggregation method and is effective in both global and personalized federated learning settings, with minimal computational overhead. Furthermore, when training with differential privacy, ANFR achieves an appealing balance between privacy and utility, enabling strong privacy guarantees without sacrificing performance. By integrating weight standardization and channel attention in the backbone model, ANFR offers a novel and versatile approach to the challenge of statistical heterogeneity. We demonstrate through extensive experiments that ANFR consistently outperforms established baselines across various aggregation methods, datasets, and heterogeneity conditions.

[LG-189] FairlyUncertain: A Comprehensive Benchmark of Uncertainty in Algorithmic Fairness

链接: https://arxiv.org/abs/2410.02005
作者: Lucas Rosenblatt,R. Teal Witter
关键词-EN: real-world data challenges, predictive algorithms hinge, Fair predictive algorithms, equality and trust, algorithms hinge
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fair predictive algorithms hinge on both equality and trust, yet inherent uncertainty in real-world data challenges our ability to make consistent, fair, and calibrated decisions. While fairly managing predictive error has been extensively explored, some recent work has begun to address the challenge of fairly accounting for irreducible prediction uncertainty. However, a clear taxonomy and well-specified objectives for integrating uncertainty into fairness remains undefined. We address this gap by introducing FairlyUncertain, an axiomatic benchmark for evaluating uncertainty estimates in fairness. Our benchmark posits that fair predictive uncertainty estimates should be consistent across learning pipelines and calibrated to observed randomness. Through extensive experiments on ten popular fairness datasets, our evaluation reveals: (1) A theoretically justified and simple method for estimating uncertainty in binary settings is more consistent and calibrated than prior work; (2) Abstaining from binary predictions, even with improved uncertainty estimates, reduces error but does not alleviate outcome imbalances between demographic groups; (3) Incorporating consistent and calibrated uncertainty estimates in regression tasks improves fairness without any explicit fairness interventions. Additionally, our benchmark package is designed to be extensible and open-source, to grow with the field. By providing a standardized framework for assessing the interplay between uncertainty and fairness, FairlyUncertain paves the way for more equitable and trustworthy machine learning practices.

[LG-190] Normalizing Flow Based Metric for Image Generation

链接: https://arxiv.org/abs/2410.02004
作者: Pranav Jeevan,Neeraj Nixon,Amit Sethi
关键词-EN: dual-flow based likelihood, based likelihood distance, exact dual-flow based, flow-based likelihood distance, proposed metrics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 16 figures

点击查看摘要

Abstract:We propose two new evaluation metrics to assess realness of generated images based on normalizing flows: a simpler and efficient flow-based likelihood distance (FLD) and a more exact dual-flow based likelihood distance (D-FLD). Because normalizing flows can be used to compute the exact likelihood, the proposed metrics assess how closely generated images align with the distribution of real images from a given domain. This property gives the proposed metrics a few advantages over the widely used Fréchet inception distance (FID) and other recent metrics. Firstly, the proposed metrics need only a few hundred images to stabilize (converge in mean), as opposed to tens of thousands needed for FID, and at least a few thousand for the other metrics. This allows confident evaluation of even small sets of generated images, such as validation batches inside training loops. Secondly, the network used to compute the proposed metric has over an order of magnitude fewer parameters compared to Inception-V3 used to compute FID, making it computationally more efficient. For assessing the realness of generated images in new domains (e.g., x-ray images), ideally these networks should be retrained on real images to model their distinct distributions. Thus, our smaller network will be even more advantageous for new domains. Extensive experiments show that the proposed metrics have the desired monotonic relationships with the extent of image degradation of various kinds.

[LG-191] Deep Learning Alternatives of the Kolmogorov Superposition Theorem

链接: https://arxiv.org/abs/2410.01990
作者: Leonardo Ferreira Guilhoto,Paris Perdikaris
关键词-EN: Kolmogorov Superposition Theorem, Superposition Theorem, paper explores alternative, explores alternative formulations, neural network design
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:This paper explores alternative formulations of the Kolmogorov Superposition Theorem (KST) as a foundation for neural network design. The original KST formulation, while mathematically elegant, presents practical challenges due to its limited insight into the structure of inner and outer functions and the large number of unknown variables it introduces. Kolmogorov-Arnold Networks (KANs) leverage KST for function approximation, but they have faced scrutiny due to mixed results compared to traditional multilayer perceptrons (MLPs) and practical limitations imposed by the original KST formulation. To address these issues, we introduce ActNet, a scalable deep learning model that builds on the KST and overcomes many of the drawbacks of Kolmogorov’s original formulation. We evaluate ActNet in the context of Physics-Informed Neural Networks (PINNs), a framework well-suited for leveraging KST’s strengths in low-dimensional function approximation, particularly for simulating partial differential equations (PDEs). In this challenging setting, where models must learn latent functions without direct measurements, ActNet consistently outperforms KANs across multiple benchmarks and is competitive against the current best MLP-based approaches. These results present ActNet as a promising new direction for KST-based deep learning applications, particularly in scientific computing and PDE simulation tasks.

[LG-192] LLMKG@VLDB24 Workshop Summary

链接: https://arxiv.org/abs/2410.01978
作者: Arijit Khan,Tianxing Wu,Xi Chen
关键词-EN: large language models, language models, knowledge graphs, hot topic, unification of large
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 1 figure

点击查看摘要

Abstract:The unification of large language models (LLMs) and knowledge graphs (KGs) has emerged as a hot topic. At the LLM+KG’24 workshop, held in conjunction with VLDB 2024 in Guangzhou, China, one of the key themes explored was important data management challenges and opportunities due to the effective interaction between LLMs and KGs. This report outlines the major directions and approaches presented by various speakers during the LLM+KG’24 workshop.

[LG-193] Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

链接: https://arxiv.org/abs/2410.01971
作者: Asher J. Hancock,Allen Z. Ren,Anirudha Majumdar
关键词-EN: generalist robot policies, large-scale internet data, robot policies, robot demonstrations, generalist robot
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model’s sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model’s weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 40%. Website with additional information, videos, and code: this https URL .

[LG-194] Which Algorithms Have Tight Generalization Bounds?

链接: https://arxiv.org/abs/2410.01969
作者: Michael Gastpar,Ido Nachum,Jonathan Shafer,Thomas Weinberger
关键词-EN: tight generalization bounds, tight generalization, generalization bounds, machine learning algorithms, generalization
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study which machine learning algorithms have tight generalization bounds. First, we present conditions that preclude the existence of tight generalization bounds. Specifically, we show that algorithms that have certain inductive biases that cause them to be unstable do not admit tight generalization bounds. Next, we show that algorithms that are sufficiently stable do have tight generalization bounds. We conclude with a simple characterization that relates the existence of tight generalization bounds to the conditional variance of the algorithm’s loss.

[LG-195] Scale-Invariant Learning-to-Rank

链接: https://arxiv.org/abs/2410.01959
作者: Alessio Petrozziello,Christian Sommeregger,Ye-Sheen Lim
关键词-EN: property rooms, relevant to users, search filters, plays a key, key role
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:At Expedia, learning-to-rank (LTR) models plays a key role on our website in sorting and presenting information more relevant to users, such as search filters, property rooms, amenities, and images. A major challenge in deploying these models is ensuring consistent feature scaling between training and production data, as discrepancies can lead to unreliable rankings when deployed. Normalization techniques like feature standardization and batch normalization could address these issues but are impractical in production due to latency impacts and the difficulty of distributed real-time inference. To address consistent feature scaling issue, we introduce a scale-invariant LTR framework which combines a deep and a wide neural network to mathematically guarantee scale-invariance in the model at both training and prediction time. We evaluate our framework in simulated real-world scenarios with injected feature scale issues by perturbing the test set at prediction time, and show that even with inconsistent train-test scaling, using framework achieves better performance than without.

[LG-196] ComaDICE: Offline Cooperative Multi-Agent Reinforcement Learning with Stationary Distribution Shift Regularization

链接: https://arxiv.org/abs/2410.01954
作者: TheViet Bui,Thanh Hong Nguyen,Tien Mai
关键词-EN: garnered significant attention, learn effective policies, Offline reinforcement learning, environmental interactions, joint state-action space
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) has garnered significant attention for its ability to learn effective policies from pre-collected datasets without the need for further environmental interactions. While promising results have been demonstrated in single-agent settings, offline multi-agent reinforcement learning (MARL) presents additional challenges due to the large joint state-action space and the complexity of multi-agent behaviors. A key issue in offline RL is the distributional shift, which arises when the target policy being optimized deviates from the behavior policy that generated the data. This problem is exacerbated in MARL due to the interdependence between agents’ local policies and the expansive joint state-action space. Prior approaches have primarily addressed this challenge by incorporating regularization in the space of either Q-functions or policies. In this work, we introduce a regularizer in the space of stationary distributions to better handle distributional shift. Our algorithm, ComaDICE, offers a principled framework for offline cooperative MARL by incorporating stationary distribution regularization for the global learning policy, complemented by a carefully structured multi-agent value decomposition strategy to facilitate multi-agent training. Through extensive experiments on the multi-agent MuJoCo and StarCraft II benchmarks, we demonstrate that ComaDICE achieves superior performance compared to state-of-the-art offline MARL methods across nearly all tasks.

[LG-197] Score-based pullback Riemannian geometry

链接: https://arxiv.org/abs/2410.01950
作者: Willem Diepeveen,Georgios Batzolis,Zakhar Shumaylov,Carola-Bibiane Schönlieb
关键词-EN: offering improved efficiency, Data-driven Riemannian geometry, interpretable representation learning, Riemannian geometry, pullback Riemannian geometry
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data-driven Riemannian geometry has emerged as a powerful tool for interpretable representation learning, offering improved efficiency in downstream tasks. Moving forward, it is crucial to balance cheap manifold mappings with efficient training algorithms. In this work, we integrate concepts from pullback Riemannian geometry and generative models to propose a framework for data-driven Riemannian geometry that is scalable in both geometry and learning: score-based pullback Riemannian geometry. Focusing on unimodal distributions as a first step, we propose a score-based Riemannian structure with closed-form geodesics that pass through the data probability density. With this structure, we construct a Riemannian autoencoder (RAE) with error bounds for discovering the correct data manifold dimension. This framework can naturally be used with anisotropic normalizing flows by adopting isometry regularization during training. Through numerical experiments on various datasets, we demonstrate that our framework not only produces high-quality geodesics through the data support, but also reliably estimates the intrinsic dimension of the data manifold and provides a global chart of the manifold, even in high-dimensional ambient spaces.

[LG-198] Discrete Copula Diffusion

链接: https://arxiv.org/abs/2410.01949
作者: Anji Liu,Oliver Broadrick,Mathias Niepert,Guy Van den Broeck
关键词-EN: Discrete diffusion models, recently shown significant, shown significant progress, Discrete diffusion, DNA sequences
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion models have recently shown significant progress in modeling complex data, such as natural languages and DNA sequences. However, unlike diffusion models for continuous data, which can generate high-quality samples in just a few denoising steps, modern discrete diffusion models still require hundreds or even thousands of denoising steps to perform well. In this paper, we identify a fundamental limitation that prevents discrete diffusion models from achieving strong performance with fewer steps – they fail to capture dependencies between output variables at each denoising step. To address this issue, we provide a formal explanation and introduce a general approach to supplement the missing dependency information by incorporating another deep generative model, termed the copula model. Our method does not require fine-tuning either the diffusion model or the copula model, yet it enables high-quality sample generation with significantly fewer denoising steps. When we apply this approach to autoregressive copula models, the combined model outperforms both models individually in unconditional and conditional text generation. Specifically, the hybrid model achieves better (un)conditional text generation using 8 to 32 times fewer denoising steps than the diffusion model alone. In addition to presenting an effective discrete diffusion generation algorithm, this paper emphasizes the importance of modeling inter-variable dependencies in discrete diffusion.

[LG-199] One-step Noisy Label Mitigation

链接: https://arxiv.org/abs/2410.01944
作者: Hao Li,Jiayang Gu,Jingkuan Song,An Zhang,Lianli Gao
关键词-EN: Mitigating the detrimental, large-scale pre-training tasks, increasingly critical, detrimental effects, large-scale pre-training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 4 figures, 11 Tables

点击查看摘要

Abstract:Mitigating the detrimental effects of noisy labels on the training process has become increasingly critical, as obtaining entirely clean or human-annotated samples for large-scale pre-training tasks is often impractical. Nonetheless, existing noise mitigation methods often encounter limitations in practical applications due to their task-specific design, model dependency, and significant computational overhead. In this work, we exploit the properties of high-dimensional orthogonality to identify a robust and effective boundary in cone space for separating clean and noisy samples. Building on this, we propose One-step Anti-Noise (OSA), a model-agnostic noisy label mitigation paradigm that employs an estimator model and a scoring function to assess the noise level of input pairs through just one-step inference, a cost-efficient process. We empirically demonstrate the superiority of OSA, highlighting its enhanced training robustness, improved task transferability, ease of deployment, and reduced computational costs across various benchmarks, models, and tasks. Our code is released at this https URL.

[LG-200] CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

链接: https://arxiv.org/abs/2410.01943
作者: Mohammadreza Pourreza,Hailong Li,Ruoxi Sun,Yeounoh Chung,Shayan Talaei,Gaurav Tarlok Kakkar,Yu Gan,Amin Saberi,Fatma Ozcan,Sercan O. Arik
关键词-EN: large language model, employs innovative strategies, improve candidate generation, binary-candidates selection LLM, single LLM call
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:In tackling the challenges of large language model (LLM) performance for Text-to-SQL tasks, we introduce CHASE-SQL, a new framework that employs innovative strategies, using test-time compute in multi-agent modeling to improve candidate generation and selection. CHASE-SQL leverages LLMs’ intrinsic knowledge to generate diverse and high-quality SQL candidates using different LLM generators with: (1) a divide-and-conquer method that decomposes complex queries into manageable sub-queries in a single LLM call; (2) chain-of-thought reasoning based on query execution plans, reflecting the steps a database engine takes during execution; and (3) a unique instance-aware synthetic example generation technique, which offers specific few-shot demonstrations tailored to test this http URL identify the best candidate, a selection agent is employed to rank the candidates through pairwise comparisons with a fine-tuned binary-candidates selection LLM. This selection approach has been demonstrated to be more robust over alternatives. The proposed generators-selector framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods. Overall, our proposed CHASE-SQL achieves the state-of-the-art execution accuracy of 73.0% and 73.01% on the test set and development set of the notable BIRD Text-to-SQL dataset benchmark, rendering CHASE-SQL the top submission of the leaderboard (at the time of paper submission).

[LG-201] AEGAN: Generating Synthetic Tabular Data For Data Augmentation

链接: https://arxiv.org/abs/2410.01933
作者: Jiayu Li,Zilong Zhao,Kevin Yee,Uzair Javaid,Biplab Sikdar
关键词-EN: tabular data generation, gained significant attention, Synthetic tabular data, privacy-preserving data sharing, tabular data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic tabular data generation has gained significant attention for its potential in data augmentation, software testing and privacy-preserving data sharing. However, most research has primarily focused on larger datasets and evaluating their quality in terms of metrics like column-wise statistical distributions and inter-feature correlations, while often overlooking its utility for data augmentation, particularly for datasets whose data is scarce. In this paper, we propose Tabular Auto-Encoder Generative Adversarial Network (TAEGAN), an improved GAN-based framework for generating high-quality tabular data. Although large language models (LLMs)-based methods represent the state-of-the-art in synthetic tabular data generation, they are often overkill for small datasets due to their extensive size and complexity. TAEGAN employs a masked auto-encoder as the generator, which for the first time introduces the power of self-supervised pre-training in tabular data generation so that essentially exposes the networks to more information. We extensively evaluate TAEGAN against five state-of-the-art synthetic tabular data generation algorithms. Results from 10 datasets show that TAEGAN outperforms existing deep-learning-based tabular data generation models on 9 out of 10 datasets on the machine learning efficacy and achieves superior data augmentation performance on 7 out of 8 smaller datasets.

[LG-202] Dont flatten tokenize! Unlocking the key to SoftMoEs efficacy in deep RL

链接: https://arxiv.org/abs/2410.01930
作者: Ghada Sokar,Johan Obando-Ceron,Aaron Courville,Hugo Larochelle,Pablo Samuel Castro
关键词-EN: model size increases, deep neural networks, reinforcement learning, size increases, deep neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The use of deep neural networks in reinforcement learning (RL) often suffers from performance degradation as model size increases. While soft mixtures of experts (SoftMoEs) have recently shown promise in mitigating this issue for online RL, the reasons behind their effectiveness remain largely unknown. In this work we provide an in-depth analysis identifying the key factors driving this performance gain. We discover the surprising result that tokenizing the encoder output, rather than the use of multiple experts, is what is behind the efficacy of SoftMoEs. Indeed, we demonstrate that even with an appropriately scaled single expert, we are able to maintain the performance gains, largely thanks to tokenization.

[LG-203] LLM-Augmented Symbolic Reinforcement Learning with Landmark-Based Task Decomposition

链接: https://arxiv.org/abs/2410.01929
作者: Alireza Kheirandish,Duo Xu,Faramarz Fekri
关键词-EN: reinforcement learning, complex task, fundamental challenges, challenges in reinforcement, solving complex tasks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the fundamental challenges in reinforcement learning (RL) is to take a complex task and be able to decompose it to subtasks that are simpler for the RL agent to learn. In this paper, we report on our work that would identify subtasks by using some given positive and negative trajectories for solving the complex task. We assume that the states are represented by first-order predicate logic using which we devise a novel algorithm to identify the subtasks. Then we employ a Large Language Model (LLM) to generate first-order logic rule templates for achieving each subtask. Such rules were then further fined tuned to a rule-based policy via an Inductive Logic Programming (ILP)-based RL agent. Through experiments, we verify the accuracy of our algorithm in detecting subtasks which successfully detect all of the subtasks correctly. We also investigated the quality of the common-sense rules produced by the language model to achieve the subtasks. Our experiments show that our LLM-guided rule template generation can produce rules that are necessary for solving a subtask, which leads to solving complex tasks with fewer assumptions about predefined first-order logic predicates of the environment.

[LG-204] MARPLE: A Benchmark for Long-Horizon Inference NEURIPS2024

链接: https://arxiv.org/abs/2410.01926
作者: Emily Jin,Zhuoyi Huang,Jan-Philipp Fränken,Weiyu Liu,Hannah Cha,Erik Brockbank,Sarah Wu,Ruohan Zhang,Jiajun Wu,Tobias Gerstenberg
关键词-EN: Reconstructing past events, long time horizons, past events requires, events requires reasoning, Reconstructing past
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024. First two authors contributed equally. Project page: this https URL

点击查看摘要

Abstract:Reconstructing past events requires reasoning across long time horizons. To figure out what happened, we need to use our prior knowledge about the world and human behavior and draw inferences from various sources of evidence including visual, language, and auditory cues. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting with simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic ``whodunit’’ stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both traditional Monte Carlo simulation methods and an LLM baseline (GPT-4) on this task. Compared to humans, traditional inference models are less robust and performant, while GPT-4 has difficulty comprehending environmental changes. We analyze what factors influence inference performance and ablate different modes of evidence, finding that all modes are valuable for performance. Overall, our experiments demonstrate that the long-horizon, multimodal inference tasks in our benchmark present a challenge to current models.

[LG-205] NTK-DFL: Enhancing Decentralized Federated Learning in Heterogeneous Settings via Neural Tangent Kernel

链接: https://arxiv.org/abs/2410.01922
作者: Gabriel Thompson,Kai Yue,Chau-Wai Wong,Huaiyu Dai
关键词-EN: raw data exchange, collaborative machine learning, DFL faces challenges, machine learning framework, collaborative machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized federated learning (DFL) is a collaborative machine learning framework for training a model across participants without a central server or raw data exchange. DFL faces challenges due to statistical heterogeneity, as participants often possess different data distributions reflecting local environments and user behaviors. Recent work has shown that the neural tangent kernel (NTK) approach, when applied to federated learning in a centralized framework, can lead to improved performance. The NTK-based update mechanism is more expressive than typical gradient descent methods, enabling more efficient convergence and better handling of data heterogeneity. We propose an approach leveraging the NTK to train client models in the decentralized setting, while introducing a synergy between NTK-based evolution and model averaging. This synergy exploits inter-model variance and improves both accuracy and convergence in heterogeneous settings. Our model averaging technique significantly enhances performance, boosting accuracy by at least 10% compared to the mean local model accuracy. Empirical results demonstrate that our approach consistently achieves higher accuracy than baselines in highly heterogeneous settings, where other approaches often underperform. Additionally, it reaches target performance in 4.6 times fewer communication rounds. We validate our approach across multiple datasets, network topologies, and heterogeneity settings to ensure robustness and generalizability.

[LG-206] Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo

链接: https://arxiv.org/abs/2410.01920
作者: Shengyu Feng,Xiang Kong,Shuang Ma,Aonan Zhang,Dong Yin,Chong Wang,Ruoming Pang,Yiming Yang
关键词-EN: Large Language Models, Language Models, multi-step reasoning abilities, Augmenting the multi-step, Large Language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Augmenting the multi-step reasoning abilities of Large Language Models (LLMs) has been a persistent challenge. Recently, verification has shown promise in improving solution consistency by evaluating generated outputs. However, current verification approaches suffer from sampling inefficiencies, requiring a large number of samples to achieve satisfactory performance. Additionally, training an effective verifier often depends on extensive process supervision, which is costly to acquire. In this paper, we address these limitations by introducing a novel verification method based on Twisted Sequential Monte Carlo (TSMC). TSMC sequentially refines its sampling effort to focus exploration on promising candidates, resulting in more efficient generation of high-quality solutions. We apply TSMC to LLMs by estimating the expected future rewards at partial solutions. This approach results in a more straightforward training target that eliminates the need for step-wise human annotations. We empirically demonstrate the advantages of our method across multiple math benchmarks, and also validate our theoretical analysis of both our approach and existing verification methods.

[LG-207] Provably Accurate Shapley Value Estimation via Leverage Score Sampling

链接: https://arxiv.org/abs/2410.01917
作者: Christopher Musco,R. Teal Witter
关键词-EN: specific input features, Originally introduced, Kernel SHAP, attribute model predictions, explainable machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Originally introduced in game theory, Shapley values have emerged as a central tool in explainable machine learning, where they are used to attribute model predictions to specific input features. However, computing Shapley values exactly is expensive: for a general model with n features, O(2^n) model evaluations are necessary. To address this issue, approximation algorithms are widely used. One of the most popular is the Kernel SHAP algorithm, which is model agnostic and remarkably effective in practice. However, to the best of our knowledge, Kernel SHAP has no strong non-asymptotic complexity guarantees. We address this issue by introducing Leverage SHAP, a light-weight modification of Kernel SHAP that provides provably accurate Shapley value estimates with just O(n\log n) model evaluations. Our approach takes advantage of a connection between Shapley value estimation and agnostic active learning by employing leverage score sampling, a powerful regression tool. Beyond theoretical guarantees, we show that Leverage SHAP consistently outperforms even the highly optimized implementation of Kernel SHAP available in the ubiquitous SHAP library [Lundberg Lee, 2017].

[LG-208] Is uniform expressivity too restrictive? Towards efficient expressivity of graph neural networks

链接: https://arxiv.org/abs/2410.01910
作者: Sammy Khalife,Josué Tonelli-Cueto
关键词-EN: Graph Neural Network, Neural Network, Uniform expressivity guarantees, Uniform expressivity, Graph Neural
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Uniform expressivity guarantees that a Graph Neural Network (GNN) can express a query without the parameters depending on the size of the input graphs. This property is desirable in applications in order to have number of trainable parameters that is independent of the size of the input graphs. Uniform expressivity of the two variable guarded fragment (GC2) of first order logic is a well-celebrated result for Rectified Linear Unit (ReLU) GNNs [Barcelo al., 2020]. In this article, we prove that uniform expressivity of GC2 queries is not possible for GNNs with a wide class of Pfaffian activation functions (including the sigmoid and tanh), answering a question formulated by [Grohe, 2021]. We also show that despite these limitations, many of those GNNs can still efficiently express GC2 queries in a way that the number of parameters remains logarithmic on the maximal degree of the input graphs. Furthermore, we demonstrate that a log-log dependency on the degree is achievable for a certain choice of activation function. This shows that uniform expressivity can be successfully relaxed by covering large graphs appearing in practical applications. Our experiments illustrates that our theoretical estimates hold in practice.

[LG-209] Social Media Authentication and Combating Deepfakes using Semi-fragile Invisible Image Watermarking

链接: https://arxiv.org/abs/2410.01906
作者: Aakash Varma Nadimpalli,Ajita Rattani
关键词-EN: severe societal concerns, raised severe societal, watermark removal attacks, deep generative models, video synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: ACM Transactions (Digital Threats: Research and Practice)

点击查看摘要

Abstract:With the significant advances in deep generative models for image and video synthesis, Deepfakes and manipulated media have raised severe societal concerns. Conventional machine learning classifiers for deepfake detection often fail to cope with evolving deepfake generation technology and are susceptible to adversarial attacks. Alternatively, invisible image watermarking is being researched as a proactive defense technique that allows media authentication by verifying an invisible secret message embedded in the image pixels. A handful of invisible image watermarking techniques introduced for media authentication have proven vulnerable to basic image processing operations and watermark removal attacks. In response, we have proposed a semi-fragile image watermarking technique that embeds an invisible secret message into real images for media authentication. Our proposed watermarking framework is designed to be fragile to facial manipulations or tampering while being robust to benign image-processing operations and watermark removal attacks. This is facilitated through a unique architecture of our proposed technique consisting of critic and adversarial networks that enforce high image quality and resiliency to watermark removal efforts, respectively, along with the backbone encoder-decoder and the discriminator networks. Thorough experimental investigations on SOTA facial Deepfake datasets demonstrate that our proposed model can embed a 64 -bit secret as an imperceptible image watermark that can be recovered with a high-bit recovery accuracy when benign image processing operations are applied while being non-recoverable when unseen Deepfake manipulations are applied. In addition, our proposed watermarking technique demonstrates high resilience to several white-box and black-box watermark removal attacks. Thus, obtaining state-of-the-art performance.

[LG-210] Conformal Prediction Sets Can Cause Disparate Impact

链接: https://arxiv.org/abs/2410.01888
作者: Jesse C. Cresswell,Bhargava Kumar,Yi Sui,Mouloud Belbahri
关键词-EN: machine learning models, learning models, inherently actionable, promising method, method for quantifying
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Although conformal prediction is a promising method for quantifying the uncertainty of machine learning models, the prediction sets it outputs are not inherently actionable. Many applications require a single output to act on, not several. To overcome this, prediction sets can be provided to a human who then makes an informed decision. In any such system it is crucial to ensure the fairness of outcomes across protected groups, and researchers have proposed that Equalized Coverage be used as the standard for fairness. By conducting experiments with human participants, we demonstrate that providing prediction sets can increase the unfairness of their decisions. Disquietingly, we find that providing sets that satisfy Equalized Coverage actually increases unfairness compared to marginal coverage. Instead of equalizing coverage, we propose to equalize set sizes across groups which empirically leads to more fair outcomes.

[LG-211] NEAT: Nonlinear Parameter-efficient Adaptation of Pre-trained Models

链接: https://arxiv.org/abs/2410.01870
作者: Yibo Zhong,Haoxiang Jiang,Lincan Li,Ryumei Nakada,Tianci Liu,Linjun Zhang,Huaxiu Yao,Haoyu Wang
关键词-EN: adapting large models, crucial for adapting, adapting large, Fine-tuning, NEAT
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Fine-tuning pre-trained models is crucial for adapting large models to downstream tasks, often delivering state-of-the-art performance. However, fine-tuning all model parameters is resource-intensive and laborious, leading to the emergence of parameter-efficient fine-tuning (PEFT) methods. One widely adopted PEFT technique, Low-Rank Adaptation (LoRA), freezes the pre-trained model weights and introduces two low-rank matrices whose ranks are significantly smaller than the dimensions of the original weight matrices. This enables efficient fine-tuning by adjusting only a small number of parameters. Despite its efficiency, LoRA approximates weight updates using low-rank decomposition, which struggles to capture complex, non-linear components and efficient optimization trajectories. As a result, LoRA-based methods often exhibit a significant performance gap compared to full fine-tuning. Closing this gap requires higher ranks, which increases the number of parameters. To address these limitations, we propose a nonlinear parameter-efficient adaptation method (NEAT). NEAT introduces a lightweight neural network that takes pre-trained weights as input and learns a nonlinear transformation to approximate cumulative weight updates. These updates can be interpreted as functions of the corresponding pre-trained weights. The nonlinear approximation directly models the cumulative updates, effectively capturing complex and non-linear structures in the weight updates. Our theoretical analysis demonstrates taht NEAT can be more efficient than LoRA while having equal or greater expressivity. Extensive evaluations across four benchmarks and over twenty datasets demonstrate that NEAT significantly outperforms baselines in both vision and text tasks.

[LG-212] House of Cards: Massive Weights in LLMs

链接: https://arxiv.org/abs/2410.01866
作者: Jaehoon Oh,Seungjun Shin,Dokwan Oh
关键词-EN: large language models, massive weights, Massive activations, Massive, weights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Under review

点击查看摘要

Abstract:Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top- k massive weights as the weights that contribute to the dimensions with the top- k magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor performance drop, even though a much larger number of weights are set to zero. This implies that during the pre-training process, learning is dominantly focused on massive weights. Building on this observation, we propose a simple plug-and-play method called MacDrop (massive weights curriculum dropout), to rely less on massive weights during parameter-efficient fine-tuning. This method applies dropout to the pre-trained massive weights, starting with a high dropout probability and gradually decreasing it as fine-tuning progresses. Through experiments, we demonstrate that MacDrop generally improves performance across zero-shot downstream tasks and generation tasks.

[LG-213] Simplifying complex machine learning by linearly separable network embedding spaces

链接: https://arxiv.org/abs/2410.01865
作者: Alexandros Xenos,Noel-Malod Dognin,Natasa Przulj
关键词-EN: Low-dimensional embeddings, Low-dimensional, network, embedding, network data
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, 8 figures

点击查看摘要

Abstract:Low-dimensional embeddings are a cornerstone in the modelling and analysis of complex networks. However, most existing approaches for mining network embedding spaces rely on computationally intensive machine learning systems to facilitate downstream tasks. In the field of NLP, word embedding spaces capture semantic relationships \textitlinearly, allowing for information retrieval using \textitsimple linear operations on word embedding vectors. Here, we demonstrate that there are structural properties of network data that yields this linearity. We show that the more homophilic the network representation, the more linearly separable the corresponding network embedding space, yielding better downstream analysis results. Hence, we introduce novel graphlet-based methods enabling embedding of networks into more linearly separable spaces, allowing for their better mining. Our fundamental insights into the structure of network data that enable their \textit\textbflinear mining and exploitation enable the ML community to build upon, towards efficiently and explainably mining of the complex network data.

[LG-214] Learning the Optimal Path and DNN Partition for Collaborative Edge Inference

链接: https://arxiv.org/abs/2410.01857
作者: Yin Huang,Letian Zhang,Jie Xu
关键词-EN: Deep Neural Networks, Deep Neural, Recent advancements, numerous intelligent mobile, intelligent mobile applications
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 15 pages, 15 figures, submitted to IEEE journals for possible publication

点击查看摘要

Abstract:Recent advancements in Deep Neural Networks (DNNs) have catalyzed the development of numerous intelligent mobile applications and services. However, they also introduce significant computational challenges for resource-constrained mobile devices. To address this, collaborative edge inference has been proposed. This method involves partitioning a DNN inference task into several subtasks and distributing these across multiple network nodes. Despite its potential, most current approaches presume known network parameters – like node processing speeds and link transmission rates – or rely on a fixed sequence of nodes for processing the DNN subtasks. In this paper, we tackle a more complex scenario where network parameters are unknown and must be learned, and multiple network paths are available for distributing inference tasks. Specifically, we explore the learning problem of selecting the optimal network path and assigning DNN layers to nodes along this path, considering potential security threats and the costs of switching paths. We begin by deriving structural insights from the DNN layer assignment with complete network information, which narrows down the decision space and provides crucial understanding of optimal assignments. We then cast the learning problem with incomplete network information as a novel adversarial group linear bandits problem with switching costs, featuring rewards generation through a combined stochastic and adversarial process. We introduce a new bandit algorithm, B-EXPUCB, which combines elements of the classical blocked EXP3 and LinUCB algorithms, and demonstrate its sublinear regret. Extensive simulations confirm B-EXPUCB’s superior performance in learning for collaborative edge inference over existing algorithms.

[LG-215] Explainable Diagnosis Prediction through Neuro-Symbolic Integration

链接: https://arxiv.org/abs/2410.01855
作者: Qiuhao Lu,Rui Li,Elham Sagheb,Andrew Wen,Jinlian Wang,Liwei Wang,Jungwei W. Fan,Hongfang Liu
关键词-EN: impact patient outcomes, significantly impact patient, Logical Neural Networks, patient outcomes, critical task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diagnosis prediction is a critical task in healthcare, where timely and accurate identification of medical conditions can significantly impact patient outcomes. Traditional machine learning and deep learning models have achieved notable success in this domain but often lack interpretability which is a crucial requirement in clinical settings. In this study, we explore the use of neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop explainable models for diagnosis prediction. Essentially, we design and implement LNN-based models that integrate domain-specific knowledge through logical rules with learnable thresholds. Our models, particularly M_\textmulti-pathway and M_\textcomprehensive , demonstrate superior performance over traditional models such as Logistic Regression, SVM, and Random Forest, achieving higher accuracy (up to 80.52%) and AUROC scores (up to 0.8457) in the case study of diabetes prediction. The learned weights and thresholds within the LNN models provide direct insights into feature contributions, enhancing interpretability without compromising predictive power. These findings highlight the potential of neuro-symbolic approaches in bridging the gap between accuracy and explainability in healthcare AI applications. By offering transparent and adaptable diagnostic models, our work contributes to the advancement of precision medicine and supports the development of equitable healthcare solutions. Future research will focus on extending these methods to larger and more diverse datasets to further validate their applicability across different medical conditions and populations.

[LG-216] An Early-Stage Workflow Proposal for the Generation of Safe and Dependable AI Classifiers

链接: https://arxiv.org/abs/2410.01850
作者: Hans Dermot Doran,Suzana Veljanovska
关键词-EN: preferably lightweight workflow, necessitates definition, complete yet adaptable, generation and execution, execution of qualifiable
类目: Machine Learning (cs.LG)
*备注: 43rd International Conference on Computer Safety, Reliability and Security (SafeComp2024), Florence, Italy, September 17-20.2024

点击查看摘要

Abstract:The generation and execution of qualifiable safe and dependable AI models, necessitates definition of a transparent, complete yet adaptable and preferably lightweight workflow. Given the rapidly progressing domain of AI research and the relative immaturity of the safe-AI domain the process stability upon which functionally safety developments rest must be married with some degree of adaptability. This early-stage work proposes such a workflow basing it on a an extended ONNX model description. A use case provides one foundations of this body of work which we expect to be extended by other, third party use-cases.

[LG-217] Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition ALT

链接: https://arxiv.org/abs/2410.01848
作者: Soufiane Belharbi,Marco Pedersoli,Alessandro Lameiras Koerich,Simon Bacon,Eric Granger
关键词-EN: level of accuracy, facial expression recognition, achieve a high, high level, facial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, AI and Digital Health Symposium 2024, October 18th 2024, Montréal

点击查看摘要

Abstract:Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (AUs) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models. During training, this AU codebook is used, along with the input image expression label, and facial landmarks, to construct a AU heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with AU heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with AU maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks RAF-DB, and AffectNet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.

[LG-218] Bayes-CATSI: A variational Bayesian approach for medical time series data imputation

链接: https://arxiv.org/abs/2410.01847
作者: Omkar Kulkarni,Rohitash Chandra
关键词-EN: Time Series Imputation, time series datasets, Context-Aware Time Series, series datasets feature, datasets feature missing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Medical time series datasets feature missing values that need data imputation methods, however, conventional machine learning models fall short due to a lack of uncertainty quantification in predictions. Among these models, the CATSI (Context-Aware Time Series Imputation) stands out for its effectiveness by incorporating a context vector into the imputation process, capturing the global dependencies of each patient. In this paper, we propose a Bayesian Context-Aware Time Series Imputation (Bayes-CATSI) framework which leverages uncertainty quantification offered by variational inference. We consider the time series derived from electroencephalography (EEG), electrooculography (EOG), electromyography (EMG), electrocardiology (EKG). Variational Inference assumes the shape of the posterior distribution and through minimization of the Kullback-Leibler(KL) divergence it finds variational densities that are closest to the true posterior distribution. Thus , we integrate the variational Bayesian deep learning layers into the CATSI model. Our results show that Bayes-CATSI not only provides uncertainty quantification but also achieves superior imputation performance compared to the CATSI model. Specifically, an instance of Bayes-CATSI outperforms CATSI by 9.57 %. We provide an open-source code implementation for applying Bayes-CATSI to other medical data imputation problems.

[LG-219] Public interest in science or bots? Selective amplification of scientific articles on Twitter

链接: https://arxiv.org/abs/2410.01842
作者: Ashiqur Rahman,Ehsan Mohammadi,Hamed Alhoori
关键词-EN: measure public response, bot activity, sharing scholarly articles, Twitter Application Programming, Application Programming Interface
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: 38 pages, 10 figures. Aslib Journal of Information Management, Vol. ahead-of-print No. ahead-of-print

点击查看摘要

Abstract:With the remarkable capability to reach the public instantly, social media has become integral in sharing scholarly articles to measure public response. Since spamming by bots on social media can steer the conversation and present a false public interest in given research, affecting policies impacting the public’s lives in the real world, this topic warrants critical study and attention. We used the Altmetric dataset in combination with data collected through the Twitter Application Programming Interface (API) and the Botometer API. We combined the data into an extensive dataset with academic articles, several features from the article and a label indicating whether the article had excessive bot activity on Twitter or not. We analyzed the data to see the possibility of bot activity based on different characteristics of the article. We also trained machine-learning models using this dataset to identify possible bot activity in any given article. Our machine-learning models were capable of identifying possible bot activity in any academic article with an accuracy of 0.70. We also found that articles related to “Health and Human Science” are more prone to bot activity compared to other research areas. Without arguing the maliciousness of the bot activity, our work presents a tool to identify the presence of bot activity in the dissemination of an academic article and creates a baseline for future research in this direction.

[LG-220] mporal Graph Memory Networks For Knowledge Tracing

链接: https://arxiv.org/abs/2410.01836
作者: Seif Gad,Sherif Abdelfattah,Ghodai Abdelrahman
关键词-EN: past exercise answering, automatic tutoring systems, student knowledge growth, past exercise, exercise answering
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tracing a student’s knowledge growth given the past exercise answering is a vital objective in automatic tutoring systems to customize the learning experience. Yet, achieving this objective is a non-trivial task as it involves modeling the knowledge state across multiple knowledge components (KCs) while considering their temporal and relational dynamics during the learning process. Knowledge tracing methods have tackled this task by either modeling KCs’ temporal dynamics using recurrent models or relational dynamics across KCs and questions using graph models. Albeit, there is a lack of methods that could learn joint embedding between relational and temporal dynamics of the task. Moreover, many methods that count for the impact of a student’s forgetting behavior during the learning process use hand-crafted features, limiting their generalization on different scenarios. In this paper, we propose a novel method that jointly models the relational and temporal dynamics of the knowledge state using a deep temporal graph memory network. In addition, we propose a generic technique for representing a student’s forgetting behavior using temporal decay constraints on the graph memory module. We demonstrate the effectiveness of our proposed method using multiple knowledge tracing benchmarks while comparing it to state-of-the-art methods.

[LG-221] Analysis of Convolutional Neural Network-based Image Classifications: A Multi-Featured Application for Rice Leaf Disease Prediction and Recommendations for Farmers

链接: https://arxiv.org/abs/2410.01827
作者: Biplov Paneru,Bishwash Paneru,Krishna Bikram Shah
关键词-EN: convolutional neural network, neural network, precision agriculture, study presents, method for improving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This study presents a novel method for improving rice disease classification using 8 different convolutional neural network (CNN) algorithms, which will further the field of precision agriculture. Tkinter-based application that offers farmers a feature-rich interface. With the help of this cutting-edge application, farmers will be able to make timely and well-informed decisions by enabling real-time disease prediction and providing personalized recommendations. Together with the user-friendly Tkinter interface, the smooth integration of cutting-edge CNN transfer learning algorithms-based technology that include ResNet-50, InceptionV3, VGG16, and MobileNetv2 with the UCI dataset represents a major advancement toward modernizing agricultural practices and guaranteeing sustainable crop management. Remarkable outcomes include 75% accuracy for ResNet-50, 90% accuracy for DenseNet121, 84% accuracy for VGG16, 95.83% accuracy for MobileNetV2, 91.61% accuracy for DenseNet169, and 86% accuracy for InceptionV3. These results give a concise summary of the models’ capabilities, assisting researchers in choosing appropriate strategies for precise and successful rice crop disease identification. A severe overfitting has been seen on VGG19 with 70% accuracy and Nasnet with 80.02% accuracy. On Renset101, only an accuracy of 54% could be achieved, along with only 33% on efficientNetB0. A MobileNetV2-trained model was successfully deployed on a TKinter GUI application to make predictions using image or real-time video capture.

[LG-222] Large Language Models as Markov Chains

链接: https://arxiv.org/abs/2410.02724
作者: Oussama Zekri,Ambroise Odonnat,Abdelhakim Benechehab,Linus Bleistein,Nicolas Boullé,Ievgen Redko
关键词-EN: Large language models, natural language processing, language processing tasks, Large language, remarkably efficient
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 49 pages, 17 figures

点击查看摘要

Abstract:Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size T and context window of size K and Markov chains defined on a finite state space of size \mathcalO(T^K) . We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.

[LG-223] Measurements with Noise: Bayesian Optimization for Co-optimizing Noise and Property Discovery in Automated Experiments

链接: https://arxiv.org/abs/2410.02717
作者: Boris N. Slautin,Yu Liu,Jan Dec,Vladimir V. Shvartsman,Doru C. Lupascu,Maxim Ziatdinov,Sergei V. Kalinin
关键词-EN: developed a Bayesian, integrates intra-step noise, automated experimental cycles, Bayesian optimization, integrates intra-step
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:We have developed a Bayesian optimization (BO) workflow that integrates intra-step noise optimization into automated experimental cycles. Traditional BO approaches in automated experiments focus on optimizing experimental trajectories but often overlook the impact of measurement noise on data quality and cost. Our proposed framework simultaneously optimizes both the target property and the associated measurement noise by introducing time as an additional input parameter, thereby balancing the signal-to-noise ratio and experimental duration. Two approaches are explored: a reward-driven noise optimization and a double-optimization acquisition function, both enhancing the efficiency of automated workflows by considering noise and cost within the optimization process. We validate our method through simulations and real-world experiments using Piezoresponse Force Microscopy (PFM), demonstrating the successful optimization of measurement duration and property exploration. Our approach offers a scalable solution for optimizing multiple variables in automated experimental workflows, improving data quality, and reducing resource expenditure in materials science and beyond.

[LG-224] AlzhiNet: Traversing from 2DCNN to 3DCNN Towards Early Detection and Diagnosis of Alzheimers Disease

链接: https://arxiv.org/abs/2410.02714
作者: Romoke Grace Akindele,Samuel Adebayo,Paul Shekonya Kanda,Ming Yu
关键词-EN: Convolutional Neural Networks, progressive neurodegenerative disorder, Convolutional Neural, Neural Networks, effective disease management
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a progressive neurodegenerative disorder with increasing prevalence among the aging population, necessitating early and accurate diagnosis for effective disease management. In this study, we present a novel hybrid deep learning framework that integrates both 2D Convolutional Neural Networks (2D-CNN) and 3D Convolutional Neural Networks (3D-CNN), along with a custom loss function and volumetric data augmentation, to enhance feature extraction and improve classification performance in AD diagnosis. According to extensive experiments, AlzhiNet outperforms standalone 2D and 3D models, highlighting the importance of combining these complementary representations of data. The depth and quality of 3D volumes derived from the augmented 2D slices also significantly influence the model’s performance. The results indicate that carefully selecting weighting factors in hybrid predictions is imperative for achieving optimal results. Our framework has been validated on the Magnetic Resonance Imaging (MRI) from Kaggle and MIRIAD datasets, obtaining accuracies of 98.9% and 99.99%, respectively, with an AUC of 100%. Furthermore, AlzhiNet was studied under a variety of perturbation scenarios on the Alzheimer’s Kaggle dataset, including Gaussian noise, brightness, contrast, salt and pepper noise, color jitter, and occlusion. The results obtained show that AlzhiNet is more robust to perturbations than ResNet-18, making it an excellent choice for real-world applications. This approach represents a promising advancement in the early diagnosis and treatment planning for Alzheimer’s disease.

[LG-225] Highly Adaptive Ridge

链接: https://arxiv.org/abs/2410.02680
作者: Alejandro Schuler,Alexander Hagemeister,Mark van der Laan
关键词-EN: Highly Adaptive Ridge, square-integrable sectional derivatives, Highly Adaptive, propose the Highly, Adaptive Ridge
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we propose the Highly Adaptive Ridge (HAR): a regression method that achieves a n^-1/3 dimension-free L2 convergence rate in the class of right-continuous functions with square-integrable sectional derivatives. This is a large nonparametric function class that is particularly appropriate for tabular data. HAR is exactly kernel ridge regression with a specific data-adaptive kernel based on a saturated zero-order tensor-product spline basis expansion. We use simulation and real data to confirm our theory. We demonstrate empirical performance better than state-of-the-art algorithms for small datasets in particular.

[LG-226] Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression

链接: https://arxiv.org/abs/2410.02629
作者: Kai Tan,Pierre C. Bellec
关键词-EN: Stochastic Gradient Descent, Gradient Descent, Stochastic Gradient, robust regression problems, high-dimensional robust regression
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper studies the generalization performance of iterates obtained by Gradient Descent (GD), Stochastic Gradient Descent (SGD) and their proximal variants in high-dimensional robust regression problems. The number of features is comparable to the sample size and errors may be heavy-tailed. We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm. These estimators are provably consistent under suitable conditions. The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer. We provide explicit generalization error estimates for iterates generated from GD and SGD, or from proximal SGD in the presence of a non-smooth regularizer. The proposed risk estimates serve as effective proxies for the actual generalization error, allowing us to determine the optimal stopping iteration that minimizes the generalization error. Extensive simulations confirm the effectiveness of the proposed generalization error estimates.

[LG-227] Online Learning Guided Quasi-Newton Methods with Global Non-Asymptotic Convergence

链接: https://arxiv.org/abs/2410.02626
作者: Ruichen Jiang,Aryan Mokhtari
关键词-EN: including unconstrained minimization, monotone nonlinear equations, linear convergence rate, convergence rate, global convergence rate
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 54 pages

点击查看摘要

Abstract:In this paper, we propose a quasi-Newton method for solving smooth and monotone nonlinear equations, including unconstrained minimization and minimax optimization as special cases. For the strongly monotone setting, we establish two global convergence bounds: (i) a linear convergence rate that matches the rate of the celebrated extragradient method, and (ii) an explicit global superlinear convergence rate that provably surpasses the linear convergence rate after at most O(d) iterations, where d is the problem’s dimension. In addition, for the case where the operator is only monotone, we prove a global convergence rate of O(\min\1/k,\sqrtd/k^1.25) in terms of the duality gap. This matches the rate of the extragradient method when k = O(d^2) and is faster when k = \Omega(d^2) . These results are the first global convergence results to demonstrate a provable advantage of a quasi-Newton method over the extragradient method, without querying the Jacobian of the operator. Unlike classical quasi-Newton methods, we achieve this by using the hybrid proximal extragradient framework and a novel online learning approach for updating the Jacobian approximation matrices. Specifically, guided by the convergence analysis, we formulate the Jacobian approximation update as an online convex optimization problem over non-symmetric matrices, relating the regret of the online problem to the convergence rate of our method. To facilitate efficient implementation, we further develop a tailored online learning algorithm based on an approximate separation oracle, which preserves structures such as symmetry and sparsity in the Jacobian matrices.

[LG-228] Generalization emerges from local optimization in a self-organized learning network

链接: https://arxiv.org/abs/2410.02590
作者: S. Barland,L. Gil
关键词-EN: global error function, building supervised learning, supervised learning networks, design and analyze, paradigm for building
类目: Adaptation and Self-Organizing Systems (nlin.AO); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: This paper is submitted to Phys. Rev. X. It’s a physicist’s study that focus on a new paradigm for deep learning networks. We would have liked to choose other keywords for arXiv to reach a wider community, but don’t have the rights to do so

点击查看摘要

Abstract:We design and analyze a new paradigm for building supervised learning networks, driven only by local optimization rules without relying on a global error function. Traditional neural networks with a fixed topology are made up of identical nodes and derive their expressiveness from an appropriate adjustment of connection weights. In contrast, our network stores new knowledge in the nodes accurately and instantaneously, in the form of a lookup table. Only then is some of this information structured and incorporated into the network geometry. The training error is initially zero by construction and remains so throughout the network topology transformation phase. The latter involves a small number of local topological transformations, such as splitting or merging of nodes and adding binary connections between them. The choice of operations to be carried out is only driven by optimization of expressivity at the local scale. What we are primarily looking for in a learning network is its ability to generalize, i.e. its capacity to correctly answer questions for which it has never learned the answers. We show on numerous examples of classification tasks that the networks generated by our algorithm systematically reach such a state of perfect generalization when the number of learned examples becomes sufficiently large. We report on the dynamics of the change of state and show that it is abrupt and has the distinctive characteristics of a first order phase transition, a phenomenon already observed for traditional learning networks and known as grokking. In addition to proposing a non-potential approach for the construction of learning networks, our algorithm makes it possible to rethink the grokking transition in a new light, under which acquisition of training data and topological structuring of data are completely decoupled phenomena.

[LG-229] he Benefit of Being Bayesian in Online Conformal Prediction

链接: https://arxiv.org/abs/2410.02561
作者: Zhiyu Zhang,Zhou Lu,Heng Yang
关键词-EN: machine learning model, black-box machine learning, Conformal Prediction, valid confidence sets, construction of valid
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Based on the framework of Conformal Prediction (CP), we study the online construction of valid confidence sets given a black-box machine learning model. By converting the target confidence levels into quantile levels, the problem can be reduced to predicting the quantiles (in hindsight) of a sequentially revealed data sequence. Two very different approaches have been studied previously. (i) Direct approach: Assuming the data sequence is iid or exchangeable, one could maintain the empirical distribution of the observed data as an algorithmic belief, and directly predict its quantiles. (ii) Indirect approach: As statistical assumptions often do not hold in practice, a recent trend is to consider the adversarial setting and apply first-order online optimization to moving quantile losses (Gibbs Candès, 2021). It requires knowing the target quantile level beforehand, and suffers from certain validity issues on the obtained confidence sets, due to the associated loss linearization. This paper presents a novel Bayesian CP framework that combines their strengths. Without any statistical assumption, it is able to both: (i) answer multiple arbitrary confidence level queries online, with provably low regret; and (ii) overcome the validity issues suffered by first-order optimization baselines, due to being “data-centric” rather than “iterate-centric”. From a technical perspective, our key idea is to regularize the algorithmic belief of the above direct approach by a Bayesian prior, which “robustifies” it by simulating a non-linearized Follow the Regularized Leader (FTRL) algorithm on the output. For statisticians, this can be regarded as an online adversarial view of Bayesian inference. Importantly, the proposed belief update backbone is shared by prediction heads targeting different confidence levels, bringing practical benefits analogous to U-calibration (Kleinberg et al., 2023). Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2410.02561 [stat.ML] (or arXiv:2410.02561v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2410.02561 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-230] Obtaining Lower Query Complexities through Lightweight Zeroth-Order Proximal Gradient Algorithms

链接: https://arxiv.org/abs/2410.02559
作者: Bin Gu,Xiyuan Wei,Hualin Zhang,Yi Chang,Heng Huang
关键词-EN: machine learning problems, random ZO estimator, coordinated ZO estimator, frac, mathcal
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Neural Computation 36 (5), 897-935

点击查看摘要

Abstract:Zeroth-order (ZO) optimization is one key technique for machine learning problems where gradient calculation is expensive or impossible. Several variance reduced ZO proximal algorithms have been proposed to speed up ZO optimization for non-smooth problems, and all of them opted for the coordinated ZO estimator against the random ZO estimator when approximating the true gradient, since the former is more accurate. While the random ZO estimator introduces bigger error and makes convergence analysis more challenging compared to coordinated ZO estimator, it requires only \mathcalO(1) computation, which is significantly less than \mathcalO(d) computation of the coordinated ZO estimator, with d being dimension of the problem space. To take advantage of the computationally efficient nature of the random ZO estimator, we first propose a ZO objective decrease (ZOOD) property which can incorporate two different types of errors in the upper bound of convergence rate. Next, we propose two generic reduction frameworks for ZO optimization which can automatically derive the convergence results for convex and non-convex problems respectively, as long as the convergence rate for the inner solver satisfies the ZOOD property. With the application of two reduction frameworks on our proposed ZOR-ProxSVRG and ZOR-ProxSAGA, two variance reduced ZO proximal algorithms with fully random ZO estimators, we improve the state-of-the-art function query complexities from \mathcalO\left(\min\fracdn^1/2\epsilon^2, \fracd\epsilon^3\right) to \tilde\mathcalO\left(\fracn+d\epsilon^2\right) under d n^\frac12 for non-convex problems, and from \mathcalO\left(\fracd\epsilon^2\right) to \tilde\mathcalO\left(n\log\frac1\epsilon+\fracd\epsilon\right) for convex problems.

[LG-231] Local Flow Matching Generative Models

链接: https://arxiv.org/abs/2410.02548
作者: Chen Xu,Xiuyuan Cheng,Yao Xie
关键词-EN: Local Flow Matching, Flow Matching, introduce Local Flow, simulation-free method, method for learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flow Matching (FM) is a simulation-free method for learning a continuous and invertible flow to interpolate between two distributions, and in particular to generate data from noise in generative modeling. In this paper, we introduce Local Flow Matching (LFM), which learns a sequence of FM sub-models and each matches a diffusion process up to the time of the step size in the data-to-noise direction. In each step, the two distributions to be interpolated by the sub-model are closer to each other than data vs. noise, and this enables the use of smaller models with faster training. The stepwise structure of LFM is natural to be distilled and different distillation techniques can be adopted to speed up generation. Theoretically, we prove a generation guarantee of the proposed flow model in terms of the \chi^2 -divergence between the generated and true data distributions. In experiments, we demonstrate the improved training efficiency and competitive generative performance of LFM compared to FM on the unconditional generation of tabular data and image datasets, and also on the conditional generation of robotic manipulation policies.

[LG-232] Dual Active Learning for Reinforcement Learning from Human Feedback

链接: https://arxiv.org/abs/2410.02504
作者: Pangpang Liu,Chengchun Shi,Will Wei Sun
关键词-EN: Aligning large language, large language models, generative artificial intelligence, Aligning large, language models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences is critical to recent advances in generative artificial intelligence. Reinforcement learning from human feedback (RLHF) is widely applied to achieve this objective. A key step in RLHF is to learn the reward function from human feedback. However, human feedback is costly and time-consuming, making it essential to collect high-quality conversation data for human teachers to label. Additionally, different human teachers have different levels of expertise. It is thus critical to query the most appropriate teacher for their opinions. In this paper, we use offline reinforcement learning (RL) to formulate the alignment problem. Motivated by the idea of D -optimal design, we first propose a dual active reward learning algorithm for the simultaneous selection of conversations and teachers. Next, we apply pessimistic RL to solve the alignment problem, based on the learned reward estimator. Theoretically, we show that the reward estimator obtained through our proposed adaptive selection strategy achieves minimal generalized variance asymptotically, and prove that the sub-optimality of our pessimistic policy scales as O(1/\sqrtT) with a given sample budget T . Through simulations and experiments on LLMs, we demonstrate the effectiveness of our algorithm and its superiority over state-of-the-arts.

[LG-233] Distributed Learning with Discretely Observed Functional Data

链接: https://arxiv.org/abs/2410.02376
作者: Jiading Liu,Lei Shi
关键词-EN: solve statistical inverse, statistical inverse problems, distributed spectral algorithms, spectral algorithms, selecting different filter
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:By selecting different filter functions, spectral algorithms can generate various regularization methods to solve statistical inverse problems within the learning-from-samples framework. This paper combines distributed spectral algorithms with Sobolev kernels to tackle the functional linear regression problem. The design and mathematical analysis of the algorithms require only that the functional covariates are observed at discrete sample points. Furthermore, the hypothesis function spaces of the algorithms are the Sobolev spaces generated by the Sobolev kernels, optimizing both approximation capability and flexibility. Through the establishment of regularity conditions for the target function and functional covariate, we derive matching upper and lower bounds for the convergence of the distributed spectral algorithms in the Sobolev norm. This demonstrates that the proposed regularity conditions are reasonable and that the convergence analysis under these conditions is tight, capturing the essential characteristics of functional linear regression. The analytical techniques and estimates developed in this paper also enhance existing results in the previous literature.

[LG-234] A novel neural network-based approach to derive a geomagnetic baseline for robust characterization of geomagnetic indices at mid-latitude

链接: https://arxiv.org/abs/2410.02311
作者: Rungployphan Kieokaew,Veronika Haberle,Aurélie Marchaudon,Pierre-Louis Blelly,Aude Chambodut
关键词-EN: magnetic measurements characterize, ground magnetic measurements, solar-terrestrial interaction, Geomagnetic, derived from ground
类目: pace Physics (physics.space-ph); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Geomagnetic indices derived from ground magnetic measurements characterize the intensity of solar-terrestrial interaction. The \textitKp index derived from multiple magnetic observatories at mid-latitude has commonly been used for space weather operations. Yet, its temporal cadence is low and its intensity scale is crude. To derive a new generation of geomagnetic indices, it is desirable to establish a geomagnetic baseline' that defines the quiet-level of activity without solar-driven perturbations. We present a new approach for deriving a baseline that represents the time-dependent quiet variations focusing on data from Chambon-la-Forêt, France. Using a filtering technique, the measurements are first decomposed into the above-diurnal variation and the sum of 24h, 12h, 8h, and 6h filters, called the daily variation. Using correlation tools and SHapley Additive exPlanations, we identify parameters that dominantly correlate with the daily variation. Here, we predict the daily quiet’ variation using a long short-term memory neural network trained using at least 11 years of data at 1h cadence. This predicted daily quiet variation is combined with linear extrapolation of the secular trend associated with the intrinsic geomagnetic variability, which dominates the above-diurnal variation, to yield a new geomagnetic baseline. Unlike the existing baselines, our baseline is insensitive to geomagnetic storms. It is thus suitable for defining geomagnetic indices that accurately reflect the intensity of solar-driven perturbations. Our methodology is quick to implement and scalable, making it suitable for real-time operation. Strategies for operational forecasting of our geomagnetic baseline 1 day and 27 days in advance are presented.

[LG-235] On Lais Upper Confidence Bound in Multi-Armed Bandits

链接: https://arxiv.org/abs/2410.02279
作者: Huachen Ren,Cun-Hui Zhang
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 25 pages

点击查看摘要

[LG-236] Fast nonparametric feature selection with error control using integrated path stability selection

链接: https://arxiv.org/abs/2410.02208
作者: Omar Melikechi,David B. Dunson,Jeffrey W. Miller
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-237] SC-CDM: Enhancing Quality of Image Semantic Communication with a Compact Diffusion Model

链接: https://arxiv.org/abs/2410.02121
作者: Kexin Zhang,Lixin Li,Wensheng Lin,Yuna Yan,Wenchi Cheng,Zhu Han
关键词-EN: mobile communication systems, mobile communication, communication systems, Semantic Communication, emerging technology
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: arXiv admin note: text overlap with arXiv:2408.05112

点击查看摘要

Abstract:Semantic Communication (SC) is an emerging technology that has attracted much attention in the sixth-generation (6G) mobile communication systems. However, few literature has fully considered the perceptual quality of the reconstructed image. To solve this problem, we propose a generative SC for wireless image transmission (denoted as SC-CDM). This approach leverages compact diffusion models to improve the fidelity and semantic accuracy of the images reconstructed after transmission, ensuring that the essential content is preserved even in bandwidth-constrained environments. Specifically, we aim to redesign the swin Transformer as a new backbone for efficient semantic feature extraction and compression. Next, the receiver integrates the slim prior and image reconstruction networks. Compared to traditional Diffusion Models (DMs), it leverages DMs’ robust distribution mapping capability to generate a compact condition vector, guiding image recovery, thus enhancing the perceptual details of the reconstructed images. Finally, a series of evaluation and ablation studies are conducted to validate the effectiveness and robustness of the proposed algorithm and further increase the Peak Signal-to-Noise Ratio (PSNR) by over 17% on top of CNN-based DeepJSCC.

[LG-238] Posterior sampling via Langevin dynamics based on generative priors

链接: https://arxiv.org/abs/2410.02078
作者: Vishal Purohit,Matthew Repasky,Jianfeng Lu,Qiang Qiu,Yao Xie,Xiuyuan Cheng
关键词-EN:
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-239] A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

链接: https://arxiv.org/abs/2410.02025
作者: Shivam Kumar,Yun Yang,Lizhen Lin
关键词-EN: high-dimensional ambient space, response variable lies, potentially lower-dimensional manifold, deep generative models, conditional deep generative
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:1708.06633 by other authors

点击查看摘要

Abstract:In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.

[LG-240] Auto-conditioned primal-dual hybrid gradient method and alternating direction method of multipliers

链接: https://arxiv.org/abs/2410.01979
作者: Guanghui Lan,Tianjiao Li
关键词-EN: bilinear saddle point, saddle point problems, Line search procedures, Line search, saddle point
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Line search procedures are often employed in primal-dual methods for bilinear saddle point problems, especially when the norm of the linear operator is large or difficult to compute. In this paper, we demonstrate that line search is unnecessary by introducing a novel primal-dual method, the auto-conditioned primal-dual hybrid gradient (AC-PDHG) method, which achieves optimal complexity for solving bilinear saddle point problems. AC-PDHG is fully adaptive to the linear operator, using only past iterates to estimate its norm. We further tailor AC-PDHG to solve linearly constrained problems, providing convergence guarantees for both the optimality gap and constraint violation. Moreover, we explore an important class of linearly constrained problems where both the objective and constraints decompose into two parts. By incorporating the design principles of AC-PDHG into the preconditioned alternating direction method of multipliers (ADMM), we propose the auto-conditioned alternating direction method of multipliers (AC-ADMM), which guarantees convergence based solely on one part of the constraint matrix and fully adapts to it, eliminating the need for line search. Finally, we extend both AC-PDHG and AC-ADMM to solve bilinear problems with an additional smooth term. By integrating these methods with a novel acceleration scheme, we attain optimal iteration complexities under the single-oracle setting.

[LG-241] Quantum-data-driven dynamical transition in quantum learning

链接: https://arxiv.org/abs/2410.01955
作者: Bingzhi Zhang,Junyu Liu,Liang Jiang,Quntao Zhuang
关键词-EN: QNN training dynamics, quantum information processing, QNN training, Quantum circuits, information processing
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 14+30 pages, 25 figures

点击查看摘要

Abstract:Quantum circuits are an essential ingredient of quantum information processing. Parameterized quantum circuits optimized under a specific cost function – quantum neural networks (QNNs) – provide a paradigm for achieving quantum advantage in the near term. Understanding QNN training dynamics is crucial for optimizing their performance. In terms of supervised learning tasks such as classification and regression for large datasets, the role of quantum data in QNN training dynamics remains unclear. We reveal a quantum-data-driven dynamical transition, where the target value and data determine the polynomial or exponential convergence of the training. We analytically derive the complete classification of fixed points from the dynamical equation and reveal a comprehensive `phase diagram’ featuring seven distinct dynamics. These dynamics originate from a bifurcation transition with multiple codimensions induced by training data, extending the transcritical bifurcation in simple optimization tasks. Furthermore, perturbative analyses identify an exponential convergence class and a polynomial convergence class among the seven dynamics. We provide a non-perturbative theory to explain the transition via generalized restricted Haar ensemble. The analytical results are confirmed with numerical simulations of QNN training and experimental verification on IBM quantum devices. As the QNN training dynamics is determined by the choice of the target value, our findings provide guidance on constructing the cost function to optimize the speed of convergence.

[LG-242] Dynamic Portfolio Rebalancing: A Hybrid new Model Using GNNs and Pathfinding for Cost Efficiency

链接: https://arxiv.org/abs/2410.01864
作者: Diego Vallarino
关键词-EN:
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-243] FredNormer: Frequency Domain Normalization for Non-stationary Time Series Forecasting

链接: https://arxiv.org/abs/2410.01860
作者: Xihao Piao,Zheng Chen,Yushun Dong,Yasuko Matsubara,Yasushi Sakurai
关键词-EN: Recent normalization-based methods, distribution shift issue, shown great success, facilitating non-stationary time, non-stationary time series
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent normalization-based methods have shown great success in tackling the distribution shift issue, facilitating non-stationary time series forecasting. Since these methods operate in the time domain, they may fail to fully capture the dynamic patterns that are more apparent in the frequency domain, leading to suboptimal results. This paper first theoretically analyzes how normalization methods affect frequency components. We prove that the current normalization methods that operate in the time domain uniformly scale non-zero frequencies, and thus, they struggle to determine components that contribute to more robust forecasting. Therefore, we propose FredNormer, which observes datasets from a frequency perspective and adaptively up-weights the key frequency components. To this end, FredNormer consists of two components: a statistical metric that normalizes the input samples based on their frequency stability and a learnable weighting layer that adjusts stability and introduces sample-specific variations. Notably, FredNormer is a plug-and-play module, which does not compromise the efficiency compared to existing normalization methods. Extensive experiments show that FredNormer improves the averaged MSE of backbone forecasting models by 33.3% and 55.3% on the ETTm2 dataset. Compared to the baseline normalization methods, FredNormer achieves 18 top-1 results and 6 top-2 results out of 28 settings.

[LG-244] Enhancing End Stage Renal Disease Outcome Prediction: A Multi-Sourced Data-Driven Approach

链接: https://arxiv.org/abs/2410.01859
作者: Yubo Li,Rema Padman
关键词-EN: End Stage Renal, Stage Renal Disease, End Stage, Stage Renal, Chronic Kidney Disease
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: To improve prediction of Chronic Kidney Disease (CKD) progression to End Stage Renal Disease (ESRD) using machine learning (ML) and deep learning (DL) models applied to an integrated clinical and claims dataset of varying observation windows, supported by explainable AI (XAI) to enhance interpretability and reduce bias. Materials and Methods: We utilized data about 10,326 CKD patients, combining their clinical and claims information from 2009 to 2018. Following data preprocessing, cohort identification, and feature engineering, we evaluated multiple statistical, ML and DL models using data extracted from five distinct observation windows. Feature importance and Shapley value analysis were employed to understand key predictors. Models were tested for robustness, clinical relevance, misclassification errors and bias issues. Results: Integrated data models outperformed those using single data sources, with the Long Short-Term Memory (LSTM) model achieving the highest AUC (0.93) and F1 score (0.65). A 24-month observation window was identified as optimal for balancing early detection and prediction accuracy. The 2021 eGFR equation improved prediction accuracy and reduced racial bias, notably for African American patients. Discussion: Improved ESRD prediction accuracy, results interpretability and bias mitigation strategies presented in this study have the potential to significantly enhance CKD and ESRD management, support targeted early interventions and reduce healthcare disparities. Conclusion: This study presents a robust framework for predicting ESRD outcomes in CKD patients, improving clinical decision-making and patient care through multi-sourced, integrated data and AI/ML methods. Future research will expand data integration and explore the application of this framework to other chronic diseases. Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG) Cite as: arXiv:2410.01859 [q-bio.QM] (or arXiv:2410.01859v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2410.01859 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yubo Li [view email] [v1] Wed, 2 Oct 2024 03:21:01 UTC (7,525 KB)

[LG-245] Long-range gene expression prediction with token alignment of large language model

链接: https://arxiv.org/abs/2410.01858
作者: Edouardo Honig,Huixin Zhan,Ying Nian Wu,Zijun Frank Zhang
关键词-EN:
类目: Cell Behavior (q-bio.CB); Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 14 pages, 10 figures

点击查看摘要

[LG-246] Recovering Time-Varying Networks From Single-Cell Data

链接: https://arxiv.org/abs/2410.01853
作者: Euxhen Hasanaj,Barnabás Póczos,Ziv Bar-Joseph
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

[LG-247] Context-Aware Predictive Coding: A Representation Learning Framework for WiFi Sensing

链接: https://arxiv.org/abs/2410.01825
作者: B. Barahimi,H. Tabassum,M. Omer,O. Waqar
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-248] Democratizing Signal Processing and Machine Learning: Math Learning Equity for Elementary and Middle School Students

链接: https://arxiv.org/abs/2409.17304
作者: Namrata Vaswani,Mohamed Y. Selim,Renee Serrell Gibert
关键词-EN:
类目: History and Overview (math.HO); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Under submission to IEEE Signal Processing Magazine

点击查看摘要

信息检索

[IR-0] Unified Multi-Modal Interleaved Document Representation for Information Retrieval

链接: https://arxiv.org/abs/2410.02729
作者: Jaewoo Lee,Joonho Ko,Jinheon Baek,Soyeong Jeong,Sung Ju Hwang
关键词-EN: natural language tasks, gained remarkable attention, remarkable attention due, language tasks, gained remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Preprint

点击查看摘要

Abstract:Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.

[IR-1] Domain-Specific Retrieval-Augmented Generation Using Vector Stores Knowledge Graphs and Tensor Factorization ICML

链接: https://arxiv.org/abs/2410.02721
作者: Ryan C. Barron,Ves Grantcharov,Selma Wanna,Maksim E. Eren,Manish Bhattarai,Nicholas Solovyev,George Tompkins,Charles Nicholas,Kim Ø. Rasmussen,Cynthia Matuszek,Boian S. Alexandrov
关键词-EN: Large Language Models, natural language processing, general natural language, numerous general natural, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注: 9 pages 7 figures, 1 table, 1 cypher code Accepted to ICMLA 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs’ intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.

[IR-2] Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

链接: https://arxiv.org/abs/2410.02642
作者: Shijie Chen,Bernal Jiménez Gutiérrez,Yu Su
关键词-EN: modern digital life, played a vital, vital role, role in modern, modern digital
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two ( O(1) ) forward passes to re-rank N documents, making it substantially more efficient than generative re-ranking methods that require at least O(N) forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR’s performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.

[IR-3] Long-Sequence Recommendation Models Need Decoupled Embeddings

链接: https://arxiv.org/abs/2410.02604
作者: Ningya Feng,Junwei Pan,Jialong Wu,Baixu Chen,Ximei Wang,Qian Li,Xian Hu,Jie Jiang,Mingsheng Long
关键词-EN: Lifelong user behavior, capturing user interests, predicting user responses, Lifelong user, user behavior sequences
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: First three authors contributed equally

点击查看摘要

Abstract:Lifelong user behavior sequences, comprising up to tens of thousands of history behaviors, are crucial for capturing user interests and predicting user responses in modern recommendation systems. A two-stage paradigm is typically adopted to handle these long sequences: a few relevant behaviors are first searched from the original long sequences via an attention mechanism in the first stage and then aggregated with the target item to construct a discriminative representation for prediction in the second stage. In this work, we identify and characterize, for the first time, a neglected deficiency in existing long-sequence recommendation models: a single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes. Initial attempts to address this issue using linear projections – a technique borrowed from language processing – proved ineffective, shedding light on the unique challenges of recommendation models. To overcome this, we propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are initialized and learned separately to fully decouple attention and representation. Extensive experiments and analysis demonstrate that DARE provides more accurate search of correlated behaviors and outperforms baselines with AUC gains up to 0.9% on public datasets and notable online system improvements. Furthermore, decoupling embedding spaces allows us to reduce the attention embedding dimension and accelerate the search procedure by 50% without significant performance impact, enabling more efficient, high-performance online serving.

[IR-4] Quantifying User Coherence: A Unified Framework for Cross-Domain Recommendation Analysis

链接: https://arxiv.org/abs/2410.02453
作者: Michaël Soumm,Alexandre Fournier-Montgieux,Adrian Popescu,Bertrand Delezoide
关键词-EN: quality remains under-researched, Recommender Systems, profile quality remains, understanding recommender systems, remains under-researched
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The effectiveness of Recommender Systems (RS) is closely tied to the quality and distinctiveness of user profiles, yet despite many advancements in raw performance, the sensitivity of RS to user profile quality remains under-researched. This paper introduces novel information-theoretic measures for understanding recommender systems: a “surprise” measure quantifying users’ deviations from popular choices, and a “conditional surprise” measure capturing user interaction coherence. We evaluate 7 recommendation algorithms across 9 datasets, revealing the relationships between our measures and standard performance metrics. Using a rigorous statistical framework, our analysis quantifies how much user profile density and information measures impact algorithm performance across domains. By segmenting users based on these measures, we achieve improved performance with reduced data and show that simpler algorithms can match complex ones for low-coherence users. Additionally, we employ our measures to analyze how well different recommendation algorithms maintain the coherence and diversity of user preferences in their predictions, providing insights into algorithm behavior. This work advances the theoretical understanding of user behavior and practical heuristics for personalized recommendation systems, promoting more efficient and adaptive architectures.

[IR-5] Multi-modal clothing recommendation model based on large model and VAE enhancement

链接: https://arxiv.org/abs/2410.02219
作者: Bingjie Huang,Qingyu Lu,Shuaishuai Huang,Xue-she Wang,Haowei Yang
关键词-EN: requiring in-depth research, Accurately recommending products, subject requiring in-depth, Accurately recommending, in-depth research
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately recommending products has long been a subject requiring in-depth research. This study proposes a multimodal paradigm for clothing recommendations. Specifically, it designs a multimodal analysis method that integrates clothing description texts and images, utilizing a pre-trained large language model to deeply explore the hidden meanings of users and products. Additionally, a variational encoder is employed to learn the relationship between user information and products to address the cold start problem in recommendation systems. This study also validates the significant performance advantages of this method over various recommendation system methods through extensive ablation experiments, providing crucial practical guidance for the comprehensive optimization of recommendation systems.

[IR-6] A Survey on Point-of-Interest Recommendation: Models Architectures and Security

链接: https://arxiv.org/abs/2410.02191
作者: Qianru Zhang,Peng Yang,Junliang Yu,Haixin Wang,Xingwei He,Siu-Ming Yiu,Hongzhi Yin
关键词-EN: Location-Based Social Networks, Social Networks, creating unparalleled opportunities, Location-Based Social, Networks has led
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:The widespread adoption of smartphones and Location-Based Social Networks has led to a massive influx of spatio-temporal data, creating unparalleled opportunities for enhancing Point-of-Interest (POI) recommendation systems. These advanced POI systems are crucial for enriching user experiences, enabling personalized interactions, and optimizing decision-making processes in the digital landscape. However, existing surveys tend to focus on traditional approaches and few of them delve into cutting-edge developments, emerging architectures, as well as security considerations in POI recommendations. To address this gap, our survey stands out by offering a comprehensive, up-to-date review of POI recommendation systems, covering advancements in models, architectures, and security aspects. We systematically examine the transition from traditional models to advanced techniques such as large language models. Additionally, we explore the architectural evolution from centralized to decentralized and federated learning systems, highlighting the improvements in scalability and privacy. Furthermore, we address the increasing importance of security, examining potential vulnerabilities and privacy-preserving approaches. Our taxonomy provides a structured overview of the current state of POI recommendation, while we also identify promising directions for future research in this rapidly advancing field.

[IR-7] BayesCNS: A Unified Bayesian Approach to Address Cold Start and Non-Stationarity in Search Systems at Scale

链接: https://arxiv.org/abs/2410.02126
作者: Randy Ardywibowo,Rakesh Sunki,Lucy Kuo,Sankalp Nayak
关键词-EN: platforms frequently employ, recommendation platforms frequently, Information Retrieval, frequently employ, recommendation platforms
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Information Retrieval (IR) systems used in search and recommendation platforms frequently employ Learning-to-Rank (LTR) models to rank items in response to user queries. These models heavily rely on features derived from user interactions, such as clicks and engagement data. This dependence introduces cold start issues for items lacking user engagement and poses challenges in adapting to non-stationary shifts in user behavior over time. We address both challenges holistically as an online learning problem and propose BayesCNS, a Bayesian approach designed to handle cold start and non-stationary distribution shifts in search systems at scale. BayesCNS achieves this by estimating prior distributions for user-item interactions, which are continuously updated with new user interactions gathered online. This online learning procedure is guided by a ranker model, enabling efficient exploration of relevant items using contextual information provided by the ranker. We successfully deployed BayesCNS in a large-scale search system and demonstrated its efficacy through comprehensive offline and online experiments. Notably, an online A/B experiment showed a 10.60% increase in new item interactions and a 1.05% improvement in overall success metrics over the existing production baseline.

[IR-8] Price-guided user attention in large-scale E-commerce group recommendation

链接: https://arxiv.org/abs/2410.02074
作者: Yang Shi,Young-joo Chung
关键词-EN: Existing group recommender, utilize attention mechanisms, Existing group, identify critical users, mechanisms to identify
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing group recommender systems utilize attention mechanisms to identify critical users who influence group decisions the most. We analyzed user attention scores from a widely-used group recommendation model on a real-world E-commerce dataset and found that item price and user interaction history significantly influence the selection of critical users. When item prices are low, users with extensive interaction histories are more influential in group decision-making. Conversely, their influence diminishes with higher item prices. Based on these observations, we propose a novel group recommendation approach that incorporates item price as a guiding factor for user aggregation. Our model employs an adaptive sigmoid function to adjust output logits based on item prices, enhancing the accuracy of user aggregation. Our model can be plugged into any attention-based group recommender system if the price information is available. We evaluate our model’s performance on a public benchmark and a real-world dataset. We compare it with other state-of-the-art group recommendation methods. Our results demonstrate that our price-guided user attention approach outperforms the state-of-the-art methods in terms of hit ratio and mean square error.

[IR-9] Financial Sentiment Analysis on News and Reports Using Large Language Models and FinBERT

链接: https://arxiv.org/abs/2410.01987
作者: Yanxin Shen,Pulin Kirin Zhang
关键词-EN: well-informed financial decisions, evaluating market sentiment, making well-informed financial, crucial for evaluating, evaluating market
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Social and Information Networks (cs.SI); General Finance (q-fin.GN)
*备注:

点击查看摘要

Abstract:Financial sentiment analysis (FSA) is crucial for evaluating market sentiment and making well-informed financial decisions. The advent of large language models (LLMs) such as BERT and its financial variant, FinBERT, has notably enhanced sentiment analysis capabilities. This paper investigates the application of LLMs and FinBERT for FSA, comparing their performance on news articles, financial reports and company announcements. The study emphasizes the advantages of prompt engineering with zero-shot and few-shot strategy to improve sentiment classification accuracy. Experimental results indicate that GPT-4o, with few-shot examples of financial texts, can be as competent as a well fine-tuned FinBERT in this specialized field.

[IR-10] he Importance of Causality in Decision Making: A Perspective on Recommender Systems RECSYS’24

链接: https://arxiv.org/abs/2410.01822
作者: Emanuele Cavenaghi,Alessio Zanga,Fabio Stella,Markus Zanker
关键词-EN: receiving increasing attention, transform accurate predictions, Recommendation Systems, explainable decisions, receiving increasing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Causality is receiving increasing attention in the Recommendation Systems (RSs) community, which has realised that RSs could greatly benefit from causality to transform accurate predictions into effective and explainable decisions. Indeed, the RS literature has repeatedly highlighted that, in real-world scenarios, recommendation algorithms suffer many types of biases since assumptions ensuring unbiasedness are likely not met. In this discussion paper, we formulate the RS problem in terms of causality, using potential outcomes and structural causal models, by giving formal definitions of the causal quantities to be estimated and a general causal graph to serve as a reference to foster future research and development.

[IR-11] A GEN AI Framework for Medical Note Generation

链接: https://arxiv.org/abs/2410.01841
作者: Hui Yi Leong,Yi Fan Gao,Shuai Ji,Bora Kalaycioglu,Uktu Pamuksuz
关键词-EN: Electronic Health Records, Health Records, Electronic Health, direct patient care, Automatic Speech Recognition
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Sound (cs.SD)
*备注: 8 Figures, 7 page, IEEE standard research paper

点击查看摘要

Abstract:The increasing administrative burden of medical documentation, particularly through Electronic Health Records (EHR), significantly reduces the time available for direct patient care and contributes to physician burnout. To address this issue, we propose MediNotes, an advanced generative AI framework designed to automate the creation of SOAP (Subjective, Objective, Assessment, Plan) notes from medical conversations. MediNotes integrates Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Automatic Speech Recognition (ASR) to capture and process both text and voice inputs in real time or from recorded audio, generating structured and contextually accurate medical notes. The framework also incorporates advanced techniques like Quantized Low-Rank Adaptation (QLoRA) and Parameter-Efficient Fine-Tuning (PEFT) for efficient model fine-tuning in resource-constrained environments. Additionally, MediNotes offers a query-based retrieval system, allowing healthcare providers and patients to access relevant medical information quickly and accurately. Evaluations using the ACI-BENCH dataset demonstrate that MediNotes significantly improves the accuracy, efficiency, and usability of automated medical documentation, offering a robust solution to reduce the administrative burden on healthcare professionals while improving the quality of clinical workflows.

附件下载

点击下载今日全部论文列表