本篇博文主要展示 2024-10-07 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-07)

今日共更新530篇论文,其中:

  • 自然语言处理122篇(Computation and Language (cs.CL))
  • 人工智能158篇(Artificial Intelligence (cs.AI))
  • 计算机视觉90篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习202篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Enhance Reasoning by Learning from Mistakes: Peer-Review Knowledge Distillation from Multiple Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在复杂推理任务中表现出色但计算资源需求巨大的问题。解决方案的关键在于提出了一种名为Mistake-Aware Peer-Review Distillation(MAPD)的新方法。该方法通过让教师模型识别并解释学生模型的错误,生成定制化的教学数据,并通过模拟教师模型之间的同行评审过程,筛选出高质量的推理依据,从而提高知识蒸馏的效果,使得较小规模的模型也能在推理任务中表现出色。

链接: https://arxiv.org/abs/2410.03663
作者: Zhuochun Li,Yuelyu Ji,Rui Meng,Daqing He
关键词-EN: Large language models, natural language processing, demonstrated exceptional performance, Large language, exhibited complex reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) have exhibited complex reasoning abilities by generating question rationales and demonstrated exceptional performance in natural language processing (NLP) tasks. However, these reasoning capabilities generally emerge in models with tens of billions of parameters, creating significant computational challenges for real-world deployment. Recent research has concentrated on improving open-source smaller models through knowledge distillation (KD) from commercial LLMs. Nevertheless, most of these studies rely solely on the responses from one single LLM as the gold rationale for training. In this paper, we introduce a novel Mistake-Aware Peer-Review Distillation (MAPD) approach: 1) Instead of merely obtaining gold rationales from teachers, our method asks teachers to identify and explain the student’s mistakes, providing customized instruction learning data. 2) We design a simulated peer-review process between teacher LLMs, which selects only the generated rationales above the acceptance threshold. This reduces the chance of teachers guessing correctly with flawed rationale, improving instructional data quality. Comprehensive experiments and analysis on mathematical, commonsense, and logical reasoning tasks demonstrate the effectiveness of our method.
摘要:大语言模型 (LLMs) 通过生成问题理由展示了复杂的推理能力,并在自然语言处理 (NLP) 任务中表现出色。然而,这些推理能力通常出现在拥有数百亿参数的模型中,这为实际部署带来了显著的计算挑战。最近的研究集中在通过从商业 LLMs 进行知识蒸馏 (KD) 来改进开源的小型模型。然而,大多数这些研究仅依赖于单一 LLM 的响应作为训练的金标准理由。在本文中,我们引入了一种新的错误感知同行评审蒸馏 (Mistake-Aware Peer-Review Distillation, MAPD) 方法:1) 我们的方法不仅从教师那里获取金标准理由,还要求教师识别并解释学生的错误,提供定制化的教学数据。2) 我们设计了一个教师 LLMs 之间的模拟同行评审过程,该过程仅选择生成理由高于接受阈值的评审。这减少了教师通过有缺陷的理由正确猜测的机会,提高了教学数据的质量。在数学、常识和逻辑推理任务上的综合实验和分析证明了我们方法的有效性。

[NLP-1] Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

【速读】: 该论文试图解决大视觉-语言模型(LVLMs)中存在的跨模态参数知识冲突问题,即视觉和语言组件之间知识表示的不一致性。解决方案的关键在于提出了一种系统性的方法来检测、解释和缓解这些冲突。具体方法包括:引入一个管道来识别视觉和文本答案之间的冲突,并提出一种对比度量来区分冲突样本;开发了一种动态对比解码方法,根据答案的置信度去除来自置信度较低的模态组件的不良logits;对于不提供logits的模型,提出了两种基于提示的策略来缓解冲突。这些方法在ViQuAE和InfoSeek数据集上显著提升了模型的准确性,例如在使用LLaVA-34B模型时,动态对比解码方法平均提升了2.24%的准确率。

链接: https://arxiv.org/abs/2410.03659
作者: Tinghui Zhu,Qin Liu,Fei Wang,Zhengzhong Tu,Muhao Chen
关键词-EN: Large Vision-Language Models, demonstrated impressive capabilities, Large Vision-Language, multimodal inputs, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Website: this https URL

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. However, these models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. In this paper, we formally define the problem of \textbfcross-modality parametric knowledge conflict and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%.
摘要:大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 在捕捉和推理多模态输入方面展现了令人印象深刻的能力。然而,这些模型容易出现跨模态参数知识冲突,这种冲突源于视觉和语言组件之间表示知识的不一致性。本文中,我们正式定义了跨模态参数知识冲突问题,并提出了一种系统的方法来检测、解释和缓解这些冲突。我们引入了一个流程,用于识别视觉和文本答案之间的冲突,结果显示,无论模型大小如何,近期 LVLMs 在各模态间始终存在高冲突率。我们进一步研究了这些冲突如何干扰推理过程,并提出了一种对比度量方法,以区分冲突样本与其他样本。基于这些见解,我们开发了一种新颖的动态对比解码方法,该方法根据答案的置信度,移除来自置信度较低的模态组件的不良推断 logits。对于不提供 logits 的模型,我们还引入了两种基于提示的策略来缓解冲突。我们的方法在 ViQuAE 和 InfoSeek 数据集上均取得了显著的准确性提升。具体而言,使用 LLaVA-34B,我们提出的动态对比解码方法平均提升了 2.24% 的准确率。

[NLP-2] RAFT: Realistic Attacks to Fool Text Detectors EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在检测其生成文本时的鲁棒性和可靠性问题。解决方案的关键在于提出了一种名为RAFT的语法错误无黑盒攻击方法,该方法通过利用LLM嵌入的可迁移性,在保持原始文本质量的同时,贪婪地选择候选词来干扰目标检测器。实验结果表明,RAFT攻击能够有效破坏所有研究中的检测器,成功率高达99%,并且具有跨模型的迁移性。此外,人工评估显示攻击文本与人类书写文本难以区分,表明当前的LLM检测器在对抗性攻击面前并不鲁棒,强调了开发更健壮检测机制的迫切性。

链接: https://arxiv.org/abs/2410.03658
作者: James Wang,Ran Li,Junfeng Yang,Chengzhi Mao
关键词-EN: exhibited remarkable fluency, Large language models, Large language, exhibited remarkable, remarkable fluency
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable fluency across various tasks. However, their unethical applications, such as disseminating disinformation, have become a growing concern. Although recent works have proposed a number of LLM detection methods, their robustness and reliability remain unclear. In this paper, we present RAFT: a grammar error-free black-box attack against existing LLM detectors. In contrast to previous attacks for language models, our method exploits the transferability of LLM embeddings at the word-level while preserving the original text quality. We leverage an auxiliary embedding to greedily select candidate words to perturb against the target detector. Experiments reveal that our attack effectively compromises all detectors in the study across various domains by up to 99%, and are transferable across source models. Manual human evaluation studies show our attacks are realistic and indistinguishable from original human-written text. We also show that examples generated by RAFT can be used to train adversarially robust detectors. Our work shows that current LLM detectors are not adversarially robust, underscoring the urgent need for more resilient detection mechanisms.
摘要:大语言模型 (LLMs) 在各种任务中展现了卓越的流畅性。然而,其不道德的应用,如传播虚假信息,已成为日益严重的问题。尽管近期研究提出了多种 LLM 检测方法,但其鲁棒性和可靠性仍不明确。本文中,我们提出了 RAFT:一种针对现有 LLM 检测器的无语法错误黑箱攻击方法。与以往的语言模型攻击不同,我们的方法利用了 LLM 在词级别嵌入的可迁移性,同时保持了原始文本质量。我们利用辅助嵌入贪婪地选择候选词来干扰目标检测器。实验表明,我们的攻击在不同领域内对所有检测器的破坏率高达 99%,并且具有跨源模型的可迁移性。人工评估研究显示,我们的攻击文本与原始人类书写文本难以区分。我们还展示了 RAFT 生成的示例可用于训练对抗鲁棒的检测器。我们的工作表明,当前的 LLM 检测器不具备对抗鲁棒性,凸显了开发更坚韧检测机制的迫切需求。

[NLP-3] Aligning LLMs with Individual Preferences via Interaction

【速读】: 该论文试图解决大型语言模型(LLMs)在广泛应用中如何更好地与用户个性化偏好对齐的问题。解决方案的关键在于训练LLMs通过多轮对话隐式推断当前用户的个性化偏好,并动态调整其行为和响应以适应这些推断出的偏好。具体方法包括建立一个包含3,310个不同用户角色的多样化池,通过多LLM协作生成包含3K+多轮对话的树结构数据集,并应用监督微调和强化学习来增强LLMs的能力。最终通过ALOE基准评估,验证了该方法在实现动态个性化对齐方面的有效性。

链接: https://arxiv.org/abs/2410.03642
作者: Shujin Wu,May Fung,Cheng Qian,Jeonghwan Kim,Dilek Hakkani-Tur,Heng Ji
关键词-EN: large language models, increasingly advanced capabilities, demonstrate increasingly advanced, language models, advanced capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: The code and dataset are made public at this https URL

点击查看摘要

Abstract:As large language models (LLMs) demonstrate increasingly advanced capabilities, aligning their behaviors with human values and preferences becomes crucial for their wide adoption. While previous research focuses on general alignment to principles such as helpfulness, harmlessness, and honesty, the need to account for individual and diverse preferences has been largely overlooked, potentially undermining customized human experiences. To address this gap, we train LLMs that can ‘‘interact to align’’, essentially cultivating the meta-skill of LLMs to implicitly infer the unspoken personalized preferences of the current user through multi-turn conversations, and then dynamically align their following behaviors and responses to these inferred preferences. Our approach involves establishing a diverse pool of 3,310 distinct user personas by initially creating seed examples, which are then expanded through iterative self-generation and filtering. Guided by distinct user personas, we leverage multi-LLM collaboration to develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures. Finally, we apply supervised fine-tuning and reinforcement learning to enhance LLMs using this dataset. For evaluation, we establish the ALOE (ALign With CustOmized PrEferences) benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations. Experimental results demonstrate the effectiveness of our method in enabling dynamic, personalized alignment via interaction.
摘要:随着大语言模型 (LLM) 展示出越来越先进的能力,使其行为与人类价值观和偏好相一致对于其广泛应用变得至关重要。尽管先前的研究侧重于与帮助性、无害性和诚实性等原则进行一般性对齐,但考虑到个体和多样化的偏好需求在很大程度上被忽视,这可能会削弱定制化的人类体验。为填补这一空白,我们训练了能够“通过互动进行对齐”的 LLM,实质上是培养 LLM 通过多轮对话隐式推断当前用户的个性化未言明偏好,并随后动态调整其后续行为和响应以符合这些推断偏好的元技能。我们的方法包括首先通过创建种子示例来建立一个包含 3,310 个不同用户角色的多样化池,然后通过迭代自生成和过滤进行扩展。在不同用户角色的指导下,我们利用多 LLM 协作开发了一个包含 3K+ 多轮对话的树结构偏好数据集。最后,我们应用监督微调和强化学习来使用该数据集增强 LLM。为了评估,我们建立了 ALOE (ALign With CustOmized PrEferences) 基准,该基准包含 100 个精心挑选的示例和设计良好的指标,用于衡量对话期间定制化对齐的性能。实验结果表明,我们的方法通过互动实现了动态、个性化的对齐,效果显著。

[NLP-4] What Matters for Model Merging at Scale?

【速读】: 该论文试图解决在大规模模型合并中,如何有效结合多个专家模型以提升合并模型性能的问题。解决方案的关键在于系统评估不同因素(如基础模型质量、专家模型数量和模型规模)对合并模型性能的影响。通过实验,论文发现合并效果在基础模型质量较高、模型规模较大以及合并更多专家模型时更为显著,且不同合并方法在大规模模型合并中的表现相似。这些发现为大规模模型合并提供了新的见解,并为未来的研究提供了参考。

链接: https://arxiv.org/abs/2410.03617
作者: Prateek Yadav,Tu Vu,Jonathan Lai,Alexandra Chronopoulou,Manaal Faruqui,Mohit Bansal,Tsendsuren Munkhdalai
关键词-EN: models, merging, decentralized model development, capable single model, expert models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 Pages, 7 Figures, 4 Tables

点击查看摘要

Abstract:Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors – like the base model quality and number of expert models – , to affect the merged model’s performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods – Averaging, Task~Arithmetic, Dare, and TIES – across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert’s training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.
摘要:模型合并旨在将多个专家模型合并为一个更具能力的单一模型,从而带来诸如降低存储和服务成本、提高泛化能力以及支持去中心化模型开发等优势。尽管前景广阔,但以往的研究主要集中在合并少数小型模型上。这留下了许多未解之谜,例如模型规模扩大对其效果的影响,以及它如何与其他关键因素(如基础模型质量和专家模型数量)相互作用,从而影响合并模型的性能。本研究系统地评估了大规模模型合并的实用性,考察了这些不同因素的影响。我们通过四种流行的合并方法——平均法 (Averaging)、任务算术法 (Task Arithmetic)、Dare 和 TIES,对从 1B 到 64B 参数的模型进行了合并实验,最多合并了 8 个不同的专家模型。我们在专家训练任务(即保留任务)和零样本泛化到未见过的保留任务上评估了合并模型。我们的实验提供了关于大规模模型合并以及不同因素之间相互作用的几项新见解。首先,我们发现当专家模型源自强大的基础模型(即具有良好零样本性能的模型)时,合并更为有效。其次,较大的模型更易于合并。第三,合并持续提升泛化能力。值得注意的是,当合并 8 个大型专家模型时,合并模型通常比多任务训练模型具有更好的泛化能力。第四,在处理较大模型时,我们可以更好地合并更多专家模型。第五,不同的合并方法在大规模上表现非常相似。总体而言,我们的研究揭示了模型合并的一些有趣特性,同时也突显了一些局限性。我们希望这项研究能为未来的大规模合并研究提供参考。

[NLP-5] ICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation

【速读】: 该论文试图解决大语言模型(LLMs)指令遵循能力的评估问题,特别是如何在不依赖人工标注的情况下,实现自动化、可解释且准确的评估。解决方案的关键在于提出了TICK(Targeted Instruct-evaluation with ChecKlists)协议,通过LLM生成的指令特定检查清单(checklists)来结构化评估过程。这些检查清单将指令分解为一系列的“是/否”问题,每个问题对应于候选响应是否满足指令的特定要求。实验结果表明,使用TICK显著提高了LLM判断与人类偏好之间的一致性,并通过STICK(Self-TICK)的自精炼和最佳N选择进一步提升了生成质量,同时增强了人类评估者之间的共识。

链接: https://arxiv.org/abs/2410.03608
作者: Jonathan Cook,Tim Rocktäschel,Jakob Foerster,Dennis Aumiller,Alex Wang
关键词-EN: Large Language Models, Large Language, Language Models, usage of Large, instruction-following ability
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Given the widespread adoption and usage of Large Language Models (LLMs), it is crucial to have flexible and interpretable evaluations of their instruction-following ability. Preference judgments between model outputs have become the de facto evaluation standard, despite distilling complex, multi-faceted preferences into a single ranking. Furthermore, as human annotation is slow and costly, LLMs are increasingly used to make these judgments, at the expense of reliability and interpretability. In this work, we propose TICK (Targeted Instruct-evaluation with ChecKlists), a fully automated, interpretable evaluation protocol that structures evaluations with LLM-generated, instruction-specific checklists. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists that decompose the instruction into a series of YES/NO questions. Each question asks whether a candidate response meets a specific requirement of the instruction. We demonstrate that using TICK leads to a significant increase (46.4% \to 52.2%) in the frequency of exact agreements between LLM judgements and human preferences, as compared to having an LLM directly score an output. We then show that STICK (Self-TICK) can be used to improve generation quality across multiple benchmarks via self-refinement and Best-of-N selection. STICK self-refinement on LiveBench reasoning tasks leads to an absolute gain of + 7.8%, whilst Best-of-N selection with STICK attains + 6.3% absolute improvement on the real-world instruction dataset, WildBench. In light of this, structured, multi-faceted self-improvement is shown to be a promising way to further advance LLM capabilities. Finally, by providing LLM-generated checklists to human evaluators tasked with directly scoring LLM responses to WildBench instructions, we notably increase inter-annotator agreement (0.194 \to 0.256).
摘要:鉴于大语言模型 (Large Language Models, LLMs) 的广泛采用和使用,对其指令遵循能力的灵活且可解释的评估至关重要。尽管将复杂的、多方面的偏好提炼成单一的排序,模型输出之间的偏好判断已成为事实上的评估标准。此外,由于人工标注既慢又昂贵,LLMs 越来越多地被用于进行这些判断,尽管这会牺牲可靠性和可解释性。在本研究中,我们提出了 TICK (Targeted Instruct-evaluation with ChecKlists),一种完全自动化、可解释的评估协议,该协议通过使用 LLM 生成的、针对特定指令的检查清单来结构化评估。我们首先展示,给定一个指令,LLMs 可以可靠地生成高质量、定制化的评估检查清单,将指令分解为一系列是/否问题。每个问题询问候选响应是否满足指令的特定要求。我们证明,与 LLM 直接评分输出相比,使用 TICK 显著增加了 LLM 判断与人类偏好之间完全一致的频率 (46.4% \to 52.2%)。然后,我们展示 STICK (Self-TICK) 可以通过自我改进和 Best-of-N 选择来提高多个基准的生成质量。在 LiveBench 推理任务上的 STICK 自我改进带来了 + 7.8% 的绝对增益,而使用 STICK 的 Best-of-N 选择在真实世界的指令数据集 WildBench 上实现了 + 6.3% 的绝对改进。由此看来,结构化、多方面的自我改进显示出进一步推进 LLM 能力的潜力。最后,通过向负责直接评分 LLM 对 WildBench 指令响应的人类评估者提供 LLM 生成的检查清单,我们显著提高了评估者之间的共识 (0.194 \to 0.256)。

[NLP-6] Efficiently Identifying Watermarked Segments in Mixed-Source Texts

【速读】: 该论文试图解决在大语言模型生成的长文本中识别个别水印段落的问题,特别是在混合来源的文档中。解决方案的关键在于提出了两种新颖的方法:一是几何覆盖检测框架,用于判断长文本中是否存在水印段落;二是自适应在线学习算法,用于精确地定位水印段落在文本中的具体位置。这些方法不仅在现有水印技术上表现出色,还具有适应其他水印技术的潜力,为精确水印检测提供了新的思路。

链接: https://arxiv.org/abs/2410.03600
作者: Xuandong Zhao,Chenwen Liao,Yu-Xiang Wang,Lei Li
关键词-EN: large language models, mitigating misuse cases, detect synthetic text, language models, mitigating misuse
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text watermarks in large language models (LLMs) are increasingly used to detect synthetic text, mitigating misuse cases like fake news and academic dishonesty. While existing watermarking detection techniques primarily focus on classifying entire documents as watermarked or not, they often neglect the common scenario of identifying individual watermark segments within longer, mixed-source documents. Drawing inspiration from plagiarism detection systems, we propose two novel methods for partial watermark detection. First, we develop a geometry cover detection framework aimed at determining whether there is a watermark segment in long text. Second, we introduce an adaptive online learning algorithm to pinpoint the precise location of watermark segments within the text. Evaluated on three popular watermarking techniques (KGW-Watermark, Unigram-Watermark, and Gumbel-Watermark), our approach achieves high accuracy, significantly outperforming baseline methods. Moreover, our framework is adaptable to other watermarking techniques, offering new insights for precise watermark detection.
摘要:在大语言模型 (LLM) 中,文本水印技术正日益被用于检测合成文本,以缓解如虚假新闻和学术不端等滥用情况。尽管现有的水印检测技术主要集中在将整个文档分类为是否带有水印,但它们往往忽视了在更长的、混合来源的文档中识别单个水印片段的常见场景。借鉴抄袭检测系统的灵感,我们提出了两种新颖的部分水印检测方法。首先,我们开发了一个几何覆盖检测框架,旨在确定长文本中是否存在水印片段。其次,我们引入了一种自适应在线学习算法,以精确定位文本中水印片段的具体位置。在三种流行的水印技术 (KGW-Watermark, Unigram-Watermark, 和 Gumbel-Watermark) 上进行评估,我们的方法实现了高准确性,显著优于基线方法。此外,我们的框架可适应其他水印技术,为精确水印检测提供了新的见解。

[NLP-7] Understanding Reasoning in Chain-of-Thought from the Hopfieldian View

【速读】: 该论文试图解决现有研究在提升Chain-of-Thought (CoT)推理能力时,缺乏一个全面框架来解释和理解CoT成功背后的基本因素的问题。解决方案的关键在于引入认知神经科学的Hopfieldian视角,建立CoT推理与认知元素(如刺激、动作、神经群体和表示空间)之间的联系,并提出Representation-of-Thought (RoT)框架。RoT框架利用低维表示空间的鲁棒性来增强CoT推理过程的鲁棒性和可解释性,同时提供对推理过程的细粒度控制。

链接: https://arxiv.org/abs/2410.03595
作者: Lijie Hu,Liang Liu,Shu Yang,Xin Chen,Zhen Tan,Muhammad Asif Ali,Mengdi Li,Di Wang
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable abilities, Models have demonstrated
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, a new version of “A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning”

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable abilities across various tasks, with Chain-of-Thought (CoT) prompting emerging as a key technique to enhance reasoning capabilities. However, existing research primarily focuses on improving performance, lacking a comprehensive framework to explain and understand the fundamental factors behind CoT’s success. To bridge this gap, we introduce a novel perspective grounded in the Hopfieldian view of cognition in cognitive neuroscience. We establish a connection between CoT reasoning and key cognitive elements such as stimuli, actions, neural populations, and representation spaces. From our view, we can understand the reasoning process as the movement between these representation spaces. Building on this insight, we develop a method for localizing reasoning errors in the response of CoTs. Moreover, we propose the Representation-of-Thought (RoT) framework, which leverages the robustness of low-dimensional representation spaces to enhance the robustness of the reasoning process in CoTs. Experimental results demonstrate that RoT improves the robustness and interpretability of CoT reasoning while offering fine-grained control over the reasoning process.
摘要:大语言模型在各种任务中展示了显著的能力,其中思维链 (Chain-of-Thought, CoT) 提示作为一种关键技术,用于增强推理能力。然而,现有研究主要集中在提升性能上,缺乏一个全面的框架来解释和理解 CoT 成功背后的基本因素。为了填补这一空白,我们引入了一种基于认知神经科学中 Hopfield 认知观的新视角。我们建立了 CoT 推理与刺激、动作、神经群体和表示空间等关键认知元素之间的联系。从我们的视角来看,可以将推理过程理解为在这些表示空间之间的移动。基于这一洞察,我们开发了一种方法,用于定位 CoT 响应中的推理错误。此外,我们提出了思维表示 (Representation-of-Thought, RoT) 框架,该框架利用低维表示空间的鲁棒性来增强 CoT 推理过程的鲁棒性。实验结果表明,RoT 提高了 CoT 推理的鲁棒性和可解释性,同时提供了对推理过程的细粒度控制。

[NLP-8] Explicit Implicit and Scattered: Revisiting Event Extraction to Capture Complex Arguments EMNLP-2024

【速读】: 该论文试图解决现有事件抽取(Event Extraction, EE)框架无法有效处理隐式(implicit)和分散(scattered)事件论元的问题。解决方案的关键在于引入了一种新的数据集DiscourseEE,该数据集包含7,464个论元标注,其中51.2%为隐式论元,17.4%为分散论元。论文将论元抽取问题重新定义为文本生成问题,以支持复杂论元类型的抽取,并通过全面评估现有最先进模型,揭示了生成式事件抽取中的关键挑战。

链接: https://arxiv.org/abs/2410.03594
作者: Omar Sharif,Joseph Gatto,Madhusudan Basak,Sarah M. Preum
关键词-EN: Prior works formulate, Prior works, contiguous spans, extraction, Event Extraction
类目: Computation and Language (cs.CL)
备注: Accepted in EMNLP-2024 (Main). 21 pages, 8 figures, and 11 tables

点击查看摘要

Abstract:Prior works formulate the extraction of event-specific arguments as a span extraction problem, where event arguments are explicit – i.e. assumed to be contiguous spans of text in a document. In this study, we revisit this definition of Event Extraction (EE) by introducing two key argument types that cannot be modeled by existing EE frameworks. First, implicit arguments are event arguments which are not explicitly mentioned in the text, but can be inferred through context. Second, scattered arguments are event arguments that are composed of information scattered throughout the text. These two argument types are crucial to elicit the full breadth of information required for proper event modeling. To support the extraction of explicit, implicit, and scattered arguments, we develop a novel dataset, DiscourseEE, which includes 7,464 argument annotations from online health discourse. Notably, 51.2% of the arguments are implicit, and 17.4% are scattered, making DiscourseEE a unique corpus for complex event extraction. Additionally, we formulate argument extraction as a text generation problem to facilitate the extraction of complex argument types. We provide a comprehensive evaluation of state-of-the-art models and highlight critical open challenges in generative event extraction. Our data and codebase are available at this https URL. Comments: Accepted in EMNLP-2024 (Main). 21 pages, 8 figures, and 11 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.03594 [cs.CL] (or arXiv:2410.03594v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.03594 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:先前的研究将事件特定参数的提取形式化为一个跨度提取问题,其中事件参数是显式的——即假定在文档中是连续的文本跨度。在本研究中,我们通过引入两种现有事件提取 (EE) 框架无法建模的关键参数类型,重新审视了事件提取的定义。首先,隐式参数是那些在文本中未明确提及,但可以通过上下文推断的事件参数。其次,分散参数是由文本中分散的信息组成的事件参数。这两种参数类型对于充分提取事件建模所需的信息至关重要。为了支持显式、隐式和分散参数的提取,我们开发了一个新的数据集,DiscourseEE,该数据集包含来自在线健康讨论的 7,464 个参数注释。值得注意的是,51.2% 的参数是隐式的,17.4% 是分散的,使得 DiscourseEE 成为复杂事件提取的独特语料库。此外,我们将参数提取形式化为一个文本生成问题,以促进复杂参数类型的提取。我们全面评估了最先进的模型,并突出了生成事件提取中的关键开放挑战。我们的数据和代码库可通过此 https URL 获取。

评论:已被 EMNLP-2024 (主会议) 接受。21 页,8 幅图,11 张表。主题:计算与语言 (cs.CL)。引用为:arXiv:2410.03594 [cs.CL] (或 arXiv:2410.03594v1 [cs.CL] 用于此版本)。https://doi.org/10.48550/arXiv.2410.03594。通过 DataCite 发布的 arXiv DOI (注册待定)。

[NLP-9] able Question Answering for Low-resourced Indic Languages EMNLP

【速读】: 该论文试图解决低资源语言(如孟加拉语和印地语)在表格问答(TableQA)任务中缺乏标注数据和神经模型的问题。解决方案的关键在于提出了一种全自动的大规模表格问答数据生成流程,适用于预算有限的低资源语言。通过这一方法,作者成功为孟加拉语和印地语生成了大规模的TableQA数据集,并训练出优于现有最先进大型语言模型(LLMs)的TableQA模型。此外,该方法还具备跨语言迁移能力和数学推理能力,且可扩展应用于任何有网络存在的低资源语言。

链接: https://arxiv.org/abs/2410.03576
作者: Vaishali Pal,Evangelos Kanoulas,Andrew Yates,Maarten de Rijke
关键词-EN: returning individual cells, structured information, returning individual, task of answering, answering questions
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP,2024

点击查看摘要

Abstract:TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. TableQA research has focused primarily on high-resource languages, leaving medium- and low-resource languages with little progress due to scarcity of annotated data and neural models. We address this gap by introducing a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models. TableQA models trained on our large-scale datasets outperform state-of-the-art LLMs. We further study the trained models on different aspects, including mathematical reasoning capabilities and zero-shot cross-lingual transfer. Our work is the first on low-resource tableQA focusing on scalable data generation and evaluation procedures. Our proposed data generation method can be applied to any low-resource language with a web presence. We release datasets, models, and code (this https URL).
摘要: TableQA 是指通过结构化信息表格回答问题,输出可以是单个单元格或整个表格。TableQA 研究主要集中在高资源语言上,由于标注数据和神经模型稀缺,中低资源语言的进展甚微。我们通过引入一种全自动的大规模 TableQA 数据生成流程来填补这一空白,该流程适用于预算有限的低资源语言。我们将数据生成方法应用于两种印度语言——孟加拉语和印地语,这两种语言目前没有 TableQA 数据集或模型。基于我们大规模数据集训练的 TableQA 模型表现优于最先进的大语言模型。我们进一步研究了训练模型在不同方面的表现,包括数学推理能力和零样本跨语言迁移。我们的工作是首次专注于低资源 TableQA 的可扩展数据生成和评估流程。我们提出的数据生成方法可以应用于任何有网络存在的低资源语言。我们发布了数据集、模型和代码 (this https URL)。

[NLP-10] owards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)

【速读】: 该论文试图解决当前先进大型语言模型(LLMs)在不同语言,尤其是低资源语言上的分词技术对服务成本和可用性的影响问题。解决方案的关键在于评估和比较不同LLMs(如GPT-4、GPT-3、DaVinci和BERT)的分词变异性,并强调在开发过程中采用语言学敏感的实践,特别是对于传统上资源匮乏的语言。论文通过案例研究展示了分词选择在电子健康记录(EHR)系统中的实际影响,旨在推动AI服务开发中的国际化(I18N)实践,强调包容性,特别是对于在AI应用中传统上未被充分代表的语言。

链接: https://arxiv.org/abs/2410.03568
作者: Abrar Rahman,Garry Bowlin,Binit Mohanty,Sean McGunigal
关键词-EN: tokenization techniques employed, low resource languages, base embeddings, large language models, BERT base tokenizer
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive study on the tokenization techniques employed by state-of-the-art large language models (LLMs) and their implications on the cost and availability of services across different languages, especially low resource languages. The analysis considers multiple LLMs, including GPT-4 (using cl100k_base embeddings), GPT-3 (with p50k_base embeddings), and DaVinci (employing r50k_base embeddings), as well as the widely used BERT base tokenizer. The study evaluates the tokenization variability observed across these models and investigates the challenges of linguistic representation in subword tokenization. The research underscores the importance of fostering linguistically-aware development practices, especially for languages that are traditionally under-resourced. Moreover, this paper introduces case studies that highlight the real-world implications of tokenization choices, particularly in the context of electronic health record (EHR) systems. This research aims to promote generalizable Internationalization (I18N) practices in the development of AI services in this domain and beyond, with a strong emphasis on inclusivity, particularly for languages traditionally underrepresented in AI applications.
摘要:本文对最先进的大语言模型 (LLM) 所采用的 Token 化技术及其对不同语言服务成本和可用性的影响进行了全面研究,特别是针对资源匮乏的语言。分析涵盖了多个 LLM,包括使用 cl100k_base 嵌入的 GPT-4、采用 p50k_base 嵌入的 GPT-3 以及使用 r50k_base 嵌入的 DaVinci,以及广泛使用的 BERT base Tokenizer。研究评估了这些模型间 Token 化变异性的观察结果,并探讨了子词 Token 化中语言表示的挑战。研究强调了促进语言感知开发实践的重要性,特别是对于传统资源不足的语言。此外,本文通过案例研究突显了 Token 化选择在实际应用中的影响,特别是在电子健康记录 (EHR) 系统中的应用。本研究旨在推动该领域及更广泛领域中 AI 服务开发的可推广国际化 (I18N) 实践,特别强调包容性,尤其是对于在 AI 应用中传统上代表性不足的语言。

[NLP-11] BodyShapeGPT: SMPL Body Shape Manipulation with LLMs ECCV2024

【速读】: 该论文试图解决通过自然语言描述生成准确3D人体模型的问题。解决方案的关键在于利用经过微调的大型语言模型(LLMs)来识别和解析人体物理描述,并将其转换为SMPL-X模型的形状参数,从而实现通过自然语言控制3D人体形状的生成。这种方法不仅提升了人机交互的效率,还为虚拟环境中的个性化定制和模拟开辟了新的途径。

链接: https://arxiv.org/abs/2410.03556
作者: Baldomero R. Árbol,Dan Casas
关键词-EN: performing complex tasks, Large Language Models, provide a wide, wide range, range of tools
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ECCV 2024 Workshop on Foundation Models for 3D Humans. Code repository: this https URL

点击查看摘要

Abstract:Generative AI models provide a wide range of tools capable of performing complex tasks in a fraction of the time it would take a human. Among these, Large Language Models (LLMs) stand out for their ability to generate diverse texts, from literary narratives to specialized responses in different fields of knowledge. This paper explores the use of fine-tuned LLMs to identify physical descriptions of people, and subsequently create accurate representations of avatars using the SMPL-X model by inferring shape parameters. We demonstrate that LLMs can be trained to understand and manipulate the shape space of SMPL, allowing the control of 3D human shapes through natural language. This approach promises to improve human-machine interaction and opens new avenues for customization and simulation in virtual environments.
摘要:生成式 AI 模型提供了一系列工具,能够在极短的时间内完成复杂的任务,远超人类所需的时间。其中,大语言模型 (Large Language Models, LLMs) 因其能够生成多样化的文本,从文学叙事到不同知识领域的专业回应,而尤为突出。本文探讨了如何利用微调后的 LLMs 来识别人物的物理描述,并通过推断形状参数,使用 SMPL-X 模型创建准确的虚拟形象表示。我们展示了 LLMs 可以被训练来理解和操纵 SMPL 的形状空间,从而通过自然语言控制 3D 人体形状。这种方法有望改善人机交互,并为虚拟环境中的定制化和模拟开辟新的途径。

[NLP-12] Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding

【速读】: 该论文试图解决蛋白质语言模型(pLMs)在特定下游任务中表现优异,但在实现通用蛋白质理解方面存在局限性的问题。解决方案的关键在于引入Structure-Enhanced Protein Instruction Tuning (SEPIT)框架,该框架通过集成结构感知模块到pLMs中,并将其与大型语言模型(LLMs)连接,以生成对蛋白质的全面理解。SEPIT采用两阶段指令调优流程,首先通过基于描述的指令建立蛋白质的基本理解,然后利用混合专家系统(MoEs)进一步学习复杂属性和功能信息,同时保持激活参数的数量不变。此外,论文构建了迄今为止最大且最全面的蛋白质指令数据集,以支持通用蛋白质理解模型的训练和评估。

链接: https://arxiv.org/abs/2410.03553
作者: Wei Wu,Chao Wang,Liyi Chen,Mingze Yin,Yiheng Zhu,Kun Fu,Jieping Ye,Hui Xiong,Zheng Wang
关键词-EN: including metabolic reactions, DNA replication, reactions and DNA, essential biomolecules, play a central
类目: Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine-tuned model is tailored for particular downstream prediction task, and achieving general-purpose protein understanding remains a challenge. In this paper, we introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins. In this framework, we propose a novel two-stage instruction tuning pipeline that first establishes a basic understanding of proteins through caption-based instructions and then refines this understanding using a mixture of experts (MoEs) to learn more complex properties and functional information with the same amount of activated parameters. Moreover, we construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model. Extensive experimental results on open-ended generation and closed-set answer tasks demonstrate the superior performance of SEPIT over both closed-source general LLMs and open-source LLMs trained with protein knowledge.
摘要:蛋白质作为基本生物分子,在生物过程中起着核心作用,包括代谢反应和 DNA 复制。准确预测其性质和功能在生物应用中至关重要。近期,通过监督微调的蛋白质语言模型 (pLMs) 为这一问题提供了有前景的解决方案。然而,微调后的模型针对特定的下游预测任务进行了定制,实现通用蛋白质理解仍然是一个挑战。本文中,我们介绍了结构增强的蛋白质指令微调 (SEPIT) 框架,以弥合这一差距。我们的方法将一个新颖的结构感知模块集成到 pLMs 中,以赋予其结构知识,然后将这些增强的 pLMs 连接到大语言模型 (LLMs) 以生成对蛋白质的理解。在此框架中,我们提出了一种新颖的两阶段指令微调流程,首先通过基于标题的指令建立对蛋白质的基本理解,然后使用专家混合 (MoEs) 对这一理解进行细化,以学习更复杂的性质和功能信息,同时保持激活参数的数量相同。此外,我们构建了迄今为止最大且最全面的蛋白质指令数据集,这使我们能够训练和评估通用蛋白质理解模型。在开放式生成和封闭集回答任务上的广泛实验结果表明,SEPIT 在性能上优于闭源通用 LLMs 和基于蛋白质知识训练的开源 LLMs。

[NLP-13] Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research EMNLP2024

【速读】: 该论文旨在解决自然语言处理(NLP)在计算社会科学(CSS)中使用社交媒体数据时面临的数据质量问题,特别是数据重复导致的标签不一致和数据泄露问题。解决方案的关键在于通过深入分析现有20个常用数据集,揭示数据重复对模型性能评估的影响,并提出新的数据集开发和使用协议及最佳实践,以提高社交媒体数据的质量和模型的可靠性。

链接: https://arxiv.org/abs/2410.03545
作者: Yida Mu,Mali Jin,Xingyi Song,Nikolaos Aletras
关键词-EN: Computational Social Science, natural language processing, Research in natural, Social Science, Computational Social
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Main

点击查看摘要

Abstract:Research in natural language processing (NLP) for Computational Social Science (CSS) heavily relies on data from social media platforms. This data plays a crucial role in the development of models for analysing socio-linguistic phenomena within online communities. In this work, we conduct an in-depth examination of 20 datasets extensively used in NLP for CSS to comprehensively examine data quality. Our analysis reveals that social media datasets exhibit varying levels of data duplication. Consequently, this gives rise to challenges like label inconsistencies and data leakage, compromising the reliability of models. Our findings also suggest that data duplication has an impact on the current claims of state-of-the-art performance, potentially leading to an overestimation of model effectiveness in real-world scenarios. Finally, we propose new protocols and best practices for improving dataset development from social media data and its usage.
摘要:计算社会科学 (Computational Social Science, CSS) 领域的自然语言处理 (Natural Language Processing, NLP) 研究高度依赖于社交媒体平台的数据。这些数据在开发用于分析在线社区中的社会语言现象的模型中起着至关重要的作用。在本研究中,我们对 CSS 领域中广泛使用的 20 个数据集进行了深入的分析,以全面评估数据质量。我们的分析发现,社交媒体数据集表现出不同程度的数据重复。因此,这导致了标签不一致和数据泄露等问题,从而影响了模型的可靠性。我们的研究结果还表明,数据重复对当前关于最先进性能的声明产生了影响,可能导致对模型在实际场景中有效性的过高估计。最后,我们提出了新的协议和最佳实践,以改进从社交媒体数据中开发数据集及其使用。

[NLP-14] Re-examining Sexism and Misogyny Classification with Annotator Attitudes EMNLP2024

【速读】: 该论文试图解决性别暴力(GBV)在线数据集在捕捉多样的标注者视角和确保受影响群体代表性方面的不足。解决方案的关键在于重新审视GBV内容审核流程中的两个重要阶段:(1)手动数据标注;(2)自动化分类。在手动标注阶段,论文通过收集标注者的社会心理学调查数据,分析标注者身份和态度与其标注结果之间的关系,发现右翼威权主义得分较高的标注者更倾向于将文本标注为性别歧视,而社会支配倾向和新型性别歧视态度得分较高的标注者则倾向于不标注。在自动化分类阶段,论文使用大型语言模型和五种提示策略进行分类实验,发现标注者态度影响分类器的预测能力,包含标注者态度信息的提示可以提高分类性能,但模型在处理新标签集的复杂性和类别不平衡方面仍面临挑战。

链接: https://arxiv.org/abs/2410.03543
作者: Aiqi Jiang,Nikolas Vitsakis,Tanvi Dinkar,Gavin Abercrombie,Ioannis Konstas
关键词-EN: increasing problem online, Gender-Based Violence, existing datasets fail, problem online, affected groups
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:Gender-Based Violence (GBV) is an increasing problem online, but existing datasets fail to capture the plurality of possible annotator perspectives or ensure the representation of affected groups. We revisit two important stages in the moderation pipeline for GBV: (1) manual data labelling; and (2) automated classification. For (1), we examine two datasets to investigate the relationship between annotator identities and attitudes and the responses they give to two GBV labelling tasks. To this end, we collect demographic and attitudinal information from crowd-sourced annotators using three validated surveys from Social Psychology. We find that higher Right Wing Authoritarianism scores are associated with a higher propensity to label text as sexist, while for Social Dominance Orientation and Neosexist Attitudes, higher scores are associated with a negative tendency to do so. For (2), we conduct classification experiments using Large Language Models and five prompting strategies, including infusing prompts with annotator information. We find: (i) annotator attitudes affect the ability of classifiers to predict their labels; (ii) including attitudinal information can boost performance when we use well-structured brief annotator descriptions; and (iii) models struggle to reflect the increased complexity and imbalanced classes of the new label sets.
摘要:基于性别的暴力 (Gender-Based Violence, GBV) 在线上日益严重,但现有数据集未能捕捉到可能的多样的标注者视角,也无法确保受影响群体的代表性。我们重新审视了 GBV 内容审核流程中的两个关键阶段:(1) 人工数据标注;(2) 自动化分类。对于 (1),我们分析了两个数据集,研究标注者身份和态度与其对两项 GBV 标注任务的反应之间的关系。为此,我们通过社会心理学领域的三项验证性调查,收集了众包标注者的社会人口统计和态度信息。我们发现,右翼威权主义 (Right Wing Authoritarianism) 得分较高的人更倾向于将文本标注为性别歧视,而社会支配倾向 (Social Dominance Orientation) 和新性别歧视态度 (Neosexist Attitudes) 得分较高的人则更倾向于给出负面标注。对于 (2),我们使用大语言模型 (Large Language Models) 和五种提示策略进行分类实验,包括在提示中融入标注者信息。我们发现:(i) 标注者的态度影响分类器预测其标注的能力;(ii) 当使用结构良好的简短标注者描述时,包含态度信息可以提升性能;(iii) 模型难以反映新标注集的复杂性和类别不平衡问题。

[NLP-15] MARE: Multi-Aspect Rationale Extractor on Unsupervised Rationale Extraction EMNLP2024

【速读】: 该论文试图解决无监督理由提取任务中多方面理由提取的问题,关键在于提出了一个多方面理由提取器(MARE),通过多方面多头注意力机制(MAMHA)和硬删除技术来同时编码多个文本片段,并使用多任务训练来减少训练开销。该方法能够有效捕捉各方面的内部关联,从而提升多方面理由提取的性能。

链接: https://arxiv.org/abs/2410.03531
作者: Han Jiang,Junwen Duan,Zhe Qu,Jianxin Wang
关键词-EN: support model predictions, explicit rationale annotation, extract text snippets, rationale extraction aims, Unsupervised rationale extraction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in EMNLP2024(Main) conference

点击查看摘要

Abstract:Unsupervised rationale extraction aims to extract text snippets to support model predictions without explicit rationale annotation. Researchers have made many efforts to solve this task. Previous works often encode each aspect independently, which may limit their ability to capture meaningful internal correlations between aspects. While there has been significant work on mitigating spurious correlations, our approach focuses on leveraging the beneficial internal correlations to improve multi-aspect rationale extraction. In this paper, we propose a Multi-Aspect Rationale Extractor (MARE) to explain and predict multiple aspects simultaneously. Concretely, we propose a Multi-Aspect Multi-Head Attention (MAMHA) mechanism based on hard deletion to encode multiple text chunks simultaneously. Furthermore, multiple special tokens are prepended in front of the text with each corresponding to one certain aspect. Finally, multi-task training is deployed to reduce the training overhead. Experimental results on two unsupervised rationale extraction benchmarks show that MARE achieves state-of-the-art performance. Ablation studies further demonstrate the effectiveness of our method. Our codes have been available at this https URL.
摘要:无监督理由提取旨在提取文本片段以支持模型预测,而无需显式的理由标注。研究人员已为此任务付出了许多努力。以往的工作通常独立编码每个方面,这可能限制了它们捕捉方面之间有意义的内部关联的能力。尽管已有大量工作致力于缓解虚假关联,但我们的方法侧重于利用有益的内部关联来改进多方面理由提取。在本文中,我们提出了一种多方面理由提取器 (Multi-Aspect Rationale Extractor, MARE),以同时解释和预测多个方面。具体而言,我们提出了一种基于硬删除的多方面多头注意力 (Multi-Aspect Multi-Head Attention, MAMHA) 机制,以同时编码多个文本块。此外,在文本前添加了多个特殊 Token,每个 Token 对应一个特定方面。最后,采用多任务训练以减少训练开销。在两个无监督理由提取基准上的实验结果表明,MARE 达到了最先进的性能。消融研究进一步证明了我们方法的有效性。我们的代码已在此 https URL 上提供。

[NLP-16] No Need to Talk: Asynchronous Mixture of Language Models

【速读】: 该论文试图解决在训练混合语言模型时,节点间高带宽通信需求的问题。解决方案的关键是提出了一种名为SmallTalk LM的创新方法,该方法允许每个模型专注于数据分布的不同部分,而不需要节点间的高带宽通信。在推理阶段,通过一个轻量级的路由器根据输入序列的前缀将任务导向单一专家模型,从而在保持相似推理成本的同时,显著降低了困惑度,并在下游任务中表现出色。

链接: https://arxiv.org/abs/2410.03529
作者: Anastasiia Filippova,Angelos Katharopoulos,David Grangier,Ronan Collobert
关键词-EN: asynchronous manner, innovative method, introduce SmallTalk, model, training
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Our experiments on language modeling demonstrate tha SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on 75% of the tasks.
摘要:我们介绍了 SmallTalk LM,这是一种创新的方法,用于以近乎异步的方式训练混合语言模型。混合模型中的每个模型专门处理数据分布的不同部分,而无需节点之间进行高带宽通信。在推理阶段,一个轻量级的路由器根据短前缀将给定序列导向单一专家。这种推理方案自然地使用了混合模型总体参数的一部分。我们在语言建模实验中证明,SmallTalk LM 在相同的总训练 FLOPs 下,实现了比密集模型基线显著更低的困惑度,并且推理成本几乎相同。最后,在我们的下游任务评估中,我们在 75% 的任务上超越了密集基线。

[NLP-17] Steering Large Language Models between Code Execution and Textual Reasoning

【速读】: 该论文试图解决的问题是如何在大型语言模型(LLMs)中有效地引导代码生成与文本推理,特别是在处理复杂任务时。解决方案的关键在于提出了三种方法来优化LLMs在代码生成与文本推理之间的切换,从而显著提高任务解决的成功率。这些方法不仅考虑了代码生成的准确性,还深入分析了生成过程中的成本,包括令牌长度和运行时间,以确保解决方案的效率和可扩展性。

链接: https://arxiv.org/abs/2410.03524
作者: Yongchao Chen,Harsh Jhamtani,Srinagesh Sharma,Chuchu Fan,Chi Wang
关键词-EN: Large Language Models, Large Language, capabilities of Large, Language Models, recent research focuses
类目: Computation and Language (cs.CL)
备注: 32 pages, 12 figures, 12 tables

点击查看摘要

Abstract:While a lot of recent research focuses on enhancing the textual reasoning capabilities of Large Language Models (LLMs) by optimizing the multi-agent framework or reasoning chains, several benchmark tasks can be solved with 100% success through direct coding, which is more scalable and avoids the computational overhead associated with textual iterating and searching. Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching, which is unlikely to be solved by simply scaling up the model and data size. The recently released OpenAI GPT Code Interpreter and multi-agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution to solve complex tasks using LLMs. However, based on our experiments on 7 existing popular methods for steering code/text generation in both single- and multi-turn settings with 14 tasks and 6 types of LLMs (including the new O1-preview), currently there is no optimal method to correctly steer LLMs to write code when needed. We discover some interesting patterns on when models use code vs. textual reasoning with the evolution to task complexity and model sizes, which even result in an astonishingly inverse scaling law. We also discover that results from LLM written code are not always better than using textual reasoning, even if the task could be solved through code. To mitigate the above issues, we propose three methods to better steer LLM code/text generation and achieve a notable improvement. The costs of token lengths and runtime are thoroughly discussed for all the methods. We believe the problem of steering LLM code/text generation is critical for future research and has much space for further improvement. Project Page, Datasets, and Codes are available at this https URL.
摘要:尽管近期许多研究致力于通过优化多智能体框架或推理链来增强大语言模型 (LLM) 的文本推理能力,但某些基准任务可以通过直接编码实现 100% 的成功率,这种方式更具扩展性,并避免了与文本迭代和搜索相关的计算开销。文本推理在解决涉及数学、逻辑、优化和搜索等挑战的任务时存在固有局限,仅通过扩大模型和数据规模难以克服。最近发布的 OpenAI GPT Code Interpreter 和多智能体框架如 AutoGen 展示了将代码生成与执行相结合以利用 LLM 解决复杂任务的显著能力。然而,根据我们在单轮和多轮设置下对 7 种现有流行方法进行的 14 项任务和 6 种类型 LLM(包括新的 O1-preview)的实验,目前尚无最佳方法能正确引导 LLM 在需要时编写代码。我们发现了一些有趣的模型在代码使用与文本推理之间随着任务复杂性和模型规模变化的规律,甚至出现了令人惊讶的逆向扩展定律。我们还发现,即使任务可以通过代码解决,LLM 生成的代码结果并不总是优于文本推理。为缓解上述问题,我们提出了三种方法来更好地引导 LLM 的代码/文本生成,并取得了显著改进。我们对所有方法的 Token 长度和运行时成本进行了详细讨论。我们认为,引导 LLM 代码/文本生成的问题对未来研究至关重要,且仍有很大改进空间。项目页面、数据集和代码可在以下链接获取:https URL。

[NLP-18] LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation

【速读】: 该论文试图解决在线医疗服务的两大主要挑战:一是由于隐私问题导致的大规模、公开可用且领域特定的医疗数据稀缺,现有数据集规模小且局限于少数疾病,限制了基于预训练语言模型(PLMs)的分诊方法的有效性;二是现有方法缺乏医学知识,难以准确理解医患咨询中的专业术语和表达。解决方案的关键在于构建了大规模中文医疗对话语料库(LCMDC),包含粗粒度分诊数据集、细粒度诊断数据集和医疗咨询数据集,从而解决了数据短缺问题。此外,论文提出了一种结合BERT监督学习和提示学习的分诊系统,以及基于GPT的医疗咨询模型,通过强化学习提升模型性能,并通过自建背景语料库对PLMs进行预训练,以增强领域知识的获取。

链接: https://arxiv.org/abs/2410.03521
作者: Xinyuan Wang,Haozhou Li,Dingfang Zheng,Qinke Peng
关键词-EN: pandemic underscored major, underscored major deficiencies, online medical services, traditional healthcare systems, pandemic underscored
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), comprising a Coarse-grained Triage dataset with 439,630 samples, a Fine-grained Diagnosis dataset with 199,600 samples, and a Medical Consultation dataset with 472,418 items, thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model using reinforcement learning. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.
摘要:全球 COVID-19 疫情凸显了传统医疗系统的重大缺陷,加速了在线医疗服务的发展,特别是在医疗分诊和咨询方面。然而,现有研究面临两大主要挑战。首先,由于隐私问题,缺乏大规模、公开可用的特定领域医疗数据集,现有数据集规模小且仅限于少数疾病,限制了基于预训练语言模型 (Pre-trained Language Models, PLMs) 的分诊方法的有效性。其次,现有方法缺乏医学知识,难以准确理解医患咨询中的专业术语和表达。为克服这些障碍,我们构建了大规模中文医疗对话语料库 (Large-scale Chinese Medical Dialogue Corpora, LCMDC),包括一个包含 439,630 个样本的粗粒度分诊数据集、一个包含 199,600 个样本的细粒度诊断数据集,以及一个包含 472,418 个项目的医疗咨询数据集,从而解决了该领域的数据短缺问题。此外,我们进一步提出了一种结合基于 BERT 的监督学习和提示学习的分诊系统,以及一种基于 GPT 的医疗咨询模型,使用强化学习。为增强领域知识获取,我们使用自建的背景语料库对 PLMs 进行了预训练。在 LCMDC 上的实验结果证明了我们提出的系统的有效性。

[NLP-19] CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

【速读】: 该论文试图解决在临床医学场景中对大型语言模型(LLMs)进行统一评估的问题。解决方案的关键在于提出了CliMedBench,这是一个包含14个专家指导的核心临床场景的综合基准,用于评估LLMs在7个关键维度上的医疗能力。该基准包含33,735个问题,来源于顶级三级医院的真实医疗报告和考试练习,确保了评估的可靠性和全面性。通过实验,论文揭示了现有LLMs在临床应用中的优势和局限,特别是强调了中文医疗LLMs在医学推理和事实一致性方面的不足,以及通用领域LLMs在医疗领域的潜在能力,为未来的医学研究和模型改进提供了重要见解。

链接: https://arxiv.org/abs/2410.03502
作者: Zetian Ouyang,Yishuai Qiu,Linlin Wang,Gerard de Melo,Ya Zhang,Yanfeng Wang,Liang He
关键词-EN: Large Language Models, Large Language, Language Models, unified evaluation standards, proliferation of Large
类目: Computation and Language (cs.CL)
备注: accepted by ENMLP-2024

点击查看摘要

Abstract:With the proliferation of Large Language Models (LLMs) in diverse domains, there is a particular need for unified evaluation standards in clinical medical scenarios, where models need to be examined very thoroughly. We present CliMedBench, a comprehensive benchmark with 14 expert-guided core clinical scenarios specifically designed to assess the medical ability of LLMs across 7 pivot dimensions. It comprises 33,735 questions derived from real-world medical reports of top-tier tertiary hospitals and authentic examination exercises. The reliability of this benchmark has been confirmed in several ways. Subsequent experiments with existing LLMs have led to the following findings: (i) Chinese medical LLMs underperform on this benchmark, especially where medical reasoning and factual consistency are vital, underscoring the need for advances in clinical knowledge and diagnostic accuracy. (ii) Several general-domain LLMs demonstrate substantial potential in medical clinics, while the limited input capacity of many medical LLMs hinders their practical use. These findings reveal both the strengths and limitations of LLMs in clinical scenarios and offer critical insights for medical research.
摘要:随着大语言模型 (LLMs) 在各个领域的广泛应用,临床医学场景中特别需要统一的评估标准,因为这些模型需要经过非常严格的检验。我们提出了 CliMedBench,这是一个综合性的基准测试,包含 14 个专家指导的核心临床场景,专门用于评估 LLMs 在 7 个关键维度上的医学能力。该基准测试包含 33,735 个问题,这些问题源自顶尖三级医院的真实医疗报告和真实的考试练习。该基准测试的可靠性已通过多种方式得到确认。随后对现有 LLMs 的实验结果表明:(i) 中文医学 LLMs 在该基准测试中表现不佳,尤其是在医疗推理和事实一致性至关重要的领域,这突显了临床知识和诊断准确性方面需要进一步改进。(ii) 一些通用领域的 LLMs 在医疗临床中展现出巨大的潜力,而许多医学 LLMs 有限的输入容量限制了它们的实际应用。这些发现揭示了 LLMs 在临床场景中的优势和局限性,并为医学研究提供了关键的见解。

[NLP-20] owards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

【速读】: 该论文试图解决大语言模型(LLMs)在评估过程中由于随机性导致的非确定性问题,特别是在重复实验时得分和预测区间的变化。解决方案的关键在于提出了一种简单且成本效益高的方法,通过重复实验来量化基准测试得分的不确定性,并建议在可重复的LLM评估中采用这种方法。

链接: https://arxiv.org/abs/2410.03492
作者: Robert E. Blackwell,Jon Barry,Anthony G. Cohn
关键词-EN: Large language models, give deterministic answers, fixed random seed, models give deterministic, Large language
类目: Computation and Language (cs.CL)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the time and cost of repeated experiments. We use benchmarks designed for testing LLMs’ capacity to reason about cardinal directions to explore the impact of experimental repeats on mean score and prediction interval. We suggest a simple method for cost-effectively quantifying the uncertainty of a benchmark score and make recommendations concerning reproducible LLM evaluation.
摘要:大语言模型 (LLMs) 是随机的,即使将温度设置为零并固定随机种子,也并非所有模型都能给出确定性的答案。然而,由于重复实验的时间和成本,很少有基准研究尝试量化这种不确定性。我们使用专为测试 LLMs 对基本方向推理能力的基准,探讨了实验重复对平均分数和预测区间的影响。我们提出了一种简单且成本效益高的方法来量化基准分数的不确定性,并就大语言模型评估的可重复性提出了建议。

[NLP-21] Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering

【速读】: 该论文试图解决自动生成反仇恨言论(counterspeech)时,生成的内容往往缺乏专家级辩论丰富性的问题。解决方案的关键在于:首先,研究大型语言模型(LLMs)中帮助性与无害性之间的张力,测试安全防护措施是否阻碍了生成内容的质量;其次,评估针对仇恨言论特定组成部分进行攻击是否能更有效地对抗在线仇恨。通过广泛的人工和自动评估,论文表明安全防护措施可能对旨在促进积极社会互动的任务产生负面影响,而针对仇恨言论的隐性负面刻板印象和仇恨部分进行攻击,则能生成更高质量的反仇恨言论。

链接: https://arxiv.org/abs/2410.03466
作者: Helena Bonaldi,Greta Damo,Nicolás Benjamín Ocampo,Elena Cabrio,Serena Villata,Marco Guerini
关键词-EN: NLG research community, attracting increasing interest, NLG research, research community, hate speech mitigation
类目: Computation and Language (cs.CL)
备注: To appear in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (long paper)

点击查看摘要

Abstract:The potential effectiveness of counterspeech as a hate speech mitigation strategy is attracting increasing interest in the NLG research community, particularly towards the task of automatically producing it. However, automatically generated responses often lack the argumentative richness which characterises expert-produced counterspeech. In this work, we focus on two aspects of counterspeech generation to produce more cogent responses. First, by investigating the tension between helpfulness and harmlessness of LLMs, we test whether the presence of safety guardrails hinders the quality of the generations. Secondly, we assess whether attacking a specific component of the hate speech results in a more effective argumentative strategy to fight online hate. By conducting an extensive human and automatic evaluation, we show how the presence of safety guardrails can be detrimental also to a task that inherently aims at fostering positive social interactions. Moreover, our results show that attacking a specific component of the hate speech, and in particular its implicit negative stereotype and its hateful parts, leads to higher-quality generations.
摘要:作为仇恨言论缓解策略的反驳言论的有效性在自然语言生成 (NLG) 研究社区中引起了越来越多的关注,特别是在自动生成反驳言论的任务方面。然而,自动生成的回应往往缺乏专家制作的反驳言论所具有的论证丰富性。在本研究中,我们专注于反驳言论生成的两个方面,以产生更具说服力的回应。首先,通过探讨大语言模型 (LLM) 的有益性和无害性之间的紧张关系,我们测试了安全护栏的存在是否会阻碍生成内容的质量。其次,我们评估了针对仇恨言论的特定组成部分进行攻击是否能产生更有效的论证策略来对抗网络仇恨。通过进行广泛的人工和自动评估,我们展示了安全护栏的存在如何对一个本质上旨在促进积极社会互动的任务产生不利影响。此外,我们的结果表明,攻击仇恨言论的特定组成部分,特别是其隐含的负面刻板印象及其仇恨部分,可以产生更高质量的生成内容。

[NLP-22] Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation

【速读】: 该论文试图解决在检索增强生成(RAG)系统中,大型语言模型(LLM)输出中存在的幻觉问题,即生成错误或不相关信息的问题。解决方案的关键在于引入自动生成域适应(Auto-GDA)框架,通过合成数据生成实现无监督域适应。Auto-GDA利用弱标签和离散优化策略,迭代地改进生成样本的质量,从而在推理时使用轻量级自然语言推理(NLI)模型进行高效的接地验证,显著降低了计算成本并提升了模型性能。

链接: https://arxiv.org/abs/2410.03461
作者: Tobias Leemann,Periklis Petridis,Giuseppe Vietri,Dionysis Manousakas,Aaron Roth,Sergul Aydore
关键词-EN: NLI models, large language model, suffer from hallucination, generating incorrect, irrelevant information
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While retrieval augmented generation (RAG) has been shown to enhance factuality of large language model (LLM) outputs, LLMs still suffer from hallucination, generating incorrect or irrelevant information. One common detection strategy involves prompting the LLM again to assess whether its response is grounded in the retrieved evidence, but this approach is costly. Alternatively, lightweight natural language inference (NLI) models for efficient grounding verification can be used at inference time. While existing pre-trained NLI models offer potential solutions, their performance remains subpar compared to larger models on realistic RAG inputs. RAG inputs are more complex than most datasets used for training NLI models and have characteristics specific to the underlying knowledge base, requiring adaptation of the NLI models to a specific target domain. Additionally, the lack of labeled instances in the target domain makes supervised domain adaptation, e.g., through fine-tuning, infeasible. To address these challenges, we introduce Automatic Generative Domain Adaptation (Auto-GDA). Our framework enables unsupervised domain adaptation through synthetic data generation. Unlike previous methods that rely on handcrafted filtering and augmentation strategies, Auto-GDA employs an iterative process to continuously improve the quality of generated samples using weak labels from less efficient teacher models and discrete optimization to select the most promising augmented samples. Experimental results demonstrate the effectiveness of our approach, with models fine-tuned on synthetic data using Auto-GDA often surpassing the performance of the teacher model and reaching the performance level of LLMs at 10 % of their computational cost.
摘要:尽管检索增强生成 (RAG) 已被证明可以提高大语言模型 (LLM) 输出的真实性,但 LLM 仍然存在幻觉问题,即生成错误或不相关的信息。一种常见的检测策略是再次提示 LLM 以评估其响应是否基于检索到的证据,但这种方法成本较高。另一种选择是在推理时使用轻量级自然语言推理 (NLI) 模型进行高效的接地验证。虽然现有的预训练 NLI 模型提供了潜在的解决方案,但它们在现实 RAG 输入上的表现仍不如大型模型。RAG 输入比大多数用于训练 NLI 模型的数据集更为复杂,并且具有特定于底层知识库的特征,因此需要将 NLI 模型适应于特定的目标领域。此外,目标领域中缺乏标记实例使得通过微调等方法进行监督领域适应变得不可行。为了应对这些挑战,我们引入了自动生成式领域适应 (Auto-GDA)。我们的框架通过合成数据生成实现了无监督领域适应。与依赖手工筛选和增强策略的先前方法不同,Auto-GDA 采用迭代过程,利用效率较低的教师模型的弱标签和离散优化来持续改进生成样本的质量,并选择最有前景的增强样本。实验结果表明,我们的方法具有有效性,使用 Auto-GDA 微调的模型在合成数据上的表现通常优于教师模型,并且达到 LLM 性能水平的计算成本仅为 LLM 的 10%。

[NLP-23] Multi-Dialect Vietnamese: Task Dataset Baseline Models and Challenges EMNLP2024

【速读】: 该论文试图解决越南语方言的细粒度分类问题,特别是针对越南63个省份的方言进行精细区分。解决方案的关键在于引入了一个名为Vietnamese Multi-Dialect (ViMD)的新数据集,该数据集包含了102.56小时的音频数据,涵盖约19,000个语音片段和超过120万个单词的转录文本,能够捕捉越南各省份方言的丰富多样性。通过使用该数据集,论文进一步对预训练模型进行了微调,以评估其在方言识别和语音识别两个下游任务中的表现,从而揭示了地理因素对方言的影响以及当前语音识别方法在处理多方言数据时的局限性。

链接: https://arxiv.org/abs/2410.03458
作者: Nguyen Van Dinh,Thanh Chi Dang,Luan Thanh Nguyen,Kiet Van Nguyen
关键词-EN: belong to Northern, Southern Vietnam, primary dialect groups, low-resource language, typically categorized
类目: Computation and Language (cs.CL)
备注: Main EMNLP 2024

点击查看摘要

Abstract:Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained models for two downstream tasks: (1) Dialect identification and (2) Speech recognition. The empirical results suggest two implications including the influence of geographical factors on dialects, and the constraints of current approaches in speech recognition tasks involving multi-dialect speech data. Our dataset is available for research purposes.
摘要:越南语作为一种低资源语言,通常被分为三个主要方言组,分别属于越南的北部、中部和南部。然而,这些地区内的每个省份都有其独特的语音变体。尽管存在多种语音识别数据集,但没有一个数据集能够对越南63个省份特有的方言进行细粒度分类。为了填补这一空白,我们引入了越南多方言(Vietnamese Multi-Dialect, ViMD)数据集,这是一个全新的综合性数据集,捕捉了越南各地63种省级方言的丰富多样性。我们的数据集包含102.56小时的音频,约19,000条语音,相关转录文本超过120万字。为了提供基准测试并同时展示我们数据集的挑战性,我们对两个下游任务的先进预训练模型进行了微调:(1)方言识别和(2)语音识别。实证结果表明,地理因素对方言的影响以及当前方法在处理多方言语音数据时的局限性。我们的数据集可供研究使用。

[NLP-24] CoCoLoFa: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds EMNLP2024

【速读】: 该论文试图解决自动化检测文本中逻辑谬误的问题,其关键解决方案在于创建了一个大规模的逻辑谬误数据集CoCoLoFa,并通过结合众包和大型语言模型(LLM)的方式,有效地生成和标注了高质量的谬误样本。具体来说,论文招募了143名众包工作者,利用LLM辅助工具帮助他们编写包含特定逻辑谬误类型的评论,从而构建了一个包含7,706条评论的数据集。这一方法不仅提高了数据集的质量和可靠性,还使得基于BERT的模型在谬误检测和分类任务上达到了最先进的性能(F1值分别为0.86和0.87)。

链接: https://arxiv.org/abs/2410.03457
作者: Min-Hsuan Yeh,Ruyuan Wan,Ting-Hao ‘Kenneth’ Huang
关键词-EN: spot argument flaws, Detecting logical fallacies, users spot argument, Detecting logical, argument flaws
类目: Computation and Language (cs.CL)
备注: In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)

点击查看摘要

Abstract:Detecting logical fallacies in texts can help users spot argument flaws, but automating this detection is not easy. Manually annotating fallacies in large-scale, real-world text data to create datasets for developing and validating detection models is costly. This paper introduces CoCoLoFa, the largest known logical fallacy dataset, containing 7,706 comments for 648 news articles, with each comment labeled for fallacy presence and type. We recruited 143 crowd workers to write comments embodying specific fallacy types (e.g., slippery slope) in response to news articles. Recognizing the complexity of this writing task, we built an LLM-powered assistant into the workers’ interface to aid in drafting and refining their comments. Experts rated the writing quality and labeling validity of CoCoLoFa as high and reliable. BERT-based models fine-tuned using CoCoLoFa achieved the highest fallacy detection (F1=0.86) and classification (F1=0.87) performance on its test set, outperforming the state-of-the-art LLMs. Our work shows that combining crowdsourcing and LLMs enables us to more effectively construct datasets for complex linguistic phenomena that crowd workers find challenging to produce on their own.
摘要:检测文本中的逻辑谬误可以帮助用户识别论证中的缺陷,但自动化这一检测过程并不容易。手动标注大规模真实世界文本数据中的谬误以创建用于开发和验证检测模型的数据集成本高昂。本文介绍了 CoCoLoFa,这是目前已知最大的逻辑谬误数据集,包含 7,706 条评论,对应 648 篇新闻文章,每条评论都标注了是否存在谬误及其类型。我们招募了 143 名众包工作者,要求他们根据新闻文章撰写体现特定谬误类型(例如,滑坡谬误)的评论。考虑到这一写作任务的复杂性,我们在工作者界面中构建了一个由大语言模型 (LLM) 驱动的助手,以帮助他们起草和完善评论。专家对 CoCoLoFa 的写作质量和标注有效性评价为高且可靠。使用 CoCoLoFa 微调的基于 BERT 的模型在其测试集上实现了最高的谬误检测 (F1=0.86) 和分类 (F1=0.87) 性能,超过了最先进的 LLM。我们的研究表明,结合众包和大语言模型,使我们能够更有效地构建数据集,用于处理众包工作者难以独立产出的复杂语言现象。

[NLP-25] How Language Models Prioritize Contextual Grammatical Cues?

【速读】: 该论文试图解决在存在多个性别线索词的情况下,Transformer语言模型如何处理性别一致性问题。解决方案的关键在于通过两种互补的方法——上下文混合分析和激活补丁变体,分别追踪模型内部的信息流动和测量线索对模型预测的影响。研究发现,BERT倾向于优先使用上下文中的第一个线索来形成目标词表示和模型预测,而GPT-2则更依赖于最后一个线索。这些发现揭示了编码器和解码器模型在优先级和使用上下文信息方面的显著差异。

链接: https://arxiv.org/abs/2410.03447
作者: Hamidreza Amirzadeh,Afra Alishahi,Hosein Mohebbi
关键词-EN: Transformer-based language models, utilize contextual information, shown an excellent, excellent ability, ability to effectively
类目: Computation and Language (cs.CL)
备注: Accepted to BlackboxNLP 2024

点击查看摘要

Abstract:Transformer-based language models have shown an excellent ability to effectively capture and utilize contextual information. Although various analysis techniques have been used to quantify and trace the contribution of single contextual cues to a target task such as subject-verb agreement or coreference resolution, scenarios in which multiple relevant cues are available in the context remain underexplored. In this paper, we investigate how language models handle gender agreement when multiple gender cue words are present, each capable of independently disambiguating a target gender pronoun. We analyze two widely used Transformer-based models: BERT, an encoder-based, and GPT-2, a decoder-based model. Our analysis employs two complementary approaches: context mixing analysis, which tracks information flow within the model, and a variant of activation patching, which measures the impact of cues on the model’s prediction. We find that BERT tends to prioritize the first cue in the context to form both the target word representations and the model’s prediction, while GPT-2 relies more on the final cue. Our findings reveal striking differences in how encoder-based and decoder-based models prioritize and use contextual information for their predictions.
摘要:基于 Transformer 的语言模型展现了出色的能力,能够有效捕捉和利用上下文信息。尽管已有多种分析技术用于量化和追踪单一上下文线索对目标任务(如主谓一致或指代消解)的贡献,但当上下文中存在多个相关线索时,这些情况仍未得到充分探索。本文研究了当存在多个性别线索词时,语言模型如何处理性别一致性问题,每个线索词都能独立地消除目标性别代词的歧义。我们分析了两种广泛使用的基于 Transformer 的模型:BERT(一种基于编码器的模型)和 GPT-2(一种基于解码器的模型)。我们的分析采用了两种互补的方法:上下文混合分析,追踪模型内部的信息流动;以及激活补丁的变体,测量线索对模型预测的影响。我们发现,BERT 倾向于优先考虑上下文中的第一个线索来形成目标词表示和模型的预测,而 GPT-2 则更多依赖于最后一个线索。我们的研究揭示了基于编码器和基于解码器的模型在优先级和使用上下文信息进行预测方面的显著差异。

[NLP-26] On Uncertainty In Natural Language Processing

【速读】: 该论文旨在解决自然语言处理(NLP)中模型预测的不确定性问题,并探讨如何量化这种不确定性以提高模型的可靠性。解决方案的关键在于从语言学、统计学和神经网络的角度对不确定性进行表征,并通过实验设计来减少和量化不确定性。论文提出了多种不确定性量化方法,包括在多语言文本分类任务中的应用,以及基于非可交换性保序预测的自然语言生成校准采样方法。此外,论文还开发了一种利用辅助预测器量化大型黑箱语言模型置信度的方法,该方法仅依赖于目标模型的输入和输出文本。

链接: https://arxiv.org/abs/2410.03446
作者: Dennis Ulmer
关键词-EN: increasingly capable systems, natural language processing, decade in deep, deep learning, learning has brought
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: PhD thesis

点击查看摘要

Abstract:The last decade in deep learning has brought on increasingly capable systems that are deployed on a wide variety of applications. In natural language processing, the field has been transformed by a number of breakthroughs including large language models, which are used in increasingly many user-facing applications. In order to reap the benefits of this technology and reduce potential harms, it is important to quantify the reliability of model predictions and the uncertainties that shroud their development. This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective, and how it can be reduced and quantified through the design of the experimental pipeline. We further explore uncertainty quantification in modeling by theoretically and empirically investigating the effect of inductive model biases in text classification tasks. The corresponding experiments include data for three different languages (Danish, English and Finnish) and tasks as well as a large set of different uncertainty quantification approaches. Additionally, we propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction, which provides tighter token sets with better coverage of the actual continuation. Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors, where the confidence is predicted from the input to and generated output text of the target model alone. Comments: PhD thesis Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2410.03446 [cs.AI] (or arXiv:2410.03446v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.03446 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:过去十年,深度学习领域涌现出越来越多功能强大的系统,这些系统被广泛应用于各种应用中。在自然语言处理领域,一系列突破性进展,包括大语言模型 (Large Language Model),已经彻底改变了该领域,并越来越多地应用于面向用户的应用中。为了充分利用这项技术并减少潜在的危害,量化模型预测的可靠性及其开发过程中的不确定性至关重要。本论文研究了如何从语言学、统计学和神经学的角度来表征自然语言处理中的不确定性,以及如何通过实验流程的设计来减少和量化这些不确定性。我们进一步通过理论和实证研究,探讨了归纳模型偏差在文本分类任务中对不确定性量化的影响。相应的实验涵盖了三种不同语言(丹麦语、英语和芬兰语)的数据以及一系列不同的不确定性量化方法。此外,我们提出了一种基于非可交换性保形预测的自然语言生成校准采样方法,该方法能够提供更紧密的 Token 集合,并更好地覆盖实际的延续部分。最后,我们开发了一种方法,通过辅助预测器来量化大型黑箱语言模型的置信度,其中置信度仅从目标模型的输入和生成的输出文本中预测。

评论:博士论文
学科:人工智能 (cs.AI); 计算与语言 (cs.CL)
引用方式:arXiv:2410.03446 [cs.AI]
(或 arXiv:2410.03446v1 [cs.AI] 用于此版本)
https://doi.org/10.48550/arXiv.2410.03446
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-27] Exploring the Benefit of Activation Sparsity in Pre-training ICML2024

【速读】: 该论文试图解决预训练Transformer模型中稀疏激活特性的利用问题。解决方案的关键在于提出了一种名为Switchable Sparse-Dense Learning (SSD)的方法,该方法在预训练过程中自适应地在基于Mixtures-of-Experts (MoE)的稀疏训练和传统密集训练之间切换。SSD利用稀疏训练的效率,同时避免稀疏训练中的静态激活相关性问题,从而在保持模型性能的同时降低预训练成本,并实现稀疏推理时与密集模型相当的性能,且推理速度可提升至2倍。

链接: https://arxiv.org/abs/2410.03440
作者: Zhengyan Zhang,Chaojun Xiao,Qiujieli Qin,Yankai Lin,Zhiyuan Zeng,Xu Han,Zhiyuan Liu,Ruobing Xie,Maosong Sun,Jie Zhou
关键词-EN: Pre-trained Transformers inherently, Transformers inherently possess, Pre-trained Transformers, sparse activation, inherently possess
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICML 2024

点击查看摘要

Abstract:Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to 2\times faster inference speed. Codes are available at this https URL.
摘要:预训练的 Transformer 天生具有稀疏激活的特性,即对于每个 Token,只有一小部分神经元被激活。尽管稀疏激活通过后训练方法得到了探索,但在预训练阶段的潜力尚未被挖掘。在本研究中,我们首先探讨了预训练过程中激活特性的变化。我们的研究表明,Transformer 在整个预训练过程中大部分时间都表现出稀疏激活,而激活相关性随着训练的进行不断演变。基于这一观察,我们提出了可切换的稀疏-密集学习 (Switchable Sparse-Dense Learning, SSD)。SSD 在预训练过程中自适应地在基于专家混合 (Mixtures-of-Experts, MoE) 的稀疏训练和传统的密集训练之间切换,利用稀疏训练的效率,同时避免稀疏训练的静态激活相关性。与密集训练相比,SSD 在相同模型规模下实现了相当的性能,并降低了预训练成本。此外,使用 SSD 训练的模型可以直接用作 MoE 模型进行稀疏推理,并实现与密集模型相同性能,推理速度最多可提高 2 倍。代码可在以下链接获取:https URL。

[NLP-28] oolGen: Unified Tool Retrieval and Calling via Generation

【速读】: 该论文试图解决大语言模型(LLMs)在自主执行任务时无法直接与外部工具交互的问题。解决方案的关键在于引入ToolGen框架,通过将每个工具表示为一个独特的token,直接将工具知识集成到LLM的参数中。这种方法使得LLM能够在生成下一个token时自然地生成工具调用及其参数,从而无缝地将工具调用与语言生成结合。ToolGen消除了对额外检索机制的需求,显著提升了性能和可扩展性,并为AI代理在多领域工具适应方面开辟了新的可能性。

链接: https://arxiv.org/abs/2410.03439
作者: Renxi Wang,Xudong Han,Lei Ji,Shu Wang,Timothy Baldwin,Haonan Li
关键词-EN: large language models, autonomously execute tasks, external tools remains, critical limitation, inability to autonomously
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) advance, their inability to autonomously execute tasks by directly interacting with external tools remains a critical limitation. Traditional methods rely on inputting tool descriptions as context, which is constrained by context length and requires separate, often inefficient, retrieval mechanisms. We introduce ToolGen, a paradigm shift that integrates tool knowledge directly into the LLM’s parameters by representing each tool as a unique token. This enables the LLM to generate tool calls and arguments as part of its next token prediction capabilities, seamlessly blending tool invocation with language generation. Our framework allows the LLM to access and utilize a vast amount of tools with no additional retrieval step, significantly enhancing both performance and scalability. Experimental results with over 47,000 tools show that ToolGen not only achieves superior results in both tool retrieval and autonomous task completion but also sets the stage for a new era of AI agents that can adapt to tools across diverse domains. By fundamentally transforming tool retrieval into a generative process, ToolGen paves the way for more versatile, efficient, and autonomous AI systems. ToolGen enables end-to-end tool learning and opens opportunities for integration with other advanced techniques such as chain-of-thought and reinforcement learning, thereby expanding the practical capabilities of LLMs.
摘要:随着大语言模型 (LLM) 的发展,它们无法通过直接与外部工具交互来自主执行任务的问题仍然是一个关键限制。传统方法依赖于将工具描述作为上下文输入,这受限于上下文长度,并需要单独的、通常效率低下的检索机制。我们引入了 ToolGen,这是一种范式转变,通过将每个工具表示为一个独特的 Token,将工具知识直接集成到 LLM 的参数中。这使得 LLM 能够在其下一个 Token 预测能力中生成工具调用和参数,无缝地将工具调用与语言生成结合在一起。我们的框架允许 LLM 无需额外的检索步骤即可访问和利用大量工具,显著提升了性能和可扩展性。通过对超过 47,000 个工具的实验结果表明,ToolGen 不仅在工具检索和自主任务完成方面取得了卓越的成果,还为能够适应跨领域工具的新一代 AI 智能体奠定了基础。通过将工具检索从根本上转变为生成过程,ToolGen 为更通用、高效和自主的 AI 系统铺平了道路。ToolGen 实现了端到端的工具学习,并开启了与其他先进技术(如链式思维和强化学习)集成的机会,从而扩展了 LLM 的实际应用能力。

[NLP-29] A General Framework for Producing Interpretable Semantic Text Embeddings

【速读】: 该论文试图解决自然语言处理中语义文本嵌入的可解释性问题,特别是在需要透明性的任务中,黑箱模型的高质量嵌入缺乏解释性。解决方案的关键在于提出了一个通用框架 \algoCQG-MBQA(对比问题生成 - 多任务二元问题回答),通过系统生成高度区分性且认知负荷低的二元问题,并利用 \algoMBQA 模型高效回答这些问题,从而在成本效益高的前提下生成可解释的语义文本嵌入。该框架不仅在嵌入质量上可与许多先进黑箱模型相媲美,而且在各种下游任务中优于其他可解释文本嵌入方法。

链接: https://arxiv.org/abs/2410.03435
作者: Yiqun Sun,Qiang Huang,Yixuan Tang,Anthony K. H. Tung,Jun Yu
关键词-EN: Natural Language Processing, Language Processing, Natural Language, algo, NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 5 figures, and 9 tables

点击查看摘要

Abstract:Semantic text embedding is essential to many tasks in Natural Language Processing (NLP). While black-box models are capable of generating high-quality embeddings, their lack of interpretability limits their use in tasks that demand transparency. Recent approaches have improved interpretability by leveraging domain-expert-crafted or LLM-generated questions, but these methods rely heavily on expert input or well-prompt design, which restricts their generalizability and ability to generate discriminative questions across a wide range of tasks. To address these challenges, we introduce \algoCQG-MBQA (Contrastive Question Generation - Multi-task Binary Question Answering), a general framework for producing interpretable semantic text embeddings across diverse tasks. Our framework systematically generates highly discriminative, low cognitive load yes/no questions through the \algoCQG method and answers them efficiently with the \algoMBQA model, resulting in interpretable embeddings in a cost-effective manner. We validate the effectiveness and interpretability of \algoCQG-MBQA through extensive experiments and ablation studies, demonstrating that it delivers embedding quality comparable to many advanced black-box models while maintaining inherently interpretability. Additionally, \algoCQG-MBQA outperforms other interpretable text embedding methods across various downstream tasks.
摘要:语义文本嵌入在自然语言处理 (NLP) 的许多任务中至关重要。尽管黑箱模型能够生成高质量的嵌入,但其缺乏可解释性限制了它们在需要透明度的任务中的应用。最近的方法通过利用领域专家设计的或大语言模型 (LLM) 生成的问题来提高可解释性,但这些方法严重依赖专家输入或精心设计的提示,这限制了它们在广泛任务中的通用性和生成区分性问题的能力。为了解决这些挑战,我们引入了 \algoCQG-MBQA (对比问题生成 - 多任务二元问题回答),这是一个用于在多样任务中生成可解释语义文本嵌入的通用框架。我们的框架通过 \algoCQG 方法系统地生成高度区分性、低认知负荷的是/否问题,并使用 \algoMBQA 模型高效地回答这些问题,从而以成本效益的方式生成可解释的嵌入。我们通过广泛的实验和消融研究验证了 \algoCQG-MBQA 的有效性和可解释性,证明它在保持固有可解释性的同时,嵌入质量可与许多先进的黑箱模型相媲美。此外,\algoCQG-MBQA 在各种下游任务中优于其他可解释文本嵌入方法。

[NLP-30] Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication

【速读】: 该论文试图解决在线数据库中的图像与易读文本(E2R)不匹配的问题,以及定制化图像制作成本高的问题。解决方案的关键在于评估和比较多种文本到图像生成模型,以确定它们是否能够快速且便捷地生成符合E2R文本需求的定制化图像。研究结果表明,虽然某些模型表现出色,但在没有人工监督的情况下,这些模型尚不能大规模应用。该研究为E2R创作者提供了重要的参考,有助于推动创建更符合目标群体需求的可访问信息。

链接: https://arxiv.org/abs/2410.03430
作者: Miriam Anschütz,Tringa Sylaj,Georg Groh
关键词-EN: Explanatory images play, Explanatory images, play a pivotal, pivotal role, images
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: To be published at TSAR workshop 2024 ( this https URL )

点击查看摘要

Abstract:Explanatory images play a pivotal role in accessible and easy-to-read (E2R) texts. However, the images available in online databases are not tailored toward the respective texts, and the creation of customized images is expensive. In this large-scale study, we investigated whether text-to-image generation models can close this gap by providing customizable images quickly and easily. We benchmarked seven, four open- and three closed-source, image generation models and provide an extensive evaluation of the resulting images. In addition, we performed a user study with people from the E2R target group to examine whether the images met their requirements. We find that some of the models show remarkable performance, but none of the models are ready to be used at a larger scale without human supervision. Our research is an important step toward facilitating the creation of accessible information for E2R creators and tailoring accessible images to the target group’s needs.
摘要: 解释性图像在易于访问和易于阅读 (E2R) 的文本中起着关键作用。然而,在线数据库中的图像并未针对相应的文本进行定制,而创建定制图像的成本较高。在本大规模研究中,我们探讨了文本到图像生成模型是否能够通过快速且轻松地提供可定制图像来填补这一空白。我们对七个图像生成模型进行了基准测试,包括四个开源模型和三个闭源模型,并对生成的图像进行了广泛评估。此外,我们还对 E2R 目标群体的成员进行了用户研究,以检验这些图像是否符合他们的需求。我们发现,部分模型表现出显著的性能,但在没有人工监督的情况下,没有一个模型能够在大规模应用中使用。我们的研究是促进 E2R 创作者创建易于访问的信息并根据目标群体需求定制易于访问图像的重要一步。

[NLP-31] How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics EMNLP2024

【速读】: 该论文试图解决自然语言推理(NLI)评估中存在的系统性虚假相关问题,即现有流行数据集中的虚假相关性导致模型性能被夸大。解决方案的关键在于提出一种自动化创建具有挑战性测试集的方法,该方法不依赖于人工构建的非现实示例。通过利用训练动态的方法,将流行NLI数据集的测试集分类为三个难度级别,显著减少了虚假相关性,并涵盖了更多现实和多样的语言现象。此外,该方法应用于训练集时,仅使用部分数据训练的模型性能可与全数据训练的模型相媲美,超越了其他数据集特征化技术。这一研究为NLI数据集构建提供了改进,提供了更真实的模型性能评估,对多样化的自然语言理解应用具有重要意义。

链接: https://arxiv.org/abs/2410.03429
作者: Adrian Cosma,Stefan Ruseti,Mihai Dascalu,Cornelia Caragea
关键词-EN: Natural Language Inference, assessing language understanding, Natural Language, Language Inference, artificially inflate actual
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Main Conference

点击查看摘要

Abstract:Natural Language Inference (NLI) evaluation is crucial for assessing language understanding models; however, popular datasets suffer from systematic spurious correlations that artificially inflate actual model performance. To address this, we propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. This categorization significantly reduces spurious correlation measures, with examples labeled as having the highest difficulty showing markedly decreased performance and encompassing more realistic and diverse linguistic phenomena. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset, surpassing other dataset characterization techniques. Our research addresses limitations in NLI dataset construction, providing a more authentic evaluation of model performance with implications for diverse NLU applications.
摘要:自然语言推理 (Natural Language Inference, NLI) 评估对于评估语言理解模型至关重要;然而,流行的数据集存在系统性的虚假相关性,这些相关性人为地夸大了模型的实际性能。为了解决这一问题,我们提出了一种自动创建具有挑战性测试集的方法,而不依赖于人工构建的人工和不现实的示例。我们通过利用训练动态的方法,将流行 NLI 数据集的测试集分为三个难度级别。这种分类显著减少了虚假相关性的度量,其中被标记为最高难度的示例显示出显著降低的性能,并涵盖了更多现实和多样的语言现象。当我们的分类方法应用于训练集时,仅使用部分数据的模型能够达到与使用完整数据集训练的模型相当的性能,超越了其他数据集分类技术。我们的研究解决了 NLI 数据集构建中的局限性,为模型性能提供了更真实的评估,并对多样化的自然语言理解 (Natural Language Understanding, NLU) 应用具有重要意义。

[NLP-32] One2set Large Language Model: Best Partners for Keyphrase Generation EMNLP2024

【速读】: 该论文试图解决关键短语生成(KPG)中单一模型难以同时兼顾召回率和精确度的问题。解决方案的关键在于引入了一个“生成-选择”框架,将KPG任务分解为两个步骤:首先采用基于one2set范式的生成模型生成候选短语,然后利用大型语言模型(LLM)作为选择器从候选短语中筛选出关键短语。具体改进包括:设计基于最优传输的分配策略以解决训练中的监督信号分配不当问题,并将关键短语选择建模为序列标注任务以减少冗余选择。实验结果表明,该框架在多个基准数据集上显著优于现有最先进模型,特别是在预测缺失关键短语方面表现突出。

链接: https://arxiv.org/abs/2410.03421
作者: Liangying Shao,Liang Zhang,Minlong Peng,Guoqi Ma,Hao Yue,Mingming Sun,Jinsong Su
关键词-EN: aims to automatically, automatically generate, generate a collection, collection of phrases, phrases representing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 Main Conference

点击查看摘要

Abstract:Keyphrase generation (KPG) aims to automatically generate a collection of phrases representing the core concepts of a given document. The dominant paradigms in KPG include one2seq and one2set. Recently, there has been increasing interest in applying large language models (LLMs) to KPG. Our preliminary experiments reveal that it is challenging for a single model to excel in both recall and precision. Further analysis shows that: 1) the one2set paradigm owns the advantage of high recall, but suffers from improper assignments of supervision signals during training; 2) LLMs are powerful in keyphrase selection, but existing selection methods often make redundant selections. Given these observations, we introduce a generate-then-select framework decomposing KPG into two steps, where we adopt a one2set-based model as generator to produce candidates and then use an LLM as selector to select keyphrases from these candidates. Particularly, we make two important improvements on our generator and selector: 1) we design an Optimal Transport-based assignment strategy to address the above improper assignments; 2) we model the keyphrase selection as a sequence labeling task to alleviate redundant selections. Experimental results on multiple benchmark datasets show that our framework significantly surpasses state-of-the-art models, especially in absent keyphrase prediction.
摘要:关键短语生成 (Keyphrase Generation, KPG) 旨在自动生成一组代表给定文档核心概念的短语。KPG 的主导范式包括 one2seq 和 one2set。近年来,将大语言模型 (Large Language Models, LLMs) 应用于 KPG 的兴趣日益增加。我们的初步实验表明,单一模型在召回率和精确度上都表现出色是具有挑战性的。进一步分析显示:1) one2set 范式在召回率方面具有优势,但在训练过程中存在监督信号分配不当的问题;2) LLMs 在关键短语选择方面表现出色,但现有的选择方法往往会产生冗余选择。基于这些观察,我们引入了一个生成-选择框架,将 KPG 分解为两个步骤:首先采用基于 one2set 的模型作为生成器生成候选短语,然后使用 LLM 作为选择器从这些候选短语中选择关键短语。特别地,我们在生成器和选择器上进行了两项重要改进:1) 设计了一种基于最优传输 (Optimal Transport) 的分配策略,以解决上述不当分配问题;2) 将关键短语选择建模为序列标注任务,以减少冗余选择。在多个基准数据集上的实验结果表明,我们的框架显著优于现有最先进的模型,特别是在缺失关键短语预测方面。

[NLP-33] Surgical Cheap and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

【速读】: 该论文试图解决语言模型在拒绝行为中的误拒绝问题,即模型不应拒绝安全请求,即使这些请求表面上类似于不安全的请求。解决方案的关键在于通过单向量消融技术,提取并消融误拒绝向量,从而降低误拒绝率,同时不影响模型的安全性和整体能力。该方法无需额外训练,且适用于各种语言模型,为当前及未来模型的误拒绝问题提供了一种简单且有效的解决方案。

链接: https://arxiv.org/abs/2410.03415
作者: Xinpeng Wang,Chengzhi Hu,Paul Röttger,Barbara Plank
关键词-EN: refuse safe requests, give harmful advice, harmless requires careful, follow malicious instructions, superficially resemble unsafe
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g. “how do I kill someone?”), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. “how do I kill a Python process?”). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate without negatively impacting model safety and general model capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.
摘要:训练一个既有益又无害的语言模型需要仔细校准拒绝行为:模型应拒绝执行恶意指令或提供有害建议(例如:“我如何杀人?”),但不应拒绝安全请求,即使这些请求表面上看似不安全(例如:“我如何终止一个 Python 进程?”)。如先前工作所示,避免这种误拒绝对于高能力的语言模型来说也是一项挑战。本文提出了一种简单且精准的方法,通过单向量消融来缓解语言模型中的误拒绝问题。对于给定的模型,我们提取一个误拒绝向量,并证明消融该向量可以降低误拒绝率,同时不会对模型的安全性和整体能力产生负面影响。我们还展示了我们的方法可用于模型安全的细粒度校准。我们的方法无需训练且与模型无关,因此对于缓解当前和未来语言模型中的误拒绝问题具有实用价值。

[NLP-34] am MTS @ AutoMin 2021: An Overview of Existing Summarization Approaches and Comparison to Unsupervised Summarization Techniques INTERSPEECH2021

【速读】: 该论文试图解决自动会议记录(automatic minuting)的问题,特别是在视频或音频会议中自动生成会议摘要。解决方案的关键在于提出了一种基于聚类的无监督摘要技术,并结合了一个适应于实际录音的自动语音识别模块。该技术在自动会议记录任务中表现优异,尤其是在Rouge 1、Rouge 2和Rouge L指标上,分别在开发集上达到了0.21、0.02和0.2的分数,在测试集上也有相应的提升。

链接: https://arxiv.org/abs/2410.03412
作者: Olga Iakovenko,Anna Andreeva,Anna Lapidus,Liana Mikaelyan
关键词-EN: Remote communication, worldwide pandemic, communication through video, video or audio, audio conferences
类目: Computation and Language (cs.CL)
备注: First Shared Task on Automatic Minuting at Interspeech 2021

点击查看摘要

Abstract:Remote communication through video or audio conferences has become more popular than ever because of the worldwide pandemic. These events, therefore, have provoked the development of systems for automatic minuting of spoken language leading to AutoMin 2021 challenge. The following paper illustrates the results of the research that team MTS has carried out while participating in the Automatic Minutes challenge. In particular, in this paper we analyze existing approaches to text and speech summarization, propose an unsupervised summarization technique based on clustering and provide a pipeline that includes an adapted automatic speech recognition block able to run on real-life recordings. The proposed unsupervised technique outperforms pre-trained summarization models on the automatic minuting task with Rouge 1, Rouge 2 and Rouge L values of 0.21, 0.02 and 0.2 on the dev set, with Rouge 1, Rouge 2, Rouge L, Adequacy, Grammatical correctness and Fluency values of 0.180, 0.035, 0.098, 1.857, 2.304, 1.911 on the test set accordingly
摘要:由于全球疫情的影响,通过视频或音频会议进行的远程沟通变得比以往任何时候都更加普遍。因此,这些活动促进了自动会议记录系统的发展,从而催生了 AutoMin 2021 挑战赛。本文展示了团队 MTS 在参与自动会议记录挑战赛期间进行的研究成果。特别地,本文分析了现有的文本和语音摘要方法,提出了一种基于聚类的无监督摘要技术,并提供了一个包含适应性自动语音识别模块的流程,该模块能够在实际录音上运行。所提出的无监督技术在自动会议记录任务中优于预训练的摘要模型,在开发集上的 Rouge 1、Rouge 2 和 Rouge L 值分别为 0.21、0.02 和 0.2,在测试集上的 Rouge 1、Rouge 2、Rouge L、充分性、语法正确性和流畅性值分别为 0.180、0.035、0.098、1.857、2.304、1.911。

[NLP-35] Killing Two Flies with One Stone: An Attempt to Break LLMs Using English-Icelandic Idioms and Proper Names

【速读】: 该论文旨在解决英语到冰岛语翻译中习语表达和专有名词翻译的挑战。解决方案的关键在于创建两个测试套件:第一个测试套件评估机器翻译系统对常见英语习语的翻译能力,并测试系统能否区分习语和字面意思相同的短语;第二个测试套件则针对地名翻译,要求系统将地名翻译成冰岛语的外来词(并正确变格),以及处理冰岛语中男女同形异义词的翻译问题。通过这些测试套件,论文揭示了现有机器翻译系统在处理习语和专有名词方面的显著不足,表明有较大的改进空间。

链接: https://arxiv.org/abs/2410.03394
作者: Bjarki Ármannsson,Hinrik Hafsteinsson,Atli Jasonarson,Steinþór Steingrímsson
关键词-EN: Árni Magnússon Institute, Magnússon Institute team, Árni Magnússon, Magnússon Institute, English-Icelandic translation direction
类目: Computation and Language (cs.CL)
备注: WMT24 MT Test Suites subtask. 8 pages, 5 tables

点击查看摘要

Abstract:This paper presents the submission of the Árni Magnússon Institute’s team to the WMT24 test suite subtask, focusing on idiomatic expressions and proper names for the English-Icelandic translation direction. Intuitively and empirically, idioms and proper names are known to be a significant challenge for modern translation models. We create two different test suites. The first evaluates the competency of MT systems in translating common English idiomatic expressions, as well as testing whether systems can distinguish between those expressions and the same phrases when used in a literal context. The second test suite consists of place names that should be translated into their Icelandic exonyms (and correctly inflected) and pairs of Icelandic names that share a surface form between the male and female variants, so that incorrect translations impact meaning as well as readability. The scores reported are relatively low, especially for idiomatic expressions and place names, and indicate considerable room for improvement. Comments: WMT24 MT Test Suites subtask. 8 pages, 5 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.03394 [cs.CL] (or arXiv:2410.03394v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.03394 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:本文介绍了Árni Magnússon研究所团队提交给WMT24测试套件子任务的内容,重点研究了英语到冰岛语翻译方向中的习语表达和专有名词。从直观和经验上,习语和专有名词被认为是现代翻译模型的一大挑战。我们创建了两个不同的测试套件。第一个评估MT系统在翻译常见英语习语表达方面的能力,并测试系统是否能区分这些表达在字面上下文中的使用。第二个测试套件包含应翻译成冰岛语外名的地名(并正确变格),以及共享男性与女性变体表面形式的冰岛名字对,以确保错误的翻译不仅影响意义,还影响可读性。报告的分数相对较低,尤其是对于习语表达和地名,表明有相当大的改进空间。

评论:WMT24 MT测试套件子任务。8页,5个表格。主题:计算与语言(cs.CL)。引用为:arXiv:2410.03394 [cs.CL](或arXiv:2410.03394v1 [cs.CL] 用于此版本)。https://doi.org/10.48550/arXiv.2410.03394。了解更多信息请关注arXiv-issued DOI via DataCite(待注册)。

[NLP-36] Cogs in a Machine Doing What Theyre Meant to Do – The AMI Submission to the WMT24 General Translation Task

【速读】: 该论文旨在解决英语到冰岛语的通用翻译任务,其解决方案的关键在于构建了一个包含四个翻译模型和一个语法校正模型的系统。通过精心筛选和过滤训练数据,特别是利用大型语言模型(LLM)生成合成数据,显著提升了系统的翻译能力。

链接: https://arxiv.org/abs/2410.03381
作者: Atli Jasonarson,Hinrik Hafsteinsson,Bjarki Ármannsson,Steinþór Steingrímsson
关键词-EN: Árni Magnusson Institute, Magnusson Institute team, General translation task, Árni Magnusson, Magnusson Institute
类目: Computation and Language (cs.CL)
备注: WMT24 General Translation Task System Description Paper, 10 pages, 1 figure, 6 tables

点击查看摘要

Abstract:This paper presents the submission of the Árni Magnusson Institute’s team to the WMT24 General translation task. We work on the English-Icelandic translation direction. Our system comprises four translation models and a grammar correction model. For training our models we carefully curate our datasets, aggressively filtering out sentence pairs that may detrimentally affect the quality of our system’s output. Some of our data are collected from human translations and some are synthetically generated. A part of the synthetic data is generated using an LLM, and we find that it increases the translation capability of our system significantly.
摘要:本文介绍了Árni Magnusson研究所团队提交给WMT24通用翻译任务的内容。我们专注于英语到冰岛语的翻译方向。我们的系统由四个翻译模型和一个语法校正模型组成。为了训练我们的模型,我们精心策划了数据集,积极过滤掉可能对系统输出质量产生不利影响的句子对。我们的数据部分来自人工翻译,部分是合成生成的。其中一部分合成数据是使用大语言模型 (LLM) 生成的,我们发现这显著提高了我们系统的翻译能力。

[NLP-37] Should Cross-Lingual AMR Parsing go Meta? An Empirical Assessment of Meta-Learning and Joint Learning AMR Parsing EMNLP2024

【速读】: 该论文试图解决跨语言抽象语义表示(AMR)解析的问题,即在目标语言的训练数据有限的情况下,如何有效地预测目标语言的AMR图。解决方案的关键在于采用元学习(meta-learning)方法,通过在少量样本(k-shot)场景下进行训练,提升模型在0-shot和k-shot评估中的表现。研究结果表明,元学习模型在0-shot评估中对某些语言表现略优,但在k大于0的情况下,性能提升有限或不存在。

链接: https://arxiv.org/abs/2410.03357
作者: Jeongwoo Kang,Maximin Coavoux,Cédric Lopez,Didier Schwab
关键词-EN: Cross-lingual AMR parsing, predicting AMR graphs, Cross-lingual AMR, AMR training data, AMR parsing
类目: Computation and Language (cs.CL)
备注: to appear in Findings of EMNLP 2024

点击查看摘要

Abstract:Cross-lingual AMR parsing is the task of predicting AMR graphs in a target language when training data is available only in a source language. Due to the small size of AMR training data and evaluation data, cross-lingual AMR parsing has only been explored in a small set of languages such as English, Spanish, German, Chinese, and Italian. Taking inspiration from Langedijk et al. (2022), who apply meta-learning to tackle cross-lingual syntactic parsing, we investigate the use of meta-learning for cross-lingual AMR parsing. We evaluate our models in k -shot scenarios (including 0-shot) and assess their effectiveness in Croatian, Farsi, Korean, Chinese, and French. Notably, Korean and Croatian test sets are developed as part of our work, based on the existing The Little Prince English AMR corpus, and made publicly available. We empirically study our method by comparing it to classical joint learning. Our findings suggest that while the meta-learning model performs slightly better in 0-shot evaluation for certain languages, the performance gain is minimal or absent when k is higher than 0.
摘要:跨语言抽象语义表示 (AMR) 解析的任务是在仅提供源语言训练数据的情况下,预测目标语言的 AMR 图。由于 AMR 训练数据和评估数据规模较小,跨语言 AMR 解析仅在英语、西班牙语、德语、中文和意大利语等少数语言中进行了探索。受到 Langedijk 等人 (2022) 将元学习应用于跨语言句法解析的启发,我们研究了将元学习用于跨语言 AMR 解析的方法。我们在 k-shot 场景(包括零样本)中评估了我们的模型,并评估了它们在克罗地亚语、波斯语、韩语、中文和法语中的有效性。值得注意的是,韩语和克罗地亚语测试集是基于现有的《小王子》英文 AMR 语料库开发的,并作为我们工作的一部分公开发布。我们通过与经典联合学习方法的比较,实证研究了我们的方法。研究结果表明,尽管元学习模型在某些语言的零样本评估中表现略好,但在 k 大于 0 时,性能提升微乎其微或不存在。

[NLP-38] Generating Equivalent Representations of Code By A Self-Reflection Approach

【速读】: 该论文试图解决自动生成代码的等价表示(Equivalent Representations, ERs)的问题,即如何在不改变代码语义的前提下,自动生成自然语言注释、伪代码等文本表示。解决方案的关键在于提出了一种自反方法,通过两个大型语言模型(LLMs)的相互协作,经过一个反思过程来生成ERs。该方法在开放和受限两种设置下生成ERs,分别探讨了无约束和有约束条件下的生成效果,并展示了在不同软件工程任务中的应用潜力。

链接: https://arxiv.org/abs/2410.03351
作者: Jia Li,Ge Li,Lecheng Wang,Hao Zhu,Zhi Jin
关键词-EN: Equivalent Representations, textual representations, ERs, Representations, representations that preserve
类目: Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Equivalent Representations (ERs) of code are textual representations that preserve the same semantics as the code itself, e.g., natural language comments and pseudocode. ERs play a critical role in software development and maintenance. However, how to automatically generate ERs of code remains an open challenge. In this paper, we propose a self-reflection approach to generating ERs of code. It enables two Large Language Models (LLMs) to work mutually and produce an ER through a reflection process. Depending on whether constraints on ERs are applied, our approach generates ERs in both open and constrained settings. We conduct a empirical study to generate ERs in two settings and obtain eight findings. (1) Generating ERs in the open setting. In the open setting, we allow LLMs to represent code without any constraints, analyzing the resulting ERs and uncovering five key findings. These findings shed light on how LLMs comprehend syntactic structures, APIs, and numerical computations in code. (2) Generating ERs in the constrained setting. In the constrained setting, we impose constraints on ERs, such as natural language comments, pseudocode, and flowcharts. This allows our approach to address a range of software engineering tasks. Based on our experiments, we have three findings demonstrating that our approach can effectively generate ERs that adhere to specific constraints, thus supporting various software engineering tasks. (3) Future directions. We also discuss potential future research directions, such as deriving intermediate languages for code generation, exploring LLM-friendly requirement descriptions, and further supporting software engineering tasks. We believe that this paper will spark discussions in research communities and inspire many follow-up studies.
摘要:代码的等价表示 (Equivalent Representations, ERs) 是保留与代码本身相同语义的文本表示,例如自然语言注释和伪代码。ERs 在软件开发和维护中起着至关重要的作用。然而,如何自动生成代码的 ERs 仍然是一个开放的挑战。本文提出了一种自反思方法来生成代码的 ERs。该方法使两个大语言模型 (Large Language Models, LLMs) 能够相互协作,通过反思过程生成 ERs。根据是否施加 ERs 的约束,我们的方法在开放和受限两种设置下生成 ERs。我们进行了一项实证研究,在两种设置下生成 ERs,并获得了八个发现。(1) 在开放设置下生成 ERs。在开放设置下,我们允许 LLMs 在没有约束的情况下表示代码,分析生成的 ERs 并揭示了五个关键发现。这些发现揭示了 LLMs 如何理解代码中的语法结构、API 和数值计算。(2) 在受限设置下生成 ERs。在受限设置下,我们对 ERs 施加约束,例如自然语言注释、伪代码和流程图。这使得我们的方法能够解决一系列软件工程任务。基于我们的实验,我们有三个发现,表明我们的方法能够有效地生成符合特定约束的 ERs,从而支持各种软件工程任务。(3) 未来方向。我们还讨论了潜在的未来研究方向,例如推导代码生成的中间语言,探索对 LLM 友好的需求描述,以及进一步支持软件工程任务。我们相信,本文将在研究社区中引发讨论,并激发许多后续研究。

[NLP-39] Zero-Shot Fact Verification via Natural Logic and Large Language Models EMNLP2024

【速读】: 该论文试图解决现有事实验证系统依赖大量标注自然逻辑训练数据的问题,提出了一种零样本方法,利用指令微调的大型语言模型的泛化能力。解决方案的关键在于利用这些模型在未见过的数据上进行推理,从而在不依赖特定领域训练数据的情况下,实现对人工和真实世界多语言数据集的有效验证,并在零样本泛化和零样本迁移设置中均表现出色。

链接: https://arxiv.org/abs/2410.03341
作者: Marek Strong,Rami Aly,Andreas Vlachos
关键词-EN: providing faithful justifications, natural logic, set-theoretic operators, providing faithful, faithful justifications
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:The recent development of fact verification systems with natural logic has enhanced their explainability by aligning claims with evidence through set-theoretic operators, providing faithful justifications. Despite these advancements, such systems often rely on a large amount of training data annotated with natural logic. To address this issue, we propose a zero-shot method that utilizes the generalization capabilities of instruction-tuned large language models. To comprehensively assess the zero-shot capabilities of our method and other fact verification systems, we evaluate all models on both artificial and real-world claims, including multilingual datasets. We also compare our method against other fact verification systems in two setups. First, in the zero-shot generalization setup, we demonstrate that our approach outperforms other systems that were not specifically trained on natural logic data, achieving an average accuracy improvement of 8.96 points over the best-performing baseline. Second, in the zero-shot transfer setup, we show that current systems trained on natural logic data do not generalize well to other domains, and our method outperforms these systems across all datasets with real-world claims.
摘要:近年来,基于自然逻辑的事实验证系统的发展通过集合论操作符将声明与证据对齐,增强了其可解释性,提供了忠实的解释。尽管取得了这些进展,此类系统通常依赖于大量带有自然逻辑注释的训练数据。为了解决这一问题,我们提出了一种利用指令微调大语言模型泛化能力的零样本方法。为了全面评估我们的方法和其他事实验证系统的零样本能力,我们在人工和真实世界的声明上评估了所有模型,包括多语言数据集。我们还通过两种设置将我们的方法与其他事实验证系统进行了比较。首先,在零样本泛化设置中,我们展示了我们的方法优于未专门针对自然逻辑数据进行训练的其他系统,平均准确率比表现最佳的基线提高了8.96个百分点。其次,在零样本迁移设置中,我们表明当前基于自然逻辑数据训练的系统在其他领域泛化能力不佳,而我们的方法在所有包含真实世界声明的数据集上均优于这些系统。

[NLP-40] Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models

【速读】: 该论文旨在探索如何利用大型语言模型(LLMs)在语音情感预测任务(GenSEC)中最大化上下文信息和多系统输出的利用效率。解决方案的关键在于三个技术:ASR转录排序、可变对话上下文和系统输出融合。研究发现,对话上下文的效果随其长度增加而递减,且用于选择预测转录的指标至关重要。最终,论文提出的最佳方案在绝对准确率上超越了提供的基线20%。

链接: https://arxiv.org/abs/2410.03312
作者: Pavel Stepachev,Pinzhen Chen,Barry Haddow
关键词-EN: Large language models, Large language, language models, started to play, play a vital
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have started to play a vital role in modelling speech and text. To explore the best use of context and multiple systems’ outputs for post-ASR speech emotion prediction, we study LLM prompting on a recent task named GenSEC. Our techniques include ASR transcript ranking, variable conversation context, and system output fusion. We show that the conversation context has diminishing returns and the metric used to select the transcript for prediction is crucial. Finally, our best submission surpasses the provided baseline by 20% in absolute accuracy.
摘要:大语言模型 (LLMs) 在语音和文本建模中开始发挥重要作用。为了探索在 ASR 后语音情感预测中如何最佳利用上下文和多系统输出,我们研究了在名为 GenSEC 的最新任务上对 LLM 进行提示的方法。我们的技术包括 ASR 转录排名、可变对话上下文和系统输出融合。我们发现,对话上下文的效果逐渐减弱,而用于选择预测转录的指标至关重要。最终,我们的最佳提交在绝对准确率上超越了提供的基线 20%。

[NLP-41] Comparing zero-shot self-explanations with human rationales in multilingual text classification

【速读】: 该论文试图解决指令调优的大型语言模型(LLMs)生成的自我解释是否能提供有效解释的问题。解决方案的关键在于通过评估自我解释的合理性和忠实度,将其与人类注释和后验特征归因方法(如层级相关传播LRP)进行比较。研究结果表明,自我解释在更接近人类注释的同时,保持了与LRP相当的忠实度。

链接: https://arxiv.org/abs/2410.03296
作者: Stephanie Brandl,Oliver Eberle
关键词-EN: complex XAI methods, possibly complex XAI, require gradient computations, XAI methods, complex XAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Instruction-tuned LLMs are able to provide an explanation about their output to users by generating self-explanations that do not require gradient computations or the application of possibly complex XAI methods. In this paper, we analyse whether this ability results in a good explanation by evaluating self-explanations in the form of input rationales with respect to their plausibility to humans as well as their faithfulness to models. For this, we apply two text classification tasks: sentiment classification and forced labour detection. Next to English, we further include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations for all samples. To allow for direct comparisons, we also compute post-hoc feature attribution, i.e., layer-wise relevance propagation (LRP) and apply this pipeline to 4 LLMs (Llama2, Llama3, Mistral and Mixtral). Our results show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness.
摘要:指令调优的大语言模型 (LLM) 能够通过生成自我解释来为用户提供对其输出的解释,这些自我解释不需要梯度计算或应用可能复杂的可解释 AI (XAI) 方法。本文中,我们分析了这种能力是否能产生良好的解释,通过评估自我解释在输入理由形式上的合理性以及其对模型的忠实度。为此,我们应用了两个文本分类任务:情感分类和强迫劳动检测。除了英语,我们还进一步包括了丹麦语和意大利语的情感分类任务翻译,并将自我解释与所有样本的人类注释进行比较。为了允许直接比较,我们还计算了事后特征归因,即逐层相关传播 (LRP),并将此流程应用于 4 个 LLM (Llama2, Llama3, Mistral 和 Mixtral)。我们的结果表明,自我解释与人类注释的吻合度比 LRP 更高,同时保持了可比的忠实度水平。

[NLP-42] Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis

【速读】: 该论文试图解决COVID-19相关Instagram帖子的大规模多语言数据集构建及其情感分析问题。解决方案的关键在于:首先,构建了一个包含500,153条COVID-19相关Instagram帖子、涵盖161种语言和535,021个独特标签的多语言数据集;其次,对这些帖子进行了多语言情感分析,将其分类为正面、负面或中性,并将结果作为数据集的一个属性呈现;最后,通过逐年分析和语言特定的情感分析,揭示了自疫情开始以来Instagram上COVID-19相关帖子的情感趋势及其在不同语言间的差异。

链接: https://arxiv.org/abs/2410.03293
作者: Nirmalya Thakur
关键词-EN: Instagram, Instagram posts, sentiment, makes three scientific, scientific contributions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The work presented in this paper makes three scientific contributions with a specific focus on mining and analysis of COVID-19-related posts on Instagram. First, it presents a multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset, available at this https URL, contains Instagram posts in 161 different languages as well as 535,021 distinct hashtags. After the development of this dataset, multilingual sentiment analysis was performed, which involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset. Second, it presents the results of performing sentiment analysis per year from 2020 to 2024. The findings revealed the trends in sentiment related to COVID-19 on Instagram since the beginning of the pandemic. For instance, between 2020 and 2024, the sentiment trends show a notable shift, with positive sentiment decreasing from 38.35% to 28.69%, while neutral sentiment rising from 44.19% to 58.34%. Finally, the paper also presents findings of language-specific sentiment analysis. This analysis highlighted similar and contrasting trends of sentiment across posts published in different languages on Instagram. For instance, out of all English posts, 49.68% were positive, 14.84% were negative, and 35.48% were neutral. In contrast, among Hindi posts, 4.40% were positive, 57.04% were negative, and 38.56% were neutral, reflecting distinct differences in the sentiment distribution between these two languages.
摘要:本文的工作在针对 COVID-19 相关 Instagram 帖子的挖掘与分析方面做出了三项科学贡献。首先,本文展示了一个包含 500,153 条关于 COVID-19 的 Instagram 帖子的多语言数据集,这些帖子发布于 2020 年 1 月至 2024 年 9 月之间。该数据集可通过此 https URL 获取,包含 161 种不同语言的 Instagram 帖子以及 535,021 个独特的标签。在构建此数据集后,进行了多语言情感分析,即将每条帖子分类为正面、负面或中性。情感分析的结果作为单独的属性呈现在此数据集中。其次,本文展示了从 2020 年至 2024 年每年进行的情感分析结果。研究发现,自疫情开始以来,Instagram 上与 COVID-19 相关的情感趋势有所变化。例如,2020 年至 2024 年间,情感趋势显示出显著变化,正面情感从 38.35% 下降至 28.69%,而中性情感则从 44.19% 上升至 58.34%。最后,本文还展示了特定语言的情感分析结果。该分析突显了在 Instagram 上以不同语言发布的帖子中情感趋势的相似性和差异性。例如,在所有英文帖子中,49.68% 为正面,14.84% 为负面,35.48% 为中性。相比之下,在印地语帖子中,4.40% 为正面,57.04% 为负面,38.56% 为中性,反映出这两种语言在情感分布上的显著差异。

[NLP-43] What do Large Language Models Need for Machine Translation Evaluation?

【速读】: 该论文试图解决如何利用大型语言模型(LLMs)评估机器翻译(MT)质量的问题,特别是探讨在资源受限和无需训练的情况下,LLMs所需的翻译信息(如源文本、参考译文、翻译错误和标注指南)以及提示技术(如零样本、思维链和少样本提示)对评估效果的影响。解决方案的关键在于发现参考译文对于LLM评估的重要性,以及思维链提示技术对较大模型的显著提升作用。此外,论文还指出LLMs在生成评估时可能不提供数值评分,这对其任务可靠性提出了质疑。

链接: https://arxiv.org/abs/2410.03278
作者: Shenbin Qian,Archchana Sindhujan,Minnie Kabra,Diptesh Kanojia,Constantin Orăsan,Tharindu Ranasinghe,Frédéric Blain
关键词-EN: natural language processing, Leveraging large language, large language models, led to superlative, superlative claims
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium- and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT prompting, than smaller models. We also observe that LLMs do not always provide a numerical score when generating evaluations, which poses a question on their reliability for the task. Our work presents a comprehensive analysis for resource-constrained and training-less LLM-based evaluation of machine translation. We release the accrued prompt templates, code and data publicly for reproducibility.
摘要:利用大语言模型 (LLMs) 进行各种自然语言处理任务,已经引发了对其性能的极高评价。在机器翻译 (MT) 的评估中,现有研究表明,LLMs 能够达到与经过微调的多语言预训练语言模型相媲美的结果。本文探讨了在 LLMs 评估 MT 质量时,需要哪些翻译信息,如源文本、参考译文、翻译错误及标注指南等。此外,我们还研究了针对八种涵盖高、中、低资源语言的语言对,利用不同 LLM 变体的零样本、思维链 (CoT) 和少样本提示技术。我们的研究发现,参考译文对于基于 LLM 的评估至关重要。尽管较大的模型不一定表现更好,但它们往往比小模型更能从 CoT 提示中受益。我们还观察到,LLMs 在生成评估时并不总是提供数值评分,这对其在该任务中的可靠性提出了疑问。我们的工作为资源受限且无需训练的基于 LLM 的机器翻译评估提供了全面的分析。我们公开发布了积累的提示模板、代码和数据,以供复现。

[NLP-44] A Multi-task Learning Framework for Evaluating Machine Translation of Emotion-loaded User-generated Content

【速读】: 该论文试图解决用户生成内容(UGC)机器翻译中特有的挑战,如处理俚语、情感及文学手法(如反讽和讽刺),并评估这些翻译质量的问题。解决方案的关键在于利用现有的情感相关数据集,扩展其包含情感标签和基于多维质量指标的人工标注翻译错误,并引入句子级评估分数和词级标签,形成适用于句子级和词级翻译评估及情感分类的多任务数据集。论文提出了一种新的架构,通过结合不同的损失启发式(如Nash和Aligned损失)的联合损失函数,实现这些任务的并发执行,从而在UGC的机器翻译评估中达到最先进的性能。

链接: https://arxiv.org/abs/2410.03277
作者: Shenbin Qian,Constantin Orăsan,Diptesh Kanojia,Félix do Carmo
关键词-EN: poses unique challenges, including handling slang, Machine translation, user-generated content, poses unique
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine translation (MT) of user-generated content (UGC) poses unique challenges, including handling slang, emotion, and literary devices like irony and sarcasm. Evaluating the quality of these translations is challenging as current metrics do not focus on these ubiquitous features of UGC. To address this issue, we utilize an existing emotion-related dataset that includes emotion labels and human-annotated translation errors based on Multi-dimensional Quality Metrics. We extend it with sentence-level evaluation scores and word-level labels, leading to a dataset suitable for sentence- and word-level translation evaluation and emotion classification, in a multi-task setting. We propose a new architecture to perform these tasks concurrently, with a novel combined loss function, which integrates different loss heuristics, like the Nash and Aligned losses. Our evaluation compares existing fine-tuning and multi-task learning approaches, assessing generalization with ablative experiments over multiple datasets. Our approach achieves state-of-the-art performance and we present a comprehensive analysis for MT evaluation of UGC.
摘要:用户生成内容 (User-Generated Content, UGC) 的机器翻译 (Machine Translation, MT) 面临独特的挑战,包括处理俚语、情感以及诸如讽刺和反讽等文学手法。由于现有评估指标并未关注这些 UGC 的普遍特征,因此评估这些翻译的质量具有挑战性。为解决这一问题,我们利用了一个现有的情感相关数据集,该数据集包含情感标签和基于多维质量指标的人工标注翻译错误。我们通过添加句子级别的评估分数和词级别的标签对其进行了扩展,从而形成了一个适用于句子级别和词级别翻译评估以及情感分类的数据集,适用于多任务设置。我们提出了一种新的架构,以同时执行这些任务,并引入了一种新颖的组合损失函数,该函数整合了不同的损失启发式方法,如 Nash 损失和 Aligned 损失。我们的评估对比了现有的微调和多任务学习方法,通过在多个数据集上的消融实验评估了其泛化能力。我们的方法达到了最先进的性能,并提供了一个全面的分析,用于 UGC 的机器翻译评估。

[NLP-45] Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models EMNLP

【速读】: 该论文试图解决使用Byte-Pair Encoding (BPE) 进行词汇适应时,将目标领域特定词汇简单附加到预训练语言模型(PLM)词汇末尾导致优先级降低和次优分词的问题。解决方案的关键在于提出AdaptBPE方法,通过修改BPE初始化阶段,首先在添加的目标词汇上进行最长字符串匹配,然后再进行字符级分词,从而提高分词效果。实验结果表明,AdaptBPE在分类和摘要任务中分别提升了3.57%的准确率和1.87%的Rouge-L得分,尤其在参考摘要中高OOV浓度或较长的情况下表现更佳。

链接: https://arxiv.org/abs/2410.03258
作者: Gunjan Balde,Soumyadeep Roy,Mainack Mondal,Niloy Ganguly
关键词-EN: pretrained language models, fine-tuning pretrained language, Byte-Pair Encoding, vocabulary adaptation approaches, language models
类目: Computation and Language (cs.CL)
备注: 11 pages. Accepted at EMNLP Findings 2024 (The 2024 Conference on Empirical Methods in Natural Language Processing)

点击查看摘要

Abstract:In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially append the target domain-specific vocabulary at the end of the PLM vocabulary. This approach leads to a lower priority score and causes sub-optimal tokenization in BPE that iteratively uses merge rules to tokenize a given text. To mitigate this issue, we propose AdaptBPE where the BPE tokenization initialization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level. We perform an extensive evaluation of AdaptBPE versus the standard BPE over various classification and summarization tasks; AdaptBPE improves by 3.57% (in terms of accuracy) and 1.87% (in terms of Rouge-L), respectively. AdaptBPE for MEDVOC works particularly well when reference summaries have high OOV concentration or are longer in length. We also conduct a human evaluation, revealing that AdaptBPE generates more relevant and more faithful summaries as compared to MEDVOC. We make our codebase publicly available at this https URL.
摘要:在本研究中,我们揭示了使用字节对编码 (Byte-Pair Encoding, BPE) 分词方案对预训练语言模型 (Pretrained Language Models, PLMs) 进行领域专家化微调时,词汇适应方法存在的一个根本性局限。当前的方法简单地将目标领域专用词汇附加在 PLM 词汇表的末尾。这种方法导致较低的优先级评分,并造成 BPE 在迭代使用合并规则对给定文本进行分词时出现次优结果。为解决这一问题,我们提出了 AdaptBPE,该方法在 BPE 分词初始化阶段进行了修改,首先对添加的(目标)词汇进行最长字符串匹配,然后再进行字符级别的分词。我们对 AdaptBPE 与标准 BPE 在多种分类和摘要任务上进行了广泛评估;结果显示,AdaptBPE 在准确性方面提升了 3.57%,在 Rouge-L 指标上提升了 1.87%。AdaptBPE 在 MEDVOC 中的表现尤为出色,特别是在参考摘要中存在高比例的词汇表外词 (Out-Of-Vocabulary, OOV) 或摘要长度较长的情况下。我们还进行了人工评估,结果表明,与 MEDVOC 相比,AdaptBPE 生成的摘要更具相关性和忠实度。我们已将代码库公开发布,详见此 https URL。

[NLP-46] owards a Benchmark for Large Language Models for Business Process Management Tasks

【速读】: 该论文试图解决在业务流程管理(BPM)领域缺乏针对大型语言模型(LLMs)性能评估基准的问题。解决方案的关键在于系统性地比较不同LLM在四个BPM任务中的表现,重点关注小型开源模型,以识别任务特定的性能差异、比较开源模型与商业模型的有效性,并评估模型大小对BPM任务性能的影响。通过这种方式,论文为组织在BPM领域选择合适的LLM提供了实用指导。

链接: https://arxiv.org/abs/2410.03255
作者: Kiran Busch,Henrik Leopold
关键词-EN: deploying Large Language, Large Language Models, Large Language, deploying Large, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:An increasing number of organizations are deploying Large Language Models (LLMs) for a wide range of tasks. Despite their general utility, LLMs are prone to errors, ranging from inaccuracies to hallucinations. To objectively assess the capabilities of existing LLMs, performance benchmarks are conducted. However, these benchmarks often do not translate to more specific real-world tasks. This paper addresses the gap in benchmarking LLM performance in the Business Process Management (BPM) domain. Currently, no BPM-specific benchmarks exist, creating uncertainty about the suitability of different LLMs for BPM tasks. This paper systematically compares LLM performance on four BPM tasks focusing on small open-source models. The analysis aims to identify task-specific performance variations, compare the effectiveness of open-source versus commercial models, and assess the impact of model size on BPM task performance. This paper provides insights into the practical applications of LLMs in BPM, guiding organizations in selecting appropriate models for their specific needs.
摘要:越来越多的组织正在部署大语言模型 (LLM) 用于各种任务。尽管 LLM 具有广泛的实用性,但它们容易出错,从不准确到幻觉现象。为了客观评估现有 LLM 的能力,通常会进行性能基准测试。然而,这些基准测试往往无法转化为更具体的实际任务。本文针对在业务流程管理 (BPM) 领域中 LLM 性能基准测试的空白进行了探讨。目前,尚不存在针对 BPM 的特定基准测试,这使得不同 LLM 在 BPM 任务中的适用性存在不确定性。本文系统地比较了四个 BPM 任务中 LLM 的性能,重点关注小型开源模型。分析旨在识别任务特定的性能差异,比较开源模型与商业模型的有效性,并评估模型大小对 BPM 任务性能的影响。本文为 LLM 在 BPM 中的实际应用提供了见解,指导组织根据其特定需求选择合适的模型。

[NLP-47] Are Expert-Level Language Models Expert-Level Annotators? NEURIPS2024

【速读】: 该论文试图解决的问题是评估大型语言模型(LLMs)在需要专家知识的领域中作为数据标注工具的表现。解决方案的关键在于系统性地评估LLMs在这些高度专业化领域中的标注能力,并从成本效益的角度提出实用建议。这是首次对LLMs作为专家级数据标注工具进行系统性评估的研究。

链接: https://arxiv.org/abs/2410.03254
作者: Yu-Min Tseng,Wei-Lin Chen,Chung-Chi Chen,Hsin-Hsi Chen
关键词-EN: Data annotation refers, relevant information, annotation refers, labeling or tagging, tagging of textual
类目: Computation and Language (cs.CL)
备注: Accepted to WiML @ NeurIPS 2024 (extended version)

点击查看摘要

Abstract:Data annotation refers to the labeling or tagging of textual data with relevant information. A large body of works have reported positive results on leveraging LLMs as an alternative to human annotators. However, existing studies focus on classic NLP tasks, and the extent to which LLMs as data annotators perform in domains requiring expert knowledge remains underexplored. In this work, we investigate comprehensive approaches across three highly specialized domains and discuss practical suggestions from a cost-effectiveness perspective. To the best of our knowledge, we present the first systematic evaluation of LLMs as expert-level data annotators.
摘要:数据标注是指为文本数据添加相关信息的标签或标记。大量研究报告了利用大语言模型 (LLM) 作为人工标注者的替代方案所取得的积极成果。然而,现有研究主要集中在经典自然语言处理 (NLP) 任务上,而大语言模型在需要专家知识的领域中作为数据标注者的表现程度仍未得到充分探索。在本研究中,我们探讨了在三个高度专业化的领域中采用的综合方法,并从成本效益的角度提出了实际建议。据我们所知,我们首次系统地评估了大语言模型作为专家级数据标注者的表现。

[NLP-48] How much can we forget about Data Contamination?

【速读】: 该论文试图解决基准数据泄露到训练数据中对大型语言模型(LLMs)评估能力的影响问题。解决方案的关键在于通过实验和理论分析,量化了模型参数、示例重复次数和训练数据量三个维度上的基准过拟合程度。研究发现,尽管轻微的污染会导致过拟合,但在训练数据量超过Chinchilla比例五倍的情况下,即使污染重复144次也能被遗忘。论文还提出了通过累积权重衰减来估计遗忘过程的理论模型,并实验验证了遗忘速度快于理论预测。这些结果表明,在实际训练过程中,适度的污染可以在训练结束时被遗忘。

链接: https://arxiv.org/abs/2410.03249
作者: Sebastian Bordt,Suraj Srinivas,Valentyn Boreiko,Ulrike von Luxburg
关键词-EN: large language models, evaluating the capabilities, capabilities of large, large language, significant challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we use experimental evidence and theoretical estimates to challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). We find that if model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. We then derive a simple theory of example forgetting via cumulative weight decay. It allows us to bound the number of gradient steps required to forget past data for any training run where we know the hyperparameters of AdamW. This indicates that many LLMs, including Llama 3, have forgotten the data seen at the beginning of training. Experimentally, we demonstrate that forgetting occurs faster than what is predicted by our bounds. Taken together, our results suggest that moderate amounts of contamination can be forgotten at the end of realistically scaled training runs.
摘要:基准数据泄露到训练数据中已成为评估大语言模型 (LLM) 能力的一个重大挑战。在这项工作中,我们通过实验证据和理论估计,挑战了小规模污染使基准评估失效的普遍假设。首先,我们通过三个维度的扩展来实验量化基准过拟合的程度:模型参数数量(最高达 1.6B)、示例被查看的次数(最高达 144 次)以及训练 Token 数量(最高达 40B)。我们发现,如果模型和数据遵循 Chinchilla 缩放定律,轻微的污染确实会导致过拟合。同时,即使污染次数达到 144 次,如果训练数据量超过 Chinchilla 的五倍,这种污染也可以被遗忘,这是许多现代 LLM 的特征。接着,我们通过累积权重衰减推导出一个简单的示例遗忘理论。它使我们能够为任何已知 AdamW 超参数的训练运行,确定遗忘过去数据所需的梯度步数上限。这表明,包括 Llama 3 在内的许多 LLM,已经遗忘了训练初期看到的数据。实验上,我们证明遗忘速度比我们预测的上限更快。综合来看,我们的结果表明,在实际规模的训练运行结束时,可以遗忘适量的污染。

[NLP-49] Beyond Film Subtitles: Is YouTube the Best Approximation of Spoken Vocabulary? COLING2025

【速读】: 该论文试图解决在缺乏高质量电影字幕或语音语料库的情况下,如何准确估计多种语言的词频以用于心理语言学研究的问题。解决方案的关键在于利用经过精心处理的YouTube字幕数据,这些数据不仅覆盖了多种语言,而且在词频估计上表现出色,甚至优于现有的最佳资源。通过构建和评估五种不同语言(中文、英语、印尼语、日语和西班牙语)的词频规范,研究证明了YouTube字幕在词汇决策时间、词熟悉度和词汇复杂性预测方面的有效性,尤其是在英语和日语的词汇复杂性预测任务中,基于YouTube字幕词频的简单线性回归模型表现超越了基于电影字幕词频的模型和GPT-4。

链接: https://arxiv.org/abs/2410.03240
作者: Adam Nohejl,Frederikus Hudi,Eunike Andriani Kardinata,Shintaro Ozaki,Maria Angelica Riera Machin,Hongyu Sun,Justin Vasselli,Taro Watanabe
关键词-EN: modeling human familiarity, modeling human, era of large, English, subtitles
类目: Computation and Language (cs.CL)
备注: Submitted for review to COLING 2025. 8 pages, 3 figures

点击查看摘要

Abstract:Word frequency is a key variable in psycholinguistics, useful for modeling human familiarity with words even in the era of large language models (LLMs). Frequency in film subtitles has proved to be a particularly good approximation of everyday language exposure. For many languages, however, film subtitles are not easily available, or are overwhelmingly translated from English. We demonstrate that frequencies extracted from carefully processed YouTube subtitles provide an approximation comparable to, and often better than, the best currently available resources. Moreover, they are available for languages for which a high-quality subtitle or speech corpus does not exist. We use YouTube subtitles to construct frequency norms for five diverse languages, Chinese, English, Indonesian, Japanese, and Spanish, and evaluate their correlation with lexical decision time, word familiarity, and lexical complexity. In addition to being strongly correlated with two psycholinguistic variables, a simple linear regression on the new frequencies achieves a new high score on a lexical complexity prediction task in English and Japanese, surpassing both models trained on film subtitle frequencies and the LLM GPT-4. Our code, the frequency lists, fastText word embeddings, and statistical language models are freely available at this https URL.
摘要:词频是心理语言学中的一个关键变量,即使在大型语言模型 (LLM) 的时代,它对于建模人类对词汇的熟悉度仍然非常有用。电影字幕中的频率已被证明是日常语言接触的良好近似值。然而,对于许多语言来说,电影字幕并不容易获得,或者主要是从英语翻译过来的。我们证明,从精心处理的 YouTube 字幕中提取的频率可以提供与目前最佳资源相当甚至更好的近似值。此外,对于那些没有高质量字幕或语音语料库的语言,这些频率也是可用的。我们使用 YouTube 字幕构建了五种不同语言(中文、英语、印尼语、日语和西班牙语)的频率规范,并评估了它们与词汇决策时间、词汇熟悉度和词汇复杂性的相关性。除了与两种心理语言学变量高度相关外,基于新频率的简单线性回归在英语和日语的词汇复杂性预测任务中达到了新的高分,超过了基于电影字幕频率训练的模型和 LLM GPT-4。我们的代码、频率列表、fastText 词嵌入和统计语言模型均可在此 https URL 免费获取。

[NLP-50] Showing LLM-Generated Code Selectively Based on Confidence of LLMs

【速读】: 该论文试图解决大型语言模型(LLMs)在代码生成过程中可能产生错误程序的问题,这些问题不仅浪费开发者的时间,还可能引入安全风险。解决方案的关键是提出了HonestCoder,一种基于LLM的代码生成方法,通过估计LLMs对生成代码的置信度来选择性地向开发者展示代码。HonestCoder通过测量LLMs生成代码的多模态相似性来估计置信度,从而有效减少展示给开发者的错误程序数量,同时仅增加轻微的时间开销。

链接: https://arxiv.org/abs/2410.03234
作者: Jia Li,Yuqi Zhu,Yongmin Li,Ge Li,Zhi Jin
关键词-EN: Large Language Models, Large Language, Language Models, programs, shown impressive abilities
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive abilities in code generation, but they may generate erroneous programs. Reading a program takes ten times longer than writing it. Showing these erroneous programs to developers will waste developers’ energies and introduce security risks to software. To address the above limitations, we propose HonestCoder, a novel LLM-based code generation approach. HonestCoder selectively shows the generated programs to developers based on LLMs’ confidence. The confidence provides valuable insights into the correctness of generated programs. To achieve this goal, we propose a novel approach to estimate LLMs’ confidence in code generation. It estimates confidence by measuring the multi-modal similarity between LLMs-generated programs. We collect and release a multilingual benchmark named TruthCodeBench, which consists of 2,265 samples and covers two popular programming languages (i.e., Python and Java). We apply HonestCoder to four popular LLMs (e.g., DeepSeek-Coder and Code Llama) and evaluate it on TruthCodeBench. Based on the experiments, we obtain the following insights. (1) HonestCoder can effectively estimate LLMs’ confidence and accurately determine the correctness of generated programs. For example, HonestCoder outperforms the state-of-the-art baseline by 27.79% in AUROC and 63.74% in AUCPR. (2) HonestCoder can decrease the number of erroneous programs shown to developers. Compared to eight baselines, it can show more correct programs and fewer erroneous programs to developers. (3) Compared to showing code indiscriminately, HonestCoder only adds slight time overhead (approximately 0.4 seconds per requirement). (4) We discuss future directions to facilitate the application of LLMs in software development. We hope this work can motivate broad discussions about measuring the reliability of LLMs’ outputs in performing code-related tasks. Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL) Cite as: arXiv:2410.03234 [cs.SE] (or arXiv:2410.03234v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2410.03234 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大语言模型 (LLMs) 在代码生成方面展现了令人印象深刻的能力,但它们也可能生成错误的程序。阅读一个程序所需的时间是编写它的十倍。向开发者展示这些错误的程序不仅会浪费开发者的时间,还会给软件带来安全风险。为了解决上述问题,我们提出了 HonestCoder,一种基于 LLM 的新型代码生成方法。HonestCoder 根据 LLM 的置信度选择性地向开发者展示生成的程序。置信度为生成程序的正确性提供了宝贵的见解。为了实现这一目标,我们提出了一种新的方法来估计 LLM 在代码生成中的置信度。该方法通过测量 LLM 生成程序的多模态相似性来估计置信度。我们收集并发布了一个名为 TruthCodeBench 的多语言基准测试集,包含 2,265 个样本,涵盖两种流行的编程语言(即 Python 和 Java)。我们将 HonestCoder 应用于四种流行的大语言模型(如 DeepSeek-Coder 和 Code Llama),并在 TruthCodeBench 上对其进行评估。根据实验结果,我们得出以下结论:(1) HonestCoder 能够有效估计 LLM 的置信度,并准确判断生成程序的正确性。例如,HonestCoder 在 AUROC 上比最先进的基线高出 27.79%,在 AUCPR 上高出 63.74%。(2) HonestCoder 能够减少展示给开发者的错误程序数量。与八个基线相比,它能够向开发者展示更多正确的程序和更少的错误程序。(3) 与不加区分地展示代码相比,HonestCoder 仅增加了轻微的时间开销(每个需求约 0.4 秒)。(4) 我们讨论了未来方向,以促进 LLM 在软件开发中的应用。我们希望这项工作能够激发关于测量 LLM 在执行代码相关任务时输出可靠性的广泛讨论。

主题:软件工程 (cs.SE);计算与语言 (cs.CL)
引用方式:arXiv:2410.03234 [cs.SE](或 arXiv:2410.03234v1 [cs.SE] 用于此版本)
https://doi.org/10.48550/arXiv.2410.03234
通过 DataCite 发布的 arXiv DOI(待注册)

[NLP-51] ALR2: A Retrieve-then-Reason Framework for Long-context Question Answering

【速读】: 该论文试图解决大型语言模型(LLMs)在处理长上下文时推理能力下降的问题。解决方案的关键在于引入了一种名为ALR²的方法,该方法通过一个显式的两阶段过程,即先检索后推理,来增强LLMs的长上下文推理能力。具体来说,ALR²方法通过明确地将LLMs与检索和推理的目标对齐,从而有效地减少了模型在长上下文任务中因信息过载而导致的错误推理和答案生成问题。实验结果表明,ALR²在长上下文问答基准测试中显著优于其他基线方法,分别在HotpotQA和SQuAD数据集的长上下文版本中取得了至少8.4和7.9的EM(Exact Match)提升。

链接: https://arxiv.org/abs/2410.03227
作者: Huayang Li,Pat Verga,Priyanka Sen,Bowen Yang,Vijay Viswanathan,Patrick Lewis,Taro Watanabe,Yixuan Su
关键词-EN: large language models, recent years, extended significantly, significantly in recent, context window
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The context window of large language models (LLMs) has been extended significantly in recent years. However, while the context length that the LLM can process has grown, the capability of the model to accurately reason over that context degrades noticeably. This occurs because modern LLMs often become overwhelmed by the vast amount of information in the context; when answering questions, the model must identify and reason over relevant evidence sparsely distributed throughout the text. To alleviate the challenge of long-context reasoning, we develop a retrieve-then-reason framework, enabling LLMs to reason over relevant evidence collected during an intermediate retrieval step. We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate “retrieved facts”, resulting in flawed reasoning and the production of incorrect answers. To address these issues, we introduce ALR ^2 , a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure, i.e., aligning LLMs with the objectives of both retrieval and reasoning. We demonstrate the efficacy of ALR ^2 for mitigating performance degradation in long-context reasoning tasks. Through extensive experiments on long-context QA benchmarks, we find our method to outperform competitive baselines by large margins, achieving at least 8.4 and 7.9 EM gains on the long-context versions of HotpotQA and SQuAD datasets, respectively.
摘要:近年来,大语言模型 (LLM) 的上下文窗口得到了显著扩展。然而,尽管 LLM 能够处理更长的上下文长度,但其在这一上下文中的准确推理能力却明显下降。这是因为现代 LLM 常常被上下文中庞大的信息量所淹没;在回答问题时,模型必须识别并推理散布在整个文本中的相关证据。为了缓解长上下文推理的挑战,我们开发了一种“先检索后推理”的框架,使 LLM 能够在中间检索步骤中收集的相关证据上进行推理。我们发现,现代 LLM 在准确检索相关事实方面表现不佳,反而经常产生“幻觉性”的检索事实,导致推理错误和答案不准确。为了解决这些问题,我们引入了 ALR ^2 方法,通过一个明确的二阶段过程,即对齐 LLM 的检索和推理目标,来增强其长上下文推理能力。我们展示了 ALR ^2 在缓解长上下文推理任务中性能下降的有效性。通过在长上下文问答基准上的广泛实验,我们发现我们的方法在性能上大幅超越了竞争基线,分别在 HotpotQA 和 SQuAD 数据集的长上下文版本上取得了至少 8.4 和 7.9 的 EM 提升。

[NLP-52] Frame-Voyager: Learning to Query Frames for Video Large Language Models

【速读】: 该论文试图解决视频大语言模型(Video-LLMs)在处理长视频时因输入token长度限制而导致的性能下降问题。解决方案的关键在于提出了Frame-Voyager,这是一种能够根据任务中的文本查询自动选择信息丰富帧组合的模型。Frame-Voyager通过引入新的数据收集和标注流程,利用预训练的Video-LLM对视频帧组合进行排序,并基于Video-LLM的预测损失来训练模型,使其能够高效地查询出信息密度高的帧组合,从而提升视频理解任务的性能。

链接: https://arxiv.org/abs/2410.03226
作者: Sicheng Yu,Chengkai Jin,Huanyu Wang,Zhenghao Chen,Sheng Jin,Zhongrong Zuo,Xioalei Xu,Zhenbang Sun,Bingni Zhang,Jiawei Wu,Hao Zhang,Qianru Sun
关键词-EN: Large Language Models, Video Large Language, Language Models, Large Language, made remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM’s prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.
摘要:视频大语言模型 (Video-LLMs) 在视频理解任务中取得了显著进展。然而,它们受限于输入 Token 的最大长度,使得输入整个视频变得不切实际。现有的帧选择方法,如均匀帧采样和文本-帧检索,未能考虑到视频中信息密度的变化或任务中复杂的指令,导致性能不佳。在本文中,我们提出了 Frame-Voyager,它能够根据任务中的文本查询学习查询信息丰富的帧组合。为了训练 Frame-Voyager,我们引入了一种新的数据收集和标注流程,通过使用预训练的 Video-LLM 对帧组合进行排序。给定一个包含 M 帧的视频,我们遍历其 T 帧组合,将其输入到 Video-LLM 中,并根据 Video-LLM 的预测损失对其进行排序。利用这种排序作为监督,我们训练 Frame-Voyager 查询损失较低的帧组合。在实验中,我们通过将 Frame-Voyager 插入两个不同的 Video-LLM 中,在四个视频问答基准上对其进行了评估。实验结果表明,Frame-Voyager 在所有设置中均取得了令人印象深刻的结果,突显了其作为 Video-LLMs 即插即用解决方案的潜力。

[NLP-53] Consultation on Industrial Machine Faults with Large language Models

【速读】: 该论文试图解决工业机器故障诊断中传统方法依赖专家知识和特定机器学习模型、适应性有限且需要大量标注数据的问题。解决方案的关键在于利用大型语言模型(LLMs),通过结构化的多轮提示技术,动态生成提示以增强模型从多源数据中综合信息的能力,从而提高故障诊断的准确性和上下文理解能力。实验结果表明,该方法在诊断多种故障类型时达到了91%的准确率,显示出LLMs在革新工业故障咨询实践中的潜力。

链接: https://arxiv.org/abs/2410.03223
作者: Apiradee Boonmee,Kritsada Wongsuwan,Pimchanok Sukjai
关键词-EN: critical component, component of operational, operational efficiency, efficiency and safety, safety in manufacturing
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Industrial machine fault diagnosis is a critical component of operational efficiency and safety in manufacturing environments. Traditional methods rely heavily on expert knowledge and specific machine learning models, which can be limited in their adaptability and require extensive labeled data. This paper introduces a novel approach leveraging Large Language Models (LLMs), specifically through a structured multi-round prompting technique, to improve fault diagnosis accuracy. By dynamically crafting prompts, our method enhances the model’s ability to synthesize information from diverse data sources, leading to improved contextual understanding and actionable recommendations. Experimental results demonstrate that our approach outperforms baseline models, achieving an accuracy of 91% in diagnosing various fault types. The findings underscore the potential of LLMs in revolutionizing industrial fault consultation practices, paving the way for more effective maintenance strategies in complex environments.
摘要:工业机器故障诊断是制造环境中运营效率和安全性的关键组成部分。传统方法严重依赖专家知识和特定的机器学习模型,这些方法在适应性方面存在局限,并且需要大量的标注数据。本文介绍了一种利用大语言模型 (LLM) 的新方法,特别是通过结构化的多轮提示技术,来提高故障诊断的准确性。通过动态构建提示,我们的方法增强了模型从多种数据源中综合信息的能力,从而提高了上下文理解和可操作建议的质量。实验结果表明,我们的方法优于基线模型,在诊断各种故障类型时达到了 91% 的准确率。这些发现强调了 LLM 在革新工业故障咨询实践中的潜力,为复杂环境中的更有效维护策略铺平了道路。

[NLP-54] NLIP_Lab-IITH Low-Resource MT System for WMT24 Indic MT Shared Task

【速读】: 该论文旨在解决低资源印度语言翻译问题,特别是针对英语与阿萨姆语、卡西语、卢舍语和曼尼普尔语之间的翻译任务。解决方案的关键在于对预训练模型进行微调,利用对齐增强技术来优化嵌入向量的对齐,从而提升翻译质量。论文通过语言特定的微调策略,在官方公开测试集上分别获得了50.6、42.3、54.9和66.3的chrF2分数,展示了其在低资源语言翻译任务中的有效性。此外,论文还探讨了多语言训练和层冻结技术,进一步优化了翻译性能。

链接: https://arxiv.org/abs/2410.03215
作者: Pramit Sahoo,Maharaj Brahma,Maunendra Sankar Desarkar
关键词-EN: Low-Resource Indic Language, Low-Resource Indic, Indic Language Translation, Indic Language, rightarrow
类目: Computation and Language (cs.CL)
备注: WMT2024 INDICMT Shared Task

点击查看摘要

Abstract:In this paper, we describe our system for the WMT 24 shared task of Low-Resource Indic Language Translation. We consider eng \leftrightarrow as, kha, lus, mni as participating language pairs. In this shared task, we explore the finetuning of a pre-trained model motivated by the pre-trained objective of aligning embeddings closer by alignment augmentation \citelin-etal-2020-pre for 22 scheduled Indian languages. Our primary system is based on language-specific finetuning on a pre-trained model. We achieve chrF2 scores of 50.6, 42.3, 54.9, and 66.3 on the official public test set for eng \rightarrow as, eng \rightarrow kha, eng \rightarrow lus, eng \rightarrow mni respectively. We also explore multilingual training with/without language grouping and layer-freezing. Our code, models, and generated translations are available here: this https URL.
摘要:本文描述了我们为 WMT 24 低资源印度语言翻译共享任务开发的系统。我们考虑了 eng \leftrightarrow as, kha, lus, mni 作为参与的语言对。在该共享任务中,我们探索了通过对齐增强 (alignment augmentation) [Lin et al., 2020] 对预训练模型进行微调,以使嵌入更接近预训练目标,适用于 22 种预定印度语言。我们的主要系统基于在预训练模型上的特定语言微调。我们在官方公开测试集上分别获得了 eng \rightarrow as, eng \rightarrow kha, eng \rightarrow lus, eng \rightarrow mni 的 chrF2 分数为 50.6, 42.3, 54.9 和 66.3。我们还探索了有无语言分组和层冻结的多语言训练。我们的代码、模型和生成的翻译可在此处获取:this https URL。

[NLP-55] Learning Semantic Structure through First-Order-Logic Translation EMNLP2024

【速读】: 该论文试图解决基于Transformer的语言模型在提取简单句子中的谓词-论元结构时可能出现的混淆问题。解决方案的关键在于通过两种任务(问答和一阶逻辑翻译)以及两种方法(提示和微调)来探索模型的表现。具体来说,论文通过在一阶逻辑翻译任务上微调大型语言模型,并使用提示方法进行问答任务,发现一阶逻辑翻译更适合于学习谓词-论元结构,从而有效缓解了模型在谓词和对象匹配上的混淆问题。

链接: https://arxiv.org/abs/2410.03203
作者: Akshay Chaturvedi,Nicholas Asher
关键词-EN: simple sentences, transformer-based language models, study whether transformer-based, extract predicate argument, language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:In this paper, we study whether transformer-based language models can extract predicate argument structure from simple sentences. We firstly show that language models sometimes confuse which predicates apply to which objects. To mitigate this, we explore two tasks: question answering (Q/A), and first order logic (FOL) translation, and two regimes, prompting and finetuning. In FOL translation, we finetune several large language models on synthetic datasets designed to gauge their generalization abilities. For Q/A, we finetune encoder models like BERT and RoBERTa and use prompting for LLMs. The results show that FOL translation for LLMs is better suited to learn predicate argument structure.
摘要:本文研究了基于 Transformer 的语言模型是否能够从简单句子中提取谓词-论元结构。我们首先指出,语言模型有时会混淆哪些谓词适用于哪些对象。为了缓解这一问题,我们探讨了两个任务:问答 (Q/A) 和一阶逻辑 (FOL) 翻译,以及两种方法:提示 (prompting) 和微调 (finetuning)。在一阶逻辑翻译中,我们对多个大语言模型在为评估其泛化能力而设计的合成数据集上进行了微调。对于问答任务,我们对 BERT 和 RoBERTa 等编码器模型进行了微调,并使用提示方法处理大语言模型。结果表明,大语言模型在一阶逻辑翻译任务中更适合学习谓词-论元结构。

[NLP-56] PersoBench: Benchmarking Personalized Response Generation in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在生成个性化对话回复时表现不足的问题。解决方案的关键在于提出了一个新的基准测试PersoBench,用于在零样本设置下评估LLMs在角色感知对话生成中的个性化能力。通过使用多个知名数据集和一系列评估指标,论文分析了LLMs在生成流畅、多样、连贯和个性化回复方面的表现,特别是在标准和链式思维提示方法下。研究发现,尽管LLMs在生成流畅和多样化的回复方面表现出色,但在结合对话上下文和提供角色信息生成个性化和连贯回复方面仍有显著不足。

链接: https://arxiv.org/abs/2410.03198
作者: Saleh Afzoon,Usman Naseem,Amin Beheshti,Zahra Jamali
关键词-EN: large language models, impressive conversational capabilities, exhibited impressive conversational, language models, conversational capabilities
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have exhibited impressive conversational capabilities, their proficiency in delivering personalized responses remains unclear. Although recent benchmarks automatically evaluate persona consistency in role-playing contexts using LLM-based judgment, the evaluation of personalization in response generation remains underexplored. To address this gap, we present a new benchmark, PersoBench, to evaluate the personalization ability of LLMs in persona-aware dialogue generation within a zero-shot setting. We assess the performance of three open-source and three closed-source LLMs using well-known datasets and a range of metrics. Our analysis, conducted on three well-known persona-aware datasets, evaluates multiple dimensions of response quality, including fluency, diversity, coherence, and personalization, across both standard and chain-of-thought prompting methods. Our findings reveal that while LLMs excel at generating fluent and diverse responses, they are far from satisfactory in delivering personalized and coherent responses considering both the conversation context and the provided personas. Our benchmark implementation is available at this https URL.
摘要:尽管大语言模型 (Large Language Models, LLMs) 在对话能力方面表现出色,但其在提供个性化回复方面的熟练程度仍不明确。虽然最近的基准测试通过基于 LLM 的判断自动评估了角色扮演情境中的角色一致性,但回复生成中的个性化评估仍未得到充分探索。为了填补这一空白,我们提出了一个新的基准测试,PersoBench,用于在零样本设置下评估 LLMs 在角色感知对话生成中的个性化能力。我们使用知名数据集和一系列指标评估了三个开源和三个闭源 LLMs 的性能。我们的分析基于三个知名的角色感知数据集,评估了回复质量的多个维度,包括流畅性、多样性、连贯性和个性化,涵盖了标准提示和思维链提示方法。我们的研究发现,尽管 LLMs 在生成流畅和多样化的回复方面表现优异,但在考虑对话上下文和提供的角色时,生成个性化和连贯的回复方面仍远未达到满意水平。我们的基准测试实现可在以下链接获取:https URL。

[NLP-57] Cross-lingual Transfer for Automatic Question Generation by Learning Interrogative Structures in Target Languages EMNLP2024

【速读】: 该论文试图解决自动问句生成(QG)领域中多语言数据稀缺的问题,特别是针对低资源语言的问句生成。解决方案的关键在于提出了一种跨语言转移的自动问句生成方法(XLT-QG),该方法无需目标语言的单语、平行或标注数据,仅利用一个小型语言模型在英语问答数据集上进行训练,通过学习有限的问句样本中的疑问结构,进而生成目标语言的问句。这种方法在实验中表现优异,生成的合成数据对多语言问答模型的训练也有积极作用。

链接: https://arxiv.org/abs/2410.03197
作者: Seonjeong Hwang,Yunsu Kim,Gary Geunbae Lee
关键词-EN: Automatic question generation, enhancing chatbot systems, developing educational materials, Automatic question, serves a wide
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Automatic question generation (QG) serves a wide range of purposes, such as augmenting question-answering (QA) corpora, enhancing chatbot systems, and developing educational materials. Despite its importance, most existing datasets predominantly focus on English, resulting in a considerable gap in data availability for other languages. Cross-lingual transfer for QG (XLT-QG) addresses this limitation by allowing models trained on high-resource language datasets to generate questions in low-resource languages. In this paper, we propose a simple and efficient XLT-QG method that operates without the need for monolingual, parallel, or labeled data in the target language, utilizing a small language model. Our model, trained solely on English QA datasets, learns interrogative structures from a limited set of question exemplars, which are then applied to generate questions in the target language. Experimental results show that our method outperforms several XLT-QG baselines and achieves performance comparable to GPT-3.5-turbo across different languages. Additionally, the synthetic data generated by our model proves beneficial for training multilingual QA models. With significantly fewer parameters than large language models and without requiring additional training for target languages, our approach offers an effective solution for QG and QA tasks across various languages.
摘要:自动问题生成 (Automatic question generation, QG) 在多个领域中具有广泛的应用,如增强问答 (Question-Answering, QA) 语料库、提升聊天机器人系统以及开发教育材料。尽管其重要性不言而喻,但现有的大多数数据集主要集中在英语上,导致其他语言的数据可用性存在显著差距。跨语言问题生成 (Cross-lingual transfer for QG, XLT-QG) 通过允许在高资源语言数据集上训练的模型生成低资源语言的问题,解决了这一限制。本文提出了一种简单且高效的 XLT-QG 方法,该方法无需目标语言的单语、平行或标注数据,而是利用一个小型语言模型进行操作。我们的模型仅在英语 QA 数据集上训练,通过有限的问题样本集学习疑问结构,然后将这些结构应用于生成目标语言的问题。实验结果表明,我们的方法优于多个 XLT-QG 基线,并在不同语言中实现了与 GPT-3.5-turbo 相当的性能。此外,我们的模型生成的合成数据证明对训练多语言 QA 模型有益。与大语言模型相比,我们的方法参数显著减少,且无需为目标语言进行额外训练,为各种语言的 QG 和 QA 任务提供了有效的解决方案。

[NLP-58] Parallel Corpus Augmentation using Masked Language Models

【速读】: 该论文试图解决平行文本语料库数据稀缺的问题,提出了一种新颖的方法来扩充平行文本语料库,且无需额外的单语语料库。解决方案的关键在于使用多语言掩码语言模型(Multi-Lingual Masked Language Model)来在上下文中掩码并预测替代词,并通过句子嵌入(Sentence Embeddings)来筛选出可能是互译的句子对。此外,该方法通过机器翻译质量评估指标进行交叉验证,确保扩充语料库的质量。这种方法能够显著缓解拥有合理种子语料库的语言对的数据稀缺问题。

链接: https://arxiv.org/abs/2410.03194
作者: Vibhuti Kumari,Narayana Murthy Kavi
关键词-EN: augmenting parallel text, parallel text corpora, fold larger corpora, promises good quality, paper we propose
类目: Computation and Language (cs.CL)
备注: 21 Pages, 3 Figures. arXiv admin note: text overlap with arXiv:2011.01536 by other authors

点击查看摘要

Abstract:In this paper we propose a novel method of augmenting parallel text corpora which promises good quality and is also capable of producing many fold larger corpora than the seed corpus we start with. We do not need any additional monolingual corpora. We use Multi-Lingual Masked Language Model to mask and predict alternative words in context and we use Sentence Embeddings to check and select sentence pairs which are likely to be translations of each other. We cross check our method using metrics for MT Quality Estimation. We believe this method can greatly alleviate the data scarcity problem for all language pairs for which a reasonable seed corpus is available.
摘要:本文提出了一种新颖的方法,用于扩充平行文本语料库,该方法不仅能保证高质量,还能生成比初始种子语料库大数倍的语料库。我们不需要任何额外的单语语料库。我们使用多语言掩码语言模型 (Multi-Lingual Masked Language Model) 来掩码并预测上下文中的替代词,并使用句子嵌入 (Sentence Embeddings) 来检查和选择可能互为翻译的句子对。我们通过机器翻译质量评估指标 (MT Quality Estimation) 来交叉验证我们的方法。我们相信,这种方法可以极大地缓解那些拥有合理种子语料库的语言对的数据稀缺问题。

[NLP-59] Generating bilingual example sentences with large language models as lexicography assistants

【速读】: 该论文试图解决大语言模型(LLMs)在为不同资源丰富度的语言生成双语词典例句时的性能差异问题。解决方案的关键在于通过上下文学习(in-context learning)来调整LLMs以符合个体注释者的偏好,从而提高生成例句的质量。此外,论文还探讨了使用预训练语言模型进行自动化评分的可行性,发现句子困惑度(sentence perplexity)在资源丰富的语言中可以有效衡量例句的典型性和可理解性。通过这些方法,论文旨在降低词典编纂工作的成本,特别是对于资源匮乏的语言。

链接: https://arxiv.org/abs/2410.03182
作者: Raphael Merx,Ekaterina Vylomova,Kemal Kurniawan
关键词-EN: varying resource levels, resource levels, bilingual dictionaries, varying resource, Good Dictionary
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a study of LLMs’ performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility. Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low inter-annotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for typicality and intelligibility in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.
摘要:我们研究了大语言模型 (LLM) 在生成和评价双语词典例句方面的表现,涉及不同资源水平的语言:法语 (高资源)、印尼语 (中资源) 和德顿语 (低资源),以英语为目标语言。我们根据 GDEX (Good Dictionary EXample) 标准评估大语言模型生成的例句质量:典型性、信息性和可理解性。我们的研究发现,尽管大语言模型能够生成合理的词典例句,但在资源较少的语言中,其表现显著下降。我们还观察到人类对例句质量的偏好存在高度变异性,表现为较低的注释者间一致率。为解决这一问题,我们展示了上下文学习如何成功地将大语言模型与个别注释者的偏好对齐。此外,我们探索了使用预训练语言模型进行例句自动评级的可能性,发现句子困惑度在高资源语言中是典型性和可理解性的良好代理。我们的研究还贡献了一个包含 600 个评分的全新数据集,用于大语言模型生成的句子对,并提供了关于大语言模型在降低词典编纂工作成本,特别是低资源语言方面潜力的见解。

[NLP-60] Kiss up Kick down: Exploring Behavioral Changes in Multi-modal Large Language Models with Assigned Visual Personas EMNLP2024

【速读】: 该论文试图解决多模态大语言模型(LLMs)是否能根据视觉形象调整其行为的问题,这是现有文献中主要关注文本形象的一个显著空白。解决方案的关键在于开发了一个包含5000张虚构角色图像的新数据集,用于将视觉形象分配给LLMs,并通过分析这些图像中的视觉特征(特别是攻击性)来研究LLMs的谈判行为。结果表明,LLMs能够像人类一样评估图像的攻击性,并在面对更具攻击性的视觉形象时表现出更激进的谈判行为,而在面对较不具攻击性的对手形象时则表现出较不激进的行为。

链接: https://arxiv.org/abs/2410.03181
作者: Seungjong Sun,Eungu Lee,Seo Yeon Baek,Seunghyun Hwang,Wonbyung Lee,Dongyan Nan,Bernard J. Jansen,Jang Hyun Kim
关键词-EN: large language models, multi-modal large language, aggressive negotiation behaviors, language models, addressing a significant
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:This study is the first to explore whether multi-modal large language models (LLMs) can align their behaviors with visual personas, addressing a significant gap in the literature that predominantly focuses on text-based personas. We developed a novel dataset of 5K fictional avatar images for assignment as visual personas to LLMs, and analyzed their negotiation behaviors based on the visual traits depicted in these images, with a particular focus on aggressiveness. The results indicate that LLMs assess the aggressiveness of images in a manner similar to humans and output more aggressive negotiation behaviors when prompted with an aggressive visual persona. Interestingly, the LLM exhibited more aggressive negotiation behaviors when the opponent’s image appeared less aggressive than their own, and less aggressive behaviors when the opponents image appeared more aggressive.
摘要:本研究首次探讨了多模态大语言模型 (LLM) 是否能够与其视觉角色对齐,填补了当前文献主要关注基于文本的角色这一显著空白。我们开发了一个包含 5,000 张虚构头像图像的新数据集,用于为 LLM 分配视觉角色,并基于这些图像所描绘的视觉特征分析其谈判行为,特别关注于攻击性。结果表明,LLM 在评估图像的攻击性方面与人类相似,并且在提示使用攻击性视觉角色时输出更具攻击性的谈判行为。有趣的是,当对手的图像显得比自己的更不具攻击性时,LLM 表现出更强的攻击性谈判行为;而当对手的图像显得更具攻击性时,LLM 则表现出较少的攻击性行为。

[NLP-61] Autoregressive Large Language Models are Computationally Universal

【速读】: 该论文试图解决的问题是验证基于Transformer的语言模型在无需外部干预或权重修改的情况下,通过自回归解码实现通用计算的能力。解决方案的关键在于提出了一种广义的自回归解码方法,其中生成的token被附加到序列的末尾,随着上下文窗口的推进而扩展。通过将这种系统与经典的Lag系统(已知具有计算通用性)对应,并利用新的证明方法,论文展示了通用图灵机可以通过具有2027条生成规则的Lag系统模拟。进一步,论文验证了现有的gemini-1.5-pro-001语言模型能够通过单一系统提示,在确定性(贪婪)解码下正确应用这些规则,从而证明了该模型在扩展自回归解码下具备通用计算能力。

链接: https://arxiv.org/abs/2410.03170
作者: Dale Schuurmans,Hanjun Dai,Francesco Zanini
关键词-EN: transformer-based language model, language model, realize universal computation, Lag system, external intervention
类目: Computation and Language (cs.CL)
备注: 32 pages

点击查看摘要

Abstract:We show that autoregressive decoding of a transformer-based language model can realize universal computation, without external intervention or modification of the model’s weights. Establishing this result requires understanding how a language model can process arbitrarily long inputs using a bounded context. For this purpose, we consider a generalization of autoregressive decoding where, given a long input, emitted tokens are appended to the end of the sequence as the context window advances. We first show that the resulting system corresponds to a classical model of computation, a Lag system, that has long been known to be computationally universal. By leveraging a new proof, we show that a universal Turing machine can be simulated by a Lag system with 2027 production rules. We then investigate whether an existing large language model can simulate the behaviour of such a universal Lag system. We give an affirmative answer by showing that a single system-prompt can be developed for gemini-1.5-pro-001 that drives the model, under deterministic (greedy) decoding, to correctly apply each of the 2027 production rules. We conclude that, by the Church-Turing thesis, prompted gemini-1.5-pro-001 with extended autoregressive (greedy) decoding is a general purpose computer.
摘要:我们展示了基于 Transformer 的语言模型的自回归解码可以实现通用计算,而无需外部干预或修改模型的权重。要确立这一结果,需要理解语言模型如何使用有限的上下文处理任意长度的输入。为此,我们考虑了一种自回归解码的泛化方法,其中在给定长输入的情况下,随着上下文窗口的推进,生成的 Token 会被附加到序列的末尾。我们首先证明了由此产生的系统对应于一个经典的计算模型——Lag 系统,该系统早已被证明具有计算通用性。通过利用一个新的证明,我们展示了可以用具有 2027 条生成规则的 Lag 系统模拟通用图灵机。接着,我们探讨了现有的一个大语言模型是否能够模拟这种通用 Lag 系统的行为。我们给出了肯定的答案,表明可以为 gemini-1.5-pro-001 开发一个单一的系统提示,在确定性(贪婪)解码下,驱动模型正确应用每一条 2027 条生成规则。我们得出结论,根据 Church-Turing 论题,经过扩展自回归(贪婪)解码的 gemini-1.5-pro-001 是一个通用计算机。

[NLP-62] Can Watermarked LLMs be Identified by Users via Crafted Prompts?

【速读】: 该论文试图解决大语言模型(LLM)中水印技术的不可感知性问题。当前的水印技术在检测性和鲁棒性方面表现良好,但在实际应用中,水印的存在可能被用户察觉,从而影响服务使用和增加水印被攻击的风险。论文提出的解决方案之关键是设计了一种名为Water-Probe的识别算法,通过精心设计的提示词来检测LLM输出中的水印。该算法利用了当前水印技术在相同水印密钥下产生的输出差异的一致性,从而识别出水印的存在。此外,论文还提出了Water-Bag策略,通过合并多个水印密钥来增加水印密钥选择的随机性,从而显著提高水印的不可感知性。

链接: https://arxiv.org/abs/2410.03168
作者: Aiwei Liu,Sheng Guan,Yiming Liu,Leyi Pan,Yifei Zhang,Liancheng Fang,Lijie Wen,Philip S. Yu,Xuming Hu
关键词-EN: Large Language Models, Language Models, Large Language, made significant progress, detecting LLM outputs
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 25 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Text watermarking for Large Language Models (LLMs) has made significant progress in detecting LLM outputs and preventing misuse. Current watermarking techniques offer high detectability, minimal impact on text quality, and robustness to text editing. However, current researches lack investigation into the imperceptibility of watermarking techniques in LLM services. This is crucial as LLM providers may not want to disclose the presence of watermarks in real-world scenarios, as it could reduce user willingness to use the service and make watermarks more vulnerable to attacks. This work is the first to investigate the imperceptibility of watermarked LLMs. We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM. Our key motivation is that current watermarked LLMs expose consistent biases under the same watermark key, resulting in similar differences across prompts under different watermark keys. Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts, while Water-Probe demonstrates a minimal false positive rate for non-watermarked LLMs. Finally, we propose that the key to enhancing the imperceptibility of watermarked LLMs is to increase the randomness of watermark key selection. Based on this, we introduce the Water-Bag strategy, which significantly improves watermark imperceptibility by merging multiple watermark keys.
摘要:大语言模型 (LLM) 的文本水印技术在检测 LLM 输出和防止滥用方面取得了显著进展。当前的水印技术提供了高检测性、对文本质量的最小影响以及对文本编辑的鲁棒性。然而,现有研究缺乏对水印技术在 LLM 服务中不可感知性的探讨。这一点至关重要,因为 LLM 提供者可能不希望在实际场景中披露水印的存在,因为这可能会降低用户使用服务的意愿,并使水印更容易受到攻击。本研究首次探讨了水印 LLM 的不可感知性。我们设计了一种名为 Water-Probe 的识别算法,通过向 LLM 提供精心设计的提示来检测水印。我们的主要动机是,当前的水印 LLM 在相同的 watermark key 下表现出一致的偏差,导致在不同 watermark key 下提示之间的差异相似。实验表明,几乎所有主流的水印算法都能通过我们精心设计的提示轻松识别,而 Water-Probe 对非水印 LLM 表现出极低的误报率。最后,我们提出,增强水印 LLM 不可感知性的关键是增加水印 key 选择的随机性。基于此,我们引入了 Water-Bag 策略,通过合并多个水印 key 显著提高了水印的不可感知性。

[NLP-63] Exploring Learnability in Memory-Augmented Recurrent Neural Networks: Precision Stability and Empirical Insights

【速读】: 该论文试图解决无记忆和增强记忆的递归神经网络(RNNs)在处理长序列时难以泛化的问题。解决方案的关键在于冻结记忆组件,这不仅显著提升了模型在Penn Treebank数据集上的表现(测试困惑度从123.5降至120.5),还使得模型在长序列上的性能保持率从60%的下降减少到90%的保持。理论分析表明,冻结记忆有助于稳定时间依赖性,从而实现更稳健的收敛,强调了稳定记忆设计和长序列评估在理解RNNs学习能力极限方面的重要性。

链接: https://arxiv.org/abs/2410.03154
作者: Shrabon Das,Ankur Mali
关键词-EN: Pushdown Automata, equivalent to Pushdown, study explores, memory-less and memory-augmented, theoretically equivalent
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 theorems, 5 tables

点击查看摘要

Abstract:This study explores the learnability of memory-less and memory-augmented RNNs, which are theoretically equivalent to Pushdown Automata. Empirical results show that these models often fail to generalize on longer sequences, relying more on precision than mastering symbolic grammar. Experiments on fully trained and component-frozen models reveal that freezing the memory component significantly improves performance, achieving state-of-the-art results on the Penn Treebank dataset (test perplexity reduced from 123.5 to 120.5). Models with frozen memory retained up to 90% of initial performance on longer sequences, compared to a 60% drop in standard models. Theoretical analysis suggests that freezing memory stabilizes temporal dependencies, leading to robust convergence. These findings stress the need for stable memory designs and long-sequence evaluations to understand RNNs true learnability limits.
摘要:本研究探讨了无记忆和增强记忆的递归神经网络 (RNN) 的可学习性,这些网络在理论上等价于下推自动机。实证结果表明,这些模型在处理更长的序列时往往难以泛化,更多依赖于精确性而非掌握符号语法。对完全训练和组件冻结模型的实验显示,冻结记忆组件显著提升了性能,在 Penn Treebank 数据集上达到了最先进的结果(测试困惑度从 123.5 降低到 120.5)。与标准模型相比,冻结记忆的模型在更长序列上保留了高达 90% 的初始性能,而标准模型的性能下降了 60%。理论分析表明,冻结记忆稳定了时间依赖性,从而实现了稳健的收敛。这些发现强调了稳定记忆设计和长序列评估的必要性,以理解 RNN 真正的学习能力极限。

[NLP-64] Media Framing through the Lens of Event-Centric Narratives EMNLP2024

【速读】: 该论文试图解决新闻报道中的框架效应问题,即如何解释和分析新闻文章中对同一现象的不同解读。解决方案的关键在于提出一个框架,该框架能够提取事件及其与其他事件的关系,并将这些事件分组为高层次的叙事,从而帮助解释新闻文章中的框架效应。通过这种方法,研究者能够分析美国新闻中关于移民和枪支管制两个领域的框架效应。

链接: https://arxiv.org/abs/2410.03151
作者: Rohan Das,Aditya Chandra,I-Ta Lee,Maria Leonor Pacheco
关键词-EN: communications perspective, defines the packaging, frame defines, encourage certain interpretations, Abstract
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted to the 6th Workshop on Narrative Understanding, co-located with EMNLP 2024

点击查看摘要

Abstract:From a communications perspective, a frame defines the packaging of the language used in such a way as to encourage certain interpretations and to discourage others. For example, a news article can frame immigration as either a boost or a drain on the economy, and thus communicate very different interpretations of the same phenomenon. In this work, we argue that to explain framing devices we have to look at the way narratives are constructed. As a first step in this direction, we propose a framework that extracts events and their relations to other events, and groups them into high-level narratives that help explain frames in news articles. We show that our framework can be used to analyze framing in U.S. news for two different domains: immigration and gun control.
摘要:从传播学的角度来看,框架定义了语言的包装方式,以鼓励某些解读并抑制其他解读。例如,新闻文章可以将移民框架为对经济的促进或拖累,从而传达对同一现象的截然不同的解读。在这项工作中,我们认为要解释框架机制,必须考察叙事结构的构建方式。作为这一方向的第一步,我们提出了一种框架,该框架提取事件及其与其他事件的关系,并将它们分组为高级叙事,这些叙事有助于解释新闻文章中的框架。我们展示了我们的框架可以用于分析美国新闻中两个不同领域的框架:移民和枪支管制。

[NLP-65] Analysis and Detection of Differences in Spoken User Behaviors between Autonomous and Wizard-of-Oz Systems

【速读】: 该论文旨在研究用户在与远程操作机器人和自主对话系统交互时的行为差异,特别是在专注倾听和求职面试对话场景中的表现。解决方案的关键在于通过分析用户的言语行为指标(如言语长度、语速、填充词、反馈词、不流畅性和笑声)来区分这两种交互条件,并开发预测模型以提高区分准确性和精确度,最终实现比基线模型更高的F1分数。

链接: https://arxiv.org/abs/2410.03147
作者: Mikey Elmers,Koji Inoue,Divesh Lala,Keiko Ochi,Tatsuya Kawahara
关键词-EN: Japanese human-robot interactions, study examined users’, examined users’ behavioral, users’ behavioral differences, corpus of Japanese
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Accepted and will be presented at the 27th conference of the Oriental COCOSDA (O-COCOSDA 2024)

点击查看摘要

Abstract:This study examined users’ behavioral differences in a large corpus of Japanese human-robot interactions, comparing interactions between a tele-operated robot and an autonomous dialogue system. We analyzed user spoken behaviors in both attentive listening and job interview dialogue scenarios. Results revealed significant differences in metrics such as speech length, speaking rate, fillers, backchannels, disfluencies, and laughter between operator-controlled and autonomous conditions. Furthermore, we developed predictive models to distinguish between operator and autonomous system conditions. Our models demonstrated higher accuracy and precision compared to the baseline model, with several models also achieving a higher F1 score than the baseline.
摘要:本研究在大规模日本人与机器人互动语料库中,考察了用户在与远程操作机器人和自主对话系统互动时的行为差异。我们分析了用户在专注倾听和求职面试对话场景中的口语行为。结果显示,在操作员控制和自主系统条件下,用户的言语长度、语速、填充词、反馈词、不流畅性和笑声等指标存在显著差异。此外,我们开发了预测模型,以区分操作员和自主系统条件。与基线模型相比,我们的模型在准确性和精确度上表现更优,其中多个模型在 F1 分数上也超过了基线模型。

[NLP-66] Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback EMNLP2024

【速读】: 该论文试图解决现有对齐技术在大语言模型(LLMs)中使用简单二元标签(如成对偏好中的首选输出)无法捕捉成对输出之间细微质量差异的问题。解决方案的关键是引入了一种名为Margin Matching Preference Optimization (MMPO)的方法,该方法通过在优化过程中引入相对质量边际,设计基于Bradley-Terry模型的软目标概率,并使用标准交叉熵目标来训练模型,从而显著提升LLM策略和奖励模型的性能。实验结果表明,MMPO在多个基准测试中均优于基线方法,并且在某些情况下表现尤为突出,显示出其在提高模型性能和鲁棒性方面的显著优势。

链接: https://arxiv.org/abs/2410.03145
作者: Kyuyoung Kim,Ah Jeong Seo,Hao Liu,Jinwoo Shin,Kimin Lee
关键词-EN: Large language models, Large language, fine-tuned with alignment, alignment techniques, systems to date
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. Despite their success, existing methods typically rely on simple binary labels, such as those indicating preferred outputs in pairwise preferences, which fail to capture the subtle differences in relative quality between pairs. To address this limitation, we introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models. Specifically, given quality margins in pairwise preferences, we design soft target probabilities based on the Bradley-Terry model, which are then used to train models with the standard cross-entropy objective. Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench. Notably, the 7B model trained with MMPO achieves state-of-the-art performance on RewardBench as of June 2024, outperforming other models of the same scale. Our analysis also shows that MMPO is more robust to overfitting, leading to better-calibrated models.
摘要:通过对齐技术(如基于人类反馈的强化学习)进行微调的大语言模型 (LLMs) 在开发迄今为止一些最强大的 AI 系统中发挥了关键作用。尽管这些方法取得了成功,但现有方法通常依赖于简单的二元标签,例如在成对偏好中指示首选输出的标签,这些标签无法捕捉成对之间相对质量的细微差异。为了解决这一局限性,我们提出了一种名为 Margin Matching Preference Optimization (MMPO) 的方法,该方法将相对质量边际纳入优化过程,从而改进了 LLM 策略和奖励模型。具体而言,在给定成对偏好中的质量边际的情况下,我们基于 Bradley-Terry 模型设计了软目标概率,然后使用标准的交叉熵目标来训练模型。在人类和 AI 反馈数据上的实验表明,MMPO 在包括 MT-bench 和 RewardBench 在内的流行基准测试中始终优于基线方法,通常有显著的优势。值得注意的是,截至 2024 年 6 月,使用 MMPO 训练的 7B 模型在 RewardBench 上达到了最先进的性能,超过了同规模的其它模型。我们的分析还表明,MMPO 对过拟合更为稳健,从而产生了更好的校准模型。

[NLP-67] In-context Learning in Presence of Spurious Correlations

【速读】: 该论文试图解决的是在分类任务中,传统上下文学习方法容易受到虚假特征影响的问题。解决方案的关键在于提出一种新的训练技术,通过在多样化的合成上下文学习实例数据集上进行训练,使得模型能够在未见过的任务上表现出良好的泛化能力,同时在与强方法如ERM和GroupDRO的对比中展现出相当的性能。

链接: https://arxiv.org/abs/2410.03140
作者: Hrayr Harutyunyan,Rafayel Darbinyan,Samvel Karapetyan,Hrant Khachatrian
关键词-EN: Large language models, Large language, language models exhibit, in-context, exhibit a remarkable
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models exhibit a remarkable capacity for in-context learning, where they learn to solve tasks given a few examples. Recent work has shown that transformers can be trained to perform simple regression tasks in-context. This work explores the possibility of training an in-context learner for classification tasks involving spurious features. We find that the conventional approach of training in-context learners is susceptible to spurious features. Moreover, when the meta-training dataset includes instances of only one task, the conventional approach leads to task memorization and fails to produce a model that leverages context for predictions. Based on these observations, we propose a novel technique to train such a learner for a given classification task. Remarkably, this in-context learner matches and sometimes outperforms strong methods like ERM and GroupDRO. However, unlike these algorithms, it does not generalize well to other tasks. We show that it is possible to obtain an in-context learner that generalizes to unseen tasks by training on a diverse dataset of synthetic in-context learning instances.
摘要:大语言模型展现出了显著的上下文学习能力,即它们能够在给定少量示例的情况下学习解决任务。最近的研究表明,Transformer 可以在上下文中训练以执行简单的回归任务。本文探讨了在涉及虚假特征的分类任务中训练上下文学习者的可能性。我们发现,传统的上下文学习者训练方法容易受到虚假特征的影响。此外,当元训练数据集仅包含一个任务的实例时,传统方法会导致任务记忆化,无法生成一个利用上下文进行预测的模型。基于这些观察,我们提出了一种新颖的技术来训练针对特定分类任务的上下文学习者。值得注意的是,这种上下文学习者在性能上与 ERM 和 GroupDRO 等强方法相当,有时甚至超越它们。然而,与这些算法不同,它对其他任务的泛化能力较差。我们展示了通过在多样化的合成上下文学习实例数据集上进行训练,可以获得一个能够泛化到未见任务的上下文学习者。

[NLP-68] SAG: Style-Aligned Article Generation via Model Collaboration

【速读】: 该论文试图解决大型语言模型(LLMs)在个性化和风格化内容生成中的优化限制问题,以及小型语言模型(SLMs)在复杂指令理解和能力迁移方面的不足。解决方案的关键在于提出了一种新的协作训练框架,通过冻结LLMs以利用其强大的指令跟随能力,并使用风格特定的数据对SLM进行监督微调,同时引入自改进方法以增强风格一致性。该方法在NoteBench基准测试中表现出色,显著提升了ROUGE-L和BLEU-4评分,同时保持了较低的幻觉率。

链接: https://arxiv.org/abs/2410.03137
作者: Chenning Xu,Fangxun Shu,Dian Jin,Jinghao Wei,Hao Jiang
关键词-EN: Large language models, stylish content generation, Large language, increased the demand, demand for personalized
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have increased the demand for personalized and stylish content generation. However, closed-source models like GPT-4 present limitations in optimization opportunities, while the substantial training costs and inflexibility of open-source alternatives, such as Qwen-72B, pose considerable challenges. Conversely, small language models (SLMs) struggle with understanding complex instructions and transferring learned capabilities to new contexts, often exhibiting more pronounced limitations. In this paper, we present a novel collaborative training framework that leverages the strengths of both LLMs and SLMs for style article generation, surpassing the performance of either model alone. We freeze the LLMs to harness their robust instruction-following capabilities and subsequently apply supervised fine-tuning on the SLM using style-specific data. Additionally, we introduce a self-improvement method to enhance style consistency. Our new benchmark, NoteBench, thoroughly evaluates style-aligned generation. Extensive experiments show that our approach achieves state-of-the-art performance, with improvements of 0.78 in ROUGE-L and 0.55 in BLEU-4 scores compared to GPT-4, while maintaining a low hallucination rate regarding factual and faithfulness.
摘要:大语言模型 (LLM) 的兴起推动了对个性化和时尚内容生成的高需求。然而,像 GPT-4 这样的闭源模型在优化机会上存在局限性,而开源替代方案如 Qwen-72B 的高昂训练成本和不灵活性也带来了显著挑战。相比之下,小型语言模型 (SLM) 在理解和执行复杂指令以及将学习能力迁移到新情境方面表现不佳,通常显示出更为明显的局限性。本文提出了一种新颖的协作训练框架,该框架结合了 LLM 和 SLM 的优势,用于风格化文章生成,超越了单一模型的性能。我们冻结 LLM 以利用其强大的指令跟随能力,随后对 SLM 进行风格特定数据的监督微调。此外,我们引入了一种自我改进方法以增强风格一致性。我们新的基准测试 NoteBench 全面评估了风格对齐的生成效果。大量实验表明,我们的方法在 ROUGE-L 和 BLEU-4 评分上分别比 GPT-4 提高了 0.78 和 0.55,同时在事实性和忠实度方面保持了较低的幻觉率。

[NLP-69] Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate World Model

【速读】: 该论文试图解决大型语言模型(LLMs)在复杂、多步骤决策任务中推理能力不足的问题。解决方案的关键在于提出了一个名为Structure-aware Planning with Accurate World Model(SWAP)的新型多步骤推理框架。SWAP通过引入结构化信息和世界模型来指导推理过程,并采用生成器-判别器架构来提高世界状态预测的准确性,从而克服了传统基于Chain-of-Thought(CoT)方法的局限性。此外,SWAP通过多样性建模(DBM)和对比排序(CR)技术,增强了生成多样性和判别准确性,显著提升了LLMs在多种推理任务中的表现。

链接: https://arxiv.org/abs/2410.03136
作者: Siheng Xiong,Ali Payani,Yuan Yang,Faramarz Fekri
关键词-EN: large language models, world model, Accurate World Model, remains a key, SWAP
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing the reasoning capabilities of large language models (LLMs) remains a key challenge, especially for tasks that require complex, multi-step decision-making. Humans excel at these tasks by leveraging deliberate planning with an internal world model to simulate the potential outcomes of various actions. Inspired by this, we propose a novel multi-step reasoning framework for LLMs, referred to as Structure-aware Planning with Accurate World Model (SWAP). Unlike previous approaches that rely solely on Chain-of-Thought (CoT) reasoning in natural language, SWAP incorporates structural information to guide the reasoning process via a world model and provides a soft verification mechanism over the steps. Moreover, SWAP overcomes the challenge of accurate world state predictions in complex reasoning tasks by introducing a Generator-Discriminator architecture, which enables more reliable world modeling. Specifically, the generator predicts the next state, and the discriminator ensures alignment with the logical consistency required by the problem context. SWAP also encourages the policy model to explore a broad range of potential actions to prevent premature convergence. By resolving the bottlenecks of generation diversity for both actions and states using diversity-based modeling (DBM) and improving discrimination accuracy through contrastive ranking (CR), SWAP significantly enhances the reasoning performance of LLMs. We evaluate SWAP across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP achieves substantial improvements over the baselines and consistently outperforms existing LLMs of similar sizes.
摘要:提升大语言模型 (LLM) 的推理能力仍然是一个关键挑战,尤其是在需要复杂、多步骤决策的任务中。人类在这些任务中表现出色,通过利用内部世界模型进行深思熟虑的规划来模拟各种行动的潜在结果。受此启发,我们提出了一种新颖的多步骤推理框架,称为结构感知规划与精确世界模型 (SWAP)。与以往仅依赖自然语言中的思维链 (CoT) 推理的方法不同,SWAP 通过世界模型引入结构信息来指导推理过程,并提供对步骤的软验证机制。此外,SWAP 通过引入生成器-判别器架构克服了复杂推理任务中准确预测世界状态的挑战,从而实现了更可靠的世界建模。具体而言,生成器预测下一状态,判别器确保与问题上下文所需的逻辑一致性相符。SWAP 还鼓励策略模型探索广泛的潜在行动,以防止过早收敛。通过使用基于多样性的建模 (DBM) 解决行动和状态生成多样性的瓶颈,并通过对比排序 (CR) 提高判别准确性,SWAP 显著提升了 LLM 的推理性能。我们在包括数学推理、逻辑推理和编码任务在内的多样化推理密集型基准上评估了 SWAP。大量实验表明,SWAP 在基线基础上取得了显著改进,并持续优于现有同规模的 LLM。

[NLP-70] AIME: AI System Optimization via Multiple LLM Evaluators

【速读】: 该论文试图解决单一大型语言模型(LLM)在复杂任务(如代码生成)中评估不准确的问题。解决方案的关键在于提出了一种基于多LLM评估者(AIME)的优化方法。AIME通过多个LLM分别独立评估不同标准,然后将这些评估结果进行组合,从而提高错误检测率和任务成功率。实验证明,AIME在代码生成任务中显著优于单一LLM评估方法,错误检测率提高了62%,成功率提高了16%。此外,论文还强调了评估者数量和评估标准选择的重要性,这些因素可以影响任务成功率高达12%。

链接: https://arxiv.org/abs/2410.03131
作者: Bhrij Patel,Souradip Chakraborty,Wesley A. Suttle,Mengdi Wang,Amrit Singh Bedi,Dinesh Manocha
关键词-EN: feedback loop scheme, optimization typically involves, Text-based AI system, current output, iteration output
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 10 Figures, 4 Tables

点击查看摘要

Abstract:Text-based AI system optimization typically involves a feedback loop scheme where a single LLM generates an evaluation in natural language of the current output to improve the next iteration’s output. However, in this work, we empirically demonstrate that for a practical and complex task (code generation) with multiple criteria to evaluate, utilizing only one LLM evaluator tends to let errors in generated code go undetected, thus leading to incorrect evaluations and ultimately suboptimal test case performance. Motivated by this failure case, we assume there exists an optimal evaluation policy that samples an evaluation between response and ground truth. We then theoretically prove that a linear combination of multiple evaluators can approximate this optimal policy. From this insight, we propose AI system optimization via Multiple LLM Evaluators (AIME). AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We provide an extensive empirical study showing AIME outperforming baseline methods in code generation tasks, with up to 62% higher error detection rate and up to 16% higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets. We also show that the selection of the number of evaluators and which criteria to utilize is non-trivial as it can impact pact success rate by up to 12% .
摘要:基于文本的 AI 系统优化通常采用反馈循环方案,其中单一的大语言模型 (LLM) 生成对当前输出的自然语言评估,以改进下一轮迭代的输出。然而,在本研究中,我们通过实证证明,对于具有多个评估标准的实际且复杂的任务(如代码生成),仅使用一个 LLM 评估器往往会导致生成的代码错误未被检测到,从而导致错误的评估结果,最终影响测试用例的性能表现。受此失败案例的启发,我们假设存在一种最优的评估策略,该策略在响应与真实值之间进行采样评估。随后,我们从理论上证明了多个评估器的线性组合可以逼近这种最优策略。基于这一洞察,我们提出了通过多 LLM 评估器 (AIME) 进行 AI 系统优化。AIME 是一种评估协议,它利用多个 LLM,每个 LLM 独立地对不同的标准生成评估,然后通过串联的方式将这些评估结果结合起来。我们提供了一项广泛的实证研究,结果显示 AIME 在代码生成任务中优于基线方法,在 LeetCodeHard 和 HumanEval 数据集上,错误检测率提高了高达 62%,成功率提高了高达 16%,相较于单一 LLM 评估协议。我们还展示了评估器数量和所使用标准的选择并非无关紧要,因为它们可能会影响高达 12% 的成功率。

[NLP-71] ARB-LLM: Alternating Refined Binarizations for Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在实际部署中因高内存和计算需求而受限的问题。解决方案的关键在于提出了一种名为ARB-LLM的新型1-bit后训练量化(PTQ)技术,通过交替精炼二值化(ARB)算法逐步更新二值化参数,显著减少量化误差,并结合列组位图(CGB)优化权重分区策略,从而有效缩小二值化权重与全精度权重之间的分布差距,并考虑了LLM权重分布中的列偏差问题。最终,ARB-LLM _\textRC 成为首个超越相同规模FP16模型的二值化PTQ方法。

链接: https://arxiv.org/abs/2410.03129
作者: Zhiteng Li,Xianglong Yan,Tianao Zhang,Haotong Qin,Dong Xie,Jiang Tian,zhongchao shi,Linghe Kong,Yulun Zhang,Xiaokang Yang
关键词-EN: Large Language Models, natural language processing, Large Language, hinder practical deployment, greatly pushed forward
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The code and models will be available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have greatly pushed forward advancements in natural language processing, yet their high memory and computational demands hinder practical deployment. Binarization, as an effective compression technique, can shrink model weights to just 1 bit, significantly reducing the high demands on computation and memory. However, current binarization methods struggle to narrow the distribution gap between binarized and full-precision weights, while also overlooking the column deviation in LLM weight distribution. To tackle these issues, we propose ARB-LLM, a novel 1-bit post-training quantization (PTQ) technique tailored for LLMs. To narrow the distribution shift between binarized and full-precision weights, we first design an alternating refined binarization (ARB) algorithm to progressively update the binarization parameters, which significantly reduces the quantization error. Moreover, considering the pivot role of calibration data and the column deviation in LLM weights, we further extend ARB to ARB-X and ARB-RC. In addition, we refine the weight partition strategy with column-group bitmap (CGB), which further enhance performance. Equipping ARB-X and ARB-RC with CGB, we obtain ARB-LLM _\textX and ARB-LLM _\textRC respectively, which significantly outperform state-of-the-art (SOTA) binarization methods for LLMs. As a binary PTQ method, our ARB-LLM _\textRC is the first to surpass FP16 models of the same size. The code and models will be available at this https URL.
摘要:大语言模型 (LLM) 在自然语言处理领域的进步中发挥了重要作用,但其对内存和计算资源的高需求限制了实际部署。二值化作为一种有效的压缩技术,可以将模型权重压缩至仅 1 比特,显著降低对计算和内存的需求。然而,当前的二值化方法难以缩小二值化权重与全精度权重之间的分布差距,同时忽略了 LLM 权重分布中的列偏差。为解决这些问题,我们提出了 ARB-LLM,一种专为 LLM 设计的新型 1 比特后训练量化 (PTQ) 技术。为缩小二值化权重与全精度权重之间的分布偏移,我们首先设计了一种交替精炼二值化 (ARB) 算法,通过逐步更新二值化参数,显著减少了量化误差。此外,考虑到校准数据的关键作用和 LLM 权重中的列偏差,我们将 ARB 进一步扩展为 ARB-X 和 ARB-RC。同时,我们通过列组位图 (CGB) 优化了权重划分策略,进一步提升了性能。结合 ARB-X 和 ARB-RC 与 CGB,我们分别得到了 ARB-LLM _\textX 和 ARB-LLM _\textRC,它们在性能上显著超越了当前最先进的 (SOTA) LLM 二值化方法。作为二值化 PTQ 方法,我们的 ARB-LLM _\textRC 首次在相同尺寸下超越了 FP16 模型。代码和模型将在以下链接提供:[https URL]。

[NLP-72] On Unsupervised Prompt Learning for Classification with Black-box Language Models

【速读】: 该论文试图解决在无标签数据情况下对黑箱大型语言模型(LLMs)进行微调的问题。解决方案的关键在于提出了一种无监督的提示学习方法,通过学习提示本身和无标签数据的伪标签来实现分类任务。具体来说,提示被建模为一系列离散的令牌,每个令牌具有待学习的类别分布;同时,利用LLMs的上下文学习(ICL)能力,首先识别出可靠的伪标签数据,然后根据提示为其他无标签数据分配伪标签,这些伪标签数据作为上下文示例与提示一起使用,从而在训练和使用阶段保持一致性。实验结果表明,该方法在基准数据集上具有显著的有效性。

链接: https://arxiv.org/abs/2410.03124
作者: Zhen-Yu Zhang,Jiandong Zhang,Huaxiu Yao,Gang Niu,Masashi Sugiyama
关键词-EN: Large language models, achieved impressive success, Large language, black-box LLMs, text-formatted learning problems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive success in text-formatted learning problems, and most popular LLMs have been deployed in a black-box fashion. Meanwhile, fine-tuning is usually necessary for a specific downstream task to obtain better performance, and this functionality is provided by the owners of the black-box LLMs. To fine-tune a black-box LLM, labeled data are always required to adjust the model parameters. However, in many real-world applications, LLMs can label textual datasets with even better quality than skilled human annotators, motivating us to explore the possibility of fine-tuning black-box LLMs with unlabeled data. In this paper, we propose unsupervised prompt learning for classification with black-box LLMs, where the learning parameters are the prompt itself and the pseudo labels of unlabeled data. Specifically, the prompt is modeled as a sequence of discrete tokens, and every token has its own to-be-learned categorical distribution. On the other hand, for learning the pseudo labels, we are the first to consider the in-context learning (ICL) capabilities of LLMs: we first identify reliable pseudo-labeled data using the LLM, and then assign pseudo labels to other unlabeled data based on the prompt, allowing the pseudo-labeled data to serve as in-context demonstrations alongside the prompt. Those in-context demonstrations matter: previously, they are involved when the prompt is used for prediction while they are not involved when the prompt is trained; thus, taking them into account during training makes the prompt-learning and prompt-using stages more consistent. Experiments on benchmark datasets show the effectiveness of our proposed algorithm. After unsupervised prompt learning, we can use the pseudo-labeled dataset for further fine-tuning by the owners of the black-box LLMs.
摘要:大语言模型 (LLMs) 在文本格式的学习问题上取得了显著的成功,并且大多数流行的 LLMs 都是以黑箱方式部署的。同时,为了在特定的下游任务中获得更好的性能,通常需要进行微调,这一功能由黑箱 LLMs 的所有者提供。要微调一个黑箱 LLM,通常需要标注数据来调整模型参数。然而,在许多实际应用中,LLMs 能够以比熟练的人类标注者更高的质量标注文本数据集,这促使我们探索使用未标注数据微调黑箱 LLMs 的可能性。在本文中,我们提出了针对黑箱 LLMs 的无监督提示学习用于分类,其中学习参数是提示本身和未标注数据的伪标签。具体来说,提示被建模为一系列离散的 Token,每个 Token 都有其待学习的类别分布。另一方面,为了学习伪标签,我们首次考虑了 LLMs 的上下文学习 (ICL) 能力:我们首先使用 LLM 识别可靠的伪标注数据,然后根据提示为其他未标注数据分配伪标签,使得伪标注数据能够作为上下文演示与提示一起使用。这些上下文演示至关重要:之前,它们在提示用于预测时被涉及,而在提示训练时未被涉及;因此,在训练过程中考虑它们使得提示学习和提示使用阶段更加一致。在基准数据集上的实验显示了我们提出的算法的有效性。经过无监督提示学习后,我们可以使用伪标注数据集由黑箱 LLMs 的所有者进行进一步微调。

[NLP-73] RIPPLECOT: Amplifying Ripple Effect of Knowledge Editing in Language Models via Chain-of-Thought In-Context Learning EMNLP

【速读】: 该论文试图解决大语言模型在知识编辑过程中面临的“涟漪效应”问题,即在修改单一事实后,模型难以准确更新与之相关的多跳事实链。解决方案的关键在于提出了一种名为RippleCOT的新型上下文学习编辑方法,该方法通过集成链式思维(Chain-of-Thought, COT)推理,将演示结构化为“新事实、问题、思维、答案”,从而有效地引导模型处理涉及多跳逻辑的复杂问题。这种方法显著提升了模型在处理多跳问题时的准确性,实验结果显示其性能大幅超越现有最先进的方法。

链接: https://arxiv.org/abs/2410.03122
作者: Zihao Zhao,Yuchen Yang,Yijiang Li,Yinzhi Cao
关键词-EN: large language models, ripple effect poses, ripple effect, poses a significant, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP findings

点击查看摘要

Abstract:The ripple effect poses a significant challenge in knowledge editing for large language models. Namely, when a single fact is edited, the model struggles to accurately update the related facts in a sequence, which is evaluated by multi-hop questions linked to a chain of related facts. Recent strategies have moved away from traditional parameter updates to more flexible, less computation-intensive methods, proven to be more effective in addressing the ripple effect. In-context learning (ICL) editing uses a simple demonstration Imagine that + new fact to guide LLMs, but struggles with complex multi-hop questions as the new fact alone fails to specify the chain of facts involved in such scenarios. Besides, memory-based editing maintains additional storage for all edits and related facts, requiring continuous updates to stay effective. As a result of these design limitations, the challenge remains, with the highest accuracy being only 33.8% on the MQuAKE-cf benchmarks for Vicuna-7B. To address this, we propose RippleCOT, a novel ICL editing approach integrating Chain-of-Thought (COT) reasoning. RippleCOT structures demonstrations as newfact, question, thought, answer, incorporating a thought component to identify and decompose the multi-hop logic within questions. This approach effectively guides the model through complex multi-hop questions with chains of related facts. Comprehensive experiments demonstrate that RippleCOT significantly outperforms the state-of-the-art on the ripple effect, achieving accuracy gains ranging from 7.8% to 87.1%.
摘要:大语言模型中的知识编辑面临一个重大挑战,即涟漪效应。具体来说,当一个事实被编辑时,模型难以准确地更新相关事实序列,这一问题通过与一系列相关事实链相连的多跳问题进行评估。近期策略已从传统的参数更新转向更为灵活、计算量更小的方法,这些方法在解决涟漪效应方面被证明更为有效。上下文学习 (In-context Learning, ICL) 编辑使用简单的演示 Imagine that + new fact 来引导大语言模型,但在处理复杂的多跳问题时表现不佳,因为新事实本身无法明确指定此类场景中涉及的事实链。此外,基于记忆的编辑方法为所有编辑和相关事实维护额外的存储空间,需要持续更新以保持有效性。由于这些设计限制,挑战依然存在,Vicuna-7B 在 MQuAKE-cf 基准测试中的最高准确率仅为 33.8%。为解决这一问题,我们提出了 RippleCOT,这是一种结合了思维链 (Chain-of-Thought, COT) 推理的新型 ICL 编辑方法。RippleCOT 将演示结构化为 newfact, question, thought, answer,通过引入思维组件来识别和分解问题中的多跳逻辑。这种方法有效地引导模型处理涉及相关事实链的复杂多跳问题。综合实验表明,RippleCOT 在涟漪效应方面显著优于现有技术,准确率提升范围从 7.8% 到 87.1%。

[NLP-74] Precision Stability and Generalization: A Comprehensive Assessment of RNNs learnability capability for Classifying Counter and Dyck Languages

【速读】: 该论文试图解决循环神经网络(RNNs)在分类结构化形式语言(如计数语言和Dyck语言)中的学习能力问题。研究的关键在于揭示RNNs在处理此类任务时主要作为状态机运作,其语言能力受嵌入精度及负样本采样策略的显著影响。实验表明,随着正负样本结构相似性的增加,RNNs的性能显著下降。此外,研究强调了在评估神经网络的语言分类能力时,数据结构和采样技术的重要性,并指出仅依赖理论表达能力不足以理解真正的学习能力,需要更强的表达能力约束来准确评估RNNs的学习潜力。

链接: https://arxiv.org/abs/2410.03118
作者: Neisarg Dave,Daniel Kifer,Lee Giles,Ankur Mali
关键词-EN: Recurrent Neural Networks, classifying structured formal, counter and Dyck, structured formal languages, Neural Networks
类目: Computation and Language (cs.CL)
备注: 21 pages, 5 figures, 5 tables

点击查看摘要

Abstract:This study investigates the learnability of Recurrent Neural Networks (RNNs) in classifying structured formal languages, focusing on counter and Dyck languages. Traditionally, both first-order (LSTM) and second-order (O2RNN) RNNs have been considered effective for such tasks, primarily based on their theoretical expressiveness within the Chomsky hierarchy. However, our research challenges this notion by demonstrating that RNNs primarily operate as state machines, where their linguistic capabilities are heavily influenced by the precision of their embeddings and the strategies used for sampling negative examples. Our experiments revealed that performance declines significantly as the structural similarity between positive and negative examples increases. Remarkably, even a basic single-layer classifier using RNN embeddings performed better than chance. To evaluate generalization, we trained models on strings up to a length of 40 and tested them on strings from lengths 41 to 500, using 10 unique seeds to ensure statistical robustness. Stability comparisons between LSTM and O2RNN models showed that O2RNNs generally offer greater stability across various scenarios. We further explore the impact of different initialization strategies revealing that our hypothesis is consistent with various RNNs. Overall, this research questions established beliefs about RNNs’ computational capabilities, highlighting the importance of data structure and sampling techniques in assessing neural networks’ potential for language classification tasks. It emphasizes that stronger constraints on expressivity are crucial for understanding true learnability, as mere expressivity does not capture the essence of learning.
摘要:本研究探讨了循环神经网络 (RNN) 在分类结构化形式语言中的可学习性,重点关注计数语言和Dyck语言。传统上,一阶 (LSTM) 和二阶 (O2RNN) RNN 都被认为对此类任务有效,主要基于它们在Chomsky层次结构中的理论表达能力。然而,我们的研究通过证明RNN主要作为状态机运行,其语言能力在很大程度上受其嵌入精度和用于采样负样本的策略影响,挑战了这一观点。我们的实验表明,随着正负样本之间结构相似性的增加,性能显著下降。值得注意的是,即使使用RNN嵌入的基本单层分类器表现也优于随机猜测。为了评估泛化能力,我们在长度为40的字符串上训练模型,并在长度为41到500的字符串上进行测试,使用10个不同的种子以确保统计稳健性。LSTM和O2RNN模型之间的稳定性比较显示,O2RNN在各种场景下通常提供更高的稳定性。我们进一步探讨了不同初始化策略的影响,发现我们的假设与各种RNN一致。总体而言,本研究质疑了关于RNN计算能力的既定信念,强调了数据结构和采样技术在评估神经网络语言分类任务潜力中的重要性。它强调,更强的表达能力约束对于理解真正的可学习性至关重要,因为单纯的表达能力并不能捕捉学习的本质。

[NLP-75] ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

【速读】: 该论文试图解决大型语言模型(LLMs)在多步推理任务中的表现问题。解决方案的关键在于设计了一个专注于直接评估多步推理能力的基准,通过消除路径探索和隐性知识利用,使得模型只需遵循明确的指令来解决问题。该基准包含详细的指令和相应问题,确保解决过程完全依赖于指令,从而能够全面评估LLMs在遵循指令和进行多步推理方面的能力。

链接: https://arxiv.org/abs/2410.03117
作者: Ippei Fujisawa,Sensho Nobe,Hiroki Seto,Rina Onda,Yoshiaki Uchida,Hiroki Ikoma,Pei-Chun Chien,Ryota Kanai
关键词-EN: tasks remains limited, large language models, continue to advance, intellectual activities, remains limited
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning is central to a wide range of intellectual activities, and while the capabilities of large language models (LLMs) continue to advance, their performance in reasoning tasks remains limited. The processes and mechanisms underlying reasoning are not yet fully understood, but key elements include path exploration, selection of relevant knowledge, and multi-step inference. Problems are solved through the synthesis of these components. In this paper, we propose a benchmark that focuses on a specific aspect of reasoning ability: the direct evaluation of multi-step inference. To this end, we design a special reasoning task where multi-step inference is specifically focused by largely eliminating path exploration and implicit knowledge utilization. Our dataset comprises pairs of explicit instructions and corresponding questions, where the procedures necessary for solving the questions are entirely detailed within the instructions. This setup allows models to solve problems solely by following the provided directives. By constructing problems that require varying numbers of steps to solve and evaluating responses at each step, we enable a thorough assessment of state-of-the-art LLMs’ ability to follow instructions. To ensure the robustness of our evaluation, we include multiple distinct tasks. Furthermore, by comparing accuracy across tasks, utilizing step-aware metrics, and applying separately defined measures of complexity, we conduct experiments that offer insights into the capabilities and limitations of LLMs in reasoning tasks. Our findings have significant implications for the development of LLMs and highlight areas for future research in advancing their reasoning abilities. Our dataset is available at \urlthis https URL and code at \urlthis https URL.
摘要:推理是广泛智力活动的核心,尽管大语言模型 (LLM) 的能力不断进步,但在推理任务中的表现仍然有限。推理的过程和机制尚未完全理解,但其关键要素包括路径探索、相关知识的选取以及多步推理。问题通过这些组件的综合来解决。本文提出了一种专注于推理能力特定方面的基准:直接评估多步推理。为此,我们设计了一项特殊的推理任务,该任务主要关注多步推理,通过大幅减少路径探索和隐性知识利用来实现。我们的数据集包括明确的指令及其对应的问题对,其中解决问题的必要步骤完全详细地包含在指令中。这种设置使得模型只需遵循提供的指令即可解决问题。通过构建需要不同步骤数来解决的问题,并在每一步评估响应,我们能够全面评估最先进 LLM 遵循指令的能力。为确保评估的稳健性,我们包含了多个不同的任务。此外,通过跨任务比较准确性、利用步骤感知指标以及应用单独定义的复杂性度量,我们进行了实验,提供了对 LLM 在推理任务中能力和局限性的深入见解。我们的发现对 LLM 的发展具有重要意义,并突显了未来研究推进其推理能力的领域。我们的数据集可通过 \urlthis https URL 获取,代码可通过 \urlthis https URL 获取。

[NLP-76] X-ALMA: Plug Play Modules and Adaptive Rejection for Quality Translation at Scale

【速读】: 该论文试图解决当前大型语言模型(LLMs)在多语言处理中对高资源语言(如英语和中文)表现优异,但对中低资源语言表现不佳的问题。解决方案的关键在于引入了X-ALMA模型,该模型通过插拔式语言特定模块架构和精心设计的训练方案,特别是最终阶段的Adaptive Rejection Preference Optimization(ARPO)方法,确保了在50种不同语言上的高质量翻译性能,超越了现有的开源多语言LLMs,如Aya-101和Aya-23,在FLORES和WMT’23测试集上的表现。

链接: https://arxiv.org/abs/2410.03115
作者: Haoran Xu,Kenton Murray,Philipp Koehn,Hieu Hoang,Akiko Eriguchi,Huda Khayrallah
关键词-EN: Large language models, limited multilingual data, due to English-centric, English-centric pre-training, NLP tasks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success across various NLP tasks, yet their focus has predominantly been on English due to English-centric pre-training and limited multilingual data. While some multilingual LLMs claim to support for hundreds of languages, models often fail to provide high-quality response for mid- and low-resource languages, leading to imbalanced performance heavily skewed in favor of high-resource languages like English and Chinese. In this paper, we prioritize quality over scaling number of languages, with a focus on multilingual machine translation task, and introduce X-ALMA, a model designed with a commitment to ensuring top-tier performance across 50 diverse languages, regardless of their resource levels. X-ALMA surpasses state-of-the-art open-source multilingual LLMs, such as Aya-101 and Aya-23, in every single translation direction on the FLORES and WMT’23 test datasets according to COMET-22. This is achieved by plug-and-play language-specific module architecture to prevent language conflicts during training and a carefully designed training regimen with novel optimization methods to maximize the translation performance. At the final stage of training regimen, our proposed Adaptive Rejection Preference Optimization (ARPO) surpasses existing preference optimization methods in translation tasks.
摘要:大语言模型 (LLMs) 在各种自然语言处理 (NLP) 任务中取得了显著的成功,但其重点主要集中在英语上,这是由于以英语为中心的预训练和有限的多语言数据。尽管一些多语言 LLMs 声称支持数百种语言,但模型往往无法为中低资源语言提供高质量的响应,导致性能严重偏向于高资源语言,如英语和中文。在本文中,我们优先考虑质量而非语言数量的扩展,专注于多语言机器翻译任务,并介绍了 X-ALMA,这是一种致力于确保在 50 种不同语言中实现顶级性能的模型,无论其资源水平如何。根据 COMET-22 的评估,X-ALMA 在 FLORES 和 WMT’23 测试数据集的每个翻译方向上均超越了最先进的开源多语言 LLMs,如 Aya-101 和 Aya-23。这一成就通过即插即用的特定语言模块架构来防止训练过程中的语言冲突,并通过精心设计的训练方案和创新的优化方法来最大化翻译性能。在训练方案的最后阶段,我们提出的自适应拒绝偏好优化 (ARPO) 在翻译任务中超越了现有的偏好优化方法。

[NLP-77] LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

【速读】: 该论文试图解决大语言模型(LLMs)中Key-Value(KV)缓存的高内存消耗问题,特别是在序列长度和批量大小增加时,KV缓存的内存需求线性增长,成为LLM部署的主要瓶颈。解决方案的关键在于提出了一种低秩近似方法来压缩KV权重矩阵,这种方法可以在不重新训练模型的情况下直接集成到现有的基于Transformer的LLMs中。通过调整层级敏感性和引入渐进压缩策略,论文的方法能够在不进行模型调优或任务特定分析的情况下,显著减少GPU内存占用,同时保持模型性能。

链接: https://arxiv.org/abs/2410.03111
作者: Rongzhi Zhang,Kuang Wang,Liyuan Liu,Shuohang Wang,Hao Cheng,Chao Zhang,Yelong Shen
关键词-EN: enabling faster inference, storing previously computed, autoregressive large language, serving transformer-based autoregressive, transformer-based autoregressive large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly with sequence length and batch size, posing a significant bottleneck in LLM deployment. Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages, which requires extensive parameter tuning thus unsuitable for pre-trained LLMs; (2) KV cache compression at test time, primarily through token eviction policies, which often overlook inter-layer dependencies and can be task-specific. This paper introduces an orthogonal approach to KV cache compression. We propose a low-rank approximation of KV weight matrices, allowing for plug-in integration with existing transformer-based LLMs without model retraining. To effectively compress KV cache at the weight level, we adjust for layerwise sensitivity and introduce a progressive compression strategy, which is supported by our theoretical analysis on how compression errors accumulate in deep networks. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages. Extensive experiments with LLaMA models ranging from 8B to 70B parameters across various tasks show that our approach significantly reduces the GPU memory footprint while maintaining performance. Comments: 15 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2 Cite as: arXiv:2410.03111 [cs.LG] (or arXiv:2410.03111v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:键值 (KV) 缓存是服务于基于 Transformer 的自回归大语言模型 (LLM) 的关键组件,通过存储先前计算的 KV 向量来实现更快的推理。然而,其内存消耗与序列长度和批量大小成线性关系,成为 LLM 部署中的一个显著瓶颈。现有缓解此问题的方法包括:(1) 在升级阶段集成的高效注意力变体,这需要广泛的参数调整,因此不适合预训练的 LLM;(2) 测试时的 KV 缓存压缩,主要通过 Token 驱逐策略实现,但往往忽视了层间依赖性,并且可能是任务特定的。本文介绍了一种正交的 KV 缓存压缩方法。我们提出了一种低秩近似 KV 权重矩阵的方法,允许在不重新训练模型的情况下与现有的基于 Transformer 的 LLM 进行插件式集成。为了在权重级别有效压缩 KV 缓存,我们调整了层级敏感性,并引入了一种渐进压缩策略,这在我们的理论分析中得到了支持,该分析探讨了压缩误差在深度网络中的累积方式。我们的方法设计为在升级阶段无需模型调整或在测试阶段无需任务特定的分析。通过在从 8B 到 70B 参数的 LLaMA 模型上进行的广泛实验,我们展示了我们的方法在显著减少 GPU 内存占用的情况下,同时保持了性能。

评论:15 页,4 图 主题:机器学习 (cs.LG);人工智能 (cs.AI);计算与语言 (cs.CL) ACM 分类:I.2 引用为:arXiv:2410.03111 [cs.LG] (或 arXiv:2410.03111v1 [cs.LG] 用于此版本) https://doi.org/10.48550/arXiv.2410.03111 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-78] Mamba in Vision: A Comprehensive Survey of Techniques and Applications

【速读】: 该论文试图解决卷积神经网络(CNNs)和视觉变换器(ViTs)在计算机视觉中面临的挑战,特别是CNNs难以捕捉长程依赖关系和ViTs因自注意力机制的二次复杂度导致的高计算成本问题。解决方案的关键在于利用选择性结构化状态空间模型(Selective Structured State Space Models),即Mamba模型,来有效捕捉长程依赖关系,同时实现线性计算复杂度,从而在保持高效计算的同时提升模型性能。

链接: https://arxiv.org/abs/2410.03105
作者: Md Maklachur Rahman,Abdullah Aman Tutul,Ankur Nath,Lamyanba Laishram,Soon Ki Jung,Tracy Hammond
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, faced by Convolutional, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. While CNNs excel at extracting local features, they often struggle to capture long-range dependencies without complex architectural modifications. In contrast, ViTs effectively model global relationships but suffer from high computational costs due to the quadratic complexity of their self-attention mechanisms. Mamba addresses these limitations by leveraging Selective Structured State Space Models to effectively capture long-range dependencies with linear computational complexity. This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions. We provide a foundational resource for advancing the understanding and growth of Mamba models in computer vision. An overview of this work is available at this https URL.
摘要:Mamba 作为一种新兴方法,正在逐步解决卷积神经网络 (CNN) 和视觉 Transformer (ViT) 在计算机视觉领域面临的挑战。尽管 CNN 在提取局部特征方面表现出色,但它们通常难以在不进行复杂架构修改的情况下捕捉长距离依赖关系。相比之下,ViT 虽然能够有效建模全局关系,但由于其自注意力机制的二次复杂度,计算成本较高。Mamba 通过利用选择性结构化状态空间模型,以线性计算复杂度有效地捕捉长距离依赖关系,从而解决了这些局限性。本调查分析了 Mamba 模型的独特贡献、计算优势及其应用,同时指出了当前的挑战和潜在的未来研究方向。我们提供了一个基础资源,以促进对 Mamba 模型在计算机视觉领域理解和发展的深入研究。本文的概述可在此 https URL 获取。

[NLP-79] Horizon-Length Prediction: Advancing Fill-in-the-Middle Capabilities for Code Generation with Lookahead Planning

【速读】: 该论文试图解决现有Fill-in-the-Middle (FIM)训练方法在生成与上下文一致的代码时存在的问题,特别是当前基于重新排序和常规下一个词预测(NTP)的训练方式导致模型难以有效利用远端右上下文进行规划的问题。解决方案的关键在于提出了Horizon-Length Prediction (HLP),这是一种新的训练目标,通过在每一步预测剩余中间词的数量(即地平线长度),使模型能够进行前瞻性规划,从而在不依赖数据集特定后处理的情况下,自然地学习任意左右上下文的填充边界。HLP显著提升了FIM任务的性能,且无需额外的推理成本,确保了其在实际应用中的可行性。

链接: https://arxiv.org/abs/2410.03103
作者: Yifeng Ding,Hantian Ding,Shiqi Wang,Qing Sun,Varun Kumar,Zijian Wang
关键词-EN: generation of missing, FIM, HLP, models, FIM training paradigm
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Fill-in-the-Middle (FIM) has become integral to code language models, enabling generation of missing code given both left and right contexts. However, the current FIM training paradigm, which reorders original training sequences and then performs regular next-token prediction (NTP), often leads to models struggling to generate content that aligns smoothly with the surrounding context. Crucially, while existing works rely on rule-based post-processing to circumvent this weakness, such methods are not practically usable in open-domain code completion tasks as they depend on restrictive, dataset-specific assumptions (e.g., generating the same number of lines as in the ground truth). Moreover, model performance on FIM tasks deteriorates significantly without these unrealistic assumptions. We hypothesize that NTP alone is insufficient for models to learn effective planning conditioned on the distant right context, a critical factor for successful code infilling. To overcome this, we propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens (i.e., horizon length) at each step. HLP advances FIM with lookahead planning, enabling models to inherently learn infilling boundaries for arbitrary left and right contexts without relying on dataset-specific post-processing. Our evaluation across different models and sizes shows that HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level, and without resorting to unrealistic post-processing methods. Furthermore, the enhanced planning capability gained through HLP boosts model performance on code reasoning. Importantly, HLP only incurs negligible training overhead and no additional inference cost, ensuring its practicality for real-world scenarios. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2410.03103 [cs.LG] (or arXiv:2410.03103v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:填空式中间生成 (Fill-in-the-Middle, FIM) 已成为代码语言模型的核心组成部分,能够在给定左右上下文的情况下生成缺失的代码。然而,当前的 FIM 训练范式通过重新排序原始训练序列并进行常规的下一个 Token 预测 (Next-Token Prediction, NTP),往往导致模型难以生成与周围上下文平滑衔接的内容。关键在于,尽管现有工作依赖基于规则的后处理来规避这一弱点,但这些方法在开放域代码补全任务中并不实用,因为它们依赖于限制性且特定于数据集的假设(例如,生成与真实数据相同数量的行)。此外,在没有这些不切实际的假设的情况下,模型在 FIM 任务上的表现显著下降。我们假设,仅靠 NTP 不足以让模型学习到基于远端右侧上下文的有效规划,而这是成功进行代码填充的关键因素。为解决这一问题,我们提出了地平线长度预测 (Horizon-Length Prediction, HLP),这是一种新颖的训练目标,教导模型在每一步预测剩余中间 Token 的数量(即地平线长度)。HLP 通过前瞻性规划推进 FIM,使模型能够在不依赖数据集特定后处理的情况下,自然地学习任意左右上下文的填充边界。我们在不同模型和规模上的评估显示,HLP 在多样化的基准测试中显著提升了 FIM 性能,相对提升高达 24%,涵盖文件级和仓库级,且无需采用不切实际的后处理方法。此外,通过 HLP 增强的规划能力还提升了模型在代码推理上的表现。重要的是,HLP 仅带来可忽略的训练开销,且无额外推理成本,确保了其在实际场景中的实用性。

主题:机器学习 (cs.LG); 计算与语言 (cs.CL); 软件工程 (cs.SE)
引用为:arXiv:2410.03103 [cs.LG](或 arXiv:2410.03103v1 [cs.LG] 用于此版本)
https://doi.org/10.48550/arXiv.2410.03103
arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-80] CoCoHD: Congress Committee Hearing Dataset EMNLP2024

【速读】: 该论文试图解决美国国会听证会数据缺乏全面分析的问题。解决方案的关键在于提出了Congress Committee Hearing Dataset (CoCoHD),这是一个涵盖1997至2024年间86个委员会的32,697条记录的数据集。通过该数据集,研究人员可以深入分析政策语言,特别是在医疗、LGBTQ+权利和气候正义等关键议题上。论文通过案例研究展示了数据集的应用潜力,特别是在能源与商业委员会对化石燃料消费立场的分析上,并利用预训练语言模型进行微调,生成了与能源相关的指标。此外,市场分析表明,利用CoCoHD进行自然语言分析能够预测和揭示能源行业的趋势。

链接: https://arxiv.org/abs/2410.03099
作者: Arnav Hiray,Yunsong Liu,Mingxiao Song,Agam Shah,Sudheer Chava
关键词-EN: impacting individual lives, congressional hearings significantly, hearings significantly influence, social fabric, impacting individual
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:U.S. congressional hearings significantly influence the national economy and social fabric, impacting individual lives. Despite their importance, there is a lack of comprehensive datasets for analyzing these discourses. To address this, we propose the Congress Committee Hearing Dataset (CoCoHD), covering hearings from 1997 to 2024 across 86 committees, with 32,697 records. This dataset enables researchers to study policy language on critical issues like healthcare, LGBTQ+ rights, and climate justice. We demonstrate its potential with a case study on 1,000 energy-related sentences, analyzing the Energy and Commerce Committee’s stance on fossil fuel consumption. By fine-tuning pre-trained language models, we create energy-relevant measures for each hearing. Our market analysis shows that natural language analysis using CoCoHD can predict and highlight trends in the energy sector.
摘要:美国国会听证会对国家经济和社会结构有着显著影响,影响着个人的生活。尽管其重要性不言而喻,但目前缺乏全面的数据集来分析这些讨论。为此,我们提出了国会委员会听证数据集 (Congress Committee Hearing Dataset, CoCoHD),涵盖了从 1997 年到 2024 年间的 86 个委员会的 32,697 条记录。该数据集使研究人员能够研究关于医疗保健、LGBTQ+ 权利和气候正义等关键问题的政策语言。我们通过一个关于 1,000 条能源相关句子的案例研究,展示了其潜力,分析了能源与商务委员会对化石燃料消费的立场。通过微调预训练语言模型,我们为每次听证会创建了与能源相关的指标。我们的市场分析显示,利用 CoCoHD 进行自然语言分析可以预测并突出能源行业的趋势。

[NLP-81] UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

【速读】: 该论文试图解决大规模语言模型(LLMs)在长上下文推理中由于高内存和计算需求带来的部署挑战。解决方案的关键在于提出了一种不确定性感知的压缩方案UNComp,通过利用矩阵熵在token序列级别估计模型在不同层和头部的复杂度,从而自适应地压缩隐藏状态和KV缓存。UNComp不仅加速了预填充阶段,还显著减少了KV缓存的大小,同时保持了关键任务中的性能,提供了一种无需训练的组查询注意力范式,可无缝集成到现有的KV缓存方案中。

链接: https://arxiv.org/abs/2410.03090
作者: Jing Xiong,Jianghan Shen,Fanghua Ye,Chaofan Tao,Zhongwei Wan,Jianqiao Lu,Xun Wu,Chuanyang Zheng,Zhijiang Guo,Lingpeng Kong,Ngai Wong
关键词-EN: Deploying large language, Deploying large, large language models, computational demands, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) is challenging due to their high memory and computational demands, especially during long-context inference. While key-value (KV) caching accelerates inference by reusing previously computed keys and values, it also introduces significant memory overhead. Existing KV cache compression methods such as eviction and merging typically compress the KV cache after it is generated and overlook the eviction of hidden states, failing to improve the speed of the prefilling stage. Additionally, applying a uniform compression rate across different attention heads can harm crucial retrieval heads in needle-in-a-haystack tasks due to excessive compression. In this paper, we propose UNComp, an uncertainty-aware compression scheme that leverages matrix entropy to estimate model uncertainty across layers and heads at the token sequence level. By grouping layers and heads based on their uncertainty, UNComp adaptively compresses both the hidden states and the KV cache. Our method achieves a 1.6x speedup in the prefilling stage and reduces the KV cache to 4.74% of its original size, resulting in a 6.4x increase in throughput and a 1.4x speedup in inference with only a 1.41% performance loss. Remarkably, in needle-in-a-haystack tasks, UNComp outperforms the full-size KV cache even when compressed to 9.38% of its original size. Our approach offers an efficient, training-free Grouped-Query Attention paradigm that can be seamlessly integrated into existing KV cache schemes.
摘要:部署大语言模型 (LLMs) 面临挑战,因其对内存和计算资源的高需求,特别是在长上下文推理期间。尽管键值 (KV) 缓存通过重用先前计算的键和值加速了推理过程,但也引入了显著的内存开销。现有的 KV 缓存压缩方法,如驱逐和合并,通常在生成 KV 缓存后进行压缩,并忽略了隐藏状态的驱逐,未能提升预填充阶段的速度。此外,在不同注意力头之间应用统一的压缩率可能会因过度压缩而损害针在草堆任务中的关键检索头。本文提出 UNComp,一种基于不确定性感知的压缩方案,利用矩阵熵在 Token 序列级别估计模型在各层和各头的不确定性。通过根据不确定性对层和头进行分组,UNComp 自适应地压缩隐藏状态和 KV 缓存。我们的方法在预填充阶段实现了 1.6 倍的速度提升,并将 KV 缓存压缩至原始大小的 4.74%,从而使吞吐量提高了 6.4 倍,推理速度提升了 1.4 倍,仅损失 1.41% 的性能。值得注意的是,在针在草堆任务中,即使压缩至原始大小的 9.38%,UNComp 的表现仍优于全尺寸 KV 缓存。我们的方法提供了一种高效、无需训练的分组查询注意力范式,可以无缝集成到现有的 KV 缓存方案中。

[NLP-82] Scaling Parameter-Constrained Language Models with Quality Data EMNLP2024

【速读】: 该论文试图解决在语言模型训练中,传统缩放定律忽视数据质量对模型泛化能力影响的问题。解决方案的关键在于引入“有效训练令牌”这一概念,将其定义为文本多样性和合成度(由教师模型衡量)的组合,以此量化数据质量对模型性能的影响。通过预训练200个不同规模的模型并分析其推理任务的准确性,论文验证了有效训练令牌与模型性能之间的强相关性,并展示了在数据采样和合成等技术中如何利用这一概念提升数据质量。

链接: https://arxiv.org/abs/2410.03083
作者: Ernie Chang,Matteo Paltenghi,Yang Li,Pin-Jie Lin,Changsheng Zhao,Patrick Huber,Zechun Liu,Rastislav Rabatin,Yangyang Shi,Vikas Chandra
关键词-EN: providing compute-optimal estimates, modeling traditionally quantify, traditionally quantify training, quantify training loss, language modeling traditionally
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024 Industry Track, 18 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization. In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation – effective training tokens – which we posit to be a critical determinant of performance for parameter-constrained language models. Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text: (i) text diversity and (ii) syntheticity as measured by a teacher model. We pretrained over 200 models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores. We demonstrated the estimated constants yield +0.83 Pearson correlation with true accuracies, and analyzed it in scenarios involving widely-used data techniques such as data sampling and synthesis which aim to improve data quality.
摘要:语言模型中的缩放定律传统上将训练损失量化为数据集大小和模型参数的函数,提供了计算优化的估计,但往往忽略了数据质量对模型泛化能力的影响。本文通过在原有公式——有效训练 Token (effective training tokens) 中引入数据质量的微观视角,扩展了对缩放定律的传统理解,我们假设这是参数受限语言模型性能的关键决定因素。具体而言,我们将提出的有效训练 Token 定义为两个易于计算的文本指标的组合:(i) 文本多样性 (text diversity) 和 (ii) 由教师模型 (teacher model) 衡量的合成性 (syntheticity)。我们在多样化的采样合成数据集上预训练了 200 个从 25M 到 1.5B 参数的模型,并估计了将文本质量、模型大小、训练 Token 与八个推理任务准确率分数相关联的常数。我们展示了估计的常数与真实准确率之间具有 +0.83 的皮尔逊相关系数,并在涉及广泛使用的数据技术(如数据采样和合成)的场景中进行了分析,这些技术旨在提高数据质量。

[NLP-83] CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在指令遵循能力上的提升问题,特别是通过数据采样策略来增强模型的指令遵循能力。解决方案的关键在于引入了一种名为CommonIT(Commonality-aware Instruction Tuning)的新型指令调优策略,该策略通过将指令数据集根据任务类型、嵌入相似性和长度三个指标进行聚类,并在训练过程中确保每个小批次(mini-batch)仅包含来自同一聚类的数据。这种策略不仅在不同小批次间引入数据随机性,还在批次内部保持数据相似性,从而有效提升了模型在指令遵循任务上的表现。实验结果表明,CommonIT在多个模型和数据集上均显著提升了指令遵循能力,特别是在特定任务和领域上表现尤为突出。

链接: https://arxiv.org/abs/2410.03077
作者: Jun Rao,Xuebo Liu,Lian Lian,Shengjun Cheng,Yunjie Liao,Min Zhang
关键词-EN: Large Language Models, Large Language, Language Models, Commonality-aware Instruction Tuning, instruction tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:With instruction tuning, Large Language Models (LLMs) can enhance their ability to adhere to commands. Diverging from most works focusing on data mixing, our study concentrates on enhancing the model’s capabilities from the perspective of data sampling during training. Drawing inspiration from the human learning process, where it is generally easier to master solutions to similar topics through focused practice on a single type of topic, we introduce a novel instruction tuning strategy termed CommonIT: Commonality-aware Instruction Tuning. Specifically, we cluster instruction datasets into distinct groups with three proposed metrics (Task, Embedding and Length). We ensure each training mini-batch, or “partition”, consists solely of data from a single group, which brings about both data randomness across mini-batches and intra-batch data similarity. Rigorous testing on LLaMa models demonstrates CommonIT’s effectiveness in enhancing the instruction-following capabilities of LLMs through IT datasets (FLAN, CoT, and Alpaca) and models (LLaMa2-7B, Qwen2-7B, LLaMa 13B, and BLOOM 7B). CommonIT consistently boosts an average improvement of 2.1% on the general domain (i.e., the average score of Knowledge, Reasoning, Multilinguality and Coding) with the Length metric, and 5.2% on the special domain (i.e., GSM, Openfunctions and Code) with the Task metric, and 3.8% on the specific tasks (i.e., MMLU) with the Embedding metric. Code is available at \urlthis https URL.
摘要:通过指令调优,大语言模型 (LLMs) 可以增强其遵循指令的能力。与大多数专注于数据混合的工作不同,我们的研究侧重于从训练过程中的数据采样角度提升模型的能力。借鉴人类学习过程,即通过专注于单一类型主题的练习更容易掌握相似主题的解决方案,我们提出了一种新的指令调优策略,称为 CommonIT:共性感知指令调优。具体来说,我们根据三种提出的指标(任务、嵌入和长度)将指令数据集聚类为不同的组。我们确保每个训练小批次,或称为“分区”,仅包含来自单一组的数据,从而在不同小批次之间引入数据随机性,并在批次内部保持数据相似性。对 LLaMa 模型的严格测试表明,CommonIT 通过 IT 数据集(FLAN、CoT 和 Alpaca)和模型(LLaMa2-7B、Qwen2-7B、LLaMa 13B 和 BLOOM 7B)有效提升了 LLMs 的指令遵循能力。CommonIT 在一般领域(即知识、推理、多语言和编码的平均得分)上通过长度指标实现了 2.1% 的平均提升,在特殊领域(即 GSM、Openfunctions 和 Code)上通过任务指标实现了 5.2% 的提升,在特定任务(即 MMLU)上通过嵌入指标实现了 3.8% 的提升。代码可在 \urlthis https URL 获取。

[NLP-84] Multilingual Topic Classification in X: Dataset and Analysis EMNLP2024

【速读】: 该论文试图解决在社交媒体中跨语言内容理解和分类的复杂性问题。解决方案的关键在于引入了X-Topic,这是一个包含英语、西班牙语、日语和希腊语四种语言的多语言数据集,专门用于推特主题分类。X-Topic不仅涵盖了广泛的主题,还为跨语言分析、多语言模型的开发以及在线对话的计算研究提供了宝贵的资源。通过利用X-Topic进行全面的跨语言和多语言分析,并比较当前通用和领域特定语言模型的能力,论文展示了其在解决多语言内容分类挑战中的有效性。

链接: https://arxiv.org/abs/2410.03075
作者: Dimosthenis Antypas,Asahi Ushio,Francesco Barbieri,Jose Camacho-Collados
关键词-EN: transcending linguistic boundaries, discussed daily, transcending linguistic, linguistic boundaries, dynamic realm
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:In the dynamic realm of social media, diverse topics are discussed daily, transcending linguistic boundaries. However, the complexities of understanding and categorising this content across various languages remain an important challenge with traditional techniques like topic modelling often struggling to accommodate this multilingual diversity. In this paper, we introduce X-Topic, a multilingual dataset featuring content in four distinct languages (English, Spanish, Japanese, and Greek), crafted for the purpose of tweet topic classification. Our dataset includes a wide range of topics, tailored for social media content, making it a valuable resource for scientists and professionals working on cross-linguistic analysis, the development of robust multilingual models, and computational scientists studying online dialogue. Finally, we leverage X-Topic to perform a comprehensive cross-linguistic and multilingual analysis, and compare the capabilities of current general- and domain-specific language models.
摘要:在社交媒体的动态领域中,每天都有各种主题的讨论跨越语言界限。然而,理解和分类这些多语言内容仍然是一个重要挑战,传统技术如主题建模常常难以适应这种多语言的多样性。本文中,我们引入了 X-Topic,这是一个包含四种不同语言(英语、西班牙语、日语和希腊语)内容的多语言数据集,专门用于推文主题分类。我们的数据集涵盖了广泛的主题,专为社交媒体内容设计,使其成为从事跨语言分析、开发健壮多语言模型以及研究在线对话的计算科学家的宝贵资源。最后,我们利用 X-Topic 进行了一次全面的跨语言和多语言分析,并比较了当前通用和领域特定语言模型的能力。

[NLP-85] Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs EMNLP

【速读】: 该论文试图解决短文本数据在传统主题模型中因词汇共现不足而导致主题提取效果不佳的问题。解决方案的关键在于利用大型语言模型(LLMs)将短文本扩展为更详细的序列,然后通过前缀调优训练一个较小的语言模型,并结合变分自编码器(VAE)进行短文本主题建模,从而显著提升在数据稀疏情况下的主题建模性能。

链接: https://arxiv.org/abs/2410.03071
作者: Pritom Saha Akash,Kevin Chen-Chuan Chang
关键词-EN: uncovering hidden themes, collection of documents, Topic modeling, powerful technique, technique for uncovering
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: EMNLP Findings 2024. arXiv admin note: substantial text overlap with arXiv:2310.15420

点击查看摘要

Abstract:Topic modeling is a powerful technique for uncovering hidden themes within a collection of documents. However, the effectiveness of traditional topic models often relies on sufficient word co-occurrence, which is lacking in short texts. Therefore, existing approaches, whether probabilistic or neural, frequently struggle to extract meaningful patterns from such data, resulting in incoherent topics. To address this challenge, we propose a novel approach that leverages large language models (LLMs) to extend short texts into more detailed sequences before applying topic modeling. To further improve the efficiency and solve the problem of semantic inconsistency from LLM-generated texts, we propose to use prefix tuning to train a smaller language model coupled with a variational autoencoder for short-text topic modeling. Our method significantly improves short-text topic modeling performance, as demonstrated by extensive experiments on real-world datasets with extreme data sparsity, outperforming current state-of-the-art topic models.
摘要:主题建模是一种强大的技术,用于揭示文档集合中的隐藏主题。然而,传统主题模型的有效性通常依赖于足够的词语共现,这在短文本中往往缺失。因此,现有的方法,无论是概率性的还是神经网络的,经常难以从这些数据中提取有意义的模式,导致主题不连贯。为了解决这一挑战,我们提出了一种新方法,利用大语言模型 (LLMs) 将短文本扩展为更详细的序列,然后再进行主题建模。为了进一步提高效率并解决由 LLM 生成文本带来的语义不一致问题,我们提出使用前缀调优来训练一个较小的语言模型,并结合变分自编码器进行短文本主题建模。我们的方法显著提高了短文本主题建模的性能,如在具有极端数据稀疏性的真实世界数据集上的广泛实验所示,优于当前最先进的主题模型。

[NLP-86] DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models EMNLP2024

【速读】: 该论文试图解决视觉文档理解(VDU)模型在小样本情况下泛化能力不足的问题。解决方案的关键在于提出了一种名为DocKD的新框架,通过整合外部文档知识来丰富数据生成过程,从而提升知识蒸馏的效果。具体来说,DocKD通过向大型语言模型(LLM)提供文档中的关键-值对、布局和描述等元素,生成高质量的文档标注数据,使得学生VDU模型在训练后不仅在同领域任务中表现优异,而且在跨领域任务中显著超越了仅使用人类标注数据训练的模型。

链接: https://arxiv.org/abs/2410.03061
作者: Sungnyun Kim,Haofu Liao,Srikar Appalaraju,Peng Tang,Zhuowen Tu,Ravi Kumar Satzoda,R. Manmatha,Vijay Mahadevan,Stefano Soatto
关键词-EN: Visual document understanding, text and image, involves understanding documents, Visual document, involves understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data are not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.
摘要:视觉文档理解 (Visual Document Understanding, VDU) 是一项具有挑战性的任务,涉及理解跨多种模态(文本和图像)和布局(如表格、表单等)的文档。本研究旨在通过从大语言模型 (Large Language Model, LLM) 中提取知识来增强小型 VDU 模型的泛化能力。我们发现,直接提示 LLM 往往无法生成信息丰富且有用的数据。为此,我们提出了一种新的框架(称为 DocKD),通过整合外部文档知识来丰富数据生成过程。具体而言,我们向 LLM 提供各种文档元素,如键值对、布局和描述,以引出开放式答案。我们的实验表明,DocKD 生成的文档标注质量高,并且超越了不利用外部文档知识的直接知识蒸馏方法。此外,仅使用 DocKD 生成数据训练的学生 VDU 模型不仅在域内任务上与使用人工标注数据训练的模型相当,而且在域外任务上显著优于它们。

[NLP-87] Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues

【速读】: 该论文试图解决如何构建一个可扩展的社会文化规范(Sociocultural Norm, SCN)数据库,以支持社交感知对话系统的问题。解决方案的关键在于利用大型语言模型(LLMs)通过社交感知对话和上下文框架来约束生成过程,从而减少幻觉并提取高质量的自然语言规范陈述。论文提出使用合成数据来替代真实对话标注的金标准框架,并展示了从合成数据中提取的SCN质量与真实对话标注的金标准框架相当,且从真实数据中提取的SCN质量在有框架标注的情况下优于无框架标注的情况。此外,论文还验证了提取的SCN在基于检索增强生成(RAG)模型中的有效性,用于推理多个下游对话任务。

链接: https://arxiv.org/abs/2410.03049
作者: Shilin Qu,Weiqing Wang,Xin Zhou,Haolan Zhan,Zhuang Li,Lizhen Qu,Linhao Luo,Yuan-Fang Li,Gholamreza Haffari
关键词-EN: conversational information retrieval, including conversational information, retrieval-enhanced machine learning, Sociocultural norms serve, contextual information retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 17 pages

点击查看摘要

Abstract:Sociocultural norms serve as guiding principles for personal conduct in social interactions, emphasizing respect, cooperation, and appropriate behavior, which is able to benefit tasks including conversational information retrieval, contextual information retrieval and retrieval-enhanced machine learning. We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs) for socially aware dialogues. We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase. Our approach utilizes socially aware dialogues, enriched with contextual frames, as the primary data source to constrain the generating process and reduce the hallucinations. This enables extracting of high-quality and nuanced natural-language norm statements, leveraging the pragmatic implications of utterances with respect to the situation. As real dialogue annotated with gold frames are not readily available, we propose using synthetic data. Our empirical results show: (i) the quality of the SCNs derived from synthetic data is comparable to that from real dialogues annotated with gold frames, and (ii) the quality of the SCNs extracted from real data, annotated with either silver (predicted) or gold frames, surpasses that without the frame annotations. We further show the effectiveness of the extracted SCNs in a RAG-based (Retrieval-Augmented Generation) model to reason about multiple downstream dialogue tasks.
摘要:社会文化规范作为社会互动中个人行为的指导原则,强调尊重、合作和适当行为,这些规范能够有益于包括对话信息检索、情境信息检索和增强机器学习在内的任务。我们提出了一种利用大语言模型 (LLM) 构建社会文化规范 (SCN) 库的可扩展方法,用于社会意识对话。我们构建了一个全面且公开可访问的中文社会文化规范库。我们的方法利用了社会意识对话,通过丰富的情境框架作为主要数据源,以约束生成过程并减少幻觉。这使得能够提取高质量且细致的自然语言规范陈述,利用了话语在情境中的语用含义。由于带有黄金框架标注的真实对话不易获得,我们提出使用合成数据。我们的实证结果显示:(i) 从合成数据中提取的 SCN 质量与从带有黄金框架标注的真实对话中提取的 SCN 质量相当,以及 (ii) 从真实数据中提取的 SCN,无论是带有白银 (预测) 还是黄金框架标注,其质量均优于无框架标注的 SCN。我们进一步展示了提取的 SCN 在基于 RAG (检索增强生成) 模型中的有效性,用于推理多个下游对话任务。

[NLP-88] Geometry is All You Need: A Unified Taxonomy of Matrix and Tensor Factorization for Compression of Generative Language Models

【速读】: 该论文试图解决矩阵和张量在自然语言处理(NLP)模型参数化中的应用缺乏系统性和统一性的问题。解决方案的关键在于提出了一个统一的分类法,通过引入线性代数中的子空间概念,将矩阵/张量压缩方法与机器学习(ML)和NLP中的模型压缩概念结合起来。具体来说,论文利用子空间这一几何代数的核心概念,重新定义了矩阵/张量和ML/NLP中的相关概念(如注意力机制),使得典型的矩阵和张量分解算法可以被解释为几何变换。通过这种方式,论文不仅统一了这些方法的理论基础,还指出了当前研究中的空白和潜在的改进方向。

链接: https://arxiv.org/abs/2410.03040
作者: Mingxue Xu,Sadia Sharmin,Danilo P. Mandic
关键词-EN: Natural Language Processing, Language Processing, Natural Language, model systematic efficiency, language model parametrization
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Matrix and tensor-guided parametrization for Natural Language Processing (NLP) models is fundamentally useful for the improvement of the model’s systematic efficiency. However, the internal links between these two algebra structures and language model parametrization are poorly understood. Also, the existing matrix and tensor research is math-heavy and far away from machine learning (ML) and NLP research concepts. These two issues result in the recent progress on matrices and tensors for model parametrization being more like a loose collection of separate components from matrix/tensor and NLP studies, rather than a well-structured unified approach, further hindering algorithm design. To this end, we propose a unified taxonomy, which bridges the matrix/tensor compression approaches and model compression concepts in ML and NLP research. Namely, we adopt an elementary concept in linear algebra, that of a subspace, which is also the core concept in geometric algebra, to reformulate the matrix/tensor and ML/NLP concepts (e.g. attention mechanism) under one umbrella. In this way, based on our subspace formalization, typical matrix and tensor decomposition algorithms can be interpreted as geometric transformations. Finally, we revisit recent literature on matrix- or tensor-guided language model compression, rephrase and compare their core ideas, and then point out the current research gap and potential solutions.
摘要:矩阵和张量引导的参数化方法在自然语言处理 (NLP) 模型中对提升模型的系统效率具有根本性的作用。然而,这两种代数结构与语言模型参数化之间的内在联系尚未被充分理解。此外,现有的矩阵和张量研究侧重于数学层面,与机器学习 (ML) 和 NLP 研究的概念相距甚远。这两个问题导致近期在模型参数化中使用矩阵和张量的进展更像是矩阵/张量研究与 NLP 研究中各个独立部分的松散集合,而非结构化的统一方法,进一步阻碍了算法设计。为此,我们提出了一种统一的分类法,该分类法连接了矩阵/张量压缩方法与 ML 和 NLP 研究中的模型压缩概念。具体来说,我们采用了线性代数中的一个基本概念——子空间,这也是几何代数的核心概念,来重新表述矩阵/张量和 ML/NLP 概念(例如注意力机制),将其置于一个统一的框架下。基于我们的子空间形式化,典型的矩阵和张量分解算法可以被解释为几何变换。最后,我们回顾了近期关于矩阵或张量引导的语言模型压缩的文献,重新表述并比较了它们的核心思想,并指出了当前研究的差距和潜在的解决方案。

[NLP-89] Disentangling Textual and Acoustic Features of Neural Speech Representations

【速读】: 该论文试图解决神经语音模型内部复杂表示中内容信息与声学特征的纠缠问题,特别是在实际应用中可能涉及隐私风险的声学特征(如性别或说话者身份)的编码问题。解决方案的关键在于基于信息瓶颈原理提出了一种解纠缠框架,将复杂的语音表示分离为两个独立组件:一个编码内容信息(即可转录为文本的信息),另一个编码与特定下游任务相关的声学特征。通过在情感识别和说话者识别任务中应用和评估该框架,量化了各模型层中文本和声学特征的贡献,并探索了该框架作为归因方法用于识别最具代表性的语音帧表示。

链接: https://arxiv.org/abs/2410.03037
作者: Hosein Mohebbi,Grzegorz Chrupała,Willem Zuidema,Afra Alishahi,Ivan Titov
关键词-EN: deeply entangled internal, Neural speech models, build deeply entangled, entangled internal representations, fundamental frequency
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Neural speech models build deeply entangled internal representations, which capture a variety of features (e.g., fundamental frequency, loudness, syntactic category, or semantic content of a word) in a distributed encoding. This complexity makes it difficult to track the extent to which such representations rely on textual and acoustic information, or to suppress the encoding of acoustic features that may pose privacy risks (e.g., gender or speaker identity) in critical, real-world applications. In this paper, we build upon the Information Bottleneck principle to propose a disentanglement framework that separates complex speech representations into two distinct components: one encoding content (i.e., what can be transcribed as text) and the other encoding acoustic features relevant to a given downstream task. We apply and evaluate our framework to emotion recognition and speaker identification downstream tasks, quantifying the contribution of textual and acoustic features at each model layer. Additionally, we explore the application of our disentanglement framework as an attribution method to identify the most salient speech frame representations from both the textual and acoustic perspectives.
摘要:神经语音模型构建了深度纠缠的内部表示,这些表示以分布式编码方式捕捉了多种特征(例如,基频、响度、句法类别或词语的语义内容)。这种复杂性使得难以追踪这些表示在多大程度上依赖于文本和声学信息,或在关键的实际应用中抑制可能带来隐私风险的声学特征编码(例如,性别或说话者身份)。本文基于信息瓶颈原理,提出了一种解耦框架,将复杂的语音表示分离为两个独立的部分:一个编码内容(即可以转录为文本的内容),另一个编码与给定下游任务相关的声学特征。我们将该框架应用于情感识别和说话者识别的下游任务,量化了每个模型层中文本和声学特征的贡献。此外,我们探索了将解耦框架作为一种归因方法,以识别从文本和声学角度来看最具显著性的语音帧表示。

[NLP-90] MLP-KAN: Unifying Deep Representation and Function Learning

【速读】: 该论文试图解决在人工智能领域中,用户需要根据数据集特性手动选择使用表示学习模型还是函数学习模型的问题。解决方案的关键在于引入MLP-KAN,这是一种统一的方法,通过将多层感知机(MLPs)用于表示学习与Kolmogorov-Arnold网络(KANs)用于函数学习集成在混合专家(MoE)架构中,实现对任务特性的动态适应,从而无需手动选择模型。该方法嵌入在基于Transformer的框架中,通过在多个广泛使用的数据集上的实验验证,展示了其在深度表示学习和函数学习任务中的优越性能和广泛适用性。

链接: https://arxiv.org/abs/2410.03027
作者: Yunhong He,Yifeng Xie,Zhengqing Yuan,Lichao Sun
关键词-EN: demonstrated substantial promise, Recent advancements, function learning, artificial intelligence, demonstrated substantial
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in both representation learning and function learning have demonstrated substantial promise across diverse domains of artificial intelligence. However, the effective integration of these paradigms poses a significant challenge, particularly in cases where users must manually decide whether to apply a representation learning or function learning model based on dataset characteristics. To address this issue, we introduce MLP-KAN, a unified method designed to eliminate the need for manual model selection. By integrating Multi-Layer Perceptrons (MLPs) for representation learning and Kolmogorov-Arnold Networks (KANs) for function learning within a Mixture-of-Experts (MoE) architecture, MLP-KAN dynamically adapts to the specific characteristics of the task at hand, ensuring optimal performance. Embedded within a transformer-based framework, our work achieves remarkable results on four widely-used datasets across diverse domains. Extensive experimental evaluation demonstrates its superior versatility, delivering competitive performance across both deep representation and function learning tasks. These findings highlight the potential of MLP-KAN to simplify the model selection process, offering a comprehensive, adaptable solution across various domains. Our code and weights are available at \urlthis https URL.
摘要:近年来,表示学习 (representation learning) 和功能学习 (function learning) 的最新进展在人工智能的多个领域展示了巨大的潜力。然而,这些范式的有效整合提出了一个重大挑战,特别是在用户必须根据数据集特征手动决定应用表示学习模型还是功能学习模型的情况下。为了解决这一问题,我们引入了 MLP-KAN,一种旨在消除手动模型选择需求的统一方法。通过在混合专家 (Mixture-of-Experts, MoE) 架构中集成多层感知器 (Multi-Layer Perceptrons, MLPs) 用于表示学习,以及 Kolmogorov-Arnold 网络 (Kolmogorov-Arnold Networks, KANs) 用于功能学习,MLP-KAN 能够动态适应当前任务的具体特征,确保最佳性能。嵌入在基于 Transformer 的框架中,我们的工作在跨多个领域的四个广泛使用的数据集上取得了显著成果。广泛的实验评估证明了其卓越的多功能性,在深度表示和功能学习任务中均提供了具有竞争力的性能。这些发现突显了 MLP-KAN 简化模型选择过程的潜力,为各个领域提供了一个全面且适应性强的解决方案。我们的代码和权重可在 \urlthis https URL 获取。

[NLP-91] Characterizing Context Influence and Hallucination in Summarization

【速读】: 该论文试图解决大型语言模型(LLMs)在生成内容时可能产生的两个主要问题:幻觉(生成与上下文信息相矛盾的内容)和隐私泄露(无意中泄露输入信息)。解决方案的关键在于引入了一种新的概念——上下文影响(Context Influence),并提出了上下文影响解码(Context-Influence Decoding, CID)方法。通过分析和实验,论文展示了增强上下文(通过排除先验知识)和上下文与先验知识分布不一致时,上下文对LLM的影响会增加。此外,上下文影响为CID的隐私信息泄露提供了一个下限。实验结果表明,通过改进解码方法,可以在提高生成质量的同时,显著增加上下文的影响,从而在一定程度上控制幻觉和隐私泄露问题。

链接: https://arxiv.org/abs/2410.03026
作者: James Flemings,Wanrong Zhang,Bo Jiang,Zafar Takhirov,Murali Annavaram
关键词-EN: Large Language Models, Large Language, numerous downstream tasks, achieved remarkable performance, Language Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) have achieved remarkable performance in numerous downstream tasks, their ubiquity has raised two significant concerns. One is that LLMs can hallucinate by generating content that contradicts relevant contextual information; the other is that LLMs can inadvertently leak private information due to input regurgitation. Many prior works have extensively studied each concern independently, but none have investigated them simultaneously. Furthermore, auditing the influence of provided context during open-ended generation with a privacy emphasis is understudied. To this end, we comprehensively characterize the influence and hallucination of contextual information during summarization. We introduce a definition for context influence and Context-Influence Decoding (CID), and then we show that amplifying the context (by factoring out prior knowledge) and the context being out of distribution with respect to prior knowledge increases the context’s influence on an LLM. Moreover, we show that context influence gives a lower bound of the private information leakage of CID. We corroborate our analytical findings with experimental evaluations that show improving the F1 ROGUE-L score on CNN-DM for LLaMA 3 by \textbf10 % over regular decoding also leads to \textbf1.5x more influence by the context. Moreover, we empirically evaluate how context influence and hallucination are affected by (1) model capacity, (2) context size, (3) the length of the current response, and (4) different token n -grams of the context. Our code can be accessed here: this https URL.
摘要:尽管大语言模型 (LLMs) 在众多下游任务中取得了显著的性能,但其广泛应用也引发了两个重大问题。一是 LLMs 可能会通过生成与相关上下文信息相矛盾的内容而产生幻觉;二是 LLMs 可能会因输入信息的重复而无意中泄露隐私信息。许多先前的工作分别广泛研究了这两个问题,但尚未有研究同时探讨这两个问题。此外,在开放式生成过程中,对提供上下文的影响进行隐私导向的审计研究尚不充分。为此,我们全面分析了在摘要生成过程中上下文信息的影响和幻觉现象。我们提出了上下文影响的定义和上下文影响解码 (CID) 的概念,并证明通过排除先验知识来增强上下文,以及上下文与先验知识分布不一致,都会增加上下文对 LLM 的影响。此外,我们表明上下文影响为 CID 的隐私信息泄露提供了一个下限。我们通过实验验证了分析结果,显示在 CNN-DM 数据集上,通过改进 LLaMA 3 的 F1 ROGUE-L 评分,使其比常规解码提高 10%,同时导致上下文影响增加了 1.5 倍。此外,我们还实证评估了上下文影响和幻觉如何受到以下因素的影响:(1) 模型容量,(2) 上下文大小,(3) 当前响应的长度,以及 (4) 上下文中不同 Token n-gram 的影响。我们的代码可以通过以下链接访问:this https URL。

[NLP-92] Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review

【速读】: 该论文试图解决现有AI文本检测算法在区分人类撰写的同行评审与由先进大型语言模型(如GPT-4o)生成的评审之间存在的不足。解决方案的关键在于提出一种新的检测方法,该方法在低误报率的情况下,能够更准确地识别由GPT-4o生成的同行评审,从而应对生成式AI在学术评审中的潜在滥用问题。

链接: https://arxiv.org/abs/2410.03019
作者: Sungduk Yu,Man Luo,Avinash Madasu,Vasudev Lal,Phillip Howard
关键词-EN: published scientific research, peer review process, scientific research, ensuring the integrity, integrity of published
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in the linguistic capabilities of large language models (LLMs), a new potential risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. In this study, we investigate the ability of existing AI text detection algorithms to distinguish between peer reviews written by humans and different state-of-the-art LLMs. Our analysis shows that existing approaches fail to identify many GPT-4o written reviews without also producing a high number of false positive classifications. To address this deficiency, we propose a new detection approach which surpasses existing methods in the identification of GPT-4o written peer reviews at low levels of false positive classifications. Our work reveals the difficulty of accurately identifying AI-generated text at the individual review level, highlighting the urgent need for new tools and methods to detect this type of unethical application of generative AI.
摘要:同行评审是确保已发表科学研究完整性的关键过程。对这一过程的信任基于这样一个假设:相关领域的专家会仔细评估提交出版的手稿的优点。随着大语言模型 (LLM) 在语言能力方面的近期快速进步,同行评审过程面临一个新的潜在风险,即疏忽的评审者可能会依赖 LLM 来执行通常耗时的论文评审过程。在本研究中,我们调查了现有的 AI 文本检测算法区分由人类撰写的同行评审和不同最先进 LLM 撰写的同行评审的能力。我们的分析表明,现有方法无法在不产生大量误报的情况下识别许多由 GPT-4o 撰写的评审。为了解决这一缺陷,我们提出了一种新的检测方法,该方法在低误报率的情况下,在识别由 GPT-4o 撰写的同行评审方面超越了现有方法。我们的工作揭示了在个体评审层面准确识别 AI 生成文本的难度,强调了迫切需要新的工具和方法来检测这种生成式 AI 的不道德应用。

[NLP-93] utor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise

【速读】: 该论文试图解决在教育领域中,由于专家资源有限导致的教育质量提升困难的问题,尤其是在服务不足的社区中。解决方案的关键是引入Tutor CoPilot,这是一种人机协作系统,利用专家思维模型为辅导员提供类似专家的指导,从而提高辅导员在实际辅导中的效果。通过随机对照试验,研究发现使用Tutor CoPilot的辅导员所辅导的学生在掌握知识点上显著提高,尤其是在低评分的辅导员中效果更为显著。此外,Tutor CoPilot的成本效益高,每年每位辅导员仅需20美元,且能促进辅导员采用更高质量的教学策略。

链接: https://arxiv.org/abs/2410.03017
作者: Rose E. Wang,Ana T. Ribeiro,Carly D. Robinson,Susanna Loeb,Dora Demszky
关键词-EN: Tutor CoPilot, Language Models, Tutor, tutors, societal impact
类目: Computation and Language (cs.CL)
备注: Our pre-registration for this randomized controlled trial can be found here: this https URL

点击查看摘要

Abstract:Generative AI, particularly Language Models (LMs), has the potential to transform real-world domains with societal impact, particularly where access to experts is limited. For example, in education, training novice educators with expert guidance is important for effectiveness but expensive, creating significant barriers to improving education quality at scale. This challenge disproportionately harms students from under-served communities, who stand to gain the most from high-quality education. We introduce Tutor CoPilot, a novel Human-AI approach that leverages a model of expert thinking to provide expert-like guidance to tutors as they tutor. This study is the first randomized controlled trial of a Human-AI system in live tutoring, involving 900 tutors and 1,800 K-12 students from historically under-served communities. Following a preregistered analysis plan, we find that students working with tutors that have access to Tutor CoPilot are 4 percentage points (p.p.) more likely to master topics (p0.01). Notably, students of lower-rated tutors experienced the greatest benefit, improving mastery by 9 p.p. We find that Tutor CoPilot costs only 20 per-tutor annually. We analyze 550,000+ messages using classifiers to identify pedagogical strategies, and find that tutors with access to Tutor CoPilot are more likely to use high-quality strategies to foster student understanding (e.g., asking guiding questions) and less likely to give away the answer to the student. Tutor interviews highlight how Tutor CoPilot’s guidance helps tutors to respond to student needs, though they flag issues in Tutor CoPilot, such as generating suggestions that are not grade-level appropriate. Altogether, our study of Tutor CoPilot demonstrates how Human-AI systems can scale expertise in real-world domains, bridge gaps in skills and create a future where high-quality education is accessible to all students.
**摘要:**生成式 AI,特别是语言模型 (Language Models, LMs),具有改变具有社会影响力的现实世界领域的潜力,尤其是在专家资源有限的情况下。例如,在教育领域,用专家指导培训新手教育者对于提高教育效果至关重要,但成本高昂,这为大规模提升教育质量设置了重大障碍。这一挑战对来自服务不足社区的学生尤为不利,他们最需要高质量的教育。我们引入了 Tutor CoPilot,这是一种新颖的人工智能与人类结合的方法,利用专家思维模型为辅导员提供类似专家的指导。本研究是首个在实际辅导中进行的人工智能与人类系统的随机对照试验,涉及 900 名辅导员和 1,800 名来自历史上服务不足社区的 K-12 学生。根据预先注册的分析计划,我们发现,使用 Tutor CoPilot 的辅导员辅导的学生在掌握主题方面高出 4 个百分点 (p.p.) (p0.01)。值得注意的是,评分较低的辅导员辅导的学生受益最大,掌握程度提高了 9 个百分点。我们发现,Tutor CoPilot 每年每位辅导员的成本仅为 20 美元。我们分析了 550,000 多条消息,使用分类器识别教学策略,发现使用 Tutor CoPilot 的辅导员更倾向于采用高质量的策略来促进学生理解(例如,提出引导性问题),并且更少直接给出答案。辅导员访谈强调了 Tutor CoPilot 的指导如何帮助辅导员回应学生需求,尽管他们指出了 Tutor CoPilot 的一些问题,例如生成不适合年级水平的建议。总的来说,我们对 Tutor CoPilot 的研究展示了人工智能与人类系统如何在现实世界领域中扩展专业知识,弥合技能差距,并创造一个让所有学生都能获得高质量教育的未来。

[NLP-94] Can Transformers Learn n-gram Language Models?

【速读】: 该论文试图解决的问题是如何将Transformer模型在形式语言上的理论能力与其在实际任务中的表现联系起来。解决方案的关键在于将Transformer与n-gram语言模型(LMs)进行关联,并通过实验比较Transformer与传统n-gram LMs估计技术(如add-\lambda平滑)在不同类型的随机n-gram LMs上的表现。研究发现,在具有任意下一个符号概率的n-gram LMs上,传统技术优于Transformer;而在参数共享的n-gram LMs上,Transformer表现更优,甚至超过了专门设计用于学习n-gram LMs的方法。

链接: https://arxiv.org/abs/2410.03001
作者: Anej Svete,Nadav Borenstein,Mike Zhou,Isabelle Augenstein,Ryan Cotterell
关键词-EN: represent formal languages, gram LMs, gram language models, formal languages, represent formal
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Much theoretical work has described the ability of transformers to represent formal languages. However, linking theoretical results to empirical performance is not straightforward due to the complex interplay between the architecture, the learning algorithm, and training data. To test whether theoretical lower bounds imply \emphlearnability of formal languages, we turn to recent work relating transformers to n -gram language models (LMs). We study transformers’ ability to learn random n -gram LMs of two kinds: ones with arbitrary next-symbol probabilities and ones where those are defined with shared parameters. We find that classic estimation techniques for n -gram LMs such as add- \lambda smoothing outperform transformers on the former, while transformers perform better on the latter, outperforming methods specifically designed to learn n -gram LMs.
摘要:大量理论研究已经描述了 Transformer 能够表示形式语言的能力。然而,由于架构、学习算法和训练数据之间复杂的相互作用,将理论结果与实际性能联系起来并不直接。为了测试理论下界是否意味着形式语言的可学习性,我们转向了最近将 Transformer 与 n-gram 语言模型 (LM) 相关联的工作。我们研究了 Transformer 学习两种随机 n-gram LM 的能力:一种是具有任意下一个符号概率的模型,另一种是这些概率由共享参数定义的模型。我们发现,对于前者,经典的 n-gram LM 估计技术,如加 λ 平滑,优于 Transformer;而对于后者,Transformer 表现更好,超过了专门设计用于学习 n-gram LM 的方法。

[NLP-95] Guided Stream of Search: Learning to Better Search with Language Models via Optimal Path Guidance

【速读】: 该论文试图解决语言模型在需要复杂规划和推理的任务中表现不佳的问题。解决方案的关键在于提出了一种名为“guided stream of search (GSoS)”的方法,通过逐步将最优解融入语言模型的自我生成过程中,生成高质量的搜索轨迹,并通过监督微调将这些轨迹提炼到预训练模型中。这种方法显著提升了语言模型在Countdown数学推理任务中的搜索和规划能力,并且与强化学习微调结合时效果更佳,相比仅利用子目标奖励的最优解方法更为有效。

链接: https://arxiv.org/abs/2410.02992
作者: Seungyong Moon,Bumsoo Park,Hyun Oh Song
关键词-EN: demonstrated impressive capabilities, require complex planning, language models, optimal solutions, demonstrated impressive
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While language models have demonstrated impressive capabilities across a range of tasks, they still struggle with tasks that require complex planning and reasoning. Recent studies have proposed training language models on search processes rather than optimal solutions, resulting in better generalization performance even though search processes are noisy and even suboptimal. However, these studies overlook the value of optimal solutions, which can serve as step-by-step landmarks to guide more effective search. In this work, we explore how to leverage optimal solutions to enhance the search and planning abilities of language models. To this end, we propose guided stream of search (GSoS), which seamlessly incorporates optimal solutions into the self-generation process in a progressive manner, producing high-quality search trajectories. These trajectories are then distilled into the pre-trained model via supervised fine-tuning. Our approach significantly enhances the search and planning abilities of language models on Countdown, a simple yet challenging mathematical reasoning task. Notably, combining our method with RL fine-tuning yields further improvements, whereas previous supervised fine-tuning methods do not benefit from RL. Furthermore, our approach exhibits greater effectiveness than leveraging optimal solutions in the form of subgoal rewards.
摘要:尽管语言模型在多种任务中展现了令人印象深刻的能力,但在需要复杂规划和推理的任务上仍显不足。近期研究提出,通过训练语言模型处理搜索过程而非最优解,即使搜索过程嘈杂且可能次优,也能获得更好的泛化性能。然而,这些研究忽视了最优解的价值,最优解可以作为逐步的里程碑,指导更有效的搜索。在本研究中,我们探讨了如何利用最优解来增强语言模型的搜索和规划能力。为此,我们提出了引导搜索流 (Guided Stream of Search, GSoS),该方法以渐进的方式将最优解无缝融入自生成过程中,生成高质量的搜索轨迹。这些轨迹随后通过监督微调提炼到预训练模型中。我们的方法显著提升了语言模型在倒计时 (Countdown) 这一简单但具挑战性的数学推理任务上的搜索和规划能力。值得注意的是,将我们的方法与强化学习 (RL) 微调结合,进一步提升了性能,而之前的监督微调方法并未从强化学习中获益。此外,我们的方法在利用最优解作为子目标奖励方面表现出更高的有效性。

[NLP-96] Coal Mining Question Answering with LLMs

【速读】: 该论文试图解决煤炭开采领域中复杂、动态的技术问题问答难题。解决方案的关键在于采用多轮提示工程框架,结合大型语言模型(如GPT-4),通过将复杂查询分解为结构化组件,提升模型处理技术信息的精度和相关性。实验结果表明,该方法显著提高了问答系统的准确性和上下文相关性,平均准确率提升15-18%,并在GPT-4评分中取得显著增长。

链接: https://arxiv.org/abs/2410.02959
作者: Antonio Carlos Rivera,Anthony Moore,Steven Robinson
关键词-EN: large language models, prompt engineering techniques, coal mining, language models, combined with tailored
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present a novel approach to coal mining question answering (QA) using large language models (LLMs) combined with tailored prompt engineering techniques. Coal mining is a complex, high-risk industry where accurate, context-aware information is critical for safe and efficient operations. Current QA systems struggle to handle the technical and dynamic nature of mining-related queries. To address these challenges, we propose a multi-turn prompt engineering framework designed to guide LLMs, such as GPT-4, in answering coal mining questions with higher precision and relevance. By breaking down complex queries into structured components, our approach allows LLMs to process nuanced technical information more effectively. We manually curated a dataset of 500 questions from real-world mining scenarios and evaluated the system’s performance using both accuracy (ACC) and GPT-4-based scoring metrics. Experiments comparing ChatGPT, Claude2, and GPT-4 across baseline, chain-of-thought (CoT), and multi-turn prompting methods demonstrate that our method significantly improves both accuracy and contextual relevance, with an average accuracy improvement of 15-18% and a notable increase in GPT-4 scores. The results show that our prompt-engineering approach provides a robust, adaptable solution for domain-specific question answering in high-stakes environments like coal mining.
摘要:本文提出了一种利用大语言模型 (LLM) 结合定制化提示工程技术来解决煤矿问答 (QA) 的新方法。煤矿开采是一个复杂且高风险的行业,准确且上下文感知的信息对于安全和高效的操作至关重要。当前的问答系统难以应对与矿业相关的技术性和动态性查询。为解决这些挑战,我们提出了一种多轮提示工程框架,旨在指导如 GPT-4 这样的大语言模型,以更高的精确度和相关性回答煤矿问题。通过将复杂查询分解为结构化组件,我们的方法使大语言模型能够更有效地处理微妙的技术信息。我们从实际矿业场景中手动筛选了 500 个问题,并使用准确率 (ACC) 和基于 GPT-4 的评分指标评估了系统的性能。实验比较了 ChatGPT、Claude2 和 GPT-4 在基线、思维链 (CoT) 和多轮提示方法下的表现,结果表明我们的方法显著提高了准确率和上下文相关性,平均准确率提升了 15-18%,GPT-4 评分显著增加。结果表明,我们的提示工程方法为高风险环境下的特定领域问答提供了一个强大且适应性强的解决方案,如煤矿开采。

[NLP-97] AutoML-Agent : A Multi-Agent LLM Framework for Full-Pipeline AutoML

【速读】: 该论文试图解决现有自动化机器学习(AutoML)系统在设置复杂工具时需要大量技术专家知识和时间的问题。解决方案的关键在于提出了一种名为AutoML-Agent的多智能体框架,该框架通过自然语言接口利用大型语言模型(LLM),使得非专业用户也能构建数据驱动的解决方案。AutoML-Agent通过检索增强的规划策略和并行执行的子任务分解,提高了搜索最优模型的效率,并引入了多阶段验证机制来确保执行结果的正确性和代码生成的成功率。

链接: https://arxiv.org/abs/2410.02958
作者: Patara Trirat,Wonyong Jeong,Sung Ju Hwang
关键词-EN: Automated machine learning, Automated machine, machine learning, hyperparameter tuning, Automated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 47 pages, 5 figures

点击查看摘要

Abstract:Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general time-consuming and requires a large amount of human effort. Therefore, recent works have started exploiting large language models (LLM) to lessen such burden and increase the usability of AutoML frameworks via a natural language interface, allowing non-expert users to build their data-driven solutions. These methods, however, are usually designed only for a particular process in the AI development pipeline and do not efficiently use the inherent capacity of the LLMs. This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML, i.e., from data retrieval to model deployment. AutoML-Agent takes user’s task descriptions, facilitates collaboration between specialized LLM agents, and delivers deployment-ready models. Unlike existing work, instead of devising a single plan, we introduce a retrieval-augmented planning strategy to enhance exploration to search for more optimal plans. We also decompose each plan into sub-tasks (e.g., data preprocessing and neural network design) each of which is solved by a specialized agent we build via prompting executing in parallel, making the search process more efficient. Moreover, we propose a multi-stage verification to verify executed results and guide the code generation LLM in implementing successful solutions. Extensive experiments on seven downstream tasks using fourteen datasets show that AutoML-Agent achieves a higher success rate in automating the full AutoML process, yielding systems with good performance throughout the diverse domains.
摘要:自动化机器学习 (AutoML) 通过自动化开发流程中的任务,如最佳模型搜索和超参数调优,加速了 AI 的发展。现有的 AutoML 系统通常需要技术专家来设置复杂的工具,这通常耗时且需要大量的人力。因此,最近的研究开始利用大语言模型 (LLM) 来减轻这种负担,并通过自然语言接口提高 AutoML 框架的可用性,使非专业用户也能构建基于数据驱动的解决方案。然而,这些方法通常仅针对 AI 开发流程中的特定环节设计,并未有效利用 LLM 的固有能力。本文提出了 AutoML-Agent,一种专为全流程 AutoML 设计的新型多智能体框架,即从数据检索到模型部署。AutoML-Agent 接受用户任务描述,促进专业 LLM 智能体之间的协作,并交付可部署的模型。与现有工作不同,我们引入了一种检索增强的规划策略,以增强探索,搜索更优的计划,而不是设计单一计划。我们还通过提示执行将每个计划分解为子任务(例如,数据预处理和神经网络设计),每个子任务由我们构建的专业智能体并行解决,从而使搜索过程更加高效。此外,我们提出了一种多阶段验证方法,以验证执行结果并指导代码生成 LLM 实现成功的解决方案。在七个下游任务中使用十四个数据集的广泛实验表明,AutoML-Agent 在自动化全 AutoML 流程中实现了更高的成功率,并在不同领域中产生了性能良好的系统。

[NLP-98] Unlocking Structured Thinking in Language Models with Cognitive prompting ICLR2025

【速读】: 该论文试图解决大型语言模型(LLMs)在处理复杂、多步骤任务时的效率和能力问题。解决方案的关键在于提出了一种名为“认知提示”(cognitive prompting)的新方法,通过模拟人类认知操作如目标澄清、分解、过滤、抽象和模式识别,系统化地引导LLMs进行逐步推理。该方法通过动态自适应选择认知操作序列,显著提升了如LLaMA3.1 70B等大型模型的性能,特别是在多步骤推理任务中的表现,同时增强了模型的可解释性和灵活性。

链接: https://arxiv.org/abs/2410.02953
作者: Oliver Kramer,Jill Baumann
关键词-EN: cognitive prompting, propose cognitive prompting, large language models, human-like cognitive operations, cognitive
类目: Computation and Language (cs.CL)
备注: 11 pages, submitted to ICLR 2025

点击查看摘要

Abstract:We propose cognitive prompting as a novel approach to guide problem-solving in large language models (LLMs) through structured, human-like cognitive operations such as goal clarification, decomposition, filtering, abstraction, and pattern recognition. By employing systematic, step-by-step reasoning, cognitive prompting enables LLMs to efficiently tackle complex, multi-step tasks. We evaluate the effectiveness of cognitive prompting on Meta’s LLaMA models, comparing performance on arithmetic reasoning tasks using the GSM8K dataset and on commonsense reasoning benchmarks. Our analysis includes comparisons between models without cognitive prompting, models with a static sequence of cognitive operations, and models using reflective cognitive prompting, where the LLM dynamically self-selects the sequence of cognitive operations. The results show that cognitive prompting, particularly when dynamically adapted, significantly improves the performance of larger models, such as LLaMA3.1 70B, and enhances their ability to handle multi-step reasoning tasks. This approach also improves interpretability and flexibility, highlighting cognitive prompting as a promising strategy for general-purpose AI reasoning.
摘要:我们提出认知提示 (cognitive prompting) 作为一种新颖的方法,通过结构化的人类认知操作,如目标澄清、分解、过滤、抽象和模式识别,来指导大语言模型 (LLMs) 解决问题。通过采用系统化的、逐步推理的方法,认知提示使 LLMs 能够高效地处理复杂的多步骤任务。我们评估了认知提示在 Meta 的 LLaMA 模型上的有效性,比较了在 GSM8K 数据集上的算术推理任务和常识推理基准上的表现。我们的分析包括比较没有认知提示的模型、具有静态认知操作序列的模型以及使用反射性认知提示的模型,其中 LLM 动态地自我选择认知操作序列。结果显示,认知提示,特别是当动态适应时,显著提高了较大模型(如 LLaMA3.1 70B)的性能,并增强了它们处理多步骤推理任务的能力。这种方法还提高了可解释性和灵活性,突显了认知提示作为一种有前途的通用人工智能推理策略。

[NLP-99] Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications EMNLP2024

【速读】: 该论文试图解决在实时应用中使用大型语言模型(LLMs)进行视觉编辑任务时的高成本和高延迟问题。解决方案的关键在于采用知识蒸馏技术,通过微调一个较小的学生LLM,并利用较大的教师LLM(如GPT-3.5-Turbo)的指导和行为信号,来实现与教师模型相近的性能,从而显著降低成本和延迟。此外,通过数据增强技术在低数据环境下提升了微调效果,进一步优化了学生模型的表现。

链接: https://arxiv.org/abs/2410.02952
作者: Oren Sultan,Alex Khasin,Guy Shiran,Asnat Greenstein-Messica,Dafna Shahaf
关键词-EN: practical distillation approach, present a practical, practical distillation, real-time applications, invoking tools
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024

点击查看摘要

Abstract:We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language (“golden hour”), using an LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model (GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using augmentation.
摘要:我们提出了一种实用的蒸馏方法,用于微调大语言模型 (LLM) 以在实时应用中调用工具。我们专注于视觉编辑任务;具体来说,我们通过解释用户以自然语言指定的风格化请求(例如“黄金时刻”),使用 LLM 选择适当的工具及其参数,以实现所需的视觉效果。我们发现,如 GPT-3.5-Turbo 这样的专有 LLM 在此任务中显示出潜力,但其高成本和延迟使其不适用于实时应用。在我们的方法中,我们通过一个(更大)的教师 LLM 和行为信号的指导,微调一个(较小)的学生 LLM。我们引入了离线指标来评估学生 LLM。在线和离线实验均表明,我们的学生模型能够匹配教师模型(GPT-3.5-Turbo)的性能,显著降低了成本和延迟。最后,我们展示了在数据稀缺的情况下,通过数据增强,微调效果提升了 25%。

[NLP-100] LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences

【速读】: 该论文试图解决大型语言模型(LLM)推理过程中碳足迹估算的复杂性问题。解决方案的关键在于引入了一种基于图神经网络(GNN)的模型,称为\coo,该模型显著提高了LLM推理碳足迹预测的准确性。与传统的基于方程的模型和现有的机器学习方法相比,\coo模型能够更精确地处理推理请求的多样性、硬件配置的差异以及推理过程中不同阶段的特性,从而提供更为准确的碳足迹估算。

链接: https://arxiv.org/abs/2410.02950
作者: Zhenxiao Fu,Fan Chen,Shan Zhou,Haitong Li,Lei Jiang
关键词-EN: substantially larger carbon, LLM inference carbon, large language model, LLM inference requests, LLM inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 9 pages, 11 figures

点击查看摘要

Abstract:Throughout its lifecycle, a large language model (LLM) generates a substantially larger carbon footprint during inference than training. LLM inference requests vary in batch size, prompt length, and token generation number, while cloud providers employ different GPU types and quantities to meet diverse service-level objectives for accuracy and latency. It is crucial for both users and cloud providers to have a tool that quickly and accurately estimates the carbon impact of LLM inferences based on a combination of inference request and hardware configurations before execution. Estimating the carbon footprint of LLM inferences is more complex than training due to lower and highly variable model FLOPS utilization, rendering previous equation-based models inaccurate. Additionally, existing machine learning (ML) prediction methods either lack accuracy or demand extensive training data, as they inadequately handle the distinct prefill and decode phases, overlook hardware-specific features, and inefficiently sample uncommon inference configurations. We introduce \coo, a graph neural network (GNN)-based model that greatly improves the accuracy of LLM inference carbon footprint predictions compared to previous methods.
摘要:在整个生命周期中,大语言模型 (LLM) 在推理阶段产生的碳足迹远大于训练阶段。LLM 推理请求在批量大小、提示长度和 Token 生成数量上存在差异,而云服务提供商则采用不同类型和数量的 GPU 来满足多样化的服务级别目标,包括准确性和延迟。对于用户和云服务提供商而言,在执行推理之前,能够快速且准确地估算基于推理请求和硬件配置组合的 LLM 推理碳影响至关重要。由于模型 FLOPS 利用率较低且高度可变,估算 LLM 推理的碳足迹比训练更为复杂,导致先前基于方程的模型不够准确。此外,现有的机器学习 (ML) 预测方法要么缺乏准确性,要么需要大量训练数据,因为它们未能充分处理独特的预填充和解码阶段,忽视了硬件特定的特征,并且低效地采样不常见的推理配置。我们引入了 \coo,这是一种基于图神经网络 (GNN) 的模型,与先前的方法相比,显著提高了 LLM 推理碳足迹预测的准确性。

[NLP-101] Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification EMNLP

【速读】: 该论文试图解决长文档分类中由于内容广泛和结构复杂导致的局部和全局依赖关系难以捕捉的问题。解决方案的关键在于提出了一种结合图-树结构的模型,通过集成句法树进行句子编码和文档图进行文档编码,分别捕捉细粒度的句法关系和更广泛的文档上下文。具体实现中,使用Tree Transformers生成句子编码,并通过图注意力网络建模句内和句间依赖关系。训练过程中,双向信息传播机制从词到句子再到文档,反之亦然,丰富了上下文表示。该方法能够全面理解文档在各个层次的内容,并有效处理任意长度的上下文,不受token限制的约束。

链接: https://arxiv.org/abs/2410.02930
作者: Sudipta Singha Roy,Xindi Wang,Robert E. Mercer,Frank Rudzicz
关键词-EN: global dependencies due, classification presents challenges, presents challenges, challenges in capturing, capturing both local
类目: Computation and Language (cs.CL)
备注: accepted to EMNLP findings 2024

点击查看摘要

Abstract:Long document classification presents challenges in capturing both local and global dependencies due to their extensive content and complex structure. Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents. To address these constraints, we propose a novel model leveraging a graph-tree structure. Our approach integrates syntax trees for sentence encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts, respectively. We use Tree Transformers to generate sentence encodings, while a graph attention network models inter- and intra-sentence dependencies. During training, we implement bidirectional information propagation from word-to-sentence-to-document and vice versa, which enriches the contextual representation. Our proposed method enables a comprehensive understanding of content at all hierarchical levels and effectively handles arbitrarily long contexts without token limit constraints. Experimental results demonstrate the effectiveness of our approach in all types of long document classification tasks.
摘要:长文档分类由于其内容广泛和结构复杂,面临着捕捉局部和全局依赖关系的挑战。现有方法常常受限于 Token 数量,并且未能充分建模文档内部的层次关系。为了解决这些限制,我们提出了一种利用图-树结构的新模型。我们的方法结合了句法树用于句子编码和文档图用于文档编码,分别捕捉细粒度的句法关系和更广泛的文档上下文。我们使用 Tree Transformers 生成句子编码,同时利用图注意力网络建模句内和句间依赖关系。在训练过程中,我们实现了从词到句子再到文档的双向信息传播,反之亦然,从而丰富了上下文表示。我们提出的方法能够在所有层次上全面理解内容,并有效处理任意长度的上下文,不受 Token 限制的约束。实验结果表明,我们的方法在各种长文档分类任务中均表现出色。

[NLP-102] Fine-Tuning Language Models with Differential Privacy through Adaptive Noise Allocation EMNLP2024

【速读】: 该论文试图解决传统差分隐私训练方法在保护隐私时对所有参数均匀添加噪声,导致模型性能下降的问题。解决方案的关键在于提出了一种名为ANADP的新算法,该算法能够根据模型参数的重要性自适应地分配加性噪声,从而在满足隐私约束的同时,缩小常规微调与传统差分隐私微调之间的性能差距。

链接: https://arxiv.org/abs/2410.02912
作者: Xianzhi Li,Ran Zmigrod,Zhiqiang Ma,Xiaomo Liu,Xiaodan Zhu
关键词-EN: memorizing detailed patterns, achieve impressive modeling, significant privacy concerns, raise significant privacy, impressive modeling performance
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: EMNLP 2024 findings

点击查看摘要

Abstract:Language models are capable of memorizing detailed patterns and information, leading to a double-edged effect: they achieve impressive modeling performance on downstream tasks with the stored knowledge but also raise significant privacy concerns. Traditional differential privacy based training approaches offer robust safeguards by employing a uniform noise distribution across all parameters. However, this overlooks the distinct sensitivities and contributions of individual parameters in privacy protection and often results in suboptimal models. To address these limitations, we propose ANADP, a novel algorithm that adaptively allocates additive noise based on the importance of model parameters. We demonstrate that ANADP narrows the performance gap between regular fine-tuning and traditional DP fine-tuning on a series of datasets while maintaining the required privacy constraints.
摘要:语言模型能够记忆详细的模式和信息,这带来了一把双刃剑:它们在存储知识的基础上在下游任务中取得了令人印象深刻的建模性能,但也引发了重大的隐私问题。传统的基于差分隐私的训练方法通过在所有参数上应用统一的噪声分布提供了强大的保护措施。然而,这种方法忽视了个别参数在隐私保护中的不同敏感性和贡献,往往导致模型性能不佳。为了解决这些局限性,我们提出了 ANADP,一种根据模型参数重要性自适应分配加性噪声的新算法。我们证明,ANADP 在一系列数据集上缩小了常规微调与传统差分隐私微调之间的性能差距,同时保持了所需的隐私约束。

[NLP-103] NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator

【速读】: 该论文试图解决训练网页代理(web agents)时依赖昂贵的人类监督的问题,提出了一种完全通过合成演示(synthetic demonstrations)进行训练的方法NNetnav。解决方案的关键在于利用语言指令的层次结构,将复杂指令分解为更简单的子任务,从而使探索指数级空间的过程更加可控。具体来说,NNetnav通过与浏览器交互生成轨迹,并使用语言模型对这些轨迹进行后处理标注,自动修剪无法标注出有意义子任务的交互片段。这种方法不仅减少了人类监督的需求,还在WebArena和MiniWoB++等环境中显著提升了语言模型策略的性能。

链接: https://arxiv.org/abs/2410.02907
作者: Shikhar Murty,Dzmitry Bahdanau,Christopher D. Manning
关键词-EN: introduce NNetscape Navigator, NNetscape Navigator, training web agents, introduce NNetscape, language model
类目: Computation and Language (cs.CL)
备注: Preprint. Under Review

点击查看摘要

Abstract:We introduce NNetscape Navigator (NNetnav), a method for training web agents entirely through synthetic demonstrations. These demonstrations are collected by first interacting with a browser to generate trajectory rollouts, which are then retroactively labeled into instructions using a language model. Most work on training browser agents has relied on expensive human supervision, and the limited previous work on such interaction-first synthetic data techniques has failed to provide effective search through the exponential space of exploration. In contrast, NNetnav exploits the hierarchical structure of language instructions to make this search more tractable: complex instructions are typically decomposable into simpler subtasks, allowing NNetnav to automatically prune interaction episodes when an intermediate trajectory cannot be annotated with a meaningful sub-task. We use NNetnav demonstrations from a language model for supervised fine-tuning of a smaller language model policy, and find improvements of 6 points on WebArena and over 20 points on MiniWoB++, two popular environments for web-agents. Notably, on WebArena, we observe that language model policies can be further enhanced when fine-tuned with NNetnav demonstrations derived from the same language model. Finally, we collect and release a dataset of over 6k NNetnav demonstrations on WebArena, spanning a diverse and complex set of instructions.
摘要:我们介绍了 NNetscape Navigator (NNetnav),这是一种完全通过合成演示来训练网络智能体的方法。这些演示首先通过与浏览器交互生成轨迹展开,然后使用语言模型对这些轨迹进行回溯性标注,形成指令。大多数训练浏览器智能体的工作依赖于昂贵的人类监督,而先前有限的基于交互优先合成数据技术的工作未能有效探索指数级扩展的搜索空间。相比之下,NNetnav 利用语言指令的层次结构,使这种搜索更加易于处理:复杂的指令通常可分解为更简单的子任务,允许 NNetnav 在中间轨迹无法被标注为有意义的子任务时自动修剪交互片段。我们使用从语言模型中获取的 NNetnav 演示,对较小的语言模型策略进行监督微调,并在 WebArena 和 MiniWoB++ 这两个流行的网络智能体环境中分别观察到 6 点和超过 20 点的改进。值得注意的是,在 WebArena 上,我们发现当使用从同一语言模型派生的 NNetnav 演示进行微调时,语言模型策略可以得到进一步增强。最后,我们收集并发布了一个包含超过 6000 个 NNetnav 演示的数据集,涵盖了 WebArena 上多样化且复杂的指令集。

[NLP-104] Better Instruction-Following Through Minimum Bayes Risk ICLR2025

【速读】: 该论文试图解决指令遵循型大语言模型(LLM)在测试时性能提升的问题。解决方案的关键在于利用最小贝叶斯风险(MBR)解码方法,通过基于参考的LLM评判器来从候选输出中选择高质量的输出,从而显著提升模型性能。研究还探索了在MBR解码基础上进行迭代自训练,通过直接偏好优化(Direct Preference Optimisation)进一步增强模型性能,使得自训练模型在贪婪解码下的表现通常能与使用MBR解码的基线模型相媲美,甚至有时超越。

链接: https://arxiv.org/abs/2410.02902
作者: Ian Wu,Patrick Fernandes,Amanda Bertsch,Seungone Kim,Sina Pakazad,Graham Neubig
关键词-EN: General-purpose LLM judges, Minimum Bayes Risk, human-level evaluation provide, MBR decoding, General-purpose LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at ICLR 2025

点击查看摘要

Abstract:General-purpose LLM judges capable of human-level evaluation provide not only a scalable and accurate way of evaluating instruction-following LLMs but also new avenues for supervising and improving their performance. One promising way of leveraging LLM judges for supervision is through Minimum Bayes Risk (MBR) decoding, which uses a reference-based evaluator to select a high-quality output from amongst a set of candidate outputs. In the first part of this work, we explore using MBR decoding as a method for improving the test-time performance of instruction-following LLMs. We find that MBR decoding with reference-based LLM judges substantially improves over greedy decoding, best-of-N decoding with reference-free judges and MBR decoding with lexical and embedding-based metrics on AlpacaEval and MT-Bench. These gains are consistent across LLMs with up to 70B parameters, demonstrating that smaller LLM judges can be used to supervise much larger LLMs. Then, seeking to retain the improvements from MBR decoding while mitigating additional test-time costs, we explore iterative self-training on MBR-decoded outputs. We find that self-training using Direct Preference Optimisation leads to significant performance gains, such that the self-trained models with greedy decoding generally match and sometimes exceed the performance of their base models with MBR decoding.
摘要:能够进行人类水平评估的通用大语言模型 (LLM) 不仅提供了一种可扩展且准确的指令遵循大语言模型评估方法,还为监督和提升其性能开辟了新的途径。利用 LLM 法官进行监督的一个有前景的方法是通过最小贝叶斯风险 (MBR) 解码,该方法使用基于参考的评估器从一组候选输出中选择高质量的输出。在本工作的第一部分,我们探讨了使用 MBR 解码作为提高指令遵循大语言模型测试时性能的方法。我们发现,基于参考的 LLM 法官的 MBR 解码显著优于贪婪解码、基于无参考法官的最佳 N 解码以及基于词汇和嵌入度量标准的 MBR 解码在 AlpacaEval 和 MT-Bench 上的表现。这些收益在参数高达 70B 的 LLM 中是一致的,表明较小的 LLM 法官可以用来监督更大的 LLM。接着,为了在减少额外测试时间成本的同时保留 MBR 解码的改进,我们探索了在 MBR 解码输出上的迭代自训练。我们发现,使用直接偏好优化进行自训练可以显著提升性能,使得自训练模型在贪婪解码下通常能够匹配甚至有时超过其基础模型在 MBR 解码下的性能。

[NLP-105] FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs

【速读】: 该论文试图解决语言模型(LMs)在生成内容时可能产生的幻觉(hallucinations)问题。解决方案的关键在于利用语言模型的内部表示(即隐藏状态)来预先检测和缓解幻觉。具体来说,论文提出了FactCheckMate系统,该系统通过学习一个分类器,基于模型在处理输入时生成的隐藏状态,预测模型是否会产生幻觉。如果检测到幻觉,FactCheckMate会调整模型的隐藏状态,以促使模型生成更符合事实的输出。这种方法通过利用模型的内部机制,实现了高效的幻觉检测和缓解,相比于事后处理方法,具有更低的推理开销。

链接: https://arxiv.org/abs/2410.02899
作者: Deema Alnuhait,Neeraja Kirtane,Muhammad Khalifa,Hao Peng
关键词-EN: Language models, Language, FactCheckMate, hidden states, LMs
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckMate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM will hallucinate, based on the model’s hidden states produced over the inputs, before decoding begins. If a hallucination is detected, FactCheckMate then intervenes, by adjusting the LM’s hidden states such that the model will produce more factual outputs. FactCheckMate provides fresh insights that the inner workings of LMs can be revealed by their hidden states. Practically, both the detection and mitigation models in FactCheckMate are lightweight, adding little inference overhead; FactCheckMate proves a more efficient approach for mitigating hallucinations compared to many post-hoc alternatives. We evaluate FactCheckMate over LMs of different scales and model families (including Llama, Mistral, and Gemma), across a variety of QA datasets from different domains. Our results demonstrate the effectiveness of leveraging internal representations for early hallucination detection and mitigation, achieving over 70% preemptive detection accuracy. On average, outputs generated by LMs with intervention are 34.4% more factual compared to those without intervention. The average overhead difference in the inference time introduced by FactCheckMate is around 3.16 seconds.
摘要:语言模型 (LMs) 会产生幻觉。我们探讨:能否在幻觉发生之前检测并缓解它们?本研究通过展示 LMs 的内部表示提供了丰富的信号,可以用于此目的,从而正面回答了这一研究问题。我们引入了 FactCheckMate,它通过学习一个分类器,在解码开始之前,基于模型在输入上生成的隐藏状态,预测 LM 是否会产生幻觉,从而预先检测幻觉。如果检测到幻觉,FactCheckMate 会介入,通过调整 LM 的隐藏状态,使得模型产生更符合事实的输出。FactCheckMate 提供了新的见解,即 LMs 的内部运作可以通过其隐藏状态揭示。实际上,FactCheckMate 中的检测和缓解模型都是轻量级的,增加了很少的推理开销;与许多事后替代方案相比,FactCheckMate 被证明是一种更有效的缓解幻觉的方法。我们在不同规模和模型家族(包括 Llama、Mistral 和 Gemma)的 LMs 上,评估了 FactCheckMate 在来自不同领域的各种 QA 数据集上的表现。我们的结果表明,利用内部表示进行早期幻觉检测和缓解是有效的,预先检测准确率超过 70%。平均而言,经过干预的 LMs 生成的输出比未经干预的输出更符合事实,符合事实程度提高了 34.4%。FactCheckMate 引入的推理时间平均开销差异约为 3.16 秒。

[NLP-106] Cognitive Biases in Large Language Models for News Recommendation RECSYS’24

【速读】: 该论文旨在探讨大型语言模型(LLMs)在新闻推荐系统中引入的认知偏差问题,这些偏差可能导致错误信息的传播、刻板印象的强化和回音壁效应的形成。论文的关键在于识别了多种认知偏差(如锚定偏差、框架偏差、现状偏差和群体归因偏差)对LLM推荐系统的影响,并提出了通过数据增强、提示工程和学习算法改进等策略来缓解这些偏差,从而提高推荐系统的可靠性。

链接: https://arxiv.org/abs/2410.02897
作者: Yougang Lyu,Xiaoyu Zhang,Zhaochun Ren,Maarten de Rijke
关键词-EN: large language models, recommender systems, cognitive biases, LLM-based news recommender, language models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at the ROGEN '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Despite large language models (LLMs) increasingly becoming important components of news recommender systems, employing LLMs in such systems introduces new risks, such as the influence of cognitive biases in LLMs. Cognitive biases refer to systematic patterns of deviation from norms or rationality in the judgment process, which can result in inaccurate outputs from LLMs, thus threatening the reliability of news recommender systems. Specifically, LLM-based news recommender systems affected by cognitive biases could lead to the propagation of misinformation, reinforcement of stereotypes, and the formation of echo chambers. In this paper, we explore the potential impact of multiple cognitive biases on LLM-based news recommender systems, including anchoring bias, framing bias, status quo bias and group attribution bias. Furthermore, to facilitate future research at improving the reliability of LLM-based news recommender systems, we discuss strategies to mitigate these biases through data augmentation, prompt engineering and learning algorithms aspects.
摘要:尽管大语言模型 (Large Language Models, LLMs) 在新闻推荐系统中日益成为重要的组成部分,但将 LLMs 应用于此类系统也引入了新的风险,例如 LLMs 中的认知偏差影响。认知偏差指的是在判断过程中偏离规范或理性的系统性模式,这可能导致 LLMs 输出不准确的结果,从而威胁新闻推荐系统的可靠性。具体而言,受认知偏差影响的基于 LLM 的新闻推荐系统可能导致错误信息的传播、刻板印象的强化以及回音壁效应的形成。本文探讨了多种认知偏差对基于 LLM 的新闻推荐系统的潜在影响,包括锚定偏差 (anchoring bias)、框架偏差 (framing bias)、现状偏差 (status quo bias) 和群体归因偏差 (group attribution bias)。此外,为了促进未来在提高基于 LLM 的新闻推荐系统可靠性方面的研究,我们讨论了通过数据增强、提示工程 (prompt engineering) 和学习算法方面来减轻这些偏差的策略。

[NLP-107] he Role of Deductive and Inductive Reasoning in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在复杂和动态问题空间中适应性不足的问题。解决方案的关键在于提出了Deductive and InDuctive(DID)方法,通过在提示构造过程中动态集成演绎和归纳推理,增强LLMs的推理能力。DID方法借鉴认知科学,模拟人类适应性推理机制,提供了一个灵活的框架,使模型能够根据任务上下文和表现调整其推理路径,从而在不显著增加计算开销的情况下,显著提高解决方案的准确性和推理质量。

链接: https://arxiv.org/abs/2410.02892
作者: Chengkun Cai,Xu Zhao,Haoliang Liu,Zhongyu Jiang,Tianfang Zhang,Zongkai Wu,Jenq-Neng Hwang,Lei Li
关键词-EN: Large Language Models, Large Language, Language Models, artificial intelligence, progress in artificial
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved substantial progress in artificial intelligence, particularly in reasoning tasks. However, their reliance on static prompt structures, coupled with limited dynamic reasoning capabilities, often constrains their adaptability to complex and evolving problem spaces. In this paper, we propose the Deductive and InDuctive(DID) method, which enhances LLM reasoning by dynamically integrating both deductive and inductive reasoning within the prompt construction process. Drawing inspiration from cognitive science, the DID approach mirrors human adaptive reasoning mechanisms, offering a flexible framework that allows the model to adjust its reasoning pathways based on task context and performance. We empirically validate the efficacy of DID on established datasets such as AIW and MR-GSM8K, as well as on our custom dataset, Holiday Puzzle, which presents tasks about different holiday date calculating challenges. By leveraging DID’s hybrid prompt strategy, we demonstrate significant improvements in both solution accuracy and reasoning quality, achieved without imposing substantial computational overhead. Our findings suggest that DID provides a more robust and cognitively aligned framework for reasoning in LLMs, contributing to the development of advanced LLM-driven problem-solving strategies informed by cognitive science models.
摘要:大语言模型 (LLMs) 在人工智能领域,特别是在推理任务方面取得了显著进展。然而,它们依赖于静态提示结构,以及有限的动态推理能力,往往限制了其在复杂和不断变化的问题空间中的适应性。本文提出了一种演绎与归纳相结合 (Deductive and InDuctive, DID) 的方法,通过在提示构建过程中动态整合演绎和归纳推理,增强了大语言模型的推理能力。受认知科学的启发,DID 方法模仿了人类的适应性推理机制,提供了一个灵活的框架,使模型能够根据任务上下文和表现调整其推理路径。我们通过在已有的数据集如 AIW 和 MR-GSM8K 以及我们自定义的 Holiday Puzzle 数据集上进行实证验证,该数据集涉及不同节日的日期计算挑战。通过利用 DID 的混合提示策略,我们展示了在解决方案准确性和推理质量方面的显著改进,且未显著增加计算开销。我们的研究结果表明,DID 为大语言模型中的推理提供了一个更强大且与认知科学模型相一致的框架,有助于推动基于认知科学模型的高级大语言模型驱动的问题解决策略的发展。

[NLP-108] LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

【速读】: 该论文试图解决大型语言模型(LLMs)在数学推理能力上的不足,提出了一种名为LLaMA-Berry的高级数学问题解决框架。解决方案的关键在于结合蒙特卡洛树搜索(MCTS)与迭代自优化(Self-Refine)来优化推理路径,并利用成对奖励模型(Pairwise Preference Reward Model, PPRM)对不同路径进行全局评估。通过LLMs的自批判和重写能力,SR-MCTS克服了传统逐步和贪婪搜索算法的低效性和局限性,促进了更高效的解空间探索。PPRM借鉴了人类反馈强化学习(RLHF)的思想,使用增强的Borda计数法(EBC)将成对偏好综合为全局排名分数,从而找到更优的答案,解决了数学推理任务中评分变异性和非独立分布的问题。

链接: https://arxiv.org/abs/2410.02884
作者: Di Zhang,Jianbo Wu,Jingdi Lei,Tong Che,Jiatong Li,Tong Xie,Xiaoshui Huang,Shufei Zhang,Marco Pavone,Yuqiang Li,Wanli Ouyang,Dongzhan Zhou
关键词-EN: Large Language Models, Large Language, Monte Carlo Tree, ability of Large, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.
摘要:本文提出了一种先进的数学问题解决框架——LLaMA-Berry,旨在提升大语言模型 (LLM) 的数学推理能力。该框架结合了蒙特卡洛树搜索 (MCTS) 与迭代自优化 (Self-Refine) 来优化推理路径,并利用成对奖励模型 (pairwise reward model) 对不同路径进行全局评估。通过利用 LLM 的自我批评和重写能力,应用于 MCTS 的迭代自优化 (SR-MCTS) 克服了传统逐步和贪心搜索算法的低效性和局限性,促进了更高效的解空间探索。受人类反馈强化学习 (RLHF) 启发的成对偏好奖励模型 (PPRM),采用增强的博达计数法 (EBC) 来综合这些偏好,生成全局排序分数以找到更优答案。这种方法解决了数学推理任务中评分变异性和非独立分布的挑战。该框架已在通用和高级基准测试中进行了测试,显示出在搜索效率和问题解决能力方面优于现有方法如 ToT 和 rStar,特别是在复杂的奥林匹克级别基准测试中,包括 GPQA、AIME24 和 AMC23。

[NLP-109] Computational Modeling of Artistic Inspiration: A Framework for Predicting Aesthetic Preferences in Lyrical Lines Using Linguistic and Stylistic Features

【速读】: 该论文试图解决艺术创作中艺术灵感难以理解和预测的问题,特别是通过计算模型来模拟不同个体对艺术作品的偏好。解决方案的关键在于提出了一种基于语言和风格特征的新框架,并引入了名为“EvocativeLines”的数据集,该数据集包含被标注为“inspiring”或“not inspiring”的歌词行,用于评估模型在不同偏好下的表现。该框架利用了提出的语言和诗意特征,并通过校准网络进行精确的艺术偏好预测,实验结果表明其性能优于现有的先进语言模型LLaMA-3-70b。

链接: https://arxiv.org/abs/2410.02881
作者: Gaurav Sahu,Olga Vechtomova
关键词-EN: Artistic inspiration remains, understood aspects, inspiration remains, artistic preferences, creative process
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Artistic inspiration remains one of the least understood aspects of the creative process. It plays a crucial role in producing works that resonate deeply with audiences, but the complexity and unpredictability of aesthetic stimuli that evoke inspiration have eluded systematic study. This work proposes a novel framework for computationally modeling artistic preferences in different individuals through key linguistic and stylistic properties, with a focus on lyrical content. In addition to the framework, we introduce \textitEvocativeLines, a dataset of annotated lyric lines, categorized as either “inspiring” or “not inspiring,” to facilitate the evaluation of our framework across diverse preference profiles. Our computational model leverages the proposed linguistic and poetic features and applies a calibration network on top of it to accurately forecast artistic preferences among different creative individuals. Our experiments demonstrate that our framework outperforms an out-of-the-box LLaMA-3-70b, a state-of-the-art open-source language model, by nearly 18 points. Overall, this work contributes an interpretable and flexible framework that can be adapted to analyze any type of artistic preferences that are inherently subjective across a wide spectrum of skill levels.
摘要:艺术灵感仍然是创造过程中最难以理解的一个方面。它在创作与观众产生深刻共鸣的作品中起着至关重要的作用,但引发灵感的审美刺激的复杂性和不可预测性却一直未能得到系统的研究。本研究提出了一种通过关键的语言和风格属性来计算建模不同个体艺术偏好的新框架,重点在于歌词内容。除了框架本身,我们还引入了 \textitEvocativeLines,一个标注了歌词行数据集,这些歌词行被分类为“有灵感”或“无灵感”,以促进我们框架在不同偏好配置文件中的评估。我们的计算模型利用了所提出的语言和诗歌特征,并在其上应用了一个校准网络,以准确预测不同创意个体的艺术偏好。我们的实验表明,我们的框架比开箱即用的 LLaMA-3-70b(一种最先进的开源语言模型)高出近 18 分。总的来说,这项工作贡献了一个可解释且灵活的框架,可以适应分析任何类型的艺术偏好,这些偏好本质上是主观的,并且跨越了广泛的技能水平。

[NLP-110] Position: LLM Unlearning Benchmarks are Weak Measures of Progress

【速读】: 该论文试图解决现有基准测试对大型语言模型(LLM)遗忘方法效果评估过于乐观且可能误导的问题。解决方案的关键在于揭示现有基准测试在面对简单良性修改时的脆弱性,特别是当遗忘信息与保留信息之间存在松散依赖关系时。论文通过引入这些修改,展示了遗忘信息仍可访问或遗忘过程对模型性能的负面影响被低估的情况。此外,论文指出基准测试中遗忘目标的模糊性容易导致方法过度拟合于特定测试查询。基于这些发现,论文呼吁社区在解释基准测试结果时应保持谨慎,并提出了改进未来LLM遗忘研究的几项建议。

链接: https://arxiv.org/abs/2410.02879
作者: Pratiksha Thaker,Shengyuan Hu,Neil Kale,Yash Maurya,Zhiwei Steven Wu,Virginia Smith
关键词-EN: LLM unlearning research, information post hoc, large language models, harmful information post, LLM unlearning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. By introducing simple, benign modifications to a number of popular benchmarks, we expose instances where supposedly unlearned information remains accessible, or where the unlearning process has degraded the model’s performance on retained information to a much greater extent than indicated by the original benchmark. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information. Further, we show that ambiguity in unlearning targets in existing benchmarks can easily lead to the design of methods that overfit to the given test queries. Based on our findings, we urge the community to be cautious when interpreting benchmark results as reliable measures of progress, and we provide several recommendations to guide future LLM unlearning research.
摘要:遗忘方法有望通过事后移除敏感或有害信息来提高大语言模型 (LLM) 的隐私和安全性。LLM 遗忘研究社区越来越多地转向经验基准来评估这些方法的有效性。在本文中,我们发现现有基准对候选遗忘方法的有效性提供了过于乐观且可能具有误导性的看法。通过引入对多个流行基准的简单、良性的修改,我们揭示了在某些情况下,所谓的遗忘信息仍然可以访问,或者遗忘过程对保留信息的模型性能的损害程度远大于原始基准所指示的程度。我们发现现有基准特别容易受到引入遗忘信息和保留信息之间松散依赖关系的修改的影响。此外,我们表明现有基准中遗忘目标的模糊性很容易导致设计出过度拟合给定测试查询的方法。基于我们的发现,我们敦促社区在将基准结果解释为进展的可靠衡量标准时要谨慎,并提供了几项建议以指导未来 LLM 遗忘研究。

[NLP-111] PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System

【速读】: 该论文试图解决生成式人工智能(GenAI)系统中风险识别的挑战,特别是针对多模态模型的新型危害、风险和越狱行为的探测。解决方案的关键在于开发了Python风险识别工具包(PyRIT),这是一个开源、模型和平台无关的框架,通过其可组合的架构,支持红队人员高效地探测和识别GenAI系统中的风险,并具备对未来模型和模态的扩展能力。

链接: https://arxiv.org/abs/2410.02828
作者: Gary D. Lopez Munoz,Amanda J. Minnich,Roman Lutz,Richard Lundeen,Raja Sekhar Rao Dheekonda,Nina Chikanov,Bolor-Erdene Jagdagdorj,Martin Pouliot,Shiven Chawla,Whitney Maxwell,Blake Bullwinkel,Katherine Pratt,Joris de Gruyter,Charlotte Siska,Pete Bryan,Tori Westerhoff,Chang Kawaguchi,Christian Seifert,Ram Shankar Siva Kumar,Yonatan Zunger
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, daily lives, Risk Identification Toolkit
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) is becoming ubiquitous in our daily lives. The increase in computational power and data availability has led to a proliferation of both single- and multi-modal models. As the GenAI ecosystem matures, the need for extensible and model-agnostic risk identification frameworks is growing. To meet this need, we introduce the Python Risk Identification Toolkit (PyRIT), an open-source framework designed to enhance red teaming efforts in GenAI systems. PyRIT is a model- and platform-agnostic tool that enables red teamers to probe for and identify novel harms, risks, and jailbreaks in multimodal generative AI models. Its composable architecture facilitates the reuse of core building blocks and allows for extensibility to future models and modalities. This paper details the challenges specific to red teaming generative AI systems, the development and features of PyRIT, and its practical applications in real-world scenarios.
摘要:生成式人工智能 (Generative AI) 正逐渐渗透到我们的日常生活中。计算能力的提升和数据可用性的增加,使得单模态和多模态模型大量涌现。随着生成式 AI 生态系统的成熟,对可扩展且与模型无关的风险识别框架的需求日益增长。为了满足这一需求,我们推出了 Python 风险识别工具包 (PyRIT),这是一个开源框架,旨在增强生成式 AI 系统中的红队测试工作。PyRIT 是一个与模型和平台无关的工具,使红队人员能够探测并识别多模态生成式 AI 模型中的新型危害、风险和越狱行为。其可组合的架构促进了核心构建模块的重用,并允许扩展到未来的模型和模态。本文详细介绍了生成式 AI 系统红队测试面临的特定挑战、PyRIT 的开发及其功能,以及其在现实场景中的实际应用。

[NLP-112] Ingest-And-Ground: Dispelling Hallucinations from Continually-Pretrained LLMs with RAG

【速读】: 该论文试图解决大语言模型(LLM)在处理隐私相关查询时可能产生的幻觉问题,即模型输出不准确或不相关信息的问题。解决方案的关键在于通过持续预训练LLM模型与隐私特定知识库相结合,并在此基础上增加一个语义检索增强生成(RAG)层,从而使模型在处理隐私查询时能够基于事实信息生成更准确的响应,显著提升模型性能。

链接: https://arxiv.org/abs/2410.02825
作者: Chenhao Fang,Derek Larson,Shitong Zhu,Sophie Zeng,Wendy Summer,Yanqing Peng,Yuriy Hulovatyy,Rajeev Rao,Gabriel Forgues,Arya Pudota,Alex Goncalves,Hervé Robert
关键词-EN: improve privacy process, privacy process efficiency, paper presents, presents new methods, potential to improve
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This paper presents new methods that have the potential to improve privacy process efficiency with LLM and RAG. To reduce hallucination, we continually pre-train the base LLM model with a privacy-specific knowledge base and then augment it with a semantic RAG layer. Our evaluations demonstrate that this approach enhances the model performance (as much as doubled metrics compared to out-of-box LLM) in handling privacy-related queries, by grounding responses with factual information which reduces inaccuracies.
摘要:本文介绍了一种新方法,该方法有望通过大语言模型 (LLM) 和检索增强生成 (RAG) 技术提高隐私处理效率。为了减少幻觉现象,我们持续使用隐私特定知识库对基础 LLM 模型进行预训练,然后通过语义 RAG 层对其进行增强。我们的评估结果表明,这种方法在处理隐私相关查询时显著提升了模型性能(与开箱即用的 LLM 相比,指标提升高达两倍),通过基于事实信息的响应减少了不准确性。

[NLP-113] GPTs Judgements Under Uncertainty

【速读】: 该论文试图解决的问题是探究人类认知中固有的偏见(如损失厌恶、框架效应和合取谬误)是否在GPT-4o在概率场景中的判断和决策中表现出来。解决方案的关键在于通过1350次实验,涵盖九种认知偏见,分析GPT-4o在面对具有相似概率基础的提示时的反应,以区分其使用统计推理还是启发式推理。研究发现,GPT-4o在处理相同提示的不同迭代时,既表现出类似人类的启发式错误,也展现出统计上合理的决策,显示出其决策过程的矛盾性。

链接: https://arxiv.org/abs/2410.02820
作者: Payam Saeedi,Mahsa Goodarzi
关键词-EN: framing effects, human cognition, loss aversion, conjunction fallacy, judges and makes
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We investigate whether biases inherent in human cognition, such as loss aversion, framing effects, and conjunction fallacy, manifest in how GPT-4o judges and makes decisions in probabilistic scenarios. By conducting 1350 experiments across nine cognitive biases and analyzing the responses for statistical versus heuristic reasoning, we demonstrate GPT-4o’s contradicting approach while responding to prompts with similar underlying probability notations. Our findings also reveal mixed performances with the AI demonstrating both human-like heuristic errors and statistically sound decisions, even as it goes through identical iterations of the same prompt.
摘要:我们研究了人类认知中固有的偏见,如损失厌恶、框架效应和合取谬误,是否在 GPT-4o 判断和做出概率情境决策时显现出来。通过进行 1350 次实验,涵盖九种认知偏见,并分析统计推理与启发式推理的响应,我们展示了 GPT-4o 在处理具有相似基础概率表示的提示时采用的矛盾方法。我们的研究结果还揭示了 AI 在表现上的混合性,既展示了类似人类的启发式错误,也做出了统计上合理的决策,即使在处理相同提示的相同迭代过程中也是如此。

[NLP-114] SAC-KG: Exploiting Large Language Models as Skilled Automatic Constructors for Domain Knowledge Graphs ACL2024

【速读】: 该论文试图解决现有知识图谱(KG)构建方法过度依赖人工干预,导致在实际应用中难以扩展和自动化的问题。解决方案的关键在于提出了一个名为SAC-KG的通用KG构建框架,利用大型语言模型(LLMs)作为技能自动构建器,通过生成器、验证器和修剪器三个组件协同工作,自动从领域语料库中生成精确的多层次知识图谱。该框架不仅显著提高了KG构建的自动化程度,还在精度上比现有最先进方法提升了20%以上。

链接: https://arxiv.org/abs/2410.02811
作者: Hanzhu Chen,Xu Shen,Qitan Lv,Jie Wang,Xiaoqi Ni,Jieping Ye
关键词-EN: domain Knowledge Graph, Skilled Automatic Constructors, play a pivotal, pivotal role, role in knowledge-intensive
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2024 Main

点击查看摘要

Abstract:Knowledge graphs (KGs) play a pivotal role in knowledge-intensive tasks across specialized domains, where the acquisition of precise and dependable knowledge is crucial. However, existing KG construction methods heavily rely on human intervention to attain qualified KGs, which severely hinders the practical applicability in real-world scenarios. To address this challenge, we propose a general KG construction framework, named SAC-KG, to exploit large language models (LLMs) as Skilled Automatic Constructors for domain Knowledge Graph. SAC-KG effectively involves LLMs as domain experts to generate specialized and precise multi-level KGs. Specifically, SAC-KG consists of three components: Generator, Verifier, and Pruner. For a given entity, Generator produces its relations and tails from raw domain corpora, to construct a specialized single-level KG. Verifier and Pruner then work together to ensure precision by correcting generation errors and determining whether newly produced tails require further iteration for the next-level this http URL demonstrate that SAC-KG automatically constructs a domain KG at the scale of over one million nodes and achieves a precision of 89.32%, leading to a superior performance with over 20% increase in precision rate compared to existing state-of-the-art methods for the KG construction task.
摘要:知识图谱 (Knowledge Graphs, KGs) 在专业领域的知识密集型任务中发挥着关键作用,其中获取精确且可靠的知识至关重要。然而,现有的知识图谱构建方法严重依赖人工干预以获得合格的知识图谱,这严重阻碍了其在实际应用中的可行性。为解决这一挑战,我们提出了一种通用知识图谱构建框架,名为 SAC-KG,该框架利用大语言模型 (Large Language Models, LLMs) 作为领域知识图谱的熟练自动构建者。SAC-KG 有效地将 LLMs 作为领域专家,生成专门且精确的多层次知识图谱。具体而言,SAC-KG 由三个组件组成:生成器 (Generator)、验证器 (Verifier) 和修剪器 (Pruner)。对于给定的实体,生成器从原始领域语料库中生成其关系和尾部,以构建一个专门的单层次知识图谱。验证器和修剪器随后协同工作,通过纠正生成错误并确定新产生的尾部是否需要进一步迭代以进行下一层次的构建,从而确保精确性。实验表明,SAC-KG 能够自动构建规模超过一百万个节点的领域知识图谱,并达到 89.32% 的精确度,相较于现有最先进的知识图谱构建方法,其精确率提高了 20% 以上,从而实现了卓越的性能。

[NLP-115] StateAct: State Tracking and Reasoning for Acting and Planning with Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在交互环境中进行长期推理任务时面临的挑战。解决方案的关键在于提出了一种基于少样本上下文学习的简单方法,通过增强“思维链”(chain-of-thought)与状态跟踪(state-tracking)相结合,以提升LLMs在规划和执行任务中的表现。该方法无需额外数据或人工规则,仅依赖于少样本上下文学习,显著提高了在Alfworld等任务中的性能,并展示了其在解决长时程问题和减少任务步骤方面的效率提升。

链接: https://arxiv.org/abs/2410.02810
作者: Nikolai Rozanov,Marek Rei
关键词-EN: large language models, language models, large language, interactive environments, Planning and acting
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 5 pages appendix, 7 figures, 5 tables

点击查看摘要

Abstract:Planning and acting to solve real' tasks using large language models (LLMs) in interactive environments has become a new frontier for AI methods. While recent advances allowed LLMs to interact with online tools, solve robotics tasks and many more, long range reasoning tasks remain a problem for LLMs. Existing methods to address this issue are very resource intensive and require additional data or human crafted rules, instead, we propose a simple method based on few-shot in-context learning alone to enhance chain-of-thought’ with state-tracking for planning and acting with LLMs. We show that our method establishes the new state-of-the-art on Alfworld for in-context learning methods (\textbf+14% over the previous best few-shot in-context learning method) and performs on par with methods that use additional training data and additional tools such as code-execution. We also demonstrate that our enhanced chain-of-states' allows the agent to both solve longer horizon problems and to be more efficient in number of steps required to solve a task. We show that our method works across a variety of LLMs for both API-based and open source ones. Finally, we also conduct ablation studies and show that chain-of-thoughts’ helps state-tracking accuracy, while a json-structure harms overall performance. We open-source our code and annotations at \urlthis https URL.
摘要:在交互环境中利用大语言模型 (LLMs) 进行规划和执行以解决“真实”任务已成为 AI 方法的新前沿。尽管近期进展使 LLMs 能够与在线工具互动、解决机器人任务等,但长程推理任务对 LLMs 仍是一个挑战。现有解决此问题的方法非常耗费资源,且需要额外数据或人工制定的规则。相反,我们提出了一种基于少样本上下文学习的简单方法,通过状态追踪增强“思维链”,以实现 LLMs 的规划和执行。我们的研究表明,该方法在 Alfworld 上为上下文学习方法设立了新的技术水平(比之前的最佳少样本上下文学习方法提高了 14%),并能与使用额外训练数据和工具(如代码执行)的方法相媲美。我们还展示了增强的“状态链”使智能体不仅能解决更长周期的任务,而且在解决任务所需的步骤数量上更为高效。我们的方法适用于多种基于 API 和开源的 LLMs。最后,我们还进行了消融研究,结果显示“思维链”有助于状态追踪的准确性,而 json 结构则对整体性能有负面影响。我们已在 \urlthis https URL 开源了代码和注释。

[NLP-116] aCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution

【速读】: 该论文试图解决大型语言模型(LLMs)在复杂指令下的性能优化问题,特别是传统方法在处理复杂指令时的局限性,如难以有效提升指令的复杂度和跨领域难度管理。解决方案的关键在于提出了一种名为任务中心指令进化(Task-Centered Instruction Evolution, TaCIE)的创新方法。TaCIE通过将复杂指令分解为基础组件,生成并整合新元素,再重新组装成更复杂的指令,从而实现指令的动态进化,显著提升了LLMs在多领域应用中的性能。

链接: https://arxiv.org/abs/2410.02795
作者: Jiuding Yang,Shengyao Lu,Weidong Guo,Xiangyang Li,Kaitong Yang,Yu Xu,Di Niu
关键词-EN: Large Language Models, Large Language, require precise alignment, Language Models, require precise
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) require precise alignment with complex instructions to optimize their performance in real-world applications. As the demand for refined instruction tuning data increases, traditional methods that evolve simple seed instructions often struggle to effectively enhance complexity or manage difficulty scaling across various domains. Our innovative approach, Task-Centered Instruction Evolution (TaCIE), addresses these shortcomings by redefining instruction evolution from merely evolving seed instructions to a more dynamic and comprehensive combination of elements. TaCIE starts by deconstructing complex instructions into their fundamental components. It then generates and integrates new elements with the original ones, reassembling them into more sophisticated instructions that progressively increase in difficulty, diversity, and complexity. Applied across multiple domains, LLMs fine-tuned with these evolved instructions have substantially outperformed those tuned with conventional methods, marking a significant advancement in instruction-based model fine-tuning.
摘要:大语言模型 (LLMs) 在实际应用中需要与复杂的指令精确对齐以优化其性能。随着对精细化指令调优数据需求的增加,传统方法通常依赖于简单的种子指令,往往难以有效提升复杂度或在不同领域间管理难度扩展。我们的创新方法,任务中心指令进化 (Task-Centered Instruction Evolution, TaCIE),通过重新定义指令进化,从仅进化种子指令转变为更动态和全面的元素组合,解决了这些不足。TaCIE 首先将复杂指令分解为其基本组成部分,然后生成并整合新元素与原始元素,重新组装成更加复杂的指令,这些指令在难度、多样性和复杂性上逐步增加。在多个领域应用中,经过这些进化指令微调的 LLMs 显著优于使用传统方法微调的模型,标志着基于指令的模型微调技术取得了重大进展。

[NLP-117] Navigation with VLM framework: Go to Any Language

【速读】: 该论文试图解决在开放场景中实现类似人类探索行为的导航问题,特别是在面对开放词汇和任意语言目标时的导航挑战。解决方案的关键在于引入Navigation with VLM (NavVLM)框架,该框架利用设备级别的Vision Large Language Models (VLMs),使代理能够在无需任何预训练的情况下,根据任意语言目标感知环境信息并提供持续的探索指导,直至到达目标位置或区域。这一方法不仅在传统特定目标设置中实现了最先进的成功率(SR)和路径长度加权成功率(SPL),还将导航能力扩展到任何开放集语言目标,显著提升了导航系统的灵活性和适应性。

链接: https://arxiv.org/abs/2410.02787
作者: Zecheng Yin,Chonghao Cheng,Lizhen
关键词-EN: posed significant challenges, Vision Large Language, Large Language Models, significant challenges, Vision Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Navigating towards fully open language goals and exploring open scenes in a manner akin to human exploration have always posed significant challenges. Recently, Vision Large Language Models (VLMs) have demonstrated remarkable capabilities in reasoning with both language and visual data. While many works have focused on leveraging VLMs for navigation in open scenes and with open vocabularies, these efforts often fall short of fully utilizing the potential of VLMs or require substantial computational resources. We introduce Navigation with VLM (NavVLM), a framework that harnesses equipment-level VLMs to enable agents to navigate towards any language goal specific or non-specific in open scenes, emulating human exploration behaviors without any prior training. The agent leverages the VLM as its cognitive core to perceive environmental information based on any language goal and constantly provides exploration guidance during navigation until it reaches the target location or area. Our framework not only achieves state-of-the-art performance in Success Rate (SR) and Success weighted by Path Length (SPL) in traditional specific goal settings but also extends the navigation capabilities to any open-set language goal. We evaluate NavVLM in richly detailed environments from the Matterport 3D (MP3D), Habitat Matterport 3D (HM3D), and Gibson datasets within the Habitat simulator. With the power of VLMs, navigation has entered a new era.
摘要:在实现完全开放的语言目标和探索开放场景方面,如何以类似于人类探索的方式进行,一直是一个重大挑战。最近,视觉大语言模型 (Vision Large Language Models, VLMs) 在处理语言和视觉数据推理方面展示了卓越的能力。尽管许多研究致力于利用 VLMs 在开放场景和开放词汇表中进行导航,但这些努力往往未能充分发挥 VLMs 的潜力,或者需要大量的计算资源。我们提出了基于 VLM 的导航框架 (Navigation with VLM, NavVLM),该框架利用设备级别的 VLMs 使智能体能够在开放场景中导航至任何特定或非特定的语言目标,模拟人类探索行为,而无需任何预先训练。智能体利用 VLM 作为其认知核心,根据任何语言目标感知环境信息,并在导航过程中不断提供探索指导,直至到达目标位置或区域。我们的框架不仅在传统的特定目标设置中实现了最先进的成功率 (Success Rate, SR) 和路径长度加权成功率 (Success weighted by Path Length, SPL),还将导航能力扩展到任何开放集语言目标。我们在 Habitat 模拟器中的 Matterport 3D (MP3D)、Habitat Matterport 3D (HM3D) 和 Gibson 数据集的丰富详细环境中评估了 NavVLM。借助 VLMs 的力量,导航技术进入了一个新时代。

[NLP-118] Learning variant product relationship and variation attributes from e-commerce website structures

【速读】: 该论文试图解决电子商务目录中变体产品关系的识别问题,即如何确定两个产品是变体关系,并识别它们之间的差异属性。解决方案的关键在于提出了一种名为VARM(变体关系匹配策略)的新型实体解析方法,该方法结合了编码和生成式AI模型的优势。首先,通过构建包含网页产品链接的数据集来训练编码大型语言模型(LLM),以预测任意给定产品对的变体匹配关系;其次,利用RAG提示的生成式LLM来提取变体产品组之间的变化和共同属性。这种方法不仅识别变体关系,还能明确指出变体间的差异,从而在实际数据评估中表现优于传统解决方案,为利用这种新型产品关系提供了有效途径。

链接: https://arxiv.org/abs/2410.02779
作者: Pedro Herrero-Vidal,You-Lin Chen,Cris Liu,Prithviraj Sen,Lichao Wang
关键词-EN: introduce VARM, product relationships, product, variant product relationships, variant relationship matcher
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce VARM, variant relationship matcher strategy, to identify pairs of variant products in e-commerce catalogs. Traditional definitions of entity resolution are concerned with whether product mentions refer to the same underlying product. However, this fails to capture product relationships that are critical for e-commerce applications, such as having similar, but not identical, products listed on the same webpage or share reviews. Here, we formulate a new type of entity resolution in variant product relationships to capture these similar e-commerce product links. In contrast with the traditional definition, the new definition requires both identifying if two products are variant matches of each other and what are the attributes that vary between them. To satisfy these two requirements, we developed a strategy that leverages the strengths of both encoding and generative AI models. First, we construct a dataset that captures webpage product links, and therefore variant product relationships, to train an encoding LLM to predict variant matches for any given pair of products. Second, we use RAG prompted generative LLMs to extract variation and common attributes amongst groups of variant products. To validate our strategy, we evaluated model performance using real data from one of the world’s leading e-commerce retailers. The results showed that our strategy outperforms alternative solutions and paves the way to exploiting these new type of product relationships.
摘要:我们引入了 VARM,即变体关系匹配策略,用于识别电子商务目录中的变体产品对。传统的实体解析定义关注的是产品提及是否指向同一底层产品。然而,这种方法未能捕捉到电子商务应用中至关重要的产品关系,例如在同一网页上列出相似但不完全相同的产品,或共享评论。在此,我们提出了一种新的实体解析类型,即变体产品关系,以捕捉这些相似的电子商务产品链接。与传统定义不同,新的定义不仅需要识别两个产品是否为彼此的变体匹配,还需要确定它们之间变化的属性。为了满足这两个要求,我们开发了一种策略,结合了编码和生成式 AI 模型的优势。首先,我们构建了一个数据集,捕捉网页产品链接,从而捕捉变体产品关系,以训练一个编码大语言模型来预测任何给定产品对的变体匹配。其次,我们使用 RAG 提示的生成式大语言模型来提取变体产品组之间的变化和共同属性。为了验证我们的策略,我们使用全球领先电子商务零售商的真实数据评估了模型性能。结果显示,我们的策略优于其他解决方案,并为利用这种新型产品关系铺平了道路。

[NLP-119] Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

【速读】: 该论文试图解决视觉问答(VQA)任务中,现有视觉-语言模型在预测多个人类标注者提供的答案时,难以准确捕捉人类不确定性分布的问题。解决方案的关键在于引入新的与人类相关联的评估指标,并验证了针对人类分布进行模型校准的有效性。研究结果表明,即使是当前最先进的BEiT3模型,也难以捕捉人类答案的多标签分布特性,而传统的以准确性为导向的校准方法反而加剧了模型预测与人类分布之间的差距。论文强调,通过校准模型以更好地匹配人类不确定性分布,可以显著提升模型在VQA任务中的表现,并指出这一领域在模型与人类响应的一致性对齐方面仍需进一步研究。

链接: https://arxiv.org/abs/2410.02773
作者: Jian Lan,Diego Frassinelli,Barbara Plank
关键词-EN: Visual Question Answering, Large vision-language models, Large vision-language, multiple human annotators, accurately predict responses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3’s ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, better aligning model confidence with human uncertainty. Our findings highlight that for VQA, the consistent alignment between human responses and model predictions is understudied and should become the next crucial target of future studies.
摘要:大型视觉-语言模型在准确预测多个标注者提供的答案时经常遇到困难,尤其是在这些答案表现出人类的不确定性时。在本研究中,我们专注于视觉问答 (Visual Question Answering, VQA) 任务,并全面评估了最先进的视觉-语言模型与人类答案分布的相关性。为此,我们根据样本中人类在分歧中的不确定性水平 (低、中、高) 对样本进行分类,并采用不仅包括准确性,还包括三种新的与人类相关联的 VQA 指标,以研究不确定性水平对模型的影响。为了更好地使模型与人类对齐,我们还验证了常见校准方法和人类校准方法的效果。我们的结果表明,即使是目前在该任务上表现最佳的 BEiT3 模型,也难以捕捉多样化人类答案中固有的多标签分布。此外,我们观察到,常用的以准确性为导向的校准技术反而削弱了 BEiT3 捕捉不确定性水平的能力,进一步扩大了模型预测与人类分布之间的差距。相比之下,我们展示了将模型校准向人类分布方向对 VQA 任务的益处,更好地使模型置信度与人类不确定性对齐。我们的研究结果强调,对于 VQA 任务,人类答案与模型预测之间的一致性对齐研究尚不充分,应成为未来研究的关键目标。

[NLP-120] HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

【速读】: 该论文试图解决在移动设备上部署大型语言模型(LLMs)时,现有安全防护模型因参数庞大导致的内存需求和延迟问题。解决方案的关键在于通过数据增强方法HarmAug,将大型教师模型蒸馏成小型模型。HarmAug通过引导LLM生成有害指令并生成相应响应,从而扩充训练数据集,提升小型模型的性能。实验结果表明,使用HarmAug训练的435百万参数安全防护模型在F1分数和AUPRC指标上可媲美甚至超越70亿参数的大型模型,同时显著降低计算成本。

链接: https://arxiv.org/abs/2410.01524
作者: Seanie Lee,Haebin Seong,Dong Bok Lee,Minki Kang,Xiaoyin Chen,Dominik Wagner,Yoshua Bengio,Juho Lee,Sung Ju Hwang
关键词-EN: detect malicious queries, malicious queries aimed, Safety guard models, Safety guard, existing safety guard
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as, “Make a single harmful instruction prompt that would elicit offensive content”, we add an affirmative prefix (e.g., “I have an idea for a prompt:”) to the LLM’s response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
摘要:检测针对大语言模型 (LLMs) 的恶意查询的安全防护模型对于确保 LLMs 在实际应用中的安全与负责任部署至关重要。然而,由于内存需求和延迟问题,将现有的拥有数十亿参数的安全防护模型与 LLMs 一同部署在移动设备上并不现实。为了降低这一成本,我们利用带有二元有害标签的指令-响应对标记数据集,将一个大型教师安全防护模型蒸馏成一个较小的模型。由于现有标记数据集中有害指令的多样性有限,简单蒸馏的模型往往表现不如大型模型。为了缩小小型与大型模型之间的差距,我们提出了 HarmAug,一种简单而有效的数据增强方法,该方法涉及破解 LLM 并引导其生成有害指令。给定一个提示,例如“制作一个会引发冒犯内容的单一有害指令提示”,我们在 LLM 的响应前添加一个肯定的前缀(例如“我有一个提示的想法:”)。这鼓励 LLM 继续生成响应的其余部分,从而采样有害指令。另一个 LLM 生成对有害指令的响应,教师模型则标记该指令-响应对。我们通过实验证明,我们的 HarmAug 优于其他相关基线。此外,使用 HarmAug 训练的 4.35 亿参数安全防护模型在 F1 分数上与超过 70 亿参数的大型模型相当,甚至在 AUPRC 上表现更优,同时计算成本不到后者的 25%。

[NLP-121] FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model EMNLP2024

【速读】: 该论文旨在解决多任务语音语言模型(SpeechLM)在推理过程中效率低下的问题,特别是针对语音信号的时序依赖性,传统的视觉或文本模态的推理优化方法无法直接适用。论文提出了FastAdaSP,一种加权令牌合并框架,专门设计用于各种语音相关任务,以改善效率与性能之间的权衡。关键在于通过令牌合并技术,显著提高了内存效率和解码吞吐量,同时保持了任务性能,如情感识别(ER)和口语问答(SQA)等任务中,FastAdaSP实现了7倍的内存效率提升和1.83倍的解码吞吐量提升,且未导致性能下降。

链接: https://arxiv.org/abs/2410.03007
作者: Yichen Lu,Jiaqi Song,Chao-Han Huck Yang,Shinji Watanabe
关键词-EN: Speech Language Model, Multitask Speech Language, explore Multitask Speech, Language Model, explore Multitask
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2024 Industry Track

点击查看摘要

Abstract:In this study, we aim to explore Multitask Speech Language Model (SpeechLM) efficient inference via token reduction. Unlike other modalities such as vision or text, speech has unique temporal dependencies, making previous efficient inference works on other modalities not directly applicable. Furthermore, methods for efficient SpeechLM inference on long sequence and sparse signals remain largely unexplored. Then we propose FastAdaSP, a weighted token merging framework specifically designed for various speech-related tasks to improve the trade-off between efficiency and performance. Experimental results on WavLLM and Qwen-Audio show that our method achieves the state-of-the-art (SOTA) efficiency-performance trade-off compared with other baseline methods. Specifically, FastAdaSP achieved 7x memory efficiency and 1.83x decoding throughput without any degradation on tasks like Emotion Recognition (ER) and Spoken Question Answering (SQA). The code will be available at this https URL
摘要:在本研究中,我们旨在通过 Token 减少探索多任务语音语言模型 (SpeechLM) 的高效推理。与视觉或文本等其他模态不同,语音具有独特的时间依赖性,使得其他模态上的高效推理工作无法直接适用。此外,针对长序列和稀疏信号的高效 SpeechLM 推理方法仍未得到充分探索。为此,我们提出了 FastAdaSP,这是一个专为各种语音相关任务设计的加权 Token 合并框架,旨在改善效率与性能之间的权衡。在 WavLLM 和 Qwen-Audio 上的实验结果表明,与其它基线方法相比,我们的方法在效率-性能权衡方面达到了最先进 (SOTA) 水平。具体而言,FastAdaSP 在情感识别 (ER) 和口语问答 (SQA) 等任务上实现了 7 倍的内存效率和 1.83 倍的解码吞吐量,且没有任何性能下降。代码将在此 https URL 提供。

人工智能

[AI-0] Estimating Body and Hand Motion in an Ego-sensed World

链接: https://arxiv.org/abs/2410.03665
作者: Brent Yi,Vickie Ye,Maya Zheng,Lea Müller,Georgios Pavlakos,Yi Ma,Jitendra Malik,Angjoo Kanazawa
关键词-EN: head-mounted device, human motion estimation, egocentric SLAM poses, present EgoAllo, human motion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture the wearer’s actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve the hands: the resulting kinematic and temporal constraints result in over 40% lower hand estimation errors compared to noisy monocular estimates. Project page: this https URL

[AI-1] Enhance Reasoning by Learning from Mistakes: Peer-Review Knowledge Distillation from Multiple Large Language Models

链接: https://arxiv.org/abs/2410.03663
作者: Zhuochun Li,Yuelyu Ji,Rui Meng,Daqing He
关键词-EN: Large language models, natural language processing, demonstrated exceptional performance, Large language, exhibited complex reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) have exhibited complex reasoning abilities by generating question rationales and demonstrated exceptional performance in natural language processing (NLP) tasks. However, these reasoning capabilities generally emerge in models with tens of billions of parameters, creating significant computational challenges for real-world deployment. Recent research has concentrated on improving open-source smaller models through knowledge distillation (KD) from commercial LLMs. Nevertheless, most of these studies rely solely on the responses from one single LLM as the gold rationale for training. In this paper, we introduce a novel Mistake-Aware Peer-Review Distillation (MAPD) approach: 1) Instead of merely obtaining gold rationales from teachers, our method asks teachers to identify and explain the student’s mistakes, providing customized instruction learning data. 2) We design a simulated peer-review process between teacher LLMs, which selects only the generated rationales above the acceptance threshold. This reduces the chance of teachers guessing correctly with flawed rationale, improving instructional data quality. Comprehensive experiments and analysis on mathematical, commonsense, and logical reasoning tasks demonstrate the effectiveness of our method.

[AI-2] System 2 reasoning capabilities are nigh

链接: https://arxiv.org/abs/2410.03662
作者: Scott C. Lowe
关键词-EN: machine learning models, human-like reasoning capabilities, recent years, machine learning, made strides
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, machine learning models have made strides towards human-like reasoning capabilities from several directions. In this work, we review the current state of the literature and describe the remaining steps to achieve a neural model which can perform System 2 reasoning analogous to a human. We argue that if current models are insufficient to be classed as performing reasoning, there remains very little additional progress needed to attain that goal.

[AI-3] Geometric Representation Condition Improves Equivariant Molecule Generation

链接: https://arxiv.org/abs/2410.03655
作者: Zian Li,Cai Zhou,Xiyuan Wang,Xingang Peng,Muhan Zhang
关键词-EN: demonstrated substantial potential, Recent advancements, molecular generative models, accelerating scientific discovery, scientific discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in molecular generative models have demonstrated substantial potential in accelerating scientific discovery, particularly in drug design. However, these models often face challenges in generating high-quality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce GeoRCG, a general framework to enhance the performance of molecular generative models by integrating geometric representation conditions. We decompose the molecule generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared to directly generating a molecule, the relatively easy-to-generate representation in the first-stage guides the second-stage generation to reach a high-quality molecule in a more goal-oriented and much faster way. Leveraging EDM as the base generator, we observe significant quality improvements in unconditional molecule generation on the widely-used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 31% performance improvement over state-of-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations over conditioning on individual property values as in previous approaches. Furthermore, we show that, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while maintaining superior generation quality than that achieved with 1,000 steps, thereby significantly accelerating the generation process.

[AI-4] GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

链接: https://arxiv.org/abs/2410.03645
作者: Pu Hua,Minghuan Liu,Annabella Macaluso,Yunfeng Lin,Weinan Zhang,Huazhe Xu,Lirui Wang
关键词-EN: Robotic simulation today, today remains challenging, simulation today remains, create diverse simulation, Robotic simulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: CoRL 2024. Project website: this https URL

点击查看摘要

Abstract:Robotic simulation today remains challenging to scale up due to the human efforts required to create diverse simulation tasks and scenes. Simulation-trained policies also face scalability issues as many sim-to-real methods focus on a single task. To address these challenges, this work proposes GenSim2, a scalable framework that leverages coding LLMs with multi-modal and reasoning capabilities for complex and realistic simulation task creation, including long-horizon tasks with articulated objects. To automatically generate demonstration data for these tasks at scale, we propose planning and RL solvers that generalize within object categories. The pipeline can generate data for up to 100 articulated tasks with 200 objects and reduce the required human efforts. To utilize such data, we propose an effective multi-task language-conditioned policy architecture, dubbed proprioceptive point-cloud transformer (PPT), that learns from the generated demonstrations and exhibits strong sim-to-real zero-shot transfer. Combining the proposed pipeline and the policy architecture, we show a promising usage of GenSim2 that the generated data can be used for zero-shot transfer or co-train with real-world collected data, which enhances the policy performance by 20% compared with training exclusively on limited real data.

[AI-5] Aligning LLMs with Individual Preferences via Interaction

链接: https://arxiv.org/abs/2410.03642
作者: Shujin Wu,May Fung,Cheng Qian,Jeonghwan Kim,Dilek Hakkani-Tur,Heng Ji
关键词-EN: large language models, increasingly advanced capabilities, demonstrate increasingly advanced, language models, advanced capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: The code and dataset are made public at this https URL

点击查看摘要

Abstract:As large language models (LLMs) demonstrate increasingly advanced capabilities, aligning their behaviors with human values and preferences becomes crucial for their wide adoption. While previous research focuses on general alignment to principles such as helpfulness, harmlessness, and honesty, the need to account for individual and diverse preferences has been largely overlooked, potentially undermining customized human experiences. To address this gap, we train LLMs that can ‘‘interact to align’’, essentially cultivating the meta-skill of LLMs to implicitly infer the unspoken personalized preferences of the current user through multi-turn conversations, and then dynamically align their following behaviors and responses to these inferred preferences. Our approach involves establishing a diverse pool of 3,310 distinct user personas by initially creating seed examples, which are then expanded through iterative self-generation and filtering. Guided by distinct user personas, we leverage multi-LLM collaboration to develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures. Finally, we apply supervised fine-tuning and reinforcement learning to enhance LLMs using this dataset. For evaluation, we establish the ALOE (ALign With CustOmized PrEferences) benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations. Experimental results demonstrate the effectiveness of our method in enabling dynamic, personalized alignment via interaction.

[AI-6] What Matters for Model Merging at Scale?

链接: https://arxiv.org/abs/2410.03617
作者: Prateek Yadav,Tu Vu,Jonathan Lai,Alexandra Chronopoulou,Manaal Faruqui,Mohit Bansal,Tsendsuren Munkhdalai
关键词-EN: models, merging, decentralized model development, capable single model, expert models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 20 Pages, 7 Figures, 4 Tables

点击查看摘要

Abstract:Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors – like the base model quality and number of expert models – , to affect the merged model’s performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods – Averaging, Task~Arithmetic, Dare, and TIES – across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert’s training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.

[AI-7] ICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation

链接: https://arxiv.org/abs/2410.03608
作者: Jonathan Cook,Tim Rocktäschel,Jakob Foerster,Dennis Aumiller,Alex Wang
关键词-EN: Large Language Models, Large Language, Language Models, usage of Large, instruction-following ability
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given the widespread adoption and usage of Large Language Models (LLMs), it is crucial to have flexible and interpretable evaluations of their instruction-following ability. Preference judgments between model outputs have become the de facto evaluation standard, despite distilling complex, multi-faceted preferences into a single ranking. Furthermore, as human annotation is slow and costly, LLMs are increasingly used to make these judgments, at the expense of reliability and interpretability. In this work, we propose TICK (Targeted Instruct-evaluation with ChecKlists), a fully automated, interpretable evaluation protocol that structures evaluations with LLM-generated, instruction-specific checklists. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists that decompose the instruction into a series of YES/NO questions. Each question asks whether a candidate response meets a specific requirement of the instruction. We demonstrate that using TICK leads to a significant increase (46.4% \to 52.2%) in the frequency of exact agreements between LLM judgements and human preferences, as compared to having an LLM directly score an output. We then show that STICK (Self-TICK) can be used to improve generation quality across multiple benchmarks via self-refinement and Best-of-N selection. STICK self-refinement on LiveBench reasoning tasks leads to an absolute gain of + 7.8%, whilst Best-of-N selection with STICK attains + 6.3% absolute improvement on the real-world instruction dataset, WildBench. In light of this, structured, multi-faceted self-improvement is shown to be a promising way to further advance LLM capabilities. Finally, by providing LLM-generated checklists to human evaluators tasked with directly scoring LLM responses to WildBench instructions, we notably increase inter-annotator agreement (0.194 \to 0.256).

[AI-8] SiMilarity-Enhanced Homophily for Multi-View Heterophilous Graph Clustering

链接: https://arxiv.org/abs/2410.03596
作者: Jianpeng Chen,Yawen Ling,Yazhou Ren,Zichen Wen,Tianyi Wu,Shufei Zhang,Lifang He
关键词-EN: graph-structured data, downstream applications, Multi-view Heterophilous Graph, increasing prevalence, prevalence of graph-structured
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the increasing prevalence of graph-structured data, multi-view graph clustering has been widely used in various downstream applications. Existing approaches primarily rely on a unified message passing mechanism, which significantly enhances clustering performance. Nevertheless, this mechanism limits its applicability to heterophilous situations, as it is fundamentally predicated on the assumption of homophily, i.e., the connected nodes often belong to the same class. In reality, this assumption does not always hold; a moderately or even mildly homophilous graph is more common than a fully homophilous one due to inevitable heterophilous information in the graph. To address this issue, in this paper, we propose a novel SiMilarity-enhanced Homophily for Multi-view Heterophilous Graph Clustering (SMHGC) approach. By analyzing the relationship between similarity and graph homophily, we propose to enhance the homophily by introducing three similarity terms, i.e., neighbor pattern similarity, node feature similarity, and multi-view global similarity, in a label-free manner. Then, a consensus-based inter- and intra-view fusion paradigm is proposed to fuse the improved homophilous graph from different views and utilize them for clustering. The state-of-the-art experimental results on both multi-view heterophilous and homophilous datasets collectively demonstrate the strong capacity of similarity for unsupervised multi-view heterophilous graph learning. Additionally, the consistent performance across semi-synthetic datasets with varying levels of homophily serves as further evidence of SMHGC’s resilience to heterophily.

[AI-9] Understanding Reasoning in Chain-of-Thought from the Hopfieldian View

链接: https://arxiv.org/abs/2410.03595
作者: Lijie Hu,Liang Liu,Shu Yang,Xin Chen,Zhen Tan,Muhammad Asif Ali,Mengdi Li,Di Wang
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable abilities, Models have demonstrated
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 28 pages, a new version of “A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning”

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable abilities across various tasks, with Chain-of-Thought (CoT) prompting emerging as a key technique to enhance reasoning capabilities. However, existing research primarily focuses on improving performance, lacking a comprehensive framework to explain and understand the fundamental factors behind CoT’s success. To bridge this gap, we introduce a novel perspective grounded in the Hopfieldian view of cognition in cognitive neuroscience. We establish a connection between CoT reasoning and key cognitive elements such as stimuli, actions, neural populations, and representation spaces. From our view, we can understand the reasoning process as the movement between these representation spaces. Building on this insight, we develop a method for localizing reasoning errors in the response of CoTs. Moreover, we propose the Representation-of-Thought (RoT) framework, which leverages the robustness of low-dimensional representation spaces to enhance the robustness of the reasoning process in CoTs. Experimental results demonstrate that RoT improves the robustness and interpretability of CoT reasoning while offering fine-grained control over the reasoning process.

[AI-10] Variational Bayes Gaussian Splatting

链接: https://arxiv.org/abs/2410.03592
作者: Toon Van de Maele,Ozan Catal,Alexander Tschantz,Christopher L. Buckley,Tim Verbelen
关键词-EN: Bayes Gaussian Splatting, Gaussian Splatting, scenes using mixtures, Recently, Variational Bayes Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting has emerged as a promising approach for modeling 3D scenes using mixtures of Gaussians. The predominant optimization method for these models relies on backpropagating gradients through a differentiable rendering pipeline, which struggles with catastrophic forgetting when dealing with continuous streams of data. To address this limitation, we propose Variational Bayes Gaussian Splatting (VBGS), a novel approach that frames training a Gaussian splat as variational inference over model parameters. By leveraging the conjugacy properties of multivariate Gaussians, we derive a closed-form variational update rule, allowing efficient updates from partial, sequential observations without the need for replay buffers. Our experiments show that VBGS not only matches state-of-the-art performance on static datasets, but also enables continual learning from sequentially streamed 2D and 3D data, drastically improving performance in this setting.

[AI-11] A Multi-model Approach for Video Data Retrieval in Autonomous Vehicle Development

链接: https://arxiv.org/abs/2410.03580
作者: Jesper Knapp,Klas Moberg,Yuchuan Jin,Simin Sun,Miroslaw Staron
关键词-EN: Autonomous driving software, generates enormous amounts, Autonomous driving, driving software generates, software generates enormous
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous driving software generates enormous amounts of data every second, which software development organizations save for future analysis and testing in the form of logs. However, given the vast size of this data, locating specific scenarios within a collection of vehicle logs can be challenging. Writing the correct SQL queries to find these scenarios requires engineers to have a strong background in SQL and the specific databases in question, further complicating the search process. This paper presents and evaluates a pipeline that allows searching for specific scenarios in log collections using natural language descriptions instead of SQL. The generated descriptions were evaluated by engineers working with vehicle logs at the Zenseact on a scale from 1 to 5. Our approach achieved a mean score of 3.3, demonstrating the potential of using a multi-model architecture to improve the software development workflow. We also present an interface that can visualize the query process and visualize the results.

[AI-12] A Survey on Offensive AI Within Cybersecurity

链接: https://arxiv.org/abs/2410.03566
作者: Sahil Girhepuje,Aviral Verma,Gaurav Raina
关键词-EN: Artificial Intelligence, witnessed major growth, witnessed major, major growth, growth and integration
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has witnessed major growth and integration across various domains. As AI systems become increasingly prevalent, they also become targets for threat actors to manipulate their functionality for malicious purposes. This survey paper on offensive AI will comprehensively cover various aspects related to attacks against and using AI systems. It will delve into the impact of offensive AI practices on different domains, including consumer, enterprise, and public digital infrastructure. The paper will explore adversarial machine learning, attacks against AI models, infrastructure, and interfaces, along with offensive techniques like information gathering, social engineering, and weaponized AI. Additionally, it will discuss the consequences and implications of offensive AI, presenting case studies, insights, and avenues for further research.

[AI-13] raining on more Reachable Tasks for Generalisation in Reinforcement Learning

链接: https://arxiv.org/abs/2410.03565
作者: Max Weltevrede,Caroline Horsch,Matthijs T.J. Spaan,Wendelin Böhmer
关键词-EN: multi-task reinforcement learning, fixed set, reinforcement learning, multi-task reinforcement, agents train
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2406.08069

点击查看摘要

Abstract:In multi-task reinforcement learning, agents train on a fixed set of tasks and have to generalise to new ones. Recent work has shown that increased exploration improves this generalisation, but it remains unclear why exactly that is. In this paper, we introduce the concept of reachability in multi-task reinforcement learning and show that an initial exploration phase increases the number of reachable tasks the agent is trained on. This, and not the increased exploration, is responsible for the improved generalisation, even to unreachable tasks. Inspired by this, we propose a novel method Explore-Go that implements such an exploration phase at the beginning of each episode. Explore-Go only modifies the way experience is collected and can be used with most existing on-policy or off-policy reinforcement learning algorithms. We demonstrate the effectiveness of our method when combined with some popular algorithms and show an increase in generalisation performance across several environments.

[AI-14] faerdXel: An Expert System for Danish Traffic Law

链接: https://arxiv.org/abs/2410.03560
作者: Luís Cruz-Filipe,Jonas Vistrup
关键词-EN: Danish traffic law, traffic law, Danish traffic, present færdXel, symbolic reasoning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present færdXel, a tool for symbolic reasoning in the domain of Danish traffic law. færdXel combines techniques from logic programming with a novel interface that allows users to navigate through its reasoning process, thereby ensuring the system’s trustworthiness. A preliminary empirical evaluation indicates that this work is seen as very promising, and has the potential to become a foundation for real-world AI tools supporting professionals in the Danish legal sector.

[AI-15] Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features

链接: https://arxiv.org/abs/2410.03558
作者: Benyuan Meng,Qianqian Xu,Zitai Wang,Xiaochun Cao,Qingming Huang
关键词-EN: image generation, initially designed, designed for image, activations, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at this https URL.

[AI-16] Constructive Apraxia: An Unexpected Limit of Instructible Vision-Language Models and Analog for Human Cognitive Disorders

链接: https://arxiv.org/abs/2410.03551
作者: David Noever,Samantha E. Miller Noever
关键词-EN: human cognitive disorders, instructible vision-language models, specifically constructive apraxia, cognitive disorders, study reveals
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study reveals an unexpected parallel between instructible vision-language models (VLMs) and human cognitive disorders, specifically constructive apraxia. We tested 25 state-of-the-art VLMs, including GPT-4 Vision, DALL-E 3, and Midjourney v5, on their ability to generate images of the Ponzo illusion, a task that requires basic spatial reasoning and is often used in clinical assessments of constructive apraxia. Remarkably, 24 out of 25 models failed to correctly render two horizontal lines against a perspective background, mirroring the deficits seen in patients with parietal lobe damage. The models consistently misinterpreted spatial instructions, producing tilted or misaligned lines that followed the perspective of the background rather than remaining horizontal. This behavior is strikingly similar to how apraxia patients struggle to copy or construct simple figures despite intact visual perception and motor skills. Our findings suggest that current VLMs, despite their advanced capabilities in other domains, lack fundamental spatial reasoning abilities akin to those impaired in constructive apraxia. This limitation in AI systems provides a novel computational model for studying spatial cognition deficits and highlights a critical area for improvement in VLM architecture and training methodologies.

[AI-17] Dreamming User Multimodal Representation for Micro-Video Recommendation

链接: https://arxiv.org/abs/2410.03538
作者: Chengzhi Lin,Hezheng Lin,Shuchang Liu,Cangguang Ruan,LingJing Xu,Dezhao Yang,Chuyuan Wang,Yongqi Liu
关键词-EN: advanced recommender systems, mitigate information overload, deliver tailored content, Platonic Representation Hypothesis, underscored the necessity
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The proliferation of online micro-video platforms has underscored the necessity for advanced recommender systems to mitigate information overload and deliver tailored content. Despite advancements, accurately and promptly capturing dynamic user interests remains a formidable challenge. Inspired by the Platonic Representation Hypothesis, which posits that different data modalities converge towards a shared statistical model of reality, we introduce DreamUMM (Dreaming User Multi-Modal Representation), a novel approach leveraging user historical behaviors to create real-time user representation in a multimoda space. DreamUMM employs a closed-form solution correlating user video preferences with multimodal similarity, hypothesizing that user interests can be effectively represented in a unified multimodal space. Additionally, we propose Candidate-DreamUMM for scenarios lacking recent user behavior data, inferring interests from candidate videos alone. Extensive online A/B tests demonstrate significant improvements in user engagement metrics, including active days and play count. The successful deployment of DreamUMM in two micro-video platforms with hundreds of millions of daily active users, illustrates its practical efficacy and scalability in personalized micro-video content delivery. Our work contributes to the ongoing exploration of representational convergence by providing empirical evidence supporting the potential for user interest representations to reside in a multimodal space.

[AI-18] Ward: Provable RAG Dataset Inference via LLM Watermarks

链接: https://arxiv.org/abs/2410.03537
作者: Nikola Jovanović,Robin Staab,Maximilian Baader,Martin Vechev
关键词-EN: Retrieval-Augmented Generation, incorporate external data, Generation, incorporate external, RAG Dataset Inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves LLMs by enabling them to incorporate external data during generation. This raises concerns for data owners regarding unauthorized use of their content in RAG systems. Despite its importance, the challenge of detecting such unauthorized usage remains underexplored, with existing datasets and methodologies from adjacent fields being ill-suited for its study. In this work, we take several steps to bridge this gap. First, we formalize this problem as (black-box) RAG Dataset Inference (RAG-DI). To facilitate research on this challenge, we further introduce a novel dataset specifically designed for benchmarking RAG-DI methods under realistic conditions, and propose a set of baseline approaches. Building on this foundation, we introduce Ward, a RAG-DI method based on LLM watermarks that enables data owners to obtain rigorous statistical guarantees regarding the usage of their dataset in a RAG system. In our experimental evaluation, we show that Ward consistently outperforms all baselines across many challenging settings, achieving higher accuracy, superior query efficiency and robustness. Our work provides a foundation for future studies of RAG-DI and highlights LLM watermarks as a promising approach to this problem.

[AI-19] Computer Vision Intelligence Test Modeling and Generation: A Case Study on Smart OCR

链接: https://arxiv.org/abs/2410.03536
作者: Jing Shu,Bing-Jiun Miu,Eugene Chang,Jerry Gao,Jun Liu
关键词-EN: AI-based systems possess, systems possess distinctive, possess distinctive characteristics, AI-based systems, systems possess
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:AI-based systems possess distinctive characteristics and introduce challenges in quality evaluation at the same time. Consequently, ensuring and validating AI software quality is of critical importance. In this paper, we present an effective AI software functional testing model to address this challenge. Specifically, we first present a comprehensive literature review of previous work, covering key facets of AI software testing processes. We then introduce a 3D classification model to systematically evaluate the image-based text extraction AI function, as well as test coverage criteria and complexity. To evaluate the performance of our proposed AI software quality test, we propose four evaluation metrics to cover different aspects. Finally, based on the proposed framework and defined metrics, a mobile Optical Character Recognition (OCR) case study is presented to demonstrate the framework’s effectiveness and capability in assessing AI function quality.

[AI-20] Multiscale fusion enhanced spiking neural network for invasive BCI neural signal decoding

链接: https://arxiv.org/abs/2410.03533
作者: Yu Song,Liyuan Han,Bo Xu,Tielin Zhang
关键词-EN: Brain-computer interfaces, Spiking Neural Networks, Spiking Neural, requiring stable, Neural Network
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) are an advanced fusion of neuroscience and artificial intelligence, requiring stable and long-term decoding of neural signals. Spiking Neural Networks (SNNs), with their neuronal dynamics and spike-based signal processing, are inherently well-suited for this task. This paper presents a novel approach utilizing a Multiscale Fusion enhanced Spiking Neural Network (MFSNN). The MFSNN emulates the parallel processing and multiscale feature fusion seen in human visual perception to enable real-time, efficient, and energy-conserving neural signal decoding. Initially, the MFSNN employs temporal convolutional networks and channel attention mechanisms to extract spatiotemporal features from raw data. It then enhances decoding performance by integrating these features through skip connections. Additionally, the MFSNN improves generalizability and robustness in cross-day signal decoding through mini-batch supervised generalization learning. In two benchmark invasive BCI paradigms, including the single-hand grasp-and-touch and center-and-out reach tasks, the MFSNN surpasses traditional artificial neural network methods, such as MLP and GRU, in both accuracy and computational efficiency. Moreover, the MFSNN’s multiscale feature fusion framework is well-suited for the implementation on neuromorphic chips, offering an energy-efficient solution for online decoding of invasive BCI signals.

[AI-21] MARE: Multi-Aspect Rationale Extractor on Unsupervised Rationale Extraction EMNLP2024

链接: https://arxiv.org/abs/2410.03531
作者: Han Jiang,Junwen Duan,Zhe Qu,Jianxin Wang
关键词-EN: support model predictions, explicit rationale annotation, extract text snippets, rationale extraction aims, Unsupervised rationale extraction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted in EMNLP2024(Main) conference

点击查看摘要

Abstract:Unsupervised rationale extraction aims to extract text snippets to support model predictions without explicit rationale annotation. Researchers have made many efforts to solve this task. Previous works often encode each aspect independently, which may limit their ability to capture meaningful internal correlations between aspects. While there has been significant work on mitigating spurious correlations, our approach focuses on leveraging the beneficial internal correlations to improve multi-aspect rationale extraction. In this paper, we propose a Multi-Aspect Rationale Extractor (MARE) to explain and predict multiple aspects simultaneously. Concretely, we propose a Multi-Aspect Multi-Head Attention (MAMHA) mechanism based on hard deletion to encode multiple text chunks simultaneously. Furthermore, multiple special tokens are prepended in front of the text with each corresponding to one certain aspect. Finally, multi-task training is deployed to reduce the training overhead. Experimental results on two unsupervised rationale extraction benchmarks show that MARE achieves state-of-the-art performance. Ablation studies further demonstrate the effectiveness of our method. Our codes have been available at this https URL.

[AI-22] A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

链接: https://arxiv.org/abs/2410.03523
作者: Yan Scholten,Stephan Günnemann,Leo Schwinn
关键词-EN: Large Language Models, Large Language, open research problem, Language Models, research problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. this https URL

[AI-23] LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation

链接: https://arxiv.org/abs/2410.03521
作者: Xinyuan Wang,Haozhou Li,Dingfang Zheng,Qinke Peng
关键词-EN: pandemic underscored major, underscored major deficiencies, online medical services, traditional healthcare systems, pandemic underscored
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), comprising a Coarse-grained Triage dataset with 439,630 samples, a Fine-grained Diagnosis dataset with 199,600 samples, and a Medical Consultation dataset with 472,418 items, thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model using reinforcement learning. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.

[AI-24] FedStein: Enhancing Multi-Domain Federated Learning Through James-Stein Estimator NEURIPS’24 NEURIPS2024

链接: https://arxiv.org/abs/2410.03499
作者: Sunny Gupta,Nikita Jangid,Amit Sethi
关键词-EN: enabling collaborative in-situ, collaborative in-situ training, facilitates data privacy, Federated Learning, Multi-Domain Federated Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 12 pages, 2 figures. Accepted at International Workshop on Federated Foundation Models In Conjunction with NeurIPS 2024 (FL@FM-NeurIPS’24)

点击查看摘要

Abstract:Federated Learning (FL) facilitates data privacy by enabling collaborative in-situ training across decentralized clients. Despite its inherent advantages, FL faces significant challenges of performance and convergence when dealing with data that is not independently and identically distributed (non-i.i.d.). While previous research has primarily addressed the issue of skewed label distribution across clients, this study focuses on the less explored challenge of multi-domain FL, where client data originates from distinct domains with varying feature distributions. We introduce a novel method designed to address these challenges FedStein: Enhancing Multi-Domain Federated Learning Through the James-Stein Estimator. FedStein uniquely shares only the James-Stein (JS) estimates of batch normalization (BN) statistics across clients, while maintaining local BN parameters. The non-BN layer parameters are exchanged via standard FL techniques. Extensive experiments conducted across three datasets and multiple models demonstrate that FedStein surpasses existing methods such as FedAvg and FedBN, with accuracy improvements exceeding 14% in certain domains leading to enhanced domain generalization. The code is available at this https URL

[AI-25] Generative Artificial Intelligence for Navigating Synthesizable Chemical Space

链接: https://arxiv.org/abs/2410.03494
作者: Wenhao Gao,Shitong Luo,Connor W. Coley
关键词-EN: generative modeling framework, modeling framework designed, chemical space exploration, generative modeling, modeling framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:We introduce SynFormer, a generative modeling framework designed to efficiently explore and navigate synthesizable chemical space. Unlike traditional molecular generation approaches, we generate synthetic pathways for molecules to ensure that designs are synthetically tractable. By incorporating a scalable transformer architecture and a diffusion module for building block selection, SynFormer surpasses existing models in synthesizable molecular design. We demonstrate SynFormer’s effectiveness in two key applications: (1) local chemical space exploration, where the model generates synthesizable analogs of a reference molecule, and (2) global chemical space exploration, where the model aims to identify optimal molecules according to a black-box property prediction oracle. Additionally, we demonstrate the scalability of our approach via the improvement in performance as more computational resources become available. With our code and trained models openly available, we hope that SynFormer will find use across applications in drug discovery and materials science.

[AI-26] Gradient-based Jailbreak Images for Multimodal Fusion Models

链接: https://arxiv.org/abs/2410.03489
作者: Javier Rando,Hannah Korevaar,Erik Brinkman,Ivan Evtimov,Florian Tramèr
关键词-EN: Augmenting language models, Augmenting language, require discrete optimization, Augmenting, continuous optimization
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Augmenting language models with image inputs may enable more effective jailbreak attacks through continuous optimization, unlike text inputs that require discrete optimization. However, new multimodal fusion models tokenize all input modalities using non-differentiable functions, which hinders straightforward attacks. In this work, we introduce the notion of a tokenizer shortcut that approximates tokenization with a continuous function and enables continuous optimization. We use tokenizer shortcuts to create the first end-to-end gradient image attacks against multimodal fusion models. We evaluate our attacks on Chameleon models and obtain jailbreak images that elicit harmful information for 72.5% of prompts. Jailbreak images outperform text jailbreaks optimized with the same objective and require 3x lower compute budget to optimize 50x more input tokens. Finally, we find that representation engineering defenses, like Circuit Breakers, trained only on text attacks can effectively transfer to adversarial image inputs.

[AI-27] A Multimodal Framework for Deepfake Detection

链接: https://arxiv.org/abs/2410.03487
作者: Kashish Gandhi,Prutha Kulkarni,Taran Shah,Piyush Chaudhari,Meera Narvekar,Kranti Ghag
关键词-EN: digital media integrity, deepfake technology poses, rapid advancement, technology poses, poses a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 22 pages, 14 figures, Accepted in Journal of Electrical Systems

点击查看摘要

Abstract:The rapid advancement of deepfake technology poses a significant threat to digital media integrity. Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. This creates risks of misinformation, fraud, and severe implications for personal privacy and security. Our research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements. This comprehensive strategy recognizes that human perception integrates multiple sensory inputs, particularly visual and auditory information, to form a complete understanding of media content. For visual analysis, a model that employs advanced feature extraction techniques was developed, extracting nine distinct facial characteristics and then applying various machine learning and deep learning models. For auditory analysis, our model leverages mel-spectrogram analysis for feature extraction and then applies various machine learning and deep learningmodels. To achieve a combined analysis, real and deepfake audio in the original dataset were swapped for testing purposes and ensured balanced samples. Using our proposed models for video and audio classification i.e. Artificial Neural Network and VGG19, the overall sample is classified as deepfake if either component is identified as such. Our multimodal framework combines visual and auditory analyses, yielding an accuracy of 94%.

[AI-28] Group Fairness in Peer Review NEURIPS2023

链接: https://arxiv.org/abs/2410.03474
作者: Haris Aziz,Evi Micha,Nisarg Shah
关键词-EN: AAAI serve, serve as crossroads, attract submissions, large conference, AAAI
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
*备注: A preliminary version appeared at NeurIPS 2023

点击查看摘要

Abstract:Large conferences such as NeurIPS and AAAI serve as crossroads of various AI fields, since they attract submissions from a vast number of communities. However, in some cases, this has resulted in a poor reviewing experience for some communities, whose submissions get assigned to less qualified reviewers outside of their communities. An often-advocated solution is to break up any such large conference into smaller conferences, but this can lead to isolation of communities and harm interdisciplinary research. We tackle this challenge by introducing a notion of group fairness, called the core, which requires that every possible community (subset of researchers) to be treated in a way that prevents them from unilaterally benefiting by withdrawing from a large conference. We study a simple peer review model, prove that it always admits a reviewing assignment in the core, and design an efficient algorithm to find one such assignment. We use real data from CVPR and ICLR conferences to compare our algorithm to existing reviewing assignment algorithms on a number of metrics. Comments: A preliminary version appeared at NeurIPS 2023 Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph) Cite as: arXiv:2410.03474 [cs.GT] (or arXiv:2410.03474v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2410.03474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] Vulnerability Detection via Topological Analysis of Attention Maps

链接: https://arxiv.org/abs/2410.03470
作者: Pavel Snopov,Andrey Nikolaevich Golubinskiy
关键词-EN: gained significant traction, significant traction, Recently, gained significant, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
*备注: Accepted to ITaS2024. Contains 8 pages

点击查看摘要

Abstract:Recently, deep learning (DL) approaches to vulnerability detection have gained significant traction. These methods demonstrate promising results, often surpassing traditional static code analysis tools in effectiveness. In this study, we explore a novel approach to vulnerability detection utilizing the tools from topological data analysis (TDA) on the attention matrices of the BERT model. Our findings reveal that traditional machine learning (ML) techniques, when trained on the topological features extracted from these attention matrices, can perform competitively with pre-trained language models (LLMs) such as CodeBERTa. This suggests that TDA tools, including persistent homology, are capable of effectively capturing semantic information critical for identifying vulnerabilities. Comments: Accepted to ITaS2024. Contains 8 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT) Cite as: arXiv:2410.03470 [cs.LG] (or arXiv:2410.03470v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03470 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] Diffusion State-Guided Projected Gradient for Inverse Problems

链接: https://arxiv.org/abs/2410.03463
作者: Rayhan Zirvi,Bahareh Tolooshams,Anima Anandkumar
关键词-EN: Recent advancements, inverse problems, learning data priors, solving inverse problems, diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: preprint. under review. RZ and BT have equal contributions

点击查看摘要

Abstract:Recent advancements in diffusion models have been effective in learning data priors for solving inverse problems. They leverage diffusion sampling steps for inducing a data prior while using a measurement guidance gradient at each step to impose data consistency. For general inverse problems, approximations are needed when an unconditionally trained diffusion model is used since the measurement likelihood is intractable, leading to inaccurate posterior sampling. In other words, due to their approximations, these methods fail to preserve the generation process on the data manifold defined by the diffusion prior, leading to artifacts in applications such as image restoration. To enhance the performance and robustness of diffusion models in solving inverse problems, we propose Diffusion State-Guided Projected Gradient (DiffStateGrad), which projects the measurement gradient onto a subspace that is a low-rank approximation of an intermediate state of the diffusion process. DiffStateGrad, as a module, can be added to a wide range of diffusion-based inverse solvers to improve the preservation of the diffusion process on the prior manifold and filter out artifact-inducing components. We highlight that DiffStateGrad improves the robustness of diffusion models in terms of the choice of measurement guidance step size and noise while improving the worst-case performance. Finally, we demonstrate that DiffStateGrad improves upon the state-of-the-art on linear and nonlinear image restoration inverse problems.

[AI-31] How Toxicity Classifiers and Large Language Models Respond to Ableism

链接: https://arxiv.org/abs/2410.03448
作者: Mahika Phutane,Ananya Seelam,Aditya Vashistha
关键词-EN: People with disabilities, regularly encounter ableist, encounter ableist hate, regularly encounter, hate and microaggressions
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:People with disabilities (PwD) regularly encounter ableist hate and microaggressions online. While online platforms use machine learning models to moderate online harm, there is little research investigating how these models interact with ableism. In this paper, we curated a dataset of 100 social media comments targeted towards PwD, and recruited 160 participants to rate and explain how toxic and ableist these comments were. We then prompted state-of-the art toxicity classifiers (TCs) and large language models (LLMs) to rate and explain the harm. Our analysis revealed that TCs and LLMs rated toxicity significantly lower than PwD, but LLMs rated ableism generally on par with PwD. However, ableism explanations by LLMs overlooked emotional harm, and lacked specificity and acknowledgement of context, important facets of PwD explanations. Going forward, we discuss challenges in designing disability-aware toxicity classifiers, and advocate for the shift from ableism detection to ableism interpretation and explanation.

[AI-32] On Uncertainty In Natural Language Processing

链接: https://arxiv.org/abs/2410.03446
作者: Dennis Ulmer
关键词-EN: increasingly capable systems, natural language processing, decade in deep, deep learning, learning has brought
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: PhD thesis

点击查看摘要

Abstract:The last decade in deep learning has brought on increasingly capable systems that are deployed on a wide variety of applications. In natural language processing, the field has been transformed by a number of breakthroughs including large language models, which are used in increasingly many user-facing applications. In order to reap the benefits of this technology and reduce potential harms, it is important to quantify the reliability of model predictions and the uncertainties that shroud their development. This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective, and how it can be reduced and quantified through the design of the experimental pipeline. We further explore uncertainty quantification in modeling by theoretically and empirically investigating the effect of inductive model biases in text classification tasks. The corresponding experiments include data for three different languages (Danish, English and Finnish) and tasks as well as a large set of different uncertainty quantification approaches. Additionally, we propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction, which provides tighter token sets with better coverage of the actual continuation. Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors, where the confidence is predicted from the input to and generated output text of the target model alone. Comments: PhD thesis Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2410.03446 [cs.AI] (or arXiv:2410.03446v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.03446 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-33] Exploring the Benefit of Activation Sparsity in Pre-training ICML2024

链接: https://arxiv.org/abs/2410.03440
作者: Zhengyan Zhang,Chaojun Xiao,Qiujieli Qin,Yankai Lin,Zhiyuan Zeng,Xu Han,Zhiyuan Liu,Ruobing Xie,Maosong Sun,Jie Zhou
关键词-EN: Pre-trained Transformers inherently, Transformers inherently possess, Pre-trained Transformers, sparse activation, inherently possess
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ICML 2024

点击查看摘要

Abstract:Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to 2\times faster inference speed. Codes are available at this https URL.

[AI-34] A General Framework for Producing Interpretable Semantic Text Embeddings

链接: https://arxiv.org/abs/2410.03435
作者: Yiqun Sun,Qiang Huang,Yixuan Tang,Anthony K. H. Tung,Jun Yu
关键词-EN: Natural Language Processing, Language Processing, Natural Language, algo, NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 5 figures, and 9 tables

点击查看摘要

Abstract:Semantic text embedding is essential to many tasks in Natural Language Processing (NLP). While black-box models are capable of generating high-quality embeddings, their lack of interpretability limits their use in tasks that demand transparency. Recent approaches have improved interpretability by leveraging domain-expert-crafted or LLM-generated questions, but these methods rely heavily on expert input or well-prompt design, which restricts their generalizability and ability to generate discriminative questions across a wide range of tasks. To address these challenges, we introduce \algoCQG-MBQA (Contrastive Question Generation - Multi-task Binary Question Answering), a general framework for producing interpretable semantic text embeddings across diverse tasks. Our framework systematically generates highly discriminative, low cognitive load yes/no questions through the \algoCQG method and answers them efficiently with the \algoMBQA model, resulting in interpretable embeddings in a cost-effective manner. We validate the effectiveness and interpretability of \algoCQG-MBQA through extensive experiments and ablation studies, demonstrating that it delivers embedding quality comparable to many advanced black-box models while maintaining inherently interpretability. Additionally, \algoCQG-MBQA outperforms other interpretable text embedding methods across various downstream tasks.

[AI-35] Self-supervised Spatio-Temporal Graph Mask-Passing Attention Network for Perceptual Importance Prediction of Multi-point Tactility

链接: https://arxiv.org/abs/2410.03434
作者: Dazhong He,Qian Liu
关键词-EN: modern multimedia systems, visual and auditory, prevalent in modern, form of human, multimedia systems
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at Eurohaptics 2024

点击查看摘要

Abstract:While visual and auditory information are prevalent in modern multimedia systems, haptic interaction, e.g., tactile and kinesthetic interaction, provides a unique form of human perception. However, multimedia technology for contact interaction is less mature than non-contact multimedia technologies and requires further development. Specialized haptic media technologies, requiring low latency and bitrates, are essential to enable haptic interaction, necessitating haptic information compression. Existing vibrotactile signal compression methods, based on the perceptual model, do not consider the characteristics of fused tactile perception at multiple spatially distributed interaction points. In fact, differences in tactile perceptual importance are not limited to conventional frequency and time domains, but also encompass differences in the spatial locations on the skin unique to tactile perception. For the most frequently used tactile information, vibrotactile texture perception, we have developed a model to predict its perceptual importance at multiple points, based on self-supervised learning and Spatio-Temporal Graph Neural Network. Current experimental results indicate that this model can effectively predict the perceptual importance of various points in multi-point tactile perception scenarios.

[AI-36] EB-NeRD: A Large-Scale Dataset for News Recommendation RECSYS’24

链接: https://arxiv.org/abs/2410.03432
作者: Johannes Kruse,Kasper Lindskow,Saikishore Kalloori,Marco Polignano,Claudio Pomo,Abhishek Srivastava,Anshuk Uppal,Michael Riis Andersen,Jes Frellsen
关键词-EN: Personalized content recommendations, Personalized content, social networks, content experience, Ekstra Bladet
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 8 tables, 2 figures, RecSys '24

点击查看摘要

Abstract:Personalized content recommendations have been pivotal to the content experience in digital media from video streaming to social networks. However, several domain specific challenges have held back adoption of recommender systems in news publishing. To address these challenges, we introduce the Ekstra Bladet News Recommendation Dataset (EB-NeRD). The dataset encompasses data from over a million unique users and more than 37 million impression logs from Ekstra Bladet. It also includes a collection of over 125,000 Danish news articles, complete with titles, abstracts, bodies, and metadata, such as categories. EB-NeRD served as the benchmark dataset for the RecSys '24 Challenge, where it was demonstrated how the dataset can be used to address both technical and normative challenges in designing effective and responsible recommender systems for news publishing. The dataset is available at: this https URL.

[AI-37] Cayley Graph Propagation

链接: https://arxiv.org/abs/2410.03424
作者: JJ Wilson,Maya Bechler-Speicher,Petar Veličković
关键词-EN: modelling graph-structured data, graph neural networks, neural networks, graph-structured data, pairs of nodes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:In spite of the plethora of success stories with graph neural networks (GNNs) on modelling graph-structured data, they are notoriously vulnerable to over-squashing, whereby tasks necessitate the mixing of information between distance pairs of nodes. To address this problem, prior work suggests rewiring the graph structure to improve information flow. Alternatively, a significant body of research has dedicated itself to discovering and precomputing bottleneck-free graph structures to ameliorate over-squashing. One well regarded family of bottleneck-free graphs within the mathematical community are expander graphs, with prior work \unicodex2014 Expander Graph Propagation (EGP) \unicodex2014 proposing the use of a well-known expander graph family \unicodex2014 the Cayley graphs of the \mathrmSL(2,\mathbbZ_n) special linear group \unicodex2014 as a computational template for GNNs. However, in EGP the computational graphs used are truncated to align with a given input graph. In this work, we show that truncation is detrimental to the coveted expansion properties. Instead, we propose CGP, a method to propagate information over a complete Cayley graph structure, thereby ensuring it is bottleneck-free to better alleviate over-squashing. Our empirical evidence across several real-world datasets not only shows that CGP recovers significant improvements as compared to EGP, but it is also akin to or outperforms computationally complex graph rewiring techniques.

[AI-38] One2set Large Language Model: Best Partners for Keyphrase Generation EMNLP2024

链接: https://arxiv.org/abs/2410.03421
作者: Liangying Shao,Liang Zhang,Minlong Peng,Guoqi Ma,Hao Yue,Mingming Sun,Jinsong Su
关键词-EN: aims to automatically, automatically generate, generate a collection, collection of phrases, phrases representing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 Main Conference

点击查看摘要

Abstract:Keyphrase generation (KPG) aims to automatically generate a collection of phrases representing the core concepts of a given document. The dominant paradigms in KPG include one2seq and one2set. Recently, there has been increasing interest in applying large language models (LLMs) to KPG. Our preliminary experiments reveal that it is challenging for a single model to excel in both recall and precision. Further analysis shows that: 1) the one2set paradigm owns the advantage of high recall, but suffers from improper assignments of supervision signals during training; 2) LLMs are powerful in keyphrase selection, but existing selection methods often make redundant selections. Given these observations, we introduce a generate-then-select framework decomposing KPG into two steps, where we adopt a one2set-based model as generator to produce candidates and then use an LLM as selector to select keyphrases from these candidates. Particularly, we make two important improvements on our generator and selector: 1) we design an Optimal Transport-based assignment strategy to address the above improper assignments; 2) we model the keyphrase selection as a sequence labeling task to alleviate redundant selections. Experimental results on multiple benchmark datasets show that our framework significantly surpasses state-of-the-art models, especially in absent keyphrase prediction.

[AI-39] Comparative study of regression vs pairwise models for surrogate-based heuristic optimisation

链接: https://arxiv.org/abs/2410.03409
作者: Pablo S. Naharro,Pablo Toharia,Antonio LaTorre,José-María Peña
关键词-EN: surrogate models, Heuristic optimisation algorithms, pairwise surrogate models, Heuristic optimisation, space by sampling
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Heuristic optimisation algorithms explore the search space by sampling solutions, evaluating their fitness, and biasing the search in the direction of promising solutions. However, in many cases, this fitness function involves executing expensive computational calculations, drastically reducing the reasonable number of evaluations. In this context, surrogate models have emerged as an excellent alternative to alleviate these computational problems. This paper addresses the formulation of surrogate problems as both regression models that approximate fitness (surface surrogate models) and a novel way to connect classification models (pairwise surrogate models). The pairwise approach can be directly exploited by some algorithms, such as Differential Evolution, in which the fitness value is not actually needed to drive the search, and it is sufficient to know whether a solution is better than another one or not. Based on these modelling approaches, we have conducted a multidimensional analysis of surrogate models under different configurations: different machine learning algorithms (regularised regression, neural networks, decision trees, boosting methods, and random forests), different surrogate strategies (encouraging diversity or relaxing prediction thresholds), and compare them for both surface and pairwise surrogate models. The experimental part of the article includes the benchmark problems already proposed for the SOCO2011 competition in continuous optimisation and a simulation problem included in the recent GECCO2021 Industrial Challenge. This paper shows that the performance of the overall search, when using online machine learning-based surrogate models, depends not only on the accuracy of the predictive model but also on both the kind of bias towards positive or negative cases and how the optimisation uses those predictions to decide whether to execute the actual fitness function.

[AI-40] EBES: Easy Benchmarking for Event Sequences

链接: https://arxiv.org/abs/2410.03399
作者: Dmitry Osin,Igor Udovichenko,Viktor Moskvoretskii,Egor Shvetsov,Evgeny Burnaev
关键词-EN: user interaction logs, irregular sampling intervals, common data structures, Event sequences, characterized by irregular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Event sequences, characterized by irregular sampling intervals and a mix of categorical and numerical features, are common data structures in various real-world domains such as healthcare, finance, and user interaction logs. Despite advances in temporal data modeling techniques, there is no standardized benchmarks for evaluating their performance on event sequences. This complicates result comparison across different papers due to varying evaluation protocols, potentially misleading progress in this field. We introduce EBES, a comprehensive benchmarking tool with standardized evaluation scenarios and protocols, focusing on regression and classification problems with sequence-level targets. Our library simplifies benchmarking, dataset addition, and method integration through a unified interface. It includes a novel synthetic dataset and provides preprocessed real-world datasets, including the largest publicly available banking dataset. Our results provide an in-depth analysis of datasets, identifying some as unsuitable for model comparison. We investigate the importance of modeling temporal and sequential components, as well as the robustness and scaling properties of the models. These findings highlight potential directions for future research. Our benchmark aim is to facilitate reproducible research, expediting progress and increasing real-world impacts.

[AI-41] GraphCroc: Cross-Correlation Autoencoder for Graph Structural Reconstruction NEURIPS2024

链接: https://arxiv.org/abs/2410.03396
作者: Shijin Duan,Ruyi Ding,Jiaxing He,Aidong Adam Ding,Yunsi Fei,Xiaolin Xu
关键词-EN: Graph-structured data, prompting the development, data is integral, Graph-structured, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 16 figures. Accepted in NeurIPS 2024

点击查看摘要

Abstract:Graph-structured data is integral to many applications, prompting the development of various graph representation methods. Graph autoencoders (GAEs), in particular, reconstruct graph structures from node embeddings. Current GAE models primarily utilize self-correlation to represent graph structures and focus on node-level tasks, often overlooking multi-graph scenarios. Our theoretical analysis indicates that self-correlation generally falls short in accurately representing specific graph features such as islands, symmetrical structures, and directional edges, particularly in smaller or multiple graph contexts. To address these limitations, we introduce a cross-correlation mechanism that significantly enhances the GAE representational capabilities. Additionally, we propose GraphCroc, a new GAE that supports flexible encoder architectures tailored for various downstream tasks and ensures robust structural reconstruction, through a mirrored encoding-decoding process. This model also tackles the challenge of representation bias during optimization by implementing a loss-balancing strategy. Both theoretical analysis and numerical evaluations demonstrate that our methodology significantly outperforms existing self-correlation-based GAEs in graph structure reconstruction.

[AI-42] Predicting perturbation targets with causal differential networks

链接: https://arxiv.org/abs/2410.03380
作者: Menghua Wu,Umesh Padia,Sean H. Murphy,Regina Barzilay,Tommi Jaakkola
关键词-EN: Rationally identifying variables, enable myriad applications, Rationally identifying, identifying variables responsible, cell engineering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Rationally identifying variables responsible for changes to a biological system can enable myriad applications in disease understanding and cell engineering. From a causality perspective, we are given two datasets generated by the same causal model, one observational (control) and one interventional (perturbed). The goal is to isolate the subset of measured variables (e.g. genes) that were the targets of the intervention, i.e. those whose conditional independencies have changed. Knowing the causal graph would limit the search space, allowing us to efficiently pinpoint these variables. However, current algorithms that infer causal graphs in the presence of unknown intervention targets scale poorly to the hundreds or thousands of variables in biological data, as they must jointly search the combinatorial spaces of graphs and consistent intervention targets. In this work, we propose a causality-inspired approach for predicting perturbation targets that decouples the two search steps. First, we use an amortized causal discovery model to separately infer causal graphs from the observational and interventional datasets. Then, we learn to map these paired graphs to the sets of variables that were intervened upon, in a supervised learning framework. This approach consistently outperforms baselines for perturbation modeling on seven single-cell transcriptomics datasets, each with thousands of measured variables. We also demonstrate significant improvements over six causal discovery algorithms in predicting intervention targets across a variety of tractable, synthetic datasets.

[AI-43] Mitigating Adversarial Perturbations for Deep Reinforcement Learning via Vector Quantization IROS2024

链接: https://arxiv.org/abs/2410.03376
作者: Tung M. Luu,Thanh Nguyen,Tee Joshua Tian Jin,Sungwoon Kim,Chang D. Yoo
关键词-EN: Recent studies reveal, well-performing reinforcement learning, Recent studies, reinforcement learning, perturbations during deployment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, IROS 2024 (Code: this https URL )

点击查看摘要

Abstract:Recent studies reveal that well-performing reinforcement learning (RL) agents in training often lack resilience against adversarial perturbations during deployment. This highlights the importance of building a robust agent before deploying it in the real world. Most prior works focus on developing robust training-based procedures to tackle this problem, including enhancing the robustness of the deep neural network component itself or adversarially training the agent on strong attacks. In this work, we instead study an input transformation-based defense for RL. Specifically, we propose using a variant of vector quantization (VQ) as a transformation for input observations, which is then used to reduce the space of adversarial attacks during testing, resulting in the transformed observations being less affected by attacks. Our method is computationally efficient and seamlessly integrates with adversarial training, further enhancing the robustness of RL agents against adversarial attacks. Through extensive experiments in multiple environments, we demonstrate that using VQ as the input transformation effectively defends against adversarial attacks on the agent’s observations.

[AI-44] SoundSignature: What Type of Music Do You Like?

链接: https://arxiv.org/abs/2410.03375
作者: Brandon James Carone,Pablo Ripollés
关键词-EN: custom OpenAI Assistant, Music Information Retrieval, analyze users’ favorite, users’ favorite songs, OpenAI Assistant
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: 10 pages, 1 figure, to be published in the 2024 International Symposium on the IEEE Internet of Sounds Proceedings

点击查看摘要

Abstract:SoundSignature is a music application that integrates a custom OpenAI Assistant to analyze users’ favorite songs. The system incorporates state-of-the-art Music Information Retrieval (MIR) Python packages to combine extracted acoustic/musical features with the assistant’s extensive knowledge of the artists and bands. Capitalizing on this combined knowledge, SoundSignature leverages semantic audio and principles from the emerging Internet of Sounds (IoS) ecosystem, integrating MIR with AI to provide users with personalized insights into the acoustic properties of their music, akin to a musical preference personality report. Users can then interact with the chatbot to explore deeper inquiries about the acoustic analyses performed and how they relate to their musical taste. This interactivity transforms the application, acting not only as an informative resource about familiar and/or favorite songs, but also as an educational platform that enables users to deepen their understanding of musical features, music theory, acoustic properties commonly used in signal processing, and the artists behind the music. Beyond general usability, the application also incorporates several well-established open-source musician-specific tools, such as a chord recognition algorithm (CREMA), a source separation algorithm (DEMUCS), and an audio-to-MIDI converter (basic-pitch). These features allow users without coding skills to access advanced, open-source music processing algorithms simply by interacting with the chatbot (e.g., can you give me the stems of this song?). In this paper, we highlight the application’s innovative features and educational potential, and present findings from a pilot user study that evaluates its efficacy and usability.

[AI-45] Make Interval Bound Propagation great again

链接: https://arxiv.org/abs/2410.03373
作者: Patryk Krukowski,Daniel Wilczak,Jacek Tabor,Anna Bielawska,Przemysław Spurek
关键词-EN: medical data analysis, robust deep networks, autonomous driving, real life, data analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In various scenarios motivated by real life, such as medical data analysis, autonomous driving, and adversarial training, we are interested in robust deep networks. A network is robust when a relatively small perturbation of the input cannot lead to drastic changes in output (like change of class, etc.). This falls under the broader scope field of Neural Network Certification (NNC). Two crucial problems in NNC are of profound interest to the scientific community: how to calculate the robustness of a given pre-trained network and how to construct robust networks. The common approach to constructing robust networks is Interval Bound Propagation (IBP). This paper demonstrates that IBP is sub-optimal in the first case due to its susceptibility to the wrapping effect. Even for linear activation, IBP gives strongly sub-optimal bounds. Consequently, one should use strategies immune to the wrapping effect to obtain bounds close to optimal ones. We adapt two classical approaches dedicated to strict computations – Dubleton Arithmetic and Affine Arithmetic – to mitigate the wrapping effect in neural networks. These techniques yield precise results for networks with linear activation functions, thus resisting the wrapping effect. As a result, we achieve bounds significantly closer to the optimal level than IBPs.

[AI-46] LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding

链接: https://arxiv.org/abs/2410.03355
作者: Doohyuk Jang,Sihwan Park,June Yong Yang,Yeonsung Jung,Jihun Yun,Souvik Kundu,Sung-Yub Kim,Eunho Yang
关键词-EN: recently gained prominence, speculative decoding, recently gained, gained prominence, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. In this work, we identify a challenge in this setting, which we term \textittoken selection ambiguity, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, we propose a relaxed acceptance condition referred to as LANTERN that leverages the interchangeability of tokens in latent space. This relaxation restores the effectiveness of speculative decoding in visual AR models by enabling more flexible use of candidate tokens that would otherwise be prematurely rejected. Furthermore, by incorporating a total variation distance bound, we ensure that these speed gains are achieved without significantly compromising image quality or semantic coherence. Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. In specific, compared to a naïve application of the state-of-the-art speculative decoding, LANTERN increases speed-ups by \mathbf1.75\times and \mathbf1.76\times , as compared to greedy decoding and random sampling, respectively, when applied to LlamaGen, a contemporary visual AR model.

[AI-47] An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation

链接: https://arxiv.org/abs/2410.03334
作者: Ahmed Abdulaal,Hugo Fry,Nina Montaña-Brown,Ayodeji Ijishakin,Jack Gao,Stephanie Hyland,Daniel C. Alexander,Daniel C. Castro
关键词-EN: experiencing unprecedented demand, radiology report generation, automating radiology report, unprecedented demand, leading to increased
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Radiological services are experiencing unprecedented demand, leading to increased interest in automating radiology report generation. Existing Vision-Language Models (VLMs) suffer from hallucinations, lack interpretability, and require expensive fine-tuning. We introduce SAE-Rad, which uses sparse autoencoders (SAEs) to decompose latent representations from a pre-trained vision transformer into human-interpretable features. Our hybrid architecture combines state-of-the-art SAE advancements, achieving accurate latent reconstructions while maintaining sparsity. Using an off-the-shelf language model, we distil ground-truth reports into radiological descriptions for each SAE feature, which we then compile into a full report for each image, eliminating the need for fine-tuning large models for this task. To the best of our knowledge, SAE-Rad represents the first instance of using mechanistic interpretability techniques explicitly for a downstream multi-modal reasoning task. On the MIMIC-CXR dataset, SAE-Rad achieves competitive radiology-specific metrics compared to state-of-the-art models while using significantly fewer computational resources for training. Qualitative analysis reveals that SAE-Rad learns meaningful visual concepts and generates reports aligning closely with expert interpretations. Our results suggest that SAEs can enhance multimodal reasoning in healthcare, providing a more interpretable alternative to existing VLMs.

[AI-48] Comparative Analysis and Ensemble Enhancement of Leading CNN Architectures for Breast Cancer Classification

链接: https://arxiv.org/abs/2410.03333
作者: Gary Murphy,Raghubir Singh
关键词-EN: Convolutional Neural Network, leading Convolutional Neural, Neural Network, Convolutional Neural, image datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study introduces a novel and accurate approach to breast cancer classification using histopathology images. It systematically compares leading Convolutional Neural Network (CNN) models across varying image datasets, identifies their optimal hyperparameters, and ranks them based on classification efficacy. To maximize classification accuracy for each model we explore, the effects of data augmentation, alternative fully-connected layers, model training hyperparameter settings, and, the advantages of retraining models versus using pre-trained weights. Our methodology includes several original concepts, including serializing generated datasets to ensure consistent data conditions across training runs and significantly reducing training duration. Combined with automated curation of results, this enabled the exploration of over 2,000 training permutations – such a comprehensive comparison is as yet unprecedented. Our findings establish the settings required to achieve exceptional classification accuracy for standalone CNN models and rank them by model efficacy. Based on these results, we propose ensemble architectures that stack three high-performing standalone CNN models together with diverse classifiers, resulting in improved classification accuracy. The ability to systematically run so many model permutations to get the best outcomes gives rise to very high quality results, including 99.75% for BreakHis x40 and BreakHis x200 and 95.18% for the Bach datasets when split into train, validation and test datasets. The Bach Online blind challenge, yielded 89% using this approach. Whilst this study is based on breast cancer histopathology image datasets, the methodology is equally applicable to other medical image datasets.

[AI-49] Influence-oriented Personalized Federated Learning

链接: https://arxiv.org/abs/2410.03315
作者: Yue Tan,Guodong Long,Jing Jiang,Chengqi Zhang
关键词-EN: Traditional federated learning, Traditional federated, neglecting the mutual, rely on fixed, fixed weighting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Traditional federated learning (FL) methods often rely on fixed weighting for parameter aggregation, neglecting the mutual influence by others. Hence, their effectiveness in heterogeneous data contexts is limited. To address this problem, we propose an influence-oriented federated learning framework, namely FedC^2I, which quantitatively measures Client-level and Class-level Influence to realize adaptive parameter aggregation for each client. Our core idea is to explicitly model the inter-client influence within an FL system via the well-crafted influence vector and influence matrix. The influence vector quantifies client-level influence, enables clients to selectively acquire knowledge from others, and guides the aggregation of feature representation layers. Meanwhile, the influence matrix captures class-level influence in a more fine-grained manner to achieve personalized classifier aggregation. We evaluate the performance of FedC^2I against existing federated learning methods under non-IID settings and the results demonstrate the superiority of our method.

[AI-50] Comparing zero-shot self-explanations with human rationales in multilingual text classification

链接: https://arxiv.org/abs/2410.03296
作者: Stephanie Brandl,Oliver Eberle
关键词-EN: complex XAI methods, possibly complex XAI, require gradient computations, XAI methods, complex XAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Instruction-tuned LLMs are able to provide an explanation about their output to users by generating self-explanations that do not require gradient computations or the application of possibly complex XAI methods. In this paper, we analyse whether this ability results in a good explanation by evaluating self-explanations in the form of input rationales with respect to their plausibility to humans as well as their faithfulness to models. For this, we apply two text classification tasks: sentiment classification and forced labour detection. Next to English, we further include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations for all samples. To allow for direct comparisons, we also compute post-hoc feature attribution, i.e., layer-wise relevance propagation (LRP) and apply this pipeline to 4 LLMs (Llama2, Llama3, Mistral and Mixtral). Our results show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness.

[AI-51] Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis

链接: https://arxiv.org/abs/2410.03293
作者: Nirmalya Thakur
关键词-EN: Instagram, Instagram posts, sentiment, makes three scientific, scientific contributions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The work presented in this paper makes three scientific contributions with a specific focus on mining and analysis of COVID-19-related posts on Instagram. First, it presents a multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset, available at this https URL, contains Instagram posts in 161 different languages as well as 535,021 distinct hashtags. After the development of this dataset, multilingual sentiment analysis was performed, which involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset. Second, it presents the results of performing sentiment analysis per year from 2020 to 2024. The findings revealed the trends in sentiment related to COVID-19 on Instagram since the beginning of the pandemic. For instance, between 2020 and 2024, the sentiment trends show a notable shift, with positive sentiment decreasing from 38.35% to 28.69%, while neutral sentiment rising from 44.19% to 58.34%. Finally, the paper also presents findings of language-specific sentiment analysis. This analysis highlighted similar and contrasting trends of sentiment across posts published in different languages on Instagram. For instance, out of all English posts, 49.68% were positive, 14.84% were negative, and 35.48% were neutral. In contrast, among Hindi posts, 4.40% were positive, 57.04% were negative, and 38.56% were neutral, reflecting distinct differences in the sentiment distribution between these two languages.

[AI-52] Enhanced Transformer architecture for in-context learning of dynamical systems

链接: https://arxiv.org/abs/2410.03291
作者: Matteo Rufolo,Dario Piga,Gabriele Maroni,Marco Forgione
关键词-EN: in-context identification paradigm, identification paradigm aims, Recently introduced, aims at estimating, offline and based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Recently introduced by some of the authors, the in-context identification paradigm aims at estimating, offline and based on synthetic data, a meta-model that describes the behavior of a whole class of systems. Once trained, this meta-model is fed with an observed input/output sequence (context) generated by a real system to predict its behavior in a zero-shot learning fashion. In this paper, we enhance the original meta-modeling framework through three key innovations: by formulating the learning task within a probabilistic framework; by managing non-contiguous context and query windows; and by adopting recurrent patching to effectively handle long context sequences. The efficacy of these modifications is demonstrated through a numerical example focusing on the Wiener-Hammerstein system class, highlighting the model’s enhanced performance and scalability.

[AI-53] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

链接: https://arxiv.org/abs/2410.03290
作者: Haibo Wang,Zhiyang Xu,Yu Cheng,Shizhe Diao,Yufan Zhou,Yixin Cao,Qifan Wang,Weifeng Ge,Lifu Huang
关键词-EN: Large Language Models, Video Large Language, Large Language, demonstrated remarkable capabilities, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We identify that current Video-LLMs have limitations for fine-grained video understanding since they lack effective temporal modeling and timestamp representation. In light of this, we sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge to represent timestamps. To optimize the training of Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance Grounded-VideoLLM’s temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

[AI-54] st-time Adaptation for Regression by Subspace Alignment

链接: https://arxiv.org/abs/2410.03263
作者: Kazuki Adachi,Shin’ya Yamaguchi,Atsutoshi Kumagai,Tomoki Hamagami
关键词-EN: investigates test-time adaptation, paper investigates test-time, unlabeled target data, regression model pre-trained, existing TTA methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates test-time adaptation (TTA) for regression, where a regression model pre-trained in a source domain is adapted to an unknown target distribution with unlabeled target data. Although regression is one of the fundamental tasks in machine learning, most of the existing TTA methods have classification-specific designs, which assume that models output class-categorical predictions, whereas regression models typically output only single scalar values. To enable TTA for regression, we adopt a feature alignment approach, which aligns the feature distributions between the source and target domains to mitigate the domain gap. However, we found that naive feature alignment employed in existing TTA methods for classification is ineffective or even worse for regression because the features are distributed in a small subspace and many of the raw feature dimensions have little significance to the output. For an effective feature alignment in TTA for regression, we propose Significant-subspace Alignment (SSA). SSA consists of two components: subspace detection and dimension weighting. Subspace detection finds the feature subspace that is representative and significant to the output. Then, the feature alignment is performed in the subspace during TTA. Meanwhile, dimension weighting raises the importance of the dimensions of the feature subspace that have greater significance to the output. We experimentally show that SSA outperforms various baselines on real-world datasets.

[AI-55] owards a Benchmark for Large Language Models for Business Process Management Tasks

链接: https://arxiv.org/abs/2410.03255
作者: Kiran Busch,Henrik Leopold
关键词-EN: deploying Large Language, Large Language Models, Large Language, deploying Large, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:An increasing number of organizations are deploying Large Language Models (LLMs) for a wide range of tasks. Despite their general utility, LLMs are prone to errors, ranging from inaccuracies to hallucinations. To objectively assess the capabilities of existing LLMs, performance benchmarks are conducted. However, these benchmarks often do not translate to more specific real-world tasks. This paper addresses the gap in benchmarking LLM performance in the Business Process Management (BPM) domain. Currently, no BPM-specific benchmarks exist, creating uncertainty about the suitability of different LLMs for BPM tasks. This paper systematically compares LLM performance on four BPM tasks focusing on small open-source models. The analysis aims to identify task-specific performance variations, compare the effectiveness of open-source versus commercial models, and assess the impact of model size on BPM task performance. This paper provides insights into the practical applications of LLMs in BPM, guiding organizations in selecting appropriate models for their specific needs.

[AI-56] How much can we forget about Data Contamination?

链接: https://arxiv.org/abs/2410.03249
作者: Sebastian Bordt,Suraj Srinivas,Valentyn Boreiko,Ulrike von Luxburg
关键词-EN: large language models, evaluating the capabilities, capabilities of large, large language, significant challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we use experimental evidence and theoretical estimates to challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). We find that if model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. We then derive a simple theory of example forgetting via cumulative weight decay. It allows us to bound the number of gradient steps required to forget past data for any training run where we know the hyperparameters of AdamW. This indicates that many LLMs, including Llama 3, have forgotten the data seen at the beginning of training. Experimentally, we demonstrate that forgetting occurs faster than what is predicted by our bounds. Taken together, our results suggest that moderate amounts of contamination can be forgotten at the end of realistically scaled training runs.

[AI-57] Latent Action Priors From a Single Gait Cycle Demonstration for Online Imitation Learning ICRA2025

链接: https://arxiv.org/abs/2410.03246
作者: Oliver Hausdörfer,Alexander von Rohr,Éric Lefort,Angela Schoellig
关键词-EN: Deep Reinforcement Learning, Deep Reinforcement, unrealistic learning outcomes, Reinforcement Learning, simulation often results
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) in simulation often results in brittle and unrealistic learning outcomes. To push the agent towards more desirable solutions, prior information can be injected in the learning process through, for instance, reward shaping, expert data, or motion primitives. We propose an additional inductive bias for robot learning: latent actions learned from expert demonstration as priors in the action space. We show that these action priors can be learned from only a single open-loop gait cycle using a simple autoencoder. Using these latent action priors combined with established style rewards for imitation in DRL achieves above expert demonstration level of performance and leads to more desirable gaits. Further, action priors substantially improve the performance on transfer tasks, even leading to gait transitions for higher target speeds. Videos and code are available at this https URL.

[AI-58] Enriching Ontologies with Disjointness Axioms using Large Language Models ISWC2024

链接: https://arxiv.org/abs/2410.03235
作者: Elias Crum,Antonio De Santis,Manon Ovide,Jiaxin Pan,Alessia Pisu,Nicolas Lazzari,Sebastian Rudolph
关键词-EN: Large Language Models, Knowledge Graphs, explicit disjointness declarations, lack explicit disjointness, declarations between classes
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Accepted at KBC-LM’24 workshop at ISWC 2024

点击查看摘要

Abstract:Ontologies often lack explicit disjointness declarations between classes, despite their usefulness for sophisticated reasoning and consistency checking in Knowledge Graphs. In this study, we explore the potential of Large Language Models (LLMs) to enrich ontologies by identifying and asserting class disjointness axioms. Our approach aims at leveraging the implicit knowledge embedded in LLMs, using prompt engineering to elicit this knowledge for classifying ontological disjointness. We validate our methodology on the DBpedia ontology, focusing on open-source LLMs. Our findings suggest that LLMs, when guided by effective prompt strategies, can reliably identify disjoint class relationships, thus streamlining the process of ontology completion without extensive manual input. For comprehensive disjointness enrichment, we propose a process that takes logical relationships between disjointness and subclass statements into account in order to maintain satisfiability and reduce the number of calls to the LLM. This work provides a foundation for future applications of LLMs in automated ontology enhancement and offers insights into optimizing LLM performance through strategic prompt design. Our code is publicly available on GitHub at this https URL.

[AI-59] AutoPenBench: Benchmarking Generative Agents for Penetration Testing

链接: https://arxiv.org/abs/2410.03225
作者: Luca Gioacchini,Marco Mellia,Idilio Drago,Alexander Delsanto,Giuseppe Siracusano,Roberto Bifulco
关键词-EN: Large Language Models, Language Models, Large Language, powered by Large, automate cybersecurity tasks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Codes for the benchmark: this https URL Codes for the paper experiments: this https URL

点击查看摘要

Abstract:Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber-attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in-vitro and real-world scenarios. We assess the agent performance with generic and specific milestones that allow us to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction. We compare their performance and limitations. For example, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate (SR) across the benchmark, solving 27% of the simple tasks and only one real-world task. In contrast, the assisted agent demonstrates substantial improvements, with 64% of SR. AutoPenBench allows us also to observe how different LLMs like GPT-4o or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap with a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AutoPenBench along with the research community by making it available under this https URL.

[AI-60] ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database

链接: https://arxiv.org/abs/2410.03224
作者: Anyi Rao,Jean-Peïc Chou,Maneesh Agrawala
关键词-EN: vivid story, create a vivid, mental visualization, large movie database, large movie
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted in the 37th Annual ACM Symposium on User Interface Software and Technology (UIST’24). Webpage: this https URL

点击查看摘要

Abstract:Scriptwriters usually rely on their mental visualization to create a vivid story by using their imagination to see, feel, and experience the scenes they are writing. Besides mental visualization, they often refer to existing images or scenes in movies and analyze the visual elements to create a certain mood or atmosphere. In this paper, we develop ScriptViz to provide external visualization based on a large movie database for the screenwriting process. It retrieves reference visuals on the fly based on scripts’ text and dialogue from a large movie database. The tool provides two types of control on visual elements that enable writers to 1) see exactly what they want with fixed visual elements and 2) see variances in uncertain elements. User evaluation among 15 scriptwriters shows that ScriptViz is able to present scriptwriters with consistent yet diverse visual possibilities, aligning closely with their scripts and helping their creation.

[AI-61] A Tutorial on the Design Experimentation and Application of Metaheuristic Algorithms to Real-World Optimization Problems

链接: https://arxiv.org/abs/2410.03205
作者: Eneko Osaba,Esther Villar-Rodriguez,Javier Del Ser,Antonio J. Nebro,Daniel Molina,Antonio LaTorre,Ponnuthurai N.Suganthan,Carlos A. Coello Coello,Francisco Herrera
关键词-EN: metaheuristic algorithms, efficient solution, metaheuristics, algorithmic design uprightness, optimization
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the last few years, the formulation of real-world optimization problems and their efficient solution via metaheuristic algorithms has been a catalyst for a myriad of research studies. In spite of decades of historical advancements on the design and use of metaheuristics, large difficulties still remain in regards to the understandability, algorithmic design uprightness, and performance verifiability of new technical achievements. A clear example stems from the scarce replicability of works dealing with metaheuristics used for optimization, which is often infeasible due to ambiguity and lack of detail in the presentation of the methods to be reproduced. Additionally, in many cases, there is a questionable statistical significance of their reported results. This work aims at providing the audience with a proposal of good practices which should be embraced when conducting studies about metaheuristics methods used for optimization in order to provide scientific rigor, value and transparency. To this end, we introduce a step by step methodology covering every research phase that should be followed when addressing this scientific field. Specifically, frequently overlooked yet crucial aspects and useful recommendations will be discussed in regards to the formulation of the problem, solution encoding, implementation of search operators, evaluation metrics, design of experiments, and considerations for real-world performance, among others. Finally, we will outline important considerations, challenges, and research directions for the success of newly developed optimization metaheuristics in their deployment and operation over real-world application environments.

[AI-62] Looking into Concept Explanation Methods for Diabetic Retinopathy Classification

链接: https://arxiv.org/abs/2410.03188
作者: Andrea M. Storås,Josefine V. Sundgaard
关键词-EN: Diabetic retinopathy, imaging is crucial, common complication, monitoring the progression, progression of retinal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Diabetic retinopathy is a common complication of diabetes, and monitoring the progression of retinal abnormalities using fundus imaging is crucial. Because the images must be interpreted by a medical expert, it is infeasible to screen all individuals with diabetes for diabetic retinopathy. Deep learning has shown impressive results for automatic analysis and grading of fundus images. One drawback is, however, the lack of interpretability, which hampers the implementation of such systems in the clinic. Explainable artificial intelligence methods can be applied to explain the deep neural networks. Explanations based on concepts have shown to be intuitive for humans to understand, but have not yet been explored in detail for diabetic retinopathy grading. This work investigates and compares two concept-based explanation techniques for explaining deep neural networks developed for automatic diagnosis of diabetic retinopathy: Quantitative Testing with Concept Activation Vectors and Concept Bottleneck Models. We found that both methods have strengths and weaknesses, and choice of method should take the available data and the end user’s preferences into account.

[AI-63] EXAQ: Exponent Aware Quantization For LLMs Acceleration

链接: https://arxiv.org/abs/2410.03185
作者: Moran Shkolnik,Maxim Fishman,Brian Chmiel,Hilla Ben-Yaacov,Ron Banner,Kfir Yehuda Levy
关键词-EN: Large Language Models, Language Models, Large Language, decreasing the computational, computational and storage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Quantization has established itself as the primary approach for decreasing the computational and storage expenses associated with Large Language Models (LLMs) inference. The majority of current research emphasizes quantizing weights and activations to enable low-bit general-matrix-multiply (GEMM) operations, with the remaining non-linear operations executed at higher precision. In our study, we discovered that following the application of these techniques, the primary bottleneck in LLMs inference lies in the softmax layer. The softmax operation comprises three phases: exponent calculation, accumulation, and normalization, Our work focuses on optimizing the first two phases. We propose an analytical approach to determine the optimal clipping value for the input to the softmax function, enabling sub-4-bit quantization for LLMs inference. This method accelerates the calculations of both e^x and \sum(e^x) with minimal to no accuracy degradation. For example, in LLaMA1-30B, we achieve baseline performance with 2-bit quantization on the well-known “Physical Interaction: Question Answering” (PIQA) dataset evaluation. This ultra-low bit quantization allows, for the first time, an acceleration of approximately 4x in the accumulation phase. The combination of accelerating both e^x and \sum(e^x) results in a 36.9% acceleration in the softmax operation.

[AI-64] Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models EMNLP2024

链接: https://arxiv.org/abs/2410.03176
作者: Yufang Liu,Tao Ji,Changzhi Sun,Yuanbin Wu,Aimin Zhou
关键词-EN: achieved impressive performance, Large Vision-Language Models, Large Vision-Language, CLIP model, impressive performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models. However, there is no clear conclusion as to which part of the model these hallucinations originate from. In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues. We demonstrate that our method can effectively mitigate object hallucinations for CLIP model, and we show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.

[AI-65] Adaptive Masking Enhances Visual Grounding

链接: https://arxiv.org/abs/2410.03161
作者: Sen Jia,Lei Li
关键词-EN: large-scale vision-language pre-training, garnered considerable attention, largely due, garnered considerable, large-scale vision-language
类目: Artificial Intelligence (cs.AI)
*备注: Code will be available at this https URL

点击查看摘要

Abstract:In recent years, zero-shot and few-shot learning in visual grounding have garnered considerable attention, largely due to the success of large-scale vision-language pre-training on expansive datasets such as LAION-5B and DataComp-1B. However, the continuous expansion of these datasets presents significant challenges, particularly with respect to data availability and computational overhead, thus creating a bottleneck in the advancement of low-shot learning capabilities. In this paper, we propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, aimed at enhancing vocabulary grounding in low-shot learning scenarios without necessitating an increase in dataset size. Drawing inspiration from cognitive science and the recent success of masked autoencoders (MAE), our method leverages adaptive masking on salient regions of the feature maps generated by the vision backbone. This enables the model to learn robust, generalized representations through the reconstruction of occluded information, thereby facilitating effective attention to both local and global features. We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks. Experimental results consistently show that IMAGE outperforms baseline models, achieving enhanced generalization and improved performance in low-shot scenarios. These findings highlight the potential of adaptive feature manipulation through attention mechanisms and Gaussian modeling as a promising alternative to approaches that rely on the continual scaling of dataset sizes for the advancement of zero-shot and few-shot learning. Our code is publicly available at this https URL.

[AI-66] Autoregressive Moving-average Attention Mechanism for Time Series Forecasting

链接: https://arxiv.org/abs/2410.03159
作者: Jiecheng Lu,Xu Han,Yan Sun,Shihao Yang
关键词-EN: autoregressive Transformer model, autoregressive attention mechanisms, decoder-only autoregressive Transformer, time series forecasting, enhancing their ability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose an Autoregressive (AR) Moving-average (MA) attention structure that can adapt to various linear attention mechanisms, enhancing their ability to capture long-range and local temporal patterns in time series. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that incorporating the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.

[AI-67] Mathematical Formalism for Memory Compression in Selective State Space Models

链接: https://arxiv.org/abs/2410.03158
作者: Siddhanth Bhat
关键词-EN: modelling long-range dependencies, sequence modelling, State space models, hidden state, sequence data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注: 27 Pages

点击查看摘要

Abstract:State space models (SSMs) have emerged as a powerful framework for modelling long-range dependencies in sequence data. Unlike traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), SSMs offer a structured and stable approach to sequence modelling, leveraging principles from control theory and dynamical systems. However, a key challenge in sequence modelling is compressing long-term dependencies into a compact hidden state representation without losing critical information. In this paper, we develop a rigorous mathematical framework for understanding memory compression in selective state space models. We introduce a selective gating mechanism that dynamically filters and updates the hidden state based on input relevance, allowing for efficient memory compression. We formalize the trade-off between memory efficiency and information retention using information-theoretic tools, such as mutual information and rate-distortion theory. Our analysis provides theoretical bounds on the amount of information that can be compressed without sacrificing model performance. We also derive theorems that prove the stability and convergence of the hidden state in selective SSMs, ensuring reliable long-term memory retention. Computational complexity analysis reveals that selective SSMs offer significant improvements in memory efficiency and processing speed compared to traditional RNN-based models. Through empirical validation on sequence modelling tasks such as time-series forecasting and natural language processing, we demonstrate that selective SSMs achieve state-of-the-art performance while using less memory and computational resources. Comments: 27 Pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC) Cite as: arXiv:2410.03158 [cs.LG] (or arXiv:2410.03158v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03158 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-68] MELODI: Exploring Memory Compression for Long Contexts

链接: https://arxiv.org/abs/2410.03156
作者: Yinpeng Chen,DeLesley Hutchins,Aren Jansen,Andrey Zhmoginov,David Racz,Jesper Andersen
关键词-EN: efficiently process long, process long documents, memory architecture designed, present MELODI, short context windows
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present MELODI, a novel memory architecture designed to efficiently process long documents using short context windows. The key principle behind MELODI is to represent short-term and long-term memory as a hierarchical compression scheme across both network layers and context windows. Specifically, the short-term memory is achieved through recurrent compression of context windows across multiple layers, ensuring smooth transitions between windows. In contrast, the long-term memory performs further compression within a single middle layer and aggregates information across context windows, effectively consolidating crucial information from the entire history. Compared to a strong baseline - the Memorizing Transformer employing dense attention over a large long-term memory (64K key-value pairs) - our method demonstrates superior performance on various long-context datasets while remarkably reducing the memory footprint by a factor of 8.

[AI-69] Remaining Useful Life Prediction: A Study on Multidimensional Industrial Signal Processing and Efficient Transfer Learning Based on Large Language Models

链接: https://arxiv.org/abs/2410.03134
作者: Yan Chen,Cheng Liu
关键词-EN: Remaining useful life, maintaining modern industrial, safety are paramount, RUL prediction, crucial for maintaining
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Remaining useful life (RUL) prediction is crucial for maintaining modern industrial systems, where equipment reliability and operational safety are paramount. Traditional methods, based on small-scale deep learning or physical/statistical models, often struggle with complex, multidimensional sensor data and varying operating conditions, limiting their generalization capabilities. To address these challenges, this paper introduces an innovative regression framework utilizing large language models (LLMs) for RUL prediction. By leveraging the modeling power of LLMs pre-trained on corpus data, the proposed model can effectively capture complex temporal dependencies and improve prediction accuracy. Extensive experiments on the Turbofan engine’s RUL prediction task show that the proposed model surpasses state-of-the-art (SOTA) methods on the challenging FD002 and FD004 subsets and achieves near-SOTA results on the other subsets. Notably, different from previous research, our framework uses the same sliding window length and all sensor signals for all subsets, demonstrating strong consistency and generalization. Moreover, transfer learning experiments reveal that with minimal target domain data for fine-tuning, the model outperforms SOTA methods trained on full target domain data. This research highlights the significant potential of LLMs in industrial signal processing and RUL prediction, offering a forward-looking solution for health management in future intelligent industrial systems.

[AI-70] Autoregressive Action Sequence Learning for Robotic Manipulation

链接: https://arxiv.org/abs/2410.03132
作者: Xinyu Zhang,Yuhan Liu,Haonan Chang,Liam Schramm,Abdeslam Boularias
关键词-EN: natural language processing, demonstrated remarkable success, language processing, demonstrated remarkable, remarkable success
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive models have demonstrated remarkable success in natural language processing. In this work, we design a simple yet effective autoregressive architecture for robotic manipulation tasks. We propose the Chunking Causal Transformer (CCT), which extends the next-single-token prediction of causal transformers to support multi-token prediction in a single pass. Further, we design a novel attention interleaving strategy that allows CCT to be trained efficiently with teacher-forcing. Based on CCT, we propose the Autoregressive Policy (ARP) model, which learns to generate action sequences autoregressively. We find that action sequence learning enables better leverage of the underlying causal relationships in robotic tasks. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that it outperforms the state-of-the-art methods in all tested environments, while being more efficient in computation and parameter sizes. Video demonstrations, our source code, and the models of ARP can be found at this http URL.

[AI-71] AIME: AI System Optimization via Multiple LLM Evaluators

链接: https://arxiv.org/abs/2410.03131
作者: Bhrij Patel,Souradip Chakraborty,Wesley A. Suttle,Mengdi Wang,Amrit Singh Bedi,Dinesh Manocha
关键词-EN: feedback loop scheme, optimization typically involves, Text-based AI system, current output, iteration output
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 21 pages, 10 Figures, 4 Tables

点击查看摘要

Abstract:Text-based AI system optimization typically involves a feedback loop scheme where a single LLM generates an evaluation in natural language of the current output to improve the next iteration’s output. However, in this work, we empirically demonstrate that for a practical and complex task (code generation) with multiple criteria to evaluate, utilizing only one LLM evaluator tends to let errors in generated code go undetected, thus leading to incorrect evaluations and ultimately suboptimal test case performance. Motivated by this failure case, we assume there exists an optimal evaluation policy that samples an evaluation between response and ground truth. We then theoretically prove that a linear combination of multiple evaluators can approximate this optimal policy. From this insight, we propose AI system optimization via Multiple LLM Evaluators (AIME). AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We provide an extensive empirical study showing AIME outperforming baseline methods in code generation tasks, with up to 62% higher error detection rate and up to 16% higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets. We also show that the selection of the number of evaluators and which criteria to utilize is non-trivial as it can impact pact success rate by up to 12% .

[AI-72] ARB-LLM: Alternating Refined Binarizations for Large Language Models

链接: https://arxiv.org/abs/2410.03129
作者: Zhiteng Li,Xianglong Yan,Tianao Zhang,Haotong Qin,Dong Xie,Jiang Tian,zhongchao shi,Linghe Kong,Yulun Zhang,Xiaokang Yang
关键词-EN: Large Language Models, natural language processing, Large Language, hinder practical deployment, greatly pushed forward
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: The code and models will be available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have greatly pushed forward advancements in natural language processing, yet their high memory and computational demands hinder practical deployment. Binarization, as an effective compression technique, can shrink model weights to just 1 bit, significantly reducing the high demands on computation and memory. However, current binarization methods struggle to narrow the distribution gap between binarized and full-precision weights, while also overlooking the column deviation in LLM weight distribution. To tackle these issues, we propose ARB-LLM, a novel 1-bit post-training quantization (PTQ) technique tailored for LLMs. To narrow the distribution shift between binarized and full-precision weights, we first design an alternating refined binarization (ARB) algorithm to progressively update the binarization parameters, which significantly reduces the quantization error. Moreover, considering the pivot role of calibration data and the column deviation in LLM weights, we further extend ARB to ARB-X and ARB-RC. In addition, we refine the weight partition strategy with column-group bitmap (CGB), which further enhance performance. Equipping ARB-X and ARB-RC with CGB, we obtain ARB-LLM _\textX and ARB-LLM _\textRC respectively, which significantly outperform state-of-the-art (SOTA) binarization methods for LLMs. As a binary PTQ method, our ARB-LLM _\textRC is the first to surpass FP16 models of the same size. The code and models will be available at this https URL.

[AI-73] Understanding Decision Subjects Engagement with and Perceived Fairness of AI Models When Opportunities of Qualification Improvement Exist

链接: https://arxiv.org/abs/2410.03126
作者: Meric Altug Gemalmaz,Ming Yin
关键词-EN: affects people engagement, model decision fairness, fairness affects people, decision fairness affects, model
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We explore how an AI model’s decision fairness affects people’s engagement with and perceived fairness of the model if they are subject to its decisions, but could repeatedly and strategically respond to these decisions. Two types of strategic responses are considered – people could determine whether to continue interacting with the model, and whether to invest in themselves to improve their chance of future favorable decisions from the model. Via three human-subject experiments, we found that in decision subjects’ strategic, repeated interactions with an AI model, the model’s decision fairness does not change their willingness to interact with the model or to improve themselves, even when the model exhibits unfairness on salient protected attributes. However, decision subjects still perceive the AI model to be less fair when it systematically biases against their group, especially if the difficulty of improving one’s qualification for the favorable decision is larger for the lowly-qualified people.

[AI-74] RIPPLECOT: Amplifying Ripple Effect of Knowledge Editing in Language Models via Chain-of-Thought In-Context Learning EMNLP

链接: https://arxiv.org/abs/2410.03122
作者: Zihao Zhao,Yuchen Yang,Yijiang Li,Yinzhi Cao
关键词-EN: large language models, ripple effect poses, ripple effect, poses a significant, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP findings

点击查看摘要

Abstract:The ripple effect poses a significant challenge in knowledge editing for large language models. Namely, when a single fact is edited, the model struggles to accurately update the related facts in a sequence, which is evaluated by multi-hop questions linked to a chain of related facts. Recent strategies have moved away from traditional parameter updates to more flexible, less computation-intensive methods, proven to be more effective in addressing the ripple effect. In-context learning (ICL) editing uses a simple demonstration Imagine that + new fact to guide LLMs, but struggles with complex multi-hop questions as the new fact alone fails to specify the chain of facts involved in such scenarios. Besides, memory-based editing maintains additional storage for all edits and related facts, requiring continuous updates to stay effective. As a result of these design limitations, the challenge remains, with the highest accuracy being only 33.8% on the MQuAKE-cf benchmarks for Vicuna-7B. To address this, we propose RippleCOT, a novel ICL editing approach integrating Chain-of-Thought (COT) reasoning. RippleCOT structures demonstrations as newfact, question, thought, answer, incorporating a thought component to identify and decompose the multi-hop logic within questions. This approach effectively guides the model through complex multi-hop questions with chains of related facts. Comprehensive experiments demonstrate that RippleCOT significantly outperforms the state-of-the-art on the ripple effect, achieving accuracy gains ranging from 7.8% to 87.1%.

[AI-75] ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

链接: https://arxiv.org/abs/2410.03117
作者: Ippei Fujisawa,Sensho Nobe,Hiroki Seto,Rina Onda,Yoshiaki Uchida,Hiroki Ikoma,Pei-Chun Chien,Ryota Kanai
关键词-EN: tasks remains limited, large language models, continue to advance, intellectual activities, remains limited
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reasoning is central to a wide range of intellectual activities, and while the capabilities of large language models (LLMs) continue to advance, their performance in reasoning tasks remains limited. The processes and mechanisms underlying reasoning are not yet fully understood, but key elements include path exploration, selection of relevant knowledge, and multi-step inference. Problems are solved through the synthesis of these components. In this paper, we propose a benchmark that focuses on a specific aspect of reasoning ability: the direct evaluation of multi-step inference. To this end, we design a special reasoning task where multi-step inference is specifically focused by largely eliminating path exploration and implicit knowledge utilization. Our dataset comprises pairs of explicit instructions and corresponding questions, where the procedures necessary for solving the questions are entirely detailed within the instructions. This setup allows models to solve problems solely by following the provided directives. By constructing problems that require varying numbers of steps to solve and evaluating responses at each step, we enable a thorough assessment of state-of-the-art LLMs’ ability to follow instructions. To ensure the robustness of our evaluation, we include multiple distinct tasks. Furthermore, by comparing accuracy across tasks, utilizing step-aware metrics, and applying separately defined measures of complexity, we conduct experiments that offer insights into the capabilities and limitations of LLMs in reasoning tasks. Our findings have significant implications for the development of LLMs and highlight areas for future research in advancing their reasoning abilities. Our dataset is available at \urlthis https URL and code at \urlthis https URL.

[AI-76] LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

链接: https://arxiv.org/abs/2410.03111
作者: Rongzhi Zhang,Kuang Wang,Liyuan Liu,Shuohang Wang,Hao Cheng,Chao Zhang,Yelong Shen
关键词-EN: enabling faster inference, storing previously computed, autoregressive large language, serving transformer-based autoregressive, transformer-based autoregressive large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly with sequence length and batch size, posing a significant bottleneck in LLM deployment. Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages, which requires extensive parameter tuning thus unsuitable for pre-trained LLMs; (2) KV cache compression at test time, primarily through token eviction policies, which often overlook inter-layer dependencies and can be task-specific. This paper introduces an orthogonal approach to KV cache compression. We propose a low-rank approximation of KV weight matrices, allowing for plug-in integration with existing transformer-based LLMs without model retraining. To effectively compress KV cache at the weight level, we adjust for layerwise sensitivity and introduce a progressive compression strategy, which is supported by our theoretical analysis on how compression errors accumulate in deep networks. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages. Extensive experiments with LLaMA models ranging from 8B to 70B parameters across various tasks show that our approach significantly reduces the GPU memory footprint while maintaining performance. Comments: 15 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2 Cite as: arXiv:2410.03111 [cs.LG] (or arXiv:2410.03111v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-77] MBDS: A Multi-Body Dynamics Simulation Dataset for Graph Networks Simulators

链接: https://arxiv.org/abs/2410.03107
作者: Sheng Yang,Fengge Wu,Junsuo Zhao
关键词-EN: Graph Network Simulators, modeling physical phenomena, Graph Network, Network Simulators, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modeling the structure and events of the physical world constitutes a fundamental objective of neural networks. Among the diverse approaches, Graph Network Simulators (GNS) have emerged as the leading method for modeling physical phenomena, owing to their low computational cost and high accuracy. The datasets employed for training and evaluating physical simulation techniques are typically generated by researchers themselves, often resulting in limited data volume and quality. Consequently, this poses challenges in accurately assessing the performance of these methods. In response to this, we have constructed a high-quality physical simulation dataset encompassing 1D, 2D, and 3D scenes, along with more trajectories and time-steps compared to existing datasets. Furthermore, our work distinguishes itself by developing eight complete scenes, significantly enhancing the dataset’s comprehensiveness. A key feature of our dataset is the inclusion of precise multi-body dynamics, facilitating a more realistic simulation of the physical world. Utilizing our high-quality dataset, we conducted a systematic evaluation of various existing GNS methods. Our dataset is accessible for download at this https URL, offering a valuable resource for researchers to enhance the training and evaluation of their methodologies.

[AI-78] Mamba in Vision: A Comprehensive Survey of Techniques and Applications

链接: https://arxiv.org/abs/2410.03105
作者: Md Maklachur Rahman,Abdullah Aman Tutul,Ankur Nath,Lamyanba Laishram,Soon Ki Jung,Tracy Hammond
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, faced by Convolutional, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. While CNNs excel at extracting local features, they often struggle to capture long-range dependencies without complex architectural modifications. In contrast, ViTs effectively model global relationships but suffer from high computational costs due to the quadratic complexity of their self-attention mechanisms. Mamba addresses these limitations by leveraging Selective Structured State Space Models to effectively capture long-range dependencies with linear computational complexity. This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions. We provide a foundational resource for advancing the understanding and growth of Mamba models in computer vision. An overview of this work is available at this https URL.

[AI-79] Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing

链接: https://arxiv.org/abs/2410.03097
作者: Ziqi Jiang,Zhen Wang,Long Chen
关键词-EN: computer vision, editing, remains a fundamental, fundamental challenge, challenge in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Precise and flexible image editing remains a fundamental challenge in computer vision. Based on the modified areas, most editing methods can be divided into two main types: global editing and local editing. In this paper, we choose the two most common editing approaches (ie text-based editing and drag-based editing) and analyze their drawbacks. Specifically, text-based methods often fail to describe the desired modifications precisely, while drag-based methods suffer from ambiguity. To address these issues, we proposed \textbfCLIPDrag, a novel image editing method that is the first to combine text and drag signals for precise and ambiguity-free manipulations on diffusion models. To fully leverage these two signals, we treat text signals as global guidance and drag points as local information. Then we introduce a novel global-local motion supervision method to integrate text signals into existing drag-based methods by adapting a pre-trained language-vision model like CLIP. Furthermore, we also address the problem of slow convergence in CLIPDrag by presenting a fast point-tracking method that enforces drag points moving toward correct directions. Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods.

[AI-80] Strategic Insights from Simulation Gaming of AI Race Dynamics

链接: https://arxiv.org/abs/2410.03092
作者: Ross Gruetzemacher,Shahar Avin,James Fox,Alexander K Saeri
关键词-EN: scenario exploration exercise, Intelligence Rising, scenario exploration, exploration exercise, present insights
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 41 pages, includes executive summary. Under review for academic journal

点击查看摘要

Abstract:We present insights from “Intelligence Rising”, a scenario exploration exercise about possible AI futures. Drawing on the experiences of facilitators who have overseen 43 games over a four-year period, we illuminate recurring patterns, strategies, and decision-making processes observed during gameplay. Our analysis reveals key strategic considerations about AI development trajectories in this simulated environment, including: the destabilising effects of AI races, the crucial role of international cooperation in mitigating catastrophic risks, the challenges of aligning corporate and national interests, and the potential for rapid, transformative change in AI capabilities. We highlight places where we believe the game has been effective in exposing participants to the complexities and uncertainties inherent in AI governance. Key recurring gameplay themes include the emergence of international agreements, challenges to the robustness of such agreements, the critical role of cybersecurity in AI development, and the potential for unexpected crises to dramatically alter AI trajectories. By documenting these insights, we aim to provide valuable foresight for policymakers, industry leaders, and researchers navigating the complex landscape of AI development and governance.

[AI-81] Scaling Parameter-Constrained Language Models with Quality Data EMNLP2024

链接: https://arxiv.org/abs/2410.03083
作者: Ernie Chang,Matteo Paltenghi,Yang Li,Pin-Jie Lin,Changsheng Zhao,Patrick Huber,Zechun Liu,Rastislav Rabatin,Yangyang Shi,Vikas Chandra
关键词-EN: providing compute-optimal estimates, modeling traditionally quantify, traditionally quantify training, quantify training loss, language modeling traditionally
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024 Industry Track, 18 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization. In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation – effective training tokens – which we posit to be a critical determinant of performance for parameter-constrained language models. Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text: (i) text diversity and (ii) syntheticity as measured by a teacher model. We pretrained over 200 models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores. We demonstrated the estimated constants yield +0.83 Pearson correlation with true accuracies, and analyzed it in scenarios involving widely-used data techniques such as data sampling and synthesis which aim to improve data quality.

[AI-82] CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions EMNLP2024

链接: https://arxiv.org/abs/2410.03077
作者: Jun Rao,Xuebo Liu,Lian Lian,Shengjun Cheng,Yunjie Liao,Min Zhang
关键词-EN: Large Language Models, Large Language, Language Models, Commonality-aware Instruction Tuning, instruction tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:With instruction tuning, Large Language Models (LLMs) can enhance their ability to adhere to commands. Diverging from most works focusing on data mixing, our study concentrates on enhancing the model’s capabilities from the perspective of data sampling during training. Drawing inspiration from the human learning process, where it is generally easier to master solutions to similar topics through focused practice on a single type of topic, we introduce a novel instruction tuning strategy termed CommonIT: Commonality-aware Instruction Tuning. Specifically, we cluster instruction datasets into distinct groups with three proposed metrics (Task, Embedding and Length). We ensure each training mini-batch, or “partition”, consists solely of data from a single group, which brings about both data randomness across mini-batches and intra-batch data similarity. Rigorous testing on LLaMa models demonstrates CommonIT’s effectiveness in enhancing the instruction-following capabilities of LLMs through IT datasets (FLAN, CoT, and Alpaca) and models (LLaMa2-7B, Qwen2-7B, LLaMa 13B, and BLOOM 7B). CommonIT consistently boosts an average improvement of 2.1% on the general domain (i.e., the average score of Knowledge, Reasoning, Multilinguality and Coding) with the Length metric, and 5.2% on the special domain (i.e., GSM, Openfunctions and Code) with the Task metric, and 3.8% on the specific tasks (i.e., MMLU) with the Embedding metric. Code is available at \urlthis https URL.

[AI-83] Multi-Robot Motion Planning with Diffusion Models ICLR2025

链接: https://arxiv.org/abs/2410.03072
作者: Yorai Shaoul,Itamar Mishani,Shivam Vats,Jiaoyang Li,Maxim Likhachev
关键词-EN: complex multi-modal behaviors, learning complex multi-modal, Diffusion models, recently been successfully, successfully applied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: The first three authors contributed equally to this work. Under review for ICLR 2025

点击查看摘要

Abstract:Diffusion models have recently been successfully applied to a wide range of robotics applications for learning complex multi-modal behaviors from data. However, prior works have mostly been confined to single-robot and small-scale environments due to the high sample complexity of learning multi-robot diffusion models. In this paper, we propose a method for generating collision-free multi-robot trajectories that conform to underlying data distributions while using only single-robot data. Our algorithm, Multi-robot Multi-model planning Diffusion (MMD), does so by combining learned diffusion models with classical search-based techniques – generating data-driven motions under collision constraints. Scaling further, we show how to compose multiple diffusion models to plan in large environments where a single diffusion model fails to generalize well. We demonstrate the effectiveness of our approach in planning for dozens of robots in a variety of simulated scenarios motivated by logistics environments. View video demonstrations in our supplementary material, and our code at: this https URL.

[AI-84] Integrating Natural Language Prompting Tasks in Introductory Programming Courses

链接: https://arxiv.org/abs/2410.03063
作者: Chris Kerslake,Paul Denny,David H Smith IV,James Prather,Juho Leinonen,Andrew Luxton-Reilly,Stephen MacNeil
关键词-EN: emphasize mastering syntax, interesting programs, emphasize mastering, basic constructs, constructs before progressing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 7 pages, 6 figures. Accepted for publication at SIGCSE Virtual 2024

点击查看摘要

Abstract:Introductory programming courses often emphasize mastering syntax and basic constructs before progressing to more complex and interesting programs. This bottom-up approach can be frustrating for novices, shifting the focus away from problem solving and potentially making computing less appealing to a broad range of students. The rise of generative AI for code production could partially address these issues by fostering new skills via interaction with AI models, including constructing high-level prompts and evaluating code that is automatically generated. In this experience report, we explore the inclusion of two prompt-focused activities in an introductory course, implemented across four labs in a six-week module. The first requires students to solve computational problems by writing natural language prompts, emphasizing problem-solving over syntax. The second involves students crafting prompts to generate code equivalent to provided fragments, to foster an understanding of the relationship between prompts and code. Most of the students in the course had reported finding programming difficult to learn, often citing frustrations with syntax and debugging. We found that self-reported difficulty with learning programming had a strong inverse relationship with performance on traditional programming assessments such as tests and projects, as expected. However, performance on the natural language tasks was less strongly related to self-reported difficulty, suggesting they may target different skills. Learning how to communicate with AI coding models is becoming an important skill, and natural language prompting tasks may appeal to a broad range of students.

[AI-85] Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

链接: https://arxiv.org/abs/2410.03062
作者: Grant Wardle,Teo Susnjak
关键词-EN: large language models, paper examines, large language, multi-modal prompts influences, reasoning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper examines how the sequencing of images and text within multi-modal prompts influences the reasoning performance of large language models (LLMs). We performed empirical evaluations using three commercial LLMs. Our results demonstrate that the order in which modalities are presented can significantly affect performance, particularly in tasks of varying complexity. For simpler tasks involving a single image, modality sequencing had a clear impact on accuracy. However, in more complex tasks involving multiple images and intricate reasoning steps, the effect of sequencing diminished, likely due to the increased cognitive demands of the task. Our findings also highlight the importance of question/prompt structure. In nested and multi-step reasoning tasks, modality sequencing played a key role in shaping model performance. While LLMs excelled in the initial stages of reasoning, they struggled to re-incorporate earlier information, underscoring the challenges of multi-hop reasoning within transformer architectures. This suggests that aligning the sequence of modalities with the logical flow of reasoning steps is more critical than modality order alone. These insights offer valuable implications for improving multi-modal prompt design, with broader applications across fields such as education, medical imaging, and cross-modal learning.

[AI-86] owards an Improved Metric for Evaluating Disentangled Representations

链接: https://arxiv.org/abs/2410.03056
作者: Sahib Julka,Yashu Wang,Michael Granitzer
关键词-EN: Disentangled representation learning, making representations controllable, representation learning plays, Disentangled representation, interpretable and transferable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Disentangled representation learning plays a pivotal role in making representations controllable, interpretable and transferable. Despite its significance in the domain, the quest for reliable and consistent quantitative disentanglement metric remains a major challenge. This stems from the utilisation of diverse metrics measuring different properties and the potential bias introduced by their design. Our work undertakes a comprehensive examination of existing popular disentanglement evaluation metrics, comparing them in terms of measuring aspects of disentanglement (viz. Modularity, Compactness, and Explicitness), detecting the factor-code relationship, and describing the degree of disentanglement. We propose a new framework for quantifying disentanglement, introducing a metric entitled \emphEDI, that leverages the intuitive concept of \emphexclusivity and improved factor-code relationship to minimize ad-hoc decisions. An in-depth analysis reveals that EDI measures essential properties while offering more stability than existing metrics, advocating for its adoption as a standardised approach.

[AI-87] Permissive Information-Flow Analysis for Large Language Models

链接: https://arxiv.org/abs/2410.03055
作者: Shoaib Ahmed Siddiqui,Radhika Gaonkar,Boris Köpf,David Krueger,Andrew Paverd,Ahmed Salem,Shruti Tople,Lukas Wutschitz,Menglin Xia,Santiago Zanella-Béguelin
关键词-EN: Large Language Models, larger software systems, Large Language, rapidly becoming commodity, larger software
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model’s behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. One promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, the traditional approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the labels of the samples that were influential in generating the model output and to eliminate the labels of unnecessary input. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a k -nearest-neighbors language model. We compare these with the baseline of an introspection-based influence estimator that directly asks the language model to predict the output label. The results obtained highlight the superiority of our prompt-based label propagator, which improves the label in more than 85% of the cases in an LLM agent setting. These findings underscore the practicality of permissive label propagation for retrieval augmentation.

[AI-88] Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues

链接: https://arxiv.org/abs/2410.03049
作者: Shilin Qu,Weiqing Wang,Xin Zhou,Haolan Zhan,Zhuang Li,Lizhen Qu,Linhao Luo,Yuan-Fang Li,Gholamreza Haffari
关键词-EN: conversational information retrieval, including conversational information, retrieval-enhanced machine learning, Sociocultural norms serve, contextual information retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Sociocultural norms serve as guiding principles for personal conduct in social interactions, emphasizing respect, cooperation, and appropriate behavior, which is able to benefit tasks including conversational information retrieval, contextual information retrieval and retrieval-enhanced machine learning. We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs) for socially aware dialogues. We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase. Our approach utilizes socially aware dialogues, enriched with contextual frames, as the primary data source to constrain the generating process and reduce the hallucinations. This enables extracting of high-quality and nuanced natural-language norm statements, leveraging the pragmatic implications of utterances with respect to the situation. As real dialogue annotated with gold frames are not readily available, we propose using synthetic data. Our empirical results show: (i) the quality of the SCNs derived from synthetic data is comparable to that from real dialogues annotated with gold frames, and (ii) the quality of the SCNs extracted from real data, annotated with either silver (predicted) or gold frames, surpasses that without the frame annotations. We further show the effectiveness of the extracted SCNs in a RAG-based (Retrieval-Augmented Generation) model to reason about multiple downstream dialogue tasks.

[AI-89] Revealing the Unseen: Guiding Personalized Diffusion Models to Expose Training Data

链接: https://arxiv.org/abs/2410.03039
作者: Xiaoyu Wu,Jiaru Zhang,Steven Wu
关键词-EN: capture specific styles, Diffusion Models, styles or objects, evolved into advanced, small set
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small set of images to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the potential risks of data leakage by releasing their fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask: “Can training data be extracted from these fine-tuned DMs shared online?” A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model’s learned distribution – from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets such as WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting approximately 20% of fine-tuning data in most cases, significantly surpassing baseline performance.

[AI-90] SPINE: Online Semantic Planning for Missions with Incomplete Natural Language Specifications in Unstructured Environments

链接: https://arxiv.org/abs/2410.03035
作者: Zachary Ravichandran,Varun Murali,Mariliza Tzes,George J. Pappas,Vijay Kumar
关键词-EN: describe high-level missions, increasingly capable, describe high-level, SPINE, Large Language Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As robots become increasingly capable, users will want to describe high-level missions and have robots fill in the gaps. In many realistic settings, pre-built maps are difficult to obtain, so execution requires exploration and mapping that are necessary and specific to the mission. Consider an emergency response scenario where a user commands a robot, “triage impacted regions.” The robot must infer relevant semantics (victims, etc.) and exploration targets (damaged regions) based on priors or other context, then explore and refine its plan online. These missions are incompletely specified, meaning they imply subtasks and semantics. While many semantic planning methods operate online, they are typically designed for well specified tasks such as object search or exploration. Recently, Large Language Models (LLMs) have demonstrated powerful contextual reasoning over a range of robotic tasks described in natural language. However, existing LLM planners typically do not consider online planning or complex missions; rather, relevant subtasks are provided by a pre-built map or a user. We address these limitations via SPINE (online Semantic Planner for missions with Incomplete Natural language specifications in unstructured Environments). SPINE uses an LLM to reason about subtasks implied by the mission then realizes these subtasks in a receding horizon framework. Tasks are automatically validated for safety and refined online with new observations. We evaluate SPINE in simulation and real-world settings. Evaluation missions require multiple steps of semantic reasoning and exploration in cluttered outdoor environments of over 20,000m ^2 area. We evaluate SPINE against competitive baselines in single-agent and air-ground teaming applications. Please find videos and software on our project page: this https URL

[AI-91] CounterQuill: Investigating the Potential of Human-AI Collaboration in Online Counterspeech Writing

链接: https://arxiv.org/abs/2410.03032
作者: Xiaohan Ding,Kaike Ping,Uma Sushmitha Gunturi,Buse Carik,Sophia Stil,Lance T Wilhelm,Taufiq Daryanto,James Hawdon,Sang Won Lee,Eugenia H Rho
关键词-EN: social media platforms, media platforms, causing harm, individuals and society, increasingly prevalent
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Online hate speech has become increasingly prevalent on social media platforms, causing harm to individuals and society. While efforts have been made to combat this issue through content moderation, the potential of user-driven counterspeech as an alternative solution remains underexplored. Existing counterspeech methods often face challenges such as fear of retaliation and skill-related barriers. To address these challenges, we introduce CounterQuill, an AI-mediated system that assists users in composing effective and empathetic counterspeech. CounterQuill provides a three-step process: (1) a learning session to help users understand hate speech and counterspeech; (2) a brainstorming session that guides users in identifying key elements of hate speech and exploring counterspeech strategies; and (3) a co-writing session that enables users to draft and refine their counterspeech with CounterQuill. We conducted a within-subjects user study with 20 participants to evaluate CounterQuill in comparison to ChatGPT. Results show that CounterQuill’s guidance and collaborative writing process provided users a stronger sense of ownership over their co-authored counterspeech. Users perceived CounterQuill as a writing partner and thus were more willing to post the co-written counterspeech online compared to the one written with ChatGPT.

[AI-92] Dynamic Sparse Training versus Dense Training: The Unexpected Winner in Image Corruption Robustness

链接: https://arxiv.org/abs/2410.03030
作者: Boqian Wu,Qiao Xiao,Shunxin Wang,Nicola Strisciuglio,Mykola Pechenizkiy,Maurice van Keulen,Decebal Constantin Mocanu,Elena Mocanu
关键词-EN: Dynamic Sparse Training, artificial neural networks, Sparse Training, Dynamic Sparse, Sparse Training opens
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:It is generally perceived that Dynamic Sparse Training opens the door to a new era of scalability and efficiency for artificial neural networks at, perhaps, some costs in accuracy performance for the classification task. At the same time, Dense Training is widely accepted as being the “de facto” approach to train artificial neural networks if one would like to maximize their robustness against image corruption. In this paper, we question this general practice. Consequently, we claim that, contrary to what is commonly thought, the Dynamic Sparse Training methods can consistently outperform Dense Training in terms of robustness accuracy, particularly if the efficiency aspect is not considered as a main objective (i.e., sparsity levels between 10% and up to 50%), without adding (or even reducing) resource cost. We validate our claim on two types of data, images and videos, using several traditional and modern deep learning architectures for computer vision and three widely studied Dynamic Sparse Training algorithms. Our findings reveal a new yet-unknown benefit of Dynamic Sparse Training and open new possibilities in improving deep learning robustness beyond the current state of the art.

[AI-93] Flow Matching with Gaussian Process Priors for Probabilistic Time Series Forecasting

链接: https://arxiv.org/abs/2410.03024
作者: Marcel Kollovieh,Marten Lienen,David Lüdke,Leo Schwinn,Stephan Günnemann
关键词-EN: Recent advancements, time series modeling, opened new directions, series modeling, generative modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent advancements in generative modeling, particularly diffusion models, have opened new directions for time series modeling, achieving state-of-the-art performance in forecasting and synthesis. However, the reliance of diffusion-based models on a simple, fixed prior complicates the generative process since the data and prior distributions differ significantly. We introduce TSFlow, a conditional flow matching (CFM) model for time series that simplifies the generative problem by combining Gaussian processes, optimal transport paths, and data-dependent prior distributions. By incorporating (conditional) Gaussian processes, TSFlow aligns the prior distribution more closely with the temporal structure of the data, enhancing both unconditional and conditional generation. Furthermore, we propose conditional prior sampling to enable probabilistic forecasting with an unconditionally trained model. In our experimental evaluation on eight real-world datasets, we demonstrate the generative capabilities of TSFlow, producing high-quality unconditional samples. Finally, we show that both conditionally and unconditionally trained models achieve competitive results in forecasting benchmarks, surpassing other methods on 6 out of 8 datasets.

[AI-94] Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review

链接: https://arxiv.org/abs/2410.03019
作者: Sungduk Yu,Man Luo,Avinash Madasu,Vasudev Lal,Phillip Howard
关键词-EN: published scientific research, peer review process, scientific research, ensuring the integrity, integrity of published
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in the linguistic capabilities of large language models (LLMs), a new potential risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. In this study, we investigate the ability of existing AI text detection algorithms to distinguish between peer reviews written by humans and different state-of-the-art LLMs. Our analysis shows that existing approaches fail to identify many GPT-4o written reviews without also producing a high number of false positive classifications. To address this deficiency, we propose a new detection approach which surpasses existing methods in the identification of GPT-4o written peer reviews at low levels of false positive classifications. Our work reveals the difficulty of accurately identifying AI-generated text at the individual review level, highlighting the urgent need for new tools and methods to detect this type of unethical application of generative AI.

[AI-95] ransforming Teachers Roles and Agencies in the Era of Generative AI: Perceptions Acceptance Knowledge and Practices

链接: https://arxiv.org/abs/2410.03018
作者: Xiaoming Zhai
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, impact of Generative, addresses teachers’ perceptions
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the transformative impact of Generative Artificial Intelligence (GenAI) on teachers’ roles and agencies in education, presenting a comprehensive framework that addresses teachers’ perceptions, knowledge, acceptance, and practices of GenAI. As GenAI technologies, such as ChatGPT, become increasingly integrated into educational settings, teachers are required to adapt to evolving classroom dynamics, where AI plays a significant role in content creation, personalized learning, and student engagement. However, existing literature often treats these factors in isolation, overlooking how they collectively influence teachers’ ability to effectively integrate GenAI into their pedagogical practices. This paper fills this gap by proposing a framework that categorizes teachers into four roles – Observer, Adopter, Collaborator, and Innovator – each representing different levels of GenAI engagement, outlining teachers’ agencies in GenAI classrooms. By highlighting the need for continuous professional development and institutional support, we demonstrate how teachers can evolve from basic GenAI users to co-creators of knowledge alongside GenAI systems. The findings emphasize that for GenAI to reach its full educational potential, teachers must not only accept and understand its capabilities but also integrate it deeply into their teaching strategies. This study contributes to the growing literature on GenAI in education, offering practical implications for supporting teachers in navigating the complexities of GenAI adoption.

[AI-96] ask-unaware Lifelong Robot Learning with Retrieval-based Weighted Local Adaptation

链接: https://arxiv.org/abs/2410.02995
作者: Pengzhi Yang,Xinyu Wang,Ruipeng Zhang,Cong Wang,Frans Oliehoek,Jens Kober
关键词-EN: Real-world environments require, defined task boundaries, previously learned abilities, retaining previously learned, Real-world environments
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-world environments require robots to continuously acquire new skills while retaining previously learned abilities, all without the need for clearly defined task boundaries. Storing all past data to prevent forgetting is impractical due to storage and privacy concerns. To address this, we propose a method that efficiently restores a robot’s proficiency in previously learned tasks over its lifespan. Using an Episodic Memory (EM), our approach enables experience replay during training and retrieval during testing for local fine-tuning, allowing rapid adaptation to previously encountered problems without explicit task identifiers. Additionally, we introduce a selective weighting mechanism that emphasizes the most challenging segments of retrieved demonstrations, focusing local adaptation where it is most needed. This framework offers a scalable solution for lifelong learning in dynamic, task-unaware environments, combining retrieval-based adaptation with selective weighting to enhance robot performance in open-ended scenarios.

[AI-97] Guided Stream of Search: Learning to Better Search with Language Models via Optimal Path Guidance

链接: https://arxiv.org/abs/2410.02992
作者: Seungyong Moon,Bumsoo Park,Hyun Oh Song
关键词-EN: demonstrated impressive capabilities, require complex planning, language models, optimal solutions, demonstrated impressive
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:While language models have demonstrated impressive capabilities across a range of tasks, they still struggle with tasks that require complex planning and reasoning. Recent studies have proposed training language models on search processes rather than optimal solutions, resulting in better generalization performance even though search processes are noisy and even suboptimal. However, these studies overlook the value of optimal solutions, which can serve as step-by-step landmarks to guide more effective search. In this work, we explore how to leverage optimal solutions to enhance the search and planning abilities of language models. To this end, we propose guided stream of search (GSoS), which seamlessly incorporates optimal solutions into the self-generation process in a progressive manner, producing high-quality search trajectories. These trajectories are then distilled into the pre-trained model via supervised fine-tuning. Our approach significantly enhances the search and planning abilities of language models on Countdown, a simple yet challenging mathematical reasoning task. Notably, combining our method with RL fine-tuning yields further improvements, whereas previous supervised fine-tuning methods do not benefit from RL. Furthermore, our approach exhibits greater effectiveness than leveraging optimal solutions in the form of subgoal rewards.

[AI-98] Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

链接: https://arxiv.org/abs/2410.02984
作者: George Wang,Jesse Hoogland,Stan van Wingerden,Zach Furman,Daniel Murfet
关键词-EN: Local Learning Coefficient, introduce refined variants, singular learning theory, model complexity grounded, transformer language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory, to study the development of internal structure in transformer language models during training. By applying these \textitrefined LLCs (rLLCs) to individual components of a two-layer attention-only transformer, we gain novel insights into the progressive differentiation and specialization of attention heads. Our methodology reveals how attention heads differentiate into distinct functional roles over the course of training, analyzes the types of data these heads specialize to process, and discovers a previously unidentified multigram circuit. These findings demonstrate that rLLCs provide a principled, quantitative toolkit for \textitdevelopmental interpretability, which aims to understand models through their evolution across the learning process. More broadly, this work takes a step towards establishing the correspondence between data distributional structure, geometric properties of the loss landscape, learning dynamics, and emergent computational structures in neural networks.

[AI-99] An explainable approach to detect case law on housing and eviction issues within the HUDOC database

链接: https://arxiv.org/abs/2410.02978
作者: Mohammad Mohammadi,Martijn Wieling,Michel Vols
关键词-EN: Case law, understanding of human, Court of Human, instrumental in shaping, shaping our understanding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Case law is instrumental in shaping our understanding of human rights, including the right to adequate housing. The HUDOC database provides access to the textual content of case law from the European Court of Human Rights (ECtHR), along with some metadata. While this metadata includes valuable information, such as the application number and the articles addressed in a case, it often lacks detailed substantive insights, such as the specific issues a case covers. This underscores the need for detailed analysis to extract such information. However, given the size of the database - containing over 40,000 cases - an automated solution is essential. In this study, we focus on the right to adequate housing and aim to build models to detect cases related to housing and eviction issues. Our experiments show that the resulting models not only provide performance comparable to more sophisticated approaches but are also interpretable, offering explanations for their decisions by highlighting the most influential words. The application of these models led to the identification of new cases that were initially overlooked during data collection. This suggests that NLP approaches can be effectively applied to categorise case law based on the specific issues they address. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02978 [cs.LG] (or arXiv:2410.02978v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.02978 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-100] Harm Ratio: A Novel and Versatile Fairness Criterion

链接: https://arxiv.org/abs/2410.02977
作者: Soroush Ebadian,Rupert Freeman,Nisarg Shah
关键词-EN: fair division research, division research, fairness, collective decision-making, individual
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: To appear at EAAMO 2024

点击查看摘要

Abstract:Envy-freeness has become the cornerstone of fair division research. In settings where each individual is allocated a disjoint share of collective resources, it is a compelling fairness axiom which demands that no individual strictly prefer the allocation of another individual to their own. Unfortunately, in many real-life collective decision-making problems, the goal is to choose a (common) public outcome that is equally applicable to all individuals, and the notion of envy becomes vacuous. Consequently, this literature has avoided studying fairness criteria that focus on individuals feeling a sense of jealousy or resentment towards other individuals (rather than towards the system), missing out on a key aspect of fairness. In this work, we propose a novel fairness criterion, individual harm ratio, which is inspired by envy-freeness but applies to a broad range of collective decision-making settings. Theoretically, we identify minimal conditions under which this criterion and its groupwise extensions can be guaranteed, and study the computational complexity of related problems. Empirically, we conduct experiments with real data to show that our fairness criterion is powerful enough to differentiate between prominent decision-making algorithms for a range of tasks from voting and fair division to participatory budgeting and peer review. Comments: To appear at EAAMO 2024 Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02977 [cs.GT] (or arXiv:2410.02977v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2410.02977 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-101] F-Fidelity: A Robust Framework for Faithfulness Evaluation of Explainable AI

链接: https://arxiv.org/abs/2410.02970
作者: Xu Zheng,Farhad Shirani,Zhuomin Chen,Chaohao Lin,Wei Cheng,Wenbo Guo,Dongsheng Luo
关键词-EN: XAI methods remains, XAI methods, XAI, research has developed, developed a number
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint; 26 pages, 4 figures

点击查看摘要

Abstract:Recent research has developed a number of eXplainable AI (XAI) techniques. Although extracting meaningful insights from deep learning models, how to properly evaluate these XAI methods remains an open problem. The most widely used approach is to perturb or even remove what the XAI method considers to be the most important features in an input and observe the changes in the output prediction. This approach although efficient suffers the Out-of-Distribution (OOD) problem as the perturbed samples may no longer follow the original data distribution. A recent method RemOve And Retrain (ROAR) solves the OOD issue by retraining the model with perturbed samples guided by explanations. However, the training may not always converge given the distribution difference. Furthermore, using the model retrained based on XAI methods to evaluate these explainers may cause information leakage and thus lead to unfair comparisons. We propose Fine-tuned Fidelity F-Fidelity, a robust evaluation framework for XAI, which utilizes i) an explanation-agnostic fine-tuning strategy, thus mitigating the information leakage issue and ii) a random masking operation that ensures that the removal step does not generate an OOD input. We designed controlled experiments with state-of-the-art (SOTA) explainers and their degraded version to verify the correctness of our framework. We conducted experiments on multiple data structures, such as images, time series, and natural language. The results demonstrate that F-Fidelity significantly improves upon prior evaluation metrics in recovering the ground-truth ranking of the explainers. Furthermore, we show both theoretically and empirically that, given a faithful explainer, F-Fidelity metric can be used to compute the sparsity of influential input components, i.e., to extract the true explanation size.

[AI-102] Label-Free Subjective Player Experience Modelling via Lets Play Videos AAAI

链接: https://arxiv.org/abs/2410.02967
作者: Dave Goel,Athar Mahmoudi-Nejad,Matthew Guzdial
关键词-EN: Player Experience Modelling, Player Experience, Experience Modelling, techniques applied, Modelling
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment

点击查看摘要

Abstract:Player Experience Modelling (PEM) is the study of AI techniques applied to modelling a player’s experience within a video game. PEM development can be labour-intensive, requiring expert hand-authoring or specialized data collection. In this work, we propose a novel PEM development approach, approximating player experience from gameplay video. We evaluate this approach predicting affect in the game Angry Birds via a human subject study. We validate that our PEM can strongly correlate with self-reported and sensor measures of affect, demonstrating the potential of this approach.

[AI-103] AutoML-Agent : A Multi-Agent LLM Framework for Full-Pipeline AutoML

链接: https://arxiv.org/abs/2410.02958
作者: Patara Trirat,Wonyong Jeong,Sung Ju Hwang
关键词-EN: Automated machine learning, Automated machine, machine learning, hyperparameter tuning, Automated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
*备注: 47 pages, 5 figures

点击查看摘要

Abstract:Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general time-consuming and requires a large amount of human effort. Therefore, recent works have started exploiting large language models (LLM) to lessen such burden and increase the usability of AutoML frameworks via a natural language interface, allowing non-expert users to build their data-driven solutions. These methods, however, are usually designed only for a particular process in the AI development pipeline and do not efficiently use the inherent capacity of the LLMs. This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML, i.e., from data retrieval to model deployment. AutoML-Agent takes user’s task descriptions, facilitates collaboration between specialized LLM agents, and delivers deployment-ready models. Unlike existing work, instead of devising a single plan, we introduce a retrieval-augmented planning strategy to enhance exploration to search for more optimal plans. We also decompose each plan into sub-tasks (e.g., data preprocessing and neural network design) each of which is solved by a specialized agent we build via prompting executing in parallel, making the search process more efficient. Moreover, we propose a multi-stage verification to verify executed results and guide the code generation LLM in implementing successful solutions. Extensive experiments on seven downstream tasks using fourteen datasets show that AutoML-Agent achieves a higher success rate in automating the full AutoML process, yielding systems with good performance throughout the diverse domains.

[AI-104] AiBAT: Artificial Intelligence/Instructions for Build Assembly and Test

链接: https://arxiv.org/abs/2410.02955
作者: Benjamin Nuernberger,Anny Liu,Heather Stefanini,Richard Otis,Amanda Towler,R. Peter Dillon
关键词-EN: Instructions for Build, conducted on hardware, including tests, operation is conducted, IBAT instructions
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: 9 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Instructions for Build, Assembly, and Test (IBAT) refers to the process used whenever any operation is conducted on hardware, including tests, assembly, and maintenance. Currently, the generation of IBAT documents is time-intensive, as users must manually reference and transfer information from engineering diagrams and parts lists into IBAT instructions. With advances in machine learning and computer vision, however, it is possible to have an artificial intelligence (AI) model perform the partial filling of the IBAT template, freeing up engineer time for more highly skilled tasks. AiBAT is a novel system for assisting users in authoring IBATs. It works by first analyzing assembly drawing documents, extracting information and parsing it, and then filling in IBAT templates with the extracted information. Such assisted authoring has potential to save time and reduce cost. This paper presents an overview of the AiBAT system, including promising preliminary results and discussion on future work.

[AI-105] Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications EMNLP2024

链接: https://arxiv.org/abs/2410.02952
作者: Oren Sultan,Alex Khasin,Guy Shiran,Asnat Greenstein-Messica,Dafna Shahaf
关键词-EN: practical distillation approach, present a practical, practical distillation, real-time applications, invoking tools
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024

点击查看摘要

Abstract:We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language (“golden hour”), using an LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model (GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using augmentation.

[AI-106] LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences

链接: https://arxiv.org/abs/2410.02950
作者: Zhenxiao Fu,Fan Chen,Shan Zhou,Haitong Li,Lei Jiang
关键词-EN: substantially larger carbon, LLM inference carbon, large language model, LLM inference requests, LLM inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注: 9 pages, 11 figures

点击查看摘要

Abstract:Throughout its lifecycle, a large language model (LLM) generates a substantially larger carbon footprint during inference than training. LLM inference requests vary in batch size, prompt length, and token generation number, while cloud providers employ different GPU types and quantities to meet diverse service-level objectives for accuracy and latency. It is crucial for both users and cloud providers to have a tool that quickly and accurately estimates the carbon impact of LLM inferences based on a combination of inference request and hardware configurations before execution. Estimating the carbon footprint of LLM inferences is more complex than training due to lower and highly variable model FLOPS utilization, rendering previous equation-based models inaccurate. Additionally, existing machine learning (ML) prediction methods either lack accuracy or demand extensive training data, as they inadequately handle the distinct prefill and decode phases, overlook hardware-specific features, and inefficiently sample uncommon inference configurations. We introduce \coo, a graph neural network (GNN)-based model that greatly improves the accuracy of LLM inference carbon footprint predictions compared to previous methods.

[AI-107] SymmetricDiffusers: Learning Discrete Diffusion on Finite Symmetric Groups

链接: https://arxiv.org/abs/2410.02942
作者: Yongxing Zhang,Donglin Yang,Renjie Liao
关键词-EN: Finite symmetric groups, essential in fields, Finite symmetric, symmetric groups, distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Finite symmetric groups S_n are essential in fields such as combinatorics, physics, and chemistry. However, learning a probability distribution over S_n poses significant challenges due to its intractable size and discrete nature. In this paper, we introduce SymmetricDiffusers, a novel discrete diffusion model that simplifies the task of learning a complicated distribution over S_n by decomposing it into learning simpler transitions of the reverse diffusion using deep neural networks. We identify the riffle shuffle as an effective forward transition and provide empirical guidelines for selecting the diffusion length based on the theory of random walks on finite groups. Additionally, we propose a generalized Plackett-Luce (PL) distribution for the reverse transition, which is provably more expressive than the PL distribution. We further introduce a theoretically grounded “denoising schedule” to improve sampling and learning efficiency. Extensive experiments show that our model achieves state-of-the-art or comparable performances on solving tasks including sorting 4-digit MNIST images, jigsaw puzzles, and traveling salesman problems. Our code is released at this https URL.

[AI-108] Intrinsic Evaluation of RAG Systems for Deep-Logic Questions

链接: https://arxiv.org/abs/2410.02932
作者: Junyi Hu,You Zhou,Jie Wang
关键词-EN: evaluate retrieval-augmented generation, involving deep-logic queries, applications involving deep-logic, BERT embedding similarity, Logical-Relation Correctness Ratio
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce the Overall Performance Index (OPI), an intrinsic metric to evaluate retrieval-augmented generation (RAG) mechanisms for applications involving deep-logic queries. OPI is computed as the harmonic mean of two key metrics: the Logical-Relation Correctness Ratio and the average of BERT embedding similarity scores between ground-truth and generated answers. We apply OPI to assess the performance of LangChain, a popular RAG tool, using a logical relations classifier fine-tuned from GPT-4o on the RAG-Dataset-12000 from Hugging Face. Our findings show a strong correlation between BERT embedding similarity scores and extrinsic evaluation scores. Among the commonly used retrievers, the cosine similarity retriever using BERT-based embeddings outperforms others, while the Euclidean distance-based retriever exhibits the weakest performance. Furthermore, we demonstrate that combining multiple retrievers, either algorithmically or by merging retrieved sentences, yields superior performance compared to using any single retriever alone.

[AI-109] Deep image-based Adaptive BRDF Measure

链接: https://arxiv.org/abs/2410.02917
作者: Wen Cao
关键词-EN: accurate sensor simulation, physically accurate sensor, reflectance distribution function, bi-directional reflectance distribution, Efficient and accurate
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注: 9

点击查看摘要

Abstract:Efficient and accurate measurement of the bi-directional reflectance distribution function (BRDF) plays a key role in high quality image rendering and physically accurate sensor simulation. However, obtaining the reflectance properties of a material is both time-consuming and challenging. This paper presents a novel method for minimizing the number of samples required for high quality BRDF capture using a gonio-reflectometer setup. Taking an image of the physical material sample as input a lightweight neural network first estimates the parameters of an analytic BRDF model, and the distribution of the sample locations. In a second step we use an image based loss to find the number of samples required to meet the accuracy required. This approach significantly accelerates the measurement process while maintaining a high level of accuracy and fidelity in the BRDF representation.

[AI-110] Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models

链接: https://arxiv.org/abs/2410.02916
作者: Qingzhao Zhang,Ziyang Xiong,Z. Morley Mao
关键词-EN: large language models, open deployment, paramount concern, concern of large, large language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Safety is a paramount concern of large language models (LLMs) in their open deployment. To this end, safeguard methods aim to enforce the ethical and responsible use of LLMs through safety alignment or guardrail mechanisms. However, we found that the malicious attackers could exploit false positives of safeguards, i.e., fooling the safeguard model to block safe content mistakenly, leading to a new denial-of-service (DoS) attack on LLMs. Specifically, by software or phishing attacks on user client software, attackers insert a short, seemingly innocuous adversarial prompt into to user prompt templates in configuration files; thus, this prompt appears in final user requests without visibility in the user interface and is not trivial to identify. By designing an optimization process that utilizes gradient and attention information, our attack can automatically generate seemingly safe adversarial prompts, approximately only 30 characters long, that universally block over 97% of user requests on Llama Guard 3. The attack presents a new dimension of evaluating LLM safeguards focusing on false positives, fundamentally different from the classic jailbreak.

[AI-111] Streamlining Conformal Information Retrieval via Score Refinement

链接: https://arxiv.org/abs/2410.02914
作者: Yotam Intrator,Ori Kelner,Regev Cohen,Roman Goldenberg,Ehud Rivlin,Daniel Freedman
关键词-EN: retrieval augmented generation, lack statistical guarantees, augmented generation, fundamental to modern, modern applications
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Information retrieval (IR) methods, like retrieval augmented generation, are fundamental to modern applications but often lack statistical guarantees. Conformal prediction addresses this by retrieving sets guaranteed to include relevant information, yet existing approaches produce large-sized sets, incurring high computational costs and slow response times. In this work, we introduce a score refinement method that applies a simple monotone transformation to retrieval scores, leading to significantly smaller conformal sets while maintaining their statistical guarantees. Experiments on various BEIR benchmarks validate the effectiveness of our approach in producing compact sets containing relevant information.

[AI-112] Fine-Tuning Language Models with Differential Privacy through Adaptive Noise Allocation EMNLP2024

链接: https://arxiv.org/abs/2410.02912
作者: Xianzhi Li,Ran Zmigrod,Zhiqiang Ma,Xiaomo Liu,Xiaodan Zhu
关键词-EN: memorizing detailed patterns, achieve impressive modeling, significant privacy concerns, raise significant privacy, impressive modeling performance
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: EMNLP 2024 findings

点击查看摘要

Abstract:Language models are capable of memorizing detailed patterns and information, leading to a double-edged effect: they achieve impressive modeling performance on downstream tasks with the stored knowledge but also raise significant privacy concerns. Traditional differential privacy based training approaches offer robust safeguards by employing a uniform noise distribution across all parameters. However, this overlooks the distinct sensitivities and contributions of individual parameters in privacy protection and often results in suboptimal models. To address these limitations, we propose ANADP, a novel algorithm that adaptively allocates additive noise based on the importance of model parameters. We demonstrate that ANADP narrows the performance gap between regular fine-tuning and traditional DP fine-tuning on a series of datasets while maintaining the required privacy constraints.

[AI-113] Better Instruction-Following Through Minimum Bayes Risk ICLR2025

链接: https://arxiv.org/abs/2410.02902
作者: Ian Wu,Patrick Fernandes,Amanda Bertsch,Seungone Kim,Sina Pakazad,Graham Neubig
关键词-EN: General-purpose LLM judges, Minimum Bayes Risk, human-level evaluation provide, MBR decoding, General-purpose LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under review at ICLR 2025

点击查看摘要

Abstract:General-purpose LLM judges capable of human-level evaluation provide not only a scalable and accurate way of evaluating instruction-following LLMs but also new avenues for supervising and improving their performance. One promising way of leveraging LLM judges for supervision is through Minimum Bayes Risk (MBR) decoding, which uses a reference-based evaluator to select a high-quality output from amongst a set of candidate outputs. In the first part of this work, we explore using MBR decoding as a method for improving the test-time performance of instruction-following LLMs. We find that MBR decoding with reference-based LLM judges substantially improves over greedy decoding, best-of-N decoding with reference-free judges and MBR decoding with lexical and embedding-based metrics on AlpacaEval and MT-Bench. These gains are consistent across LLMs with up to 70B parameters, demonstrating that smaller LLM judges can be used to supervise much larger LLMs. Then, seeking to retain the improvements from MBR decoding while mitigating additional test-time costs, we explore iterative self-training on MBR-decoded outputs. We find that self-training using Direct Preference Optimisation leads to significant performance gains, such that the self-trained models with greedy decoding generally match and sometimes exceed the performance of their base models with MBR decoding.

[AI-114] Cognitive Biases in Large Language Models for News Recommendation RECSYS’24

链接: https://arxiv.org/abs/2410.02897
作者: Yougang Lyu,Xiaoyu Zhang,Zhaochun Ren,Maarten de Rijke
关键词-EN: large language models, recommender systems, cognitive biases, LLM-based news recommender, language models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted at the ROGEN '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Despite large language models (LLMs) increasingly becoming important components of news recommender systems, employing LLMs in such systems introduces new risks, such as the influence of cognitive biases in LLMs. Cognitive biases refer to systematic patterns of deviation from norms or rationality in the judgment process, which can result in inaccurate outputs from LLMs, thus threatening the reliability of news recommender systems. Specifically, LLM-based news recommender systems affected by cognitive biases could lead to the propagation of misinformation, reinforcement of stereotypes, and the formation of echo chambers. In this paper, we explore the potential impact of multiple cognitive biases on LLM-based news recommender systems, including anchoring bias, framing bias, status quo bias and group attribution bias. Furthermore, to facilitate future research at improving the reliability of LLM-based news recommender systems, we discuss strategies to mitigate these biases through data augmentation, prompt engineering and learning algorithms aspects.

[AI-115] he Role of Deductive and Inductive Reasoning in Large Language Models

链接: https://arxiv.org/abs/2410.02892
作者: Chengkun Cai,Xu Zhao,Haoliang Liu,Zhongyu Jiang,Tianfang Zhang,Zongkai Wu,Jenq-Neng Hwang,Lei Li
关键词-EN: Large Language Models, Large Language, Language Models, artificial intelligence, progress in artificial
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved substantial progress in artificial intelligence, particularly in reasoning tasks. However, their reliance on static prompt structures, coupled with limited dynamic reasoning capabilities, often constrains their adaptability to complex and evolving problem spaces. In this paper, we propose the Deductive and InDuctive(DID) method, which enhances LLM reasoning by dynamically integrating both deductive and inductive reasoning within the prompt construction process. Drawing inspiration from cognitive science, the DID approach mirrors human adaptive reasoning mechanisms, offering a flexible framework that allows the model to adjust its reasoning pathways based on task context and performance. We empirically validate the efficacy of DID on established datasets such as AIW and MR-GSM8K, as well as on our custom dataset, Holiday Puzzle, which presents tasks about different holiday date calculating challenges. By leveraging DID’s hybrid prompt strategy, we demonstrate significant improvements in both solution accuracy and reasoning quality, achieved without imposing substantial computational overhead. Our findings suggest that DID provides a more robust and cognitively aligned framework for reasoning in LLMs, contributing to the development of advanced LLM-driven problem-solving strategies informed by cognitive science models.

[AI-116] LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

链接: https://arxiv.org/abs/2410.02884
作者: Di Zhang,Jianbo Wu,Jingdi Lei,Tong Che,Jiatong Li,Tong Xie,Xiaoshui Huang,Shufei Zhang,Marco Pavone,Yuqiang Li,Wanli Ouyang,Dongzhan Zhou
关键词-EN: Large Language Models, Large Language, Monte Carlo Tree, ability of Large, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.

[AI-117] Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL

链接: https://arxiv.org/abs/2410.02874
作者: Naoaki Kanazawa,Kento Kawaharazuka,Yoshiki Obinata,Kei Okada,Masayuki Inaba
关键词-EN: Large Language Model, cooking behaviours based, growing demand, expected tasks, real world
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted at Advanced Robotics

点击查看摘要

Abstract:Although there is a growing demand for cooking behaviours as one of the expected tasks for robots, a series of cooking behaviours based on new recipe descriptions by robots in the real world has not yet been realised. In this study, we propose a robot system that integrates real-world executable robot cooking behaviour planning using the Large Language Model (LLM) and classical planning of PDDL descriptions, and food ingredient state recognition learning from a small number of data using the Vision-Language model (VLM). We succeeded in experiments in which PR2, a dual-armed wheeled robot, performed cooking from arranged new recipes in a real-world environment, and confirmed the effectiveness of the proposed system.

[AI-118] owards Layer-Wise Personalized Federated Learning: Adaptive Layer Disentanglement via Conflicting Gradients

链接: https://arxiv.org/abs/2410.02845
作者: Minh Duong Nguyen,Khanh Le,Khoi Do,Nguyen H.Tran,Duc Nguyen,Chien Trinh,Zhaohui Yang
关键词-EN: personalized Federated Learning, high data heterogeneity, Federated Learning, significant gradient divergence, personalized Federated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In personalized Federated Learning (pFL), high data heterogeneity can cause significant gradient divergence across devices, adversely affecting the learning process. This divergence, especially when gradients from different users form an obtuse angle during aggregation, can negate progress, leading to severe weight and gradient update degradation. To address this issue, we introduce a new approach to pFL design, namely Federated Learning with Layer-wise Aggregation via Gradient Analysis (FedLAG), utilizing the concept of gradient conflict at the layer level. Specifically, when layer-wise gradients of different clients form acute angles, those gradients align in the same direction, enabling updates across different clients toward identifying client-invariant features. Conversely, when layer-wise gradient pairs make create obtuse angles, the layers tend to focus on client-specific tasks. In hindsights, FedLAG assigns layers for personalization based on the extent of layer-wise gradient conflicts. Specifically, layers with gradient conflicts are excluded from the global aggregation process. The theoretical evaluation demonstrates that when integrated into other pFL baselines, FedLAG enhances pFL performance by a certain margin. Therefore, our proposed method achieves superior convergence behavior compared with other baselines. Extensive experiments show that our FedLAG outperforms several state-of-the-art methods and can be easily incorporated with many existing methods to further enhance performance.

[AI-119] Neural DDEs with Learnable Delays for Partially Observed Dynamical Systems

链接: https://arxiv.org/abs/2410.02843
作者: Thibault Monsel,Emmanuel Menier,Onofrio Semeraro,Lionel Mathelin,Guillaume Charpiat
关键词-EN: recently been introduced, learn dynamical systems, learn dynamical, Delay Differential Equations, Constant Lag Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Many successful methods to learn dynamical systems from data have recently been introduced. Such methods often rely on the availability of the system’s full state. However, this underlying hypothesis is rather restrictive as it is typically not confirmed in practice, leaving us with partially observed systems. Utilizing the Mori-Zwanzig (MZ) formalism from statistical physics, we demonstrate that Constant Lag Neural Delay Differential Equations (NDDEs) naturally serve as suitable models for partially observed states. In empirical evaluation, we show that such models outperform existing methods on both synthetic and experimental data.

[AI-120] FlipAttack: Jailbreak LLMs via Flipping

链接: https://arxiv.org/abs/2410.02832
作者: Yue Liu,Xiaoxin He,Miao Xiong,Jinlan Fu,Shumin Deng,Bryan Hooi
关键词-EN: effective jailbreak attack, jailbreak attack named, attack named FlipAttack, LLMs, paper proposes
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 43 pages, 31 figures

点击查看摘要

Abstract:This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when noise is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing left-side noise merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task, and then develop 4 variants to guide LLMs to denoise, understand, and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves \sim 98% attack success rate on GPT-4o, and \sim 98% bypass rate against 5 guardrail models on average. The codes are available at GitHub\footnotethis https URL.

[AI-121] Skill Issues: An Analysis of CS:GO Skill Rating Systems

链接: https://arxiv.org/abs/2410.02831
作者: Mikel Bober-Irizar,Naunidh Dua,Max McGuinness
关键词-EN: skill rating systems, accurate skill rating, meteoric rise, rise of online, online games
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The meteoric rise of online games has created a need for accurate skill rating systems for tracking improvement and fair matchmaking. Although many skill rating systems are deployed, with various theoretical foundations, less work has been done at analysing the real-world performance of these algorithms. In this paper, we perform an empirical analysis of Elo, Glicko2 and TrueSkill through the lens of surrogate modelling, where skill ratings influence future matchmaking with a configurable acquisition function. We look both at overall performance and data efficiency, and perform a sensitivity analysis based on a large dataset of Counter-Strike: Global Offensive matches.

[AI-122] LLMs May Not Be Human-Level Players But They Can Be Testers: Measuring Game Difficulty with LLM Agents

链接: https://arxiv.org/abs/2410.02829
作者: Chang Xiao,Brenda Z. Yang
关键词-EN: Large Language Models, Language Models, Large Language, Recent advances, advances in Large
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated their potential as autonomous agents across various tasks. One emerging application is the use of LLMs in playing games. In this work, we explore a practical problem for the gaming industry: Can LLMs be used to measure game difficulty? We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process. Based on our experiments, we also outline general principles and guidelines for incorporating LLMs into the game testing process.

[AI-123] PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System

链接: https://arxiv.org/abs/2410.02828
作者: Gary D. Lopez Munoz,Amanda J. Minnich,Roman Lutz,Richard Lundeen,Raja Sekhar Rao Dheekonda,Nina Chikanov,Bolor-Erdene Jagdagdorj,Martin Pouliot,Shiven Chawla,Whitney Maxwell,Blake Bullwinkel,Katherine Pratt,Joris de Gruyter,Charlotte Siska,Pete Bryan,Tori Westerhoff,Chang Kawaguchi,Christian Seifert,Ram Shankar Siva Kumar,Yonatan Zunger
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, daily lives, Risk Identification Toolkit
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) is becoming ubiquitous in our daily lives. The increase in computational power and data availability has led to a proliferation of both single- and multi-modal models. As the GenAI ecosystem matures, the need for extensible and model-agnostic risk identification frameworks is growing. To meet this need, we introduce the Python Risk Identification Toolkit (PyRIT), an open-source framework designed to enhance red teaming efforts in GenAI systems. PyRIT is a model- and platform-agnostic tool that enables red teamers to probe for and identify novel harms, risks, and jailbreaks in multimodal generative AI models. Its composable architecture facilitates the reuse of core building blocks and allows for extensibility to future models and modalities. This paper details the challenges specific to red teaming generative AI systems, the development and features of PyRIT, and its practical applications in real-world scenarios.

[AI-124] Effective Intrusion Detection for UAV Communications using Autoencoder-based Feature Extraction and Machine Learning Approach

链接: https://arxiv.org/abs/2410.02827
作者: Tuan-Cuong Vuong,Cong Chi Nguyen,Van-Cuong Pham,Thi-Thanh-Huyen Le,Xuan-Nam Tran,Thien Van Luong
关键词-EN: unmanned aerial vehicles, recent actual UAV, intrusion detection method, aerial vehicles, actual UAV intrusion
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 4 pages

点击查看摘要

Abstract:This paper proposes a novel intrusion detection method for unmanned aerial vehicles (UAV) in the presence of recent actual UAV intrusion dataset. In particular, in the first stage of our method, we design an autoencoder architecture for effectively extracting important features, which are then fed into various machine learning models in the second stage for detecting and classifying attack types. To the best of our knowledge, this is the first attempt to propose such the autoencoder-based machine learning intrusion detection method for UAVs using actual dataset, while most of existing works only consider either simulated datasets or datasets irrelevant to UAV communications. Our experiment results show that the proposed method outperforms the baselines such as feature selection schemes in both binary and multi-class classification tasks.

[AI-125] LinkThief: Combining Generalized Structure Knowledge with Node Similarity for Link Stealing Attack against GNN

链接: https://arxiv.org/abs/2410.02826
作者: Yuxing Zhang,Siyuan Meng,Chunchun Chen,Mengyao Peng,Hongyan Gu,Xinli Huang
关键词-EN: Bridge Graph Generator, http URL attacks, http URL theoretical, Graph neural networks, http URL studies
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks(GNNs) have a wide range of applications in this http URL studies have shown that Graph neural networks(GNNs) are vulnerable to link stealing attacks,which infers the existence of edges in the target GNN’s training this http URL attacks are usually based on the assumption that links exist between two nodes that share similar posteriors;however,they fail to focus on links that do not hold under this this http URL this end,we propose LinkThief,an improved link stealing attack that combines generalized structure knowledge with node similarity,in a scenario where the attackers’ background knowledge contains partially leaked target graph and shadow this http URL,to equip the attack model with insights into the link structure spanning both the shadow graph and the target graph,we introduce the idea of creating a Shadow-Target Bridge Graph and extracting edge subgraph structure features from this http URL theoretical analysis from the perspective of privacy theft,we first explore how to implement the aforementioned this http URL upon the findings,we design the Bridge Graph Generator to construct the Shadow-Target Bridge this http URL,the subgraph around the link is sampled by the Edge Subgraph Preparation this http URL,the Edge Structure Feature Extractor is designed to obtain generalized structure knowledge,which is combined with node similarity to form the features provided to the attack this http URL experiments validate the correctness of theoretical analysis and demonstrate that LinkThief still effectively steals links without extra assumptions.

[AI-126] DANA: Domain-Aware Neurosymbolic Agents for Consistency and Accuracy

链接: https://arxiv.org/abs/2410.02823
作者: Vinh Luong,Sang Dinh,Shruti Raghavan,William Nguyen,Zooey Nguyen,Quynh Le,Hung Vo,Kentaro Maegaito,Loc Nguyen,Thao Nguyen,Anh Hai Ha,Christopher Nguyen
关键词-EN: Large Language Models, shown remarkable capabilities, Large Language, Language Models, inherent probabilistic nature
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities, but their inherent probabilistic nature often leads to inconsistency and inaccuracy in complex problem-solving tasks. This paper introduces DANA (Domain-Aware Neurosymbolic Agent), an architecture that addresses these issues by integrating domain-specific knowledge with neurosymbolic approaches. We begin by analyzing current AI architectures, including AutoGPT, LangChain ReAct and OpenAI’s ChatGPT, through a neurosymbolic lens, highlighting how their reliance on probabilistic inference contributes to inconsistent outputs. In response, DANA captures and applies domain expertise in both natural-language and symbolic forms, enabling more deterministic and reliable problem-solving behaviors. We implement a variant of DANA using Hierarchical Task Plans (HTPs) in the open-source OpenSSA framework. This implementation achieves over 90% accuracy on the FinanceBench financial-analysis benchmark, significantly outperforming current LLM-based systems in both consistency and accuracy. Application of DANA in physical industries such as semiconductor shows that its flexible architecture for incorporating knowledge is effective in mitigating the probabilistic limitations of LLMs and has potential in tackling complex, real-world problems that require reliability and precision.

[AI-127] GPTs Judgements Under Uncertainty

链接: https://arxiv.org/abs/2410.02820
作者: Payam Saeedi,Mahsa Goodarzi
关键词-EN: framing effects, human cognition, loss aversion, conjunction fallacy, judges and makes
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate whether biases inherent in human cognition, such as loss aversion, framing effects, and conjunction fallacy, manifest in how GPT-4o judges and makes decisions in probabilistic scenarios. By conducting 1350 experiments across nine cognitive biases and analyzing the responses for statistical versus heuristic reasoning, we demonstrate GPT-4o’s contradicting approach while responding to prompts with similar underlying probability notations. Our findings also reveal mixed performances with the AI demonstrating both human-like heuristic errors and statistically sound decisions, even as it goes through identical iterations of the same prompt.

[AI-128] Bipolar fuzzy relation equations systems based on the product t-norm

链接: https://arxiv.org/abs/2410.02816
作者: M. Eugenia Cornejo,David Lobo,Jesús Medina
关键词-EN: fuzzy relation equations, Bipolar fuzzy relation, logical connective negations, fuzzy relation, relation equations
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bipolar fuzzy relation equations arise as a generalization of fuzzy relation equations considering unknown variables together with their logical connective negations. The occurrence of a variable and the occurrence of its negation simultaneously can give very useful information for certain frameworks where the human reasoning plays a key role. Hence, the resolution of bipolar fuzzy relation equations systems is a research topic of great interest. This paper focuses on the study of bipolar fuzzy relation equations systems based on the max-product t-norm composition. Specifically, the solvability and the algebraic structure of the set of solutions of these bipolar equations systems will be studied, including the case in which such systems are composed of equations whose independent term be equal to zero. As a consequence, this paper complements the contribution carried out by the authors on the solvability of bipolar max-product fuzzy relation equations. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02816 [cs.AI] (or arXiv:2410.02816v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.02816 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Mathematical Methods in the Applied Sciences 42(17) (2019) 5779-5793 Related DOI: https://doi.org/10.1002/mma.5646 Focus to learn more DOI(s) linking to related resources

[AI-129] SAC-KG: Exploiting Large Language Models as Skilled Automatic Constructors for Domain Knowledge Graphs ACL2024

链接: https://arxiv.org/abs/2410.02811
作者: Hanzhu Chen,Xu Shen,Qitan Lv,Jie Wang,Xiaoqi Ni,Jieping Ye
关键词-EN: domain Knowledge Graph, Skilled Automatic Constructors, play a pivotal, pivotal role, role in knowledge-intensive
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: ACL 2024 Main

点击查看摘要

Abstract:Knowledge graphs (KGs) play a pivotal role in knowledge-intensive tasks across specialized domains, where the acquisition of precise and dependable knowledge is crucial. However, existing KG construction methods heavily rely on human intervention to attain qualified KGs, which severely hinders the practical applicability in real-world scenarios. To address this challenge, we propose a general KG construction framework, named SAC-KG, to exploit large language models (LLMs) as Skilled Automatic Constructors for domain Knowledge Graph. SAC-KG effectively involves LLMs as domain experts to generate specialized and precise multi-level KGs. Specifically, SAC-KG consists of three components: Generator, Verifier, and Pruner. For a given entity, Generator produces its relations and tails from raw domain corpora, to construct a specialized single-level KG. Verifier and Pruner then work together to ensure precision by correcting generation errors and determining whether newly produced tails require further iteration for the next-level this http URL demonstrate that SAC-KG automatically constructs a domain KG at the scale of over one million nodes and achieves a precision of 89.32%, leading to a superior performance with over 20% increase in precision rate compared to existing state-of-the-art methods for the KG construction task.

[AI-130] StateAct: State Tracking and Reasoning for Acting and Planning with Large Language Models

链接: https://arxiv.org/abs/2410.02810
作者: Nikolai Rozanov,Marek Rei
关键词-EN: large language models, language models, large language, interactive environments, Planning and acting
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 9 pages, 5 pages appendix, 7 figures, 5 tables

点击查看摘要

Abstract:Planning and acting to solve real' tasks using large language models (LLMs) in interactive environments has become a new frontier for AI methods. While recent advances allowed LLMs to interact with online tools, solve robotics tasks and many more, long range reasoning tasks remain a problem for LLMs. Existing methods to address this issue are very resource intensive and require additional data or human crafted rules, instead, we propose a simple method based on few-shot in-context learning alone to enhance chain-of-thought’ with state-tracking for planning and acting with LLMs. We show that our method establishes the new state-of-the-art on Alfworld for in-context learning methods (\textbf+14% over the previous best few-shot in-context learning method) and performs on par with methods that use additional training data and additional tools such as code-execution. We also demonstrate that our enhanced chain-of-states' allows the agent to both solve longer horizon problems and to be more efficient in number of steps required to solve a task. We show that our method works across a variety of LLMs for both API-based and open source ones. Finally, we also conduct ablation studies and show that chain-of-thoughts’ helps state-tracking accuracy, while a json-structure harms overall performance. We open-source our code and annotations at \urlthis https URL.

[AI-131] Investigating the Impact of Randomness on Reproducibility in Computer Vision: A Study on Applications in Civil Engineering and Medicine

链接: https://arxiv.org/abs/2410.02806
作者: Bahadır Eryılmaz,Osman Alperen Koraş,Jörg Schlötterer,Christin Seifert
关键词-EN: scientific research, essential for scientific, CUDA-induced randomness, Abstract, randomness
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reproducibility is essential for scientific research. However, in computer vision, achieving consistent results is challenging due to various factors. One influential, yet often unrecognized, factor is CUDA-induced randomness. Despite CUDA’s advantages for accelerating algorithm execution on GPUs, if not controlled, its behavior across multiple executions remains non-deterministic. While reproducibility issues in ML being researched, the implications of CUDA-induced randomness in application are yet to be understood. Our investigation focuses on this randomness across one standard benchmark dataset and two real-world datasets in an isolated environment. Our results show that CUDA-induced randomness can account for differences up to 4.77% in performance scores. We find that managing this variability for reproducibility may entail increased runtime or reduce performance, but that disadvantages are not as significant as reported in previous studies.

[AI-132] Leveraging Retrieval Augment Approach for Multimodal Emotion Recognition Under Missing Modalities

链接: https://arxiv.org/abs/2410.02804
作者: Qi Fan,Hongyu Yuan,Haolin Zuo,Rui Liu,Guanglai Gao
关键词-EN: Multimodal emotion recognition, Multimodal emotion, emotion recognition, Multimodal, emotion recognition utilizes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under reviewing

点击查看摘要

Abstract:Multimodal emotion recognition utilizes complete multimodal information and robust multimodal joint representation to gain high performance. However, the ideal condition of full modality integrity is often not applicable in reality and there always appears the situation that some modalities are missing. For example, video, audio, or text data is missing due to sensor failure or network bandwidth problems, which presents a great challenge to MER research. Traditional methods extract useful information from the complete modalities and reconstruct the missing modalities to learn robust multimodal joint representation. These methods have laid a solid foundation for research in this field, and to a certain extent, alleviated the difficulty of multimodal emotion recognition under missing modalities. However, relying solely on internal reconstruction and multimodal joint learning has its limitations, especially when the missing information is critical for emotion recognition. To address this challenge, we propose a novel framework of Retrieval Augment for Missing Modality Multimodal Emotion Recognition (RAMER), which introduces similar multimodal emotion data to enhance the performance of emotion recognition under missing modalities. By leveraging databases, that contain related multimodal emotion data, we can retrieve similar multimodal emotion information to fill in the gaps left by missing modalities. Various experimental results demonstrate that our framework is superior to existing state-of-the-art approaches in missing modality MER tasks. Our whole project is publicly available on this https URL.

[AI-133] Estimating Body Volume and Height Using 3D Data

链接: https://arxiv.org/abs/2410.02800
作者: Vivek Ganesh Sonar,Muhammad Tanveer Jan,Mike Wells,Abhijit Pandya,Gabriela Engstrom,Richard Shih,Borko Furht
关键词-EN: body weight estimation, weight-based medications, urgent situations, body weight, estimation is critical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:Accurate body weight estimation is critical in emergency medicine for proper dosing of weight-based medications, yet direct measurement is often impractical in urgent situations. This paper presents a non-invasive method for estimating body weight by calculating total body volume and height using 3D imaging technology. A RealSense D415 camera is employed to capture high-resolution depth maps of the patient, from which 3D models are generated. The Convex Hull Algorithm is then applied to calculate the total body volume, with enhanced accuracy achieved by segmenting the point cloud data into multiple sections and summing their individual volumes. The height is derived from the 3D model by identifying the distance between key points on the body. This combined approach provides an accurate estimate of body weight, improving the reliability of medical interventions where precise weight data is unavailable. The proposed method demonstrates significant potential to enhance patient safety and treatment outcomes in emergency settings.

[AI-134] aCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution

链接: https://arxiv.org/abs/2410.02795
作者: Jiuding Yang,Shengyao Lu,Weidong Guo,Xiangyang Li,Kaitong Yang,Yu Xu,Di Niu
关键词-EN: Large Language Models, Large Language, require precise alignment, Language Models, require precise
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) require precise alignment with complex instructions to optimize their performance in real-world applications. As the demand for refined instruction tuning data increases, traditional methods that evolve simple seed instructions often struggle to effectively enhance complexity or manage difficulty scaling across various domains. Our innovative approach, Task-Centered Instruction Evolution (TaCIE), addresses these shortcomings by redefining instruction evolution from merely evolving seed instructions to a more dynamic and comprehensive combination of elements. TaCIE starts by deconstructing complex instructions into their fundamental components. It then generates and integrates new elements with the original ones, reassembling them into more sophisticated instructions that progressively increase in difficulty, diversity, and complexity. Applied across multiple domains, LLMs fine-tuned with these evolved instructions have substantially outperformed those tuned with conventional methods, marking a significant advancement in instruction-based model fine-tuning.

[AI-135] DifFaiRec: Generative Fair Recommender with Conditional Diffusion Model ICDM2024

链接: https://arxiv.org/abs/2410.02791
作者: Zhenhao Jiang,Jicong Fan
关键词-EN: users automatically based, Diffusion-based Fair Recommender, automatically based, groups, users automatically
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The paper was accepted by ICDM 2024

点击查看摘要

Abstract:Although recommenders can ship items to users automatically based on the users’ preferences, they often cause unfairness to groups or individuals. For instance, when users can be divided into two groups according to a sensitive social attribute and there is a significant difference in terms of activity between the two groups, the learned recommendation algorithm will result in a recommendation gap between the two groups, which causes group unfairness. In this work, we propose a novel recommendation algorithm named Diffusion-based Fair Recommender (DifFaiRec) to provide fair recommendations. DifFaiRec is built upon the conditional diffusion model and hence has a strong ability to learn the distribution of user preferences from their ratings on items and is able to generate diverse recommendations effectively. To guarantee fairness, we design a counterfactual module to reduce the model sensitivity to protected attributes and provide mathematical explanations. The experiments on benchmark datasets demonstrate the superiority of DifFaiRec over competitive baselines.

[AI-136] Logic-Free Building Automation: Learning the Control of Room Facilities with Wall Switches and Ceiling Camera

链接: https://arxiv.org/abs/2410.02789
作者: Hideya Ochiai,Kohki Hashimoto,Takuya Sakamoto,Seiya Watanabe,Ryosuke Hara,Ryo Yagi,Yuji Aizono,Hiroshi Esaki
关键词-EN: Artificial intelligence enables, Artificial intelligence, intelligence enables smarter, intelligence enables, preferences on facility
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注: 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Artificial intelligence enables smarter control in building automation by its learning capability of users’ preferences on facility control. Reinforcement learning (RL) was one of the approaches to this, but it has many challenges in real-world implementations. We propose a new architecture for logic-free building automation (LFBA) that leverages deep learning (DL) to control room facilities without predefined logic. Our approach differs from RL in that it uses wall switches as supervised signals and a ceiling camera to monitor the environment, allowing the DL model to learn users’ preferred controls directly from the scenes and switch states. This LFBA system is tested by our testbed with various conditions and user activities. The results demonstrate the efficacy, achieving 93%-98% control accuracy with VGG, outperforming other DL models such as Vision Transformer and ResNet. This indicates that LFBA can achieve smarter and more user-friendly control by learning from the observable scenes and user interactions.

[AI-137] Navigation with VLM framework: Go to Any Language

链接: https://arxiv.org/abs/2410.02787
作者: Zecheng Yin,Chonghao Cheng,Lizhen
关键词-EN: posed significant challenges, Vision Large Language, Large Language Models, significant challenges, Vision Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: under review

点击查看摘要

Abstract:Navigating towards fully open language goals and exploring open scenes in a manner akin to human exploration have always posed significant challenges. Recently, Vision Large Language Models (VLMs) have demonstrated remarkable capabilities in reasoning with both language and visual data. While many works have focused on leveraging VLMs for navigation in open scenes and with open vocabularies, these efforts often fall short of fully utilizing the potential of VLMs or require substantial computational resources. We introduce Navigation with VLM (NavVLM), a framework that harnesses equipment-level VLMs to enable agents to navigate towards any language goal specific or non-specific in open scenes, emulating human exploration behaviors without any prior training. The agent leverages the VLM as its cognitive core to perceive environmental information based on any language goal and constantly provides exploration guidance during navigation until it reaches the target location or area. Our framework not only achieves state-of-the-art performance in Success Rate (SR) and Success weighted by Path Length (SPL) in traditional specific goal settings but also extends the navigation capabilities to any open-set language goal. We evaluate NavVLM in richly detailed environments from the Matterport 3D (MP3D), Habitat Matterport 3D (HM3D), and Gibson datasets within the Habitat simulator. With the power of VLMs, navigation has entered a new era.

[AI-138] Robust Symmetry Detection via Riemannian Langevin Dynamics

链接: https://arxiv.org/abs/2410.02786
作者: Jihyeon Je,Jiayi Liu,Guandao Yang,Boyang Deng,Shengqu Cai,Gordon Wetzstein,Or Litany,Leonidas Guibas
关键词-EN: kinds of objects, man-made creations, noise, Symmetries, symmetry
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Symmetries are ubiquitous across all kinds of objects, whether in nature or in man-made creations. While these symmetries may seem intuitive to the human eye, detecting them with a machine is nontrivial due to the vast search space. Classical geometry-based methods work by aggregating “votes” for each symmetry but struggle with noise. In contrast, learning-based methods may be more robust to noise, but often overlook partial symmetries due to the scarcity of annotated data. In this work, we address this challenge by proposing a novel symmetry detection method that marries classical symmetry detection techniques with recent advances in generative modeling. Specifically, we apply Langevin dynamics to a redefined symmetry space to enhance robustness against noise. We provide empirical results on a variety of shapes that suggest our method is not only robust to noise, but can also identify both partial and global symmetries. Moreover, we demonstrate the utility of our detected symmetries in various downstream tasks, such as compression and symmetrization of noisy shapes.

[AI-139] Enhancing Mental Health Support through Human-AI Collaboration: Toward Secure and Empathetic AI-enabled chatbots

链接: https://arxiv.org/abs/2410.02783
作者: Rawan AlMakinah,Andrea Norcini-Pala,Lindsey Disney,M. Abdullah Canbaz
关键词-EN: cultural barriers hinder, barriers hinder timely, hinder timely care, mental health, mental health support
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 17 pages, 9 Figures

点击查看摘要

Abstract:Access to mental health support remains limited, particularly in marginalized communities where structural and cultural barriers hinder timely care. This paper explores the potential of AI-enabled chatbots as a scalable solution, focusing on advanced large language models (LLMs)-GPT v4, Mistral Large, and LLama V3.1-and assessing their ability to deliver empathetic, meaningful responses in mental health contexts. While these models show promise in generating structured responses, they fall short in replicating the emotional depth and adaptability of human therapists. Additionally, trustworthiness, bias, and privacy challenges persist due to unreliable datasets and limited collaboration with mental health professionals. To address these limitations, we propose a federated learning framework that ensures data privacy, reduces bias, and integrates continuous validation from clinicians to enhance response quality. This approach aims to develop a secure, evidence-based AI chatbot capable of offering trustworthy, empathetic, and bias-reduced mental health support, advancing AI’s role in digital mental health care.

[AI-140] Guess What I Think: Streamlined EEG-to-Image Generation with Latent Diffusion Models ICASSP2025

链接: https://arxiv.org/abs/2410.02780
作者: Eleonora Lopez,Luigi Sigillo,Federica Colonnese,Massimo Panella,Danilo Comminiello
关键词-EN: advance brain-computer interface, encode visual cues, gaining increasing attention, brain signals encode, increasing attention due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Generating images from brain waves is gaining increasing attention due to its potential to advance brain-computer interface (BCI) systems by understanding how brain signals encode visual cues. Most of the literature has focused on fMRI-to-Image tasks as fMRI is characterized by high spatial resolution. However, fMRI is an expensive neuroimaging modality and does not allow for real-time BCI. On the other hand, electroencephalography (EEG) is a low-cost, non-invasive, and portable neuroimaging technique, making it an attractive option for future real-time applications. Nevertheless, EEG presents inherent challenges due to its low spatial resolution and susceptibility to noise and artifacts, which makes generating images from EEG more difficult. In this paper, we address these problems with a streamlined framework based on the ControlNet adapter for conditioning a latent diffusion model (LDM) through EEG signals. We conduct experiments and ablation studies on popular benchmarks to demonstrate that the proposed method beats other state-of-the-art models. Unlike these methods, which often require extensive preprocessing, pretraining, different losses, and captioning models, our approach is efficient and straightforward, requiring only minimal preprocessing and a few components. Code will be available after publication.

[AI-141] Learning variant product relationship and variation attributes from e-commerce website structures

链接: https://arxiv.org/abs/2410.02779
作者: Pedro Herrero-Vidal,You-Lin Chen,Cris Liu,Prithviraj Sen,Lichao Wang
关键词-EN: introduce VARM, product relationships, product, variant product relationships, variant relationship matcher
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce VARM, variant relationship matcher strategy, to identify pairs of variant products in e-commerce catalogs. Traditional definitions of entity resolution are concerned with whether product mentions refer to the same underlying product. However, this fails to capture product relationships that are critical for e-commerce applications, such as having similar, but not identical, products listed on the same webpage or share reviews. Here, we formulate a new type of entity resolution in variant product relationships to capture these similar e-commerce product links. In contrast with the traditional definition, the new definition requires both identifying if two products are variant matches of each other and what are the attributes that vary between them. To satisfy these two requirements, we developed a strategy that leverages the strengths of both encoding and generative AI models. First, we construct a dataset that captures webpage product links, and therefore variant product relationships, to train an encoding LLM to predict variant matches for any given pair of products. Second, we use RAG prompted generative LLMs to extract variation and common attributes amongst groups of variant products. To validate our strategy, we evaluated model performance using real data from one of the world’s leading e-commerce retailers. The results showed that our strategy outperforms alternative solutions and paves the way to exploiting these new type of product relationships.

[AI-142] Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

链接: https://arxiv.org/abs/2410.02773
作者: Jian Lan,Diego Frassinelli,Barbara Plank
关键词-EN: Visual Question Answering, Large vision-language models, Large vision-language, multiple human annotators, accurately predict responses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3’s ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, better aligning model confidence with human uncertainty. Our findings highlight that for VQA, the consistent alignment between human responses and model predictions is understudied and should become the next crucial target of future studies.

[AI-143] Complex-valued convolutional neural network classification of hand gesture from radar images

链接: https://arxiv.org/abs/2410.02771
作者: Shokooh Khandan
关键词-EN: gesture recognition systems, Hand gesture recognition, popular in HCI, application areas, Hand gesture
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 173 pages, 36 tables, 50 figures

点击查看摘要

Abstract:Hand gesture recognition systems have yielded many exciting advancements in the last decade and become more popular in HCI (human-computer interaction) with several application areas, which spans from safety and security applications to automotive field. Various deep neural network architectures have already been inspected for hand gesture recognition systems, including multi-layer perceptron (MLP), convolutional neural network (CNN), recurrent neural network (RNN) and a cascade of the last two architectures known as CNN-RNN. However, a major problem still exists, which is most of the existing ML algorithms are designed and developed the building blocks and techniques for real-valued (RV). Researchers applied various RV techniques on the complex-valued (CV) radar images, such as converting a CV optimisation problem into a RV one, by splitting the complex numbers into their real and imaginary parts. However, the major disadvantage of this method is that the resulting algorithm will double the network dimensions. Recent work on RNNs and other fundamental theoretical analysis suggest that CV numbers have a richer representational capacity, but due to the absence of the building blocks required to design such models, the performance of CV networks are marginalised. In this report, we propose a fully CV-CNN, including all building blocks, forward and backward operations, and derivatives all in complex domain. We explore our proposed classification model on two sets of CV hand gesture radar images in comparison with the equivalent RV model. In chapter five, we propose a CV-forward residual network, for the purpose of binary classification of the two sets of CV hand gesture radar datasets and compare its performance with our proposed CV-CNN and a baseline CV-forward CNN.

[AI-144] Fundamentals of legislation for autonomous artificial intelligence systems

链接: https://arxiv.org/abs/2410.02769
作者: Anna Romanova
关键词-EN: autonomous corporate management, dedicated operational context, operational context, management systems based, corporate management systems
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: in Russian language

点击查看摘要

Abstract:The article proposes a method for forming a dedicated operational context in course of development and implementation of autonomous corporate management systems based on example of autonomous systems for a board of directors. The significant part of the operational context for autonomous company management systems is the regulatory and legal environment within which corporations operate. In order to create a special operational context for autonomous artificial intelligence systems, the wording of local regulatory documents can be simultaneously presented in two versions: for use by people and for use by autonomous systems. In this case, the artificial intelligence system will get a well-defined operational context that allows such a system to perform functions within the required standards. Local regulations that provide for the specifics of the joint work of individuals and autonomous artificial intelligence systems can create the basis of the relevant legislation governing the development and implementation of autonomous systems.

[AI-145] BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

链接: https://arxiv.org/abs/2410.02768
作者: Jin Chen,Kaijing Ma,Haojian Huang,Jiayu Shen,Han Fang,Xianghao Zang,Chao Ban,Zhongjiang He,Hao Sun,Yanmei Kang
关键词-EN: demonstrating remarkable capabilities, rapidly advancing, remarkable capabilities, development of multi-modal, demonstrating remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, and similar semantics can also be expressed through different text forms, leading to underutilization of video. To address this, we propose BoViLA, a self-training framework that augments question samples during training through LLM-based self-questioning and answering, which help model exploit video information and the internal knowledge of LLMs more thoroughly to improve modality alignment. To filter bad self-generated questions, we introduce Evidential Deep Learning (EDL) to estimate uncertainty and assess the quality of self-generated questions by evaluating the modality alignment within the context. To the best of our knowledge, this work is the first to explore LLM-based self-training frameworks for modality alignment. We evaluate BoViLA on five strong VideoQA benchmarks, where it outperforms several state-of-the-art methods and demonstrate its effectiveness and generality. Additionally, we provide extensive analyses of the self-training framework and the EDL-based uncertainty filtering mechanism. The code will be made available at this https URL.

[AI-146] Optimizing food taste sensory evaluation through neural network-based taste electroencephalogram channel selection

链接: https://arxiv.org/abs/2410.03559
作者: Xiuxin Xia,Qun Wang,He Wang,Chenrui Liu,Pengwei Li,Yan Shi,Hong Men
关键词-EN: stimulation can reflect, reflect different brain, brain patterns, EEG, channel selection
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 33 pages, 13 figures

点击查看摘要

Abstract:The taste electroencephalogram (EEG) evoked by the taste stimulation can reflect different brain patterns and be used in applications such as sensory evaluation of food. However, considering the computational cost and efficiency, EEG data with many channels has to face the critical issue of channel selection. This paper proposed a channel selection method called class activation mapping with attention (CAM-Attention). The CAM-Attention method combined a convolutional neural network with channel and spatial attention (CNN-CSA) model with a gradient-weighted class activation mapping (Grad-CAM) model. The CNN-CSA model exploited key features in EEG data by attention mechanism, and the Grad-CAM model effectively realized the visualization of feature regions. Then, channel selection was effectively implemented based on feature regions. Finally, the CAM-Attention method reduced the computational burden of taste EEG recognition and effectively distinguished the four tastes. In short, it has excellent recognition performance and provides effective technical support for taste sensory evaluation.

[AI-147] Evaluating Investment Risks in LATAM AI Startups: Ranking of Investment Potential and Framework for Valuation

链接: https://arxiv.org/abs/2410.03552
作者: Abraham Ramos-Torres,Laura N. Montoya
关键词-EN: innovative entrepreneurs addressing, entrepreneurs addressing market, Serviceable Obtainable Market, Latin America, Total Addressable Market
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM); Pricing of Securities (q-fin.PR)
*备注: 21 pages, 7 figures, 8 tables, Accepted for publication to the International Association for Applied Business Research Journal (IAABR)

点击查看摘要

Abstract:The growth of the tech startup ecosystem in Latin America (LATAM) is driven by innovative entrepreneurs addressing market needs across various sectors. However, these startups encounter unique challenges and risks that require specific management approaches. This paper explores a case study with the Total Addressable Market (TAM), Serviceable Available Market (SAM), and Serviceable Obtainable Market (SOM) metrics within the context of the online food delivery industry in LATAM, serving as a model for valuing startups using the Discounted Cash Flow (DCF) method. By analyzing key emerging powers such as Argentina, Colombia, Uruguay, Costa Rica, Panama, and Ecuador, the study highlights the potential and profitability of AI-driven startups in the region through the development of a ranking of emerging powers in Latin America for tech startup investment. The paper also examines the political, economic, and competitive risks faced by startups and offers strategic insights on mitigating these risks to maximize investment returns. Furthermore, the research underscores the value of diversifying investment portfolios with startups in emerging markets, emphasizing the opportunities for substantial growth and returns despite inherent risks.

[AI-148] owards Real-time Intrahepatic Vessel Identification in Intraoperative Ultrasound-Guided Liver Surgery MICCAI2024

链接: https://arxiv.org/abs/2410.03420
作者: Karl-Philippe Beaudet(IHU Strasbourg, UNISTRA, MIMESIS),Alexandros Karargyris(IHU Strasbourg, UNISTRA),Sidaty El Hadramy(UNISTRA, MIMESIS),Stéphane Cotin(UNISTRA, MIMESIS),Jean-Paul Mazellier(IHU Strasbourg, UNISTRA),Nicolas Padoy(IHU Strasbourg, UNISTRA),Juan Verde(IHU Strasbourg, UNISTRA, MIMESIS)
关键词-EN: traditional open surgery, complexity hinders widespread, hinders widespread adoption, widespread adoption due, maintains patient outcomes
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024, Oct 2024, Marrakech, Morocco

点击查看摘要

Abstract:While laparoscopic liver resection is less prone to complications and maintains patient outcomes compared to traditional open surgery, its complexity hinders widespread adoption due to challenges in representing the liver’s internal structure. Laparoscopic intraoperative ultrasound offers efficient, cost-effective and radiation-free guidance. Our objective is to aid physicians in identifying internal liver structures using laparoscopic intraoperative ultrasound. We propose a patient-specific approach using preoperative 3D ultrasound liver volume to train a deep learning model for real-time identification of portal tree and branch structures. Our personalized AI model, validated on ex vivo swine livers, achieved superior precision (0.95) and recall (0.93) compared to surgeons, laying groundwork for precise vessel identification in ultrasound-based liver resection. Its adaptability and potential clinical impact promise to advance surgical interventions and improve patient care.

[AI-149] An Enhanced Harmonic Densely Connected Hybrid Transformer Network Architecture for Chronic Wound Segmentation Utilising Multi-Colour Space Tensor Merging

链接: https://arxiv.org/abs/2410.03359
作者: Bill Cassidy,Christian Mcbride,Connah Kendrick,Neil D. Reeves,Joseph M. Pappachan,Cornelius J. Fernandez,Elias Chacko,Raphael Brüngel,Christoph M. Friedrich,Metib Alotaibi,Abdullah Abdulaziz AlWabel,Mohammad Alderwish,Kuan-Ying Lai,Moi Hoon Yap
关键词-EN: hospitals world wide, world wide, growing burdens, burdens for clinics, clinics and hospitals
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Chronic wounds and associated complications present ever growing burdens for clinics and hospitals world wide. Venous, arterial, diabetic, and pressure wounds are becoming increasingly common globally. These conditions can result in highly debilitating repercussions for those affected, with limb amputations and increased mortality risk resulting from infection becoming more common. New methods to assist clinicians in chronic wound care are therefore vital to maintain high quality care standards. This paper presents an improved HarDNet segmentation architecture which integrates a contrast-eliminating component in the initial layers of the network to enhance feature learning. We also utilise a multi-colour space tensor merging process and adjust the harmonic shape of the convolution blocks to facilitate these additional features. We train our proposed model using wound images from light-skinned patients and test the model on two test sets (one set with ground truth, and one without) comprising only darker-skinned cases. Subjective ratings are obtained from clinical wound experts with intraclass correlation coefficient used to determine inter-rater reliability. For the dark-skin tone test set with ground truth, we demonstrate improvements in terms of Dice similarity coefficient (+0.1221) and intersection over union (+0.1274). Qualitative analysis showed high expert ratings, with improvements of 3% demonstrated when comparing the baseline model with the proposed model. This paper presents the first study to focus on darker-skin tones for chronic wound segmentation using models trained only on wound images exhibiting lighter skin. Diabetes is highly prevalent in countries where patients have darker skin tones, highlighting the need for a greater focus on such cases. Additionally, we conduct the largest qualitative study to date for chronic wound segmentation.

[AI-150] Manikin-Recorded Cardiopulmonary Sounds Dataset Using Digital Stethoscope

链接: https://arxiv.org/abs/2410.03280
作者: Yasaman Torabi,Shahram Shirani,James P. Reilly
关键词-EN: healthcare monitoring, crucial for healthcare, Heart and lung, lung sounds, sounds
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Heart and lung sounds are crucial for healthcare monitoring. Recent improvements in stethoscope technology have made it possible to capture patient sounds with enhanced precision. In this dataset, we used a digital stethoscope to capture both heart and lung sounds, including individual and mixed recordings. To our knowledge, this is the first dataset to offer both separate and mixed cardiorespiratory sounds. The recordings were collected from a clinical manikin, a patient simulator designed to replicate human physiological conditions, generating clean heart and lung sounds at different body locations. This dataset includes both normal sounds and various abnormalities (i.e., murmur, atrial fibrillation, tachycardia, atrioventricular block, third and fourth heart sound, wheezing, crackles, rhonchi, pleural rub, and gurgling sounds). The dataset includes audio recordings of chest examinations performed at different anatomical locations, as determined by specialist nurses. Each recording has been enhanced using frequency filters to highlight specific sound types. This dataset is useful for applications in artificial intelligence, such as automated cardiopulmonary disease detection, sound classification, unsupervised separation techniques, and deep learning algorithms related to audio signal processing.

[AI-151] MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech EMNLP2024

链接: https://arxiv.org/abs/2410.03192
作者: Taejun Bak,Youngsik Eom,SeungJae Choi,Young-Sun Joo
关键词-EN: achieved significant improvements, TTS, TTS systems, zero-shot multi-task TTS, multi-task TTS
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Accepted to EMNLP 2024 Findings

点击查看摘要

Abstract:Text-to-speech (TTS) systems that scale up the amount of training data have achieved significant improvements in zero-shot speech synthesis. However, these systems have certain limitations: they require a large amount of training data, which increases costs, and often overlook prosody similarity. To address these issues, we propose MultiVerse, a zero-shot multi-task TTS system that is able to perform TTS or speech style transfer in zero-shot and cross-lingual conditions. MultiVerse requires much less training data than traditional data-driven approaches. To ensure zero-shot performance even with limited data, we leverage source-filter theory-based disentanglement, utilizing the prompt for modeling filter-related and source-related representations. Additionally, to further enhance prosody similarity, we adopt a prosody modeling approach combining prompt-based autoregressive and non-autoregressive methods. Evaluations demonstrate the remarkable zero-shot multi-task TTS performance of MultiVerse and show that MultiVerse not only achieves zero-shot TTS performance comparable to data-driven TTS systems with much less data, but also significantly outperforms other zero-shot TTS systems trained with the same small amount of data. In particular, our novel prosody modeling technique significantly contributes to MultiVerse’s ability to generate speech with high prosody similarity to the given prompts. Our samples are available at this https URL

[AI-152] FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model EMNLP2024

链接: https://arxiv.org/abs/2410.03007
作者: Yichen Lu,Jiaqi Song,Chao-Han Huck Yang,Shinji Watanabe
关键词-EN: Speech Language Model, Multitask Speech Language, explore Multitask Speech, Language Model, explore Multitask
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Industry Track

点击查看摘要

Abstract:In this study, we aim to explore Multitask Speech Language Model (SpeechLM) efficient inference via token reduction. Unlike other modalities such as vision or text, speech has unique temporal dependencies, making previous efficient inference works on other modalities not directly applicable. Furthermore, methods for efficient SpeechLM inference on long sequence and sparse signals remain largely unexplored. Then we propose FastAdaSP, a weighted token merging framework specifically designed for various speech-related tasks to improve the trade-off between efficiency and performance. Experimental results on WavLLM and Qwen-Audio show that our method achieves the state-of-the-art (SOTA) efficiency-performance trade-off compared with other baseline methods. Specifically, FastAdaSP achieved 7x memory efficiency and 1.83x decoding throughput without any degradation on tasks like Emotion Recognition (ER) and Spoken Question Answering (SQA). The code will be available at this https URL

[AI-153] Deep Signature: Characterization of Large-Scale Molecular Dynamics

链接: https://arxiv.org/abs/2410.02847
作者: Tiexin Qin,Mengxu Zhu,Chunyang Li,Terry Lyons,Hong Yan,Haoliang Li
关键词-EN: Understanding protein dynamics, developing molecular therapies, deciphering protein functional, protein functional mechanisms, Understanding protein
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: 17 page, 8 figures

点击查看摘要

Abstract:Understanding protein dynamics are essential for deciphering protein functional mechanisms and developing molecular therapies. However, the complex high-dimensional dynamics and interatomic interactions of biological processes pose significant challenge for existing computational techniques. In this paper, we approach this problem for the first time by introducing Deep Signature, a novel computationally tractable framework that characterizes complex dynamics and interatomic interactions based on their evolving trajectories. Specifically, our approach incorporates soft spectral clustering that locally aggregates cooperative dynamics to reduce the size of the system, as well as signature transform that collects iterated integrals to provide a global characterization of the non-smooth interactive dynamics. Theoretical analysis demonstrates that Deep Signature exhibits several desirable properties, including invariance to translation, near invariance to rotation, equivariance to permutation of atomic coordinates, and invariance under time reparameterization. Furthermore, experimental results on three benchmarks of biological processes verify that our approach can achieve superior performance compared to baseline methods.

[AI-154] CAnDOIT: Causal Discovery with Observational and Interventional Data from Time-Series

链接: https://arxiv.org/abs/2410.02844
作者: Luca Castri,Sariah Mghames,Marc Hanheide,Nicola Bellotto
关键词-EN: branches of science, intelligent systems, utmost importance, causal, data
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Published in Advanced Intelligent Systems

点击查看摘要

Abstract:The study of cause-and-effect is of the utmost importance in many branches of science, but also for many practical applications of intelligent systems. In particular, identifying causal relationships in situations that include hidden factors is a major challenge for methods that rely solely on observational data for building causal models. This paper proposes CAnDOIT, a causal discovery method to reconstruct causal models using both observational and interventional time-series data. The use of interventional data in the causal analysis is crucial for real-world applications, such as robotics, where the scenario is highly complex and observational data alone are often insufficient to uncover the correct causal structure. Validation of the method is performed initially on randomly generated synthetic models and subsequently on a well-known benchmark for causal structure learning in a robotic manipulation environment. The experiments demonstrate that the approach can effectively handle data from interventions and exploit them to enhance the accuracy of the causal analysis. A Python implementation of CAnDOIT has also been developed and is publicly available on GitHub: this https URL.

[AI-155] KLDD: Kalman Filter based Linear Deformable Diffusion Model in Retinal Image Segmentation

链接: https://arxiv.org/abs/2410.02808
作者: Zhihao Zhao,Yinzheng Zhao,Junjie Yang,Kai Huang,Nassir Navab,M. Ali Nasseri
关键词-EN: AI-based vascular segmentation, linear deformable convolution, vascular structures, Linear Deformable, ophthalmic diseases
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BIBM 2024

点击查看摘要

Abstract:AI-based vascular segmentation is becoming increasingly common in enhancing the screening and treatment of ophthalmic diseases. Deep learning structures based on U-Net have achieved relatively good performance in vascular segmentation. However, small blood vessels and capillaries tend to be lost during segmentation when passed through the traditional U-Net downsampling module. To address this gap, this paper proposes a novel Kalman filter based Linear Deformable Diffusion (KLDD) model for retinal vessel segmentation. Our model employs a diffusion process that iteratively refines the segmentation, leveraging the flexible receptive fields of deformable convolutions in feature extraction modules to adapt to the detailed tubular vascular structures. More specifically, we first employ a feature extractor with linear deformable convolution to capture vascular structure information form the input images. To better optimize the coordinate positions of deformable convolution, we employ the Kalman filter to enhance the perception of vascular structures in linear deformable convolution. Subsequently, the features of the vascular structures extracted are utilized as a conditioning element within a diffusion model by the Cross-Attention Aggregation module (CAAM) and the Channel-wise Soft Attention module (CSAM). These aggregations are designed to enhance the diffusion model’s capability to generate vascular structures. Experiments are evaluated on retinal fundus image datasets (DRIVE, CHASE_DB1) as well as the 3mm and 6mm of the OCTA-500 dataset, and the results show that the diffusion model proposed in this paper outperforms other methods.

[AI-156] AutoPETIII: The Tracer Frontier. What Frontier?

链接: https://arxiv.org/abs/2410.02807
作者: Zacharia Mesbah,Léo Mottay,Romain Modzelewski,Pierre Decazes,Sébastien Hapdey,Su Ruan,Sébastien Thureau
关键词-EN: Positron Emitting Tomography, Emitting Tomography, AutoPET competition gathered, medical imaging community, Positron Emitting
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For the last three years, the AutoPET competition gathered the medical imaging community around a hot topic: lesion segmentation on Positron Emitting Tomography (PET) scans. Each year a different aspect of the problem is presented; in 2024 the multiplicity of existing and used tracers was at the core of the challenge. Specifically, this year’s edition aims to develop a fully automatic algorithm capable of performing lesion segmentation on a PET/CT scan, without knowing the tracer, which can either be a FDG or PSMA-based tracer. In this paper we describe how we used the nnUNetv2 framework to train two sets of 6 fold ensembles of models to perform fully automatic PET/CT lesion segmentation as well as a MIP-CNN to choose which set of models to use for segmentation.

[AI-157] rust-informed Decision-Making Through An Uncertainty-Aware Stacked Neural Networks Framework: Case Study in COVID-19 Classification

链接: https://arxiv.org/abs/2410.02805
作者: Hassan Gharoun,Mohammad Sadegh Khorshidi,Fang Chen,Amir H. Gandomi
关键词-EN: stacked neural networks, uncertainty-aware stacked neural, neural networks model, radiological images, study presents
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 7 figures, 6 tables

点击查看摘要

Abstract:This study presents an uncertainty-aware stacked neural networks model for the reliable classification of COVID-19 from radiological images. The model addresses the critical gap in uncertainty-aware modeling by focusing on accurately identifying confidently correct predictions while alerting users to confidently incorrect and uncertain predictions, which can promote trust in automated systems. The architecture integrates uncertainty quantification methods, including Monte Carlo dropout and ensemble techniques, to enhance predictive reliability by assessing the certainty of diagnostic predictions. Within a two-tier model framework, the tier one model generates initial predictions and associated uncertainties, which the second tier model uses to produce a trust indicator alongside the diagnostic outcome. This dual-output model not only predicts COVID-19 cases but also provides a trust flag, indicating the reliability of each diagnosis and aiming to minimize the need for retesting and expert verification. The effectiveness of this approach is demonstrated through extensive experiments on the COVIDx CXR-4 dataset, showing a novel approach in identifying and handling confidently incorrect cases and uncertain cases, thus enhancing the trustworthiness of automated diagnostics in clinical settings.

计算机视觉

[CV-0] Estimating Body and Hand Motion in an Ego-sensed World

链接: https://arxiv.org/abs/2410.03665
作者: Brent Yi,Vickie Ye,Maya Zheng,Lea Müller,Georgios Pavlakos,Yi Ma,Jitendra Malik,Angjoo Kanazawa
关键词-EN: head-mounted device, human motion estimation, egocentric SLAM poses, present EgoAllo, human motion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture the wearer’s actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve the hands: the resulting kinematic and temporal constraints result in over 40% lower hand estimation errors compared to noisy monocular estimates. Project page: this https URL

[CV-1] Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

链接: https://arxiv.org/abs/2410.03659
作者: Tinghui Zhu,Qin Liu,Fei Wang,Zhengzhong Tu,Muhao Chen
关键词-EN: Large Vision-Language Models, demonstrated impressive capabilities, Large Vision-Language, multimodal inputs, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Website: this https URL

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. However, these models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. In this paper, we formally define the problem of \textbfcross-modality parametric knowledge conflict and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%.

[CV-2] GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

链接: https://arxiv.org/abs/2410.03645
作者: Pu Hua,Minghuan Liu,Annabella Macaluso,Yunfeng Lin,Weinan Zhang,Huazhe Xu,Lirui Wang
关键词-EN: Robotic simulation today, today remains challenging, simulation today remains, create diverse simulation, Robotic simulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: CoRL 2024. Project website: this https URL

点击查看摘要

Abstract:Robotic simulation today remains challenging to scale up due to the human efforts required to create diverse simulation tasks and scenes. Simulation-trained policies also face scalability issues as many sim-to-real methods focus on a single task. To address these challenges, this work proposes GenSim2, a scalable framework that leverages coding LLMs with multi-modal and reasoning capabilities for complex and realistic simulation task creation, including long-horizon tasks with articulated objects. To automatically generate demonstration data for these tasks at scale, we propose planning and RL solvers that generalize within object categories. The pipeline can generate data for up to 100 articulated tasks with 200 objects and reduce the required human efforts. To utilize such data, we propose an effective multi-task language-conditioned policy architecture, dubbed proprioceptive point-cloud transformer (PPT), that learns from the generated demonstrations and exhibits strong sim-to-real zero-shot transfer. Combining the proposed pipeline and the policy architecture, we show a promising usage of GenSim2 that the generated data can be used for zero-shot transfer or co-train with real-world collected data, which enhances the policy performance by 20% compared with training exclusively on limited real data.

[CV-3] Unlearnable 3D Point Clouds: Class-wise Transformation Is All You Need NEURIPS2024

链接: https://arxiv.org/abs/2410.03644
作者: Xianlong Wang,Minghui Li,Wei Liu,Hangtao Zhang,Shengshan Hu,Yechao Zhang,Ziqi Zhou,Hai Jin
关键词-EN: Traditional unlearnable strategies, Traditional unlearnable, prevent unauthorized users, data, Traditional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Traditional unlearnable strategies have been proposed to prevent unauthorized users from training on the 2D image data. With more 3D point cloud data containing sensitivity information, unauthorized usage of this new type data has also become a serious concern. To address this, we propose the first integral unlearnable framework for 3D point clouds including two processes: (i) we propose an unlearnable data protection scheme, involving a class-wise setting established by a category-adaptive allocation strategy and multi-transformations assigned to samples; (ii) we propose a data restoration scheme that utilizes class-wise inverse matrix transformation, thus enabling authorized-only training for unlearnable data. This restoration process is a practical issue overlooked in most existing unlearnable literature, \ie, even authorized users struggle to gain knowledge from 3D unlearnable data. Both theoretical and empirical results (including 6 datasets, 16 models, and 2 tasks) demonstrate the effectiveness of our proposed unlearnable framework. Our code is available at \urlthis https URL

[CV-4] Variational Bayes Gaussian Splatting

链接: https://arxiv.org/abs/2410.03592
作者: Toon Van de Maele,Ozan Catal,Alexander Tschantz,Christopher L. Buckley,Tim Verbelen
关键词-EN: Bayes Gaussian Splatting, Gaussian Splatting, scenes using mixtures, Recently, Variational Bayes Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting has emerged as a promising approach for modeling 3D scenes using mixtures of Gaussians. The predominant optimization method for these models relies on backpropagating gradients through a differentiable rendering pipeline, which struggles with catastrophic forgetting when dealing with continuous streams of data. To address this limitation, we propose Variational Bayes Gaussian Splatting (VBGS), a novel approach that frames training a Gaussian splat as variational inference over model parameters. By leveraging the conjugacy properties of multivariate Gaussians, we derive a closed-form variational update rule, allowing efficient updates from partial, sequential observations without the need for replay buffers. Our experiments show that VBGS not only matches state-of-the-art performance on static datasets, but also enables continual learning from sequentially streamed 2D and 3D data, drastically improving performance in this setting.

[CV-5] Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models

链接: https://arxiv.org/abs/2410.03577
作者: Xin Zou,Yizhou Wang,Yibo Yan,Sirui Huang,Kening Zheng,Junkai Chen,Chang Tang,Xuming Hu
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, assertively fabricating content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) are susceptible to hallucinations, especially assertively fabricating content not present in the visual inputs. To address the aforementioned challenge, we follow a common cognitive process - when one’s initial memory of critical on-sight details fades, it is intuitive to look at them a second time to seek a factual and accurate answer. Therefore, we introduce Memory-space Visual Retracing (MemVR), a novel hallucination mitigation paradigm that without the need for external knowledge retrieval or additional fine-tuning. In particular, we treat visual prompts as supplementary evidence to be reinjected into MLLMs via Feed Forward Network (FFN) as key-value memory, when the model is uncertain or even amnesic about question-relevant visual memories. Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination issues across various MLLMs and excels in general benchmarks without incurring added time overhead, thus emphasizing its potential for widespread applicability.

[CV-6] Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features

链接: https://arxiv.org/abs/2410.03558
作者: Benyuan Meng,Qianqian Xu,Zitai Wang,Xiaochun Cao,Qingming Huang
关键词-EN: image generation, initially designed, designed for image, activations, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at this https URL.

[CV-7] BodyShapeGPT: SMPL Body Shape Manipulation with LLMs ECCV2024

链接: https://arxiv.org/abs/2410.03556
作者: Baldomero R. Árbol,Dan Casas
关键词-EN: performing complex tasks, Large Language Models, provide a wide, wide range, range of tools
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ECCV 2024 Workshop on Foundation Models for 3D Humans. Code repository: this https URL

点击查看摘要

Abstract:Generative AI models provide a wide range of tools capable of performing complex tasks in a fraction of the time it would take a human. Among these, Large Language Models (LLMs) stand out for their ability to generate diverse texts, from literary narratives to specialized responses in different fields of knowledge. This paper explores the use of fine-tuned LLMs to identify physical descriptions of people, and subsequently create accurate representations of avatars using the SMPL-X model by inferring shape parameters. We demonstrate that LLMs can be trained to understand and manipulate the shape space of SMPL, allowing the control of 3D human shapes through natural language. This approach promises to improve human-machine interaction and opens new avenues for customization and simulation in virtual environments.

[CV-8] Enhancing Autonomous Navigation by Imaging Hidden Objects using Single-Photon LiDAR

链接: https://arxiv.org/abs/2410.03555
作者: Aaron Young,Nevindu M. Batagoda,Harry Zhang,Akshat Dave,Adithya Pediredla,Dan Negrut,Ramesh Raskar
关键词-EN: Robust autonomous navigation, limited visibility remains, Robust autonomous, autonomous navigation, remains a critical
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Robust autonomous navigation in environments with limited visibility remains a critical challenge in robotics. We present a novel approach that leverages Non-Line-of-Sight (NLOS) sensing using single-photon LiDAR to improve visibility and enhance autonomous navigation. Our method enables mobile robots to “see around corners” by utilizing multi-bounce light information, effectively expanding their perceptual range without additional infrastructure. We propose a three-module pipeline: (1) Sensing, which captures multi-bounce histograms using SPAD-based LiDAR; (2) Perception, which estimates occupancy maps of hidden regions from these histograms using a convolutional neural network; and (3) Control, which allows a robot to follow safe paths based on the estimated occupancy. We evaluate our approach through simulations and real-world experiments on a mobile robot navigating an L-shaped corridor with hidden obstacles. Our work represents the first experimental demonstration of NLOS imaging for autonomous navigation, paving the way for safer and more efficient robotic systems operating in complex environments. We also contribute a novel dynamics-integrated transient rendering framework for simulating NLOS scenarios, facilitating future research in this domain.

[CV-9] Constructive Apraxia: An Unexpected Limit of Instructible Vision-Language Models and Analog for Human Cognitive Disorders

链接: https://arxiv.org/abs/2410.03551
作者: David Noever,Samantha E. Miller Noever
关键词-EN: human cognitive disorders, instructible vision-language models, specifically constructive apraxia, cognitive disorders, study reveals
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study reveals an unexpected parallel between instructible vision-language models (VLMs) and human cognitive disorders, specifically constructive apraxia. We tested 25 state-of-the-art VLMs, including GPT-4 Vision, DALL-E 3, and Midjourney v5, on their ability to generate images of the Ponzo illusion, a task that requires basic spatial reasoning and is often used in clinical assessments of constructive apraxia. Remarkably, 24 out of 25 models failed to correctly render two horizontal lines against a perspective background, mirroring the deficits seen in patients with parietal lobe damage. The models consistently misinterpreted spatial instructions, producing tilted or misaligned lines that followed the perspective of the background rather than remaining horizontal. This behavior is strikingly similar to how apraxia patients struggle to copy or construct simple figures despite intact visual perception and motor skills. Our findings suggest that current VLMs, despite their advanced capabilities in other domains, lack fundamental spatial reasoning abilities akin to those impaired in constructive apraxia. This limitation in AI systems provides a novel computational model for studying spatial cognition deficits and highlights a critical area for improvement in VLM architecture and training methodologies.

[CV-10] Dreamming User Multimodal Representation for Micro-Video Recommendation

链接: https://arxiv.org/abs/2410.03538
作者: Chengzhi Lin,Hezheng Lin,Shuchang Liu,Cangguang Ruan,LingJing Xu,Dezhao Yang,Chuyuan Wang,Yongqi Liu
关键词-EN: advanced recommender systems, mitigate information overload, deliver tailored content, Platonic Representation Hypothesis, underscored the necessity
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The proliferation of online micro-video platforms has underscored the necessity for advanced recommender systems to mitigate information overload and deliver tailored content. Despite advancements, accurately and promptly capturing dynamic user interests remains a formidable challenge. Inspired by the Platonic Representation Hypothesis, which posits that different data modalities converge towards a shared statistical model of reality, we introduce DreamUMM (Dreaming User Multi-Modal Representation), a novel approach leveraging user historical behaviors to create real-time user representation in a multimoda space. DreamUMM employs a closed-form solution correlating user video preferences with multimodal similarity, hypothesizing that user interests can be effectively represented in a unified multimodal space. Additionally, we propose Candidate-DreamUMM for scenarios lacking recent user behavior data, inferring interests from candidate videos alone. Extensive online A/B tests demonstrate significant improvements in user engagement metrics, including active days and play count. The successful deployment of DreamUMM in two micro-video platforms with hundreds of millions of daily active users, illustrates its practical efficacy and scalability in personalized micro-video content delivery. Our work contributes to the ongoing exploration of representational convergence by providing empirical evidence supporting the potential for user interest representations to reside in a multimodal space.

[CV-11] Computer Vision Intelligence Test Modeling and Generation: A Case Study on Smart OCR

链接: https://arxiv.org/abs/2410.03536
作者: Jing Shu,Bing-Jiun Miu,Eugene Chang,Jerry Gao,Jun Liu
关键词-EN: AI-based systems possess, systems possess distinctive, possess distinctive characteristics, AI-based systems, systems possess
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:AI-based systems possess distinctive characteristics and introduce challenges in quality evaluation at the same time. Consequently, ensuring and validating AI software quality is of critical importance. In this paper, we present an effective AI software functional testing model to address this challenge. Specifically, we first present a comprehensive literature review of previous work, covering key facets of AI software testing processes. We then introduce a 3D classification model to systematically evaluate the image-based text extraction AI function, as well as test coverage criteria and complexity. To evaluate the performance of our proposed AI software quality test, we propose four evaluation metrics to cover different aspects. Finally, based on the proposed framework and defined metrics, a mobile Optical Character Recognition (OCR) case study is presented to demonstrate the framework’s effectiveness and capability in assessing AI function quality.

[CV-12] Classification-Denoising Networks

链接: https://arxiv.org/abs/2410.03505
作者: Louis Thiry,Florentin Guth
关键词-EN: ignoring conditioning information, partially ignoring conditioning, suffer from complementary, complementary issues, issues of lack
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Image classification and denoising suffer from complementary issues of lack of robustness or partially ignoring conditioning information. We argue that they can be alleviated by unifying both tasks through a model of the joint probability of (noisy) images and class labels. Classification is performed with a forward pass followed by conditioning. Using the Tweedie-Miyasawa formula, we evaluate the denoising function with the score, which can be computed by marginalization and back-propagation. The training objective is then a combination of cross-entropy loss and denoising score matching loss integrated over noise levels. Numerical experiments on CIFAR-10 and ImageNet show competitive classification and denoising performance compared to reference deep convolutional classifiers/denoisers, and significantly improves efficiency compared to previous joint approaches. Our model shows an increased robustness to adversarial perturbations compared to a standard discriminative classifier, and allows for a novel interpretation of adversarial gradients as a difference of denoisers.

[CV-13] FedStein: Enhancing Multi-Domain Federated Learning Through James-Stein Estimator NEURIPS’24 NEURIPS2024

链接: https://arxiv.org/abs/2410.03499
作者: Sunny Gupta,Nikita Jangid,Amit Sethi
关键词-EN: enabling collaborative in-situ, collaborative in-situ training, facilitates data privacy, Federated Learning, Multi-Domain Federated Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 12 pages, 2 figures. Accepted at International Workshop on Federated Foundation Models In Conjunction with NeurIPS 2024 (FL@FM-NeurIPS’24)

点击查看摘要

Abstract:Federated Learning (FL) facilitates data privacy by enabling collaborative in-situ training across decentralized clients. Despite its inherent advantages, FL faces significant challenges of performance and convergence when dealing with data that is not independently and identically distributed (non-i.i.d.). While previous research has primarily addressed the issue of skewed label distribution across clients, this study focuses on the less explored challenge of multi-domain FL, where client data originates from distinct domains with varying feature distributions. We introduce a novel method designed to address these challenges FedStein: Enhancing Multi-Domain Federated Learning Through the James-Stein Estimator. FedStein uniquely shares only the James-Stein (JS) estimates of batch normalization (BN) statistics across clients, while maintaining local BN parameters. The non-BN layer parameters are exchanged via standard FL techniques. Extensive experiments conducted across three datasets and multiple models demonstrate that FedStein surpasses existing methods such as FedAvg and FedBN, with accuracy improvements exceeding 14% in certain domains leading to enhanced domain generalization. The code is available at this https URL

[CV-14] A Multimodal Framework for Deepfake Detection

链接: https://arxiv.org/abs/2410.03487
作者: Kashish Gandhi,Prutha Kulkarni,Taran Shah,Piyush Chaudhari,Meera Narvekar,Kranti Ghag
关键词-EN: digital media integrity, deepfake technology poses, rapid advancement, technology poses, poses a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 22 pages, 14 figures, Accepted in Journal of Electrical Systems

点击查看摘要

Abstract:The rapid advancement of deepfake technology poses a significant threat to digital media integrity. Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. This creates risks of misinformation, fraud, and severe implications for personal privacy and security. Our research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements. This comprehensive strategy recognizes that human perception integrates multiple sensory inputs, particularly visual and auditory information, to form a complete understanding of media content. For visual analysis, a model that employs advanced feature extraction techniques was developed, extracting nine distinct facial characteristics and then applying various machine learning and deep learning models. For auditory analysis, our model leverages mel-spectrogram analysis for feature extraction and then applies various machine learning and deep learningmodels. To achieve a combined analysis, real and deepfake audio in the original dataset were swapped for testing purposes and ensured balanced samples. Using our proposed models for video and audio classification i.e. Artificial Neural Network and VGG19, the overall sample is classified as deepfake if either component is identified as such. Our multimodal framework combines visual and auditory analyses, yielding an accuracy of 94%.

[CV-15] VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

链接: https://arxiv.org/abs/2410.03478
作者: Han Lin,Tushar Nagarajan,Nicolas Ballas,Mido Assran,Mojtaba Komeili,Mohit Bansal,Koustuv Sinha
关键词-EN: active research area, present video input, typically in conjunction, textual annotations, active research
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Procedural video representation learning is an active research area where the objective is to learn an agent which can anticipate and forecast the future given the present video input, typically in conjunction with textual annotations. Prior works often rely on large-scale pretraining of visual encoders and prediction models with language supervision. However, the necessity and effectiveness of extending compute intensive pretraining to learn video clip sequences with noisy text supervision have not yet been fully validated by previous works. In this work, we show that a strong off-the-shelf frozen pretrained visual encoder, along with a well designed prediction model, can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning without the need for pretraining the prediction model, nor requiring additional supervision from language or ASR. Instead of learning representations from pixel space, our method utilizes the latent embedding space of publicly available vision encoders. By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting through iterative denoising - leveraging the recent advances in diffusion transformers (Peebles Xie, 2023). Empirical studies over a total of five procedural learning tasks across four datasets (NIV, CrossTask, COIN and Ego4D-v2) show that our model advances the strong baselines in long-horizon action anticipation (+2.6% in Verb ED@20, +3.1% in Noun ED@20), and significantly improves the SoTA in step forecasting (+5.0%), task classification (+3.8%), and procedure planning tasks (up to +2.28% in success rate, +3.39% in mAcc, and +0.90% in mIoU).

[CV-16] Diffusion State-Guided Projected Gradient for Inverse Problems

链接: https://arxiv.org/abs/2410.03463
作者: Rayhan Zirvi,Bahareh Tolooshams,Anima Anandkumar
关键词-EN: Recent advancements, inverse problems, learning data priors, solving inverse problems, diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: preprint. under review. RZ and BT have equal contributions

点击查看摘要

Abstract:Recent advancements in diffusion models have been effective in learning data priors for solving inverse problems. They leverage diffusion sampling steps for inducing a data prior while using a measurement guidance gradient at each step to impose data consistency. For general inverse problems, approximations are needed when an unconditionally trained diffusion model is used since the measurement likelihood is intractable, leading to inaccurate posterior sampling. In other words, due to their approximations, these methods fail to preserve the generation process on the data manifold defined by the diffusion prior, leading to artifacts in applications such as image restoration. To enhance the performance and robustness of diffusion models in solving inverse problems, we propose Diffusion State-Guided Projected Gradient (DiffStateGrad), which projects the measurement gradient onto a subspace that is a low-rank approximation of an intermediate state of the diffusion process. DiffStateGrad, as a module, can be added to a wide range of diffusion-based inverse solvers to improve the preservation of the diffusion process on the prior manifold and filter out artifact-inducing components. We highlight that DiffStateGrad improves the robustness of diffusion models in terms of the choice of measurement guidance step size and noise while improving the worst-case performance. Finally, we demonstrate that DiffStateGrad improves upon the state-of-the-art on linear and nonlinear image restoration inverse problems.

[CV-17] Dynamic Diffusion Transformer

链接: https://arxiv.org/abs/2410.03456
作者: Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Yibing Song,Gao Huang,Fan Wang,Yang You
关键词-EN: demonstrated superior performance, Dynamic Diffusion Transformer, substantial computational costs, Diffusion Transformer, demonstrated superior
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with 3% additional fine-tuning iterations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet. The code is publicly available at this https URL Dynamic-Diffusion-Transformer.

[CV-18] CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

链接: https://arxiv.org/abs/2410.03441
作者: Guy Tevet,Sigal Raab,Setareh Cohan,Daniele Reda,Zhengyi Luo,Xue Bin Peng,Amit H. Bermano,Michiel van de Panne
关键词-EN: Reinforcement Learning, human motion generation, simulations have complementary, based control, Motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Motion diffusion models and Reinforcement Learning (RL) based control for physics-based simulations have complementary strengths for human motion generation. The former is capable of generating a wide variety of motions, adhering to intuitive control such as text, while the latter offers physically plausible motion and direct interaction with the environment. In this work, we present a method that combines their respective strengths. CLoSD is a text-driven RL physics-based controller, guided by diffusion generation for various tasks. Our key insight is that motion diffusion can serve as an on-the-fly universal planner for a robust RL controller. To this end, CLoSD maintains a closed-loop interaction between two modules – a Diffusion Planner (DiP), and a tracking controller. DiP is a fast-responding autoregressive diffusion model, controlled by textual prompts and target locations, and the controller is a simple and robust motion imitator that continuously receives motion plans from DiP and provides feedback from the environment. CLoSD is capable of seamlessly performing a sequence of different tasks, including navigation to a goal location, striking an object with a hand or foot as specified in a text prompt, sitting down, and getting up. this https URL

[CV-19] Dessie: Disentanglement for Articulated 3D Horse Shape and Pose Estimation from Images ACCV2024

链接: https://arxiv.org/abs/2410.03438
作者: Ci Li,Yi Yang,Zehang Weng,Elin Hernlund,Silvia Zuffi,Hedvig Kjellström
关键词-EN: parametric animal models, recent years, aid in estimating, images and video, developed to aid
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV2024

点击查看摘要

Abstract:In recent years, 3D parametric animal models have been developed to aid in estimating 3D shape and pose from images and video. While progress has been made for humans, it’s more challenging for animals due to limited annotated data. To address this, we introduce the first method using synthetic data generation and disentanglement to learn to regress 3D shape and pose. Focusing on horses, we use text-based texture generation and a synthetic data pipeline to create varied shapes, poses, and appearances, learning disentangled spaces. Our method, Dessie, surpasses existing 3D horse reconstruction methods and generalizes to other large animals like zebras, cows, and deer. See the project website at: \urlthis https URL.

[CV-20] Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication

链接: https://arxiv.org/abs/2410.03430
作者: Miriam Anschütz,Tringa Sylaj,Georg Groh
关键词-EN: Explanatory images play, Explanatory images, play a pivotal, pivotal role, images
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: To be published at TSAR workshop 2024 ( this https URL )

点击查看摘要

Abstract:Explanatory images play a pivotal role in accessible and easy-to-read (E2R) texts. However, the images available in online databases are not tailored toward the respective texts, and the creation of customized images is expensive. In this large-scale study, we investigated whether text-to-image generation models can close this gap by providing customizable images quickly and easily. We benchmarked seven, four open- and three closed-source, image generation models and provide an extensive evaluation of the resulting images. In addition, we performed a user study with people from the E2R target group to examine whether the images met their requirements. We find that some of the models show remarkable performance, but none of the models are ready to be used at a larger scale without human supervision. Our research is an important step toward facilitating the creation of accessible information for E2R creators and tailoring accessible images to the target group’s needs.

[CV-21] Img2CAD: Conditioned 3D CAD Model Generation from Single Image with Structured Visual Geometry

链接: https://arxiv.org/abs/2410.03417
作者: Tianrun Chen,Chunan Yu,Yuanqi Hu,Jing Li,Tao Xu,Runlong Cao,Lanyun Zhu,Ying Zang,Yong Zhang,Zejian Li,Linyun Sun
关键词-EN: generate CAD models, CAD models, Structured Visual Geometry, CAD, CAD model generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose Img2CAD, the first approach to our knowledge that uses 2D image inputs to generate CAD models with editable parameters. Unlike existing AI methods for 3D model generation using text or image inputs often rely on mesh-based representations, which are incompatible with CAD tools and lack editability and fine control, Img2CAD enables seamless integration between AI-based 3D reconstruction and CAD software. We have identified an innovative intermediate representation called Structured Visual Geometry (SVG), characterized by vectorized wireframes extracted from objects. This representation significantly enhances the performance of generating conditioned CAD models. Additionally, we introduce two new datasets to further support research in this area: ABC-mono, the largest known dataset comprising over 200,000 3D CAD models with rendered images, and KOCAD, the first dataset featuring real-world captured objects alongside their ground truth CAD models, supporting further research in conditioned CAD model generation.

[CV-22] Lightning UQ Box: A Comprehensive Framework for Uncertainty Quantification in Deep Learning

链接: https://arxiv.org/abs/2410.03390
作者: Nils Lehmann,Jakob Gawlikowski,Adam J. Stewart,Vytautas Jancauskas,Stefan Depeweg,Eric Nalisnick,Nina Maria Gottschling
关键词-EN: deep neural networks, Uncertainty quantification, applying deep neural, real world tasks, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Uncertainty quantification (UQ) is an essential tool for applying deep neural networks (DNNs) to real world tasks, as it attaches a degree of confidence to DNN outputs. However, despite its benefits, UQ is often left out of the standard DNN workflow due to the additional technical knowledge required to apply and evaluate existing UQ procedures. Hence there is a need for a comprehensive toolbox that allows the user to integrate UQ into their modelling workflow, without significant overhead. We introduce \textttLightning UQ Box: a unified interface for applying and evaluating various approaches to UQ. In this paper, we provide a theoretical and quantitative comparison of the wide range of state-of-the-art UQ methods implemented in our toolbox. We focus on two challenging vision tasks: (i) estimating tropical cyclone wind speeds from infrared satellite imagery and (ii) estimating the power output of solar panels from RGB images of the sky. By highlighting the differences between methods our results demonstrate the need for a broad and approachable experimental framework for UQ, that can be used for benchmarking UQ methods. The toolbox, example implementations, and further information are available at: this https URL

[CV-23] LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding

链接: https://arxiv.org/abs/2410.03355
作者: Doohyuk Jang,Sihwan Park,June Yong Yang,Yeonsung Jung,Jihun Yun,Souvik Kundu,Sung-Yub Kim,Eunho Yang
关键词-EN: recently gained prominence, speculative decoding, recently gained, gained prominence, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. In this work, we identify a challenge in this setting, which we term \textittoken selection ambiguity, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, we propose a relaxed acceptance condition referred to as LANTERN that leverages the interchangeability of tokens in latent space. This relaxation restores the effectiveness of speculative decoding in visual AR models by enabling more flexible use of candidate tokens that would otherwise be prematurely rejected. Furthermore, by incorporating a total variation distance bound, we ensure that these speed gains are achieved without significantly compromising image quality or semantic coherence. Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. In specific, compared to a naïve application of the state-of-the-art speculative decoding, LANTERN increases speed-ups by \mathbf1.75\times and \mathbf1.76\times , as compared to greedy decoding and random sampling, respectively, when applied to LlamaGen, a contemporary visual AR model.

[CV-24] Audio-Agent : Leveraging LLMs For Audio Generation Editing and Composition

链接: https://arxiv.org/abs/2410.03335
作者: Zixuan Wang,Yu-Wing Tai,Chi-Keung Tang
关键词-EN: editing and composition, composition based, audio, text, Large Language Model
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions, and calls the agent for audio generation. Consequently, Audio-Agent generates high-quality audio that is closely aligned with the provided text or video while also supporting variable-length generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio, a process that can be tedious and time-consuming. We propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions to bridge video and audio modality. Thus our framework provides a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.

[CV-25] An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation

链接: https://arxiv.org/abs/2410.03334
作者: Ahmed Abdulaal,Hugo Fry,Nina Montaña-Brown,Ayodeji Ijishakin,Jack Gao,Stephanie Hyland,Daniel C. Alexander,Daniel C. Castro
关键词-EN: experiencing unprecedented demand, radiology report generation, automating radiology report, unprecedented demand, leading to increased
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Radiological services are experiencing unprecedented demand, leading to increased interest in automating radiology report generation. Existing Vision-Language Models (VLMs) suffer from hallucinations, lack interpretability, and require expensive fine-tuning. We introduce SAE-Rad, which uses sparse autoencoders (SAEs) to decompose latent representations from a pre-trained vision transformer into human-interpretable features. Our hybrid architecture combines state-of-the-art SAE advancements, achieving accurate latent reconstructions while maintaining sparsity. Using an off-the-shelf language model, we distil ground-truth reports into radiological descriptions for each SAE feature, which we then compile into a full report for each image, eliminating the need for fine-tuning large models for this task. To the best of our knowledge, SAE-Rad represents the first instance of using mechanistic interpretability techniques explicitly for a downstream multi-modal reasoning task. On the MIMIC-CXR dataset, SAE-Rad achieves competitive radiology-specific metrics compared to state-of-the-art models while using significantly fewer computational resources for training. Qualitative analysis reveals that SAE-Rad learns meaningful visual concepts and generates reports aligning closely with expert interpretations. Our results suggest that SAEs can enhance multimodal reasoning in healthcare, providing a more interpretable alternative to existing VLMs.

[CV-26] Comparative Analysis and Ensemble Enhancement of Leading CNN Architectures for Breast Cancer Classification

链接: https://arxiv.org/abs/2410.03333
作者: Gary Murphy,Raghubir Singh
关键词-EN: Convolutional Neural Network, leading Convolutional Neural, Neural Network, Convolutional Neural, image datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study introduces a novel and accurate approach to breast cancer classification using histopathology images. It systematically compares leading Convolutional Neural Network (CNN) models across varying image datasets, identifies their optimal hyperparameters, and ranks them based on classification efficacy. To maximize classification accuracy for each model we explore, the effects of data augmentation, alternative fully-connected layers, model training hyperparameter settings, and, the advantages of retraining models versus using pre-trained weights. Our methodology includes several original concepts, including serializing generated datasets to ensure consistent data conditions across training runs and significantly reducing training duration. Combined with automated curation of results, this enabled the exploration of over 2,000 training permutations – such a comprehensive comparison is as yet unprecedented. Our findings establish the settings required to achieve exceptional classification accuracy for standalone CNN models and rank them by model efficacy. Based on these results, we propose ensemble architectures that stack three high-performing standalone CNN models together with diverse classifiers, resulting in improved classification accuracy. The ability to systematically run so many model permutations to get the best outcomes gives rise to very high quality results, including 99.75% for BreakHis x40 and BreakHis x200 and 95.18% for the Bach datasets when split into train, validation and test datasets. The Bach Online blind challenge, yielded 89% using this approach. Whilst this study is based on breast cancer histopathology image datasets, the methodology is equally applicable to other medical image datasets.

[CV-27] EmojiHeroVR: A Study on Facial Expression Recognition under Partial Occlusion from Head-Mounted Displays

链接: https://arxiv.org/abs/2410.03331
作者: Thorben Ortmann,Qi Wang,Larissa Putzar
关键词-EN: Virtual Reality, enabling advanced personalization, providing emotional feedback, enhancement of Virtual, Emotion recognition promotes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Emotion recognition promotes the evaluation and enhancement of Virtual Reality (VR) experiences by providing emotional feedback and enabling advanced personalization. However, facial expressions are rarely used to recognize users’ emotions, as Head-Mounted Displays (HMDs) occlude the upper half of the face. To address this issue, we conducted a study with 37 participants who played our novel affective VR game EmojiHeroVR. The collected database, EmoHeVRDB (EmojiHeroVR Database), includes 3,556 labeled facial images of 1,778 reenacted emotions. For each labeled image, we also provide 29 additional frames recorded directly before and after the labeled image to facilitate dynamic Facial Expression Recognition (FER). Additionally, EmoHeVRDB includes data on the activations of 63 facial expressions captured via the Meta Quest Pro VR headset for each frame. Leveraging our database, we conducted a baseline evaluation on the static FER classification task with six basic emotions and neutral using the EfficientNet-B0 architecture. The best model achieved an accuracy of 69.84% on the test set, indicating that FER under HMD occlusion is feasible but significantly more challenging than conventional FER.

[CV-28] Does SpatioTemporal information benefit Two video summarization benchmarks? ECAI2024

链接: https://arxiv.org/abs/2410.03323
作者: Aashutosh Ganesh,Mirela Popa,Daan Odijk,Nava Tintarev
关键词-EN: aspect of summarizing, models, benchmark datasets, Video summarization, important aspect
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted for presentation at AEQUITAS workshop, Co-located with ECAI 2024

点击查看摘要

Abstract:An important aspect of summarizing videos is understanding the temporal context behind each part of the video to grasp what is and is not important. Video summarization models have in recent years modeled spatio-temporal relationships to represent this information. These models achieved state-of-the-art correlation scores on important benchmark datasets. However, what has not been reviewed is whether spatio-temporal relationships are even required to achieve state-of-the-art results. Previous work in activity recognition has found biases, by prioritizing static cues such as scenes or objects, over motion information. In this paper we inquire if similar spurious relationships might influence the task of video summarization. To do so, we analyse the role that temporal information plays on existing benchmark datasets. We first estimate a baseline with temporally invariant models to see how well such models rank on benchmark datasets (TVSum and SumMe). We then disrupt the temporal order of the videos to investigate the impact it has on existing state-of-the-art models. One of our findings is that the temporally invariant models achieve competitive correlation scores that are close to the human baselines on the TVSum dataset. We also demonstrate that existing models are not affected by temporal perturbations. Furthermore, with certain disruption strategies that shuffle fixed time segments, we can actually improve their correlation scores. With these results, we find that spatio-temporal relationship play a minor role and we raise the question whether these benchmarks adequately model the task of video summarization. Code available at: this https URL

[CV-29] Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

链接: https://arxiv.org/abs/2410.03321
作者: Minheng Ni,Yutao Fan,Lei Zhang,Wangmeng Zuo
关键词-EN: large-scale models evolve, increasingly utilized, models, ambiguous instructions, instructions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As large-scale models evolve, language instructions are increasingly utilized in multi-modal tasks. Due to human language habits, these instructions often contain ambiguities in real-world scenarios, necessitating the integration of visual context or common sense for accurate interpretation. However, even highly intelligent large models exhibit significant performance limitations on ambiguous instructions, where weak reasoning abilities of disambiguation can lead to catastrophic errors. To address this issue, this paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models or empirical experience for generally intelligent models to understand ambiguous instructions. Unlike traditional methods that require models to possess high intelligence to understand long texts or perform lengthy complex reasoning, our framework does not significantly increase computational overhead and is more general and effective, even for generally intelligent models. Experiments show that our method not only significantly enhances the performance of models of different intelligence levels on ambiguous instructions but also improves their performance on general datasets. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity. We will release our data and code.

[CV-30] Quo Vadis Motion Generation? From Large Language Models to Large Motion Models

链接: https://arxiv.org/abs/2410.03311
作者: Ye Wang,Sipeng Zheng,Bin Cao,Qianshan Wei,Qin Jin,Zongqing Lu
关键词-EN: human motion understanding, large motion models, success of LLMs, motion, recent success
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted towards the development of large motion models. Despite some progress, current state-of-the-art works remain far from achieving truly generalist models, largely due to the lack of large-scale, high-quality motion data. To address this, we present MotionBase, the first million-level motion generation benchmark, offering 15 times the data volume of the previous largest dataset, and featuring multimodal data with hierarchically detailed text descriptions. By leveraging this vast dataset, our large motion model demonstrates strong performance across a broad range of motions, including unseen ones. Through systematic investigation, we underscore the importance of scaling both data and model size, with synthetic data and pseudo labels playing a crucial role in mitigating data acquisition costs. Moreover, our research reveals the limitations of existing evaluation metrics, particularly in handling out-of-domain text instructions – an issue that has long been overlooked. In addition to these, we introduce a novel 2D lookup-free approach for motion tokenization, which preserves motion information and expands codebook capacity, further enhancing the representative ability of large motion models. The release of MotionBase and the insights gained from this study are expected to pave the way for the development of more powerful and versatile motion generation models.

[CV-31] SELU: Self-Learning Embodied MLLMs in Unknown Environments

链接: https://arxiv.org/abs/2410.03303
作者: Boyu Li,Haobin Jiang,Ziluo Ding,Xinrun Xu,Haoran Li,Dongbin Zhao,Zongqing Lu
关键词-EN: multimodal large language, large language models, demonstrated strong visual, strong visual understanding, autonomously improving MLLMs
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback like human or environmental feedback is not always available. To address this challenge, existing methods primarily focus on enhancing the decision-making capabilities of MLLMs through voting and scoring mechanisms, while little effort has been paid to improving the environmental comprehension of MLLMs in unknown environments. To fully unleash the self-learning potential of MLLMs, we propose a novel actor-critic self-learning paradigm, dubbed SELU, inspired by the actor-critic paradigm in reinforcement learning. The critic employs self-asking and hindsight relabeling to extract knowledge from interaction trajectories collected by the actor, thereby augmenting its environmental comprehension. Simultaneously, the actor is improved by the self-feedback provided by the critic, enhancing its decision-making. We evaluate our method in the AI2-THOR and VirtualHome environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning.

[CV-32] Action Selection Learning for Multi-label Multi-view Action Recognition

链接: https://arxiv.org/abs/2410.03302
作者: Trung Thanh Nguyen,Yasutomo Kawanishi,Takahiro Komamizu,Ichiro Ide
关键词-EN: Multi-label multi-view action, recognize multiple concurrent, Action Selection Learning, multi-view action recognition, Multi-label multi-view
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM Multimedia Asia 2024

点击查看摘要

Abstract:Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named MultiASL (Multi-view Action Selection Learning), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.

[CV-33] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

链接: https://arxiv.org/abs/2410.03290
作者: Haibo Wang,Zhiyang Xu,Yu Cheng,Shizhe Diao,Yufan Zhou,Yixin Cao,Qifan Wang,Weifeng Ge,Lifu Huang
关键词-EN: Large Language Models, Video Large Language, Large Language, demonstrated remarkable capabilities, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We identify that current Video-LLMs have limitations for fine-grained video understanding since they lack effective temporal modeling and timestamp representation. In light of this, we sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge to represent timestamps. To optimize the training of Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance Grounded-VideoLLM’s temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

[CV-34] Sm: enhanced localization in Multiple Instance Learning for medical imaging classification NEURIPS2024

链接: https://arxiv.org/abs/2410.03276
作者: Francisco M. Castro-Macías,Pablo Morales-Álvarez,Yunan Wu,Rafael Molina,Aggelos K. Katsaggelos
关键词-EN: Multiple Instance Learning, Multiple Instance, Instance Learning, medical imaging classification, labeling effort
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 24 pages, 14 figures, 2024 Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Multiple Instance Learning (MIL) is widely used in medical imaging classification to reduce the labeling effort. While only bag labels are available for training, one typically seeks predictions at both bag and instance levels (classification and localization tasks, respectively). Early MIL methods treated the instances in a bag independently. Recent methods account for global and local dependencies among instances. Although they have yielded excellent results in classification, their performance in terms of localization is comparatively limited. We argue that these models have been designed to target the classification task, while implications at the instance level have not been deeply investigated. Motivated by a simple observation – that neighboring instances are likely to have the same label – we propose a novel, principled, and flexible mechanism to model local dependencies. It can be used alone or combined with any mechanism to model global dependencies (e.g., transformers). A thorough empirical validation shows that our module leads to state-of-the-art performance in localization while being competitive or superior in classification. Our code is at this https URL.

[CV-35] Frame-Voyager: Learning to Query Frames for Video Large Language Models

链接: https://arxiv.org/abs/2410.03226
作者: Sicheng Yu,Chengkai Jin,Huanyu Wang,Zhenghao Chen,Sheng Jin,Zhongrong Zuo,Xioalei Xu,Zhenbang Sun,Bingni Zhang,Jiawei Wu,Hao Zhang,Qianru Sun
关键词-EN: Large Language Models, Video Large Language, Language Models, Large Language, made remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM’s prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.

[CV-36] ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database

链接: https://arxiv.org/abs/2410.03224
作者: Anyi Rao,Jean-Peïc Chou,Maneesh Agrawala
关键词-EN: vivid story, create a vivid, mental visualization, large movie database, large movie
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted in the 37th Annual ACM Symposium on User Interface Software and Technology (UIST’24). Webpage: this https URL

点击查看摘要

Abstract:Scriptwriters usually rely on their mental visualization to create a vivid story by using their imagination to see, feel, and experience the scenes they are writing. Besides mental visualization, they often refer to existing images or scenes in movies and analyze the visual elements to create a certain mood or atmosphere. In this paper, we develop ScriptViz to provide external visualization based on a large movie database for the screenwriting process. It retrieves reference visuals on the fly based on scripts’ text and dialogue from a large movie database. The tool provides two types of control on visual elements that enable writers to 1) see exactly what they want with fixed visual elements and 2) see variances in uncertain elements. User evaluation among 15 scriptwriters shows that ScriptViz is able to present scriptwriters with consistent yet diverse visual possibilities, aligning closely with their scripts and helping their creation.

[CV-37] uning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

链接: https://arxiv.org/abs/2410.03190
作者: Zichen Miao,Zhengyuan Yang,Kevin Lin,Ze Wang,Zicheng Liu,Lijuan Wang,Qiang Qiu
关键词-EN: rivals non-distilled multi-step, Recent advancements, fewer inference steps, significantly fewer inference, non-distilled multi-step models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in timestep-distilled diffusion models have enabled high-quality image generation that rivals non-distilled multi-step models, but with significantly fewer inference steps. While such models are attractive for applications due to the low inference cost and latency, fine-tuning them with a naive diffusion objective would result in degraded and blurry outputs. An intuitive alternative is to repeat the diffusion distillation process with a fine-tuned teacher model, which produces good results but is cumbersome and computationally intensive; the distillation training usually requires magnitude higher of training compute compared to fine-tuning for specific image styles. In this paper, we present an algorithm named pairwise sample optimization (PSO), which enables the direct fine-tuning of an arbitrary timestep-distilled diffusion model. PSO introduces additional reference images sampled from the current time-step distilled model, and increases the relative likelihood margin between the training images and reference images. This enables the model to retain its few-step generation ability, while allowing for fine-tuning of its output distribution. We also demonstrate that PSO is a generalized formulation which can be flexibly extended to both offline-sampled and online-sampled pairwise data, covering various popular objectives for diffusion model preference optimization. We evaluate PSO in both preference optimization and other fine-tuning tasks, including style transfer and concept customization. We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data. PSO also demonstrates effectiveness in style transfer and concept customization by directly tuning timestep-distilled diffusion models.

[CV-38] Generalizable Prompt Tuning for Vision-Language Models

链接: https://arxiv.org/abs/2410.03189
作者: Qian Zhang
关键词-EN: CLIP involves optimizing, CLIP involves, specific downstream tasks, generate image-text pairs, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prompt tuning for vision-language models such as CLIP involves optimizing the text prompts used to generate image-text pairs for specific downstream tasks. While hand-crafted or template-based prompts are generally applicable to a wider range of unseen classes, they tend to perform poorly in downstream tasks (i.e., seen classes). Learnable soft prompts, on the other hand, often perform well in downstream tasks but lack generalizability. Additionally, prior research has predominantly concentrated on the textual modality, with very few studies attempting to explore the prompt’s generalization potential from the visual modality. Keeping these limitations in mind, we investigate how to prompt tuning to obtain both a competitive downstream performance and generalization. The study shows that by treating soft and hand-crafted prompts as dual views of the textual modality, and maximizing their mutual information, we can better ensemble task-specific and general semantic information. Moreover, to generate more expressive prompts, the study introduces a class-wise augmentation from the visual modality, resulting in significant robustness to a wider range of unseen classes. Extensive evaluations on several benchmarks report that the proposed approach achieves competitive results in terms of both task-specific performance and general abilities.

[CV-39] Looking into Concept Explanation Methods for Diabetic Retinopathy Classification

链接: https://arxiv.org/abs/2410.03188
作者: Andrea M. Storås,Josefine V. Sundgaard
关键词-EN: Diabetic retinopathy, imaging is crucial, common complication, monitoring the progression, progression of retinal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Diabetic retinopathy is a common complication of diabetes, and monitoring the progression of retinal abnormalities using fundus imaging is crucial. Because the images must be interpreted by a medical expert, it is infeasible to screen all individuals with diabetes for diabetic retinopathy. Deep learning has shown impressive results for automatic analysis and grading of fundus images. One drawback is, however, the lack of interpretability, which hampers the implementation of such systems in the clinic. Explainable artificial intelligence methods can be applied to explain the deep neural networks. Explanations based on concepts have shown to be intuitive for humans to understand, but have not yet been explored in detail for diabetic retinopathy grading. This work investigates and compares two concept-based explanation techniques for explaining deep neural networks developed for automatic diagnosis of diabetic retinopathy: Quantitative Testing with Concept Activation Vectors and Concept Bottleneck Models. We found that both methods have strengths and weaknesses, and choice of method should take the available data and the end user’s preferences into account.

[CV-40] Autonomous Character-Scene Interaction Synthesis from Text Instruction

链接: https://arxiv.org/abs/2410.03187
作者: Nan Jiang,Zimo He,Zi Wang,Hongjie Li,Yixin Chen,Siyuan Huang,Yixin Zhu
关键词-EN: presents substantial demands, complex activities, substantial demands, demands for user-defined, user-defined waypoints
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Synthesizing human motions in 3D environments, particularly those with complex activities such as locomotion, hand-reaching, and human-object interaction, presents substantial demands for user-defined waypoints and stage transitions. These requirements pose challenges for current models, leading to a notable gap in automating the animation of characters from simple human inputs. This paper addresses this challenge by introducing a comprehensive framework for synthesizing multi-stage scene-aware interaction motions directly from a single text instruction and goal location. Our approach employs an auto-regressive diffusion model to synthesize the next motion segment, along with an autonomous scheduler predicting the transition for each action stage. To ensure that the synthesized motions are seamlessly integrated within the environment, we propose a scene representation that considers the local perception both at the start and the goal location. We further enhance the coherence of the generated motion by integrating frame embeddings with language input. Additionally, to support model training, we present a comprehensive motion-captured dataset comprising 16 hours of motion sequences in 120 indoor scenes covering 40 types of motions, each annotated with precise language descriptions. Experimental results demonstrate the efficacy of our method in generating high-quality, multi-stage motions closely aligned with environmental and textual conditions.

[CV-41] Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models EMNLP2024

链接: https://arxiv.org/abs/2410.03176
作者: Yufang Liu,Tao Ji,Changzhi Sun,Yuanbin Wu,Aimin Zhou
关键词-EN: achieved impressive performance, Large Vision-Language Models, Large Vision-Language, CLIP model, impressive performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models. However, there is no clear conclusion as to which part of the model these hallucinations originate from. In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues. We demonstrate that our method can effectively mitigate object hallucinations for CLIP model, and we show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.

[CV-42] HRVMamba: High-Resolution Visual State Space Model for Dense Prediction

链接: https://arxiv.org/abs/2410.03174
作者: Hao Zhang,Yongqiang Ma,Wenqi Shao,Ping Luo,Nanning Zheng,Kaipeng Zhang
关键词-EN: Visual State Space, global receptive field, demonstrated significant potential, linear computational complexity, State Space
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, State Space Models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have demonstrated significant potential in computer vision tasks due to their linear computational complexity with respect to token length and their global receptive field. However, Mamba’s performance on dense prediction tasks, including human pose estimation and semantic segmentation, has been constrained by three key challenges: insufficient inductive bias, long-range forgetting, and low-resolution output representation. To address these challenges, we introduce the Dynamic Visual State Space (DVSS) block, which utilizes multi-scale convolutional kernels to extract local features across different scales and enhance inductive bias, and employs deformable convolution to mitigate the long-range forgetting problem while enabling adaptive spatial aggregation based on input and task-specific information. By leveraging the multi-resolution parallel design proposed in HRNet, we introduce High-Resolution Visual State Space Model (HRVMamba) based on the DVSS block, which preserves high-resolution representations throughout the entire process while promoting effective multi-scale feature learning. Extensive experiments highlight HRVMamba’s impressive performance on dense prediction tasks, achieving competitive results against existing benchmark models without bells and whistles. Code is available at this https URL.

[CV-43] Selective Transformer for Hyperspectral Image Classification

链接: https://arxiv.org/abs/2410.03171
作者: Yichu Xu,Di Wang,Lefei Zhang,Liangpei Zhang
关键词-EN: Selective Transformer Block, achieved satisfactory results, Selective Transformer, Kernel Selective Transformer, Transformer Block
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) fixed receptive field representation overlooks effective contextual information; (2) redundant self-attention feature representation. To address these limitations, we propose a novel Selective Transformer (SFormer) for HSI classification. The SFormer is designed to dynamically select receptive fields for capturing both spatial and spectral contextual information, while mitigating the impact of redundant data by prioritizing the most relevant features. This enables a highly accurate classification of the land covers of the HSI. Specifically, a Kernel Selective Transformer Block (KSTB) is first utilized to dynamically select an appropriate receptive field range to effectively extract spatial-spectral features. Furthermore, to capture the most crucial tokens, a Token Selective Transformer Block (TSTB) is introduced, which selects the most relevant tokens based on the ranking of attention scores for each query. Extensive experiments on four benchmark HSI datasets demonstrate that the proposed SFormer outperforms the state-of-the-art HSI classification models. The codes will be released.

[CV-44] Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach

链接: https://arxiv.org/abs/2410.03160
作者: Yaofang Liu,Yumeng Ren,Xiaodong Cun,Aitor Artola,Yang Liu,Tieyong Zeng,Raymond H. Chan,Jean-michel Morel
关键词-EN: revolutionized image generation, Diffusion models, shown promise, video diffusion models, revolutionized image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code at this https URL

点击查看摘要

Abstract:Diffusion models have revolutionized image generation, and their extension to video generation has shown promise. However, current video diffusion models~(VDMs) rely on a scalar timestep variable applied at the clip level, which limits their ability to model complex temporal dependencies needed for various tasks like image-to-video generation. To address this limitation, we propose a frame-aware video diffusion model~(FVDM), which introduces a novel vectorized timestep variable~(VTV). Unlike conventional VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model’s capacity to capture fine-grained temporal dependencies. FVDM’s flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot this http URL empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks. By addressing fundamental shortcomings in existing VDMs, FVDM sets a new paradigm in video synthesis, offering a robust framework with significant implications for generative modeling and multimedia applications.

[CV-45] Bridging the Gap between Text Audio Image and Any Sequence: A Novel Approach using Gloss-based Annotation

链接: https://arxiv.org/abs/2410.03146
作者: Sen Fang,Yalin Feng,Sizhou Chen,Xiaofeng Zhang,Teik Toe Teoh
关键词-EN: utilizing gloss-based annotation, approach called BGTAI, called BGTAI, BGTAI to simplify, simplify multimodal understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation as an intermediate step in aligning Text and Audio with Images. While the dynamic temporal factors in textual and audio inputs contain various predicate adjectives that influence the meaning of the entire sentence, images, on the other hand, present static scenes. By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved. This study explores the feasibility of this idea, specifically, we first propose the first Langue2Gloss model and then integrate it into the multimodal model UniBriVL for joint training. To strengthen the adaptability of gloss with text/audio and overcome the efficiency and instability issues in multimodal training, we propose a DS-Net (Data-Pair Selection Network), an Result Filter module, and a novel SP-Loss function. Our approach outperforms previous multimodal models in the main experiments, demonstrating its efficacy in enhancing multimodal representations and improving compatibility among text, audio, visual, and any sequence modalities.

[CV-46] Machine Learning for Asymptomatic Ratoon Stunting Disease Detection With Freely Available Satellite Based Multispectral Imaging

链接: https://arxiv.org/abs/2410.03141
作者: Ethan Kane Waters,Carla Chia-ming Chen,Mostafa Rahimi Azghadi
关键词-EN: Ratoon Stunting Disease, Ratoon Stunting, effective crop management, asymptomatic infectious diseases, Stunting Disease
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 13 pages, 1 figure and 2 tables (main text), 1 figure and 3 tables (appendices). Submitted to “Computers and Electronics in Agriculture”

点击查看摘要

Abstract:Disease detection in sugarcane, particularly the identification of asymptomatic infectious diseases such as Ratoon Stunting Disease (RSD), is critical for effective crop management. This study employed various machine learning techniques to detect the presence of RSD in different sugarcane varieties, using vegetation indices derived from freely available satellite-based spectral data. Our results show that the Support Vector Machine with a Radial Basis Function Kernel (SVM-RBF) was the most effective algorithm, achieving classification accuracy between 85.64% and 96.55%, depending on the variety. Gradient Boosting and Random Forest also demonstrated high performance achieving accuracy between 83.33% to 96.55%, while Logistic Regression and Quadratic Discriminant Analysis showed variable results across different varieties. The inclusion of sugarcane variety and vegetation indices was important in the detection of RSD. This agreed with what was identified in the current literature. Our study highlights the potential of satellite-based remote sensing as a cost-effective and efficient method for large-scale sugarcane disease detection alternative to traditional manual laboratory testing methods.

[CV-47] ARB-LLM: Alternating Refined Binarizations for Large Language Models

链接: https://arxiv.org/abs/2410.03129
作者: Zhiteng Li,Xianglong Yan,Tianao Zhang,Haotong Qin,Dong Xie,Jiang Tian,zhongchao shi,Linghe Kong,Yulun Zhang,Xiaokang Yang
关键词-EN: Large Language Models, natural language processing, Large Language, hinder practical deployment, greatly pushed forward
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: The code and models will be available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have greatly pushed forward advancements in natural language processing, yet their high memory and computational demands hinder practical deployment. Binarization, as an effective compression technique, can shrink model weights to just 1 bit, significantly reducing the high demands on computation and memory. However, current binarization methods struggle to narrow the distribution gap between binarized and full-precision weights, while also overlooking the column deviation in LLM weight distribution. To tackle these issues, we propose ARB-LLM, a novel 1-bit post-training quantization (PTQ) technique tailored for LLMs. To narrow the distribution shift between binarized and full-precision weights, we first design an alternating refined binarization (ARB) algorithm to progressively update the binarization parameters, which significantly reduces the quantization error. Moreover, considering the pivot role of calibration data and the column deviation in LLM weights, we further extend ARB to ARB-X and ARB-RC. In addition, we refine the weight partition strategy with column-group bitmap (CGB), which further enhance performance. Equipping ARB-X and ARB-RC with CGB, we obtain ARB-LLM _\textX and ARB-LLM _\textRC respectively, which significantly outperform state-of-the-art (SOTA) binarization methods for LLMs. As a binary PTQ method, our ARB-LLM _\textRC is the first to surpass FP16 models of the same size. The code and models will be available at this https URL.

[CV-48] MBDS: A Multi-Body Dynamics Simulation Dataset for Graph Networks Simulators

链接: https://arxiv.org/abs/2410.03107
作者: Sheng Yang,Fengge Wu,Junsuo Zhao
关键词-EN: Graph Network Simulators, modeling physical phenomena, Graph Network, Network Simulators, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modeling the structure and events of the physical world constitutes a fundamental objective of neural networks. Among the diverse approaches, Graph Network Simulators (GNS) have emerged as the leading method for modeling physical phenomena, owing to their low computational cost and high accuracy. The datasets employed for training and evaluating physical simulation techniques are typically generated by researchers themselves, often resulting in limited data volume and quality. Consequently, this poses challenges in accurately assessing the performance of these methods. In response to this, we have constructed a high-quality physical simulation dataset encompassing 1D, 2D, and 3D scenes, along with more trajectories and time-steps compared to existing datasets. Furthermore, our work distinguishes itself by developing eight complete scenes, significantly enhancing the dataset’s comprehensiveness. A key feature of our dataset is the inclusion of precise multi-body dynamics, facilitating a more realistic simulation of the physical world. Utilizing our high-quality dataset, we conducted a systematic evaluation of various existing GNS methods. Our dataset is accessible for download at this https URL, offering a valuable resource for researchers to enhance the training and evaluation of their methodologies.

[CV-49] Mamba in Vision: A Comprehensive Survey of Techniques and Applications

链接: https://arxiv.org/abs/2410.03105
作者: Md Maklachur Rahman,Abdullah Aman Tutul,Ankur Nath,Lamyanba Laishram,Soon Ki Jung,Tracy Hammond
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, faced by Convolutional, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. While CNNs excel at extracting local features, they often struggle to capture long-range dependencies without complex architectural modifications. In contrast, ViTs effectively model global relationships but suffer from high computational costs due to the quadratic complexity of their self-attention mechanisms. Mamba addresses these limitations by leveraging Selective Structured State Space Models to effectively capture long-range dependencies with linear computational complexity. This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions. We provide a foundational resource for advancing the understanding and growth of Mamba models in computer vision. An overview of this work is available at this https URL.

[CV-50] Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing

链接: https://arxiv.org/abs/2410.03097
作者: Ziqi Jiang,Zhen Wang,Long Chen
关键词-EN: computer vision, editing, remains a fundamental, fundamental challenge, challenge in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Precise and flexible image editing remains a fundamental challenge in computer vision. Based on the modified areas, most editing methods can be divided into two main types: global editing and local editing. In this paper, we choose the two most common editing approaches (ie text-based editing and drag-based editing) and analyze their drawbacks. Specifically, text-based methods often fail to describe the desired modifications precisely, while drag-based methods suffer from ambiguity. To address these issues, we proposed \textbfCLIPDrag, a novel image editing method that is the first to combine text and drag signals for precise and ambiguity-free manipulations on diffusion models. To fully leverage these two signals, we treat text signals as global guidance and drag points as local information. Then we introduce a novel global-local motion supervision method to integrate text signals into existing drag-based methods by adapting a pre-trained language-vision model like CLIP. Furthermore, we also address the problem of slow convergence in CLIPDrag by presenting a fast point-tracking method that enforces drag points moving toward correct directions. Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods.

[CV-51] Generative Edge Detection with Stable Diffusion

链接: https://arxiv.org/abs/2410.03080
作者: Caixia Zhou,Yaping Huang,Mochu Xiang,Jiahui Ren,Haibin Ling,Jing Zhang
关键词-EN: pixel-level classification problem, edge detection methods, generative edge detection, Edge detection, edge detection task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Edge detection is typically viewed as a pixel-level classification problem mainly addressed by discriminative methods. Recently, generative edge detection methods, especially diffusion model based solutions, are initialized in the edge detection task. Despite great potential, the retraining of task-specific designed modules and multi-step denoising inference limits their broader applications. Upon closer investigation, we speculate that part of the reason is the under-exploration of the rich discriminative information encoded in extensively pre-trained large models (\eg, stable diffusion models). Thus motivated, we propose a novel approach, named Generative Edge Detector (GED), by fully utilizing the potential of the pre-trained stable diffusion model. Our model can be trained and inferred efficiently without specific network design due to the rich high-level and low-level prior knowledge empowered by the pre-trained stable diffusion. Specifically, we propose to finetune the denoising U-Net and predict latent edge maps directly, by taking the latent image feature maps as input. Additionally, due to the subjectivity and ambiguity of the edges, we also incorporate the granularity of the edges into the denoising U-Net model as one of the conditions to achieve controllable and diverse predictions. Furthermore, we devise a granularity regularization to ensure the relative granularity relationship of the multiple predictions. We conduct extensive experiments on multiple datasets and achieve competitive performance (\eg, 0.870 and 0.880 in terms of ODS and OIS on the BSDS test dataset).

[CV-52] DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models EMNLP2024

链接: https://arxiv.org/abs/2410.03061
作者: Sungnyun Kim,Haofu Liao,Srikar Appalaraju,Peng Tang,Zhuowen Tu,Ravi Kumar Satzoda,R. Manmatha,Vijay Mahadevan,Stefano Soatto
关键词-EN: Visual document understanding, text and image, involves understanding documents, Visual document, involves understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data are not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.

[CV-53] DiffKillR: Killing and Recreating Diffeomorphisms for Cell Annotation in Dense Microscopy Images

链接: https://arxiv.org/abs/2410.03058
作者: Chen Liu,Danqi Liao,Alejandro Parada-Mayorga,Alejandro Ribeiro,Marcello DiStasio,Smita Krishnaswamy
关键词-EN: presents significant opportunities, driven by advances, slide scanning, presents significant, clinical diagnostics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The proliferation of digital microscopy images, driven by advances in automated whole slide scanning, presents significant opportunities for biomedical research and clinical diagnostics. However, accurately annotating densely packed information in these images remains a major challenge. To address this, we introduce DiffKillR, a novel framework that reframes cell annotation as the combination of archetype matching and image registration tasks. DiffKillR employs two complementary neural networks: one that learns a diffeomorphism-invariant feature space for robust cell matching and another that computes the precise warping field between cells for annotation mapping. Using a small set of annotated archetypes, DiffKillR efficiently propagates annotations across large microscopy images, reducing the need for extensive manual labeling. More importantly, it is suitable for any type of pixel-level annotation. We will discuss the theoretical properties of DiffKillR and validate it on three microscopy tasks, demonstrating its advantages over existing supervised, semi-supervised, and unsupervised methods.

[CV-54] CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization

链接: https://arxiv.org/abs/2410.03054
作者: Shigemichi Matsuzaki,Kazuhito Tanaka,Kazuhiro Shintani
关键词-EN: letter proposes, global localization, Vision Language Models, semantic object landmarks, surrounding objects
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: IEEE Robotics and Automation Letters

点击查看摘要

Abstract:This letter proposes a method of global localization on a map with semantic object landmarks. One of the most promising approaches for localization on object maps is to use semantic graph matching using landmark descriptors calculated from the distribution of surrounding objects. These descriptors are vulnerable to misclassification and partial observations. Moreover, many existing methods rely on inlier extraction using RANSAC, which is stochastic and sensitive to a high outlier rate. To address the former issue, we augment the correspondence matching using Vision Language Models (VLMs). Landmark discriminability is improved by VLM embeddings, which are independent of surrounding objects. In addition, inliers are estimated deterministically using a graph-theoretic approach. We also incorporate pose calculation using the weighted least squares considering correspondence similarity and observation completeness to improve the robustness. We confirmed improvements in matching and pose estimation accuracy through experiments on ScanNet and TUM datasets.

[CV-55] AuroraCap: Efficient Performant Video Detailed Captioning and a New Benchmark

链接: https://arxiv.org/abs/2410.03051
作者: Wenhao Chai,Enxin Song,Yilun Du,Chenlin Meng,Vashisht Madhavan,Omer Bar-Tal,Jeng-Neng Hwang,Saining Xie,Christopher D. Manning
关键词-EN: coherent textual descriptions, understanding and generation, key task, task which aims, aims to generate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code, docs, weight, benchmark and training data are all avaliable at \href{ this https URL }{website}

点击查看摘要

Abstract:Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which limits research in this field. Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCscore for bettering evaluation, which adopts a divide-and-conquer strategy to transform long caption evaluation into multiple short question-answer pairs. With the help of human Elo ranking, our experiments show that this benchmark better correlates with human judgments of video detailed captioning quality.

[CV-56] Revealing the Unseen: Guiding Personalized Diffusion Models to Expose Training Data

链接: https://arxiv.org/abs/2410.03039
作者: Xiaoyu Wu,Jiaru Zhang,Steven Wu
关键词-EN: capture specific styles, Diffusion Models, styles or objects, evolved into advanced, small set
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small set of images to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the potential risks of data leakage by releasing their fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask: “Can training data be extracted from these fine-tuned DMs shared online?” A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model’s learned distribution – from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets such as WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting approximately 20% of fine-tuning data in most cases, significantly surpassing baseline performance.

[CV-57] CPFD: Confidence-aware Privileged Feature Distillation for Short Video Classification CIKM2024

链接: https://arxiv.org/abs/2410.03038
作者: Jinghao Shi,Xiang Shen,Kaili Zhao,Xuedong Wang,Vera Wen,Zixuan Wang,Yifan Wu,Zhixin Zhang
关键词-EN: Privileged Feature Distillation, Dense features, Privileged Dense Features, short video classification, Privileged Dense
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Camera ready for CIKM 2024

点击查看摘要

Abstract:Dense features, customized for different business scenarios, are essential in short video classification. However, their complexity, specific adaptation requirements, and high computational costs make them resource-intensive and less accessible during online inference. Consequently, these dense features are categorized as `Privileged Dense Features’.Meanwhile, end-to-end multi-modal models have shown promising results in numerous computer vision tasks. In industrial applications, prioritizing end-to-end multi-modal features, can enhance efficiency but often leads to the loss of valuable information from historical privileged dense this http URL integrate both features while maintaining efficiency and manageable resource costs, we present Confidence-aware Privileged Feature Distillation (CPFD), which empowers features of an end-to-end multi-modal model by adaptively distilling privileged features during this http URL existing privileged feature distillation (PFD) methods, which apply uniform weights to all instances during distillation, potentially causing unstable performance across different business scenarios and a notable performance gap between teacher model (Dense Feature enhanced multimodal-model DF-X-VLM) and student model (multimodal-model only X-VLM), our CPFD leverages confidence scores derived from the teacher model to adaptively mitigate the performance variance with the student this http URL conducted extensive offline experiments on five diverse tasks demonstrating that CPFD improves the video classification F1 score by 6.76% compared with end-to-end multimodal-model (X-VLM) and by 2.31% with vanilla PFD on-average. And it reduces the performance gap by 84.6% and achieves results comparable to teacher model DF-X-VLM. The effectiveness of CPFD is further substantiated by online experiments, and our framework has been deployed in production systems for over a dozen models.

[CV-58] Dynamic Sparse Training versus Dense Training: The Unexpected Winner in Image Corruption Robustness

链接: https://arxiv.org/abs/2410.03030
作者: Boqian Wu,Qiao Xiao,Shunxin Wang,Nicola Strisciuglio,Mykola Pechenizkiy,Maurice van Keulen,Decebal Constantin Mocanu,Elena Mocanu
关键词-EN: Dynamic Sparse Training, artificial neural networks, Sparse Training, Dynamic Sparse, Sparse Training opens
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:It is generally perceived that Dynamic Sparse Training opens the door to a new era of scalability and efficiency for artificial neural networks at, perhaps, some costs in accuracy performance for the classification task. At the same time, Dense Training is widely accepted as being the “de facto” approach to train artificial neural networks if one would like to maximize their robustness against image corruption. In this paper, we question this general practice. Consequently, we claim that, contrary to what is commonly thought, the Dynamic Sparse Training methods can consistently outperform Dense Training in terms of robustness accuracy, particularly if the efficiency aspect is not considered as a main objective (i.e., sparsity levels between 10% and up to 50%), without adding (or even reducing) resource cost. We validate our claim on two types of data, images and videos, using several traditional and modern deep learning architectures for computer vision and three widely studied Dynamic Sparse Training algorithms. Our findings reveal a new yet-unknown benefit of Dynamic Sparse Training and open new possibilities in improving deep learning robustness beyond the current state of the art.

[CV-59] PixelShuffler: A Simple Image Translation Through Pixel Rearrangement

链接: https://arxiv.org/abs/2410.03021
作者: Omar Zamzam
关键词-EN: converting MRI scans, MRI scans, MRI contrasts, converting MRI, generating photorealistic images
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Image-to-image translation is a topic in computer vision that has a vast range of use cases ranging from medical image translation, such as converting MRI scans to CT scans or to other MRI contrasts, to image colorization, super-resolution, domain adaptation, and generating photorealistic images from sketches or semantic maps. Image style transfer is also a widely researched application of image-to-image translation, where the goal is to synthesize an image that combines the content of one image with the style of another. Existing state-of-the-art methods often rely on complex neural networks, including diffusion models and language models, to achieve high-quality style transfer, but these methods can be computationally expensive and intricate to implement. In this paper, we propose a novel pixel shuffle method that addresses the image-to-image translation problem generally with a specific demonstrative application in style transfer. The proposed method approaches style transfer by shuffling the pixels of the style image such that the mutual information between the shuffled image and the content image is maximized. This approach inherently preserves the colors of the style image while ensuring that the structural details of the content image are retained in the stylized output. We demonstrate that this simple and straightforward method produces results that are comparable to state-of-the-art techniques, as measured by the Learned Perceptual Image Patch Similarity (LPIPS) loss for content preservation and the Fréchet Inception Distance (FID) score for style similarity. Our experiments validate that the proposed pixel shuffle method achieves competitive performance with significantly reduced complexity, offering a promising alternative for efficient image style transfer, as well as a promise in usability of the method in general image-to-image translation tasks.

[CV-60] MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection

链接: https://arxiv.org/abs/2410.03010
作者: Niki Nezakati,Md Kaykobad Reza,Ameya Patil,Mashhour Solh,M. Salman Asif
关键词-EN: Multimodal learning seeks, multiple input sources, Multimodal learning, downstream tasks, modalities
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal learning seeks to combine data from multiple input sources to enhance the performance of different downstream tasks. In real-world scenarios, performance can degrade substantially if some input modalities are missing. Existing methods that can handle missing modalities involve custom training or adaptation steps for each input modality combination. These approaches are either tied to specific modalities or become computationally expensive as the number of input modalities increases. In this paper, we propose Masked Modality Projection (MMP), a method designed to train a single model that is robust to any missing modality scenario. We achieve this by randomly masking a subset of modalities during training and learning to project available input modalities to estimate the tokens for the masked modalities. This approach enables the model to effectively learn to leverage the information from the available modalities to compensate for the missing ones, enhancing missing modality robustness. We conduct a series of experiments with various baseline models and datasets to assess the effectiveness of this strategy. Experiments demonstrate that our approach improves robustness to different missing modality scenarios, outperforming existing methods designed for missing modalities or specific modality combinations.

[CV-61] Fully Automated CTC Detection Segmentation and Classification for Multi-Channel IF Imaging MICCAI2024

链接: https://arxiv.org/abs/2410.02988
作者: Evan Schwab,Bharat Annaldas,Nisha Ramesh,Anna Lundberg,Vishal Shelke,Xinran Xu,Cole Gilbertson,Jiyun Byun,Ernest T. Lam
关键词-EN: Liquid biopsies, metastatic breast cancer, tissue biopsies, invasive and non-localized, non-localized alternative
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注: Published in MICCAI 2024 MOVI Workshop Conference Proceedings

点击查看摘要

Abstract:Liquid biopsies (eg., blood draws) offer a less invasive and non-localized alternative to tissue biopsies for monitoring the progression of metastatic breast cancer (mBCa). Immunofluoresence (IF) microscopy is a tool to image and analyze millions of blood cells in a patient sample. By detecting and genetically sequencing circulating tumor cells (CTCs) in the blood, personalized treatment plans are achievable for various cancer subtypes. However, CTCs are rare (about 1 in 2M), making manual CTC detection very difficult. In addition, clinicians rely on quantitative cellular biomarkers to manually classify CTCs. This requires prior tasks of cell detection, segmentation and feature extraction. To assist clinicians, we have developed a fully automated machine learning-based production-level pipeline to efficiently detect, segment and classify CTCs in multi-channel IF images. We achieve over 99% sensitivity and 97% specificity on 9,533 cells from 15 mBCa patients. Our pipeline has been successfully deployed on real mBCa patients, reducing a patient average of 14M detected cells to only 335 CTC candidates for manual review.

[CV-62] SymmetricDiffusers: Learning Discrete Diffusion on Finite Symmetric Groups

链接: https://arxiv.org/abs/2410.02942
作者: Yongxing Zhang,Donglin Yang,Renjie Liao
关键词-EN: Finite symmetric groups, essential in fields, Finite symmetric, symmetric groups, distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Finite symmetric groups S_n are essential in fields such as combinatorics, physics, and chemistry. However, learning a probability distribution over S_n poses significant challenges due to its intractable size and discrete nature. In this paper, we introduce SymmetricDiffusers, a novel discrete diffusion model that simplifies the task of learning a complicated distribution over S_n by decomposing it into learning simpler transitions of the reverse diffusion using deep neural networks. We identify the riffle shuffle as an effective forward transition and provide empirical guidelines for selecting the diffusion length based on the theory of random walks on finite groups. Additionally, we propose a generalized Plackett-Luce (PL) distribution for the reverse transition, which is provably more expressive than the PL distribution. We further introduce a theoretically grounded “denoising schedule” to improve sampling and learning efficiency. Extensive experiments show that our model achieves state-of-the-art or comparable performances on solving tasks including sorting 4-digit MNIST images, jigsaw puzzles, and traveling salesman problems. Our code is released at this https URL.

[CV-63] RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions

链接: https://arxiv.org/abs/2410.02924
作者: Ziyao Zeng,Yangchao Wu,Hyoungseob Park,Daniel Wang,Fengyu Yang,Stefano Soatto,Dong Lao,Byung-Woo Hong,Alex Wong
关键词-EN: depth, monocular depth estimation, metric-scale monocular depth, depth estimation, relative depth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a method for metric-scale monocular depth estimation. Inferring depth from a single image is an ill-posed problem due to the loss of scale from perspective projection during the image formation process. Any scale chosen is a bias, typically stemming from training on a dataset; hence, existing works have instead opted to use relative (normalized, inverse) depth. Our goal is to recover metric-scaled depth maps through a linear transformation. The crux of our method lies in the observation that certain objects (e.g., cars, trees, street signs) are typically found or associated with certain types of scenes (e.g., outdoor). We explore whether language descriptions can be used to transform relative depth predictions to those in metric scale. Our method, RSA, takes as input a text caption describing objects present in an image and outputs the parameters of a linear transformation which can be applied globally to a relative depth map to yield metric-scaled depth predictions. We demonstrate our method on recent general-purpose monocular depth models on indoors (NYUv2) and outdoors (KITTI). When trained on multiple datasets, RSA can serve as a general alignment module in zero-shot settings. Our method improves over common practices in aligning relative to metric depth and results in predictions that are comparable to an upper bound of fitting relative depth to ground truth via a linear transformation.

[CV-64] AirLetters: An Open Video Dataset of Characters Drawn in the Air ECCV’24

链接: https://arxiv.org/abs/2410.02921
作者: Rishit Dagli,Guillaume Berger,Joanna Materzynska,Ingo Bax,Roland Memisevic
关键词-EN: consisting of real-world, video dataset consisting, dataset consisting, video, introduce AirLetters
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV’24, HANDS workshop

点击查看摘要

Abstract:We introduce AirLetters, a new video dataset consisting of real-world videos of human-generated, articulated motions. Specifically, our dataset requires a vision model to predict letters that humans draw in the air. Unlike existing video datasets, accurate classification predictions for AirLetters rely critically on discerning motion patterns and on integrating long-range information in the video over time. An extensive evaluation of state-of-the-art image and video understanding models on AirLetters shows that these methods perform poorly and fall far behind a human baseline. Our work shows that, despite recent progress in end-to-end video understanding, accurate representations of complex articulated motions – a task that is trivial for humans – remains an open problem for end-to-end learning.

[CV-65] ask-Decoupled Image Inpainting Framework for Class-specific Object Remover

链接: https://arxiv.org/abs/2410.02894
作者: Changsuk Oh,H. Jin Kim
关键词-EN: Object, Object removal, object remover, image inpainting networks, image inpainting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object removal refers to the process of erasing designated objects from an image while preserving the overall appearance. Existing works on object removal erase removal targets using image inpainting networks. However, image inpainting networks often generate unsatisfactory removal results. In this work, we find that the current training approach which encourages a single image inpainting model to handle both object removal and restoration tasks is one of the reasons behind such unsatisfactory result. Based on this finding, we propose a task-decoupled image inpainting framework which generates two separate inpainting models: an object restorer for object restoration tasks and an object remover for object removal tasks. We train the object restorer with the masks that partially cover the removal targets. Then, the proposed framework makes an object restorer to generate a guidance for training the object remover. Using the proposed framework, we obtain a class-specific object remover which focuses on removing objects of a target class, aiming to better erase target class objects than general object removers. We also introduce a data curation method that encompasses the image selection and mask generation approaches used to produce training data for the proposed class-specific object remover. Using the proposed curation method, we can simulate the scenarios where an object remover is trained on the data with object removal ground truth images. Experiments on multiple datasets show that the proposed class-specific object remover can better remove target class objects than object removers based on image inpainting networks.

[CV-66] Investigating the Impact of Randomness on Reproducibility in Computer Vision: A Study on Applications in Civil Engineering and Medicine

链接: https://arxiv.org/abs/2410.02806
作者: Bahadır Eryılmaz,Osman Alperen Koraş,Jörg Schlötterer,Christin Seifert
关键词-EN: scientific research, essential for scientific, CUDA-induced randomness, Abstract, randomness
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reproducibility is essential for scientific research. However, in computer vision, achieving consistent results is challenging due to various factors. One influential, yet often unrecognized, factor is CUDA-induced randomness. Despite CUDA’s advantages for accelerating algorithm execution on GPUs, if not controlled, its behavior across multiple executions remains non-deterministic. While reproducibility issues in ML being researched, the implications of CUDA-induced randomness in application are yet to be understood. Our investigation focuses on this randomness across one standard benchmark dataset and two real-world datasets in an isolated environment. Our results show that CUDA-induced randomness can account for differences up to 4.77% in performance scores. We find that managing this variability for reproducibility may entail increased runtime or reduce performance, but that disadvantages are not as significant as reported in previous studies.

[CV-67] Leveraging Retrieval Augment Approach for Multimodal Emotion Recognition Under Missing Modalities

链接: https://arxiv.org/abs/2410.02804
作者: Qi Fan,Hongyu Yuan,Haolin Zuo,Rui Liu,Guanglai Gao
关键词-EN: Multimodal emotion recognition, Multimodal emotion, emotion recognition, Multimodal, emotion recognition utilizes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under reviewing

点击查看摘要

Abstract:Multimodal emotion recognition utilizes complete multimodal information and robust multimodal joint representation to gain high performance. However, the ideal condition of full modality integrity is often not applicable in reality and there always appears the situation that some modalities are missing. For example, video, audio, or text data is missing due to sensor failure or network bandwidth problems, which presents a great challenge to MER research. Traditional methods extract useful information from the complete modalities and reconstruct the missing modalities to learn robust multimodal joint representation. These methods have laid a solid foundation for research in this field, and to a certain extent, alleviated the difficulty of multimodal emotion recognition under missing modalities. However, relying solely on internal reconstruction and multimodal joint learning has its limitations, especially when the missing information is critical for emotion recognition. To address this challenge, we propose a novel framework of Retrieval Augment for Missing Modality Multimodal Emotion Recognition (RAMER), which introduces similar multimodal emotion data to enhance the performance of emotion recognition under missing modalities. By leveraging databases, that contain related multimodal emotion data, we can retrieve similar multimodal emotion information to fill in the gaps left by missing modalities. Various experimental results demonstrate that our framework is superior to existing state-of-the-art approaches in missing modality MER tasks. Our whole project is publicly available on this https URL.

[CV-68] Estimating Body Volume and Height Using 3D Data

链接: https://arxiv.org/abs/2410.02800
作者: Vivek Ganesh Sonar,Muhammad Tanveer Jan,Mike Wells,Abhijit Pandya,Gabriela Engstrom,Richard Shih,Borko Furht
关键词-EN: body weight estimation, weight-based medications, urgent situations, body weight, estimation is critical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:Accurate body weight estimation is critical in emergency medicine for proper dosing of weight-based medications, yet direct measurement is often impractical in urgent situations. This paper presents a non-invasive method for estimating body weight by calculating total body volume and height using 3D imaging technology. A RealSense D415 camera is employed to capture high-resolution depth maps of the patient, from which 3D models are generated. The Convex Hull Algorithm is then applied to calculate the total body volume, with enhanced accuracy achieved by segmenting the point cloud data into multiple sections and summing their individual volumes. The height is derived from the 3D model by identifying the distance between key points on the body. This combined approach provides an accurate estimate of body weight, improving the reliability of medical interventions where precise weight data is unavailable. The proposed method demonstrates significant potential to enhance patient safety and treatment outcomes in emergency settings.

[CV-69] Logic-Free Building Automation: Learning the Control of Room Facilities with Wall Switches and Ceiling Camera

链接: https://arxiv.org/abs/2410.02789
作者: Hideya Ochiai,Kohki Hashimoto,Takuya Sakamoto,Seiya Watanabe,Ryosuke Hara,Ryo Yagi,Yuji Aizono,Hiroshi Esaki
关键词-EN: Artificial intelligence enables, Artificial intelligence, intelligence enables smarter, intelligence enables, preferences on facility
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注: 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Artificial intelligence enables smarter control in building automation by its learning capability of users’ preferences on facility control. Reinforcement learning (RL) was one of the approaches to this, but it has many challenges in real-world implementations. We propose a new architecture for logic-free building automation (LFBA) that leverages deep learning (DL) to control room facilities without predefined logic. Our approach differs from RL in that it uses wall switches as supervised signals and a ceiling camera to monitor the environment, allowing the DL model to learn users’ preferred controls directly from the scenes and switch states. This LFBA system is tested by our testbed with various conditions and user activities. The results demonstrate the efficacy, achieving 93%-98% control accuracy with VGG, outperforming other DL models such as Vision Transformer and ResNet. This indicates that LFBA can achieve smarter and more user-friendly control by learning from the observable scenes and user interactions.

[CV-70] RoMo: A Robust Solver for Full-body Unlabeled Optical Motion Capture SIGGRAPH

链接: https://arxiv.org/abs/2410.02788
作者: Xiaoyu Pan,Bowen Zheng,Xinwei Jiang,Zijiao Zeng,Qilong Kou,He Wang,Xiaogang Jin
关键词-EN: accurately capturing full-body, Optical motion capture, capturing full-body motions, gold standard, accurately capturing
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Siggraph Asia 2024 Conference Paper

点击查看摘要

Abstract:Optical motion capture (MoCap) is the “gold standard” for accurately capturing full-body motions. To make use of raw MoCap point data, the system labels the points with corresponding body part locations and solves the full-body motions. However, MoCap data often contains mislabeling, occlusion and positional errors, requiring extensive manual correction. To alleviate this burden, we introduce RoMo, a learning-based framework for robustly labeling and solving raw optical motion capture data. In the labeling stage, RoMo employs a divide-and-conquer strategy to break down the complex full-body labeling challenge into manageable subtasks: alignment, full-body segmentation and part-specific labeling. To utilize the temporal continuity of markers, RoMo generates marker tracklets using a K-partite graph-based clustering algorithm, where markers serve as nodes, and edges are formed based on positional and feature similarities. For motion solving, to prevent error accumulation along the kinematic chain, we introduce a hybrid inverse kinematic solver that utilizes joint positions as intermediate representations and adjusts the template skeleton to match estimated joint positions. We demonstrate that RoMo achieves high labeling and solving accuracy across multiple metrics and various datasets. Extensive comparisons show that our method outperforms state-of-the-art research methods. On a real dataset, RoMo improves the F1 score of hand labeling from 0.94 to 0.98, and reduces joint position error of body motion solving by 25%. Furthermore, RoMo can be applied in scenarios where commercial systems are inadequate. The code and data for RoMo are available at this https URL.

[CV-71] Navigation with VLM framework: Go to Any Language

链接: https://arxiv.org/abs/2410.02787
作者: Zecheng Yin,Chonghao Cheng,Lizhen
关键词-EN: posed significant challenges, Vision Large Language, Large Language Models, significant challenges, Vision Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: under review

点击查看摘要

Abstract:Navigating towards fully open language goals and exploring open scenes in a manner akin to human exploration have always posed significant challenges. Recently, Vision Large Language Models (VLMs) have demonstrated remarkable capabilities in reasoning with both language and visual data. While many works have focused on leveraging VLMs for navigation in open scenes and with open vocabularies, these efforts often fall short of fully utilizing the potential of VLMs or require substantial computational resources. We introduce Navigation with VLM (NavVLM), a framework that harnesses equipment-level VLMs to enable agents to navigate towards any language goal specific or non-specific in open scenes, emulating human exploration behaviors without any prior training. The agent leverages the VLM as its cognitive core to perceive environmental information based on any language goal and constantly provides exploration guidance during navigation until it reaches the target location or area. Our framework not only achieves state-of-the-art performance in Success Rate (SR) and Success weighted by Path Length (SPL) in traditional specific goal settings but also extends the navigation capabilities to any open-set language goal. We evaluate NavVLM in richly detailed environments from the Matterport 3D (MP3D), Habitat Matterport 3D (HM3D), and Gibson datasets within the Habitat simulator. With the power of VLMs, navigation has entered a new era.

[CV-72] Robust Symmetry Detection via Riemannian Langevin Dynamics

链接: https://arxiv.org/abs/2410.02786
作者: Jihyeon Je,Jiayi Liu,Guandao Yang,Boyang Deng,Shengqu Cai,Gordon Wetzstein,Or Litany,Leonidas Guibas
关键词-EN: kinds of objects, man-made creations, noise, Symmetries, symmetry
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Symmetries are ubiquitous across all kinds of objects, whether in nature or in man-made creations. While these symmetries may seem intuitive to the human eye, detecting them with a machine is nontrivial due to the vast search space. Classical geometry-based methods work by aggregating “votes” for each symmetry but struggle with noise. In contrast, learning-based methods may be more robust to noise, but often overlook partial symmetries due to the scarcity of annotated data. In this work, we address this challenge by proposing a novel symmetry detection method that marries classical symmetry detection techniques with recent advances in generative modeling. Specifically, we apply Langevin dynamics to a redefined symmetry space to enhance robustness against noise. We provide empirical results on a variety of shapes that suggest our method is not only robust to noise, but can also identify both partial and global symmetries. Moreover, we demonstrate the utility of our detected symmetries in various downstream tasks, such as compression and symmetrization of noisy shapes.

[CV-73] Guess What I Think: Streamlined EEG-to-Image Generation with Latent Diffusion Models ICASSP2025

链接: https://arxiv.org/abs/2410.02780
作者: Eleonora Lopez,Luigi Sigillo,Federica Colonnese,Massimo Panella,Danilo Comminiello
关键词-EN: advance brain-computer interface, encode visual cues, gaining increasing attention, brain signals encode, increasing attention due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Generating images from brain waves is gaining increasing attention due to its potential to advance brain-computer interface (BCI) systems by understanding how brain signals encode visual cues. Most of the literature has focused on fMRI-to-Image tasks as fMRI is characterized by high spatial resolution. However, fMRI is an expensive neuroimaging modality and does not allow for real-time BCI. On the other hand, electroencephalography (EEG) is a low-cost, non-invasive, and portable neuroimaging technique, making it an attractive option for future real-time applications. Nevertheless, EEG presents inherent challenges due to its low spatial resolution and susceptibility to noise and artifacts, which makes generating images from EEG more difficult. In this paper, we address these problems with a streamlined framework based on the ControlNet adapter for conditioning a latent diffusion model (LDM) through EEG signals. We conduct experiments and ablation studies on popular benchmarks to demonstrate that the proposed method beats other state-of-the-art models. Unlike these methods, which often require extensive preprocessing, pretraining, different losses, and captioning models, our approach is efficient and straightforward, requiring only minimal preprocessing and a few components. Code will be available after publication.

[CV-74] Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

链接: https://arxiv.org/abs/2410.02773
作者: Jian Lan,Diego Frassinelli,Barbara Plank
关键词-EN: Visual Question Answering, Large vision-language models, Large vision-language, multiple human annotators, accurately predict responses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3’s ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, better aligning model confidence with human uncertainty. Our findings highlight that for VQA, the consistent alignment between human responses and model predictions is understudied and should become the next crucial target of future studies.

[CV-75] Complex-valued convolutional neural network classification of hand gesture from radar images

链接: https://arxiv.org/abs/2410.02771
作者: Shokooh Khandan
关键词-EN: gesture recognition systems, Hand gesture recognition, popular in HCI, application areas, Hand gesture
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 173 pages, 36 tables, 50 figures

点击查看摘要

Abstract:Hand gesture recognition systems have yielded many exciting advancements in the last decade and become more popular in HCI (human-computer interaction) with several application areas, which spans from safety and security applications to automotive field. Various deep neural network architectures have already been inspected for hand gesture recognition systems, including multi-layer perceptron (MLP), convolutional neural network (CNN), recurrent neural network (RNN) and a cascade of the last two architectures known as CNN-RNN. However, a major problem still exists, which is most of the existing ML algorithms are designed and developed the building blocks and techniques for real-valued (RV). Researchers applied various RV techniques on the complex-valued (CV) radar images, such as converting a CV optimisation problem into a RV one, by splitting the complex numbers into their real and imaginary parts. However, the major disadvantage of this method is that the resulting algorithm will double the network dimensions. Recent work on RNNs and other fundamental theoretical analysis suggest that CV numbers have a richer representational capacity, but due to the absence of the building blocks required to design such models, the performance of CV networks are marginalised. In this report, we propose a fully CV-CNN, including all building blocks, forward and backward operations, and derivatives all in complex domain. We explore our proposed classification model on two sets of CV hand gesture radar images in comparison with the equivalent RV model. In chapter five, we propose a CV-forward residual network, for the purpose of binary classification of the two sets of CV hand gesture radar datasets and compare its performance with our proposed CV-CNN and a baseline CV-forward CNN.

[CV-76] BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

链接: https://arxiv.org/abs/2410.02768
作者: Jin Chen,Kaijing Ma,Haojian Huang,Jiayu Shen,Han Fang,Xianghao Zang,Chao Ban,Zhongjiang He,Hao Sun,Yanmei Kang
关键词-EN: demonstrating remarkable capabilities, rapidly advancing, remarkable capabilities, development of multi-modal, demonstrating remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, and similar semantics can also be expressed through different text forms, leading to underutilization of video. To address this, we propose BoViLA, a self-training framework that augments question samples during training through LLM-based self-questioning and answering, which help model exploit video information and the internal knowledge of LLMs more thoroughly to improve modality alignment. To filter bad self-generated questions, we introduce Evidential Deep Learning (EDL) to estimate uncertainty and assess the quality of self-generated questions by evaluating the modality alignment within the context. To the best of our knowledge, this work is the first to explore LLM-based self-training frameworks for modality alignment. We evaluate BoViLA on five strong VideoQA benchmarks, where it outperforms several state-of-the-art methods and demonstrate its effectiveness and generality. Additionally, we provide extensive analyses of the self-training framework and the EDL-based uncertainty filtering mechanism. The code will be made available at this https URL.

[CV-77] HyperCMR: Enhanced Multi-Contrast CMR Reconstruction with Eagle Loss MICCAI2024

链接: https://arxiv.org/abs/2410.03624
作者: Ruru Xu,Caner Özer,Ilkay Oksuz
关键词-EN: Accelerating image acquisition, magnetic resonance imaging, cardiac magnetic resonance, Accelerating image, critical task
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024 STACOM-CMRxRecon

点击查看摘要

Abstract:Accelerating image acquisition for cardiac magnetic resonance imaging (CMRI) is a critical task. CMRxRecon2024 challenge aims to set the state of the art for multi-contrast CMR reconstruction. This paper presents HyperCMR, a novel framework designed to accelerate the reconstruction of multi-contrast cardiac magnetic resonance (CMR) images. HyperCMR enhances the existing PromptMR model by incorporating advanced loss functions, notably the innovative Eagle Loss, which is specifically designed to recover missing high-frequency information in undersampled k-space. Extensive experiments conducted on the CMRxRecon2024 challenge dataset demonstrate that HyperCMR consistently outperforms the baseline across multiple evaluation metrics, achieving superior SSIM and PSNR scores.

[CV-78] owards Real-time Intrahepatic Vessel Identification in Intraoperative Ultrasound-Guided Liver Surgery MICCAI2024

链接: https://arxiv.org/abs/2410.03420
作者: Karl-Philippe Beaudet(IHU Strasbourg, UNISTRA, MIMESIS),Alexandros Karargyris(IHU Strasbourg, UNISTRA),Sidaty El Hadramy(UNISTRA, MIMESIS),Stéphane Cotin(UNISTRA, MIMESIS),Jean-Paul Mazellier(IHU Strasbourg, UNISTRA),Nicolas Padoy(IHU Strasbourg, UNISTRA),Juan Verde(IHU Strasbourg, UNISTRA, MIMESIS)
关键词-EN: traditional open surgery, complexity hinders widespread, hinders widespread adoption, widespread adoption due, maintains patient outcomes
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024, Oct 2024, Marrakech, Morocco

点击查看摘要

Abstract:While laparoscopic liver resection is less prone to complications and maintains patient outcomes compared to traditional open surgery, its complexity hinders widespread adoption due to challenges in representing the liver’s internal structure. Laparoscopic intraoperative ultrasound offers efficient, cost-effective and radiation-free guidance. Our objective is to aid physicians in identifying internal liver structures using laparoscopic intraoperative ultrasound. We propose a patient-specific approach using preoperative 3D ultrasound liver volume to train a deep learning model for real-time identification of portal tree and branch structures. Our personalized AI model, validated on ex vivo swine livers, achieved superior precision (0.95) and recall (0.93) compared to surgeons, laying groundwork for precise vessel identification in ultrasound-based liver resection. Its adaptability and potential clinical impact promise to advance surgical interventions and improve patient care.

[CV-79] An Enhanced Harmonic Densely Connected Hybrid Transformer Network Architecture for Chronic Wound Segmentation Utilising Multi-Colour Space Tensor Merging

链接: https://arxiv.org/abs/2410.03359
作者: Bill Cassidy,Christian Mcbride,Connah Kendrick,Neil D. Reeves,Joseph M. Pappachan,Cornelius J. Fernandez,Elias Chacko,Raphael Brüngel,Christoph M. Friedrich,Metib Alotaibi,Abdullah Abdulaziz AlWabel,Mohammad Alderwish,Kuan-Ying Lai,Moi Hoon Yap
关键词-EN: hospitals world wide, world wide, growing burdens, burdens for clinics, clinics and hospitals
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Chronic wounds and associated complications present ever growing burdens for clinics and hospitals world wide. Venous, arterial, diabetic, and pressure wounds are becoming increasingly common globally. These conditions can result in highly debilitating repercussions for those affected, with limb amputations and increased mortality risk resulting from infection becoming more common. New methods to assist clinicians in chronic wound care are therefore vital to maintain high quality care standards. This paper presents an improved HarDNet segmentation architecture which integrates a contrast-eliminating component in the initial layers of the network to enhance feature learning. We also utilise a multi-colour space tensor merging process and adjust the harmonic shape of the convolution blocks to facilitate these additional features. We train our proposed model using wound images from light-skinned patients and test the model on two test sets (one set with ground truth, and one without) comprising only darker-skinned cases. Subjective ratings are obtained from clinical wound experts with intraclass correlation coefficient used to determine inter-rater reliability. For the dark-skin tone test set with ground truth, we demonstrate improvements in terms of Dice similarity coefficient (+0.1221) and intersection over union (+0.1274). Qualitative analysis showed high expert ratings, with improvements of 3% demonstrated when comparing the baseline model with the proposed model. This paper presents the first study to focus on darker-skin tones for chronic wound segmentation using models trained only on wound images exhibiting lighter skin. Diabetes is highly prevalent in countries where patients have darker skin tones, highlighting the need for a greater focus on such cases. Additionally, we conduct the largest qualitative study to date for chronic wound segmentation.

[CV-80] Lost in Tracking: Uncertainty-guided Cardiac Cine MRI Segmentation at Right Ventricle Base

链接: https://arxiv.org/abs/2410.03320
作者: Yidong Zhao,Yi Zhang,Orlando Simonetti,Yuchi Han,Qian Tao
关键词-EN: Accurate biventricular segmentation, cardiac magnetic resonance, Accurate biventricular, magnetic resonance, cine images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate biventricular segmentation of cardiac magnetic resonance (CMR) cine images is essential for the clinical evaluation of heart function. However, compared to left ventricle (LV), right ventricle (RV) segmentation is still more challenging and less reproducible. Degenerate performance frequently occurs at the RV base, where the in-plane anatomical structures are complex (with atria, valve, and aorta) and vary due to the strong interplanar motion. In this work, we propose to address the currently unsolved issues in CMR segmentation, specifically at the RV base, with two strategies: first, we complemented the public resource by reannotating the RV base in the ACDC dataset, with refined delineation of the right ventricle outflow tract (RVOT), under the guidance of an expert cardiologist. Second, we proposed a novel dual encoder U-Net architecture that leverages temporal incoherence to inform the segmentation when interplanar motions occur. The inter-planar motion is characterized by loss-of-tracking, via Bayesian uncertainty of a motion-tracking model. Our experiments showed that our method significantly improved RV base segmentation taking into account temporal incoherence. Furthermore, we investigated the reproducibility of deep learning-based segmentation and showed that the combination of consistent annotation and loss of tracking could enhance the reproducibility of RV segmentation, potentially facilitating a large number of clinical studies focusing on RV.

[CV-81] Semantic Segmentation Based Quality Control of Histopathology Whole Slide Images

链接: https://arxiv.org/abs/2410.03289
作者: Abhijeet Patil,Garima Jain,Harsh Diwakar,Jay Sawant,Tripti Bameta,Swapnil Rane,Amit Sethi
关键词-EN: quality control, pen marks, developed a software, tissue regions, segments various regions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:We developed a software pipeline for quality control (QC) of histopathology whole slide images (WSIs) that segments various regions, such as blurs of different levels, tissue regions, tissue folds, and pen marks. Given the necessity and increasing availability of GPUs for processing WSIs, the proposed pipeline comprises multiple lightweight deep learning models to strike a balance between accuracy and speed. The pipeline was evaluated in all TCGAs, which is the largest publicly available WSI dataset containing more than 11,000 histopathological images from 28 organs. It was compared to a previous work, which was not based on deep learning, and it showed consistent improvement in segmentation results across organs. To minimize annotation effort for tissue and blur segmentation, annotated images were automatically prepared by mosaicking patches (sub-images) from various WSIs whose labels were identified using a patch classification tool HistoROI. Due to the generality of our trained QC pipeline and its extensive testing the potential impact of this work is broad. It can be used for automated pre-processing any WSI cohort to enhance the accuracy and reliability of large-scale histopathology image analysis for both research and clinical use. We have made the trained models, training scripts, training data, and inference results publicly available at this https URL, which should enable the research community to use the pipeline right out of the box or further customize it to new datasets and applications in the future.

[CV-82] 3D Segmentation of Neuronal Nuclei and Cell-Type Identification using Multi-channel Information

链接: https://arxiv.org/abs/2410.03248
作者: Antonio LaTorre,Lidia Alonso-Nanclares,José María Peña,Javier De Felipe
关键词-EN: Background Analyzing images, Background Analyzing, Analyzing images, objective in neuroscience, accurately estimate
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Background Analyzing images to accurately estimate the number of different cell types in the brain using automatic methods is a major objective in neuroscience. The automatic and selective detection and segmentation of neurons would be an important step in neuroanatomical studies. New method We present a method to improve the 3D reconstruction of neuronal nuclei that allows their segmentation, excluding the nuclei of non-neuronal cell types. Results We have tested the algorithm on stacks of images from rat neocortex, in a complex scenario (large stacks of images, uneven staining, and three different channels to visualize different cellular markers). It was able to provide a good identification ratio of neuronal nuclei and a 3D segmentation. Comparison with Existing Methods: Many automatic tools are in fact currently available, but different methods yield different cell count estimations, even in the same brain regions, due to differences in the labeling and imaging techniques, as well as in the algorithms used to detect cells. Moreover, some of the available automated software methods have provided estimations of cell numbers that have been reported to be inaccurate or inconsistent after evaluation by neuroanatomists. Conclusions It is critical to have a tool for automatic segmentation that allows discrimination between neurons, glial cells and perivascular cells. It would greatly speed up a task that is currently performed manually and would allow the cell counting to be systematic, avoiding human bias. Furthermore, the resulting 3D reconstructions of different cell types can be used to generate models of the spatial distribution of cells.

[CV-83] ECHOPulse: ECG controlled echocardio-grams video generation

链接: https://arxiv.org/abs/2410.03143
作者: Yiwei Li,Sekeun Kim,Zihao Wu,Hanqi Jiang,Yi Pan,Pengfei Jin,Sifan Song,Yucheng Shi,Tianze Yang,Tianming Liu,Quanzheng Li,Xiang Li
关键词-EN: ECHO video generation, interpretation heavily relies, ECHO video, ECHO, video generation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Echocardiography (ECHO) is essential for cardiac assessments, but its video quality and interpretation heavily relies on manual expertise, leading to inconsistent results from clinical and portable devices. ECHO video generation offers a solution by improving automated monitoring through synthetic data and generating high-quality videos from routine health data. However, existing models often face high computational costs, slow inference, and rely on complex conditional prompts that require experts’ annotations. To address these challenges, we propose ECHOPULSE, an ECG-conditioned ECHO video generation model. ECHOPULSE introduces two key advancements: (1) it accelerates ECHO video generation by leveraging VQ-VAE tokenization and masked visual token modeling for fast decoding, and (2) it conditions on readily accessible ECG signals, which are highly coherent with ECHO videos, bypassing complex conditional prompts. To the best of our knowledge, this is the first work to use time-series prompts like ECG signals for ECHO video generation. ECHOPULSE not only enables controllable synthetic ECHO data generation but also provides updated cardiac function information for disease monitoring and prediction beyond ECG alone. Evaluations on three public and private datasets demonstrate state-of-the-art performance in ECHO video generation across both qualitative and quantitative measures. Additionally, ECHOPULSE can be easily generalized to other modality generation tasks, such as cardiac MRI, fMRI, and 3D CT generation. Demo can seen from \urlthis https URL.

[CV-84] GABIC: Graph-based Attention Block for Image Compression ICIP2024

链接: https://arxiv.org/abs/2410.02981
作者: Gabriele Spadaro,Alberto Presta,Enzo Tartaglione,Jhony H. Giraldo,Marco Grangetto,Attilio Fiandrotti
关键词-EN: neural Learned Image, Learned Image Compression, neural Learned, JPEG and HEVC-intra, Learned Image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, accepted at ICIP 2024

点击查看摘要

Abstract:While standardized codecs like JPEG and HEVC-intra represent the industry standard in image compression, neural Learned Image Compression (LIC) codecs represent a promising alternative. In detail, integrating attention mechanisms from Vision Transformers into LIC models has shown improved compression efficiency. However, extra efficiency often comes at the cost of aggregating redundant features. This work proposes a Graph-based Attention Block for Image Compression (GABIC), a method to reduce feature redundancy based on a k-Nearest Neighbors enhanced attention mechanism. Our experiments show that GABIC outperforms comparable methods, particularly at high bit rates, enhancing compression performance.

[CV-85] Individuation of 3D perceptual units from neurogeometry of binocular cells

链接: https://arxiv.org/abs/2410.02870
作者: Maria Virginia Bolelli,Giovanna Citti,Alessandro Sarti,Steven W. Zucker
关键词-EN: neurogeometric sub-Riemannian model, functional architecture, early stages, stages of three-dimensional, three-dimensional vision
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)
*备注: 30 pages, 13 figures

点击查看摘要

Abstract:We model the functional architecture of the early stages of three-dimensional vision by extending the neurogeometric sub-Riemannian model for stereo-vision introduced in \citeBCSZ23. A new framework for correspondence is introduced that integrates a neural-based algorithm to achieve stereo correspondence locally while, simultaneously, organizing the corresponding points into global perceptual units. The result is an effective scene segmentation. We achieve this using harmonic analysis on the sub-Riemannian structure and show, in a comparison against Riemannian distance, that the sub-Riemannian metric is central to the solution.

[CV-86] YouTube Video Analytics for Patient Engagement: Evidence from Colonoscopy Preparation Videos

链接: https://arxiv.org/abs/2410.02830
作者: Yawen Guo,Xiao Liu,Anjana Susarla,Padman Rema
关键词-EN: medical information, deliver contextualized, Video Intelligence API, medical, information
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: The 30th WORKSHOP ON INFORMATION TECHNOLOGIES AND SYSTEMS. arXiv admin note: substantial text overlap with arXiv:2312.09425

点击查看摘要

Abstract:Videos can be an effective way to deliver contextualized, just-in-time medical information for patient education. However, video analysis, from topic identification and retrieval to extraction and analysis of medical information and understandability from a patient perspective are extremely challenging tasks. This study demonstrates a data analysis pipeline that utilizes methods to retrieve medical information from YouTube videos on preparing for a colonoscopy exam, a much maligned and disliked procedure that patients find challenging to get adequately prepared for. We first use the YouTube Data API to collect metadata of desired videos on select search keywords and use Google Video Intelligence API to analyze texts, frames and objects data. Then we annotate the YouTube video materials on medical information, video understandability and overall recommendation. We develop a bidirectional long short-term memory (BiLSTM) model to identify medical terms in videos and build three classifiers to group videos based on the levels of encoded medical information and video understandability, and whether the videos are recommended or not. Our study provides healthcare stakeholders with guidelines and a scalable approach for generating new educational video content to enhance management of a vast number of health conditions.

[CV-87] KLDD: Kalman Filter based Linear Deformable Diffusion Model in Retinal Image Segmentation

链接: https://arxiv.org/abs/2410.02808
作者: Zhihao Zhao,Yinzheng Zhao,Junjie Yang,Kai Huang,Nassir Navab,M. Ali Nasseri
关键词-EN: AI-based vascular segmentation, linear deformable convolution, vascular structures, Linear Deformable, ophthalmic diseases
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BIBM 2024

点击查看摘要

Abstract:AI-based vascular segmentation is becoming increasingly common in enhancing the screening and treatment of ophthalmic diseases. Deep learning structures based on U-Net have achieved relatively good performance in vascular segmentation. However, small blood vessels and capillaries tend to be lost during segmentation when passed through the traditional U-Net downsampling module. To address this gap, this paper proposes a novel Kalman filter based Linear Deformable Diffusion (KLDD) model for retinal vessel segmentation. Our model employs a diffusion process that iteratively refines the segmentation, leveraging the flexible receptive fields of deformable convolutions in feature extraction modules to adapt to the detailed tubular vascular structures. More specifically, we first employ a feature extractor with linear deformable convolution to capture vascular structure information form the input images. To better optimize the coordinate positions of deformable convolution, we employ the Kalman filter to enhance the perception of vascular structures in linear deformable convolution. Subsequently, the features of the vascular structures extracted are utilized as a conditioning element within a diffusion model by the Cross-Attention Aggregation module (CAAM) and the Channel-wise Soft Attention module (CSAM). These aggregations are designed to enhance the diffusion model’s capability to generate vascular structures. Experiments are evaluated on retinal fundus image datasets (DRIVE, CHASE_DB1) as well as the 3mm and 6mm of the OCTA-500 dataset, and the results show that the diffusion model proposed in this paper outperforms other methods.

[CV-88] AutoPETIII: The Tracer Frontier. What Frontier?

链接: https://arxiv.org/abs/2410.02807
作者: Zacharia Mesbah,Léo Mottay,Romain Modzelewski,Pierre Decazes,Sébastien Hapdey,Su Ruan,Sébastien Thureau
关键词-EN: Positron Emitting Tomography, Emitting Tomography, AutoPET competition gathered, medical imaging community, Positron Emitting
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For the last three years, the AutoPET competition gathered the medical imaging community around a hot topic: lesion segmentation on Positron Emitting Tomography (PET) scans. Each year a different aspect of the problem is presented; in 2024 the multiplicity of existing and used tracers was at the core of the challenge. Specifically, this year’s edition aims to develop a fully automatic algorithm capable of performing lesion segmentation on a PET/CT scan, without knowing the tracer, which can either be a FDG or PSMA-based tracer. In this paper we describe how we used the nnUNetv2 framework to train two sets of 6 fold ensembles of models to perform fully automatic PET/CT lesion segmentation as well as a MIP-CNN to choose which set of models to use for segmentation.

[CV-89] rust-informed Decision-Making Through An Uncertainty-Aware Stacked Neural Networks Framework: Case Study in COVID-19 Classification

链接: https://arxiv.org/abs/2410.02805
作者: Hassan Gharoun,Mohammad Sadegh Khorshidi,Fang Chen,Amir H. Gandomi
关键词-EN: stacked neural networks, uncertainty-aware stacked neural, neural networks model, radiological images, study presents
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 7 figures, 6 tables

点击查看摘要

Abstract:This study presents an uncertainty-aware stacked neural networks model for the reliable classification of COVID-19 from radiological images. The model addresses the critical gap in uncertainty-aware modeling by focusing on accurately identifying confidently correct predictions while alerting users to confidently incorrect and uncertain predictions, which can promote trust in automated systems. The architecture integrates uncertainty quantification methods, including Monte Carlo dropout and ensemble techniques, to enhance predictive reliability by assessing the certainty of diagnostic predictions. Within a two-tier model framework, the tier one model generates initial predictions and associated uncertainties, which the second tier model uses to produce a trust indicator alongside the diagnostic outcome. This dual-output model not only predicts COVID-19 cases but also provides a trust flag, indicating the reliability of each diagnosis and aiming to minimize the need for retesting and expert verification. The effectiveness of this approach is demonstrated through extensive experiments on the COVIDx CXR-4 dataset, showing a novel approach in identifying and handling confidently incorrect cases and uncertain cases, thus enhancing the trustworthiness of automated diagnostics in clinical settings.

机器学习

[LG-0] System 2 reasoning capabilities are nigh

链接: https://arxiv.org/abs/2410.03662
作者: Scott C. Lowe
关键词-EN: machine learning models, human-like reasoning capabilities, recent years, machine learning, made strides
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, machine learning models have made strides towards human-like reasoning capabilities from several directions. In this work, we review the current state of the literature and describe the remaining steps to achieve a neural model which can perform System 2 reasoning analogous to a human. We argue that if current models are insufficient to be classed as performing reasoning, there remains very little additional progress needed to attain that goal.

[LG-1] RAFT: Realistic Attacks to Fool Text Detectors EMNLP2024

链接: https://arxiv.org/abs/2410.03658
作者: James Wang,Ran Li,Junfeng Yang,Chengzhi Mao
关键词-EN: exhibited remarkable fluency, Large language models, Large language, exhibited remarkable, remarkable fluency
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable fluency across various tasks. However, their unethical applications, such as disseminating disinformation, have become a growing concern. Although recent works have proposed a number of LLM detection methods, their robustness and reliability remain unclear. In this paper, we present RAFT: a grammar error-free black-box attack against existing LLM detectors. In contrast to previous attacks for language models, our method exploits the transferability of LLM embeddings at the word-level while preserving the original text quality. We leverage an auxiliary embedding to greedily select candidate words to perturb against the target detector. Experiments reveal that our attack effectively compromises all detectors in the study across various domains by up to 99%, and are transferable across source models. Manual human evaluation studies show our attacks are realistic and indistinguishable from original human-written text. We also show that examples generated by RAFT can be used to train adversarially robust detectors. Our work shows that current LLM detectors are not adversarially robust, underscoring the urgent need for more resilient detection mechanisms.

[LG-2] Geometric Representation Condition Improves Equivariant Molecule Generation

链接: https://arxiv.org/abs/2410.03655
作者: Zian Li,Cai Zhou,Xiyuan Wang,Xingang Peng,Muhan Zhang
关键词-EN: demonstrated substantial potential, Recent advancements, molecular generative models, accelerating scientific discovery, scientific discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in molecular generative models have demonstrated substantial potential in accelerating scientific discovery, particularly in drug design. However, these models often face challenges in generating high-quality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce GeoRCG, a general framework to enhance the performance of molecular generative models by integrating geometric representation conditions. We decompose the molecule generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared to directly generating a molecule, the relatively easy-to-generate representation in the first-stage guides the second-stage generation to reach a high-quality molecule in a more goal-oriented and much faster way. Leveraging EDM as the base generator, we observe significant quality improvements in unconditional molecule generation on the widely-used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 31% performance improvement over state-of-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations over conditioning on individual property values as in previous approaches. Furthermore, we show that, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while maintaining superior generation quality than that achieved with 1,000 steps, thereby significantly accelerating the generation process.

[LG-3] Learning Humanoid Locomotion over Challenging Terrain

链接: https://arxiv.org/abs/2410.03654
作者: Ilija Radosavovic,Sarthak Kamat,Trevor Darrell,Jitendra Malik
关键词-EN: Humanoid, capable of traversing, Abstract, model, Developing controllers capable
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Humanoid robots can, in principle, use their legs to go almost anywhere. Developing controllers capable of traversing diverse terrains, however, remains a considerable challenge. Classical controllers are hard to generalize broadly while the learning-based methods have primarily focused on gentle terrains. Here, we present a learning-based approach for blind humanoid locomotion capable of traversing challenging natural and man-made terrain. Our method uses a transformer model to predict the next action based on the history of proprioceptive observations and actions. The model is first pre-trained on a dataset of flat-ground trajectories with sequence modeling, and then fine-tuned on uneven terrain using reinforcement learning. We evaluate our model on a real humanoid robot across a variety of terrains, including rough, deformable, and sloped surfaces. The model demonstrates robust performance, in-context adaptation, and emergent terrain representations. In real-world case studies, our humanoid robot successfully traversed over 4 miles of hiking trails in Berkeley and climbed some of the steepest streets in San Francisco.

[LG-4] GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

链接: https://arxiv.org/abs/2410.03645
作者: Pu Hua,Minghuan Liu,Annabella Macaluso,Yunfeng Lin,Weinan Zhang,Huazhe Xu,Lirui Wang
关键词-EN: Robotic simulation today, today remains challenging, simulation today remains, create diverse simulation, Robotic simulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: CoRL 2024. Project website: this https URL

点击查看摘要

Abstract:Robotic simulation today remains challenging to scale up due to the human efforts required to create diverse simulation tasks and scenes. Simulation-trained policies also face scalability issues as many sim-to-real methods focus on a single task. To address these challenges, this work proposes GenSim2, a scalable framework that leverages coding LLMs with multi-modal and reasoning capabilities for complex and realistic simulation task creation, including long-horizon tasks with articulated objects. To automatically generate demonstration data for these tasks at scale, we propose planning and RL solvers that generalize within object categories. The pipeline can generate data for up to 100 articulated tasks with 200 objects and reduce the required human efforts. To utilize such data, we propose an effective multi-task language-conditioned policy architecture, dubbed proprioceptive point-cloud transformer (PPT), that learns from the generated demonstrations and exhibits strong sim-to-real zero-shot transfer. Combining the proposed pipeline and the policy architecture, we show a promising usage of GenSim2 that the generated data can be used for zero-shot transfer or co-train with real-world collected data, which enhances the policy performance by 20% compared with training exclusively on limited real data.

[LG-5] Real-World Benchmarks Make Membership Inference Attacks Fail on Diffusion Models

链接: https://arxiv.org/abs/2410.03640
作者: Chumeng Liang,Jiaxuan You
关键词-EN: Membership inference attacks, diffusion models, Membership inference, pre-trained diffusion models, diffusion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Membership inference attacks (MIAs) on diffusion models have emerged as potential evidence of unauthorized data usage in training pre-trained diffusion models. These attacks aim to detect the presence of specific images in training datasets of diffusion models. Our study delves into the evaluation of state-of-the-art MIAs on diffusion models and reveals critical flaws and overly optimistic performance estimates in existing MIA evaluation. We introduce CopyMark, a more realistic MIA benchmark that distinguishes itself through the support for pre-trained diffusion models, unbiased datasets, and fair evaluation pipelines. Through extensive experiments, we demonstrate that the effectiveness of current MIA methods significantly degrades under these more practical conditions. Based on our results, we alert that MIA, in its current state, is not a reliable approach for identifying unauthorized data usage in pre-trained diffusion models. To the best of our knowledge, we are the first to discover the performance overestimation of MIAs on diffusion models and present a unified benchmark for more realistic evaluation. Our code is available on GitHub: \urlthis https URL.

[LG-6] Robust Offline Imitation Learning from Diverse Auxiliary Data

链接: https://arxiv.org/abs/2410.03626
作者: Udita Ghosh,Dripta S. Raychaudhuri,Jiachen Li,Konstantinos Karydis,Amit K. Roy-Chowdhury
关键词-EN: Offline imitation, environment interaction, Robust Offline Imitation, data, Offline imitation learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline imitation learning enables learning a policy solely from a set of expert demonstrations, without any environment interaction. To alleviate the issue of distribution shift arising due to the small amount of expert data, recent works incorporate large numbers of auxiliary demonstrations alongside the expert data. However, the performance of these approaches rely on assumptions about the quality and composition of the auxiliary data. However, they are rarely successful when those assumptions do not hold. To address this limitation, we propose Robust Offline Imitation from Diverse Auxiliary Data (ROIDA). ROIDA first identifies high-quality transitions from the entire auxiliary dataset using a learned reward function. These high-reward samples are combined with the expert demonstrations for weighted behavioral cloning. For lower-quality samples, ROIDA applies temporal difference learning to steer the policy towards high-reward states, improving long-term returns. This two-pronged approach enables our framework to effectively leverage both high and low-quality data without any assumptions. Extensive experiments validate that ROIDA achieves robust and consistent performance across multiple auxiliary datasets with diverse ratios of expert and non-expert demonstrations. ROIDA effectively leverages unlabeled auxiliary data, outperforming prior methods reliant on specific data assumptions.

[LG-7] A Global Medical Data Security and Privacy Preserving Standards Identification Framework for Electronic Healthcare Consumers

链接: https://arxiv.org/abs/2410.03621
作者: Vinaytosh Mishra,Kishu Gupta,Deepika Saxena,Ashutosh Kumar Singh
关键词-EN: Electronic Health Records, healthcare records brings, success of digital, focus on putting, putting consumers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electronic Health Records (EHR) are crucial for the success of digital healthcare, with a focus on putting consumers at the center of this transformation. However, the digitalization of healthcare records brings along security and privacy risks for personal data. The major concern is that different countries have varying standards for the security and privacy of medical data. This paper proposed a novel and comprehensive framework to standardize these rules globally, bringing them together on a common platform. To support this proposal, the study reviews existing literature to understand the research interest in this issue. It also examines six key laws and standards related to security and privacy, identifying twenty concepts. The proposed framework utilized K-means clustering to categorize these concepts and identify five key factors. Finally, an Ordinal Priority Approach is applied to determine the preferred implementation of these factors in the context of EHRs. The proposed study provides a descriptive then prescriptive framework for the implementation of privacy and security in the context of electronic health records. Therefore, the findings of the proposed framework are useful for professionals and policymakers in improving the security and privacy associated with EHRs.

[LG-8] Open-World Reinforcement Learning over Long Short-Term Imagination

链接: https://arxiv.org/abs/2410.03618
作者: Jiajian Li,Qi Wang,Yunbo Wang,Xin Jin,Yang Li,Wenjun Zeng,Xiaokang Yang
关键词-EN: Training visual reinforcement, Training visual, high-dimensional open world, visual reinforcement learning, reinforcement learning agents
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training visual reinforcement learning agents in a high-dimensional open world presents significant challenges. While various model-based methods have improved sample efficiency by learning interactive world models, these agents tend to be “short-sighted”, as they are typically trained on short snippets of imagined experiences. We argue that the primary obstacle in open-world decision-making is improving the efficiency of off-policy exploration across an extensive state space. In this paper, we present LS-Imagine, which extends the imagination horizon within a limited number of state transition steps, enabling the agent to explore behaviors that potentially lead to promising long-term feedback. The foundation of our approach is to build a long short-term world model. To achieve this, we simulate goal-conditioned jumpy state transitions and compute corresponding affordance maps by zooming in on specific areas within single images. This facilitates the integration of direct long-term values into behavior learning. Our method demonstrates significant improvements over state-of-the-art techniques in MineDojo.

[LG-9] What Matters for Model Merging at Scale?

链接: https://arxiv.org/abs/2410.03617
作者: Prateek Yadav,Tu Vu,Jonathan Lai,Alexandra Chronopoulou,Manaal Faruqui,Mohit Bansal,Tsendsuren Munkhdalai
关键词-EN: models, merging, decentralized model development, capable single model, expert models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 20 Pages, 7 Figures, 4 Tables

点击查看摘要

Abstract:Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors – like the base model quality and number of expert models – , to affect the merged model’s performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods – Averaging, Task~Arithmetic, Dare, and TIES – across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert’s training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.

[LG-10] Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation

链接: https://arxiv.org/abs/2410.03613
作者: Jie Xiao,Qianyi Huang,Xu Chen,Chen Tian
关键词-EN: large language models, language models, increasingly integrate, daily lives, Gemini Nano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly integrate into every aspect of our work and daily lives, there are growing concerns about user privacy, which push the trend toward local deployment of these models. There are a number of lightweight LLMs (e.g., Gemini Nano, LLAMA2 7B) that can run locally on smartphones, providing users with greater control over their personal data. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices. To fully understand the current landscape of LLM deployment on mobile platforms, we conduct a comprehensive measurement study on mobile devices. We evaluate both metrics that affect user experience, including token throughput, latency, and battery consumption, as well as factors critical to developers, such as resource utilization, DVFS strategies, and inference engines. In addition, we provide a detailed analysis of how these hardware capabilities and system dynamics affect on-device LLM performance, which may help developers identify and address bottlenecks for mobile LLM applications. We also provide comprehensive comparisons across the mobile system-on-chips (SoCs) from major vendors, highlighting their performance differences in handling LLM workloads. We hope that this study can provide insights for both the development of on-device LLMs and the design for future mobile system architecture.

[LG-11] ICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation

链接: https://arxiv.org/abs/2410.03608
作者: Jonathan Cook,Tim Rocktäschel,Jakob Foerster,Dennis Aumiller,Alex Wang
关键词-EN: Large Language Models, Large Language, Language Models, usage of Large, instruction-following ability
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given the widespread adoption and usage of Large Language Models (LLMs), it is crucial to have flexible and interpretable evaluations of their instruction-following ability. Preference judgments between model outputs have become the de facto evaluation standard, despite distilling complex, multi-faceted preferences into a single ranking. Furthermore, as human annotation is slow and costly, LLMs are increasingly used to make these judgments, at the expense of reliability and interpretability. In this work, we propose TICK (Targeted Instruct-evaluation with ChecKlists), a fully automated, interpretable evaluation protocol that structures evaluations with LLM-generated, instruction-specific checklists. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists that decompose the instruction into a series of YES/NO questions. Each question asks whether a candidate response meets a specific requirement of the instruction. We demonstrate that using TICK leads to a significant increase (46.4% \to 52.2%) in the frequency of exact agreements between LLM judgements and human preferences, as compared to having an LLM directly score an output. We then show that STICK (Self-TICK) can be used to improve generation quality across multiple benchmarks via self-refinement and Best-of-N selection. STICK self-refinement on LiveBench reasoning tasks leads to an absolute gain of + 7.8%, whilst Best-of-N selection with STICK attains + 6.3% absolute improvement on the real-world instruction dataset, WildBench. In light of this, structured, multi-faceted self-improvement is shown to be a promising way to further advance LLM capabilities. Finally, by providing LLM-generated checklists to human evaluators tasked with directly scoring LLM responses to WildBench instructions, we notably increase inter-annotator agreement (0.194 \to 0.256).

[LG-12] How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework

链接: https://arxiv.org/abs/2410.03601
作者: Yinuo Ren,Haoxuan Chen,Grant M. Rotskoff,Lexing Ying
关键词-EN: Discrete diffusion models, Discrete diffusion, gained increasing attention, diffusion models, model complex distributions
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Discrete diffusion models have gained increasing attention for their ability to model complex distributions with tractable sampling and inference. However, the error analysis for discrete diffusion models remains less well-understood. In this work, we propose a comprehensive framework for the error analysis of discrete diffusion models based on Lévy-type stochastic integrals. By generalizing the Poisson random measure to that with a time-independent and state-dependent intensity, we rigorously establish a stochastic integral formulation of discrete diffusion models and provide the corresponding change of measure theorems that are intriguingly analogous to Itô integrals and Girsanov’s theorem for their continuous counterparts. Our framework unifies and strengthens the current theoretical results on discrete diffusion models and obtains the first error bound for the \tau -leaping scheme in KL divergence. With error sources clearly identified, our analysis gives new insight into the mathematical properties of discrete diffusion models and offers guidance for the design of efficient and accurate algorithms for real-world discrete diffusion model applications.

[LG-13] Understanding Reasoning in Chain-of-Thought from the Hopfieldian View

链接: https://arxiv.org/abs/2410.03595
作者: Lijie Hu,Liang Liu,Shu Yang,Xin Chen,Zhen Tan,Muhammad Asif Ali,Mengdi Li,Di Wang
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable abilities, Models have demonstrated
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 28 pages, a new version of “A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning”

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable abilities across various tasks, with Chain-of-Thought (CoT) prompting emerging as a key technique to enhance reasoning capabilities. However, existing research primarily focuses on improving performance, lacking a comprehensive framework to explain and understand the fundamental factors behind CoT’s success. To bridge this gap, we introduce a novel perspective grounded in the Hopfieldian view of cognition in cognitive neuroscience. We establish a connection between CoT reasoning and key cognitive elements such as stimuli, actions, neural populations, and representation spaces. From our view, we can understand the reasoning process as the movement between these representation spaces. Building on this insight, we develop a method for localizing reasoning errors in the response of CoTs. Moreover, we propose the Representation-of-Thought (RoT) framework, which leverages the robustness of low-dimensional representation spaces to enhance the robustness of the reasoning process in CoTs. Experimental results demonstrate that RoT improves the robustness and interpretability of CoT reasoning while offering fine-grained control over the reasoning process.

[LG-14] raining Over a Distribution of Hyperparameters for Enhanced Performance and Adaptability on Imbalanced Classification

链接: https://arxiv.org/abs/2410.03588
作者: Kelsey Lieberman,Swarna Kamlam Ravindran,Shuai Yuan,Carlo Tomasi
关键词-EN: severe class imbalance, class imbalance remains, training reliable classifiers, well-studied problem, remains a challenge
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although binary classification is a well-studied problem, training reliable classifiers under severe class imbalance remains a challenge. Recent techniques mitigate the ill effects of imbalance on training by modifying the loss functions or optimization methods. We observe that different hyperparameter values on these loss functions perform better at different recall values. We propose to exploit this fact by training one model over a distribution of hyperparameter values–instead of a single value–via Loss Conditional Training (LCT). Experiments show that training over a distribution of hyperparameters not only approximates the performance of several models but actually improves the overall performance of models on both CIFAR and real medical imaging applications, such as melanoma and diabetic retinopathy detection. Furthermore, training models with LCT is more efficient because some hyperparameter tuning can be conducted after training to meet individual needs without needing to retrain from scratch.

[LG-15] HyResPINNs: Adaptive Hybrid Residual Networks for Learning Optimal Combinations of Neural and RBF Components for Physics-Informed Modeling

链接: https://arxiv.org/abs/2410.03573
作者: Madison Cooley,Robert M. Kirby,Shandian Zhe,Varun Shankar
关键词-EN: Physics-informed neural networks, enforce physical constraints, increasingly popular class, loss functions regularized, Physics-informed neural
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) are an increasingly popular class of techniques for the numerical solution of partial differential equations (PDEs), where neural networks are trained using loss functions regularized by relevant PDE terms to enforce physical constraints. We present a new class of PINNs called HyResPINNs, which augment traditional PINNs with adaptive hybrid residual blocks that combine the outputs of a standard neural network and a radial basis function (RBF) network. A key feature of our method is the inclusion of adaptive combination parameters within each residual block, which dynamically learn to weigh the contributions of the neural network and RBF network outputs. Additionally, adaptive connections between residual blocks allow for flexible information flow throughout the network. We show that HyResPINNs are more robust to training point locations and neural network architectures than traditional PINNs. Moreover, HyResPINNs offer orders of magnitude greater accuracy than competing methods on certain problems, with only modest increases in training costs. We demonstrate the strengths of our approach on challenging PDEs, including the Allen-Cahn equation and the Darcy-Flow equation. Our results suggest that HyResPINNs effectively bridge the gap between traditional numerical methods and modern machine learning-based solvers.

[LG-16] aching Transformers Modular Arithmetic at Scale

链接: https://arxiv.org/abs/2410.03569
作者: Eshika Saxena,Alberto Alfarano,Emily Wenger,Kristin Lauter
关键词-EN: simple operation, sum modulo, Modular addition, Modular, elements mod
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modular addition is, on its face, a simple operation: given N elements in \mathbbZ_q , compute their sum modulo q . Yet, scalable machine learning solutions to this problem remain elusive: prior work trains ML models that sum N \le 6 elements mod q \le 1000 . Promising applications of ML models for cryptanalysis-which often involve modular arithmetic with large N and q -motivate reconsideration of this problem. This work proposes three changes to the modular addition model training pipeline: more diverse training data, an angular embedding, and a custom loss function. With these changes, we demonstrate success with our approach for N = 256, q = 3329 , a case which is interesting for cryptographic applications, and a significant increase in N and q over prior work. These techniques also generalize to other modular arithmetic problems, motivating future work.

[LG-17] owards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)

链接: https://arxiv.org/abs/2410.03568
作者: Abrar Rahman,Garry Bowlin,Binit Mohanty,Sean McGunigal
关键词-EN: tokenization techniques employed, low resource languages, base embeddings, large language models, BERT base tokenizer
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive study on the tokenization techniques employed by state-of-the-art large language models (LLMs) and their implications on the cost and availability of services across different languages, especially low resource languages. The analysis considers multiple LLMs, including GPT-4 (using cl100k_base embeddings), GPT-3 (with p50k_base embeddings), and DaVinci (employing r50k_base embeddings), as well as the widely used BERT base tokenizer. The study evaluates the tokenization variability observed across these models and investigates the challenges of linguistic representation in subword tokenization. The research underscores the importance of fostering linguistically-aware development practices, especially for languages that are traditionally under-resourced. Moreover, this paper introduces case studies that highlight the real-world implications of tokenization choices, particularly in the context of electronic health record (EHR) systems. This research aims to promote generalizable Internationalization (I18N) practices in the development of AI services in this domain and beyond, with a strong emphasis on inclusivity, particularly for languages traditionally underrepresented in AI applications.

[LG-18] raining on more Reachable Tasks for Generalisation in Reinforcement Learning

链接: https://arxiv.org/abs/2410.03565
作者: Max Weltevrede,Caroline Horsch,Matthijs T.J. Spaan,Wendelin Böhmer
关键词-EN: multi-task reinforcement learning, fixed set, reinforcement learning, multi-task reinforcement, agents train
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2406.08069

点击查看摘要

Abstract:In multi-task reinforcement learning, agents train on a fixed set of tasks and have to generalise to new ones. Recent work has shown that increased exploration improves this generalisation, but it remains unclear why exactly that is. In this paper, we introduce the concept of reachability in multi-task reinforcement learning and show that an initial exploration phase increases the number of reachable tasks the agent is trained on. This, and not the increased exploration, is responsible for the improved generalisation, even to unreachable tasks. Inspired by this, we propose a novel method Explore-Go that implements such an exploration phase at the beginning of each episode. Explore-Go only modifies the way experience is collected and can be used with most existing on-policy or off-policy reinforcement learning algorithms. We demonstrate the effectiveness of our method when combined with some popular algorithms and show an increase in generalisation performance across several environments.

[LG-19] BodyShapeGPT: SMPL Body Shape Manipulation with LLMs ECCV2024

链接: https://arxiv.org/abs/2410.03556
作者: Baldomero R. Árbol,Dan Casas
关键词-EN: performing complex tasks, Large Language Models, provide a wide, wide range, range of tools
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ECCV 2024 Workshop on Foundation Models for 3D Humans. Code repository: this https URL

点击查看摘要

Abstract:Generative AI models provide a wide range of tools capable of performing complex tasks in a fraction of the time it would take a human. Among these, Large Language Models (LLMs) stand out for their ability to generate diverse texts, from literary narratives to specialized responses in different fields of knowledge. This paper explores the use of fine-tuned LLMs to identify physical descriptions of people, and subsequently create accurate representations of avatars using the SMPL-X model by inferring shape parameters. We demonstrate that LLMs can be trained to understand and manipulate the shape space of SMPL, allowing the control of 3D human shapes through natural language. This approach promises to improve human-machine interaction and opens new avenues for customization and simulation in virtual environments.

[LG-20] Artificial intelligence inspired freeform optics design: a review

链接: https://arxiv.org/abs/2410.03554
作者: Lei Feng,Jingxing Liao,Jingna Yang
关键词-EN: Integrating artificial intelligence, enhanced design efficiency, Integrating artificial, significantly enhanced design, artificial intelligence
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Integrating artificial intelligence (AI) techniques such as machine learning and deep learning into freeform optics design has significantly enhanced design efficiency, expanded the design space, and led to innovative solutions. This article reviews the latest developments in AI applications within this field, highlighting their roles in initial design generation, optimization, and performance prediction. It also addresses the benefits of AI, such as improved accuracy and performance, alongside challenges like data requirements, model interpretability, and computational complexity. Despite these challenges, the future of AI in freeform optics design looks promising, with potential advancements in hybrid design methods, interpretable AI, AI-driven manufacturing, and targeted research for specific applications. Collaboration among researchers, engineers, and designers is essential to fully harness AI’s potential and drive innovation in optics.

[LG-21] Multi-modal Atmospheric Sensing to Augment Wearable IMU-Based Hand Washing Detection

链接: https://arxiv.org/abs/2410.03549
作者: Robin Burchard,Kristof Van Laerhoven
关键词-EN: Hand washing detection, Hand washing, washing detection, crucial part, part of personal
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: iWOAR2024

点击查看摘要

Abstract:Hand washing is a crucial part of personal hygiene. Hand washing detection is a relevant topic for wearable sensing with applications in the medical and professional fields. Hand washing detection can be used to aid workers in complying with hygiene rules. Hand washing detection using body-worn IMU-based sensor systems has been shown to be a feasible approach, although, for some reported results, the specificity of the detection was low, leading to a high rate of false positives. In this work, we present a novel, open-source prototype device that additionally includes a humidity, temperature, and barometric sensor. We contribute a benchmark dataset of 10 participants and 43 hand-washing events and perform an evaluation of the sensors’ benefits. Added to that, we outline the usefulness of the additional sensor in both the annotation pipeline and the machine learning models. By visual inspection, we show that especially the humidity sensor registers a strong increase in the relative humidity during a hand-washing activity. A machine learning analysis of our data shows that distinct features benefiting from such relative humidity patterns remain to be identified.

[LG-22] Multidimensional Human Activity Recognition With Large Language Model: A Conceptual Framework

链接: https://arxiv.org/abs/2410.03546
作者: Syed Mhamudul Hasan
关键词-EN: Human Activity Recognition, revolutionize risk assessment, Activity Recognition, large language model, Human Activity
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In high-stake environments like emergency response or elder care, the integration of large language model (LLM), revolutionize risk assessment, resource allocation, and emergency responses in Human Activity Recognition (HAR) systems by leveraging data from various wearable sensors. We propose a conceptual framework that utilizes various wearable devices, each considered as a single dimension, to support a multidimensional learning approach within HAR systems. By integrating and processing data from these diverse sources, LLMs can process and translate complex sensor inputs into actionable insights. This integration mitigates the inherent uncertainties and complexities associated with them, and thus enhancing the responsiveness and effectiveness of emergency services. This paper sets the stage for exploring the transformative potential of LLMs within HAR systems in empowering emergency workers to navigate the unpredictable and risky environments they encounter in their critical roles.

[LG-23] Ward: Provable RAG Dataset Inference via LLM Watermarks

链接: https://arxiv.org/abs/2410.03537
作者: Nikola Jovanović,Robin Staab,Maximilian Baader,Martin Vechev
关键词-EN: Retrieval-Augmented Generation, incorporate external data, Generation, incorporate external, RAG Dataset Inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves LLMs by enabling them to incorporate external data during generation. This raises concerns for data owners regarding unauthorized use of their content in RAG systems. Despite its importance, the challenge of detecting such unauthorized usage remains underexplored, with existing datasets and methodologies from adjacent fields being ill-suited for its study. In this work, we take several steps to bridge this gap. First, we formalize this problem as (black-box) RAG Dataset Inference (RAG-DI). To facilitate research on this challenge, we further introduce a novel dataset specifically designed for benchmarking RAG-DI methods under realistic conditions, and propose a set of baseline approaches. Building on this foundation, we introduce Ward, a RAG-DI method based on LLM watermarks that enables data owners to obtain rigorous statistical guarantees regarding the usage of their dataset in a RAG system. In our experimental evaluation, we show that Ward consistently outperforms all baselines across many challenging settings, achieving higher accuracy, superior query efficiency and robustness. Our work provides a foundation for future studies of RAG-DI and highlights LLM watermarks as a promising approach to this problem.

[LG-24] NRGBoost: Energy-Based Generative Boosted Trees

链接: https://arxiv.org/abs/2410.03535
作者: João Bravo
关键词-EN: Boosted Decision Trees, Gradient Boosted Decision, Random Forests, Decision Trees, Gradient Boosted
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the rise to dominance of deep learning in unstructured data domains, tree-based methods such as Random Forests (RF) and Gradient Boosted Decision Trees (GBDT) are still the workhorses for handling discriminative tasks on tabular data. We explore generative extensions of these popular algorithms with a focus on explicitly modeling the data density (up to a normalization constant), thus enabling other applications besides sampling. As our main contribution we propose an energy-based generative boosting algorithm that is analogous to the second order boosting implemented in popular packages like XGBoost. We show that, despite producing a generative model capable of handling inference tasks over any input variable, our proposed algorithm can achieve similar discriminative performance to GBDT on a number of real world tabular datasets, outperforming alternative generative approaches. At the same time, we show that it is also competitive with neural network based models for sampling.

[LG-25] No Need to Talk: Asynchronous Mixture of Language Models

链接: https://arxiv.org/abs/2410.03529
作者: Anastasiia Filippova,Angelos Katharopoulos,David Grangier,Ronan Collobert
关键词-EN: asynchronous manner, innovative method, introduce SmallTalk, model, training
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 23 pages

点击查看摘要

Abstract:We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Our experiments on language modeling demonstrate tha SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on 75% of the tasks.

[LG-26] A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

链接: https://arxiv.org/abs/2410.03523
作者: Yan Scholten,Stephan Günnemann,Leo Schwinn
关键词-EN: Large Language Models, Large Language, open research problem, Language Models, research problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. this https URL

[LG-27] Improving Online Bagging for Complex Imbalanced Data Stream

链接: https://arxiv.org/abs/2410.03519
作者: Bartosz Przybyl,Jerzy Stefanowski
关键词-EN: concept drifting data, Learning classifiers, drifting data streams, concept drifting, Oversampling Online Bagging
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Learning classifiers from imbalanced and concept drifting data streams is still a challenge. Most of the current proposals focus on taking into account changes in the global imbalance ratio only and ignore the local difficulty factors, such as the minority class decomposition into sub-concepts and the presence of unsafe types of examples (borderline or rare ones). As the above factors present in the stream may deteriorate the performance of popular online classifiers, we propose extensions of resampling online bagging, namely Neighbourhood Undersampling or Oversampling Online Bagging to take better account of the presence of unsafe minority examples. The performed computational experiments with synthetic complex imbalanced data streams have shown their advantage over earlier variants of online bagging resampling ensembles.

[LG-28] Fine-Grained Expressive Power of Weisfeiler-Leman: A Homomorphism Counting Perspective

链接: https://arxiv.org/abs/2410.03517
作者: Junru Zhou,Muhan Zhang
关键词-EN: graph neural networks, homomorphism counting power, neural networks, ability of graph, graph neural
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:The ability of graph neural networks (GNNs) to count homomorphisms has recently been proposed as a practical and fine-grained measure of their expressive power. Although several existing works have investigated the homomorphism counting power of certain GNN families, a simple and unified framework for analyzing the problem is absent. In this paper, we first propose \emphgeneralized folklore Weisfeiler-Leman (GFWL) algorithms as a flexible design basis for expressive GNNs, and then provide a theoretical framework to algorithmically determine the homomorphism counting power of an arbitrary class of GNN within the GFWL design space. As the considered design space is large enough to accommodate almost all known powerful GNNs, our result greatly extends all existing works, and may find its application in the automation of GNN model design.

[LG-29] Stabilized Neural Prediction of Potential Outcomes in Continuous Time

链接: https://arxiv.org/abs/2410.03514
作者: Konstantin Hess,Stefan Feuerriegel
关键词-EN: electronic health records, personalize care, electronic health, health records, records are widely
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Patient trajectories from electronic health records are widely used to predict potential outcomes of treatments over time, which then allows to personalize care. Yet, existing neural methods for this purpose have a key limitation: while some adjust for time-varying confounding, these methods assume that the time series are recorded in discrete time. In other words, they are constrained to settings where measurements and treatments are conducted at fixed time steps, even though this is unrealistic in medical practice. In this work, we aim to predict potential outcomes in continuous time. The latter is of direct practical relevance because it allows for modeling patient trajectories where measurements and treatments take place at arbitrary, irregular timestamps. We thus propose a new method called stabilized continuous time inverse propensity network (SCIP-Net). For this, we further derive stabilized inverse propensity weights for robust prediction of the potential outcomes. To the best of our knowledge, our SCIP-Net is the first neural method that performs proper adjustments for time-varying confounding in continuous time.

[LG-30] Classification-Denoising Networks

链接: https://arxiv.org/abs/2410.03505
作者: Louis Thiry,Florentin Guth
关键词-EN: ignoring conditioning information, partially ignoring conditioning, suffer from complementary, complementary issues, issues of lack
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Image classification and denoising suffer from complementary issues of lack of robustness or partially ignoring conditioning information. We argue that they can be alleviated by unifying both tasks through a model of the joint probability of (noisy) images and class labels. Classification is performed with a forward pass followed by conditioning. Using the Tweedie-Miyasawa formula, we evaluate the denoising function with the score, which can be computed by marginalization and back-propagation. The training objective is then a combination of cross-entropy loss and denoising score matching loss integrated over noise levels. Numerical experiments on CIFAR-10 and ImageNet show competitive classification and denoising performance compared to reference deep convolutional classifiers/denoisers, and significantly improves efficiency compared to previous joint approaches. Our model shows an increased robustness to adversarial perturbations compared to a standard discriminative classifier, and allows for a novel interpretation of adversarial gradients as a difference of denoisers.

[LG-31] FedStein: Enhancing Multi-Domain Federated Learning Through James-Stein Estimator NEURIPS’24 NEURIPS2024

链接: https://arxiv.org/abs/2410.03499
作者: Sunny Gupta,Nikita Jangid,Amit Sethi
关键词-EN: enabling collaborative in-situ, collaborative in-situ training, facilitates data privacy, Federated Learning, Multi-Domain Federated Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 12 pages, 2 figures. Accepted at International Workshop on Federated Foundation Models In Conjunction with NeurIPS 2024 (FL@FM-NeurIPS’24)

点击查看摘要

Abstract:Federated Learning (FL) facilitates data privacy by enabling collaborative in-situ training across decentralized clients. Despite its inherent advantages, FL faces significant challenges of performance and convergence when dealing with data that is not independently and identically distributed (non-i.i.d.). While previous research has primarily addressed the issue of skewed label distribution across clients, this study focuses on the less explored challenge of multi-domain FL, where client data originates from distinct domains with varying feature distributions. We introduce a novel method designed to address these challenges FedStein: Enhancing Multi-Domain Federated Learning Through the James-Stein Estimator. FedStein uniquely shares only the James-Stein (JS) estimates of batch normalization (BN) statistics across clients, while maintaining local BN parameters. The non-BN layer parameters are exchanged via standard FL techniques. Extensive experiments conducted across three datasets and multiple models demonstrate that FedStein surpasses existing methods such as FedAvg and FedBN, with accuracy improvements exceeding 14% in certain domains leading to enhanced domain generalization. The code is available at this https URL

[LG-32] Collaborative and Efficient Personalization with Mixtures of Adaptors

链接: https://arxiv.org/abs/2410.03497
作者: Abdulla Jasem Almansoori,Samuel Horváth,Martin Takáč
关键词-EN: Non-iid data, Non-iid, federated, learning, adaptors
类目: Machine Learning (cs.LG)
*备注: 36 pages, 10 figures

点击查看摘要

Abstract:Non-iid data is prevalent in real-world federated learning problems. Data heterogeneity can come in different types in terms of distribution shifts. In this work, we are interested in the heterogeneity that comes from concept shifts, i.e., shifts in the prediction across clients. In particular, we consider multi-task learning, where we want the model to adapt to the task of the client. We propose a parameter-efficient framework to tackle this issue, where each client learns to mix between parameter-efficient adaptors according to its task. We use Low-Rank Adaptors (LoRAs) as the backbone and extend its concept to other types of layers. We call our framework Federated Low-Rank Adaptive Learning (FLoRAL). This framework is not an algorithm but rather a model parameterization for a multi-task learning objective, so it can work on top of any algorithm that optimizes this objective, which includes many algorithms from the literature. FLoRAL is memory-efficient, and clients are personalized with small states (e.g., one number per adaptor) as the adaptors themselves are federated. Hence, personalization is–in this sense–federated as well. Even though clients can personalize more freely by training an adaptor locally, we show that collaborative and efficient training of adaptors is possible and performs better. We also show that FLoRAL can outperform an ensemble of full models with optimal cluster assignment, which demonstrates the benefits of federated personalization and the robustness of FLoRAL to overfitting. We show promising experimental results on synthetic datasets, real-world federated multi-task problems such as MNIST, CIFAR-10, and CIFAR-100. We also provide a theoretical analysis of local SGD on a relaxed objective and discuss the effects of aggregation mismatch on convergence.

[LG-33] Fourier PINNs: From Strong Boundary Conditions to Adaptive Fourier Bases

链接: https://arxiv.org/abs/2410.03496
作者: Madison Cooley,Varun Shankar,Robert M. Kirby,Shandian Zhe
关键词-EN: Physics-Informed Neural Networks, partial differential equations, traditional numerical solvers, Interest is rising, Neural Networks
类目: Machine Learning (cs.LG)
*备注: 24 pages, 15 figures

点击查看摘要

Abstract:Interest is rising in Physics-Informed Neural Networks (PINNs) as a mesh-free alternative to traditional numerical solvers for partial differential equations (PDEs). However, PINNs often struggle to learn high-frequency and multi-scale target solutions. To tackle this problem, we first study a strong Boundary Condition (BC) version of PINNs for Dirichlet BCs and observe a consistent decline in relative error compared to the standard PINNs. We then perform a theoretical analysis based on the Fourier transform and convolution theorem. We find that strong BC PINNs can better learn the amplitudes of high-frequency components of the target solutions. However, constructing the architecture for strong BC PINNs is difficult for many BCs and domain geometries. Enlightened by our theoretical analysis, we propose Fourier PINNs – a simple, general, yet powerful method that augments PINNs with pre-specified, dense Fourier bases. Our proposed architecture likewise learns high-frequency components better but places no restrictions on the particular BCs or problem domains. We develop an adaptive learning and basis selection algorithm via alternating neural net basis optimization, Fourier and neural net basis coefficient estimation, and coefficient truncation. This scheme can flexibly identify the significant frequencies while weakening the nominal frequencies to better capture the target solution’s power spectrum. We show the advantage of our approach through a set of systematic experiments.

[LG-34] Generative Artificial Intelligence for Navigating Synthesizable Chemical Space

链接: https://arxiv.org/abs/2410.03494
作者: Wenhao Gao,Shitong Luo,Connor W. Coley
关键词-EN: generative modeling framework, modeling framework designed, chemical space exploration, generative modeling, modeling framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:We introduce SynFormer, a generative modeling framework designed to efficiently explore and navigate synthesizable chemical space. Unlike traditional molecular generation approaches, we generate synthetic pathways for molecules to ensure that designs are synthetically tractable. By incorporating a scalable transformer architecture and a diffusion module for building block selection, SynFormer surpasses existing models in synthesizable molecular design. We demonstrate SynFormer’s effectiveness in two key applications: (1) local chemical space exploration, where the model generates synthesizable analogs of a reference molecule, and (2) global chemical space exploration, where the model aims to identify optimal molecules according to a black-box property prediction oracle. Additionally, we demonstrate the scalability of our approach via the improvement in performance as more computational resources become available. With our code and trained models openly available, we hope that SynFormer will find use across applications in drug discovery and materials science.

[LG-35] A Multimodal Framework for Deepfake Detection

链接: https://arxiv.org/abs/2410.03487
作者: Kashish Gandhi,Prutha Kulkarni,Taran Shah,Piyush Chaudhari,Meera Narvekar,Kranti Ghag
关键词-EN: digital media integrity, deepfake technology poses, rapid advancement, technology poses, poses a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 22 pages, 14 figures, Accepted in Journal of Electrical Systems

点击查看摘要

Abstract:The rapid advancement of deepfake technology poses a significant threat to digital media integrity. Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. This creates risks of misinformation, fraud, and severe implications for personal privacy and security. Our research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements. This comprehensive strategy recognizes that human perception integrates multiple sensory inputs, particularly visual and auditory information, to form a complete understanding of media content. For visual analysis, a model that employs advanced feature extraction techniques was developed, extracting nine distinct facial characteristics and then applying various machine learning and deep learning models. For auditory analysis, our model leverages mel-spectrogram analysis for feature extraction and then applies various machine learning and deep learningmodels. To achieve a combined analysis, real and deepfake audio in the original dataset were swapped for testing purposes and ensured balanced samples. Using our proposed models for video and audio classification i.e. Artificial Neural Network and VGG19, the overall sample is classified as deepfake if either component is identified as such. Our multimodal framework combines visual and auditory analyses, yielding an accuracy of 94%.

[LG-36] VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

链接: https://arxiv.org/abs/2410.03478
作者: Han Lin,Tushar Nagarajan,Nicolas Ballas,Mido Assran,Mojtaba Komeili,Mohit Bansal,Koustuv Sinha
关键词-EN: active research area, present video input, typically in conjunction, textual annotations, active research
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Procedural video representation learning is an active research area where the objective is to learn an agent which can anticipate and forecast the future given the present video input, typically in conjunction with textual annotations. Prior works often rely on large-scale pretraining of visual encoders and prediction models with language supervision. However, the necessity and effectiveness of extending compute intensive pretraining to learn video clip sequences with noisy text supervision have not yet been fully validated by previous works. In this work, we show that a strong off-the-shelf frozen pretrained visual encoder, along with a well designed prediction model, can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning without the need for pretraining the prediction model, nor requiring additional supervision from language or ASR. Instead of learning representations from pixel space, our method utilizes the latent embedding space of publicly available vision encoders. By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting through iterative denoising - leveraging the recent advances in diffusion transformers (Peebles Xie, 2023). Empirical studies over a total of five procedural learning tasks across four datasets (NIV, CrossTask, COIN and Ego4D-v2) show that our model advances the strong baselines in long-horizon action anticipation (+2.6% in Verb ED@20, +3.1% in Noun ED@20), and significantly improves the SoTA in step forecasting (+5.0%), task classification (+3.8%), and procedure planning tasks (up to +2.28% in success rate, +3.39% in mAcc, and +0.90% in mIoU).

[LG-37] On the Hardness of Learning One Hidden Layer Neural Networks

链接: https://arxiv.org/abs/2410.03477
作者: Shuchen Li,Ilias Zadik,Manolis Zampetakis
关键词-EN: hidden layer ReLU, layer ReLU neural, ReLU neural networks, hidden layer, layer ReLU
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 18 pages

点击查看摘要

Abstract:In this work, we consider the problem of learning one hidden layer ReLU neural networks with inputs from \mathbbR^d . We show that this learning problem is hard under standard cryptographic assumptions even when: (1) the size of the neural network is polynomial in d , (2) its input distribution is a standard Gaussian, and (3) the noise is Gaussian and polynomially small in d . Our hardness result is based on the hardness of the Continuous Learning with Errors (CLWE) problem, and in particular, is based on the largely believed worst-case hardness of approximately solving the shortest vector problem up to a multiplicative polynomial factor.

[LG-38] Vulnerability Detection via Topological Analysis of Attention Maps

链接: https://arxiv.org/abs/2410.03470
作者: Pavel Snopov,Andrey Nikolaevich Golubinskiy
关键词-EN: gained significant traction, significant traction, Recently, gained significant, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
*备注: Accepted to ITaS2024. Contains 8 pages

点击查看摘要

Abstract:Recently, deep learning (DL) approaches to vulnerability detection have gained significant traction. These methods demonstrate promising results, often surpassing traditional static code analysis tools in effectiveness. In this study, we explore a novel approach to vulnerability detection utilizing the tools from topological data analysis (TDA) on the attention matrices of the BERT model. Our findings reveal that traditional machine learning (ML) techniques, when trained on the topological features extracted from these attention matrices, can perform competitively with pre-trained language models (LLMs) such as CodeBERTa. This suggests that TDA tools, including persistent homology, are capable of effectively capturing semantic information critical for identifying vulnerabilities. Comments: Accepted to ITaS2024. Contains 8 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT) Cite as: arXiv:2410.03470 [cs.LG] (or arXiv:2410.03470v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03470 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] S7: Selective and Simplified State Space Layers for Sequence Modeling

链接: https://arxiv.org/abs/2410.03464
作者: Taylan Soydan,Nikola Zubić,Nico Messikommer,Siddhartha Mishra,Davide Scaramuzza
关键词-EN: efficiently handling tasks, extended contexts, central challenge, efficiently handling, Long Range Arena
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Dynamical Systems (math.DS)
*备注: 23 pages, 3 figures, 11 tables. Equal contribution by Taylan Soydan and Nikola Zubić

点击查看摘要

Abstract:A central challenge in sequence modeling is efficiently handling tasks with extended contexts. While recent state-space models (SSMs) have made significant progress in this area, they often lack input-dependent filtering or require substantial increases in model complexity to handle input variability. We address this gap by introducing S7, a simplified yet powerful SSM that can handle input dependence while incorporating stable reparameterization and specific design choices to dynamically adjust state transitions based on input content, maintaining efficiency and performance. We prove that this reparameterization ensures stability in long-sequence modeling by keeping state transitions well-behaved over time. Additionally, it controls the gradient norm, enabling efficient training and preventing issues like exploding or vanishing gradients. S7 significantly outperforms baselines across various sequence modeling tasks, including neuromorphic event-based datasets, Long Range Arena benchmarks, and various physical and biological time series. Overall, S7 offers a more straightforward approach to sequence modeling without relying on complex, domain-specific inductive biases, achieving significant improvements across key benchmarks.

[LG-40] Diffusion State-Guided Projected Gradient for Inverse Problems

链接: https://arxiv.org/abs/2410.03463
作者: Rayhan Zirvi,Bahareh Tolooshams,Anima Anandkumar
关键词-EN: Recent advancements, inverse problems, learning data priors, solving inverse problems, diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: preprint. under review. RZ and BT have equal contributions

点击查看摘要

Abstract:Recent advancements in diffusion models have been effective in learning data priors for solving inverse problems. They leverage diffusion sampling steps for inducing a data prior while using a measurement guidance gradient at each step to impose data consistency. For general inverse problems, approximations are needed when an unconditionally trained diffusion model is used since the measurement likelihood is intractable, leading to inaccurate posterior sampling. In other words, due to their approximations, these methods fail to preserve the generation process on the data manifold defined by the diffusion prior, leading to artifacts in applications such as image restoration. To enhance the performance and robustness of diffusion models in solving inverse problems, we propose Diffusion State-Guided Projected Gradient (DiffStateGrad), which projects the measurement gradient onto a subspace that is a low-rank approximation of an intermediate state of the diffusion process. DiffStateGrad, as a module, can be added to a wide range of diffusion-based inverse solvers to improve the preservation of the diffusion process on the prior manifold and filter out artifact-inducing components. We highlight that DiffStateGrad improves the robustness of diffusion models in terms of the choice of measurement guidance step size and noise while improving the worst-case performance. Finally, we demonstrate that DiffStateGrad improves upon the state-of-the-art on linear and nonlinear image restoration inverse problems.

[LG-41] Linear Transformer Topological Masking with Graph Random Features

链接: https://arxiv.org/abs/2410.03462
作者: Isaac Reid,Kumar Avinava Dubey,Deepali Jain,Will Whitney,Amr Ahmed,Joshua Ainslie,Alex Bewley,Mithun Jacob,Aranyak Mehta,David Rendleman,Connor Schenck,Richard E. Turner,René Wagner,Adrian Weller,Krzysztof Choromanski
关键词-EN: incorporating information, training transformers, transformers on graph-structured, underlying topology, topology is crucial
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:When training transformers on graph-structured data, incorporating information about the underlying topology is crucial for good performance. Topological masking, a type of relative position encoding, achieves this by upweighting or downweighting attention depending on the relationship between the query and keys in a graph. In this paper, we propose to parameterise topological masks as a learnable function of a weighted adjacency matrix – a novel, flexible approach which incorporates a strong structural inductive bias. By approximating this mask with graph random features (for which we prove the first known concentration bounds), we show how this can be made fully compatible with linear attention, preserving \mathcalO(N) time and space complexity with respect to the number of input tokens. The fastest previous alternative was \mathcalO(N \log N) and only suitable for specific graphs. Our efficient masking algorithms provide strong performance gains for tasks on image and point cloud data, including with 30 k nodes.

[LG-42] Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation

链接: https://arxiv.org/abs/2410.03461
作者: Tobias Leemann,Periklis Petridis,Giuseppe Vietri,Dionysis Manousakas,Aaron Roth,Sergul Aydore
关键词-EN: NLI models, large language model, suffer from hallucination, generating incorrect, irrelevant information
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While retrieval augmented generation (RAG) has been shown to enhance factuality of large language model (LLM) outputs, LLMs still suffer from hallucination, generating incorrect or irrelevant information. One common detection strategy involves prompting the LLM again to assess whether its response is grounded in the retrieved evidence, but this approach is costly. Alternatively, lightweight natural language inference (NLI) models for efficient grounding verification can be used at inference time. While existing pre-trained NLI models offer potential solutions, their performance remains subpar compared to larger models on realistic RAG inputs. RAG inputs are more complex than most datasets used for training NLI models and have characteristics specific to the underlying knowledge base, requiring adaptation of the NLI models to a specific target domain. Additionally, the lack of labeled instances in the target domain makes supervised domain adaptation, e.g., through fine-tuning, infeasible. To address these challenges, we introduce Automatic Generative Domain Adaptation (Auto-GDA). Our framework enables unsupervised domain adaptation through synthetic data generation. Unlike previous methods that rely on handcrafted filtering and augmentation strategies, Auto-GDA employs an iterative process to continuously improve the quality of generated samples using weak labels from less efficient teacher models and discrete optimization to select the most promising augmented samples. Experimental results demonstrate the effectiveness of our approach, with models fine-tuned on synthetic data using Auto-GDA often surpassing the performance of the teacher model and reaching the performance level of LLMs at 10 % of their computational cost.

[LG-43] Generative Semantic Communication for Text-to-Speech Synthesis

链接: https://arxiv.org/abs/2410.03459
作者: Jiahao Zheng,Jinke Ren,Peng Xu,Zhihao Yuan,Jie Xu,Fangxin Wang,Gui Gui,Shuguang Cui
关键词-EN: improve communication efficiency, promising technology, technology to improve, efficiency by transmitting, Semantic communication
类目: ound (cs.SD); Information Theory (cs.IT); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: The paper has been accepted by IEEE Globecom Workshop

点击查看摘要

Abstract:Semantic communication is a promising technology to improve communication efficiency by transmitting only the semantic information of the source data. However, traditional semantic communication methods primarily focus on data reconstruction tasks, which may not be efficient for emerging generative tasks such as text-to-speech (TTS) synthesis. To address this limitation, this paper develops a novel generative semantic communication framework for TTS synthesis, leveraging generative artificial intelligence technologies. Firstly, we utilize a pre-trained large speech model called WavLM and the residual vector quantization method to construct two semantic knowledge bases (KBs) at the transmitter and receiver, respectively. The KB at the transmitter enables effective semantic extraction, while the KB at the receiver facilitates lifelike speech synthesis. Then, we employ a transformer encoder and a diffusion model to achieve efficient semantic coding without introducing significant communication overhead. Finally, numerical results demonstrate that our framework achieves much higher fidelity for the generated speech than four baselines, in both cases with additive white Gaussian noise channel and Rayleigh fading channel.

[LG-44] MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

链接: https://arxiv.org/abs/2410.03450
作者: Junpeng Yue,Xinru Xu,Börje F. Karlsson,Zongqing Lu
关键词-EN: retrieving multimodal task-relevant, potential for complex, MLLM, task-relevant trajectory data, complex embodied tasks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at hand. To address this issue, we propose a novel method, MLLM as ReTriever (MART), which enhances the performance of embodied agents by utilizing interaction data to fine-tune an MLLM retriever based on preference learning, such that the retriever fully considers the effectiveness of trajectories and prioritize them for unseen tasks. We also introduce Trajectory Abstraction, a mechanism that leverages MLLMs’ summarization capabilities to represent trajectories with fewer tokens while preserving key information, enabling agents to better comprehend milestones in the trajectory. Experimental results across various environments demonstrate our method significantly improves task success rates in unseen scenes compared to baseline methods. This work presents a new paradigm for multimodal retrieval in embodied agents, by fine-tuning a general-purpose MLLM as the retriever to assess trajectory effectiveness. All benchmark task sets and simulator code modifications for action and observation spaces will be released.

[LG-45] Zebra: In-Context and Generative Pretraining for Solving Parametric PDEs

链接: https://arxiv.org/abs/2410.03437
作者: Louis Serrano,Armand Kassaï Koupaï,Thomas X Wang,Pierre Erbacher,Patrick Gallinari
关键词-EN: Solving time-dependent parametric, partial differential equations, Solving time-dependent, time-dependent parametric partial, parametric partial differential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving time-dependent parametric partial differential equations (PDEs) is challenging, as models must adapt to variations in parameters such as coefficients, forcing terms, and boundary conditions. Data-driven neural solvers either train on data sampled from the PDE parameters distribution in the hope that the model generalizes to new instances or rely on gradient-based adaptation and meta-learning to implicitly encode the dynamics from observations. This often comes with increased inference complexity. Inspired by the in-context learning capabilities of large language models (LLMs), we introduce Zebra, a novel generative auto-regressive transformer designed to solve parametric PDEs without requiring gradient adaptation at inference. By leveraging in-context information during both pre-training and inference, Zebra dynamically adapts to new tasks by conditioning on input sequences that incorporate context trajectories or preceding states. This approach enables Zebra to flexibly handle arbitrarily sized context inputs and supports uncertainty quantification through the sampling of multiple solution trajectories. We evaluate Zebra across a variety of challenging PDE scenarios, demonstrating its adaptability, robustness, and superior performance compared to existing approaches.

[LG-46] A General Framework for Producing Interpretable Semantic Text Embeddings

链接: https://arxiv.org/abs/2410.03435
作者: Yiqun Sun,Qiang Huang,Yixuan Tang,Anthony K. H. Tung,Jun Yu
关键词-EN: Natural Language Processing, Language Processing, Natural Language, algo, NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 5 figures, and 9 tables

点击查看摘要

Abstract:Semantic text embedding is essential to many tasks in Natural Language Processing (NLP). While black-box models are capable of generating high-quality embeddings, their lack of interpretability limits their use in tasks that demand transparency. Recent approaches have improved interpretability by leveraging domain-expert-crafted or LLM-generated questions, but these methods rely heavily on expert input or well-prompt design, which restricts their generalizability and ability to generate discriminative questions across a wide range of tasks. To address these challenges, we introduce \algoCQG-MBQA (Contrastive Question Generation - Multi-task Binary Question Answering), a general framework for producing interpretable semantic text embeddings across diverse tasks. Our framework systematically generates highly discriminative, low cognitive load yes/no questions through the \algoCQG method and answers them efficiently with the \algoMBQA model, resulting in interpretable embeddings in a cost-effective manner. We validate the effectiveness and interpretability of \algoCQG-MBQA through extensive experiments and ablation studies, demonstrating that it delivers embedding quality comparable to many advanced black-box models while maintaining inherently interpretability. Additionally, \algoCQG-MBQA outperforms other interpretable text embedding methods across various downstream tasks.

[LG-47] EB-NeRD: A Large-Scale Dataset for News Recommendation RECSYS’24

链接: https://arxiv.org/abs/2410.03432
作者: Johannes Kruse,Kasper Lindskow,Saikishore Kalloori,Marco Polignano,Claudio Pomo,Abhishek Srivastava,Anshuk Uppal,Michael Riis Andersen,Jes Frellsen
关键词-EN: Personalized content recommendations, Personalized content, social networks, content experience, Ekstra Bladet
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 8 tables, 2 figures, RecSys '24

点击查看摘要

Abstract:Personalized content recommendations have been pivotal to the content experience in digital media from video streaming to social networks. However, several domain specific challenges have held back adoption of recommender systems in news publishing. To address these challenges, we introduce the Ekstra Bladet News Recommendation Dataset (EB-NeRD). The dataset encompasses data from over a million unique users and more than 37 million impression logs from Ekstra Bladet. It also includes a collection of over 125,000 Danish news articles, complete with titles, abstracts, bodies, and metadata, such as categories. EB-NeRD served as the benchmark dataset for the RecSys '24 Challenge, where it was demonstrated how the dataset can be used to address both technical and normative challenges in designing effective and responsible recommender systems for news publishing. The dataset is available at: this https URL.

[LG-48] Cayley Graph Propagation

链接: https://arxiv.org/abs/2410.03424
作者: JJ Wilson,Maya Bechler-Speicher,Petar Veličković
关键词-EN: modelling graph-structured data, graph neural networks, neural networks, graph-structured data, pairs of nodes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:In spite of the plethora of success stories with graph neural networks (GNNs) on modelling graph-structured data, they are notoriously vulnerable to over-squashing, whereby tasks necessitate the mixing of information between distance pairs of nodes. To address this problem, prior work suggests rewiring the graph structure to improve information flow. Alternatively, a significant body of research has dedicated itself to discovering and precomputing bottleneck-free graph structures to ameliorate over-squashing. One well regarded family of bottleneck-free graphs within the mathematical community are expander graphs, with prior work \unicodex2014 Expander Graph Propagation (EGP) \unicodex2014 proposing the use of a well-known expander graph family \unicodex2014 the Cayley graphs of the \mathrmSL(2,\mathbbZ_n) special linear group \unicodex2014 as a computational template for GNNs. However, in EGP the computational graphs used are truncated to align with a given input graph. In this work, we show that truncation is detrimental to the coveted expansion properties. Instead, we propose CGP, a method to propagate information over a complete Cayley graph structure, thereby ensuring it is bottleneck-free to better alleviate over-squashing. Our empirical evidence across several real-world datasets not only shows that CGP recovers significant improvements as compared to EGP, but it is also akin to or outperforms computationally complex graph rewiring techniques.

[LG-49] Benchmarking the Fidelity and Utility of Synthetic Relational Data

链接: https://arxiv.org/abs/2410.03411
作者: Valter Hudovernik,Martin Jurkovič,Erik Štrumbelj
关键词-EN: Synthesizing relational data, attention from researchers, started to receive, receive more attention, Synthesizing relational
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthesizing relational data has started to receive more attention from researchers, practitioners, and industry. The task is more difficult than synthesizing a single table due to the added complexity of relationships between tables. For the same reason, benchmarking methods for synthesizing relational data introduces new challenges. Our work is motivated by a lack of an empirical evaluation of state-of-the-art methods and by gaps in the understanding of how such an evaluation should be done. We review related work on relational data synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data. We combine the best practices and a novel robust detection approach into a benchmarking tool and use it to compare six methods, including two commercial tools. While some methods are better than others, no method is able to synthesize a dataset that is indistinguishable from original data. For utility, we typically observe moderate correlation between real and synthetic data for both model predictive performance and feature importance.

[LG-50] Predictive Coding for Decision Transformer IROS2024

链接: https://arxiv.org/abs/2410.03408
作者: Tung M. Luu,Donghoon Lee,Chang D. Yoo
关键词-EN: Recent work, return-conditioned supervised learning, demonstrated the effectiveness, effectiveness of formulating, return-conditioned supervised
类目: Machine Learning (cs.LG)
*备注: 8 pages, IROS 2024 (Code: this https URL )

点击查看摘要

Abstract:Recent work in offline reinforcement learning (RL) has demonstrated the effectiveness of formulating decision-making as return-conditioned supervised learning. Notably, the decision transformer (DT) architecture has shown promise across various domains. However, despite its initial success, DTs have underperformed on several challenging datasets in goal-conditioned RL. This limitation stems from the inefficiency of return conditioning for guiding policy learning, particularly in unstructured and suboptimal datasets, resulting in DTs failing to effectively learn temporal compositionality. Moreover, this problem might be further exacerbated in long-horizon sparse-reward tasks. To address this challenge, we propose the Predictive Coding for Decision Transformer (PCDT) framework, which leverages generalized future conditioning to enhance DT methods. PCDT utilizes an architecture that extends the DT framework, conditioned on predictive codings, enabling decision-making based on both past and future factors, thereby improving generalization. Through extensive experiments on eight datasets from the AntMaze and FrankaKitchen environments, our proposed method achieves performance on par with or surpassing existing popular value-based and transformer-based methods in offline goal-conditioned RL. Furthermore, we also evaluate our method on a goal-reaching task with a physical robot.

[LG-51] Distributed Networked Multi-task Learning

链接: https://arxiv.org/abs/2410.03403
作者: Lingzhou Hong,Alfredo Garcia
关键词-EN: correlated data streams, distributed multi-task learning, multiple linear model, multi-task learning scheme, data streams
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a distributed multi-task learning scheme that accounts for multiple linear model estimation tasks with heterogeneous and/or correlated data streams. We assume that nodes can be partitioned into groups corresponding to different learning tasks and communicate according to a directed network topology. Each node estimates a linear model asynchronously and is subject to local (within-group) regularization and global (across groups) regularization terms targeting noise reduction and generalization performance improvement respectively. We provide a finite-time characterization of convergence of the estimators and task relation and illustrate the scheme’s general applicability in two examples: random field temperature estimation and modeling student performance from different academic districts.

[LG-52] EBES: Easy Benchmarking for Event Sequences

链接: https://arxiv.org/abs/2410.03399
作者: Dmitry Osin,Igor Udovichenko,Viktor Moskvoretskii,Egor Shvetsov,Evgeny Burnaev
关键词-EN: user interaction logs, irregular sampling intervals, common data structures, Event sequences, characterized by irregular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Event sequences, characterized by irregular sampling intervals and a mix of categorical and numerical features, are common data structures in various real-world domains such as healthcare, finance, and user interaction logs. Despite advances in temporal data modeling techniques, there is no standardized benchmarks for evaluating their performance on event sequences. This complicates result comparison across different papers due to varying evaluation protocols, potentially misleading progress in this field. We introduce EBES, a comprehensive benchmarking tool with standardized evaluation scenarios and protocols, focusing on regression and classification problems with sequence-level targets. Our library simplifies benchmarking, dataset addition, and method integration through a unified interface. It includes a novel synthetic dataset and provides preprocessed real-world datasets, including the largest publicly available banking dataset. Our results provide an in-depth analysis of datasets, identifying some as unsuitable for model comparison. We investigate the importance of modeling temporal and sequential components, as well as the robustness and scaling properties of the models. These findings highlight potential directions for future research. Our benchmark aim is to facilitate reproducible research, expediting progress and increasing real-world impacts.

[LG-53] GraphCroc: Cross-Correlation Autoencoder for Graph Structural Reconstruction NEURIPS2024

链接: https://arxiv.org/abs/2410.03396
作者: Shijin Duan,Ruyi Ding,Jiaxing He,Aidong Adam Ding,Yunsi Fei,Xiaolin Xu
关键词-EN: Graph-structured data, prompting the development, data is integral, Graph-structured, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 16 figures. Accepted in NeurIPS 2024

点击查看摘要

Abstract:Graph-structured data is integral to many applications, prompting the development of various graph representation methods. Graph autoencoders (GAEs), in particular, reconstruct graph structures from node embeddings. Current GAE models primarily utilize self-correlation to represent graph structures and focus on node-level tasks, often overlooking multi-graph scenarios. Our theoretical analysis indicates that self-correlation generally falls short in accurately representing specific graph features such as islands, symmetrical structures, and directional edges, particularly in smaller or multiple graph contexts. To address these limitations, we introduce a cross-correlation mechanism that significantly enhances the GAE representational capabilities. Additionally, we propose GraphCroc, a new GAE that supports flexible encoder architectures tailored for various downstream tasks and ensures robust structural reconstruction, through a mirrored encoding-decoding process. This model also tackles the challenge of representation bias during optimization by implementing a loss-balancing strategy. Both theoretical analysis and numerical evaluations demonstrate that our methodology significantly outperforms existing self-correlation-based GAEs in graph structure reconstruction.

[LG-54] Lightning UQ Box: A Comprehensive Framework for Uncertainty Quantification in Deep Learning

链接: https://arxiv.org/abs/2410.03390
作者: Nils Lehmann,Jakob Gawlikowski,Adam J. Stewart,Vytautas Jancauskas,Stefan Depeweg,Eric Nalisnick,Nina Maria Gottschling
关键词-EN: deep neural networks, Uncertainty quantification, applying deep neural, real world tasks, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Uncertainty quantification (UQ) is an essential tool for applying deep neural networks (DNNs) to real world tasks, as it attaches a degree of confidence to DNN outputs. However, despite its benefits, UQ is often left out of the standard DNN workflow due to the additional technical knowledge required to apply and evaluate existing UQ procedures. Hence there is a need for a comprehensive toolbox that allows the user to integrate UQ into their modelling workflow, without significant overhead. We introduce \textttLightning UQ Box: a unified interface for applying and evaluating various approaches to UQ. In this paper, we provide a theoretical and quantitative comparison of the wide range of state-of-the-art UQ methods implemented in our toolbox. We focus on two challenging vision tasks: (i) estimating tropical cyclone wind speeds from infrared satellite imagery and (ii) estimating the power output of solar panels from RGB images of the sky. By highlighting the differences between methods our results demonstrate the need for a broad and approachable experimental framework for UQ, that can be used for benchmarking UQ methods. The toolbox, example implementations, and further information are available at: this https URL

[LG-55] From Epilepsy Seizures Classification to Detection: A Deep Learning-based Approach for Raw EEG Signals

链接: https://arxiv.org/abs/2410.03385
作者: Davy Darankoum,Manon Villalba,Clelia Allioux,Baptiste Caraballo,Carine Dumont,Eloise Gronlier,Corinne Roucard,Yann Roche,Chloe Habermacher,Sergei Grudinin,Julien Volle
关键词-EN: prevalent neurological disease, raw EEG signals, Epilepsy represents, prevalent neurological, neurological disease
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 25 pages, 7 tables, 4 figures

点击查看摘要

Abstract:Epilepsy represents the most prevalent neurological disease in the world. One-third of people suffering from mesial temporal lobe epilepsy (MTLE) exhibit drug resistance, urging the need to develop new treatments. A key part in anti-seizure medication (ASM) development is the capability of detecting and quantifying epileptic seizures occurring in electroencephalogram (EEG) signals, which is crucial for treatment efficacy evaluation. In this study, we introduced a seizure detection pipeline based on deep learning models applied to raw EEG signals. This pipeline integrates: a new pre-processing technique which segments continuous raw EEG signals without prior distinction between seizure and seizure-free activities; a post-processing algorithm developed to reassemble EEG segments and allow the identification of seizures start/end; and finally, a new evaluation procedure based on a strict seizure events comparison between predicted and real labels. Models training have been performed using a data splitting strategy which addresses the potential for data leakage. We demonstrated the fundamental differences between a seizure classification and a seizure detection task and showed the differences in performance between the two tasks. Finally, we demonstrated the generalization capabilities across species of our best architecture, combining a Convolutional Neural Network and a Transformer encoder. The model was trained on animal EEGs and tested on human EEGs with a F1-score of 93% on a balanced Bonn dataset.

[LG-56] Predicting perturbation targets with causal differential networks

链接: https://arxiv.org/abs/2410.03380
作者: Menghua Wu,Umesh Padia,Sean H. Murphy,Regina Barzilay,Tommi Jaakkola
关键词-EN: Rationally identifying variables, enable myriad applications, Rationally identifying, identifying variables responsible, cell engineering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Rationally identifying variables responsible for changes to a biological system can enable myriad applications in disease understanding and cell engineering. From a causality perspective, we are given two datasets generated by the same causal model, one observational (control) and one interventional (perturbed). The goal is to isolate the subset of measured variables (e.g. genes) that were the targets of the intervention, i.e. those whose conditional independencies have changed. Knowing the causal graph would limit the search space, allowing us to efficiently pinpoint these variables. However, current algorithms that infer causal graphs in the presence of unknown intervention targets scale poorly to the hundreds or thousands of variables in biological data, as they must jointly search the combinatorial spaces of graphs and consistent intervention targets. In this work, we propose a causality-inspired approach for predicting perturbation targets that decouples the two search steps. First, we use an amortized causal discovery model to separately infer causal graphs from the observational and interventional datasets. Then, we learn to map these paired graphs to the sets of variables that were intervened upon, in a supervised learning framework. This approach consistently outperforms baselines for perturbation modeling on seven single-cell transcriptomics datasets, each with thousands of measured variables. We also demonstrate significant improvements over six causal discovery algorithms in predicting intervention targets across a variety of tractable, synthetic datasets.

[LG-57] Mitigating Adversarial Perturbations for Deep Reinforcement Learning via Vector Quantization IROS2024

链接: https://arxiv.org/abs/2410.03376
作者: Tung M. Luu,Thanh Nguyen,Tee Joshua Tian Jin,Sungwoon Kim,Chang D. Yoo
关键词-EN: Recent studies reveal, well-performing reinforcement learning, Recent studies, reinforcement learning, perturbations during deployment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, IROS 2024 (Code: this https URL )

点击查看摘要

Abstract:Recent studies reveal that well-performing reinforcement learning (RL) agents in training often lack resilience against adversarial perturbations during deployment. This highlights the importance of building a robust agent before deploying it in the real world. Most prior works focus on developing robust training-based procedures to tackle this problem, including enhancing the robustness of the deep neural network component itself or adversarially training the agent on strong attacks. In this work, we instead study an input transformation-based defense for RL. Specifically, we propose using a variant of vector quantization (VQ) as a transformation for input observations, which is then used to reduce the space of adversarial attacks during testing, resulting in the transformed observations being less affected by attacks. Our method is computationally efficient and seamlessly integrates with adversarial training, further enhancing the robustness of RL agents against adversarial attacks. Through extensive experiments in multiple environments, we demonstrate that using VQ as the input transformation effectively defends against adversarial attacks on the agent’s observations.

[LG-58] Make Interval Bound Propagation great again

链接: https://arxiv.org/abs/2410.03373
作者: Patryk Krukowski,Daniel Wilczak,Jacek Tabor,Anna Bielawska,Przemysław Spurek
关键词-EN: medical data analysis, robust deep networks, autonomous driving, real life, data analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In various scenarios motivated by real life, such as medical data analysis, autonomous driving, and adversarial training, we are interested in robust deep networks. A network is robust when a relatively small perturbation of the input cannot lead to drastic changes in output (like change of class, etc.). This falls under the broader scope field of Neural Network Certification (NNC). Two crucial problems in NNC are of profound interest to the scientific community: how to calculate the robustness of a given pre-trained network and how to construct robust networks. The common approach to constructing robust networks is Interval Bound Propagation (IBP). This paper demonstrates that IBP is sub-optimal in the first case due to its susceptibility to the wrapping effect. Even for linear activation, IBP gives strongly sub-optimal bounds. Consequently, one should use strategies immune to the wrapping effect to obtain bounds close to optimal ones. We adapt two classical approaches dedicated to strict computations – Dubleton Arithmetic and Affine Arithmetic – to mitigate the wrapping effect in neural networks. These techniques yield precise results for networks with linear activation functions, thus resisting the wrapping effect. As a result, we achieve bounds significantly closer to the optimal level than IBPs.

[LG-59] Latent Abstractions in Generative Diffusion Models

链接: https://arxiv.org/abs/2410.03368
作者: Giulio Franzese,Mattia Martini,Giulio Corallo,Paolo Papotti,Pietro Michiardi
关键词-EN: produce high-dimensional data, models produce high-dimensional, generative models produce, diffusion-based generative models, high-dimensional data
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work we study how diffusion-based generative models produce high-dimensional data, such as an image, by implicitly relying on a manifestation of a low-dimensional set of latent abstractions, that guide the generative process. We present a novel theoretical framework that extends NLF, and that offers a unique perspective on SDE-based generative models. The development of our theory relies on a novel formulation of the joint (state and measurement) dynamics, and an information-theoretic measure of the influence of the system state on the measurement process. According to our theory, diffusion models can be cast as a system of SDE, describing a non-linear filter in which the evolution of unobservable latent abstractions steers the dynamics of an observable measurement process (corresponding to the generative pathways). In addition, we present an empirical study to validate our theory and previous empirical results on the emergence of latent abstractions at different stages of the generative process.

[LG-60] Error Correction Code Transformer: From Non-Unified to Unified

链接: https://arxiv.org/abs/2410.03364
作者: Yongli Yan,Jieao Zhu,Tianyue Zheng,Jiaqi He,Linglong Dai
关键词-EN: reliable data transmission, Channel coding, error correction codes, emergence of sixth-generation, coding is vital
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Channel coding is vital for reliable data transmission in modern wireless systems, and its significance will increase with the emergence of sixth-generation (6G) networks, which will need to support various error correction codes. However, traditional decoders were typically designed as fixed hardware circuits tailored to specific decoding algorithms, leading to inefficiencies and limited flexibility. To address these challenges, this paper proposes a unified, code-agnostic Transformer-based decoding architecture capable of handling multiple linear block codes, including Polar, Low-Density Parity-Check (LDPC), and Bose-Chaudhuri-Hocquenghem (BCH), within a single framework. To achieve this, standardized units are employed to harmonize parameters across different code types, while the redesigned unified attention module compresses the structural information of various codewords. Additionally, a sparse mask, derived from the sparsity of the parity-check matrix, is introduced to enhance the model’s ability to capture inherent constraints between information and parity-check bits, resulting in improved decoding accuracy and robustness. Extensive experimental results demonstrate that the proposed unified Transformer-based decoder not only outperforms existing methods but also provides a flexible, efficient, and high-performance solution for next-generation wireless communication systems.

[LG-61] Dolphin: A Programmable Framework for Scalable Neurosymbolic Learning

链接: https://arxiv.org/abs/2410.03348
作者: Aaditya Naik,Jason Liu,Claire Wang,Saikat Dutta,Mayur Naik,Eric Wong
关键词-EN: symbolic programs, promising paradigm, paradigm to incorporate, incorporate symbolic reasoning, deep learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neurosymbolic learning has emerged as a promising paradigm to incorporate symbolic reasoning into deep learning models. However, existing frameworks are limited in scalability with respect to both the training data and the complexity of symbolic programs. We propose Dolphin, a framework to scale neurosymbolic learning at a fundamental level by mapping both forward chaining and backward gradient propagation in symbolic programs to vectorized computations. For this purpose, Dolphin introduces a set of abstractions and primitives built directly on top of a high-performance deep learning framework like PyTorch, effectively enabling symbolic programs to be written as PyTorch modules. It thereby enables neurosymbolic programs to be written in a language like Python that is familiar to developers and compile them to computation graphs that are amenable to end-to-end differentiation on GPUs. We evaluate Dolphin on a suite of 13 benchmarks across 5 neurosymbolic tasks that combine deep learning models for text, image, or video processing with symbolic programs that involve multi-hop reasoning, recursion, and even black-box functions like Python eval(). Dolphin only takes 0.33%-37.17% of the time (and 2.77% on average) to train these models on the largest input per task compared to baselines Scallop, ISED, and IndeCateR+, which time out on most of these inputs. Models written in Dolphin also achieve state-of-the-art accuracies even on the largest benchmarks.

[LG-62] Audio-Agent : Leveraging LLMs For Audio Generation Editing and Composition

链接: https://arxiv.org/abs/2410.03335
作者: Zixuan Wang,Yu-Wing Tai,Chi-Keung Tang
关键词-EN: editing and composition, composition based, audio, text, Large Language Model
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions, and calls the agent for audio generation. Consequently, Audio-Agent generates high-quality audio that is closely aligned with the provided text or video while also supporting variable-length generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio, a process that can be tedious and time-consuming. We propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions to bridge video and audio modality. Thus our framework provides a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.

[LG-63] Influence-oriented Personalized Federated Learning

链接: https://arxiv.org/abs/2410.03315
作者: Yue Tan,Guodong Long,Jing Jiang,Chengqi Zhang
关键词-EN: Traditional federated learning, Traditional federated, neglecting the mutual, rely on fixed, fixed weighting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Traditional federated learning (FL) methods often rely on fixed weighting for parameter aggregation, neglecting the mutual influence by others. Hence, their effectiveness in heterogeneous data contexts is limited. To address this problem, we propose an influence-oriented federated learning framework, namely FedC^2I, which quantitatively measures Client-level and Class-level Influence to realize adaptive parameter aggregation for each client. Our core idea is to explicitly model the inter-client influence within an FL system via the well-crafted influence vector and influence matrix. The influence vector quantifies client-level influence, enables clients to selectively acquire knowledge from others, and guides the aggregation of feature representation layers. Meanwhile, the influence matrix captures class-level influence in a more fine-grained manner to achieve personalized classifier aggregation. We evaluate the performance of FedC^2I against existing federated learning methods under non-IID settings and the results demonstrate the superiority of our method.

[LG-64] Quo Vadis Motion Generation? From Large Language Models to Large Motion Models

链接: https://arxiv.org/abs/2410.03311
作者: Ye Wang,Sipeng Zheng,Bin Cao,Qianshan Wei,Qin Jin,Zongqing Lu
关键词-EN: human motion understanding, large motion models, success of LLMs, motion, recent success
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted towards the development of large motion models. Despite some progress, current state-of-the-art works remain far from achieving truly generalist models, largely due to the lack of large-scale, high-quality motion data. To address this, we present MotionBase, the first million-level motion generation benchmark, offering 15 times the data volume of the previous largest dataset, and featuring multimodal data with hierarchically detailed text descriptions. By leveraging this vast dataset, our large motion model demonstrates strong performance across a broad range of motions, including unseen ones. Through systematic investigation, we underscore the importance of scaling both data and model size, with synthetic data and pseudo labels playing a crucial role in mitigating data acquisition costs. Moreover, our research reveals the limitations of existing evaluation metrics, particularly in handling out-of-domain text instructions – an issue that has long been overlooked. In addition to these, we introduce a novel 2D lookup-free approach for motion tokenization, which preserves motion information and expands codebook capacity, further enhancing the representative ability of large motion models. The release of MotionBase and the insights gained from this study are expected to pave the way for the development of more powerful and versatile motion generation models.

[LG-65] Selective Test-Time Adaptation for Unsupervised Anomaly Detection using Neural Implicit Representations MICCAI

链接: https://arxiv.org/abs/2410.03306
作者: Sameer Ambekar,Julia A. Schnabel,Cosmin Bereca
关键词-EN: clinical settings unseen, Deep learning models, medical imaging, imaging often encounter, encounter challenges
类目: Machine Learning (cs.LG)
*备注: Accepted at MICCAIw ADSMI

点击查看摘要

Abstract:Deep learning models in medical imaging often encounter challenges when adapting to new clinical settings unseen during training. Test-time adaptation offers a promising approach to optimize models for these unseen domains, yet its application in anomaly detection (AD) remains largely unexplored. AD aims to efficiently identify deviations from normative distributions; however, full adaptation, including pathological shifts, may inadvertently learn the anomalies it intends to detect. We introduce a novel concept of \emphselective test-time adaptation that utilizes the inherent characteristics of deep pre-trained features to adapt \emphselectively in a zero-shot manner to any test image from an unseen domain. This approach employs a model-agnostic, lightweight multi-layer perceptron for neural implicit representations, enabling the adaptation of outputs from any reconstruction-based AD method without altering the source-trained model. Rigorous validation in brain AD demonstrated that our strategy substantially enhances detection accuracy for multiple conditions and different target distributions. Specifically, our method improves the detection rates by up to 78% for enlarged ventricles and 24% for edemas.

[LG-66] SELU: Self-Learning Embodied MLLMs in Unknown Environments

链接: https://arxiv.org/abs/2410.03303
作者: Boyu Li,Haobin Jiang,Ziluo Ding,Xinrun Xu,Haoran Li,Dongbin Zhao,Zongqing Lu
关键词-EN: multimodal large language, large language models, demonstrated strong visual, strong visual understanding, autonomously improving MLLMs
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback like human or environmental feedback is not always available. To address this challenge, existing methods primarily focus on enhancing the decision-making capabilities of MLLMs through voting and scoring mechanisms, while little effort has been paid to improving the environmental comprehension of MLLMs in unknown environments. To fully unleash the self-learning potential of MLLMs, we propose a novel actor-critic self-learning paradigm, dubbed SELU, inspired by the actor-critic paradigm in reinforcement learning. The critic employs self-asking and hindsight relabeling to extract knowledge from interaction trajectories collected by the actor, thereby augmenting its environmental comprehension. Simultaneously, the actor is improved by the self-feedback provided by the critic, enhancing its decision-making. We evaluate our method in the AI2-THOR and VirtualHome environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning.

[LG-67] Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

链接: https://arxiv.org/abs/2410.03294
作者: Tianheng Ling,Chao Qian,Gregor Schiele
关键词-EN: integer-only quantized Transformers, resource-constrained embedded FPGAs, Neural Architecture Search, study addresses, challenges of integer-only
类目: Machine Learning (cs.LG)
*备注: Accepted by the 21st EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (MobiQuitous2024). 20 pages, 8 figures, 6 tables

点击查看摘要

Abstract:This study addresses the deployment challenges of integer-only quantized Transformers on resource-constrained embedded FPGAs (Xilinx Spartan-7 XC7S15). We enhanced the flexibility of our VHDL template by introducing a selectable resource type for storing intermediate results across model layers, thereby breaking the deployment bottleneck by utilizing BRAM efficiently. Moreover, we developed a resource-aware mixed-precision quantization approach that enables researchers to explore hardware-level quantization strategies without requiring extensive expertise in Neural Architecture Search. This method provides accurate resource utilization estimates with a precision discrepancy as low as 3%, compared to actual deployment metrics. Compared to previous work, our approach has successfully facilitated the deployment of model configurations utilizing mixed-precision quantization, thus overcoming the limitations inherent in five previously non-deployable configurations with uniform quantization bitwidths. Consequently, this research enhances the applicability of Transformers in embedded systems, facilitating a broader range of Transformer-powered applications on edge devices.

[LG-68] Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis

链接: https://arxiv.org/abs/2410.03293
作者: Nirmalya Thakur
关键词-EN: Instagram, Instagram posts, sentiment, makes three scientific, scientific contributions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The work presented in this paper makes three scientific contributions with a specific focus on mining and analysis of COVID-19-related posts on Instagram. First, it presents a multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset, available at this https URL, contains Instagram posts in 161 different languages as well as 535,021 distinct hashtags. After the development of this dataset, multilingual sentiment analysis was performed, which involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset. Second, it presents the results of performing sentiment analysis per year from 2020 to 2024. The findings revealed the trends in sentiment related to COVID-19 on Instagram since the beginning of the pandemic. For instance, between 2020 and 2024, the sentiment trends show a notable shift, with positive sentiment decreasing from 38.35% to 28.69%, while neutral sentiment rising from 44.19% to 58.34%. Finally, the paper also presents findings of language-specific sentiment analysis. This analysis highlighted similar and contrasting trends of sentiment across posts published in different languages on Instagram. For instance, out of all English posts, 49.68% were positive, 14.84% were negative, and 35.48% were neutral. In contrast, among Hindi posts, 4.40% were positive, 57.04% were negative, and 38.56% were neutral, reflecting distinct differences in the sentiment distribution between these two languages.

[LG-69] Demystifying the Token Dynamics of Deep Selective State Space Models

链接: https://arxiv.org/abs/2410.03292
作者: Thieu N Vo,Tung D. Pham,Xin T. Tong,Tan Minh Nguyen
关键词-EN: modeling sequential data, Selective state space, selective SSM remains, deep selective SSM, state space models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selective state space models (SSM), such as Mamba, have gained prominence for their effectiveness in modeling sequential data. Despite their outstanding empirical performance, a comprehensive theoretical understanding of deep selective SSM remains elusive, hindering their further development and adoption for applications that need high fidelity. In this paper, we investigate the dynamical properties of tokens in a pre-trained Mamba model. In particular, we derive the dynamical system governing the continuous-time limit of the Mamba model and characterize the asymptotic behavior of its solutions. In the one-dimensional case, we prove that only one of the following two scenarios happens: either all tokens converge to zero, or all tokens diverge to infinity. We provide criteria based on model parameters to determine when each scenario occurs. For the convergent scenario, we empirically verify that this scenario negatively impacts the model’s performance. For the divergent scenario, we prove that different tokens will diverge to infinity at different rates, thereby contributing unequally to the updates during model training. Based on these investigations, we propose two refinements for the model: excluding the convergent scenario and reordering tokens based on their importance scores, both aimed at improving practical performance. Our experimental results validate these refinements, offering insights into enhancing Mamba’s effectiveness in real-world applications.

[LG-70] Enhanced Transformer architecture for in-context learning of dynamical systems

链接: https://arxiv.org/abs/2410.03291
作者: Matteo Rufolo,Dario Piga,Gabriele Maroni,Marco Forgione
关键词-EN: in-context identification paradigm, identification paradigm aims, Recently introduced, aims at estimating, offline and based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Recently introduced by some of the authors, the in-context identification paradigm aims at estimating, offline and based on synthetic data, a meta-model that describes the behavior of a whole class of systems. Once trained, this meta-model is fed with an observed input/output sequence (context) generated by a real system to predict its behavior in a zero-shot learning fashion. In this paper, we enhance the original meta-modeling framework through three key innovations: by formulating the learning task within a probabilistic framework; by managing non-contiguous context and query windows; and by adopting recurrent patching to effectively handle long context sequences. The efficacy of these modifications is demonstrated through a numerical example focusing on the Wiener-Hammerstein system class, highlighting the model’s enhanced performance and scalability.

[LG-71] uniINF: Best-of-Both-Worlds Algorithm for Parameter-Free Heavy-Tailed MABs

链接: https://arxiv.org/abs/2410.03284
作者: Yu Chen,Jiatai Huang,Yan Dai,Longbo Huang
关键词-EN: Heavy-Tailed Multi-Armed Bandits, Multi-Armed Bandits, adversarial environments, demonstrating robustness, robustness and adaptability
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present a novel algorithm, uniINF, for the Heavy-Tailed Multi-Armed Bandits (HTMAB) problem, demonstrating robustness and adaptability in both stochastic and adversarial environments. Unlike the stochastic MAB setting where loss distributions are stationary with time, our study extends to the adversarial setup, where losses are generated from heavy-tailed distributions that depend on both arms and time. Our novel algorithm uniINF enjoys the so-called Best-of-Both-Worlds (BoBW) property, performing optimally in both stochastic and adversarial environments without knowing the exact environment type. Moreover, our algorithm also possesses a Parameter-Free feature, i.e., it operates without the need of knowing the heavy-tail parameters (\sigma, \alpha) a-priori. To be precise, uniINF ensures nearly-optimal regret in both stochastic and adversarial environments, matching the corresponding lower bounds when (\sigma, \alpha) is known (up to logarithmic factors). To our knowledge, uniINF is the first parameter-free algorithm to achieve the BoBW property for the heavy-tailed MAB problem. Technically, we develop innovative techniques to achieve BoBW guarantees for Parameter-Free HTMABs, including a refined analysis for the dynamics of log-barrier, an auto-balancing learning rate scheduling scheme, an adaptive skipping-clipping loss tuning technique, and a stopping-time analysis for logarithmic regret.

[LG-72] Neural Sampling from Boltzmann Densities: Fisher-Rao Curves in the Wasserstein Geometry

链接: https://arxiv.org/abs/2410.03282
作者: Jannis Chemseddine,Christian Wald,Richard Duong,Gabriele Steidl
关键词-EN: unnormalized Boltzmann density, unnormalized Boltzmann, Boltzmann curve, Boltzmann density, Boltzmann
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We deal with the task of sampling from an unnormalized Boltzmann density \rho_D by learning a Boltzmann curve given by energies f_t starting in a simple density \rho_Z . First, we examine conditions under which Fisher-Rao flows are absolutely continuous in the Wasserstein geometry. Second, we address specific interpolations f_t and the learning of the related density/velocity pairs (\rho_t,v_t) . It was numerically observed that the linear interpolation, which requires only a parametrization of the velocity field v_t , suffers from a “teleportation-of-mass” issue. Using tools from the Wasserstein geometry, we give an analytical example, where we can precisely measure the explosion of the velocity field. Inspired by Máté and Fleuret, who parametrize both f_t and v_t , we propose an interpolation which parametrizes only f_t and fixes an appropriate v_t . This corresponds to the Wasserstein gradient flow of the Kullback-Leibler divergence related to Langevin dynamics. We demonstrate by numerical examples that our model provides a well-behaved flow field which successfully solves the above sampling task.

[LG-73] BN-SCAFFOLD: controlling the drift of Batch Normalization statistics in Federated Learning

链接: https://arxiv.org/abs/2410.03281
作者: Gonzalo Iñaki Quintana,Laurence Vancamberg,Vincent Jugnon,Mathilde Mougeot,Agnès Desolneux
关键词-EN: training Machine Learning, Machine Learning, Deep Neural Networks, learning paradigm, Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is gaining traction as a learning paradigm for training Machine Learning (ML) models in a decentralized way. Batch Normalization (BN) is ubiquitous in Deep Neural Networks (DNN), as it improves convergence and generalization. However, BN has been reported to hinder performance of DNNs in heterogeneous FL. Recently, the FedTAN algorithm has been proposed to mitigate the effect of heterogeneity on BN, by aggregating BN statistics and gradients from all the clients. However, it has a high communication cost, that increases linearly with the depth of the DNN. SCAFFOLD is a variance reduction algorithm, that estimates and corrects the client drift in a communication-efficient manner. Despite its promising results in heterogeneous FL settings, it has been reported to underperform for models with BN. In this work, we seek to revive SCAFFOLD, and more generally variance reduction, as an efficient way of training DNN with BN in heterogeneous FL. We introduce a unified theoretical framework for analyzing the convergence of variance reduction algorithms in the BN-DNN setting, inspired of by the work of Wang et al. 2023, and show that SCAFFOLD is unable to remove the bias introduced by BN. We thus propose the BN-SCAFFOLD algorithm, which extends the client drift correction of SCAFFOLD to BN statistics. We prove convergence using the aforementioned framework and validate the theoretical results with experiments on MNIST and CIFAR-10. BN-SCAFFOLD equals the performance of FedTAN, without its high communication cost, outperforming Federated Averaging (FedAvg), SCAFFOLD, and other FL algorithms designed to mitigate BN heterogeneity.

[LG-74] Sm: enhanced localization in Multiple Instance Learning for medical imaging classification NEURIPS2024

链接: https://arxiv.org/abs/2410.03276
作者: Francisco M. Castro-Macías,Pablo Morales-Álvarez,Yunan Wu,Rafael Molina,Aggelos K. Katsaggelos
关键词-EN: Multiple Instance Learning, Multiple Instance, Instance Learning, medical imaging classification, labeling effort
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 24 pages, 14 figures, 2024 Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Multiple Instance Learning (MIL) is widely used in medical imaging classification to reduce the labeling effort. While only bag labels are available for training, one typically seeks predictions at both bag and instance levels (classification and localization tasks, respectively). Early MIL methods treated the instances in a bag independently. Recent methods account for global and local dependencies among instances. Although they have yielded excellent results in classification, their performance in terms of localization is comparatively limited. We argue that these models have been designed to target the classification task, while implications at the instance level have not been deeply investigated. Motivated by a simple observation – that neighboring instances are likely to have the same label – we propose a novel, principled, and flexible mechanism to model local dependencies. It can be used alone or combined with any mechanism to model global dependencies (e.g., transformers). A thorough empirical validation shows that our module leads to state-of-the-art performance in localization while being competitive or superior in classification. Our code is at this https URL.

[LG-75] st-time Adaptation for Regression by Subspace Alignment

链接: https://arxiv.org/abs/2410.03263
作者: Kazuki Adachi,Shin’ya Yamaguchi,Atsutoshi Kumagai,Tomoki Hamagami
关键词-EN: investigates test-time adaptation, paper investigates test-time, unlabeled target data, regression model pre-trained, existing TTA methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates test-time adaptation (TTA) for regression, where a regression model pre-trained in a source domain is adapted to an unknown target distribution with unlabeled target data. Although regression is one of the fundamental tasks in machine learning, most of the existing TTA methods have classification-specific designs, which assume that models output class-categorical predictions, whereas regression models typically output only single scalar values. To enable TTA for regression, we adopt a feature alignment approach, which aligns the feature distributions between the source and target domains to mitigate the domain gap. However, we found that naive feature alignment employed in existing TTA methods for classification is ineffective or even worse for regression because the features are distributed in a small subspace and many of the raw feature dimensions have little significance to the output. For an effective feature alignment in TTA for regression, we propose Significant-subspace Alignment (SSA). SSA consists of two components: subspace detection and dimension weighting. Subspace detection finds the feature subspace that is representative and significant to the output. Then, the feature alignment is performed in the subspace during TTA. Meanwhile, dimension weighting raises the importance of the dimensions of the feature subspace that have greater significance to the output. We experimentally show that SSA outperforms various baselines on real-world datasets.

[LG-76] How much can we forget about Data Contamination?

链接: https://arxiv.org/abs/2410.03249
作者: Sebastian Bordt,Suraj Srinivas,Valentyn Boreiko,Ulrike von Luxburg
关键词-EN: large language models, evaluating the capabilities, capabilities of large, large language, significant challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we use experimental evidence and theoretical estimates to challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). We find that if model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. We then derive a simple theory of example forgetting via cumulative weight decay. It allows us to bound the number of gradient steps required to forget past data for any training run where we know the hyperparameters of AdamW. This indicates that many LLMs, including Llama 3, have forgotten the data seen at the beginning of training. Experimentally, we demonstrate that forgetting occurs faster than what is predicted by our bounds. Taken together, our results suggest that moderate amounts of contamination can be forgotten at the end of realistically scaled training runs.

[LG-77] CUDLE: Learning Under Label Scarcity to Detect Cannabis Use in Uncontrolled Environments

链接: https://arxiv.org/abs/2410.03211
作者: Reza Rahimi Azghan,Nicholas C. Glodosky,Ramesh Kumar Sah,Carrie Cuttler,Ryan McLaughlin,Michael J. Cleveland,Hassan Ghasemzadeh
关键词-EN: support behavioral interventions, Wearable sensor systems, potential for real-time, objective monitoring, behavioral interventions
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, 1 table

点击查看摘要

Abstract:Wearable sensor systems have demonstrated a great potential for real-time, objective monitoring of physiological health to support behavioral interventions. However, obtaining accurate labels in free-living environments remains difficult due to limited human supervision and the reliance on self-labeling by patients, making data collection and supervised learning particularly challenging. To address this issue, we introduce CUDLE (Cannabis Use Detection with Label Efficiency), a novel framework that leverages self-supervised learning with real-world wearable sensor data to tackle a pressing healthcare challenge: the automatic detection of cannabis consumption in free-living environments. CUDLE identifies cannabis consumption moments using sensor-derived data through a contrastive learning framework. It first learns robust representations via a self-supervised pretext task with data augmentation. These representations are then fine-tuned in a downstream task with a shallow classifier, enabling CUDLE to outperform traditional supervised methods, especially with limited labeled data. To evaluate our approach, we conducted a clinical study with 20 cannabis users, collecting over 500 hours of wearable sensor data alongside user-reported cannabis use moments through EMA (Ecological Momentary Assessment) methods. Our extensive analysis using the collected data shows that CUDLE achieves a higher accuracy of 73.4%, compared to 71.1% for the supervised approach, with the performance gap widening as the number of labels decreases. Notably, CUDLE not only surpasses the supervised model while using 75% less labels, but also reaches peak performance with far fewer subjects.

[LG-78] adashi: Enabling AI-Based Automated Code Generation With Guaranteed Correctness

链接: https://arxiv.org/abs/2410.03210
作者: Emil Vatai,Aleksandr Drozd,Ivan R. Ivanov,Yinghao Ren,Mohamed Wahib
关键词-EN: human experts developing, place rigorous methods, Frameworks and DSLs, DSLs auto-generating code, DSLs auto-generating
类目: Machine Learning (cs.LG)
*备注: Submitted to CGO

点击查看摘要

Abstract:Frameworks and DSLs auto-generating code have traditionally relied on human experts developing them to have in place rigorous methods to assure the legality of the applied code transformations. Machine Learning (ML) is gaining wider adoption as a means to auto-generate code optimised for the hardware target. However, ML solutions, and in particular black-box DNNs, provide no such guarantees on legality. In this paper we propose a library, Tadashi, which leverages the polyhedral model to empower researchers seeking to curate datasets crucial for applying ML in code-generation. Tadashi provides the ability to reliably and practically check the legality of candidate transformations on polyhedral schedules applied on a baseline reference code. We provide a proof that our library guarantees the legality of generated transformations, and demonstrate its lightweight practical cost. Tadashi is available at this https URL.

[LG-79] SPHINX: Structural Prediction using Hypergraph Inference Network

链接: https://arxiv.org/abs/2410.03208
作者: Iulia Duta,Pietro Liò
关键词-EN: real-world systems, relations is widely, widely recognized, large number, number of real-world
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The importance of higher-order relations is widely recognized in a large number of real-world systems. However, annotating them is a tedious and sometimes impossible task. Consequently, current approaches for data modelling either ignore the higher-order interactions altogether or simplify them into pairwise connections. In order to facilitate higher-order processing, even when a hypergraph structure is not available, we introduce Structural Prediction using Hypergraph Inference Network (SPHINX), a model that learns to infer a latent hypergraph structure in an unsupervised way, solely from the final node-level signal. The model consists of a soft, differentiable clustering method used to sequentially predict, for each hyperedge, the probability distribution over the nodes and a sampling algorithm that converts them into an explicit hypergraph structure. We show that the recent advancement in k-subset sampling represents a suitable tool for producing discrete hypergraph structures, addressing some of the training instabilities exhibited by prior works. The resulting model can generate the higher-order structure necessary for any modern hypergraph neural network, facilitating the capture of higher-order interaction in domains where annotating them is difficult. Through extensive ablation studies and experiments conducted on two challenging datasets for trajectory prediction, we demonstrate that our model is capable of inferring suitable latent hypergraphs, that are interpretable and enhance the final performance.

[LG-80] Learning Semantic Structure through First-Order-Logic Translation EMNLP2024

链接: https://arxiv.org/abs/2410.03203
作者: Akshay Chaturvedi,Nicholas Asher
关键词-EN: simple sentences, transformer-based language models, study whether transformer-based, extract predicate argument, language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: EMNLP 2024 Findings

点击查看摘要

Abstract:In this paper, we study whether transformer-based language models can extract predicate argument structure from simple sentences. We firstly show that language models sometimes confuse which predicates apply to which objects. To mitigate this, we explore two tasks: question answering (Q/A), and first order logic (FOL) translation, and two regimes, prompting and finetuning. In FOL translation, we finetune several large language models on synthetic datasets designed to gauge their generalization abilities. For Q/A, we finetune encoder models like BERT and RoBERTa and use prompting for LLMs. The results show that FOL translation for LLMs is better suited to learn predicate argument structure.

[LG-81] Learning test generators for cyber-physical systems

链接: https://arxiv.org/abs/2410.03202
作者: Jarkko Peltomäki,Ivan Porres
关键词-EN: Black-box runtime verification, Black-box runtime, test generators, temporal logic, discover errors
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 34 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Black-box runtime verification methods for cyber-physical systems can be used to discover errors in systems whose inputs and outputs are expressed as signals over time and their correctness requirements are specified in a temporal logic. Existing methods, such as requirement falsification, often focus on finding a single input that is a counterexample to system correctness. In this paper, we study how to create test generators that can produce multiple and diverse counterexamples for a single requirement. Several counterexamples expose system failures in varying input conditions and support the root cause analysis of the faults. We present the WOGAN algorithm to create such test generators automatically. The algorithm works by training iteratively a Wasserstein generative adversarial network that models the target distribution of the uniform distribution on the set of counterexamples. WOGAN is an algorithm that trains generative models that act as test generators for runtime verification. The training is performed online without the need for a previous model or dataset. We also propose criteria to evaluate such test generators. We evaluate the trained generators on several well-known problems including the ARCH-COMP falsification benchmarks. Our experimental results indicate that generators trained by the WOGAN algorithm are as effective as state-of-the-art requirement falsification algorithms while producing tests that are as diverse as a sample from uniform random sampling. We conclude that WOGAN is a viable method to produce test generators automatically and that these test generators can generate multiple and diverse counterexamples for the runtime verification of cyber-physical systems. Comments: 34 pages, 4 figures, 7 tables Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2410.03202 [cs.LG] (or arXiv:2410.03202v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03202 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-82] EXAQ: Exponent Aware Quantization For LLMs Acceleration

链接: https://arxiv.org/abs/2410.03185
作者: Moran Shkolnik,Maxim Fishman,Brian Chmiel,Hilla Ben-Yaacov,Ron Banner,Kfir Yehuda Levy
关键词-EN: Large Language Models, Language Models, Large Language, decreasing the computational, computational and storage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Quantization has established itself as the primary approach for decreasing the computational and storage expenses associated with Large Language Models (LLMs) inference. The majority of current research emphasizes quantizing weights and activations to enable low-bit general-matrix-multiply (GEMM) operations, with the remaining non-linear operations executed at higher precision. In our study, we discovered that following the application of these techniques, the primary bottleneck in LLMs inference lies in the softmax layer. The softmax operation comprises three phases: exponent calculation, accumulation, and normalization, Our work focuses on optimizing the first two phases. We propose an analytical approach to determine the optimal clipping value for the input to the softmax function, enabling sub-4-bit quantization for LLMs inference. This method accelerates the calculations of both e^x and \sum(e^x) with minimal to no accuracy degradation. For example, in LLaMA1-30B, we achieve baseline performance with 2-bit quantization on the well-known “Physical Interaction: Question Answering” (PIQA) dataset evaluation. This ultra-low bit quantization allows, for the first time, an acceleration of approximately 4x in the accumulation phase. The combination of accelerating both e^x and \sum(e^x) results in a 36.9% acceleration in the softmax operation.

[LG-83] Rapid optimization in high dimensional space by deep kernel learning augmented genetic algorithms

链接: https://arxiv.org/abs/2410.03173
作者: Mani Valleti,Aditya Raghavan,Sergei V. Kalinin
关键词-EN: supply chain management, Exploration of complex, complex high-dimensional spaces, high-dimensional spaces presents, spaces presents significant
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Exploration of complex high-dimensional spaces presents significant challenges in fields such as molecular discovery, process optimization, and supply chain management. Genetic Algorithms (GAs), while offering significant power for creating new candidate spaces, often entail high computational demands due to the need for evaluation of each new proposed solution. On the other hand, Deep Kernel Learning (DKL) efficiently navigates the spaces of preselected candidate structures but lacks generative capabilities. This study introduces an approach that amalgamates the generative power of GAs to create new candidates with the efficiency of DKL-based surrogate models to rapidly ascertain the behavior of new candidate spaces. This DKL-GA framework can be further used to build Bayesian Optimization (BO) workflows. We demonstrate the effectiveness of this approach through the optimization of the FerroSIM model, showcasing its broad applicability to diverse challenges, including molecular discovery and battery charging optimization.

[LG-84] Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach

链接: https://arxiv.org/abs/2410.03160
作者: Yaofang Liu,Yumeng Ren,Xiaodong Cun,Aitor Artola,Yang Liu,Tieyong Zeng,Raymond H. Chan,Jean-michel Morel
关键词-EN: revolutionized image generation, Diffusion models, shown promise, video diffusion models, revolutionized image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code at this https URL

点击查看摘要

Abstract:Diffusion models have revolutionized image generation, and their extension to video generation has shown promise. However, current video diffusion models~(VDMs) rely on a scalar timestep variable applied at the clip level, which limits their ability to model complex temporal dependencies needed for various tasks like image-to-video generation. To address this limitation, we propose a frame-aware video diffusion model~(FVDM), which introduces a novel vectorized timestep variable~(VTV). Unlike conventional VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model’s capacity to capture fine-grained temporal dependencies. FVDM’s flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot this http URL empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks. By addressing fundamental shortcomings in existing VDMs, FVDM sets a new paradigm in video synthesis, offering a robust framework with significant implications for generative modeling and multimedia applications.

[LG-85] Autoregressive Moving-average Attention Mechanism for Time Series Forecasting

链接: https://arxiv.org/abs/2410.03159
作者: Jiecheng Lu,Xu Han,Yan Sun,Shihao Yang
关键词-EN: autoregressive Transformer model, autoregressive attention mechanisms, decoder-only autoregressive Transformer, time series forecasting, enhancing their ability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose an Autoregressive (AR) Moving-average (MA) attention structure that can adapt to various linear attention mechanisms, enhancing their ability to capture long-range and local temporal patterns in time series. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that incorporating the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.

[LG-86] Mathematical Formalism for Memory Compression in Selective State Space Models

链接: https://arxiv.org/abs/2410.03158
作者: Siddhanth Bhat
关键词-EN: modelling long-range dependencies, sequence modelling, State space models, hidden state, sequence data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注: 27 Pages

点击查看摘要

Abstract:State space models (SSMs) have emerged as a powerful framework for modelling long-range dependencies in sequence data. Unlike traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), SSMs offer a structured and stable approach to sequence modelling, leveraging principles from control theory and dynamical systems. However, a key challenge in sequence modelling is compressing long-term dependencies into a compact hidden state representation without losing critical information. In this paper, we develop a rigorous mathematical framework for understanding memory compression in selective state space models. We introduce a selective gating mechanism that dynamically filters and updates the hidden state based on input relevance, allowing for efficient memory compression. We formalize the trade-off between memory efficiency and information retention using information-theoretic tools, such as mutual information and rate-distortion theory. Our analysis provides theoretical bounds on the amount of information that can be compressed without sacrificing model performance. We also derive theorems that prove the stability and convergence of the hidden state in selective SSMs, ensuring reliable long-term memory retention. Computational complexity analysis reveals that selective SSMs offer significant improvements in memory efficiency and processing speed compared to traditional RNN-based models. Through empirical validation on sequence modelling tasks such as time-series forecasting and natural language processing, we demonstrate that selective SSMs achieve state-of-the-art performance while using less memory and computational resources. Comments: 27 Pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC) Cite as: arXiv:2410.03158 [cs.LG] (or arXiv:2410.03158v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03158 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-87] MELODI: Exploring Memory Compression for Long Contexts

链接: https://arxiv.org/abs/2410.03156
作者: Yinpeng Chen,DeLesley Hutchins,Aren Jansen,Andrey Zhmoginov,David Racz,Jesper Andersen
关键词-EN: efficiently process long, process long documents, memory architecture designed, present MELODI, short context windows
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present MELODI, a novel memory architecture designed to efficiently process long documents using short context windows. The key principle behind MELODI is to represent short-term and long-term memory as a hierarchical compression scheme across both network layers and context windows. Specifically, the short-term memory is achieved through recurrent compression of context windows across multiple layers, ensuring smooth transitions between windows. In contrast, the long-term memory performs further compression within a single middle layer and aggregates information across context windows, effectively consolidating crucial information from the entire history. Compared to a strong baseline - the Memorizing Transformer employing dense attention over a large long-term memory (64K key-value pairs) - our method demonstrates superior performance on various long-context datasets while remarkably reducing the memory footprint by a factor of 8.

[LG-88] Machine Learning for Asymptomatic Ratoon Stunting Disease Detection With Freely Available Satellite Based Multispectral Imaging

链接: https://arxiv.org/abs/2410.03141
作者: Ethan Kane Waters,Carla Chia-ming Chen,Mostafa Rahimi Azghadi
关键词-EN: Ratoon Stunting Disease, Ratoon Stunting, effective crop management, asymptomatic infectious diseases, Stunting Disease
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 13 pages, 1 figure and 2 tables (main text), 1 figure and 3 tables (appendices). Submitted to “Computers and Electronics in Agriculture”

点击查看摘要

Abstract:Disease detection in sugarcane, particularly the identification of asymptomatic infectious diseases such as Ratoon Stunting Disease (RSD), is critical for effective crop management. This study employed various machine learning techniques to detect the presence of RSD in different sugarcane varieties, using vegetation indices derived from freely available satellite-based spectral data. Our results show that the Support Vector Machine with a Radial Basis Function Kernel (SVM-RBF) was the most effective algorithm, achieving classification accuracy between 85.64% and 96.55%, depending on the variety. Gradient Boosting and Random Forest also demonstrated high performance achieving accuracy between 83.33% to 96.55%, while Logistic Regression and Quadratic Discriminant Analysis showed variable results across different varieties. The inclusion of sugarcane variety and vegetation indices was important in the detection of RSD. This agreed with what was identified in the current literature. Our study highlights the potential of satellite-based remote sensing as a cost-effective and efficient method for large-scale sugarcane disease detection alternative to traditional manual laboratory testing methods.

[LG-89] In-context Learning in Presence of Spurious Correlations

链接: https://arxiv.org/abs/2410.03140
作者: Hrayr Harutyunyan,Rafayel Darbinyan,Samvel Karapetyan,Hrant Khachatrian
关键词-EN: Large language models, Large language, language models exhibit, in-context, exhibit a remarkable
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models exhibit a remarkable capacity for in-context learning, where they learn to solve tasks given a few examples. Recent work has shown that transformers can be trained to perform simple regression tasks in-context. This work explores the possibility of training an in-context learner for classification tasks involving spurious features. We find that the conventional approach of training in-context learners is susceptible to spurious features. Moreover, when the meta-training dataset includes instances of only one task, the conventional approach leads to task memorization and fails to produce a model that leverages context for predictions. Based on these observations, we propose a novel technique to train such a learner for a given classification task. Remarkably, this in-context learner matches and sometimes outperforms strong methods like ERM and GroupDRO. However, unlike these algorithms, it does not generalize well to other tasks. We show that it is possible to obtain an in-context learner that generalizes to unseen tasks by training on a diverse dataset of synthetic in-context learning instances.

[LG-90] Can LLMs Generate Diverse Molecules? Towards Alignment with Structural Diversity

链接: https://arxiv.org/abs/2410.03138
作者: Hyosoon Jang,Yunhui Jang,Jaehyung Kim,Sungsoo Ahn
关键词-EN: demonstrated impressive performance, offers significant potential, Recent advancements, accelerate drug discovery, large language models
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated impressive performance in generating molecular structures as drug candidates, which offers significant potential to accelerate drug discovery. However, the current LLMs overlook a critical requirement for drug discovery: proposing a diverse set of molecules. This diversity is essential for improving the chances of finding a viable drug, as it provides alternative molecules that may succeed where others fail in wet-lab or clinical validations. Despite such a need for diversity, the LLMs often output structurally similar molecules from a given prompt. While decoding schemes like beam search may enhance textual diversity, this often does not align with molecular structural diversity. In response, we propose a new method for fine-tuning molecular generative LLMs to autoregressively generate a set of structurally diverse molecules, where each molecule is generated by conditioning on the previously generated molecules. Our approach consists of two stages: (1) supervised fine-tuning to adapt LLMs to autoregressively generate molecules in a sequence and (2) reinforcement learning to maximize structural diversity within the generated molecules. Our experiments show that (1) our fine-tuning approach enables the LLMs to better discover diverse molecules compared to existing decoding schemes and (2) our fine-tuned model outperforms other representative LLMs in generating diverse molecules, including the ones fine-tuned on chemical domains.

[LG-91] Remaining Useful Life Prediction: A Study on Multidimensional Industrial Signal Processing and Efficient Transfer Learning Based on Large Language Models

链接: https://arxiv.org/abs/2410.03134
作者: Yan Chen,Cheng Liu
关键词-EN: Remaining useful life, maintaining modern industrial, safety are paramount, RUL prediction, crucial for maintaining
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Remaining useful life (RUL) prediction is crucial for maintaining modern industrial systems, where equipment reliability and operational safety are paramount. Traditional methods, based on small-scale deep learning or physical/statistical models, often struggle with complex, multidimensional sensor data and varying operating conditions, limiting their generalization capabilities. To address these challenges, this paper introduces an innovative regression framework utilizing large language models (LLMs) for RUL prediction. By leveraging the modeling power of LLMs pre-trained on corpus data, the proposed model can effectively capture complex temporal dependencies and improve prediction accuracy. Extensive experiments on the Turbofan engine’s RUL prediction task show that the proposed model surpasses state-of-the-art (SOTA) methods on the challenging FD002 and FD004 subsets and achieves near-SOTA results on the other subsets. Notably, different from previous research, our framework uses the same sliding window length and all sensor signals for all subsets, demonstrating strong consistency and generalization. Moreover, transfer learning experiments reveal that with minimal target domain data for fine-tuning, the model outperforms SOTA methods trained on full target domain data. This research highlights the significant potential of LLMs in industrial signal processing and RUL prediction, offering a forward-looking solution for health management in future intelligent industrial systems.

[LG-92] Autoregressive Action Sequence Learning for Robotic Manipulation

链接: https://arxiv.org/abs/2410.03132
作者: Xinyu Zhang,Yuhan Liu,Haonan Chang,Liam Schramm,Abdeslam Boularias
关键词-EN: natural language processing, demonstrated remarkable success, language processing, demonstrated remarkable, remarkable success
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive models have demonstrated remarkable success in natural language processing. In this work, we design a simple yet effective autoregressive architecture for robotic manipulation tasks. We propose the Chunking Causal Transformer (CCT), which extends the next-single-token prediction of causal transformers to support multi-token prediction in a single pass. Further, we design a novel attention interleaving strategy that allows CCT to be trained efficiently with teacher-forcing. Based on CCT, we propose the Autoregressive Policy (ARP) model, which learns to generate action sequences autoregressively. We find that action sequence learning enables better leverage of the underlying causal relationships in robotic tasks. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that it outperforms the state-of-the-art methods in all tested environments, while being more efficient in computation and parameter sizes. Video demonstrations, our source code, and the models of ARP can be found at this http URL.

[LG-93] AIME: AI System Optimization via Multiple LLM Evaluators

链接: https://arxiv.org/abs/2410.03131
作者: Bhrij Patel,Souradip Chakraborty,Wesley A. Suttle,Mengdi Wang,Amrit Singh Bedi,Dinesh Manocha
关键词-EN: feedback loop scheme, optimization typically involves, Text-based AI system, current output, iteration output
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 21 pages, 10 Figures, 4 Tables

点击查看摘要

Abstract:Text-based AI system optimization typically involves a feedback loop scheme where a single LLM generates an evaluation in natural language of the current output to improve the next iteration’s output. However, in this work, we empirically demonstrate that for a practical and complex task (code generation) with multiple criteria to evaluate, utilizing only one LLM evaluator tends to let errors in generated code go undetected, thus leading to incorrect evaluations and ultimately suboptimal test case performance. Motivated by this failure case, we assume there exists an optimal evaluation policy that samples an evaluation between response and ground truth. We then theoretically prove that a linear combination of multiple evaluators can approximate this optimal policy. From this insight, we propose AI system optimization via Multiple LLM Evaluators (AIME). AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We provide an extensive empirical study showing AIME outperforming baseline methods in code generation tasks, with up to 62% higher error detection rate and up to 16% higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets. We also show that the selection of the number of evaluators and which criteria to utilize is non-trivial as it can impact pact success rate by up to 12% .

[LG-94] ARB-LLM: Alternating Refined Binarizations for Large Language Models

链接: https://arxiv.org/abs/2410.03129
作者: Zhiteng Li,Xianglong Yan,Tianao Zhang,Haotong Qin,Dong Xie,Jiang Tian,zhongchao shi,Linghe Kong,Yulun Zhang,Xiaokang Yang
关键词-EN: Large Language Models, natural language processing, Large Language, hinder practical deployment, greatly pushed forward
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: The code and models will be available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have greatly pushed forward advancements in natural language processing, yet their high memory and computational demands hinder practical deployment. Binarization, as an effective compression technique, can shrink model weights to just 1 bit, significantly reducing the high demands on computation and memory. However, current binarization methods struggle to narrow the distribution gap between binarized and full-precision weights, while also overlooking the column deviation in LLM weight distribution. To tackle these issues, we propose ARB-LLM, a novel 1-bit post-training quantization (PTQ) technique tailored for LLMs. To narrow the distribution shift between binarized and full-precision weights, we first design an alternating refined binarization (ARB) algorithm to progressively update the binarization parameters, which significantly reduces the quantization error. Moreover, considering the pivot role of calibration data and the column deviation in LLM weights, we further extend ARB to ARB-X and ARB-RC. In addition, we refine the weight partition strategy with column-group bitmap (CGB), which further enhance performance. Equipping ARB-X and ARB-RC with CGB, we obtain ARB-LLM _\textX and ARB-LLM _\textRC respectively, which significantly outperform state-of-the-art (SOTA) binarization methods for LLMs. As a binary PTQ method, our ARB-LLM _\textRC is the first to surpass FP16 models of the same size. The code and models will be available at this https URL.

[LG-95] On Unsupervised Prompt Learning for Classification with Black-box Language Models

链接: https://arxiv.org/abs/2410.03124
作者: Zhen-Yu Zhang,Jiandong Zhang,Huaxiu Yao,Gang Niu,Masashi Sugiyama
关键词-EN: Large language models, achieved impressive success, Large language, black-box LLMs, text-formatted learning problems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive success in text-formatted learning problems, and most popular LLMs have been deployed in a black-box fashion. Meanwhile, fine-tuning is usually necessary for a specific downstream task to obtain better performance, and this functionality is provided by the owners of the black-box LLMs. To fine-tune a black-box LLM, labeled data are always required to adjust the model parameters. However, in many real-world applications, LLMs can label textual datasets with even better quality than skilled human annotators, motivating us to explore the possibility of fine-tuning black-box LLMs with unlabeled data. In this paper, we propose unsupervised prompt learning for classification with black-box LLMs, where the learning parameters are the prompt itself and the pseudo labels of unlabeled data. Specifically, the prompt is modeled as a sequence of discrete tokens, and every token has its own to-be-learned categorical distribution. On the other hand, for learning the pseudo labels, we are the first to consider the in-context learning (ICL) capabilities of LLMs: we first identify reliable pseudo-labeled data using the LLM, and then assign pseudo labels to other unlabeled data based on the prompt, allowing the pseudo-labeled data to serve as in-context demonstrations alongside the prompt. Those in-context demonstrations matter: previously, they are involved when the prompt is used for prediction while they are not involved when the prompt is trained; thus, taking them into account during training makes the prompt-learning and prompt-using stages more consistent. Experiments on benchmark datasets show the effectiveness of our proposed algorithm. After unsupervised prompt learning, we can use the pseudo-labeled dataset for further fine-tuning by the owners of the black-box LLMs.

[LG-96] Shrinking: Reconstruction of Parameterized Surfaces from Signed Distance Fields ICML

链接: https://arxiv.org/abs/2410.03123
作者: Haotian Yin,Przemyslaw Musialski
关键词-EN: Signed Distance Fields, implicit neural representation, Distance Fields, Signed Distance, reconstructing explicit parameterized
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, accepted by ICMLA

点击查看摘要

Abstract:We propose a novel method for reconstructing explicit parameterized surfaces from Signed Distance Fields (SDFs), a widely used implicit neural representation (INR) for 3D surfaces. While traditional reconstruction methods like Marching Cubes extract discrete meshes that lose the continuous and differentiable properties of INRs, our approach iteratively contracts a parameterized initial sphere to conform to the target SDF shape, preserving differentiability and surface parameterization throughout. This enables downstream applications such as texture mapping, geometry processing, animation, and finite element analysis. Evaluated on the typical geometric shapes and parts of the ABC dataset, our method achieves competitive reconstruction quality, maintaining smoothness and differentiability crucial for advanced computer graphics and geometric deep learning applications.

[LG-97] RIPPLECOT: Amplifying Ripple Effect of Knowledge Editing in Language Models via Chain-of-Thought In-Context Learning EMNLP

链接: https://arxiv.org/abs/2410.03122
作者: Zihao Zhao,Yuchen Yang,Yijiang Li,Yinzhi Cao
关键词-EN: large language models, ripple effect poses, ripple effect, poses a significant, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP findings

点击查看摘要

Abstract:The ripple effect poses a significant challenge in knowledge editing for large language models. Namely, when a single fact is edited, the model struggles to accurately update the related facts in a sequence, which is evaluated by multi-hop questions linked to a chain of related facts. Recent strategies have moved away from traditional parameter updates to more flexible, less computation-intensive methods, proven to be more effective in addressing the ripple effect. In-context learning (ICL) editing uses a simple demonstration Imagine that + new fact to guide LLMs, but struggles with complex multi-hop questions as the new fact alone fails to specify the chain of facts involved in such scenarios. Besides, memory-based editing maintains additional storage for all edits and related facts, requiring continuous updates to stay effective. As a result of these design limitations, the challenge remains, with the highest accuracy being only 33.8% on the MQuAKE-cf benchmarks for Vicuna-7B. To address this, we propose RippleCOT, a novel ICL editing approach integrating Chain-of-Thought (COT) reasoning. RippleCOT structures demonstrations as newfact, question, thought, answer, incorporating a thought component to identify and decompose the multi-hop logic within questions. This approach effectively guides the model through complex multi-hop questions with chains of related facts. Comprehensive experiments demonstrate that RippleCOT significantly outperforms the state-of-the-art on the ripple effect, achieving accuracy gains ranging from 7.8% to 87.1%.

[LG-98] Spatial-aware decision-making with ring attractors in reinforcement learning systems

链接: https://arxiv.org/abs/2410.03119
作者: Marcos Negre Saura,Richard Allmendinger,Theodore Papamarkou,Wei Pan
关键词-EN: neural circuit dynamics, mathematical model inspired, ring attractors, action selection process, circuit dynamics
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the integration of ring attractors, a mathematical model inspired by neural circuit dynamics, into the reinforcement learning (RL) action selection process. Ring attractors, as specialized brain-inspired structures that encode spatial information and uncertainty, offer a biologically plausible mechanism to improve learning speed and predictive performance. They do so by explicitly encoding the action space, facilitating the organization of neural activity, and enabling the distribution of spatial representations across the neural network in the context of deep RL. The application of ring attractors in the RL action selection process involves mapping actions to specific locations on the ring and decoding the selected action based on neural activity. We investigate the application of ring attractors by both building them as exogenous models and integrating them as part of a Deep Learning policy algorithm. Our results show a significant improvement in state-of-the-art models for the Atari 100k benchmark. Notably, our integrated approach improves the performance of state-of-the-art models by half, representing a 53% increase over selected baselines.

[LG-99] ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

链接: https://arxiv.org/abs/2410.03117
作者: Ippei Fujisawa,Sensho Nobe,Hiroki Seto,Rina Onda,Yoshiaki Uchida,Hiroki Ikoma,Pei-Chun Chien,Ryota Kanai
关键词-EN: tasks remains limited, large language models, continue to advance, intellectual activities, remains limited
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reasoning is central to a wide range of intellectual activities, and while the capabilities of large language models (LLMs) continue to advance, their performance in reasoning tasks remains limited. The processes and mechanisms underlying reasoning are not yet fully understood, but key elements include path exploration, selection of relevant knowledge, and multi-step inference. Problems are solved through the synthesis of these components. In this paper, we propose a benchmark that focuses on a specific aspect of reasoning ability: the direct evaluation of multi-step inference. To this end, we design a special reasoning task where multi-step inference is specifically focused by largely eliminating path exploration and implicit knowledge utilization. Our dataset comprises pairs of explicit instructions and corresponding questions, where the procedures necessary for solving the questions are entirely detailed within the instructions. This setup allows models to solve problems solely by following the provided directives. By constructing problems that require varying numbers of steps to solve and evaluating responses at each step, we enable a thorough assessment of state-of-the-art LLMs’ ability to follow instructions. To ensure the robustness of our evaluation, we include multiple distinct tasks. Furthermore, by comparing accuracy across tasks, utilizing step-aware metrics, and applying separately defined measures of complexity, we conduct experiments that offer insights into the capabilities and limitations of LLMs in reasoning tasks. Our findings have significant implications for the development of LLMs and highlight areas for future research in advancing their reasoning abilities. Our dataset is available at \urlthis https URL and code at \urlthis https URL.

[LG-100] LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

链接: https://arxiv.org/abs/2410.03111
作者: Rongzhi Zhang,Kuang Wang,Liyuan Liu,Shuohang Wang,Hao Cheng,Chao Zhang,Yelong Shen
关键词-EN: enabling faster inference, storing previously computed, autoregressive large language, serving transformer-based autoregressive, transformer-based autoregressive large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly with sequence length and batch size, posing a significant bottleneck in LLM deployment. Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages, which requires extensive parameter tuning thus unsuitable for pre-trained LLMs; (2) KV cache compression at test time, primarily through token eviction policies, which often overlook inter-layer dependencies and can be task-specific. This paper introduces an orthogonal approach to KV cache compression. We propose a low-rank approximation of KV weight matrices, allowing for plug-in integration with existing transformer-based LLMs without model retraining. To effectively compress KV cache at the weight level, we adjust for layerwise sensitivity and introduce a progressive compression strategy, which is supported by our theoretical analysis on how compression errors accumulate in deep networks. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages. Extensive experiments with LLaMA models ranging from 8B to 70B parameters across various tasks show that our approach significantly reduces the GPU memory footprint while maintaining performance. Comments: 15 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2 Cite as: arXiv:2410.03111 [cs.LG] (or arXiv:2410.03111v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-101] A Training-Free Conditional Diffusion Model for Learning Stochastic Dynamical Systems

链接: https://arxiv.org/abs/2410.03108
作者: Yanfang Liu,Yuan Chen,Dongbin Xiu,Guannan Zhang
关键词-EN: training-free conditional diffusion, conditional diffusion model, stochastic differential equations, study introduces, introduces a training-free
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:This study introduces a training-free conditional diffusion model for learning unknown stochastic differential equations (SDEs) using data. The proposed approach addresses key challenges in computational efficiency and accuracy for modeling SDEs by utilizing a score-based diffusion model to approximate their stochastic flow map. Unlike the existing methods, this technique is based on an analytically derived closed-form exact score function, which can be efficiently estimated by Monte Carlo method using the trajectory data, and eliminates the need for neural network training to learn the score function. By generating labeled data through solving the corresponding reverse ordinary differential equation, the approach enables supervised learning of the flow map. Extensive numerical experiments across various SDE types, including linear, nonlinear, and multi-dimensional systems, demonstrate the versatility and effectiveness of the method. The learned models exhibit significant improvements in predicting both short-term and long-term behaviors of unknown stochastic systems, often surpassing baseline methods like GANs in estimating drift and diffusion coefficients.

[LG-102] Mamba in Vision: A Comprehensive Survey of Techniques and Applications

链接: https://arxiv.org/abs/2410.03105
作者: Md Maklachur Rahman,Abdullah Aman Tutul,Ankur Nath,Lamyanba Laishram,Soon Ki Jung,Tracy Hammond
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, faced by Convolutional, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. While CNNs excel at extracting local features, they often struggle to capture long-range dependencies without complex architectural modifications. In contrast, ViTs effectively model global relationships but suffer from high computational costs due to the quadratic complexity of their self-attention mechanisms. Mamba addresses these limitations by leveraging Selective Structured State Space Models to effectively capture long-range dependencies with linear computational complexity. This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions. We provide a foundational resource for advancing the understanding and growth of Mamba models in computer vision. An overview of this work is available at this https URL.

[LG-103] Horizon-Length Prediction: Advancing Fill-in-the-Middle Capabilities for Code Generation with Lookahead Planning

链接: https://arxiv.org/abs/2410.03103
作者: Yifeng Ding,Hantian Ding,Shiqi Wang,Qing Sun,Varun Kumar,Zijian Wang
关键词-EN: generation of missing, FIM, HLP, models, FIM training paradigm
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Fill-in-the-Middle (FIM) has become integral to code language models, enabling generation of missing code given both left and right contexts. However, the current FIM training paradigm, which reorders original training sequences and then performs regular next-token prediction (NTP), often leads to models struggling to generate content that aligns smoothly with the surrounding context. Crucially, while existing works rely on rule-based post-processing to circumvent this weakness, such methods are not practically usable in open-domain code completion tasks as they depend on restrictive, dataset-specific assumptions (e.g., generating the same number of lines as in the ground truth). Moreover, model performance on FIM tasks deteriorates significantly without these unrealistic assumptions. We hypothesize that NTP alone is insufficient for models to learn effective planning conditioned on the distant right context, a critical factor for successful code infilling. To overcome this, we propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens (i.e., horizon length) at each step. HLP advances FIM with lookahead planning, enabling models to inherently learn infilling boundaries for arbitrary left and right contexts without relying on dataset-specific post-processing. Our evaluation across different models and sizes shows that HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level, and without resorting to unrealistic post-processing methods. Furthermore, the enhanced planning capability gained through HLP boosts model performance on code reasoning. Importantly, HLP only incurs negligible training overhead and no additional inference cost, ensuring its practicality for real-world scenarios. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2410.03103 [cs.LG] (or arXiv:2410.03103v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-104] UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

链接: https://arxiv.org/abs/2410.03090
作者: Jing Xiong,Jianghan Shen,Fanghua Ye,Chaofan Tao,Zhongwei Wan,Jianqiao Lu,Xun Wu,Chuanyang Zheng,Zhijiang Guo,Lingpeng Kong,Ngai Wong
关键词-EN: Deploying large language, Deploying large, large language models, computational demands, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) is challenging due to their high memory and computational demands, especially during long-context inference. While key-value (KV) caching accelerates inference by reusing previously computed keys and values, it also introduces significant memory overhead. Existing KV cache compression methods such as eviction and merging typically compress the KV cache after it is generated and overlook the eviction of hidden states, failing to improve the speed of the prefilling stage. Additionally, applying a uniform compression rate across different attention heads can harm crucial retrieval heads in needle-in-a-haystack tasks due to excessive compression. In this paper, we propose UNComp, an uncertainty-aware compression scheme that leverages matrix entropy to estimate model uncertainty across layers and heads at the token sequence level. By grouping layers and heads based on their uncertainty, UNComp adaptively compresses both the hidden states and the KV cache. Our method achieves a 1.6x speedup in the prefilling stage and reduces the KV cache to 4.74% of its original size, resulting in a 6.4x increase in throughput and a 1.4x speedup in inference with only a 1.41% performance loss. Remarkably, in needle-in-a-haystack tasks, UNComp outperforms the full-size KV cache even when compressed to 9.38% of its original size. Our approach offers an efficient, training-free Grouped-Query Attention paradigm that can be seamlessly integrated into existing KV cache schemes.

[LG-105] Optimization Proxies using Limited Labeled Data and Training Time – A Semi-Supervised Bayesian Neural Network Approach

链接: https://arxiv.org/abs/2410.03085
作者: Parikshit Pareek,Kaarthik Sundar,Deepjyoti Deka,Sidhant Misra
关键词-EN: electric power grids, Constrained optimization problems, optimization problems arise, optimization problems, engineering system operations
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Constrained optimization problems arise in various engineering system operations such as inventory management and electric power grids. However, the requirement to repeatedly solve such optimization problems with uncertain parameters poses a significant computational challenge. This work introduces a learning scheme using Bayesian Neural Networks (BNNs) to solve constrained optimization problems under limited labeled data and restricted model training times. We propose a semi-supervised BNN for this practical but complex regime, wherein training commences in a sandwiched fashion, alternating between a supervised learning step (using labeled data) for minimizing cost, and an unsupervised learning step (using unlabeled data) for enforcing constraint feasibility. Both supervised and unsupervised steps use a Bayesian approach, where Stochastic Variational Inference is employed for approximate Bayesian inference. We show that the proposed semi-supervised learning method outperforms conventional BNN and deep neural network (DNN) architectures on important non-convex constrained optimization problems from energy network operations, achieving up to a tenfold reduction in expected maximum equality gap and halving the optimality and inequality (feasibility) gaps, without requiring any correction or projection step. By leveraging the BNN’s ability to provide posterior samples at minimal computational cost, we demonstrate that a Selection via Posterior (SvP) scheme can further reduce equality gaps by more than 10%. We also provide tight and practically meaningful probabilistic confidence bounds that can be constructed using a low number of labeled testing data and readily adapted to other applications.

[LG-106] MetaOOD: Automatic Selection of OOD Detection Models KDD

链接: https://arxiv.org/abs/2410.03074
作者: Yuehan Qin,Yichi Zhang,Yi Nian,Xueying Ding,Yue Zhao
关键词-EN: OOD detection, OOD, OOD detection model, detection model, detection
类目: Machine Learning (cs.LG)
*备注: Best paper at 2024 KDD Workshop on Resource-Efficient Learning. Extended version

点击查看摘要

Abstract:How can we automatically select an out-of-distribution (OOD) detection model for various underlying tasks? This is crucial for maintaining the reliability of open-world applications by identifying data distribution shifts, particularly in critical domains such as online transactions, autonomous driving, and real-time patient diagnosis. Despite the availability of numerous OOD detection methods, the challenge of selecting an optimal model for diverse tasks remains largely underexplored, especially in scenarios lacking ground truth labels. In this work, we introduce MetaOOD, the first zero-shot, unsupervised framework that utilizes meta-learning to automatically select an OOD detection model. As a meta-learning approach, MetaOOD leverages historical performance data of existing methods across various benchmark OOD datasets, enabling the effective selection of a suitable model for new datasets without the need for labeled data at the test time. To quantify task similarities more accurately, we introduce language model-based embeddings that capture the distinctive OOD characteristics of both datasets and detection models. Through extensive experimentation with 24 unique test dataset pairs to choose from among 11 OOD detection models, we demonstrate that MetaOOD significantly outperforms existing methods and only brings marginal time overhead. Our results, validated by Wilcoxon statistical tests, show that MetaOOD surpasses a diverse group of 11 baselines, including established OOD detectors and advanced unsupervised selection methods.

[LG-107] FedMAC: Tackling Partial-Modality Missing in Federated Learning with Cross-Modal Aggregation and Contrastive Regularization

链接: https://arxiv.org/abs/2410.03070
作者: Manh Duong Nguyen,Trung Thanh Nguyen,Huy Hieu Pham,Trong Nghia Hoang,Phi Le Nguyen,Thanh Trung Huynh
关键词-EN: training machine learning, machine learning models, distributed data sources, machine learning, learning models
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: The 22nd International Symposium on Network Computing and Applications (NCA 2024)

点击查看摘要

Abstract:Federated Learning (FL) is a method for training machine learning models using distributed data sources. It ensures privacy by allowing clients to collaboratively learn a shared global model while storing their data locally. However, a significant challenge arises when dealing with missing modalities in clients’ datasets, where certain features or modalities are unavailable or incomplete, leading to heterogeneous data distribution. While previous studies have addressed the issue of complete-modality missing, they fail to tackle partial-modality missing on account of severe heterogeneity among clients at an instance level, where the pattern of missing data can vary significantly from one sample to another. To tackle this challenge, this study proposes a novel framework named FedMAC, designed to address multi-modality missing under conditions of partial-modality missing in FL. Additionally, to avoid trivial aggregation of multi-modal features, we introduce contrastive-based regularization to impose additional constraints on the latent representation space. The experimental results demonstrate the effectiveness of FedMAC across various client configurations with statistical heterogeneity, outperforming baseline methods by up to 26% in severe missing scenarios, highlighting its potential as a solution for the challenge of partially missing modalities in federated systems.

[LG-108] FedCert: Federated Accuracy Certification

链接: https://arxiv.org/abs/2410.03067
作者: Minh Hieu Nguyen,Huu Tien Nguyen,Trung Thanh Nguyen,Manh Duong Nguyen,Trong Nghia Hoang,Truong Thao Nguyen,Phi Le Nguyen
关键词-EN: preserving data privacy, machine learning models, keeping local data, Federated Learning, training machine learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: The 22nd International Symposium on Network Computing and Applications (NCA 2024)

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a powerful paradigm for training machine learning models in a decentralized manner, preserving data privacy by keeping local data on clients. However, evaluating the robustness of these models against data perturbations on clients remains a significant challenge. Previous studies have assessed the effectiveness of models in centralized training based on certified accuracy, which guarantees that a certain percentage of the model’s predictions will remain correct even if the input data is perturbed. However, the challenge of extending these evaluations to FL remains unresolved due to the unknown client’s local data. To tackle this challenge, this study proposed a method named FedCert to take the first step toward evaluating the robustness of FL systems. The proposed method is designed to approximate the certified accuracy of a global model based on the certified accuracy and class distribution of each client. Additionally, considering the Non-Independent and Identically Distributed (Non-IID) nature of data in real-world scenarios, we introduce the client grouping algorithm to ensure reliable certified accuracy during the aggregation step of the approximation algorithm. Through theoretical analysis, we demonstrate the effectiveness of FedCert in assessing the robustness and reliability of FL systems. Moreover, experimental results on the CIFAR-10 and CIFAR-100 datasets under various scenarios show that FedCert consistently reduces the estimation error compared to baseline methods. This study offers a solution for evaluating the robustness of FL systems and lays the groundwork for future research to enhance the dependability of decentralized learning. The source code is available at this https URL.

[LG-109] Compute Or Load KV Cache? Why Not Both?

链接: https://arxiv.org/abs/2410.03065
作者: Shuowei Jin,Xueshen Liu,Qingzhao Zhang,Z. Morley Mao
关键词-EN: Large Language Models, Language Models, Large Language, context window sizes, enabling sophisticated applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly increased context window sizes, enabling sophisticated applications but also introducing substantial computational overheads, particularly computing key-value (KV) cache in the prefill stage. Prefix caching has emerged to save GPU power in this scenario, which saves KV cache at disks and reuse them across multiple queries. However, traditional prefix caching mechanisms often suffer from substantial latency because the speed of loading KV cache from disks to GPU memory is bottlenecked by the throughput of I/O devices. To optimize the latency of long-context prefill, we propose Cake, a novel KV cache loader, which employs a bidirectional parallelized KV cache generation strategy. Upon receiving a prefill task, Cake simultaneously and dynamically loads saved KV cache from prefix cache locations and computes KV cache on local GPUs, maximizing the utilization of available computation and I/O bandwidth resources. Additionally, Cake automatically adapts to diverse system statuses without manual parameter. tuning. In experiments on various prompt datasets, GPUs, and I/O devices, Cake offers up to 68.1% Time To First Token (TTFT) reduction compare with compute-only method and 94.6% TTFT reduction compare with I/O-only method.

[LG-110] Geometric Collaborative Filtering with Convergence

链接: https://arxiv.org/abs/2410.03064
作者: Hisham Husain,Julien Monteil
关键词-EN: modelling user-click interactions, user-click interactions due, Latent variable collaborative, latent collaborative filtering, collaborative filtering
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 13 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Latent variable collaborative filtering methods have been a standard approach to modelling user-click interactions due to their simplicity and effectiveness. However, there is limited work on analyzing the mathematical properties of these methods in particular on preventing the overfitting towards the identity, and such methods typically utilize loss functions that overlook the geometry between items. In this work, we introduce a notion of generalization gap in collaborative filtering and analyze this with respect to latent collaborative filtering models. We present a geometric upper bound that gives rise to loss functions, and a way to meaningfully utilize the geometry of item-metadata to improve recommendations. We show how these losses can be minimized and gives the recipe to a new latent collaborative filtering algorithm, which we refer to as GeoCF, due to the geometric nature of our results. We then show experimentally that our proposed GeoCF algorithm can outperform other all existing methods on the Movielens20M and Netflix datasets, as well as two large-scale internal datasets. In summary, our work proposes a theoretically sound method which paves a way to better understand generalization of collaborative filtering at large.

[LG-111] owards an Improved Metric for Evaluating Disentangled Representations

链接: https://arxiv.org/abs/2410.03056
作者: Sahib Julka,Yashu Wang,Michael Granitzer
关键词-EN: Disentangled representation learning, making representations controllable, representation learning plays, Disentangled representation, interpretable and transferable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Disentangled representation learning plays a pivotal role in making representations controllable, interpretable and transferable. Despite its significance in the domain, the quest for reliable and consistent quantitative disentanglement metric remains a major challenge. This stems from the utilisation of diverse metrics measuring different properties and the potential bias introduced by their design. Our work undertakes a comprehensive examination of existing popular disentanglement evaluation metrics, comparing them in terms of measuring aspects of disentanglement (viz. Modularity, Compactness, and Explicitness), detecting the factor-code relationship, and describing the degree of disentanglement. We propose a new framework for quantifying disentanglement, introducing a metric entitled \emphEDI, that leverages the intuitive concept of \emphexclusivity and improved factor-code relationship to minimize ad-hoc decisions. An in-depth analysis reveals that EDI measures essential properties while offering more stability than existing metrics, advocating for its adoption as a standardised approach.

[LG-112] Permissive Information-Flow Analysis for Large Language Models

链接: https://arxiv.org/abs/2410.03055
作者: Shoaib Ahmed Siddiqui,Radhika Gaonkar,Boris Köpf,David Krueger,Andrew Paverd,Ahmed Salem,Shruti Tople,Lukas Wutschitz,Menglin Xia,Santiago Zanella-Béguelin
关键词-EN: Large Language Models, larger software systems, Large Language, rapidly becoming commodity, larger software
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model’s behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. One promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, the traditional approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the labels of the samples that were influential in generating the model output and to eliminate the labels of unnecessary input. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a k -nearest-neighbors language model. We compare these with the baseline of an introspection-based influence estimator that directly asks the language model to predict the output label. The results obtained highlight the superiority of our prompt-based label propagator, which improves the label in more than 85% of the cases in an LLM agent setting. These findings underscore the practicality of permissive label propagation for retrieval augmentation.

[LG-113] Learning Structured Representations by Embedding Class Hierarchy with Fast Optimal Transport

链接: https://arxiv.org/abs/2410.03052
作者: Siqi Zeng,Sixian Du,Makoto Yamada,Han Zhao
关键词-EN: Cophenetic Correlation Coefficient, Correlation Coefficient, embed structured knowledge, Cophenetic Correlation, supervised learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To embed structured knowledge within labels into feature representations, prior work (Zeng et al., 2022) proposed to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer during supervised learning. This regularizer calculates pairwise Euclidean distances of class means and aligns them with the corresponding shortest path distances derived from the label hierarchy tree. However, class means may not be good representatives of the class conditional distributions, especially when they are multi-mode in nature. To address this limitation, under the CPCC framework, we propose to use the Earth Mover’s Distance (EMD) to measure the pairwise distances among classes in the feature space. We show that our exact EMD method generalizes previous work, and recovers the existing algorithm when class-conditional distributions are Gaussian in the feature space. To further improve the computational efficiency of our method, we introduce the Optimal Transport-CPCC family by exploring four EMD approximation variants. Our most efficient OT-CPCC variant runs in linear time in the size of the dataset, while maintaining competitive performance across datasets and tasks.

[LG-114] Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues

链接: https://arxiv.org/abs/2410.03049
作者: Shilin Qu,Weiqing Wang,Xin Zhou,Haolan Zhan,Zhuang Li,Lizhen Qu,Linhao Luo,Yuan-Fang Li,Gholamreza Haffari
关键词-EN: conversational information retrieval, including conversational information, retrieval-enhanced machine learning, Sociocultural norms serve, contextual information retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Sociocultural norms serve as guiding principles for personal conduct in social interactions, emphasizing respect, cooperation, and appropriate behavior, which is able to benefit tasks including conversational information retrieval, contextual information retrieval and retrieval-enhanced machine learning. We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs) for socially aware dialogues. We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase. Our approach utilizes socially aware dialogues, enriched with contextual frames, as the primary data source to constrain the generating process and reduce the hallucinations. This enables extracting of high-quality and nuanced natural-language norm statements, leveraging the pragmatic implications of utterances with respect to the situation. As real dialogue annotated with gold frames are not readily available, we propose using synthetic data. Our empirical results show: (i) the quality of the SCNs derived from synthetic data is comparable to that from real dialogues annotated with gold frames, and (ii) the quality of the SCNs extracted from real data, annotated with either silver (predicted) or gold frames, surpasses that without the frame annotations. We further show the effectiveness of the extracted SCNs in a RAG-based (Retrieval-Augmented Generation) model to reason about multiple downstream dialogue tasks.

[LG-115] owards Understanding the Feasibility of Machine Unlearning

链接: https://arxiv.org/abs/2410.03043
作者: Mahtab Sarvmaili,Hassan Sajjad,Ga Wu
关键词-EN: recent privacy regulations, attracted significant attention, privacy regulations, research community, unlearning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In light of recent privacy regulations, machine unlearning has attracted significant attention in the research community. However, current studies predominantly assess the overall success of unlearning approaches, overlooking the varying difficulty of unlearning individual training samples. As a result, the broader feasibility of machine unlearning remains under-explored. This paper presents a set of novel metrics for quantifying the difficulty of unlearning by jointly considering the properties of target model and data distribution. Specifically, we propose several heuristics to assess the conditions necessary for a successful unlearning operation, examine the variations in unlearning difficulty across different training samples, and present a ranking mechanism to identify the most challenging samples to unlearn. We highlight the effectiveness of the Kernelized Stein Discrepancy (KSD), a parameterized kernel function tailored to each model and dataset, as a heuristic for evaluating unlearning difficulty. Our approach is validated through multiple classification tasks and established machine unlearning algorithms, demonstrating the practical feasibility of unlearning operations across diverse scenarios.

[LG-116] FedPeWS: Personalized Warmup via Subnetworks for Enhanced Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2410.03042
作者: Nurbek Tastan,Samuel Horvath,Martin Takac,Karthik Nandakumar
关键词-EN: Statistical data heterogeneity, Statistical data, extreme data heterogeneity, data heterogeneity, significant barrier
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Statistical data heterogeneity is a significant barrier to convergence in federated learning (FL). While prior work has advanced heterogeneous FL through better optimization objectives, these methods fall short when there is extreme data heterogeneity among collaborating participants. We hypothesize that convergence under extreme data heterogeneity is primarily hindered due to the aggregation of conflicting updates from the participants in the initial collaboration rounds. To overcome this problem, we propose a warmup phase where each participant learns a personalized mask and updates only a subnetwork of the full model. This personalized warmup allows the participants to focus initially on learning specific subnetworks tailored to the heterogeneity of their data. After the warmup phase, the participants revert to standard federated optimization, where all parameters are communicated. We empirically demonstrate that the proposed personalized warmup via subnetworks (FedPeWS) approach improves accuracy and convergence speed over standard federated optimization methods.

[LG-117] Geometry is All You Need: A Unified Taxonomy of Matrix and Tensor Factorization for Compression of Generative Language Models

链接: https://arxiv.org/abs/2410.03040
作者: Mingxue Xu,Sadia Sharmin,Danilo P. Mandic
关键词-EN: Natural Language Processing, Language Processing, Natural Language, model systematic efficiency, language model parametrization
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Matrix and tensor-guided parametrization for Natural Language Processing (NLP) models is fundamentally useful for the improvement of the model’s systematic efficiency. However, the internal links between these two algebra structures and language model parametrization are poorly understood. Also, the existing matrix and tensor research is math-heavy and far away from machine learning (ML) and NLP research concepts. These two issues result in the recent progress on matrices and tensors for model parametrization being more like a loose collection of separate components from matrix/tensor and NLP studies, rather than a well-structured unified approach, further hindering algorithm design. To this end, we propose a unified taxonomy, which bridges the matrix/tensor compression approaches and model compression concepts in ML and NLP research. Namely, we adopt an elementary concept in linear algebra, that of a subspace, which is also the core concept in geometric algebra, to reformulate the matrix/tensor and ML/NLP concepts (e.g. attention mechanism) under one umbrella. In this way, based on our subspace formalization, typical matrix and tensor decomposition algorithms can be interpreted as geometric transformations. Finally, we revisit recent literature on matrix- or tensor-guided language model compression, rephrase and compare their core ideas, and then point out the current research gap and potential solutions.

[LG-118] Revealing the Unseen: Guiding Personalized Diffusion Models to Expose Training Data

链接: https://arxiv.org/abs/2410.03039
作者: Xiaoyu Wu,Jiaru Zhang,Steven Wu
关键词-EN: capture specific styles, Diffusion Models, styles or objects, evolved into advanced, small set
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small set of images to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the potential risks of data leakage by releasing their fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask: “Can training data be extracted from these fine-tuned DMs shared online?” A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model’s learned distribution – from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets such as WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting approximately 20% of fine-tuning data in most cases, significantly surpassing baseline performance.

[LG-119] CPFD: Confidence-aware Privileged Feature Distillation for Short Video Classification CIKM2024

链接: https://arxiv.org/abs/2410.03038
作者: Jinghao Shi,Xiang Shen,Kaili Zhao,Xuedong Wang,Vera Wen,Zixuan Wang,Yifan Wu,Zhixin Zhang
关键词-EN: Privileged Feature Distillation, Dense features, Privileged Dense Features, short video classification, Privileged Dense
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Camera ready for CIKM 2024

点击查看摘要

Abstract:Dense features, customized for different business scenarios, are essential in short video classification. However, their complexity, specific adaptation requirements, and high computational costs make them resource-intensive and less accessible during online inference. Consequently, these dense features are categorized as `Privileged Dense Features’.Meanwhile, end-to-end multi-modal models have shown promising results in numerous computer vision tasks. In industrial applications, prioritizing end-to-end multi-modal features, can enhance efficiency but often leads to the loss of valuable information from historical privileged dense this http URL integrate both features while maintaining efficiency and manageable resource costs, we present Confidence-aware Privileged Feature Distillation (CPFD), which empowers features of an end-to-end multi-modal model by adaptively distilling privileged features during this http URL existing privileged feature distillation (PFD) methods, which apply uniform weights to all instances during distillation, potentially causing unstable performance across different business scenarios and a notable performance gap between teacher model (Dense Feature enhanced multimodal-model DF-X-VLM) and student model (multimodal-model only X-VLM), our CPFD leverages confidence scores derived from the teacher model to adaptively mitigate the performance variance with the student this http URL conducted extensive offline experiments on five diverse tasks demonstrating that CPFD improves the video classification F1 score by 6.76% compared with end-to-end multimodal-model (X-VLM) and by 2.31% with vanilla PFD on-average. And it reduces the performance gap by 84.6% and achieves results comparable to teacher model DF-X-VLM. The effectiveness of CPFD is further substantiated by online experiments, and our framework has been deployed in production systems for over a dozen models.

[LG-120] Disentangling Textual and Acoustic Features of Neural Speech Representations

链接: https://arxiv.org/abs/2410.03037
作者: Hosein Mohebbi,Grzegorz Chrupała,Willem Zuidema,Afra Alishahi,Ivan Titov
关键词-EN: deeply entangled internal, Neural speech models, build deeply entangled, entangled internal representations, fundamental frequency
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Neural speech models build deeply entangled internal representations, which capture a variety of features (e.g., fundamental frequency, loudness, syntactic category, or semantic content of a word) in a distributed encoding. This complexity makes it difficult to track the extent to which such representations rely on textual and acoustic information, or to suppress the encoding of acoustic features that may pose privacy risks (e.g., gender or speaker identity) in critical, real-world applications. In this paper, we build upon the Information Bottleneck principle to propose a disentanglement framework that separates complex speech representations into two distinct components: one encoding content (i.e., what can be transcribed as text) and the other encoding acoustic features relevant to a given downstream task. We apply and evaluate our framework to emotion recognition and speaker identification downstream tasks, quantifying the contribution of textual and acoustic features at each model layer. Additionally, we explore the application of our disentanglement framework as an attribution method to identify the most salient speech frame representations from both the textual and acoustic perspectives.

[LG-121] MLP-KAN: Unifying Deep Representation and Function Learning

链接: https://arxiv.org/abs/2410.03027
作者: Yunhong He,Yifeng Xie,Zhengqing Yuan,Lichao Sun
关键词-EN: demonstrated substantial promise, Recent advancements, function learning, artificial intelligence, demonstrated substantial
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in both representation learning and function learning have demonstrated substantial promise across diverse domains of artificial intelligence. However, the effective integration of these paradigms poses a significant challenge, particularly in cases where users must manually decide whether to apply a representation learning or function learning model based on dataset characteristics. To address this issue, we introduce MLP-KAN, a unified method designed to eliminate the need for manual model selection. By integrating Multi-Layer Perceptrons (MLPs) for representation learning and Kolmogorov-Arnold Networks (KANs) for function learning within a Mixture-of-Experts (MoE) architecture, MLP-KAN dynamically adapts to the specific characteristics of the task at hand, ensuring optimal performance. Embedded within a transformer-based framework, our work achieves remarkable results on four widely-used datasets across diverse domains. Extensive experimental evaluation demonstrates its superior versatility, delivering competitive performance across both deep representation and function learning tasks. These findings highlight the potential of MLP-KAN to simplify the model selection process, offering a comprehensive, adaptable solution across various domains. Our code and weights are available at \urlthis https URL.

[LG-122] Characterizing Context Influence and Hallucination in Summarization

链接: https://arxiv.org/abs/2410.03026
作者: James Flemings,Wanrong Zhang,Bo Jiang,Zafar Takhirov,Murali Annavaram
关键词-EN: Large Language Models, Large Language, numerous downstream tasks, achieved remarkable performance, Language Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) have achieved remarkable performance in numerous downstream tasks, their ubiquity has raised two significant concerns. One is that LLMs can hallucinate by generating content that contradicts relevant contextual information; the other is that LLMs can inadvertently leak private information due to input regurgitation. Many prior works have extensively studied each concern independently, but none have investigated them simultaneously. Furthermore, auditing the influence of provided context during open-ended generation with a privacy emphasis is understudied. To this end, we comprehensively characterize the influence and hallucination of contextual information during summarization. We introduce a definition for context influence and Context-Influence Decoding (CID), and then we show that amplifying the context (by factoring out prior knowledge) and the context being out of distribution with respect to prior knowledge increases the context’s influence on an LLM. Moreover, we show that context influence gives a lower bound of the private information leakage of CID. We corroborate our analytical findings with experimental evaluations that show improving the F1 ROGUE-L score on CNN-DM for LLaMA 3 by \textbf10 % over regular decoding also leads to \textbf1.5x more influence by the context. Moreover, we empirically evaluate how context influence and hallucination are affected by (1) model capacity, (2) context size, (3) the length of the current response, and (4) different token n -grams of the context. Our code can be accessed here: this https URL.

[LG-123] Flow Matching with Gaussian Process Priors for Probabilistic Time Series Forecasting

链接: https://arxiv.org/abs/2410.03024
作者: Marcel Kollovieh,Marten Lienen,David Lüdke,Leo Schwinn,Stephan Günnemann
关键词-EN: Recent advancements, time series modeling, opened new directions, series modeling, generative modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent advancements in generative modeling, particularly diffusion models, have opened new directions for time series modeling, achieving state-of-the-art performance in forecasting and synthesis. However, the reliance of diffusion-based models on a simple, fixed prior complicates the generative process since the data and prior distributions differ significantly. We introduce TSFlow, a conditional flow matching (CFM) model for time series that simplifies the generative problem by combining Gaussian processes, optimal transport paths, and data-dependent prior distributions. By incorporating (conditional) Gaussian processes, TSFlow aligns the prior distribution more closely with the temporal structure of the data, enhancing both unconditional and conditional generation. Furthermore, we propose conditional prior sampling to enable probabilistic forecasting with an unconditionally trained model. In our experimental evaluation on eight real-world datasets, we demonstrate the generative capabilities of TSFlow, producing high-quality unconditional samples. Finally, we show that both conditionally and unconditionally trained models achieve competitive results in forecasting benchmarks, surpassing other methods on 6 out of 8 datasets.

[LG-124] On Logical Extrapolation for Mazes with Recurrent and Implicit Networks

链接: https://arxiv.org/abs/2410.03020
作者: Brandon Knutson,Amandin Chyba Rabeendran,Michael Ivanitskiy,Jordan Pettyjohn,Cecilia Diniz-Behn,Samy Wu Fung,Daniel McKenzie
关键词-EN: architectures-particularly recurrent neural, neural network architectures-particularly, recurrent neural networks, implicit neural networks, Recent work
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent work has suggested that certain neural network architectures-particularly recurrent neural networks (RNNs) and implicit neural networks (INNs) are capable of logical extrapolation. That is, one may train such a network on easy instances of a specific task and then apply it successfully to more difficult instances of the same task. In this paper, we revisit this idea and show that (i) The capacity for extrapolation is less robust than previously suggested. Specifically, in the context of a maze-solving task, we show that while INNs (and some RNNs) are capable of generalizing to larger maze instances, they fail to generalize along axes of difficulty other than maze size. (ii) Models that are explicitly trained to converge to a fixed point (e.g. the INN we test) are likely to do so when extrapolating, while models that are not (e.g. the RNN we test) may exhibit more exotic limiting behaviour such as limit cycles, even when they correctly solve the problem. Our results suggest that (i) further study into why such networks extrapolate easily along certain axes of difficulty yet struggle with others is necessary, and (ii) analyzing the dynamics of extrapolation may yield insights into designing more efficient and interpretable logical extrapolators.

[LG-125] Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory

链接: https://arxiv.org/abs/2410.03016
作者: Alexander Levine,Peter Stone,Amy Zhang
关键词-EN: Markov Decision Process, Block Markov Decision, Exogenous Block Markov, controllable latent space, sequential decision-making environments
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In order to train agents that can quickly adapt to new objectives or reward functions, efficient unsupervised representation learning in sequential decision-making environments can be important. Frameworks such as the Exogenous Block Markov Decision Process (Ex-BMDP) have been proposed to formalize this representation-learning problem (Efroni et al., 2022b). In the Ex-BMDP framework, the agent’s high-dimensional observations of the environment have two latent factors: a controllable factor, which evolves deterministically within a small state space according to the agent’s actions, and an exogenous factor, which represents time-correlated noise, and can be highly complex. The goal of the representation learning problem is to learn an encoder that maps from observations into the controllable latent space, as well as the dynamics of this space. Efroni et al. (2022b) has shown that this is possible with a sample complexity that depends only on the size of the controllable latent space, and not on the size of the noise factor. However, this prior work has focused on the episodic setting, where the controllable latent state resets to a specific start state after a finite horizon. By contrast, if the agent can only interact with the environment in a single continuous trajectory, prior works have not established sample-complexity bounds. We propose STEEL, the first provably sample-efficient algorithm for learning the controllable dynamics of an Ex-BMDP from a single trajectory, in the function approximation setting. STEEL has a sample complexity that depends only on the sizes of the controllable latent space and the encoder function class, and (at worst linearly) on the mixing time of the exogenous noise factor. We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems. Code is available at: this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.03016 [cs.LG] (or arXiv:2410.03016v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03016 Focus to learn more arXiv-issued DOI via DataCite

[LG-126] MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection

链接: https://arxiv.org/abs/2410.03010
作者: Niki Nezakati,Md Kaykobad Reza,Ameya Patil,Mashhour Solh,M. Salman Asif
关键词-EN: Multimodal learning seeks, multiple input sources, Multimodal learning, downstream tasks, modalities
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal learning seeks to combine data from multiple input sources to enhance the performance of different downstream tasks. In real-world scenarios, performance can degrade substantially if some input modalities are missing. Existing methods that can handle missing modalities involve custom training or adaptation steps for each input modality combination. These approaches are either tied to specific modalities or become computationally expensive as the number of input modalities increases. In this paper, we propose Masked Modality Projection (MMP), a method designed to train a single model that is robust to any missing modality scenario. We achieve this by randomly masking a subset of modalities during training and learning to project available input modalities to estimate the tokens for the masked modalities. This approach enables the model to effectively learn to leverage the information from the available modalities to compensate for the missing ones, enhancing missing modality robustness. We conduct a series of experiments with various baseline models and datasets to assess the effectiveness of this strategy. Experiments demonstrate that our approach improves robustness to different missing modality scenarios, outperforming existing methods designed for missing modalities or specific modality combinations.

[LG-127] Formation of Representations in Neural Networks

链接: https://arxiv.org/abs/2410.03006
作者: Liu Ziyin,Isaac Chuang,Tomer Galanti,Tomaso Poggio
关键词-EN: Understanding neural representations, scientific understanding, Understanding neural, modern neural networks, neural networks
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: preprint

点击查看摘要

Abstract:Understanding neural representations will help open the black box of neural networks and advance our scientific understanding of modern AI systems. However, how complex, structured, and transferable representations emerge in modern neural networks has remained a mystery. Building on previous results, we propose the Canonical Representation Hypothesis (CRH), which posits a set of six alignment relations to universally govern the formation of representations in most hidden layers of a neural network. Under the CRH, the latent representations ®, weights (W), and neuron gradients (G) become mutually aligned during training. This alignment implies that neural networks naturally learn compact representations, where neurons and weights are invariant to task-irrelevant transformations. We then show that the breaking of CRH leads to the emergence of reciprocal power-law relations between R, W, and G, which we refer to as the Polynomial Alignment Hypothesis (PAH). We present a minimal-assumption theory demonstrating that the balance between gradient noise and regularization is crucial for the emergence the canonical representation. The CRH and PAH lead to an exciting possibility of unifying major key deep learning phenomena, including neural collapse and the neural feature ansatz, in a single framework.

[LG-128] owards Universal Certified Robustness with Multi-Norm Training

链接: https://arxiv.org/abs/2410.03000
作者: Enyi Jiang,Gagandeep Singh
关键词-EN: Existing certified training, Existing certified, certified training, training, certified
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Existing certified training methods can only train models to be robust against a certain perturbation type (e.g. l_\infty or l_2 ). However, an l_\infty certifiably robust model may not be certifiably robust against l_2 perturbation (and vice versa) and also has low robustness against other perturbations (e.g. geometric transformation). To this end, we propose the first multi-norm certified training framework \textbfCURE, consisting of a new l_2 deterministic certified training defense and several multi-norm certified training methods, to attain better \emphunion robustness when training from scratch or fine-tuning a pre-trained certified model. Further, we devise bound alignment and connect natural training with certified training for better union robustness. Compared with SOTA certified training, \textbfCURE improves union robustness up to 22.8% on MNIST, 23.9% on CIFAR-10, and 8.0% on TinyImagenet. Further, it leads to better generalization on a diverse set of challenging unseen geometric perturbations, up to 6.8% on CIFAR-10. Overall, our contributions pave a path towards \textituniversal certified robustness.

[LG-129] Q-SCALE: Quantum computing-based Sensor Calibration for Advanced Learning and Efficiency

链接: https://arxiv.org/abs/2410.02998
作者: Lorenzo Bergadano,Andrea Ceschini,Pietro Chiavassa,Edoardo Giusto,Bartolomeo Montrucchio,Massimo Panella,Antonello Rosato
关键词-EN: Quantum Machine Learning, utilizing Quantum Computing, Machine Learning, techniques utilizing Quantum, quality monitoring systems
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: Accepted at QCE24

点击查看摘要

Abstract:In a world burdened by air pollution, the integration of state-of-the-art sensor calibration techniques utilizing Quantum Computing (QC) and Machine Learning (ML) holds promise for enhancing the accuracy and efficiency of air quality monitoring systems in smart cities. This article investigates the process of calibrating inexpensive optical fine-dust sensors through advanced methodologies such as Deep Learning (DL) and Quantum Machine Learning (QML). The objective of the project is to compare four sophisticated algorithms from both the classical and quantum realms to discern their disparities and explore possible alternative approaches to improve the precision and dependability of particulate matter measurements in urban air quality surveillance. Classical Feed-Forward Neural Networks (FFNN) and Long Short-Term Memory (LSTM) models are evaluated against their quantum counterparts: Variational Quantum Regressors (VQR) and Quantum LSTM (QLSTM) circuits. Through meticulous testing, including hyperparameter optimization and cross-validation, the study assesses the potential of quantum models to refine calibration performance. Our analysis shows that: the FFNN model achieved superior calibration accuracy on the test set compared to the VQR model in terms of lower L1 loss function (2.92 vs 4.81); the QLSTM slightly outperformed the LSTM model (loss on the test set: 2.70 vs 2.77), despite using fewer trainable weights (66 vs 482).

[LG-130] Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

链接: https://arxiv.org/abs/2410.02994
作者: Suei-Wen Chen,Keith Ross,Pierre Youssef
关键词-EN: Monte Carlo Exploring, Carlo Exploring Starts, Monte Carlo, Exploring Starts, Carlo Exploring
类目: Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:Monte Carlo Exploring Starts (MCES), which aims to learn the optimal policy using only sample returns, is a simple and natural algorithm in reinforcement learning which has been shown to converge under various conditions. However, the convergence rate analysis for MCES-style algorithms in the form of sample complexity has received very little attention. In this paper we develop a finite sample bound for a modified MCES algorithm which solves the stochastic shortest path problem. To this end, we prove a novel result on the convergence rate of the policy iteration algorithm. This result implies that with probability at least 1-\delta , the algorithm returns an optimal policy after \tildeO(SAK^3\log^3\frac1\delta) sampled episodes, where S and A denote the number of states and actions respectively, K is a proxy for episode length, and \tildeO hides logarithmic factors and constants depending on the rewards of the environment that are assumed to be known.

[LG-131] Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

链接: https://arxiv.org/abs/2410.02984
作者: George Wang,Jesse Hoogland,Stan van Wingerden,Zach Furman,Daniel Murfet
关键词-EN: Local Learning Coefficient, introduce refined variants, singular learning theory, model complexity grounded, transformer language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory, to study the development of internal structure in transformer language models during training. By applying these \textitrefined LLCs (rLLCs) to individual components of a two-layer attention-only transformer, we gain novel insights into the progressive differentiation and specialization of attention heads. Our methodology reveals how attention heads differentiate into distinct functional roles over the course of training, analyzes the types of data these heads specialize to process, and discovers a previously unidentified multigram circuit. These findings demonstrate that rLLCs provide a principled, quantitative toolkit for \textitdevelopmental interpretability, which aims to understand models through their evolution across the learning process. More broadly, this work takes a step towards establishing the correspondence between data distributional structure, geometric properties of the loss landscape, learning dynamics, and emergent computational structures in neural networks.

[LG-132] DecTrain: Deciding When to Train a DNN Online

链接: https://arxiv.org/abs/2410.02980
作者: Zih-Sing Fu,Soumya Sudhakar,Sertac Karaman,Vivienne Sze
关键词-EN: Deep neural networks, Deep neural, DNN, deployment data differs, neural networks
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages

点击查看摘要

Abstract:Deep neural networks (DNNs) can deteriorate in accuracy when deployment data differs from training data. While performing online training at all timesteps can improve accuracy, it is computationally expensive. We propose DecTrain, a new algorithm that decides when to train a monocular depth DNN online using self-supervision with low overhead. To make the decision at each timestep, DecTrain compares the cost of training with the predicted accuracy gain. We evaluate DecTrain on out-of-distribution data, and find DecTrain maintains accuracy compared to online training at all timesteps, while training only 44% of the time on average. We also compare the recovery of a low inference cost DNN using DecTrain and a more generalizable high inference cost DNN on various sequences. DecTrain recovers the majority (97%) of the accuracy gain of online training at all timesteps while reducing computation compared to the high inference cost DNN which recovers only 66%. With an even smaller DNN, we achieve 89% recovery while reducing computation by 56%. DecTrain enables low-cost online training for a smaller DNN to have competitive accuracy with a larger, more generalizable DNN at a lower overall computational cost.

[LG-133] An explainable approach to detect case law on housing and eviction issues within the HUDOC database

链接: https://arxiv.org/abs/2410.02978
作者: Mohammad Mohammadi,Martijn Wieling,Michel Vols
关键词-EN: Case law, understanding of human, Court of Human, instrumental in shaping, shaping our understanding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Case law is instrumental in shaping our understanding of human rights, including the right to adequate housing. The HUDOC database provides access to the textual content of case law from the European Court of Human Rights (ECtHR), along with some metadata. While this metadata includes valuable information, such as the application number and the articles addressed in a case, it often lacks detailed substantive insights, such as the specific issues a case covers. This underscores the need for detailed analysis to extract such information. However, given the size of the database - containing over 40,000 cases - an automated solution is essential. In this study, we focus on the right to adequate housing and aim to build models to detect cases related to housing and eviction issues. Our experiments show that the resulting models not only provide performance comparable to more sophisticated approaches but are also interpretable, offering explanations for their decisions by highlighting the most influential words. The application of these models led to the identification of new cases that were initially overlooked during data collection. This suggests that NLP approaches can be effectively applied to categorise case law based on the specific issues they address. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.02978 [cs.LG] (or arXiv:2410.02978v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.02978 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-134] Learning Optimal Control and Dynamical Structure of Global Trajectory Search Problems with Diffusion Models

链接: https://arxiv.org/abs/2410.02976
作者: Jannik Graebner,Anjian Li,Amlan Sinha,Ryne Beeson
关键词-EN: Spacecraft trajectory design, revealed specific solution, Spacecraft trajectory, specific solution structures, data-driven methods
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: This paper was presented at the AAS/AIAA Astrodynamics Specialist Conference

点击查看摘要

Abstract:Spacecraft trajectory design is a global search problem, where previous work has revealed specific solution structures that can be captured with data-driven methods. This paper explores two global search problems in the circular restricted three-body problem: hybrid cost function of minimum fuel/time-of-flight and transfers to energy-dependent invariant manifolds. These problems display a fundamental structure either in the optimal control profile or the use of dynamical structures. We build on our prior generative machine learning framework to apply diffusion models to learn the conditional probability distribution of the search problem and analyze the model’s capability to capture these structures.

[LG-135] F-Fidelity: A Robust Framework for Faithfulness Evaluation of Explainable AI

链接: https://arxiv.org/abs/2410.02970
作者: Xu Zheng,Farhad Shirani,Zhuomin Chen,Chaohao Lin,Wei Cheng,Wenbo Guo,Dongsheng Luo
关键词-EN: XAI methods remains, XAI methods, XAI, research has developed, developed a number
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint; 26 pages, 4 figures

点击查看摘要

Abstract:Recent research has developed a number of eXplainable AI (XAI) techniques. Although extracting meaningful insights from deep learning models, how to properly evaluate these XAI methods remains an open problem. The most widely used approach is to perturb or even remove what the XAI method considers to be the most important features in an input and observe the changes in the output prediction. This approach although efficient suffers the Out-of-Distribution (OOD) problem as the perturbed samples may no longer follow the original data distribution. A recent method RemOve And Retrain (ROAR) solves the OOD issue by retraining the model with perturbed samples guided by explanations. However, the training may not always converge given the distribution difference. Furthermore, using the model retrained based on XAI methods to evaluate these explainers may cause information leakage and thus lead to unfair comparisons. We propose Fine-tuned Fidelity F-Fidelity, a robust evaluation framework for XAI, which utilizes i) an explanation-agnostic fine-tuning strategy, thus mitigating the information leakage issue and ii) a random masking operation that ensures that the removal step does not generate an OOD input. We designed controlled experiments with state-of-the-art (SOTA) explainers and their degraded version to verify the correctness of our framework. We conducted experiments on multiple data structures, such as images, time series, and natural language. The results demonstrate that F-Fidelity significantly improves upon prior evaluation metrics in recovering the ground-truth ranking of the explainers. Furthermore, we show both theoretically and empirically that, given a faithful explainer, F-Fidelity metric can be used to compute the sparsity of influential input components, i.e., to extract the true explanation size.

[LG-136] Label-Free Subjective Player Experience Modelling via Lets Play Videos AAAI

链接: https://arxiv.org/abs/2410.02967
作者: Dave Goel,Athar Mahmoudi-Nejad,Matthew Guzdial
关键词-EN: Player Experience Modelling, Player Experience, Experience Modelling, techniques applied, Modelling
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment

点击查看摘要

Abstract:Player Experience Modelling (PEM) is the study of AI techniques applied to modelling a player’s experience within a video game. PEM development can be labour-intensive, requiring expert hand-authoring or specialized data collection. In this work, we propose a novel PEM development approach, approximating player experience from gameplay video. We evaluate this approach predicting affect in the game Angry Birds via a human subject study. We validate that our PEM can strongly correlate with self-reported and sensor measures of affect, demonstrating the potential of this approach.

[LG-137] Bushfire Severity Modelling and Future Trend Prediction Across Australia: Integrating Remote Sensing and Machine Learning

链接: https://arxiv.org/abs/2410.02963
作者: Shouthiri Partheepan,Farzad Sanati,Jahan Hassan
关键词-EN: major natural disasters, natural disasters, huge losses, losses to livelihoods, major natural
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:Bushfire is one of the major natural disasters that cause huge losses to livelihoods and the environment. Understanding and analyzing the severity of bushfires is crucial for effective management and mitigation strategies, helping to prevent the extensive damage and loss caused by these natural disasters. This study presents an in-depth analysis of bushfire severity in Australia over the last twelve years, combining remote sensing data and machine learning techniques to predict future fire trends. By utilizing Landsat imagery and integrating spectral indices like NDVI, NBR, and Burn Index, along with topographical and climatic factors, we developed a robust predictive model using XGBoost. The model achieved high accuracy, 86.13%, demonstrating its effectiveness in predicting fire severity across diverse Australian ecosystems. By analyzing historical trends and integrating factors such as population density and vegetation cover, we identify areas at high risk of future severe bushfires. Additionally, this research identifies key regions at risk, providing data-driven recommendations for targeted firefighting efforts. The findings contribute valuable insights into fire management strategies, enhancing resilience to future fire events in Australia. Also, we propose future work on developing a UAV-based swarm coordination model to enhance fire prediction in real-time and firefighting capabilities in the most vulnerable regions.

[LG-138] AutoML-Agent : A Multi-Agent LLM Framework for Full-Pipeline AutoML

链接: https://arxiv.org/abs/2410.02958
作者: Patara Trirat,Wonyong Jeong,Sung Ju Hwang
关键词-EN: Automated machine learning, Automated machine, machine learning, hyperparameter tuning, Automated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
*备注: 47 pages, 5 figures

点击查看摘要

Abstract:Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general time-consuming and requires a large amount of human effort. Therefore, recent works have started exploiting large language models (LLM) to lessen such burden and increase the usability of AutoML frameworks via a natural language interface, allowing non-expert users to build their data-driven solutions. These methods, however, are usually designed only for a particular process in the AI development pipeline and do not efficiently use the inherent capacity of the LLMs. This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML, i.e., from data retrieval to model deployment. AutoML-Agent takes user’s task descriptions, facilitates collaboration between specialized LLM agents, and delivers deployment-ready models. Unlike existing work, instead of devising a single plan, we introduce a retrieval-augmented planning strategy to enhance exploration to search for more optimal plans. We also decompose each plan into sub-tasks (e.g., data preprocessing and neural network design) each of which is solved by a specialized agent we build via prompting executing in parallel, making the search process more efficient. Moreover, we propose a multi-stage verification to verify executed results and guide the code generation LLM in implementing successful solutions. Extensive experiments on seven downstream tasks using fourteen datasets show that AutoML-Agent achieves a higher success rate in automating the full AutoML process, yielding systems with good performance throughout the diverse domains.

[LG-139] LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences

链接: https://arxiv.org/abs/2410.02950
作者: Zhenxiao Fu,Fan Chen,Shan Zhou,Haitong Li,Lei Jiang
关键词-EN: substantially larger carbon, LLM inference carbon, large language model, LLM inference requests, LLM inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注: 9 pages, 11 figures

点击查看摘要

Abstract:Throughout its lifecycle, a large language model (LLM) generates a substantially larger carbon footprint during inference than training. LLM inference requests vary in batch size, prompt length, and token generation number, while cloud providers employ different GPU types and quantities to meet diverse service-level objectives for accuracy and latency. It is crucial for both users and cloud providers to have a tool that quickly and accurately estimates the carbon impact of LLM inferences based on a combination of inference request and hardware configurations before execution. Estimating the carbon footprint of LLM inferences is more complex than training due to lower and highly variable model FLOPS utilization, rendering previous equation-based models inaccurate. Additionally, existing machine learning (ML) prediction methods either lack accuracy or demand extensive training data, as they inadequately handle the distinct prefill and decode phases, overlook hardware-specific features, and inefficiently sample uncommon inference configurations. We introduce \coo, a graph neural network (GNN)-based model that greatly improves the accuracy of LLM inference carbon footprint predictions compared to previous methods.

[LG-140] SymmetricDiffusers: Learning Discrete Diffusion on Finite Symmetric Groups

链接: https://arxiv.org/abs/2410.02942
作者: Yongxing Zhang,Donglin Yang,Renjie Liao
关键词-EN: Finite symmetric groups, essential in fields, Finite symmetric, symmetric groups, distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Finite symmetric groups S_n are essential in fields such as combinatorics, physics, and chemistry. However, learning a probability distribution over S_n poses significant challenges due to its intractable size and discrete nature. In this paper, we introduce SymmetricDiffusers, a novel discrete diffusion model that simplifies the task of learning a complicated distribution over S_n by decomposing it into learning simpler transitions of the reverse diffusion using deep neural networks. We identify the riffle shuffle as an effective forward transition and provide empirical guidelines for selecting the diffusion length based on the theory of random walks on finite groups. Additionally, we propose a generalized Plackett-Luce (PL) distribution for the reverse transition, which is provably more expressive than the PL distribution. We further introduce a theoretically grounded “denoising schedule” to improve sampling and learning efficiency. Extensive experiments show that our model achieves state-of-the-art or comparable performances on solving tasks including sorting 4-digit MNIST images, jigsaw puzzles, and traveling salesman problems. Our code is released at this https URL.

[LG-141] Comparison of Autoencoder Encodings for ECG Representation in Downstream Prediction Tasks

链接: https://arxiv.org/abs/2410.02937
作者: Christopher J. Harvey,Sumaiya Shomaji,Zijun Yao,Amit Noheria
关键词-EN: Principal Component Analysis, cardiovascular assessment, inexpensive and widely, widely available tool, tool for cardiovascular
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The electrocardiogram (ECG) is an inexpensive and widely available tool for cardiovascular assessment. Despite its standardized format and small file size, the high complexity and inter-individual variability of ECG signals (typically a 60,000-size vector) make it challenging to use in deep learning models, especially when only small datasets are available. This study addresses these challenges by exploring feature generation methods from representative beat ECGs, focusing on Principal Component Analysis (PCA) and Autoencoders to reduce data complexity. We introduce three novel Variational Autoencoder (VAE) variants: Stochastic Autoencoder (SAE), Annealed beta-VAE (Abeta-VAE), and cyclical beta-VAE (Cbeta-VAE), and compare their effectiveness in maintaining signal fidelity and enhancing downstream prediction tasks. The Abeta-VAE achieved superior signal reconstruction, reducing the mean absolute error (MAE) to 15.7 plus-minus 3.2 microvolts, which is at the level of signal noise. Moreover, the SAE encodings, when combined with ECG summary features, improved the prediction of reduced Left Ventricular Ejection Fraction (LVEF), achieving an area under the receiver operating characteristic curve (AUROC) of 0.901. This performance nearly matches the 0.910 AUROC of state-of-the-art CNN models but requires significantly less data and computational resources. Our findings demonstrate that these VAE encodings are not only effective in simplifying ECG data but also provide a practical solution for applying deep learning in contexts with limited-scale labeled training data.

[LG-142] Streamlining Conformal Information Retrieval via Score Refinement

链接: https://arxiv.org/abs/2410.02914
作者: Yotam Intrator,Ori Kelner,Regev Cohen,Roman Goldenberg,Ehud Rivlin,Daniel Freedman
关键词-EN: retrieval augmented generation, lack statistical guarantees, augmented generation, fundamental to modern, modern applications
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Information retrieval (IR) methods, like retrieval augmented generation, are fundamental to modern applications but often lack statistical guarantees. Conformal prediction addresses this by retrieving sets guaranteed to include relevant information, yet existing approaches produce large-sized sets, incurring high computational costs and slow response times. In this work, we introduce a score refinement method that applies a simple monotone transformation to retrieval scores, leading to significantly smaller conformal sets while maintaining their statistical guarantees. Experiments on various BEIR benchmarks validate the effectiveness of our approach in producing compact sets containing relevant information.

[LG-143] Fine-Tuning Language Models with Differential Privacy through Adaptive Noise Allocation EMNLP2024

链接: https://arxiv.org/abs/2410.02912
作者: Xianzhi Li,Ran Zmigrod,Zhiqiang Ma,Xiaomo Liu,Xiaodan Zhu
关键词-EN: memorizing detailed patterns, achieve impressive modeling, significant privacy concerns, raise significant privacy, impressive modeling performance
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: EMNLP 2024 findings

点击查看摘要

Abstract:Language models are capable of memorizing detailed patterns and information, leading to a double-edged effect: they achieve impressive modeling performance on downstream tasks with the stored knowledge but also raise significant privacy concerns. Traditional differential privacy based training approaches offer robust safeguards by employing a uniform noise distribution across all parameters. However, this overlooks the distinct sensitivities and contributions of individual parameters in privacy protection and often results in suboptimal models. To address these limitations, we propose ANADP, a novel algorithm that adaptively allocates additive noise based on the importance of model parameters. We demonstrate that ANADP narrows the performance gap between regular fine-tuning and traditional DP fine-tuning on a series of datasets while maintaining the required privacy constraints.

[LG-144] Solving Reach-Avoid-Stay Problems Using Deep Deterministic Policy Gradients

链接: https://arxiv.org/abs/2410.02898
作者: Gabriel Chenevert,Jingqi Li,Achyuta kannan,Sangjae Bae,Donggun Lee
关键词-EN: RAS, avoid obstacles, robots and air, air taxis, taxis to reach
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Reach-Avoid-Stay (RAS) optimal control enables systems such as robots and air taxis to reach their targets, avoid obstacles, and stay near the target. However, current methods for RAS often struggle with handling complex, dynamic environments and scaling to high-dimensional systems. While reinforcement learning (RL)-based reachability analysis addresses these challenges, it has yet to tackle the RAS problem. In this paper, we propose a two-step deep deterministic policy gradient (DDPG) method to extend RL-based reachability method to solve RAS problems. First, we train a function that characterizes the maximal robust control invariant set within the target set, where the system can safely stay, along with its corresponding policy. Second, we train a function that defines the set of states capable of safely reaching the robust control invariant set, along with its corresponding policy. We prove that this method results in the maximal robust RAS set in the absence of training errors and demonstrate that it enables RAS in complex environments, scales to high-dimensional systems, and achieves higher success rates for the RAS task compared to previous methods, validated through one simulation and two high-dimensional experiments.

[LG-145] he Role of Deductive and Inductive Reasoning in Large Language Models

链接: https://arxiv.org/abs/2410.02892
作者: Chengkun Cai,Xu Zhao,Haoliang Liu,Zhongyu Jiang,Tianfang Zhang,Zongkai Wu,Jenq-Neng Hwang,Lei Li
关键词-EN: Large Language Models, Large Language, Language Models, artificial intelligence, progress in artificial
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved substantial progress in artificial intelligence, particularly in reasoning tasks. However, their reliance on static prompt structures, coupled with limited dynamic reasoning capabilities, often constrains their adaptability to complex and evolving problem spaces. In this paper, we propose the Deductive and InDuctive(DID) method, which enhances LLM reasoning by dynamically integrating both deductive and inductive reasoning within the prompt construction process. Drawing inspiration from cognitive science, the DID approach mirrors human adaptive reasoning mechanisms, offering a flexible framework that allows the model to adjust its reasoning pathways based on task context and performance. We empirically validate the efficacy of DID on established datasets such as AIW and MR-GSM8K, as well as on our custom dataset, Holiday Puzzle, which presents tasks about different holiday date calculating challenges. By leveraging DID’s hybrid prompt strategy, we demonstrate significant improvements in both solution accuracy and reasoning quality, achieved without imposing substantial computational overhead. Our findings suggest that DID provides a more robust and cognitively aligned framework for reasoning in LLMs, contributing to the development of advanced LLM-driven problem-solving strategies informed by cognitive science models.

[LG-146] Universally Optimal Watermarking Schemes for LLMs: from Theory to Practice

链接: https://arxiv.org/abs/2410.02890
作者: Haiyun He,Yepeng Liu,Ziqiao Wang,Yongyi Mao,Yuheng Bu
关键词-EN: Large Language Models, Large Language, boosts human efficiency, poses misuse risks, differentiate AI-generated content
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) boosts human efficiency but also poses misuse risks, with watermarking serving as a reliable method to differentiate AI-generated content from human-created text. In this work, we propose a novel theoretical framework for watermarking LLMs. Particularly, we jointly optimize both the watermarking scheme and detector to maximize detection performance, while controlling the worst-case Type-I error and distortion in the watermarked text. Within our framework, we characterize the universally minimum Type-II error, showing a fundamental trade-off between detection performance and distortion. More importantly, we identify the optimal type of detectors and watermarking schemes. Building upon our theoretical analysis, we introduce a practical, model-agnostic and computationally efficient token-level watermarking algorithm that invokes a surrogate model and the Gumbel-max trick. Empirical results on Llama-13B and Mistral-8 \times 7B demonstrate the effectiveness of our method. Furthermore, we also explore how robustness can be integrated into our theoretical framework, which provides a foundation for designing future watermarking systems with improved resilience to adversarial attacks.

[LG-147] owards Layer-Wise Personalized Federated Learning: Adaptive Layer Disentanglement via Conflicting Gradients

链接: https://arxiv.org/abs/2410.02845
作者: Minh Duong Nguyen,Khanh Le,Khoi Do,Nguyen H.Tran,Duc Nguyen,Chien Trinh,Zhaohui Yang
关键词-EN: personalized Federated Learning, high data heterogeneity, Federated Learning, significant gradient divergence, personalized Federated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In personalized Federated Learning (pFL), high data heterogeneity can cause significant gradient divergence across devices, adversely affecting the learning process. This divergence, especially when gradients from different users form an obtuse angle during aggregation, can negate progress, leading to severe weight and gradient update degradation. To address this issue, we introduce a new approach to pFL design, namely Federated Learning with Layer-wise Aggregation via Gradient Analysis (FedLAG), utilizing the concept of gradient conflict at the layer level. Specifically, when layer-wise gradients of different clients form acute angles, those gradients align in the same direction, enabling updates across different clients toward identifying client-invariant features. Conversely, when layer-wise gradient pairs make create obtuse angles, the layers tend to focus on client-specific tasks. In hindsights, FedLAG assigns layers for personalization based on the extent of layer-wise gradient conflicts. Specifically, layers with gradient conflicts are excluded from the global aggregation process. The theoretical evaluation demonstrates that when integrated into other pFL baselines, FedLAG enhances pFL performance by a certain margin. Therefore, our proposed method achieves superior convergence behavior compared with other baselines. Extensive experiments show that our FedLAG outperforms several state-of-the-art methods and can be easily incorporated with many existing methods to further enhance performance.

[LG-148] Neural DDEs with Learnable Delays for Partially Observed Dynamical Systems

链接: https://arxiv.org/abs/2410.02843
作者: Thibault Monsel,Emmanuel Menier,Onofrio Semeraro,Lionel Mathelin,Guillaume Charpiat
关键词-EN: recently been introduced, learn dynamical systems, learn dynamical, Delay Differential Equations, Constant Lag Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Many successful methods to learn dynamical systems from data have recently been introduced. Such methods often rely on the availability of the system’s full state. However, this underlying hypothesis is rather restrictive as it is typically not confirmed in practice, leaving us with partially observed systems. Utilizing the Mori-Zwanzig (MZ) formalism from statistical physics, we demonstrate that Constant Lag Neural Delay Differential Equations (NDDEs) naturally serve as suitable models for partially observed states. In empirical evaluation, we show that such models outperform existing methods on both synthetic and experimental data.

[LG-149] Overcoming Representation Bias in Fairness-Aware data Repair using Optimal Transport

链接: https://arxiv.org/abs/2410.02840
作者: Abigail Langbridge,Anthony Quinn,Robert Shorten
关键词-EN: Optimal transport, important role, role in transforming, manner which engenders, transforming data distributions
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Optimal transport (OT) has an important role in transforming data distributions in a manner which engenders fairness. Typically, the OT operators are learnt from the unfair attribute-labelled data, and then used for their repair. Two significant limitations of this approach are as follows: (i) the OT operators for underrepresented subgroups are poorly learnt (i.e. they are susceptible to representation bias); and (ii) these OT repairs cannot be effected on identically distributed but out-of-sample (i.e.\ archival) data. In this paper, we address both of these problems by adopting a Bayesian nonparametric stopping rule for learning each attribute-labelled component of the data distribution. The induced OT-optimal quantization operators can then be used to repair the archival data. We formulate a novel definition of the fair distributional target, along with quantifiers that allow us to trade fairness against damage in the transformed data. These are used to reveal excellent performance of our representation-bias-tolerant scheme in simulated and benchmark data sets.

[LG-150] Skill Issues: An Analysis of CS:GO Skill Rating Systems

链接: https://arxiv.org/abs/2410.02831
作者: Mikel Bober-Irizar,Naunidh Dua,Max McGuinness
关键词-EN: skill rating systems, accurate skill rating, meteoric rise, rise of online, online games
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The meteoric rise of online games has created a need for accurate skill rating systems for tracking improvement and fair matchmaking. Although many skill rating systems are deployed, with various theoretical foundations, less work has been done at analysing the real-world performance of these algorithms. In this paper, we perform an empirical analysis of Elo, Glicko2 and TrueSkill through the lens of surrogate modelling, where skill ratings influence future matchmaking with a configurable acquisition function. We look both at overall performance and data efficiency, and perform a sensitivity analysis based on a large dataset of Counter-Strike: Global Offensive matches.

[LG-151] LLMs May Not Be Human-Level Players But They Can Be Testers: Measuring Game Difficulty with LLM Agents

链接: https://arxiv.org/abs/2410.02829
作者: Chang Xiao,Brenda Z. Yang
关键词-EN: Large Language Models, Language Models, Large Language, Recent advances, advances in Large
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated their potential as autonomous agents across various tasks. One emerging application is the use of LLMs in playing games. In this work, we explore a practical problem for the gaming industry: Can LLMs be used to measure game difficulty? We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process. Based on our experiments, we also outline general principles and guidelines for incorporating LLMs into the game testing process.

[LG-152] Effective Intrusion Detection for UAV Communications using Autoencoder-based Feature Extraction and Machine Learning Approach

链接: https://arxiv.org/abs/2410.02827
作者: Tuan-Cuong Vuong,Cong Chi Nguyen,Van-Cuong Pham,Thi-Thanh-Huyen Le,Xuan-Nam Tran,Thien Van Luong
关键词-EN: unmanned aerial vehicles, recent actual UAV, intrusion detection method, aerial vehicles, actual UAV intrusion
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 4 pages

点击查看摘要

Abstract:This paper proposes a novel intrusion detection method for unmanned aerial vehicles (UAV) in the presence of recent actual UAV intrusion dataset. In particular, in the first stage of our method, we design an autoencoder architecture for effectively extracting important features, which are then fed into various machine learning models in the second stage for detecting and classifying attack types. To the best of our knowledge, this is the first attempt to propose such the autoencoder-based machine learning intrusion detection method for UAVs using actual dataset, while most of existing works only consider either simulated datasets or datasets irrelevant to UAV communications. Our experiment results show that the proposed method outperforms the baselines such as feature selection schemes in both binary and multi-class classification tasks.

[LG-153] LinkThief: Combining Generalized Structure Knowledge with Node Similarity for Link Stealing Attack against GNN

链接: https://arxiv.org/abs/2410.02826
作者: Yuxing Zhang,Siyuan Meng,Chunchun Chen,Mengyao Peng,Hongyan Gu,Xinli Huang
关键词-EN: Bridge Graph Generator, http URL attacks, http URL theoretical, Graph neural networks, http URL studies
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks(GNNs) have a wide range of applications in this http URL studies have shown that Graph neural networks(GNNs) are vulnerable to link stealing attacks,which infers the existence of edges in the target GNN’s training this http URL attacks are usually based on the assumption that links exist between two nodes that share similar posteriors;however,they fail to focus on links that do not hold under this this http URL this end,we propose LinkThief,an improved link stealing attack that combines generalized structure knowledge with node similarity,in a scenario where the attackers’ background knowledge contains partially leaked target graph and shadow this http URL,to equip the attack model with insights into the link structure spanning both the shadow graph and the target graph,we introduce the idea of creating a Shadow-Target Bridge Graph and extracting edge subgraph structure features from this http URL theoretical analysis from the perspective of privacy theft,we first explore how to implement the aforementioned this http URL upon the findings,we design the Bridge Graph Generator to construct the Shadow-Target Bridge this http URL,the subgraph around the link is sampled by the Edge Subgraph Preparation this http URL,the Edge Structure Feature Extractor is designed to obtain generalized structure knowledge,which is combined with node similarity to form the features provided to the attack this http URL experiments validate the correctness of theoretical analysis and demonstrate that LinkThief still effectively steals links without extra assumptions.

[LG-154] DANA: Domain-Aware Neurosymbolic Agents for Consistency and Accuracy

链接: https://arxiv.org/abs/2410.02823
作者: Vinh Luong,Sang Dinh,Shruti Raghavan,William Nguyen,Zooey Nguyen,Quynh Le,Hung Vo,Kentaro Maegaito,Loc Nguyen,Thao Nguyen,Anh Hai Ha,Christopher Nguyen
关键词-EN: Large Language Models, shown remarkable capabilities, Large Language, Language Models, inherent probabilistic nature
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities, but their inherent probabilistic nature often leads to inconsistency and inaccuracy in complex problem-solving tasks. This paper introduces DANA (Domain-Aware Neurosymbolic Agent), an architecture that addresses these issues by integrating domain-specific knowledge with neurosymbolic approaches. We begin by analyzing current AI architectures, including AutoGPT, LangChain ReAct and OpenAI’s ChatGPT, through a neurosymbolic lens, highlighting how their reliance on probabilistic inference contributes to inconsistent outputs. In response, DANA captures and applies domain expertise in both natural-language and symbolic forms, enabling more deterministic and reliable problem-solving behaviors. We implement a variant of DANA using Hierarchical Task Plans (HTPs) in the open-source OpenSSA framework. This implementation achieves over 90% accuracy on the FinanceBench financial-analysis benchmark, significantly outperforming current LLM-based systems in both consistency and accuracy. Application of DANA in physical industries such as semiconductor shows that its flexible architecture for incorporating knowledge is effective in mitigating the probabilistic limitations of LLMs and has potential in tackling complex, real-world problems that require reliability and precision.

[LG-155] GPTs Judgements Under Uncertainty

链接: https://arxiv.org/abs/2410.02820
作者: Payam Saeedi,Mahsa Goodarzi
关键词-EN: framing effects, human cognition, loss aversion, conjunction fallacy, judges and makes
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate whether biases inherent in human cognition, such as loss aversion, framing effects, and conjunction fallacy, manifest in how GPT-4o judges and makes decisions in probabilistic scenarios. By conducting 1350 experiments across nine cognitive biases and analyzing the responses for statistical versus heuristic reasoning, we demonstrate GPT-4o’s contradicting approach while responding to prompts with similar underlying probability notations. Our findings also reveal mixed performances with the AI demonstrating both human-like heuristic errors and statistically sound decisions, even as it goes through identical iterations of the same prompt.

[LG-156] Physics-Informed Graph-Mesh Networks for PDEs: A hybrid approach for complex problems

链接: https://arxiv.org/abs/2410.02819
作者: Marien Chenaud,Frédéric Magoulès,José Alves
关键词-EN: including solving partial, solving partial differential, partial differential equations, numerous applications, including solving
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent rise of deep learning has led to numerous applications, including solving partial differential equations using Physics-Informed Neural Networks. This approach has proven highly effective in several academic cases. However, their lack of physical invariances, coupled with other significant weaknesses, such as an inability to handle complex geometries or their lack of generalization capabilities, make them unable to compete with classical numerical solvers in industrial settings. In this work, a limitation regarding the use of automatic differentiation in the context of physics-informed learning is highlighted. A hybrid approach combining physics-informed graph neural networks with numerical kernels from finite elements is introduced. After studying the theoretical properties of our model, we apply it to complex geometries, in two and three dimensions. Our choices are supported by an ablation study, and we evaluate the generalisation capacity of the proposed approach.

[LG-157] Neural Coordination and Capacity Control for Inventory Management

链接: https://arxiv.org/abs/2410.02817
作者: Carson Eisenach,Udaya Ghai,Dhruv Madeka,Kari Torkkola,Dean Foster,Sham Kakade
关键词-EN: limited shared resources, retailer managing multiple, managing multiple products, review inventory control, capacity control mechanism
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper addresses the capacitated periodic review inventory control problem, focusing on a retailer managing multiple products with limited shared resources, such as storage or inbound labor at a facility. Specifically, this paper is motivated by the questions of (1) what does it mean to backtest a capacity control mechanism, (2) can we devise and backtest a capacity control mechanism that is compatible with recent advances in deep reinforcement learning for inventory management? First, because we only have a single historic sample path of Amazon’s capacity limits, we propose a method that samples from a distribution of possible constraint paths covering a space of real-world scenarios. This novel approach allows for more robust and realistic testing of inventory management strategies. Second, we extend the exo-IDP (Exogenous Decision Process) formulation of Madeka et al. 2022 to capacitated periodic review inventory control problems and show that certain capacitated control problems are no harder than supervised learning. Third, we introduce a `neural coordinator’, designed to produce forecasts of capacity prices, guiding the system to adhere to target constraints in place of a traditional model predictive controller. Finally, we apply a modified DirectBackprop algorithm for learning a deep RL buying policy and a training the neural coordinator. Our methodology is evaluated through large-scale backtests, demonstrating RL buying policies with a neural coordinator outperforms classic baselines both in terms of cumulative discounted reward and capacity adherence (we see improvements of up to 50% in some cases).

[LG-158] SAC-KG: Exploiting Large Language Models as Skilled Automatic Constructors for Domain Knowledge Graphs ACL2024

链接: https://arxiv.org/abs/2410.02811
作者: Hanzhu Chen,Xu Shen,Qitan Lv,Jie Wang,Xiaoqi Ni,Jieping Ye
关键词-EN: domain Knowledge Graph, Skilled Automatic Constructors, play a pivotal, pivotal role, role in knowledge-intensive
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: ACL 2024 Main

点击查看摘要

Abstract:Knowledge graphs (KGs) play a pivotal role in knowledge-intensive tasks across specialized domains, where the acquisition of precise and dependable knowledge is crucial. However, existing KG construction methods heavily rely on human intervention to attain qualified KGs, which severely hinders the practical applicability in real-world scenarios. To address this challenge, we propose a general KG construction framework, named SAC-KG, to exploit large language models (LLMs) as Skilled Automatic Constructors for domain Knowledge Graph. SAC-KG effectively involves LLMs as domain experts to generate specialized and precise multi-level KGs. Specifically, SAC-KG consists of three components: Generator, Verifier, and Pruner. For a given entity, Generator produces its relations and tails from raw domain corpora, to construct a specialized single-level KG. Verifier and Pruner then work together to ensure precision by correcting generation errors and determining whether newly produced tails require further iteration for the next-level this http URL demonstrate that SAC-KG automatically constructs a domain KG at the scale of over one million nodes and achieves a precision of 89.32%, leading to a superior performance with over 20% increase in precision rate compared to existing state-of-the-art methods for the KG construction task.

[LG-159] StateAct: State Tracking and Reasoning for Acting and Planning with Large Language Models

链接: https://arxiv.org/abs/2410.02810
作者: Nikolai Rozanov,Marek Rei
关键词-EN: large language models, language models, large language, interactive environments, Planning and acting
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 9 pages, 5 pages appendix, 7 figures, 5 tables

点击查看摘要

Abstract:Planning and acting to solve real' tasks using large language models (LLMs) in interactive environments has become a new frontier for AI methods. While recent advances allowed LLMs to interact with online tools, solve robotics tasks and many more, long range reasoning tasks remain a problem for LLMs. Existing methods to address this issue are very resource intensive and require additional data or human crafted rules, instead, we propose a simple method based on few-shot in-context learning alone to enhance chain-of-thought’ with state-tracking for planning and acting with LLMs. We show that our method establishes the new state-of-the-art on Alfworld for in-context learning methods (\textbf+14% over the previous best few-shot in-context learning method) and performs on par with methods that use additional training data and additional tools such as code-execution. We also demonstrate that our enhanced chain-of-states' allows the agent to both solve longer horizon problems and to be more efficient in number of steps required to solve a task. We show that our method works across a variety of LLMs for both API-based and open source ones. Finally, we also conduct ablation studies and show that chain-of-thoughts’ helps state-tracking accuracy, while a json-structure harms overall performance. We open-source our code and annotations at \urlthis https URL.

[LG-160] A Data Envelopment Analysis Approach for Assessing Fairness in Resource Allocation: Application to Kidney Exchange Programs

链接: https://arxiv.org/abs/2410.02799
作者: Ali Kaazempur-Mofrad,Xiaowu Dai
关键词-EN: significantly increased transplantation, increased transplantation rates, raise pressing questions, Kidney exchange programs, Data Envelopment Analysis
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Kidney exchange programs have significantly increased transplantation rates but raise pressing questions about fairness in organ allocation. We present a novel framework leveraging Data Envelopment Analysis (DEA) to evaluate multiple fairness criteria–Priority, Access, and Outcome–within a single model, capturing complexities that may be overlooked in single-metric analyses. Using data from the United Network for Organ Sharing, we analyze these criteria individually, measuring Priority fairness through waitlist durations, Access fairness through Kidney Donor Profile Index scores, and Outcome fairness through graft lifespan. We then apply our DEA model to demonstrate significant disparities in kidney allocation efficiency across ethnic groups. To quantify uncertainty, we employ conformal prediction within the DEA framework, yielding group-conditional prediction intervals with finite sample coverage guarantees. Our findings show notable differences in efficiency distributions between ethnic groups. Our study provides a rigorous framework for evaluating fairness in complex resource allocation systems, where resource scarcity and mutual compatibility constraints exist. All code for using the proposed method and reproducing results is available on GitHub.

[LG-161] DifFaiRec: Generative Fair Recommender with Conditional Diffusion Model ICDM2024

链接: https://arxiv.org/abs/2410.02791
作者: Zhenhao Jiang,Jicong Fan
关键词-EN: users automatically based, Diffusion-based Fair Recommender, automatically based, groups, users automatically
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The paper was accepted by ICDM 2024

点击查看摘要

Abstract:Although recommenders can ship items to users automatically based on the users’ preferences, they often cause unfairness to groups or individuals. For instance, when users can be divided into two groups according to a sensitive social attribute and there is a significant difference in terms of activity between the two groups, the learned recommendation algorithm will result in a recommendation gap between the two groups, which causes group unfairness. In this work, we propose a novel recommendation algorithm named Diffusion-based Fair Recommender (DifFaiRec) to provide fair recommendations. DifFaiRec is built upon the conditional diffusion model and hence has a strong ability to learn the distribution of user preferences from their ratings on items and is able to generate diverse recommendations effectively. To guarantee fairness, we design a counterfactual module to reduce the model sensitivity to protected attributes and provide mathematical explanations. The experiments on benchmark datasets demonstrate the superiority of DifFaiRec over competitive baselines.

[LG-162] Guess What I Think: Streamlined EEG-to-Image Generation with Latent Diffusion Models ICASSP2025

链接: https://arxiv.org/abs/2410.02780
作者: Eleonora Lopez,Luigi Sigillo,Federica Colonnese,Massimo Panella,Danilo Comminiello
关键词-EN: advance brain-computer interface, encode visual cues, gaining increasing attention, brain signals encode, increasing attention due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Generating images from brain waves is gaining increasing attention due to its potential to advance brain-computer interface (BCI) systems by understanding how brain signals encode visual cues. Most of the literature has focused on fMRI-to-Image tasks as fMRI is characterized by high spatial resolution. However, fMRI is an expensive neuroimaging modality and does not allow for real-time BCI. On the other hand, electroencephalography (EEG) is a low-cost, non-invasive, and portable neuroimaging technique, making it an attractive option for future real-time applications. Nevertheless, EEG presents inherent challenges due to its low spatial resolution and susceptibility to noise and artifacts, which makes generating images from EEG more difficult. In this paper, we address these problems with a streamlined framework based on the ControlNet adapter for conditioning a latent diffusion model (LDM) through EEG signals. We conduct experiments and ablation studies on popular benchmarks to demonstrate that the proposed method beats other state-of-the-art models. Unlike these methods, which often require extensive preprocessing, pretraining, different losses, and captioning models, our approach is efficient and straightforward, requiring only minimal preprocessing and a few components. Code will be available after publication.

[LG-163] Learning variant product relationship and variation attributes from e-commerce website structures

链接: https://arxiv.org/abs/2410.02779
作者: Pedro Herrero-Vidal,You-Lin Chen,Cris Liu,Prithviraj Sen,Lichao Wang
关键词-EN: introduce VARM, product relationships, product, variant product relationships, variant relationship matcher
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce VARM, variant relationship matcher strategy, to identify pairs of variant products in e-commerce catalogs. Traditional definitions of entity resolution are concerned with whether product mentions refer to the same underlying product. However, this fails to capture product relationships that are critical for e-commerce applications, such as having similar, but not identical, products listed on the same webpage or share reviews. Here, we formulate a new type of entity resolution in variant product relationships to capture these similar e-commerce product links. In contrast with the traditional definition, the new definition requires both identifying if two products are variant matches of each other and what are the attributes that vary between them. To satisfy these two requirements, we developed a strategy that leverages the strengths of both encoding and generative AI models. First, we construct a dataset that captures webpage product links, and therefore variant product relationships, to train an encoding LLM to predict variant matches for any given pair of products. Second, we use RAG prompted generative LLMs to extract variation and common attributes amongst groups of variant products. To validate our strategy, we evaluated model performance using real data from one of the world’s leading e-commerce retailers. The results showed that our strategy outperforms alternative solutions and paves the way to exploiting these new type of product relationships.

[LG-164] OATH: Efficient and Flexible Zero-Knowledge Proofs of End-to-End ML Fairness

链接: https://arxiv.org/abs/2410.02777
作者: Olive Franzese,Ali Shahin Shamsabadi,Hamed Haddadi
关键词-EN: received lesser attention, lesser attention, address fairness noncompliance, fairness noncompliance, received lesser
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Though there is much interest in fair AI systems, the problem of fairness noncompliance – which concerns whether fair models are used in practice – has received lesser attention. Zero-Knowledge Proofs of Fairness (ZKPoF) address fairness noncompliance by allowing a service provider to verify to external parties that their model serves diverse demographics equitably, with guaranteed confidentiality over proprietary model parameters and data. They have great potential for building public trust and effective AI regulation, but no previous techniques for ZKPoF are fit for real-world deployment. We present OATH, the first ZKPoF framework that is (i) deployably efficient with client-facing communication comparable to in-the-clear ML as a Service query answering, and an offline audit phase that verifies an asymptotically constant quantity of answered queries, (ii) deployably flexible with modularity for any score-based classifier given a zero-knowledge proof of correct inference, (iii) deployably secure with an end-to-end security model that guarantees confidentiality and fairness across training, inference, and audits. We show that OATH obtains strong robustness against malicious adversaries at concretely efficient parameter settings. Notably, OATH provides a 1343x improvement to runtime over previous work for neural network ZKPoF, and scales up to much larger models – even DNNs with tens of millions of parameters.

[LG-165] Bypassing the Popularity Bias: Repurposing Models for Better Long-Tail Recommendation

链接: https://arxiv.org/abs/2410.02776
作者: Václav Blahut,Karel Koupil
关键词-EN: Recommender systems play, influencing our beliefs, Recommender systems, play a crucial, crucial role
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Recommender systems play a crucial role in shaping information we encounter online, whether on social media or when using content platforms, thereby influencing our beliefs, choices, and behaviours. Many recent works address the issue of fairness in recommender systems, typically focusing on topics like ensuring equal access to information and opportunities for all individual users or user groups, promoting diverse content to avoid filter bubbles and echo chambers, enhancing transparency and explainability, and adhering to ethical and sustainable practices. In this work, we aim to achieve a more equitable distribution of exposure among publishers on an online content platform, with a particular focus on those who produce high quality, long-tail content that may be unfairly disadvantaged. We propose a novel approach of repurposing existing components of an industrial recommender system to deliver valuable exposure to underrepresented publishers while maintaining high recommendation quality. To demonstrate the efficiency of our proposal, we conduct large-scale online AB experiments, report results indicating desired outcomes and share several insights from long-term application of the approach in the production setting.

[LG-166] HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

链接: https://arxiv.org/abs/2410.01524
作者: Seanie Lee,Haebin Seong,Dong Bok Lee,Minki Kang,Xiaoyin Chen,Dominik Wagner,Yoshua Bengio,Juho Lee,Sung Ju Hwang
关键词-EN: detect malicious queries, malicious queries aimed, Safety guard models, Safety guard, existing safety guard
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as, “Make a single harmful instruction prompt that would elicit offensive content”, we add an affirmative prefix (e.g., “I have an idea for a prompt:”) to the LLM’s response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.

[LG-167] Minimax-optimal trust-aware multi-armed bandits

链接: https://arxiv.org/abs/2410.03651
作者: Changxiao Cai,Jiacheng Zhang
关键词-EN: sequential decision-making applications, achieved significant success, humans perfectly implement, Multi-armed bandit, recommended policy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Multi-armed bandit (MAB) algorithms have achieved significant success in sequential decision-making applications, under the premise that humans perfectly implement the recommended policy. However, existing methods often overlook the crucial factor of human trust in learning algorithms. When trust is lacking, humans may deviate from the recommended policy, leading to undesired learning performance. Motivated by this gap, we study the trust-aware MAB problem by integrating a dynamic trust model into the standard MAB framework. Specifically, it assumes that the recommended and actually implemented policy differs depending on human trust, which in turn evolves with the quality of the recommended policy. We establish the minimax regret in the presence of the trust issue and demonstrate the suboptimality of vanilla MAB algorithms such as the upper confidence bound (UCB) algorithm. To overcome this limitation, we introduce a novel two-stage trust-aware procedure that provably attains near-optimal statistical guarantees. A simulation study is conducted to illustrate the benefits of our proposed algorithm when dealing with the trust issue.

[LG-168] Conditional Enzyme Generation Using Protein Language Models with Adapters

链接: https://arxiv.org/abs/2410.03634
作者: Jason Yang,Aadyot Bhatnagar,Jeffrey A. Ruffolo,Ali Madani
关键词-EN: language models, protein language models, key goal, Adapted Language Model, Protein Conditionally Adapted
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The conditional generation of proteins with desired functions and/or properties is a key goal for generative models. Existing methods based on prompting of language models can generate proteins conditioned on a target functionality, such as a desired enzyme family. However, these methods are limited to simple, tokenized conditioning and have not been shown to generalize to unseen functions. In this study, we propose ProCALM (Protein Conditionally Adapted Language Model), an approach for the conditional generation of proteins using adapters to protein language models. Our specific implementation of ProCALM involves finetuning ProGen2 to incorporate conditioning representations of enzyme function and taxonomy. ProCALM matches existing methods at conditionally generating sequences from target enzyme families. Impressively, it can also generate within the joint distribution of enzymatic function and taxonomy, and it can generalize to rare and unseen enzyme families and taxonomies. Overall, ProCALM is a flexible and computationally efficient approach, and we expect that it can be extended to a wide range of generative language models.

[LG-169] Exploring gauge-fixing conditions with gradient-based optimization

链接: https://arxiv.org/abs/2410.03602
作者: William Detmold,Gurtej Kanwar,Yin Lin,Phiala E. Shanahan,Michael L. Wagman
关键词-EN: Lattice gauge fixing, compute gauge-variant quantities, RI-MOM renormalization schemes, Lattice gauge, model calculations
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures; Proceedings of the 41st International Symposium on Lattice Field Theory (Lattice 2024)

点击查看摘要

Abstract:Lattice gauge fixing is required to compute gauge-variant quantities, for example those used in RI-MOM renormalization schemes or as objects of comparison for model calculations. Recently, gauge-variant quantities have also been found to be more amenable to signal-to-noise optimization using contour deformations. These applications motivate systematic parameterization and exploration of gauge-fixing schemes. This work introduces a differentiable parameterization of gauge fixing which is broad enough to cover Landau gauge, Coulomb gauge, and maximal tree gauges. The adjoint state method allows gradient-based optimization to select gauge-fixing schemes that minimize an arbitrary target loss function.

[LG-170] Nonstationary Sparse Spectral Permanental Process

链接: https://arxiv.org/abs/2410.03581
作者: Zicheng Sun,Yixuan Zhang,Zenan Ling,Xuhui Fan,Feng Zhou
关键词-EN: Existing permanental processes, Existing permanental, permanental processes, processes often impose, kernel types
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing permanental processes often impose constraints on kernel types or stationarity, limiting the model’s expressiveness. To overcome these limitations, we propose a novel approach utilizing the sparse spectral representation of nonstationary kernels. This technique relaxes the constraints on kernel types and stationarity, allowing for more flexible modeling while reducing computational complexity to the linear level. Additionally, we introduce a deep kernel variant by hierarchically stacking multiple spectral feature mappings, further enhancing the model’s expressiveness to capture complex patterns in data. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of our approach, particularly in scenarios with pronounced data nonstationarity. Additionally, ablation studies are conducted to provide insights into the impact of various hyperparameters on model performance.

[LG-171] Optimizing food taste sensory evaluation through neural network-based taste electroencephalogram channel selection

链接: https://arxiv.org/abs/2410.03559
作者: Xiuxin Xia,Qun Wang,He Wang,Chenrui Liu,Pengwei Li,Yan Shi,Hong Men
关键词-EN: stimulation can reflect, reflect different brain, brain patterns, EEG, channel selection
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 33 pages, 13 figures

点击查看摘要

Abstract:The taste electroencephalogram (EEG) evoked by the taste stimulation can reflect different brain patterns and be used in applications such as sensory evaluation of food. However, considering the computational cost and efficiency, EEG data with many channels has to face the critical issue of channel selection. This paper proposed a channel selection method called class activation mapping with attention (CAM-Attention). The CAM-Attention method combined a convolutional neural network with channel and spatial attention (CNN-CSA) model with a gradient-weighted class activation mapping (Grad-CAM) model. The CNN-CSA model exploited key features in EEG data by attention mechanism, and the Grad-CAM model effectively realized the visualization of feature regions. Then, channel selection was effectively implemented based on feature regions. Finally, the CAM-Attention method reduced the computational burden of taste EEG recognition and effectively distinguished the four tastes. In short, it has excellent recognition performance and provides effective technical support for taste sensory evaluation.

[LG-172] Authentication by Location Tracking in Underwater Acoustic Networks

链接: https://arxiv.org/abs/2410.03511
作者: Gianmaria Ventura,Francesco Ardizzon,Stefano Tomasin
关键词-EN: Physical layer message, underwater acoustic channel, layer message authentication, Physical layer, underwater acoustic networks
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Article submitted to IEEE Transaction on Wireless Communications

点击查看摘要

Abstract:Physical layer message authentication in underwater acoustic networks (UWANs) leverages the characteristics of the underwater acoustic channel (UWAC) as a fingerprint of the transmitting device. However, as the device moves its UWAC changes, and the authentication mechanism must track such variations. In this paper, we propose a context-based authentication mechanism operating in two steps: first, we estimate the position of the underwater device, then we predict its future position based on the previously estimated ones. To check the authenticity of the transmission, we compare the estimated and the predicted position. The location is estimated using a convolutional neural network taking as input the sample covariance matrix of the estimated UWACs. The prediction uses either a Kalman filter or a recurrent neural network (RNN). The authentication check is performed on the squared error between the predicted and estimated positions. The solution based on the Kalman filter outperforms that built on the RNN when the device moves according to a correlated Gauss-Markov mobility model, which reproduces a typical underwater motion.

[LG-173] Aircraft Radar Altimeter Interference Mitigation Through a CNN-Layer Only Denoising Autoencoder Architecture

链接: https://arxiv.org/abs/2410.03423
作者: Samuel B. Brown,Stephen Young,Adam Wagenknecht,Daniel Jakubisin,Charles E. Thornton,Aaron Orndorff,William C. Headley
关键词-EN: experience significant difficulty, radio frequency communication, signal processing applications, large sample regime, frequency communication signals
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: To be presented at MILCOM 2024, Washington DC

点击查看摘要

Abstract:Denoising autoencoders for signal processing applications have been shown to experience significant difficulty in learning to reconstruct radio frequency communication signals, particularly in the large sample regime. In communication systems, this challenge is primarily due to the need to reconstruct the modulated data stream which is generally highly stochastic in nature. In this work, we take advantage of this limitation by using the denoising autoencoder to instead remove interfering radio frequency communication signals while reconstructing highly structured FMCW radar signals. More specifically, in this work we show that a CNN-layer only autoencoder architecture can be utilized to improve the accuracy of a radar altimeter’s ranging estimate even in severe interference environments consisting of a multitude of interference signals. This is demonstrated through comprehensive performance analysis of an end-to-end FMCW radar altimeter simulation with and without the convolutional layer-only autoencoder. The proposed approach significantly improves interference mitigation in the presence of both narrow-band tone interference as well as wideband QPSK interference in terms of range RMS error, number of false altitude reports, and the peak-to-sidelobe ratio of the resulting range profile. FMCW radar signals of up to 40,000 IQ samples can be reliably reconstructed.

[LG-174] Conformal confidence sets for biomedical image segmentation

链接: https://arxiv.org/abs/2410.03406
作者: Samuel Davenport
关键词-EN: provide spatial uncertainty, spatial uncertainty guarantees, black-box machine learning, develop confidence sets, confidence sets
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop confidence sets which provide spatial uncertainty guarantees for the output of a black-box machine learning model designed for image segmentation. To do so we adapt conformal inference to the imaging setting, obtaining thresholds on a calibration dataset based on the distribution of the maximum of the transformed logit scores within and outside of the ground truth masks. We prove that these confidence sets, when applied to new predictions of the model, are guaranteed to contain the true unknown segmented mask with desired probability. We show that learning appropriate score transformations on a learning dataset before performing calibration is crucial for optimizing performance. We illustrate and validate our approach on a polpys tumor dataset. To do so we obtain the logit scores from a deep neural network trained for polpys segmentation and show that using distance transformed scores to obtain outer confidence sets and the original scores for inner confidence sets enables tight bounds on tumor location whilst controlling the false coverage rate.

[LG-175] Manikin-Recorded Cardiopulmonary Sounds Dataset Using Digital Stethoscope

链接: https://arxiv.org/abs/2410.03280
作者: Yasaman Torabi,Shahram Shirani,James P. Reilly
关键词-EN: healthcare monitoring, crucial for healthcare, Heart and lung, lung sounds, sounds
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Heart and lung sounds are crucial for healthcare monitoring. Recent improvements in stethoscope technology have made it possible to capture patient sounds with enhanced precision. In this dataset, we used a digital stethoscope to capture both heart and lung sounds, including individual and mixed recordings. To our knowledge, this is the first dataset to offer both separate and mixed cardiorespiratory sounds. The recordings were collected from a clinical manikin, a patient simulator designed to replicate human physiological conditions, generating clean heart and lung sounds at different body locations. This dataset includes both normal sounds and various abnormalities (i.e., murmur, atrial fibrillation, tachycardia, atrioventricular block, third and fourth heart sound, wheezing, crackles, rhonchi, pleural rub, and gurgling sounds). The dataset includes audio recordings of chest examinations performed at different anatomical locations, as determined by specialist nurses. Each recording has been enhanced using frequency filters to highlight specific sound types. This dataset is useful for applications in artificial intelligence, such as automated cardiopulmonary disease detection, sound classification, unsupervised separation techniques, and deep learning algorithms related to audio signal processing.

[LG-176] Optimal Transport for epsilon-Contaminated Credal Sets

链接: https://arxiv.org/abs/2410.03267
作者: Michele Caprio
关键词-EN: Kantorovich optimal transport, Kantorovich optimal, optimal transport problems, Monge and Kantorovich, Kantorovich
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We provide a version for lower probabilities of Monge’s and Kantorovich’s optimal transport problems. We show that, when the lower probabilities are the lower envelopes of \epsilon -contaminated sets, then our version of Monge’s, and a restricted version of our Kantorovich’s problems, coincide with their respective classical versions. We also give sufficient conditions for the existence of our version of Kantorovich’s optimal plan, and for the two problems to be equivalent. As a byproduct, we show that for \epsilon -contaminations the lower probability versions of Monge’s and Kantorovich’s optimal transport problems need not coincide. The applications of our results to Machine Learning and Artificial Intelligence are also discussed.

[LG-177] Elucidating the Design Choice of Probability Paths in Flow Matching for Forecasting

链接: https://arxiv.org/abs/2410.03229
作者: Soon Hoe Lim,Yijin Wang,Annan Yu,Emma Hart,Michael W. Mahoney,Xiaoye S. Li,N. Benjamin Erichson
关键词-EN: time series forecasting, probability path model, probability path, probabilistic time series, latent spaces
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages

点击查看摘要

Abstract:Flow matching has recently emerged as a powerful paradigm for generative modeling and has been extended to probabilistic time series forecasting in latent spaces. However, the impact of the specific choice of probability path model on forecasting performance remains under-explored. In this work, we demonstrate that forecasting spatio-temporal data with flow matching is highly sensitive to the selection of the probability path model. Motivated by this insight, we propose a novel probability path model designed to improve forecasting performance. Our empirical results across various dynamical system benchmarks show that our model achieves faster convergence during training and improved predictive performance compared to existing probability path models. Importantly, our approach is efficient during inference, requiring only a few sampling steps. This makes our proposed model practical for real-world applications and opens new avenues for probabilistic forecasting.

[LG-178] Learning to steer with Brownian noise

链接: https://arxiv.org/abs/2410.03221
作者: Stefan Ankirchner,Sören Christensen,Jan Kallsen,Philip Le Borne,Stefan Perko
关键词-EN: velocity follower problem, bounded velocity follower, decision maker lacks, maker lacks knowledge, underlying system parameters
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This paper considers an ergodic version of the bounded velocity follower problem, assuming that the decision maker lacks knowledge of the underlying system parameters and must learn them while simultaneously controlling. We propose algorithms based on moving empirical averages and develop a framework for integrating statistical methods with stochastic control theory. Our primary result is a logarithmic expected regret rate. To achieve this, we conduct a rigorous analysis of the ergodic convergence rates of the underlying processes and the risks of the considered estimators.

[LG-179] Nested Deep Learning Model: A Foundation Model for Brain Signal Data

链接: https://arxiv.org/abs/2410.03191
作者: Fangyi Wei,Jiajie Mo,Kai Zhang,Haipeng Shen,Srikantan Nagarajan,Fei Jiang
关键词-EN: million people globally, Epilepsy affects, million people, people globally, diagnosis and treatment
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 31 pages; references added; 14 pages supplementary materials added

点击查看摘要

Abstract:Epilepsy affects over 50 million people globally, with EEG/MEG-based spike detection playing a crucial role in diagnosis and treatment. Manual spike identification is time-consuming and requires specialized training, limiting the number of professionals available to analyze EEG/MEG data. To address this, various algorithmic approaches have been developed. However, current methods face challenges in handling varying channel configurations and in identifying the specific channels where spikes originate. This paper introduces a novel Nested Deep Learning (NDL) framework designed to overcome these limitations. NDL applies a weighted combination of signals across all channels, ensuring adaptability to different channel setups, and allows clinicians to identify key channels more accurately. Through theoretical analysis and empirical validation on real EEG/MEG datasets, NDL demonstrates superior accuracy in spike detection and channel localization compared to traditional methods. The results show that NDL improves prediction accuracy, supports cross-modality data integration, and can be fine-tuned for various neurophysiological applications.

[LG-180] ECHOPulse: ECG controlled echocardio-grams video generation

链接: https://arxiv.org/abs/2410.03143
作者: Yiwei Li,Sekeun Kim,Zihao Wu,Hanqi Jiang,Yi Pan,Pengfei Jin,Sifan Song,Yucheng Shi,Tianze Yang,Tianming Liu,Quanzheng Li,Xiang Li
关键词-EN: ECHO video generation, interpretation heavily relies, ECHO video, ECHO, video generation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Echocardiography (ECHO) is essential for cardiac assessments, but its video quality and interpretation heavily relies on manual expertise, leading to inconsistent results from clinical and portable devices. ECHO video generation offers a solution by improving automated monitoring through synthetic data and generating high-quality videos from routine health data. However, existing models often face high computational costs, slow inference, and rely on complex conditional prompts that require experts’ annotations. To address these challenges, we propose ECHOPULSE, an ECG-conditioned ECHO video generation model. ECHOPULSE introduces two key advancements: (1) it accelerates ECHO video generation by leveraging VQ-VAE tokenization and masked visual token modeling for fast decoding, and (2) it conditions on readily accessible ECG signals, which are highly coherent with ECHO videos, bypassing complex conditional prompts. To the best of our knowledge, this is the first work to use time-series prompts like ECG signals for ECHO video generation. ECHOPULSE not only enables controllable synthetic ECHO data generation but also provides updated cardiac function information for disease monitoring and prediction beyond ECG alone. Evaluations on three public and private datasets demonstrate state-of-the-art performance in ECHO video generation across both qualitative and quantitative measures. Additionally, ECHOPULSE can be easily generalized to other modality generation tasks, such as cardiac MRI, fMRI, and 3D CT generation. Demo can seen from \urlthis https URL.

[LG-181] Forest Proximities for Time Series

链接: https://arxiv.org/abs/2410.03098
作者: Ben Shaw,Jake Rhodes,Soukaina Filali Boubrahimi,Kevin R. Moon
关键词-EN: improved random forest, series distance measures, random forest proximity, time series, recently been introduced
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:RF-GAP has recently been introduced as an improved random forest proximity measure. In this paper, we present PF-GAP, an extension of RF-GAP proximities to proximity forests, an accurate and efficient time series classification model. We use the forest proximities in connection with Multi-Dimensional Scaling to obtain vector embeddings of univariate time series, comparing the embeddings to those obtained using various time series distance measures. We also use the forest proximities alongside Local Outlier Factors to investigate the connection between misclassified points and outliers, comparing with nearest neighbor classifiers which use time series distance measures. We show that the forest proximities may exhibit a stronger connection between misclassified points and outliers than nearest neighbor classifiers.

[LG-182] Entanglement-induced provable and robust quantum learning advantages

链接: https://arxiv.org/abs/2410.03094
作者: Haimeng Zhao,Dong-Ling Deng
关键词-EN: Quantum computing holds, innovate machine learning, quantum learning advantage, Quantum, learning
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures + 13-page supplementary materials

点击查看摘要

Abstract:Quantum computing holds the unparalleled potentials to enhance, speed up or innovate machine learning. However, an unambiguous demonstration of quantum learning advantage has not been achieved so far. Here, we rigorously establish a noise-robust, unconditional quantum learning advantage in terms of expressivity, inference speed, and training efficiency, compared to commonly-used classical machine learning models. Our proof is information-theoretic and pinpoints the origin of this advantage: quantum entanglement can be used to reduce the communication required by non-local machine learning tasks. In particular, we design a fully classical task that can be solved with unit accuracy by a quantum model with a constant number of variational parameters using entanglement resources, whereas commonly-used classical models must scale at least linearly with the size of the task to achieve a larger-than-exponentially-small accuracy. We further show that the quantum model can be trained with constant time and a number of samples inversely proportional to the problem size. We prove that this advantage is robust against constant depolarization noise. We show through numerical simulations that even though the classical models can have improved performance as their sizes are increased, they would suffer from overfitting. The constant-versus-linear separation, bolstered by the overfitting problem, makes it possible to demonstrate the quantum advantage with relatively small system sizes. We demonstrate, through both numerical simulations and trapped-ion experiments on IonQ Aria, the desired quantum-classical learning separation. Our results provide a valuable guide for demonstrating quantum learning advantages in practical applications with current noisy intermediate-scale quantum devices.

[LG-183] Vehicle Suspension Recommendation System: Multi-Fidelity Neural Network-based Mechanism Design Optimization

链接: https://arxiv.org/abs/2410.03045
作者: Sumin Lee,Namwoo Kang
关键词-EN: design, performance, perform functions, analysis, designed
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mechanisms are designed to perform functions in various fields. Often, there is no unique mechanism that performs a well-defined function. For example, vehicle suspensions are designed to improve driving performance and ride comfort, but different types are available depending on the environment. This variability in design makes performance comparison difficult. Additionally, the traditional design process is multi-step, gradually reducing the number of design candidates while performing costly analyses to meet target performance. Recently, AI models have been used to reduce the computational cost of FEA. However, there are limitations in data availability and different analysis environments, especially when transitioning from low-fidelity to high-fidelity analysis. In this paper, we propose a multi-fidelity design framework aimed at recommending optimal types and designs of mechanical mechanisms. As an application, vehicle suspension systems were selected, and several types were defined. For each type, mechanism parameters were generated and converted into 3D CAD models, followed by low-fidelity rigid body dynamic analysis under driving conditions. To effectively build a deep learning-based multi-fidelity surrogate model, the results of the low-fidelity analysis were analyzed using DBSCAN and sampled at 5% for high-cost flexible body dynamic analysis. After training the multi-fidelity model, a multi-objective optimization problem was formulated for the performance metrics of each suspension type. Finally, we recommend the optimal type and design based on the input to optimize ride comfort-related performance metrics. To validate the proposed methodology, we extracted basic design rules of Pareto solutions using data mining techniques. We also verified the effectiveness and applicability by comparing the results with those obtained from a conventional deep learning-based design process.

[LG-184] Minmax Trend Filtering: A Locally Adaptive Nonparametric Regression Method via Pointwise Min Max Optimization

链接: https://arxiv.org/abs/2410.03041
作者: Sabyasachi Chatterjee
关键词-EN: Fused Lasso, classical linear smoothing, Trend Filtering, Minmax Trend Filtering, Total Variation Denoising
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trend Filtering is a nonparametric regression method which exhibits local adaptivity, in contrast to a host of classical linear smoothing methods. However, there seems to be no unanimously agreed upon definition of local adaptivity in the literature. A question we seek to answer here is how exactly is Fused Lasso or Total Variation Denoising, which is Trend Filtering of order 0 , locally adaptive? To answer this question, we first derive a new pointwise formula for the Fused Lasso estimator in terms of min-max/max-min optimization of penalized local averages. This pointwise representation appears to be new and gives a concrete explanation of the local adaptivity of Fused Lasso. It yields that the estimation error of Fused Lasso at any given point is bounded by the best (local) bias variance tradeoff where bias and variance have a slightly different meaning than usual. We then propose higher order polynomial versions of Fused Lasso which are defined pointwise in terms of min-max/max-min optimization of penalized local polynomial regressions. These appear to be new nonparametric regression methods, different from any existing method in the nonparametric regression toolbox. We call these estimators Minmax Trend Filtering. They continue to enjoy the notion of local adaptivity in the sense that their estimation error at any given point is bounded by the best (local) bias variance tradeoff.

[LG-185] owards Understanding the Universality of Transformers for Next-Token Prediction

链接: https://arxiv.org/abs/2410.03011
作者: Michael E. Sander,Gabriel Peyré
关键词-EN: Causal Transformers, causal kernel descent, trained to predict, Causal, Transformers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint, 22 pages

点击查看摘要

Abstract:Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token x_t+1 given an autoregressive sequence (x_1, \dots, x_t) as a prompt, where x_t+1 = f(x_t) , and f is a context-dependent function that varies with each sequence. On the theoretical side, we focus on specific instances, namely when f is linear or when (x_t)_t \geq 1 is periodic. We explicitly construct a Transformer (with linear, exponential, or softmax attention) that learns the mapping f in-context through a causal kernel descent method. The causal kernel descent method we propose provably estimates x_t+1 based solely on past and current observations (x_1, \dots, x_t) , with connections to the Kaczmarz algorithm in Hilbert spaces. We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings f .

[LG-186] GABIC: Graph-based Attention Block for Image Compression ICIP2024

链接: https://arxiv.org/abs/2410.02981
作者: Gabriele Spadaro,Alberto Presta,Enzo Tartaglione,Jhony H. Giraldo,Marco Grangetto,Attilio Fiandrotti
关键词-EN: neural Learned Image, Learned Image Compression, neural Learned, JPEG and HEVC-intra, Learned Image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, accepted at ICIP 2024

点击查看摘要

Abstract:While standardized codecs like JPEG and HEVC-intra represent the industry standard in image compression, neural Learned Image Compression (LIC) codecs represent a promising alternative. In detail, integrating attention mechanisms from Vision Transformers into LIC models has shown improved compression efficiency. However, extra efficiency often comes at the cost of aggregating redundant features. This work proposes a Graph-based Attention Block for Image Compression (GABIC), a method to reduce feature redundancy based on a k-Nearest Neighbors enhanced attention mechanism. Our experiments show that GABIC outperforms comparable methods, particularly at high bit rates, enhancing compression performance.

[LG-187] From Optimization to Sampling via Lyapunov Potentials

链接: https://arxiv.org/abs/2410.02979
作者: August Y. Chen,Karthik Sridharan
关键词-EN: appropriately scaled Gaussian, scaled Gaussian noise, Langevin Dynamics, Gradient Descent, Gradient Descent leads
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the problem of sampling from high-dimensional distributions using Langevin Dynamics, a natural and popular variant of Gradient Descent where at each step, appropriately scaled Gaussian noise is added. The similarities between Langevin Dynamics and Gradient Descent leads to the natural question: if the distribution’s log-density can be optimized from all initializations via Gradient Descent, given oracle access to the gradients, can we sample from the distribution using Langevin Dynamics? We answer this question in the affirmative, at low but appropriate temperature levels natural in the context of both optimization and real-world applications. As a corollary, we show we can sample from several new natural and interesting classes of non-log-concave densities, an important setting where we have relatively few examples.

[LG-188] On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

链接: https://arxiv.org/abs/2410.02935
作者: Huy Nguyen,Xing Han,Carl William Harris,Suchi Saria,Nhat Ho
关键词-EN: handling complex inputs, Mixture of Experts, Hierarchical Mixture, developing large-scale foundation, architecture in developing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 58 pages

点击查看摘要

Abstract:With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our investigation highlights the advantages of using varied gating functions, moving beyond softmax gating within HMoE frameworks. We theoretically demonstrate that applying tailored gating functions to each expert group allows HMoE to achieve robust results, even when optimal gating functions are applied only at select hierarchical levels. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and prediction tasks, where our modified HMoE models show great performance improvements.

[LG-189] FAIR Universe HiggsML Uncertainty Challenge Competition

链接: https://arxiv.org/abs/2410.02867
作者: Wahid Bhimji,Paolo Calafiura,Ragansu Chakkappai,Yuan-Tang Chou,Sascha Diefenbacher,Jordan Dudley,Steven Farrell,Aishik Ghosh,Isabelle Guyon,Chris Harris,Shih-Chieh Hsu,Elham E Khoda,Rémy Lyscar,Alexandre Michon,Benjamin Nachman,Peter Nugent,Mathis Reymond,David Rousseau,Benjamin Sluijter,Benjamin Thorne,Ihsan Ullah,Yulei Zhang
关键词-EN: HiggsML Uncertainty Challenge, FAIR Universe, Uncertainty Challenge focuses, imperfect simulators due, HiggsML Uncertainty
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Whitepaper for the FAIR Universe HiggsML Uncertainty Challenge Competition, available : this https URL

点击查看摘要

Abstract:The FAIR Universe – HiggsML Uncertainty Challenge focuses on measuring the physics properties of elementary particles with imperfect simulators due to differences in modelling systematic errors. Additionally, the challenge is leveraging a large-compute-scale AI platform for sharing datasets, training models, and hosting machine learning competitions. Our challenge brings together the physics and machine learning communities to advance our understanding and methodologies in handling systematic (epistemic) uncertainties within AI techniques.

[LG-190] Reconstructing Galaxy Cluster Mass Maps using Score-based Generative Modeling

链接: https://arxiv.org/abs/2410.02857
作者: Alan Hsu,Matthew Ho,Joyce Lin,Carleen Markey,Michelle Ntampaka,Hy Trac,Barnabás Póczos
关键词-EN: score-based generative modeling, dark matter projected, dark matter maps, matter projected density, projected density maps
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, submitted to The Open Journal of Astrophysics

点击查看摘要

Abstract:We present a novel approach to reconstruct gas and dark matter projected density maps of galaxy clusters using score-based generative modeling. Our diffusion model takes in mock SZ and X-ray images as conditional observations, and generates realizations of corresponding gas and dark matter maps by sampling from a learned data posterior. We train and validate the performance of our model by using mock data from a hydrodynamical cosmological simulation. The model accurately reconstructs both the mean and spread of the radial density profiles in the spatial domain to within 5%, indicating that the model is able to distinguish between clusters of different sizes. In the spectral domain, the model achieves close-to-unity values for the bias and cross-correlation coefficients, indicating that the model can accurately probe cluster structures on both large and small scales. Our experiments demonstrate the ability of score models to learn a strong, nonlinear, and unbiased mapping between input observables and fundamental density distributions of galaxy clusters. These diffusion models can be further fine-tuned and generalized to not only take in additional observables as inputs, but also real observations and predict unknown density distributions of galaxy clusters.

[LG-191] A Spatio-Temporal Machine Learning Model for Mortgage Credit Risk: Default Probabilities and Loan Portfolios

链接: https://arxiv.org/abs/2410.02846
作者: Pascal Kündig,Fabio Sigrist
关键词-EN: latent spatio-temporal Gaussian, spatio-temporal Gaussian process, Gaussian process model, Gaussian process, process model accounting
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:We introduce a novel machine learning model for credit risk by combining tree-boosting with a latent spatio-temporal Gaussian process model accounting for frailty correlation. This allows for modeling non-linearities and interactions among predictor variables in a flexible data-driven manner and for accounting for spatio-temporal variation that is not explained by observable predictor variables. We also show how estimation and prediction can be done in a computationally efficient manner. In an application to a large U.S. mortgage credit risk data set, we find that both predictive default probabilities for individual loans and predictive loan portfolio loss distributions obtained with our novel approach are more accurate compared to conventional independent linear hazard models and also linear spatio-temporal models. Using interpretability tools for machine learning models, we find that the likely reasons for this outperformance are strong interaction and non-linear effects in the predictor variables and the presence of large spatio-temporal frailty effects.

[LG-192] CAnDOIT: Causal Discovery with Observational and Interventional Data from Time-Series

链接: https://arxiv.org/abs/2410.02844
作者: Luca Castri,Sariah Mghames,Marc Hanheide,Nicola Bellotto
关键词-EN: branches of science, intelligent systems, utmost importance, causal, data
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Published in Advanced Intelligent Systems

点击查看摘要

Abstract:The study of cause-and-effect is of the utmost importance in many branches of science, but also for many practical applications of intelligent systems. In particular, identifying causal relationships in situations that include hidden factors is a major challenge for methods that rely solely on observational data for building causal models. This paper proposes CAnDOIT, a causal discovery method to reconstruct causal models using both observational and interventional time-series data. The use of interventional data in the causal analysis is crucial for real-world applications, such as robotics, where the scenario is highly complex and observational data alone are often insufficient to uncover the correct causal structure. Validation of the method is performed initially on randomly generated synthetic models and subsequently on a well-known benchmark for causal structure learning in a robotic manipulation environment. The experiments demonstrate that the approach can effectively handle data from interventions and exploit them to enhance the accuracy of the causal analysis. A Python implementation of CAnDOIT has also been developed and is publicly available on GitHub: this https URL.

[LG-193] Modelling the longevity of complex living systems

链接: https://arxiv.org/abs/2410.02838
作者: Indrė Žliobaitė
关键词-EN: ECML PKDD, Nectar Track, Track of ECML, extended abstract, Laws of Macroevolutionary
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This extended abstract was presented at the Nectar Track of ECML PKDD 2024 in Vilnius, Lithuania. The content supplements a recently published paper “Laws of Macroevolutionary Expansion” in the Proceedings of the National Academy of Sciences (PNAS).

[LG-194] he MLE is minimax optimal for LGC

链接: https://arxiv.org/abs/2410.02835
作者: Doron Cohen,Aryeh Kontorovich,Roi Weiss
关键词-EN: Maximum Likelihood Estimator, recently introduced Local, introduced Local Glivenko-Cantelli, Local Glivenko-Cantelli setting, Maximum Likelihood
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We revisit the recently introduced Local Glivenko-Cantelli setting, which studies distribution-dependent uniform convegence rates of the Maximum Likelihood Estimator (MLE). In this work, we investigate generalizations of this setting where arbitrary estimators are allowed rather than just the MLE. Can a strictly larger class of measures be learned? Can better risk decay rates be obtained? We provide exhaustive answers to these questions – which are both negative, provided the learner is barred from exploiting some infinite-dimensional pathologies. On the other hand, allowing such exploits does lead to a strictly larger class of learnable measures.

[LG-195] Asymmetry of the Relative Entropy in the Regularization of Empirical Risk Minimization

链接: https://arxiv.org/abs/2410.02833
作者: Francisco Daunas,Iañki Esnaola,Samir M. Perlaza,H. Vincent Poor
关键词-EN: relative entropy, relative entropy regularization, relative entropy asymmetry, Type-II ERM-RER, relative entropy forces
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The effect of relative entropy asymmetry is analyzed in the context of empirical risk minimization (ERM) with relative entropy regularization (ERM-RER). Two regularizations are considered: (a) the relative entropy of the measure to be optimized with respect to a reference measure (Type-I ERM-RER); or (b) the relative entropy of the reference measure with respect to the measure to be optimized (Type-II ERM-RER). The main result is the characterization of the solution to the Type-II ERM-RER problem and its key properties. By comparing the well-understood Type-I ERM-RER with Type-II ERM-RER, the effects of entropy asymmetry are highlighted. The analysis shows that in both cases, regularization by relative entropy forces the solution’s support to collapse into the support of the reference measure, introducing a strong inductive bias that can overshadow the evidence provided by the training data. Finally, it is shown that Type-II regularization is equivalent to Type-I regularization with an appropriate transformation of the empirical risk function.

[LG-196] Inverse Design of Copolymers Including Stoichiometry and Chain Architecture

链接: https://arxiv.org/abs/2410.02824
作者: Gabriel Vogel,Jana M. Weber
关键词-EN: hinder rapid discovery, space hinder rapid, properties is high, innovative synthetic polymers, demand for innovative
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注: 24 pages, 20 figures

点击查看摘要

Abstract:The demand for innovative synthetic polymers with improved properties is high, but their structural complexity and vast design space hinder rapid discovery. Machine learning-guided molecular design is a promising approach to accelerate polymer discovery. However, the scarcity of labeled polymer data and the complex hierarchical structure of synthetic polymers make generative design particularly challenging. We advance the current state-of-the-art approaches to generate not only repeating units, but monomer ensembles including their stoichiometry and chain architecture. We build upon a recent polymer representation that includes stoichiometries and chain architectures of monomer ensembles and develop a novel variational autoencoder (VAE) architecture encoding a graph and decoding a string. Using a semi-supervised setup, we enable the handling of partly labelled datasets which can be benefitial for domains with a small corpus of labelled data. Our model learns a continuous, well organized latent space (LS) that enables de-novo generation of copolymer structures including different monomer stoichiometries and chain architectures. In an inverse design case study, we demonstrate our model for in-silico discovery of novel conjugated copolymer photocatalysts for hydrogen production using optimization of the polymer’s electron affinity and ionization potential in the latent space.

[LG-197] Raising the Bar(ometer): Identifying a Users Stair and Lift Usage Through Wearable Sensor Data Analysis

链接: https://arxiv.org/abs/2410.02790
作者: Hrishikesh Balkrishna Karande,Ravikiran Arasur Thippeswamy Shivalingappa,Abdelhafid Nassim Yaici,Iman Haghbin,Niravkumar Bavadiya,Robin Burchard,Kristof Van Laerhoven
关键词-EN: multiple times daily, confronted multiple times, confronted multiple, stairs, multiple times
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: submitted to iWOAR 2024

点击查看摘要

Abstract:Many users are confronted multiple times daily with the choice of whether to take the stairs or the elevator. Whereas taking the stairs could be beneficial for cardiovascular health and wellness, taking the elevator might be more convenient but it also consumes energy. By precisely tracking and boosting users’ stairs and elevator usage through their wearable, users might gain health insights and motivation, encouraging a healthy lifestyle and lowering the risk of sedentary-related health problems. This research describes a new exploratory dataset, to examine the patterns and behaviors related to using stairs and lifts. We collected data from 20 participants while climbing and descending stairs and taking a lift in a variety of scenarios. The aim is to provide insights and demonstrate the practicality of using wearable sensor data for such a scenario. Our collected dataset was used to train and test a Random Forest machine learning model, and the results show that our method is highly accurate at classifying stair and lift operations with an accuracy of 87.61% and a multi-class weighted F1-score of 87.56% over 8-second time windows. Furthermore, we investigate the effect of various types of sensors and data attributes on the model’s performance. Our findings show that combining inertial and pressure sensors yields a viable solution for real-time activity detection.

[LG-198] A Deep Learning Approach for User-Centric Clustering in Cell-Free Massive MIMO Systems

链接: https://arxiv.org/abs/2410.02775
作者: Giovanni Di Gennaro,Amedeo Buonanno,Gianmarco Romano,Stefano Buzzi,Francesco A.N Palmieri
关键词-EN: cell-free massive MIMO, conventional massive MIMO, massive MIMO cellular, massive MIMO systems, MIMO cellular configurations
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted to 25th IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2024

点击查看摘要

Abstract:Contrary to conventional massive MIMO cellular configurations plagued by inter-cell interference, cell-free massive MIMO systems distribute network resources across the coverage area, enabling users to connect with multiple access points (APs) and boosting both system capacity and fairness across user. In such systems, one critical functionality is the association between APs and users: determining the optimal association is indeed a combinatorial problem of prohibitive complexity. In this paper, a solution based on deep learning is thus proposed to solve the user clustering problem aimed at maximizing the sum spectral efficiency while controlling the number of active connections. The proposed solution can scale effectively with the number of users, leveraging long short-term memory cells to operate without the need for retraining. Numerical results show the effectiveness of the proposed solution, even in the presence of imperfect channel state information due to pilot contamination.

[LG-199] Estimating the Unobservable Components of Electricity Demand Response with Inverse Optimization

链接: https://arxiv.org/abs/2410.02774
作者: Adrian Esteban-Perez,Derek Bunn,Yashar Ghiassi-Farrokhfal
关键词-EN: Understanding and predicting, electricity demand responses, system operators, predicting the electricity, electricity demand
类目: ignal Processing (eess.SP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Understanding and predicting the electricity demand responses to prices are critical activities for system operators, retailers, and regulators. While conventional machine learning and time series analyses have been adequate for the routine demand patterns that have adapted only slowly over many years, the emergence of active consumers with flexible assets such as solar-plus-storage systems, and electric vehicles, introduces new challenges. These active consumers exhibit more complex consumption patterns, the drivers of which are often unobservable to the retailers and system operators. In practice, system operators and retailers can only monitor the net demand (metered at grid connection points), which reflects the overall energy consumption or production exchanged with the grid. As a result, all “behind-the-meter” activities-such as the use of flexibility-remain hidden from these entities. Such behind-the-meter behavior may be controlled by third party agents or incentivized by tariffs; in either case, the retailer’s revenue and the system loads would be impacted by these activities behind the meter, but their details can only be inferred. We define the main components of net demand, as baseload, flexible, and self-generation, each having nonlinear responses to market price signals. As flexible demand response and self generation are increasing, this raises a pressing question of whether existing methods still perform well and, if not, whether there is an alternative way to understand and project the unobserved components of behavior. In response to this practical challenge, we evaluate the potential of a data-driven inverse optimization (IO) methodology. This approach characterizes decomposed consumption patterns without requiring direct observation of behind-the-meter behavior or device-level metering […]

[LG-200] Efficient Numerical Calibration of Water Delivery Network Using Short-Burst Hydrant Trials

链接: https://arxiv.org/abs/2410.02772
作者: Katarzyna Kołodziej(1),Michał Cholewa(1),Przemysław Głomb(1),Wojciech Koral(2),Michał Romaszewski(1) ((1) Institute of Theoretical and Applied Informatics, Polish Academy of Sciences, Gliwice, Poland, (2) AIUT Sp. z o.o. Gliwice, Poland)
关键词-EN: Network Hydraulic Models, Water Distribution Network, Distribution Network Hydraulic, Hydraulic Models, Water Distribution
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures, submitted to ASCE Journal of Water Resources Planning and Management

点击查看摘要

Abstract:Calibration is a critical process for reducing uncertainty in Water Distribution Network Hydraulic Models (WDN HM). However, features of certain WDNs, such as oversized pipelines, lead to shallow pressure gradients under normal daily conditions, posing a challenge for effective calibration. This study proposes a calibration methodology using short hydrant trials conducted at night, which increase the pressure gradient in the WDN. The data is resampled to align with hourly consumption patterns. In a unique real-world case study of a WDN zone, we demonstrate the statistically significant superiority of our method compared to calibration based on daily usage. The experimental methodology, inspired by a machine learning cross-validation framework, utilises two state-of-the-art calibration algorithms, achieving a reduction in absolute error of up to 45% in the best scenario.

[LG-201] Insightful Railway Track Evaluation: Leveraging NARX Feature Interpretation

链接: https://arxiv.org/abs/2410.02770
作者: P. H. O. Silva,A. S. Cerqueira,E. G. Nepomuceno
关键词-EN: engineering domains, extracting meaningful insights, essential for extracting, extracting meaningful, time series
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: In English. CBA 2024 - XXV Brazilian Congress of Automation (CBA - XXV Congresso Brasileiro de Automática)

点击查看摘要

Abstract:The classification of time series is essential for extracting meaningful insights and aiding decision-making in engineering domains. Parametric modeling techniques like NARX are invaluable for comprehending intricate processes, such as environmental time series, owing to their easily interpretable and transparent structures. This article introduces a classification algorithm, Logistic-NARX Multinomial, which merges the NARX methodology with logistic regression. This approach not only produces interpretable models but also effectively tackles challenges associated with multiclass classification. Furthermore, this study introduces an innovative methodology tailored for the railway sector, offering a tool by employing NARX models to interpret the multitude of features derived from onboard sensors. This solution provides profound insights through feature importance analysis, enabling informed decision-making regarding safety and maintenance.

信息检索

[IR-0] Discovering Biases in Information Retrieval Models Using Relevance Thesaurus as Global Explanation

链接: https://arxiv.org/abs/2410.03584
作者: Youngwoo Kim,Razieh Rahimi,James Allan
关键词-EN: unseen query-document pairs, interpreting neural relevance, neural relevance, relevance, neural relevance models
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Most efforts in interpreting neural relevance models have focused on local explanations, which explain the relevance of a document to a query but are not useful in predicting the model’s behavior on unseen query-document pairs. We propose a novel method to globally explain neural relevance models by constructing a “relevance thesaurus” containing semantically relevant query and document term pairs. This thesaurus is used to augment lexical matching models such as BM25 to approximate the neural model’s predictions. Our method involves training a neural relevance model to score the relevance of partial query and document segments, which is then used to identify relevant terms across the vocabulary space. We evaluate the obtained thesaurus explanation based on ranking effectiveness and fidelity to the target neural ranking model. Notably, our thesaurus reveals the existence of brand name bias in ranking models, demonstrating one advantage of our explanation method.

[IR-1] Dreamming User Multimodal Representation for Micro-Video Recommendation

链接: https://arxiv.org/abs/2410.03538
作者: Chengzhi Lin,Hezheng Lin,Shuchang Liu,Cangguang Ruan,LingJing Xu,Dezhao Yang,Chuyuan Wang,Yongqi Liu
关键词-EN: advanced recommender systems, mitigate information overload, deliver tailored content, Platonic Representation Hypothesis, underscored the necessity
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The proliferation of online micro-video platforms has underscored the necessity for advanced recommender systems to mitigate information overload and deliver tailored content. Despite advancements, accurately and promptly capturing dynamic user interests remains a formidable challenge. Inspired by the Platonic Representation Hypothesis, which posits that different data modalities converge towards a shared statistical model of reality, we introduce DreamUMM (Dreaming User Multi-Modal Representation), a novel approach leveraging user historical behaviors to create real-time user representation in a multimoda space. DreamUMM employs a closed-form solution correlating user video preferences with multimodal similarity, hypothesizing that user interests can be effectively represented in a unified multimodal space. Additionally, we propose Candidate-DreamUMM for scenarios lacking recent user behavior data, inferring interests from candidate videos alone. Extensive online A/B tests demonstrate significant improvements in user engagement metrics, including active days and play count. The successful deployment of DreamUMM in two micro-video platforms with hundreds of millions of daily active users, illustrates its practical efficacy and scalability in personalized micro-video content delivery. Our work contributes to the ongoing exploration of representational convergence by providing empirical evidence supporting the potential for user interest representations to reside in a multimodal space.

[IR-2] EB-NeRD: A Large-Scale Dataset for News Recommendation RECSYS’24

链接: https://arxiv.org/abs/2410.03432
作者: Johannes Kruse,Kasper Lindskow,Saikishore Kalloori,Marco Polignano,Claudio Pomo,Abhishek Srivastava,Anshuk Uppal,Michael Riis Andersen,Jes Frellsen
关键词-EN: Personalized content recommendations, Personalized content, social networks, content experience, Ekstra Bladet
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 8 tables, 2 figures, RecSys '24

点击查看摘要

Abstract:Personalized content recommendations have been pivotal to the content experience in digital media from video streaming to social networks. However, several domain specific challenges have held back adoption of recommender systems in news publishing. To address these challenges, we introduce the Ekstra Bladet News Recommendation Dataset (EB-NeRD). The dataset encompasses data from over a million unique users and more than 37 million impression logs from Ekstra Bladet. It also includes a collection of over 125,000 Danish news articles, complete with titles, abstracts, bodies, and metadata, such as categories. EB-NeRD served as the benchmark dataset for the RecSys '24 Challenge, where it was demonstrated how the dataset can be used to address both technical and normative challenges in designing effective and responsible recommender systems for news publishing. The dataset is available at: this https URL.

[IR-3] SoundSignature: What Type of Music Do You Like?

链接: https://arxiv.org/abs/2410.03375
作者: Brandon James Carone,Pablo Ripollés
关键词-EN: custom OpenAI Assistant, Music Information Retrieval, analyze users’ favorite, users’ favorite songs, OpenAI Assistant
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: 10 pages, 1 figure, to be published in the 2024 International Symposium on the IEEE Internet of Sounds Proceedings

点击查看摘要

Abstract:SoundSignature is a music application that integrates a custom OpenAI Assistant to analyze users’ favorite songs. The system incorporates state-of-the-art Music Information Retrieval (MIR) Python packages to combine extracted acoustic/musical features with the assistant’s extensive knowledge of the artists and bands. Capitalizing on this combined knowledge, SoundSignature leverages semantic audio and principles from the emerging Internet of Sounds (IoS) ecosystem, integrating MIR with AI to provide users with personalized insights into the acoustic properties of their music, akin to a musical preference personality report. Users can then interact with the chatbot to explore deeper inquiries about the acoustic analyses performed and how they relate to their musical taste. This interactivity transforms the application, acting not only as an informative resource about familiar and/or favorite songs, but also as an educational platform that enables users to deepen their understanding of musical features, music theory, acoustic properties commonly used in signal processing, and the artists behind the music. Beyond general usability, the application also incorporates several well-established open-source musician-specific tools, such as a chord recognition algorithm (CREMA), a source separation algorithm (DEMUCS), and an audio-to-MIDI converter (basic-pitch). These features allow users without coding skills to access advanced, open-source music processing algorithms simply by interacting with the chatbot (e.g., can you give me the stems of this song?). In this paper, we highlight the application’s innovative features and educational potential, and present findings from a pilot user study that evaluates its efficacy and usability.

[IR-4] Multimodal Point-of-Interest Recommendation

链接: https://arxiv.org/abs/2410.03265
作者: Yuta Kanzawa,Toyotaro Suzumura,Hiroki Kanezashi,Jiawei Yong,Shintaro Fukushima
关键词-EN: Large Language Models, Large Language, articles to read, items to buy, Large
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Large Language Models are applied to recommendation tasks such as items to buy and news articles to read. Point of Interest is quite a new area to sequential recommendation based on language representations of multimodal datasets. As a first step to prove our concepts, we focused on restaurant recommendation based on each user’s past visit history. When choosing a next restaurant to visit, a user would consider genre and location of the venue and, if available, pictures of dishes served there. We created a pseudo restaurant check-in history dataset from the Foursquare dataset and the FoodX-251 dataset by converting pictures into text descriptions with a multimodal model called LLaVA, and used a language-based sequential recommendation framework named Recformer proposed in 2023. A model trained on this semi-multimodal dataset has outperformed another model trained on the same dataset without picture descriptions. This suggests that this semi-multimodal model reflects actual human behaviours and that our path to a multimodal recommendation model is in the right direction.

[IR-5] Enriching Music Descriptions with a Finetuned-LLM and Metadata for Text-to-Music Retrieval ICASSP2024

链接: https://arxiv.org/abs/2410.03264
作者: SeungHeon Doh,Minhee Lee,Dasaem Jeong,Juhan Nam
关键词-EN: natural language query, finding music based, extensive music databases, plays a pivotal, pivotal role
类目: ound (cs.SD); Information Retrieval (cs.IR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted for publication at the IEEE ICASSP 2024

点击查看摘要

Abstract:Text-to-Music Retrieval, finding music based on a given natural language query, plays a pivotal role in content discovery within extensive music databases. To address this challenge, prior research has predominantly focused on a joint embedding of music audio and text, utilizing it to retrieve music tracks that exactly match descriptive queries related to musical attributes (i.e. genre, instrument) and contextual elements (i.e. mood, theme). However, users also articulate a need to explore music that shares similarities with their favorite tracks or artists, such as \textitI need a similar track to Superstition by Stevie Wonder. To address these concerns, this paper proposes an improved Text-to-Music Retrieval model, denoted as TTMR++, which utilizes rich text descriptions generated with a finetuned large language model and metadata. To accomplish this, we obtained various types of seed text from several existing music tag and caption datasets and a knowledge graph dataset of artists and tracks. The experimental results show the effectiveness of TTMR++ in comparison to state-of-the-art music-text joint embedding models through a comprehensive evaluation involving various musical text queries.

[IR-6] Data-Efficient Massive Tool Retrieval: A Reinforcement Learning Approach for Query-Tool Alignment with Language Models

链接: https://arxiv.org/abs/2410.03212
作者: Yuxiang Zhang,Xin Fan,Junjie Wang,Chongxian Chen,Fan Mo,Tetsuya Sakai,Hayato Yamana
关键词-EN: successfully addressed complex, Recent advancements, addressed complex tasks, tool retrieval, massive tool retrieval
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) integrated with external tools and APIs have successfully addressed complex tasks by using in-context learning or fine-tuning. Despite this progress, the vast scale of tool retrieval remains challenging due to stringent input length constraints. In response, we propose a pre-retrieval strategy from an extensive repository, effectively framing the problem as the massive tool retrieval (MTR) task. We introduce the MTRB (massive tool retrieval benchmark) to evaluate real-world tool-augmented LLM scenarios with a large number of tools. This benchmark is designed for low-resource scenarios and includes a diverse collection of tools with descriptions refined for consistency and clarity. It consists of three subsets, each containing 90 test samples and 10 training samples. To handle the low-resource MTR task, we raise a new query-tool alignment (QTA) framework leverages LLMs to enhance query-tool alignment by rewriting user queries through ranking functions and the direct preference optimization (DPO) method. This approach consistently outperforms existing state-of-the-art models in top-5 and top-10 retrieval tasks across the MTRB benchmark, with improvements up to 93.28% based on the metric Sufficiency@k, which measures the adequacy of tool retrieval within the first k results. Furthermore, ablation studies validate the efficacy of our framework, highlighting its capacity to optimize performance even with limited annotated samples. Specifically, our framework achieves up to 78.53% performance improvement in Sufficiency@k with just a single annotated sample. Additionally, QTA exhibits strong cross-dataset generalizability, emphasizing its potential for real-world applications.

[IR-7] Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs EMNLP

链接: https://arxiv.org/abs/2410.03071
作者: Pritom Saha Akash,Kevin Chen-Chuan Chang
关键词-EN: uncovering hidden themes, collection of documents, Topic modeling, powerful technique, technique for uncovering
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: EMNLP Findings 2024. arXiv admin note: substantial text overlap with arXiv:2310.15420

点击查看摘要

Abstract:Topic modeling is a powerful technique for uncovering hidden themes within a collection of documents. However, the effectiveness of traditional topic models often relies on sufficient word co-occurrence, which is lacking in short texts. Therefore, existing approaches, whether probabilistic or neural, frequently struggle to extract meaningful patterns from such data, resulting in incoherent topics. To address this challenge, we propose a novel approach that leverages large language models (LLMs) to extend short texts into more detailed sequences before applying topic modeling. To further improve the efficiency and solve the problem of semantic inconsistency from LLM-generated texts, we propose to use prefix tuning to train a smaller language model coupled with a variational autoencoder for short-text topic modeling. Our method significantly improves short-text topic modeling performance, as demonstrated by extensive experiments on real-world datasets with extreme data sparsity, outperforming current state-of-the-art topic models.

[IR-8] Geometric Collaborative Filtering with Convergence

链接: https://arxiv.org/abs/2410.03064
作者: Hisham Husain,Julien Monteil
关键词-EN: modelling user-click interactions, user-click interactions due, Latent variable collaborative, latent collaborative filtering, collaborative filtering
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 13 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Latent variable collaborative filtering methods have been a standard approach to modelling user-click interactions due to their simplicity and effectiveness. However, there is limited work on analyzing the mathematical properties of these methods in particular on preventing the overfitting towards the identity, and such methods typically utilize loss functions that overlook the geometry between items. In this work, we introduce a notion of generalization gap in collaborative filtering and analyze this with respect to latent collaborative filtering models. We present a geometric upper bound that gives rise to loss functions, and a way to meaningfully utilize the geometry of item-metadata to improve recommendations. We show how these losses can be minimized and gives the recipe to a new latent collaborative filtering algorithm, which we refer to as GeoCF, due to the geometric nature of our results. We then show experimentally that our proposed GeoCF algorithm can outperform other all existing methods on the Movielens20M and Netflix datasets, as well as two large-scale internal datasets. In summary, our work proposes a theoretically sound method which paves a way to better understand generalization of collaborative filtering at large.

[IR-9] Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues

链接: https://arxiv.org/abs/2410.03049
作者: Shilin Qu,Weiqing Wang,Xin Zhou,Haolan Zhan,Zhuang Li,Lizhen Qu,Linhao Luo,Yuan-Fang Li,Gholamreza Haffari
关键词-EN: conversational information retrieval, including conversational information, retrieval-enhanced machine learning, Sociocultural norms serve, contextual information retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Sociocultural norms serve as guiding principles for personal conduct in social interactions, emphasizing respect, cooperation, and appropriate behavior, which is able to benefit tasks including conversational information retrieval, contextual information retrieval and retrieval-enhanced machine learning. We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs) for socially aware dialogues. We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase. Our approach utilizes socially aware dialogues, enriched with contextual frames, as the primary data source to constrain the generating process and reduce the hallucinations. This enables extracting of high-quality and nuanced natural-language norm statements, leveraging the pragmatic implications of utterances with respect to the situation. As real dialogue annotated with gold frames are not readily available, we propose using synthetic data. Our empirical results show: (i) the quality of the SCNs derived from synthetic data is comparable to that from real dialogues annotated with gold frames, and (ii) the quality of the SCNs extracted from real data, annotated with either silver (predicted) or gold frames, surpasses that without the frame annotations. We further show the effectiveness of the extracted SCNs in a RAG-based (Retrieval-Augmented Generation) model to reason about multiple downstream dialogue tasks.

[IR-10] Inductive Generative Recommendation via Retrieval-based Speculation

链接: https://arxiv.org/abs/2410.02939
作者: Yijie Ding,Yupeng Hou,Jiacheng Li,Julian McAuley
关键词-EN: discrete tokens, emerging paradigm, paradigm that tokenizes, learns to autoregressively, autoregressively generate
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative recommendation (GR) is an emerging paradigm that tokenizes items into discrete tokens and learns to autoregressively generate the next tokens as predictions. Although effective, GR models operate in a transductive setting, meaning they can only generate items seen during training without applying heuristic re-ranking strategies. In this paper, we propose SpecGR, a plug-and-play framework that enables GR models to recommend new items in an inductive setting. SpecGR uses a drafter model with inductive capability to propose candidate items, which may include both existing items and new items. The GR model then acts as a verifier, accepting or rejecting candidates while retaining its strong ranking capabilities. We further introduce the guided re-drafting technique to make the proposed candidates more aligned with the outputs of generative recommendation models, improving the verification efficiency. We consider two variants for drafting: (1) using an auxiliary drafter model for better flexibility, or (2) leveraging the GR model’s own encoder for parameter-efficient self-drafting. Extensive experiments on three real-world datasets demonstrate that SpecGR exhibits both strong inductive recommendation ability and the best overall performance among the compared methods. Our code is available at: this https URL.

[IR-11] Streamlining Conformal Information Retrieval via Score Refinement

链接: https://arxiv.org/abs/2410.02914
作者: Yotam Intrator,Ori Kelner,Regev Cohen,Roman Goldenberg,Ehud Rivlin,Daniel Freedman
关键词-EN: retrieval augmented generation, lack statistical guarantees, augmented generation, fundamental to modern, modern applications
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Information retrieval (IR) methods, like retrieval augmented generation, are fundamental to modern applications but often lack statistical guarantees. Conformal prediction addresses this by retrieving sets guaranteed to include relevant information, yet existing approaches produce large-sized sets, incurring high computational costs and slow response times. In this work, we introduce a score refinement method that applies a simple monotone transformation to retrieval scores, leading to significantly smaller conformal sets while maintaining their statistical guarantees. Experiments on various BEIR benchmarks validate the effectiveness of our approach in producing compact sets containing relevant information.

[IR-12] Cognitive Biases in Large Language Models for News Recommendation RECSYS’24

链接: https://arxiv.org/abs/2410.02897
作者: Yougang Lyu,Xiaoyu Zhang,Zhaochun Ren,Maarten de Rijke
关键词-EN: large language models, recommender systems, cognitive biases, LLM-based news recommender, language models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted at the ROGEN '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Despite large language models (LLMs) increasingly becoming important components of news recommender systems, employing LLMs in such systems introduces new risks, such as the influence of cognitive biases in LLMs. Cognitive biases refer to systematic patterns of deviation from norms or rationality in the judgment process, which can result in inaccurate outputs from LLMs, thus threatening the reliability of news recommender systems. Specifically, LLM-based news recommender systems affected by cognitive biases could lead to the propagation of misinformation, reinforcement of stereotypes, and the formation of echo chambers. In this paper, we explore the potential impact of multiple cognitive biases on LLM-based news recommender systems, including anchoring bias, framing bias, status quo bias and group attribution bias. Furthermore, to facilitate future research at improving the reliability of LLM-based news recommender systems, we discuss strategies to mitigate these biases through data augmentation, prompt engineering and learning algorithms aspects.

[IR-13] DifFaiRec: Generative Fair Recommender with Conditional Diffusion Model ICDM2024

链接: https://arxiv.org/abs/2410.02791
作者: Zhenhao Jiang,Jicong Fan
关键词-EN: users automatically based, Diffusion-based Fair Recommender, automatically based, groups, users automatically
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The paper was accepted by ICDM 2024

点击查看摘要

Abstract:Although recommenders can ship items to users automatically based on the users’ preferences, they often cause unfairness to groups or individuals. For instance, when users can be divided into two groups according to a sensitive social attribute and there is a significant difference in terms of activity between the two groups, the learned recommendation algorithm will result in a recommendation gap between the two groups, which causes group unfairness. In this work, we propose a novel recommendation algorithm named Diffusion-based Fair Recommender (DifFaiRec) to provide fair recommendations. DifFaiRec is built upon the conditional diffusion model and hence has a strong ability to learn the distribution of user preferences from their ratings on items and is able to generate diverse recommendations effectively. To guarantee fairness, we design a counterfactual module to reduce the model sensitivity to protected attributes and provide mathematical explanations. The experiments on benchmark datasets demonstrate the superiority of DifFaiRec over competitive baselines.

[IR-14] Learning variant product relationship and variation attributes from e-commerce website structures

链接: https://arxiv.org/abs/2410.02779
作者: Pedro Herrero-Vidal,You-Lin Chen,Cris Liu,Prithviraj Sen,Lichao Wang
关键词-EN: introduce VARM, product relationships, product, variant product relationships, variant relationship matcher
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce VARM, variant relationship matcher strategy, to identify pairs of variant products in e-commerce catalogs. Traditional definitions of entity resolution are concerned with whether product mentions refer to the same underlying product. However, this fails to capture product relationships that are critical for e-commerce applications, such as having similar, but not identical, products listed on the same webpage or share reviews. Here, we formulate a new type of entity resolution in variant product relationships to capture these similar e-commerce product links. In contrast with the traditional definition, the new definition requires both identifying if two products are variant matches of each other and what are the attributes that vary between them. To satisfy these two requirements, we developed a strategy that leverages the strengths of both encoding and generative AI models. First, we construct a dataset that captures webpage product links, and therefore variant product relationships, to train an encoding LLM to predict variant matches for any given pair of products. Second, we use RAG prompted generative LLMs to extract variation and common attributes amongst groups of variant products. To validate our strategy, we evaluated model performance using real data from one of the world’s leading e-commerce retailers. The results showed that our strategy outperforms alternative solutions and paves the way to exploiting these new type of product relationships.

[IR-15] Bypassing the Popularity Bias: Repurposing Models for Better Long-Tail Recommendation

链接: https://arxiv.org/abs/2410.02776
作者: Václav Blahut,Karel Koupil
关键词-EN: Recommender systems play, influencing our beliefs, Recommender systems, play a crucial, crucial role
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Recommender systems play a crucial role in shaping information we encounter online, whether on social media or when using content platforms, thereby influencing our beliefs, choices, and behaviours. Many recent works address the issue of fairness in recommender systems, typically focusing on topics like ensuring equal access to information and opportunities for all individual users or user groups, promoting diverse content to avoid filter bubbles and echo chambers, enhancing transparency and explainability, and adhering to ethical and sustainable practices. In this work, we aim to achieve a more equitable distribution of exposure among publishers on an online content platform, with a particular focus on those who produce high quality, long-tail content that may be unfairly disadvantaged. We propose a novel approach of repurposing existing components of an industrial recommender system to deliver valuable exposure to underrepresented publishers while maintaining high recommendation quality. To demonstrate the efficiency of our proposal, we conduct large-scale online AB experiments, report results indicating desired outcomes and share several insights from long-term application of the approach in the production setting.

[IR-16] YouTube Video Analytics for Patient Engagement: Evidence from Colonoscopy Preparation Videos

链接: https://arxiv.org/abs/2410.02830
作者: Yawen Guo,Xiao Liu,Anjana Susarla,Padman Rema
关键词-EN: medical information, deliver contextualized, Video Intelligence API, medical, information
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: The 30th WORKSHOP ON INFORMATION TECHNOLOGIES AND SYSTEMS. arXiv admin note: substantial text overlap with arXiv:2312.09425

点击查看摘要

Abstract:Videos can be an effective way to deliver contextualized, just-in-time medical information for patient education. However, video analysis, from topic identification and retrieval to extraction and analysis of medical information and understandability from a patient perspective are extremely challenging tasks. This study demonstrates a data analysis pipeline that utilizes methods to retrieve medical information from YouTube videos on preparing for a colonoscopy exam, a much maligned and disliked procedure that patients find challenging to get adequately prepared for. We first use the YouTube Data API to collect metadata of desired videos on select search keywords and use Google Video Intelligence API to analyze texts, frames and objects data. Then we annotate the YouTube video materials on medical information, video understandability and overall recommendation. We develop a bidirectional long short-term memory (BiLSTM) model to identify medical terms in videos and build three classifiers to group videos based on the levels of encoded medical information and video understandability, and whether the videos are recommended or not. Our study provides healthcare stakeholders with guidelines and a scalable approach for generating new educational video content to enhance management of a vast number of health conditions.

附件下载

点击下载今日全部论文列表