本篇博文主要展示 2024-10-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-21)

今日共更新476篇论文,其中:

  • 自然语言处理99篇(Computation and Language (cs.CL))
  • 人工智能125篇(Artificial Intelligence (cs.AI))
  • 计算机视觉80篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习168篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts

【速读】: 该论文试图解决机器生成文本检测器在实际应用中性能下降的问题,质疑其高基准分数是否源于评估数据集的质量不佳。解决方案的关键在于提出一种系统的方法来评估包含AI生成片段的数据集的质量,以确保检测器在面对多样化文本时的鲁棒性和泛化能力。论文强调了使用高质量生成数据来改进检测模型训练和数据集本身的潜力,从而提升检测器在实际应用中的可靠性。

链接: https://arxiv.org/abs/2410.14677
作者: German Gritsai,Anastasia Voznyuk,Andrey Grabovoy,Yury Chekhovich
关键词-EN: autoregressive Large Language, Large Language Models, Large Language, necessitating reliable machine-generated, autoregressive Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of autoregressive Large Language Models (LLMs) has significantly improved the quality of generated texts, necessitating reliable machine-generated text detectors. A huge number of detectors and collections with AI fragments have emerged, and several detection methods even showed recognition quality up to 99.9% according to the target metrics in such collections. However, the quality of such detectors tends to drop dramatically in the wild, posing a question: Are detectors actually highly trustworthy or do their high benchmark scores come from the poor quality of evaluation datasets? In this paper, we emphasise the need for robust and qualitative methods for evaluating generated data to be secure against bias and low generalising ability of future model. We present a systematic review of datasets from competitions dedicated to AI-generated content detection and propose methods for evaluating the quality of datasets containing AI-generated fragments. In addition, we discuss the possibility of using high-quality generated data to achieve two goals: improving the training of detection models and improving the training datasets themselves. Our contribution aims to facilitate a better understanding of the dynamics between human and machine text, which will ultimately support the integrity of information in an increasingly automated world.
摘要:自回归大语言模型 (LLM) 的快速发展显著提升了生成文本的质量,从而需要可靠的机器生成文本检测器。大量检测器和包含 AI 片段的集合应运而生,据目标指标显示,某些检测方法的识别质量甚至高达 99.9%。然而,这些检测器在实际应用中的表现往往急剧下降,这引发了一个问题:这些检测器是否真的高度可信,还是它们的高基准分数源于评估数据集的质量不佳?本文强调,需要采用稳健且高质量的方法来评估生成数据,以抵御未来模型可能存在的偏差和低泛化能力。我们系统地回顾了专门针对 AI 生成内容检测竞赛的数据集,并提出了评估包含 AI 生成片段数据集质量的方法。此外,我们还探讨了使用高质量生成数据实现两个目标的可能性:提升检测模型的训练效果和改进训练数据集本身。我们的贡献旨在促进对人类与机器文本之间动态关系的更好理解,这将最终支持在日益自动化的世界中信息的完整性。

[NLP-1] SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment

【速读】: 该论文试图解决现有偏好对齐机制对所有用户一刀切的问题,即大型语言模型(LLM)中包含非偏好特征的参数知识被统一屏蔽,导致高级用户无法充分利用这些知识。解决方案的关键是提出了SudoLM框架,通过授权对齐机制,允许LLM根据用户资质动态控制对特定参数知识的访问权限。SudoLM为授权用户分配SUDO密钥,使其能够解锁所有参数知识的访问,同时阻止非合格用户的访问,从而在保持模型通用性的同时,提升了高级用户的实用性。

链接: https://arxiv.org/abs/2410.14676
作者: Qin Liu,Fei Wang,Chaowei Xiao,Muhao Chen
关键词-EN: Existing preference alignment, large language model, Existing preference, language model, parametric knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing preference alignment is a one-size-fits-all alignment mechanism, where the part of the large language model (LLM) parametric knowledge with non-preferred features is uniformly blocked to all the users. However, this part of knowledge can be useful to advanced users whose expertise qualifies them to handle these information. The one-size-fits-all alignment mechanism undermines LLM’s utility for these qualified users. To address this problem, we propose SudoLM, a framework that lets LLMs learn access control over specific parametric knowledge for users with different credentials via authorization alignment. SudoLM allows authorized users to unlock their access to all the parametric knowledge with an assigned SUDO key while blocking access to non-qualified users. Experiments on two application scenarios demonstrate that SudoLM effectively controls the user’s access to the parametric knowledge and maintains its general utility.
摘要:现有的偏好对齐机制是一种“一刀切”的对齐方式,其中大语言模型 (LLM) 参数知识中包含非偏好特征的部分被统一屏蔽给所有用户。然而,这部分知识对于具备专业知识的高级用户来说可能是有用的,他们有能力处理这些信息。这种“一刀切”的对齐机制削弱了 LLM 对这些合格用户的实用性。为了解决这个问题,我们提出了 SudoLM,这是一个框架,允许 LLM 通过授权对齐机制为不同资质的用户学习对特定参数知识的访问控制。SudoLM 允许授权用户使用分配的 SUDO 密钥解锁对所有参数知识的访问,同时阻止非合格用户的访问。在两个应用场景的实验中,SudoLM 有效地控制了用户对参数知识的访问,并保持了其通用实用性。

[NLP-2] Enhancing Large Language Models Situated Faithfulness to External Contexts

【速读】: 该论文试图解决大语言模型(LLMs)在面对外部信息时过度依赖的问题,尤其是在外部信息不准确或误导性的情况下。解决方案的关键在于提出两种增强模型对内部知识信任度的方法:Self-Guided Confidence Reasoning (SCR) 和 Rule-Based Confidence Reasoning (RCR)。SCR 通过模型自我评估外部信息相对于内部知识的可信度来生成最准确的答案,而 RCR 则从模型中提取显式的置信度信号,并使用预定义规则确定最终答案。研究表明,对于推理能力强的模型(如GPT-4o),SCR 表现优于 RCR,而对于较小模型(如Llama-3-8B),RCR 表现更佳。通过 Confidence Reasoning Direct Preference Optimization (CR-DPO) 方法微调 SCR,可以进一步提升模型在已见和未见数据集上的表现。

链接: https://arxiv.org/abs/2410.14675
作者: Yukun Huang,Sanxing Chen,Hongyi Cai,Bhuwan Dhingra
关键词-EN: Large Language Models, Large Language, external information, Language Models, intentionally misleading
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are often augmented with external information as contexts, but this external information can sometimes be inaccurate or even intentionally misleading. We argue that robust LLMs should demonstrate situated faithfulness, dynamically calibrating their trust in external information based on their confidence in the internal knowledge and the external context. To benchmark this capability, we evaluate LLMs across several QA datasets, including a newly created dataset called RedditQA featuring in-the-wild incorrect contexts sourced from Reddit posts. We show that when provided with both correct and incorrect contexts, both open-source and proprietary models tend to overly rely on external information, regardless of its factual accuracy. To enhance situated faithfulness, we propose two approaches: Self-Guided Confidence Reasoning (SCR) and Rule-Based Confidence Reasoning (RCR). SCR enables models to self-access the confidence of external information relative to their own internal knowledge to produce the most accurate answer. RCR, in contrast, extracts explicit confidence signals from the LLM and determines the final answer using predefined rules. Our results show that for LLMs with strong reasoning capabilities, such as GPT-4o and GPT-4o mini, SCR outperforms RCR, achieving improvements of up to 24.2% over a direct input augmentation baseline. Conversely, for a smaller model like Llama-3-8B, RCR outperforms SCR. Fine-tuning SCR with our proposed Confidence Reasoning Direct Preference Optimization (CR-DPO) method improves performance on both seen and unseen datasets, yielding an average improvement of 8.9% on Llama-3-8B. In addition to quantitative results, we offer insights into the relative strengths of SCR and RCR. Our findings highlight promising avenues for improving situated faithfulness in LLMs. The data and code are released.
摘要:大语言模型 (LLMs) 通常通过外部信息作为上下文进行增强,但这些外部信息有时可能不准确,甚至故意误导。我们认为,稳健的 LLMs 应表现出基于情境的忠实性,根据其对内部知识和外部上下文的信心,动态调整对外部信息的信任度。为了衡量这一能力,我们在多个问答数据集上评估了 LLMs,包括一个新创建的名为 RedditQA 的数据集,该数据集包含了从 Reddit 帖子中获取的野生环境中的不正确上下文。我们发现,当提供正确和错误的上下文时,开源和专有模型都倾向于过度依赖外部信息,而不管其事实准确性如何。为了增强基于情境的忠实性,我们提出了两种方法:自我引导的信心推理 (SCR) 和基于规则的信心推理 (RCR)。SCR 使模型能够自我评估外部信息相对于其内部知识的信心,以产生最准确的答案。相比之下,RCR 从 LLM 中提取显式的信心信号,并使用预定义的规则确定最终答案。我们的结果显示,对于具有强大推理能力的 LLMs,如 GPT-4o 和 GPT-4o mini,SCR 优于 RCR,相对于直接输入增强基线,性能提升高达 24.2%。相反,对于较小的模型如 Llama-3-8B,RCR 优于 SCR。通过我们提出的信心推理直接偏好优化 (CR-DPO) 方法对 SCR 进行微调,可以提高在已见和未见数据集上的性能,Llama-3-8B 的平均改进率为 8.9%。除了定量结果外,我们还提供了对 SCR 和 RCR 相对优势的见解。我们的研究结果突显了提升 LLMs 基于情境忠实性的有前景的方向。数据和代码已发布。

[NLP-3] NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples NEURIPS24

【速读】: 该论文试图解决视觉-语言模型(VLMs)在处理自然图像和问题时表现不佳的问题,特别是针对人类可以轻松回答的自然对抗样本。解决方案的关键在于提出了一个名为NaturalBench的新基准,该基准通过半自动化方法收集了10,000个经过人工验证的视觉问答样本,并采用以视觉为中心的设计,即每个问题与两张导致不同答案的图像配对,从而防止模型仅依赖常识先验知识进行解答。此外,NaturalBench通过细粒度的技能标签对样本进行标注,以评估模型在理解属性绑定、对象关系和高级推理等方面的能力,从而更全面地揭示VLMs的偏差和局限性。

链接: https://arxiv.org/abs/2410.14669
作者: Baiqi Li,Zhiqiu Lin,Wenxuan Peng,Jean de Dieu Nyandwi,Daniel Jiang,Zixian Ma,Simran Khanuja,Ranjay Krishna,Graham Neubig,Deva Ramanan
关键词-EN: made significant progress, Vision-language models, progress in recent, made significant, significant progress
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to NeurIPS 24; We open-source our dataset at: this https URL Project page at: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a \textbfvision-centric design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.
摘要:视觉-语言模型 (Vision-language models, VLMs) 在评估复杂视觉-语言推理的视觉问答 (Visual-question-answering, VQA) 基准测试中取得了显著进展。然而,这些模型是否真正有效?在本研究中,我们展示了 VLMs 在处理人类可以轻松回答的自然图像和问题时仍然存在困难,我们将这些称为自然对抗样本。我们还发现,使用现成的模型如 CLIP 和 ChatGPT 从自然图像-文本语料库中生成这些 VQA 样本异常容易。我们提出了一种半自动化的方法来收集一个新的基准测试,名为 NaturalBench,用于可靠地评估 VLMs,该基准包含 10,000 个经过人工验证的 VQA 样本。关键的是,我们采用了以视觉为中心的设计,将每个问题与两张产生不同答案的图像配对,防止模型在不使用图像的情况下进行盲解答。这使得 NaturalBench 比之前可以仅凭常识先验知识解决的基准测试更具挑战性。我们在 NaturalBench 上评估了 53 个最先进的 VLMs,结果显示,即使是 LLaVA-OneVision、Cambrian-1、Llama3.2-Vision、Molmo、Qwen2-VL 以及 GPT-4o 等模型,其表现也落后于人类水平 (超过 90%) 50%-70%。我们从两个角度分析了 NaturalBench 的难度:(1) 组合性:解决 NaturalBench 需要多样化的视觉-语言技能,包括理解属性绑定、对象关系以及逻辑和计数等高级推理。为此,与之前每样本使用单一标签的工作不同,我们对每个 NaturalBench 样本使用 1 到 8 个技能标签进行细粒度评估。(2) 偏差:NaturalBench 暴露了 VLMs 中的严重偏差,因为模型通常会选择相同的答案,而不管图像内容如何。最后,我们将基准构建方法应用于多种数据源,包括长描述 (超过 100 字) 和非英语语言如中文和印地语,突显了其在动态评估 VLMs 方面的潜力。

[NLP-4] MiCEval: Unveiling Multimodal Chain of Thoughts Quality via Image Description and Reasoning Steps

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在复杂推理任务中使用多模态思维链(MCoT)策略时,缺乏自动化评估推理步骤质量的问题。解决方案的关键是提出了多模态思维链评估框架(MiCEval),该框架通过评估描述的准确性和每个推理步骤的质量来判断推理链的正确性。MiCEval基于细粒度数据集,对每个步骤进行正确性、相关性和信息量的评分,实验结果表明,使用MiCEval进行步骤级评估与人类判断更为一致,优于基于余弦相似度或微调的方法。

链接: https://arxiv.org/abs/2410.14668
作者: Xiongtao Zhou,Jie He,Lanyu Chen,jingyu li,Haojing Chen,Victor Gutierrez Basulto,Jeff Z. Pan,Hanjie Chen
关键词-EN: large language models, popular prompting strategy, multimodal large language, complex reasoning tasks, language models
类目: Computation and Language (cs.CL)
备注: 40 pages

点击查看摘要

Abstract:Multimodal Chain of Thought (MCoT) is a popular prompting strategy for improving the performance of multimodal large language models (MLLMs) across a range of complex reasoning tasks. Despite its popularity, there is a notable absence of automated methods for evaluating the quality of reasoning steps in MCoT. To address this gap, we propose Multimodal Chain-of-Thought Evaluation (MiCEval), a framework designed to assess the correctness of reasoning chains by evaluating the quality of both the description and each reasoning step. The evaluation of the description component focuses on the accuracy of the image descriptions, while the reasoning step evaluates the quality of each step as it is conditionally generated based on the preceding steps. MiCEval is built upon a fine-grained dataset with annotations that rate each step according to correctness, relevance, and informativeness. Extensive experiments on four state-of-the-art MLLMs show that step-wise evaluations using MiCEval align more closely with human judgments compared to existing methods based on cosine similarity or fine-tuning approaches. MiCEval datasets and code can be found in this https URL.
摘要:多模态思维链 (Multimodal Chain of Thought, MCoT) 是一种流行的提示策略,用于提升多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在复杂推理任务中的表现。尽管其应用广泛,但目前缺乏自动评估 MCoT 中推理步骤质量的方法。为填补这一空白,我们提出了多模态思维链评估 (Multimodal Chain-of-Thought Evaluation, MiCEval) 框架,该框架通过评估描述质量和每个推理步骤的质量来判断推理链的正确性。描述部分的评估侧重于图像描述的准确性,而推理步骤的评估则基于前一步骤的条件生成来判断每一步的质量。MiCEval 构建于一个细粒度的数据集之上,该数据集对每一步骤的正确性、相关性和信息量进行了标注。在四个最先进的 MLLMs 上的广泛实验表明,使用 MiCEval 进行逐步评估的结果与人类判断更为一致,相比于基于余弦相似度或微调方法的现有评估方法。MiCEval 数据集和代码可在以下链接中找到:https URL。

[NLP-5] DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph

【速读】: 该论文试图解决电影剧本摘要生成中的独特挑战,特别是机器学习模型难以捕捉剧本中复杂的角色互动、对话和场景关系以及长期依赖性和潜在关系的问题。解决方案的关键在于引入了一种名为DiscoGraMS的新资源,该资源将电影剧本表示为一种电影角色感知的论述图(CaD Graph)。这种图结构能够更好地保留剧本中的重要信息,并通过后期融合图和文本模态的方法,为摘要生成、问答和显著性检测等下游任务提供更全面和忠实的表示。

链接: https://arxiv.org/abs/2410.14666
作者: Maitreya Prafulla Chitale,Uday Bindal,Rajakrishnan Rajkumar,Rahul Mishra
关键词-EN: Summarizing movie screenplays, standard document summarization, Summarizing movie, unique set, compared to standard
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Summarizing movie screenplays presents a unique set of challenges compared to standard document summarization. Screenplays are not only lengthy, but also feature a complex interplay of characters, dialogues, and scenes, with numerous direct and subtle relationships and contextual nuances that are difficult for machine learning models to accurately capture and comprehend. Recent attempts at screenplay summarization focus on fine-tuning transformer-based pre-trained models, but these models often fall short in capturing long-term dependencies and latent relationships, and frequently encounter the “lost in the middle” issue. To address these challenges, we introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph). This approach is well-suited for various downstream tasks, such as summarization, question-answering, and salience detection. The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay’s content. We further explore a baseline method that combines the CaD Graph with the corresponding movie script through a late fusion of graph and text modalities, and we present very initial promising results.
摘要:与标准文档摘要相比,总结电影剧本呈现出独特的挑战。剧本不仅篇幅长,还涉及复杂的角色、对话和场景之间的互动,其中包含众多直接和微妙的关系及上下文细微差别,这些对于机器学习模型来说难以准确捕捉和理解。近期在剧本摘要方面的尝试主要集中在微调基于 Transformer 的预训练模型,但这些模型在捕捉长期依赖关系和潜在关系方面往往表现不佳,并且经常遇到“中间迷失”问题。为应对这些挑战,我们引入了 DiscoGraMS,这是一种将电影剧本表示为电影角色感知话语图(CaD Graph)的新资源。这种方法适用于多种下游任务,如摘要、问答和显著性检测。该模型旨在保留所有显著信息,提供更全面且忠实于剧本内容的表示。我们进一步探索了一种基线方法,该方法通过图和文本模态的后期融合,将 CaD Graph 与相应的电影剧本结合,并展示了初步的令人鼓舞的结果。

[NLP-6] Real-time Fake News from Adversarial Feedback

【速读】: 该论文试图解决现有假新闻检测评估方法的局限性,即基于事实核查网站的传统数据集无法有效测试模型对当前世界事实的推理能力。解决方案的关键在于开发了一种新的流程,利用基于检索增强生成(RAG)的检测器的自然语言反馈,迭代地将实时新闻改写为更具欺骗性的假新闻,从而挑战大型语言模型(LLM)的检测能力。通过这种迭代重写,论文展示了RAG在检测和生成假新闻中的重要作用,揭示了检索增强对于应对未见事件和对抗性攻击的重要性。

链接: https://arxiv.org/abs/2410.14651
作者: Sanxing Chen,Yukun Huang,Bhuwan Dhingra
关键词-EN: fact-checking websites, knowledge cutoffs, show that existing, existing evaluations, based on conventional
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We show that existing evaluations for fake news detection based on conventional sources, such as claims on fact-checking websites, result in an increasing accuracy over time for LLM-based detectors – even after their knowledge cutoffs. This suggests that recent popular political claims, which form the majority of fake news on such sources, are easily classified using surface-level shallow patterns. Instead, we argue that a proper fake news detection dataset should test a model’s ability to reason factually about the current world by retrieving and reading related evidence. To this end, we develop a novel pipeline that leverages natural language feedback from a RAG-based detector to iteratively modify real-time news into deceptive fake news that challenges LLMs. Our iterative rewrite decreases the binary classification AUC by an absolute 17.5 percent for a strong RAG GPT-4o detector. Our experiments reveal the important role of RAG in both detecting and generating fake news, as retrieval-free LLM detectors are vulnerable to unseen events and adversarial attacks, while feedback from RAG detection helps discover more deceitful patterns in fake news.
摘要:我们展示了基于传统来源(如事实核查网站上的声明)的现有虚假新闻检测评估方法,随着时间的推移,对于基于大语言模型 (LLM) 的检测器,其准确性不断提高——即使在知识截断之后也是如此。这表明,近期流行的政治声明(这些声明构成了此类来源上大多数虚假新闻的主体)可以通过表面层次的浅层模式轻松分类。相反,我们认为,一个合适的虚假新闻检测数据集应当测试模型通过检索和阅读相关证据来对当前世界进行事实推理的能力。为此,我们开发了一种新颖的流程,利用基于检索增强生成 (RAG) 检测器的自然语言反馈,迭代地将实时新闻修改为具有欺骗性的虚假新闻,从而挑战大语言模型。我们的迭代重写使得一个强大的 RAG GPT-4o 检测器的二分类 AUC 绝对值下降了 17.5%。我们的实验揭示了 RAG 在检测和生成虚假新闻中的重要作用,因为无检索的大语言模型检测器容易受到未见事件和对抗性攻击的影响,而 RAG 检测的反馈有助于发现虚假新闻中更具欺骗性的模式。

[NLP-7] Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs

【速读】: 该论文试图解决大型语言模型(LLMs)在处理长输入时存在的位置偏差问题,特别是“中间迷失”现象,即模型难以有效利用输入中间的相关信息。解决方案的关键在于提出了LongPiBench基准,该基准专门设计用于评估涉及多个相关信息片段的位置偏差。通过对比五种商业模型和六种开源模型的实验,研究发现当前大多数模型对“中间迷失”问题具有较好的鲁棒性,但存在与相关信息片段间距相关的显著偏差。这些发现强调了评估和减少位置偏差对于提升LLMs能力的重要性。

链接: https://arxiv.org/abs/2410.14641
作者: Runchu Tian,Yanghao Li,Yuepeng Fu,Siyang Deng,Qinyu Luo,Cheng Qian,Shuo Wang,Xin Cong,Zhong Zhang,Yesai Wu,Yankai Lin,Huadong Wang,Xiaojiang Liu
关键词-EN: effectively process long, process long inputs, relevant information, relevant information pieces, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in progress

点击查看摘要

Abstract:Positional bias in large language models (LLMs) hinders their ability to effectively process long inputs. A prominent example is the “lost in the middle” phenomenon, where LLMs struggle to utilize relevant information situated in the middle of the input. While prior research primarily focuses on single pieces of relevant information, real-world applications often involve multiple relevant information pieces. To bridge this gap, we present LongPiBench, a benchmark designed to assess positional bias involving multiple pieces of relevant information. Thorough experiments are conducted with five commercial and six open-source models. These experiments reveal that while most current models are robust against the “lost in the middle” issue, there exist significant biases related to the spacing of relevant information pieces. These findings highlight the importance of evaluating and reducing positional biases to advance LLM’s capabilities.
摘要: 大语言模型 (LLM) 中的位置偏差阻碍了其有效处理长输入的能力。一个显著的例子是“中间迷失”现象,即 LLM 难以利用位于输入中间的相关信息。尽管先前的研究主要集中在单一相关信息上,但现实应用中通常涉及多条相关信息。为了填补这一空白,我们提出了 LongPiBench,这是一个旨在评估涉及多条相关信息的位置偏差的基准。我们使用五个商业模型和六个开源模型进行了详尽的实验。这些实验揭示,尽管大多数当前模型对“中间迷失”问题具有较强的鲁棒性,但存在与相关信息条目间距相关的显著偏差。这些发现强调了评估和减少位置偏差对于提升 LLM 能力的重要性。

[NLP-8] GenEOL: Harnessing the Generative Power of LLMs for Training-Free Sentence Embeddings

【速读】: 该论文试图解决训练免费嵌入方法在利用预训练大型语言模型(LLMs)进行文本嵌入时,忽视了LLMs生成能力的问题。解决方案的关键在于提出了一种名为GenEOL的新方法,该方法利用LLMs生成多样化的句子变换,这些变换保留了原句的语义,并通过聚合这些变换的嵌入结果来增强整体句子嵌入的质量。GenEOL显著优于现有的训练免费嵌入方法,在句子语义文本相似性(STS)基准测试中平均提升了2.85分,并且在多个聚类、重排序和对分类任务中取得了显著的性能提升。

链接: https://arxiv.org/abs/2410.14635
作者: Raghuveer Thirukovalluru,Bhuwan Dhingra
关键词-EN: large language models, directly leverage pretrained, leverage pretrained large, pretrained large language, Training-free embedding methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training-free embedding methods directly leverage pretrained large language models (LLMs) to embed text, bypassing the costly and complex procedure of contrastive learning. Previous training-free embedding methods have mainly focused on optimizing embedding prompts and have overlooked the benefits of utilizing the generative abilities of LLMs. We propose a novel method, GenEOL, which uses LLMs to generate diverse transformations of a sentence that preserve its meaning, and aggregates the resulting embeddings of these transformations to enhance the overall sentence embedding. GenEOL significantly outperforms the existing training-free embedding methods by an average of 2.85 points across several LLMs on the sentence semantic text similarity (STS) benchmark. Our analysis shows that GenEOL stabilizes representation quality across LLM layers and is robust to perturbations of embedding prompts. GenEOL also achieves notable gains on multiple clustering, reranking and pair-classification tasks from the MTEB benchmark.
摘要:无需训练的嵌入方法直接利用预训练的大语言模型 (LLMs) 进行文本嵌入,绕过了对比学习这一成本高昂且复杂的步骤。先前的无需训练嵌入方法主要集中在优化嵌入提示上,而忽略了利用 LLMs 的生成能力所带来的好处。我们提出了一种新方法,GenEOL,该方法利用 LLMs 生成保留句子含义的多样化变换,并将这些变换的嵌入结果聚合以增强整体句子嵌入。在句子语义文本相似性 (STS) 基准测试中,GenEOL 在多个 LLMs 上的表现平均优于现有无需训练嵌入方法 2.85 分。我们的分析表明,GenEOL 在 LLM 层间稳定了表示质量,并对嵌入提示的扰动具有鲁棒性。GenEOL 在 MTEB 基准测试的多个聚类、重排序和配对分类任务中也取得了显著的提升。

[NLP-9] Diverging Preferences: When do Annotators Disagree and do Models Know?

【速读】: 该论文试图解决人类标注偏好数据集中存在的分歧问题,并探讨这些分歧对大语言模型(LLM)开发中奖励建模和评估的影响。解决方案的关键在于识别和分类分歧的来源,将其分为任务不明确、回应风格、拒绝和标注错误四大类,共十个子类。通过实验,论文发现传统的奖励建模方法(如Bradley-Terry模型)无法区分标注者的一致意见和多数意见,且现有的LLM评估方法在面对分歧时倾向于选择一个“胜出”的回应,这影响了评估的准确性。为此,论文提出开发新的方法来识别分歧偏好,以减少其对评估和训练的影响,从而提高LLM的多样性和对齐性。

链接: https://arxiv.org/abs/2410.14632
作者: Michael JQ Zhang,Zhilin Wang,Jena D. Hwang,Yi Dong,Olivier Delalleau,Yejin Choi,Eunsol Choi,Xiang Ren,Valentina Pyatkin
关键词-EN: human-labeled preference datasets, reward modeling, standard reward modeling, examine diverging preferences, preference datasets
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes – task underspecification, response style, refusals, and annotation errors. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact two areas of LLM development: reward modeling and evaluation. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. We also find that these tendencies are also echoed by popular LLM-as-Judge evaluation methods, which consistently identify a winning response in cases of diverging preferences. These findings highlight remaining challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
摘要:我们研究了人类标注的偏好数据集中存在的偏好分歧。我们开发了一个涵盖四类高层次类别(任务未明确、响应风格、拒绝和标注错误)的10个分歧来源的分类体系。我们发现,大多数分歧与标准奖励建模方法相悖,这些方法假设标注者之间的分歧是噪声。随后,我们探讨了这些发现如何影响大语言模型 (LLM) 开发的两个领域:奖励建模和评估。在我们的实验中,我们展示了标准奖励建模方法(如Bradley-Terry模型)如何无法区分给定的偏好判断是标注者之间的一致同意还是分歧用户偏好中的多数意见。我们还发现,这些倾向在流行的LLM-as-Judge评估方法中也有所体现,这些方法在存在分歧偏好时始终识别出一个获胜响应。这些发现突显了LLM评估中仍存在的挑战,这些挑战受到响应风格等分歧特征的极大影响,以及在开发多元对齐的LLM时面临的挑战。为了解决这些问题,我们开发了识别分歧偏好的方法,以减轻它们对评估和训练的影响。

[NLP-10] CELI: Controller-Embedded Language Model Interactions

【速读】: 该论文试图解决现有提示工程和流程优化技术在处理复杂、多阶段任务时的局限性,解决方案的关键在于提出了Controller-Embedded Language Model Interactions (CELI)框架。CELI通过将控制逻辑直接嵌入语言模型(LM)的提示中,实现了动态适应不断变化的任务需求,并将控制权从传统的编程执行环境转移到LM,使其能够自主管理计算工作流程,同时保持与外部系统和功能的无缝交互。这一框架支持任意函数调用和可变参数,有效弥合了LM的适应性推理能力与传统软件范式的结构化控制机制之间的差距。

链接: https://arxiv.org/abs/2410.14627
作者: Jan-Samuel Wagner,Dave DeCaprio,Abishek Chiffon Muthu Raja,Jonathan M. Holman,Lauren K. Brady,Sky C. Cheung,Hosein Barzekar,Eric Yang,Mark Anthony Martinez II,David Soong,Sriram Sridhar,Han Si,Brandon W. Higgs,Hisham Hamadeh,Scott Ogden
关键词-EN: introduce Controller-Embedded Language, Controller-Embedded Language Model, Language Model Interactions, control logic directly, Language Model
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 2 figures

点击查看摘要

Abstract:We introduce Controller-Embedded Language Model Interactions (CELI), a framework that integrates control logic directly within language model (LM) prompts, facilitating complex, multi-stage task execution. CELI addresses limitations of existing prompt engineering and workflow optimization techniques by embedding control logic directly within the operational context of language models, enabling dynamic adaptation to evolving task requirements. Our framework transfers control from the traditional programming execution environment to the LMs, allowing them to autonomously manage computational workflows while maintaining seamless interaction with external systems and functions. CELI supports arbitrary function calls with variable arguments, bridging the gap between LMs’ adaptive reasoning capabilities and conventional software paradigms’ structured control mechanisms. To evaluate CELI’s versatility and effectiveness, we conducted case studies in two distinct domains: code generation (HumanEval benchmark) and multi-stage content generation (Wikipedia-style articles). The results demonstrate notable performance improvements across a range of domains. CELI achieved a 4.9 percentage point improvement over the best reported score of the baseline GPT-4 model on the HumanEval code generation benchmark. In multi-stage content generation, 94.4% of CELI-produced Wikipedia-style articles met or exceeded first draft quality when optimally configured, with 44.4% achieving high quality. These outcomes underscore CELI’s potential for optimizing AI-driven workflows across diverse computational domains.
摘要:我们介绍了控制器嵌入式语言模型交互 (Controller-Embedded Language Model Interactions, CELI),这是一个将控制逻辑直接集成在语言模型 (LM) 提示中的框架,有助于复杂的多阶段任务执行。CELI 通过将控制逻辑直接嵌入语言模型的操作上下文中,解决了现有提示工程和流程优化技术的局限性,实现了对任务需求变化的动态适应。我们的框架将控制权从传统的编程执行环境转移到了语言模型,使其能够自主管理计算工作流,同时保持与外部系统和功能的无缝交互。CELI 支持带有可变参数的任意函数调用,弥合了语言模型自适应推理能力与传统软件范式结构化控制机制之间的差距。为了评估 CELI 的多功能性和有效性,我们在两个不同的领域进行了案例研究:代码生成 (HumanEval 基准) 和多阶段内容生成 (维基百科风格的文章)。结果显示,在多个领域中,性能均有显著提升。CELI 在 HumanEval 代码生成基准上,相较于基线 GPT-4 模型的最佳报告分数,提升了 4.9 个百分点。在多阶段内容生成中,当最佳配置时,94.4% 的 CELI 生成的维基百科风格文章达到了或超过了初稿质量,其中 44.4% 达到了高质量水平。这些结果突显了 CELI 在优化跨多样计算领域的 AI 驱动工作流的潜力。

[NLP-11] You Shall Know a Tool by the Traces it Leaves: The Predictability of Sentiment Analysis Tools

【速读】: 该论文试图解决情感分析工具在不同语料库和语言中分类结果不一致的问题,并揭示了情感分析工具存在的算法偏差。解决方案的关键在于通过构建分类器,基于Twitter、Wikipedia和不同新闻语料库(涵盖英语、德语和法语),对情感分析工具的输出结果进行预测和区分,平均F1得分为0.89(针对英语语料库)。研究结果表明,情感分析工具的标注结果不应被视为绝对准确,强调了进行更多系统性自然语言处理评估研究的必要性。

链接: https://arxiv.org/abs/2410.14626
作者: Daniel Baumartz,Mevlüt Bagci,Alexander Henlein,Maxim Konca,Andy Lücking,Alexander Mehler
关键词-EN: provide comparable results, sentiment analysis tools, sentiment analysis, provide comparable, analysis tools
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:If sentiment analysis tools were valid classifiers, one would expect them to provide comparable results for sentiment classification on different kinds of corpora and for different languages. In line with results of previous studies we show that sentiment analysis tools disagree on the same dataset. Going beyond previous studies we show that the sentiment tool used for sentiment annotation can even be predicted from its outcome, revealing an algorithmic bias of sentiment analysis. Based on Twitter, Wikipedia and different news corpora from the English, German and French languages, our classifiers separate sentiment tools with an averaged F1-score of 0.89 (for the English corpora). We therefore warn against taking sentiment annotations as face value and argue for the need of more and systematic NLP evaluation studies.
摘要:如果情感分析工具是有效的分类器,人们会期望它们在不同类型的语料库和不同语言的情感分类上提供可比的结果。根据以往研究的结果,我们展示了情感分析工具在同一数据集上存在分歧。超越以往的研究,我们发现情感分析工具的情感标注结果甚至可以预测其使用的工具,揭示了情感分析的算法偏见。基于 Twitter、Wikipedia 以及来自英语、德语和法语的不同新闻语料库,我们的分类器以平均 F1 分数 0.89(针对英语语料库)区分了情感分析工具。因此,我们警告不要将情感标注视为表面价值,并主张需要更多和系统的自然语言处理评估研究。

[NLP-12] DiSCo Meets LLMs: A Unified Approach for Sparse Retrieval and Contextual Distillation in Conversational Search

【速读】: 该论文试图解决在对话搜索(Conversational Search)中,大型语言模型(LLMs)在推理时效率低下的问题。解决方案的关键在于提出了一种新的蒸馏方法,通过放松现有的训练目标,将对话与文档之间的相似度分数进行蒸馏,而不是仅仅依赖于表示学习。这种方法统一了检索和上下文建模任务,利用了文档相关性的对比性质,从而在多个对话搜索数据集上显著提升了跨领域检索性能,并在多教师蒸馏中进一步取得了额外收益。

链接: https://arxiv.org/abs/2410.14609
作者: Simon Lupart,Mohammad Aliannejadi,Evangelos Kanoulas
关键词-EN: Conversational Search, conversational context, conversational context modeling, Large Language Models, retrieving relevant documents
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational Search (CS) is the task of retrieving relevant documents from a corpus within a conversational context, combining retrieval with conversational context modeling. With the explosion of Large Language Models (LLMs), the CS field has seen major improvements with LLMs rewriting user queries, accounting for conversational context. However, engaging LLMs at inference time harms efficiency. Current methods address this by distilling embeddings from human-rewritten queries to learn the context modeling task. Yet, these approaches predominantly focus on context modeling, and only treat the contrastive component of the retrieval task within a distillation-independent loss term. To address these limitations, we propose a new distillation method, as a relaxation of the previous objective, unifying retrieval and context modeling. We relax the existing training objectives by distilling similarity scores between conversations and documents, rather than relying solely on representation learning. Our proposed distillation objective allows for more freedom in the representation space and leverages the contrastive nature of document relevance. Through experiments on Learned Sparse Retrieval (LSR) across 5 CS datasets, our approach demonstrates substantial improvements in both in-domain and out-of-domain retrieval performance, outperforming state-of-the-art with gains of up to 6 points in recall for out-of-domain datasets. Additionally, through the relaxation of the objective, we propose a multi-teacher distillation, using multiple LLMs as teachers, yielding additional gains, and outperforming the teachers themselves in in-domain experiments. Finally, analysis of the sparsity of the models reveals that our distillation allows for better control over the sparsity of the trained models.
摘要:对话式搜索 (Conversational Search, CS) 是指在对话环境中从语料库中检索相关文档的任务,结合了检索与对话上下文建模。随着大语言模型 (Large Language Models, LLMs) 的爆发,CS 领域在 LLMs 重写用户查询并考虑对话上下文方面取得了重大进展。然而,在推理时调用 LLMs 会损害效率。当前的方法通过从人工重写的查询中提取嵌入来学习上下文建模任务来解决这一问题。然而,这些方法主要关注上下文建模,仅在独立于蒸馏的损失项中处理检索任务的对比部分。为了解决这些限制,我们提出了一种新的蒸馏方法,作为先前目标的松弛,统一了检索和上下文建模。我们通过蒸馏对话和文档之间的相似度分数来放松现有的训练目标,而不是仅仅依赖于表示学习。我们提出的蒸馏目标允许在表示空间中具有更大的自由度,并利用了文档相关性的对比性质。通过对 5 个 CS 数据集上的学习稀疏检索 (Learned Sparse Retrieval, LSR) 进行实验,我们的方法在域内和域外检索性能方面均显示出显著改进,在域外数据集的召回率上超过了最先进的方法,提升了高达 6 个百分点。此外,通过目标的松弛,我们提出了多教师蒸馏,使用多个 LLMs 作为教师,在域内实验中取得了额外收益,并超越了教师自身的表现。最后,对模型稀疏性的分析表明,我们的蒸馏方法允许更好地控制训练模型的稀疏性。

[NLP-13] aching Models to Balance Resisting and Accepting Persuasion

【速读】: 该论文试图解决大语言模型(LLMs)在面对对抗性对话者时易受说服的问题,并提出解决方案的关键在于平衡正向和负向说服。论文引入了一种名为“说服平衡训练”(Persuasion-Balanced Training, PBT)的方法,通过多代理递归对话树生成数据,并利用偏好优化训练模型,使其在适当情况下接受说服。PBT不仅提高了模型对错误信息的抵抗力,还增强了其在多代理辩论中的表现稳定性,确保了更强的模型能够持续提升较弱模型的表现。

链接: https://arxiv.org/abs/2410.14596
作者: Elias Stengel-Eskin,Peter Hase,Mohit Bansal
关键词-EN: Large language models, Large language, pose risks, models, PBT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates. We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model’s performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.
摘要:大语言模型 (LLMs) 容易受到说服的影响,当模型面对对抗性对话者时,这可能会带来风险。我们迈出了防御模型免受说服影响的第一步,同时认为防御对抗性 (即负面) 说服只是问题的一半:模型还应能够接受有益的 (即正面) 说服以改进其答案。我们表明,仅针对一方进行优化会导致在另一方表现不佳。为了平衡正面和负面说服,我们引入了说服平衡训练 (Persuasion-Balanced Training, PBT),该方法利用多智能体递归对话树生成数据,并通过偏好优化训练模型在适当情况下接受说服。PBT 持续提高了对错误信息的抵抗力,并在面对挑战时更具韧性,同时在包含正面和负面说服的综合数据上实现了最佳整体表现。关键的是,我们展示了 PBT 模型在多智能体辩论中是更好的队友。我们发现,在没有 PBT 的情况下,强模型和弱模型的组合表现不稳定,模型回答的顺序决定了团队获得的是强模型还是弱模型的表现。PBT 带来了更好且更稳定的结果,减少了顺序依赖性,强模型始终能带动弱模型提升。

[NLP-14] oolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases

【速读】: 该论文试图解决工具增强型代理(LLMs)在处理复杂任务时,工具容量超出代理推理或模型限制的问题。解决方案的关键在于引入Toolshed知识库,这是一个工具知识库(向量数据库),用于存储增强的工具表示并优化大规模工具增强型代理的工具选择。此外,论文提出了高级RAG-Tool融合技术,通过在预检索、检索中和检索后阶段应用先进的检索增强生成(RAG)技术,无需模型微调即可提升工具选择的准确性。预检索阶段通过增强工具文档的关键信息存储在Toolshed知识库中,检索中阶段通过查询规划和转换提高检索精度,检索后阶段则通过细化检索到的工具文档和自我反思来进一步优化。通过调整代理可访问的工具总数(tool-M)和工具选择阈值(top-k),论文在多个基准数据集上实现了显著的性能提升。

链接: https://arxiv.org/abs/2410.14594
作者: Elias Lumer
关键词-EN: multi-agent code development, enabled complex tasks, Recent advancements, secure database interactions, Toolshed Knowledge Bases
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in tool-equipped Agents (LLMs) have enabled complex tasks like secure database interactions and multi-agent code development. However, scaling tool capacity beyond agent reasoning or model limits remains a challenge. In this paper, we address these challenges by introducing Toolshed Knowledge Bases, a tool knowledge base (vector database) designed to store enhanced tool representations and optimize tool selection for large-scale tool-equipped Agents. Additionally, we propose Advanced RAG-Tool Fusion, a novel ensemble of tool-applied advanced retrieval-augmented generation (RAG) techniques across the pre-retrieval, intra-retrieval, and post-retrieval phases, without requiring model fine-tuning. During pre-retrieval, tool documents are enhanced with key information and stored in the Toolshed Knowledge Base. Intra-retrieval focuses on query planning and transformation to increase retrieval accuracy. Post-retrieval refines the retrieved tool documents and enables self-reflection. Furthermore, by varying both the total number of tools (tool-M) an Agent has access to and the tool selection threshold (top-k), we address trade-offs between retrieval accuracy, agent performance, and token cost. Our approach achieves 46%, 56%, and 47% absolute improvements on the ToolE single-tool, ToolE multi-tool and Seal-Tools benchmark datasets, respectively (Recall@5).
摘要:近期,配备工具的智能体(大语言模型)在处理安全数据库交互和多智能体代码开发等复杂任务方面取得了显著进展。然而,扩展工具能力超越智能体推理或模型限制仍然是一个挑战。本文通过引入工具库知识库(Toolshed Knowledge Bases),一种用于存储增强工具表示并优化大规模工具配备智能体工具选择的工具知识库(向量数据库),来解决这些挑战。此外,我们提出了高级RAG-工具融合(Advanced RAG-Tool Fusion),这是一种新颖的工具应用高级检索增强生成(RAG)技术集合,涵盖预检索、检索中和检索后阶段,无需模型微调。在预检索阶段,工具文档通过关键信息增强并存储在工具库知识库中。检索中阶段专注于查询规划和转换,以提高检索准确性。检索后阶段则对检索到的工具文档进行细化,并实现自我反思。此外,通过调整智能体可访问的工具总数(工具-M)和工具选择阈值(top-k),我们解决了检索准确性、智能体性能和Token成本之间的权衡问题。我们的方法在ToolE单工具、ToolE多工具和Seal-Tools基准数据集上分别实现了46%、56%和47%的绝对改进(Recall@5)。

[NLP-15] Dialetto ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum

【速读】: 该论文试图解决自然语言处理(NLP)中对方言处理时将其视为离散类别的问题,特别是在评估变体导向的NLP时,即使同一方言内部也存在显著变异。解决方案的关键在于研究方言内部的变异,并通过实证方法测量意大利方言的语音转文本性能,发现地理上的性能差异与方言间的语言相似性有显著相关性(相关系数为-0.5)。论文通过方言学方法交叉验证结果,并解释这种性能差异是由于语音转文本模型对与标准方言更相似的方言存在偏见。此外,利用地理统计方法预测未见地点的零样本性能,发现地理信息的引入显著提高了预测性能,表明性能分布中存在地理结构。

链接: https://arxiv.org/abs/2410.14589
作者: Ryan Soh-Eun Shim,Barbara Plank
关键词-EN: increasing interest, African-American Venacular English, performance, Indian English, Venacular English
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories (Faisal et al., 2024; Ziems et al., 2023), yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to the standard variety in the speech-to-text model examined. We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance, indicating there to be geographical structure in the performance distribution.
摘要:在自然语言处理 (NLP) 领域,对方言的研究兴趣日益增加。然而,迄今为止的大多数研究仍将方言视为离散的类别。例如,针对英语变体的评估工作通常将印度英语或非裔美国人英语视为同质类别 (Faisal et al., 2024; Ziems et al., 2023),尽管在同一变体内部也存在显著的差异。我们研究了方言内部的变异,并表明在类别内部,性能存在关键性的差异。我们测量了意大利方言的语音转文本性能,并实证观察到地理性能差异。这种差异与最高性能方言变体的语言相似性显著相关 (-0.5)。我们将结果与方言计量学方法进行交叉验证,并解释这种性能差异是由于语音转文本模型对与标准方言更相似的方言存在偏好的结果。此外,我们利用地理统计方法预测未见地点的零样本性能,发现加入地理信息显著提高了预测性能,表明性能分布中存在地理结构。

[NLP-16] Do LLMs estimate uncertainty well in instruction-following?

【速读】: 该论文试图解决大语言模型(LLMs)在遵循用户指令时的不确定性估计问题,特别是在高风险应用中的可靠性问题。解决方案的关键在于引入了一种受控的评估设置,通过两个版本的基准数据集,系统地比较了不同不确定性估计方法在指令遵循任务中的表现。研究发现,现有方法在模型出现细微错误时表现不佳,而内部模型状态虽有所改进,但在复杂场景中仍显不足。这一研究为理解LLMs在指令遵循任务中的局限性和不确定性估计的潜力提供了重要见解,为构建更可信的AI代理奠定了基础。

链接: https://arxiv.org/abs/2410.14582
作者: Juyeon Heo,Miao Xiong,Christina Heinze-Deml,Jaya Narain
关键词-EN: Large language models, precisely follow user, Large language, follow user instructions, valuable personal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs’ instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs’ uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of the uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks, where multiple factors are entangled with uncertainty stems from instruction-following, complicating the isolation and comparison across methods and models. To address these issues, we introduce a controlled evaluation setup with two benchmark versions of data, enabling a comprehensive comparison of uncertainty estimation methods under various conditions. Our findings show that existing uncertainty methods struggle, particularly when models make subtle errors in instruction following. While internal model states provide some improvement, they remain inadequate in more complex scenarios. The insights from our controlled evaluation setups provide a crucial understanding of LLMs’ limitations and potential for uncertainty estimation in instruction-following tasks, paving the way for more trustworthy AI agents.
摘要:大语言模型 (LLMs) 在各种领域中可能成为有价值的个人 AI 智能体,前提是它们能够准确遵循用户指令。然而,最近的研究表明,LLMs 在遵循指令的能力上存在显著局限性,这引发了对其在高风险应用中可靠性的担忧。准确估计 LLMs 在遵循指令时的不确定性对于降低部署风险至关重要。据我们所知,本文首次系统评估了 LLMs 在指令遵循情境下的不确定性估计能力。我们的研究发现,现有指令遵循基准测试存在关键挑战,其中多个因素与不确定性来源交织在一起,使得不同方法和模型之间的隔离和比较变得复杂。为解决这些问题,我们引入了一种受控的评估设置,包含两个基准数据版本,能够在各种条件下全面比较不确定性估计方法。我们的研究结果显示,现有不确定性方法在模型遵循指令时出现细微错误的情况下表现不佳。尽管内部模型状态提供了一些改进,但在更复杂的情况下仍显不足。我们受控评估设置中的见解为理解 LLMs 在指令遵循任务中的局限性和不确定性估计的潜力提供了关键认识,为构建更可信的 AI 智能体铺平了道路。

[NLP-17] Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

【速读】: 该论文试图解决在softmax注意力机制下,镜像下降(Mirror Descent, MD)算法的收敛性质和隐含偏置问题。解决方案的关键在于证明MD算法在应用于分类问题时,其收敛方向与带有(\ell_p)范数目标的广义硬间隔支持向量机(SVM)一致。论文通过理论分析揭示了MD算法在高度非线性和非凸问题中的收敛速率与传统梯度下降(GD)相当,并探讨了关键-查询矩阵与解码器联合优化的动态过程,确保其收敛到各自的硬间隔SVM解。此外,实证研究表明MD算法在泛化性能和最优标记选择方面优于标准GD。

链接: https://arxiv.org/abs/2410.14581
作者: Aaron Alvarado Kristanto Julistiono,Davoud Ataee Tarzanagh,Navid Azizan
关键词-EN: natural language processing, artificial intelligence, computer vision, revolutionized several domains, domains of artificial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the p -th power of the \ell_p -norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an \ell_p -norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.
摘要: 注意力机制 (Attention mechanisms) 已经彻底改变了人工智能的多个领域,例如自然语言处理和计算机视觉,通过使模型能够有选择地关注输入数据的相关部分。尽管最近的研究已经描述了基于注意力的模型中梯度下降 (Gradient Descent, GD) 的优化动态及其首选解的结构特性,但对于更一般的优化算法,如镜像下降 (Mirror Descent, MD),了解较少。本文中,我们研究了一类针对 softmax 注意力机制定制的 MD 算法的收敛特性和隐含偏置,其中势函数选择为 \ell_p 范数的 p 次方。具体来说,我们证明了当应用于使用 softmax 注意力模型的分类问题时,这些算法在方向上收敛于具有 \ell_p 范数目标的广义硬间隔 SVM (Support Vector Machine)。值得注意的是,我们的理论结果表明,尽管当前问题具有高度非线性和非凸性,但收敛速度与传统 GD 在更简单模型中的收敛速度相当。此外,我们深入探讨了关键-查询矩阵和解码器的联合优化动态,确定了这种复杂联合优化在何种条件下会收敛到各自的硬间隔 SVM 解。最后,我们在真实数据上的数值实验表明,MD 算法在标准 GD 的基础上提高了泛化能力,并在最佳 Token 选择方面表现出色。

[NLP-18] Large Language Models Are Overparameterized Text Encoders

【速读】: 该论文试图解决大型语言模型(LLMs)在文本嵌入任务中由于模型规模过大导致的推理时间和内存需求过高的问题。解决方案的关键在于通过在监督对比训练之前对LLM的最后p%层进行剪枝,仅训练1000步,从而实现内存和推理时间的显著减少。论文提出了一种名为L³Prune的新型层剪枝策略,基于模型的初始损失提供两种最优剪枝配置:一种在大规模剪枝时性能损失极小,另一种在资源受限环境下性能下降较小但剪枝比例更高。实验结果表明,LLMs在文本嵌入任务中存在过度参数化,可以通过简单的剪枝操作大幅减少模型规模而不显著影响性能。

链接: https://arxiv.org/abs/2410.14578
作者: Thennal D K,Tim Fischer,Chris Biemann
关键词-EN: supervised contrastive training, text, text embedding, Large language models, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages of content + 1 for limitations and ethical considerations, 14 pages in total including references and appendix, 5+1 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last p% layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30% of layers with negligible impact on performance and up to 80% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose \textL^3 \textPrune , a novel layer-pruning strategy based on the model’s initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21% of the parameters with a -0.3 performance drop, and the small variant only suffers from a -5.1 decrease while pruning 74% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.
摘要:大语言模型 (LLMs) 在经过监督对比训练微调后,作为文本嵌入模型表现出强大的性能。然而,其庞大的规模显著增加了推理时间和内存需求。本文中,我们展示了一种方法,即在监督训练前对 LLM 的最后 p% 层进行剪枝,仅需训练 1000 步,即可实现内存和推理时间的成比例减少。我们在文本嵌入任务上评估了四种不同的最先进 LLM,发现我们的方法可以在性能几乎无损的情况下剪枝高达 30% 的层,甚至在性能略有下降的情况下剪枝高达 80%。仅需三行代码,我们的方法即可轻松集成到任何将 LLM 转换为文本编码器的流程中。此外,我们还提出了 \textL^3 \textPrune,一种基于模型初始损失的新颖层剪枝策略,提供了两种最优剪枝配置:一种是大变体,性能损失可忽略不计;另一种是小变体,适用于资源受限的环境。平均而言,大变体剪枝 21% 的参数,性能下降仅为 -0.3,而小变体在剪枝 74% 的模型时,性能仅下降 -5.1。我们认为这些结果有力地证明了 LLM 在文本嵌入任务中存在过度参数化,且可以轻松进行剪枝。

[NLP-19] MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts NEURIPS2024

【速读】: 该论文试图解决稀疏混合专家模型(SMoE)在训练过程中不稳定和难以适应新数据分布的问题,导致模型对数据污染缺乏鲁棒性。解决方案的关键在于将动量机制引入SMoE,提出了一种名为MomentumSMoE的新型SMoE模型。通过理论证明和数值实验,论文展示了MomentumSMoE在稳定性、鲁棒性以及在多个实际任务(如ImageNet-1K对象识别和WikiText-103语言建模)中的优越性能。此外,论文还展示了MomentumSMoE框架的通用性,能够应用于多种SMoE模型,并且可以轻松集成其他基于动量的优化方法(如Adam)以进一步提升性能。

链接: https://arxiv.org/abs/2410.14574
作者: Rachel S.Y. Teo,Tan M. Nguyen
关键词-EN: unlocking unparalleled scalability, Sparse Mixture, deep learning, key to unlocking, unlocking unparalleled
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 10 pages in the main text. Published at NeurIPS 2024. The code is available at this https URL

点击查看摘要

Abstract:Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning. SMoE has the potential to exponentially increase parameter count while maintaining the efficiency of the model by only activating a small subset of these parameters for a given sample. However, it has been observed that SMoE suffers from unstable training and has difficulty adapting to new distributions, leading to the model’s lack of robustness to data contamination. To overcome these limitations, we first establish a connection between the dynamics of the expert representations in SMoEs and gradient descent on a multi-objective optimization problem. Leveraging our framework, we then integrate momentum into SMoE and propose a new family of SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE. In particular, we verify the advantages of MomentumSMoE over SMoE on a variety of practical tasks including ImageNet-1K object recognition and WikiText-103 language modeling. We demonstrate the applicability of MomentumSMoE to many types of SMoE models, including those in the Sparse MoE model for vision (V-MoE) and the Generalist Language Model (GLaM). We also show that other advanced momentum-based optimization methods, such as Adam, can be easily incorporated into the MomentumSMoE framework for designing new SMoE models with even better performance, almost negligible additional computation cost, and simple implementations.
摘要:稀疏混合专家模型 (Sparse Mixture of Experts, SMoE) 已成为解锁深度学习中无与伦比的可扩展性的关键。SMoE 具有指数级增加参数数量的潜力,同时通过仅激活给定样本的一小部分参数来保持模型的效率。然而,观察发现 SMoE 在训练过程中存在不稳定性,并且难以适应新分布,导致模型对数据污染缺乏鲁棒性。为了克服这些局限性,我们首先建立了 SMoE 中专家表示动态与多目标优化问题梯度下降之间的联系。在此基础上,我们将动量引入 SMoE,并提出了一类新的 SMoE 模型,命名为 MomentumSMoE。我们通过理论证明和数值实验展示了 MomentumSMoE 比 SMoE 更加稳定和鲁棒。特别地,我们在包括 ImageNet-1K 对象识别和 WikiText-103 语言建模在内的多种实际任务中验证了 MomentumSMoE 相对于 SMoE 的优势。我们展示了 MomentumSMoE 适用于多种类型的 SMoE 模型,包括视觉稀疏 MoE 模型 (V-MoE) 和通用语言模型 (GLaM)。此外,我们还表明,其他基于动量的先进优化方法,如 Adam,可以轻松融入 MomentumSMoE 框架,用于设计性能更佳、计算成本几乎可以忽略不计且实现简单的新型 SMoE 模型。

[NLP-20] RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions

【速读】: 该论文试图解决对话式AI代理在处理用户查询时遇到的困惑性问题,特别是那些包含错误假设或模糊性的问题。解决方案的关键在于提出了一种新颖的合成数据生成方法,通过从给定的文档语料库中高效地创建多样化且基于上下文的困惑性问题,以提高RAG代理对这些问题的响应质量。论文还通过实证比较评估了多个大型语言模型作为RAG代理的困惑检测和适当响应生成的能力,并贡献了一个基准数据集到公共领域。

链接: https://arxiv.org/abs/2410.14567
作者: Zhiyuan Peng,Jinming Nian,Alexandre Evfimievski,Yi Fang
关键词-EN: Retrieval Augmented Generation, Retrieval Augmented, provide verifiable document-grounded, verifiable document-grounded responses, user inquiries
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: under review

点击查看摘要

Abstract:Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries. However, many natural questions do not have good answers: about 25% contain false assumptions~\citeYu2023:CREPE, and over 50% are ambiguous~\citeMin2020:AmbigQA. RAG agents need high-quality data to improve their responses to confusing questions. This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus. We conduct an empirical comparative evaluation of several large language models as RAG agents to measure the accuracy of confusion detection and appropriate response generation. We contribute a benchmark dataset to the public domain.
摘要:对话式 AI 智能体使用检索增强生成 (Retrieval Augmented Generation, RAG) 技术,为用户查询提供基于可验证文档的响应。然而,许多自然问题并没有好的答案:约 25% 包含错误假设 [Yu2023:CREPE],超过 50% 是模糊的 [Min2020:AmbigQA]。RAG 智能体需要高质量的数据来改进对混淆问题的响应。本文提出了一种新颖的合成数据生成方法,能够从给定的文档语料库中高效地创建一系列基于上下文的混淆问题。我们进行了实证比较评估,对多个大语言模型作为 RAG 智能体的混淆检测准确性和适当响应生成能力进行了测量。我们向公众领域贡献了一个基准数据集。

[NLP-21] me what I need to know: Exploring LLM-based (Personalized) Abstractive Multi-Source Meeting Summarization

【速读】: 该论文试图解决会议总结中个性化和上下文理解不足的问题。解决方案的关键在于采用三阶段大型语言模型方法,通过识别需要额外上下文的转录段落、从补充材料中推断相关细节并插入转录中,以及基于丰富后的转录生成总结,从而提升模型对会议内容的理解,增强总结的相关性和信息量。此外,引入个性化协议,提取参与者特征并定制总结,进一步提高了总结的信息量。

链接: https://arxiv.org/abs/2410.14545
作者: Frederic Kirstein,Terry Ruas,Robert Kratel,Bela Gipp
关键词-EN: existing solutions struggle, digital communication, generate personalized, crucial in digital, existing solutions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Meeting summarization is crucial in digital communication, but existing solutions struggle with salience identification to generate personalized, workable summaries, and context understanding to fully comprehend the meetings’ content. Previous attempts to address these issues by considering related supplementary resources (e.g., presentation slides) alongside transcripts are hindered by models’ limited context sizes and handling the additional complexities of the multi-source tasks, such as identifying relevant information in additional files and seamlessly aligning it with the meeting content. This work explores multi-source meeting summarization considering supplementary materials through a three-stage large language model approach: identifying transcript passages needing additional context, inferring relevant details from supplementary materials and inserting them into the transcript, and generating a summary from this enriched transcript. Our multi-source approach enhances model understanding, increasing summary relevance by ~9% and producing more content-rich outputs. We introduce a personalization protocol that extracts participant characteristics and tailors summaries accordingly, improving informativeness by ~10%. This work further provides insights on performance-cost trade-offs across four leading model families, including edge-device capable options. Our approach can be extended to similar complex generative tasks benefitting from additional resources and personalization, such as dialogue systems and action planning.
摘要:会议总结在数字通信中至关重要,但现有解决方案在生成个性化、可操作的总结以及全面理解会议内容方面存在困难,尤其是在识别关键信息方面。以往通过考虑相关补充资源(例如演示文稿)与会议记录相结合来解决这些问题的尝试,由于模型上下文大小的限制以及处理多源任务的额外复杂性(如在附加文件中识别相关信息并与会议内容无缝对接)而受到阻碍。本研究通过三阶段大语言模型方法探索了考虑补充材料的多源会议总结:识别需要额外上下文的会议记录段落,从补充材料中推断相关细节并将其插入会议记录,然后从这一丰富的会议记录中生成总结。我们的多源方法增强了模型理解,使总结的相关性提高了约9%,并产生了内容更丰富的输出。我们引入了一种个性化协议,提取参与者特征并相应地定制总结,使信息量提高了约10%。本研究还提供了关于四个领先模型家族在性能与成本之间权衡的见解,包括适用于边缘设备的选择。我们的方法可以扩展到类似复杂的生成任务,这些任务受益于额外的资源和个性化,例如对话系统和行动规划。

[NLP-22] Do LLMs “know” internally when they follow instructions?

【速读】: 该论文试图解决大语言模型(LLMs)在遵循用户指令时经常失败的问题。解决方案的关键在于识别并利用输入嵌入空间中的一个维度,该维度与指令遵循成功率密切相关。通过调整模型在这一维度的表示,可以显著提高指令遵循的成功率,同时不降低响应质量。研究发现,这一维度与提示的措辞更为相关,而非任务或指令的固有难度,这解释了为什么LLMs有时无法遵循清晰指令以及为什么提示工程在内容基本不变的情况下仍然有效。

链接: https://arxiv.org/abs/2410.14516
作者: Juyeon Heo,Christina Heinze-Deml,Oussama Elachqar,Shirley Ren,Udhay Nallasamy,Andy Miller,Kwan Ho Ryan Chan,Jaya Narain
关键词-EN: large language models, language models, constraints and guidelines, crucial for building, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided constraints and guidelines. However, LLMs often fail to follow even simple and clear instructions. To improve instruction-following behavior and prevent undesirable outputs, a deeper understanding of how LLMs’ internal states relate to these outcomes is required. Our analysis of LLM internal states reveal a dimension in the input embedding space linked to successful instruction-following. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality. Further investigation reveals that this dimension is more closely related to the phrasing of prompts rather than the inherent difficulty of the task or instructions. This discovery also suggests explanations for why LLMs sometimes fail to follow clear instructions and why prompt engineering is often effective, even when the content remains largely unchanged. This work provides insight into the internal workings of LLMs’ instruction-following, paving the way for reliable LLM agents.
摘要:指令遵循对于构建基于大语言模型 (LLM) 的 AI 智能体至关重要,因为这些模型必须严格遵守用户提供的约束和指南。然而,LLM 往往无法遵循即使是简单而清晰的指令。为了改进指令遵循行为并防止不期望的输出,需要更深入地理解 LLM 内部状态与这些结果之间的关系。我们对 LLM 内部状态的分析揭示了输入嵌入空间中与成功指令遵循相关的一个维度。我们证明,沿着这一维度修改表示可以提高指令遵循的成功率,相比于随机变化,且不会影响响应质量。进一步的研究表明,这一维度与提示的措辞更密切相关,而非任务或指令的固有难度。这一发现也为为何 LLM 有时无法遵循清晰指令以及为何提示工程通常有效(即使内容基本不变)提供了解释。这项工作为 LLM 指令遵循的内部机制提供了洞察,为构建可靠的 LLM 智能体铺平了道路。

[NLP-23] SignAttention: On the Interpretability of Transformer Models for Sign Language Translation NEURIPS2024

【速读】: 该论文试图解决基于Transformer的手语翻译(SLT)模型的可解释性问题,特别是从视频形式的希腊手语到手语词汇和文本的翻译过程。解决方案的关键在于通过分析模型中的注意力机制,揭示模型如何处理和将视觉输入与手语词汇序列对齐。研究发现,模型关注的是帧的集群而非单个帧,并且随着手语词汇数量的增加,姿态与词汇之间的对角线对齐模式逐渐减弱。此外,模型在解码过程中,初始阶段依赖视频帧,但随着翻译的进行,逐渐转向依赖先前预测的词汇。这些发现有助于理解SLT模型的内部工作机制,为开发更透明和可靠的翻译系统奠定基础。

链接: https://arxiv.org/abs/2410.14506
作者: Pedro Alejandro Dal Bianco,Oscar Agustín Stanchi,Facundo Manuel Quiroga,Franco Ronchetti,Enzo Ferrante
关键词-EN: Greek Sign Language, Transformer-based Sign Language, Sign Language Dataset, video-based Greek Sign, Sign Language Translation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at IAI Workshop @ NeurIPS 2024

点击查看摘要

Abstract:This paper presents the first comprehensive interpretability analysis of a Transformer-based Sign Language Translation (SLT) model, focusing on the translation from video-based Greek Sign Language to glosses and text. Leveraging the Greek Sign Language Dataset, we examine the attention mechanisms within the model to understand how it processes and aligns visual input with sequential glosses. Our analysis reveals that the model pays attention to clusters of frames rather than individual ones, with a diagonal alignment pattern emerging between poses and glosses, which becomes less distinct as the number of glosses increases. We also explore the relative contributions of cross-attention and self-attention at each decoding step, finding that the model initially relies on video frames but shifts its focus to previously predicted tokens as the translation progresses. This work contributes to a deeper understanding of SLT models, paving the way for the development of more transparent and reliable translation systems essential for real-world applications.
摘要: 本文首次对基于 Transformer 的手语翻译 (Sign Language Translation, SLT) 模型进行了全面的可解释性分析,重点关注从基于视频的希腊手语到 glosses 和文本的翻译。利用希腊手语数据集,我们研究了模型中的注意力机制,以理解其如何处理和将视觉输入与顺序 glosses 对齐。我们的分析揭示,模型关注的是帧的集群而非单个帧,并且在姿态与 glosses 之间出现了对角线对齐模式,随着 glosses 数量的增加,这种模式变得不那么明显。我们还探讨了在每个解码步骤中交叉注意力 (cross-attention) 和自注意力 (self-attention) 的相对贡献,发现模型最初依赖视频帧,但随着翻译的进行,其焦点逐渐转向先前预测的 Token。这项工作有助于更深入地理解 SLT 模型,为开发更透明和可靠的翻译系统铺平了道路,这对于实际应用至关重要。

[NLP-24] Combining Entropy and Matrix Nuclear Norm for Enhanced Evaluation of Language Models

【速读】: 该论文试图解决大语言模型(LLMs)评估中传统方法在计算需求和可解释性方面的局限性问题。解决方案的关键在于引入了一种新的混合评估方法,该方法结合了基于协方差矩阵的熵和矩阵核范数(MNN)两种技术。具体步骤包括:首先对LLMs的隐藏状态进行归一化处理,然后计算这些表示的协方差矩阵和MNN,并进一步计算协方差矩阵的熵以捕捉模型输出的不确定性和冗余性。通过将这些指标整合为一个综合评分,该方法在保持评估准确性的同时提高了计算效率,并允许根据不同目标调整熵和MNN之间的权重,从而提供了一个灵活且全面的评估框架。

链接: https://arxiv.org/abs/2410.14480
作者: James Vo
关键词-EN: large language models, continue to advance, Matrix Nuclear Norm, large language, precise and efficient
类目: Computation and Language (cs.CL)
备注: The method is currently under experimentation

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, the need for precise and efficient evaluation metrics becomes more pressing. Traditional approaches, while informative, often face limitations in computational demands and interpretability. In this paper, we introduce a novel hybrid evaluation method that integrates two established techniques: entropy derived from covariance matrices and the Matrix Nuclear Norm (MNN). Our method begins by normalizing hidden states from LLMs, then computes the covariance matrix and MNN from these representations. We further calculate the entropy of the covariance matrix to capture uncertainty and redundancy in the model’s outputs. By combining these metrics into a composite score, we offer a comprehensive evaluation framework that balances accuracy with computational efficiency. Additionally, our approach allows for flexibility in adjusting the weightings between entropy and MNN, tailoring the evaluation for different objectives. Through a series of experiments on various LLMs, we demonstrate the robustness and efficacy of our method, offering deeper insights into model performance. This work contributes to the ongoing development of LLM evaluation and opens avenues for future innovations in model assessment techniques.
摘要:随着大语言模型 (LLM) 的不断进步,对精确且高效的评估指标的需求变得愈发迫切。传统方法虽然提供了丰富的信息,但在计算需求和可解释性方面往往存在局限。本文中,我们提出了一种新颖的混合评估方法,该方法结合了两种成熟的技术:基于协方差矩阵的熵和矩阵核范数 (MNN)。我们的方法首先对大语言模型的隐藏状态进行归一化处理,然后计算这些表示的协方差矩阵和 MNN。进一步地,我们计算协方差矩阵的熵,以捕捉模型输出中的不确定性和冗余性。通过将这些指标结合成一个综合评分,我们提供了一个在准确性和计算效率之间取得平衡的综合评估框架。此外,我们的方法允许在熵和 MNN 之间灵活调整权重,从而根据不同的目标定制评估。通过一系列针对不同大语言模型的实验,我们展示了该方法的鲁棒性和有效性,为模型性能提供了更深入的见解。这项工作有助于大语言模型评估的持续发展,并为未来在模型评估技术方面的创新开辟了道路。

[NLP-25] A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference

【速读】: 该论文试图解决在大语言模型(LLMs)推理过程中,如何通过跨层共享键值(KV)缓存来提高效率的问题。解决方案的关键在于提出一个统一的框架,涵盖了多种跨层KV共享技术及其变体,并通过系统的实验评估这些配置在生成吞吐量、语言建模和下游任务性能方面的表现。研究发现,在减少KV缓存大小的情况下,大多数配置能够保持与标准Transformer相当的性能并提高吞吐量,而进一步减少缓存大小后,将所有层的查询与上层KV配对的方法能更好地维持性能,尽管这会增加训练成本和预填充延迟。

链接: https://arxiv.org/abs/2410.14442
作者: You Wu,Haoyi Wu,Kewei Tu
关键词-EN: large language models, found effective, effective in efficient, sharing key-value, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, sharing key-value (KV) cache across layers has been found effective in efficient inference of large language models (LLMs). To systematically investigate different techniques of cross-layer KV sharing, we propose a unified framework that covers several recent methods and their novel variants. We conduct comprehensive experiments on all the configurations of the framework, evaluating their generation throughput and performance in language modeling and downstream tasks. We find that when reducing the size of the KV cache by 2x, most configurations can achieve competitive performance to and higher throughput than standard transformers, but when further reducing the size of the KV cache, pairing queries of all layers with KVs of upper layers can better maintain performance, although it also introduces additional training cost and prefilling latency. We hope that this work will help users choose the appropriate approach according to their requirements and facilitate research on the acceleration of LLM inference.
摘要:最近,跨层共享键值 (KV) 缓存在大语言模型 (LLM) 的高效推理中被发现是有效的。为了系统地研究不同的跨层 KV 共享技术,我们提出了一个统一的框架,涵盖了多种近期方法及其新颖的变体。我们对框架的所有配置进行了全面的实验,评估了它们在语言建模和下游任务中的生成吞吐量和性能。我们发现,当将 KV 缓存的大小减少 2 倍时,大多数配置可以实现与标准 Transformer 相媲美的性能和更高的吞吐量,但当进一步减少 KV 缓存的大小时,将所有层的查询与上层 KV 配对可以更好地保持性能,尽管这也会引入额外的训练成本和预填充延迟。我们希望这项工作能够帮助用户根据其需求选择合适的方法,并促进对 LLM 推理加速的研究。

[NLP-26] Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation

【速读】: 该论文试图解决大语言模型(LLMs)在参数高效微调(PEFT)后仍易受恶意攻击的问题,特别是后门攻击。解决方案的关键在于提出了一种名为W2SDefense的新型弱到强遗忘算法,通过特征对齐知识蒸馏来防御后门攻击。具体来说,首先通过全参数微调训练一个小规模语言模型作为干净的教师模型,然后利用PEFT指导大规模中毒的学生模型遗忘后门,从而增强学生模型遗忘后门特征的能力,防止后门的激活,同时保持模型性能不受影响。

链接: https://arxiv.org/abs/2410.14425
作者: Shuai Zhao,Xiaobao Wu,Cong-Duy Nguyen,Meihuizi Jia,Yichao Feng,Luu Anh Tuan
关键词-EN: Parameter-efficient fine-tuning, bridge the gap, gap between large, PEFT, Parameter-efficient
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) can bridge the gap between large language models (LLMs) and downstream tasks. However, PEFT has been proven vulnerable to malicious attacks. Research indicates that poisoned LLMs, even after PEFT, retain the capability to activate internalized backdoors when input samples contain predefined triggers. In this paper, we introduce a novel weak-to-strong unlearning algorithm to defend against backdoor attacks based on feature alignment knowledge distillation, named W2SDefense. Specifically, we first train a small-scale language model through full-parameter fine-tuning to serve as the clean teacher model. Then, this teacher model guides the large-scale poisoned student model in unlearning the backdoor, leveraging PEFT. Theoretical analysis suggests that W2SDefense has the potential to enhance the student model’s ability to unlearn backdoor features, preventing the activation of the backdoor. We conduct experiments on text classification tasks involving three state-of-the-art language models and three different backdoor attack algorithms. Our empirical results demonstrate the outstanding performance of W2SDefense in defending against backdoor attacks without compromising model performance.
摘要:参数高效微调 (Parameter-efficient fine-tuning, PEFT) 可以弥合大语言模型 (Large Language Models, LLMs) 与下游任务之间的差距。然而,PEFT 已被证明容易受到恶意攻击。研究表明,即使经过 PEFT,被污染的 LLMs 在输入样本包含预定义触发器时,仍能激活内化的后门。本文中,我们提出了一种基于特征对齐知识蒸馏的新型弱到强遗忘算法,名为 W2SDefense,以防御后门攻击。具体而言,我们首先通过全参数微调训练一个小规模语言模型,作为干净的教师模型。然后,该教师模型利用 PEFT 指导大规模被污染的学生模型进行后门遗忘。理论分析表明,W2SDefense 具有增强学生模型遗忘后门特征能力的潜力,从而防止后门的激活。我们在涉及三种最先进的语言模型和三种不同后门攻击算法的文本分类任务上进行了实验。我们的实证结果显示,W2SDefense 在防御后门攻击方面表现出色,且不损害模型性能。

[NLP-27] Fact Recall Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion

【速读】: 该论文试图解决语言模型在处理事实信息时缺乏对不同预测场景的区分问题。解决方案的关键在于提出了一个名为PrISM的模型特定配方,用于构建包含四种不同预测场景的数据集,并通过因果追踪(CT)方法分析这些场景,以实现对语言模型事实完成能力的更细致研究,从而提供对模型处理事实相关查询的更深入理解。

链接: https://arxiv.org/abs/2410.14405
作者: Denitsa Saynova,Lovisa Hagström,Moa Johansson,Richard Johansson,Marco Kuhlmann
关键词-EN: miss important distinctions, Previous interpretations, miss important, important distinctions, Astrid Lindgren
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous interpretations of language models (LMs) miss important distinctions in how these models process factual information. For example, given the query “Astrid Lindgren was born in” with the corresponding completion “Sweden”, no difference is made between whether the prediction was based on having the exact knowledge of the birthplace of the Swedish author or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we investigate four different prediction scenarios for which the LM can be expected to show distinct behaviors. These scenarios correspond to different levels of model reliability and types of information being processed - some being less desirable for factual predictions. To facilitate precise interpretations of LMs for fact completion, we propose a model-specific recipe called PrISM for constructing datasets with examples of each scenario based on a set of diagnostic criteria. We apply a popular interpretability method, causal tracing (CT), to the four prediction scenarios and find that while CT produces different results for each scenario, aggregations over a set of mixed examples may only represent the results from the scenario with the strongest measured signal. In summary, we contribute tools for a more granular study of fact completion in language models and analyses that provide a more nuanced understanding of how LMs process fact-related queries.
摘要:以往对语言模型 (Language Models, LMs) 的解释忽略了这些模型在处理事实信息时的关键区别。例如,对于查询“Astrid Lindgren 出生于”并得到相应完成“瑞典”,模型无法区分预测是基于对瑞典作家出生地的确切知识,还是假设一个听起来像瑞典人的名字的人出生在瑞典。本文中,我们研究了四种不同的预测场景,这些场景中 LM 可能表现出不同的行为。这些场景对应于不同的模型可靠性和处理的信息类型——其中一些对于事实预测来说不那么理想。为了促进对 LM 事实完成的精确解释,我们提出了一种名为 PrISM 的模型特定方法,用于根据一组诊断标准构建包含每种场景示例的数据集。我们应用了一种流行的解释性方法——因果追踪 (Causal Tracing, CT),对四种预测场景进行分析,发现尽管 CT 在每种场景中产生不同的结果,但在一组混合示例上的聚合可能仅代表测量信号最强的场景的结果。总之,我们为语言模型中事实完成的更细致研究贡献了工具,并提供了更深入理解 LM 如何处理与事实相关查询的分析。

[NLP-28] SylloBio-NLI: Evaluating Large Language Models on Biomedical Syllogistic Reasoning

【速读】: 该论文试图解决在生物医学领域中自然语言推理(NLI)的逻辑推理问题,特别是三段论推理(syllogistic reasoning)的有效性和可靠性。解决方案的关键在于提出了SylloBio-NLI框架,该框架利用外部本体论(ontologies)系统地实例化多样化的三段论论证,以评估大型语言模型(LLMs)在识别有效结论和提取支持证据方面的能力。通过在人类基因组通路中实例化的28种三段论模式上进行广泛实验,研究发现零样本(zero-shot)LLMs在生物医学三段论推理中表现不佳,而少样本提示(few-shot prompting)可以显著提升性能,但模型对词汇表面变化的敏感性仍然是一个挑战,表明现有模型在生物医学NLI应用中尚未达到所需的鲁棒性和一致性。

链接: https://arxiv.org/abs/2410.14399
作者: Magdalena Wysocka,Danilo S. Carvalho,Oskar Wysocki,Marco Valentino,Andre Freitas
关键词-EN: Natural Language Inference, crucial for Natural, Language Inference, Natural Language, Large Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Syllogistic reasoning is crucial for Natural Language Inference (NLI). This capability is particularly significant in specialized domains such as biomedicine, where it can support automatic evidence interpretation and scientific discovery. This paper presents SylloBio-NLI, a novel framework that leverages external ontologies to systematically instantiate diverse syllogistic arguments for biomedical NLI. We employ SylloBio-NLI to evaluate Large Language Models (LLMs) on identifying valid conclusions and extracting supporting evidence across 28 syllogistic schemes instantiated with human genome pathways. Extensive experiments reveal that biomedical syllogistic reasoning is particularly challenging for zero-shot LLMs, which achieve an average accuracy between 70% on generalized modus ponens and 23% on disjunctive syllogism. At the same time, we found that few-shot prompting can boost the performance of different LLMs, including Gemma (+14%) and LLama-3 (+43%). However, a deeper analysis shows that both techniques exhibit high sensitivity to superficial lexical variations, highlighting a dependency between reliability, models’ architecture, and pre-training regime. Overall, our results indicate that, while in-context examples have the potential to elicit syllogistic reasoning in LLMs, existing models are still far from achieving the robustness and consistency required for safe biomedical NLI applications.
摘要:三段论推理在自然语言推理 (NLI) 中至关重要。这种能力在生物医学等专业领域尤为重要,它可以支持自动证据解释和科学发现。本文提出了 SylloBio-NLI,这是一种利用外部本体来系统地实例化多样化的生物医学 NLI 三段论论据的新框架。我们使用 SylloBio-NLI 来评估大语言模型 (LLM) 在识别有效结论和提取支持证据方面的能力,这些结论和证据基于 28 种三段论方案,这些方案通过人类基因组通路实例化。大量实验表明,生物医学三段论推理对零样本 LLM 来说尤其具有挑战性,它们在广义肯定前件 (generalized modus ponens) 上的平均准确率为 70%,而在析取三段论 (disjunctive syllogism) 上的准确率仅为 23%。同时,我们发现少样本提示可以提升不同 LLM 的性能,包括 Gemma (+14%) 和 LLama-3 (+43%)。然而,深入分析显示,这两种技术对表面词汇变化表现出高度敏感性,突显了可靠性、模型架构和预训练机制之间的依赖关系。总体而言,我们的结果表明,尽管上下文示例有可能激发 LLM 中的三段论推理,但现有模型仍远未达到安全生物医学 NLI 应用所需的稳健性和一致性。

[NLP-29] Generative AI Pragmatics and Authenticity in Second Language Learning

【速读】: 该论文试图解决生成式AI在语言学习和教学中的应用问题,特别是AI在理解和生成语言时缺乏社会意识、存在语言和文化偏见的问题。解决方案的关键在于认识到AI系统的局限性,即它们基于统计概率的数学模型,缺乏真实的生活经验和文化背景,导致在某些语言学习互动中不适用。论文强调了在将AI整合到第二语言习得和跨文化交际能力培养的教学中时,需要考虑这些局限性,并寻求方法来弥补AI在语言和文化真实性方面的不足。

链接: https://arxiv.org/abs/2410.14395
作者: Robert Godwin-Jones`
关键词-EN: artificial intelligence, obvious benefits, benefits to integrating, integrating generative, language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:There are obvious benefits to integrating generative AI (artificial intelligence) into language learning and teaching. Those include using AI as a language tutor, creating learning materials, or assessing learner output. However, due to how AI systems under-stand human language, based on a mathematical model using statistical probability, they lack the lived experience to be able to use language with the same social aware-ness as humans. Additionally, there are built-in linguistic and cultural biases based on their training data which is mostly in English and predominantly from Western sources. Those facts limit AI suitability for some language learning interactions. Stud-ies have clearly shown that systems such as ChatGPT often do not produce language that is pragmatically appropriate. The lack of linguistic and cultural authenticity has important implications for how AI is integrated into second language acquisition as well as in instruction targeting development of intercultural communication compe-tence.
摘要:将生成式 AI (Generative AI) 融入语言学习和教学中具有显著优势。这些优势包括使用 AI 作为语言导师、创建学习材料或评估学习者输出。然而,由于 AI 系统基于使用统计概率的数学模型来理解人类语言,它们缺乏与人类相同的社交意识,无法像人类一样使用语言。此外,由于其训练数据主要为英语且大多来自西方来源,AI 系统内置了语言和文化偏见。这些事实限制了 AI 在某些语言学习互动中的适用性。研究表明,诸如 ChatGPT 等系统经常无法生成语用上适当的语言。语言和文化真实性的缺乏对 AI 在第二语言习得中的整合以及针对发展跨文化沟通能力的教学中具有重要影响。

[NLP-30] Analyzing Context Utilization of LLMs in Document-Level Translation

【速读】: 该论文试图解决大语言模型(LLM)在文档级翻译中利用上下文信息的能力问题,特别是在代词翻译方面。解决方案的关键在于强调对LLM进行上下文感知的微调,重点在于训练模型如何有效地利用与翻译任务相关的上下文部分,以提高其在文档级翻译中的可靠性和性能。

链接: https://arxiv.org/abs/2410.14391
作者: Wafaa Mohammed,Vlad Niculae
关键词-EN: Large language models, increasingly strong contenders, Large language, language models, increasingly strong
类目: Computation and Language (cs.CL)
备注: 4 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLM) are increasingly strong contenders in machine translation. We study document-level translation, where some words cannot be translated without context from outside the sentence. We investigate the ability of prominent LLMs to utilize context by analyzing models’ robustness to perturbed and randomized document context. We find that LLMs’ improved document-translation performance is not always reflected in pronoun translation performance. We highlight the need for context-aware finetuning of LLMs with a focus on relevant parts of the context to improve their reliability for document-level translation.
摘要:大语言模型 (LLM) 在机器翻译领域正成为越来越强大的竞争者。我们研究了文档级翻译,其中某些词语在没有句子外部上下文的情况下无法翻译。我们通过分析模型对扰动和随机化文档上下文的鲁棒性,探讨了著名 LLM 利用上下文的能力。我们发现,LLM 在文档翻译性能的提升并不总是反映在代词翻译性能上。我们强调了针对上下文相关部分进行上下文感知微调的必要性,以提高 LLM 在文档级翻译中的可靠性。

[NLP-31] How Do Multilingual Models Remember? Investigating Multilingual Factual Recall Mechanisms

【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)在多语言和跨语言情境下的事实知识召回机制是否与单一语言模型相同,以及这些机制在多语言环境中的适用性。解决方案的关键在于对两个高度多语言的LLMs进行全面分析,评估先前在单一语言模型中识别出的知识召回组件和机制在多语言环境中的适用程度,并揭示语言独立和语言依赖的召回机制。

链接: https://arxiv.org/abs/2410.14387
作者: Constanza Fierro,Negar Foroutan,Desmond Elliott,Anders Søgaard
关键词-EN: Large Language Models, retrieve vast amounts, Large Language, English monolingual models, store and retrieve
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) store and retrieve vast amounts of factual knowledge acquired during pre-training. Prior research has localized and identified mechanisms behind knowledge recall; however, it has primarily focused on English monolingual models. The question of how these processes generalize to other languages and multilingual LLMs remains unexplored. In this paper, we address this gap by conducting a comprehensive analysis of two highly multilingual LLMs. We assess the extent to which previously identified components and mechanisms of factual recall in English apply to a multilingual context. Then, we examine when language plays a role in the recall process, uncovering evidence of language-independent and language-dependent mechanisms.
摘要:大语言模型 (LLMs) 在预训练期间存储和检索大量的事实知识。先前的研究已经定位并识别了知识召回背后的机制;然而,这些研究主要集中在英语单语模型上。这些过程如何推广到其他语言和多语言 LLMs 的问题仍未得到探索。在本文中,我们通过全面分析两个高度多语言的 LLMs 来解决这一差距。我们评估了先前在英语中识别的事实召回组件和机制在多语言环境中的适用程度。然后,我们研究了语言在召回过程中何时起作用,揭示了语言无关和语言依赖机制的证据。

[NLP-32] Fine-Tuning Pre-trained Language Models for Robust Causal Representation Learning

【速读】: 该论文试图解决预训练语言模型在单一领域微调后难以泛化到其他领域(out-of-domain, OOD)数据的问题。解决方案的关键在于引入因果前门调整(causal front-door adjustment),通过分解假设和利用微调后的表示作为数据增强手段,从而在保持领域特定知识的同时,增强模型对OOD数据的泛化能力。这种方法在合成和真实世界数据集上的实验表明,相比现有方法,其泛化性能更为优越。

链接: https://arxiv.org/abs/2410.14375
作者: Jialin Yu,Yuxiang Zhou,Yulan He,Nevin L. Zhang,Ricardo Silva
关键词-EN: pre-trained language models, pre-trained language, language models, data, Abstract
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The fine-tuning of pre-trained language models (PLMs) has been shown to be effective across various domains. By using domain-specific supervised data, the general-purpose representation derived from PLMs can be transformed into a domain-specific representation. However, these methods often fail to generalize to out-of-domain (OOD) data due to their reliance on non-causal representations, often described as spurious features. Existing methods either make use of adjustments with strong assumptions about lack of hidden common causes, or mitigate the effect of spurious features using multi-domain data. In this work, we investigate how fine-tuned pre-trained language models aid generalizability from single-domain scenarios under mild assumptions, targeting more general and practical real-world scenarios. We show that a robust representation can be derived through a so-called causal front-door adjustment, based on a decomposition assumption, using fine-tuned representations as a source of data augmentation. Comprehensive experiments in both synthetic and real-world settings demonstrate the superior generalizability of the proposed method compared to existing approaches. Our work thus sheds light on the domain generalization problem by introducing links between fine-tuning and causal mechanisms into representation learning.
摘要:预训练语言模型 (Pre-trained Language Models, PLMs) 的微调在各个领域中已被证明是有效的。通过使用特定领域的监督数据,PLMs 衍生出的通用表示可以转化为特定领域的表示。然而,这些方法往往由于依赖于非因果表示(通常称为虚假特征)而无法泛化到域外 (Out-of-Domain, OOD) 数据。现有方法要么利用对隐藏共同原因缺失的强假设进行调整,要么通过多领域数据来减轻虚假特征的影响。在本研究中,我们探讨了在温和假设下,微调的预训练语言模型如何从单一领域场景中提升泛化能力,以应对更普遍和实际的现实世界场景。我们展示了通过所谓的因果前门调整,基于分解假设,利用微调表示作为数据增强的来源,可以推导出鲁棒的表示。在合成和真实世界设置中的综合实验表明,所提出的方法相比现有方法具有更优越的泛化能力。因此,我们的工作通过将微调与因果机制引入表示学习,为领域泛化问题提供了新的视角。

[NLP-33] Efficiently Computing Susceptibility to Context in Language Models

【速读】: 该论文试图解决现代语言模型对输入上下文细微变化的敏感性量化问题。解决方案的关键在于提出了Fisher susceptibility,这是一种基于Fisher信息的高效估计方法,相较于传统的Monte Carlo近似方法,其计算速度提高了70倍,且在多个查询领域的敏感性评估中表现出相当的准确性。

链接: https://arxiv.org/abs/2410.14361
作者: Tianyu Liu,Kevin Du,Mrinmaya Sachan,Ryan Cotterell
关键词-EN: Monte Carlo, Monte Carlo approximation, answering queries, strength of modern, ability to incorporate
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:One strength of modern language models is their ability to incorporate information from a user-input context when answering queries. However, they are not equally sensitive to the subtle changes to that context. To quantify this, Du et al. (2024) gives an information-theoretic metric to measure such sensitivity. Their metric, susceptibility, is defined as the degree to which contexts can influence a model’s response to a query at a distributional level. However, exactly computing susceptibility is difficult and, thus, Du et al. (2024) falls back on a Monte Carlo approximation. Due to the large number of samples required, the Monte Carlo approximation is inefficient in practice. As a faster alternative, we propose Fisher susceptibility, an efficient method to estimate the susceptibility based on Fisher information. Empirically, we validate that Fisher susceptibility is comparable to Monte Carlo estimated susceptibility across a diverse set of query domains despite its being 70\times faster. Exploiting the improved efficiency, we apply Fisher susceptibility to analyze factors affecting the susceptibility of language models. We observe that larger models are as susceptible as smaller ones.
摘要:现代语言模型的一个优势在于它们能够在回答查询时整合用户输入上下文中的信息。然而,这些模型对上下文的微妙变化并不具有相同的敏感度。为了量化这一点,Du 等人 (2024) 提出了一种信息论的度量方法来衡量这种敏感度。他们的度量指标——敏感度 (susceptibility),定义为上下文在分布层面上影响模型对查询响应的程度。然而,精确计算敏感度是困难的,因此 Du 等人 (2024) 采用了蒙特卡罗近似法。由于需要大量的样本,蒙特卡罗近似法在实践中效率较低。作为更快的替代方案,我们提出了 Fisher 敏感度 (Fisher susceptibility),这是一种基于 Fisher 信息的高效方法来估计敏感度。通过实证验证,我们发现 Fisher 敏感度在各种查询领域中与蒙特卡罗估计的敏感度相当,尽管其速度提高了 70 倍。利用这种改进的效率,我们将 Fisher 敏感度应用于分析影响语言模型敏感度的因素。我们观察到,较大的模型与较小的模型具有相同的敏感度。

[NLP-34] Critical Questions Generation: Motivation and Challenges CONLL2024

【速读】: 该论文试图解决大语言模型(LLMs)在生成反驳论点时存在的知识过时和内容幻觉问题。解决方案的关键在于提出一个新的任务,即“关键问题生成”(Critical Questions Generation),通过处理论辩文本生成由该文本引发的关键问题(CQs)。这些关键问题旨在揭示论点的盲点,指出其可能缺失的信息,从而在不依赖外部知识的情况下对论点进行质疑。论文通过两种互补的方法创建了一个用于大规模实验的参考数据集:一是实例化Walton的论辩理论中定义的关键问题模板,二是利用LLMs作为关键问题生成器。研究结果表明,尽管LLMs在生成关键问题方面表现尚可,但仍有很大的改进空间。

链接: https://arxiv.org/abs/2410.14335
作者: Blanca Calvo Figueras,Rodrigo Agerri
关键词-EN: Large Language Models, Language Models, brought impressive performances, Large Language, strategies against misinformation
类目: Computation and Language (cs.CL)
备注: 14 pages, 3 figures, 7 tables, to be published in the 28th Conference on Computational Natural Language Learning (CoNLL 2024)

点击查看摘要

Abstract:The development of Large Language Models (LLMs) has brought impressive performances on mitigation strategies against misinformation, such as counterargument generation. However, LLMs are still seriously hindered by outdated knowledge and by their tendency to generate hallucinated content. In order to circumvent these issues, we propose a new task, namely, Critical Questions Generation, consisting of processing an argumentative text to generate the critical questions (CQs) raised by it. In argumentation theory CQs are tools designed to lay bare the blind spots of an argument by pointing at the information it could be missing. Thus, instead of trying to deploy LLMs to produce knowledgeable and relevant counterarguments, we use them to question arguments, without requiring any external knowledge. Research on CQs Generation using LLMs requires a reference dataset for large scale experimentation. Thus, in this work we investigate two complementary methods to create such a resource: (i) instantiating CQs templates as defined by Walton’s argumentation theory and (ii), using LLMs as CQs generators. By doing so, we contribute with a procedure to establish what is a valid CQ and conclude that, while LLMs are reasonable CQ generators, they still have a wide margin for improvement in this task.
摘要:大语言模型 (LLM) 的发展在对抗虚假信息的缓解策略上,如反驳生成,展现了令人瞩目的表现。然而,LLM 仍然严重受限于过时的知识和其生成幻觉内容的倾向。为了规避这些问题,我们提出了一项新任务,即关键问题生成 (Critical Questions Generation),该任务包括处理论证文本以生成由其引发的关键问题 (CQ)。在论证理论中,CQ 是旨在通过指出论证可能缺失的信息来揭示其盲点的工具。因此,我们不是尝试部署 LLM 来生成知识性和相关的反驳,而是利用它们来质疑论证,而不需要任何外部知识。使用 LLM 进行 CQ 生成的研究需要一个参考数据集来进行大规模实验。因此,在本研究中,我们探讨了两种互补的方法来创建这样的资源:(i) 实例化由 Walton 论证理论定义的 CQ 模板,以及 (ii) 使用 LLM 作为 CQ 生成器。通过这样做,我们提供了一个确定有效 CQ 的程序,并得出结论:尽管 LLM 是合理的 CQ 生成器,但它们在这一任务上仍有很大的改进空间。

[NLP-35] LoGU: Long-form Generation with Uncertainty Expressions

【速读】: 该论文旨在解决大型语言模型(LLMs)在长文本生成中产生的幻觉问题,特别是模型在不确定时未能准确表达不确定性的情况。论文提出了长文本生成中的不确定性(LoGU)任务,并识别了两个关键挑战:不确定性抑制(模型不愿表达不确定性)和不确定性错位(模型错误地传达不确定性)。解决方案的关键在于采用基于细化的数据收集框架和两阶段训练流程。该框架通过分而治之的策略,基于原子声明细化不确定性,并通过监督微调(SFT)和直接偏好优化(DPO)训练数据,以增强不确定性表达。实验结果表明,该方法显著提高了生成内容的准确性,减少了幻觉现象,并保持了响应的全面性。

链接: https://arxiv.org/abs/2410.14309
作者: Ruihan Yang,Caiqi Zhang,Zhisong Zhang,Xinting Huang,Sen Yang,Nigel Collier,Dong Yu,Deqing Yang
关键词-EN: Large Language Models, Large Language, demonstrate impressive capabilities, factually incorrect content, generating factually incorrect
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate impressive capabilities, they still struggle with generating factually incorrect content (i.e., hallucinations). A promising approach to mitigate this issue is enabling models to express uncertainty when unsure. Previous research on uncertainty modeling has primarily focused on short-form QA, but realworld applications often require much longer responses. In this work, we introduce the task of Long-form Generation with Uncertainty(LoGU). We identify two key challenges: Uncertainty Suppression, where models hesitate to express uncertainty, and Uncertainty Misalignment, where models convey uncertainty inaccurately. To tackle these challenges, we propose a refinement-based data collection framework and a two-stage training pipeline. Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims. The collected data are then used in training through supervised fine-tuning (SFT) and direct preference optimization (DPO) to enhance uncertainty expression. Extensive experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.
摘要:尽管大语言模型 (LLM) 展示了令人印象深刻的能力,但它们在生成事实性错误内容(即幻觉)方面仍存在困难。一个有前景的解决方案是使模型在不确定时能够表达不确定性。先前关于不确定性建模的研究主要集中在短形式问答上,但现实应用往往需要更长的回复。在本研究中,我们引入了长形式生成不确定性 (LoGU) 的任务。我们识别了两个关键挑战:不确定性抑制,即模型犹豫表达不确定性;以及不确定性错位,即模型不准确地传达不确定性。为了应对这些挑战,我们提出了一种基于细化的数据收集框架和两阶段训练流程。我们的框架采用分而治之的策略,基于原子声明细化不确定性。收集的数据随后通过监督微调 (SFT) 和直接偏好优化 (DPO) 用于训练,以增强不确定性表达。在三个长形式指令跟随数据集上的广泛实验表明,我们的方法显著提高了准确性,减少了幻觉,并保持了回复的全面性。

[NLP-36] SwaQuAD-24: QA Benchmark Dataset in Swahili

【速读】: 该论文试图解决斯瓦希里语在自然语言处理(NLP)领域中的代表性不足问题。解决方案的关键在于创建一个高质量的斯瓦希里语问答(QA)基准数据集,该数据集借鉴了SQuAD、GLUE、KenSwQuAD和KLUE等已建立的基准,旨在捕捉斯瓦希里语的语言多样性和复杂性。数据集的设计不仅支持机器翻译和信息检索等应用,还特别关注数据隐私、偏见缓解和包容性等伦理问题,并计划未来扩展到特定领域内容、多模态整合和更广泛的众包努力,以促进东非的技术创新和低资源语言的NLP研究。

链接: https://arxiv.org/abs/2410.14289
作者: Alfred Malengo Kondoro
关键词-EN: Swahili Question Answering, Question Answering, Swahili Question, natural language processing, aimed at addressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes the creation of a Swahili Question Answering (QA) benchmark dataset, aimed at addressing the underrepresentation of Swahili in natural language processing (NLP). Drawing from established benchmarks like SQuAD, GLUE, KenSwQuAD, and KLUE, the dataset will focus on providing high-quality, annotated question-answer pairs that capture the linguistic diversity and complexity of Swahili. The dataset is designed to support a variety of applications, including machine translation, information retrieval, and social services like healthcare chatbots. Ethical considerations, such as data privacy, bias mitigation, and inclusivity, are central to the dataset development. Additionally, the paper outlines future expansion plans to include domain-specific content, multimodal integration, and broader crowdsourcing efforts. The Swahili QA dataset aims to foster technological innovation in East Africa and provide an essential resource for NLP research and applications in low-resource languages.
摘要:本文提出创建一个斯瓦希里语问答 (QA) 基准数据集,旨在解决斯瓦希里语在自然语言处理 (NLP) 中的代表性不足问题。借鉴 SQuAD、GLUE、KenSwQuAD 和 KLUE 等已建立的基准,该数据集将专注于提供高质量、标注的问答对,以捕捉斯瓦希里语的语言多样性和复杂性。该数据集设计用于支持多种应用,包括机器翻译、信息检索以及医疗聊天机器人等社会服务。数据集开发过程中,数据隐私、偏差缓解和包容性等伦理考量是核心。此外,本文还概述了未来的扩展计划,包括特定领域内容、多模态整合和更广泛的众包努力。斯瓦希里语 QA 数据集旨在促进东非的技术创新,并为低资源语言的 NLP 研究和应用提供重要资源。

[NLP-37] EcomEdit: An Automated E-commerce Knowledge Editing Framework for Enhanced Product and Purchase Intention Understanding

【速读】: 该论文试图解决在电子商务领域中,如何在不进行计算密集型微调的情况下,及时更新和修正大型语言模型(LLMs)中的产品信息和客户购买意向的问题。解决方案的关键在于提出了ECOMEDIT框架,该框架利用更强大的LLMs作为判断工具,实现自动知识冲突检测,并通过概念化增强待编辑知识的语义覆盖范围,从而提升LLMs对产品描述和购买意向的理解能力,并在下游电子商务任务中表现出更强的性能。

链接: https://arxiv.org/abs/2410.14276
作者: Ching Ming Samuel Lau,Weiqi Wang,Haochen Shi,Baixuan Xu,Jiaxin Bai,Yangqiu Song
关键词-EN: Large Language Models, Language Models, Large Language, computationally expensive fine-tuning, update factual information
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Editing (KE) aims to correct and update factual information in Large Language Models (LLMs) to ensure accuracy and relevance without computationally expensive fine-tuning. Though it has been proven effective in several domains, limited work has focused on its application within the e-commerce sector. However, there are naturally occurring scenarios that make KE necessary in this domain, such as the timely updating of product features and trending purchase intentions by customers, which necessitate further exploration. In this paper, we pioneer the application of KE in the e-commerce domain by presenting ECOMEDIT, an automated e-commerce knowledge editing framework tailored for e-commerce-related knowledge and tasks. Our framework leverages more powerful LLMs as judges to enable automatic knowledge conflict detection and incorporates conceptualization to enhance the semantic coverage of the knowledge to be edited. Through extensive experiments, we demonstrate the effectiveness of ECOMEDIT in improving LLMs’ understanding of product descriptions and purchase intentions. We also show that LLMs, after our editing, can achieve stronger performance on downstream e-commerce tasks.
摘要:知识编辑 (Knowledge Editing, KE) 旨在纠正和更新大语言模型 (Large Language Models, LLMs) 中的事实信息,以确保准确性和相关性,而无需进行计算成本高昂的微调。尽管在多个领域已被证明有效,但关于其在电子商务领域应用的研究仍较为有限。然而,在该领域中存在一些自然发生的场景,使得知识编辑成为必要,例如及时更新产品特性和客户购买意向的流行趋势,这需要进一步探索。本文中,我们开创性地将知识编辑应用于电子商务领域,提出了 ECOMEDIT,这是一个专为电子商务相关知识和任务定制的自动化电子商务知识编辑框架。我们的框架利用更强大的大语言模型作为评判者,以实现自动知识冲突检测,并结合概念化以增强待编辑知识的语义覆盖范围。通过广泛的实验,我们展示了 ECOMEDIT 在提升大语言模型对产品描述和购买意向理解方面的有效性。我们还表明,经过我们的编辑后,大语言模型在下游电子商务任务中能够实现更强的性能。

[NLP-38] REEF: Representation Encoding Fingerprints for Large Language Models

【速读】: 该论文试图解决开源大型语言模型(LLMs)的知识产权保护问题,特别是如何识别一个可疑模型是否是基于受害模型的后续开发。解决方案的关键在于提出了一种无需训练的REEF方法,通过计算和比较可疑模型与受害模型在相同样本上的特征表示的中心核对齐相似度,来确定两者之间的关系。这种方法不仅不损害模型的通用能力,而且对顺序微调、剪枝、模型合并和排列具有鲁棒性,为第三方和模型所有者提供了一种简单有效的知识产权保护手段。

链接: https://arxiv.org/abs/2410.14273
作者: Jie Zhang,Dongrui Liu,Chen Qian,Linfeng Zhang,Yong Liu,Yu Qiao,Jing Shao
关键词-EN: open-source Large Language, Large Language Models, Large Language, training LLMs costs, LLMs costs extensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Protecting the intellectual property of open-source Large Language Models (LLMs) is very important, because training LLMs costs extensive computational resources and data. Therefore, model owners and third parties need to identify whether a suspect model is a subsequent development of the victim model. To this end, we propose a training-free REEF to identify the relationship between the suspect and victim models from the perspective of LLMs’ feature representations. Specifically, REEF computes and compares the centered kernel alignment similarity between the representations of a suspect model and a victim model on the same samples. This training-free REEF does not impair the model’s general capabilities and is robust to sequential fine-tuning, pruning, model merging, and permutations. In this way, REEF provides a simple and effective way for third parties and models’ owners to protect LLMs’ intellectual property together. The code is available at this https URL.
摘要:保护开源大语言模型 (LLM) 的知识产权至关重要,因为训练 LLM 需要大量的计算资源和数据。因此,模型所有者和第三方需要确定一个可疑模型是否是受害模型的后续开发。为此,我们提出了一种无需训练的 REEF,从 LLM 的特征表示角度来识别可疑模型与受害模型之间的关系。具体来说,REEF 计算并比较了在相同样本上可疑模型与受害模型的特征表示之间的中心核对齐相似度。这种无需训练的 REEF 不会损害模型的通用能力,并且对顺序微调、剪枝、模型合并和排列具有鲁棒性。通过这种方式,REEF 为第三方和模型所有者提供了一种简单而有效的方式,共同保护 LLM 的知识产权。代码可在以下链接获取:https URL。

[NLP-39] MoDification: Mixture of Depths Made Easy

【速读】: 该论文试图解决在长上下文应用中,现有大型语言模型(LLMs)的效率问题。解决方案的关键在于提出了一种名为MoDification的方法,该方法通过将MoD中的top-k操作符升级为threshold-p操作符,并对架构和数据进行优化,从而在不经过大量训练的情况下,实现从任何LLMs到MoD模型的转换。实验结果显示,MoDification在长上下文应用中能够显著提升模型效率,实现高达1.2倍的延迟加速和1.8倍的内存减少。

链接: https://arxiv.org/abs/2410.14268
作者: Chen Zhang,Meizhi Zhong,Qimeng Wang,Xuantao Lu,Zheyu Ye,Chengqiang Lu,Yan Gao,Yao Hu,Kehai Chen,Min Zhang,Dawei Song
关键词-EN: serving large language, large language models, trending topic, topic in serving, serving large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 9 figures, 5 tables, work in progress

点击查看摘要

Abstract:Long-context efficiency has recently become a trending topic in serving large language models (LLMs). And mixture of depths (MoD) is proposed as a perfect fit to bring down both latency and memory. In this paper, however, we discover that MoD can barely transform existing LLMs without costly training over an extensive number of tokens. To enable the transformations from any LLMs to MoD ones, we showcase top-k operator in MoD should be promoted to threshold-p operator, and refinement to architecture and data should also be crafted along. All these designs form our method termed MoDification. Through a comprehensive set of experiments covering model scales from 3B to 70B, we exhibit MoDification strikes an excellent balance between efficiency and effectiveness. MoDification can achieve up to ~1.2x speedup in latency and ~1.8x reduction in memory compared to original LLMs especially in long-context applications.
摘要:长上下文效率最近已成为服务大语言模型 (LLM) 的热门话题。混合深度 (Mixture of Depths, MoD) 被提出作为一种能够降低延迟和内存的理想方案。然而,本文发现,MoD 几乎无法在不经过大量 Token 的昂贵训练的情况下转换现有 LLM。为了实现从任何 LLM 到 MoD 的转换,我们展示了在 MoD 中,top-k 操作符应提升为 threshold-p 操作符,并且架构和数据的改进也应随之进行。这些设计共同构成了我们称之为 MoDification 的方法。通过一系列涵盖从 3B 到 70B 模型规模的全面实验,我们展示了 MoDification 在效率和效果之间达到了极佳的平衡。与原始 LLM 相比,特别是在长上下文应用中,MoDification 可以实现高达约 1.2 倍的延迟加速和 1.8 倍的内存减少。

[NLP-40] Good Parenting is all you need – Multi-agent ic LLM Hallucination Mitigation

【速读】: 该论文试图解决大型语言模型(LLM)在生成内容时产生幻觉的问题,即模型虚构不存在的事实。解决方案的关键在于利用高级AI模型(如Llama3-70b和GPT-4变体)作为审查代理,通过多次测试和反馈机制,这些模型能够高精度地识别并修正幻觉内容,成功率在85%到100%之间,从而显著提升生成内容的准确性和可靠性。

链接: https://arxiv.org/abs/2410.14262
作者: Edward(Ted)Kwartler,Matthew Berman,Alan Aqrawi
关键词-EN: Large Language Model, Large Language, ability of Large, Language Model, study explores
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study explores the ability of Large Language Model (LLM) agents to detect and correct hallucinations in AI-generated content. A primary agent was tasked with creating a blog about a fictional Danish artist named Flipfloppidy, which was then reviewed by another agent for factual inaccuracies. Most LLMs hallucinated the existence of this artist. Across 4,900 test runs involving various combinations of primary and reviewing agents, advanced AI models such as Llama3-70b and GPT-4 variants demonstrated near-perfect accuracy in identifying hallucinations and successfully revised outputs in 85% to 100% of cases following feedback. These findings underscore the potential of advanced AI models to significantly enhance the accuracy and reliability of generated content, providing a promising approach to improving AI workflow orchestration.
摘要:本研究探讨了大语言模型 (LLM) 智能体在检测和纠正 AI 生成内容中的幻觉现象的能力。一个主要智能体被指派创建一篇关于一个虚构的丹麦艺术家 Flipfloppidy 的博客,随后由另一个智能体进行事实核查。大多数 LLM 都幻觉了这位艺术家的存在。在涉及各种主要和审查智能体组合的 4,900 次测试运行中,如 Llama3-70b 和 GPT-4 变体等先进 AI 模型在识别幻觉方面表现出接近完美的准确性,并在反馈后成功修订输出内容的案例占比达到 85% 至 100%。这些发现强调了先进 AI 模型在显著提高生成内容准确性和可靠性方面的潜力,为改进 AI 工作流编排提供了一种有前景的方法。

[NLP-41] Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement

【速读】: 该论文试图解决当前大语言模型(LLM)生成内容在社交媒体上广泛传播所引发的虚假信息、数据偏见和隐私侵犯等问题,特别是现有检测方法仅限于二元分类,无法应对复杂的人机协作场景。论文提出的解决方案之关键是引入两个新任务:LLM角色识别(LLM-RR)和LLM影响度量(LLM-IM),分别用于识别LLM在内容生成中的具体角色和量化LLM在内容创作中的参与程度。为此,论文设计了LLMDetect基准,包括混合新闻检测语料库(HNDC)和DetectEval评估套件,以全面评估检测器在不同上下文和多强度变化下的泛化能力和鲁棒性。实验结果表明,微调的预训练语言模型(PLM)在两个任务上表现优异,而高级LLM在自我生成内容的检测上仍面临挑战。

链接: https://arxiv.org/abs/2410.14259
作者: Zihao Cheng,Li Zhou,Feng Jiang,Benyou Wang,Haizhou Li
关键词-EN: social media platforms, LLM-generated content, data biases, large language models, media platforms
类目: Computation and Language (cs.CL)
备注: Social Media, Large Language Models, LLM-generated Text Detection, AI-assisted News Detection

点击查看摘要

Abstract:The rapid development of large language models (LLMs), like ChatGPT, has resulted in the widespread presence of LLM-generated content on social media platforms, raising concerns about misinformation, data biases, and privacy violations, which can undermine trust in online discourse. While detecting LLM-generated content is crucial for mitigating these risks, current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content. This approach introduces two novel tasks: LLM Role Recognition (LLM-RR), a multi-class classification task that identifies specific roles of LLM in content generation, and LLM Influence Measurement (LLM-IM), a regression task that quantifies the extent of LLM involvement in content creation. To support these tasks, we propose LLMDetect, a benchmark designed to evaluate detectors’ performance on these new tasks. LLMDetect includes the Hybrid News Detection Corpus (HNDC) for training detectors, as well as DetectEval, a comprehensive evaluation suite that considers five distinct cross-context variations and multi-intensity variations within the same LLM role. This allows for a thorough assessment of detectors’ generalization and robustness across diverse contexts. Our empirical validation of 10 baseline detection methods demonstrates that fine-tuned PLM-based models consistently outperform others on both tasks, while advanced LLMs face challenges in accurately detecting their own generated content. Our experimental results and analysis offer insights for developing more effective detection models for LLM-generated content. This research enhances the understanding of LLM-generated content and establishes a foundation for more nuanced detection methodologies.
摘要:大语言模型 (LLM) 如 ChatGPT 的快速发展,导致了 LLM 生成内容在社交媒体平台上的广泛存在,引发了关于错误信息、数据偏见和隐私侵犯的担忧,这些都可能削弱在线对话的信任度。尽管检测 LLM 生成内容对于缓解这些风险至关重要,但当前的方法往往侧重于二元分类,未能解决现实世界中如人机协作等复杂场景的问题。为了超越二元分类并应对这些挑战,我们提出了一种新的检测 LLM 生成内容的范式。该方法引入了两个新颖的任务:LLM 角色识别 (LLM-RR),这是一个多类别分类任务,用于识别 LLM 在内容生成中的具体角色;以及 LLM 影响度量 (LLM-IM),这是一个回归任务,用于量化 LLM 在内容创作中的参与程度。为了支持这些任务,我们提出了 LLMDetect,这是一个用于评估在这些新任务上检测器性能的基准。LLMDetect 包括用于训练检测器的混合新闻检测语料库 (HNDC),以及 DetectEval,这是一个综合评估套件,考虑了五种不同的跨上下文变异和同一 LLM 角色内的多强度变异。这使得可以全面评估检测器在不同上下文中的泛化能力和鲁棒性。我们对 10 种基线检测方法的实证验证表明,微调的基于 PLM 的模型在两个任务上始终优于其他模型,而先进的 LLM 在准确检测其自身生成的内容方面面临挑战。我们的实验结果和分析为开发更有效的 LLM 生成内容检测模型提供了见解。这项研究增强了对 LLM 生成内容的理解,并为更细致的检测方法奠定了基础。

[NLP-42] Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas

【速读】: 该论文试图解决现有大型语言模型(LLMs)在生成创新性研究想法时,由于缺乏对外部知识的有效获取而导致的建议简单和重复的问题。解决方案的关键在于引入一种增强的规划和搜索方法,通过迭代过程有目的地规划外部知识的检索,逐步丰富想法生成的广度和深度,从而显著提升生成想法的新颖性和多样性。

链接: https://arxiv.org/abs/2410.14255
作者: Xiang Hu,Hongyu Fu,Jinge Wang,Yifeng Wang,Zhikun Li,Renjun Xu,Yu Lu,Yaochu Jin,Lili Pan,Zhenzhong Lan
关键词-EN: large language models, harnessing large language, Scientific innovation, generate research ideas, pivotal for humanity
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific innovation is pivotal for humanity, and harnessing large language models (LLMs) to generate research ideas could transform discovery. However, existing LLMs often produce simplistic and repetitive suggestions due to their limited ability in acquiring external knowledge for innovation. To address this problem, we introduce an enhanced planning and search methodology designed to boost the creative potential of LLM-based systems. Our approach involves an iterative process to purposely plan the retrieval of external knowledge, progressively enriching the idea generation with broader and deeper insights. Validation through automated and human assessments indicates that our framework substantially elevates the quality of generated ideas, particularly in novelty and diversity. The number of unique novel ideas produced by our framework is 3.4 times higher than without it. Moreover, our method outperforms the current state-of-the-art, generating at least 2.5 times more top-rated ideas based on 170 seed papers in a Swiss Tournament evaluation.
摘要:科学创新对人类至关重要,利用大语言模型 (LLM) 生成研究思路有可能彻底改变发现过程。然而,现有的 LLM 由于获取外部知识以进行创新的能力有限,往往产生简单且重复的建议。为了解决这一问题,我们引入了一种增强的规划和搜索方法,旨在提升基于 LLM 系统的创造潜力。我们的方法涉及一个迭代过程,有目的地规划外部知识的检索,逐步丰富创意生成,提供更广泛和深入的见解。通过自动化和人工评估的验证表明,我们的框架显著提升了生成创意的质量,特别是在新颖性和多样性方面。与没有使用该框架相比,我们的框架产生的独特新颖创意数量是其 3.4 倍。此外,我们的方法在瑞士锦标赛评估中,基于 170 篇种子论文,生成的顶级创意数量至少是当前最先进方法的 2.5 倍。

[NLP-43] Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation

【速读】: 该论文试图解决大型语言模型(LLMs)在遵循人类指令方面的训练后调优问题。解决方案的关键在于利用多智能体模拟(MATRIX)自动生成多样化的基于文本的场景,捕捉广泛的真实世界人类需求,并通过场景驱动的指令生成器(MATRIX-Gen)合成可控且高度真实的数据。实验结果表明,该框架能有效生成通用和领域特定的数据,显著提升模型性能。

链接: https://arxiv.org/abs/2410.14251
作者: Shuo Tang,Xianghe Pang,Zexi Liu,Bohan Tang,Rui Ye,Xiaowen Dong,Yanfeng Wang,Siheng Chen
关键词-EN: enabling large language, Post-training is essential, large language models, essential for enabling, enabling large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training is essential for enabling large language models (LLMs) to follow human instructions. Inspired by the recent success of using LLMs to simulate human society, we leverage multi-agent simulation to automatically generate diverse text-based scenarios, capturing a wide range of real-world human needs. We propose MATRIX, a multi-agent simulator that creates realistic and scalable scenarios. Leveraging these outputs, we introduce a novel scenario-driven instruction generator MATRIX-Gen for controllable and highly realistic data synthesis. Extensive experiments demonstrate that our framework effectively generates both general and domain-specific data. Notably, on AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta’s Llama-3-8B-Instruct model, which was trained on over 10M pairs; see our project at this https URL.
摘要: 后训练对于使大语言模型 (LLM) 遵循人类指令至关重要。受到近期利用 LLM 模拟人类社会成功的启发,我们利用多智能体模拟来自动生成多样化的基于文本的场景,捕捉广泛的真实世界人类需求。我们提出了 MATRIX,一个创建真实且可扩展场景的多智能体模拟器。利用这些输出,我们引入了一种新颖的场景驱动指令生成器 MATRIX-Gen,用于可控且高度真实的数据合成。广泛的实验表明,我们的框架有效地生成了通用和领域特定的数据。值得注意的是,在 AlpacaEval 2 和 Arena-Hard 基准测试中,经过 MATRIX-Gen 合成的仅 20K 指令-响应对数据集后训练的 Llama-3-8B-Base 模型,表现优于 Meta 的 Llama-3-8B-Instruct 模型,后者训练数据超过 10M 对;详见我们的项目页面:[https URL]。

[NLP-44] Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models

【速读】: 该论文试图解决视频语言模型(VLM)在多选题问答(MCQA)评估中存在的选择偏差问题。解决方案的关键在于通过引入一种名为BOLD的后处理校准技术,来平衡模型在选择答案时对某些位置的偏好,从而减少选择偏差。该方法通过分解MCQA任务并应用公平性偏差指标,有效地抑制了模型对训练中观察到的任意模式或表面线索的依赖,提升了模型对视频内容和相关问题的真正理解能力,进而提高了模型的整体性能,包括准确率和F1均值分数。

链接: https://arxiv.org/abs/2410.14248
作者: Olga Loginova,Oleksandr Bezrukov,Alexey Kravets
关键词-EN: Evaluating Video Language, Video Language Models, Evaluating Video, Video Language, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating Video Language Models (VLMs) is a challenging task. Due to its transparency, Multiple-Choice Question Answering (MCQA) is widely used to measure the performance of these models through accuracy. However, existing MCQA benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias, when models disproportionately favor certain answer options based on positional patterns observed during training. In this work, we conduct a comprehensive empirical analysis of several VLM architectures across major datasets designed to assess complex video-focused reasoning. We identify where the bias is most pronounced and demonstrate to what extent model responses reflect genuine understanding of video content and related questions, as opposed to reliance on arbitrary patterns or superficial cues, such as answer position. By decomposing the MCQA task and adapting fairness bias metrics to VLMs, we introduce a post-processing calibration technique BOLD to balance this bias. Our results show that reducing selection bias improves not only debiasing metrics but also overall model performance, including Accuracy and F1 Mean score. Our method, by suppressing “blind guessing”, offers a more cost- and time-effective approach to mitigating selection bias compared to existing techniques. This study represents the first focused investigation of selection bias in video-to-text LLM-powered models.
摘要:评估视频语言模型 (VLM) 是一项具有挑战性的任务。由于其透明性,多选题问答 (MCQA) 广泛用于通过准确性来衡量这些模型的性能。然而,现有的 MCQA 基准未能全面捕捉 VLM 的推理能力,因为模型在训练过程中观察到的位置模式会导致对某些答案选项的偏好,从而产生选择偏差。在本研究中,我们对几种 VLM 架构进行了全面的实证分析,这些架构针对评估复杂视频推理的主要数据集进行了设计。我们识别了偏差最为明显的部分,并展示了模型响应在多大程度上反映了视频内容及相关问题的真实理解,而非依赖于任意模式或表面线索,如答案位置。通过分解 MCQA 任务并调整公平性偏差指标以适应 VLM,我们引入了一种后处理校准技术 BOLD 来平衡这种偏差。我们的结果表明,减少选择偏差不仅提高了去偏差指标,还提升了整体模型性能,包括准确率和 F1 均值分数。我们的方法通过抑制“盲目猜测”,相比现有技术提供了一种更具成本效益和时间效益的减少选择偏差的方法。本研究首次针对视频到文本大语言模型驱动的模型中的选择偏差进行了集中研究。

[NLP-45] A Novel Method to Metigate Demographic and Expert Bias in ICD Coding with Causal Inference

【速读】: 该论文试图解决ICD编码中的标签不平衡、人口统计学因素和专家偏见问题。解决方案的关键在于提出了一种基于因果推断(Causal Inference)的新方法DECI(Demographic and Expert biases in ICD coding through Causal Inference),通过因果关系解释模型预测的三种不同路径,并基于反事实推理来减轻人口统计学和专家偏见,从而提高ICD编码的准确性和无偏性。

链接: https://arxiv.org/abs/2410.14236
作者: Bin Zhang,Junli Wang
关键词-EN: involves assigning ICD, International Classification, coding involves assigning, ICD coding, assigning ICD codes
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:ICD(International Classification of Diseases) coding involves assigning ICD codes to patients visit based on their medical notes. Considering ICD coding as a multi-label text classification task, researchers have developed sophisticated methods. Despite progress, these models often suffer from label imbalance and may develop spurious correlations with demographic factors. Additionally, while human coders assign ICD codes, the inclusion of irrelevant information from unrelated experts introduces biases. To combat these issues, we propose a novel method to mitigate Demographic and Expert biases in ICD coding through Causal Inference (DECI). We provide a novel causality-based interpretation in ICD Coding that models make predictions by three distinct pathways. And based counterfactual reasoning, DECI mitigate demographic and expert biases. Experimental results show that DECI outperforms state-of-the-art models, offering a significant advancement in accurate and unbiased ICD coding.
摘要:ICD (International Classification of Diseases) 编码涉及根据患者的医疗记录为其就诊分配 ICD 代码。将 ICD 编码视为多标签文本分类任务,研究人员已开发出复杂的方法。尽管取得了进展,这些模型往往受到标签不平衡的影响,并可能与人口统计因素产生虚假关联。此外,虽然人类编码员分配 ICD 代码,但无关专家提供的不相关信息引入了偏见。为了应对这些问题,我们提出了一种通过因果推断 (Causal Inference) 来减轻 ICD 编码中人口统计和专家偏见的新方法 (DECI)。我们提供了一种基于因果关系的 ICD 编码解释,模型通过三条不同的路径进行预测。基于反事实推理,DECI 减轻了人口统计和专家偏见。实验结果表明,DECI 优于最先进的模型,在准确且无偏的 ICD 编码方面取得了显著进展。

[NLP-46] owards Robust Knowledge Representations in Multilingual LLMs for Equivalence and Inheritance based Consistent Reasoning

【速读】: 该论文试图解决当前大型语言模型(LLMs)在跨语言复杂推理任务中表现出的不一致性和违反继承关系的问题。解决方案的关键在于引入“组合表示”(Compositional Representations),通过将不同语言中的词汇表示为等价词汇的组合,从而减少跨语言推理时的冲突,提升模型在多语言环境下的推理一致性。实验结果表明,这种表示方法能够显著降低冲突率(最高达4.7%的减少),显示出共享LLM表示的潜在优势。

链接: https://arxiv.org/abs/2410.14235
作者: Gaurav Arora,Srujana Merugu,Shreya Jain,Vaibhav Saxena
关键词-EN: linguistic skills form, Large Language Models, human intelligence, facilitating problem-solving, problem-solving and decision-making
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning and linguistic skills form the cornerstone of human intelligence, facilitating problem-solving and decision-making. Recent advances in Large Language Models (LLMs) have led to impressive linguistic capabilities and emergent reasoning behaviors, fueling widespread adoption across application domains. However, LLMs still struggle with complex reasoning tasks, highlighting their systemic limitations. In this work, we focus on evaluating whether LLMs have the requisite representations to reason using two foundational relationships: “equivalence” and “inheritance”. We introduce novel tasks and benchmarks spanning six languages and observe that current SOTA LLMs often produce conflicting answers to the same questions across languages in 17.3-57.5% of cases and violate inheritance constraints in up to 37.2% cases. To enhance consistency across languages, we propose novel “Compositional Representations” where tokens are represented as composition of equivalent tokens across languages, with resulting conflict reduction (up to -4.7%) indicating benefits of shared LLM representations.
摘要:推理和语言技能构成了人类智能的基石,促进了问题解决和决策制定。大语言模型 (LLM) 的最新进展带来了令人印象深刻的语言能力和涌现的推理行为,推动了其在各个应用领域的广泛采用。然而,LLM 在复杂推理任务上仍显吃力,突显了其系统性局限。在本研究中,我们专注于评估 LLM 是否具备使用两种基础关系——“等价性”和“继承性”进行推理的必要表征。我们引入了涵盖六种语言的新任务和基准,并观察到当前最先进的 LLM 在 17.3-57.5% 的情况下对同一问题在不同语言中给出矛盾答案,并且在高达 37.2% 的情况下违反了继承约束。为了增强跨语言的一致性,我们提出了新的“组合表征”方法,其中 Token 被表示为跨语言等价 Token 的组合,结果冲突减少 (高达 -4.7%) 表明共享 LLM 表征的益处。

[NLP-47] Unveiling Large Language Models Generated Texts: A Multi-Level Fine-Grained Detection Framework

【速读】: 该论文试图解决现有方法在学术环境中难以有效识别由大型语言模型(LLMs)生成的文本的问题。解决方案的关键在于提出了一个多层次细粒度检测(MFD)框架,该框架通过整合低级结构特征、高级语义特征和深度语言特征,并进行句子级别的词汇、语法和句法评估,以实现全面的文本分析。此外,通过对比学习训练文本编码器,提取高级语义特征,增强对细微差异的检测能力,并利用先进的LLM分析整个文本以提取深度语言特征,从而提高模型对复杂模式和上下文信息的捕捉能力。实验结果表明,MFD模型在公共数据集上表现优于现有方法,为机构和出版商提供了一种有效的机制来检测LLM生成的文本,确保学术诚信。

链接: https://arxiv.org/abs/2410.14231
作者: Zhen Tao,Zhiyu Li,Runyu Chen,Dinghao Xi,Wei Xu
关键词-EN: Large language models, transformed human writing, Large language, content expansion, stylistic refinement
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed human writing by enhancing grammar correction, content expansion, and stylistic refinement. However, their widespread use raises concerns about authorship, originality, and ethics, even potentially threatening scholarly integrity. Existing detection methods, which mainly rely on single-feature analysis and binary classification, often fail to effectively identify LLM-generated text in academic contexts. To address these challenges, we propose a novel Multi-level Fine-grained Detection (MFD) framework that detects LLM-generated text by integrating low-level structural, high-level semantic, and deep-level linguistic features, while conducting sentence-level evaluations of lexicon, grammar, and syntax for comprehensive analysis. To improve detection of subtle differences in LLM-generated text and enhance robustness against paraphrasing, we apply two mainstream evasion techniques to rewrite the text. These variations, along with original texts, are used to train a text encoder via contrastive learning, extracting high-level semantic features of sentence to boost detection generalization. Furthermore, we leverage advanced LLM to analyze the entire text and extract deep-level linguistic features, enhancing the model’s ability to capture complex patterns and nuances while effectively incorporating contextual information. Extensive experiments on public datasets show that the MFD model outperforms existing methods, achieving an MAE of 0.1346 and an accuracy of 88.56%. Our research provides institutions and publishers with an effective mechanism to detect LLM-generated text, mitigating risks of compromised authorship. Educators and editors can use the model’s predictions to refine verification and plagiarism prevention protocols, ensuring adherence to standards.
摘要:大语言模型 (LLM) 通过增强语法校正、内容扩展和风格优化,彻底改变了人类的写作方式。然而,其广泛使用引发了关于作者身份、原创性和伦理问题的担忧,甚至可能威胁到学术诚信。现有的检测方法主要依赖于单一特征分析和二元分类,往往无法在学术环境中有效识别由 LLM 生成的文本。为应对这些挑战,我们提出了一种新颖的多层次细粒度检测 (Multi-level Fine-grained Detection, MFD) 框架,该框架通过整合低层次的结构特征、高层次的语义特征和深层次的语言特征来检测 LLM 生成的文本,同时对词汇、语法和句法进行句子级别的评估,以进行全面分析。为提高对 LLM 生成文本细微差异的检测能力并增强对改写攻击的鲁棒性,我们应用了两种主流的规避技术对文本进行重写。这些变体与原始文本一起,通过对比学习训练文本编码器,提取句子的高层次语义特征,以提升检测的泛化能力。此外,我们利用先进的大语言模型分析整个文本,提取深层次的语言特征,增强模型捕捉复杂模式和细微差别的能力,同时有效整合上下文信息。在公共数据集上的广泛实验表明,MFD 模型优于现有方法,实现了 0.1346 的均方误差 (MAE) 和 88.56% 的准确率。我们的研究为机构和出版商提供了一种有效的机制来检测 LLM 生成的文本,从而减轻作者身份被篡改的风险。教育工作者和编辑可以使用该模型的预测结果来优化验证和防抄袭协议,确保符合标准。

[NLP-48] Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model ACM-MM2024

【速读】: 该论文试图解决联合多模态实体关系抽取(JMERE)任务中标注数据不足的问题。解决方案的关键在于提出了知识增强的跨模态提示模型(KECPM),通过引导大型语言模型生成补充背景知识,从而在少样本学习设置下有效弥补信息不足。该方法包括两个阶段:首先通过语义相似性动态构建提示,引导ChatGPT生成相关知识并进行自我反思以精炼知识;然后将辅助知识与原始输入融合,利用基于Transformer的模型进行对齐,以满足JMERE任务的输出格式要求。

链接: https://arxiv.org/abs/2410.14225
作者: Li Yuan,Yi Cai,Junsheng Huang
关键词-EN: Multimodal Entity-Relation Extraction, Joint Multimodal Entity-Relation, social media posts, Entity-Relation Extraction, Joint Multimodal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted by ACM MM 2024

点击查看摘要

Abstract:Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from text-image pairs in social media posts. Existing methods for JMERE require large amounts of labeled data. However, gathering and annotating fine-grained multimodal data for JMERE poses significant challenges. Initially, we construct diverse and comprehensive multimodal few-shot datasets fitted to the original data distribution. To address the insufficient information in the few-shot setting, we introduce the \textbfKnowledge-\textbfEnhanced \textbfCross-modal \textbfPrompt \textbfModel (KECPM) for JMERE. This method can effectively address the problem of insufficient information in the few-shot setting by guiding a large language model to generate supplementary background knowledge. Our proposed method comprises two stages: (1) a knowledge ingestion stage that dynamically formulates prompts based on semantic similarity guide ChatGPT generating relevant knowledge and employs self-reflection to refine the knowledge; (2) a knowledge-enhanced language model stage that merges the auxiliary knowledge with the original input and utilizes a transformer-based model to align with JMERE’s required output format. We extensively evaluate our approach on a few-shot dataset derived from the JMERE dataset, demonstrating its superiority over strong baselines in terms of both micro and macro F _1 scores. Additionally, we present qualitative analyses and case studies to elucidate the effectiveness of our model.
摘要:联合多模态实体关系抽取 (Joint Multimodal Entity-Relation Extraction, JMERE) 是一项具有挑战性的任务,旨在从社交媒体帖子中的文本-图像对中提取实体及其关系。现有的 JMERE 方法需要大量标注数据。然而,为 JMERE 收集和标注细粒度的多模态数据存在显著挑战。首先,我们构建了多样且全面的多模态少样本数据集,以适应原始数据分布。为了解决少样本设置中的信息不足问题,我们引入了知识增强型跨模态提示模型 (Knowledge-Enhanced Cross-modal Prompt Model, KECPM) 用于 JMERE。该方法通过引导大语言模型生成补充背景知识,有效解决了少样本设置中的信息不足问题。我们提出的方法包括两个阶段:(1) 知识摄取阶段,基于语义相似性动态构建提示,引导 ChatGPT 生成相关知识,并利用自我反思来精炼知识;(2) 知识增强型语言模型阶段,将辅助知识与原始输入融合,并利用基于 Transformer 的模型与 JMERE 所需的输出格式对齐。我们在从 JMERE 数据集中派生的少样本数据集上广泛评估了我们的方法,结果显示其在微观和宏观 F_1 分数方面均优于强基线。此外,我们通过定性分析和案例研究阐明了我们模型的有效性。

[NLP-49] Paths-over-Graph: Knowledge Graph Enpowered Large Language Model Reasoning

【速读】: 该论文试图解决大语言模型(LLMs)在处理复杂推理和知识密集型任务时出现的幻觉问题和知识缺乏问题。解决方案的关键在于提出了一种名为Paths-over-Graph (PoG)的新方法,通过整合知识图谱(KGs)中的知识推理路径来增强LLM的推理能力,从而提高输出结果的可解释性和忠实度。PoG通过三阶段的动态多跳路径探索,结合LLM的固有知识和KGs的事实知识,有效处理多跳和多实体问题。此外,PoG引入了高效的三步剪枝技术,利用图结构、LLM提示和预训练语言模型(如SBERT)来精简探索路径,确保所有推理路径包含高度相关的信息,从而使推理过程更加忠实和可解释。

链接: https://arxiv.org/abs/2410.14211
作者: Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Wenjie Zhang
关键词-EN: Large Language Models, achieved impressive results, Large Language, LLM reasoning, reasoning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive results in various tasks but struggle with hallucination problems and lack of relevant knowledge, especially in deep complex reasoning and knowledge-intensive tasks. Knowledge Graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. However, existing KG-based LLM reasoning methods face challenges like handling multi-hop reasoning, multi-entity questions, and effectively utilizing graph structures. To address these issues, we propose Paths-over-Graph (PoG), a novel method that enhances LLM reasoning by integrating knowledge reasoning paths from KGs, improving the interpretability and faithfulness of LLM outputs. PoG tackles multi-hop and multi-entity questions through a three-phase dynamic multi-hop path exploration, which combines the inherent knowledge of LLMs with factual knowledge from KGs. In order to improve the efficiency, PoG prunes irrelevant information from the graph exploration first and introduces efficient three-step pruning techniques that incorporate graph structures, LLM prompting, and a pre-trained language model (e.g., SBERT) to effectively narrow down the explored candidate paths. This ensures all reasoning paths contain highly relevant information captured from KGs, making the reasoning faithful and interpretable in problem-solving. PoG innovatively utilizes graph structure to prune the irrelevant noise and represents the first method to implement multi-entity deep path detection on KGs for LLM reasoning tasks. Comprehensive experiments on five benchmark KGQA datasets demonstrate PoG outperforms the state-of-the-art method ToG across GPT-3.5-Turbo and GPT-4, achieving an average accuracy improvement of 18.9%. Notably, PoG with GPT-3.5-Turbo surpasses ToG with GPT-4 by up to 23.9%.
摘要:大语言模型 (LLMs) 在各种任务中取得了显著成果,但在幻觉问题和缺乏相关知识方面仍存在挑战,尤其是在深度复杂推理和知识密集型任务中。知识图谱 (KGs) 以结构化格式捕捉大量事实,为推理提供了可靠的知识来源。然而,现有的基于 KG 的 LLM 推理方法在处理多跳推理、多实体问题以及有效利用图结构方面面临挑战。为解决这些问题,我们提出了 Paths-over-Graph (PoG),这是一种通过整合来自 KGs 的知识推理路径来增强 LLM 推理的新方法,从而提高了 LLM 输出的可解释性和忠实度。PoG 通过三阶段动态多跳路径探索来解决多跳和多实体问题,结合了 LLMs 的固有知识和 KGs 的事实知识。为了提高效率,PoG 首先从图探索中剪枝无关信息,并引入了包含图结构、LLM 提示和预训练语言模型 (例如 SBERT) 的高效三步剪枝技术,以有效缩小探索的候选路径。这确保了所有推理路径都包含从 KGs 捕捉到的高度相关信息,使得推理在问题解决中既忠实又可解释。PoG 创新性地利用图结构来剪枝无关噪声,并首次实现了在 KG 上进行多实体深度路径检测,用于 LLM 推理任务。在五个基准 KGQA 数据集上的综合实验表明,PoG 在 GPT-3.5-Turbo 和 GPT-4 上均优于最先进的方法 ToG,平均准确率提高了 18.9%。值得注意的是,PoG 结合 GPT-3.5-Turbo 的表现甚至超过了 ToG 结合 GPT-4 的表现,最高可达 23.9%。

[NLP-50] Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

【速读】: 该论文试图解决合成数据在训练大型语言模型时引入的噪声、非信息性和误导性信号的问题。解决方案的关键在于提出了一种名为Montessori-Instruct的新型数据合成框架,该框架通过利用合成训练数据对学生模型的局部数据影响来表征学生的学习偏好,并使用直接偏好优化(DPO)训练教师模型,以生成符合学生学习偏好的合成数据。实验结果表明,该方法在Alpaca Eval和MT-Bench上显著优于标准合成方法,分别提升了18.35%和46.24%,并且优于由更强教师模型GPT-4生成的数据。

链接: https://arxiv.org/abs/2410.14208
作者: Xiaochuan Li,Zichun Yu,Chenyan Xiong
关键词-EN: inevitably introduces noisy, generative nature inevitably, nature inevitably introduces, misleading learning signals, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Codes and data are open-sourced at this https URL

点击查看摘要

Abstract:Synthetic data has been widely used to train large language models, but their generative nature inevitably introduces noisy, non-informative, and misleading learning signals. In this paper, we propose Montessori-Instruct, a novel data synthesis framework that tailors the data synthesis ability of the teacher language model toward the student language model’s learning process. Specifically, we utilize local data influence of synthetic training data points on students to characterize students’ learning preferences. Then, we train the teacher model with Direct Preference Optimization (DPO) to generate synthetic data tailored toward student learning preferences. Experiments with Llama3-8B-Instruct (teacher) and Llama3-8B (student) on Alpaca Eval and MT-Bench demonstrate that Montessori-Instruct significantly outperforms standard synthesis methods by 18.35% and 46.24% relatively. Our method also beats data synthesized by a stronger teacher model, GPT-4o. Further analysis confirms the benefits of teacher’s learning to generate more influential training data in the student’s improved learning, the advantages of local data influence in accurately measuring student preferences, and the robustness of Montessori-Instruct across different student models. Our code and data are open-sourced at this https URL.
摘要:合成数据已被广泛用于训练大语言模型,但其生成性质不可避免地引入了噪声、非信息性和误导性的学习信号。本文提出了一种名为 Montessori-Instruct 的新型数据合成框架,该框架将教师语言模型的数据合成能力定制化,以适应学生语言模型的学习过程。具体而言,我们利用合成训练数据点对学生的局部数据影响来表征学生的学习偏好。随后,我们通过直接偏好优化 (DPO) 训练教师模型,以生成符合学生学习偏好的合成数据。在 Alpaca Eval 和 MT-Bench 上,使用 Llama3-8B-Instruct (教师) 和 Llama3-8B (学生) 进行的实验表明,Montessori-Instruct 相对于标准合成方法分别显著提升了 18.35% 和 46.24%。我们的方法还优于由更强大的教师模型 GPT-4o 合成的数据。进一步的分析证实了教师学习生成更具影响力的训练数据对学生学习改进的益处,局部数据影响在准确衡量学生偏好方面的优势,以及 Montessori-Instruct 在不同学生模型中的鲁棒性。我们的代码和数据已在 https URL 上开源。

[NLP-51] MediTOD: An English Dialogue Dataset for Medical History Taking with Comprehensive Annotations EMNLP2024

【速读】: 该论文试图解决医疗任务导向对话系统中数据集不足、缺乏全面标注以及多语言限制的问题。解决方案的关键在于引入MediTOD,这是一个新的英语医生-患者对话数据集,专门用于医疗历史采集任务。通过与医生合作,设计基于问卷的标注方案,并由医疗专业人员进行高质量的全面标注,捕捉医疗槽及其属性,从而为自然语言理解、策略学习和自然语言生成等子任务提供丰富的数据支持。

链接: https://arxiv.org/abs/2410.14204
作者: Vishal Vivek Saley,Goonjan Saha,Rocktim Jyoti Das,Dinesh Raghu,Mausam
关键词-EN: guiding treatment selection, task-oriented dialogue systems, patient medical history, collecting patient medical, reducing doctor burnout
类目: Computation and Language (cs.CL)
备注: EMNLP2024 Camera Ready Version

点击查看摘要

Abstract:Medical task-oriented dialogue systems can assist doctors by collecting patient medical history, aiding in diagnosis, or guiding treatment selection, thereby reducing doctor burnout and expanding access to medical services. However, doctor-patient dialogue datasets are not readily available, primarily due to privacy regulations. Moreover, existing datasets lack comprehensive annotations involving medical slots and their different attributes, such as symptoms and their onset, progression, and severity. These comprehensive annotations are crucial for accurate diagnosis. Finally, most existing datasets are non-English, limiting their utility for the larger research community. In response, we introduce MediTOD, a new dataset of doctor-patient dialogues in English for the medical history-taking task. Collaborating with doctors, we devise a questionnaire-based labeling scheme tailored to the medical domain. Then, medical professionals create the dataset with high-quality comprehensive annotations, capturing medical slots and their attributes. We establish benchmarks in supervised and few-shot settings on MediTOD for natural language understanding, policy learning, and natural language generation subtasks, evaluating models from both TOD and biomedical domains. We make MediTOD publicly available for future research. Comments: EMNLP2024 Camera Ready Version Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.14204 [cs.CL] (or arXiv:2410.14204v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.14204 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:面向医疗任务的对话系统可以通过收集患者病史、辅助诊断或指导治疗选择,从而减轻医生的工作压力并扩大医疗服务的覆盖范围。然而,由于隐私法规的限制,医生与患者之间的对话数据集并不容易获取。此外,现有数据集缺乏涉及医疗槽位及其不同属性的全面标注,例如症状及其发作、进展和严重程度。这些全面的标注对于准确诊断至关重要。最后,大多数现有数据集为非英语,限制了其对更广泛研究社区的实用性。为此,我们引入了 MediTOD,这是一个针对病史采集任务的英语医生-患者对话新数据集。我们与医生合作,设计了一种基于问卷的标注方案,专门针对医疗领域。随后,医疗专业人员创建了具有高质量全面标注的数据集,捕捉医疗槽位及其属性。我们在 MediTOD 上建立了监督学习和少样本学习设置下的基准测试,用于自然语言理解、策略学习和自然语言生成子任务,评估了来自 TOD 和生物医学领域的模型。我们将 MediTOD 公开发布,供未来研究使用。

评论:EMNLP2024 最终版 主题:计算与语言 (cs.CL) 引用为:arXiv:2410.14204 [cs.CL] (或 arXiv:2410.14204v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2410.14204 了解更多信息 通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-52] Rationale Behind Essay Scores: Enhancing S-LLMs Multi-Trait Essay Scoring with Rationale Generated by LLMs

【速读】: 该论文试图解决现有自动作文评分(AES)系统仅依赖作文文本而忽略评分细则具体评估方面的不足,提出了一种基于理由的多特质评分(RMTS)方法。解决方案的关键在于结合提示工程的大型语言模型(LLM)与基于微调的小型大型语言模型(S-LLM),通过LLM生成的特质特定理由来指导评分模型,从而实现对作文多特质评分的精确预测。实验结果表明,RMTS在特质特定评分上显著优于现有最先进模型和传统S-LLM,通过引入细粒度的定性理由,提升了评分的特质可靠性。

链接: https://arxiv.org/abs/2410.14202
作者: SeongYeub Chu,JongWoo Kim,Bryan Wong,MunYong Yi
关键词-EN: Existing automated essay, specific aspects evaluated, Rationale-based Multiple Trait, Existing automated, Multiple Trait Scoring
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing automated essay scoring (AES) has solely relied on essay text without using explanatory rationales for the scores, thereby forgoing an opportunity to capture the specific aspects evaluated by rubric indicators in a fine-grained manner. This paper introduces Rationale-based Multiple Trait Scoring (RMTS), a novel approach for multi-trait essay scoring that integrates prompt-engineering-based large language models (LLMs) with a fine-tuning-based essay scoring model using a smaller large language model (S-LLM). RMTS uses an LLM-based trait-wise rationale generation system where a separate LLM agent generates trait-specific rationales based on rubric guidelines, which the scoring model uses to accurately predict multi-trait scores. Extensive experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize, show that RMTS significantly outperforms state-of-the-art models and vanilla S-LLMs in trait-specific scoring. By assisting quantitative assessment with fine-grained qualitative rationales, RMTS enhances the trait-wise reliability, providing partial explanations about essays.
**摘要:**现有的自动作文评分 (AES) 仅依赖作文文本,而未使用评分理由来解释评分,从而错失了以细粒度方式捕捉评分标准指标所评估的具体方面的机会。本文介绍了基于理由的多特质评分 (Rationale-based Multiple Trait Scoring, RMTS),这是一种新颖的多特质作文评分方法,它将基于提示工程的大语言模型 (LLM) 与使用较小大语言模型 (S-LLM) 的微调作文评分模型相结合。RMTS 使用基于 LLM 的特质理由生成系统,其中独立的 LLM 智能体根据评分标准生成特质特定的理由,评分模型利用这些理由来准确预测多特质评分。在包括 ASAP、ASAP++ 和 Feedback Prize 在内的基准数据集上的广泛实验表明,RMTS 在特质特定评分方面显著优于最先进的模型和普通的 S-LLM。通过借助细粒度的定性理由辅助定量评估,RMTS 提高了特质评分的可靠性,为作文提供了部分解释。

[NLP-53] Supervised Chain of Thought

【速读】: 该论文试图解决大型语言模型(LLMs)在处理需要深度计算的复杂推理任务时面临的计算深度限制问题。解决方案的关键在于引入任务特定的监督,以优化模型在提示空间中的导航能力。论文通过将解决方案搜索空间划分为提示空间和答案空间,并强调任务特定监督在准确导航提示空间中的重要性,从而提升模型在复杂推理任务中的表现。实验结果表明,任务特定监督的应用显著缩小了无监督与有监督情况下的推理性能差距。

链接: https://arxiv.org/abs/2410.14198
作者: Xiang Zhang,Dujian Ding
关键词-EN: advancing Artificial Intelligence, Large Language Models, revolutionized natural language, natural language processing, Artificial Intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing and hold immense potential for advancing Artificial Intelligence. However, the core architecture of most mainstream LLMs – the Transformer – has inherent limitations in computational depth, rendering them theoretically incapable of solving many reasoning tasks that demand increasingly deep computations. Chain of Thought (CoT) prompting has emerged as a technique to address these architectural limitations, as evidenced by several theoretical studies. It offers a promising approach to solving complex reasoning tasks that were previously beyond the capabilities of these models. Despite its successes, CoT and its variants (such as Tree of Thought, Graph of Thought, etc.) rely on a “one-prompt-for-all” approach, using a single prompt structure (e.g., “think step by step”) for a wide range of tasks – from counting and sorting to solving mathematical and algorithmic problems. This approach poses significant challenges for models to generate the correct reasoning steps, as the model must navigate through a vast prompt template space to find the appropriate template for each task. In this work, we build upon previous theoretical analyses of CoT to demonstrate how the one-prompt-for-all approach can negatively affect the computability of LLMs. We partition the solution search space into two: the prompt space and the answer space. Our findings show that task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance. Through experiments with state-of-the-art LLMs, we reveal a gap in reasoning performance when supervision is applied versus when it is not.
摘要:大语言模型 (LLMs) 已经彻底改变了自然语言处理领域,并为人工智能的发展带来了巨大的潜力。然而,大多数主流 LLMs 的核心架构——Transformer——在计算深度上存在固有的局限性,使其在理论上无法解决许多需要越来越深计算的推理任务。思维链 (Chain of Thought, CoT) 提示作为一种技术应运而生,以应对这些架构上的局限性,这一点在多项理论研究中得到了证实。它为解决之前这些模型无法处理的复杂推理任务提供了一种有前景的方法。尽管取得了成功,CoT 及其变体(如思维树、思维图等)依赖于一种“一提示多用”的方法,即使用单一的提示结构(例如“逐步思考”)来应对从计数和排序到解决数学和算法问题的广泛任务。这种方法对模型生成正确推理步骤提出了重大挑战,因为模型必须在庞大的提示模板空间中导航,以找到适合每个任务的模板。在本研究中,我们基于对 CoT 的先前理论分析,展示了“一提示多用”方法如何负面影响 LLMs 的计算能力。我们将解决方案搜索空间划分为两个部分:提示空间和答案空间。我们的研究结果表明,任务特定的监督对于准确导航提示空间并实现最佳性能至关重要。通过与最先进的 LLMs 进行实验,我们揭示了在应用监督与不应用监督时推理性能之间的差距。

[NLP-54] Speciesism in Natural Language Processing Research

【速读】: 该论文试图解决NLP研究中是否存在对非人类动物的歧视(即物种歧视)的问题。解决方案的关键在于揭示和量化研究人员、数据集和模型中存在的物种歧视,并通过讨论如何减少这种歧视来推动NLP研究的伦理进步。具体来说,研究通过调查和实验发现,NLP研究人员普遍未意识到物种歧视问题,数据集中存在固有的物种歧视偏见,而最新的NLP模型如OpenAI GPTs也默认表现出物种歧视倾向。因此,减少物种歧视的关键在于提高研究人员的意识,改进数据集的标注,以及在模型训练中引入伦理考量。

链接: https://arxiv.org/abs/2410.14194
作者: Masashi Takeshita,Rafal Rzepka
关键词-EN: Natural Language Processing, Natural Language, Language Processing, NLP, NLP research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This article is a preprint and has not been peer-reviewed. The postprint has been accepted for publication in AI and Ethics. Please cite the final version of the article once it is published

点击查看摘要

Abstract:Natural Language Processing (NLP) research on AI Safety and social bias in AI has focused on safety for humans and social bias against human minorities. However, some AI ethicists have argued that the moral significance of nonhuman animals has been ignored in AI research. Therefore, the purpose of this study is to investigate whether there is speciesism, i.e., discrimination against nonhuman animals, in NLP research. First, we explain why nonhuman animals are relevant in NLP research. Next, we survey the findings of existing research on speciesism in NLP researchers, data, and models and further investigate this problem in this study. The findings of this study suggest that speciesism exists within researchers, data, and models, respectively. Specifically, our survey and experiments show that (a) among NLP researchers, even those who study social bias in AI, do not recognize speciesism or speciesist bias; (b) among NLP data, speciesist bias is inherent in the data annotated in the datasets used to evaluate NLP models; © OpenAI GPTs, recent NLP models, exhibit speciesist bias by default. Finally, we discuss how we can reduce speciesism in NLP research.
摘要:自然语言处理 (NLP) 在 AI 安全和社会偏见方面的研究主要集中在人类的安全和社会对少数族群的偏见上。然而,一些 AI 伦理学家认为,AI 研究中忽视了非人类动物的道德意义。因此,本研究的目的在于探讨 NLP 研究中是否存在物种主义,即对非人类动物的歧视。首先,我们解释了为什么非人类动物在 NLP 研究中具有相关性。接着,我们调查了现有关于 NLP 研究者、数据和模型中物种主义的研究成果,并在本研究中进一步探讨了这一问题。研究结果表明,物种主义分别存在于研究者、数据和模型中。具体而言,我们的调查和实验显示:(a) 在 NLP 研究者中,即使是那些研究 AI 社会偏见的研究者,也不承认物种主义或物种主义偏见;(b) 在 NLP 数据中,物种主义偏见固化在用于评估 NLP 模型的数据集中;© OpenAI 的 GPT 系列,即最新的 NLP 模型,默认表现出物种主义偏见。最后,我们讨论了如何在 NLP 研究中减少物种主义。

[NLP-55] MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

【速读】: 该论文试图解决现有大型语言模型(LLMs)在实际应用中难以动态适应多样化人类偏好问题。解决方案的关键在于提出了一种名为MetaAlign的新方法,该方法能够在推理阶段动态地使LLMs与各种显式或隐式的偏好对齐,而不依赖于预定义的静态偏好嵌入。通过在精心构建的MetaAlign数据集上进行优化,实验结果验证了该方法的可行性,从而为语言模型的动态对齐提供了新的思路。

链接: https://arxiv.org/abs/2410.14184
作者: Mozhi Zhang,Pengyu Wang,Chenkun Tan,Mianqiu Huang,Dong Zhang,Yaqian Zhou,Xipeng Qiu
关键词-EN: acquire extensive knowledge, extensive text corpora, Large Language Models, Large Language, acquire extensive
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora, making them powerful tools for various applications. To make LLMs more usable, aligning them with human preferences is essential. Existing alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), typically embed predefined preferences directly within the model’s parameters. These methods, however, often result in a static alignment that can not account for the diversity of human preferences in practical applications. In response to this challenge, we propose an effective method, \textbfMetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time. Experimental results show that LLMs optimized on our meticulously constructed MetaAlign Dataset can effectively align with any preferences specified at the inference stage, validating the feasibility of MetaAlign. We hope that our work can provide some insights into the alignment of language models.
摘要:大语言模型 (LLMs) 从广泛的文本语料库中获取了丰富的知识和卓越的能力,使其成为各种应用的强大工具。为了使 LLMs 更加实用,将其与人类偏好对齐是至关重要的。现有的对齐技术,如基于人类反馈的强化学习 (RLHF) 和直接偏好优化 (DPO),通常将预定义的偏好直接嵌入模型的参数中。然而,这些方法往往导致静态对齐,无法在实际应用中适应人类偏好的多样性。针对这一挑战,我们提出了一种有效的方法,MetaAlign,旨在帮助 LLMs 在推理时动态对齐各种显式或隐式的偏好。实验结果表明,在我们精心构建的 MetaAlign 数据集上优化的 LLMs 能够有效地对齐推理阶段指定的任何偏好,验证了 MetaAlign 的可行性。我们希望我们的工作能为语言模型的对齐提供一些启示。

[NLP-56] LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

【速读】: 该论文试图解决大型语言模型(LLMs)在实验室安全决策中的可靠性问题。解决方案的关键在于提出了实验室安全基准(LabSafety Bench),这是一个基于职业安全与健康管理局(OSHA)协议的全面评估框架,包含765道经专家验证的多选题,用于评估LLMs和视觉语言模型(VLMs)在实验室安全情境下的表现。通过这一基准,论文揭示了即使是最先进的模型如GPT-4o,在实验室安全相关任务中仍存在关键错误,强调了在安全关键环境中依赖LLMs的风险,并强调了开发专门基准以准确评估LLMs在实际安全应用中可信度的重要性。

链接: https://arxiv.org/abs/2410.14182
作者: Yujun Zhou,Jingdong Yang,Kehan Guo,Pin-Yu Chen,Tian Gao,Werner Geyer,Nuno Moniz,Nitesh V Chawla,Xiangliang Zhang
关键词-EN: accidents pose significant, Laboratory accidents pose, pose significant risks, life and property, underscoring the importance
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 50 pages, 19 figures

点击查看摘要

Abstract:Laboratory accidents pose significant risks to human life and property, underscoring the importance of robust safety protocols. Despite advancements in safety training, laboratory personnel may still unknowingly engage in unsafe practices. With the increasing reliance on large language models (LLMs) for guidance in various fields, including laboratory settings, there is a growing concern about their reliability in critical safety-related decision-making. Unlike trained human researchers, LLMs lack formal lab safety education, raising questions about their ability to provide safe and accurate guidance. Existing research on LLM trustworthiness primarily focuses on issues such as ethical compliance, truthfulness, and fairness but fails to fully cover safety-critical real-world applications, like lab safety. To address this gap, we propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive evaluation framework based on a new taxonomy aligned with Occupational Safety and Health Administration (OSHA) protocols. This benchmark includes 765 multiple-choice questions verified by human experts, assessing LLMs and vision language models (VLMs) performance in lab safety contexts. Our evaluations demonstrate that while GPT-4o outperforms human participants, it is still prone to critical errors, highlighting the risks of relying on LLMs in safety-critical environments. Our findings emphasize the need for specialized benchmarks to accurately assess the trustworthiness of LLMs in real-world safety applications.
摘要:实验室事故对人类生命和财产构成重大风险,凸显了健全安全协议的重要性。尽管安全培训有所进步,实验室人员仍可能无意中采取不安全的行为。随着大语言模型 (LLM) 在包括实验室环境在内的多个领域中指导作用的增加,其在关键安全决策中的可靠性引起了广泛关注。与受过训练的人类研究人员不同,LLM 缺乏正式的实验室安全教育,这引发了对其提供安全准确指导能力的质疑。现有关于 LLM 可信度的研究主要集中在伦理合规、真实性和公平性等问题上,但未能全面涵盖实验室安全等安全关键的现实应用。为填补这一空白,我们提出了实验室安全基准 (LabSafety Bench),这是一个基于与职业安全与健康管理局 (OSHA) 协议相一致的新分类法的综合评估框架。该基准包括由人类专家验证的 765 道多选题,用于评估 LLM 和视觉语言模型 (VLM) 在实验室安全环境中的表现。我们的评估结果表明,尽管 GPT-4o 的表现优于人类参与者,但它仍容易出现重大错误,突显了在安全关键环境中依赖 LLM 的风险。我们的研究结果强调了需要专门的基准来准确评估 LLM 在现实世界安全应用中的可信度。

[NLP-57] XForecast: Evaluating Natural Language Explanations for Time Series Forecasting

【速读】: 该论文试图解决时间序列预测模型的可解释性问题,特别是如何生成易于理解的解释(Natural Language Explanations, NLEs)并评估其有效性。解决方案的关键在于引入基于模拟性的两个新性能指标,用于评估人类代理根据解释预测模型输出的能力,从而区分好的解释与差的解释,并发现解释质量主要受数值推理能力而非模型大小的影响。

链接: https://arxiv.org/abs/2410.14180
作者: Taha Aksu,Chenghao Liu,Amrita Saha,Sarah Tan,Caiming Xiong,Doyen Sahoo
关键词-EN: forecasting aids decision-making, ensure informed decisions, series forecasting aids, aids decision-making, accurate predictions
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Time series forecasting aids decision-making, especially for stakeholders who rely on accurate predictions, making it very important to understand and explain these models to ensure informed decisions. Traditional explainable AI (XAI) methods, which underline feature or temporal importance, often require expert knowledge. In contrast, natural language explanations (NLEs) are more accessible to laypeople. However, evaluating forecast NLEs is difficult due to the complex causal relationships in time series data. To address this, we introduce two new performance metrics based on simulatability, assessing how well a human surrogate can predict model forecasts using the explanations. Experiments show these metrics differentiate good from poor explanations and align with human judgments. Utilizing these metrics, we further evaluate the ability of state-of-the-art large language models (LLMs) to generate explanations for time series data, finding that numerical reasoning, rather than model size, is the main factor influencing explanation quality.
摘要:时间序列预测辅助决策,特别是对于依赖准确预测的利益相关者,理解并解释这些模型以确保明智决策至关重要。传统的可解释 AI (Explainable AI, XAI) 方法,强调特征或时间的重要性,通常需要专家知识。相比之下,自然语言解释 (Natural Language Explanations, NLEs) 对非专业人士更为友好。然而,由于时间序列数据中复杂的因果关系,评估预测 NLEs 具有挑战性。为解决这一问题,我们引入了两种基于可模拟性的新性能指标,评估人类代理使用解释预测模型预测的准确性。实验表明,这些指标能够区分好与差的解释,并与人类判断相符。利用这些指标,我们进一步评估了最先进的大语言模型 (Large Language Models, LLMs) 生成时间序列数据解释的能力,发现数值推理而非模型规模是影响解释质量的主要因素。

[NLP-58] MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

【速读】: 该论文试图解决现有图表相关任务基准在捕捉真实世界多图表场景复杂性方面的不足,特别是缺乏对多跳推理能力的评估。解决方案的关键在于引入MultiChartQA基准,该基准通过四个关键领域的评估(直接问答、并行问答、比较推理和顺序推理),全面考察多模态大语言模型在处理多图表任务时的能力,从而填补了现有基准的空白,并揭示了模型在多图表理解方面的显著性能差距。

链接: https://arxiv.org/abs/2410.14179
作者: Zifeng Zhu,Mengzhao Jia,Zhihan Zhang,Lang Li,Meng Jiang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios. Current benchmarks primarily focus on single-chart tasks, neglecting the multi-hop reasoning required to extract and integrate information from multiple charts, which is essential in practical applications. To fill this gap, we introduce MultiChartQA, a benchmark that evaluates MLLMs’ capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning. Our evaluation of a wide range of MLLMs reveals significant performance gaps compared to humans. These results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field. Our code and data are available at this https URL
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在包括视觉问答和图表理解在内的多种任务中展示了令人印象深刻的能力,然而现有针对图表相关任务的基准测试在捕捉真实世界多图表场景的复杂性方面存在不足。当前的基准测试主要集中在单图表任务上,忽略了从多个图表中提取和整合信息所需的多跳推理,这在实际应用中至关重要。为了填补这一空白,我们引入了 MultiChartQA,这是一个评估 MLLMs 在四个关键领域能力的基准测试:直接问答、并行问答、比较推理和顺序推理。我们对多种 MLLMs 的评估结果显示,其性能与人类相比存在显著差距。这些结果突显了多图表理解中的挑战,并展示了 MultiChartQA 推动该领域进步的潜力。我们的代码和数据可在以下链接获取:[https URL]

[NLP-59] LLM The Genius Paradox: A Linguistic and Math Experts Struggle with Simple Word-based Counting Problems

【速读】: 该论文试图解决大语言模型(LLMs)在简单字符计数任务上的表现不足问题,并探讨其背后的原因。论文通过设计多种评估设置,验证了关于LLMs在简单计数任务上表现不佳的常见猜想(如分词、架构和训练数据)的有效性。研究发现,这些猜想并不完全准确,且LLMs在高级数学和编码推理能力向简单计数任务的迁移上也存在挑战。论文的关键解决方案在于强调“先推理后响应”的策略,认为这是帮助LLMs更准确感知任务并提高响应准确性的最稳健和高效的方法。此外,论文呼吁更多关注模型能力获取和评估,并强调在模型预训练过程中培养“先推理后响应”的意识的重要性。

链接: https://arxiv.org/abs/2410.14166
作者: Nan Xu,Xuezhe Ma
关键词-EN: humans find trivial, trivial to handle, number of character, LLMs, Interestingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interestingly, LLMs yet struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r’s in the word “strawberry”. There are several popular conjectures (e.g., tokenization, architecture and training data) regarding the reason for deficiency of LLMs in simple word-based counting problems, sharing the similar belief that such failure stems from model pretraining hence probably inevitable during deployment. In this paper, we carefully design multiple evaluation settings to investigate validity of prevalent conjectures. Meanwhile, we measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks. Although specialized LLMs suffer from counting problems as well, we find conjectures about inherent deficiency of LLMs invalid and further seek opportunities to elicit knowledge and capabilities from LLMs that are beneficial to counting tasks. Compared with strategies such as finetuning and in-context learning that are commonly adopted to enhance performance on new or challenging tasks, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks with more accurate responses. We hope our conjecture validation design could provide insights into the study of future critical failure modes of LLMs. Based on challenges in transferring advanced capabilities to much simpler tasks, we call for more attention to model capability acquisition and evaluation. We also highlight the importance of cultivating consciousness of “reasoning before responding” during model pretraining. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.14166 [cs.CL] (or arXiv:2410.14166v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.14166 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:有趣的是,大语言模型 (LLM) 在处理一些人类认为微不足道的基本任务时仍然存在困难,例如计算单词 “strawberry” 中字母 r 的数量。关于大语言模型在简单基于单词的计数问题上表现不足的原因,存在几种流行的猜想 (例如,Tokenization、架构和训练数据),这些猜想都持有类似的观点,即这种失败源于模型的预训练,因此在部署过程中可能是不可避免的。在本文中,我们精心设计了多种评估设置,以调查这些流行猜想的有效性。同时,我们测量了从专门的大语言模型到简单计数任务的高级数学和编码推理能力的可转移性。尽管专门的大语言模型在计数问题上同样存在困难,但我们发现关于大语言模型固有缺陷的猜想是无效的,并进一步寻求从大语言模型中提取对计数任务有益的知识和能力的机会。与通常采用的微调 (finetuning) 和上下文学习 (in-context learning) 等策略相比,我们展示了通过推理参与是帮助大语言模型更好地感知任务并以更准确的方式响应的最稳健和高效的方法。我们希望我们的猜想验证设计能为未来大语言模型关键失败模式的研究提供见解。基于将高级能力转移到更简单任务的挑战,我们呼吁更多关注模型能力获取和评估。我们还强调了在模型预训练过程中培养 “先推理后响应” 意识的重要性。

主题:计算与语言 (cs.CL); 人工智能 (cs.AI)
引用为:arXiv:2410.14166 [cs.CL] (或 arXiv:2410.14166v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.14166
通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-60] Automated Genre-Aware Article Scoring and Feedback Using Large Language Models

【速读】: 该论文试图解决传统文章评分方法在评估不同类型文章质量时的局限性,特别是缺乏针对特定特征的详细评分。解决方案的关键在于结合预训练的BERT模型和大型语言模型Chat-GPT,通过深度理解文章内容和结构,提供基于特征的详细评分和改进建议。这种方法不仅提高了评分的准确性,还能生成个性化的反馈,帮助用户提升写作技能,展示了自动化评分技术在教育领域的潜在价值。

链接: https://arxiv.org/abs/2410.14165
作者: Chihang Wang,Yuxin Dong,Zhenhong Zhang,Ruotong Wang,Shuo Wang,Jiajing Chen
关键词-EN: advanced intelligent article, offers detailed feature-based, intelligent article scoring, detailed feature-based scoring, feature-based scoring tailored
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper focuses on the development of an advanced intelligent article scoring system that not only assesses the overall quality of written work but also offers detailed feature-based scoring tailored to various article genres. By integrating the pre-trained BERT model with the large language model Chat-GPT, the system gains a deep understanding of both the content and structure of the text, enabling it to provide a thorough evaluation along with targeted suggestions for improvement. Experimental results demonstrate that this system outperforms traditional scoring methods across multiple public datasets, particularly in feature-based assessments, offering a more accurate reflection of the quality of different article types. Moreover, the system generates personalized feedback to assist users in enhancing their writing skills, underscoring the potential and practical value of automated scoring technologies in educational contexts.
摘要:本文专注于开发一种先进的智能文章评分系统,该系统不仅评估书面作品的整体质量,还提供针对不同文章类型的详细特征评分。通过将预训练的 BERT 模型与大语言模型 Chat-GPT 集成,该系统能够深入理解文本的内容和结构,从而提供全面的评估以及针对性的改进建议。实验结果表明,该系统在多个公共数据集上优于传统的评分方法,特别是在特征评分方面,能够更准确地反映不同文章类型的质量。此外,该系统生成个性化的反馈,帮助用户提升写作技能,突显了自动化评分技术在教育环境中的潜力和实际价值。

[NLP-61] Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

【速读】: 该论文试图解决自回归语言模型在复杂推理和长期规划任务中的不足,特别是难以处理困难子目标的问题。解决方案的关键在于引入离散扩散模型,并通过多粒度扩散建模(Multi-granularity Diffusion Modeling, MDM)来优先学习困难的子目标。MDM在复杂任务如倒计时、数独和布尔可满足性问题中显著优于自回归模型,无需使用搜索技术,展示了扩散模型在提升AI在复杂语言理解和问题解决任务中能力的潜力。

链接: https://arxiv.org/abs/2410.14157
作者: Jiacheng Ye,Jiahui Gao,Shansan Gong,Lin Zheng,Xin Jiang,Zhenguo Li,Lingpeng Kong
关键词-EN: long-term planning tasks, reasoning and long-term, long-term planning, Multi-granularity Diffusion Modeling, Boolean Satisfiability Problems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive language models, despite their impressive capabilities, struggle with complex reasoning and long-term planning tasks. We introduce discrete diffusion models as a novel solution to these challenges. Through the lens of subgoal imbalance, we demonstrate how diffusion models effectively learn difficult subgoals that elude autoregressive approaches. We propose Multi-granularity Diffusion Modeling (MDM), which prioritizes subgoals based on difficulty during learning. On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MDM significantly outperforms autoregressive models without using search techniques. For instance, MDM achieves 91.5% and 100% accuracy on Countdown and Sudoku, respectively, compared to 45.8% and 20.7% for autoregressive models. Our work highlights the potential of diffusion-based approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks.
**摘要:**自回归语言模型虽然在能力上令人印象深刻,但在复杂推理和长期规划任务上表现不佳。我们引入了离散扩散模型作为应对这些挑战的新方法。通过子目标不平衡的视角,我们展示了扩散模型如何有效地学习那些自回归方法难以解决的困难子目标。我们提出了多粒度扩散建模 (Multi-granularity Diffusion Modeling, MDM),该方法在学习过程中根据难度优先处理子目标。在诸如倒计时 (Countdown)、数独 (Sudoku) 和布尔可满足性问题 (Boolean Satisfiability Problems) 等复杂任务上,MDM 显著优于自回归模型,且无需使用搜索技术。例如,MDM 在倒计时和数独任务上分别达到了 91.5% 和 100% 的准确率,而自回归模型在这两项任务上的准确率仅为 45.8% 和 20.7%。我们的工作突显了基于扩散的方法在提升 AI 能力以应对复杂语言理解和问题解决任务方面的潜力。

[NLP-62] owards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)生成的自然语言解释(NLEs)的忠实度问题。现有方法通过在解释或特征层面插入扰动来测量忠实度,但这些方法不够全面且设计不当。论文提出了一种基于因果中介技术的激活补丁方法,称为因果忠实度(Causal Faithfulness),通过量化解释与模型输出之间的因果归属一致性来评估忠实度。关键在于该方法考虑了模型的内部计算过程,并避免了依赖分布外样本的风险,从而提高了忠实度评估的有效性。

链接: https://arxiv.org/abs/2410.14155
作者: Wei Jie Yeo,Ranjan Satapthy,Erik Cambria
关键词-EN: persuasive Natural Language, Natural Language Explanations, Large Language Models, generating persuasive Natural, Large Language
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness. We experimented across models varying from 2B to 27B parameters and found that models that underwent alignment tuning tend to produce more faithful and plausible explanations. We find that Causal Faithfulness is a promising improvement over existing faithfulness tests by taking into account the model’s internal computations and avoiding out of distribution concerns that could otherwise undermine the validity of faithfulness assessments. We release the code in \urlthis https URL
摘要:大语言模型 (LLMs) 能够生成具有说服力的自然语言解释 (NLEs) 来为其答案提供依据。然而,这些解释的忠实度不应仅凭表面价值轻易信任。近期研究提出了多种方法来衡量 NLEs 的忠实度,通常通过在解释或特征层面插入扰动来实现。我们认为,这些方法既不全面,也不符合忠实度的既定定义。此外,我们强调了基于分布外样本进行忠实度研究的风险。在本研究中,我们利用一种名为激活补丁的因果中介技术,来衡量解释对其所支持答案的忠实度。我们提出的指标,因果忠实度 (Causal Faithfulness),通过量化解释与相应模型输出之间的因果归属一致性,作为忠实度的指示器。我们在从 2B 到 27B 参数不等的模型上进行了实验,发现经过对齐调优的模型往往能生成更忠实且合理的解释。我们发现,因果忠实度通过考虑模型的内部计算并避免分布外问题,有望成为现有忠实度测试的显著改进,从而避免这些问题可能削弱忠实度评估的有效性。我们已在 \urlthis https URL 发布了代码。

[NLP-63] SRAP-Agent : Simulating and Optimizing Scarce Resource Allocation Policy with LLM-based Agent

【速读】: 该论文试图解决公共稀缺资源分配中的效率与公平问题,传统方法因完全信息和个体理性的理想化假设以及数据限制而存在局限。解决方案的关键在于提出了SRAP-Agent框架,通过将大型语言模型(LLMs)集成到经济模拟中,以弥合理论模型与现实动态之间的差距。具体应用在公共住房分配场景中,通过政策优化算法进行广泛的模拟实验,验证了SRAP-Agent的可行性和有效性。

链接: https://arxiv.org/abs/2410.14152
作者: Jiarui Ji,Yang Li,Hongtao Liu,Zhicheng Du,Zhewei Wei,Weiran Shen,Qi Qi,Yankai Lin
关键词-EN: scarce resource allocation, resource allocation plays, Optimizing Scarce Resource, Public scarce resource, scarce resource
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Public scarce resource allocation plays a crucial role in economics as it directly influences the efficiency and equity in society. Traditional studies including theoretical model-based, empirical study-based and simulation-based methods encounter limitations due to the idealized assumption of complete information and individual rationality, as well as constraints posed by limited available data. In this work, we propose an innovative framework, SRAP-Agent (Simulating and Optimizing Scarce Resource Allocation Policy with LLM-based Agent), which integrates Large Language Models (LLMs) into economic simulations, aiming to bridge the gap between theoretical models and real-world dynamics. Using public housing allocation scenarios as a case study, we conduct extensive policy simulation experiments to verify the feasibility and effectiveness of the SRAP-Agent and employ the Policy Optimization Algorithm with certain optimization objectives. The source code can be found in this https URL
摘要:公共稀缺资源分配在经济学中起着至关重要的作用,因为它直接影响社会效率和公平性。传统的研究方法,包括基于理论模型、实证研究和基于模拟的方法,由于完全信息和个人理性的理想化假设以及有限可用数据的限制,遇到了局限性。在本研究中,我们提出了一种创新的框架——SRAP-Agent(基于大语言模型模拟和优化稀缺资源分配政策),该框架将大语言模型 (LLMs) 整合到经济模拟中,旨在弥合理论模型与现实世界动态之间的差距。以公共住房分配场景为例,我们进行了广泛的政策模拟实验,以验证 SRAP-Agent 的可行性和有效性,并采用具有特定优化目标的政策优化算法。源代码可在以下链接中找到:https URL

[NLP-64] Utilizing Large Language Models for Event Deconstruction to Enhance Multimodal Aspect-Based Sentiment Analysis

【速读】: 该论文试图解决多实体和多情感共存场景下的多模态方面级情感分析(MABSA)的复杂性问题。解决方案的关键在于创新性地引入大型语言模型(LLMs)进行事件分解,并提出一个基于强化学习的MABSA-RL框架。该框架通过LLMs将原始文本分解为一系列事件,从而降低分析复杂度,并通过强化学习优化模型参数,实验结果表明该方法在两个基准数据集上优于现有的先进方法。

链接: https://arxiv.org/abs/2410.14150
作者: Xiaoyong Huang,Heli Sun,Qunshu Gao,Wenjie Huang,Ruichen Cao
关键词-EN: Aspect-Based Sentiment Analysis, Multimodal Aspect-Based Sentiment, making Multimodal Aspect-Based, Contentcontinues to increase, User-Generated Contentcontinues
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of the internet, the richness of User-Generated Contentcontinues to increase, making Multimodal Aspect-Based Sentiment Analysis (MABSA) a research hotspot. Existing studies have achieved certain results in MABSA, but they have not effectively addressed the analytical challenges in scenarios where multiple entities and sentiments coexist. This paper innovatively introduces Large Language Models (LLMs) for event decomposition and proposes a reinforcement learning framework for Multimodal Aspect-based Sentiment Analysis (MABSA-RL) framework. This framework decomposes the original text into a set of events using LLMs, reducing the complexity of analysis, introducing reinforcement learning to optimize model parameters. Experimental results show that MABSA-RL outperforms existing advanced methods on two benchmark datasets. This paper provides a new research perspective and method for multimodal aspect-level sentiment analysis.
摘要:随着互联网的快速发展,用户生成内容 (User-Generated Content) 的丰富性不断增加,使得多模态基于方面的情感分析 (Multimodal Aspect-Based Sentiment Analysis, MABSA) 成为研究热点。现有研究在 MABSA 方面取得了一定成果,但未能有效解决多实体和多情感共存场景下的分析挑战。本文创新性地引入大语言模型 (Large Language Models, LLMs) 进行事件分解,并提出了一种基于强化学习的多模态基于方面的情感分析 (Multimodal Aspect-based Sentiment Analysis with Reinforcement Learning, MABSA-RL) 框架。该框架利用 LLMs 将原始文本分解为一组事件,降低了分析复杂度,并引入强化学习优化模型参数。实验结果表明,MABSA-RL 在两个基准数据集上优于现有的先进方法。本文为多模态方面级情感分析提供了新的研究视角和方法。

[NLP-65] Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment

【速读】: 该论文试图解决视觉-语言大模型(VLLMs)中的模态对齐问题,特别是由于模态对齐不准确导致的幻觉和生成不安全内容的问题。解决方案的关键是提出了FiSAO(细粒度自对齐优化)方法,该方法利用模型自身的视觉编码器作为细粒度验证器,通过视觉编码器提供的token级反馈来改进视觉-语言对齐,无需额外数据。这种方法显著提升了模态对齐效果,甚至超越了依赖额外数据的传统偏好调优方法。

链接: https://arxiv.org/abs/2410.14148
作者: Chenhang Cui,An Zhang,Yiyang Zhou,Zhaorun Chen,Gelei Deng,Huaxiu Yao,Tat-Seng Chua
关键词-EN: large language models, enhancing the interaction, linguistic modalities, large language, recent advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:The recent advancements in large language models (LLMs) and pre-trained vision models have accelerated the development of vision-language large models (VLLMs), enhancing the interaction between visual and linguistic modalities. Despite their notable success across various domains, VLLMs face challenges in modality alignment, which can lead to issues like hallucinations and unsafe content generation. Current alignment techniques often rely on coarse feedback and external datasets, limiting scalability and performance. In this paper, we propose FiSAO (Fine-Grained Self-Alignment Optimization), a novel self-alignment method that utilizes the model’s own visual encoder as a fine-grained verifier to improve vision-language alignment without the need for additional data. By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data. Through both theoretical analysis and experimental validation, we demonstrate that FiSAO effectively addresses the misalignment problem in VLLMs, marking the first instance of token-level rewards being applied to such models.
摘要:近年来,大语言模型 (LLM) 和预训练视觉模型的进步加速了视觉-语言大模型 (VLLM) 的发展,增强了视觉和语言模态之间的交互。尽管在各个领域取得了显著的成功,VLLM 在模态对齐方面仍面临挑战,可能导致幻觉和生成不安全内容等问题。当前的对齐技术通常依赖于粗略的反馈和外部数据集,限制了其可扩展性和性能。本文提出了一种名为 FiSAO (细粒度自对齐优化) 的新型自对齐方法,该方法利用模型自身的视觉编码器作为细粒度验证器,以改进视觉-语言对齐,而无需额外数据。通过利用视觉编码器提供的 Token 级反馈,FiSAO 显著提升了视觉-语言对齐效果,甚至超越了需要额外数据的传统偏好调优方法。通过理论分析和实验验证,我们证明了 FiSAO 有效地解决了 VLLM 中的对齐问题,标志着 Token 级奖励首次应用于此类模型。

[NLP-66] CAPE: A Chinese Dataset for Appraisal-based Emotional Generation using Large Language Models

【速读】: 该论文试图解决在大语言模型对话中生成情感适当响应的挑战,关键在于引入了一个基于认知评估理论的两阶段自动数据生成框架,创建了名为CAPE的中文情感语料库。该语料库通过考虑多样化的个人和情境因素,促进了上下文适当情感响应的生成。论文提出了情感预测和下一话语预测两项任务,并通过自动化和人工评估验证了基于该数据集训练的对话代理能够更准确地表达人类情感,从而推动了情感表达在对话代理中的进步,为更细致和有意义的人机交互铺平了道路。

链接: https://arxiv.org/abs/2410.14145
作者: June M. Liu,He Cao,Renliang Sun,Rui Wang,Yu Li,Jiaxing Zhang
关键词-EN: large language models, language models presents, significant challenge due, remain largely underexplored, Generating emotionally
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating emotionally appropriate responses in conversations with large language models presents a significant challenge due to the complexities of human emotions and cognitive processes, which remain largely underexplored in their critical role in social interactions. In this study, we introduce a two-stage automatic data generation framework to create CAPE, a Chinese dataset named Cognitive Appraisal theory-based Emotional corpus. This corpus facilitates the generation of dialogues with contextually appropriate emotional responses by accounting for diverse personal and situational factors. We propose two tasks utilizing this dataset: emotion prediction and next utterance prediction. Both automated and human evaluations demonstrate that agents trained on our dataset can deliver responses that are more aligned with human emotional expressions. Our study shows the potential for advancing emotional expression in conversational agents, paving the way for more nuanced and meaningful human-computer interactions.
摘要: 在大语言模型中生成情感上适当的对话响应面临着重大挑战,这是由于人类情感和认知过程的复杂性,这些因素在社交互动中的关键作用在很大程度上尚未得到充分探索。在本研究中,我们引入了一个两阶段的自动数据生成框架,用于创建基于认知评估理论的情感语料库 (CAPE),这是一个中文数据集。该语料库通过考虑多样化的个人和情境因素,促进了生成具有情境适当情感响应的对话。我们提出了两个利用该数据集的任务:情感预测和下一话语预测。自动化和人工评估均表明,基于我们数据集训练的 AI 智能体能够提供更符合人类情感表达的响应。我们的研究表明,这为提升对话 AI 智能体的情感表达能力开辟了道路,为更细致和有意义的人机交互奠定了基础。

[NLP-67] A Lightweight Multi Aspect Controlled Text Generation Solution For Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在缺乏高质量指令调优数据的情况下,无法在多方面可控文本生成(MCTG)任务中达到理想性能的问题。解决方案的关键在于提出了一种轻量级的MCTG数据增强管道,通过分析传统数据集中的偏差和相关性,并引入增强的控制属性和句子来解决这些问题。实验结果表明,经过数据增强后的LLMs在MCTG任务中的表现显著提升,准确率提高了20%,且各方面的相关性降低。

链接: https://arxiv.org/abs/2410.14144
作者: Chenyang Zhang,Jiayi Lin,Haibo Tong,Bingxuan Hou,Dongyu Zhang,Jialin Li,Junli Wang
关键词-EN: Large language models, show remarkable abilities, Large language, Controllable Text Generation, show remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show remarkable abilities with instruction tuning. However, they fail to achieve ideal tasks when lacking high-quality instruction tuning data on target tasks. Multi-Aspect Controllable Text Generation (MCTG) is a representative task for this dilemma, where aspect datasets are usually biased and correlated. Existing work exploits additional model structures and strategies for solutions, limiting adaptability to LLMs. To activate MCTG ability of LLMs, we propose a lightweight MCTG pipeline based on data augmentation. We analyze bias and correlations in traditional datasets, and address these concerns with augmented control attributes and sentences. Augmented datasets are feasible for instruction tuning. In our experiments, LLMs perform better in MCTG after data augmentation, with a 20% accuracy rise and less aspect correlations.
摘要:大语言模型 (LLMs) 在指令微调方面展现出显著的能力。然而,当缺乏针对目标任务的高质量指令微调数据时,它们无法达到理想的任务效果。多方面可控文本生成 (MCTG) 是这一困境的代表性任务,其中方面数据集通常存在偏差且相互关联。现有工作通过利用额外的模型结构和策略来解决这一问题,但限制了其对 LLMs 的适应性。为了激活 LLMs 的 MCTG 能力,我们提出了一种基于数据增强的轻量级 MCTG 流程。我们分析了传统数据集中的偏差和相关性,并通过增强的控制属性和句子来解决这些问题。增强后的数据集适用于指令微调。在我们的实验中,经过数据增强后,LLMs 在 MCTG 任务中的表现有所提升,准确率提高了 20%,且方面相关性降低。

[NLP-68] Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents

【速读】: 该论文试图解决机器人如何在日常任务中准确解读视觉线索并在多样化的安全关键情境下有效响应的问题,特别是处理如地板上的尖锐物体等紧急情况。解决方案的关键在于提出了M-CoDAL多模态对话系统,该系统利用话语连贯关系增强机器人在安全关键情境中的上下文理解和沟通能力。通过引入基于聚类的主动学习机制,结合外部大型语言模型(LLM)识别信息丰富的实例,系统能够更有效地训练和提升在实际应用中的表现。研究通过新创建的多模态数据集和实际用户研究验证了该系统在提高安全情境解决能力、用户情感响应以及对话安全性方面的有效性。

链接: https://arxiv.org/abs/2410.14141
作者: Sabit Hassan,Hye-Young Chung,Xiang Zhi Tan,Malihe Alikhani
关键词-EN: accurately interpret visual, interpret visual cues, diverse safety-critical situations, safety-critical situations, daily tasks
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When assisting people in daily tasks, robots need to accurately interpret visual cues and respond effectively in diverse safety-critical situations, such as sharp objects on the floor. In this context, we present M-CoDAL, a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations. The system leverages discourse coherence relations to enhance its contextual understanding and communication abilities. To train this system, we introduce a novel clustering-based active learning mechanism that utilizes an external Large Language Model (LLM) to identify informative instances. Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images. These violations are annotated using a Large Multimodal Model (LMM) and verified by human annotators. Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation. Next, we deploy our dialogue system on a Hello Robot Stretch robot and conduct a within-subject user study with real-world participants. In the study, participants role-play two safety scenarios with different levels of severity with the robot and receive interventions from our model and a baseline system powered by OpenAI’s ChatGPT. The study results corroborate and extend the findings from automated evaluation, showing that our proposed system is more persuasive and competent in a real-world embodied agent setting.
摘要:在协助人们完成日常任务时,机器人需要准确解读视觉线索,并在各种安全关键情境中有效响应,例如地板上的尖锐物体。在此背景下,我们提出了 M-CoDAL,这是一种专为具身智能体设计的多模态对话系统,旨在更好地理解和应对安全关键情境。该系统利用话语连贯关系来增强其上下文理解和沟通能力。为了训练这一系统,我们引入了一种基于聚类的主动学习机制,该机制利用外部大语言模型 (LLM) 来识别信息丰富的实例。我们使用一个新创建的多模态数据集对这一方法进行了评估,该数据集包含从 2K Reddit 图像中提取的 1K 个安全违规案例。这些违规案例通过大模态模型 (LMM) 进行标注,并由人工标注者进行验证。结果表明,我们的方法在解决安全情境、用户情感以及对话安全性方面均有提升。接下来,我们将对话系统部署在 Hello Robot Stretch 机器人上,并进行了一项针对真实世界参与者的受试者内用户研究。在研究中,参与者与机器人进行角色扮演,模拟两种不同严重程度的安全场景,并接受由我们的模型和基于 OpenAI 的 ChatGPT 驱动的基线系统提供的干预。研究结果证实并扩展了自动化评估的发现,表明我们提出的系统在真实世界的具身智能体环境中更具说服力和能力。

[NLP-69] ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering

【速读】: 该论文试图解决基于文本的视觉问答(Text-based VQA)任务中如何有效利用图像中的场景文本信息的问题。解决方案的关键在于引入了一种新的方法,该方法遵循语言学中对意义的定义,通过嵌入场景文本的2D边界框坐标来考虑其空间信息,从而更有效地利用越南语文本中的信息。实验结果表明,该方法在两个大规模的越南语基于文本的VQA数据集上取得了最先进的性能。

链接: https://arxiv.org/abs/2410.14132
作者: Nghia Hieu Nguyen,Tho Thanh Quan,Ngan Luu-Thuy Nguyen
关键词-EN: Text-based VQA, Vietnamese Text-based VQA, Text-based VQA datasets, scene texts, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-based VQA is a challenging task that requires machines to use scene texts in given images to yield the most appropriate answer for the given question. The main challenge of text-based VQA is exploiting the meaning and information from scene texts. Recent studies tackled this challenge by considering the spatial information of scene texts in images via embedding 2D coordinates of their bounding boxes. In this study, we follow the definition of meaning from linguistics to introduce a novel method that effectively exploits the information from scene texts written in Vietnamese. Experimental results show that our proposed method obtains state-of-the-art results on two large-scale Vietnamese Text-based VQA datasets. The implementation can be found at this link.
摘要:基于文本的视觉问答 (Text-based VQA) 是一项具有挑战性的任务,要求机器利用给定图像中的场景文本,为提出的问题生成最合适的答案。基于文本的 VQA 的主要挑战在于如何利用场景文本的含义和信息。最近的研究通过考虑图像中场景文本的空间信息,即嵌入其边界框的二维坐标,来应对这一挑战。在本研究中,我们遵循语言学中对意义的定义,提出了一种新颖的方法,能够有效利用越南语文本中的场景信息。实验结果表明,我们提出的方法在两个大规模的越南语基于文本的 VQA 数据集上取得了最先进的成果。实现代码可在此链接中找到。

[NLP-70] Be My Donor. Transfer the NLP Datasets Between the Languages Using LLM

【速读】: 该论文试图解决如何利用大型语言模型(LLM)将数据集及其标注从一种语言迁移到另一种语言的问题。解决方案的关键在于使用ChatGPT3.5-turbo和Llama-3.1-8b作为核心LLM,构建一个标注迁移的管道,从而在目标语言(如俄语)中实现数据集的快速转换和标注,进而促进资源匮乏语言领域的发展,并为基于BERT的模型训练提供翻译后的数据集以建立基准。

链接: https://arxiv.org/abs/2410.14074
作者: Dmitrii Popov,Egor Terentev,Igor Buyanov
关键词-EN: annotation, Russian pairs translating, target language, LLM to transfer, Abstract
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we investigated how one can use the LLM to transfer the dataset and its annotation from one language to another. This is crucial since sharing the knowledge between different languages could boost certain underresourced directions in the target language, saving lots of efforts in data annotation or quick prototyping. We experiment with English and Russian pairs translating the DEFT corpus. This corpus contains three layers of annotation dedicated to term-definition pair mining, which is a rare annotation type for Russian. We provide a pipeline for the annotation transferring using ChatGPT3.5-turbo and Llama-3.1-8b as core LLMs. In the end, we train the BERT-based models on the translated dataset to establish a baseline.
摘要:在本研究中,我们探讨了如何利用大语言模型 (LLM) 将数据集及其标注从一种语言迁移到另一种语言。这一过程至关重要,因为不同语言之间的知识共享可以促进目标语言中某些资源匮乏领域的发展,从而节省大量数据标注或快速原型设计的工作量。我们以英语和俄语对为例,翻译了 DEFT 语料库。该语料库包含三层标注,专门用于术语-定义对挖掘,这在俄语中是一种罕见的标注类型。我们提供了一个使用 ChatGPT3.5-turbo 和 Llama-3.1-8b 作为核心大语言模型的标注迁移流程。最终,我们在翻译后的数据集上训练了基于 BERT 的模型,以建立一个基准。

[NLP-71] Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

【速读】: 该论文试图解决视觉语言模型(VLMs)中视觉标记数量过多导致的计算开销问题。解决方案的关键是提出了一种名为Visual Compact Token Registers(Victor)的方法,通过在视觉标记后添加少量可学习的寄存器标记,并利用语言模型前几层将视觉信息压缩到这些寄存器中,从而减少视觉标记的数量。这种方法显著提高了训练和推理的计算效率,同时仅引入少量新的可训练参数,对模型性能影响极小。

链接: https://arxiv.org/abs/2410.14072
作者: Yuxin Wen,Qingqing Cao,Qichen Fu,Sachin Mehta,Mahyar Najibi
关键词-EN: perform complex reasoning, Recent advancements, visual tokens, tokens, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in vision-language models (VLMs) have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images. In the widely used fully autoregressive transformer-based models like LLaVA, projected visual tokens are prepended to textual tokens. Oftentimes, visual tokens are significantly more than prompt tokens, resulting in increased computational overhead during both training and inference. In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower of VLMs. After these few layers, all visual tokens are discarded, significantly improving computational efficiency for both training and inference. Notably, our method is easy to implement and requires a small number of new trainable parameters with minimal impact on model performance. In our experiment, with merely 8 visual registers–about 1% of the original tokens–Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.
摘要:近年来,视觉-语言模型 (Vision-Language Models, VLMs) 的进步扩展了其在实际应用中的潜力,使这些模型能够在图像上进行复杂的推理。在广泛使用的全自回归 Transformer 模型(如 LLaVA)中,投影视觉 Token (Token) 被前置于文本 Token。通常,视觉 Token 的数量远多于提示 Token,导致训练和推理过程中的计算开销显著增加。本文提出了一种名为视觉紧凑 Token 寄存器 (Visual Compact Token Registers, Victor) 的方法,通过将视觉 Token 总结为一组较小的寄存器 Token 来减少视觉 Token 的数量。Victor 在视觉 Token 之后添加了少量可学习的寄存器 Token,并利用 VLM 语言塔的前几层将视觉信息总结到这些寄存器中。经过这几层后,所有视觉 Token 都被丢弃,从而显著提高了训练和推理的计算效率。值得注意的是,我们的方法易于实现,仅需少量新的可训练参数,对模型性能的影响极小。在我们的实验中,仅使用 8 个视觉寄存器(约占原始 Token 的 1%),Victor 在准确率下降不到 4% 的情况下,将总训练时间减少了 43%,并将推理吞吐量提升了 3.3 倍。

[NLP-72] owards Cross-Cultural Machine Translation with Retrieval-Augmented Generation from Multilingual Knowledge Graphs EMNLP2024

【速读】: 该论文试图解决包含实体名称的文本在跨文化翻译中的挑战,特别是由于文化相关参考和转写(transcreation)导致的语言差异问题。解决方案的关键在于提出了XC-Translate基准和KG-MT方法。XC-Translate是首个大规模手动创建的专注于包含文化微妙实体名称文本的机器翻译基准。KG-MT则通过密集检索机制将多语言知识图谱的信息整合到神经机器翻译模型中,从而显著提升了翻译性能,相较于NLLB-200和GPT-4分别取得了129%和62%的相对改进。

链接: https://arxiv.org/abs/2410.14057
作者: Simone Conia,Daniel Lee,Min Li,Umar Farooq Minhas,Saloni Potdar,Yunyao Li
关键词-EN: challenging task, Translating text, cultural-related references, references can vary, vary significantly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:Translating text that contains entity names is a challenging task, as cultural-related references can vary significantly across languages. These variations may also be caused by transcreation, an adaptation process that entails more than transliteration and word-for-word translation. In this paper, we address the problem of cross-cultural translation on two fronts: (i) we introduce XC-Translate, the first large-scale, manually-created benchmark for machine translation that focuses on text that contains potentially culturally-nuanced entity names, and (ii) we propose KG-MT, a novel end-to-end method to integrate information from a multilingual knowledge graph into a neural machine translation model by leveraging a dense retrieval mechanism. Our experiments and analyses show that current machine translation systems and large language models still struggle to translate texts containing entity names, whereas KG-MT outperforms state-of-the-art approaches by a large margin, obtaining a 129% and 62% relative improvement compared to NLLB-200 and GPT-4, respectively.
摘要:翻译包含实体名称的文本是一项具有挑战性的任务,因为与文化相关的参考内容在不同语言之间可能存在显著差异。这些差异也可能由再创作引起,这是一种涉及不仅仅是音译和逐字翻译的适应过程。在本文中,我们通过两个方面来解决跨文化翻译的问题:(i) 我们引入了 XC-Translate,这是首个大规模、人工创建的机器翻译基准,专注于包含潜在文化细微差别的实体名称的文本;(ii) 我们提出了 KG-MT,一种新颖的端到端方法,通过利用密集检索机制将多语言知识图谱中的信息整合到神经机器翻译模型中。我们的实验和分析表明,当前的机器翻译系统和大型语言模型在翻译包含实体名称的文本时仍然面临困难,而 KG-MT 则大幅超越了最先进的方法,相对于 NLLB-200 和 GPT-4,分别获得了 129% 和 62% 的相对改进。

[NLP-73] From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs

【速读】: 该论文试图解决大型语言模型在长期记忆管理方面的挑战,特别是如何有效组织、检索和整合信息。解决方案的关键是引入了一种名为MemTree的算法,该算法利用动态的树状结构记忆表示来优化信息管理。MemTree通过层次化组织记忆,每个节点包含聚合的文本内容、对应的语义嵌入以及不同抽象层次的信息,从而动态适应新旧信息的语义嵌入计算与比较,增强模型的上下文感知能力。这种结构化的记忆管理方法使得MemTree在处理复杂推理和扩展交互方面比传统的扁平查找表方法更为有效。

链接: https://arxiv.org/abs/2410.14052
作者: Alireza Rezazadeh,Zichao Li,Wei Wei,Yujia Bao
关键词-EN: Recent advancements, large language models, effective long-term memory, context windows, advancements in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in large language models have significantly improved their context windows, yet challenges in effective long-term memory management remain. We introduce MemTree, an algorithm that leverages a dynamic, tree-structured memory representation to optimize the organization, retrieval, and integration of information, akin to human cognitive schemas. MemTree organizes memory hierarchically, with each node encapsulating aggregated textual content, corresponding semantic embeddings, and varying abstraction levels across the tree’s depths. Our algorithm dynamically adapts this memory structure by computing and comparing semantic embeddings of new and existing information to enrich the model’s context-awareness. This approach allows MemTree to handle complex reasoning and extended interactions more effectively than traditional memory augmentation methods, which often rely on flat lookup tables. Evaluations on benchmarks for multi-turn dialogue understanding and document question answering show that MemTree significantly enhances performance in scenarios that demand structured memory management.
摘要:近年来,大语言模型在上下文窗口方面的进步显著,然而在长期记忆管理的有效性方面仍存在挑战。我们提出了 MemTree,一种利用动态树结构记忆表示的算法,旨在优化信息的组织、检索和整合,类似于人类的认知模式。MemTree 以层次结构组织记忆,每个节点封装了聚合的文本内容、相应的语义嵌入以及树深度上的不同抽象级别。我们的算法通过计算和比较新旧信息的语义嵌入,动态调整这一记忆结构,以增强模型的上下文感知能力。这种方法使得 MemTree 在处理复杂推理和扩展交互方面比传统的基于平面查找表的记忆增强方法更为有效。在多轮对话理解和文档问答基准测试中的评估表明,MemTree 在需要结构化记忆管理的场景中显著提升了性能。

[NLP-74] Learning Multimodal Cues of Childrens Uncertainty SIGDIAL2023

【速读】: 该论文试图解决多模态AI系统在与用户协作解决问题或引导用户理解复杂概念时,如何理解和预测用户的不确定性问题。解决方案的关键在于首次引入发展心理学和认知心理学的专家合作,标注了一个用于研究非言语不确定性线索的数据集,并通过分析不确定性与任务难度及表现的关系,开发了一种能够实时视频片段预测不确定性的多模态机器学习模型,该模型在性能上优于传统的多模态变换器模型。这一研究不仅提升了人机协作中的认知协调,还对手势理解和生成具有广泛的应用前景。

链接: https://arxiv.org/abs/2410.14050
作者: Qi Cheng,Mert İnan,Rahma Mbarki,Grace Grmek,Theresa Choi,Yiming Sun,Kimele Persaud,Jenny Wang,Malihe Alikhani
关键词-EN: achieving common ground, common ground, Understanding uncertainty plays, plays a critical, achieving common
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: SIGDIAL 2023

点击查看摘要

Abstract:Understanding uncertainty plays a critical role in achieving common ground (Clark et al.,1983). This is especially important for multimodal AI systems that collaborate with users to solve a problem or guide the user through a challenging concept. In this work, for the first time, we present a dataset annotated in collaboration with developmental and cognitive psychologists for the purpose of studying nonverbal cues of uncertainty. We then present an analysis of the data, studying different roles of uncertainty and its relationship with task difficulty and performance. Lastly, we present a multimodal machine learning model that can predict uncertainty given a real-time video clip of a participant, which we find improves upon a baseline multimodal transformer model. This work informs research on cognitive coordination between human-human and human-AI and has broad implications for gesture understanding and generation. The anonymized version of our data and code will be publicly available upon the completion of the required consent forms and data sheets.
摘要:理解不确定性在达成共识中起着关键作用 (Clark et al., 1983)。这对于与用户协作解决问题的多模态 AI 系统尤为重要,或指导用户理解复杂概念。在本研究中,我们首次与发展心理学和认知心理学家合作,标注了一个用于研究不确定性非言语线索的数据集。随后,我们对数据进行了分析,研究了不确定性的不同角色及其与任务难度和表现的关系。最后,我们提出了一种多模态机器学习模型,该模型能够根据参与者的实时视频片段预测不确定性,并发现其在多模态 Transformer 模型基础上有所改进。本研究为人类与人类、人类与 AI 之间的认知协调研究提供了信息,并对手势理解和生成具有广泛影响。在完成所需的同意书和数据表后,我们将公开匿名化的数据和代码。

[NLP-75] Learning Metadata-Agnostic Representations for Text-to-SQL In-Context Example Selection NEURIPS2024

【速读】: 该论文试图解决在自然语言到SQL查询任务中,如何选择最优的上下文学习示例以提高模型性能的问题。解决方案的关键在于提出了一种名为MARLO的方法,通过在共享嵌入空间中对齐自然语言问题和SQL查询的表示,利用查询结构来建模查询意图,而不依赖于底层数据库的元数据。这种方法能够选择结构和语义上与任务相关的示例,而非仅与特定领域或问题表述相关的示例,从而在Spider基准测试中显著提升了执行准确性,并降低了推理延迟。

链接: https://arxiv.org/abs/2410.14049
作者: Chuhong Mai,Ro-ee Tal,Thahir Mohamed
关键词-EN: In-context learning, task demonstrations added, powerful paradigm, paradigm where large, ICL
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2024 Table Representation Learning workshop

点击查看摘要

Abstract:In-context learning (ICL) is a powerful paradigm where large language models (LLMs) benefit from task demonstrations added to the prompt. Yet, selecting optimal demonstrations is not trivial, especially for complex or multi-modal tasks where input and output distributions differ. We hypothesize that forming task-specific representations of the input is key. In this paper, we propose a method to align representations of natural language questions and those of SQL queries in a shared embedding space. Our technique, dubbed MARLO - Metadata-Agnostic Representation Learning for Text-tO-SQL - uses query structure to model querying intent without over-indexing on underlying database metadata (i.e. tables, columns, or domain-specific entities of a database referenced in the question or query). This allows MARLO to select examples that are structurally and semantically relevant for the task rather than examples that are spuriously related to a certain domain or question phrasing. When used to retrieve examples based on question similarity, MARLO shows superior performance compared to generic embedding models (on average +2.9%pt. in execution accuracy) on the Spider benchmark. It also outperforms the next best method that masks metadata information by +0.8%pt. in execution accuracy on average, while imposing a significantly lower inference latency.
摘要:上下文学习 (In-context learning, ICL) 是一种强大的范式,其中大语言模型 (Large Language Models, LLMs) 通过在提示中添加任务演示而受益。然而,选择最佳的演示并非易事,尤其是在输入和输出分布不同的复杂或多模态任务中。我们假设,形成任务特定的输入表示是关键。本文提出了一种方法,将自然语言问题和 SQL 查询的表示对齐在一个共享的嵌入空间中。我们的技术,称为 MARLO - 元数据无关表示学习用于文本到 SQL (Metadata-Agnostic Representation Learning for Text-tO-SQL),利用查询结构来建模查询意图,而不过度依赖底层数据库元数据(即问题或查询中引用的数据库的表、列或特定领域实体)。这使得 MARLO 能够选择结构上和语义上与任务相关的示例,而不是与特定领域或问题措辞虚假相关的示例。当用于基于问题相似性检索示例时,MARLO 在 Spider 基准测试中表现出优于通用嵌入模型的性能(执行准确率平均提高 2.9 个百分点)。它还优于下一个最佳方法,该方法通过屏蔽元数据信息,在执行准确率上平均提高了 0.8 个百分点,同时显著降低了推理延迟。

[NLP-76] Efficient Retrieval of Temporal Event Sequences from Textual Descriptions

【速读】: 该论文试图解决从文本描述中检索时间事件序列的问题,这对于分析电子商务行为、监控社交媒体活动和追踪犯罪事件等应用至关重要。解决方案的关键在于引入TPP-LLM-Embedding模型,该模型基于TPP-LLM框架,结合大型语言模型和时间点过程,能够同时编码事件类型和时间,并通过池化生成序列级别的表示。通过使用相同的架构嵌入文本描述,确保序列和描述在共享的嵌入空间中,并优化对比损失以增强匹配对之间的相似性,从而实现高效的检索。

链接: https://arxiv.org/abs/2410.14043
作者: Zefang Liu,Yinzhu Quan
关键词-EN: monitoring social media, analyzing e-commerce behavior, social media activities, tracking criminal incidents, e-commerce behavior
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieving temporal event sequences from textual descriptions is essential for applications such as analyzing e-commerce behavior, monitoring social media activities, and tracking criminal incidents. In this paper, we introduce TPP-LLM-Embedding, a unified model for efficiently embedding and retrieving event sequences based on natural language descriptions. Built on the TPP-LLM framework, which integrates large language models with temporal point processes, our model encodes both event types and times, generating a sequence-level representation through pooling. Textual descriptions are embedded using the same architecture, ensuring a shared embedding space for both sequences and descriptions. We optimize a contrastive loss based on similarity between these embeddings, bringing matching pairs closer and separating non-matching ones. TPP-LLM-Embedding enables efficient retrieval and demonstrates superior performance compared to baseline models across diverse datasets.
摘要:从文本描述中检索时间事件序列对于分析电子商务行为、监控社交媒体活动和追踪犯罪事件等应用至关重要。本文介绍了 TPP-LLM-Embedding,这是一种基于自然语言描述高效嵌入和检索事件序列的统一模型。该模型建立在 TPP-LLM 框架之上,该框架将大语言模型与时间点过程相结合,能够同时编码事件类型和时间,并通过池化生成序列级别的表示。文本描述使用相同的架构进行嵌入,确保序列和描述共享相同的嵌入空间。我们基于这些嵌入之间的相似性优化对比损失,使匹配的配对更接近,并分离非匹配的配对。TPP-LLM-Embedding 实现了高效的检索,并在多个数据集上展示了优于基线模型的性能。

[NLP-77] Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles EMNLP2024

【速读】: 该论文试图解决在大语言模型使用过程中,如何高效压缩提示(prompt)以减少推理时间和计算成本的问题。解决方案的关键在于提出了Style-Compress框架,该框架利用一个小型语言模型通过风格变化和上下文学习,迭代生成并选择有效的压缩提示,作为任务特定的演示,从而使小型模型能够作为高效的压缩器,无需额外训练即可适应新任务。这种方法在多个任务中表现优于基线模型,且在压缩比为0.25或0.5时,压缩后的提示性能与原始提示相当或更优。

链接: https://arxiv.org/abs/2410.14042
作者: Xiao Pu,Tianxing He,Xiaojun Wan
关键词-EN: compression condenses contexts, condenses contexts, contexts while maintaining, maintaining their informativeness, usage scenarios
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Prompt compression condenses contexts while maintaining their informativeness for different usage scenarios. It not only shortens the inference time and reduces computational costs during the usage of large language models, but also lowers expenses when using closed-source models. In a preliminary study, we discover that when instructing language models to compress prompts, different compression styles (e.g., extractive or abstractive) impact performance of compressed prompts on downstream tasks. Building on this insight, we propose Style-Compress, a lightweight framework that adapts a smaller language model to compress prompts for a larger model on a new task without additional training. Our approach iteratively generates and selects effective compressed prompts as task-specific demonstrations through style variation and in-context learning, enabling smaller models to act as efficient compressors with task-specific examples. Style-Compress outperforms two baseline compression models in four tasks: original prompt reconstruction, text summarization, multi-hop QA, and CoT reasoning. In addition, with only 10 samples and 100 queries for adaptation, prompts compressed by Style-Compress achieve performance on par with or better than original prompts at a compression ratio of 0.25 or 0.5.
摘要:提示压缩在保持信息丰富性的同时,简化了上下文以适应不同的使用场景。这不仅缩短了大语言模型使用时的推理时间,降低了计算成本,还减少了使用闭源模型时的费用。在初步研究中,我们发现,当指导语言模型压缩提示时,不同的压缩风格(例如,抽取式或生成式)会影响压缩提示在下游任务中的表现。基于这一发现,我们提出了 Style-Compress,一个轻量级框架,它使较小的语言模型能够在新任务上为较大的模型压缩提示,而无需额外的训练。我们的方法通过风格变化和上下文学习,迭代生成并选择有效的压缩提示作为任务特定的示范,使较小的模型能够作为具有任务特定示例的高效压缩器。Style-Compress 在四个任务中优于两个基线压缩模型:原始提示重建、文本摘要、多跳问答和 CoT 推理。此外,仅使用 10 个样本和 100 次查询进行适应,Style-Compress 压缩的提示在压缩比为 0.25 或 0.5 时,其性能与原始提示相当或更优。

[NLP-78] From Barriers to Tactics: A Behavioral Science-Informed Agent ic Workflow for Personalized Nutrition Coaching

【速读】: 该论文试图解决心血管代谢疾病患者在营养管理中面临的个性化障碍问题,解决方案的关键在于利用大型语言模型(LLM)驱动的代理工作流程,通过行为科学原理识别并针对性解决患者的具体饮食障碍。该系统通过两个专门的LLM代理,一个负责探测和识别患者饮食问题的根本原因,另一个则根据患者具体情况提供定制化的克服策略,从而实现个性化、可扩展且基于行为科学的营养指导。

链接: https://arxiv.org/abs/2410.14041
作者: Eric Yang,Tomas Garcia,Hannah Williams,Bhawesh Kumar,Martin Ramé,Eileen Rivera,Yiran Ma,Jonathan Amar,Caricia Catalani,Yugang Jia
关键词-EN: requires sustained positive, positive nutrition habits, sustained positive nutrition, Effective management, conditions requires sustained
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 22 pages

点击查看摘要

Abstract:Effective management of cardiometabolic conditions requires sustained positive nutrition habits, often hindered by complex and individualized barriers. Direct human management is simply not scalable, while previous attempts aimed at automating nutrition coaching lack the personalization needed to address these diverse challenges. This paper introduces a novel LLM-powered agentic workflow designed to provide personalized nutrition coaching by directly targeting and mitigating patient-specific barriers. Grounded in behavioral science principles, the workflow leverages a comprehensive mapping of nutrition-related barriers to corresponding evidence-based strategies. A specialized LLM agent intentionally probes for and identifies the root cause of a patient’s dietary struggles. Subsequently, a separate LLM agent delivers tailored tactics designed to overcome those specific barriers with patient context. We designed and validated our approach through a user study with individuals with cardiometabolic conditions, demonstrating the system’s ability to accurately identify barriers and provide personalized guidance. Furthermore, we conducted a large-scale simulation study, grounding on real patient vignettes and expert-validated metrics, to evaluate the system’s performance across a wide range of scenarios. Our findings demonstrate the potential of this LLM-powered agentic workflow to improve nutrition coaching by providing personalized, scalable, and behaviorally-informed interventions.
摘要:有效管理心血管代谢疾病需要持续的积极营养习惯,但这些习惯往往受到复杂且个性化的障碍所阻碍。直接的人工管理无法实现规模化,而以往试图自动化营养指导的尝试缺乏针对这些多样挑战所需的个性化。本文介绍了一种基于大语言模型 (LLM) 驱动的智能体工作流程,旨在通过直接针对并缓解患者特定的障碍来提供个性化的营养指导。该工作流程基于行为科学原理,利用对营养相关障碍的全面映射,将其与基于证据的策略相对应。一个专门的 LLM 智能体有目的地探查并识别患者饮食困难的根源。随后,另一个 LLM 智能体提供量身定制的策略,旨在克服这些特定障碍,并结合患者的具体情况。我们通过一项针对心血管代谢疾病患者的用户研究设计和验证了我们的方法,展示了系统准确识别障碍并提供个性化指导的能力。此外,我们还进行了一项大规模模拟研究,基于真实患者的案例和专家验证的指标,评估了系统在广泛场景中的表现。我们的研究结果表明,这种基于 LLM 驱动的智能体工作流程具有改善营养指导的潜力,通过提供个性化、可扩展且基于行为科学干预的解决方案。

[NLP-79] Graph Neural Flows for Unveiling Systemic Interactions Among Irregularly Sampled Time Series NEURIPS2024

【速读】: 该论文试图解决在分析相互作用的系统时,单独分析各组成部分难以准确预测系统动态的问题。解决方案的关键在于开发了一种基于图的模型,通过有向无环图(DAG)来建模系统组件之间的条件依赖关系(一种因果表示形式),并结合连续时间模型参数化常微分方程(ODEs)的解曲线。这种称为“图神经流”的技术显著提升了非图模型方法以及未建模条件依赖关系的图模型方法的性能。

链接: https://arxiv.org/abs/2410.14030
作者: Giangiacomo Mercatali,Andre Freitas,Jie Chen
关键词-EN: Interacting systems, prevalent in nature, Interacting, conditional dependencies, Abstract
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: NeurIPS 2024. Code is available at this https URL

点击查看摘要

Abstract:Interacting systems are prevalent in nature. It is challenging to accurately predict the dynamics of the system if its constituent components are analyzed independently. We develop a graph-based model that unveils the systemic interactions of time series observed at irregular time points, by using a directed acyclic graph to model the conditional dependencies (a form of causal notation) of the system components and learning this graph in tandem with a continuous-time model that parameterizes the solution curves of ordinary differential equations (ODEs). Our technique, a graph neural flow, leads to substantial enhancements over non-graph-based methods, as well as graph-based methods without the modeling of conditional dependencies. We validate our approach on several tasks, including time series classification and forecasting, to demonstrate its efficacy.
摘要: 交互系统在自然界中普遍存在。如果仅独立分析系统的组成部分,准确预测系统的动态变化将极具挑战性。我们开发了一种基于图的模型,通过使用有向无环图 (DAG) 来建模系统组件的条件依赖关系(一种因果表示形式),并同时学习该图与参数化常微分方程 (ODE) 解曲线的连续时间模型,从而揭示在不规则时间点观测到的时间序列的系统交互。我们的技术——图神经流 (Graph Neural Flow),相较于非图基方法以及未建模条件依赖关系的图基方法,带来了显著的性能提升。我们通过在多个任务中验证我们的方法,包括时间序列分类和预测,以展示其有效性。

[NLP-80] Measuring and Modifying the Readability of English Texts with GPT-4

【速读】: 该论文试图解决的问题是大型语言模型(LLMs)是否能够可靠地评估和调整文本的可读性。解决方案的关键在于通过实证研究,发现GPT-4 Turbo和GPT-4o mini在“零样本”情况下生成的可读性估计与人类判断具有较高的相关性(r = 0.76 和 r = 0.74),超越了传统可读性公式和心理语言学指标的表现。此外,通过预注册的人类实验(N = 59),验证了GPT-4 Turbo能够有效地使文本变得更易读或更难读,尽管人类判断中仍存在较大的变异性。研究还讨论了该方法的局限性,包括适用范围的限制以及可读性概念的上下文依赖性。

链接: https://arxiv.org/abs/2410.14028
作者: Sean Trott(1),Pamela D. Rivière(1) ((1) Department of Cognitive Science, University of California San Diego)
关键词-EN: Large Language Models, Language Models, Large Language, success of Large, domains has raised
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 figures, workshop TSAR 2024

点击查看摘要

Abstract:The success of Large Language Models (LLMs) in other domains has raised the question of whether LLMs can reliably assess and manipulate the readability of text. We approach this question empirically. First, using a published corpus of 4,724 English text excerpts, we find that readability estimates produced zero-shot'' from GPT-4 Turbo and GPT-4o mini exhibit relatively high correlation with human judgments (r = 0.76 and r = 0.74, respectively), out-performing estimates derived from traditional readability formulas and various psycholinguistic indices. Then, in a pre-registered human experiment (N = 59), we ask whether Turbo can reliably make text easier or harder to read. We find evidence to support this hypothesis, though considerable variance in human judgments remains unexplained. We conclude by discussing the limitations of this approach, including limited scope, as well as the validity of the readability’’ construct and its dependence on context, audience, and goal.
摘要:大语言模型 (LLM) 在其他领域的成功引发了关于 LLM 是否能可靠评估和操控文本可读性的问题。我们通过实证研究来探讨这一问题。首先,使用一个包含 4,724 段英文文本的公开语料库,我们发现 GPT-4 Turbo 和 GPT-4o mini 在“零样本”情况下生成的可读性估计与人类判断具有较高的相关性 (分别为 r = 0.76 和 r = 0.74),超过了传统可读性公式和各种心理语言学指标的估计。接着,在一个预注册的人类实验中 (N = 59),我们探讨了 Turbo 是否能可靠地使文本变得更易读或更难读。实验结果支持了这一假设,尽管人类判断中仍存在相当大的变异未被解释。最后,我们讨论了这种方法的局限性,包括其有限的适用范围,以及“可读性”概念的有效性及其对上下文、受众和目标的依赖。

[NLP-81] Generating Signed Language Instructions in Large-Scale Dialogue Systems NAACL2024

【速读】: 该论文试图解决在多模态对话AI平台上实现面向目标的对话系统中融入美国手语(ASL)指令的问题。解决方案的关键在于设计了一个基于大型语言模型的手语翻译模块,结合基于标记的视频检索系统,以生成和传递手语教学内容。该系统通过触摸界面接收用户输入,利用检索方法和认知基础的词汇翻译生成ASL指令,并整合了聋人和听力障碍社区以及认知和ASL学习科学专家的见解,以确保手语指令的有效性和准确性。

链接: https://arxiv.org/abs/2410.14026
作者: Mert İnan,Katherine Atwell,Anthony Sicilia,Lorna Quandt,Malihe Alikhani
关键词-EN: American Sign Language, worldwide multimodal conversational, enhanced with American, Large Language Models, American Sign
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) Industry Track

点击查看摘要

Abstract:We introduce a goal-oriented conversational AI system enhanced with American Sign Language (ASL) instructions, presenting the first implementation of such a system on a worldwide multimodal conversational AI platform. Accessible through a touch-based interface, our system receives input from users and seamlessly generates ASL instructions by leveraging retrieval methods and cognitively based gloss translations. Central to our design is a sign translation module powered by Large Language Models, alongside a token-based video retrieval system for delivering instructional content from recipes and wikiHow guides. Our development process is deeply rooted in a commitment to community engagement, incorporating insights from the Deaf and Hard-of-Hearing community, as well as experts in cognitive and ASL learning sciences. The effectiveness of our signing instructions is validated by user feedback, achieving ratings on par with those of the system in its non-signing variant. Additionally, our system demonstrates exceptional performance in retrieval accuracy and text-generation quality, measured by metrics such as BERTScore. We have made our codebase and datasets publicly accessible at this https URL, and a demo of our signed instruction video retrieval system is available at this https URL.
摘要:我们介绍了一种目标导向的对话式 AI 系统,该系统通过美国手语 (ASL) 指令进行了增强,并在全球多模态对话式 AI 平台上首次实现了此类系统的部署。通过基于触摸的界面,我们的系统接收用户输入,并利用检索方法和基于认知的手语翻译,无缝生成 ASL 指令。我们的设计核心是一个由大语言模型 (Large Language Models) 驱动的符号翻译模块,以及一个基于 Token 的视频检索系统,用于从食谱和 wikiHow 指南中传递教学内容。我们的开发过程深深植根于对社区参与的承诺,融合了聋人和听力障碍社区以及认知和 ASL 学习科学领域专家的见解。我们的手语指令的有效性通过用户反馈得到了验证,其评分与非手语版本的系统相当。此外,我们的系统在检索准确性和文本生成质量方面表现出色,这些指标如 BERTScore 所衡量。我们已经将代码库和数据集公开在 https URL,并且我们的手语指令视频检索系统演示可在 https URL 获取。

[NLP-82] LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education

【速读】: 该论文旨在解决大型语言模型(LLMs)在个性化教育环境中可能存在的偏见问题,特别是这些模型作为“教师”角色时如何生成和选择针对不同人口统计群体的教育内容。解决方案的关键在于引入并应用两种偏见评分指标——平均绝对偏见(MAB)和最大差异偏见(MDB),以分析9种开源和闭源的先进LLMs。通过实验,论文揭示了这些模型在教育解释生成中不仅延续了典型偏见,还产生了反转的有害刻板印象。

链接: https://arxiv.org/abs/2410.14012
作者: Iain Weissburg,Sathvika Anand,Sharon Levy,Haewon Jeong
关键词-EN: large language models, concerns about inherent, gained prominence, increasing adoption, adoption of large
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 46 Pages, 55 Figures, dataset release pending publication

点击查看摘要

Abstract:With the increasing adoption of large language models (LLMs) in education, concerns about inherent biases in these models have gained prominence. We evaluate LLMs for bias in the personalized educational setting, specifically focusing on the models’ roles as “teachers”. We reveal significant biases in how models generate and select educational content tailored to different demographic groups, including race, ethnicity, sex, gender, disability status, income, and national origin. We introduce and apply two bias score metrics–Mean Absolute Bias (MAB) and Maximum Difference Bias (MDB)–to analyze 9 open and closed state-of-the-art LLMs. Our experiments, which utilize over 17,000 educational explanations across multiple difficulty levels and topics, uncover that models perpetuate both typical and inverted harmful stereotypes.
摘要:随着大语言模型 (LLMs) 在教育领域的日益普及,这些模型中固有的偏见问题引起了广泛关注。我们评估了 LLMs 在个性化教育环境中的偏见,特别关注这些模型作为“教师”的角色。我们揭示了模型在为不同人口统计群体(包括种族、民族、性别、性别、残疾状况、收入和国籍)生成和选择教育内容时存在的显著偏见。我们引入了两种偏见评分指标——平均绝对偏见 (MAB) 和最大差异偏见 (MDB),并应用于分析 9 个开放和封闭的最新 LLMs。我们的实验利用了超过 17,000 个涵盖多个难度级别和主题的教育解释,发现模型不仅延续了典型的有害刻板印象,还产生了反转的有害刻板印象。

[NLP-83] Personalized Adaptation via In-Context Preference Learning

【速读】: 该论文试图解决现有强化学习从人类反馈(RLHF)方法在个性化方面的不足,即忽视个体用户偏好导致个性化效果不佳的问题。解决方案的关键在于提出了一种名为“偏好预训练Transformer(PPT)”的新方法,该方法利用Transformer的上下文学习能力,通过在线用户反馈动态适应个体偏好。PPT方法分为两个阶段:离线阶段使用历史依赖的损失函数训练单一策略模型,在线阶段通过上下文学习使模型适应用户偏好。这种方法在上下文 bandits 设置中展示了优越的个性化适应能力,并显著降低了计算成本。

链接: https://arxiv.org/abs/2410.14001
作者: Allison Lau,Younwoo Choi,Vahid Balazadeh,Keertana Chidambaram,Vasilis Syrgkanis,Rahul G. Krishnan
关键词-EN: Human Feedback, Reinforcement Learning, human preferences, RLHF, Human
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is widely used to align Language Models (LMs) with human preferences. However, existing approaches often neglect individual user preferences, leading to suboptimal personalization. We present the Preference Pretrained Transformer (PPT), a novel approach for adaptive personalization using online user feedback. PPT leverages the in-context learning capabilities of transformers to dynamically adapt to individual preferences. Our approach consists of two phases: (1) an offline phase where we train a single policy model using a history-dependent loss function, and (2) an online phase where the model adapts to user preferences through in-context learning. We demonstrate PPT’s effectiveness in a contextual bandit setting, showing that it achieves personalized adaptation superior to existing methods while significantly reducing the computational costs. Our results suggest the potential of in-context learning for scalable and efficient personalization in large language models.
摘要:基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 广泛用于将语言模型 (Language Models, LMs) 与人类偏好对齐。然而,现有方法往往忽视了个体用户的偏好,导致个性化效果不佳。我们提出了偏好预训练 Transformer (Preference Pretrained Transformer, PPT),这是一种利用在线用户反馈进行自适应个性化的创新方法。PPT 利用 Transformer 的上下文学习能力,动态适应个体偏好。我们的方法包括两个阶段:(1) 离线阶段,我们使用历史依赖的损失函数训练单一策略模型;(2) 在线阶段,模型通过上下文学习适应用户偏好。我们在上下文强盗设置中展示了 PPT 的有效性,表明其在实现个性化适应方面优于现有方法,同时显著降低了计算成本。我们的结果表明,上下文学习在大语言模型中实现可扩展且高效的个性化具有潜力。

[NLP-84] RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs

【速读】: 该论文试图解决复杂现实问题在文本知识图谱(TKGs)中的准确检索问题,特别是面对标注数据稀缺和拓扑结构复杂的情况。解决方案的关键在于开发了一个名为RiTeK的数据集,该数据集具有广泛的拓扑结构,并合成了包含多样化拓扑结构、关系信息和复杂文本描述的真实用户查询。此外,论文提出了一种增强的蒙特卡洛树搜索(MCTS)方法,称为Relational MCTS,用于自动从文本图谱中提取特定查询的关系路径信息,从而增强大型语言模型(LLMs)的推理能力。实验结果表明,RiTeK数据集对当前的检索和LLM系统提出了显著挑战,而Relational MCTS方法在RiTeK上实现了最先进的性能。

链接: https://arxiv.org/abs/2410.13987
作者: Jiatan Huang,Mingchen Li,Zonghai Yao,Zhichao Yang,Yongkang Xiao,Feiyun Ouyang,Xiaohan Li,Shuo Han,Hong Yu
关键词-EN: Answering complex real-world, Large Language Models, complex real-world questions, Answering complex, textual knowledge graphs
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Answering complex real-world questions often requires accurate retrieval from textual knowledge graphs (TKGs). The scarcity of annotated data, along with intricate topological structures, makes this task particularly challenging. As the nature of relational path information could enhance the inference ability of Large Language Models (LLMs), efficiently retrieving more complex relational path information from TKGs presents another key challenge. To tackle these challenges, we first develop a Dataset for LLMs Complex Reasoning over Textual Knowledge Graphs (RiTeK) with a broad topological structure this http URL synthesize realistic user queries that integrate diverse topological structures, relational information, and complex textual descriptions. We conduct rigorous expert evaluation to validate the quality of our synthesized queries. And then, we introduce an enhanced Monte Carlo Tree Search (MCTS) method, Relational MCTS, to automatically extract relational path information from textual graphs for specific queries. Our dataset mainly covers the medical domain as the relation types and entity are complex and publicly available. Experimental results indicate that RiTeK poses significant challenges for current retrieval and LLM systems, while the proposed Relational MCTS method enhances LLM inference ability and achieves state-of-the-art performance on RiTeK.
摘要:回答复杂的现实世界问题通常需要从文本知识图谱 (Textual Knowledge Graphs, TKGs) 中进行准确检索。由于标注数据的稀缺性以及复杂的拓扑结构,这一任务显得尤为困难。由于关系路径信息的性质可以增强大语言模型 (Large Language Models, LLMs) 的推理能力,因此从 TKGs 中高效检索更复杂的关系路径信息成为另一个关键挑战。为了应对这些挑战,我们首先开发了一个面向大语言模型在文本知识图谱上进行复杂推理的数据集 (Dataset for LLMs Complex Reasoning over Textual Knowledge Graphs, RiTeK),该数据集具有广泛的拓扑结构。我们合成了融入多样化拓扑结构、关系信息和复杂文本描述的真实用户查询。通过严格的专家评估验证了我们合成查询的质量。随后,我们引入了一种增强的蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS) 方法,即关系 MCTS (Relational MCTS),用于自动从文本图谱中提取特定查询的关系路径信息。我们的数据集主要涵盖医疗领域,因为该领域的关系类型和实体复杂且公开可用。实验结果表明,RiTeK 对当前的检索和大语言模型系统提出了显著挑战,而提出的关系 MCTS 方法则增强了大语言模型的推理能力,并在 RiTeK 上实现了最先进的性能。

[NLP-85] Are LLMs Models of Distributional Semantics? A Case Study on Quantifiers

【速读】: 该论文试图解决的问题是评估分布语义学模型在处理模糊(如“许多”)和精确(如“超过一半”)量词时的表现,并探讨这些模型是否如预期那样在模糊量词上表现更好,而在精确量词上表现较差。解决方案的关键在于通过广泛的模型类型进行实证研究,结果发现大型语言模型(LLMs)在精确量词上的表现更接近人类判断,这与传统假设相悖,从而呼吁重新评估分布语义学模型的能力和局限性。

链接: https://arxiv.org/abs/2410.13984
作者: Zhang Enyan,Zewei Wang,Michael A. Lepori,Ellie Pavlick,Helena Aparicio
关键词-EN: Distributional semantics, distributional semantics models, natural language, semantics, Distributional
类目: Computation and Language (cs.CL)
备注: 9 Pages, 3 Figures

点击查看摘要

Abstract:Distributional semantics is the linguistic theory that a word’s meaning can be derived from its distribution in natural language (i.e., its use). Language models are commonly viewed as an implementation of distributional semantics, as they are optimized to capture the statistical features of natural language. It is often argued that distributional semantics models should excel at capturing graded/vague meaning based on linguistic conventions, but struggle with truth-conditional reasoning and symbolic processing. We evaluate this claim with a case study on vague (e.g. “many”) and exact (e.g. “more than half”) quantifiers. Contrary to expectations, we find that, across a broad range of models of various types, LLMs align more closely with human judgements on exact quantifiers versus vague ones. These findings call for a re-evaluation of the assumptions underpinning what distributional semantics models are, as well as what they can capture.
摘要:分布式语义学是一种语言学理论,认为一个词的意义可以通过其在自然语言中的分布(即使用情况)来推导。语言模型通常被视为分布式语义学的实现,因为它们被优化以捕捉自然语言的统计特征。人们常常认为,分布式语义学模型应擅长捕捉基于语言习惯的渐进/模糊意义,但在真值条件推理和符号处理方面表现不佳。我们通过一个关于模糊(如“许多”)和精确(如“超过一半”)量词的案例研究来评估这一观点。出乎意料的是,我们发现,在各种类型的模型中,大语言模型在精确量词上的表现更接近人类判断,而非模糊量词。这些发现要求我们重新评估支撑分布式语义学模型的假设,以及它们所能捕捉的内容。

[NLP-86] Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations NEURIPS

【速读】: 该论文试图解决大型视觉语言模型(LVLMs)如LLaVA在处理包含不同人口统计学特征的图像时,由于训练数据中的社会偏见而产生的偏差问题。解决方案的关键在于提出了一种新颖的去偏框架,通过在文本生成过程中直接剔除偏见属性,避免生成与受保护属性相关的文本,甚至不在模型内部表示这些属性。该方法无需重新训练模型,仅需少量代表性偏见输出样本(约1000个),即可有效减少模型生成受保护属性相关文本的倾向,同时保持对真实数据(如COCO数据集)的标注性能。实验结果表明,去偏后的模型在生成准确性上与基准偏见模型相当,表明去偏效果可以在不牺牲模型性能的情况下实现。

链接: https://arxiv.org/abs/2410.13976
作者: Neale Ratzlaff,Matthew Lyle Olson,Musashi Hinck,Shao-Yen Tseng,Vasudev Lal,Phillip Howard
关键词-EN: Large Vision Language, Vision Language Models, Large Vision, Vision Language, demonstrated impressive capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeurIPS workshop on SafeGenAI, 10 pages, 2 figures

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) such as LLaVA have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input image. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a novel debiasing framework for LVLMs by directly ablating biased attributes during text generation to avoid generating text related to protected attributes, or even representing them internally. Our method requires no training and a relatively small amount of representative biased outputs (~1000 samples). Our experiments show that not only can we can minimize the propensity of LVLMs to generate text related to protected attributes, but we can even use synthetic data to inform the ablation while retaining captioning performance on real data such as COCO. Furthermore, we find the resulting generations from a debiased LVLM exhibit similar accuracy as a baseline biased model, showing that debiasing effects can be achieved without sacrificing model performance.
摘要:大型视觉语言模型 (Large Vision Language Models, LVLMs) 如 LLaVA 展示了作为通用聊天机器人的显著能力,能够就提供的输入图像进行对话。然而,其响应受到训练数据集中存在的社会偏见的影响,导致模型在处理描绘不同人群的图像时产生不理想的差异性响应。在本研究中,我们提出了一种新颖的去偏框架,通过在文本生成过程中直接消除偏见属性,以避免生成与受保护属性相关的文本,甚至避免在内部表示这些属性。我们的方法无需训练,并且仅需要相对少量的代表性偏见输出样本 (~1000 个样本)。实验结果表明,我们不仅能够最小化 LVLMs 生成与受保护属性相关文本的倾向,还可以利用合成数据来指导消除偏见,同时保留在真实数据如 COCO 上的标注性能。此外,我们发现去偏后的 LVLM 生成的结果与基线偏见模型相比,准确性相似,表明去偏效果可以在不牺牲模型性能的情况下实现。

[NLP-87] Detecting AI-Generated Texts in Cross-Domains

【速读】: 该论文试图解决现有工具在检测由大型语言模型(LLM)生成的文本时,面对新领域文本性能下降的问题。解决方案的关键在于训练一个名为RoBERTa-Ranker的排序分类器,该模型基于RoBERTa并使用包含多种人类和LLM生成文本的数据集进行训练。论文提出了一种仅需少量新领域标注数据的微调方法,使得该模型在不同领域和不同LLM生成的文本上均表现优异,超越了现有的DetectGPT和GPTZero模型。这种方法使得构建一个跨领域检测AI生成文本的单一系统变得可行且经济。

链接: https://arxiv.org/abs/2410.13966
作者: You Zhou,Jie Wang
关键词-EN: large language model, Existing tools, large language, performance can drop, drop when dealing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing tools to detect text generated by a large language model (LLM) have met with certain success, but their performance can drop when dealing with texts in new domains. To tackle this issue, we train a ranking classifier called RoBERTa-Ranker, a modified version of RoBERTa, as a baseline model using a dataset we constructed that includes a wider variety of texts written by humans and generated by various LLMs. We then present a method to fine-tune RoBERTa-Ranker that requires only a small amount of labeled data in a new domain. Experiments show that this fine-tuned domain-aware model outperforms the popular DetectGPT and GPTZero on both in-domain and cross-domain texts, where AI-generated texts may either be in a different domain or generated by a different LLM not used to generate the training datasets. This approach makes it feasible and economical to build a single system to detect AI-generated texts across various domains.
摘要:现有的用于检测大语言模型 (LLM) 生成文本的工具已取得了一定的成功,但在处理新领域文本时,其性能可能会下降。为了解决这一问题,我们训练了一个名为 RoBERTa-Ranker 的排序分类器,这是 RoBERTa 的一个改进版本,作为基准模型使用我们构建的数据集,该数据集包含了由人类编写和各种 LLM 生成的更广泛的文本。随后,我们提出了一种仅需少量新领域标注数据即可微调 RoBERTa-Ranker 的方法。实验表明,这种经过微调的领域感知模型在域内和跨域文本上均优于流行的 DetectGPT 和 GPTZero,其中 AI 生成的文本可能属于不同领域或由未用于生成训练数据集的不同 LLM 生成。这种方法使得构建一个能够跨多个领域检测 AI 生成文本的单一系统变得可行且经济。

[NLP-88] From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization

【速读】: 该论文试图解决多文档摘要任务中大语言模型(LLMs)产生的幻觉问题。解决方案的关键在于创建了两个新的多文档基准数据集,用于评估LLMs在处理多文档时的幻觉表现。研究发现,平均高达75%的生成摘要内容存在幻觉,且幻觉更可能出现在摘要的末尾。论文还指出,简单的后处理方法在减少幻觉方面效果有限,强调了需要更系统化的方法来有效缓解多文档摘要中的幻觉问题。

链接: https://arxiv.org/abs/2410.13961
作者: Catarina G. Belem,Pouya Pezeskhpour,Hayate Iso,Seiji Maekawa,Nikita Bhutani,Estevam Hruschka
关键词-EN: tasks remains largely, remains largely unexplored, large language models, single-document tasks, tasks remains
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although many studies have investigated and reduced hallucinations in large language models (LLMs) for single-document tasks, research on hallucination in multi-document summarization (MDS) tasks remains largely unexplored. Specifically, it is unclear how the challenges arising from handling multiple documents (e.g., repetition and diversity of information) affect models outputs. In this work, we investigate how hallucinations manifest in LLMs when summarizing topic-specific information from multiple documents. Since no benchmarks exist for investigating hallucinations in MDS, we use existing news and conversation datasets, annotated with topic-specific insights, to create two novel multi-document benchmarks. When evaluating 5 LLMs on our benchmarks, we observe that on average, up to 75% of the content in LLM-generated summary is hallucinated, with hallucinations more likely to occur towards the end of the summaries. Moreover, when summarizing non-existent topic-related information, gpt-3.5-turbo and GPT-4o still generate summaries about 79.35% and 44% of the time, raising concerns about their tendency to fabricate content. To understand the characteristics of these hallucinations, we manually evaluate 700+ insights and find that most errors stem from either failing to follow instructions or producing overly generic insights. Motivated by these observations, we investigate the efficacy of simple post-hoc baselines in mitigating hallucinations but find them only moderately effective. Our results underscore the need for more effective approaches to systematically mitigate hallucinations in MDS. We release our dataset and code at this http URL.
摘要:尽管许多研究已经探讨并减少了在单文档任务中大语言模型 (LLM) 的幻觉现象,但对于多文档摘要 (MDS) 任务中的幻觉研究仍处于初步阶段。具体而言,处理多文档时所面临的挑战(如信息的重复性和多样性)如何影响模型输出尚不清楚。在本研究中,我们探讨了在从多文档中总结特定主题信息时,LLM 如何表现出幻觉现象。由于目前尚无针对 MDS 中幻觉现象的基准测试,我们利用现有的新闻和对话数据集,并注释了特定主题的见解,创建了两个新的多文档基准测试。在评估 5 个 LLM 在我们的基准测试时,我们发现平均而言,LLM 生成的摘要中高达 75% 的内容存在幻觉,且幻觉更可能出现在摘要的末尾。此外,在总结不存在的话题相关信息时,gpt-3.5-turbo 和 GPT-4o 仍然分别有 79.35% 和 44% 的时间生成摘要,这引发了对其内容捏造倾向的担忧。为了理解这些幻觉的特征,我们手动评估了 700 多个见解,发现大多数错误源于未能遵循指令或生成过于泛泛的见解。基于这些观察,我们研究了简单的事后基线方法在减少幻觉方面的有效性,但发现它们仅具有中等效果。我们的结果强调了在 MDS 中系统性减少幻觉需要更有效的方法。我们在此 http URL 发布了数据集和代码。

[NLP-89] FinQAPT: Empowering Financial Decisions with End-to-End LLM-driven Question Answering Pipeline

【速读】: 该论文试图解决金融决策中从大量文档中提取相关信息的问题。解决方案的关键在于开发了一个名为FinQAPT的端到端管道,该管道通过查询识别相关财务报告,提取相关上下文,并利用大型语言模型(LLMs)执行下游任务。关键技术包括一种基于聚类的负采样技术以增强上下文提取,以及一种称为动态N-shot提示的新提示方法,以提升LLMs的数值问答能力。尽管在模块级别上达到了80.6%的先进准确率,但在管道级别上由于从财务报告中提取相关上下文的挑战,性能有所下降。

链接: https://arxiv.org/abs/2410.13959
作者: Kuldeep Singh,Simerjot Kaur,Charese Smiley
关键词-EN: Large Language Models, relevant information embedded, Financial decision-making hinges, leverages Large Language, decision-making hinges
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in ICAIF 2024, 8 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Financial decision-making hinges on the analysis of relevant information embedded in the enormous volume of documents in the financial domain. To address this challenge, we developed FinQAPT, an end-to-end pipeline that streamlines the identification of relevant financial reports based on a query, extracts pertinent context, and leverages Large Language Models (LLMs) to perform downstream tasks. To evaluate the pipeline, we experimented with various techniques to optimize the performance of each module using the FinQA dataset. We introduced a novel clustering-based negative sampling technique to enhance context extraction and a novel prompting method called Dynamic N-shot Prompting to boost the numerical question-answering capabilities of LLMs. At the module level, we achieved state-of-the-art accuracy on FinQA, attaining an accuracy of 80.6%. However, at the pipeline level, we observed decreased performance due to challenges in extracting relevant context from financial reports. We conducted a detailed error analysis of each module and the end-to-end pipeline, pinpointing specific challenges that must be addressed to develop a robust solution for handling complex financial tasks.
摘要:金融决策依赖于对金融领域大量文档中嵌入的相关信息的分析。为了应对这一挑战,我们开发了 FinQAPT,这是一个端到端的流程,能够根据查询快速识别相关的财务报告,提取相关上下文,并利用大语言模型 (LLMs) 执行下游任务。为了评估该流程,我们使用 FinQA 数据集对各个模块的优化技术进行了实验。我们引入了一种基于聚类的负采样技术来增强上下文提取,并提出了一种名为动态 N-shot 提示 (Dynamic N-shot Prompting) 的新提示方法,以提升 LLMs 的数值问题回答能力。在模块层面,我们在 FinQA 上达到了最先进的准确率,达到了 80.6%。然而,在流程层面,由于从财务报告中提取相关上下文的挑战,我们观察到性能有所下降。我们对每个模块和端到端流程进行了详细的错误分析,指出了必须解决的具体挑战,以开发出能够处理复杂金融任务的稳健解决方案。

[NLP-90] Identifying High Consideration E-Commerce Search Queries EMNLP2024

【速读】: 该论文试图解决在电子商务中识别高考虑度(High Consideration, HC)查询的问题,以便通过针对性的用户体验(如定制的问答小部件)帮助用户做出购买决策。解决方案的关键在于提出了一种基于用户参与度的查询排序(Engagement-based Query Ranking, EQR)方法,该方法侧重于利用与用户行为、财务和目录信息相关的查询级别特征,而非传统的流行度信号,来预测用户在产品搜索过程中与相关购物知识内容的潜在参与水平。实验结果表明,该方法在离线实验中表现出强大的排序性能,并且在商业部署中显示出优于人工选择的查询对下游客户影响的提升。

链接: https://arxiv.org/abs/2410.13951
作者: Zhiyu Chen,Jason Choi,Besnik Fetahu,Shervin Malmasi
关键词-EN: missions typically require, typically require careful, substantial research investment, elaborate decision making, high consideration
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024 (Industry Track)

点击查看摘要

Abstract:In e-commerce, high consideration search missions typically require careful and elaborate decision making, and involve a substantial research investment from customers. We consider the task of identifying High Consideration (HC) queries. Identifying such queries enables e-commerce sites to better serve user needs using targeted experiences such as curated QA widgets that help users reach purchase decisions. We explore the task by proposing an Engagement-based Query Ranking (EQR) approach, focusing on query ranking to indicate potential engagement levels with query-related shopping knowledge content during product search. Unlike previous studies on predicting trends, EQR prioritizes query-level features related to customer behavior, finance, and catalog information rather than popularity signals. We introduce an accurate and scalable method for EQR and present experimental results demonstrating its effectiveness. Offline experiments show strong ranking performance. Human evaluation shows a precision of 96% for HC queries identified by our model. The model was commercially deployed, and shown to outperform human-selected queries in terms of downstream customer impact, as measured through engagement.
摘要:在电子商务中,高考虑度的搜索任务通常需要用户进行谨慎且细致的决策,并涉及大量的研究投入。我们考虑识别高考虑度 (High Consideration, HC) 查询的任务。识别此类查询使电子商务网站能够通过针对性的体验更好地满足用户需求,例如定制的问答小部件,帮助用户做出购买决策。我们通过提出基于参与度的查询排序 (Engagement-based Query Ranking, EQR) 方法来探索这一任务,重点在于查询排序,以指示在产品搜索过程中与查询相关的购物知识内容的潜在参与水平。与以往预测趋势的研究不同,EQR 优先考虑与客户行为、财务和目录信息相关的查询级特征,而非流行度信号。我们引入了一种准确且可扩展的 EQR 方法,并展示了其实验结果,证明了其有效性。离线实验显示了强大的排序性能。人工评估显示,我们的模型识别 HC 查询的准确率为 96%。该模型已商业化部署,并在下游客户影响方面表现优于人工选择的查询,通过参与度进行衡量。

[NLP-91] Boosting LLM Translation Skills without General Ability Loss via Rationale Distillation

【速读】: 该论文试图解决大语言模型(LLMs)在机器翻译任务中性能提升的同时,避免因微调导致的指令遵循能力和人类偏好对齐能力的丧失,以及由此带来的潜在安全风险问题。解决方案的关键在于提出了一种名为RaDis(Rationale Distillation)的新方法,通过利用LLMs的强大生成能力为训练数据创建理性(rationales),并在训练过程中“重放”这些理性,以防止遗忘。这些理性封装了通用知识和安全原则,作为自我蒸馏的目标来规范训练过程。通过同时训练参考翻译和自我生成的理性,模型能够在学习新的翻译技能的同时,保持其整体通用能力。

链接: https://arxiv.org/abs/2410.13944
作者: Junhong Wu,Yang Zhao,Yangyifan Xu,Bing Liu,Chengqing Zong
关键词-EN: Large Language Models, Large Language, achieved impressive results, numerous NLP tasks, numerous NLP
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive results across numerous NLP tasks but still encounter difficulties in machine translation. Traditional methods to improve translation have typically involved fine-tuning LLMs using parallel corpora. However, vanilla fine-tuning often leads to catastrophic forgetting of the instruction-following capabilities and alignment with human preferences, compromising their broad general abilities and introducing potential security risks. These abilities, which are developed using proprietary and unavailable training data, make existing continual instruction tuning methods ineffective. To overcome this issue, we propose a novel approach called RaDis (Rationale Distillation). RaDis harnesses the strong generative capabilities of LLMs to create rationales for training data, which are then “replayed” to prevent forgetting. These rationales encapsulate general knowledge and safety principles, acting as self-distillation targets to regulate the training process. By jointly training on both reference translations and self-generated rationales, the model can learn new translation skills while preserving its overall general abilities. Extensive experiments demonstrate that our method enhances machine translation performance while maintaining the broader capabilities of LLMs across other tasks. This work presents a pathway for creating more versatile LLMs that excel in specialized tasks without compromising generality and safety.
摘要:大语言模型 (LLMs) 在众多自然语言处理 (NLP) 任务中取得了显著成果,但在机器翻译方面仍面临挑战。传统提升翻译质量的方法通常涉及使用平行语料库对 LLMs 进行微调。然而,常规微调往往导致指令遵循能力和与人类偏好的一致性出现灾难性遗忘,损害其广泛通用能力并引入潜在的安全风险。这些能力是通过专有且不可用的训练数据开发的,使得现有的持续指令微调方法失效。为解决这一问题,我们提出了一种名为 RaDis (Rationale Distillation) 的新方法。RaDis 利用 LLMs 强大的生成能力为训练数据创建理由,这些理由随后被“重放”以防止遗忘。这些理由封装了通用知识和安全原则,作为自我蒸馏目标来规范训练过程。通过联合训练参考翻译和自我生成的理由,模型可以在学习新翻译技能的同时保留其整体通用能力。大量实验表明,我们的方法在提升机器翻译性能的同时,保持了 LLMs 在其他任务中的广泛能力。这项工作为创建在特定任务中表现出色且不牺牲通用性和安全性的多功能 LLMs 提供了途径。

[NLP-92] Automatically Interpreting Millions of Features in Large Language Models

【速读】: 该论文试图解决深度神经网络中神经元激活难以解释的问题,特别是通过稀疏自编码器(SAE)将这些激活转换到更高维度的潜在空间后,如何自动化地生成和评估自然语言解释。解决方案的关键在于构建一个开源的自动化流水线,利用大型语言模型(LLMs)生成和评估SAE特征的自然语言解释。论文引入了五种新的评分技术,其中干预评分技术特别有效,能够评估特征干预后的解释性,从而揭示现有方法无法召回的特征。此外,论文还提出了生成更广泛适用解释的指南,并讨论了现有评分技术的缺陷。通过这些方法,论文验证了SAE潜在特征相比神经元更具解释性,即使在使用top-k后处理进行稀疏化的情况下。

链接: https://arxiv.org/abs/2410.13928
作者: Gonçalo Paulo,Alex Mallen,Caden Juang,Nora Belrose
关键词-EN: simple human-understandable interpretation, deep neural networks, higher-dimensional latent space, sparse autoencoders, human-understandable interpretation
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of explanations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a feature, which we find explains features that are not recalled by existing methods. We propose guidelines for generating better explanations that remain valid for a broader set of activating contexts, and discuss pitfalls with existing scoring techniques. We use our explanations to measure the semantic similarity of independently trained SAEs, and find that SAEs trained on nearby layers of the residual stream are highly similar. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons, even when neurons are sparsified using top- k postprocessing. Our code is available at this https URL, and our explanations are available at this https URL.
摘要:尽管深度神经网络中神经元的激活通常不具备简单的人类可理解解释,但稀疏自编码器 (Sparse Autoencoders, SAEs) 可以将这些激活转换为更高维度的潜在空间,该空间可能更容易解释。然而,这些 SAE 可能拥有数百万个不同的潜在特征,使得人类手动解释每一个特征变得不可行。在本研究中,我们构建了一个开源自动化流水线,利用大语言模型 (Large Language Models, LLMs) 为 SAE 特征生成并评估自然语言解释。我们在不同大小、激活函数和损失函数的 SAE 上测试了我们的框架,这些 SAE 基于两个不同的开源权重大语言模型进行训练。我们引入了五种新技术来评估解释质量,这些技术比之前的最先进方法运行成本更低。其中一种技术,干预评分 (Intervention Scoring),评估了干预特征效果的可解释性,我们发现这种方法解释了现有方法无法回忆的特征。我们提出了生成更好解释的指导原则,这些解释在更广泛的激活上下文中仍然有效,并讨论了现有评分技术的陷阱。我们利用这些解释来测量独立训练的 SAE 之间的语义相似性,发现基于残差流相邻层的 SAE 训练结果高度相似。我们的大规模分析证实,即使使用 top-k 后处理对神经元进行稀疏化,SAE 潜在特征确实比神经元更具可解释性。我们的代码可在以下链接获取,解释结果也可在以下链接获取。

[NLP-93] NSmark: Null Space Based Black-box Watermarking Defense Framework for Pre-trained Language Models

【速读】: 该论文试图解决预训练语言模型(PLMs)在黑盒设置下易受线性功能等价攻击(LFEA)的问题。解决方案的关键在于提出了一种名为NSmark的任务无关、黑盒水印方案,该方案通过利用输出矩阵的零空间不变性来抵抗LL-LFEA攻击。NSmark包括三个阶段:水印生成、水印嵌入和验证,通过数字签名和扩展频谱调制增强水印的鲁棒性,同时确保模型性能不受影响。实验结果表明,NSmark在预训练和下游任务中均表现出有效性、可靠性和鲁棒性。

链接: https://arxiv.org/abs/2410.13907
作者: Haodong Zhao,Jinming Hu,Peixuan Li,Fangqi Li,Jinrui Sha,Peixuan Chen,Zhuosheng Zhang,Gongshen Liu
关键词-EN: Pre-trained language models, critical intellectual property, Linear Functionality Equivalence, Pre-trained language, Functionality Equivalence Attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained language models (PLMs) have emerged as critical intellectual property (IP) assets that necessitate protection. Although various watermarking strategies have been proposed, they remain vulnerable to Linear Functionality Equivalence Attacks (LFEA), which can invalidate most existing white-box watermarks without prior knowledge of the watermarking scheme or training data. This paper further analyzes and extends the attack scenarios of LFEA to the commonly employed black-box settings for PLMs by considering Last-Layer outputs (dubbed LL-LFEA). We discover that the null space of the output matrix remains invariant against LL-LFEA attacks. Based on this finding, we propose NSmark, a task-agnostic, black-box watermarking scheme capable of resisting LL-LFEA attacks. NSmark consists of three phases: (i) watermark generation using the digital signature of the owner, enhanced by spread spectrum modulation for increased robustness; (ii) watermark embedding through an output mapping extractor that preserves PLM performance while maximizing watermark capacity; (iii) watermark verification, assessed by extraction rate and null space conformity. Extensive experiments on both pre-training and downstream tasks confirm the effectiveness, reliability, fidelity, and robustness of our approach. Code is available at this https URL.
摘要:预训练语言模型 (Pre-trained Language Models, PLMs) 已成为需要保护的关键知识产权 (Intellectual Property, IP) 资产。尽管已提出多种水印策略,但它们仍易受线性功能等价攻击 (Linear Functionality Equivalence Attacks, LFEA) 的影响,这种攻击可以在不了解水印方案或训练数据的情况下使大多数现有白盒水印失效。本文进一步分析并扩展了 LFEA 的攻击场景,将其应用于 PLMs 中常用的黑盒设置,通过考虑最后一层的输出 (称为 LL-LFEA)。我们发现,输出矩阵的零空间在 LL-LFEA 攻击下保持不变。基于这一发现,我们提出了 NSmark,一种任务无关的黑盒水印方案,能够抵抗 LL-LFEA 攻击。NSmark 包括三个阶段:(i) 使用所有者的数字签名生成水印,并通过扩频调制增强以提高鲁棒性;(ii) 通过输出映射提取器嵌入水印,同时最大化水印容量并保持 PLM 性能;(iii) 水印验证,通过提取率和零空间一致性进行评估。在预训练和下游任务上的广泛实验证实了我们的方法的有效性、可靠性、保真度和鲁棒性。代码可在以下链接获取:https URL。

[NLP-94] SoK: Prompt Hacking of Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)应用中的安全性和鲁棒性问题,特别是针对prompt hacking攻击的三种类型:jailbreaking、leaking和injection。解决方案的关键在于提出了一种新的评估框架,将LLM的响应分类为五个不同的类别,超越了传统的二元分类,从而提供了更细致的AI行为分析,增强了系统的安全性和鲁棒性评估的精确性和针对性。

链接: https://arxiv.org/abs/2410.13901
作者: Baha Rababah,Shang(Tommy)Wu,Matthew Kwiatkowski,Carson Leung,Cuneyt Gurcan Akcora
关键词-EN: large language models, remain critical challenges, based applications remain, applications remain critical, language models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence. Among the key threats to these applications are prompt hacking attacks, which can significantly undermine the security and reliability of LLM-based systems. In this work, we offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection, addressing the nuances that differentiate them despite their overlapping characteristics. To enhance the evaluation of LLM-based applications, we propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification. This approach provides more granular insights into the AI’s behavior, improving diagnostic precision and enabling more targeted enhancements to the system’s safety and robustness.
摘要:基于大语言模型 (LLM) 的应用程序的安全性和鲁棒性仍然是人工智能领域的关键挑战。在这些应用程序面临的主要威胁中,提示词攻击尤为突出,它们可以显著削弱基于 LLM 系统的安全性和可靠性。本文中,我们全面系统地概述了三种不同的提示词攻击类型:越狱 (jailbreaking)、泄露 (leaking) 和注入 (injection),尽管它们具有重叠的特征,但各自有着细微的差别。为了提升对基于 LLM 应用程序的评估,我们提出了一种新的框架,将 LLM 的响应划分为五个不同的类别,超越了传统的二元分类。这种方法提供了对 AI 行为更细致的洞察,提高了诊断的精确性,并能够更有针对性地增强系统的安全性和鲁棒性。

[NLP-95] Observing the Southern US Culture of Honor Using Large-Scale Social Media Analysis EMNLP2024

【速读】: 该论文试图解决的问题是验证美国南部地区(盛行荣誉文化)的互联网用户是否更倾向于在受到个人攻击时进行报复性攻击。解决方案的关键在于利用OpenAI的GPT-3.5 API进行用户地理位置定位和自动检测用户间的侮辱性言论,并通过手动标注数据集来验证GPT-3.5的性能,从而实现基于大型语言模型的大规模分析,并得出结论。

链接: https://arxiv.org/abs/2410.13887
作者: Juho Kim,Michael Guerzhoy
关键词-EN: governing interpersonal relations, culture of honor, individuals’ status, interpersonal relations, social system
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: Accepted to SICon Workshop at the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)

点击查看摘要

Abstract:A \textitculture of honor refers to a social system where individuals’ status, reputation, and esteem play a central role in governing interpersonal relations. Past works have associated this concept with the United States (US) South and related with it various traits such as higher sensitivity to insult, a higher value on reputation, and a tendency to react violently to insults. In this paper, we hypothesize and confirm that internet users from the US South, where a \textitculture of honor is more prevalent, are more likely to display a trait predicted by their belonging to a \textitculture of honor. Specifically, we test the hypothesis that US Southerners are more likely to retaliate to personal attacks by personally attacking back. We leverage OpenAI’s GPT-3.5 API to both geolocate internet users and to automatically detect whether users are insulting each other. We validate the use of GPT-3.5 by measuring its performance on manually-labeled subsets of the data. Our work demonstrates the potential of formulating a hypothesis based on a conceptual framework, operationalizing it in a way that is amenable to large-scale LLM-aided analysis, manually validating the use of the LLM, and drawing a conclusion.
摘要:荣誉文化 (culture of honor) 指的是一种社会体系,其中个人的地位、声誉和尊重在人际关系中起着核心作用。过去的研究将这一概念与美国南部 (US South) 联系起来,并关联了诸如对侮辱的高度敏感性、对声誉的高度重视以及对侮辱倾向于暴力反应等特征。本文中,我们假设并证实了来自美国南部的互联网用户,由于该地区荣誉文化更为普遍,更可能表现出与其所属荣誉文化相关的特征。具体而言,我们测试了这样一个假设:美国南部的人更可能通过个人攻击来回击个人攻击。我们利用 OpenAI 的 GPT-3.5 API 来定位互联网用户的地理位置,并自动检测用户之间是否存在侮辱行为。通过测量 GPT-3.5 在手动标注数据子集上的表现,我们验证了其使用。本文展示了基于概念框架制定假设、将其操作化为适合大规模大语言模型 (LLM) 辅助分析的方式、手动验证 LLM 的使用,并得出结论的潜力。

[NLP-96] Stars Stripes and Silicon: Unravelling the ChatGPTs All-American Monochrome Cis-centric Bias

【速读】: 该论文试图解决大型语言模型(如ChatGPT)中存在的偏见、毒性、不可靠性和鲁棒性不足等问题。论文指出,这些问题主要源于模型训练数据的质与量,而非模型架构本身。解决方案的关键在于跨学科合作,包括研究人员、实践者和利益相关者的共同努力,以建立治理框架、监督机制和问责制度,从而减轻偏见语言模型对社会的负面影响。通过积极应对这些挑战,AI社区可以更好地利用大型语言模型的潜力,促进社会进步,同时避免加剧现有的不平等和偏见。

链接: https://arxiv.org/abs/2410.13868
作者: Federico Torrielli
关键词-EN: large language models, lack of robustness, robustness in large, large language, language models
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the challenges associated with bias, toxicity, unreliability, and lack of robustness in large language models (LLMs) such as ChatGPT. It emphasizes that these issues primarily stem from the quality and diversity of data on which LLMs are trained, rather than the model architectures themselves. As LLMs are increasingly integrated into various real-world applications, their potential to negatively impact society by amplifying existing biases and generating harmful content becomes a pressing concern. The paper calls for interdisciplinary efforts to address these challenges. Additionally, it highlights the need for collaboration between researchers, practitioners, and stakeholders to establish governance frameworks, oversight, and accountability mechanisms to mitigate the harmful consequences of biased LLMs. By proactively addressing these challenges, the AI community can harness the enormous potential of LLMs for the betterment of society without perpetuating harmful biases or exacerbating existing inequalities.
摘要:本文探讨了大型语言模型 (LLM) 如 ChatGPT 在偏见、毒性、不可靠性和缺乏鲁棒性方面面临的挑战。文章强调,这些问题主要源于 LLM 训练数据的质量和多样性,而非模型架构本身。随着 LLM 越来越多地集成到各种实际应用中,它们通过放大现有偏见和生成有害内容对社会产生负面影响的潜力成为一个紧迫的问题。本文呼吁跨学科合作来应对这些挑战。此外,文章强调了研究人员、从业者和利益相关者之间合作的重要性,以建立治理框架、监督机制和问责机制,从而减轻偏见 LLM 的有害后果。通过主动应对这些挑战,AI 社区可以利用 LLM 的巨大潜力造福社会,同时避免延续有害偏见或加剧现有不平等。

[NLP-97] E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

【速读】: 该论文试图解决3D医学图像(如CT扫描)在训练数据有限和高维度问题下,限制3D医学视觉-语言模型发展的问题。解决方案的关键在于:通过收集大量未标注的3D CT数据,并利用自监督学习构建3D视觉基础模型,以提取3D视觉特征;同时,采用3D空间卷积技术来聚合和投影高层图像特征,降低计算复杂度并保留空间信息;最后,基于BIMCV-R和CT-RATE构建两个指令微调数据集,对3D视觉-语言模型进行微调,从而在报告生成、视觉问答和疾病诊断等方面展现出优于现有方法的性能。

链接: https://arxiv.org/abs/2410.14200
作者: Haoran Lai,Zihang Jiang,Qingsong Yao,Rongsheng Wang,Zhiyang He,Xiaodong Tao,Wei Wei,Weifu Lv,S.Kevin Zhou
关键词-EN: medical vision-language models, holds significant potential, models holds significant, medical vision-language, vision-language models holds
类目: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment. However, compared to 2D medical images, 3D medical images, such as CT scans, face challenges related to limited training data and high dimension, which severely restrict the progress of 3D medical vision-language models. To address these issues, we collect a large amount of unlabeled 3D CT data and utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features. Then, we apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity while preserving spatial information. We also construct two instruction-tuning datasets based on BIMCV-R and CT-RATE to fine-tune the 3D vision-language model. Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis. Code and data will be made publicly available soon.
摘要:三维医学视觉-语言模型的发展在疾病诊断和患者治疗方面具有巨大的潜力。然而,与二维医学图像相比,三维医学图像(如CT扫描)面临着训练数据有限和高维度的问题,这严重限制了三维医学视觉-语言模型的进展。为了解决这些问题,我们收集了大量未标注的三维CT数据,并利用自监督学习构建了一个三维视觉基础模型,用于提取三维视觉特征。随后,我们应用三维空间卷积来聚合和投影高级图像特征,在降低计算复杂度的同时保留空间信息。此外,我们基于BIMCV-R和CT-RATE构建了两个指令微调数据集,以对三维视觉-语言模型进行微调。我们的模型在报告生成、视觉问答和疾病诊断方面表现优于现有方法。代码和数据将很快公开发布。

[NLP-98] UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

【速读】: 该论文试图解决如何有效评估大型语言模型(LLMs)在处理复杂现实世界金融任务中的能力问题。解决方案的关键在于引入UCFE(User-Centric Financial Expertise)基准,该基准采用混合方法,结合人类专家评估与动态任务特定交互,以模拟不断变化的金融场景。通过用户研究收集反馈并构建涵盖广泛用户意图和交互的数据集,利用LLM-as-Judge方法对12个LLM服务进行基准测试,结果显示基准评分与人类偏好高度一致(Pearson相关系数为0.78),验证了UCFE数据集和评估方法的有效性。

链接: https://arxiv.org/abs/2410.14059
作者: Yuzhe Yang,Yifei Zhang,Yan Hu,Yilin Guo,Ruoli Gan,Yueru He,Mingcong Lei,Xiao Zhang,Haining Wang,Qianqian Xie,Jimin Huang,Honghai Yu,Benyou Wang
关键词-EN: User-Centric Financial Expertise, large language models, handle complex real-world, Financial Expertise benchmark, complex real-world financial
类目: Computational Finance (q-fin.CP); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 12 LLM services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial sector but also provides a robust framework for assessing their performance and user this http URL benchmark dataset and evaluation code are available.
摘要:本文介绍了 UCFE:以用户为中心的金融专业知识基准,这是一个创新的框架,旨在评估大语言模型 (LLM) 处理复杂现实世界金融任务的能力。UCFE 基准采用了一种混合方法,结合了人类专家评估与动态、任务特定的交互,以模拟不断变化的金融场景的复杂性。首先,我们进行了一项涉及 804 名参与者的用户研究,收集了他们对金融任务的反馈。其次,基于这些反馈,我们创建了一个涵盖广泛用户意图和交互的数据集。该数据集作为使用 LLM-as-Judge 方法对 12 个 LLM 服务进行基准测试的基础。我们的结果显示,基准评分与人类偏好之间存在显著的一致性,Pearson 相关系数为 0.78,证实了 UCFE 数据集和我们的评估方法的有效性。UCFE 基准不仅揭示了 LLM 在金融领域的潜力,还提供了一个强大的框架,用于评估其性能和用户使用此 http URL 基准数据集和评估代码。

人工智能

[AI-0] SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment

链接: https://arxiv.org/abs/2410.14676
作者: Qin Liu,Fei Wang,Chaowei Xiao,Muhao Chen
关键词-EN: Existing preference alignment, large language model, Existing preference, language model, parametric knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing preference alignment is a one-size-fits-all alignment mechanism, where the part of the large language model (LLM) parametric knowledge with non-preferred features is uniformly blocked to all the users. However, this part of knowledge can be useful to advanced users whose expertise qualifies them to handle these information. The one-size-fits-all alignment mechanism undermines LLM’s utility for these qualified users. To address this problem, we propose SudoLM, a framework that lets LLMs learn access control over specific parametric knowledge for users with different credentials via authorization alignment. SudoLM allows authorized users to unlock their access to all the parametric knowledge with an assigned SUDO key while blocking access to non-qualified users. Experiments on two application scenarios demonstrate that SudoLM effectively controls the user’s access to the parametric knowledge and maintains its general utility.

[AI-1] Enhancing Large Language Models Situated Faithfulness to External Contexts

链接: https://arxiv.org/abs/2410.14675
作者: Yukun Huang,Sanxing Chen,Hongyi Cai,Bhuwan Dhingra
关键词-EN: Large Language Models, Large Language, external information, Language Models, intentionally misleading
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are often augmented with external information as contexts, but this external information can sometimes be inaccurate or even intentionally misleading. We argue that robust LLMs should demonstrate situated faithfulness, dynamically calibrating their trust in external information based on their confidence in the internal knowledge and the external context. To benchmark this capability, we evaluate LLMs across several QA datasets, including a newly created dataset called RedditQA featuring in-the-wild incorrect contexts sourced from Reddit posts. We show that when provided with both correct and incorrect contexts, both open-source and proprietary models tend to overly rely on external information, regardless of its factual accuracy. To enhance situated faithfulness, we propose two approaches: Self-Guided Confidence Reasoning (SCR) and Rule-Based Confidence Reasoning (RCR). SCR enables models to self-access the confidence of external information relative to their own internal knowledge to produce the most accurate answer. RCR, in contrast, extracts explicit confidence signals from the LLM and determines the final answer using predefined rules. Our results show that for LLMs with strong reasoning capabilities, such as GPT-4o and GPT-4o mini, SCR outperforms RCR, achieving improvements of up to 24.2% over a direct input augmentation baseline. Conversely, for a smaller model like Llama-3-8B, RCR outperforms SCR. Fine-tuning SCR with our proposed Confidence Reasoning Direct Preference Optimization (CR-DPO) method improves performance on both seen and unseen datasets, yielding an average improvement of 8.9% on Llama-3-8B. In addition to quantitative results, we offer insights into the relative strengths of SCR and RCR. Our findings highlight promising avenues for improving situated faithfulness in LLMs. The data and code are released.

[AI-2] BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

链接: https://arxiv.org/abs/2410.14672
作者: Shaozhe Hao,Xuantong Liu,Xianbiao Qi,Shihao Zhao,Bojia Zi,Rong Xiao,Kai Han,Kwan-Yee K. Wong
关键词-EN: compact binary latent, focusing on enhancing, binary latent codes, conditional generative model, generative training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:We introduce BiGR, a novel conditional image generation model using compact binary latent codes for generative training, focusing on enhancing both generation and representation capabilities. BiGR is the first conditional generative model that unifies generation and discrimination within the same framework. BiGR features a binary tokenizer, a masked modeling mechanism, and a binary transcoder for binary code prediction. Additionally, we introduce a novel entropy-ordered sampling method to enable efficient image generation. Extensive experiments validate BiGR’s superior performance in generation quality, as measured by FID-50k, and representation capabilities, as evidenced by linear-probe accuracy. Moreover, BiGR showcases zero-shot generalization across various vision tasks, enabling applications such as image inpainting, outpainting, editing, interpolation, and enrichment, without the need for structural modifications. Our findings suggest that BiGR unifies generative and discriminative tasks effectively, paving the way for further advancements in the field.

[AI-3] DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph

链接: https://arxiv.org/abs/2410.14666
作者: Maitreya Prafulla Chitale,Uday Bindal,Rajakrishnan Rajkumar,Rahul Mishra
关键词-EN: Summarizing movie screenplays, standard document summarization, Summarizing movie, unique set, compared to standard
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Summarizing movie screenplays presents a unique set of challenges compared to standard document summarization. Screenplays are not only lengthy, but also feature a complex interplay of characters, dialogues, and scenes, with numerous direct and subtle relationships and contextual nuances that are difficult for machine learning models to accurately capture and comprehend. Recent attempts at screenplay summarization focus on fine-tuning transformer-based pre-trained models, but these models often fall short in capturing long-term dependencies and latent relationships, and frequently encounter the “lost in the middle” issue. To address these challenges, we introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph). This approach is well-suited for various downstream tasks, such as summarization, question-answering, and salience detection. The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay’s content. We further explore a baseline method that combines the CaD Graph with the corresponding movie script through a late fusion of graph and text modalities, and we present very initial promising results.

[AI-4] Online Reinforcement Learning with Passive Memory

链接: https://arxiv.org/abs/2410.14665
作者: Anay Pattanaik,Lav R. Varshney
关键词-EN: leverages pre-collected data, reinforcement learning algorithm, online reinforcement learning, pre-collected data, online interaction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper considers an online reinforcement learning algorithm that leverages pre-collected data (passive memory) from the environment for online interaction. We show that using passive memory improves performance and further provide theoretical guarantees for regret that turns out to be near-minimax optimal. Results show that the quality of passive memory determines sub-optimality of the incurred regret. The proposed approach and results hold in both continuous and discrete state-action spaces.

[AI-5] Real-time Fake News from Adversarial Feedback

链接: https://arxiv.org/abs/2410.14651
作者: Sanxing Chen,Yukun Huang,Bhuwan Dhingra
关键词-EN: fact-checking websites, knowledge cutoffs, show that existing, existing evaluations, based on conventional
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We show that existing evaluations for fake news detection based on conventional sources, such as claims on fact-checking websites, result in an increasing accuracy over time for LLM-based detectors – even after their knowledge cutoffs. This suggests that recent popular political claims, which form the majority of fake news on such sources, are easily classified using surface-level shallow patterns. Instead, we argue that a proper fake news detection dataset should test a model’s ability to reason factually about the current world by retrieving and reading related evidence. To this end, we develop a novel pipeline that leverages natural language feedback from a RAG-based detector to iteratively modify real-time news into deceptive fake news that challenges LLMs. Our iterative rewrite decreases the binary classification AUC by an absolute 17.5 percent for a strong RAG GPT-4o detector. Our experiments reveal the important role of RAG in both detecting and generating fake news, as retrieval-free LLM detectors are vulnerable to unseen events and adversarial attacks, while feedback from RAG detection helps discover more deceitful patterns in fake news.

[AI-6] Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs

链接: https://arxiv.org/abs/2410.14641
作者: Runchu Tian,Yanghao Li,Yuepeng Fu,Siyang Deng,Qinyu Luo,Cheng Qian,Shuo Wang,Xin Cong,Zhong Zhang,Yesai Wu,Yankai Lin,Huadong Wang,Xiaojiang Liu
关键词-EN: effectively process long, process long inputs, relevant information, relevant information pieces, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: work in progress

点击查看摘要

Abstract:Positional bias in large language models (LLMs) hinders their ability to effectively process long inputs. A prominent example is the “lost in the middle” phenomenon, where LLMs struggle to utilize relevant information situated in the middle of the input. While prior research primarily focuses on single pieces of relevant information, real-world applications often involve multiple relevant information pieces. To bridge this gap, we present LongPiBench, a benchmark designed to assess positional bias involving multiple pieces of relevant information. Thorough experiments are conducted with five commercial and six open-source models. These experiments reveal that while most current models are robust against the “lost in the middle” issue, there exist significant biases related to the spacing of relevant information pieces. These findings highlight the importance of evaluating and reducing positional biases to advance LLM’s capabilities.

[AI-7] GenEOL: Harnessing the Generative Power of LLMs for Training-Free Sentence Embeddings

链接: https://arxiv.org/abs/2410.14635
作者: Raghuveer Thirukovalluru,Bhuwan Dhingra
关键词-EN: large language models, directly leverage pretrained, leverage pretrained large, pretrained large language, Training-free embedding methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training-free embedding methods directly leverage pretrained large language models (LLMs) to embed text, bypassing the costly and complex procedure of contrastive learning. Previous training-free embedding methods have mainly focused on optimizing embedding prompts and have overlooked the benefits of utilizing the generative abilities of LLMs. We propose a novel method, GenEOL, which uses LLMs to generate diverse transformations of a sentence that preserve its meaning, and aggregates the resulting embeddings of these transformations to enhance the overall sentence embedding. GenEOL significantly outperforms the existing training-free embedding methods by an average of 2.85 points across several LLMs on the sentence semantic text similarity (STS) benchmark. Our analysis shows that GenEOL stabilizes representation quality across LLM layers and is robust to perturbations of embedding prompts. GenEOL also achieves notable gains on multiple clustering, reranking and pair-classification tasks from the MTEB benchmark.

[AI-8] On the Regularization of Learnable Embeddings for Time Series Processing

链接: https://arxiv.org/abs/2410.14630
作者: Luca Butera,Giovanni De Felice,Andrea Cini,Cesare Alippi
关键词-EN: time series, multiple time series, time series processing, individual features, processing multiple time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In processing multiple time series, accounting for the individual features of each sequence can be challenging. To address this, modern deep learning methods for time series analysis combine a shared (global) model with local layers, specific to each time series, often implemented as learnable embeddings. Ideally, these local embeddings should encode meaningful representations of the unique dynamics of each sequence. However, when these are learned end-to-end as parameters of a forecasting model, they may end up acting as mere sequence identifiers. Shared processing blocks may then become reliant on such identifiers, limiting their transferability to new contexts. In this paper, we address this issue by investigating methods to regularize the learning of local learnable embeddings for time series processing. Specifically, we perform the first extensive empirical study on the subject and show how such regularizations consistently improve performance in widely adopted architectures. Furthermore, we show that methods preventing the co-adaptation of local and global parameters are particularly effective in this context. This hypothesis is validated by comparing several methods preventing the downstream models from relying on sequence identifiers, going as far as completely resetting the embeddings during training. The obtained results provide an important contribution to understanding the interplay between learnable local parameters and shared processing layers: a key challenge in modern time series processing models and a step toward developing effective foundation models for time series.

[AI-9] CELI: Controller-Embedded Language Model Interactions

链接: https://arxiv.org/abs/2410.14627
作者: Jan-Samuel Wagner,Dave DeCaprio,Abishek Chiffon Muthu Raja,Jonathan M. Holman,Lauren K. Brady,Sky C. Cheung,Hosein Barzekar,Eric Yang,Mark Anthony Martinez II,David Soong,Sriram Sridhar,Han Si,Brandon W. Higgs,Hisham Hamadeh,Scott Ogden
关键词-EN: introduce Controller-Embedded Language, Controller-Embedded Language Model, Language Model Interactions, control logic directly, Language Model
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 26 pages, 2 figures

点击查看摘要

Abstract:We introduce Controller-Embedded Language Model Interactions (CELI), a framework that integrates control logic directly within language model (LM) prompts, facilitating complex, multi-stage task execution. CELI addresses limitations of existing prompt engineering and workflow optimization techniques by embedding control logic directly within the operational context of language models, enabling dynamic adaptation to evolving task requirements. Our framework transfers control from the traditional programming execution environment to the LMs, allowing them to autonomously manage computational workflows while maintaining seamless interaction with external systems and functions. CELI supports arbitrary function calls with variable arguments, bridging the gap between LMs’ adaptive reasoning capabilities and conventional software paradigms’ structured control mechanisms. To evaluate CELI’s versatility and effectiveness, we conducted case studies in two distinct domains: code generation (HumanEval benchmark) and multi-stage content generation (Wikipedia-style articles). The results demonstrate notable performance improvements across a range of domains. CELI achieved a 4.9 percentage point improvement over the best reported score of the baseline GPT-4 model on the HumanEval code generation benchmark. In multi-stage content generation, 94.4% of CELI-produced Wikipedia-style articles met or exceeded first draft quality when optimally configured, with 44.4% achieving high quality. These outcomes underscore CELI’s potential for optimizing AI-driven workflows across diverse computational domains.

[AI-10] Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments

链接: https://arxiv.org/abs/2410.14616
作者: Mariusz Wisniewski,Paraskevas Chatzithanos,Weisi Guo,Antonios Tsourdos
关键词-EN: Deep Reinforcement learning, Deep Reinforcement, Reinforcement learning, enable autonomous navigation, enable autonomous
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 31 pages, 19 figures. For associated code, see this https URL

点击查看摘要

Abstract:Deep Reinforcement learning (DRL) is used to enable autonomous navigation in unknown environments. Most research assume perfect sensor data, but real-world environments may contain natural and artificial sensor noise and denial. Here, we present a benchmark of both well-used and emerging DRL algorithms in a navigation task with configurable sensor denial effects. In particular, we are interested in comparing how different DRL methods (e.g. model-free PPO vs. model-based DreamerV3) are affected by sensor denial. We show that DreamerV3 outperforms other methods in the visual end-to-end navigation task with a dynamic goal - and other methods are not able to learn this. Furthermore, DreamerV3 generally outperforms other methods in sensor-denied environments. In order to improve robustness, we use adversarial training and demonstrate an improved performance in denied environments, although this generally comes with a performance cost on the vanilla environments. We anticipate this benchmark of different DRL methods and the usage of adversarial training to be a starting point for the development of more elaborate navigation strategies that are capable of dealing with uncertain and denied sensor readings.

[AI-11] Streaming Deep Reinforcement Learning Finally Works

链接: https://arxiv.org/abs/2410.14606
作者: Mohamed Elsayed,Gautham Vasan,A. Rupam Mahmood
关键词-EN: Natural intelligence processes, intelligence processes experience, real time, Natural intelligence, mimics natural learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Natural intelligence processes experience as a continuous stream, sensing, acting, and learning moment-by-moment in real time. Streaming learning, the modus operandi of classic reinforcement learning (RL) algorithms like Q-learning and TD, mimics natural learning by using the most recent sample without storing it. This approach is also ideal for resource-constrained, communication-limited, and privacy-sensitive applications. However, in deep RL, learners almost always use batch updates and replay buffers, making them computationally expensive and incompatible with streaming learning. Although the prevalence of batch deep RL is often attributed to its sample efficiency, a more critical reason for the absence of streaming deep RL is its frequent instability and failure to learn, which we refer to as stream barrier. This paper introduces the stream-x algorithms, the first class of deep RL algorithms to overcome stream barrier for both prediction and control and match sample efficiency of batch RL. Through experiments in Mujoco Gym, DM Control Suite, and Atari Games, we demonstrate stream barrier in existing algorithms and successful stable learning with our stream-x algorithms: stream Q, stream AC, and stream TD, achieving the best model-free performance in DM Control Dog environments. A set of common techniques underlies the stream-x algorithms, enabling their success with a single set of hyperparameters and allowing for easy extension to other algorithms, thereby reviving streaming RL.

[AI-12] How Does Data Diversity Shape the Weight Landscape of Neural Networks?

链接: https://arxiv.org/abs/2410.14602
作者: Yang Ba,Michelle V. Mancenido,Rong Pan
关键词-EN: weight decay, machine learning models, enhance the generalization, generalization of machine, Random Matrix Theory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To enhance the generalization of machine learning models to unseen data, techniques such as dropout, weight decay ( L_2 regularization), and noise augmentation are commonly employed. While regularization methods (i.e., dropout and weight decay) are geared toward adjusting model parameters to prevent overfitting, data augmentation increases the diversity of the input training set, a method purported to improve accuracy and calibration error. In this paper, we investigate the impact of each of these techniques on the parameter space of neural networks, with the goal of understanding how they alter the weight landscape in transfer learning scenarios. To accomplish this, we employ Random Matrix Theory to analyze the eigenvalue distributions of pre-trained models, fine-tuned using these techniques but using different levels of data diversity, for the same downstream tasks. We observe that diverse data influences the weight landscape in a similar fashion as dropout. Additionally, we compare commonly used data augmentation methods with synthetic data created by generative models. We conclude that synthetic data can bring more diversity into real input data, resulting in a better performance on out-of-distribution test instances.

[AI-13] aching Models to Balance Resisting and Accepting Persuasion

链接: https://arxiv.org/abs/2410.14596
作者: Elias Stengel-Eskin,Peter Hase,Mohit Bansal
关键词-EN: Large language models, Large language, pose risks, models, PBT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates. We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model’s performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.

[AI-14] mporal Fair Division of Indivisible Items

链接: https://arxiv.org/abs/2410.14593
作者: Edith Elkind,Alexander Lam,Mohamad Latifian,Tzeh Yuan Neoh,Nicholas Teh
关键词-EN: fair division model, items arrive sequentially, indivisible items arrive, arrive sequentially, immediately and irrevocably
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study a fair division model where indivisible items arrive sequentially, and must be allocated immediately and irrevocably. Previous work on online fair division has shown impossibility results in achieving approximate envy-freeness under these constraints. In contrast, we consider an informed setting where the algorithm has complete knowledge of future items, and aim to ensure that the cumulative allocation at each round satisfies approximate envy-freeness – which we define as temporal envy-freeness up to one item (TEF1). We focus on settings where items can be exclusively goods or exclusively chores. For goods, while TEF1 allocations may not always exist, we identify several special cases where they do – two agents, two item types, generalized binary valuations, unimodal preferences – and provide polynomial-time algorithms for these cases. We also prove that determining the existence of a TEF1 allocation is NP-hard. For chores, we establish analogous results for the special cases, but present a slightly weaker intractability result. We also establish the incompatibility between TEF1 and Pareto-optimality, with the implication that it is intractable to find a TEF1 allocation that maximizes any p -mean welfare, even for two agents.

[AI-15] Neural Combinatorial Clustered Bandits for Recommendation Systems

链接: https://arxiv.org/abs/2410.14586
作者: Baran Atalar,Carlee Joe-Wong
关键词-EN: individual base arms, base arm rewards, unknown reward functions, super arm, individual base
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider the contextual combinatorial bandit setting where in each round, the learning agent, e.g., a recommender system, selects a subset of “arms,” e.g., products, and observes rewards for both the individual base arms, which are a function of known features (called “context”), and the super arm (the subset of arms), which is a function of the base arm rewards. The agent’s goal is to simultaneously learn the unknown reward functions and choose the highest-reward arms. For example, the “reward” may represent a user’s probability of clicking on one of the recommended products. Conventional bandit models, however, employ restrictive reward function models in order to obtain performance guarantees. We make use of deep neural networks to estimate and learn the unknown reward functions and propose Neural UCB Clustering (NeUClust), which adopts a clustering approach to select the super arm in every round by exploiting underlying structure in the context space. Unlike prior neural bandit works, NeUClust uses a neural network to estimate the super arm reward and select the super arm, thus eliminating the need for a known optimization oracle. We non-trivially extend prior neural combinatorial bandit works to prove that NeUClust achieves \widetildeO\left(\widetilded\sqrtT\right) regret, where \widetilded is the effective dimension of a neural tangent kernel matrix, T the number of rounds. Experiments on real world recommendation datasets show that NeUClust achieves better regret and reward than other contextual combinatorial and neural bandit algorithms.

[AI-16] MCSFF: Multi-modal Consistency and Specificity Fusion Framework for Entity Alignment

链接: https://arxiv.org/abs/2410.14584
作者: Wei Ai,Wen Deng,Hongyi Chen,Jiayi Du,Tao Meng,Yuntao Shou
关键词-EN: enhancing knowledge graphs, improving information retrieval, question-answering systems, Multi-modal entity alignment, essential for enhancing
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages, 1 figures

点击查看摘要

Abstract:Multi-modal entity alignment (MMEA) is essential for enhancing knowledge graphs and improving information retrieval and question-answering systems. Existing methods often focus on integrating modalities through their complementarity but overlook the specificity of each modality, which can obscure crucial features and reduce alignment accuracy. To solve this, we propose the Multi-modal Consistency and Specificity Fusion Framework (MCSFF), which innovatively integrates both complementary and specific aspects of modalities. We utilize Scale Computing’s hyper-converged infrastructure to optimize IT management and resource allocation in large-scale data processing. Our framework first computes similarity matrices for each modality using modality embeddings to preserve their unique characteristics. Then, an iterative update method denoises and enhances modality features to fully express critical information. Finally, we integrate the updated information from all modalities to create enriched and precise entity representations. Experiments show our method outperforms current state-of-the-art MMEA baselines on the MMKG dataset, demonstrating its effectiveness and practical potential.

[AI-17] Do LLMs estimate uncertainty well in instruction-following?

链接: https://arxiv.org/abs/2410.14582
作者: Juyeon Heo,Miao Xiong,Christina Heinze-Deml,Jaya Narain
关键词-EN: Large language models, precisely follow user, Large language, follow user instructions, valuable personal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs’ instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs’ uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of the uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks, where multiple factors are entangled with uncertainty stems from instruction-following, complicating the isolation and comparison across methods and models. To address these issues, we introduce a controlled evaluation setup with two benchmark versions of data, enabling a comprehensive comparison of uncertainty estimation methods under various conditions. Our findings show that existing uncertainty methods struggle, particularly when models make subtle errors in instruction following. While internal model states provide some improvement, they remain inadequate in more complex scenarios. The insights from our controlled evaluation setups provide a crucial understanding of LLMs’ limitations and potential for uncertainty estimation in instruction-following tasks, paving the way for more trustworthy AI agents.

[AI-18] Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

链接: https://arxiv.org/abs/2410.14581
作者: Aaron Alvarado Kristanto Julistiono,Davoud Ataee Tarzanagh,Navid Azizan
关键词-EN: natural language processing, artificial intelligence, computer vision, revolutionized several domains, domains of artificial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the p -th power of the \ell_p -norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an \ell_p -norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.

[AI-19] owards Unsupervised Validation of Anomaly-Detection Models

链接: https://arxiv.org/abs/2410.14579
作者: Lihi Idan
关键词-EN: highly challenging task, highly challenging, validation, challenging task, unsupervised model-validation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised validation of anomaly-detection models is a highly challenging task. While the common practices for model validation involve a labeled validation set, such validation sets cannot be constructed when the underlying datasets are unlabeled. The lack of robust and efficient unsupervised model-validation techniques presents an acute challenge in the implementation of automated anomaly-detection pipelines, especially when there exists no prior knowledge of the model’s performance on similar datasets. This work presents a new paradigm to automated validation of anomaly-detection models, inspired by real-world, collaborative decision-making mechanisms. We focus on two commonly-used, unsupervised model-validation tasks – model selection and model evaluation – and provide extensive experimental results that demonstrate the accuracy and robustness of our approach on both tasks.

[AI-20] Large Language Models Are Overparameterized Text Encoders

链接: https://arxiv.org/abs/2410.14578
作者: Thennal D K,Tim Fischer,Chris Biemann
关键词-EN: supervised contrastive training, text, text embedding, Large language models, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages of content + 1 for limitations and ethical considerations, 14 pages in total including references and appendix, 5+1 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last p% layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30% of layers with negligible impact on performance and up to 80% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose \textL^3 \textPrune , a novel layer-pruning strategy based on the model’s initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21% of the parameters with a -0.3 performance drop, and the small variant only suffers from a -5.1 decrease while pruning 74% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.

[AI-21] MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts NEURIPS2024

链接: https://arxiv.org/abs/2410.14574
作者: Rachel S.Y. Teo,Tan M. Nguyen
关键词-EN: unlocking unparalleled scalability, Sparse Mixture, deep learning, key to unlocking, unlocking unparalleled
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 10 pages in the main text. Published at NeurIPS 2024. The code is available at this https URL

点击查看摘要

Abstract:Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning. SMoE has the potential to exponentially increase parameter count while maintaining the efficiency of the model by only activating a small subset of these parameters for a given sample. However, it has been observed that SMoE suffers from unstable training and has difficulty adapting to new distributions, leading to the model’s lack of robustness to data contamination. To overcome these limitations, we first establish a connection between the dynamics of the expert representations in SMoEs and gradient descent on a multi-objective optimization problem. Leveraging our framework, we then integrate momentum into SMoE and propose a new family of SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE. In particular, we verify the advantages of MomentumSMoE over SMoE on a variety of practical tasks including ImageNet-1K object recognition and WikiText-103 language modeling. We demonstrate the applicability of MomentumSMoE to many types of SMoE models, including those in the Sparse MoE model for vision (V-MoE) and the Generalist Language Model (GLaM). We also show that other advanced momentum-based optimization methods, such as Adam, can be easily incorporated into the MomentumSMoE framework for designing new SMoE models with even better performance, almost negligible additional computation cost, and simple implementations.

[AI-22] Building Trust in Black-box Optimization: A Comprehensive Framework for Explainability

链接: https://arxiv.org/abs/2410.14573
作者: Nazanin Nezami,Hadis Anahideh
关键词-EN: Optimizing costly black-box, Optimizing costly, costly black-box functions, constrained evaluation budget, evaluation budget presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimizing costly black-box functions within a constrained evaluation budget presents significant challenges in many real-world applications. Surrogate Optimization (SO) is a common resolution, yet its proprietary nature introduced by the complexity of surrogate models and the sampling core (e.g., acquisition functions) often leads to a lack of explainability and transparency. While existing literature has primarily concentrated on enhancing convergence to global optima, the practical interpretation of newly proposed strategies remains underexplored, especially in batch evaluation settings. In this paper, we propose \emphInclusive Explainability Metrics for Surrogate Optimization (IEMSO), a comprehensive set of model-agnostic metrics designed to enhance the transparency, trustworthiness, and explainability of the SO approaches. Through these metrics, we provide both intermediate and post-hoc explanations to practitioners before and after performing expensive evaluations to gain trust. We consider four primary categories of metrics, each targeting a specific aspect of the SO process: Sampling Core Metrics, Batch Properties Metrics, Optimization Process Metrics, and Feature Importance. Our experimental evaluations demonstrate the significant potential of the proposed metrics across different benchmarks.

[AI-23] ransBox: EL-closed Ontology Embedding

链接: https://arxiv.org/abs/2410.14571
作者: Hui Yang,Jiaoyan Chen,Uli Sattler
关键词-EN: Web Ontology Language, Description Logic, standard knowledge graphs, Web Ontology, Ontology Language
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:OWL (Web Ontology Language) ontologies, which are able to represent both relational and type facts as standard knowledge graphs and complex domain knowledge in Description Logic (DL) axioms, are widely adopted in domains such as healthcare and bioinformatics. Inspired by the success of knowledge graph embeddings, embedding OWL ontologies has gained significant attention in recent years. Current methods primarily focus on learning embeddings for atomic concepts and roles, enabling the evaluation based on normalized axioms through specially designed score functions. However, they often neglect the embedding of complex concepts, making it difficult to infer with more intricate axioms. This limitation reduces their effectiveness in advanced reasoning tasks, such as Ontology Learning and ontology-mediated Query Answering. In this paper, we propose EL+±closed ontology embeddings which are able to represent any logical expressions in DL via composition. Furthermore, we develop TransBox, an effective EL+±closed ontology embedding method that can handle many-to-one, one-to-many and many-to-many relations. Our extensive experiments demonstrate that TransBox often achieves state-of-the-art performance across various real-world datasets for predicting complex axioms.

[AI-24] When LLMs Go Online: The Emerging Threat of Web-Enabled LLMs

链接: https://arxiv.org/abs/2410.14569
作者: Hanna Kim,Minkyoo Song,Seung Ho Na,Seungwon Shin,Kimin Lee
关键词-EN: Large Language Models, Language Models, Large Language, LLM agents, agentic systems capable
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have established them as agentic systems capable of planning and interacting with various tools. These LLM agents are often paired with web-based tools, enabling access to diverse sources and real-time information. Although these advancements offer significant benefits across various applications, they also increase the risk of malicious use, particularly in cyberattacks involving personal information. In this work, we investigate the risks associated with misuse of LLM agents in cyberattacks involving personal data. Specifically, we aim to understand: 1) how potent LLM agents can be when directed to conduct cyberattacks, 2) how cyberattacks are enhanced by web-based tools, and 3) how affordable and easy it becomes to launch cyberattacks using LLM agents. We examine three attack scenarios: the collection of Personally Identifiable Information (PII), the generation of impersonation posts, and the creation of spear-phishing emails. Our experiments reveal the effectiveness of LLM agents in these attacks: LLM agents achieved a precision of up to 95.9% in collecting PII, up to 93.9% of impersonation posts created by LLM agents were evaluated as authentic, and the click rate for links in spear phishing emails created by LLM agents reached up to 46.67%. Additionally, our findings underscore the limitations of existing safeguards in contemporary commercial LLMs, emphasizing the urgent need for more robust security measures to prevent the misuse of LLM agents.

[AI-25] RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions

链接: https://arxiv.org/abs/2410.14567
作者: Zhiyuan Peng,Jinming Nian,Alexandre Evfimievski,Yi Fang
关键词-EN: Retrieval Augmented Generation, Retrieval Augmented, provide verifiable document-grounded, verifiable document-grounded responses, user inquiries
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: under review

点击查看摘要

Abstract:Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries. However, many natural questions do not have good answers: about 25% contain false assumptions~\citeYu2023:CREPE, and over 50% are ambiguous~\citeMin2020:AmbigQA. RAG agents need high-quality data to improve their responses to confusing questions. This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus. We conduct an empirical comparative evaluation of several large language models as RAG agents to measure the accuracy of confusion detection and appropriate response generation. We contribute a benchmark dataset to the public domain.

[AI-26] Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization

链接: https://arxiv.org/abs/2410.14548
作者: Ravil Mussabayev,Rustam Mussabayev
关键词-EN: K-means clustering, optimize K-means clustering, Variable Neighborhood Search, deteriorates when confronted, confronted with massive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. To address this limitation, we propose a novel heuristic algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data. Our approach is based on the sequential optimization of the partial objective function landscapes obtained by restricting the Minimum Sum-of-Squares Clustering (MSSC) formulation to random samples from the original big dataset. Within each landscape, systematically expanding neighborhoods of the currently best (incumbent) solution are explored by reinitializing all degenerate and a varying number of additional centroids. Extensive and rigorous experimentation on a large number of real-world datasets reveals that by transforming the traditional local search into a global one, our algorithm significantly enhances the accuracy and efficiency of K-means clustering in big data environments, becoming the new state of the art in the field.

[AI-27] me what I need to know: Exploring LLM-based (Personalized) Abstractive Multi-Source Meeting Summarization

链接: https://arxiv.org/abs/2410.14545
作者: Frederic Kirstein,Terry Ruas,Robert Kratel,Bela Gipp
关键词-EN: existing solutions struggle, digital communication, generate personalized, crucial in digital, existing solutions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Meeting summarization is crucial in digital communication, but existing solutions struggle with salience identification to generate personalized, workable summaries, and context understanding to fully comprehend the meetings’ content. Previous attempts to address these issues by considering related supplementary resources (e.g., presentation slides) alongside transcripts are hindered by models’ limited context sizes and handling the additional complexities of the multi-source tasks, such as identifying relevant information in additional files and seamlessly aligning it with the meeting content. This work explores multi-source meeting summarization considering supplementary materials through a three-stage large language model approach: identifying transcript passages needing additional context, inferring relevant details from supplementary materials and inserting them into the transcript, and generating a summary from this enriched transcript. Our multi-source approach enhances model understanding, increasing summary relevance by ~9% and producing more content-rich outputs. We introduce a personalization protocol that extracts participant characteristics and tailors summaries accordingly, improving informativeness by ~10%. This work further provides insights on performance-cost trade-offs across four leading model families, including edge-device capable options. Our approach can be extended to similar complex generative tasks benefitting from additional resources and personalization, such as dialogue systems and action planning.

[AI-28] Computational Grounding of Responsibility Attribution and Anticipation in LTLf

链接: https://arxiv.org/abs/2410.14544
作者: Giuseppe De Giacomo,Emiliano Lorini,Timothy Parker,Gianmarco Parretti
关键词-EN: autonomous systems, machine ethics, area of autonomous, key notions, Responsibility
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Responsibility is one of the key notions in machine ethics and in the area of autonomous systems. It is a multi-faceted notion involving counterfactual reasoning about actions and strategies. In this paper, we study different variants of responsibility in a strategic setting based on LTLf. We show a connection with notions in reactive synthesis, including synthesis of winning, dominant, and best-effort strategies. This connection provides the building blocks for a computational grounding of responsibility including complexity characterizations and sound, complete, and optimal algorithms for attributing and anticipating responsibility.

[AI-29] Do LLMs “know” internally when they follow instructions?

链接: https://arxiv.org/abs/2410.14516
作者: Juyeon Heo,Christina Heinze-Deml,Oussama Elachqar,Shirley Ren,Udhay Nallasamy,Andy Miller,Kwan Ho Ryan Chan,Jaya Narain
关键词-EN: large language models, language models, constraints and guidelines, crucial for building, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided constraints and guidelines. However, LLMs often fail to follow even simple and clear instructions. To improve instruction-following behavior and prevent undesirable outputs, a deeper understanding of how LLMs’ internal states relate to these outcomes is required. Our analysis of LLM internal states reveal a dimension in the input embedding space linked to successful instruction-following. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality. Further investigation reveals that this dimension is more closely related to the phrasing of prompts rather than the inherent difficulty of the task or instructions. This discovery also suggests explanations for why LLMs sometimes fail to follow clear instructions and why prompt engineering is often effective, even when the content remains largely unchanged. This work provides insight into the internal workings of LLMs’ instruction-following, paving the way for reliable LLM agents.

[AI-30] Efficient Annotator Reliability Assessment and Sample Weighting for Knowledge-Based Misinformation Detection on Social Media

链接: https://arxiv.org/abs/2410.14515
作者: Owen Cook,Charlie Grimshaw,Ben Wu,Sophie Dillon,Jack Hicks,Luke Jones,Thomas Smith,Matyas Szert,Xingyi Song
关键词-EN: potentially vulnerable people, targetting potentially vulnerable, Misinformation spreads rapidly, social media, confusing the truth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 8 pages, 3 figures, 3 tables. Code available here: this https URL

点击查看摘要

Abstract:Misinformation spreads rapidly on social media, confusing the truth and targetting potentially vulnerable people. To effectively mitigate the negative impact of misinformation, it must first be accurately detected before applying a mitigation strategy, such as X’s community notes, which is currently a manual process. This study takes a knowledge-based approach to misinformation detection, modelling the problem similarly to one of natural language inference. The EffiARA annotation framework is introduced, aiming to utilise inter- and intra-annotator agreement to understand the reliability of each annotator and influence the training of large language models for classification based on annotator reliability. In assessing the EffiARA annotation framework, the Russo-Ukrainian Conflict Knowledge-Based Misinformation Classification Dataset (RUC-MCD) was developed and made publicly available. This study finds that sample weighting using annotator reliability performs the best, utilising both inter- and intra-annotator agreement and soft-label training. The highest classification performance achieved using Llama-3.2-1B was a macro-F1 of 0.757 and 0.740 using TwHIN-BERT-large.

[AI-31] LEAD: Latent Realignment for Human Motion Diffusion

链接: https://arxiv.org/abs/2410.14508
作者: Nefeli Andreou,Xi Wang,Victoria Fernández Abrevaya,Marie-Paule Cani,Yiorgos Chrysanthou,Vicky Kalogeiton
关键词-EN: generate realistic human, realistic human motion, generate realistic, realistic human, natural language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Our goal is to generate realistic human motion from natural language. Modern methods often face a trade-off between model expressiveness and text-to-motion alignment. Some align text and motion latent spaces but sacrifice expressiveness; others rely on diffusion models producing impressive motions, but lacking semantic meaning in their latent space. This may compromise realism, diversity, and applicability. Here, we address this by combining latent diffusion with a realignment mechanism, producing a novel, semantically structured space that encodes the semantics of language. Leveraging this capability, we introduce the task of textual motion inversion to capture novel motion concepts from a few examples. For motion synthesis, we evaluate LEAD on HumanML3D and KIT-ML and show comparable performance to the state-of-the-art in terms of realism, diversity, and text-motion consistency. Our qualitative analysis and user study reveal that our synthesized motions are sharper, more human-like and comply better with the text compared to modern methods. For motion textual inversion, our method demonstrates improved capacity in capturing out-of-distribution characteristics in comparison to traditional VAEs.

[AI-32] SignAttention: On the Interpretability of Transformer Models for Sign Language Translation NEURIPS2024

链接: https://arxiv.org/abs/2410.14506
作者: Pedro Alejandro Dal Bianco,Oscar Agustín Stanchi,Facundo Manuel Quiroga,Franco Ronchetti,Enzo Ferrante
关键词-EN: Greek Sign Language, Transformer-based Sign Language, Sign Language Dataset, video-based Greek Sign, Sign Language Translation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at IAI Workshop @ NeurIPS 2024

点击查看摘要

Abstract:This paper presents the first comprehensive interpretability analysis of a Transformer-based Sign Language Translation (SLT) model, focusing on the translation from video-based Greek Sign Language to glosses and text. Leveraging the Greek Sign Language Dataset, we examine the attention mechanisms within the model to understand how it processes and aligns visual input with sequential glosses. Our analysis reveals that the model pays attention to clusters of frames rather than individual ones, with a diagonal alignment pattern emerging between poses and glosses, which becomes less distinct as the number of glosses increases. We also explore the relative contributions of cross-attention and self-attention at each decoding step, finding that the model initially relies on video frames but shifts its focus to previously predicted tokens as the translation progresses. This work contributes to a deeper understanding of SLT models, paving the way for the development of more transparent and reliable translation systems essential for real-world applications.

[AI-33] ANT: Adaptive Noise Schedule for Time Series Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2410.14488
作者: Seunghan Lee,Kibok Lee,Taeyoung Park
关键词-EN: generative artificial intelligence, Time series diffusion, series diffusion models, diffusion models, optimal noise schedule
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Advances in diffusion models for generative artificial intelligence have recently propagated to the time series (TS) domain, demonstrating state-of-the-art performance on various tasks. However, prior works on TS diffusion models often borrow the framework of existing works proposed in other domains without considering the characteristics of TS data, leading to suboptimal performance. In this work, we propose Adaptive Noise schedule for Time series diffusion models (ANT), which automatically predetermines proper noise schedules for given TS datasets based on their statistics representing non-stationarity. Our intuition is that an optimal noise schedule should satisfy the following desiderata: 1) It linearly reduces the non-stationarity of TS data so that all diffusion steps are equally meaningful, 2) the data is corrupted to the random noise at the final step, and 3) the number of steps is sufficiently large. The proposed method is practical for use in that it eliminates the necessity of finding the optimal noise schedule with a small additional cost to compute the statistics for given datasets, which can be done offline before training. We validate the effectiveness of our method across various tasks, including TS forecasting, refinement, and generation, on datasets from diverse domains. Code is available at this repository: this https URL.

[AI-34] ransfer Reinforcement Learning in Heterogeneous Action Spaces using Subgoal Mapping

链接: https://arxiv.org/abs/2410.14484
作者: Kavinayan P. Sivakumar,Yan Zhang,Zachary Bell,Scott Nivison,Michael M. Zavlanos
关键词-EN: learner agent, action spaces, expert agent, problem involving agents, expert agent policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we consider a transfer reinforcement learning problem involving agents with different action spaces. Specifically, for any new unseen task, the goal is to use a successful demonstration of this task by an expert agent in its action space to enable a learner agent learn an optimal policy in its own different action space with fewer samples than those required if the learner was learning on its own. Existing transfer learning methods across different action spaces either require handcrafted mappings between those action spaces provided by human experts, which can induce bias in the learning procedure, or require the expert agent to share its policy parameters with the learner agent, which does not generalize well to unseen tasks. In this work, we propose a method that learns a subgoal mapping between the expert agent policy and the learner agent policy. Since the expert agent and the learner agent have different action spaces, their optimal policies can have different subgoal trajectories. We learn this subgoal mapping by training a Long Short Term Memory (LSTM) network for a distribution of tasks and then use this mapping to predict the learner subgoal sequence for unseen tasks, thereby improving the speed of learning by biasing the agent’s policy towards the predicted learner subgoal sequence. Through numerical experiments, we demonstrate that the proposed learning scheme can effectively find the subgoal mapping underlying the given distribution of tasks. Moreover, letting the learner agent imitate the expert agent’s policy with the learnt subgoal mapping can significantly improve the sample efficiency and training time of the learner agent in unseen new tasks.

[AI-35] DRL Optimization Trajectory Generation via Wireless Network Intent-Guided Diffusion Models for Optimizing Resource Allocation

链接: https://arxiv.org/abs/2410.14481
作者: Junjie Wu,Xuming Fang,Dusit Niyato,Jiacheng Wang,Jingyu Wang
关键词-EN: including low-altitude economies, including low-altitude, low-altitude economies, continues to expand, accompanied by increasing
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid advancements in wireless communication fields, including low-altitude economies, 6G, and Wi-Fi, the scale of wireless networks continues to expand, accompanied by increasing service quality demands. Traditional deep reinforcement learning (DRL)-based optimization models can improve network performance by solving non-convex optimization problems intelligently. However, they heavily rely on online deployment and often require extensive initial training. Online DRL optimization models typically make accurate decisions based on current channel state distributions. When these distributions change, their generalization capability diminishes, which hinders the responsiveness essential for real-time and high-reliability wireless communication networks. Furthermore, different users have varying quality of service (QoS) requirements across diverse scenarios, and conventional online DRL methods struggle to accommodate this variability. Consequently, exploring flexible and customized AI strategies is critical. We propose a wireless network intent (WNI)-guided trajectory generation model based on a generative diffusion model (GDM). This model can be generated and fine-tuned in real time to achieve the objective and meet the constraints of target intent networks, significantly reducing state information exposure during wireless communication. Moreover, The WNI-guided optimization trajectory generation can be customized to address differentiated QoS requirements, enhancing the overall quality of communication in future intelligent networks. Extensive simulation results demonstrate that our approach achieves greater stability in spectral efficiency variations and outperforms traditional DRL optimization models in dynamic communication systems.

[AI-36] How Do Training Methods Influence the Utilization of Vision Models? NEURIPS2024

链接: https://arxiv.org/abs/2410.14470
作者: Paul Gavrikov,Shashank Agnihotri,Margret Keuper,Janis Keuper
关键词-EN: contribute equally, learnable parameters, decision function, entire layers’ parameters, network decision function
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the Interpretable AI: Past, Present and Future Workshop at NeurIPS 2024

点击查看摘要

Abstract:Not all learnable parameters (e.g., weights) contribute equally to a neural network’s decision function. In fact, entire layers’ parameters can sometimes be reset to random values with little to no impact on the model’s decisions. We revisit earlier studies that examined how architecture and task complexity influence this phenomenon and ask: is this phenomenon also affected by how we train the model? We conducted experimental evaluations on a diverse set of ImageNet-1k classification models to explore this, keeping the architecture and training data constant but varying the training pipeline. Our findings reveal that the training method strongly influences which layers become critical to the decision function for a given task. For example, improved training regimes and self-supervised training increase the importance of early layers while significantly under-utilizing deeper layers. In contrast, methods such as adversarial training display an opposite trend. Our preliminary results extend previous findings, offering a more nuanced understanding of the inner mechanics of neural networks. Code: this https URL Comments: Accepted at the Interpretable AI: Past, Present and Future Workshop at NeurIPS 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.14470 [cs.CV] (or arXiv:2410.14470v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.14470 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-37] he Propensity for Density in Feed-forward Models

链接: https://arxiv.org/abs/2410.14461
作者: Nandi Schoots,Alex Jackson,Ali Kholmovaia,Peter McBurney,Murray Shanahan
关键词-EN: task tend, process of training, training a neural, solved with fewer, fewer weights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Does the process of training a neural network to solve a task tend to use all of the available weights even when the task could be solved with fewer weights? To address this question we study the effects of pruning fully connected, convolutional and residual models while varying their widths. We find that the proportion of weights that can be pruned without degrading performance is largely invariant to model size. Increasing the width of a model has little effect on the density of the pruned model relative to the increase in absolute size of the pruned network. In particular, we find substantial prunability across a large range of model sizes, where our biggest model is 50 times as wide as our smallest model. We explore three hypotheses that could explain these findings.

[AI-38] oward Generalizing Visual Brain Decoding to Unseen Subjects

链接: https://arxiv.org/abs/2410.14445
作者: Xiangtao Kong,Kexin Huang,Ping Li,Lei Zhang
关键词-EN: decode visual information, Visual brain decoding, decode visual, visual information, brain decoding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual brain decoding aims to decode visual information from human brain activities. Despite the great progress, one critical limitation of current brain decoding research lies in the lack of generalization capability to unseen subjects. Prior works typically focus on decoding brain activity of individuals based on the observation that different subjects exhibit different brain activities, while it remains unclear whether brain decoding can be generalized to unseen subjects. This study aims to answer this question. We first consolidate an image-fMRI dataset consisting of stimulus-image and fMRI-response pairs, involving 177 subjects in the movie-viewing task of the Human Connectome Project (HCP). This dataset allows us to investigate the brain decoding performance with the increase of participants. We then present a learning paradigm that applies uniform processing across all subjects, instead of employing different network heads or tokenizers for individuals as in previous methods, which can accommodate a large number of subjects to explore the generalization capability across different subjects. A series of experiments are conducted and we have the following findings. First, the network exhibits clear generalization capabilities with the increase of training subjects. Second, the generalization capability is common to popular network architectures (MLP, CNN and Transformer). Third, the generalization performance is affected by the similarity between subjects. Our findings reveal the inherent similarities in brain activities across individuals. With the emerging of larger and more comprehensive datasets, it is possible to train a brain decoding foundation model in the this http URL and models can be found at this https URL.

[AI-39] FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2410.14429
作者: Rui Hu,Qian He,Gaofeng He,Jiedong Zhuang,Huang Chen,Huafeng Liu,Huamin Wang
关键词-EN: producing lifelike clothed, Modeling and producing, lifelike clothed human, attracted researchers’ attention, clothed human images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Modeling and producing lifelike clothed human images has attracted researchers’ attention from different areas for decades, with the complexity from highly articulated and structured content. Rendering algorithms decompose and simulate the imaging process of a camera, while are limited by the accuracy of modeled variables and the efficiency of computation. Generative models can produce impressively vivid human images, however still lacking in controllability and editability. This paper studies photorealism enhancement of rendered images, leveraging generative power from diffusion models on the controlled basis of rendering. We introduce a novel framework to translate rendered images into their realistic counterparts, which consists of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In DKI, we adopt positive (real) domain finetuning and negative (rendered) domain embedding to inject knowledge into a pretrained Text-to-image (T2I) diffusion model. In RIG, we generate the realistic image corresponding to the input rendered image, with a Texture-preserving Attention Control (TAC) to preserve fine-grained clothing textures, exploiting the decoupled features encoded in the UNet structure. Additionally, we introduce SynFashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.

[AI-40] Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation

链接: https://arxiv.org/abs/2410.14425
作者: Shuai Zhao,Xiaobao Wu,Cong-Duy Nguyen,Meihuizi Jia,Yichao Feng,Luu Anh Tuan
关键词-EN: Parameter-efficient fine-tuning, bridge the gap, gap between large, PEFT, Parameter-efficient
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) can bridge the gap between large language models (LLMs) and downstream tasks. However, PEFT has been proven vulnerable to malicious attacks. Research indicates that poisoned LLMs, even after PEFT, retain the capability to activate internalized backdoors when input samples contain predefined triggers. In this paper, we introduce a novel weak-to-strong unlearning algorithm to defend against backdoor attacks based on feature alignment knowledge distillation, named W2SDefense. Specifically, we first train a small-scale language model through full-parameter fine-tuning to serve as the clean teacher model. Then, this teacher model guides the large-scale poisoned student model in unlearning the backdoor, leveraging PEFT. Theoretical analysis suggests that W2SDefense has the potential to enhance the student model’s ability to unlearn backdoor features, preventing the activation of the backdoor. We conduct experiments on text classification tasks involving three state-of-the-art language models and three different backdoor attack algorithms. Our empirical results demonstrate the outstanding performance of W2SDefense in defending against backdoor attacks without compromising model performance.

[AI-41] An explainable machine learning approach for energy forecasting at the household level

链接: https://arxiv.org/abs/2410.14416
作者: Pauline Béraud,Margaux Rioux,Michel Babany,Philippe de La Chevasnerie,Damien Theis,Giacomo Teodori,Chloé Pinguet,Romane Rigaud,François Leclerc
关键词-EN: recurring research topic, recurring research, balance between production, Machine Learning, common Machine Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Electricity forecasting has been a recurring research topic, as it is key to finding the right balance between production and consumption. While most papers are focused on the national or regional scale, few are interested in the household level. Desegregated forecast is a common topic in Machine Learning (ML) literature but lacks explainability that household energy forecasts require. This paper specifically targets the challenges of forecasting electricity use at the household level. This paper confronts common Machine Learning algorithms to electricity household forecasts, weighing the pros and cons, including accuracy and explainability with well-known key metrics. Furthermore, we also confront them in this paper with the business challenges specific to this sector such as explainability or outliers resistance. We introduce a custom decision tree, aiming at providing a fair estimate of the energy consumption, while being explainable and consistent with human intuition. We show that this novel method allows greater explainability without sacrificing much accuracy. The custom tree methodology can be used in various business use cases but is subject to limitations, such as a lack of resilience with outliers.

[AI-42] Generative AI Pragmatics and Authenticity in Second Language Learning

链接: https://arxiv.org/abs/2410.14395
作者: Robert Godwin-Jones`
关键词-EN: artificial intelligence, obvious benefits, benefits to integrating, integrating generative, language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:There are obvious benefits to integrating generative AI (artificial intelligence) into language learning and teaching. Those include using AI as a language tutor, creating learning materials, or assessing learner output. However, due to how AI systems under-stand human language, based on a mathematical model using statistical probability, they lack the lived experience to be able to use language with the same social aware-ness as humans. Additionally, there are built-in linguistic and cultural biases based on their training data which is mostly in English and predominantly from Western sources. Those facts limit AI suitability for some language learning interactions. Stud-ies have clearly shown that systems such as ChatGPT often do not produce language that is pragmatically appropriate. The lack of linguistic and cultural authenticity has important implications for how AI is integrated into second language acquisition as well as in instruction targeting development of intercultural communication compe-tence.

[AI-43] Debug Smarter Not Harder: AI Agents for Error Resolution in Computational Notebooks EMNLP2024

链接: https://arxiv.org/abs/2410.14393
作者: Konstantin Grotov,Artem Borzilov,Maksim Krivobok,Timofey Bryksin,Yaroslav Zharov
关键词-EN: offering unprecedented interactivity, development process, research-related development, Large Language Models, offering unprecedented
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024 System Demonstrations

点击查看摘要

Abstract:Computational notebooks became indispensable tools for research-related development, offering unprecedented interactivity and flexibility in the development process. However, these benefits come at the cost of reproducibility and an increased potential for bugs. With the rise of code-fluent Large Language Models empowered with agentic techniques, smart bug-fixing tools with a high level of autonomy have emerged. However, those tools are tuned for classical script programming and still struggle with non-linear computational notebooks. In this paper, we present an AI agent designed specifically for error resolution in a computational notebook. We have developed an agentic system capable of exploring a notebook environment by interacting with it – similar to how a user would – and integrated the system into the JetBrains service for collaborative data science called Datalore. We evaluate our approach against the pre-existing single-action solution by comparing costs and conducting a user study. Users rate the error resolution capabilities of the agentic system higher but experience difficulties with UI. We share the results of the study and consider them valuable for further improving user-agent collaboration.

[AI-44] SurgeryV2: Bridging the Gap Between Model Merging and Multi-Task Learning with Deep Representation Surgery ICML2024

链接: https://arxiv.org/abs/2410.14389
作者: Enneng Yang,Li Shen,Zhenyi Wang,Guibing Guo,Xingwei Wang,Xiaocun Cao,Jie Zhang,Dacheng Tao
关键词-EN: raw training data, merged model, merged MTL model, MTL, merging-based multitask learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper is an extended version of our previous work [ arXiv:2402.02705 ] presented at ICML 2024

点击查看摘要

Abstract:Model merging-based multitask learning (MTL) offers a promising approach for performing MTL by merging multiple expert models without requiring access to raw training data. However, in this paper, we examine the merged model’s representation distribution and uncover a critical issue of “representation bias”. This bias arises from a significant distribution gap between the representations of the merged and expert models, leading to the suboptimal performance of the merged MTL model. To address this challenge, we first propose a representation surgery solution called Surgery. Surgery is a lightweight, task-specific module that aligns the final layer representations of the merged model with those of the expert models, effectively alleviating bias and improving the merged model’s performance. Despite these improvements, a performance gap remains compared to the traditional MTL method. Further analysis reveals that representation bias phenomena exist at each layer of the merged model, and aligning representations only in the last layer is insufficient for fully reducing systemic bias because biases introduced at each layer can accumulate and interact in complex ways. To tackle this, we then propose a more comprehensive solution, deep representation surgery (also called SurgeryV2), which mitigates representation bias across all layers, and thus bridges the performance gap between model merging-based MTL and traditional MTL. Finally, we design an unsupervised optimization objective to optimize both the Surgery and SurgeryV2 modules. Our experimental results show that incorporating these modules into state-of-the-art (SOTA) model merging schemes leads to significant performance gains. Notably, our SurgeryV2 scheme reaches almost the same level as individual expert models or the traditional MTL model. The code is available at \urlthis https URL.

[AI-45] Interpretable end-to-end Neurosymbolic Reinforcement Learning agents

链接: https://arxiv.org/abs/2410.14371
作者: Nils Grandien,Quentin Delfosse,Kristian Kersting
关键词-EN: slightly different environments, Deep reinforcement learning, rely on shortcut, generalizing to slightly, Deep reinforcement
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages; 5 figures; 3 tables

点击查看摘要

Abstract:Deep reinforcement learning (RL) agents rely on shortcut learning, preventing them from generalizing to slightly different environments. To address this problem, symbolic method, that use object-centric states, have been developed. However, comparing these methods to deep agents is not fair, as these last operate from raw pixel-based states. In this work, we instantiate the symbolic SCoBots framework. SCoBots decompose RL tasks into intermediate, interpretable representations, culminating in action decisions based on a comprehensible set of object-centric relational concepts. This architecture aids in demystifying agent decisions. By explicitly learning to extract object-centric representations from raw states, object-centric RL, and policy distillation via rule extraction, this work places itself within the neurosymbolic AI paradigm, blending the strengths of neural networks with symbolic AI. We present the first implementation of an end-to-end trained SCoBot, separately evaluate of its components, on different Atari games. The results demonstrate the framework’s potential to create interpretable and performing RL systems, and pave the way for future research directions in obtaining end-to-end interpretable RL agents.

[AI-46] CoMAL: Collaborative Multi-Agent Large Language Models for Mixed-Autonomy Traffic

链接: https://arxiv.org/abs/2410.14368
作者: Huaiyuan Yao,Longchao Da,Vishnu Nandam,Justin Turnau,Zhiwei Liu,Linsey Pang,Hua Wei
关键词-EN: traffic flow systematically, optimizing traffic flow, autonomous vehicles, great potential, potential to improve
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The integration of autonomous vehicles into urban traffic has great potential to improve efficiency by reducing congestion and optimizing traffic flow systematically. In this paper, we introduce CoMAL (Collaborative Multi-Agent LLMs), a framework designed to address the mixed-autonomy traffic problem by collaboration among autonomous vehicles to optimize traffic flow. CoMAL is built upon large language models, operating in an interactive traffic simulation environment. It utilizes a Perception Module to observe surrounding agents and a Memory Module to store strategies for each agent. The overall workflow includes a Collaboration Module that encourages autonomous vehicles to discuss the effective strategy and allocate roles, a reasoning engine to determine optimal behaviors based on assigned roles, and an Execution Module that controls vehicle actions using a hybrid approach combining rule-based models. Experimental results demonstrate that CoMAL achieves superior performance on the Flow benchmark. Additionally, we evaluate the impact of different language models and compare our framework with reinforcement learning approaches. It highlights the strong cooperative capability of LLM agents and presents a promising solution to the mixed-autonomy traffic challenge. The code is available at this https URL.

[AI-47] Assistive AI for Augmenting Human Decision-making

链接: https://arxiv.org/abs/2410.14353
作者: Natabara Máté Gyöngyössy,Bernát Török,Csilla Farkas,Laura Lucaj,Attila Menyhárd,Krisztina Menyhárd-Balázs,András Simonyi,Patrick van der Smagt,Zsolt Ződi,András Lőrincz
关键词-EN: framework, Regulatory, lasting societal damage, Abstract, Regulatory frameworks
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 37 pages, 6 figures

点击查看摘要

Abstract:Regulatory frameworks for the use of AI are emerging. However, they trail behind the fast-evolving malicious AI technologies that can quickly cause lasting societal damage. In response, we introduce a pioneering Assistive AI framework designed to enhance human decision-making capabilities. This framework aims to establish a trust network across various fields, especially within legal contexts, serving as a proactive complement to ongoing regulatory efforts. Central to our framework are the principles of privacy, accountability, and credibility. In our methodology, the foundation of reliability of information and information sources is built upon the ability to uphold accountability, enhance security, and protect privacy. This approach supports, filters, and potentially guides communication, thereby empowering individuals and communities to make well-informed decisions based on cutting-edge advancements in AI. Our framework uses the concept of Boards as proxies to collectively ensure that AI-assisted decisions are reliable, accountable, and in alignment with societal values and legal standards. Through a detailed exploration of our framework, including its main components, operations, and sample use cases, the paper shows how AI can assist in the complex process of decision-making while maintaining human oversight. The proposed framework not only extends regulatory landscapes but also highlights the synergy between AI technology and human judgement, underscoring the potential of AI to serve as a vital instrument in discerning reality from fiction and thus enhancing the decision-making process. Furthermore, we provide domain-specific use cases to highlight the applicability of our framework.

[AI-48] A Scientific Machine Learning Approach for Predicting and Forecasting Battery Degradation in Electric Vehicles

链接: https://arxiv.org/abs/2410.14347
作者: Sharv Murgai,Hrishikesh Bhagwat,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
关键词-EN: mitigate climate change, Carbon emissions, alarming rate, posing a significant, climate change
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Carbon emissions are rising at an alarming rate, posing a significant threat to global efforts to mitigate climate change. Electric vehicles have emerged as a promising solution, but their reliance on lithium-ion batteries introduces the critical challenge of battery degradation. Accurate prediction and forecasting of battery degradation over both short and long time spans are essential for optimizing performance, extending battery life, and ensuring effective long-term energy management. This directly influences the reliability, safety, and sustainability of EVs, supporting their widespread adoption and aligning with key UN SDGs. In this paper, we present a novel approach to the prediction and long-term forecasting of battery degradation using Scientific Machine Learning framework which integrates domain knowledge with neural networks, offering more interpretable and scientifically grounded solutions for both predicting short-term battery health and forecasting degradation over extended periods. This hybrid approach captures both known and unknown degradation dynamics, improving predictive accuracy while reducing data requirements. We incorporate ground-truth data to inform our models, ensuring that both the predictions and forecasts reflect practical conditions. The model achieved MSE of 9.90 with the UDE and 11.55 with the NeuralODE, in experimental data, a loss of 1.6986 with the UDE, and a MSE of 2.49 in the NeuralODE, demonstrating the enhanced precision of our approach. This integration of data-driven insights with SciML’s strengths in interpretability and scalability allows for robust battery management. By enhancing battery longevity and minimizing waste, our approach contributes to the sustainability of energy systems and accelerates the global transition toward cleaner, more responsible energy solutions, aligning with the UN’s SDG agenda.

[AI-49] Game Theory with Simulation in the Presence of Unpredictable Randomisation

链接: https://arxiv.org/abs/2410.14311
作者: Vojtech Kovarik,Nathaniel Sauerberg,Lewis Hammond,Vincent Conitzer
关键词-EN: traditional agents, mixed-strategy simulation, improve social welfare, simulation, improve social
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI agents will be predictable in certain ways that traditional agents are not. Where and how can we leverage this predictability in order to improve social welfare? We study this question in a game-theoretic setting where one agent can pay a fixed cost to simulate the other in order to learn its mixed strategy. As a negative result, we prove that, in contrast to prior work on pure-strategy simulation, enabling mixed-strategy simulation may no longer lead to improved outcomes for both players in all so-called “generalised trust games”. In fact, mixed-strategy simulation does not help in any game where the simulatee’s action can depend on that of the simulator. We also show that, in general, deciding whether simulation introduces Pareto-improving Nash equilibria in a given game is NP-hard. As positive results, we establish that mixed-strategy simulation can improve social welfare if the simulator has the option to scale their level of trust, if the players face challenges with both trust and coordination, or if maintaining some level of privacy is essential for enabling cooperation.

[AI-50] ransferring Tactile Data Across Sensors ICRA

链接: https://arxiv.org/abs/2410.14310
作者: Wadhah Zai El Amri,Malte Kuhlmann,Nicolás Navarro-Guerrero
关键词-EN: crucial in robotics, perception is essential, increasingly crucial, Tactile perception, Tactile
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Extended Abstract. Accepted in ICRA@40 (40th Anniversary of the IEEE International Conference on Robotics and Automation) 23-26 September, 2024 Rotterdam, Netherlands

点击查看摘要

Abstract:Tactile perception is essential for human interaction with the environment and is becoming increasingly crucial in robotics. Tactile sensors like the BioTac mimic human fingertips and provide detailed interaction data. Despite its utility in applications like slip detection and object identification, this sensor is now deprecated, making many existing datasets obsolete. This article introduces a novel method for translating data between tactile sensors by exploiting sensor deformation information rather than output signals. We demonstrate the approach by translating BioTac signals into the DIGIT sensor. Our framework consists of three steps: first, converting signal data into corresponding 3D deformation meshes; second, translating these 3D deformation meshes from one sensor to another; and third, generating output images using the converted meshes. Our approach enables the continued use of valuable datasets.

[AI-51] LoGU: Long-form Generation with Uncertainty Expressions

链接: https://arxiv.org/abs/2410.14309
作者: Ruihan Yang,Caiqi Zhang,Zhisong Zhang,Xinting Huang,Sen Yang,Nigel Collier,Dong Yu,Deqing Yang
关键词-EN: Large Language Models, Large Language, demonstrate impressive capabilities, factually incorrect content, generating factually incorrect
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate impressive capabilities, they still struggle with generating factually incorrect content (i.e., hallucinations). A promising approach to mitigate this issue is enabling models to express uncertainty when unsure. Previous research on uncertainty modeling has primarily focused on short-form QA, but realworld applications often require much longer responses. In this work, we introduce the task of Long-form Generation with Uncertainty(LoGU). We identify two key challenges: Uncertainty Suppression, where models hesitate to express uncertainty, and Uncertainty Misalignment, where models convey uncertainty inaccurately. To tackle these challenges, we propose a refinement-based data collection framework and a two-stage training pipeline. Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims. The collected data are then used in training through supervised fine-tuning (SFT) and direct preference optimization (DPO) to enhance uncertainty expression. Extensive experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.

[AI-52] SwaQuAD-24: QA Benchmark Dataset in Swahili

链接: https://arxiv.org/abs/2410.14289
作者: Alfred Malengo Kondoro
关键词-EN: Swahili Question Answering, Question Answering, Swahili Question, natural language processing, aimed at addressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes the creation of a Swahili Question Answering (QA) benchmark dataset, aimed at addressing the underrepresentation of Swahili in natural language processing (NLP). Drawing from established benchmarks like SQuAD, GLUE, KenSwQuAD, and KLUE, the dataset will focus on providing high-quality, annotated question-answer pairs that capture the linguistic diversity and complexity of Swahili. The dataset is designed to support a variety of applications, including machine translation, information retrieval, and social services like healthcare chatbots. Ethical considerations, such as data privacy, bias mitigation, and inclusivity, are central to the dataset development. Additionally, the paper outlines future expansion plans to include domain-specific content, multimodal integration, and broader crowdsourcing efforts. The Swahili QA dataset aims to foster technological innovation in East Africa and provide an essential resource for NLP research and applications in low-resource languages.

[AI-53] Advanced Underwater Image Quality Enhancement via Hybrid Super-Resolution Convolutional Neural Networks and Multi-Scale Retinex-Based Defogging Techniques

链接: https://arxiv.org/abs/2410.14285
作者: Yugandhar Reddy Gogireddy,Jithendra Reddy Gogireddy
关键词-EN: Convolutional Neural Networks, image degradation due, Super-Resolution Convolutional Neural, underwater image degradation, light scattering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The difficulties of underwater image degradation due to light scattering, absorption, and fog-like particles which lead to low resolution and poor visibility are discussed in this study report. We suggest a sophisticated hybrid strategy that combines Multi-Scale Retinex (MSR) defogging methods with Super-Resolution Convolutional Neural Networks (SRCNN) to address these problems. The Retinex algorithm mimics human visual perception to reduce uneven lighting and fogging, while the SRCNN component improves the spatial resolution of underwater this http URL the combination of these methods, we are able to enhance the clarity, contrast, and colour restoration of underwater images, offering a reliable way to improve image quality in difficult underwater conditions. The research conducts extensive experiments on real-world underwater datasets to further illustrate the efficacy of the suggested approach. In terms of sharpness, visibility, and feature retention, quantitative evaluation which use metrics like the Structural Similarity Index Measure (SSIM) and Peak Signal-to-Noise Ratio (PSNR) demonstrates notable advances over conventional this http URL real-time underwater applications like marine exploration, underwater robotics, and autonomous underwater vehicles, where clear and high-resolution imaging is crucial for operational success, the combination of deep learning and conventional image processing techniques offers a computationally efficient framework with superior results.

[AI-54] REEF: Representation Encoding Fingerprints for Large Language Models

链接: https://arxiv.org/abs/2410.14273
作者: Jie Zhang,Dongrui Liu,Chen Qian,Linfeng Zhang,Yong Liu,Yu Qiao,Jing Shao
关键词-EN: open-source Large Language, Large Language Models, Large Language, training LLMs costs, LLMs costs extensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Protecting the intellectual property of open-source Large Language Models (LLMs) is very important, because training LLMs costs extensive computational resources and data. Therefore, model owners and third parties need to identify whether a suspect model is a subsequent development of the victim model. To this end, we propose a training-free REEF to identify the relationship between the suspect and victim models from the perspective of LLMs’ feature representations. Specifically, REEF computes and compares the centered kernel alignment similarity between the representations of a suspect model and a victim model on the same samples. This training-free REEF does not impair the model’s general capabilities and is robust to sequential fine-tuning, pruning, model merging, and permutations. In this way, REEF provides a simple and effective way for third parties and models’ owners to protect LLMs’ intellectual property together. The code is available at this https URL.

[AI-55] Revisiting SLO and Goodput Metrics in LLM Serving

链接: https://arxiv.org/abs/2410.14257
作者: Zhibin Wang,Shipeng Li,Yuhang Zhou,Xue Li,Rong Gu,Nguyen Cam-Tu,Chen Tian,Sheng Zhong
关键词-EN: Large language models, Large language, LLM serving, achieved remarkable performance, LLM
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance and are widely deployed in various applications, while the serving of LLM inference has raised concerns about user experience and serving throughput. Accordingly, service level objectives (SLOs) and goodput-the number of requests that meet SLOs per second-are introduced to evaluate the performance of LLM serving. However, existing metrics fail to capture the nature of user experience. We observe two ridiculous phenomena in existing metrics: 1) delaying token delivery can smooth the tail time between tokens (tail TBT) of a request and 2) dropping the request that fails to meet the SLOs midway can improve goodput. In this paper, we revisit SLO and goodput metrics in LLM serving and propose a unified metric framework smooth goodput including SLOs and goodput to reflect the nature of user experience in LLM serving. The framework can adapt to specific goals of different tasks by setting parameters. We re-evaluate the performance of different LLM serving systems under multiple workloads based on this unified framework and provide possible directions for future optimization of existing strategies. We hope that this framework can provide a unified standard for evaluating LLM serving and foster researches in the field of LLM serving optimization to move in a cohesive direction. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.14257 [cs.LG] (or arXiv:2410.14257v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.14257 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-56] Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas

链接: https://arxiv.org/abs/2410.14255
作者: Xiang Hu,Hongyu Fu,Jinge Wang,Yifeng Wang,Zhikun Li,Renjun Xu,Yu Lu,Yaochu Jin,Lili Pan,Zhenzhong Lan
关键词-EN: large language models, harnessing large language, Scientific innovation, generate research ideas, pivotal for humanity
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Scientific innovation is pivotal for humanity, and harnessing large language models (LLMs) to generate research ideas could transform discovery. However, existing LLMs often produce simplistic and repetitive suggestions due to their limited ability in acquiring external knowledge for innovation. To address this problem, we introduce an enhanced planning and search methodology designed to boost the creative potential of LLM-based systems. Our approach involves an iterative process to purposely plan the retrieval of external knowledge, progressively enriching the idea generation with broader and deeper insights. Validation through automated and human assessments indicates that our framework substantially elevates the quality of generated ideas, particularly in novelty and diversity. The number of unique novel ideas produced by our framework is 3.4 times higher than without it. Moreover, our method outperforms the current state-of-the-art, generating at least 2.5 times more top-rated ideas based on 170 seed papers in a Swiss Tournament evaluation.

[AI-57] Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation

链接: https://arxiv.org/abs/2410.14251
作者: Shuo Tang,Xianghe Pang,Zexi Liu,Bohan Tang,Rui Ye,Xiaowen Dong,Yanfeng Wang,Siheng Chen
关键词-EN: enabling large language, Post-training is essential, large language models, essential for enabling, enabling large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Post-training is essential for enabling large language models (LLMs) to follow human instructions. Inspired by the recent success of using LLMs to simulate human society, we leverage multi-agent simulation to automatically generate diverse text-based scenarios, capturing a wide range of real-world human needs. We propose MATRIX, a multi-agent simulator that creates realistic and scalable scenarios. Leveraging these outputs, we introduce a novel scenario-driven instruction generator MATRIX-Gen for controllable and highly realistic data synthesis. Extensive experiments demonstrate that our framework effectively generates both general and domain-specific data. Notably, on AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta’s Llama-3-8B-Instruct model, which was trained on over 10M pairs; see our project at this https URL.

[AI-58] Almost-Linear RNNs Yield Highly Interpretable Symbolic Codes in Dynamical Systems Reconstruction NEURIPS2024

链接: https://arxiv.org/abs/2410.14240
作者: Manuel Brenner,Christoph Jürgen Hemmer,Zahra Monfared,Daniel Durstewitz
关键词-EN: theory is fundamental, areas of science, PWL, Recurrent Neural Networks, PWL representations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Dynamical systems (DS) theory is fundamental for many areas of science and engineering. It can provide deep insights into the behavior of systems evolving in time, as typically described by differential or recursive equations. A common approach to facilitate mathematical tractability and interpretability of DS models involves decomposing nonlinear DS into multiple linear DS separated by switching manifolds, i.e. piecewise linear (PWL) systems. PWL models are popular in engineering and a frequent choice in mathematics for analyzing the topological properties of DS. However, hand-crafting such models is tedious and only possible for very low-dimensional scenarios, while inferring them from data usually gives rise to unnecessarily complex representations with very many linear subregions. Here we introduce Almost-Linear Recurrent Neural Networks (AL-RNNs) which automatically and robustly produce most parsimonious PWL representations of DS from time series data, using as few PWL nonlinearities as possible. AL-RNNs can be efficiently trained with any SOTA algorithm for dynamical systems reconstruction (DSR), and naturally give rise to a symbolic encoding of the underlying DS that provably preserves important topological properties. We show that for the Lorenz and Rössler systems, AL-RNNs discover, in a purely data-driven way, the known topologically minimal PWL representations of the corresponding chaotic attractors. We further illustrate on two challenging empirical datasets that interpretable symbolic encodings of the dynamics can be achieved, tremendously facilitating mathematical and computational analysis of the underlying systems.

[AI-59] Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model ACM-MM2024

链接: https://arxiv.org/abs/2410.14225
作者: Li Yuan,Yi Cai,Junsheng Huang
关键词-EN: Multimodal Entity-Relation Extraction, Joint Multimodal Entity-Relation, social media posts, Entity-Relation Extraction, Joint Multimodal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: accepted by ACM MM 2024

点击查看摘要

Abstract:Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from text-image pairs in social media posts. Existing methods for JMERE require large amounts of labeled data. However, gathering and annotating fine-grained multimodal data for JMERE poses significant challenges. Initially, we construct diverse and comprehensive multimodal few-shot datasets fitted to the original data distribution. To address the insufficient information in the few-shot setting, we introduce the \textbfKnowledge-\textbfEnhanced \textbfCross-modal \textbfPrompt \textbfModel (KECPM) for JMERE. This method can effectively address the problem of insufficient information in the few-shot setting by guiding a large language model to generate supplementary background knowledge. Our proposed method comprises two stages: (1) a knowledge ingestion stage that dynamically formulates prompts based on semantic similarity guide ChatGPT generating relevant knowledge and employs self-reflection to refine the knowledge; (2) a knowledge-enhanced language model stage that merges the auxiliary knowledge with the original input and utilizes a transformer-based model to align with JMERE’s required output format. We extensively evaluate our approach on a few-shot dataset derived from the JMERE dataset, demonstrating its superiority over strong baselines in terms of both micro and macro F _1 scores. Additionally, we present qualitative analyses and case studies to elucidate the effectiveness of our model.

[AI-60] Formal Explanations for Neuro-Symbolic AI

链接: https://arxiv.org/abs/2410.14219
作者: Sushmita Paul,Jinqiang Yu,Jip J. Dekker,Alexey Ignatiev,Peter J. Stuckey
关键词-EN: Artificial Intelligence, neural, algorithms face, face two significant, Neuro-symbolic artificial intelligence
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Despite the practical success of Artificial Intelligence (AI), current neural AI algorithms face two significant issues. First, the decisions made by neural architectures are often prone to bias and brittleness. Second, when a chain of reasoning is required, neural systems often perform poorly. Neuro-symbolic artificial intelligence is a promising approach that tackles these (and other) weaknesses by combining the power of neural perception and symbolic reasoning. Meanwhile, the success of AI has made it critical to understand its behaviour, leading to the development of explainable artificial intelligence (XAI). While neuro-symbolic AI systems have important advantages over purely neural AI, we still need to explain their actions, which are obscured by the interactions of the neural and symbolic components. To address the issue, this paper proposes a formal approach to explaining the decisions of neuro-symbolic systems. The approach hinges on the use of formal abductive explanations and on solving the neuro-symbolic explainability problem hierarchically. Namely, it first computes a formal explanation for the symbolic component of the system, which serves to identify a subset of the individual parts of neural information that needs to be explained. This is followed by explaining only those individual neural inputs, independently of each other, which facilitates succinctness of hierarchical formal explanations and helps to increase the overall performance of the approach. Experimental results for a few complex reasoning tasks demonstrate practical efficiency of the proposed approach, in comparison to purely neural systems, from the perspective of explanation size, explanation time, training time, model sizes, and the quality of explanations reported.

[AI-61] Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

链接: https://arxiv.org/abs/2410.14208
作者: Xiaochuan Li,Zichun Yu,Chenyan Xiong
关键词-EN: inevitably introduces noisy, generative nature inevitably, nature inevitably introduces, misleading learning signals, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Codes and data are open-sourced at this https URL

点击查看摘要

Abstract:Synthetic data has been widely used to train large language models, but their generative nature inevitably introduces noisy, non-informative, and misleading learning signals. In this paper, we propose Montessori-Instruct, a novel data synthesis framework that tailors the data synthesis ability of the teacher language model toward the student language model’s learning process. Specifically, we utilize local data influence of synthetic training data points on students to characterize students’ learning preferences. Then, we train the teacher model with Direct Preference Optimization (DPO) to generate synthetic data tailored toward student learning preferences. Experiments with Llama3-8B-Instruct (teacher) and Llama3-8B (student) on Alpaca Eval and MT-Bench demonstrate that Montessori-Instruct significantly outperforms standard synthesis methods by 18.35% and 46.24% relatively. Our method also beats data synthesized by a stronger teacher model, GPT-4o. Further analysis confirms the benefits of teacher’s learning to generate more influential training data in the student’s improved learning, the advantages of local data influence in accurately measuring student preferences, and the robustness of Montessori-Instruct across different student models. Our code and data are open-sourced at this https URL.

[AI-62] Rationale Behind Essay Scores: Enhancing S-LLMs Multi-Trait Essay Scoring with Rationale Generated by LLMs

链接: https://arxiv.org/abs/2410.14202
作者: SeongYeub Chu,JongWoo Kim,Bryan Wong,MunYong Yi
关键词-EN: Existing automated essay, specific aspects evaluated, Rationale-based Multiple Trait, Existing automated, Multiple Trait Scoring
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing automated essay scoring (AES) has solely relied on essay text without using explanatory rationales for the scores, thereby forgoing an opportunity to capture the specific aspects evaluated by rubric indicators in a fine-grained manner. This paper introduces Rationale-based Multiple Trait Scoring (RMTS), a novel approach for multi-trait essay scoring that integrates prompt-engineering-based large language models (LLMs) with a fine-tuning-based essay scoring model using a smaller large language model (S-LLM). RMTS uses an LLM-based trait-wise rationale generation system where a separate LLM agent generates trait-specific rationales based on rubric guidelines, which the scoring model uses to accurately predict multi-trait scores. Extensive experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize, show that RMTS significantly outperforms state-of-the-art models and vanilla S-LLMs in trait-specific scoring. By assisting quantitative assessment with fine-grained qualitative rationales, RMTS enhances the trait-wise reliability, providing partial explanations about essays.

[AI-63] Supervised Chain of Thought

链接: https://arxiv.org/abs/2410.14198
作者: Xiang Zhang,Dujian Ding
关键词-EN: advancing Artificial Intelligence, Large Language Models, revolutionized natural language, natural language processing, Artificial Intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing and hold immense potential for advancing Artificial Intelligence. However, the core architecture of most mainstream LLMs – the Transformer – has inherent limitations in computational depth, rendering them theoretically incapable of solving many reasoning tasks that demand increasingly deep computations. Chain of Thought (CoT) prompting has emerged as a technique to address these architectural limitations, as evidenced by several theoretical studies. It offers a promising approach to solving complex reasoning tasks that were previously beyond the capabilities of these models. Despite its successes, CoT and its variants (such as Tree of Thought, Graph of Thought, etc.) rely on a “one-prompt-for-all” approach, using a single prompt structure (e.g., “think step by step”) for a wide range of tasks – from counting and sorting to solving mathematical and algorithmic problems. This approach poses significant challenges for models to generate the correct reasoning steps, as the model must navigate through a vast prompt template space to find the appropriate template for each task. In this work, we build upon previous theoretical analyses of CoT to demonstrate how the one-prompt-for-all approach can negatively affect the computability of LLMs. We partition the solution search space into two: the prompt space and the answer space. Our findings show that task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance. Through experiments with state-of-the-art LLMs, we reveal a gap in reasoning performance when supervision is applied versus when it is not.

[AI-64] Speciesism in Natural Language Processing Research

链接: https://arxiv.org/abs/2410.14194
作者: Masashi Takeshita,Rafal Rzepka
关键词-EN: Natural Language Processing, Natural Language, Language Processing, NLP, NLP research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This article is a preprint and has not been peer-reviewed. The postprint has been accepted for publication in AI and Ethics. Please cite the final version of the article once it is published

点击查看摘要

Abstract:Natural Language Processing (NLP) research on AI Safety and social bias in AI has focused on safety for humans and social bias against human minorities. However, some AI ethicists have argued that the moral significance of nonhuman animals has been ignored in AI research. Therefore, the purpose of this study is to investigate whether there is speciesism, i.e., discrimination against nonhuman animals, in NLP research. First, we explain why nonhuman animals are relevant in NLP research. Next, we survey the findings of existing research on speciesism in NLP researchers, data, and models and further investigate this problem in this study. The findings of this study suggest that speciesism exists within researchers, data, and models, respectively. Specifically, our survey and experiments show that (a) among NLP researchers, even those who study social bias in AI, do not recognize speciesism or speciesist bias; (b) among NLP data, speciesist bias is inherent in the data annotated in the datasets used to evaluate NLP models; © OpenAI GPTs, recent NLP models, exhibit speciesist bias by default. Finally, we discuss how we can reduce speciesism in NLP research.

[AI-65] LLM The Genius Paradox: A Linguistic and Math Experts Struggle with Simple Word-based Counting Problems

链接: https://arxiv.org/abs/2410.14166
作者: Nan Xu,Xuezhe Ma
关键词-EN: humans find trivial, trivial to handle, number of character, LLMs, Interestingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Interestingly, LLMs yet struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r’s in the word “strawberry”. There are several popular conjectures (e.g., tokenization, architecture and training data) regarding the reason for deficiency of LLMs in simple word-based counting problems, sharing the similar belief that such failure stems from model pretraining hence probably inevitable during deployment. In this paper, we carefully design multiple evaluation settings to investigate validity of prevalent conjectures. Meanwhile, we measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks. Although specialized LLMs suffer from counting problems as well, we find conjectures about inherent deficiency of LLMs invalid and further seek opportunities to elicit knowledge and capabilities from LLMs that are beneficial to counting tasks. Compared with strategies such as finetuning and in-context learning that are commonly adopted to enhance performance on new or challenging tasks, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks with more accurate responses. We hope our conjecture validation design could provide insights into the study of future critical failure modes of LLMs. Based on challenges in transferring advanced capabilities to much simpler tasks, we call for more attention to model capability acquisition and evaluation. We also highlight the importance of cultivating consciousness of “reasoning before responding” during model pretraining. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.14166 [cs.CL] (or arXiv:2410.14166v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.14166 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-66] RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

链接: https://arxiv.org/abs/2410.14154
作者: Muhe Ding,Yang Ma,Pengda Qin,Jianlong Wu,Yuhong Li,Liqiang Nie
关键词-EN: received substantial interest, recently received substantial, Large Language Models, Multimodal Large Language, Large Language
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, Journal

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. MLLMs involve significant external knowledge within their parameters; however, it is challenging to continually update these models with the latest knowledge, which involves huge computational costs and poor interpretability. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs. Considering the redundant information within vision modality, we first leverage the question to instruct the extraction of visual information through interactions with one set of learnable queries, minimizing irrelevant interference during retrieval and generation. Besides, we introduce a pre-trained multimodal adaptive fusion module to achieve question text-to-multimodal retrieval and integration of multimodal knowledge by projecting visual and language modalities into a unified semantic space. Furthermore, we present an Adaptive Selection Knowledge Generation (ASKG) strategy to train the generator to autonomously discern the relevance of retrieved knowledge, which realizes excellent denoising performance. Extensive experiments on open multimodal question-answering datasets demonstrate that RA-BLIP achieves significant performance and surpasses the state-of-the-art retrieval-augmented models.

[AI-67] Utilizing Large Language Models for Event Deconstruction to Enhance Multimodal Aspect-Based Sentiment Analysis

链接: https://arxiv.org/abs/2410.14150
作者: Xiaoyong Huang,Heli Sun,Qunshu Gao,Wenjie Huang,Ruichen Cao
关键词-EN: Aspect-Based Sentiment Analysis, Multimodal Aspect-Based Sentiment, making Multimodal Aspect-Based, Contentcontinues to increase, User-Generated Contentcontinues
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:With the rapid development of the internet, the richness of User-Generated Contentcontinues to increase, making Multimodal Aspect-Based Sentiment Analysis (MABSA) a research hotspot. Existing studies have achieved certain results in MABSA, but they have not effectively addressed the analytical challenges in scenarios where multiple entities and sentiments coexist. This paper innovatively introduces Large Language Models (LLMs) for event decomposition and proposes a reinforcement learning framework for Multimodal Aspect-based Sentiment Analysis (MABSA-RL) framework. This framework decomposes the original text into a set of events using LLMs, reducing the complexity of analysis, introducing reinforcement learning to optimize model parameters. Experimental results show that MABSA-RL outperforms existing advanced methods on two benchmark datasets. This paper provides a new research perspective and method for multimodal aspect-level sentiment analysis.

[AI-68] CausalChat: Interactive Causal Model Development and Refinement Using Large Language Models

链接: https://arxiv.org/abs/2410.14146
作者: Yanming Zhang,Akshith Kota,Eric Papenhausen,Klaus Mueller
关键词-EN: Causal networks, detailed causal networks, complex relationships, Causal, construct causal networks
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Causal networks are widely used in many fields to model the complex relationships between variables. A recent approach has sought to construct causal networks by leveraging the wisdom of crowds through the collective participation of humans. While this can yield detailed causal networks that model the underlying phenomena quite well, it requires a large number of individuals with domain understanding. We adopt a different approach: leveraging the causal knowledge that large language models, such as OpenAI’s GPT-4, have learned by ingesting massive amounts of literature. Within a dedicated visual analytics interface, called CausalChat, users explore single variables or variable pairs recursively to identify causal relations, latent variables, confounders, and mediators, constructing detailed causal networks through conversation. Each probing interaction is translated into a tailored GPT-4 prompt and the response is conveyed through visual representations which are linked to the generated text for explanations. We demonstrate the functionality of CausalChat across diverse data contexts and conduct user studies involving both domain experts and laypersons.

[AI-69] A Lightweight Multi Aspect Controlled Text Generation Solution For Large Language Models

链接: https://arxiv.org/abs/2410.14144
作者: Chenyang Zhang,Jiayi Lin,Haibo Tong,Bingxuan Hou,Dongyu Zhang,Jialin Li,Junli Wang
关键词-EN: Large language models, show remarkable abilities, Large language, Controllable Text Generation, show remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) show remarkable abilities with instruction tuning. However, they fail to achieve ideal tasks when lacking high-quality instruction tuning data on target tasks. Multi-Aspect Controllable Text Generation (MCTG) is a representative task for this dilemma, where aspect datasets are usually biased and correlated. Existing work exploits additional model structures and strategies for solutions, limiting adaptability to LLMs. To activate MCTG ability of LLMs, we propose a lightweight MCTG pipeline based on data augmentation. We analyze bias and correlations in traditional datasets, and address these concerns with augmented control attributes and sentences. Augmented datasets are feasible for instruction tuning. In our experiments, LLMs perform better in MCTG after data augmentation, with a 20% accuracy rise and less aspect correlations.

[AI-70] ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

链接: https://arxiv.org/abs/2410.14138
作者: Jingqi Zhou,Sheng Wang,Jingwei Dong,Lei Li,Jiahui Gao,Lingpeng Kong,Chuan Wu
关键词-EN: witnessed significant progress, visual understanding tasks, visual, witnessed significant, significant progress
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., insufficient and irrelevant visual descriptions, and limited multi-modal capacities). We then decompose visual reasoning process into two stages: visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features multi-run proactive perception and decoupled vision-reasoning capabilities. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods on a wide range of benchmarks for both open-source and closed-source models. In addition, with the assistance of LLMs, ProReason achieves a performance improvement of up to 15% on MMMU benchmark. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

[AI-71] Inverse Reinforcement Learning from Non-Stationary Learning Agents

链接: https://arxiv.org/abs/2410.14135
作者: Kavinayan P. Sivakumar,Yi Shen,Zachary Bell,Scott Nivison,Boyuan Chen,Michael M. Zavlanos
关键词-EN: trajectory data collected, inverse reinforcement learning, reinforcement learning problem, reward function, learning agent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we study an inverse reinforcement learning problem that involves learning the reward function of a learning agent using trajectory data collected while this agent is learning its optimal policy. To address this problem, we propose an inverse reinforcement learning method that allows us to estimate the policy parameters of the learning agent which can then be used to estimate its reward function. Our method relies on a new variant of the behavior cloning algorithm, which we call bundle behavior cloning, and uses a small number of trajectories generated by the learning agent’s policy at different points in time to learn a set of policies that match the distribution of actions observed in the sampled trajectories. We then use the cloned policies to train a neural network model that estimates the reward function of the learning agent. We provide a theoretical analysis to show a complexity result on bound guarantees for our method that beats standard behavior cloning as well as numerical experiments for a reinforcement learning problem that validate the proposed method.

[AI-72] owards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation

链接: https://arxiv.org/abs/2410.14122
作者: Yonghyun Kim,Alexander Lerch
关键词-EN: Automatic Piano Transcription, remains largely unexplored, significantly improved system, Automatic Piano, improved system performance
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval (ISMIR) Conference, 2024

点击查看摘要

Abstract:Recent advancements in Automatic Piano Transcription (APT) have significantly improved system performance, but the impact of noisy environments on the system performance remains largely unexplored. This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models and evaluates the performance of the Onsets and Frames model when trained on noise-augmented data. We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.

[AI-73] FedMSE: Federated learning for IoT network intrusion detection

链接: https://arxiv.org/abs/2410.14121
作者: Van Tuan Nguyen,Razvan Beuran
关键词-EN: improving IoT network, paper proposes, improving IoT, federated learning approach, federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes a novel federated learning approach for improving IoT network intrusion detection. The rise of IoT has expanded the cyber attack surface, making traditional centralized machine learning methods insufficient due to concerns about data availability, computational resources, transfer costs, and especially privacy preservation. A semi-supervised federated learning model was developed to overcome these issues, combining the Shrink Autoencoder and Centroid one-class classifier (SAE-CEN). This approach enhances the performance of intrusion detection by effectively representing normal network data and accurately identifying anomalies in the decentralized strategy. Additionally, a mean square error-based aggregation algorithm (MSEAvg) was introduced to improve global model performance by prioritizing more accurate local models. The results obtained in our experimental setup, which uses various settings relying on the N-BaIoT dataset and Dirichlet distribution, demonstrate significant improvements in real-world heterogeneous IoT networks in detection accuracy from 93.98 \pm 2.90 to 97.30 \pm 0.49, reduced learning costs when requiring only 50% of gateways participating in the training process, and robustness in large-scale networks.

[AI-74] Skill Generalization with Verbs IROS2023

链接: https://arxiv.org/abs/2410.14118
作者: Rachel Ma,Lyndon Lam,Benjamin A. Spiegel,Aditya Ganeshan,Roma Patel,Ben Abbatematteo,David Paulius,Stefanie Tellex,George Konidaris
关键词-EN: understand natural language, language commands issued, natural language commands, issued by humans, understand natural
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages + 2 pages (references), 6 figures. Accepted at IROS 2023. Code, dataset info and demo videos can be found at: this https URL

点击查看摘要

Abstract:It is imperative that robots can understand natural language commands issued by humans. Such commands typically contain verbs that signify what action should be performed on a given object and that are applicable to many objects. We propose a method for generalizing manipulation skills to novel objects using verbs. Our method learns a probabilistic classifier that determines whether a given object trajectory can be described by a specific verb. We show that this classifier accurately generalizes to novel object categories with an average accuracy of 76.69% across 13 object categories and 14 verbs. We then perform policy search over the object kinematics to find an object trajectory that maximizes classifier prediction for a given verb. Our method allows a robot to generate a trajectory for a novel object based on a verb, which can then be used as input to a motion planner. We show that our model can generate trajectories that are usable for executing five verb commands applied to novel instances of two different object categories on a real robot.

[AI-75] A Communication and Computation Efficient Fully First-order Method for Decentralized Bilevel Optimization

链接: https://arxiv.org/abs/2410.14115
作者: Min Wen,Chengchang Liu,Ahmed Abdelmoniem,Yipeng Zhou,Yuedong Xu
关键词-EN: decentralized Bilevel optimization, Bilevel optimization, decentralized bilevel, meta-learning and reinforcement, remains less explored
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注: 19 Pages

点击查看摘要

Abstract:Bilevel optimization, crucial for hyperparameter tuning, meta-learning and reinforcement learning, remains less explored in the decentralized learning paradigm, such as decentralized federated learning (DFL). Typically, decentralized bilevel methods rely on both gradients and Hessian matrices to approximate hypergradients of upper-level models. However, acquiring and sharing the second-order oracle is compute and communication intensive. % and sharing this information incurs heavy communication overhead. To overcome these challenges, this paper introduces a fully first-order decentralized method for decentralized Bilevel optimization, \textC^2 DFB which is both compute- and communicate-efficient. In \textC^2 DFB, each learning node optimizes a min-min-max problem to approximate hypergradient by exclusively using gradients information. To reduce the traffic load at the inner-loop of solving the lower-level problem, \textC^2 DFB incorporates a lightweight communication protocol for efficiently transmitting compressed residuals of local parameters. % during the inner loops. Rigorous theoretical analysis ensures its convergence % of the algorithm, indicating a first-order oracle calls of \tilde\mathcalO(\epsilon^-4) . Experiments on hyperparameter tuning and hyper-representation tasks validate the superiority of \textC^2 DFB across various typologies and heterogeneous data distributions.

[AI-76] Extreme Precipitation Nowcasting using Multi-Task Latent Diffusion Models

链接: https://arxiv.org/abs/2410.14103
作者: Li Chaorong,Ling Xudong,Yang Qiang,Qin Fengqing,Huang Yuanyuan
关键词-EN: Deep learning models, made remarkable strides, Deep learning, high precipitation intensity, precipitation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6figures

点击查看摘要

Abstract:Deep learning models have made remarkable strides in precipitation prediction, yet they continue to struggle with capturing the spatial details of the features of radar images, particularly over high precipitation intensity areas. This shortcoming is evident in the form of low forecast accuracy in the spatial positioning of radar echo images across varying precipitation intensity regions. To address this challenge, we introduce the multi-task latent diffusion model(MTLDM), a novel approach for precipitation prediction. The basic concept of the MTLDM is based on the understanding that the radar image representing precipitation is the result of multiple factors. Therefore, we adopt a divide-and-conquer approach, that is, we decompose the radar image using decomposition technology and then predict the decomposed sub-images separately. We conceptualize the precipitation image as a composition of various components corresponding to different precipitation intensities. The MTLDM decomposes the precipitation image into these distinct components and employs a dedicated task to predict each one. This method enables spatiotemporally consistent prediction of real-world precipitation areas up to 5-80 min in advance, outperforming existing state-of-the-art techniques across multiple evaluation metrics.

[AI-77] Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

链接: https://arxiv.org/abs/2410.14101
作者: Shuwei He,Rui Liu,Haizhou Li
关键词-EN: multi-source spatial knowledge, spatial environmental image, multi-source spatial, spoken content, spatial knowledge
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Visual Text-to-Speech (VTTS) aims to take the spatial environmental image as the prompt to synthesize the reverberation speech for the spoken content. Previous research focused on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address the issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS ^2 KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and semantic captions from image understanding LLM as supplementary sources. Afterwards, we propose a serial interaction mechanism to deeply engage with both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on their this http URL enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive spatial speech this http URL results demonstrate that the MS ^2 KU-VTTS surpasses existing baselines in generating immersive speech. Demos and code are available at: this https URL.

[AI-78] ST-MoE-BERT: A Spatial-Temporal Mixture-of-Experts Framework for Long-Term Cross-City Mobility Prediction

链接: https://arxiv.org/abs/2410.14099
作者: Haoyu He,Haozheng Luo,Qi R. Wang
关键词-EN: Predicting human mobility, multiple cities presents, cities presents significant, Predicting human, presents significant challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 2nd ACM SIGSPATIAL International Workshop on the Human Mobility Prediction Challenge

点击查看摘要

Abstract:Predicting human mobility across multiple cities presents significant challenges due to the complex and diverse spatial-temporal dynamics inherent in different urban environments. In this study, we propose a robust approach to predict human mobility patterns called ST-MoE-BERT. Compared to existing methods, our approach frames the prediction task as a spatial-temporal classification problem. Our methodology integrates the Mixture-of-Experts architecture with BERT model to capture complex mobility dynamics and perform the downstream human mobility prediction task. Additionally, transfer learning is integrated to solve the challenge of data scarcity in cross-city prediction. We demonstrate the effectiveness of the proposed model on GEO-BLEU and DTW, comparing it to several state-of-the-art methods. Notably, ST-MoE-BERT achieves an average improvement of 8.29%.

[AI-79] owards Effective Planning Strategies for Dynamic Opinion Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.14091
作者: Bharath Muppasani,Protik Nag,Vignesh Narayanan,Biplav Srivastava,Michael N. Huhns
关键词-EN: disseminating accurate information, under-explored intervention planning, intervention planning aimed, disseminating accurate, accurate information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:In this study, we investigate the under-explored intervention planning aimed at disseminating accurate information within dynamic opinion networks by leveraging learning strategies. Intervention planning involves identifying key nodes (search) and exerting control (e.g., disseminating accurate/official information through the nodes) to mitigate the influence of misinformation. However, as network size increases, the problem becomes computationally intractable. To address this, we first introduce a novel ranking algorithm (search) to identify key nodes for disseminating accurate information, which facilitates the training of neural network (NN) classifiers for scalable and generalized solutions. Second, we address the complexity of label generation (through search) by developing a Reinforcement Learning (RL)-based dynamic planning framework. We investigate NN-based RL planners tailored for dynamic opinion networks governed by two propagation models for the framework. Each model incorporates both binary and continuous opinion and trust representations. Our experimental results demonstrate that our ranking algorithm-based classifiers provide plans that enhance infection rate control, especially with increased action budgets. Moreover, reward strategies focusing on key metrics, such as the number of susceptible nodes and infection rates, outperform those prioritizing faster blocking strategies. Additionally, our findings reveal that Graph Convolutional Networks (GCNs)-based planners facilitate scalable centralized plans that achieve lower infection rates (higher control) across various network scenarios (e.g., Watts-Strogatz topology, varying action budgets, varying initial infected nodes, and varying degree of infected nodes).

[AI-80] In-context learning and Occams razor

链接: https://arxiv.org/abs/2410.14086
作者: Eric Elmoznino,Tom Marty,Tejas Kasetty,Leo Gagnon,Sarthak Mittal,Mahan Fathi,Dhanya Sridhar,Guillaume Lajoie
关键词-EN: Free Lunch Theorem, Lunch Theorem states, Free Lunch, Lunch Theorem, Occam razor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occam’s razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam’s razor and in-context learning: an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at this https URL.

[AI-81] Interpreting Inflammation Prediction Model via Tag-based Cohort Explanation

链接: https://arxiv.org/abs/2410.14082
作者: Fanyu Meng,Jules Larke,Xin Liu,Zhaodan Kong,Xin Chen,Danielle Lemay,Ilias Tagkopoulos
关键词-EN: make intelligent decisions, revolutionizing nutrition science, Machine learning, intelligent decisions, learning is revolutionizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning is revolutionizing nutrition science by enabling systems to learn from data and make intelligent decisions. However, the complexity of these models often leads to challenges in understanding their decision-making processes, necessitating the development of explainability techniques to foster trust and increase model transparency. An under-explored type of explanation is cohort explanation, which provides explanations to groups of instances with similar characteristics. Unlike traditional methods that focus on individual explanations or global model behavior, cohort explainability bridges the gap by providing unique insights at an intermediate granularity. We propose a novel framework for identifying cohorts within a dataset based on local feature importance scores, aiming to generate concise descriptions of the clusters via tags. We evaluate our framework on a food-based inflammation prediction model and demonstrated that the framework can generate reliable explanations that match domain knowledge.

[AI-82] Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

链接: https://arxiv.org/abs/2410.14072
作者: Yuxin Wen,Qingqing Cao,Qichen Fu,Sachin Mehta,Mahyar Najibi
关键词-EN: perform complex reasoning, Recent advancements, visual tokens, tokens, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in vision-language models (VLMs) have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images. In the widely used fully autoregressive transformer-based models like LLaVA, projected visual tokens are prepended to textual tokens. Oftentimes, visual tokens are significantly more than prompt tokens, resulting in increased computational overhead during both training and inference. In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower of VLMs. After these few layers, all visual tokens are discarded, significantly improving computational efficiency for both training and inference. Notably, our method is easy to implement and requires a small number of new trainable parameters with minimal impact on model performance. In our experiment, with merely 8 visual registers–about 1% of the original tokens–Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.

[AI-83] FaceSaliencyAug: Mitigating Geographic Gender and Stereotypical Biases via Saliency-Based Data Augmentation

链接: https://arxiv.org/abs/2410.14070
作者: Teerath Kumar,Alessandra Mileo,Malika Bendechache
关键词-EN: pose significant challenges, Convolutional Neural Networks, models pose significant, vision models pose, computer vision models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at Image Signal and Video processing

点击查看摘要

Abstract:Geographical, gender and stereotypical biases in computer vision models pose significant challenges to their performance and fairness. In this study, we present an approach named FaceSaliencyAug aimed at addressing the gender bias in Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Leveraging the salient regions of faces detected by saliency, the propose approach mitigates geographical and stereotypical biases in the datasets. FaceSaliencyAug randomly selects masks from a predefined search space and applies them to the salient region of face images, subsequently restoring the original image with masked salient region. The proposed augmentation strategy enhances data diversity, thereby improving model performance and debiasing effects. We quantify dataset diversity using Image Similarity Score (ISS) across five datasets, including Flickr Faces HQ (FFHQ), WIKI, IMDB, Labelled Faces in the Wild (LFW), UTK Faces, and Diverse Dataset. The proposed approach demonstrates superior diversity metrics, as evaluated by ISS-intra and ISS-inter algorithms. Furthermore, we evaluate the effectiveness of our approach in mitigating gender bias on CEO, Engineer, Nurse, and School Teacher datasets. We use the Image-Image Association Score (IIAS) to measure gender bias in these occupations. Our experiments reveal a reduction in gender bias for both CNNs and ViTs, indicating the efficacy of our method in promoting fairness and inclusivity in computer vision models.

[AI-84] Provable Benefits of Complex Parameterizations for Structured State Space Models NEURIPS2024

链接: https://arxiv.org/abs/2410.14067
作者: Yuval Ran-Milo,Eden Lumbroso,Edo Cohen-Karlik,Raja Giryes,Amir Globerson,Nadav Cohen
关键词-EN: Structured state space, linear dynamical systems, dynamical systems adhering, real SSM, prominent neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 12 pages, 1 figure. Accepted to NeurIPS 2024

点击查看摘要

Abstract:Structured state space models (SSMs), the core engine behind prominent neural networks such as S4 and Mamba, are linear dynamical systems adhering to a specified structure, most notably diagonal. In contrast to typical neural network modules, whose parameterizations are real, SSMs often use complex parameterizations. Theoretically explaining the benefits of complex parameterizations for SSMs is an open problem. The current paper takes a step towards its resolution, by establishing formal gaps between real and complex diagonal SSMs. Firstly, we prove that while a moderate dimension suffices in order for a complex SSM to express all mappings of a real SSM, a much higher dimension is needed for a real SSM to express mappings of a complex SSM. Secondly, we prove that even if the dimension of a real SSM is high enough to express a given mapping, typically, doing so requires the parameters of the real SSM to hold exponentially large values, which cannot be learned in practice. In contrast, a complex SSM can express any given mapping with moderate parameter values. Experiments corroborate our theory, and suggest a potential extension of the theory that accounts for selectivity, a new architectural feature yielding state of the art performance.

[AI-85] On Partial Prototype Collapse in the DINO Family of Self-Supervised Methods BMVC2024

链接: https://arxiv.org/abs/2410.14060
作者: Hariprasath Govindarajan,Per Sidén,Jacob Roll,Fredrik Lindsten
关键词-EN: prominent self-supervised learning, self-supervised learning paradigm, mixture model, mixture model simultaneously, prominent self-supervised
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: First version of the paper appeared in OpenReview on 22 Sep 2023. Accepted to BMVC 2024

点击查看摘要

Abstract:A prominent self-supervised learning paradigm is to model the representations as clusters, or more generally as a mixture model. Learning to map the data samples to compact representations and fitting the mixture model simultaneously leads to the representation collapse problem. Regularizing the distribution of data points over the clusters is the prevalent strategy to avoid this issue. While this is sufficient to prevent full representation collapse, we show that a partial prototype collapse problem still exists in the DINO family of methods, that leads to significant redundancies in the prototypes. Such prototype redundancies serve as shortcuts for the method to achieve a marginal latent class distribution that matches the prescribed prior. We show that by encouraging the model to use diverse prototypes, the partial prototype collapse can be mitigated. Effective utilization of the prototypes enables the methods to learn more fine-grained clusters, encouraging more informative representations. We demonstrate that this is especially beneficial when pre-training on a long-tailed fine-grained dataset.

[AI-86] owards Cross-Cultural Machine Translation with Retrieval-Augmented Generation from Multilingual Knowledge Graphs EMNLP2024

链接: https://arxiv.org/abs/2410.14057
作者: Simone Conia,Daniel Lee,Min Li,Umar Farooq Minhas,Saloni Potdar,Yunyao Li
关键词-EN: challenging task, Translating text, cultural-related references, references can vary, vary significantly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:Translating text that contains entity names is a challenging task, as cultural-related references can vary significantly across languages. These variations may also be caused by transcreation, an adaptation process that entails more than transliteration and word-for-word translation. In this paper, we address the problem of cross-cultural translation on two fronts: (i) we introduce XC-Translate, the first large-scale, manually-created benchmark for machine translation that focuses on text that contains potentially culturally-nuanced entity names, and (ii) we propose KG-MT, a novel end-to-end method to integrate information from a multilingual knowledge graph into a neural machine translation model by leveraging a dense retrieval mechanism. Our experiments and analyses show that current machine translation systems and large language models still struggle to translate texts containing entity names, whereas KG-MT outperforms state-of-the-art approaches by a large margin, obtaining a 129% and 62% relative improvement compared to NLLB-200 and GPT-4, respectively.

[AI-87] From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs

链接: https://arxiv.org/abs/2410.14052
作者: Alireza Rezazadeh,Zichao Li,Wei Wei,Yujia Bao
关键词-EN: Recent advancements, large language models, effective long-term memory, context windows, advancements in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models have significantly improved their context windows, yet challenges in effective long-term memory management remain. We introduce MemTree, an algorithm that leverages a dynamic, tree-structured memory representation to optimize the organization, retrieval, and integration of information, akin to human cognitive schemas. MemTree organizes memory hierarchically, with each node encapsulating aggregated textual content, corresponding semantic embeddings, and varying abstraction levels across the tree’s depths. Our algorithm dynamically adapts this memory structure by computing and comparing semantic embeddings of new and existing information to enrich the model’s context-awareness. This approach allows MemTree to handle complex reasoning and extended interactions more effectively than traditional memory augmentation methods, which often rely on flat lookup tables. Evaluations on benchmarks for multi-turn dialogue understanding and document question answering show that MemTree significantly enhances performance in scenarios that demand structured memory management.

[AI-88] Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3

链接: https://arxiv.org/abs/2410.14044
作者: Naghmeh Farzi,Laura Dietz
关键词-EN: Traditional evaluation, human-annotated relevance labels, systems relies, costly at scale, relies on human-annotated
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional evaluation of information retrieval (IR) systems relies on human-annotated relevance labels, which can be both biased and costly at scale. In this context, large language models (LLMs) offer an alternative by allowing us to directly prompt them to assign relevance labels for passages associated with each query. In this study, we explore alternative methods to directly prompt LLMs for assigned relevance labels, by exploring two hypotheses: Hypothesis 1 assumes that it is helpful to break down “relevance” into specific criteria - exactness, coverage, topicality, and contextual fit. We explore different approaches that prompt large language models (LLMs) to obtain criteria-level grades for all passages, and we consider various ways to aggregate criteria-level grades into a relevance label. Hypothesis 2 assumes that differences in linguistic style between queries and passages may negatively impact the automatic relevance label prediction. We explore whether improvements can be achieved by first synthesizing a summary of the passage in the linguistic style of a query, and then using this summary in place of the passage to assess its relevance. We include an empirical evaluation of our approaches based on data from the LLMJudge challenge run in Summer 2024, where our “Four Prompts” approach obtained the highest scores in Kendall’s tau. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) ACMclasses: H.3.3 Cite as: arXiv:2410.14044 [cs.IR] (or arXiv:2410.14044v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.14044 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-89] Latent Weight Diffusion: Generating Policies from Trajectories

链接: https://arxiv.org/abs/2410.14040
作者: Shashank Hegde,Gautam Salhotra,Gaurav S. Sukhatme
关键词-EN: open-source robotic data, diffusion, manipulation and locomotion, increasing availability, availability of open-source
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:With the increasing availability of open-source robotic data, imitation learning has emerged as a viable approach for both robot manipulation and locomotion. Currently, large generalized policies are trained to predict controls or trajectories using diffusion models, which have the desirable property of learning multimodal action distributions. However, generalizability comes with a cost - namely, larger model size and slower inference. Further, there is a known trade-off between performance and action horizon for Diffusion Policy (i.e., diffusing trajectories): fewer diffusion queries accumulate greater trajectory tracking errors. Thus, it is common practice to run these models at high inference frequency, subject to robot computational constraints. To address these limitations, we propose Latent Weight Diffusion (LWD), a method that uses diffusion to learn a distribution over policies for robotic tasks, rather than over trajectories. Our approach encodes demonstration trajectories into a latent space and then decodes them into policies using a hypernetwork. We employ a diffusion denoising model within this latent space to learn its distribution. We demonstrate that LWD can reconstruct the behaviors of the original policies that generated the trajectory dataset. LWD offers the benefits of considerably smaller policy networks during inference and requires fewer diffusion model queries. When tested on the Metaworld MT10 benchmark, LWD achieves a higher success rate compared to a vanilla multi-task policy, while using models up to ~18x smaller during inference. Additionally, since LWD generates closed-loop policies, we show that it outperforms Diffusion Policy in long action horizon settings, with reduced diffusion queries during rollout. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2410.14040 [cs.LG] (or arXiv:2410.14040v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.14040 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-90] Generating Signed Language Instructions in Large-Scale Dialogue Systems NAACL2024

链接: https://arxiv.org/abs/2410.14026
作者: Mert İnan,Katherine Atwell,Anthony Sicilia,Lorna Quandt,Malihe Alikhani
关键词-EN: American Sign Language, worldwide multimodal conversational, enhanced with American, Large Language Models, American Sign
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) Industry Track

点击查看摘要

Abstract:We introduce a goal-oriented conversational AI system enhanced with American Sign Language (ASL) instructions, presenting the first implementation of such a system on a worldwide multimodal conversational AI platform. Accessible through a touch-based interface, our system receives input from users and seamlessly generates ASL instructions by leveraging retrieval methods and cognitively based gloss translations. Central to our design is a sign translation module powered by Large Language Models, alongside a token-based video retrieval system for delivering instructional content from recipes and wikiHow guides. Our development process is deeply rooted in a commitment to community engagement, incorporating insights from the Deaf and Hard-of-Hearing community, as well as experts in cognitive and ASL learning sciences. The effectiveness of our signing instructions is validated by user feedback, achieving ratings on par with those of the system in its non-signing variant. Additionally, our system demonstrates exceptional performance in retrieval accuracy and text-generation quality, measured by metrics such as BERTScore. We have made our codebase and datasets publicly accessible at this https URL, and a demo of our signed instruction video retrieval system is available at this https URL.

[AI-91] Vision-Language-Action Model and Diffusion Policy Switching Enables Dexterous Control of an Anthropomorphic Hand

链接: https://arxiv.org/abs/2410.14022
作者: Cheng Pan,Kai Junge,Josie Hughes
关键词-EN: autonomous dexterous manipulation, advance autonomous dexterous, hybrid control method, VLA model, VLA
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To advance autonomous dexterous manipulation, we propose a hybrid control method that combines the relative advantages of a fine-tuned Vision-Language-Action (VLA) model and diffusion models. The VLA model provides language commanded high-level planning, which is highly generalizable, while the diffusion model handles low-level interactions which offers the precision and robustness required for specific objects and environments. By incorporating a switching signal into the training-data, we enable event based transitions between these two models for a pick-and-place task where the target object and placement location is commanded through language. This approach is deployed on our anthropomorphic ADAPT Hand 2, a 13DoF robotic hand, which incorporates compliance through series elastic actuation allowing for resilience for any interactions: showing the first use of a multi-fingered hand controlled with a VLA model. We demonstrate this model switching approach results in a over 80% success rate compared to under 40% when only using a VLA model, enabled by accurate near-object arm motion by the VLA model and a multi-modal grasping motion with error recovery abilities from the diffusion model.

[AI-92] Whisker-Inspired Tactile Sensing: A Sim2Real Approach for Precise Underwater Contact Tracking

链接: https://arxiv.org/abs/2410.14005
作者: Hao Li,Chengyi Xing,Saad Khan,Miaoya Zhong,Mark R. Cutkosky
关键词-EN: Aquatic mammals, Fiber Bragg Grating, analyze water movements, inspiring the development, detect and discriminate
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aquatic mammals, such as pinnipeds, utilize their whiskers to detect and discriminate objects and analyze water movements, inspiring the development of robotic whiskers for sensing contacts, surfaces, and water flows. We present the design and application of underwater whisker sensors based on Fiber Bragg Grating (FBG) technology. These passive whiskers are mounted along the robot ’ s exterior to sense its surroundings through light, non-intrusive contacts. For contact tracking, we employ a sim-to-real learning framework, which involves extensive data collection in simulation followed by a sim-to-real calibration process to transfer the model trained in simulation to the real world. Experiments with whiskers immersed in water indicate that our approach can track contact points with an accuracy of 2 mm, without requiring precise robot proprioception. We demonstrate that the approach also generalizes to unseen objects.

[AI-93] On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery

链接: https://arxiv.org/abs/2410.13981
作者: Renpu Liu,Ruida Zhou,Cong Shen,Jing Yang
关键词-EN: parameter updating based, perform in-context learning, contextual information provided, in-context learning, Von Oswald
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:An intriguing property of the Transformer is its ability to perform in-context learning (ICL), where the Transformer can solve different inference tasks without parameter updating based on the contextual information provided by the corresponding input-output demonstration pairs. It has been theoretically proved that ICL is enabled by the capability of Transformers to perform gradient-descent algorithms (Von Oswald et al., 2023a; Bai et al., 2024). This work takes a step further and shows that Transformers can perform learning-to-optimize (L2O) algorithms. Specifically, for the ICL sparse recovery (formulated as LASSO) tasks, we show that a K-layer Transformer can perform an L2O algorithm with a provable convergence rate linear in K. This provides a new perspective explaining the superior ICL capability of Transformers, even with only a few layers, which cannot be achieved by the standard gradient-descent algorithms. Moreover, unlike the conventional L2O algorithms that require the measurement matrix involved in training to match that in testing, the trained Transformer is able to solve sparse recovery problems generated with different measurement matrices. Besides, Transformers as an L2O algorithm can leverage structural information embedded in the training tasks to accelerate its convergence during ICL, and generalize across different lengths of demonstration pairs, where conventional L2O algorithms typically struggle or fail. Such theoretical findings are supported by our experimental results.

[AI-94] RecoveryChaining: Learning Local Recovery Policies for Robust Manipulation

链接: https://arxiv.org/abs/2410.13979
作者: Shivam Vats,Devesh K. Jha,Maxim Likhachev,Oliver Kroemer,Diego Romeres
关键词-EN: efficiently optimize diverse, optimize diverse objectives, long horizon tasks, solve complex manipulation, complex manipulation problems
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:Model-based planners and controllers are commonly used to solve complex manipulation problems as they can efficiently optimize diverse objectives and generalize to long horizon tasks. However, they are limited by the fidelity of their model which oftentimes leads to failures during deployment. To enable a robot to recover from such failures, we propose to use hierarchical reinforcement learning to learn a separate recovery policy. The recovery policy is triggered when a failure is detected based on sensory observations and seeks to take the robot to a state from which it can complete the task using the nominal model-based controllers. Our approach, called RecoveryChaining, uses a hybrid action space, where the model-based controllers are provided as additional \emphnominal options which allows the recovery policy to decide how to recover, when to switch to a nominal controller and which controller to switch to even with \emphsparse rewards. We evaluate our approach in three multi-step manipulation tasks with sparse rewards, where it learns significantly more robust recovery policies than those learned by baselines. Finally, we successfully transfer recovery policies learned in simulation to a physical robot to demonstrate the feasibility of sim-to-real transfer with our method.

[AI-95] MarineFormer: A Transformer-based Navigation Policy Model for Collision Avoidance in Marine Environment

链接: https://arxiv.org/abs/2410.13973
作者: Ehsan Kazemi,Iman Soltani
关键词-EN: Unmanned Surface Vehicle, Surface Vehicle, Unmanned Surface, problem of Unmanned, dense marine environment
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we investigate the problem of Unmanned Surface Vehicle (USV) navigation in a dense marine environment with a high-intensity current flow. The complexities arising from static and dynamic obstacles and the disturbance forces caused by current flow render existing navigation protocols inadequate for ensuring safety and avoiding collisions at sea. To learn a safe and efficient robot policy, we propose a novel methodology that leverages attention mechanisms to capture heterogeneous interactions of the agents with the static and moving obstacles and the flow disturbances from the environment in space and time. In particular, we refine a temporal function with MarineFormer, a Transformer navigation policy for spatially variable Marine environment, trained end-to-end with reinforcement learning (RL). MarineFormer uses foundational spatio-temporal graph attention with transformer architecture to process spatial attention and temporal sequences in an environment that simulates a 2D turbulent marine condition. We propose architectural modifications that improve the stability and learning speed of the recurrent models. The flow velocity estimation, which can be derived from flow simulations or sensors, is incorporated into a model-free RL framework to prevent the robot from entering into high-intensity current flow regions including intense vortices, while potentially leveraging the flow to assist in transportation. The investigated 2D marine environment encompasses flow singularities, including vortices, sinks, and sources, representing fundamental planar flow patterns associated with flood or maritime thunderstorms. Our proposed method is trained with a new reward model to deal with static and dynamic obstacles and disturbances from the current flow.

[AI-96] Detecting AI-Generated Texts in Cross-Domains

链接: https://arxiv.org/abs/2410.13966
作者: You Zhou,Jie Wang
关键词-EN: large language model, Existing tools, large language, performance can drop, drop when dealing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing tools to detect text generated by a large language model (LLM) have met with certain success, but their performance can drop when dealing with texts in new domains. To tackle this issue, we train a ranking classifier called RoBERTa-Ranker, a modified version of RoBERTa, as a baseline model using a dataset we constructed that includes a wider variety of texts written by humans and generated by various LLMs. We then present a method to fine-tune RoBERTa-Ranker that requires only a small amount of labeled data in a new domain. Experiments show that this fine-tuned domain-aware model outperforms the popular DetectGPT and GPTZero on both in-domain and cross-domain texts, where AI-generated texts may either be in a different domain or generated by a different LLM not used to generate the training datasets. This approach makes it feasible and economical to build a single system to detect AI-generated texts across various domains.

[AI-97] FinQAPT: Empowering Financial Decisions with End-to-End LLM-driven Question Answering Pipeline

链接: https://arxiv.org/abs/2410.13959
作者: Kuldeep Singh,Simerjot Kaur,Charese Smiley
关键词-EN: Large Language Models, relevant information embedded, Financial decision-making hinges, leverages Large Language, decision-making hinges
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted in ICAIF 2024, 8 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Financial decision-making hinges on the analysis of relevant information embedded in the enormous volume of documents in the financial domain. To address this challenge, we developed FinQAPT, an end-to-end pipeline that streamlines the identification of relevant financial reports based on a query, extracts pertinent context, and leverages Large Language Models (LLMs) to perform downstream tasks. To evaluate the pipeline, we experimented with various techniques to optimize the performance of each module using the FinQA dataset. We introduced a novel clustering-based negative sampling technique to enhance context extraction and a novel prompting method called Dynamic N-shot Prompting to boost the numerical question-answering capabilities of LLMs. At the module level, we achieved state-of-the-art accuracy on FinQA, attaining an accuracy of 80.6%. However, at the pipeline level, we observed decreased performance due to challenges in extracting relevant context from financial reports. We conducted a detailed error analysis of each module and the end-to-end pipeline, pinpointing specific challenges that must be addressed to develop a robust solution for handling complex financial tasks.

[AI-98] Goal Inference from Open-Ended Dialog

链接: https://arxiv.org/abs/2410.13957
作者: Rachel Ma,Jingyi Qu,Andreea Bobu,Dylan Hadfield-Menell
关键词-EN: accomplish diverse user, diverse user goals, Large Language Models, embodied agents, agents to learn
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages + 2 page (references and appendix)

点击查看摘要

Abstract:We present an online method for embodied agents to learn and accomplish diverse user goals. While offline methods like RLHF can represent various goals but require large datasets, our approach achieves similar flexibility with online efficiency. We extract natural language goal representations from conversations with Large Language Models (LLMs). We prompt an LLM to role play as a human with different goals and use the corresponding likelihoods to run Bayesian inference over potential goals. As a result, our method can represent uncertainty over complex goals based on unrestricted dialog. We evaluate our method in grocery shopping and home robot assistance domains using a text-based interface and AI2Thor simulation respectively. Results show our method outperforms ablation baselines that lack either explicit goal representation or probabilistic inference.

[AI-99] Identifying High Consideration E-Commerce Search Queries EMNLP2024

链接: https://arxiv.org/abs/2410.13951
作者: Zhiyu Chen,Jason Choi,Besnik Fetahu,Shervin Malmasi
关键词-EN: missions typically require, typically require careful, substantial research investment, elaborate decision making, high consideration
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by EMNLP 2024 (Industry Track)

点击查看摘要

Abstract:In e-commerce, high consideration search missions typically require careful and elaborate decision making, and involve a substantial research investment from customers. We consider the task of identifying High Consideration (HC) queries. Identifying such queries enables e-commerce sites to better serve user needs using targeted experiences such as curated QA widgets that help users reach purchase decisions. We explore the task by proposing an Engagement-based Query Ranking (EQR) approach, focusing on query ranking to indicate potential engagement levels with query-related shopping knowledge content during product search. Unlike previous studies on predicting trends, EQR prioritizes query-level features related to customer behavior, finance, and catalog information rather than popularity signals. We introduce an accurate and scalable method for EQR and present experimental results demonstrating its effectiveness. Offline experiments show strong ranking performance. Human evaluation shows a precision of 96% for HC queries identified by our model. The model was commercially deployed, and shown to outperform human-selected queries in terms of downstream customer impact, as measured through engagement.

[AI-100] he KnowWhereGraph Ontology

链接: https://arxiv.org/abs/2410.13948
作者: Cogan Shimizu,Shirly Stephe,Adrita Barua,Ling Cai,Antrea Christou,Kitty Currier,Abhilekha Dalal,Colby K. Fisher,Pascal Hitzler,Krzysztof Janowicz,Wenwen Li,Zilong Liu,Mohammad Saeid Mahdavinejad,Gengchen Mai,Dean Rehberger,Mark Schildhauer,Meilin Shi,Sanaz Saki Norouzi,Yuanyuan Tian,Sizhe Wang,Zhangyu Wang,Joseph Zalewski,Lu Zhou,Rui Zhu
关键词-EN: largest fully publicly, geospatial knowledge graphs, largest fully, fully publicly, publicly available geospatial
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:KnowWhereGraph is one of the largest fully publicly available geospatial knowledge graphs. It includes data from 30 layers on natural hazards (e.g., hurricanes, wildfires), climate variables (e.g., air temperature, precipitation), soil properties, crop and land-cover types, demographics, and human health, various place and region identifiers, among other themes. These have been leveraged through the graph by a variety of applications to address challenges in food security and agricultural supply chains; sustainability related to soil conservation practices and farm labor; and delivery of emergency humanitarian aid following a disaster. In this paper, we introduce the ontology that acts as the schema for KnowWhereGraph. This broad overview provides insight into the requirements and design specifications for the graph and its schema, including the development methodology (modular ontology modeling) and the resources utilized to implement, materialize, and deploy KnowWhereGraph with its end-user interfaces and public query SPARQL endpoint.

[AI-101] ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

链接: https://arxiv.org/abs/2410.13924
作者: Guangda Ji,Silvan Weder,Francis Engelmann,Marc Pollefeys,Hermann Blum
关键词-EN: neural networks scales, neural networks, dense semantic annotations, dataset, large-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The performance of neural networks scales with both their size and the amount of data they have been trained on. This is shown in both language and image generation. However, this requires scaling-friendly network architectures as well as large-scale datasets. Even though scaling-friendly architectures like transformers have emerged for 3D vision tasks, the GPT-moment of 3D vision remains distant due to the lack of training data. In this paper, we introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations. Specifically, we complement ARKitScenes dataset with dense semantic annotations that are automatically generated at scale. To this end, we extend LabelMaker, a recent automatic annotation pipeline, to serve the needs of large-scale pre-training. This involves extending the pipeline with cutting-edge segmentation models as well as making it robust to the challenges of large-scale processing. Further, we push forward the state-of-the-art performance on ScanNet and ScanNet200 dataset with prevalent 3D semantic segmentation models, demonstrating the efficacy of our generated dataset.

[AI-102] LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild

链接: https://arxiv.org/abs/2410.13919
作者: Reworr,Dmitrii Volkov
关键词-EN: introduce the LLM, LLM Honeypot, customized SSH honeypot, system for monitoring, monitoring autonomous
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce the LLM Honeypot, a system for monitoring autonomous AI hacking agents. We deployed a customized SSH honeypot and applied prompt injections with temporal analysis to identify LLM-based agents among attackers. Over a trial run of a few weeks in a public environment, we collected 800,000 hacking attempts and 6 potential AI agents, which we plan to analyze in depth in future work. Our objectives aim to improve awareness of AI hacking agents and enhance preparedness for their risks.

[AI-103] Leveraging Fine-Tuned Language Models for Efficient and Accurate Smart Contract Auditing

链接: https://arxiv.org/abs/2410.13918
作者: Zhiyuan Wei,Jing Sun,Zijian Zhang,Xianhao Zhang,Meng Li
关键词-EN: smart contracts, smart contract auditing, rise of blockchain, blockchain technologies, technologies has greatly
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 26 pages, 7 figures

点击查看摘要

Abstract:The rise of blockchain technologies has greatly accelerated the development and deployment of smart contracts. However, their inherent vulnerabilities and susceptibility to bugs have led to significant financial losses, underscoring the challenges in securing smart contracts. While traditional auditing methods are crucial, they often fall short in addressing the increasing complexity and volume of smart contracts. Recent advancements in Large Language Models (LLMs) offer promising solutions for enhancing software auditing by automatically identifying security vulnerabilities. Despite their potential, the practical application of these models is hindered by substantial computational demands. This paper investigates the feasibility of using smaller, fine-tuned models to achieve comparable or even superior results in smart contract auditing. We introduce the FTSmartAudit framework, which is designed to develop cost-effective, specialized models for smart contract auditing through the fine-tuning of LLMs. Our contributions include: (1) a single-task learning framework that streamlines data preparation, training, evaluation, and continuous learning; (2) a robust dataset generation method utilizing domain-special knowledge distillation to produce high-quality datasets from advanced models like GPT-4o; (3) an adaptive learning strategy to maintain model accuracy and robustness; (4) the proven effectiveness of fine-tuned models in detecting specific vulnerabilities and complex logical errors; and (5) a framework that can be extended to other domains requiring LLM solutions. Our experimental results demonstrate that smaller models can surpass state-of-the-art commercial models and tools in detecting vulnerabilities in smart contracts.

[AI-104] A Simulation System Towards Solving Societal-Scale Manipulation

链接: https://arxiv.org/abs/2410.13915
作者: Maximilian Puelma Touzel,Sneheel Sarangi,Austin Welch,Gayatri Krishnakumar,Dan Zhao,Zachary Yang,Hao Yu,Ethan Kosak-Hine,Tom Gibbs,Andreea Musulan,Camille Thibault,Busra Tugce Gurbuz,Reihaneh Rabbany,Jean-François Godbout,Kellin Pelrine
关键词-EN: poses significant risks, AI-driven manipulation poses, manipulation poses significant, democratic processes, rise of AI-driven
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-world settings at scale is ethically and logistically impractical, highlighting a need for simulation tools that can model these dynamics in controlled settings to enable experimentation with possible defenses. We present a simulation environment designed to address this. We elaborate upon the Concordia framework that simulates offline, `real life’ activity by adding online interactions to the simulation through social media with the integration of a Mastodon server. We improve simulation efficiency and information flow, and add a set of measurement tools, particularly longitudinal surveys. We demonstrate the simulator with a tailored example in which we track agents’ political positions and show how partisan manipulation of agents can affect election results.

[AI-105] Large Language Model-driven Multi-Agent Simulation for News Diffusion Under Different Network Structures

链接: https://arxiv.org/abs/2410.13909
作者: Xinyi Li,Yu Xu,Yongfeng Zhang,Edward C. Malthouse
关键词-EN: raised critical concerns, critical concerns, democratic processes, proliferation of fake, digital age
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The proliferation of fake news in the digital age has raised critical concerns, particularly regarding its impact on societal trust and democratic processes. Diverging from conventional agent-based simulation approaches, this work introduces an innovative approach by employing a large language model (LLM)-driven multi-agent simulation to replicate complex interactions within information ecosystems. We investigate key factors that facilitate news propagation, such as agent personalities and network structures, while also evaluating strategies to combat misinformation. Through simulations across varying network structures, we demonstrate the potential of LLM-based agents in modeling the dynamics of misinformation spread, validating the influence of agent traits on the diffusion process. Our findings emphasize the advantages of LLM-based simulations over traditional techniques, as they uncover underlying causes of information spread – such as agents promoting discussions – beyond the predefined rules typically employed in existing agent-based models. Additionally, we evaluate three countermeasure strategies, discovering that brute-force blocking influential agents in the network or announcing news accuracy can effectively mitigate misinformation. However, their effectiveness is influenced by the network structure, highlighting the importance of considering network structure in the development of future misinformation countermeasures.

[AI-106] NSmark: Null Space Based Black-box Watermarking Defense Framework for Pre-trained Language Models

链接: https://arxiv.org/abs/2410.13907
作者: Haodong Zhao,Jinming Hu,Peixuan Li,Fangqi Li,Jinrui Sha,Peixuan Chen,Zhuosheng Zhang,Gongshen Liu
关键词-EN: Pre-trained language models, critical intellectual property, Linear Functionality Equivalence, Pre-trained language, Functionality Equivalence Attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Pre-trained language models (PLMs) have emerged as critical intellectual property (IP) assets that necessitate protection. Although various watermarking strategies have been proposed, they remain vulnerable to Linear Functionality Equivalence Attacks (LFEA), which can invalidate most existing white-box watermarks without prior knowledge of the watermarking scheme or training data. This paper further analyzes and extends the attack scenarios of LFEA to the commonly employed black-box settings for PLMs by considering Last-Layer outputs (dubbed LL-LFEA). We discover that the null space of the output matrix remains invariant against LL-LFEA attacks. Based on this finding, we propose NSmark, a task-agnostic, black-box watermarking scheme capable of resisting LL-LFEA attacks. NSmark consists of three phases: (i) watermark generation using the digital signature of the owner, enhanced by spread spectrum modulation for increased robustness; (ii) watermark embedding through an output mapping extractor that preserves PLM performance while maximizing watermark capacity; (iii) watermark verification, assessed by extraction rate and null space conformity. Extensive experiments on both pre-training and downstream tasks confirm the effectiveness, reliability, fidelity, and robustness of our approach. Code is available at this https URL.

[AI-107] P4GCN: Vertical Federated Social Recommendation with Privacy-Preserving Two-Party Graph Convolution Networks

链接: https://arxiv.org/abs/2410.13905
作者: Zheng Wang,Wanwan Wang,Yimin Huang,Zhaopeng Peng,Ziqi Yang,Cheng Wang,Xiaoliang Fan
关键词-EN: social recommendation systems, recent years, commonly utilized, social, graph neural networks
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, graph neural networks (GNNs) have been commonly utilized for social recommendation systems. However, real-world scenarios often present challenges related to user privacy and business constraints, inhibiting direct access to valuable social information from other platforms. While many existing methods have tackled matrix factorization-based social recommendations without direct social data access, developing GNN-based federated social recommendation models under similar conditions remains largely unexplored. To address this issue, we propose a novel vertical federated social recommendation method leveraging privacy-preserving two-party graph convolution networks (P4GCN) to enhance recommendation accuracy without requiring direct access to sensitive social information. First, we introduce a Sandwich-Encryption module to ensure comprehensive data privacy during the collaborative computing process. Second, we provide a thorough theoretical analysis of the privacy guarantees, considering the participation of both curious and honest parties. Extensive experiments on four real-world datasets demonstrate that P4GCN outperforms state-of-the-art methods in terms of recommendation accuracy. The code is available at this https URL.

[AI-108] CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment

链接: https://arxiv.org/abs/2410.13903
作者: Qinfeng Li,Yangfan Xie,Tianyu Du,Zhiqiang Shen,Zhenghan Qin,Hao Peng,Xinkui Zhao,Xianwei Zhu,Jianwei Yin,Xuhong Zhang
关键词-EN: Proprietary large language, demonstrate exceptional generalization, large language models, demonstrate exceptional, exceptional generalization ability
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Proprietary large language models (LLMs) demonstrate exceptional generalization ability across various tasks. Additionally, deploying LLMs on edge devices is trending for efficiency and privacy reasons. However, edge deployment of proprietary LLMs introduces new security threats: attackers who obtain an edge-deployed LLM can easily use it as a base model for various tasks due to its high generalization ability, which we call foundational capability stealing. Unfortunately, existing model protection mechanisms are often task-specific and fail to protect general-purpose LLMs, as they mainly focus on protecting task-related parameters using trusted execution environments (TEEs). Although some recent TEE-based methods are able to protect the overall model parameters in a computation-efficient way, they still suffer from prohibitive communication costs between TEE and CPU/GPU, making it impractical to deploy for edge LLMs. To protect the foundational capabilities of edge LLMs, we propose CoreGuard, a computation- and communication-efficient model protection approach against model stealing on edge devices. The core component of CoreGuard is a lightweight and propagative authorization module residing in TEE. Extensive experiments show that CoreGuard achieves the same security protection as the black-box security guarantees with negligible overhead.

[AI-109] SoK: Prompt Hacking of Large Language Models

链接: https://arxiv.org/abs/2410.13901
作者: Baha Rababah,Shang(Tommy)Wu,Matthew Kwiatkowski,Carson Leung,Cuneyt Gurcan Akcora
关键词-EN: large language models, remain critical challenges, based applications remain, applications remain critical, language models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence. Among the key threats to these applications are prompt hacking attacks, which can significantly undermine the security and reliability of LLM-based systems. In this work, we offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection, addressing the nuances that differentiate them despite their overlapping characteristics. To enhance the evaluation of LLM-based applications, we propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification. This approach provides more granular insights into the AI’s behavior, improving diagnostic precision and enabling more targeted enhancements to the system’s safety and robustness.

[AI-110] Security of and by Generative AI platforms

链接: https://arxiv.org/abs/2410.13899
作者: Hari Hayagreevan,Souvik Khamaru
关键词-EN: highlights the dual, dual importance, securing generative, genAI, whitepaper highlights
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This whitepaper highlights the dual importance of securing generative AI (genAI) platforms and leveraging genAI for cybersecurity. As genAI technologies proliferate, their misuse poses significant risks, including data breaches, model tampering, and malicious content generation. Securing these platforms is critical to protect sensitive data, ensure model integrity, and prevent adversarial attacks. Simultaneously, genAI presents opportunities for enhancing security by automating threat detection, vulnerability analysis, and incident response. The whitepaper explores strategies for robust security frameworks around genAI systems, while also showcasing how genAI can empower organizations to anticipate, detect, and mitigate sophisticated cyber threats.

[AI-111] Deep Learning Based XIoT Malware Analysis: A Comprehensive Survey Taxonomy and Research Challenges

链接: https://arxiv.org/abs/2410.13894
作者: Rami Darwish,Mahmoud Abdelsalam,Sajad Khorsandroo
关键词-EN: fastest-growing computing industries, computing industries, fastest-growing computing, Internet, Things
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Internet of Things (IoT) is one of the fastest-growing computing industries. By the end of 2027, more than 29 billion devices are expected to be connected. These smart devices can communicate with each other with and without human intervention. This rapid growth has led to the emergence of new types of malware. However, traditional malware detection methods, such as signature-based and heuristic-based techniques, are becoming increasingly ineffective against these new types of malware. Therefore, it has become indispensable to find practical solutions for detecting IoT malware. Machine Learning (ML) and Deep Learning (DL) approaches have proven effective in dealing with these new IoT malware variants, exhibiting high detection rates. In this paper, we bridge the gap in research between the IoT malware analysis and the wide adoption of deep learning in tackling the problems in this domain. As such, we provide a comprehensive review on deep learning based malware analysis across various categories of the IoT domain (i.e. Extended Internet of Things (XIoT)), including Industrial IoT (IIoT), Internet of Medical Things (IoMT), Internet of Vehicles (IoV), and Internet of Battlefield Things (IoBT).

[AI-112] Can LLMs be Scammed? A Baseline Measurement Study

链接: https://arxiv.org/abs/2410.13893
作者: Udari Madhushani Sehwag,Kelly Patel,Francesca Mosca,Vineeth Ravi,Jessica Staddon
关键词-EN: current literature lacks, Large Language Models’, effectively resist scams, assessing Large Language, current literature
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the importance of developing generative AI models that can effectively resist scams, current literature lacks a structured framework for evaluating their vulnerability to such threats. In this work, we address this gap by constructing a benchmark based on the FINRA taxonomy and systematically assessing Large Language Models’ (LLMs’) vulnerability to a variety of scam tactics. First, we incorporate 37 well-defined base scam scenarios reflecting the diverse scam categories identified by FINRA taxonomy, providing a focused evaluation of LLMs’ scam detection capabilities. Second, we utilize representative proprietary (GPT-3.5, GPT-4) and open-source (Llama) models to analyze their performance in scam detection. Third, our research provides critical insights into which scam tactics are most effective against LLMs and how varying persona traits and persuasive techniques influence these vulnerabilities. We reveal distinct susceptibility patterns across different models and scenarios, underscoring the need for targeted enhancements in LLM design and deployment.

[AI-113] S4ST: A Strong Self-transferable faSt and Simple Scale Transformation for Transferable Targeted Attack

链接: https://arxiv.org/abs/2410.13891
作者: Yongxiang Liu,Bowen Peng,Li Liu,Xiang Li
关键词-EN: deep neural networks, transferable targeted attacks, Transferable targeted, remain relatively underexplored, deep neural
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 16 pages, 18 figures

点击查看摘要

Abstract:Transferable targeted adversarial attacks (TTAs) against deep neural networks have been proven significantly more challenging than untargeted ones, yet they remain relatively underexplored. This paper sheds new light on performing highly efficient yet transferable targeted attacks leveraging the simple gradient-based baseline. Our research underscores the critical importance of image transformations within gradient calculations, marking a shift from the prevalent emphasis on loss functions to address the gradient vanishing problem. Moreover, we have developed two effective blind estimators that facilitate the design of transformation strategies to enhance targeted transferability under black-box conditions. The adversarial examples’ self-transferability to geometric transformations has been identified as strongly correlated with their black-box transferability, featuring these basic operations as potent yet overlapped proxies for facilitating targeted transferability. The surrogate self-alignment assessments further highlight simple scaling transformation’s exceptional efficacy, which rivals that of most advanced methods. Building on these insights, we introduce a scaling-centered transformation strategy termed Strong, Self-transferable, faSt, and Simple Scale Transformation (S4ST) to enhance transferable targeted attacks. In experiments conducted on the ImageNet-Compatible benchmark dataset, our proposed S4ST attains a SOTA average targeted transfer success rate across various challenging black-box models, outperforming the previous leading method by over 14% while requiring only 25% of the execution time. Additionally, our approach eclipses SOTA attacks considerably and exhibits remarkable effectiveness against real-world APIs. This work marks a significant leap forward in TTAs, revealing the realistic threats they pose and providing a practical generation method for future research.

[AI-114] ransformers Utilization in Chart Understanding: A Review of Recent Advances Future Trends

链接: https://arxiv.org/abs/2410.13883
作者: Mirna Al-Shetairy,Hanan Hindy,Dina Khattab,Mostafa M. Aref
关键词-EN: involving chart interactions, interest in vision-language, chart interactions, involving chart, Chart Understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, interest in vision-language tasks has grown, especially those involving chart interactions. These tasks are inherently multimodal, requiring models to process chart images, accompanying text, underlying data tables, and often user queries. Traditionally, Chart Understanding (CU) relied on heuristics and rule-based systems. However, recent advancements that have integrated transformer architectures significantly improved performance. This paper reviews prominent research in CU, focusing on State-of-The-Art (SoTA) frameworks that employ transformers within End-to-End (E2E) solutions. Relevant benchmarking datasets and evaluation techniques are analyzed. Additionally, this article identifies key challenges and outlines promising future directions for advancing CU solutions. Following the PRISMA guidelines, a comprehensive literature search is conducted across Google Scholar, focusing on publications from Jan’20 to Jun’24. After rigorous screening and quality assessment, 32 studies are selected for in-depth analysis. The CU tasks are categorized into a three-layered paradigm based on the cognitive task required. Recent advancements in the frameworks addressing various CU tasks are also reviewed. Frameworks are categorized into single-task or multi-task based on the number of tasks solvable by the E2E solution. Within multi-task frameworks, pre-trained and prompt-engineering-based techniques are explored. This review overviews leading architectures, datasets, and pre-training tasks. Despite significant progress, challenges remain in OCR dependency, handling low-resolution images, and enhancing visual reasoning. Future directions include addressing these challenges, developing robust benchmarks, and optimizing model efficiency. Additionally, integrating explainable AI techniques and exploring the balance between real and synthetic data are crucial for advancing CU research.

[AI-115] Deep Knowledge Tracing for Personalized Adaptive Learning at Historically Black Colleges and Universities

链接: https://arxiv.org/abs/2410.13876
作者: Ming-Mu Kuo,Xiangfang Li,Lijun Qian,Pamela Obiomon,Xishuang Dong
关键词-EN: Personalized adaptive learning, closely monitoring individual, Personalized adaptive, knowledge tracing, DKT
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Personalized adaptive learning (PAL) stands out by closely monitoring individual students’ progress and tailoring their learning paths to their unique knowledge and needs. A crucial technique for effective PAL implementation is knowledge tracing, which models students’ evolving knowledge to predict their future performance. Recent advancements in deep learning have significantly enhanced knowledge tracing through Deep Knowledge Tracing (DKT). However, there is limited research on DKT for Science, Technology, Engineering, and Math (STEM) education at Historically Black Colleges and Universities (HBCUs). This study builds a comprehensive dataset to investigate DKT for implementing PAL in STEM education at HBCUs, utilizing multiple state-of-the-art (SOTA) DKT models to examine knowledge tracing performance. The dataset includes 352,148 learning records for 17,181 undergraduate students across eight colleges at Prairie View AM University (PVAMU). The SOTA DKT models employed include DKT, DKT+, DKVMN, SAKT, and KQN. Experimental results demonstrate the effectiveness of DKT models in accurately predicting students’ academic outcomes. Specifically, the SAKT and KQN models outperform others in terms of accuracy and AUC. These findings have significant implications for faculty members and academic advisors, providing valuable insights for identifying students at risk of academic underperformance before the end of the semester. Furthermore, this allows for proactive interventions to support students’ academic progress, potentially enhancing student retention and graduation rates.

[AI-116] Explaining an image classifier with a generative model conditioned by uncertainty

链接: https://arxiv.org/abs/2410.13871
作者: Adrien Le Coz,Stéphane Herbin,Faouzi Adjed
关键词-EN: image classifier uncertainty, explain its behavior, propose to condition, condition a generative, generative model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We propose to condition a generative model by a given image classifier uncertainty in order to analyze and explain its behavior. Preliminary experiments on synthetic data and a corrupted version of MNIST dataset illustrate the idea.

[AI-117] Stars Stripes and Silicon: Unravelling the ChatGPTs All-American Monochrome Cis-centric Bias

链接: https://arxiv.org/abs/2410.13868
作者: Federico Torrielli
关键词-EN: large language models, lack of robustness, robustness in large, large language, language models
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This paper investigates the challenges associated with bias, toxicity, unreliability, and lack of robustness in large language models (LLMs) such as ChatGPT. It emphasizes that these issues primarily stem from the quality and diversity of data on which LLMs are trained, rather than the model architectures themselves. As LLMs are increasingly integrated into various real-world applications, their potential to negatively impact society by amplifying existing biases and generating harmful content becomes a pressing concern. The paper calls for interdisciplinary efforts to address these challenges. Additionally, it highlights the need for collaboration between researchers, practitioners, and stakeholders to establish governance frameworks, oversight, and accountability mechanisms to mitigate the harmful consequences of biased LLMs. By proactively addressing these challenges, the AI community can harness the enormous potential of LLMs for the betterment of society without perpetuating harmful biases or exacerbating existing inequalities.

[AI-118] Asymptotically Optimal Change Detection for Unnormalized Pre- and Post-Change Distributions

链接: https://arxiv.org/abs/2410.14615
作者: Arman Adibi,Sanjeev Kulkarni,H. Vincent Poor,Taposh Banerjee,Vahid Tarokh
关键词-EN: Approximation Cumulative Sum, paper addresses, addresses the problem, problem of detecting, Cumulative Sum
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of detecting changes when only unnormalized pre- and post-change distributions are accessible. This situation happens in many scenarios in physics such as in ferromagnetism, crystallography, magneto-hydrodynamics, and thermodynamics, where the energy models are difficult to normalize. Our approach is based on the estimation of the Cumulative Sum (CUSUM) statistics, which is known to produce optimal performance. We first present an intuitively appealing approximation method. Unfortunately, this produces a biased estimator of the CUSUM statistics and may cause performance degradation. We then propose the Log-Partition Approximation Cumulative Sum (LPA-CUSUM) algorithm based on thermodynamic integration (TI) in order to estimate the log-ratio of normalizing constants of pre- and post-change distributions. It is proved that this approach gives an unbiased estimate of the log-partition function and the CUSUM statistics, and leads to an asymptotically optimal performance. Moreover, we derive a relationship between the required sample size for thermodynamic integration and the desired detection delay performance, offering guidelines for practical parameter selection. Numerical studies are provided demonstrating the efficacy of our approach. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2410.14615 [stat.ML] (or arXiv:2410.14615v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2410.14615 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-119] Less is More: Selective Reduction of CT Data for Self-Supervised Pre-Training of Deep Learning Models with Contrastive Learning Improves Downstream Classification Performance

链接: https://arxiv.org/abs/2410.14524
作者: Daniel Wolf,Tristan Payer,Catharina Silvia Lisson,Christoph Gerhard Lisson,Meinrad Beer,Michael Götz,Timo Ropinski
关键词-EN: Self-supervised pre-training, widely used technique, Self-supervised, deep learning models, medical images
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in Computers in Biology and Medicine

点击查看摘要

Abstract:Self-supervised pre-training of deep learning models with contrastive learning is a widely used technique in image analysis. Current findings indicate a strong potential for contrastive pre-training on medical images. However, further research is necessary to incorporate the particular characteristics of these images. We hypothesize that the similarity of medical images hinders the success of contrastive learning in the medical imaging domain. To this end, we investigate different strategies based on deep embedding, information theory, and hashing in order to identify and reduce redundancy in medical pre-training datasets. The effect of these different reduction strategies on contrastive learning is evaluated on two pre-training datasets and several downstream classification tasks. In all of our experiments, dataset reduction leads to a considerable performance gain in downstream tasks, e.g., an AUC score improvement from 0.78 to 0.83 for the COVID CT Classification Grand Challenge, 0.97 to 0.98 for the OrganSMNIST Classification Challenge and 0.73 to 0.83 for a brain hemorrhage classification task. Furthermore, pre-training is up to nine times faster due to the dataset reduction. In conclusion, the proposed approach highlights the importance of dataset quality and provides a transferable approach to improve contrastive pre-training for classification downstream tasks on medical images.

[AI-120] Learning to refine domain knowledge for biological network inference

链接: https://arxiv.org/abs/2410.14436
作者: Peiwen Li,Menghua Wu
关键词-EN: pose significant challenges, Perturbation experiments, discover causal relationships, causal structure learning, structure learning algorithms
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Perturbation experiments allow biologists to discover causal relationships between variables of interest, but the sparsity and high dimensionality of these data pose significant challenges for causal structure learning algorithms. Biological knowledge graphs can bootstrap the inference of causal structures in these situations, but since they compile vastly diverse information, they can bias predictions towards well-studied systems. Alternatively, amortized causal structure learning algorithms encode inductive biases through data simulation and train supervised models to recapitulate these synthetic graphs. However, realistically simulating biology is arguably even harder than understanding a specific system. In this work, we take inspiration from both strategies and propose an amortized algorithm for refining domain knowledge, based on data observations. On real and synthetic datasets, we show that our approach outperforms baselines in recovering ground truth causal graphs and identifying errors in the prior knowledge with limited interventional data.

[AI-121] Deep Learning Applications in Medical Image Analysis: Advancements Challenges and Future Directions

链接: https://arxiv.org/abs/2410.14131
作者: Aimina Ali Eli,Abida Ali
关键词-EN: Medical image analysis, contemporary healthcare, facilitating physicians, precise diagnosis, essential element
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image analysis has emerged as an essential element of contemporary healthcare, facilitating physicians in achieving expedited and precise diagnosis. Recent breakthroughs in deep learning, a subset of artificial intelligence, have markedly revolutionized the analysis of medical pictures, improving the accuracy and efficiency of clinical procedures. Deep learning algorithms, especially convolutional neural networks (CNNs), have demonstrated remarkable proficiency in autonomously learning features from multidimensional medical pictures, including MRI, CT, and X-ray scans, without the necessity for manual feature extraction. These models have been utilized across multiple medical disciplines, including pathology, radiology, ophthalmology, and cardiology, where they aid in illness detection, classification, and segmentation tasks…

[AI-122] Ensemble-based large-eddy reconstruction of wind turbine inflow in a near-stationary atmospheric boundary layer through generative artificial intelligence

链接: https://arxiv.org/abs/2410.14024
作者: Alex Rybchuk,Luis A. Martínez-Tossas,Stefano Letizia,Nicholas Hamilton,Andy Scholbrock,Emina Maric,Daniel R. Houck,Thomas G. Herges,Nathaniel B. de Velder,Paula Doubrawa
关键词-EN: accurately reconstruct, dynamics of turbines, large-eddy, field experiments, technique
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注: 30 pages, 15 figures

点击查看摘要

Abstract:To validate the second-by-second dynamics of turbines in field experiments, it is necessary to accurately reconstruct the winds going into the turbine. Current time-resolved inflow reconstruction techniques estimate wind behavior in unobserved regions using relatively simple spectral-based models of the atmosphere. Here, we develop a technique for time-resolved inflow reconstruction that is rooted in a large-eddy simulation model of the atmosphere. Our “large-eddy reconstruction” technique blends observations and atmospheric model information through a diffusion model machine learning algorithm, allowing us to generate probabilistic ensembles of reconstructions for a single 10-min observational period. Our generated inflows can be used directly by aeroelastic codes or as inflow boundary conditions in a large-eddy simulation. We verify the second-by-second reconstruction capability of our technique in three synthetic field campaigns, finding positive Pearson correlation coefficient values (0.20r0.85) between ground-truth and reconstructed streamwise velocity, as well as smaller positive correlation coefficient values for unobserved fields (spanwise velocity, vertical velocity, and temperature). We validate our technique in three real-world case studies by driving large-eddy simulations with reconstructed inflows and comparing to independent inflow measurements. The reconstructions are visually similar to measurements, follow desired power spectra properties, and track second-by-second behavior (0.25 r 0.75).

[AI-123] Approximating Auction Equilibria with Reinforcement Learning

链接: https://arxiv.org/abs/2410.13960
作者: Pranjal Rawat
关键词-EN: Proximal Policy Optimization, Traditional methods, auction complexity increases, complexity increases, Neural Fictitious Self-Play
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional methods for computing equilibria in auctions become computationally intractable as auction complexity increases, particularly in multi-item and dynamic auctions. This paper introduces a self-play based reinforcement learning approach that employs advanced algorithms such as Proximal Policy Optimization and Neural Fictitious Self-Play to approximate Bayes-Nash equilibria. This framework allows for continuous action spaces, high-dimensional information states, and delayed payoffs. Through self-play, these algorithms can learn robust and near-optimal bidding strategies in auctions with known equilibria, including those with symmetric and asymmetric valuations, private and interdependent values, and multi-round auctions.

[AI-124] Associative memory and dead neurons

链接: https://arxiv.org/abs/2410.13866
作者: Vladimir Fanaskov,Ivan Oseledets
关键词-EN: Large Associative Memory, John Hopfield introduced, Associative Memory Problem, Large Associative, Machine Learning
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In “Large Associative Memory Problem in Neurobiology and Machine Learning,” Dmitry Krotov and John Hopfield introduced a general technique for the systematic construction of neural ordinary differential equations with non-increasing energy or Lyapunov function. We study this energy function and identify that it is vulnerable to the problem of dead neurons. Each point in the state space where the neuron dies is contained in a non-compact region with constant energy. In these flat regions, energy function alone does not completely determine all degrees of freedom and, as a consequence, can not be used to analyze stability or find steady states or basins of attraction. We perform a direct analysis of the dynamical system and show how to resolve problems caused by flat directions corresponding to dead neurons: (i) all information about the state vector at a fixed point can be extracted from the energy and Hessian matrix (of Lagrange function), (ii) it is enough to analyze stability in the range of Hessian matrix, (iii) if steady state touching flat region is stable the whole flat region is the basin of attraction. The analysis of the Hessian matrix can be complicated for realistic architectures, so we show that for a slightly altered dynamical system (with the same structure of steady states), one can derive a diverse family of Lyapunov functions that do not have flat regions corresponding to dead neurons. In addition, these energy functions allow one to use Lagrange functions with Hessian matrices that are not necessarily positive definite and even consider architectures with non-symmetric feedforward and feedback connections.

计算机视觉

[CV-0] BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

链接: https://arxiv.org/abs/2410.14672
作者: Shaozhe Hao,Xuantong Liu,Xianbiao Qi,Shihao Zhao,Bojia Zi,Rong Xiao,Kai Han,Kwan-Yee K. Wong
关键词-EN: compact binary latent, focusing on enhancing, binary latent codes, conditional generative model, generative training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:We introduce BiGR, a novel conditional image generation model using compact binary latent codes for generative training, focusing on enhancing both generation and representation capabilities. BiGR is the first conditional generative model that unifies generation and discrimination within the same framework. BiGR features a binary tokenizer, a masked modeling mechanism, and a binary transcoder for binary code prediction. Additionally, we introduce a novel entropy-ordered sampling method to enable efficient image generation. Extensive experiments validate BiGR’s superior performance in generation quality, as measured by FID-50k, and representation capabilities, as evidenced by linear-probe accuracy. Moreover, BiGR showcases zero-shot generalization across various vision tasks, enabling applications such as image inpainting, outpainting, editing, interpolation, and enrichment, without the need for structural modifications. Our findings suggest that BiGR unifies generative and discriminative tasks effectively, paving the way for further advancements in the field.

[CV-1] NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples NEURIPS24

链接: https://arxiv.org/abs/2410.14669
作者: Baiqi Li,Zhiqiu Lin,Wenxuan Peng,Jean de Dieu Nyandwi,Daniel Jiang,Zixian Ma,Simran Khanuja,Ranjay Krishna,Graham Neubig,Deva Ramanan
关键词-EN: made significant progress, Vision-language models, progress in recent, made significant, significant progress
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted to NeurIPS 24; We open-source our dataset at: this https URL Project page at: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a \textbfvision-centric design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

[CV-2] Parallel Backpropagation for Inverse of a Convolution with Application to Normalizing Flows

链接: https://arxiv.org/abs/2410.14634
作者: Sandeep Nagar,Girish Varma
关键词-EN: Image Deblurring, Normalizing Flows, Normalizing, Normalizing Flow backbones, Inverse Convolutions
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Probability (math.PR)
*备注: Preprint

点击查看摘要

Abstract:Inverse of an invertible convolution is an important operation that comes up in Normalizing Flows, Image Deblurring, etc. The naive algorithm for backpropagation of this operation using Gaussian elimination has running time O(n^3) where n is the number of pixels in the image. We give a fast parallel backpropagation algorithm with running time O(\sqrtn) for a square image and provide a GPU implementation of the same. Inverse Convolutions are usually used in Normalizing Flows in the sampling pass, making them slow. We propose to use Inverse Convolutions in the forward (image to latent vector) pass of the Normalizing flow. Since the sampling pass is the inverse of the forward pass, it will use convolutions only, resulting in efficient sampling times. We use our parallel backpropagation algorithm for optimizing the inverse convolution layer resulting in fast training times also. We implement this approach in various Normalizing Flow backbones, resulting in our Inverse-Flow models. We benchmark Inverse-Flow on standard datasets and show significantly improved sampling times with similar bits per dimension compared to previous models.

[CV-3] Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task Learning

链接: https://arxiv.org/abs/2410.14633
作者: Yuxiang Lu,Shengcao Cao,Yu-Xiong Wang
关键词-EN: Vision Foundation Models, demonstrated outstanding performance, Vision Foundation, numerous downstream tasks, Foundation Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile “Swiss Army Knife” (SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs.

[CV-4] MultiOrg: A Multi-rater Organoid-detection Dataset

链接: https://arxiv.org/abs/2410.14612
作者: Christina Bukas,Harshavardhan Subramanian,Fenja See,Carina Steinchen,Ivan Ezhov,Gowtham Boosarpu,Sara Asgharpour,Gerald Burgstaller,Mareike Lehmann,Florian Kofler,Marie Piraud
关键词-EN: gained significant attention, High-throughput image analysis, disease prediction, recent years, drug discovery
类目: Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB)
*备注:

点击查看摘要

Abstract:High-throughput image analysis in the biomedical domain has gained significant attention in recent years, driving advancements in drug discovery, disease prediction, and personalized medicine. Organoids, specifically, are an active area of research, providing excellent models for human organs and their functions. Automating the quantification of organoids in microscopy images would provide an effective solution to overcome substantial manual quantification bottlenecks, particularly in high-throughput image analysis. However, there is a notable lack of open biomedical datasets, in contrast to other domains, such as autonomous driving, and, notably, only few of them have attempted to quantify annotation uncertainty. In this work, we present MultiOrg a comprehensive organoid dataset tailored for object detection tasks with uncertainty quantification. This dataset comprises over 400 high-resolution 2d microscopy images and curated annotations of more than 60,000 organoids. Most importantly, it includes three label sets for the test data, independently annotated by two experts at distinct time points. We additionally provide a benchmark for organoid detection, and make the best model available through an easily installable, interactive plugin for the popular image visualization tool Napari, to perform organoid quantification.

[CV-5] DRACO-DehazeNet: An Efficient Image Dehazing Network Combining Detail Recovery and a Novel Contrastive Learning Paradigm

链接: https://arxiv.org/abs/2410.14595
作者: Gao Yu Lee,Tanmoy Dam,Md Meftahul Ferdaus,Daniel Puiu Poenar,Vu Duong
关键词-EN: significant computational power, current learning-based approaches, consumed significant computational, clarifying images obscured, computational power
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to a journal and currently under review. Once the paper is accepted and published, the copyright will be transferred to the corresponding journal

点击查看摘要

Abstract:Image dehazing is crucial for clarifying images obscured by haze or fog, but current learning-based approaches is dependent on large volumes of training data and hence consumed significant computational power. Additionally, their performance is often inadequate under non-uniform or heavy haze. To address these challenges, we developed the Detail Recovery And Contrastive DehazeNet, which facilitates efficient and effective dehazing via a dense dilated inverted residual block and an attention-based detail recovery network that tailors enhancements to specific dehazed scene contexts. A major innovation is its ability to train effectively with limited data, achieved through a novel quadruplet loss-based contrastive dehazing paradigm. This approach distinctly separates hazy and clear image features while also distinguish lower-quality and higher-quality dehazed images obtained from each sub-modules of our network, thereby refining the dehazing process to a larger extent. Extensive tests on a variety of benchmarked haze datasets demonstrated the superiority of our approach. The code repository for this work will be available soon.

[CV-6] MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts NEURIPS2024

链接: https://arxiv.org/abs/2410.14574
作者: Rachel S.Y. Teo,Tan M. Nguyen
关键词-EN: unlocking unparalleled scalability, Sparse Mixture, deep learning, key to unlocking, unlocking unparalleled
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 10 pages in the main text. Published at NeurIPS 2024. The code is available at this https URL

点击查看摘要

Abstract:Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning. SMoE has the potential to exponentially increase parameter count while maintaining the efficiency of the model by only activating a small subset of these parameters for a given sample. However, it has been observed that SMoE suffers from unstable training and has difficulty adapting to new distributions, leading to the model’s lack of robustness to data contamination. To overcome these limitations, we first establish a connection between the dynamics of the expert representations in SMoEs and gradient descent on a multi-objective optimization problem. Leveraging our framework, we then integrate momentum into SMoE and propose a new family of SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE. In particular, we verify the advantages of MomentumSMoE over SMoE on a variety of practical tasks including ImageNet-1K object recognition and WikiText-103 language modeling. We demonstrate the applicability of MomentumSMoE to many types of SMoE models, including those in the Sparse MoE model for vision (V-MoE) and the Generalist Language Model (GLaM). We also show that other advanced momentum-based optimization methods, such as Adam, can be easily incorporated into the MomentumSMoE framework for designing new SMoE models with even better performance, almost negligible additional computation cost, and simple implementations.

[CV-7] Multi-modal Pose Diffuser: A Multimodal Generative Conditional Pose Prior

链接: https://arxiv.org/abs/2410.14540
作者: Calvin-Khang Ta,Arindam Dutta,Rohit Kundu,Rohit Lal,Hannah Dela Cruz,Dripta S. Raychaudhuri,Amit Roy-Chowdhury
关键词-EN: Skinned Multi-Person Linear, Multi-Person Linear, Skinned Multi-Person, underline, providing a streamlined
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Skinned Multi-Person Linear (SMPL) model plays a crucial role in 3D human pose estimation, providing a streamlined yet effective representation of the human body. However, ensuring the validity of SMPL configurations during tasks such as human mesh regression remains a significant challenge , highlighting the necessity for a robust human pose prior capable of discerning realistic human poses. To address this, we introduce MOPED: \underlineMulti-m\underlineOdal \underlinePos\underlineE \underlineDiffuser. MOPED is the first method to leverage a novel multi-modal conditional diffusion model as a prior for SMPL pose parameters. Our method offers powerful unconditional pose generation with the ability to condition on multi-modal inputs such as images and text. This capability enhances the applicability of our approach by incorporating additional context often overlooked in traditional pose priors. Extensive experiments across three distinct tasks-pose estimation, pose denoising, and pose completion-demonstrate that our multi-modal diffusion model-based prior significantly outperforms existing methods. These results indicate that our model captures a broader spectrum of plausible human poses.

[CV-8] CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection

链接: https://arxiv.org/abs/2410.14509
作者: Andrea Appiani,Cigdem Beyan
关键词-EN: Voice Activity Detection, Voice Activity, Activity Detection, person is speaking, speaking and identifying
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments composed of the upper body of an individual, while the text encoder handles textual descriptions automatically generated through prompt engineering. Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.

[CV-9] LEAD: Latent Realignment for Human Motion Diffusion

链接: https://arxiv.org/abs/2410.14508
作者: Nefeli Andreou,Xi Wang,Victoria Fernández Abrevaya,Marie-Paule Cani,Yiorgos Chrysanthou,Vicky Kalogeiton
关键词-EN: generate realistic human, realistic human motion, generate realistic, realistic human, natural language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Our goal is to generate realistic human motion from natural language. Modern methods often face a trade-off between model expressiveness and text-to-motion alignment. Some align text and motion latent spaces but sacrifice expressiveness; others rely on diffusion models producing impressive motions, but lacking semantic meaning in their latent space. This may compromise realism, diversity, and applicability. Here, we address this by combining latent diffusion with a realignment mechanism, producing a novel, semantically structured space that encodes the semantics of language. Leveraging this capability, we introduce the task of textual motion inversion to capture novel motion concepts from a few examples. For motion synthesis, we evaluate LEAD on HumanML3D and KIT-ML and show comparable performance to the state-of-the-art in terms of realism, diversity, and text-motion consistency. Our qualitative analysis and user study reveal that our synthesized motions are sharper, more human-like and comply better with the text compared to modern methods. For motion textual inversion, our method demonstrates improved capacity in capturing out-of-distribution characteristics in comparison to traditional VAEs.

[CV-10] Neural Real-Time Recalibration for Infrared Multi-Camera Systems

链接: https://arxiv.org/abs/2410.14505
作者: Benyamin Mehmandar,Reza Talakoob,Charalambos Poullis
关键词-EN: calibration, real-time, traditional calibration techniques, infrared multi-camera systems, multi-camera infrared systems
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: real-time camera calibration, infrared camera, neural calibration

点击查看摘要

Abstract:Currently, there are no learning-free or neural techniques for real-time recalibration of infrared multi-camera systems. In this paper, we address the challenge of real-time, highly-accurate calibration of multi-camera infrared systems, a critical task for time-sensitive applications. Unlike traditional calibration techniques that lack adaptability and struggle with on-the-fly recalibrations, we propose a neural network-based method capable of dynamic real-time calibration. The proposed method integrates a differentiable projection model that directly correlates 3D geometries with their 2D image projections and facilitates the direct optimization of both intrinsic and extrinsic camera parameters. Key to our approach is the dynamic camera pose synthesis with perturbations in camera parameters, emulating realistic operational challenges to enhance model robustness. We introduce two model variants: one designed for multi-camera systems with onboard processing of 2D points, utilizing the direct 2D projections of 3D fiducials, and another for image-based systems, employing color-coded projected points for implicitly establishing correspondence. Through rigorous experimentation, we demonstrate our method is more accurate than traditional calibration techniques with or without perturbations while also being real-time, marking a significant leap in the field of real-time multi-camera system calibration. The source code can be found at this https URL

[CV-11] How Do Training Methods Influence the Utilization of Vision Models? NEURIPS2024

链接: https://arxiv.org/abs/2410.14470
作者: Paul Gavrikov,Shashank Agnihotri,Margret Keuper,Janis Keuper
关键词-EN: contribute equally, learnable parameters, decision function, entire layers’ parameters, network decision function
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the Interpretable AI: Past, Present and Future Workshop at NeurIPS 2024

点击查看摘要

Abstract:Not all learnable parameters (e.g., weights) contribute equally to a neural network’s decision function. In fact, entire layers’ parameters can sometimes be reset to random values with little to no impact on the model’s decisions. We revisit earlier studies that examined how architecture and task complexity influence this phenomenon and ask: is this phenomenon also affected by how we train the model? We conducted experimental evaluations on a diverse set of ImageNet-1k classification models to explore this, keeping the architecture and training data constant but varying the training pipeline. Our findings reveal that the training method strongly influences which layers become critical to the decision function for a given task. For example, improved training regimes and self-supervised training increase the importance of early layers while significantly under-utilizing deeper layers. In contrast, methods such as adversarial training display an opposite trend. Our preliminary results extend previous findings, offering a more nuanced understanding of the inner mechanics of neural networks. Code: this https URL Comments: Accepted at the Interpretable AI: Past, Present and Future Workshop at NeurIPS 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.14470 [cs.CV] (or arXiv:2410.14470v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.14470 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-12] LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian Splatting scenes

链接: https://arxiv.org/abs/2410.14462
作者: Juliette Marrie,Romain Ménégaux,Michael Arbel,Diane Larlus,Julien Mairal
关键词-EN: Gaussian Splatting, represented by Gaussian, vision models, address the task, uplifting visual features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We address the task of uplifting visual features or semantic masks from 2D vision models to 3D scenes represented by Gaussian Splatting. Whereas common approaches rely on iterative optimization-based procedures, we show that a simple yet effective aggregation technique yields excellent results. Applied to semantic masks from Segment Anything (SAM), our uplifting approach leads to segmentation quality comparable to the state of the art. We then extend this method to generic DINOv2 features, integrating 3D scene geometry through graph diffusion, and achieve competitive segmentation results despite DINOv2 not being trained on millions of annotated masks like SAM.

[CV-13] oward Generalizing Visual Brain Decoding to Unseen Subjects

链接: https://arxiv.org/abs/2410.14445
作者: Xiangtao Kong,Kexin Huang,Ping Li,Lei Zhang
关键词-EN: decode visual information, Visual brain decoding, decode visual, visual information, brain decoding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual brain decoding aims to decode visual information from human brain activities. Despite the great progress, one critical limitation of current brain decoding research lies in the lack of generalization capability to unseen subjects. Prior works typically focus on decoding brain activity of individuals based on the observation that different subjects exhibit different brain activities, while it remains unclear whether brain decoding can be generalized to unseen subjects. This study aims to answer this question. We first consolidate an image-fMRI dataset consisting of stimulus-image and fMRI-response pairs, involving 177 subjects in the movie-viewing task of the Human Connectome Project (HCP). This dataset allows us to investigate the brain decoding performance with the increase of participants. We then present a learning paradigm that applies uniform processing across all subjects, instead of employing different network heads or tokenizers for individuals as in previous methods, which can accommodate a large number of subjects to explore the generalization capability across different subjects. A series of experiments are conducted and we have the following findings. First, the network exhibits clear generalization capabilities with the increase of training subjects. Second, the generalization capability is common to popular network architectures (MLP, CNN and Transformer). Third, the generalization performance is affected by the similarity between subjects. Our findings reveal the inherent similarities in brain activities across individuals. With the emerging of larger and more comprehensive datasets, it is possible to train a brain decoding foundation model in the this http URL and models can be found at this https URL.

[CV-14] FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2410.14429
作者: Rui Hu,Qian He,Gaofeng He,Jiedong Zhuang,Huang Chen,Huafeng Liu,Huamin Wang
关键词-EN: producing lifelike clothed, Modeling and producing, lifelike clothed human, attracted researchers’ attention, clothed human images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Modeling and producing lifelike clothed human images has attracted researchers’ attention from different areas for decades, with the complexity from highly articulated and structured content. Rendering algorithms decompose and simulate the imaging process of a camera, while are limited by the accuracy of modeled variables and the efficiency of computation. Generative models can produce impressively vivid human images, however still lacking in controllability and editability. This paper studies photorealism enhancement of rendered images, leveraging generative power from diffusion models on the controlled basis of rendering. We introduce a novel framework to translate rendered images into their realistic counterparts, which consists of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In DKI, we adopt positive (real) domain finetuning and negative (rendered) domain embedding to inject knowledge into a pretrained Text-to-image (T2I) diffusion model. In RIG, we generate the realistic image corresponding to the input rendered image, with a Texture-preserving Attention Control (TAC) to preserve fine-grained clothing textures, exploiting the decoupled features encoded in the UNet structure. Additionally, we introduce SynFashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.

[CV-15] Variable Aperture Bokeh Rendering via Customized Focal Plane Guidance

链接: https://arxiv.org/abs/2410.14400
作者: Kang Chen,Shijun Yan,Aiwen Jiang,Han Li,Zhifeng Wang
关键词-EN: Bokeh, bokeh effect, Bokeh rendering, popular techniques, bokeh rendering method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Bokeh rendering is one of the most popular techniques in photography. It can make photographs visually appealing, forcing users to focus their attentions on particular area of image. However, achieving satisfactory bokeh effect usually presents significant challenge, since mobile cameras with restricted optical systems are constrained, while expensive high-end DSLR lens with large aperture should be needed. Therefore, many deep learning-based computational photography methods have been developed to mimic the bokeh effect in recent years. Nevertheless, most of these methods were limited to rendering bokeh effect in certain single aperture. There lacks user-friendly bokeh rendering method that can provide precise focal plane control and customised bokeh generation. There as well lacks authentic realistic bokeh dataset that can potentially promote bokeh learning on variable apertures. To address these two issues, in this paper, we have proposed an effective controllable bokeh rendering method, and contributed a Variable Aperture Bokeh Dataset (VABD). In the proposed method, user can customize focal plane to accurately locate concerned subjects and select target aperture information for bokeh rendering. Experimental results on public EBB! benchmark dataset and our constructed dataset VABD have demonstrated that the customized focal plane together aperture prompt can bootstrap model to simulate realistic bokeh effect. The proposed method has achieved competitive state-of-the-art performance with only 4.4M parameters, which is much lighter than mainstream computational bokeh models. The contributed dataset and source codes will be released on github this https URL.

[CV-16] Dynamic Negative Guidance of Diffusion Models ICLR2025

链接: https://arxiv.org/abs/2410.14398
作者: Felix Koulischer,Johannes Deleu,Gabriel Raya,Thomas Demeester,Luca Ambrogioni
关键词-EN: Negative Prompting, undesired features, Dynamic Negative Guidance, widely utilized, prevent the generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper currently under review. Submitted to ICLR 2025

点击查看摘要

Abstract:Negative Prompting (NP) is widely utilized in diffusion models, particularly in text-to-image applications, to prevent the generation of undesired features. In this paper, we show that conventional NP is limited by the assumption of a constant guidance scale, which may lead to highly suboptimal results, or even complete failure, due to the non-stationarity and state-dependence of the reverse process. Based on this analysis, we derive a principled technique called Dynamic Negative Guidance, which relies on a near-optimal time and state dependent modulation of the guidance without requiring additional training. Unlike NP, negative guidance requires estimating the posterior class probability during the denoising process, which is achieved with limited additional computational overhead by tracking the discrete Markov Chain during the generative process. We evaluate the performance of DNG class-removal on MNIST and CIFAR10, where we show that DNG leads to higher safety, preservation of class balance and image quality when compared with baseline methods. Furthermore, we show that it is possible to use DNG with Stable Diffusion to obtain more accurate and less invasive guidance than NP.

[CV-17] SurgeryV2: Bridging the Gap Between Model Merging and Multi-Task Learning with Deep Representation Surgery ICML2024

链接: https://arxiv.org/abs/2410.14389
作者: Enneng Yang,Li Shen,Zhenyi Wang,Guibing Guo,Xingwei Wang,Xiaocun Cao,Jie Zhang,Dacheng Tao
关键词-EN: raw training data, merged model, merged MTL model, MTL, merging-based multitask learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper is an extended version of our previous work [ arXiv:2402.02705 ] presented at ICML 2024

点击查看摘要

Abstract:Model merging-based multitask learning (MTL) offers a promising approach for performing MTL by merging multiple expert models without requiring access to raw training data. However, in this paper, we examine the merged model’s representation distribution and uncover a critical issue of “representation bias”. This bias arises from a significant distribution gap between the representations of the merged and expert models, leading to the suboptimal performance of the merged MTL model. To address this challenge, we first propose a representation surgery solution called Surgery. Surgery is a lightweight, task-specific module that aligns the final layer representations of the merged model with those of the expert models, effectively alleviating bias and improving the merged model’s performance. Despite these improvements, a performance gap remains compared to the traditional MTL method. Further analysis reveals that representation bias phenomena exist at each layer of the merged model, and aligning representations only in the last layer is insufficient for fully reducing systemic bias because biases introduced at each layer can accumulate and interact in complex ways. To tackle this, we then propose a more comprehensive solution, deep representation surgery (also called SurgeryV2), which mitigates representation bias across all layers, and thus bridges the performance gap between model merging-based MTL and traditional MTL. Finally, we design an unsupervised optimization objective to optimize both the Surgery and SurgeryV2 modules. Our experimental results show that incorporating these modules into state-of-the-art (SOTA) model merging schemes leads to significant performance gains. Notably, our SurgeryV2 scheme reaches almost the same level as individual expert models or the traditional MTL model. The code is available at \urlthis https URL.

[CV-18] AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios

链接: https://arxiv.org/abs/2410.14379
作者: Ziming Huang,Xurui Li,Haotian Liu,Feng Xue,Yuzhe Wang,Yu Zhou
关键词-EN: industrial scenario, anomaly, gain, anomalies, anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the industrial scenario, anomaly detection could locate but cannot classify anomalies. To complete their capability, we study to automatically discover and recognize visual classes of industrial anomalies. In terms of multi-class anomaly classification, previous methods cluster anomalies represented by frozen pre-trained models but often fail due to poor discrimination. Novel class discovery (NCD) has the potential to tackle this. However, it struggles with non-prominent and semantically weak anomalies that challenge network learning focus. To address these, we introduce AnomalyNCD, a multi-class anomaly classification framework compatible with existing anomaly detection methods. This framework learns anomaly-specific features and classifies anomalies in a self-supervised manner. Initially, a technique called Main Element Binarization (MEBin) is first designed, which segments primary anomaly regions into masks to alleviate the impact of incorrect detections on learning. Subsequently, we employ mask-guided contrastive representation learning to improve feature discrimination, which focuses network attention on isolated anomalous regions and reduces the confusion of erroneous inputs through re-corrected pseudo labels. Finally, to enable flexible classification at both region and image levels during inference, we develop a region merging strategy that determines the overall image category based on the classified anomaly regions. Our method outperforms the state-of-the-art works on the MVTec AD and MTD datasets. Compared with the current methods, AnomalyNCD combined with zero-shot anomaly detection method achieves a 10.8% F_1 gain, 8.8% NMI gain, and 9.5% ARI gain on MVTec AD, 12.8% F_1 gain, 5.7% NMI gain, and 10.8% ARI gain on MTD. The source code is available at this https URL.

[CV-19] Impact of imperfect annotations on CNN training and performance for instance segmentation and classification in digital pathology

链接: https://arxiv.org/abs/2410.14365
作者: Laura Gálvez Jiménez,Christine Decaestecker
关键词-EN: Segmentation and classification, accurate diagnosis, digital pathology, pathology for accurate, numbers of instances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segmentation and classification of large numbers of instances, such as cell nuclei, are crucial tasks in digital pathology for accurate diagnosis. However, the availability of high-quality datasets for deep learning methods is often limited due to the complexity of the annotation process. In this work, we investigate the impact of noisy annotations on the training and performance of a state-of-the-art CNN model for the combined task of detecting, segmenting and classifying nuclei in histopathology images. In this context, we investigate the conditions for determining an appropriate number of training epochs to prevent overfitting to annotation noise during training. Our results indicate that the utilisation of a small, correctly annotated validation set is instrumental in avoiding overfitting and maintaining model performance to a large extent. Additionally, our findings underscore the beneficial role of pre-training.

[CV-20] Zero-shot Action Localization via the Confidence of Large Vision-Language Models

链接: https://arxiv.org/abs/2410.14340
作者: Josiah Aklilu,Xiaohan Wang,Serena Yeung-Levy
关键词-EN: minimally invasive surgery, dramatically enhance analysis, Precise action localization, Precise action, invasive surgery
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Precise action localization in untrimmed video is vital for fields such as professional sports and minimally invasive surgery, where the delineation of particular motions in recordings can dramatically enhance analysis. But in many cases, large scale datasets with video-label pairs for localization are unavailable, limiting the opportunity to fine-tune video-understanding models. Recent developments in large vision-language models (LVLM) address this need with impressive zero-shot capabilities in a variety of video understanding tasks. However, the adaptation of image-based LVLMs, with their powerful visual question answering capabilities, to action localization in long-form video is still relatively unexplored. To this end, we introduce a true ZEro-shot Action Localization method (ZEAL). Specifically, we leverage the built-in action knowledge of a large language model (LLM) to inflate actions into highly-detailed descriptions of the archetypal start and end of the action. These descriptions serve as queries to LVLM for generating frame-level confidence scores which can be aggregated to produce localization outputs. The simplicity and flexibility of our method lends it amenable to more capable LVLMs as they are developed, and we demonstrate remarkable results in zero-shot action localization on a challenging benchmark, without any training.

[CV-21] Evaluating the evaluators: Towards human-aligned metrics for missing markers reconstruction

链接: https://arxiv.org/abs/2410.14334
作者: Taras Kucherenko,Derek Peristy,Judith Bütepage
关键词-EN: optical motion capture, motion capture systems, Animation data, optical motion, motion capture
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Animation data is often obtained through optical motion capture systems, which utilize a multitude of cameras to establish the position of optical markers. However, system errors or occlusions can result in missing markers, the manual cleaning of which can be time-consuming. This has sparked interest in machine learning-based solutions for missing marker reconstruction in the academic community. Most academic papers utilize a simplistic mean square error as the main metric. In this paper, we show that this metric does not correlate with subjective perception of the fill quality. We introduce and evaluate a set of better-correlated metrics that can drive progress in the field.

[CV-22] Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

链接: https://arxiv.org/abs/2410.14332
作者: Yin Xie,Kaicheng Yang,Ninghua Yang,Weimo Deng,Xiangzi Dai,Tiancheng Gu,Yumeng Wang,Xiang An,Yongle Zhao,Ziyong Feng,Jiankang Deng
关键词-EN: Large Multimodal Models, Large Multimodal, Large Language Models, Recent advances, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 11 figures

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a “foreign language” for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we release the training code and pre-trained model weights at this https URL.

[CV-23] Fast proxy centers for Jeffreys centroids: The Jeffreys-Fisher-Rao and the inductive Gauss-Bregman centers

链接: https://arxiv.org/abs/2410.14326
作者: Frank Nielsen
关键词-EN: Jeffreys centroid, mutually absolutely continuous, absolutely continuous probability, Jeffreys, continuous probability distributions
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 35 pages, 10 figures

点击查看摘要

Abstract:The symmetric Kullback-Leibler centroid also called the Jeffreys centroid of a set of mutually absolutely continuous probability distributions on a measure space provides a notion of centrality which has proven useful in many tasks including information retrieval, information fusion, and clustering in image, video and sound processing. However, the Jeffreys centroid is not available in closed-form for sets of categorical or normal distributions, two widely used statistical models, and thus need to be approximated numerically in practice. In this paper, we first propose the new Jeffreys-Fisher-Rao center defined as the Fisher-Rao midpoint of the sided Kullback-Leibler centroids as a plug-in replacement of the Jeffreys centroid. This Jeffreys-Fisher-Rao center admits a generic formula for uni-parameter exponential family distributions, and closed-form formula for categorical and normal distributions, matches exactly the Jeffreys centroid for same-mean normal distributions, and is experimentally observed in practice to be close to the Jeffreys centroid. Second, we define a new type of inductive centers generalizing the principle of Gauss arithmetic-geometric double sequence mean for pairs of densities of any given exponential family. This center is shown experimentally to approximate very well the Jeffreys centroid and is suggested to use when the Jeffreys-Fisher-Rao center is not available in closed form. Moreover, this Gauss-Bregman inductive center always converges and matches the Jeffreys centroid for sets of same-mean normal distributions. We report on our experiments demonstrating the use of the Jeffreys-Fisher-Rao and Gauss-Bregman centers instead of the Jeffreys centroid. Finally, we conclude this work by reinterpreting these fast proxy centers of Jeffreys centroids under the lens of dually flat spaces in information geometry.

[CV-24] HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation NEURIPS2024

链接: https://arxiv.org/abs/2410.14324
作者: Bo Cheng,Yuhang Ma,Liebucha Wu,Shanyuan Liu,Ao Ma,Xiaoyu Wu,Dawei Leng,Yuhui Yin
关键词-EN: involves synthesizing images, synthesizing images based, generation involves synthesizing, involves synthesizing, synthesizing images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS2024

点击查看摘要

Abstract:The task of layout-to-image generation involves synthesizing images based on the captions of objects and their spatial positions. Existing methods still struggle in complex layout generation, where common bad cases include object missing, inconsistent lighting, conflicting view angles, etc. To effectively address these issues, we propose a \textbfHierarchical \textbfControllable (HiCo) diffusion model for layout-to-image generation, featuring object seperable conditioning branch structure. Our key insight is to achieve spatial disentanglement through hierarchical modeling of layouts. We use a multi branch structure to represent hierarchy and aggregate them in fusion module. To evaluate the performance of multi-objective controllable layout generation in natural scenes, we introduce the HiCo-7K benchmark, derived from the GRIT-20M dataset and manually cleaned. this https URL.

[CV-25] Advanced Underwater Image Quality Enhancement via Hybrid Super-Resolution Convolutional Neural Networks and Multi-Scale Retinex-Based Defogging Techniques

链接: https://arxiv.org/abs/2410.14285
作者: Yugandhar Reddy Gogireddy,Jithendra Reddy Gogireddy
关键词-EN: Convolutional Neural Networks, image degradation due, Super-Resolution Convolutional Neural, underwater image degradation, light scattering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The difficulties of underwater image degradation due to light scattering, absorption, and fog-like particles which lead to low resolution and poor visibility are discussed in this study report. We suggest a sophisticated hybrid strategy that combines Multi-Scale Retinex (MSR) defogging methods with Super-Resolution Convolutional Neural Networks (SRCNN) to address these problems. The Retinex algorithm mimics human visual perception to reduce uneven lighting and fogging, while the SRCNN component improves the spatial resolution of underwater this http URL the combination of these methods, we are able to enhance the clarity, contrast, and colour restoration of underwater images, offering a reliable way to improve image quality in difficult underwater conditions. The research conducts extensive experiments on real-world underwater datasets to further illustrate the efficacy of the suggested approach. In terms of sharpness, visibility, and feature retention, quantitative evaluation which use metrics like the Structural Similarity Index Measure (SSIM) and Peak Signal-to-Noise Ratio (PSNR) demonstrates notable advances over conventional this http URL real-time underwater applications like marine exploration, underwater robotics, and autonomous underwater vehicles, where clear and high-resolution imaging is crucial for operational success, the combination of deep learning and conventional image processing techniques offers a computationally efficient framework with superior results.

[CV-26] akin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization

链接: https://arxiv.org/abs/2410.14283
作者: Bin Lin,Yanzhen Yu,Jianhao Ye,Ruitao Lv,Yuguang Yang,Ruoye Xie,Pan Yu,Hongbin Zhou
关键词-EN: face critical challenges, imprecise audio-driven synchronization, including expression leakage, critical challenges, methods face critical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: under review

点击查看摘要

Abstract:Existing audio-driven facial animation methods face critical challenges, including expression leakage, ineffective subtle expression transfer, and imprecise audio-driven synchronization. We discovered that these issues stem from limitations in motion representation and the lack of fine-grained control over facial expressions. To address these problems, we present Takin-ADA, a novel two-stage approach for real-time audio-driven portrait animation. In the first stage, we introduce a specialized loss function that enhances subtle expression transfer while reducing unwanted expression leakage. The second stage utilizes an advanced audio processing technique to improve lip-sync accuracy. Our method not only generates precise lip movements but also allows flexible control over facial expressions and head motions. Takin-ADA achieves high-resolution (512x512) facial animations at up to 42 FPS on an RTX 4090 GPU, outperforming existing commercial solutions. Extensive experiments demonstrate that our model significantly surpasses previous methods in video quality, facial dynamics realism, and natural head movements, setting a new benchmark in the field of audio-driven facial animation.

[CV-27] You Only Look Twice! for Failure Causes Identification of Drill Bits

链接: https://arxiv.org/abs/2410.14282
作者: Asma Yamani,Nehal Al-Otaiby,Haifa Al-Shemmeri,Imane Boudellioua
关键词-EN: drill bit failure, drill bit, safety threats, Efficient identification, drill bit damages
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Efficient identification of the root causes of drill bit failure is crucial due to potential impacts such as operational losses, safety threats, and delays. Early recognition of these failures enables proactive maintenance, reducing risks and financial losses associated with unforeseen breakdowns and prolonged downtime. Thus, our study investigates various causes of drill bit failure using images of different blades. The process involves annotating cutters with their respective locations and damage types, followed by the development of two YOLO Location and Damage Cutter Detection models, as well as multi-class multi-label Decision Tree and Random Forests models to identify the causes of failure by assessing the cutters’ location and damage type. Additionally, RRFCI is proposed for the classification of failure causes. Notably, the cutter location detection model achieved a high score of 0.97 mPA, and the cutter damage detection model yielded a 0.49 mPA. The rule-based approach over-performed both DT and RF in failure cause identification, achieving a macro-average F1-score of 0.94 across all damage causes. The integration of the complete automated pipeline successfully identified 100% of the 24 failure causes when tested on independent sets of ten drill bits, showcasing its potential to efficiently assist experts in identifying the root causes of drill bit damages.

[CV-28] ClearSR: Latent Low-Resolution Image Embeddings Help Diffusion-Based Real-World Super Resolution Models See Clearer

链接: https://arxiv.org/abs/2410.14279
作者: Yuhao Wan,Peng-Tao Jiang,Qibin Hou,Hao Zhang,Jinwei Chen,Ming-Ming Cheng,Bo Li
关键词-EN: diffusion-based real-world image, latent low-resolution image, present ClearSR, diffusion-based real-world, real-world image super-resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present ClearSR, a new method that can better take advantage of latent low-resolution image (LR) embeddings for diffusion-based real-world image super-resolution (Real-ISR). Previous Real-ISR models mostly focus on how to activate more generative priors of text-to-image diffusion models to make the output high-resolution (HR) images look better. However, since these methods rely too much on the generative priors, the content of the output images is often inconsistent with the input LR ones. To mitigate the above issue, in this work, we explore using latent LR embeddings to constrain the control signals from ControlNet, and extract LR information at both detail and structure levels. We show that the proper use of latent LR embeddings can produce higher-quality control signals, which enables the super-resolution results to be more consistent with the LR image and leads to clearer visual results. In addition, we also show that latent LR embeddings can be used to control the inference stage, allowing for the improvement of fidelity and generation ability simultaneously. Experiments demonstrate that our model can achieve better performance across multiple metrics on several test sets and generate more consistent SR results with LR images than existing methods. Our code will be made publicly available.

[CV-29] HYPNOS : Highly Precise Foreground-focused Diffusion Finetuning for Inanimate Objects ACCV

链接: https://arxiv.org/abs/2410.14265
作者: Oliverio Theophilus Nathanael,Jonathan Samuel Lumentut,Nicholas Hans Muliawan,Edbert Valencio Angky,Felix Indra Kurniadi,Alfi Yusrotis Zakiyyah,Jeklin Harefa
关键词-EN: computer vision studies, recent years, hot topic, topic in computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 12 figures, to appear on the Rich Media with Generative AI workshop in conjunction with Asian Conference on Computer Vision (ACCV) 2024

点击查看摘要

Abstract:In recent years, personalized diffusion-based text-to-image generative tasks have been a hot topic in computer vision studies. A robust diffusion model is determined by its ability to perform near-perfect reconstruction of certain product outcomes given few related input samples. Unfortunately, the current prominent diffusion-based finetuning technique falls short in maintaining the foreground object consistency while being constrained to produce diverse backgrounds in the image outcome. In the worst scenario, the overfitting issue may occur, meaning that the foreground object is less controllable due to the condition above, for example, the input prompt information is transferred ambiguously to both foreground and background regions, instead of the supposed background region only. To tackle the issues above, we proposed Hypnos, a highly precise foreground-focused diffusion finetuning technique. On the image level, this strategy works best for inanimate object generation tasks, and to do so, Hypnos implements two main approaches, namely: (i) a content-centric prompting strategy and (ii) the utilization of our additional foreground-focused discriminative module. The utilized module is connected with the diffusion model and finetuned with our proposed set of supervision mechanism. Combining the strategies above yielded to the foreground-background disentanglement capability of the diffusion model. Our experimental results showed that the proposed strategy gave a more robust performance and visually pleasing results compared to the former technique. For better elaborations, we also provided extensive studies to assess the fruitful outcomes above, which reveal how personalization behaves in regard to several training conditions.

[CV-30] Vision-Language Navigation with Energy-Based Policy

链接: https://arxiv.org/abs/2410.14250
作者: Rui Liu,Wenguan Wang,Yi Yang
关键词-EN: Existing VLN models, Vision-language navigation, VLN models, requires an agent, human instructions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa. Theoretically, the optimization objective is equivalent to minimizing the forward divergence between the occupancy measure of the expert and ours. Consequently, ENP learns to globally align with the expert policy by maximizing the likelihood of the actions and modeling the dynamics of the navigation states in a collaborative manner. With a variety of VLN architectures, ENP achieves promising performances on R2R, REVERIE, RxR, and R2R-CE, unleashing the power of existing VLN models.

[CV-31] ERDDCI: Exact Reversible Diffusion via Dual-Chain Inversion for High-Quality Image Editing

链接: https://arxiv.org/abs/2410.14247
作者: Jimin Dai,Yingzhen Zhang,Shuo Chen,Jian Yang,Lei Luo
关键词-EN: Exact Reversible Diffusion, successfully applied, diffusion process, image editing, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have been successfully applied to real image editing. These models typically invert images into latent noise vectors used to reconstruct the original images (known as inversion), and then edit them during the inference process. However, recent popular DMs often rely on the assumption of local linearization, where the noise injected during the inversion process is expected to approximate the noise removed during the inference process. While DM efficiently generates images under this assumption, it can also accumulate errors during the diffusion process due to the assumption, ultimately negatively impacting the quality of real image reconstruction and editing. To address this issue, we propose a novel method, referred to as ERDDCI (Exact Reversible Diffusion via Dual-Chain Inversion). ERDDCI uses the new Dual-Chain Inversion (DCI) for joint inference to derive an exact reversible diffusion process. By using DCI, our method effectively avoids the cumbersome optimization process in existing inversion approaches and achieves high-quality image editing. Additionally, to accommodate image operations under high guidance scales, we introduce a dynamic control strategy that enables more refined image reconstruction and editing. Our experiments demonstrate that ERDDCI significantly outperforms state-of-the-art methods in a 50-step diffusion process. It achieves rapid and precise image reconstruction with an SSIM of 0.999 and an LPIPS of 0.001, and also delivers competitive results in image editing.

[CV-32] PReP: Efficient context-based shape retrieval for missing parts

链接: https://arxiv.org/abs/2410.14245
作者: Vlassis Fotis,Ioannis Romanelis,Georgios Mylonas,Athanasios Kalogeras,Konstantinos Moustakas
关键词-EN: point cloud domain, cloud domain, paper we study, study the problem, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper we study the problem of shape part retrieval in the point cloud domain. Shape retrieval methods in the literature rely on the presence of an existing query object, but what if the part we are looking for is not available? We present Part Retrieval Pipeline (PReP), a pipeline that creatively utilizes metric learning techniques along with a trained classification model to measure the suitability of potential replacement parts from a database, as part of an application scenario targeting circular economy. Through an innovative training procedure with increasing difficulty, it is able to learn to recognize suitable parts relying only on shape context. Thanks to its low parameter size and computational requirements, it can be used to sort through a warehouse of potentially tens of thousand of spare parts in just a few seconds. We also establish an alternative baseline approach to compare against, and extensively document the unique challenges associated with this task, as well as identify the design choices to solve them.

[CV-33] Pseudo-label Refinement for Improving Self-Supervised Learning Systems

链接: https://arxiv.org/abs/2410.14242
作者: Zia-ur-Rehman,Arif Mahmood,Wenxiong Kang
关键词-EN: gained significant attention, leveraging clustering-based pseudo-labels, Self-supervised learning systems, SLR algorithm, human annotations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning systems have gained significant attention in recent years by leveraging clustering-based pseudo-labels to provide supervision without the need for human annotations. However, the noise in these pseudo-labels caused by the clustering methods poses a challenge to the learning process leading to degraded performance. In this work, we propose a pseudo-label refinement (SLR) algorithm to address this issue. The cluster labels from the previous epoch are projected to the current epoch cluster-labels space and a linear combination of the new label and the projected label is computed as a soft refined label containing the information from the previous epoch clusters as well as from the current epoch. In contrast to the common practice of using the maximum value as a cluster/class indicator, we employ hierarchical clustering on these soft pseudo-labels to generate refined hard-labels. This approach better utilizes the information embedded in the soft labels, outperforming the simple maximum value approach for hard label generation. The effectiveness of the proposed SLR algorithm is evaluated in the context of person re-identification (Re-ID) using unsupervised domain adaptation (UDA). Experimental results demonstrate that the modified Re-ID baseline, incorporating the SLR algorithm, achieves significantly improved mean Average Precision (mAP) performance in various UDA tasks, including real-to-synthetic, synthetic-to-real, and different real-to-real scenarios. These findings highlight the efficacy of the SLR algorithm in enhancing the performance of self-supervised learning systems.

[CV-34] Storyboard guided Alignment for Fine-grained Video Action Recognition

链接: https://arxiv.org/abs/2410.14238
作者: Enqi Liu,Liyuan Pan,Yan Yang,Yiran Zhong,Zhijing Wu,Xinxiao Wu,Liu Liu
关键词-EN: video-text matching problem, global video semantics, matching problem, atomic actions, video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fine-grained video action recognition can be conceptualized as a video-text matching problem. Previous approaches often rely on global video semantics to consolidate video embeddings, which can lead to misalignment in video-text pairs due to a lack of understanding of action semantics at an atomic granularity level. To tackle this challenge, we propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics. Inspired by the concept of storyboarding, which disassembles a script into individual shots, we enhance global video semantics by generating fine-grained descriptions using a pre-trained large language model. These detailed descriptions capture common atomic actions depicted in videos. A filtering metric is proposed to select the descriptions that correspond to the atomic actions present in both the videos and the descriptions. By employing global semantics and fine-grained descriptions, we can identify key frames in videos and utilize them to aggregate embeddings, thereby making the embedding more accurate. Extensive experiments on various video action recognition datasets demonstrate superior performance of our proposed method in supervised, few-shot, and zero-shot settings.

[CV-35] MambaSCI: Efficient Mamba-UNet for Quad-Bayer Patterned Video Snapshot Compressive Imaging NEURIPS2024

链接: https://arxiv.org/abs/2410.14214
作者: Zhenghao Pan,Haijin Zeng,Jiezhang Cao,Yongyong Chen,Kai Zhang,Yong Xu
关键词-EN: single Bayer-patterned measurement, snapshot compressive imaging, capture multiple sequential, color video SCI, Bayer-patterned measurement
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Color video snapshot compressive imaging (SCI) employs computational imaging techniques to capture multiple sequential video frames in a single Bayer-patterned measurement. With the increasing popularity of quad-Bayer pattern in mainstream smartphone cameras for capturing high-resolution videos, mobile photography has become more accessible to a wider audience. However, existing color video SCI reconstruction algorithms are designed based on the traditional Bayer pattern. When applied to videos captured by quad-Bayer cameras, these algorithms often result in color distortion and ineffective demosaicing, rendering them impractical for primary equipment. To address this challenge, we propose the MambaSCI method, which leverages the Mamba and UNet architectures for efficient reconstruction of quad-Bayer patterned color video SCI. To the best of our knowledge, our work presents the first algorithm for quad-Bayer patterned SCI reconstruction, and also the initial application of the Mamba model to this task. Specifically, we customize Residual-Mamba-Blocks, which residually connect the Spatial-Temporal Mamba (STMamba), Edge-Detail-Reconstruction (EDR) module, and Channel Attention (CA) module. Respectively, STMamba is used to model long-range spatial-temporal dependencies with linear complexity, EDR is for better edge-detail reconstruction, and CA is used to compensate for the missing channel information interaction in Mamba model. Experiments demonstrate that MambaSCI surpasses state-of-the-art methods with lower computational and memory costs. PyTorch style pseudo-code for the core modules is provided in the supplementary materials.

[CV-36] Shape Transformation Driven by Active Contour for Class-Imbalanced Semi-Supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2410.14210
作者: Yuliang Gu,Yepeng Liu,Zhichao Sun,Jinchi Zhu,Yongchao Xu,Laurent Najman(LIGM)
关键词-EN: demands expert knowledge, images demands expert, medical images demands, medical image segmentation, demands expert
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Annotating 3D medical images demands expert knowledge and is time-consuming. As a result, semi-supervised learning (SSL) approaches have gained significant interest in 3D medical image segmentation. The significant size differences among various organs in the human body lead to imbalanced class distribution, which is a major challenge in the real-world application of these SSL approaches. To address this issue, we develop a novel Shape Transformation driven by Active Contour (STAC), that enlarges smaller organs to alleviate imbalanced class distribution across different organs. Inspired by curve evolution theory in active contour methods, STAC employs a signed distance function (SDF) as the level set function, to implicitly represent the shape of organs, and deforms voxels in the direction of the steepest descent of SDF (i.e., the normal vector). To ensure that the voxels far from expansion organs remain unchanged, we design an SDF-based weight function to control the degree of deformation for each voxel. We then use STAC as a data-augmentation process during the training stage. Experimental results on two benchmark datasets demonstrate that the proposed method significantly outperforms some state-of-the-art methods. Source code is publicly available at this https URL.

[CV-37] xt-to-Image Representativity Fairness Evaluation Framework

链接: https://arxiv.org/abs/2410.14201
作者: Asma Yamani,Malak Baslyman
关键词-EN: searches or artists, progressing rapidly, source of advertisement, advertisement and media, Representativity Fairness Evaluation
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-Image generative systems are progressing rapidly to be a source of advertisement and media and could soon serve as image searches or artists. However, there is a significant concern about the representativity bias these models embody and how these biases can propagate in the social fabric after fine-tuning them. Therefore, continuously monitoring and evaluating these models for fairness is important. To address this issue, we propose Text-to-Image (TTI) Representativity Fairness Evaluation Framework. In this framework, we evaluate three aspects of a TTI system; diversity, inclusion, and quality. For each aspect, human-based and model-based approaches are proposed and evaluated for their ability to capture the bias and whether they can substitute each other. The framework starts by suggesting the prompts for generating the images for the evaluation based on the context and the sensitive attributes under study. Then the three aspects are evaluated using the proposed approaches. Based on the evaluation, a decision is made regarding the representativity bias within the TTI system. The evaluation of our framework on Stable Diffusion shows that the framework can effectively capture the bias in TTI systems. The results also confirm that our proposed model based-approaches can substitute human-based approaches in three out of four components with high correlation, which could potentially reduce costs and automate the process. The study suggests that continual learning of the model on more inclusive data across disadvantaged minorities such as Indians and Middle Easterners is essential to mitigate current stereotyping and lack of inclusiveness.

[CV-38] Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis NEURIPS-2024

链接: https://arxiv.org/abs/2410.14195
作者: Honglin Li,Yunlong Zhang,Pingyi Chen,Zhongyi Shui,Chenglu Zhu,Lin Yang
关键词-EN: Histopathology Whole Slide, Slide Image, clinical cancer diagnosis, routines of doctors, gold standard
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS-2024. arXiv admin note: text overlap with arXiv:2311.12885

点击查看摘要

Abstract:Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels. Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information. Moreover, their use of absolute positional embedding struggles to effectively handle long contextual dependencies in shape-varying WSIs. In this paper, we first analyze how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling. Then, we demonstrate that the rank of attention matrix can be improved by focusing on local interactions via a local attention mask. Our analysis shows that the local mask aligns with the attention patterns in the lower layers of the Transformer. Furthermore, the local attention mask can be implemented during chunked attention calculation, reducing the quadratic computational complexity to linear with a small local bandwidth. Building on this, we propose a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling. Our method, Long-contextual MIL (LongMIL), is evaluated through extensive experiments on various WSI tasks to validate its superiority. Our code will be available at this http URL.

[CV-39] Neural Signed Distance Function Inference through Splatting 3D Gaussians Pulled on Zero-Level Set NEURIPS2024

链接: https://arxiv.org/abs/2410.14189
作者: Wenyuan Zhang,Yu-Shen Liu,Zhizhong Han
关键词-EN: neural SDF, based surface reconstruction, Gaussians, multi-view based surface, SDF
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:It is vital to infer a signed distance function (SDF) in multi-view based surface reconstruction. 3D Gaussian splatting (3DGS) provides a novel perspective for volume rendering, and shows advantages in rendering efficiency and quality. Although 3DGS provides a promising neural rendering option, it is still hard to infer SDFs for surface reconstruction with 3DGS due to the discreteness, the sparseness, and the off-surface drift of 3D Gaussians. To resolve these issues, we propose a method that seamlessly merge 3DGS with the learning of neural SDFs. Our key idea is to more effectively constrain the SDF inference with the multi-view consistency. To this end, we dynamically align 3D Gaussians on the zero-level set of the neural SDF using neural pulling, and then render the aligned 3D Gaussians through the differentiable rasterization. Meanwhile, we update the neural SDF by pulling neighboring space to the pulled 3D Gaussians, which progressively refine the signed distance field near the surface. With both differentiable pulling and splatting, we jointly optimize 3D Gaussians and the neural SDF with both RGB and geometry constraints, which recovers more accurate, smooth, and complete surfaces with more geometry details. Our numerical and visual comparisons show our superiority over the state-of-the-art results on the widely used benchmarks.

[CV-40] MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

链接: https://arxiv.org/abs/2410.14179
作者: Zifeng Zhu,Mengzhao Jia,Zhihan Zhang,Lang Li,Meng Jiang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios. Current benchmarks primarily focus on single-chart tasks, neglecting the multi-hop reasoning required to extract and integrate information from multiple charts, which is essential in practical applications. To fill this gap, we introduce MultiChartQA, a benchmark that evaluates MLLMs’ capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning. Our evaluation of a wide range of MLLMs reveals significant performance gaps compared to humans. These results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field. Our code and data are available at this https URL

[CV-41] Feature Augmentation based Test-Time Adaptation

链接: https://arxiv.org/abs/2410.14178
作者: Younggeol Cho,Youngrae Kim,Junho Yoon,Seunghoon Hong,Dongman Lee
关键词-EN: Test-time adaptation, unseen domain, domain without accessing, accessing the source, based Test-time Adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Test-time adaptation (TTA) allows a model to be adapted to an unseen domain without accessing the source data. Due to the nature of practical environments, TTA has a limited amount of data for adaptation. Recent TTA methods further restrict this by filtering input data for reliability, making the effective data size even smaller and limiting adaptation potential. To address this issue, We propose Feature Augmentation based Test-time Adaptation (FATA), a simple method that fully utilizes the limited amount of input data through feature augmentation. FATA employs Normalization Perturbation to augment features and adapts the model using the FATA loss, which makes the outputs of the augmented and original features similar. FATA is model-agnostic and can be seamlessly integrated into existing models without altering the model architecture. We demonstrate the effectiveness of FATA on various models and scenarios on ImageNet-C and Office-Home, validating its superiority in diverse real-world conditions.

[CV-42] Learning autonomous driving from aerial imagery IROS2024

链接: https://arxiv.org/abs/2410.14177
作者: Varun Murali,Guy Rosman,Sertac Karaman,Daniela Rus
关键词-EN: ground vehicles solely, Neural Radiance Field, aerial imagery, perception to control, solely from aerial
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Presented at IROS 2024

点击查看摘要

Abstract:In this work, we consider the problem of learning end to end perception to control for ground vehicles solely from aerial imagery. Photogrammetric simulators allow the synthesis of novel views through the transformation of pre-generated assets into novel this http URL, they have a large setup cost, require careful collection of data and often human effort to create usable simulators. We use a Neural Radiance Field (NeRF) as an intermediate representation to synthesize novel views from the point of view of a ground vehicle. These novel viewpoints can then be used for several downstream autonomous navigation applications. In this work, we demonstrate the utility of novel view synthesis though the application of training a policy for end to end learning from images and depth data. In a traditional real to sim to real framework, the collected data would be transformed into a visual simulator which could then be used to generate novel views. In contrast, using a NeRF allows a compact representation and the ability to optimize over the parameters of the visual simulator as more data is gathered in the environment. We demonstrate the efficacy of our method in a custom built mini-city environment through the deployment of imitation policies on robotic cars. We additionally consider the task of place localization and demonstrate that our method is able to relocalize the car in the real world.

[CV-43] DaRePlane: Direction-aware Representations for Dynamic Scene Reconstruction

链接: https://arxiv.org/abs/2410.14169
作者: Ange Lou,Benjamin Planche,Zhongpai Gao,Yamin Li,Tianyu Luan,Hao Ding,Meng Zheng,Terrence Chen,Ziyan Wu,Jack Noble
关键词-EN: Numerous recent approaches, addressing slow training, neural radiance fields, slow training times, Numerous recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2403.02265

点击查看摘要

Abstract:Numerous recent approaches to modeling and re-rendering dynamic scenes leverage plane-based explicit representations, addressing slow training times associated with models like neural radiance fields (NeRF) and Gaussian splatting (GS). However, merely decomposing 4D dynamic scenes into multiple 2D plane-based representations is insufficient for high-fidelity re-rendering of scenes with complex motions. In response, we present DaRePlane, a novel direction-aware representation approach that captures scene dynamics from six different directions. This learned representation undergoes an inverse dual-tree complex wavelet transformation (DTCWT) to recover plane-based information. Within NeRF pipelines, DaRePlane computes features for each space-time point by fusing vectors from these recovered planes, then passed to a tiny MLP for color regression. When applied to Gaussian splatting, DaRePlane computes the features of Gaussian points, followed by a tiny multi-head MLP for spatial-time deformation prediction. Notably, to address redundancy introduced by the six real and six imaginary direction-aware wavelet coefficients, we introduce a trainable masking approach, mitigating storage issues without significant performance decline. To demonstrate the generality and efficiency of DaRePlane, we test it on both regular and surgical dynamic scenes, for both NeRF and GS systems. Extensive experiments show that DaRePlane yields state-of-the-art performance in novel view synthesis for various complex dynamic scenes.

[CV-44] Optimal DLT-based Solutions for the Perspective-n-Point

链接: https://arxiv.org/abs/2410.14164
作者: Sébastien Henry,John A. Christian
关键词-EN: modified normalized direct, direct linear transform, normalized direct linear, algorithm for solving, propose a modified
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 6 figures, 2 tables

点击查看摘要

Abstract:We propose a modified normalized direct linear transform (DLT) algorithm for solving the perspective-n-point (PnP) problem with much better behavior than the conventional DLT. The modification consists of analytically weighting the different measurements in the linear system with a negligible increase in computational load. Our approach exhibits clear improvements – in both performance and runtime – when compared to popular methods such as EPnP, CPnP, RPnP, and OPnP. Our new non-iterative solution approaches that of the true optimal found via Gauss-Newton optimization, but at a fraction of the computational cost. Our optimal DLT (oDLT) implementation, as well as the experiments, are released in open source.

[CV-45] Unlabeled Action Quality Assessment Based on Multi-dimensional Adaptive Constrained Dynamic Time Warping

链接: https://arxiv.org/abs/2410.14161
作者: Renguang Chen,Guolong Zheng,Xu Yang,Zhide Chen,Jiwu Shu,Wencheng Yang,Kexin Zhu,Chen Feng
关键词-EN: online exercise executions, necessitates effective methods, exercise necessitates effective, action quality assessment, popularity of online
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The growing popularity of online sports and exercise necessitates effective methods for evaluating the quality of online exercise executions. Previous action quality assessment methods, which relied on labeled scores from motion videos, exhibited slightly lower accuracy and discriminability. This limitation hindered their rapid application to newly added exercises. To address this problem, this paper presents an unlabeled Multi-Dimensional Exercise Distance Adaptive Constrained Dynamic Time Warping (MED-ACDTW) method for action quality assessment. Our approach uses an athletic version of DTW to compare features from template and test videos, eliminating the need for score labels during training. The result shows that utilizing both 2D and 3D spatial dimensions, along with multiple human body features, improves the accuracy by 2-3% compared to using either 2D or 3D pose estimation alone. Additionally, employing MED for score calculation enhances the precision of frame distance matching, which significantly boosts overall discriminability. The adaptive constraint scheme enhances the discriminability of action quality assessment by approximately 30%. Furthermore, to address the absence of a standardized perspective in sports class evaluations, we introduce a new dataset called BGym.

[CV-46] Assessing Open-world Forgetting in Generative Image Model Customization

链接: https://arxiv.org/abs/2410.14159
作者: Héctor Laria,Alex Gomez-Villa,Imad Eddine Marouf,Kai Wang,Bogdan Raducanu,Joost van de Weijer
关键词-EN: significantly enhanced image, enhanced image generation, Recent advances, image generation capabilities, significantly enhanced
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have significantly enhanced image generation capabilities. However, customizing these models with new classes often leads to unintended consequences that compromise their reliability. We introduce the concept of open-world forgetting to emphasize the vast scope of these unintended alterations, contrasting it with the well-studied closed-world forgetting, which is measurable by evaluating performance on a limited set of classes or skills. Our research presents the first comprehensive investigation into open-world forgetting in diffusion models, focusing on semantic and appearance drift of representations. We utilize zero-shot classification to analyze semantic drift, revealing that even minor model adaptations lead to unpredictable shifts affecting areas far beyond newly introduced concepts, with dramatic drops in zero-shot classification of up to 60%. Additionally, we observe significant changes in texture and color of generated content when analyzing appearance drift. To address these issues, we propose a mitigation strategy based on functional regularization, designed to preserve original capabilities while accommodating new concepts. Our study aims to raise awareness of unintended changes due to model customization and advocates for the analysis of open-world forgetting in future research on model customization and finetuning methods. Furthermore, we provide insights for developing more robust adaptation methodologies.

[CV-47] Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment

链接: https://arxiv.org/abs/2410.14148
作者: Chenhang Cui,An Zhang,Yiyang Zhou,Zhaorun Chen,Gelei Deng,Huaxiu Yao,Tat-Seng Chua
关键词-EN: large language models, enhancing the interaction, linguistic modalities, large language, recent advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 23 pages

点击查看摘要

Abstract:The recent advancements in large language models (LLMs) and pre-trained vision models have accelerated the development of vision-language large models (VLLMs), enhancing the interaction between visual and linguistic modalities. Despite their notable success across various domains, VLLMs face challenges in modality alignment, which can lead to issues like hallucinations and unsafe content generation. Current alignment techniques often rely on coarse feedback and external datasets, limiting scalability and performance. In this paper, we propose FiSAO (Fine-Grained Self-Alignment Optimization), a novel self-alignment method that utilizes the model’s own visual encoder as a fine-grained verifier to improve vision-language alignment without the need for additional data. By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data. Through both theoretical analysis and experimental validation, we demonstrate that FiSAO effectively addresses the misalignment problem in VLLMs, marking the first instance of token-level rewards being applied to such models.

[CV-48] Preview-based Category Contrastive Learning for Knowledge Distillation

链接: https://arxiv.org/abs/2410.14143
作者: Muhe Ding,Jianlong Wu,Xue Dong,Xiaojie Li,Pengda Qin,Tian Gan,Liqiang Nie
关键词-EN: model compression, larger model, smaller model, mainstream algorithm, compression by transferring
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures, Journal

点击查看摘要

Abstract:Knowledge distillation is a mainstream algorithm in model compression by transferring knowledge from the larger model (teacher) to the smaller model (student) to improve the performance of student. Despite many efforts, existing methods mainly investigate the consistency between instance-level feature representation or prediction, which neglects the category-level information and the difficulty of each sample, leading to undesirable performance. To address these issues, we propose a novel preview-based category contrastive learning method for knowledge distillation (PCKD). It first distills the structural knowledge of both instance-level feature correspondence and the relation between instance features and category centers in a contrastive learning fashion, which can explicitly optimize the category representation and explore the distinct correlation between representations of instances and categories, contributing to discriminative category centers and better classification results. Besides, we introduce a novel preview strategy to dynamically determine how much the student should learn from each sample according to their difficulty. Different from existing methods that treat all samples equally and curriculum learning that simply filters out hard samples, our method assigns a small weight for hard instances as a preview to better guide the student training. Extensive experiments on several challenging datasets, including CIFAR-100 and ImageNet, demonstrate the superiority over state-of-the-art methods.

[CV-49] ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

链接: https://arxiv.org/abs/2410.14138
作者: Jingqi Zhou,Sheng Wang,Jingwei Dong,Lei Li,Jiahui Gao,Lingpeng Kong,Chuan Wu
关键词-EN: witnessed significant progress, visual understanding tasks, visual, witnessed significant, significant progress
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., insufficient and irrelevant visual descriptions, and limited multi-modal capacities). We then decompose visual reasoning process into two stages: visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features multi-run proactive perception and decoupled vision-reasoning capabilities. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods on a wide range of benchmarks for both open-source and closed-source models. In addition, with the assistance of LLMs, ProReason achieves a performance improvement of up to 15% on MMMU benchmark. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

[CV-50] ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering

链接: https://arxiv.org/abs/2410.14132
作者: Nghia Hieu Nguyen,Tho Thanh Quan,Ngan Luu-Thuy Nguyen
关键词-EN: Text-based VQA, Vietnamese Text-based VQA, Text-based VQA datasets, scene texts, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Text-based VQA is a challenging task that requires machines to use scene texts in given images to yield the most appropriate answer for the given question. The main challenge of text-based VQA is exploiting the meaning and information from scene texts. Recent studies tackled this challenge by considering the spatial information of scene texts in images via embedding 2D coordinates of their bounding boxes. In this study, we follow the definition of meaning from linguistics to introduce a novel method that effectively exploits the information from scene texts written in Vietnamese. Experimental results show that our proposed method obtains state-of-the-art results on two large-scale Vietnamese Text-based VQA datasets. The implementation can be found at this link.

[CV-51] Extreme Precipitation Nowcasting using Multi-Task Latent Diffusion Models

链接: https://arxiv.org/abs/2410.14103
作者: Li Chaorong,Ling Xudong,Yang Qiang,Qin Fengqing,Huang Yuanyuan
关键词-EN: Deep learning models, made remarkable strides, Deep learning, high precipitation intensity, precipitation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6figures

点击查看摘要

Abstract:Deep learning models have made remarkable strides in precipitation prediction, yet they continue to struggle with capturing the spatial details of the features of radar images, particularly over high precipitation intensity areas. This shortcoming is evident in the form of low forecast accuracy in the spatial positioning of radar echo images across varying precipitation intensity regions. To address this challenge, we introduce the multi-task latent diffusion model(MTLDM), a novel approach for precipitation prediction. The basic concept of the MTLDM is based on the understanding that the radar image representing precipitation is the result of multiple factors. Therefore, we adopt a divide-and-conquer approach, that is, we decompose the radar image using decomposition technology and then predict the decomposed sub-images separately. We conceptualize the precipitation image as a composition of various components corresponding to different precipitation intensities. The MTLDM decomposes the precipitation image into these distinct components and employs a dedicated task to predict each one. This method enables spatiotemporally consistent prediction of real-world precipitation areas up to 5-80 min in advance, outperforming existing state-of-the-art techniques across multiple evaluation metrics.

[CV-52] Enhancing In-vehicle Multiple Object Tracking Systems with Embeddable Ising Machines

链接: https://arxiv.org/abs/2410.14093
作者: Kosuke Tatsumura,Yohei Hamakawa,Masaya Yamasaki,Koji Oya,Hiroshi Fujimoto
关键词-EN: autonomous mobile vehicles, comprises object detection, temporal association, needed in autonomous, mobile vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
*备注: 18 pages, 7 figures, 2 tables

点击查看摘要

Abstract:A cognitive function of tracking multiple objects, needed in autonomous mobile vehicles, comprises object detection and their temporal association. While great progress owing to machine learning has been recently seen for elaborating the similarity matrix between the objects that have been recognized and the objects detected in a current video frame, less for the assignment problem that finally determines the temporal association, which is a combinatorial optimization problem. Here we show an in-vehicle multiple object tracking system with a flexible assignment function for tracking through multiple long-term occlusion events. To solve the flexible assignment problem formulated as a nondeterministic polynomial time-hard problem, the system relies on an embeddable Ising machine based on a quantum-inspired algorithm called simulated bifurcation. Using a vehicle-mountable computing platform, we demonstrate a realtime system-wide throughput (23 frames per second on average) with the enhanced functionality.

[CV-53] MMAD-Purify: A Precision-Optimized Framework for Efficient and Scalable Multi-Modal Attacks

链接: https://arxiv.org/abs/2410.14089
作者: Xinxin Liu,Zhongliang Guo,Siyuan Huang,Chun Pong Lau
关键词-EN: achieved remarkable performance, pose significant risks, Neural networks, networks have achieved, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural networks have achieved remarkable performance across a wide range of tasks, yet they remain susceptible to adversarial perturbations, which pose significant risks in safety-critical applications. With the rise of multimodality, diffusion models have emerged as powerful tools not only for generative tasks but also for various applications such as image editing, inpainting, and super-resolution. However, these models still lack robustness due to limited research on attacking them to enhance their resilience. Traditional attack techniques, such as gradient-based adversarial attacks and diffusion model-based methods, are hindered by computational inefficiencies and scalability issues due to their iterative nature. To address these challenges, we introduce an innovative framework that leverages the distilled backbone of diffusion models and incorporates a precision-optimized noise predictor to enhance the effectiveness of our attack framework. This approach not only enhances the attack’s potency but also significantly reduces computational costs. Our framework provides a cutting-edge solution for multi-modal adversarial attacks, ensuring reduced latency and the generation of high-fidelity adversarial examples with superior success rates. Furthermore, we demonstrate that our framework achieves outstanding transferability and robustness against purification defenses, outperforming existing gradient-based attack models in both effectiveness and efficiency.

[CV-54] Your Interest Your Summaries: Query-Focused Long Video Summarization

链接: https://arxiv.org/abs/2410.14087
作者: Nirav Patel,Payal Prajapati,Maitrik Shah
关键词-EN: Generating a concise, varying scene importance, informative video summary, concise and informative, subjective due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear at the 18th International Conference on Control, Automation, Robotics and Vision (ICARCV), December 2024, Dubai, UAE

点击查看摘要

Abstract:Generating a concise and informative video summary from a long video is important, yet subjective due to varying scene importance. Users’ ability to specify scene importance through text queries enhances the relevance of such summaries. This paper introduces an approach for query-focused video summarization, aiming to align video summaries closely with user queries. To this end, we propose the Fully Convolutional Sequence Network with Attention (FCSNA-QFVS), a novel approach designed for this task. Leveraging temporal convolutional and attention mechanisms, our model effectively extracts and highlights relevant content based on user-specified queries. Experimental validation on a benchmark dataset for query-focused video summarization demonstrates the effectiveness of our approach.

[CV-55] Self Supervised Deep Learning for Robot Grasping

链接: https://arxiv.org/abs/2410.14084
作者: Danyal Saqib,Wajahat Hussain
关键词-EN: Learning Based Robot, Learning Based, Based Robot Grasping, Based Robot, Convolutional Neural Network
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learning Based Robot Grasping currently involves the use of labeled data. This approach has two major disadvantages. Firstly, labeling data for grasp points and angles is a strenuous process, so the dataset remains limited. Secondly, human labeling is prone to bias due to semantics. In order to solve these problems we propose a simpler self-supervised robotic setup, that will train a Convolutional Neural Network (CNN). The robot will label and collect the data during the training process. The idea is to make a robot that is less costly, small and easily maintainable in a lab setup. The robot will be trained on a large data set for several hundred hours and then the trained Neural Network can be mapped onto a larger grasping robot. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.14084 [cs.RO] (or arXiv:2410.14084v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2410.14084 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-56] SAMReg: SAM-enabled Image Registration with ROI-based Correspondence

链接: https://arxiv.org/abs/2410.14083
作者: Shiqi Huang,Tingfa Xu,Ziyi Shen,Shaheer Ullah Saeed,Wen Yan,Dean Barratt,Yipeng Hu
关键词-EN: correspondence representation based, spatial correspondence representation, image registration, medical image registration, registration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper describes a new spatial correspondence representation based on paired regions-of-interest (ROIs), for medical image registration. The distinct properties of the proposed ROI-based correspondence are discussed, in the context of potential benefits in clinical applications following image registration, compared with alternative correspondence-representing approaches, such as those based on sampled displacements and spatial transformation functions. These benefits include a clear connection between learning-based image registration and segmentation, which in turn motivates two cases of image registration approaches using (pre-)trained segmentation networks. Based on the segment anything model (SAM), a vision foundation model for segmentation, we develop a new registration algorithm SAMReg, which does not require any training (or training data), gradient-based fine-tuning or prompt engineering. The proposed SAMReg models are evaluated across five real-world applications, including intra-subject registration tasks with cardiac MR and lung CT, challenging inter-subject registration scenarios with prostate MR and retinal imaging, and an additional evaluation with a non-clinical example with aerial image registration. The proposed methods outperform both intensity-based iterative algorithms and DDF-predicting learning-based networks across tested metrics including Dice and target registration errors on anatomical structures, and further demonstrates competitive performance compared to weakly-supervised registration approaches that rely on fully-segmented training data. Open source code and examples are available at: this https URL.

[CV-57] Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

链接: https://arxiv.org/abs/2410.14072
作者: Yuxin Wen,Qingqing Cao,Qichen Fu,Sachin Mehta,Mahyar Najibi
关键词-EN: perform complex reasoning, Recent advancements, visual tokens, tokens, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in vision-language models (VLMs) have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images. In the widely used fully autoregressive transformer-based models like LLaVA, projected visual tokens are prepended to textual tokens. Oftentimes, visual tokens are significantly more than prompt tokens, resulting in increased computational overhead during both training and inference. In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower of VLMs. After these few layers, all visual tokens are discarded, significantly improving computational efficiency for both training and inference. Notably, our method is easy to implement and requires a small number of new trainable parameters with minimal impact on model performance. In our experiment, with merely 8 visual registers–about 1% of the original tokens–Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.

[CV-58] FaceSaliencyAug: Mitigating Geographic Gender and Stereotypical Biases via Saliency-Based Data Augmentation

链接: https://arxiv.org/abs/2410.14070
作者: Teerath Kumar,Alessandra Mileo,Malika Bendechache
关键词-EN: pose significant challenges, Convolutional Neural Networks, models pose significant, vision models pose, computer vision models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at Image Signal and Video processing

点击查看摘要

Abstract:Geographical, gender and stereotypical biases in computer vision models pose significant challenges to their performance and fairness. In this study, we present an approach named FaceSaliencyAug aimed at addressing the gender bias in Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Leveraging the salient regions of faces detected by saliency, the propose approach mitigates geographical and stereotypical biases in the datasets. FaceSaliencyAug randomly selects masks from a predefined search space and applies them to the salient region of face images, subsequently restoring the original image with masked salient region. The proposed augmentation strategy enhances data diversity, thereby improving model performance and debiasing effects. We quantify dataset diversity using Image Similarity Score (ISS) across five datasets, including Flickr Faces HQ (FFHQ), WIKI, IMDB, Labelled Faces in the Wild (LFW), UTK Faces, and Diverse Dataset. The proposed approach demonstrates superior diversity metrics, as evaluated by ISS-intra and ISS-inter algorithms. Furthermore, we evaluate the effectiveness of our approach in mitigating gender bias on CEO, Engineer, Nurse, and School Teacher datasets. We use the Image-Image Association Score (IIAS) to measure gender bias in these occupations. Our experiments reveal a reduction in gender bias for both CNNs and ViTs, indicating the efficacy of our method in promoting fairness and inclusivity in computer vision models.

[CV-59] On Partial Prototype Collapse in the DINO Family of Self-Supervised Methods BMVC2024

链接: https://arxiv.org/abs/2410.14060
作者: Hariprasath Govindarajan,Per Sidén,Jacob Roll,Fredrik Lindsten
关键词-EN: prominent self-supervised learning, self-supervised learning paradigm, mixture model, mixture model simultaneously, prominent self-supervised
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: First version of the paper appeared in OpenReview on 22 Sep 2023. Accepted to BMVC 2024

点击查看摘要

Abstract:A prominent self-supervised learning paradigm is to model the representations as clusters, or more generally as a mixture model. Learning to map the data samples to compact representations and fitting the mixture model simultaneously leads to the representation collapse problem. Regularizing the distribution of data points over the clusters is the prevalent strategy to avoid this issue. While this is sufficient to prevent full representation collapse, we show that a partial prototype collapse problem still exists in the DINO family of methods, that leads to significant redundancies in the prototypes. Such prototype redundancies serve as shortcuts for the method to achieve a marginal latent class distribution that matches the prescribed prior. We show that by encouraging the model to use diverse prototypes, the partial prototype collapse can be mitigated. Effective utilization of the prototypes enables the methods to learn more fine-grained clusters, encouraging more informative representations. We demonstrate that this is especially beneficial when pre-training on a long-tailed fine-grained dataset.

[CV-60] Learning Multimodal Cues of Childrens Uncertainty SIGDIAL2023

链接: https://arxiv.org/abs/2410.14050
作者: Qi Cheng,Mert İnan,Rahma Mbarki,Grace Grmek,Theresa Choi,Yiming Sun,Kimele Persaud,Jenny Wang,Malihe Alikhani
关键词-EN: achieving common ground, common ground, Understanding uncertainty plays, plays a critical, achieving common
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: SIGDIAL 2023

点击查看摘要

Abstract:Understanding uncertainty plays a critical role in achieving common ground (Clark et al.,1983). This is especially important for multimodal AI systems that collaborate with users to solve a problem or guide the user through a challenging concept. In this work, for the first time, we present a dataset annotated in collaboration with developmental and cognitive psychologists for the purpose of studying nonverbal cues of uncertainty. We then present an analysis of the data, studying different roles of uncertainty and its relationship with task difficulty and performance. Lastly, we present a multimodal machine learning model that can predict uncertainty given a real-time video clip of a participant, which we find improves upon a baseline multimodal transformer model. This work informs research on cognitive coordination between human-human and human-AI and has broad implications for gesture understanding and generation. The anonymized version of our data and code will be publicly available upon the completion of the required consent forms and data sheets.

[CV-61] Human Action Anticipation: A Survey

链接: https://arxiv.org/abs/2410.14045
作者: Bolin Lai,Sam Toyer,Tushar Nagarajan,Rohit Girdhar,Shengxin Zha,James M. Rehg,Kris Kitani,Kristen Grauman,Ruta Desai,Miao Liu
关键词-EN: Predicting future human, increasingly popular topic, computer vision, autonomous vehicles, digital assistants
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 30 pages, 9 figures, 12 tables

点击查看摘要

Abstract:Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans various tasks, including action anticipation, activity forecasting, intent prediction, goal prediction, and so on. Our survey aims to tie together this fragmented literature, covering recent technical innovations as well as the development of new large-scale datasets for model training and evaluation. We also summarize the widely-used metrics for different tasks and provide a comprehensive performance comparison of existing approaches on eleven action anticipation datasets. This survey serves as not only a reference for contemporary methodologies in action anticipation, but also a guideline for future research direction of this evolving landscape.

[CV-62] Probabilistic U-Net with Kendall Shape Spaces for Geometry-Aware Segmentations of Images

链接: https://arxiv.org/abs/2410.14017
作者: Jiyoung Park,Günay Doğan
关键词-EN: detecting distinct regions, probabilistic image segmentation, probabilistic, fundamental problems, problems in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 22 pages, 13 figures

点击查看摘要

Abstract:One of the fundamental problems in computer vision is image segmentation, the task of detecting distinct regions or objects in given images. Deep Neural Networks (DNN) have been shown to be very effective in segmenting challenging images, producing convincing segmentations. There is further need for probabilistic DNNs that can reflect the uncertainties from the input images and the models into the computed segmentations, in other words, new DNNs that can generate multiple plausible segmentations and their distributions depending on the input or the model uncertainties. While there are existing probabilistic segmentation models, many of them do not take into account the geometry or shape underlying the segmented regions. In this paper, we propose a probabilistic image segmentation model that can incorporate the geometry of a segmentation. Our proposed model builds on the Probabilistic U-Net of \citekohl2018probabilistic to generate probabilistic segmentations, i.e.! multiple likely segmentations for an input image. Our model also adopts the Kendall Shape Variational Auto-Encoder of \citevadgama2023kendall to encode a Kendall shape space in the latent variable layers of the prior and posterior networks of the Probabilistic U-Net. Incorporating the shape space in this manner leads to a more robust segmentation with spatially coherent regions, respecting the underlying geometry in the input images.

[CV-63] Reproducibility study of “LICO: Explainable Models with Language-Image Consistency”

链接: https://arxiv.org/abs/2410.13989
作者: Luan Fletcher,Robert van der Klis,Martin Sedláček,Stefan Vasilev,Christos Athanasiadis
关键词-EN: growing reproducibility crisis, crisis in machine, brought forward, careful examination, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 2 figures, Machine Learning Reproducibility Challenge 2024

点击查看摘要

Abstract:The growing reproducibility crisis in machine learning has brought forward a need for careful examination of research findings. This paper investigates the claims made by Lei et al. (2023) regarding their proposed method, LICO, for enhancing post-hoc interpretability techniques and improving image classification performance. LICO leverages natural language supervision from a vision-language model to enrich feature representations and guide the learning process. We conduct a comprehensive reproducibility study, employing (Wide) ResNets and established interpretability methods like Grad-CAM and RISE. We were mostly unable to reproduce the authors’ results. In particular, we did not find that LICO consistently led to improved classification performance or improvements in quantitative and qualitative measures of interpretability. Thus, our findings highlight the importance of rigorous evaluation and transparent reporting in interpretability research.

[CV-64] Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations NEURIPS

链接: https://arxiv.org/abs/2410.13976
作者: Neale Ratzlaff,Matthew Lyle Olson,Musashi Hinck,Shao-Yen Tseng,Vasudev Lal,Phillip Howard
关键词-EN: Large Vision Language, Vision Language Models, Large Vision, Vision Language, demonstrated impressive capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: NeurIPS workshop on SafeGenAI, 10 pages, 2 figures

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) such as LLaVA have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input image. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a novel debiasing framework for LVLMs by directly ablating biased attributes during text generation to avoid generating text related to protected attributes, or even representing them internally. Our method requires no training and a relatively small amount of representative biased outputs (~1000 samples). Our experiments show that not only can we can minimize the propensity of LVLMs to generate text related to protected attributes, but we can even use synthetic data to inform the ablation while retaining captioning performance on real data such as COCO. Furthermore, we find the resulting generations from a debiased LVLM exhibit similar accuracy as a baseline biased model, showing that debiasing effects can be achieved without sacrificing model performance.

[CV-65] Satellite Streaming Video QoE Prediction: A Real-World Subjective Database and Network-Level Prediction Models

链接: https://arxiv.org/abs/2410.13952
作者: Bowen Chen,Zaixi Shang,Jae Won Chung,David Lerner,Werner Robitza,Rakesh Rao Ramachandra Rao,Alexander Raake,Alan C. Bovik
关键词-EN: exhibit unprecedented growth, Internet Service Providers, Service Providers find, continues to exhibit, unprecedented growth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Demand for streaming services, including satellite, continues to exhibit unprecedented growth. Internet Service Providers find themselves at the crossroads of technological advancements and rising customer expectations. To stay relevant and competitive, these ISPs must ensure their networks deliver optimal video streaming quality, a key determinant of user satisfaction. Towards this end, it is important to have accurate Quality of Experience prediction models in place. However, achieving robust performance by these models requires extensive data sets labeled by subjective opinion scores on videos impaired by diverse playback disruptions. To bridge this data gap, we introduce the LIVE-Viasat Real-World Satellite QoE Database. This database consists of 179 videos recorded from real-world streaming services affected by various authentic distortion patterns. We also conducted a comprehensive subjective study involving 54 participants, who contributed both continuous-time opinion scores and endpoint (retrospective) QoE scores. Our analysis sheds light on various determinants influencing subjective QoE, such as stall events, spatial resolutions, bitrate, and certain network parameters. We demonstrate the usefulness of this unique new resource by evaluating the efficacy of prevalent QoE-prediction models on it. We also created a new model that maps the network parameters to predicted human perception scores, which can be used by ISPs to optimize the video streaming quality of their networks. Our proposed model, which we call SatQA, is able to accurately predict QoE using only network parameters, without any access to pixel data or video-specific metadata, estimated by Spearman’s Rank Order Correlation Coefficient (SROCC), Pearson Linear Correlation Coefficient (PLCC), and Root Mean Squared Error (RMSE), indicating high accuracy and reliability.

[CV-66] ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

链接: https://arxiv.org/abs/2410.13924
作者: Guangda Ji,Silvan Weder,Francis Engelmann,Marc Pollefeys,Hermann Blum
关键词-EN: neural networks scales, neural networks, dense semantic annotations, dataset, large-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The performance of neural networks scales with both their size and the amount of data they have been trained on. This is shown in both language and image generation. However, this requires scaling-friendly network architectures as well as large-scale datasets. Even though scaling-friendly architectures like transformers have emerged for 3D vision tasks, the GPT-moment of 3D vision remains distant due to the lack of training data. In this paper, we introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations. Specifically, we complement ARKitScenes dataset with dense semantic annotations that are automatically generated at scale. To this end, we extend LabelMaker, a recent automatic annotation pipeline, to serve the needs of large-scale pre-training. This involves extending the pipeline with cutting-edge segmentation models as well as making it robust to the challenges of large-scale processing. Further, we push forward the state-of-the-art performance on ScanNet and ScanNet200 dataset with prevalent 3D semantic segmentation models, demonstrating the efficacy of our generated dataset.

[CV-67] GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction

链接: https://arxiv.org/abs/2410.13911
作者: Patrick Kwon,Hanbyul Joo
关键词-EN: Recent generative models, generate humans interacting, synthesize high-quality images, Recent generative, synthesize high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent generative models can synthesize high-quality images but often fail to generate humans interacting with objects using their hands. This arises mostly from the model’s misunderstanding of such interactions, and the hardships of synthesizing intricate regions of the body. In this paper, we propose GraspDiffusion, a novel generative method that creates realistic scenes of human-object interaction. Given a 3D object mesh, GraspDiffusion first constructs life-like whole-body poses with control over the object’s location relative to the human body. This is achieved by separately leveraging the generative priors for 3D body and hand poses, optimizing them into a joint grasping pose. The resulting pose guides the image synthesis to correctly reflect the intended interaction, allowing the creation of realistic and diverse human-object interaction scenes. We demonstrate that GraspDiffusion can successfully tackle the relatively uninvestigated problem of generating full-bodied human-object interactions while outperforming previous methods. Code and models will be available at this https URL

[CV-68] ransformers Utilization in Chart Understanding: A Review of Recent Advances Future Trends

链接: https://arxiv.org/abs/2410.13883
作者: Mirna Al-Shetairy,Hanan Hindy,Dina Khattab,Mostafa M. Aref
关键词-EN: involving chart interactions, interest in vision-language, chart interactions, involving chart, Chart Understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, interest in vision-language tasks has grown, especially those involving chart interactions. These tasks are inherently multimodal, requiring models to process chart images, accompanying text, underlying data tables, and often user queries. Traditionally, Chart Understanding (CU) relied on heuristics and rule-based systems. However, recent advancements that have integrated transformer architectures significantly improved performance. This paper reviews prominent research in CU, focusing on State-of-The-Art (SoTA) frameworks that employ transformers within End-to-End (E2E) solutions. Relevant benchmarking datasets and evaluation techniques are analyzed. Additionally, this article identifies key challenges and outlines promising future directions for advancing CU solutions. Following the PRISMA guidelines, a comprehensive literature search is conducted across Google Scholar, focusing on publications from Jan’20 to Jun’24. After rigorous screening and quality assessment, 32 studies are selected for in-depth analysis. The CU tasks are categorized into a three-layered paradigm based on the cognitive task required. Recent advancements in the frameworks addressing various CU tasks are also reviewed. Frameworks are categorized into single-task or multi-task based on the number of tasks solvable by the E2E solution. Within multi-task frameworks, pre-trained and prompt-engineering-based techniques are explored. This review overviews leading architectures, datasets, and pre-training tasks. Despite significant progress, challenges remain in OCR dependency, handling low-resolution images, and enhancing visual reasoning. Future directions include addressing these challenges, developing robust benchmarks, and optimizing model efficiency. Additionally, integrating explainable AI techniques and exploring the balance between real and synthetic data are crucial for advancing CU research.

[CV-69] Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

链接: https://arxiv.org/abs/2410.13882
作者: Long Le,Jason Xie,William Liang,Hung-Ju Wang,Yue Yang,Yecheng Jason Ma,Kyle Vedder,Arjun Krishna,Dinesh Jayaraman,Eric Eaton
关键词-EN: driving immersive experiences, driving immersive, advanced automation, immersive experiences, experiences and advanced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust outcome. Qualitative evaluations demonstrate Articulate-Anything’s capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-the-art performance. We further showcase the utility of our generated assets by using them to train robotic policies for fine-grained manipulation tasks that go beyond basic pick and place.

[CV-70] Explaining an image classifier with a generative model conditioned by uncertainty

链接: https://arxiv.org/abs/2410.13871
作者: Adrien Le Coz,Stéphane Herbin,Faouzi Adjed
关键词-EN: image classifier uncertainty, explain its behavior, propose to condition, condition a generative, generative model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We propose to condition a generative model by a given image classifier uncertainty in order to analyze and explain its behavior. Preliminary experiments on synthetic data and a corrupted version of MNIST dataset illustrate the idea.

[CV-71] A Hybrid Feature Fusion Deep Learning Framework for Leukemia Cancer Detection in Microscopic Blood Sample Using Gated Recurrent Unit and Uncertainty Quantification

链接: https://arxiv.org/abs/2410.14536
作者: Maksuda Akter,Rabea Khatun,Md Manowarul Islam
关键词-EN: Acute lymphoblastic leukemia, Acute lymphoblastic, adults and children, malignant form, common cancer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Acute lymphoblastic leukemia (ALL) is the most malignant form of leukemia and the most common cancer in adults and children. Traditionally, leukemia is diagnosed by analyzing blood and bone marrow smears under a microscope, with additional cytochemical tests for confirmation. However, these methods are expensive, time consuming, and highly dependent on expert knowledge. In recent years, deep learning, particularly Convolutional Neural Networks (CNNs), has provided advanced methods for classifying microscopic smear images, aiding in the detection of leukemic cells. These approaches are quick, cost effective, and not subject to human bias. However, most methods lack the ability to quantify uncertainty, which could lead to critical misdiagnoses. In this research, hybrid deep learning models (InceptionV3-GRU, EfficientNetB3-GRU, MobileNetV2-GRU) were implemented to classify ALL. Bayesian optimization was used to fine tune the model’s hyperparameters and improve its performance. Additionally, Deep Ensemble uncertainty quantification was applied to address uncertainty during leukemia image classification. The proposed models were trained on the publicly available datasets ALL-IDB1 and ALL-IDB2. Their results were then aggregated at the score level using the sum rule. The parallel architecture used in these models offers a high level of confidence in differentiating between ALL and non-ALL cases. The proposed method achieved a remarkable detection accuracy rate of 100% on the ALL-IDB1 dataset, 98.07% on the ALL-IDB2 dataset, and 98.64% on the combined dataset, demonstrating its potential for accurate and reliable leukemia diagnosis.

[CV-72] Less is More: Selective Reduction of CT Data for Self-Supervised Pre-Training of Deep Learning Models with Contrastive Learning Improves Downstream Classification Performance

链接: https://arxiv.org/abs/2410.14524
作者: Daniel Wolf,Tristan Payer,Catharina Silvia Lisson,Christoph Gerhard Lisson,Meinrad Beer,Michael Götz,Timo Ropinski
关键词-EN: Self-supervised pre-training, widely used technique, Self-supervised, deep learning models, medical images
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in Computers in Biology and Medicine

点击查看摘要

Abstract:Self-supervised pre-training of deep learning models with contrastive learning is a widely used technique in image analysis. Current findings indicate a strong potential for contrastive pre-training on medical images. However, further research is necessary to incorporate the particular characteristics of these images. We hypothesize that the similarity of medical images hinders the success of contrastive learning in the medical imaging domain. To this end, we investigate different strategies based on deep embedding, information theory, and hashing in order to identify and reduce redundancy in medical pre-training datasets. The effect of these different reduction strategies on contrastive learning is evaluated on two pre-training datasets and several downstream classification tasks. In all of our experiments, dataset reduction leads to a considerable performance gain in downstream tasks, e.g., an AUC score improvement from 0.78 to 0.83 for the COVID CT Classification Grand Challenge, 0.97 to 0.98 for the OrganSMNIST Classification Challenge and 0.73 to 0.83 for a brain hemorrhage classification task. Furthermore, pre-training is up to nine times faster due to the dataset reduction. In conclusion, the proposed approach highlights the importance of dataset quality and provides a transferable approach to improve contrastive pre-training for classification downstream tasks on medical images.

[CV-73] An Integrated Deep Learning Model for Skin Cancer Detection Using Hybrid Feature Fusion Technique

链接: https://arxiv.org/abs/2410.14489
作者: Maksuda Akter,Rabea Khatun,Md. Alamin Talukder,Md. Manowarul Islam,Md. Ashraf Uddin
关键词-EN: potentially fatal disease, fatal disease caused, DNA damage, caused by DNA, potentially fatal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Skin cancer is a serious and potentially fatal disease caused by DNA damage. Early detection significantly increases survival rates, making accurate diagnosis crucial. In this groundbreaking study, we present a hybrid framework based on Deep Learning (DL) that achieves precise classification of benign and malignant skin lesions. Our approach begins with dataset preprocessing to enhance classification accuracy, followed by training two separate pre-trained DL models, InceptionV3 and DenseNet121. By fusing the results of each model using the weighted sum rule, our system achieves exceptional accuracy rates. Specifically, we achieve a 92.27% detection accuracy rate, 92.33% sensitivity, 92.22% specificity, 90.81% precision, and 91.57% F1-score, outperforming existing models and demonstrating the robustness and trustworthiness of our hybrid approach. Our study represents a significant advance in skin cancer diagnosis and provides a promising foundation for further research in the field. With the potential to save countless lives through earlier detection, our hybrid deep-learning approach is a game-changer in the fight against skin cancer.

[CV-74] Integrating Deep Learning with Fundus and Optical Coherence Tomography for Cardiovascular Disease Prediction

链接: https://arxiv.org/abs/2410.14423
作者: Cynthia Maldonado-Garcia,Arezoo Zakeri,Alejandro F Frangi,Nishant Ravikumar
关键词-EN: reducing healthcare burden, reducing healthcare, healthcare burden, quality of life, CVD
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15155))

点击查看摘要

Abstract:Early identification of patients at risk of cardiovascular diseases (CVD) is crucial for effective preventive care, reducing healthcare burden, and improving patients’ quality of life. This study demonstrates the potential of retinal optical coherence tomography (OCT) imaging combined with fundus photographs for identifying future adverse cardiac events. We used data from 977 patients who experienced CVD within a 5-year interval post-image acquisition, alongside 1,877 control participants without CVD, totaling 2,854 subjects. We propose a novel binary classification network based on a Multi-channel Variational Autoencoder (MCVAE), which learns a latent embedding of patients’ fundus and OCT images to classify individuals into two groups: those likely to develop CVD in the future and those who are not. Our model, trained on both imaging modalities, achieved promising results (AUROC 0.78 +/- 0.02, accuracy 0.68 +/- 0.002, precision 0.74 +/- 0.02, sensitivity 0.73 +/- 0.02, and specificity 0.68 +/- 0.01), demonstrating its efficacy in identifying patients at risk of future CVD events based on their retinal images. This study highlights the potential of retinal OCT imaging and fundus photographs as cost-effective, non-invasive alternatives for predicting cardiovascular disease risk. The widespread availability of these imaging techniques in optometry practices and hospitals further enhances their potential for large-scale CVD risk screening. Our findings contribute to the development of standardized, accessible methods for early CVD risk identification, potentially improving preventive care strategies and patient outcomes.

[CV-75] 2D-3D Deformable Image Registration of Histology Slide and Micro-CT with ML-based Initialization

链接: https://arxiv.org/abs/2410.14343
作者: Junan Chen,Matteo Ronchetti,Verena Stehl,Van Nguyen,Muhannad Al Kallaa,Mahesh Thalwaththe Gedara,Claudia Lölkes,Stefan Moser,Maximilian Seidl,Matthias Wieczorek
关键词-EN: Recent developments, virtual histology based, micro-computed tomography, broadened the perspective, perspective of pathological
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Recent developments in the registration of histology and micro-computed tomography (\muCT) have broadened the perspective of pathological applications such as virtual histology based on \muCT. This topic remains challenging because of the low image quality of soft tissue CT. Additionally, soft tissue samples usually deform during the histology slide preparation, making it difficult to correlate the structures between histology slide and \muCT. In this work, we propose a novel 2D-3D multi-modal deformable image registration method. The method uses a machine learning (ML) based initialization followed by the registration. The registration is finalized by an analytical out-of-plane deformation refinement. The method is evaluated on datasets acquired from tonsil and tumor tissues. \muCTs of both phase-contrast and conventional absorption modalities are investigated. The registration results from the proposed method are compared with those from intensity- and keypoint-based methods. The comparison is conducted using both visual and fiducial-based evaluations. The proposed method demonstrates superior performance compared to the other two methods.

[CV-76] E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

链接: https://arxiv.org/abs/2410.14200
作者: Haoran Lai,Zihang Jiang,Qingsong Yao,Rongsheng Wang,Zhiyang He,Xiaodong Tao,Wei Wei,Weifu Lv,S.Kevin Zhou
关键词-EN: medical vision-language models, holds significant potential, models holds significant, medical vision-language, vision-language models holds
类目: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment. However, compared to 2D medical images, 3D medical images, such as CT scans, face challenges related to limited training data and high dimension, which severely restrict the progress of 3D medical vision-language models. To address these issues, we collect a large amount of unlabeled 3D CT data and utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features. Then, we apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity while preserving spatial information. We also construct two instruction-tuning datasets based on BIMCV-R and CT-RATE to fine-tune the 3D vision-language model. Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis. Code and data will be made publicly available soon.

[CV-77] Deep Learning Applications in Medical Image Analysis: Advancements Challenges and Future Directions

链接: https://arxiv.org/abs/2410.14131
作者: Aimina Ali Eli,Abida Ali
关键词-EN: Medical image analysis, contemporary healthcare, facilitating physicians, precise diagnosis, essential element
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image analysis has emerged as an essential element of contemporary healthcare, facilitating physicians in achieving expedited and precise diagnosis. Recent breakthroughs in deep learning, a subset of artificial intelligence, have markedly revolutionized the analysis of medical pictures, improving the accuracy and efficiency of clinical procedures. Deep learning algorithms, especially convolutional neural networks (CNNs), have demonstrated remarkable proficiency in autonomously learning features from multidimensional medical pictures, including MRI, CT, and X-ray scans, without the necessity for manual feature extraction. These models have been utilized across multiple medical disciplines, including pathology, radiology, ophthalmology, and cardiology, where they aid in illness detection, classification, and segmentation tasks…

[CV-78] Segmentation of Pediatric Brain Tumors using a Radiologically informed Deep Learning Cascade

链接: https://arxiv.org/abs/2410.14020
作者: Timothy Mulvany,Daniel Griffiths-King,Jan Novak,Heather Rose
关键词-EN: Diffuse Midline Glioma, Intrinsic Pontine Glioma, Diffuse Intrinsic Pontine, Pontine Glioma, Midline Glioma
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Monitoring of Diffuse Intrinsic Pontine Glioma (DIPG) and Diffuse Midline Glioma (DMG) brain tumors in pediatric patients is key for assessment of treatment response. Response Assessment in Pediatric Neuro-Oncology (RAPNO) guidelines recommend the volumetric measurement of these tumors using MRI. Segmentation challenges, such as the Brain Tumor Segmentation (BraTS) Challenge, promote development of automated approaches which are replicable, generalizable and accurate, to aid in these tasks. The current study presents a novel adaptation of existing nnU-Net approaches for pediatric brain tumor segmentation, submitted to the BraTS-PEDs 2024 challenge. We apply an adapted nnU-Net with hierarchical cascades to the segmentation task of the BraTS-PEDs 2024 challenge. The residual encoder variant of nnU-Net, used as our baseline model, already provides high quality segmentations. We incorporate multiple changes to the implementation of nnU-Net and devise a novel two-stage cascaded nnU-Net to segment the substructures of brain tumors from coarse to fine. Using outputs from the nnU-Net Residual Encoder (trained to segment CC, ED, ET and NET tumor labels from T1w, T1w-CE, T2w and T2-FLAIR MRI), these are passed to two additional models one classifying ET versus NET and a second classifying CC vs ED using cascade learning. We use radiological guidelines to steer which multi parametric MRI (mpMRI) to use in these cascading models. Compared to a default nnU-Net and an ensembled nnU-net as baseline approaches, our novel method provides robust segmentations for the BraTS-PEDs 2024 challenge, achieving mean Dice scores of 0.657, 0.904, 0.703, and 0.967, and HD95 of 76.2, 10.1, 111.0, and 12.3 for the ET, NET, CC and ED, respectively.

[CV-79] From Real Artifacts to Virtual Reference: A Robust Framework for Translating Endoscopic Images

链接: https://arxiv.org/abs/2410.13896
作者: unyang Wu,Fangfang Xie,Jiayuan Sun,Yun Gu,Guang-Zhong Yang
关键词-EN: medical image analysis, multimodal medical image, plays a crucial, crucial role, role in multimodal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Domain adaptation, which bridges the distributions across different modalities, plays a crucial role in multimodal medical image analysis. In endoscopic imaging, combining pre-operative data with intra-operative imaging is important for surgical planning and navigation. However, existing domain adaptation methods are hampered by distribution shift caused by in vivo artifacts, necessitating robust techniques for aligning noisy and artifact abundant patient endoscopic videos with clean virtual images reconstructed from pre-operative tomographic data for pose estimation during intraoperative guidance. This paper presents an artifact-resilient image translation method and an associated benchmark for this purpose. The method incorporates a novel ``local-global’’ translation framework and a noise-resilient feature extraction strategy. For the former, it decouples the image translation process into a local step for feature denoising, and a global step for global style transfer. For feature extraction, a new contrastive learning strategy is proposed, which can extract noise-resilient features for establishing robust correspondence across domains. Detailed validation on both public and in-house clinical datasets has been conducted, demonstrating significantly improved performance compared to the current state-of-the-art.

机器学习

[LG-0] Decomposing The Dark Matter of Sparse Autoencoders

链接: https://arxiv.org/abs/2410.14670
作者: Joshua Engels,Logan Riggs,Max Tegmark
关键词-EN: Sparse autoencoders, decomposing language model, SAE error, SAE, promising technique
类目: Machine Learning (cs.LG)
*备注: Code at this https URL

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are a promising technique for decomposing language model activations into interpretable linear features. However, current SAEs fall short of completely explaining model performance, resulting in “dark matter”: unexplained variance in activations. This work investigates dark matter as an object of study in its own right. Surprisingly, we find that much of SAE dark matter–about half of the error vector itself and 90% of its norm–can be linearly predicted from the initial activation vector. Additionally, we find that the scaling behavior of SAE error norms at a per token level is remarkably predictable: larger SAEs mostly struggle to reconstruct the same contexts as smaller SAEs. We build on the linear representation hypothesis to propose models of activations that might lead to these observations, including postulating a new type of “introduced error”; these insights imply that the part of the SAE error vector that cannot be linearly predicted (“nonlinear” error) might be fundamentally different from the linearly predictable component. To validate this hypothesis, we empirically analyze nonlinear SAE error and show that 1) it contains fewer not yet learned features, 2) SAEs trained on it are quantitatively worse, 3) it helps predict SAE per-token scaling behavior, and 4) it is responsible for a proportional amount of the downstream increase in cross entropy loss when SAE activations are inserted into the model. Finally, we examine two methods to reduce nonlinear SAE error at a fixed sparsity: inference time gradient pursuit, which leads to a very slight decrease in nonlinear error, and linear transformations from earlier layer SAE outputs, which leads to a larger reduction.

[LG-1] Stochastic Gradient Descent Jittering for Inverse Problems: Alleviating the Accuracy-Robustness Tradeoff

链接: https://arxiv.org/abs/2410.14667
作者: Peimeng Guan,Mark A. Davenport
关键词-EN: Inverse problems aim, Inverse problems, perturbed measurements, reconstruct unseen data, problems aim
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Inverse problems aim to reconstruct unseen data from corrupted or perturbed measurements. While most work focuses on improving reconstruction quality, generalization accuracy and robustness are equally important, especially for safety-critical applications. Model-based architectures (MBAs), such as loop unrolling methods, are considered more interpretable and achieve better reconstructions. Empirical evidence suggests that MBAs are more robust to perturbations than black-box solvers, but the accuracy-robustness tradeoff in MBAs remains underexplored. In this work, we propose a simple yet effective training scheme for MBAs, called SGD jittering, which injects noise iteration-wise during reconstruction. We theoretically demonstrate that SGD jittering not only generalizes better than the standard mean squared error training but is also more robust to average-case attacks. We validate SGD jittering using denoising toy examples, seismic deconvolution, and single-coil MRI reconstruction. The proposed method achieves cleaner reconstructions for out-of-distribution data and demonstrates enhanced robustness to adversarial attacks.

[LG-2] DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph

链接: https://arxiv.org/abs/2410.14666
作者: Maitreya Prafulla Chitale,Uday Bindal,Rajakrishnan Rajkumar,Rahul Mishra
关键词-EN: Summarizing movie screenplays, standard document summarization, Summarizing movie, unique set, compared to standard
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Summarizing movie screenplays presents a unique set of challenges compared to standard document summarization. Screenplays are not only lengthy, but also feature a complex interplay of characters, dialogues, and scenes, with numerous direct and subtle relationships and contextual nuances that are difficult for machine learning models to accurately capture and comprehend. Recent attempts at screenplay summarization focus on fine-tuning transformer-based pre-trained models, but these models often fall short in capturing long-term dependencies and latent relationships, and frequently encounter the “lost in the middle” issue. To address these challenges, we introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph). This approach is well-suited for various downstream tasks, such as summarization, question-answering, and salience detection. The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay’s content. We further explore a baseline method that combines the CaD Graph with the corresponding movie script through a late fusion of graph and text modalities, and we present very initial promising results.

[LG-3] Online Reinforcement Learning with Passive Memory

链接: https://arxiv.org/abs/2410.14665
作者: Anay Pattanaik,Lav R. Varshney
关键词-EN: leverages pre-collected data, reinforcement learning algorithm, online reinforcement learning, pre-collected data, online interaction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper considers an online reinforcement learning algorithm that leverages pre-collected data (passive memory) from the environment for online interaction. We show that using passive memory improves performance and further provide theoretical guarantees for regret that turns out to be near-minimax optimal. Results show that the quality of passive memory determines sub-optimality of the incurred regret. The proposed approach and results hold in both continuous and discrete state-action spaces.

[LG-4] A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning

链接: https://arxiv.org/abs/2410.14660
作者: Shengjie Sun,Runze Liu,Jiafei Lyu,Jing-Wen Yang,Liangpeng Zhang,Xiu Li
关键词-EN: Large Language Models, Large Language, Language Models, Reinforcement Learning, shown significant potential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant potential in designing reward functions for Reinforcement Learning (RL) tasks. However, obtaining high-quality reward code often involves human intervention, numerous LLM queries, or repetitive RL training. To address these issues, we propose CARD, a LLM-driven Reward Design framework that iteratively generates and improves reward function code. Specifically, CARD includes a Coder that generates and verifies the code, while a Evaluator provides dynamic feedback to guide the Coder in improving the code, eliminating the need for human feedback. In addition to process feedback and trajectory feedback, we introduce Trajectory Preference Evaluation (TPE), which evaluates the current reward function based on trajectory preferences. If the code fails the TPE, the Evaluator provides preference feedback, avoiding RL training at every iteration and making the reward function better aligned with the task objective. Empirical results on Meta-World and ManiSkill2 demonstrate that our method achieves an effective balance between task performance and token efficiency, outperforming or matching the baselines across all tasks. On 10 out of 12 tasks, CARD shows better or comparable performance to policies trained with expert-designed rewards, and our method even surpasses the oracle on 3 tasks.

[LG-5] Harnessing Causality in Reinforcement Learning With Bagged Decision Times

链接: https://arxiv.org/abs/2410.14659
作者: Daiqi Gao,Hsin-Yu Lai,Predrag Klasnja,Susan A. Murphy
关键词-EN: bagged decision times, decision times, reinforcement learning, consecutive decision times, bagged decision
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider reinforcement learning (RL) for a class of problems with bagged decision times. A bag contains a finite sequence of consecutive decision times. The transition dynamics are non-Markovian and non-stationary within a bag. Further, all actions within a bag jointly impact a single reward, observed at the end of the bag. Our goal is to construct an online RL algorithm to maximize the discounted sum of the bag-specific rewards. To handle non-Markovian transitions within a bag, we utilize an expert-provided causal directed acyclic graph (DAG). Based on the DAG, we construct the states as a dynamical Bayesian sufficient statistic of the observed history, which results in Markovian state transitions within and across bags. We then frame this problem as a periodic Markov decision process (MDP) that allows non-stationarity within a period. An online RL algorithm based on Bellman-equations for stationary MDPs is generalized to handle periodic MDPs. To justify the proposed RL algorithm, we show that our constructed state achieves the maximal optimal value function among all state constructions for a periodic MDP. Further we prove the Bellman optimality equations for periodic MDPs. We evaluate the proposed method on testbed variants, constructed with real data from a mobile health clinical trial.

[LG-6] Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens

链接: https://arxiv.org/abs/2410.14655
作者: Zhepeng Cen,Yao Liu,Siliang Zeng,Pratik Chaudhar,Huzefa Rangwala,George Karypis,Rasool Fakoor
关键词-EN: Language models, maximize the likelihood, model, past tokens, Language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset. However, during inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously generated tokens as input to predict the next one. Marginal differences in predictions at each step can cascade over successive steps, resulting in different distributions from what the models were trained for and potentially leading to unpredictable behavior. This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time. Our first approach is Batch-Scheduled Sampling, where, during training, we stochastically choose between the ground-truth token from the dataset and the model’s own generated token as input to predict the next token. This is done in an offline manner, modifying the context window by interleaving ground-truth tokens with those generated by the model. Our second approach is Reference-Answer-based Correction, where we explicitly incorporate a self-correction capability into the model during training. This enables the model to effectively self-correct the gaps between the generated sequences and the ground truth data without relying on an external oracle model. By incorporating our proposed strategies during training, we have observed an overall improvement in performance compared to baseline methods, as demonstrated by our extensive experiments using summarization, general question-answering, and math question-answering tasks.

[LG-7] EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

链接: https://arxiv.org/abs/2410.14649
作者: Oliver Sieberling,Denis Kuznedelev,Eldar Kurtic,Dan Alistarh
关键词-EN: high computational costs, large language models, emph, high computational, computational costs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by \emphdynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on heuristics for identifying the “importance” of a given layer towards the loss, based on assumptions such as \empherror monotonicity, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise errors. In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range. We begin from the motivating observation that, in general, \empherror monotonicity does not hold for LLMs: compressed models with lower sum of per-layer errors can perform \emphworse than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called EvoPress, which has provable convergence, and low sample and evaluation complexity. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models. Via EvoPress, we set new state-of-the-art results across all compression approaches: structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths. Our code is available at this https URL.

[LG-8] HR-Bandit: Human-AI Collaborated Linear Recourse Bandit

链接: https://arxiv.org/abs/2410.14640
作者: Junyu Cao,Ruijiang Gao,Esmaeil Keyvanshokooh
关键词-EN: doctors frequently recommend, frequently recommend actionable, recommend actionable recourses, Recourse Linear UCB, Linear Recourse Bandit
类目: Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Human doctors frequently recommend actionable recourses that allow patients to modify their conditions to access more effective treatments. Inspired by such healthcare scenarios, we propose the Recourse Linear UCB ( \textsfRLinUCB ) algorithm, which optimizes both action selection and feature modifications by balancing exploration and exploitation. We further extend this to the Human-AI Linear Recourse Bandit ( \textsfHR-Bandit ), which integrates human expertise to enhance performance. \textsfHR-Bandit offers three key guarantees: (i) a warm-start guarantee for improved initial performance, (ii) a human-effort guarantee to minimize required human interactions, and (iii) a robustness guarantee that ensures sublinear regret even when human decisions are suboptimal. Empirical results, including a healthcare case study, validate its superior performance against existing benchmarks.

[LG-9] Convergence of Manifold Filter-Combine Networks NEURIPS

链接: https://arxiv.org/abs/2410.14639
作者: David R. Johnson,Joyce Chew,Siddharth Viswanath,Edward De Brouwer,Deanna Needell,Smita Krishnaswamy,Michael Perlmutter
关键词-EN: manifold neural networks, Manifold Filter-Combine Networks, understand manifold neural, neural networks, graph neural networks
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Accepted to NeurIPS Workshop on Symmetry and Geometry in Neural Representations (Extended Abstract Track)

点击查看摘要

Abstract:In order to better understand manifold neural networks (MNNs), we introduce Manifold Filter-Combine Networks (MFCNs). The filter-combine framework parallels the popular aggregate-combine paradigm for graph neural networks (GNNs) and naturally suggests many interesting families of MNNs which can be interpreted as the manifold analog of various popular GNNs. We then propose a method for implementing MFCNs on high-dimensional point clouds that relies on approximating the manifold by a sparse graph. We prove that our method is consistent in the sense that it converges to a continuum limit as the number of data points tends to infinity.

[LG-10] Parallel Backpropagation for Inverse of a Convolution with Application to Normalizing Flows

链接: https://arxiv.org/abs/2410.14634
作者: Sandeep Nagar,Girish Varma
关键词-EN: Image Deblurring, Normalizing Flows, Normalizing, Normalizing Flow backbones, Inverse Convolutions
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Probability (math.PR)
*备注: Preprint

点击查看摘要

Abstract:Inverse of an invertible convolution is an important operation that comes up in Normalizing Flows, Image Deblurring, etc. The naive algorithm for backpropagation of this operation using Gaussian elimination has running time O(n^3) where n is the number of pixels in the image. We give a fast parallel backpropagation algorithm with running time O(\sqrtn) for a square image and provide a GPU implementation of the same. Inverse Convolutions are usually used in Normalizing Flows in the sampling pass, making them slow. We propose to use Inverse Convolutions in the forward (image to latent vector) pass of the Normalizing flow. Since the sampling pass is the inverse of the forward pass, it will use convolutions only, resulting in efficient sampling times. We use our parallel backpropagation algorithm for optimizing the inverse convolution layer resulting in fast training times also. We implement this approach in various Normalizing Flow backbones, resulting in our Inverse-Flow models. We benchmark Inverse-Flow on standard datasets and show significantly improved sampling times with similar bits per dimension compared to previous models.

[LG-11] On the Regularization of Learnable Embeddings for Time Series Processing

链接: https://arxiv.org/abs/2410.14630
作者: Luca Butera,Giovanni De Felice,Andrea Cini,Cesare Alippi
关键词-EN: time series, multiple time series, time series processing, individual features, processing multiple time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In processing multiple time series, accounting for the individual features of each sequence can be challenging. To address this, modern deep learning methods for time series analysis combine a shared (global) model with local layers, specific to each time series, often implemented as learnable embeddings. Ideally, these local embeddings should encode meaningful representations of the unique dynamics of each sequence. However, when these are learned end-to-end as parameters of a forecasting model, they may end up acting as mere sequence identifiers. Shared processing blocks may then become reliant on such identifiers, limiting their transferability to new contexts. In this paper, we address this issue by investigating methods to regularize the learning of local learnable embeddings for time series processing. Specifically, we perform the first extensive empirical study on the subject and show how such regularizations consistently improve performance in widely adopted architectures. Furthermore, we show that methods preventing the co-adaptation of local and global parameters are particularly effective in this context. This hypothesis is validated by comparing several methods preventing the downstream models from relying on sequence identifiers, going as far as completely resetting the embeddings during training. The obtained results provide an important contribution to understanding the interplay between learnable local parameters and shared processing layers: a key challenge in modern time series processing models and a step toward developing effective foundation models for time series.

[LG-12] SIMformer: Single-Layer Vanilla Transformer Can Learn Free-Space Trajectory Similarity

链接: https://arxiv.org/abs/2410.14629
作者: Chuang Yang,Renhe Jiang,Xiaohang Xu,Chuan Xiao,Kaoru Sezaki
关键词-EN: Free-space trajectory similarity, quadratic time complexity, incur quadratic time, trajectory similarity calculation, Free-space trajectory
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Free-space trajectory similarity calculation, e.g., DTW, Hausdorff, and Frechet, often incur quadratic time complexity, thus learning-based methods have been proposed to accelerate the computation. The core idea is to train an encoder to transform trajectories into representation vectors and then compute vector similarity to approximate the ground truth. However, existing methods face dual challenges of effectiveness and efficiency: 1) they all utilize Euclidean distance to compute representation similarity, which leads to the severe curse of dimensionality issue – reducing the distinguishability among representations and significantly affecting the accuracy of subsequent similarity search tasks; 2) most of them are trained in triplets manner and often necessitate additional information which downgrades the efficiency; 3) previous studies, while emphasizing the scalability in terms of efficiency, overlooked the deterioration of effectiveness when the dataset size grows. To cope with these issues, we propose a simple, yet accurate, fast, scalable model that only uses a single-layer vanilla transformer encoder as the feature extractor and employs tailored representation similarity functions to approximate various ground truth similarity measures. Extensive experiments demonstrate our model significantly mitigates the curse of dimensionality issue and outperforms the state-of-the-arts in effectiveness, efficiency, and scalability.

[LG-13] Enhancing AI Accessibility in Veterinary Medicine: Linking Classifiers and Electronic Health Records

链接: https://arxiv.org/abs/2410.14625
作者: Chun Yin Kong,Picasso Vasquez,Makan Farhoodimoghadam,Chris Brandt,Titus C. Brown,Krystle L. Reagan,Allison Zwingenberger,Stefan M. Keller
关键词-EN: integrating machine learning, clinical decision-making tools, electronic health records, rapidly evolving landscape, improve diagnostic accuracy
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of veterinary healthcare, integrating machine learning (ML) clinical decision-making tools with electronic health records (EHRs) promises to improve diagnostic accuracy and patient care. However, the seamless integration of ML classifiers into existing EHRs in veterinary medicine is frequently hindered by the rigidity of EHR systems or the limited availability of IT resources. To address this shortcoming, we present Anna, a freely-available software solution that provides ML classifier results for EHR laboratory data in real-time.

[LG-14] Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments

链接: https://arxiv.org/abs/2410.14616
作者: Mariusz Wisniewski,Paraskevas Chatzithanos,Weisi Guo,Antonios Tsourdos
关键词-EN: Deep Reinforcement learning, Deep Reinforcement, Reinforcement learning, enable autonomous navigation, enable autonomous
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 31 pages, 19 figures. For associated code, see this https URL

点击查看摘要

Abstract:Deep Reinforcement learning (DRL) is used to enable autonomous navigation in unknown environments. Most research assume perfect sensor data, but real-world environments may contain natural and artificial sensor noise and denial. Here, we present a benchmark of both well-used and emerging DRL algorithms in a navigation task with configurable sensor denial effects. In particular, we are interested in comparing how different DRL methods (e.g. model-free PPO vs. model-based DreamerV3) are affected by sensor denial. We show that DreamerV3 outperforms other methods in the visual end-to-end navigation task with a dynamic goal - and other methods are not able to learn this. Furthermore, DreamerV3 generally outperforms other methods in sensor-denied environments. In order to improve robustness, we use adversarial training and demonstrate an improved performance in denied environments, although this generally comes with a performance cost on the vanilla environments. We anticipate this benchmark of different DRL methods and the usage of adversarial training to be a starting point for the development of more elaborate navigation strategies that are capable of dealing with uncertain and denied sensor readings.

[LG-15] Streaming Deep Reinforcement Learning Finally Works

链接: https://arxiv.org/abs/2410.14606
作者: Mohamed Elsayed,Gautham Vasan,A. Rupam Mahmood
关键词-EN: Natural intelligence processes, intelligence processes experience, real time, Natural intelligence, mimics natural learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Natural intelligence processes experience as a continuous stream, sensing, acting, and learning moment-by-moment in real time. Streaming learning, the modus operandi of classic reinforcement learning (RL) algorithms like Q-learning and TD, mimics natural learning by using the most recent sample without storing it. This approach is also ideal for resource-constrained, communication-limited, and privacy-sensitive applications. However, in deep RL, learners almost always use batch updates and replay buffers, making them computationally expensive and incompatible with streaming learning. Although the prevalence of batch deep RL is often attributed to its sample efficiency, a more critical reason for the absence of streaming deep RL is its frequent instability and failure to learn, which we refer to as stream barrier. This paper introduces the stream-x algorithms, the first class of deep RL algorithms to overcome stream barrier for both prediction and control and match sample efficiency of batch RL. Through experiments in Mujoco Gym, DM Control Suite, and Atari Games, we demonstrate stream barrier in existing algorithms and successful stable learning with our stream-x algorithms: stream Q, stream AC, and stream TD, achieving the best model-free performance in DM Control Dog environments. A set of common techniques underlies the stream-x algorithms, enabling their success with a single set of hyperparameters and allowing for easy extension to other algorithms, thereby reviving streaming RL.

[LG-16] Learning to Control the Smoothness of Graph Convolutional Network Features

链接: https://arxiv.org/abs/2410.14604
作者: Shih-Hsin Wang,Justin Baker,Cory Hauck,Bao Wang
关键词-EN: Oono and Suzuki, Cai and Wang, work of Oono, node classification, non-smooth feature components
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 48 pages

点击查看摘要

Abstract:The pioneering work of Oono and Suzuki [ICLR, 2020] and Cai and Wang [arXiv:2006.13318] initializes the analysis of the smoothness of graph convolutional network (GCN) features. Their results reveal an intricate empirical correlation between node classification accuracy and the ratio of smooth to non-smooth feature components. However, the optimal ratio that favors node classification is unknown, and the non-smooth features of deep GCN with ReLU or leaky ReLU activation function diminish. In this paper, we propose a new strategy to let GCN learn node features with a desired smoothness – adapting to data and tasks – to enhance node classification. Our approach has three key steps: (1) We establish a geometric relationship between the input and output of ReLU or leaky ReLU. (2) Building on our geometric insights, we augment the message-passing process of graph convolutional layers (GCLs) with a learnable term to modulate the smoothness of node features with computational efficiency. (3) We investigate the achievable ratio between smooth and non-smooth feature components for GCNs with the augmented message-passing scheme. Our extensive numerical results show that the augmented message-passing schemes significantly improve node classification for GCN and some related models.

[LG-17] How Does Data Diversity Shape the Weight Landscape of Neural Networks?

链接: https://arxiv.org/abs/2410.14602
作者: Yang Ba,Michelle V. Mancenido,Rong Pan
关键词-EN: weight decay, machine learning models, enhance the generalization, generalization of machine, Random Matrix Theory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To enhance the generalization of machine learning models to unseen data, techniques such as dropout, weight decay ( L_2 regularization), and noise augmentation are commonly employed. While regularization methods (i.e., dropout and weight decay) are geared toward adjusting model parameters to prevent overfitting, data augmentation increases the diversity of the input training set, a method purported to improve accuracy and calibration error. In this paper, we investigate the impact of each of these techniques on the parameter space of neural networks, with the goal of understanding how they alter the weight landscape in transfer learning scenarios. To accomplish this, we employ Random Matrix Theory to analyze the eigenvalue distributions of pre-trained models, fine-tuned using these techniques but using different levels of data diversity, for the same downstream tasks. We observe that diverse data influences the weight landscape in a similar fashion as dropout. Additionally, we compare commonly used data augmentation methods with synthetic data created by generative models. We conclude that synthetic data can bring more diversity into real input data, resulting in a better performance on out-of-distribution test instances.

[LG-18] Learning With Multi-Group Guarantees For Clusterable Subpopulations

链接: https://arxiv.org/abs/2410.14588
作者: Jessica Dai,Nika Haghtalab,Eric Zhao
关键词-EN: canonical desideratum, subpopulations, relevant subpopulations, prediction problems, meaningful subpopulations
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:A canonical desideratum for prediction problems is that performance guarantees should hold not just on average over the population, but also for meaningful subpopulations within the overall population. But what constitutes a meaningful subpopulation? In this work, we take the perspective that relevant subpopulations should be defined with respect to the clusters that naturally emerge from the distribution of individuals for which predictions are being made. In this view, a population refers to a mixture model whose components constitute the relevant subpopulations. We suggest two formalisms for capturing per-subgroup guarantees: first, by attributing each individual to the component from which they were most likely drawn, given their features; and second, by attributing each individual to all components in proportion to their relative likelihood of having been drawn from each component. Using online calibration as a case study, we study a \variational algorithm that provides guarantees for each of these formalisms by handling all plausible underlying subpopulation structures simultaneously, and achieve an O(T^1/2) rate even when the subpopulations are not well-separated. In comparison, the more natural cluster-then-predict approach that first recovers the structure of the subpopulations and then makes predictions suffers from a O(T^2/3) rate and requires the subpopulations to be separable. Along the way, we prove that providing per-subgroup calibration guarantees for underlying clusters can be easier than learning the clusters: separation between median subgroup features is required for the latter but not the former.

[LG-19] Neuro-Symbolic Traders: Assessing the Wisdom of AI Crowds in Markets

链接: https://arxiv.org/abs/2410.14587
作者: Namid R. Stillman,Rory Baggott
关键词-EN: Deep generative models, Deep generative, generative models, Deep, generative
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 8 pages, 4 figures, ACM format

点击查看摘要

Abstract:Deep generative models are becoming increasingly used as tools for financial analysis. However, it is unclear how these models will influence financial markets, especially when they infer financial value in a semi-autonomous way. In this work, we explore the interplay between deep generative models and market dynamics. We develop a form of virtual traders that use deep generative models to make buy/sell decisions, which we term neuro-symbolic traders, and expose them to a virtual market. Under our framework, neuro-symbolic traders are agents that use vision-language models to discover a model of the fundamental value of an asset. Agents develop this model as a stochastic differential equation, calibrated to market data using gradient descent. We test our neuro-symbolic traders on both synthetic data and real financial time series, including an equity stock, commodity, and a foreign exchange pair. We then expose several groups of neuro-symbolic traders to a virtual market environment. This market environment allows for feedback between the traders belief of the underlying value to the observed price dynamics. We find that this leads to price suppression compared to the historical data, highlighting a future risk to market stability. Our work is a first step towards quantifying the effect of deep generative agents on markets dynamics and sets out some of the potential risks and benefits of this approach in the future.

[LG-20] Neural Combinatorial Clustered Bandits for Recommendation Systems

链接: https://arxiv.org/abs/2410.14586
作者: Baran Atalar,Carlee Joe-Wong
关键词-EN: individual base arms, base arm rewards, unknown reward functions, super arm, individual base
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider the contextual combinatorial bandit setting where in each round, the learning agent, e.g., a recommender system, selects a subset of “arms,” e.g., products, and observes rewards for both the individual base arms, which are a function of known features (called “context”), and the super arm (the subset of arms), which is a function of the base arm rewards. The agent’s goal is to simultaneously learn the unknown reward functions and choose the highest-reward arms. For example, the “reward” may represent a user’s probability of clicking on one of the recommended products. Conventional bandit models, however, employ restrictive reward function models in order to obtain performance guarantees. We make use of deep neural networks to estimate and learn the unknown reward functions and propose Neural UCB Clustering (NeUClust), which adopts a clustering approach to select the super arm in every round by exploiting underlying structure in the context space. Unlike prior neural bandit works, NeUClust uses a neural network to estimate the super arm reward and select the super arm, thus eliminating the need for a known optimization oracle. We non-trivially extend prior neural combinatorial bandit works to prove that NeUClust achieves \widetildeO\left(\widetilded\sqrtT\right) regret, where \widetilded is the effective dimension of a neural tangent kernel matrix, T the number of rounds. Experiments on real world recommendation datasets show that NeUClust achieves better regret and reward than other contextual combinatorial and neural bandit algorithms.

[LG-21] Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

链接: https://arxiv.org/abs/2410.14581
作者: Aaron Alvarado Kristanto Julistiono,Davoud Ataee Tarzanagh,Navid Azizan
关键词-EN: natural language processing, artificial intelligence, computer vision, revolutionized several domains, domains of artificial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the p -th power of the \ell_p -norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an \ell_p -norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.

[LG-22] owards Unsupervised Validation of Anomaly-Detection Models

链接: https://arxiv.org/abs/2410.14579
作者: Lihi Idan
关键词-EN: highly challenging task, highly challenging, validation, challenging task, unsupervised model-validation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised validation of anomaly-detection models is a highly challenging task. While the common practices for model validation involve a labeled validation set, such validation sets cannot be constructed when the underlying datasets are unlabeled. The lack of robust and efficient unsupervised model-validation techniques presents an acute challenge in the implementation of automated anomaly-detection pipelines, especially when there exists no prior knowledge of the model’s performance on similar datasets. This work presents a new paradigm to automated validation of anomaly-detection models, inspired by real-world, collaborative decision-making mechanisms. We focus on two commonly-used, unsupervised model-validation tasks – model selection and model evaluation – and provide extensive experimental results that demonstrate the accuracy and robustness of our approach on both tasks.

[LG-23] Large Language Models Are Overparameterized Text Encoders

链接: https://arxiv.org/abs/2410.14578
作者: Thennal D K,Tim Fischer,Chris Biemann
关键词-EN: supervised contrastive training, text, text embedding, Large language models, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages of content + 1 for limitations and ethical considerations, 14 pages in total including references and appendix, 5+1 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last p% layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30% of layers with negligible impact on performance and up to 80% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose \textL^3 \textPrune , a novel layer-pruning strategy based on the model’s initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21% of the parameters with a -0.3 performance drop, and the small variant only suffers from a -5.1 decrease while pruning 74% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.

[LG-24] MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts NEURIPS2024

链接: https://arxiv.org/abs/2410.14574
作者: Rachel S.Y. Teo,Tan M. Nguyen
关键词-EN: unlocking unparalleled scalability, Sparse Mixture, deep learning, key to unlocking, unlocking unparalleled
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 10 pages in the main text. Published at NeurIPS 2024. The code is available at this https URL

点击查看摘要

Abstract:Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning. SMoE has the potential to exponentially increase parameter count while maintaining the efficiency of the model by only activating a small subset of these parameters for a given sample. However, it has been observed that SMoE suffers from unstable training and has difficulty adapting to new distributions, leading to the model’s lack of robustness to data contamination. To overcome these limitations, we first establish a connection between the dynamics of the expert representations in SMoEs and gradient descent on a multi-objective optimization problem. Leveraging our framework, we then integrate momentum into SMoE and propose a new family of SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE. In particular, we verify the advantages of MomentumSMoE over SMoE on a variety of practical tasks including ImageNet-1K object recognition and WikiText-103 language modeling. We demonstrate the applicability of MomentumSMoE to many types of SMoE models, including those in the Sparse MoE model for vision (V-MoE) and the Generalist Language Model (GLaM). We also show that other advanced momentum-based optimization methods, such as Adam, can be easily incorporated into the MomentumSMoE framework for designing new SMoE models with even better performance, almost negligible additional computation cost, and simple implementations.

[LG-25] Building Trust in Black-box Optimization: A Comprehensive Framework for Explainability

链接: https://arxiv.org/abs/2410.14573
作者: Nazanin Nezami,Hadis Anahideh
关键词-EN: Optimizing costly black-box, Optimizing costly, costly black-box functions, constrained evaluation budget, evaluation budget presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimizing costly black-box functions within a constrained evaluation budget presents significant challenges in many real-world applications. Surrogate Optimization (SO) is a common resolution, yet its proprietary nature introduced by the complexity of surrogate models and the sampling core (e.g., acquisition functions) often leads to a lack of explainability and transparency. While existing literature has primarily concentrated on enhancing convergence to global optima, the practical interpretation of newly proposed strategies remains underexplored, especially in batch evaluation settings. In this paper, we propose \emphInclusive Explainability Metrics for Surrogate Optimization (IEMSO), a comprehensive set of model-agnostic metrics designed to enhance the transparency, trustworthiness, and explainability of the SO approaches. Through these metrics, we provide both intermediate and post-hoc explanations to practitioners before and after performing expensive evaluations to gain trust. We consider four primary categories of metrics, each targeting a specific aspect of the SO process: Sampling Core Metrics, Batch Properties Metrics, Optimization Process Metrics, and Feature Importance. Our experimental evaluations demonstrate the significant potential of the proposed metrics across different benchmarks.

[LG-26] Understanding the difficulty of low-precision post-training quantization of large language models

链接: https://arxiv.org/abs/2410.14570
作者: Zifei Xu,Sayeh Sharify,Wanzin Yazar,Tristan Webb,Xin Wang
关键词-EN: high parameter counts, computationally expensive, Large language models, high parameter, parameter counts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.

[LG-27] Measuring Diversity: Axioms and Challenges

链接: https://arxiv.org/abs/2410.14556
作者: Mikhail Mironov,Liudmila Prokhorenkova
关键词-EN: recommender systems, image or molecule, molecule generation, generation to recommender, diversity
类目: Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:The concept of diversity is widely used in various applications: from image or molecule generation to recommender systems. Thus, being able to properly measure diversity is important. This paper addresses the problem of quantifying diversity for a set of objects. First, we make a systematic review of existing diversity measures and explore their undesirable behavior in some cases. Based on this review, we formulate three desirable properties (axioms) of a reliable diversity measure: monotonicity, uniqueness, and continuity. We show that none of the existing measures has all three properties and thus these measures are not suitable for quantifying diversity. Then, we construct two examples of measures that have all the desirable properties, thus proving that the list of axioms is not self-contradicting. Unfortunately, the constructed examples are too computationally complex for practical use, thus we pose an open problem of constructing a diversity measure that has all the listed properties and can be computed in practice.

[LG-28] Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization

链接: https://arxiv.org/abs/2410.14548
作者: Ravil Mussabayev,Rustam Mussabayev
关键词-EN: K-means clustering, optimize K-means clustering, Variable Neighborhood Search, deteriorates when confronted, confronted with massive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. To address this limitation, we propose a novel heuristic algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data. Our approach is based on the sequential optimization of the partial objective function landscapes obtained by restricting the Minimum Sum-of-Squares Clustering (MSSC) formulation to random samples from the original big dataset. Within each landscape, systematically expanding neighborhoods of the currently best (incumbent) solution are explored by reinitializing all degenerate and a varying number of additional centroids. Extensive and rigorous experimentation on a large number of real-world datasets reveals that by transforming the traditional local search into a global one, our algorithm significantly enhances the accuracy and efficiency of K-means clustering in big data environments, becoming the new state of the art in the field.

[LG-29] Using Sentiment and Technical Analysis to Predict Bitcoin with Machine Learning

链接: https://arxiv.org/abs/2410.14532
作者: Arthur Emanuel de Oliveira Carosia
关键词-EN: gained significant attention, recent years due, Cryptocurrencies have gained, Technical Analysis indicators, gained significant
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Cryptocurrencies have gained significant attention in recent years due to their decentralized nature and potential for financial innovation. Thus, the ability to accurately predict its price has become a subject of great interest for investors, traders, and researchers. Some works in the literature show how Bitcoin’s market sentiment correlates with its price fluctuations in the market. However, papers that consider the sentiment of the market associated with financial Technical Analysis indicators in order to predict Bitcoin’s price are still scarce. In this paper, we present a novel approach for predicting Bitcoin price movements by combining the Fear Greedy Index, a measure of market sentiment, Technical Analysis indicators, and the potential of Machine Learning algorithms. This work represents a preliminary study on the importance of sentiment metrics in cryptocurrency forecasting. Our initial experiments demonstrate promising results considering investment returns, surpassing the Buy Hold baseline, and offering valuable insights about the combination of indicators of sentiment and market in a cryptocurrency prediction model.

[LG-30] Domain Adaptive Safety Filters via Deep Operator Learning

链接: https://arxiv.org/abs/2410.14528
作者: Lakshmideepakreddy Manda,Shaoru Chen,Mahyar Fazlyab
关键词-EN: Control Barrier Functions, constructing Control Barrier, Barrier Functions, safety-critical control systems, Learning-based approaches
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 63rd IEEE Conference on Decision and Control (CDC)

点击查看摘要

Abstract:Learning-based approaches for constructing Control Barrier Functions (CBFs) are increasingly being explored for safety-critical control systems. However, these methods typically require complete retraining when applied to unseen environments, limiting their adaptability. To address this, we propose a self-supervised deep operator learning framework that learns the mapping from environmental parameters to the corresponding CBF, rather than learning the CBF directly. Our approach leverages the residual of a parametric Partial Differential Equation (PDE), where the solution defines a parametric CBF approximating the maximal control invariant set. This framework accommodates complex safety constraints, higher relative degrees, and actuation limits. We demonstrate the effectiveness of the method through numerical experiments on navigation tasks involving dynamic obstacles.

[LG-31] Rethinking Distance Metrics for Counterfactual Explainability

链接: https://arxiv.org/abs/2410.14522
作者: Joshua Nathaniel Williams,Anurag Katakkar,Hoda Heidari,J. Zico Kolter
关键词-EN: Machine Learning, post-hoc explainability, counterfactual generation methods, settings in Machine, Learning
类目: Machine Learning (cs.LG)
*备注: 13 pages, 3 figures, 1 table

点击查看摘要

Abstract:Counterfactual explanations have been a popular method of post-hoc explainability for a variety of settings in Machine Learning. Such methods focus on explaining classifiers by generating new data points that are similar to a given reference, while receiving a more desirable prediction. In this work, we investigate a framing for counterfactual generation methods that considers counterfactuals not as independent draws from a region around the reference, but as jointly sampled with the reference from the underlying data distribution. Through this framing, we derive a distance metric, tailored for counterfactual similarity that can be applied to a broad range of settings. Through both quantitative and qualitative analyses of counterfactual generation methods, we show that this framing allows us to express more nuanced dependencies among the covariates.

[LG-32] Efficient Annotator Reliability Assessment and Sample Weighting for Knowledge-Based Misinformation Detection on Social Media

链接: https://arxiv.org/abs/2410.14515
作者: Owen Cook,Charlie Grimshaw,Ben Wu,Sophie Dillon,Jack Hicks,Luke Jones,Thomas Smith,Matyas Szert,Xingyi Song
关键词-EN: potentially vulnerable people, targetting potentially vulnerable, Misinformation spreads rapidly, social media, confusing the truth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 8 pages, 3 figures, 3 tables. Code available here: this https URL

点击查看摘要

Abstract:Misinformation spreads rapidly on social media, confusing the truth and targetting potentially vulnerable people. To effectively mitigate the negative impact of misinformation, it must first be accurately detected before applying a mitigation strategy, such as X’s community notes, which is currently a manual process. This study takes a knowledge-based approach to misinformation detection, modelling the problem similarly to one of natural language inference. The EffiARA annotation framework is introduced, aiming to utilise inter- and intra-annotator agreement to understand the reliability of each annotator and influence the training of large language models for classification based on annotator reliability. In assessing the EffiARA annotation framework, the Russo-Ukrainian Conflict Knowledge-Based Misinformation Classification Dataset (RUC-MCD) was developed and made publicly available. This study finds that sample weighting using annotator reliability performs the best, utilising both inter- and intra-annotator agreement and soft-label training. The highest classification performance achieved using Llama-3.2-1B was a macro-F1 of 0.757 and 0.740 using TwHIN-BERT-large.

[LG-33] ANT: Adaptive Noise Schedule for Time Series Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2410.14488
作者: Seunghan Lee,Kibok Lee,Taeyoung Park
关键词-EN: generative artificial intelligence, Time series diffusion, series diffusion models, diffusion models, optimal noise schedule
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Advances in diffusion models for generative artificial intelligence have recently propagated to the time series (TS) domain, demonstrating state-of-the-art performance on various tasks. However, prior works on TS diffusion models often borrow the framework of existing works proposed in other domains without considering the characteristics of TS data, leading to suboptimal performance. In this work, we propose Adaptive Noise schedule for Time series diffusion models (ANT), which automatically predetermines proper noise schedules for given TS datasets based on their statistics representing non-stationarity. Our intuition is that an optimal noise schedule should satisfy the following desiderata: 1) It linearly reduces the non-stationarity of TS data so that all diffusion steps are equally meaningful, 2) the data is corrupted to the random noise at the final step, and 3) the number of steps is sufficiently large. The proposed method is practical for use in that it eliminates the necessity of finding the optimal noise schedule with a small additional cost to compute the statistics for given datasets, which can be done offline before training. We validate the effectiveness of our method across various tasks, including TS forecasting, refinement, and generation, on datasets from diverse domains. Code is available at this repository: this https URL.

[LG-34] CaTs and DAGs: Integrating Directed Acyclic Graphs with Transformers and Fully-Connected Neural Networks for Causally Constrained Predictions

链接: https://arxiv.org/abs/2410.14485
作者: Matthew J. Vowels,Mathieu Rochat,Sina Akbari
关键词-EN: natural language processing, Artificial Neural Networks, Artificial Neural, including fully-connected networks, Neural Networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Artificial Neural Networks (ANNs), including fully-connected networks and transformers, are highly flexible and powerful function approximators, widely applied in fields like computer vision and natural language processing. However, their inability to inherently respect causal structures can limit their robustness, making them vulnerable to covariate shift and difficult to interpret/explain. This poses significant challenges for their reliability in real-world applications. In this paper, we introduce Causal Fully-Connected Neural Networks (CFCNs) and Causal Transformers (CaTs), two general model families designed to operate under predefined causal constraints, as specified by a Directed Acyclic Graph (DAG). These models retain the powerful function approximation abilities of traditional neural networks while adhering to the underlying structural constraints, improving robustness, reliability, and interpretability at inference time. This approach opens new avenues for deploying neural networks in more demanding, real-world scenarios where robustness and explainability is critical.

[LG-35] ransfer Reinforcement Learning in Heterogeneous Action Spaces using Subgoal Mapping

链接: https://arxiv.org/abs/2410.14484
作者: Kavinayan P. Sivakumar,Yan Zhang,Zachary Bell,Scott Nivison,Michael M. Zavlanos
关键词-EN: learner agent, action spaces, expert agent, problem involving agents, expert agent policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we consider a transfer reinforcement learning problem involving agents with different action spaces. Specifically, for any new unseen task, the goal is to use a successful demonstration of this task by an expert agent in its action space to enable a learner agent learn an optimal policy in its own different action space with fewer samples than those required if the learner was learning on its own. Existing transfer learning methods across different action spaces either require handcrafted mappings between those action spaces provided by human experts, which can induce bias in the learning procedure, or require the expert agent to share its policy parameters with the learner agent, which does not generalize well to unseen tasks. In this work, we propose a method that learns a subgoal mapping between the expert agent policy and the learner agent policy. Since the expert agent and the learner agent have different action spaces, their optimal policies can have different subgoal trajectories. We learn this subgoal mapping by training a Long Short Term Memory (LSTM) network for a distribution of tasks and then use this mapping to predict the learner subgoal sequence for unseen tasks, thereby improving the speed of learning by biasing the agent’s policy towards the predicted learner subgoal sequence. Through numerical experiments, we demonstrate that the proposed learning scheme can effectively find the subgoal mapping underlying the given distribution of tasks. Moreover, letting the learner agent imitate the expert agent’s policy with the learnt subgoal mapping can significantly improve the sample efficiency and training time of the learner agent in unseen new tasks.

[LG-36] Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models

链接: https://arxiv.org/abs/2410.14479
作者: Cody Clop,Yannick Teglia
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, generating coherent text
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent text but remain limited by the static nature of their training data. Retrieval Augmented Generation (RAG) addresses this issue by combining LLMs with up-to-date information retrieval, but also expand the attack surface of the system. This paper investigates prompt injection attacks on RAG, focusing on malicious objectives beyond misinformation, such as inserting harmful links, promoting unauthorized services, and initiating denial-of-service behaviors. We build upon existing corpus poisoning techniques and propose a novel backdoor attack aimed at the fine-tuning process of the dense retriever component. Our experiments reveal that corpus poisoning can achieve significant attack success rates through the injection of a small number of compromised documents into the retriever corpus. In contrast, backdoor attacks demonstrate even higher success rates but necessitate a more complex setup, as the victim must fine-tune the retriever using the attacker poisoned dataset.

[LG-37] Laplace Transform Based Low-Complexity Learning of Continuous Markov Semigroups

链接: https://arxiv.org/abs/2410.14477
作者: Vladimir R. Kostic,Karim Lounici,Hélène Halconruy,Timothée Devergne,Pietro Novelli,Massimiliano Pontil
关键词-EN: Markov processes serve, real-world random processes, real-world random, Markov processes, universal model
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 35 pages

点击查看摘要

Abstract:Markov processes serve as a universal model for many real-world random processes. This paper presents a data-driven approach for learning these models through the spectral decomposition of the infinitesimal generator (IG) of the Markov semigroup. The unbounded nature of IGs complicates traditional methods such as vector-valued regression and Hilbert-Schmidt operator analysis. Existing techniques, including physics-informed kernel regression, are computationally expensive and limited in scope, with no recovery guarantees for transfer operator methods when the time-lag is small. We propose a novel method that leverages the IG’s resolvent, characterized by the Laplace transform of transfer operators. This approach is robust to time-lag variations, ensuring accurate eigenvalue learning even for small time-lags. Our statistical analysis applies to a broader class of Markov processes than current methods while reducing computational complexity from quadratic to linear in the state dimension. Finally, we illustrate the behaviour of our method in two experiments.

[LG-38] Enhancing Cryptocurrency Market Forecasting: Advanced Machine Learning Techniques and Industrial Engineering Contributions

链接: https://arxiv.org/abs/2410.14475
作者: Jannatun Nayeem Pinky,Ramya Akula
关键词-EN: decentralized digital assets, experienced rapid growth, market capitalization nearing, digital assets, growth and adoption
类目: Machine Learning (cs.LG)
*备注: 63 pages, 6 figures

点击查看摘要

Abstract:Cryptocurrencies, as decentralized digital assets, have experienced rapid growth and adoption, with over 23,000 cryptocurrencies and a market capitalization nearing \ 1.1 trillion (about \ 3,400 per person in the US) as of 2023. This dynamic market presents significant opportunities and risks, highlighting the need for accurate price prediction models to manage volatility. This chapter comprehensively reviews machine learning (ML) techniques applied to cryptocurrency price prediction from 2014 to 2024. We explore various ML algorithms, including linear models, tree-based approaches, and advanced deep learning architectures such as transformers and large language models. Additionally, we examine the role of sentiment analysis in capturing market sentiment from textual data like social media posts and news articles to anticipate price fluctuations. With expertise in optimizing complex systems and processes, industrial engineers are pivotal in enhancing these models. They contribute by applying principles of process optimization, efficiency, and risk mitigation to improve computational performance and data management. This chapter highlights the evolving landscape of cryptocurrency price prediction, the integration of emerging technologies, and the significant role of industrial engineers in refining predictive models. By addressing current limitations and exploring future research directions, this chapter aims to advance the development of more accurate and robust prediction systems, supporting better-informed investment decisions and more stable market behavior.

[LG-39] How Do Training Methods Influence the Utilization of Vision Models? NEURIPS2024

链接: https://arxiv.org/abs/2410.14470
作者: Paul Gavrikov,Shashank Agnihotri,Margret Keuper,Janis Keuper
关键词-EN: contribute equally, learnable parameters, decision function, entire layers’ parameters, network decision function
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the Interpretable AI: Past, Present and Future Workshop at NeurIPS 2024

点击查看摘要

Abstract:Not all learnable parameters (e.g., weights) contribute equally to a neural network’s decision function. In fact, entire layers’ parameters can sometimes be reset to random values with little to no impact on the model’s decisions. We revisit earlier studies that examined how architecture and task complexity influence this phenomenon and ask: is this phenomenon also affected by how we train the model? We conducted experimental evaluations on a diverse set of ImageNet-1k classification models to explore this, keeping the architecture and training data constant but varying the training pipeline. Our findings reveal that the training method strongly influences which layers become critical to the decision function for a given task. For example, improved training regimes and self-supervised training increase the importance of early layers while significantly under-utilizing deeper layers. In contrast, methods such as adversarial training display an opposite trend. Our preliminary results extend previous findings, offering a more nuanced understanding of the inner mechanics of neural networks. Code: this https URL Comments: Accepted at the Interpretable AI: Past, Present and Future Workshop at NeurIPS 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.14470 [cs.CV] (or arXiv:2410.14470v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.14470 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Electrocardiogram-Language Model for Few-Shot Question Answering with Meta Learning

链接: https://arxiv.org/abs/2410.14464
作者: Jialu Tang,Tong Xia,Yuan Lu,Cecilia Mascolo,Aaqib Saeed
关键词-EN: requires specialized expertise, involving synthesizing insights, complex clinical queries, clinical queries posed, interpretation requires specialized
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) interpretation requires specialized expertise, often involving synthesizing insights from ECG signals with complex clinical queries posed in natural language. The scarcity of labeled ECG data coupled with the diverse nature of clinical inquiries presents a significant challenge for developing robust and adaptable ECG diagnostic systems. This work introduces a novel multimodal meta-learning method for few-shot ECG question answering, addressing the challenge of limited labeled data while leveraging the rich knowledge encoded within large language models (LLMs). Our LLM-agnostic approach integrates a pre-trained ECG encoder with a frozen LLM (e.g., LLaMA and Gemma) via a trainable fusion module, enabling the language model to reason about ECG data and generate clinically meaningful answers. Extensive experiments demonstrate superior generalization to unseen diagnostic tasks compared to supervised baselines, achieving notable performance even with limited ECG leads. For instance, in a 5-way 5-shot setting, our method using LLaMA-3.1-8B achieves accuracy of 84.6%, 77.3%, and 69.6% on single verify, choose and query question types, respectively. These results highlight the potential of our method to enhance clinical ECG interpretation by combining signal processing with the nuanced language understanding capabilities of LLMs, particularly in data-constrained scenarios.

[LG-41] he Propensity for Density in Feed-forward Models

链接: https://arxiv.org/abs/2410.14461
作者: Nandi Schoots,Alex Jackson,Ali Kholmovaia,Peter McBurney,Murray Shanahan
关键词-EN: task tend, process of training, training a neural, solved with fewer, fewer weights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Does the process of training a neural network to solve a task tend to use all of the available weights even when the task could be solved with fewer weights? To address this question we study the effects of pruning fully connected, convolutional and residual models while varying their widths. We find that the proportion of weights that can be pruned without degrading performance is largely invariant to model size. Increasing the width of a model has little effect on the density of the pruned model relative to the increase in absolute size of the pruned network. In particular, we find substantial prunability across a large range of model sizes, where our biggest model is 50 times as wide as our smallest model. We explore three hypotheses that could explain these findings.

[LG-42] FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2410.14429
作者: Rui Hu,Qian He,Gaofeng He,Jiedong Zhuang,Huang Chen,Huafeng Liu,Huamin Wang
关键词-EN: producing lifelike clothed, Modeling and producing, lifelike clothed human, attracted researchers’ attention, clothed human images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Modeling and producing lifelike clothed human images has attracted researchers’ attention from different areas for decades, with the complexity from highly articulated and structured content. Rendering algorithms decompose and simulate the imaging process of a camera, while are limited by the accuracy of modeled variables and the efficiency of computation. Generative models can produce impressively vivid human images, however still lacking in controllability and editability. This paper studies photorealism enhancement of rendered images, leveraging generative power from diffusion models on the controlled basis of rendering. We introduce a novel framework to translate rendered images into their realistic counterparts, which consists of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In DKI, we adopt positive (real) domain finetuning and negative (rendered) domain embedding to inject knowledge into a pretrained Text-to-image (T2I) diffusion model. In RIG, we generate the realistic image corresponding to the input rendered image, with a Texture-preserving Attention Control (TAC) to preserve fine-grained clothing textures, exploiting the decoupled features encoded in the UNet structure. Additionally, we introduce SynFashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.

[LG-43] Predicting time-varying flux and balance in metabolic systems using structured neural-ODE processes

链接: https://arxiv.org/abs/2410.14426
作者: Santanu Rathod,Pietro Lio,Xiao Zhang
关键词-EN: deep domain knowledge, neural ODE process, bypassing the demand, optimization problem, ODE process model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a novel data-driven framework as an alternative to dynamic flux balance analysis, bypassing the demand for deep domain knowledge and manual efforts to formulate the optimization problem. The proposed framework is end-to-end, which trains a structured neural ODE process (SNODEP) model to estimate flux and balance samples using gene-expression time-series data. SNODEP is designed to circumvent the limitations of the standard neural ODE process model, including restricting the latent and decoder sampling distributions to be normal and lacking structure between context points for calculating the latent, thus more suitable for modeling the underlying dynamics of a metabolic system. Through comprehensive experiments ( 156 in total), we demonstrate that SNODEP not only predicts the unseen time points of real-world gene-expression data and the flux and balance estimates well but can even generalize to more challenging unseen knockout configurations and irregular data sampling scenarios, all essential for metabolic pathway analysis. We hope our work can serve as a catalyst for building more scalable and powerful models for genome-scale metabolic analysis. Our code is available at: \urlthis https URL.

[LG-44] An explainable machine learning approach for energy forecasting at the household level

链接: https://arxiv.org/abs/2410.14416
作者: Pauline Béraud,Margaux Rioux,Michel Babany,Philippe de La Chevasnerie,Damien Theis,Giacomo Teodori,Chloé Pinguet,Romane Rigaud,François Leclerc
关键词-EN: recurring research topic, recurring research, balance between production, Machine Learning, common Machine Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Electricity forecasting has been a recurring research topic, as it is key to finding the right balance between production and consumption. While most papers are focused on the national or regional scale, few are interested in the household level. Desegregated forecast is a common topic in Machine Learning (ML) literature but lacks explainability that household energy forecasts require. This paper specifically targets the challenges of forecasting electricity use at the household level. This paper confronts common Machine Learning algorithms to electricity household forecasts, weighing the pros and cons, including accuracy and explainability with well-known key metrics. Furthermore, we also confront them in this paper with the business challenges specific to this sector such as explainability or outliers resistance. We introduce a custom decision tree, aiming at providing a fair estimate of the energy consumption, while being explainable and consistent with human intuition. We show that this novel method allows greater explainability without sacrificing much accuracy. The custom tree methodology can be used in various business use cases but is subject to limitations, such as a lack of resilience with outliers.

[LG-45] SNAC: Multi-Scale Neural Audio Codec

链接: https://arxiv.org/abs/2410.14411
作者: Hubert Siuzdak,Florian Grötschla,Luca A. Lanzendörfer
关键词-EN: recently gained popularity, language modeling approaches, represent audio signals, Residual Vector Quantization, Neural audio
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks. This paper proposes the Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions. By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales. This leads to more efficient compression, as demonstrated by extensive objective and subjective evaluations. The code and model weights are open-sourced at this https URL.

[LG-46] Debug Smarter Not Harder: AI Agents for Error Resolution in Computational Notebooks EMNLP2024

链接: https://arxiv.org/abs/2410.14393
作者: Konstantin Grotov,Artem Borzilov,Maksim Krivobok,Timofey Bryksin,Yaroslav Zharov
关键词-EN: offering unprecedented interactivity, development process, research-related development, Large Language Models, offering unprecedented
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024 System Demonstrations

点击查看摘要

Abstract:Computational notebooks became indispensable tools for research-related development, offering unprecedented interactivity and flexibility in the development process. However, these benefits come at the cost of reproducibility and an increased potential for bugs. With the rise of code-fluent Large Language Models empowered with agentic techniques, smart bug-fixing tools with a high level of autonomy have emerged. However, those tools are tuned for classical script programming and still struggle with non-linear computational notebooks. In this paper, we present an AI agent designed specifically for error resolution in a computational notebook. We have developed an agentic system capable of exploring a notebook environment by interacting with it – similar to how a user would – and integrated the system into the JetBrains service for collaborative data science called Datalore. We evaluate our approach against the pre-existing single-action solution by comparing costs and conducting a user study. Users rate the error resolution capabilities of the agentic system higher but experience difficulties with UI. We share the results of the study and consider them valuable for further improving user-agent collaboration.

[LG-47] Personalizing Low-Rank Bayesian Neural Networks Via Federated Learning

链接: https://arxiv.org/abs/2410.14390
作者: Boning Zhang,Dongzhu Liu,Osvaldo Simeone,Guanchu Wang,Dimitrios Pezaros,Guangxu Zhu
关键词-EN: support real-world decision-making, assign reliable confidence, reliable confidence estimates, real-world decision-making, support real-world
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To support real-world decision-making, it is crucial for models to be well-calibrated, i.e., to assign reliable confidence estimates to their predictions. Uncertainty quantification is particularly important in personalized federated learning (PFL), as participating clients typically have small local datasets, making it difficult to unambiguously determine optimal model parameters. Bayesian PFL (BPFL) methods can potentially enhance calibration, but they often come with considerable computational and memory requirements due to the need to track the variances of all the individual model parameters. Furthermore, different clients may exhibit heterogeneous uncertainty levels owing to varying local dataset sizes and distributions. To address these challenges, we propose LR-BPFL, a novel BPFL method that learns a global deterministic model along with personalized low-rank Bayesian corrections. To tailor the local model to each client’s inherent uncertainty level, LR-BPFL incorporates an adaptive rank selection mechanism. We evaluate LR-BPFL across a variety of datasets, demonstrating its advantages in terms of calibration, accuracy, as well as computational and memory requirements.

[LG-48] SurgeryV2: Bridging the Gap Between Model Merging and Multi-Task Learning with Deep Representation Surgery ICML2024

链接: https://arxiv.org/abs/2410.14389
作者: Enneng Yang,Li Shen,Zhenyi Wang,Guibing Guo,Xingwei Wang,Xiaocun Cao,Jie Zhang,Dacheng Tao
关键词-EN: raw training data, merged model, merged MTL model, MTL, merging-based multitask learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper is an extended version of our previous work [ arXiv:2402.02705 ] presented at ICML 2024

点击查看摘要

Abstract:Model merging-based multitask learning (MTL) offers a promising approach for performing MTL by merging multiple expert models without requiring access to raw training data. However, in this paper, we examine the merged model’s representation distribution and uncover a critical issue of “representation bias”. This bias arises from a significant distribution gap between the representations of the merged and expert models, leading to the suboptimal performance of the merged MTL model. To address this challenge, we first propose a representation surgery solution called Surgery. Surgery is a lightweight, task-specific module that aligns the final layer representations of the merged model with those of the expert models, effectively alleviating bias and improving the merged model’s performance. Despite these improvements, a performance gap remains compared to the traditional MTL method. Further analysis reveals that representation bias phenomena exist at each layer of the merged model, and aligning representations only in the last layer is insufficient for fully reducing systemic bias because biases introduced at each layer can accumulate and interact in complex ways. To tackle this, we then propose a more comprehensive solution, deep representation surgery (also called SurgeryV2), which mitigates representation bias across all layers, and thus bridges the performance gap between model merging-based MTL and traditional MTL. Finally, we design an unsupervised optimization objective to optimize both the Surgery and SurgeryV2 modules. Our experimental results show that incorporating these modules into state-of-the-art (SOTA) model merging schemes leads to significant performance gains. Notably, our SurgeryV2 scheme reaches almost the same level as individual expert models or the traditional MTL model. The code is available at \urlthis https URL.

[LG-49] Unscrambling disease progression at scale: fast inference of event permutations with optimal transport NEURIPS2024

链接: https://arxiv.org/abs/2410.14388
作者: Peter A. Wijeratne,Daniel C. Alexander
关键词-EN: infer group-level temporal, group-level temporal trajectories, models infer group-level, chronic degenerative condition, degenerative condition plays
类目: Machine Learning (cs.LG)
*备注: Pre-print of version accepted to NeurIPS 2024

点击查看摘要

Abstract:Disease progression models infer group-level temporal trajectories of change in patients’ features as a chronic degenerative condition plays out. They provide unique insight into disease biology and staging systems with individual-level clinical utility. Discrete models consider disease progression as a latent permutation of events, where each event corresponds to a feature becoming measurably abnormal. However, permutation inference using traditional maximum likelihood approaches becomes prohibitive due to combinatoric explosion, severely limiting model dimensionality and utility. Here we leverage ideas from optimal transport to model disease progression as a latent permutation matrix of events belonging to the Birkhoff polytope, facilitating fast inference via optimisation of the variational lower bound. This enables a factor of 1000 times faster inference than the current state of the art and, correspondingly, supports models with several orders of magnitude more features than the current state of the art can consider. Experiments demonstrate the increase in speed, accuracy and robustness to noise in simulation. Further experiments with real-world imaging data from two separate datasets, one from Alzheimer’s disease patients, the other age-related macular degeneration, showcase, for the first time, pixel-level disease progression events in the brain and eye, respectively. Our method is low compute, interpretable and applicable to any progressive condition and data modality, giving it broad potential clinical utility.

[LG-50] Dual-Label LearningWith Irregularly Present Labels

链接: https://arxiv.org/abs/2410.14380
作者: Mingqian Li,Qiao Han,Yiteng Zhai,Ruifeng Li,Yao Yang,Hongyang Chen
关键词-EN: samples exhibits irregular, exhibits irregular patterns, samples exhibits, fully labeled, irregular patterns
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In multi-task learning, we often encounter the case when the presence of labels across samples exhibits irregular patterns: samples can be fully labeled, partially labeled or unlabeled. Taking drug analysis as an example, multiple toxicity properties of a drug molecule may not be concurrently available due to experimental limitations. It triggers a demand for a new training and inference mechanism that could accommodate irregularly present labels and maximize the utility of any available label information. In this work, we focus on the two-label learning task, and propose a novel training and inference framework, Dual-Label Learning (DLL). The DLL framework formulates the problem into a dual-function system, in which the two functions should simultaneously satisfy standard supervision, structural duality and probabilistic duality. DLL features a dual-tower model architecture that explicitly captures the information exchange between labels, aimed at maximizing the utility of partially available labels in understanding label correlation. During training, label imputation for missing labels is conducted as part of the forward propagation process, while during inference, labels are regarded as unknowns of a bivariate system of equations and are solved jointly. Theoretical analysis guarantees the feasibility of DLL, and extensive experiments are conducted to verify that by explicitly modeling label correlation and maximizing the utility of available labels, our method makes consistently better predictions than baseline approaches by up to a 10% gain in F1-score or MAPE. Remarkably, our method provided with data at a label missing rate as high as 60% can achieve similar or even better results than baseline approaches at a label missing rate of only 10%.

[LG-51] Fine-Tuning Pre-trained Language Models for Robust Causal Representation Learning

链接: https://arxiv.org/abs/2410.14375
作者: Jialin Yu,Yuxiang Zhou,Yulan He,Nevin L. Zhang,Ricardo Silva
关键词-EN: pre-trained language models, pre-trained language, language models, data, Abstract
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The fine-tuning of pre-trained language models (PLMs) has been shown to be effective across various domains. By using domain-specific supervised data, the general-purpose representation derived from PLMs can be transformed into a domain-specific representation. However, these methods often fail to generalize to out-of-domain (OOD) data due to their reliance on non-causal representations, often described as spurious features. Existing methods either make use of adjustments with strong assumptions about lack of hidden common causes, or mitigate the effect of spurious features using multi-domain data. In this work, we investigate how fine-tuned pre-trained language models aid generalizability from single-domain scenarios under mild assumptions, targeting more general and practical real-world scenarios. We show that a robust representation can be derived through a so-called causal front-door adjustment, based on a decomposition assumption, using fine-tuned representations as a source of data augmentation. Comprehensive experiments in both synthetic and real-world settings demonstrate the superior generalizability of the proposed method compared to existing approaches. Our work thus sheds light on the domain generalization problem by introducing links between fine-tuning and causal mechanisms into representation learning.

[LG-52] A Scientific Machine Learning Approach for Predicting and Forecasting Battery Degradation in Electric Vehicles

链接: https://arxiv.org/abs/2410.14347
作者: Sharv Murgai,Hrishikesh Bhagwat,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
关键词-EN: mitigate climate change, Carbon emissions, alarming rate, posing a significant, climate change
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Carbon emissions are rising at an alarming rate, posing a significant threat to global efforts to mitigate climate change. Electric vehicles have emerged as a promising solution, but their reliance on lithium-ion batteries introduces the critical challenge of battery degradation. Accurate prediction and forecasting of battery degradation over both short and long time spans are essential for optimizing performance, extending battery life, and ensuring effective long-term energy management. This directly influences the reliability, safety, and sustainability of EVs, supporting their widespread adoption and aligning with key UN SDGs. In this paper, we present a novel approach to the prediction and long-term forecasting of battery degradation using Scientific Machine Learning framework which integrates domain knowledge with neural networks, offering more interpretable and scientifically grounded solutions for both predicting short-term battery health and forecasting degradation over extended periods. This hybrid approach captures both known and unknown degradation dynamics, improving predictive accuracy while reducing data requirements. We incorporate ground-truth data to inform our models, ensuring that both the predictions and forecasts reflect practical conditions. The model achieved MSE of 9.90 with the UDE and 11.55 with the NeuralODE, in experimental data, a loss of 1.6986 with the UDE, and a MSE of 2.49 in the NeuralODE, demonstrating the enhanced precision of our approach. This integration of data-driven insights with SciML’s strengths in interpretability and scalability allows for robust battery management. By enhancing battery longevity and minimizing waste, our approach contributes to the sustainability of energy systems and accelerates the global transition toward cleaner, more responsible energy solutions, aligning with the UN’s SDG agenda.

[LG-53] Evaluating the evaluators: Towards human-aligned metrics for missing markers reconstruction

链接: https://arxiv.org/abs/2410.14334
作者: Taras Kucherenko,Derek Peristy,Judith Bütepage
关键词-EN: optical motion capture, motion capture systems, Animation data, optical motion, motion capture
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Animation data is often obtained through optical motion capture systems, which utilize a multitude of cameras to establish the position of optical markers. However, system errors or occlusions can result in missing markers, the manual cleaning of which can be time-consuming. This has sparked interest in machine learning-based solutions for missing marker reconstruction in the academic community. Most academic papers utilize a simplistic mean square error as the main metric. In this paper, we show that this metric does not correlate with subjective perception of the fill quality. We introduce and evaluate a set of better-correlated metrics that can drive progress in the field.

[LG-54] Fast proxy centers for Jeffreys centroids: The Jeffreys-Fisher-Rao and the inductive Gauss-Bregman centers

链接: https://arxiv.org/abs/2410.14326
作者: Frank Nielsen
关键词-EN: Jeffreys centroid, mutually absolutely continuous, absolutely continuous probability, Jeffreys, continuous probability distributions
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 35 pages, 10 figures

点击查看摘要

Abstract:The symmetric Kullback-Leibler centroid also called the Jeffreys centroid of a set of mutually absolutely continuous probability distributions on a measure space provides a notion of centrality which has proven useful in many tasks including information retrieval, information fusion, and clustering in image, video and sound processing. However, the Jeffreys centroid is not available in closed-form for sets of categorical or normal distributions, two widely used statistical models, and thus need to be approximated numerically in practice. In this paper, we first propose the new Jeffreys-Fisher-Rao center defined as the Fisher-Rao midpoint of the sided Kullback-Leibler centroids as a plug-in replacement of the Jeffreys centroid. This Jeffreys-Fisher-Rao center admits a generic formula for uni-parameter exponential family distributions, and closed-form formula for categorical and normal distributions, matches exactly the Jeffreys centroid for same-mean normal distributions, and is experimentally observed in practice to be close to the Jeffreys centroid. Second, we define a new type of inductive centers generalizing the principle of Gauss arithmetic-geometric double sequence mean for pairs of densities of any given exponential family. This center is shown experimentally to approximate very well the Jeffreys centroid and is suggested to use when the Jeffreys-Fisher-Rao center is not available in closed form. Moreover, this Gauss-Bregman inductive center always converges and matches the Jeffreys centroid for sets of same-mean normal distributions. We report on our experiments demonstrating the use of the Jeffreys-Fisher-Rao and Gauss-Bregman centers instead of the Jeffreys centroid. Finally, we conclude this work by reinterpreting these fast proxy centers of Jeffreys centroids under the lens of dually flat spaces in information geometry.

[LG-55] Debiasing Mini-Batch Quadratics for Applications in Deep Learning

链接: https://arxiv.org/abs/2410.14325
作者: Lukas Tatzel,Bálint Mucsányi,Osane Hackel,Philipp Hennig
关键词-EN: fundamental building block, machine learning methods, Quadratic approximations form, form a fundamental, fundamental building
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Main text (including references): 13 pages, 6 figures; Supplements: 25 pages, 13 figures

点击查看摘要

Abstract:Quadratic approximations form a fundamental building block of machine learning methods. E.g., second-order optimizers try to find the Newton step into the minimum of a local quadratic proxy to the objective function; and the second-order approximation of a network’s loss function can be used to quantify the uncertainty of its outputs via the Laplace approximation. When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches. This, however, distorts and biases the shape of the associated stochastic quadratic approximations in an intricate way with detrimental effects on applications. In this paper, we (i) show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty quantification via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies.

[LG-56] PTR: A Pre-trained Language Model for Trajectory Recovery

链接: https://arxiv.org/abs/2410.14281
作者: Tonglong Wei,Yan Lin,Youfang Lin,Shengnan Guo,Jilin Hu,Gao Cong,Huaiyu Wan
关键词-EN: Spatiotemporal trajectory data, Spatiotemporal trajectory, hardware and platforms, extensively collected, collected and analyzed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spatiotemporal trajectory data is vital for web-of-things services and is extensively collected and analyzed by web-based hardware and platforms. However, issues such as service interruptions and network instability often lead to sparsely recorded trajectories, resulting in a loss of detailed movement data. As a result, recovering these trajectories to restore missing information becomes essential. Despite progress, several challenges remain unresolved. First, the lack of large-scale dense trajectory data hampers the performance of existing deep learning methods, which rely heavily on abundant data for supervised training. Second, current methods struggle to generalize across sparse trajectories with varying sampling intervals, necessitating separate re-training for each interval and increasing computational costs. Third, external factors crucial for the recovery of missing points are not fully incorporated. To address these challenges, we propose a framework called PTR. This framework mitigates the issue of limited dense trajectory data by leveraging the capabilities of pre-trained language models (PLMs). PTR incorporates an explicit trajectory prompt and is trained on datasets with multiple sampling intervals, enabling it to generalize effectively across different intervals in sparse trajectories. To capture external factors, we introduce an implicit trajectory prompt that models road conditions, providing richer information for recovering missing points. Additionally, we present a trajectory embedder that encodes trajectory points and transforms the embeddings of both observed and missing points into a format comprehensible to PLMs. Experimental results on two public trajectory datasets with three sampling intervals demonstrate the efficacy and scalability of PTR. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.14281 [cs.LG] (or arXiv:2410.14281v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.14281 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-57] Stochastic Quasi-Newton Optimization in Large Dimensions Including Deep Network Training

链接: https://arxiv.org/abs/2410.14270
作者: Uttam Suman,Mariya Mamajiwala,Mukul Saxena,Ankit Tyagi,Debasish Roy
关键词-EN: possibly non-smooth objective, functions typically defined, dimensional design spaces, non-smooth objective functions, objective functions typically
类目: Machine Learning (cs.LG)
*备注: 19 pages, 12 figures, 3 tables

点击查看摘要

Abstract:Our proposal is on a new stochastic optimizer for non-convex and possibly non-smooth objective functions typically defined over large dimensional design spaces. Towards this, we have tried to bridge noise-assisted global search and faster local convergence, the latter being the characteristic feature of a Newton-like search. Our specific scheme – acronymed FINDER (Filtering Informed Newton-like and Derivative-free Evolutionary Recursion), exploits the nonlinear stochastic filtering equations to arrive at a derivative-free update that has resemblance with the Newton search employing the inverse Hessian of the objective function. Following certain simplifications of the update to enable a linear scaling with dimension and a few other enhancements, we apply FINDER to a range of problems, starting with some IEEE benchmark objective functions to a couple of archetypal data-driven problems in deep networks to certain cases of physics-informed deep networks. The performance of the new method vis-á-vis the well-known Adam and a few others bears evidence to its promise and potentialities for large dimensional optimization problems of practical interest.

[LG-58] On time series clustering with k-means

链接: https://arxiv.org/abs/2410.14269
作者: Christopher Holder,Anthony Bagnall,Jason Lines
关键词-EN: distance-based partitional clustering, time series, long history, history of research, distance-based partitional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is a long history of research into time series clustering using distance-based partitional clustering. Many of the most popular algorithms adapt k-means (also known as Lloyd’s algorithm) to exploit time dependencies in the data by specifying a time series distance function. However, these algorithms are often presented with k-means configured in various ways, altering key parameters such as the initialisation strategy. This variability makes it difficult to compare studies because k-means is known to be highly sensitive to its configuration. To address this, we propose a standard Lloyd’s-based model for TSCL that adopts an end-to-end approach, incorporating a specialised distance function not only in the assignment step but also in the initialisation and stopping criteria. By doing so, we create a unified structure for comparing seven popular Lloyd’s-based TSCL algorithms. This common framework enables us to more easily attribute differences in clustering performance to the distance function itself, rather than variations in the k-means configuration.

[LG-59] MoDification: Mixture of Depths Made Easy

链接: https://arxiv.org/abs/2410.14268
作者: Chen Zhang,Meizhi Zhong,Qimeng Wang,Xuantao Lu,Zheyu Ye,Chengqiang Lu,Yan Gao,Yao Hu,Kehai Chen,Min Zhang,Dawei Song
关键词-EN: serving large language, large language models, trending topic, topic in serving, serving large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 12 pages, 9 figures, 5 tables, work in progress

点击查看摘要

Abstract:Long-context efficiency has recently become a trending topic in serving large language models (LLMs). And mixture of depths (MoD) is proposed as a perfect fit to bring down both latency and memory. In this paper, however, we discover that MoD can barely transform existing LLMs without costly training over an extensive number of tokens. To enable the transformations from any LLMs to MoD ones, we showcase top-k operator in MoD should be promoted to threshold-p operator, and refinement to architecture and data should also be crafted along. All these designs form our method termed MoDification. Through a comprehensive set of experiments covering model scales from 3B to 70B, we exhibit MoDification strikes an excellent balance between efficiency and effectiveness. MoDification can achieve up to ~1.2x speedup in latency and ~1.8x reduction in memory compared to original LLMs especially in long-context applications.

[LG-60] Revisiting SLO and Goodput Metrics in LLM Serving

链接: https://arxiv.org/abs/2410.14257
作者: Zhibin Wang,Shipeng Li,Yuhang Zhou,Xue Li,Rong Gu,Nguyen Cam-Tu,Chen Tian,Sheng Zhong
关键词-EN: Large language models, Large language, LLM serving, achieved remarkable performance, LLM
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance and are widely deployed in various applications, while the serving of LLM inference has raised concerns about user experience and serving throughput. Accordingly, service level objectives (SLOs) and goodput-the number of requests that meet SLOs per second-are introduced to evaluate the performance of LLM serving. However, existing metrics fail to capture the nature of user experience. We observe two ridiculous phenomena in existing metrics: 1) delaying token delivery can smooth the tail time between tokens (tail TBT) of a request and 2) dropping the request that fails to meet the SLOs midway can improve goodput. In this paper, we revisit SLO and goodput metrics in LLM serving and propose a unified metric framework smooth goodput including SLOs and goodput to reflect the nature of user experience in LLM serving. The framework can adapt to specific goals of different tasks by setting parameters. We re-evaluate the performance of different LLM serving systems under multiple workloads based on this unified framework and provide possible directions for future optimization of existing strategies. We hope that this framework can provide a unified standard for evaluating LLM serving and foster researches in the field of LLM serving optimization to move in a cohesive direction. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.14257 [cs.LG] (or arXiv:2410.14257v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.14257 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] RAZOR: Refining Accuracy by Zeroing Out Redundancies

链接: https://arxiv.org/abs/2410.14254
作者: Daniel Riccio,Genoveffa Tortora,Mara Sangiovanni
关键词-EN: imposing significant pressure, generating vast volumes, existing data analysis, imposing significant, proliferation of sensors
类目: Machine Learning (cs.LG)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:In many application domains, the proliferation of sensors and devices is generating vast volumes of data, imposing significant pressure on existing data analysis and data mining techniques. Nevertheless, an increase in data volume does not inherently imply an increase in informational content, as a substantial portion may be redundant or represent noise. This challenge is particularly evident in the deep learning domain, where the utility of additional data is contingent on its informativeness. In the absence of such, larger datasets merely exacerbate the computational cost and complexity of the learning process. To address these challenges, we propose RAZOR, a novel instance selection technique designed to extract a significantly smaller yet sufficiently informative subset from a larger set of instances without compromising the learning process. RAZOR has been specifically engineered to be robust, efficient, and scalable, making it suitable for large-scale datasets. Unlike many techniques in the literature, RAZOR is capable of operating in both supervised and unsupervised settings. Experimental results demonstrate that RAZOR outperforms recent state-of-the-art techniques in terms of both effectiveness and efficiency.

[LG-62] Pseudo-label Refinement for Improving Self-Supervised Learning Systems

链接: https://arxiv.org/abs/2410.14242
作者: Zia-ur-Rehman,Arif Mahmood,Wenxiong Kang
关键词-EN: gained significant attention, leveraging clustering-based pseudo-labels, Self-supervised learning systems, SLR algorithm, human annotations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning systems have gained significant attention in recent years by leveraging clustering-based pseudo-labels to provide supervision without the need for human annotations. However, the noise in these pseudo-labels caused by the clustering methods poses a challenge to the learning process leading to degraded performance. In this work, we propose a pseudo-label refinement (SLR) algorithm to address this issue. The cluster labels from the previous epoch are projected to the current epoch cluster-labels space and a linear combination of the new label and the projected label is computed as a soft refined label containing the information from the previous epoch clusters as well as from the current epoch. In contrast to the common practice of using the maximum value as a cluster/class indicator, we employ hierarchical clustering on these soft pseudo-labels to generate refined hard-labels. This approach better utilizes the information embedded in the soft labels, outperforming the simple maximum value approach for hard label generation. The effectiveness of the proposed SLR algorithm is evaluated in the context of person re-identification (Re-ID) using unsupervised domain adaptation (UDA). Experimental results demonstrate that the modified Re-ID baseline, incorporating the SLR algorithm, achieves significantly improved mean Average Precision (mAP) performance in various UDA tasks, including real-to-synthetic, synthetic-to-real, and different real-to-real scenarios. These findings highlight the efficacy of the SLR algorithm in enhancing the performance of self-supervised learning systems.

[LG-63] Almost-Linear RNNs Yield Highly Interpretable Symbolic Codes in Dynamical Systems Reconstruction NEURIPS2024

链接: https://arxiv.org/abs/2410.14240
作者: Manuel Brenner,Christoph Jürgen Hemmer,Zahra Monfared,Daniel Durstewitz
关键词-EN: theory is fundamental, areas of science, PWL, Recurrent Neural Networks, PWL representations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Dynamical systems (DS) theory is fundamental for many areas of science and engineering. It can provide deep insights into the behavior of systems evolving in time, as typically described by differential or recursive equations. A common approach to facilitate mathematical tractability and interpretability of DS models involves decomposing nonlinear DS into multiple linear DS separated by switching manifolds, i.e. piecewise linear (PWL) systems. PWL models are popular in engineering and a frequent choice in mathematics for analyzing the topological properties of DS. However, hand-crafting such models is tedious and only possible for very low-dimensional scenarios, while inferring them from data usually gives rise to unnecessarily complex representations with very many linear subregions. Here we introduce Almost-Linear Recurrent Neural Networks (AL-RNNs) which automatically and robustly produce most parsimonious PWL representations of DS from time series data, using as few PWL nonlinearities as possible. AL-RNNs can be efficiently trained with any SOTA algorithm for dynamical systems reconstruction (DSR), and naturally give rise to a symbolic encoding of the underlying DS that provably preserves important topological properties. We show that for the Lorenz and Rössler systems, AL-RNNs discover, in a purely data-driven way, the known topologically minimal PWL representations of the corresponding chaotic attractors. We further illustrate on two challenging empirical datasets that interpretable symbolic encodings of the dynamics can be achieved, tremendously facilitating mathematical and computational analysis of the underlying systems.

[LG-64] Unified Convergence Analysis for Score-Based Diffusion Models with Deterministic Samplers

链接: https://arxiv.org/abs/2410.14237
作者: Runjia Li,Qiwei Di,Quanquan Gu
关键词-EN: Score-based diffusion models, high-dimensional data distributions, Score-based diffusion, data distribution, original data distribution
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 68 pages

点击查看摘要

Abstract:Score-based diffusion models have emerged as powerful techniques for generating samples from high-dimensional data distributions. These models involve a two-phase process: first, injecting noise to transform the data distribution into a known prior distribution, and second, sampling to recover the original data distribution from noise. Among the various sampling methods, deterministic samplers stand out for their enhanced efficiency. However, analyzing these deterministic samplers presents unique challenges, as they preclude the use of established techniques such as Girsanov’s theorem, which are only applicable to stochastic samplers. Furthermore, existing analysis for deterministic samplers usually focuses on specific examples, lacking a generalized approach for general forward processes and various deterministic samplers. Our paper addresses these limitations by introducing a unified convergence analysis framework. To demonstrate the power of our framework, we analyze the variance-preserving (VP) forward process with the exponential integrator (EI) scheme, achieving iteration complexity of \tilde O(d^2/\epsilon) . Additionally, we provide a detailed analysis of Denoising Diffusion Implicit Models (DDIM)-type samplers, which have been underexplored in previous research, achieving polynomial iteration complexity.

[LG-65] G-NeuroDAVIS: A Neural Network model for generalized embedding data visualization and sample generation

链接: https://arxiv.org/abs/2410.14223
作者: Chayan Maitra,Rajat K. De
关键词-EN: Visualizing high-dimensional datasets, generalized embedding, visualizing high-dimensional data, Visualizing high-dimensional, long time
类目: Machine Learning (cs.LG)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Visualizing high-dimensional datasets through a generalized embedding has been a challenge for a long time. Several methods have shown up for the same, but still, they have not been able to generate a generalized embedding, which not only can reveal the hidden patterns present in the data but also generate realistic high-dimensional samples from it. Motivated by this aspect, in this study, a novel generative model, called G-NeuroDAVIS, has been developed, which is capable of visualizing high-dimensional data through a generalized embedding, and thereby generating new samples. The model leverages advanced generative techniques to produce high-quality embedding that captures the underlying structure of the data more effectively than existing methods. G-NeuroDAVIS can be trained in both supervised and unsupervised settings. We rigorously evaluated our model through a series of experiments, demonstrating superior performance in classification tasks, which highlights the robustness of the learned representations. Furthermore, the conditional sample generation capability of the model has been described through qualitative assessments, revealing a marked improvement in generating realistic and diverse samples. G-NeuroDAVIS has outperformed the Variational Autoencoder (VAE) significantly in multiple key aspects, including embedding quality, classification performance, and sample generation capability. These results underscore the potential of our generative model to serve as a powerful tool in various applications requiring high-quality data generation and representation learning.

[LG-66] Formal Explanations for Neuro-Symbolic AI

链接: https://arxiv.org/abs/2410.14219
作者: Sushmita Paul,Jinqiang Yu,Jip J. Dekker,Alexey Ignatiev,Peter J. Stuckey
关键词-EN: Artificial Intelligence, neural, algorithms face, face two significant, Neuro-symbolic artificial intelligence
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Despite the practical success of Artificial Intelligence (AI), current neural AI algorithms face two significant issues. First, the decisions made by neural architectures are often prone to bias and brittleness. Second, when a chain of reasoning is required, neural systems often perform poorly. Neuro-symbolic artificial intelligence is a promising approach that tackles these (and other) weaknesses by combining the power of neural perception and symbolic reasoning. Meanwhile, the success of AI has made it critical to understand its behaviour, leading to the development of explainable artificial intelligence (XAI). While neuro-symbolic AI systems have important advantages over purely neural AI, we still need to explain their actions, which are obscured by the interactions of the neural and symbolic components. To address the issue, this paper proposes a formal approach to explaining the decisions of neuro-symbolic systems. The approach hinges on the use of formal abductive explanations and on solving the neuro-symbolic explainability problem hierarchically. Namely, it first computes a formal explanation for the symbolic component of the system, which serves to identify a subset of the individual parts of neural information that needs to be explained. This is followed by explaining only those individual neural inputs, independently of each other, which facilitates succinctness of hierarchical formal explanations and helps to increase the overall performance of the approach. Experimental results for a few complex reasoning tasks demonstrate practical efficiency of the proposed approach, in comparison to purely neural systems, from the perspective of explanation size, explanation time, training time, model sizes, and the quality of explanations reported.

[LG-67] Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

链接: https://arxiv.org/abs/2410.14208
作者: Xiaochuan Li,Zichun Yu,Chenyan Xiong
关键词-EN: inevitably introduces noisy, generative nature inevitably, nature inevitably introduces, misleading learning signals, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Codes and data are open-sourced at this https URL

点击查看摘要

Abstract:Synthetic data has been widely used to train large language models, but their generative nature inevitably introduces noisy, non-informative, and misleading learning signals. In this paper, we propose Montessori-Instruct, a novel data synthesis framework that tailors the data synthesis ability of the teacher language model toward the student language model’s learning process. Specifically, we utilize local data influence of synthetic training data points on students to characterize students’ learning preferences. Then, we train the teacher model with Direct Preference Optimization (DPO) to generate synthetic data tailored toward student learning preferences. Experiments with Llama3-8B-Instruct (teacher) and Llama3-8B (student) on Alpaca Eval and MT-Bench demonstrate that Montessori-Instruct significantly outperforms standard synthesis methods by 18.35% and 46.24% relatively. Our method also beats data synthesized by a stronger teacher model, GPT-4o. Further analysis confirms the benefits of teacher’s learning to generate more influential training data in the student’s improved learning, the advantages of local data influence in accurately measuring student preferences, and the robustness of Montessori-Instruct across different student models. Our code and data are open-sourced at this https URL.

[LG-68] Flexi-Fuzz least squares SVM for Alzheimers diagnosis: Tackling noise outliers and class imbalance

链接: https://arxiv.org/abs/2410.14207
作者: Mushir Akhtar,A. Quadir,M. Tanveer,Mohd. Arshad(for the Alzheimer’s Disease Neuroimaging)
关键词-EN: progressive cognitive decline, leading neurodegenerative condition, characterized by progressive, memory loss, neurodegenerative condition
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a leading neurodegenerative condition and the primary cause of dementia, characterized by progressive cognitive decline and memory loss. Its progression, marked by shrinkage in the cerebral cortex, is irreversible. Numerous machine learning algorithms have been proposed for the early diagnosis of AD. However, they often struggle with the issues of noise, outliers, and class imbalance. To tackle the aforementioned limitations, in this article, we introduce a novel, robust, and flexible membership scheme called Flexi-Fuzz. This scheme integrates a novel flexible weighting mechanism, class probability, and imbalance ratio. The proposed flexible weighting mechanism assigns the maximum weight to samples within a specific proximity to the center, with a gradual decrease in weight beyond a certain threshold. This approach ensures that samples near the class boundary still receive significant weight, maintaining their influence in the classification process. Class probability is used to mitigate the impact of noisy samples, while the imbalance ratio addresses class imbalance. Leveraging this, we incorporate the proposed Flexi-Fuzz membership scheme into the least squares support vector machines (LSSVM) framework, resulting in a robust and flexible model termed Flexi-Fuzz-LSSVM. We determine the class-center using two methods: the conventional mean approach and an innovative median approach, leading to two model variants, Flexi-Fuzz-LSSVM-I and Flexi-Fuzz-LSSVM-II. To validate the effectiveness of the proposed Flexi-Fuzz-LSSVM models, we evaluated them on benchmark UCI and KEEL datasets, both with and without label noise. Additionally, we tested the models on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset for AD diagnosis. Experimental results demonstrate the superiority of the Flexi-Fuzz-LSSVM models over baseline models.

[LG-69] xPerT: Extended Persistence Transformer

链接: https://arxiv.org/abs/2410.14193
作者: Sehun Kim
关键词-EN: persistent homology, compact summary, summary of persistent, captures the topological, topological features
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:A persistence diagram provides a compact summary of persistent homology, which captures the topological features of a space at different scales. However, due to its nature as a set, incorporating it as a feature into a machine learning framework is challenging. Several methods have been proposed to use persistence diagrams as input for machine learning models, but they often require complex preprocessing steps and extensive hyperparameter tuning. In this paper, we propose a novel transformer architecture called the \textitExtended Persistence Transformer (xPerT), which is highly scalable than the compared to Persformer, an existing transformer for persistence diagrams. xPerT reduces GPU memory usage by over 90% and improves accuracy on multiple datasets. Additionally, xPerT does not require complex preprocessing steps or extensive hyperparameter tuning, making it easy to use in practice. Our code is available at this https URL.

[LG-70] Combining Hough Transform and Deep Learning Approaches to Reconstruct ECG Signals From Printouts

链接: https://arxiv.org/abs/2410.14185
作者: Felix Krones,Ben Walker,Terry Lyons,Adam Mahdi
关键词-EN: Moody PhysioNet Challenge, George B. Moody, Moody PhysioNet, presents our team, winning contribution
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This work presents our team’s (SignalSavants) winning contribution to the 2024 George B. Moody PhysioNet Challenge. The Challenge had two goals: reconstruct ECG signals from printouts and classify them for cardiac diseases. Our focus was the first task. Despite many ECGs being digitally recorded today, paper ECGs remain common throughout the world. Digitising them could help build more diverse datasets and enable automated analyses. However, the presence of varying recording standards and poor image quality requires a data-centric approach for developing robust models that can generalise effectively. Our approach combines the creation of a diverse training set, Hough transform to rotate images, a U-Net based segmentation model to identify individual signals, and mask vectorisation to reconstruct the signals. We assessed the performance of our models using the 10-fold stratified cross-validation (CV) split of 21,799 recordings proposed by the PTB-XL dataset. On the digitisation task, our model achieved an average CV signal-to-noise ratio of 17.02 and an official Challenge score of 12.15 on the hidden set, securing first place in the competition. Our study shows the challenges of building robust, generalisable, digitisation approaches. Such models require large amounts of resources (data, time, and computational power) but have great potential in diversifying the data available.

[LG-71] LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

链接: https://arxiv.org/abs/2410.14182
作者: Yujun Zhou,Jingdong Yang,Kehan Guo,Pin-Yu Chen,Tian Gao,Werner Geyer,Nuno Moniz,Nitesh V Chawla,Xiangliang Zhang
关键词-EN: accidents pose significant, Laboratory accidents pose, pose significant risks, life and property, underscoring the importance
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 50 pages, 19 figures

点击查看摘要

Abstract:Laboratory accidents pose significant risks to human life and property, underscoring the importance of robust safety protocols. Despite advancements in safety training, laboratory personnel may still unknowingly engage in unsafe practices. With the increasing reliance on large language models (LLMs) for guidance in various fields, including laboratory settings, there is a growing concern about their reliability in critical safety-related decision-making. Unlike trained human researchers, LLMs lack formal lab safety education, raising questions about their ability to provide safe and accurate guidance. Existing research on LLM trustworthiness primarily focuses on issues such as ethical compliance, truthfulness, and fairness but fails to fully cover safety-critical real-world applications, like lab safety. To address this gap, we propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive evaluation framework based on a new taxonomy aligned with Occupational Safety and Health Administration (OSHA) protocols. This benchmark includes 765 multiple-choice questions verified by human experts, assessing LLMs and vision language models (VLMs) performance in lab safety contexts. Our evaluations demonstrate that while GPT-4o outperforms human participants, it is still prone to critical errors, highlighting the risks of relying on LLMs in safety-critical environments. Our findings emphasize the need for specialized benchmarks to accurately assess the trustworthiness of LLMs in real-world safety applications.

[LG-72] Auto Detecting Cognitive Events Using Machine Learning on Pupillary Data

链接: https://arxiv.org/abs/2410.14174
作者: Quang Dang,Murat Kucukosmanoglu,Michael Anoruo,Golshan Kargosha,Sarah Conklin,Justin Brooks
关键词-EN: affects information processing, Assessing cognitive workload, Assessing cognitive, decision making, information processing
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Assessing cognitive workload is crucial for human performance as it affects information processing, decision making, and task execution. Pupil size is a valuable indicator of cognitive workload, reflecting changes in attention and arousal governed by the autonomic nervous system. Cognitive events are closely linked to cognitive workload as they activate mental processes and trigger cognitive responses. This study explores the potential of using machine learning to automatically detect cognitive events experienced using individuals. We framed the problem as a binary classification task, focusing on detecting stimulus onset across four cognitive tasks using CNN models and 1-second pupillary data. The results, measured by Matthew’s correlation coefficient, ranged from 0.47 to 0.80, depending on the cognitive task. This paper discusses the trade-offs between generalization and specialization, model behavior when encountering unseen stimulus onset times, structural variances among cognitive tasks, factors influencing model predictions, and real-time simulation. These findings highlight the potential of machine learning techniques in detecting cognitive events based on pupil and eye movement responses, contributing to advancements in personalized learning and optimizing neurocognitive workload management.

[LG-73] Heavy-Tailed Diffusion Models

链接: https://arxiv.org/abs/2410.14171
作者: Kushagra Pandey,Jaideep Pathak,Yilun Xu,Stephan Mandt,Michael Pritchard,Arash Vahdat,Morteza Mardani
关键词-EN: distributions remains unclear, remains unclear, heavy-tailed distributions remains, capture heavy-tailed behavior, ability to capture
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 51 pages, Contains GIF animations and is best viewed with a dedicated pdf reader

点击查看摘要

Abstract:Diffusion models achieve state-of-the-art generation quality across many applications, but their ability to capture rare or extreme events in heavy-tailed distributions remains unclear. In this work, we show that traditional diffusion and flow-matching models with standard Gaussian priors fail to capture heavy-tailed behavior. We address this by repurposing the diffusion framework for heavy-tail estimation using multivariate Student-t distributions. We develop a tailored perturbation kernel and derive the denoising posterior based on the conditional Student-t distribution for the backward process. Inspired by \gamma -divergence for heavy-tailed distributions, we derive a training objective for heavy-tailed denoisers. The resulting framework introduces controllable tail generation using only a single scalar hyperparameter, making it easily tunable for diverse real-world distributions. As specific instantiations of our framework, we introduce t-EDM and t-Flow, extensions of existing diffusion and flow models that employ a Student-t prior. Remarkably, our approach is readily compatible with standard Gaussian diffusion models and requires only minimal code changes. Empirically, we show that our t-EDM and t-Flow outperform standard diffusion models in heavy-tail estimation on high-resolution weather datasets in which generating rare and extreme events is crucial.

[LG-74] Assessing Open-world Forgetting in Generative Image Model Customization

链接: https://arxiv.org/abs/2410.14159
作者: Héctor Laria,Alex Gomez-Villa,Imad Eddine Marouf,Kai Wang,Bogdan Raducanu,Joost van de Weijer
关键词-EN: significantly enhanced image, enhanced image generation, Recent advances, image generation capabilities, significantly enhanced
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have significantly enhanced image generation capabilities. However, customizing these models with new classes often leads to unintended consequences that compromise their reliability. We introduce the concept of open-world forgetting to emphasize the vast scope of these unintended alterations, contrasting it with the well-studied closed-world forgetting, which is measurable by evaluating performance on a limited set of classes or skills. Our research presents the first comprehensive investigation into open-world forgetting in diffusion models, focusing on semantic and appearance drift of representations. We utilize zero-shot classification to analyze semantic drift, revealing that even minor model adaptations lead to unpredictable shifts affecting areas far beyond newly introduced concepts, with dramatic drops in zero-shot classification of up to 60%. Additionally, we observe significant changes in texture and color of generated content when analyzing appearance drift. To address these issues, we propose a mitigation strategy based on functional regularization, designed to preserve original capabilities while accommodating new concepts. Our study aims to raise awareness of unintended changes due to model customization and advocates for the analysis of open-world forgetting in future research on model customization and finetuning methods. Furthermore, we provide insights for developing more robust adaptation methodologies.

[LG-75] A Mirror Descent Perspective of Smoothed Sign Descent

链接: https://arxiv.org/abs/2410.14158
作者: Shuyang Wang,Diego Klabjan
关键词-EN: Recent work, work by Woodworth, dual dynamics, Recent, Woodworth
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recent work by Woodworth et al. (2020) shows that the optimization dynamics of gradient descent for overparameterized problems can be viewed as low-dimensional dual dynamics induced by a mirror map, explaining the implicit regularization phenomenon from the mirror descent perspective. However, the methodology does not apply to algorithms where update directions deviate from true gradients, such as ADAM. We use the mirror descent framework to study the dynamics of smoothed sign descent with a stability constant \varepsilon for regression problems. We propose a mirror map that establishes equivalence to dual dynamics under some assumptions. By studying dual dynamics, we characterize the convergent solution as an approximate KKT point of minimizing a Bregman divergence style function, and show the benefit of tuning the stability constant \varepsilon to reduce the KKT error.

[LG-76] Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

链接: https://arxiv.org/abs/2410.14157
作者: Jiacheng Ye,Jiahui Gao,Shansan Gong,Lin Zheng,Xin Jiang,Zhenguo Li,Lingpeng Kong
关键词-EN: long-term planning tasks, reasoning and long-term, long-term planning, Multi-granularity Diffusion Modeling, Boolean Satisfiability Problems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive language models, despite their impressive capabilities, struggle with complex reasoning and long-term planning tasks. We introduce discrete diffusion models as a novel solution to these challenges. Through the lens of subgoal imbalance, we demonstrate how diffusion models effectively learn difficult subgoals that elude autoregressive approaches. We propose Multi-granularity Diffusion Modeling (MDM), which prioritizes subgoals based on difficulty during learning. On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MDM significantly outperforms autoregressive models without using search techniques. For instance, MDM achieves 91.5% and 100% accuracy on Countdown and Sudoku, respectively, compared to 45.8% and 20.7% for autoregressive models. Our work highlights the potential of diffusion-based approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks.

[LG-77] Wireless Human-Machine Collaboration in Industry 5.0

链接: https://arxiv.org/abs/2410.14153
作者: Gaoyang Pang,Wanchun Liu,Dusit Niyato,Daniel Quevedo,Branka Vucetic,Yonghui Li
关键词-EN: Wireless Human-Machine Collaboration, Human-Machine Collaboration, advancement for Industry, geographically distributed systems, enabling seamless interaction
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: Paper accepted by IEEE Transactions on Automatic Control

点击查看摘要

Abstract:Wireless Human-Machine Collaboration (WHMC) represents a critical advancement for Industry 5.0, enabling seamless interaction between humans and machines across geographically distributed systems. As the WHMC systems become increasingly important for achieving complex collaborative control tasks, ensuring their stability is essential for practical deployment and long-term operation. Stability analysis certifies how the closed-loop system will behave under model randomness, which is essential for systems operating with wireless communications. However, the fundamental stability analysis of the WHMC systems remains an unexplored challenge due to the intricate interplay between the stochastic nature of wireless communications, dynamic human operations, and the inherent complexities of control system dynamics. This paper establishes a fundamental WHMC model incorporating dual wireless loops for machine and human control. Our framework accounts for practical factors such as short-packet transmissions, fading channels, and advanced HARQ schemes. We model human control lag as a Markov process, which is crucial for capturing the stochastic nature of human interactions. Building on this model, we propose a stochastic cycle-cost-based approach to derive a stability condition for the WHMC system, expressed in terms of wireless channel statistics, human dynamics, and control parameters. Our findings are validated through extensive numerical simulations and a proof-of-concept experiment, where we developed and tested a novel wireless collaborative cart-pole control system. The results confirm the effectiveness of our approach and provide a robust framework for future research on WHMC systems in more complex environments.

[LG-78] CausalChat: Interactive Causal Model Development and Refinement Using Large Language Models

链接: https://arxiv.org/abs/2410.14146
作者: Yanming Zhang,Akshith Kota,Eric Papenhausen,Klaus Mueller
关键词-EN: Causal networks, detailed causal networks, complex relationships, Causal, construct causal networks
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Causal networks are widely used in many fields to model the complex relationships between variables. A recent approach has sought to construct causal networks by leveraging the wisdom of crowds through the collective participation of humans. While this can yield detailed causal networks that model the underlying phenomena quite well, it requires a large number of individuals with domain understanding. We adopt a different approach: leveraging the causal knowledge that large language models, such as OpenAI’s GPT-4, have learned by ingesting massive amounts of literature. Within a dedicated visual analytics interface, called CausalChat, users explore single variables or variable pairs recursively to identify causal relations, latent variables, confounders, and mediators, constructing detailed causal networks through conversation. Each probing interaction is translated into a tailored GPT-4 prompt and the response is conveyed through visual representations which are linked to the generated text for explanations. We demonstrate the functionality of CausalChat across diverse data contexts and conduct user studies involving both domain experts and laypersons.

[LG-79] Preview-based Category Contrastive Learning for Knowledge Distillation

链接: https://arxiv.org/abs/2410.14143
作者: Muhe Ding,Jianlong Wu,Xue Dong,Xiaojie Li,Pengda Qin,Tian Gan,Liqiang Nie
关键词-EN: model compression, larger model, smaller model, mainstream algorithm, compression by transferring
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures, Journal

点击查看摘要

Abstract:Knowledge distillation is a mainstream algorithm in model compression by transferring knowledge from the larger model (teacher) to the smaller model (student) to improve the performance of student. Despite many efforts, existing methods mainly investigate the consistency between instance-level feature representation or prediction, which neglects the category-level information and the difficulty of each sample, leading to undesirable performance. To address these issues, we propose a novel preview-based category contrastive learning method for knowledge distillation (PCKD). It first distills the structural knowledge of both instance-level feature correspondence and the relation between instance features and category centers in a contrastive learning fashion, which can explicitly optimize the category representation and explore the distinct correlation between representations of instances and categories, contributing to discriminative category centers and better classification results. Besides, we introduce a novel preview strategy to dynamically determine how much the student should learn from each sample according to their difficulty. Different from existing methods that treat all samples equally and curriculum learning that simply filters out hard samples, our method assigns a small weight for hard instances as a preview to better guide the student training. Extensive experiments on several challenging datasets, including CIFAR-100 and ImageNet, demonstrate the superiority over state-of-the-art methods.

[LG-80] Hierarchical Conditional Multi-Task Learning for Streamflow Modeling

链接: https://arxiv.org/abs/2410.14137
作者: Shaoming Xu,Arvind Renganathan,Ankush Khandelwal,Rahul Ghosh,Xiang Li,Licheng Liu,Kshitij Tayal,Peter Harrington,Xiaowei Jia,Zhenong Jin,Jonh Nieber,Vipin Kumar
关键词-EN: systems involving intermediate, involving intermediate processes, intermediate processes driven, meteorological forces, involving intermediate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Streamflow, vital for water resource management, is governed by complex hydrological systems involving intermediate processes driven by meteorological forces. While deep learning models have achieved state-of-the-art results of streamflow prediction, their end-to-end single-task learning approach often fails to capture the causal relationships within these systems. To address this, we propose Hierarchical Conditional Multi-Task Learning (HCMTL), a hierarchical approach that jointly models soil water and snowpack processes based on their causal connections to streamflow. HCMTL utilizes task embeddings to connect network modules, enhancing flexibility and expressiveness while capturing unobserved processes beyond soil water and snowpack. It also incorporates the Conditional Mini-Batch strategy to improve long time series modeling. We compare HCMTL with five baselines on a global dataset. HCMTL’s superior performance across hundreds of drainage basins over extended periods shows that integrating domain-specific causal knowledge into deep learning enhances both prediction accuracy and interpretability. This is essential for advancing our understanding of complex hydrological systems and supporting efficient water resource management to mitigate natural disasters like droughts and floods.

[LG-81] Inverse Reinforcement Learning from Non-Stationary Learning Agents

链接: https://arxiv.org/abs/2410.14135
作者: Kavinayan P. Sivakumar,Yi Shen,Zachary Bell,Scott Nivison,Boyuan Chen,Michael M. Zavlanos
关键词-EN: trajectory data collected, inverse reinforcement learning, reinforcement learning problem, reward function, learning agent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we study an inverse reinforcement learning problem that involves learning the reward function of a learning agent using trajectory data collected while this agent is learning its optimal policy. To address this problem, we propose an inverse reinforcement learning method that allows us to estimate the policy parameters of the learning agent which can then be used to estimate its reward function. Our method relies on a new variant of the behavior cloning algorithm, which we call bundle behavior cloning, and uses a small number of trajectories generated by the learning agent’s policy at different points in time to learn a set of policies that match the distribution of actions observed in the sampled trajectories. We then use the cloned policies to train a neural network model that estimates the reward function of the learning agent. We provide a theoretical analysis to show a complexity result on bound guarantees for our method that beats standard behavior cloning as well as numerical experiments for a reinforcement learning problem that validate the proposed method.

[LG-82] owards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation

链接: https://arxiv.org/abs/2410.14122
作者: Yonghyun Kim,Alexander Lerch
关键词-EN: Automatic Piano Transcription, remains largely unexplored, significantly improved system, Automatic Piano, improved system performance
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval (ISMIR) Conference, 2024

点击查看摘要

Abstract:Recent advancements in Automatic Piano Transcription (APT) have significantly improved system performance, but the impact of noisy environments on the system performance remains largely unexplored. This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models and evaluates the performance of the Onsets and Frames model when trained on noise-augmented data. We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.

[LG-83] FedMSE: Federated learning for IoT network intrusion detection

链接: https://arxiv.org/abs/2410.14121
作者: Van Tuan Nguyen,Razvan Beuran
关键词-EN: improving IoT network, paper proposes, improving IoT, federated learning approach, federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes a novel federated learning approach for improving IoT network intrusion detection. The rise of IoT has expanded the cyber attack surface, making traditional centralized machine learning methods insufficient due to concerns about data availability, computational resources, transfer costs, and especially privacy preservation. A semi-supervised federated learning model was developed to overcome these issues, combining the Shrink Autoencoder and Centroid one-class classifier (SAE-CEN). This approach enhances the performance of intrusion detection by effectively representing normal network data and accurately identifying anomalies in the decentralized strategy. Additionally, a mean square error-based aggregation algorithm (MSEAvg) was introduced to improve global model performance by prioritizing more accurate local models. The results obtained in our experimental setup, which uses various settings relying on the N-BaIoT dataset and Dirichlet distribution, demonstrate significant improvements in real-world heterogeneous IoT networks in detection accuracy from 93.98 \pm 2.90 to 97.30 \pm 0.49, reduced learning costs when requiring only 50% of gateways participating in the training process, and robustness in large-scale networks.

[LG-84] Skill Generalization with Verbs IROS2023

链接: https://arxiv.org/abs/2410.14118
作者: Rachel Ma,Lyndon Lam,Benjamin A. Spiegel,Aditya Ganeshan,Roma Patel,Ben Abbatematteo,David Paulius,Stefanie Tellex,George Konidaris
关键词-EN: understand natural language, language commands issued, natural language commands, issued by humans, understand natural
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages + 2 pages (references), 6 figures. Accepted at IROS 2023. Code, dataset info and demo videos can be found at: this https URL

点击查看摘要

Abstract:It is imperative that robots can understand natural language commands issued by humans. Such commands typically contain verbs that signify what action should be performed on a given object and that are applicable to many objects. We propose a method for generalizing manipulation skills to novel objects using verbs. Our method learns a probabilistic classifier that determines whether a given object trajectory can be described by a specific verb. We show that this classifier accurately generalizes to novel object categories with an average accuracy of 76.69% across 13 object categories and 14 verbs. We then perform policy search over the object kinematics to find an object trajectory that maximizes classifier prediction for a given verb. Our method allows a robot to generate a trajectory for a novel object based on a verb, which can then be used as input to a motion planner. We show that our model can generate trajectories that are usable for executing five verb commands applied to novel instances of two different object categories on a real robot.

[LG-85] A Communication and Computation Efficient Fully First-order Method for Decentralized Bilevel Optimization

链接: https://arxiv.org/abs/2410.14115
作者: Min Wen,Chengchang Liu,Ahmed Abdelmoniem,Yipeng Zhou,Yuedong Xu
关键词-EN: decentralized Bilevel optimization, Bilevel optimization, decentralized bilevel, meta-learning and reinforcement, remains less explored
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注: 19 Pages

点击查看摘要

Abstract:Bilevel optimization, crucial for hyperparameter tuning, meta-learning and reinforcement learning, remains less explored in the decentralized learning paradigm, such as decentralized federated learning (DFL). Typically, decentralized bilevel methods rely on both gradients and Hessian matrices to approximate hypergradients of upper-level models. However, acquiring and sharing the second-order oracle is compute and communication intensive. % and sharing this information incurs heavy communication overhead. To overcome these challenges, this paper introduces a fully first-order decentralized method for decentralized Bilevel optimization, \textC^2 DFB which is both compute- and communicate-efficient. In \textC^2 DFB, each learning node optimizes a min-min-max problem to approximate hypergradient by exclusively using gradients information. To reduce the traffic load at the inner-loop of solving the lower-level problem, \textC^2 DFB incorporates a lightweight communication protocol for efficiently transmitting compressed residuals of local parameters. % during the inner loops. Rigorous theoretical analysis ensures its convergence % of the algorithm, indicating a first-order oracle calls of \tilde\mathcalO(\epsilon^-4) . Experiments on hyperparameter tuning and hyper-representation tasks validate the superiority of \textC^2 DFB across various typologies and heterogeneous data distributions.

[LG-86] Improving Graph Neural Networks by Learning Continuous Edge Directions

链接: https://arxiv.org/abs/2410.14109
作者: Seong Ho Pahng,Sahand Hormoz
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, continuous edge directions, edge directions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) traditionally employ a message-passing mechanism that resembles diffusion over undirected graphs, which often leads to homogenization of node features and reduced discriminative power in tasks such as node classification. Our key insight for addressing this limitation is to assign fuzzy edge directions – that can vary continuously from node i pointing to node j to vice versa – to the edges of a graph so that features can preferentially flow in one direction between nodes to enable long-range information transmission across the graph. We also introduce a novel complex-valued Laplacian for directed graphs with fuzzy edges where the real and imaginary parts represent information flow in opposite directions. Using this Laplacian, we propose a general framework, called Continuous Edge Direction (CoED) GNN, for learning on graphs with fuzzy edges and prove its expressivity limits using a generalization of the Weisfeiler-Leman (WL) graph isomorphism test for directed graphs with fuzzy edges. Our architecture aggregates neighbor features scaled by the learned edge directions and processes the aggregated messages from in-neighbors and out-neighbors separately alongside the self-features of the nodes. Since continuous edge directions are differentiable, they can be learned jointly with the GNN weights via gradient-based optimization. CoED GNN is particularly well-suited for graph ensemble data where the graph structure remains fixed but multiple realizations of node features are available, such as in gene regulatory networks, web connectivity graphs, and power grids. We demonstrate through extensive experiments on both synthetic and real datasets that learning continuous edge directions significantly improves performance both for undirected and directed graphs compared with existing methods.

[LG-87] ransfer Learning on Transformers for Building Energy Consumption Forecasting – A Comparative Study

链接: https://arxiv.org/abs/2410.14107
作者: Robert Spencer,Surangika Ranathunga,Mikael Boulic,Andries(Hennie)van Heerden,Teo Susnjak
关键词-EN: application of Transfer, Transfer Learning, Recurrent Neural Networks, Convolutional Neural Networks, Neural Networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates the application of Transfer Learning (TL) on Transformer architectures to enhance building energy consumption forecasting. Transformers are a relatively new deep learning architecture, which has served as the foundation for groundbreaking technologies such as ChatGPT. While TL has been studied in the past, these studies considered either one TL strategy or used older deep learning models such as Recurrent Neural Networks or Convolutional Neural Networks. Here, we carry out an extensive empirical study on six different TL strategies and analyse their performance under varying feature spaces. In addition to the vanilla Transformer architecture, we also experiment with Informer and PatchTST, specifically designed for time series forecasting. We use 16 datasets from the Building Data Genome Project 2 to create building energy consumption forecasting models. Experiment results reveal that while TL is generally beneficial, especially when the target domain has no data, careful selection of the exact TL strategy should be made to gain the maximum benefit. This decision largely depends on the feature space properties such as the recorded weather features. We also note that PatchTST outperforms the other two Transformer variants (vanilla Transformer and Informer). We believe our findings would assist researchers in making informed decision in using TL and transformer architectures for building energy consumption forecasting.

[LG-88] DMGNN: Detecting and Mitigating Backdoor Attacks in Graph Neural Networks

链接: https://arxiv.org/abs/2410.14105
作者: Hao Sui,Bing Chen,Jiale Zhang,Chengcheng Zhu,Di Wu,Qinghua Lu,Guodong Long
关键词-EN: OOD backdoor attacks, backdoor attacks, graph backdoor attacks, Recent studies, multiple adversarial attacks
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Recent studies have revealed that GNNs are highly susceptible to multiple adversarial attacks. Among these, graph backdoor attacks pose one of the most prominent threats, where attackers cause models to misclassify by learning the backdoored features with injected triggers and modified target labels during the training phase. Based on the features of the triggers, these attacks can be categorized into out-of-distribution (OOD) and in-distribution (ID) graph backdoor attacks, triggers with notable differences from the clean sample feature distributions constitute OOD backdoor attacks, whereas the triggers in ID backdoor attacks are nearly identical to the clean sample feature distributions. Existing methods can successfully defend against OOD backdoor attacks by comparing the feature distribution of triggers and clean samples but fail to mitigate stealthy ID backdoor attacks. Due to the lack of proper supervision signals, the main task accuracy is negatively affected in defending against ID backdoor attacks. To bridge this gap, we propose DMGNN against OOD and ID graph backdoor attacks that can powerfully eliminate stealthiness to guarantee defense effectiveness and improve the model performance. Specifically, DMGNN can easily identify the hidden ID and OOD triggers via predicting label transitions based on counterfactual explanation. To further filter the diversity of generated explainable graphs and erase the influence of the trigger features, we present a reverse sampling pruning method to screen and discard the triggers directly on the data level. Extensive experimental evaluations on open graph datasets demonstrate that DMGNN far outperforms the state-of-the-art (SOTA) defense methods, reducing the attack success rate to 5% with almost negligible degradation in model performance (within 3.5%).

[LG-89] ST-MoE-BERT: A Spatial-Temporal Mixture-of-Experts Framework for Long-Term Cross-City Mobility Prediction

链接: https://arxiv.org/abs/2410.14099
作者: Haoyu He,Haozheng Luo,Qi R. Wang
关键词-EN: Predicting human mobility, multiple cities presents, cities presents significant, Predicting human, presents significant challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 2nd ACM SIGSPATIAL International Workshop on the Human Mobility Prediction Challenge

点击查看摘要

Abstract:Predicting human mobility across multiple cities presents significant challenges due to the complex and diverse spatial-temporal dynamics inherent in different urban environments. In this study, we propose a robust approach to predict human mobility patterns called ST-MoE-BERT. Compared to existing methods, our approach frames the prediction task as a spatial-temporal classification problem. Our methodology integrates the Mixture-of-Experts architecture with BERT model to capture complex mobility dynamics and perform the downstream human mobility prediction task. Additionally, transfer learning is integrated to solve the challenge of data scarcity in cross-city prediction. We demonstrate the effectiveness of the proposed model on GEO-BLEU and DTW, comparing it to several state-of-the-art methods. Notably, ST-MoE-BERT achieves an average improvement of 8.29%.

[LG-90] Efficient Sparse PCA via Block-Diagonalization

链接: https://arxiv.org/abs/2410.14092
作者: Alberto Del Pia,Dekun Zhou,Yinglun Zhu
关键词-EN: Principal Component Analysis, Sparse Principal Component, Sparse PCA, Sparse PCA algorithm, Principal Component
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sparse Principal Component Analysis (Sparse PCA) is a pivotal tool in data analysis and dimensionality reduction. However, Sparse PCA is a challenging problem in both theory and practice: it is known to be NP-hard and current exact methods generally require exponential runtime. In this paper, we propose a novel framework to efficiently approximate Sparse PCA by (i) approximating the general input covariance matrix with a re-sorted block-diagonal matrix, (ii) solving the Sparse PCA sub-problem in each block, and (iii) reconstructing the solution to the original problem. Our framework is simple and powerful: it can leverage any off-the-shelf Sparse PCA algorithm and achieve significant computational speedups, with a minor additive error that is linear in the approximation error of the block-diagonal matrix. Suppose g(k, d) is the runtime of an algorithm (approximately) solving Sparse PCA in dimension d and with sparsity value k . Our framework, when integrated with this algorithm, reduces the runtime to \mathcalO\left(\fracdd^\star \cdot g(k, d^\star) + d^2\right) , where d^\star \leq d is the largest block size of the block-diagonal matrix. For instance, integrating our framework with the Branch-and-Bound algorithm reduces the complexity from g(k, d) = \mathcalO(k^3\cdot d^k) to \mathcalO(k^3\cdot d \cdot (d^\star)^k-1) , demonstrating exponential speedups if d^\star is small. We perform large-scale evaluations on many real-world datasets: for exact Sparse PCA algorithm, our method achieves an average speedup factor of 93.77, while maintaining an average approximation error of 2.15%; for approximate Sparse PCA algorithm, our method achieves an average speedup factor of 6.77 and an average approximation error of merely 0.37%.

[LG-91] owards Effective Planning Strategies for Dynamic Opinion Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.14091
作者: Bharath Muppasani,Protik Nag,Vignesh Narayanan,Biplav Srivastava,Michael N. Huhns
关键词-EN: disseminating accurate information, under-explored intervention planning, intervention planning aimed, disseminating accurate, accurate information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:In this study, we investigate the under-explored intervention planning aimed at disseminating accurate information within dynamic opinion networks by leveraging learning strategies. Intervention planning involves identifying key nodes (search) and exerting control (e.g., disseminating accurate/official information through the nodes) to mitigate the influence of misinformation. However, as network size increases, the problem becomes computationally intractable. To address this, we first introduce a novel ranking algorithm (search) to identify key nodes for disseminating accurate information, which facilitates the training of neural network (NN) classifiers for scalable and generalized solutions. Second, we address the complexity of label generation (through search) by developing a Reinforcement Learning (RL)-based dynamic planning framework. We investigate NN-based RL planners tailored for dynamic opinion networks governed by two propagation models for the framework. Each model incorporates both binary and continuous opinion and trust representations. Our experimental results demonstrate that our ranking algorithm-based classifiers provide plans that enhance infection rate control, especially with increased action budgets. Moreover, reward strategies focusing on key metrics, such as the number of susceptible nodes and infection rates, outperform those prioritizing faster blocking strategies. Additionally, our findings reveal that Graph Convolutional Networks (GCNs)-based planners facilitate scalable centralized plans that achieve lower infection rates (higher control) across various network scenarios (e.g., Watts-Strogatz topology, varying action budgets, varying initial infected nodes, and varying degree of infected nodes).

[LG-92] In-context learning and Occams razor

链接: https://arxiv.org/abs/2410.14086
作者: Eric Elmoznino,Tom Marty,Tejas Kasetty,Leo Gagnon,Sarthak Mittal,Mahan Fathi,Dhanya Sridhar,Guillaume Lajoie
关键词-EN: Free Lunch Theorem, Lunch Theorem states, Free Lunch, Lunch Theorem, Occam razor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occam’s razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam’s razor and in-context learning: an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at this https URL.

[LG-93] Interpreting Inflammation Prediction Model via Tag-based Cohort Explanation

链接: https://arxiv.org/abs/2410.14082
作者: Fanyu Meng,Jules Larke,Xin Liu,Zhaodan Kong,Xin Chen,Danielle Lemay,Ilias Tagkopoulos
关键词-EN: make intelligent decisions, revolutionizing nutrition science, Machine learning, intelligent decisions, learning is revolutionizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning is revolutionizing nutrition science by enabling systems to learn from data and make intelligent decisions. However, the complexity of these models often leads to challenges in understanding their decision-making processes, necessitating the development of explainability techniques to foster trust and increase model transparency. An under-explored type of explanation is cohort explanation, which provides explanations to groups of instances with similar characteristics. Unlike traditional methods that focus on individual explanations or global model behavior, cohort explainability bridges the gap by providing unique insights at an intermediate granularity. We propose a novel framework for identifying cohorts within a dataset based on local feature importance scores, aiming to generate concise descriptions of the clusters via tags. We evaluate our framework on a food-based inflammation prediction model and demonstrated that the framework can generate reliable explanations that match domain knowledge.

[LG-94] Reward-free World Models for Online Imitation Learning

链接: https://arxiv.org/abs/2410.14081
作者: Shangzhe Li,Zhiao Huang,Hao Su
关键词-EN: acquire skills directly, enables agents, expert demonstrations, providing a compelling, agents to acquire
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high-dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in the reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.

[LG-95] FedPAE: Peer-Adaptive Ensemble Learning for Asynchronous and Model-Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2410.14075
作者: Brianna Mueller,W. Nick Street,Stephen Baek,Qihang Lin,Jingyi Yang,Yankun Huang
关键词-EN: compromising data privacy, distributed data sources, enables multiple clients, enables multiple, Federated learning
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Federated learning (FL) enables multiple clients with distributed data sources to collaboratively train a shared model without compromising data privacy. However, existing FL paradigms face challenges due to heterogeneity in client data distributions and system capabilities. Personalized federated learning (pFL) has been proposed to mitigate these problems, but often requires a shared model architecture and a central entity for parameter aggregation, resulting in scalability and communication issues. More recently, model-heterogeneous FL has gained attention due to its ability to support diverse client models, but existing methods are limited by their dependence on a centralized framework, synchronized training, and publicly available datasets. To address these limitations, we introduce Federated Peer-Adaptive Ensemble Learning (FedPAE), a fully decentralized pFL algorithm that supports model heterogeneity and asynchronous learning. Our approach utilizes a peer-to-peer model sharing mechanism and ensemble selection to achieve a more refined balance between local and global information. Experimental results show that FedPAE outperforms existing state-of-the-art pFL algorithms, effectively managing diverse client capabilities and demonstrating robustness against statistical heterogeneity.

[LG-96] Rethinking Optimal Transport in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2410.14069
作者: Arip Asadulaev,Rostislav Korst,Alexander Korotin,Vage Egiazarian,Andrey Filchenkov,Evgeny Burnaev
关键词-EN: offline reinforcement learning, offline reinforcement, reinforcement learning, rethink offline reinforcement, reinforcement
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel algorithm for offline reinforcement learning using optimal transport. Typically, in offline reinforcement learning, the data is provided by various experts and some of them can be sub-optimal. To extract an efficient policy, it is necessary to \emphstitch the best behaviors from the dataset. To address this problem, we rethink offline reinforcement learning as an optimal transportation problem. And based on this, we present an algorithm that aims to find a policy that maps states to a \emphpartial distribution of the best expert actions for each given state. We evaluate the performance of our algorithm on continuous control problems from the D4RL suite and demonstrate improvements over existing methods.

[LG-97] Provable Benefits of Complex Parameterizations for Structured State Space Models NEURIPS2024

链接: https://arxiv.org/abs/2410.14067
作者: Yuval Ran-Milo,Eden Lumbroso,Edo Cohen-Karlik,Raja Giryes,Amir Globerson,Nadav Cohen
关键词-EN: Structured state space, linear dynamical systems, dynamical systems adhering, real SSM, prominent neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 12 pages, 1 figure. Accepted to NeurIPS 2024

点击查看摘要

Abstract:Structured state space models (SSMs), the core engine behind prominent neural networks such as S4 and Mamba, are linear dynamical systems adhering to a specified structure, most notably diagonal. In contrast to typical neural network modules, whose parameterizations are real, SSMs often use complex parameterizations. Theoretically explaining the benefits of complex parameterizations for SSMs is an open problem. The current paper takes a step towards its resolution, by establishing formal gaps between real and complex diagonal SSMs. Firstly, we prove that while a moderate dimension suffices in order for a complex SSM to express all mappings of a real SSM, a much higher dimension is needed for a real SSM to express mappings of a complex SSM. Secondly, we prove that even if the dimension of a real SSM is high enough to express a given mapping, typically, doing so requires the parameters of the real SSM to hold exponentially large values, which cannot be learned in practice. In contrast, a complex SSM can express any given mapping with moderate parameter values. Experiments corroborate our theory, and suggest a potential extension of the theory that accounts for selectivity, a new architectural feature yielding state of the art performance.

[LG-98] Lightweight Correlation-Aware Table Compression

链接: https://arxiv.org/abs/2410.14066
作者: Mihail Stoian,Alexander van Renen,Jan Kobiolka,Ping-Lin Kuo,Josif Grabocka,Andreas Kipf
关键词-EN: competitive compression ratios, data necessitates efficient, provide high scan, necessitates efficient, managing relational data
类目: Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Third Table Representation Learning Workshop (TRL 2024)

点击查看摘要

Abstract:The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to reduce file sizes further. Yet, current approaches either incur significant scan overheads or require manual specification of correlations, limiting their practicability. We present \textttVirtual , a framework that integrates seamlessly with existing open formats to automatically leverage data correlations, achieving substantial compression gains while having minimal scan performance overhead. Experiments on \textttthis http URL datasets show that \textttVirtual reduces file sizes by up to 40% compared to Apache Parquet.

[LG-99] Data-driven rainfall prediction at a regional scale: a case study with Ghana

链接: https://arxiv.org/abs/2410.14062
作者: Indrajit Kalita,Lucia Vilallonga,Yves Atchade
关键词-EN: volatile rainfall events, warming planet, climate change, expected to experience, experience the brunt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With a warming planet, tropical regions are expected to experience the brunt of climate change, with more intense and more volatile rainfall events. Currently, state-of-the-art numerical weather prediction (NWP) models are known to struggle to produce skillful rainfall forecasts in tropical regions of Africa. There is thus a pressing need for improved rainfall forecasting in these regions. Over the last decade or so, the increased availability of large-scale meteorological datasets and the development of powerful machine learning models have opened up new opportunities for data-driven weather forecasting. Focusing on Ghana in this study, we use these tools to develop two U-Net convolutional neural network (CNN) models, to predict 24h rainfall at 12h and 30h lead-time. The models were trained using data from the ERA5 reanalysis dataset, and the GPM-IMERG dataset. A special attention was paid to interpretability. We developed a novel statistical methodology that allowed us to probe the relative importance of the meteorological variables input in our model, offering useful insights into the factors that drive precipitation in the Ghana region. Empirically, we found that our 12h lead-time model has performances that match, and in some accounts are better than the 18h lead-time forecasts produced by the ECMWF (as available in the TIGGE dataset). We also found that combining our data-driven model with classical NWP further improves forecast accuracy.

[LG-100] On Partial Prototype Collapse in the DINO Family of Self-Supervised Methods BMVC2024

链接: https://arxiv.org/abs/2410.14060
作者: Hariprasath Govindarajan,Per Sidén,Jacob Roll,Fredrik Lindsten
关键词-EN: prominent self-supervised learning, self-supervised learning paradigm, mixture model, mixture model simultaneously, prominent self-supervised
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: First version of the paper appeared in OpenReview on 22 Sep 2023. Accepted to BMVC 2024

点击查看摘要

Abstract:A prominent self-supervised learning paradigm is to model the representations as clusters, or more generally as a mixture model. Learning to map the data samples to compact representations and fitting the mixture model simultaneously leads to the representation collapse problem. Regularizing the distribution of data points over the clusters is the prevalent strategy to avoid this issue. While this is sufficient to prevent full representation collapse, we show that a partial prototype collapse problem still exists in the DINO family of methods, that leads to significant redundancies in the prototypes. Such prototype redundancies serve as shortcuts for the method to achieve a marginal latent class distribution that matches the prescribed prior. We show that by encouraging the model to use diverse prototypes, the partial prototype collapse can be mitigated. Effective utilization of the prototypes enables the methods to learn more fine-grained clusters, encouraging more informative representations. We demonstrate that this is especially beneficial when pre-training on a long-tailed fine-grained dataset.

[LG-101] From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs

链接: https://arxiv.org/abs/2410.14052
作者: Alireza Rezazadeh,Zichao Li,Wei Wei,Yujia Bao
关键词-EN: Recent advancements, large language models, effective long-term memory, context windows, advancements in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models have significantly improved their context windows, yet challenges in effective long-term memory management remain. We introduce MemTree, an algorithm that leverages a dynamic, tree-structured memory representation to optimize the organization, retrieval, and integration of information, akin to human cognitive schemas. MemTree organizes memory hierarchically, with each node encapsulating aggregated textual content, corresponding semantic embeddings, and varying abstraction levels across the tree’s depths. Our algorithm dynamically adapts this memory structure by computing and comparing semantic embeddings of new and existing information to enrich the model’s context-awareness. This approach allows MemTree to handle complex reasoning and extended interactions more effectively than traditional memory augmentation methods, which often rely on flat lookup tables. Evaluations on benchmarks for multi-turn dialogue understanding and document question answering show that MemTree significantly enhances performance in scenarios that demand structured memory management.

[LG-102] Human Action Anticipation: A Survey

链接: https://arxiv.org/abs/2410.14045
作者: Bolin Lai,Sam Toyer,Tushar Nagarajan,Rohit Girdhar,Shengxin Zha,James M. Rehg,Kris Kitani,Kristen Grauman,Ruta Desai,Miao Liu
关键词-EN: Predicting future human, increasingly popular topic, computer vision, autonomous vehicles, digital assistants
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 30 pages, 9 figures, 12 tables

点击查看摘要

Abstract:Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans various tasks, including action anticipation, activity forecasting, intent prediction, goal prediction, and so on. Our survey aims to tie together this fragmented literature, covering recent technical innovations as well as the development of new large-scale datasets for model training and evaluation. We also summarize the widely-used metrics for different tasks and provide a comprehensive performance comparison of existing approaches on eleven action anticipation datasets. This survey serves as not only a reference for contemporary methodologies in action anticipation, but also a guideline for future research direction of this evolving landscape.

[LG-103] From Barriers to Tactics: A Behavioral Science-Informed Agent ic Workflow for Personalized Nutrition Coaching

链接: https://arxiv.org/abs/2410.14041
作者: Eric Yang,Tomas Garcia,Hannah Williams,Bhawesh Kumar,Martin Ramé,Eileen Rivera,Yiran Ma,Jonathan Amar,Caricia Catalani,Yugang Jia
关键词-EN: requires sustained positive, positive nutrition habits, sustained positive nutrition, Effective management, conditions requires sustained
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 22 pages

点击查看摘要

Abstract:Effective management of cardiometabolic conditions requires sustained positive nutrition habits, often hindered by complex and individualized barriers. Direct human management is simply not scalable, while previous attempts aimed at automating nutrition coaching lack the personalization needed to address these diverse challenges. This paper introduces a novel LLM-powered agentic workflow designed to provide personalized nutrition coaching by directly targeting and mitigating patient-specific barriers. Grounded in behavioral science principles, the workflow leverages a comprehensive mapping of nutrition-related barriers to corresponding evidence-based strategies. A specialized LLM agent intentionally probes for and identifies the root cause of a patient’s dietary struggles. Subsequently, a separate LLM agent delivers tailored tactics designed to overcome those specific barriers with patient context. We designed and validated our approach through a user study with individuals with cardiometabolic conditions, demonstrating the system’s ability to accurately identify barriers and provide personalized guidance. Furthermore, we conducted a large-scale simulation study, grounding on real patient vignettes and expert-validated metrics, to evaluate the system’s performance across a wide range of scenarios. Our findings demonstrate the potential of this LLM-powered agentic workflow to improve nutrition coaching by providing personalized, scalable, and behaviorally-informed interventions.

[LG-104] Latent Weight Diffusion: Generating Policies from Trajectories

链接: https://arxiv.org/abs/2410.14040
作者: Shashank Hegde,Gautam Salhotra,Gaurav S. Sukhatme
关键词-EN: open-source robotic data, diffusion, manipulation and locomotion, increasing availability, availability of open-source
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:With the increasing availability of open-source robotic data, imitation learning has emerged as a viable approach for both robot manipulation and locomotion. Currently, large generalized policies are trained to predict controls or trajectories using diffusion models, which have the desirable property of learning multimodal action distributions. However, generalizability comes with a cost - namely, larger model size and slower inference. Further, there is a known trade-off between performance and action horizon for Diffusion Policy (i.e., diffusing trajectories): fewer diffusion queries accumulate greater trajectory tracking errors. Thus, it is common practice to run these models at high inference frequency, subject to robot computational constraints. To address these limitations, we propose Latent Weight Diffusion (LWD), a method that uses diffusion to learn a distribution over policies for robotic tasks, rather than over trajectories. Our approach encodes demonstration trajectories into a latent space and then decodes them into policies using a hypernetwork. We employ a diffusion denoising model within this latent space to learn its distribution. We demonstrate that LWD can reconstruct the behaviors of the original policies that generated the trajectory dataset. LWD offers the benefits of considerably smaller policy networks during inference and requires fewer diffusion model queries. When tested on the Metaworld MT10 benchmark, LWD achieves a higher success rate compared to a vanilla multi-task policy, while using models up to ~18x smaller during inference. Additionally, since LWD generates closed-loop policies, we show that it outperforms Diffusion Policy in long action horizon settings, with reduced diffusion queries during rollout. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2410.14040 [cs.LG] (or arXiv:2410.14040v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.14040 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-105] Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

链接: https://arxiv.org/abs/2410.14038
作者: Bryan L. M. de Oliveira,Murilo L. da Luz,Bruno Brandão,Luana G. B. Martins,Telma W. de L. Soares,Luckeciano C. Melo
关键词-EN: agents encounter diverse, Learning effective visual, effective visual representations, effective visual, crucial in open-world
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning effective visual representations is crucial in open-world environments where agents encounter diverse and unstructured observations. This ability enables agents to extract meaningful information from raw sensory inputs, like pixels, which is essential for generalization across different tasks. However, evaluating representation learning separately from policy learning remains a challenge in most reinforcement learning (RL) benchmarks. To address this, we introduce the Sliding Puzzles Gym (SPGym), a benchmark that extends the classic 15-tile puzzle with variable grid sizes and observation spaces, including large real-world image datasets. SPGym allows scaling the representation learning challenge while keeping the latent environment dynamics and algorithmic problem fixed, providing a targeted assessment of agents’ ability to form compositional and generalizable state representations. Experiments with both model-free and model-based RL algorithms, with and without explicit representation learning components, show that as the representation challenge scales, SPGym effectively distinguishes agents based on their capabilities. Moreover, SPGym reaches difficulty levels where no tested algorithm consistently excels, highlighting key challenges and opportunities for advancing representation learning for decision-making research.

[LG-106] Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models Language Models and different Readout Mechanisms

链接: https://arxiv.org/abs/2410.14031
作者: Shreya Saha,Ishaan Chadha,Meenakshi khosla
关键词-EN: DNN approaches, neural responses, neural response prediction, past decade, advanced significantly
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over the past decade, predictive modeling of neural responses in the primate visual system has advanced significantly, largely driven by various DNN approaches. These include models optimized directly for visual recognition, cross-modal alignment through contrastive objectives, neural response prediction from scratch, and large language model this http URL, different readout mechanisms, ranging from fully linear to spatial-feature factorized methods have been explored for mapping network activations to neural responses. Despite the diversity of these approaches, it remains unclear which method performs best across different visual regions. In this study, we systematically compare these approaches for modeling the human visual system and investigate alternative strategies to improve response predictions. Our findings reveal that for early to mid-level visual areas, response-optimized models with visual inputs offer superior prediction accuracy, while for higher visual regions, embeddings from LLMs based on detailed contextual descriptions of images and task-optimized models pretrained on large vision datasets provide the best fit. Through comparative analysis of these modeling approaches, we identified three distinct regions in the visual cortex: one sensitive primarily to perceptual features of the input that are not captured by linguistic descriptions, another attuned to fine-grained visual details representing semantic information, and a third responsive to abstract, global meanings aligned with linguistic content. We also highlight the critical role of readout mechanisms, proposing a novel scheme that modulates receptive fields and feature maps based on semantic content, resulting in an accuracy boost of 3-23% over existing SOTAs for all models and brain regions. Together, these findings offer key insights into building more precise models of the visual system.

[LG-107] Graph Neural Flows for Unveiling Systemic Interactions Among Irregularly Sampled Time Series NEURIPS2024

链接: https://arxiv.org/abs/2410.14030
作者: Giangiacomo Mercatali,Andre Freitas,Jie Chen
关键词-EN: Interacting systems, prevalent in nature, Interacting, conditional dependencies, Abstract
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: NeurIPS 2024. Code is available at this https URL

点击查看摘要

Abstract:Interacting systems are prevalent in nature. It is challenging to accurately predict the dynamics of the system if its constituent components are analyzed independently. We develop a graph-based model that unveils the systemic interactions of time series observed at irregular time points, by using a directed acyclic graph to model the conditional dependencies (a form of causal notation) of the system components and learning this graph in tandem with a continuous-time model that parameterizes the solution curves of ordinary differential equations (ODEs). Our technique, a graph neural flow, leads to substantial enhancements over non-graph-based methods, as well as graph-based methods without the modeling of conditional dependencies. We validate our approach on several tasks, including time series classification and forecasting, to demonstrate its efficacy.

[LG-108] Auditing and Enforcing Conditional Fairness via Optimal Transport

链接: https://arxiv.org/abs/2410.14029
作者: Mohsen Ghassemi,Alan Mishler,Niccolo Dalmasso,Luhao Zhang,Vamsi K. Potluru,Tucker Balch,Manuela Veloso
关键词-EN: demographic parity, Conditional demographic parity, additional feature, decision process, target demographic parity
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conditional demographic parity (CDP) is a measure of the demographic parity of a predictive model or decision process when conditioning on an additional feature or set of features. Many algorithmic fairness techniques exist to target demographic parity, but CDP is much harder to achieve, particularly when the conditioning variable has many levels and/or when the model outputs are continuous. The problem of auditing and enforcing CDP is understudied in the literature. In light of this, we propose novel measures of conditional demographic disparity (CDD) which rely on statistical distances borrowed from the optimal transport literature. We further design and evaluate regularization-based approaches based on these CDD measures. Our methods, \fairbit and \fairlp, allow us to target CDP even when the conditioning variable has many levels. When model outputs are continuous, our methods target full equality of the conditional distributions, unlike other methods that only consider first moments or related proxy quantities. We validate the efficacy of our approaches on real-world datasets.

[LG-109] Identifying Privacy Personas

链接: https://arxiv.org/abs/2410.14023
作者: Olena Hrynenko,Andrea Cavallaro
关键词-EN: behavioural patterns, Privacy personas capture, Privacy, Privacy personas, privacy protection
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Privacy personas capture the differences in user segments with respect to one’s knowledge, behavioural patterns, level of self-efficacy, and perception of the importance of privacy protection. Modelling these differences is essential for appropriately choosing personalised communication about privacy (e.g. to increase literacy) and for defining suitable choices for privacy enhancing technologies (PETs). While various privacy personas have been derived in the literature, they group together people who differ from each other in terms of important attributes such as perceived or desired level of control, and motivation to use PET. To address this lack of granularity and comprehensiveness in describing personas, we propose eight personas that we derive by combining qualitative and quantitative analysis of the responses to an interactive educational questionnaire. We design an analysis pipeline that uses divisive hierarchical clustering and Boschloo’s statistical test of homogeneity of proportions to ensure that the elicited clusters differ from each other based on a statistical measure. Additionally, we propose a new measure for calculating distances between questionnaire responses, that accounts for the type of the question (closed- vs open-ended) used to derive traits. We show that the proposed privacy personas statistically differ from each other. We statistically validate the proposed personas and also compare them with personas in the literature, showing that they provide a more granular and comprehensive understanding of user segments, which will allow to better assist users with their privacy needs.

[LG-110] Conformal Prediction for Federated Graph Neural Networks with Missing Neighbor Information

链接: https://arxiv.org/abs/2410.14010
作者: Ömer Faruk Akgül,Rajgopal Kannan,Viktor Prasanna
关键词-EN: representing real-world objects, objects and interactions, play a crucial, crucial role, mining and machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graphs play a crucial role in data mining and machine learning, representing real-world objects and interactions. As graph datasets grow, managing large, decentralized subgraphs becomes essential, particularly within federated learning frameworks. These frameworks face significant challenges, including missing neighbor information, which can compromise model reliability in safety-critical settings. Deployment of federated learning models trained in such settings necessitates quantifying the uncertainty of the models. This study extends the applicability of Conformal Prediction (CP), a well-established method for uncertainty quantification, to federated graph learning. We specifically tackle the missing links issue in distributed subgraphs to minimize its adverse effects on CP set sizes. We discuss data dependencies across the distributed subgraphs and establish conditions for CP validity and precise test-time coverage. We introduce a Variational Autoencoder-based approach for reconstructing missing neighbors to mitigate the negative impact of missing data. Empirical evaluations on real-world datasets demonstrate the efficacy of our approach, yielding smaller prediction sets while ensuring coverage guarantees.

[LG-111] Personalized Adaptation via In-Context Preference Learning

链接: https://arxiv.org/abs/2410.14001
作者: Allison Lau,Younwoo Choi,Vahid Balazadeh,Keertana Chidambaram,Vasilis Syrgkanis,Rahul G. Krishnan
关键词-EN: Human Feedback, Reinforcement Learning, human preferences, RLHF, Human
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is widely used to align Language Models (LMs) with human preferences. However, existing approaches often neglect individual user preferences, leading to suboptimal personalization. We present the Preference Pretrained Transformer (PPT), a novel approach for adaptive personalization using online user feedback. PPT leverages the in-context learning capabilities of transformers to dynamically adapt to individual preferences. Our approach consists of two phases: (1) an offline phase where we train a single policy model using a history-dependent loss function, and (2) an online phase where the model adapts to user preferences through in-context learning. We demonstrate PPT’s effectiveness in a contextual bandit setting, showing that it achieves personalized adaptation superior to existing methods while significantly reducing the computational costs. Our results suggest the potential of in-context learning for scalable and efficient personalization in large language models.

[LG-112] Adversarial Inception for Bounded Backdoor Poisoning in Deep Reinforcement Learning ICLR2025

链接: https://arxiv.org/abs/2410.13995
作者: Ethan Rathbun,Christopher Amato,Alina Oprea
关键词-EN: Deep Reinforcement Learning, Reinforcement Learning, Deep Reinforcement, vulnerability of Deep, Recent works
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 10 pages, 5 figures, ICLR 2025

点击查看摘要

Abstract:Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. These attacks induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks rely on arbitrarily large perturbations to the agent’s rewards to achieve both of these objectives - leaving them open to detection. Thus, in this work, we propose a new class of backdoor attacks against DRL which achieve state of the art performance while minimally altering the agent’s rewards. These ``inception’’ attacks train the agent to associate the targeted adversarial behavior with high returns by inducing a disjunction between the agent’s chosen action and the true action executed in the environment during training. We formally define these attacks and prove they can achieve both adversarial objectives. We then devise an online inception attack which significantly out-performs prior attacks under bounded reward constraints.

[LG-113] On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery

链接: https://arxiv.org/abs/2410.13981
作者: Renpu Liu,Ruida Zhou,Cong Shen,Jing Yang
关键词-EN: parameter updating based, perform in-context learning, contextual information provided, in-context learning, Von Oswald
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:An intriguing property of the Transformer is its ability to perform in-context learning (ICL), where the Transformer can solve different inference tasks without parameter updating based on the contextual information provided by the corresponding input-output demonstration pairs. It has been theoretically proved that ICL is enabled by the capability of Transformers to perform gradient-descent algorithms (Von Oswald et al., 2023a; Bai et al., 2024). This work takes a step further and shows that Transformers can perform learning-to-optimize (L2O) algorithms. Specifically, for the ICL sparse recovery (formulated as LASSO) tasks, we show that a K-layer Transformer can perform an L2O algorithm with a provable convergence rate linear in K. This provides a new perspective explaining the superior ICL capability of Transformers, even with only a few layers, which cannot be achieved by the standard gradient-descent algorithms. Moreover, unlike the conventional L2O algorithms that require the measurement matrix involved in training to match that in testing, the trained Transformer is able to solve sparse recovery problems generated with different measurement matrices. Besides, Transformers as an L2O algorithm can leverage structural information embedded in the training tasks to accelerate its convergence during ICL, and generalize across different lengths of demonstration pairs, where conventional L2O algorithms typically struggle or fail. Such theoretical findings are supported by our experimental results.

[LG-114] Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations NEURIPS

链接: https://arxiv.org/abs/2410.13976
作者: Neale Ratzlaff,Matthew Lyle Olson,Musashi Hinck,Shao-Yen Tseng,Vasudev Lal,Phillip Howard
关键词-EN: Large Vision Language, Vision Language Models, Large Vision, Vision Language, demonstrated impressive capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: NeurIPS workshop on SafeGenAI, 10 pages, 2 figures

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) such as LLaVA have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input image. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a novel debiasing framework for LVLMs by directly ablating biased attributes during text generation to avoid generating text related to protected attributes, or even representing them internally. Our method requires no training and a relatively small amount of representative biased outputs (~1000 samples). Our experiments show that not only can we can minimize the propensity of LVLMs to generate text related to protected attributes, but we can even use synthetic data to inform the ablation while retaining captioning performance on real data such as COCO. Furthermore, we find the resulting generations from a debiased LVLM exhibit similar accuracy as a baseline biased model, showing that debiasing effects can be achieved without sacrificing model performance.

[LG-115] rojan Prompt Attacks on Graph Neural Networks

链接: https://arxiv.org/abs/2410.13974
作者: Minhua Lin,Zhiwei Zhang,Enyan Dai,Zongyu Wu,Yilong Wang,Xiang Zhang,Suhang Wang
关键词-EN: Graph Prompt Learning, Prompt Learning, adapt pre-trained GNN, GPL, GNN
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph Prompt Learning (GPL) has been introduced as a promising approach that uses prompts to adapt pre-trained GNN models to specific downstream tasks without requiring fine-tuning of the entire model. Despite the advantages of GPL, little attention has been given to its vulnerability to backdoor attacks, where an adversary can manipulate the model’s behavior by embedding hidden triggers. Existing graph backdoor attacks rely on modifying model parameters during training, but this approach is impractical in GPL as GNN encoder parameters are frozen after pre-training. Moreover, downstream users may fine-tune their own task models on clean datasets, further complicating the attack. In this paper, we propose TGPA, a backdoor attack framework designed specifically for GPL. TGPA injects backdoors into graph prompts without modifying pre-trained GNN encoders and ensures high attack success rates and clean accuracy. To address the challenge of model fine-tuning by users, we introduce a finetuning-resistant poisoning approach that maintains the effectiveness of the backdoor even after downstream model adjustments. Extensive experiments on multiple datasets under various settings demonstrate the effectiveness of TGPA in compromising GPL models with fixed GNN encoders.

[LG-116] Enhancing Generalization in Sparse Mixture of Experts Models: The Case for Increased Expert Activation in Compositional Tasks

链接: https://arxiv.org/abs/2410.13964
作者: Jinze Zhao
关键词-EN: Transformer models grow, Transformer models, ability to generalize, Large Language Models, compositional tasks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Transformer models grow in complexity, their ability to generalize to novel, compositional tasks becomes crucial. This study challenges conventional wisdom about sparse activation in Sparse Mixture of Experts (SMoE) models when faced with increasingly complex compositional tasks. Through experiments on the SRAVEN symbolic reasoning task and SKILL-MIX benchmark, we demonstrate that activating more experts improves performance on difficult tasks, with the optimal number of activated experts scaling with task complexity. Our findings reveal that pretrained SMoE-based Large Language Models achieve better results by increasing experts-per-token on challenging compositional tasks.

[LG-117] FinQAPT: Empowering Financial Decisions with End-to-End LLM-driven Question Answering Pipeline

链接: https://arxiv.org/abs/2410.13959
作者: Kuldeep Singh,Simerjot Kaur,Charese Smiley
关键词-EN: Large Language Models, relevant information embedded, Financial decision-making hinges, leverages Large Language, decision-making hinges
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted in ICAIF 2024, 8 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Financial decision-making hinges on the analysis of relevant information embedded in the enormous volume of documents in the financial domain. To address this challenge, we developed FinQAPT, an end-to-end pipeline that streamlines the identification of relevant financial reports based on a query, extracts pertinent context, and leverages Large Language Models (LLMs) to perform downstream tasks. To evaluate the pipeline, we experimented with various techniques to optimize the performance of each module using the FinQA dataset. We introduced a novel clustering-based negative sampling technique to enhance context extraction and a novel prompting method called Dynamic N-shot Prompting to boost the numerical question-answering capabilities of LLMs. At the module level, we achieved state-of-the-art accuracy on FinQA, attaining an accuracy of 80.6%. However, at the pipeline level, we observed decreased performance due to challenges in extracting relevant context from financial reports. We conducted a detailed error analysis of each module and the end-to-end pipeline, pinpointing specific challenges that must be addressed to develop a robust solution for handling complex financial tasks.

[LG-118] Goal Inference from Open-Ended Dialog

链接: https://arxiv.org/abs/2410.13957
作者: Rachel Ma,Jingyi Qu,Andreea Bobu,Dylan Hadfield-Menell
关键词-EN: accomplish diverse user, diverse user goals, Large Language Models, embodied agents, agents to learn
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages + 2 page (references and appendix)

点击查看摘要

Abstract:We present an online method for embodied agents to learn and accomplish diverse user goals. While offline methods like RLHF can represent various goals but require large datasets, our approach achieves similar flexibility with online efficiency. We extract natural language goal representations from conversations with Large Language Models (LLMs). We prompt an LLM to role play as a human with different goals and use the corresponding likelihoods to run Bayesian inference over potential goals. As a result, our method can represent uncertainty over complex goals based on unrestricted dialog. We evaluate our method in grocery shopping and home robot assistance domains using a text-based interface and AI2Thor simulation respectively. Results show our method outperforms ablation baselines that lack either explicit goal representation or probabilistic inference.

[LG-119] Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all NEURIPS2024

链接: https://arxiv.org/abs/2410.13956
作者: Ihab Bendidi,Shawn Whitfield,Kian Kenyon-Dean,Hanene Ben Yedder,Yassir El Mesbahi,Emmanuel Noutahi,Alisandra K. Denton
关键词-EN: living organisms remains, organisms remains limited, remains limited due, interactions in living, living organisms
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Neurips 2024 AIDrugX Workshop

点击查看摘要

Abstract:Understanding the relationships among genes, compounds, and their interactions in living organisms remains limited due to technological constraints and the complexity of biological data. Deep learning has shown promise in exploring these relationships using various data types. However, transcriptomics, which provides detailed insights into cellular states, is still underused due to its high noise levels and limited data availability. Recent advancements in transcriptomics sequencing provide new opportunities to uncover valuable insights, especially with the rise of many new foundation models for transcriptomics, yet no benchmark has been made to robustly evaluate the effectiveness of these rising models for perturbation analysis. This article presents a novel biologically motivated evaluation framework and a hierarchy of perturbation analysis tasks for comparing the performance of pretrained foundation models to each other and to more classical techniques of learning from transcriptomics data. We compile diverse public datasets from different sequencing techniques and cell lines to assess models performance. Our approach identifies scVI and PCA to be far better suited models for understanding biological perturbations in comparison to existing foundation models, especially in their application in real-world scenarios.

[LG-120] Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees

链接: https://arxiv.org/abs/2410.13954
作者: Aleksandar Armacki,Shuhua Yu,Pranay Sharma,Gauri Joshi,Dragana Bajovic,Dusan Jakovetic,Soummya Kar
关键词-EN: study high-probability convergence, online learning, study high-probability, presence of heavy-tailed, nonlinear SGD methods
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 34 pages, 5 figures

点击查看摘要

Abstract:We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate \widetilde\mathcalO(t^-1/4) , while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate \mathcalO(t^-\zeta) , where \zeta \in (0,1) depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded p -th moments, p \in (1,2] , we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as p \rightarrow 1 , our exponents are constant and strictly better whenever p 6/5 for non-convex and p 8/7 for strongly convex costs. Experiments validate our theory, demonstrating noise symmetry in real-life settings and showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.

[LG-121] On Diffusion Models for Multi-Agent Partial Observability: Shared Attractors Error Bounds and Composite Flow

链接: https://arxiv.org/abs/2410.13953
作者: Tonghan Wang,Heng Dong,Yanchen Jiang,David C. Parkes,Milind Tambe
关键词-EN: Multiagent systems grapple, Multiagent systems, decentralized POMDP, partial observability, systems grapple
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiagent systems grapple with partial observability (PO), and the decentralized POMDP (Dec-POMDP) model highlights the fundamental nature of this challenge. Whereas recent approaches to address PO have appealed to deep learning models, providing a rigorous understanding of how these models and their approximation errors affect agents’ handling of PO and their interactions remain a challenge. In addressing this challenge, we investigate reconstructing global states from local action-observation histories in Dec-POMDPs using diffusion models. We first find that diffusion models conditioned on local history represent possible states as stable fixed points. In collectively observable (CO) Dec-POMDPs, individual diffusion models conditioned on agents’ local histories share a unique fixed point corresponding to the global state, while in non-CO settings, the shared fixed points yield a distribution of possible states given joint history. We further find that, with deep learning approximation errors, fixed points can deviate from true states and the deviation is negatively correlated to the Jacobian rank. Inspired by this low-rank property, we bound the deviation by constructing a surrogate linear regression model that approximates the local behavior of diffusion models. With this bound, we propose a composite diffusion process iterating over agents with theoretical convergence guarantees to the true state.

[LG-122] Automatically Interpreting Millions of Features in Large Language Models

链接: https://arxiv.org/abs/2410.13928
作者: Gonçalo Paulo,Alex Mallen,Caden Juang,Nora Belrose
关键词-EN: simple human-understandable interpretation, deep neural networks, higher-dimensional latent space, sparse autoencoders, human-understandable interpretation
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of explanations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a feature, which we find explains features that are not recalled by existing methods. We propose guidelines for generating better explanations that remain valid for a broader set of activating contexts, and discuss pitfalls with existing scoring techniques. We use our explanations to measure the semantic similarity of independently trained SAEs, and find that SAEs trained on nearby layers of the residual stream are highly similar. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons, even when neurons are sparsified using top- k postprocessing. Our code is available at this https URL, and our explanations are available at this https URL.

[LG-123] FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

链接: https://arxiv.org/abs/2410.13925
作者: ZiDong Wang,Zeyu Lu,Di Huang,Cai Zhou,Wanli Ouyang,and Lei Bai
关键词-EN: Nature is infinitely, infinitely resolution-free, Nature, Flexible Vision Transformer, Transformer
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2402.12376

点击查看摘要

Abstract:\textitNature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids. This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping. On this basis, we present the \textbfFlexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with \textitunrestricted resolutions and aspect ratios. We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler. Enhanced by a meticulously adjusted network structure, FiTv2 exhibits 2\times convergence speed of FiT. When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation. Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency. Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions. We have released all the codes and models at \urlthis https URL to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.

[LG-124] GBCT: An Efficient and Adaptive Granular-Ball Clustering Algorithm for Complex Data

链接: https://arxiv.org/abs/2410.13917
作者: Shuyin Xia,Bolun Shi,Yifan Wang,Jiang Xie,Guoyin Wang,Xinbo Gao
关键词-EN: fine-grained information, information and achieve, calculating the distance, implementing other calculations, calculations based
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional clustering algorithms often focus on the most fine-grained information and achieve clustering by calculating the distance between each pair of data points or implementing other calculations based on points. This way is not inconsistent with the cognitive mechanism of “global precedence” in human brain, resulting in those methods’ bad performance in efficiency, generalization ability and robustness. To address this problem, we propose a new clustering algorithm called granular-ball clustering (GBCT) via granular-ball computing. Firstly, GBCT generates a smaller number of granular-balls to represent the original data, and forms clusters according to the relationship between granular-balls, instead of the traditional point relationship. At the same time, its coarse-grained characteristics are not susceptible to noise, and the algorithm is efficient and robust; besides, as granular-balls can fit various complex data, GBCT performs much better in non-spherical data sets than other traditional clustering methods. The completely new coarse granularity representation method of GBCT and cluster formation mode can also used to improve other traditional methods.

[LG-125] Exogenous Matching: Learning Good Proposals for Tractable Counterfactual Estimation

链接: https://arxiv.org/abs/2410.13914
作者: Yikang Chen,Dehui du,Lili Tian
关键词-EN: named Exogenous Matching, Exogenous Matching, named Exogenous, Structural Causal Models, tractable and efficient
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose an importance sampling method for tractable and efficient estimation of counterfactual expressions in general settings, named Exogenous Matching. By minimizing a common upper bound of counterfactual estimators, we transform the variance minimization problem into a conditional distribution learning problem, enabling its integration with existing conditional distribution modeling approaches. We validate the theoretical results through experiments under various types and settings of Structural Causal Models (SCMs) and demonstrate the outperformance on counterfactual estimation tasks compared to other existing importance sampling methods. We also explore the impact of injecting structural prior knowledge (counterfactual Markov boundaries) on the results. Finally, we apply this method to identifiable proxy SCMs and demonstrate the unbiasedness of the estimates, empirically illustrating the applicability of the method to practical scenarios.

[LG-126] Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace

链接: https://arxiv.org/abs/2410.13910
作者: Jinluan Yang,Anke Tang,Didi Zhu,Zhengyu Chen,Li Shen,Fei Wu
关键词-EN: integrate multiple single-task, multiple single-task fine-tuned, gained significant attention, single-task fine-tuned models, Model merging
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 21 pages,8 figures

点击查看摘要

Abstract:Model merging has gained significant attention as a cost-effective approach to integrate multiple single-task fine-tuned models into a unified one that can perform well on multiple tasks. However, existing model merging techniques primarily focus on resolving conflicts between task-specific models, they often overlook potential security threats, particularly the risk of backdoor attacks in the open-source model ecosystem. In this paper, we first investigate the vulnerabilities of existing model merging methods to backdoor attacks, identifying two critical challenges: backdoor succession and backdoor transfer. To address these issues, we propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities. Specifically, DAM employs a meta-learning-based optimization method with dual masks to identify a shared and safety-aware subspace for model merging. These masks are alternately optimized: the Task-Shared mask identifies common beneficial parameters across tasks, aiming to preserve task-specific knowledge while reducing interference, while the Backdoor-Detection mask isolates potentially harmful parameters to neutralize security threats. This dual-mask design allows us to carefully balance the preservation of useful knowledge and the removal of potential vulnerabilities. Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points while sacrificing only about 1% in accuracy. Furthermore, DAM exhibits robust performance and broad applicability across various types of backdoor attacks and the number of compromised models involved in the merging process. We will release the codes and models soon.

[LG-127] P4GCN: Vertical Federated Social Recommendation with Privacy-Preserving Two-Party Graph Convolution Networks

链接: https://arxiv.org/abs/2410.13905
作者: Zheng Wang,Wanwan Wang,Yimin Huang,Zhaopeng Peng,Ziqi Yang,Cheng Wang,Xiaoliang Fan
关键词-EN: social recommendation systems, recent years, commonly utilized, social, graph neural networks
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, graph neural networks (GNNs) have been commonly utilized for social recommendation systems. However, real-world scenarios often present challenges related to user privacy and business constraints, inhibiting direct access to valuable social information from other platforms. While many existing methods have tackled matrix factorization-based social recommendations without direct social data access, developing GNN-based federated social recommendation models under similar conditions remains largely unexplored. To address this issue, we propose a novel vertical federated social recommendation method leveraging privacy-preserving two-party graph convolution networks (P4GCN) to enhance recommendation accuracy without requiring direct access to sensitive social information. First, we introduce a Sandwich-Encryption module to ensure comprehensive data privacy during the collaborative computing process. Second, we provide a thorough theoretical analysis of the privacy guarantees, considering the participation of both curious and honest parties. Extensive experiments on four real-world datasets demonstrate that P4GCN outperforms state-of-the-art methods in terms of recommendation accuracy. The code is available at this https URL.

[LG-128] A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation NEURIPS2024

链接: https://arxiv.org/abs/2410.13897
作者: Aviral Srivastava,Sourav Panda
关键词-EN: including large language, large language models, complex security risks, advance rapidly, large language
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This paper was accepted in NeurIPS 2024 workshop on Red Teaming GenAI: What can we learn with Adversaries?

点击查看摘要

Abstract:As generative AI systems, including large language models (LLMs) and diffusion models, advance rapidly, their growing adoption has led to new and complex security risks often overlooked in traditional AI risk assessment frameworks. This paper introduces a novel formal framework for categorizing and mitigating these emergent security risks by integrating adaptive, real-time monitoring, and dynamic risk mitigation strategies tailored to generative models’ unique vulnerabilities. We identify previously under-explored risks, including latent space exploitation, multi-modal cross-attack vectors, and feedback-loop-induced model degradation. Our framework employs a layered approach, incorporating anomaly detection, continuous red-teaming, and real-time adversarial simulation to mitigate these risks. We focus on formal verification methods to ensure model robustness and scalability in the face of evolving threats. Though theoretical, this work sets the stage for future empirical validation by establishing a detailed methodology and metrics for evaluating the performance of risk mitigation strategies in generative AI systems. This framework addresses existing gaps in AI safety, offering a comprehensive road map for future research and implementation.

[LG-129] Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

链接: https://arxiv.org/abs/2410.13886
作者: Priyanshu Kumar,Elaine Lau,Saranya Vijayakumar,Tu Trinh,Scale Red Team,Elaine Chang,Vaughn Robinson,Sean Hendryx,Shuyan Zhou,Matt Fredrikson,Summer Yue,Zifan Wang
关键词-EN: large language models, assisting dangerous activities, browser agents, large language, language models
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike chatbots, LLM agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red-teaming browser agents, LLMs that manipulate information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red-teaming browser agents. BrowserART is consist of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from HarmBench [Mazeika et al., 2024] and AirBench 2024 [Zeng et al., 2024b]) across both synthetic and real websites. Our empirical study on state-of-the-art browser agents reveals that, while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal-trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT-4o and o1-preview-based browser agents attempted 98 and 63 harmful behaviors (out of 100), respectively. We publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety

[LG-130] ransformers Utilization in Chart Understanding: A Review of Recent Advances Future Trends

链接: https://arxiv.org/abs/2410.13883
作者: Mirna Al-Shetairy,Hanan Hindy,Dina Khattab,Mostafa M. Aref
关键词-EN: involving chart interactions, interest in vision-language, chart interactions, involving chart, Chart Understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, interest in vision-language tasks has grown, especially those involving chart interactions. These tasks are inherently multimodal, requiring models to process chart images, accompanying text, underlying data tables, and often user queries. Traditionally, Chart Understanding (CU) relied on heuristics and rule-based systems. However, recent advancements that have integrated transformer architectures significantly improved performance. This paper reviews prominent research in CU, focusing on State-of-The-Art (SoTA) frameworks that employ transformers within End-to-End (E2E) solutions. Relevant benchmarking datasets and evaluation techniques are analyzed. Additionally, this article identifies key challenges and outlines promising future directions for advancing CU solutions. Following the PRISMA guidelines, a comprehensive literature search is conducted across Google Scholar, focusing on publications from Jan’20 to Jun’24. After rigorous screening and quality assessment, 32 studies are selected for in-depth analysis. The CU tasks are categorized into a three-layered paradigm based on the cognitive task required. Recent advancements in the frameworks addressing various CU tasks are also reviewed. Frameworks are categorized into single-task or multi-task based on the number of tasks solvable by the E2E solution. Within multi-task frameworks, pre-trained and prompt-engineering-based techniques are explored. This review overviews leading architectures, datasets, and pre-training tasks. Despite significant progress, challenges remain in OCR dependency, handling low-resolution images, and enhancing visual reasoning. Future directions include addressing these challenges, developing robust benchmarks, and optimizing model efficiency. Additionally, integrating explainable AI techniques and exploring the balance between real and synthetic data are crucial for advancing CU research.

[LG-131] Mixed-curvature decision trees and random forests ICLR2025

链接: https://arxiv.org/abs/2410.13879
作者: Philippe Chlenski,Quentin Chu,Raiyan R. Khan,Antonio Khalil Moretti,Itsik Pe’er
关键词-EN: Decision trees, random forest, extensions are workhorses, Decision, Euclidean
类目: Machine Learning (cs.LG)
*备注: 25 pages, 9 figures. Submitted to ICLR 2025

点击查看摘要

Abstract:Decision trees (DTs) and their random forest (RF) extensions are workhorses of classification and regression in Euclidean spaces. However, algorithms for learning in non-Euclidean spaces are still limited. We extend DT and RF algorithms to product manifolds: Cartesian products of several hyperbolic, hyperspherical, or Euclidean components. Such manifolds handle heterogeneous curvature while still factorizing neatly into simpler components, making them compelling embedding spaces for complex datasets. Our novel angular reformulation of DTs respects the geometry of the product manifold, yielding splits that are geodesically convex, maximum-margin, and composable. In the special cases of single-component manifolds, our method simplifies to its Euclidean or hyperbolic counterparts, or introduces hyperspherical DT algorithms, depending on the curvature. We benchmark our method on various classification, regression, and link prediction tasks on synthetic data, graph embeddings, mixed-curvature variational autoencoder latent spaces, and empirical data. Compared to six other classifiers, product DTs and RFs ranked first on 21 of 22 single-manifold benchmarks and 18 of 35 product manifold benchmarks, and placed in the top 2 on 53 of 57 benchmarks overall. This highlights the value of product DTs and RFs as straightforward yet powerful new tools for data analysis in product manifolds. Code for our paper is available at this https URL.

[LG-132] COOL: Efficient and Reliable Chain-Oriented Objective Logic with Neural Networks Feedback Control for Program Synthesis ICLR2025

链接: https://arxiv.org/abs/2410.13874
作者: Jipeng Han
关键词-EN: complex software development, lack fine-grained control, neural network, formal or neural-based, lack fine-grained
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 25 pages, 9 figures, submitted to ICLR 2025

点击查看摘要

Abstract:Program synthesis methods, whether formal or neural-based, lack fine-grained control and flexible modularity, which limits their adaptation to complex software development. These limitations stem from rigid Domain-Specific Language (DSL) frameworks and neural network incorrect predictions. To this end, we propose the Chain of Logic (CoL), which organizes synthesis stages into a chain and provides precise heuristic control to guide the synthesis process. Furthermore, by integrating neural networks with libraries and introducing a Neural Network Feedback Control (NNFC) mechanism, our approach modularizes synthesis and mitigates the impact of neural network mispredictions. Experiments on relational and symbolic synthesis tasks show that CoL significantly enhances the efficiency and reliability of DSL program synthesis across multiple metrics. Specifically, CoL improves accuracy by 70% while reducing tree operations by 91% and time by 95%. Additionally, NNFC further boosts accuracy by 6%, with a 64% reduction in tree operations under challenging conditions such as insufficient training data, increased difficulty, and multidomain synthesis. These improvements confirm COOL as a highly efficient and reliable program synthesis framework.

[LG-133] BLEND: Behavior-guided Neural Population Dynamics Modeling via Privileged Knowledge Distillation

链接: https://arxiv.org/abs/2410.13872
作者: Zhengrui Guo,Fangxu Zhou,Wei Wu,Qichen Sun,Lishuang Feng,Jinzhuo Wang,Hao Chen
关键词-EN: neuronal populations represents, computational neuroscience, neural, represents a key, key pursuit
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 20 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Modeling the nonlinear dynamics of neuronal populations represents a key pursuit in computational neuroscience. Recent research has increasingly focused on jointly modeling neural activity and behavior to unravel their interconnections. Despite significant efforts, these approaches often necessitate either intricate model designs or oversimplified assumptions. Given the frequent absence of perfectly paired neural-behavioral datasets in real-world scenarios when deploying these models, a critical yet understudied research question emerges: how to develop a model that performs well using only neural activity as input at inference, while benefiting from the insights gained from behavioral signals during training? To this end, we propose BLEND, the behavior-guided neural population dynamics modeling framework via privileged knowledge distillation. By considering behavior as privileged information, we train a teacher model that takes both behavior observations (privileged features) and neural activities (regular features) as inputs. A student model is then distilled using only neural activity. Unlike existing methods, our framework is model-agnostic and avoids making strong assumptions about the relationship between behavior and neural activity. This allows BLEND to enhance existing neural dynamics modeling architectures without developing specialized models from scratch. Extensive experiments across neural population activity modeling and transcriptomic neuron identity prediction tasks demonstrate strong capabilities of BLEND, reporting over 50% improvement in behavioral decoding and over 15% improvement in transcriptomic neuron identity prediction after behavior-guided distillation. Furthermore, we empirically explore various behavior-guided distillation strategies within the BLEND framework and present a comprehensive analysis of effectiveness and implications for model performance. Comments: 20 pages, 5 figures, 3 tables Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2410.13872 [cs.NE] (or arXiv:2410.13872v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2410.13872 Focus to learn more arXiv-issued DOI via DataCite

[LG-134] A Federated Learning Platform as a Service for Advancing Stroke Management in European Clinical Centers

链接: https://arxiv.org/abs/2410.13869
作者: Diogo Reis Santos,Albert Sund Aillet,Antonio Boiano,Usevalad Milasheuski,Lorenzo Giusti,Marco Di Gennaro,Sanaz Kianoush,Luca Barbieri,Monica Nicoli,Michele Carminati,Alessandro E. C. Redondi,Stefano Savazzi,Luigi Serio
关键词-EN: technologies holds transformative, holds transformative potential, artificial intelligence, technologies holds, rapid evolution
类目: Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid evolution of artificial intelligence (AI) technologies holds transformative potential for the healthcare sector. In critical situations requiring immediate decision-making, healthcare professionals can leverage machine learning (ML) algorithms to prioritize and optimize treatment options, thereby reducing costs and improving patient outcomes. However, the sensitive nature of healthcare data presents significant challenges in terms of privacy and data ownership, hindering data availability and the development of robust algorithms. Federated Learning (FL) addresses these challenges by enabling collaborative training of ML models without the exchange of local data. This paper introduces a novel FL platform designed to support the configuration, monitoring, and management of FL processes. This platform operates on Platform-as-a-Service (PaaS) principles and utilizes the Message Queuing Telemetry Transport (MQTT) publish-subscribe protocol. Considering the production readiness and data sensitivity inherent in clinical environments, we emphasize the security of the proposed FL architecture, addressing potential threats and proposing mitigation strategies to enhance the platform’s trustworthiness. The platform has been successfully tested in various operational environments using a publicly available dataset, highlighting its benefits and confirming its efficacy.

[LG-135] Self-supervised contrastive learning performs non-linear system identification

链接: https://arxiv.org/abs/2410.14673
作者: Rodrigo González Laiz,Tobias Schmidt,Steffen Schneider
关键词-EN: brought tremendous success, Self-supervised learning, approaches have brought, tasks and domains, brought tremendous
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) approaches have brought tremendous success across many tasks and domains. It has been argued that these successes can be attributed to a link between SSL and identifiable representation learning: Temporal structure and auxiliary variables ensure that latent representations are related to the true underlying generative factors of the data. Here, we deepen this connection and show that SSL can perform system identification in latent space. We propose DynCL, a framework to uncover linear, switching linear and non-linear dynamics under a non-linear observation model, give theoretical guarantees and validate them empirically.

[LG-136] syren-new: Precise formulae for the linear and nonlinear matter power spectra with massive neutrinos and dynamical dark energy

链接: https://arxiv.org/abs/2410.14623
作者: Ce Sui,Deaglan J. Bartlett,Shivam Pandey,Harry Desmond,Pedro G. Ferreira,Benjamin D. Wandelt
关键词-EN: future large scale, large scale structure, scale structure surveys, dark energy, structure surveys aim
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 18 pages, 15 figures

点击查看摘要

Abstract:Current and future large scale structure surveys aim to constrain the neutrino mass and the equation of state of dark energy. We aim to construct accurate and interpretable symbolic approximations to the linear and nonlinear matter power spectra as a function of cosmological parameters in extended \Lambda CDM models which contain massive neutrinos and non-constant equations of state for dark energy. This constitutes an extension of the syren-halofit emulators to incorporate these two effects, which we call syren-new (SYmbolic-Regression-ENhanced power spectrum emulator with NEutrinos and W_0-w_a ). We also obtain a simple approximation to the derived parameter \sigma_8 as a function of the cosmological parameters for these models. Our results for the linear power spectrum are designed to emulate CLASS, whereas for the nonlinear case we aim to match the results of EuclidEmulator2. We compare our results to existing emulators and N -body simulations. Our analytic emulators for \sigma_8 , the linear and nonlinear power spectra achieve root mean squared errors of 0.1%, 0.3% and 1.3%, respectively, across a wide range of cosmological parameters, redshifts and wavenumbers. We verify that emulator-related discrepancies are subdominant compared to observational errors and other modelling uncertainties when computing shear power spectra for LSST-like surveys. Our expressions have similar accuracy to existing (numerical) emulators, but are at least an order of magnitude faster, both on a CPU and GPU. Our work greatly improves the accuracy, speed and range of applicability of current symbolic approximations to the linear and nonlinear matter power spectra. We provide publicly available code for all symbolic approximations found.

[LG-137] JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling

链接: https://arxiv.org/abs/2410.14621
作者: Ameya Daigavane,Bodhi P. Vani,Saeed Saremi,Joseph Kleinhenz,Joshua Rackers
关键词-EN: understanding protein function, cryptic pockets, Conformational ensembles, structures are immensely, immensely important
类目: Biological Physics (physics.bio-ph); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Conformational ensembles of protein structures are immensely important both to understanding protein function, and for drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles are computationally inefficient, or do not transfer to systems outside their training data. We present walk-Jump Accelerated Molecular ensembles with Universal Noise (JAMUN), a step towards the goal of efficiently sampling the Boltzmann distribution of arbitrary proteins. By extending Walk-Jump Sampling to point clouds, JAMUN enables ensemble generation at orders of magnitude faster rates than traditional molecular dynamics or state-of-the-art ML methods. Further, JAMUN is able to predict the stable basins of small peptides that were not seen during training.

[LG-138] Asymptotically Optimal Change Detection for Unnormalized Pre- and Post-Change Distributions

链接: https://arxiv.org/abs/2410.14615
作者: Arman Adibi,Sanjeev Kulkarni,H. Vincent Poor,Taposh Banerjee,Vahid Tarokh
关键词-EN: Approximation Cumulative Sum, paper addresses, addresses the problem, problem of detecting, Cumulative Sum
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of detecting changes when only unnormalized pre- and post-change distributions are accessible. This situation happens in many scenarios in physics such as in ferromagnetism, crystallography, magneto-hydrodynamics, and thermodynamics, where the energy models are difficult to normalize. Our approach is based on the estimation of the Cumulative Sum (CUSUM) statistics, which is known to produce optimal performance. We first present an intuitively appealing approximation method. Unfortunately, this produces a biased estimator of the CUSUM statistics and may cause performance degradation. We then propose the Log-Partition Approximation Cumulative Sum (LPA-CUSUM) algorithm based on thermodynamic integration (TI) in order to estimate the log-ratio of normalizing constants of pre- and post-change distributions. It is proved that this approach gives an unbiased estimate of the log-partition function and the CUSUM statistics, and leads to an asymptotically optimal performance. Moreover, we derive a relationship between the required sample size for thermodynamic integration and the desired detection delay performance, offering guidelines for practical parameter selection. Numerical studies are provided demonstrating the efficacy of our approach. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2410.14615 [stat.ML] (or arXiv:2410.14615v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2410.14615 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-139] Contractivity and linear convergence in bilinear saddle-point problems: An operator-theoretic approach

链接: https://arxiv.org/abs/2410.14592
作者: Colin Dirren,Mattia Bianchi,Panagiotis D. Grontas,John Lygeros,Florian Dörfler
关键词-EN: convex-concave bilinear saddle-point, suitable rank conditions, bilinear saddle-point problem, strongly convex, study the convex-concave
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the convex-concave bilinear saddle-point problem \min_x \max_y f(x) + y^\top Ax - g(y) , where both, only one, or none of the functions f and g are strongly convex, and suitable rank conditions on the matrix A hold. The solution of this problem is at the core of many machine learning tasks. By employing tools from operator theory, we systematically prove the contractivity (in turn, the linear convergence) of several first-order primal-dual algorithms, including the Chambolle-Pock method. Our approach results in concise and elegant proofs, and it yields new convergence guarantees and tighter bounds compared to known results.

[LG-140] A Lipschitz spaces view of infinitely wide shallow neural networks

链接: https://arxiv.org/abs/2410.14591
作者: Francesca Bartolucci,Marcello Carioni,José A. Iglesias,Yury Korolev,Emanuele Naldi,Stefano Vigogna
关键词-EN: shallow neural networks, unbounded parameter spaces, neural networks, shallow neural, unbounded parameter
类目: Functional Analysis (math.FA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 39 pages, 1 table

点击查看摘要

Abstract:We revisit the mean field parametrization of shallow neural networks, using signed measures on unbounded parameter spaces and duality pairings that take into account the regularity and growth of activation functions. This setting directly leads to the use of unbalanced Kantorovich-Rubinstein norms defined by duality with Lipschitz functions, and of spaces of measures dual to those of continuous functions with controlled growth. These allow to make transparent the need for total variation and moment bounds or penalization to obtain existence of minimizers of variational formulations, under which we prove a compactness result in strong Kantorovich-Rubinstein norm, and in the absence of which we show several examples demonstrating undesirable behavior. Further, the Kantorovich-Rubinstein setting enables us to combine the advantages of a completely linear parametrization and ensuing reproducing kernel Banach space framework with optimal transport insights. We showcase this synergy with representer theorems and uniform large data limits for empirical risk minimization, and in proposed formulations for distillation and fusion applications.

[LG-141] Diffusion-based Semi-supervised Spectral Algorithm for Regression on Manifolds

链接: https://arxiv.org/abs/2410.14539
作者: Weichun Xia,Jiaxin Jiang,Lei Shi
关键词-EN: diffusion-based spectral algorithm, tackle regression analysis, tackle regression, embedded within lower-dimensional, Traditional spectral algorithms
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel diffusion-based spectral algorithm to tackle regression analysis on high-dimensional data, particularly data embedded within lower-dimensional manifolds. Traditional spectral algorithms often fall short in such contexts, primarily due to the reliance on predetermined kernel functions, which inadequately address the complex structures inherent in manifold-based data. By employing graph Laplacian approximation, our method uses the local estimation property of heat kernel, offering an adaptive, data-driven approach to overcome this obstacle. Another distinct advantage of our algorithm lies in its semi-supervised learning framework, enabling it to fully use the additional unlabeled data. This ability enhances the performance by allowing the algorithm to dig the spectrum and curvature of the data manifold, providing a more comprehensive understanding of the dataset. Moreover, our algorithm performs in an entirely data-driven manner, operating directly within the intrinsic manifold structure of the data, without requiring any predefined manifold information. We provide a convergence analysis of our algorithm. Our findings reveal that the algorithm achieves a convergence rate that depends solely on the intrinsic dimension of the underlying manifold, thereby avoiding the curse of dimensionality associated with the higher ambient dimension.

[LG-142] Comparing Differentiable and Dynamic Ray Tracing: Introducing the Multipath Lifetime Map

链接: https://arxiv.org/abs/2410.14535
作者: Jérome Eertmans,Enrico Maria Vittuci,Vittorio Degli Esposti,Laurent Jacques,Claude Oestges
关键词-EN: rapidly changing nature, Dynamic Ray Tracing, radio propagation modeling, propagation modeling tools, Ray Tracing frameworks
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures, 1 table, submitted to EuCAP 2025

点击查看摘要

Abstract:With the increasing presence of dynamic scenarios, such as Vehicle-to-Vehicle communications, radio propagation modeling tools must adapt to the rapidly changing nature of the radio channel. Recently, both Differentiable and Dynamic Ray Tracing frameworks have emerged to address these challenges. However, there is often confusion about how these approaches differ and which one should be used in specific contexts. In this paper, we provide an overview of these two techniques and a comparative analysis against two state-of-the-art tools: 3DSCAT from UniBo and Sionna from NVIDIA. To provide a more precise characterization of the scope of these methods, we introduce a novel simulation-based metric, the Multipath Lifetime Map, which enables the evaluation of spatial and temporal coherence in radio channels only based on the geometrical description of the environment. Finally, our metrics are evaluated on a classic urban street canyon scenario, yielding similar results to those obtained from measurement campaigns.

[LG-143] he Traveling Bandit: A Framework for Bayesian Optimization with Movement Costs

链接: https://arxiv.org/abs/2410.14533
作者: Qiyuan Chen,Raed Al Kontar
关键词-EN: Bayesian Optimization, input alterations incur, alterations incur varying, Traveling Salesman Problem, incur varying costs
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a framework for Bayesian Optimization (BO) with metric movement costs, addressing a critical challenge in practical applications where input alterations incur varying costs. Our approach is a convenient plug-in that seamlessly integrates with the existing literature on batched algorithms, where designs within batches are observed following the solution of a Traveling Salesman Problem. The proposed method provides a theoretical guarantee of convergence in terms of movement costs for BO. Empirically, our method effectively reduces average movement costs over time while maintaining comparable regret performance to conventional BO methods. This framework also shows promise for broader applications in various bandit settings with movement costs.

[LG-144] An Integrated Deep Learning Model for Skin Cancer Detection Using Hybrid Feature Fusion Technique

链接: https://arxiv.org/abs/2410.14489
作者: Maksuda Akter,Rabea Khatun,Md. Alamin Talukder,Md. Manowarul Islam,Md. Ashraf Uddin
关键词-EN: potentially fatal disease, fatal disease caused, DNA damage, caused by DNA, potentially fatal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Skin cancer is a serious and potentially fatal disease caused by DNA damage. Early detection significantly increases survival rates, making accurate diagnosis crucial. In this groundbreaking study, we present a hybrid framework based on Deep Learning (DL) that achieves precise classification of benign and malignant skin lesions. Our approach begins with dataset preprocessing to enhance classification accuracy, followed by training two separate pre-trained DL models, InceptionV3 and DenseNet121. By fusing the results of each model using the weighted sum rule, our system achieves exceptional accuracy rates. Specifically, we achieve a 92.27% detection accuracy rate, 92.33% sensitivity, 92.22% specificity, 90.81% precision, and 91.57% F1-score, outperforming existing models and demonstrating the robustness and trustworthiness of our hybrid approach. Our study represents a significant advance in skin cancer diagnosis and provides a promising foundation for further research in the field. With the potential to save countless lives through earlier detection, our hybrid deep-learning approach is a game-changer in the fight against skin cancer.

[LG-145] Spectral Representations for Accurate Causal Uncertainty Quantification with Gaussian Processes

链接: https://arxiv.org/abs/2410.14483
作者: Hugh Dance,Peter Orbanz,Arthur Gretton
关键词-EN: robust decision making, Accurate uncertainty quantification, Accurate uncertainty, complex systems, non-parametric settings
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Accurate uncertainty quantification for causal effects is essential for robust decision making in complex systems, but remains challenging in non-parametric settings. One promising framework represents conditional distributions in a reproducing kernel Hilbert space and places Gaussian process priors on them to infer posteriors on causal effects, but requires restrictive nuclear dominant kernels and approximations that lead to unreliable uncertainty estimates. In this work, we introduce a method, IMPspec, that addresses these limitations via a spectral representation of the Hilbert space. We show that posteriors in this model can be obtained explicitly, by extending a result in Hilbert space regression theory. We also learn the spectral representation to optimise posterior calibration. Our method achieves state-of-the-art performance in uncertainty quantification and causal Bayesian optimisation across simulations and a healthcare application.

[LG-146] Flow-based Sampling for Entanglement Entropy and the Machine Learning of Defects

链接: https://arxiv.org/abs/2410.14466
作者: Andrea Bulgarelli,Elia Cellini,Karl Jansen,Stefan Kühn,Alessandro Nada,Shinichi Nakajima,Kim A. Nicoli,Marco Panero
关键词-EN: numerically calculate Rényi, calculate Rényi entanglement, Rényi entanglement entropies, calculate Rényi, Rényi entanglement
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:We introduce a novel technique to numerically calculate Rényi entanglement entropies in lattice quantum field theory using generative models. We describe how flow-based approaches can be combined with the replica trick using a custom neural-network architecture around a lattice defect connecting two replicas. Numerical tests for the \phi^4 scalar field theory in two and three dimensions demonstrate that our technique outperforms state-of-the-art Monte Carlo calculations, and exhibit a promising scaling with the defect size.

[LG-147] Learning to refine domain knowledge for biological network inference

链接: https://arxiv.org/abs/2410.14436
作者: Peiwen Li,Menghua Wu
关键词-EN: pose significant challenges, Perturbation experiments, discover causal relationships, causal structure learning, structure learning algorithms
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Perturbation experiments allow biologists to discover causal relationships between variables of interest, but the sparsity and high dimensionality of these data pose significant challenges for causal structure learning algorithms. Biological knowledge graphs can bootstrap the inference of causal structures in these situations, but since they compile vastly diverse information, they can bias predictions towards well-studied systems. Alternatively, amortized causal structure learning algorithms encode inductive biases through data simulation and train supervised models to recapitulate these synthetic graphs. However, realistically simulating biology is arguably even harder than understanding a specific system. In this work, we take inspiration from both strategies and propose an amortized algorithm for refining domain knowledge, based on data observations. On real and synthetic datasets, we show that our approach outperforms baselines in recovering ground truth causal graphs and identifying errors in the prior knowledge with limited interventional data.

[LG-148] A Bioinformatic Approach Validated Utilizing Machine Learning Algorithms to Identify Relevant Biomarkers and Crucial Pathways in Gallbladder Cancer

链接: https://arxiv.org/abs/2410.14433
作者: Rabea Khatun,Wahia Tasnim,Maksuda Akter,Md Manowarul Islam,Md. Ashraf Uddin,Md. Zulfiker Mahmud,Saurav Chandra Das
关键词-EN: biliary tract neoplasms, Gallbladder cancer, GBC, tract neoplasms, disease among biliary
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gallbladder cancer (GBC) is the most frequent cause of disease among biliary tract neoplasms. Identifying the molecular mechanisms and biomarkers linked to GBC progression has been a significant challenge in scientific research. Few recent studies have explored the roles of biomarkers in GBC. Our study aimed to identify biomarkers in GBC using machine learning (ML) and bioinformatics techniques. We compared GBC tumor samples with normal samples to identify differentially expressed genes (DEGs) from two microarray datasets (GSE100363, GSE139682) obtained from the NCBI GEO database. A total of 146 DEGs were found, with 39 up-regulated and 107 down-regulated genes. Functional enrichment analysis of these DEGs was performed using Gene Ontology (GO) terms and REACTOME pathways through DAVID. The protein-protein interaction network was constructed using the STRING database. To identify hub genes, we applied three ranking algorithms: Degree, MNC, and Closeness Centrality. The intersection of hub genes from these algorithms yielded 11 hub genes. Simultaneously, two feature selection methods (Pearson correlation and recursive feature elimination) were used to identify significant gene subsets. We then developed ML models using SVM and RF on the GSE100363 dataset, with validation on GSE139682, to determine the gene subset that best distinguishes GBC samples. The hub genes outperformed the other gene subsets. Finally, NTRK2, COL14A1, SCN4B, ATP1A2, SLC17A7, SLIT3, COL7A1, CLDN4, CLEC3B, ADCYAP1R1, and MFAP4 were identified as crucial genes, with SLIT3, COL7A1, and CLDN4 being strongly linked to GBC development and prediction.

[LG-149] Integrating Deep Learning with Fundus and Optical Coherence Tomography for Cardiovascular Disease Prediction

链接: https://arxiv.org/abs/2410.14423
作者: Cynthia Maldonado-Garcia,Arezoo Zakeri,Alejandro F Frangi,Nishant Ravikumar
关键词-EN: reducing healthcare burden, reducing healthcare, healthcare burden, quality of life, CVD
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15155))

点击查看摘要

Abstract:Early identification of patients at risk of cardiovascular diseases (CVD) is crucial for effective preventive care, reducing healthcare burden, and improving patients’ quality of life. This study demonstrates the potential of retinal optical coherence tomography (OCT) imaging combined with fundus photographs for identifying future adverse cardiac events. We used data from 977 patients who experienced CVD within a 5-year interval post-image acquisition, alongside 1,877 control participants without CVD, totaling 2,854 subjects. We propose a novel binary classification network based on a Multi-channel Variational Autoencoder (MCVAE), which learns a latent embedding of patients’ fundus and OCT images to classify individuals into two groups: those likely to develop CVD in the future and those who are not. Our model, trained on both imaging modalities, achieved promising results (AUROC 0.78 +/- 0.02, accuracy 0.68 +/- 0.002, precision 0.74 +/- 0.02, sensitivity 0.73 +/- 0.02, and specificity 0.68 +/- 0.01), demonstrating its efficacy in identifying patients at risk of future CVD events based on their retinal images. This study highlights the potential of retinal OCT imaging and fundus photographs as cost-effective, non-invasive alternatives for predicting cardiovascular disease risk. The widespread availability of these imaging techniques in optometry practices and hospitals further enhances their potential for large-scale CVD risk screening. Our findings contribute to the development of standardized, accessible methods for early CVD risk identification, potentially improving preventive care strategies and patient outcomes.

[LG-150] Asymptotic non-linear shrinkage formulas for weighted sample covariance

链接: https://arxiv.org/abs/2410.14420
作者: Benoit Oriol
关键词-EN: Ledoit and Péché, precision matrix estimators, spirit of Ledoit, precision matrix, matrix estimators
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We compute asymptotic non-linear shrinkage formulas for covariance and precision matrix estimators for weighted sample covariances, in the spirit of Ledoit and Péché. We detail explicitly the formulas for exponentially-weighted sample covariances. Those new tools pave a way for applying non-linear shrinkage methods on weighted sample covariance. We show experimentally the performance of the asymptotic shrinkage formulas. Finally, we test the robustness of the theory to a heavy-tailed distributions.

[LG-151] WeSpeR: Population spectrum retrieval and spectral density estimation of weighted sample covariance

链接: https://arxiv.org/abs/2410.14413
作者: Benoit Oriol
关键词-EN: weighted sample covariance, sample covariance shows, weighted sample, sample covariance, random behavior
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The spectrum of the weighted sample covariance shows a asymptotic non random behavior when the dimension grows with the number of samples. In this setting, we prove that the asymptotic spectral distribution F of the weighted sample covariance has a continuous density on \mathbbR^* . We address then the practical problem of numerically finding this density. We propose a procedure to compute it, to determine the support of F and define an efficient grid on it. We use this procedure to design the \textitWeSpeR algorithm, which estimates the spectral density and retrieves the true spectral covariance spectrum. Empirical tests confirm the good properties of the \textitWeSpeR algorithm.

[LG-152] Investigating the Capabilities of Deep Learning for Processing and Interpreting One-Shot Multi-offset GPR Data: A Numerical Case Study for Lunar and Martian Environments

链接: https://arxiv.org/abs/2410.14386
作者: Iraklis Giannakis,Craig Warren,Antonios Giannopoulos,Georgios Leontidis,Yan Su,Feng Zhou,Javier Martin-Torres,Nectaria Diamanti
关键词-EN: mature geophysical method, gained increasing popularity, Ground-penetrating radar, past decade, mature geophysical
类目: Geophysics (physics.geo-ph); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ground-penetrating radar (GPR) is a mature geophysical method that has gained increasing popularity in planetary science over the past decade. GPR has been utilised both for Lunar and Martian missions providing pivotal information regarding the near surface geology of Terrestrial planets. Within that context, numerous processing pipelines have been suggested to address the unique challenges present in planetary setups. These processing pipelines often require manual tuning resulting to ambiguous outputs open to non-unique interpretations. These pitfalls combined with the large number of planetary GPR data (kilometers in magnitude), highlight the necessity for automatic, objective and advanced processing and interpretation schemes. The current paper investigates the potential of deep learning for interpreting and processing GPR data. The one-shot multi-offset configuration is investigated via a coherent numerical case study, showcasing the potential of deep learning for A) reconstructing the dielectric distribution of the the near surface of Terrestrial planets, and B) filling missing or bad-quality traces. Special care was taken for the numerical data to be both realistic and challenging. Moreover, the generated synthetic data are properly labelled and made publicly available for training future data-driven pipelines and contributing towards developing pre-trained foundation models for GPR.

[LG-153] Optimizing importance weighting in the presence of sub-population shifts

链接: https://arxiv.org/abs/2410.14315
作者: Floris Holstege,Bram Wouters,Noud van Giersbergen,Cees Diks
关键词-EN: machine learning models, severely harm performance, severely harm, machine learning, test data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint. Currently under review

点击查看摘要

Abstract:A distribution shift between the training and test data can severely harm performance of machine learning models. Importance weighting addresses this issue by assigning different weights to data points during training. We argue that existing heuristics for determining the weights are suboptimal, as they neglect the increase of the variance of the estimated model due to the finite sample size of the training data. We interpret the optimal weights in terms of a bias-variance trade-off, and propose a bi-level optimization procedure in which the weights and model parameters are optimized simultaneously. We apply this optimization to existing importance weighting techniques for last-layer retraining of deep neural networks in the presence of sub-population shifts and show empirically that optimizing weights significantly improves generalization performance.

[LG-154] Comparative Evaluation of Clustered Federated Learning Method

链接: https://arxiv.org/abs/2410.14212
作者: Michael Ben Ali(IRIT),Omar El-Rifai(IRIT),Imen Megdiche(IRIT, IRIT-SIG, INUC),André Peninou(IRIT, IRIT-SIG, UT2J),Olivier Teste(IRIT-SIG, IRIT, UT2J, UT)
关键词-EN: preserves data privacy, Federated Learning, Clustered Federated Learning, recent years, promising methods
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over recent years, Federated Learning (FL) has proven to be one of the most promising methods of distributed learning which preserves data privacy. As the method evolved and was confronted to various real-world scenarios, new challenges have emerged. One such challenge is the presence of highly heterogeneous (often referred as non-IID) data distributions among participants of the FL protocol. A popular solution to this hurdle is Clustered Federated Learning (CFL), which aims to partition clients into groups where the distribution are homogeneous. In the literature, state-of-the-art CFL algorithms are often tested using a few cases of data heterogeneities, without systematically justifying the choices. Further, the taxonomy used for differentiating the different heterogeneity scenarios is not always straightforward. In this paper, we explore the performance of two state-of-theart CFL algorithms with respect to a proposed taxonomy of data heterogeneities in federated learning (FL). We work with three image classification datasets and analyze the resulting clusters against the heterogeneity classes using extrinsic clustering metrics. Our objective is to provide a clearer understanding of the relationship between CFL performances and data heterogeneity scenarios.

[LG-155] Provable In-context Learning for Mixture of Linear Regressions using Transformers

链接: https://arxiv.org/abs/2410.14183
作者: Yanhao Jin,Krishnakumar Balasubramanian,Lifeng Lai
关键词-EN: in-context learning capabilities, linear regression models, high SNR regime, learning capabilities, theoretically investigate
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We theoretically investigate the in-context learning capabilities of transformers in the context of learning mixtures of linear regression models. For the case of two mixtures, we demonstrate the existence of transformers that can achieve an accuracy, relative to the oracle predictor, of order \mathcal\tildeO((d/n)^1/4) in the low signal-to-noise ratio (SNR) regime and \mathcal\tildeO(\sqrtd/n) in the high SNR regime, where n is the length of the prompt, and d is the dimension of the problem. Additionally, we derive in-context excess risk bounds of order \mathcalO(L/\sqrtB) , where B denotes the number of (training) prompts, and L represents the number of attention layers. The order of L depends on whether the SNR is low or high. In the high SNR regime, we extend the results to K -component mixture models for finite K . Extensive simulations also highlight the advantages of transformers for this task, outperforming other baselines such as the Expectation-Maximization algorithm.

[LG-156] Estimating the Causal Effects of T Cell Receptors

链接: https://arxiv.org/abs/2410.14127
作者: Eli N. Weinstein,Elizabeth B. Wood,David M. Blei
关键词-EN: cells impacts disease, impacts disease, central question, question in human, human immunology
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:A central question in human immunology is how a patient’s repertoire of T cells impacts disease. Here, we introduce a method to infer the causal effects of T cell receptor (TCR) sequences on patient outcomes using observational TCR repertoire sequencing data and clinical outcomes data. Our approach corrects for unobserved confounders, such as a patient’s environment and life history, by using the patient’s immature, pre-selection TCR repertoire. The pre-selection repertoire can be estimated from nonproductive TCR data, which is widely available. It is generated by a randomized mutational process, V(D)J recombination, which provides a natural experiment. We show formally how to use the pre-selection repertoire to draw causal inferences, and develop a scalable neural-network estimator for our identification formula. Our method produces an estimate of the effect of interventions that add a specific TCR sequence to patient repertoires. As a demonstration, we use it to analyze the effects of TCRs on COVID-19 severity, uncovering potentially therapeutic TCRs that are (1) observed in patients, (2) bind SARS-CoV-2 antigens in vitro and (3) have strong positive effects on clinical outcomes.

[LG-157] A Statistical Machine Learning Approach for Adapting Reduced-Order Models using Projected Gaussian Process

链接: https://arxiv.org/abs/2410.14090
作者: Xiao Liu,Xinchao Liu
关键词-EN: Proper Orthogonal Decomposition, Grassmann Manifold, adapting POD basis, POD basis, Orthogonal Decomposition
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The Proper Orthogonal Decomposition (POD) computes the optimal basis modes that span a low-dimensional subspace where the Reduced-Order Models (ROMs) reside. Because a governing equation is often parameterized by a set of parameters, challenges immediately arise when one would like to investigate how systems behave differently over the parameter space (in design, control, uncertainty quantification and real-time operations). In this case, the POD basis needs to be updated so as to adapt ROM that accurately captures the variation of a system’s behavior over its parameter space. This paper proposes a Projected Gaussian Process (pGP) and formulate the problem of adapting POD basis as a supervised statistical learning problem, for which the goal is to learn a mapping from the parameter space to the Grassmann Manifold that contains the optimal vector subspaces. A mapping is firstly found between the Euclidean space and the horizontal space of an orthogonal matrix that spans a reference subspace in the Grassmann Manifold. Then, a second mapping from the horizontal space to the Grassmann Manifold is established through the Exponential/Logarithm maps between the manifold and its tangent space. Finally, given a new parameter, the conditional distribution of a vector can be found in the Euclidean space using the Gaussian Process (GP) regression, and such a distribution is projected to the Grassmann Manifold that yields the optimal subspace for the new parameter. The proposed statistical learning approach allows us to optimally estimate model parameters given data (i.e., the prediction/interpolation becomes problem-specific), and quantify the uncertainty associated with the prediction. Numerical examples are presented to demonstrate the advantages of the proposed pGP for adapting POD basis against parameter changes.

[LG-158] Gradual Domain Adaptation via Manifold-Constrained Distributionally Robust Optimization NEURIPS

链接: https://arxiv.org/abs/2410.14061
作者: Amir Hossein Saberi,Amir Najafi,Ala Emrani,Amin Behjati,Yasaman Zolfimoselo,Mahdi Shadrooy,Abolfazl Motahari,Babak H. Khalaj
关键词-EN: gradual domain adaptation, manifold-constrained data distributions, address the challenge, domain adaptation, class of manifold-constrained
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at Proceedings of Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:The aim of this paper is to address the challenge of gradual domain adaptation within a class of manifold-constrained data distributions. In particular, we consider a sequence of T\ge2 data distributions P_1,\ldots,P_T undergoing a gradual shift, where each pair of consecutive measures P_i,P_i+1 are close to each other in Wasserstein distance. We have a supervised dataset of size n sampled from P_0 , while for the subsequent distributions in the sequence, only unlabeled i.i.d. samples are available. Moreover, we assume that all distributions exhibit a known favorable attribute, such as (but not limited to) having intra-class soft/hard margins. In this context, we propose a methodology rooted in Distributionally Robust Optimization (DRO) with an adaptive Wasserstein radius. We theoretically show that this method guarantees the classification error across all P_i s can be suitably bounded. Our bounds rely on a newly introduced \it compatibility measure, which fully characterizes the error propagation dynamics along the sequence. Specifically, for inadequately constrained distributions, the error can exponentially escalate as we progress through the gradual shifts. Conversely, for appropriately constrained distributions, the error can be demonstrated to be linear or even entirely eradicated. We have substantiated our theoretical findings through several experimental results.

[LG-159] Feedback Schr"odinger Bridge Matching

链接: https://arxiv.org/abs/2410.14055
作者: Panagiotis Theodoropoulos,Nikolaos Komianos,Vincent Pacelli,Guan-Horng Liu,Evangelos A. Theodorou
关键词-EN: Recent advancements, advancements in diffusion, heavily relied, face a trade-off, Schrödinger Bridge Matching
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in diffusion bridges for distribution transport problems have heavily relied on matching frameworks, yet existing methods often face a trade-off between scalability and access to optimal pairings during training. Fully unsupervised methods make minimal assumptions but incur high computational costs, limiting their practicality. On the other hand, imposing full supervision of the matching process with optimal pairings improves scalability, however, it can be infeasible in many applications. To strike a balance between scalability and minimal supervision, we introduce Feedback Schrödinger Bridge Matching (FSBM), a novel semi-supervised matching framework that incorporates a small portion (less than 8% of the entire dataset) of pre-aligned pairs as state feedback to guide the transport map of non coupled samples, thereby significantly improving efficiency. This is achieved by formulating a static Entropic Optimal Transport (EOT) problem with an additional term capturing the semi-supervised guidance. The generalized EOT objective is then recast into a dynamic formulation to leverage the scalability of matching frameworks. Extensive experiments demonstrate that FSBM accelerates training and enhances generalization by leveraging coupled pairs guidance, opening new avenues for training matching frameworks with partially aligned datasets.

[LG-160] nsor Decomposition with Unaligned Observations

链接: https://arxiv.org/abs/2410.14046
作者: Runshi Tang,Tamara Kolda,Anru R. Zhang
关键词-EN: addresses unaligned observations, canonical polyadic, unaligned observations, paper presents, presents a canonical
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper presents a canonical polyadic (CP) tensor decomposition that addresses unaligned observations. The mode with unaligned observations is represented using functions in a reproducing kernel Hilbert space (RKHS). We introduce a versatile loss function that effectively accounts for various types of data, including binary, integer-valued, and positive-valued types. Additionally, we propose an optimization algorithm for computing tensor decompositions with unaligned observations, along with a stochastic gradient method to enhance computational efficiency. A sketching algorithm is also introduced to further improve efficiency when using the \ell_2 loss function. To demonstrate the efficacy of our methods, we provide illustrative examples using both synthetic data and an early childhood human microbiome dataset.

[LG-161] From Distributional Robustness to Robust Statistics: A Confidence Sets Perspective

链接: https://arxiv.org/abs/2410.14008
作者: Gabriel Chan,Bart Van Parys,Amine Bennouna
关键词-EN: distributionally robust optimization, classical robust statistics, robust optimization, robust statistics, distributionally robust
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We establish a connection between distributionally robust optimization (DRO) and classical robust statistics. We demonstrate that this connection arises naturally in the context of estimation under data corruption, where the goal is to construct ``minimal’’ confidence sets for the unknown data-generating distribution. Specifically, we show that a DRO ambiguity set, based on the Kullback-Leibler divergence and total variation distance, is uniformly minimal, meaning it represents the smallest confidence set that contains the unknown distribution with at a given confidence power. Moreover, we prove that when parametric assumptions are imposed on the unknown distribution, the ambiguity set is never larger than a confidence set based on the optimal estimator proposed by Huber. This insight reveals that the commonly observed conservatism of DRO formulations is not intrinsic to these formulations themselves but rather stems from the non-parametric framework in which these formulations are employed.

[LG-162] Generalization for Least Squares Regression With Simple Spiked Covariances

链接: https://arxiv.org/abs/2410.13991
作者: Jiping Li,Rishi Sonthalia
关键词-EN: Random matrix theory, Random matrix, theory has proven, valuable tool, tool in analyzing
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Random matrix theory has proven to be a valuable tool in analyzing the generalization of linear models. However, the generalization properties of even two-layer neural networks trained by gradient descent remain poorly understood. To understand the generalization performance of such networks, it is crucial to characterize the spectrum of the feature matrix at the hidden layer. Recent work has made progress in this direction by describing the spectrum after a single gradient step, revealing a spiked covariance structure. Yet, the generalization error for linear models with spiked covariances has not been previously determined. This paper addresses this gap by examining two simple models exhibiting spiked covariances. We derive their generalization error in the asymptotic proportional regime. Our analysis demonstrates that the eigenvector and eigenvalue corresponding to the spike significantly influence the generalization error.

[LG-163] Recurrent Neural Goodness-of-Fit Test for Time Series

链接: https://arxiv.org/abs/2410.13986
作者: Aoran Zhang,Wenbin Zhou,Liyan Xie,Shixiang Zhu
关键词-EN: advanced modeling techniques, Time series, finance and healthcare, modeling techniques, crucial across diverse
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 4 figures

点击查看摘要

Abstract:Time series data are crucial across diverse domains such as finance and healthcare, where accurate forecasting and decision-making rely on advanced modeling techniques. While generative models have shown great promise in capturing the intricate dynamics inherent in time series, evaluating their performance remains a major challenge. Traditional evaluation metrics fall short due to the temporal dependencies and potential high dimensionality of the features. In this paper, we propose the REcurrent NeurAL (RENAL) Goodness-of-Fit test, a novel and statistically rigorous framework for evaluating generative time series models. By leveraging recurrent neural networks, we transform the time series into conditionally independent data pairs, enabling the application of a chi-square-based goodness-of-fit test to the temporal dependencies within the data. This approach offers a robust, theoretically grounded solution for assessing the quality of generative models, particularly in settings with limited time sequences. We demonstrate the efficacy of our method across both synthetic and real-world datasets, outperforming existing methods in terms of reliability and accuracy. Our method fills a critical gap in the evaluation of time series generative models, offering a tool that is both practical and adaptable to high-stakes applications.

[LG-164] MACK: Mismodeling Addressed with Contrastive Knowledge

链接: https://arxiv.org/abs/2410.13947
作者: Liam Rankin Sheldon,Dylan Sheldon Rankin,Philip Harris
关键词-EN: physics typically relies, energy physics typically, machine learning methods, typically relies, volumes of precise
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 13 pages, 4 figures, Submission to SciPost

点击查看摘要

Abstract:The use of machine learning methods in high energy physics typically relies on large volumes of precise simulation for training. As machine learning models become more complex they can become increasingly sensitive to differences between this simulation and the real data collected by experiments. We present a generic methodology based on contrastive learning which is able to greatly mitigate this negative effect. Crucially, the method does not require prior knowledge of the specifics of the mismodeling. While we demonstrate the efficacy of this technique using the task of jet-tagging at the Large Hadron Collider, it is applicable to a wide array of different tasks both in and out of the field of high energy physics.

[LG-165] On the Robustness of Machine Learning Models in Predicting Thermodynamic Properties: a Case of Searching for New Quasicrystal Approximants

链接: https://arxiv.org/abs/2410.13873
作者: Fedor S. Avilov,Roman A. Eremin,Semen A. Budennyy,Innokentiy S. Humonen
关键词-EN: artificial intelligence-assisted modeling, artificial intelligence-assisted, intelligence-assisted modeling, modeling of disordered, disordered crystals
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite an artificial intelligence-assisted modeling of disordered crystals is a widely used and well-tried method of new materials design, the issues of its robustness, reliability, and stability are still not resolved and even not discussed enough. To highlight it, in this work we composed a series of nested intermetallic approximants of quasicrystals datasets and trained various machine learning models on them correspondingly. Our qualitative and, what is more important, quantitative assessment of the difference in the predictions clearly shows that different reasonable changes in the training sample can lead to the completely different set of the predicted potentially new materials. We also showed the advantage of pre-training and proposed a simple yet effective trick of sequential training to increase stability.

[LG-166] Self-Supervised Pre-Training with Joint-Embedding Predictive Architecture Boosts ECG Classification Performance

链接: https://arxiv.org/abs/2410.13867
作者: Kuba Weimann,Tim O. F. Conrad
关键词-EN: heart arrhythmias requires, Accurate diagnosis, heart arrhythmias, interpretation of electrocardiograms, arrhythmias requires
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate diagnosis of heart arrhythmias requires the interpretation of electrocardiograms (ECG), which capture the electrical activity of the heart. Automating this process through machine learning is challenging due to the need for large annotated datasets, which are difficult and costly to collect. To address this issue, transfer learning is often employed, where models are pre-trained on large datasets and fine-tuned for specific ECG classification tasks with limited labeled data. Self-supervised learning has become a widely adopted pre-training method, enabling models to learn meaningful representations from unlabeled datasets. In this work, we explore the joint-embedding predictive architecture (JEPA) for self-supervised learning from ECG data. Unlike invariance-based methods, JEPA does not rely on hand-crafted data augmentations, and unlike generative methods, it predicts latent features rather than reconstructing input data. We create a large unsupervised pre-training dataset by combining ten public ECG databases, amounting to over one million records. We pre-train Vision Transformers using JEPA on this dataset and fine-tune them on various PTB-XL benchmarks. Our results show that JEPA outperforms existing invariance-based and generative approaches, achieving an AUC of 0.945 on the PTB-XL all statements task. JEPA consistently learns the highest quality representations, as demonstrated in linear evaluations, and proves advantageous for pre-training even in the absence of additional data.

[LG-167] rans-Bifurcation Prediction of Dynamics in terms of Extreme Learning Machines with Control Inputs

链接: https://arxiv.org/abs/2410.13289
作者: Satoru Tadokoro,Akihiro Yamaguchi,Takao Namiki,Ichiro Tsuda
关键词-EN: additional control inputs, extreme learning machine, control inputs, dynamical systems, extending the extreme
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:By extending the extreme learning machine by additional control inputs, we achieved almost complete reproduction of bifurcation structures of dynamical systems. The learning ability of the proposed neural network system is striking in that the entire structure of the bifurcations of a target one-parameter family of dynamical systems can be nearly reproduced by training on transient dynamics using only a few parameter values. Moreover, we propose a mechanism to explain this remarkable learning ability and discuss the relationship between the present results and similar results obtained by Kim et al.

信息检索

[IR-0] SIMformer: Single-Layer Vanilla Transformer Can Learn Free-Space Trajectory Similarity

链接: https://arxiv.org/abs/2410.14629
作者: Chuang Yang,Renhe Jiang,Xiaohang Xu,Chuan Xiao,Kaoru Sezaki
关键词-EN: Free-space trajectory similarity, quadratic time complexity, incur quadratic time, trajectory similarity calculation, Free-space trajectory
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Free-space trajectory similarity calculation, e.g., DTW, Hausdorff, and Frechet, often incur quadratic time complexity, thus learning-based methods have been proposed to accelerate the computation. The core idea is to train an encoder to transform trajectories into representation vectors and then compute vector similarity to approximate the ground truth. However, existing methods face dual challenges of effectiveness and efficiency: 1) they all utilize Euclidean distance to compute representation similarity, which leads to the severe curse of dimensionality issue – reducing the distinguishability among representations and significantly affecting the accuracy of subsequent similarity search tasks; 2) most of them are trained in triplets manner and often necessitate additional information which downgrades the efficiency; 3) previous studies, while emphasizing the scalability in terms of efficiency, overlooked the deterioration of effectiveness when the dataset size grows. To cope with these issues, we propose a simple, yet accurate, fast, scalable model that only uses a single-layer vanilla transformer encoder as the feature extractor and employs tailored representation similarity functions to approximate various ground truth similarity measures. Extensive experiments demonstrate our model significantly mitigates the curse of dimensionality issue and outperforms the state-of-the-arts in effectiveness, efficiency, and scalability.

[IR-1] Enhancing AI Accessibility in Veterinary Medicine: Linking Classifiers and Electronic Health Records

链接: https://arxiv.org/abs/2410.14625
作者: Chun Yin Kong,Picasso Vasquez,Makan Farhoodimoghadam,Chris Brandt,Titus C. Brown,Krystle L. Reagan,Allison Zwingenberger,Stefan M. Keller
关键词-EN: integrating machine learning, clinical decision-making tools, electronic health records, rapidly evolving landscape, improve diagnostic accuracy
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of veterinary healthcare, integrating machine learning (ML) clinical decision-making tools with electronic health records (EHRs) promises to improve diagnostic accuracy and patient care. However, the seamless integration of ML classifiers into existing EHRs in veterinary medicine is frequently hindered by the rigidity of EHR systems or the limited availability of IT resources. To address this shortcoming, we present Anna, a freely-available software solution that provides ML classifier results for EHR laboratory data in real-time.

[IR-2] DiSCo Meets LLMs: A Unified Approach for Sparse Retrieval and Contextual Distillation in Conversational Search

链接: https://arxiv.org/abs/2410.14609
作者: Simon Lupart,Mohammad Aliannejadi,Evangelos Kanoulas
关键词-EN: Conversational Search, conversational context, conversational context modeling, Large Language Models, retrieving relevant documents
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Conversational Search (CS) is the task of retrieving relevant documents from a corpus within a conversational context, combining retrieval with conversational context modeling. With the explosion of Large Language Models (LLMs), the CS field has seen major improvements with LLMs rewriting user queries, accounting for conversational context. However, engaging LLMs at inference time harms efficiency. Current methods address this by distilling embeddings from human-rewritten queries to learn the context modeling task. Yet, these approaches predominantly focus on context modeling, and only treat the contrastive component of the retrieval task within a distillation-independent loss term. To address these limitations, we propose a new distillation method, as a relaxation of the previous objective, unifying retrieval and context modeling. We relax the existing training objectives by distilling similarity scores between conversations and documents, rather than relying solely on representation learning. Our proposed distillation objective allows for more freedom in the representation space and leverages the contrastive nature of document relevance. Through experiments on Learned Sparse Retrieval (LSR) across 5 CS datasets, our approach demonstrates substantial improvements in both in-domain and out-of-domain retrieval performance, outperforming state-of-the-art with gains of up to 6 points in recall for out-of-domain datasets. Additionally, through the relaxation of the objective, we propose a multi-teacher distillation, using multiple LLMs as teachers, yielding additional gains, and outperforming the teachers themselves in in-domain experiments. Finally, analysis of the sparsity of the models reveals that our distillation allows for better control over the sparsity of the trained models.

[IR-3] RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions

链接: https://arxiv.org/abs/2410.14567
作者: Zhiyuan Peng,Jinming Nian,Alexandre Evfimievski,Yi Fang
关键词-EN: Retrieval Augmented Generation, Retrieval Augmented, provide verifiable document-grounded, verifiable document-grounded responses, user inquiries
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: under review

点击查看摘要

Abstract:Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries. However, many natural questions do not have good answers: about 25% contain false assumptions~\citeYu2023:CREPE, and over 50% are ambiguous~\citeMin2020:AmbigQA. RAG agents need high-quality data to improve their responses to confusing questions. This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus. We conduct an empirical comparative evaluation of several large language models as RAG agents to measure the accuracy of confusion detection and appropriate response generation. We contribute a benchmark dataset to the public domain.

[IR-4] SPFresh: Incremental In-Place Update for Billion-Scale Vector Search SOSP23

链接: https://arxiv.org/abs/2410.14452
作者: Yuming Xu,Hengyu Liang,Jin Li,Shuotao Xu,Qi Chen,Qianxi Zhang,Cheng Li,Ziyue Yang,Fan Yang,Yuqing Yang,Peng Cheng,Mao Yang
关键词-EN: Approximate Nearest Neighbor, Approximate Nearest, Nearest Neighbor Search, Nearest Neighbor, similar high-dimensional vectors
类目: Information Retrieval (cs.IR)
*备注: SOSP 23

点击查看摘要

Abstract:Approximate Nearest Neighbor Search (ANNS) is now widely used in various applications, ranging from information retrieval, question answering, and recommendation, to search for similar high-dimensional vectors. As the amount of vector data grows continuously, it becomes important to support updates to vector index, the enabling technique that allows for efficient and accurate ANNS on vectors. Because of the curse of high dimensionality, it is often costly to identify the right neighbors of a single new vector, a necessary process for index update. To amortize update costs, existing systems maintain a secondary index to accumulate updates, which are merged by the main index by global rebuilding the entire index periodically. However, this approach has high fluctuations of search latency and accuracy, not even to mention that it requires substantial resources and is extremely time-consuming for rebuilds. We introduce SPFresh, a system that supports in-place vector updates. At the heart of SPFresh is LIRE, a lightweight incremental rebalancing protocol to split vector partitions and reassign vectors in the nearby partitions to adapt to data distribution shift. LIRE achieves low-overhead vector updates by only reassigning vectors at the boundary between partitions, where in a high-quality vector index the amount of such vectors are deemed small. With LIRE, SPFresh provides superior query latency and accuracy to solutions based on global rebuild, with only 1% of DRAM and less than 10% cores needed at the peak compared to the state-of-the-art, in a billion scale vector index with 1% of daily vector update rate.

[IR-5] ChartifyText: Automated Chart Generation from Data-Involved Texts via LLM

链接: https://arxiv.org/abs/2410.14331
作者: Songheng Zhang,Lei Wang,Toby Jia-Jun Li,Qiaomu Shen,Yixin Cao,Yong Wang
关键词-EN: public health, health and journalism, numerical values involved, involved are widely, Text documents
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text documents with numerical values involved are widely used in various applications such as scientific research, economy, public health and journalism. However, it is difficult for readers to quickly interpret such data-involved texts and gain deep insights. To fill this research gap, this work aims to automatically generate charts to accurately convey the underlying data and ideas to readers, which is essentially a challenging task. The challenges originate from text ambiguities, intrinsic sparsity and uncertainty of data in text documents, and subjective sentiment differences. Specifically, we propose ChartifyText, a novel fully-automated approach that leverages Large Language Models (LLMs) to convert complex data-involved texts to expressive charts. It consists of two major modules: tabular data inference and expressive chart generation. The tabular data inference module employs systematic prompt engineering to guide the LLM (e.g., GPT-4) to infer table data, where data ranges, uncertainties, missing data values and corresponding subjective sentiments are explicitly considered. The expressive chart generation module augments standard charts with intuitive visual encodings and concise texts to accurately convey the underlying data and insights. We extensively evaluate the effectiveness of ChartifyText on real-world data-involved text documents through case studies, in-depth interviews with three visualization experts, and a carefully-designed user study with 15 participants. The results demonstrate the usefulness and effectiveness of ChartifyText in helping readers efficiently and effectively make sense of data-involved texts.

[IR-6] Graph Neural Patching for Cold-Start Recommendations

链接: https://arxiv.org/abs/2410.14241
作者: Hao Chen,Yu Yang,Yuanchen Bei,Zefan Wang,Yue Xu,Feiran Huang
关键词-EN: recommender systems remains, cold start problem, critical challenge, warm users, existing warm users
类目: Information Retrieval (cs.IR)
*备注: 13 pages, accepted by Australasian Database Conference 2024. arXiv admin note: substantial text overlap with arXiv:2209.12215

点击查看摘要

Abstract:The cold start problem in recommender systems remains a critical challenge. Current solutions often train hybrid models on auxiliary data for both cold and warm users/items, potentially degrading the experience for the latter. This drawback limits their viability in practical scenarios where the satisfaction of existing warm users/items is paramount. Although graph neural networks (GNNs) excel at warm recommendations by effective collaborative signal modeling, they haven’t been effectively leveraged for the cold-start issue within a user-item graph, which is largely due to the lack of initial connections for cold user/item entities. Addressing this requires a GNN adept at cold-start recommendations without sacrificing performance for existing ones. To this end, we introduce Graph Neural Patching for Cold-Start Recommendations (GNP), a customized GNN framework with dual functionalities: GWarmer for modeling collaborative signal on existing warm users/items and Patching Networks for simulating and enhancing GWarmer’s performance on cold-start recommendations. Extensive experiments on three benchmark datasets confirm GNP’s superiority in recommending both warm and cold users/items.

[IR-7] Personalized Image Generation with Large Multimodal Models

链接: https://arxiv.org/abs/2410.14170
作者: Yiyan Xu,Wenjie Wang,Yang Zhang,Tang Biao,Peng Yan,Fuli Feng,Xiangnan He
关键词-EN: personalized image generation, personalized image, image generation, Personalized content filtering, Personalized
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users’ varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users’ visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users’ visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2410.14170 [cs.IR] (or arXiv:2410.14170v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.14170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-8] Optimizing Retrieval-Augmented Generation with Elasticsearch for Enhanced Question-Answering Systems

链接: https://arxiv.org/abs/2410.14167
作者: Jiajing Chen,Runyuan Bao,Hongye Zheng,Zhen Qi,Jianjun Wei,Jiacheng Hu
关键词-EN: Retrieval Augmented Generation, large-scale language models, Augmented Generation, Stanford Question Answering, Question Answering Dataset
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This study aims to improve the accuracy and quality of large-scale language models (LLMs) in answering questions by integrating Elasticsearch into the Retrieval Augmented Generation (RAG) framework. The experiment uses the Stanford Question Answering Dataset (SQuAD) version 2.0 as the test dataset and compares the performance of different retrieval methods, including traditional methods based on keyword matching or semantic similarity calculation, BM25-RAG and TF-IDF- RAG, and the newly proposed ES-RAG scheme. The results show that ES-RAG not only has obvious advantages in retrieval efficiency but also performs well in key indicators such as accuracy, which is 0.51 percentage points higher than TF-IDF-RAG. In addition, Elasticsearch’s powerful search capabilities and rich configuration options enable the entire question-answering system to better handle complex queries and provide more flexible and efficient responses based on the diverse needs of users. Future research directions can further explore how to optimize the interaction mechanism between Elasticsearch and LLM, such as introducing higher-level semantic understanding and context-awareness capabilities, to achieve a more intelligent and humanized question-answering experience.

[IR-9] owards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation

链接: https://arxiv.org/abs/2410.14122
作者: Yonghyun Kim,Alexander Lerch
关键词-EN: Automatic Piano Transcription, remains largely unexplored, significantly improved system, Automatic Piano, improved system performance
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval (ISMIR) Conference, 2024

点击查看摘要

Abstract:Recent advancements in Automatic Piano Transcription (APT) have significantly improved system performance, but the impact of noisy environments on the system performance remains largely unexplored. This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models and evaluates the performance of the Onsets and Frames model when trained on noise-augmented data. We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.

[IR-10] Lightweight Correlation-Aware Table Compression

链接: https://arxiv.org/abs/2410.14066
作者: Mihail Stoian,Alexander van Renen,Jan Kobiolka,Ping-Lin Kuo,Josif Grabocka,Andreas Kipf
关键词-EN: competitive compression ratios, data necessitates efficient, provide high scan, necessitates efficient, managing relational data
类目: Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Third Table Representation Learning Workshop (TRL 2024)

点击查看摘要

Abstract:The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to reduce file sizes further. Yet, current approaches either incur significant scan overheads or require manual specification of correlations, limiting their practicability. We present \textttVirtual , a framework that integrates seamlessly with existing open formats to automatically leverage data correlations, achieving substantial compression gains while having minimal scan performance overhead. Experiments on \textttthis http URL datasets show that \textttVirtual reduces file sizes by up to 40% compared to Apache Parquet.

[IR-11] Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3

链接: https://arxiv.org/abs/2410.14044
作者: Naghmeh Farzi,Laura Dietz
关键词-EN: Traditional evaluation, human-annotated relevance labels, systems relies, costly at scale, relies on human-annotated
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional evaluation of information retrieval (IR) systems relies on human-annotated relevance labels, which can be both biased and costly at scale. In this context, large language models (LLMs) offer an alternative by allowing us to directly prompt them to assign relevance labels for passages associated with each query. In this study, we explore alternative methods to directly prompt LLMs for assigned relevance labels, by exploring two hypotheses: Hypothesis 1 assumes that it is helpful to break down “relevance” into specific criteria - exactness, coverage, topicality, and contextual fit. We explore different approaches that prompt large language models (LLMs) to obtain criteria-level grades for all passages, and we consider various ways to aggregate criteria-level grades into a relevance label. Hypothesis 2 assumes that differences in linguistic style between queries and passages may negatively impact the automatic relevance label prediction. We explore whether improvements can be achieved by first synthesizing a summary of the passage in the linguistic style of a query, and then using this summary in place of the passage to assess its relevance. We include an empirical evaluation of our approaches based on data from the LLMJudge challenge run in Summer 2024, where our “Four Prompts” approach obtained the highest scores in Kendall’s tau. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) ACMclasses: H.3.3 Cite as: arXiv:2410.14044 [cs.IR] (or arXiv:2410.14044v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.14044 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-12] Efficient Retrieval of Temporal Event Sequences from Textual Descriptions

链接: https://arxiv.org/abs/2410.14043
作者: Zefang Liu,Yinzhu Quan
关键词-EN: monitoring social media, analyzing e-commerce behavior, social media activities, tracking criminal incidents, e-commerce behavior
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieving temporal event sequences from textual descriptions is essential for applications such as analyzing e-commerce behavior, monitoring social media activities, and tracking criminal incidents. In this paper, we introduce TPP-LLM-Embedding, a unified model for efficiently embedding and retrieving event sequences based on natural language descriptions. Built on the TPP-LLM framework, which integrates large language models with temporal point processes, our model encodes both event types and times, generating a sequence-level representation through pooling. Textual descriptions are embedded using the same architecture, ensuring a shared embedding space for both sequences and descriptions. We optimize a contrastive loss based on similarity between these embeddings, bringing matching pairs closer and separating non-matching ones. TPP-LLM-Embedding enables efficient retrieval and demonstrates superior performance compared to baseline models across diverse datasets.

[IR-13] FinQAPT: Empowering Financial Decisions with End-to-End LLM-driven Question Answering Pipeline

链接: https://arxiv.org/abs/2410.13959
作者: Kuldeep Singh,Simerjot Kaur,Charese Smiley
关键词-EN: Large Language Models, relevant information embedded, Financial decision-making hinges, leverages Large Language, decision-making hinges
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted in ICAIF 2024, 8 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Financial decision-making hinges on the analysis of relevant information embedded in the enormous volume of documents in the financial domain. To address this challenge, we developed FinQAPT, an end-to-end pipeline that streamlines the identification of relevant financial reports based on a query, extracts pertinent context, and leverages Large Language Models (LLMs) to perform downstream tasks. To evaluate the pipeline, we experimented with various techniques to optimize the performance of each module using the FinQA dataset. We introduced a novel clustering-based negative sampling technique to enhance context extraction and a novel prompting method called Dynamic N-shot Prompting to boost the numerical question-answering capabilities of LLMs. At the module level, we achieved state-of-the-art accuracy on FinQA, attaining an accuracy of 80.6%. However, at the pipeline level, we observed decreased performance due to challenges in extracting relevant context from financial reports. We conducted a detailed error analysis of each module and the end-to-end pipeline, pinpointing specific challenges that must be addressed to develop a robust solution for handling complex financial tasks.

[IR-14] Identifying High Consideration E-Commerce Search Queries EMNLP2024

链接: https://arxiv.org/abs/2410.13951
作者: Zhiyu Chen,Jason Choi,Besnik Fetahu,Shervin Malmasi
关键词-EN: missions typically require, typically require careful, substantial research investment, elaborate decision making, high consideration
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by EMNLP 2024 (Industry Track)

点击查看摘要

Abstract:In e-commerce, high consideration search missions typically require careful and elaborate decision making, and involve a substantial research investment from customers. We consider the task of identifying High Consideration (HC) queries. Identifying such queries enables e-commerce sites to better serve user needs using targeted experiences such as curated QA widgets that help users reach purchase decisions. We explore the task by proposing an Engagement-based Query Ranking (EQR) approach, focusing on query ranking to indicate potential engagement levels with query-related shopping knowledge content during product search. Unlike previous studies on predicting trends, EQR prioritizes query-level features related to customer behavior, finance, and catalog information rather than popularity signals. We introduce an accurate and scalable method for EQR and present experimental results demonstrating its effectiveness. Offline experiments show strong ranking performance. Human evaluation shows a precision of 96% for HC queries identified by our model. The model was commercially deployed, and shown to outperform human-selected queries in terms of downstream customer impact, as measured through engagement.

[IR-15] P4GCN: Vertical Federated Social Recommendation with Privacy-Preserving Two-Party Graph Convolution Networks

链接: https://arxiv.org/abs/2410.13905
作者: Zheng Wang,Wanwan Wang,Yimin Huang,Zhaopeng Peng,Ziqi Yang,Cheng Wang,Xiaoliang Fan
关键词-EN: social recommendation systems, recent years, commonly utilized, social, graph neural networks
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, graph neural networks (GNNs) have been commonly utilized for social recommendation systems. However, real-world scenarios often present challenges related to user privacy and business constraints, inhibiting direct access to valuable social information from other platforms. While many existing methods have tackled matrix factorization-based social recommendations without direct social data access, developing GNN-based federated social recommendation models under similar conditions remains largely unexplored. To address this issue, we propose a novel vertical federated social recommendation method leveraging privacy-preserving two-party graph convolution networks (P4GCN) to enhance recommendation accuracy without requiring direct access to sensitive social information. First, we introduce a Sandwich-Encryption module to ensure comprehensive data privacy during the collaborative computing process. Second, we provide a thorough theoretical analysis of the privacy guarantees, considering the participation of both curious and honest parties. Extensive experiments on four real-world datasets demonstrate that P4GCN outperforms state-of-the-art methods in terms of recommendation accuracy. The code is available at this https URL.

[IR-16] Classifying Peace in Global Media Using RAG and Intergroup Reciprocity

链接: https://arxiv.org/abs/2410.13865
作者: K. Lian(1),L. S. Liebovitch(1),M. Wild(1),H. West(1),P. T. Coleman(1),F. Chen(2),E. Kimani(2),K. Sieck(2) ((1) Columbia University, (2) Toyota Research Institute)
关键词-EN: Retrieval Augmented Generation, Negative Intergroup Reciprocity, Augmented Generation, Retrieval Augmented, Positive and Negative
类目: Information Retrieval (cs.IR)
*备注: 6 pages, 1 figure

点击查看摘要

Abstract:This paper presents a novel approach to identifying insights of peace in global media using a Retrieval Augmented Generation (RAG) model and concepts of Positive and Negative Intergroup Reciprocity (PIR/NIR). By refining the definitions of PIR and NIR, we offer a more accurate and meaningful analysis of intergroup relations as represented in media articles. Our methodology provides insights into the dynamics that contribute to or detract from peace at a national level.

附件下载

点击下载今日全部论文列表