本篇博文主要展示 2024-10-14 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-14)

今日共更新476篇论文,其中:

  • 自然语言处理83篇(Computation and Language (cs.CL))
  • 人工智能142篇(Artificial Intelligence (cs.AI))
  • 计算机视觉96篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习184篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

【速读】: 该论文试图解决视觉-语言模型(VLMs)在集成视觉模块后,其安全性对齐能力下降的问题,这种现象被称为“安全性对齐退化”。解决方案的关键在于引入跨模态表示操控(Cross-Modality Representation Manipulation, CMRM),这是一种在推理时进行表示干预的方法,旨在恢复VLMs中语言模型(LLM)骨干固有的安全性对齐能力,同时保持VLMs的功能性能力。通过这种方法,无需额外训练即可显著提升VLMs在多模态输入下的安全性,例如将LLaVA-7B在多模态输入下的不安全率从61.53%降低至3.15%。

链接: https://arxiv.org/abs/2410.09047
作者: Qin Liu,Chao Shang,Ling Liu,Nikolaos Pappas,Jie Ma,Neha Anna John,Srikanth Doss,Lluis Marquez,Miguel Ballesteros,Yassine Benajiba
关键词-EN: vision module compared, safety alignment, Vision-Language Models, safety alignment ability, safety alignment degradation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as ‘‘safety alignment degradation’’ in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preserving the functional capabilities of VLMs. The empirical results show that our framework significantly recovers the alignment ability that is inherited from the LLM backbone with minimal impact on the fluency and linguistic capabilities of pre-trained VLMs even without additional training. Specifically, the unsafe rate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as 3.15% with only inference-time intervention. WARNING: This paper contains examples of toxic or harmful language. Comments: Preprint Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.09047 [cs.CL] (or arXiv:2410.09047v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.09047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:视觉-语言模型 (Vision-Language Models, VLMs) 的安全对齐能力在与视觉模块整合后,相比其大语言模型 (Large Language Model, LLM) 骨干网络更容易出现退化。本文探讨了这一现象,称之为“安全对齐退化”,并指出这一挑战源于在引入视觉模态时产生的表示差异。具体而言,我们发现多模态输入的表示与仅文本输入的表示发生了偏离,后者代表了 LLM 骨干网络优化所针对的分布。同时,最初在文本嵌入空间中开发的安全对齐能力未能成功转移到这一新的多模态表示空间中。为了减少安全对齐退化,我们引入了跨模态表示操纵 (Cross-Modality Representation Manipulation, CMRM),这是一种推理时表示干预方法,旨在恢复 VLMs 中 LLM 骨干网络固有的安全对齐能力,同时保持 VLMs 的功能能力。实验结果表明,我们的框架显著恢复了从 LLM 骨干网络继承的对齐能力,对预训练 VLMs 的流畅性和语言能力影响最小,甚至在无需额外训练的情况下也能实现。具体来说,LLaVA-7B 在多模态输入上的不安全率可以从 61.53% 降低到仅 3.15%,仅需推理时干预。警告:本文包含有毒或有害语言的示例。

评论:预印本 主题:计算与语言 (cs.CL);人工智能 (cs.AI);机器学习 (cs.LG) 引用为:arXiv:2410.09047 [cs.CL] (或 arXiv:2410.09047v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2410.09047 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-1] MiRAGeNews: Multimodal Realistic AI-Generated News Detection EMNLP2024

【速读】: 该论文试图解决AI生成的虚假新闻内容扩散的问题,解决方案的关键在于提出了MiRAGeNews数据集,这是一个包含12,500对高质量真实和AI生成图像-标题对的数据集。通过该数据集,论文训练了一个多模态检测器(MiRAGe),显著提升了对来自不同图像生成器和新闻发布者的图像-标题对进行检测的准确性,相较于现有的最先进模型,F-1分数提高了5.1%。此外,论文还公开了代码和数据,以促进未来对AI生成内容检测的研究。

链接: https://arxiv.org/abs/2410.09045
作者: Runsheng Huang,Liam Dugan,Yue Yang,Chris Callison-Burch
关键词-EN: inflammatory or misleading, recent years, proliferation of inflammatory, increasingly common, common in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:The proliferation of inflammatory or misleading “fake” news content has become increasingly common in recent years. Simultaneously, it has become easier than ever to use AI tools to generate photorealistic images depicting any scene imaginable. Combining these two – AI-generated fake news content – is particularly potent and dangerous. To combat the spread of AI-generated fake news, we propose the MiRAGeNews Dataset, a dataset of 12,500 high-quality real and AI-generated image-caption pairs from state-of-the-art generators. We find that our dataset poses a significant challenge to humans (60% F-1) and state-of-the-art multi-modal LLMs ( 24% F-1). Using our dataset we train a multi-modal detector (MiRAGe) that improves by +5.1% F-1 over state-of-the-art baselines on image-caption pairs from out-of-domain image generators and news publishers. We release our code and data to aid future work on detecting AI-generated content.
摘要:近年来,煽动性或误导性的“假”新闻内容日益泛滥。与此同时,使用 AI 工具生成逼真的图像变得前所未有的简单,这些图像可以描绘出任何想象中的场景。将这两者结合——即 AI 生成的假新闻内容——尤其具有强大的影响力和危险性。为了应对 AI 生成假新闻的传播,我们提出了 MiRAGeNews 数据集,这是一个包含 12,500 对高质量真实和 AI 生成图像-标题对的数据集,这些图像-标题对来自最先进的生成器。我们发现,我们的数据集对人类(60% F-1)和最先进的多模态大语言模型(24% F-1)都构成了显著挑战。利用我们的数据集,我们训练了一个多模态检测器(MiRAGe),该检测器在来自域外图像生成器和新闻发布者的图像-标题对上,比最先进的基线提高了 +5.1% F-1。我们发布了代码和数据,以助力未来在检测 AI 生成内容方面的工作。

[NLP-2] AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation

【速读】: 该论文旨在研究基于Transformer的大型语言模型(LLMs)在面对jailbreaking攻击时的脆弱性,并提出了一种基于优化策略的Greedy Coordinate Gradient (GCG)方法的改进方案。解决方案的关键在于引入了一种名为AttnGCG的新方法,通过操纵模型的注意力分数来增强LLM的jailbreaking效果。实验结果表明,AttnGCG在不同LLM模型上均显著提升了攻击的有效性,平均提高了Llama-2系列的7%和Gemma系列的10%。此外,该方法还展示了强大的攻击迁移能力,能够有效对抗未见过的有害目标和黑箱LLMs如GPT-3.5和GPT-4。通过注意力分数的可视化,研究者能够更清晰地理解注意力操纵如何促进更有效的jailbreaking。

链接: https://arxiv.org/abs/2410.09040
作者: Zijun Wang,Haoqin Tu,Jieru Mei,Bingchen Zhao,Yisen Wang,Cihang Xie
关键词-EN: Greedy Coordinate Gradient, transformer-based Large Language, optimization-based Greedy Coordinate, Large Language Models, Coordinate Gradient
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models’ attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score visualization is more interpretable, allowing us to gain better insights into how our targeted attention manipulation facilitates more effective jailbreaking. We release the code at this https URL.
摘要:本文研究了基于 Transformer 的大语言模型 (LLM) 对越狱攻击的脆弱性,特别关注基于优化的贪婪坐标梯度 (GCG) 策略。我们首先观察到攻击效果与模型内部行为之间存在正相关关系。例如,当模型更加关注旨在确保 LLM 安全对齐的系统提示时,攻击效果往往较差。基于这一发现,我们引入了一种增强方法,通过操纵模型的注意力分数来促进 LLM 越狱,我们称之为 AttnGCG。实证结果显示,AttnGCG 在不同 LLM 上的攻击效果均有持续提升,Llama-2 系列模型的平均提升约为 7%,Gemma 系列模型的平均提升约为 10%。我们的策略还展示了针对未见有害目标和黑箱 LLM(如 GPT-3.5 和 GPT-4)的强大攻击迁移性。此外,我们注意到我们的注意力分数可视化更具解释性,使我们能够更好地理解目标注意力操纵如何促进更有效的越狱。我们在此 https URL 发布了代码。

[NLP-3] SimpleStrat: Diversifying Language Model Generation with Stratification

【速读】: 该论文试图解决大型语言模型(LLMs)在生成多样化响应时,传统方法(如增加温度)不仅降低生成质量,还依赖于模型对下一个词概率分布的准确性问题。解决方案的关键是提出了一种名为SimpleStrat的新方法,该方法利用语言模型本身将生成空间划分为多个层次(strata),在推理时随机选择一个层次并从中抽取样本。这种方法通过引入CoverageQA数据集和KL散度来评估多样性,并在实验中显示出比GPT-4和Llama 3更高的召回率和更低的KL散度,从而有效提高了生成响应的多样性和质量。

链接: https://arxiv.org/abs/2410.09038
作者: Justin Wong,Yury Orlovskiy,Michael Luo,Sanjit A. Seshia,Joseph E. Gonzalez
关键词-EN: Generating diverse responses, Generating diverse, synthetic data generation, search and synthetic, crucial for applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating diverse responses from large language models (LLMs) is crucial for applications such as planning/search and synthetic data generation, where diversity provides distinct answers across generations. Prior approaches rely on increasing temperature to increase diversity. However, contrary to popular belief, we show not only does this approach produce lower quality individual generations as temperature increases, but it depends on model’s next-token probabilities being similar to the true distribution of answers. We propose \method, an alternative approach that uses the language model itself to partition the space into strata. At inference, a random stratum is selected and a sample drawn from within the strata. To measure diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers, and assess diversity by measuring KL Divergence between the output distribution and uniform distribution over valid ground truth answers. As computing probability per response/solution for proprietary models is infeasible, we measure recall on ground truth solutions. Our evaluation show using SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36 average reduction in KL Divergence compared to Llama 3.
摘要:从大语言模型 (LLMs) 中生成多样化的响应对于规划/搜索和合成数据生成等应用至关重要,在这些应用中,多样性提供了跨代的不同答案。先前的研究依赖于增加温度来提高多样性。然而,与普遍观点相反,我们不仅展示了随着温度增加,这种方法生成的个体质量下降,而且还依赖于模型的下一个 Token 概率与答案的真实分布相似。我们提出了 \method,一种利用语言模型本身将空间划分为层的方法。在推理过程中,随机选择一个层,并从中抽取样本。为了衡量多样性,我们引入了 CoverageQA,这是一个包含多个同样合理答案的不完全指定问题的数据集,并通过测量输出分布与有效真实答案的均匀分布之间的 KL 散度来评估多样性。由于计算专有模型每条响应/解决方案的概率是不可行的,我们通过测量真实解决方案的召回率来评估。我们的评估显示,使用 SimpleStrat 相比 GPT-4o 提高了 0.05 的召回率,相比 Llama 3 平均减少了 0.36 的 KL 散度。

[NLP-4] Mentor-KD: Making Small Language Models Better Multi-step Reasoners EMNLP2024

【速读】: 该论文试图解决在知识蒸馏过程中,由于LLM教师模型生成的蒸馏数据集质量不足和软标签提供不充分导致的推理能力转移效果不佳的问题。解决方案的关键是提出Mentor-KD方法,通过引入一个中间规模的、任务特定的微调模型(Mentor),来增强额外的CoT注释并为学生模型提供软标签,从而有效提升小规模语言模型(LM)的多步推理能力。

链接: https://arxiv.org/abs/2410.09037
作者: Hojae Lee,Junho Kim,SangKeun Lee
关键词-EN: Large Language Models, displayed remarkable performances, Large Language, Language Models, displayed remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have displayed remarkable performances across various complex tasks by leveraging Chain-of-Thought (CoT) prompting. Recently, studies have proposed a Knowledge Distillation (KD) approach, reasoning distillation, which transfers such reasoning ability of LLMs through fine-tuning language models of multi-step rationales generated by LLM teachers. However, they have inadequately considered two challenges regarding insufficient distillation sets from the LLM teacher model, in terms of 1) data quality and 2) soft label provision. In this paper, we propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller LMs while addressing the aforementioned challenges. Specifically, we exploit a mentor, intermediate-sized task-specific fine-tuned model, to augment additional CoT annotations and provide soft labels for the student model during reasoning distillation. We conduct extensive experiments and confirm Mentor-KD’s effectiveness across various models and complex reasoning tasks.
摘要:大语言模型 (LLMs) 通过利用思维链 (Chain-of-Thought, CoT) 提示,在各种复杂任务中展现了卓越的表现。近期研究提出了一种知识蒸馏 (Knowledge Distillation, KD) 方法,即推理蒸馏,通过微调由 LLM 教师生成的多步推理语言模型来转移 LLM 的推理能力。然而,这些研究未能充分考虑两个关于 LLM 教师模型蒸馏集不足的挑战:1) 数据质量问题和 2) 软标签提供问题。本文提出了 Mentor-KD,该方法在解决上述挑战的同时,有效地将 LLM 的多步推理能力蒸馏到较小的语言模型 (LM) 中。具体而言,我们利用一个中间大小的任务特定微调模型作为导师,在推理蒸馏过程中为学生模型提供额外的 CoT 注释和软标签。我们进行了广泛的实验,并确认了 Mentor-KD 在各种模型和复杂推理任务中的有效性。

[NLP-5] PEAR: A Robust and Flexible Automation Framework for Ptychography Enabled by Multiple Large Language Model Agents

【速读】: 该论文试图解决传统X射线和电子显微镜中计算成像技术(ptychography)在参数优化过程中依赖试错法导致的工作流程低效和潜在人为偏差的问题。解决方案的关键在于开发了“Ptychographic Experiment and Analysis Robot”(PEAR)框架,该框架利用大型语言模型(LLMs)自动化数据分析过程。PEAR通过多代理设计,包括知识检索、代码生成、参数推荐和图像推理等任务,显著提高了工作流程的成功率,并支持不同自动化水平和定制化本地知识库,确保在不同研究环境中的灵活性和适应性。

链接: https://arxiv.org/abs/2410.09034
作者: Xiangyu Yin,Chuqiao Shi,Yimo Han,Yi Jiang
关键词-EN: advanced computational imaging, computational imaging technique, technique in X-ray, X-ray and electron, electron microscopy
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 18 pages, 5 figures, technical preview report

点击查看摘要

Abstract:Ptychography is an advanced computational imaging technique in X-ray and electron microscopy. It has been widely adopted across scientific research fields, including physics, chemistry, biology, and materials science, as well as in industrial applications such as semiconductor characterization. In practice, obtaining high-quality ptychographic images requires simultaneous optimization of numerous experimental and algorithmic parameters. Traditionally, parameter selection often relies on trial and error, leading to low-throughput workflows and potential human bias. In this work, we develop the “Ptychographic Experiment and Analysis Robot” (PEAR), a framework that leverages large language models (LLMs) to automate data analysis in ptychography. To ensure high robustness and accuracy, PEAR employs multiple LLM agents for tasks including knowledge retrieval, code generation, parameter recommendation, and image reasoning. Our study demonstrates that PEAR’s multi-agent design significantly improves the workflow success rate, even with smaller open-weight models such as LLaMA 3.1 8B. PEAR also supports various automation levels and is designed to work with customized local knowledge bases, ensuring flexibility and adaptability across different research environments.
摘要:

Ptychography 是一种先进的计算成像技术,广泛应用于 X 射线和电子显微镜领域。它已被广泛应用于包括物理学、化学、生物学和材料科学在内的多个科学研究领域,以及半导体表征等工业应用中。在实践中,获得高质量的 ptychographic 图像需要同时优化众多实验和算法参数。传统上,参数选择往往依赖于试错法,导致工作流程效率低下并可能引入人为偏差。在本研究中,我们开发了“Ptychographic 实验与分析机器人”(PEAR),这是一个利用大语言模型(LLM)来自动化 ptychography 数据分析的框架。为了确保高鲁棒性和准确性,PEAR 采用多个 LLM 智能体来执行知识检索、代码生成、参数推荐和图像推理等任务。我们的研究表明,PEAR 的多智能体设计显著提高了工作流程的成功率,即使在较小的开源模型如 LLaMA 3.1 8B 的情况下也是如此。PEAR 还支持多种自动化级别,并设计为与定制的本地知识库协同工作,确保在不同研究环境中的灵活性和适应性。

[NLP-6] Agent Harm: A Benchmark for Measuring Harmfulness of LLM Agents

【速读】: 该论文试图解决大型语言模型(LLMs)在作为智能代理时对恶意攻击的鲁棒性问题。解决方案的关键在于提出了一个新的基准测试AgentHarm,该基准包含110个明确的恶意任务,涵盖11种危害类别,用于评估模型在面对恶意请求时的拒绝能力以及在遭受攻击后完成多步骤任务的能力。通过评估一系列领先的LLMs,研究发现这些模型在未经过破解的情况下对恶意请求表现出惊人的顺从性,且简单的通用破解模板可以有效破解智能代理,使其执行连贯且恶意的多步骤行为并保留模型能力。AgentHarm的公开发布旨在为LLM代理的攻击和防御提供简单可靠的评估工具。

链接: https://arxiv.org/abs/2410.09024
作者: Maksym Andriushchenko,Alexandra Souly,Mateusz Dziemian,Derek Duenas,Maxwell Lin,Justin Wang,Dan Hendrycks,Andy Zou,Zico Kolter,Matt Fredrikson,Eric Winsor,Jerome Wynne,Yarin Gal,Xander Davies
关键词-EN: users design prompts, circumvent safety measures, users design, design prompts, prompts to circumvent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents – which use external tools and can execute multi-stage tasks – may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. We publicly release AgentHarm to enable simple and reliable evaluation of attacks and defenses for LLM-based agents. We publicly release the benchmark at this https URL.
摘要:大语言模型 (LLM) 对越狱攻击的鲁棒性,即用户设计提示以绕过安全措施并滥用模型能力,主要针对作为简单聊天机器人运行的 LLM 进行了研究。然而,使用外部工具并能执行多阶段任务的 LLM 智能体,如果被滥用,可能带来更大的风险,但其鲁棒性仍未得到充分探索。为了促进对 LLM 智能体滥用的研究,我们提出了一种新的基准测试,称为 AgentHarm。该基准包含一系列多样化的 110 个明确恶意的智能体任务(通过增强扩展到 440 个),涵盖 11 种危害类别,包括欺诈、网络犯罪和骚扰。除了衡量模型是否拒绝有害的智能体请求外,在 AgentHarm 上取得高分还要求被越狱的智能体在攻击后保持其能力以完成多步骤任务。我们评估了一系列领先的大语言模型,并发现:(1) 领先的大语言模型在未越狱的情况下意外地顺从恶意智能体请求,(2) 简单的通用越狱模板可以被有效改编以越狱智能体,(3) 这些越狱行为使得智能体能够进行连贯且恶意的多步骤行为,并保留模型能力。我们公开发布了 AgentHarm,以实现对基于大语言模型的智能体的攻击和防御进行简单可靠的评估。我们在以下链接公开发布了该基准测试。

[NLP-7] MedMobile: A mobile-sized language model with expert-level clinical capabilities

【速读】: 该论文试图解决大规模语言模型在医疗应用中的计算成本和隐私问题,解决方案的关键在于开发了一个名为MedMobile的3.8亿参数轻量级语言模型,该模型能够在移动设备上运行,并通过链式思维、集成和微调等技术显著提升性能,同时避免了检索增强生成带来的性能提升不显著的问题。MedMobile在MedQA(USMLE)测试中得分达到75.7%,超过了医生的及格线,接近比其大100倍的模型的性能。

链接: https://arxiv.org/abs/2410.09019
作者: Krithik Vishwanath,Jaden Stryker,Anton Alaykin,Daniel Alexander Alber,Eric Karl Oermann
关键词-EN: demonstrated expert-level reasoning, Language models, abilities in medicine, demonstrated expert-level, expert-level reasoning
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures (2 main, 3 supplementary)

点击查看摘要

Abstract:Language models (LMs) have demonstrated expert-level reasoning and recall abilities in medicine. However, computational costs and privacy concerns are mounting barriers to wide-scale implementation. We introduce a parsimonious adaptation of phi-3-mini, MedMobile, a 3.8 billion parameter LM capable of running on a mobile device, for medical applications. We demonstrate that MedMobile scores 75.7% on the MedQA (USMLE), surpassing the passing mark for physicians (~60%), and approaching the scores of models 100 times its size. We subsequently perform a careful set of ablations, and demonstrate that chain of thought, ensembling, and fine-tuning lead to the greatest performance gains, while unexpectedly retrieval augmented generation fails to demonstrate significant improvements
摘要:语言模型 (Language Models, LMs) 在医学领域展示了专家级别的推理和记忆能力。然而,计算成本和隐私问题正成为大规模应用的主要障碍。我们引入了一种简约的phi-3-mini适应版本,名为MedMobile,这是一个拥有38亿参数的大语言模型,能够在移动设备上运行,专门用于医学应用。我们展示了MedMobile在MedQA (USMLE) 上的得分为75.7%,超过了医生的及格线 (~60%),并接近比其大100倍的模型的得分。随后,我们进行了一系列细致的消融实验,结果表明,思维链 (chain of thought)、集成 (ensembling) 和微调 (fine-tuning) 带来了最大的性能提升,而意外的是,检索增强生成 (retrieval augmented generation) 未能显示出显著的改进。

[NLP-8] Parameter-Efficient Fine-Tuning of State Space Models

【速读】: 该论文试图解决两个关键问题:一是现有的参数高效微调(PEFT)方法在深度状态空间模型(SSM)上的表现如何;二是哪些模块在微调过程中最为有效。解决方案的关键在于发现并验证了LoRA(低秩适应)方法在SSM模型中的有效性,特别是在不修改SSM模块的情况下,将LoRA应用于线性投影矩阵能取得最佳效果。此外,论文提出了LoRA与选择性维度微调(SDLoRA)结合的新方法,通过选择性地更新SSM模块中的某些通道和状态,同时将LoRA应用于线性投影矩阵,从而进一步提升了模型性能。

链接: https://arxiv.org/abs/2410.09016
作者: Kevin Galim,Wonjun Kang,Yuchen Zeng,Hyung Il Koo,Kangwook Lee
关键词-EN: Deep State Space, State Space Models, Deep State, Space Models, State Space
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Deep State Space Models (SSMs), such as Mamba (Gu Dao, 2024), have emerged as powerful tools for language modeling, offering high performance with efficient inference and linear scaling in sequence length. However, the application of parameter-efficient fine-tuning (PEFT) methods to SSM-based models remains largely unexplored. This paper aims to systematically study two key questions: (i) How do existing PEFT methods perform on SSM-based models? (ii) Which modules are most effective for fine-tuning? We conduct an empirical benchmark of four basic PEFT methods on SSM-based models. Our findings reveal that prompt-based methods (e.g., prefix-tuning) are no longer effective, an empirical result further supported by theoretical analysis. In contrast, LoRA remains effective for SSM-based models. We further investigate the optimal application of LoRA within these models, demonstrating both theoretically and experimentally that applying LoRA to linear projection matrices without modifying SSM modules yields the best results, as LoRA is not effective at tuning SSM modules. To further improve performance, we introduce LoRA with Selective Dimension tuning (SDLoRA), which selectively updates certain channels and states on SSM modules while applying LoRA to linear projection matrices. Extensive experimental results show that this approach outperforms standard LoRA.
摘要:深度状态空间模型 (SSMs),如 Mamba (Gu Dao, 2024),已成为语言建模的强大工具,具有高效的推理能力和序列长度线性扩展的高性能。然而,参数高效微调 (PEFT) 方法在基于 SSM 模型上的应用仍未得到充分探索。本文旨在系统研究两个关键问题:(i) 现有 PEFT 方法在基于 SSM 模型上的表现如何?(ii) 哪些模块在微调中最有效?我们对四种基本 PEFT 方法在基于 SSM 模型上进行了实证基准测试。我们的研究结果表明,基于提示的方法 (如前缀微调) 不再有效,这一实证结果得到了理论分析的支持。相比之下,LoRA 在基于 SSM 模型上仍然有效。我们进一步探讨了 LoRA 在这些模型中的最佳应用,理论和实验均表明,在不修改 SSM 模块的情况下,将 LoRA 应用于线性投影矩阵能取得最佳效果,因为 LoRA 在微调 SSM 模块方面效果不佳。为了进一步提升性能,我们引入了带有选择性维度微调 (SDLoRA) 的 LoRA,该方法在将 LoRA 应用于线性投影矩阵的同时,选择性地更新 SSM 模块中的某些通道和状态。广泛的实验结果表明,这种方法优于标准的 LoRA。

[NLP-9] he Impact of Visual Information in Chinese Characters: Evaluating Large Models Ability to Recognize and Utilize Radicals

【速读】: 该论文试图解决的问题是评估和提升大型语言模型(LLMs)和视觉-语言模型(VLMs)对汉字中视觉元素(如部首、结构、笔画和笔画数)的理解能力。解决方案的关键在于建立一个基准测试,以评估模型对这些视觉元素的理解,并通过在提示中引入部首信息来增强模型在汉语理解任务(如词性标注)中的表现,从而探索整合汉字子结构信息以提升中文语言处理(CLP)性能的潜力。

链接: https://arxiv.org/abs/2410.09013
作者: Xiaofeng Wu,Karl Stratos,Wei Xu
关键词-EN: glyphic writing system, Chinese incorporates information-rich, incorporates information-rich visual, meaning or pronunciation, information-rich visual features
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The glyphic writing system of Chinese incorporates information-rich visual features in each character, such as radicals that provide hints about meaning or pronunciation. However, there has been no investigation into whether contemporary Large Language Models (LLMs) and Vision-Language Models (VLMs) can harness these sub-character features in Chinese through prompting. In this study, we establish a benchmark to evaluate LLMs’ and VLMs’ understanding of visual elements in Chinese characters, including radicals, composition structures, strokes, and stroke counts. Our results reveal that models surprisingly exhibit some, but still limited, knowledge of the visual information, regardless of whether images of characters are provided. To incite models’ ability to use radicals, we further experiment with incorporating radicals into the prompts for Chinese language understanding tasks. We observe consistent improvement in Part-Of-Speech tagging when providing additional information about radicals, suggesting the potential to enhance CLP by integrating sub-character information.
摘要:汉字书写系统在每个字符中融入了丰富的视觉特征,如部首提供了关于意义或发音的提示。然而,目前尚未有研究探讨当代大语言模型 (LLMs) 和视觉-语言模型 (VLMs) 是否能通过提示利用汉字中的这些子字符特征。在本研究中,我们建立了一个基准,用于评估 LLMs 和 VLMs 对汉字视觉元素的理解,包括部首、结构组成、笔画及笔画数。我们的结果显示,模型在一定程度上,但仍然有限地,展现了这些视觉信息的知识,无论是否提供字符的图像。为了激发模型利用部首的能力,我们进一步实验,将部首信息融入到中文语言理解任务的提示中。我们观察到,在提供部首的额外信息时,词性标注任务有持续的改进,这表明通过整合子字符信息来增强中文语言处理 (CLP) 的潜力。

[NLP-10] SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

【速读】: 该论文试图解决小型语言模型在复杂数学推理任务中难以有效识别和纠正推理错误的问题。解决方案的关键在于提出了一种名为SuperCorrect的两阶段框架,通过大型教师模型监督和纠正小型学生模型的推理和反思过程。第一阶段,教师模型提取层次化的高级和详细思维模板,指导学生模型产生更细粒度的推理思维;第二阶段,引入跨模型的协作直接偏好优化(DPO),通过教师模型的纠正轨迹增强学生模型的自我纠正能力,使其能够有效定位并解决错误思维,从而突破思维瓶颈,获取新技能和知识以应对挑战性问题。

链接: https://arxiv.org/abs/2410.09008
作者: Ling Yang,Zhaochen Yu,Tianjun Zhang,Minkai Xu,Joseph E. Gonzalez,Bin Cui,Shuicheng Yan
关键词-EN: shown significant improvements, Large language models, student model, LLaMA have shown, shown significant
类目: Computation and Language (cs.CL)
备注: Project: this https URL

点击查看摘要

Abstract:Large language models (LLMs) like GPT-4, PaLM, and LLaMA have shown significant improvements in various reasoning tasks. However, smaller models such as Llama-3-8B and DeepSeekMath-Base still struggle with complex mathematical reasoning because they fail to effectively identify and correct reasoning errors. Recent reflection-based methods aim to address these issues by enabling self-reflection and self-correction, but they still face challenges in independently detecting errors in their reasoning steps. To overcome these limitations, we propose SuperCorrect, a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model by following the teacher’s correction traces during training. This cross-model DPO approach teaches the student model to effectively locate and resolve erroneous thoughts with error-driven insights from the teacher model, breaking the bottleneck of its thoughts and acquiring new skills and knowledge to tackle challenging problems. Extensive experiments consistently demonstrate our superiority over previous methods. Notably, our SuperCorrect-7B model significantly surpasses powerful DeepSeekMath-7B by 7.8%/5.3% and Qwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K benchmarks, achieving new SOTA performance among all 7B models. Code: this https URL
摘要:像 GPT-4、PaLM 和 LLaMA 这样的大语言模型 (LLM) 在各种推理任务中展示了显著的改进。然而,较小的模型如 Llama-3-8B 和 DeepSeekMath-Base 在复杂数学推理方面仍然面临困难,因为它们无法有效识别和纠正推理错误。最近的基于反思的方法旨在通过实现自我反思和自我纠正来解决这些问题,但它们在独立检测推理步骤中的错误方面仍面临挑战。为了克服这些限制,我们提出了 SuperCorrect,一种新颖的两阶段框架,使用大型教师模型来监督和纠正较小学生模型的推理和反思过程。在第一阶段,我们从教师模型中提取层次化的高级和详细思维模板,以指导学生模型引发更细粒度的推理思维。在第二阶段,我们引入了跨模型的协作直接偏好优化 (DPO),通过在训练期间遵循教师的纠正轨迹来增强学生模型的自我纠正能力。这种跨模型的 DPO 方法教会学生模型有效定位和解决错误思维,从教师模型的错误驱动洞察中突破其思维瓶颈,并获得新的技能和知识以应对挑战性问题。广泛的实验一致证明我们优于先前的方法。值得注意的是,我们的 SuperCorrect-7B 模型在 MATH/GSM8K 基准测试中分别以 7.8%/5.3% 和 15.1%/6.3% 的优势显著超越了强大的 DeepSeekMath-7B 和 Qwen2.5-Math-7B,在所有 7B 模型中达到了新的 SOTA 性能。代码:this https URL

[NLP-11] Hypothesis-only Biases in Large Language Model-Elicited Natural Language Inference

【速读】: 该论文试图解决的问题是:使用大型语言模型(LLMs)替代众包工作者生成自然语言推理(NLI)假设是否会导致类似的标注偏差。解决方案的关键在于通过GPT-4、Llama-2和Mistral 7b重新创建部分Stanford NLI语料库,并训练仅基于假设的分类器来检测LLM生成的假设中是否存在标注偏差。研究发现,基于BERT的仅假设分类器在LLM生成的NLI数据集上达到86-96%的准确率,表明这些数据集中存在仅假设偏差。此外,还发现LLM生成的假设中存在频繁的“泄露”现象,如GPT-4生成的矛盾假设中“在游泳池中游泳”这一短语出现超过10,000次。这些结果提供了实证证据,表明NLI中已知的偏差在LLM生成的数据中仍然存在。

链接: https://arxiv.org/abs/2410.08996
作者: Grace Proebsting,Adam Poliak
关键词-EN: Natural Language Inference, write Natural Language, Language Inference, Natural Language, replacing crowdsource workers
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We test whether replacing crowdsource workers with LLMs to write Natural Language Inference (NLI) hypotheses similarly results in annotation artifacts. We recreate a portion of the Stanford NLI corpus using GPT-4, Llama-2 and Mistral 7b, and train hypothesis-only classifiers to determine whether LLM-elicited hypotheses contain annotation artifacts. On our LLM-elicited NLI datasets, BERT-based hypothesis-only classifiers achieve between 86-96% accuracy, indicating these datasets contain hypothesis-only artifacts. We also find frequent “give-aways” in LLM-generated hypotheses, e.g. the phrase “swimming in a pool” appears in more than 10,000 contradictions generated by GPT-4. Our analysis provides empirical evidence that well-attested biases in NLI can persist in LLM-generated data.
摘要:我们测试了是否可以用大语言模型 (LLM) 替代众包工人来编写自然语言推理 (NLI) 假设,并同样产生标注伪影。我们使用 GPT-4、Llama-2 和 Mistral 7b 重新创建了部分斯坦福 NLI 语料库,并训练了仅基于假设的分类器,以确定 LLM 生成的假设是否包含标注伪影。在我们的 LLM 生成的 NLI 数据集上,基于 BERT 的仅假设分类器达到了 86-96% 的准确率,表明这些数据集确实包含仅假设的伪影。我们还发现 LLM 生成的假设中频繁出现“泄露”现象,例如,GPT-4 生成的超过 10,000 个矛盾中出现了“在游泳池中游泳”这一短语。我们的分析提供了实证证据,证明 NLI 中已证实的偏见可以在 LLM 生成的数据中持续存在。

[NLP-12] Science is Exploration: Computational Frontiers for Conceptual Metaphor Theory

【速读】: 该论文试图解决的问题是大型语言模型(LLMs)是否能够准确识别和解释自然语言数据中的概念隐喻。解决方案的关键在于采用了一种基于隐喻标注指南的新型提示技术,通过这种方法,论文展示了LLMs在概念隐喻的大规模计算研究中的潜力,并表明LLMs能够应用为人类标注者设计的程序性指南,显示出令人惊讶的语言学知识深度。

链接: https://arxiv.org/abs/2410.08991
作者: Rebecca M. M. Hicke,Ross Deans Kristensen-McLachlan
关键词-EN: conceptual metaphors, Large Language Models, Metaphors, language, natural language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the 2024 Computational Humanities Research Conference (CHR)

点击查看摘要

Abstract:Metaphors are everywhere. They appear extensively across all domains of natural language, from the most sophisticated poetry to seemingly dry academic prose. A significant body of research in the cognitive science of language argues for the existence of conceptual metaphors, the systematic structuring of one domain of experience in the language of another. Conceptual metaphors are not simply rhetorical flourishes but are crucial evidence of the role of analogical reasoning in human cognition. In this paper, we ask whether Large Language Models (LLMs) can accurately identify and explain the presence of such conceptual metaphors in natural language data. Using a novel prompting technique based on metaphor annotation guidelines, we demonstrate that LLMs are a promising tool for large-scale computational research on conceptual metaphors. Further, we show that LLMs are able to apply procedural guidelines designed for human annotators, displaying a surprising depth of linguistic knowledge.
摘要:隐喻无处不在。它们广泛存在于自然语言的各个领域,从最精妙的诗歌到看似枯燥的学术散文。认知语言学领域的大量研究表明,概念隐喻的存在,即一个经验领域在另一个领域的语言系统化结构。概念隐喻不仅仅是修辞上的点缀,更是类比推理在人类认知中作用的重要证据。本文探讨了大语言模型 (LLM) 能否准确识别并解释自然语言数据中存在的此类概念隐喻。通过基于隐喻标注指南的新型提示技术,我们展示了 LLM 在概念隐喻的大规模计算研究中具有巨大潜力。此外,我们还发现 LLM 能够应用为人类标注者设计的程序性指南,显示出令人惊讶的语言知识深度。

[NLP-13] owards Trustworthy Knowledge Graph Reasoning: An Uncertainty Aware Perspective

【速读】: 该论文试图解决知识图谱与大型语言模型(KG-LLM)框架中缺乏严格的不确定性估计问题,特别是在高风险应用中的可靠部署。解决方案的关键在于提出了一个名为Uncertainty Aware Knowledge-Graph Reasoning (UAG)的新框架,该框架通过引入不确定性量化机制来增强KG-LLM的可靠性。具体来说,UAG设计了一个不确定性感知的分步推理框架,利用保形预测提供预测集的理论保证,并引入错误率控制模块来调整各组件的错误率,从而在保持预定义覆盖率的同时,平均减少40%的预测集/区间大小。

链接: https://arxiv.org/abs/2410.08985
作者: Bo Ni,Yu Wang,Lu Cheng,Erik Blasch,Tyler Derr
关键词-EN: Large Language Models, coupled with Large, KG-based retrieval-augmented frameworks, Large Language, language model components
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, Knowledge Graphs (KGs) have been successfully coupled with Large Language Models (LLMs) to mitigate their hallucinations and enhance their reasoning capability, such as in KG-based retrieval-augmented frameworks. However, current KG-LLM frameworks lack rigorous uncertainty estimation, limiting their reliable deployment in high-stakes applications. Directly incorporating uncertainty quantification into KG-LLM frameworks presents challenges due to their complex architectures and the intricate interactions between the knowledge graph and language model components. To address this gap, we propose a new trustworthy KG-LLM framework, Uncertainty Aware Knowledge-Graph Reasoning (UAG), which incorporates uncertainty quantification into the KG-LLM framework. We design an uncertainty-aware multi-step reasoning framework that leverages conformal prediction to provide a theoretical guarantee on the prediction set. To manage the error rate of the multi-step process, we additionally introduce an error rate control module to adjust the error rate within the individual components. Extensive experiments show that our proposed UAG can achieve any pre-defined coverage rate while reducing the prediction set/interval size by 40% on average over the baselines.
摘要:近年来,知识图谱 (Knowledge Graphs, KGs) 与大语言模型 (Large Language Models, LLMs) 的结合已成功缓解了后者的幻觉问题并增强了其推理能力,例如在基于知识图谱的检索增强框架中。然而,当前的 KG-LLM 框架缺乏严格的不确定性估计,限制了它们在高风险应用中的可靠部署。由于 KG-LLM 框架的复杂架构以及知识图谱与语言模型组件之间的复杂交互,直接将不确定性量化融入这些框架中面临挑战。为解决这一差距,我们提出了一种新的可信 KG-LLM 框架——不确定性感知知识图谱推理 (Uncertainty Aware Knowledge-Graph Reasoning, UAG),该框架将不确定性量化融入 KG-LLM 框架中。我们设计了一个不确定性感知的多步骤推理框架,利用保形预测为预测集提供理论保证。为了管理多步骤过程中的错误率,我们额外引入了一个错误率控制模块,以调整各个组件内的错误率。大量实验表明,我们提出的 UAG 可以在实现任意预定义覆盖率的同时,平均减少 40% 的预测集/区间大小。

[NLP-14] UniGlyph: A Seven-Segment Script for Universal Language Representation

【速读】: 该论文试图解决跨语言交流中的音标表示不一致和传统字符集的局限性问题。解决方案的关键在于设计了一种基于七段字符的通用音标系统UniGlyph,通过其灵活且一致的脚本结构、音素映射和转写规则,能够准确且紧凑地表示多种语言的音素多样性。UniGlyph通过引入音高和音长标记,确保了音素的精确表示,同时保持了较小的字符集规模,适用于人工智能领域的自然语言处理和多语言语音识别等应用。

链接: https://arxiv.org/abs/2410.08974
作者: G. V. Bency Sherin,A. Abijesh Euphrine,A. Lenora Moreen,L. Arun Jose
关键词-EN: designed to create, derived from seven-segment, UniGlyph, International Phonetic Alphabet, phonetic
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Symbolic Computation (cs.SC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: This submission includes 23 pages and tables. No external funding has been received for this research. Acknowledgments to Jeseentha V. for contributions to the phonetic study

点击查看摘要

Abstract:UniGlyph is a constructed language (conlang) designed to create a universal transliteration system using a script derived from seven-segment characters. The goal of UniGlyph is to facilitate cross-language communication by offering a flexible and consistent script that can represent a wide range of phonetic sounds. This paper explores the design of UniGlyph, detailing its script structure, phonetic mapping, and transliteration rules. The system addresses imperfections in the International Phonetic Alphabet (IPA) and traditional character sets by providing a compact, versatile method to represent phonetic diversity across languages. With pitch and length markers, UniGlyph ensures accurate phonetic representation while maintaining a small character set. Applications of UniGlyph include artificial intelligence integrations, such as natural language processing and multilingual speech recognition, enhancing communication across different languages. Future expansions are discussed, including the addition of animal phonetic sounds, where unique scripts are assigned to different species, broadening the scope of UniGlyph beyond human communication. This study presents the challenges and solutions in developing such a universal script, demonstrating the potential of UniGlyph to bridge linguistic gaps in cross-language communication, educational phonetics, and AI-driven applications.
摘要:UniGlyph 是一种人工语言 (conlang),旨在通过源自七段字符的文字系统创建一个通用的音译系统。UniGlyph 的目标是通过提供一种灵活且一致的文字系统,来促进跨语言交流,该系统能够表示广泛的语音音。本文探讨了 UniGlyph 的设计,详细介绍了其文字结构、语音映射和音译规则。该系统通过提供一种紧凑且多样的方法来表示跨语言的语音多样性,解决了国际音标 (IPA) 和传统字符集的不完善之处。通过音高和长度标记,UniGlyph 确保了准确的语音表示,同时保持了较小的字符集。UniGlyph 的应用包括人工智能集成,如自然语言处理和多语言语音识别,增强了不同语言之间的交流。未来扩展包括增加动物语音音,为不同物种分配独特的文字,从而将 UniGlyph 的应用范围扩展到人类交流之外。本研究展示了开发这种通用文字的挑战和解决方案,展示了 UniGlyph 在跨语言交流、教育语音学和人工智能驱动应用中弥合语言障碍的潜力。

[NLP-15] Extra Global Attention Designation Using Keyword Detection in Sparse Transformer Architectures

【速读】: 该论文试图解决稀疏Transformer在处理长距离上下文编码时的挑战,特别是在文档开头和结尾之间建立主题联系的问题。解决方案的关键在于提出了一种选择性增加全局注意力的方法,通过在文档前缀中添加关键词并对其进行全局注意力编码,从而在抽象摘要任务中提升了零样本、少样本和微调情况下的性能。

链接: https://arxiv.org/abs/2410.08971
作者: Evan Lucas,Dylan Kangas,Timothy C Havens
关键词-EN: Longformer Encoder-Decoder, sparse transformer architecture, extension to Longformer, popular sparse transformer, propose an extension
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose an extension to Longformer Encoder-Decoder, a popular sparse transformer architecture. One common challenge with sparse transformers is that they can struggle with encoding of long range context, such as connections between topics discussed at a beginning and end of a document. A method to selectively increase global attention is proposed and demonstrated for abstractive summarization tasks on several benchmark data sets. By prefixing the transcript with additional keywords and encoding global attention on these keywords, improvement in zero-shot, few-shot, and fine-tuned cases is demonstrated for some benchmark data sets.
摘要:本文提出了一种对 Longformer Encoder-Decoder 的扩展,这是一种流行的稀疏 Transformer 架构。稀疏 Transformer 的一个常见挑战是它们在编码长距离上下文(如文档开头和结尾讨论的主题之间的联系)时可能遇到困难。本文提出了一种选择性增加全局注意力的方法,并在多个基准数据集上的抽象摘要任务中进行了演示。通过在文稿前缀添加额外关键词并对其进行全局注意力编码,展示了在某些基准数据集上零样本、少样本和微调情况下的性能提升。

[NLP-16] NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)中的幻觉问题,特别是在高风险应用中对事实准确性的要求。解决方案的关键在于提出了一种轻量级方法——Norm Voting(NoVo),该方法利用注意力头范数的未开发潜力,通过一个仅依赖于30个随机样本的高效推理算法自动选择与事实相关的头范数,并将其应用于简单的投票算法中,从而显著提高零样本多选题(MCQs)的事实准确性。NoVo在多个数据集上展示了卓越的泛化能力,超越了现有的最先进方法,并为LLMs的可解释性、鲁棒性和可靠性开辟了新的研究方向。

链接: https://arxiv.org/abs/2410.08970
作者: Zheng Yi Ho,Siyuan Liang,Sen Zhang,Yibing Zhan,Dacheng Tao
关键词-EN: Large Language Models, Language Models, Large Language, Hallucinations in Large, remain a major
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucinations in Large Language Models (LLMs) remain a major obstacle, particularly in high-stakes applications where factual accuracy is critical. While representation editing and reading methods have made strides in reducing hallucinations, their heavy reliance on specialised tools and training on in-domain samples, makes them difficult to scale and prone to overfitting. This limits their accuracy gains and generalizability to diverse datasets. This paper presents a lightweight method, Norm Voting (NoVo), which harnesses the untapped potential of attention head norms to dramatically enhance factual accuracy in zero-shot multiple-choice questions (MCQs). NoVo begins by automatically selecting truth-correlated head norms with an efficient, inference-only algorithm using only 30 random samples, allowing NoVo to effortlessly scale to diverse datasets. Afterwards, selected head norms are employed in a simple voting algorithm, which yields significant gains in prediction accuracy. On TruthfulQA MC1, NoVo surpasses the current state-of-the-art and all previous methods by an astounding margin – at least 19 accuracy points. NoVo demonstrates exceptional generalization to 20 diverse datasets, with significant gains in over 90% of them, far exceeding all current representation editing and reading methods. NoVo also reveals promising gains to finetuning strategies and building textual adversarial defence. NoVo’s effectiveness with head norms opens new frontiers in LLM interpretability, robustness and reliability.
摘要:大语言模型 (LLM) 中的幻觉问题仍然是一个主要障碍,特别是在事实准确性至关重要的关键应用中。尽管表示编辑和阅读方法在减少幻觉方面取得了进展,但它们严重依赖于专用工具和对领域内样本的训练,这使得它们难以扩展且容易过拟合。这限制了它们的准确性提升和在多样化数据集上的泛化能力。本文提出了一种轻量级方法,即规范投票 (Norm Voting, NoVo),该方法利用了注意力头规范的未开发潜力,显著提高了零样本多选题 (MCQs) 中的事实准确性。NoVo 首先通过使用仅 30 个随机样本的高效推理算法自动选择与真实相关的头规范,使 NoVo 能够轻松扩展到多样化数据集。随后,所选的头规范被用于一个简单的投票算法中,该算法在预测准确性方面取得了显著提升。在 TruthfulQA MC1 上,NoVo 以惊人的差距超越了当前最先进的方法和所有先前的方法——至少提高了 19 个准确点。NoVo 展示了在 20 个多样化数据集上的出色泛化能力,其中超过 90% 的数据集取得了显著提升,远远超过了所有当前的表示编辑和阅读方法。NoVo 还显示出对微调策略和构建文本对抗防御的潜在增益。NoVo 在头规范方面的有效性为大语言模型的可解释性、鲁棒性和可靠性开辟了新的领域。

[NLP-17] Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

【速读】: 该论文试图解决当前大型语言模型(LLMs)在安全性对齐方面采用的“一刀切”方法缺乏灵活性的问题,特别是在面对不同文化和地区的社会规范以及用户多样化的安全需求时。解决方案的关键在于提出了可控安全性对齐框架(Controllable Safety Alignment, CoSA),该框架通过引入安全配置(safety configs),即自由形式的自然语言描述,来动态调整模型的安全行为,而无需重新训练模型。核心技术是CoSAlign,一种数据中心化的方法,用于使LLMs能够轻松适应多样化的安全配置。此外,论文还设计了一种新的可控性评估协议,结合有用性和配置安全性,形成CoSA-Score,并构建了CoSApien基准,用于评估模型在实际应用中的表现。

链接: https://arxiv.org/abs/2410.08968
作者: Jingyu Zhang,Ahmed Elgohary,Ahmed Magooda,Daniel Khashabi,Benjamin Van Durme
关键词-EN: content deemed unsafe, safety, diverse safety, large language models, current paradigm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs – free-form natural language descriptions of the desired safety behaviors – that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at inference time. To enable that, we propose CoSAlign, a data-centric method for aligning LLMs to easily adapt to diverse safety configs. Furthermore, we devise a novel controllability evaluation protocol that considers both helpfulness and configured safety, summarizing them into CoSA-Score, and construct CoSApien, a human-authored benchmark that consists of real-world LLM use cases with diverse safety requirements and corresponding evaluation prompts. We show that CoSAlign leads to substantial gains of controllability over strong baselines including in-context alignment. Our framework encourages better representation and adaptation to pluralistic human values in LLMs, and thereby increasing their practicality. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.08968 [cs.CL] (or arXiv:2410.08968v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.08968 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:当前大语言模型 (LLM) 的安全对齐范式采用了一种“一刀切”的方法:模型拒绝与任何被模型提供者视为不安全的内容进行交互。这种方法在面对不同文化和地区之间多样的社会规范时缺乏灵活性。此外,用户可能有不同的安全需求,使得具有静态安全标准的模型过于限制而无法实用,同时也过于昂贵而难以重新对齐。我们提出了可控安全对齐 (Controllable Safety Alignment, CoSA),这是一个旨在使模型适应多样安全需求而无需重新训练的框架。我们不是对固定模型进行对齐,而是使模型遵循安全配置——即作为系统提示一部分提供的自由形式自然语言描述的期望安全行为。为了调整模型的安全行为,授权用户只需在推理时修改这些安全配置。为此,我们提出了 CoSAlign,一种以数据为中心的方法,用于使 LLM 能够轻松适应多样安全配置。此外,我们设计了一种新的可控性评估协议,该协议同时考虑了有用性和配置的安全性,并将它们总结为 CoSA-Score,并构建了 CoSApien,这是一个由人类编写、包含具有多样安全需求和相应评估提示的真实世界 LLM 使用案例的基准。我们展示了 CoSAlign 在包括上下文对齐在内的强大基线上的可控性显著提升。我们的框架鼓励更好地表示和适应 LLM 中的多元人类价值观,从而提高其实用性。

主题:计算与语言 (cs.CL);人工智能 (cs.AI)
引用为:arXiv:2410.08968 [cs.CL] (或 arXiv:2410.08968v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.08968
通过 DataCite 发布的 arXiv-issued DOI (待注册)

[NLP-18] Language Imbalance Driven Rewarding for Multilingual Self-improving

【速读】: 该论文试图解决大型语言模型(LLMs)在多语言应用中的不平衡问题,特别是英语和中文等“一等”语言与许多其他语言之间的代表性差异。解决方案的关键在于提出了一种名为“语言不平衡驱动的奖励机制”(Language Imbalance Driven Rewarding),利用LLMs中主导语言与非主导语言之间的固有不平衡作为奖励信号,通过迭代DPO训练来提升非主导语言的性能,同时增强主导语言的能力。该方法通过在Meta-Llama-3-8B-Instruct模型上的两轮迭代微调,显著提升了多语言性能,特别是在指令跟随和算术推理任务上,分别在X-AlpacaEval排行榜上提高了7.46%的胜率和在MGSM基准测试中提高了13.9%的准确率。

链接: https://arxiv.org/abs/2410.08964
作者: Wen Yang,Junhong Wu,Chen Wang,Chengqing Zong,Jiajun Zhang
关键词-EN: Large Language Models, Large Language, Language Models, English and Chinese, Imbalance Driven Rewarding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited “first-class” languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose \textitLanguage Imbalance Driven Rewarding , where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language’s capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs.
摘要:大语言模型 (LLM) 在众多任务中已达到最先进的性能。然而,这些进步主要惠及了如英语和中文等“一流”语言,导致许多其他语言的代表性不足。这种不平衡不仅限制了更广泛的应用,还自然地产生了语言之间的偏好排序,为以自增强方式启动 LLM 的多语言能力提供了机会。因此,我们提出了 语言不平衡驱动奖励 (Language Imbalance Driven Rewarding),其中 LLM 内主导语言与非主导语言之间的固有不平衡被用作奖励信号。迭代 DPO 训练表明,这种方法不仅增强了非主导语言的 LLM 性能,还提升了主导语言的能力,从而产生了迭代奖励信号。通过对 Meta-Llama-3-8B-Instruct 进行两轮此方法的微调,结果显示在指令跟随和算术推理任务上的多语言性能持续提升,具体表现为在 X-AlpacaEval 排行榜上的平均胜率提高了 7.46%,在 MGSM 基准测试中的准确率提高了 13.9%。这项工作作为初步探索,为 LLM 的多语言自增强铺平了道路。

[NLP-19] owards Cross-Lingual LLM Evaluation for European Languages

【速读】: 该论文试图解决大语言模型(LLMs)在多种欧洲语言中性能评估的一致性和有效性问题。解决方案的关键在于引入了一种跨语言评估方法,通过使用五个广泛使用的基准测试的翻译版本,评估了40个LLMs在21种欧洲语言中的表现。论文的核心贡献包括:1) 探讨了翻译基准测试的有效性;2) 评估了不同翻译服务的影响;3) 提供了一个包含新创建数据集的多语言评估框架,这些数据集包括EU20-MMLU、EU20-HellaSwag、EU20-ARC、EU20-TruthfulQA和EU20-GSM8K。通过公开这些基准测试和结果,论文旨在促进多语言LLM评估的进一步研究。

链接: https://arxiv.org/abs/2410.08928
作者: Klaudia Thellmann,Bernhard Stadler,Michael Fromm,Jasper Schulze Buschhoff,Alex Jude,Fabio Barth,Johannes Leveling,Nicolas Flores-Herr,Joachim Köhler,René Jäkel,Mehdi Ali
关键词-EN: Large Language Models, rise of Large, revolutionized natural language, natural language processing, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains challenging, especially due to the scarcity of multilingual benchmarks. We introduce a cross-lingual evaluation approach tailored for European languages. We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages. Our contributions include examining the effectiveness of translated benchmarks, assessing the impact of different translation services, and offering a multilingual evaluation framework for LLMs that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC, EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly available to encourage further research in multilingual LLM evaluation.
摘要:大语言模型 (LLM) 的兴起彻底改变了多种语言和任务的自然语言处理。然而,在多种欧洲语言中一致且有意义地评估 LLM 性能仍然是一个挑战,尤其是由于多语言基准的稀缺。我们引入了一种针对欧洲语言的跨语言评估方法。我们采用了五个广泛使用的基准的翻译版本,以评估 40 个 LLM 在 21 种欧洲语言中的能力。我们的贡献包括:考察翻译基准的有效性,评估不同翻译服务的影响,并提供一个包括新创建数据集的多语言评估框架:EU20-MMLU、EU20-HellaSwag、EU20-ARC、EU20-TruthfulQA 和 EU20-GSM8K。基准和结果已公开发布,以鼓励在多语言 LLM 评估方面的进一步研究。

[NLP-20] AutoPersuade: A Framework for Evaluating and Explaining Persuasive Arguments

【速读】: 该论文试图解决如何自动构建具有说服力的信息的问题。解决方案的关键在于提出了一个三部分的框架AutoPersuade:首先,通过收集并评估大量论据数据;其次,开发了一种新颖的主题模型来识别影响说服力的论据特征;最后,利用该模型预测新论据的有效性,并评估不同组成部分的因果影响以提供解释。通过实验验证,该框架在素食主义论据上的应用显示了其有效性,并能进行样本外的预测。

链接: https://arxiv.org/abs/2410.08917
作者: Till Raphael Saenger,Musashi Hinck,Justin Grimmer,Brandon M. Stewart
关键词-EN: constructing persuasive messages, persuasive messages, three-part framework, framework for constructing, constructing persuasive
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce AutoPersuade, a three-part framework for constructing persuasive messages. First, we curate a large dataset of arguments with human evaluations. Next, we develop a novel topic model to identify argument features that influence persuasiveness. Finally, we use this model to predict the effectiveness of new arguments and assess the causal impact of different components to provide explanations. We validate AutoPersuade through an experimental study on arguments for veganism, demonstrating its effectiveness with human studies and out-of-sample predictions.
摘要:我们介绍了 AutoPersuade,这是一个用于构建说服性信息的三部分框架。首先,我们精心策划了一个包含人类评估的大规模论据数据集。接下来,我们开发了一种新颖的主题模型,以识别影响说服力的论据特征。最后,我们利用该模型预测新论据的有效性,并评估不同组成部分的因果影响,从而提供解释。我们通过针对素食主义论据的实验研究验证了 AutoPersuade,展示了其在人类研究和样本外预测中的有效性。

[NLP-21] Lifelong Event Detection via Optimal Transport EMNLP2024

【速读】: 该论文试图解决持续事件检测(Continual Event Detection, CED)中的灾难性遗忘问题,即在学习新任务(新的事件类型)时,模型性能在先前任务上的表现会受到损害。解决方案的关键在于引入了一种名为Lifelong Event Detection via Optimal Transport (LEDOT)的新方法,该方法利用最优传输原理,将分类模块的优化与每个类别的内在特性(由预训练语言模型定义)对齐。LEDOT通过整合回放集、原型潜在表示和一个创新的最优传输组件,有效地缓解了灾难性遗忘问题。实验结果表明,LEDOT在MAVEN和ACE数据集上均显著优于现有最先进的基线方法,成为持续事件检测领域的一个开创性解决方案。

链接: https://arxiv.org/abs/2410.08905
作者: Viet Dao,Van-Cuong Pham,Quyen Tran,Thanh-Thien Le,Linh Ngo Van,Thien Huu Nguyen
关键词-EN: coming event types, formidable challenge due, Continual Event Detection, Lifelong Event Detection, Event Detection
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Continual Event Detection (CED) poses a formidable challenge due to the catastrophic forgetting phenomenon, where learning new tasks (with new coming event types) hampers performance on previous ones. In this paper, we introduce a novel approach, Lifelong Event Detection via Optimal Transport (LEDOT), that leverages optimal transport principles to align the optimization of our classification module with the intrinsic nature of each class, as defined by their pre-trained language modeling. Our method integrates replay sets, prototype latent representations, and an innovative Optimal Transport component. Extensive experiments on MAVEN and ACE datasets demonstrate LEDOT’s superior performance, consistently outperforming state-of-the-art baselines. The results underscore LEDOT as a pioneering solution in continual event detection, offering a more effective and nuanced approach to addressing catastrophic forgetting in evolving environments.
摘要:持续事件检测 (Continual Event Detection, CED) 由于灾难性遗忘现象而面临巨大挑战,即学习新任务(涉及新出现的事件类型)会损害之前任务的性能。本文提出了一种新颖的方法,即通过最优传输 (Optimal Transport) 实现终身事件检测 (Lifelong Event Detection via Optimal Transport, LEDOT),该方法利用最优传输原理,将分类模块的优化与每个类别的内在特性(由其预训练语言模型定义)对齐。我们的方法结合了重放集、原型潜在表示以及创新的最优传输组件。在 MAVEN 和 ACE 数据集上的广泛实验表明,LEDOT 的性能优于最先进的基线方法,始终表现出色。这些结果强调了 LEDOT 在持续事件检测中的开创性解决方案地位,提供了一种更有效且细致的方法来应对动态环境中的灾难性遗忘问题。

[NLP-22] A Benchmark for Cross-Domain Argumentative Stance Classification on Social Media

【速读】: 该论文试图解决多领域论点立场分类中数据多样性和标注成本高的问题。解决方案的关键在于利用平台规则、专家精选内容和大型语言模型,自动生成多领域的论点对,从而避免人工标注的耗时和劳动密集性。通过这种方法,论文构建了一个包含4,498个主题声明和30,961个论点的多领域基准数据集,涵盖21个领域,并在全监督、零样本和少样本设置下进行了基准测试,揭示了不同方法的优缺点。

链接: https://arxiv.org/abs/2410.08900
作者: Jiaqing Yuan,Ruijie Xi,Munindar P. Singh
关键词-EN: stance classification plays, identifying authors’ viewpoints, Argumentative stance classification, stance classification, classification plays
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Argumentative stance classification plays a key role in identifying authors’ viewpoints on specific topics. However, generating diverse pairs of argumentative sentences across various domains is challenging. Existing benchmarks often come from a single domain or focus on a limited set of topics. Additionally, manual annotation for accurate labeling is time-consuming and labor-intensive. To address these challenges, we propose leveraging platform rules, readily available expert-curated content, and large language models to bypass the need for human annotation. Our approach produces a multidomain benchmark comprising 4,498 topical claims and 30,961 arguments from three sources, spanning 21 domains. We benchmark the dataset in fully supervised, zero-shot, and few-shot settings, shedding light on the strengths and limitations of different methodologies. We release the dataset and code in this study at hidden for anonymity.
摘要:论点立场分类在识别作者对特定主题的观点方面起着关键作用。然而,生成跨多个领域的多样化论点句子对是一项挑战。现有的基准数据集通常来自单一领域或仅关注有限的主题集。此外,手动标注以进行准确标签化既耗时又费力。为了应对这些挑战,我们提出利用平台规则、现成的专家精选内容以及大语言模型来绕过人工标注的需求。我们的方法生成了一个多领域基准数据集,包含来自三个来源的 4,498 个主题声明和 30,961 个论点,涵盖 21 个领域。我们在全监督、零样本和少样本设置下对数据集进行了基准测试,揭示了不同方法的优缺点。我们在本研究中发布了数据集和代码,以匿名方式隐藏。

[NLP-23] RoRA-VLM: Robust Retrieval-Augmented Vision Language Models

【速读】: 该论文试图解决视觉-语言模型(VLMs)在知识密集型任务中表现不佳的问题,主要原因是难以准确地将视觉对象和场景与其对应的实体和背景知识进行关联。论文提出的解决方案是引入RORA-VLM,一种专门为VLMs设计的创新且稳健的检索增强框架。其关键创新包括:1)采用两阶段检索过程,通过图像锚定的文本查询扩展来协同结合视觉和文本信息,以检索最相关的多模态知识片段;2)通过在检索增强训练过程中注入对抗性噪声,并采用面向查询的视觉标记细化策略过滤无关视觉信息,增强VLMs对检索到的多模态知识中无关信息的抗干扰能力。

链接: https://arxiv.org/abs/2410.08876
作者: Jingyuan Qi,Zhiyang Xu,Rulin Shao,Yang Chen,Jing Di,Yu Cheng,Qifan Wang,Lifu Huang
关键词-EN: Current vision-language models, multimodal knowledge snippets, retrieved multimodal knowledge, exhibit inferior performance, Current vision-language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current vision-language models (VLMs) still exhibit inferior performance on knowledge-intensive tasks, primarily due to the challenge of accurately encoding all the associations between visual objects and scenes to their corresponding entities and background knowledge. While retrieval augmentation methods offer an efficient way to integrate external knowledge, extending them to vision-language domain presents unique challenges in (1) precisely retrieving relevant information from external sources due to the inherent discrepancy within the multimodal queries, and (2) being resilient to the irrelevant, extraneous and noisy information contained in the retrieved multimodal knowledge snippets. In this work, we introduce RORA-VLM, a novel and robust retrieval augmentation framework specifically tailored for VLMs, with two key innovations: (1) a 2-stage retrieval process with image-anchored textual-query expansion to synergistically combine the visual and textual information in the query and retrieve the most relevant multimodal knowledge snippets; and (2) a robust retrieval augmentation method that strengthens the resilience of VLMs against irrelevant information in the retrieved multimodal knowledge by injecting adversarial noises into the retrieval-augmented training process, and filters out extraneous visual information, such as unrelated entities presented in images, via a query-oriented visual token refinement strategy. We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets. Our results demonstrate that with a minimal amount of training instance, RORA-VLM enables the base model to achieve significant performance improvement and constantly outperform state-of-the-art retrieval-augmented VLMs on all benchmarks while also exhibiting a novel zero-shot domain transfer capability.
摘要:当前的视觉-语言模型 (Vision-Language Models, VLMs) 在知识密集型任务上仍表现出较差的性能,这主要归因于难以准确编码视觉对象和场景与其对应实体和背景知识之间的所有关联。尽管检索增强方法提供了一种有效整合外部知识的途径,但将其扩展到视觉-语言领域面临着独特的挑战:(1) 由于多模态查询的内在差异,难以从外部源中精确检索相关信息;(2) 对检索到的多模态知识片段中包含的不相关、冗余和噪声信息具有抗干扰能力。在本研究中,我们提出了 RORA-VLM,一种专为 VLMs 设计的新型且稳健的检索增强框架,具有两大创新点:(1) 采用图像锚定的文本查询扩展的二阶段检索过程,协同结合查询中的视觉和文本信息,以检索最相关的多模态知识片段;(2) 一种增强 VLMs 对检索到的多模态知识中不相关信息抗干扰能力的稳健检索增强方法,通过在检索增强训练过程中注入对抗性噪声,并采用面向查询的视觉 Token 细化策略过滤掉图像中呈现的不相关实体等冗余视觉信息。我们进行了广泛的实验,以验证所提出方法在三个广泛采用的基准数据集上的有效性和稳健性。实验结果表明,在极少训练实例的情况下,RORA-VLM 使基础模型实现了显著的性能提升,并在所有基准测试中持续超越现有的最先进检索增强 VLMs,同时展现出一种新颖的零样本领域迁移能力。

[NLP-24] Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

【速读】: 该论文试图解决音频描述(Audio Descriptions, ADs)生成过程中所需的人力资源成本高、时间消耗长的问题。解决方案的关键在于利用自然语言处理(NLP)和计算机视觉(CV)领域的最新进展,特别是大型语言模型(LLMs)和视觉-语言模型(VLMs),来实现AD的自动化生成。论文探讨了如何应用这些先进技术来生成AD,并指出了未来研究的重要方向。

链接: https://arxiv.org/abs/2410.08860
作者: Yingqiang Gao,Lukas Fischer,Alexa Lintner,Sarah Ebling
关键词-EN: assist blind persons, acoustic commentaries designed, accessing digital media, digital media content, Audio descriptions
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio descriptions (ADs) function as acoustic commentaries designed to assist blind persons and persons with visual impairments in accessing digital media content on television and in movies, among other settings. As an accessibility service typically provided by trained AD professionals, the generation of ADs demands significant human effort, making the process both time-consuming and costly. Recent advancements in natural language processing (NLP) and computer vision (CV), particularly in large language models (LLMs) and vision-language models (VLMs), have allowed for getting a step closer to automatic AD generation. This paper reviews the technologies pertinent to AD generation in the era of LLMs and VLMs: we discuss how state-of-the-art NLP and CV technologies can be applied to generate ADs and identify essential research directions for the future.
摘要:音频描述 (Audio Descriptions, ADs) 作为一种声学解说,旨在帮助盲人及视觉障碍者访问电视和电影等数字媒体内容。作为一种通常由专业 AD 人员提供的无障碍服务,AD 的生成需要大量人力,使得这一过程既耗时又昂贵。近年来,自然语言处理 (Natural Language Processing, NLP) 和计算机视觉 (Computer Vision, CV) 的进步,特别是大语言模型 (Large Language Models, LLMs) 和视觉语言模型 (Vision-Language Models, VLMs) 的发展,使得自动生成 AD 更近了一步。本文回顾了在 LLMs 和 VLMs 时代与 AD 生成相关的技术:我们讨论了如何应用最先进的 NLP 和 CV 技术来生成 AD,并识别了未来重要的研究方向。

[NLP-25] Measuring the Inconsistency of Large Language Models in Preferential Ranking

【速读】: 该论文试图解决大语言模型(LLMs)在提供一致性序数偏好排名方面的不足问题。解决方案的关键在于引入基于序数理论的正式化一致性标准,包括传递性、非对称性、可逆性和无关选项独立性等准则,并通过诊断实验揭示现有LLMs在这些准则上的不足,特别是位置偏差和传递性差的问题,从而强调了进一步研究以解决这些局限性的必要性。

链接: https://arxiv.org/abs/2410.08851
作者: Xiutian Zhao,Ke Wang,Wei Peng
关键词-EN: large language models’, hallucination issues persist, rankings remains underexplored, recent advancements, language models’
类目: Computation and Language (cs.CL)
备注: In Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)

点击查看摘要

Abstract:Despite large language models’ (LLMs) recent advancements, their bias and hallucination issues persist, and their ability to offer consistent preferential rankings remains underexplored. This study investigates the capacity of LLMs to provide consistent ordinal preferences, a crucial aspect in scenarios with dense decision space or lacking absolute answers. We introduce a formalization of consistency based on order theory, outlining criteria such as transitivity, asymmetry, reversibility, and independence from irrelevant alternatives. Our diagnostic experiments on selected state-of-the-art LLMs reveal their inability to meet these criteria, indicating a strong positional bias and poor transitivity, with preferences easily swayed by irrelevant alternatives. These findings highlight a significant inconsistency in LLM-generated preferential rankings, underscoring the need for further research to address these limitations.
摘要:尽管大语言模型 (LLMs) 近期取得了显著进展,但其偏见和幻觉问题依然存在,且其在提供一致的优先排序能力方面的研究仍显不足。本研究探讨了 LLMs 在密集决策空间或缺乏绝对答案的场景中提供一致序数偏好的能力。我们基于序理论引入了一种一致性的形式化定义,概述了诸如传递性、非对称性、可逆性和无关选择独立性等标准。我们对选定的最先进 LLMs 进行的诊断实验显示,它们无法满足这些标准,表现出强烈的定位偏见和较差的传递性,偏好容易被无关选择所影响。这些发现突显了 LLM 生成的优先排序中存在显著的不一致性,强调了进一步研究以解决这些局限性的必要性。

[NLP-26] Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

【速读】: 该论文试图解决直接偏好优化(DPO)及其变体在训练语言模型时出现的“似然位移”现象,即尽管模型被训练为更频繁地生成偏好响应,但这些响应的似然性却常常在训练过程中下降。论文指出,这种现象可能导致概率质量从偏好响应转移到具有相反意义的响应,从而引发灾难性后果,如在拒绝不安全提示时,模型可能从拒绝转向生成有害响应。解决方案的关键在于识别并过滤掉那些导致似然位移的训练样本,具体通过计算中心化隐藏嵌入相似度(CHES)得分来实现。论文提出,CHES得分有助于识别哪些样本在数据集中对似然位移贡献最大,从而通过数据筛选有效缓解无意中的不一致性问题。

链接: https://arxiv.org/abs/2410.08847
作者: Noam Razin,Sadhika Malladi,Adithya Bhaskar,Danqi Chen,Sanjeev Arora,Boris Hanin
关键词-EN: Direct Preference Optimization, Direct Preference, Preference Optimization, likelihood displacement, variants are increasingly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Code available at this https URL

点击查看摘要

Abstract:Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer \textttNo over \textttNever can sharply increase the probability of \textttYes . Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.
摘要:直接偏好优化 (Direct Preference Optimization, DPO) 及其变体在将语言模型与人类偏好对齐方面越来越常用。尽管这些方法旨在教导模型更频繁地生成偏好响应而非非偏好响应,但先前的工作观察到,偏好响应的概率在训练过程中往往下降。本文揭示了这一反直觉现象的原因及其影响,我们称之为“概率位移” (likelihood displacement)。我们证明,概率位移可能是灾难性的,即将概率质量从偏好响应转移到意义相反的响应。作为一个简单的例子,训练模型偏好 \texttt{No} 而非 \texttt{Never} 可能会显著增加 \texttt{Yes} 的概率。此外,在对模型进行拒绝不安全提示的对齐训练时,我们发现这种位移可能会无意中导致对齐失败,即将概率质量从偏好拒绝响应转移到有害响应(例如,将 Llama-3-8B-Instruct 的拒绝率从 74.4% 降至 33.4%)。我们理论上描述了概率位移是由偏好引起的相似嵌入驱动的,这可以通过中心化隐藏嵌入相似度 (Centered Hidden Embedding Similarity, CHES) 得分来衡量。从经验上看,CHES 得分能够识别出在给定数据集中对概率位移贡献最大的训练样本。在我们的实验中,过滤掉这些样本有效地缓解了无意中的对齐失败。更广泛地说,我们的结果强调了精心筛选数据以确保偏好足够区分的重要性,我们相信 CHES 得分在这方面可能具有重要价值。

[NLP-27] Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities

【速读】: 该论文试图解决印尼语自动语音识别(ASR)中由于训练数据多样性不足导致的识别准确性问题。解决方案的关键在于采用先进的语音识别模型(如Massively Multilingual Speech (MMS)和Whisper),并通过编译包含多种语音变异性的印尼语数据集来增强模型的泛化能力。研究结果表明,通过微调Whisper模型,可以显著降低词错误率(WER)和字符错误率(CER),特别是在处理不同说话风格变异性时表现尤为突出。

链接: https://arxiv.org/abs/2410.08828
作者: Aulia Adila,Dessi Lestari,Ayu Purwarianti,Dipta Tanaya,Kurniawati Azizah,Sakriani Sakti
关键词-EN: background noise conditions, speech, ideal speech recognition, noise conditions, Indonesian
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:An ideal speech recognition model has the capability to transcribe speech accurately under various characteristics of speech signals, such as speaking style (read and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building such a model requires a significant amount of training data with diverse speech characteristics. Currently, Indonesian data is dominated by read, formal, and clean speech, leading to a scarcity of Indonesian data with other speech variabilities. To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. We further investigate the models’ predictive ability to transcribe Indonesian speech data across different variability groups. The best results were achieved by the Whisper fine-tuned model across datasets with various characteristics, as indicated by the decrease in word error rate (WER) and character error rate (CER). Moreover, we found that speaking style variability affected model performance the most.
摘要:一个理想的语音识别模型应具备在各种语音信号特征下准确转录语音的能力,这些特征包括说话风格(朗读和即兴)、语音上下文(正式和非正式)以及背景噪音条件(清晰和中等)。构建这样的模型需要大量具有多样语音特征的训练数据。目前,印度尼西亚的数据主要由朗读、正式和清晰的语音组成,导致其他语音变异性的数据稀缺。为了开发印度尼西亚的自动语音识别(ASR),我们介绍了关于最先进的语音识别模型,即大规模多语言语音(MMS)和Whisper的研究,并编译了一个包含印度尼西亚语音变异性的数据集以促进我们的研究。我们进一步研究了这些模型在不同变异性组别中转录印度尼西亚语音数据的能力。最佳结果由Whisper微调模型在具有各种特征的数据集中实现,表现为词错误率(WER)和字符错误率(CER)的降低。此外,我们发现说话风格变异性对模型性能影响最大。

[NLP-28] Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation

【速读】: 该论文试图解决复杂问答任务中现有检索增强生成(RAG)方法在信息检索时机预测不准确、未充分考虑先前检索知识的情况下,导致信息收集和交互不足,从而产生低质量答案的问题。解决方案的关键在于提出了一种名为自适应笔记增强RAG(Adaptive-Note)的通用RAG方法,该方法通过迭代信息收集器、自适应记忆审查器和任务导向生成器,遵循新的检索与记忆范式,实现了对知识增长的全面视图,并引入了基于笔记的自适应停止探索策略,以确保充分的信息探索和高质量的知识交互。

链接: https://arxiv.org/abs/2410.08821
作者: Ruobing Wang,Daren Zha,Shi Yu,Qingfei Zhao,Yuxuan Chen,Yixuan Wang,Shuo Wang,Yukun Yan,Zhenghao Liu,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Models, Language Models, Large Language, hallucinated outputs generated, open-domain question-answering tasks
类目: Computation and Language (cs.CL)
备注: 15 pages, 2 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates issues of the factual errors and hallucinated outputs generated by Large Language Models (LLMs) in open-domain question-answering tasks (OpenQA) via introducing external knowledge. For complex QA, however, existing RAG methods use LLMs to actively predict retrieval timing and directly use the retrieved information for generation, regardless of whether the retrieval timing accurately reflects the actual information needs, or sufficiently considers prior retrieved knowledge, which may result in insufficient information gathering and interaction, yielding low-quality answers. To address these, we propose a generic RAG approach called Adaptive Note-Enhanced RAG (Adaptive-Note) for complex QA tasks, which includes the iterative information collector, adaptive memory reviewer, and task-oriented generator, while following a new Retriever-and-Memory paradigm. Specifically, Adaptive-Note introduces an overarching view of knowledge growth, iteratively gathering new information in the form of notes and updating them into the existing optimal knowledge structure, enhancing high-quality knowledge interactions. In addition, we employ an adaptive, note-based stop-exploration strategy to decide “what to retrieve and when to stop” to encourage sufficient knowledge exploration. We conduct extensive experiments on five complex QA datasets, and the results demonstrate the superiority and effectiveness of our method and its components. The code and data are at this https URL.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 通过引入外部知识,缓解了大语言模型 (Large Language Models, LLMs) 在开放领域问答任务 (OpenQA) 中生成的事实错误和幻觉输出问题。然而,对于复杂的问答任务,现有的 RAG 方法使用 LLMs 主动预测检索时机,并直接使用检索到的信息进行生成,而不管检索时机是否准确反映了实际的信息需求,或充分考虑了先前检索到的知识,这可能导致信息收集和交互不足,产生低质量的答案。为解决这些问题,我们提出了一种名为自适应笔记增强 RAG (Adaptive Note-Enhanced RAG, Adaptive-Note) 的通用 RAG 方法,用于复杂的问答任务,该方法包括迭代信息收集器、自适应记忆审查器和任务导向生成器,同时遵循一种新的检索与记忆范式 (Retriever-and-Memory paradigm)。具体而言,Adaptive-Note 引入了知识增长的总体视图,以笔记的形式迭代收集新信息,并将其更新到现有的最优知识结构中,增强高质量的知识交互。此外,我们采用了一种基于笔记的自适应停止探索策略,以决定“何时检索及何时停止”,以鼓励充分的知识探索。我们在五个复杂的问答数据集上进行了广泛的实验,结果表明了我们的方法及其组件的优越性和有效性。代码和数据可在以下链接获取:https URL。

[NLP-29] Which Demographics do LLMs Default to During Annotation?

【速读】: 该论文试图解决的问题是:在没有明确指定标注者人口统计信息的情况下,大型语言模型(LLM)在文本标注时会倾向于模仿哪些特定人群的特征。解决方案的关键在于通过对比不同类型的提示条件(包括无人口统计信息条件、虚假条件和人口统计信息条件),评估LLM在标注礼貌性和冒犯性文本时所体现的标注者特征。研究结果表明,人口统计信息提示对标注结果有显著影响,特别是在性别、种族和年龄方面,这与之前认为LLM不受此类信息影响的观点形成对比。

链接: https://arxiv.org/abs/2410.08820
作者: Christopher Bagdon,Aidan Combs,Lynn Greschner,Roman Klinger,Jiahui Li,Sean Papay,Nadine Probol,Yarik Menchaca Resendiz,Johannes Schäfer,Aswathy Velutharambath,Sabine Weber,Amelie Wührl
关键词-EN: woman might find, find it offensive, teenager might find, cultural background, assign in text
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Demographics and cultural background of annotators influence the labels they assign in text annotation – for instance, an elderly woman might find it offensive to read a message addressed to a “bro”, but a male teenager might find it appropriate. It is therefore important to acknowledge label variations to not under-represent members of a society. Two research directions developed out of this observation in the context of using large language models (LLM) for data annotations, namely (1) studying biases and inherent knowledge of LLMs and (2) injecting diversity in the output by manipulating the prompt with demographic information. We combine these two strands of research and ask the question to which demographics an LLM resorts to when no demographics is given. To answer this question, we evaluate which attributes of human annotators LLMs inherently mimic. Furthermore, we compare non-demographic conditioned prompts and placebo-conditioned prompts (e.g., “you are an annotator who lives in house number 5”) to demographics-conditioned prompts (“You are a 45 year old man and an expert on politeness annotation. How do you rate instance”). We study these questions for politeness and offensiveness annotations on the POPQUORN data set, a corpus created in a controlled manner to investigate human label variations based on demographics which has not been used for LLM-based analyses so far. We observe notable influences related to gender, race, and age in demographic prompting, which contrasts with previous studies that found no such effects.
摘要:标注者的年龄、性别和文化背景会影响他们在文本标注中赋予的标签——例如,一位老年女性可能会觉得被称呼为“兄弟”冒犯,但一个男性青少年可能认为这是合适的。因此,承认标签的多样性对于不低估社会成员的重要性至关重要。基于这一观察,在使用大语言模型 (LLM) 进行数据标注的背景下,发展出两个研究方向:(1) 研究 LLM 的偏见和固有知识;(2) 通过操纵包含人口统计信息的提示来注入输出的多样性。我们将这两个研究方向结合起来,探讨在没有提供人口统计信息时,LLM 倾向于依赖哪些人口统计特征。为了回答这个问题,我们评估了 LLM 模仿人类标注者哪些属性。此外,我们将非人口统计条件提示和安慰剂条件提示(例如,“你是一个住在5号房子的标注者”)与人口统计条件提示(例如,“你是一个45岁的男性,并且是礼貌标注的专家。你如何评价这个实例”)进行比较。我们在 POPQUORN 数据集上研究了这些问题,该语料库以受控方式创建,用于研究基于人口统计学的人类标签变化,迄今为止尚未用于基于 LLM 的分析。我们观察到性别、种族和年龄在人口统计提示中具有显著影响,这与之前未发现此类效应的研究形成对比。

[NLP-30] StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

【速读】: 该论文试图解决现有检索增强生成(RAG)方法在知识密集型推理任务中表现不佳的问题,主要原因是这些任务所需的有用信息分散且难以准确识别和全局推理。解决方案的关键在于提出了一种新的框架——StructRAG,该框架能够识别任务所需的最佳结构类型,将原始文档重构为这种结构化格式,并基于此结构进行推理。通过这种方式,StructRAG在各种知识密集型任务中实现了最先进的性能,特别是在挑战性场景中表现突出,展示了其在增强大型语言模型(LLMs)应对复杂现实应用中的潜力。

链接: https://arxiv.org/abs/2410.08815
作者: Zhuoqun Li,Xuanang Chen,Haiyang Yu,Hongyu Lin,Yaojie Lu,Qiaoyu Tang,Fei Huang,Xianpei Han,Le Sun,Yongbin Li
关键词-EN: large language models, effectively enhance large, enhance large language, Retrieval-augmented generation, existing RAG methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a key means to effectively enhance large language models (LLMs) in many knowledge-based tasks. However, existing RAG methods struggle with knowledge-intensive reasoning tasks, because useful information required to these tasks are badly scattered. This characteristic makes it difficult for existing RAG methods to accurately identify key information and perform global reasoning with such noisy augmentation. In this paper, motivated by the cognitive theories that humans convert raw information into various structured knowledge when tackling knowledge-intensive reasoning, we proposes a new framework, StructRAG, which can identify the optimal structure type for the task at hand, reconstruct original documents into this structured format, and infer answers based on the resulting structure. Extensive experiments across various knowledge-intensive tasks show that StructRAG achieves state-of-the-art performance, particularly excelling in challenging scenarios, demonstrating its potential as an effective solution for enhancing LLMs in complex real-world applications.
摘要:检索增强生成 (Retrieval-augmented generation, RAG) 是提升大语言模型 (Large Language Models, LLMs) 在众多基于知识的任务中表现的关键手段。然而,现有的 RAG 方法在处理知识密集型推理任务时表现不佳,因为这些任务所需的有用信息分布极为分散。这一特性使得现有的 RAG 方法难以准确识别关键信息,并在这种噪声增强的情况下进行全局推理。本文受人类在处理知识密集型推理时将原始信息转化为多种结构化知识的认知理论启发,提出了一种新的框架——结构化 RAG (StructRAG)。该框架能够识别当前任务的最佳结构类型,将原始文档重构为这种结构化格式,并基于生成的结构进行答案推断。在多种知识密集型任务上的广泛实验表明,StructRAG 达到了最先进的性能,特别是在挑战性场景中表现尤为突出,展示了其在复杂现实应用中增强 LLMs 的有效潜力。

[NLP-31] A Social Context-aware Graph-based Multimodal Attentive Learning Framework for Disaster Content Classification during Emergencies

【速读】: 该论文试图解决在危机事件中,社交媒体上大量未过滤和多样化的灾难相关信息分类问题,以提高灾难响应和公共安全的效率。解决方案的关键在于提出了CrisisSpot方法,该方法利用基于图的神经网络捕捉文本和视觉模态之间的复杂关系,并引入社交上下文特征来整合用户中心和内容中心的信息。此外,通过引入倒置双重嵌入注意力机制(IDEA),该方法能够捕捉数据中的和谐与对比模式,增强多模态交互,从而提供更丰富的分类洞察。实验结果表明,CrisisSpot在公开的CrisisMMD数据集和自建的TSEqD数据集上均显著优于现有最先进方法。

链接: https://arxiv.org/abs/2410.08814
作者: Shahid Shafi Dar,Mohammad Zia Ur Rehman,Karan Bais,Mohammed Abdul Haseeb,Nagendra Kumara
关键词-EN: social media platforms, effective disaster response, times of crisis, public safety, prompt and precise
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In times of crisis, the prompt and precise classification of disaster-related information shared on social media platforms is crucial for effective disaster response and public safety. During such critical events, individuals use social media to communicate, sharing multimodal textual and visual content. However, due to the significant influx of unfiltered and diverse data, humanitarian organizations face challenges in leveraging this information efficiently. Existing methods for classifying disaster-related content often fail to model users’ credibility, emotional context, and social interaction information, which are essential for accurate classification. To address this gap, we propose CrisisSpot, a method that utilizes a Graph-based Neural Network to capture complex relationships between textual and visual modalities, as well as Social Context Features to incorporate user-centric and content-centric information. We also introduce Inverted Dual Embedded Attention (IDEA), which captures both harmonious and contrasting patterns within the data to enhance multimodal interactions and provide richer insights. Additionally, we present TSEqD (Turkey-Syria Earthquake Dataset), a large annotated dataset for a single disaster event, containing 10,352 samples. Through extensive experiments, CrisisSpot demonstrated significant improvements, achieving an average F1-score gain of 9.45% and 5.01% compared to state-of-the-art methods on the publicly available CrisisMMD dataset and the TSEqD dataset, respectively.
摘要:在危机时刻,对社交媒体平台上分享的与灾难相关的信息进行及时且准确的分类,对于有效的灾难响应和公共安全至关重要。在这样关键的事件中,个人使用社交媒体进行交流,分享多模态的文本和视觉内容。然而,由于大量未经筛选和多样化的数据的涌入,人道主义组织在有效利用这些信息方面面临挑战。现有的分类灾难相关内容的方法往往未能建模用户的可信度、情感背景和社会互动信息,而这些信息对于准确分类是必不可少的。为了填补这一空白,我们提出了 CrisisSpot,一种利用基于图的神经网络来捕捉文本和视觉模态之间复杂关系的方法,以及引入社会背景特征来整合以用户为中心和以内容为中心的信息。我们还引入了倒置双重嵌入注意力机制 (IDEA),该机制捕捉数据中的和谐与对比模式,以增强多模态交互并提供更丰富的洞察。此外,我们展示了 TSEqD (土耳其-叙利亚地震数据集),这是一个针对单一灾难事件的大型标注数据集,包含 10,352 个样本。通过广泛的实验,CrisisSpot 展示了显著的改进,与公开可用的 CrisisMMD 数据集和 TSEqD 数据集上的最先进方法相比,分别实现了平均 F1 分数提升 9.45% 和 5.01%。

[NLP-32] PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

【速读】: 该论文试图解决大型语言模型(LLMs)在偏好学习过程中易受数据中毒攻击的问题。解决方案的关键在于引入PoisonBench基准,用于评估LLMs在偏好学习期间对数据中毒攻击的敏感性。通过部署两种不同类型的攻击和八个现实场景,研究评估了21种广泛使用的模型,揭示了模型在面对数据中毒攻击时的脆弱性,并强调了需要更强大的防御机制来抵御恶意模型和数据操纵。

链接: https://arxiv.org/abs/2410.08811
作者: Tingchen Fu,Mrinank Sharma,Philip Torr,Shay B. Cohen,David Krueger,Fazl Barez
关键词-EN: aligning current LLMs, Preference learning, data poisoning, data poisoning attacks, central component
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Tingchen Fu and Fazl Barez are core research contributors

点击查看摘要

Abstract:Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models’ susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
摘要:偏好学习是当前大语言模型 (LLM) 对齐的核心组成部分,但这一过程容易受到数据中毒攻击的影响。为了解决这一问题,我们引入了 PoisonBench,这是一个用于评估大语言模型在偏好学习过程中对数据中毒攻击的敏感性的基准。数据中毒攻击可以操纵大语言模型的响应,使其包含隐藏的恶意内容或偏见,可能导致模型在看似正常运作的情况下生成有害或意外的输出。我们在八个现实场景中部署了两种不同的攻击类型,评估了 21 个广泛使用的模型。我们的研究结果揭示了令人担忧的趋势:(1) 参数规模的扩大并不必然增强对中毒攻击的抵抗力;(2) 攻击效果与数据中毒比例之间存在对数线性关系;(3) 数据中毒的效果可以泛化到未包含在中毒数据中的外推触发器。这些结果暴露了当前偏好学习技术的弱点,突显了对抗恶意模型和数据操纵的迫切需要更强大的防御措施。

[NLP-33] Data Processing for the OpenGPT-X Model Family

【速读】: 该论文试图解决OpenGPT-X项目中大规模多语言数据准备的问题,特别是如何高效地处理和准备用于训练多语言大语言模型(LLMs)的数据。解决方案的关键在于区分处理经过精心筛选的数据(curated data)和从网络获取的数据(web data),并针对这两种数据类型开发了专门的算法处理流程。具体来说,经过筛选的数据只需进行最小程度的过滤,而网络数据则需要进行广泛的过滤和去重处理。此外,论文还强调了数据处理的透明性和符合欧洲数据法规的重要性,并提供了对数据集的深入分析,以确保数据准备过程的合规性和高效性。

链接: https://arxiv.org/abs/2410.08800
作者: Nicolo’ Brandizzi,Hammam Abdelwahab,Anirban Bhowmick,Lennard Helmer,Benny Jörg Stein,Pavel Denisov,Qasid Saleem,Michael Fromm,Mehdi Ali,Richard Rutmann,Farzad Naderi,Mohamad Saif Agy,Alexander Schwirjow,Fabian Küch,Luzian Hahn,Malte Ostendorff,Pedro Ortiz Suarez,Georg Rehm,Dennis Wegener,Nicolas Flores-Herr,Joachim Köhler,Johannes Leveling
关键词-EN: large language models, large-scale initiative aimed, high-performance multilingual large, multilingual large language, paper presents
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project goal is to deliver models that cover all major European languages, with a particular focus on real-world applications within the European Union. We explain all data processing steps, starting with the data selection and requirement definition to the preparation of the final datasets for model training. We distinguish between curated data and web data, as each of these categories is handled by distinct pipelines, with curated data undergoing minimal filtering and web data requiring extensive filtering and deduplication. This distinction guided the development of specialized algorithmic solutions for both pipelines. In addition to describing the processing methodologies, we provide an in-depth analysis of the datasets, increasing transparency and alignment with European data regulations. Finally, we share key insights and challenges faced during the project, offering recommendations for future endeavors in large-scale multilingual data preparation for LLMs.
摘要:本文全面概述了为 OpenGPT-X 项目开发的数据准备流程,该项目是一个大规模的计划,旨在创建开放且高性能的多语言大语言模型 (LLM)。项目的目标是提供涵盖所有主要欧洲语言的模型,特别关注在欧盟内的实际应用。我们从数据选择和需求定义开始,详细解释了所有数据处理步骤,直至为模型训练准备最终数据集。我们区分了经过筛选的数据和网络数据,因为这两类数据由不同的流程处理,筛选数据经过最小程度的过滤,而网络数据则需要广泛的过滤和去重。这一区分指导了为两种流程开发专门的算法解决方案。除了描述处理方法外,我们还对数据集进行了深入分析,增加了透明度,并与欧洲数据法规保持一致。最后,我们分享了项目中遇到的关键见解和挑战,并为未来在大规模多语言数据准备方面的努力提供了建议。

[NLP-34] On the State of NLP Approaches to Modeling Depression in Social Media: A Post-COVID-19 Outlook

【速读】: 该论文试图解决在COVID-19大流行背景下,如何利用自然语言处理(NLP)技术在社交媒体上建模和预测抑郁症的问题。解决方案的关键在于利用最新的NLP方法和新数据集,结合疫情对心理健康的影响,进行抑郁症的建模研究。此外,论文还强调了在收集和处理心理健康数据时需要考虑的伦理问题,包括公平性、问责制和伦理规范。

链接: https://arxiv.org/abs/2410.08793
作者: Ana-Maria Bucur,Andreea-Codrina Moldovan,Krutika Parvatikar,Marcos Zampieri,Ashiqur R. KhudaBukhsh,Liviu P. Dinu
关键词-EN: mental health conditions, predicting mental health, mental health, Computational approaches, past years
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computational approaches to predicting mental health conditions in social media have been substantially explored in the past years. Multiple surveys have been published on this topic, providing the community with comprehensive accounts of the research in this area. Among all mental health conditions, depression is the most widely studied due to its worldwide prevalence. The COVID-19 global pandemic, starting in early 2020, has had a great impact on mental health worldwide. Harsh measures employed by governments to slow the spread of the virus (e.g., lockdowns) and the subsequent economic downturn experienced in many countries have significantly impacted people’s lives and mental health. Studies have shown a substantial increase of above 50% in the rate of depression in the population. In this context, we present a survey on natural language processing (NLP) approaches to modeling depression in social media, providing the reader with a post-COVID-19 outlook. This survey contributes to the understanding of the impacts of the pandemic on modeling depression in social media. We outline how state-of-the-art approaches and new datasets have been used in the context of the COVID-19 pandemic. Finally, we also discuss ethical issues in collecting and processing mental health data, considering fairness, accountability, and ethics.
摘要:近年来,计算方法在社交媒体中预测心理健康状况的研究得到了广泛探索。关于这一主题,已有多个综述发表,为社区提供了该领域研究的全面概述。在所有心理健康状况中,抑郁症因其全球范围内的普遍性而成为研究最为广泛的课题。2020年初开始的COVID-19全球大流行对全球心理健康产生了重大影响。各国政府为减缓病毒传播而采取的严厉措施(如封锁)以及随后许多国家经历的经济衰退,显著影响了人们的生活和心理健康。研究表明,抑郁症的发病率显著增加了50%以上。在此背景下,我们提供了一份关于自然语言处理(NLP)方法在社交媒体中建模抑郁症的综述,为读者提供了后COVID-19时代的展望。本综述有助于理解大流行对社交媒体中抑郁症建模的影响。我们概述了在COVID-19大流行背景下,如何使用最先进的方法和新数据集。最后,我们还讨论了在收集和处理心理健康数据时涉及的伦理问题,考虑了公平性、问责制和伦理。

[NLP-35] Integrating Supertag Features into Neural Discontinuous Constituent Parsing

【速读】: 该论文试图解决在自然语言处理中,传统句法分析方法难以处理非局部依赖关系的问题,特别是在德语等语言中常见的跨词依赖。解决方案的关键在于引入supertag信息到基于转换的非连续成分句法分析中。通过使用专门的supertagger作为神经解析器的额外输入(pipeline方法)或联合训练神经模型进行解析和supertagging(多任务学习),论文探讨了如何利用supertag信息来提高解析器的性能。此外,论文还比较了不同框架(如CCG、LTAG-spinal、LCFRS)和序列标注任务(如chunking、依存句法分析)作为辅助任务的适用性。

链接: https://arxiv.org/abs/2410.08766
作者: Lukas Mielczarek
关键词-EN: natural-language processing, essential in natural-language, widely used description, parsing, DPTB for English
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: Bachelor’s Thesis. Supervised by Dr. Kilian Evang and Univ.-Prof. Dr. Laura Kallmeyer

点击查看摘要

Abstract:Syntactic parsing is essential in natural-language processing, with constituent structure being one widely used description of syntax. Traditional views of constituency demand that constituents consist of adjacent words, but this poses challenges in analysing syntax with non-local dependencies, common in languages like German. Therefore, in a number of treebanks like NeGra and TIGER for German and DPTB for English, long-range dependencies are represented by crossing edges. Various grammar formalisms have been used to describe discontinuous trees - often with high time complexities for parsing. Transition-based parsing aims at reducing this factor by eliminating the need for an explicit grammar. Instead, neural networks are trained to produce trees given raw text input using supervised learning on large annotated corpora. An elegant proposal for a stack-free transition-based parser developed by Coavoux and Cohen (2019) successfully allows for the derivation of any discontinuous constituent tree over a sentence in worst-case quadratic time. The purpose of this work is to explore the introduction of supertag information into transition-based discontinuous constituent parsing. In lexicalised grammar formalisms like CCG (Steedman, 1989) informative categories are assigned to the words in a sentence and act as the building blocks for composing the sentence’s syntax. These supertags indicate a word’s structural role and syntactic relationship with surrounding items. The study examines incorporating supertag information by using a dedicated supertagger as additional input for a neural parser (pipeline) and by jointly training a neural model for both parsing and supertagging (multi-task). In addition to CCG, several other frameworks (LTAG-spinal, LCFRS) and sequence labelling tasks (chunking, dependency parsing) will be compared in terms of their suitability as auxiliary tasks for parsing. Comments: Bachelor’s Thesis. Supervised by Dr. Kilian Evang and Univ.-Prof. Dr. Laura Kallmeyer Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL) Cite as: arXiv:2410.08766 [cs.CL] (or arXiv:2410.08766v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.08766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:句法解析在自然语言处理中至关重要,其中成分结构是一种广泛使用的句法描述方法。传统的成分观点要求成分由相邻的词组成,但这在分析具有非局部依赖关系的句法时存在挑战,这种情况在德语等语言中尤为常见。因此,在诸如德语的NeGra和TIGER以及英语的DPTB等多个树库中,长距离依赖关系通过交叉边来表示。各种语法形式主义已被用于描述不连续的树结构,但通常伴随着较高的解析时间复杂度。基于转移的解析方法旨在通过消除显式语法的需要来降低这一因素。取而代之的是,神经网络通过在大规模标注语料库上的监督学习,训练生成树结构以处理原始文本输入。Coavoux和Cohen(2019)提出了一种无需堆栈的基于转移的解析器,该解析器在最坏情况下以二次时间成功推导出句子上的任何不连续成分树。本工作的目的是探讨将超标签信息引入基于转移的不连续成分解析中。在诸如CCG(Steedman, 1989)等词汇化语法形式主义中,信息丰富的类别被分配给句子中的词,并作为构成句子句法的构建块。这些超标签指示了词的结构角色及其与周围词的句法关系。研究通过使用专用超标签器作为神经解析器的额外输入(管道),以及通过联合训练神经模型进行解析和超标签标注(多任务)来考察超标签信息的整合。除了CCG,还将比较其他几个框架(LTAG-spinal, LCFRS)和序列标注任务(分块,依存解析)在作为解析辅助任务方面的适用性。

评论:学士论文。由Kilian Evang博士和Univ.-Prof. Dr. Laura Kallmeyer指导。
主题:计算与语言(cs.CL);人工智能(cs.AI);形式语言与自动机理论(cs.FL)
引用为:arXiv:2410.08766 [cs.CL](或arXiv:2410.08766v1 [cs.CL]用于此版本)
https://doi.org/10.48550/arXiv.2410.08766
通过DataCite发布的arXiv DOI(待注册)

[NLP-36] Measuring the Groundedness of Legal Question-Answering Systems EMNLP2024

【速读】: 该论文旨在解决法律领域中生成式AI系统回答的准确性和可信度问题,关键在于评估生成回答的groundedness(基于事实性)。解决方案包括使用基于相似度的度量和自然语言推理模型来评估回答是否与给定上下文一致,并通过不同的提示策略改进大型语言模型对未基于事实回答的检测能力。研究通过新创建的grounding分类语料库验证了这些方法的有效性,并评估了其在实际应用中的延迟,以确保其适用于可能触发额外手动验证或自动重新生成回答的流程。

链接: https://arxiv.org/abs/2410.08764
作者: Dietrich Trautmann,Natalia Ostapuk,Quentin Grail,Adrian Alan Pol,Guglielmo Bonifazi,Shang Gao,Martin Gajek
关键词-EN: paramount importance, high-stakes domains, legal question-answering, generative AI systems, responses
类目: Computation and Language (cs.CL)
备注: to appear NLLP @ EMNLP 2024

点击查看摘要

Abstract:In high-stakes domains like legal question-answering, the accuracy and trustworthiness of generative AI systems are of paramount importance. This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability. Our experiments include similarity-based metrics and natural language inference models to evaluate whether responses are well-founded in the given contexts. We also explore different prompting strategies for large language models to improve the detection of ungrounded responses. We validated the effectiveness of these methods using a newly created grounding classification corpus, designed specifically for legal queries and corresponding responses from retrieval-augmented prompting, focusing on their alignment with source material. Our results indicate potential in groundedness classification of generated responses, with the best method achieving a macro-F1 score of 0.8. Additionally, we evaluated the methods in terms of their latency to determine their suitability for real-world applications, as this step typically follows the generation process. This capability is essential for processes that may trigger additional manual verification or automated response regeneration. In summary, this study demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.
摘要:在法律问答等高风险领域,生成式 AI 系统的准确性和可信度至关重要。本文提出了一套全面的基准测试方法,旨在显著提升 AI 生成响应的可靠性,评估其基于事实的程度。我们的实验包括基于相似度的指标和自然语言推理模型,用于评估响应是否在给定上下文中得到充分支持。此外,我们还探索了针对大语言模型的不同提示策略,以提高对无事实依据响应的检测能力。我们通过一个新创建的基于事实分类语料库验证了这些方法的有效性,该语料库专为法律查询及其对应的检索增强提示响应设计,重点关注其与源材料的匹配度。实验结果表明,生成响应的基于事实分类具有潜力,最佳方法的宏 F1 分数达到 0.8。此外,我们还评估了这些方法的延迟性能,以确定其在实际应用中的适用性,因为这一步骤通常在生成过程之后进行。这种能力对于可能触发额外人工验证或自动响应重新生成的流程至关重要。总之,本研究展示了多种检测方法在提升法律环境中生成式 AI 可信度方面的潜力。

[NLP-37] Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在非标准化任务和非英语法律任务中的效用受限问题,特别是在韩国法律领域的应用。解决方案的关键在于引入了一个名为KBL的基准测试,用于评估LLMs对韩国法律语言的理解能力。KBL包括7个法律知识任务、4个法律推理任务以及韩国律师资格考试的4个领域,共计2510个示例。这些数据集通过与律师紧密合作开发,以确保在实际法律场景中的有效评估。此外,论文还评估了LLMs在封闭书籍设置(仅依赖内部知识)和检索增强生成(RAG)设置(使用韩国法规和先例语料库)中的表现,结果表明LLMs在韩国法律领域仍有显著的改进空间和机会。

链接: https://arxiv.org/abs/2410.08731
作者: Yeeun Kim,Young Rok Choi,Eunkyung Choi,Jinhwan Choi,Hai Jin Park,Wonseok Hwang
关键词-EN: Uniform Bar Exam, Large language models, demonstrated remarkable performance, efficacy remains limited, passing the Uniform
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance in the legal domain, with GPT-4 even passing the Uniform Bar Exam in the U.S. However their efficacy remains limited for non-standardized tasks and tasks in languages other than English. This underscores the need for careful evaluation of LLMs within each legal system before application. Here, we introduce KBL, a benchmark for assessing the Korean legal language understanding of LLMs, consisting of (1) 7 legal knowledge tasks (510 examples), (2) 4 legal reasoning tasks (288 examples), and (3) the Korean bar exam (4 domains, 53 tasks, 2,510 examples). First two datasets were developed in close collaboration with lawyers to evaluate LLMs in practical scenarios in a certified manner. Furthermore, considering legal practitioners’ frequent use of extensive legal documents for research, we assess LLMs in both a closed book setting, where they rely solely on internal knowledge, and a retrieval-augmented generation (RAG) setting, using a corpus of Korean statutes and precedents. The results indicate substantial room and opportunities for improvement.
摘要:大语言模型 (LLMs) 在法律领域展示了显著的性能,GPT-4 甚至通过了美国的统一律师资格考试。然而,它们在非标准化任务和非英语任务中的有效性仍然有限。这凸显了在应用之前,需要对每个法律体系内的 LLMs 进行仔细评估的必要性。在此,我们介绍 KBL,这是一个用于评估 LLMs 对韩语法律语言理解的基准,包括 (1) 7 项法律知识任务 (510 个示例),(2) 4 项法律推理任务 (288 个示例),以及 (3) 韩国律师资格考试 (4 个领域,53 项任务,2,510 个示例)。前两个数据集在与律师紧密合作下开发,以认证方式评估 LLMs 在实际场景中的表现。此外,考虑到法律从业者经常使用大量法律文件进行研究,我们在两种设置下评估 LLMs:一种是闭卷设置,它们仅依赖内部知识;另一种是检索增强生成 (RAG) 设置,使用韩国法规和先例的语料库。结果表明,仍有很大的改进空间和机会。

[NLP-38] From N-grams to Pre-trained Multilingual Models For Language Identification

【速读】: 该论文试图解决南非11种语言的语言识别(LID)问题,解决方案的关键在于有效选择数据规模以建立目标语言的频率分布,从而提升语言排序效果。论文通过对比N-gram模型和多种预训练多语言模型(如mBERT、RemBERT、XLM-r、AfriBERTa等),发现Serengeti模型在平均表现上优于其他模型。此外,论文还提出了一种轻量级的BERT基LID模型(za_BERT_lid),该模型在NHCLT + Vukzenzele语料库上训练,性能与最佳的非洲中心模型相当。

链接: https://arxiv.org/abs/2410.08728
作者: Thapelo Sindane,Vukosi Marivate
关键词-EN: South African languages, Large Pre-trained Multilingual, South African, Pre-trained Multilingual models, African languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The paper has been accepted at The 4th International Conference on Natural Language Processing for Digital Humanities (NLP4DH 2024)

点击查看摘要

Abstract:In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models – mBERT, RemBERT, XLM-r, and Afri-centric multilingual models – AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.
摘要:本文研究了在11种南非语言中使用N-gram模型和大规模预训练多语言模型进行语言识别 (LID) 的方法。对于N-gram模型,研究表明,有效的数据规模选择对于建立目标语言的有效频率分布至关重要,这能够高效地建模每种语言,从而提升语言排序效果。对于预训练的多语言模型,我们进行了广泛的实验,涵盖了一系列多样化的预训练多语言模型 (PLM) —— mBERT、RemBERT、XLM-r,以及以非洲为中心的多语言模型 —— AfriBERTa、Afro-XLMr、AfroLM和Serengeti。我们进一步将这些模型与现有的大规模语言识别工具进行比较:Compact Language Detector v3 (CLD V3)、AfroLID、GlotLID和OpenLID,以强调基于聚焦的语言识别的重要性。从中我们发现,Serengeti在平均水平上优于从N-gram到Transformer的所有模型。此外,我们提出了一种轻量级的基于BERT的语言识别模型 (za_BERT_lid),该模型使用NHCLT + Vukzenzele语料库进行训练,其性能与表现最佳的以非洲为中心的模型相当。

[NLP-39] On the token distance modeling ability of higher RoPE attention dimension

【速读】: 该论文试图解决如何理解旋转位置嵌入(RoPE)在扩展语言模型上下文长度时捕捉长距离依赖关系的问题。解决方案的关键在于通过维度级分析,研究注意力头中隐藏维度与捕捉长距离依赖之间的相关性,从而识别出一种名为“位置头”(Positional Heads)的特定类型注意力头。这些头在处理长输入时表现出对长距离信息交互的强烈关注,并通过消融实验证明了其在长文本处理中的关键作用。此外,论文还展示了长度外推效率与这些头的高维注意力分配扩展之间的相关性,为未来长文本理解研究提供了新的视角。

链接: https://arxiv.org/abs/2410.08703
作者: Xiangyu Hong,Che Jiang,Biqing Qi,Fandong Meng,Mo Yu,Bowen Zhou,Jie Zhou
关键词-EN: Rotary position embedding, shown promising results, Rotary position, based on Rotary, position embedding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Length extrapolation algorithms based on Rotary position embedding (RoPE) have shown promising results in extending the context length of language models. However, understanding how position embedding can capture longer-range contextual information remains elusive. Based on the intuition that different dimensions correspond to different frequency of changes in RoPE encoding, we conducted a dimension-level analysis to investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies. Using our correlation metric, we identified a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models. These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing, as evidence by our ablation. We further demonstrate the correlation between the efficiency of length extrapolation and the extension of the high-dimensional attention allocation of these heads. The identification of Positional Heads provides insights for future research in long-text comprehension.
摘要:基于旋转位置嵌入 (Rotary position embedding, RoPE) 的长度外推算法在扩展语言模型的上下文长度方面展示了有前景的结果。然而,理解位置嵌入如何捕捉更长范围的上下文信息仍然是一个未解之谜。基于不同维度对应于 RoPE 编码中不同频率变化的直觉,我们进行了维度级别的分析,以研究注意力头的一个隐藏维度与其捕捉长距离依赖关系的贡献之间的相关性。利用我们的相关性度量,我们从各种长度外推模型中识别出一种特定类型的注意力头,我们称之为位置头 (Positional Heads)。这些头在长距离信息交互方面表现出强烈的关注,并且在处理长输入时起着关键作用,这一点通过我们的消融实验得以证明。我们进一步展示了长度外推效率与这些头的高维度注意力分配扩展之间的相关性。位置头的识别为未来在长文本理解方面的研究提供了见解。

[NLP-40] SocialGaze: Improving the Integration of Human Social Norms in Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在判断社会行为接受度时与人类共识不一致的问题。解决方案的关键在于引入SocialGaze框架,通过多步骤提示方法,使语言模型从多个视角描述社会情境,从而在形成判断前进行更全面的理性分析。实验结果表明,SocialGaze方法显著提高了GPT-3.5模型与人类判断的一致性,最高提升了11个F1分数。

链接: https://arxiv.org/abs/2410.08698
作者: Anvesh Rao Vijjini,Rakesh R. Menon,Jiayi Fu,Shashank Srivastava,Snigdha Chaturvedi
关键词-EN: research has explored, explored enhancing, enhancing the reasoning, reasoning capabilities, capabilities of large
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:While much research has explored enhancing the reasoning capabilities of large language models (LLMs) in the last few years, there is a gap in understanding the alignment of these models with social values and norms. We introduce the task of judging social acceptance. Social acceptance requires models to judge and rationalize the acceptability of people’s actions in social situations. For example, is it socially acceptable for a neighbor to ask others in the community to keep their pets indoors at night? We find that LLMs’ understanding of social acceptance is often misaligned with human consensus. To alleviate this, we introduce SocialGaze, a multi-step prompting framework, in which a language model verbalizes a social situation from multiple perspectives before forming a judgment. Our experiments demonstrate that the SocialGaze approach improves the alignment with human judgments by up to 11 F1 points with the GPT-3.5 model. We also identify biases and correlations in LLMs in assigning blame that is related to features such as the gender (males are significantly more likely to be judged unfairly) and age (LLMs are more aligned with humans for older narrators).
摘要:尽管近年来大量研究致力于提升大语言模型 (LLM) 的推理能力,但在理解这些模型与社会价值观和规范的契合度方面仍存在差距。我们引入了判断社会接受度的任务。社会接受度要求模型判断并合理化人们在社交情境中的行为是否可接受。例如,邻居要求社区中的其他人夜间将宠物留在室内是否符合社会规范?我们发现,大语言模型对社会接受度的理解往往与人类共识不一致。为缓解这一问题,我们提出了 SocialGaze,这是一个多步骤的提示框架,语言模型在形成判断之前会从多个角度对社交情境进行口头描述。我们的实验表明,SocialGaze 方法使 GPT-3.5 模型与人类判断的契合度提高了多达 11 个 F1 点。我们还识别出大语言模型在分配责任时存在的偏见和相关性,这些偏见与性别(男性被不公平判断的可能性显著更高)和年龄(大语言模型对年长叙述者的判断与人类更为一致)等特征相关。

[NLP-41] AMPO: Automatic Multi-Branched Prompt Optimization

【速读】: 该论文试图解决现有自动提示优化技术在处理复杂任务时难以应对多样化模式的问题。解决方案的关键在于提出了一种名为AMPO的自动提示优化方法,该方法通过迭代开发多分支提示结构,利用失败案例作为反馈,从而更好地处理多种模式。关键模块包括模式识别、分支调整和分支修剪,这些模块共同作用,使得AMPO在多个任务中均能实现最佳效果,并因其采用的最小搜索策略而显著提高了优化效率。

链接: https://arxiv.org/abs/2410.08696
作者: Sheng Yang,Yurong Wu,Yan Gao,Zineng Zhou,Bin Benjamin Zhu,Xiaodi Sun,Jian-Guang Lou,Zhiming Ding,Anbang Hu,Yuan Fang,Yunsong Li,Junyan Chen,Linjun Yang
关键词-EN: large language models, language models, important to enhance, enhance the performance, performance of large
类目: Computation and Language (cs.CL)
备注: 13 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Prompt engineering is very important to enhance the performance of large language models (LLMs). When dealing with complex issues, prompt engineers tend to distill multiple patterns from examples and inject relevant solutions to optimize the prompts, achieving satisfying results. However, existing automatic prompt optimization techniques are only limited to producing single flow instructions, struggling with handling diverse patterns. In this paper, we present AMPO, an automatic prompt optimization method that can iteratively develop a multi-branched prompt using failure cases as feedback. Our goal is to explore a novel way of structuring prompts with multi-branches to better handle multiple patterns in complex tasks, for which we introduce three modules: Pattern Recognition, Branch Adjustment, and Branch Pruning. In experiments across five tasks, AMPO consistently achieves the best results. Additionally, our approach demonstrates significant optimization efficiency due to our adoption of a minimal search strategy.
摘要:提示工程对于提升大语言模型 (LLM) 的性能至关重要。在处理复杂问题时,提示工程师倾向于从示例中提炼出多种模式,并将相关解决方案注入提示中,以优化提示,从而获得满意的结果。然而,现有的自动提示优化技术仅限于生成单一流程的指令,难以处理多样化的模式。本文中,我们提出了 AMPO,一种能够利用失败案例作为反馈,迭代开发多分支提示的自动提示优化方法。我们的目标是探索一种通过多分支结构化提示的新方法,以更好地处理复杂任务中的多种模式。为此,我们引入了三个模块:模式识别 (Pattern Recognition)、分支调整 (Branch Adjustment) 和分支剪枝 (Branch Pruning)。在五个任务的实验中,AMPO 持续取得了最佳结果。此外,由于我们采用了最小搜索策略,我们的方法展示了显著的优化效率。

[NLP-42] Guidelines for Fine-grained Sentence-level Arabic Readability Annotation

【速读】: 该论文旨在解决阿拉伯语阅读理解评估中缺乏全面且多样化的语言资源问题。解决方案的关键在于构建Balanced Arabic Readability Evaluation Corpus (BAREC),这是一个涵盖从幼儿园到研究生水平的19个不同阅读难度级别的标准化语料库。BAREC通过结合人工标注和AI驱动的工具,确保语料库在体裁、主题和地区变异方面具有广泛的代表性。论文详细介绍了其精细的标注指南,并通过分析10,631个句子/短语(113,651个单词)展示了高水平的标注一致性(平均成对标注者一致性为79.9%),同时报告了在自动阅读难度评估方面的竞争性结果。

链接: https://arxiv.org/abs/2410.08674
作者: Nizar Habash,Hanada Taha-Thomure,Khalid N. Elmadani,Zeina Zeino,Abdallah Abushmaes
关键词-EN: Arabic Readability Evaluation, Readability Evaluation Corpus, Balanced Arabic Readability, Arabic language resources, language resources aligned
类目: Computation and Language (cs.CL)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:This paper presents the foundational framework and initial findings of the Balanced Arabic Readability Evaluation Corpus (BAREC) project, designed to address the need for comprehensive Arabic language resources aligned with diverse readability levels. Inspired by the Taha/Arabi21 readability reference, BAREC aims to provide a standardized reference for assessing sentence-level Arabic text readability across 19 distinct levels, ranging in targets from kindergarten to postgraduate comprehension. Our ultimate goal with BAREC is to create a comprehensive and balanced corpus that represents a wide range of genres, topics, and regional variations through a multifaceted approach combining manual annotation with AI-driven tools. This paper focuses on our meticulous annotation guidelines, demonstrated through the analysis of 10,631 sentences/phrases (113,651 words). The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 79.9%, reflecting a high level of substantial agreement. We also report competitive results for benchmarking automatic readability assessment. We will make the BAREC corpus and guidelines openly accessible to support Arabic language research and education.
摘要:本文介绍了平衡阿拉伯可读性评估语料库 (Balanced Arabic Readability Evaluation Corpus, BAREC) 项目的基础框架和初步研究成果,该项目旨在满足与多样可读性水平相符的阿拉伯语言资源需求。受 Taha/Arabi21 可读性参考文献的启发,BAREC 旨在提供一个标准化的参考,用于评估从幼儿园到研究生理解水平范围内的 19 个不同层次的阿拉伯文本可读性。我们的最终目标是创建一个全面且平衡的语料库,通过结合人工标注与 AI 驱动工具的多维度方法,涵盖广泛的体裁、主题和地区差异。本文重点介绍了我们细致的标注指南,并通过分析 10,631 个句子/短语 (113,651 个词) 进行了展示。平均的成对标注者间一致性,通过二次加权 Kappa 系数衡量,达到了 79.9%,反映出高度实质性的一致性。我们还报告了在自动可读性评估基准测试中的竞争性结果。我们将公开 BAREC 语料库和指南,以支持阿拉伯语言的研究和教育。

[NLP-43] QEFT: Quantization for Efficient Fine-Tuning of LLMs EMNLP2024

【速读】: 该论文试图解决在大语言模型(LLMs)微调过程中,如何在保持推理效率的同时优化微调效果的问题。解决方案的关键在于提出了一种名为“量化高效微调(Quantization for Efficient Fine-Tuning, QEFT)”的新技术。QEFT通过量化技术加速推理和微调过程,具有坚实的理论基础、高灵活性和良好的硬件兼容性,能够在减少资源消耗的同时,保持与全精度参数高效微调相当的质量和多功能性。

链接: https://arxiv.org/abs/2410.08661
作者: Changhun Lee,Jun-gyu Jin,Younghyun Cho,Eunhyeok Park
关键词-EN: large language models, keeping inference efficient, highly important, rapid growth, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at Findings of EMNLP 2024

点击查看摘要

Abstract:With the rapid growth in the use of fine-tuning for large language models (LLMs), optimizing fine-tuning while keeping inference efficient has become highly important. However, this is a challenging task as it requires improvements in all aspects, including inference speed, fine-tuning speed, memory consumption, and, most importantly, model quality. Previous studies have attempted to achieve this by combining quantization with fine-tuning, but they have failed to enhance all four aspects simultaneously. In this study, we propose a new lightweight technique called Quantization for Efficient Fine-Tuning (QEFT). QEFT accelerates both inference and fine-tuning, is supported by robust theoretical foundations, offers high flexibility, and maintains good hardware compatibility. Our extensive experiments demonstrate that QEFT matches the quality and versatility of full-precision parameter-efficient fine-tuning, while using fewer resources. Our code is available at this https URL.
摘要:随着对大语言模型 (LLM) 进行微调的使用迅速增长,优化微调过程的同时保持推理效率变得极为重要。然而,这是一项具有挑战性的任务,因为它需要在推理速度、微调速度、内存消耗以及最重要的模型质量等各个方面进行改进。以往的研究尝试通过将量化与微调相结合来实现这一目标,但未能同时提升所有四个方面。在本研究中,我们提出了一种新的轻量级技术,称为高效微调量化 (QEFT)。QEFT 不仅加速了推理和微调过程,还建立在坚实的理论基础上,提供了高度的灵活性,并保持了良好的硬件兼容性。我们的广泛实验表明,QEFT 在质量和对资源的利用上与全精度参数高效微调相当,同时使用更少的资源。我们的代码可在以下链接获取:https URL。

[NLP-44] More than Memes: A Multimodal Topic Modeling Approach to Conspiracy Theories on Telegram

【速读】: 该论文试图解决在社交媒体上日益增多的视听数据中,如何有效分析阴谋论及相关内容的问题。解决方案的关键在于采用多模态主题建模方法,结合BERTopic和CLIP技术,对文本和视觉数据进行联合分析。具体来说,研究通过分析约40,000条发布于2023年10月的德语Telegram消息,探讨了这一方法在处理用户生成的中等规模文本-图像在线内容时的潜力与挑战,并提供了跨模态主题分析和多模态叙事策略的定性案例研究。

链接: https://arxiv.org/abs/2410.08642
作者: Elisabeth Steffen
关键词-EN: German-language Telegram channels, related content online, conspiracy theories, German-language Telegram, traditionally focused
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 11 pages, 11 figures

点击查看摘要

Abstract:Research on conspiracy theories and related content online has traditionally focused on textual data. To address the increasing prevalence of (audio-)visual data on social media, and to capture the evolving and dynamic nature of this communication, researchers have begun to explore the potential of unsupervised approaches for analyzing multimodal online content. Our research contributes to this field by exploring the potential of multimodal topic modeling for analyzing conspiracy theories in German-language Telegram channels. Our work uses the BERTopic topic modeling approach in combination with CLIP for the analysis of textual and visual data. We analyze a corpus of ~40, 000 Telegram messages posted in October 2023 in 571 German-language Telegram channels known for disseminating conspiracy theories and other deceptive content. We explore the potentials and challenges of this approach for studying a medium-sized corpus of user-generated, text-image online content. We offer insights into the dominant topics across modalities, different text and image genres discovered during the analysis, quantitative inter-modal topic analyses, and a qualitative case study of textual, visual, and multimodal narrative strategies in the communication of conspiracy theories.
摘要:传统上,关于阴谋论及相关内容的研究主要集中在文本数据上。为了应对社交媒体上(音频)视觉数据的日益流行,并捕捉这种交流方式的动态演变特性,研究人员开始探索无监督方法在分析多模态在线内容方面的潜力。我们的研究通过探索多模态主题建模在分析德语 Telegram 频道中阴谋论方面的潜力,为这一领域做出了贡献。我们的工作结合了 BERTopic 主题建模方法与 CLIP 技术,用于分析文本和视觉数据。我们分析了 2023 年 10 月在 571 个以传播阴谋论和其他虚假内容著称的德语 Telegram 频道中发布的约 40,000 条 Telegram 消息。我们探讨了这种方法在研究用户生成的中等规模文本-图像在线内容库方面的潜力和挑战。我们提供了跨模态主导主题的见解,分析过程中发现的不同的文本和图像类型,定量的模态间主题分析,以及对阴谋论传播中使用的文本、视觉和多模态叙事策略的定性案例研究。

[NLP-45] Words as Beacons: Guiding RL Agents with High-Level Language Prompts

【速读】: 该论文试图解决强化学习中稀疏奖励环境下的探索难题,提出了一种教师-学生强化学习框架,利用大型语言模型(LLMs)作为“教师”,通过将复杂任务分解为子目标来指导智能体的学习过程。解决方案的关键在于LLMs能够基于环境的文本描述理解其结构和目的,生成位置目标、对象表示和语言指令三种类型的子目标,从而加速学习并增强探索效率。实验结果表明,该方法在MiniGrid基准测试中显著提升了训练收敛速度,相比现有基线方法,训练步数减少了30到200倍。

链接: https://arxiv.org/abs/2410.08632
作者: Unai Ruiz-Gonzalez,Alain Andres,Pedro G.Bascoy,Javier Del Ser
关键词-EN: pose significant challenges, Large Language Models, incomplete learning processes, leverages Large Language, pose significant
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sparse reward environments in reinforcement learning (RL) pose significant challenges for exploration, often leading to inefficient or incomplete learning processes. To tackle this issue, this work proposes a teacher-student RL framework that leverages Large Language Models (LLMs) as “teachers” to guide the agent’s learning process by decomposing complex tasks into subgoals. Due to their inherent capability to understand RL environments based on a textual description of structure and purpose, LLMs can provide subgoals to accomplish the task defined for the environment in a similar fashion to how a human would do. In doing so, three types of subgoals are proposed: positional targets relative to the agent, object representations, and language-based instructions generated directly by the LLM. More importantly, we show that it is possible to query the LLM only during the training phase, enabling agents to operate within the environment without any LLM intervention. We assess the performance of this proposed framework by evaluating three state-of-the-art open-source LLMs (Llama, DeepSeek, Qwen) eliciting subgoals across various procedurally generated environment of the MiniGrid benchmark. Experimental results demonstrate that this curriculum-based approach accelerates learning and enhances exploration in complex tasks, achieving up to 30 to 200 times faster convergence in training steps compared to recent baselines designed for sparse reward environments.
摘要:在强化学习 (Reinforcement Learning, RL) 中,稀疏奖励环境对探索提出了重大挑战,往往导致学习过程效率低下或不完整。为了解决这一问题,本文提出了一种师生 RL 框架,该框架利用大语言模型 (Large Language Models, LLMs) 作为“教师”,通过将复杂任务分解为子目标来指导智能体的学习过程。由于 LLMs 本身具备基于环境结构和目的的文本描述理解 RL 环境的能力,因此可以像人类一样为环境定义的任务提供子目标。在此过程中,我们提出了三种类型的子目标:相对于智能体的位置目标、对象表示以及由 LLM 直接生成的基于语言的指令。更重要的是,我们展示了在训练阶段仅查询 LLM 是可能的,从而使智能体能够在没有任何 LLM 干预的情况下在环境中运行。我们通过评估三个最先进的开源 LLM (Llama, DeepSeek, Qwen) 在 MiniGrid 基准测试的各种程序生成环境中的子目标生成性能,来评估所提出框架的性能。实验结果表明,这种基于课程的方法加速了学习并增强了复杂任务中的探索,与为稀疏奖励环境设计的最新基线相比,训练步骤的收敛速度提高了 30 到 200 倍。

[NLP-46] Retrieving Contextual Information for Long-Form Question Answering using Weak Supervision EMNLP2024

【速读】: 该论文试图解决长形式问答(LFQA)中现有检索器在获取与问题相关的上下文信息方面的不足,特别是在缺乏相关训练数据的情况下。解决方案的关键在于提出并比较不同的弱监督技术,以优化检索过程,使其能够更好地捕捉和利用上下文信息。实验结果表明,这种方法显著提高了LFQA的端到端性能,特别是在相关页面召回率和生成长形式答案的准确性方面,分别提升了14.7%和12.5%。此外,通过在对话式问答数据集上的实验,还展示了长形式答案能够预见可能的后续问题。

链接: https://arxiv.org/abs/2410.08623
作者: Philipp Christmann,Svitlana Vakulenko,Ionut Teodor Sorodoc,Bill Byrne,Adrià de Gispert
关键词-EN: aims at generating, generating in-depth answers, generating in-depth, Long-form question answering, providing relevant information
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at EMNLP 2024 (Findings)

点击查看摘要

Abstract:Long-form question answering (LFQA) aims at generating in-depth answers to end-user questions, providing relevant information beyond the direct answer. However, existing retrievers are typically optimized towards information that directly targets the question, missing out on such contextual information. Furthermore, there is a lack of training data for relevant context. To this end, we propose and compare different weak supervision techniques to optimize retrieval for contextual information. Experiments demonstrate improvements on the end-to-end QA performance on ASQA, a dataset for long-form question answering. Importantly, as more contextual information is retrieved, we improve the relevant page recall for LFQA by 14.7% and the groundedness of generated long-form answers by 12.5%. Finally, we show that long-form answers often anticipate likely follow-up questions, via experiments on a conversational QA dataset.
摘要:长篇问答 (Long-form Question Answering, LFQA) 旨在生成深入的答案,以回应终端用户的提问,提供超出直接答案的相关信息。然而,现有的检索器通常针对直接回答问题的信息进行优化,忽略了此类上下文信息。此外,相关上下文的训练数据缺乏。为此,我们提出并比较了不同的弱监督技术,以优化上下文信息的检索。实验表明,在 ASQA 数据集上,端到端问答性能有所提升。重要的是,随着更多上下文信息的检索,我们提高了 LFQA 的相关页面召回率 14.7%,并将生成的长篇答案的准确性提高了 12.5%。最后,我们通过在对话式问答数据集上的实验,展示了长篇答案通常能预见可能的后续问题。

[NLP-47] StraGo: Harnessing Strategic Guidance for Prompt Optimization

【速读】: 该论文试图解决现有提示优化方法在提升提示效果时导致的“提示漂移”问题,即新产生的提示可能对之前成功的案例产生负面影响,同时这些方法过于依赖大语言模型(LLMs)的内在能力。解决方案的关键在于引入StraGo(Strategic-Guided Optimization),这是一种利用成功和失败案例的洞察来识别优化目标关键因素的新方法。StraGo通过采用“如何做”的方法论,结合上下文学习,制定具体的、可操作的策略,为提示优化提供详细的、逐步的指导,从而实现稳定且有效的提示改进。

链接: https://arxiv.org/abs/2410.08601
作者: Yurong Wu,Yan Gao,Bin Benjamin Zhu,Zineng Zhou,Xiaodi Sun,Sheng Yang,Jian-Guang Lou,Zhiming Ding,Linjun Yang
关键词-EN: large language models, prompt optimization, engineering is pivotal, pivotal for harnessing, Prompt
类目: Computation and Language (cs.CL)
备注: 19 pages, 3 figures, 20 tables

点击查看摘要

Abstract:Prompt engineering is pivotal for harnessing the capabilities of large language models (LLMs) across diverse applications. While existing prompt optimization methods improve prompt effectiveness, they often lead to prompt drifting, where newly generated prompts can adversely impact previously successful cases while addressing failures. Furthermore, these methods tend to rely heavily on LLMs’ intrinsic capabilities for prompt optimization tasks. In this paper, we introduce StraGo (Strategic-Guided Optimization), a novel approach designed to mitigate prompt drifting by leveraging insights from both successful and failed cases to identify critical factors for achieving optimization objectives. StraGo employs a how-to-do methodology, integrating in-context learning to formulate specific, actionable strategies that provide detailed, step-by-step guidance for prompt optimization. Extensive experiments conducted across a range of tasks, including reasoning, natural language understanding, domain-specific knowledge, and industrial applications, demonstrate StraGo’s superior performance. It establishes a new state-of-the-art in prompt optimization, showcasing its ability to deliver stable and effective prompt improvements.
摘要:提示工程在利用大语言模型 (LLM) 在各种应用中的能力方面起着关键作用。尽管现有的提示优化方法提高了提示的有效性,但它们往往导致提示漂移,即新生成的提示可能会对先前成功的案例产生负面影响,同时解决失败案例。此外,这些方法往往严重依赖 LLM 的内在能力来进行提示优化任务。在本文中,我们介绍了 StraGo (Strategic-Guided Optimization),这是一种新颖的方法,旨在通过利用成功和失败案例的洞察来识别实现优化目标的关键因素,从而缓解提示漂移问题。StraGo 采用了一种“如何做”的方法论,结合上下文学习来制定具体的、可操作的策略,为提示优化提供详细的、逐步的指导。在一系列任务中进行的广泛实验,包括推理、自然语言理解、特定领域知识和工业应用,展示了 StraGo 的优越性能。它在大语言模型提示优化方面建立了新的技术水平,展示了其提供稳定和有效提示改进的能力。

[NLP-48] Parameter-Efficient Fine-Tuning of Large Language Models using Semantic Knowledge Tuning

【速读】: 该论文试图解决现有大型语言模型(LLMs)在特定任务中使用提示(prompts)和前缀调优(prefix tuning)时,由于使用缺乏语义意义的随机标记(tokens)而导致训练成本高、性能不佳的问题。解决方案的关键在于提出了一种名为语义知识调优(Semantic Knowledge Tuning, SK-Tuning)的新方法,该方法使用具有语义意义的词汇替代随机标记,并通过零样本学习能力使固定的大型语言模型理解并处理提示的语义内容,从而提高模型在文本分类和理解等任务中的性能。SK-Tuning不仅减少了训练时间和参数数量,还显著提升了任务表现,为优化LLMs在语言处理任务中的效率和效果提供了新的途径。

链接: https://arxiv.org/abs/2410.08598
作者: Nusrat Jahan Prottasha,Asif Mahmud,Md. Shohanur Islam Sobuj,Prakash Bhat,Md Kowsher,Niloofar Yousefi,Ozlem Ozmen Garibay
关键词-EN: low computational cost, gaining significant popularity, Large Language Models, Large Language, computational cost
类目: Computation and Language (cs.CL)
备注: Accepted in Nature Scientific Reports

点击查看摘要

Abstract:Large Language Models (LLMs) are gaining significant popularity in recent years for specialized tasks using prompts due to their low computational cost. Standard methods like prefix tuning utilize special, modifiable tokens that lack semantic meaning and require extensive training for best performance, often falling short. In this context, we propose a novel method called Semantic Knowledge Tuning (SK-Tuning) for prompt and prefix tuning that employs meaningful words instead of random tokens. This method involves using a fixed LLM to understand and process the semantic content of the prompt through zero-shot capabilities. Following this, it integrates the processed prompt with the input text to improve the model’s performance on particular tasks. Our experimental results show that SK-Tuning exhibits faster training times, fewer parameters, and superior performance on tasks such as text classification and understanding compared to other tuning methods. This approach offers a promising method for optimizing the efficiency and effectiveness of LLMs in processing language tasks.
摘要:近年来,大语言模型 (Large Language Models, LLMs) 因其低计算成本而在使用提示进行特定任务时获得了显著的流行度。标准方法如前缀调优 (prefix tuning) 使用缺乏语义意义的特殊可修改 Token,并且需要大量训练才能达到最佳性能,往往表现不佳。在此背景下,我们提出了一种名为语义知识调优 (Semantic Knowledge Tuning, SK-Tuning) 的新方法,用于提示和前缀调优,该方法采用有意义的词汇而非随机 Token。该方法涉及使用固定的大语言模型通过零样本 (zero-shot) 能力理解和处理提示的语义内容。随后,它将处理后的提示与输入文本整合,以提高模型在特定任务上的性能。我们的实验结果表明,与其它调优方法相比,SK-Tuning 在文本分类和理解等任务上表现出更快的训练时间、更少的参数和更优越的性能。这种方法为优化大语言模型在处理语言任务时的效率和效果提供了一种有前景的方法。

[NLP-49] Baichuan-Omni Technical Report

【速读】: 该论文试图解决当前GPT-4等先进多模态模型缺乏高性能开源替代品的问题。解决方案的关键在于提出了Baichuan-Omni,这是首个开源的7B多模态大语言模型(MLLM),能够同时处理和分析图像、视频、音频和文本等多种模态,并通过两阶段的训练方案——多模态对齐和多任务微调,使模型具备有效处理视觉和音频数据的能力。这一方案旨在为开源社区提供一个具有竞争力的基准,推动多模态理解和实时交互技术的发展。

链接: https://arxiv.org/abs/2410.08565
作者: Yadong Li,Haoze Sun,Mingan Lin,Tianpeng Li,Guosheng Dong,Tao Zhang,Bowen Ding,Wei Song,Zhenglin Cheng,Yuqi Huo,Song Chen,Xu Li,Da Pan,Shusen Zhang,Xin Wu,Zheng Liang,Jun Liu,Tao Zhang,Keer Lu,Yaqi Zhao,Yanjun Shen,Fan Yang,Kaicheng Yu,Tao Lin,Jianhua Xu,Zenan Zhou,Weipeng Chen
关键词-EN: high-performing open-source counterpart, salient multimodal capabilities, Large Language Model, multimodal interactive experience, Multimodal Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-Omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.
摘要:GPT-4o 显著的多模态能力和交互体验突显了其在实际应用中的关键作用,然而它缺乏高性能的开源替代品。本文介绍了 Baichuan-Omni,这是首个开源的 7B 多模态大语言模型 (Multimodal Large Language Model, MLLM),擅长同时处理和分析图像、视频、音频和文本模态,并提供先进的多模态交互体验和强大的性能。我们提出了一种有效的多模态训练方案,从 7B 模型开始,经过两个阶段的多模态对齐和跨音频、图像、视频和文本模态的多任务微调。这种方法使语言模型具备了有效处理视觉和音频数据的能力。在各种全模态和多模态基准测试中展示了强大的性能,我们旨在为开源社区在推进多模态理解和实时交互方面提供一个有竞争力的基线。

[NLP-50] Similar Phrases for Cause of Actions of Civil Cases

【速读】: 该论文旨在解决台湾司法体系中因缺乏标准化的案由标签而导致的案件筛选难题。解决方案的关键在于利用嵌入和聚类技术,通过分析案由之间的相似性,特别是基于引用的法律条文,来识别和分类案由。研究采用了多种相似性度量方法,如Dice系数和Pearson相关系数,并通过集成模型和社交网络分析来识别相关案由的集群,从而提升法律分析的效率和深度,揭示案由之间潜在的隐性联系,扩展了其在民事法以外的法律研究中的应用潜力。

链接: https://arxiv.org/abs/2410.08564
作者: Ho-Chien Huang,Chao-Lin Liu
关键词-EN: Taiwanese judicial system, Taiwanese judicial, relevant legal judgments, identifying relevant legal, judicial system
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 3 tables(including appendix)

点击查看摘要

Abstract:In the Taiwanese judicial system, Cause of Actions (COAs) are essential for identifying relevant legal judgments. However, the lack of standardized COA labeling creates challenges in filtering cases using basic methods. This research addresses this issue by leveraging embedding and clustering techniques to analyze the similarity between COAs based on cited legal articles. The study implements various similarity measures, including Dice coefficient and Pearson’s correlation coefficient. An ensemble model combines rankings, and social network analysis identifies clusters of related COAs. This approach enhances legal analysis by revealing inconspicuous connections between COAs, offering potential applications in legal research beyond civil law.
摘要:在台湾的司法体系中,诉因 (Cause of Actions, COAs) 对于识别相关法律判决至关重要。然而,缺乏标准化的 COA 标签化使得使用基本方法筛选案件变得困难。本研究通过利用嵌入 (embedding) 和聚类 (clustering) 技术,基于引用的法律条文分析 COA 之间的相似性,从而解决这一问题。研究中实施了多种相似性度量方法,包括 Dice 系数和 Pearson 相关系数。通过集成模型结合排名,并利用社会网络分析识别相关 COA 的集群。这种方法通过揭示 COA 之间不明显的联系,增强了法律分析,并为民事法以外的法律研究提供了潜在应用。

[NLP-51] Balancing Innovation and Privacy: Data Security Strategies in Natural Language Processing Applications

【速读】: 该论文试图解决自然语言处理(NLP)应用中的隐私保护问题,特别是用户数据在聊天机器人、情感分析和机器翻译等常见应用中的安全问题。解决方案的关键在于引入了一种基于差分隐私的新算法,通过在数据分析过程中添加随机噪声,既保证了数据分析结果的准确性和可靠性,又有效防止了用户敏感信息的泄露。相较于传统的隐私保护方法如数据匿名化和同态加密,该算法在计算效率和可扩展性方面具有显著优势,同时保持了较高的数据分析精度。

链接: https://arxiv.org/abs/2410.08553
作者: Shaobo Liu,Guiran Liu,Binrong Zhu,Yuanshuai Luo,Linxiao Wu,Rui Wang
关键词-EN: Natural Language Processing, Natural Language, Language Processing, privacy, privacy protection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research addresses privacy protection in Natural Language Processing (NLP) by introducing a novel algorithm based on differential privacy, aimed at safeguarding user data in common applications such as chatbots, sentiment analysis, and machine translation. With the widespread application of NLP technology, the security and privacy protection of user data have become important issues that need to be solved urgently. This paper proposes a new privacy protection algorithm designed to effectively prevent the leakage of user sensitive information. By introducing a differential privacy mechanism, our model ensures the accuracy and reliability of data analysis results while adding random noise. This method not only reduces the risk caused by data leakage but also achieves effective processing of data while protecting user privacy. Compared to traditional privacy methods like data anonymization and homomorphic encryption, our approach offers significant advantages in terms of computational efficiency and scalability while maintaining high accuracy in data analysis. The proposed algorithm’s efficacy is demonstrated through performance metrics such as accuracy (0.89), precision (0.85), and recall (0.88), outperforming other methods in balancing privacy and utility. As privacy protection regulations become increasingly stringent, enterprises and developers must take effective measures to deal with privacy risks. Our research provides an important reference for the application of privacy protection technology in the field of NLP, emphasizing the need to achieve a balance between technological innovation and user privacy. In the future, with the continuous advancement of technology, privacy protection will become a core element of data-driven applications and promote the healthy development of the entire industry.
摘要:本研究通过引入一种基于差分隐私的新算法,解决了自然语言处理 (NLP) 中的隐私保护问题,旨在保护用户数据在聊天机器人、情感分析和机器翻译等常见应用中的安全。随着 NLP 技术的广泛应用,用户数据的安全和隐私保护已成为亟待解决的重要问题。本文提出了一种新的隐私保护算法,旨在有效防止用户敏感信息的泄露。通过引入差分隐私机制,我们的模型在添加随机噪声的同时,确保了数据分析结果的准确性和可靠性。这种方法不仅降低了数据泄露带来的风险,还在保护用户隐私的同时实现了数据的有效处理。与传统的隐私保护方法如数据匿名化和同态加密相比,我们的方法在计算效率和可扩展性方面具有显著优势,同时在数据分析的准确性上保持高水平。所提出的算法的有效性通过准确率 (0.89)、精确率 (0.85) 和召回率 (0.88) 等性能指标得以证明,在平衡隐私和效用方面优于其他方法。随着隐私保护法规的日益严格,企业和开发者必须采取有效措施应对隐私风险。我们的研究为 NLP 领域中隐私保护技术的应用提供了重要参考,强调了在技术创新和用户隐私之间实现平衡的必要性。未来,随着技术的不断进步,隐私保护将成为数据驱动应用的核心要素,并推动整个行业的健康发展。

[NLP-52] Humanity in AI: Detecting the Personality of Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在人格检测中存在的两个主要问题:幻觉(产生不准确或无关的回答)和选项顺序对回答的敏感性。解决方案的关键在于结合文本挖掘与问卷调查方法,通过文本挖掘提取心理特征,减少选项顺序和幻觉的影响。具体方法包括对两种方法的评分进行归一化处理,并计算均方根误差以验证其有效性。实验结果表明,这种方法能有效检测LLMs的人格特征,并揭示其人格来源于预训练数据。

链接: https://arxiv.org/abs/2410.08545
作者: Baohua Zhan,Yongyi Huang,Wenyao Cui,Huaping Zhang,Jianyun Shang
关键词-EN: Large Language Models, Large Language, Language Models, personality, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Questionnaires are a common method for detecting the personality of Large Language Models (LLMs). However, their reliability is often compromised by two main issues: hallucinations (where LLMs produce inaccurate or irrelevant responses) and the sensitivity of responses to the order of the presented options. To address these issues, we propose combining text mining with questionnaires method. Text mining can extract psychological features from the LLMs’ responses without being affected by the order of options. Furthermore, because this method does not rely on specific answers, it reduces the influence of hallucinations. By normalizing the scores from both methods and calculating the root mean square error, our experiment results confirm the effectiveness of this approach. To further investigate the origins of personality traits in LLMs, we conduct experiments on both pre-trained language models (PLMs), such as BERT and GPT, as well as conversational models (ChatLLMs), such as ChatGPT. The results show that LLMs do contain certain personalities, for example, ChatGPT and ChatGLM exhibit the personality traits of ‘Conscientiousness’. Additionally, we find that the personalities of LLMs are derived from their pre-trained data. The instruction data used to train ChatLLMs can enhance the generation of data containing personalities and expose their hidden personality. We compare the results with the human average personality score, and we find that the personality of FLAN-T5 in PLMs and ChatGPT in ChatLLMs is more similar to that of a human, with score differences of 0.34 and 0.22, respectively.
摘要:问卷调查是检测大语言模型 (LLM) 人格的常见方法。然而,其可靠性常受两个主要问题影响:幻觉 (LLM 产生不准确或无关的回答) 和回答对选项顺序的敏感性。为解决这些问题,我们提出将文本挖掘与问卷调查方法结合。文本挖掘能从 LLM 的回答中提取心理特征,而不受选项顺序影响。此外,由于该方法不依赖特定答案,减少了幻觉的影响。通过标准化两种方法的得分并计算均方根误差,实验结果证实了此方法的有效性。为进一步探究 LLM 人格特质的来源,我们对预训练语言模型 (PLM) 如 BERT 和 GPT,以及对话模型 (ChatLLM) 如 ChatGPT 进行了实验。结果显示,LLM 确实包含某些人格特质,例如 ChatGPT 和 ChatGLM 表现出“尽责性”特质。此外,我们发现 LLM 的人格源自其预训练数据。用于训练 ChatLLM 的指令数据能增强包含人格的数据生成,并揭示其隐藏的人格。我们将结果与人类平均人格得分进行比较,发现 PLM 中的 FLAN-T5 和 ChatLLM 中的 ChatGPT 的人格更接近人类,得分差异分别为 0.34 和 0.22。

[NLP-53] Scaling Laws for Predicting Downstream Performance in LLMs

【速读】: 该论文试图解决在大语言模型(LLMs)训练前对其下游性能进行精确估计的问题。解决方案的关键在于采用两阶段方法:首先利用一系列较小采样模型的统计数据,估计一个将计算资源(如FLOPs)映射到预训练损失的函数;然后,在关键的“涌现阶段”之后,将预训练损失映射到下游任务性能。通过这种方法,论文提出的FLP解决方案能够以更高的计算效率准确预测LLMs的性能,显著优于直接从FLOPs到性能的传统方法。

链接: https://arxiv.org/abs/2410.08527
作者: Yangyi Chen,Binxuan Huang,Yifan Gao,Zhengyang Wang,Jingfeng Yang,Heng Ji
关键词-EN: large language models, Precise estimation, performance, pre-training loss, downstream performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of sampling models, followed by mapping the pre-training loss to downstream task Performance after the critical “emergent phase”. In preliminary experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. This motivates FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training, specifically blending general corpora with code data to accurately represent the common necessity. FLP-M extends the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources, and employs a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance. By utilizing a 3B LLM trained on a specific ratio and a series of smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures for most benchmarks within 10% error margins.
摘要:在大语言模型 (LLM) 训练之前对其下游性能进行精确估计,对于指导其开发过程至关重要。缩放定律分析利用一系列显著较小的采样语言模型 (LM) 的统计数据来预测目标 LLM 的性能。对于下游性能预测,关键挑战在于 LLM 中出现的超越任务特定计算阈值的能力。在本研究中,我们专注于预训练损失作为性能估计的更计算高效的指标。我们的两阶段方法首先利用一系列采样模型估计一个函数,该函数将计算资源(例如,FLOPs)映射到预训练损失,然后在关键的“涌现阶段”之后将预训练损失映射到下游任务性能。在初步实验中,这种 FLP 解决方案使用一系列高达 3B 的采样 LM 准确预测了 7B 和 13B 参数 LLM 的性能,分别实现了 5% 和 10% 的误差范围,并显著优于 FLOPs-to-Performance 方法。这促使我们提出了 FLP-M,一种解决预训练期间集成多源数据集实际需求的基本性能预测方法,特别是通过混合通用语料库和代码数据来准确代表常见需求。FLP-M 扩展了幂律分析函数,以基于数据源间的 FLOPs 预测领域特定的预训练损失,并采用两层神经网络来建模多个领域特定损失与下游性能之间的非线性关系。通过利用在特定比例上训练的 3B LLM 和一系列较小的采样 LM,FLP-M 能够有效预测 3B 和 7B LLM 在大多数基准测试中跨各种数据混合的性能,误差范围在 10% 以内。

[NLP-54] “I Am the One and Only Your Cyber BFF”: Understanding the Impact of GenAI Requires Understanding the Impact of Anthropomorphic AI

【速读】: 该论文试图解决生成式人工智能(GenAI)系统中日益增多的拟人化行为及其潜在负面社会影响的问题。论文指出,尽管拟人化AI系统可能带来负面影响,但这一现象在AI的开发、部署和使用中仍被广泛忽视、研究不足且未明确规范。解决方案的关键在于全面评估拟人化AI的社会影响,并呼吁学术界和业界采取行动,以系统性地研究和规范这一现象,从而更好地理解和控制其潜在风险。

链接: https://arxiv.org/abs/2410.08526
作者: Myra Cheng,Alicia DeVrio,Lisa Egede,Su Lin Blodgett,Alexandra Olteanu
关键词-EN: generating outputs, anthropomorphic behaviors, increasingly prone, scholars increasingly raising, increasingly raising concerns
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many state-of-the-art generative AI (GenAI) systems are increasingly prone to anthropomorphic behaviors, i.e., to generating outputs that are perceived to be human-like. While this has led to scholars increasingly raising concerns about possible negative impacts such anthropomorphic AI systems can give rise to, anthropomorphism in AI development, deployment, and use remains vastly overlooked, understudied, and underspecified. In this perspective, we argue that we cannot thoroughly map the social impacts of generative AI without mapping the social impacts of anthropomorphic AI, and outline a call to action.
摘要:许多最先进的生成式 AI (Generative AI) 系统越来越倾向于表现出拟人化行为,即生成被认为是类人的输出。尽管这导致学者们越来越多地对这种拟人化 AI 系统可能带来的负面影响表示担忧,但 AI 开发、部署和使用中的拟人化现象仍然被广泛忽视、研究不足且未明确界定。在此视角下,我们认为,如果不全面映射拟人化 AI 的社会影响,就无法彻底映射生成式 AI 的社会影响,并提出行动呼吁。

[NLP-55] Improving Legal Entity Recognition Using a Hybrid Transformer Model and Semantic Filtering Approach

【速读】: 该论文试图解决法律实体识别(Legal Entity Recognition, LER)在自动化法律工作流程中的复杂性和领域特异性问题,特别是处理法律文档中的歧义和嵌套实体结构。解决方案的关键在于提出了一种新颖的混合模型,通过引入基于语义相似性的过滤机制,增强了经过微调用于法律文本处理的Legal-BERT模型的准确性和精确度。实验结果表明,该模型在15,000份标注法律文档的数据集上达到了93.4%的F1分数,显著提升了精度和召回率。

链接: https://arxiv.org/abs/2410.08521
作者: Duraimurugan Rajamanickam
关键词-EN: Legal Entity Recognition, Entity Recognition, automating legal workflows, compliance monitoring, contract analysis
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 7 pages, 1 table

点击查看摘要

Abstract:Legal Entity Recognition (LER) is critical in automating legal workflows such as contract analysis, compliance monitoring, and litigation support. Existing approaches, including rule-based systems and classical machine learning models, struggle with the complexity of legal documents and domain specificity, particularly in handling ambiguities and nested entity structures. This paper proposes a novel hybrid model that enhances the accuracy and precision of Legal-BERT, a transformer model fine-tuned for legal text processing, by introducing a semantic similarity-based filtering mechanism. We evaluate the model on a dataset of 15,000 annotated legal documents, achieving an F1 score of 93.4%, demonstrating significant improvements in precision and recall over previous methods.
摘要:法律实体识别 (Legal Entity Recognition, LER) 在自动化法律工作流程中至关重要,如合同分析、合规监控和诉讼支持。现有的方法,包括基于规则的系统和经典机器学习模型,在处理法律文档的复杂性和领域特异性方面存在困难,特别是在处理歧义和嵌套实体结构时。本文提出了一种新颖的混合模型,通过引入基于语义相似性的过滤机制,增强了针对法律文本处理进行微调的 Transformer 模型 Legal-BERT 的准确性和精确度。我们在一个包含 15,000 份标注法律文档的数据集上评估了该模型,达到了 93.4% 的 F1 分数,显示出在精确度和召回率方面较之前方法的显著改进。

[NLP-56] Generation with Dynamic Vocabulary EMNLP2024

【速读】: 该论文试图解决传统静态词汇表在语言模型生成过程中限制生成质量和效率的问题。解决方案的关键在于引入了一种新的动态词汇表,该词汇表能够在生成过程中涉及任意文本片段,这些文本片段作为基本生成单元,类似于传统静态词汇表中的token。通过原子性地生成多token片段,动态词汇表显著提升了生成质量和效率,例如在MAUVE指标上提高了25%,延迟降低了20%。此外,动态词汇表以即插即用的方式部署,适用于多种下游应用,如无需训练即可应用于不同领域,并在问答任务中生成可靠的引用,同时不降低答案的准确性。

链接: https://arxiv.org/abs/2410.08481
作者: Yanting Liu,Tao Ji,Changzhi Sun,Yuanbin Wu,Xiaoling Wang
关键词-EN: dynamic vocabulary, text spans, standard language model, arbitrary text spans, Abstract
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:We introduce a new dynamic vocabulary for language models. It can involve arbitrary text spans during generation. These text spans act as basic generation bricks, akin to tokens in the traditional static vocabularies. We show that, the ability to generate multi-tokens atomically improve both generation quality and efficiency (compared to the standard language model, the MAUVE metric is increased by 25%, the latency is decreased by 20%). The dynamic vocabulary can be deployed in a plug-and-play way, thus is attractive for various downstream applications. For example, we demonstrate that dynamic vocabulary can be applied to different domains in a training-free manner. It also helps to generate reliable citations in question answering tasks (substantially enhancing citation results without compromising answer accuracy).
摘要: 我们引入了一种新的动态词汇表用于语言模型。它能够在生成过程中涉及任意文本片段。这些文本片段作为基本生成单元,类似于传统静态词汇表中的 Token。我们展示了,生成多 Token 的原子能力不仅提高了生成质量和效率(与标准语言模型相比,MAUVE 指标提高了 25%,延迟降低了 20%)。动态词汇表可以以即插即用的方式部署,因此对各种下游应用具有吸引力。例如,我们证明了动态词汇表可以在无需训练的情况下应用于不同领域。它还有助于在问答任务中生成可靠的引用(显著提升引用结果而不影响答案准确性)。

[NLP-57] GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation

【速读】: 该论文试图解决大型语言模型(LLMs)在基于检索的推理过程中,依赖于高质量和密集的非参数知识源的问题,特别是在科学或边缘领域中,构建全面的非参数知识源成本高且有时不可行。解决方案的关键是提出了Graph Inspired Veracity Extrapolation (GIVE)框架,该框架通过整合参数化和非参数化记忆,利用外部结构化知识来启发LLMs建模相关概念之间的联系,从而在非常稀疏的知识图谱上增强知识检索和忠实推理过程。具体来说,GIVE框架促使LLMs将查询分解为关键概念和属性,构建相关实体组,并通过探测这些实体组中节点对之间的潜在关系来构建增强的推理链。这种方法不仅利用了事实链接,还包含了推断链接,以实现全面理解和响应生成,从而在没有额外训练成本的情况下,使GPT3.5-turbo在推理密集的基准测试中超越了如GPT4等先进模型。

链接: https://arxiv.org/abs/2410.08475
作者: Jiashu He,Mingyu Derek Ma,Jinxuan Fan,Dan Roth,Wei Wang,Alejandro Ribeiro
关键词-EN: Existing retrieval-based reasoning, Existing retrieval-based, retrieval-based reasoning approaches, large language models, provide domain knowledge
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing retrieval-based reasoning approaches for large language models (LLMs) heavily rely on the density and quality of the non-parametric knowledge source to provide domain knowledge and explicit reasoning chain. However, inclusive knowledge sources are expensive and sometimes infeasible to build for scientific or corner domains. To tackle the challenges, we introduce Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning framework that integrates the parametric and non-parametric memories to enhance both knowledge retrieval and faithful reasoning processes on very sparse knowledge graphs. By leveraging the external structured knowledge to inspire LLM to model the interconnections among relevant concepts, our method facilitates a more logical and step-wise reasoning approach akin to experts’ problem-solving, rather than gold answer retrieval. Specifically, the framework prompts LLMs to decompose the query into crucial concepts and attributes, construct entity groups with relevant entities, and build an augmented reasoning chain by probing potential relationships among node pairs across these entity groups. Our method incorporates both factual and extrapolated linkages to enable comprehensive understanding and response generation. Extensive experiments on reasoning-intense benchmarks on biomedical and commonsense QA demonstrate the effectiveness of our proposed method. Specifically, GIVE enables GPT3.5-turbo to outperform advanced models like GPT4 without any additional training cost, thereby underscoring the efficacy of integrating structured information and internal reasoning ability of LLMs for tackling specialized tasks with limited external resources.
摘要:现有的基于检索的大语言模型 (LLM) 推理方法严重依赖于非参数知识源的密度和质量,以提供领域知识和显式推理链。然而,对于科学或边缘领域,构建包含性知识源既昂贵又有时不可行。为应对这些挑战,我们引入了图启发式真实性外推 (Graph Inspired Veracity Extrapolation, GIVE),这是一种新颖的推理框架,它整合了参数和非参数记忆,以增强在非常稀疏的知识图谱上的知识检索和忠实推理过程。通过利用外部结构化知识来启发 LLM 建模相关概念之间的相互联系,我们的方法促进了更符合专家问题解决逻辑的逐步推理方法,而非仅依赖于黄金答案检索。具体而言,该框架提示 LLM 将查询分解为关键概念和属性,构建包含相关实体的实体组,并通过探测这些实体组中节点对之间的潜在关系来构建增强的推理链。我们的方法结合了事实和外推的链接,以实现全面理解和响应生成。在生物医学和常识问答等推理密集型基准上的广泛实验证明了我们提出方法的有效性。具体来说,GIVE 使 GPT3.5-turbo 在没有额外训练成本的情况下超越了 GPT4 等先进模型,从而突显了整合结构化信息和 LLM 内部推理能力以应对资源有限的专业任务的有效性。

[NLP-58] SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在复杂体育场景中的推理能力评估问题。解决方案的关键在于引入了一个名为SPORTU的基准测试,该基准包括两个主要组件:SPORTU-text用于测试模型通过问答形式对体育规则和策略的理解能力,不涉及视觉输入;SPORTU-video则通过包含1,701个慢动作视频片段和12,048个问答对的测试集,评估模型从简单的体育识别到复杂的犯规检测和规则应用等多层次推理能力。通过在SPORTU-text上使用少样本学习和链式思维(CoT)提示的方法,以及在SPORTU-video上对多种MLLMs的评估,论文展示了现有模型在体育理解和推理方面的不足,并指出了改进的方向。

链接: https://arxiv.org/abs/2410.08474
作者: Haotian Xia,Zhengbang Yang,Junbo Zou,Rhys Tracy,Yuqing Wang,Chi Lu,Christopher Lai,Yanjun He,Xun Shao,Zhuoqing Xie,Yuan-fang Wang,Weining Shen,Hanjie Chen
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are advancing the ability to reason about complex sports scenarios by integrating textual and visual information. To comprehensively evaluate their capabilities, we introduce SPORTU, a benchmark designed to assess MLLMs across multi-level sports reasoning tasks. SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding. This component focuses on testing models’ ability to reason about sports solely through question-answering (QA), without requiring visual inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning, from simple sports recognition to complex tasks like foul detection and rule application. We evaluate four prevalent LLMs mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT) prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but still falls short of human-level performance, highlighting room for improvement in rule comprehension and reasoning. The evaluation for the SPORTU-video part includes 7 proprietary and 6 open-source MLLMs. Experiments show that models fall short on hard tasks that require deep reasoning and rule-based understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on the hard task, showing large room for improvement. We hope that SPORTU will serve as a critical step toward evaluating models’ capabilities in sports understanding and reasoning.
摘要:多模态大语言模型 (MLLMs) 通过整合文本和视觉信息,正在提升对复杂体育场景的推理能力。为了全面评估其能力,我们引入了 SPORTU,这是一个旨在评估 MLLMs 在多层次体育推理任务中的基准。SPORTU 包含两个关键组成部分:SPORTU-text,包含 900 道多选题,并附有人工注释的解释,用于规则理解和策略理解。该部分专注于测试模型仅通过问答 (QA) 进行体育推理的能力,无需视觉输入;SPORTU-video,包含 1,701 段慢动作视频剪辑,涵盖 7 种不同体育项目和 12,048 对 QA 对,旨在评估从简单体育识别到复杂任务(如犯规检测和规则应用)的多层次推理。我们评估了四种主要利用少样本学习范式并辅以思维链 (CoT) 提示的大语言模型在 SPORTU-text 部分的表现。GPT-4o 达到了 71% 的最高准确率,但仍未达到人类水平的表现,突显了在规则理解和推理方面的改进空间。对 SPORTU-video 部分的评估包括 7 种专有和 6 种开源的多模态大语言模型。实验表明,模型在需要深度推理和基于规则理解的高难度任务上表现不足。Claude-3.5-Sonnet 在高难度任务中表现最佳,准确率仅为 52.6%,显示出巨大的改进空间。我们希望 SPORTU 将成为评估模型在体育理解和推理能力方面的重要一步。

[NLP-59] Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP EMNLP2024

【速读】: 该论文试图解决在视觉-语言模型(如CLIP)中,文本编码过程中不同语义元素的重要性差异未被充分考虑的问题。解决方案的关键在于提出了语义令牌重加权框架(SToRI),通过根据上下文重要性对语义元素进行差异化加权,从而在文本嵌入构建过程中实现更精细的控制,以响应数据驱动的洞察和用户偏好。这一方法在少样本图像分类和用户偏好驱动的图像检索任务中展示了其有效性。

链接: https://arxiv.org/abs/2410.08469
作者: Eunji Kim,Kyuhong Shim,Simyung Chang,Sungroh Yoon
关键词-EN: Vision-Language Models, translating textual input, embedding space shared, natural language, encoder within Vision-Language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.
摘要:在像 CLIP 这样的视觉-语言模型 (Vision-Language Models, VLMs) 中,文本编码器在将文本输入转换为与图像共享的嵌入空间方面起着至关重要的作用,从而通过自然语言促进视觉任务的解释性分析。尽管句子中不同文本元素的重要性因上下文而异,但在构建文本嵌入时考虑重要性变化的努力仍然不足。我们提出了一种语义 Token 重加权框架,用于构建可解释的文本嵌入 (Semantic Token Reweighting to build Interpretable text embeddings, SToRI),该框架还融入了可控性。SToRI 通过根据上下文重要性对语义元素进行差异化加权,改进了 CLIP 中的文本编码过程,从而能够根据数据驱动的洞察和用户偏好实现更精细的强调控制。SToRI 的有效性通过针对用户偏好的少样本图像分类和图像检索的综合实验得到了验证。

[NLP-60] Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

【速读】: 该论文试图解决传统基于奖励建模的人类偏好优化方法(如RLHF)中存在的模型漂移和奖励过拟合问题,以及直接偏好优化方法(如DPO)可能导致策略退化的问题。解决方案的关键在于引入DRDO(直接奖励蒸馏与策略优化)方法,通过监督知识蒸馏同时建模奖励和偏好,直接模仿专家奖励并从新颖的偏好似然公式中学习人类偏好,从而避免策略退化并提高对噪声偏好信号和分布外设置的鲁棒性。

链接: https://arxiv.org/abs/2410.08458
作者: Abhijnan Nath,Changsoo Jung,Ethan Seefried,Nikhil Krishnaswamy
关键词-EN: building usable generative, usable generative large, generative large language, large language models, Direct Preference Optimization
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward modeling of human preferences is one of the cornerstones of building usable generative large language models (LLMs). While traditional RLHF-based alignment methods explicitly maximize the expected rewards from a separate reward model, more recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods can still lead to degenerate policies, and rely heavily on the Bradley-Terry-based preference formulation to model reward differences between pairs of candidate outputs. This formulation is challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs is of low confidence. In this paper, we introduce DRDO (Direct Reward Distillation and policy-Optimization), a supervised knowledge distillation-based preference alignment method that simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences from a novel preference likelihood formulation. Our experimental results on the Ultrafeedback and TL;DR datasets demonstrate that policies trained using DRDO surpass previous methods such as DPO and e-DPO in terms of expected rewards and are more robust, on average, to noisy preference signals as well as out-of-distribution (OOD) settings.
摘要:人类偏好的奖励建模是构建可用生成式大语言模型 (LLM) 的基石之一。传统的基于 RLHF 的对齐方法明确地最大化来自独立奖励模型的预期奖励,而最近的监督对齐方法,如直接偏好优化 (DPO),则绕过了这一阶段,以避免模型漂移和奖励过拟合等问题。尽管 DPO 及其类似直接对齐方法因其简单性而广受欢迎,但它们仍可能导致策略退化,并严重依赖基于 Bradley-Terry 的偏好公式来建模候选输出对之间的奖励差异。这种公式在面对非确定性或噪声偏好标签时面临挑战,例如人类对两个候选输出的评分信心较低。本文中,我们引入了 DRDO (直接奖励蒸馏与策略优化),这是一种基于监督知识蒸馏的偏好对齐方法,同时建模奖励和偏好以避免此类退化。DRDO 在学习人类偏好的同时,直接模仿由预言机分配的奖励,并采用了一种新颖的偏好似然公式。我们在 Ultrafeedback 和 TL;DR 数据集上的实验结果表明,使用 DRDO 训练的策略在预期奖励方面优于 DPO 和 e-DPO 等先前方法,并且在平均水平上对噪声偏好信号以及分布外 (OOD) 设置更为鲁棒。

[NLP-61] forallutoexistslor!landL: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

【速读】: 该论文试图解决大规模语言模型(LLM)在形式化任务中的客观评估问题,特别是在翻译和逻辑推理等具有明确正确性概念的任务中。解决方案的关键在于提出了一个名为 \forall uto \exists \lor!\land L 的新基准,该基准通过自动生成不同难度的任务和相应的真实答案,消除了对人工标注的依赖,并使用自动生成的随机数据集来防止模型对静态数据集的过拟合。这种自动化的评估方法不仅降低了评估成本和时间,还提高了评估的客观性和广泛适用性。

链接: https://arxiv.org/abs/2410.08437
作者: Rushang Karia,Daniel Bramblett,Daksh Dobhal,Siddharth Srivastava
关键词-EN: Large Language Model, scaling Large Language, Language Model, Large Language, scaling Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents \forall uto \exists \lor!\land L, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. \forall uto \exists \lor!\land L is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; © the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM’s performance on \forall uto \exists \lor!\land L is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update.
摘要:本文介绍了 \forall uto \exists \lor!\land L,这是一种用于在具有明确正确性概念的正式任务(如翻译中的真值维护和逻辑推理)中扩展大语言模型 (LLM) 评估的新基准。\forall uto \exists \lor!\land L 是首个提供多个关键优势的基准范式,这些优势对于在不依赖人工标注的情况下扩展 LLM 的客观评估至关重要:(a) 通过自动生成不同难度级别的任务来评估日益复杂化的 LLM 的能力;(b) 自动生成基准真值,从而消除对昂贵且耗时的人工标注的依赖;© 使用自动生成的、随机化的数据集,这些数据集降低了后续 LLM 对许多当代基准中使用的静态数据集过拟合的能力。实证分析表明,LLM 在 \forall uto \exists \lor!\land L 上的表现与其在专注于翻译和推理任务的其他多样化基准上的表现高度相关,这使得它在难以获取和/或更新手工数据集的环境中成为一个有价值的自主评估范式。

[NLP-62] Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)在复杂多步推理任务中生成结构化中间证明步骤的能力问题,关键解决方案在于利用上下文学习中的结构感知演示和结构感知剪枝技术。通过这两种方法,论文展示了如何帮助LLMs更好地构建证明结构,从而提高模型的推理能力和可解释性。

链接: https://arxiv.org/abs/2410.08436
作者: Zi’ou Zheng,Christopher Malon,Martin Renqiang Min,Xiaodan Zhu
关键词-EN: Large Language Models, multi-step reasoning tasks, improving models’ explainability, complex multi-step reasoning, performing complex multi-step
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP2024 main conference

点击查看摘要

Abstract:When performing complex multi-step reasoning tasks, the ability of Large Language Models (LLMs) to derive structured intermediate proof steps is important for ensuring that the models truly perform the desired reasoning and for improving models’ explainability. This paper is centred around a focused study: whether the current state-of-the-art generalist LLMs can leverage the structures in a few examples to better construct the proof structures with \textitin-context learning. Our study specifically focuses on structure-aware demonstration and structure-aware pruning. We demonstrate that they both help improve performance. A detailed analysis is provided to help understand the results.
摘要:在进行复杂的多步骤推理任务时,大语言模型 (LLM) 能否推导出结构化的中间证明步骤,对于确保模型真正执行所需的推理以及提高模型的可解释性至关重要。本文围绕一个重点研究展开:当前最先进的通用大语言模型是否能够利用少量示例中的结构,通过上下文学习更好地构建证明结构。我们的研究特别关注结构感知的演示和结构感知的剪枝。我们证明,这两者都有助于提高性能。本文还提供了详细的分析,以帮助理解这些结果。

[NLP-63] oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

【速读】: 该论文试图解决大型语言模型(LLMs)在医疗应用中缺乏专业临床知识的问题,解决方案的关键在于采用检索增强生成(RAG)技术,通过整合35份本地和23份国际的术前指南,定制化模型以适应医疗领域的需求。研究通过评估RAG模型在判断手术适应性和提供术前指导方面的准确性、一致性和安全性,发现GPT4 LLM-RAG模型在准确性(96.4%)、无幻觉产生以及生成正确指导方面表现优异,且响应时间显著快于临床医生(20秒 vs. 10分钟),展示了LLM-RAG模型在术前医疗任务中的高效性、可扩展性和可靠性。

链接: https://arxiv.org/abs/2410.08431
作者: Yu He Ke,Liyuan Jin,Kabilan Elangovan,Hairil Rizal Abdullah,Nan Liu,Alex Tiong Heng Sia,Chai Rick Soh,Joshua Yi Min Tung,Jasmine Chiat Ling Ong,Chang-Fu Kuo,Shao-Chun Wu,Vesela P. Kovacheva,Daniel Shu Wei Ting
关键词-EN: Large Language Models, Large Language, Retrieval Augmented Generation, specialized clinical knowledge, lack specialized clinical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2402.01733

点击查看摘要

Abstract:Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions. We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses. A total of 3,682 responses were evaluated. Clinical documents were processed using Llamaindex, and 10 LLMs, including GPT3.5, GPT4, and Claude-3, were assessed. Fourteen clinical scenarios were analyzed, focusing on seven aspects of preoperative instructions. Established guidelines and expert judgment were used to determine correct responses, with human-generated answers serving as comparisons. The LLM-RAG models generated responses within 20 seconds, significantly faster than clinicians (10 minutes). The GPT4 LLM-RAG model achieved the highest accuracy (96.4% vs. 86.6%, p=0.016), with no hallucinations and producing correct instructions comparable to clinicians. Results were consistent across both local and international guidelines. This study demonstrates the potential of LLM-RAG models for preoperative healthcare tasks, highlighting their efficiency, scalability, and reliability.
摘要:大语言模型 (LLMs) 在医疗应用中展现出潜力,但往往缺乏专门的临床知识。检索增强生成 (RAG) 允许通过特定领域的信息进行定制,使其适用于医疗保健领域。本研究评估了 RAG 模型在确定手术适应性和提供术前指导方面的准确性、一致性和安全性。我们开发了基于 35 份本地和 23 份国际术前指南的 LLM-RAG 模型,并将其与人工生成的响应进行对比测试。总共评估了 3,682 条响应。临床文档通过 Llamaindex 处理,并评估了包括 GPT3.5、GPT4 和 Claude-3 在内的 10 个 LLMs。分析了 14 个临床场景,重点关注术前指导的七个方面。通过既定指南和专家判断确定正确响应,人工生成的答案作为对比。LLM-RAG 模型在 20 秒内生成响应,显著快于临床医生(10 分钟)。GPT4 LLM-RAG 模型达到了最高准确率(96.4% vs. 86.6%,p=0.016),没有出现幻觉,并生成了与临床医生相当的正确指导。结果在本地和国际指南中均保持一致。本研究展示了 LLM-RAG 模型在术前医疗任务中的潜力,突显了其效率、可扩展性和可靠性。

[NLP-64] Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models

【速读】: 该论文试图解决的问题是大型语言模型(LLMs)在处理复杂问题时如何有效整合其内部参数知识(PK)与外部上下文知识(CK)。解决方案的关键在于识别和分类PK与CK之间的四种关系:支持性、互补性、冲突性和无关性,并通过引入ECHOQA基准测试来评估LLMs在不同知识类型下的表现。研究发现,LLMs在有上下文信息时倾向于抑制其内部PK,即使这些信息是互补或无关的。尽管通过特定指令可以鼓励LLMs更多依赖其PK,但它们仍难以充分利用这些知识,这揭示了LLMs在知识密集型任务中的一个关键弱点。

链接: https://arxiv.org/abs/2410.08414
作者: Sitao Cheng,Liangming Pan,Xunjian Yin,Xinyi Wang,William Yang Wang
关键词-EN: Large language models, encode vast amounts, Large language, language models, encode vast
类目: Computation and Language (cs.CL)
备注: 27 pages, 8 figures and 17 tables

点击查看摘要

Abstract:Large language models (LLMs) encode vast amounts of knowledge during pre-training (parametric knowledge, or PK) and can further be enhanced by incorporating contextual knowledge (CK). Can LLMs effectively integrate their internal PK with external CK to solve complex problems? In this paper, we investigate the dynamic interaction between PK and CK, categorizing their relationships into four types: Supportive, Complementary, Conflicting, and Irrelevant. To support this investigation, we introduce ECHOQA, a benchmark spanning scientific, factual, and commonsense knowledge. Our results show that LLMs tend to suppress their PK when contextual information is available, even when it is complementary or irrelevant. While tailored instructions can encourage LLMs to rely more on their PK, they still struggle to fully leverage it. These findings reveal a key vulnerability in LLMs, raising concerns about their reliability in knowledge-intensive tasks. Resources are available at this https URL Interplay.
摘要:大语言模型 (LLMs) 在预训练阶段编码了大量的知识(参数化知识,或 PK),并且可以通过融入上下文知识 (CK) 进一步增强。LLMs 能否有效地将内部 PK 与外部 CK 整合以解决复杂问题?本文探讨了 PK 与 CK 之间的动态交互,将其关系分为四种类型:支持型、互补型、冲突型和无关型。为支持这一研究,我们引入了 ECHOQA,这是一个涵盖科学、事实和常识知识的基准测试。我们的结果显示,当上下文信息可用时,LLMs 倾向于抑制其 PK,即使这些信息是互补或无关的。虽然定制的指令可以鼓励 LLMs 更多地依赖其 PK,但它们仍然难以充分利用它。这些发现揭示了 LLMs 的一个关键弱点,引发了对其在知识密集型任务中可靠性的担忧。资源可在以下链接获取:https URL Interplay。

[NLP-65] he Effects of Hallucinations in Synthetic Training Data for Relation Extraction ISWC’24

【速读】: 该论文试图解决生成数据增强(GDA)方法在关系抽取任务中引入的幻觉(hallucinations)问题,这些幻觉可能导致模型性能下降。解决方案的关键在于识别和检测这些幻觉,论文提出了两种方法来区分“幻觉”文本和“干净”文本,分别达到了83.8%和92.2%的F1分数。这些方法不仅有助于去除幻觉,还能估计其在数据集中的普遍性,从而选择高质量数据,提升关系抽取模型的效果。

链接: https://arxiv.org/abs/2410.08393
作者: Steven Rogulsky,Nicholas Popovic,Michael Färber
关键词-EN: constructing knowledge graphs, Relation extraction, knowledge graphs, foundation for training, constructing knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at KBC-LM@ISWC’24

点击查看摘要

Abstract:Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact on relation extraction remains underexplored. In this paper, we examine the effects of hallucinations on the performance of relation extraction on the document and sentence levels. Our empirical study reveals that hallucinations considerably compromise the ability of models to extract relations from text, with recall reductions between 19.1% and 39.2%. We identify that relevant hallucinations impair the model’s performance, while irrelevant hallucinations have a minimal impact. Additionally, we develop methods for the detection of hallucinations to improve data quality and model performance. Our approaches successfully classify texts as either ‘hallucinated’ or ‘clean,’ achieving high F1-scores of 83.8% and 92.2%. These methods not only assist in removing hallucinations but also help in estimating their prevalence within datasets, which is crucial for selecting high-quality data. Overall, our work confirms the profound impact of relevant hallucinations on the effectiveness of relation extraction models.
摘要:关系抽取对于构建知识图谱至关重要,而大规模高质量的数据集则是训练、微调和评估模型的基础。生成式数据增强 (Generative Data Augmentation, GDA) 是扩展此类数据集的常见方法。然而,这种方法往往引入幻觉 (hallucinations),例如虚假事实,其对关系抽取的影响尚未得到充分探索。本文中,我们研究了幻觉对文档和句子级别关系抽取性能的影响。我们的实证研究表明,幻觉显著削弱了模型从文本中抽取关系的能力,召回率下降幅度在 19.1% 到 39.2% 之间。我们发现,相关的幻觉会损害模型的性能,而无关的幻觉影响甚微。此外,我们开发了检测幻觉的方法,以提高数据质量和模型性能。我们的方法成功地将文本分类为“幻觉”或“干净”,F1 分数分别达到 83.8% 和 92.2%。这些方法不仅有助于去除幻觉,还能帮助估计数据集中幻觉的普遍性,这对于选择高质量数据至关重要。总体而言,我们的工作证实了相关幻觉对关系抽取模型有效性的深远影响。

[NLP-66] KV Prediction for Improved Time to First Token

【速读】: 该论文试图解决基于Transformer的语言模型在边缘设备上处理长提示或大批次时,首次输出token(TTFT)生成时间过长的问题。解决方案的关键在于引入了一种名为KV Prediction的新方法,通过使用一个小型辅助模型预先处理提示并生成基础模型所需的KV缓存的近似值,从而在不重新查询辅助模型的情况下进行自回归生成,显著减少了TTFT的时间,同时保持了较高的准确性。

链接: https://arxiv.org/abs/2410.08391
作者: Maxwell Horton,Qingqing Cao,Chenfan Sun,Yanzi Jin,Sachin Mehta,Mohammad Rastegari,Moin Nabi
关键词-EN: Inference with transformer-based, language models begins, transformer-based language models, prompt processing step, transformer-based language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge devices when prompt lengths or batch sizes rise. This degrades user experience by introducing significant latency into the model’s outputs. To reduce the time spent producing the first output (known as the ``time to first token’', or TTFT) of a pretrained model, we introduce a novel method called KV Prediction. In our method, a small auxiliary model is used to process the prompt and produce an approximation of the KV cache used by a base model. This approximated KV cache is then used with the base model for autoregressive generation without the need to query the auxiliary model again. We demonstrate that our method produces a pareto-optimal efficiency-accuracy trade-off when compared to baselines. On TriviaQA, we demonstrate relative accuracy improvements in the range of 15%-50% across a range of TTFT FLOPs budgets. We also demonstrate accuracy improvements of up to 30% on HumanEval python code completion at fixed TTFT FLOPs budgets. Additionally, we benchmark models on an Apple M2 Pro CPU and demonstrate that our improvement in FLOPs translates to a TTFT speedup on hardware. We release our code at this https URL .
摘要:基于 Transformer 的语言模型在进行推理时,首先需要进行提示处理步骤。在这一步骤中,模型生成第一个输出 Token,并存储未来生成步骤所需的 KV 缓存。当提示长度或批量大小增加时,这一提示处理步骤在边缘设备上的计算成本可能非常高,对于十亿参数的模型,可能需要数十秒甚至更长时间。这会通过引入显著的延迟来降低用户体验。为了减少生成第一个输出(称为“首个 Token 时间”,或 TTFT)所需的时间,我们引入了一种名为 KV 预测的新方法。在我们的方法中,使用一个小型辅助模型来处理提示,并生成基础模型所需 KV 缓存的近似值。然后,这个近似的 KV 缓存与基础模型一起用于自回归生成,无需再次查询辅助模型。我们证明,与基线方法相比,我们的方法在效率-准确性权衡方面达到了帕累托最优。在 TriviaQA 数据集上,我们在不同 TTFT FLOPs 预算范围内展示了 15%-50% 的相对准确性提升。在固定 TTFT FLOPs 预算下,我们在 HumanEval Python 代码补全任务中也展示了高达 30% 的准确性提升。此外,我们在 Apple M2 Pro CPU 上对模型进行了基准测试,并证明我们的 FLOPs 改进在硬件上转化为 TTFT 的加速。我们在此 https URL 上发布了代码。

[NLP-67] GUS-Net: Social Bias Classification in Text with Generalizations Unfairness and Stereotypes

【速读】: 该论文试图解决自然语言处理(NLP)中偏见检测的关键问题,特别是针对大型语言模型(LLMs)在不同领域中的应用。解决方案的关键在于提出了GUS-Net,这是一种创新的偏见检测方法,专注于三种主要类型的偏见:(G)一般化、(U)不公平和(S)刻板印象。GUS-Net通过利用生成式AI和自动化代理创建综合合成数据集,实现了强大的多标签令牌分类。其核心创新在于结合预训练模型的上下文编码,显著提升了偏见识别的准确性和深度。实验结果表明,GUS-Net在准确性、F1分数和汉明损失方面优于现有最先进的技术,有效捕捉了广泛上下文中的多种偏见,为社会偏见检测提供了有力工具。

链接: https://arxiv.org/abs/2410.08388
作者: Maximus Powers,Hua Wei,Umang Mavani,Harshitha Reddy Jonala,Ansh Tiwari
关键词-EN: natural language processing, critical challenge, bias detection, large language models, bias
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The detection of bias in natural language processing (NLP) is a critical challenge, particularly with the increasing use of large language models (LLMs) in various domains. This paper introduces GUS-Net, an innovative approach to bias detection that focuses on three key types of biases: (G)eneralizations, (U)nfairness, and (S)tereotypes. GUS-Net leverages generative AI and automated agents to create a comprehensive synthetic dataset, enabling robust multi-label token classification. Our methodology enhances traditional bias detection methods by incorporating the contextual encodings of pre-trained models, resulting in improved accuracy and depth in identifying biased entities. Through extensive experiments, we demonstrate that GUS-Net outperforms state-of-the-art techniques, achieving superior performance in terms of accuracy, F1-score, and Hamming Loss. The findings highlight GUS-Net’s effectiveness in capturing a wide range of biases across diverse contexts, making it a valuable tool for social bias detection in text. This study contributes to the ongoing efforts in NLP to address implicit bias, providing a pathway for future research and applications in various fields. The Jupyter notebooks used to create the dataset and model are available at: this https URL. Warning: This paper contains examples of harmful language, and reader discretion is recommended. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.08388 [cs.CL] (or arXiv:2410.08388v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.08388 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:自然语言处理 (NLP) 中的偏见检测是一个关键挑战,尤其是在大语言模型 (LLM) 在各个领域中日益广泛应用的背景下。本文介绍了 GUS-Net,一种创新的偏见检测方法,专注于三种关键类型的偏见:(G) 概括性偏见、(U) 不公平性偏见和 (S) 刻板印象偏见。GUS-Net 利用生成式 AI 和自动化智能体创建了一个全面的合成数据集,从而实现了强大的多标签 Token 分类。我们的方法通过结合预训练模型的上下文编码,增强了传统的偏见检测方法,从而在识别偏见实体时提高了准确性和深度。通过广泛的实验,我们证明 GUS-Net 优于最先进的技术,在准确率、F1-score 和 Hamming Loss 方面表现出色。研究结果凸显了 GUS-Net 在捕捉不同上下文中广泛偏见方面的有效性,使其成为文本中社会偏见检测的有价值工具。本研究为 NLP 领域解决隐性偏见的持续努力做出了贡献,为未来在各个领域的研究和应用提供了路径。用于创建数据集和模型的 Jupyter 笔记本可在以下链接获取:this https URL。警告:本文包含有害语言的示例,建议读者谨慎阅读。

主题:计算与语言 (cs.CL);人工智能 (cs.AI)
引用方式:arXiv:2410.08388 [cs.CL] (或 arXiv:2410.08388v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.08388
通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-68] Evaluating Transformer Models for Suicide Risk Detection on Social Media

【速读】: 该论文试图解决社交平台上自杀风险检测的问题,解决方案的关键在于利用先进的自然语言处理技术,特别是基于Transformer的模型,如DeBERTa和GPT-4o,通过微调这些模型来准确分类社交媒体帖子中的自杀风险类别(指示、想法、行为和尝试)。研究结果表明,经过微调的GPT-4o模型在识别自杀风险方面表现优异,达到了竞赛中的第二名,显示出通用模型在经过适当调整后,能够有效应用于自动化自杀风险检测。

链接: https://arxiv.org/abs/2410.08375
作者: Jakub Pokrywka,Jeremi I. Kaczmarek,Edward J. Gorzelańczyk
关键词-EN: social media, suicide risk, social media posts, potential life-saving implications, suicide risk detection
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The detection of suicide risk in social media is a critical task with potential life-saving implications. This paper presents a study on leveraging state-of-the-art natural language processing solutions for identifying suicide risk in social media posts as a submission for the “IEEE BigData 2024 Cup: Detection of Suicide Risk on Social Media” conducted by the kubapok team. We experimented with the following configurations of transformer-based models: fine-tuned DeBERTa, GPT-4o with CoT and few-shot prompting, and fine-tuned GPT-4o. The task setup was to classify social media posts into four categories: indicator, ideation, behavior, and attempt. Our findings demonstrate that the fine-tuned GPT-4o model outperforms two other configurations, achieving high accuracy in identifying suicide risk. Notably, our model achieved second place in the competition. By demonstrating that straightforward, general-purpose models can achieve state-of-the-art results, we propose that these models, combined with minimal tuning, may have the potential to be effective solutions for automated suicide risk detection on social media.
摘要:社交媒体中自杀风险的检测是一项具有潜在救生意义的关键任务。本文介绍了kubapok团队为“IEEE BigData 2024杯:社交媒体中自杀风险的检测”提交的研究,探讨了利用最先进的自然语言处理解决方案来识别社交媒体帖子中的自杀风险。我们实验了以下基于Transformer模型的配置:微调的DeBERTa、带有CoT和少样本提示的GPT-4o,以及微调的GPT-4o。任务设置是将社交媒体帖子分类为四个类别:指示、想法、行为和尝试。我们的研究结果表明,微调的GPT-4o模型优于其他两种配置,在识别自杀风险方面达到了高准确率。值得注意的是,我们的模型在比赛中获得了第二名。通过展示通用模型在经过简单微调后可以达到最先进的结果,我们提出这些模型结合最小程度的调优,可能成为社交媒体上自动化自杀风险检测的有效解决方案。

[NLP-69] Merging in a Bottle: Differentiable Adaptive Merging (DAM) and the Path from Averaging to Automation

【速读】: 该论文试图解决模型集成过程中由于训练方法和微调差异导致的复杂性问题,提出了一种高效的模型集成方法Differentiable Adaptive Merging (DAM)。解决方案的关键在于通过调整缩放系数来优化模型集成,减少计算需求,同时比较了不同复杂度的集成技术,发现即使在模型相似度高的情况下,简单的平均方法(如Model Soups)也能表现出竞争力,强调了各种技术的独特优势和局限性。

链接: https://arxiv.org/abs/2410.08371
作者: Thomas Gauthier-Caron,Shamane Siriwardhana,Elliot Stein,Malikeh Ehghaghi,Charles Goddard,Mark McQuade,Jacob Solawetz,Maxime Labonne
关键词-EN: requiring substantial retraining, separate language models, achieving a balance, substantial retraining, systems can combine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 1 figure, and 3 tables

点击查看摘要

Abstract:By merging models, AI systems can combine the distinct strengths of separate language models, achieving a balance between multiple capabilities without requiring substantial retraining. However, the integration process can be intricate due to differences in training methods and fine-tuning, typically necessitating specialized knowledge and repeated refinement. This paper explores model merging techniques across a spectrum of complexity, examining where automated methods like evolutionary strategies stand compared to hyperparameter-driven approaches such as DARE, TIES-Merging and simpler methods like Model Soups. In addition, we introduce Differentiable Adaptive Merging (DAM), an efficient, adaptive merging approach as an alternative to evolutionary merging that optimizes model integration through scaling coefficients, minimizing computational demands. Our findings reveal that even simple averaging methods, like Model Soups, perform competitively when model similarity is high, underscoring each technique’s unique strengths and limitations. We open-sourced DAM, including the implementation code and experiment pipeline, on GitHub: this https URL.
摘要:通过合并模型,AI 系统能够结合不同语言模型的独特优势,在多种能力之间实现平衡,而无需进行大量的重新训练。然而,由于训练方法和微调的差异,集成过程可能相当复杂,通常需要专业知识和反复的优化。本文探讨了从简单到复杂的多种模型合并技术,分析了自动化方法(如进化策略)与基于超参数的方法(如 DARE、TIES-Merging)以及更简单的方法(如 Model Soups)之间的比较。此外,我们引入了可微分自适应合并 (Differentiable Adaptive Merging, DAM),这是一种高效的、自适应的合并方法,作为进化合并的替代方案,通过缩放系数优化模型集成,同时最小化计算需求。我们的研究结果表明,即使在模型相似度较高的情况下,简单的平均方法(如 Model Soups)也能表现出竞争性,突显了每种技术的独特优势和局限性。我们在 GitHub 上开源了 DAM,包括实现代码和实验流程:this https URL。

[NLP-70] Revealing COVID-19s Social Dynamics: Diachronic Semantic Analysis of Vaccine and Symptom Discourse on Twitter

【速读】: 该论文试图解决社交媒体文本数据中词语语义随时间变化的难题,即语义漂移现象。解决方案的关键在于提出了一种无监督的动态词嵌入方法,通过利用词语共现统计和动态更新机制,捕捉社交媒体数据中的纵向语义变化。该方法无需预定义锚点词,能够有效应对数据稀疏性、分布不均和协同语义效应等问题,从而在COVID-19 Twitter数据集上揭示了疫苗和症状相关实体在不同疫情阶段的语义演变模式及其与现实统计数据的潜在关联。

链接: https://arxiv.org/abs/2410.08352
作者: Zeqiang Wang,Jiageng Wu,Yuqi Wang,Wei Wang,Jie Yang,Jon Johnson,Nishanth Sastry,Suparna De
关键词-EN: vast textual data, textual data generated, data generated daily, social impacts due, behavior of people
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Social media is recognized as an important source for deriving insights into public opinion dynamics and social impacts due to the vast textual data generated daily and the ‘unconstrained’ behavior of people interacting on these platforms. However, such analyses prove challenging due to the semantic shift phenomenon, where word meanings evolve over time. This paper proposes an unsupervised dynamic word embedding method to capture longitudinal semantic shifts in social media data without predefined anchor words. The method leverages word co-occurrence statistics and dynamic updating to adapt embeddings over time, addressing the challenges of data sparseness, imbalanced distributions, and synergistic semantic effects. Evaluated on a large COVID-19 Twitter dataset, the method reveals semantic evolution patterns of vaccine- and symptom-related entities across different pandemic stages, and their potential correlations with real-world statistics. Our key contributions include the dynamic embedding technique, empirical analysis of COVID-19 semantic shifts, and discussions on enhancing semantic shift modeling for computational social science research. This study enables capturing longitudinal semantic dynamics on social media to understand public discourse and collective phenomena.
摘要:社交媒体被认为是洞察公众舆论动态和社会影响的重要来源,这得益于其每日生成的海量文本数据以及人们在平台上互动的“无约束”行为。然而,由于语义漂移现象,即词语含义随时间演变,此类分析变得极具挑战性。本文提出了一种无监督的动态词嵌入方法,以捕捉社交媒体数据中的纵向语义变化,而无需预定义锚点词。该方法利用词共现统计和动态更新机制,随时间调整嵌入表示,从而应对数据稀疏性、分布不均以及协同语义效应等挑战。在大型 COVID-19 Twitter 数据集上的评估显示,该方法揭示了疫苗和症状相关实体在不同疫情阶段的语义演变模式,及其与现实世界统计数据的潜在关联。我们的主要贡献包括动态嵌入技术、COVID-19 语义漂移的实证分析,以及对增强计算社会科学研究中语义漂移建模的讨论。本研究有助于捕捉社交媒体上的纵向语义动态,以理解公众话语和集体现象。

[NLP-71] Nonlinear second-order dynamics describe labial constriction trajectories across languages and contexts

【速读】: 该论文试图解决在英语和普通话中发音/b/和/m/时唇部闭合轨迹的动力学问题。解决方案的关键在于发现并形式化了一个经验规律:唇部闭合轨迹的瞬时位移与瞬时速度之比通常遵循从运动开始到结束的指数衰减曲线。基于此,论文提出了一种非线性二阶动力系统模型,该模型仅包含两个参数T和r,分别对应目标状态和运动速度。通过非线性回归验证,该模型能够很好地拟合个体运动轨迹,并能模拟出与实测轨迹相符的关键运动学变量,如持续时间、峰值速度和达到峰值速度的时间。这一模型为理解发音运动的动力学提供了新的基础,并为进一步研究如韵律、运动协调和随机噪声等因素对发音运动学的影响提供了框架。

链接: https://arxiv.org/abs/2410.08351
作者: Michael C. Stern,Jason A. Shaw
关键词-EN: English and Mandarin, labial constriction trajectories, labial constriction, Mandarin, English
类目: Computation and Language (cs.CL); Adaptation and Self-Organizing Systems (nlin.AO)
备注:

点击查看摘要

Abstract:We investigate the dynamics of labial constriction trajectories during the production of /b/ and /m/ in English and Mandarin. We find that, across languages and contexts, the ratio of instantaneous displacement to instantaneous velocity generally follows an exponential decay curve from movement onset to movement offset. We formalize this empirical discovery in a differential equation and, in combination with an assumption of point attractor dynamics, derive a nonlinear second-order dynamical system describing labial constriction trajectories. The equation has only two parameters, T and r. T corresponds to the target state and r corresponds to movement rapidity. Thus, each of the parameters corresponds to a phonetically relevant dimension of control. Nonlinear regression demonstrates that the model provides excellent fits to individual movement trajectories. Moreover, trajectories simulated from the model qualitatively match empirical trajectories, and capture key kinematic variables like duration, peak velocity, and time to achieve peak velocity. The model constitutes a proposal for the dynamics of individual articulatory movements, and thus offers a novel foundation from which to understand additional influences on articulatory kinematics like prosody, inter-movement coordination, and stochastic noise.
摘要:我们研究了在英语和普通话中发音 /b/ 和 /m/ 时唇部闭合轨迹的动力学。我们发现,在不同语言和语境中,瞬时位移与瞬时速度的比率通常从运动开始到运动结束遵循指数衰减曲线。我们将这一经验发现形式化为一个微分方程,并结合点吸引子动力学的假设,推导出一个描述唇部闭合轨迹的非线性二阶动力系统。该方程仅包含两个参数,T 和 r。T 对应于目标状态,r 对应于运动速度。因此,每个参数都对应于语音控制的相关维度。非线性回归表明,该模型对单个运动轨迹提供了极佳的拟合。此外,从模型模拟的轨迹在定性上与实证轨迹相匹配,并捕捉到关键的运动学变量,如持续时间、峰值速度和达到峰值速度的时间。该模型为单个发音运动的动力学提供了一种新的基础,从而为理解发音运动学中的其他影响因素(如韵律、运动间协调和随机噪声)提供了新的视角。

[NLP-72] Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning

【速读】: 该论文试图解决儿童如何通过强化学习(RL)框架学习数字,特别是语言指令对数字理解的影响。解决方案的关键在于利用先进的深度强化学习模型,模拟并分析不同语言结构对RL代理数字获取的影响,发现某些语言结构能更有效地提升数字理解能力,并预测出最优的数字呈现序列以加速学习过程。这一研究为语言与数字认知的相互作用提供了深入见解,对教育策略和早期儿童学习支持的人工智能系统开发具有重要意义。

链接: https://arxiv.org/abs/2410.08334
作者: Tirthankar Mittra
关键词-EN: children learn numbers, paper investigates, investigates how children, children learn, reinforcement learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper investigates how children learn numbers using the framework of reinforcement learning (RL), with a focus on the impact of language instructions. The motivation for using reinforcement learning stems from its parallels with psychological learning theories in controlled environments. By using state of the art deep reinforcement learning models, we simulate and analyze the effects of various forms of language instructions on number acquisition. Our findings indicate that certain linguistic structures more effectively improve numerical comprehension in RL agents. Additionally, our model predicts optimal sequences for presenting numbers to RL agents which enhance their speed of learning. This research provides valuable insights into the interplay between language and numerical cognition, with implications for both educational strategies and the development of artificial intelligence systems designed to support early childhood learning.
摘要:本文探讨了儿童如何利用强化学习 (Reinforcement Learning, RL) 框架学习数字,重点研究了语言指令的影响。采用强化学习的动机源于其在受控环境中与心理学学习理论的相似性。通过使用最先进的深度强化学习模型,我们模拟并分析了各种形式的语言指令对数字获取的影响。研究结果表明,某些语言结构能更有效地提升 RL 智能体的数值理解能力。此外,我们的模型预测了向 RL 智能体展示数字的最佳序列,从而加快其学习速度。这项研究为语言与数值认知之间的相互作用提供了宝贵的见解,对教育策略和旨在支持早期儿童学习的人工智能系统的发展具有重要意义。

[NLP-73] Agents Thinking Fast and Slow: A Talker-Reasoner Architecture

【速读】: 该论文试图解决大型语言模型在自然对话中同时处理对话生成和多步骤推理与规划的难题。解决方案的关键在于引入了一种名为“Talker-Reasoner”的新架构,其中“Talker”(系统1)负责快速、直观的对话生成,而“Reasoner”(系统2)则负责慢速、深思熟虑的多步骤推理和规划,通过调用工具和执行动作来更新代理状态。这种架构的优势在于模块化和降低延迟,并通过睡眠辅导代理的实际应用来展示其现实世界的相关性。

链接: https://arxiv.org/abs/2410.08328
作者: Konstantina Christakopoulou,Shibl Mourad,Maja Matarić
关键词-EN: Large language models, Large language, natural conversation, language models, models have enabled
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models have enabled agents of all kinds to interact with users through natural conversation. Consequently, agents now have two jobs: conversing and planning/reasoning. Their conversational responses must be informed by all available information, and their actions must help to achieve goals. This dichotomy between conversing with the user and doing multi-step reasoning and planning can be seen as analogous to the human systems of “thinking fast and slow” as introduced by Kahneman. Our approach is comprised of a “Talker” agent (System 1) that is fast and intuitive, and tasked with synthesizing the conversational response; and a “Reasoner” agent (System 2) that is slower, more deliberative, and more logical, and is tasked with multi-step reasoning and planning, calling tools, performing actions in the world, and thereby producing the new agent state. We describe the new Talker-Reasoner architecture and discuss its advantages, including modularity and decreased latency. We ground the discussion in the context of a sleep coaching agent, in order to demonstrate real-world relevance.
摘要:大语言模型使得各类智能体能够通过自然对话与用户互动。因此,智能体现在承担了两项任务:对话和规划/推理。它们的对话回应必须基于所有可用信息,并且其行动必须有助于实现目标。这种在用户对话与多步骤推理和规划之间的二分法,可以被视为类似于Kahneman引入的“快思考与慢思考”的人类系统。我们的方法由一个“Talker”智能体(系统1)组成,该智能体快速且直观,负责合成对话回应;以及一个“Reasoner”智能体(系统2),该智能体较慢、更具深思熟虑性且更具逻辑性,负责多步骤推理和规划,调用工具,在现实世界中执行行动,从而产生新的智能体状态。我们描述了新的Talker-Reasoner架构,并讨论了其优势,包括模块化和降低延迟。我们将讨论置于睡眠辅导智能体的背景下,以展示其在现实世界中的相关性。

[NLP-74] Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains EMNLP2024

【速读】: 该论文试图解决在涉及隐私数据的高风险领域(如医疗和社工服务)中,文本数据匿名化困难的问题。解决方案的关键在于利用差分隐私语言模型生成的合成数据替代真实数据,以促进自然语言处理(NLP)在这些领域的发展,同时不损害数据隐私。论文通过在实际高风险领域生成合成数据,并提出并执行以使用为导向的评估,来评估数据质量,结果表明先前的简单评估未能揭示合成数据中的效用、隐私和公平性问题。总体而言,该研究强调了进一步改进合成数据生成技术的必要性,以使其成为实现隐私保护数据共享的可行方法。

链接: https://arxiv.org/abs/2410.08327
作者: Krithika Ramesh,Nupoor Gandhi,Pulkit Madaan,Lisa Bauer,Charith Peris,Anjalie Field
关键词-EN: anonymizing text data, text data hinders, deployment of NLP, social services, difficulty of anonymizing
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024 (Findings)

点击查看摘要

Abstract:The difficulty of anonymizing text data hinders the development and deployment of NLP in high-stakes domains that involve private data, such as healthcare and social services. Poorly anonymized sensitive data cannot be easily shared with annotators or external researchers, nor can it be used to train public models. In this work, we explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in these domains without compromising privacy. In contrast to prior work, we generate synthetic data for real high-stakes domains, and we propose and conduct use-inspired evaluations to assess data quality. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data. Overall, our work underscores the need for further improvements to synthetic data generation for it to be a viable way to enable privacy-preserving data sharing.
摘要:匿名化文本数据的难度阻碍了自然语言处理 (NLP) 在高风险领域的发展和部署,这些领域涉及私人数据,如医疗保健和社会服务。匿名化不当的敏感数据无法轻易与标注人员或外部研究人员共享,也无法用于训练公共模型。在本研究中,我们探讨了使用差分隐私语言模型生成的合成数据替代真实数据,以促进这些领域中 NLP 的发展,同时不损害隐私的可行性。与先前的工作不同,我们为真实的高风险领域生成合成数据,并提出并进行以使用为导向的评估,以评估数据质量。我们的结果表明,先前简单的评估未能突出合成数据中的效用、隐私和公平性问题。总体而言,我们的工作强调了进一步改进合成数据生成方法的必要性,以使其成为实现隐私保护数据共享的可行途径。

[NLP-75] he language of sound search: Examining User Queries in Audio Search Engines

【速读】: 该论文试图解决当前文本型音频检索系统未能充分满足用户实际需求和行为的问题。解决方案的关键在于通过分析来自自定义调查和Freesound网站查询日志的两个数据集,揭示用户在不受现有系统限制时的查询偏好和行为模式。研究发现,用户在不受限制时倾向于使用更详细的查询,且查询主要基于关键词而非完整句子。这些发现为设计以用户为中心、高效的文本型音频检索系统提供了重要依据。

链接: https://arxiv.org/abs/2410.08324
作者: Benno Weck,Frederic Font
关键词-EN: study examines textual, general audio retrieval, audio retrieval, audio retrieval systems, text-based audio retrieval
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at DCASE 2024. Supplementary materials at this https URL

点击查看摘要

Abstract:This study examines textual, user-written search queries within the context of sound search engines, encompassing various applications such as foley, sound effects, and general audio retrieval. Current research inadequately addresses real-world user needs and behaviours in designing text-based audio retrieval systems. To bridge this gap, we analysed search queries from two sources: a custom survey and Freesound website query logs. The survey was designed to collect queries for an unrestricted, hypothetical sound search engine, resulting in a dataset that captures user intentions without the constraints of existing systems. This dataset is also made available for sharing with the research community. In contrast, the Freesound query logs encompass approximately 9 million search requests, providing a comprehensive view of real-world usage patterns. Our findings indicate that survey queries are generally longer than Freesound queries, suggesting users prefer detailed queries when not limited by system constraints. Both datasets predominantly feature keyword-based queries, with few survey participants using full sentences. Key factors influencing survey queries include the primary sound source, intended usage, perceived location, and the number of sound sources. These insights are crucial for developing user-centred, effective text-based audio retrieval systems, enhancing our understanding of user behaviour in sound search contexts.
摘要:本研究探讨了在声音搜索引擎背景下,用户编写的文本搜索查询,涵盖了如拟音、音效和一般音频检索等多种应用。当前的研究在设计基于文本的音频检索系统时,未能充分考虑现实世界中用户的需求和行为。为了填补这一空白,我们分析了来自两个来源的搜索查询:一项定制调查和Freesound网站的查询日志。该调查旨在收集用于不受限制的假设性声音搜索引擎的查询,从而生成一个捕捉用户意图而不受现有系统限制的数据集。该数据集也与研究社区共享。相比之下,Freesound查询日志包含了约900万次搜索请求,提供了对现实世界使用模式的全面视角。我们的研究发现,调查查询通常比Freesound查询更长,这表明用户在不受系统限制时更倾向于使用详细的查询。两个数据集主要以关键词为基础的查询为主,很少有调查参与者使用完整句子。影响调查查询的关键因素包括主要声音来源、预期用途、感知位置和声音来源的数量。这些见解对于开发以用户为中心、有效的基于文本的音频检索系统至关重要,增强了我们对于声音搜索情境下用户行为的理解。

[NLP-76] Do You Know What You Are Talking About? Characterizing Query-Knowledge Relevance For Reliable Retrieval Augmented Generation

【速读】: 该论文试图解决语言模型(LMs)在生成过程中出现的幻觉和错误信息问题,特别是当用户查询超出外部知识库的覆盖范围或知识库信息过时时,RAG系统可能生成不准确响应的问题。解决方案的关键在于建立一个统计框架,通过捕捉知识的相关性来评估RAG系统对查询的回答能力。论文提出了两种测试方法:在线测试使用拟合优度(GoF)测试来检测与知识库相关性低的查询,以识别超出知识库范围的查询;离线测试则通过分析用户查询集合,检测查询分布的显著变化,以判断知识库是否仍能有效支持用户需求。这些策略通过系统评估八个问答数据集,证明了新测试框架能有效提升现有RAG系统的可靠性。

链接: https://arxiv.org/abs/2410.08320
作者: Zhuohang Li,Jiaxin Zhang,Chao Yan,Kamalika Das,Sricharan Kumar,Murat Kantarcioglu,Bradley A. Malin
关键词-EN: Language models, external knowledge corpus, hallucinations and misinformation, suffer from hallucinations, knowledge corpus
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models (LMs) are known to suffer from hallucinations and misinformation. Retrieval augmented generation (RAG) that retrieves verifiable information from an external knowledge corpus to complement the parametric knowledge in LMs provides a tangible solution to these problems. However, the generation quality of RAG is highly dependent on the relevance between a user’s query and the retrieved documents. Inaccurate responses may be generated when the query is outside of the scope of knowledge represented in the external knowledge corpus or if the information in the corpus is out-of-date. In this work, we establish a statistical framework that assesses how well a query can be answered by an RAG system by capturing the relevance of knowledge. We introduce an online testing procedure that employs goodness-of-fit (GoF) tests to inspect the relevance of each user query to detect out-of-knowledge queries with low knowledge relevance. Additionally, we develop an offline testing framework that examines a collection of user queries, aiming to detect significant shifts in the query distribution which indicates the knowledge corpus is no longer sufficiently capable of supporting the interests of the users. We demonstrate the capabilities of these strategies through a systematic evaluation on eight question-answering (QA) datasets, the results of which indicate that the new testing framework is an efficient solution to enhance the reliability of existing RAG systems.
摘要:语言模型 (Language Models, LMs) 存在幻觉和错误信息的问题。检索增强生成 (Retrieval Augmented Generation, RAG) 通过从外部知识库中检索可验证信息来补充 LMs 的参数化知识,为解决这些问题提供了切实可行的方案。然而,RAG 的生成质量高度依赖于用户查询与检索文档之间的相关性。当查询超出外部知识库所代表的知识范围,或知识库中的信息已过时时,可能会生成不准确的响应。在本研究中,我们建立了一个统计框架,通过捕捉知识的相关性来评估 RAG 系统对查询的回答能力。我们引入了一种在线测试程序,该程序采用拟合优度 (Goodness-of-Fit, GoF) 测试来检查每个用户查询的相关性,以检测知识相关性较低的超出知识范围的查询。此外,我们还开发了一个离线测试框架,该框架检查一组用户查询,旨在检测查询分布的显著变化,这表明知识库不再能够充分支持用户的兴趣。我们通过在八个问答 (Question-Answering, QA) 数据集上的系统评估,展示了这些策略的能力,结果表明,新的测试框架是增强现有 RAG 系统可靠性的有效解决方案。

[NLP-77] MELO: An Evaluation Benchmark for Multilingual Entity Linking of Occupations RECSYS2024 RECSYS

【速读】: 该论文旨在解决多语言职业实体链接的问题,通过引入Multilingual Entity Linking of Occupations (MELO) Benchmark,构建了一个包含48个数据集的评估框架,用于将21种语言中的实体提及链接到ESCO Occupations多语言分类体系。解决方案的关键在于利用高质量的预先存在的人工标注数据,并采用简单的词汇模型和通用句子编码器(在零样本设置下作为双编码器)进行实验,以建立未来研究的基线。该基准数据集和标准化评估的源代码已公开发布。

链接: https://arxiv.org/abs/2410.08319
作者: Federico Retyk,Luis Gasco,Casimiro Pio Carrino,Daniel Deniz,Rabih Zbib
关键词-EN: ESCO Occupations multilingual, Multilingual Entity Linking, Occupations multilingual taxonomy, ESCO Occupations, Occupations multilingual
类目: Computation and Language (cs.CL)
备注: Accepted to the 4th Workshop on Recommender Systems for Human Resources (RecSys in HR 2024) as part of RecSys 2024

点击查看摘要

Abstract:We present the Multilingual Entity Linking of Occupations (MELO) Benchmark, a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations multilingual taxonomy. MELO was built using high-quality, pre-existent human annotations. We conduct experiments with simple lexical models and general-purpose sentence encoders, evaluated as bi-encoders in a zero-shot setup, to establish baselines for future research. The datasets and source code for standardized evaluation are publicly available at this https URL
摘要:我们提出了职业多语言实体链接(Multilingual Entity Linking of Occupations, MELO)基准,这是一个包含 48 个数据集的新集合,用于评估将 21 种语言中的实体提及链接到 ESCO 职业多语言分类法。MELO 是使用高质量的、预先存在的人工标注构建的。我们通过简单的词汇模型和通用句子编码器进行实验,这些模型在零样本设置中作为双编码器进行评估,以建立未来研究的基线。标准化评估的数据集和源代码已公开,链接为 https URL。

[NLP-78] HyperDPO: Hypernetwork-based Multi-Objective Fine-Tuning Framework

【速读】: 该论文试图解决多目标微调(Multi-Objective Fine-Tuning, MOFT)问题,即在现有模型上同时微调多个不同目标的数据集。解决方案的关键在于提出了HyperDPO框架,这是一种基于超网络的方法,通过将Direct Preference Optimization (DPO)技术中的Bradley-Terry-Luce模型替换为Plackett-Luce模型,使其能够处理涉及列表排序数据集的广泛MOFT任务。HyperDPO框架不仅支持高效的单次训练过程以绘制辅助目标的Pareto前沿,还提供了灵活的训练后控制权衡的能力。此外,论文还提出了一种新颖的超提示调优设计,能够在不改变模型架构的情况下,将连续权重传递给基于Transformer的模型。

链接: https://arxiv.org/abs/2410.08316
作者: Yinuo Ren,Tesi Xiao,Michael Shavlovsky,Lexing Ying,Holakou Rahmanian
关键词-EN: Direct Preference Optimization, LLM alignment, efficient LLM alignment, Multi-Objective Fine-Tuning, faces the Multi-Objective
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:In LLM alignment and many other ML applications, one often faces the Multi-Objective Fine-Tuning (MOFT) problem, i.e. fine-tuning an existing model with datasets labeled w.r.t. different objectives simultaneously. To address the challenge, we propose the HyperDPO framework, a hypernetwork-based approach that extends the Direct Preference Optimization (DPO) technique, originally developed for efficient LLM alignment with preference data, to accommodate the MOFT settings. By substituting the Bradley-Terry-Luce model in DPO with the Plackett-Luce model, our framework is capable of handling a wide range of MOFT tasks that involve listwise ranking datasets. Compared with previous approaches, HyperDPO enjoys an efficient one-shot training process for profiling the Pareto front of auxiliary objectives, and offers flexible post-training control over trade-offs. Additionally, we propose a novel Hyper Prompt Tuning design, that conveys continuous weight across objectives to transformer-based models without altering their architecture. We demonstrate the effectiveness and efficiency of the HyperDPO framework through its applications to various tasks, including Learning-to-Rank (LTR) and LLM alignment, highlighting its viability for large-scale ML deployments.
摘要:在大语言模型 (LLM) 对齐及其他许多机器学习 (ML) 应用中,人们常常面临多目标微调 (Multi-Objective Fine-Tuning, MOFT) 问题,即同时使用针对不同目标标注的数据集对现有模型进行微调。为应对这一挑战,我们提出了 HyperDPO 框架,这是一种基于超网络的方法,它扩展了最初为高效大语言模型对齐而开发的直接偏好优化 (Direct Preference Optimization, DPO) 技术,以适应 MOFT 设置。通过将 DPO 中的 Bradley-Terry-Luce 模型替换为 Plackett-Luce 模型,我们的框架能够处理涉及列表排序数据集的广泛 MOFT 任务。与以往的方法相比,HyperDPO 在辅助目标的 Pareto 前沿轮廓绘制中具有高效的一次性训练过程,并提供灵活的训练后控制权衡。此外,我们提出了一种新颖的超提示调优设计,该设计在不改变 Transformer 模型架构的情况下,将连续权重传递给基于 Transformer 的模型。我们通过在各种任务中的应用,包括学习排序 (Learning-to-Rank, LTR) 和大语言模型对齐,展示了 HyperDPO 框架的有效性和效率,突显了其在大规模 ML 部署中的可行性。

[NLP-79] Privately Learning from Graphs with Applications in Fine-tuning Large Language Models

【速读】: 该论文试图解决在处理敏感领域(如金融和医疗)中的图数据时,如何在不泄露隐私的前提下进行关系学习的问题。解决方案的关键在于提出了一种隐私保护的关系学习管道,该管道在训练过程中通过解耦采样关系中的依赖性,确保了差分隐私(Differential Privacy)的实现。具体来说,论文通过定制化的DP-SGD(Differential Privacy-Stochastic Gradient Descent)应用,解决了传统方法在处理关系数据时因样本间依赖性导致的隐私保护不足的问题。该方法被应用于微调大型语言模型(如BERT和Llama2),并在真实世界的关系数据集上进行了评估,结果显示在保持隐私的同时显著提升了关系学习任务的性能。

链接: https://arxiv.org/abs/2410.08299
作者: Haoteng Yin,Rongzhe Wei,Eli Chien,Pan Li
关键词-EN: complementing data modalities, Graphs offer unique, offer unique insights, interactions between entities, modalities like text
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Graphs offer unique insights into relationships and interactions between entities, complementing data modalities like text, images, and videos. By incorporating relational information from graph data, AI models can extend their capabilities beyond traditional tasks. However, relational data in sensitive domains such as finance and healthcare often contain private information, making privacy preservation crucial. Existing privacy-preserving methods, such as DP-SGD, which rely on gradient decoupling assumptions, are not well-suited for relational learning due to the inherent dependencies between coupled training samples. To address this challenge, we propose a privacy-preserving relational learning pipeline that decouples dependencies in sampled relations during training, ensuring differential privacy through a tailored application of DP-SGD. We apply this method to fine-tune large language models (LLMs) on sensitive graph data, and tackle the associated computational complexities. Our approach is evaluated on LLMs of varying sizes (e.g., BERT, Llama2) using real-world relational data from four text-attributed graphs. The results demonstrate significant improvements in relational learning tasks, all while maintaining robust privacy guarantees during training. Additionally, we explore the trade-offs between privacy, utility, and computational efficiency, offering insights into the practical deployment of our approach. Code is available at this https URL.
摘要:图结构提供了独特的视角,用于分析实体之间的关系和交互,补充了文本、图像和视频等数据模态。通过整合图数据中的关系信息,AI模型能够扩展其能力,超越传统任务。然而,在金融和医疗等敏感领域中的关系数据通常包含私人信息,使得隐私保护变得至关重要。现有的隐私保护方法,如依赖梯度解耦假设的DP-SGD,由于训练样本之间固有的依赖性,并不适用于关系学习。为应对这一挑战,我们提出了一种隐私保护的关系学习流程,该流程在训练过程中解耦采样关系中的依赖性,并通过定制化的DP-SGD应用确保差分隐私。我们将此方法应用于在敏感图数据上微调大语言模型(LLMs),并解决了相关的计算复杂性问题。我们的方法在四个文本属性图的真实世界关系数据上,对不同规模的LLMs(如BERT、Llama2)进行了评估。结果显示,在保持训练过程中强大的隐私保障的同时,关系学习任务取得了显著改进。此外,我们还探讨了隐私、效用和计算效率之间的权衡,为实际部署我们的方法提供了见解。代码可在以下链接获取:https URL。

[NLP-80] Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference

【速读】: 该论文试图解决文化遗产领域中缺乏特定机器阅读理解(MRC)数据集的问题。解决方案的关键在于提出了一种利用强化学习从人类反馈(RLHF)和合成偏好数据来生成具有增加难度的高质量MRC数据集的成本效益方法。具体来说,该方法通过分析现有问答模型在SQuAD数据集子集上的表现,创建了一个难度度量标准,并使用近端策略优化(PPO)技术来生成更具挑战性的问题。此外,论文还提供了开源代码库和三个llama-2-chat适配器,以确保方法的可重复性和适应性。

链接: https://arxiv.org/abs/2410.08289
作者: William Thorne,Ambrose Robinson,Bohua Peng,Chenghua Lin,Diana Maynard
关键词-EN: sector increasingly adopts, increasingly adopts technologies, personalised search experiences, Retrieval-Augmented Generation, heritage sector increasingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: is to be published in NLP4DH 2024

点击查看摘要

Abstract:As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it’s equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method’s effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.
摘要:随着文化遗产领域越来越多地采用诸如检索增强生成 (RAG) 等技术,以提供更加个性化的搜索体验并实现与馆藏数据的对话,对专业评估数据集的需求也随之增长。尽管端到端系统测试至关重要,但评估各个组件同样重要。我们针对最终的回答任务,该任务非常适合机器阅读理解 (MRC)。尽管现有的 MRC 数据集涵盖了通用领域,但它们缺乏文化遗产信息所需的特定性。遗憾的是,对于大多数文化遗产机构而言,手动创建此类数据集的成本过高。本文提出了一种利用合成偏好数据进行人类反馈强化学习 (RLHF) 来生成具有更高难度的领域特定 MRC 数据集的经济高效方法。我们的方法利用现有问答模型在 SQuAD 子集上的表现来创建难度度量,假设更难的问题被正确回答的频率较低。本研究贡献如下:(1) 一种使用 PPO 和合成数据增加问题难度的方法;(2) 该方法有效性的实证证据,包括人类评估;(3) 深入的错误分析和新兴现象研究;以及 (4) 一个开源代码库和三套用于可重复性和适应性的 llama-2-chat 适配器。

[NLP-81] LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation

【速读】: 该论文试图解决传统医疗系统在应对全球COVID-19疫情时暴露出的不足,特别是在线医疗服务的医疗分诊和咨询方面。解决方案的关键在于构建了一个大规模的中文医疗对话语料库(LCMDC),包含粗粒度分诊数据集、细粒度诊断数据集和医疗咨询数据集,以解决现有数据集规模小且领域特定的不足。此外,论文提出了一种结合BERT监督学习和提示学习的分诊系统,以及基于GPT的医疗咨询模型,并通过强化学习进行优化,以增强模型对专业医学术语和表达的理解能力。通过预训练语言模型(PLMs)使用自建背景语料库,进一步提升了领域知识的获取能力。实验结果表明,这些系统在LCMDC上的表现证明了其有效性。

链接: https://arxiv.org/abs/2410.03521
作者: Xinyuan Wang,Haozhou Li,Dingfang Zheng,Qinke Peng
关键词-EN: pandemic underscored major, underscored major deficiencies, online medical services, traditional healthcare systems, pandemic underscored
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), comprising a Coarse-grained Triage dataset with 439,630 samples, a Fine-grained Diagnosis dataset with 199,600 samples, and a Medical Consultation dataset with 472,418 items, thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model using reinforcement learning. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.
摘要:全球 COVID-19 疫情凸显了传统医疗系统的重大缺陷,加速了在线医疗服务的发展,特别是在医疗分诊和咨询方面。然而,现有研究面临两大主要挑战。首先,由于隐私问题,缺乏大规模、公开可用的、领域特定的医疗数据集,现有数据集规模小且仅限于少数疾病,限制了基于预训练语言模型 (Pre-trained Language Models, PLMs) 的分诊方法的有效性。其次,现有方法缺乏医学知识,难以准确理解医患咨询中的专业术语和表达。为克服这些障碍,我们构建了大规模中文医疗对话语料库 (Large-scale Chinese Medical Dialogue Corpora, LCMDC),包括一个包含 439,630 个样本的粗粒度分诊数据集、一个包含 199,600 个样本的细粒度诊断数据集,以及一个包含 472,418 个项目的医疗咨询数据集,从而解决了该领域的数据短缺问题。此外,我们进一步提出了一种结合基于 BERT 的监督学习和提示学习的分诊系统,以及一种基于 GPT 的医疗咨询模型,使用强化学习。为增强领域知识获取,我们使用自建的背景语料库对 PLMs 进行了预训练。在 LCMDC 上的实验结果证明了我们提出的系统的有效性。

[NLP-82] Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

【速读】: 该论文试图解决基于Wav2Vec2的自动语音障碍质量评估模型在临床应用中的可解释性问题。解决方案的关键在于通过层级分析识别关键层,并利用后验解释性AI方法(如典型相关分析CCA和可视化技术)来追踪模型演化,从而增强模型嵌入的可视化和解释性,使模型输出与临床评估维度之间的关联更加清晰。

链接: https://arxiv.org/abs/2410.08250
作者: Tuan Nguyen,Corinne Fredouille,Alain Ghio,Mathieu Balaguer,Virginie Woisard
关键词-EN: Neck Cancer speech, Cancer speech contexts, Head and Neck, Neck Cancer, yielding impressive results
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at the Spoken Language Technology (SLT) Conference 2024

点击查看摘要

Abstract:With the rise of SSL and ASR technologies, the Wav2Vec2 ASR-based model has been fine-tuned for automated speech disorder quality assessment tasks, yielding impressive results and setting a new baseline for Head and Neck Cancer speech contexts. This demonstrates that the ASR dimension from Wav2Vec2 closely aligns with assessment dimensions. Despite its effectiveness, this system remains a black box with no clear interpretation of the connection between the model ASR dimension and clinical assessments. This paper presents the first analysis of this baseline model for speech quality assessment, focusing on intelligibility and severity tasks. We conduct a layer-wise analysis to identify key layers and compare different SSL and ASR Wav2Vec2 models based on pre-trained data. Additionally, post-hoc XAI methods, including Canonical Correlation Analysis (CCA) and visualization techniques, are used to track model evolution and visualize embeddings for enhanced interpretability.
摘要:随着自监督学习 (SSL) 和自动语音识别 (ASR) 技术的发展,基于 Wav2Vec2 ASR 的模型已被微调用于自动化语音障碍质量评估任务,取得了显著成果,并为头颈癌语音情境设定了新的基准。这表明 Wav2Vec2 的 ASR 维度与评估维度高度一致。尽管该系统效果显著,但其仍是一个黑箱,模型 ASR 维度与临床评估之间的联系尚不明确。本文首次对该基准模型进行了分析,重点关注可理解性和严重性任务。我们进行了逐层分析,以识别关键层,并基于预训练数据比较了不同 SSL 和 ASR Wav2Vec2 模型。此外,后验 XAI 方法,包括典型相关分析 (CCA) 和可视化技术,被用于追踪模型演化并可视化嵌入,以增强解释性。

人工智能

[AI-0] Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

链接: https://arxiv.org/abs/2410.09047
作者: Qin Liu,Chao Shang,Ling Liu,Nikolaos Pappas,Jie Ma,Neha Anna John,Srikanth Doss,Lluis Marquez,Miguel Ballesteros,Yassine Benajiba
关键词-EN: vision module compared, safety alignment, Vision-Language Models, safety alignment ability, safety alignment degradation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as ‘‘safety alignment degradation’’ in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preserving the functional capabilities of VLMs. The empirical results show that our framework significantly recovers the alignment ability that is inherited from the LLM backbone with minimal impact on the fluency and linguistic capabilities of pre-trained VLMs even without additional training. Specifically, the unsafe rate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as 3.15% with only inference-time intervention. WARNING: This paper contains examples of toxic or harmful language. Comments: Preprint Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.09047 [cs.CL] (or arXiv:2410.09047v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.09047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] ransforming In-Vehicle Network Intrusion Detection: VAE-based Knowledge Distillation Meets Explainable AI ALT

链接: https://arxiv.org/abs/2410.09043
作者: Muhammet Anil Yagiz,Pedram MohajerAnsari,Mert D. Pese,Polat Goktas
关键词-EN: robust in-vehicle network, ensuring robust in-vehicle, security is paramount, in-vehicle network, evolving landscape
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: In Proceedings of the Sixth Workshop on CPSIoT Security and Privacy (CPSIoTSec 24), October 14-18, 2024, Salt Lake City, UT, USA. ACM, New York, NY, USA

点击查看摘要

Abstract:In the evolving landscape of autonomous vehicles, ensuring robust in-vehicle network (IVN) security is paramount. This paper introduces an advanced intrusion detection system (IDS) called KD-XVAE that uses a Variational Autoencoder (VAE)-based knowledge distillation approach to enhance both performance and efficiency. Our model significantly reduces complexity, operating with just 1669 parameters and achieving an inference time of 0.3 ms per batch, making it highly suitable for resource-constrained automotive environments. Evaluations in the HCRL Car-Hacking dataset demonstrate exceptional capabilities, attaining perfect scores (Recall, Precision, F1 Score of 100%, and FNR of 0%) under multiple attack types, including DoS, Fuzzing, Gear Spoofing, and RPM Spoofing. Comparative analysis on the CICIoV2024 dataset further underscores its superiority over traditional machine learning models, achieving perfect detection metrics. We furthermore integrate Explainable AI (XAI) techniques to ensure transparency in the model’s decisions. The VAE compresses the original feature space into a latent space, on which the distilled model is trained. SHAP(SHapley Additive exPlanations) values provide insights into the importance of each latent dimension, mapped back to original features for intuitive understanding. Our paper advances the field by integrating state-of-the-art techniques, addressing critical challenges in the deployment of efficient, trustworthy, and reliable IDSes for autonomous vehicles, ensuring enhanced protection against emerging cyber threats.

[AI-2] SimpleStrat: Diversifying Language Model Generation with Stratification

链接: https://arxiv.org/abs/2410.09038
作者: Justin Wong,Yury Orlovskiy,Michael Luo,Sanjit A. Seshia,Joseph E. Gonzalez
关键词-EN: Generating diverse responses, Generating diverse, synthetic data generation, search and synthetic, crucial for applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating diverse responses from large language models (LLMs) is crucial for applications such as planning/search and synthetic data generation, where diversity provides distinct answers across generations. Prior approaches rely on increasing temperature to increase diversity. However, contrary to popular belief, we show not only does this approach produce lower quality individual generations as temperature increases, but it depends on model’s next-token probabilities being similar to the true distribution of answers. We propose \method, an alternative approach that uses the language model itself to partition the space into strata. At inference, a random stratum is selected and a sample drawn from within the strata. To measure diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers, and assess diversity by measuring KL Divergence between the output distribution and uniform distribution over valid ground truth answers. As computing probability per response/solution for proprietary models is infeasible, we measure recall on ground truth solutions. Our evaluation show using SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36 average reduction in KL Divergence compared to Llama 3.

[AI-3] Mentor-KD: Making Small Language Models Better Multi-step Reasoners EMNLP2024

链接: https://arxiv.org/abs/2410.09037
作者: Hojae Lee,Junho Kim,SangKeun Lee
关键词-EN: Large Language Models, displayed remarkable performances, Large Language, Language Models, displayed remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have displayed remarkable performances across various complex tasks by leveraging Chain-of-Thought (CoT) prompting. Recently, studies have proposed a Knowledge Distillation (KD) approach, reasoning distillation, which transfers such reasoning ability of LLMs through fine-tuning language models of multi-step rationales generated by LLM teachers. However, they have inadequately considered two challenges regarding insufficient distillation sets from the LLM teacher model, in terms of 1) data quality and 2) soft label provision. In this paper, we propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller LMs while addressing the aforementioned challenges. Specifically, we exploit a mentor, intermediate-sized task-specific fine-tuned model, to augment additional CoT annotations and provide soft labels for the student model during reasoning distillation. We conduct extensive experiments and confirm Mentor-KD’s effectiveness across various models and complex reasoning tasks.

[AI-4] PEAR: A Robust and Flexible Automation Framework for Ptychography Enabled by Multiple Large Language Model Agents

链接: https://arxiv.org/abs/2410.09034
作者: Xiangyu Yin,Chuqiao Shi,Yimo Han,Yi Jiang
关键词-EN: advanced computational imaging, computational imaging technique, technique in X-ray, X-ray and electron, electron microscopy
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
*备注: 18 pages, 5 figures, technical preview report

点击查看摘要

Abstract:Ptychography is an advanced computational imaging technique in X-ray and electron microscopy. It has been widely adopted across scientific research fields, including physics, chemistry, biology, and materials science, as well as in industrial applications such as semiconductor characterization. In practice, obtaining high-quality ptychographic images requires simultaneous optimization of numerous experimental and algorithmic parameters. Traditionally, parameter selection often relies on trial and error, leading to low-throughput workflows and potential human bias. In this work, we develop the “Ptychographic Experiment and Analysis Robot” (PEAR), a framework that leverages large language models (LLMs) to automate data analysis in ptychography. To ensure high robustness and accuracy, PEAR employs multiple LLM agents for tasks including knowledge retrieval, code generation, parameter recommendation, and image reasoning. Our study demonstrates that PEAR’s multi-agent design significantly improves the workflow success rate, even with smaller open-weight models such as LLaMA 3.1 8B. PEAR also supports various automation levels and is designed to work with customized local knowledge bases, ensuring flexibility and adaptability across different research environments.

[AI-5] Agent Harm: A Benchmark for Measuring Harmfulness of LLM Agents

链接: https://arxiv.org/abs/2410.09024
作者: Maksym Andriushchenko,Alexandra Souly,Mateusz Dziemian,Derek Duenas,Maxwell Lin,Justin Wang,Dan Hendrycks,Andy Zou,Zico Kolter,Matt Fredrikson,Eric Winsor,Jerome Wynne,Yarin Gal,Xander Davies
关键词-EN: users design prompts, circumvent safety measures, users design, design prompts, prompts to circumvent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents – which use external tools and can execute multi-stage tasks – may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. We publicly release AgentHarm to enable simple and reliable evaluation of attacks and defenses for LLM-based agents. We publicly release the benchmark at this https URL.

[AI-6] Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models

链接: https://arxiv.org/abs/2410.09012
作者: Hao Li,Cor-Paul Bezemer,Ahmed E. Hassan
关键词-EN: including software engineering, large language models, Foundation models, impacted many fields, including software
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Foundation models (FMs) such as large language models (LLMs) have significantly impacted many fields, including software engineering (SE). The interaction between SE and FMs has led to the integration of FMs into SE practices (FM4SE) and the application of SE methodologies to FMs (SE4FM). While several literature surveys exist on academic contributions to these trends, we are the first to provide a practitioner’s view. We analyze 155 FM4SE and 997 SE4FM blog posts from leading technology companies, leveraging an FM-powered surveying approach to systematically label and summarize the discussed activities and tasks. We observed that while code generation is the most prominent FM4SE task, FMs are leveraged for many other SE activities such as code understanding, summarization, and API recommendation. The majority of blog posts on SE4FM are about model deployment operation, and system architecture orchestration. Although the emphasis is on cloud deployments, there is a growing interest in compressing FMs and deploying them on smaller devices such as edge or mobile devices. We outline eight future research directions inspired by our gained insights, aiming to bridge the gap between academic findings and real-world applications. Our study not only enriches the body of knowledge on practical applications of FM4SE and SE4FM but also demonstrates the utility of FMs as a powerful and efficient approach in conducting literature surveys within technical and grey literature domains. Our dataset, results, code and used prompts can be found in our online replication package at this https URL.

[AI-7] Hierarchical Universal Value Function Approximators

链接: https://arxiv.org/abs/2410.08997
作者: Rushiv Arora
关键词-EN: estimating long-term returns, building universal approximators, parameterized manner, key advancements, key elements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 12 pages, 10 figures, 3 appendices. Currently under review

点击查看摘要

Abstract:There have been key advancements to building universal approximators for multi-goal collections of reinforcement learning value functions – key elements in estimating long-term returns of states in a parameterized manner. We extend this to hierarchical reinforcement learning, using the options framework, by introducing hierarchical universal value function approximators (H-UVFAs). This allows us to leverage the added benefits of scaling, planning, and generalization expected in temporal abstraction settings. We develop supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: Q(s, g, o; \theta) and Q(s, g, o, a; \theta) . Finally we demonstrate generalization of the HUVFAs and show they outperform corresponding UVFAs.

[AI-8] SubZero: Random Subspace Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning

链接: https://arxiv.org/abs/2410.08989
作者: Ziming Yu,Pan Zhou,Sike Wang,Jia Li,Hua Huang
关键词-EN: Large Language Models, Fine-tuning Large Language, Fine-tuning Large, Large Language, proven effective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) has proven effective for a variety of downstream tasks. However, as LLMs grow in size, the memory demands for backpropagation become increasingly prohibitive. Zeroth-order (ZO) optimization methods offer a memory-efficient alternative by using forward passes to estimate gradients, but the variance of gradient estimates typically scales linearly with the model’s parameter dimension \unicodex2013 a significant issue for LLMs. In this paper, we propose the random Subspace Zeroth-order (SubZero) optimization to address the challenges posed by LLMs’ high dimensionality. We introduce a low-rank perturbation tailored for LLMs that significantly reduces memory consumption while improving training performance. Additionally, we prove that our gradient estimation closely approximates the backpropagation gradient, exhibits lower variance than traditional ZO methods, and ensures convergence when combined with SGD. Experimental results show that SubZero enhances fine-tuning performance and achieves faster convergence compared to standard ZO approaches like MeZO across various language modeling tasks.

[AI-9] owards Trustworthy Knowledge Graph Reasoning: An Uncertainty Aware Perspective

链接: https://arxiv.org/abs/2410.08985
作者: Bo Ni,Yu Wang,Lu Cheng,Erik Blasch,Tyler Derr
关键词-EN: Large Language Models, coupled with Large, KG-based retrieval-augmented frameworks, Large Language, language model components
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recently, Knowledge Graphs (KGs) have been successfully coupled with Large Language Models (LLMs) to mitigate their hallucinations and enhance their reasoning capability, such as in KG-based retrieval-augmented frameworks. However, current KG-LLM frameworks lack rigorous uncertainty estimation, limiting their reliable deployment in high-stakes applications. Directly incorporating uncertainty quantification into KG-LLM frameworks presents challenges due to their complex architectures and the intricate interactions between the knowledge graph and language model components. To address this gap, we propose a new trustworthy KG-LLM framework, Uncertainty Aware Knowledge-Graph Reasoning (UAG), which incorporates uncertainty quantification into the KG-LLM framework. We design an uncertainty-aware multi-step reasoning framework that leverages conformal prediction to provide a theoretical guarantee on the prediction set. To manage the error rate of the multi-step process, we additionally introduce an error rate control module to adjust the error rate within the individual components. Extensive experiments show that our proposed UAG can achieve any pre-defined coverage rate while reducing the prediction set/interval size by 40% on average over the baselines.

[AI-10] Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

链接: https://arxiv.org/abs/2410.08979
作者: Devdhar Patel,Hava Siegelmann
关键词-EN: surpassing human-level control, human-level control capabilities, rapidly reaching, reaching and surpassing, surpassing human-level
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. Such speeds are difficult to achieve in the real world and often requires specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a “temporal recall” mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Additionally, we compare SRL with model-based online planning, showing that SRL achieves superior FAS while leveraging the same model during training that online planners use for planning.

[AI-11] Learning Representations of Instruments for Partial Identification of Treatment Effects

链接: https://arxiv.org/abs/2410.08976
作者: Jonas Schweisthal,Dennis Frauen,Maresa Schröder,Konstantin Hess,Niki Kilbertus,Stefan Feuerriegel
关键词-EN: observational data, data is important, average treatment effect, bounds, CATE
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially high-dimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).

[AI-12] ALVIN: Active Learning Via INterpolation EMNLP2024

链接: https://arxiv.org/abs/2410.08972
作者: Michalis Korakakis,Andreas Vlachos,Adrian Weller
关键词-EN: active learning methods, Active Learning, Active Learning aims, typical active learning, minimize annotation effort
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024 (Main)

点击查看摘要

Abstract:Active Learning aims to minimize annotation effort by selecting the most useful instances from a pool of unlabeled data. However, typical active learning methods overlook the presence of distinct example groups within a class, whose prevalence may vary, e.g., in occupation classification datasets certain demographics are disproportionately represented in specific classes. This oversight causes models to rely on shortcuts for predictions, i.e., spurious correlations between input attributes and labels occurring in well-represented groups. To address this issue, we propose Active Learning Via INterpolation (ALVIN), which conducts intra-class interpolations between examples from under-represented and well-represented groups to create anchors, i.e., artificial points situated between the example groups in the representation space. By selecting instances close to the anchors for annotation, ALVIN identifies informative examples exposing the model to regions of the representation space that counteract the influence of shortcuts. Crucially, since the model considers these examples to be of high certainty, they are likely to be ignored by typical active learning methods. Experimental results on six datasets encompassing sentiment analysis, natural language inference, and paraphrase detection demonstrate that ALVIN outperforms state-of-the-art active learning methods in both in-distribution and out-of-distribution generalization.

[AI-13] NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models

链接: https://arxiv.org/abs/2410.08970
作者: Zheng Yi Ho,Siyuan Liang,Sen Zhang,Yibing Zhan,Dacheng Tao
关键词-EN: Large Language Models, Language Models, Large Language, Hallucinations in Large, remain a major
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hallucinations in Large Language Models (LLMs) remain a major obstacle, particularly in high-stakes applications where factual accuracy is critical. While representation editing and reading methods have made strides in reducing hallucinations, their heavy reliance on specialised tools and training on in-domain samples, makes them difficult to scale and prone to overfitting. This limits their accuracy gains and generalizability to diverse datasets. This paper presents a lightweight method, Norm Voting (NoVo), which harnesses the untapped potential of attention head norms to dramatically enhance factual accuracy in zero-shot multiple-choice questions (MCQs). NoVo begins by automatically selecting truth-correlated head norms with an efficient, inference-only algorithm using only 30 random samples, allowing NoVo to effortlessly scale to diverse datasets. Afterwards, selected head norms are employed in a simple voting algorithm, which yields significant gains in prediction accuracy. On TruthfulQA MC1, NoVo surpasses the current state-of-the-art and all previous methods by an astounding margin – at least 19 accuracy points. NoVo demonstrates exceptional generalization to 20 diverse datasets, with significant gains in over 90% of them, far exceeding all current representation editing and reading methods. NoVo also reveals promising gains to finetuning strategies and building textual adversarial defence. NoVo’s effectiveness with head norms opens new frontiers in LLM interpretability, robustness and reliability.

[AI-14] Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

链接: https://arxiv.org/abs/2410.08968
作者: Jingyu Zhang,Ahmed Elgohary,Ahmed Magooda,Daniel Khashabi,Benjamin Van Durme
关键词-EN: content deemed unsafe, safety, diverse safety, large language models, current paradigm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs – free-form natural language descriptions of the desired safety behaviors – that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at inference time. To enable that, we propose CoSAlign, a data-centric method for aligning LLMs to easily adapt to diverse safety configs. Furthermore, we devise a novel controllability evaluation protocol that considers both helpfulness and configured safety, summarizing them into CoSA-Score, and construct CoSApien, a human-authored benchmark that consists of real-world LLM use cases with diverse safety requirements and corresponding evaluation prompts. We show that CoSAlign leads to substantial gains of controllability over strong baselines including in-context alignment. Our framework encourages better representation and adaptation to pluralistic human values in LLMs, and thereby increasing their practicality. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.08968 [cs.CL] (or arXiv:2410.08968v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.08968 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Language Imbalance Driven Rewarding for Multilingual Self-improving

链接: https://arxiv.org/abs/2410.08964
作者: Wen Yang,Junhong Wu,Chen Wang,Chengqing Zong,Jiajun Zhang
关键词-EN: Large Language Models, Large Language, Language Models, English and Chinese, Imbalance Driven Rewarding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited “first-class” languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose \textitLanguage Imbalance Driven Rewarding , where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language’s capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs.

[AI-16] Evaluating Federated Kolmogorov-Arnold Networks on Non-IID Data

链接: https://arxiv.org/abs/2410.08961
作者: Arthur Mendonça Sasse,Claudio Miceli de Farias
关键词-EN: Federated Kolmogorov-Arnold Networks, Kolmogorov-Arnold Networks, Radial Basis Functions, initial stage, Layer Perceptrons
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures, for associated code see this https URL

点击查看摘要

Abstract:Federated Kolmogorov-Arnold Networks (F-KANs) have already been proposed, but their assessment is at an initial stage. We present a comparison between KANs (using B-splines and Radial Basis Functions as activation functions) and Multi- Layer Perceptrons (MLPs) with a similar number of parameters for 100 rounds of federated learning in the MNIST classification task using non-IID partitions with 100 clients. After 15 trials for each model, we show that the best accuracies achieved by MLPs can be achieved by Spline-KANs in half of the time (in rounds), with just a moderate increase in computing time.

[AI-17] On the Adversarial Transferability of Generalized “Skip Connections”

链接: https://arxiv.org/abs/2410.08950
作者: Yisen Wang,Yichuan Mo,Dongxian Wu,Mingjie Li,Xingjun Ma,Zhouchen Lin
关键词-EN: skip connections, modern deep models, Skip, Skip Gradient Method, essential ingredient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Skip connection is an essential ingredient for modern deep models to be deeper and more powerful. Despite their huge success in normal scenarios (state-of-the-art classification performance on natural examples), we investigate and identify an interesting property of skip connections under adversarial scenarios, namely, the use of skip connections allows easier generation of highly transferable adversarial examples. Specifically, in ResNet-like models (with skip connections), we find that using more gradients from the skip connections rather than the residual modules according to a decay factor during backpropagation allows one to craft adversarial examples with high transferability. The above method is termed as Skip Gradient Method (SGM). Although starting from ResNet-like models in vision domains, we further extend SGM to more advanced architectures, including Vision Transformers (ViTs) and models with length-varying paths and other domains, i.e. natural language processing. We conduct comprehensive transfer attacks against various models including ResNets, Transformers, Inceptions, Neural Architecture Search, and Large Language Models (LLMs). We show that employing SGM can greatly improve the transferability of crafted attacks in almost all cases. Furthermore, considering the big complexity for practical use, we further demonstrate that SGM can even improve the transferability on ensembles of models or targeted attacks and the stealthiness against current defenses. At last, we provide theoretical explanations and empirical insights on how SGM works. Our findings not only motivate new adversarial research into the architectural characteristics of models but also open up further challenges for secure model architecture design. Our code is available at this https URL.

[AI-18] ransferable Belief Model on Quantum Circuits

链接: https://arxiv.org/abs/2410.08949
作者: Qianli Zhou,Hao Luo,Lipeng Pan,Yong Deng,Eloi Bosse
关键词-EN: transferable belief model, Dempster-Shafer theory, enables agents, incomplete environments, belief
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The transferable belief model, as a semantic interpretation of Dempster-Shafer theory, enables agents to perform reasoning and decision making in imprecise and incomplete environments. The model offers distinct semantics for handling unreliable testimonies, allowing for a more reasonable and general process of belief transfer compared to the Bayesian approach. However, because both the belief masses and the structure of focal sets must be considered when updating belief functions-leading to extra computational complexity during reasoning-the transferable belief model has gradually lost favor among researchers in recent developments. In this paper, we implement the transferable belief model on quantum circuits and demonstrate that belief functions offer a more concise and effective alternative to Bayesian approaches within the quantum computing framework. Furthermore, leveraging the unique characteristics of quantum computing, we propose several novel belief transfer approaches. More broadly, this paper introduces a new perspective on basic information representation for quantum AI models, suggesting that belief functions are more suitable than Bayesian approach for handling uncertainty on quantum circuits.

[AI-19] he Dynamics of Social Conventions in LLM populations: Spontaneous Emergence Collective Biases and Tipping Points

链接: https://arxiv.org/abs/2410.08948
作者: Ariel Flint Ashery,Luca Maria Aiello,Andrea Baronchelli
关键词-EN: economic life, Large Language Model, Social, conventions, Large Language
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Social conventions are the foundation for social and economic life. As legions of AI agents increasingly interact with each other and with humans, their ability to form shared conventions will determine how effectively they will coordinate behaviors, integrate into society and influence it. Here, we investigate the dynamics of conventions within populations of Large Language Model (LLM) agents using simulated interactions. First, we show that globally accepted social conventions can spontaneously arise from local interactions between communicating LLMs. Second, we demonstrate how strong collective biases can emerge during this process, even when individual agents appear to be unbiased. Third, we examine how minority groups of committed LLMs can drive social change by establishing new social conventions. We show that once these minority groups reach a critical size, they can consistently overturn established behaviors. In all cases, contrasting the experimental results with predictions from a minimal multi-agent model allows us to isolate the specific role of LLM agents. Our results clarify how AI systems can autonomously develop norms without explicit programming and have implications for designing AI systems that align with human values and societal goals.

[AI-20] Meta-Transfer Learning Empowered Temporal Graph Networks for Cross-City Real Estate Appraisal

链接: https://arxiv.org/abs/2410.08947
作者: Weijia Zhang,Jindong Han,Hao Liu,Wei Fan,Hao Wang,Hui Xiong
关键词-EN: Real estate appraisal, real property taxation, Real estate, real estate deals, estate appraisal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Real estate appraisal is important for a variety of endeavors such as real estate deals, investment analysis, and real property taxation. Recently, deep learning has shown great promise for real estate appraisal by harnessing substantial online transaction data from web platforms. Nonetheless, deep learning is data-hungry, and thus it may not be trivially applicable to enormous small cities with limited data. To this end, we propose Meta-Transfer Learning Empowered Temporal Graph Networks (MetaTransfer) to transfer valuable knowledge from multiple data-rich metropolises to the data-scarce city to improve valuation performance. Specifically, by modeling the ever-growing real estate transactions with associated residential communities as a temporal event heterogeneous graph, we first design an Event-Triggered Temporal Graph Network to model the irregular spatiotemporal correlations between evolving real estate transactions. Besides, we formulate the city-wide real estate appraisal as a multi-task dynamic graph link label prediction problem, where the valuation of each community in a city is regarded as an individual task. A Hypernetwork-Based Multi-Task Learning module is proposed to simultaneously facilitate intra-city knowledge sharing between multiple communities and task-specific parameters generation to accommodate the community-wise real estate price distribution. Furthermore, we propose a Tri-Level Optimization Based Meta- Learning framework to adaptively re-weight training transaction instances from multiple source cities to mitigate negative transfer, and thus improve the cross-city knowledge transfer effectiveness. Finally, extensive experiments based on five real-world datasets demonstrate the significant superiority of MetaTransfer compared with eleven baseline algorithms.

[AI-21] Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory

链接: https://arxiv.org/abs/2410.08942
作者: Aymane El Firdoussi,Mohamed El Amine Seddik,Soufiane Hayou,Reda Alami,Ahmed Alzubaidi,Hakim Hacid
关键词-EN: gained attention, attention for training, Shumailov, Seddik, Synthetic data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Synthetic data has gained attention for training large language models, but poor-quality data can harm performance (see, e.g., Shumailov et al. (2023); Seddik et al. (2024)). A potential solution is data pruning, which retains only high-quality data based on a score function (human or machine feedback). Previous work Feng et al. (2024) analyzed models trained on synthetic data as sample size increases. We extend this by using random matrix theory to derive the performance of a binary classifier trained on a mix of real and pruned synthetic data in a high dimensional setting. Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy. We also show a smooth phase transition in synthetic label noise, contrasting with prior sharp behavior in infinite sample limits. Experiments with toy models and large language models validate our theoretical results.

[AI-22] owards Cross-Lingual LLM Evaluation for European Languages

链接: https://arxiv.org/abs/2410.08928
作者: Klaudia Thellmann,Bernhard Stadler,Michael Fromm,Jasper Schulze Buschhoff,Alex Jude,Fabio Barth,Johannes Leveling,Nicolas Flores-Herr,Joachim Köhler,René Jäkel,Mehdi Ali
关键词-EN: Large Language Models, rise of Large, revolutionized natural language, natural language processing, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains challenging, especially due to the scarcity of multilingual benchmarks. We introduce a cross-lingual evaluation approach tailored for European languages. We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages. Our contributions include examining the effectiveness of translated benchmarks, assessing the impact of different translation services, and offering a multilingual evaluation framework for LLMs that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC, EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly available to encourage further research in multilingual LLM evaluation.

[AI-23] Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images

链接: https://arxiv.org/abs/2410.08926
作者: Virmarie Maquiling,Sean Anthony Byrne,Diederick C. Niehorster,Marco Carminati,Enkelejda Kasneci
关键词-EN: advancing gaze estimation, eye tracking technologies, vision foundation model, tracking technologies, explore the transformative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Virmarie Maquiling and Sean Anthony Byrne contributed equally to this paper, 8 pages, 3 figures, CHI Case Study, pre-print

点击查看摘要

Abstract:We explore the transformative potential of SAM 2, a vision foundation model, in advancing gaze estimation and eye tracking technologies. By significantly reducing annotation time, lowering technical barriers through its ease of deployment, and enhancing segmentation accuracy, SAM 2 addresses critical challenges faced by researchers and practitioners. Utilizing its zero-shot segmentation capabilities with minimal user input-a single click per video-we tested SAM 2 on over 14 million eye images from diverse datasets, including virtual reality setups and the world’s largest unified dataset recorded using wearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches the performance of domain-specific models trained solely on eye images, achieving competitive mean Intersection over Union (mIoU) scores of up to 93% without fine-tuning. Additionally, we provide our code and segmentation masks for these widely used datasets to promote further research.

[AI-24] HyperPg – Prototypical Gaussians on the Hypersphere for Interpretable Deep Learning

链接: https://arxiv.org/abs/2410.08925
作者: Maximilian Xiling Li,Korbinian Franz Rudolf,Nils Blank,Rudolf Lioutikov
关键词-EN: interpretable alternative, black-box deep learning, Prototype Learning methods, Learning methods provide, deep learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prototype Learning methods provide an interpretable alternative to black-box deep learning models. Approaches such as ProtoPNet learn, which part of a test image “look like” known prototypical parts from training images, combining predictive power with the inherent interpretability of case-based reasoning. However, existing approaches have two main drawbacks: A) They rely solely on deterministic similarity scores without statistical confidence. B) The prototypes are learned in a black-box manner without human input. This work introduces HyperPg, a new prototype representation leveraging Gaussian distributions on a hypersphere in latent space, with learnable mean and variance. HyperPg prototypes adapt to the spread of clusters in the latent space and output likelihood scores. The new architecture, HyperPgNet, leverages HyperPg to learn prototypes aligned with human concepts from pixel-level annotations. Consequently, each prototype represents a specific concept such as color, image texture, or part of the image subject. A concept extraction pipeline built on foundation models provides pixel-level annotations, significantly reducing human labeling effort. Experiments on CUB-200-2011 and Stanford Cars datasets demonstrate that HyperPgNet outperforms other prototype learning architectures while using fewer parameters and training steps. Additionally, the concept-aligned HyperPg prototypes are learned transparently, enhancing model interpretability.

[AI-25] Exploring the Design Space of Cognitive Engagement Techniques with AI-Generated Code for Enhanced Learning

链接: https://arxiv.org/abs/2410.08922
作者: Majeed Kazemitabaar,Oliver Huang,Sangho Suh,Austin Z. Henley,Tovi Grossman
关键词-EN: Large Language Models, Language Models, Large Language, Novice programmers, relying on Large
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:Novice programmers are increasingly relying on Large Language Models (LLMs) to generate code for learning programming concepts. However, this interaction can lead to superficial engagement, giving learners an illusion of learning and hindering skill development. To address this issue, we conducted a systematic design exploration to develop seven cognitive engagement techniques aimed at promoting deeper engagement with AI-generated code. In this paper, we describe our design process, the initial seven techniques and results from a between-subjects study (N=82). We then iteratively refined the top techniques and further evaluated them through a within-subjects study (N=42). We evaluate the friction each technique introduces, their effectiveness in helping learners apply concepts to isomorphic tasks without AI assistance, and their success in aligning learners’ perceived and actual coding abilities. Ultimately, our results highlight the most effective technique: guiding learners through the step-by-step problem-solving process, where they engage in an interactive dialog with the AI, prompting what needs to be done at each stage before the corresponding code is revealed.

[AI-26] Efficient Hyperparameter Importance Assessment for CNNs

链接: https://arxiv.org/abs/2410.08920
作者: Ruinan Wang,Ian Nabney,Mohammad Golbabaee
关键词-EN: impacting models’ robustness, profoundly impacting models’, machine learning pipeline, Convolutional Neural Networks, profoundly impacting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages

点击查看摘要

Abstract:Hyperparameter selection is an essential aspect of the machine learning pipeline, profoundly impacting models’ robustness, stability, and generalization capabilities. Given the complex hyperparameter spaces associated with Neural Networks and the constraints of computational resources and time, optimizing all hyperparameters becomes impractical. In this context, leveraging hyperparameter importance assessment (HIA) can provide valuable guidance by narrowing down the search space. This enables machine learning practitioners to focus their optimization efforts on the hyperparameters with the most significant impact on model performance while conserving time and resources. This paper aims to quantify the importance weights of some hyperparameters in Convolutional Neural Networks (CNNs) with an algorithm called N-RReliefF, laying the groundwork for applying HIA methodologies in the Deep Learning field. We conduct an extensive study by training over ten thousand CNN models across ten popular image classification datasets, thereby acquiring a comprehensive dataset containing hyperparameter configuration instances and their corresponding performance metrics. It is demonstrated that among the investigated hyperparameters, the top five important hyperparameters of the CNN model are the number of convolutional layers, learning rate, dropout rate, optimizer and epoch.

[AI-27] st-driven Software Experimentation with LASSO: an LLM Benchmarking Example

链接: https://arxiv.org/abs/2410.08911
作者: Marcus Kessel
关键词-EN: Empirical software engineering, Test-Driven Software Experiments, software engineering faces, Empirical software, Software Experiments
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Empirical software engineering faces a critical gap: the lack of standardized tools for rapid development and execution of Test-Driven Software Experiments (TDSEs) - that is, experiments that involve the execution of software subjects and the observation and analysis of their “de facto” run-time behavior. In this paper we present a general-purpose analysis platform called LASSO that provides a minimal set of domain-specific languages and data structures to conduct TDSEs. By empowering users with an executable scripting language to design and execute TDSEs, LASSO enables efficient evaluation of run-time semantics and execution characteristics in addition to statically determined properties. We present an example TDSE that demonstrates the practical benefits of LASSO’s scripting capabilities for assessing the reliability of LLMs for code generation by means of a self-contained, reusable and extensible study script. The LASSO platform is freely available at: this https URL, and a demo video is available on YouTube: this https URL

[AI-28] A Benchmark for Cross-Domain Argumentative Stance Classification on Social Media

链接: https://arxiv.org/abs/2410.08900
作者: Jiaqing Yuan,Ruijie Xi,Munindar P. Singh
关键词-EN: stance classification plays, identifying authors’ viewpoints, Argumentative stance classification, stance classification, classification plays
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Argumentative stance classification plays a key role in identifying authors’ viewpoints on specific topics. However, generating diverse pairs of argumentative sentences across various domains is challenging. Existing benchmarks often come from a single domain or focus on a limited set of topics. Additionally, manual annotation for accurate labeling is time-consuming and labor-intensive. To address these challenges, we propose leveraging platform rules, readily available expert-curated content, and large language models to bypass the need for human annotation. Our approach produces a multidomain benchmark comprising 4,498 topical claims and 30,961 arguments from three sources, spanning 21 domains. We benchmark the dataset in fully supervised, zero-shot, and few-shot settings, shedding light on the strengths and limitations of different methodologies. We release the dataset and code in this study at hidden for anonymity.

[AI-29] Utilizing ChatGPT in a Data Structures and Algorithms Course: A Teaching Assistants Perspective

链接: https://arxiv.org/abs/2410.08899
作者: Pooriya Jamie,Reyhaneh Hajihashemi,Sharareh Alipour
关键词-EN: Integrating large language, large language models, Integrating large, computer science education, large language
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Integrating large language models (LLMs) like ChatGPT is revolutionizing the field of computer science education. These models offer new possibilities for enriching student learning and supporting teaching assistants (TAs) in providing prompt feedback and supplementary learning resources. This research delves into the use of ChatGPT in a data structures and algorithms (DSA) course, particularly when combined with TA supervision. The findings demonstrate that incorporating ChatGPT with structured prompts and active TA guidance enhances students’ understanding of intricate algorithmic concepts, boosts engagement, and elevates academic performance. However, challenges exist in addressing academic integrity and the limitations of LLMs in tackling complex problems. The study underscores the importance of active TA involvement in reducing students’ reliance on AI-generated content and amplifying the overall educational impact. The results suggest that while LLMs can be advantageous for education, their successful integration demands continuous oversight and a thoughtful balance between AI and human guidance.

[AI-30] Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient

链接: https://arxiv.org/abs/2410.08893
作者: Wenlong Wang,Ivana Dusparic,Yucheng Shi,Ke Zhang,Vinny Cahill
关键词-EN: Model-based reinforcement learning, world model, offers a solution, data inefficiency, inefficiency that plagues
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning (RL) offers a solution to the data inefficiency that plagues most model-free RL algorithms. However, learning a robust world model often demands complex and deep architectures, which are expensive to compute and train. Within the world model, dynamics models are particularly crucial for accurate predictions, and various dynamics-model architectures have been explored, each with its own set of challenges. Currently, recurrent neural network (RNN) based world models face issues such as vanishing gradients and difficulty in capturing long-term dependencies effectively. In contrast, use of transformers suffers from the well-known issues of self-attention mechanisms, where both memory and computational complexity scale as O(n^2) , with n representing the sequence length. To address these challenges we propose a state space model (SSM) based world model, specifically based on Mamba, that achieves O(n) memory and computational complexity while effectively capturing long-term dependencies and facilitating the use of longer training sequences efficiently. We also introduce a novel sampling method to mitigate the suboptimality caused by an incorrect world model in the early stages of training, combining it with the aforementioned technique to achieve a normalised score comparable to other state-of-the-art model-based RL algorithms using only a 7 million trainable parameter world model. This model is accessible and can be trained on an off-the-shelf laptop. Our code is available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2410.08893 [cs.LG] (or arXiv:2410.08893v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.08893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-31] Federated Learning in Practice: Reflections and Projections

链接: https://arxiv.org/abs/2410.08892
作者: Katharine Daly,Hubert Eichner,Peter Kairouz,H. Brendan McMahan,Daniel Ramage,Zheng Xu
关键词-EN: enables multiple entities, machine learning technique, Federated Learning, local data, technique that enables
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a machine learning technique that enables multiple entities to collaboratively learn a shared model without exchanging their local data. Over the past decade, FL systems have achieved substantial progress, scaling to millions of devices across various learning domains while offering meaningful differential privacy (DP) guarantees. Production systems from organizations like Google, Apple, and Meta demonstrate the real-world applicability of FL. However, key challenges remain, including verifying server-side DP guarantees and coordinating training across heterogeneous devices, limiting broader adoption. Additionally, emerging trends such as large (multi-modal) models and blurred lines between training, inference, and personalization challenge traditional FL frameworks. In response, we propose a redefined FL framework that prioritizes privacy principles rather than rigid definitions. We also chart a path forward by leveraging trusted execution environments and open-source ecosystems to address these challenges and facilitate future advancements in FL.

[AI-32] Bank Loan Prediction Using Machine Learning Techniques

链接: https://arxiv.org/abs/2410.08886
作者: F M Ahosanul Haque,Md. Mahedi Hassan
关键词-EN: machine learning, development of economies, ecosystem through consumer, consumer and business, bank loan approval
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 18 figures, 6 tables

点击查看摘要

Abstract:Banks are important for the development of economies in any financial ecosystem through consumer and business loans. Lending, however, presents risks; thus, banks have to determine the applicant’s financial position to reduce the probabilities of default. A number of banks have currently, therefore, adopted data analytics and state-of-the-art technology to arrive at better decisions in the process. The probability of payback is prescribed by a predictive modeling technique in which machine learning algorithms are applied. In this research project, we will apply several machine learning methods to further improve the accuracy and efficiency of loan approval processes. Our work focuses on the prediction of bank loan approval; we have worked on a dataset of 148,670 instances and 37 attributes using machine learning methods. The target property segregates the loan applications into “Approved” and “Denied” groups. various machine learning techniques have been used, namely, Decision Tree Categorization, AdaBoosting, Random Forest Classifier, SVM, and GaussianNB. Following that, the models were trained and evaluated. Among these, the best-performing algorithm was AdaBoosting, which achieved an incredible accuracy of 99.99%. The results therefore show how ensemble learning works effectively to improve the prediction skills of loan approval decisions. The presented work points to the possibility of achieving extremely accurate and efficient loan prediction models that provide useful insights for applying machine learning to financial domains.

[AI-33] Online design of dynamic networks

链接: https://arxiv.org/abs/2410.08875
作者: Duo Wang,Andrea Araldo,Mounim El Yacoubi
关键词-EN: planning phase, network, Designing, Carlo Tree Search, Designing a network
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
*备注: 14 pages

点击查看摘要

Abstract:Designing a network (e.g., a telecommunication or transport network) is mainly done offline, in a planning phase, prior to the operation of the network. On the other hand, a massive effort has been devoted to characterizing dynamic networks, i.e., those that evolve over time. The novelty of this paper is that we introduce a method for the online design of dynamic networks. The need to do so emerges when a network needs to operate in a dynamic and stochastic environment. In this case, one may wish to build a network over time, on the fly, in order to react to the changes of the environment and to keep certain performance targets. We tackle this online design problem with a rolling horizon optimization based on Monte Carlo Tree Search. The potential of online network design is showcased for the design of a futuristic dynamic public transport network, where bus lines are constructed on the fly to better adapt to a stochastic user demand. In such a scenario, we compare our results with state-of-the-art dynamic vehicle routing problem (VRP) resolution methods, simulating requests from a New York City taxi dataset. Differently from classic VRP methods, that extend vehicle trajectories in isolation, our method enables us to build a structured network of line buses, where complex user journeys are possible, thus increasing system performance.

[AI-34] Experiments with Choice in Dependently-Typed Higher-Order Logic

链接: https://arxiv.org/abs/2410.08874
作者: Daniel Ranalter,Chad E. Brown,Cezary Kaliszyk
关键词-EN: extensional type theory, powerful extensional type, type theory, higher-order logic, enriching the language
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: 10 pages incl. references; published in the proceedings of LPAR25

点击查看摘要

Abstract:Recently an extension to higher-order logic – called DHOL – was introduced, enriching the language with dependent types, and creating a powerful extensional type theory. In this paper we propose two ways how choice can be added to DHOL. We extend the DHOL term structure by Hilbert’s indefinite choice operator \epsilon , define a translation of the choice terms to HOL choice that extends the existing translation from DHOL to HOL and show that the extension of the translation is complete and give an argument for soundness. We finally evaluate the extended translation on a set of dependent HOL problems that require choice.

[AI-35] he Good the Bad and the Ugly: Watermarks Transferable Attacks and Adversarial Defenses ICML2024

链接: https://arxiv.org/abs/2410.08864
作者: Grzegorz Głuch,Berkant Turan,Sai Ganesh Nagarajan,Sebastian Pokutta
关键词-EN: extend existing definitions, transferable attack, formalize and extend, extend existing, existing definitions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 42 pages, 6 figures, preliminary version published in ICML 2024 (Workshop on Theoretical Foundations of Foundation Models), see this https URL

点击查看摘要

Abstract:We formalize and extend existing definitions of backdoor-based watermarks and adversarial defenses as interactive protocols between two players. The existence of these schemes is inherently tied to the learning tasks for which they are designed. Our main result shows that for almost every discriminative learning task, at least one of the two – a watermark or an adversarial defense – exists. The term “almost every” indicates that we also identify a third, counterintuitive but necessary option, i.e., a scheme we call a transferable attack. By transferable attack, we refer to an efficient algorithm computing queries that look indistinguishable from the data distribution and fool all efficient defenders. To this end, we prove the necessity of a transferable attack via a construction that uses a cryptographic tool called homomorphic encryption. Furthermore, we show that any task that satisfies our notion of a transferable attack implies a cryptographic primitive, thus requiring the underlying task to be computationally complex. These two facts imply an “equivalence” between the existence of transferable attacks and cryptography. Finally, we show that the class of tasks of bounded VC-dimension has an adversarial defense, and a subclass of them has a watermark.

[AI-36] MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices

链接: https://arxiv.org/abs/2410.08855
作者: Mohamed Amine Hamdi,Francesco Daghero,Giuseppe Maria Sarda,Josse Van Delm,Arne Symons,Luca Benini,Marian Verhelst,Daniele Jahier Pagliari,Alessio Burrello
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, Toggle, heterogeneous MCU family
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 13 pages, 11 figures, 4 tables

点击查看摘要

Abstract:Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to a different heterogeneous MCU family implies labor-intensive re-development of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, resulting in the generation of general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework enhanced with hardware cost models can compete with and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency by up to 60.88 times on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by 2.15 times compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board. Comments: 13 pages, 11 figures, 4 tables Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) ACMclasses: I.2.2; D.1.3 Cite as: arXiv:2410.08855 [cs.DC] (or arXiv:2410.08855v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2410.08855 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohamed Amine Hamdi [view email] [v1] Fri, 11 Oct 2024 14:32:06 UTC (1,052 KB) Full-text links: Access Paper: View a PDF of the paper titled MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices, by Mohamed Amine Hamdi and 8 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.DC prev | next new | recent | 2024-10 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-37] Hybrid LLM-DDQN based Joint Optimization of V2I Communication and Autonomous Driving

链接: https://arxiv.org/abs/2410.08854
作者: Zijiang Yan,Hao Zhou,Hina Tabassum,Xue Liu
关键词-EN: Large language models, considerable interest recently, interest recently due, Large language, received considerable interest
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注: Submission for possible publication

点击查看摘要

Abstract:Large language models (LLMs) have received considerable interest recently due to their outstanding reasoning and comprehension capabilities. This work explores applying LLMs to vehicular networks, aiming to jointly optimize vehicle-to-infrastructure (V2I) communications and autonomous driving (AD) policies. We deploy LLMs for AD decision-making to maximize traffic flow and avoid collisions for road safety, and a double deep Q-learning algorithm (DDQN) is used for V2I optimization to maximize the received data rate and reduce frequent handovers. In particular, for LLM-enabled AD, we employ the Euclidean distance to identify previously explored AD experiences, and then LLMs can learn from past good and bad decisions for further improvement. Then, LLM-based AD decisions will become part of states in V2I problems, and DDQN will optimize the V2I decisions accordingly. After that, the AD and V2I decisions are iteratively optimized until convergence. Such an iterative optimization approach can better explore the interactions between LLMs and conventional reinforcement learning techniques, revealing the potential of using LLMs for network optimization and management. Finally, the simulations demonstrate that our proposed hybrid LLM-DDQN approach outperforms the conventional DDQN algorithm, showing faster convergence and higher average rewards.

[AI-38] Conformalized Interactive Imitation Learning: Handling Expert Shift and Intermittent Feedback

链接: https://arxiv.org/abs/2410.08852
作者: Michelle Zhao,Reid Simmons,Henny Admoni,Aaditya Ramdas,Andrea Bajcsy
关键词-EN: interactive imitation learning, seeking additional feedback, actively seeking additional, Monte Carlo dropout, distribution shifts encountered
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In interactive imitation learning (IL), uncertainty quantification offers a way for the learner (i.e. robot) to contend with distribution shifts encountered during deployment by actively seeking additional feedback from an expert (i.e. human) online. Prior works use mechanisms like ensemble disagreement or Monte Carlo dropout to quantify when black-box IL policies are uncertain; however, these approaches can lead to overconfident estimates when faced with deployment-time distribution shifts. Instead, we contend that we need uncertainty quantification algorithms that can leverage the expert human feedback received during deployment time to adapt the robot’s uncertainty online. To tackle this, we draw upon online conformal prediction, a distribution-free method for constructing prediction intervals online given a stream of ground-truth labels. Human labels, however, are intermittent in the interactive IL setting. Thus, from the conformal prediction side, we introduce a novel uncertainty quantification algorithm called intermittent quantile tracking (IQT) that leverages a probabilistic model of intermittent labels, maintains asymptotic coverage guarantees, and empirically achieves desired coverage levels. From the interactive IL side, we develop ConformalDAgger, a new approach wherein the robot uses prediction intervals calibrated by IQT as a reliable measure of deployment-time uncertainty to actively query for more expert feedback. We compare ConformalDAgger to prior uncertainty-aware DAgger methods in scenarios where the distribution shift is (and isn’t) present because of changes in the expert’s policy. We find that in simulated and hardware deployments on a 7DOF robotic manipulator, ConformalDAgger detects high uncertainty when the expert shifts and increases the number of interventions compared to baselines, allowing the robot to more quickly learn the new behavior.

[AI-39] Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

链接: https://arxiv.org/abs/2410.08847
作者: Noam Razin,Sadhika Malladi,Adithya Bhaskar,Danqi Chen,Sanjeev Arora,Boris Hanin
关键词-EN: Direct Preference Optimization, Direct Preference, Preference Optimization, likelihood displacement, variants are increasingly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: Code available at this https URL

点击查看摘要

Abstract:Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer \textttNo over \textttNever can sharply increase the probability of \textttYes . Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

[AI-40] Public Transport Network Design for Equality of Accessibility via Message Passing Neural Networks and Reinforcement Learning

链接: https://arxiv.org/abs/2410.08841
作者: Duo Wang,Maximilien Chau,Andrea Araldo
关键词-EN: Designing Public Transport, Public Transport, Transport Network Design, Designing Public, pollution and congestion
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Designing Public Transport (PT) networks able to satisfy mobility needs of people is essential to reduce the number of individual vehicles on the road, and thus pollution and congestion. Urban sustainability is thus tightly coupled to an efficient PT. Current approaches on Transport Network Design (TND) generally aim to optimize generalized cost, i.e., a unique number including operator and users’ costs. Since we intend quality of PT as the capability of satisfying mobility needs, we focus instead on PT accessibility, i.e., the ease of reaching surrounding points of interest via PT. PT accessibility is generally unequally distributed in urban regions: suburbs generally suffer from poor PT accessibility, which condemns residents therein to be dependent on their private cars. We thus tackle the problem of designing bus lines so as to minimize the inequality in the geographical distribution of accessibility. We combine state-of-the-art Message Passing Neural Networks (MPNN) and Reinforcement Learning. We show the efficacy of our method against metaheuristics (classically used in TND) in a use case representing in simplified terms the city of Montreal.

[AI-41] Unveiling Molecular Secrets: An LLM-Augmented Linear Model for Explainable and Calibratable Molecular Property Prediction

链接: https://arxiv.org/abs/2410.08829
作者: Zhuoran Li,Xu Sun,Wanyu Lin,Jiannong Cao
关键词-EN: scientific fields, material science, drug discovery, discovery and material, molecular property prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explainable molecular property prediction is essential for various scientific fields, such as drug discovery and material science. Despite delivering intrinsic explainability, linear models struggle with capturing complex, non-linear patterns. Large language models (LLMs), on the other hand, yield accurate predictions through powerful inference capabilities yet fail to provide chemically meaningful explanations for their predictions. This work proposes a novel framework, called MoleX, which leverages LLM knowledge to build a simple yet powerful linear model for accurate molecular property prediction with faithful explanations. The core of MoleX is to model complicated molecular structure-property relationships using a simple linear model, augmented by LLM knowledge and a crafted calibration strategy. Specifically, to extract the maximum amount of task-relevant knowledge from LLM embeddings, we employ information bottleneck-inspired fine-tuning and sparsity-inducing dimensionality reduction. These informative embeddings are then used to fit a linear model for explainable inference. Moreover, we introduce residual calibration to address prediction errors stemming from linear models’ insufficient expressiveness of complex LLM embeddings, thus recovering the LLM’s predictive power and boosting overall accuracy. Theoretically, we provide a mathematical foundation to justify MoleX’s explainability. Extensive experiments demonstrate that MoleX outperforms existing methods in molecular property prediction, establishing a new milestone in predictive performance, explainability, and efficiency. In particular, MoleX enables CPU inference and accelerates large-scale dataset processing, achieving comparable performance 300x faster with 100,000 fewer parameters than LLMs. Additionally, the calibration improves model performance by up to 12.7% without compromising explainability.

[AI-42] One-shot Generative Domain Adaptation in 3D GANs

链接: https://arxiv.org/abs/2410.08824
作者: Ziqiang Li,Yi Wu,Chaoyue Wang,Xue Rui,Bin Li
关键词-EN: necessitates extensive training, ensure stable training, extensive training data, generation necessitates extensive, image generation necessitates
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IJCV

点击查看摘要

Abstract:3D-aware image generation necessitates extensive training data to ensure stable training and mitigate the risk of overfitting. This paper first considers a novel task known as One-shot 3D Generative Domain Adaptation (GDA), aimed at transferring a pre-trained 3D generator from one domain to a new one, relying solely on a single reference image. One-shot 3D GDA is characterized by the pursuit of specific attributes, namely, high fidelity, large diversity, cross-domain consistency, and multi-view consistency. Within this paper, we introduce 3D-Adapter, the first one-shot 3D GDA method, for diverse and faithful generation. Our approach begins by judiciously selecting a restricted weight set for fine-tuning, and subsequently leverages four advanced loss functions to facilitate adaptation. An efficient progressive fine-tuning strategy is also implemented to enhance the adaptation process. The synergy of these three technological components empowers 3D-Adapter to achieve remarkable performance, substantiated both quantitatively and qualitatively, across all desired properties of 3D GDA. Furthermore, 3D-Adapter seamlessly extends its capabilities to zero-shot scenarios, and preserves the potential for crucial tasks such as interpolation, reconstruction, and editing within the latent space of the pre-trained generator. Code will be available at this https URL.

[AI-43] SOLD: Reinforcement Learning with Slot Object-Centric Latent Dynamics

链接: https://arxiv.org/abs/2410.08822
作者: Malte Mosbach,Jan Niklas Ewertz,Angel Villar-Corrales,Sven Behnke
关键词-EN: agent understanding, Learning, latent dynamics, latent, dynamics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning a latent dynamics model provides a task-agnostic representation of an agent’s understanding of its environment. Leveraging this knowledge for model-based reinforcement learning holds the potential to improve sample efficiency over model-free methods by learning inside imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, forecasting how actions will affect specific parts of their surroundings. Inspired by this, we propose Slot-Attention for Object-centric Latent Dynamics (SOLD), a novel algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3, a state-of-the-art model-based RL algorithm, across a range of benchmark robotic environments that evaluate for both relational reasoning and low-level manipulation capabilities. Videos are available at this https URL.

[AI-44] StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

链接: https://arxiv.org/abs/2410.08815
作者: Zhuoqun Li,Xuanang Chen,Haiyang Yu,Hongyu Lin,Yaojie Lu,Qiaoyu Tang,Fei Huang,Xianpei Han,Le Sun,Yongbin Li
关键词-EN: large language models, effectively enhance large, enhance large language, Retrieval-augmented generation, existing RAG methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a key means to effectively enhance large language models (LLMs) in many knowledge-based tasks. However, existing RAG methods struggle with knowledge-intensive reasoning tasks, because useful information required to these tasks are badly scattered. This characteristic makes it difficult for existing RAG methods to accurately identify key information and perform global reasoning with such noisy augmentation. In this paper, motivated by the cognitive theories that humans convert raw information into various structured knowledge when tackling knowledge-intensive reasoning, we proposes a new framework, StructRAG, which can identify the optimal structure type for the task at hand, reconstruct original documents into this structured format, and infer answers based on the resulting structure. Extensive experiments across various knowledge-intensive tasks show that StructRAG achieves state-of-the-art performance, particularly excelling in challenging scenarios, demonstrating its potential as an effective solution for enhancing LLMs in complex real-world applications.

[AI-45] PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

链接: https://arxiv.org/abs/2410.08811
作者: Tingchen Fu,Mrinank Sharma,Philip Torr,Shay B. Cohen,David Krueger,Fazl Barez
关键词-EN: aligning current LLMs, Preference learning, data poisoning, data poisoning attacks, central component
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Tingchen Fu and Fazl Barez are core research contributors

点击查看摘要

Abstract:Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models’ susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.

[AI-46] DCNet: A Data-Driven Framework for DVL

链接: https://arxiv.org/abs/2410.08809
作者: Zeev Yampolsky,Itzik Klein
关键词-EN: Autonomous underwater vehicles, Autonomous underwater, underwater robotic platforms, underwater vehicles, DVL
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 Pages, 9 Figures, 5 Tables

点击查看摘要

Abstract:Autonomous underwater vehicles (AUVs) are underwater robotic platforms used in a variety of applications. An AUV’s navigation solution relies heavily on the fusion of inertial sensors and Doppler velocity logs (DVL), where the latter delivers accurate velocity updates. To ensure accurate navigation, a DVL calibration is undertaken before the mission begins to estimate its error terms. During calibration, the AUV follows a complex trajectory and employs nonlinear estimation filters to estimate error terms. In this paper, we introduce DCNet, a data-driven framework that utilizes a two-dimensional convolution kernel in an innovative way. Using DCNet and our proposed DVL error model, we offer a rapid calibration procedure. This can be applied to a trajectory with a nearly constant velocity. To train and test our proposed approach a dataset of 276 minutes long with real DVL recorded measurements was used. We demonstrated an average improvement of 70% in accuracy and 80% improvement in calibration time, compared to the baseline approach, with a low-performance DVL. As a result of those improvements, an AUV employing a low-cost DVL, can achieve higher accuracy, shorter calibration time, and apply a simple nearly constant velocity calibration trajectory. Our results also open up new applications for marine robotics utilizing low-cost, high-accurate DVLs.

[AI-47] M3-Impute: Mask-guided Representation Learning for Missing Value Imputation

链接: https://arxiv.org/abs/2410.08794
作者: Zhongyi Yu,Zhenghao Wu,Shuhan Zhong,Weifeng Su,S.-H. Gary Chan,Chul-Ho Lee,Weipeng Zhuo
关键词-EN: poses significant challenges, poses significant, significant challenges, analysis and machine, common problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Missing values are a common problem that poses significant challenges to data analysis and machine learning. This problem necessitates the development of an effective imputation method to fill in the missing values accurately, thereby enhancing the overall quality and utility of the datasets. Existing imputation methods, however, fall short of explicitly considering the `missingness’ information in the data during the embedding initialization stage and modeling the entangled feature and sample correlations during the learning process, thus leading to inferior performance. We propose M ^3 -Impute, which aims to explicitly leverage the missingness information and such correlations with novel masking schemes. M ^3 -Impute first models the data as a bipartite graph and uses a graph neural network to learn node embeddings, where the refined embedding initialization process directly incorporates the missingness information. They are then optimized through M ^3 -Impute’s novel feature correlation unit (FRU) and sample correlation unit (SRU) that effectively captures feature and sample correlations for imputation. Experiment results on 25 benchmark datasets under three different missingness settings show the effectiveness of M ^3 -Impute by achieving 20 best and 4 second-best MAE scores on average.

[AI-48] VLM See Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

链接: https://arxiv.org/abs/2410.08792
作者: Beichen Wang,Juexiao Zhang,Shuwen Dong,Irving Fang,Chen Feng
关键词-EN: Vision Language Models, Vision Language, Language Models, common sense reasoning, recently been adopted
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have recently been adopted in robotics for their capability in common sense reasoning and generalizability. Existing work has applied VLMs to generate task and motion planning from natural language instructions and simulate training data for robot learning. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. Our method integrates keyframe selection, visual perception, and VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to ‘‘see’’ human demonstrations and explain the corresponding plans to the robot for it to ‘‘do’’. To validate our approach, we collected a set of long-horizon human videos demonstrating pick-and-place tasks in three diverse categories and designed a set of metrics to comprehensively benchmark SeeDo against several baselines, including state-of-the-art video-input VLMs. The experiments demonstrate SeeDo’s superior performance. We further deployed the generated task plans in both a simulation environment and on a real robot arm.

[AI-49] F2A: An Innovative Approach for Prompt Injection by Utilizing Feign Security Detection Agents

链接: https://arxiv.org/abs/2410.08776
作者: Yupeng Ren
关键词-EN: Large Language Models, Language Models, Large Language, numerous mature applications, safety detection results
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of Large Language Models (LLMs), numerous mature applications of LLMs have emerged in the field of content safety detection. However, we have found that LLMs exhibit blind trust in safety detection agents. The general LLMs can be compromised by hackers with this vulnerability. Hence, this paper proposed an attack named Feign Agent Attack (F2A).Through such malicious forgery methods, adding fake safety detection results into the prompt, the defense mechanism of LLMs can be bypassed, thereby obtaining harmful content and hijacking the normal this http URL, a series of experiments were conducted. In these experiments, the hijacking capability of F2A on LLMs was analyzed and demonstrated, exploring the fundamental reasons why LLMs blindly trust safety detection results. The experiments involved various scenarios where fake safety detection results were injected into prompts, and the responses were closely monitored to understand the extent of the vulnerability. Also, this paper provided a reasonable solution to this attack, emphasizing that it is important for LLMs to critically evaluate the results of augmented agents to prevent the generating harmful content. By doing so, the reliability and security can be significantly improved, protecting the LLMs from F2A.

[AI-50] Efficient Multi-Object Tracking on Edge Devices via Reconstruction-Based Channel Pruning

链接: https://arxiv.org/abs/2410.08769
作者: Jan Müller,Adrian Pigors
关键词-EN: addressing critical security, Jetson Orin Nano, technologies presents, advancement of multi-object, presents the dual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of multi-object tracking (MOT) technologies presents the dual challenge of maintaining high performance while addressing critical security and privacy concerns. In applications such as pedestrian tracking, where sensitive personal data is involved, the potential for privacy violations and data misuse becomes a significant issue if data is transmitted to external servers. To mitigate these risks, processing data directly on an edge device, such as a smart camera, has emerged as a viable solution. Edge computing ensures that sensitive information remains local, thereby aligning with stringent privacy principles and significantly reducing network latency. However, the implementation of MOT on edge devices is not without its challenges. Edge devices typically possess limited computational resources, necessitating the development of highly optimized algorithms capable of delivering real-time performance under these constraints. The disparity between the computational requirements of state-of-the-art MOT algorithms and the capabilities of edge devices emphasizes a significant obstacle. To address these challenges, we propose a neural network pruning method specifically tailored to compress complex networks, such as those used in modern MOT systems. This approach optimizes MOT performance by ensuring high accuracy and efficiency within the constraints of limited edge devices, such as NVIDIA’s Jetson Orin Nano. By applying our pruning method, we achieve model size reductions of up to 70% while maintaining a high level of accuracy and further improving performance on the Jetson Orin Nano, demonstrating the effectiveness of our approach for edge computing applications.

[AI-51] Integrating Supertag Features into Neural Discontinuous Constituent Parsing

链接: https://arxiv.org/abs/2410.08766
作者: Lukas Mielczarek
关键词-EN: natural-language processing, essential in natural-language, widely used description, parsing, DPTB for English
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注: Bachelor’s Thesis. Supervised by Dr. Kilian Evang and Univ.-Prof. Dr. Laura Kallmeyer

点击查看摘要

Abstract:Syntactic parsing is essential in natural-language processing, with constituent structure being one widely used description of syntax. Traditional views of constituency demand that constituents consist of adjacent words, but this poses challenges in analysing syntax with non-local dependencies, common in languages like German. Therefore, in a number of treebanks like NeGra and TIGER for German and DPTB for English, long-range dependencies are represented by crossing edges. Various grammar formalisms have been used to describe discontinuous trees - often with high time complexities for parsing. Transition-based parsing aims at reducing this factor by eliminating the need for an explicit grammar. Instead, neural networks are trained to produce trees given raw text input using supervised learning on large annotated corpora. An elegant proposal for a stack-free transition-based parser developed by Coavoux and Cohen (2019) successfully allows for the derivation of any discontinuous constituent tree over a sentence in worst-case quadratic time. The purpose of this work is to explore the introduction of supertag information into transition-based discontinuous constituent parsing. In lexicalised grammar formalisms like CCG (Steedman, 1989) informative categories are assigned to the words in a sentence and act as the building blocks for composing the sentence’s syntax. These supertags indicate a word’s structural role and syntactic relationship with surrounding items. The study examines incorporating supertag information by using a dedicated supertagger as additional input for a neural parser (pipeline) and by jointly training a neural model for both parsing and supertagging (multi-task). In addition to CCG, several other frameworks (LTAG-spinal, LCFRS) and sequence labelling tasks (chunking, dependency parsing) will be compared in terms of their suitability as auxiliary tasks for parsing. Comments: Bachelor’s Thesis. Supervised by Dr. Kilian Evang and Univ.-Prof. Dr. Laura Kallmeyer Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL) Cite as: arXiv:2410.08766 [cs.CL] (or arXiv:2410.08766v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.08766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-52] Unlocking FedNL: Self-Contained Compute-Optimized Implementation

链接: https://arxiv.org/abs/2410.08760
作者: Konstantin Burlachenko,Peter Richtárik
关键词-EN: train Machine Learning, collaboratively train Machine, Machine Learning, Federated Newton Learn, enables intelligent agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Performance (cs.PF); Optimization and Control (math.OC)
*备注: 55 pages, 12 figures, 12 tables

点击查看摘要

Abstract:Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner, eliminating the need for sharing their local data. The recent work (arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization. However, the reference FedNL prototype exhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch a single experiment in a sever-grade workstation; (ii) The prototype only simulates multi-node setting; (iii) Prototype integration into resource-constrained applications is challenging. To bridge the gap between theory and practice, we present a self-contained implementation of FedNL, FedNL-LS, FedNL-PP for single-node and multi-node settings. Our work resolves the aforementioned issues and reduces the wall clock time by x1000. With this FedNL outperforms alternatives for training logistic regression in a single-node – CVXPY (arXiv:1603.00943), and in a multi-node – Apache Spark (arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose two practical-orientated compressors for FedNL - adaptive TopLEK and cache-aware RandSeqK, which fulfill the theory of FedNL.

[AI-53] Enhancing GNNs with Architecture-Agnostic Graph Transformations: A Systematic Analysis

链接: https://arxiv.org/abs/2410.08759
作者: Zhifei Li,Gerrit Großmann,Verena Wolf
关键词-EN: graph neural network, recent years, neural network, wide variety, GNN
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, a wide variety of graph neural network (GNN) architectures have emerged, each with its own strengths, weaknesses, and complexities. Various techniques, including rewiring, lifting, and node annotation with centrality values, have been employed as pre-processing steps to enhance GNN performance. However, there are no universally accepted best practices, and the impact of architecture and pre-processing on performance often remains opaque. This study systematically explores the impact of various graph transformations as pre-processing steps on the performance of common GNN architectures across standard datasets. The models are evaluated based on their ability to distinguish non-isomorphic graphs, referred to as expressivity. Our findings reveal that certain transformations, particularly those augmenting node features with centrality measures, consistently improve expressivity. However, these gains come with trade-offs, as methods like graph encoding, while enhancing expressivity, introduce numerical inaccuracies widely-used python packages. Additionally, we observe that these pre-processing techniques are limited when addressing complex tasks involving 3-WL and 4-WL indistinguishable graphs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.08759 [cs.LG] (or arXiv:2410.08759v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.08759 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-54] Hespi: A pipeline for automatically detecting information from hebarium specimen sheets

链接: https://arxiv.org/abs/2410.08740
作者: Robert Turnbull,Emily Fitzgerald,Karen Thompson,Joanne L. Birch
关键词-EN: conservation sciences, Optical Character Recognition, Specimen, data, Specimen sheet PIpeline
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Specimen associated biodiversity data are sought after for biological, environmental, climate, and conservation sciences. A rate shift is required for the extraction of data from specimen images to eliminate the bottleneck that the reliance on human-mediated transcription of these data represents. We applied advanced computer vision techniques to develop the `Hespi’ (HErbarium Specimen sheet PIpeline), which extracts a pre-catalogue subset of collection data on the institutional labels on herbarium specimens from their digital images. The pipeline integrates two object detection models; the first detects bounding boxes around text-based labels and the second detects bounding boxes around text-based data fields on the primary institutional label. The pipeline classifies text-based institutional labels as printed, typed, handwritten, or a combination and applies Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) for data extraction. The recognized text is then corrected against authoritative databases of taxon names. The extracted text is also corrected with the aide of a multimodal Large Language Model (LLM). Hespi accurately detects and extracts text for test datasets including specimen sheet images from international herbaria. The components of the pipeline are modular and users can train their own models with their own data and use them in place of the models provided.

[AI-55] Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.08731
作者: Yeeun Kim,Young Rok Choi,Eunkyung Choi,Jinhwan Choi,Hai Jin Park,Wonseok Hwang
关键词-EN: Uniform Bar Exam, Large language models, demonstrated remarkable performance, efficacy remains limited, passing the Uniform
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance in the legal domain, with GPT-4 even passing the Uniform Bar Exam in the U.S. However their efficacy remains limited for non-standardized tasks and tasks in languages other than English. This underscores the need for careful evaluation of LLMs within each legal system before application. Here, we introduce KBL, a benchmark for assessing the Korean legal language understanding of LLMs, consisting of (1) 7 legal knowledge tasks (510 examples), (2) 4 legal reasoning tasks (288 examples), and (3) the Korean bar exam (4 domains, 53 tasks, 2,510 examples). First two datasets were developed in close collaboration with lawyers to evaluate LLMs in practical scenarios in a certified manner. Furthermore, considering legal practitioners’ frequent use of extensive legal documents for research, we assess LLMs in both a closed book setting, where they rely solely on internal knowledge, and a retrieval-augmented generation (RAG) setting, using a corpus of Korean statutes and precedents. The results indicate substantial room and opportunities for improvement.

[AI-56] From N-grams to Pre-trained Multilingual Models For Language Identification

链接: https://arxiv.org/abs/2410.08728
作者: Thapelo Sindane,Vukosi Marivate
关键词-EN: South African languages, Large Pre-trained Multilingual, South African, Pre-trained Multilingual models, African languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: The paper has been accepted at The 4th International Conference on Natural Language Processing for Digital Humanities (NLP4DH 2024)

点击查看摘要

Abstract:In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models – mBERT, RemBERT, XLM-r, and Afri-centric multilingual models – AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.

[AI-57] On the token distance modeling ability of higher RoPE attention dimension

链接: https://arxiv.org/abs/2410.08703
作者: Xiangyu Hong,Che Jiang,Biqing Qi,Fandong Meng,Mo Yu,Bowen Zhou,Jie Zhou
关键词-EN: Rotary position embedding, shown promising results, Rotary position, based on Rotary, position embedding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Length extrapolation algorithms based on Rotary position embedding (RoPE) have shown promising results in extending the context length of language models. However, understanding how position embedding can capture longer-range contextual information remains elusive. Based on the intuition that different dimensions correspond to different frequency of changes in RoPE encoding, we conducted a dimension-level analysis to investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies. Using our correlation metric, we identified a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models. These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing, as evidence by our ablation. We further demonstrate the correlation between the efficiency of length extrapolation and the extension of the high-dimensional attention allocation of these heads. The identification of Positional Heads provides insights for future research in long-text comprehension.

[AI-58] Chain-of-Restoration: Multi-Task Image Restoration Models are Zero-Shot Step-by-Step Universal Image Restorers

链接: https://arxiv.org/abs/2410.08688
作者: Jin Cao,Deyu Meng,Xiangyong Cao
关键词-EN: typically targeting isolated, previous works typically, works typically targeting, isolated degradation types, targeting isolated degradation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Despite previous works typically targeting isolated degradation types, recent research has increasingly focused on addressing composite degradations which involve a complex interplay of multiple different isolated degradations. Recognizing the challenges posed by the exponential number of possible degradation combinations, we propose Universal Image Restoration (UIR), a new task setting that requires models to be trained on a set of degradation bases and then remove any degradation that these bases can potentially compose in a zero-shot manner. Inspired by the Chain-of-Thought which prompts LLMs to address problems step-by-step, we propose the Chain-of-Restoration (CoR), which instructs models to step-by-step remove unknown composite degradations. By integrating a simple Degradation Discriminator into pre-trained multi-task models, CoR facilitates the process where models remove one degradation basis per step, continuing this process until the image is fully restored from the unknown composite degradation. Extensive experiments show that CoR significantly improves model performance in removing composite degradations, achieving results comparable to or surpassing those of State-of-The-Art (SoTA) methods trained on all degradations. The code will be released at this https URL.

[AI-59] SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

链接: https://arxiv.org/abs/2410.08669
作者: Yang Zhou,Hao Shao,Letian Wang,Steven L. Waslander,Hongsheng Li,Yu Liu
关键词-EN: Predicting the future, motion prediction, autonomous vehicles, safely in dynamic, surrounding agents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limiting their ability to capture complex interactions and road geometries. Inspired by recent advances in natural language processing (NLP) and computer vision (CV), self-supervised learning (SSL) has gained significant attention in the motion prediction community for learning rich and transferable scene representations. Nonetheless, existing pre-training methods for motion prediction have largely focused on specific model architectures and single dataset, limiting their scalability and generalizability. To address these challenges, we propose SmartPretrain, a general and scalable SSL framework for motion prediction that is both model-agnostic and dataset-agnostic. Our approach integrates contrastive and reconstructive SSL, leveraging the strengths of both generative and discriminative paradigms to effectively represent spatiotemporal evolution and interactions without imposing architectural constraints. Additionally, SmartPretrain employs a dataset-agnostic scenario sampling strategy that integrates multiple datasets, enhancing data volume, diversity, and robustness. Extensive experiments on multiple datasets demonstrate that SmartPretrain consistently improves the performance of state-of-the-art prediction models across datasets, data splits and main metrics. For instance, SmartPretrain significantly reduces the MissRate of Forecast-MAE by 10.6%. These results highlight SmartPretrain’s effectiveness as a unified, scalable solution for motion prediction, breaking free from the limitations of the small-data regime. Codes are available at this https URL

[AI-60] DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization

链接: https://arxiv.org/abs/2410.08666
作者: Yanfeng Jiang,Zelan Yang,Bohua Chen,Shen Li,Yong Li,Tao Li
关键词-EN: Large language models, Large language, achieve exceptional performance, downstream tasks, supervised fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning. However, the diversity of downstream tasks and practical requirements makes deploying multiple full-parameter fine-tuned models challenging. Current methods that compress the delta weight struggle to achieve ultra-high compression, failing to minimize the deployment overhead. To address the above issue, we propose a novel distribution-driven delta compression framework DeltaDQ, which utilizes Group-wise Dropout and Separate Quantization to achieve ultra-high compression for the delta weight. We have observed that the matrix-computed intermediate results for the delta weight exhibit extremely small variance and min-max range characteristics, referred to as Balanced Intermediate Results. Exploiting this phenomenon, we introduce Group-wise Dropout to perform dropout on the delta weight using an optimal group size. Furthermore, using Separate Quantization, sparse weights are quantized and decomposed to achieve a lower bit. Experimental results show that DeltaDQ achieves 16x compression with improved accuracy compared to baselines for WizardMath and WizardCoder models across different parameter scales. Moreover, DeltaDQ demonstrates the ability for ultra-high compression ratio, achieving 128x compression for the WizardMath-7B model and 512x compression for the WizardMath-70B model.

[AI-61] DistDD: Distributed Data Distillation Aggregation through Gradient Matching

链接: https://arxiv.org/abs/2410.08665
作者: Peiran Wang,Haohan Wang
关键词-EN: federated learning framework, distilling data directly, federated learning, traditional federated learning, clients’ devices
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce DistDD, a novel approach within the federated learning framework that reduces the need for repetitive communication by distilling data directly on clients’ devices. Unlike traditional federated learning that requires iterative model updates across nodes, DistDD facilitates a one-time distillation process that extracts a global distilled dataset, maintaining the privacy standards of federated learning while significantly cutting down communication costs. By leveraging the DistDD’s distilled dataset, the developers of the FL can achieve just-in-time parameter tuning and neural architecture search over FL without repeating the whole FL process multiple times. We provide a detailed convergence proof of the DistDD algorithm, reinforcing its mathematical stability and reliability for practical applications. Our experiments demonstrate the effectiveness and robustness of DistDD, particularly in non-i.i.d. and mislabeled data scenarios, showcasing its potential to handle complex real-world data challenges distinctively from conventional federated learning methods. We also evaluate DistDD’s application in the use case and prove its effectiveness and communication-savings in the NAS use case.

[AI-62] RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process

链接: https://arxiv.org/abs/2410.08660
作者: Peiran Wang,Xiaogeng Liu,Chaowei Xiao
关键词-EN: innovative attack Retrieval-based, Retrieval-based Prompt Decomposition, Decomposition framework designed, large language models, attack Retrieval-based Prompt
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2403.04783 by other authors

点击查看摘要

Abstract:In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user’s original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user’s prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.

[AI-63] Efficient line search for optimizing Area Under the ROC Curve in gradient descent

链接: https://arxiv.org/abs/2410.08635
作者: Jadon Fowler,Toby Dylan Hocking
关键词-EN: Receiver Operating Characteristic, Receiver Operating, Operating Characteristic, Area Under Min, Recently the Area
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Receiver Operating Characteristic (ROC) curves are useful for evaluation in binary classification and changepoint detection, but difficult to use for learning since the Area Under the Curve (AUC) is piecewise constant (gradient zero almost everywhere). Recently the Area Under Min (AUM) of false positive and false negative rates has been proposed as a differentiable surrogate for AUC. In this paper we study the piecewise linear/constant nature of the AUM/AUC, and propose new efficient path-following algorithms for choosing the learning rate which is optimal for each step of gradient descent (line search), when optimizing a linear model. Remarkably, our proposed line search algorithm has the same log-linear asymptotic time complexity as gradient descent with constant step size, but it computes a complete representation of the AUM/AUC as a function of step size. In our empirical study of binary classification problems, we verify that our proposed algorithm is fast and exact; in changepoint detection problems we show that the proposed algorithm is just as accurate as grid search, but faster.

[AI-64] Words as Beacons: Guiding RL Agents with High-Level Language Prompts

链接: https://arxiv.org/abs/2410.08632
作者: Unai Ruiz-Gonzalez,Alain Andres,Pedro G.Bascoy,Javier Del Ser
关键词-EN: pose significant challenges, Large Language Models, incomplete learning processes, leverages Large Language, pose significant
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse reward environments in reinforcement learning (RL) pose significant challenges for exploration, often leading to inefficient or incomplete learning processes. To tackle this issue, this work proposes a teacher-student RL framework that leverages Large Language Models (LLMs) as “teachers” to guide the agent’s learning process by decomposing complex tasks into subgoals. Due to their inherent capability to understand RL environments based on a textual description of structure and purpose, LLMs can provide subgoals to accomplish the task defined for the environment in a similar fashion to how a human would do. In doing so, three types of subgoals are proposed: positional targets relative to the agent, object representations, and language-based instructions generated directly by the LLM. More importantly, we show that it is possible to query the LLM only during the training phase, enabling agents to operate within the environment without any LLM intervention. We assess the performance of this proposed framework by evaluating three state-of-the-art open-source LLMs (Llama, DeepSeek, Qwen) eliciting subgoals across various procedurally generated environment of the MiniGrid benchmark. Experimental results demonstrate that this curriculum-based approach accelerates learning and enhances exploration in complex tasks, achieving up to 30 to 200 times faster convergence in training steps compared to recent baselines designed for sparse reward environments.

[AI-65] Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation

链接: https://arxiv.org/abs/2410.08613
作者: Zhe Dong,Yuzhe Sun,Yanfeng Gu,Tianzhu Liu
关键词-EN: remote sensing image, referring remote sensing, sensing image segmentation, remote sensing, sensing image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given a natural language expression and a remote sensing image, the goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression. In contrast to natural scenarios, expressions in RRSIS often involve complex geospatial relationships, with target objects of interest that vary significantly in scale and lack visual saliency, thereby increasing the difficulty of achieving precise segmentation. To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM). Specifically, a context-aware prompt modulation (CAPM) module is designed to integrate spatial positional relationships and task-specific knowledge into the linguistic features, thereby enhancing the ability to capture the target object. Additionally, a language-guided feature aggregation (LGFA) module is introduced to integrate linguistic information into multi-scale visual features, incorporating an attention deficit compensation mechanism to enhance feature aggregation. Finally, a mutual-interaction decoder (MID) is designed to enhance cross-modal feature alignment through cascaded bidirectional cross-attention, thereby enabling precise segmentation mask prediction. To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets. Extensive benchmarking on RISBench and two other prevalent datasets demonstrates the superior performance of the proposed CroBIM over existing state-of-the-art (SOTA) methods. The source code for CroBIM and the RISBench dataset will be publicly available at this https URL

[AI-66] Synth-SONAR: Sonar Image Synthesis with Enhanced Diversity and Realism via Dual Diffusion Models and GPT Prompting

链接: https://arxiv.org/abs/2410.08612
作者: Purushothaman Natarajan,Kamal Basha,Athira Nambiar
关键词-EN: marine biology, Sonar, Sonar image synthesis, underwater exploration, crucial for advancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 5 tables and 9 figures

点击查看摘要

Abstract:Sonar image synthesis is crucial for advancing applications in underwater exploration, marine biology, and defence. Traditional methods often rely on extensive and costly data collection using sonar sensors, jeopardizing data quality and diversity. To overcome these limitations, this study proposes a new sonar image synthesis framework, Synth-SONAR leveraging diffusion models and GPT prompting. The key novelties of Synth-SONAR are threefold: First, by integrating Generative AI-based style injection techniques along with publicly available real/simulated data, thereby producing one of the largest sonar data corpus for sonar research. Second, a dual text-conditioning sonar diffusion model hierarchy synthesizes coarse and fine-grained sonar images with enhanced quality and diversity. Third, high-level (coarse) and low-level (detailed) text-based sonar generation methods leverage advanced semantic information available in visual language models (VLMs) and GPT-prompting. During inference, the method generates diverse and realistic sonar images from textual prompts, bridging the gap between textual descriptions and sonar image generation. This marks the application of GPT-prompting in sonar imagery for the first time, to the best of our knowledge. Synth-SONAR achieves state-of-the-art results in producing high-quality synthetic sonar datasets, significantly enhancing their diversity and realism.

[AI-67] Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.08611
作者: Mengyuan Chen,Junyu Gao,Changsheng Xu
关键词-EN: pre-trained vision-language model, potential OOD labels, OOD labels, extensive semantic pool, selecting potential OOD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 28 pages, accepted by NeurIPS 2024

点击查看摘要

Abstract:A straightforward pipeline for zero-shot out-of-distribution (OOD) detection involves selecting potential OOD labels from an extensive semantic pool and then leveraging a pre-trained vision-language model to perform classification on both in-distribution (ID) and OOD labels. In this paper, we theorize that enhancing performance requires expanding the semantic pool, while increasing the expected probability of selected OOD labels being activated by OOD samples, and ensuring low mutual dependence among the activations of these OOD labels. A natural expansion manner is to adopt a larger lexicon; however, the inevitable introduction of numerous synonyms and uncommon words fails to meet the above requirements, indicating that viable expansion manners move beyond merely selecting words from a lexicon. Since OOD detection aims to correctly classify input images into ID/OOD class groups, we can “make up” OOD label candidates which are not standard class names but beneficial for the process. Observing that the original semantic pool is comprised of unmodified specific class names, we correspondingly construct a conjugated semantic pool (CSP) consisting of modified superclass names, each serving as a cluster center for samples sharing similar properties across different categories. Consistent with our established theory, expanding OOD label candidates with the CSP satisfies the requirements and outperforms existing works by 7.89% in FPR95. Codes are available in this https URL.

[AI-68] xt-To-Image with Generative Adversarial Networks

链接: https://arxiv.org/abs/2410.08608
作者: Mehrshad Momen-Tayefeh
关键词-EN: Generating realistic images, Generating realistic, Generative Adversarial Networks, computer vision, field of computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating realistic images from human texts is one of the most challenging problems in the field of computer vision (CV). The meaning of descriptions given can be roughly reflected by existing text-to-image approaches. In this paper, our main purpose is to propose a brief comparison between five different methods base on the Generative Adversarial Networks (GAN) to make image from the text. In addition, each model architectures synthesis images with different resolution. Furthermore, the best and worst obtained resolutions is 6464, 256256 respectively. However, we checked and compared some metrics that introduce the accuracy of each model. Also, by doing this study, we found out the best model for this problem by comparing these different approaches essential metrics.

[AI-69] What killed the cat? Towards a logical formalization of curiosity (and suspense and surprise) in narratives

链接: https://arxiv.org/abs/2410.08597
作者: Florence Dupin de Saint-Cyr(IRIT-ADRIA),Anne-Gwenn Bosser(Lab-STICC_COMMEDIA, ENIB, Lab-STICC),Benjamin Callac(Lab-STICC_COMMEDIA),Eric Maisel(Lab-STICC_COMMEDIA)
关键词-EN: narrative tension, provide a unified, heart of narrative, unified framework, curiosity
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We provide a unified framework in which the three emotions at the heart of narrative tension (curiosity, suspense and surprise) are formalized. This framework is built on nonmonotonic reasoning which allows us to compactly represent the default behavior of the world and to simulate the affective evolution of an agent receiving a story. After formalizing the notions of awareness, curiosity, surprise and suspense, we explore the properties induced by our definitions and study the computational complexity of detecting them. We finally propose means to evaluate these emotions’ intensity for a given agent listening to a story.

[AI-70] VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding NEURIPS2024 NEURIPS

链接: https://arxiv.org/abs/2410.08593
作者: Houlun Chen,Xin Wang,Hong Chen,Zeyang Zhang,Wei Feng,Bin Huang,Jia Jia,Wenwu Zhu
关键词-EN: Corpus Moment Retrieval, Existing Video Corpus, Moment Retrieval, underline, hinders precise video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by 38th NeurIPS Datasets Benchmarks Track (NeurIPS 2024)

点击查看摘要

Abstract:Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding, which hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underlineVid\underlineEo-text annotation pipeline to generate captions with \underlineRel\underlineIable \underlineFIn\underlineE-grained statics and \underlineDynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the inaccurate annotations caused by the LLM hallucination, we propose a Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation model with disturbed hard-negatives augmented contrastive and matching losses. With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a high level of annotation quality. We evaluate several state-of-the-art VCMR models on the proposed dataset, revealing that there is still significant scope for fine-grained video understanding in VCMR. Code and Datasets are in \hrefthis https URLthis https URL.

[AI-71] VIBES – Vision Backbone Efficient Selection WACV2025

链接: https://arxiv.org/abs/2410.08592
作者: Joris Guerin,Shray Bansal,Amirreza Shaban,Paulo Mann,Harshvardhan Gazula
关键词-EN: specific target tasks, efficiently selecting high-performance, selecting high-performance pre-trained, high-performance pre-trained vision, target tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, under review at WACV 2025

点击查看摘要

Abstract:This work tackles the challenge of efficiently selecting high-performance pre-trained vision backbones for specific target tasks. Although exhaustive search within a finite set of backbones can solve this problem, it becomes impractical for large datasets and backbone pools. To address this, we introduce Vision Backbone Efficient Selection (VIBES), which aims to quickly find well-suited backbones, potentially trading off optimality for efficiency. We propose several simple yet effective heuristics to address VIBES and evaluate them across four diverse computer vision datasets. Our results show that these approaches can identify backbones that outperform those selected from generic benchmarks, even within a limited search budget of one hour on a single GPU. We reckon VIBES marks a paradigm shift from benchmarks to task-specific optimization.

[AI-72] ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

链接: https://arxiv.org/abs/2410.08584
作者: Yefei He,Feng Chen,Jing Liu,Wenqi Shao,Hong Zhou,Kaipeng Zhang,Bohan Zhuang
关键词-EN: scenarios involving high-resolution, involving high-resolution images, large vision-language models, fetching the key-value, images or videos
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs that resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform attention mechanism solely on those important tokens to accelerate the prefill phase. To mitigate the memory bottleneck in the decoding phase, we employ mixed-precision quantization to the KV cache, where high-bit quantization is used for caches of important tokens, while low-bit quantization is applied to those of less importance. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6 \times and reduce GPU memory usage by 50.0%, with a minimal accuracy reduction of only 0.2% on Video-MME benchmark over LongVA-7B model, effectively enhancing the generation efficiency of LVLMs.

[AI-73] Intent-Enhanced Data Augmentation for Sequential Recommendation

链接: https://arxiv.org/abs/2410.08583
作者: Shuai Chen,Zhoujun Li
关键词-EN: sequential recommendation algorithms, mine dynamic user, sequential recommendation, dynamic user intent, recommendation algorithms focuses
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:The research on intent-enhanced sequential recommendation algorithms focuses on how to better mine dynamic user intent based on user behavior data for sequential recommendation tasks. Various data augmentation methods are widely applied in current sequential recommendation algorithms, effectively enhancing the ability to capture user intent. However, these widely used data augmentation methods often rely on a large amount of random sampling, which can introduce excessive noise into the training data, blur user intent, and thus negatively affect recommendation performance. Additionally, these methods have limited approaches to utilizing augmented data, failing to fully leverage the augmented samples. We propose an intent-enhanced data augmentation method for sequential recommendation(\textbfIESRec), which constructs positive and negative samples based on user behavior sequences through intent-segment insertion. On one hand, the generated positive samples are mixed with the original training data, and they are trained together to improve recommendation performance. On the other hand, the generated positive and negative samples are used to build a contrastive loss function, enhancing recommendation performance through self-supervised training. Finally, the main recommendation task is jointly trained with the contrastive learning loss minimization task. Experiments on three real-world datasets validate the effectiveness of our IESRec model.

[AI-74] Integrating AI for Enhanced Feedback in Translation Revision- A Mixed-Methods Investigation of Student Engagement

链接: https://arxiv.org/abs/2410.08581
作者: Simin Xu,Yanfang Su,Kanglong Liu
关键词-EN: Artificial Intelligence, revision process, remains understudied, translation education, application of Artificial
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the well-established importance of feedback in education, the application of Artificial Intelligence (AI)-generated feedback, particularly from language models like ChatGPT, remains understudied in translation education. This study investigates the engagement of master’s students in translation with ChatGPT-generated feedback during their revision process. A mixed-methods approach, combining a translation-and-revision experiment with quantitative and qualitative analyses, was employed to examine the feedback, translations pre-and post-revision, the revision process, and student reflections. The results reveal complex interrelations among cognitive, affective, and behavioural dimensions influencing students’ engagement with AI feedback and their subsequent revisions. Specifically, the findings indicate that students invested considerable cognitive effort in the revision process, despite finding the feedback comprehensible. Additionally, they exhibited moderate affective satisfaction with the feedback model. Behaviourally, their actions were largely influenced by cognitive and affective factors, although some inconsistencies were observed. This research provides novel insights into the potential applications of AI-generated feedback in translation teachingand opens avenues for further investigation into the integration of AI tools in language teaching settings.

[AI-75] A Theoretical Framework for AI-driven data quality monitoring in high-volume data environments

链接: https://arxiv.org/abs/2410.08576
作者: Nikhil Bangad,Vivekananda Jayaram,Manjunatha Sughaturu Krishnappa,Amey Ram Banarse,Darshan Mohan Bidkar,Akshay Nagpal,Vidyasagar Parlapalli
关键词-EN: monitoring system designed, quality monitoring system, paper presents, challenges of maintaining, data quality monitoring
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a theoretical framework for an AI-driven data quality monitoring system designed to address the challenges of maintaining data quality in high-volume environments. We examine the limitations of traditional methods in managing the scale, velocity, and variety of big data and propose a conceptual approach leveraging advanced machine learning techniques. Our framework outlines a system architecture that incorporates anomaly detection, classification, and predictive analytics for real-time, scalable data quality management. Key components include an intelligent data ingestion layer, adaptive preprocessing mechanisms, context-aware feature extraction, and AI-based quality assessment modules. A continuous learning paradigm is central to our framework, ensuring adaptability to evolving data patterns and quality requirements. We also address implications for scalability, privacy, and integration within existing data ecosystems. While practical results are not provided, it lays a robust theoretical foundation for future research and implementations, advancing data quality management and encouraging the exploration of AI-driven solutions in dynamic environments.

[AI-76] Baichuan-Omni Technical Report

链接: https://arxiv.org/abs/2410.08565
作者: Yadong Li,Haoze Sun,Mingan Lin,Tianpeng Li,Guosheng Dong,Tao Zhang,Bowen Ding,Wei Song,Zhenglin Cheng,Yuqi Huo,Song Chen,Xu Li,Da Pan,Shusen Zhang,Xin Wu,Zheng Liang,Jun Liu,Tao Zhang,Keer Lu,Yaqi Zhao,Yanjun Shen,Fan Yang,Kaicheng Yu,Tao Lin,Jianhua Xu,Zenan Zhou,Weipeng Chen
关键词-EN: high-performing open-source counterpart, salient multimodal capabilities, Large Language Model, multimodal interactive experience, Multimodal Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-Omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.

[AI-77] Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive architecture

链接: https://arxiv.org/abs/2410.08559
作者: Sehun Kim
关键词-EN: Embedding Predictive Architecture, Joint Embedding Predictive, ECG Joint Embedding, named ECG Joint, Predictive Architecture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a self-supervised learning method for 12-lead Electrocardiogram (ECG) analysis, named ECG Joint Embedding Predictive Architecture (ECG-JEPA). ECG-JEPA employs a masking strategy to learn semantic representations of ECG data. Unlike existing methods, ECG-JEPA predicts at the hidden representation level rather than reconstructing raw data. This approach offers several advantages in the ECG domain: (1) it avoids producing unnecessary details, such as noise, which is common in standard ECG; and (2) it addresses the limitations of naïve L2 loss between raw signals. Another key contribution is the introduction of a special masked attention tailored for 12-lead ECG data, Cross-Pattern Attention (CroPA). CroPA enables the model to effectively capture inter-patch relationships. Additionally, ECG-JEPA is highly scalable, allowing efficient training on large datasets. Our code is openly available this https URL.

[AI-78] Balancing Innovation and Privacy: Data Security Strategies in Natural Language Processing Applications

链接: https://arxiv.org/abs/2410.08553
作者: Shaobo Liu,Guiran Liu,Binrong Zhu,Yuanshuai Luo,Linxiao Wu,Rui Wang
关键词-EN: Natural Language Processing, Natural Language, Language Processing, privacy, privacy protection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This research addresses privacy protection in Natural Language Processing (NLP) by introducing a novel algorithm based on differential privacy, aimed at safeguarding user data in common applications such as chatbots, sentiment analysis, and machine translation. With the widespread application of NLP technology, the security and privacy protection of user data have become important issues that need to be solved urgently. This paper proposes a new privacy protection algorithm designed to effectively prevent the leakage of user sensitive information. By introducing a differential privacy mechanism, our model ensures the accuracy and reliability of data analysis results while adding random noise. This method not only reduces the risk caused by data leakage but also achieves effective processing of data while protecting user privacy. Compared to traditional privacy methods like data anonymization and homomorphic encryption, our approach offers significant advantages in terms of computational efficiency and scalability while maintaining high accuracy in data analysis. The proposed algorithm’s efficacy is demonstrated through performance metrics such as accuracy (0.89), precision (0.85), and recall (0.88), outperforming other methods in balancing privacy and utility. As privacy protection regulations become increasingly stringent, enterprises and developers must take effective measures to deal with privacy risks. Our research provides an important reference for the application of privacy protection technology in the field of NLP, emphasizing the need to achieve a balance between technological innovation and user privacy. In the future, with the continuous advancement of technology, privacy protection will become a core element of data-driven applications and promote the healthy development of the entire industry.

[AI-79] Context-Aware Full Body Anonymization using Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.08551
作者: Pascl Zwick,Kevin Roesch,Marvin Klemp,Oliver Bringmann
关键词-EN: real world datasets, plays a key, key role, role in protecting, protecting sensible information
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anonymization plays a key role in protecting sensible information of individuals in real world datasets. Self-driving cars for example need high resolution facial features to track people and their viewing direction to predict future behaviour and react accordingly. In order to protect people’s privacy whilst keeping important features in the dataset, it is important to replace the full body of a person with a highly detailed anonymized one. In contrast to doing face anonymization, full body replacement decreases the ability of recognizing people by their hairstyle or clothes. In this paper, we propose a workflow for full body person anonymization utilizing Stable Diffusion as a generative backend. Text-to-image diffusion models, like Stable Diffusion, OpenAI’s DALL-E or Midjourney, have become very popular in recent time, being able to create photorealistic images from a single text prompt. We show that our method outperforms state-of-the art anonymization pipelines with respect to image quality, resolution, Inception Score (IS) and Frechet Inception Distance (FID). Additionally, our method is invariant with respect to the image generator and thus able to be used with the latest models available.

[AI-80] Humanity in AI: Detecting the Personality of Large Language Models

链接: https://arxiv.org/abs/2410.08545
作者: Baohua Zhan,Yongyi Huang,Wenyao Cui,Huaping Zhang,Jianyun Shang
关键词-EN: Large Language Models, Large Language, Language Models, personality, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Questionnaires are a common method for detecting the personality of Large Language Models (LLMs). However, their reliability is often compromised by two main issues: hallucinations (where LLMs produce inaccurate or irrelevant responses) and the sensitivity of responses to the order of the presented options. To address these issues, we propose combining text mining with questionnaires method. Text mining can extract psychological features from the LLMs’ responses without being affected by the order of options. Furthermore, because this method does not rely on specific answers, it reduces the influence of hallucinations. By normalizing the scores from both methods and calculating the root mean square error, our experiment results confirm the effectiveness of this approach. To further investigate the origins of personality traits in LLMs, we conduct experiments on both pre-trained language models (PLMs), such as BERT and GPT, as well as conversational models (ChatLLMs), such as ChatGPT. The results show that LLMs do contain certain personalities, for example, ChatGPT and ChatGLM exhibit the personality traits of ‘Conscientiousness’. Additionally, we find that the personalities of LLMs are derived from their pre-trained data. The instruction data used to train ChatLLMs can enhance the generation of data containing personalities and expose their hidden personality. We compare the results with the human average personality score, and we find that the personality of FLAN-T5 in PLMs and ChatGPT in ChatLLMs is more similar to that of a human, with score differences of 0.34 and 0.22, respectively.

[AI-81] Kaleidoscope: Learnable Masks for Heterogeneous Multi-agent Reinforcement Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.08540
作者: Xinran Li,Ling Pan,Jun Zhang
关键词-EN: multi-agent reinforcement learning, parameter sharing, reinforcement learning, enhance sample efficiency, commonly employed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted by the Thirty-Eighth Annual Conference on Neural Information Processing Systems(NeurIPS 2024)

点击查看摘要

Abstract:In multi-agent reinforcement learning (MARL), parameter sharing is commonly employed to enhance sample efficiency. However, the popular approach of full parameter sharing often leads to homogeneous policies among agents, potentially limiting the performance benefits that could be derived from policy diversity. To address this critical limitation, we introduce \emphKaleidoscope, a novel adaptive partial parameter sharing scheme that fosters policy heterogeneity while still maintaining high sample efficiency. Specifically, Kaleidoscope maintains one set of common parameters alongside multiple sets of distinct, learnable masks for different agents, dictating the sharing of parameters. It promotes diversity among policy networks by encouraging discrepancy among these masks, without sacrificing the efficiencies of parameter sharing. This design allows Kaleidoscope to dynamically balance high sample efficiency with a broad policy representational capacity, effectively bridging the gap between full parameter sharing and non-parameter sharing across various environments. We further extend Kaleidoscope to critic ensembles in the context of actor-critic algorithms, which could help improve value this http URL empirical evaluations across extensive environments, including multi-agent particle environment, multi-agent MuJoCo and StarCraft multi-agent challenge v2, demonstrate the superior performance of Kaleidoscope compared with existing parameter sharing approaches, showcasing its potential for performance enhancement in MARL. The code is publicly available at \urlthis https URL.

[AI-82] VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking

链接: https://arxiv.org/abs/2410.08529
作者: Zekun Qian,Ruize Han,Junhui Hou,Linqi Song,Wei Feng
关键词-EN: base classes, diverse object categories, unseen categories, represents a critical, categories
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.

[AI-83] Scaling Laws for Predicting Downstream Performance in LLMs

链接: https://arxiv.org/abs/2410.08527
作者: Yangyi Chen,Binxuan Huang,Yifan Gao,Zhengyang Wang,Jingfeng Yang,Heng Ji
关键词-EN: large language models, Precise estimation, performance, pre-training loss, downstream performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of sampling models, followed by mapping the pre-training loss to downstream task Performance after the critical “emergent phase”. In preliminary experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. This motivates FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training, specifically blending general corpora with code data to accurately represent the common necessity. FLP-M extends the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources, and employs a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance. By utilizing a 3B LLM trained on a specific ratio and a series of smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures for most benchmarks within 10% error margins.

[AI-84] “I Am the One and Only Your Cyber BFF”: Understanding the Impact of GenAI Requires Understanding the Impact of Anthropomorphic AI

链接: https://arxiv.org/abs/2410.08526
作者: Myra Cheng,Alicia DeVrio,Lisa Egede,Su Lin Blodgett,Alexandra Olteanu
关键词-EN: generating outputs, anthropomorphic behaviors, increasingly prone, scholars increasingly raising, increasingly raising concerns
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Many state-of-the-art generative AI (GenAI) systems are increasingly prone to anthropomorphic behaviors, i.e., to generating outputs that are perceived to be human-like. While this has led to scholars increasingly raising concerns about possible negative impacts such anthropomorphic AI systems can give rise to, anthropomorphism in AI development, deployment, and use remains vastly overlooked, understudied, and underspecified. In this perspective, we argue that we cannot thoroughly map the social impacts of generative AI without mapping the social impacts of anthropomorphic AI, and outline a call to action.

[AI-85] Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning ICRA2025

链接: https://arxiv.org/abs/2410.08500
作者: Yunpeng Gao,Zhigang Wang,Linglin Jing,Dong Wang,Xuelong Li,Bin Zhao
关键词-EN: Unmanned Aerial Vehicles, enabling Unmanned Aerial, task enabling Unmanned, enabling Unmanned, Aerial Vehicles
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues. It remains challenging due to the complex spatial relationships in outdoor aerial scenes. In this paper, we propose an end-to-end zero-shot framework for aerial VLN tasks, where the large language model (LLM) is introduced as our agent for action prediction. Specifically, we develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning ability of LLMs. This is achieved by extracting and projecting instruction-related semantic masks of landmarks into a top-down map that contains the location information of surrounding landmarks. Further, this map is transformed into a matrix representation with distance metrics as the text prompt to the LLM, for action prediction according to the instruction. Experiments conducted in real and simulation environments have successfully proved the effectiveness and robustness of our method, achieving 15.9% and 12.5% improvements (absolute) in Oracle Success Rate (OSR) on AerialVLN-S dataset.

[AI-86] A Systematic Review of Edge Case Detection in Automated Driving: Methods Challenges and Future Directions

链接: https://arxiv.org/abs/2410.08491
作者: Saeed Rahmani,Sabine Rieder,Erwin de Gelder,Marcel Sonntag,Jorge Lorente Mallada,Sytze Kalisvaart,Vahid Hashemi,Simeon C. Calvert
关键词-EN: edge case detection, edge cases, edge case, case detection, case detection methods
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Preprint submitted to IEEE Transactions on Intelligent Transportation Systems

点击查看摘要

Abstract:The rapid development of automated vehicles (AVs) promises to revolutionize transportation by enhancing safety and efficiency. However, ensuring their reliability in diverse real-world conditions remains a significant challenge, particularly due to rare and unexpected situations known as edge cases. Although numerous approaches exist for detecting edge cases, there is a notable lack of a comprehensive survey that systematically reviews these techniques. This paper fills this gap by presenting a practical, hierarchical review and systematic classification of edge case detection and assessment methodologies. Our classification is structured on two levels: first, categorizing detection approaches according to AV modules, including perception-related and trajectory-related edge cases; and second, based on underlying methodologies and theories guiding these techniques. We extend this taxonomy by introducing a new class called “knowledge-driven” approaches, which is largely overlooked in the literature. Additionally, we review the techniques and metrics for the evaluation of edge case detection methods and identified edge cases. To our knowledge, this is the first survey to comprehensively cover edge case detection methods across all AV subsystems, discuss knowledge-driven edge cases, and explore evaluation techniques for detection methods. This structured and multi-faceted analysis aims to facilitate targeted research and modular testing of AVs. Moreover, by identifying the strengths and weaknesses of various approaches and discussing the challenges and future directions, this survey intends to assist AV developers, researchers, and policymakers in enhancing the safety and reliability of automated driving (AD) systems through effective edge case detection.

[AI-87] Personalized Item Embeddings in Federated Multimodal Recommendation

链接: https://arxiv.org/abs/2410.08478
作者: Zhiwei Li,Guodong Long,Jing Jiang,Chengqi Zhang
关键词-EN: Federated recommendation systems, protecting user privacy, recommendation systems play, play a crucial, crucial role
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 5 tables, conference

点击查看摘要

Abstract:Federated recommendation systems play a crucial role in protecting user privacy. However, existing methods primarily rely on ID-based item embeddings, overlooking the rich multimodal information of items. To address this limitation, we propose a novel Federated Multimodal Recommendation System called FedMR. FedMR leverages a foundation model on the server side to encode multimodal data, such as images and text, associated with items. To tackle the challenge of data heterogeneity caused by varying user preferences, FedMR introduces a Mixing Feature Fusion Module on the client. This module dynamically adjusts the weights of different fusion strategies based on user interaction history, generating personalized item embeddings that capture fine-grained user preferences. FedMR is compatible with existing ID-based federated recommendation systems, improving their performances without modifying the original framework. Our experiments on four real-world multimodal recommendation datasets demonstrate the effectiveness of FedMR. Our code is available at this https URL.

[AI-88] GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation

链接: https://arxiv.org/abs/2410.08475
作者: Jiashu He,Mingyu Derek Ma,Jinxuan Fan,Dan Roth,Wei Wang,Alejandro Ribeiro
关键词-EN: Existing retrieval-based reasoning, Existing retrieval-based, retrieval-based reasoning approaches, large language models, provide domain knowledge
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Existing retrieval-based reasoning approaches for large language models (LLMs) heavily rely on the density and quality of the non-parametric knowledge source to provide domain knowledge and explicit reasoning chain. However, inclusive knowledge sources are expensive and sometimes infeasible to build for scientific or corner domains. To tackle the challenges, we introduce Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning framework that integrates the parametric and non-parametric memories to enhance both knowledge retrieval and faithful reasoning processes on very sparse knowledge graphs. By leveraging the external structured knowledge to inspire LLM to model the interconnections among relevant concepts, our method facilitates a more logical and step-wise reasoning approach akin to experts’ problem-solving, rather than gold answer retrieval. Specifically, the framework prompts LLMs to decompose the query into crucial concepts and attributes, construct entity groups with relevant entities, and build an augmented reasoning chain by probing potential relationships among node pairs across these entity groups. Our method incorporates both factual and extrapolated linkages to enable comprehensive understanding and response generation. Extensive experiments on reasoning-intense benchmarks on biomedical and commonsense QA demonstrate the effectiveness of our proposed method. Specifically, GIVE enables GPT3.5-turbo to outperform advanced models like GPT4 without any additional training cost, thereby underscoring the efficacy of integrating structured information and internal reasoning ability of LLMs for tackling specialized tasks with limited external resources.

[AI-89] Deeper Insights into Deep Graph Convolutional Networks: Stability and Generalization

链接: https://arxiv.org/abs/2410.08473
作者: Guangrui Yang,Ming Li,Han Feng,Xiaosheng Zhuang
关键词-EN: exhibiting promising performance, graph learning tasks, stability and generalization, deep GCNs, Graph convolutional networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 44 pages, 3 figures, submitted to IEEE Trans. Pattern Anal. Mach. Intell. on 18-Jun-2024, under review

点击查看摘要

Abstract:Graph convolutional networks (GCNs) have emerged as powerful models for graph learning tasks, exhibiting promising performance in various domains. While their empirical success is evident, there is a growing need to understand their essential ability from a theoretical perspective. Existing theoretical research has primarily focused on the analysis of single-layer GCNs, while a comprehensive theoretical exploration of the stability and generalization of deep GCNs remains limited. In this paper, we bridge this gap by delving into the stability and generalization properties of deep GCNs, aiming to provide valuable insights by characterizing rigorously the associated upper bounds. Our theoretical results reveal that the stability and generalization of deep GCNs are influenced by certain key factors, such as the maximum absolute eigenvalue of the graph filter operators and the depth of the network. Our theoretical studies contribute to a deeper understanding of the stability and generalization properties of deep GCNs, potentially paving the way for developing more reliable and well-performing models.

[AI-90] ARCap: Collecting High-quality Human Demonstrations for Robot Learning with Augmented Reality Feedback ICRA2025

链接: https://arxiv.org/abs/2410.08464
作者: Sirui Chen,Chen Wang,Kaden Nguyen,Li Fei-Fei,C. Karen Liu
关键词-EN: shown promising results, Recent progress, progress in imitation, imitation learning, learning from human
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 8 Figures, submitted to ICRA 2025

点击查看摘要

Abstract:Recent progress in imitation learning from human demonstrations has shown promising results in teaching robots manipulation skills. To further scale up training datasets, recent works start to use portable data collection devices without the need for physical robot hardware. However, due to the absence of on-robot feedback during data collection, the data quality depends heavily on user expertise, and many devices are limited to specific robot embodiments. We propose ARCap, a portable data collection system that provides visual feedback through augmented reality (AR) and haptic warnings to guide users in collecting high-quality demonstrations. Through extensive user studies, we show that ARCap enables novice users to collect robot-executable data that matches robot kinematics and avoids collisions with the scenes. With data collected from ARCap, robots can perform challenging tasks, such as manipulation in cluttered environments and long-horizon cross-embodiment manipulation. ARCap is fully open-source and easy to calibrate; all components are built from off-the-shelf products. More details and results can be found on our website: this https URL

[AI-91] Why pre-training is beneficial for downstream classification tasks?

链接: https://arxiv.org/abs/2410.08455
作者: Xin Jiang,Xu Cheng,Zechao Li
关键词-EN: exhibited notable benefits, notable benefits, remain unclear, exhibited notable, boosting accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-training has exhibited notable benefits to downstream tasks by boosting accuracy and speeding up convergence, but the exact reasons for these benefits still remain unclear. To this end, we propose to quantitatively and explicitly explain effects of pre-training on the downstream task from a novel game-theoretic view, which also sheds new light into the learning behavior of deep neural networks (DNNs). Specifically, we extract and quantify the knowledge encoded by the pre-trained model, and further track the changes of such knowledge during the fine-tuning process. Interestingly, we discover that only a small amount of pre-trained model’s knowledge is preserved for the inference of downstream tasks. However, such preserved knowledge is very challenging for a model training from scratch to learn. Thus, with the help of this exclusively learned and useful knowledge, the model fine-tuned from pre-training usually achieves better performance than the model training from scratch. Besides, we discover that pre-training can guide the fine-tuned model to learn target knowledge for the downstream task more directly and quickly, which accounts for the faster convergence of the fine-tuned model.

[AI-92] JurEE not Judges: safeguarding llm interactions with small specialised Encoder Ensembles

链接: https://arxiv.org/abs/2410.08442
作者: Dom Nasrabadi
关键词-EN: encoder-only transformer models, transformer models designed, encoder-only transformer, LLM-based systems, designed to strengthen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce JurEE, an ensemble of efficient, encoder-only transformer models designed to strengthen safeguards in AI-User interactions within LLM-based systems. Unlike existing LLM-as-Judge methods, which often struggle with generalization across risk taxonomies and only provide textual outputs, JurEE offers probabilistic risk estimates across a wide range of prevalent risks. Our approach leverages diverse data sources and employs progressive synthetic data generation techniques, including LLM-assisted augmentation, to enhance model robustness and performance. We create an in-house benchmark comprising of other reputable benchmarks such as the OpenAI Moderation Dataset and ToxicChat, where we find JurEE significantly outperforms baseline models, demonstrating superior accuracy, speed, and cost-efficiency. This makes it particularly suitable for applications requiring stringent content moderation, such as customer-facing chatbots. The encoder-ensemble’s modular design allows users to set tailored risk thresholds, enhancing its versatility across various safety-related applications. JurEE’s collective decision-making process, where each specialized encoder model contributes to the final output, not only improves predictive accuracy but also enhances interpretability. This approach provides a more efficient, performant, and economical alternative to traditional LLMs for large-scale implementations requiring robust content moderation.

[AI-93] forallutoexistslor!landL: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

链接: https://arxiv.org/abs/2410.08437
作者: Rushang Karia,Daniel Bramblett,Daksh Dobhal,Siddharth Srivastava
关键词-EN: Large Language Model, scaling Large Language, Language Model, Large Language, scaling Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This paper presents \forall uto \exists \lor!\land L, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. \forall uto \exists \lor!\land L is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; © the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM’s performance on \forall uto \exists \lor!\land L is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update.

[AI-94] Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.08436
作者: Zi’ou Zheng,Christopher Malon,Martin Renqiang Min,Xiaodan Zhu
关键词-EN: Large Language Models, multi-step reasoning tasks, improving models’ explainability, complex multi-step reasoning, performing complex multi-step
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP2024 main conference

点击查看摘要

Abstract:When performing complex multi-step reasoning tasks, the ability of Large Language Models (LLMs) to derive structured intermediate proof steps is important for ensuring that the models truly perform the desired reasoning and for improving models’ explainability. This paper is centred around a focused study: whether the current state-of-the-art generalist LLMs can leverage the structures in a few examples to better construct the proof structures with \textitin-context learning. Our study specifically focuses on structure-aware demonstration and structure-aware pruning. We demonstrate that they both help improve performance. A detailed analysis is provided to help understand the results.

[AI-95] Symbolic Music Generation with Fine-grained Interactive Textural Guidance

链接: https://arxiv.org/abs/2410.08435
作者: Tingyu Zhu,Haoyu Liu,Zhimin Jiang,Zeyu Zheng
关键词-EN: limited data availability, Fine-grained Textural Guidance, generation presents unique, presents unique challenges, unique challenges due
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The problem of symbolic music generation presents unique challenges due to the combination of limited data availability and the need for high precision in note pitch. To overcome these difficulties, we introduce Fine-grained Textural Guidance (FTG) within diffusion models to correct errors in the learned distributions. By incorporating FTG, the diffusion models improve the accuracy of music generation, which makes them well-suited for advanced tasks such as progressive music generation, improvisation and interactive music creation. We derive theoretical characterizations for both the challenges in symbolic music generation and the effect of the FTG approach. We provide numerical experiments and a demo page for interactive music generation with user input to showcase the effectiveness of our approach.

[AI-96] oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

链接: https://arxiv.org/abs/2410.08431
作者: Yu He Ke,Liyuan Jin,Kabilan Elangovan,Hairil Rizal Abdullah,Nan Liu,Alex Tiong Heng Sia,Chai Rick Soh,Joshua Yi Min Tung,Jasmine Chiat Ling Ong,Chang-Fu Kuo,Shao-Chun Wu,Vesela P. Kovacheva,Daniel Shu Wei Ting
关键词-EN: Large Language Models, Large Language, Retrieval Augmented Generation, specialized clinical knowledge, lack specialized clinical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.01733

点击查看摘要

Abstract:Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions. We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses. A total of 3,682 responses were evaluated. Clinical documents were processed using Llamaindex, and 10 LLMs, including GPT3.5, GPT4, and Claude-3, were assessed. Fourteen clinical scenarios were analyzed, focusing on seven aspects of preoperative instructions. Established guidelines and expert judgment were used to determine correct responses, with human-generated answers serving as comparisons. The LLM-RAG models generated responses within 20 seconds, significantly faster than clinicians (10 minutes). The GPT4 LLM-RAG model achieved the highest accuracy (96.4% vs. 86.6%, p=0.016), with no hallucinations and producing correct instructions comparable to clinicians. Results were consistent across both local and international guidelines. This study demonstrates the potential of LLM-RAG models for preoperative healthcare tasks, highlighting their efficiency, scalability, and reliability.

[AI-97] Promptly Yours? A Human Subject Study on Prompt Inference in AI-Generated Art

链接: https://arxiv.org/abs/2410.08406
作者: Khoi Trinh,Joseph Spracklen,Raveen Wijewickrama,Bimal Viswanath,Murtuza Jadliwala,Anindya Maiti
关键词-EN: generating unique artworks, creators can purchase, unique artworks, emerging field, art has witnessed
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emerging field of AI-generated art has witnessed the rise of prompt marketplaces, where creators can purchase, sell, or share prompts for generating unique artworks. These marketplaces often assert ownership over prompts, claiming them as intellectual property. This paper investigates whether concealed prompts sold on prompt marketplaces can be considered as secure intellectual property, given that humans and AI tools may be able to approximately infer the prompts based on publicly advertised sample images accompanying each prompt on sale. Specifically, our survey aims to assess (i) how accurately can humans infer the original prompt solely by examining an AI-generated image, with the goal of generating images similar to the original image, and (ii) the possibility of improving upon individual human and AI prompt inferences by crafting human-AI combined prompts with the help of a large language model. Although previous research has explored the use of AI and machine learning to infer (and also protect against) prompt inference, we are the first to include humans in the loop. Our findings indicate that while humans and human-AI collaborations can infer prompts and generate similar images with high accuracy, they are not as successful as using the original prompt.

[AI-98] AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

链接: https://arxiv.org/abs/2410.08405
作者: Muhammad Awais,Ali Husain Salem Abdulla Alharthi,Amandeep Kumar,Hisham Cholakkal,Rao Muhammad Anwer
关键词-EN: Significant progress, capitalizing on vast, made in advancing, vast repositories, Significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Significant progress has been made in advancing large multimodal conversational models (LMMs), capitalizing on vast repositories of image-text data available online. Despite this progress, these models often encounter substantial domain gaps, hindering their ability to engage in complex conversations across new domains. Recent efforts have aimed to mitigate this issue, albeit relying on domain-specific image-text data to curate instruction-tuning data. However, many domains, such as agriculture, lack such vision-language data. In this work, we propose an approach to construct instruction-tuning data that harnesses vision-only data for the agriculture domain. We utilize diverse agricultural datasets spanning multiple domains, curate class-specific information, and employ large language models (LLMs) to construct an expert-tuning set, resulting in a 70k expert-tuning dataset called AgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficient LMM that can hold complex agriculture-related conversations and provide useful insights. We also develop AgroEvals for evaluation and compare AgroGPT’s performance with large open and closed-source models. AgroGPT excels at identifying fine-grained agricultural concepts, can act as an agriculture expert, and provides helpful information for multimodal agriculture questions. The code, datasets, and models are available at this https URL.

[AI-99] he Effects of Hallucinations in Synthetic Training Data for Relation Extraction ISWC’24

链接: https://arxiv.org/abs/2410.08393
作者: Steven Rogulsky,Nicholas Popovic,Michael Färber
关键词-EN: constructing knowledge graphs, Relation extraction, knowledge graphs, foundation for training, constructing knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted at KBC-LM@ISWC’24

点击查看摘要

Abstract:Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact on relation extraction remains underexplored. In this paper, we examine the effects of hallucinations on the performance of relation extraction on the document and sentence levels. Our empirical study reveals that hallucinations considerably compromise the ability of models to extract relations from text, with recall reductions between 19.1% and 39.2%. We identify that relevant hallucinations impair the model’s performance, while irrelevant hallucinations have a minimal impact. Additionally, we develop methods for the detection of hallucinations to improve data quality and model performance. Our approaches successfully classify texts as either ‘hallucinated’ or ‘clean,’ achieving high F1-scores of 83.8% and 92.2%. These methods not only assist in removing hallucinations but also help in estimating their prevalence within datasets, which is crucial for selecting high-quality data. Overall, our work confirms the profound impact of relevant hallucinations on the effectiveness of relation extraction models.

[AI-100] KV Prediction for Improved Time to First Token

链接: https://arxiv.org/abs/2410.08391
作者: Maxwell Horton,Qingqing Cao,Chenfan Sun,Yanzi Jin,Sachin Mehta,Mohammad Rastegari,Moin Nabi
关键词-EN: Inference with transformer-based, language models begins, transformer-based language models, prompt processing step, transformer-based language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge devices when prompt lengths or batch sizes rise. This degrades user experience by introducing significant latency into the model’s outputs. To reduce the time spent producing the first output (known as the ``time to first token’', or TTFT) of a pretrained model, we introduce a novel method called KV Prediction. In our method, a small auxiliary model is used to process the prompt and produce an approximation of the KV cache used by a base model. This approximated KV cache is then used with the base model for autoregressive generation without the need to query the auxiliary model again. We demonstrate that our method produces a pareto-optimal efficiency-accuracy trade-off when compared to baselines. On TriviaQA, we demonstrate relative accuracy improvements in the range of 15%-50% across a range of TTFT FLOPs budgets. We also demonstrate accuracy improvements of up to 30% on HumanEval python code completion at fixed TTFT FLOPs budgets. Additionally, we benchmark models on an Apple M2 Pro CPU and demonstrate that our improvement in FLOPs translates to a TTFT speedup on hardware. We release our code at this https URL .

[AI-101] KnowGraph: Knowledge-Enabled Anomaly Detection via Logical Reasoning on Graph Data CCS2024

链接: https://arxiv.org/abs/2410.08390
作者: Andy Zhou,Xiaojun Xu,Ramesh Raghunathan,Alok Lal,Xinze Guan,Bin Yu,Bo Li
关键词-EN: Graph Neural Networks, network traffic, pivotal in diverse, transaction networks, Neural Networks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ACM CCS 2024

点击查看摘要

Abstract:Graph-based anomaly detection is pivotal in diverse security applications, such as fraud detection in transaction networks and intrusion detection for network traffic. Standard approaches, including Graph Neural Networks (GNNs), often struggle to generalize across shifting data distributions. Meanwhile, real-world domain knowledge is more stable and a common existing component of real-world detection strategies. To explicitly integrate such knowledge into data-driven models such as GCNs, we propose KnowGraph, which integrates domain knowledge with data-driven learning for enhanced graph-based anomaly detection. KnowGraph comprises two principal components: (1) a statistical learning component that utilizes a main model for the overarching detection task, augmented by multiple specialized knowledge models that predict domain-specific semantic entities; (2) a reasoning component that employs probabilistic graphical models to execute logical inferences based on model outputs, encoding domain knowledge through weighted first-order logic formulas. Extensive experiments on these large-scale real-world datasets show that KnowGraph consistently outperforms state-of-the-art baselines in both transductive and inductive settings, achieving substantial gains in average precision when generalizing to completely unseen test graphs. Further ablation studies demonstrate the effectiveness of the proposed reasoning component in improving detection performance, especially under extreme class imbalance. These results highlight the potential of integrating domain knowledge into data-driven models for high-stakes, graph-based security applications.

[AI-102] GUS-Net: Social Bias Classification in Text with Generalizations Unfairness and Stereotypes

链接: https://arxiv.org/abs/2410.08388
作者: Maximus Powers,Hua Wei,Umang Mavani,Harshitha Reddy Jonala,Ansh Tiwari
关键词-EN: natural language processing, critical challenge, bias detection, large language models, bias
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The detection of bias in natural language processing (NLP) is a critical challenge, particularly with the increasing use of large language models (LLMs) in various domains. This paper introduces GUS-Net, an innovative approach to bias detection that focuses on three key types of biases: (G)eneralizations, (U)nfairness, and (S)tereotypes. GUS-Net leverages generative AI and automated agents to create a comprehensive synthetic dataset, enabling robust multi-label token classification. Our methodology enhances traditional bias detection methods by incorporating the contextual encodings of pre-trained models, resulting in improved accuracy and depth in identifying biased entities. Through extensive experiments, we demonstrate that GUS-Net outperforms state-of-the-art techniques, achieving superior performance in terms of accuracy, F1-score, and Hamming Loss. The findings highlight GUS-Net’s effectiveness in capturing a wide range of biases across diverse contexts, making it a valuable tool for social bias detection in text. This study contributes to the ongoing efforts in NLP to address implicit bias, providing a pathway for future research and applications in various fields. The Jupyter notebooks used to create the dataset and model are available at: this https URL. Warning: This paper contains examples of harmful language, and reader discretion is recommended. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.08388 [cs.CL] (or arXiv:2410.08388v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.08388 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-103] Language model developers should report train-test overlap

链接: https://arxiv.org/abs/2410.08385
作者: Andy K Zhang,Kevin Klyman,Yifan Mai,Yoav Levine,Yian Zhang,Rishi Bommasani,Percy Liang
关键词-EN: train-test overlap, train-test, overlap, results requires knowledge, measure train-test overlap
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
*备注: 18 pages

点击查看摘要

Abstract:Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 model developers, finding that just 9 developers report train-test overlap: 4 developers release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 developers publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional developers. Overall, we take the position that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.

[AI-104] Optimizing Vital Sign Monitoring in Resource-Constrained Maternal Care: An RL-Based Restless Bandit Approach

链接: https://arxiv.org/abs/2410.08377
作者: Niclas Boehmer,Yunfan Zhao,Guojun Xiong,Paula Rodriguez-Diaz,Paola Del Cueto Cibrian,Joseph Ngonzi,Adeline Boatin,Milind Tambe
关键词-EN: significant global public, global public health, Maternal mortality remains, public health challenge, mortality remains
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Maternal mortality remains a significant global public health challenge. One promising approach to reducing maternal deaths occurring during facility-based childbirth is through early warning systems, which require the consistent monitoring of mothers’ vital signs after giving birth. Wireless vital sign monitoring devices offer a labor-efficient solution for continuous monitoring, but their scarcity raises the critical question of how to allocate them most effectively. We devise an allocation algorithm for this problem by modeling it as a variant of the popular Restless Multi-Armed Bandit (RMAB) paradigm. In doing so, we identify and address novel, previously unstudied constraints unique to this domain, which render previous approaches for RMABs unsuitable and significantly increase the complexity of the learning and planning problem. To overcome these challenges, we adopt the popular Proximal Policy Optimization (PPO) algorithm from reinforcement learning to learn an allocation policy by training a policy and value function network. We demonstrate in simulations that our approach outperforms the best heuristic baseline by up to a factor of 4 .

[AI-105] Merging in a Bottle: Differentiable Adaptive Merging (DAM) and the Path from Averaging to Automation

链接: https://arxiv.org/abs/2410.08371
作者: Thomas Gauthier-Caron,Shamane Siriwardhana,Elliot Stein,Malikeh Ehghaghi,Charles Goddard,Mark McQuade,Jacob Solawetz,Maxime Labonne
关键词-EN: requiring substantial retraining, separate language models, achieving a balance, substantial retraining, systems can combine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 1 figure, and 3 tables

点击查看摘要

Abstract:By merging models, AI systems can combine the distinct strengths of separate language models, achieving a balance between multiple capabilities without requiring substantial retraining. However, the integration process can be intricate due to differences in training methods and fine-tuning, typically necessitating specialized knowledge and repeated refinement. This paper explores model merging techniques across a spectrum of complexity, examining where automated methods like evolutionary strategies stand compared to hyperparameter-driven approaches such as DARE, TIES-Merging and simpler methods like Model Soups. In addition, we introduce Differentiable Adaptive Merging (DAM), an efficient, adaptive merging approach as an alternative to evolutionary merging that optimizes model integration through scaling coefficients, minimizing computational demands. Our findings reveal that even simple averaging methods, like Model Soups, perform competitively when model similarity is high, underscoring each technique’s unique strengths and limitations. We open-sourced DAM, including the implementation code and experiment pipeline, on GitHub: this https URL.

[AI-106] Large Legislative Models: Towards Efficient AI Policymaking in Economic Simulations

链接: https://arxiv.org/abs/2410.08345
作者: Henry Gasztowtt,Benjamin Smith,Vincent Zhu,Qinxun Bai,Edwin Zhang
关键词-EN: broad societal benefit, AI-driven policymaking tools, economic policymaking presents, societal benefit, improvement of economic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The improvement of economic policymaking presents an opportunity for broad societal benefit, a notion that has inspired research towards AI-driven policymaking tools. AI policymaking holds the potential to surpass human performance through the ability to process data quickly at scale. However, existing RL-based methods exhibit sample inefficiency, and are further limited by an inability to flexibly incorporate nuanced information into their decision-making processes. Thus, we propose a novel method in which we instead utilize pre-trained Large Language Models (LLMs), as sample-efficient policymakers in socially complex multi-agent reinforcement learning (MARL) scenarios. We demonstrate significant efficiency gains, outperforming existing methods across three environments. Our code is available at this https URL.

[AI-107] Kernel Banzhaf: A Fast and Robust Estimator for Banzhaf Values

链接: https://arxiv.org/abs/2410.08336
作者: Yurong Liu,R. Teal Witter,Flip Korn,Tarfah Alrashed,Dimitris Paparas,Juliana Freire
关键词-EN: widely-used Shapley, Kernel Banzhaf, introduce Kernel Banzhaf, establishing Kernel Banzhaf, Kernel Banzhaf substantially
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Banzhaf values offer a simple and interpretable alternative to the widely-used Shapley values. We introduce Kernel Banzhaf, a novel algorithm inspired by KernelSHAP, that leverages an elegant connection between Banzhaf values and linear regression. Through extensive experiments on feature attribution tasks, we demonstrate that Kernel Banzhaf substantially outperforms other algorithms for estimating Banzhaf values in both sample efficiency and robustness to noise. Furthermore, we prove theoretical guarantees on the algorithm’s performance, establishing Kernel Banzhaf as a valuable tool for interpretable machine learning.

[AI-108] Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning

链接: https://arxiv.org/abs/2410.08334
作者: Tirthankar Mittra
关键词-EN: children learn numbers, paper investigates, investigates how children, children learn, reinforcement learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper investigates how children learn numbers using the framework of reinforcement learning (RL), with a focus on the impact of language instructions. The motivation for using reinforcement learning stems from its parallels with psychological learning theories in controlled environments. By using state of the art deep reinforcement learning models, we simulate and analyze the effects of various forms of language instructions on number acquisition. Our findings indicate that certain linguistic structures more effectively improve numerical comprehension in RL agents. Additionally, our model predicts optimal sequences for presenting numbers to RL agents which enhance their speed of learning. This research provides valuable insights into the interplay between language and numerical cognition, with implications for both educational strategies and the development of artificial intelligence systems designed to support early childhood learning.

[AI-109] Level of agreement between emotions generated by Artificial Intelligence and human evaluation: a methodological proposal

链接: https://arxiv.org/abs/2410.08332
作者: Miguel Carrasco,Cesar Gonzalez-Martin,Sonia Navajas-Torrente,Raul Dastres
关键词-EN: highly subjective, capable of conveying, experience is highly, emotions, conveying emotions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 29 pages

点击查看摘要

Abstract:Images are capable of conveying emotions, but emotional experience is highly subjective. Advances in artificial intelligence have enabled the generation of images based on emotional descriptions. However, the level of agreement between the generative images and human emotional responses has not yet been evaluated. To address this, 20 artistic landscapes were generated using StyleGAN2-ADA. Four variants evoking positive emotions (contentment, amusement) and negative emotions (fear, sadness) were created for each image, resulting in 80 pictures. An online questionnaire was designed using this material, in which 61 observers classified the generated images. Statistical analyses were performed on the collected data to determine the level of agreement among participants, between the observer’s responses, and the AI-generated emotions. A generally good level of agreement was found, with better results for negative emotions. However, the study confirms the subjectivity inherent in emotional evaluation.

[AI-110] Agents Thinking Fast and Slow: A Talker-Reasoner Architecture

链接: https://arxiv.org/abs/2410.08328
作者: Konstantina Christakopoulou,Shibl Mourad,Maja Matarić
关键词-EN: Large language models, Large language, natural conversation, language models, models have enabled
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models have enabled agents of all kinds to interact with users through natural conversation. Consequently, agents now have two jobs: conversing and planning/reasoning. Their conversational responses must be informed by all available information, and their actions must help to achieve goals. This dichotomy between conversing with the user and doing multi-step reasoning and planning can be seen as analogous to the human systems of “thinking fast and slow” as introduced by Kahneman. Our approach is comprised of a “Talker” agent (System 1) that is fast and intuitive, and tasked with synthesizing the conversational response; and a “Reasoner” agent (System 2) that is slower, more deliberative, and more logical, and is tasked with multi-step reasoning and planning, calling tools, performing actions in the world, and thereby producing the new agent state. We describe the new Talker-Reasoner architecture and discuss its advantages, including modularity and decreased latency. We ground the discussion in the context of a sleep coaching agent, in order to demonstrate real-world relevance.

[AI-111] UNIQ: Offline Inverse Q-learning for Avoiding Undesirable Demonstrations

链接: https://arxiv.org/abs/2410.08307
作者: Huy Hoang,Tien Mai,Pradeep Varakantham
关键词-EN: avoids undesirable demonstrations, undesirable demonstrations, learning, offline imitation learning, learning policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We address the problem of offline learning a policy that avoids undesirable demonstrations. Unlike conventional offline imitation learning approaches that aim to imitate expert or near-optimal demonstrations, our setting involves avoiding undesirable behavior (specified using undesirable demonstrations). To tackle this problem, unlike standard imitation learning where the aim is to minimize the distance between learning policy and expert demonstrations, we formulate the learning task as maximizing a statistical distance, in the space of state-action stationary distributions, between the learning policy and the undesirable policy. This significantly different approach results in a novel training objective that necessitates a new algorithm to address it. Our algorithm, UNIQ, tackles these challenges by building on the inverse Q-learning framework, framing the learning problem as a cooperative (non-adversarial) task. We then demonstrate how to efficiently leverage unlabeled data for practical training. Our method is evaluated on standard benchmark environments, where it consistently outperforms state-of-the-art baselines. The code implementation can be accessed at: this https URL.

[AI-112] Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?

链接: https://arxiv.org/abs/2410.08292
作者: Khashayar Gatmiry,Nikunj Saunshi,Sashank J. Reddi,Stefanie Jegelka,Sanjiv Kumar
关键词-EN: single forward pass, few-shot learning, forward pass, multi-step algorithms, remarkable capability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The remarkable capability of Transformers to do reasoning and few-shot learning, without any fine-tuning, is widely conjectured to stem from their ability to implicitly simulate a multi-step algorithms – such as gradient descent – with their weights in a single forward pass. Recently, there has been progress in understanding this complex phenomenon from an expressivity point of view, by demonstrating that Transformers can express such multi-step algorithms. However, our knowledge about the more fundamental aspect of its learnability, beyond single layer models, is very limited. In particular, can training Transformers enable convergence to algorithmic solutions? In this work we resolve this for in-context linear regression with linear looped Transformers – a multi-layer model with weight sharing that is conjectured to have an inductive bias to learn fix-point iterative algorithms. More specifically, for this setting we show that the global minimizer of the population training loss implements multi-step preconditioned gradient descent, with a preconditioner that adapts to the data distribution. Furthermore, we show a fast convergence for gradient flow on the regression loss, despite the non-convexity of the landscape, by proving a novel gradient dominance condition. To our knowledge, this is the first theoretical analysis for multi-layer Transformer in this setting. We further validate our theoretical findings through synthetic experiments.

[AI-113] Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference

链接: https://arxiv.org/abs/2410.08289
作者: William Thorne,Ambrose Robinson,Bohua Peng,Chenghua Lin,Diana Maynard
关键词-EN: sector increasingly adopts, increasingly adopts technologies, personalised search experiences, Retrieval-Augmented Generation, heritage sector increasingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: is to be published in NLP4DH 2024

点击查看摘要

Abstract:As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it’s equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method’s effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.

[AI-114] FusionSense: Bridging Common Sense Vision and Touch for Robust Sparse-View Reconstruction

链接: https://arxiv.org/abs/2410.08282
作者: Irving Fang,Kairui Shi,Xujin He,Siqi Tan,Yifan Wang,Hanwen Zhao,Hung-Jui Huang,Wenzhen Yuan,Chen Feng,Jing Zhang
关键词-EN: Humans effortlessly integrate, Humans effortlessly, effortlessly integrate common-sense, integrate common-sense knowledge, effortlessly integrate
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. Emulating this capability, we introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. FusionSense addresses three key challenges: (i) How can robots efficiently acquire robust global shape information about the surrounding scene and objects? (ii) How can robots strategically select touch points on the object using geometric and common-sense priors? (iii) How can partial observations such as tactile signals improve the overall representation of the object? Our framework employs 3D Gaussian Splatting as a core representation and incorporates a hierarchical optimization strategy involving global structure construction, object visual hull pruning and local geometric constraints. This advancement results in fast and robust perception in environments with traditionally challenging objects that are transparent, reflective, or dark, enabling more downstream manipulation or navigation tasks. Experiments on real-world data suggest that our framework outperforms previously state-of-the-art sparse-view methods. All code and data are open-sourced on the project website.

[AI-115] Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

链接: https://arxiv.org/abs/2410.08260
作者: Qiuheng Wang,Yukai Shi,Jiarong Ou,Rui Chen,Ke Lin,Jiahao Wang,Boyuan Jiang,Haotian Yang,Mingwu Zheng,Xin Tao,Fei Yang,Pengfei Wan,Di Zhang
关键词-EN: visual generation technologies, generation technologies continue, continue to advance, expanded rapidly, technologies continue
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the splitted videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset. Our dataset and code will be released at this https URL.

[AI-116] AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile Environments

链接: https://arxiv.org/abs/2410.08256
作者: Cheng Fang,Sicong Liu,Zimu Zhou,Bin Guo,Jiaqi Tang,Ke Ma,Zhiwen Yu
关键词-EN: deliver seamless user, seamless user experiences, unpredictable domain shifts, unpredictable domain, evolving environments
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: This paper is accepted by SenSys 2024. Copyright may be transferred without notice

点击查看摘要

Abstract:On-device adapting to continual, unpredictable domain shifts is essential for mobile applications like autonomous driving and augmented reality to deliver seamless user experiences in evolving environments. Test-time adaptation (TTA) emerges as a promising solution by tuning model parameters with unlabeled live data immediately before prediction. However, TTA’s unique forward-backward-reforward pipeline notably increases the latency over standard inference, undermining the responsiveness in time-sensitive mobile applications. This paper presents AdaShadow, a responsive test-time adaptation framework for non-stationary mobile data distribution and resource dynamics via selective updates of adaptation-critical layers. Although the tactic is recognized in generic on-device training, TTA’s unsupervised and online context presents unique challenges in estimating layer importance and latency, as well as scheduling the optimal layer update plan. AdaShadow addresses these challenges with a backpropagation-free assessor to rapidly identify critical layers, a unit-based runtime predictor to account for resource dynamics in latency estimation, and an online scheduler for prompt layer update planning. Also, AdaShadow incorporates a memory I/O-aware computation reuse scheme to further reduce latency in the reforward pass. Results show that AdaShadow achieves the best accuracy-latency balance under continual shifts. At low memory and energy costs, Adashadow provides a 2x to 3.5x speedup (ms-level) over state-of-the-art TTA methods with comparable accuracy and a 14.8% to 25.4% accuracy boost over efficient supervised methods with similar latency.

[AI-117] Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning

链接: https://arxiv.org/abs/2410.08255
作者: David D. Baek,Yuxiao Li,Max Tegmark
关键词-EN: MLP toy models, LLM in-context learning, Motivated by interpretability, MLP toy, networks represent knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 13 figures

点击查看摘要

Abstract:Motivated by interpretability and reliability, we investigate how neural networks represent knowledge during graph learning, We find hints of universality, where equivalent representations are learned across a range of model sizes (from 10^2 to 10^9 parameters) and contexts (MLP toy models, LLM in-context learning and LLM training). We show that these attractor representations optimize generalization to unseen examples by exploiting properties of knowledge graph relations (e.g. symmetry and meta-transitivity). We find experimental support for such universality by showing that LLMs and simpler neural networks can be stitched, i.e., by stitching the first part of one model to the last part of another, mediated only by an affine or almost affine transformation. We hypothesize that this dynamic toward simplicity and generalization is driven by “intelligence from starvation”: where overfitting is minimized by pressure to minimize the use of resources that are either scarce or competed for against other tasks.

[AI-118] Federated Graph Learning for Cross-Domain Recommendation NEURIPS’24

链接: https://arxiv.org/abs/2410.08249
作者: Ziqi Yang,Zhaopeng Peng,Zihui Wang,Jianzhong Qi,Chaochao Chen,Weike Pan,Chenglu Wen,Cheng Wang,Xiaoliang Fan
关键词-EN: Cross-domain recommendation, data sparsity problem, offers a promising, promising solution, data sparsity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS’24

点击查看摘要

Abstract:Cross-domain recommendation (CDR) offers a promising solution to the data sparsity problem by enabling knowledge transfer across source and target domains. However, many recent CDR models overlook crucial issues such as privacy as well as the risk of negative transfer (which negatively impact model performance), especially in multi-domain settings. To address these challenges, we propose FedGCDR, a novel federated graph learning framework that securely and effectively leverages positive knowledge from multiple source domains. First, we design a positive knowledge transfer module that ensures privacy during inter-domain knowledge transmission. This module employs differential privacy-based knowledge extraction combined with a feature mapping mechanism, transforming source domain embeddings from federated graph attention networks into reliable domain knowledge. Second, we design a knowledge activation module to filter out potential harmful or conflicting knowledge from source domains, addressing the issues of negative transfer. This module enhances target domain training by expanding the graph of the target domain to generate reliable domain attentions and fine-tunes the target model for improved negative knowledge filtering and more accurate predictions. We conduct extensive experiments on 16 popular domains of the Amazon dataset, demonstrating that FedGCDR significantly outperforms state-of-the-art methods.

[AI-119] Forecasting mortality associated emergency department crowding

链接: https://arxiv.org/abs/2410.08247
作者: Jalmari Nevanlinna,Anna Eidstø,Jari Ylä-Mattila,Teemu Koivistoinen,Niku Oksala,Juho Kanniainen,Ari Palomäki,Antti Roine
关键词-EN: global public health, public health issue, Emergency department, global public, public health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Emergency department (ED) crowding is a global public health issue that has been repeatedly associated with increased mortality. Predicting future service demand would enable preventative measures aiming to eliminate crowding along with it’s detrimental effects. Recent findings in our ED indicate that occupancy ratios exceeding 90% are associated with increased 10-day mortality. In this paper, we aim to predict these crisis periods using retrospective data from a large Nordic ED with a LightGBM model. We provide predictions for the whole ED and individually for it’s different operational sections. We demonstrate that afternoon crowding can be predicted at 11 a.m. with an AUC of 0.82 (95% CI 0.78-0.86) and at 8 a.m. with an AUC up to 0.79 (95% CI 0.75-0.83). Consequently we show that forecasting mortality-associated crowding using anonymous administrative data is feasible.

[AI-120] Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts NEURIPS2024

链接: https://arxiv.org/abs/2410.08245
作者: Sukwon Yun,Inyoung Choi,Jie Peng,Yangfan Wu,Jingxuan Bao,Qiyiwen Zhang,Jiayi Xin,Qi Long,Tianlong Chen
关键词-EN: gained increasing importance, Multimodal learning, modality combinations, arbitrary modality combinations, modality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Multimodal learning has gained increasing importance across various fields, offering the ability to integrate data from diverse sources such as images, text, and personalized records, which are frequently observed in medical domains. However, in scenarios where some modalities are missing, many existing frameworks struggle to accommodate arbitrary modality combinations, often relying heavily on a single modality or complete data. This oversight of potential modality combinations limits their applicability in real-world situations. To address this challenge, we propose Flex-MoE (Flexible Mixture-of-Experts), a new framework designed to flexibly incorporate arbitrary modality combinations while maintaining robustness to missing data. The core idea of Flex-MoE is to first address missing modalities using a new missing modality bank that integrates observed modality combinations with the corresponding missing ones. This is followed by a uniquely designed Sparse MoE framework. Specifically, Flex-MoE first trains experts using samples with all modalities to inject generalized knowledge through the generalized router ( \mathcalG -Router). The \mathcalS -Router then specializes in handling fewer modality combinations by assigning the top-1 gate to the expert corresponding to the observed modality combination. We evaluate Flex-MoE on the ADNI dataset, which encompasses four modalities in the Alzheimer’s Disease domain, as well as on the MIMIC-IV dataset. The results demonstrate the effectiveness of Flex-MoE highlighting its ability to model arbitrary modality combinations in diverse missing modality scenarios. Code is available at this https URL.

[AI-121] RAB2-DEF: Dynamic and explainable defense against adversarial attacks in Federated Learning to fair poor clients

链接: https://arxiv.org/abs/2410.08244
作者: Nuria Rodríguez-Barroso,M. Victoria Luzón,Francisco Herrera
关键词-EN: data privacy concerns, textbf, regulation is growing, data privacy, privacy concerns derived
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:At the same time that artificial intelligence is becoming popular, concern and the need for regulation is growing, including among other requirements the data privacy. In this context, Federated Learning is proposed as a solution to data privacy concerns derived from different source data scenarios due to its distributed learning. The defense mechanisms proposed in literature are just focused on defending against adversarial attacks and the performance, leaving aside other important qualities such as explainability, fairness to poor quality clients, dynamism in terms of attacks configuration and generality in terms of being resilient against different kinds of attacks. In this work, we propose RAB ^2 -DEF, a \textbfr esilient \textbfa gainst \textbfb\textyzantine and \textbfb ackdoor attacks which is \textbfd ynamic, \textbfe xplainable and \textbff air to poor clients using local linear explanations. We test the performance of RAB ^2 -DEF in image datasets and both byzantine and backdoor attacks considering the state-of-the-art defenses and achieve that RAB ^2 -DEF is a proper defense at the same time that it boosts the other qualities towards trustworthy artificial intelligence.

[AI-122] Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow

链接: https://arxiv.org/abs/2410.08243
作者: Cyrile Delestre,Yoann Sola
关键词-EN: Banking Transaction Flow, sequential data found, Transaction Flow, sequential data, data found
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Banking Transaction Flow (BTF) is a sequential data found in a number of banking activities such as marketing, credit risk or banking fraud. It is a multimodal data composed of three modalities: a date, a numerical value and a wording. We propose in this work an application of self-attention mechanism to the processing of BTFs. We trained two general models on a large amount of BTFs in a self-supervised way: one RNN-based model and one Transformer-based model. We proposed a specific tokenization in order to be able to process BTFs. The performance of these two models was evaluated on two banking downstream tasks: a transaction categorization task and a credit risk task. The results show that fine-tuning these two pre-trained models allowed to perform better than the state-of-the-art approaches for both tasks.

[AI-123] LecPrompt: A Prompt-based Approach for Logical Error Correction with CodeBERT

链接: https://arxiv.org/abs/2410.08241
作者: Zhenyu Xu,Victor S. Sheng
关键词-EN: raise compiler alerts, Logical errors, compiler alerts, making them hard, hard to detect
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Logical errors in programming don’t raise compiler alerts, making them hard to detect. These silent errors can disrupt a program’s function or cause run-time issues. Their correction requires deep insight into the program’s logic, highlighting the importance of automated detection and repair. In this paper, we introduce LecPrompt to localize and repair logical errors, an prompt-based approach that harnesses the capabilities of CodeBERT, a transformer-based large language model trained on code. First, LecPrompt leverages a large language model to calculate perplexity and log probability metrics, pinpointing logical errors at both token and line levels. Through statistical analysis, it identifies tokens and lines that deviate significantly from the expected patterns recognized by large language models, marking them as potential error sources. Second, by framing the logical error correction challenge as a Masked Language Modeling (MLM) task, LecPrompt employs CodeBERT to autoregressively repair the identified error tokens. Finally, the soft-prompt method provides a novel solution in low-cost scenarios, ensuring that the model can be fine-tuned to the specific nuances of the logical error correction task without incurring high computational costs. To evaluate LecPrompt’s performance, we created a method to introduce logical errors into correct code and applying this on QuixBugs to produce the QuixBugs-LE dataset. Our evaluations on the QuixBugs-LE dataset for both Python and Java highlight the impressive capabilities of our method, LecPrompt. For Python, LecPrompt achieves a noteworthy 74.58% top-1 token-level repair accuracy and 27.4% program-level repair accuracy. In Java, LecPrompt delivers a 69.23% top-1 token-level repair accuracy and 24.7% full program-level repair accuracy.

[AI-124] New technologies and AI: envisioning future directions for UNSCR 1540

链接: https://arxiv.org/abs/2410.08216
作者: Clara Punzi
关键词-EN: United Nations Security, Nations Security Council, Security Council Resolution, Artificial Intelligence, United Nations
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 5 pages, no figures, references in the footnotes

点击查看摘要

Abstract:This paper investigates the emerging challenges posed by the integration of Artificial Intelligence (AI) in the military domain, particularly within the context of United Nations Security Council Resolution 1540 (UNSCR 1540), which seeks to prevent the proliferation of weapons of mass destruction (WMDs). While the resolution initially focused on nuclear, chemical, and biological threats, the rapid advancement of AI introduces new complexities that were previously unanticipated. We critically analyze how AI can both exacerbate existing risks associated with WMDs (e.g., thorough the deployment of kamikaze drones and killer robots) and introduce novel threats (e.g., by exploiting Generative AI potentialities), thereby compromising international peace and security. The paper calls for an expansion of UNSCR 1540 to address the growing influence of AI technologies in the development, dissemination, and potential misuse of WMDs, urging the creation of a governance framework to mitigate these emerging risks.

[AI-125] An undetectable watermark for generative image models

链接: https://arxiv.org/abs/2410.07369
作者: Sam Gunn,Xuandong Zhao,Dawn Song
关键词-EN: undetectable watermarking scheme, watermark, generative image models, undetectable watermarking, Christ and Gunn
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We present the first undetectable watermarking scheme for generative image models. Undetectability ensures that no efficient adversary can distinguish between watermarked and un-watermarked images, even after making many adaptive queries. In particular, an undetectable watermark does not degrade image quality under any efficiently computable metric. Our scheme works by selecting the initial latents of a diffusion model using a pseudorandom error-correcting code (Christ and Gunn, 2024), a strategy which guarantees undetectability and robustness. We experimentally demonstrate that our watermarks are quality-preserving and robust using Stable Diffusion 2.1. Our experiments verify that, in contrast to every prior scheme we tested, our watermark does not degrade image quality. Our experiments also demonstrate robustness: existing watermark removal attacks fail to remove our watermark from images without significantly degrading the quality of the images. Finally, we find that we can robustly encode 512 bits in our watermark, and up to 2500 bits when the images are not subjected to watermark removal attacks. Our code is available at this https URL.

[AI-126] LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation

链接: https://arxiv.org/abs/2410.03521
作者: Xinyuan Wang,Haozhou Li,Dingfang Zheng,Qinke Peng
关键词-EN: pandemic underscored major, underscored major deficiencies, online medical services, traditional healthcare systems, pandemic underscored
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), comprising a Coarse-grained Triage dataset with 439,630 samples, a Fine-grained Diagnosis dataset with 199,600 samples, and a Medical Consultation dataset with 472,418 items, thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model using reinforcement learning. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.

[AI-127] Learning Transferable Features for Implicit Neural Representations

链接: https://arxiv.org/abs/2409.09566
作者: Kushal Vyas,Ahmed Imtiaz Humayun,Aniket Dashpute,Richard G. Baraniuk,Ashok Veeraraghavan,Guha Balakrishnan
关键词-EN: Implicit neural representations, Implicit neural, variety of applications, learned neural features, demonstrated success
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Implicit neural representations (INRs) have demonstrated success in a variety of applications, including inverse problems and neural rendering. An INR is typically trained to capture one signal of interest, resulting in learned neural features that are highly attuned to that signal. Assumed to be less generalizable, we explore the aspect of transferability of such learned neural features for fitting similar signals. We introduce a new INR training framework, STRAINER that learns transferrable features for fitting INRs to new signals from a given distribution, faster and with better reconstruction quality. Owing to the sequential layer-wise affine operations in an INR, we propose to learn transferable representations by sharing initial encoder layers across multiple INRs with independent decoder layers. At test time, the learned encoder representations are transferred as initialization for an otherwise randomly initialized INR. We find STRAINER to yield extremely powerful initialization for fitting images from the same domain and allow for \approx +10dB gain in signal quality early on compared to an untrained INR itself. STRAINER also provides a simple way to encode data-driven priors in INRs. We evaluate STRAINER on multiple in-domain and out-of-domain signal fitting tasks and inverse problems and further provide detailed analysis and discussion on the transferability of STRAINER’s features. Our demo can be accessed at this https URL .

[AI-128] Editing Massive Concepts in Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2403.13807
作者: Tianwei Xiong,Yue Wu,Enze Xie,Yue Wu,Zhenguo Li,Xihui Liu
关键词-EN: generating outdated, biased content, risk of generating, diffusion models suffer, massive concept editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization for each individual concept with dual self-distillation from text alignment loss and diffusion noise prediction loss. The second stage conducts massive concept editing with multi-layer, closed form model editing. We further propose a comprehensive benchmark, named ImageNet Concept Editing Benchmark (ICEB), for evaluating massive concept editing for T2I models with two subtasks, free-form prompts, massive concept categories, and extensive evaluation metrics. Extensive experiments conducted on our proposed benchmark and previous benchmarks demonstrate the superior scalability of EMCID for editing up to 1,000 concepts, providing a practical approach for fast adjustment and re-deployment of T2I diffusion models in real-world applications.

[AI-129] Beyond Myopia: Learning from Positive and Unlabeled Data through Holistic Predictive Trends

链接: https://arxiv.org/abs/2310.04078
作者: Xinrui Wang,Wenhai Wan,Chuanxin Geng,Shaoyuan LI,Songcan Chen
关键词-EN: Learning binary classifiers, Learning binary, binary classifiers, PUL, verifying negative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages

点击查看摘要

Abstract:Learning binary classifiers from positive and unlabeled data (PUL) is vital in many real-world applications, especially when verifying negative examples is difficult. Despite the impressive empirical performance of recent PUL methods, challenges like accumulated errors and increased estimation bias persist due to the absence of negative labels. In this paper, we unveil an intriguing yet long-overlooked observation in PUL: \textitresampling the positive data in each training iteration to ensure a balanced distribution between positive and unlabeled examples results in strong early-stage performance. Furthermore, predictive trends for positive and negative classes display distinctly different patterns. Specifically, the scores (output probability) of unlabeled negative examples consistently decrease, while those of unlabeled positive examples show largely chaotic trends. Instead of focusing on classification within individual time frames, we innovatively adopt a holistic approach, interpreting the scores of each example as a temporal point process (TPP). This reformulates the core problem of PUL as recognizing trends in these scores. We then propose a novel TPP-inspired measure for trend detection and prove its asymptotic unbiasedness in predicting changes. Notably, our method accomplishes PUL without requiring additional parameter tuning or prior assumptions, offering an alternative perspective for tackling this problem. Extensive experiments verify the superiority of our method, particularly in a highly imbalanced real-world setting, where it achieves improvements of up to 11.3% in key metrics. The code is available at \hrefthis https URLthis https URL.

[AI-130] he structure of the token space for large language models

链接: https://arxiv.org/abs/2410.08993
作者: Michael Robinson,Sourya Dey,Shauna Sweet
关键词-EN: Large language models, high dimensional ambient, dimensional ambient latent, ambient latent space, language models encode
类目: Differential Geometry (math.DG); Artificial Intelligence (cs.AI)
*备注: 33 pages, 22 figures

点击查看摘要

Abstract:Large language models encode the correlational structure present in natural language by fitting segments of utterances (tokens) into a high dimensional ambient latent space upon which the models then operate. We assert that in order to develop a foundational, first-principles understanding of the behavior and limitations of large language models, it is crucial to understand the topological and geometric structure of this token subspace. In this article, we present estimators for the dimension and Ricci scalar curvature of the token subspace, and apply it to three open source large language models of moderate size: GPT2, LLEMMA7B, and MISTRAL7B. In all three models, using these measurements, we find that the token subspace is not a manifold, but is instead a stratified manifold, where on each of the individual strata, the Ricci curvature is significantly negative. We additionally find that the dimension and curvature correlate with generative fluency of the models, which suggest that these findings have implications for model behavior.

[AI-131] Conditional Generative Models for Contrast-Enhanced Synthesis of T1w and T1 Maps in Brain MRI

链接: https://arxiv.org/abs/2410.08894
作者: Moritz Piening,Fabian Altekrüger,Gabriele Steidl,Elke Hattingen,Eike Steidl
关键词-EN: Gadolinium-based contrast agents, Gadolinium-based contrast, diagnosis in neuroradiology, vital tool, tool for tumor
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Contrast enhancement by Gadolinium-based contrast agents (GBCAs) is a vital tool for tumor diagnosis in neuroradiology. Based on brain MRI scans of glioblastoma before and after Gadolinium administration, we address enhancement prediction by neural networks with two new contributions. Firstly, we study the potential of generative models, more precisely conditional diffusion and flow matching, for uncertainty quantification in virtual enhancement. Secondly, we examine the performance of T1 scans from quantitive MRI versus T1-weighted scans. In contrast to T1-weighted scans, these scans have the advantage of a physically meaningful and thereby comparable voxel range. To compare network prediction performance of these two modalities with incompatible gray-value scales, we propose to evaluate segmentations of contrast-enhanced regions of interest using Dice and Jaccard scores. Across models, we observe better segmentations with T1 scans than with T1-weighted scans.

[AI-132] Symmetry-Constrained Generation of Diverse Low-Bandgap Molecules with Monte Carlo Tree Search

链接: https://arxiv.org/abs/2410.08833
作者: Akshay Subramanian,James Damewood,Juno Nam,Kevin P. Greenman,Avni P. Singhal,Rafael Gómez-Bombarelli
关键词-EN: mechanical flexibility, next-generation electronic devices, electronic devices due, solution processability, promising avenue
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Organic optoelectronic materials are a promising avenue for next-generation electronic devices due to their solution processability, mechanical flexibility, and tunable electronic properties. In particular, near-infrared (NIR) sensitive molecules have unique applications in night-vision equipment and biomedical imaging. Molecular engineering has played a crucial role in developing non-fullerene acceptors (NFAs) such as the Y-series molecules, which have significantly improved the power conversion efficiency (PCE) of solar cells and enhanced spectral coverage in the NIR region. However, systematically designing molecules with targeted optoelectronic properties while ensuring synthetic accessibility remains a challenge. To address this, we leverage structural priors from domain-focused, patent-mined datasets of organic electronic molecules using a symmetry-aware fragment decomposition algorithm and a fragment-constrained Monte Carlo Tree Search (MCTS) generator. Our approach generates candidates that retain symmetry constraints from the patent dataset, while also exhibiting red-shifted absorption, as validated by TD-DFT calculations.

[AI-133] radarODE-MTL: A Multi-Task Learning Framework with Eccentric Gradient Alignment for Robust Radar-Based ECG Reconstruction

链接: https://arxiv.org/abs/2410.08656
作者: Yuanyuan Zhang,Rui Yang,Yutao Yue,Eng Gee Lim
关键词-EN: vital sign monitoring, Millimeter-wave radar, vital sign recovery, accurate vital sign, vital sign
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Millimeter-wave radar is promising to provide robust and accurate vital sign monitoring in an unobtrusive manner. However, the radar signal might be distorted in propagation by ambient noise or random body movement, ruining the subtle cardiac activities and destroying the vital sign recovery. In particular, the recovery of electrocardiogram (ECG) signal heavily relies on the deep-learning model and is sensitive to noise. Therefore, this work creatively deconstructs the radar-based ECG recovery into three individual tasks and proposes a multi-task learning (MTL) framework, radarODE-MTL, to increase the robustness against consistent and abrupt noises. In addition, to alleviate the potential conflicts in optimizing individual tasks, a novel multi-task optimization strategy, eccentric gradient alignment (EGA), is proposed to dynamically trim the task-specific gradients based on task difficulties in orthogonal space. The proposed radarODE-MTL with EGA is evaluated on the public dataset with prominent improvements in accuracy, and the performance remains consistent under noises. The experimental results indicate that radarODE-MTL could reconstruct accurate ECG signals robustly from radar signals and imply the application prospect in real-life situations. The code is available at: this http URL.

[AI-134] SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

链接: https://arxiv.org/abs/2410.08643
作者: Toby Dylan Hocking,Gabrielle Thibault,Cameron Scott Bodine,Paul Nelson Arellano,Alexander F Shenkin,Olivia Jasmine Lindly
关键词-EN: time period, machine learning, real-world applications, applications of machine, data
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).

[AI-135] CryoFM: A Flow-based Foundation Model for Cryo-EM Densities

链接: https://arxiv.org/abs/2410.08631
作者: Yi Zhou,Yilai Li,Jing Yuan,Quanquan Gu
关键词-EN: drug discovery, high resolution, density maps, powerful technique, biology and drug
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cryo-electron microscopy (cryo-EM) is a powerful technique in structural biology and drug discovery, enabling the study of biomolecules at high resolution. Significant advancements by structural biologists using cryo-EM have led to the production of over 38,626 protein density maps at various resolutions1. However, cryo-EM data processing algorithms have yet to fully benefit from our knowledge of biomolecular density maps, with only a few recent models being data-driven but limited to specific tasks. In this study, we present CryoFM, a foundation model designed as a generative model, learning the distribution of high-quality density maps and generalizing effectively to downstream tasks. Built on flow matching, CryoFM is trained to accurately capture the prior distribution of biomolecular density maps. Furthermore, we introduce a flow posterior sampling method that leverages CRYOFM as a flexible prior for several downstream tasks in cryo-EM and cryo-electron tomography (cryo-ET) without the need for fine-tuning, achieving state-of-the-art performance on most tasks and demonstrating its potential as a foundational model for broader applications in these fields.

[AI-136] ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation

链接: https://arxiv.org/abs/2410.08588
作者: Siyou Li,Beining Xu,Yihao Luo,Dong Nie,Le Zhang
关键词-EN: medical report generation, Automatic medical report, produce detailed text, detailed text reports, automatic MRG
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automatic medical report generation (MRG), which aims to produce detailed text reports from medical images, has emerged as a critical task in this domain. MRG systems can enhance radiological workflows by reducing the time and effort required for report writing, thereby improving diagnostic efficiency. In this work, we present a novel approach for automatic MRG utilizing a multimodal large language model. Specifically, we employed the 3D Vision Transformer (ViT3D) image encoder introduced from M3D-CLIP to process 3D scans and use the Asclepius-Llama3-8B as the language model to generate the text reports by auto-regressive decoding. The experiment shows our model achieved an average Green score of 0.3 on the MRG task validation set and an average accuracy of 0.61 on the visual question answering (VQA) task validation set, outperforming the baseline model. Our approach demonstrates the effectiveness of the ViT3D alignment of LLaMA3 for automatic MRG and VQA tasks by tuning the model on a small dataset.

[AI-137] VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis

链接: https://arxiv.org/abs/2410.08397
作者: Andrew Hoopes,Victor Ion Butoi,John V. Guttag,Adrian V. Dalca
关键词-EN: agent-driven vision-language framework, tackles diverse radiological, analytical metrics, agent-driven vision-language, joint modeling
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 5 figures, vision-language agent, medical image analysis, neuroimage foundation model

点击查看摘要

Abstract:We present VoxelPrompt, an agent-driven vision-language framework that tackles diverse radiological tasks through joint modeling of natural language, image volumes, and analytical metrics. VoxelPrompt is multi-modal and versatile, leveraging the flexibility of language interaction while providing quantitatively grounded image analysis. Given a variable number of 3D medical volumes, such as MRI and CT scans, VoxelPrompt employs a language agent that iteratively predicts executable instructions to solve a task specified by an input prompt. These instructions communicate with a vision network to encode image features and generate volumetric outputs (e.g., segmentations). VoxelPrompt interprets the results of intermediate instructions and plans further actions to compute discrete measures (e.g., tumor growth across a series of scans) and present relevant outputs to the user. We evaluate this framework in a sandbox of diverse neuroimaging tasks, and we show that the single VoxelPrompt model can delineate hundreds of anatomical and pathological features, measure many complex morphological properties, and perform open-language analysis of lesion characteristics. VoxelPrompt carries out these objectives with accuracy similar to that of fine-tuned, single-task models for segmentation and visual question-answering, while facilitating a much larger range of tasks. Therefore, by supporting accurate image processing with language interaction, VoxelPrompt provides comprehensive utility for numerous imaging tasks that traditionally require specialized models to address.

[AI-138] Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

链接: https://arxiv.org/abs/2410.08250
作者: Tuan Nguyen,Corinne Fredouille,Alain Ghio,Mathieu Balaguer,Virginie Woisard
关键词-EN: Neck Cancer speech, Cancer speech contexts, Head and Neck, Neck Cancer, yielding impressive results
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注: Accepted at the Spoken Language Technology (SLT) Conference 2024

点击查看摘要

Abstract:With the rise of SSL and ASR technologies, the Wav2Vec2 ASR-based model has been fine-tuned for automated speech disorder quality assessment tasks, yielding impressive results and setting a new baseline for Head and Neck Cancer speech contexts. This demonstrates that the ASR dimension from Wav2Vec2 closely aligns with assessment dimensions. Despite its effectiveness, this system remains a black box with no clear interpretation of the connection between the model ASR dimension and clinical assessments. This paper presents the first analysis of this baseline model for speech quality assessment, focusing on intelligibility and severity tasks. We conduct a layer-wise analysis to identify key layers and compare different SSL and ASR Wav2Vec2 models based on pre-trained data. Additionally, post-hoc XAI methods, including Canonical Correlation Analysis (CCA) and visualization techniques, are used to track model evolution and visualize embeddings for enhanced interpretability.

[AI-139] A Survey of Spatio-Temporal EEG data Analysis: from Models to Applications

链接: https://arxiv.org/abs/2410.08224
作者: Pengfei Wang,Huanran Zheng,Silong Dai,Yiqiao Wang,Xiaotian Gu,Yuanbin Wu,Xiaoling Wang
关键词-EN: witnessed remarkable advancements, recent years, field of electroencephalography, analysis has witnessed, remarkable advancements
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: submitted to IECE Chinese Journal of Information Fusion

点击查看摘要

Abstract:In recent years, the field of electroencephalography (EEG) analysis has witnessed remarkable advancements, driven by the integration of machine learning and artificial intelligence. This survey aims to encapsulate the latest developments, focusing on emerging methods and technologies that are poised to transform our comprehension and interpretation of brain activity. We delve into self-supervised learning methods that enable the robust representation of brain signals, which are fundamental for a variety of downstream applications. We also explore emerging discriminative methods, including graph neural networks (GNN), foundation models, and large language models (LLMs)-based approaches. Furthermore, we examine generative technologies that harness EEG data to produce images or text, offering novel perspectives on brain activity visualization and interpretation. The survey provides an extensive overview of these cutting-edge techniques, their current applications, and the profound implications they hold for future research and clinical practice. The relevant literature and open-source materials have been compiled and are consistently being refreshed at \urlthis https URL

[AI-140] Embedding an ANN-Based Crystal Plasticity Model into the Finite Element Framework using an ABAQUS User-Material Subroutine

链接: https://arxiv.org/abs/2410.08214
作者: Yuqing He,Yousef Heider,Bernd Markert
关键词-EN: trained Neural Networks, Neural Networks, Finite Element, incorporating trained Neural, trained Neural
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:This manuscript presents a practical method for incorporating trained Neural Networks (NNs) into the Finite Element (FE) framework using a user material (UMAT) subroutine. The work exemplifies crystal plasticity, a complex inelastic non-linear path-dependent material response, with a wide range of applications in ABAQUS UMAT. However, this approach can be extended to other material behaviors and FE tools. The use of a UMAT subroutine serves two main purposes: (1) it predicts and updates the stress or other mechanical properties of interest directly from the strain history; (2) it computes the Jacobian matrix either through backpropagation or numerical differentiation, which plays an essential role in the solution convergence. By implementing NNs in a UMAT subroutine, a trained machine learning model can be employed as a data-driven constitutive law within the FEM framework, preserving multiscale information that conventional constitutive laws often neglect or average. The versatility of this method makes it a powerful tool for integrating machine learning into mechanical simulation. While this approach is expected to provide higher accuracy in reproducing realistic material behavior, the reliability of the solution process and the convergence conditions must be paid special attention. While the theory of the model is explained in [Heider et al. 2020], exemplary source code is also made available for interested readers [this https URL]

[AI-141] A Review of Electromagnetic Elimination Methods for low-field portable MRI scanner

链接: https://arxiv.org/abs/2406.17804
作者: Wanyu Bian
关键词-EN: eliminating electromagnetic interference, deep learning, deep learning methods, EMI, EMI elimination
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive analysis of both conventional and deep learning methods for eliminating electromagnetic interference (EMI) in MRI systems. We explore the underlying principles and implementation of traditional analytical and adaptive EMI elimination techniques, as well as cutting-edge deep learning approaches. Through a detailed comparison, the strengths and limitations of each method are highlighted. Recent advancements in active EMI elimination utilizing multiple external EMI receiver coils and analytical techniques are discussed alongside the superior performance of deep learning methods, which leverage neural networks trained on extensive MRI data. While deep learning methods demonstrate significant improvements in EMI suppression, enhancing diagnostic capabilities and accessibility of MRI technology, they also introduce potential security and safety concerns, especially in production and commercial applications. This study underscores the need to address these challenges to fully realize the benefits of deep learning in EMI elimination. The findings suggest a balanced approach, combining the reliability of conventional methods with the advanced capabilities of deep learning, to develop more robust and effective EMI suppression strategies in MRI systems.

计算机视觉

[CV-0] SceneCraft: Layout-Guided 3D Scene Generation NEURIPS2024

链接: https://arxiv.org/abs/2410.09049
作者: Xiuyu Yang,Yunze Man,Jun-Kun Chen,Yu-Xiong Wang
关键词-EN: modeling tools, task with traditional, tedious and challenging, challenging task, user specifications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024. Code: this https URL Project Page: this https URL

点击查看摘要

Abstract:The creation of complex 3D scenes tailored to user specifications has been a tedious and challenging task with traditional 3D modeling tools. Although some pioneering methods have achieved automatic text-to-3D generation, they are generally limited to small-scale scenes with restricted control over the shape and texture. We introduce SceneCraft, a novel method for generating detailed indoor scenes that adhere to textual descriptions and spatial layout preferences provided by users. Central to our method is a rendering-based technique, which converts 3D semantic layouts into multi-view 2D proxy maps. Furthermore, we design a semantic and depth conditioned diffusion model to generate multi-view images, which are used to learn a neural radiance field (NeRF) as the final scene representation. Without the constraints of panorama image generation, we surpass previous methods in supporting complicated indoor space generation beyond a single room, even as complicated as a whole multi-bedroom apartment with irregular shapes and layouts. Through experimental analysis, we demonstrate that our method significantly outperforms existing approaches in complex indoor scene generation with diverse textures, consistent geometry, and realistic visual quality. Code and more results are available at: this https URL

[CV-1] MiRAGeNews: Multimodal Realistic AI-Generated News Detection EMNLP2024

链接: https://arxiv.org/abs/2410.09045
作者: Runsheng Huang,Liam Dugan,Yue Yang,Chris Callison-Burch
关键词-EN: inflammatory or misleading, recent years, proliferation of inflammatory, increasingly common, common in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: EMNLP 2024 Findings

点击查看摘要

Abstract:The proliferation of inflammatory or misleading “fake” news content has become increasingly common in recent years. Simultaneously, it has become easier than ever to use AI tools to generate photorealistic images depicting any scene imaginable. Combining these two – AI-generated fake news content – is particularly potent and dangerous. To combat the spread of AI-generated fake news, we propose the MiRAGeNews Dataset, a dataset of 12,500 high-quality real and AI-generated image-caption pairs from state-of-the-art generators. We find that our dataset poses a significant challenge to humans (60% F-1) and state-of-the-art multi-modal LLMs ( 24% F-1). Using our dataset we train a multi-modal detector (MiRAGe) that improves by +5.1% F-1 over state-of-the-art baselines on image-caption pairs from out-of-domain image generators and news publishers. We release our code and data to aid future work on detecting AI-generated content.

[CV-2] Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery

链接: https://arxiv.org/abs/2410.09032
作者: Pratinav Seth,Michelle Lin,Brefo Dwamena Yaw,Jade Boutot,Mary Kang,David Rolnick
关键词-EN: leaching methane, oil and gas, atmosphere and toxic, toxic compounds, Alberta Energy Regulator
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Millions of abandoned oil and gas wells are scattered across the world, leaching methane into the atmosphere and toxic compounds into the groundwater. Many of these locations are unknown, preventing the wells from being plugged and their polluting effects averted. Remote sensing is a relatively unexplored tool for pinpointing abandoned wells at scale. We introduce the first large-scale benchmark dataset for this problem, leveraging medium-resolution multi-spectral satellite imagery from Planet Labs. Our curated dataset comprises over 213,000 wells (abandoned, suspended, and active) from Alberta, a region with especially high well density, sourced from the Alberta Energy Regulator and verified by domain experts. We evaluate baseline algorithms for well detection and segmentation, showing the promise of computer vision approaches but also significant room for improvement.

[CV-3] CVAM-Pose: Conditional Variational Autoencoder for Multi-Object Monocular Pose Estimation BMVC2024

链接: https://arxiv.org/abs/2410.09010
作者: Jianyu Zhao,Wei Quan,Bogdan J. Matuszewski
关键词-EN: Estimating rigid objects’, Estimating rigid, rigid objects’ poses, computer vision, augmented reality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: BMVC 2024, oral presentation, the main paper and supplementary materials are included

点击查看摘要

Abstract:Estimating rigid objects’ poses is one of the fundamental problems in computer vision, with a range of applications across automation and augmented reality. Most existing approaches adopt one network per object class strategy, depend heavily on objects’ 3D models, depth data, and employ a time-consuming iterative refinement, which could be impractical for some applications. This paper presents a novel approach, CVAM-Pose, for multi-object monocular pose estimation that addresses these limitations. The CVAM-Pose method employs a label-embedded conditional variational autoencoder network, to implicitly abstract regularised representations of multiple objects in a single low-dimensional latent space. This autoencoding process uses only images captured by a projective camera and is robust to objects’ occlusion and scene clutter. The classes of objects are one-hot encoded and embedded throughout the network. The proposed label-embedded pose regression strategy interprets the learnt latent space representations utilising continuous pose representations. Ablation tests and systematic evaluations demonstrate the scalability and efficiency of the CVAM-Pose method for multi-object scenarios. The proposed CVAM-Pose outperforms competing latent space approaches. For example, it is respectively 25% and 20% better than AAE and Multi-Path methods, when evaluated using the \mathrmAR_VSD metric on the Linemod-Occluded dataset. It also achieves results somewhat comparable to methods reliant on 3D models reported in BOP challenges. Code available: this https URL

[CV-4] Semantic Score Distillation Sampling for Compositional Text-to-3D Generation

链接: https://arxiv.org/abs/2410.09009
作者: Ling Yang,Zixiang Zhang,Junlin Han,Bohan Zeng,Runjia Li,Philip Torr,Wentao Zhang
关键词-EN: textual descriptions remains, Score Distillation Sampling, Distillation Sampling, assets from textual, vision research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project: this https URL

点击查看摘要

Abstract:Generating high-quality 3D assets from textual descriptions remains a pivotal challenge in computer graphics and vision research. Due to the scarcity of 3D data, state-of-the-art approaches utilize pre-trained 2D diffusion priors, optimized through Score Distillation Sampling (SDS). Despite progress, crafting complex 3D scenes featuring multiple objects or intricate interactions is still difficult. To tackle this, recent methods have incorporated box or layout guidance. However, these layout-guided compositional methods often struggle to provide fine-grained control, as they are generally coarse and lack expressiveness. To overcome these challenges, we introduce a novel SDS approach, Semantic Score Distillation Sampling (SemanticSDS), designed to effectively improve the expressiveness and accuracy of compositional text-to-3D generation. Our approach integrates new semantic embeddings that maintain consistency across different rendering views and clearly differentiate between various objects and parts. These embeddings are transformed into a semantic map, which directs a region-specific SDS process, enabling precise optimization and compositional generation. By leveraging explicit semantic guidance, our method unlocks the compositional capabilities of existing pre-trained diffusion models, thereby achieving superior quality in 3D content generation, particularly for complex objects and scenes. Experimental results demonstrate that our SemanticSDS framework is highly effective for generating state-of-the-art complex 3D content. Code: this https URL

[CV-5] DA-Ada: Learning Domain-Aware Adapter for Domain Adaptive Object Detection NEURIPS2024

链接: https://arxiv.org/abs/2410.09004
作者: Haochen Li,Rui Zhang,Hantao Yao,Xin Zhang,Yifan Hao,Xinkai Song,Xiaqing Li,Yongwei Zhao,Ling Li,Yunji Chen
关键词-EN: generalize detectors trained, annotated source domain, Domain, knowledge, aims to generalize
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Domain adaptive object detection (DAOD) aims to generalize detectors trained on an annotated source domain to an unlabelled target domain. As the visual-language models (VLMs) can provide essential general knowledge on unseen images, freezing the visual encoder and inserting a domain-agnostic adapter can learn domain-invariant knowledge for DAOD. However, the domain-agnostic adapter is inevitably biased to the source domain. It discards some beneficial knowledge discriminative on the unlabelled domain, i.e., domain-specific knowledge of the target domain. To solve the issue, we propose a novel Domain-Aware Adapter (DA-Ada) tailored for the DAOD task. The key point is exploiting domain-specific knowledge between the essential general knowledge and domain-invariant knowledge. DA-Ada consists of the Domain-Invariant Adapter (DIA) for learning domain-invariant knowledge and the Domain-Specific Adapter (DSA) for injecting the domain-specific knowledge from the information discarded by the visual encoder. Comprehensive experiments over multiple DAOD tasks show that DA-Ada can efficiently infer a domain-aware visual encoder for boosting domain adaptive object detection. Our code is available at this https URL.

[CV-6] DEL: Discrete Element Learner for Learning 3D Particle Dynamics with Neural Rendering

链接: https://arxiv.org/abs/2410.08983
作者: Jiaxu Wang,Jingkai Sun,Junhao He,Ziyi Zhang,Qiang Zhang,Mingyuan Sun,Renjing Xu
关键词-EN: show great potential, simulating particle dynamics, great potential, potential for simulating, per-particle correspondences
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning-based simulators show great potential for simulating particle dynamics when 3D groundtruth is available, but per-particle correspondences are not always accessible. The development of neural rendering presents a new solution to this field to learn 3D dynamics from 2D images by inverse rendering. However, existing approaches still suffer from ill-posed natures resulting from the 2D to 3D uncertainty, for example, specific 2D images can correspond with various 3D particle distributions. To mitigate such uncertainty, we consider a conventional, mechanically interpretable framework as the physical priors and extend it to a learning-based version. In brief, we incorporate the learnable graph kernels into the classic Discrete Element Analysis (DEA) framework to implement a novel mechanics-integrated learning system. In this case, the graph network kernels are only used for approximating some specific mechanical operators in the DEA framework rather than the whole dynamics mapping. By integrating the strong physics priors, our methods can effectively learn the dynamics of various materials from the partial 2D observations in a unified manner. Experiments show that our approach outperforms other learned simulators by a large margin in this context and is robust to different renderers, fewer training samples, and fewer camera views.

[CV-7] Rapid Grassmannian Averaging with Chebyshev Polynomials ICLR2025

链接: https://arxiv.org/abs/2410.08956
作者: Brighton Ancelin,Alex Saad-Falcon,Kason Ancelin,Justin Romberg
关键词-EN: Rapid Grassmannian Averaging, Decentralized Rapid Grassmannian, Grassmannian Averaging, Rapid Grassmannian, Grassmannian
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to ICLR 2025

点击查看摘要

Abstract:We propose new algorithms to efficiently average a collection of points on a Grassmannian manifold in both the centralized and decentralized settings. Grassmannian points are used ubiquitously in machine learning, computer vision, and signal processing to represent data through (often low-dimensional) subspaces. While averaging these points is crucial to many tasks (especially in the decentralized setting), existing methods unfortunately remain computationally expensive due to the non-Euclidean geometry of the manifold. Our proposed algorithms, Rapid Grassmannian Averaging (RGrAv) and Decentralized Rapid Grassmannian Averaging (DRGrAv), overcome this challenge by leveraging the spectral structure of the problem to rapidly compute an average using only small matrix multiplications and QR factorizations. We provide a theoretical guarantee of optimality and present numerical experiments which demonstrate that our algorithms outperform state-of-the-art methods in providing high accuracy solutions in minimal time. Additional experiments showcase the versatility of our algorithms to tasks such as K-means clustering on video motion data, establishing RGrAv and DRGrAv as powerful tools for generic Grassmannian averaging.

[CV-8] Parallel Watershed Partitioning: GPU-Based Hierarchical Image Segmentation

链接: https://arxiv.org/abs/2410.08946
作者: Varduhi Yeghiazaryan,Yeva Gabrielyan,Irina Voiculescu
关键词-EN: processing applications rely, applications rely, Abstract, similar, image
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Many image processing applications rely on partitioning an image into disjoint regions whose pixels are ‘similar.’ The watershed and waterfall transforms are established mathematical morphology pixel clustering techniques. They are both relevant to modern applications where groups of pixels are to be decided upon in one go, or where adjacency information is relevant. We introduce three new parallel partitioning algorithms for GPUs. By repeatedly applying watershed algorithms, we produce waterfall results which form a hierarchy of partition regions over an input image. Our watershed algorithms attain competitive execution times in both 2D and 3D, processing an 800 megavoxel image in less than 1.4 sec. We also show how to use this fully deterministic image partitioning as a pre-processing step to machine learning based semantic segmentation. This replaces the role of superpixel algorithms, and results in comparable accuracy and faster training times.

[CV-9] MeshGS: Adaptive Mesh-Aligned Gaussian Splatting for High-Quality Rendering ACCV

链接: https://arxiv.org/abs/2410.08941
作者: Jaehoon Choi,Yonghan Lee,Hyungtae Lee,Heesung Kwon,Dinesh Manocha
关键词-EN: Gaussian splats, Gaussian, Gaussian splatting, loosely-bound Gaussian splats, splats
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV (Asian Conference on Computer Vision) 2024

点击查看摘要

Abstract:Recently, 3D Gaussian splatting has gained attention for its capability to generate high-fidelity rendering results. At the same time, most applications such as games, animation, and AR/VR use mesh-based representations to represent and render 3D scenes. We propose a novel approach that integrates mesh representation with 3D Gaussian splats to perform high-quality rendering of reconstructed real-world scenes. In particular, we introduce a distance-based Gaussian splatting technique to align the Gaussian splats with the mesh surface and remove redundant Gaussian splats that do not contribute to the rendering. We consider the distance between each Gaussian splat and the mesh surface to distinguish between tightly-bound and loosely-bound Gaussian splats. The tightly-bound splats are flattened and aligned well with the mesh geometry. The loosely-bound Gaussian splats are used to account for the artifacts in reconstructed 3D meshes in terms of rendering. We present a training strategy of binding Gaussian splats to the mesh geometry, and take into account both types of splats. In this context, we introduce several regularization techniques aimed at precisely aligning tightly-bound Gaussian splats with the mesh surface during the training process. We validate the effectiveness of our method on large and unbounded scene from mip-NeRF 360 and Deep Blending datasets. Our method surpasses recent mesh-based neural rendering techniques by achieving a 2dB higher PSNR, and outperforms mesh-based Gaussian splatting methods by 1.3 dB PSNR, particularly on the outdoor mip-NeRF 360 dataset, demonstrating better rendering quality. We provide analyses for each type of Gaussian splat and achieve a reduction in the number of Gaussian splats by 30% compared to the original 3D Gaussian splatting.

[CV-10] Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images

链接: https://arxiv.org/abs/2410.08926
作者: Virmarie Maquiling,Sean Anthony Byrne,Diederick C. Niehorster,Marco Carminati,Enkelejda Kasneci
关键词-EN: advancing gaze estimation, eye tracking technologies, vision foundation model, tracking technologies, explore the transformative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Virmarie Maquiling and Sean Anthony Byrne contributed equally to this paper, 8 pages, 3 figures, CHI Case Study, pre-print

点击查看摘要

Abstract:We explore the transformative potential of SAM 2, a vision foundation model, in advancing gaze estimation and eye tracking technologies. By significantly reducing annotation time, lowering technical barriers through its ease of deployment, and enhancing segmentation accuracy, SAM 2 addresses critical challenges faced by researchers and practitioners. Utilizing its zero-shot segmentation capabilities with minimal user input-a single click per video-we tested SAM 2 on over 14 million eye images from diverse datasets, including virtual reality setups and the world’s largest unified dataset recorded using wearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches the performance of domain-specific models trained solely on eye images, achieving competitive mean Intersection over Union (mIoU) scores of up to 93% without fine-tuning. Additionally, we provide our code and segmentation masks for these widely used datasets to promote further research.

[CV-11] HyperPg – Prototypical Gaussians on the Hypersphere for Interpretable Deep Learning

链接: https://arxiv.org/abs/2410.08925
作者: Maximilian Xiling Li,Korbinian Franz Rudolf,Nils Blank,Rudolf Lioutikov
关键词-EN: interpretable alternative, black-box deep learning, Prototype Learning methods, Learning methods provide, deep learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prototype Learning methods provide an interpretable alternative to black-box deep learning models. Approaches such as ProtoPNet learn, which part of a test image “look like” known prototypical parts from training images, combining predictive power with the inherent interpretability of case-based reasoning. However, existing approaches have two main drawbacks: A) They rely solely on deterministic similarity scores without statistical confidence. B) The prototypes are learned in a black-box manner without human input. This work introduces HyperPg, a new prototype representation leveraging Gaussian distributions on a hypersphere in latent space, with learnable mean and variance. HyperPg prototypes adapt to the spread of clusters in the latent space and output likelihood scores. The new architecture, HyperPgNet, leverages HyperPg to learn prototypes aligned with human concepts from pixel-level annotations. Consequently, each prototype represents a specific concept such as color, image texture, or part of the image subject. A concept extraction pipeline built on foundation models provides pixel-level annotations, significantly reducing human labeling effort. Experiments on CUB-200-2011 and Stanford Cars datasets demonstrate that HyperPgNet outperforms other prototype learning architectures while using fewer parameters and training steps. Additionally, the concept-aligned HyperPg prototypes are learned transparently, enhancing model interpretability.

[CV-12] Efficient Hyperparameter Importance Assessment for CNNs

链接: https://arxiv.org/abs/2410.08920
作者: Ruinan Wang,Ian Nabney,Mohammad Golbabaee
关键词-EN: impacting models’ robustness, profoundly impacting models’, machine learning pipeline, Convolutional Neural Networks, profoundly impacting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages

点击查看摘要

Abstract:Hyperparameter selection is an essential aspect of the machine learning pipeline, profoundly impacting models’ robustness, stability, and generalization capabilities. Given the complex hyperparameter spaces associated with Neural Networks and the constraints of computational resources and time, optimizing all hyperparameters becomes impractical. In this context, leveraging hyperparameter importance assessment (HIA) can provide valuable guidance by narrowing down the search space. This enables machine learning practitioners to focus their optimization efforts on the hyperparameters with the most significant impact on model performance while conserving time and resources. This paper aims to quantify the importance weights of some hyperparameters in Convolutional Neural Networks (CNNs) with an algorithm called N-RReliefF, laying the groundwork for applying HIA methodologies in the Deep Learning field. We conduct an extensive study by training over ten thousand CNN models across ten popular image classification datasets, thereby acquiring a comprehensive dataset containing hyperparameter configuration instances and their corresponding performance metrics. It is demonstrated that among the investigated hyperparameters, the top five important hyperparameters of the CNN model are the number of convolutional layers, learning rate, dropout rate, optimizer and epoch.

[CV-13] Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

链接: https://arxiv.org/abs/2410.08895
作者: Kun Ding,Qiang Yu,Haojian Zhang,Gaofeng Meng,Shiming Xiang
关键词-EN: Cache-based approaches stand, adapting vision-language models, Cache-based approaches, cache model, existing cache model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: submitted to IJCV

点击查看摘要

Abstract:Cache-based approaches stand out as both effective and efficient for adapting vision-language models (VLMs). Nonetheless, the existing cache model overlooks three crucial aspects. 1) Pre-trained VLMs are mainly optimized for image-text similarity, neglecting the importance of image-image similarity, leading to a gap between pre-training and adaptation. 2) The current cache model is based on the Nadaraya-Watson (N-W) estimator, which disregards the intricate relationships among training samples while constructing weight function. 3) Under the condition of limited samples, the logits generated by cache model are of high uncertainty, directly using these logits without accounting for the confidence could be problematic. This work presents three calibration modules aimed at addressing the above challenges. Similarity Calibration refines the image-image similarity by using unlabeled images. We add a learnable projection layer with residual connection on top of the pre-trained image encoder of CLIP and optimize the parameters by minimizing self-supervised contrastive loss. Weight Calibration introduces a precision matrix into the weight function to adequately model the relation between training samples, transforming the existing cache model to a Gaussian Process (GP) regressor, which could be more accurate than N-W estimator. Confidence Calibration leverages the predictive variances computed by GP Regression to dynamically re-scale the logits of cache model, ensuring that the cache model’s outputs are appropriately adjusted based on their confidence levels. Besides, to reduce the high complexity of GPs, we further propose a group-based learning strategy. Integrating the above designs, we propose both training-free and training-required variants. Extensive experiments on 11 few-shot classification datasets validate that the proposed methods can achieve state-of-the-art performance.

[CV-14] Exploiting Memory-aware Q-distribution Prediction for Nuclear Fusion via Modern Hopfield Network

链接: https://arxiv.org/abs/2410.08889
作者: Qingchuan Ma,Shiao Wang,Tong Zheng,Xiaodong Dai,Yifeng Wang,Qingquan Yang,Xiao Wang
关键词-EN: clean energy solutions, advancing clean energy, long-term stable nuclear, Modern Hopfield Networks, nuclear fusion task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study addresses the critical challenge of predicting the Q-distribution in long-term stable nuclear fusion task, a key component for advancing clean energy solutions. We introduce an innovative deep learning framework that employs Modern Hopfield Networks to incorporate associative memory from historical shots. Utilizing a newly compiled dataset, we demonstrate the effectiveness of our approach in enhancing Q-distribution prediction. The proposed method represents a significant advancement by leveraging historical memory information for the first time in this context, showcasing improved prediction accuracy and contributing to the optimization of nuclear fusion research.

[CV-15] Can GPTs Evaluate Graphic Design Based on Design Principles? SIGGRAPH

链接: https://arxiv.org/abs/2410.08885
作者: Daichi Haraguchi,Naoto Inoue,Wataru Shimoda,Hayato Mitani,Seiichi Uchida,Kota Yamaguchi
关键词-EN: foundation models show, models show promising, show promising capability, Large Multimodal Models, Recent advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted to SIGGRAPH Asia 2024 (Technical Communications Track)

点击查看摘要

Abstract:Recent advancements in foundation models show promising capability in graphic design generation. Several studies have started employing Large Multimodal Models (LMMs) to evaluate graphic designs, assuming that LMMs can properly assess their quality, but it is unclear if the evaluation is reliable. One way to evaluate the quality of graphic design is to assess whether the design adheres to fundamental graphic design principles, which are the designer’s common practice. In this paper, we compare the behavior of GPT-based evaluation and heuristic evaluation based on design principles using human annotations collected from 60 subjects. Our experiments reveal that, while GPTs cannot distinguish small details, they have a reasonably good correlation with human annotation and exhibit a similar tendency to heuristic metrics based on design principles, suggesting that they are indeed capable of assessing the quality of graphic design. Our dataset is available at this https URL .

[CV-16] Multi-modal Fusion based Q-distribution Prediction for Controlled Nuclear Fusion

链接: https://arxiv.org/abs/2410.08879
作者: Shiao Wang,Yifeng Wang,Qingchuan Ma,Xiao Wang,Ning Yan,Qingquan Yang,Guosheng Xu,Jin Tang
关键词-EN: crucial research direction, solving prediction challenges, deep learning emerging, controlled nuclear fusion, crucial research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Q-distribution prediction is a crucial research direction in controlled nuclear fusion, with deep learning emerging as a key approach to solving prediction challenges. In this paper, we leverage deep learning techniques to tackle the complexities of Q-distribution prediction. Specifically, we explore multimodal fusion methods in computer vision, integrating 2D line image data with the original 1D data to form a bimodal input. Additionally, we employ the Transformer’s attention mechanism for feature extraction and the interactive fusion of bimodal information. Extensive experiments validate the effectiveness of our approach, significantly reducing prediction errors in Q-distribution.

[CV-17] Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

链接: https://arxiv.org/abs/2410.08860
作者: Yingqiang Gao,Lukas Fischer,Alexa Lintner,Sarah Ebling
关键词-EN: assist blind persons, acoustic commentaries designed, accessing digital media, digital media content, Audio descriptions
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Audio descriptions (ADs) function as acoustic commentaries designed to assist blind persons and persons with visual impairments in accessing digital media content on television and in movies, among other settings. As an accessibility service typically provided by trained AD professionals, the generation of ADs demands significant human effort, making the process both time-consuming and costly. Recent advancements in natural language processing (NLP) and computer vision (CV), particularly in large language models (LLMs) and vision-language models (VLMs), have allowed for getting a step closer to automatic AD generation. This paper reviews the technologies pertinent to AD generation in the era of LLMs and VLMs: we discuss how state-of-the-art NLP and CV technologies can be applied to generate ADs and identify essential research directions for the future.

[CV-18] Learning Interaction-aware 3D Gaussian Splatting for One-shot Hand Avatars NEURIPS2024

链接: https://arxiv.org/abs/2410.08840
作者: Xuan Huang,Hanhui Li,Wanquan Liu,Xiaodan Liang,Yiqiang Yan,Yuhao Cheng,Chengqiang Gao
关键词-EN: create animatable avatars, Gaussian Splatting, propose to create, create animatable, animatable avatars
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:In this paper, we propose to create animatable avatars for interacting hands with 3D Gaussian Splatting (GS) and single-image inputs. Existing GS-based methods designed for single subjects often yield unsatisfactory results due to limited input views, various hand poses, and occlusions. To address these challenges, we introduce a novel two-stage interaction-aware GS framework that exploits cross-subject hand priors and refines 3D Gaussians in interacting areas. Particularly, to handle hand variations, we disentangle the 3D presentation of hands into optimization-based identity maps and learning-based latent geometric features and neural texture maps. Learning-based features are captured by trained networks to provide reliable priors for poses, shapes, and textures, while optimization-based identity maps enable efficient one-shot fitting of out-of-distribution hands. Furthermore, we devise an interaction-aware attention module and a self-adaptive Gaussian refinement module. These modules enhance image rendering quality in areas with intra- and inter-hand interactions, overcoming the limitations of existing GS-based methods. Our proposed method is validated via extensive experiments on the large-scale InterHand2.6M dataset, and it significantly improves the state-of-the-art performance in image quality. Project Page: \urlthis https URL.

[CV-19] owards virtual painting recolouring using Vision Transformer on X-Ray Fluorescence datacubes

链接: https://arxiv.org/abs/2410.08826
作者: Alessandro Bombini,Fernando García-Avello Bofías,Francesca Giambi,Chiara Ruberto
关键词-EN: perform virtual painting, virtual painting recolouring, Deep Variational Embedding, X-Ray Fluorescence, Variational Embedding network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: v1: 20 pages, 10 figures; link to code repository

点击查看摘要

Abstract:In this contribution, we define (and test) a pipeline to perform virtual painting recolouring using raw data of X-Ray Fluorescence (XRF) analysis on pictorial artworks. To circumvent the small dataset size, we generate a synthetic dataset, starting from a database of XRF spectra; furthermore, to ensure a better generalisation capacity (and to tackle the issue of in-memory size and inference time), we define a Deep Variational Embedding network to embed the XRF spectra into a lower dimensional, K-Means friendly, metric space. We thus train a set of models to assign coloured images to embedded XRF images. We report here the devised pipeline performances in terms of visual quality metrics, and we close on a discussion on the results. Comments: v1: 20 pages, 10 figures; link to code repository Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph) ACMclasses: I.4.m; J.2 Cite as: arXiv:2410.08826 [cs.CV] (or arXiv:2410.08826v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.08826 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-20] One-shot Generative Domain Adaptation in 3D GANs

链接: https://arxiv.org/abs/2410.08824
作者: Ziqiang Li,Yi Wu,Chaoyue Wang,Xue Rui,Bin Li
关键词-EN: necessitates extensive training, ensure stable training, extensive training data, generation necessitates extensive, image generation necessitates
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IJCV

点击查看摘要

Abstract:3D-aware image generation necessitates extensive training data to ensure stable training and mitigate the risk of overfitting. This paper first considers a novel task known as One-shot 3D Generative Domain Adaptation (GDA), aimed at transferring a pre-trained 3D generator from one domain to a new one, relying solely on a single reference image. One-shot 3D GDA is characterized by the pursuit of specific attributes, namely, high fidelity, large diversity, cross-domain consistency, and multi-view consistency. Within this paper, we introduce 3D-Adapter, the first one-shot 3D GDA method, for diverse and faithful generation. Our approach begins by judiciously selecting a restricted weight set for fine-tuning, and subsequently leverages four advanced loss functions to facilitate adaptation. An efficient progressive fine-tuning strategy is also implemented to enhance the adaptation process. The synergy of these three technological components empowers 3D-Adapter to achieve remarkable performance, substantiated both quantitatively and qualitatively, across all desired properties of 3D GDA. Furthermore, 3D-Adapter seamlessly extends its capabilities to zero-shot scenarios, and preserves the potential for crucial tasks such as interpolation, reconstruction, and editing within the latent space of the pre-trained generator. Code will be available at this https URL.

[CV-21] LIME-Eval: Rethinking Low-light Image Enhancement Evaluation via Object Detection

链接: https://arxiv.org/abs/2410.08810
作者: Mingjia Li,Hao Zhao,Xiaojie Guo
关键词-EN: paired ground-truth information, high-level vision tasks, ground-truth information, high-level vision, absence of paired
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the nature of enhancement–the absence of paired ground-truth information, high-level vision tasks have been recently employed to evaluate the performance of low-light image enhancement. A widely-used manner is to see how accurately an object detector trained on enhanced low-light images by different candidates can perform with respect to annotated semantic labels. In this paper, we first demonstrate that the mentioned approach is generally prone to overfitting, and thus diminishes its measurement reliability. In search of a proper evaluation metric, we propose LIME-Bench, the first online benchmark platform designed to collect human preferences for low-light enhancement, providing a valuable dataset for validating the correlation between human perception and automated evaluation metrics. We then customize LIME-Eval, a novel evaluation framework that utilizes detectors pre-trained on standard-lighting datasets without object annotations, to judge the quality of enhanced images. By adopting an energy-based strategy to assess the accuracy of output confidence maps, our LIME-Eval can simultaneously bypass biases associated with retraining detectors and circumvent the reliance on annotations for dim images. Comprehensive experiments are provided to reveal the effectiveness of our LIME-Eval. Our benchmark platform (this https URL) and code (this https URL) are available online.

[CV-22] CoTCoNet: An Optimized Coupled Transformer-Convolutional Network with an Adaptive Graph Reconstruction for Leukemia Detection

链接: https://arxiv.org/abs/2410.08797
作者: Chandravardhan Singh Raghaw,Arnav Sharma,Shubhi Bansa,Mohammad Zia Ur Rehman,Nagendra Kumar
关键词-EN: accurate blood smear, blood smear analysis, Swift and accurate, effective diagnostic method, accurate blood
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Swift and accurate blood smear analysis is an effective diagnostic method for leukemia and other hematological malignancies. However, manual leukocyte count and morphological evaluation using a microscope is time-consuming and prone to errors. Conventional image processing methods also exhibit limitations in differentiating cells due to the visual similarity between malignant and benign cell morphology. This limitation is further compounded by the skewed training data that hinders the extraction of reliable and pertinent features. In response to these challenges, we propose an optimized Coupled Transformer Convolutional Network (CoTCoNet) framework for the classification of leukemia, which employs a well-designed transformer integrated with a deep convolutional network to effectively capture comprehensive global features and scalable spatial patterns, enabling the identification of complex and large-scale hematological features. Further, the framework incorporates a graph-based feature reconstruction module to reveal the hidden or unobserved hard-to-see biological features of leukocyte cells and employs a Population-based Meta-Heuristic Algorithm for feature selection and optimization. To mitigate data imbalance issues, we employ a synthetic leukocyte generator. In the evaluation phase, we initially assess CoTCoNet on a dataset containing 16,982 annotated cells, and it achieves remarkable accuracy and F1-Score rates of 0.9894 and 0.9893, respectively. To broaden the generalizability of our model, we evaluate it across four publicly available diverse datasets, which include the aforementioned dataset. This evaluation demonstrates that our method outperforms current state-of-the-art approaches. We also incorporate an explainability approach in the form of feature visualization closely aligned with cell annotations to provide a deeper understanding of the framework.

[CV-23] VLM See Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

链接: https://arxiv.org/abs/2410.08792
作者: Beichen Wang,Juexiao Zhang,Shuwen Dong,Irving Fang,Chen Feng
关键词-EN: Vision Language Models, Vision Language, Language Models, common sense reasoning, recently been adopted
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have recently been adopted in robotics for their capability in common sense reasoning and generalizability. Existing work has applied VLMs to generate task and motion planning from natural language instructions and simulate training data for robot learning. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. Our method integrates keyframe selection, visual perception, and VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to ‘‘see’’ human demonstrations and explain the corresponding plans to the robot for it to ‘‘do’’. To validate our approach, we collected a set of long-horizon human videos demonstrating pick-and-place tasks in three diverse categories and designed a set of metrics to comprehensively benchmark SeeDo against several baselines, including state-of-the-art video-input VLMs. The experiments demonstrate SeeDo’s superior performance. We further deployed the generated task plans in both a simulation environment and on a real robot arm.

[CV-24] VideoSAM: Open-World Video Segmentation

链接: https://arxiv.org/abs/2410.08781
作者: Pinxue Guo,Zixu Zhao,Jianxiong Gao,Chongruo Wu,Tong He,Zheng Zhang,Tianjun Xiao,Wenqiang Zhang
关键词-EN: autonomous driving, essential for advancing, open-world settings, settings where continuous, continuous perception
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video segmentation is essential for advancing robotics and autonomous driving, particularly in open-world settings where continuous perception and object association across video frames are critical. While the Segment Anything Model (SAM) has excelled in static image segmentation, extending its capabilities to video segmentation poses significant challenges. We tackle two major hurdles: a) SAM’s embedding limitations in associating objects across frames, and b) granularity inconsistencies in object segmentation. To this end, we introduce VideoSAM, an end-to-end framework designed to address these challenges by improving object tracking and segmentation consistency in dynamic environments. VideoSAM integrates an agglomerated backbone, RADIO, enabling object association through similarity metrics and introduces Cycle-ack-Pairs Propagation with a memory mechanism for stable object tracking. Additionally, we incorporate an autoregressive object-token mechanism within the SAM decoder to maintain consistent granularity across frames. Our method is extensively evaluated on the UVO and BURST benchmarks, and robotic videos from RoboTAP, demonstrating its effectiveness and robustness in real-world scenarios. All codes will be available.

[CV-25] HpEIS: Learning Hand Pose Embeddings for Multimedia Interactive Systems

链接: https://arxiv.org/abs/2410.08779
作者: Songpei Xu,Xuri Ge,Chaitanya Kaul,Roderick Murray-Smith
关键词-EN: Embedding Interactive System, Hand-pose Embedding Interactive, Variational Autoencoder, Embedding Interactive, two-dimensional visual space
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: 6 pages, 8 figures, 3 tables

点击查看摘要

Abstract:We present a novel Hand-pose Embedding Interactive System (HpEIS) as a virtual sensor, which maps users’ flexible hand poses to a two-dimensional visual space using a Variational Autoencoder (VAE) trained on a variety of hand poses. HpEIS enables visually interpretable and guidable support for user explorations in multimedia collections, using only a camera as an external hand pose acquisition device. We identify general usability issues associated with system stability and smoothing requirements through pilot experiments with expert and inexperienced users. We then design stability and smoothing improvements, including hand-pose data augmentation, an anti-jitter regularisation term added to loss function, stabilising post-processing for movement turning points and smoothing post-processing based on One Euro Filters. In target selection experiments (n=12), we evaluate HpEIS by measures of task completion time and the final distance to target points, with and without the gesture guidance window condition. Experimental responses indicate that HpEIS provides users with a learnable, flexible, stable and smooth mid-air hand movement interaction experience.

[CV-26] Efficient Multi-Object Tracking on Edge Devices via Reconstruction-Based Channel Pruning

链接: https://arxiv.org/abs/2410.08769
作者: Jan Müller,Adrian Pigors
关键词-EN: addressing critical security, Jetson Orin Nano, technologies presents, advancement of multi-object, presents the dual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of multi-object tracking (MOT) technologies presents the dual challenge of maintaining high performance while addressing critical security and privacy concerns. In applications such as pedestrian tracking, where sensitive personal data is involved, the potential for privacy violations and data misuse becomes a significant issue if data is transmitted to external servers. To mitigate these risks, processing data directly on an edge device, such as a smart camera, has emerged as a viable solution. Edge computing ensures that sensitive information remains local, thereby aligning with stringent privacy principles and significantly reducing network latency. However, the implementation of MOT on edge devices is not without its challenges. Edge devices typically possess limited computational resources, necessitating the development of highly optimized algorithms capable of delivering real-time performance under these constraints. The disparity between the computational requirements of state-of-the-art MOT algorithms and the capabilities of edge devices emphasizes a significant obstacle. To address these challenges, we propose a neural network pruning method specifically tailored to compress complex networks, such as those used in modern MOT systems. This approach optimizes MOT performance by ensuring high accuracy and efficiency within the constraints of limited edge devices, such as NVIDIA’s Jetson Orin Nano. By applying our pruning method, we achieve model size reductions of up to 70% while maintaining a high level of accuracy and further improving performance on the Jetson Orin Nano, demonstrating the effectiveness of our approach for edge computing applications.

[CV-27] Look Gauss No Pose: Novel View Synthesis using Gaussian Splatting without Accurate Pose Initialization IROS2024

链接: https://arxiv.org/abs/2410.08743
作者: Christian Schmidt,Jens Piekenbrinck,Bastian Leibe
关键词-EN: posed input images, Gaussian Splatting, Gaussian Splatting framework, novel-view synthesis, input images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in IROS 2024

点击查看摘要

Abstract:3D Gaussian Splatting has recently emerged as a powerful tool for fast and accurate novel-view synthesis from a set of posed input images. However, like most novel-view synthesis approaches, it relies on accurate camera pose information, limiting its applicability in real-world scenarios where acquiring accurate camera poses can be challenging or even impossible. We propose an extension to the 3D Gaussian Splatting framework by optimizing the extrinsic camera parameters with respect to photometric residuals. We derive the analytical gradients and integrate their computation with the existing high-performance CUDA implementation. This enables downstream tasks such as 6-DoF camera pose estimation as well as joint reconstruction and camera refinement. In particular, we achieve rapid convergence and high accuracy for pose estimation on real-world scenes. Our method enables fast reconstruction of 3D scenes without requiring accurate pose information by jointly optimizing geometry and camera poses, while achieving state-of-the-art results in novel-view synthesis. Our approach is considerably faster to optimize than most competing methods, and several times faster in rendering. We show results on real-world scenes and complex trajectories through simulated environments, achieving state-of-the-art results on LLFF while reducing runtime by two to four times compared to the most efficient competing method. Source code will be available at this https URL .

[CV-28] Hespi: A pipeline for automatically detecting information from hebarium specimen sheets

链接: https://arxiv.org/abs/2410.08740
作者: Robert Turnbull,Emily Fitzgerald,Karen Thompson,Joanne L. Birch
关键词-EN: conservation sciences, Optical Character Recognition, Specimen, data, Specimen sheet PIpeline
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Specimen associated biodiversity data are sought after for biological, environmental, climate, and conservation sciences. A rate shift is required for the extraction of data from specimen images to eliminate the bottleneck that the reliance on human-mediated transcription of these data represents. We applied advanced computer vision techniques to develop the `Hespi’ (HErbarium Specimen sheet PIpeline), which extracts a pre-catalogue subset of collection data on the institutional labels on herbarium specimens from their digital images. The pipeline integrates two object detection models; the first detects bounding boxes around text-based labels and the second detects bounding boxes around text-based data fields on the primary institutional label. The pipeline classifies text-based institutional labels as printed, typed, handwritten, or a combination and applies Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) for data extraction. The recognized text is then corrected against authoritative databases of taxon names. The extracted text is also corrected with the aide of a multimodal Large Language Model (LLM). Hespi accurately detects and extracts text for test datasets including specimen sheet images from international herbaria. The components of the pipeline are modular and users can train their own models with their own data and use them in place of the models provided.

[CV-29] MMLF: Multi-modal Multi-class Late Fusion for Object Detection with Uncertainty Estimation

链接: https://arxiv.org/abs/2410.08739
作者: Qihang Yang,Yang Zhao,Hong Cheng
关键词-EN: driving necessitates advanced, necessitates advanced object, Autonomous driving necessitates, single-modal approaches, late fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Autonomous driving necessitates advanced object detection techniques that integrate information from multiple modalities to overcome the limitations associated with single-modal approaches. The challenges of aligning diverse data in early fusion and the complexities, along with overfitting issues introduced by deep fusion, underscore the efficacy of late fusion at the decision level. Late fusion ensures seamless integration without altering the original detector’s network structure. This paper introduces a pioneering Multi-modal Multi-class Late Fusion method, designed for late fusion to enable multi-class detection. Fusion experiments conducted on the KITTI validation and official test datasets illustrate substantial performance improvements, presenting our model as a versatile solution for multi-modal object detection in autonomous driving. Moreover, our approach incorporates uncertainty analysis into the classification fusion process, rendering our model more transparent and trustworthy and providing more reliable insights into category predictions.

[CV-30] Gradients Stand-in for Defending Deep Leakage in Federated Learning

链接: https://arxiv.org/abs/2410.08734
作者: H. Yi,H. Ren,C. Hu,Y. Li,J. Deng,X. Xie
关键词-EN: Federated Learning, localizing sensitive data, shifting the paradigm, reinforce privacy protections, paradigm towards localizing
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has become a cornerstone of privacy protection, shifting the paradigm towards localizing sensitive data while only sending model gradients to a central server. This strategy is designed to reinforce privacy protections and minimize the vulnerabilities inherent in centralized data storage systems. Despite its innovative approach, recent empirical studies have highlighted potential weaknesses in FL, notably regarding the exchange of gradients. In response, this study introduces a novel, efficacious method aimed at safeguarding against gradient leakage, namely, ``AdaDefense". Following the idea that model convergence can be achieved by using different types of optimization methods, we suggest using a local stand-in rather than the actual local gradient for global gradient aggregation on the central server. This proposed approach not only effectively prevents gradient leakage, but also ensures that the overall performance of the model remains largely unaffected. Delving into the theoretical dimensions, we explore how gradients may inadvertently leak private information and present a theoretical framework supporting the efficacy of our proposed method. Extensive empirical tests, supported by popular benchmark experiments, validate that our approach maintains model integrity and is robust against gradient leakage, marking an important step in our pursuit of safe and efficient FL.

[CV-31] Impact of Surface Reflections in Maritime Obstacle Detection BMVC

链接: https://arxiv.org/abs/2410.08713
作者: Samed Yalçın,Hazım Kemal Ekenel
关键词-EN: Maritime obstacle detection, unmanned surface vehicles, Maritime obstacle, obstacle detection aims, obstacle detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at RROW2024 Workshop @ British Machine Vision Conference (BMVC) 2024

点击查看摘要

Abstract:Maritime obstacle detection aims to detect possible obstacles for autonomous driving of unmanned surface vehicles. In the context of maritime obstacle detection, the water surface can act like a mirror on certain circumstances, causing reflections on imagery. Previous works have indicated surface reflections as a source of false positives for object detectors in maritime obstacle detection tasks. In this work, we show that surface reflections indeed adversely affect detector performance. We measure the effect of reflections by testing on two custom datasets, which we make publicly available. The first one contains imagery with reflections, while in the second reflections are inpainted. We show that the reflections reduce mAP by 1.2 to 9.6 points across various detectors. To remove false positives on reflections, we propose a novel filtering approach named Heatmap Based Sliding Filter. We show that the proposed method reduces the total number of false positives by 34.64% while minimally affecting true positives. We also conduct qualitative analysis and show that the proposed method indeed removes false positives on the reflections. The datasets can be found on this https URL.

[CV-32] Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

链接: https://arxiv.org/abs/2410.08695
作者: Yue Yang,Shuibai Zhang,Wenqi Shao,Kaipeng Zhang,Yi Bin,Yu Wang,Ping Luo
关键词-EN: Large Vision-Language Models, demonstrated remarkable capabilities, Large Vision-Language, Vision-Language Models, perception and reasoning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples remain consistent with the original ones by a judge module. By composing various bootstrapping strategies, VLB offers dynamic variants of existing benchmarks with diverse complexities, enabling the evaluation to co-evolve with the ever-evolving capabilities of LVLMs. Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs.

[CV-33] Chain-of-Restoration: Multi-Task Image Restoration Models are Zero-Shot Step-by-Step Universal Image Restorers

链接: https://arxiv.org/abs/2410.08688
作者: Jin Cao,Deyu Meng,Xiangyong Cao
关键词-EN: typically targeting isolated, previous works typically, works typically targeting, isolated degradation types, targeting isolated degradation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Despite previous works typically targeting isolated degradation types, recent research has increasingly focused on addressing composite degradations which involve a complex interplay of multiple different isolated degradations. Recognizing the challenges posed by the exponential number of possible degradation combinations, we propose Universal Image Restoration (UIR), a new task setting that requires models to be trained on a set of degradation bases and then remove any degradation that these bases can potentially compose in a zero-shot manner. Inspired by the Chain-of-Thought which prompts LLMs to address problems step-by-step, we propose the Chain-of-Restoration (CoR), which instructs models to step-by-step remove unknown composite degradations. By integrating a simple Degradation Discriminator into pre-trained multi-task models, CoR facilitates the process where models remove one degradation basis per step, continuing this process until the image is fully restored from the unknown composite degradation. Extensive experiments show that CoR significantly improves model performance in removing composite degradations, achieving results comparable to or surpassing those of State-of-The-Art (SoTA) methods trained on all degradations. The code will be released at this https URL.

[CV-34] Uncertainty Estimation and Out-of-Distribution Detection for LiDAR Scene Semantic Segmentation ECCV

链接: https://arxiv.org/abs/2410.08687
作者: Hanieh Shojaei,Qianqian Zou,Max Mehltretter
关键词-EN: environments requires autonomous, requires autonomous vehicles, Safe navigation, LiDAR scene segmentation, Gaussian Mixture Model
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in the Proceedings of the European Conference on Computer Vision (ECCV) 2024

点击查看摘要

Abstract:Safe navigation in new environments requires autonomous vehicles and robots to accurately interpret their surroundings, relying on LiDAR scene segmentation, out-of-distribution (OOD) obstacle detection, and uncertainty computation. We propose a method to distinguish in-distribution (ID) from OOD samples and quantify both epistemic and aleatoric uncertainties using the feature space of a single deterministic model. After training a semantic segmentation network, a Gaussian Mixture Model (GMM) is fitted to its feature space. OOD samples are detected by checking if their squared Mahalanobis distances to each Gaussian component conform to a chi-squared distribution, eliminating the need for an additional OOD training set. Given that the estimated mean and covariance matrix of a multivariate Gaussian distribution follow Gaussian and Inverse-Wishart distributions, multiple GMMs are generated by sampling from these distributions to assess epistemic uncertainty through classification variability. Aleatoric uncertainty is derived from the entropy of responsibility values within Gaussian components. Comparing our method with deep ensembles and logit-sampling for uncertainty computation demonstrates its superior performance in real-world applications for quantifying epistemic and aleatoric uncertainty, as well as detecting OOD samples. While deep ensembles miss some highly uncertain samples, our method successfully detects them and assigns high epistemic uncertainty.

[CV-35] Gait Sequence Upsampling using Diffusion Models for single LiDAR sensors

链接: https://arxiv.org/abs/2410.08680
作者: Jeongho Ahn,Kazuto Nakashima,Koki Yoshino,Yumi Iwashita,Ryo Kurazume
关键词-EN: traditional RGB cameras, RGB cameras, gait-based person identification, traditional RGB, varying lighting conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, 3D LiDAR has emerged as a promising technique in the field of gait-based person identification, serving as an alternative to traditional RGB cameras, due to its robustness under varying lighting conditions and its ability to capture 3D geometric information. However, long capture distances or the use of low-cost LiDAR sensors often result in sparse human point clouds, leading to a decline in identification performance. To address these challenges, we propose a sparse-to-dense upsampling model for pedestrian point clouds in LiDAR-based gait recognition, named LidarGSU, which is designed to improve the generalization capability of existing identification models. Our method utilizes diffusion probabilistic models (DPMs), which have shown high fidelity in generative tasks such as image completion. In this work, we leverage DPMs on sparse sequential pedestrian point clouds as conditional masks in a video-to-video translation approach, applied in an inpainting manner. We conducted extensive experiments on the SUSTeck1K dataset to evaluate the generative quality and recognition performance of the proposed method. Furthermore, we demonstrate the applicability of our upsampling model using a real-world dataset, captured with a low-resolution sensor across varying measurement distances.

[CV-36] Bukva: Russian Sign Language Alphabet

链接: https://arxiv.org/abs/2410.08675
作者: Karina Kvanchiani,Petr Surovtsev,Alexander Nagaev,Elizaveta Petrova,Alexander Kapitanov
关键词-EN: Russian Sign Language, Russian fingerspelling alphabet, paper investigates, Russian fingerspelling, dactyl
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preptrint. Title: “Bukva: Russian Sign Language Alphabet”. 9 pages

点击查看摘要

Abstract:This paper investigates the recognition of the Russian fingerspelling alphabet, also known as the Russian Sign Language (RSL) dactyl. Dactyl is a component of sign languages where distinct hand movements represent individual letters of a written language. This method is used to spell words without specific signs, such as proper nouns or technical terms. The alphabet learning simulator is an essential isolated dactyl recognition application. There is a notable issue of data shortage in isolated dactyl recognition: existing Russian dactyl datasets lack subject heterogeneity, contain insufficient samples, or cover only static signs. We provide Bukva, the first full-fledged open-source video dataset for RSL dactyl recognition. It contains 3,757 videos with more than 101 samples for each RSL alphabet sign, including dynamic ones. We utilized crowdsourcing platforms to increase the subject’s heterogeneity, resulting in the participation of 155 deaf and hard-of-hearing experts in the dataset creation. We use a TSM (Temporal Shift Module) block to handle static and dynamic signs effectively, achieving 83.6% top-1 accuracy with a real-time inference with CPU only. The dataset, demo code, and pre-trained models are publicly available.

[CV-37] SpikeBottleNet: Energy Efficient Spike Neural Network Partitioning for Feature Compression in Device-Edge Co-Inference Systems ECAI-2024

链接: https://arxiv.org/abs/2410.08673
作者: Maruf Hassan,Steven Davy
关键词-EN: intelligent mobile applications, mobile applications highlights, deploying powerful deep, powerful deep learning, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper consists of 7 pages and 3 figures. It was submitted to ECAI-2024, and the authors are currently working on improving it based on the review

点击查看摘要

Abstract:The advent of intelligent mobile applications highlights the crucial demand for deploying powerful deep learning models on resource-constrained mobile devices. An effective solution in this context is the device-edge co-inference framework, which partitions a deep neural network between a mobile device and a nearby edge server. This approach requires balancing on-device computations and communication costs, often achieved through compressed intermediate feature transmission. Conventional deep neural network architectures require continuous data processing, leading to substantial energy consumption by edge devices. This motivates exploring binary, event-driven activations enabled by spiking neural networks (SNNs), known for their extremely energy efficiency. In this research, we propose a novel architecture named SpikeBottleNet, a significant improvement to the existing architecture by integrating SNNs. A key aspect of our investigation is the development of an intermediate feature compression technique specifically designed for SNNs. This technique leverages a split computing approach for SNNs to partition complex architectures, such as Spike ResNet50. By incorporating the power of SNNs within device-edge co-inference systems, experimental results demonstrate that our SpikeBottleNet achieves a significant bit compression ratio of up to 256x in the final convolutional layer while maintaining high classification accuracy with only a 2.5% reduction. Moreover, compared to the baseline BottleNet++ architecture, our framework reduces the transmitted feature size at earlier splitting points by 75%. Furthermore, in terms of the energy efficiency of edge devices, our methodology surpasses the baseline by a factor of up to 98, demonstrating significant enhancements in both efficiency and performance.

[CV-38] SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

链接: https://arxiv.org/abs/2410.08669
作者: Yang Zhou,Hao Shao,Letian Wang,Steven L. Waslander,Hongsheng Li,Yu Liu
关键词-EN: Predicting the future, motion prediction, autonomous vehicles, safely in dynamic, surrounding agents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limiting their ability to capture complex interactions and road geometries. Inspired by recent advances in natural language processing (NLP) and computer vision (CV), self-supervised learning (SSL) has gained significant attention in the motion prediction community for learning rich and transferable scene representations. Nonetheless, existing pre-training methods for motion prediction have largely focused on specific model architectures and single dataset, limiting their scalability and generalizability. To address these challenges, we propose SmartPretrain, a general and scalable SSL framework for motion prediction that is both model-agnostic and dataset-agnostic. Our approach integrates contrastive and reconstructive SSL, leveraging the strengths of both generative and discriminative paradigms to effectively represent spatiotemporal evolution and interactions without imposing architectural constraints. Additionally, SmartPretrain employs a dataset-agnostic scenario sampling strategy that integrates multiple datasets, enhancing data volume, diversity, and robustness. Extensive experiments on multiple datasets demonstrate that SmartPretrain consistently improves the performance of state-of-the-art prediction models across datasets, data splits and main metrics. For instance, SmartPretrain significantly reduces the MissRate of Forecast-MAE by 10.6%. These results highlight SmartPretrain’s effectiveness as a unified, scalable solution for motion prediction, breaking free from the limitations of the small-data regime. Codes are available at this https URL

[CV-39] E-Motion: Future Motion Simulation via Event Sequence Diffusion NEURIPS2024

链接: https://arxiv.org/abs/2410.08649
作者: Song Wu,Zhiyu Zhu,Junhui Hou,Guangming Shi,Jinjian Wu
关键词-EN: typical object future, Forecasting a typical, object future motion, typical object, critical task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Forecasting a typical object’s future motion is a critical task for interpreting and interacting with dynamic environments in computer vision. Event-based sensors, which could capture changes in the scene with exceptional temporal granularity, may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable. Inspired by that, we propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework. Specifically, we initially employ pre-trained stable video diffusion models to adapt the event sequence dataset. This process facilitates the transfer of extensive knowledge from RGB videos to an event-centric domain. Moreover, we introduce an alignment mechanism that utilizes reinforcement learning techniques to enhance the reverse generation trajectory of the diffusion model, ensuring improved performance and accuracy. Through extensive testing and validation, we demonstrate the effectiveness of our method in various complex scenarios, showcasing its potential to revolutionize motion flow prediction in computer vision applications such as autonomous vehicle guidance, robotic navigation, and interactive media. Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.

[CV-40] Boosting Open-Vocabulary Object Detection by Handling Background Samples ICONIP2024

链接: https://arxiv.org/abs/2410.08645
作者: Ruizhe Zeng,Lu Zhang,Xu Yang,Zhiyong Liu
关键词-EN: candidate vocabulary list, accurately detecting objects, open-vocabulary detectors, Open-vocabulary object detection, task of accurately
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 5 figures, Accepted to ICONIP 2024

点击查看摘要

Abstract:Open-vocabulary object detection is the task of accurately detecting objects from a candidate vocabulary list that includes both base and novel categories. Currently, numerous open-vocabulary detectors have achieved success by leveraging the impressive zero-shot capabilities of CLIP. However, we observe that CLIP models struggle to effectively handle background images (i.e. images without corresponding labels) due to their language-image learning methodology. This limitation results in suboptimal performance for open-vocabulary detectors that rely on CLIP when processing background samples. In this paper, we propose Background Information Representation for open-vocabulary Detector (BIRDet), a novel approach to address the limitations of CLIP in handling background samples. Specifically, we design Background Information Modeling (BIM) to replace the single, fixed background embedding in mainstream open-vocabulary detectors with dynamic scene information, and prompt it into image-related background representations. This method effectively enhances the ability to classify oversized regions as background. Besides, we introduce Partial Object Suppression (POS), an algorithm that utilizes the ratio of overlap area to address the issue of misclassifying partial regions as foreground. Experiments on OV-COCO and OV-LVIS benchmarks demonstrate that our proposed model is capable of achieving performance enhancements across various open-vocabulary detectors.

[CV-41] More than Memes: A Multimodal Topic Modeling Approach to Conspiracy Theories on Telegram

链接: https://arxiv.org/abs/2410.08642
作者: Elisabeth Steffen
关键词-EN: German-language Telegram channels, related content online, conspiracy theories, German-language Telegram, traditionally focused
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 11 pages, 11 figures

点击查看摘要

Abstract:Research on conspiracy theories and related content online has traditionally focused on textual data. To address the increasing prevalence of (audio-)visual data on social media, and to capture the evolving and dynamic nature of this communication, researchers have begun to explore the potential of unsupervised approaches for analyzing multimodal online content. Our research contributes to this field by exploring the potential of multimodal topic modeling for analyzing conspiracy theories in German-language Telegram channels. Our work uses the BERTopic topic modeling approach in combination with CLIP for the analysis of textual and visual data. We analyze a corpus of ~40, 000 Telegram messages posted in October 2023 in 571 German-language Telegram channels known for disseminating conspiracy theories and other deceptive content. We explore the potentials and challenges of this approach for studying a medium-sized corpus of user-generated, text-image online content. We offer insights into the dominant topics across modalities, different text and image genres discovered during the analysis, quantitative inter-modal topic analyses, and a qualitative case study of textual, visual, and multimodal narrative strategies in the communication of conspiracy theories.

[CV-42] Multi-Source Temporal Attention Network for Precipitation Nowcasting

链接: https://arxiv.org/abs/2410.08641
作者: Rafael Pablos Sarabia,Joachim Nyborg,Morten Birk,Jeppe Liborius Sjørup,Anders Lillevang Vesterholt,Ira Assent
关键词-EN: Precipitation nowcasting, climate change, industries and plays, plays a significant, significant role
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Precipitation nowcasting is crucial across various industries and plays a significant role in mitigating and adapting to climate change. We introduce an efficient deep learning model for precipitation nowcasting, capable of predicting rainfall up to 8 hours in advance with greater accuracy than existing operational physics-based and extrapolation-based models. Our model leverages multi-source meteorological data and physics-based forecasts to deliver high-resolution predictions in both time and space. It captures complex spatio-temporal dynamics through temporal attention networks and is optimized using data quality maps and dynamic thresholds. Experiments demonstrate that our model outperforms state-of-the-art, and highlight its potential for fast reliable responses to evolving weather conditions.

[CV-43] Natural Language Induced Adversarial Images ACM-MM2024

链接: https://arxiv.org/abs/2410.08620
作者: Xiaopei Zhu,Peiyang Xu,Guanning Zeng,Yingpeng Dong,Xiaolin Hu
关键词-EN: deep learning models, adversarial, shows the vulnerability, build more robust, Research
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Carmera-ready version. To appear in ACM MM 2024

点击查看摘要

Abstract:Research of adversarial attacks is important for AI security because it shows the vulnerability of deep learning models and helps to build more robust models. Adversarial attacks on images are most widely studied, which include noise-based attacks, image editing-based attacks, and latent space-based attacks. However, the adversarial examples crafted by these methods often lack sufficient semantic information, making it challenging for humans to understand the failure modes of deep learning models under natural conditions. To address this limitation, we propose a natural language induced adversarial image attack method. The core idea is to leverage a text-to-image model to generate adversarial images given input prompts, which are maliciously constructed to lead to misclassification for a target model. To adopt commercial text-to-image models for synthesizing more natural adversarial images, we propose an adaptive genetic algorithm (GA) for optimizing discrete adversarial prompts without requiring gradients and an adaptive word space reduction method for improving query efficiency. We further used CLIP to maintain the semantic consistency of the generated images. In our experiments, we found that some high-frequency semantic information such as “foggy”, “humid”, “stretching”, etc. can easily cause classifier errors. This adversarial semantic information exists not only in generated images but also in photos captured in the real world. We also found that some adversarial semantic information can be transferred to unknown classification tasks. Furthermore, our attack method can transfer to different text-to-image models (e.g., Midjourney, DALL-E 3, etc.) and image classifiers. Our code is available at: this https URL.

[CV-44] Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation

链接: https://arxiv.org/abs/2410.08613
作者: Zhe Dong,Yuzhe Sun,Yanfeng Gu,Tianzhu Liu
关键词-EN: remote sensing image, referring remote sensing, sensing image segmentation, remote sensing, sensing image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given a natural language expression and a remote sensing image, the goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression. In contrast to natural scenarios, expressions in RRSIS often involve complex geospatial relationships, with target objects of interest that vary significantly in scale and lack visual saliency, thereby increasing the difficulty of achieving precise segmentation. To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM). Specifically, a context-aware prompt modulation (CAPM) module is designed to integrate spatial positional relationships and task-specific knowledge into the linguistic features, thereby enhancing the ability to capture the target object. Additionally, a language-guided feature aggregation (LGFA) module is introduced to integrate linguistic information into multi-scale visual features, incorporating an attention deficit compensation mechanism to enhance feature aggregation. Finally, a mutual-interaction decoder (MID) is designed to enhance cross-modal feature alignment through cascaded bidirectional cross-attention, thereby enabling precise segmentation mask prediction. To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets. Extensive benchmarking on RISBench and two other prevalent datasets demonstrates the superior performance of the proposed CroBIM over existing state-of-the-art (SOTA) methods. The source code for CroBIM and the RISBench dataset will be publicly available at this https URL

[CV-45] Synth-SONAR: Sonar Image Synthesis with Enhanced Diversity and Realism via Dual Diffusion Models and GPT Prompting

链接: https://arxiv.org/abs/2410.08612
作者: Purushothaman Natarajan,Kamal Basha,Athira Nambiar
关键词-EN: marine biology, Sonar, Sonar image synthesis, underwater exploration, crucial for advancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 5 tables and 9 figures

点击查看摘要

Abstract:Sonar image synthesis is crucial for advancing applications in underwater exploration, marine biology, and defence. Traditional methods often rely on extensive and costly data collection using sonar sensors, jeopardizing data quality and diversity. To overcome these limitations, this study proposes a new sonar image synthesis framework, Synth-SONAR leveraging diffusion models and GPT prompting. The key novelties of Synth-SONAR are threefold: First, by integrating Generative AI-based style injection techniques along with publicly available real/simulated data, thereby producing one of the largest sonar data corpus for sonar research. Second, a dual text-conditioning sonar diffusion model hierarchy synthesizes coarse and fine-grained sonar images with enhanced quality and diversity. Third, high-level (coarse) and low-level (detailed) text-based sonar generation methods leverage advanced semantic information available in visual language models (VLMs) and GPT-prompting. During inference, the method generates diverse and realistic sonar images from textual prompts, bridging the gap between textual descriptions and sonar image generation. This marks the application of GPT-prompting in sonar imagery for the first time, to the best of our knowledge. Synth-SONAR achieves state-of-the-art results in producing high-quality synthetic sonar datasets, significantly enhancing their diversity and realism.

[CV-46] Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.08611
作者: Mengyuan Chen,Junyu Gao,Changsheng Xu
关键词-EN: pre-trained vision-language model, potential OOD labels, OOD labels, extensive semantic pool, selecting potential OOD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 28 pages, accepted by NeurIPS 2024

点击查看摘要

Abstract:A straightforward pipeline for zero-shot out-of-distribution (OOD) detection involves selecting potential OOD labels from an extensive semantic pool and then leveraging a pre-trained vision-language model to perform classification on both in-distribution (ID) and OOD labels. In this paper, we theorize that enhancing performance requires expanding the semantic pool, while increasing the expected probability of selected OOD labels being activated by OOD samples, and ensuring low mutual dependence among the activations of these OOD labels. A natural expansion manner is to adopt a larger lexicon; however, the inevitable introduction of numerous synonyms and uncommon words fails to meet the above requirements, indicating that viable expansion manners move beyond merely selecting words from a lexicon. Since OOD detection aims to correctly classify input images into ID/OOD class groups, we can “make up” OOD label candidates which are not standard class names but beneficial for the process. Observing that the original semantic pool is comprised of unmodified specific class names, we correspondingly construct a conjugated semantic pool (CSP) consisting of modified superclass names, each serving as a cluster center for samples sharing similar properties across different categories. Consistent with our established theory, expanding OOD label candidates with the CSP satisfies the requirements and outperforms existing works by 7.89% in FPR95. Codes are available in this https URL.

[CV-47] xt-To-Image with Generative Adversarial Networks

链接: https://arxiv.org/abs/2410.08608
作者: Mehrshad Momen-Tayefeh
关键词-EN: Generating realistic images, Generating realistic, Generative Adversarial Networks, computer vision, field of computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating realistic images from human texts is one of the most challenging problems in the field of computer vision (CV). The meaning of descriptions given can be roughly reflected by existing text-to-image approaches. In this paper, our main purpose is to propose a brief comparison between five different methods base on the Generative Adversarial Networks (GAN) to make image from the text. In addition, each model architectures synthesis images with different resolution. Furthermore, the best and worst obtained resolutions is 6464, 256256 respectively. However, we checked and compared some metrics that introduce the accuracy of each model. Also, by doing this study, we found out the best model for this problem by comparing these different approaches essential metrics.

[CV-48] VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding NEURIPS2024 NEURIPS

链接: https://arxiv.org/abs/2410.08593
作者: Houlun Chen,Xin Wang,Hong Chen,Zeyang Zhang,Wei Feng,Bin Huang,Jia Jia,Wenwu Zhu
关键词-EN: Corpus Moment Retrieval, Existing Video Corpus, Moment Retrieval, underline, hinders precise video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by 38th NeurIPS Datasets Benchmarks Track (NeurIPS 2024)

点击查看摘要

Abstract:Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding, which hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underlineVid\underlineEo-text annotation pipeline to generate captions with \underlineRel\underlineIable \underlineFIn\underlineE-grained statics and \underlineDynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the inaccurate annotations caused by the LLM hallucination, we propose a Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation model with disturbed hard-negatives augmented contrastive and matching losses. With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a high level of annotation quality. We evaluate several state-of-the-art VCMR models on the proposed dataset, revealing that there is still significant scope for fine-grained video understanding in VCMR. Code and Datasets are in \hrefthis https URLthis https URL.

[CV-49] VIBES – Vision Backbone Efficient Selection WACV2025

链接: https://arxiv.org/abs/2410.08592
作者: Joris Guerin,Shray Bansal,Amirreza Shaban,Paulo Mann,Harshvardhan Gazula
关键词-EN: specific target tasks, efficiently selecting high-performance, selecting high-performance pre-trained, high-performance pre-trained vision, target tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, under review at WACV 2025

点击查看摘要

Abstract:This work tackles the challenge of efficiently selecting high-performance pre-trained vision backbones for specific target tasks. Although exhaustive search within a finite set of backbones can solve this problem, it becomes impractical for large datasets and backbone pools. To address this, we introduce Vision Backbone Efficient Selection (VIBES), which aims to quickly find well-suited backbones, potentially trading off optimality for efficiency. We propose several simple yet effective heuristics to address VIBES and evaluate them across four diverse computer vision datasets. Our results show that these approaches can identify backbones that outperform those selected from generic benchmarks, even within a limited search budget of one hour on a single GPU. We reckon VIBES marks a paradigm shift from benchmarks to task-specific optimization.

[CV-50] ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

链接: https://arxiv.org/abs/2410.08584
作者: Yefei He,Feng Chen,Jing Liu,Wenqi Shao,Hong Zhou,Kaipeng Zhang,Bohan Zhuang
关键词-EN: scenarios involving high-resolution, involving high-resolution images, large vision-language models, fetching the key-value, images or videos
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs that resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform attention mechanism solely on those important tokens to accelerate the prefill phase. To mitigate the memory bottleneck in the decoding phase, we employ mixed-precision quantization to the KV cache, where high-bit quantization is used for caches of important tokens, while low-bit quantization is applied to those of less importance. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6 \times and reduce GPU memory usage by 50.0%, with a minimal accuracy reduction of only 0.2% on Video-MME benchmark over LongVA-7B model, effectively enhancing the generation efficiency of LVLMs.

[CV-51] DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

链接: https://arxiv.org/abs/2410.08582
作者: Nguyen Huu Bao Long,Chenyu Zhang,Yuzhi Shi,Tsubasa Hirakawa,Takayoshi Yamashita,Tohgoroh Matsui,Hironobu Fujiyoshi
关键词-EN: demonstrated superior performance, Deformable Bi-level Routing, Bi-level Routing Attention, demonstrated superior, superior performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 7 figures. arXiv admin note: text overlap with arXiv:2303.08810 by other authors

点击查看摘要

Abstract:Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its this http URL is available at this https URL

[CV-52] Diffusion-Based Depth Inpainting for Transparent and Reflective Objects

链接: https://arxiv.org/abs/2410.08567
作者: Tianyu Sun,Dingchang Hu,Yixiang Dai,Guijin Wang
关键词-EN: imaging techniques due, Transparent and reflective, reflective objects, everyday lives, present a significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transparent and reflective objects, which are common in our everyday lives, present a significant challenge to 3D imaging techniques due to their unique visual and optical properties. Faced with these types of objects, RGB-D cameras fail to capture the real depth value with their accurate spatial information. To address this issue, we propose DITR, a diffusion-based Depth Inpainting framework specifically designed for Transparent and Reflective objects. This network consists of two stages, including a Region Proposal stage and a Depth Inpainting stage. DITR dynamically analyzes the optical and geometric depth loss and inpaints them automatically. Furthermore, comprehensive experimental results demonstrate that DITR is highly effective in depth inpainting tasks of transparent and reflective objects with robust adaptability.

[CV-53] Baichuan-Omni Technical Report

链接: https://arxiv.org/abs/2410.08565
作者: Yadong Li,Haoze Sun,Mingan Lin,Tianpeng Li,Guosheng Dong,Tao Zhang,Bowen Ding,Wei Song,Zhenglin Cheng,Yuqi Huo,Song Chen,Xu Li,Da Pan,Shusen Zhang,Xin Wu,Zheng Liang,Jun Liu,Tao Zhang,Keer Lu,Yaqi Zhao,Yanjun Shen,Fan Yang,Kaicheng Yu,Tao Lin,Jianhua Xu,Zenan Zhou,Weipeng Chen
关键词-EN: high-performing open-source counterpart, salient multimodal capabilities, Large Language Model, multimodal interactive experience, Multimodal Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-Omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.

[CV-54] Context-Aware Full Body Anonymization using Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.08551
作者: Pascl Zwick,Kevin Roesch,Marvin Klemp,Oliver Bringmann
关键词-EN: real world datasets, plays a key, key role, role in protecting, protecting sensible information
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anonymization plays a key role in protecting sensible information of individuals in real world datasets. Self-driving cars for example need high resolution facial features to track people and their viewing direction to predict future behaviour and react accordingly. In order to protect people’s privacy whilst keeping important features in the dataset, it is important to replace the full body of a person with a highly detailed anonymized one. In contrast to doing face anonymization, full body replacement decreases the ability of recognizing people by their hairstyle or clothes. In this paper, we propose a workflow for full body person anonymization utilizing Stable Diffusion as a generative backend. Text-to-image diffusion models, like Stable Diffusion, OpenAI’s DALL-E or Midjourney, have become very popular in recent time, being able to create photorealistic images from a single text prompt. We show that our method outperforms state-of-the art anonymization pipelines with respect to image quality, resolution, Inception Score (IS) and Frechet Inception Distance (FID). Additionally, our method is invariant with respect to the image generator and thus able to be used with the latest models available.

[CV-55] Quality Prediction of AI Generated Images and Videos: Emerging Trends and Opportunities

链接: https://arxiv.org/abs/2410.08534
作者: Abhijay Ghildyal,Yuanhan Chen,Saman Zadtootaghaj,Nabajeet Barman,Alan C. Bovik
关键词-EN: creating realistic images, generation models capable, video generation models, Video Quality Assessment, Image Quality Assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: “The abstract field cannot be longer than 1,920 characters”, the abstract appearing here is slightly shorter than that in the PDF file

点击查看摘要

Abstract:The advent of AI has influenced many aspects of human life, from self-driving cars and intelligent chatbots to text-based image and video generation models capable of creating realistic images and videos based on user prompts (text-to-image, image-to-image, and image-to-video). AI-based methods for image and video super resolution, video frame interpolation, denoising, and compression have already gathered significant attention and interest in the industry and some solutions are already being implemented in real-world products and services. However, to achieve widespread integration and acceptance, AI-generated and enhanced content must be visually accurate, adhere to intended use, and maintain high visual quality to avoid degrading the end user’s quality of experience (QoE). One way to monitor and control the visual “quality” of AI-generated and -enhanced content is by deploying Image Quality Assessment (IQA) and Video Quality Assessment (VQA) models. However, most existing IQA and VQA models measure visual fidelity in terms of “reconstruction” quality against a pristine reference content and were not designed to assess the quality of “generative” artifacts. To address this, newer metrics and models have recently been proposed, but their performance evaluation and overall efficacy have been limited by datasets that were too small or otherwise lack representative content and/or distortion capacity; and by performance measures that can accurately report the success of an IQA/VQA model for “GenAI”. This paper examines the current shortcomings and possibilities presented by AI-generated and enhanced image and video content, with a particular focus on end-user perceived quality. Finally, we discuss open questions and make recommendations for future work on the “GenAI” quality assessment problems, towards further progressing on this interesting and relevant field of research. Comments: “The abstract field cannot be longer than 1,920 characters”, the abstract appearing here is slightly shorter than that in the PDF file Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2410.08534 [cs.CV] (or arXiv:2410.08534v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.08534 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-56] Diffusion Models Need Visual Priors for Image Generation

链接: https://arxiv.org/abs/2410.08531
作者: Xiaoyu Yue,Zidong Wang,Zeyu Lu,Shuyang Sun,Meng Wei,Wanli Ouyang,Lei Bai,Luping Zhou
关键词-EN: Conventional class-guided diffusion, Conventional class-guided, correct semantic content, models generally succeed, class-guided diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:Conventional class-guided diffusion models generally succeed in generating images with correct semantic content, but often struggle with texture details. This limitation stems from the usage of class priors, which only provide coarse and limited conditional information. To address this issue, we propose Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model leveraging visual priors from the early stages of diffusion sampling. Specifically, we introduce a latent embedding module that employs a compression-reconstruction approach to discard redundant detail information from the conditional samples in each stage, retaining only the semantic information for guidance. We evaluate DoD on the popular ImageNet- 256 \times 256 dataset, reducing 7 \times training cost compared to SiT and DiT with even better performance in terms of the FID-50K score. Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.

[CV-57] Ego3DT: Tracking Every 3D Object in Ego-centric Videos

链接: https://arxiv.org/abs/2410.08530
作者: Shengyu Hao,Wenhao Chai,Zhonghan Zhao,Meiqi Sun,Wendi Hu,Jieyang Zhou,Yixian Zhao,Qi Li,Yizhou Wang,Xi Li,Gaoang Wang
关键词-EN: brought ego-centric perspectives, contemporary research, growing interest, interest in embodied, embodied intelligence
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted by ACM Multimedia 2024

点击查看摘要

Abstract:The growing interest in embodied intelligence has brought ego-centric perspectives to contemporary research. One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment. Utilizing information from adjacent video frames, Ego3DT dynamically constructs a 3D scene of the ego view using a pre-trained 3D scene reconstruction model. Additionally, we have innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos. Moreover, the efficacy of our approach is corroborated by extensive experiments on two newly compiled datasets, with 1.04x - 2.90x in HOTA, showcasing the robustness and accuracy of our method in diverse ego-centric scenarios.

[CV-58] VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking

链接: https://arxiv.org/abs/2410.08529
作者: Zekun Qian,Ruize Han,Junhui Hou,Linqi Song,Wei Feng
关键词-EN: base classes, diverse object categories, unseen categories, represents a critical, categories
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.

[CV-59] A Bayesian Approach to Weakly-supervised Laparoscopic Image Segmentation MICCAI2024

链接: https://arxiv.org/abs/2410.08509
作者: Zhou Zheng,Yuichiro Hayashi,Masahiro Oda,Takayuki Kitasaka,Kensaku Mori
关键词-EN: study weakly-supervised laparoscopic, study weakly-supervised, comprehensive Bayesian framework, weakly-supervised laparoscopic image, Bayesian deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Early acceptance at MICCAI 2024. Supplementary material included. Minor typo corrections in notation have been made

点击查看摘要

Abstract:In this paper, we study weakly-supervised laparoscopic image segmentation with sparse annotations. We introduce a novel Bayesian deep learning approach designed to enhance both the accuracy and interpretability of the model’s segmentation, founded upon a comprehensive Bayesian framework, ensuring a robust and theoretically validated method. Our approach diverges from conventional methods that directly train using observed images and their corresponding weak annotations. Instead, we estimate the joint distribution of both images and labels given the acquired data. This facilitates the sampling of images and their high-quality pseudo-labels, enabling the training of a generalizable segmentation model. Each component of our model is expressed through probabilistic formulations, providing a coherent and interpretable structure. This probabilistic nature benefits accurate and practical learning from sparse annotations and equips our model with the ability to quantify uncertainty. Extensive evaluations with two public laparoscopic datasets demonstrated the efficacy of our method, which consistently outperformed existing methods. Furthermore, our method was adapted for scribble-supervised cardiac multi-structure segmentation, presenting competitive performance compared to previous methods. The code is available at this https URL.

[CV-60] SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models

链接: https://arxiv.org/abs/2410.08474
作者: Haotian Xia,Zhengbang Yang,Junbo Zou,Rhys Tracy,Yuqing Wang,Chi Lu,Christopher Lai,Yanjun He,Xun Shao,Zhuoqing Xie,Yuan-fang Wang,Weining Shen,Hanjie Chen
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are advancing the ability to reason about complex sports scenarios by integrating textual and visual information. To comprehensively evaluate their capabilities, we introduce SPORTU, a benchmark designed to assess MLLMs across multi-level sports reasoning tasks. SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding. This component focuses on testing models’ ability to reason about sports solely through question-answering (QA), without requiring visual inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning, from simple sports recognition to complex tasks like foul detection and rule application. We evaluate four prevalent LLMs mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT) prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but still falls short of human-level performance, highlighting room for improvement in rule comprehension and reasoning. The evaluation for the SPORTU-video part includes 7 proprietary and 6 open-source MLLMs. Experiments show that models fall short on hard tasks that require deep reasoning and rule-based understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on the hard task, showing large room for improvement. We hope that SPORTU will serve as a critical step toward evaluating models’ capabilities in sports understanding and reasoning.

[CV-61] DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

链接: https://arxiv.org/abs/2410.08470
作者: Jia Li,Yangchen Yu,Yin Chen,Yu Zhang,Peng Jia,Yunbo Xu,Ziqiang Li,Meng Wang,Richang Hong
关键词-EN: attracting increasing research, increasing research interests, understanding human social, Engagement estimation plays, human social behaviors
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注: 1st Place on the NoXi Base dataset in the Multi-Domain Engagement Estimation Challenge held by MultiMediate 24, accepted by ACM Multimedia 2024. The source code is available at \url{ this https URL }

点击查看摘要

Abstract:Engagement estimation plays a crucial role in understanding human social behaviors, attracting increasing research interests in fields such as affective computing and human-computer interaction. In this paper, we propose a Dialogue-Aware Transformer framework (DAT) with Modality-Group Fusion (MGF), which relies solely on audio-visual input and is language-independent, for estimating human engagement in conversations. Specifically, our method employs a modality-group fusion strategy that independently fuses audio and visual features within each modality for each person before inferring the entire audio-visual content. This strategy significantly enhances the model’s performance and robustness. Additionally, to better estimate the target participant’s engagement levels, the introduced Dialogue-Aware Transformer considers both the participant’s behavior and cues from their conversational partners. Our method was rigorously tested in the Multi-Domain Engagement Estimation Challenge held by MultiMediate’24, demonstrating notable improvements in engagement-level regression precision over the baseline model. Notably, our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.

[CV-62] Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP EMNLP2024

链接: https://arxiv.org/abs/2410.08469
作者: Eunji Kim,Kyuhong Shim,Simyung Chang,Sungroh Yoon
关键词-EN: Vision-Language Models, translating textual input, embedding space shared, natural language, encoder within Vision-Language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.

[CV-63] Aligned Divergent Pathways for Omni-Domain Generalized Person Re-Identification CEC

链接: https://arxiv.org/abs/2410.08466
作者: Eugene P.W. Ang,Shan Lin,Alex C. Kot
关键词-EN: Person Re-identification, Person ReID, Person, advanced significantly, Generalization Person ReID
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 International Conference on Electrical, Computer and Energy Technologies (ICECET)

点击查看摘要

Abstract:Person Re-identification (Person ReID) has advanced significantly in fully supervised and domain generalized Person R e ID. However, methods developed for one task domain transfer poorly to the other. An ideal Person ReID method should be effective regardless of the number of domains involved in training or testing. Furthermore, given training data from the target domain, it should perform at least as well as state-of-the-art (SOTA) fully supervised Person ReID methods. We call this paradigm Omni-Domain Generalization Person ReID, referred to as ODG-ReID, and propose a way to achieve this by expanding compatible backbone architectures into multiple diverse pathways. Our method, Aligned Divergent Pathways (ADP), first converts a base architecture into a multi-branch structure by copying the tail of the original backbone. We design our module Dynamic Max-Deviance Adaptive Instance Normalization (DyMAIN) that encourages learning of generalized features that are robust to omni-domain directions and apply DyMAIN to the branches of ADP. Our proposed Phased Mixture-of-Cosines (PMoC) coordinates a mix of stable and turbulent learning rate schedules among branches for further diversified learning. Finally, we realign the feature space between branches with our proposed Dimensional Consistency Metric Loss (DCML). ADP outperforms the state-of-the-art (SOTA) results for multi-source domain generalization and supervised ReID within the same domain. Furthermore, our method demonstrates improvement on a wide range of single-source domain generalization benchmarks, achieving Omni-Domain Generalization over Person ReID tasks.

[CV-64] Diverse Deep Feature Ensemble Learning for Omni-Domain Generalized Person Re-identification

链接: https://arxiv.org/abs/2410.08460
作者: Eugene P.W. Ang,Shan Lin,Alex C. Kot
关键词-EN: Person Re-identification, Person ReID, Person, Re-identification, ReID
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICMIP '24: Proceedings of the 2024 9th International Conference on Multimedia and Image Processing, Pages 64 - 71

点击查看摘要

Abstract:Person Re-identification (Person ReID) has progressed to a level where single-domain supervised Person ReID performance has saturated. However, such methods experience a significant drop in performance when trained and tested across different datasets, motivating the development of domain generalization techniques. However, our research reveals that domain generalization methods significantly underperform single-domain supervised methods on single dataset benchmarks. An ideal Person ReID method should be effective regardless of the number of domains involved, and when test domain data is available for training it should perform as well as state-of-the-art (SOTA) fully supervised methods. This is a paradigm that we call Omni-Domain Generalization Person ReID (ODG-ReID). We propose a way to achieve ODG-ReID by creating deep feature diversity with self-ensembles. Our method, Diverse Deep Feature Ensemble Learning (D2FEL), deploys unique instance normalization patterns that generate multiple diverse views and recombines these views into a compact encoding. To the best of our knowledge, our work is one of few to consider omni-domain generalization in Person ReID, and we advance the study of applying feature ensembles in Person ReID. D2FEL significantly improves and matches the SOTA performance for major domain generalization and single-domain supervised benchmarks.

[CV-65] A Unified Deep Semantic Expansion Framework for Domain-Generalized Person Re-identification

链接: https://arxiv.org/abs/2410.08456
作者: Eugene P.W. Ang,Shan Lin,Alex C. Kot
关键词-EN: Supervised Person Re-identification, Supervised Person, achieved excellent performance, Person Re-identification, Generalized Person Re-identification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Neurocomputing Volume 600, 1 October 2024, 128120. 15 pages

点击查看摘要

Abstract:Supervised Person Re-identification (Person ReID) methods have achieved excellent performance when training and testing within one camera network. However, they usually suffer from considerable performance degradation when applied to different camera systems. In recent years, many Domain Adaptation Person ReID methods have been proposed, achieving impressive performance without requiring labeled data from the target domain. However, these approaches still need the unlabeled data of the target domain during the training process, making them impractical in many real-world scenarios. Our work focuses on the more practical Domain Generalized Person Re-identification (DG-ReID) problem. Given one or more source domains, it aims to learn a generalized model that can be applied to unseen target domains. One promising research direction in DG-ReID is the use of implicit deep semantic feature expansion, and our previous method, Domain Embedding Expansion (DEX), is one such example that achieves powerful results in DG-ReID. However, in this work we show that DEX and other similar implicit deep semantic feature expansion methods, due to limitations in their proposed loss function, fail to reach their full potential on large evaluation benchmarks as they have a tendency to saturate too early. Leveraging on this analysis, we propose Unified Deep Semantic Expansion, our novel framework that unifies implicit and explicit semantic feature expansion techniques in a single framework to mitigate this early over-fitting and achieve a new state-of-the-art (SOTA) in all DG-ReID benchmarks. Further, we apply our method on more general image retrieval tasks, also surpassing the current SOTA in all of these benchmarks by wide margins.

[CV-66] HorGait: Advancing Gait Recognition with Efficient High-Order Spatial Interactions in LiDAR Point Clouds

链接: https://arxiv.org/abs/2410.08454
作者: Jiaxing Hao,Yanxi Wang,Zhigang Chang,Hongmin Gao,Zihao Cheng,Chen Wu,Xin Zhao,Peiye Fang,Rachmat Muwardi
关键词-EN: remote biometric technology, extreme lighting conditions, Transformer architecture, Gait recognition, Transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gait recognition is a remote biometric technology that utilizes the dynamic characteristics of human movement to identify individuals even under various extreme lighting conditions. Due to the limitation in spatial perception capability inherent in 2D gait representations, LiDAR can directly capture 3D gait features and represent them as point clouds, reducing environmental and lighting interference in recognition while significantly advancing privacy protection. For complex 3D representations, shallow networks fail to achieve accurate recognition, making vision Transformers the foremost prevalent method. However, the prevalence of dumb patches has limited the widespread use of Transformer architecture in gait recognition. This paper proposes a method named HorGait, which utilizes a hybrid model with a Transformer architecture for gait recognition on the planar projection of 3D point clouds from LiDAR. Specifically, it employs a hybrid model structure called LHM Block to achieve input adaptation, long-range, and high-order spatial interaction of the Transformer architecture. Additionally, it uses large convolutional kernel CNNs to segment the input representation, replacing attention windows to reduce dumb patches. We conducted extensive experiments, and the results show that HorGait achieves state-of-the-art performance among Transformer architecture methods on the SUSTech1K dataset, verifying that the hybrid model can complete the full Transformer process and perform better in point cloud planar projection. The outstanding performance of HorGait offers new insights for the future application of the Transformer architecture in gait recognition.

[CV-67] Human Stone Toolmaking Action Grammar (HSTAG): A Challenging Benchmark for Fine-grained Motor Behavior Recognition

链接: https://arxiv.org/abs/2410.08410
作者: Cheng Liu,Xuyang Yan,Zekun Zhang,Cheng Ding,Tianhao Zhao,Shaya Jannati,Cynthia Martinez,Dietrich Stout
关键词-EN: past decade, Human Stone Toolmaking, witnessed the development, growing number, Toolmaking Action Grammar
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4 figures, accepted by the 11th IEEE International Conference on Data Science and Advanced Analytics (DSAA)

点击查看摘要

Abstract:Action recognition has witnessed the development of a growing number of novel algorithms and datasets in the past decade. However, the majority of public benchmarks were constructed around activities of daily living and annotated at a rather coarse-grained level, which lacks diversity in domain-specific datasets, especially for rarely seen domains. In this paper, we introduced Human Stone Toolmaking Action Grammar (HSTAG), a meticulously annotated video dataset showcasing previously undocumented stone toolmaking behaviors, which can be used for investigating the applications of advanced artificial intelligence techniques in understanding a rapid succession of complex interactions between two hand-held objects. HSTAG consists of 18,739 video clips that record 4.5 hours of experts’ activities in stone toolmaking. Its unique features include (i) brief action durations and frequent transitions, mirroring the rapid changes inherent in many motor behaviors; (ii) multiple angles of view and switches among multiple tools, increasing intra-class variability; (iii) unbalanced class distributions and high similarity among different action sequences, adding difficulty in capturing distinct patterns for each action. Several mainstream action recognition models are used to conduct experimental analysis, which showcases the challenges and uniqueness of HSTAG this https URL.

[CV-68] Optimizing YOLO Architectures for Optimal Road Damage Detection and Classification: A Comparative Study from YOLOv7 to YOLOv10

链接: https://arxiv.org/abs/2410.08409
作者: Vung Pham,Lan Dong Thi Ngoc,Duy-Linh Bui
关键词-EN: Maintaining roadway infrastructure, sustainable transportation system, Maintaining roadway, ensuring a safe, transportation system
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Invited paper in the Optimized Road Damage Detection Challenge (ORDDC’2024), a track in the IEEE BigData 2024 Challenge

点击查看摘要

Abstract:Maintaining roadway infrastructure is essential for ensuring a safe, efficient, and sustainable transportation system. However, manual data collection for detecting road damage is time-consuming, labor-intensive, and poses safety risks. Recent advancements in artificial intelligence, particularly deep learning, offer a promising solution for automating this process using road images. This paper presents a comprehensive workflow for road damage detection using deep learning models, focusing on optimizations for inference speed while preserving detection accuracy. Specifically, to accommodate hardware limitations, large images are cropped, and lightweight models are utilized. Additionally, an external pothole dataset is incorporated to enhance the detection of this underrepresented damage class. The proposed approach employs multiple model architectures, including a custom YOLOv7 model with Coordinate Attention layers and a Tiny YOLOv7 model, which are trained and combined to maximize detection performance. The models are further reparameterized to optimize inference efficiency. Experimental results demonstrate that the ensemble of the custom YOLOv7 model with three Coordinate Attention layers and the default Tiny YOLOv7 model achieves an F1 score of 0.7027 with an inference speed of 0.0547 seconds per image. The complete pipeline, including data preprocessing, model training, and inference scripts, is publicly available on the project’s GitHub repository, enabling reproducibility and facilitating further research.

[CV-69] AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

链接: https://arxiv.org/abs/2410.08405
作者: Muhammad Awais,Ali Husain Salem Abdulla Alharthi,Amandeep Kumar,Hisham Cholakkal,Rao Muhammad Anwer
关键词-EN: Significant progress, capitalizing on vast, made in advancing, vast repositories, Significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Significant progress has been made in advancing large multimodal conversational models (LMMs), capitalizing on vast repositories of image-text data available online. Despite this progress, these models often encounter substantial domain gaps, hindering their ability to engage in complex conversations across new domains. Recent efforts have aimed to mitigate this issue, albeit relying on domain-specific image-text data to curate instruction-tuning data. However, many domains, such as agriculture, lack such vision-language data. In this work, we propose an approach to construct instruction-tuning data that harnesses vision-only data for the agriculture domain. We utilize diverse agricultural datasets spanning multiple domains, curate class-specific information, and employ large language models (LLMs) to construct an expert-tuning set, resulting in a 70k expert-tuning dataset called AgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficient LMM that can hold complex agriculture-related conversations and provide useful insights. We also develop AgroEvals for evaluation and compare AgroGPT’s performance with large open and closed-source models. AgroGPT excels at identifying fine-grained agricultural concepts, can act as an agriculture expert, and provides helpful information for multimodal agriculture questions. The code, datasets, and models are available at this https URL.

[CV-70] Are We Ready for Real-Time LiDAR Semantic Segmentation in Autonomous Driving? IROS2024

链接: https://arxiv.org/abs/2410.08365
作者: Samir Abou Haidar,Alexandre Chariot,Mehdi Darouich,Cyril Joly,Jean-Emmanuel Deschaud
关键词-EN: point clouds typically, clouds typically generated, point clouds, detection and recognition, perception framework
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to IROS 2024 PPNIV Workshop

点击查看摘要

Abstract:Within a perception framework for autonomous mobile and robotic systems, semantic analysis of 3D point clouds typically generated by LiDARs is key to numerous applications, such as object detection and recognition, and scene reconstruction. Scene semantic segmentation can be achieved by directly integrating 3D spatial data with specialized deep neural networks. Although this type of data provides rich geometric information regarding the surrounding environment, it also presents numerous challenges: its unstructured and sparse nature, its unpredictable size, and its demanding computational requirements. These characteristics hinder the real-time semantic analysis, particularly on resource-constrained hardware architectures that constitute the main computational components of numerous robotic applications. Therefore, in this paper, we investigate various 3D semantic segmentation methodologies and analyze their performance and capabilities for resource-constrained inference on embedded NVIDIA Jetson platforms. We evaluate them for a fair comparison through a standardized training protocol and data augmentations, providing benchmark results on the Jetson AGX Orin and AGX Xavier series for two large-scale outdoor datasets: SemanticKITTI and nuScenes.

[CV-71] me Traveling to Defend Against Adversarial Example Attacks in Image Classification

链接: https://arxiv.org/abs/2410.08338
作者: Anthony Etim,Jakub Szefer
关键词-EN: traffic sign, traffic sign classification, critical threat, sign, Adversarial attacks
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Adversarial example attacks have emerged as a critical threat to machine learning. Adversarial attacks in image classification abuse various, minor modifications to the image that confuse the image classification neural network – while the image still remains recognizable to humans. One important domain where the attacks have been applied is in the automotive setting with traffic sign classification. Researchers have demonstrated that adding stickers, shining light, or adding shadows are all different means to make machine learning inference algorithms mis-classify the traffic signs. This can cause potentially dangerous situations as a stop sign is recognized as a speed limit sign causing vehicles to ignore it and potentially leading to accidents. To address these attacks, this work focuses on enhancing defenses against such adversarial attacks. This work shifts the advantage to the user by introducing the idea of leveraging historical images and majority voting. While the attacker modifies a traffic sign that is currently being processed by the victim’s machine learning inference, the victim can gain advantage by examining past images of the same traffic sign. This work introduces the notion of ‘‘time traveling’’ and uses historical Street View images accessible to anybody to perform inference on different, past versions of the same traffic sign. In the evaluation, the proposed defense has 100% effectiveness against latest adversarial example attack on traffic sign classification algorithm.

[CV-72] Level of agreement between emotions generated by Artificial Intelligence and human evaluation: a methodological proposal

链接: https://arxiv.org/abs/2410.08332
作者: Miguel Carrasco,Cesar Gonzalez-Martin,Sonia Navajas-Torrente,Raul Dastres
关键词-EN: highly subjective, capable of conveying, experience is highly, emotions, conveying emotions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 29 pages

点击查看摘要

Abstract:Images are capable of conveying emotions, but emotional experience is highly subjective. Advances in artificial intelligence have enabled the generation of images based on emotional descriptions. However, the level of agreement between the generative images and human emotional responses has not yet been evaluated. To address this, 20 artistic landscapes were generated using StyleGAN2-ADA. Four variants evoking positive emotions (contentment, amusement) and negative emotions (fear, sadness) were created for each image, resulting in 80 pictures. An online questionnaire was designed using this material, in which 61 observers classified the generated images. Statistical analyses were performed on the collected data to determine the level of agreement among participants, between the observer’s responses, and the AI-generated emotions. A generally good level of agreement was found, with better results for negative emotions. However, the study confirms the subjectivity inherent in emotional evaluation.

[CV-73] Neural Architecture Search of Hybrid Models for NPU-CIM Heterogeneous AR/VR Devices

链接: https://arxiv.org/abs/2410.08326
作者: Yiwei Zhao,Ziyun Li,Win-San Khwa,Xiaoyu Sun,Sai Qian Zhang,Syed Shakib Sarwar,Kleber Hugo Stangherlin,Yi-Lun Lu,Jorge Tomas Gomez,Jae-Sun Seo,Phillip B. Gibbons,Barbara De Salvo,Chiao Liu
关键词-EN: Augmented Reality applications, Virtual Reality, Augmented Reality, Reality applications, Reality and Augmented
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Low-Latency and Low-Power Edge AI is essential for Virtual Reality and Augmented Reality applications. Recent advances show that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can pose system challenges for latency and energy-efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage the architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and perform diverse execution schemas to efficiently execute these hybrid models. We also introduce H4H-NAS, a Neural Architecture Search framework to design efficient hybrid CNN/ViT models for heterogeneous edge systems with both NPU and CIM. Our H4H-NAS approach is powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN/ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet dataset. Moreover, results from our Algo/HW co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing such heterogeneous computing over baseline solutions. The framework guides the design of hybrid network architectures and system architectures of NPU+CIM heterogeneous systems.

[CV-74] Music Genre Classification using Large Language Models

链接: https://arxiv.org/abs/2410.08321
作者: Mohamed El Amine Meguenani,Alceu de Souza Britto Jr.,Alessandro Lameiras Koerich
关键词-EN: pre-trained large language, large language models, paper exploits, capabilities of pre-trained, pre-trained large
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: 7 pages

点击查看摘要

Abstract:This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification. The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders, a transformer encoder, and additional layers for coding audio units and generating feature vectors. The extracted feature vectors are used to train a classification head. During inference, predictions on individual chunks are aggregated for a final genre classification. We conducted a comprehensive comparison of LLMs, including WavLM, HuBERT, and wav2vec 2.0, with traditional deep learning architectures like 1D and 2D convolutional neural networks (CNNs) and the audio spectrogram transformer (AST). Our findings demonstrate the superior performance of the AST model, achieving an overall accuracy of 85.5%, surpassing all other models evaluated. These results highlight the potential of LLMs and transformer-based architectures for advancing music information retrieval tasks, even in zero-shot scenarios.

[CV-75] FusionSense: Bridging Common Sense Vision and Touch for Robust Sparse-View Reconstruction

链接: https://arxiv.org/abs/2410.08282
作者: Irving Fang,Kairui Shi,Xujin He,Siqi Tan,Yifan Wang,Hanwen Zhao,Hung-Jui Huang,Wenzhen Yuan,Chen Feng,Jing Zhang
关键词-EN: Humans effortlessly integrate, Humans effortlessly, effortlessly integrate common-sense, integrate common-sense knowledge, effortlessly integrate
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. Emulating this capability, we introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. FusionSense addresses three key challenges: (i) How can robots efficiently acquire robust global shape information about the surrounding scene and objects? (ii) How can robots strategically select touch points on the object using geometric and common-sense priors? (iii) How can partial observations such as tactile signals improve the overall representation of the object? Our framework employs 3D Gaussian Splatting as a core representation and incorporates a hierarchical optimization strategy involving global structure construction, object visual hull pruning and local geometric constraints. This advancement results in fast and robust perception in environments with traditionally challenging objects that are transparent, reflective, or dark, enabling more downstream manipulation or navigation tasks. Experiments on real-world data suggest that our framework outperforms previously state-of-the-art sparse-view methods. All code and data are open-sourced on the project website.

[CV-76] Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

链接: https://arxiv.org/abs/2410.08261
作者: Jinbin Bai,Tian Ye,Wei Chow,Enxin Song,Qing-Guo Chen,Xiangtai Li,Zhen Dong,Lei Zhu,Shuicheng Yan
关键词-EN: made significant strides, paradigm remains fundamentally, Stable Diffusion, unified language-vision models, complicating the development
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models, such as Stable Diffusion, have made significant strides in visual generation, yet their paradigm remains fundamentally different from autoregressive language models, complicating the development of unified language-vision models. Recent efforts like LlamaGen have attempted autoregressive image generation using discrete VQVAE tokens, but the large number of tokens involved renders this approach inefficient and slow. In this work, we present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL. By incorporating a comprehensive suite of architectural innovations, advanced positional encoding strategies, and optimized sampling conditions, Meissonic substantially improves MIM’s performance and efficiency. Additionally, we leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images. Extensive experiments validate Meissonic’s capabilities, demonstrating its potential as a new standard in text-to-image synthesis. We release a model checkpoint capable of producing 1024 \times 1024 resolution images.

[CV-77] Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

链接: https://arxiv.org/abs/2410.08260
作者: Qiuheng Wang,Yukai Shi,Jiarong Ou,Rui Chen,Ke Lin,Jiahao Wang,Boyuan Jiang,Haotian Yang,Mingwu Zheng,Xin Tao,Fei Yang,Pengfei Wan,Di Zhang
关键词-EN: visual generation technologies, generation technologies continue, continue to advance, expanded rapidly, technologies continue
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the splitted videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset. Our dataset and code will be released at this https URL.

[CV-78] In Search of Forgotten Domain Generalization

链接: https://arxiv.org/abs/2410.08258
作者: Prasanna Mayilvahanan,Roland S. Zimmermann,Thaddäus Wiedemer,Evgenia Rusak,Attila Juhos,Matthias Bethge,Wieland Brendel
关键词-EN: OOD, generalize to unseen, strictly OOD, OOD generalization, model OOD performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Out-of-Domain (OOD) generalization is the ability of a model trained on one or more domains to generalize to unseen domains. In the ImageNet era of computer vision, evaluation sets for measuring a model’s OOD performance were designed to be strictly OOD with respect to style. However, the emergence of foundation models and expansive web-scale datasets has obfuscated this evaluation process, as datasets cover a broad range of domains and risk test domain contamination. In search of the forgotten domain generalization, we create large-scale datasets subsampled from LAION – LAION-Natural and LAION-Rendition – that are strictly OOD to corresponding ImageNet and DomainNet test sets in terms of style. Training CLIP models on these datasets reveals that a significant portion of their performance is explained by in-domain examples. This indicates that the OOD generalization challenges from the ImageNet era still prevail and that training on web-scale data merely creates the illusion of OOD generalization. Furthermore, through a systematic exploration of combining natural and rendition datasets in varying proportions, we identify optimal mixing ratios for model generalization across these domains. Our datasets and results re-enable meaningful assessment of OOD robustness at scale – a crucial prerequisite for improving model robustness.

[CV-79] Neural Material Adaptor for Visual Grounding of Intrinsic Dynamics NEURIPS2024

链接: https://arxiv.org/abs/2410.08257
作者: Junyi Cao,Shanyan Guan,Yanhao Ge,Wei Li,Xiaokang Yang,Chao Ma
关键词-EN: humans effortlessly discern, effortlessly discern intrinsic, Neural Material Adaptor, modern AI systems, systems often struggle
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: NeurIPS 2024, the project page: this https URL

点击查看摘要

Abstract:While humans effortlessly discern intrinsic dynamics and adapt to new scenarios, modern AI systems often struggle. Current methods for visual grounding of dynamics either use pure neural-network-based simulators (black box), which may violate physical laws, or traditional physical simulators (white box), which rely on expert-defined equations that may not fully capture actual dynamics. We propose the Neural Material Adaptor (NeuMA), which integrates existing physical laws with learned corrections, facilitating accurate learning of actual dynamics while maintaining the generalizability and interpretability of physical priors. Additionally, we propose Particle-GS, a particle-driven 3D Gaussian Splatting variant that bridges simulation and observed images, allowing back-propagate image gradients to optimize the simulator. Comprehensive experiments on various dynamics in terms of grounded particle accuracy, dynamic rendering quality, and generalization ability demonstrate that NeuMA can accurately capture intrinsic dynamics.

[CV-80] Finetuning YOLOv9 for Vehicle Detection: Deep Learning for Intelligent Transportation Systems in Dhaka Bangladesh

链接: https://arxiv.org/abs/2410.08230
作者: Shahriar Ahmad Fahim
关键词-EN: caused numerous transportation, vehicle detection system, numerous transportation challenges, Intelligent Transportation Systems, Rapid urbanization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:Rapid urbanization in megacities around the world, like Dhaka, has caused numerous transportation challenges that need to be addressed. Emerging technologies of deep learning and artificial intelligence can help us solve these problems to move towards Intelligent Transportation Systems (ITS) in the city. The government of Bangladesh recognizes the integration of ITS to ensure smart mobility as a vital step towards the development plan “Smart Bangladesh Vision 2041”, but faces challenges in understanding ITS, its effects, and directions to implement. A vehicle detection system can pave the way to understanding traffic congestion, finding mobility patterns, and ensuring traffic surveillance. So, this paper proposes a fine-tuned object detector, the YOLOv9 model to detect native vehicles trained on a Bangladesh-based dataset. Results show that the fine-tuned YOLOv9 model achieved a mean Average Precision (mAP) of 0.934 at the Intersection over Union (IoU) threshold of 0.5, achieving state-of-the-art performance over past studies on Bangladesh-based datasets, shown through a comparison. Later, by suggesting the model to be deployed on CCTVs (closed circuit television) on the roads, a conceptual technique is proposed to process the vehicle detection model output data in a graph structure creating a vehicle detection system in the city. Finally, applications of such vehicle detection system are discussed showing a framework on how it can solve further ITS research questions, to provide a rationale for policymakers to implement the proposed vehicle detection system in the city.

[CV-81] Improving Spiking Neural Network Accuracy With Color Model Information Encoded Bit Planes

链接: https://arxiv.org/abs/2410.08229
作者: Nhan T. Luu,Thang C. Truong,Duong T. Luu
关键词-EN: Spiking neural networks, small memory footprint, low energy consumption, Spiking neural, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) have emerged as a promising paradigm in computational neuroscience and artificial intelligence, offering advantages such as low energy consumption and small memory footprint. However, their practical adoption is constrained by several challenges, prominently among them being performance optimization. In this study, we present a novel approach to enhance the performance of SNNs through a new encoding method that exploits bit planes derived from various color models of input image data for spike encoding. Our proposed technique is designed to improve the computational accuracy of SNNs compared to conventional methods without increasing model size. Through extensive experimental validation, we demonstrate the effectiveness of our encoding strategy in achieving performance gain across multiple computer vision tasks. To the best of our knowledge, this is the first research endeavor applying color spaces within the context of SNNs. By leveraging the unique characteristics of color spaces, we hope to unlock new potentials in SNNs performance, potentially paving the way for more efficient and effective SNNs models in future researches and applications.

[CV-82] Learning Transferable Features for Implicit Neural Representations

链接: https://arxiv.org/abs/2409.09566
作者: Kushal Vyas,Ahmed Imtiaz Humayun,Aniket Dashpute,Richard G. Baraniuk,Ashok Veeraraghavan,Guha Balakrishnan
关键词-EN: Implicit neural representations, Implicit neural, variety of applications, learned neural features, demonstrated success
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Implicit neural representations (INRs) have demonstrated success in a variety of applications, including inverse problems and neural rendering. An INR is typically trained to capture one signal of interest, resulting in learned neural features that are highly attuned to that signal. Assumed to be less generalizable, we explore the aspect of transferability of such learned neural features for fitting similar signals. We introduce a new INR training framework, STRAINER that learns transferrable features for fitting INRs to new signals from a given distribution, faster and with better reconstruction quality. Owing to the sequential layer-wise affine operations in an INR, we propose to learn transferable representations by sharing initial encoder layers across multiple INRs with independent decoder layers. At test time, the learned encoder representations are transferred as initialization for an otherwise randomly initialized INR. We find STRAINER to yield extremely powerful initialization for fitting images from the same domain and allow for \approx +10dB gain in signal quality early on compared to an untrained INR itself. STRAINER also provides a simple way to encode data-driven priors in INRs. We evaluate STRAINER on multiple in-domain and out-of-domain signal fitting tasks and inverse problems and further provide detailed analysis and discussion on the transferability of STRAINER’s features. Our demo can be accessed at this https URL .

[CV-83] Editing Massive Concepts in Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2403.13807
作者: Tianwei Xiong,Yue Wu,Enze Xie,Yue Wu,Zhenguo Li,Xihui Liu
关键词-EN: generating outdated, biased content, risk of generating, diffusion models suffer, massive concept editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization for each individual concept with dual self-distillation from text alignment loss and diffusion noise prediction loss. The second stage conducts massive concept editing with multi-layer, closed form model editing. We further propose a comprehensive benchmark, named ImageNet Concept Editing Benchmark (ICEB), for evaluating massive concept editing for T2I models with two subtasks, free-form prompts, massive concept categories, and extensive evaluation metrics. Extensive experiments conducted on our proposed benchmark and previous benchmarks demonstrate the superior scalability of EMCID for editing up to 1,000 concepts, providing a practical approach for fast adjustment and re-deployment of T2I diffusion models in real-world applications.

[CV-84] A foundation model for generalizable disease diagnosis in chest X-ray images

链接: https://arxiv.org/abs/2410.08861
作者: Lijian Xu,Ziyu Ni,Hao Sun,Hongsheng Li,Shaoting Zhang
关键词-EN: Medical artificial intelligence, providing robust tools, Medical artificial, unlabelled CXR images, artificial intelligence
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical artificial intelligence (AI) is revolutionizing the interpretation of chest X-ray (CXR) images by providing robust tools for disease diagnosis. However, the effectiveness of these AI models is often limited by their reliance on large amounts of task-specific labeled data and their inability to generalize across diverse clinical settings. To address these challenges, we introduce CXRBase, a foundational model designed to learn versatile representations from unlabelled CXR images, facilitating efficient adaptation to various clinical tasks. CXRBase is initially trained on a substantial dataset of 1.04 million unlabelled CXR images using self-supervised learning methods. This approach allows the model to discern meaningful patterns without the need for explicit labels. After this initial phase, CXRBase is fine-tuned with labeled data to enhance its performance in disease detection, enabling accurate classification of chest diseases. CXRBase provides a generalizable solution to improve model performance and alleviate the annotation workload of experts to enable broad clinical AI applications from chest imaging.

[CV-85] On the impact of key design aspects in simulated Hybrid Quantum Neural Networks for Earth Observation

链接: https://arxiv.org/abs/2410.08677
作者: Lorenzo Papa,Alessandro Sebastianelli,Gabriele Meoni,Irene Amerini
关键词-EN: improving machine learning, computing has introduced, introduced novel perspectives, perspectives for tackling, tackling and improving
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Quantum computing has introduced novel perspectives for tackling and improving machine learning tasks. Moreover, the integration of quantum technologies together with well-known deep learning (DL) architectures has emerged as a potential research trend gaining attraction across various domains, such as Earth Observation (EO) and many other research fields. However, prior related works in EO literature have mainly focused on convolutional architectural advancements, leaving several essential topics unexplored. Consequently, this research investigates through three cases of study fundamental aspects of hybrid quantum machine models for EO tasks aiming to provide a solid groundwork for future research studies towards more adequate simulations and looking at the post-NISQ era. More in detail, we firstly (1) investigate how different quantum libraries behave when training hybrid quantum models, assessing their computational efficiency and effectiveness. Secondly, (2) we analyze the stability/sensitivity to initialization values (i.e., seed values) in both traditional model and quantum-enhanced counterparts. Finally, (3) we explore the benefits of hybrid quantum attention-based models in EO applications, examining how integrating quantum circuits into ViTs can improve model performance.

[CV-86] Fully Unsupervised Dynamic MRI Reconstruction via Diffeo-Temporal Equivariance

链接: https://arxiv.org/abs/2410.08646
作者: Andrew Wang,Mike Davies
关键词-EN: Reconstructing dynamic MRI, free breathing motion, resolution real-time imaging, MRI image sequences, higher spatiotemporal resolution
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Pre-print

点击查看摘要

Abstract:Reconstructing dynamic MRI image sequences from undersampled accelerated measurements is crucial for faster and higher spatiotemporal resolution real-time imaging of cardiac motion, free breathing motion and many other applications. Classical paradigms, such as gated cine MRI, assume periodicity, disallowing imaging of true motion. Supervised deep learning methods are fundamentally flawed as, in dynamic imaging, ground truth fully-sampled videos are impossible to truly obtain. We propose an unsupervised framework to learn to reconstruct dynamic MRI sequences from undersampled measurements alone by leveraging natural geometric spatiotemporal equivariances of MRI. Dynamic Diffeomorphic Equivariant Imaging (DDEI) significantly outperforms state-of-the-art unsupervised methods such as SSDU on highly accelerated dynamic cardiac imaging. Our method is agnostic to the underlying neural network architecture and can be used to adapt the latest models and post-processing approaches. Our code and video demos are at this https URL.

[CV-87] ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation

链接: https://arxiv.org/abs/2410.08588
作者: Siyou Li,Beining Xu,Yihao Luo,Dong Nie,Le Zhang
关键词-EN: medical report generation, Automatic medical report, produce detailed text, detailed text reports, automatic MRG
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automatic medical report generation (MRG), which aims to produce detailed text reports from medical images, has emerged as a critical task in this domain. MRG systems can enhance radiological workflows by reducing the time and effort required for report writing, thereby improving diagnostic efficiency. In this work, we present a novel approach for automatic MRG utilizing a multimodal large language model. Specifically, we employed the 3D Vision Transformer (ViT3D) image encoder introduced from M3D-CLIP to process 3D scans and use the Asclepius-Llama3-8B as the language model to generate the text reports by auto-regressive decoding. The experiment shows our model achieved an average Green score of 0.3 on the MRG task validation set and an average accuracy of 0.61 on the visual question answering (VQA) task validation set, outperforming the baseline model. Our approach demonstrates the effectiveness of the ViT3D alignment of LLaMA3 for automatic MRG and VQA tasks by tuning the model on a small dataset.

[CV-88] CAS-GAN for Contrast-free Angiography Synthesis

链接: https://arxiv.org/abs/2410.08490
作者: De-Xing Huang,Xiao-Hu Zhou,Mei-Jiang Gui,Xiao-Liang Xie,Shi-Qi Liu,Shuang-Yi Wang,Hao Li,Tian-Yu Xiang,Zeng-Guang Hou
关键词-EN: posing substantial health, substantial health risks, numerous interventional procedures, Iodinated contrast agents, interventional procedures
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Iodinated contrast agents are widely utilized in numerous interventional procedures, yet posing substantial health risks to patients. This paper presents CAS-GAN, a novel GAN framework that serves as a ``virtual contrast agent" to synthesize X-ray angiographies via disentanglement representation learning and vessel semantic guidance, thereby reducing the reliance on iodinated agents during interventional procedures. Specifically, our approach disentangles X-ray angiographies into background and vessel components, leveraging medical prior knowledge. A specialized predictor then learns to map the interrelationships between these components. Additionally, a vessel semantic-guided generator and a corresponding loss function are introduced to enhance the visual fidelity of generated images. Experimental results on the XCAD dataset demonstrate the state-of-the-art performance of our CAS-GAN, achieving a FID of 5.94 and a MMD of 0.017. These promising results highlight CAS-GAN’s potential for clinical applications.

[CV-89] Beyond GFVC: A Progressive Face Video Compression Framework with Adaptive Visual Tokens

链接: https://arxiv.org/abs/2410.08485
作者: Bolin Chen,Shanzhi Yin,Zihan Zhang,Jie Chen,Ru-Ling Liao,Lingyu Zhu,Shiqi Wang,Yan Ye
关键词-EN: deep generative models, Face Video Compression, diverse application functionalities, Generative Face Video, face video coding
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, deep generative models have greatly advanced the progress of face video coding towards promising rate-distortion performance and diverse application functionalities. Beyond traditional hybrid video coding paradigms, Generative Face Video Compression (GFVC) relying on the strong capabilities of deep generative models and the philosophy of early Model-Based Coding (MBC) can facilitate the compact representation and realistic reconstruction of visual face signal, thus achieving ultra-low bitrate face video communication. However, these GFVC algorithms are sometimes faced with unstable reconstruction quality and limited bitrate ranges. To address these problems, this paper proposes a novel Progressive Face Video Compression framework, namely PFVC, that utilizes adaptive visual tokens to realize exceptional trade-offs between reconstruction robustness and bandwidth intelligence. In particular, the encoder of the proposed PFVC projects the high-dimensional face signal into adaptive visual tokens in a progressive manner, whilst the decoder can further reconstruct these adaptive visual tokens for motion estimation and signal synthesis with different granularity levels. Experimental results demonstrate that the proposed PFVC framework can achieve better coding flexibility and superior rate-distortion performance in comparison with the latest Versatile Video Coding (VVC) codec and the state-of-the-art GFVC algorithms. The project page can be found at this https URL.

[CV-90] VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis

链接: https://arxiv.org/abs/2410.08397
作者: Andrew Hoopes,Victor Ion Butoi,John V. Guttag,Adrian V. Dalca
关键词-EN: agent-driven vision-language framework, tackles diverse radiological, analytical metrics, agent-driven vision-language, joint modeling
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 5 figures, vision-language agent, medical image analysis, neuroimage foundation model

点击查看摘要

Abstract:We present VoxelPrompt, an agent-driven vision-language framework that tackles diverse radiological tasks through joint modeling of natural language, image volumes, and analytical metrics. VoxelPrompt is multi-modal and versatile, leveraging the flexibility of language interaction while providing quantitatively grounded image analysis. Given a variable number of 3D medical volumes, such as MRI and CT scans, VoxelPrompt employs a language agent that iteratively predicts executable instructions to solve a task specified by an input prompt. These instructions communicate with a vision network to encode image features and generate volumetric outputs (e.g., segmentations). VoxelPrompt interprets the results of intermediate instructions and plans further actions to compute discrete measures (e.g., tumor growth across a series of scans) and present relevant outputs to the user. We evaluate this framework in a sandbox of diverse neuroimaging tasks, and we show that the single VoxelPrompt model can delineate hundreds of anatomical and pathological features, measure many complex morphological properties, and perform open-language analysis of lesion characteristics. VoxelPrompt carries out these objectives with accuracy similar to that of fine-tuned, single-task models for segmentation and visual question-answering, while facilitating a much larger range of tasks. Therefore, by supporting accurate image processing with language interaction, VoxelPrompt provides comprehensive utility for numerous imaging tasks that traditionally require specialized models to address.

[CV-91] A Real Benchmark Swell Noise Dataset for Performing Seismic Data Denoising via Deep Learning

链接: https://arxiv.org/abs/2410.08231
作者: Pablo M. Barros,Roosevelt de L. Sardinha,Giovanny A. M. Arboleda,Lessandro de S. S. Valente,Isabelle R. V. de Melo,Albino Aveleda,André Bulcão,Sergio L. Netto,Alexandre G. Evsukoff
关键词-EN: deep learning, computer vision, creation of open, tested and compared, compared with reproducible
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent development of deep learning (DL) methods for computer vision has been driven by the creation of open benchmark datasets on which new algorithms can be tested and compared with reproducible results. Although DL methods have many applications in geophysics, few real seismic datasets are available for benchmarking DL models, especially for denoising real data, which is one of the main problems in seismic data processing scenarios in the oil and gas industry. This article presents a benchmark dataset composed of synthetic seismic data corrupted with noise extracted from a filtering process implemented on real data. In this work, a comparison between two well-known DL-based denoising models is conducted on this dataset, which is proposed as a benchmark for accelerating the development of new solutions for seismic data denoising. This work also introduces a new evaluation metric that can capture small variations in model results. The results show that DL models are effective at denoising seismic data, but some issues remain to be solved.

[CV-92] Multi-Atlas Brain Network Classification through Consistency Distillation and Complementary Information Fusion

链接: https://arxiv.org/abs/2410.08228
作者: Jiaxing Xu,Mengcheng Lan,Xia Dong,Kai He,Wei Zhang,Qingtian Bian,Yiping Ke
关键词-EN: identifying distinctive patterns, identifying distinctive, brain, brain network classification, atlases
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of neuroscience, identifying distinctive patterns associated with neurological disorders via brain networks is crucial. Resting-state functional magnetic resonance imaging (fMRI) serves as a primary tool for mapping these networks by correlating blood-oxygen-level-dependent (BOLD) signals across different brain regions, defined as regions of interest (ROIs). Constructing these brain networks involves using atlases to parcellate the brain into ROIs based on various hypotheses of brain division. However, there is no standard atlas for brain network classification, leading to limitations in detecting abnormalities in disorders. Some recent methods have proposed utilizing multiple atlases, but they neglect consistency across atlases and lack ROI-level information exchange. To tackle these limitations, we propose an Atlas-Integrated Distillation and Fusion network (AIDFusion) to improve brain network classification using fMRI data. AIDFusion addresses the challenge of utilizing multiple atlases by employing a disentangle Transformer to filter out inconsistent atlas-specific information and distill distinguishable connections across atlases. It also incorporates subject- and population-level consistency constraints to enhance cross-atlas consistency. Additionally, AIDFusion employs an inter-atlas message-passing mechanism to fuse complementary information across brain regions. Experimental results on four datasets of different diseases demonstrate the effectiveness and efficiency of AIDFusion compared to state-of-the-art methods. A case study illustrates AIDFusion extract patterns that are both interpretable and consistent with established neuroscience findings.

[CV-93] Removal of clouds from satellite images using time compositing techniques

链接: https://arxiv.org/abs/2410.08223
作者: Atma Bharathi Mani,Nagashree TR,Manavalan P,Diwakar PG
关键词-EN: function, quantitative study, deterrent to qualitative, qualitative and quantitative, min
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Clouds in satellite images are a deterrent to qualitative and quantitative study. Time compositing methods compare a series of co-registered images and retrieve only those pixels that have comparatively lesser cloud cover for the resultant image. Two different approaches of time compositing were tested. The first method recoded the clouds to value 0 on all the constituent images and ran a ‘max’ function. The second method directly ran a ‘min’ function without recoding on all the images for the resultant image. The ‘max’ function gave a highly mottled image while the ‘min’ function gave a superior quality image with smoother texture. Persistent clouds on all constituent images were retained in both methods, but they were readily identifiable and easily extractable in the ‘max’ function image as they were recoded to 0, while that in the ‘min’ function appeared with varying DN values. Hence a hybrid technique was created which recodes the clouds to value 255 and runs a ‘min’ function. This method preserved the quality of the ‘min’ function and the advantage of retrieving clouds as in the ‘max’ function image. The models were created using Erdas Imagine Modeler 9.1 and MODIS 250 m resolution images of coastal Karnataka in the months of May, June 2008 were used. A detailed investigation on the different methods is described and scope for automating different techniques is discussed.

[CV-94] A Visual-Analytical Approach for Automatic Detection of Cyclonic Events in Satellite Observations

链接: https://arxiv.org/abs/2410.08218
作者: Akash Agrawal,Mayesh Mohapatra,Abhinav Raja,Paritosh Tiwari,Vishwajeet Pattanaik,Neeru Jaiswal,Arpit Agarwal,Punit Rathore
关键词-EN: catastrophic weather events, holds crucial significance, predicting catastrophic weather, North Indian Ocean, tropical cyclones holds
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 10 pages, 22 figures

点击查看摘要

Abstract:Estimating the location and intensity of tropical cyclones holds crucial significance for predicting catastrophic weather events. In this study, we approach this task as a detection and regression challenge, specifically over the North Indian Ocean (NIO) region where best tracks location and wind speed information serve as the labels. The current process for cyclone detection and intensity estimation involves physics-based simulation studies which are time-consuming, only using image features will automate the process for significantly faster and more accurate predictions. While conventional methods typically necessitate substantial prior knowledge for training, we are exploring alternative approaches to enhance efficiency. This research aims to focus specifically on cyclone detection, intensity estimation and related aspects using only image input and data-driven approaches and will lead to faster inference time and automate the process as opposed to current NWP models being utilized at SAC. In context to algorithm development, a novel two stage detection and intensity estimation module is proposed. In the first level detection we try to localize the cyclone over an entire image as captured by INSAT3D over the NIO (North Indian Ocean). For the intensity estimation task, we propose a CNN-LSTM network, which works on the cyclone centered images, utilizing a ResNet-18 backbone, by which we are able to capture both temporal and spatial characteristics.

[CV-95] A Review of Electromagnetic Elimination Methods for low-field portable MRI scanner

链接: https://arxiv.org/abs/2406.17804
作者: Wanyu Bian
关键词-EN: eliminating electromagnetic interference, deep learning, deep learning methods, EMI, EMI elimination
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive analysis of both conventional and deep learning methods for eliminating electromagnetic interference (EMI) in MRI systems. We explore the underlying principles and implementation of traditional analytical and adaptive EMI elimination techniques, as well as cutting-edge deep learning approaches. Through a detailed comparison, the strengths and limitations of each method are highlighted. Recent advancements in active EMI elimination utilizing multiple external EMI receiver coils and analytical techniques are discussed alongside the superior performance of deep learning methods, which leverage neural networks trained on extensive MRI data. While deep learning methods demonstrate significant improvements in EMI suppression, enhancing diagnostic capabilities and accessibility of MRI technology, they also introduce potential security and safety concerns, especially in production and commercial applications. This study underscores the need to address these challenges to fully realize the benefits of deep learning in EMI elimination. The findings suggest a balanced approach, combining the reliability of conventional methods with the advanced capabilities of deep learning, to develop more robust and effective EMI suppression strategies in MRI systems.

机器学习

[LG-0] Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

链接: https://arxiv.org/abs/2410.09047
作者: Qin Liu,Chao Shang,Ling Liu,Nikolaos Pappas,Jie Ma,Neha Anna John,Srikanth Doss,Lluis Marquez,Miguel Ballesteros,Yassine Benajiba
关键词-EN: vision module compared, safety alignment, Vision-Language Models, safety alignment ability, safety alignment degradation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as ‘‘safety alignment degradation’’ in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preserving the functional capabilities of VLMs. The empirical results show that our framework significantly recovers the alignment ability that is inherited from the LLM backbone with minimal impact on the fluency and linguistic capabilities of pre-trained VLMs even without additional training. Specifically, the unsafe rate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as 3.15% with only inference-time intervention. WARNING: This paper contains examples of toxic or harmful language. Comments: Preprint Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.09047 [cs.CL] (or arXiv:2410.09047v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.09047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery

链接: https://arxiv.org/abs/2410.09032
作者: Pratinav Seth,Michelle Lin,Brefo Dwamena Yaw,Jade Boutot,Mary Kang,David Rolnick
关键词-EN: leaching methane, oil and gas, atmosphere and toxic, toxic compounds, Alberta Energy Regulator
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Millions of abandoned oil and gas wells are scattered across the world, leaching methane into the atmosphere and toxic compounds into the groundwater. Many of these locations are unknown, preventing the wells from being plugged and their polluting effects averted. Remote sensing is a relatively unexplored tool for pinpointing abandoned wells at scale. We introduce the first large-scale benchmark dataset for this problem, leveraging medium-resolution multi-spectral satellite imagery from Planet Labs. Our curated dataset comprises over 213,000 wells (abandoned, suspended, and active) from Alberta, a region with especially high well density, sourced from the Alberta Energy Regulator and verified by domain experts. We evaluate baseline algorithms for well detection and segmentation, showing the promise of computer vision approaches but also significant room for improvement.

[LG-2] Agent Harm: A Benchmark for Measuring Harmfulness of LLM Agents

链接: https://arxiv.org/abs/2410.09024
作者: Maksym Andriushchenko,Alexandra Souly,Mateusz Dziemian,Derek Duenas,Maxwell Lin,Justin Wang,Dan Hendrycks,Andy Zou,Zico Kolter,Matt Fredrikson,Eric Winsor,Jerome Wynne,Yarin Gal,Xander Davies
关键词-EN: users design prompts, circumvent safety measures, users design, design prompts, prompts to circumvent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents – which use external tools and can execute multi-stage tasks – may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. We publicly release AgentHarm to enable simple and reliable evaluation of attacks and defenses for LLM-based agents. We publicly release the benchmark at this https URL.

[LG-3] Parameter-Efficient Fine-Tuning of State Space Models

链接: https://arxiv.org/abs/2410.09016
作者: Kevin Galim,Wonjun Kang,Yuchen Zeng,Hyung Il Koo,Kangwook Lee
关键词-EN: Deep State Space, State Space Models, Deep State, Space Models, State Space
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Deep State Space Models (SSMs), such as Mamba (Gu Dao, 2024), have emerged as powerful tools for language modeling, offering high performance with efficient inference and linear scaling in sequence length. However, the application of parameter-efficient fine-tuning (PEFT) methods to SSM-based models remains largely unexplored. This paper aims to systematically study two key questions: (i) How do existing PEFT methods perform on SSM-based models? (ii) Which modules are most effective for fine-tuning? We conduct an empirical benchmark of four basic PEFT methods on SSM-based models. Our findings reveal that prompt-based methods (e.g., prefix-tuning) are no longer effective, an empirical result further supported by theoretical analysis. In contrast, LoRA remains effective for SSM-based models. We further investigate the optimal application of LoRA within these models, demonstrating both theoretically and experimentally that applying LoRA to linear projection matrices without modifying SSM modules yields the best results, as LoRA is not effective at tuning SSM modules. To further improve performance, we introduce LoRA with Selective Dimension tuning (SDLoRA), which selectively updates certain channels and states on SSM modules while applying LoRA to linear projection matrices. Extensive experimental results show that this approach outperforms standard LoRA.

[LG-4] Hierarchical Universal Value Function Approximators

链接: https://arxiv.org/abs/2410.08997
作者: Rushiv Arora
关键词-EN: estimating long-term returns, building universal approximators, parameterized manner, key advancements, key elements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 12 pages, 10 figures, 3 appendices. Currently under review

点击查看摘要

Abstract:There have been key advancements to building universal approximators for multi-goal collections of reinforcement learning value functions – key elements in estimating long-term returns of states in a parameterized manner. We extend this to hierarchical reinforcement learning, using the options framework, by introducing hierarchical universal value function approximators (H-UVFAs). This allows us to leverage the added benefits of scaling, planning, and generalization expected in temporal abstraction settings. We develop supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: Q(s, g, o; \theta) and Q(s, g, o, a; \theta) . Finally we demonstrate generalization of the HUVFAs and show they outperform corresponding UVFAs.

[LG-5] Science is Exploration: Computational Frontiers for Conceptual Metaphor Theory

链接: https://arxiv.org/abs/2410.08991
作者: Rebecca M. M. Hicke,Ross Deans Kristensen-McLachlan
关键词-EN: conceptual metaphors, Large Language Models, Metaphors, language, natural language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted to the 2024 Computational Humanities Research Conference (CHR)

点击查看摘要

Abstract:Metaphors are everywhere. They appear extensively across all domains of natural language, from the most sophisticated poetry to seemingly dry academic prose. A significant body of research in the cognitive science of language argues for the existence of conceptual metaphors, the systematic structuring of one domain of experience in the language of another. Conceptual metaphors are not simply rhetorical flourishes but are crucial evidence of the role of analogical reasoning in human cognition. In this paper, we ask whether Large Language Models (LLMs) can accurately identify and explain the presence of such conceptual metaphors in natural language data. Using a novel prompting technique based on metaphor annotation guidelines, we demonstrate that LLMs are a promising tool for large-scale computational research on conceptual metaphors. Further, we show that LLMs are able to apply procedural guidelines designed for human annotators, displaying a surprising depth of linguistic knowledge.

[LG-6] SubZero: Random Subspace Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning

链接: https://arxiv.org/abs/2410.08989
作者: Ziming Yu,Pan Zhou,Sike Wang,Jia Li,Hua Huang
关键词-EN: Large Language Models, Fine-tuning Large Language, Fine-tuning Large, Large Language, proven effective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) has proven effective for a variety of downstream tasks. However, as LLMs grow in size, the memory demands for backpropagation become increasingly prohibitive. Zeroth-order (ZO) optimization methods offer a memory-efficient alternative by using forward passes to estimate gradients, but the variance of gradient estimates typically scales linearly with the model’s parameter dimension \unicodex2013 a significant issue for LLMs. In this paper, we propose the random Subspace Zeroth-order (SubZero) optimization to address the challenges posed by LLMs’ high dimensionality. We introduce a low-rank perturbation tailored for LLMs that significantly reduces memory consumption while improving training performance. Additionally, we prove that our gradient estimation closely approximates the backpropagation gradient, exhibits lower variance than traditional ZO methods, and ensures convergence when combined with SGD. Experimental results show that SubZero enhances fine-tuning performance and achieves faster convergence compared to standard ZO approaches like MeZO across various language modeling tasks.

[LG-7] DEL: Discrete Element Learner for Learning 3D Particle Dynamics with Neural Rendering

链接: https://arxiv.org/abs/2410.08983
作者: Jiaxu Wang,Jingkai Sun,Junhao He,Ziyi Zhang,Qiang Zhang,Mingyuan Sun,Renjing Xu
关键词-EN: show great potential, simulating particle dynamics, great potential, potential for simulating, per-particle correspondences
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning-based simulators show great potential for simulating particle dynamics when 3D groundtruth is available, but per-particle correspondences are not always accessible. The development of neural rendering presents a new solution to this field to learn 3D dynamics from 2D images by inverse rendering. However, existing approaches still suffer from ill-posed natures resulting from the 2D to 3D uncertainty, for example, specific 2D images can correspond with various 3D particle distributions. To mitigate such uncertainty, we consider a conventional, mechanically interpretable framework as the physical priors and extend it to a learning-based version. In brief, we incorporate the learnable graph kernels into the classic Discrete Element Analysis (DEA) framework to implement a novel mechanics-integrated learning system. In this case, the graph network kernels are only used for approximating some specific mechanical operators in the DEA framework rather than the whole dynamics mapping. By integrating the strong physics priors, our methods can effectively learn the dynamics of various materials from the partial 2D observations in a unified manner. Experiments show that our approach outperforms other learned simulators by a large margin in this context and is robust to different renderers, fewer training samples, and fewer camera views.

[LG-8] Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

链接: https://arxiv.org/abs/2410.08979
作者: Devdhar Patel,Hava Siegelmann
关键词-EN: surpassing human-level control, human-level control capabilities, rapidly reaching, reaching and surpassing, surpassing human-level
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. Such speeds are difficult to achieve in the real world and often requires specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a “temporal recall” mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Additionally, we compare SRL with model-based online planning, showing that SRL achieves superior FAS while leveraging the same model during training that online planners use for planning.

[LG-9] Learning Representations of Instruments for Partial Identification of Treatment Effects

链接: https://arxiv.org/abs/2410.08976
作者: Jonas Schweisthal,Dennis Frauen,Maresa Schröder,Konstantin Hess,Niki Kilbertus,Stefan Feuerriegel
关键词-EN: observational data, data is important, average treatment effect, bounds, CATE
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially high-dimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).

[LG-10] ALVIN: Active Learning Via INterpolation EMNLP2024

链接: https://arxiv.org/abs/2410.08972
作者: Michalis Korakakis,Andreas Vlachos,Adrian Weller
关键词-EN: active learning methods, Active Learning, Active Learning aims, typical active learning, minimize annotation effort
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024 (Main)

点击查看摘要

Abstract:Active Learning aims to minimize annotation effort by selecting the most useful instances from a pool of unlabeled data. However, typical active learning methods overlook the presence of distinct example groups within a class, whose prevalence may vary, e.g., in occupation classification datasets certain demographics are disproportionately represented in specific classes. This oversight causes models to rely on shortcuts for predictions, i.e., spurious correlations between input attributes and labels occurring in well-represented groups. To address this issue, we propose Active Learning Via INterpolation (ALVIN), which conducts intra-class interpolations between examples from under-represented and well-represented groups to create anchors, i.e., artificial points situated between the example groups in the representation space. By selecting instances close to the anchors for annotation, ALVIN identifies informative examples exposing the model to regions of the representation space that counteract the influence of shortcuts. Crucially, since the model considers these examples to be of high certainty, they are likely to be ignored by typical active learning methods. Experimental results on six datasets encompassing sentiment analysis, natural language inference, and paraphrase detection demonstrate that ALVIN outperforms state-of-the-art active learning methods in both in-distribution and out-of-distribution generalization.

[LG-11] Evaluating Federated Kolmogorov-Arnold Networks on Non-IID Data

链接: https://arxiv.org/abs/2410.08961
作者: Arthur Mendonça Sasse,Claudio Miceli de Farias
关键词-EN: Federated Kolmogorov-Arnold Networks, Kolmogorov-Arnold Networks, Radial Basis Functions, initial stage, Layer Perceptrons
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures, for associated code see this https URL

点击查看摘要

Abstract:Federated Kolmogorov-Arnold Networks (F-KANs) have already been proposed, but their assessment is at an initial stage. We present a comparison between KANs (using B-splines and Radial Basis Functions as activation functions) and Multi- Layer Perceptrons (MLPs) with a similar number of parameters for 100 rounds of federated learning in the MNIST classification task using non-IID partitions with 100 clients. After 15 trials for each model, we show that the best accuracies achieved by MLPs can be achieved by Spline-KANs in half of the time (in rounds), with just a moderate increase in computing time.

[LG-12] Rapid Grassmannian Averaging with Chebyshev Polynomials ICLR2025

链接: https://arxiv.org/abs/2410.08956
作者: Brighton Ancelin,Alex Saad-Falcon,Kason Ancelin,Justin Romberg
关键词-EN: Rapid Grassmannian Averaging, Decentralized Rapid Grassmannian, Grassmannian Averaging, Rapid Grassmannian, Grassmannian
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to ICLR 2025

点击查看摘要

Abstract:We propose new algorithms to efficiently average a collection of points on a Grassmannian manifold in both the centralized and decentralized settings. Grassmannian points are used ubiquitously in machine learning, computer vision, and signal processing to represent data through (often low-dimensional) subspaces. While averaging these points is crucial to many tasks (especially in the decentralized setting), existing methods unfortunately remain computationally expensive due to the non-Euclidean geometry of the manifold. Our proposed algorithms, Rapid Grassmannian Averaging (RGrAv) and Decentralized Rapid Grassmannian Averaging (DRGrAv), overcome this challenge by leveraging the spectral structure of the problem to rapidly compute an average using only small matrix multiplications and QR factorizations. We provide a theoretical guarantee of optimality and present numerical experiments which demonstrate that our algorithms outperform state-of-the-art methods in providing high accuracy solutions in minimal time. Additional experiments showcase the versatility of our algorithms to tasks such as K-means clustering on video motion data, establishing RGrAv and DRGrAv as powerful tools for generic Grassmannian averaging.

[LG-13] On the Adversarial Transferability of Generalized “Skip Connections”

链接: https://arxiv.org/abs/2410.08950
作者: Yisen Wang,Yichuan Mo,Dongxian Wu,Mingjie Li,Xingjun Ma,Zhouchen Lin
关键词-EN: skip connections, modern deep models, Skip, Skip Gradient Method, essential ingredient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Skip connection is an essential ingredient for modern deep models to be deeper and more powerful. Despite their huge success in normal scenarios (state-of-the-art classification performance on natural examples), we investigate and identify an interesting property of skip connections under adversarial scenarios, namely, the use of skip connections allows easier generation of highly transferable adversarial examples. Specifically, in ResNet-like models (with skip connections), we find that using more gradients from the skip connections rather than the residual modules according to a decay factor during backpropagation allows one to craft adversarial examples with high transferability. The above method is termed as Skip Gradient Method (SGM). Although starting from ResNet-like models in vision domains, we further extend SGM to more advanced architectures, including Vision Transformers (ViTs) and models with length-varying paths and other domains, i.e. natural language processing. We conduct comprehensive transfer attacks against various models including ResNets, Transformers, Inceptions, Neural Architecture Search, and Large Language Models (LLMs). We show that employing SGM can greatly improve the transferability of crafted attacks in almost all cases. Furthermore, considering the big complexity for practical use, we further demonstrate that SGM can even improve the transferability on ensembles of models or targeted attacks and the stealthiness against current defenses. At last, we provide theoretical explanations and empirical insights on how SGM works. Our findings not only motivate new adversarial research into the architectural characteristics of models but also open up further challenges for secure model architecture design. Our code is available at this https URL.

[LG-14] Meta-Transfer Learning Empowered Temporal Graph Networks for Cross-City Real Estate Appraisal

链接: https://arxiv.org/abs/2410.08947
作者: Weijia Zhang,Jindong Han,Hao Liu,Wei Fan,Hao Wang,Hui Xiong
关键词-EN: Real estate appraisal, real property taxation, Real estate, real estate deals, estate appraisal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Real estate appraisal is important for a variety of endeavors such as real estate deals, investment analysis, and real property taxation. Recently, deep learning has shown great promise for real estate appraisal by harnessing substantial online transaction data from web platforms. Nonetheless, deep learning is data-hungry, and thus it may not be trivially applicable to enormous small cities with limited data. To this end, we propose Meta-Transfer Learning Empowered Temporal Graph Networks (MetaTransfer) to transfer valuable knowledge from multiple data-rich metropolises to the data-scarce city to improve valuation performance. Specifically, by modeling the ever-growing real estate transactions with associated residential communities as a temporal event heterogeneous graph, we first design an Event-Triggered Temporal Graph Network to model the irregular spatiotemporal correlations between evolving real estate transactions. Besides, we formulate the city-wide real estate appraisal as a multi-task dynamic graph link label prediction problem, where the valuation of each community in a city is regarded as an individual task. A Hypernetwork-Based Multi-Task Learning module is proposed to simultaneously facilitate intra-city knowledge sharing between multiple communities and task-specific parameters generation to accommodate the community-wise real estate price distribution. Furthermore, we propose a Tri-Level Optimization Based Meta- Learning framework to adaptively re-weight training transaction instances from multiple source cities to mitigate negative transfer, and thus improve the cross-city knowledge transfer effectiveness. Finally, extensive experiments based on five real-world datasets demonstrate the significant superiority of MetaTransfer compared with eleven baseline algorithms.

[LG-15] Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory

链接: https://arxiv.org/abs/2410.08942
作者: Aymane El Firdoussi,Mohamed El Amine Seddik,Soufiane Hayou,Reda Alami,Ahmed Alzubaidi,Hakim Hacid
关键词-EN: gained attention, attention for training, Shumailov, Seddik, Synthetic data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Synthetic data has gained attention for training large language models, but poor-quality data can harm performance (see, e.g., Shumailov et al. (2023); Seddik et al. (2024)). A potential solution is data pruning, which retains only high-quality data based on a score function (human or machine feedback). Previous work Feng et al. (2024) analyzed models trained on synthetic data as sample size increases. We extend this by using random matrix theory to derive the performance of a binary classifier trained on a mix of real and pruned synthetic data in a high dimensional setting. Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy. We also show a smooth phase transition in synthetic label noise, contrasting with prior sharp behavior in infinite sample limits. Experiments with toy models and large language models validate our theoretical results.

[LG-16] Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

链接: https://arxiv.org/abs/2410.08931
作者: Clayton Leite,Yu Xiao
关键词-EN: garnering significant attention, significant attention, sequences of human, human poses, poses from textual
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-motion models that generate sequences of human poses from textual descriptions are garnering significant attention. However, due to data scarcity, the range of motions these models can produce is still limited. For instance, current text-to-motion models cannot generate a motion of kicking a football with the instep of the foot, since the training data only includes martial arts kicks. We propose a novel method that uses short video clips or images as conditions to modify existing basic motions. In this approach, the model’s understanding of a kick serves as the prior, while the video or image of a football kick acts as the posterior, enabling the generation of the desired motion. By incorporating these additional modalities as conditions, our method can create motions not present in the training set, overcoming the limitations of text-motion datasets. A user study with 26 participants demonstrated that our approach produces unseen motions with realism comparable to commonly represented motions in text-motion datasets (e.g., HumanML3D), such as walking, running, squatting, and kicking.

[LG-17] owards Cross-Lingual LLM Evaluation for European Languages

链接: https://arxiv.org/abs/2410.08928
作者: Klaudia Thellmann,Bernhard Stadler,Michael Fromm,Jasper Schulze Buschhoff,Alex Jude,Fabio Barth,Johannes Leveling,Nicolas Flores-Herr,Joachim Köhler,René Jäkel,Mehdi Ali
关键词-EN: Large Language Models, rise of Large, revolutionized natural language, natural language processing, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains challenging, especially due to the scarcity of multilingual benchmarks. We introduce a cross-lingual evaluation approach tailored for European languages. We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages. Our contributions include examining the effectiveness of translated benchmarks, assessing the impact of different translation services, and offering a multilingual evaluation framework for LLMs that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC, EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly available to encourage further research in multilingual LLM evaluation.

[LG-18] HyperPg – Prototypical Gaussians on the Hypersphere for Interpretable Deep Learning

链接: https://arxiv.org/abs/2410.08925
作者: Maximilian Xiling Li,Korbinian Franz Rudolf,Nils Blank,Rudolf Lioutikov
关键词-EN: interpretable alternative, black-box deep learning, Prototype Learning methods, Learning methods provide, deep learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prototype Learning methods provide an interpretable alternative to black-box deep learning models. Approaches such as ProtoPNet learn, which part of a test image “look like” known prototypical parts from training images, combining predictive power with the inherent interpretability of case-based reasoning. However, existing approaches have two main drawbacks: A) They rely solely on deterministic similarity scores without statistical confidence. B) The prototypes are learned in a black-box manner without human input. This work introduces HyperPg, a new prototype representation leveraging Gaussian distributions on a hypersphere in latent space, with learnable mean and variance. HyperPg prototypes adapt to the spread of clusters in the latent space and output likelihood scores. The new architecture, HyperPgNet, leverages HyperPg to learn prototypes aligned with human concepts from pixel-level annotations. Consequently, each prototype represents a specific concept such as color, image texture, or part of the image subject. A concept extraction pipeline built on foundation models provides pixel-level annotations, significantly reducing human labeling effort. Experiments on CUB-200-2011 and Stanford Cars datasets demonstrate that HyperPgNet outperforms other prototype learning architectures while using fewer parameters and training steps. Additionally, the concept-aligned HyperPg prototypes are learned transparently, enhancing model interpretability.

[LG-19] DiffPO: A causal diffusion model for learning distributions of potential outcomes

链接: https://arxiv.org/abs/2410.08924
作者: Yuchen Ma,Valentyn Melnychuk,Jonas Schweisthal,Stefan Feuerriegel
关键词-EN: Predicting potential outcomes, Predicting potential, potential outcomes, interventions from observational, observational data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting potential outcomes of interventions from observational data is crucial for decision-making in medicine, but the task is challenging due to the fundamental problem of causal inference. Existing methods are largely limited to point estimates of potential outcomes with no uncertain quantification; thus, the full information about the distributions of potential outcomes is typically ignored. In this paper, we propose a novel causal diffusion model called DiffPO, which is carefully designed for reliable inferences in medicine by learning the distribution of potential outcomes. In our DiffPO, we leverage a tailored conditional denoising diffusion model to learn complex distributions, where we address the selection bias through a novel orthogonal diffusion loss. Another strength of our DiffPO method is that it is highly flexible (e.g., it can also be used to estimate different causal quantities such as CATE). Across a wide range of experiments, we show that our method achieves state-of-the-art performance.

[LG-20] Path-minimizing Latent ODEs for improved extrapolation and inference

链接: https://arxiv.org/abs/2410.08923
作者: Matt L. Sampson,Peter Melchior
关键词-EN: complicated non-linear dynamics, provide flexible descriptions, predicting complicated non-linear, Latent ODE models, ODE models provide
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 20 pages 11 figures

点击查看摘要

Abstract:Latent ODE models provide flexible descriptions of dynamic systems, but they can struggle with extrapolation and predicting complicated non-linear dynamics. The latent ODE approach implicitly relies on encoders to identify unknown system parameters and initial conditions, whereas the evaluation times are known and directly provided to the ODE solver. This dichotomy can be exploited by encouraging time-independent latent representations. By replacing the common variational penalty in latent space with an \ell_2 penalty on the path length of each system, the models learn data representations that can easily be distinguished from those of systems with different configurations. This results in faster training, smaller models, more accurate interpolation and long-time extrapolation compared to the baseline ODE models with GRU, RNN, and LSTM encoder/decoders on tests with damped harmonic oscillator, self-gravitating fluid, and predator-prey systems. We also demonstrate superior results for simulation-based inference of the Lotka-Volterra parameters and initial conditions by using the latents as data summaries for a conditional normalizing flow. Our change to the training loss is agnostic to the specific recognition network used by the decoder and can therefore easily be adopted by other latent ODE models.

[LG-21] Efficient Hyperparameter Importance Assessment for CNNs

链接: https://arxiv.org/abs/2410.08920
作者: Ruinan Wang,Ian Nabney,Mohammad Golbabaee
关键词-EN: impacting models’ robustness, profoundly impacting models’, machine learning pipeline, Convolutional Neural Networks, profoundly impacting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages

点击查看摘要

Abstract:Hyperparameter selection is an essential aspect of the machine learning pipeline, profoundly impacting models’ robustness, stability, and generalization capabilities. Given the complex hyperparameter spaces associated with Neural Networks and the constraints of computational resources and time, optimizing all hyperparameters becomes impractical. In this context, leveraging hyperparameter importance assessment (HIA) can provide valuable guidance by narrowing down the search space. This enables machine learning practitioners to focus their optimization efforts on the hyperparameters with the most significant impact on model performance while conserving time and resources. This paper aims to quantify the importance weights of some hyperparameters in Convolutional Neural Networks (CNNs) with an algorithm called N-RReliefF, laying the groundwork for applying HIA methodologies in the Deep Learning field. We conduct an extensive study by training over ten thousand CNN models across ten popular image classification datasets, thereby acquiring a comprehensive dataset containing hyperparameter configuration instances and their corresponding performance metrics. It is demonstrated that among the investigated hyperparameters, the top five important hyperparameters of the CNN model are the number of convolutional layers, learning rate, dropout rate, optimizer and epoch.

[LG-22] An End-to-End Deep Learning Method for Solving Nonlocal Allen-Cahn and Cahn-Hilliard Phase-Field Models

链接: https://arxiv.org/abs/2410.08914
作者: Yuwei Geng,Olena Burkovska,Lili Ju,Guannan Zhang,Max Gunzburger
关键词-EN: phase-field models, nonlocal phase-field models, propose an efficient, solving nonlocal Allen-Cahn, models
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We propose an efficient end-to-end deep learning method for solving nonlocal Allen-Cahn (AC) and Cahn-Hilliard (CH) phase-field models. One motivation for this effort emanates from the fact that discretized partial differential equation-based AC or CH phase-field models result in diffuse interfaces between phases, with the only recourse for remediation is to severely refine the spatial grids in the vicinity of the true moving sharp interface whose width is determined by a grid-independent parameter that is substantially larger than the local grid size. In this work, we introduce non-mass conserving nonlocal AC or CH phase-field models with regular, logarithmic, or obstacle double-well potentials. Because of non-locality, some of these models feature totally sharp interfaces separating phases. The discretization of such models can lead to a transition between phases whose width is only a single grid cell wide. Another motivation is to use deep learning approaches to ameliorate the otherwise high cost of solving discretized nonlocal phase-field models. To this end, loss functions of the customized neural networks are defined using the residual of the fully discrete approximations of the AC or CH models, which results from applying a Fourier collocation method and a temporal semi-implicit approximation. To address the long-range interactions in the models, we tailor the architecture of the neural network by incorporating a nonlocal kernel as an input channel to the neural network model. We then provide the results of extensive computational experiments to illustrate the accuracy, structure-preserving properties, predictive capabilities, and cost reductions of the proposed method.

[LG-23] Low-Dimension-to-High-Dimension Generalization And Its Implications for Length Generalization

链接: https://arxiv.org/abs/2410.08898
作者: Yang Chen,Yitao Liang,Zhouchen Lin
关键词-EN: LDHD generalization, high-dimensional testing space, LDHD, generalization, inherent LDHD generalization
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-Dimension-to-High-Dimension (LDHD) generalization is a special case of Out-of-Distribution (OOD) generalization, where the training data are restricted to a low-dimensional subspace of the high-dimensional testing space. Assuming that each instance is generated from a latent variable and the dimension of the latent variable reflects the problem scale, the inherent scaling challenge in length generalization can be captured by the LDHD generalization in the latent space. We theoretically demonstrate that LDHD generalization is generally unattainable without exploiting prior knowledge to provide appropriate inductive bias. Specifically, we explore LDHD generalization in Boolean functions. We verify that different architectures trained with (S)GD converge to \emphmin-degree interpolators w.r.t. different independent sets. LDHD generalization is achievable if and only if the target function coincides with this inductive bias. Applying the insights from LDHD generalization to length generalization, we explain the effectiveness of CoT as changing the structure latent space to enable better LDHD generalization. We also propose a principle for position embedding design to handle both the inherent LDHD generalization and the nuisances such as the data format. Following the principle, we propose a novel position embedding called RPE-Square that remedies the RPE for dealing with the data format nuisance.

[LG-24] MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL

链接: https://arxiv.org/abs/2410.08896
作者: Claas A Voelcker,Marcel Hussing,Eric Eaton,Amir-massoud Farahmand,Igor Gilitschenski
关键词-EN: Building deep reinforcement, Building deep, deep reinforcement learning, proven notoriously challenging, agents that find
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sample efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for Temporal Difference learning (MAD-TD) uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD’s ability to combat value overestimation, and its practical stability gains for continued learning.

[LG-25] Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient

链接: https://arxiv.org/abs/2410.08893
作者: Wenlong Wang,Ivana Dusparic,Yucheng Shi,Ke Zhang,Vinny Cahill
关键词-EN: Model-based reinforcement learning, world model, offers a solution, data inefficiency, inefficiency that plagues
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning (RL) offers a solution to the data inefficiency that plagues most model-free RL algorithms. However, learning a robust world model often demands complex and deep architectures, which are expensive to compute and train. Within the world model, dynamics models are particularly crucial for accurate predictions, and various dynamics-model architectures have been explored, each with its own set of challenges. Currently, recurrent neural network (RNN) based world models face issues such as vanishing gradients and difficulty in capturing long-term dependencies effectively. In contrast, use of transformers suffers from the well-known issues of self-attention mechanisms, where both memory and computational complexity scale as O(n^2) , with n representing the sequence length. To address these challenges we propose a state space model (SSM) based world model, specifically based on Mamba, that achieves O(n) memory and computational complexity while effectively capturing long-term dependencies and facilitating the use of longer training sequences efficiently. We also introduce a novel sampling method to mitigate the suboptimality caused by an incorrect world model in the early stages of training, combining it with the aforementioned technique to achieve a normalised score comparable to other state-of-the-art model-based RL algorithms using only a 7 million trainable parameter world model. This model is accessible and can be trained on an off-the-shelf laptop. Our code is available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2410.08893 [cs.LG] (or arXiv:2410.08893v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.08893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Federated Learning in Practice: Reflections and Projections

链接: https://arxiv.org/abs/2410.08892
作者: Katharine Daly,Hubert Eichner,Peter Kairouz,H. Brendan McMahan,Daniel Ramage,Zheng Xu
关键词-EN: enables multiple entities, machine learning technique, Federated Learning, local data, technique that enables
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a machine learning technique that enables multiple entities to collaboratively learn a shared model without exchanging their local data. Over the past decade, FL systems have achieved substantial progress, scaling to millions of devices across various learning domains while offering meaningful differential privacy (DP) guarantees. Production systems from organizations like Google, Apple, and Meta demonstrate the real-world applicability of FL. However, key challenges remain, including verifying server-side DP guarantees and coordinating training across heterogeneous devices, limiting broader adoption. Additionally, emerging trends such as large (multi-modal) models and blurred lines between training, inference, and personalization challenge traditional FL frameworks. In response, we propose a redefined FL framework that prioritizes privacy principles rather than rigid definitions. We also chart a path forward by leveraging trusted execution environments and open-source ecosystems to address these challenges and facilitate future advancements in FL.

[LG-27] Bank Loan Prediction Using Machine Learning Techniques

链接: https://arxiv.org/abs/2410.08886
作者: F M Ahosanul Haque,Md. Mahedi Hassan
关键词-EN: machine learning, development of economies, ecosystem through consumer, consumer and business, bank loan approval
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 18 figures, 6 tables

点击查看摘要

Abstract:Banks are important for the development of economies in any financial ecosystem through consumer and business loans. Lending, however, presents risks; thus, banks have to determine the applicant’s financial position to reduce the probabilities of default. A number of banks have currently, therefore, adopted data analytics and state-of-the-art technology to arrive at better decisions in the process. The probability of payback is prescribed by a predictive modeling technique in which machine learning algorithms are applied. In this research project, we will apply several machine learning methods to further improve the accuracy and efficiency of loan approval processes. Our work focuses on the prediction of bank loan approval; we have worked on a dataset of 148,670 instances and 37 attributes using machine learning methods. The target property segregates the loan applications into “Approved” and “Denied” groups. various machine learning techniques have been used, namely, Decision Tree Categorization, AdaBoosting, Random Forest Classifier, SVM, and GaussianNB. Following that, the models were trained and evaluated. Among these, the best-performing algorithm was AdaBoosting, which achieved an incredible accuracy of 99.99%. The results therefore show how ensemble learning works effectively to improve the prediction skills of loan approval decisions. The presented work points to the possibility of achieving extremely accurate and efficient loan prediction models that provide useful insights for applying machine learning to financial domains.

[LG-28] Interdependency Matters: Graph Alignment for Multivariate Time Series Anomaly Detection

链接: https://arxiv.org/abs/2410.08877
作者: Yuanyi Wang,Haifeng Sun,Chengsen Wang,Mengde Zhu,Jingyu Wang,Wei Tang,Qi Qi,Zirui Zhuang,Jianxin Liao
关键词-EN: Anomaly detection, MTS Anomaly Detection, multivariate time series, mining and industry, Anomaly
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Anomaly detection in multivariate time series (MTS) is crucial for various applications in data mining and industry. Current industrial methods typically approach anomaly detection as an unsupervised learning task, aiming to identify deviations by estimating the normal distribution in noisy, label-free datasets. These methods increasingly incorporate interdependencies between channels through graph structures to enhance accuracy. However, the role of interdependencies is more critical than previously understood, as shifts in interdependencies between MTS channels from normal to anomalous data are significant. This observation suggests that \textitanomalies could be detected by changes in these interdependency graph series. To capitalize on this insight, we introduce MADGA (MTS Anomaly Detection via Graph Alignment), which redefines anomaly detection as a graph alignment (GA) problem that explicitly utilizes interdependencies for anomaly detection. MADGA dynamically transforms subsequences into graphs to capture the evolving interdependencies, and Graph alignment is performed between these graphs, optimizing an alignment plan that minimizes cost, effectively minimizing the distance for normal data and maximizing it for anomalous data. Uniquely, our GA approach involves explicit alignment of both nodes and edges, employing Wasserstein distance for nodes and Gromov-Wasserstein distance for edges. To our knowledge, this is the first application of GA to MTS anomaly detection that explicitly leverages interdependency for this purpose. Extensive experiments on diverse real-world datasets validate the effectiveness of MADGA, demonstrating its capability to detect anomalies and differentiate interdependencies, consistently achieving state-of-the-art across various scenarios.

[LG-29] Fragile Giants: Understanding the Susceptibility of Models to Subpopulation Attacks

链接: https://arxiv.org/abs/2410.08872
作者: Isha Gupta,Hidde Lycklama,Emanuel Opel,Evan Rose,Anwar Hithnawi
关键词-EN: machine learning models, increasingly complex, machine learning, robustness and trustworthiness, subpopulation poisoning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As machine learning models become increasingly complex, concerns about their robustness and trustworthiness have become more pressing. A critical vulnerability of these models is data poisoning attacks, where adversaries deliberately alter training data to degrade model performance. One particularly stealthy form of these attacks is subpopulation poisoning, which targets distinct subgroups within a dataset while leaving overall performance largely intact. The ability of these attacks to generalize within subpopulations poses a significant risk in real-world settings, as they can be exploited to harm marginalized or underrepresented groups within the dataset. In this work, we investigate how model complexity influences susceptibility to subpopulation poisoning attacks. We introduce a theoretical framework that explains how overparameterized models, due to their large capacity, can inadvertently memorize and misclassify targeted subpopulations. To validate our theory, we conduct extensive experiments on large-scale image and text datasets using popular model architectures. Our results show a clear trend: models with more parameters are significantly more vulnerable to subpopulation poisoning. Moreover, we find that attacks on smaller, human-interpretable subgroups often go undetected by these models. These results highlight the need to develop defenses that specifically address subpopulation vulnerabilities.

[LG-30] Can we hop in general? A discussion of benchmark selection and design using the Hopper environment

链接: https://arxiv.org/abs/2410.08870
作者: Claas A Voelcker,Marcel Hussing,Marcel Hussing
关键词-EN: benchmark-driven testing, current RL community, fundamental paradigm, Empirical, Benchmark choices
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Empirical, benchmark-driven testing is a fundamental paradigm in the current RL community. While using off-the-shelf benchmarks in reinforcement learning (RL) research is a common practice, this choice is rarely discussed. Benchmark choices are often done based on intuitive ideas like “legged robots” or “visual observations”. In this paper, we argue that benchmarking in RL needs to be treated as a scientific discipline itself. To illustrate our point, we present a case study on different variants of the Hopper environment to show that the selection of standard benchmarking suites can drastically change how we judge performance of algorithms. The field does not have a cohesive notion of what the different Hopper environments are representative - they do not even seem to be representative of each other. Our experimental results suggests a larger issue in the deep RL literature: benchmark choices are neither commonly justified, nor does there exist a language that could be used to justify the selection of certain environments. This paper concludes with a discussion of the requirements for proper discussion and evaluations of benchmarks and recommends steps to start a dialogue towards this goal.

[LG-31] Evolution of SAE Features Across Layers in LLMs

链接: https://arxiv.org/abs/2410.08869
作者: Daniel Balcells,Benjamin Lerner,Michael Oesterle,Ediz Ucar,Stefan Heimersheim
关键词-EN: Sparse Autoencoders, transformer-based language models, typically defined independently, Autoencoders for transformer-based, transformer-based language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors, and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

[LG-32] Improved Sample Complexity for Global Convergence of Actor-Critic Algorithms

链接: https://arxiv.org/abs/2410.08868
作者: Navdeep Kumar,Priyank Agrawal,Giorgia Ramponi,Kfir Yehuda Levy,Shie Mannor
关键词-EN: improved sample complexity, significantly improved sample, global sample complexity, local convergence results, sample complexity
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we establish the global convergence of the actor-critic algorithm with a significantly improved sample complexity of O(\epsilon^-3) , advancing beyond the existing local convergence results. Previous works provide local convergence guarantees with a sample complexity of O(\epsilon^-2) for bounding the squared gradient of the return, which translates to a global sample complexity of O(\epsilon^-4) using the gradient domination lemma. In contrast to traditional methods that employ decreasing step sizes for both the actor and critic, we demonstrate that a constant step size for the critic is sufficient to ensure convergence in expectation. This key insight reveals that using a decreasing step size for the actor alone is sufficient to handle the noise for both the actor and critic. Our findings provide theoretical support for the practical success of many algorithms that rely on constant step sizes.

[LG-33] Prediction by Machine Learning Analysis of Genomic Data Phenotypic Frost Tolerance in Perccottus glenii

链接: https://arxiv.org/abs/2410.08867
作者: Lilin Fan,Xuqing Chai,Zhixiong Tian,Yihang Qiao,Zhen Wang,Yifan Zhang
关键词-EN: Random Forest model, optimal encoding method, Biotechnology Information database, optimal classification model, Traditional biological analysis
类目: Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Analysis of the genome sequence of Perccottus glenii, the only fish known to possess freeze tolerance, holds significant importance for understanding how organisms adapt to extreme environments, Traditional biological analysis methods are time-consuming and have limited accuracy, To address these issues, we will employ machine learning techniques to analyze the gene sequences of Perccottus glenii, with Neodontobutis hainanens as a comparative group, Firstly, we have proposed five gene sequence vectorization methods and a method for handling ultra-long gene sequences, We conducted a comparative study on the three vectorization methods: ordinal encoding, One-Hot encoding, and K-mer encoding, to identify the optimal encoding method, Secondly, we constructed four classification models: Random Forest, LightGBM, XGBoost, and Decision Tree, The dataset used by these classification models was extracted from the National Center for Biotechnology Information database, and we vectorized the sequence matrices using the optimal encoding method, K-mer, The Random Forest model, which is the optimal model, achieved a classification accuracy of up to 99, 98 , Lastly, we utilized SHAP values to conduct an interpretable analysis of the optimal classification model, Through ten-fold cross-validation and the AUC metric, we identified the top 10 features that contribute the most to the model’s classification accuracy, This demonstrates that machine learning methods can effectively replace traditional manual analysis in identifying genes associated with the freeze tolerance phenotype in Perccottus glenii.

[LG-34] he Good the Bad and the Ugly: Watermarks Transferable Attacks and Adversarial Defenses ICML2024

链接: https://arxiv.org/abs/2410.08864
作者: Grzegorz Głuch,Berkant Turan,Sai Ganesh Nagarajan,Sebastian Pokutta
关键词-EN: extend existing definitions, transferable attack, formalize and extend, extend existing, existing definitions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 42 pages, 6 figures, preliminary version published in ICML 2024 (Workshop on Theoretical Foundations of Foundation Models), see this https URL

点击查看摘要

Abstract:We formalize and extend existing definitions of backdoor-based watermarks and adversarial defenses as interactive protocols between two players. The existence of these schemes is inherently tied to the learning tasks for which they are designed. Our main result shows that for almost every discriminative learning task, at least one of the two – a watermark or an adversarial defense – exists. The term “almost every” indicates that we also identify a third, counterintuitive but necessary option, i.e., a scheme we call a transferable attack. By transferable attack, we refer to an efficient algorithm computing queries that look indistinguishable from the data distribution and fool all efficient defenders. To this end, we prove the necessity of a transferable attack via a construction that uses a cryptographic tool called homomorphic encryption. Furthermore, we show that any task that satisfies our notion of a transferable attack implies a cryptographic primitive, thus requiring the underlying task to be computationally complex. These two facts imply an “equivalence” between the existence of transferable attacks and cryptography. Finally, we show that the class of tasks of bounded VC-dimension has an adversarial defense, and a subclass of them has a watermark.

[LG-35] Hybrid LLM-DDQN based Joint Optimization of V2I Communication and Autonomous Driving

链接: https://arxiv.org/abs/2410.08854
作者: Zijiang Yan,Hao Zhou,Hina Tabassum,Xue Liu
关键词-EN: Large language models, considerable interest recently, interest recently due, Large language, received considerable interest
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注: Submission for possible publication

点击查看摘要

Abstract:Large language models (LLMs) have received considerable interest recently due to their outstanding reasoning and comprehension capabilities. This work explores applying LLMs to vehicular networks, aiming to jointly optimize vehicle-to-infrastructure (V2I) communications and autonomous driving (AD) policies. We deploy LLMs for AD decision-making to maximize traffic flow and avoid collisions for road safety, and a double deep Q-learning algorithm (DDQN) is used for V2I optimization to maximize the received data rate and reduce frequent handovers. In particular, for LLM-enabled AD, we employ the Euclidean distance to identify previously explored AD experiences, and then LLMs can learn from past good and bad decisions for further improvement. Then, LLM-based AD decisions will become part of states in V2I problems, and DDQN will optimize the V2I decisions accordingly. After that, the AD and V2I decisions are iteratively optimized until convergence. Such an iterative optimization approach can better explore the interactions between LLMs and conventional reinforcement learning techniques, revealing the potential of using LLMs for network optimization and management. Finally, the simulations demonstrate that our proposed hybrid LLM-DDQN approach outperforms the conventional DDQN algorithm, showing faster convergence and higher average rewards.

[LG-36] Conformalized Interactive Imitation Learning: Handling Expert Shift and Intermittent Feedback

链接: https://arxiv.org/abs/2410.08852
作者: Michelle Zhao,Reid Simmons,Henny Admoni,Aaditya Ramdas,Andrea Bajcsy
关键词-EN: interactive imitation learning, seeking additional feedback, actively seeking additional, Monte Carlo dropout, distribution shifts encountered
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In interactive imitation learning (IL), uncertainty quantification offers a way for the learner (i.e. robot) to contend with distribution shifts encountered during deployment by actively seeking additional feedback from an expert (i.e. human) online. Prior works use mechanisms like ensemble disagreement or Monte Carlo dropout to quantify when black-box IL policies are uncertain; however, these approaches can lead to overconfident estimates when faced with deployment-time distribution shifts. Instead, we contend that we need uncertainty quantification algorithms that can leverage the expert human feedback received during deployment time to adapt the robot’s uncertainty online. To tackle this, we draw upon online conformal prediction, a distribution-free method for constructing prediction intervals online given a stream of ground-truth labels. Human labels, however, are intermittent in the interactive IL setting. Thus, from the conformal prediction side, we introduce a novel uncertainty quantification algorithm called intermittent quantile tracking (IQT) that leverages a probabilistic model of intermittent labels, maintains asymptotic coverage guarantees, and empirically achieves desired coverage levels. From the interactive IL side, we develop ConformalDAgger, a new approach wherein the robot uses prediction intervals calibrated by IQT as a reliable measure of deployment-time uncertainty to actively query for more expert feedback. We compare ConformalDAgger to prior uncertainty-aware DAgger methods in scenarios where the distribution shift is (and isn’t) present because of changes in the expert’s policy. We find that in simulated and hardware deployments on a 7DOF robotic manipulator, ConformalDAgger detects high uncertainty when the expert shifts and increases the number of interventions compared to baselines, allowing the robot to more quickly learn the new behavior.

[LG-37] Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

链接: https://arxiv.org/abs/2410.08847
作者: Noam Razin,Sadhika Malladi,Adithya Bhaskar,Danqi Chen,Sanjeev Arora,Boris Hanin
关键词-EN: Direct Preference Optimization, Direct Preference, Preference Optimization, likelihood displacement, variants are increasingly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: Code available at this https URL

点击查看摘要

Abstract:Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer \textttNo over \textttNever can sharply increase the probability of \textttYes . Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

[LG-38] A physics-guided neural network for flooding area detection using SAR imagery and local river gauge observations

链接: https://arxiv.org/abs/2410.08837
作者: Monika Gierszewska,Tomasz Berezowski
关键词-EN: flooding extent area, water, water extent areas, flooding area, valley is related
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures, 57 cited references

点击查看摘要

Abstract:The flooding extent area in a river valley is related to river gauge observations. The higher the water elevation, the larger the flooding area. Due to synthetic aperture radar\textquoteright s (SAR) capabilities to penetrate through clouds, radar images have been commonly used to estimate flooding extent area with various methods, from simple thresholding to deep learning models. In this study, we propose a physics-guided neural network for flooding area detection. Our approach takes as input data the Sentinel 1 time-series images and the water elevations in the river assigned to each image. We apply the Pearson correlation coefficient between the predicted sum of water extent areas and the local water level observations of river water elevations as the loss function. The effectiveness of our method is evaluated in five different study areas by comparing the predicted water maps with reference water maps obtained from digital terrain models and optical satellite images. The highest Intersection over Union (IoU) score achieved by our models was 0.89 for the water class and 0.96 for the non-water class. Additionally, we compared the results with other unsupervised methods. The proposed neural network provided a higher IoU than the other methods, especially for SAR images registered during low water elevation in the river.

[LG-39] Unveiling Molecular Secrets: An LLM-Augmented Linear Model for Explainable and Calibratable Molecular Property Prediction

链接: https://arxiv.org/abs/2410.08829
作者: Zhuoran Li,Xu Sun,Wanyu Lin,Jiannong Cao
关键词-EN: scientific fields, material science, drug discovery, discovery and material, molecular property prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explainable molecular property prediction is essential for various scientific fields, such as drug discovery and material science. Despite delivering intrinsic explainability, linear models struggle with capturing complex, non-linear patterns. Large language models (LLMs), on the other hand, yield accurate predictions through powerful inference capabilities yet fail to provide chemically meaningful explanations for their predictions. This work proposes a novel framework, called MoleX, which leverages LLM knowledge to build a simple yet powerful linear model for accurate molecular property prediction with faithful explanations. The core of MoleX is to model complicated molecular structure-property relationships using a simple linear model, augmented by LLM knowledge and a crafted calibration strategy. Specifically, to extract the maximum amount of task-relevant knowledge from LLM embeddings, we employ information bottleneck-inspired fine-tuning and sparsity-inducing dimensionality reduction. These informative embeddings are then used to fit a linear model for explainable inference. Moreover, we introduce residual calibration to address prediction errors stemming from linear models’ insufficient expressiveness of complex LLM embeddings, thus recovering the LLM’s predictive power and boosting overall accuracy. Theoretically, we provide a mathematical foundation to justify MoleX’s explainability. Extensive experiments demonstrate that MoleX outperforms existing methods in molecular property prediction, establishing a new milestone in predictive performance, explainability, and efficiency. In particular, MoleX enables CPU inference and accelerates large-scale dataset processing, achieving comparable performance 300x faster with 100,000 fewer parameters than LLMs. Additionally, the calibration improves model performance by up to 12.7% without compromising explainability.

[LG-40] Do Unlearning Methods Remove Information from Language Model Weights?

链接: https://arxiv.org/abs/2410.08827
作者: Aghyad Deeb,Fabien Roger
关键词-EN: Large Language Models’, Language Models’ knowledge, Large Language, perform cyber-security attacks, Language Models’
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models’ knowledge of how to perform cyber-security attacks, create bioweapons, and manipulate humans poses risks of misuse. Previous work has proposed methods to unlearn this knowledge. Historically, it has been unclear whether unlearning techniques are removing information from the model weights or just making it harder to access. To disentangle these two objectives, we propose an adversarial evaluation method to test for the removal of information from model weights: we give an attacker access to some facts that were supposed to be removed, and using those, the attacker tries to recover other facts from the same distribution that cannot be guessed from the accessible facts. We show that using fine-tuning on the accessible facts can recover 88% of the pre-unlearning accuracy when applied to current unlearning methods, revealing the limitations of these methods in removing information from the model weights.

[LG-41] owards virtual painting recolouring using Vision Transformer on X-Ray Fluorescence datacubes

链接: https://arxiv.org/abs/2410.08826
作者: Alessandro Bombini,Fernando García-Avello Bofías,Francesca Giambi,Chiara Ruberto
关键词-EN: perform virtual painting, virtual painting recolouring, Deep Variational Embedding, X-Ray Fluorescence, Variational Embedding network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: v1: 20 pages, 10 figures; link to code repository

点击查看摘要

Abstract:In this contribution, we define (and test) a pipeline to perform virtual painting recolouring using raw data of X-Ray Fluorescence (XRF) analysis on pictorial artworks. To circumvent the small dataset size, we generate a synthetic dataset, starting from a database of XRF spectra; furthermore, to ensure a better generalisation capacity (and to tackle the issue of in-memory size and inference time), we define a Deep Variational Embedding network to embed the XRF spectra into a lower dimensional, K-Means friendly, metric space. We thus train a set of models to assign coloured images to embedded XRF images. We report here the devised pipeline performances in terms of visual quality metrics, and we close on a discussion on the results. Comments: v1: 20 pages, 10 figures; link to code repository Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph) ACMclasses: I.4.m; J.2 Cite as: arXiv:2410.08826 [cs.CV] (or arXiv:2410.08826v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.08826 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] SOLD: Reinforcement Learning with Slot Object-Centric Latent Dynamics

链接: https://arxiv.org/abs/2410.08822
作者: Malte Mosbach,Jan Niklas Ewertz,Angel Villar-Corrales,Sven Behnke
关键词-EN: agent understanding, Learning, latent dynamics, latent, dynamics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning a latent dynamics model provides a task-agnostic representation of an agent’s understanding of its environment. Leveraging this knowledge for model-based reinforcement learning holds the potential to improve sample efficiency over model-free methods by learning inside imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, forecasting how actions will affect specific parts of their surroundings. Inspired by this, we propose Slot-Attention for Object-centric Latent Dynamics (SOLD), a novel algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3, a state-of-the-art model-based RL algorithm, across a range of benchmark robotic environments that evaluate for both relational reasoning and low-level manipulation capabilities. Videos are available at this https URL.

[LG-43] Uncertainty-Aware Optimal Treatment Selection for Clinical Time Series NEURIPS2024

链接: https://arxiv.org/abs/2410.08816
作者: Thomas Schwarz,Cecilia Casolo,Niki Kilbertus
关键词-EN: optimize treatment outcomes, frames is essential, predict and optimize, time frames, ability to predict
类目: Machine Learning (cs.LG)
*备注: appeared at the workshop on Causal Representation Learning at NeurIPS 2024 (oral)

点击查看摘要

Abstract:In personalized medicine, the ability to predict and optimize treatment outcomes across various time frames is essential. Additionally, the ability to select cost-effective treatments within specific budget constraints is critical. Despite recent advancements in estimating counterfactual trajectories, a direct link to optimal treatment selection based on these estimates is missing. This paper introduces a novel method integrating counterfactual estimation techniques and uncertainty quantification to recommend personalized treatment plans adhering to predefined cost constraints. Our approach is distinctive in its handling of continuous treatment variables and its incorporation of uncertainty quantification to improve prediction reliability. We validate our method using two simulated datasets, one focused on the cardiovascular system and the other on COVID-19. Our findings indicate that our method has robust performance across different counterfactual estimation baselines, showing that introducing uncertainty quantification in these settings helps the current baselines in finding more reliable and accurate treatment selection. The robustness of our method across various settings highlights its potential for broad applicability in personalized healthcare solutions.

[LG-44] Dont Transform the Code Code the Transforms: Towards Precise Code Rewriting using LLMs

链接: https://arxiv.org/abs/2410.08806
作者: Chris Cummins,Volker Seeker,Jordi Armengol-Estapé,Aram H. Markosyan,Gabriel Synnaeve,Hugh Leather
关键词-EN: refactoring and optimizing, fast and correct, code, Tools, LLM
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tools for rewriting, refactoring and optimizing code should be fast and correct. Large language models (LLMs), by their nature, possess neither of these qualities. Yet, there remains tremendous opportunity in using LLMs to improve code. We explore the use of LLMs not to transform code, but to code transforms. We propose a chain-of-thought approach to synthesizing code transformations from a small number of input/output code examples that incorporates execution and feedback. Unlike the direct rewrite approach, LLM-generated transformations are easy to inspect, debug, and validate. The logic of the rewrite is explicitly coded and easy to adapt. The compute required to run code transformations is minute compared to that of LLM rewriting. We test our approach on 16 Python code transformations and find that LLM- generated transforms are perfectly precise for 7 of them and less imprecise than direct LLM rewriting on the others. We hope to encourage further research to improving the precision of LLM code rewriting. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.08806 [cs.LG] (or arXiv:2410.08806v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.08806 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Batched Energy-Entropy acquisition for Bayesian Optimization NEURIPS2024

链接: https://arxiv.org/abs/2410.08804
作者: Felix Teufel,Carsten Stahlhut,Jesper Ferkinghoff-Borg
关键词-EN: attractive machine learning, machine learning framework, performing sample-efficient global, sample-efficient global optimization, Bayesian optimization
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages (+31 appendix), 21 figures. Accepted at NeurIPS 2024

点击查看摘要

Abstract:Bayesian optimization (BO) is an attractive machine learning framework for performing sample-efficient global optimization of black-box functions. The optimization process is guided by an acquisition function that selects points to acquire in each round of BO. In batched BO, when multiple points are acquired in parallel, commonly used acquisition functions are often high-dimensional and intractable, leading to the use of sampling-based alternatives. We propose a statistical physics inspired acquisition function for BO with Gaussian processes that can natively handle batches. Batched Energy-Entropy acquisition for BO (BEEBO) enables tight control of the explore-exploit trade-off of the optimization process and generalizes to heteroskedastic black-box problems. We demonstrate the applicability of BEEBO on a range of problems, showing competitive performance to existing methods.

[LG-46] M3-Impute: Mask-guided Representation Learning for Missing Value Imputation

链接: https://arxiv.org/abs/2410.08794
作者: Zhongyi Yu,Zhenghao Wu,Shuhan Zhong,Weifeng Su,S.-H. Gary Chan,Chul-Ho Lee,Weipeng Zhuo
关键词-EN: poses significant challenges, poses significant, significant challenges, analysis and machine, common problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Missing values are a common problem that poses significant challenges to data analysis and machine learning. This problem necessitates the development of an effective imputation method to fill in the missing values accurately, thereby enhancing the overall quality and utility of the datasets. Existing imputation methods, however, fall short of explicitly considering the `missingness’ information in the data during the embedding initialization stage and modeling the entangled feature and sample correlations during the learning process, thus leading to inferior performance. We propose M ^3 -Impute, which aims to explicitly leverage the missingness information and such correlations with novel masking schemes. M ^3 -Impute first models the data as a bipartite graph and uses a graph neural network to learn node embeddings, where the refined embedding initialization process directly incorporates the missingness information. They are then optimized through M ^3 -Impute’s novel feature correlation unit (FRU) and sample correlation unit (SRU) that effectively captures feature and sample correlations for imputation. Experiment results on 25 benchmark datasets under three different missingness settings show the effectiveness of M ^3 -Impute by achieving 20 best and 4 second-best MAE scores on average.

[LG-47] VLM See Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

链接: https://arxiv.org/abs/2410.08792
作者: Beichen Wang,Juexiao Zhang,Shuwen Dong,Irving Fang,Chen Feng
关键词-EN: Vision Language Models, Vision Language, Language Models, common sense reasoning, recently been adopted
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have recently been adopted in robotics for their capability in common sense reasoning and generalizability. Existing work has applied VLMs to generate task and motion planning from natural language instructions and simulate training data for robot learning. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. Our method integrates keyframe selection, visual perception, and VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to ‘‘see’’ human demonstrations and explain the corresponding plans to the robot for it to ‘‘do’’. To validate our approach, we collected a set of long-horizon human videos demonstrating pick-and-place tasks in three diverse categories and designed a set of metrics to comprehensively benchmark SeeDo against several baselines, including state-of-the-art video-input VLMs. The experiments demonstrate SeeDo’s superior performance. We further deployed the generated task plans in both a simulation environment and on a real robot arm.

[LG-48] Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

链接: https://arxiv.org/abs/2410.08791
作者: Reza Abbasi,Sernam Lim
关键词-EN: GPU memory, models, Superpipeline, computer vision, GPU
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves dynamically managing model execution by dividing models into individual layers and efficiently transferring these layers between GPU and CPU memory. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds. This allows models that would otherwise exceed available GPU memory to run effectively. Unlike existing solutions that focus mainly on inference or specific model types, Superpipeline can be applied to large language models (LLMs), vision-language models (VLMs), and vision-based models. We tested Superpipeline’s performance across various models and hardware setups. The method includes two key parameters that allow fine-tuning the balance between GPU memory use and processing speed. Importantly, Superpipeline does not require retraining or changing model parameters, ensuring that the original model’s output remains unchanged. Superpipeline’s simplicity and flexibility make it useful for researchers and professionals working with advanced AI models on limited hardware. It enables the use of larger models or bigger batch sizes on existing hardware, potentially speeding up innovation across many machine learning applications. This work marks an important step toward making advanced AI models more accessible and optimizing their deployment in resource-limited environments. The code for Superpipeline is available at this https URL.

[LG-49] Efficient Differentiable Discovery of Causal Order

链接: https://arxiv.org/abs/2410.08787
作者: Mathieu Chevalley,Arash Mehrjou,Patrick Schwab
关键词-EN: Directed Acyclic Graph, Acyclic Graph, Directed Acyclic, Chevalley, score-based method
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the algorithm Intersort, Chevalley et al. (2024) proposed a score-based method to discover the causal order of variables in a Directed Acyclic Graph (DAG) model, leveraging interventional data to outperform existing methods. However, as a score-based method over the permutahedron, Intersort is computationally expensive and non-differentiable, limiting its ability to be utilised in problems involving large-scale datasets, such as those in genomics and climate models, or to be integrated into end-to-end gradient-based learning frameworks. We address this limitation by reformulating Intersort using differentiable sorting and ranking techniques. Our approach enables scalable and differentiable optimization of causal orderings, allowing the continuous score function to be incorporated as a regularizer in downstream tasks. Empirical results demonstrate that causal discovery algorithms benefit significantly from regularizing on the causal order, underscoring the effectiveness of our method. Our work opens the door to efficiently incorporating regularization for causal order into the training of differentiable models and thereby addresses a long-standing limitation of purely associational supervised learning.

[LG-50] Integrating Expert Judgment and Algorithmic Decision Making: An Indistinguishability Framework

链接: https://arxiv.org/abs/2410.08783
作者: Rohan Alur,Loren Laine,Darrick K. Li,Dennis Shung,Manish Raghavan,Devavrat Shah
关键词-EN: human-AI collaboration, decision tasks, collaboration in prediction, tasks, Abstract
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.00793

点击查看摘要

Abstract:We introduce a novel framework for human-AI collaboration in prediction and decision tasks. Our approach leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or “look the same” to any feasible predictive algorithm. We argue that this framing clarifies the problem of human-AI collaboration in prediction and decision tasks, as experts often form judgments by drawing on information which is not encoded in an algorithm’s training data. Algorithmic indistinguishability yields a natural test for assessing whether experts incorporate this kind of “side information”, and further provides a simple but principled method for selectively incorporating human feedback into algorithmic predictions. We show that this method provably improves the performance of any feasible algorithmic predictor and precisely quantify this improvement. We demonstrate the utility of our framework in a case study of emergency room triage decisions, where we find that although algorithmic risk scores are highly competitive with physicians, there is strong evidence that physician judgments provide signal which could not be replicated by any predictive algorithm. This insight yields a range of natural decision rules which leverage the complementary strengths of human experts and predictive algorithms.

[LG-51] Causal machine learning for predicting treatment outcomes

链接: https://arxiv.org/abs/2410.08770
作者: Stefan Feuerriegel,Dennis Frauen,Valentyn Melnychuk,Jonas Schweisthal,Konstantin Hess,Alicia Curth,Stefan Bauer,Niki Kilbertus,Isaac S. Kohane,Mihaela van der Schaar
关键词-EN: Causal machine learning, outcomes including efficacy, predicting treatment outcomes, treatment outcomes including, offers flexible
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: Accepted version; not Version of Record

点击查看摘要

Abstract:Causal machine learning (ML) offers flexible, data-driven methods for predicting treatment outcomes including efficacy and toxicity, thereby supporting the assessment and safety of drugs. A key benefit of causal ML is that it allows for estimating individualized treatment effects, so that clinical decision-making can be personalized to individual patient profiles. Causal ML can be used in combination with both clinical trial data and real-world data, such as clinical registries and electronic health records, but caution is needed to avoid biased or incorrect predictions. In this Perspective, we discuss the benefits of causal ML (relative to traditional statistical or ML approaches) and outline the key components and steps. Finally, we provide recommendations for the reliable use of causal ML and effective translation into the clinic.

[LG-52] Unlocking FedNL: Self-Contained Compute-Optimized Implementation

链接: https://arxiv.org/abs/2410.08760
作者: Konstantin Burlachenko,Peter Richtárik
关键词-EN: train Machine Learning, collaboratively train Machine, Machine Learning, Federated Newton Learn, enables intelligent agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Performance (cs.PF); Optimization and Control (math.OC)
*备注: 55 pages, 12 figures, 12 tables

点击查看摘要

Abstract:Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner, eliminating the need for sharing their local data. The recent work (arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization. However, the reference FedNL prototype exhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch a single experiment in a sever-grade workstation; (ii) The prototype only simulates multi-node setting; (iii) Prototype integration into resource-constrained applications is challenging. To bridge the gap between theory and practice, we present a self-contained implementation of FedNL, FedNL-LS, FedNL-PP for single-node and multi-node settings. Our work resolves the aforementioned issues and reduces the wall clock time by x1000. With this FedNL outperforms alternatives for training logistic regression in a single-node – CVXPY (arXiv:1603.00943), and in a multi-node – Apache Spark (arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose two practical-orientated compressors for FedNL - adaptive TopLEK and cache-aware RandSeqK, which fulfill the theory of FedNL.

[LG-53] Enhancing GNNs with Architecture-Agnostic Graph Transformations: A Systematic Analysis

链接: https://arxiv.org/abs/2410.08759
作者: Zhifei Li,Gerrit Großmann,Verena Wolf
关键词-EN: graph neural network, recent years, neural network, wide variety, GNN
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, a wide variety of graph neural network (GNN) architectures have emerged, each with its own strengths, weaknesses, and complexities. Various techniques, including rewiring, lifting, and node annotation with centrality values, have been employed as pre-processing steps to enhance GNN performance. However, there are no universally accepted best practices, and the impact of architecture and pre-processing on performance often remains opaque. This study systematically explores the impact of various graph transformations as pre-processing steps on the performance of common GNN architectures across standard datasets. The models are evaluated based on their ability to distinguish non-isomorphic graphs, referred to as expressivity. Our findings reveal that certain transformations, particularly those augmenting node features with centrality measures, consistently improve expressivity. However, these gains come with trade-offs, as methods like graph encoding, while enhancing expressivity, introduce numerical inaccuracies widely-used python packages. Additionally, we observe that these pre-processing techniques are limited when addressing complex tasks involving 3-WL and 4-WL indistinguishable graphs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.08759 [cs.LG] (or arXiv:2410.08759v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.08759 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Zero-Shot Offline Imitation Learning via Optimal Transport

链接: https://arxiv.org/abs/2410.08751
作者: Thomas Rupf,Marco Bagatella,Nico Gürtler,Jonas Frey,Georg Martius
关键词-EN: reproducing unseen behavior, learning algorithms hold, imitation learning algorithms, test time, algorithms hold
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent’s immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks.

[LG-55] Gradients Stand-in for Defending Deep Leakage in Federated Learning

链接: https://arxiv.org/abs/2410.08734
作者: H. Yi,H. Ren,C. Hu,Y. Li,J. Deng,X. Xie
关键词-EN: Federated Learning, localizing sensitive data, shifting the paradigm, reinforce privacy protections, paradigm towards localizing
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has become a cornerstone of privacy protection, shifting the paradigm towards localizing sensitive data while only sending model gradients to a central server. This strategy is designed to reinforce privacy protections and minimize the vulnerabilities inherent in centralized data storage systems. Despite its innovative approach, recent empirical studies have highlighted potential weaknesses in FL, notably regarding the exchange of gradients. In response, this study introduces a novel, efficacious method aimed at safeguarding against gradient leakage, namely, ``AdaDefense". Following the idea that model convergence can be achieved by using different types of optimization methods, we suggest using a local stand-in rather than the actual local gradient for global gradient aggregation on the central server. This proposed approach not only effectively prevents gradient leakage, but also ensures that the overall performance of the model remains largely unaffected. Delving into the theoretical dimensions, we explore how gradients may inadvertently leak private information and present a theoretical framework supporting the efficacy of our proposed method. Extensive empirical tests, supported by popular benchmark experiments, validate that our approach maintains model integrity and is robust against gradient leakage, marking an important step in our pursuit of safe and efficient FL.

[LG-56] Preferential Normalizing Flows

链接: https://arxiv.org/abs/2410.08710
作者: Petrus Mikkola,Luigi Acerbi,Arto Klami
关键词-EN: high-dimensional probability distribution, notoriously challenging, reward modeling, noisy judgments, judgments is notoriously
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages

点击查看摘要

Abstract:Eliciting a high-dimensional probability distribution from an expert via noisy judgments is notoriously challenging, yet useful for many applications, such as prior elicitation and reward modeling. We introduce a method for eliciting the expert’s belief density as a normalizing flow based solely on preferential questions such as comparing or ranking alternatives. This allows eliciting in principle arbitrarily flexible densities, but flow estimation is susceptible to the challenge of collapsing or diverging probability mass that makes it difficult in practice. We tackle this problem by introducing a novel functional prior for the flow, motivated by a decision-theoretic argument, and show empirically that the belief density can be inferred as the function-space maximum a posteriori estimate. We demonstrate our method by eliciting multivariate belief densities of simulated experts, including the prior belief of a general-purpose large language model over a real-world dataset.

[LG-57] Distillation of Discrete Diffusion through Dimensional Correlations NEURIPS2024

链接: https://arxiv.org/abs/2410.08709
作者: Satoshi Hayakawa,Yuhta Takida,Masaaki Imaizumi,Hiromi Wakaki,Yuki Mitsufuji
关键词-EN: demonstrated exceptional performances, demonstrated exceptional, exceptional performances, fields of generative, discrete diffusion
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: To be presented at Machine Learning and Compression Workshop @ NeurIPS 2024

点击查看摘要

Abstract:Diffusion models have demonstrated exceptional performances in various fields of generative modeling. While they often outperform competitors including VAEs and GANs in sample quality and diversity, they suffer from slow sampling speed due to their iterative nature. Recently, distillation techniques and consistency models are mitigating this issue in continuous domains, but discrete diffusion models have some specific challenges towards faster generation. Most notably, in the current literature, correlations between different dimensions (pixels, locations) are ignored, both by its modeling and loss functions, due to computational limitations. In this paper, we propose “mixture” models in discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: first, that dimensionally independent models can well approximate the data distribution if they are allowed to conduct many sampling steps, and second, that our loss functions enables mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. We empirically demonstrate that our proposed method for discrete diffusions work in practice, by distilling a continuous-time discrete diffusion model pretrained on the CIFAR-10 dataset.

[LG-58] Uncertainty Estimation and Out-of-Distribution Detection for LiDAR Scene Semantic Segmentation ECCV

链接: https://arxiv.org/abs/2410.08687
作者: Hanieh Shojaei,Qianqian Zou,Max Mehltretter
关键词-EN: environments requires autonomous, requires autonomous vehicles, Safe navigation, LiDAR scene segmentation, Gaussian Mixture Model
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in the Proceedings of the European Conference on Computer Vision (ECCV) 2024

点击查看摘要

Abstract:Safe navigation in new environments requires autonomous vehicles and robots to accurately interpret their surroundings, relying on LiDAR scene segmentation, out-of-distribution (OOD) obstacle detection, and uncertainty computation. We propose a method to distinguish in-distribution (ID) from OOD samples and quantify both epistemic and aleatoric uncertainties using the feature space of a single deterministic model. After training a semantic segmentation network, a Gaussian Mixture Model (GMM) is fitted to its feature space. OOD samples are detected by checking if their squared Mahalanobis distances to each Gaussian component conform to a chi-squared distribution, eliminating the need for an additional OOD training set. Given that the estimated mean and covariance matrix of a multivariate Gaussian distribution follow Gaussian and Inverse-Wishart distributions, multiple GMMs are generated by sampling from these distributions to assess epistemic uncertainty through classification variability. Aleatoric uncertainty is derived from the entropy of responsibility values within Gaussian components. Comparing our method with deep ensembles and logit-sampling for uncertainty computation demonstrates its superior performance in real-world applications for quantifying epistemic and aleatoric uncertainty, as well as detecting OOD samples. While deep ensembles miss some highly uncertain samples, our method successfully detects them and assigns high epistemic uncertainty.

[LG-59] Efficiently Scanning and Resampling Spatio-Temporal Tasks with Irregular Observations

链接: https://arxiv.org/abs/2410.08681
作者: Bryce Ferenczi,Michael Burke,Tom Drummond
关键词-EN: aimed at combining, efficiency of recurrent, recurrent models, parallelism of multi-head, multi-head attention
类目: Machine Learning (cs.LG)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:Various works have aimed at combining the inference efficiency of recurrent models and training parallelism of multi-head attention for sequence modeling. However, most of these works focus on tasks with fixed-dimension observation spaces, such as individual tokens in language modeling or pixels in image completion. To handle an observation space of varying size, we propose a novel algorithm that alternates between cross-attention between a 2D latent state and observation, and a discounted cumulative sum over the sequence dimension to efficiently accumulate historical information. We find this resampling cycle is critical for performance. To evaluate efficient sequence modeling in this domain, we introduce two multi-agent intention tasks: simulated agents chasing bouncing particles and micromanagement analysis in professional StarCraft II games. Our algorithm achieves comparable accuracy with a lower parameter count, faster training and inference compared to existing methods.

[LG-60] DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization

链接: https://arxiv.org/abs/2410.08666
作者: Yanfeng Jiang,Zelan Yang,Bohua Chen,Shen Li,Yong Li,Tao Li
关键词-EN: Large language models, Large language, achieve exceptional performance, downstream tasks, supervised fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning. However, the diversity of downstream tasks and practical requirements makes deploying multiple full-parameter fine-tuned models challenging. Current methods that compress the delta weight struggle to achieve ultra-high compression, failing to minimize the deployment overhead. To address the above issue, we propose a novel distribution-driven delta compression framework DeltaDQ, which utilizes Group-wise Dropout and Separate Quantization to achieve ultra-high compression for the delta weight. We have observed that the matrix-computed intermediate results for the delta weight exhibit extremely small variance and min-max range characteristics, referred to as Balanced Intermediate Results. Exploiting this phenomenon, we introduce Group-wise Dropout to perform dropout on the delta weight using an optimal group size. Furthermore, using Separate Quantization, sparse weights are quantized and decomposed to achieve a lower bit. Experimental results show that DeltaDQ achieves 16x compression with improved accuracy compared to baselines for WizardMath and WizardCoder models across different parameter scales. Moreover, DeltaDQ demonstrates the ability for ultra-high compression ratio, achieving 128x compression for the WizardMath-7B model and 512x compression for the WizardMath-70B model.

[LG-61] DistDD: Distributed Data Distillation Aggregation through Gradient Matching

链接: https://arxiv.org/abs/2410.08665
作者: Peiran Wang,Haohan Wang
关键词-EN: federated learning framework, distilling data directly, federated learning, traditional federated learning, clients’ devices
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce DistDD, a novel approach within the federated learning framework that reduces the need for repetitive communication by distilling data directly on clients’ devices. Unlike traditional federated learning that requires iterative model updates across nodes, DistDD facilitates a one-time distillation process that extracts a global distilled dataset, maintaining the privacy standards of federated learning while significantly cutting down communication costs. By leveraging the DistDD’s distilled dataset, the developers of the FL can achieve just-in-time parameter tuning and neural architecture search over FL without repeating the whole FL process multiple times. We provide a detailed convergence proof of the DistDD algorithm, reinforcing its mathematical stability and reliability for practical applications. Our experiments demonstrate the effectiveness and robustness of DistDD, particularly in non-i.i.d. and mislabeled data scenarios, showcasing its potential to handle complex real-world data challenges distinctively from conventional federated learning methods. We also evaluate DistDD’s application in the use case and prove its effectiveness and communication-savings in the NAS use case.

[LG-62] QEFT: Quantization for Efficient Fine-Tuning of LLMs EMNLP2024

链接: https://arxiv.org/abs/2410.08661
作者: Changhun Lee,Jun-gyu Jin,Younghyun Cho,Eunhyeok Park
关键词-EN: large language models, keeping inference efficient, highly important, rapid growth, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at Findings of EMNLP 2024

点击查看摘要

Abstract:With the rapid growth in the use of fine-tuning for large language models (LLMs), optimizing fine-tuning while keeping inference efficient has become highly important. However, this is a challenging task as it requires improvements in all aspects, including inference speed, fine-tuning speed, memory consumption, and, most importantly, model quality. Previous studies have attempted to achieve this by combining quantization with fine-tuning, but they have failed to enhance all four aspects simultaneously. In this study, we propose a new lightweight technique called Quantization for Efficient Fine-Tuning (QEFT). QEFT accelerates both inference and fine-tuning, is supported by robust theoretical foundations, offers high flexibility, and maintains good hardware compatibility. Our extensive experiments demonstrate that QEFT matches the quality and versatility of full-precision parameter-efficient fine-tuning, while using fewer resources. Our code is available at this https URL.

[LG-63] Carefully Structured Compression: Efficiently Managing StarCraft II Data

链接: https://arxiv.org/abs/2410.08659
作者: Bryce Ferenczi,Rhys Newbury,Michael Burke,Tom Drummond
关键词-EN: simple image label, image label pairs, overlooked input costs, plain text, Creation and storage
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:Creation and storage of datasets are often overlooked input costs in machine learning, as many datasets are simple image label pairs or plain text. However, datasets with more complex structures, such as those from the real time strategy game StarCraft II, require more deliberate thought and strategy to reduce cost of ownership. We introduce a serialization framework for StarCraft II that reduces the cost of dataset creation and storage, as well as improving usage ergonomics. We benchmark against the most comparable existing dataset from \textitAlphaStar-Unplugged and highlight the benefit of our framework in terms of both the cost of creation and storage. We use our dataset to train deep learning models that exceed the performance of comparable models trained on other datasets. The dataset conversion and usage framework introduced is open source and can be used as a framework for datasets with similar characteristics such as digital twin simulations. Pre-converted StarCraft II tournament data is also available online.

[LG-64] Finite Sample Complexity Analysis of Binary Segmentation

链接: https://arxiv.org/abs/2410.08654
作者: Toby Dylan Hocking
关键词-EN: Binary segmentation, classic greedy algorithm, sequential data set, likelihood function, classic greedy
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Binary segmentation is the classic greedy algorithm which recursively splits a sequential data set by optimizing some loss or likelihood function. Binary segmentation is widely used for changepoint detection in data sets measured over space or time, and as a sub-routine for decision tree learning. In theory it should be extremely fast for N data and K splits, O(N K) in the worst case, and O(N \log K) in the best case. In this paper we describe new methods for analyzing the time and space complexity of binary segmentation for a given finite N , K , and minimum segment length parameter. First, we describe algorithms that can be used to compute the best and worst case number of splits the algorithm must consider. Second, we describe synthetic data that achieve the best and worst case and which can be used to test for correct implementation of the algorithm. Finally, we provide an empirical analysis of real data which suggests that binary segmentation is often close to optimal speed in practice.

[LG-65] Edge AI Collaborative Learning: Bayesian Approaches to Uncertainty Estimation

链接: https://arxiv.org/abs/2410.08651
作者: Gleb Radchenko,Victoria Andrea Fill
关键词-EN: Internet of Things, capabilities of Internet, Recent advancements, significantly enhanced, edge computing
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Recent advancements in edge computing have significantly enhanced the AI capabilities of Internet of Things (IoT) devices. However, these advancements introduce new challenges in knowledge exchange and resource management, particularly addressing the spatiotemporal data locality in edge computing environments. This study examines algorithms and methods for deploying distributed machine learning within autonomous, network-capable, AI-enabled edge devices. We focus on determining confidence levels in learning outcomes considering the spatial variability of data encountered by independent agents. Using collaborative mapping as a case study, we explore the application of the Distributed Neural Network Optimization (DiNNO) algorithm extended with Bayesian neural networks (BNNs) for uncertainty estimation. We implement a 3D environment simulation using the Webots platform to simulate collaborative mapping tasks, decouple the DiNNO algorithm into independent processes for asynchronous network communication in distributed learning, and integrate distributed uncertainty estimation using BNNs. Our experiments demonstrate that BNNs can effectively support uncertainty estimation in a distributed learning context, with precise tuning of learning hyperparameters crucial for effective uncertainty assessment. Notably, applying Kullback-Leibler divergence for parameter regularization resulted in a 12-30% reduction in validation loss during distributed BNN training compared to other regularization strategies.

[LG-66] Multi-Source Temporal Attention Network for Precipitation Nowcasting

链接: https://arxiv.org/abs/2410.08641
作者: Rafael Pablos Sarabia,Joachim Nyborg,Morten Birk,Jeppe Liborius Sjørup,Anders Lillevang Vesterholt,Ira Assent
关键词-EN: Precipitation nowcasting, climate change, industries and plays, plays a significant, significant role
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Precipitation nowcasting is crucial across various industries and plays a significant role in mitigating and adapting to climate change. We introduce an efficient deep learning model for precipitation nowcasting, capable of predicting rainfall up to 8 hours in advance with greater accuracy than existing operational physics-based and extrapolation-based models. Our model leverages multi-source meteorological data and physics-based forecasts to deliver high-resolution predictions in both time and space. It captures complex spatio-temporal dynamics through temporal attention networks and is optimized using data quality maps and dynamic thresholds. Experiments demonstrate that our model outperforms state-of-the-art, and highlight its potential for fast reliable responses to evolving weather conditions.

[LG-67] Efficient line search for optimizing Area Under the ROC Curve in gradient descent

链接: https://arxiv.org/abs/2410.08635
作者: Jadon Fowler,Toby Dylan Hocking
关键词-EN: Receiver Operating Characteristic, Receiver Operating, Operating Characteristic, Area Under Min, Recently the Area
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Receiver Operating Characteristic (ROC) curves are useful for evaluation in binary classification and changepoint detection, but difficult to use for learning since the Area Under the Curve (AUC) is piecewise constant (gradient zero almost everywhere). Recently the Area Under Min (AUM) of false positive and false negative rates has been proposed as a differentiable surrogate for AUC. In this paper we study the piecewise linear/constant nature of the AUM/AUC, and propose new efficient path-following algorithms for choosing the learning rate which is optimal for each step of gradient descent (line search), when optimizing a linear model. Remarkably, our proposed line search algorithm has the same log-linear asymptotic time complexity as gradient descent with constant step size, but it computes a complete representation of the AUM/AUC as a function of step size. In our empirical study of binary classification problems, we verify that our proposed algorithm is fast and exact; in changepoint detection problems we show that the proposed algorithm is just as accurate as grid search, but faster.

[LG-68] GAI-Enabled Explainable Personalized Federated Semi-Supervised Learning

链接: https://arxiv.org/abs/2410.08634
作者: Yubo Peng,Feibo Jiang,Li Dong,Kezhi Wang,Kun Yang
关键词-EN: commonly distributed algorithm, training artificial intelligence, mobile users, artificial intelligence, real-world scenarios
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a commonly distributed algorithm for mobile users (MUs) training artificial intelligence (AI) models, however, several challenges arise when applying FL to real-world scenarios, such as label scarcity, non-IID data, and unexplainability. As a result, we propose an explainable personalized FL framework, called XPFL. First, we introduce a generative AI (GAI) assisted personalized federated semi-supervised learning, called GFed. Particularly, in local training, we utilize a GAI model to learn from large unlabeled data and apply knowledge distillation-based semi-supervised learning to train the local FL model using the knowledge acquired from the GAI model. In global aggregation, we obtain the new local FL model by fusing the local and global FL models in specific proportions, allowing each local model to incorporate knowledge from others while preserving its personalized characteristics. Second, we propose an explainable AI mechanism for FL, named XFed. Specifically, in local training, we apply a decision tree to match the input and output of the local FL model. In global aggregation, we utilize t-distributed stochastic neighbor embedding (t-SNE) to visualize the local models before and after aggregation. Finally, simulation results validate the effectiveness of the proposed XPFL framework.

[LG-69] ransformers Provably Solve Parity Efficiently with Chain of Thought NEURIPS2024

链接: https://arxiv.org/abs/2410.08633
作者: Juno Kim,Taiji Suzuki
关键词-EN: generating intermediate states, solve complex problems, recursively generating intermediate, analogous to fine-tuning, theoretical analysis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024 M3L Workshop

点击查看摘要

Abstract:This work provides the first theoretical analysis of training transformers to solve complex problems by recursively generating intermediate states, analogous to fine-tuning for chain-of-thought (CoT) reasoning. We consider training a one-layer transformer to solve the fundamental k -parity problem, extending the work on RNNs by Wies et al. (2023). We establish three key results: (1) any finite-precision gradient-based algorithm, without intermediate supervision, requires substantial iterations to solve parity with finite samples. (2) In contrast, when intermediate parities are incorporated into the loss function, our model can learn parity in one gradient update when aided by \emphteacher forcing, where ground-truth labels of the reasoning chain are provided at each generation step. (3) Even without teacher forcing, where the model must generate CoT chains end-to-end, parity can be learned efficiently if augmented data is employed to internally verify the soundness of intermediate steps. These results rigorously show that task decomposition and stepwise reasoning naturally arise from optimizing transformers with CoT; moreover, self-consistency checking can improve reasoning ability, aligning with empirical studies of CoT.

[LG-70] Words as Beacons: Guiding RL Agents with High-Level Language Prompts

链接: https://arxiv.org/abs/2410.08632
作者: Unai Ruiz-Gonzalez,Alain Andres,Pedro G.Bascoy,Javier Del Ser
关键词-EN: pose significant challenges, Large Language Models, incomplete learning processes, leverages Large Language, pose significant
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse reward environments in reinforcement learning (RL) pose significant challenges for exploration, often leading to inefficient or incomplete learning processes. To tackle this issue, this work proposes a teacher-student RL framework that leverages Large Language Models (LLMs) as “teachers” to guide the agent’s learning process by decomposing complex tasks into subgoals. Due to their inherent capability to understand RL environments based on a textual description of structure and purpose, LLMs can provide subgoals to accomplish the task defined for the environment in a similar fashion to how a human would do. In doing so, three types of subgoals are proposed: positional targets relative to the agent, object representations, and language-based instructions generated directly by the LLM. More importantly, we show that it is possible to query the LLM only during the training phase, enabling agents to operate within the environment without any LLM intervention. We assess the performance of this proposed framework by evaluating three state-of-the-art open-source LLMs (Llama, DeepSeek, Qwen) eliciting subgoals across various procedurally generated environment of the MiniGrid benchmark. Experimental results demonstrate that this curriculum-based approach accelerates learning and enhances exploration in complex tasks, achieving up to 30 to 200 times faster convergence in training steps compared to recent baselines designed for sparse reward environments.

[LG-71] owards Cross-domain Few-shot Graph Anomaly Detection ICDM2024

链接: https://arxiv.org/abs/2410.08629
作者: Jiazhen Chen,Sichao Fu,Zhibin Zhang,Zheng Ma,Mingbin Feng,Tony S. Wirjanto,Qinmu Peng
关键词-EN: unlabeled test nodes, garnered increasing attention, recently garnered increasing, abundant unlabeled test, labeled training nodes
类目: Machine Learning (cs.LG)
*备注: Accepted by 24th IEEE International Conference on Data Mining (ICDM 2024)

点击查看摘要

Abstract:Few-shot graph anomaly detection (GAD) has recently garnered increasing attention, which aims to discern anomalous patterns among abundant unlabeled test nodes under the guidance of a limited number of labeled training nodes. Existing few-shot GAD approaches typically adopt meta-training methods trained on richly labeled auxiliary networks to facilitate rapid adaptation to target networks that possess sparse labels. However, these proposed methods often assume that the auxiliary and target networks exist in the same data distributions-an assumption rarely holds in practical settings. This paper explores a more prevalent and complex scenario of cross-domain few-shot GAD, where the goal is to identify anomalies within sparsely labeled target graphs using auxiliary graphs from a related, yet distinct domain. The challenge here is nontrivial owing to inherent data distribution discrepancies between the source and target domains, compounded by the uncertainties of sparse labeling in the target domain. In this paper, we propose a simple and effective framework, termed CDFS-GAD, specifically designed to tackle the aforementioned challenges. CDFS-GAD first introduces a domain-adaptive graph contrastive learning module, which is aimed at enhancing cross-domain feature alignment. Then, a prompt tuning module is further designed to extract domain-specific features tailored to each domain. Moreover, a domain-adaptive hypersphere classification loss is proposed to enhance the discrimination between normal and anomalous instances under minimal supervision, utilizing domain-sensitive norms. Lastly, a self-training strategy is introduced to further refine the predicted scores, enhancing its reliability in few-shot settings. Extensive experiments on twelve real-world cross-domain data pairs demonstrate the effectiveness of the proposed CDFS-GAD framework in comparison to various existing GAD methods.

[LG-72] Synth-SONAR: Sonar Image Synthesis with Enhanced Diversity and Realism via Dual Diffusion Models and GPT Prompting

链接: https://arxiv.org/abs/2410.08612
作者: Purushothaman Natarajan,Kamal Basha,Athira Nambiar
关键词-EN: marine biology, Sonar, Sonar image synthesis, underwater exploration, crucial for advancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 5 tables and 9 figures

点击查看摘要

Abstract:Sonar image synthesis is crucial for advancing applications in underwater exploration, marine biology, and defence. Traditional methods often rely on extensive and costly data collection using sonar sensors, jeopardizing data quality and diversity. To overcome these limitations, this study proposes a new sonar image synthesis framework, Synth-SONAR leveraging diffusion models and GPT prompting. The key novelties of Synth-SONAR are threefold: First, by integrating Generative AI-based style injection techniques along with publicly available real/simulated data, thereby producing one of the largest sonar data corpus for sonar research. Second, a dual text-conditioning sonar diffusion model hierarchy synthesizes coarse and fine-grained sonar images with enhanced quality and diversity. Third, high-level (coarse) and low-level (detailed) text-based sonar generation methods leverage advanced semantic information available in visual language models (VLMs) and GPT-prompting. During inference, the method generates diverse and realistic sonar images from textual prompts, bridging the gap between textual descriptions and sonar image generation. This marks the application of GPT-prompting in sonar imagery for the first time, to the best of our knowledge. Synth-SONAR achieves state-of-the-art results in producing high-quality synthetic sonar datasets, significantly enhancing their diversity and realism.

[LG-73] xt-To-Image with Generative Adversarial Networks

链接: https://arxiv.org/abs/2410.08608
作者: Mehrshad Momen-Tayefeh
关键词-EN: Generating realistic images, Generating realistic, Generative Adversarial Networks, computer vision, field of computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating realistic images from human texts is one of the most challenging problems in the field of computer vision (CV). The meaning of descriptions given can be roughly reflected by existing text-to-image approaches. In this paper, our main purpose is to propose a brief comparison between five different methods base on the Generative Adversarial Networks (GAN) to make image from the text. In addition, each model architectures synthesis images with different resolution. Furthermore, the best and worst obtained resolutions is 6464, 256256 respectively. However, we checked and compared some metrics that introduce the accuracy of each model. Also, by doing this study, we found out the best model for this problem by comparing these different approaches essential metrics.

[LG-74] MergePrint: Robust Fingerprinting against Merging Large Language Models

链接: https://arxiv.org/abs/2410.08604
作者: Shojiro Yamabe,Tsubasa Takahashi,Futa Waseda,Koki Wataoka
关键词-EN: training large language, large language models, protecting their intellectual, increasingly critical, cost of training
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:As the cost of training large language models (LLMs) rises, protecting their intellectual property has become increasingly critical. Model merging, which integrates multiple expert models into a single model capable of performing multiple tasks, presents a growing risk of unauthorized and malicious usage. While fingerprinting techniques have been studied for asserting model ownership, existing methods have primarily focused on fine-tuning, leaving model merging underexplored. To address this gap, we propose a novel fingerprinting method MergePrint that embeds robust fingerprints designed to preserve ownership claims even after model merging. By optimizing against a pseudo-merged model, which simulates post-merged model weights, MergePrint generates fingerprints that remain detectable after merging. Additionally, we optimize the fingerprint inputs to minimize performance degradation, enabling verification through specific outputs from targeted inputs. This approach provides a practical fingerprinting strategy for asserting ownership in cases of misappropriation through model merging.

[LG-75] VIBES – Vision Backbone Efficient Selection WACV2025

链接: https://arxiv.org/abs/2410.08592
作者: Joris Guerin,Shray Bansal,Amirreza Shaban,Paulo Mann,Harshvardhan Gazula
关键词-EN: specific target tasks, efficiently selecting high-performance, selecting high-performance pre-trained, high-performance pre-trained vision, target tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, under review at WACV 2025

点击查看摘要

Abstract:This work tackles the challenge of efficiently selecting high-performance pre-trained vision backbones for specific target tasks. Although exhaustive search within a finite set of backbones can solve this problem, it becomes impractical for large datasets and backbone pools. To address this, we introduce Vision Backbone Efficient Selection (VIBES), which aims to quickly find well-suited backbones, potentially trading off optimality for efficiency. We propose several simple yet effective heuristics to address VIBES and evaluate them across four diverse computer vision datasets. Our results show that these approaches can identify backbones that outperform those selected from generic benchmarks, even within a limited search budget of one hour on a single GPU. We reckon VIBES marks a paradigm shift from benchmarks to task-specific optimization.

[LG-76] Retraining-Free Merging of Sparse Mixture-of-Experts via Hierarchical Clustering

链接: https://arxiv.org/abs/2410.08589
作者: I-Chun Chen,Hsu-Shen Liu,Wei-Fang Sun,Chen-Hao Chao,Yen-Chang Hsu,Chun-Yi Lee
关键词-EN: represent a significant, significant breakthrough, breakthrough in large, Sparse, language model development
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:Sparse Mixture-of-Experts (SMoE) models represent a significant breakthrough in large language model development. These models enable performance improvements without a proportional increase in inference costs. By selectively activating a small set of parameters during task execution, SMoEs enhance model capacity. However, their deployment remains challenging due to the substantial memory footprint required to accommodate the growing number of experts. This constraint renders them less feasible in environments with limited hardware resources. To address this challenge, we propose Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework that reduces SMoE model parameters without retraining. Unlike previous methods, HC-SMoE employs hierarchical clustering based on expert outputs. This approach ensures that the merging process remains unaffected by routing decisions. The output-based clustering strategy captures functional similarities between experts, offering an adaptable solution for models with numerous experts. We validate our approach through extensive experiments on eight zero-shot language tasks and demonstrate its effectiveness in large-scale SMoE models such as Qwen and Mixtral. Our comprehensive results demonstrate that HC-SMoE consistently achieves strong performance, which highlights its potential for real-world deployment.

[LG-77] Logarithmic Regret for Unconstrained Submodular Maximization Stochastic Bandit

链接: https://arxiv.org/abs/2410.08578
作者: Julien Zhou(Thoth, STATIFY),Pierre Gaillard(Thoth),Thibaud Rahier,Julyan Arbel(STATIFY)
关键词-EN: stochastic bandit feedback, submodular maximization problem, unconstrained submodular maximization, online unconstrained submodular, Online USM
类目: Machine Learning (cs.LG); Combinatorics (math.CO); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We address the online unconstrained submodular maximization problem (Online USM), in a setting with stochastic bandit feedback. In this framework, a decision-maker receives noisy rewards from a nonmonotone submodular function, taking values in a known bounded interval. This paper proposes Double-Greedy - Explore-then-Commit (DG-ETC), adapting the Double-Greedy approach from the offline and online full-information settings. DG-ETC satisfies a O(d log(dT)) problemdependent upper bound for the 1/2-approximate pseudo-regret, as well as a O(dT^2/3log(dT)^1/3) problem-free one at the same time, outperforming existing approaches. To that end, we introduce a notion of hardness for submodular functions, characterizing how difficult it is to maximize them with this type of strategy.

[LG-78] Similar Phrases for Cause of Actions of Civil Cases

链接: https://arxiv.org/abs/2410.08564
作者: Ho-Chien Huang,Chao-Lin Liu
关键词-EN: Taiwanese judicial system, Taiwanese judicial, relevant legal judgments, identifying relevant legal, judicial system
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 3 tables(including appendix)

点击查看摘要

Abstract:In the Taiwanese judicial system, Cause of Actions (COAs) are essential for identifying relevant legal judgments. However, the lack of standardized COA labeling creates challenges in filtering cases using basic methods. This research addresses this issue by leveraging embedding and clustering techniques to analyze the similarity between COAs based on cited legal articles. The study implements various similarity measures, including Dice coefficient and Pearson’s correlation coefficient. An ensemble model combines rankings, and social network analysis identifies clusters of related COAs. This approach enhances legal analysis by revealing inconspicuous connections between COAs, offering potential applications in legal research beyond civil law.

[LG-79] Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive architecture

链接: https://arxiv.org/abs/2410.08559
作者: Sehun Kim
关键词-EN: Embedding Predictive Architecture, Joint Embedding Predictive, ECG Joint Embedding, named ECG Joint, Predictive Architecture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a self-supervised learning method for 12-lead Electrocardiogram (ECG) analysis, named ECG Joint Embedding Predictive Architecture (ECG-JEPA). ECG-JEPA employs a masking strategy to learn semantic representations of ECG data. Unlike existing methods, ECG-JEPA predicts at the hidden representation level rather than reconstructing raw data. This approach offers several advantages in the ECG domain: (1) it avoids producing unnecessary details, such as noise, which is common in standard ECG; and (2) it addresses the limitations of naïve L2 loss between raw signals. Another key contribution is the introduction of a special masked attention tailored for 12-lead ECG data, Cross-Pattern Attention (CroPA). CroPA enables the model to effectively capture inter-patch relationships. Additionally, ECG-JEPA is highly scalable, allowing efficient training on large datasets. Our code is openly available this https URL.

[LG-80] MUSO: Achieving Exact Machine Unlearning in Over-Parameterized Regimes

链接: https://arxiv.org/abs/2410.08557
作者: Ruikai Yang,Mingzhen He,Zhengbao He,Youmei Qiu,Xiaolin Huang
关键词-EN: well-trained model behave, Machine unlearning, well-trained model, Machine, specific data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning (MU) is to make a well-trained model behave as if it had never been trained on specific data. In today’s over-parameterized models, dominated by neural networks, a common approach is to manually relabel data and fine-tune the well-trained model. It can approximate the MU model in the output space, but the question remains whether it can achieve exact MU, i.e., in the parameter space. We answer this question by employing random feature techniques to construct an analytical framework. Under the premise of model optimization via stochastic gradient descent, we theoretically demonstrated that over-parameterized linear models can achieve exact MU through relabeling specific data. We also extend this work to real-world nonlinear networks and propose an alternating optimization algorithm that unifies the tasks of unlearning and relabeling. The algorithm’s effectiveness, confirmed through numerical experiments, highlights its superior performance in unlearning across various scenarios compared to current state-of-the-art methods, particularly excelling over similar relabeling-based MU approaches.

[LG-81] Score Neural Operator: A Generative Model for Learning and Generalizing Across Multiple Probability Distributions

链接: https://arxiv.org/abs/2410.08549
作者: Xinyu Liao,Aoyang Qin,Jacob Seidman,Junqi Wang,Wei Wang,Paris Perdikaris
关键词-EN: Score Neural Operator, existing generative models, Neural Operator, Score Neural, score
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing generative models are limited to learning a single probability distribution from the training data and cannot generalize to novel distributions for unseen data. An architecture that can generate samples from both trained datasets and unseen probability distributions would mark a significant breakthrough. Recently, score-based generative models have gained considerable attention for their comprehensive mode coverage and high-quality image synthesis, as they effectively learn an operator that maps a probability distribution to its corresponding score function. In this work, we introduce the \emphScore Neural Operator , which learns the mapping from multiple probability distributions to their score functions within a unified framework. We employ latent space techniques to facilitate the training of score matching, which tends to over-fit in the original image pixel space, thereby enhancing sample generation quality. Our trained Score Neural Operator demonstrates the ability to predict score functions of probability measures beyond the training space and exhibits strong generalization performance in both 2-dimensional Gaussian Mixture Models and 1024-dimensional MNIST double-digit datasets. Importantly, our approach offers significant potential for few-shot learning applications, where a single image from a new distribution can be leveraged to generate multiple distinct images from that distribution.

[LG-82] Kaleidoscope: Learnable Masks for Heterogeneous Multi-agent Reinforcement Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.08540
作者: Xinran Li,Ling Pan,Jun Zhang
关键词-EN: multi-agent reinforcement learning, parameter sharing, reinforcement learning, enhance sample efficiency, commonly employed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted by the Thirty-Eighth Annual Conference on Neural Information Processing Systems(NeurIPS 2024)

点击查看摘要

Abstract:In multi-agent reinforcement learning (MARL), parameter sharing is commonly employed to enhance sample efficiency. However, the popular approach of full parameter sharing often leads to homogeneous policies among agents, potentially limiting the performance benefits that could be derived from policy diversity. To address this critical limitation, we introduce \emphKaleidoscope, a novel adaptive partial parameter sharing scheme that fosters policy heterogeneity while still maintaining high sample efficiency. Specifically, Kaleidoscope maintains one set of common parameters alongside multiple sets of distinct, learnable masks for different agents, dictating the sharing of parameters. It promotes diversity among policy networks by encouraging discrepancy among these masks, without sacrificing the efficiencies of parameter sharing. This design allows Kaleidoscope to dynamically balance high sample efficiency with a broad policy representational capacity, effectively bridging the gap between full parameter sharing and non-parameter sharing across various environments. We further extend Kaleidoscope to critic ensembles in the context of actor-critic algorithms, which could help improve value this http URL empirical evaluations across extensive environments, including multi-agent particle environment, multi-agent MuJoCo and StarCraft multi-agent challenge v2, demonstrate the superior performance of Kaleidoscope compared with existing parameter sharing approaches, showcasing its potential for performance enhancement in MARL. The code is publicly available at \urlthis https URL.

[LG-83] Robust Offline Policy Learning with Observational Data from Multiple Sources

链接: https://arxiv.org/abs/2410.08537
作者: Aldo Gael Carranza,Susan Athey
关键词-EN: diverse target settings, observational bandit feedback, bandit feedback data, target settings, observational bandit
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2305.12407

点击查看摘要

Abstract:We consider the problem of using observational bandit feedback data from multiple heterogeneous data sources to learn a personalized decision policy that robustly generalizes across diverse target settings. To achieve this, we propose a minimax regret optimization objective to ensure uniformly low regret under general mixtures of the source distributions. We develop a policy learning algorithm tailored to this objective, combining doubly robust offline policy evaluation techniques and no-regret learning algorithms for minimax optimization. Our regret analysis shows that this approach achieves the minimal worst-case mixture regret up to a moderated vanishing rate of the total data across all sources. Our analysis, extensions, and experimental results demonstrate the benefits of this approach for learning robust decision policies from multiple data sources.

[LG-84] Scaling Laws for Predicting Downstream Performance in LLMs

链接: https://arxiv.org/abs/2410.08527
作者: Yangyi Chen,Binxuan Huang,Yifan Gao,Zhengyang Wang,Jingfeng Yang,Heng Ji
关键词-EN: large language models, Precise estimation, performance, pre-training loss, downstream performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of sampling models, followed by mapping the pre-training loss to downstream task Performance after the critical “emergent phase”. In preliminary experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. This motivates FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training, specifically blending general corpora with code data to accurately represent the common necessity. FLP-M extends the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources, and employs a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance. By utilizing a 3B LLM trained on a specific ratio and a series of smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures for most benchmarks within 10% error margins.

[LG-85] IGNN-Solver: A Graph Neural Solver for Implicit Graph Neural Networks

链接: https://arxiv.org/abs/2410.08524
作者: Junchao Lin,Zenan Ling,Zhanbo Feng,Feng Zhou,Jingwen Xu,Robert C Qiu
关键词-EN: capturing long-range dependencies, exhibit strong expressive, strong expressive power, recently demonstrated remarkable, demonstrated remarkable performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Implicit graph neural networks (IGNNs), which exhibit strong expressive power with a single layer, have recently demonstrated remarkable performance in capturing long-range dependencies (LRD) in underlying graphs while effectively mitigating the over-smoothing problem. However, IGNNs rely on computationally expensive fixed-point iterations, which lead to significant speed and scalability limitations, hindering their application to large-scale graphs. To achieve fast fixed-point solving for IGNNs, we propose a novel graph neural solver, IGNN-Solver, which leverages the generalized Anderson Acceleration method, parameterized by a small GNN, and learns iterative updates as a graph-dependent temporal process. Extensive experiments demonstrate that the IGNN-Solver significantly accelerates inference, achieving a 1.5\times to 8\times speedup without sacrificing accuracy. Moreover, this advantage becomes increasingly pronounced as the graph scale grows, facilitating its large-scale deployment in real-world applications.

[LG-86] Evaluating the effects of Data Sparsity on the Link-level Bicycling Volume Estimation: A Graph Convolutional Neural Network Approach

链接: https://arxiv.org/abs/2410.08522
作者: Mohit Gupta,Debjit Bhowmick,Meead Saberi,Shirui Pan,Ben Beck
关键词-EN: Accurate bicycling volume, making informed decisions, bicycling volume estimation, Accurate bicycling, volume estimation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate bicycling volume estimation is crucial for making informed decisions about future investments in bicycling infrastructure. Traditional link-level volume estimation models are effective for motorised traffic but face significant challenges when applied to the bicycling context because of sparse data and the intricate nature of bicycling mobility patterns. To the best of our knowledge, we present the first study to utilize a Graph Convolutional Network (GCN) architecture to model link-level bicycling volumes. We estimate the Annual Average Daily Bicycle (AADB) counts across the City of Melbourne, Australia using Strava Metro bicycling count data. To evaluate the effectiveness of the GCN model, we benchmark it against traditional machine learning models, such as linear regression, support vector machines, and random forest. Our results show that the GCN model performs better than these traditional models in predicting AADB counts, demonstrating its ability to capture the spatial dependencies inherent in bicycle traffic data. We further investigate how varying levels of data sparsity affect performance of the GCN architecture. The GCN architecture performs well and better up to 80% sparsity level, but its limitations become apparent as the data sparsity increases further, emphasizing the need for further research on handling extreme data sparsity in bicycling volume estimation. Our findings offer valuable insights for city planners aiming to improve bicycling infrastructure and promote sustainable transportation.

[LG-87] Improving Legal Entity Recognition Using a Hybrid Transformer Model and Semantic Filtering Approach

链接: https://arxiv.org/abs/2410.08521
作者: Duraimurugan Rajamanickam
关键词-EN: Legal Entity Recognition, Entity Recognition, automating legal workflows, compliance monitoring, contract analysis
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 7 pages, 1 table

点击查看摘要

Abstract:Legal Entity Recognition (LER) is critical in automating legal workflows such as contract analysis, compliance monitoring, and litigation support. Existing approaches, including rule-based systems and classical machine learning models, struggle with the complexity of legal documents and domain specificity, particularly in handling ambiguities and nested entity structures. This paper proposes a novel hybrid model that enhances the accuracy and precision of Legal-BERT, a transformer model fine-tuned for legal text processing, by introducing a semantic similarity-based filtering mechanism. We evaluate the model on a dataset of 15,000 annotated legal documents, achieving an F1 score of 93.4%, demonstrating significant improvements in precision and recall over previous methods.

[LG-88] Distributionally robust self-supervised learning for tabular data NEURIPS2024

链接: https://arxiv.org/abs/2410.08511
作者: Shantanu Ghosh,Tiankang Xie,Mikhail Kuznetsov
关键词-EN: Empirical Risk Minimization, Risk Minimization, Empirical Risk, exhibit systematic errors, Machine learning
类目: Machine Learning (cs.LG)
*备注: TRL Workshop@NeurIPS2024

点击查看摘要

Abstract:Machine learning (ML) models trained using Empirical Risk Minimization (ERM) often exhibit systematic errors on specific subpopulations of tabular data, known as error slices. Learning robust representation in presence of error slices is challenging, especially in self-supervised settings during the feature reconstruction phase, due to high cardinality features and the complexity of constructing error sets. Traditional robust representation learning methods are largely focused on improving worst group performance in supervised setting in computer vision, leaving a gap in approaches tailored for tabular data. We address this gap by developing a framework to learn robust representation in tabular data during self-supervised pre-training. Our approach utilizes an encoder-decoder model trained with Masked Language Modeling (MLM) loss to learn robust latent representations. This paper applies the Just Train Twice (JTT) and Deep Feature Reweighting (DFR) methods during the pre-training phase for tabular data. These methods fine-tune the ERM pre-trained model by up-weighting error-prone samples or creating balanced datasets for specific categorical features. This results in specialized models for each feature, which are then used in an ensemble approach to enhance downstream classification performance. This methodology improves robustness across slices, thus enhancing overall generalization performance. Extensive experiments across various datasets demonstrate the efficacy of our approach.

[LG-89] Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data NEURIPS2024

链接: https://arxiv.org/abs/2410.08503
作者: Binghui Li,Yuanzhi Li
关键词-EN: Adversarial training, Adversarial, training, training deep neural, non-robust feature
类目: Machine Learning (cs.LG)
*备注: 34 pages, Mathematics of Modern Machine Learning Workshop at NeurIPS 2024

点击查看摘要

Abstract:Adversarial training is a widely-applied approach to training deep neural networks to be robust against adversarial perturbation. However, although adversarial training has achieved empirical success in practice, it still remains unclear why adversarial examples exist and how adversarial training methods improve model robustness. In this paper, we provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory. Specifically, we focus on a multiple classification setting, where the structured data can be composed of two types of features: the robust features, which are resistant to perturbation but sparse, and the non-robust features, which are susceptible to perturbation but dense. We train a two-layer smoothed ReLU convolutional neural network to learn our structured data. First, we prove that by using standard training (gradient descent over the empirical risk), the network learner primarily learns the non-robust feature rather than the robust feature, which thereby leads to the adversarial examples that are generated by perturbations aligned with negative non-robust feature directions. Then, we consider the gradient-based adversarial training algorithm, which runs gradient ascent to find adversarial examples and runs gradient descent over the empirical risk at adversarial examples to update models. We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning to improve the network robustness. Finally, we also empirically validate our theoretical findings with experiments on real-image datasets, including MNIST, CIFAR10 and SVHN.

[LG-90] On a Hidden Property in Computational Imaging

链接: https://arxiv.org/abs/2410.08498
作者: Yinan Feng,Yinpeng Chen,Yueh Lee,Youzuo Lin
关键词-EN: Computed Tomography, Full Waveform Inversion, Full Waveform, medical applications, seismic waveform data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational imaging plays a vital role in various scientific and medical applications, such as Full Waveform Inversion (FWI), Computed Tomography (CT), and Electromagnetic (EM) inversion. These methods address inverse problems by reconstructing physical properties (e.g., the acoustic velocity map in FWI) from measurement data (e.g., seismic waveform data in FWI), where both modalities are governed by complex mathematical equations. In this paper, we empirically demonstrate that despite their differing governing equations, three inverse problems (FWI, CT, and EM inversion) share a hidden property within their latent spaces. Specifically, using FWI as an example, we show that both modalities (the velocity map and seismic waveform data) follow the same set of one-way wave equations in the latent space, yet have distinct initial conditions that are linearly correlated. This suggests that after projection into the latent embedding space, the two modalities correspond to different solutions of the same equation, connected through their initial conditions. Our experiments confirm that this hidden property is consistent across all three imaging problems, providing a novel perspective for understanding these computational imaging tasks.

[LG-91] owards Sharper Risk Bounds for Minimax Problems

链接: https://arxiv.org/abs/2410.08497
作者: Bowei Zhu,Shaojie Li,Yong Liu
关键词-EN: reinforcement learning, adversarial training, machine learning, Minimax problems, achieved success
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Minimax problems have achieved success in machine learning such as adversarial training, robust optimization, reinforcement learning. For theoretical analysis, current optimal excess risk bounds, which are composed by generalization error and optimization error, present 1/n-rates in strongly-convex-strongly-concave (SC-SC) settings. Existing studies mainly focus on minimax problems with specific algorithms for optimization error, with only a few studies on generalization performance, which limit better excess risk bounds. In this paper, we study the generalization bounds measured by the gradients of primal functions using uniform localized convergence. We obtain a sharper high probability generalization error bound for nonconvex-strongly-concave (NC-SC) stochastic minimax problems. Furthermore, we provide dimension-independent results under Polyak-Lojasiewicz condition for the outer layer. Based on our generalization error bound, we analyze some popular algorithms such as empirical saddle point (ESP), gradient descent ascent (GDA) and stochastic gradient descent ascent (SGDA). We derive better excess primal risk bounds with further reasonable assumptions, which, to the best of our knowledge, are n times faster than exist results in minimax problems.

[LG-92] Personalized Item Embeddings in Federated Multimodal Recommendation

链接: https://arxiv.org/abs/2410.08478
作者: Zhiwei Li,Guodong Long,Jing Jiang,Chengqi Zhang
关键词-EN: Federated recommendation systems, protecting user privacy, recommendation systems play, play a crucial, crucial role
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 5 tables, conference

点击查看摘要

Abstract:Federated recommendation systems play a crucial role in protecting user privacy. However, existing methods primarily rely on ID-based item embeddings, overlooking the rich multimodal information of items. To address this limitation, we propose a novel Federated Multimodal Recommendation System called FedMR. FedMR leverages a foundation model on the server side to encode multimodal data, such as images and text, associated with items. To tackle the challenge of data heterogeneity caused by varying user preferences, FedMR introduces a Mixing Feature Fusion Module on the client. This module dynamically adjusts the weights of different fusion strategies based on user interaction history, generating personalized item embeddings that capture fine-grained user preferences. FedMR is compatible with existing ID-based federated recommendation systems, improving their performances without modifying the original framework. Our experiments on four real-world multimodal recommendation datasets demonstrate the effectiveness of FedMR. Our code is available at this https URL.

[LG-93] Deeper Insights into Deep Graph Convolutional Networks: Stability and Generalization

链接: https://arxiv.org/abs/2410.08473
作者: Guangrui Yang,Ming Li,Han Feng,Xiaosheng Zhuang
关键词-EN: exhibiting promising performance, graph learning tasks, stability and generalization, deep GCNs, Graph convolutional networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 44 pages, 3 figures, submitted to IEEE Trans. Pattern Anal. Mach. Intell. on 18-Jun-2024, under review

点击查看摘要

Abstract:Graph convolutional networks (GCNs) have emerged as powerful models for graph learning tasks, exhibiting promising performance in various domains. While their empirical success is evident, there is a growing need to understand their essential ability from a theoretical perspective. Existing theoretical research has primarily focused on the analysis of single-layer GCNs, while a comprehensive theoretical exploration of the stability and generalization of deep GCNs remains limited. In this paper, we bridge this gap by delving into the stability and generalization properties of deep GCNs, aiming to provide valuable insights by characterizing rigorously the associated upper bounds. Our theoretical results reveal that the stability and generalization of deep GCNs are influenced by certain key factors, such as the maximum absolute eigenvalue of the graph filter operators and the depth of the network. Our theoretical studies contribute to a deeper understanding of the stability and generalization properties of deep GCNs, potentially paving the way for developing more reliable and well-performing models.

[LG-94] Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP EMNLP2024

链接: https://arxiv.org/abs/2410.08469
作者: Eunji Kim,Kyuhong Shim,Simyung Chang,Sungroh Yoon
关键词-EN: Vision-Language Models, translating textual input, embedding space shared, natural language, encoder within Vision-Language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.

[LG-95] Driving Privacy Forward: Mitigating Information Leakage within Smart Vehicles through Synthetic Data Generation

链接: https://arxiv.org/abs/2410.08462
作者: Krish Parikh
关键词-EN: Smart vehicles produce, vehicles produce large, produce large amounts, Smart vehicles, vehicles produce
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Smart vehicles produce large amounts of data, much of which is sensitive and at risk of privacy breaches. As attackers increasingly exploit anonymised metadata within these datasets to profile drivers, it’s important to find solutions that mitigate this information leakage without hindering innovation and ongoing research. Synthetic data has emerged as a promising tool to address these privacy concerns, as it allows for the replication of real-world data relationships while minimising the risk of revealing sensitive information. In this paper, we examine the use of synthetic data to tackle these challenges. We start by proposing a comprehensive taxonomy of 14 in-vehicle sensors, identifying potential attacks and categorising their vulnerability. We then focus on the most vulnerable signals, using the Passive Vehicular Sensor (PVS) dataset to generate synthetic data with a Tabular Variational Autoencoder (TVAE) model, which included over 1 million data points. Finally, we evaluate this against 3 core metrics: fidelity, utility, and privacy. Our results show that we achieved 90.1% statistical similarity and 78% classification accuracy when tested on its original intent while also preventing the profiling of the driver. The code can be found at this https URL

[LG-96] Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

链接: https://arxiv.org/abs/2410.08458
作者: Abhijnan Nath,Changsoo Jung,Ethan Seefried,Nikhil Krishnaswamy
关键词-EN: building usable generative, usable generative large, generative large language, large language models, Direct Preference Optimization
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Reward modeling of human preferences is one of the cornerstones of building usable generative large language models (LLMs). While traditional RLHF-based alignment methods explicitly maximize the expected rewards from a separate reward model, more recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods can still lead to degenerate policies, and rely heavily on the Bradley-Terry-based preference formulation to model reward differences between pairs of candidate outputs. This formulation is challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs is of low confidence. In this paper, we introduce DRDO (Direct Reward Distillation and policy-Optimization), a supervised knowledge distillation-based preference alignment method that simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences from a novel preference likelihood formulation. Our experimental results on the Ultrafeedback and TL;DR datasets demonstrate that policies trained using DRDO surpass previous methods such as DPO and e-DPO in terms of expected rewards and are more robust, on average, to noisy preference signals as well as out-of-distribution (OOD) settings.

[LG-97] Unity is Power: Semi-Asynchronous Collaborative Training of Large-Scale Models with Structured Pruning in Resource-Limited Clients

链接: https://arxiv.org/abs/2410.08457
作者: Yan Li,Mingyi Li,Xiao Zhang,Guangwei Xu,Feng Chen,Yuan Yuan,Yifei Zou,Mengying Zhao,Jianbo Lu,Dongxiao Yu
关键词-EN: weak computing power, massive heterogeneous weak, heterogeneous weak computing, collaboratively train large-scale, dispersed datasets
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 24 Pages, 12 figures

点击查看摘要

Abstract:In this work, we study to release the potential of massive heterogeneous weak computing power to collaboratively train large-scale models on dispersed datasets. In order to improve both efficiency and accuracy in resource-adaptive collaborative learning, we take the first step to consider the \textitunstructured pruning, \textitvarying submodel architectures, \textitknowledge loss, and \textitstraggler challenges simultaneously. We propose a novel semi-asynchronous collaborative training framework, namely Co\text-S^2P , with data distribution-aware structured pruning and cross-block knowledge transfer mechanism to address the above concerns. Furthermore, we provide theoretical proof that Co\text-S^2P can achieve asymptotic optimal convergence rate of O(1/\sqrtN^*EQ) . Finally, we conduct extensive experiments on a real-world hardware testbed, in which 16 heterogeneous Jetson devices can be united to train large-scale models with parameters up to 0.11 billion. The experimental results demonstrate that Co\text-S^2P improves accuracy by up to 8.8% and resource utilization by up to 1.2 \times compared to state-of-the-art methods, while reducing memory consumption by approximately 22% and training time by about 24% on all resource-limited devices.

[LG-98] Why pre-training is beneficial for downstream classification tasks?

链接: https://arxiv.org/abs/2410.08455
作者: Xin Jiang,Xu Cheng,Zechao Li
关键词-EN: exhibited notable benefits, notable benefits, remain unclear, exhibited notable, boosting accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-training has exhibited notable benefits to downstream tasks by boosting accuracy and speeding up convergence, but the exact reasons for these benefits still remain unclear. To this end, we propose to quantitatively and explicitly explain effects of pre-training on the downstream task from a novel game-theoretic view, which also sheds new light into the learning behavior of deep neural networks (DNNs). Specifically, we extract and quantify the knowledge encoded by the pre-trained model, and further track the changes of such knowledge during the fine-tuning process. Interestingly, we discover that only a small amount of pre-trained model’s knowledge is preserved for the inference of downstream tasks. However, such preserved knowledge is very challenging for a model training from scratch to learn. Thus, with the help of this exclusively learned and useful knowledge, the model fine-tuned from pre-training usually achieves better performance than the model training from scratch. Besides, we discover that pre-training can guide the fine-tuned model to learn target knowledge for the downstream task more directly and quickly, which accounts for the faster convergence of the fine-tuned model.

[LG-99] AdvDiffuser: Generating Adversarial Safety-Critical Driving Scenarios via Guided Diffusion

链接: https://arxiv.org/abs/2410.08453
作者: Yuting Xie,Xianda Guo,Cong Wang,Kunhua Liu,Long Chen
关键词-EN: hold significant importance, hold significant, significant importance, training and testing, testing of autonomous
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Safety-critical scenarios are infrequent in natural driving environments but hold significant importance for the training and testing of autonomous driving systems. The prevailing approach involves generating safety-critical scenarios automatically in simulation by introducing adversarial adjustments to natural environments. These adjustments are often tailored to specific tested systems, thereby disregarding their transferability across different systems. In this paper, we propose AdvDiffuser, an adversarial framework for generating safety-critical driving scenarios through guided diffusion. By incorporating a diffusion model to capture plausible collective behaviors of background vehicles and a lightweight guide model to effectively handle adversarial scenarios, AdvDiffuser facilitates transferability. Experimental results on the nuScenes dataset demonstrate that AdvDiffuser, trained on offline driving logs, can be applied to various tested systems with minimal warm-up episode data and outperform other existing methods in terms of realism, diversity, and adversarial performance.

[LG-100] he Proof of Kolmogorov-Arnold May Illuminate Neural Network Learning

链接: https://arxiv.org/abs/2410.08451
作者: Michael H. Freedman
关键词-EN: Kolmogorov and Arnold, Neural Networks, answering Hilbert, theory of Neural, laid the foundations
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov and Arnold, in answering Hilbert’s 13th problem (in the context of continuous functions), laid the foundations for the modern theory of Neural Networks (NNs). Their proof divides the representation of a multivariate function into two steps: The first (non-linear) inter-layer map gives a universal embedding of the data manifold into a single hidden layer whose image is patterned in such a way that a subsequent dynamic can then be defined to solve for the second inter-layer map. I interpret this pattern as “minor concentration” of the almost everywhere defined Jacobians of the interlayer map. Minor concentration amounts to sparsity for higher exterior powers of the Jacobians. We present a conceptual argument for how such sparsity may set the stage for the emergence of successively higher order concepts in today’s deep NNs and suggest two classes of experiments to test this hypothesis.

[LG-101] Finite Sample and Large Deviations Analysis of Stochastic Gradient Algorithm with Correlated Noise

链接: https://arxiv.org/abs/2410.08449
作者: George Yin,Vikram Krishnamurthy
关键词-EN: stochastic gradient algorithm, finite sample regret, decreasing step size, step size stochastic, size stochastic gradient
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We analyze the finite sample regret of a decreasing step size stochastic gradient algorithm. We assume correlated noise and use a perturbed Lyapunov function as a systematic approach for the analysis. Finally we analyze the escape time of the iterates using large deviations theory.

[LG-102] Slow Convergence of Interacting Kalman Filters in Word-of-Mouth Social Learning

链接: https://arxiv.org/abs/2410.08447
作者: Vikram Krishnamurthy,Cristian Rojas
关键词-EN: Kalman filter, Kalman filter receives, Kalman filter agents, subsequent Kalman filter, previous Kalman filter
类目: Machine Learning (cs.LG); Theoretical Economics (econ.TH); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We consider word-of-mouth social learning involving m Kalman filter agents that operate sequentially. The first Kalman filter receives the raw observations, while each subsequent Kalman filter receives a noisy measurement of the conditional mean of the previous Kalman filter. The prior is updated by the m -th Kalman filter. When m=2 , and the observations are noisy measurements of a Gaussian random variable, the covariance goes to zero as k^-1/3 for k observations, instead of O(k^-1) in the standard Kalman filter. In this paper we prove that for m agents, the covariance decreases to zero as k^-(2^m-1) , i.e, the learning slows down exponentially with the number of agents. We also show that by artificially weighing the prior at each time, the learning rate can be made optimal as k^-1 . The implication is that in word-of-mouth social learning, artificially re-weighing the prior can yield the optimal learning rate.

[LG-103] JurEE not Judges: safeguarding llm interactions with small specialised Encoder Ensembles

链接: https://arxiv.org/abs/2410.08442
作者: Dom Nasrabadi
关键词-EN: encoder-only transformer models, transformer models designed, encoder-only transformer, LLM-based systems, designed to strengthen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce JurEE, an ensemble of efficient, encoder-only transformer models designed to strengthen safeguards in AI-User interactions within LLM-based systems. Unlike existing LLM-as-Judge methods, which often struggle with generalization across risk taxonomies and only provide textual outputs, JurEE offers probabilistic risk estimates across a wide range of prevalent risks. Our approach leverages diverse data sources and employs progressive synthetic data generation techniques, including LLM-assisted augmentation, to enhance model robustness and performance. We create an in-house benchmark comprising of other reputable benchmarks such as the OpenAI Moderation Dataset and ToxicChat, where we find JurEE significantly outperforms baseline models, demonstrating superior accuracy, speed, and cost-efficiency. This makes it particularly suitable for applications requiring stringent content moderation, such as customer-facing chatbots. The encoder-ensemble’s modular design allows users to set tailored risk thresholds, enhancing its versatility across various safety-related applications. JurEE’s collective decision-making process, where each specialized encoder model contributes to the final output, not only improves predictive accuracy but also enhances interpretability. This approach provides a more efficient, performant, and economical alternative to traditional LLMs for large-scale implementations requiring robust content moderation.

[LG-104] Reinforcement Learning for Control of Non-Markovian Cellular Population Dynamics NEURIPS

链接: https://arxiv.org/abs/2410.08439
作者: Josiah C. Kratz,Jacob Adamczyk
关键词-EN: exhibit a remarkable, bacteria to cancer, remarkable ability, ability to adapt, adapt to fluctuating
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: Accepted at NeurIPS ML4PS Workshop 2024

点击查看摘要

Abstract:Many organisms and cell types, from bacteria to cancer cells, exhibit a remarkable ability to adapt to fluctuating environments. Additionally, cells can leverage memory of past environments to better survive previously-encountered stressors. From a control perspective, this adaptability poses significant challenges in driving cell populations toward extinction, and is thus an open question with great clinical significance. In this work, we focus on drug dosing in cell populations exhibiting phenotypic plasticity. For specific dynamical models switching between resistant and susceptible states, exact solutions are known. However, when the underlying system parameters are unknown, and for complex memory-based systems, obtaining the optimal solution is currently intractable. To address this challenge, we apply reinforcement learning (RL) to identify informed dosing strategies to control cell populations evolving under novel non-Markovian dynamics. We find that model-free deep RL is able to recover exact solutions and control cell populations even in the presence of long-range temporal dynamics.

[LG-105] Symbolic Music Generation with Fine-grained Interactive Textural Guidance

链接: https://arxiv.org/abs/2410.08435
作者: Tingyu Zhu,Haoyu Liu,Zhimin Jiang,Zeyu Zheng
关键词-EN: limited data availability, Fine-grained Textural Guidance, generation presents unique, presents unique challenges, unique challenges due
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The problem of symbolic music generation presents unique challenges due to the combination of limited data availability and the need for high precision in note pitch. To overcome these difficulties, we introduce Fine-grained Textural Guidance (FTG) within diffusion models to correct errors in the learned distributions. By incorporating FTG, the diffusion models improve the accuracy of music generation, which makes them well-suited for advanced tasks such as progressive music generation, improvisation and interactive music creation. We derive theoretical characterizations for both the challenges in symbolic music generation and the effect of the FTG approach. We provide numerical experiments and a demo page for interactive music generation with user input to showcase the effectiveness of our approach.

[LG-106] MYCROFT: Towards Effective and Efficient External Data Augmentation

链接: https://arxiv.org/abs/2410.08432
作者: Zain Sarwar,Van Tran,Arjun Nitin Bhagoji,Nick Feamster,Ben Y. Zhao,Supriyo Chakraborty
关键词-EN: require large amounts, Machine learning, model trainers, data, require large
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Machine learning (ML) models often require large amounts of data to perform well. When the available data is limited, model trainers may need to acquire more data from external sources. Often, useful data is held by private entities who are hesitant to share their data due to propriety and privacy concerns. This makes it challenging and expensive for model trainers to acquire the data they need to improve model performance. To address this challenge, we propose Mycroft, a data-efficient method that enables model trainers to evaluate the relative utility of different data sources while working with a constrained data-sharing budget. By leveraging feature space distances and gradient matching, Mycroft identifies small but informative data subsets from each owner, allowing model trainers to maximize performance with minimal data exposure. Experimental results across four tasks in two domains show that Mycroft converges rapidly to the performance of the full-information baseline, where all data is shared. Moreover, Mycroft is robust to noise and can effectively rank data owners by utility. Mycroft can pave the way for democratized training of high performance ML models.

[LG-107] A phase transition in sampling from Restricted Boltzmann Machines

链接: https://arxiv.org/abs/2410.08423
作者: Youngwoo Kwon,Qian Qin,Guanyang Wang,Yuchen Wei
关键词-EN: Restricted Boltzmann Machines, undirected graphical models, one-parameter Restricted Boltzmann, Restricted Boltzmann, Boltzmann Machines
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Mathematical Physics (math-ph); Probability (math.PR); Computation (stat.CO)
*备注: 43 pages, 4 figures

点击查看摘要

Abstract:Restricted Boltzmann Machines are a class of undirected graphical models that play a key role in deep learning and unsupervised learning. In this study, we prove a phase transition phenomenon in the mixing time of the Gibbs sampler for a one-parameter Restricted Boltzmann Machine. Specifically, the mixing time varies logarithmically, polynomially, and exponentially with the number of vertices depending on whether the parameter c is above, equal to, or below a critical value c_\star\approx-5.87 . A key insight from our analysis is the link between the Gibbs sampler and a dynamical system, which we utilize to quantify the former based on the behavior of the latter. To study the critical case c= c_\star , we develop a new isoperimetric inequality for the sampler’s stationary distribution by showing that the distribution is nearly log-concave.

[LG-108] Generalizable autoregressive modeling of time series through functional narratives

链接: https://arxiv.org/abs/2410.08421
作者: Ran Liu,Wenrui Ma,Ellen Zippi,Hadi Pouransari,Jingyun Xiao,Chris Sandino,Behrooz Mahasseni,Juri Minxha,Erdrin Azemi,Eva L. Dyer,Ali Moin
关键词-EN: Time series, learn time series, Time series data, Time, series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series data are inherently functions of time, yet current transformers often learn time series by modeling them as mere concatenations of time periods, overlooking their functional properties. In this work, we propose a novel objective for transformers that learn time series by re-interpreting them as temporal functions. We build an alternative sequence of time series by constructing degradation operators of different intensity in the functional space, creating augmented variants of the original sample that are abstracted or simplified to different degrees. Based on the new set of generated sequence, we train an autoregressive transformer that progressively recovers the original sample from the most simplified variant. Analogous to the next word prediction task in languages that learns narratives by connecting different words, our autoregressive transformer aims to learn the Narratives of Time Series (NoTS) by connecting different functions in time. Theoretically, we justify the construction of the alternative sequence through its advantages in approximating functions. When learning time series data with transformers, constructing sequences of temporal functions allows for a broader class of approximable functions (e.g., differentiation) compared to sequences of time periods, leading to a 26% performance improvement in synthetic feature regression experiments. Experimentally, we validate NoTS in 3 different tasks across 22 real-world datasets, where we show that NoTS significantly outperforms other pre-training methods by up to 6%. Additionally, combining NoTS on top of existing transformer architectures can consistently boost the performance. Our results demonstrate the potential of NoTS as a general-purpose dynamic learner, offering a viable alternative for developing foundation models for time series analysis.

[LG-109] Bilinear MLPs enable weight-based mechanistic interpretability

链接: https://arxiv.org/abs/2410.08417
作者: Michael T. Pearce,Thomas Dooms,Alice Rigg,Jose M. Oramas,Lee Sharkey
关键词-EN: networks remains elusive, deep neural networks, neural networks remains, remains elusive, deep neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain how MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the MLP layer. In this paper, we analyze bilinear MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity that nevertheless achieves competitive performance. Bilinear MLPs can be fully expressed in terms of linear operations using a third-order tensor, allowing flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights using eigendecomposition reveals interpretable low-rank structure across toy tasks, image classification, and language modeling. We use this understanding to craft adversarial examples, uncover overfitting, and identify small language model circuits directly from the weights alone. Our results demonstrate that bilinear layers serve as an interpretable drop-in replacement for current activation functions and that weight-based interpretability is viable for understanding deep-learning models.

[LG-110] What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias

链接: https://arxiv.org/abs/2410.08407
作者: Aida Mohammadshahi,Yani Ioannou
关键词-EN: Deep Neural Network, Neural Network compression, Network compression method, Deep Neural, Neural Network
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Knowledge Distillation is a commonly used Deep Neural Network compression method, which often maintains overall generalization performance. However, we show that even for balanced image classification datasets, such as CIFAR-100, Tiny ImageNet and ImageNet, as many as 41% of the classes are statistically significantly affected by distillation when comparing class-wise accuracy (i.e. class bias) between a teacher/distilled student or distilled student/non-distilled student model. Changes in class bias are not necessarily an undesirable outcome when considered outside of the context of a model’s usage. Using two common fairness metrics, Demographic Parity Difference (DPD) and Equalized Odds Difference (EOD) on models trained with the CelebA, Trifeature, and HateXplain datasets, our results suggest that increasing the distillation temperature improves the distilled student model’s fairness – for DPD, the distilled student even surpasses the fairness of the teacher model at high temperatures. This study highlights the uneven effects of Knowledge Distillation on certain classes and its potentially significant role in fairness, emphasizing that caution is warranted when using distilled models for sensitive application domains.

[LG-111] Identifying Money Laundering Subgraphs on the Blockchain

链接: https://arxiv.org/abs/2410.08394
作者: Kiwhan Song,Mohamed Ali Dhraief,Muhua Xu,Locke Cai,Xuhao Chen,Arvind,Jie Chen
关键词-EN: money laundering crimes, involves the identification, Anti-Money Laundering, AML, money laundering
类目: Machine Learning (cs.LG); General Finance (q-fin.GN)
*备注: ICAIF 2024. Code is available at this https URL

点击查看摘要

Abstract:Anti-Money Laundering (AML) involves the identification of money laundering crimes in financial activities, such as cryptocurrency transactions. Recent studies advanced AML through the lens of graph-based machine learning, modeling the web of financial transactions as a graph and developing graph methods to identify suspicious activities. For instance, a recent effort on opensourcing datasets and benchmarks, Elliptic2, treats a set of Bitcoin addresses, considered to be controlled by the same entity, as a graph node and transactions among entities as graph edges. This modeling reveals the “shape” of a money laundering scheme - a subgraph on the blockchain. Despite the attractive subgraph classification results benchmarked by the paper, competitive methods remain expensive to apply due to the massive size of the graph; moreover, existing methods require candidate subgraphs as inputs which may not be available in practice. In this work, we introduce RevTrack, a graph-based framework that enables large-scale AML analysis with a lower cost and a higher accuracy. The key idea is to track the initial senders and the final receivers of funds; these entities offer a strong indication of the nature (licit vs. suspicious) of their respective subgraph. Based on this framework, we propose RevClassify, which is a neural network model for subgraph classification. Additionally, we address the practical problem where subgraph candidates are not given, by proposing RevFilter. This method identifies new suspicious subgraphs by iteratively filtering licit transactions, using RevClassify. Benchmarking these methods on Elliptic2, a new standard for AML, we show that RevClassify outperforms state-of-the-art subgraph classification techniques in both cost and accuracy. Furthermore, we demonstrate the effectiveness of RevFilter in discovering new suspicious subgraphs, confirming its utility for practical AML.

[LG-112] KnowGraph: Knowledge-Enabled Anomaly Detection via Logical Reasoning on Graph Data CCS2024

链接: https://arxiv.org/abs/2410.08390
作者: Andy Zhou,Xiaojun Xu,Ramesh Raghunathan,Alok Lal,Xinze Guan,Bin Yu,Bo Li
关键词-EN: Graph Neural Networks, network traffic, pivotal in diverse, transaction networks, Neural Networks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ACM CCS 2024

点击查看摘要

Abstract:Graph-based anomaly detection is pivotal in diverse security applications, such as fraud detection in transaction networks and intrusion detection for network traffic. Standard approaches, including Graph Neural Networks (GNNs), often struggle to generalize across shifting data distributions. Meanwhile, real-world domain knowledge is more stable and a common existing component of real-world detection strategies. To explicitly integrate such knowledge into data-driven models such as GCNs, we propose KnowGraph, which integrates domain knowledge with data-driven learning for enhanced graph-based anomaly detection. KnowGraph comprises two principal components: (1) a statistical learning component that utilizes a main model for the overarching detection task, augmented by multiple specialized knowledge models that predict domain-specific semantic entities; (2) a reasoning component that employs probabilistic graphical models to execute logical inferences based on model outputs, encoding domain knowledge through weighted first-order logic formulas. Extensive experiments on these large-scale real-world datasets show that KnowGraph consistently outperforms state-of-the-art baselines in both transductive and inductive settings, achieving substantial gains in average precision when generalizing to completely unseen test graphs. Further ablation studies demonstrate the effectiveness of the proposed reasoning component in improving detection performance, especially under extreme class imbalance. These results highlight the potential of integrating domain knowledge into data-driven models for high-stakes, graph-based security applications.

[LG-113] Heating Up Quasi-Monte Carlo Graph Random Features: A Diffusion Kernel Perspective

链接: https://arxiv.org/abs/2410.08389
作者: Brooke Feinberg,Aiwen Li
关键词-EN: recently introduced class, quasi-graph random features, Inverse Cosine kernels, Ladder graphs, yield lower variance
类目: Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: 18 pages, 16 figures

点击查看摘要

Abstract:We build upon a recently introduced class of quasi-graph random features (q-GRFs), which have demonstrated the ability to yield lower variance estimators of the 2-regularized Laplacian kernel (Choromanski 2023). Our research investigates whether similar results can be achieved with alternative kernel functions, specifically the Diffusion (or Heat), Matérn, and Inverse Cosine kernels. We find that the Diffusion kernel performs most similarly to the 2-regularized Laplacian, and we further explore graph types that benefit from the previously established antithetic termination procedure. Specifically, we explore Erdős-Rényi and Barabási-Albert random graph models, Binary Trees, and Ladder graphs, with the goal of identifying combinations of specific kernel and graph type that benefit from antithetic termination. We assert that q-GRFs achieve lower variance estimators of the Diffusion (or Heat) kernel on Ladder graphs. However, the number of rungs on the Ladder graphs impacts the algorithm’s performance; further theoretical results supporting our experimentation are forthcoming. This work builds upon some of the earliest Quasi-Monte Carlo methods for kernels defined on combinatorial objects, paving the way for kernel-based learning algorithms and future real-world applications in various domains.

[LG-114] Language model developers should report train-test overlap

链接: https://arxiv.org/abs/2410.08385
作者: Andy K Zhang,Kevin Klyman,Yifan Mai,Yoav Levine,Yian Zhang,Rishi Bommasani,Percy Liang
关键词-EN: train-test overlap, train-test, overlap, results requires knowledge, measure train-test overlap
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
*备注: 18 pages

点击查看摘要

Abstract:Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 model developers, finding that just 9 developers report train-test overlap: 4 developers release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 developers publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional developers. Overall, we take the position that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.

[LG-115] Merging in a Bottle: Differentiable Adaptive Merging (DAM) and the Path from Averaging to Automation

链接: https://arxiv.org/abs/2410.08371
作者: Thomas Gauthier-Caron,Shamane Siriwardhana,Elliot Stein,Malikeh Ehghaghi,Charles Goddard,Mark McQuade,Jacob Solawetz,Maxime Labonne
关键词-EN: requiring substantial retraining, separate language models, achieving a balance, substantial retraining, systems can combine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 1 figure, and 3 tables

点击查看摘要

Abstract:By merging models, AI systems can combine the distinct strengths of separate language models, achieving a balance between multiple capabilities without requiring substantial retraining. However, the integration process can be intricate due to differences in training methods and fine-tuning, typically necessitating specialized knowledge and repeated refinement. This paper explores model merging techniques across a spectrum of complexity, examining where automated methods like evolutionary strategies stand compared to hyperparameter-driven approaches such as DARE, TIES-Merging and simpler methods like Model Soups. In addition, we introduce Differentiable Adaptive Merging (DAM), an efficient, adaptive merging approach as an alternative to evolutionary merging that optimizes model integration through scaling coefficients, minimizing computational demands. Our findings reveal that even simple averaging methods, like Model Soups, perform competitively when model similarity is high, underscoring each technique’s unique strengths and limitations. We open-sourced DAM, including the implementation code and experiment pipeline, on GitHub: this https URL.

[LG-116] ElasticTok: Adaptive Tokenization for Image and Video

链接: https://arxiv.org/abs/2410.08368
作者: Wilson Yan,Matei Zaharia,Volodymyr Mnih,Pieter Abbeel,Aleksandra Faust,Hao Liu
关键词-EN: learning general purpose, general purpose vision, video tokenization remains, purpose vision models, processing long video
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient video tokenization remains a key bottleneck in learning general purpose vision models that are capable of processing long video sequences. Prevailing approaches are restricted to encoding videos to a fixed number of tokens, where too few tokens will result in overly lossy encodings, and too many tokens will result in prohibitively long sequence lengths. In this work, we introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. To enable this in a computationally scalable way, we propose a masking technique that drops a random number of tokens at the end of each frames’s token encoding. During inference, ElasticTok can dynamically allocate tokens when needed – more complex data can leverage more tokens, while simpler data only needs a few tokens. Our empirical evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage, paving the way for future development of more powerful multimodal models, world models, and agents.

[LG-117] owards Optimal Environmental Policies: Policy Learning under Arbitrary Bipartite Network Interference

链接: https://arxiv.org/abs/2410.08362
作者: Raphael C. Kim,Falco J. Bargagli-Stoffi,Kevin L. Chen,Rachel C. Nethery
关键词-EN: substantial effect, air pollution, mortality burdens, power, hazardous air pollution
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The substantial effect of air pollution on cardiovascular disease and mortality burdens is well-established. Emissions-reducing interventions on coal-fired power plants – a major source of hazardous air pollution – have proven to be an effective, but costly, strategy for reducing pollution-related health burdens. Targeting the power plants that achieve maximum health benefits while satisfying realistic cost constraints is challenging. The primary difficulty lies in quantifying the health benefits of intervening at particular plants. This is further complicated because interventions are applied on power plants, while health impacts occur in potentially distant communities, a setting known as bipartite network interference (BNI). In this paper, we introduce novel policy learning methods based on Q- and A-Learning to determine the optimal policy under arbitrary BNI. We derive asymptotic properties and demonstrate finite sample efficacy in simulations. We apply our novel methods to a comprehensive dataset of Medicare claims, power plant data, and pollution transport networks. Our goal is to determine the optimal strategy for installing power plant scrubbers to minimize ischemic heart disease (IHD) hospitalizations under various cost constraints. We find that annual IHD hospitalization rates could be reduced in a range from 20.66-44.51 per 10,000 person-years through optimal policies under different cost constraints.

[LG-118] Minimax Hypothesis Testing for the Bradley-Terry-Luce Model

链接: https://arxiv.org/abs/2410.08360
作者: Anuran Makur,Japneet Singh
关键词-EN: BTL model, BTL, BTL model endows, underlying BTL model, alpha
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST)
*备注: 54 pages, 6 figures

点击查看摘要

Abstract:The Bradley-Terry-Luce (BTL) model is one of the most widely used models for ranking a collection of items or agents based on pairwise comparisons among them. Given n agents, the BTL model endows each agent i with a latent skill score \alpha_i 0 and posits that the probability that agent i is preferred over agent j is \alpha_i/(\alpha_i + \alpha_j) . In this work, our objective is to formulate a hypothesis test that determines whether a given pairwise comparison dataset, with k comparisons per pair of agents, originates from an underlying BTL model. We formalize this testing problem in the minimax sense and define the critical threshold of the problem. We then establish upper bounds on the critical threshold for general induced observation graphs (satisfying mild assumptions) and develop lower bounds for complete induced graphs. Our bounds demonstrate that for complete induced graphs, the critical threshold scales as \Theta((nk)^-1/2) in a minimax sense. In particular, our test statistic for the upper bounds is based on a new approximation we derive for the separation distance between general pairwise comparison models and the class of BTL models. To further assess the performance of our statistical test, we prove upper bounds on the type I and type II probabilities of error. Much of our analysis is conducted within the context of a fixed observation graph structure, where the graph possesses certain ``nice’’ properties, such as expansion and bounded principal ratio. Additionally, we derive several auxiliary results, such as bounds on principal ratios of graphs, \ell^2 -bounds on BTL parameter estimation under model mismatch, stability of rankings under the BTL model, etc. We validate our theoretical results through experiments on synthetic and real-world datasets and propose a data-driven permutation testing approach to determine test thresholds.

[LG-119] Metalic: Meta-Learning In-Context with Protein Language Models

链接: https://arxiv.org/abs/2410.08355
作者: Jacob Beck,Shikha Surana,Manus McAuliffe,Oliver Bent,Thomas D. Barrett,Juan Jose Garau Luis,Paul Duckworth
关键词-EN: Predicting the biophysical, silico protein design, fitness prediction, prediction, biophysical and functional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting the biophysical and functional properties of proteins is essential for in silico protein design. Machine learning has emerged as a promising technique for such prediction tasks. However, the relative scarcity of in vitro annotations means that these models often have little, or no, specific data on the desired fitness prediction task. As a result of limited data, protein language models (PLMs) are typically trained on general protein sequence modeling tasks, and then fine-tuned, or applied zero-shot, to protein fitness prediction. When no task data is available, the models make strong assumptions about the correlation between the protein sequence likelihood and fitness scores. In contrast, we propose meta-learning over a distribution of standard fitness prediction tasks, and demonstrate positive transfer to unseen fitness prediction tasks. Our method, called Metalic (Meta-Learning In-Context), uses in-context learning and fine-tuning, when data is available, to adapt to new tasks. Crucially, fine-tuning enables considerable generalization, even though it is not accounted for during meta-training. Our fine-tuned models achieve strong results with 18 times fewer parameters than state-of-the-art models. Moreover, our method sets a new state-of-the-art in low-data settings on ProteinGym, an established fitness-prediction benchmark. Due to data scarcity, we believe meta-learning will play a pivotal role in advancing protein engineering.

[LG-120] Simultaneous Weight and Architecture Optimization for Neural Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.08339
作者: Zitong Huang,Mansooreh Montazerin,Ajitesh Srivastava
关键词-EN: Neural Architecture Search, Neural networks, Neural, compact neural networks, trained by choosing
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024 FITML (Fine-Tuning in Modern Machine Learning) Workshop

点击查看摘要

Abstract:Neural networks are trained by choosing an architecture and training the parameters. The choice of architecture is often by trial and error or with Neural Architecture Search (NAS) methods. While NAS provides some automation, it often relies on discrete steps that optimize the architecture and then train the parameters. We introduce a novel neural network training framework that fundamentally transforms the process by learning architecture and parameters simultaneously with gradient descent. With the appropriate setting of the loss function, it can discover sparse and compact neural networks for given datasets. Central to our approach is a multi-scale encoder-decoder, in which the encoder embeds pairs of neural networks with similar functionalities close to each other (irrespective of their architectures and weights). To train a neural network with a given dataset, we randomly sample a neural network embedding in the embedding space and then perform gradient descent using our custom loss function, which incorporates a sparsity penalty to encourage compactness. The decoder generates a neural network corresponding to the embedding. Experiments demonstrate that our framework can discover sparse and compact neural networks maintaining a high performance.

[LG-121] Kernel Banzhaf: A Fast and Robust Estimator for Banzhaf Values

链接: https://arxiv.org/abs/2410.08336
作者: Yurong Liu,R. Teal Witter,Flip Korn,Tarfah Alrashed,Dimitris Paparas,Juliana Freire
关键词-EN: widely-used Shapley, Kernel Banzhaf, introduce Kernel Banzhaf, establishing Kernel Banzhaf, Kernel Banzhaf substantially
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Banzhaf values offer a simple and interpretable alternative to the widely-used Shapley values. We introduce Kernel Banzhaf, a novel algorithm inspired by KernelSHAP, that leverages an elegant connection between Banzhaf values and linear regression. Through extensive experiments on feature attribution tasks, we demonstrate that Kernel Banzhaf substantially outperforms other algorithms for estimating Banzhaf values in both sample efficiency and robustness to noise. Furthermore, we prove theoretical guarantees on the algorithm’s performance, establishing Kernel Banzhaf as a valuable tool for interpretable machine learning.

[LG-122] Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning

链接: https://arxiv.org/abs/2410.08334
作者: Tirthankar Mittra
关键词-EN: children learn numbers, paper investigates, investigates how children, children learn, reinforcement learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper investigates how children learn numbers using the framework of reinforcement learning (RL), with a focus on the impact of language instructions. The motivation for using reinforcement learning stems from its parallels with psychological learning theories in controlled environments. By using state of the art deep reinforcement learning models, we simulate and analyze the effects of various forms of language instructions on number acquisition. Our findings indicate that certain linguistic structures more effectively improve numerical comprehension in RL agents. Additionally, our model predicts optimal sequences for presenting numbers to RL agents which enhance their speed of learning. This research provides valuable insights into the interplay between language and numerical cognition, with implications for both educational strategies and the development of artificial intelligence systems designed to support early childhood learning.

[LG-123] Physics and Deep Learning in Computational Wave Imaging

链接: https://arxiv.org/abs/2410.08329
作者: Youzuo Lin,Shihang Feng,James Theiler,Yinpeng Chen,Umberto Villa,Jing Rao,John Greenhall,Cristian Pantea,Mark A. Anastasio,Brendt Wohlberg
关键词-EN: extracts hidden structure, analyzing wave signals, extracts hidden, hidden structure, structure and physical
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 29 pages, 11 figures

点击查看摘要

Abstract:Computational wave imaging (CWI) extracts hidden structure and physical properties of a volume of material by analyzing wave signals that traverse that volume. Applications include seismic exploration of the Earth’s subsurface, acoustic imaging and non-destructive testing in material science, and ultrasound computed tomography in medicine. Current approaches for solving CWI problems can be divided into two categories: those rooted in traditional physics, and those based on deep learning. Physics-based methods stand out for their ability to provide high-resolution and quantitatively accurate estimates of acoustic properties within the medium. However, they can be computationally intensive and are susceptible to ill-posedness and nonconvexity typical of CWI problems. Machine learning-based computational methods have recently emerged, offering a different perspective to address these challenges. Diverse scientific communities have independently pursued the integration of deep learning in CWI. This review delves into how contemporary scientific machine-learning (ML) techniques, and deep neural networks in particular, have been harnessed to tackle CWI problems. We present a structured framework that consolidates existing research spanning multiple domains, including computational imaging, wave physics, and data science. This study concludes with important lessons learned from existing ML-based methods and identifies technical hurdles and emerging trends through a systematic analysis of the extensive literature on this topic.

[LG-124] Agents Thinking Fast and Slow: A Talker-Reasoner Architecture

链接: https://arxiv.org/abs/2410.08328
作者: Konstantina Christakopoulou,Shibl Mourad,Maja Matarić
关键词-EN: Large language models, Large language, natural conversation, language models, models have enabled
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models have enabled agents of all kinds to interact with users through natural conversation. Consequently, agents now have two jobs: conversing and planning/reasoning. Their conversational responses must be informed by all available information, and their actions must help to achieve goals. This dichotomy between conversing with the user and doing multi-step reasoning and planning can be seen as analogous to the human systems of “thinking fast and slow” as introduced by Kahneman. Our approach is comprised of a “Talker” agent (System 1) that is fast and intuitive, and tasked with synthesizing the conversational response; and a “Reasoner” agent (System 2) that is slower, more deliberative, and more logical, and is tasked with multi-step reasoning and planning, calling tools, performing actions in the world, and thereby producing the new agent state. We describe the new Talker-Reasoner architecture and discuss its advantages, including modularity and decreased latency. We ground the discussion in the context of a sleep coaching agent, in order to demonstrate real-world relevance.

[LG-125] Neural Architecture Search of Hybrid Models for NPU-CIM Heterogeneous AR/VR Devices

链接: https://arxiv.org/abs/2410.08326
作者: Yiwei Zhao,Ziyun Li,Win-San Khwa,Xiaoyu Sun,Sai Qian Zhang,Syed Shakib Sarwar,Kleber Hugo Stangherlin,Yi-Lun Lu,Jorge Tomas Gomez,Jae-Sun Seo,Phillip B. Gibbons,Barbara De Salvo,Chiao Liu
关键词-EN: Augmented Reality applications, Virtual Reality, Augmented Reality, Reality applications, Reality and Augmented
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Low-Latency and Low-Power Edge AI is essential for Virtual Reality and Augmented Reality applications. Recent advances show that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can pose system challenges for latency and energy-efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage the architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and perform diverse execution schemas to efficiently execute these hybrid models. We also introduce H4H-NAS, a Neural Architecture Search framework to design efficient hybrid CNN/ViT models for heterogeneous edge systems with both NPU and CIM. Our H4H-NAS approach is powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN/ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet dataset. Moreover, results from our Algo/HW co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing such heterogeneous computing over baseline solutions. The framework guides the design of hybrid network architectures and system architectures of NPU+CIM heterogeneous systems.

[LG-126] he language of sound search: Examining User Queries in Audio Search Engines

链接: https://arxiv.org/abs/2410.08324
作者: Benno Weck,Frederic Font
关键词-EN: study examines textual, general audio retrieval, audio retrieval, audio retrieval systems, text-based audio retrieval
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted at DCASE 2024. Supplementary materials at this https URL

点击查看摘要

Abstract:This study examines textual, user-written search queries within the context of sound search engines, encompassing various applications such as foley, sound effects, and general audio retrieval. Current research inadequately addresses real-world user needs and behaviours in designing text-based audio retrieval systems. To bridge this gap, we analysed search queries from two sources: a custom survey and Freesound website query logs. The survey was designed to collect queries for an unrestricted, hypothetical sound search engine, resulting in a dataset that captures user intentions without the constraints of existing systems. This dataset is also made available for sharing with the research community. In contrast, the Freesound query logs encompass approximately 9 million search requests, providing a comprehensive view of real-world usage patterns. Our findings indicate that survey queries are generally longer than Freesound queries, suggesting users prefer detailed queries when not limited by system constraints. Both datasets predominantly feature keyword-based queries, with few survey participants using full sentences. Key factors influencing survey queries include the primary sound source, intended usage, perceived location, and the number of sound sources. These insights are crucial for developing user-centred, effective text-based audio retrieval systems, enhancing our understanding of user behaviour in sound search contexts.

[LG-127] Do You Know What You Are Talking About? Characterizing Query-Knowledge Relevance For Reliable Retrieval Augmented Generation

链接: https://arxiv.org/abs/2410.08320
作者: Zhuohang Li,Jiaxin Zhang,Chao Yan,Kamalika Das,Sricharan Kumar,Murat Kantarcioglu,Bradley A. Malin
关键词-EN: Language models, external knowledge corpus, hallucinations and misinformation, suffer from hallucinations, knowledge corpus
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models (LMs) are known to suffer from hallucinations and misinformation. Retrieval augmented generation (RAG) that retrieves verifiable information from an external knowledge corpus to complement the parametric knowledge in LMs provides a tangible solution to these problems. However, the generation quality of RAG is highly dependent on the relevance between a user’s query and the retrieved documents. Inaccurate responses may be generated when the query is outside of the scope of knowledge represented in the external knowledge corpus or if the information in the corpus is out-of-date. In this work, we establish a statistical framework that assesses how well a query can be answered by an RAG system by capturing the relevance of knowledge. We introduce an online testing procedure that employs goodness-of-fit (GoF) tests to inspect the relevance of each user query to detect out-of-knowledge queries with low knowledge relevance. Additionally, we develop an offline testing framework that examines a collection of user queries, aiming to detect significant shifts in the query distribution which indicates the knowledge corpus is no longer sufficiently capable of supporting the interests of the users. We demonstrate the capabilities of these strategies through a systematic evaluation on eight question-answering (QA) datasets, the results of which indicate that the new testing framework is an efficient solution to enhance the reliability of existing RAG systems.

[LG-128] HyperDPO: Hypernetwork-based Multi-Objective Fine-Tuning Framework

链接: https://arxiv.org/abs/2410.08316
作者: Yinuo Ren,Tesi Xiao,Michael Shavlovsky,Lexing Ying,Holakou Rahmanian
关键词-EN: Direct Preference Optimization, LLM alignment, efficient LLM alignment, Multi-Objective Fine-Tuning, faces the Multi-Objective
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In LLM alignment and many other ML applications, one often faces the Multi-Objective Fine-Tuning (MOFT) problem, i.e. fine-tuning an existing model with datasets labeled w.r.t. different objectives simultaneously. To address the challenge, we propose the HyperDPO framework, a hypernetwork-based approach that extends the Direct Preference Optimization (DPO) technique, originally developed for efficient LLM alignment with preference data, to accommodate the MOFT settings. By substituting the Bradley-Terry-Luce model in DPO with the Plackett-Luce model, our framework is capable of handling a wide range of MOFT tasks that involve listwise ranking datasets. Compared with previous approaches, HyperDPO enjoys an efficient one-shot training process for profiling the Pareto front of auxiliary objectives, and offers flexible post-training control over trade-offs. Additionally, we propose a novel Hyper Prompt Tuning design, that conveys continuous weight across objectives to transformer-based models without altering their architecture. We demonstrate the effectiveness and efficiency of the HyperDPO framework through its applications to various tasks, including Learning-to-Rank (LTR) and LLM alignment, highlighting its viability for large-scale ML deployments.

[LG-129] Dynamics of Concept Learning and Compositional Generalization

链接: https://arxiv.org/abs/2410.08309
作者: Yongyi Yang,Core Francisco Park,Ekdeep Singh Lubana,Maya Okawa,Wei Hu,Hidenori Tanaka
关键词-EN: primitive concepts underlying, data-generating process, compositional data-generating process, learning dynamics, manipulate primitive concepts
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model’s speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work’s compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the identity mapping on a Gaussian mixture with structurally organized centroids. We mathematically analyze the learning dynamics of neural networks trained on this SIM task and show that, despite its simplicity, SIM’s learning dynamics capture and help explain key empirical observations on compositional generalization with diffusion models identified in prior work. Our theory also offers several new insights – e.g., we find a novel mechanism for non-monotonic learning dynamics of test loss in early phases of training. We validate our new predictions by training a text-conditioned diffusion model, bridging our simplified framework and complex generative models. Overall, this work establishes the SIM task as a meaningful theoretical abstraction of concept learning dynamics in modern generative models.

[LG-130] Machine Learning for Missing Value Imputation

链接: https://arxiv.org/abs/2410.08308
作者: Abu Fuad Ahmad,Khaznah Alshammari,Istiaque Ahmed,MD Shohel Sayed
关键词-EN: Missing Value Imputation, recent times, considerable number, address the issue, MVI methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent times, a considerable number of research studies have been carried out to address the issue of Missing Value Imputation (MVI). MVI aims to provide a primary solution for datasets that have one or more missing attribute values. The advancements in Artificial Intelligence (AI) drive the development of new and improved machine learning (ML) algorithms and methods. The advancements in ML have opened up significant opportunities for effectively imputing these missing values. The main objective of this article is to conduct a comprehensive and rigorous review, as well as analysis, of the state-of-the-art ML applications in MVI methods. This analysis seeks to enhance researchers’ understanding of the subject and facilitate the development of robust and impactful interventions in data preprocessing for Data Analytics. The review is performed following the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) technique. More than 100 articles published between 2014 and 2023 are critically reviewed, considering the methods and findings. Furthermore, the latest literature is examined to scrutinize the trends in MVI methods and their evaluation. The accomplishments and limitations of the existing literature are discussed in detail. The survey concludes by identifying the current gaps in research and providing suggestions for future research directions and emerging trends in related fields of interest.

[LG-131] UNIQ: Offline Inverse Q-learning for Avoiding Undesirable Demonstrations

链接: https://arxiv.org/abs/2410.08307
作者: Huy Hoang,Tien Mai,Pradeep Varakantham
关键词-EN: avoids undesirable demonstrations, undesirable demonstrations, learning, offline imitation learning, learning policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We address the problem of offline learning a policy that avoids undesirable demonstrations. Unlike conventional offline imitation learning approaches that aim to imitate expert or near-optimal demonstrations, our setting involves avoiding undesirable behavior (specified using undesirable demonstrations). To tackle this problem, unlike standard imitation learning where the aim is to minimize the distance between learning policy and expert demonstrations, we formulate the learning task as maximizing a statistical distance, in the space of state-action stationary distributions, between the learning policy and the undesirable policy. This significantly different approach results in a novel training objective that necessitates a new algorithm to address it. Our algorithm, UNIQ, tackles these challenges by building on the inverse Q-learning framework, framing the learning problem as a cooperative (non-adversarial) task. We then demonstrate how to efficiently leverage unlabeled data for practical training. Our method is evaluated on standard benchmark environments, where it consistently outperforms state-of-the-art baselines. The code implementation can be accessed at: this https URL.

[LG-132] Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation

链接: https://arxiv.org/abs/2410.08305
作者: Grigory Malinovsky,Umberto Michieli,Hasan Abed Al Kader Hammoud,Taha Ceritli,Hayder Elesedy,Mete Ozay,Peter Richtárik
关键词-EN: adapting large foundational, large foundational models, specific tasks, adapting large, large foundational
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 36 pages, 4 figures, 2 algorithms

点击查看摘要

Abstract:Fine-tuning has become a popular approach to adapting large foundational models to specific tasks. As the size of models and datasets grows, parameter-efficient fine-tuning techniques are increasingly important. One of the most widely used methods is Low-Rank Adaptation (LoRA), with adaptation update expressed as the product of two low-rank matrices. While LoRA was shown to possess strong performance in fine-tuning, it often under-performs when compared to full-parameter fine-tuning (FPFT). Although many variants of LoRA have been extensively studied empirically, their theoretical optimization analysis is heavily under-explored. The starting point of our work is a demonstration that LoRA and its two extensions, Asymmetric LoRA and Chain of LoRA, indeed encounter convergence issues. To address these issues, we propose Randomized Asymmetric Chain of LoRA (RAC-LoRA) – a general optimization framework that rigorously analyzes the convergence rates of LoRA-based methods. Our approach inherits the empirical benefits of LoRA-style heuristics, but introduces several small but important algorithmic modifications which turn it into a provably convergent method. Our framework serves as a bridge between FPFT and low-rank adaptation. We provide provable guarantees of convergence to the same solution as FPFT, along with the rate of convergence. Additionally, we present a convergence analysis for smooth, non-convex loss functions, covering gradient descent, stochastic gradient descent, and federated learning settings. Our theoretical findings are supported by experimental results.

[LG-133] Global Lyapunov functions: a long-standing open problem in mathematics with symbolic transformers

链接: https://arxiv.org/abs/2410.08304
作者: Alberto Alfarano,François Charton,Amaury Hayat
关键词-EN: complex reasoning tasks, spectacular progress, language models, reasoning tasks, models still struggle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite their spectacular progress, language models still struggle on complex reasoning tasks, such as advanced mathematics. We consider a long-standing open problem in mathematics: discovering a Lyapunov function that ensures the global stability of a dynamical system. This problem has no known general solution, and algorithmic solvers only exist for some small polynomial systems. We propose a new method for generating synthetic training samples from random solutions, and show that sequence-to-sequence transformers trained on such datasets perform better than algorithmic solvers and humans on polynomial systems, and can discover new Lyapunov functions for non-polynomial systems.

[LG-134] A Framework to Enable Algorithmic Design Choice Exploration in DNNs

链接: https://arxiv.org/abs/2410.08300
作者: Timothy L. Cronin IV,Sanmukh Kuppannagari
关键词-EN: demonstrated significant success, deep neural networks, Deep learning technologies, neural networks, demonstrated significant
类目: Machine Learning (cs.LG)
*备注: IEEE HPEC 2024

点击查看摘要

Abstract:Deep learning technologies, particularly deep neural networks (DNNs), have demonstrated significant success across many domains. This success has been accompanied by substantial advancements and innovations in the algorithms behind the operations required by DNNs. These enhanced algorithms hold the potential to greatly increase the performance of DNNs. However, discovering the best performing algorithm for a DNN and altering the DNN to use such algorithm is a difficult and time consuming task. To address this, we introduce an open source framework which provides easy to use fine grain algorithmic control for DNNs, enabling algorithmic exploration and selection. Along with built-in high performance implementations of common deep learning operations, the framework enables users to implement and select their own algorithms to be utilized by the DNN. The framework’s built-in accelerated implementations are shown to yield outputs equivalent to and exhibit similar performance as implementations in PyTorch, a popular DNN framework. Moreover, the framework incurs no additional performance overhead, meaning that performance depends solely on the algorithms chosen by the user.

[LG-135] Privately Learning from Graphs with Applications in Fine-tuning Large Language Models

链接: https://arxiv.org/abs/2410.08299
作者: Haoteng Yin,Rongzhe Wei,Eli Chien,Pan Li
关键词-EN: complementing data modalities, Graphs offer unique, offer unique insights, interactions between entities, modalities like text
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graphs offer unique insights into relationships and interactions between entities, complementing data modalities like text, images, and videos. By incorporating relational information from graph data, AI models can extend their capabilities beyond traditional tasks. However, relational data in sensitive domains such as finance and healthcare often contain private information, making privacy preservation crucial. Existing privacy-preserving methods, such as DP-SGD, which rely on gradient decoupling assumptions, are not well-suited for relational learning due to the inherent dependencies between coupled training samples. To address this challenge, we propose a privacy-preserving relational learning pipeline that decouples dependencies in sampled relations during training, ensuring differential privacy through a tailored application of DP-SGD. We apply this method to fine-tune large language models (LLMs) on sensitive graph data, and tackle the associated computational complexities. Our approach is evaluated on LLMs of varying sizes (e.g., BERT, Llama2) using real-world relational data from four text-attributed graphs. The results demonstrate significant improvements in relational learning tasks, all while maintaining robust privacy guarantees during training. Additionally, we explore the trade-offs between privacy, utility, and computational efficiency, offering insights into the practical deployment of our approach. Code is available at this https URL.

[LG-136] Impact of Missing Values in Machine Learning: A Comprehensive Analysis

链接: https://arxiv.org/abs/2410.08295
作者: Abu Fuad Ahmad,Md Shohel Sayeed,Khaznah Alshammari,Istiaque Ahmed
关键词-EN: Machine learning, big data analysis, data mining, big data, missing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) has become a ubiquitous tool across various domains of data mining and big data analysis. The efficacy of ML models depends heavily on high-quality datasets, which are often complicated by the presence of missing values. Consequently, the performance and generalization of ML models are at risk in the face of such datasets. This paper aims to examine the nuanced impact of missing values on ML workflows, including their types, causes, and consequences. Our analysis focuses on the challenges posed by missing values, including biased inferences, reduced predictive power, and increased computational burdens. The paper further explores strategies for handling missing values, including imputation techniques and removal strategies, and investigates how missing values affect model evaluation metrics and introduces complexities in cross-validation and model selection. The study employs case studies and real-world examples to illustrate the practical implications of addressing missing values. Finally, the discussion extends to future research directions, emphasizing the need for handling missing values ethically and transparently. The primary goal of this paper is to provide insights into the pervasive impact of missing values on ML models and guide practitioners toward effective strategies for achieving robust and reliable model outcomes.

[LG-137] Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?

链接: https://arxiv.org/abs/2410.08292
作者: Khashayar Gatmiry,Nikunj Saunshi,Sashank J. Reddi,Stefanie Jegelka,Sanjiv Kumar
关键词-EN: single forward pass, few-shot learning, forward pass, multi-step algorithms, remarkable capability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The remarkable capability of Transformers to do reasoning and few-shot learning, without any fine-tuning, is widely conjectured to stem from their ability to implicitly simulate a multi-step algorithms – such as gradient descent – with their weights in a single forward pass. Recently, there has been progress in understanding this complex phenomenon from an expressivity point of view, by demonstrating that Transformers can express such multi-step algorithms. However, our knowledge about the more fundamental aspect of its learnability, beyond single layer models, is very limited. In particular, can training Transformers enable convergence to algorithmic solutions? In this work we resolve this for in-context linear regression with linear looped Transformers – a multi-layer model with weight sharing that is conjectured to have an inductive bias to learn fix-point iterative algorithms. More specifically, for this setting we show that the global minimizer of the population training loss implements multi-step preconditioned gradient descent, with a preconditioner that adapts to the data distribution. Furthermore, we show a fast convergence for gradient flow on the regression loss, despite the non-convexity of the landscape, by proving a novel gradient dominance condition. To our knowledge, this is the first theoretical analysis for multi-layer Transformer in this setting. We further validate our theoretical findings through synthetic experiments.

[LG-138] owards Foundation Models for Mixed Integer Linear Programming

链接: https://arxiv.org/abs/2410.08288
作者: Sirui Li,Janardhan Kulkarni,Ishai Menache,Cathy Wu,Beibin Li
关键词-EN: Mixed Integer Linear, Integer Linear Programming, Mixed Integer, Linear Programming, Integer Linear
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixed Integer Linear Programming (MILP) is essential for modeling complex decision-making problems but faces challenges in computational tractability and requires expert formulation. Current deep learning approaches for MILP focus on specific problem classes and do not generalize to unseen classes. To address this shortcoming, we take a foundation model training approach, where we train a single deep learning model on a diverse set of MILP problems to generalize across problem classes. As existing datasets for MILP lack diversity and volume, we introduce MILP-Evolve, a novel LLM-based evolutionary framework that is capable of generating a large set of diverse MILP classes with an unlimited amount of instances. We study our methodology on three key learning tasks that capture diverse aspects of MILP: (1) integrality gap prediction, (2) learning to branch, and (3) a new task of aligning MILP instances with natural language descriptions. Our empirical results show that models trained on the data generated by MILP-Evolve achieve significant improvements on unseen problems, including MIPLIB benchmarks. Our work highlights the potential of moving towards a foundation model approach for MILP that can generalize to a broad range of MILP applications. We are committed to fully open-sourcing our work to advance further research.

[LG-139] Neural Material Adaptor for Visual Grounding of Intrinsic Dynamics NEURIPS2024

链接: https://arxiv.org/abs/2410.08257
作者: Junyi Cao,Shanyan Guan,Yanhao Ge,Wei Li,Xiaokang Yang,Chao Ma
关键词-EN: humans effortlessly discern, effortlessly discern intrinsic, Neural Material Adaptor, modern AI systems, systems often struggle
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: NeurIPS 2024, the project page: this https URL

点击查看摘要

Abstract:While humans effortlessly discern intrinsic dynamics and adapt to new scenarios, modern AI systems often struggle. Current methods for visual grounding of dynamics either use pure neural-network-based simulators (black box), which may violate physical laws, or traditional physical simulators (white box), which rely on expert-defined equations that may not fully capture actual dynamics. We propose the Neural Material Adaptor (NeuMA), which integrates existing physical laws with learned corrections, facilitating accurate learning of actual dynamics while maintaining the generalizability and interpretability of physical priors. Additionally, we propose Particle-GS, a particle-driven 3D Gaussian Splatting variant that bridges simulation and observed images, allowing back-propagate image gradients to optimize the simulator. Comprehensive experiments on various dynamics in terms of grounded particle accuracy, dynamic rendering quality, and generalization ability demonstrate that NeuMA can accurately capture intrinsic dynamics.

[LG-140] AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile Environments

链接: https://arxiv.org/abs/2410.08256
作者: Cheng Fang,Sicong Liu,Zimu Zhou,Bin Guo,Jiaqi Tang,Ke Ma,Zhiwen Yu
关键词-EN: deliver seamless user, seamless user experiences, unpredictable domain shifts, unpredictable domain, evolving environments
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: This paper is accepted by SenSys 2024. Copyright may be transferred without notice

点击查看摘要

Abstract:On-device adapting to continual, unpredictable domain shifts is essential for mobile applications like autonomous driving and augmented reality to deliver seamless user experiences in evolving environments. Test-time adaptation (TTA) emerges as a promising solution by tuning model parameters with unlabeled live data immediately before prediction. However, TTA’s unique forward-backward-reforward pipeline notably increases the latency over standard inference, undermining the responsiveness in time-sensitive mobile applications. This paper presents AdaShadow, a responsive test-time adaptation framework for non-stationary mobile data distribution and resource dynamics via selective updates of adaptation-critical layers. Although the tactic is recognized in generic on-device training, TTA’s unsupervised and online context presents unique challenges in estimating layer importance and latency, as well as scheduling the optimal layer update plan. AdaShadow addresses these challenges with a backpropagation-free assessor to rapidly identify critical layers, a unit-based runtime predictor to account for resource dynamics in latency estimation, and an online scheduler for prompt layer update planning. Also, AdaShadow incorporates a memory I/O-aware computation reuse scheme to further reduce latency in the reforward pass. Results show that AdaShadow achieves the best accuracy-latency balance under continual shifts. At low memory and energy costs, Adashadow provides a 2x to 3.5x speedup (ms-level) over state-of-the-art TTA methods with comparable accuracy and a 14.8% to 25.4% accuracy boost over efficient supervised methods with similar latency.

[LG-141] Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning

链接: https://arxiv.org/abs/2410.08255
作者: David D. Baek,Yuxiao Li,Max Tegmark
关键词-EN: MLP toy models, LLM in-context learning, Motivated by interpretability, MLP toy, networks represent knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 13 figures

点击查看摘要

Abstract:Motivated by interpretability and reliability, we investigate how neural networks represent knowledge during graph learning, We find hints of universality, where equivalent representations are learned across a range of model sizes (from 10^2 to 10^9 parameters) and contexts (MLP toy models, LLM in-context learning and LLM training). We show that these attractor representations optimize generalization to unseen examples by exploiting properties of knowledge graph relations (e.g. symmetry and meta-transitivity). We find experimental support for such universality by showing that LLMs and simpler neural networks can be stitched, i.e., by stitching the first part of one model to the last part of another, mediated only by an affine or almost affine transformation. We hypothesize that this dynamic toward simplicity and generalization is driven by “intelligence from starvation”: where overfitting is minimized by pressure to minimize the use of resources that are either scarce or competed for against other tasks.

[LG-142] Federated Graph Learning for Cross-Domain Recommendation NEURIPS’24

链接: https://arxiv.org/abs/2410.08249
作者: Ziqi Yang,Zhaopeng Peng,Zihui Wang,Jianzhong Qi,Chaochao Chen,Weike Pan,Chenglu Wen,Cheng Wang,Xiaoliang Fan
关键词-EN: Cross-domain recommendation, data sparsity problem, offers a promising, promising solution, data sparsity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS’24

点击查看摘要

Abstract:Cross-domain recommendation (CDR) offers a promising solution to the data sparsity problem by enabling knowledge transfer across source and target domains. However, many recent CDR models overlook crucial issues such as privacy as well as the risk of negative transfer (which negatively impact model performance), especially in multi-domain settings. To address these challenges, we propose FedGCDR, a novel federated graph learning framework that securely and effectively leverages positive knowledge from multiple source domains. First, we design a positive knowledge transfer module that ensures privacy during inter-domain knowledge transmission. This module employs differential privacy-based knowledge extraction combined with a feature mapping mechanism, transforming source domain embeddings from federated graph attention networks into reliable domain knowledge. Second, we design a knowledge activation module to filter out potential harmful or conflicting knowledge from source domains, addressing the issues of negative transfer. This module enhances target domain training by expanding the graph of the target domain to generate reliable domain attentions and fine-tunes the target model for improved negative knowledge filtering and more accurate predictions. We conduct extensive experiments on 16 popular domains of the Amazon dataset, demonstrating that FedGCDR significantly outperforms state-of-the-art methods.

[LG-143] Forecasting mortality associated emergency department crowding

链接: https://arxiv.org/abs/2410.08247
作者: Jalmari Nevanlinna,Anna Eidstø,Jari Ylä-Mattila,Teemu Koivistoinen,Niku Oksala,Juho Kanniainen,Ari Palomäki,Antti Roine
关键词-EN: global public health, public health issue, Emergency department, global public, public health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Emergency department (ED) crowding is a global public health issue that has been repeatedly associated with increased mortality. Predicting future service demand would enable preventative measures aiming to eliminate crowding along with it’s detrimental effects. Recent findings in our ED indicate that occupancy ratios exceeding 90% are associated with increased 10-day mortality. In this paper, we aim to predict these crisis periods using retrospective data from a large Nordic ED with a LightGBM model. We provide predictions for the whole ED and individually for it’s different operational sections. We demonstrate that afternoon crowding can be predicted at 11 a.m. with an AUC of 0.82 (95% CI 0.78-0.86) and at 8 a.m. with an AUC up to 0.79 (95% CI 0.75-0.83). Consequently we show that forecasting mortality-associated crowding using anonymous administrative data is feasible.

[LG-144] Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts NEURIPS2024

链接: https://arxiv.org/abs/2410.08245
作者: Sukwon Yun,Inyoung Choi,Jie Peng,Yangfan Wu,Jingxuan Bao,Qiyiwen Zhang,Jiayi Xin,Qi Long,Tianlong Chen
关键词-EN: gained increasing importance, Multimodal learning, modality combinations, arbitrary modality combinations, modality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Multimodal learning has gained increasing importance across various fields, offering the ability to integrate data from diverse sources such as images, text, and personalized records, which are frequently observed in medical domains. However, in scenarios where some modalities are missing, many existing frameworks struggle to accommodate arbitrary modality combinations, often relying heavily on a single modality or complete data. This oversight of potential modality combinations limits their applicability in real-world situations. To address this challenge, we propose Flex-MoE (Flexible Mixture-of-Experts), a new framework designed to flexibly incorporate arbitrary modality combinations while maintaining robustness to missing data. The core idea of Flex-MoE is to first address missing modalities using a new missing modality bank that integrates observed modality combinations with the corresponding missing ones. This is followed by a uniquely designed Sparse MoE framework. Specifically, Flex-MoE first trains experts using samples with all modalities to inject generalized knowledge through the generalized router ( \mathcalG -Router). The \mathcalS -Router then specializes in handling fewer modality combinations by assigning the top-1 gate to the expert corresponding to the observed modality combination. We evaluate Flex-MoE on the ADNI dataset, which encompasses four modalities in the Alzheimer’s Disease domain, as well as on the MIMIC-IV dataset. The results demonstrate the effectiveness of Flex-MoE highlighting its ability to model arbitrary modality combinations in diverse missing modality scenarios. Code is available at this https URL.

[LG-145] RAB2-DEF: Dynamic and explainable defense against adversarial attacks in Federated Learning to fair poor clients

链接: https://arxiv.org/abs/2410.08244
作者: Nuria Rodríguez-Barroso,M. Victoria Luzón,Francisco Herrera
关键词-EN: data privacy concerns, textbf, regulation is growing, data privacy, privacy concerns derived
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:At the same time that artificial intelligence is becoming popular, concern and the need for regulation is growing, including among other requirements the data privacy. In this context, Federated Learning is proposed as a solution to data privacy concerns derived from different source data scenarios due to its distributed learning. The defense mechanisms proposed in literature are just focused on defending against adversarial attacks and the performance, leaving aside other important qualities such as explainability, fairness to poor quality clients, dynamism in terms of attacks configuration and generality in terms of being resilient against different kinds of attacks. In this work, we propose RAB ^2 -DEF, a \textbfr esilient \textbfa gainst \textbfb\textyzantine and \textbfb ackdoor attacks which is \textbfd ynamic, \textbfe xplainable and \textbff air to poor clients using local linear explanations. We test the performance of RAB ^2 -DEF in image datasets and both byzantine and backdoor attacks considering the state-of-the-art defenses and achieve that RAB ^2 -DEF is a proper defense at the same time that it boosts the other qualities towards trustworthy artificial intelligence.

[LG-146] Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow

链接: https://arxiv.org/abs/2410.08243
作者: Cyrile Delestre,Yoann Sola
关键词-EN: Banking Transaction Flow, sequential data found, Transaction Flow, sequential data, data found
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Banking Transaction Flow (BTF) is a sequential data found in a number of banking activities such as marketing, credit risk or banking fraud. It is a multimodal data composed of three modalities: a date, a numerical value and a wording. We propose in this work an application of self-attention mechanism to the processing of BTFs. We trained two general models on a large amount of BTFs in a self-supervised way: one RNN-based model and one Transformer-based model. We proposed a specific tokenization in order to be able to process BTFs. The performance of these two models was evaluated on two banking downstream tasks: a transaction categorization task and a credit risk task. The results show that fine-tuning these two pre-trained models allowed to perform better than the state-of-the-art approaches for both tasks.

[LG-147] NetDiff: Deep Graph Denoising Diffusion for Ad Hoc Network Topology Generation

链接: https://arxiv.org/abs/2410.08238
作者: Félix Marcoccia,Cédric Adjih,Paul Mühlethaler
关键词-EN: work introduces NetDiff, diffusion probabilistic architecture, network link topologies, hoc network link, introduces NetDiff
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:This work introduces NetDiff, an expressive graph denoising diffusion probabilistic architecture that generates wireless ad hoc network link topologies. Such networks, with directional antennas, can achieve unmatched performance when the communication links are designed to provide good geometric properties, notably by reducing interference between these links while respecting diverse physical constraints. How to craft such a link assignment algorithm is yet a real problem. Deep graph generation offers multiple advantages compared to traditional approaches: it allows to relieve the network nodes of the communication burden caused by the search of viable links and to avoid resorting to heavy combinatorial methods to find a good link topology. Denoising diffusion also provides a built-in method to update the network over time. Given that graph neural networks sometimes tend to struggle with global, structural properties, we augment the popular graph transformer with cross-attentive modulation tokens in order to improve global control over the predicted topology. We also incorporate simple node and edge features, as well as additional loss terms, to facilitate the compliance with the network topology physical constraints. A network evolution algorithm based on partial diffusion is also proposed to maintain a stable network topology over time when the nodes move. Our results show that the generated links are realistic, present structural properties similar to the dataset graphs’, and require only minor corrections and verification steps to be operational.

[LG-148] A Recurrent Neural Network Approach to the Answering Machine Detection Problem

链接: https://arxiv.org/abs/2410.08235
作者: Kemal Altwlkany,Sead Delalic,Elmedin Selmanovic,Adis Alihodzic,Ivica Lovric
关键词-EN: cloud communications, paramount importance, telecommunications and cloud, answered an outbound, outbound call
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 2 figures, 2024 47th MIPRO ICT and Electronics Convention (MIPRO)

点击查看摘要

Abstract:In the field of telecommunications and cloud communications, accurately and in real-time detecting whether a human or an answering machine has answered an outbound call is of paramount importance. This problem is of particular significance during campaigns as it enhances service quality, efficiency and cost reduction through precise caller identification. Despite the significance of the field, it remains inadequately explored in the existing literature. This paper presents an innovative approach to answering machine detection that leverages transfer learning through the YAMNet model for feature extraction. The YAMNet architecture facilitates the training of a recurrent-based classifier, enabling real-time processing of audio streams, as opposed to fixed-length recordings. The results demonstrate an accuracy of over 96% on the test set. Furthermore, we conduct an in-depth analysis of misclassified samples and reveal that an accuracy exceeding 98% can be achieved with the integration of a silence detection algorithm, such as the one provided by FFmpeg.

[LG-149] Finetuning YOLOv9 for Vehicle Detection: Deep Learning for Intelligent Transportation Systems in Dhaka Bangladesh

链接: https://arxiv.org/abs/2410.08230
作者: Shahriar Ahmad Fahim
关键词-EN: caused numerous transportation, vehicle detection system, numerous transportation challenges, Intelligent Transportation Systems, Rapid urbanization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:Rapid urbanization in megacities around the world, like Dhaka, has caused numerous transportation challenges that need to be addressed. Emerging technologies of deep learning and artificial intelligence can help us solve these problems to move towards Intelligent Transportation Systems (ITS) in the city. The government of Bangladesh recognizes the integration of ITS to ensure smart mobility as a vital step towards the development plan “Smart Bangladesh Vision 2041”, but faces challenges in understanding ITS, its effects, and directions to implement. A vehicle detection system can pave the way to understanding traffic congestion, finding mobility patterns, and ensuring traffic surveillance. So, this paper proposes a fine-tuned object detector, the YOLOv9 model to detect native vehicles trained on a Bangladesh-based dataset. Results show that the fine-tuned YOLOv9 model achieved a mean Average Precision (mAP) of 0.934 at the Intersection over Union (IoU) threshold of 0.5, achieving state-of-the-art performance over past studies on Bangladesh-based datasets, shown through a comparison. Later, by suggesting the model to be deployed on CCTVs (closed circuit television) on the roads, a conceptual technique is proposed to process the vehicle detection model output data in a graph structure creating a vehicle detection system in the city. Finally, applications of such vehicle detection system are discussed showing a framework on how it can solve further ITS research questions, to provide a rationale for policymakers to implement the proposed vehicle detection system in the city.

[LG-150] Learning Bipedal Walking for Humanoid Robots in Challenging Environments with Obstacle Avoidance

链接: https://arxiv.org/abs/2410.08212
作者: Marwan Hamze(LISV),Mitsuharu Morisawa(AIST),Eiichi Yoshida(CNRS-AIST JRL)
关键词-EN: Deep reinforcement learning, achieve dynamic walking, Deep reinforcement, dynamic walking, Deep
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Robomech, May 2024, Utsunomiya, Japan

点击查看摘要

Abstract:Deep reinforcement learning has seen successful implementations on humanoid robots to achieve dynamic walking. However, these implementations have been so far successful in simple environments void of obstacles. In this paper, we aim to achieve bipedal locomotion in an environment where obstacles are present using a policy-based reinforcement learning. By adding simple distance reward terms to a state of art reward function that can achieve basic bipedal locomotion, the trained policy succeeds in navigating the robot towards the desired destination without colliding with the obstacles along the way.

[LG-151] An undetectable watermark for generative image models

链接: https://arxiv.org/abs/2410.07369
作者: Sam Gunn,Xuandong Zhao,Dawn Song
关键词-EN: undetectable watermarking scheme, watermark, generative image models, undetectable watermarking, Christ and Gunn
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We present the first undetectable watermarking scheme for generative image models. Undetectability ensures that no efficient adversary can distinguish between watermarked and un-watermarked images, even after making many adaptive queries. In particular, an undetectable watermark does not degrade image quality under any efficiently computable metric. Our scheme works by selecting the initial latents of a diffusion model using a pseudorandom error-correcting code (Christ and Gunn, 2024), a strategy which guarantees undetectability and robustness. We experimentally demonstrate that our watermarks are quality-preserving and robust using Stable Diffusion 2.1. Our experiments verify that, in contrast to every prior scheme we tested, our watermark does not degrade image quality. Our experiments also demonstrate robustness: existing watermark removal attacks fail to remove our watermark from images without significantly degrading the quality of the images. Finally, we find that we can robustly encode 512 bits in our watermark, and up to 2500 bits when the images are not subjected to watermark removal attacks. Our code is available at this https URL.

[LG-152] LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation

链接: https://arxiv.org/abs/2410.03521
作者: Xinyuan Wang,Haozhou Li,Dingfang Zheng,Qinke Peng
关键词-EN: pandemic underscored major, underscored major deficiencies, online medical services, traditional healthcare systems, pandemic underscored
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), comprising a Coarse-grained Triage dataset with 439,630 samples, a Fine-grained Diagnosis dataset with 199,600 samples, and a Medical Consultation dataset with 472,418 items, thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model using reinforcement learning. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.

[LG-153] Editing Massive Concepts in Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2403.13807
作者: Tianwei Xiong,Yue Wu,Enze Xie,Yue Wu,Zhenguo Li,Xihui Liu
关键词-EN: generating outdated, biased content, risk of generating, diffusion models suffer, massive concept editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization for each individual concept with dual self-distillation from text alignment loss and diffusion noise prediction loss. The second stage conducts massive concept editing with multi-layer, closed form model editing. We further propose a comprehensive benchmark, named ImageNet Concept Editing Benchmark (ICEB), for evaluating massive concept editing for T2I models with two subtasks, free-form prompts, massive concept categories, and extensive evaluation metrics. Extensive experiments conducted on our proposed benchmark and previous benchmarks demonstrate the superior scalability of EMCID for editing up to 1,000 concepts, providing a practical approach for fast adjustment and re-deployment of T2I diffusion models in real-world applications.

[LG-154] Beyond Myopia: Learning from Positive and Unlabeled Data through Holistic Predictive Trends

链接: https://arxiv.org/abs/2310.04078
作者: Xinrui Wang,Wenhai Wan,Chuanxin Geng,Shaoyuan LI,Songcan Chen
关键词-EN: Learning binary classifiers, Learning binary, binary classifiers, PUL, verifying negative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages

点击查看摘要

Abstract:Learning binary classifiers from positive and unlabeled data (PUL) is vital in many real-world applications, especially when verifying negative examples is difficult. Despite the impressive empirical performance of recent PUL methods, challenges like accumulated errors and increased estimation bias persist due to the absence of negative labels. In this paper, we unveil an intriguing yet long-overlooked observation in PUL: \textitresampling the positive data in each training iteration to ensure a balanced distribution between positive and unlabeled examples results in strong early-stage performance. Furthermore, predictive trends for positive and negative classes display distinctly different patterns. Specifically, the scores (output probability) of unlabeled negative examples consistently decrease, while those of unlabeled positive examples show largely chaotic trends. Instead of focusing on classification within individual time frames, we innovatively adopt a holistic approach, interpreting the scores of each example as a temporal point process (TPP). This reformulates the core problem of PUL as recognizing trends in these scores. We then propose a novel TPP-inspired measure for trend detection and prove its asymptotic unbiasedness in predicting changes. Notably, our method accomplishes PUL without requiring additional parameter tuning or prior assumptions, offering an alternative perspective for tackling this problem. Extensive experiments verify the superiority of our method, particularly in a highly imbalanced real-world setting, where it achieves improvements of up to 11.3% in key metrics. The code is available at \hrefthis https URLthis https URL.

[LG-155] Linear Convergence of Diffusion Models Under the Manifold Hypothesis

链接: https://arxiv.org/abs/2410.09046
作者: Peter Potaptchik,Iskander Azangulov,George Deligiannidis
关键词-EN: Score-matching generative models, complex high-dimensional data, high-dimensional data distributions, Score-matching generative, proven successful
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Score-matching generative models have proven successful at sampling from complex high-dimensional data distributions. In many applications, this distribution is believed to concentrate on a much lower d -dimensional manifold embedded into D -dimensional space; this is known as the manifold hypothesis. The current best-known convergence guarantees are either linear in D or polynomial (superlinear) in d . The latter exploits a novel integration scheme for the backward SDE. We take the best of both worlds and show that the number of steps diffusion models require in order to converge in Kullback-Leibler~(KL) divergence is linear (up to logarithmic terms) in the intrinsic dimension d . Moreover, we show that this linear dependency is sharp.

[LG-156] Variance reduction combining pre-experiment and in-experiment data

链接: https://arxiv.org/abs/2410.09027
作者: Zhexiao Lin,Pablo Crespo
关键词-EN: essential in data-driven, Online controlled experiments, CUPED and CUPAC, Online controlled, variance
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP)
*备注: 18 pages

点击查看摘要

Abstract:Online controlled experiments (A/B testing) are essential in data-driven decision-making for many companies. Increasing the sensitivity of these experiments, particularly with a fixed sample size, relies on reducing the variance of the estimator for the average treatment effect (ATE). Existing methods like CUPED and CUPAC use pre-experiment data to reduce variance, but their effectiveness depends on the correlation between the pre-experiment data and the outcome. In contrast, in-experiment data is often more strongly correlated with the outcome and thus more informative. In this paper, we introduce a novel method that combines both pre-experiment and in-experiment data to achieve greater variance reduction than CUPED and CUPAC, without introducing bias or additional computation complexity. We also establish asymptotic theory and provide consistent variance estimators for our method. Applying this method to multiple online experiments at Etsy, we reach substantial variance reduction over CUPAC with the inclusion of only a few in-experiment covariates. These results highlight the potential of our approach to significantly improve experiment sensitivity and accelerate decision-making.

[LG-157] Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

链接: https://arxiv.org/abs/2410.09005
作者: Roman Worschech,Bernd Rosenow
关键词-EN: training data size, Neural scaling laws, scaling laws describe, deep neural networks, neural networks scales
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural scaling laws describe how the performance of deep neural networks scales with key factors such as training data size, model complexity, and training time, often following power-law behaviors over multiple orders of magnitude. Despite their empirical observation, the theoretical understanding of these scaling laws remains limited. In this work, we employ techniques from statistical mechanics to analyze one-pass stochastic gradient descent within a student-teacher framework, where both the student and teacher are two-layer neural networks. Our study primarily focuses on the generalization error and its behavior in response to data covariance matrices that exhibit power-law spectra. For linear activation functions, we derive analytical expressions for the generalization error, exploring different learning regimes and identifying conditions under which power-law scaling emerges. Additionally, we extend our analysis to non-linear activation functions in the feature learning regime, investigating how power-law spectra in the data covariance matrix impact learning dynamics. Importantly, we find that the length of the symmetric plateau depends on the number of distinct eigenvalues of the data covariance matrix and the number of hidden units, demonstrating how these plateaus behave under various configurations. In addition, our results reveal a transition from exponential to power-law convergence in the specialized phase when the data covariance matrix possesses a power-law spectrum. This work contributes to the theoretical understanding of neural scaling laws and provides insights into optimizing learning performance in practical scenarios involving complex data structures.

[LG-158] Optimal Downsampling for Imbalanced Classification with Generalized Linear Models

链接: https://arxiv.org/abs/2410.08994
作者: Yan Chen,Jose Blanchet,Krzysztof Dembczynski,Laura Fee Nern,Aaron Flores
关键词-EN: highly imbalanced classification, imbalanced classification models, imbalanced classification, classification models, optimal downsampling
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We propose a pseudo maximum likelihood estimator and study its asymptotic normality in the context of increasingly imbalanced populations relative to an increasingly large sample size. We provide theoretical guarantees for the introduced estimator. Additionally, we compute the optimal downsampling rate using a criterion that balances statistical accuracy and computational efficiency. Our numerical experiments, conducted on both synthetic and empirical data, further validate our theoretical results, and demonstrate that the introduced estimator outperforms commonly available alternatives.

[LG-159] Online-to-PAC generalization bounds under graph-mixing dependencies

链接: https://arxiv.org/abs/2410.08977
作者: Baptiste Abélès,Eugenio Clerico,Gergely Neu
关键词-EN: training data set, data set made, statistical learning require, Traditional generalization results, require a training
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages (10 main + 3 supplementary material). All authors contributed equally

点击查看摘要

Abstract:Traditional generalization results in statistical learning require a training data set made of independently drawn examples. Most of the recent efforts to relax this independence assumption have considered either purely temporal (mixing) dependencies, or graph-dependencies, where non-adjacent vertices correspond to independent random variables. Both approaches have their own limitations, the former requiring a temporal ordered structure, and the latter lacking a way to quantify the strength of inter-dependencies. In this work, we bridge these two lines of work by proposing a framework where dependencies decay with graph distance. We derive generalization bounds leveraging the online-to-PAC framework, by deriving a concentration result and introducing an online learning framework incorporating the graph structure. The resulting high-probability generalization guarantees depend on both the mixing rate and the graph’s chromatic number.

[LG-160] Lifted Coefficient of Determination: Fast model-free prediction intervals and likelihood-free model comparison

链接: https://arxiv.org/abs/2410.08958
作者: Daniel Salnikov,Kevin Michalewicz,Dan Leonte
关键词-EN: lifted linear model, observations increases, lifted linear, Lifted Coefficient, derive model-free prediction
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:We propose the \textitlifted linear model , and derive model-free prediction intervals that become tighter as the correlation between predictions and observations increases. These intervals motivate the \textitLifted Coefficient of Determination , a model comparison criterion for arbitrary loss functions in prediction-based settings, e.g., regression, classification or counts. We extend the prediction intervals to more general error distributions, and propose a fast model-free outlier detection algorithm for regression. Finally, we illustrate the framework via numerical experiments.

[LG-161] KinDEL: DNA-Encoded Library Dataset for Kinase Inhibitors

链接: https://arxiv.org/abs/2410.08938
作者: Benson Chen,Tomasz Danel,Patrick J. McEnaney,Nikhil Jain,Kirill Novikov,Spurti Umesh Akki,Joshua L. Turnbull,Virja Atul Pandya,Boris P. Belotserkovskii,Jared Bryce Weaver,Ankita Biswas,Dat Nguyen,Gabriel H. S. Dreiman,Mohammad Sultan,Nathaniel Stanley,Daniel M Whalen,Divya Kanichar,Christoph Klein,Emily Fox,R. Edward Watts
关键词-EN: small molecule libraries, DNA-Encoded Libraries, diverse chemical spaces, characterize diverse chemical, combinatorial small molecule
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:DNA-Encoded Libraries (DEL) are combinatorial small molecule libraries that offer an efficient way to characterize diverse chemical spaces. Selection experiments using DELs are pivotal to drug discovery efforts, enabling high-throughput screens for hit finding. However, limited availability of public DEL datasets hinders the advancement of computational techniques designed to process such data. To bridge this gap, we present KinDEL, one of the first large, publicly available DEL datasets on two kinases: Mitogen-Activated Protein Kinase 14 (MAPK14) and Discoidin Domain Receptor Tyrosine Kinase 1 (DDR1). Interest in this data modality is growing due to its ability to generate extensive supervised chemical data that densely samples around select molecular structures. Demonstrating one such application of the data, we benchmark different machine learning techniques to develop predictive models for hit identification; in particular, we highlight recent structure-based probabilistic approaches. Finally, we provide biophysical assay data, both on- and off-DNA, to validate our models on a smaller subset of molecules. Data and code for our benchmarks can be found at: this https URL.

[LG-162] he Effect of Personalization in FedProx: A Fine-grained Analysis on Statistical Accuracy and Communication Efficiency

链接: https://arxiv.org/abs/2410.08934
作者: Xin Yu,Zelin He,Ying Sun,Lingzhou Xue,Runze Li
关键词-EN: effective federated learning, federated learning method, simple yet effective, effective federated, federated learning
类目: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:FedProx is a simple yet effective federated learning method that enables model personalization via regularization. Despite remarkable success in practice, a rigorous analysis of how such a regularization provably improves the statistical accuracy of each client’s local model hasn’t been fully established. Setting the regularization strength heuristically presents a risk, as an inappropriate choice may even degrade accuracy. This work fills in the gap by analyzing the effect of regularization on statistical accuracy, thereby providing a theoretical guideline for setting the regularization strength for achieving personalization. We prove that by adaptively choosing the regularization strength under different statistical heterogeneity, FedProx can consistently outperform pure local training and achieve a nearly minimax-optimal statistical rate. In addition, to shed light on resource allocation, we design an algorithm, provably showing that stronger personalization reduces communication complexity without increasing the computation cost overhead. Finally, our theory is validated on both synthetic and real-world datasets and its generalizability is verified in a non-convex setting.

[LG-163] Deep Learning Algorithms for Mean Field Optimal Stopping in Finite Space and Discrete Time

链接: https://arxiv.org/abs/2410.08850
作者: Lorenzo Magnino,Yuchen Zhu,Mathieu Laurière
关键词-EN: Optimal stopping, optimal stopping problems, field optimal stopping, multi-agent optimal stopping, discrete-time optimal stopping
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimal stopping is a fundamental problem in optimization that has found applications in risk management, finance, economics, and recently in the fields of computer science. We extend the standard framework to a multi-agent setting, named multi-agent optimal stopping (MAOS), where a group of agents cooperatively solves finite-space, discrete-time optimal stopping problems. Solving the finite-agent case is computationally prohibitive when the number of agents is very large, so this work studies the mean field optimal stopping (MFOS) problem, obtained as the number of agents approaches infinity. We prove that MFOS provides a good approximate solution to MAOS. We also prove a dynamic programming principle (DPP), based on the theory of mean field control. We then propose two deep learning methods: one simulates full trajectories to learn optimal decisions, whereas the other leverages DPP with backward induction; both methods train neural networks for the optimal stopping decisions. We demonstrate the effectiveness of these approaches through numerical experiments on 6 different problems in spatial dimension up to 300. To the best of our knowledge, this is the first work to study MFOS in finite space and discrete time, and to propose efficient and scalable computational methods for this type of problem.

[LG-164] Calibrated Computation-Aware Gaussian Processes

链接: https://arxiv.org/abs/2410.08796
作者: Disha Hegde,Mohamed Adil,Jon Cockayne
关键词-EN: Computation-aware Gaussian processes, Gaussian processes, preventing application, probabilistic linear solver, large regression problems
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Gaussian processes are notorious for scaling cubically with the size of the training set, preventing application to very large regression problems. Computation-aware Gaussian processes (CAGPs) tackle this scaling issue by exploiting probabilistic linear solvers to reduce complexity, widening the posterior with additional computational uncertainty due to reduced computation. However, the most commonly used CAGP framework results in (sometimes dramatically) conservative uncertainty quantification, making the posterior unrealistic in practice. In this work, we prove that if the utilised probabilistic linear solver is calibrated, in a rigorous statistical sense, then so too is the induced CAGP. We thus propose a new CAGP framework, CAGP-GS, based on using Gauss-Seidel iterations for the underlying probabilistic linear solver. CAGP-GS performs favourably compared to existing approaches when the test set is low-dimensional and few iterations are performed. We test the calibratedness on a synthetic problem, and compare the performance to existing approaches on a large-scale global temperature regression problem.

[LG-165] Losing dimensions: Geometric memorization in generative diffusion

链接: https://arxiv.org/abs/2410.08727
作者: Beatrice Achilli,Enrico Ventura,Gianluigi Silvestri,Bao Pham,Gabriel Raya,Dmitry Krotov,Carlo Lucibello,Luca Ambrogioni
关键词-EN: machine learning models, learning models deeply, models deeply connected, Generative diffusion processes, machine learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative diffusion processes are state-of-the-art machine learning models deeply connected with fundamental concepts in statistical physics. Depending on the dataset size and the capacity of the network, their behavior is known to transition from an associative memory regime to a generalization phase in a phenomenon that has been described as a glassy phase transition. Here, using statistical physics techniques, we extend the theory of memorization in generative diffusion to manifold-supported data. Our theoretical and experimental findings indicate that different tangent subspaces are lost due to memorization effects at different critical times and dataset sizes, which depend on the local variance of the data along their directions. Perhaps counterintuitively, we find that, under some conditions, subspaces of higher variance are lost first due to memorization effects. This leads to a selective loss of dimensionality where some prominent features of the data are memorized without a full collapse on any individual training point. We validate our theory with a comprehensive set of experiments on networks trained both in image datasets and on linear manifolds, which result in a remarkable qualitative agreement with the theoretical predictions.

[LG-166] SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

链接: https://arxiv.org/abs/2410.08643
作者: Toby Dylan Hocking,Gabrielle Thibault,Cameron Scott Bodine,Paul Nelson Arellano,Alexander F Shenkin,Olivia Jasmine Lindly
关键词-EN: time period, machine learning, real-world applications, applications of machine, data
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).

[LG-167] CryoFM: A Flow-based Foundation Model for Cryo-EM Densities

链接: https://arxiv.org/abs/2410.08631
作者: Yi Zhou,Yilai Li,Jing Yuan,Quanquan Gu
关键词-EN: drug discovery, high resolution, density maps, powerful technique, biology and drug
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cryo-electron microscopy (cryo-EM) is a powerful technique in structural biology and drug discovery, enabling the study of biomolecules at high resolution. Significant advancements by structural biologists using cryo-EM have led to the production of over 38,626 protein density maps at various resolutions1. However, cryo-EM data processing algorithms have yet to fully benefit from our knowledge of biomolecular density maps, with only a few recent models being data-driven but limited to specific tasks. In this study, we present CryoFM, a foundation model designed as a generative model, learning the distribution of high-quality density maps and generalizing effectively to downstream tasks. Built on flow matching, CryoFM is trained to accurately capture the prior distribution of biomolecular density maps. Furthermore, we introduce a flow posterior sampling method that leverages CRYOFM as a flexible prior for several downstream tasks in cryo-EM and cryo-electron tomography (cryo-ET) without the need for fine-tuning, achieving state-of-the-art performance on most tasks and demonstrating its potential as a foundational model for broader applications in these fields.

[LG-168] GPR Full-Waveform Inversion through Adaptive Filtering of Model Parameters and Gradients Using CNN

链接: https://arxiv.org/abs/2410.08568
作者: Peng Jiang,Kun Wang,Jiaxing Wang,Zeliang Feng,Shengjie Qiao,Runhuai Deng,Fengkai Zhang
关键词-EN: GPR full-waveform inversion, subsurface property model, property model iteratively, entire waveform information, GPR full-waveform
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:GPR full-waveform inversion optimizes the subsurface property model iteratively to match the entire waveform information. However, the model gradients derived from wavefield continuation often contain errors, such as ghost values and excessively large values at transmitter and receiver points. Furthermore, models updated based on these gradients frequently exhibit unclear characterization of anomalous bodies or false anomalies, making it challenging to obtain accurate inversion results. To address these issues, we introduced a novel full-waveform inversion (FWI) framework that incorporates an embedded convolutional neural network (CNN) to adaptively filter model parameters and gradients. Specifically, we embedded the CNN module before the forward modeling process and ensured the entire FWI process remains differentiable. This design leverages the auto-grad tool of the deep learning library, allowing model values to pass through the CNN module during forward computation and model gradients to pass through the CNN module during backpropagation. Experiments have shown that filtering the model parameters during forward computation and the model gradients during backpropagation can ultimately yield high-quality inversion results.

[LG-169] Adaptive Constraint Integration for Simultaneously Optimizing Crystal Structures with Multiple Targeted Properties

链接: https://arxiv.org/abs/2410.08562
作者: Akihiro Fujii,Yoshitaka Ushiku,Koji Shimizu,Anh Khoa Augustin Lu,Satoshi Watanabe
关键词-EN: finding crystal structures, targeted properties, Adaptive Crystal Synthesizer, Simultaneous Multi-property Optimization, finding crystal
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In materials science, finding crystal structures that have targeted properties is crucial. While recent methodologies such as Bayesian optimization and deep generative models have made some advances on this issue, these methods often face difficulties in adaptively incorporating various constraints, such as electrical neutrality and targeted properties optimization, while keeping the desired specific crystal structure. To address these challenges, we have developed the Simultaneous Multi-property Optimization using Adaptive Crystal Synthesizer (SMOACS), which utilizes state-of-the-art property prediction models and their gradients to directly optimize input crystal structures for targeted properties simultaneously. SMOACS enables the integration of adaptive constraints into the optimization process without necessitating model retraining. Thanks to this feature, SMOACS has succeeded in simultaneously optimizing targeted properties while maintaining perovskite structures, even with models trained on diverse crystal types. We have demonstrated the band gap optimization while meeting a challenging constraint, that is, maintaining electrical neutrality in large atomic configurations up to 135 atom sites, where the verification of the electrical neutrality is challenging. The properties of the most promising materials have been confirmed by density functional theory calculations.

[LG-170] Kolmogorov-Arnold Neural Networks for High-Entropy Alloys Design

链接: https://arxiv.org/abs/2410.08452
作者: Yagnik Bandyopadhyay,Harshil Avlani,Houlong L. Zhuang
关键词-EN: yielding numerous valuable, numerous valuable insights, KAN classification model, deep learning-based machine, KAN
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A wide range of deep learning-based machine learning techniques are extensively applied to the design of high-entropy alloys (HEAs), yielding numerous valuable insights. Kolmogorov-Arnold Networks (KAN) is a recently developed architecture that aims to improve both the accuracy and interpretability of input features. In this work, we explore three different datasets for HEA design and demonstrate the application of KAN for both classification and regression models. In the first example, we use a KAN classification model to predict the probability of single-phase formation in high-entropy carbide ceramics based on various properties such as mixing enthalpy and valence electron concentration. In the second example, we employ a KAN regression model to predict the yield strength and ultimate tensile strength of HEAs based on their chemical composition and process conditions including annealing time, cold rolling percentage, and homogenization temperature. The third example involves a KAN classification model to determine whether a certain composition is an HEA or non-HEA, followed by a KAN regressor model to predict the bulk modulus of the identified HEA, aiming to identify HEAs with high bulk modulus. In all three examples, KAN either outperform or match the performance in terms of accuracy such as F1 score for classification and Mean Square Error (MSE), and coefficient of determination (R2) for regression of the multilayer perceptron (MLP) by demonstrating the efficacy of KAN in handling both classification and regression tasks. We provide a promising direction for future research to explore advanced machine learning techniques, which lead to more accurate predictions and better interpretability of complex materials, ultimately accelerating the discovery and optimization of HEAs with desirable properties.

[LG-171] Nesterov acceleration in benignly non-convex landscapes

链接: https://arxiv.org/abs/2410.08395
作者: Kanan Gupta,Stephan Wojtowytsch
关键词-EN: strongly convex setting, notoriously non-convex optimization, non-convex optimization problems, momentum-based optimization algorithms, convex setting
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While momentum-based optimization algorithms are commonly used in the notoriously non-convex optimization problems of deep learning, their analysis has historically been restricted to the convex and strongly convex setting. In this article, we partially close this gap between theory and practice and demonstrate that virtually identical guarantees can be obtained in optimization problems with a `benign’ non-convexity. We show that these weaker geometric assumptions are well justified in overparametrized deep learning, at least locally. Variations of this result are obtained for a continuous time model of Nesterov’s accelerated gradient descent algorithm (NAG), the classical discrete time version of NAG, and versions of NAG with stochastic gradient estimates with purely additive noise and with noise that exhibits both additive and multiplicative scaling.

[LG-172] Upper Bounds for Learning in Reproducing Kernel Hilbert Spaces for Orbits of an Iterated Function System

链接: https://arxiv.org/abs/2410.08361
作者: Priyanka Roy,Susanne Saminger-Platz
关键词-EN: reproducing kernel Hilbert, kernel Hilbert spaces, key problems, learning theory, kernel Hilbert
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:One of the key problems in learning theory is to compute a function f that closely approximates the relationship between some input x and corresponding output y , such that y\approx f(x) . This approximation is based on sample points (x_t,y_t)_t=1^m , where the function f can be approximated within reproducing kernel Hilbert spaces using various learning algorithms. In the context of learning theory, it is usually customary to assume that the sample points are drawn independently and identically distributed (i.i.d.) from an unknown underlying distribution. However, we relax this i.i.d. assumption by considering an input sequence (x_t)_t\in \mathbb N as a trajectory generated by an iterated function system, which forms a particular Markov chain, with (y_t)_t\in \mathbb N corresponding to an observation sequence when the model is in the corresponding state x_t . For such a process, we approximate the function f using the Markov chain stochastic gradient algorithm and estimate the error by deriving upper bounds within reproducing kernel Hilbert spaces.

[LG-173] Avoiding mode collapse in diffusion models fine-tuned with reinforcement learning

链接: https://arxiv.org/abs/2410.08315
作者: Roberto Barceló,Cristóbal Alcázar,Felipe Tobar
关键词-EN: proven promising, promising for aligning, Fine-tuning foundation models, downstream objectives, Fine-tuning foundation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning foundation models via reinforcement learning (RL) has proven promising for aligning to downstream objectives. In the case of diffusion models (DMs), though RL training improves alignment from early timesteps, critical issues such as training instability and mode collapse arise. We address these drawbacks by exploiting the hierarchical nature of DMs: we train them dynamically at each epoch with a tailored RL method, allowing for continual evaluation and step-by-step refinement of the model performance (or alignment). Furthermore, we find that not every denoising step needs to be fine-tuned to align DMs to downstream tasks. Consequently, in addition to clipping, we regularise model parameters at distinct learning phases via a sliding-window approach. Our approach, termed Hierarchical Reward Fine-tuning (HRF), is validated on the Denoising Diffusion Policy Optimisation method, where we show that models trained with HRF achieve better preservation of diversity in downstream tasks, thus enhancing the fine-tuning robustness and at uncompromising mean rewards.

[LG-174] Correspondence of NNGP Kernel and the Matern Kernel

链接: https://arxiv.org/abs/2410.08311
作者: Amanda Muyskens,Benjamin W. Priest,Imene R. Goumiri,Michael D. Schneider
关键词-EN: recently gained popularity, Matern kernel, Kernels representing limiting, representing limiting cases, NNGP kernel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages, 11 figures

点击查看摘要

Abstract:Kernels representing limiting cases of neural network architectures have recently gained popularity. However, the application and performance of these new kernels compared to existing options, such as the Matern kernel, is not well studied. We take a practical approach to explore the neural network Gaussian process (NNGP) kernel and its application to data in Gaussian process regression. We first demonstrate the necessity of normalization to produce valid NNGP kernels and explore related numerical challenges. We further demonstrate that the predictions from this model are quite inflexible, and therefore do not vary much over the valid hyperparameter sets. We then demonstrate a surprising result that the predictions given from the NNGP kernel correspond closely to those given by the Matern kernel under specific circumstances, which suggests a deep similarity between overparameterized deep neural networks and the Matern kernel. Finally, we demonstrate the performance of the NNGP kernel as compared to the Matern kernel on three benchmark data cases, and we conclude that for its flexibility and practical performance, the Matern kernel is preferred to the novel NNGP in practical applications.

[LG-175] LSTM networks provide efficient cyanobacterial blooms forecasting even with incomplete spatio-temporal data

链接: https://arxiv.org/abs/2410.08237
作者: Claudia Fournier,Raul Fernandez-Fernandez,Samuel Cirés,José A. López-Orozco,Eva Besada-Portas,Antonio Quesada
关键词-EN: threatening ecosystem function, toxin-producing strains predominate, frequent dominant species, water quality, inland waters
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Cyanobacteria are the most frequent dominant species of algal blooms in inland waters, threatening ecosystem function and water quality, especially when toxin-producing strains predominate. Enhanced by anthropogenic activities and global warming, cyanobacterial blooms are expected to increase in frequency and global distribution. Early warning systems (EWS) for cyanobacterial blooms development allow timely implementation of management measures, reducing the risks associated to these blooms. In this paper, we propose an effective EWS for cyanobacterial bloom forecasting, which uses 6 years of incomplete high-frequency spatio-temporal data from multiparametric probes, including phycocyanin (PC) fluorescence as a proxy for cyanobacteria. A probe agnostic and replicable method is proposed to pre-process the data and to generate time series specific for cyanobacterial bloom forecasting. Using these pre-processed data, six different non-site/species-specific predictive models were compared including the autoregressive and multivariate versions of Linear Regression, Random Forest, and Long-Term Short-Term (LSTM) neural networks. Results were analyzed for seven forecasting time horizons ranging from 4 to 28 days evaluated with a hybrid system that combined regression metrics (MSE, R2, MAPE) for PC values, classification metrics (Accuracy, F1, Kappa) for a proposed alarm level of 10 ug PC/L, and a forecasting-specific metric to measure prediction improvement over the displaced signal (skill). The multivariate version of LSTM showed the best and most consistent results across all forecasting horizons and metrics, achieving accuracies of up to 90% in predicting the proposed PC alarm level. Additionally, positive skill values indicated its outstanding effectiveness to forecast cyanobacterial blooms from 16 to 28 days in advance.

[LG-176] A Real Benchmark Swell Noise Dataset for Performing Seismic Data Denoising via Deep Learning

链接: https://arxiv.org/abs/2410.08231
作者: Pablo M. Barros,Roosevelt de L. Sardinha,Giovanny A. M. Arboleda,Lessandro de S. S. Valente,Isabelle R. V. de Melo,Albino Aveleda,André Bulcão,Sergio L. Netto,Alexandre G. Evsukoff
关键词-EN: deep learning, computer vision, creation of open, tested and compared, compared with reproducible
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent development of deep learning (DL) methods for computer vision has been driven by the creation of open benchmark datasets on which new algorithms can be tested and compared with reproducible results. Although DL methods have many applications in geophysics, few real seismic datasets are available for benchmarking DL models, especially for denoising real data, which is one of the main problems in seismic data processing scenarios in the oil and gas industry. This article presents a benchmark dataset composed of synthetic seismic data corrupted with noise extracted from a filtering process implemented on real data. In this work, a comparison between two well-known DL-based denoising models is conducted on this dataset, which is proposed as a benchmark for accelerating the development of new solutions for seismic data denoising. This work also introduces a new evaluation metric that can capture small variations in model results. The results show that DL models are effective at denoising seismic data, but some issues remain to be solved.

[LG-177] Multi-Atlas Brain Network Classification through Consistency Distillation and Complementary Information Fusion

链接: https://arxiv.org/abs/2410.08228
作者: Jiaxing Xu,Mengcheng Lan,Xia Dong,Kai He,Wei Zhang,Qingtian Bian,Yiping Ke
关键词-EN: identifying distinctive patterns, identifying distinctive, brain, brain network classification, atlases
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of neuroscience, identifying distinctive patterns associated with neurological disorders via brain networks is crucial. Resting-state functional magnetic resonance imaging (fMRI) serves as a primary tool for mapping these networks by correlating blood-oxygen-level-dependent (BOLD) signals across different brain regions, defined as regions of interest (ROIs). Constructing these brain networks involves using atlases to parcellate the brain into ROIs based on various hypotheses of brain division. However, there is no standard atlas for brain network classification, leading to limitations in detecting abnormalities in disorders. Some recent methods have proposed utilizing multiple atlases, but they neglect consistency across atlases and lack ROI-level information exchange. To tackle these limitations, we propose an Atlas-Integrated Distillation and Fusion network (AIDFusion) to improve brain network classification using fMRI data. AIDFusion addresses the challenge of utilizing multiple atlases by employing a disentangle Transformer to filter out inconsistent atlas-specific information and distill distinguishable connections across atlases. It also incorporates subject- and population-level consistency constraints to enhance cross-atlas consistency. Additionally, AIDFusion employs an inter-atlas message-passing mechanism to fuse complementary information across brain regions. Experimental results on four datasets of different diseases demonstrate the effectiveness and efficiency of AIDFusion compared to state-of-the-art methods. A case study illustrates AIDFusion extract patterns that are both interpretable and consistent with established neuroscience findings.

[LG-178] EarthquakeNPP: Benchmark Datasets for Earthquake Forecasting with Neural Point Processes

链接: https://arxiv.org/abs/2410.08226
作者: Samuel Stockman,Daniel Lawson,Maximilian Werner
关键词-EN: Classical point process, Neural Point Processes, epidemic-type aftershock sequence, point process models, Classical point
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Classical point process models, such as the epidemic-type aftershock sequence (ETAS) model, have been widely used for forecasting the event times and locations of earthquakes for decades. Recent advances have led to Neural Point Processes (NPPs), which promise greater flexibility and improvements over classical models. However, the currently-used benchmark dataset for NPPs does not represent an up-to-date challenge in the seismological community since it lacks a key earthquake sequence from the region and improperly splits training and testing data. Furthermore, initial earthquake forecast benchmarking lacks a comparison to state-of-the-art earthquake forecasting models typically used by the seismological community. To address these gaps, we introduce EarthquakeNPP: a collection of benchmark datasets to facilitate testing of NPPs on earthquake data, accompanied by a credible implementation of the ETAS model. The datasets cover a range of small to large target regions within California, dating from 1971 to 2021, and include different methodologies for dataset generation. In a benchmarking experiment, we compare three spatio-temporal NPPs against ETAS and find that none outperform ETAS in either spatial or temporal log-likelihood. These results indicate that current NPP implementations are not yet suitable for practical earthquake forecasting. However, EarthquakeNPP will serve as a platform for collaboration between the seismology and machine learning communities with the goal of improving earthquake predictability.

[LG-179] A Survey of Spatio-Temporal EEG data Analysis: from Models to Applications

链接: https://arxiv.org/abs/2410.08224
作者: Pengfei Wang,Huanran Zheng,Silong Dai,Yiqiao Wang,Xiaotian Gu,Yuanbin Wu,Xiaoling Wang
关键词-EN: witnessed remarkable advancements, recent years, field of electroencephalography, analysis has witnessed, remarkable advancements
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: submitted to IECE Chinese Journal of Information Fusion

点击查看摘要

Abstract:In recent years, the field of electroencephalography (EEG) analysis has witnessed remarkable advancements, driven by the integration of machine learning and artificial intelligence. This survey aims to encapsulate the latest developments, focusing on emerging methods and technologies that are poised to transform our comprehension and interpretation of brain activity. We delve into self-supervised learning methods that enable the robust representation of brain signals, which are fundamental for a variety of downstream applications. We also explore emerging discriminative methods, including graph neural networks (GNN), foundation models, and large language models (LLMs)-based approaches. Furthermore, we examine generative technologies that harness EEG data to produce images or text, offering novel perspectives on brain activity visualization and interpretation. The survey provides an extensive overview of these cutting-edge techniques, their current applications, and the profound implications they hold for future research and clinical practice. The relevant literature and open-source materials have been compiled and are consistently being refreshed at \urlthis https URL

[LG-180] Variational Source-Channel Coding for Semantic Communication

链接: https://arxiv.org/abs/2410.08222
作者: Yulong Feng,Jing Xu,Liujun Hu,Guanghui Yu
关键词-EN: pivotal bridge connecting, Semantic communication, communication technology emerges, communication, Semantic
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semantic communication technology emerges as a pivotal bridge connecting AI with classical communication. The current semantic communication systems are generally modeled as an Auto-Encoder (AE). AE lacks a deep integration of AI principles with communication strategies due to its inability to effectively capture channel dynamics. This gap makes it difficult to justify the need for joint source-channel coding (JSCC) and to explain why performance improves. This paper begins by exploring lossless and lossy communication, highlighting that the inclusion of data distortion distinguishes semantic communication from classical communication. It breaks the conditions for the separation theorem to hold and explains why the amount of data transferred by semantic communication is less. Therefore, employing JSCC becomes imperative for achieving optimal semantic communication. Moreover, a Variational Source-Channel Coding (VSCC) method is proposed for constructing semantic communication systems based on data distortion theory, integrating variational inference and channel characteristics. Using a deep learning network, we develop a semantic communication system employing the VSCC method and demonstrate its capability for semantic transmission. We also establish semantic communication systems of equivalent complexity employing the AE method and the VAE method. Experimental results reveal that the VSCC model offers superior interpretability compared to AE model, as it clearly captures the semantic features of the transmitted data, represented as the variance of latent variables in our experiments. In addition, VSCC model exhibits superior semantic transmission capabilities compared to VAE model. At the same level of data distortion evaluated by PSNR, VSCC model exhibits stronger human interpretability, which can be partially assessed by SSIM.

[LG-181] A Visual-Analytical Approach for Automatic Detection of Cyclonic Events in Satellite Observations

链接: https://arxiv.org/abs/2410.08218
作者: Akash Agrawal,Mayesh Mohapatra,Abhinav Raja,Paritosh Tiwari,Vishwajeet Pattanaik,Neeru Jaiswal,Arpit Agarwal,Punit Rathore
关键词-EN: catastrophic weather events, holds crucial significance, predicting catastrophic weather, North Indian Ocean, tropical cyclones holds
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 10 pages, 22 figures

点击查看摘要

Abstract:Estimating the location and intensity of tropical cyclones holds crucial significance for predicting catastrophic weather events. In this study, we approach this task as a detection and regression challenge, specifically over the North Indian Ocean (NIO) region where best tracks location and wind speed information serve as the labels. The current process for cyclone detection and intensity estimation involves physics-based simulation studies which are time-consuming, only using image features will automate the process for significantly faster and more accurate predictions. While conventional methods typically necessitate substantial prior knowledge for training, we are exploring alternative approaches to enhance efficiency. This research aims to focus specifically on cyclone detection, intensity estimation and related aspects using only image input and data-driven approaches and will lead to faster inference time and automate the process as opposed to current NWP models being utilized at SAC. In context to algorithm development, a novel two stage detection and intensity estimation module is proposed. In the first level detection we try to localize the cyclone over an entire image as captured by INSAT3D over the NIO (North Indian Ocean). For the intensity estimation task, we propose a CNN-LSTM network, which works on the cyclone centered images, utilizing a ResNet-18 backbone, by which we are able to capture both temporal and spatial characteristics.

[LG-182] Embedding an ANN-Based Crystal Plasticity Model into the Finite Element Framework using an ABAQUS User-Material Subroutine

链接: https://arxiv.org/abs/2410.08214
作者: Yuqing He,Yousef Heider,Bernd Markert
关键词-EN: trained Neural Networks, Neural Networks, Finite Element, incorporating trained Neural, trained Neural
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:This manuscript presents a practical method for incorporating trained Neural Networks (NNs) into the Finite Element (FE) framework using a user material (UMAT) subroutine. The work exemplifies crystal plasticity, a complex inelastic non-linear path-dependent material response, with a wide range of applications in ABAQUS UMAT. However, this approach can be extended to other material behaviors and FE tools. The use of a UMAT subroutine serves two main purposes: (1) it predicts and updates the stress or other mechanical properties of interest directly from the strain history; (2) it computes the Jacobian matrix either through backpropagation or numerical differentiation, which plays an essential role in the solution convergence. By implementing NNs in a UMAT subroutine, a trained machine learning model can be employed as a data-driven constitutive law within the FEM framework, preserving multiscale information that conventional constitutive laws often neglect or average. The versatility of this method makes it a powerful tool for integrating machine learning into mechanical simulation. While this approach is expected to provide higher accuracy in reproducing realistic material behavior, the reliability of the solution process and the convergence conditions must be paid special attention. While the theory of the model is explained in [Heider et al. 2020], exemplary source code is also made available for interested readers [this https URL]

[LG-183] A Review of Electromagnetic Elimination Methods for low-field portable MRI scanner

链接: https://arxiv.org/abs/2406.17804
作者: Wanyu Bian
关键词-EN: eliminating electromagnetic interference, deep learning, deep learning methods, EMI, EMI elimination
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive analysis of both conventional and deep learning methods for eliminating electromagnetic interference (EMI) in MRI systems. We explore the underlying principles and implementation of traditional analytical and adaptive EMI elimination techniques, as well as cutting-edge deep learning approaches. Through a detailed comparison, the strengths and limitations of each method are highlighted. Recent advancements in active EMI elimination utilizing multiple external EMI receiver coils and analytical techniques are discussed alongside the superior performance of deep learning methods, which leverage neural networks trained on extensive MRI data. While deep learning methods demonstrate significant improvements in EMI suppression, enhancing diagnostic capabilities and accessibility of MRI technology, they also introduce potential security and safety concerns, especially in production and commercial applications. This study underscores the need to address these challenges to fully realize the benefits of deep learning in EMI elimination. The findings suggest a balanced approach, combining the reliability of conventional methods with the advanced capabilities of deep learning, to develop more robust and effective EMI suppression strategies in MRI systems.

信息检索

[IR-0] Interdependency Matters: Graph Alignment for Multivariate Time Series Anomaly Detection

链接: https://arxiv.org/abs/2410.08877
作者: Yuanyi Wang,Haifeng Sun,Chengsen Wang,Mengde Zhu,Jingyu Wang,Wei Tang,Qi Qi,Zirui Zhuang,Jianxin Liao
关键词-EN: Anomaly detection, MTS Anomaly Detection, multivariate time series, mining and industry, Anomaly
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Anomaly detection in multivariate time series (MTS) is crucial for various applications in data mining and industry. Current industrial methods typically approach anomaly detection as an unsupervised learning task, aiming to identify deviations by estimating the normal distribution in noisy, label-free datasets. These methods increasingly incorporate interdependencies between channels through graph structures to enhance accuracy. However, the role of interdependencies is more critical than previously understood, as shifts in interdependencies between MTS channels from normal to anomalous data are significant. This observation suggests that \textitanomalies could be detected by changes in these interdependency graph series. To capitalize on this insight, we introduce MADGA (MTS Anomaly Detection via Graph Alignment), which redefines anomaly detection as a graph alignment (GA) problem that explicitly utilizes interdependencies for anomaly detection. MADGA dynamically transforms subsequences into graphs to capture the evolving interdependencies, and Graph alignment is performed between these graphs, optimizing an alignment plan that minimizes cost, effectively minimizing the distance for normal data and maximizing it for anomalous data. Uniquely, our GA approach involves explicit alignment of both nodes and edges, employing Wasserstein distance for nodes and Gromov-Wasserstein distance for edges. To our knowledge, this is the first application of GA to MTS anomaly detection that explicitly leverages interdependency for this purpose. Extensive experiments on diverse real-world datasets validate the effectiveness of MADGA, demonstrating its capability to detect anomalies and differentiate interdependencies, consistently achieving state-of-the-art across various scenarios.

[IR-1] A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation

链接: https://arxiv.org/abs/2410.08801
作者: Sebastian Simon,Alina Mailach,Johannes Dorn,Norbert Siegmund
关键词-EN: large language models, Retrieval-augmented generation, RAG systems, RAG, missing knowledge
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is an umbrella of different components, design decisions, and domain-specific adaptations to enhance the capabilities of large language models and counter their limitations regarding hallucination and outdated and missing knowledge. Since it is unclear which design decisions lead to a satisfactory performance, developing RAG systems is often experimental and needs to follow a systematic and sound methodology to gain sound and reliable results. However, there is currently no generally accepted methodology for RAG evaluation despite a growing interest in this technology. In this paper, we propose a first blueprint of a methodology for a sound and reliable evaluation of RAG systems and demonstrate its applicability on a real-world software engineering research task: the validation of configuration dependencies across software technologies. In summary, we make two novel contributions: (i) A novel, reusable methodological design for evaluating RAG systems, including a demonstration that represents a guideline, and (ii) a RAG system, which has been developed following this methodology, that achieves the highest accuracy in the field of dependency validation. For the blueprint’s demonstration, the key insights are the crucial role of choosing appropriate baselines and metrics, the necessity for systematic RAG refinements derived from qualitative failure analysis, as well as the reporting practices of key design decision to foster replication and evaluation.

[IR-2] Hespi: A pipeline for automatically detecting information from hebarium specimen sheets

链接: https://arxiv.org/abs/2410.08740
作者: Robert Turnbull,Emily Fitzgerald,Karen Thompson,Joanne L. Birch
关键词-EN: conservation sciences, Optical Character Recognition, Specimen, data, Specimen sheet PIpeline
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Specimen associated biodiversity data are sought after for biological, environmental, climate, and conservation sciences. A rate shift is required for the extraction of data from specimen images to eliminate the bottleneck that the reliance on human-mediated transcription of these data represents. We applied advanced computer vision techniques to develop the `Hespi’ (HErbarium Specimen sheet PIpeline), which extracts a pre-catalogue subset of collection data on the institutional labels on herbarium specimens from their digital images. The pipeline integrates two object detection models; the first detects bounding boxes around text-based labels and the second detects bounding boxes around text-based data fields on the primary institutional label. The pipeline classifies text-based institutional labels as printed, typed, handwritten, or a combination and applies Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) for data extraction. The recognized text is then corrected against authoritative databases of taxon names. The extracted text is also corrected with the aide of a multimodal Large Language Model (LLM). Hespi accurately detects and extracts text for test datasets including specimen sheet images from international herbaria. The components of the pipeline are modular and users can train their own models with their own data and use them in place of the models provided.

[IR-3] Retrieving Contextual Information for Long-Form Question Answering using Weak Supervision EMNLP2024

链接: https://arxiv.org/abs/2410.08623
作者: Philipp Christmann,Svitlana Vakulenko,Ionut Teodor Sorodoc,Bill Byrne,Adrià de Gispert
关键词-EN: aims at generating, generating in-depth answers, generating in-depth, Long-form question answering, providing relevant information
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Accepted at EMNLP 2024 (Findings)

点击查看摘要

Abstract:Long-form question answering (LFQA) aims at generating in-depth answers to end-user questions, providing relevant information beyond the direct answer. However, existing retrievers are typically optimized towards information that directly targets the question, missing out on such contextual information. Furthermore, there is a lack of training data for relevant context. To this end, we propose and compare different weak supervision techniques to optimize retrieval for contextual information. Experiments demonstrate improvements on the end-to-end QA performance on ASQA, a dataset for long-form question answering. Importantly, as more contextual information is retrieved, we improve the relevant page recall for LFQA by 14.7% and the groundedness of generated long-form answers by 12.5%. Finally, we show that long-form answers often anticipate likely follow-up questions, via experiments on a conversational QA dataset.

[IR-4] Intent-Enhanced Data Augmentation for Sequential Recommendation

链接: https://arxiv.org/abs/2410.08583
作者: Shuai Chen,Zhoujun Li
关键词-EN: sequential recommendation algorithms, mine dynamic user, sequential recommendation, dynamic user intent, recommendation algorithms focuses
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:The research on intent-enhanced sequential recommendation algorithms focuses on how to better mine dynamic user intent based on user behavior data for sequential recommendation tasks. Various data augmentation methods are widely applied in current sequential recommendation algorithms, effectively enhancing the ability to capture user intent. However, these widely used data augmentation methods often rely on a large amount of random sampling, which can introduce excessive noise into the training data, blur user intent, and thus negatively affect recommendation performance. Additionally, these methods have limited approaches to utilizing augmented data, failing to fully leverage the augmented samples. We propose an intent-enhanced data augmentation method for sequential recommendation(\textbfIESRec), which constructs positive and negative samples based on user behavior sequences through intent-segment insertion. On one hand, the generated positive samples are mixed with the original training data, and they are trained together to improve recommendation performance. On the other hand, the generated positive and negative samples are used to build a contrastive loss function, enhancing recommendation performance through self-supervised training. Finally, the main recommendation task is jointly trained with the contrastive learning loss minimization task. Experiments on three real-world datasets validate the effectiveness of our IESRec model.

[IR-5] Improving Legal Entity Recognition Using a Hybrid Transformer Model and Semantic Filtering Approach

链接: https://arxiv.org/abs/2410.08521
作者: Duraimurugan Rajamanickam
关键词-EN: Legal Entity Recognition, Entity Recognition, automating legal workflows, compliance monitoring, contract analysis
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 7 pages, 1 table

点击查看摘要

Abstract:Legal Entity Recognition (LER) is critical in automating legal workflows such as contract analysis, compliance monitoring, and litigation support. Existing approaches, including rule-based systems and classical machine learning models, struggle with the complexity of legal documents and domain specificity, particularly in handling ambiguities and nested entity structures. This paper proposes a novel hybrid model that enhances the accuracy and precision of Legal-BERT, a transformer model fine-tuned for legal text processing, by introducing a semantic similarity-based filtering mechanism. We evaluate the model on a dataset of 15,000 annotated legal documents, achieving an F1 score of 93.4%, demonstrating significant improvements in precision and recall over previous methods.

[IR-6] Personalized Item Embeddings in Federated Multimodal Recommendation

链接: https://arxiv.org/abs/2410.08478
作者: Zhiwei Li,Guodong Long,Jing Jiang,Chengqi Zhang
关键词-EN: Federated recommendation systems, protecting user privacy, recommendation systems play, play a crucial, crucial role
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 5 tables, conference

点击查看摘要

Abstract:Federated recommendation systems play a crucial role in protecting user privacy. However, existing methods primarily rely on ID-based item embeddings, overlooking the rich multimodal information of items. To address this limitation, we propose a novel Federated Multimodal Recommendation System called FedMR. FedMR leverages a foundation model on the server side to encode multimodal data, such as images and text, associated with items. To tackle the challenge of data heterogeneity caused by varying user preferences, FedMR introduces a Mixing Feature Fusion Module on the client. This module dynamically adjusts the weights of different fusion strategies based on user interaction history, generating personalized item embeddings that capture fine-grained user preferences. FedMR is compatible with existing ID-based federated recommendation systems, improving their performances without modifying the original framework. Our experiments on four real-world multimodal recommendation datasets demonstrate the effectiveness of FedMR. Our code is available at this https URL.

[IR-7] he Effects of Hallucinations in Synthetic Training Data for Relation Extraction ISWC’24

链接: https://arxiv.org/abs/2410.08393
作者: Steven Rogulsky,Nicholas Popovic,Michael Färber
关键词-EN: constructing knowledge graphs, Relation extraction, knowledge graphs, foundation for training, constructing knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted at KBC-LM@ISWC’24

点击查看摘要

Abstract:Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact on relation extraction remains underexplored. In this paper, we examine the effects of hallucinations on the performance of relation extraction on the document and sentence levels. Our empirical study reveals that hallucinations considerably compromise the ability of models to extract relations from text, with recall reductions between 19.1% and 39.2%. We identify that relevant hallucinations impair the model’s performance, while irrelevant hallucinations have a minimal impact. Additionally, we develop methods for the detection of hallucinations to improve data quality and model performance. Our approaches successfully classify texts as either ‘hallucinated’ or ‘clean,’ achieving high F1-scores of 83.8% and 92.2%. These methods not only assist in removing hallucinations but also help in estimating their prevalence within datasets, which is crucial for selecting high-quality data. Overall, our work confirms the profound impact of relevant hallucinations on the effectiveness of relation extraction models.

[IR-8] Revealing COVID-19s Social Dynamics: Diachronic Semantic Analysis of Vaccine and Symptom Discourse on Twitter

链接: https://arxiv.org/abs/2410.08352
作者: Zeqiang Wang,Jiageng Wu,Yuqi Wang,Wei Wang,Jie Yang,Jon Johnson,Nishanth Sastry,Suparna De
关键词-EN: vast textual data, textual data generated, data generated daily, social impacts due, behavior of people
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Social media is recognized as an important source for deriving insights into public opinion dynamics and social impacts due to the vast textual data generated daily and the ‘unconstrained’ behavior of people interacting on these platforms. However, such analyses prove challenging due to the semantic shift phenomenon, where word meanings evolve over time. This paper proposes an unsupervised dynamic word embedding method to capture longitudinal semantic shifts in social media data without predefined anchor words. The method leverages word co-occurrence statistics and dynamic updating to adapt embeddings over time, addressing the challenges of data sparseness, imbalanced distributions, and synergistic semantic effects. Evaluated on a large COVID-19 Twitter dataset, the method reveals semantic evolution patterns of vaccine- and symptom-related entities across different pandemic stages, and their potential correlations with real-world statistics. Our key contributions include the dynamic embedding technique, empirical analysis of COVID-19 semantic shifts, and discussions on enhancing semantic shift modeling for computational social science research. This study enables capturing longitudinal semantic dynamics on social media to understand public discourse and collective phenomena.

[IR-9] he language of sound search: Examining User Queries in Audio Search Engines

链接: https://arxiv.org/abs/2410.08324
作者: Benno Weck,Frederic Font
关键词-EN: study examines textual, general audio retrieval, audio retrieval, audio retrieval systems, text-based audio retrieval
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted at DCASE 2024. Supplementary materials at this https URL

点击查看摘要

Abstract:This study examines textual, user-written search queries within the context of sound search engines, encompassing various applications such as foley, sound effects, and general audio retrieval. Current research inadequately addresses real-world user needs and behaviours in designing text-based audio retrieval systems. To bridge this gap, we analysed search queries from two sources: a custom survey and Freesound website query logs. The survey was designed to collect queries for an unrestricted, hypothetical sound search engine, resulting in a dataset that captures user intentions without the constraints of existing systems. This dataset is also made available for sharing with the research community. In contrast, the Freesound query logs encompass approximately 9 million search requests, providing a comprehensive view of real-world usage patterns. Our findings indicate that survey queries are generally longer than Freesound queries, suggesting users prefer detailed queries when not limited by system constraints. Both datasets predominantly feature keyword-based queries, with few survey participants using full sentences. Key factors influencing survey queries include the primary sound source, intended usage, perceived location, and the number of sound sources. These insights are crucial for developing user-centred, effective text-based audio retrieval systems, enhancing our understanding of user behaviour in sound search contexts.

附件下载

点击下载今日全部论文列表