Arxiv今日论文 | 2024-11-06

本篇博文主要展示 2024-11-06 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决金融领域多模态模型评估的不足问题，特别是在处理金融特有的图形图像（如蜡烛图和技术指标图）和专业知识（如期货和换手率）时，通用领域的基准测试无法有效评估这些模型的性能。解决方案的关键在于提出了MME-Finance，这是一个面向实际应用的双语开放式视觉问答（VQA）基准测试。其核心特点包括：1) 构建反映用户实际需求的图表（如电脑截图和手机拍摄图像）；2) 根据金融领域的查询偏好创建问题；3) 由具有10年以上金融行业经验的专家进行问题标注。此外，论文还开发了一个定制的金融评估系统，首次在多模态评估过程中引入视觉信息。通过广泛的实验评估，发现主流的多模态大语言模型（MLLMs）在MME-Finance上的表现显著低于通用基准测试，尤其是在与金融相关的类别上，如蜡烛图和技术指标图。

链接: https://arxiv.org/abs/2411.03314
作者: Ziliang Gan,Yu Lu,Dong Zhang,Haohan Li,Che Liu,Jian Liu,Ji Liu,Haipang Wu,Chaoyou Fu,Zenglin Xu,Rongjunchen Zhang,Yong Dai
关键词-EN: financial, multimodal models, models, rapid development, guided the rapid
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:In recent years, multimodal benchmarks for general domains have guided the rapid development of multimodal models on general tasks. However, the financial field has its peculiarities. It features unique graphical images (e.g., candlestick charts, technical indicator charts) and possesses a wealth of specialized financial knowledge (e.g., futures, turnover rate). Therefore, benchmarks from general fields often fail to measure the performance of multimodal models in the financial domain, and thus cannot effectively guide the rapid development of large financial models. To promote the development of large financial multimodal models, we propose MME-Finance, an bilingual open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark. The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users (e.g., computer screenshots and mobile photography), creating questions according to the preferences in financial domain inquiries, and annotating questions by experts with 10+ years of experience in the financial industry. Additionally, we have developed a custom-designed financial evaluation system in which visual information is first introduced in the multi-modal evaluation process. Extensive experimental evaluations of 19 mainstream MLLMs are conducted to test their perception, reasoning, and cognition capabilities. The results indicate that models performing well on general benchmarks cannot do well on MME-Finance; for instance, the top-performing open-source and closed-source models obtain 65.69 (Qwen2VL-72B) and 63.18 (GPT-4o), respectively. Their performance is particularly poor in categories most relevant to finance, such as candlestick charts and technical indicator charts. In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.
摘要：近年来，通用领域的多模态基准推动了多模态模型在通用任务上的快速发展。然而，金融领域具有其独特性，包括独特的图形图像（如蜡烛图、技术指标图）和丰富的专业金融知识（如期货、换手率）。因此，通用领域的基准往往无法准确衡量多模态模型在金融领域的表现，从而无法有效指导大型金融模型的快速发展。为促进大型金融多模态模型的发展，我们提出了MME-Finance，这是一个面向实际应用的双语开放式视觉问答（VQA）基准。我们的基准特点在于金融性和专业性，包括构建反映用户实际使用需求的图表（如电脑截图和手机拍摄），根据金融领域查询偏好创建问题，并由具有10年以上金融行业经验的专家进行问题标注。此外，我们还开发了一个定制的金融评估系统，在多模态评估过程中首次引入视觉信息。我们对19个主流多模态大语言模型（MLLMs）进行了广泛的实验评估，测试了它们的感知、推理和认知能力。结果表明，在通用基准上表现良好的模型在MME-Finance上的表现不佳；例如，表现最好的开源和闭源模型分别获得了65.69（Qwen2VL-72B）和63.18（GPT-4o）的分数。它们在金融相关性最强的类别（如蜡烛图和技术指标图）中的表现尤为糟糕。此外，我们还提出了一个中文版本，有助于在中文语境下比较MLLMs的性能。

[NLP-1] LLM s for Domain Generation Algorithm Detection

【速读】：该论文试图解决基于大型语言模型 (Large Language Models, LLMs) 的域生成算法 (Domain Generation Algorithms, DGAs) 检测问题。解决方案的关键在于采用了两种重要技术：上下文学习 (In-Context Learning, ICL) 和监督微调 (Supervised Fine-Tuning, SFT)。SFT 通过使用特定领域的数据提升了检测性能，而 ICL 则帮助检测模型快速适应新威胁，无需大量重新训练。实验结果表明，基于 SFT 的 LLM DGA 检测器在检测基于单词的 DGA 域方面表现尤为出色，达到了 94% 的准确率，且误报率 (False Positive Rate, FPR) 仅为 4%，超越了使用注意力层的最先进模型。

链接: https://arxiv.org/abs/2411.03307
作者: Reynier Leyva La O,Carlos A. Catania,Tatiana Parlanti
关键词-EN: domain generation algorithms, large language models, generation algorithms, work analyzes, large language
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This work analyzes the use of large language models (LLMs) for detecting domain generation algorithms (DGAs). We perform a detailed evaluation of two important techniques: In-Context Learning (ICL) and Supervised Fine-Tuning (SFT), showing how they can improve detection. SFT increases performance by using domain-specific data, whereas ICL helps the detection model to quickly adapt to new threats without requiring much retraining. We use Meta’s Llama3 8B model, on a custom dataset with 68 malware families and normal domains, covering several hard-to-detect schemes, including recent word-based DGAs. Results proved that LLM-based methods can achieve competitive results in DGA detection. In particular, the SFT-based LLM DGA detector outperforms state-of-the-art models using attention layers, achieving 94% accuracy with a 4% false positive rate (FPR) and excelling at detecting word-based DGA domains.
摘要：本文分析了使用大语言模型 (LLM) 进行域生成算法 (DGA) 检测的应用。我们详细评估了两种重要技术：上下文学习 (In-Context Learning, ICL) 和监督微调 (Supervised Fine-Tuning, SFT)，展示了它们如何提升检测效果。SFT 通过使用特定领域的数据来提高性能，而 ICL 则帮助检测模型快速适应新威胁，无需大量重新训练。我们采用了 Meta 的 Llama3 8B 模型，在包含 68 种恶意软件家族和正常域的自定义数据集上进行实验，涵盖了多种难以检测的方案，包括最新的基于单词的 DGA。结果表明，基于 LLM 的方法在 DGA 检测中能够取得有竞争力的结果。特别是，基于 SFT 的 LLM DGA 检测器在准确率上超越了使用注意力层的最新模型，达到了 94% 的准确率，误报率 (FPR) 仅为 4%，并且在检测基于单词的 DGA 域方面表现尤为出色。

[NLP-2] VERITAS: A Unified Approach to Reliability Evaluation

【速读】：该论文试图解决大型语言模型（LLMs）在知识密集型场景中因无法准确合成上下文信息而导致的输出不可靠问题。解决方案的关键在于引入一个强大的事实核查系统，以检测各种格式下的幻觉现象。论文提出了VERITAS系列幻觉检测模型，旨在灵活适应多样化的上下文环境，同时最小化延迟和成本。VERITAS在主要幻觉检测基准测试中达到了最先进的平均性能，相较于类似规模的模型提升了10%，并在LLM-as-a-judge设置下接近GPT-4 Turbo的性能。

链接: https://arxiv.org/abs/2411.03300
作者: Rajkumar Ramamurthy,Meghana Arakkal Rajeev,Oliver Molenschot,James Zou,Nazneen Rajani
关键词-EN: Large language models, Large language, accurate response, fail to synthesize, synthesize information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often fail to synthesize information from their context to generate an accurate response. This renders them unreliable in knowledge intensive settings where reliability of the output is key. A critical component for reliable LLMs is the integration of a robust fact-checking system that can detect hallucinations across various formats. While several open-access fact-checking models are available, their functionality is often limited to specific tasks, such as grounded question-answering or entailment verification, and they perform less effectively in conversational settings. On the other hand, closed-access models like GPT-4 and Claude offer greater flexibility across different contexts, including grounded dialogue verification, but are hindered by high costs and latency. In this work, we introduce VERITAS, a family of hallucination detection models designed to operate flexibly across diverse contexts while minimizing latency and costs. VERITAS achieves state-of-the-art results considering average performance on all major hallucination detection benchmarks, with 10% increase in average performance when compared to similar-sized models and get close to the performance of GPT4 turbo with LLM-as-a-judge setting.
摘要：大语言模型（LLMs）在生成准确响应时，往往无法从上下文中综合信息，这使得它们在知识密集型环境中不可靠，因为在这些环境中，输出的可靠性至关重要。可靠的 LLMs 的一个关键组成部分是集成一个强大的事实核查系统，该系统能够在各种格式中检测幻觉。尽管有多个开放访问的事实核查模型可用，但它们的功能通常局限于特定任务，如基于事实的问题回答或蕴涵验证，并且在对话环境中表现较差。另一方面，闭源模型如 GPT-4 和 Claude 在不同上下文中提供了更大的灵活性，包括基于事实的对话验证，但受限于高成本和高延迟。在本研究中，我们引入了 VERITAS，这是一系列幻觉检测模型，旨在灵活应对多样化的上下文，同时最小化延迟和成本。VERITAS 在所有主要幻觉检测基准的平均性能上达到了最先进的结果，与类似规模的模型相比，平均性能提高了 10%，并且在 LLM-as-a-judge 设置下接近 GPT4 turbo 的性能。

[NLP-3] SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents

【速读】：该论文试图解决多智能体系统在增强大型语言模型（LLMs）性能时，由于智能体之间密集交互导致的效率和多样性问题。解决方案的关键在于提出了一种稀疏混合智能体（SMoA）框架，通过引入响应选择和早期停止机制来稀疏化智能体之间的信息流，从而在性能和效率之间取得平衡。此外，借鉴稀疏混合专家（SMoE）框架中的专家多样性原则，为每个LLM智能体分配不同的角色描述，以促进多样化和发散性思维。实验结果表明，SMoA在保持传统混合智能体方法性能的同时，显著降低了计算成本，并展现出更高的稳定性和扩展潜力。

链接: https://arxiv.org/abs/2411.03284
作者: Dawei Li,Zhen Tan,Peijia Qian,Yifan Li,Kumar Satvik Chaudhary,Lijie Hu,Jiayi Shen
关键词-EN: Large Language Models, Language Models, Large Language, scaling agents potentially, agents potentially hampers
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Under Review

点击查看摘要

Abstract:While multi-agent systems have been shown to significantly enhance the performance of Large Language Models (LLMs) across various tasks and applications, the dense interaction between scaling agents potentially hampers their efficiency and diversity. To address these challenges, we draw inspiration from the sparse mixture-of-agents (SMoE) and propose a sparse mixture-of-agents (SMoA) framework to improve the efficiency and diversity of multi-agent LLMs. Unlike completely connected structures, SMoA introduces novel Response Selection and Early Stopping mechanisms to sparsify information flows among individual LLM agents, striking a balance between performance and efficiency. Additionally, inspired by the expert diversity principle in SMoE frameworks for workload balance between experts, we assign distinct role descriptions to each LLM agent, fostering diverse and divergent thinking. Extensive experiments on reasoning, alignment, and fairness benchmarks demonstrate that SMoA achieves performance comparable to traditional mixture-of-agents approaches but with significantly lower computational costs. Further analysis reveals that SMoA is more stable, has a greater capacity to scale, and offers considerable potential through hyper-parameter optimization. Code and data will be available at: this https URL.
摘要：尽管多智能体系统已被证明能够显著提升大语言模型 (LLM) 在各种任务和应用中的性能，但智能体之间的密集交互可能会影响其效率和多样性。为解决这些问题，我们借鉴了稀疏混合智能体 (SMoE) 的思想，提出了一种稀疏混合智能体 (SMoA) 框架，以提高多智能体 LLM 的效率和多样性。与完全连接的结构不同，SMoA 引入了新的响应选择和早期停止机制，以稀疏化个体 LLM 智能体之间的信息流，从而在性能和效率之间取得平衡。此外，受 SMoE 框架中专家多样性原则的启发，我们为每个 LLM 智能体分配了独特的角色描述，以促进多样化和发散性思维。在推理、对齐和公平性基准上的广泛实验表明，SMoA 在性能上可与传统的混合智能体方法相媲美，但计算成本显著降低。进一步的分析显示，SMoA 更为稳定，具有更大的扩展能力，并通过超参数优化展现出巨大的潜力。代码和数据将在以下链接提供：this https URL。

[NLP-4] DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

【速读】：该论文试图解决大语言模型（LLMs）在生成高质量结构化数据时面临的挑战，特别是由于LLMs对目标数据分布的理解有限以及提示工程的复杂性。解决方案的关键在于引入DiffLM，这是一个基于变分自编码器（VAE）的可控数据合成框架。DiffLM通过以下两个关键创新来解决这些问题：(1) 利用扩散模型保留原始分布和格式结构在学到的潜在分布中的更多信息；(2) 通过插拔式潜在特征注入模块，将目标分布知识的学習与LLM的生成目标解耦。此外，为了解决VAE潜在表示与真实数据分布之间的显著差异，框架中引入了潜在扩散模块，以学习一个完全表达的潜在分布。实验结果表明，DiffLM在生成高质量数据方面表现优异，某些情况下在下游任务中的性能甚至超过了真实数据。

链接: https://arxiv.org/abs/2411.03250
作者: Ying Zhou,Xinyao Wang,Yulei Niu,Yaojie Shen,Lexin Tang,Fan Chen,Ben He,Le Sun,Longyin Wen
关键词-EN: Recent advancements, large language models, data, advancements in large, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis. However, synthetic data generation via prompting LLMs remains challenging due to LLMs’ limited understanding of target data distributions and the complexity of prompt engineering, especially for structured formatted data. To address these issues, we introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE), which further (1) leverages diffusion models to reserve more information of original distribution and format structure in the learned latent distribution and (2) decouples the learning of target distribution knowledge from the LLM’s generative objectives via a plug-and-play latent feature injection module. As we observed significant discrepancies between the VAE’s latent representations and the real data distribution, the latent diffusion module is introduced into our framework to learn a fully expressive latent distribution. Evaluations on seven real-world datasets with structured formatted data (i.e., Tabular, Code and Tool data) demonstrate that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2-7 percent in certain cases. The data and code will be publicly available upon completion of internal review.
摘要：近年来，大语言模型 (LLM) 的进步显著提升了其知识和生成能力，引发了利用 LLM 进行高质量数据合成的广泛兴趣。然而，通过提示 LLM 生成合成数据仍然面临挑战，主要原因是 LLM 对目标数据分布的理解有限以及提示工程的复杂性，尤其是在处理结构化格式数据时。为解决这些问题，我们提出了 DiffLM，这是一个基于变分自编码器 (VAE) 的可控数据合成框架。该框架进一步（1）利用扩散模型在学习的潜在分布中保留原始分布和格式结构的更多信息，以及（2）通过一个即插即用的潜在特征注入模块，将目标分布知识的学习与 LLM 的生成目标解耦。我们观察到 VAE 的潜在表示与真实数据分布之间存在显著差异，因此引入了潜在扩散模块，以学习一个完全表达的潜在分布。在七个包含结构化格式数据（即表格数据、代码数据和工具数据）的真实世界数据集上的评估表明，DiffLM 生成的数据质量高，在某些情况下，下游任务的性能比真实数据高出 2-7 个百分点。数据和代码将在内部审查完成后公开发布。

[NLP-5] Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning NEURIPS2024

【速读】：该论文试图解决Transformer架构在处理序列数据时由于离散化近似导致的截断误差问题，从而提升模型性能。解决方案的关键在于引入一个预测-校正学习框架，该框架包括一个高阶预测器和一个多步校正器，以最小化截断误差。此外，论文还提出了一种基于指数移动平均的系数学习方法，以增强高阶预测器的性能。通过这些改进，模型在多个自然语言处理任务上表现出色，显著优于现有模型。

链接: https://arxiv.org/abs/2411.03042
作者: Bei Li,Tong Zheng,Rui Wang,Jiahao Liu,Qingyan Guo,Junliang Guo,Xu Tan,Tong Xiao,Jingbo Zhu,Jingang Wang,Xunliang Cai
关键词-EN: Ordinary Differential Equations, Differential Equations, multi-particle dynamical systems, Ordinary Differential, inspired significant advancements
类目: Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution.‘’ First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language modeling, and natural language understanding benchmarks demonstrate the superiority of our approach. On the WMT’14 English-German and English-French tasks, our model achieved BLEU scores of 30.95 and 44.27, respectively. Furthermore, on the OPUS multilingual machine translation task, our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters. Notably, it also beats LLama models by 5.7 accuracy points on the LM Harness Evaluation.
摘要：残差网络作为常微分方程（Ordinary Differential Equations, ODEs）的离散近似，极大地推动了神经网络设计的进步，包括多步法、高阶方法和多粒子动力系统。ODEs解的精度显著影响参数优化，从而影响模型性能。在本研究中，我们提出了一系列针对Transformer架构设计的深入探索，旨在最小化与真实“解”的误差。首先，我们引入了一个预测-校正学习框架来最小化截断误差，该框架由一个高阶预测器和一个多步校正器组成。其次，我们提出了一种基于指数移动平均的系数学习方法，以增强我们的高阶预测器。在大规模机器翻译、抽象摘要、语言建模和自然语言理解基准的广泛实验中，我们的方法表现出了优越性。在WMT’14英德和英法任务中，我们的模型分别达到了30.95和44.27的BLEU分数。此外，在OPUS多语言机器翻译任务中，我们的模型仅使用1/3的参数，就超越了强大的3.8B DeepNet，平均高出2.9 SacreBLEU。值得注意的是，在LM Harness评估中，它还比LLama模型高出5.7个准确度点。

[NLP-6] Self-Compositional Data Augmentation for Scientific Keyphrase Generation

【速读】：该论文试图解决的关键问题是生成式关键词 (keyphrase generation) 模型在训练过程中对大量标注数据的需求，而获取这些标注文档既困难又昂贵。为解决这一问题，论文提出了一种自组合数据增强方法 (self-compositional data augmentation method)。该方法的核心在于通过测量训练文档之间基于共享关键词的相关性，将相似文档组合生成合成样本。这种方法的优势在于能够在不依赖外部数据或资源的情况下，生成保持领域一致性的额外训练样本，从而提升关键词生成的性能。实验结果表明，该方法在多个跨三个不同领域的数据集上均能持续改进关键词生成效果，并通过计算机科学领域生成的关键词的定性分析进一步证实了其代表性属性的提升。

链接: https://arxiv.org/abs/2411.03039
作者: Mael Houbre,Florian Boudin,Beatrice Daille,Akiko Aizawa
关键词-EN: achieve good performance, require large amounts, generation require large, good performance, require large
类目: Computation and Language (cs.CL)
备注: Accepted to JCDL 2024 This version is not the final camera ready version

点击查看摘要

Abstract:State-of-the-art models for keyphrase generation require large amounts of training data to achieve good performance. However, obtaining keyphrase-labeled documents can be challenging and costly. To address this issue, we present a self-compositional data augmentation method. More specifically, we measure the relatedness of training documents based on their shared keyphrases, and combine similar documents to generate synthetic samples. The advantage of our method lies in its ability to create additional training samples that keep domain coherence, without relying on external data or resources. Our results on multiple datasets spanning three different domains, demonstrate that our method consistently improves keyphrase generation. A qualitative analysis of the generated keyphrases for the Computer Science domain confirms this improvement towards their representativity property.
摘要：目前最先进的生成关键词模型需要大量的训练数据才能达到良好的性能。然而，获取带有关键词标注的文档可能既困难又昂贵。为了解决这一问题，我们提出了一种自组合数据增强方法。更具体地说，我们根据训练文档共享的关键词来衡量它们的相关性，并将相似的文档组合起来生成合成样本。我们方法的优势在于，它能够创建保持领域一致性的额外训练样本，而无需依赖外部数据或资源。我们在涵盖三个不同领域的多个数据集上的实验结果表明，我们的方法能够持续提升关键词生成的效果。对计算机科学领域生成关键词的定性分析进一步证实了这一改进，使其更具代表性。

[NLP-7] Leveraging Large Language Models in Code Question Answering: Baselines and Issues

【速读】：该论文试图解决在Python源代码上进行问答的问题，旨在为软件工程师和项目经理提供关于软件产品已实现功能的帮助信息。解决方案的关键在于使用大型语言模型（Large Language Models）进行微调，以实现对Python代码的问答系统。具体来说，论文提出了通过在统一的问题和答案数据集上微调大型语言模型来实现这一目标。为了提高答案质量，研究者测试了不同预处理方式的数据集，包括未经语法修正的数据集、经过语法修正的数据集以及增加了生成摘要的数据集。实验结果表明，语法修正对测试指标值有积极影响，同时也揭示了当前研究领域中公共真实问答数据集质量不佳的问题。

链接: https://arxiv.org/abs/2411.03012
作者: Georgy Andryushchenko,Vladimir Ivanov,Vladimir Makharev,Elizaveta Tukhtina,Aidar Valeev
关键词-EN: source code question, code question answering, Question answering, source code, software product
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, Accepted to NLP (CCIS) @ AIST’24

点击查看摘要

Abstract:Question answering over source code provides software engineers and project managers with helpful information about the implemented features of a software product. This paper presents a work devoted to using large language models for question answering over source code in Python. The proposed method for implementing a source code question answering system involves fine-tuning a large language model on a unified dataset of questions and answers for Python code. To achieve the highest quality answers, we tested various models trained on datasets preprocessed in different ways: a dataset without grammar correction, a dataset with grammar correction, and a dataset augmented with the generated summaries. The model answers were also analyzed for errors manually. We report BLEU-4, BERTScore F1, BLEURT, and Exact Match metric values, along with the conclusions from the manual error analysis. The obtained experimental results highlight the current problems of the research area, such as poor quality of the public genuine question-answering datasets. In addition, the findings include the positive effect of the grammar correction of the training data on the testing metric values. The addressed findings and issues could be important for other researchers who attempt to improve the quality of source code question answering solutions. The training and evaluation code is publicly available at this https URL.
摘要：源代码问答为软件工程师和项目经理提供了关于软件产品已实现功能的宝贵信息。本文介绍了一项利用大语言模型进行 Python 源代码问答的工作。提出的源代码问答系统实现方法包括对大语言模型进行微调，使其适应于一个统一的 Python 代码问答数据集。为了获得最高质量的答案，我们测试了多种在不同预处理方式下训练的模型：包括未经语法修正的数据集、经过语法修正的数据集，以及增加了生成摘要的数据集。此外，我们还对模型答案进行了手动错误分析。我们报告了 BLEU-4、BERTScore F1、BLEURT 和 Exact Match 等指标的数值，并结合手动错误分析得出了结论。实验结果突显了当前研究领域存在的问题，如公开的真实问答数据集质量不佳。此外，研究结果还包括了训练数据语法修正对测试指标值的积极影响。这些发现和问题对于试图提升源代码问答解决方案质量的其他研究人员可能具有重要意义。训练和评估代码已在以下链接公开：https URL。

[NLP-8] Growing a Tail: Increasing Output Diversity in Large Language Models

【速读】：该论文试图解决的问题是大型语言模型在生成多样性输出时的表现，特别是当多样性被期望时，模型的输出是否足够多样化。解决方案的关键在于通过三种方法来增加模型输出的多样性：1) 通过温度采样增加生成随机性；2) 提示模型从不同视角回答问题；3) 聚合多个模型的输出。这些措施的综合应用显著提高了模型输出的多样性，使其接近人类回答的多样性水平。这一发现对希望保护文化多样性的AI政策具有重要意义，因为文化多样性是民主社会结构的重要组成部分。

链接: https://arxiv.org/abs/2411.02989
作者: Michal Shur-Ofry,Bar Horowitz-Amsalem,Adir Rahamim,Yonatan Belinkov
关键词-EN: large language models, large language, models’ output diversity, language models, models
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:How diverse are the outputs of large language models when diversity is desired? We examine the diversity of responses of various models to questions with multiple possible answers, comparing them with human responses. Our findings suggest that models’ outputs are highly concentrated, reflecting a narrow, mainstream ‘worldview’, in comparison to humans, whose responses exhibit a much longer-tail. We examine three ways to increase models’ output diversity: 1) increasing generation randomness via temperature sampling; 2) prompting models to answer from diverse perspectives; 3) aggregating outputs from several models. A combination of these measures significantly increases models’ output diversity, reaching that of humans. We discuss implications of these findings for AI policy that wishes to preserve cultural diversity, an essential building block of a democratic social fabric.
摘要：当需要多样性时，大语言模型的输出有多多样化？我们研究了不同模型对具有多种可能答案的问题的响应多样性，并将其与人类响应进行比较。研究结果表明，与人类相比，模型的输出高度集中，反映了一种狭窄的主流“世界观”，而人类的响应则表现出更长的尾部特征。我们探讨了三种提高模型输出多样性的方法：1) 通过温度采样增加生成随机性；2) 提示模型从不同角度回答问题；3) 汇总多个模型的输出。这些措施的组合显著提高了模型的输出多样性，达到了与人类相当的水平。我们讨论了这些发现对希望保护文化多样性的 AI 政策的启示，文化多样性是民主社会结构的重要基石。

[NLP-9] [Vision Paper] PRObot: Enhancing Patient-Reported Outcome Measures for Diabetic Retinopathy using Chatbots and Generative AI

【速读】：该论文试图解决糖尿病视网膜病变患者报告结局测量 (PROMs) 的收集问题，特别是现有方法在数据收集上的局限性，如仅依赖定性调查数据或静态问卷的有限回答选项。解决方案的关键在于利用大型语言模型 (LLM) 开发一个交互式聊天机器人应用 (PROBot LLM-PROM)，通过该应用，患者可以提供关于其生活质量和治疗进展的详细反馈。该应用能够根据患者的个体挑战提出定制化问题，并利用机器学习技术推断出传统的PROM分数，供临床医生评估治疗状态。这种方法旨在提高患者对医疗系统和治疗的依从性，从而减少后续视力损害的发生。

链接: https://arxiv.org/abs/2411.02973
作者: Maren Pielka,Tobias Schneider,Jan Terheyden,Rafet Sifa
关键词-EN: large language model, patient-reported outcome measures, based chatbot application, language model, outcome measures
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present an outline of the first large language model (LLM) based chatbot application in the context of patient-reported outcome measures (PROMs) for diabetic retinopathy. By utilizing the capabilities of current LLMs, we enable patients to provide feedback about their quality of life and treatment progress via an interactive application. The proposed framework offers significant advantages over the current approach, which encompasses only qualitative collection of survey data or a static survey with limited answer options. Using the PROBot LLM-PROM application, patients will be asked tailored questions about their individual challenges, and can give more detailed feedback on the progress of their treatment. Based on this input, we will use machine learning to infer conventional PROM scores, which can be used by clinicians to evaluate the treatment status. The goal of the application is to improve adherence to the healthcare system and treatments, and thus ultimately reduce cases of subsequent vision impairment. The approach needs to be further validated using a survey and a clinical study.
摘要：我们提出了首个基于大语言模型 (LLM) 的聊天机器人应用框架，用于糖尿病视网膜病变患者的自我报告结果测量 (PROMs)。通过利用当前大语言模型的能力，我们使患者能够通过交互式应用提供关于其生活质量和治疗进展的反馈。该框架相较于当前仅包含定性收集调查数据或静态调查且答案选项有限的方法，具有显著优势。使用 PROBot LLM-PROM 应用，患者将被问及与其个人挑战相关的定制问题，并能提供关于治疗进展的更详细反馈。基于这些输入，我们将利用机器学习推断传统 PROM 评分，这些评分可被临床医生用于评估治疗状态。该应用的目标是提高患者对医疗系统和治疗的依从性，从而最终减少后续视力损伤的病例。该方法需要通过调查和临床研究进一步验证。

[NLP-10] Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

【速读】：该论文试图解决自然语言到SQL（NL2SQL）翻译模型在初次尝试时可能无法生成最佳SQL输出的问题。解决方案的关键是提出了CycleSQL，这是一个迭代框架，通过引入基于数据的自然语言解释（NL explanations）作为自我反馈，并利用这些反馈来验证和迭代改进翻译的正确性。CycleSQL的核心思想是通过自我评估和反馈循环，持续提升现有模型的翻译准确性，从而在多个基准测试中显著提高翻译性能。

链接: https://arxiv.org/abs/2411.02948
作者: Yuankai Fan,Tonghui Ren,Can Huang,Zhenying He,X. Sean Wang
关键词-EN: Natural Language Interfaces, Databases empower non-technical, Interfaces for Databases, Natural Language, Language Interfaces
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Interfaces for Databases empower non-technical users to interact with data using natural language (NL). Advanced approaches, utilizing either neural sequence-to-sequence or more recent sophisticated large-scale language models, typically implement NL to SQL (NL2SQL) translation in an end-to-end fashion. However, like humans, these end-to-end translation models may not always generate the best SQL output on their first try. In this paper, we propose CycleSQL, an iterative framework designed for end-to-end translation models to autonomously generate the best output through self-evaluation. The main idea of CycleSQL is to introduce data-grounded NL explanations of query results as self-provided feedback, and use the feedback to validate the correctness of the translation iteratively, hence improving the overall translation accuracy. Extensive experiments, including quantitative and qualitative evaluations, are conducted to study CycleSQL by applying it to seven existing translation models on five widely used benchmarks. The results show that 1) the feedback loop introduced in CycleSQL can consistently improve the performance of existing models, and in particular, by applying CycleSQL to RESDSQL, obtains a translation accuracy of 82.0% (+2.6%) on the validation set, and 81.6% (+3.2%) on the test set of Spider benchmark; 2) the generated NL explanations can also provide insightful information for users, aiding in the comprehension of translation results and consequently enhancing the interpretability of NL2SQL translation.
摘要：自然语言数据库接口使非技术人员能够使用自然语言 (NL) 与数据进行交互。先进的方案，无论是采用神经序列到序列模型还是更新的复杂大规模语言模型，通常都以端到端的方式实现自然语言到 SQL (NL2SQL) 的转换。然而，与人类相似，这些端到端翻译模型在初次尝试时未必总能生成最佳的 SQL 输出。本文提出了 CycleSQL，这是一个迭代框架，旨在通过自我评估使端到端翻译模型自主生成最佳输出。CycleSQL 的主要思想是引入基于数据的查询结果自然语言解释作为自我提供的反馈，并利用这些反馈来迭代验证翻译的正确性，从而提高整体翻译的准确性。我们进行了广泛的实验，包括定量和定性评估，通过将 CycleSQL 应用于七个现有翻译模型在五个广泛使用的基准上进行研究。结果表明：1) CycleSQL 引入的反馈循环能够持续提升现有模型的性能，特别是将 CycleSQL 应用于 RESDSQL 时，在 Spider 基准的验证集上获得了 82.0% (+2.6%) 的翻译准确率，在测试集上达到了 81.6% (+3.2%)；2) 生成的自然语言解释也能为用户提供有价值的信息，有助于理解翻译结果，从而增强 NL2SQL 翻译的可解释性。

[NLP-11] Capturing research literature attitude towards Sustainable Development Goals: an LLM -based topic modeling approach

【速读】：该论文试图解决如何从大量科研文献中提取与可持续发展目标（Sustainable Development Goals, SDGs）相关讨论的问题。解决方案的关键在于构建一个完全自动化的管道，包括：1) 从Scopus数据库获取内容并准备专门针对五组SDGs的数据集；2) 利用BERTopic模型进行主题建模，通过LLM（Large Language Model）计算嵌入表示，并在连续空间中表示科学摘要；3) 通过超参数优化器高效地找到最佳配置，以适应新的大数据集；4) 通过关键词搜索和主题频率时间序列提取，实现主题探索；5) 生成交互式仪表盘，展示主题的时间演变，增强结果的可解释性和可探索性。该方法允许用户捕捉2006-2023年间科学摘要中对SDGs态度的演变，并确保结果的可重复性和方法的通用性。

链接: https://arxiv.org/abs/2411.02943
作者: Francesco Invernici,Francesca Curati,Jelena Jakimov,Amirhossein Samavi,Anna Bernasconi
关键词-EN: Sustainable Development Goals, Development Goals, world is facing, facing a multitude, human civilization
类目: Computation and Language (cs.CL)
备注: 27 pages, 8 figures, 5 tables

点击查看摘要

Abstract:The world is facing a multitude of challenges that hinder the development of human civilization and the well-being of humanity on the planet. The Sustainable Development Goals (SDGs) were formulated by the United Nations in 2015 to address these global challenges by 2030. Natural language processing techniques can help uncover discussions on SDGs within research literature. We propose a completely automated pipeline to 1) fetch content from the Scopus database and prepare datasets dedicated to five groups of SDGs; 2) perform topic modeling, a statistical technique used to identify topics in large collections of textual data; and 3) enable topic exploration through keywords-based search and topic frequency time series extraction. For topic modeling, we leverage the stack of BERTopic scaled up to be applied on large corpora of textual documents (we find hundreds of topics on hundreds of thousands of documents), introducing i) a novel LLM-based embeddings computation for representing scientific abstracts in the continuous space and ii) a hyperparameter optimizer to efficiently find the best configuration for any new big datasets. We additionally produce the visualization of results on interactive dashboards reporting topics’ temporal evolution. Results are made inspectable and explorable, contributing to the interpretability of the topic modeling process. Our proposed LLM-based topic modeling pipeline for big-text datasets allows users to capture insights on the evolution of the attitude toward SDGs within scientific abstracts in the 2006-2023 time span. All the results are reproducible by using our system; the workflow can be generalized to be applied at any point in time to any big corpus of textual documents.
摘要：世界正面临着诸多阻碍人类文明发展和地球上人类福祉的挑战。联合国于2015年制定了可持续发展目标（SDGs），旨在到2030年应对这些全球性挑战。自然语言处理技术能够帮助揭示研究文献中关于SDGs的讨论。我们提出了一种完全自动化的流程，包括：1) 从Scopus数据库获取内容并准备专门针对五组SDGs的数据集；2) 进行主题建模，这是一种用于在大规模文本数据集中识别主题的统计技术；3) 通过基于关键词的搜索和主题频率时间序列提取来实现主题探索。在主题建模方面，我们利用了扩展至适用于大规模文本语料库的BERTopic（我们在数十万份文档中发现了数百个主题），并引入了两种创新方法：i) 一种基于大语言模型（LLM）的嵌入计算方法，用于在连续空间中表示科学摘要；ii) 一种超参数优化器，用于高效地为任何新的大数据集找到最佳配置。此外，我们通过交互式仪表盘对结果进行可视化，展示了主题的时间演变。结果的可视化和可探索性有助于提高主题建模过程的可解释性。我们提出的基于大语言模型的主题建模流程适用于大规模文本数据集，使用户能够捕捉到2006年至2023年间科学摘要中对SDGs态度变化的洞察。所有结果均可通过我们的系统复现；该工作流程可以推广应用于任何时间点的任何大规模文本语料库。

[NLP-12] A Post-Training Enhanced Optimization Approach for Small Language Models

【速读】：该论文试图解决小语言模型在持续后训练阶段的优化问题，提出了基于大模型数据指导的小语言模型持续后训练对齐数据构建方法。解决方案的关键在于通过大模型的数据指导，优化对齐数据的多样性和准确性，从而提升小语言模型的性能。实验结果表明，该方法在SFT、KTO以及SFT-KTO两阶段后训练和模型权重融合实验中均显著提升了小语言模型的表现。

链接: https://arxiv.org/abs/2411.02939
作者: Keke Zhai
关键词-EN: small language models, alignment data construction, small language, data construction method, Supervised Fine Tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper delves into the continuous post-training optimization methods for small language models, and proposes a continuous post-training alignment data construction method for small language models. The core of this method is based on the data guidance of large models, optimizing the diversity and accuracy of alignment data. In addition, to verify the effectiveness of the methods in this paper, we used Qwen2-0.5B-Instruct model as the baseline model for small language models, using the alignment dataset constructed by our proposed method, we trained and compared several groups of experiments, including SFT (Supervised Fine Tuning) post-training experiment and KTO (Kahneman Tversky optimization) post-training experiment, as well as SFT-KTO two-stage post-training experiment and model weight fusion experiment. Finally, we evaluated and analyzed the performance of post-training models, and confirmed that the continuous post-training optimization method proposed by us can significantly improve the performance of small language models.
摘要：本文深入探讨了小型语言模型的持续后训练优化方法，并提出了一种针对小型语言模型的持续后训练对齐数据构建方法。该方法的核心在于基于大模型的数据指导，优化对齐数据的多样性和准确性。此外，为了验证本文方法的有效性，我们采用了Qwen2-0.5B-Instruct模型作为小型语言模型的基线模型，使用我们提出的方法构建的对齐数据集，进行了多组实验的训练与比较，包括SFT（监督微调）后训练实验、KTO（Kahneman Tversky优化）后训练实验，以及SFT-KTO两阶段后训练实验和模型权重融合实验。最后，我们对后训练模型的性能进行了评估与分析，并确认了我们提出的持续后训练优化方法能够显著提升小型语言模型的性能。

[NLP-13] Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

【速读】：该论文试图解决多模态大语言模型（MLLMs）中存在的“幻觉”问题，特别是通过多模态检索增强生成（mRAG）方法来缓解这一问题。然而，现有的启发式mRAG方法通常预定义固定的检索过程，导致两个主要问题：非自适应检索查询和检索查询过载。这些问题在当前的知识寻求型视觉问答（VQA）数据集中无法充分体现，因为这些数据集通常可以通过标准的两步检索轻松获取所需知识。为解决这一数据集差距，论文构建了Dyn-VQA数据集，包含三种类型的“动态”问题，这些问题需要复杂的检索策略，且检索查询、工具和时间都是可变的。实验表明，现有的启发式mRAG方法在处理这些动态问题时表现不佳，因为其检索过程过于僵化。因此，论文提出了第一个自适应规划代理OmniSearch，其核心思想是模拟人类在解决问题时的行为，动态地将复杂的多模态问题分解为子问题链，并进行检索操作。广泛的实验证明了OmniSearch的有效性，并为推进mRAG提供了方向。

链接: https://arxiv.org/abs/2411.02937
作者: Yangning Li,Yinghui Li,Xingyu Wang,Yong Jiang,Zhen Zhang,Xinran Zheng,Hui Wang,Hai-Tao Zheng,Philip S. Yu,Fei Huang,Jingren Zhou
关键词-EN: Retrieval Augmented Generation, Augmented Generation, large language models, Non-adaptive Retrieval Queries, Multimodal Retrieval Augmented
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the “hallucination” issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of “dynamic” questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. The code and dataset will be open-sourced at this https URL.
摘要：多模态检索增强生成 (Multimodal Retrieval Augmented Generation, mRAG) 在缓解多模态大语言模型 (Multimodal Large Language Models, MLLMs) 固有的“幻觉”问题上发挥着重要作用。尽管前景广阔，现有的启发式 mRAG 通常预定义固定的检索过程，这导致了两个问题：(1) 非自适应检索查询。(2) 检索查询过载。然而，这些缺陷在当前的知识寻求型视觉问答 (Visual Question Answering, VQA) 数据集中无法充分体现，因为最需要的知识可以通过标准的两步检索轻松获取。为了弥合数据集的差距，我们首先构建了 Dyn-VQA 数据集，该数据集包含三种类型的“动态”问题，这些问题需要复杂的知识检索策略，这些策略在查询、工具和时间上都是可变的：(1) 答案快速变化的问题。(2) 需要多模态知识的问题。(3) 多跳问题。在 Dyn-VQA 上的实验表明，现有的启发式 mRAG 由于其僵化的检索过程，难以为动态问题提供足够且精确相关的知识。因此，我们进一步提出了首个用于多模态检索的自适应规划智能体，即 OmniSearch。其核心思想是模拟人类在解决问题时的行为，动态地将复杂的多模态问题分解为带有检索动作的子问题链。广泛的实验证明了我们 OmniSearch 的有效性，同时也为推进 mRAG 提供了方向。代码和数据集将在以下链接开源：https URL。

[NLP-14] xtual Aesthetics in Large Language Models

【速读】：该论文试图解决文本美学（textual aesthetics）在大型语言模型（LLMs）中的应用问题。解决方案的关键在于引入了一个美学打磨流程，并构建了一个名为TexAes的文本美学数据集。论文提出了一种基于直接偏好优化的文本美学驱动的微调方法，称为TAPO（Textual Aesthetics-Powered Optimization），该方法在不牺牲内容正确性的前提下，利用文本美学进行模型优化。此外，论文还开发了两种基于文本和图像分析的文本美学评估方法。实验结果表明，使用文本美学数据和TAPO微调方法不仅提高了美学评分，还提升了在AlpacalEval和Anera-hard等通用评估数据集上的表现。

链接: https://arxiv.org/abs/2411.02930
作者: Lingjie Jiang,Shaohan Huang,Xun Wu,Furu Wei
关键词-EN: textual aesthetics, crucial metric, aesthetics, textual, image generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image aesthetics is a crucial metric in the field of image generation. However, textual aesthetics has not been sufficiently explored. With the widespread application of large language models (LLMs), previous work has primarily focused on the correctness of content and the helpfulness of responses. Nonetheless, providing responses with textual aesthetics is also an important factor for LLMs, which can offer a cleaner layout and ensure greater consistency and coherence in content. In this work, we introduce a pipeline for aesthetics polishing and help construct a textual aesthetics dataset named TexAes. We propose a textual aesthetics-powered fine-tuning method based on direct preference optimization, termed TAPO, which leverages textual aesthetics without compromising content correctness. Additionally, we develop two evaluation methods for textual aesthetics based on text and image analysis, respectively. Our experiments demonstrate that using textual aesthetics data and employing the TAPO fine-tuning method not only improves aesthetic scores but also enhances performance on general evaluation datasets such as AlpacalEval and Anera-hard.
摘要：图像美学在图像生成领域是一个关键的评价指标。然而，文本美学尚未得到充分探索。随着大语言模型（LLMs）的广泛应用，以往的研究主要集中在内容的正确性和响应的有用性上。然而，提供具有文本美学的响应对于大语言模型来说也是一个重要因素，它可以提供更清晰的布局，并确保内容的一致性和连贯性。在本研究中，我们引入了一个美学打磨流程，并帮助构建了一个名为 TexAes 的文本美学数据集。我们提出了一种基于直接偏好优化的文本美学驱动的微调方法，称为 TAPO，该方法在不牺牲内容正确性的前提下利用文本美学。此外，我们分别基于文本和图像分析开发了两种文本美学评估方法。我们的实验表明，使用文本美学数据并采用 TAPO 微调方法不仅提高了美学评分，还提升了在 AlpacaEval 和 Anera-hard 等通用评估数据集上的表现。

[NLP-15] Membership Inference Attacks against Large Vision-Language Models NEURIPS2024

【速读】：该论文试图解决视觉-语言模型 (Vision-Language Models, VLLMs) 中数据安全问题，特别是训练数据中可能包含的敏感信息（如私人照片和医疗记录）的检测问题。解决方案的关键在于引入首个针对不同VLLMs的成员推理攻击 (Membership Inference Attack, MIA) 基准，并提出一种新的MIA流程，专门用于标记级别的图像检测。此外，论文还提出了一种新的评估指标——MaxRényi-K%，该指标基于模型输出的置信度，适用于文本和图像数据，旨在深化对VLLMs中MIA的理解和方法论。

链接: https://arxiv.org/abs/2411.02902
作者: Zhan Li,Yongtao Wu,Yihang Chen,Francesco Tonin,Elias Abad Rocamora,Volkan Cevher
关键词-EN: exhibit promising capabilities, Large vision-language models, processing multi-modal tasks, Large vision-language, exhibit promising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: NeurIPS 2024

点击查看摘要

Abstract:Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios. However, their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records, in their training datasets. Detecting inappropriately used data in VLLMs remains a critical and unresolved issue, mainly due to the lack of standardized datasets and suitable methodologies. In this study, we introduce the first membership inference attack (MIA) benchmark tailored for various VLLMs to facilitate training data detection. Then, we propose a novel MIA pipeline specifically designed for token-level image detection. Lastly, we present a new metric called MaxRényi-K%, which is based on the confidence of the model output and applies to both text and image data. We believe that our work can deepen the understanding and methodology of MIAs in the context of VLLMs. Our code and datasets are available at this https URL.
摘要：大型视觉语言模型（VLLMs）在处理跨多种应用场景的多模态任务方面展现出显著潜力。然而，由于其训练数据集中可能包含敏感信息，如私人照片和医疗记录，这些模型的出现也引发了重大的数据安全问题。在VLLMs中检测不当使用的数据仍然是一个关键且未解决的问题，主要原因是缺乏标准化的数据集和合适的方法论。在本研究中，我们引入了首个针对多种VLLMs的成员推断攻击（MIA）基准，以促进训练数据的检测。随后，我们提出了一种新颖的MIA流程，专门设计用于Token级别的图像检测。最后，我们引入了一种新的度量标准，称为MaxRényi-K%，该度量基于模型输出的置信度，并适用于文本和图像数据。我们相信，我们的工作能够加深对VLLMs背景下MIA的理解和方法论。我们的代码和数据集可通过以下链接获取：https URL。

[NLP-16] he Translation of Circumlocution in Arabic Short Stories into English

【速读】：该论文试图解决阿拉伯语中的迂回表达（circumlocution）在英译过程中的翻译问题，特别是如何准确地将阿拉伯语中的迂回表达转化为英语。解决方案的关键在于运用Nida（1964）的翻译理论作为框架，分析源文本和目标文本中的迂回表达实例，并评估所采用的翻译策略的适当性。研究揭示了阿拉伯语迂回表达与英语元话语（metadiscourse）类别在文本和人际功能方面的显著相似性，但也指出了翻译过程中遇到的挑战，如难以准确传达迂回表达的细微差别，导致翻译策略中常采用添加、删减和修改等方法。

链接: https://arxiv.org/abs/2411.02887
作者: Dalal Waadallah Shehab
关键词-EN: renowned Arabic authors, Arabic authors, corpus of short, short stories, stories by renowned
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the translation of circumlocution from Arabic to English in a corpus of short stories by renowned Arabic authors. By analyzing the source and target texts, the study aims to identify and categorize circumlocution instances in Arabic and their corresponding renditions in English. The study employs Nida’s (1964) translation theory as a framework to assess the appropriateness of the translation strategies employed. It examines the extent to which translators successfully rendered Arabic circumlocution into English, identifying potential challenges and limitations in the translation process. The findings reveal significant similarities between Arabic circumlocution categories and English metadiscourse categories, particularly in terms of textual and interpersonal functions. However, the study also highlights instances where translators encountered difficulties in accurately conveying the nuances of circumlocution, often resorting to strategies like addition, subtraction, and alteration.
摘要：本研究探讨了在著名阿拉伯作家短篇小说集中，阿拉伯语迂回表达法向英语的翻译问题。通过分析源文本和目标文本，研究旨在识别并分类阿拉伯语中的迂回表达实例及其在英语中的对应翻译。研究采用Nida（1964）的翻译理论作为框架，评估所采用翻译策略的适当性。研究考察了译者在将阿拉伯语迂回表达成功转化为英语方面的程度，识别了翻译过程中可能遇到的挑战和局限。研究结果显示，阿拉伯语迂回表达类别与英语元话语类别之间存在显著相似性，特别是在文本和人际功能方面。然而，研究也指出了译者在准确传达迂回表达细微差别时遇到的困难，通常采用添加、删减和改变等策略。

[NLP-17] okenSelect: Efficient Long-Context Inference and Length Extrapolation for LLM s via Dynamic Token-Level KV Cache Selection

【速读】：该论文试图解决大型语言模型（LLMs）在处理长上下文时面临的两个主要问题：由于序列长度超出分布导致的性能下降，以及由于注意力机制的二次计算复杂性导致的推理时间过长。解决方案的关键在于提出了一种名为动态令牌级KV缓存选择（Dynamic Token-Level KV Cache Selection, TokenSelect）的方法。TokenSelect通过利用非连续注意力稀疏性的观察，使用查询-键点积来衡量每个注意力头在令牌级别的KV缓存关键性，并通过每个注意力头的软投票机制，选择性地仅涉及少量关键KV缓存令牌进行注意力计算，从而在不牺牲准确性的前提下显著提高计算效率。此外，论文还设计了基于连续查询相似性的选择缓存，并实现了高效的点积内核，进一步减少了令牌选择的额外开销，从而在长上下文推理中实现了高达23.84倍的注意力计算加速和2.28倍的端到端延迟加速。

链接: https://arxiv.org/abs/2411.02886
作者: Wei Wu,Zhuoshi Pan,Chao Wang,Liyi Chen,Yunchu Bai,Kun Fu,Zheng Wang,Hui Xiong
关键词-EN: large language models, LLM-powered search systems, handle longer contexts, capability for Web, Web applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the development of large language models (LLMs), the ability to handle longer contexts has become a key capability for Web applications such as cross-document understanding and LLM-powered search systems. However, this progress faces two major challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues hinder the application of LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a model-agnostic, training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using Query-Key dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a small number of critical KV cache tokens in the attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we designed the Selection Cache based on observations of consecutive Query similarity and implemented efficient dot product kernel, significantly reducing the overhead of token selection. A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention computation and up to 2.28x acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.
摘要：随着大语言模型（LLM）的发展，处理更长上下文的能力已成为跨文档理解和基于LLM的搜索系统等Web应用的关键能力。然而，这一进展面临两大挑战：由于序列长度超出分布范围导致的性能下降，以及由注意力机制的二次计算复杂性引起的推理时间过长。这些问题阻碍了LLM在长上下文场景中的应用。本文提出了一种模型无关、无需训练的动态Token级KV缓存选择方法（TokenSelect），用于高效且准确的长上下文推理。TokenSelect基于非连续注意力稀疏性的观察，利用Query-Key点积来衡量每个注意力头在Token级的KV缓存关键性。通过每个注意力头的软投票机制，TokenSelect在不影响准确性的前提下，仅选择少量关键的KV缓存Token参与注意力计算。为进一步加速TokenSelect，我们设计了基于连续Query相似性的选择缓存，并实现了高效的点积内核，显著减少了Token选择的开销。对TokenSelect的综合评估显示，注意力计算速度提升了高达23.84倍，端到端延迟加速了高达2.28倍，同时在性能上优于最先进的长上下文推理方法。

[NLP-18] Graph-DPEP: Decomposed Plug and Ensemble Play for Few-Shot Document Relation Extraction with Graph-of-Thoughts Reasoning

【速读】：该论文试图解决使用生成式大型语言模型 (Generative LLMs) 进行文档级关系抽取 (Document-level Relation Extraction, DocRE) 任务时的挑战，特别是由于DocRE任务的结构化输出格式难以直接转换为自然语言文本，以及少样本学习中信息不足导致的实体关系抽取困难。解决方案的关键在于提出了Graph-DPEP框架，该框架将结构化输出表示为图结构的三元组，并通过以下三个关键步骤实现：1) 采用“分解-插件”方法，通过类型空间分解减轻区分所有关系类型的负担；2) 引入校准器 (Verifier) 来调整生成结果并识别被忽略的查询实体对；3) 开发“集成-重演”策略，利用与缺失查询对相关的子图中的推理思路重新应用生成，以解决缺失问题。通过这些方法，Graph-DPEP框架在公开基准测试中展示了优于现有提示技术和替代语言模型的性能。

链接: https://arxiv.org/abs/2411.02864
作者: Tao Zhang,Ning Yan,Masood Mortazavi,Hoang H. Nguyen,Zhongfen Deng,Philip S. Yu
关键词-EN: Large language models, demonstrated impressive few-shot, impressive few-shot learning, few-shot learning capability, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning capability on many NLP tasks. Recasting an NLP task into a text-to-text generation task is a common practice so that generative LLMs can be prompted to resolve it. However, performing document-level relation extraction (DocRE) tasks with generative LLM models is still challenging due to the structured output format of DocRE, which complicates the conversion to plain text. Limited information available in few-shot samples and prompt instructions induce further difficulties and challenges in relation extraction for mentioned entities in a document. In this paper, we represent the structured output as a graph-style triplet rather than natural language expressions and leverage generative LLMs for the DocRE task. Our approach, the Graph-DPEP framework is grounded in the reasoning behind triplet explanation thoughts presented in natural language. In this framework, we first introduce a ``decomposed-plug" method for performing the generation from LLMs over prompts with type-space decomposition to alleviate the burden of distinguishing all relation types. Second, we employ a verifier for calibrating the generation and identifying overlooked query entity pairs. Third, we develop “ensemble-play”, reapplying generation on the entire type list by leveraging the reasoning thoughts embedded in a sub-graph associated with the missing query pair to address the missingness issue. Through extensive comparisons with existing prompt techniques and alternative Language Models (LLMs), our framework demonstrates superior performance on publicly available benchmarks in experiments.
摘要：在大规模语料库上预训练的大语言模型 (LLMs) 在许多自然语言处理 (NLP) 任务中展示了令人印象深刻的少样本学习能力。将 NLP 任务重新构造成文本到文本的生成任务是一种常见做法，以便生成式 LLMs 可以通过提示来解决这些任务。然而，使用生成式 LLM 模型执行文档级关系抽取 (DocRE) 任务仍然具有挑战性，因为 DocRE 的结构化输出格式使得转换为纯文本变得复杂。少样本样本和提示指令中有限的信息进一步增加了文档中提及实体关系抽取的难度和挑战。在本文中，我们将结构化输出表示为图风格的三元组，而不是自然语言表达，并利用生成式 LLMs 来处理 DocRE 任务。我们的方法，即 Graph-DPEP 框架，基于自然语言中三元组解释思想的推理。在该框架中，我们首先引入了一种“分解-插件”方法，通过类型空间分解来减轻区分所有关系类型的负担，从而在 LLMs 上执行生成。其次，我们使用一个校准器来校正生成结果并识别被忽略的查询实体对。第三，我们开发了“集成-重演”方法，通过利用与缺失查询对相关的子图中嵌入的推理思想，在整个类型列表上重新应用生成，以解决缺失问题。通过与现有提示技术和替代语言模型 (LLMs) 的广泛比较，我们的框架在公开基准测试中展示了优越的性能。

[NLP-19] Learning to Unify Audio Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization

【速读】：该论文试图解决多语言视觉答案定位 (Multilingual Visual Answer Localization, MVAL) 任务中现有方法忽视音频模态的问题，导致输入信息不完整和性能不佳。解决方案的关键在于提出了一种统一的音视文跨模态定位 (Audio-Visual-Textual Span Localization, AVTSL) 方法，通过整合音频模态来增强视觉和文本表示。具体来说，该方法包括三个模态特定的预测器：音视预测器、视觉预测器和文本预测器，并通过动态三角损失 (Dynamic Triangular Loss, DTL) 函数引入音视文一致性模块，确保各模态预测结果的一致性和全面性。实验结果表明，该方法显著优于现有的最先进 (SOTA) 方法，证明了音频模态的有效性。

链接: https://arxiv.org/abs/2411.02851
作者: Zhibin Wen,Bin Li
关键词-EN: Multilingual Visual Answer, multilingual question, Multilingual Visual, Visual Answer Localization, MVAL task
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The goal of Multilingual Visual Answer Localization (MVAL) is to locate a video segment that answers a given multilingual question. Existing methods either focus solely on visual modality or integrate visual and subtitle modalities. However, these methods neglect the audio modality in videos, consequently leading to incomplete input information and poor performance in the MVAL task. In this paper, we propose a unified Audio-Visual-Textual Span Localization (AVTSL) method that incorporates audio modality to augment both visual and textual representations for the MVAL task. Specifically, we integrate features from three modalities and develop three predictors, each tailored to the unique contributions of the fused modalities: an audio-visual predictor, a visual predictor, and a textual predictor. Each predictor generates predictions based on its respective modality. To maintain consistency across the predicted results, we introduce an Audio-Visual-Textual Consistency module. This module utilizes a Dynamic Triangular Loss (DTL) function, allowing each modality’s predictor to dynamically learn from the others. This collaborative learning ensures that the model generates consistent and comprehensive answers. Extensive experiments show that our proposed method outperforms several state-of-the-art (SOTA) methods, which demonstrates the effectiveness of the audio modality.
摘要：多语言视觉答案定位 (Multilingual Visual Answer Localization, MVAL) 的目标是定位一个视频片段，该片段能够回答给定的多语言问题。现有方法要么仅关注视觉模态，要么结合视觉和字幕模态。然而，这些方法忽略了视频中的音频模态，从而导致输入信息不完整，并在 MVAL 任务中表现不佳。本文提出了一种统一的视听文本跨模态定位 (Audio-Visual-Textual Span Localization, AVTSL) 方法，该方法通过引入音频模态来增强视觉和文本表示，以提升 MVAL 任务的表现。具体而言，我们整合了三种模态的特征，并开发了三个预测器，每个预测器针对融合模态的独特贡献进行定制：视听预测器、视觉预测器和文本预测器。每个预测器根据其对应的模态生成预测结果。为了保持预测结果的一致性，我们引入了一个视听文本一致性模块。该模块利用动态三角损失 (Dynamic Triangular Loss, DTL) 函数，使每个模态的预测器能够动态地从其他模态中学习。这种协作学习确保了模型生成一致且全面的答案。大量实验表明，我们提出的方法优于几种最先进 (State-of-the-Art, SOTA) 方法，这证明了音频模态的有效性。

[NLP-20] PersianRAG: A Retrieval-Augmented Generation System for Persian Language

【速读】：该论文试图解决在波斯语（Persian）这一低资源语言环境下应用检索增强生成模型（Retrieval Augmented Generation, RAG）所面临的挑战。解决方案的关键在于针对波斯语的预处理、嵌入（embedding）、检索、提示构造（prompt construction）、语言建模以及响应评估等环节提出创新方法，并构建了一个名为PersianRAG的实际应用系统。通过在多个波斯语基准数据集上的实验，论文展示了PersianRAG框架在波斯语问答任务中的增强能力。

链接: https://arxiv.org/abs/2411.02832
作者: Hossein Hosseini,Mohammad Siobhan Zare,Amir Hossein Mohammadi,Arefeh Kazemi,Zahra Zojaji,Mohammad Ali Nematbakhsh
关键词-EN: integrate large-scale pre-trained, large-scale pre-trained generative, shown significant success, Retrieval augmented generation, external retrieval mechanisms
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) models, which integrate large-scale pre-trained generative models with external retrieval mechanisms, have shown significant success in various natural language processing (NLP) tasks. However, applying RAG models in Persian language as a low-resource language, poses distinct challenges. These challenges primarily involve the preprocessing, embedding, retrieval, prompt construction, language modeling, and response evaluation of the system. In this paper, we address the challenges towards implementing a real-world RAG system for Persian language called PersianRAG. We propose novel solutions to overcome these obstacles and evaluate our approach using several Persian benchmark datasets. Our experimental results demonstrate the capability of the PersianRAG framework to enhance question answering task in Persian.
摘要：检索增强生成 (Retrieval augmented generation, RAG) 模型通过将大规模预训练生成模型与外部检索机制相结合，在多种自然语言处理 (Natural Language Processing, NLP) 任务中展现了显著的成功。然而，将 RAG 模型应用于波斯语这一低资源语言时，面临着独特的挑战。这些挑战主要涉及系统的预处理、嵌入、检索、提示构建、语言建模以及响应评估等方面。本文针对实现名为 PersianRAG 的波斯语 RAG 系统所面临的挑战进行了探讨。我们提出了创新性的解决方案以克服这些障碍，并通过多个波斯语基准数据集对我们的方法进行了评估。实验结果表明，PersianRAG 框架能够有效提升波斯语问答任务的性能。

[NLP-21] Mixtures of In-Context Learners

【速读】：该论文试图解决在上下文学习（In-context Learning, ICL）中，由于不加区分地使用示例导致Transformer大语言模型（LLMs）复杂度呈二次增长，从而耗尽内存的问题。解决方案的关键是提出了一种名为“上下文学习者混合体”（Mixtures of In-Context Learners, MoICL）的新方法，该方法将示例子集视为专家，并通过学习权重函数来合并这些专家的输出分布，从而在不增加上下文窗口或内存负担的情况下，提高学习效果。MoICL在多个分类数据集上展示了性能提升，并且在处理域外、不平衡或噪声示例时表现出更强的鲁棒性。

链接: https://arxiv.org/abs/2411.02830
作者: Giwon Hong,Emile van Krieken,Edoardo Ponti,Nikolay Malkin,Pasquale Minervini
关键词-EN: Transformer LLMs, complexity of Transformer, adapts LLMs, model parameters, fine-tuning the model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In-context learning (ICL) adapts LLMs by providing demonstrations without fine-tuning the model parameters; however, it does not differentiate between demonstrations and quadratically increases the complexity of Transformer LLMs, exhausting the memory. As a solution, we propose Mixtures of In-Context Learners (MoICL), a novel approach to treat subsets of demonstrations as experts and learn a weighting function to merge their output distributions based on a training set. In our experiments, we show performance improvements on 5 out of 7 classification datasets compared to a set of strong baselines (up to +13% compared to ICL and LENS). Moreover, we enhance the Pareto frontier of ICL by reducing the inference time needed to achieve the same performance with fewer demonstrations. Finally, MoICL is more robust to out-of-domain (up to +11%), imbalanced (up to +49%), or noisy demonstrations (up to +38%) or can filter these out from datasets. Overall, MoICL is a more expressive approach to learning from demonstrations without exhausting the context window or memory.
摘要：上下文学习（In-context Learning, ICL）通过提供示例来适应大语言模型（LLM），而无需微调模型参数；然而，它并未区分示例之间的差异，并且使 Transformer 大语言模型的复杂性呈二次方增长，从而耗尽了内存。为此，我们提出了一种名为“上下文学习者混合体”（Mixtures of In-Context Learners, MoICL）的新方法，该方法将示例子集视为专家，并学习一个加权函数，以根据训练集合并这些专家的输出分布。在我们的实验中，与一组强大的基线相比，MoICL 在 7 个分类数据集中的 5 个上展示了性能提升（与 ICL 和 LENS 相比，最高提升达 +13%）。此外，通过减少实现相同性能所需的示例数量，MoICL 提升了 ICL 的帕累托前沿，从而缩短了推理时间。最后，MoICL 对域外示例（最高提升 +11%）、不平衡示例（最高提升 +49%）或噪声示例（最高提升 +38%）具有更强的鲁棒性，或者能够从数据集中过滤掉这些示例。总体而言，MoICL 是一种更具表达力的学习示例的方法，而不会耗尽上下文窗口或内存。

[NLP-22] DroidSpeak: Enhancing Cross-LLM Communication

【速读】：该论文试图解决在多智能体系统中使用大型语言模型（LLMs）进行跨智能体通信时，由于需要传递完整的上下文信息而导致的预填充阶段延迟问题。解决方案的关键是引入了一种名为DroidSpeak的新框架，该框架通过重用中间数据（如输入嵌入E-cache和键值缓存KV-cache）来优化跨LLM通信。这种方法避免了重新处理整个上下文的必要性，从而显著加速了上下文整合过程，同时保持了任务性能的质量。实验结果表明，DroidSpeak能够将预填充阶段的延迟减少至多2.78倍，且对准确性的影响可以忽略不计。

链接: https://arxiv.org/abs/2411.02820
作者: Yuhan Liu,Esha Choukse,Shan Lu,Junchen Jiang,Madan Musuvathi
关键词-EN: utilizing Large Language, systems utilizing Large, agents traditionally relies, Large Language Models, utilizing Large
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In multi-agent systems utilizing Large Language Models (LLMs), communication between agents traditionally relies on natural language. This communication often includes the full context of the query so far, which can introduce significant prefill-phase latency, especially with long contexts. We introduce DroidSpeak, a novel framework to target this cross-LLM communication by leveraging the reuse of intermediate data, such as input embeddings (E-cache) and key-value caches (KV-cache). We efficiently bypass the need to reprocess entire contexts for fine-tuned versions of the same foundational model. This approach allows faster context integration while maintaining the quality of task performance. Experimental evaluations demonstrate DroidSpeak’s ability to significantly accelerate inter-agent communication, achieving up to a 2.78x speedup in prefill latency with negligible loss in accuracy. Our findings underscore the potential to create more efficient and scalable multi-agent systems. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2411.02820 [cs.MA] (or arXiv:2411.02820v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2411.02820 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：在利用大语言模型 (LLM) 的多智能体系统中，智能体之间的通信传统上依赖于自然语言。这种通信通常包括迄今为止查询的完整上下文，这会引入显著的前填充阶段延迟，尤其是在上下文较长的情况下。我们提出了 DroidSpeak，这是一个新颖的框架，旨在通过利用中间数据的重复使用，如输入嵌入 (E-cache) 和键值缓存 (KV-cache)，来解决跨 LLM 通信的问题。我们有效地避免了为同一基础模型的微调版本重新处理整个上下文的需求。这种方法在保持任务性能质量的同时，实现了更快的上下文整合。实验评估表明，DroidSpeak 能够显著加速智能体间的通信，前填充阶段的延迟最高可提升 2.78 倍，且准确性损失可忽略不计。我们的研究结果强调了创建更高效和可扩展的多智能体系统的潜力。

主题：多智能体系统 (cs.MA); 人工智能 (cs.AI); 计算与语言 (cs.CL); 机器学习 (cs.LG)
引用为：arXiv:2411.02820 [cs.MA] (或 arXiv:2411.02820v1 [cs.MA] 用于此版本)
https://doi.org/10.48550/arXiv.2411.02820
通过 DataCite 发布的 arXiv 发行 DOI (待注册)

[NLP-23] he Evolution of RWKV: Advancements in Efficient Language Modeling

【速读】：该论文试图解决语言模型中训练效率与推理效率之间的矛盾，解决方案的关键在于提出了Receptance Weighted Key Value (RWKV)架构。RWKV通过引入一种新颖的线性注意力机制，成功结合了Transformer的训练效率和RNN的推理效率。这一创新不仅提升了模型在不同领域的适应性，还显著优于传统模型，展示了其在深度学习中的广泛应用潜力。

链接: https://arxiv.org/abs/2411.02795
作者: Akul Datta
关键词-EN: Receptance Weighted Key, efficient language modeling, Receptance Weighted, Weighted Key, emphasizing its advancements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper reviews the development of the Receptance Weighted Key Value (RWKV) architecture, emphasizing its advancements in efficient language modeling. RWKV combines the training efficiency of Transformers with the inference efficiency of RNNs through a novel linear attention mechanism. We examine its core innovations, adaptations across various domains, and performance advantages over traditional models. The paper also discusses challenges and future directions for RWKV as a versatile architecture in deep learning.
摘要：本文回顾了Receptance Weighted Key Value (RWKV)架构的发展历程，重点强调了其在高效语言建模方面的进步。RWKV通过一种新颖的线性注意力机制，结合了Transformer的训练效率和RNN的推理效率。我们探讨了其核心创新、在不同领域的适应性以及相对于传统模型的性能优势。此外，本文还讨论了RWKV作为深度学习中一种多功能架构所面临的挑战和未来发展方向。

[NLP-24] oward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning NEURIPS2024

【速读】：该论文试图解决多模态情感分析 (Multimodal Sentiment Analysis, MSA) 任务中不确定模态缺失的问题。解决方案的关键在于提出了一个分层表示学习框架 (Hierarchical Representation Learning Framework, HRLF)。具体来说，该框架通过细粒度表示因子分解模块，利用跨模态转换和情感语义重构，将模态分解为与情感相关的表示和模态特定的表示，从而充分提取有价值的情感信息。此外，引入分层互信息最大化机制，逐步最大化多尺度表示之间的互信息，以对齐和重构表示中的高层语义。最后，通过分层对抗学习机制，进一步对齐和适应情感相关表示的潜在分布，生成鲁棒的联合多模态表示。实验结果表明，HRLF在不确定模态缺失的情况下显著提升了MSA的性能。

链接: https://arxiv.org/abs/2411.02793
作者: Mingcheng Li,Dingkang Yang,Yang Liu,Shunli Wang,Jiawei Chen,Shuaibing Wang,Jinjie Wei,Yue Jiang,Qingyao Xu,Xiaolu Hou,Mingyang Sun,Ziyun Qian,Dongliang Kou,Lihua Zhang
关键词-EN: important research area, recognize human sentiment, Multimodal Sentiment Analysis, Sentiment Analysis, sentiment analysis compared
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) is an important research area that aims to understand and recognize human sentiment through multiple modalities. The complementary information provided by multimodal fusion promotes better sentiment analysis compared to utilizing only a single modality. Nevertheless, in real-world applications, many unavoidable factors may lead to situations of uncertain modality missing, thus hindering the effectiveness of multimodal modeling and degrading the model’s performance. To this end, we propose a Hierarchical Representation Learning Framework (HRLF) for the MSA task under uncertain missing modalities. Specifically, we propose a fine-grained representation factorization module that sufficiently extracts valuable sentiment information by factorizing modality into sentiment-relevant and modality-specific representations through crossmodal translation and sentiment semantic reconstruction. Moreover, a hierarchical mutual information maximization mechanism is introduced to incrementally maximize the mutual information between multi-scale representations to align and reconstruct the high-level semantics in the representations. Ultimately, we propose a hierarchical adversarial learning mechanism that further aligns and adapts the latent distribution of sentiment-relevant representations to produce robust joint multimodal representations. Comprehensive experiments on three datasets demonstrate that HRLF significantly improves MSA performance under uncertain modality missing cases.
摘要：多模态情感分析 (Multimodal Sentiment Analysis, MSA) 是一个重要的研究领域，旨在通过多种模态理解并识别人类情感。多模态融合提供的互补信息相较于单一模态的使用，能够促进更优的情感分析。然而，在实际应用中，许多不可避免的因素可能导致模态缺失的不确定情况，从而阻碍多模态建模的有效性并降低模型的性能。为此，我们提出了一种针对不确定模态缺失情况下的多模态情感分析任务的分层表示学习框架 (Hierarchical Representation Learning Framework, HRLF)。具体而言，我们提出了一种细粒度表示分解模块，通过跨模态翻译和情感语义重构，将模态分解为与情感相关和模态特定的表示，从而充分提取有价值的情感信息。此外，引入了一种分层互信息最大化机制，以逐步最大化多尺度表示之间的互信息，从而对齐并重构表示中的高层语义。最终，我们提出了一种分层对抗学习机制，进一步对齐并适应情感相关表示的潜在分布，以生成鲁棒的联合多模态表示。在三个数据集上的综合实验表明，HRLF 在不确定模态缺失情况下显著提升了多模态情感分析的性能。

[NLP-25] Language Models and Cycle Consistency for Self-Reflective Machine Translation

【速读】：该论文试图解决如何在没有目标语言真实翻译的情况下，评估机器翻译（MT）质量和大型语言模型（LLM）翻译能力的问题。解决方案的关键在于利用循环一致性（cycle consistency）作为隐式评估指标。具体来说，论文提出了一种新颖的框架，通过将源语言A的句子翻译成目标语言B，再将其翻译回源语言A，并比较原始句子与回译句子之间的循环一致性，来间接评估目标语言B中的翻译质量。这种方法不仅避免了依赖目标语言的真实翻译，还能通过循环一致性的度量来评估LLM的翻译能力，并且通过多次前向传递（forward passes）来提高翻译质量。实验结果表明，更大的LLM或更多的前向传递次数能够提高循环一致性，这与模型规模和测试时计算量的扩展规律相一致。

链接: https://arxiv.org/abs/2411.02791
作者: Jianqiao Wangni
关键词-EN: leverages large language, translation, LLM, paper introduces, framework that leverages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper introduces a novel framework that leverages large language models (LLMs) for machine translation (MT). We start with one conjecture: an ideal translation should contain complete and accurate information for a strong enough LLM to recover the original sentence. We generate multiple translation candidates from a source language A to a target language B, and subsequently translate these candidates back to the original language A. By evaluating the cycle consistency between the original and back-translated sentences using metrics such as token-level precision and accuracy, we implicitly estimate the translation quality in language B, without knowing its ground-truth. This also helps to evaluate the LLM translation capability, only with monolingual corpora. For each source sentence, we identify the translation candidate with optimal cycle consistency with the original sentence as the final answer. Our experiments demonstrate that larger LLMs, or the same LLM with more forward passes during inference, exhibit increased cycle consistency, aligning with the LLM model size scaling law and test-time computation scaling law. This work provide methods for, 1) to implicitly evaluate translation quality of a sentence in the target language, 2), to evaluate capability of LLM for any-to-any-language translation, and 3), how to generate a better translation for a specific LLM.
摘要：本文介绍了一种利用大语言模型 (LLM) 进行机器翻译 (MT) 的新框架。我们首先提出一个猜想：一个理想的翻译应当包含足够完整和准确的信息，使得足够强大的 LLM 能够恢复原始句子。我们从源语言 A 生成多个翻译候选到目标语言 B，然后将这些候选翻译回原始语言 A。通过使用 Token 级别的精确度和准确性等指标评估原始句子和回译句子之间的循环一致性，我们隐式地估计了目标语言 B 中的翻译质量，而无需知道其真实值。这也有助于仅使用单语语料库评估 LLM 的翻译能力。对于每个源句子，我们识别出与原始句子具有最佳循环一致性的翻译候选作为最终答案。我们的实验表明，更大的 LLM 或同一 LLM 在推理过程中进行更多前向传递时，循环一致性有所增加，这与 LLM 模型规模扩展定律和测试时计算扩展定律相一致。这项工作提供了以下方法：1) 隐式评估目标语言中句子的翻译质量；2) 评估 LLM 进行任意到任意语言翻译的能力；3) 如何为特定 LLM 生成更好的翻译。

[NLP-26] Memory Augmented Cross-encoders for Controllable Personalized Search

【速读】：该论文试图解决个性化搜索中用户控制与新颖性发现之间的矛盾问题。解决方案的关键在于引入了一种可控的个性化搜索模型，称为CtrlCE。该模型通过在交叉编码器（cross-encoder）中加入可编辑的记忆模块，使得模型能够基于用户的历史交互数据进行条件化处理，并支持用户对个性化结果的控制。此外，论文还提出了一种校准混合模型，用于确定何时需要个性化处理，从而在必要时才请求用户输入以实现控制。这种方法在多个个性化搜索数据集上展示了其有效性，既能实现个性化，又能满足用户对搜索结果控制的关键需求。

链接: https://arxiv.org/abs/2411.02790
作者: Sheshera Mysore,Garima Dhanania,Kishor Patil,Surya Kallumadi,Andrew McCallum,Hamed Zamani
关键词-EN: improve retrieval results, improve retrieval, personalization, Personalized search, represents a problem
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Personalized search represents a problem where retrieval models condition on historical user interaction data in order to improve retrieval results. However, personalization is commonly perceived as opaque and not amenable to control by users. Further, personalization necessarily limits the space of items that users are exposed to. Therefore, prior work notes a tension between personalization and users’ ability for discovering novel items. While discovery of novel items in personalization setups may be resolved through search result diversification, these approaches do little to allow user control over personalization. Therefore, in this paper, we introduce an approach for controllable personalized search. Our model, CtrlCE presents a novel cross-encoder model augmented with an editable memory constructed from users historical items. Our proposed memory augmentation allows cross-encoder models to condition on large amounts of historical user data and supports interaction from users permitting control over personalization. Further, controllable personalization for search must account for queries which don’t require personalization, and in turn user control. For this, we introduce a calibrated mixing model which determines when personalization is necessary. This allows system designers using CtrlCE to only obtain user input for control when necessary. In multiple datasets of personalized search, we show CtrlCE to result in effective personalization as well as fulfill various key goals for controllable personalized search.
摘要：个性化搜索代表了一个问题，即检索模型利用历史用户交互数据来改善检索结果。然而，个性化通常被认为是晦涩难懂且不易受用户控制的。此外，个性化必然限制了用户接触到的项目空间。因此，先前的工作指出个性化与用户发现新项目的能力之间存在紧张关系。尽管在个性化设置中可以通过搜索结果多样化来解决新项目的发现问题，但这些方法对用户控制个性化的帮助甚微。因此，本文提出了一种可控的个性化搜索方法。我们的模型 CtrlCE 引入了一种新颖的跨编码器模型，该模型通过从用户历史项目构建的可编辑记忆进行增强。我们提出的记忆增强使得跨编码器模型能够基于大量历史用户数据进行条件化，并支持用户交互，从而允许用户对个性化进行控制。此外，可控的个性化搜索必须考虑那些不需要个性化的查询，以及随之而来的用户控制。为此，我们引入了一个校准混合模型，用于确定何时需要个性化。这使得使用 CtrlCE 的系统设计者仅在必要时获取用户输入以进行控制。在多个个性化搜索数据集中，我们展示了 CtrlCE 不仅实现了有效的个性化，还满足了可控个性化搜索的多个关键目标。

[NLP-27] Novelty-focused RD landscaping using transformer and local outlier factor

【速读】：该论文试图解决在研究与开发（R&D）领域中，如何系统性地构建和导航R&D景观，特别是如何评估研究提案的新颖性问题。解决方案的关键在于综合使用基于Transformer的语言模型和局部异常因子（Local Outlier Factor, LOF）。通过进一步训练的Transformer模型捕捉研究提案的语义意义，构建全面的R&D景观，然后利用LOF量化新提案在年度景观中的新颖性，通过评估每个提案与其他提案的差异性来实现。这一系统过程和量化结果旨在为R&D规划和路线图提供决策支持。

链接: https://arxiv.org/abs/2411.02738
作者: Jaewoong Choi
关键词-EN: emphasized predictive analysis, predictive analysis based, research proposals, specifically patents, academic literature
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While numerous studies have explored the field of research and development (RD) landscaping, the preponderance of these investigations has emphasized predictive analysis based on RD outcomes, specifically patents, and academic literature. However, the value of research proposals and novelty analysis has seldom been addressed. This study proposes a systematic approach to constructing and navigating the RD landscape that can be utilized to guide organizations to respond in a reproducible and timely manner to the challenges presented by increasing number of research proposals. At the heart of the proposed approach is the composite use of the transformer-based language model and the local outlier factor (LOF). The semantic meaning of the research proposals is captured with our further-trained transformers, thereby constructing a comprehensive RD landscape. Subsequently, the novelty of the newly selected research proposals within the annual landscape is quantified on a numerical scale utilizing the LOF by assessing the dissimilarity of each proposal to others preceding and within the same year. A case study examining research proposals in the energy and resource sector in South Korea is presented. The systematic process and quantitative outcomes are expected to be useful decision-support tools, providing future insights regarding RD planning and roadmapping.
摘要：尽管众多研究已经探索了研发（R&D）领域的布局，但大多数研究主要集中在基于研发成果（特别是专利和学术文献）的预测分析上。然而，研究提案的价值和新颖性分析却鲜有涉及。本研究提出了一种系统的方法来构建和导航研发布局，该方法可用于指导组织以可重复和及时的方式应对日益增多的研究提案带来的挑战。该方法的核心是综合运用基于Transformer的语言模型和局部异常因子（LOF）。通过进一步训练的Transformer，捕捉研究提案的语义意义，从而构建一个全面的研发布局。随后，利用LOF评估每个提案与之前及同年其他提案的差异性，量化年度布局中新选研究提案的新颖性，并将其数值化。本文还展示了一个案例研究，考察了韩国能源和资源领域的研究提案。该系统过程和量化结果有望成为有用的决策支持工具，为未来的研发规划和路线图提供洞察。

[NLP-28] A Natural Language Processing Approach to Support Biomedical Data Harmonization: Leveraging Large Language Models

【速读】：该论文试图解决生物医学研究中数据集变量匹配的自动化问题，特别是如何在大规模、多样化的样本中实现无偏结果。解决方案的关键在于利用大型语言模型 (LLM) 和集成学习 (ensemble learning) 技术，通过自然语言处理 (NLP) 方法自动匹配不同数据集中的变量。具体方法包括：1) 基于LLM的变量匹配；2) 模糊匹配；3) 集成学习方法，使用随机森林模型 (Random Forest, RF) 整合前两种方法的结果。实验结果表明，集成学习方法在变量匹配的准确性和效率上优于单一方法，其中LLM生成的特征对RF模型的性能贡献最大。

链接: https://arxiv.org/abs/2411.02730
作者: Zexu Li,Suraj P. Prabhu,Zachary T. Popp,Shubhi S. Jain,Vijetha Balakundi,Ting Fang Alvin Ang,Rhoda Au,Jinying Chen
关键词-EN: Biomedical research requires, Biomedical research, produce unbiased results, diverse samples, matching
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 32 pages, 2 figures

点击查看摘要

Abstract:Biomedical research requires large, diverse samples to produce unbiased results. Automated methods for matching variables across datasets can accelerate this process. Research in this area has been limited, primarily focusing on lexical matching and ontology based semantic matching. We aimed to develop new methods, leveraging large language models (LLM) and ensemble learning, to automate variable matching. Methods: We utilized data from two GERAS cohort (European and Japan) studies to develop variable matching methods. We first manually created a dataset by matching 352 EU variables with 1322 candidate JP variables, where matched variable pairs were positive and unmatched pairs were negative instances. Using this dataset, we developed and evaluated two types of natural language processing (NLP) methods, which matched variables based on variable labels and definitions from data dictionaries: (1) LLM-based and (2) fuzzy matching. We then developed an ensemble-learning method, using the Random Forest model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model’s hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF’s probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HRn) and mean reciprocal rank (MRR). Results:E5 performed best among individual methods, achieving 0.90 HR-30 and 0.70 MRR. RF performed better than E5 on all metrics over 50 trials (P less than 0.001) and achieved an average HR 30 of 0.98 and MRR of 0.73. LLM-derived features contributed most to RF’s performance. One major cause of errors in automatic variable matching was ambiguous variable definitions within data dictionaries.
摘要：生物医学研究需要大量且多样化的样本以产生无偏的结果。自动化方法用于跨数据集匹配变量可以加速这一过程。该领域的研究一直有限，主要集中在词汇匹配和基于本体的语义匹配上。我们的目标是开发新的方法，利用大语言模型 (LLM) 和集成学习来自动化变量匹配。方法：我们利用了两个 GERAS 队列（欧洲和日本）研究的数据来开发变量匹配方法。首先，我们通过手动匹配 352 个欧盟变量与 1322 个候选日本变量，创建了一个数据集，其中匹配的变量对为正实例，未匹配的对为负实例。利用此数据集，我们开发并评估了两种类型的自然语言处理 (NLP) 方法，这些方法基于数据字典中的变量标签和定义进行变量匹配：(1) 基于 LLM 的方法和 (2) 模糊匹配方法。然后，我们开发了一种集成学习方法，使用随机森林模型 (RF) 来整合个体 NLP 方法。RF 在 50 次试验中进行了训练和评估。每次试验都有随机的训练和测试集分割（4:1），并通过训练集上的交叉验证优化模型的超参数。对于每个欧盟变量，1322 个候选日本变量根据 NLP 导出的相似度分数或 RF 的概率分数进行排序，表示它们与欧盟变量匹配的可能性。排序性能通过 top-n 命中率 (HRn) 和平均倒数排名 (MRR) 来衡量。结果：在个体方法中，E5 表现最佳，达到 0.90 HR-30 和 0.70 MRR。RF 在所有指标上均优于 E5，在 50 次试验中表现更好（P < 0.001），平均 HR 30 为 0.98，MRR 为 0.73。LLM 导出的特征对 RF 的性能贡献最大。自动变量匹配中错误的一个主要原因是数据字典中变量定义的模糊性。

[NLP-29] Multimodal Commonsense Knowledge Distillation for Visual Question Answering AAAI2025

【速读】：该论文试图解决现有多模态大语言模型 (Multimodal Large Language Models, MLLMs) 和视觉语言预训练模型 (Visual Language Pretrained Models, VLPMs) 在处理需要外部常识知识 (commonsense knowledge) 的视觉问答 (Visual Question Answering, VQA) 任务时遇到的挑战，主要问题包括生成高质量提示 (prompts) 的困难和高计算成本的微调 (fine-tuning)。解决方案的关键在于提出了一种基于图的常识知识蒸馏框架，通过在常识知识、视觉对象和问题之间构建统一的关系图 (relational graph)，并利用图卷积网络 (Graph Convolutional Network, GCN) 在教师-学生环境中进行知识蒸馏。该框架具有灵活性，可以与任何类型的教师和学生模型结合，无需进一步微调，并在ScienceQA数据集上取得了竞争性的性能。

链接: https://arxiv.org/abs/2411.02722
作者: Shuo Yang,Siwen Luo,Soyeon Caren Han
关键词-EN: Existing Multimodal Large, Visual Language Pretrained, Multimodal Large Language, Large Language Models, Language Pretrained Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AAAI 2025 (Accepted, Oral)

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.
摘要：现有的多模态大语言模型 (Multimodal Large Language Models, MLLMs) 和视觉语言预训练模型 (Visual Language Pretrained Models, VLPMs) 在一般的视觉问答 (Visual Question Answering, VQA) 任务中表现出色。然而，这些模型在处理需要外部常识知识的 VQA 问题时遇到了困难，主要原因是生成高质量提示的挑战以及微调的高计算成本。在本研究中，我们提出了一种基于图的多模态常识知识蒸馏框架，该框架通过图卷积网络 (Graph Convolutional Network, GCN) 在教师-学生环境中构建了一个统一的常识知识、视觉对象和问题的关系图。该框架具有灵活性，可以与任何类型的教师和学生模型结合使用，而无需进一步微调，并在 ScienceQA 数据集上取得了具有竞争力的表现。

[NLP-30] Game Plot Design with an LLM -powered Assistant: An Empirical Study with Game Designers

【速读】：该论文试图解决游戏设计师在创作回合制游戏叙事时面临的挑战，特别是如何生成沉浸式叙事并进行有效测试和迭代。解决方案的关键在于引入了一个名为GamePlot的生成式AI (LLM) 助手，该助手通过支持设计师在协作游戏过程中创建和测试叙事，并在过程中不断优化剧情，从而提升叙事的沉浸感和设计师的创作满意度。研究结果表明，尽管LLM在生成复杂和创新内容方面存在局限性，但GamePlot在提高用户满意度和叙事所有权感方面表现出色，同时也揭示了不同用户群体对AI助手的期望差异，强调了针对不同用户群体定制AI助手的重要性。

链接: https://arxiv.org/abs/2411.02714
作者: Seyed Hossein Alavi,Weijia Xu,Nebojsa Jojic,Daniel Kennett,Raymond T. Ng,Sudha Rao,Haiyan Zhang,Bill Dolan,Vered Shwartz
关键词-EN: crafting immersive narratives, collaborative game play, supports game designers, introduce GamePlot, crafting immersive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We introduce GamePlot, an LLM-powered assistant that supports game designers in crafting immersive narratives for turn-based games, and allows them to test these games through a collaborative game play and refine the plot throughout the process. Our user study with 14 game designers shows high levels of both satisfaction with the generated game plots and sense of ownership over the narratives, but also reconfirms that LLM are limited in their ability to generate complex and truly innovative content. We also show that diverse user populations have different expectations from AI assistants, and encourage researchers to study how tailoring assistants to diverse user groups could potentially lead to increased job satisfaction and greater creativity and innovation over time.
摘要：我们介绍了 GamePlot，这是一个由大语言模型 (LLM) 驱动的助手，旨在支持游戏设计师为回合制游戏创作沉浸式叙事，并通过协作游戏玩法测试这些游戏，并在整个过程中不断完善剧情。我们对 14 名游戏设计师进行的用户研究表明，他们对生成的游戏剧情满意度高，并对叙事拥有强烈的归属感，但同时也再次确认了大语言模型在生成复杂且真正创新内容方面的能力有限。我们还发现，不同用户群体对 AI 助手有不同的期望，并鼓励研究人员探讨如何针对不同用户群体定制助手，以期随着时间的推移提高工作满意度，并促进更大的创造力和创新。

[NLP-31] Exploring Response Uncertainty in MLLM s: An Empirical Evaluation under Misleading Scenarios

【速读】：该论文试图解决多模态大语言模型（MLLMs）在面对误导性信息时响应一致性的问题。解决方案的关键在于提出了一种两阶段管道：首先收集MLLMs在没有误导信息情况下的响应，然后通过特定的误导指令收集含有误导信息的响应。通过计算误导率，并捕捉两组响应之间的正确到错误和错误到正确的转变，可以有效度量模型的响应不确定性。最终，论文建立了一个名为**Multimodal Uncertainty Benchmark (MUB)**的基准，使用显式和隐式的误导指令全面评估MLLMs在多个领域的脆弱性。实验结果显示，所有开源和闭源的MLLMs对误导指令都非常敏感，平均误导率超过86%。为增强MLLMs的鲁棒性，论文进一步通过引入显式和隐式的误导数据对所有开源MLLMs进行微调，显著降低了误导率。

链接: https://arxiv.org/abs/2411.02708
作者: Yunkai Dang,Mengxi Gao,Yibo Yan,Xin Zou,Yanggan Gu,Aiwei Liu,Xuming Hu
关键词-EN: Multimodal Large Language, Large Language Models, trustworthy multimodal intelligence, developing trustworthy multimodal, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ensuring that Multimodal Large Language Models (MLLMs) maintain consistency in their responses is essential for developing trustworthy multimodal intelligence. However, existing benchmarks include many samples where all MLLMs \textitexhibit high response uncertainty when encountering misleading information, requiring even 5-15 response attempts per sample to effectively assess uncertainty. Therefore, we propose a two-stage pipeline: first, we collect MLLMs’ responses without misleading information, and then gather misleading ones via specific misleading instructions. By calculating the misleading rate, and capturing both correct-to-incorrect and incorrect-to-correct shifts between the two sets of responses, we can effectively metric the model’s response uncertainty. Eventually, we establish a \textbf\underlineMultimodal \textbf\underlineUncertainty \textbf\underlineBenchmark (\textbfMUB) that employs both explicit and implicit misleading instructions to comprehensively assess the vulnerability of MLLMs across diverse domains. Our experiments reveal that all open-source and close-source MLLMs are highly susceptible to misleading instructions, with an average misleading rate exceeding 86%. To enhance the robustness of MLLMs, we further fine-tune all open-source MLLMs by incorporating explicit and implicit misleading data, which demonstrates a significant reduction in misleading rates. Our code is available at: \hrefthis https URLthis https URL
摘要：确保多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在响应中保持一致性对于开发可信赖的多模态智能至关重要。然而，现有基准包含许多样本，在这些样本中，所有 MLLMs 在遇到误导性信息时都表现出高度的响应不确定性，甚至需要每样本进行 5-15 次响应尝试才能有效评估不确定性。因此，我们提出了一种两阶段流程：首先，我们收集 MLLMs 在没有误导性信息情况下的响应，然后通过特定的误导性指令收集误导性响应。通过计算误导率，并捕捉两组响应之间的正确到错误和错误到正确的转变，我们可以有效度量模型的响应不确定性。最终，我们建立了一个名为 多模态不确定性基准 (Multimodal Uncertainty Benchmark, MUB) 的基准，该基准采用显式和隐式的误导性指令，全面评估 MLLMs 在多个领域中的脆弱性。我们的实验表明，所有开源和闭源的 MLLMs 对误导性指令都非常敏感，平均误导率超过 86%。为了增强 MLLMs 的鲁棒性，我们进一步通过引入显式和隐式的误导性数据对所有开源 MLLMs 进行微调，结果显示误导率显著降低。我们的代码可在以下链接获取：\hrefthis https URLthis https URL。

[NLP-32] RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation

【速读】：该论文试图解决现有中间策略表示方法在指导机器人执行操作任务时，要么提供的上下文不足，要么提供过于具体的上下文导致策略不够鲁棒的问题。解决方案的关键在于提出了一种基于“可供性”（affordances）的策略条件化方法，即RT-Affordance模型。该模型通过捕捉任务关键阶段机器人姿态的“可供性”，提供了一种既表达性强又轻量级的抽象表示，便于用户指定，并能有效利用大规模互联网数据集进行知识迁移。RT-Affordance模型采用分层结构，首先根据任务语言生成“可供性”计划，然后基于该计划条件化策略以执行操作任务。该方法能够灵活整合异质监督源，包括大规模网络数据集和机器人轨迹，并通过廉价的领域内“可供性”图像训练模型，从而在新任务上无需额外收集昂贵的机器人轨迹即可实现高效学习。实验结果表明，RT-Affordance在多个新任务上的性能超过现有方法50%以上，并显示出对新环境的鲁棒性。

链接: https://arxiv.org/abs/2411.02704
作者: Soroush Nasiriany,Sean Kirmani,Tianli Ding,Laura Smith,Yuke Zhu,Danny Driess,Dorsa Sadigh,Ted Xiao
关键词-EN: explore how intermediate, generalization by providing, providing guidance, intermediate policy representations, representations
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%, and we empirically demonstrate that affordances are robust to novel settings. Videos available at this https URL
摘要：我们探讨了如何通过提供关于如何执行操作任务的指导，利用中间策略表示来促进泛化。现有的表示方法，如语言、目标图像和轨迹草图，已被证明是有帮助的，但这些表示要么提供的上下文不足，要么提供了过度指定的上下文，导致策略的鲁棒性较差。我们提出将策略条件化于功能性（affordances），这些功能性捕捉了任务关键阶段机器人的姿态。功能性提供了表达性强且轻量级的抽象，易于用户指定，并通过从大型互联网数据集中转移知识来促进高效学习。我们的方法，RT-Affordance，是一个分层模型，首先根据任务语言提出一个功能性计划，然后根据该功能性计划条件化策略以执行操作。我们的模型能够灵活地桥接包括大型网络数据集和机器人轨迹在内的异质监督源。此外，我们在易于收集的领域内功能性图像上训练我们的模型，使得我们能够在不收集任何额外昂贵机器人轨迹的情况下学习新任务。我们在一系列多样的新任务上展示了RT-Affordance如何超越现有方法超过50%的性能，并通过实证证明功能性对新设置具有鲁棒性。视频可在以下链接获取：https URL

[NLP-33] JEL: Applying End-to-End Neural Entity Linking in JPMorgan Chase

【速读】：该论文试图解决企业知识图谱中实体链接的问题，特别是将文本来源中的提及（如公司名称）与知识图谱中的实体进行准确匹配。解决方案的关键在于提出了一种新颖的端到端神经实体链接模型（JEL），该模型利用最小化上下文信息和边缘损失来生成实体嵌入，并结合Wide Deep Learning模型分别匹配字符和语义信息。JEL模型在金融新闻中公司名称的实体链接任务上达到了最先进的性能，并且该方法可直接应用于其他需要针对其独特数据进行实体链接解决方案的企业。

链接: https://arxiv.org/abs/2411.02695
作者: Wanying Ding,Vinay K. Chaudhri,Naren Chittar,Krishna Konakanchi
关键词-EN: capturing key relationship, knowledge graph, leveraging knowledge graphs, compelling abstraction, abstraction for capturing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, IAAI-21

点击查看摘要

Abstract:Knowledge Graphs have emerged as a compelling abstraction for capturing key relationship among the entities of interest to enterprises and for integrating data from heterogeneous sources. JPMorgan Chase (JPMC) is leading this trend by leveraging knowledge graphs across the organization for multiple mission critical applications such as risk assessment, fraud detection, investment advice, etc. A core problem in leveraging a knowledge graph is to link mentions (e.g., company names) that are encountered in textual sources to entities in the knowledge graph. Although several techniques exist for entity linking, they are tuned for entities that exist in Wikipedia, and fail to generalize for the entities that are of interest to an enterprise. In this paper, we propose a novel end-to-end neural entity linking model (JEL) that uses minimal context information and a margin loss to generate entity embeddings, and a Wide Deep Learning model to match character and semantic information respectively. We show that JEL achieves the state-of-the-art performance to link mentions of company names in financial news with entities in our knowledge graph. We report on our efforts to deploy this model in the company-wide system to generate alerts in response to financial news. The methodology used for JEL is directly applicable and usable by other enterprises who need entity linking solutions for data that are unique to their respective situations.
摘要：知识图谱已成为捕捉企业感兴趣实体间关键关系并整合来自异构数据源数据的引人注目的抽象表示。摩根大通（JPMorgan Chase, JPMC）正引领这一趋势，通过在整个组织内利用知识图谱，应用于多个关键任务，如风险评估、欺诈检测、投资建议等。利用知识图谱的核心问题之一是将文本源中遇到的提及（如公司名称）与知识图谱中的实体进行关联。尽管存在多种实体链接技术，但它们主要针对维基百科中的实体进行优化，无法泛化到企业感兴趣的实体。本文提出了一种新颖的端到端神经实体链接模型（JEL），该模型利用最小上下文信息和边际损失生成实体嵌入，并采用Wide Deep Learning模型分别匹配字符和语义信息。实验结果表明，JEL在将金融新闻中的公司名称提及与知识图谱中的实体进行链接方面达到了最先进的性能。我们还报告了在公司范围内系统中部署此模型以响应金融新闻生成警报的努力。JEL所采用的方法可直接适用于其他需要为其独特数据场景提供实体链接解决方案的企业。

[NLP-34] On the loss of context-awareness in general instruction fine-tuning

【速读】：该论文试图解决预训练大型语言模型（LLMs）在经过监督微调（SFT）后可能失去上下文感知能力的问题。具体来说，论文发现当使用聊天模板应用于输入提示时，指令微调后的LLMs在提取和理解用户提供的上下文信息并据此做出响应的能力上有所下降。解决方案的关键在于提出了两种方法来缓解这一问题：一是对用户提示进行后验注意力引导（post-hoc attention steering），二是通过上下文依赖指示器（context-dependency indicator）进行条件指令微调（conditional instruction fine-tuning）。实验结果表明，这两种方法在不影响模型遵循指令能力的前提下，有效地恢复了上下文感知能力。

链接: https://arxiv.org/abs/2411.02688
作者: Yihan Wang,Andrew Bai,Nanyun Peng,Cho-Jui Hsieh
关键词-EN: Large Language Models, Pretrained Large Language, Large Language, require post-training methods, Language Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pretrained Large Language Models (LLMs) require post-training methods such as supervised fine-tuning (SFT) on instruction-response pairs to enable instruction following. However, this process can potentially harm existing capabilities learned during pretraining. In this paper, we investigate the loss of context awareness after SFT, defined as the capability to extract and understand information from the user-provided context and respond accordingly. We are the first to identify and show that the loss of context-awareness appears on instruction-finetuned LLMs when the chat template is applied to the input prompts. We identify the performance decline is partially caused by the bias embedded into the chat template to focus less on the user-provided context. Based on these observations, we propose two methods to mitigate the loss of context awareness in instruct models: post-hoc attention steering on user prompts and conditional instruction fine-tuning with a context-dependency indicator. Empirical experiments on 4 context-dependent downstream tasks and 3 pretrained LLMs of different sizes show that our methods effectively mitigates the loss of context awareness without compromising the general ability to follow instructions. Our findings also strongly advocate the necessity to carefully benchmark context awareness after instruction fine-tuning.
摘要：预训练的大语言模型 (LLM) 需要通过在指令-响应对上进行监督微调 (SFT) 等后训练方法来实现指令跟随。然而，这一过程可能会损害预训练期间学到的现有能力。本文探讨了在 SFT 后上下文感知能力的丧失，定义为从用户提供的上下文中提取和理解信息并据此做出响应的能力。我们是首个识别并展示在指令微调的 LLM 中，当聊天模板应用于输入提示时，上下文感知能力丧失的现象。我们发现性能下降部分是由于聊天模板中嵌入的偏差，使得对用户提供的上下文关注度降低。基于这些观察，我们提出了两种方法来缓解指令模型中上下文感知能力的丧失：对用户提示进行事后注意力引导和基于上下文依赖指示器的条件指令微调。在 4 个上下文依赖的下游任务和 3 个不同规模预训练 LLM 上的实证实验表明，我们的方法有效地缓解了上下文感知能力的丧失，同时不损害模型遵循指令的通用能力。我们的研究结果还强烈主张在指令微调后仔细基准测试上下文感知能力的必要性。

[NLP-35] Wave Network: An Ultra-Small Language Model

【速读】：该论文试图解决在保持高精度的前提下，如何构建一个超小型语言模型的问题。解决方案的关键在于提出了一种创新的标记表示和更新方法，即使用复数向量（complex vector）来表示每个标记，该向量包含两个部分：一个幅度向量（magnitude vector）表示输入文本的全局语义，一个相位向量（phase vector）捕捉单个标记与全局语义之间的关系。通过这种表示方法，Wave网络在AG News文本分类任务中表现优异，显著超越了使用BERT预训练嵌入的单层Transformer模型，并且在视频内存使用和训练时间上大幅减少，接近预训练和微调后的BERT base模型的精度。

链接: https://arxiv.org/abs/2411.02674
作者: Xin Zhang,Victor S.Sheng
关键词-EN: innovative token representation, Wave network, propose an innovative, representation and update, update method
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose an innovative token representation and update method in a new ultra-small language model: the Wave network. Specifically, we use a \textbfcomplex vector to represent each token, encoding both global and local semantics of the input text. A \textbfcomplex vector consists of two components: a magnitude vector representing the \textitglobal semantics of the input text, and a phase vector capturing the \textitrelationships between individual tokens and global semantics. Experiments on the AG News text classification task demonstrate that, when generating complex vectors from randomly initialized token embeddings, our single-layer Wave Network achieves 90.91% accuracy with wave interference and 91.66% with wave modulation – outperforming a single Transformer layer using BERT pre-trained embeddings by 19.23% and 19.98%, respectively, and approaching the accuracy of the pre-trained and fine-tuned BERT base model (94.64%). Additionally, compared to BERT base, the Wave Network reduces video memory usage and training time by 77.34% and 85.62% during wave modulation. In summary, we used a 2.4-million-parameter small language model to achieve accuracy comparable to a 100-million-parameter BERT model in text classification.
摘要：我们提出了一种创新的 Token 表示和更新方法，应用于一种新型超小型语言模型：Wave 网络。具体而言，我们使用一个复向量来表示每个 Token，该向量同时编码了输入文本的全局和局部语义。一个复向量由两个部分组成：一个表示输入文本全局语义的幅值向量，以及一个捕捉单个 Token 与全局语义之间关系的相位向量。在 AG News 文本分类任务的实验中，我们发现，当从随机初始化的 Token 嵌入生成复向量时，我们的单层 Wave 网络在波干扰和波调制下分别达到了 90.91% 和 91.66% 的准确率——分别比使用 BERT 预训练嵌入的单层 Transformer 高出 19.23% 和 19.98%，并且接近预训练和微调后的 BERT 基础模型（94.64%）的准确率。此外，与 BERT 基础模型相比，Wave 网络在波调制过程中减少了 77.34% 的视频内存使用和 85.62% 的训练时间。总之，我们使用了一个仅 240 万参数的小型语言模型，在文本分类任务中达到了与 1 亿参数的 BERT 模型相媲美的准确率。

[NLP-36] Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge

【速读】：该论文试图解决罕见疾病（Rare diseases）在医疗领域中面临的独特挑战，特别是诊断延迟和信息碎片化问题。解决方案的关键在于开发了一种专门针对罕见疾病的上下文感知语言模型——Zebra-Llama，该模型通过高精度检索增强生成（Retrieval Augmented Generation, RAG）技术，专注于埃勒斯-当洛斯综合征（Ehlers-Danlos Syndrome, EDS）这一案例研究。通过采用一种新颖的上下文感知微调方法，结合医学文献、患者经验和临床资源中的问题以及专家精心编制的回答进行训练，Zebra-Llama在处理EDS相关查询时展示了显著的改进，特别是在全面性、准确性、清晰度和引用可靠性方面。该模型不仅为EDS提供了更易获取和可靠的信息，还为开发其他罕见疾病的专用AI解决方案奠定了基础。

链接: https://arxiv.org/abs/2411.02657
作者: Karthik Soman,Andrew Langdon,Catalina Villouta,Chinmay Agrawal,Lashaw Salta,Braian Peetoom,Gianmarco Bellucci,Orion J Buske
关键词-EN: Large Language Models, Retrieval Augmented Generation, present unique challenges, diseases present unique, suffering from delayed
类目: Computation and Language (cs.CL)
备注: 26 pages, 4 figures, 1 supplementary figure

点击查看摘要

Abstract:Rare diseases present unique challenges in healthcare, often suffering from delayed diagnosis and fragmented information landscapes. The scarcity of reliable knowledge in these conditions poses a distinct challenge for Large Language Models (LLMs) in supporting clinical management and delivering precise patient information underscoring the need for focused training on these ‘zebra’ cases. We present Zebra-Llama, a specialized context-aware language model with high precision Retrieval Augmented Generation (RAG) capability, focusing on Ehlers-Danlos Syndrome (EDS) as our case study. EDS, affecting 1 in 5,000 individuals, exemplifies the complexities of rare diseases with its diverse symptoms, multiple subtypes, and evolving diagnostic criteria. By implementing a novel context-aware fine-tuning methodology trained on questions derived from medical literature, patient experiences, and clinical resources, along with expertly curated responses, Zebra-Llama demonstrates unprecedented capabilities in handling EDS-related queries. On a test set of real-world questions collected from EDS patients and clinicians, medical experts evaluated the responses generated by both models, revealing Zebra-Llama’s substantial improvements over base model (Llama 3.1-8B-Instruct) in thoroughness (77.5% vs. 70.1%), accuracy (83.0% vs. 78.8%), clarity (74.7% vs. 72.0%) and citation reliability (70.6% vs. 52.3%). Released as an open-source resource, Zebra-Llama not only provides more accessible and reliable EDS information but also establishes a framework for developing specialized AI solutions for other rare conditions. This work represents a crucial step towards democratizing expert-level knowledge in rare disease management, potentially transforming how healthcare providers and patients navigate the complex landscape of rare diseases.
摘要：罕见病在医疗领域呈现出独特的挑战，常常面临诊断延迟和信息碎片化的问题。这些疾病中可靠知识的匮乏对大语言模型（Large Language Models, LLMs）在支持临床管理和提供精确患者信息方面构成了显著挑战，突显了针对这些“斑马”病例进行专门训练的必要性。我们提出了Zebra-Llama，这是一种专门针对罕见病的高精度检索增强生成（Retrieval Augmented Generation, RAG）能力的上下文感知语言模型，以埃勒斯-当洛斯综合征（Ehlers-Danlos Syndrome, EDS）作为我们的案例研究。EDS影响每5000人中的1人，以其多样化的症状、多种亚型和不断演变的诊断标准，体现了罕见病的复杂性。通过实施一种新颖的上下文感知微调方法，该方法基于从医学文献、患者经验和临床资源中提取的问题以及专家精心策划的回答进行训练，Zebra-Llama在处理EDS相关查询方面展示了前所未有的能力。在一个由EDS患者和临床医生收集的真实世界问题测试集中，医学专家评估了两种模型生成的回答，结果显示Zebra-Llama在全面性（77.5% vs. 70.1%）、准确性（83.0% vs. 78.8%）、清晰度（74.7% vs. 72.0%）和引用可靠性（70.6% vs. 52.3%）方面均显著优于基础模型（Llama 3.1-8B-Instruct）。作为开源资源发布的Zebra-Llama不仅提供了更易获取和可靠的EDS信息，还为开发其他罕见病的专门AI解决方案奠定了框架。这项工作代表了在罕见病管理中实现专家级知识普及化的关键一步，有可能改变医疗提供者和患者在复杂罕见病领域中的导航方式。

[NLP-37] A Comparative Analysis of Counterfactual Explanation Methods for Text Classifiers

【速读】：该论文试图解决文本分类器解释性问题，特别是通过生成反事实解释（Counterfactual Explanations）来理解和调试文本分类器。解决方案的关键在于评估和比较五种生成反事实解释的方法，这些方法包括传统的基于梯度的白盒方法和基于大型语言模型（LLMs）的新方法。研究结果表明，传统方法在生成有效改变分类器输出的反事实解释方面表现出色，而基于LLMs的方法则在生成自然且语言上合理的反事实文本方面表现优异，但往往无法有效改变分类器输出。因此，论文建议开发结合传统梯度方法和LLM技术优势的新方法，以生成高质量、有效且合理的文本反事实解释。

链接: https://arxiv.org/abs/2411.02643
作者: Stephen McAleese,Mark Keane
关键词-EN: minimally altered text, altered text inputs, producing minimally altered, classifier output, Counterfactual explanations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Counterfactual explanations can be used to interpret and debug text classifiers by producing minimally altered text inputs that change a classifier’s output. In this work, we evaluate five methods for generating counterfactual explanations for a BERT text classifier on two datasets using three evaluation metrics. The results of our experiments suggest that established white-box substitution-based methods are effective at generating valid counterfactuals that change the classifier’s output. In contrast, newer methods based on large language models (LLMs) excel at producing natural and linguistically plausible text counterfactuals but often fail to generate valid counterfactuals that alter the classifier’s output. Based on these results, we recommend developing new counterfactual explanation methods that combine the strengths of established gradient-based approaches and newer LLM-based techniques to generate high-quality, valid, and plausible text counterfactual explanations.
摘要：反事实解释可以通过生成最小程度修改的文本输入来解释和调试文本分类器，这些修改会改变分类器的输出。在本研究中，我们评估了五种方法在两个数据集上为 BERT 文本分类器生成反事实解释的效果，使用了三种评估指标。实验结果表明，传统的基于白盒替换的方法在生成能够改变分类器输出的有效反事实解释方面表现出色。相比之下，基于大语言模型 (LLM) 的新方法在生成自然且语言上合理的文本反事实解释方面表现优异，但往往无法生成能够改变分类器输出的有效反事实解释。基于这些结果，我们建议开发新的反事实解释方法，结合传统基于梯度的方法和新型基于 LLM 的技术的优势，以生成高质量、有效且合理的文本反事实解释。

[NLP-38] Extracting Unlearned Information from LLM s with Activation Steering NEURIPS2024

【速读】：该论文试图解决的问题是：尽管大型语言模型 (LLMs) 经过训练后可以通过“遗忘”技术移除敏感信息，但这些信息仍可能被恶意攻击者通过各种手段提取出来。现有的攻击方法只能生成可能包含目标信息的候选输出集合，无法精确确定实际包含目标信息的输出。论文提出的解决方案之关键是激活引导 (Activation Steering)，这是一种用于从已遗忘的 LLMs 中精确检索信息的方法。论文还引入了匿名化激活引导 (Anonymized Activation Steering) 这一新方法来生成引导向量，并开发了一种简单的词频方法来从候选集合中精确识别出正确的答案。通过在多种遗忘技术和数据集上的评估，论文证明了激活引导在恢复一般知识（如广为人知的虚构角色）方面的有效性，同时也揭示了在检索特定信息（如非公众人物的详细信息）方面的局限性，从而突显了当前遗忘技术的一个严重漏洞。

链接: https://arxiv.org/abs/2411.02631
作者: Atakan Seyitoğlu,Aleksei Kuvshinov,Leo Schwinn,Stephan Günnemann
关键词-EN: Large Language Models, Large Language, pretraining of Large, Language Models, unintended consequence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2024 Workshop Safe Generative AI

点击查看摘要

Abstract:An unintended consequence of the vast pretraining of Large Language Models (LLMs) is the verbatim memorization of fragments of their training data, which may contain sensitive or copyrighted information. In recent years, unlearning has emerged as a solution to effectively remove sensitive knowledge from models after training. Yet, recent work has shown that supposedly deleted information can still be extracted by malicious actors through various attacks. Still, current attacks retrieve sets of possible candidate generations and are unable to pinpoint the output that contains the actual target information. We propose activation steering as a method for exact information retrieval from unlearned LLMs. We introduce a novel approach to generating steering vectors, named Anonymized Activation Steering. Additionally, we develop a simple word frequency method to pinpoint the correct answer among a set of candidates when retrieving unlearned information. Our evaluation across multiple unlearning techniques and datasets demonstrates that activation steering successfully recovers general knowledge (e.g., widely known fictional characters) while revealing limitations in retrieving specific information (e.g., details about non-public individuals). Overall, our results demonstrate that exact information retrieval from unlearned models is possible, highlighting a severe vulnerability of current unlearning techniques.
摘要：大规模预训练大语言模型 (LLM) 的一个意外后果是其训练数据片段的逐字记忆，这些数据可能包含敏感或受版权保护的信息。近年来，“遗忘”技术作为一种解决方案出现，旨在在模型训练后有效移除敏感知识。然而，最近的研究表明，被认为已删除的信息仍可能通过各种攻击手段被恶意行为者提取出来。尽管如此，当前的攻击方法只能检索出可能的候选生成集，无法准确定位包含实际目标信息的内容。我们提出了一种名为“激活引导”的方法，用于从已遗忘的 LLM 中精确检索信息。我们引入了一种新颖的方法来生成引导向量，称为“匿名化激活引导”。此外，我们还开发了一种简单的词频方法，用于在检索已遗忘信息时从候选集中准确定位正确答案。我们在多种遗忘技术和数据集上的评估表明，激活引导成功恢复了通用知识（例如，广为人知的虚构人物），但在检索特定信息（例如，非公众人物的细节）方面显示出局限性。总体而言，我们的研究结果表明，从已遗忘模型中精确检索信息是可能的，这突显了当前遗忘技术的一个严重漏洞。

[NLP-39] Oracle: Fine-Tuned Retrieval-Augmented Generation with Long-Context Support for Network

【速读】：该论文试图解决电信行业中大型语言模型（LLMs）在边缘设备上的部署限制和文档不一致性问题。解决方案的关键在于提出了TeleOracle，一个专为电信领域设计的检索增强生成（RAG）系统，基于Phi-2小型语言模型（SLM）构建。TeleOracle通过两阶段检索器，结合语义分块和混合关键词与语义搜索，提升了上下文检索的效率。此外，通过在推理过程中扩展上下文窗口和采用低秩适应进行高效微调，显著提高了模型在开放式查询任务中的性能。实验结果显示，TeleOracle在电信领域的问答（QnA）任务中，相较于基础Phi-2模型，准确率提升了30%，达到81.20%，并且在忠实度评分上优于更大规模的LLMs。

链接: https://arxiv.org/abs/2411.02617
作者: Nouf Alabbasi,Omar Erak,Omar Alhussein,Ismail Lotfi,Sami Muhaidat,Merouane Debbah
关键词-EN: telecommunications industry rapid, industry rapid evolution, rapid evolution demands, evolution demands intelligent, managing complex networks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The telecommunications industry’s rapid evolution demands intelligent systems capable of managing complex networks and adapting to emerging technologies. While large language models (LLMs) show promise in addressing these challenges, their deployment in telecom environments faces significant constraints due to edge device limitations and inconsistent documentation. To bridge this gap, we present TeleOracle, a telecom-specialized retrieval-augmented generation (RAG) system built on the Phi-2 small language model (SLM). To improve context retrieval, TeleOracle employs a two-stage retriever that incorporates semantic chunking and hybrid keyword and semantic search. Additionally, we expand the context window during inference to enhance the model’s performance on open-ended queries. We also employ low-rank adaption for efficient fine-tuning. A thorough analysis of the model’s performance indicates that our RAG framework is effective in aligning Phi-2 to the telecom domain in a downstream question and answer (QnA) task, achieving a 30% improvement in accuracy over the base Phi-2 model, reaching an overall accuracy of 81.20%. Notably, we show that our model not only performs on par with the much larger LLMs but also achieves a higher faithfulness score, indicating higher adherence to the retrieved context.
摘要：电信行业的快速发展要求智能系统能够管理复杂的网络并适应新兴技术。尽管大语言模型 (LLM) 在应对这些挑战方面显示出潜力，但其在电信环境中的部署面临显著限制，主要源于边缘设备的局限性和文档的不一致性。为了弥合这一差距，我们提出了 TeleOracle，这是一个基于 Phi-2 小型语言模型 (SLM) 的电信专用检索增强生成 (RAG) 系统。为了改进上下文检索，TeleOracle 采用了一个两阶段检索器，结合了语义分块和混合关键词与语义搜索。此外，我们在推理过程中扩展了上下文窗口，以增强模型对开放式查询的处理能力。我们还采用了低秩适应技术进行高效的微调。对模型性能的全面分析表明，我们的 RAG 框架在下游问答 (QnA) 任务中有效地将 Phi-2 对齐到电信领域，准确率比基础 Phi-2 模型提高了 30%，总体准确率达到 81.20%。值得注意的是，我们展示了我们的模型不仅与更大的 LLM 表现相当，而且在忠实度评分上更高，表明其对检索上下文的更高依从性。

[NLP-40] Investigating Idiomaticity in Word Representations

【速读】：该论文试图解决的问题是如何评估和提升词表示模型对多词表达（multiword expressions）习语性的捕捉能力。解决方案的关键在于提出了一个包含32,200个句子的数据集，其中包括了名词复合词在不同习语性水平上的最小对（minimal pairs），以及这些复合词的释义和在自然语境中的出现情况。通过这些数据，论文定义了两个细粒度指标：亲和度（Affinity）和缩放相似度（Scaled Similarity），用于评估模型对习语性变化的敏感度。研究结果表明，尽管现有模型在表面相似度上表现良好，但它们尚未能准确捕捉习语性，且模型的上下文捕捉能力仍停留在词汇层面的线索，未能深入整合语义线索以理解习语性。

链接: https://arxiv.org/abs/2411.02610
作者: Wei He,Tiago Kramer Vieira,Marcos Garcia,Carolina Scarton,Marco Idiart,Aline Villavicencio
关键词-EN: express complex ideas, eager beaver, enthusiastic person, integral part, express complex
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Idiomatic expressions are an integral part of human languages, often used to express complex ideas in compressed or conventional ways (e.g. eager beaver as a keen and enthusiastic person). However, their interpretations may not be straightforwardly linked to the meanings of their individual components in isolation and this may have an impact for compositional approaches. In this paper, we investigate to what extent word representation models are able to go beyond compositional word combinations and capture multiword expression idiomaticity and some of the expected properties related to idiomatic meanings. We focus on noun compounds of varying levels of idiomaticity in two languages (English and Portuguese), presenting a dataset of minimal pairs containing human idiomaticity judgments for each noun compound at both type and token levels, their paraphrases and their occurrences in naturalistic and sense-neutral contexts, totalling 32,200 sentences. We propose this set of minimal pairs for evaluating how well a model captures idiomatic meanings, and define a set of fine-grained metrics of Affinity and Scaled Similarity, to determine how sensitive the models are to perturbations that may lead to changes in idiomaticity. The results obtained with a variety of representative and widely used models indicate that, despite superficial indications to the contrary in the form of high similarities, idiomaticity is not yet accurately represented in current models. Moreover, the performance of models with different levels of contextualisation suggests that their ability to capture context is not yet able to go beyond more superficial lexical clues provided by the words and to actually incorporate the relevant semantic clues needed for idiomaticity.
摘要：习语表达是人类语言的重要组成部分，常用于以压缩或常规的方式表达复杂概念（例如，“eager beaver”表示积极热情的人）。然而，习语的解释可能无法直接从其独立组成部分的意义中推导出来，这对组合性方法可能产生影响。本文探讨了词表示模型在多大程度上能够超越组合性词组合，捕捉多词表达的习语性及其相关的一些预期属性。我们专注于两种语言（英语和葡萄牙语）中不同习语程度的名词复合词，提供了一个包含人类习语判断的数据集，涵盖了类型和Token级别的名词复合词、其释义及其在自然和意义中立的上下文中的出现，总计32,200个句子。我们提出这一组最小对来评估模型捕捉习语意义的能力，并定义了一组细粒度的亲和度和缩放相似度指标，以确定模型对可能导致习语性变化的扰动的敏感性。通过多种代表性和广泛使用的模型获得的结果表明，尽管表面上有高相似度的迹象，但习语性在当前模型中尚未得到准确表示。此外，不同上下文化程度的模型的性能表明，它们捕捉上下文的能力尚未能超越由词语提供的更表面的词汇线索，并实际纳入习语性所需的相应语义线索。

[NLP-41] FactTest: Factuality Testing in Large Language Models with Statistical Guarantees

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在生成内容时容易产生幻觉 (hallucinations) 和非事实内容的问题，特别是在高风险领域中，严格控制第一类错误 (Type I errors) 的概率至关重要。解决方案的关键在于提出了一个名为 FactTest 的新框架，该框架通过统计方法评估 LLM 在给定问题下提供正确答案的置信度，并确保高概率的正确性保证。具体来说，论文将事实性测试 (factuality testing) 形式化为假设检验问题，以在用户指定的显著性水平上强制执行第一类错误的上限。此外，该框架在温和条件下确保了第二类错误 (Type II errors) 的强控制，并且能够扩展以在存在协变量偏移 (covariate shifts) 的情况下保持有效性。该方法是无分布假设的，适用于任意数量的人工标注样本，并且模型无关，可应用于任何黑箱或白箱语言模型。实验结果表明，FactTest 有效地检测幻觉并提高了模型在未知问题上的拒绝回答能力，准确率提高了超过 40%。

链接: https://arxiv.org/abs/2411.02603
作者: Fan Nie,Xiaotian Hou,Shuhang Lin,James Zou,Huaxiu Yao,Linjun Zhang
关键词-EN: Large Language Models, non-factual content undermines, Large Language, propensity of Large, incorrectly classifying hallucinations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The propensity of Large Language Models (LLMs) to generate hallucinations and non-factual content undermines their reliability in high-stakes domains, where rigorous control over Type I errors (the conditional probability of incorrectly classifying hallucinations as truthful content) is essential. Despite its importance, formal verification of LLM factuality with such guarantees remains largely unexplored. In this paper, we introduce FactTest, a novel framework that statistically assesses whether an LLM can confidently provide correct answers to given questions with high-probability correctness guarantees. We formulate factuality testing as hypothesis testing problem to enforce an upper bound of Type I errors at user-specified significance levels. Notably, we prove that our framework also ensures strong Type II error control under mild conditions and can be extended to maintain its effectiveness when covariate shifts exist. %These analyses are amenable to the principled NP framework. Our approach is distribution-free and works for any number of human-annotated samples. It is model-agnostic and applies to any black-box or white-box LM. Extensive experiments on question-answering (QA) and multiple-choice benchmarks demonstrate that \approach effectively detects hallucinations and improves the model’s ability to abstain from answering unknown questions, leading to an over 40% accuracy improvement.
摘要：大语言模型 (LLM) 在生成幻觉内容和非事实内容方面的倾向性，严重影响了其在高风险领域中的可靠性，在这些领域中，对第一类错误（将幻觉内容错误分类为真实内容的条件概率）的严格控制至关重要。尽管其重要性不言而喻，但具有此类保证的大语言模型事实性正式验证仍未得到充分探索。本文中，我们提出了 FactTest，这是一种新颖的框架，能够统计评估大语言模型在给定问题下能否以高概率正确性保证自信地提供正确答案。我们将事实性测试形式化为假设检验问题，以在用户指定的显著性水平上强制执行第一类错误的上限。值得注意的是，我们证明了在温和条件下，我们的框架还能确保强有力的第二类错误控制，并且可以扩展以在存在协变量偏移的情况下保持其有效性。这些分析适用于基于 NP 框架的原理性方法。我们的方法是分布无关的，适用于任意数量的人工标注样本。它与模型无关，适用于任何黑箱或白箱语言模型。在问答 (QA) 和多项选择基准上的广泛实验表明，我们的方法能有效检测幻觉内容，并提升模型拒绝回答未知问题的能力，从而使准确率提高了超过 40%。

[NLP-42] Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration

【速读】：该论文试图解决在情境环境中实现无缝人机协作的问题。解决方案的关键在于引入Vocal Sandbox框架，该框架通过设计轻量级且可解释的学习算法，使系统能够从多种教学模式（如口语对话、对象关键点和运动示范）中适应和持续学习多层次的抽象概念。用户可以通过实时教学与机器人共同适应，例如通过展示新的低级技能（如“围绕物体跟踪”）并提供轨迹可视化，或通过口语对话教授高级规划行为（如“打包物体”），这些行为由预训练的语言模型合成，作为低级技能的组合。实验结果表明，该框架在协作礼品袋装配和LEGO定格动画制作中显著减少了主动监督需求，提高了自主性能和用户满意度。

链接: https://arxiv.org/abs/2411.02599
作者: Jennifer Grannen,Siddharth Karamcheti,Suvir Mirchandani,Percy Liang,Dorsa Sadigh
关键词-EN: enabling seamless human-robot, introduce Vocal Sandbox, seamless human-robot collaboration, Vocal Sandbox, situated environments
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Published at CoRL 2024. 24 pages, 8 figures. Project Page: this https URL

点击查看摘要

Abstract:We introduce Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments. Systems in our framework are characterized by their ability to adapt and continually learn at multiple levels of abstraction from diverse teaching modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To enable such adaptation, we design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot’s capabilities in real-time, as they teach new behaviors. For example, after demonstrating a new low-level skill for “tracking around” an object, users are provided with trajectory visualizations of the robot’s intended motion when asked to track a new object. Similarly, users teach high-level planning behaviors through spoken dialogue, using pretrained language models to synthesize behaviors such as “packing an object away” as compositions of low-level skills - concepts that can be reused and built upon. We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation. In the first setting, we run systematic ablations and user studies with 8 non-expert participants, highlighting the impact of multi-level teaching. Across 23 hours of total robot interaction time, users teach 17 new high-level behaviors with an average of 16 novel low-level skills, requiring 22.1% less active supervision compared to baselines and yielding more complex autonomous performance (+19.7%) with fewer failures (-67.1%). Qualitatively, users strongly prefer Vocal Sandbox systems due to their ease of use (+20.6%) and overall performance (+13.9%). Finally, we pair an experienced system-user with a robot to film a stop-motion animation; over two hours of continuous collaboration, the user teaches progressively more complex motion skills to shoot a 52 second (232 frame) movie.
摘要：我们介绍了 Vocal Sandbox，这是一个旨在实现情境环境中无缝人机协作的框架。该框架中的系统具有从多种教学模式（如口语对话、对象关键点和动觉示范）中适应和持续学习多层次抽象的能力。为了实现这种适应性，我们设计了轻量级且可解释的学习算法，使用户能够在实时教学新行为的同时，逐步理解和共同适应机器人的能力。例如，在演示了围绕对象进行“跟踪”的新低级技能后，当要求机器人跟踪新对象时，用户会获得机器人预期运动的轨迹可视化。同样，用户通过口语对话教授高级规划行为，利用预训练的语言模型将行为（如“打包对象”）合成为低级技能的组合——这些概念可以重复使用并进一步构建。我们在两种情境中评估了 Vocal Sandbox：协作礼品袋组装和 LEGO 定格动画制作。在第一种情境中，我们进行了系统的消融实验和用户研究，共有 8 名非专业参与者，突出了多层次教学的影响。在总共 23 小时的机器人交互时间内，用户教授了 17 种新的高级行为，平均包含 16 种新颖的低级技能，相比基线方法减少了 22.1% 的主动监督，并实现了更复杂的自主表现（+19.7%），失败率更低（-67.1%）。定性上，用户强烈偏好 Vocal Sandbox 系统，因其易用性（+20.6%）和整体性能（+13.9%）。最后，我们将一名有经验的系统用户与机器人配对，拍摄定格动画；在连续两小时的协作中，用户逐步教授了更复杂的运动技能，以拍摄一段 52 秒（232 帧）的电影。

[NLP-43] “Its a conversation not a quiz”: A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health

【速读】：该论文试图解决在公共卫生领域中采用大型语言模型（LLMs）所带来的潜在风险评估问题。解决方案的关键在于通过与健康专业人员和健康问题经历者的焦点小组讨论，构建一个风险分类体系（risk taxonomy），该体系区分并情境化了LLMs在传统健康传播中可能引入的潜在危害。具体而言，该分类体系包括四个风险维度：个体行为、以人为本的护理、信息生态系统和科技问责制。每个维度下都详细讨论了特定的风险，并提供了示例反思问题，以帮助从业者采用风险反思的方法。这一工作为计算和公共卫生领域的专家提供了一个共享的词汇和反思工具，以便共同预测、评估和减轻在决定何时使用LLM能力（或不使用）以及如何在使用时减少伤害时的风险。

链接: https://arxiv.org/abs/2411.02594
作者: Jiawei Zhou,Amy Z. Chen,Darshi Shah,Laura Schwab Reese,Munmun De Choudhury
关键词-EN: large language models, Recent breakthroughs, accessible information sources, public health, language models
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent breakthroughs in large language models (LLMs) have generated both interest and concern about their potential adoption as accessible information sources or communication tools across different domains. In public health – where stakes are high and impacts extend across populations – adopting LLMs poses unique challenges that require thorough evaluation. However, structured approaches for assessing potential risks in public health remain under-explored. To address this gap, we conducted focus groups with health professionals and health issue experiencers to unpack their concerns, situated across three distinct and critical public health issues that demand high-quality information: vaccines, opioid use disorder, and intimate partner violence. We synthesize participants’ perspectives into a risk taxonomy, distinguishing and contextualizing the potential harms LLMs may introduce when positioned alongside traditional health communication. This taxonomy highlights four dimensions of risk in individual behaviors, human-centered care, information ecosystem, and technology accountability. For each dimension, we discuss specific risks and example reflection questions to help practitioners adopt a risk-reflexive approach. This work offers a shared vocabulary and reflection tool for experts in both computing and public health to collaboratively anticipate, evaluate, and mitigate risks in deciding when to employ LLM capabilities (or not) and how to mitigate harm when they are used.
摘要：近年来，大语言模型 (Large Language Model, LLM) 的突破性进展引发了对其在不同领域作为可访问信息源或沟通工具的潜在应用的兴趣和担忧。在公共卫生领域——一个风险高、影响广泛的行业——采用 LLM 带来了独特的挑战，需要进行全面评估。然而，针对公共卫生领域潜在风险的结构化评估方法仍未得到充分探索。为了填补这一空白，我们与健康专业人员和健康问题经历者进行了焦点小组讨论，以解构他们在三个不同且关键的公共卫生问题上的担忧，这些问题需要高质量的信息：疫苗、阿片类药物使用障碍和亲密伴侣暴力。我们将参与者的观点综合成一个风险分类法，区分并情境化了 LLM 在传统健康沟通旁引入的潜在危害。该分类法突出了个体行为、以人为本的护理、信息生态系统和科技问责制四个维度的风险。对于每个维度，我们讨论了具体的风险，并提供了示例反思问题，以帮助从业者采用风险反思的方法。这项工作为计算和公共卫生领域的专家提供了一个共享的词汇和反思工具，以协作预测、评估和减轻在决定何时使用（或不使用）LLM 能力以及如何在使用时减少伤害时的风险。

[NLP-44] Geometry of orofacial neuromuscular signals: speech articulation decoding using surface electromyography

【速读】：该论文试图解决非侵入性表面肌电图（sEMG）在恢复因神经肌肉疾病、中风、创伤或头颈癌手术及治疗导致的失语患者语音输出中的应用问题。解决方案的关键在于通过收集来自多个发音部位的sEMG信号，并利用这些信号进行无声语音的解码，从而实现流畅和自然的沟通。具体来说，研究解决了以下几个关键问题：1) 口腔面部sEMG信号的数据结构；2) 个体间sEMG信号分布的偏移；3) sEMG信号在无声语音发音中覆盖整个英语音素空间的能力；4) 非侵入性sEMG基无声语音接口的泛化能力。研究通过实验证明，sEMG信号具有图数据结构，信号分布偏移可以通过基的变化来描述，且小型神经网络能够在少量数据训练下解码无声语音，并在不同个体间表现良好。

链接: https://arxiv.org/abs/2411.02591
作者: Harshavardhana T. Gowda,Zachary D. McNaughton,Lee M. Miller
关键词-EN: neck cancer surgery, speak intelligibly due, neck cancer, cancer surgery, radiotherapy toxicity
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Each year, millions of individuals lose the ability to speak intelligibly due to causes such as neuromuscular disease, stroke, trauma, and head/neck cancer surgery (e.g. laryngectomy) or treatment (e.g. radiotherapy toxicity to the speech articulators). Effective communication is crucial for daily activities, and losing the ability to speak leads to isolation, depression, anxiety, and a host of detrimental sequelae. Noninvasive surface electromyography (sEMG) has shown promise to restore speech output in these individuals. The goal is to collect sEMG signals from multiple articulatory sites as people silently produce speech and then decode the signals to enable fluent and natural communication. Currently, many fundamental properties of orofacial neuromuscular signals relating to speech articulation remain unanswered. They include questions relating to 1) the data structure of the orofacial sEMG signals, 2)the signal distribution shift of sEMG across individuals, 3) ability of sEMG signals to span the entire English language phonetic space during silent speech articulations, and 4) the generalization capability of non-invasive sEMG based silent speech interfaces. We address these questions through a series of experiments involving healthy human subjects. We show that sEMG signals evince graph data structure and that the signal distribution shift is given by a change of basis. Furthermore, we show that silently voiced articulations spanning the entire English language phonetic space can be decoded using small neural networks which can be trained with little data and that such architectures work well across individuals. To ensure transparency and reproducibility, we open-source all the data and codes used in this study.
摘要：每年，数百万个体因神经肌肉疾病、中风、创伤以及头颈部癌症手术（如喉切除术）或治疗（如放射治疗对言语发音器官的毒性）而失去清晰说话的能力。有效的沟通对于日常活动至关重要，失去说话能力会导致孤立、抑郁、焦虑以及一系列有害的后果。非侵入性表面肌电图（sEMG）在这些个体中显示出恢复言语输出的潜力。目标是收集人们在无声产生言语时来自多个发音部位的sEMG信号，然后解码这些信号以实现流畅自然的沟通。目前，与言语发音相关的口面部神经肌肉信号的许多基本特性仍未得到解答。这些问题包括：1) 口面部sEMG信号的数据结构，2) sEMG信号在个体间的分布变化，3) sEMG信号在无声言语发音时能否覆盖整个英语语音空间，以及4) 非侵入性sEMG基无声言语接口的泛化能力。我们通过一系列涉及健康人类受试者的实验来解决这些问题。我们展示了sEMG信号表现出图数据结构，并且信号分布变化可以通过基的变化来表示。此外，我们证明了使用小型神经网络可以解码覆盖整个英语语音空间的无声发音，这些网络可以在少量数据上进行训练，并且这种架构在个体间表现良好。为了确保透明度和可重复性，我们公开了本研究中使用的所有数据和代码。

[NLP-45] Context-Informed Machine Translation of Manga using Multimodal Large Language Models

【速读】：该论文试图解决漫画翻译中由于手工翻译耗时耗力而导致大部分漫画无法走出日本本土市场的问题。解决方案的关键在于利用多模态大型语言模型 (multimodal large language models, LLMs) 的视觉组件来提升翻译质量，并通过优化翻译单元大小、上下文长度以及提出一种高效的标记方法来改进漫画翻译。此外，论文还引入了首个日波平行漫画翻译数据集，并贡献了一个开源软件套件，以便于未来研究中对LLMs在漫画翻译中的性能进行基准测试。研究结果表明，所提出的方法在日英翻译中达到了最先进水平，并为日波翻译设定了新的标准。

链接: https://arxiv.org/abs/2411.02589
作者: Philip Lippmann,Konrad Skublicki,Joshua Tanner,Shonosuke Ishiwatari,Jie Yang
关键词-EN: domestic Japanese market, Japanese market, domestic Japanese, manga translation, significant time
类目: Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:Due to the significant time and effort required for handcrafting translations, most manga never leave the domestic Japanese market. Automatic manga translation is a promising potential solution. However, it is a budding and underdeveloped field and presents complexities even greater than those found in standard translation due to the need to effectively incorporate visual elements into the translation process to resolve ambiguities. In this work, we investigate to what extent multimodal large language models (LLMs) can provide effective manga translation, thereby assisting manga authors and publishers in reaching wider audiences. Specifically, we propose a methodology that leverages the vision component of multimodal LLMs to improve translation quality and evaluate the impact of translation unit size, context length, and propose a token efficient approach for manga translation. Moreover, we introduce a new evaluation dataset – the first parallel Japanese-Polish manga translation dataset – as part of a benchmark to be used in future research. Finally, we contribute an open-source software suite, enabling others to benchmark LLMs for manga translation. Our findings demonstrate that our proposed methods achieve state-of-the-art results for Japanese-English translation and set a new standard for Japanese-Polish.
摘要：由于手工翻译需要大量的时间和精力，大多数漫画从未离开日本国内市场。自动漫画翻译是一个有前景的解决方案。然而，这是一个新兴且尚未充分发展的领域，由于需要在翻译过程中有效整合视觉元素以解决歧义，其复杂性甚至超过了标准翻译。在本研究中，我们探讨了多模态大语言模型 (LLMs) 在多大程度上能够提供有效的漫画翻译，从而帮助漫画作者和出版商接触更广泛的受众。具体而言，我们提出了一种利用多模态 LLMs 的视觉组件来提高翻译质量的方法，并评估了翻译单元大小、上下文长度对翻译效果的影响，同时提出了一种高效的 Token 利用方法用于漫画翻译。此外，我们引入了一个新的评估数据集——首个日波平行漫画翻译数据集——作为未来研究的一个基准。最后，我们贡献了一套开源软件工具，使其他人能够对 LLMs 进行漫画翻译的基准测试。我们的研究结果表明，我们提出的方法在日本-英语翻译方面达到了最先进水平，并为日本-波兰语翻译设定了新的标准。

[NLP-46] A Big Data-empowered System for Real-time Detection of Regional Discriminatory Comments on Vietnamese Social Media ATC

【速读】：该论文试图解决越南社会中存在的区域歧视问题，特别是针对越南社交媒体上的区域歧视性评论的检测。解决方案的关键在于提出了一个名为“越南区域歧视评论检测 (Detection of Regional Discriminatory Comments on Vietnamese Social Media)”的任务，并构建了ViRDC数据集，该数据集包含了来自社交媒体平台的评论，为后续研究和开发提供了宝贵资源。论文的核心创新在于结合了机器学习和迁移学习模型，并开发了一个基于Apache Spark框架的系统，该系统具备流处理能力，能够实时处理来自社交媒体网络的数据，确保系统的可扩展性和响应性，从而实现对越南区域歧视的实时检测。

链接: https://arxiv.org/abs/2411.02587
作者: An Nghiep Huynh,Thanh Dat Do,Trong Hop Do
关键词-EN: Regional discrimination, Vietnamese Regional Discrimination, persistent social issue, Regional Discrimination Comments, Regional
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: accepted by 2024 International Conference on Advanced Technologies for Communications (ATC) Program

点击查看摘要

Abstract:Regional discrimination is a persistent social issue in Vietnam. While existing research has explored hate speech in the Vietnamese language, the specific issue of regional discrimination remains under-addressed. Previous studies primarily focused on model development without considering practical system implementation. In this work, we propose a task called Detection of Regional Discriminatory Comments on Vietnamese Social Media, leveraging the power of machine learning and transfer learning models. We have built the ViRDC (Vietnamese Regional Discrimination Comments) dataset, which contains comments from social media platforms, providing a valuable resource for further research and development. Our approach integrates streaming capabilities to process real-time data from social media networks, ensuring the system’s scalability and responsiveness. We developed the system on the Apache Spark framework to efficiently handle increasing data inputs during streaming. Our system offers a comprehensive solution for the real-time detection of regional discrimination in Vietnam.
摘要：区域歧视是越南长期存在的社会问题。尽管现有研究已探讨了越南语中的仇恨言论，但区域歧视这一具体问题仍未得到充分关注。以往的研究主要集中在模型开发上，而未考虑实际系统实施。在本研究中，我们提出了一项名为“越南社交媒体区域歧视评论检测”的任务，利用机器学习和迁移学习模型的力量。我们构建了ViRDC（Vietnamese Regional Discrimination Comments）数据集，该数据集包含来自社交媒体平台的评论，为后续研究和开发提供了宝贵的资源。我们的方法集成了流处理能力，以处理来自社交媒体网络的实时数据，确保系统的可扩展性和响应性。我们在Apache Spark框架上开发了该系统，以高效处理流处理过程中不断增加的数据输入。我们的系统为越南区域歧视的实时检测提供了一个全面的解决方案。

[NLP-47] Social Support Detection from Social Media Texts

【速读】：该论文试图解决在线社区中社会支持识别的问题，提出了社会支持检测 (Social Support Detection, SSD) 作为自然语言处理 (NLP) 任务。解决方案的关键在于通过结合语言学、心理语言学、情感和情感信息的多特征组合，以及使用神经网络模型和多种词嵌入技术，来提高模型在识别和支持类型分类任务中的性能。实验结果表明，整合这些特征在检测社会支持及区分其面向个体或群体方面具有显著效果，最佳结果在不同子任务中的准确率范围为0.72到0.82。

链接: https://arxiv.org/abs/2411.02580
作者: Zahra Ahani,Moein Shahiki Tash,Fazlourrahman Balouchzahi,Luis Ramos,Grigori Sidorov,Alexander Gelbukh
关键词-EN: Social Support Detection, Social support, Support Detection, introduces Social Support, plays a pivotal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Social support, conveyed through a multitude of interactions and platforms such as social media, plays a pivotal role in fostering a sense of belonging, aiding resilience in the face of challenges, and enhancing overall well-being. This paper introduces Social Support Detection (SSD) as a Natural language processing (NLP) task aimed at identifying supportive interactions within online communities. The study presents the task of Social Support Detection (SSD) in three subtasks: two binary classification tasks and one multiclass task, with labels detailed in the dataset section. We conducted experiments on a dataset comprising 10,000 YouTube comments. Traditional machine learning models were employed, utilizing various feature combinations that encompass linguistic, psycholinguistic, emotional, and sentiment information. Additionally, we experimented with neural network-based models using various word embeddings to enhance the performance of our models across these this http URL results reveal a prevalence of group-oriented support in online dialogues, reflecting broader societal patterns. The findings demonstrate the effectiveness of integrating psycholinguistic, emotional, and sentiment features with n-grams in detecting social support and distinguishing whether it is directed toward an individual or a group. The best results for different subtasks across all experiments range from 0.72 to 0.82.
摘要：社会支持通过多种互动和平台（如社交媒体）传达，在培养归属感、增强面对挑战的韧性以及提升整体幸福感方面发挥着关键作用。本文介绍了社会支持检测 (Social Support Detection, SSD) 作为一项自然语言处理 (Natural Language Processing, NLP) 任务，旨在识别在线社区中的支持性互动。研究将社会支持检测任务分为三个子任务：两个二分类任务和一个多分类任务，标签细节在数据集部分详细说明。我们在包含 10,000 条 YouTube 评论的数据集上进行了实验。采用了传统的机器学习模型，利用了包含语言学、心理语言学、情感和情感信息的各种特征组合。此外，我们还尝试了基于神经网络的模型，使用不同的词嵌入方法来提升模型在这些任务中的表现。实验结果显示，在线对话中普遍存在面向群体的支持，反映了更广泛的社会模式。研究结果表明，将心理语言学、情感和情感特征与 n-gram 结合使用在检测社会支持以及区分其是面向个人还是群体方面具有有效性。不同子任务在所有实验中的最佳结果范围从 0.72 到 0.82。

[NLP-48] MM-Embed: Universal Multimodal Retrieval with Multimodal LLM s

【速读】：该论文试图解决现有检索模型在处理多模态（multimodal）和多样化检索任务时的局限性问题。解决方案的关键在于引入多模态大语言模型（MLLMs）并进行精细调整，以实现“通用多模态检索”（universal multimodal retrieval）。具体措施包括：1) 在10个数据集上对MLLM进行微调，以处理包含文本和图像的复杂查询，并通过模态感知硬负样本挖掘（modality-aware hard negative mining）缓解模态偏差（modality bias）；2) 持续微调通用多模态检索器，以增强其文本检索能力同时保持多模态检索能力；3) 利用现成的MLLMs作为零样本重排序器（zero-shot rerankers），通过提示和重排序（prompt-and-reranking）进一步优化复杂查询的检索结果。这些方法使得MM-Embed模型在多模态检索基准M-BEIR上达到最先进性能，并在MTEB检索基准上超越了现有的文本检索模型NV-Embed-v1。

链接: https://arxiv.org/abs/2411.02571
作者: Sheng-Chieh Lin,Chankyu Lee,Mohammad Shoeybi,Jimmy Lin,Bryan Catanzaro,Wei Ping
关键词-EN: straightforward search scenario, retrieval, retrieval tasks, multimodal retrieval, retrieval models typically
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: We release the model weights at: this https URL

点击查看摘要

Abstract:State-of-the-art retrieval models typically address a straightforward search scenario, where retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose to continually fine-tune the universal multimodal retriever to enhance its text retrieval capability while maintaining multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on MTEB retrieval benchmark. Finally, we explore to prompt the off-the-shelf MLLMs as the zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way to advance universal multimodal retrieval in the future.
摘要：当前最先进的检索模型通常处理的是一种直接的搜索场景，其中检索任务是固定的（例如，找到一段文本以回答特定问题），并且仅支持单一模态的查询和检索结果。本文介绍了一系列利用多模态大语言模型 (MLLMs) 推进信息检索的技术，从而实现了一种更广泛的搜索场景，称为通用多模态检索，其中可以容纳多种模态和多样化的检索任务。为此，我们首先研究了在包含16个检索任务的10个数据集上对MLLM进行微调，将其作为双编码器检索器。我们的实证结果表明，微调后的MLLM检索器能够理解由文本和图像组成的复杂查询，但在跨模态检索任务中表现不如更小的CLIP检索器，这主要是由于MLLMs的模态偏差所致。为了解决这一问题，我们提出了模态感知的硬负样本挖掘方法，以减轻MLLM检索器表现出的模态偏差。其次，我们提出持续微调通用多模态检索器，以增强其文本检索能力，同时保持多模态检索能力。结果显示，我们的模型MM-Embed在多模态检索基准M-BEIR上达到了最先进的性能，该基准涵盖了多个领域和任务，并且在MTEB检索基准上也超越了最先进的文本检索模型NV-Embed-v1。最后，我们探索了将现成的MLLMs作为零样本重排序器，以优化多模态检索器候选结果的排序。我们发现，通过提示和重排序，MLLMs在用户查询（例如，由文本和图像组成的复杂查询）更加复杂和难以理解时，能够进一步提高多模态检索的性能。这些发现也为未来推进通用多模态检索铺平了道路。

[NLP-49] Enhancing Risk Assessment in Transformers with Loss-at-Risk Functions

【速读】：该论文试图解决传统网络损失函数（如均方误差 (MSE)）在极端风险条件下对金融市场重大损失的低估问题。解决方案的关键在于引入了一种新的损失函数——风险损失 (Loss-at-Risk)，该函数结合了风险价值 (VaR) 和条件风险价值 (CVaR)，并将其集成到Transformer模型中。这一创新使得Transformer模型能够识别潜在的极端损失，从而提高其在高风险金融决策中的预测和管理能力。通过在高度波动的金融数据集上进行实验，研究结果表明，风险损失函数不仅提升了Transformer模型的风险评估能力，同时保持了其在决策和推理方面的核心优势。

链接: https://arxiv.org/abs/2411.02558
作者: Jinghan Zhang,Henry Xie,Xinhao Zhang,Kunpeng Liu
关键词-EN: Square Error, tools are essential, precise risk assessment, risk, risk assessment tools
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by ICKG 2024

点击查看摘要

Abstract:In the financial field, precise risk assessment tools are essential for decision-making. Recent studies have challenged the notion that traditional network loss functions like Mean Square Error (MSE) are adequate, especially under extreme risk conditions that can lead to significant losses during market upheavals. Transformers and Transformer-based models are now widely used in financial forecasting according to their outstanding performance in time-series-related predictions. However, these models typically lack sensitivity to extreme risks and often underestimate great financial losses. To address this problem, we introduce a novel loss function, the Loss-at-Risk, which incorporates Value at Risk (VaR) and Conditional Value at Risk (CVaR) into Transformer models. This integration allows Transformer models to recognize potential extreme losses and further improves their capability to handle high-stakes financial decisions. Moreover, we conduct a series of experiments with highly volatile financial datasets to demonstrate that our Loss-at-Risk function improves the Transformers’ risk prediction and management capabilities without compromising their decision-making accuracy or efficiency. The results demonstrate that integrating risk-aware metrics during training enhances the Transformers’ risk assessment capabilities while preserving their core strengths in decision-making and reasoning across diverse scenarios.
摘要：在金融领域，精确的风险评估工具对于决策至关重要。近期研究对传统网络损失函数（如均方误差 (MSE)）在极端风险条件下的适用性提出了质疑，尤其是在市场动荡期间可能导致重大损失的情况下。Transformer及其基于Transformer的模型因其卓越的时间序列预测性能而在金融预测中得到广泛应用。然而，这些模型通常对极端风险缺乏敏感性，往往低估了巨大的财务损失。为解决这一问题，我们引入了一种新型损失函数——风险损失 (Loss-at-Risk)，该函数将风险价值 (VaR) 和条件风险价值 (CVaR) 融入到Transformer模型中。这种整合使得Transformer模型能够识别潜在的极端损失，并进一步增强其处理高风险金融决策的能力。此外，我们通过一系列高波动性金融数据集的实验，证明了风险损失函数在不降低决策准确性或效率的情况下，提升了Transformer的风险预测和管理能力。实验结果表明，在训练过程中整合风险感知指标，不仅增强了Transformer的风险评估能力，同时保留了其在各种场景下决策和推理的核心优势。

[NLP-50] Leveraging Transformer-Based Models for Predicting Inflection Classes of Words in an Endangered Sami Language

【速读】：该论文试图解决濒危乌拉尔语系语言Skolt Sami的词汇和形态句法特征分类问题，由于该语言具有复杂的形态结构且可用数据有限，传统的分类方法难以有效应用。解决方案的关键在于构建一个端到端的处理流程，包括数据提取、数据增强以及训练基于transformer的模型来预测屈折类别。通过这种方法，不仅能够提高有限状态转换器（Finite-State Transducers, FSTs）的词汇覆盖率，还能为研究人员提供系统的语言学文档，特别是对于从文学作品和母语者中发现的新的词汇。该模型在词性分类上达到了平均加权F1分数1.00，在屈折类别分类上达到了0.81，显示出其在处理Skolt Sami语言复杂性方面的有效性。

链接: https://arxiv.org/abs/2411.02556
作者: Khalid Alnajjar,Mika Hämäläinen,Jack Rueter
关键词-EN: Uralic language characterized, Skolt Sami, analyzing Skolt Sami, endangered Uralic language, Uralic language
类目: Computation and Language (cs.CL)
备注: IWCLUL 2024

点击查看摘要

Abstract:This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and linguistic intricacies inherent to the language. Our end-to-end pipeline includes data extraction, augmentation, and training a transformer-based model capable of predicting inflection classes. The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami. Accurate classification not only helps improve the state of Finite-State Transducers (FSTs) by providing greater lexical coverage but also contributes to systematic linguistic documentation for researchers working with newly discovered words from literature and native speakers. Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification. The trained model and code will be released publicly to facilitate future research in endangered NLP.
摘要：本文提出了一种基于 Transformer 模型的训练方法，用于分类 Skolt Sami 这一濒危乌拉尔语系的词汇和形态句法特征。Skolt Sami 以其复杂的形态学特征著称。鉴于该语言数据有限且语言结构复杂，我们的目标是构建一个有效的系统来理解和分析 Skolt Sami。我们的端到端流程包括数据提取、增强以及训练一个能够预测屈折类别的 Transformer 模型。这项工作的动机在于支持像 Skolt Sami 这样的少数语言的保存和复兴工作。准确的分类不仅有助于通过提供更广泛的词汇覆盖来改进有限状态转换器 (FST) 的状态，还为研究人员提供了系统的语言记录，这些研究人员需要处理从文学作品和母语者那里新发现的词汇。我们的模型在词性分类上达到了 1.00 的加权平均 F1 分数，在屈折类别分类上达到了 0.81 的加权平均 F1 分数。训练好的模型和代码将公开发布，以促进濒危自然语言处理 (NLP) 领域的未来研究。

[NLP-51] ripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives NEURIPS2024

【速读】：该论文试图解决现有对比语言-图像预训练（Contrastive Language-Image Pretraining, CLIP）模型在训练数据缺乏组合多样性时，其组合推理能力受限的问题。解决方案的关键在于引入一种新的对比预训练策略，即TripletCLIP。该方法通过上下文学习生成“困难”负样本的文本描述，并利用文本到图像生成器合成相应的负样本图像，以此交替训练CLIP模型。这种方法显著提升了CLIP在组合能力上的表现，在SugarCrepe基准测试中实现了超过9%的绝对提升，并在零样本图像分类和图像检索任务中取得了改进。

链接: https://arxiv.org/abs/2411.02545
作者: Maitreya Patel,Abhiram Kusumba,Sheng Cheng,Changhoon Kim,Tejas Gokhale,Chitta Baral,Yezhou Yang
关键词-EN: Contrastive Language-Image Pretraining, Language-Image Pretraining, learn representations, maximize the mutual, mutual information
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at: NeurIPS 2024 | Project Page: this https URL

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard’’ negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Our code, models, and data are available at: this https URL
摘要：对比语言-图像预训练 (Contrastive Language-Image Pretraining, CLIP) 模型通过最大化文本和视觉模态之间的互信息来学习表示。这使得训练数据的性质成为影响 CLIP 在下游任务中效能的重要因素。然而，当代图像-文本数据集在组合多样性方面的不足限制了 CLIP 的组合推理能力。我们展示了一种通过上下文学习生成“困难”负样本描述，并利用文本到图像生成器合成相应负样本图像的解决方案。我们提出了一种新的对比预训练策略，该策略交替利用这些困难负样本描述和图像来训练 CLIP。我们证明，当应用于现有数据集如 CC3M 和 CC12M 时，我们的方法（命名为 TripletCLIP）增强了 CLIP 的组合能力，在同等计算预算下，在 SugarCrepe 基准测试中实现了超过 9% 的绝对提升，同时在零样本图像分类和图像检索方面也取得了改进。我们的代码、模型和数据可在以下链接获取：this https URL。

[NLP-52] MILU: A Multi-task Indic Language Understanding Benchmark

【速读】：该论文试图解决在低资源和语言多样性丰富的语言环境中评估大型语言模型（Large Language Models, LLMs）的挑战，特别是针对使用非拉丁字母的语言，如印度语系。现有基准主要集中在英语上，导致在这些语言中评估LLM能力的显著差距。解决方案的关键是引入MILU（Multi task Indic Language Understanding Benchmark），这是一个综合评估基准，涵盖8个领域和42个主题，跨越11种印度语言，反映了一般和文化特定的知识。MILU的设计以印度为中心，包含区域和州级考试的材料，涵盖本地历史、艺术、节日、法律以及标准学科如科学和数学。通过评估42个LLM，研究发现当前LLM在MILU上表现不佳，GPT-4o以72%的平均准确率领先。多语言模型在跨语言表现上优于特定语言微调模型，后者仅略优于随机基线。模型在高资源语言中的表现优于低资源语言。领域分析显示，模型在艺术、人文、法律和治理等文化相关领域的表现较差，而在STEM等一般领域表现较好。MILU作为首个专注于印度语言的基准，为全面文化评估迈出了关键一步。

链接: https://arxiv.org/abs/2411.02538
作者: Sshubam Verma,Mohammed Safi Ur Rahman Khan,Vishwajeet Kumar,Rudra Murthy,Jaydeep Sen
关键词-EN: Evaluating Large Language, Evaluating Large, challenge in NLP, spoken in India, Large Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding Benchmark, a comprehensive evaluation benchmark designed to address this gap. MILU spans 8 domains and 42 subjects across 11 Indic languages, reflecting both general and culturally specific knowledge. With an India-centric design, incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics. We evaluate over 42 LLMs, and find that current LLMs struggle with MILU, with GPT-4o achieving the highest average accuracy at 72 percent. Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines. Models also perform better in high resource languages as compared to low resource ones. Domain-wise analysis indicates that models perform poorly in culturally relevant areas like Arts and Humanities, Law and Governance compared to general fields like STEM. To the best of our knowledge, MILU is the first of its kind benchmark focused on Indic languages, serving as a crucial step towards comprehensive cultural evaluation. All code, benchmarks, and artifacts will be made publicly available to foster open research.
摘要：在资源匮乏且语言多样性丰富的环境中评估大语言模型（LLMs）仍然是自然语言处理（NLP）中的一个重大挑战，尤其是在使用非拉丁字母的语言中，如印度所使用的语言。现有的基准测试主要集中在英语上，导致在这些语言中评估LLM能力存在显著的空白。我们引入了MILU，一个多任务印度语言理解基准，这是一个旨在填补这一空白的综合评估基准。MILU涵盖了8个领域和42个主题，跨越11种印度语言，反映了通用知识和文化特定知识。基于印度中心的设计，MILU包含了来自地区和州级考试的材料，涵盖了地方历史、艺术、节日和法律等主题，以及科学和数学等标准科目。我们评估了超过42个LLM，发现当前的LLM在MILU上表现不佳，其中GPT-4o取得了最高的平均准确率，为72%。开放的多语言模型在表现上优于特定语言的微调模型，后者仅略优于随机基线。与低资源语言相比，模型在高资源语言中的表现更好。按领域分析表明，模型在艺术和人文、法律和治理等文化相关领域的表现较差，而在STEM等通用领域的表现较好。据我们所知，MILU是首个专注于印度语言的同类基准，是迈向全面文化评估的关键一步。所有代码、基准和相关材料将公开发布，以促进开放研究。

[NLP-53] INQUIRE: A Natural World Text-to-Image Retrieval Benchmark NEURIPS2024

【速读】：该论文试图解决多模态视觉-语言模型在专家级文本到图像检索任务中的挑战，特别是针对需要细致图像理解和领域专业知识的任务。解决方案的关键在于引入了一个名为INQUIRE的文本到图像检索基准，其中包括一个新的大型数据集iNaturalist 2024 (iNat24)，包含五百万张自然世界图像和250个专家级检索查询。这些查询涵盖物种识别、环境、行为和外观等多个类别，并配对有详尽标注的相关图像，共计33,000个匹配项。INQUIRE通过两个核心检索任务进行评估：(1) INQUIRE-Fullrank，即全数据集排序任务；(2) INQUIRE-Rerank，即对前100个检索结果进行重排序的任务。研究表明，当前最先进的多模态模型在INQUIRE上表现不佳，mAP@50未能超过50%，但通过更强大的多模态模型进行重排序可以提升检索性能，表明仍有显著改进空间。该基准旨在缩小AI能力与实际科学研究需求之间的差距，推动检索系统的发展，以加速生态和生物多样性研究。

链接: https://arxiv.org/abs/2411.02537
作者: Edward Vendrow,Omiros Pantazis,Alexander Shepard,Gabriel Brostow,Kate E. Jones,Oisin Mac Aodha,Sara Beery,Grant Van Horn
关键词-EN: introduce INQUIRE, INQUIRE, INQUIRE includes iNaturalist, queries, retrieval benchmark designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Published in NeurIPS 2024, Datasets and Benchmarks Track

点击查看摘要

Abstract:We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at this https URL
摘要：我们介绍了 INQUIRE，这是一个文本到图像检索基准，旨在挑战多模态视觉语言模型在专家级查询上的表现。INQUIRE 包括 iNaturalist 2024 (iNat24)，一个包含五百万张自然世界图像的新数据集，以及 250 个专家级检索查询。这些查询与 iNat24 中所有相关图像进行了全面标注，总计 33,000 个匹配项。查询涵盖物种识别、环境、行为和外观等类别，强调需要细致图像理解和领域专业知识的任务。我们的基准评估了两个核心检索任务：(1) INQUIRE-Fullrank，一个全数据集排序任务，以及 (2) INQUIRE-Rerank，一个用于优化前 100 个检索结果的重新排序任务。对一系列近期多模态模型的详细评估表明，INQUIRE 提出了重大挑战，最佳模型未能达到 mAP@50 超过 50% 的成绩。此外，我们展示了使用更强大的多模态模型进行重新排序可以提升检索性能，但仍有显著的改进空间。通过聚焦于科学驱动的生态挑战，INQUIRE 旨在弥合 AI 能力与现实世界科学探究需求之间的差距，促进开发能够加速生态和生物多样性研究的检索系统。我们的数据集和代码可通过此 https URL 获取。

[NLP-54] owards Leveraging News Media to Support Impact Assessment of AI Technologies

【速读】：该论文试图解决现有影响评估框架（IAs）在评估AI技术对公众社会行为、政策以及文化和地理背景的影响时可能存在的偏差问题。解决方案的关键在于通过微调大型语言模型（LLMs）来捕捉和生成来自全球多样化新闻报道中的负面影响，从而在影响评估中引入更多元化的视角。研究结果表明，微调后的开源LLMs（如Mistral-7B）不仅能够生成高质量的负面影响描述，而且在涵盖影响的广泛类别方面优于GPT-4。

链接: https://arxiv.org/abs/2411.02536
作者: Mowafak Allaham,Kimon Kieslich,Nicholas Diakopoulos
关键词-EN: public social behavior, geographical contexts shaping, Expert-driven frameworks, social behavior, inadvertently overlook
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2401.18028

点击查看摘要

Abstract:Expert-driven frameworks for impact assessments (IAs) may inadvertently overlook the effects of AI technologies on the public’s social behavior, policy, and the cultural and geographical contexts shaping the perception of AI and the impacts around its use. This research explores the potentials of fine-tuning LLMs on negative impacts of AI reported in a diverse sample of articles from 266 news domains spanning 30 countries around the world to incorporate more diversity into IAs. Our findings highlight (1) the potential of fine-tuned open-source LLMs in supporting IA of AI technologies by generating high-quality negative impacts across four qualitative dimensions: coherence, structure, relevance, and plausibility, and (2) the efficacy of small open-source LLM (Mistral-7B) fine-tuned on impacts from news media in capturing a wider range of categories of impacts that GPT-4 had gaps in covering.
摘要：专家驱动的冲击评估框架（Impact Assessments, IAs）可能无意中忽略了AI技术对公众社会行为、政策以及塑造AI认知和文化地理背景的影响。本研究探讨了通过微调大语言模型（LLMs）来捕捉全球30个国家266个新闻域名中多样化的文章所报道的AI负面影响的潜力，以期在IAs中纳入更多元化的视角。我们的研究结果表明：（1）通过微调的开源大语言模型，可以在支持AI技术冲击评估方面生成高质量的负面影响，这些影响在四个定性维度上表现出色：连贯性、结构、相关性和合理性；（2）经过新闻媒体影响数据微调的小型开源大语言模型（Mistral-7B）在捕捉GPT-4未能涵盖的更广泛影响类别方面表现出了更高的效能。

[NLP-55] A Comprehensive Study on Quantization Techniques for Large Language Models

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在资源受限的物联网设备和嵌入式系统中的部署问题。解决方案的关键在于量化技术 (Quantization)，通过降低模型数值的精度，将其转换为更小的离散值集合，从而减小模型的大小并加速推理过程。论文详细分析了量化技术的数学理论、常见方法及其在LLMs中的应用，特别是探讨了几种主要量化方法的算法和性能表现。

链接: https://arxiv.org/abs/2411.02530
作者: Jiedong Lang,Zhehao Guo,Shuyu Huang
关键词-EN: Large Language Models, Large Language, demonstrates excellent performance, Language Models, Transformer model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been extensively researched and used in both academia and industry since the rise in popularity of the Transformer model, which demonstrates excellent performance in AI. However, the computational demands of LLMs are immense, and the energy resources required to run them are often limited. For instance, popular models like GPT-3, with 175 billion parameters and a storage requirement of 350 GB, present significant challenges for deployment on resource-constrained IoT devices and embedded systems. These systems often lack the computational capacity to handle such large models. Quantization, a technique that reduces the precision of model values to a smaller set of discrete values, offers a promising solution by reducing the size of LLMs and accelerating inference. In this research, we provide a comprehensive analysis of quantization techniques within the machine learning field, with a particular focus on their application to LLMs. We begin by exploring the mathematical theory of quantization, followed by a review of common quantization methods and how they are implemented. Furthermore, we examine several prominent quantization methods applied to LLMs, detailing their algorithms and performance outcomes.
摘要：自 Transformer 模型兴起以来，大语言模型 (Large Language Models, LLMs) 在学术界和工业界得到了广泛的研究和应用，其在人工智能领域表现出色。然而，LLMs 的计算需求巨大，运行所需的能源资源往往有限。例如，流行的 GPT-3 模型拥有 1750 亿参数，存储需求高达 350 GB，这给资源受限的物联网设备和嵌入式系统的部署带来了显著挑战。这些系统通常缺乏处理如此大规模模型的计算能力。量化 (Quantization) 技术通过将模型值的精度降低到一组较小的离散值，提供了一种有前景的解决方案，可以减小 LLMs 的规模并加速推理。在本研究中，我们对机器学习领域中的量化技术进行了全面分析，特别关注其在 LLMs 中的应用。我们首先探讨了量化的数学理论，随后回顾了常见的量化方法及其具体实现。此外，我们还研究了几种应用于 LLMs 的著名量化方法，详细介绍了它们的算法和性能结果。

[NLP-56] What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length

【速读】：该论文试图解决在比较语言模型（LMs）与人类语言能力时，模型概率受序列长度和词汇单字频率影响的问题。现有研究通常假设所有模型需要相同的调整来控制这些影响，而论文提出了一个新的链接理论——MORCELA，通过学习参数来估计长度和单字频率的最优调整水平，从而更准确地反映模型的语言能力。关键在于MORCELA能够根据数据动态调整这些影响因素，而不是采用统一的调整方法，从而在评估模型语言能力时提供更精确的结果。

链接: https://arxiv.org/abs/2411.02528
作者: Lindia Tjuatja,Graham Neubig,Tal Linzen,Sophie Hao
关键词-EN: unigram frequency, unigram frequency effects, linguistic capabilities, capabilities of language, lexical items
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When comparing the linguistic capabilities of language models (LMs) with humans using LM probabilities, factors such as the length of the sequence and the unigram frequency of lexical items have a significant effect on LM probabilities in ways that humans are largely robust to. Prior works in comparing LM and human acceptability judgments treat these effects uniformly across models, making a strong assumption that models require the same degree of adjustment to control for length and unigram frequency effects. We propose MORCELA, a new linking theory between LM scores and acceptability judgments where the optimal level of adjustment for these effects is estimated from data via learned parameters for length and unigram frequency. We first show that MORCELA outperforms a commonly used linking theory for acceptability–SLOR (Pauls and Klein, 2012; Lau et al. 2017)–across two families of transformer LMs (Pythia and OPT). Furthermore, we demonstrate that the assumed degrees of adjustment in SLOR for length and unigram frequency overcorrect for these confounds, and that larger models require a lower relative degree of adjustment for unigram frequency, though a significant amount of adjustment is still necessary for all models. Finally, our subsequent analysis shows that larger LMs’ lower susceptibility to frequency effects can be explained by an ability to better predict rarer words in context.
摘要：在通过语言模型 (LM) 概率比较语言模型与人类的语言能力时，序列长度和词汇项的单字频率等因素对 LM 概率有显著影响，而人类对此类因素则表现出较强的鲁棒性。以往在比较 LM 与人类可接受性判断的工作中，通常对这些效应在不同模型间进行统一处理，即假设模型需要相同程度的调整以控制长度和单字频率效应。我们提出了 MORCELA，这是一种新的链接理论，旨在将 LM 评分与可接受性判断相结合，其中对这些效应的最佳调整水平通过学习到的长度和单字频率参数从数据中估计得出。我们首先展示了 MORCELA 在两个 Transformer LM 系列（Pythia 和 OPT）中，优于常用的可接受性链接理论——SLOR（Pauls 和 Klein, 2012; Lau 等, 2017）。此外，我们证明 SLOR 中假设的长度和单字频率调整程度对这些混淆因素过度校正，并且较大的模型对单字频率的相对调整程度较低，尽管所有模型仍需进行一定程度的调整。最后，我们的后续分析表明，较大的 LM 对频率效应的较低敏感性可以通过其在上下文中更好地预测罕见词的能力来解释。

[NLP-57] Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes

【速读】：该论文试图解决的问题是如何通过实验室测试结果提高大型语言模型（LLMs）在临床鉴别诊断（DDx）中的准确性。解决方案的关键在于评估和比较不同LLMs在有和没有实验室数据情况下的诊断准确性，特别是GPT-4和Mixtral在包含实验室结果时的表现显著优于其他模型，表明实验室数据对提升诊断准确性具有重要作用。

链接: https://arxiv.org/abs/2411.02523
作者: Balu Bhasuran,Qiao Jin,Yuzhang Xie,Carl Yang,Karim Hanna,Jennifer Costa,Cindy Shavor,Zhiyong Lu,Zhe He
关键词-EN: healthcare providers systematically, providers systematically distinguish, share similar symptoms, Top, crucial for medicine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and lab results. Five LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. A comprehensive evaluation involving GPT-4, a knowledge graph, and clinicians was conducted. GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%. Lab results significantly improved accuracy, with GPT-4 and Mixtral excelling, though exact match rates were low. Lab tests, including liver function, metabolic/toxicology panels, and serology/immune tests, were generally interpreted correctly by LLMs for differential diagnosis.
摘要：在医学领域，鉴别诊断（Differential Diagnosis, DDx）至关重要，因为它帮助医疗提供者系统地区分具有相似症状的不同疾病。本研究评估了实验室检测结果对大语言模型（Large Language Models, LLMs）进行鉴别诊断的影响。研究从PubMed Central的50份病例报告中提取了临床情景，这些情景包含了患者的人口统计学信息、症状及实验室检测结果。研究测试了五种大语言模型：GPT-4、GPT-3.5、Llama-2-70b、Claude-2和Mixtral-8x7B，分别在有无实验室数据的情况下生成前10、前5和前1的鉴别诊断。通过GPT-4、知识图谱和临床医生的综合评估，GPT-4表现最佳，在有实验室数据的情况下，前1诊断的准确率为55%，前10诊断的准确率为60%，宽松的准确率可达80%。实验室结果显著提高了准确性，GPT-4和Mixtral表现尤为突出，尽管精确匹配率较低。实验室检测，包括肝功能、代谢/毒理学面板以及血清学/免疫学检测，大语言模型在鉴别诊断中通常能正确解读。

[NLP-58] Fantastic LLM s for Preference Data Annotation and How to (not) Find Them

【速读】：该论文试图解决大型语言模型（LLMs）偏好调优中高质量人类偏好数据收集成本高、时间长的问题。解决方案的关键在于引入定制化密度比率（Customized Density Ratio, CDR），利用开源LLMs进行数据标注。具体来说，CDR使用一个对齐良好的LLM和一个对齐较差的LLM之间的对数密度比作为奖励信号，通过探索221对不同的LLM组合，证明了LLM性能差距越大，奖励泛化效果越好。此外，通过特定标准和偏好范例调整密度比率奖励函数，进一步提升了跨领域和目标领域的性能。实验结果显示，基于Mistral-7B模型的CDR在RewardBench上达到82.6分，优于现有的训练奖励函数，并在安全和推理领域表现出与最先进模型相当的竞争力。最终，使用CDR标注的偏好数据对Llama-3-8B-Instruct进行偏好调优，显著提升了其在ArenaHard和Length-Controlled AlpacaEval 2.0上的胜率，以及在MT-Bench上的得分。

链接: https://arxiv.org/abs/2411.02481
作者: Guangxuan Xu,Kai Xu,Shivchander Sudalairaj,Hao Wang,Akash Srivastava
关键词-EN: high-quality human preference, relies on high-quality, time-consuming to gather, tuning of large, expensive and time-consuming
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Preference tuning of large language models (LLMs) relies on high-quality human preference data, which is often expensive and time-consuming to gather. While existing methods can use trained reward models or proprietary model as judges for preference annotation, they have notable drawbacks: training reward models remain dependent on initial human data, and using proprietary model imposes license restrictions that inhibits commercial usage. In this paper, we introduce customized density ratio (CDR) that leverages open-source LLMs for data annotation, offering an accessible and effective solution. Our approach uses the log-density ratio between a well-aligned LLM and a less aligned LLM as a reward signal. We explores 221 different LLMs pairs and empirically demonstrate that increasing the performance gap between paired LLMs correlates with better reward generalization. Furthermore, we show that tailoring the density ratio reward function with specific criteria and preference exemplars enhances performance across domains and within target areas. In our experiment using density ratio from a pair of Mistral-7B models, CDR achieves a RewardBench score of 82.6, outperforming the best in-class trained reward functions and demonstrating competitive performance against SoTA models in Safety (91.0) and Reasoning (88.0) domains. We use CDR to annotate an on-policy preference dataset with which we preference tune Llama-3-8B-Instruct with SimPO. The final model achieves a 37.4% (+15.1%) win rate on ArenaHard and a 40.7% (+17.8%) win rate on Length-Controlled AlpacaEval 2.0, along with a score of 8.0 on MT-Bench. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.02481 [cs.CL] (or arXiv:2411.02481v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.02481 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：大语言模型（LLM）的偏好调优依赖于高质量的人类偏好数据，这些数据的收集通常成本高昂且耗时。尽管现有方法可以使用训练好的奖励模型或专有模型作为偏好标注的评判标准，但它们存在显著缺陷：训练奖励模型仍然依赖于初始的人类数据，而使用专有模型则带来了许可限制，阻碍了商业应用。本文中，我们引入了定制密度比（CDR），利用开源大语言模型进行数据标注，提供了一种可访问且有效的解决方案。我们的方法使用一个对齐良好的大语言模型与一个对齐较差的大语言模型之间的对数密度比作为奖励信号。我们探索了221对不同的大语言模型，并通过实证证明，增加配对大语言模型之间的性能差距与更好的奖励泛化能力相关。此外，我们展示了根据特定标准和偏好示例定制密度比奖励函数，可以提升跨领域和目标区域内的性能。在我们的实验中，使用一对Mistral-7B模型的密度比，CDR在RewardBench上达到了82.6分，超过了同类最佳的训练奖励函数，并在安全（91.0）和推理（88.0）领域展示了与最先进模型相媲美的性能。我们使用CDR标注了一个在线策略偏好数据集，并使用SimPO对Llama-3-8B-Instruct进行偏好调优。最终模型在ArenaHard上的胜率为37.4%（+15.1%），在Length-Controlled AlpacaEval 2.0上的胜率为40.7%（+17.8%），并在MT-Bench上获得了8.0分。

主题：计算与语言（cs.CL）；人工智能（cs.AI）
引用方式：arXiv:2411.02481 [cs.CL]（或arXiv:2411.02481v1 [cs.CL]用于此版本）
https://doi.org/10.48550/arXiv.2411.02481
通过DataCite发布的arXiv DOI（待注册）

[NLP-59] A Comparative Analysis of Instruction Fine-Tuning LLM s for Financial Text Classification

【速读】：该论文试图解决大型语言模型（LLMs）在处理金融文本分类任务时表现不佳的问题，特别是由于金融文本的技术性和专业性。解决方案的关键在于通过指令微调（instruction fine-tuning）和模型合并（model merging）技术来提升LLMs在金融领域的性能。具体来说，研究者对Mistral-7B、Llama3-8B和Phi3-mini等较小规模的LLMs进行了指令微调，并在四个金融分类任务上进行了微调，显著提升了任务特定的性能。此外，通过模型合并技术，将单任务领域特定的微调模型与基础模型结合，显著增强了零样本（zero-shot）性能，甚至在某些数据集上超过了原始模型的准确性。这些方法有效增强了LLMs在复杂金融任务中的适应性和鲁棒性。

链接: https://arxiv.org/abs/2411.02476
作者: Sorouralsadat Fatemi,Yuheng Hu,Maryam Mousavi
关键词-EN: Natural Language Processing, diverse Natural Language, Large Language Models, Large Language, Language Processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across diverse Natural Language Processing (NLP) tasks, including language understanding, reasoning, and generation. However, general-domain LLMs often struggle with financial tasks due to the technical and specialized nature of financial texts. This study investigates the efficacy of instruction fine-tuning smaller-scale LLMs, including Mistral-7B, Llama3-8B, and Phi3-mini, to enhance their performance in financial text classification tasks. We fine-tuned both instruction-tuned and base models across four financial classification tasks, achieving significant improvements in task-specific performance. Furthermore, we evaluated the zero-shot capabilities of these fine-tuned models on three unseen complex financial tasks, including argument classification, deal completeness classification, and causal classification. Our results indicate while base model fine-tuning led to greater degradation, instruction-tuned models maintained more robust performance. To address this degradation, we employed model merging techniques, integrating single-task domain-specific fine-tuned models with the base model. Using this merging method resulted in significant enhancements in zero-shot performance, even exceeding the original model’s accuracy on certain datasets. Our findings underscore the effectiveness of instruction fine-tuning and model merging for adapting LLMs to specialized financial text classification tasks.
摘要：大语言模型 (LLMs) 在多种自然语言处理 (NLP) 任务中展示了令人印象深刻的能力，包括语言理解、推理和生成。然而，通用领域的大语言模型在处理金融任务时常常遇到困难，这是由于金融文本的技术性和专业性所致。本研究探讨了通过指令微调较小规模的大语言模型，包括 Mistral-7B、Llama3-8B 和 Phi3-mini，以提升其在金融文本分类任务中的表现。我们对指令微调和基础模型进行了四项金融分类任务的微调，显著提升了任务特定的性能。此外，我们还评估了这些微调模型在三个未见过的复杂金融任务中的零样本能力，包括论点分类、交易完整性分类和因果分类。结果表明，尽管基础模型微调导致了更大的性能下降，但指令微调模型保持了更稳健的表现。为了解决这种性能下降问题，我们采用了模型合并技术，将单任务领域特定的微调模型与基础模型整合。使用这种合并方法显著提升了零样本性能，甚至在某些数据集上超过了原始模型的准确性。我们的研究结果强调了指令微调和模型合并对于使大语言模型适应专业金融文本分类任务的有效性。

[NLP-60] Enhancing Multiple Dimensions of Trustworthiness in LLM s via Sparse Activation Control

【速读】：该论文试图解决在大语言模型 (LLMs) 中同时满足多个特定要求（如诚实性和安全性）的难题。传统方法依赖于大量数据进行人类反馈强化学习 (RLHF)，而该论文提出了一种无需训练的新方法——稀疏激活控制 (Sparse Activation Control)。其关键在于深入挖掘 LLMs 的内在机制，识别并定位与特定任务紧密相关的组件（即注意力头），这些组件具有稀疏特性，允许对不同任务进行近似独立的控制。通过这种方法，模型能够在安全、事实性和偏见等问题上同时与人类偏好对齐。

链接: https://arxiv.org/abs/2411.02461
作者: Yuxin Xiao,Chaoqun Wan,Yonggang Zhang,Wenxiao Wang,Binbin Lin,Xiaofei He,Xu Shen,Jieping Ye
关键词-EN: Large Language Models, Large Language, application of Large, continue to advance, advance rapidly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the development and application of Large Language Models (LLMs) continue to advance rapidly, enhancing their trustworthiness and aligning them with human preferences has become a critical area of research. Traditional methods rely heavily on extensive data for Reinforcement Learning from Human Feedback (RLHF), but representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM’s intermediate hidden states, enabling the model to meet specific requirements such as increased honesty or heightened safety awareness. However, a significant challenge arises when attempting to fulfill multiple requirements simultaneously. It proves difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature, restricting its practicality. In this work, we address this issue through ``Sparse Activation Control’'. By delving into the intrinsic mechanisms of LLMs, we manage to identify and pinpoint components that are closely related to specific tasks within the model, i.e., attention heads. These heads display sparse characteristics that allow for near-independent control over different tasks. Our experiments, conducted on the open-source Llama series models, have yielded encouraging results. The models were able to align with human preferences on issues of safety, factuality, and bias concurrently.
摘要：随着大语言模型（Large Language Models, LLMs）的开发与应用不断迅速推进，提升其可信度并使其与人类偏好相一致已成为一个关键的研究领域。传统方法主要依赖于大量数据进行基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF），但表示工程提供了一种新的、无需训练的方法。该技术利用语义特征来控制LLM中间隐藏状态的表示，使模型能够满足特定的要求，如提高诚实度或增强安全意识。然而，在尝试同时满足多个要求时，这一方法面临重大挑战。将多种语义内容（如诚实和安全）编码到单一语义特征中显得困难重重，限制了其实际应用。在本研究中，我们通过“稀疏激活控制”（Sparse Activation Control）解决了这一问题。通过深入探究LLMs的内在机制，我们成功识别并定位了模型中与特定任务密切相关的组件，即注意力头（attention heads）。这些头显示出稀疏特性，使得对不同任务的近独立控制成为可能。我们在开源的Llama系列模型上进行的实验取得了令人鼓舞的结果。这些模型能够在安全、事实性和偏见等问题上同时与人类偏好保持一致。

[NLP-61] Code-Switching Curriculum Learning for Multilingual Transfer in LLM s

【速读】：该论文试图解决大型语言模型（LLMs）在跨语言迁移学习中由于预训练数据不平衡导致的性能下降问题。解决方案的关键是提出了代码转换课程学习（Code-Switching Curriculum Learning, CSCL），通过模拟人类第二语言习得的过程，特别是代码转换（code-switching）的实践，逐步训练模型。CSCL包括三个阶段：1) 词级别代码转换，2) 句子级别代码转换，3) 单语语料库训练。实验结果表明，CSCL显著提升了模型对韩语等语言的迁移能力，并且在高资源和低资源语言中均表现出色，有效缓解了语言资源与安全对齐之间的虚假关联，为LLMs提供了更公平的语言迁移框架。

链接: https://arxiv.org/abs/2411.02460
作者: Haneul Yoo,Cheonbok Park,Sangdoo Yun,Alice Oh,Hwaran Lee
关键词-EN: Large language models, performance drops drastically, Large language, CSCL, exhibit near human-level
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) now exhibit near human-level performance in various tasks, but their performance drops drastically after a handful of high-resource languages due to the imbalance in pre-training data. Inspired by the human process of second language acquisition, particularly code-switching (the practice of language alternation in a conversation), we propose code-switching curriculum learning (CSCL) to enhance cross-lingual transfer for LLMs. CSCL mimics the stages of human language learning by progressively training models with a curriculum consisting of 1) token-level code-switching, 2) sentence-level code-switching, and 3) monolingual corpora. Using Qwen 2 as our underlying model, we demonstrate the efficacy of the CSCL in improving language transfer to Korean, achieving significant performance gains compared to monolingual continual pre-training methods. Ablation studies reveal that both token- and sentence-level code-switching significantly enhance cross-lingual transfer and that curriculum learning amplifies these effects. We also extend our findings into various languages, including Japanese (high-resource) and Indonesian (low-resource), and using two additional models (Gemma 2 and Phi 3.5). We further show that CSCL mitigates spurious correlations between language resources and safety alignment, presenting a robust, efficient framework for more equitable language transfer in LLMs. We observe that CSCL is effective for low-resource settings where high-quality, monolingual corpora for language transfer are hardly available.
摘要：大语言模型（LLMs）目前在多种任务中展现出接近人类水平的表现，但由于预训练数据的不平衡，其在少数高资源语言之外的表现急剧下降。受人类第二语言习得过程的启发，特别是代码转换（code-switching，即在对话中交替使用语言的实践），我们提出了代码转换课程学习（CSCL），以增强LLMs的跨语言迁移能力。CSCL通过逐步训练模型，模拟人类语言学习的阶段，包括1) Token级别的代码转换，2) 句子级别的代码转换，以及3) 单语语料库。我们以Qwen 2为基础模型，展示了CSCL在提升韩语迁移效果方面的有效性，相较于单语持续预训练方法，取得了显著的性能提升。消融研究表明，Token级别和句子级别的代码转换均显著增强了跨语言迁移能力，而课程学习进一步放大了这些效果。我们还将其研究扩展到多种语言，包括高资源的日语和低资源的印尼语，并使用了另外两个模型（Gemma 2和Phi 3.5）。进一步的研究表明，CSCL减轻了语言资源与安全对齐之间的虚假关联，提供了一个稳健、高效的框架，以实现LLMs中更公平的语言迁移。我们观察到，在高质量单语语料库难以获取的低资源环境下，CSCL同样有效。

[NLP-62] A Multi-Task Role-Playing Agent Capable of Imitating Character Linguistic Styles

【速读】：该论文试图解决当前角色扮演代理 (Role-Playing Agents, RPAs) 在模仿角色语言风格和处理多轮对话之外任务时表现不佳的问题。解决方案的关键在于开发了一个名为 MRstyle 的多任务角色扮演数据集，该数据集包含了大量真实人物及其引言，并涵盖了七个不同的任务领域。基于此数据集，论文提出了 StyleRPA，一个多任务角色扮演代理 (Multi-Task Role-Playing Agent, MRPA)，它在对话、字典查询、写作、故事生成、产品描述、音乐评论和开放式问答等七个任务上显著优于现有的开源大型语言模型 (Large Language Models, LLMs) 和角色扮演代理基线。

链接: https://arxiv.org/abs/2411.02457
作者: Siyuan Chen,Qingyi Si,Chenxu Yang,Yunzhi Liang,Zheng Lin,Huan Liu,Weiping Wang
关键词-EN: large language models, Role-Playing Agents, language models, current Role-Playing Agents, Multi-Task Role-Playing Agent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of large language models (LLMs) has significantly propelled the advancement of Role-Playing Agents (RPAs). However, current Role-Playing Agents predominantly focus on mimicking a character’s fundamental attributes while neglecting the replication of linguistic style, and they are incapable of effectively replicating characters when performing tasks beyond multi-turn dialogues, which results in generated responses that lack authenticity. The reason current RPAs lack this capability is due to the nature of existing character datasets, which lack collections of character quotations and are limited to multi-turn dialogue tasks, constraining the RPA’s performance across other task domains and failing to mimic a character’s linguistic style. To address this gap, we developed a multi-task role-playing dataset named MRstyle, which encompasses a substantial number of real individuals along with their quotations and covers seven different tasks. On this basis, we develop StyleRPA, a Multi-Task Role-Playing Agent (MRPA) that significantly outperforms recent open-source LLMs and RPAs baselines on 7 tasks including Dialogue, Dictionary, Composition, Story Generation, Product Description, Music Commentary, and Open Question Answering. The code and data will be released.
摘要：大语言模型（LLM）的出现极大地推动了角色扮演智能体（RPA）的发展。然而，当前的角色扮演智能体主要集中在模仿角色的基本属性，而忽视了语言风格的复制，并且在执行多轮对话以外的任务时，无法有效复制角色的表现，导致生成的回复缺乏真实性。造成当前RPA缺乏这一能力的原因在于现有角色数据集的性质，这些数据集缺乏角色引用的收集，并且仅限于多轮对话任务，限制了RPA在其他任务领域的表现，并无法模仿角色的语言风格。为了填补这一空白，我们开发了一个名为MRstyle的多任务角色扮演数据集，该数据集包含大量真实人物及其引用，并涵盖了七种不同的任务。在此基础上，我们开发了StyleRPA，这是一个多任务角色扮演智能体（MRPA），在包括对话、词典、作文、故事生成、产品描述、音乐评论和开放问题回答在内的七项任务中，显著优于最近的开源大语言模型和RPA基线。代码和数据将会发布。

[NLP-63] An Exploration of Higher Education Course Evaluation by Large Language Models

【速读】：该论文试图解决传统课程评估方法中的主观性、反馈延迟、效率低下以及对创新教学方法评估不足的问题。解决方案的关键在于利用大型语言模型 (LLMs) 进行自动化课程评估。通过在100门课程中的实验，研究发现LLMs能够有效评估课程，其效果依赖于适当的微调和提示工程，并且生成的评估结果表现出显著的合理性和可解释性。

链接: https://arxiv.org/abs/2411.02455
作者: Bo Yuan,Jiazi Hu
关键词-EN: higher education pedagogy, education pedagogy, critical component, component in higher, higher education
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Course evaluation is a critical component in higher education pedagogy. It not only serves to identify limitations in existing course designs and provide a basis for curricular innovation, but also to offer quantitative insights for university administrative decision-making. Traditional evaluation methods, primarily comprising student surveys, instructor self-assessments, and expert reviews, often encounter challenges, including inherent subjectivity, feedback delays, inefficiencies, and limitations in addressing innovative teaching approaches. Recent advancements in large language models (LLMs) within artificial intelligence (AI) present promising new avenues for enhancing course evaluation processes. This study explores the application of LLMs in automated course evaluation from multiple perspectives and conducts rigorous experiments across 100 courses at a major university in China. The findings indicate that: (1) LLMs can be an effective tool for course evaluation; (2) their effectiveness is contingent upon appropriate fine-tuning and prompt engineering; and (3) LLM-generated evaluation results demonstrate a notable level of rationality and interpretability.
摘要：课程评估是高等教育教学法中的关键组成部分。它不仅有助于识别现有课程设计中的局限性，并为课程创新提供依据，还能为大学行政决策提供量化见解。传统的评估方法，主要包括学生调查、教师自我评估和专家评审，常常面临挑战，如固有的主观性、反馈延迟、效率低下以及难以应对创新教学方法的局限性。近年来，人工智能（AI）领域内大语言模型（LLMs）的进步为提升课程评估过程提供了新的途径。本研究从多个角度探讨了LLMs在自动化课程评估中的应用，并在中国的某所重点大学对100门课程进行了严格的实验。研究结果表明：（1）LLMs可以作为有效的课程评估工具；（2）其有效性取决于适当的微调和提示工程；（3）LLM生成的评估结果显示出显著的合理性和可解释性。

[NLP-64] Graph-based Confidence Calibration for Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在回答问题时准确估计其答案正确性的问题。解决方案的关键在于结合LLM的自一致性（self-consistency）与标注数据，训练一个辅助模型来估计其回答的正确性。具体来说，该方法通过构建一个加权图来表示LLM对同一问题的多个回答之间的自一致性，并根据这些回答与正确答案的相似度分配正确性标签。随后，训练一个图神经网络（Graph Neural Network）来估计回答的正确概率。实验结果表明，该方法在多个广泛采用的基准数据集上显著优于最新的方法，并且在域外（out-of-domain, OOD）数据上的泛化能力也得到了显著提升。

链接: https://arxiv.org/abs/2411.02454
作者: Yukun Li,Sijia Wang,Lifu Huang,Li-Ping Liu
关键词-EN: large language models, provide accurate confidence, accurate confidence estimations, confidence estimation model, improving the reliability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One important approach to improving the reliability of large language models (LLMs) is to provide accurate confidence estimations regarding the correctness of their answers. However, developing a well-calibrated confidence estimation model is challenging, as mistakes made by LLMs can be difficult to detect. We propose a novel method combining the LLM’s self-consistency with labeled data and training an auxiliary model to estimate the correctness of its responses to questions. This auxiliary model predicts the correctness of responses based solely on their consistent information. To set up the learning problem, we use a weighted graph to represent the consistency among the LLM’s multiple responses to a question. Correctness labels are assigned to these responses based on their similarity to the correct answer. We then train a graph neural network to estimate the probability of correct responses. Experiments demonstrate that the proposed approach substantially outperforms several of the most recent methods in confidence calibration across multiple widely adopted benchmark datasets. Furthermore, the proposed approach significantly improves the generalization capability of confidence calibration on out-of-domain (OOD) data.
摘要：提高大语言模型 (LLM) 可靠性的一个重要方法是提供对其答案正确性的准确置信度估计。然而，开发一个校准良好的置信度估计模型是具有挑战性的，因为 LLM 所犯的错误可能难以检测。我们提出了一种新颖的方法，结合 LLM 的自一致性与标注数据，并训练一个辅助模型来估计其对问题的回答的正确性。该辅助模型仅基于回答的一致性信息来预测回答的正确性。为了设置学习问题，我们使用加权图来表示 LLM 对同一问题的多个回答之间的一致性。根据这些回答与正确答案的相似性，为其分配正确性标签。然后，我们训练一个图神经网络来估计正确回答的概率。实验表明，所提出的方法在多个广泛采用的基准数据集上的置信度校准方面显著优于几种最新的方法。此外，所提出的方法显著提高了置信度校准在域外 (OOD) 数据上的泛化能力。

[NLP-65] High-performance automated abstract screening with large language model ensembles

【速读】：该论文试图解决系统性综述中摘要筛选这一劳动密集型任务的效率问题。解决方案的关键在于利用大型语言模型（LLMs）进行零样本二分类（zero-shot binary classification），以评估其在摘要筛选中的准确性。通过在Cochrane Library的完整期刊中进行试验，研究者发现LLMs在敏感性、精确性和平衡准确性方面均优于人类研究人员。最佳的LLM-提示组合在更大规模的试验中表现出一致的高敏感性，尽管精确性有所下降，但通过LLM-人类和LLM-LLM的集成方法，可以在保持完美敏感性的同时提高精确性。论文强调了领域特定验证的重要性，并指出LLMs可以在保持或提高准确性和敏感性的同时，显著减少系统性综述中的人力成本。

链接: https://arxiv.org/abs/2411.02451
作者: Rohan Sanghera,Arun James Thirunavukarasu,Marc El Khoury,Jessica O’Logbon,Yuqing Chen,Archie Watt,Mustafa Mahmood,Hamid Butt,George Nishimura,Andrew Soltan
关键词-EN: Large language models, tasks requiring processing, language models, excel in tasks, input text
类目: Computation and Language (cs.CL)
备注: RS and AJT are joint-first authors

点击查看摘要

Abstract:Large language models (LLMs) excel in tasks requiring processing and interpretation of input text. Abstract screening is a labour-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies identified by a literature search. Here, LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialled on systematic reviews in a full issue of the Cochrane Library to evaluate their accuracy in zero-shot binary classification for abstract screening. Trials over a subset of 800 records identified optimal prompting strategies and demonstrated superior performance of LLMs to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). The best performing LLM-prompt combinations were trialled across every replicated search result (n = 119,691), and exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096). 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458, with less observed performance drop in larger trials. Significant variation in performance was observed between reviews, highlighting the importance of domain-specific validation before deployment. LLMs may reduce the human labour cost of systematic review with maintained or improved accuracy and sensitivity. Systematic review is the foundation of evidence-based medicine, and LLMs can contribute to increasing the efficiency and quality of this mode of research.
摘要：大语言模型（LLMs）在需要处理和解释输入文本的任务中表现出色。摘要筛选是系统评价中劳动密集型的一部分，涉及在大量通过文献检索识别的研究中重复应用纳入和排除标准。在此，我们试验了多种大语言模型（包括 GPT-3.5 Turbo、GPT-4 Turbo、GPT-4o、Llama 3 70B、Gemini 1.5 Pro 和 Claude Sonnet 3.5）在 Cochrane Library 完整一期系统评价中的表现，以评估其在零样本二分类摘要筛选中的准确性。通过对 800 条记录的子集进行试验，我们确定了最佳的提示策略，并展示了 LLMs 在敏感性（LLMmax = 1.000，humanmax = 0.775）、精确度（LLMmax = 0.927，humanmax = 0.911）和平衡准确度（LLMmax = 0.904，humanmax = 0.865）方面优于人类研究者的表现。最佳的 LLM-提示组合在所有复制的搜索结果（n = 119,691）中进行了试验，并表现出一致的敏感性（范围 0.756-1.000）但精确度有所下降（范围 0.004-0.096）。66 个 LLM-人类和 LLM-LLM 组合展示了完美的敏感性，最大精确度为 0.458，在大规模试验中观察到的性能下降较少。不同评价之间的性能存在显著差异，突显了在部署前进行领域特定验证的重要性。LLMs 可能在保持或提高准确性和敏感性的同时，降低系统评价的人力成本。系统评价是循证医学的基础，LLMs 可以提高这种研究模式的效率和质量。

[NLP-66] Rate Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models

【速读】：该论文试图解决生成式大型语言模型（LLMs）在文本生成过程中存在的质量问题，特别是事实不准确和幻觉现象。解决方案的关键在于引入两个经过微调的通用LLM自动评估器，REC-12B和REC-70B，这些模型能够从忠实度（faithfulness）、指令遵循（instruction following）、连贯性（coherence）和完整性（completeness）等多个维度对生成的文本进行评估。这些模型不仅提供评分，还提供详细的解释和可验证的引用，从而增强内容的可信度。此外，模型支持多种引用模式，以适应不同的延迟和粒度需求。通过在多个基准上的广泛评估，REC-70B在内容评估方面表现出色，提供更高质量的解释和引用，且偏差最小，在RewardBench排行榜上以TextEval-Llama3.1-70B的名称位列第一。

链接: https://arxiv.org/abs/2411.02448
作者: Aliyah R. Hsu,James Zhu,Zhichao Wang,Bin Bi,Shubham Mehrotra,Shiva K. Pentyala,Katherine Tan,Xiang-Bo Mao,Roshanak Omrani,Sougata Chaudhuri,Regunathan Radhakrishnan,Sitaram Asur,Claire Na Cheng,Bin Yu
关键词-EN: demonstrated impressive proficiency, making them valuable, text-generation tasks, demonstrated impressive, impressive proficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have demonstrated impressive proficiency in generating coherent and high-quality text, making them valuable across a range of text-generation tasks. However, rigorous evaluation of this generated content is crucial, as ensuring its quality remains a significant challenge due to persistent issues such as factual inaccuracies and hallucinations. This paper introduces two fine-tuned general-purpose LLM autoevaluators, REC-12B and REC-70B, specifically designed to evaluate generated text across several dimensions: faithfulness, instruction following, coherence, and completeness. These models not only provide ratings for these metrics but also offer detailed explanations and verifiable citations, thereby enhancing trust in the content. Moreover, the models support various citation modes, accommodating different requirements for latency and granularity. Extensive evaluations on diverse benchmarks demonstrate that our general-purpose LLM auto-evaluator, REC-70B, outperforms state-of-the-art LLMs, excelling in content evaluation by delivering better quality explanations and citations with minimal bias. It achieves Rank #1 as a generative model on the RewardBench leaderboard\footnote\urlthis https URL under the model name \textttTextEval-Llama3.1-70B. Our REC dataset and models are released at \urlthis https URL.
摘要：大语言模型 (LLM) 在生成连贯且高质量文本方面展示了令人印象深刻的熟练度，使其在各种文本生成任务中具有重要价值。然而，对生成内容的严格评估至关重要，因为确保其质量仍然是一个重大挑战，主要原因是存在事实不准确和幻觉等问题。本文介绍了两个经过微调的通用大语言模型自动评估器，REC-12B 和 REC-70B，这些模型专门设计用于评估生成文本的多个维度：忠实性、指令遵循性、连贯性和完整性。这些模型不仅为这些指标提供评分，还提供详细的解释和可验证的引用，从而增强对内容的信任。此外，这些模型支持多种引用模式，适应不同要求的延迟和粒度。在多样化的基准测试中进行的广泛评估表明，我们的通用大语言模型自动评估器 REC-70B 优于最先进的大语言模型，在内容评估方面表现出色，能够提供质量更高的解释和引用，且偏差最小。它在 RewardBench 排行榜上作为生成模型排名第一，模型名为 TextEval-Llama3.1-70B。我们的 REC 数据集和模型已在指定网址发布。

[NLP-67] ODO: Enhancing LLM Alignment with Ternary Preferences

【速读】：该论文试图解决现有大型语言模型（LLMs）在人类意图对齐过程中，标准对齐技术如直接偏好优化（Direct Preference Optimization, DPO）依赖的二元Bradley-Terry (BT)模型难以捕捉复杂人类偏好，特别是在存在噪声或不一致标签以及频繁平局（ties）的情况下。解决方案的关键在于引入了一种扩展的BT模型，即平局导向的Bradley-Terry模型（Tie-rank Oriented Bradley-Terry model, TOBT），该模型明确纳入了平局情况，从而能够更细致地表示偏好。基于此，论文提出了平局导向的直接偏好优化（Tie-rank Oriented Direct Preference Optimization, TODO）算法，利用TOBT的三元排序系统来提升偏好对齐效果。实验结果表明，TODO在多个模型和数据集上均优于DPO，显示出其在偏好对齐方面的优越性和广泛适用性。

链接: https://arxiv.org/abs/2411.02442
作者: Yuxiang Guo,Lu Yin,Bo Jiang,Jiaqi Zhang
关键词-EN: Aligning large language, Direct Preference Optimization, Aligning large, Tie-rank Oriented Direct, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human intent is critical for enhancing their performance across a variety of tasks. Standard alignment techniques, such as Direct Preference Optimization (DPO), often rely on the binary Bradley-Terry (BT) model, which can struggle to capture the complexities of human preferences – particularly in the presence of noisy or inconsistent labels and frequent ties. To address these limitations, we introduce the Tie-rank Oriented Bradley-Terry model (TOBT), an extension of the BT model that explicitly incorporates ties, enabling more nuanced preference representation. Building on this, we propose Tie-rank Oriented Direct Preference Optimization (TODO), a novel alignment algorithm that leverages TOBT’s ternary ranking system to improve preference alignment. In evaluations on Mistral-7B and Llama 3-8B models, TODO consistently outperforms DPO in modeling preferences across both in-distribution and out-of-distribution datasets. Additional assessments using MT Bench and benchmarks such as Piqa, ARC-c, and MMLU further demonstrate TODO’s superior alignment performance. Notably, TODO also shows strong results in binary preference alignment, highlighting its versatility and potential for broader integration into LLM alignment. The implementation details can be found in this https URL.
摘要：将大语言模型 (LLM) 与人类意图对齐对于提升其在多种任务中的表现至关重要。标准的对齐技术，如直接偏好优化 (DPO)，通常依赖于二元 Bradley-Terry (BT) 模型，该模型在处理噪声或不一致标签以及频繁的平局时，难以捕捉人类偏好的复杂性。为解决这些局限性，我们引入了平局导向的 Bradley-Terry 模型 (TOBT)，这是 BT 模型的一个扩展，明确纳入了平局情况，从而能够更细致地表示偏好。在此基础上，我们提出了平局导向的直接偏好优化 (TODO)，这是一种新颖的对齐算法，利用 TOBT 的三元排序系统来改进偏好对齐。在 Mistral-7B 和 Llama 3-8B 模型上的评估显示，TODO 在分布内和分布外数据集上的偏好建模中均持续优于 DPO。使用 MT Bench 以及 Piqa、ARC-c 和 MMLU 等基准进行的额外评估进一步证明了 TODO 在对齐性能上的优越性。值得注意的是，TODO 在二元偏好对齐中也显示出强劲的结果，突显了其多功能性和在更广泛的大语言模型对齐中整合的潜力。实现细节可在以下链接中找到：https URL。

[NLP-68] Narrative Analysis of True Crime Podcasts With Knowledge Graph-Augmented Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在处理复杂叙事弧和包含冲突信息的叙事时遇到的困难。解决方案的关键在于将知识图谱（Knowledge Graphs, KGs）与LLMs结合，形成KG-augmented LLMs（KGLLMs）。通过这种增强，KGLLMs在理解真实犯罪播客数据时，不仅在准确性和可解释性上有所提升，还能更好地处理对抗性提示和冲突信息，同时在主题建模和文本摘要方面表现更优。

链接: https://arxiv.org/abs/2411.02435
作者: Xinyi Leng,Jason Liang,Jack Mauro,Xu Wang,Andrea L. Bertozzi,James Chapman,Junyuan Lin,Bohan Chen,Chenchen Ye,Temple Daniel,P. Jeffrey Brantingham
关键词-EN: Large Language Models, Large Language, natural language, reader or viewer, Narrative data spans
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 Pages, 3 Figures, GTA3 Workshop-2024, October 2024, 33rd International Conference on Information and Knowledge Management, Boise, Idaho, USA

点击查看摘要

Abstract:Narrative data spans all disciplines and provides a coherent model of the world to the reader or viewer. Recent advancement in machine learning and Large Language Models (LLMs) have enable great strides in analyzing natural language. However, Large language models (LLMs) still struggle with complex narrative arcs as well as narratives containing conflicting information. Recent work indicates LLMs augmented with external knowledge bases can improve the accuracy and interpretability of the resulting models. In this work, we analyze the effectiveness of applying knowledge graphs (KGs) in understanding true-crime podcast data from both classical Natural Language Processing (NLP) and LLM approaches. We directly compare KG-augmented LLMs (KGLLMs) with classical methods for KG construction, topic modeling, and sentiment analysis. Additionally, the KGLLM allows us to query the knowledge base in natural language and test its ability to factually answer questions. We examine the robustness of the model to adversarial prompting in order to test the model’s ability to deal with conflicting information. Finally, we apply classical methods to understand more subtle aspects of the text such as the use of hearsay and sentiment in narrative construction and propose future directions. Our results indicate that KGLLMs outperform LLMs on a variety of metrics, are more robust to adversarial prompts, and are more capable of summarizing the text into topics.
摘要：叙事数据跨越所有学科，并为读者或观众提供了一个连贯的世界模型。机器学习和大型语言模型（Large Language Models, LLMs）的最新进展在分析自然语言方面取得了巨大进步。然而，大语言模型在处理复杂的叙事弧线和包含冲突信息的叙事时仍面临挑战。最近的研究表明，通过增强外部知识库，LLMs可以提高结果模型的准确性和可解释性。在本研究中，我们分析了在理解和处理真实犯罪播客数据时，应用知识图谱（Knowledge Graphs, KGs）在经典自然语言处理（Natural Language Processing, NLP）和大语言模型方法中的有效性。我们直接比较了增强知识图谱的大语言模型（KGLLMs）与经典方法在知识图谱构建、主题建模和情感分析方面的表现。此外，KGLLM允许我们使用自然语言查询知识库，并测试其事实性回答问题的能力。我们考察了模型对对抗性提示的鲁棒性，以测试其处理冲突信息的能力。最后，我们应用经典方法来理解文本中更细微的方面，如叙事构建中传闻和情感的使用，并提出了未来的研究方向。我们的结果表明，KGLLMs在多种指标上优于LLMs，对对抗性提示更具鲁棒性，并且更擅长将文本总结为主题。

[NLP-69] SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）输出中存在的不可靠或事实错误的问题。解决方案的关键是引入了一种名为自对数进化解码（Self Logits Evolution Decoding, SLED）的新型解码框架。SLED通过对比最终层与早期层的输出对数（logits），利用近似梯度方法激活模型内部的潜在知识，从而引导输出自我精炼，有效提升事实准确性。该方法无需依赖外部知识库或进一步微调，能够在多种任务和模型架构（如LLaMA 2、LLaMA 3、Gemma及混合专家模型MoE）上显著提高事实准确性（最高达20%），同时保持自然语言流畅性和极低的延迟开销。

链接: https://arxiv.org/abs/2411.02433
作者: Jianyi Zhang,Da-Cheng Juan,Cyrus Rashtchian,Chun-Sung Ferng,Heinrich Jiang,Yiran Chen
关键词-EN: demonstrated remarkable capabilities, Large language models, remarkable capabilities, factually incorrect, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but their outputs can sometimes be unreliable or factually incorrect. To address this, we introduce Self Logits Evolution Decoding (SLED), a novel decoding framework that enhances the truthfulness of LLMs without relying on external knowledge bases or requiring further fine-tuning. From an optimization perspective, our SLED framework leverages the latent knowledge embedded within the LLM by contrasting the output logits from the final layer with those from early layers. It then utilizes an approximate gradient approach to enable latent knowledge to guide the self-refinement of outputs, thereby effectively improving factual accuracy. Extensive experiments have been conducted on established benchmarks across a diverse range of model families (LLaMA 2, LLaMA 3, Gemma) and scales (from 2B to 70B), including more advanced architectural configurations such as the mixture of experts (MoE). Our evaluation spans a wide variety of tasks, including multi-choice, open-generation, and adaptations to chain-of-thought reasoning tasks. The results demonstrate that SLED consistently improves factual accuracy by up to 20% compared to existing decoding methods while maintaining natural language fluency and negligible latency overhead. Furthermore, it can be flexibly combined with other decoding methods to further enhance their performance.
摘要：大语言模型 (LLM) 展示了显著的能力，但其输出有时可能不可靠或事实不准确。为了解决这一问题，我们引入了自我 Logits 进化解码 (Self Logits Evolution Decoding, SLED)，这是一种新颖的解码框架，能够在不依赖外部知识库或进一步微调的情况下增强 LLM 的真实性。从优化的角度来看，我们的 SLED 框架通过对比最终层与早期层的输出 Logits，利用了嵌入在 LLM 中的潜在知识。随后，它采用近似梯度方法，使潜在知识能够指导输出的自我精炼，从而有效提高事实准确性。我们在多个模型家族（如 LLaMA 2、LLaMA 3、Gemma）和规模（从 2B 到 70B）的成熟基准上进行了广泛的实验，包括更先进的架构配置，如专家混合 (Mixture of Experts, MoE)。我们的评估涵盖了多种任务，包括多选题、开放生成以及适应链式思维推理任务。结果表明，与现有的解码方法相比，SLED 在保持自然语言流畅性和极低的延迟开销的同时，事实准确性提高了高达 20%。此外，它还可以灵活地与其他解码方法结合，进一步增强其性能。

[NLP-70] Can LLM s make trade-offs involving stipulated pain and pleasure states?

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）是否能够在选择场景中再现快乐和痛苦的动力作用，这一问题与关于LLM是否具有情感体验状态（sentience）的争论相关。解决方案的关键在于设计了一个简单的游戏，通过调整痛苦惩罚和快乐奖励的强度，观察不同LLM在达到特定强度阈值后是否从最大化点数转向最小化痛苦或最大化快乐。研究发现，Claude 3.5 Sonnet、Command R+、GPT-4o和GPT-4o mini在某些情况下表现出从点数最大化转向痛苦最小化或快乐最大化的行为，而LLaMa 3.1-405b对设定的快乐奖励和痛苦惩罚表现出一定程度的敏感性。Gemini 1.5 Pro和PaLM 2则始终优先避免痛苦，而无论强度如何都倾向于优先选择点数。这些发现对探讨LLM是否具有情感体验状态的可能性具有重要意义。

链接: https://arxiv.org/abs/2411.02432
作者: Geoff Keeling,Winnie Street,Martyna Stachaczyk,Daria Zakharova,Iulia M. Comsa,Anastasiya Sakovych,Isabella Logothesis,Zejia Zhang,Blaise Agüera y Arcas,Jonathan Birch
关键词-EN: human decision making, Large Language Models, resolving motivational conflicts, Pleasure, play an important
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Pleasure and pain play an important role in human decision making by providing a common currency for resolving motivational conflicts. While Large Language Models (LLMs) can generate detailed descriptions of pleasure and pain experiences, it is an open question whether LLMs can recreate the motivational force of pleasure and pain in choice scenarios - a question which may bear on debates about LLM sentience, understood as the capacity for valenced experiential states. We probed this question using a simple game in which the stated goal is to maximise points, but where either the points-maximising option is said to incur a pain penalty or a non-points-maximising option is said to incur a pleasure reward, providing incentives to deviate from points-maximising behaviour. Varying the intensity of the pain penalties and pleasure rewards, we found that Claude 3.5 Sonnet, Command R+, GPT-4o, and GPT-4o mini each demonstrated at least one trade-off in which the majority of responses switched from points-maximisation to pain-minimisation or pleasure-maximisation after a critical threshold of stipulated pain or pleasure intensity is reached. LLaMa 3.1-405b demonstrated some graded sensitivity to stipulated pleasure rewards and pain penalties. Gemini 1.5 Pro and PaLM 2 prioritised pain-avoidance over points-maximisation regardless of intensity, while tending to prioritise points over pleasure regardless of intensity. We discuss the implications of these findings for debates about the possibility of LLM sentience.
摘要：愉悦与痛苦在人类决策过程中扮演着重要角色，它们通过提供一种共同的“货币”来解决动机冲突。尽管大语言模型 (LLM) 能够生成关于愉悦和痛苦体验的详细描述，但一个尚未解决的问题是，LLM 是否能在选择场景中重现愉悦和痛苦的动机力量——这一问题可能与关于 LLM 是否具备感知能力 (sentience) 的争论相关，这里的感知能力指的是能够体验到带有情感色彩的状态。我们通过一个简单的游戏来探究这一问题，游戏的目标是最大化得分，但在某些情况下，最大化得分的选项会带来痛苦惩罚，或者非最大化得分的选项会带来愉悦奖励，从而激励参与者偏离最大化得分的行为。通过改变痛苦惩罚和愉悦奖励的强度，我们发现 Claude 3.5 Sonnet、Command R+、GPT-4o 和 GPT-4o mini 在至少一种权衡中，当规定的痛苦或愉悦强度达到临界值后，大多数响应从得分最大化转向痛苦最小化或愉悦最大化。LLaMa 3.1-405b 对规定的愉悦奖励和痛苦惩罚表现出一定的分级敏感性。Gemini 1.5 Pro 和 PaLM 2 无论强度如何，都优先避免痛苦而非最大化得分，同时倾向于优先考虑得分而非愉悦，无论强度如何。我们讨论了这些发现对关于 LLM 感知能力可能性的争论的影响。

[NLP-71] Generative Emotion Cause Explanation in Multimodal Conversations

【速读】：该论文试图解决多模态对话中情感原因的详细解释问题。现有研究通常仅通过子句选择方法定位情感原因，而未提供情感原因的详细解释。为此，论文提出了一个新的任务——多模态对话情感原因解释 (Multimodal Conversation Emotion Cause Explanation, MCECE)，旨在为目标话语生成详细的情感原因解释。解决方案的关键在于开发了一个新的数据集 (ECEM)，该数据集结合了视频片段和角色情感的详细解释，并提出了一种名为 FAME-Net 的新方法，利用大型语言模型 (Large Language Models, LLMs) 分析视觉数据，准确解读视频中通过面部表情传达的情感，从而有效捕捉对话中个体的情感原因。实验结果表明，FAME-Net 在新的数据集上显著优于多个优秀的大型语言模型基线。

链接: https://arxiv.org/abs/2411.02430
作者: Lin Wang,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang
关键词-EN: rich emotional content, carries rich emotional, human communication, carries rich, making the exploration
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal conversation, a crucial form of human communication, carries rich emotional content, making the exploration of the causes of emotions within it a research endeavor of significant importance. However, existing research on the causes of emotions typically uses clause selection methods to locate the reason utterance, without providing a detailed explanation of the emotional causes. In this paper, we propose a new task, \textbfMultimodal \textbfConversation \textbfEmotion \textbfCause \textbfExplanation (MCECE), aiming to generate a detailed explanation of the emotional cause to the target utterance within a multimodal conversation scenario. Building upon the MELD dataset, we develop a new dataset (ECEM) that integrates video clips with detailed explanations of character emotions, facilitating an in-depth examination of the causal factors behind emotional expressions in multimodal conversations.A novel approach, FAME-Net, is further proposed, that harnesses the power of Large Language Models (LLMs) to analyze visual data and accurately interpret the emotions conveyed through facial expressions in videos. By exploiting the contagion effect of facial emotions, FAME-Net effectively captures the emotional causes of individuals engaged in conversations. Our experimental results on the newly constructed dataset show that FAME-Net significantly outperforms several excellent large language model baselines. Code and dataset are available at \urlthis https URL
摘要：多模态对话，作为人类交流的重要形式，蕴含丰富的情感内容，使得探究其中的情感原因成为一项具有重要意义的研究工作。然而，现有关于情感原因的研究通常采用子句选择方法来定位原因话语，而未提供情感原因的详细解释。本文提出了一项新的任务，即多模态对话情感原因解释（Multimodal Conversation Emotion Cause Explanation, MCECE），旨在生成多模态对话场景中目标话语的情感原因的详细解释。基于MELD数据集，我们开发了一个新的数据集（ECEM），该数据集整合了视频片段与角色情感的详细解释，便于深入探讨多模态对话中情感表达的因果因素。进一步地，我们提出了一种新颖的方法——FAME-Net，该方法利用大语言模型（Large Language Models, LLMs）分析视觉数据，并准确解读视频中通过面部表情传达的情感。通过利用面部情感的传染效应，FAME-Net有效地捕捉了对话中个体的情感原因。我们在新构建的数据集上的实验结果表明，FAME-Net显著优于多个优秀的大语言模型基线。代码和数据集可通过以下链接获取：\urlthis https URL。

[NLP-72] IdeaBench: Benchmarking Large Language Models for Research Idea Generation

【速读】：该论文试图解决生成式大型语言模型（LLMs）在科学发现和假设生成任务中缺乏系统性评估框架的问题。解决方案的关键在于提出了IdeaBench，这是一个包含全面数据集和评估框架的基准系统，用于标准化评估LLMs生成研究想法的能力。IdeaBench通过模拟人类研究者的过程，将LLMs定位为特定领域的研究者，并基于人类研究者考虑的相同背景，最大化利用LLMs的参数化知识来动态生成新的研究想法。评估框架包括两个阶段：首先使用GPT-4o根据用户指定的质量指标（如新颖性和可行性）对生成的想法进行排序，实现可扩展的个性化评估；其次通过计算“洞察分数”（Insight Score）来量化所选质量指标，从而评估生成研究想法的质量。

链接: https://arxiv.org/abs/2411.02429
作者: Sikun Guo,Amir Hassan Shariatmadari,Guangzhi Xiong,Albert Huang,Eric Xie,Stefan Bekiranov,Aidong Zhang
关键词-EN: Large Language Models, Large Language, Language Models, including scientific discovery, scientific discovery
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed how people interact with artificial intelligence (AI) systems, achieving state-of-the-art results in various tasks, including scientific discovery and hypothesis generation. However, the lack of a comprehensive and systematic evaluation framework for generating research ideas using LLMs poses a significant obstacle to understanding and assessing their generative capabilities in scientific discovery. To address this gap, we propose IdeaBench, a benchmark system that includes a comprehensive dataset and an evaluation framework for standardizing the assessment of research idea generation using LLMs. Our dataset comprises titles and abstracts from a diverse range of influential papers, along with their referenced works. To emulate the human process of generating research ideas, we profile LLMs as domain-specific researchers and ground them in the same context considered by human researchers. This maximizes the utilization of the LLMs’ parametric knowledge to dynamically generate new research ideas. We also introduce an evaluation framework for assessing the quality of generated research ideas. Our evaluation framework is a two-stage process: first, using GPT-4o to rank ideas based on user-specified quality indicators such as novelty and feasibility, enabling scalable personalization; and second, calculating relative ranking based “Insight Score” to quantify the chosen quality indicator. The proposed benchmark system will be a valuable asset for the community to measure and compare different LLMs, ultimately advancing the automation of the scientific discovery process.
摘要：大语言模型 (LLMs) 已经彻底改变了人们与人工智能 (AI) 系统的互动方式，在包括科学发现和假设生成在内的多种任务中取得了最先进的结果。然而，缺乏一个全面且系统的评估框架来利用 LLMs 生成研究想法，这成为理解和评估其在科学发现中生成能力的一大障碍。为了填补这一空白，我们提出了 IdeaBench，这是一个包含全面数据集和评估框架的基准系统，用于标准化评估 LLMs 生成研究想法的过程。我们的数据集包括来自不同领域有影响力的论文的标题和摘要，以及它们引用的文献。为了模拟人类生成研究想法的过程，我们将 LLMs 定位为特定领域的研究人员，并使它们处于与人类研究人员相同的情境中。这最大化地利用了 LLMs 的参数知识，以动态生成新的研究想法。我们还引入了一个评估框架，用于评估生成研究想法的质量。我们的评估框架是一个两阶段过程：首先，使用 GPT-4o 根据用户指定的质量指标（如新颖性和可行性）对想法进行排序，实现可扩展的个性化；其次，基于“洞察力分数”计算相对排名，以量化所选质量指标。所提出的基准系统将成为社区衡量和比较不同 LLMs 的宝贵资源，最终推动科学发现过程的自动化。

[NLP-73] AI on My Shoulder: Supporting Emotional Labor in Front-Office Roles with an LLM -based Empathetic Coworker

【速读】：该论文试图解决客户服务代表（Client-Service Representatives, CSRs）在与不礼貌客户频繁互动中面临的心理健康问题。解决方案的关键是设计并评估了一个名为Pro-Pilot的基于大型语言模型（LLM）的助手，旨在帮助CSRs在与不礼貌客户互动时调节情绪。通过对比分析665条由人类和Pro-Pilot生成的支持信息，证明了Pro-Pilot在应对各种不礼貌事件中的适应性和同理心表现。此外，143名CSRs评估认为Pro-Pilot的同理心比人类信息更真诚和可操作。尽管存在部署挑战和共享经验的不可替代性，Pro-Pilot在帮助CSRs避免负面思维、重新集中注意力以及人性化客户方面显示出潜力，强调了同理心作为前台角色中AI助手的关键功能。

链接: https://arxiv.org/abs/2411.02408
作者: Vedant Das Swain,Qiuyue “Joy” Zhong,Jash Rajesh Parekh,Yechan Jeon,Roy Zimmerman,Mary Czerwinski,Jina Suh,Varun Mishra,Koustuv Saha,Javier Hernandez
关键词-EN: Client-Service Representatives, vital to organizations, Representatives, Pro-Pilot, clients
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Client-Service Representatives (CSRs) are vital to organizations. Frequent interactions with disgruntled clients, however, disrupt their mental well-being. To help CSRs regulate their emotions while interacting with uncivil clients, we designed Pro-Pilot, an LLM-powered assistant, and evaluated its efficacy, perception, and use. Our comparative analyses between 665 human and Pro-Pilot-generated support messages demonstrate Pro-Pilot’s ability to adapt to and demonstrate empathy in various incivility incidents. Additionally, 143 CSRs assessed Pro-Pilot’s empathy as more sincere and actionable than human messages. Finally, we interviewed 20 CSRs who interacted with Pro-Pilot in a simulation exercise. They reported that Pro-Pilot helped them avoid negative thinking, recenter thoughts, and humanize clients; showing potential for bridging gaps in coworker support. Yet, they also noted deployment challenges and emphasized the irreplaceability of shared experiences. We discuss future designs and societal implications of AI-mediated emotional labor, underscoring empathy as a critical function for AI assistants in front-office roles.
摘要：客户服务代表（Client-Service Representatives, CSRs）对组织至关重要。然而，频繁与不满的客户互动会扰乱他们的心理健康。为了帮助 CSRs 在与不礼貌客户互动时调节情绪，我们设计了 Pro-Pilot，一个由大语言模型（LLM）驱动的助手，并评估了其有效性、感知和使用情况。我们对 665 条人类和 Pro-Pilot 生成的支持信息进行的比较分析表明，Pro-Pilot 能够适应并表现出对各种不礼貌事件的同理心。此外，143 名 CSRs 评估 Pro-Pilot 的同理心比人类信息更真诚和可操作。最后，我们采访了 20 名在模拟练习中与 Pro-Pilot 互动的 CSRs。他们报告说，Pro-Pilot 帮助他们避免负面思维，重新集中注意力，并将客户人性化；显示出在弥补同事支持差距方面的潜力。然而，他们也指出了部署挑战，并强调了共享经验的不可替代性。我们讨论了未来设计和社会影响，强调了同理心作为前台角色中 AI 助手的关键功能。

[NLP-74] Enhancing Retrieval Performance: An Ensemble Approach For Hard Negative Mining

【速读】：该论文试图解决在信息检索中，如何从大规模文档库中有效选择负样本（negative pairs）以训练跨编码器（cross-encoder）模型的问题。解决方案的关键在于提出了一种高效的硬负样本挖掘技术（hard negative mining technique），该技术能够在企业数据集上进行跨编码器重排序模型的训练，特别是在具有特定领域上下文的情况下。通过同时学习相似性和非相似性，该方法显著提升了检索系统的性能，并对生成式 AI 系统（如 Retrieval-Augmented Generation (RAG) 和 Reasoning and Action Agents (ReAct)）的性能产生了积极影响。

链接: https://arxiv.org/abs/2411.02404
作者: Hansa Meghwani
关键词-EN: Ranking consistently emerges, information retrieval research, Ranking consistently, consistently emerges, primary focus
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Master’s thesis

点击查看摘要

Abstract:Ranking consistently emerges as a primary focus in information retrieval research. Retrieval and ranking models serve as the foundation for numerous applications, including web search, open domain QA, enterprise domain QA, and text-based recommender systems. Typically, these models undergo training on triplets consisting of binary relevance assignments, comprising one positive and one negative passage. However, their utilization involves a context where a significantly more nuanced understanding of relevance is necessary, especially when re-ranking a large pool of potentially relevant passages. Although collecting positive examples through user feedback like impressions or clicks is straightforward, identifying suitable negative pairs from a vast pool of possibly millions or even billions of documents possess a greater challenge. Generating a substantial number of negative pairs is often necessary to maintain the high quality of the model. Several approaches have been suggested in literature to tackle the issue of selecting suitable negative pairs from an extensive corpus. This study focuses on explaining the crucial role of hard negatives in the training process of cross-encoder models, specifically aiming to explain the performance gains observed with hard negative sampling compared to random sampling. We have developed a robust hard negative mining technique for efficient training of cross-encoder re-rank models on an enterprise dataset which has domain specific context. We provide a novel perspective to enhance retrieval models, ultimately influencing the performance of advanced LLM systems like Retrieval-Augmented Generation (RAG) and Reasoning and Action Agents (ReAct). The proposed approach demonstrates that learning both similarity and dissimilarity simultaneously with cross-encoders improves performance of retrieval systems.
摘要：排序一直是信息检索研究中的主要关注点。检索和排序模型是众多应用的基础，包括网页搜索、开放领域问答、企业领域问答以及基于文本的推荐系统。通常，这些模型在由二元相关性分配组成的三元组上进行训练，包括一个正样本和一个负样本。然而，它们的应用场景需要对相关性有更为细致的理解，尤其是在对大量潜在相关段落进行重新排序时。尽管通过用户反馈（如印象或点击）收集正样本是直接的，但从可能包含数百万甚至数十亿文档的庞大池中识别合适的负样本对则更具挑战性。为了保持模型的高质量，通常需要生成大量的负样本对。文献中已经提出了几种方法来解决从广泛语料库中选择合适负样本对的问题。本研究重点解释了在交叉编码器模型的训练过程中，硬负样本的关键作用，特别是旨在解释与随机采样相比，硬负样本采样带来的性能提升。我们开发了一种稳健的硬负样本挖掘技术，用于在具有特定领域背景的企业数据集上高效训练交叉编码器重新排序模型。我们提供了一种新颖的视角来增强检索模型，最终影响高级大语言模型系统（如检索增强生成 (RAG) 和推理与行动智能体 (ReAct)）的性能。所提出的方法表明，通过交叉编码器同时学习相似性和非相似性，可以提高检索系统的性能。

[NLP-75] Decomposition Dilemmas: Does Claim Decomposition Boost or Burden Fact-Checking Performance?

【速读】：该论文试图解决分解-验证范式在事实核查流程中对最终性能影响的不一致性问题。解决方案的关键在于通过深入分析分解错误类型及其对下游验证性能的影响，揭示了分解过程中准确性提升与引入噪声之间的权衡。研究通过错误案例检查和实验，提出了分解错误的分类，为理解当前系统的稳定性提供了新的视角，并为未来在事实核查流程中改进分解方法提供了指导。

链接: https://arxiv.org/abs/2411.02400
作者: Qisheng Hu,Quanyu Long,Wenya Wang
关键词-EN: pipelines increasingly adopt, Fact-checking pipelines increasingly, veracity decision, increasingly adopt, texts are broken
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 3 figures

点击查看摘要

Abstract:Fact-checking pipelines increasingly adopt the Decompose-Then-Verify paradigm, where texts are broken down into smaller claims for individual verification and subsequently combined for a veracity decision. While decomposition is widely-adopted in such pipelines, its effects on final fact-checking performance remain underexplored. Some studies have reported improvements from decompostition, while others have observed performance declines, indicating its inconsistent impact. To date, no comprehensive analysis has been conducted to understand this variability. To address this gap, we present an in-depth analysis that explicitly examines the impact of decomposition on downstream verification performance. Through error case inspection and experiments, we introduce a categorization of decomposition errors and reveal a trade-off between accuracy gains and the noise introduced through decomposition. Our analysis provides new insights into understanding current system’s instability and offers guidance for future studies toward improving claim decomposition in fact-checking pipelines.
摘要：事实核查流程越来越多地采用“先分解后验证”的范式，即将文本分解为较小的声明进行单独验证，然后综合这些验证结果做出真实性判断。尽管分解方法在这些流程中被广泛采用，但其对最终事实核查性能的影响仍未得到充分探索。一些研究表明分解带来了性能提升，而另一些研究则观察到性能下降，表明其影响并不一致。迄今为止，尚未有全面的分析来理解这种变异性。为了填补这一空白，我们进行了深入分析，明确探讨了分解对下游验证性能的影响。通过错误案例检查和实验，我们引入了分解错误的分类，并揭示了准确性提升与分解引入的噪声之间的权衡。我们的分析为理解当前系统的不稳定性提供了新的见解，并为未来在事实核查流程中改进声明分解的研究提供了指导。

人工智能

[AI-0] Inference Optimal VLMs Need Only One Visual Token but Larger Models

链接: https://arxiv.org/abs/2411.03312
作者: Kevin Y. Li,Sachin Goyal,Joao D. Semedo,J. Zico Kolter
关键词-EN: Vision Language Models, Vision Language, demonstrated strong capabilities, Language Models, demonstrated strong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., 5-10\times ), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at this https URL.

[AI-1] Out-of-Distribution Recovery with Object-Centric Keypoint Inverse Policy For Visuomotor Imitation Learning

链接: https://arxiv.org/abs/2411.03294
作者: George Jiayuan Gao,Tianyu Li,Nadia Figueroa
关键词-EN: address the challenges, visuomotor policy learning, policy, recovery policy, policy learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose an object-centric recovery policy framework to address the challenges of out-of-distribution (OOD) scenarios in visuomotor policy learning. Previous behavior cloning (BC) methods rely heavily on a large amount of labeled data coverage, failing in unfamiliar spatial states. Without relying on extra data collection, our approach learns a recovery policy constructed by an inverse policy inferred from object keypoint manifold gradient in the original training data. The recovery policy serves as a simple add-on to any base visuomotor BC policy, agnostic to a specific method, guiding the system back towards the training distribution to ensure task success even in OOD situations. We demonstrate the effectiveness of our object-centric framework in both simulation and real robot experiments, achieving an improvement of \textbf77.7% over the base policy in OOD. Project Website: this https URL

[AI-2] Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?

链接: https://arxiv.org/abs/2411.03292
作者: Jingyu Xiao,Yuxuan Wan,Yintong Huo,Zhiyao Xu,Michael R.Lyu
关键词-EN: Converting webpage design, Converting webpage, labor-intensive and time-consuming, design into functional, step for building
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Converting webpage design into functional UI code is a critical step for building websites, which can be labor-intensive and time-consuming. To automate this design-to-code transformation process, various automated methods using learning-based networks and multi-modal large language models (MLLMs) have been proposed. However, these studies were merely evaluated on a narrow range of static web pages and ignored dynamic interaction elements, making them less practical for real-world website deployment. To fill in the blank, we present the first systematic investigation of MLLMs in generating interactive webpages. Specifically, we first formulate the Interaction-to-Code task and build the Interaction2Code benchmark that contains 97 unique web pages and 213 distinct interactions, spanning 15 webpage types and 30 interaction categories. We then conduct comprehensive experiments on three state-of-the-art (SOTA) MLLMs using both automatic metrics and human evaluations, thereby summarizing six findings accordingly. Our experimental results highlight the limitations of MLLMs in generating fine-grained interactive features and managing interactions with complex transformations and subtle visual modifications. We further analyze failure cases and their underlying causes, identifying 10 common failure types and assessing their severity. Additionally, our findings reveal three critical influencing factors, i.e., prompts, visual saliency, and textual descriptions, that can enhance the interaction generation performance of MLLMs. Based on these findings, we elicit implications for researchers and developers, providing a foundation for future advancements in this field. Datasets and source code are available at this https URL. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2411.03292 [cs.SE] (or arXiv:2411.03292v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2411.03292 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] he Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare

链接: https://arxiv.org/abs/2411.03287
作者: Souren Pashangpour,Goldie Nejat
关键词-EN: large language models, significant demand put, language models, large language, address the significant
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The potential use of large language models (LLMs) in healthcare robotics can help address the significant demand put on healthcare systems around the world with respect to an aging demographic and a shortage of healthcare professionals. Even though LLMs have already been integrated into medicine to assist both clinicians and patients, the integration of LLMs within healthcare robots has not yet been explored for clinical settings. In this perspective paper, we investigate the groundbreaking developments in robotics and LLMs to uniquely identify the needed system requirements for designing health specific LLM based robots in terms of multi modal communication through human robot interactions (HRIs), semantic reasoning, and task planning. Furthermore, we discuss the ethical issues, open challenges, and potential future research directions for this emerging innovative field.

[AI-4] Causal Responsibility Attribution for Human-AI Collaboration

链接: https://arxiv.org/abs/2411.03275
作者: Yahang Qi,Bernhard Schölkopf,Zhijing Jin
关键词-EN: Artificial Intelligence, increasingly influence decision-making, systems increasingly influence, increasingly influence, influence decision-making
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:As Artificial Intelligence (AI) systems increasingly influence decision-making across various fields, the need to attribute responsibility for undesirable outcomes has become essential, though complicated by the complex interplay between humans and AI. Existing attribution methods based on actual causality and Shapley values tend to disproportionately blame agents who contribute more to an outcome and rely on real-world measures of blameworthiness that may misalign with responsible AI standards. This paper presents a causal framework using Structural Causal Models (SCMs) to systematically attribute responsibility in human-AI systems, measuring overall blameworthiness while employing counterfactual reasoning to account for agents’ expected epistemic levels. Two case studies illustrate the framework’s adaptability in diverse human-AI collaboration scenarios.

[AI-5] Discovering Data Structures: Nearest Neighbor Search and Beyond

链接: https://arxiv.org/abs/2411.03253
作者: Omar Salemohamed,Laurent Charlin,Shivam Garg,Vatsal Sharan,Gregory Valiant
关键词-EN: propose a general, data, data structures, framework, structures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.

[AI-6] Spontaneous Emergence of Agent Individuality through Social Interactions in LLM -Based Communities

链接: https://arxiv.org/abs/2411.03252
作者: Ryosuke Takata,Atsushi Masumori,Takashi Ikegami
关键词-EN: Large Language Model, Language Model, Large Language, study the emergence, emergence of agency
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study the emergence of agency from scratch by using Large Language Model (LLM)-based agents. In previous studies of LLM-based agents, each agent’s characteristics, including personality and memory, have traditionally been predefined. We focused on how individuality, such as behavior, personality, and memory, can be differentiated from an undifferentiated state. The present LLM agents engage in cooperative communication within a group simulation, exchanging context-based messages in natural language. By analyzing this multi-agent simulation, we report valuable new insights into how social norms, cooperation, and personality traits can emerge spontaneously. This paper demonstrates that autonomously interacting LLM-powered agents generate hallucinations and hashtags to sustain communication, which, in turn, increases the diversity of words within their interactions. Each agent’s emotions shift through communication, and as they form communities, the personalities of the agents emerge and evolve accordingly. This computational modeling approach and its findings will provide a new method for analyzing collective artificial intelligence.

[AI-7] On the Detection of Non-Cooperative RISs: Scan B-Testing via Deep Support Vector Data Description

链接: https://arxiv.org/abs/2411.03237
作者: George Stamatelis,Panagiotis Gavriilidis,Aymen Fakhreddine,George C. Alexandropoulos
关键词-EN: Reconfigurable Intelligent Surfaces, Orthogonal Frequency-Division Multiplexing, unknown characteristics lying, Reconfigurable Intelligent, MIMO OFDM system
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, submitted to an IEEE conference

点击查看摘要

Abstract:In this paper, we study the problem of promptly detecting the presence of non-cooperative activity from one or more Reconfigurable Intelligent Surfaces (RISs) with unknown characteristics lying in the vicinity of a Multiple-Input Multiple-Output (MIMO) communication system using Orthogonal Frequency-Division Multiplexing (OFDM) transmissions. We first present a novel wideband channel model incorporating RISs as well as non-reconfigurable stationary surfaces, which captures both the effect of the RIS actuation time on the channel in the frequency domain as well as the difference between changing phase configurations during or among transmissions. Considering that RISs may operate under the coordination of a third-party system, and thus, may negatively impact the communication of the intended MIMO OFDM system, we present a novel RIS activity detection framework that is unaware of the distribution of the phase configuration of any of the non-cooperative RISs. In particular, capitalizing on the knowledge of the data distribution at the multi-antenna receiver, we design a novel online change point detection statistic that combines a deep support vector data description model with the scan B -test. The presented numerical investigations demonstrate the improved detection accuracy as well as decreased computational complexity of the proposed RIS detection approach over existing change point detection schemes.

[AI-8] Formal Logic-guided Robust Federated Learning against Poisoning Attacks

链接: https://arxiv.org/abs/2411.03231
作者: Dung Thuy Nguyen,Ziyan An,Taylor T. Johnson,Meiyi Ma,Kevin Leach
关键词-EN: centralized Machine Learning, centralized Machine, Machine Learning, offers a promising, enabling decentralized
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Logic in Computer Science (cs.LO)
*备注: arXiv admin note: text overlap with arXiv:2305.00328 by other authors

点击查看摘要

Abstract:Federated Learning (FL) offers a promising solution to the privacy concerns associated with centralized Machine Learning (ML) by enabling decentralized, collaborative learning. However, FL is vulnerable to various security threats, including poisoning attacks, where adversarial clients manipulate the training data or model updates to degrade overall model performance. Recognizing this threat, researchers have focused on developing defense mechanisms to counteract poisoning attacks in FL systems. However, existing robust FL methods predominantly focus on computer vision tasks, leaving a gap in addressing the unique challenges of FL with time series data. In this paper, we present FLORAL, a defense mechanism designed to mitigate poisoning attacks in federated learning for time-series tasks, even in scenarios with heterogeneous client data and a large number of adversarial participants. Unlike traditional model-centric defenses, FLORAL leverages logical reasoning to evaluate client trustworthiness by aligning their predictions with global time-series patterns, rather than relying solely on the similarity of client updates. Our approach extracts logical reasoning properties from clients, then hierarchically infers global properties, and uses these to verify client updates. Through formal logic verification, we assess the robustness of each client contribution, identifying deviations indicative of adversarial behavior. Experimental results on two datasets demonstrate the superior performance of our approach compared to existing baseline methods, highlighting its potential to enhance the robustness of FL to time series applications. Notably, FLORAL reduced the prediction error by 93.27% in the best-case scenario compared to the second-best baseline. Our code is available at \urlthis https URL.

[AI-9] Knowledge Graphs of Driving Scenes to Empower the Emerging Capabilities of Neurosymbolic AI

链接: https://arxiv.org/abs/2411.03225
作者: Ruwan Wickramarachchi,Cory Henson,Amit Sheth
关键词-EN: era of Generative, perception to cognition, powerful approach, spanning from perception, Neurosymbolic
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:In the era of Generative AI, Neurosymbolic AI is emerging as a powerful approach for tasks spanning from perception to cognition. The use of Neurosymbolic AI has been shown to achieve enhanced capabilities, including improved grounding, alignment, explainability, and reliability. However, due to its nascent stage, there is a lack of widely available real-world benchmark datasets tailored to Neurosymbolic AI tasks. To address this gap and support the evaluation of current and future methods, we introduce DSceneKG – a suite of knowledge graphs of driving scenes built from real-world, high-quality scenes from multiple open autonomous driving datasets. In this article, we detail the construction process of DSceneKG and highlight its application in seven different tasks. DSceneKG is publicly accessible at: this https URL

[AI-10] Beyond Grid Data: Exploring Graph Neural Networks for Earth Observation

链接: https://arxiv.org/abs/2411.03223
作者: Shan Zhao,Zhaiyu Chen,Zhitong Xiong,Yilei Shi,Sudipan Saha,Xiao Xiang Zhu
关键词-EN: grid-like data structures, Earth Observation, applications typically limited, Graph Neural Networks, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in Geoscience and Remote Sensing Magazine (GRSM)

点击查看摘要

Abstract:Earth Observation (EO) data analysis has been significantly revolutionized by deep learning (DL), with applications typically limited to grid-like data structures. Graph Neural Networks (GNNs) emerge as an important innovation, propelling DL into the non-Euclidean domain. Naturally, GNNs can effectively tackle the challenges posed by diverse modalities, multiple sensors, and the heterogeneous nature of EO data. To introduce GNNs in the related domains, our review begins by offering fundamental knowledge on GNNs. Then, we summarize the generic problems in EO, to which GNNs can offer potential solutions. Following this, we explore a broad spectrum of GNNs’ applications to scientific problems in Earth systems, covering areas such as weather and climate analysis, disaster management, air quality monitoring, agriculture, land cover classification, hydrological process modeling, and urban modeling. The rationale behind adopting GNNs in these fields is explained, alongside methodologies for organizing graphs and designing favorable architectures for various tasks. Furthermore, we highlight methodological challenges of implementing GNNs in these domains and possible solutions that could guide future research. While acknowledging that GNNs are not a universal solution, we conclude the paper by comparing them with other popular architectures like transformers and analyzing their potential synergies.

[AI-11] GIS Copilot: Towards an Autonomous GIS Agent for Spatial Analysis

链接: https://arxiv.org/abs/2411.03205
作者: Temitope Akinboyewa,Zhenlong Li,Huan Ning,M. Naser Lessani
关键词-EN: Recent advancements, GIS Copilot, GIS, GIS Copilot demonstrates, offer promising capabilities
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Recent advancements in Generative AI offer promising capabilities for spatial analysis. Despite their potential, the integration of generative AI with established GIS platforms remains underexplored. In this study, we propose a framework for integrating LLMs directly into existing GIS platforms, using QGIS as an example. Our approach leverages the reasoning and programming capabilities of LLMs to autonomously generate spatial analysis workflows and code through an informed agent that has comprehensive documentation of key GIS tools and parameters. The implementation of this framework resulted in the development of a “GIS Copilot” that allows GIS users to interact with QGIS using natural language commands for spatial analysis. The GIS Copilot was evaluated based on three complexity levels: basic tasks that require one GIS tool and typically involve one data layer to perform simple operations; intermediate tasks involving multi-step processes with multiple tools, guided by user instructions; and advanced tasks which involve multi-step processes that require multiple tools but not guided by user instructions, necessitating the agent to independently decide on and executes the necessary steps. The evaluation reveals that the GIS Copilot demonstrates strong potential in automating foundational GIS operations, with a high success rate in tool selection and code generation for basic and intermediate tasks, while challenges remain in achieving full autonomy for more complex tasks. This study contributes to the emerging vision of Autonomous GIS, providing a pathway for non-experts to engage with geospatial analysis with minimal prior expertise. While full autonomy is yet to be achieved, the GIS Copilot demonstrates significant potential for simplifying GIS workflows and enhancing decision-making processes.

[AI-12] On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2411.03177
作者: Tariq Berrada Ifriqi,Pietro Astolfi,Melissa Hall,Reyhane Askari-Hemmat,Yohann Benchetrit,Marton Havasi,Matthew Muckley,Karteek Alahari,Adriana Romero-Soriano,Jakob Verbeek,Michal Drozdzal
关键词-EN: enabled unprecedented quality, LDM training recipes, latent diffusion models, Large-scale training, performing LDM training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted as a conference paper (poster) for NeurIPS 2024

点击查看摘要

Abstract:Large-scale training of latent diffusion models (LDMs) has enabled unprecedented quality in image generation. However, the key components of the best performing LDM training recipes are oftentimes not available to the research community, preventing apple-to-apple comparisons and hindering the validation of progress in the field. In this work, we perform an in-depth study of LDM training recipes focusing on the performance of models and their training efficiency. To ensure apple-to-apple comparisons, we re-implement five previously published models with their corresponding recipes. Through our study, we explore the effects of (i)~the mechanisms used to condition the generative model on semantic information (e.g., text prompt) and control metadata (e.g., crop size, random flip flag, etc.) on the model performance, and (ii)~the transfer of the representations learned on smaller and lower-resolution datasets to larger ones on the training efficiency and model performance. We then propose a novel conditioning mechanism that disentangles semantic and control metadata conditionings and sets a new state-of-the-art in class-conditional generation on the ImageNet-1k dataset – with FID improvements of 7% on 256 and 8% on 512 resolutions – as well as text-to-image generation on the CC12M dataset – with FID improvements of 8% on 256 and 23% on 512 resolution.

[AI-13] Navigating Extremes: Dynamic Sparsity in Large Output Space NEURIPS2024

链接: https://arxiv.org/abs/2411.03171
作者: Nasib Ullah,Erik Schultheis,Mike Lasby,Yani Ioannou,Rohit Babbar
关键词-EN: Dynamic Sparse Training, Dynamic Sparse, DST, alternative to post-training, post-training pruning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 7 figures, NeurIPS 2024

点击查看摘要

Abstract:In recent years, Dynamic Sparse Training (DST) has emerged as an alternative to post-training pruning for generating efficient models. In principle, DST allows for a more memory efficient training process, as it maintains sparsity throughout the entire training run. However, current DST implementations fail to capitalize on this in practice. Because sparse matrix multiplication is much less efficient than dense matrix multiplication on GPUs, most implementations simulate sparsity by masking weights. In this paper, we leverage recent advances in semi-structured sparse training to apply DST in the domain of classification with large output spaces, where memory-efficiency is paramount. With a label space of possibly millions of candidates, the classification layer alone will consume several gigabytes of memory. Switching from a dense to a fixed fan-in sparse layer updated with sparse evolutionary training (SET); however, severely hampers training convergence, especially at the largest label spaces. We find that poor gradient flow from the sparse classifier to the dense text encoder make it difficult to learn good input representations. By employing an intermediate layer or adding an auxiliary training objective, we recover most of the generalisation performance of the dense model. Overall, we demonstrate the applicability and practical benefits of DST in a challenging domain – characterized by a highly skewed label distribution that differs substantially from typical DST benchmark datasets – which enables end-to-end training with millions of labels on commodity hardware.

[AI-14] Machine Learning Innovations in CPR: A Comprehensive Survey on Enhanced Resuscitation Techniques

链接: https://arxiv.org/abs/2411.03131
作者: Saidul Islam,Gaith Rjoub,Hanae Elmekki,Jamal Bentahar,Witold Pedrycz,Robin Cohen
关键词-EN: Machine Learning, Artificial Intelligence, survey paper explores, role of Machine, Cardiopulmonary Resuscitation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This survey paper explores the transformative role of Machine Learning (ML) and Artificial Intelligence (AI) in Cardiopulmonary Resuscitation (CPR). It examines the evolution from traditional CPR methods to innovative ML-driven approaches, highlighting the impact of predictive modeling, AI-enhanced devices, and real-time data analysis in improving resuscitation outcomes. The paper provides a comprehensive overview, classification, and critical analysis of current applications, challenges, and future directions in this emerging field.

[AI-15] Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care

链接: https://arxiv.org/abs/2411.03105
作者: Christel Sirocchi,Muhammad Suffian,Federico Sabbatini,Alessandro Bogliolo,Sara Montagna
关键词-EN: decision-making relies heavily, Machine Learning, clinical, relies heavily, models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In clinical practice, decision-making relies heavily on established protocols, often formalised as rules. Concurrently, Machine Learning (ML) models, trained on clinical data, aspire to integrate into medical decision-making processes. However, despite the growing number of ML applications, their adoption into clinical practice remains limited. Two critical concerns arise, relevant to the notions of consistency and continuity of care: (a) accuracy - the ML model, albeit more accurate, might introduce errors that would not have occurred by applying the protocol; (b) interpretability - ML models operating as black boxes might make predictions based on relationships that contradict established clinical knowledge. In this context, the literature suggests using ML models integrating domain knowledge for improved accuracy and interpretability. However, there is a lack of appropriate metrics for comparing ML models with clinical rules in addressing these challenges. Accordingly, in this article, we first propose metrics to assess the accuracy of ML models with respect to the established protocol. Secondly, we propose an approach to measure the distance of explanations provided by two rule sets, with the goal of comparing the explanation similarity between clinical rule-based systems and rules extracted from ML models. The approach is validated on the Pima Indians Diabetes dataset by training two neural networks - one exclusively on data, and the other integrating a clinical protocol. Our findings demonstrate that the integrated ML model achieves comparable performance to that of a fully data-driven model while exhibiting superior accuracy relative to the clinical protocol, ensuring enhanced continuity of care. Furthermore, we show that our integrated model provides explanations for predictions that align more closely with the clinical protocol compared to the data-driven model.

[AI-16] Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting

链接: https://arxiv.org/abs/2411.03098
作者: Adrian B. Chłopowiec,Adam R. Chłopowiec,Krzysztof Galus,Wojciech Cebula,Martin Tabakov
关键词-EN: Generative Adversarial Networks, Adversarial Networks, deep learning models, challenge deep learning, Generative Adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 45 pages, 27 figures

点击查看摘要

Abstract:Limited medical imaging datasets challenge deep learning models by increasing risks of overfitting and reduced generalization, particularly in Generative Adversarial Networks (GANs), where discriminators may overfit, leading to training divergence. This constraint also impairs classification models trained on small datasets. Generative Data Augmentation (GDA) addresses this by expanding training datasets with synthetic data, although it requires training a generative model. We propose and evaluate two local lesion generation approaches to address the challenge of augmenting small medical image datasets. The first approach employs the Poisson Image Editing algorithm, a classical image processing technique, to create realistic image composites that outperform current state-of-the-art methods. The second approach introduces a novel generative method, leveraging a fine-tuned Image Inpainting GAN to synthesize realistic lesions within specified regions of real training images. A comprehensive comparison of the two proposed methods demonstrates that effective local lesion generation in a data-constrained setting allows for reaching new state-of-the-art results in capsule endoscopy lesion classification. Combination of our techniques achieves a macro F1-score of 33.07%, surpassing the previous best result by 7.84 percentage points (p.p.) on the highly imbalanced Kvasir Capsule Dataset, a benchmark for capsule endoscopy. To the best of our knowledge, this work is the first to apply a fine-tuned Image Inpainting GAN for GDA in medical imaging, demonstrating that an image-conditional GAN can be adapted effectively to limited datasets to generate high-quality examples, facilitating effective data augmentation. Additionally, we show that combining this GAN-based approach with classical image processing techniques further enhances the results.

[AI-17] HFGaussian: Learning Generalizable Gaussian Human with Integrated Human Features

链接: https://arxiv.org/abs/2411.03086
作者: Arnab Dey,Cheng-You Lu,Andrew I. Comport,Srinath Sridhar,Chin-Teng Lin,Jean Martinet
关键词-EN: radiance field rendering, field rendering show, rendering show promising, show promising results, Recent advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in radiance field rendering show promising results in 3D scene representation, where Gaussian splatting-based techniques emerge as state-of-the-art due to their quality and efficiency. Gaussian splatting is widely used for various applications, including 3D human representation. However, previous 3D Gaussian splatting methods either use parametric body models as additional information or fail to provide any underlying structure, like human biomechanical features, which are essential for different applications. In this paper, we present a novel approach called HFGaussian that can estimate novel views and human features, such as the 3D skeleton, 3D key points, and dense pose, from sparse input images in real time at 25 FPS. The proposed method leverages generalizable Gaussian splatting technique to represent the human subject and its associated features, enabling efficient and generalizable reconstruction. By incorporating a pose regression network and the feature splatting technique with Gaussian splatting, HFGaussian demonstrates improved capabilities over existing 3D human methods, showcasing the potential of 3D human representations with integrated biomechanics. We thoroughly evaluate our HFGaussian method against the latest state-of-the-art techniques in human Gaussian splatting and pose estimation, demonstrating its real-time, state-of-the-art performance.

[AI-18] Self-supervised cross-modality learning for uncertainty-aware object detection and recognition in applications which lack pre-labelled training data

链接: https://arxiv.org/abs/2411.03082
作者: Irum Mehboob,Li Sun,Alireza Astegarpanah,Rustam Stolkin
关键词-EN: lacking annotated train-ng, deep neural network, RGB images, applications lacking annotated, annotated train-ng datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 16 pages

点击查看摘要

Abstract:This paper shows how an uncertainty-aware, deep neural network can be trained to detect, recognise and localise objects in 2D RGB images, in applications lacking annotated train-ng datasets. We propose a self-supervising teacher-student pipeline, in which a relatively simple teacher classifier, trained with only a few labelled 2D thumbnails, automatically processes a larger body of unlabelled RGB-D data to teach a student network based on a modified YOLOv3 architecture. Firstly, 3D object detection with back projection is used to automatically extract and teach 2D detection and localisation information to the student network. Secondly, a weakly supervised 2D thumbnail classifier, with minimal training on a small number of hand-labelled images, is used to teach object category recognition. Thirdly, we use a Gaussian Process GP to encode and teach a robust uncertainty estimation functionality, so that the student can output confidence scores with each categorization. The resulting student significantly outperforms the same YOLO architecture trained directly on the same amount of labelled data. Our GP-based approach yields robust and meaningful uncertainty estimations for complex industrial object classifications. The end-to-end network is also capable of real-time processing, needed for robotics applications. Our method can be applied to many important industrial tasks, where labelled datasets are typically unavailable. In this paper, we demonstrate an example of detection, localisation, and object category recognition of nuclear mixed-waste materials in highly cluttered and unstructured scenes. This is critical for robotic sorting and handling of legacy nuclear waste, which poses complex environmental remediation challenges in many nuclearised nations.

[AI-19] Enhancing DP-SGD through Non-monotonous Adaptive Scaling Gradient Weight

链接: https://arxiv.org/abs/2411.03059
作者: Tao Huang,Qingyu Huang,Xin Shi,Jiayang Meng,Guolong Zheng,Xu Yang,Xun Yi
关键词-EN: protecting sensitive data, maintaining model utility, Differentially Private Per-sample, Differentially Private, Differentially Private Stochastic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the domain of deep learning, the challenge of protecting sensitive data while maintaining model utility is significant. Traditional Differential Privacy (DP) techniques such as Differentially Private Stochastic Gradient Descent (DP-SGD) typically employ strategies like direct or per-sample adaptive gradient clipping. These methods, however, compromise model accuracy due to their critical influence on gradient handling, particularly neglecting the significant contribution of small gradients during later training stages. In this paper, we introduce an enhanced version of DP-SGD, named Differentially Private Per-sample Adaptive Scaling Clipping (DP-PSASC). This approach replaces traditional clipping with non-monotonous adaptive gradient scaling, which alleviates the need for intensive threshold setting and rectifies the disproportionate weighting of smaller gradients. Our contribution is twofold. First, we develop a novel gradient scaling technique that effectively assigns proper weights to gradients, particularly small ones, thus improving learning under differential privacy. Second, we integrate a momentum-based method into DP-PSASC to reduce bias from stochastic sampling, enhancing convergence rates. Our theoretical and empirical analyses confirm that DP-PSASC preserves privacy and delivers superior performance across diverse datasets, setting new standards for privacy-sensitive applications.

[AI-20] ATM: Improving Model Merging by Alternating Tuning and Merging

链接: https://arxiv.org/abs/2411.03055
作者: Luca Zhou,Daniele Solombrino,Donato Crisostomi,Maria Sofia Bucarelli,Fabrizio Silvestri,Emanuele Rodolà
关键词-EN: recently emerged, cost-efficient paradigm, task vectors, Model merging, task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Main paper: 10 Pages, 11 figures, 2 tables

点击查看摘要

Abstract:Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch’s gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best this http URL, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.

[AI-21] Gradient-Guided Conditional Diffusion Models for Private Image Reconstruction: Analyzing Adversarial Impacts of Differential Privacy and Denoising

链接: https://arxiv.org/abs/2411.03053
作者: Tao Huang,Jiayang Meng,Hong Chen,Guolong Zheng,Xu Yang,Xun Yi,Hua Wang
关键词-EN: conditional diffusion models, diffusion models, gradient-guided conditional diffusion, reconstructing private images, diffusion model generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We investigate the construction of gradient-guided conditional diffusion models for reconstructing private images, focusing on the adversarial interplay between differential privacy noise and the denoising capabilities of diffusion models. While current gradient-based reconstruction methods struggle with high-resolution images due to computational complexity and prior knowledge requirements, we propose two novel methods that require minimal modifications to the diffusion model’s generation process and eliminate the need for prior knowledge. Our approach leverages the strong image generation capabilities of diffusion models to reconstruct private images starting from randomly generated noise, even when a small amount of differentially private noise has been added to the gradients. We also conduct a comprehensive theoretical analysis of the impact of differential privacy noise on the quality of reconstructed images, revealing the relationship among noise magnitude, the architecture of attacked models, and the attacker’s reconstruction capability. Additionally, extensive experiments validate the effectiveness of our proposed methods and the accuracy of our theoretical findings, suggesting new directions for privacy risk auditing using conditional diffusion models.

[AI-22] HumanVLM: Foundation for Human-Scene Vision-Language Model

链接: https://arxiv.org/abs/2411.03034
作者: Dawei Dai,Xu Long,Li Yutang,Zhang Yuanhui,Shuyin Xia
关键词-EN: diverse social applications, recent advancements predominantly, advancements predominantly rely, social applications, vision-language
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 34 pages,11 figures

点击查看摘要

Abstract:Human-scene vision-language tasks are increasingly prevalent in diverse social applications, yet recent advancements predominantly rely on models specifically tailored to individual tasks. Emerging research indicates that large vision-language models (VLMs) can enhance performance across various downstream vision-language understanding tasks. However, general-domain models often underperform in specialized fields. This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM), designed to provide a foundation for human-scene Vision-Language tasks. Specifically, (1) we create a large-scale human-scene multimodal image-text dataset (HumanCaption-10M) sourced from the Internet to facilitate domain-specific alignment; (2) develop a captioning approach for human-centered images, capturing human faces, bodies, and backgrounds, and construct a high-quality Human-Scene image-text dataset (HumanCaptionHQ, about 311k pairs) that contain as much detailed information as possible about human; (3) Using HumanCaption-10M and HumanCaptionHQ, we train a HumanVLM. In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance among multimodal models of comparable scale, particularly excelling in human-related tasks and significantly outperforming similar models, including Qwen2VL and ChatGPT-4o. HumanVLM, alongside the data introduced, will stimulate the research in human-around fields.

[AI-23] Adaptive Genetic Selection based Pinning Control with Asymmetric Coupling for Multi-Network Heterogeneous Vehicular Systems

链接: https://arxiv.org/abs/2411.03027
作者: Weian Guo,Ruizhi Sha,Li Li,Lun Zhang,Dongyang Li
关键词-EN: reduce communication bandwidth, communication bandwidth requirements, alleviate computational load, cloud platforms, reduce communication
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To alleviate computational load on RSUs and cloud platforms, reduce communication bandwidth requirements, and provide a more stable vehicular network service, this paper proposes an optimized pinning control approach for heterogeneous multi-network vehicular ad-hoc networks (VANETs). In such networks, vehicles participate in multiple task-specific networks with asymmetric coupling and dynamic topologies. We first establish a rigorous theoretical foundation by proving the stability of pinning control strategies under both single and multi-network conditions, deriving sufficient stability conditions using Lyapunov theory and linear matrix inequalities (LMIs). Building on this theoretical groundwork, we propose an adaptive genetic algorithm tailored to select optimal pinning nodes, effectively balancing LMI constraints while prioritizing overlapping nodes to enhance control efficiency. Extensive simulations across various network scales demonstrate that our approach achieves rapid consensus with a reduced number of control nodes, particularly when leveraging network overlaps. This work provides a comprehensive solution for efficient control node selection in complex vehicular networks, offering practical implications for deploying large-scale intelligent transportation systems.

[AI-24] DA-MoE: Addressing Depth-Sensitivity in Graph-Level Analysis through Mixture of Experts

链接: https://arxiv.org/abs/2411.03025
作者: Zelin Yao,Chuang Liu,Xianke Meng,Yibing Zhan,Jia Wu,Shirui Pan,Wenbin Hu
关键词-EN: processing graph-structured data, GNN layers, Graph, GNN, gaining popularity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8pages

点击查看摘要

Abstract:Graph neural networks (GNNs) are gaining popularity for processing graph-structured data. In real-world scenarios, graph data within the same dataset can vary significantly in scale. This variability leads to depth-sensitivity, where the optimal depth of GNN layers depends on the scale of the graph data. Empirically, fewer layers are sufficient for message passing in smaller graphs, while larger graphs typically require deeper networks to capture long-range dependencies and global features. However, existing methods generally use a fixed number of GNN layers to generate representations for all graphs, overlooking the depth-sensitivity issue in graph structure data. To address this challenge, we propose the depth adaptive mixture of expert (DA-MoE) method, which incorporates two main improvements to GNN backbone: \textbf1) DA-MoE employs different GNN layers, each considered an expert with its own parameters. Such a design allows the model to flexibly aggregate information at different scales, effectively addressing the depth-sensitivity issue in graph data. \textbf2) DA-MoE utilizes GNN to capture the structural information instead of the linear projections in the gating network. Thus, the gating network enables the model to capture complex patterns and dependencies within the data. By leveraging these improvements, each expert in DA-MoE specifically learns distinct graph patterns at different scales. Furthermore, comprehensive experiments on the TU dataset and open graph benchmark (OGB) have shown that DA-MoE consistently surpasses existing baselines on various tasks, including graph, node, and link-level analyses. The code are available at \urlthis https URL.

[AI-25] Flashy Backdoor: Real-world Environment Backdoor Attack on SNNs with DVS Cameras

链接: https://arxiv.org/abs/2411.03022
作者: Roberto Riaño,Gorka Abad,Stjepan Picek,Aitor Urbieta
关键词-EN: Deep Neural Networks, Spiking Neural Networks, Neural Networks, traditional Deep Neural, Deep Neural
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While security vulnerabilities in traditional Deep Neural Networks (DNNs) have been extensively studied, the susceptibility of Spiking Neural Networks (SNNs) to adversarial attacks remains mostly underexplored. Until now, the mechanisms to inject backdoors into SNN models have been limited to digital scenarios; thus, we present the first evaluation of backdoor attacks in real-world environments. We begin by assessing the applicability of existing digital backdoor attacks and identifying their limitations for deployment in physical environments. To address each of the found limitations, we present three novel backdoor attack methods on SNNs, i.e., Framed, Strobing, and Flashy Backdoor. We also assess the effectiveness of traditional backdoor procedures and defenses adapted for SNNs, such as pruning, fine-tuning, and fine-pruning. The results show that while these procedures and defenses can mitigate some attacks, they often fail against stronger methods like Flashy Backdoor or sacrifice too much clean accuracy, rendering the models unusable. Overall, all our methods can achieve up to a 100% Attack Success Rate while maintaining high clean accuracy in every tested dataset. Additionally, we evaluate the stealthiness of the triggers with commonly used metrics, finding them highly stealthy. Thus, we propose new alternatives more suited for identifying poisoned samples in these scenarios. Our results show that further research is needed to ensure the security of SNN-based systems against backdoor attacks and their safe application in real-world scenarios. The code, experiments, and results are available in our repository. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.03022 [cs.CR] (or arXiv:2411.03022v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.03022 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] Hierarchical Orchestra of Policies NEURIPS

链接: https://arxiv.org/abs/2411.03008
作者: Thomas P Cannon,Özgür Simsek
关键词-EN: Continual reinforcement learning, major challenge due, experience catastrophic forgetting, reinforcement learning poses, Continual reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted as a poster. NeurIPS IMOL

点击查看摘要

Abstract:Continual reinforcement learning poses a major challenge due to the tendency of agents to experience catastrophic forgetting when learning sequential tasks. In this paper, we introduce a modularity-based approach, called Hierarchical Orchestra of Policies (HOP), designed to mitigate catastrophic forgetting in lifelong reinforcement learning. HOP dynamically forms a hierarchy of policies based on a similarity metric between the current observations and previously encountered observations in successful tasks. Unlike other state-of-the-art methods, HOP does not require task labelling, allowing for robust adaptation in environments where boundaries between tasks are ambiguous. Our experiments, conducted across multiple tasks in a procedurally generated suite of environments, demonstrate that HOP significantly outperforms baseline methods in retaining knowledge across tasks and performs comparably to state-of-the-art transfer methods that require task labelling. Moreover, HOP achieves this without compromising performance when tasks remain constant, highlighting its versatility.

[AI-27] Data Quality Awareness: A Journey from Traditional Data Management to Data Science Systems

链接: https://arxiv.org/abs/2411.03007
作者: Sijie Dong,Soror Sahri,Themis Palpanas
关键词-EN: Artificial intelligence, data science, data, data science systems, significantly impacting
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Artificial intelligence (AI) has transformed various fields, significantly impacting our daily lives. A major factor in AI success is high-quality data. In this paper, we present a comprehensive review of the evolution of data quality (DQ) awareness from traditional data management systems to modern data-driven AI systems, which are integral to data science. We synthesize the existing literature, highlighting the quality challenges and techniques that have evolved from traditional data management to data science including big data and ML fields. As data science systems support a wide range of activities, our focus in this paper lies specifically in the analytics aspect driven by machine learning. We use the cause-effect connection between the quality challenges of ML and those of big data to allow a more thorough understanding of emerging DQ challenges and the related quality awareness techniques in data science systems. To the best of our knowledge, our paper is the first to provide a review of DQ awareness spanning traditional and emergent data science systems. We hope that readers will find this journey through the evolution of data quality awareness insightful and valuable.

[AI-28] Controlling for Unobserved Confounding with Large Language Model Classification of Patient Smoking Status NEURIPS2024

链接: https://arxiv.org/abs/2411.03004
作者: Samuel Lee,Zach Wood-Doughty
关键词-EN: evidence-based medicine, fundamental goal, goal of evidence-based, Causal understanding, Causal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond (AIM-FM) at NeurIPS 2024

点击查看摘要

Abstract:Causal understanding is a fundamental goal of evidence-based medicine. When randomization is impossible, causal inference methods allow the estimation of treatment effects from retrospective analysis of observational data. However, such analyses rely on a number of assumptions, often including that of no unobserved confounding. In many practical settings, this assumption is violated when important variables are not explicitly measured in the clinical record. Prior work has proposed to address unobserved confounding with machine learning by imputing unobserved variables and then correcting for the classifier’s mismeasurement. When such a classifier can be trained and the necessary assumptions are met, this method can recover an unbiased estimate of a causal effect. However, such work has been limited to synthetic data, simple classifiers, and binary variables. This paper extends this methodology by using a large language model trained on clinical notes to predict patients’ smoking status, which would otherwise be an unobserved confounder. We then apply a measurement error correction on the categorical predicted smoking status to estimate the causal effect of transthoracic echocardiography on mortality in the MIMIC dataset.

[AI-29] Accelerating Task Generalisation with Multi-Level Hierarchical Options ICLR2025

链接: https://arxiv.org/abs/2411.02998
作者: Thomas P Cannon,Özgür Simsek
关键词-EN: Creating reinforcement learning, Creating reinforcement, Fracture Cluster Options, introduces Fracture Cluster, reinforcement learning
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, under review for ICLR 2025

点击查看摘要

Abstract:Creating reinforcement learning agents that generalise effectively to new tasks is a key challenge in AI research. This paper introduces Fracture Cluster Options (FraCOs), a multi-level hierarchical reinforcement learning method that achieves state-of-the-art performance on difficult generalisation tasks. FraCOs identifies patterns in agent behaviour and forms options based on the expected future usefulness of those patterns, enabling rapid adaptation to new tasks. In tabular settings, FraCOs demonstrates effective transfer and improves performance as it grows in hierarchical depth. We evaluate FraCOs against state-of-the-art deep reinforcement learning algorithms in several complex procedurally generated environments. Our results show that FraCOs achieves higher in-distribution and out-of-distribution performance than competitors.

[AI-30] SUDS: A Strategy for Unsupervised Drift Sampling

链接: https://arxiv.org/abs/2411.02995
作者: Christofer Fellicious,Lorenz Wendlinger,Mario Gancarski,Jelena Mitrovic,Michael Granitzer
关键词-EN: encounters concept drift, Supervised machine learning, Existing drift detection, drift detection, Supervised machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 tables, 3 figures

点击查看摘要

Abstract:Supervised machine learning often encounters concept drift, where the data distribution changes over time, degrading model performance. Existing drift detection methods focus on identifying these shifts but often overlook the challenge of acquiring labeled data for model retraining after a shift occurs. We present the Strategy for Drift Sampling (SUDS), a novel method that selects homogeneous samples for retraining using existing drift detection algorithms, thereby enhancing model adaptability to evolving data. SUDS seamlessly integrates with current drift detection techniques. We also introduce the Harmonized Annotated Data Accuracy Metric (HADAM), a metric that evaluates classifier performance in relation to the quantity of annotated data required to achieve the stated performance, thereby taking into account the difficulty of acquiring labeled data. Our contributions are twofold: SUDS combines drift detection with strategic sampling to improve the retraining process, and HADAM provides a metric that balances classifier performance with the amount of labeled data, ensuring efficient resource utilization. Empirical results demonstrate the efficacy of SUDS in optimizing labeled data use in dynamic environments, significantly improving the performance of machine learning applications in real-world scenarios. Our code is open source and available at this https URL

[AI-31] Confidence Calibration of Classifiers with Many Classes NEURIPS2024

链接: https://arxiv.org/abs/2411.02988
作者: Adrien Le Coz,Stéphane Herbin,Faouzi Adjed
关键词-EN: maximum predicted class, predicted class probability, classification models based, models based, maximum predicted
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024; code available at this https URL

点击查看摘要

Abstract:For classification models based on neural networks, the maximum predicted class probability is often used as a confidence score. This score rarely predicts well the probability of making a correct prediction and requires a post-processing calibration step. However, many confidence calibration methods fail for problems with many classes. To address this issue, we transform the problem of calibrating a multiclass classifier into calibrating a single surrogate binary classifier. This approach allows for more efficient use of standard calibration methods. We evaluate our approach on numerous neural networks used for image or text classification and show that it significantly enhances existing calibration methods.

[AI-32] Autonomous Decision Making for UAV Cooperative Pursuit-Evasion Game with Reinforcement Learning

链接: https://arxiv.org/abs/2411.02983
作者: Yang Zhao,Zidong Nie,Kangsheng Dong,Qinghua Huang,Xuelong Li
关键词-EN: unmanned aerial vehicle, pursuit-evasion game, UAV cooperative pursuit-evasion, UAV pursuit-evasion game, cooperative pursuit-evasion game
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: 11 pages, 12 figures, 31 conference

点击查看摘要

Abstract:The application of intelligent decision-making in unmanned aerial vehicle (UAV) is increasing, and with the development of UAV 1v1 pursuit-evasion game, multi-UAV cooperative game has emerged as a new challenge. This paper proposes a deep reinforcement learning-based model for decision-making in multi-role UAV cooperative pursuit-evasion game, to address the challenge of enabling UAV to autonomously make decisions in complex game environments. In order to enhance the training efficiency of the reinforcement learning algorithm in UAV pursuit-evasion game environment that has high-dimensional state-action space, this paper proposes multi-environment asynchronous double deep Q-network with priority experience replay algorithm to effectively train the UAV’s game policy. Furthermore, aiming to improve cooperation ability and task completion efficiency, as well as minimize the cost of UAVs in the pursuit-evasion game, this paper focuses on the allocation of roles and targets within multi-UAV environment. The cooperative game decision model with varying numbers of UAVs are obtained by assigning diverse tasks and roles to the UAVs in different scenarios. The simulation results demonstrate that the proposed method enables autonomous decision-making of the UAVs in pursuit-evasion game scenarios and exhibits significant capabilities in cooperation.

[AI-33] Region-Guided Attack on the Segment Anything Model (SAM)

链接: https://arxiv.org/abs/2411.02974
作者: Xiaoliang Liu,Furao Shen,Jian Zhao
关键词-EN: demonstrating exceptional performance, demonstrating exceptional, medical imaging, exceptional performance, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM) is a cornerstone of image segmentation, demonstrating exceptional performance across various applications, particularly in autonomous driving and medical imaging, where precise segmentation is crucial. However, SAM is vulnerable to adversarial attacks that can significantly impair its functionality through minor input perturbations. Traditional techniques, such as FGSM and PGD, are often ineffective in segmentation tasks due to their reliance on global perturbations that overlook spatial nuances. Recent methods like Attack-SAM-K and UAD have begun to address these challenges, but they frequently depend on external cues and do not fully leverage the structural interdependencies within segmentation processes. This limitation underscores the need for a novel adversarial strategy that exploits the unique characteristics of segmentation tasks. In response, we introduce the Region-Guided Attack (RGA), designed specifically for SAM. RGA utilizes a Region-Guided Map (RGM) to manipulate segmented regions, enabling targeted perturbations that fragment large segments and expand smaller ones, resulting in erroneous outputs from SAM. Our experiments demonstrate that RGA achieves high success rates in both white-box and black-box scenarios, emphasizing the need for robust defenses against such sophisticated attacks. RGA not only reveals SAM’s vulnerabilities but also lays the groundwork for developing more resilient defenses against adversarial threats in image segmentation.

[AI-34] Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT

链接: https://arxiv.org/abs/2411.02964
作者: Pourya Jafarzadeh,Amir Mohammad Rostami,Padideh Choobdar
关键词-EN: Speaker Emotion Recognition, SER task, SER, emotion, Speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human emotional behavior. The SER task is challenging due to the variety of speakers, background noise, complexity of emotions, and speaking styles. It has many applications in education, healthcare, customer service, and Human-Computer Interaction (HCI). Previously, conventional machine learning methods such as SVM, HMM, and KNN have been used for the SER task. In recent years, deep learning methods have become popular, with convolutional neural networks and recurrent neural networks being used for SER tasks. The input of these methods is mostly spectrograms and hand-crafted features. In this work, we study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice. The models automatically extract features from raw audio signals, which are then used for the classification task. The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show the effectiveness of the proposed method on different datasets. Moreover, the model has been used for real-world applications like call center conversations, and the results demonstrate that the model accurately predicts emotions.

[AI-35] A Mamba Foundation Model for Time Series Forecasting

链接: https://arxiv.org/abs/2411.02941
作者: Haoyu Ma,Yushu Chen,Wenlai Zhao,Jinzhe Yang,Yingsheng Ji,Xinghua Xu,Xiaozhu Liu,Hao Jing,Shengzhuo Liu,Guangwen Yang
关键词-EN: predicting rapidly evolving, rapidly evolving patterns, Time series foundation, demonstrated strong performance, Time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series foundation models have demonstrated strong performance in zero-shot learning, making them well-suited for predicting rapidly evolving patterns in real-world applications where relevant training data are scarce. However, most of these models rely on the Transformer architecture, which incurs quadratic complexity as input length increases. To address this, we introduce TSMamba, a linear-complexity foundation model for time series forecasting built on the Mamba architecture. The model captures temporal dependencies through both forward and backward Mamba encoders, achieving high prediction accuracy. To reduce reliance on large datasets and lower training costs, TSMamba employs a two-stage transfer learning process that leverages pretrained Mamba LLMs, allowing effective time series modeling with a moderate training set. In the first stage, the forward and backward backbones are optimized via patch-wise autoregressive prediction; in the second stage, the model trains a prediction head and refines other components for long-term forecasting. While the backbone assumes channel independence to manage varying channel numbers across datasets, a channel-wise compressed attention module is introduced to capture cross-channel dependencies during fine-tuning on specific multivariate datasets. Experiments show that TSMamba’s zero-shot performance is comparable to state-of-the-art time series foundation models, despite using significantly less training data. It also achieves competitive or superior full-shot performance compared to task-specific prediction models. The code will be made publicly available.

[AI-36] Domain Expansion and Boundary Growth for Open-Set Single-Source Domain Generalization

链接: https://arxiv.org/abs/2411.02920
作者: Pengkun Jiao,Na Zhao,Jingjing Chen,Yu-Gang Jiang
关键词-EN: Open-set single-source domain, single-source domain, unknown target domains, single-source domain generalization, label shifts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: TMM 2024

点击查看摘要

Abstract:Open-set single-source domain generalization aims to use a single-source domain to learn a robust model that can be generalized to unknown target domains with both domain shifts and label shifts. The scarcity of the source domain and the unknown data distribution of the target domain pose a great challenge for domain-invariant feature learning and unknown class recognition. In this paper, we propose a novel learning approach based on domain expansion and boundary growth to expand the scarce source samples and enlarge the boundaries across the known classes that indirectly broaden the boundary between the known and unknown classes. Specifically, we achieve domain expansion by employing both background suppression and style augmentation on the source data to synthesize new samples. Then we force the model to distill consistent knowledge from the synthesized samples so that the model can learn domain-invariant information. Furthermore, we realize boundary growth across classes by using edge maps as an additional modality of samples when training multi-binary classifiers. In this way, it enlarges the boundary between the inliers and outliers, and consequently improves the unknown class recognition during open-set generalization. Extensive experiments show that our approach can achieve significant improvements and reach state-of-the-art performance on several cross-domain image classification datasets.

[AI-37] Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey

链接: https://arxiv.org/abs/2411.02914
作者: Ao Fu,Yi Zhou,Tao Zhou,Yi Yang,Bojun Gao,Qun Li,Guobin Wu,Ling Shao
关键词-EN: World models, models, World, video generation, role in enhancing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:World models and video generation are pivotal technologies in the domain of autonomous driving, each playing a critical role in enhancing the robustness and reliability of autonomous systems. World models, which simulate the dynamics of real-world environments, and video generation models, which produce realistic video sequences, are increasingly being integrated to improve situational awareness and decision-making capabilities in autonomous vehicles. This paper investigates the relationship between these two technologies, focusing on how their structural parallels, particularly in diffusion-based models, contribute to more accurate and coherent simulations of driving scenarios. We examine leading works such as JEPA, Genie, and Sora, which exemplify different approaches to world model design, thereby highlighting the lack of a universally accepted definition of world models. These diverse interpretations underscore the field’s evolving understanding of how world models can be optimized for various autonomous driving tasks. Furthermore, this paper discusses the key evaluation metrics employed in this domain, such as Chamfer distance for 3D scene reconstruction and Fréchet Inception Distance (FID) for assessing the quality of generated video content. By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions, emphasizing the potential of these technologies to jointly advance the performance of autonomous driving systems. The findings presented in this paper aim to provide a comprehensive understanding of how the integration of video generation and world models can drive innovation in the development of safer and more reliable autonomous vehicles.

[AI-38] WASHtsApp – A RAG-powered WhatsApp Chatbot for supporting rural African clean water access sanitation and hygiene

链接: https://arxiv.org/abs/2411.02850
作者: Simon Kloker,Alex Cedric Luyima,Matthew Bazanya
关键词-EN: educate rural African, rural African communities, clean water access, rural African, African communities
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Working Paper

点击查看摘要

Abstract:This paper introduces WASHtsApp, a WhatsApp-based chatbot designed to educate rural African communities on clean water access, sanitation, and hygiene (WASH) principles. WASHtsApp leverages a Retrieval-Augmented Generation (RAG) approach to address the limitations of previous approaches with limited reach or missing contextualization. The paper details the development process, employing Design Science Research Methodology. The evaluation consisted of two phases: content validation by four WASH experts and community validation by potential users. Content validation confirmed WASHtsApp’s ability to provide accurate and relevant WASH-related information. Community validation indicated high user acceptance and perceived usefulness of the chatbot. The paper concludes by discussing the potential for further development, including incorporating local languages and user data analysis for targeted interventions. It also proposes future research cycles focused on wider deployment and leveraging user data for educational purposes.

[AI-39] Dissecting the Failure of Invariant Learning on Graphs

链接: https://arxiv.org/abs/2411.02847
作者: Qixun Wang,Yifei Wang,Yisen Wang,Xianghua Ying
关键词-EN: Structural Causal Model, Invariant Risk Minimization, Enhancing node-level, node-level OOD, area of research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Enhancing node-level Out-Of-Distribution (OOD) generalization on graphs remains a crucial area of research. In this paper, we develop a Structural Causal Model (SCM) to theoretically dissect the performance of two prominent invariant learning methods – Invariant Risk Minimization (IRM) and Variance-Risk Extrapolation (VREx) – in node-level OOD settings. Our analysis reveals a critical limitation: due to the lack of class-conditional invariance constraints, these methods may struggle to accurately identify the structure of the predictive invariant ego-graph and consequently rely on spurious features. To address this, we propose Cross-environment Intra-class Alignment (CIA), which explicitly eliminates spurious features by aligning cross-environment representations conditioned on the same class, bypassing the need for explicit knowledge of the causal pattern structure. To adapt CIA to node-level OOD scenarios where environment labels are hard to obtain, we further propose CIA-LRA (Localized Reweighting Alignment) that leverages the distribution of neighboring labels to selectively align node representations, effectively distinguishing and preserving invariant features while removing spurious ones, all without relying on environment labels. We theoretically prove CIA-LRA’s effectiveness by deriving an OOD generalization error bound based on PAC-Bayesian analysis. Experiments on graph OOD benchmarks validate the superiority of CIA and CIA-LRA, marking a significant advancement in node-level OOD generalization. The codes are available at this https URL.

[AI-40] Correlation of Object Detection Performance with Visual Saliency and Depth Estimation

链接: https://arxiv.org/abs/2411.02844
作者: Matthias Bartolo,Dylan Seychell
关键词-EN: detection techniques continue, complementary visual tasks, object detection, object detection techniques, Pascal VOC
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Code Available at: this https URL

点击查看摘要

Abstract:As object detection techniques continue to evolve, understanding their relationships with complementary visual tasks becomes crucial for optimising model architectures and computational resources. This paper investigates the correlations between object detection accuracy and two fundamental visual tasks: depth prediction and visual saliency prediction. Through comprehensive experiments using state-of-the-art models (DeepGaze IIE, Depth Anything, DPT-Large, and Itti’s model) on COCO and Pascal VOC datasets, we find that visual saliency shows consistently stronger correlations with object detection accuracy (mA \rho up to 0.459 on Pascal VOC) compared to depth prediction (mA \rho up to 0.283). Our analysis reveals significant variations in these correlations across object categories, with larger objects showing correlation values up to three times higher than smaller objects. These findings suggest incorporating visual saliency features into object detection architectures could be more beneficial than depth information, particularly for specific object categories. The observed category-specific variations also provide insights for targeted feature engineering and dataset design improvements, potentially leading to more efficient and accurate object detection systems.

[AI-41] Conditional Vendi Score: An Information-Theoretic Approach to Diversity Evaluation of Prompt-based Generative Models

链接: https://arxiv.org/abs/2411.02817
作者: Mohammad Jalali,Azim Ospanov,Amin Gohari,Farzan Farnia
关键词-EN: commonly evaluated based, generated data, Text-conditioned generation models, input text prompt, generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Text-conditioned generation models are commonly evaluated based on the quality of the generated data and its alignment with the input text prompt. On the other hand, several applications of prompt-based generative models require sufficient diversity in the generated data to ensure the models’ capability of generating image and video samples possessing a variety of features. However, most existing diversity metrics are designed for unconditional generative models, and thus cannot distinguish the diversity arising from variations in text prompts and that contributed by the generative model itself. In this work, our goal is to quantify the prompt-induced and model-induced diversity in samples generated by prompt-based models. We propose an information-theoretic approach for internal diversity quantification, where we decompose the kernel-based entropy H(X) of the generated data X into the sum of the conditional entropy H(X|T) , given text variable T , and the mutual information I(X; T) between the text and data variables. We introduce the \emphConditional-Vendi score based on H(X|T) to quantify the internal diversity of the model and the \emphInformation-Vendi score based on I(X; T) to measure the statistical relevance between the generated data and text prompts. We provide theoretical results to statistically interpret these scores and relate them to the unconditional Vendi score. We conduct several numerical experiments to show the correlation between the Conditional-Vendi score and the internal diversity of text-conditioned generative models. The codebase is available at \hrefthis https URLthis https URL.

[AI-42] DeepContext: A Context-aware Cross-platform and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads

链接: https://arxiv.org/abs/2411.02797
作者: Qidong Zhao,Hao Wu,Yuming Hao,Zilingfeng Ye,Jiajia Li,Xu Liu,Keren Zhou
关键词-EN: deep learning models, deep learning, heterogeneous computing environments, deep learning frameworks, Effective performance profiling
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective performance profiling and analysis are essential for optimizing training and inference of deep learning models, especially given the growing complexity of heterogeneous computing environments. However, existing tools often lack the capability to provide comprehensive program context information and performance optimization insights for sophisticated interactions between CPUs and GPUs. This paper introduces DeepContext, a novel profiler that links program contexts across high-level Python code, deep learning frameworks, underlying libraries written in C/C++, as well as device code executed on GPUs. DeepContext incorporates measurements of both coarse- and fine-grained performance metrics for major deep learning frameworks, such as PyTorch and JAX, and is compatible with GPUs from both Nvidia and AMD, as well as various CPU architectures, including x86 and ARM. In addition, DeepContext integrates a novel GUI that allows users to quickly identify hotpots and an innovative automated performance analyzer that suggests users with potential optimizations based on performance metrics and program context. Through detailed use cases, we demonstrate how DeepContext can help users identify and analyze performance issues to enable quick and effective optimization of deep learning workloads. We believe Deep Context is a valuable tool for users seeking to optimize complex deep learning workflows across multiple compute environments.

[AI-43] Specialized Foundation Models Struggle to Beat Supervised Baselines

链接: https://arxiv.org/abs/2411.02796
作者: Zongzhe Xu,Ritvik Gupta,Wenduo Cheng,Alexander Shen,Junhong Shen,Ameet Talwalkar,Mikhail Khodak
关键词-EN: vision and text, success for vision, rapidly expanded, pretraining large models, massive data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN)
*备注: The first two authors contributed equally. The order was determined by coin flip

点击查看摘要

Abstract:Following its success for vision and text, the “foundation model” (FM) paradigm – pretraining large models on massive data, then fine-tuning on target tasks – has rapidly expanded to domains in the sciences, engineering, healthcare, and beyond. Has this achieved what the original FMs accomplished, i.e. the supplanting of traditional supervised learning in their domains? To answer we look at three modalities – genomics, satellite imaging, and time series – with multiple recent FMs and compare them to a standard supervised learning workflow: model development, hyperparameter tuning, and training, all using only data from the target task. Across these three specialized domains, we find that it is consistently possible to train simple supervised models – no more complicated than a lightly modified wide ResNet or UNet – that match or even outperform the latest foundation models. Our work demonstrates that the benefits of large-scale pretraining have yet to be realized in many specialized areas, reinforces the need to compare new FMs to strong, well-tuned baselines, and introduces two new, easy-to-use, open-source, and automated workflows for doing so.

[AI-44] When to Localize? A Risk-Constrained Reinforcement Learning Approach

链接: https://arxiv.org/abs/2411.02788
作者: Chak Lam Shek,Kasra Torshizi,Troi Williams,Pratap Tokekar
关键词-EN: standard navigation pipeline, lower navigational errors, navigation pipeline, navigational errors, standard navigation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In a standard navigation pipeline, a robot localizes at every time step to lower navigational errors. However, in some scenarios, a robot needs to selectively localize when it is expensive to obtain observations. For example, an underwater robot surfacing to localize too often hinders it from searching for critical items underwater, such as black boxes from crashed aircraft. On the other hand, if the robot never localizes, poor state estimates cause failure to find the items due to inadvertently leaving the search area or entering hazardous, restricted areas. Motivated by these scenarios, we investigate approaches to help a robot determine “when to localize?” We formulate this as a bi-criteria optimization problem: minimize the number of localization actions while ensuring the probability of failure (due to collision or not reaching a desired goal) remains bounded. In recent work, we showed how to formulate this active localization problem as a constrained Partially Observable Markov Decision Process (POMDP), which was solved using an online POMDP solver. However, this approach is too slow and requires full knowledge of the robot transition and observation models. In this paper, we present RiskRL, a constrained Reinforcement Learning (RL) framework that overcomes these limitations. RiskRL uses particle filtering and recurrent Soft Actor-Critic network to learn a policy that minimizes the number of localizations while ensuring the probability of failure constraint is met. Our numerical experiments show that RiskRL learns a robust policy that outperforms the baseline by at least 13% while also generalizing to unseen environments.

[AI-45] Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment

链接: https://arxiv.org/abs/2411.02785
作者: Jason Vega,Junsheng Huang,Gaokai Zhang,Hangoo Kang,Minjia Zhang,Gagandeep Singh
关键词-EN: Large Language Models, Large Language, Language Models, critical objective, Safety alignment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under peer review

点击查看摘要

Abstract:Safety alignment of Large Language Models (LLMs) has recently become a critical objective of model developers. In response, a growing body of work has been investigating how safety alignment can be bypassed through various jailbreaking methods, such as adversarial attacks. However, these jailbreak methods can be rather costly or involve a non-trivial amount of creativity and effort, introducing the assumption that malicious users are high-resource or sophisticated. In this paper, we study how simple random augmentations to the input prompt affect safety alignment effectiveness in state-of-the-art LLMs, such as Llama 3 and Qwen 2. We perform an in-depth evaluation of 17 different models and investigate the intersection of safety under random augmentations with multiple dimensions: augmentation type, model size, quantization, fine-tuning-based defenses, and decoding strategies (e.g., sampling temperature). We show that low-resource and unsophisticated attackers, i.e. \textitstochastic monkeys , can significantly improve their chances of bypassing alignment with just 25 random augmentations per prompt.

[AI-46] EcoCropsAID: Economic Crops Aerial Image Dataset for Land Use Classification

链接: https://arxiv.org/abs/2411.02762
作者: Sangdaow Noppitak,Emmanuel Okafor,Olarik Surinta
关键词-EN: Google Earth application, Google Earth, aerial images captured, Earth application, comprehensive collection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:The EcoCropsAID dataset is a comprehensive collection of 5,400 aerial images captured between 2014 and 2018 using the Google Earth application. This dataset focuses on five key economic crops in Thailand: rice, sugarcane, cassava, rubber, and longan. The images were collected at various crop growth stages: early cultivation, growth, and harvest, resulting in significant variability within each category and similarities across different categories. These variations, coupled with differences in resolution, color, and contrast introduced by multiple remote imaging sensors, present substantial challenges for land use classification. The dataset is an interdisciplinary resource that spans multiple research domains, including remote sensing, geoinformatics, artificial intelligence, and computer vision. The unique features of the EcoCropsAID dataset offer opportunities for researchers to explore novel approaches, such as extracting spatial and temporal features, developing deep learning architectures, and implementing transformer-based models. The EcoCropsAID dataset provides a valuable platform for advancing research in land use classification, with implications for optimizing agricultural practices and enhancing sustainable development. This study explicitly investigates the use of deep learning algorithms to classify economic crop areas in northeastern Thailand, utilizing satellite imagery to address the challenges posed by diverse patterns and similarities across categories.

[AI-47] A Bayesian explanation of machine learning models based on modes and functional ANOVA

链接: https://arxiv.org/abs/2411.02746
作者: Quan Long
关键词-EN: focus on providing, providing reasons, XAI, Bayesian inverse problem, label
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Most methods in explainable AI (XAI) focus on providing reasons for the prediction of a given set of features. However, we solve an inverse explanation problem, i.e., given the deviation of a label, find the reasons of this deviation. We use a Bayesian framework to recover the true'' features, conditioned on the observed label value. We efficiently explain the deviation of a label value from the mode, by identifying and ranking the influential features using the distances’’ in the ANOVA functional decomposition. We show that the new method is more human-intuitive and robust than methods based on mean values, e.g., SHapley Additive exPlanations (SHAP values). The extra costs of solving a Bayesian inverse problem are dimension-independent.

[AI-48] V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization EMNLP2024

链接: https://arxiv.org/abs/2411.02712
作者: Yuxi Xie,Guanzhen Li,Xiao Xu,Min-Yen Kan
关键词-EN: Large vision-language models, Large Language Model, output textual response, input visual content, Large vision-language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Findings; 9 pages, 6 figures, 5 tables (16 pages, 8 figures, 8 tables including references and appendices)

点击查看摘要

Abstract:Large vision-language models (LVLMs) suffer from hallucination, resulting in misalignment between the output textual response and the input visual content. Recent research indicates that the over-reliance on the Large Language Model (LLM) backbone, as one cause of the LVLM hallucination, inherently introduces bias from language priors, leading to insufficient context attention to the visual inputs. We tackle this issue of hallucination by mitigating such over-reliance through preference learning. We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time. To interpret the effectiveness and generalizability of V-DPO on different types of training data, we construct a synthetic dataset containing both response- and image-contrast preference pairs, compared against existing human-annotated hallucination samples. Our approach achieves significant improvements compared with baseline methods across various hallucination benchmarks. Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context. Our code is publicly available at this https URL. Comments: EMNLP 2024 Findings; 9 pages, 6 figures, 5 tables (16 pages, 8 figures, 8 tables including references and appendices) Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.02712 [cs.CV] (or arXiv:2411.02712v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.02712 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-49] JPEC: A Novel Graph Neural Network for Competitor Retrieval in Financial Knowledge Graphs SIGIR’24

链接: https://arxiv.org/abs/2411.02692
作者: Wanying Ding,Manoj Cherukumalli,Santosh Chikoti,Vinay K. Chaudhri
关键词-EN: complex data effectively, analyze complex data, data effectively, gained popularity, ability to organize
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: 5 pages, 4 figures, accepted by SIGIR’24

点击查看摘要

Abstract:Knowledge graphs have gained popularity for their ability to organize and analyze complex data effectively. When combined with graph embedding techniques, such as graph neural networks (GNNs), knowledge graphs become a potent tool in providing valuable insights. This study explores the application of graph embedding in identifying competitors from a financial knowledge graph. Existing state-of-the-art(SOTA) models face challenges due to the unique attributes of our knowledge graph, including directed and undirected relationships, attributed nodes, and minimal annotated competitor connections. To address these challenges, we propose a novel graph embedding model, JPEC(JPMorgan Proximity Embedding for Competitor Detection), which utilizes graph neural network to learn from both first-order and second-order node proximity together with vital features for competitor retrieval. JPEC had outperformed most existing models in extensive experiments, showcasing its effectiveness in competitor retrieval.

[AI-50] Geometry of naturalistic object representations in recurrent neural network models of working memory

链接: https://arxiv.org/abs/2411.02685
作者: Xiaoxuan Lei,Takuya Ito,Pouya Bashivan
关键词-EN: central cognitive ability, cognitive ability crucial, Working memory, intelligent decision-making, ability crucial
类目: Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Working memory is a central cognitive ability crucial for intelligent decision-making. Recent experimental and computational work studying working memory has primarily used categorical (i.e., one-hot) inputs, rather than ecologically relevant, multidimensional naturalistic ones. Moreover, studies have primarily investigated working memory during single or few cognitive tasks. As a result, an understanding of how naturalistic object information is maintained in working memory in neural networks is still lacking. To bridge this gap, we developed sensory-cognitive models, comprising a convolutional neural network (CNN) coupled with a recurrent neural network (RNN), and trained them on nine distinct N-back tasks using naturalistic stimuli. By examining the RNN’s latent space, we found that: (1) Multi-task RNNs represent both task-relevant and irrelevant information simultaneously while performing tasks; (2) The latent subspaces used to maintain specific object properties in vanilla RNNs are largely shared across tasks, but highly task-specific in gated RNNs such as GRU and LSTM; (3) Surprisingly, RNNs embed objects in new representational spaces in which individual object features are less orthogonalized relative to the perceptual space; (4) The transformation of working memory encodings (i.e., embedding of visual inputs in the RNN latent space) into memory was shared across stimuli, yet the transformations governing the retention of a memory in the face of incoming distractor stimuli were distinct across time. Our findings indicate that goal-driven RNNs employ chronological memory subspaces to track information over short time spans, enabling testable predictions with neural data.

[AI-51] owards Intelligent Augmented Reality (iAR): A Taxonomy of Context an Architecture for iAR and an Empirical Study

链接: https://arxiv.org/abs/2411.02684
作者: Shakiba Davari,Daniel Stover,Alexander Giovannelli,Cory Ilo,Doug A. Bowman
关键词-EN: Augmented Reality, Recent advancements, enhancing interface effectiveness, advancements in Augmented, research have highlighted
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in Augmented Reality (AR) research have highlighted the critical role of context awareness in enhancing interface effectiveness and user experience. This underscores the need for intelligent AR (iAR) interfaces that dynamically adapt across various contexts to provide optimal experiences. In this paper, we (a) propose a comprehensive framework for context-aware inference and adaptation in iAR, (b) introduce a taxonomy that describes context through quantifiable input data, and © present an architecture that outlines the implementation of our proposed framework and taxonomy within iAR. Additionally, we present an empirical AR experiment to observe user behavior and record user performance, context, and user-specified adaptations to the AR interfaces within a context-switching scenario. We (d) explore the nuanced relationships between context and user adaptations in this scenario and discuss the significance of our framework in identifying these patterns. This experiment emphasizes the significance of context-awareness in iAR and provides a preliminary training dataset for this specific Scenario.

[AI-52] Fair In-Context Learning via Latent Concept Variables

链接: https://arxiv.org/abs/2411.02671
作者: Karuna Bhaila,Minh-Hao Van,Kennedy Edemacu,Chen Zhao,Feng Chen,Xintao Wu
关键词-EN: ability of large, large language models, ICL, large language, facilitated by serialization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:The emerging in-context learning (ICL) ability of large language models (LLMs) has prompted their use for predictive tasks in various domains with different types of data facilitated by serialization methods. However, with increasing applications in high-stakes domains, it has been shown that LLMs can inherit social bias and discrimination from their pre-training data. In this work, we investigate this inherent bias in LLMs during in-context learning with tabular data. We focus on an optimal demonstration selection approach that utilizes latent concept variables for resource-efficient task adaptation. We design data augmentation strategies that reduce correlation between predictive outcomes and sensitive variables helping to promote fairness during latent concept learning. We utilize the learned concept and select demonstrations from a training dataset to obtain fair predictions during inference while maintaining model utility. The latent concept variable is learned using a smaller internal LLM and the selected demonstrations can be used for inference with larger external LLMs. We empirically verify that the fair latent variable approach improves fairness results on tabular datasets compared to multiple heuristic demonstration selection methods.

[AI-53] From Twitter to Reasoner: Understand Mobility Travel Modes and Sentiment Using Large Language Models ITSC2024

链接: https://arxiv.org/abs/2411.02666
作者: Kangrui Ruan,Xinyang Wang,Xuan Di
关键词-EN: improve service quality, regulate mobility services, individuals’ travel choices, Social media, service quality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages; Accepted by ITSC 2024

点击查看摘要

Abstract:Social media has become an important platform for people to express their opinions towards transportation services and infrastructure, which holds the potential for researchers to gain a deeper understanding of individuals’ travel choices, for transportation operators to improve service quality, and for policymakers to regulate mobility services. A significant challenge, however, lies in the unstructured nature of social media data. In other words, textual data like social media is not labeled, and large-scale manual annotations are cost-prohibitive. In this study, we introduce a novel methodological framework utilizing Large Language Models (LLMs) to infer the mentioned travel modes from social media posts, and reason people’s attitudes toward the associated travel mode, without the need for manual annotation. We compare different LLMs along with various prompting engineering methods in light of human assessment and LLM verification. We find that most social media posts manifest negative rather than positive sentiments. We thus identify the contributing factors to these negative posts and, accordingly, propose recommendations to traffic operators and policymakers.

[AI-54] Explanations that reveal all through the definition of encoding NEURIPS2024

链接: https://arxiv.org/abs/2411.02664
作者: Aahlad Puli,Nhi Nguyen,Rajesh Ranganath
关键词-EN: Feature attributions attempt, Feature attributions, predictive power, drive predictive power, inputs drive predictive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 35 pages, 7 figures, 6 tables, 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Feature attributions attempt to highlight what inputs drive predictive power. Good attributions or explanations are thus those that produce inputs that retain this predictive power; accordingly, evaluations of explanations score their quality of prediction. However, evaluations produce scores better than what appears possible from the values in the explanation for a class of explanations, called encoding explanations. Probing for encoding remains a challenge because there is no general characterization of what gives the extra predictive power. We develop a definition of encoding that identifies this extra predictive power via conditional dependence and show that the definition fits existing examples of encoding. This definition implies, in contrast to encoding explanations, that non-encoding explanations contain all the informative inputs used to produce the explanation, giving them a “what you see is what you get” property, which makes them transparent and simple to use. Next, we prove that existing scores (ROAR, FRESH, EVAL-X) do not rank non-encoding explanations above encoding ones, and develop STRIPE-X which ranks them correctly. After empirically demonstrating the theoretical insights, we use STRIPE-X to uncover encoding in LLM-generated explanations for predicting the sentiment in movie reviews.

[AI-55] M-CELS: Counterfactual Explanation for Multivariate Time Series Data Guided by Learned Saliency Maps ICML

链接: https://arxiv.org/abs/2411.02649
作者: Peiyu Li,Omar Bahri,Soukaina Filali Boubrahimi,Shah Muhammad Hamdi
关键词-EN: received great attention, multivariate time series, time series classification, time series, multivariate time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ICMLA 2024. arXiv admin note: text overlap with arXiv:2410.20539

点击查看摘要

Abstract:Over the past decade, multivariate time series classification has received great attention. Machine learning (ML) models for multivariate time series classification have made significant strides and achieved impressive success in a wide range of applications and tasks. The challenge of many state-of-the-art ML models is a lack of transparency and interpretability. In this work, we introduce M-CELS, a counterfactual explanation model designed to enhance interpretability in multidimensional time series classification tasks. Our experimental validation involves comparing M-CELS with leading state-of-the-art baselines, utilizing seven real-world time-series datasets from the UEA repository. The results demonstrate the superior performance of M-CELS in terms of validity, proximity, and sparsity, reinforcing its effectiveness in providing transparent insights into the decisions of machine learning models applied to multivariate time series data.

[AI-56] Intelligent Video Recording Optimization using Activity Detection for Surveillance Systems

链接: https://arxiv.org/abs/2411.02632
作者: Youssef Elmir,Hayet Touati,Ouassila Melizou
关键词-EN: managing vast amounts, leading to inefficient, event retrieval, struggle with managing, managing vast
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 6 figures, This manuscript has been accepted for publication in ACTA UNIVERSITATIS SAPIENTIAE, Informatica with minor revisions

点击查看摘要

Abstract:Surveillance systems often struggle with managing vast amounts of footage, much of which is irrelevant, leading to inefficient storage and challenges in event retrieval. This paper addresses these issues by proposing an optimized video recording solution focused on activity detection. The proposed approach utilizes a hybrid method that combines motion detection via frame subtraction with object detection using YOLOv9. This strategy specifically targets the recording of scenes involving human or car activity, thereby reducing unnecessary footage and optimizing storage usage. The developed model demonstrates superior performance, achieving precision metrics of 0.855 for car detection and 0.884 for person detection, and reducing the storage requirements by two-thirds compared to traditional surveillance systems that rely solely on motion detection. This significant reduction in storage highlights the effectiveness of the proposed approach in enhancing surveillance system efficiency. Nonetheless, some limitations persist, particularly the occurrence of false positives and false negatives in adverse weather conditions, such as strong winds.

[AI-57] EmoSphere: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

链接: https://arxiv.org/abs/2411.02625
作者: Deok-Hyeon Cho,Hyung-Seok Oh,Seung-Bin Kim,Seong-Whan Lee
关键词-EN: challenges remain owing, achieved significant progress, emotional speech datasets, technology has achieved, recent years
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.

[AI-58] Enhancing Indoor Mobility with Connected Sensor Nodes: A Real-Time Delay-Aware Cooperative Perception Approach

链接: https://arxiv.org/abs/2411.02624
作者: Minghao Ning,Yaodong Cui,Yufeng Yang,Shucheng Huang,Zhenan Liu,Ahmad Reza Alghooneh,Ehsan Hashemi,Amir Khajepour
关键词-EN: intelligent mobility platforms, mobility platforms operating, dynamic indoor environments, Lidar Camera Fusion, cooperative perception system
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper presents a novel real-time, delay-aware cooperative perception system designed for intelligent mobility platforms operating in dynamic indoor environments. The system contains a network of multi-modal sensor nodes and a central node that collectively provide perception services to mobility platforms. The proposed Hierarchical Clustering Considering the Scanning Pattern and Ground Contacting Feature based Lidar Camera Fusion improve intra-node perception for crowded environment. The system also features delay-aware global perception to synchronize and aggregate data across nodes. To validate our approach, we introduced the Indoor Pedestrian Tracking dataset, compiled from data captured by two indoor sensor nodes. Our experiments, compared to baselines, demonstrate significant improvements in detection accuracy and robustness against delays. The dataset is available in the repository: this https URL

[AI-59] Learning to Assist Humans without Inferring Rewards NEURIPS

链接: https://arxiv.org/abs/2411.02623
作者: Vivek Myers,Evan Ellis,Sergey Levine,Benjamin Eysenbach,Anca Dragan
关键词-EN: humans’ lives easier, make humans’ lives, lives easier, make humans’, humans’ lives
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Conference on Neural Information Processing Systems (NeurIPS), 2024

点击查看摘要

Abstract:Assistive agents should make humans’ lives easier. Classically, such assistance is studied through the lens of inverse reinforcement learning, where an assistive agent (e.g., a chatbot, a robot) infers a human’s intention and then selects actions to help the human reach that goal. This approach requires inferring intentions, which can be difficult in high-dimensional settings. We build upon prior work that studies assistance through the lens of empowerment: an assistive agent aims to maximize the influence of the human’s actions such that they exert a greater control over the environmental outcomes and can solve tasks in fewer steps. We lift the major limitation of prior work in this area–scalability to high-dimensional settings–with contrastive successor representations. We formally prove that these representations estimate a similar notion of empowerment to that studied by prior work and provide a ready-made mechanism for optimizing it. Empirically, our proposed method outperforms prior methods on synthetic benchmarks, and scales to Overcooked, a cooperative game setting. Theoretically, our work connects ideas from information theory, neuroscience, and reinforcement learning, and charts a path for representations to play a critical role in solving assistive problems.

[AI-60] Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning

链接: https://arxiv.org/abs/2411.02622
作者: Zihao Zhao,Yijiang Li,Yuchen Yang,Wenqing Zhang,Nuno Vasconcelos,Yinzhi Cao
关键词-EN: General Data Protection, Data Protection Regulation, Protection Regulation, addressing biased data, Machine unlearning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine unlearning–enabling a trained model to forget specific data–is crucial for addressing biased data and adhering to privacy regulations like the General Data Protection Regulation (GDPR)'s “right to be forgotten”. Recent works have paid little attention to privacy concerns, leaving the data intended for forgetting vulnerable to membership inference attacks. Moreover, they often come with high computational overhead. In this work, we propose Pseudo-Probability Unlearning (PPU), a novel method that enables models to forget data efficiently and in a privacy-preserving manner. Our method replaces the final-layer output probabilities of the neural network with pseudo-probabilities for the data to be forgotten. These pseudo-probabilities follow either a uniform distribution or align with the model’s overall distribution, enhancing privacy and reducing risk of membership inference attacks. Our optimization strategy further refines the predictive probability distributions and updates the model’s weights accordingly, ensuring effective forgetting with minimal impact on the model’s overall performance. Through comprehensive experiments on multiple benchmarks, our method achieves over 20% improvements in forgetting error compared to the state-of-the-art. Additionally, our method enhances privacy by preventing the forgotten set from being inferred to around random guesses.

[AI-61] Decoupled Data Augmentation for Improving Image Classification

链接: https://arxiv.org/abs/2411.02592
作者: Ruoxin Chen,Zhe Wang,Ke-Yue Zhang,Shuang Wu,Jiamu Sun,Shouli Wang,Taiping Yao,Shouhong Ding
关键词-EN: Recent advancements, enhancing image classification, shown promise, promise in enhancing, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in image mixing and generative data augmentation have shown promise in enhancing image classification. However, these techniques face the challenge of balancing semantic fidelity with diversity. Specifically, image mixing involves interpolating two images to create a new one, but this pixel-level interpolation can compromise fidelity. Generative augmentation uses text-to-image generative models to synthesize or modify images, often limiting diversity to avoid generating out-of-distribution data that potentially affects accuracy. We propose that this fidelity-diversity dilemma partially stems from the whole-image paradigm of existing methods. Since an image comprises the class-dependent part (CDP) and the class-independent part (CIP), where each part has fundamentally different impacts on the image’s fidelity, treating different parts uniformly can therefore be misleading. To address this fidelity-diversity dilemma, we introduce Decoupled Data Augmentation (De-DA), which resolves the dilemma by separating images into CDPs and CIPs and handling them adaptively. To maintain fidelity, we use generative models to modify real CDPs under controlled conditions, preserving semantic consistency. To enhance diversity, we replace the image’s CIP with inter-class variants, creating diverse CDP-CIP combinations. Additionally, we implement an online randomized combination strategy during training to generate numerous distinct CDP-CIP combinations cost-effectively. Comprehensive empirical evaluations validate the effectiveness of our method.

[AI-62] Multi-Agent Decision Transformers for Dynamic Dispatching in Material Handling Systems Leveraging Enterprise Big Data

链接: https://arxiv.org/abs/2411.02584
作者: Xian Yeow Lee,Haiyan Wang,Daisuke Katsumata,Takaharu Matsui,Chetan Gupta
关键词-EN: Decision Transformers, ensuring efficient operations, Transformers, Decision, Dynamic dispatching rules
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Dynamic dispatching rules that allocate resources to tasks in real-time play a critical role in ensuring efficient operations of many automated material handling systems across industries. Traditionally, the dispatching rules deployed are typically the result of manually crafted heuristics based on domain experts’ knowledge. Generating these rules is time-consuming and often sub-optimal. As enterprises increasingly accumulate vast amounts of operational data, there is significant potential to leverage this big data to enhance the performance of automated systems. One promising approach is to use Decision Transformers, which can be trained on existing enterprise data to learn better dynamic dispatching rules for improving system throughput. In this work, we study the application of Decision Transformers as dynamic dispatching policies within an actual multi-agent material handling system and identify scenarios where enterprises can effectively leverage Decision Transformers on existing big data to gain business value. Our empirical results demonstrate that Decision Transformers can improve the material handling system’s throughput by a considerable amount when the heuristic originally used in the enterprise data exhibits moderate performance and involves no randomness. When the original heuristic has strong performance, Decision Transformers can still improve the throughput but with a smaller improvement margin. However, when the original heuristics contain an element of randomness or when the performance of the dataset is below a certain threshold, Decision Transformers fail to outperform the original heuristic. These results highlight both the potential and limitations of Decision Transformers as dispatching policies for automated industrial material handling systems.

[AI-63] ViTally Consistent: Scaling Biological Representation Learning for Cell Microscopy NEURIPS2024

链接: https://arxiv.org/abs/2411.02572
作者: Kian Kenyon-Dean,Zitong Jerry Wang,John Urbanik,Konstantin Donhauser,Jason Hartford,Saber Saberian,Nil Sahin,Ihab Bendidi,Safiye Celik,Marta Fay,Juan Sebastian Rodriguez Vera,Imran S Haque,Oren Kraus
关键词-EN: molecular biology research, drug discovery, discovery and molecular, molecular biology, biology research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 Foundation Models for Science Workshop (38th Conference on Neural Information Processing Systems). 18 pages, 7 figures

点击查看摘要

Abstract:Large-scale cell microscopy screens are used in drug discovery and molecular biology research to study the effects of millions of chemical and genetic perturbations on cells. To use these images in downstream analysis, we need models that can map each image into a feature space that represents diverse biological phenotypes consistently, in the sense that perturbations with similar biological effects have similar representations. In this work, we present the largest foundation model for cell microscopy data to date, a new 1.9 billion-parameter ViT-G/8 MAE trained on over 8 billion microscopy image crops. Compared to a previous published ViT-L/8 MAE, our new model achieves a 60% improvement in linear separability of genetic perturbations and obtains the best overall performance on whole-genome biological relationship recall and replicate consistency benchmarks. Beyond scaling, we developed two key methods that improve performance: (1) training on a curated and diverse dataset; and, (2) using biologically motivated linear probing tasks to search across each transformer block for the best candidate representation of whole-genome screens. We find that many self-supervised vision transformers, pretrained on either natural or microscopy images, yield significantly more biologically meaningful representations of microscopy images in their intermediate blocks than in their typically used final blocks. More broadly, our approach and results provide insights toward a general strategy for successfully building foundation models for large-scale biological data.

[AI-64] he Intersectionality Problem for Algorithmic Fairness

链接: https://arxiv.org/abs/2411.02569
作者: Johannes Himmelreich,Arbie Hsu,Kristian Lum,Ellen Veomett
关键词-EN: intersection of multiple, multiple groups, algorithmic fairness, achieving fairness, unmet challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:A yet unmet challenge in algorithmic fairness is the problem of intersectionality, that is, achieving fairness across the intersection of multiple groups – and verifying that such fairness has been attained. Because intersectional groups tend to be small, verifying whether a model is fair raises statistical as well as moral-methodological challenges. This paper (1) elucidates the problem of intersectionality in algorithmic fairness, (2) develops desiderata to clarify the challenges underlying the problem and guide the search for potential solutions, (3) illustrates the desiderata and potential solutions by sketching a proposal using simple hypothesis testing, and (4) evaluates, partly empirically, this proposal against the proposed desiderata.

[AI-65] PIAST: A Multimodal Piano Dataset with Audio Symbolic and Text

链接: https://arxiv.org/abs/2411.02551
作者: Hayeon Bang,Eunjin Choi,Megan Finch,Seungheon Doh,Seolhee Lee,Gyeong-Hoon Lee,Juan Nam
关键词-EN: Music Information Retrieval, Information Retrieval, Music Information, piano solo music, significant area
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted for publication at the 3rd Workshop on NLP for Music and Audio (NLP4MusA 2024)

点击查看摘要

Abstract:While piano music has become a significant area of study in Music Information Retrieval (MIR), there is a notable lack of datasets for piano solo music with text labels. To address this gap, we present PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset. Utilizing a piano-specific taxonomy of semantic tags, we collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts, resulting in two subsets: PIAST-YT and PIAST-AT. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models. Among many possible tasks with the multi-modal dataset, we conduct music tagging and retrieval using both audio and MIDI data and report baseline performances to demonstrate its potential as a valuable resource for MIR research.

[AI-66] GraphXAIN: Narratives to Explain Graph Neural Networks

链接: https://arxiv.org/abs/2411.02540
作者: Mateusz Cedro,David Martens
关键词-EN: Graph Neural Networks, Neural Networks, pose interpretability challenges, Graph Neural, interpretability challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are a powerful technique for machine learning on graph-structured data, yet they pose interpretability challenges, especially for non-expert users. Existing GNN explanation methods often yield technical outputs such as subgraphs and feature importance scores, which are not easily understood. Building on recent insights from social science and other Explainable AI (XAI) methods, we propose GraphXAIN, a natural language narrative that explains individual predictions made by GNNs. We present a model-agnostic and explainer-agnostic XAI approach that complements graph explainers by generating GraphXAINs, using Large Language Models (LLMs) and integrating graph data, individual predictions from GNNs, explanatory subgraphs, and feature importances. We define XAI Narratives and XAI Descriptions, highlighting their distinctions and emphasizing the importance of narrative principles in effective explanations. By incorporating natural language narratives, our approach supports graph practitioners and non-expert users, aligning with social science research on explainability and enhancing user understanding and trust in complex GNN models. We demonstrate GraphXAIN’s capabilities on a real-world graph dataset, illustrating how its generated narratives can aid understanding compared to traditional graph explainer outputs or other descriptive explanation methods.

[AI-67] Strongly Topology-preserving GNNs for Brain Graph Super-resolution MICCAI-2024

链接: https://arxiv.org/abs/2411.02525
作者: Pragya Singh,Islem Rekik
关键词-EN: highly relevant task, Brain graph super-resolution, under-explored yet highly, highly relevant, relevant task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to PRIME-MICCAI-2024

点击查看摘要

Abstract:Brain graph super-resolution (SR) is an under-explored yet highly relevant task in network neuroscience. It circumvents the need for costly and time-consuming medical imaging data collection, preparation, and processing. Current SR methods leverage graph neural networks (GNNs) thanks to their ability to natively handle graph-structured datasets. However, most GNNs perform node feature learning, which presents two significant limitations: (1) they require computationally expensive methods to learn complex node features capable of inferring connectivity strength or edge features, which do not scale to larger graphs; and (2) computations in the node space fail to adequately capture higher-order brain topologies such as cliques and hubs. However, numerous studies have shown that brain graph topology is crucial in identifying the onset and presence of various neurodegenerative disorders like Alzheimer and Parkinson. Motivated by these challenges and applications, we propose our STP-GSR framework. It is the first graph SR architecture to perform representation learning in higher-order topological space. Specifically, using the primal-dual graph formulation from graph theory, we develop an efficient mapping from the edge space of our low-resolution (LR) brain graphs to the node space of a high-resolution (HR) dual graph. This approach ensures that node-level computations on this dual graph correspond naturally to edge-level learning on our HR brain graphs, thereby enforcing strong topological consistency within our framework. Additionally, our framework is GNN layer agnostic and can easily learn from smaller, scalable GNNs, reducing computational requirements. We comprehensively benchmark our framework across seven key topological measures and observe that it significantly outperforms the previous state-of-the-art methods and baselines.

[AI-68] Digitizing Touch with an Artificial Multimodal Fingertip

链接: https://arxiv.org/abs/2411.02479
作者: Mike Lambeta,Tingfan Wu,Ali Sengul,Victoria Rose Most,Nolan Black,Kevin Sawyer,Romeo Mercado,Haozhi Qi,Alexander Sohn,Byron Taylor,Norb Tydingco,Gregg Kammerer,Dave Stroud,Jake Khatha,Kurt Jenkins,Kyle Most,Neal Stein,Ricardo Chavira,Thomas Craven-Bartle,Eric Sanchez,Yitian Ding,Jitendra Malik,Roberto Calandra
关键词-EN: crucial sensing modality, information about object, object properties, properties and interactions, physical environment
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 28 pages

点击查看摘要

Abstract:Touch is a crucial sensing modality that provides rich information about object properties and interactions with the physical environment. Humans and robots both benefit from using touch to perceive and interact with the surrounding environment (Johansson and Flanagan, 2009; Li et al., 2020; Calandra et al., 2017). However, no existing systems provide rich, multi-modal digital touch-sensing capabilities through a hemispherical compliant embodiment. Here, we describe several conceptual and technological innovations to improve the digitization of touch. These advances are embodied in an artificial finger-shaped sensor with advanced sensing capabilities. Significantly, this fingertip contains high-resolution sensors (~8.3 million taxels) that respond to omnidirectional touch, capture multi-modal signals, and use on-device artificial intelligence to process the data in real time. Evaluations show that the artificial fingertip can resolve spatial features as small as 7 um, sense normal and shear forces with a resolution of 1.01 mN and 1.27 mN, respectively, perceive vibrations up to 10 kHz, sense heat, and even sense odor. Furthermore, it embeds an on-device AI neural network accelerator that acts as a peripheral nervous system on a robot and mimics the reflex arc found in humans. These results demonstrate the possibility of digitizing touch with superhuman performance. The implications are profound, and we anticipate potential applications in robotics (industrial, medical, agricultural, and consumer-level), virtual reality and telepresence, prosthetics, and e-commerce. Toward digitizing touch at scale, we open-source a modular platform to facilitate future research on the nature of touch.

[AI-69] Imagining and building wise machines: The centrality of AI metacognition

链接: https://arxiv.org/abs/2411.02478
作者: Samuel G. B. Johnson,Amir-Hossein Karimi,Yoshua Bengio,Nick Chater,Tobias Gerstenberg,Kate Larson,Sydney Levine,Melanie Mitchell,Iyad Rahwan,Bernhard Schölkopf,Igor Grossmann
关键词-EN: increasingly sophisticated performance, Recent advances, produced systems capable, artificial intelligence, advances in artificial
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 26 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Recent advances in artificial intelligence (AI) have produced systems capable of increasingly sophisticated performance on cognitive tasks. However, AI systems still struggle in critical ways: unpredictable and novel environments (robustness), lack of transparency in their reasoning (explainability), challenges in communication and commitment (cooperation), and risks due to potential harmful actions (safety). We argue that these shortcomings stem from one overarching failure: AI systems lack wisdom. Drawing from cognitive and social sciences, we define wisdom as the ability to navigate intractable problems - those that are ambiguous, radically uncertain, novel, chaotic, or computationally explosive - through effective task-level and metacognitive strategies. While AI research has focused on task-level strategies, metacognition - the ability to reflect on and regulate one’s thought processes - is underdeveloped in AI systems. In humans, metacognitive strategies such as recognizing the limits of one’s knowledge, considering diverse perspectives, and adapting to context are essential for wise decision-making. We propose that integrating metacognitive capabilities into AI systems is crucial for enhancing their robustness, explainability, cooperation, and safety. By focusing on developing wise AI, we suggest an alternative to aligning AI with specific human values - a task fraught with conceptual and practical difficulties. Instead, wise AI systems can thoughtfully navigate complex situations, account for diverse human values, and avoid harmful actions. We discuss potential approaches to building wise AI, including benchmarking metacognitive abilities and training AI systems to employ wise reasoning. Prioritizing metacognition in AI research will lead to systems that act not only intelligently but also wisely in complex, real-world situations.

[AI-70] Building a Synthetic Vascular Model: Evaluation in an Intracranial Aneurysms Detection Scenario

链接: https://arxiv.org/abs/2411.02477
作者: Rafic Nader,Florent Autrusseau,Vincent L’Allinec,Romain Bourcier
关键词-EN: cerebral vascular tree, bifurcations and intracranial, vascular tree, including the cerebral, intracranial aneurysms
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 9 figures, accepted for publication in IEEE Trans. on Medical Imaging. arXiv admin note: substantial text overlap with arXiv:2403.18734

点击查看摘要

Abstract:We hereby present a full synthetic model, able to mimic the various constituents of the cerebral vascular tree, including the cerebral arteries, bifurcations and intracranial aneurysms. This model intends to provide a substantial dataset of brain arteries which could be used by a 3D convolutional neural network to efficiently detect Intra-Cranial Aneurysms. The cerebral aneurysms most often occur on a particular structure of the vascular tree named the Circle of Willis. Various studies have been conducted to detect and monitor the aneurysms and those based on Deep Learning achieve the best performance. Specifically, in this work, we propose a full synthetic 3D model able to mimic the brain vasculature as acquired by Magnetic Resonance Angiography, Time Of Flight principle. Among the various MRI modalities, this latter allows for a good rendering of the blood vessels and is non-invasive. Our model has been designed to simultaneously mimic the arteries’ geometry, the aneurysm shape, and the background noise. The vascular tree geometry is modeled thanks to an interpolation with 3D Spline functions, and the statistical properties of the background noise is collected from angiography acquisitions and reproduced within the model. In this work, we thoroughly describe the synthetic vasculature model, we build up a neural network designed for aneurysm segmentation and detection, finally, we carry out an in-depth evaluation of the performance gap gained thanks to the synthetic model data augmentation.

[AI-71] Energy-Aware Dynamic Neural Inference

链接: https://arxiv.org/abs/2411.02471
作者: Marcello Bullo,Seifallah Jardak,Pietro Carnelli,Deniz Gündüz
关键词-EN: algorithms into energy-limited, sustainable operation, deep learning, energy-harvesting end-devices, growing demand
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: \c{opyright}2024 IEEE. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The growing demand for intelligent applications beyond the network edge, coupled with the need for sustainable operation, are driving the seamless integration of deep learning (DL) algorithms into energy-limited, and even energy-harvesting end-devices. However, the stochastic nature of ambient energy sources often results in insufficient harvesting rates, failing to meet the energy requirements for inference and causing significant performance degradation in energy-agnostic systems. To address this problem, we consider an on-device adaptive inference system equipped with an energy-harvester and finite-capacity energy storage. We then allow the device to reduce the run-time execution cost on-demand, by either switching between differently-sized neural networks, referred to as multi-model selection (MMS), or by enabling earlier predictions at intermediate layers, called early exiting (EE). The model to be employed, or the exit point is then dynamically chosen based on the energy storage and harvesting process states. We also study the efficacy of integrating the prediction confidence into the decision-making process. We derive a principled policy with theoretical guarantees for confidence-aware and -agnostic controllers. Moreover, in multi-exit networks, we study the advantages of taking decisions incrementally, exit-by-exit, by designing a lightweight reinforcement learning-based controller. Experimental results show that, as the rate of the ambient energy increases, energy- and confidence-aware control schemes show approximately 5% improvement in accuracy compared to their energy-aware confidence-agnostic counterparts. Incremental approaches achieve even higher accuracy, particularly when the energy storage capacity is limited relative to the energy consumption of the inference model.

[AI-72] Benchmarking XAI Explanations with Human-Aligned Evaluations

链接: https://arxiv.org/abs/2411.02470
作者: Rémi Kazmierczak,Steve Azzolin,Eloïse Berthier,Anna Hedström,Patricia Delhomme,Nicolas Bousquet,Goran Frehse,Massimiliano Mancini,Baptiste Caramiaux,Andrea Passerini,Gianni Franchi
关键词-EN: Perceptual Assessment System, Cats Dogs Cars, Artificial intelligence, Perceptual Assessment, Assessment System
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:In this paper, we introduce PASTA (Perceptual Assessment System for explanaTion of Artificial intelligence), a novel framework for a human-centric evaluation of XAI techniques in computer vision. Our first key contribution is a human evaluation of XAI explanations on four diverse datasets (COCO, Pascal Parts, Cats Dogs Cars, and MonumAI) which constitutes the first large-scale benchmark dataset for XAI, with annotations at both the image and concept levels. This dataset allows for robust evaluation and comparison across various XAI methods. Our second major contribution is a data-based metric for assessing the interpretability of explanations. It mimics human preferences, based on a database of human evaluations of explanations in the PASTA-dataset. With its dataset and metric, the PASTA framework provides consistent and reliable comparisons between XAI techniques, in a way that is scalable but still aligned with human evaluations. Additionally, our benchmark allows for comparisons between explanations across different modalities, an aspect previously unaddressed. Our findings indicate that humans tend to prefer saliency maps over other explanation types. Moreover, we provide evidence that human assessments show a low correlation with existing XAI metrics that are numerically simulated by probing the model.

[AI-73] Modeling and Simulation of a Multi Robot System Architecture

链接: https://arxiv.org/abs/2411.02468
作者: Ahmed R. Sadik,Christian Goerick,Manuel Muehlig
关键词-EN: Multi Robot System, intelligent cyberphysical system, Multi Robot, MRS solution architecture, MRS
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:A Multi Robot System (MRS) is the infrastructure of an intelligent cyberphysical system, where the robots understand the need of the human, and hence cooperate together to fulfill this need. Modeling an MRS is a crucial aspect of designing the proper system architecture, because this model can be used to simulate and measure the performance of the proposed architecture. However, an MRS solution architecture modeling is a very difficult problem, as it contains many dependent behaviors that dynamically change due to the current status of the overall system. In this paper, we introduce a general purpose MRS case study, where the humans initiate requests that are achieved by the available robots. These requests require different plans that use the current capabilities of the available robots. After proposing an architecture that defines the solution components, three steps are followed. First is modeling these components via Business Process Model and Notation (BPMN) language. BPMN provides a graphical notation to precisely represent the behaviors of every component, which is an essential need to model the solution. Second is to simulate these components behaviors and interaction in form of software agents. Java Agent DEvelopment (JADE) middleware has been used to develop and simulate the proposed model. JADE is based on a reactive agent approach, therefore it can dynamically represent the interaction among the solution components. Finally is to analyze the performance of the solution by defining a number of quantitative measurements, which can be obtained while simulating the system model in JADE middleware, therefore the solution can be analyzed and compared to another architecture.

[AI-74] See it Think it Sorted: Large Multimodal Models are Few-shot Time Series Anomaly Analyzers

链接: https://arxiv.org/abs/2411.02465
作者: Jiaxin Zhuang,Leon Yan,Zhenwei Zhang,Ruiqi Wang,Jiawei Zhang,Yuantao Gu
关键词-EN: increasingly vital due, Time series, time series data, Time series anomaly, Large Multimodal Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Under review

点击查看摘要

Abstract:Time series anomaly detection (TSAD) is becoming increasingly vital due to the rapid growth of time series data across various sectors. Anomalies in web service data, for example, can signal critical incidents such as system failures or server malfunctions, necessitating timely detection and response. However, most existing TSAD methodologies rely heavily on manual feature engineering or require extensive labeled training data, while also offering limited interpretability. To address these challenges, we introduce a pioneering framework called the Time Series Anomaly Multimodal Analyzer (TAMA), which leverages the power of Large Multimodal Models (LMMs) to enhance both the detection and interpretation of anomalies in time series data. By converting time series into visual formats that LMMs can efficiently process, TAMA leverages few-shot in-context learning capabilities to reduce dependence on extensive labeled datasets. Our methodology is validated through rigorous experimentation on multiple real-world datasets, where TAMA consistently outperforms state-of-the-art methods in TSAD tasks. Additionally, TAMA provides rich, natural language-based semantic analysis, offering deeper insights into the nature of detected anomalies. Furthermore, we contribute one of the first open-source datasets that includes anomaly detection labels, anomaly type labels, and contextual description, facilitating broader exploration and advancement within this critical field. Ultimately, TAMA not only excels in anomaly detection but also provides a comprehensive approach for understanding the underlying causes of anomalies, pushing TSAD forward through innovative methodologies and insights.

[AI-75] You are out of context!

链接: https://arxiv.org/abs/2411.02464
作者: Giancarlo Cobino,Simone Farci
关键词-EN: vector space representation, research proposes, vector space, space representation, models based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This research proposes a novel drift detection methodology for machine learning (ML) models based on the concept of ‘‘deformation’’ in the vector space representation of data. Recognizing that new data can act as forces stretching, compressing, or twisting the geometric relationships learned by a model, we explore various mathematical frameworks to quantify this deformation. We investigate measures such as eigenvalue analysis of covariance matrices to capture global shape changes, local density estimation using kernel density estimation (KDE), and Kullback-Leibler divergence to identify subtle shifts in data concentration. Additionally, we draw inspiration from continuum mechanics by proposing a ‘‘strain tensor’’ analogy to capture multi-faceted deformations across different data types. This requires careful estimation of the displacement field, and we delve into strategies ranging from density-based approaches to manifold learning and neural network methods. By continuously monitoring these deformation metrics and correlating them with model performance, we aim to provide a sensitive, interpretable, and adaptable drift detection system capable of distinguishing benign data evolution from true drift, enabling timely interventions and ensuring the reliability of machine learning systems in dynamic environments. Addressing the computational challenges of this methodology, we discuss mitigation strategies like dimensionality reduction, approximate algorithms, and parallelization for real-time and large-scale applications. The method’s effectiveness is demonstrated through experiments on real-world text data, focusing on detecting context shifts in Generative AI. Our results, supported by publicly available code, highlight the benefits of this deformation-based approach in capturing subtle drifts that traditional statistical methods often miss. Furthermore, we present a detailed application example within the healthcare domain, showcasing the methodology’s potential in diverse fields. Future work will focus on further improving computational efficiency and exploring additional applications across different ML domains.

[AI-76] Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

链接: https://arxiv.org/abs/2411.02462
作者: André Storhaug,Jingyue Li
关键词-EN: enhanced programmers’ productivity, significantly enhanced programmers’, GitHub Copilot, Copilot has significantly, large language models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, 4 tables, 1 listing

点击查看摘要

Abstract:The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers’ productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.

[AI-77] Learning World Models for Unconstrained Goal Navigation NEURIPS2024

链接: https://arxiv.org/abs/2411.02446
作者: Yuanlin Duan,Wensen Mao,He Zhu
关键词-EN: goal-conditioned reinforcement learning, world models, world models offers, sparse rewards, offers a promising
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: NeurIPS2024 Poster. arXiv admin note: substantial text overlap with arXiv:2411.01396

点击查看摘要

Abstract:Learning world models offers a promising avenue for goal-conditioned reinforcement learning with sparse rewards. By allowing agents to plan actions or exploratory goals without direct interaction with the environment, world models enhance exploration efficiency. The quality of a world model hinges on the richness of data stored in the agent’s replay buffer, with expectations of reasonable generalization across the state space surrounding recorded trajectories. However, challenges arise in generalizing learned world models to state transitions backward along recorded trajectories or between states across different trajectories, hindering their ability to accurately model real-world dynamics. To address these challenges, we introduce a novel goal-directed exploration algorithm, MUN (short for “World Models for Unconstrained Goal Navigation”). This algorithm is capable of modeling state transitions between arbitrary subgoal states in the replay buffer, thereby facilitating the learning of policies to navigate between any “key” states. Experimental results demonstrate that MUN strengthens the reliability of world models and significantly improves the policy’s capacity to generalize across new goal settings.

[AI-78] Entropic Hetero-Associative Memory

链接: https://arxiv.org/abs/2411.02438
作者: Rafael Morales,Luis A. Pineda
关键词-EN: Entropic Associative Memory, Associative Memory holds, Entropic Associative, Associative Memory, Memory holds objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:The Entropic Associative Memory holds objects in a 2D relation or memory plane'' using a finite table as the medium. Memory objects are stored by reinforcing simultaneously the cells used by the cue, implementing a form of Hebb's learning rule. Stored objects are overlapped’’ on the medium, hence the memory is indeterminate and has an entropy value at each state. The retrieval operation constructs an object from the cue and such indeterminate content. In this paper we present the extension to the hetero-associative case in which these properties are preserved. Pairs of hetero-associated objects, possibly of different domain and/or modalities, are held in a 4D relation. The memory retrieval operation selects a largely indeterminate 2D memory plane that is specific to the input cue; however, there is no cue left to retrieve an object from such latter plane. We propose three incremental methods to address such missing cue problem, which we call random, sample and test, and search and test. The model is assessed with composite recollections consisting of manuscripts digits and letters selected from the MNIST and the EMNIST corpora, respectively, such that cue digits retrieve their associated letters and vice versa. We show the memory performance and illustrate the memory retrieval operation using all three methods. The system shows promise for storing, recognizing and retrieving very large sets of object with very limited computing resources.

[AI-79] ypeScore: A Text Fidelity Metric for Text-to-Image Generative Models

链接: https://arxiv.org/abs/2411.02437
作者: Georgia Gabriela Sampaio,Ruixiang Zhang,Shuangfei Zhai,Jiatao Gu,Josh Susskind,Navdeep Jaitly,Yizhe Zhang
关键词-EN: remains a challenge, text, generative models remains, models, performance rapidly improves
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating text-to-image generative models remains a challenge, despite the remarkable progress being made in their overall performances. While existing metrics like CLIPScore work for coarse evaluations, they lack the sensitivity to distinguish finer differences as model performance rapidly improves. In this work, we focus on the text rendering aspect of these models, which provides a lens for evaluating a generative model’s fine-grained instruction-following capabilities. To this end, we introduce a new evaluation framework called TypeScore to sensitively assess a model’s ability to generate images with high-fidelity embedded text by following precise instructions. We argue that this text generation capability serves as a proxy for general instruction-following ability in image synthesis. TypeScore uses an additional image description model and leverages an ensemble dissimilarity measure between the original and extracted text to evaluate the fidelity of the rendered text. Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models across a range of instructions with diverse text styles. Our study also evaluates how well these vision-language models (VLMs) adhere to stylistic instructions, disentangling style evaluation from embedded-text fidelity. Through human evaluation studies, we quantitatively meta-evaluate the effectiveness of the metric. Comprehensive analysis is conducted to explore factors such as text length, captioning models, and current progress towards human parity on this task. The framework provides insights into remaining gaps in instruction-following for image generation with embedded text.

[AI-80] NMformer: A Transformer for Noisy Modulation Classification in Wireless Communication

链接: https://arxiv.org/abs/2411.02428
作者: Atik Faysal,Mohammad Rostami,Reihaneh Gh. Roshan,Huaxia Wang,Nikhil Muralidhar
关键词-EN: base classifier, Modulation classification, modulation images, ambient noises, signals intertwine
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modulation classification is a very challenging task since the signals intertwine with various ambient noises. Methods are required that can classify them without adding extra steps like denoising, which introduces computational complexity. In this study, we propose a vision transformer (ViT) based model named NMformer to predict the channel modulation images with different noise levels in wireless communication. Since ViTs are most effective for RGB images, we generated constellation diagrams from the modulated signals. The diagrams provide the information from the signals in a 2-D representation form. We trained NMformer on 106, 800 modulation images to build the base classifier and only used 3, 000 images to fine-tune for specific tasks. Our proposed model has two different kinds of prediction setups: in-distribution and out-of-distribution. Our model achieves 4.67% higher accuracy than the base classifier when finetuned and tested on high signal-to-noise ratios (SNRs) in-distribution classes. Moreover, the fine-tuned low SNR task achieves a higher accuracy than the base classifier. The fine-tuned classifier becomes much more effective than the base classifier by achieving higher accuracy when predicted, even on unseen data from out-of-distribution classes. Extensive experiments show the effectiveness of NMformer for a wide range of SNRs.

[AI-81] Development of CODO: A Comprehensive Tool for COVID-19 Data Representation Analysis and Visualization

链接: https://arxiv.org/abs/2411.02423
作者: Biswanath Dutta,Debanjali Bain
关键词-EN: Artificial intelligence, indispensable for managing, managing and processing, processing the vast, vast amounts
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 15 pages, 4 figures, journal

点击查看摘要

Abstract:Artificial intelligence (AI) has become indispensable for managing and processing the vast amounts of data generated during the COVID-19 pandemic. Ontology, which formalizes knowledge within a domain using standardized vocabularies and relationships, plays a crucial role in AI by enabling automated reasoning, data integration, semantic interoperability, and extracting meaningful insights from extensive datasets. The diversity of COVID-19 datasets poses challenges in comprehending this information for both human and machines. Existing COVID-19 ontologies are designed to address specific aspects of the pandemic but lack comprehensive coverage across all essential dimensions. To address this gap, CODO, an integrated ontological model has been developed encompassing critical facets of COVID-19 information such as aetiology, epidemiology, transmission, pathogenesis, diagnosis, prevention, genomics, therapeutic safety, and more. This paper reviews CODO since its inception in 2020, detailing its developments and highlighting CODO as a tool for the aggregation, representation, analysis, and visualization of diverse COVID-19 data. The major contribution of this paper is to provide a summary of the development of CODO, and outline the overall development and evaluation approach. By adhering to best practices and leveraging W3C standards, CODO ensures data integration and semantic interoperability, supporting effective navigation of COVID-19 complexities across various domains.

[AI-82] XAI-FUNGI: Dataset resulting from the user study on comprehensibility of explainable AI algorithms

链接: https://arxiv.org/abs/2411.02419
作者: Szymon Bobek,Paloma Korycińska,Monika Krakowska,Maciej Mozolewski,Dorota Rak,Magdalena Zych,Magdalena Wójcik,Grzegorz J. Nalepa
关键词-EN: explainable artificial intelligence, artificial intelligence, paper introduces, comprehensibility of explainable, explainable artificial
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a dataset that is the result of a user study on the comprehensibility of explainable artificial intelligence (XAI) algorithms. The study participants were recruited from 149 candidates to form three groups representing experts in the domain of mycology (DE), students with a data science and visualization background (IT) and students from social sciences and humanities (SSH). The main part of the dataset contains 39 transcripts of interviews during which participants were asked to complete a series of tasks and questions related to the interpretation of explanations of decisions of a machine learning model trained to distinguish between edible and inedible mushrooms. The transcripts were complemented with additional data that includes visualizations of explanations presented to the user, results from thematic analysis, recommendations of improvements of explanations provided by the participants, and the initial survey results that allow to determine the domain knowledge of the participant and data analysis literacy. The transcripts were manually tagged to allow for automatic matching between the text and other data related to particular fragments. In the advent of the area of rapid development of XAI techniques, the need for a multidisciplinary qualitative evaluation of explainability is one of the emerging topics in the community. Our dataset allows not only to reproduce the study we conducted, but also to open a wide range of possibilities for the analysis of the material we gathered.

[AI-83] A Persuasion-Based Prompt Learning Approach to Improve Smishing Detection through Data Augmentation

链接: https://arxiv.org/abs/2411.02403
作者: Ho Sung Shim,Hyoungjun Park,Kyuhan Lee,Jang-Sun Park,Seonhye Kang
关键词-EN: holds significance due, illicitly obtain personal, obtain personal information, Smishing, holds significance
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Smishing, which aims to illicitly obtain personal information from unsuspecting victims, holds significance due to its negative impacts on our society. In prior studies, as a tool to counteract smishing, machine learning (ML) has been widely adopted, which filters and blocks smishing messages before they reach potential victims. However, a number of challenges remain in ML-based smishing detection, with the scarcity of annotated datasets being one major hurdle. Specifically, given the sensitive nature of smishing-related data, there is a lack of publicly accessible data that can be used for training and evaluating ML models. Additionally, the nuanced similarities between smishing messages and other types of social engineering attacks such as spam messages exacerbate the challenge of smishing classification with limited resources. To tackle this challenge, we introduce a novel data augmentation method utilizing a few-shot prompt learning approach. What sets our approach apart from extant methods is the use of the principles of persuasion, a psychology theory which explains the underlying mechanisms of smishing. By designing prompts grounded in the persuasion principles, our augmented dataset could effectively capture various, important aspects of smishing messages, enabling ML models to be effectively trained. Our evaluation within a real-world context demonstrates that our augmentation approach produces more diverse and higher-quality smishing data instances compared to other cutting-edging approaches, leading to substantial improvements in the ability of ML models to detect the subtle characteristics of smishing messages. Moreover, our additional analyses reveal that the performance improvement provided by our approach is more pronounced when used with ML models that have a larger number of parameters, demonstrating its effectiveness in training large-scale ML models.

[AI-84] Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery

链接: https://arxiv.org/abs/2411.02136
作者: Robert Fonod,Haechan Cho,Hwasoo Yeo,Nikolas Geroliminis
关键词-EN: addressing key challenges, high-altitude drone footage, extracting georeferenced vehicle, Songdo International Business, International Business District
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a framework for extracting georeferenced vehicle trajectories from high-altitude drone footage, addressing key challenges in urban traffic monitoring and limitations of traditional ground-based systems. We employ state-of-the-art computer vision and deep learning to create an end-to-end pipeline that enhances vehicle detection, tracking, and trajectory stabilization. Conducted in the Songdo International Business District, South Korea, the study used a multi-drone experiment over 20 intersections, capturing approximately 12TB of 4K video data over four days. We developed a novel track stabilization method that uses detected vehicle bounding boxes as exclusion masks during image registration, which, combined with advanced georeferencing techniques, accurately transforms vehicle coordinates into real-world geographical data. Additionally, our framework includes robust vehicle dimension estimation and detailed road segmentation for in-depth traffic analysis. The framework produced two high-quality datasets: the Songdo Traffic dataset, comprising nearly 1 million unique vehicle trajectories, and the Songdo Vision dataset, containing over 5,000 human-annotated frames with about 300,000 vehicle instances in four classes. Comparisons between drone-derived data and high-precision sensor data from an instrumented probe vehicle highlight the accuracy and consistency of our framework’s extraction in dense urban settings. By publicly releasing these datasets and the pipeline source code, this work sets new benchmarks for data quality, reproducibility, and scalability in traffic research. Results demonstrate the potential of integrating drone technology with advanced computer vision for precise, cost-effective urban traffic monitoring, providing valuable resources for the research community to develop intelligent transportation systems and improve traffic management strategies.

[AI-85] DeMod: A Holistic Tool with Explainable Detection and Personalized Modification for Toxicity Censorship

链接: https://arxiv.org/abs/2411.01844
作者: Yaqiong Li,Peng Zhang,Hansu Gu,Tun Lu,Siyuan Qiao,Yubo Shu,Yiyang Shao,Ning Gu
关键词-EN: supporting toxicity censorship, tools supporting toxicity, social posts, automated approaches, toxicity censorship
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Although there have been automated approaches and tools supporting toxicity censorship for social posts, most of them focus on detection. Toxicity censorship is a complex process, wherein detection is just an initial task and a user can have further needs such as rationale understanding and content modification. For this problem, we conduct a needfinding study to investigate people’s diverse needs in toxicity censorship and then build a ChatGPT-based censorship tool named DeMod accordingly. DeMod is equipped with the features of explainable Detection and personalized Modification, providing fine-grained detection results, detailed explanations, and personalized modification suggestions. We also implemented the tool and recruited 35 Weibo users for evaluation. The results suggest DeMod’s multiple strengths like the richness of functionality, the accuracy of censorship, and ease of use. Based on the findings, we further propose several insights into the design of content censorship systems.

[AI-86] Integrating Saliency Ranking and Reinforcement Learning for Enhanced Object Detection ALT

链接: https://arxiv.org/abs/2408.06803
作者: Matthias Bartolo,Dylan Seychell,Josef Bajada
关键词-EN: based visual attention, combine reinforcement learning, visual attention methods, saliency ranking techniques, based visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Resultant work from Dissertation, Department of AI, University of Malta. Code available at: this https URL

点击查看摘要

Abstract:With the ever-growing variety of object detection approaches, this study explores a series of experiments that combine reinforcement learning (RL)-based visual attention methods with saliency ranking techniques to investigate transparent and sustainable solutions. By integrating saliency ranking for initial bounding box prediction and subsequently applying RL techniques to refine these predictions through a finite set of actions over multiple time steps, this study aims to enhance RL object detection accuracy. Presented as a series of experiments, this research investigates the use of various image feature extraction methods and explores diverse Deep Q-Network (DQN) architectural variations for deep reinforcement learning-based localisation agent training. Additionally, we focus on optimising the detection pipeline at every step by prioritising lightweight and faster models, while also incorporating the capability to classify detected objects, a feature absent in previous RL approaches. We show that by evaluating the performance of these trained agents using the Pascal VOC 2007 dataset, faster and more optimised models were developed. Notably, the best mean Average Precision (mAP) achieved in this study was 51.4, surpassing benchmarks set by RL-based single object detectors in the literature.

[AI-87] AtlasSeg: Atlas Prior Guided Dual-U-Net for Cortical Segmentation in Fetal Brain MRI

链接: https://arxiv.org/abs/2411.02867
作者: Haoan Xu,Tianshu Zheng,Xinyi Xu,Yao Shen,Jiwei Sun,Cong Sun,Guangbin Wang,Dan Wu
关键词-EN: remains challenging due, dynamically changing anatomical, changing anatomical anatomy, Accurate tissue segmentation, MRI remains challenging
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate tissue segmentation in fetal brain MRI remains challenging due to the dynamically changing anatomical anatomy and contrast during fetal development. To enhance segmentation accuracy throughout gestation, we introduced AtlasSeg, a dual-U-shape convolution network incorporating gestational age (GA) specific information as guidance. By providing a publicly available fetal brain atlas with segmentation label at the corresponding GA, AtlasSeg effectively extracted the contextual features of age-specific patterns in atlas branch and generated tissue segmentation in segmentation branch. Multi-scale attentive atlas feature fusions were constructed in all stages during encoding and decoding, giving rise to a dual-U-shape network to assist feature flow and information interactions between two branches. AtlasSeg outperformed six well-known segmentation networks in both our internal fetal brain MRI dataset and the external FeTA dataset. Ablation experiments demonstrate the efficiency of atlas guidance and the attention mechanism. The proposed AtlasSeg demonstrated superior segmentation performance against other convolution networks with higher segmentation accuracy, and may facilitate fetal brain MRI analysis in large-scale fetal brain studies.

[AI-88] Active Prompt Tuning Enables Gpt-40 To Do Efficient Classification Of Microscopy Images

链接: https://arxiv.org/abs/2411.02639
作者: Abhiram Kandiyana,Peter R. Mouton,Yaroslav Kolinko,Lawrence O. Hall,Dmitry Goldgof
关键词-EN: classifying cellular features, Traditional deep learning-based, deep learning-based methods, images require time, traditional Convolutional Neural
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditional deep learning-based methods for classifying cellular features in microscopy images require time- and labor-intensive processes for training models. Among the current limitations are major time commitments from domain experts for accurate ground truth preparation; and the need for a large amount of input image data. We previously proposed a solution that overcomes these challenges using OpenAI’s GPT-4(V) model on a pilot dataset (Iba-1 immuno-stained tissue sections from 11 mouse brains). Results on the pilot dataset were equivalent in accuracy and with a substantial improvement in throughput efficiency compared to the baseline using a traditional Convolutional Neural Net (CNN)-based approach. The present study builds upon this framework using a second unique and substantially larger dataset of microscopy images. Our current approach uses a newer and faster model, GPT-4o, along with improved prompts. It was evaluated on a microscopy image dataset captured at low (10x) magnification from cresyl-violet-stained sections through the cerebellum of a total of 18 mouse brains (9 Lurcher mice, 9 wild-type controls). We used our approach to classify these images either as a control group or Lurcher mutant. Using 6 mice in the prompt set the results were correct classification for 11 out of the 12 mice (92%) with 96% higher efficiency, reduced image requirements, and lower demands on time and effort of domain experts compared to the baseline method (snapshot ensemble of CNN models). These results confirm that our approach is effective across multiple datasets from different brain regions and magnifications, with minimal overhead. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.02639 [eess.IV] (or arXiv:2411.02639v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2411.02639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-89] Advanced XR-Based 6-DOF Catheter Tracking System for Immersive Cardiac Intervention Training

链接: https://arxiv.org/abs/2411.02611
作者: Mohsen Annabestani,Sandhya Sriram,S. Chiu Wong,Alexandros Sigaras,Bobak Mosadegh
关键词-EN: Extended Reality, complex cardiac interventions, technologies are gaining, gaining traction, traction as effective
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Extended Reality (XR) technologies are gaining traction as effective tools for medical training and procedural guidance, particularly in complex cardiac interventions. This paper presents a novel system for real-time 3D tracking and visualization of intracardiac echocardiography (ICE) catheters, with precise measurement of the roll angle. A custom 3D-printed setup, featuring orthogonal cameras, captures biplane video of the catheter, while a specialized computer vision algorithm reconstructs its 3D trajectory, localizing the tip with sub-millimeter accuracy and tracking the roll angle in real-time. The system’s data is integrated into an interactive Unity-based environment, rendered through the Meta Quest 3 XR headset, combining a dynamically tracked catheter with a patient-specific 3D heart model. This immersive environment allows the testing of the importance of 3D depth perception, in comparison to 2D projections, as a form of visualization in XR. Our experimental study, conducted using the ICE catheter with six participants, suggests that 3D visualization is not necessarily beneficial over 2D views offered by the XR system; although all cardiologists saw its utility for pre-operative training, planning, and intra-operative guidance. The proposed system qualitatively shows great promise in transforming catheter-based interventions, particularly ICE procedures, by improving visualization, interactivity, and skill development.

[AI-90] Computing critical exponents in 3D Ising model via pattern recognition/deep learning approach

链接: https://arxiv.org/abs/2411.02604
作者: Timothy A. Burt
关键词-EN: Convolutional Neural Network, supervised Deep Learning, Finite-Size Scaling Analysis, Neural Network, Deep Learning
类目: Computational Physics (physics.comp-ph); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we computed three critical exponents ( \alpha, \beta, \gamma ) for the 3D Ising model with Metropolis Algorithm using Finite-Size Scaling Analysis on six cube length scales (L=20,30,40,60,80,90), and performed a supervised Deep Learning (DL) approach (3D Convolutional Neural Network or CNN) to train a neural network on specific conformations of spin states. We find one can effectively reduce the information in thermodynamic ensemble-averaged quantities vs. reduced temperature t (magnetization per spin m(t) , specific heat per spin c(t) , magnetic susceptibility per spin \chi(t) ) to \textitsix latent classes. We also demonstrate our CNN on a subset of L=20 conformations and achieve a train/test accuracy of 0.92 and 0.6875, respectively. However, more work remains to be done to quantify the feasibility of computing critical exponents from the output class labels (binned m, c, \chi ) from this approach and interpreting the results from DL models trained on systems in Condensed Matter Physics in general.

[AI-91] Weakly supervised deep learning model with size constraint for prostate cancer detection in multiparametric MRI and generalization to unseen domains

链接: https://arxiv.org/abs/2411.02466
作者: Robin Trombetta(MYRIAD),Olivier Rouvière(HCL),Carole Lartizien(MYRIAD)
关键词-EN: medical segmentation tasks, shown promising performance, shown promising, Fully supervised, data
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fully supervised deep models have shown promising performance for many medical segmentation tasks. Still, the deployment of these tools in clinics is limited by the very timeconsuming collection of manually expert-annotated data. Moreover, most of the state-ofthe-art models have been trained and validated on moderately homogeneous datasets. It is known that deep learning methods are often greatly degraded by domain or label shifts and are yet to be built in such a way as to be robust to unseen data or label distributions. In the clinical setting, this problematic is particularly relevant as the deployment institutions may have different scanners or acquisition protocols than those from which the data has been collected to train the model. In this work, we propose to address these two challenges on the detection of clinically significant prostate cancer (csPCa) from bi-parametric MRI. We evaluate the method proposed by (Kervadec et al., 2018), which introduces a size constaint loss to produce fine semantic cancer lesions segmentations from weak circle scribbles annotations. Performance of the model is based on two public (PI-CAI and Prostate158) and one private databases. First, we show that the model achieves on-par performance with strong fully supervised baseline models, both on in-distribution validation data and unseen test images. Second, we observe a performance decrease for both fully supervised and weakly supervised models when tested on unseen data domains. This confirms the crucial need for efficient domain adaptation methods if deep learning models are aimed to be deployed in a clinical environment. Finally, we show that ensemble predictions from multiple trainings increase generalization performance.

[AI-92] Diagnostic Performance of Deep Learning for Predicting Gliomas IDH and 1p/19q Status in MRI: A Systematic Review and Meta-Analysis

链接: https://arxiv.org/abs/2411.02426
作者: Somayeh Farahani,Marjaneh Hejazi,Mehnaz Tabassum,Antonio Di Ieva,Neda Mahdavifar,Sidong Liu
关键词-EN: primary brain tumors, common primary brain, brain tumors, common primary, primary brain
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Gliomas, the most common primary brain tumors, show high heterogeneity in histological and molecular characteristics. Accurate molecular profiling, like isocitrate dehydrogenase (IDH) mutation and 1p/19q codeletion, is critical for diagnosis, treatment, and prognosis. This review evaluates MRI-based deep learning (DL) models’ efficacy in predicting these biomarkers. Following PRISMA guidelines, we systematically searched major databases (PubMed, Scopus, Ovid, and Web of Science) up to February 2024, screening studies that utilized DL to predict IDH and 1p/19q codeletion status from MRI data of glioma patients. We assessed the quality and risk of bias using the radiomics quality score and QUADAS-2 tool. Our meta-analysis used a bivariate model to compute pooled sensitivity, specificity, and meta-regression to assess inter-study heterogeneity. Of the 565 articles, 57 were selected for qualitative synthesis, and 52 underwent meta-analysis. The pooled estimates showed high diagnostic performance, with validation sensitivity, specificity, and area under the curve (AUC) of 0.84 [prediction interval (PI): 0.67-0.93, I2=51.10%, p 0.05], 0.87 [PI: 0.49-0.98, I2=82.30%, p 0.05], and 0.89 for IDH prediction, and 0.76 [PI: 0.28-0.96, I2=77.60%, p 0.05], 0.85 [PI: 0.49-0.97, I2=80.30%, p 0.05], and 0.90 for 1p/19q prediction, respectively. Meta-regression analyses revealed significant heterogeneity influenced by glioma grade, data source, inclusion of non-radiomics data, MRI sequences, segmentation and feature extraction methods, and validation techniques. DL models demonstrate strong potential in predicting molecular biomarkers from MRI scans, with significant variability influenced by technical and clinical factors. Thorough external validation is necessary to increase clinical utility.

计算机视觉

[CV-0] Classification Done Right for Vision-Language Pre-Training NEURIPS2024

链接: https://arxiv.org/abs/2411.03313
作者: Huang Zilong,Ye Qinghao,Kang Bingyi,Feng Jiashi,Fan Haoqi
关键词-EN: super simple classification, simple classification method, super simple, method for vision-language, vision-language pre-training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. this https URL

[CV-1] DiT4Edit: Diffusion Transformer for Image Editing

链接: https://arxiv.org/abs/2411.03286
作者: Kunyu Feng,Yue Ma,Bingyuan Wang,Chenyang Qi,Haozhe Chen,Qifeng Chen,Zeyu Wang
关键词-EN: shape-aware object editing, methods for shape-aware, Diffusion Transformer-based image, Diffusion Transformers, recent advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patches merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit across various editing scenarios, highlighting the potential of Diffusion Transformers in supporting image editing.

[CV-2] ShadowMamba: State-Space Model with Boundary-Region Selective Scan for Shadow Removal

链接: https://arxiv.org/abs/2411.03260
作者: Xiujin Zhu,Chee-Onn Chow,Joon Huang Chuah
关键词-EN: low-level vision problem, typical low-level vision, shadow removal, Image shadow removal, shadow
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image shadow removal is a typical low-level vision problem, where the presence of shadows leads to abrupt changes in brightness in certain regions, affecting the accuracy of upstream tasks. Current shadow removal methods still face challenges such as residual boundary artifacts, and capturing feature information at shadow boundaries is crucial for removing shadows and eliminating residual boundary artifacts. Recently, Mamba has achieved remarkable success in computer vision by globally modeling long-sequence information with linear complexity. However, when applied to image shadow removal, the original Mamba scanning method overlooks the semantic continuity of shadow boundaries as well as the continuity of semantics within the same region. Based on the unique characteristics of shadow images, this paper proposes a novel selective scanning method called boundary-region selective scanning. This method scans boundary regions, shadow regions, and non-shadow regions independently, bringing pixels of the same region type closer together in the long sequence, especially focusing on the local information at the boundaries, which is crucial for shadow removal. This method combines with global scanning and channel scanning to jointly accomplish the shadow removal. We name our model ShadowMamba, the first Mamba-based model for shadow removal. Extensive experimental results show that our method outperforms current state-of-the-art models across most metrics on multiple datasets. The code for ShadowMamba is available at (Code will be released upon acceptance).

[CV-3] Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution ECCV2024

链接: https://arxiv.org/abs/2411.03239
作者: Huan Zheng,Wencheng Han,Jianbing Shen
关键词-EN: gained significant attention, significant attention due, Recovering high-quality depth, consumer-grade depth cameras, Recovering high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The 1st solution for the ECCV 2024 AIM Compressed Depth Upsampling Challenge

点击查看摘要

Abstract:Recovering high-quality depth maps from compressed sources has gained significant attention due to the limitations of consumer-grade depth cameras and the bandwidth restrictions during data transmission. However, current methods still suffer from two challenges. First, bit-depth compression produces a uniform depth representation in regions with subtle variations, hindering the recovery of detailed information. Second, densely distributed random noise reduces the accuracy of estimating the global geometric structure of the scene. To address these challenges, we propose a novel framework, termed geometry-decoupled network (GDNet), for compressed depth map super-resolution that decouples the high-quality depth map reconstruction process by handling global and detailed geometric features separately. To be specific, we propose the fine geometry detail encoder (FGDE), which is designed to aggregate fine geometry details in high-resolution low-level image features while simultaneously enriching them with complementary information from low-resolution context-level image features. In addition, we develop the global geometry encoder (GGE) that aims at suppressing noise and extracting global geometric information effectively via constructing compact feature representation in a low-rank space. We conduct experiments on multiple benchmark datasets, demonstrating that our GDNet significantly outperforms current methods in terms of geometric consistency and detail recovery. In the ECCV 2024 AIM Compressed Depth Upsampling Challenge, our solution won the 1st place award. Our codes will be available.

[CV-4] opograph: An efficient Graph-Based Framework for Strictly Topology Preserving Image Segmentation

链接: https://arxiv.org/abs/2411.03228
作者: Laurin Lux,Alexander H. Berger,Alexander Weers,Nico Stucki,Daniel Rueckert,Ulrich Bauer,Johannes C. Paetzold
关键词-EN: neglecting topological accuracy, pixel-wise loss functions, Topological correctness plays, image segmentation tasks, correctness plays
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Topological correctness plays a critical role in many image segmentation tasks, yet most networks are trained using pixel-wise loss functions, such as Dice, neglecting topological accuracy. Existing topology-aware methods often lack robust topological guarantees, are limited to specific use cases, or impose high computational costs. In this work, we propose a novel, graph-based framework for topologically accurate image segmentation that is both computationally efficient and generally applicable. Our method constructs a component graph that fully encodes the topological information of both the prediction and ground truth, allowing us to efficiently identify topologically critical regions and aggregate a loss based on local neighborhood information. Furthermore, we introduce a strict topological metric capturing the homotopy equivalence between the union and intersection of prediction-label pairs. We formally prove the topological guarantees of our approach and empirically validate its effectiveness on binary and multi-class datasets. Our loss demonstrates state-of-the-art performance with up to fivefold faster loss computation compared to persistent homology methods.

[CV-5] Kernel Orthogonality does not necessarily imply a Decrease in Feature Map Redundancy in CNNs: Convolutional Similarity Minimization

链接: https://arxiv.org/abs/2411.03226
作者: Zakariae Belmekki,Jun Li,Patrick Reuter,David Antonio Gómez Jáuregui,Karl Jenkins
关键词-EN: Deep Learning due, Convolutional Neural Networks, Neural Networks, Deep Learning, Learning due
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have been heavily used in Deep Learning due to their success in various tasks. Nonetheless, it has been observed that CNNs suffer from redundancy in feature maps, leading to inefficient capacity utilization. Efforts to mitigate and solve this problem led to the emergence of multiple methods, amongst which is kernel orthogonality through variant means. In this work, we challenge the common belief that kernel orthogonality leads to a decrease in feature map redundancy, which is, supposedly, the ultimate objective behind kernel orthogonality. We prove, theoretically and empirically, that kernel orthogonality has an unpredictable effect on feature map similarity and does not necessarily decrease it. Based on our theoretical result, we propose an effective method to reduce feature map similarity independently of the input of the CNN. This is done by minimizing a novel loss function we call Convolutional Similarity. Empirical results show that minimizing the Convolutional Similarity increases the performance of classification models and can accelerate their convergence. Furthermore, using our proposed method pushes towards a more efficient use of the capacity of models, allowing the use of significantly smaller models to achieve the same levels of performance.

[CV-6] Pre-trained Visual Dynamics Representations for Efficient Policy Learning ECCV2024

链接: https://arxiv.org/abs/2411.03169
作者: Hao Luo,Bohan Zhou,Zongqing Lu
关键词-EN: Reinforcement Learning, Visual Dynamics Representations, Pre-trained Visual Dynamics, Visual Dynamics, purely video data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.

[CV-7] GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

链接: https://arxiv.org/abs/2411.03047
作者: Zhongjin Luo,Haolin Liu,Chenghong Li,Wanghao Du,Zirong Jin,Wanhu Sun,Yinyu Nie,Weikai Chen,Xiaoguang Han
关键词-EN: Neural implicit functions, brought impressive advances, clothed human digitization, Neural implicit, implicit functions
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Neural implicit functions have brought impressive advances to the state-of-the-art of clothed human digitization from multiple or even single images. However, despite the progress, current arts still have difficulty generalizing to unseen images with complex cloth deformation and body poses. In this work, we present GarVerseLOD, a new dataset and framework that paves the way to achieving unprecedented robustness in high-fidelity 3D garment reconstruction from a single unconstrained image. Inspired by the recent success of large generative models, we believe that one key to addressing the generalization challenge lies in the quantity and quality of 3D garment data. Towards this end, GarVerseLOD collects 6,000 high-quality cloth models with fine-grained geometry details manually created by professional artists. In addition to the scale of training data, we observe that having disentangled granularities of geometry can play an important role in boosting the generalization capability and inference accuracy of the learned model. We hence craft GarVerseLOD as a hierarchical dataset with levels of details (LOD), spanning from detail-free stylized shape to pose-blended garment with pixel-aligned details. This allows us to make this highly under-constrained problem tractable by factorizing the inference into easier tasks, each narrowed down with smaller searching space. To ensure GarVerseLOD can generalize well to in-the-wild images, we propose a novel labeling paradigm based on conditional diffusion models to generate extensive paired images for each garment model with high photorealism. We evaluate our method on a massive amount of in-the-wild images. Experimental results demonstrate that GarVerseLOD can generate standalone garment pieces with significantly better quality than prior approaches. Project page: this https URL

[CV-8] Evaluation of handwriting kinematics and pressure for differential diagnosis of Parkinsons disease

链接: https://arxiv.org/abs/2411.03044
作者: Peter Drotár,Jiří Mekyska,Irena Rektorová,Lucia Masarová,Zdeněk Smékal,Marcos Faundez-Zanuy
关键词-EN: PaHaW Parkinson disease, Parkinson disease handwriting, Parkinson disease, PaHaW Parkinson, disease handwriting database
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages

点击查看摘要

Abstract:Objective: We present the PaHaW Parkinson’s disease handwriting database, consisting of handwriting samples from Parkinson’s disease (PD) patients and healthy controls. Our goal is to show that kinematic features and pressure features in handwriting can be used for the differential diagnosis of PD. Methods and Material: The database contains records from 37 PD patients and 38 healthy controls performing eight different handwriting tasks. The tasks include drawing an Archimedean spiral, repetitively writing orthographically simple syllables and words, and writing of a sentence. In addition to the conventional kinematic features related to the dynamics of handwriting, we investigated new pressure features based on the pressure exerted on the writing surface. To discriminate between PD patients and healthy subjects, three different classifiers were compared: K-nearest neighbors (K-NN), ensemble AdaBoost classifier, and support vector machines (SVM). Results: For predicting PD based on kinematic and pressure features of handwriting, the best performing model was SVM with classification accuracy of Pacc = 81.3% (sensitivity Psen = 87.4% and specificity of Pspe = 80.9%). When evaluated separately, pressure features proved to be relevant for PD diagnosis, yielding Pacc = 82.5% compared to Pacc = 75.4% using kinematic features. Conclusion: Experimental results showed that an analysis of kinematic and pressure features during handwriting can help assess subtle characteristics of handwriting and discriminate between PD patients and healthy controls.

[CV-9] Judge Like a Real Doctor: Dual Teacher Sample Consistency Framework for Semi-supervised Medical Image Classification

链接: https://arxiv.org/abs/2411.03041
作者: Zhang Qixiang,Yang Yuxiang,Zu Chen,Zhang Jianjia,Wu Xi,Zhou Jiliu,Wang Yan
关键词-EN: high annotation cost, Absolute Location consistency, Semi-supervised learning, popular solution, solution to alleviate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE Transactions on Emerging Topics in Computational Intelligence

点击查看摘要

Abstract:Semi-supervised learning (SSL) is a popular solution to alleviate the high annotation cost in medical image classification. As a main branch of SSL, consistency regularization engages in imposing consensus between the predictions of a single sample from different views, termed as Absolute Location consistency (AL-c). However, only AL-c may be insufficient. Just like when diagnosing a case in practice, besides the case itself, the doctor usually refers to certain related trustworthy cases to make more reliable this http URL, we argue that solely relying on AL-c may ignore the relative differences across samples, which we interpret as relative locations, and only exploit limited information from one perspective. To address this issue, we propose a Sample Consistency Mean Teacher (SCMT) which not only incorporates AL c but also additionally enforces consistency between the samples’ relative similarities to its related samples, called Relative Location consistency (RL c). AL c and RL c conduct consistency regularization from two different perspectives, jointly extracting more diverse semantic information for classification. On the other hand, due to the highly similar structures in medical images, the sample distribution could be overly dense in feature space, making their relative locations susceptible to noise. To tackle this problem, we further develop a Sample Scatter Mean Teacher (SSMT) by utilizing contrastive learning to sparsify the sample distribution and obtain robust and effective relative locations. Extensive experiments on different datasets demonstrate the superiority of our method.

[CV-10] Rethinking Decoders for Transformer-based Semantic Segmentation: Compression is All You Need NEURIPS2024 UAI

链接: https://arxiv.org/abs/2411.03033
作者: Qishuai Wen,Chun-Guang Li
关键词-EN: typically adopt Transformer, extract additional embeddings, methods for Transformer-based, Transformer-based semantic segmentation, adopt Transformer decoders
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS2024. Code: this https URL

点击查看摘要

Abstract:State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.

[CV-11] FEDLAD: Federated Evaluation of Deep Leakage Attacks and Defenses

链接: https://arxiv.org/abs/2411.03019
作者: Isaac Baglin,Xiatian Zhu,Simon Hadfield
关键词-EN: Deep Leakage Attacks, Deep Leakage, learning paradigm designed, Federated Learning, evaluating Deep Leakage
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages

点击查看摘要

Abstract:Federated Learning is a privacy preserving decentralized machine learning paradigm designed to collaboratively train models across multiple clients by exchanging gradients to the server and keeping private data local. Nevertheless, recent research has revealed that the security of Federated Learning is compromised, as private ground truth data can be recovered through a gradient inversion technique known as Deep Leakage. While these attacks are crafted with a focus on applications in Federated Learning, they generally are not evaluated in realistic scenarios. This paper introduces the FEDLAD Framework (Federated Evaluation of Deep Leakage Attacks and Defenses), a comprehensive benchmark for evaluating Deep Leakage attacks and defenses within a realistic Federated context. By implementing a unified benchmark that encompasses multiple state-of-the-art Deep Leakage techniques and various defense strategies, our framework facilitates the evaluation and comparison of the efficacy of these methods across different datasets and training states. This work highlights a crucial trade-off between privacy and model accuracy in Federated Learning and aims to advance the understanding of security challenges in decentralized machine learning systems, stimulate future research, and enhance reproducibility in evaluating Deep Leakage attacks and defenses.

[CV-12] CRT-Fusion: Camera Radar Temporal Fusion Using Motion Information for 3D Object Detection NEURIPS2024

链接: https://arxiv.org/abs/2411.03013
作者: Jisong Kim,Minjae Seong,Jun Won Choi
关键词-EN: Motion Guided Temporal, vehicles and robotics, Guided Temporal Fusion, Motion Feature Estimator, critical component
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS2024

点击查看摘要

Abstract:Accurate and robust 3D object detection is a critical component in autonomous vehicles and robotics. While recent radar-camera fusion methods have made significant progress by fusing information in the bird’s-eye view (BEV) representation, they often struggle to effectively capture the motion of dynamic objects, leading to limited performance in real-world scenarios. In this paper, we introduce CRT-Fusion, a novel framework that integrates temporal information into radar-camera fusion to address this challenge. Our approach comprises three key modules: Multi-View Fusion (MVF), Motion Feature Estimator (MFE), and Motion Guided Temporal Fusion (MGTF). The MVF module fuses radar and image features within both the camera view and bird’s-eye view, thereby generating a more precise unified BEV representation. The MFE module conducts two simultaneous tasks: estimation of pixel-wise velocity information and BEV segmentation. Based on the velocity and the occupancy score map obtained from the MFE module, the MGTF module aligns and fuses feature maps across multiple timestamps in a recurrent manner. By considering the motion of dynamic objects, CRT-Fusion can produce robust BEV feature maps, thereby improving detection accuracy and robustness. Extensive evaluations on the challenging nuScenes dataset demonstrate that CRT-Fusion achieves state-of-the-art performance for radar-camera-based 3D object detection. Our approach outperforms the previous best method in terms of NDS by +1.7%, while also surpassing the leading approach in mAP by +1.4%. These significant improvements in both metrics showcase the effectiveness of our proposed fusion strategy in enhancing the reliability and accuracy of 3D object detection.

[CV-13] Precise Drive with VLM: First Prize Solution for PRCV 2024 Drive LM challenge

链接: https://arxiv.org/abs/2411.02999
作者: Bin Huang,Siyu Wang,Yuanpeng Chen,Yidan Wu,Hui Song,Zifan Ding,Jing Leng,Chengpeng Liang,Peng Xue,Junliang Zhang,Tiankun Zhao
关键词-EN: technical report outlines, PRCV Challenge, focusing on cognition, technical report, report outlines
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This technical report outlines the methodologies we applied for the PRCV Challenge, focusing on cognition and decision-making in driving scenarios. We employed InternVL-2.0, a pioneering open-source multi-modal model, and enhanced it by refining both the model input and training methodologies. For the input data, we strategically concatenated and formatted the multi-view images. It is worth mentioning that we utilized the coordinates of the original images without transformation. In terms of model training, we initially pre-trained the model on publicly available autonomous driving scenario datasets to bolster its alignment capabilities of the challenge tasks, followed by fine-tuning on the DriveLM-nuscenes Dataset. During the fine-tuning phase, we innovatively modified the loss function to enhance the model’s precision in predicting coordinate values. These approaches ensure that our model possesses advanced cognitive and decision-making capabilities in driving scenarios. Consequently, our model achieved a score of 0.6064, securing the first prize on the competition’s final results.

[CV-14] PV-faultNet: Optimized CNN Architecture to detect defects resulting efficient PV production

链接: https://arxiv.org/abs/2411.02997
作者: Eiffat E Zaman,Rahima Khanam
关键词-EN: fundamental building block, renewable energy, green energy, global shift, shift towards renewable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The global shift towards renewable energy has pushed PV cell manufacturing as a pivotal point as they are the fundamental building block of green energy. However, the manufacturing process is complex enough to lose its purpose due to probable defects experienced during the time impacting the overall efficiency. However, at the moment, manual inspection is being conducted to detect the defects that can cause bias, leading to time and cost inefficiency. Even if automated solutions have also been proposed, most of them are resource-intensive, proving ineffective in production environments. In that context, this study presents PV-faultNet, a lightweight Convolutional Neural Network (CNN) architecture optimized for efficient and real-time defect detection in photovoltaic (PV) cells, designed to be deployable on resource-limited production devices. Addressing computational challenges in industrial PV manufacturing environments, the model includes only 2.92 million parameters, significantly reducing processing demands without sacrificing accuracy. Comprehensive data augmentation techniques were implemented to tackle data scarcity, thus enhancing model generalization and maintaining a balance between precision and recall. The proposed model achieved high performance with 91% precision, 89% recall, and a 90% F1 score, demonstrating its effectiveness for scalable quality control in PV production.

[CV-15] Efficient and Effective Adaptation of Multimodal Foundation Models in Sequential Recommendation SIGIR2024

链接: https://arxiv.org/abs/2411.02992
作者: Junchen Fu,Xuri Ge,Xin Xin,Alexandros Karatzoglou,Ioannis Arapakis,Kaiwen Zheng,Yongxin Ni,Joemon M. Jose
关键词-EN: advanced representation learning, revolutionized sequential recommender, sequential recommender systems, Multimodal foundation models, representation learning
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注: The extension of IISAN in SIGIR2024

点击查看摘要

Abstract:Multimodal foundation models (MFMs) have revolutionized sequential recommender systems through advanced representation learning. While Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt these models, studies often prioritize parameter efficiency, neglecting GPU memory and training speed. To address this, we introduced the IISAN framework, significantly enhancing efficiency. However, IISAN was limited to symmetrical MFMs and identical text and image encoders, preventing the use of state-of-the-art Large Language Models. To overcome this, we developed IISAN-Versa, a versatile plug-and-play architecture compatible with both symmetrical and asymmetrical MFMs. IISAN-Versa employs a Decoupled PEFT structure and utilizes both intra- and inter-modal adaptation. It effectively handles asymmetry through a simple yet effective combination of group layer-dropping and dimension transformation alignment. Our research demonstrates that IISAN-Versa effectively adapts large text encoders, and we further identify a scaling effect where larger encoders generally perform better. IISAN-Versa also demonstrates strong versatility in our defined multimodal scenarios, which include raw titles and captions generated from images and videos. Additionally, IISAN-Versa achieved state-of-the-art performance on the Microlens public benchmark. We will release our code and datasets to support future research.

[CV-16] CAD-NeRF: Learning NeRFs from Uncalibrated Few-view Images by CAD Model Retrieval

链接: https://arxiv.org/abs/2411.02979
作者: Xin Wen,Xuening Zhu,Renjiao Yi,Zhifeng Wang,Chenyang Zhu,Kai Xu
关键词-EN: shown great potential, realistic rendered images, neural radiance fields, neural radiance, shown great
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The article has been accepted by Frontiers of Computer Science (FCS)

点击查看摘要

Abstract:Reconstructing from multi-view images is a longstanding problem in 3D vision, where neural radiance fields (NeRFs) have shown great potential and get realistic rendered images of novel views. Currently, most NeRF methods either require accurate camera poses or a large number of input images, or even both. Reconstructing NeRF from few-view images without poses is challenging and highly ill-posed. To address this problem, we propose CAD-NeRF, a method reconstructed from less than 10 images without any known poses. Specifically, we build a mini library of several CAD models from ShapeNet and render them from many random views. Given sparse-view input images, we run a model and pose retrieval from the library, to get a model with similar shapes, serving as the density supervision and pose initializations. Here we propose a multi-view pose retrieval method to avoid pose conflicts among views, which is a new and unseen problem in uncalibrated NeRF methods. Then, the geometry of the object is trained by the CAD guidance. The deformation of the density field and camera poses are optimized jointly. Then texture and density are trained and fine-tuned as well. All training phases are in self-supervised manners. Comprehensive evaluations of synthetic and real images show that CAD-NeRF successfully learns accurate densities with a large deformation from retrieved CAD models, showing the generalization abilities.

[CV-17] Exploring Seasonal Variability in the Context of Neural Radiance Fields for 3D Reconstruction on Satellite Imagery

链接: https://arxiv.org/abs/2411.02972
作者: Liv Kåreborn,Erica Ingerstad,Amanda Berg,Justus Karlsson,Leif Haglund
关键词-EN: Neural Radiance Fields, Radiance Fields, Neural Radiance, seasonal predictive capabilities, capabilities of Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, the seasonal predictive capabilities of Neural Radiance Fields (NeRF) applied to satellite images are investigated. Focusing on the utilization of satellite data, the study explores how Sat-NeRF, a novel approach in computer vision, performs in predicting seasonal variations across different months. Through comprehensive analysis and visualization, the study examines the model’s ability to capture and predict seasonal changes, highlighting specific challenges and strengths. Results showcase the impact of the sun direction on predictions, revealing nuanced details in seasonal transitions, such as snow cover, color accuracy, and texture representation in different landscapes. Given these results, we propose Planet-NeRF, an extension to Sat-NeRF capable of incorporating seasonal variability through a set of month embedding vectors. Comparative evaluations reveal that Planet-NeRF outperforms prior models in the case where seasonal changes are present. The extensive evaluation combined with the proposed method offers promising avenues for future research in this domain.

[CV-18] Multi-modal NeRF Self-Supervision for LiDAR Semantic Segmentation IROS

链接: https://arxiv.org/abs/2411.02969
作者: Xavier Timoneda,Markus Herb,Fabian Duerr,Daniel Goehring,Fisher Yu
关键词-EN: autonomous driving perception, driving perception consisting, LiDAR Semantic Segmentation, autonomous driving, consisting of associating
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024

点击查看摘要

Abstract:LiDAR Semantic Segmentation is a fundamental task in autonomous driving perception consisting of associating each LiDAR point to a semantic label. Fully-supervised models have widely tackled this task, but they require labels for each scan, which either limits their domain or requires impractical amounts of expensive annotations. Camera images, which are generally recorded alongside LiDAR pointclouds, can be processed by the widely available 2D foundation models, which are generic and dataset-agnostic. However, distilling knowledge from 2D data to improve LiDAR perception raises domain adaptation challenges. For example, the classical perspective projection suffers from the parallax effect produced by the position shift between both sensors at their respective capture times. We propose a Semi-Supervised Learning setup to leverage unlabeled LiDAR pointclouds alongside distilled knowledge from the camera images. To self-supervise our model on the unlabeled scans, we add an auxiliary NeRF head and cast rays from the camera viewpoint over the unlabeled voxel features. The NeRF head predicts densities and semantic logits at each sampled ray location which are used for rendering pixel semantics. Concurrently, we query the Segment-Anything (SAM) foundation model with the camera image to generate a set of unlabeled generic masks. We fuse the masks with the rendered pixel semantics from LiDAR to produce pseudo-labels that supervise the pixel predictions. During inference, we drop the NeRF head and run our model with only LiDAR. We show the effectiveness of our approach in three public LiDAR Semantic Segmentation benchmarks: nuScenes, SemanticKITTI and ScribbleKITTI.

[CV-19] Mapping Africa Settlements: High Resolution Urban and Rural Map by Deep Learning and Satellite Imagery

链接: https://arxiv.org/abs/2411.02935
作者: Mohammad Kakooei,James Bailie,Albin Söderberg,Albin Becevic,Adel Daoud
关键词-EN: Accurate Land, Land Cover, sustainable development, natural resources, essential for understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate Land Use and Land Cover (LULC) maps are essential for understanding the drivers of sustainable development, in terms of its complex interrelationships between human activities and natural resources. However, existing LULC maps often lack precise urban and rural classifications, particularly in diverse regions like Africa. This study presents a novel construction of a high-resolution rural-urban map using deep learning techniques and satellite imagery. We developed a deep learning model based on the DeepLabV3 architecture, which was trained on satellite imagery from Landsat-8 and the ESRI LULC dataset, augmented with human settlement data from the GHS-SMOD. The model utilizes semantic segmentation to classify land into detailed categories, including urban and rural areas, at a 10-meter resolution. Our findings demonstrate that incorporating LULC along with urban and rural classifications significantly enhances the model’s ability to accurately distinguish between urban, rural, and non-human settlement areas. Therefore, our maps can support more informed decision-making for policymakers, researchers, and stakeholders. We release a continent wide urban-rural map, covering the period 2016 and 2022.

[CV-20] Fried deconvolution

链接: https://arxiv.org/abs/2411.02890
作者: Jerome Gilles,Stanley Osher
关键词-EN: long range imaging, range imaging, paper we present, approach to deblur, deblur the effect
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper we present a new approach to deblur the effect of atmospheric turbulence in the case of long range imaging. Our method is based on an analytical formulation, the Fried kernel, of the atmosphere modulation transfer function (MTF) and a framelet based deconvolution algorithm. An important parameter is the refractive index structure which requires specific measurements to be known. Then we propose a method which provides a good estimation of this parameter from the input blurred image. The final algorithms are very easy to implement and show very good results on both simulated blur and real images.

[CV-21] urbulence stabilization

链接: https://arxiv.org/abs/2411.02889
作者: Yu Mao,Jerome Gilles
关键词-EN: atmospheric turbulence, recently developed, stabilized image, sequence of frames, frames acquired
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We recently developed a new approach to get a stabilized image from a sequence of frames acquired through atmospheric turbulence. The goal of this algorihtm is to remove the geometric distortions due by the atmosphere movements. This method is based on a variational formulation and is efficiently solved by the use of Bregman iterations and the operator splitting method. In this paper we propose to study the influence of the choice of the regularizing term in the model. Then we proposed to experiment some of the most used regularization constraints available in the litterature.

[CV-22] Enhancing Adversarial Robustness via Uncertainty-Aware Distributional Adversarial Training

链接: https://arxiv.org/abs/2411.02871
作者: Junhao Dong,Xinghua Qu,Z. Jane Wang,Yew-Soon Ong
关键词-EN: adversarial, Adversarial training, practical deployment, remarkable achievements, achievements in deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite remarkable achievements in deep learning across various domains, its inherent vulnerability to adversarial examples still remains a critical concern for practical deployment. Adversarial training has emerged as one of the most effective defensive techniques for improving model robustness against such malicious inputs. However, existing adversarial training schemes often lead to limited generalization ability against underlying adversaries with diversity due to their overreliance on a point-by-point augmentation strategy by mapping each clean example to its adversarial counterpart during training. In addition, adversarial examples can induce significant disruptions in the statistical information w.r.t. the target model, thereby introducing substantial uncertainty and challenges to modeling the distribution of adversarial examples. To circumvent these issues, in this paper, we propose a novel uncertainty-aware distributional adversarial training method, which enforces adversary modeling by leveraging both the statistical information of adversarial examples and its corresponding uncertainty estimation, with the goal of augmenting the diversity of adversaries. Considering the potentially negative impact induced by aligning adversaries to misclassified clean examples, we also refine the alignment reference based on the statistical proximity to clean examples during adversarial training, thereby reframing adversarial training within a distribution-to-distribution matching framework interacted between the clean and adversarial domains. Furthermore, we design an introspective gradient alignment approach via matching input gradients between these domains without introducing external models. Extensive experiments across four benchmark datasets and various network architectures demonstrate that our approach achieves state-of-the-art adversarial robustness and maintains natural performance.

[CV-23] Centerness-based Instance-aware Knowledge Distillation with Task-wise Mutual Lifting for Object Detection on Drone Imagery

链接: https://arxiv.org/abs/2411.02861
作者: Bowei Du,Zhixuan Liao,Yanan Zhang,Zhi Cai,Jiaxin Chen,Di Huang
关键词-EN: Developing accurate, aerial scenes, accurate and efficient, efficient detectors, challenging due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Developing accurate and efficient detectors for drone imagery is challenging due to the inherent complexity of aerial scenes. While some existing methods aim to achieve high accuracy by utilizing larger models, their computational cost is prohibitive for drones. Recently, Knowledge Distillation (KD) has shown promising potential for maintaining satisfactory accuracy while significantly compressing models in general object detection. Considering the advantages of KD, this paper presents the first attempt to adapt it to object detection on drone imagery and addresses two intrinsic issues: (1) low foreground-background ratio and (2) small instances and complex backgrounds, which lead to inadequate training, resulting insufficient distillation. Therefore, we propose a task-wise Lightweight Mutual Lifting (Light-ML) module with a Centerness-based Instance-aware Distillation (CID) strategy. The Light-ML module mutually harmonizes the classification and localization branches by channel shuffling and convolution, integrating teacher supervision across different tasks during back-propagation, thus facilitating training the student model. The CID strategy extracts valuable regions surrounding instances through the centerness of proposals, enhancing distillation efficacy. Experiments on the VisDrone, UAVDT, and COCO benchmarks demonstrate that the proposed approach promotes the accuracies of existing state-of-the-art KD methods with comparable computational requirements. Codes will be available upon acceptance.

[CV-24] Continual Audio-Visual Sound Separation NEURIPS2024

链接: https://arxiv.org/abs/2411.02860
作者: Weiguo Pian,Yiyang Nan,Shijian Deng,Shentong Mo,Yunhui Guo,Yapeng Tian
关键词-EN: continuously separate sound, audio-visual sound separation, separate sound sources, previously learned classes, audio-visual sound
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: NeurIPS 2024

点击查看摘要

Abstract:In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbfContinual \textbfAudio-\textbfVisual Sound \textbfSeparation). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \urlthis https URL.

[CV-25] OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing ECCV

链接: https://arxiv.org/abs/2411.02858
作者: Pranav Gupta,Rishubh Singh,Pradeep Shenoy,Ravikiran Sarvadevabhatla
关键词-EN: Multi-object multi-part scene, complexity scales exponentially, Multi-object multi-part, multi-part scene segmentation, scene objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in The European Conference on Computer Vision (ECCV) 2024

点击查看摘要

Abstract:Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of \mathbf3.3 (Pascal-Parts-58), \mathbf3.5 (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is \mathbf4.0 . Experimentally, we show that OLAF’s broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets. The code is available at this http URL

[CV-26] Analyzing Poverty through Intra-Annual Time-Series: A Wavelet Transform Approach

链接: https://arxiv.org/abs/2411.02855
作者: Mohammad Kakooei,Klaudia Solska,Adel Daoud
关键词-EN: Sustainable Development Goals, Development Goals, Sustainable Development, Reducing global poverty, Reducing global
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Reducing global poverty is a key objective of the Sustainable Development Goals (SDGs). Achieving this requires high-frequency, granular data to capture neighborhood-level changes, particularly in data scarce regions such as low- and middle-income countries. To fill in the data gaps, recent computer vision methods combining machine learning (ML) with earth observation (EO) data to improve poverty estimation. However, while much progress have been made, they often omit intra-annual variations, which are crucial for estimating poverty in agriculturally dependent countries. We explored the impact of integrating intra-annual NDVI information with annual multi-spectral data on model accuracy. To evaluate our method, we created a simulated dataset using Landsat imagery and nighttime light data to evaluate EO-ML methods that use intra-annual EO data. Additionally, we evaluated our method against the Demographic and Health Survey (DHS) dataset across Africa. Our results indicate that integrating specific NDVI-derived features with multi-spectral data provides valuable insights for poverty analysis, emphasizing the importance of retaining intra-annual information.

[CV-27] Advances in Photoacoustic Imaging Reconstruction and Quantitative Analysis for Biomedical Applications

链接: https://arxiv.org/abs/2411.02843
作者: Lei Wang,Weiming Zeng,Kai Long,Rongfeng Lan,Li Liu,Wai Ting Siok,Nizhuan Wang
关键词-EN: ensuring enhanced safety, acoustic penetration depth, innovative biomedical imaging, biomedical imaging modality, represents an innovative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Photoacoustic imaging (PAI) represents an innovative biomedical imaging modality that harnesses the advantages of optical resolution and acoustic penetration depth while ensuring enhanced safety. Despite its promising potential across a diverse array of preclinical and clinical applications, the clinical implementation of PAI faces significant challenges, including the trade-off between penetration depth and spatial resolution, as well as the demand for faster imaging speeds. This paper explores the fundamental principles underlying PAI, with a particular emphasis on three primary implementations: photoacoustic computed tomography (PACT), photoacoustic microscopy (PAM), and photoacoustic endoscopy (PAE). We undertake a critical assessment of their respective strengths and practical limitations. Furthermore, recent developments in utilizing conventional or deep learning (DL) methodologies for image reconstruction and artefact mitigation across PACT, PAM, and PAE are outlined, demonstrating considerable potential to enhance image quality and accelerate imaging processes. Furthermore, this paper examines the recent developments in quantitative analysis within PAI, including the quantification of haemoglobin concentration, oxygen saturation, and other physiological parameters within tissues. Finally, our discussion encompasses current trends and future directions in PAI research while emphasizing the transformative impact of deep learning on advancing PAI.

[CV-28] st-Time Dynamic Image Fusion NEURIPS2024

链接: https://arxiv.org/abs/2411.02840
作者: Bing Cao,Yinan Xia,Yi Ding,Changqing Zhang,Qinghua Hu
关键词-EN: dynamic image fusion, image fusion, image fusion lies, comprehensively integrating effective, dynamic image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:The inherent challenge of image fusion lies in capturing the correlation of multi-source images and comprehensively integrating effective information from different sources. Most existing techniques fail to perform dynamic image fusion while notably lacking theoretical guarantees, leading to potential deployment risks in this field. Is it possible to conduct dynamic image fusion with a clear theoretical justification? In this paper, we give our solution from a generalization perspective. We proceed to reveal the generalized form of image fusion and derive a new test-time dynamic image fusion paradigm. It provably reduces the upper bound of generalization error. Specifically, we decompose the fused image into multiple components corresponding to its source data. The decomposed components represent the effective information from the source data, thus the gap between them reflects the Relative Dominability (RD) of the uni-source data in constructing the fusion image. Theoretically, we prove that the key to reducing generalization error hinges on the negative correlation between the RD-based fusion weight and the uni-source reconstruction loss. Intuitively, RD dynamically highlights the dominant regions of each source and can be naturally converted to the corresponding fusion weight, achieving robust results. Extensive experiments and discussions with in-depth analysis on multiple benchmarks confirm our findings and superiority. Our code is available at this https URL.

[CV-29] Lost in Context: The Influence of Context on Feature Attribution Methods for Object Recognition

链接: https://arxiv.org/abs/2411.02833
作者: Sayanta Adhikari,Rishav Kumar,Konda Reddy Mopuri,Rajalakshmi Pachamuthu
关键词-EN: Contextual information plays, object recognition models, significantly affect accuracy, object recognition, underscoring models’ dependence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in ICVGIP 2024

点击查看摘要

Abstract:Contextual information plays a critical role in object recognition models within computer vision, where changes in context can significantly affect accuracy, underscoring models’ dependence on contextual cues. This study investigates how context manipulation influences both model accuracy and feature attribution, providing insights into the reliance of object recognition models on contextual information as understood through the lens of feature attribution methods. We employ a range of feature attribution techniques to decipher the reliance of deep neural networks on context in object recognition tasks. Using the ImageNet-9 and our curated ImageNet-CS datasets, we conduct experiments to evaluate the impact of contextual variations, analyzed through feature attribution methods. Our findings reveal several key insights: (a) Correctly classified images predominantly emphasize object volume attribution over context volume attribution. (b) The dependence on context remains relatively stable across different context modifications, irrespective of classification accuracy. © Context change exerts a more pronounced effect on model performance than Context perturbations. (d) Surprisingly, context attribution in `no-information’ scenarios is non-trivial. Our research moves beyond traditional methods by assessing the implications of broad-level modifications on object recognition, either in the object or its context. Comments: Published in ICVGIP 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4.m; I.2.10 Cite as: arXiv:2411.02833 [cs.CV] (or arXiv:2411.02833v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.02833 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3702250.3702254 Focus to learn more DOI(s) linking to related resources

[CV-30] LiVOS: Light Video Object Segmentation with Gated Linear Matching

链接: https://arxiv.org/abs/2411.02818
作者: Qin Liu,Jianfeng Wang,Zhengyuan Yang,Linjie Li,Kevin Lin,Marc Niethammer,Lijuan Wang
关键词-EN: Semi-supervised video object, store past frame, past frame features, video object segmentation, Semi-supervised video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Codemodels: this https URL

点击查看摘要

Abstract:Semi-supervised video object segmentation (VOS) has been largely driven by space-time memory (STM) networks, which store past frame features in a spatiotemporal memory to segment the current frame via softmax attention. However, STM networks face memory limitations due to the quadratic complexity of softmax matching, restricting their applicability as video length and resolution increase. To address this, we propose LiVOS, a lightweight memory network that employs linear matching via linear attention, reformulating memory matching into a recurrent process that reduces the quadratic attention matrix to a constant-size, spatiotemporal-agnostic 2D state. To enhance selectivity, we introduce gated linear matching, where a data-dependent gate matrix is multiplied with the state matrix to control what information to retain or discard. Experiments on diverse benchmarks demonstrated the effectiveness of our method. It achieved 64.8 JF on MOSE and 85.1 JF on DAVIS, surpassing all non-STM methods and narrowing the gap with STM-based approaches. For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU–a previously cost-prohibitive capability–opening the door for long and high-resolution video foundation models.

[CV-31] ChatGPT in Research and Education: Exploring Benefits and Threats

链接: https://arxiv.org/abs/2411.02816
作者: Abu Saleh Musa Miah,Md Mahbubur Rahman Tusher,Md. Moazzem Hossain,Md Mamun Hossain,Md Abdur Rahim,Md Ekramul Hamid,Md. Saiful Islam,Jungpil Shin
关键词-EN: advanced artificial intelligence, artificial intelligence technologies, recent years, advanced artificial, intelligence technologies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, advanced artificial intelligence technologies, such as ChatGPT, have significantly impacted various fields, including education and research. Developed by OpenAI, ChatGPT is a powerful language model that presents numerous opportunities for students and educators. It offers personalized feedback, enhances accessibility, enables interactive conversations, assists with lesson preparation and evaluation, and introduces new methods for teaching complex subjects. However, ChatGPT also poses challenges to traditional education and research systems. These challenges include the risk of cheating on online exams, the generation of human-like text that may compromise academic integrity, a potential decline in critical thinking skills, and difficulties in assessing the reliability of information generated by AI. This study examines both the opportunities and challenges ChatGPT brings to education from the perspectives of students and educators. Specifically, it explores the role of ChatGPT in helping students develop their subjective skills. To demonstrate its effectiveness, we conducted several subjective experiments using ChatGPT, such as generating solutions from subjective problem descriptions. Additionally, surveys were conducted with students and teachers to gather insights into how ChatGPT supports subjective learning and teaching. The results and analysis of these surveys are presented to highlight the impact of ChatGPT in this context.

[CV-32] ERUP-YOLO: Enhancing Object Detection Robustness for Adverse Weather Condition by Unified Image-Adaptive Processing

链接: https://arxiv.org/abs/2411.02799
作者: Yuka Ogino,Yuho Shoji,Takahiro Toizumi,Atsushi Ito
关键词-EN: image-adaptive object detection, object detection, later-stage object detections, image-adaptive object, object detection method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose an image-adaptive object detection method for adverse weather conditions such as fog and low-light. Our framework employs differentiable preprocessing filters to perform image enhancement suitable for later-stage object detections. Our framework introduces two differentiable filters: a Bézier curve-based pixel-wise (BPW) filter and a kernel-based local (KBL) filter. These filters unify the functions of classical image processing filters and improve performance of object detection. We also propose a domain-agnostic data augmentation strategy using the BPW filter. Our method does not require data-specific customization of the filter combinations, parameter ranges, and data augmentation. We evaluate our proposed approach, called Enhanced Robustness by Unified Image Processing (ERUP)-YOLO, by applying it to the YOLOv3 detector. Experiments on adverse weather datasets demonstrate that our proposed filters match or exceed the expressiveness of conventional methods and our ERUP-YOLO achieved superior performance in a wide range of adverse weather conditions, including fog and low-light conditions.

[CV-33] Real-Time Text Detection with Similar Mask in Traffic Industrial and Natural Scenes

链接: https://arxiv.org/abs/2411.02794
作者: Xu Han,Junyu Gao,Chuang Yang,Yuan Yuan,Qi Wang
关键词-EN: include mass information, scene include mass, transportation scene include, intelligent transportation scene, intelligent transportation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Texts on the intelligent transportation scene include mass information. Fully harnessing this information is one of the critical drivers for advancing intelligent transportation. Unlike the general scene, detecting text in transportation has extra demand, such as a fast inference speed, except for high accuracy. Most existing real-time text detection methods are based on the shrink mask, which loses some geometry semantic information and needs complex post-processing. In addition, the previous method usually focuses on correct output, which ignores feature correction and lacks guidance during the intermediate process. To this end, we propose an efficient multi-scene text detector that contains an effective text representation similar mask (SM) and a feature correction module (FCM). Unlike previous methods, the former aims to preserve the geometric information of the instances as much as possible. Its post-progressing saves 50 % of the time, accurately and efficiently reconstructing text contours. The latter encourages false positive features to move away from the positive feature center, optimizing the predictions from the feature level. Some ablation studies demonstrate the efficiency of the SM and the effectiveness of the FCM. Moreover, the deficiency of existing traffic datasets (such as the low-quality annotation or closed source data unavailability) motivated us to collect and annotate a traffic text dataset, which introduces motion blur. In addition, to validate the scene robustness of the SM-Net, we conduct experiments on traffic, industrial, and natural scene datasets. Extensive experiments verify it achieves (SOTA) performance on several benchmarks. The code and dataset are available at: \urlthis https URL.

[CV-34] Advancing Recycling Efficiency: A Comparative Analysis of Deep Learning Models in Waste Classification

链接: https://arxiv.org/abs/2411.02779
作者: Zhanshan Qiao
关键词-EN: http URL research, http URL situation, http URL results, http URL improve, Convolutional Neural Network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the 6th International Conference on Computing and Data Science (CONF-CDS 2024), 12 pages, 8 figures, references added

点击查看摘要

Abstract:With the ongoing increase in the worldwide population and escalating consumption habits,there’s a surge in the amount of waste this http URL situation poses considerable challenges for waste management and the optimization of recycling this http URL research tackles the pressing issue of waste classification for recycling by analyzing various deep learning models,including Convolutional Neural Network(CNN),AlexNet,ResNet,ResNet50 plus Support Vector Machine(SVM),and transformers,across a wide array of waste this http URL research meticulously compares these models on several targets like parameters settings,category accuracy,total accuracy and model parameters to establish a uniform evaluation this http URL research presents a novel method that incorporates SVM with deep learning frameworks,particularly this http URL results indicate the method significantly boosts accuracy in complex waste this http URL,the transformer model outshines others in average accuracy,showcasing its aptitude for intricate classification this http URL improve performance in poorly performing categories,the research advocates for enlarging the dataset,employing data augmentation,and leveraging sophisticated models such as transformers,along with refining training this http URL research paves the way for future advancements in multi-category waste recycling and underscores the pivotal role of deep learning in promoting environmental sustainability.

[CV-35] FedBlock: A Blockchain Approach to Federated Learning against Backdoor Attacks

链接: https://arxiv.org/abs/2411.02773
作者: Duong H. Nguyen,Phi L. Nguyen,Truong T. Nguyen,Hieu H. Pham,Duc A. Tran
关键词-EN: private data locally, data locally stored, machine learning method, Federated Learning, private data
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted as a full paper for the IEEE Special Session Federated Learning on Big Data 2024 (IEEE BigData 2024)

点击查看摘要

Abstract:Federated Learning (FL) is a machine learning method for training with private data locally stored in distributed machines without gathering them into one place for central learning. Despite its promises, FL is prone to critical security risks. First, because FL depends on a central server to aggregate local training models, this is a single point of failure. The server might function maliciously. Second, due to its distributed nature, FL might encounter backdoor attacks by participating clients. They can poison the local model before submitting to the server. Either type of attack, on the server or the client side, would severely degrade learning accuracy. We propose FedBlock, a novel blockchain-based FL framework that addresses both of these security risks. FedBlock is uniquely desirable in that it involves only smart contract programming, thus deployable atop any blockchain network. Our framework is substantiated with a comprehensive evaluation study using real-world datasets. Its robustness against backdoor attacks is competitive with the literature of FL backdoor defense. The latter, however, does not address the server risk as we do.

[CV-36] One-Stage-TFS: Thai One-Stage Fingerspelling Dataset for Fingerspelling Recognition Frameworks

链接: https://arxiv.org/abs/2411.02768
作者: Siriwiwat Lata,Sirawan Phiphitphatphaisit,Emmanuel Okafor,Olarik Surinta
关键词-EN: Thai One-Stage Fingerspelling, Thai sign language, comprehensive resource designed, Maha Sarakham University, Rajabhat Maha Sarakham
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:The Thai One-Stage Fingerspelling (One-Stage-TFS) dataset is a comprehensive resource designed to advance research in hand gesture recognition, explicitly focusing on the recognition of Thai sign language. This dataset comprises 7,200 images capturing 15 one-stage consonant gestures performed by undergraduate students from Rajabhat Maha Sarakham University, Thailand. The contributors include both expert students from the Special Education Department with proficiency in Thai sign language and students from other departments without prior sign language experience. Images were collected between July and December 2021 using a DSLR camera, with contributors demonstrating hand gestures against both simple and complex backgrounds. The One-Stage-TFS dataset presents challenges in detecting and recognizing hand gestures, offering opportunities to develop novel end-to-end recognition frameworks. Researchers can utilize this dataset to explore deep learning methods, such as YOLO, EfficientDet, RetinaNet, and Detectron, for hand detection, followed by feature extraction and recognition using techniques like convolutional neural networks, transformers, and adaptive feature fusion networks. The dataset is accessible via the Mendeley Data repository and supports a wide range of applications in computer science, including deep learning, computer vision, and pattern recognition, thereby encouraging further innovation and exploration in these fields.

[CV-37] Label Critic: Design Data Before Models

链接: https://arxiv.org/abs/2411.02753
作者: Pedro R. A. S. Bassi,Qilong Wu,Wenxuan Li,Sergio Decherchi,Andrea Cavalli,Alan Yuille,Zongwei Zhou
关键词-EN: datasets rapidly expand, medical datasets rapidly, Best-AI Labels, rapidly expand, expensive and time-consuming
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As medical datasets rapidly expand, creating detailed annotations of different body structures becomes increasingly expensive and time-consuming. We consider that requesting radiologists to create detailed annotations is unnecessarily burdensome and that pre-existing AI models can largely automate this process. Following the spirit don’t use a sledgehammer on a nut, we find that, rather than creating annotations from scratch, radiologists only have to review and edit errors if the Best-AI Labels have mistakes. To obtain the Best-AI Labels among multiple AI Labels, we developed an automatic tool, called Label Critic, that can assess label quality through tireless pairwise comparisons. Extensive experiments demonstrate that, when incorporated with our developed Image-Prompt pairs, pre-existing Large Vision-Language Models (LVLM), trained on natural images and texts, achieve 96.5% accuracy when choosing the best label in a pair-wise comparison, without extra fine-tuning. By transforming the manual annotation task (30-60 min/scan) into an automatic comparison task (15 sec/scan), we effectively reduce the manual efforts required from radiologists by an order of magnitude. When the Best-AI Labels are sufficiently accurate (81% depending on body structures), they will be directly adopted as the gold-standard annotations for the dataset, with lower-quality AI Labels automatically discarded. Label Critic can also check the label quality of a single AI Label with 71.8% accuracy when no alternatives are available for comparison, prompting radiologists to review and edit if the estimated quality is low (19% depending on body structures).

[CV-38] Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection

链接: https://arxiv.org/abs/2411.02747
作者: Yifan Wang,Xiaochen Yang,Fanqi Pu,Qingmin Liao,Wenming Yang
关键词-EN: attracted great attention, low cost, attracted great, simplicity and low, great attention due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Monocular 3D object detection has attracted great attention due to simplicity and low cost. Existing methods typically follow conventional 2D detection paradigms, first locating object centers and then predicting 3D attributes via neighboring features. However, these methods predominantly rely on progressive cross-scale feature aggregation and focus solely on local information, which may result in a lack of global awareness and the omission of small-scale objects. In addition, due to large variation in object scales across different scenes and depths, inaccurate receptive fields often lead to background noise and degraded feature representation. To address these issues, we introduces MonoASRH, a novel monocular 3D detection framework composed of Efficient Hybrid Feature Aggregation Module (EH-FAM) and Adaptive Scale-Aware 3D Regression Head (ASRH). Specifically, EH-FAM employs multi-head attention with a global receptive field to extract semantic features for small-scale objects and leverages lightweight convolutional modules to efficiently aggregate visual features across different scales. The ASRH encodes 2D bounding box dimensions and then fuses scale features with the semantic features aggregated by EH-FAM through a scale-semantic feature fusion module. The scale-semantic feature fusion module guides ASRH in learning dynamic receptive field offsets, incorporating scale priors into 3D position prediction for better scale-awareness. Extensive experiments on the KITTI and Waymo datasets demonstrate that MonoASRH achieves state-of-the-art performance.

[CV-39] DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark

链接: https://arxiv.org/abs/2411.02733
作者: Haodong Li,Haicheng Qu,Xiaofeng Zhang
关键词-EN: vision language models, large vision language, shown excellent results, remote sensing, language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rapid development of large vision language models (LVLMs), these models have shown excellent results in various multimodal tasks. Since LVLMs are prone to hallucinations and there are currently few datasets and evaluation methods specifically designed for remote sensing, their performance is typically poor when applied to remote sensing tasks. To address these issues, this paper introduces a high quality remote sensing LVLMs dataset, DDFAV, created using data augmentation and data mixing strategies. Next, a training instruction set is produced based on some high-quality remote sensing images selected from the proposed dataset. Finally, we develop a remote sensing LVLMs hallucination evaluation method RSPOPE based on the proposed dataset and evaluate the zero-shot capabilities of different LVLMs. Our proposed dataset, instruction set, and evaluation method files are available at this https URL.

[CV-40] CIT: Rethinking Class-incremental Semantic Segmentation with a Class Independent Transformation

链接: https://arxiv.org/abs/2411.02715
作者: Jinchao Ge,Bowen Zhang,Akide Liu,Minh Hieu Phan,Qi Chen,Yangyang Shu,Yang Zhao
关键词-EN: Class-incremental semantic segmentation, Class-incremental semantic, learn to segment, segment previous, latest data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Class-incremental semantic segmentation (CSS) requires that a model learn to segment new classes without forgetting how to segment previous ones: this is typically achieved by distilling the current knowledge and incorporating the latest data. However, bypassing iterative distillation by directly transferring outputs of initial classes to the current learning task is not supported in existing class-specific CSS methods. Via Softmax, they enforce dependency between classes and adjust the output distribution at each learning step, resulting in a large probability distribution gap between initial and current tasks. We introduce a simple, yet effective Class Independent Transformation (CIT) that converts the outputs of existing semantic segmentation models into class-independent forms with negligible cost or performance loss. By utilizing class-independent predictions facilitated by CIT, we establish an accumulative distillation framework, ensuring equitable incorporation of all class information. We conduct extensive experiments on various segmentation architectures, including DeepLabV3, Mask2Former, and SegViTv2. Results from these experiments show minimal task forgetting across different datasets, with less than 5% for ADE20K in the most challenging 11 task configurations and less than 1% across all configurations for the PASCAL VOC 2012 dataset.

[CV-41] Full Field Digital Mammography Dataset from a Population Screening Program

链接: https://arxiv.org/abs/2411.02710
作者: Edward Kendall,Paraham Hajishafiezahramini,Matthew Hamilton,Gregory Doyle,Nancy Wadden,Oscar Meruvia-Pastor
关键词-EN: Breast cancer presents, largest cancer risk, Breast cancer, world to women, Breast
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Breast cancer presents the second largest cancer risk in the world to women. Early detection of cancer has been shown to be effective in reducing mortality. Population screening programs schedule regular mammography imaging for participants, promoting early detection. Currently, such screening programs require manual reading. False-positive errors in the reading process unnecessarily leads to costly follow-up and patient anxiety. Automated methods promise to provide more efficient, consistent and effective reading. To facilitate their development, a number of datasets have been created. With the aim of specifically targeting population screening programs, we introduce NL-Breast-Screening, a dataset from a Canadian provincial screening program. The dataset consists of 5997 mammography exams, each of which has four standard views and is biopsy-confirmed. Cases where radiologist reading was a false-positive are identified. NL-Breast is made publicly available as a new resource to promote advances in automation for population screening programs.

[CV-42] ransferable polychromatic optical encoder for neural networks

链接: https://arxiv.org/abs/2411.02697
作者: Minho Choi,Jinlin Xiang,Anna Wirth-Singh,Seung-Hwan Baek,Eli Shlizerman,Arka Majumdar
关键词-EN: Artificial neural networks, providing unprecedented performance, Artificial neural, neural networks, providing unprecedented
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注: 21 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Artificial neural networks (ANNs) have fundamentally transformed the field of computer vision, providing unprecedented performance. However, these ANNs for image processing demand substantial computational resources, often hindering real-time operation. In this paper, we demonstrate an optical encoder that can perform convolution simultaneously in three color channels during the image capture, effectively implementing several initial convolutional layers of a ANN. Such an optical encoding results in ~24,000 times reduction in computational operations, with a state-of-the art classification accuracy (~73.2%) in free-space optical system. In addition, our analog optical encoder, trained for CIFAR-10 data, can be transferred to the ImageNet subset, High-10, without any modifications, and still exhibits moderate accuracy. Our results evidence the potential of hybrid optical/digital computer vision system in which the optical frontend can pre-process an ambient scene to reduce the energy and latency of the whole computer vision system.

[CV-43] Multi-Transmotion: Pre-trained Model for Human Motion Prediction

链接: https://arxiv.org/abs/2411.02673
作者: Yang Gao,Po-Chien Luan,Alexandre Alahi
关键词-EN: autonomous vehicle navigation, predict human behaviors, human motion prediction, behaviors is crucial, social robotics
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: CoRL 2024

点击查看摘要

Abstract:The ability of intelligent systems to predict human behaviors is crucial, particularly in fields such as autonomous vehicle navigation and social robotics. However, the complexity of human motion have prevented the development of a standardized dataset for human motion prediction, thereby hindering the establishment of pre-trained models. In this paper, we address these limitations by integrating multiple datasets, encompassing both trajectory and 3D pose keypoints, to propose a pre-trained model for human motion prediction. We merge seven distinct datasets across varying modalities and standardize their formats. To facilitate multimodal pre-training, we introduce Multi-Transmotion, an innovative transformer-based model designed for cross-modality pre-training. Additionally, we present a novel masking strategy to capture rich representations. Our methodology demonstrates competitive performance across various datasets on several downstream tasks, including trajectory prediction in the NBA and JTA datasets, as well as pose prediction in the AMASS and 3DPW datasets. The code is publicly available: this https URL

[CV-44] Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack

链接: https://arxiv.org/abs/2411.02669
作者: Xiaojun Jia,Sensen Gao,Qing Guo,Ke Ma,Yihao Huang,Simeng Qin,Yang Liu,Ivor Tsang Fellow,Xiaochun Cao
关键词-EN: Vision-language pre-training, practical VLP models, adversarial, VLP models, excel at interpreting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-language pre-training (VLP) models excel at interpreting both images and text but remain vulnerable to multimodal adversarial examples (AEs). Advancing the generation of transferable AEs, which succeed across unseen models, is key to developing more robust and practical VLP models. Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process, aiming to improve transferability by expanding the contrast space of image-text features. However, these methods focus solely on diversity around the current AEs, yielding limited gains in transferability. To address this issue, we propose to increase the diversity of AEs by leveraging the intersection regions along the adversarial trajectory during optimization. Specifically, we propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity. We provide a theoretical analysis to demonstrate the effectiveness of the proposed adversarial evolution triangle. Moreover, we find that redundant inactive dimensions can dominate similarity calculations, distorting feature matching and making AEs model-dependent with reduced transferability. Hence, we propose to generate AEs in the semantic image-text feature contrast space, which can project the original feature space into a semantic corpus subspace. The proposed semantic-aligned subspace can reduce the image feature redundancy, thereby improving adversarial transferability. Extensive experiments across different datasets and models demonstrate that the proposed method can effectively improve adversarial transferability and outperform state-of-the-art adversarial attack methods. The code is released at this https URL.

[CV-45] Data-Driven Hierarchical Open Set Recognition ICRA

链接: https://arxiv.org/abs/2411.02635
作者: Andrew Hannum,Max Conway,Mario Lopez,André Harrison
关键词-EN: open set recognition, utilizing constrained agglomerative, constrained agglomerative clustering, requiring manual relational, manual relational information
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted as Extended Abstract to the IEEE ICRA@40 2024

点击查看摘要

Abstract:This paper presents a novel data-driven hierarchical approach to open set recognition (OSR) for robust perception in robotics and computer vision, utilizing constrained agglomerative clustering to automatically build a hierarchy of known classes in embedding space without requiring manual relational information. The method, demonstrated on the Animals with Attributes 2 (AwA2) dataset, achieves competitive results with an AUC ROC score of 0.82 and utility score of 0.85, while introducing two classification approaches (score-based and traversal-based) and a new Concentration Centrality (CC) metric for measuring hierarchical classification consistency. Although not surpassing existing models in accuracy, the approach provides valuable additional information about unknown classes through automatically generated hierarchies, requires no supplementary information beyond typical supervised model requirements, and introduces the Class Concentration Centrality (CCC) metric for evaluating unknown class placement consistency, with future work aimed at improving accuracy, validating the CC metric, and expanding to Large-Scale Open-Set Classification Protocols for ImageNet.

[CV-46] racking Tumors under Deformation from Partial Point Clouds using Occupancy Networks IROS2024

链接: https://arxiv.org/abs/2411.02619
作者: Pit Henrich,Jiawei Liu,Jiawei Ge,Samuel Schmidgall,Lauren Shepard,Ahmed Ezzat Ghazi,Franziska Mathis-Ullrich,Axel Krieger
关键词-EN: determine their position, preoperative CT scans, track tumors, tumors, increased operation time
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at IROS 2024

点击查看摘要

Abstract:To track tumors during surgery, information from preoperative CT scans is used to determine their position. However, as the surgeon operates, the tumor may be deformed which presents a major hurdle for accurately resecting the tumor, and can lead to surgical inaccuracy, increased operation time, and excessive margins. This issue is particularly pronounced in robot-assisted partial nephrectomy (RAPN), where the kidney undergoes significant deformations during operation. Toward addressing this, we introduce a occupancy network-based method for the localization of tumors within kidney phantoms undergoing deformations at interactive speeds. We validate our method by introducing a 3D hydrogel kidney phantom embedded with exophytic and endophytic renal tumors. It closely mimics real tissue mechanics to simulate kidney deformation during in vivo surgery, providing excellent contrast and clear delineation of tumor margins to enable automatic threshold-based segmentation. Our findings indicate that the proposed method can localize tumors in moderately deforming kidneys with a margin of 6mm to 10mm, while providing essential volumetric 3D information at over 60Hz. This capability directly enables downstream tasks such as robotic resection.

[CV-47] racker: Tracking Based Vector HD Mapping using Top-Down Road Images

链接: https://arxiv.org/abs/2411.02588
作者: Mohammad Mahdavian,Mo Chen,Yu Zhang
关键词-EN: top-down road images, propose a tracking-based, top-down road, tracking-based HD mapping, tile images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a tracking-based HD mapping algorithm for top-down road images, referred to as tile images. While HD maps traditionally rely on perspective camera images, our approach shows that tile images can also be effectively utilized, offering valuable contributions to this research area as it can be start of a new path in HD mapping algorithms. We modified the BEVFormer layers to generate BEV masks from tile images, which are then used by the model to generate divider and boundary lines. Our model was tested with both color and intensity images, and we present quantitative and qualitative results to demonstrate its performance.

[CV-48] Real-Time Detection for Small UAVs: Combining YOLO and Multi-frame Motion Analysis

链接: https://arxiv.org/abs/2411.02582
作者: Juanqin Liu,Leonardo Plotegher,Eloy Roura,Cristino de Souza Junior,Shaoming He
关键词-EN: Unmanned Aerial Vehicle, Unmanned Aerial, Aerial Vehicle, mitigating security risks, detection technology plays
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicle (UAV) detection technology plays a critical role in mitigating security risks and safeguarding privacy in both military and civilian applications. However, traditional detection methods face significant challenges in identifying UAV targets with extremely small pixels at long distances. To address this issue, we propose the Global-Local YOLO-Motion (GL-YOMO) detection algorithm, which combines You Only Look Once (YOLO) object detection with multi-frame motion detection techniques, markedly enhancing the accuracy and stability of small UAV target detection. The YOLO detection algorithm is optimized through multi-scale feature fusion and attention mechanisms, while the integration of the Ghost module further improves efficiency. Additionally, a motion detection approach based on template matching is being developed to augment detection capabilities for minute UAV targets. The system utilizes a global-local collaborative detection strategy to achieve high precision and efficiency. Experimental results on a self-constructed fixed-wing UAV dataset demonstrate that the GL-YOMO algorithm significantly enhances detection accuracy and stability, underscoring its potential in UAV detection applications.

[CV-49] I-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

链接: https://arxiv.org/abs/2411.02570
作者: Leonardo Plini,Luca Scofano,Edoardo De Matteis,Guido Maria D’Amely di Melendugno,Alessandro Flaborea,Andrea Sanchietti,Giovanni Maria Farinella,Fabio Galasso,Antonino Furnari
关键词-EN: Identifying procedural errors, Identifying procedural, skill-based training, critical yet challenging, procedural errors online
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module’s output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.02570 [cs.CV] (or arXiv:2411.02570v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.02570 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Luca Scofano [view email] [v1] Mon, 4 Nov 2024 20:03:06 UTC (11,597 KB)

[CV-50] Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models

链接: https://arxiv.org/abs/2411.02564
作者: Meng Cao,Yuyang Liu,Yingfei Liu,Tiancai Wang,Jiahua Dong,Henghui Ding,Xiangyu Zhang,Ian Reid,Xiaodan Liang
关键词-EN: Vision Language Models, Large Vision Language, tailoring Large Vision, Language Models, continual instruction tuning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language Models (LVLMs) to meet individual task requirements. To date, most of the existing approaches are confined to single-task adaptation, whereas the requirements in real-world scenarios are inherently varied and continually evolving. Thus an ideal LVLM should sustain continual instruction tuning in the face of stream-task distributions (i.e., different domains, emerging capabilities, and new datasets) while minimizing the forgetting of previously acquired knowledge. To achieve this, we propose a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations. In terms of methodology, we propose Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs. To circumvent the additional overhead associated with experience replay, we freeze LVLMs and construct the dual increment embeddings for each input instruction to facilitate parameter-efficient tuning. Specifically, the increment embeddings can be decomposed into two principal components: 1) intrinsic increment embeddings to encode task-specific characteristics. To achieve this, we set up a low-rank pool containing candidate embeddings, from which we select the relevant ones based on their similarity with the user instructions; 2) contextual increment embeddings to investigate the inter-dependencies across tasks. In this regard, the low-rank embeddings chosen in the previous tasks are aggregated via learnable weighted sum to provide complementary hints. Extensive experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.

[CV-51] Segment Anything for Dendrites from Electron Microscopy

链接: https://arxiv.org/abs/2411.02562
作者: Zewen Zhuo,Ilya Belevich,Ville Leinonen,Eija Jokitalo,Tarja Malm,Alejandra Sierra,Jussi Tohka
关键词-EN: diseased brain tissue, electron microscopy, brain tissue, cellular structures, structures in electron
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segmentation of cellular structures in electron microscopy (EM) images is fundamental to analyzing the morphology of neurons and glial cells in the healthy and diseased brain tissue. Current neuronal segmentation applications are based on convolutional neural networks (CNNs) and do not effectively capture global relationships within images. Here, we present DendriteSAM, a vision foundation model based on Segment Anything, for interactive and automatic segmentation of dendrites in EM images. The model is trained on high-resolution EM data from healthy rat hippocampus and is tested on diseased rat and human data. Our evaluation results demonstrate better mask quality compared to the original and other fine-tuned models, leveraging the features learned during training. This study introduces the first implementation of vision foundation models in dendrite segmentation, paving the path for computer-assisted diagnosis of neuronal anomalies.

[CV-52] Map: Towards User-Participatory Visual SLAM Systems with Efficient Map Expansion and Sharing

链接: https://arxiv.org/abs/2411.02553
作者: Xinran Zhang,Hanqi Zhu,Yifan Duan,Wuyang Zhang,Longfei Shangguan,Yu Zhang,Jianmin Ji,Yanyong Zhang
关键词-EN: Constructing precise, future map-based systems, self-driving and navigation, development of future, future map-based
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 15 pages, 15 figures. Accepted by MobiCom 2024

点击查看摘要

Abstract:Constructing precise 3D maps is crucial for the development of future map-based systems such as self-driving and navigation. However, generating these maps in complex environments, such as multi-level parking garages or shopping malls, remains a formidable challenge. In this paper, we introduce a participatory sensing approach that delegates map-building tasks to map users, thereby enabling cost-effective and continuous data collection. The proposed method harnesses the collective efforts of users, facilitating the expansion and ongoing update of the maps as the environment evolves. We realized this approach by developing Map++, an efficient system that functions as a plug-and-play extension, supporting participatory map-building based on existing SLAM algorithms. Map++ addresses a plethora of scalability issues in this participatory map-building system by proposing a set of lightweight, application-layer protocols. We evaluated Map++ in four representative settings: an indoor garage, an outdoor plaza, a public SLAM benchmark, and a simulated environment. The results demonstrate that Map++ can reduce traffic volume by approximately 46% with negligible degradation in mapping accuracy, i.e., less than 0.03m compared to the baseline system. It can support approximately 2 \times as many concurrent users as the baseline under the same network bandwidth. Additionally, for users who travel on already-mapped trajectories, they can directly utilize the existing maps for localization and save 47% of the CPU usage. Comments: 15 pages, 15 figures. Accepted by MobiCom 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2411.02553 [cs.CV] (or arXiv:2411.02553v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.02553 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3636534.3649386 Focus to learn more DOI(s) linking to related resources

[CV-53] Modeling Uncertainty in 3D Gaussian Splatting through Continuous Semantic Splatting

链接: https://arxiv.org/abs/2411.02547
作者: Joey Wilson,Marcelino Almeida,Min Sun,Sachit Mahajan,Maani Ghaffari,Parker Ewen,Omid Ghasemalizadeh,Cheng-Hao Kuo,Arnie Sen
关键词-EN: Gaussian Splatting, probabilistically updating, updating and rasterizing, rasterizing semantic maps, Gaussian
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we present a novel algorithm for probabilistically updating and rasterizing semantic maps within 3D Gaussian Splatting (3D-GS). Although previous methods have introduced algorithms which learn to rasterize features in 3D-GS for enhanced scene understanding, 3D-GS can fail without warning which presents a challenge for safety-critical robotic applications. To address this gap, we propose a method which advances the literature of continuous semantic mapping from voxels to ellipsoids, combining the precise structure of 3D-GS with the ability to quantify uncertainty of probabilistic robotic maps. Given a set of images, our algorithm performs a probabilistic semantic update directly on the 3D ellipsoids to obtain an expectation and variance through the use of conjugate priors. We also propose a probabilistic rasterization which returns per-pixel segmentation predictions with quantifiable uncertainty. We compare our method with similar probabilistic voxel-based methods to verify our extension to 3D ellipsoids, and perform ablation studies on uncertainty quantification and temporal smoothing.

[CV-54] SPACE: 3D Spatial Co-operation and Exploration Framework for Robust Mapping and Coverage with Multi-Robot Systems

链接: https://arxiv.org/abs/2411.02524
作者: Sai Krishna Ghanta,Ramviyas Parasuraman
关键词-EN: significantly enhance efficiency, hold immense potential, deploying multiple robots, exploration hold immense, service and logistics
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In indoor environments, multi-robot visual (RGB-D) mapping and exploration hold immense potential for application in domains such as domestic service and logistics, where deploying multiple robots in the same environment can significantly enhance efficiency. However, there are two primary challenges: (1) the “ghosting trail” effect, which occurs due to overlapping views of robots impacting the accuracy and quality of point cloud reconstruction, and (2) the oversight of visual reconstructions in selecting the most effective frontiers for exploration. Given these challenges are interrelated, we address them together by proposing a new semi-distributed framework (SPACE) for spatial cooperation in indoor environments that enables enhanced coverage and 3D mapping. SPACE leverages geometric techniques, including “mutual awareness” and a “dynamic robot filter,” to overcome spatial mapping constraints. Additionally, we introduce a novel spatial frontier detection system and map merger, integrated with an adaptive frontier assigner for optimal coverage balancing the exploration and reconstruction objectives. In extensive ROS-Gazebo simulations, SPACE demonstrated superior performance over state-of-the-art approaches in both exploration and mapping metrics.

[CV-55] NeRF-Aug: Data Augmentation for Robotics with Neural Radiance Fields

链接: https://arxiv.org/abs/2411.02482
作者: Eric Zhu,Mara Levy,Matthew Gwilliam,Abhinav Shrivastava
关键词-EN: long standing challenge, generalize to unknown, long standing, standing challenge, unknown objects
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training a policy that can generalize to unknown objects is a long standing challenge within the field of robotics. The performance of a policy often drops significantly in situations where an object in the scene was not seen during training. To solve this problem, we present NeRF-Aug, a novel method that is capable of teaching a policy to interact with objects that are not present in the dataset. This approach differs from existing approaches by leveraging the speed and photorealism of a neural radiance field for augmentation. NeRF- Aug both creates more photorealistic data and runs 3.83 times faster than existing methods. We demonstrate the effectiveness of our method on 4 tasks with 11 novel objects that have no expert demonstration data. We achieve an average 69.1% success rate increase over existing methods. See video results at this https URL.

[CV-56] A Study of Data Augmentation Techniques to Overcome Data Scarcity in Wound Classification using Deep Learning

链接: https://arxiv.org/abs/2411.02456
作者: Harini Narayanan,Sindhu Ghanta
关键词-EN: incurring high costs, Chronic wounds, data augmentation, affecting millions, data
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Chronic wounds are a significant burden on individuals and the healthcare system, affecting millions of people and incurring high costs. Wound classification using deep learning techniques is a promising approach for faster diagnosis and treatment initiation. However, lack of high quality data to train the ML models is a major challenge to realize the potential of ML in wound care. In fact, data limitations are the biggest challenge in studies using medical or forensic imaging today. We study data augmentation techniques that can be used to overcome the data scarcity limitations and unlock the potential of deep learning based solutions. In our study we explore a range of data augmentation techniques from geometric transformations of wound images to advanced GANs, to enrich and expand datasets. Using the Keras, Tensorflow, and Pandas libraries, we implemented the data augmentation techniques that can generate realistic wound images. We show that geometric data augmentation can improve classification performance, F1 scores, by up to 11% on top of state-of-the-art models, across several key classes of wounds. Our experiments with GAN based augmentation prove the viability of using DE-GANs to generate wound images with richer variations. Our study and results show that data augmentation is a valuable privacy-preserving tool with huge potential to overcome the data scarcity limitations and we believe it will be part of any real-world ML-based wound care system.

[CV-57] Goal-Oriented Semantic Communication for Wireless Visual Question Answering with Scene Graphs

链接: https://arxiv.org/abs/2411.02452
作者: Sige Liu,Nan Li,Yansha Deng
关键词-EN: computational capabilities escalate, communication falls short, capabilities escalate, stringent requirements, falls short
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:As demands for communication and computational capabilities escalate, traditional bit-oriented communication falls short of these stringent requirements, especially for mission-critical and computation-intensive applications. Visual Question Answering (VQA), a representative application, has adopted edge computing to mitigate local computational constraints and accelerate visual perception with natural language. However, it encounters significant communication challenges such as limited bandwidth, reduced transmission power, and increased noise levels, leading to considerable latency and reduced efficiency in image and question transmission. we propose a goal-oriented semantic communication (GSC) framework that focuses on effectively extracting and transmitting semantic information most relevant to the VQA goals, improving the answering accuracy and enhancing the effectiveness and efficiency. The objective is to maximize the answering accuracy, and we propose a scene graphs (SG)-based image semantic extraction and ranking approach to prioritize the semantic information based on the goal of questions. Experimental results demonstrate that our GSC framework improves answering accuracy by up to 59% under Rayleigh channels while reducing total latency by up to 65% compared to traditional bit-oriented transmission.

[CV-58] WiCV@CVPR2024: The Thirteenth Women In Computer Vision Workshop at the Annual CVPR Conference

链接: https://arxiv.org/abs/2411.02445
作者: Asra Aslam,Sachini Herath,Ziqi Huang,Estefania Talavera,Deblina Bhattacharjee,Himangi Mittal,Vanessa Staderini,Mengwei Ren,Azade Farshad
关键词-EN: United States, Computer Vision Workshop, computer vision community, Computer Vision, organized alongside
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2309.12768

点击查看摘要

Abstract:In this paper, we present the details of Women in Computer Vision Workshop - WiCV 2024, organized alongside the CVPR 2024 in Seattle, Washington, United States. WiCV aims to amplify the voices of underrepresented women in the computer vision community, fostering increased visibility in both academia and industry. We believe that such events play a vital role in addressing gender imbalances within the field. The annual WiCV@CVPR workshop offers a)~opportunity for collaboration between researchers from minority groups, b) mentorship for female junior researchers, c) financial support to presenters to alleviate financial burdens and d)~a diverse array of role models who can inspire younger researchers at the outset of their careers. In this paper, we present a comprehensive report on the workshop program, historical trends from the past WiCV@CVPR events, and a summary of statistics related to presenters, attendees, and sponsorship for the WiCV 2024 workshop.

[CV-59] Cross-D Conv: Cross-Dimensional Transferable Knowledge Base via Fourier Shifting Operation

链接: https://arxiv.org/abs/2411.02441
作者: Mehmet Can Yavuz,Yang Yang
关键词-EN: biomedical imaging analysis, significant challenge, Cross-D Conv operation, biomedical imaging, superior real-world applicability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 3 figures, 2 tables, 1 algorithm, conference

点击查看摘要

Abstract:In biomedical imaging analysis, the dichotomy between 2D and 3D data presents a significant challenge. While 3D volumes offer superior real-world applicability, they are less available for each modality and not easy to train in large scale, whereas 2D samples are abundant but less comprehensive. This paper introduces the Cross-D Conv operation, a novel approach that bridges the dimensional gap by learning the phase shifting in the Fourier domain. Our method enables seamless weight transfer between 2D and 3D convolution operations, effectively facilitating cross-dimensional learning. The proposed architecture leverages the abundance of 2D training data to enhance 3D model performance, offering a practical solution to the multimodal data scarcity challenge in 3D medical model pretraining. Experimental validation on the RadImagenet (2D) and multimodal (3D) sets demonstrates that our approach achieves comparable or superior performance in feature quality assessment comparable to conventional methods. The enhanced convolution operation presents new opportunities for developing efficient classification and segmentation models in medical imaging. This work represents an advancement in cross-dimensional and multi-modal medical image analysis, offering a robust framework for utilizing 2D priors in 3D model pretraining or vice versa while maintaining computational efficiency.

[CV-60] MA2: A Self-Supervised and Motion Augmenting Autoencoder for Gait-Based Automatic Disease Detection

链接: https://arxiv.org/abs/2411.03129
作者: Yiqun Liu,Ke Zhang,Yin Zhu
关键词-EN: Ground reaction force, Ground reaction, reaction force, force exerted, body in contact
类目: Biological Physics (physics.bio-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 11 figures, article

点击查看摘要

Abstract:Ground reaction force (GRF) is the force exerted by the ground on a body in contact with it. GRF-based automatic disease detection (ADD) has become an emerging medical diagnosis method, which aims to learn and identify disease patterns corresponding to different gait pressures based on deep learning methods. Although existing ADD methods can save doctors time in making diagnoses, training deep models still struggles with the cost caused by the labeling engineering for a large number of gait diagnostic data for subjects. On the other hand, the accuracy of the deep model under the unified benchmark GRF dataset and the generalization ability on scalable gait datasets need to be further improved. To address these issues, we propose MA2, a GRF-based self-supervised and motion augmenting auto-encoder, which models the ADD task as an encoder-decoder paradigm. In the encoder, we introduce an embedding block including the 3-layer 1D convolution for extracting the token and a mask generator to randomly mask out the sequence of tokens to maximize the model’s potential to capture high-level, discriminative, intrinsic representations. whereafter, the decoder utilizes this information to reconstruct the pixel sequence of the origin input and calculate the reconstruction loss to optimize the network. Moreover, the backbone of an auto-encoder is multi-head self-attention that can consider the global information of the token from the input, not just the local neighborhood. This allows the model to capture generalized contextual information. Extensive experiments demonstrate MA2 has SOTA performance of 90.91% accuracy on 1% limited pathological GRF samples with labels, and good generalization ability of 78.57% accuracy on scalable Parkinson disease dataset.

[CV-61] Investigating the Applicability of a Snapshot Computed Tomography Imaging Spectrometer for the Prediction of Brix and pH of Grapes

链接: https://arxiv.org/abs/2411.03114
作者: Mads Svanborg Peters,Mads Juul Ahlebæk,Mads Toudal Frandsen,Bjarke Jørgensen,Christian Hald Jessen,Andreas Krogh Carlsen,Wei-Chih Huang,René Lynge Eriksen
关键词-EN: Tomography Imaging Spectroscopy, Computed Tomography Imaging, Squares Regression, Computed Tomography, Partial Least Squares
类目: Applied Physics (physics.app-ph); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:In this paper, a recently developed snapshot hyperspectral imaging (HSI) system based on Computed Tomography Imaging Spectroscopy (CTIS) is utilized to determine Brix and pH values in Sheegene 20 table grapes through Partial Least Squares Regression (PLSR) modeling. The performance of the CTIS system is compared with that of a state-of-the-art line scan HSI system by imaging 100 grapes across both platforms. Reference measurements of Brix and pH values are obtained directly using a refractometer and a pH meter, as these parameters are essential for assessing the quality of table and wine grapes. The findings indicate that the spectra captured by the CTIS camera correlate well with the reference measurements, despite the system’s narrower spectral range. The CTIS camera’s advantages, including its lower cost, portability, and reduced susceptibility to motion errors, highlight its potential for promising in-field applications in grape quality assessment.

[CV-62] Exploiting the Segment Anything Model (SAM) for Lung Segmentation in Chest X-ray Images

链接: https://arxiv.org/abs/2411.03064
作者: Gabriel Bellon de Carvalho,Jurandy Almeida
关键词-EN: ambitious tool designed, separate individual objects, released in April, Meta AI released, semantic interpretation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segment Anything Model (SAM), a new AI model from Meta AI released in April 2023, is an ambitious tool designed to identify and separate individual objects within a given image through semantic interpretation. The advanced capabilities of SAM are the result of its training with millions of images and masks, and a few days after its release, several researchers began testing the model on medical images to evaluate its performance in this domain. With this perspective in focus – i.e., optimizing work in the healthcare field – this work proposes the use of this new technology to evaluate and study chest X-ray images. The approach adopted for this work, with the aim of improving the model’s performance for lung segmentation, involved a transfer learning process, specifically the fine-tuning technique. After applying this adjustment, a substantial improvement was observed in the evaluation metrics used to assess SAM’s performance compared to the masks provided by the datasets. The results obtained by the model after the adjustments were satisfactory and similar to cutting-edge neural networks, such as U-Net.

[CV-63] LDPM: Towards undersampled MRI reconstruction with MR-VAE and Latent Diffusion Prior

链接: https://arxiv.org/abs/2411.02951
作者: Xingjian Tang,Jingwei Guan,Linge Li,Youmei Zhang,Mengye Lyu,Li Yan
关键词-EN: applications including MRI, powerful generative model, MRI reconstruction, diffusion model-based MRI, including MRI reconstruction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion model, as a powerful generative model, has found a wide range of applications including MRI reconstruction. However, most existing diffusion model-based MRI reconstruction methods operate directly in pixel space, which makes their optimization and inference computationally expensive. Latent diffusion models were introduced to address this problem in natural image processing, but directly applying them to MRI reconstruction still faces many challenges, including the lack of control over the generated results, the adaptability of Variational AutoEncoder (VAE) to MRI, and the exploration of applicable data consistency in latent space. To address these challenges, a Latent Diffusion Prior based undersampled MRI reconstruction (LDPM) method is proposed. A sketcher module is utilized to provide appropriate control and balance the quality and fidelity of the reconstructed MR images. A VAE adapted for MRI tasks (MR-VAE) is explored, which can serve as the backbone for future MR-related tasks. Furthermore, a variation of the DDIM sampler, called the Dual-Stage Sampler, is proposed to achieve high-fidelity reconstruction in the latent space. The proposed method achieves competitive results on fastMRI datasets, and the effectiveness of each module is demonstrated in ablation experiments.

[CV-64] A Symmetric Dynamic Learning Framework for Diffeomorphic Medical Image Registration

链接: https://arxiv.org/abs/2411.02888
作者: Jinqiu Deng,Ke Chen,Mingke Li,Daoping Zhang,Chong Chen,Alejandro F. Frangi,Jianping Zhang
关键词-EN: medical imaging applications, medical imaging, imaging applications, preserve the topology, registration
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages,7 figures

点击查看摘要

Abstract:Diffeomorphic image registration is crucial for various medical imaging applications because it can preserve the topology of the transformation. This study introduces DCCNN-LSTM-Reg, a learning framework that evolves dynamically and learns a symmetrical registration path by satisfying a specified control increment system. This framework aims to obtain symmetric diffeomorphic deformations between moving and fixed images. To achieve this, we combine deep learning networks with diffeomorphic mathematical mechanisms to create a continuous and dynamic registration architecture, which consists of multiple Symmetric Registration (SR) modules cascaded on five different scales. Specifically, our method first uses two U-nets with shared parameters to extract multiscale feature pyramids from the images. We then develop an SR-module comprising a sequential CNN-LSTM architecture to progressively correct the forward and reverse multiscale deformation fields using control increment learning and the homotopy continuation technique. Through extensive experiments on three 3D registration tasks, we demonstrate that our method outperforms existing approaches in both quantitative and qualitative evaluations.

[CV-65] Artificial Intelligence-Enhanced Couinaud Segmentation for Precision Liver Cancer Therapy

链接: https://arxiv.org/abs/2411.02815
作者: Liang Qiu,Wenhao Chi,Xiaohan Xing,Praveenbalaji Rajendran,Mingjie Li,Yuming Jiang,Oscar Pastor-Serrano,Sen Yang,Xiyue Wang,Yuanfeng Ji,Qiang Wen
关键词-EN: improving survival rates, necessitates accurately delineating, cancer necessitates accurately, accurately delineating liver, protect healthy tissue
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Precision therapy for liver cancer necessitates accurately delineating liver sub-regions to protect healthy tissue while targeting tumors, which is essential for reducing recurrence and improving survival rates. However, the segmentation of hepatic segments, known as Couinaud segmentation, is challenging due to indistinct sub-region boundaries and the need for extensive annotated datasets. This study introduces LiverFormer, a novel Couinaud segmentation model that effectively integrates global context with low-level local features based on a 3D hybrid CNN-Transformer architecture. Additionally, a registration-based data augmentation strategy is equipped to enhance the segmentation performance with limited labeled data. Evaluated on CT images from 123 patients, LiverFormer demonstrated high accuracy and strong concordance with expert annotations across various metrics, allowing for enhanced treatment planning for surgery and radiation therapy. It has great potential to reduces complications and minimizes potential damages to surrounding tissue, leading to improved outcomes for patients undergoing complex liver cancer treatments.

[CV-66] NEOviz: Uncertainty-Driven Visual Analysis of Asteroid Trajectories

链接: https://arxiv.org/abs/2411.02812
作者: Fangfei Lan,Malin Ejdbo,Joachim Moeyens,Bei Wang,Anders Ynnerman,Alexander Bock
关键词-EN: Solar System, interactive visualization system, visualization system designed, system designed, designed to assist
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We introduce NEOviz, an interactive visualization system designed to assist planetary defense experts in the visual analysis of the movements of near-Earth objects in the Solar System that might prove hazardous to Earth. Asteroids are often discovered using optical telescopes and their trajectories are calculated from images, resulting in an inherent asymmetric uncertainty in their position and velocity. Consequently, we typically cannot determine the exact trajectory of an asteroid, and an ensemble of trajectories must be generated to estimate an asteroid’s movement over time. When propagating these ensembles over decades, it is challenging to visualize the varying paths and determine their potential impact on Earth, which could cause catastrophic damage. NEOviz equips experts with the necessary tools to effectively analyze the existing catalog of asteroid observations. In particular, we present a novel approach for visualizing the 3D uncertainty region through which an asteroid travels, while providing accurate spatial context in relation to system-critical infrastructure such as Earth, the Moon, and artificial satellites. Furthermore, we use NEOviz to visualize the divergence of asteroid trajectories, capturing high-variance events in an asteroid’s orbital properties. For potential impactors, we combine the 3D visualization with an uncertainty-aware impact map to illustrate the potential risks to human populations. NEOviz was developed with continuous input from members of the planetary defense community through a participatory design process. It is exemplified in three real-world use cases and evaluated via expert feedback interviews.

[CV-67] Foundation AI Model for Medical Image Segmentation

链接: https://arxiv.org/abs/2411.02745
作者: Rina Bao,Erfan Darzi,Sheng He,Chuan-Heng Hsiao,Mohammad Arafat Hussain,Jingpeng Li,Atle Bjornerud,Ellen Grant,Yangming Ou
关键词-EN: demonstrate broad generalizability, Generative Pre-trained Transformer, Chat Generative Pre-trained, models, Foundation models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundation models refer to artificial intelligence (AI) models that are trained on massive amounts of data and demonstrate broad generalizability across various tasks with high accuracy. These models offer versatile, one-for-many or one-for-all solutions, eliminating the need for developing task-specific AI models. Examples of such foundation models include the Chat Generative Pre-trained Transformer (ChatGPT) and the Segment Anything Model (SAM). These models have been trained on millions to billions of samples and have shown wide-ranging and accurate applications in numerous tasks such as text processing (using ChatGPT) and natural image segmentation (using SAM). In medical image segmentation - finding target regions in medical images - there is a growing need for these one-for-many or one-for-all foundation models. Such models could obviate the need to develop thousands of task-specific AI models, which is currently standard practice in the field. They can also be adapted to tasks with datasets too small for effective training. We discuss two paths to achieve foundation models for medical image segmentation and comment on progress, challenges, and opportunities. One path is to adapt or fine-tune existing models, originally developed for natural images, for use with medical images. The second path entails building models from scratch, exclusively training on medical images.

[CV-68] ransUNext: towards a more advanced U-shaped framework for automatic vessel segmentation in the fundus image

链接: https://arxiv.org/abs/2411.02724
作者: Xiang Li,Mingsi Liu,Lixin Duan
关键词-EN: fundus vessel images, retinal vessel segmentation, Automatic and accurate, fundus vessel, vessel images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Purpose: Automatic and accurate segmentation of fundus vessel images has become an essential prerequisite for computer-aided diagnosis of ophthalmic diseases such as diabetes mellitus. The task of high-precision retinal vessel segmentation still faces difficulties due to the low contrast between the branch ends of retinal vessels and the background, the long and thin vessel span, and the variable morphology of the optic disc and optic cup in fundus vessel images. Methods: We propose a more advanced U-shaped architecture for a hybrid Transformer and CNN: TransUNext, which integrates an Efficient Self-attention Mechanism into the encoder and decoder of U-Net to capture both local features and global dependencies with minimal computational overhead. Meanwhile, the Global Multi-Scale Fusion (GMSF) module is further introduced to upgrade skip-connections, fuse high-level semantic and low-level detailed information, and eliminate high- and low-level semantic differences. Inspired by ConvNeXt, TransNeXt Block is designed to optimize the computational complexity of each base block in U-Net and avoid the information loss caused by the compressed dimension when the information is converted between the feature spaces of different dimensions. Results: We evaluated the proposed method on four public datasets DRIVE, STARE, CHASE-DB1, and HRF. In the experimental results, the AUC (area under the ROC curve) values were 0.9867, 0.9869, 0.9910, and 0.9887, which exceeded the other state-of-the-art.

[CV-69] FUSECAPS: Investigating Feature Fusion Based Framework for Capsule Endoscopy Image Classification

链接: https://arxiv.org/abs/2411.02637
作者: Bidisha Chakraborty,Shree Mitra
关键词-EN: class imbalance issues, classifying endoscopic images, imbalance issues, endoscopic images, order to improve
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In order to improve model accuracy, generalization, and class imbalance issues, this work offers a strong methodology for classifying endoscopic images. We suggest a hybrid feature extraction method that combines convolutional neural networks (CNNs), multi-layer perceptrons (MLPs), and radiomics. Rich, multi-scale feature extraction is made possible by this combination, which captures both deep and handmade representations. These features are then used by a classification head to classify diseases, producing a model with higher generalization and accuracy. In this framework we have achieved a validation accuracy of 76.2% in the capsule endoscopy video frame classification task.

[CV-70] owards more efficient agricultural practices via transformer-based crop type classification

链接: https://arxiv.org/abs/2411.02627
作者: E. Ulises Moya-Sánchez,Yazid S. Mikail,Daisy Nyang’anyi,Michael J. Smith,Isabella Smythe
关键词-EN: Machine learning, increase crop production, climate change, learning has great, great potential
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Machine learning has great potential to increase crop production and resilience to climate change. Accurate maps of where crops are grown are a key input to a number of downstream policy and research applications. In this proposal, we present preliminary work showing that it is possible to accurately classify crops from time series derived from Sentinel 1 and 2 satellite imagery in Mexico using a pixel-based binary crop/non-crop time series transformer model. We also find preliminary evidence that meta-learning approaches supplemented with data from similar agro-ecological zones may improve model performance. Due to these promising results, we propose further development of this method with the goal of accurate multi-class crop classification in Jalisco, Mexico via meta-learning with a dataset comprising similar agro-ecological zones.

[CV-71] Divergent Domains Convergent Grading: Enhancing Generalization in Diabetic Retinopathy Grading WACV2025

链接: https://arxiv.org/abs/2411.02614
作者: Sharon Chokuwa,Muhammad Haris Khan
关键词-EN: Diabetic Retinopathy, global blindness cases, blindness cases, global blindness, deep learning method
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV 2025

点击查看摘要

Abstract:Diabetic Retinopathy (DR) constitutes 5% of global blindness cases. While numerous deep learning approaches have sought to enhance traditional DR grading methods, they often falter when confronted with new out-of-distribution data thereby impeding their widespread application. In this study, we introduce a novel deep learning method for achieving domain generalization (DG) in DR grading and make the following contributions. First, we propose a new way of generating image-to-image diagnostically relevant fundus augmentations conditioned on the grade of the original fundus image. These augmentations are tailored to emulate the types of shifts in DR datasets thus increase the model’s robustness. Second, we address the limitations of the standard classification loss in DG for DR fundus datasets by proposing a new DG-specific loss, domain alignment loss; which ensures that the feature vectors from all domains corresponding to the same class converge onto the same manifold for better domain generalization. Third, we tackle the coupled problem of data imbalance across DR domains and classes by proposing to employ Focal loss which seamlessly integrates with our new alignment loss. Fourth, due to inevitable observer variability in DR diagnosis that induces label noise, we propose leveraging self-supervised pretraining. This approach ensures that our DG model remains robust against early susceptibility to label noise, even when only a limited dataset of non-DR fundus images is available for pretraining. Our method demonstrates significant improvements over the strong Empirical Risk Minimization baseline and other recently proposed state-of-the-art DG methods for DR grading. Code is available at this https URL.

[CV-72] Multi-modal Spatial Clustering for Spatial Transcriptomics Utilizing High-resolution Histology Images

链接: https://arxiv.org/abs/2411.02534
作者: Bingjun Li,Mostafa Karami,Masum Shah Junayed,Sheida Nabavi
关键词-EN: complex biological functions, intricate cellular environment, biological functions, complex biological, histology image features
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages

点击查看摘要

Abstract:Understanding the intricate cellular environment within biological tissues is crucial for uncovering insights into complex biological functions. While single-cell RNA sequencing has significantly enhanced our understanding of cellular states, it lacks the spatial context necessary to fully comprehend the cellular environment. Spatial transcriptomics (ST) addresses this limitation by enabling transcriptome-wide gene expression profiling while preserving spatial context. One of the principal challenges in ST data analysis is spatial clustering, which reveals spatial domains based on the spots within a tissue. Modern ST sequencing procedures typically include a high-resolution histology image, which has been shown in previous studies to be closely connected to gene expression profiles. However, current spatial clustering methods often fail to fully integrate high-resolution histology image features with gene expression data, limiting their ability to capture critical spatial and cellular interactions. In this study, we propose the spatial transcriptomics multi-modal clustering (stMMC) model, a novel contrastive learning-based deep learning approach that integrates gene expression data with histology image features through a multi-modal parallel graph autoencoder. We tested stMMC against four state-of-the-art baseline models: Leiden, GraphST, SpaGCN, and stLearn on two public ST datasets with 13 sample slices in total. The experiments demonstrated that stMMC outperforms all the baseline models in terms of ARI and NMI. An ablation study further validated the contributions of contrastive learning and the incorporation of histology image features. Comments: 9 pages Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: J.3; I.2.1 Cite as: arXiv:2411.02534 [eess.IV] (or arXiv:2411.02534v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2411.02534 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-73] Chronic Obstructive Pulmonary Disease Prediction Using Deep Convolutional Network

链接: https://arxiv.org/abs/2411.02449
作者: Shahran Rahman Alve,Muhammad Zawad Mahmud,Samiha Islam,Mohammad Monirujjaman Khan
关键词-EN: made a big, big difference, difference in helping, helping to solve, deep learning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 Pages, 11 Figures

点击查看摘要

Abstract:AI and deep learning are two recent innovations that have made a big difference in helping to solve problems in the clinical space. Using clinical imaging and sound examination, they also work on improving their vision so that they can spot diseases early and correctly. Because there aren’t enough trained HR, clinical professionals are asking for help with innovation because it helps them adapt to more patients. Aside from serious health problems like cancer and diabetes, the effects of respiratory infections are also slowly getting worse and becoming dangerous for society. Respiratory diseases need to be found early and treated quickly, so listening to the sounds of the lungs is proving to be a very helpful tool along with chest X-rays. The presented research hopes to use deep learning ideas based on Convolutional Brain Organization to help clinical specialists by giving a detailed and thorough analysis of clinical respiratory sound data for Ongoing Obstructive Pneumonic identification. We used MFCC, Mel-Spectrogram, Chroma, Chroma (Steady Q), and Chroma CENS from the Librosa AI library in the tests we ran. The new system could also figure out how serious the infection was, whether it was mild, moderate, or severe. The test results agree with the outcome of the deep learning approach that was proposed. The accuracy of the framework arrangement has been raised to a score of 96% on the ICBHI. Also, in the led tests, we used K-Crisp Cross-Approval with ten parts to make the presentation of the new deep learning approach easier to understand. With a 96 percent accuracy rate, the suggested network is better than the rest. If you don’t use cross-validation, the model is 90% accurate.

机器学习

[LG-0] Oblivious Defense in ML Models: Backdoor Removal without Detection

链接: https://arxiv.org/abs/2411.03279
作者: Shafi Goldwasser,Jonathan Shafer,Neekon Vafa,Vinod Vaikuntanathan
关键词-EN: machine learning, machine learning systems, machine learning model, learning, machine
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As society grows more reliant on machine learning, ensuring the security of machine learning systems against sophisticated attacks becomes a pressing concern. A recent result of Goldwasser, Kim, Vaikuntanathan, and Zamir (2022) shows that an adversary can plant undetectable backdoors in machine learning models, allowing the adversary to covertly control the model’s behavior. Backdoors can be planted in such a way that the backdoored machine learning model is computationally indistinguishable from an honest model without backdoors. In this paper, we present strategies for defending against backdoors in ML models, even if they are undetectable. The key observation is that it is sometimes possible to provably mitigate or even remove backdoors without needing to detect them, using techniques inspired by the notion of random self-reducibility. This depends on properties of the ground-truth labels (chosen by nature), and not of the proposed ML model (which may be chosen by an attacker). We give formal definitions for secure backdoor mitigation, and proceed to show two types of results. First, we show a “global mitigation” technique, which removes all backdoors from a machine learning model under the assumption that the ground-truth labels are close to a Fourier-heavy function. Second, we consider distributions where the ground-truth labels are close to a linear or polynomial function in \mathbbR^n . Here, we show “local mitigation” techniques, which remove backdoors with high probability for every inputs of interest, and are computationally cheaper than global mitigation. All of our constructions are black-box, so our techniques work without needing access to the model’s representation (i.e., its code or parameters). Along the way we prove a simple result for robust mean estimation. Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Cryptography and Security (cs.CR) Cite as: arXiv:2411.03279 [cs.LG] (or arXiv:2411.03279v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.03279 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Graph-Based Semi-Supervised Segregated Lipschitz Learning

链接: https://arxiv.org/abs/2411.03273
作者: Farid Bozorgnia,Yassine Belkheiri,Abderrahim Elmoataz
关键词-EN: Lipschitz Learning, paper presents, graph-based semi-supervised learning, semi-supervised learning framework, semi-supervised learning
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:This paper presents an approach to semi-supervised learning for the classification of data using the Lipschitz Learning on graphs. We develop a graph-based semi-supervised learning framework that leverages the properties of the infinity Laplacian to propagate labels in a dataset where only a few samples are labeled. By extending the theory of spatial segregation from the Laplace operator to the infinity Laplace operator, both in continuum and discrete settings, our approach provides a robust method for dealing with class imbalance, a common challenge in machine learning. Experimental validation on several benchmark datasets demonstrates that our method not only improves classification accuracy compared to existing methods but also ensures efficient label propagation in scenarios with limited labeled data.

[LG-2] Stable Matching with Ties: Approximation Ratios and Learning

链接: https://arxiv.org/abs/2411.03270
作者: Shiyun Lin,Simon Mauras,Nadav Merlis,Vianney Perchet
关键词-EN: strict preferences, utility, preferences, stable, Optimal Stable Share
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of matching markets with ties, where one side of the market does not necessarily have strict preferences over members at its other side. For example, workers do not always have strict preferences over jobs, students can give the same ranking for different schools and more. In particular, assume w.l.o.g. that workers’ preferences are determined by their utility from being matched to each job, which might admit ties. Notably, in contrast to classical two-sided markets with strict preferences, there is no longer a single stable matching that simultaneously maximizes the utility for all workers. We aim to guarantee each worker the largest possible share from the utility in her best possible stable matching. We call the ratio between the worker’s best possible stable utility and its assigned utility the \emphOptimal Stable Share (OSS)-ratio. We first prove that distributions over stable matchings cannot guarantee an OSS-ratio that is sublinear in the number of workers. Instead, randomizing over possibly non-stable matchings, we show how to achieve a tight logarithmic OSS-ratio. Then, we analyze the case where the real utility is not necessarily known and can only be approximated. In particular, we provide an algorithm that guarantees a similar fraction of the utility compared to the best possible utility. Finally, we move to a bandit setting, where we select a matching at each round and only observe the utilities for matches we perform. We show how to utilize our results for approximate utilities to gracefully interpolate between problems without ties and problems with statistical ties (small suboptimality gaps). Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2411.03270 [cs.GT] (or arXiv:2411.03270v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2411.03270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Proxy-informed Bayesian transfer learning with unknown sources

链接: https://arxiv.org/abs/2411.03263
作者: Sabina J. Sloman,Julien Martinelli,Samuel Kaski
关键词-EN: requires leveraging prior, leveraging prior knowledge, data requires leveraging, target task, training data requires
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generalization outside the scope of one’s training data requires leveraging prior knowledge about the effects that transfer, and the effects that don’t, between different data sources. Bayesian transfer learning is a principled paradigm for specifying this knowledge, and refining it on the basis of data from the source (training) and target (prediction) tasks. We address the challenging transfer learning setting where the learner (i) cannot fine-tune in the target task, and (ii) does not know which source data points correspond to the same task (i.e., the data sources are unknown). We propose a proxy-informed robust method for probabilistic transfer learning (PROMPT), which provides a posterior predictive estimate tailored to the structure of the target task, without requiring the learner have access to any outcome information from the target task. Instead, PROMPT relies on the availability of proxy information. PROMPT uses the same proxy information for two purposes: (i) estimation of effects specific to the target task, and (ii) construction of a robust reweighting of the source data for estimation of effects that transfer between tasks. We provide theoretical results on the effect of this reweighting on the risk of negative transfer, and demonstrate application of PROMPT in two synthetic settings.

[LG-4] Enhancing Transformer Training Efficiency with Dynamic Dropout

链接: https://arxiv.org/abs/2411.03236
作者: Hanrui Yan,Dan Shao
关键词-EN: validation loss improvements, introduce Dynamic Dropout, Dynamic Dropout, dropout rate based, loss improvements
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Dynamic Dropout, a novel regularization technique designed to enhance the training efficiency of Transformer models by dynamically adjusting the dropout rate based on training epochs or validation loss improvements. This approach addresses the challenge of balancing regularization and model capacity, which is crucial for achieving fast convergence and high performance. Our method involves modifying the GPT model to accept a variable dropout rate and updating dropout layers during training using schedules such as linear decay, exponential decay, and validation loss-based adjustments. Extensive experiments on the Shakespeare_char dataset demonstrate that Dynamic Dropout significantly accelerates training and improves inference efficiency compared to a baseline model with a fixed dropout rate. The validation loss-based adjustment schedule provided the best overall performance, highlighting the potential of Dynamic Dropout as a valuable technique for training large-scale Transformer models.

[LG-5] Interpretable Predictive Models for Healthcare via Rational Logistic Regression ALT

链接: https://arxiv.org/abs/2411.03224
作者: Thiti Suttaket,L Vivek Harsha Vardhan,Stanley Kok
关键词-EN: electronic health records, digital data recently, health records, healthcare sector, sector has experienced
类目: Machine Learning (cs.LG)
*备注: ICIS 2021 Proceedings ( see this https URL )

点击查看摘要

Abstract:The healthcare sector has experienced a rapid accumulation of digital data recently, especially in the form of electronic health records (EHRs). EHRs constitute a precious resource that IS researchers could utilize for clinical applications (e.g., morbidity prediction). Deep learning seems like the obvious choice to exploit this surfeit of data. However, numerous studies have shown that deep learning does not enjoy the same kind of success on EHR data as it has in other domains; simple models like logistic regression are frequently as good as sophisticated deep learning ones. Inspired by this observation, we develop a novel model called rational logistic regression (RLR) that has standard logistic regression (LR) as its special case (and thus inherits LR’s inductive bias that aligns with EHR data). RLR has rational series as its theoretical underpinnings, works on longitudinal time-series data, and learns interpretable patterns. Empirical comparisons on real-world clinical tasks demonstrate RLR’s efficacy.

[LG-6] A Machine Learning Approach for the Efficient Estimation of Ground-Level Air Temperature in Urban Areas

链接: https://arxiv.org/abs/2411.03162
作者: Iñigo Delgado-Enales,Joshua Lizundia-Loiola,Patricia Molina-Costa,Javier Del Ser
关键词-EN: increasingly populated cities, Century face, Urban Heat Island, increasingly populated, face the challenge
类目: Machine Learning (cs.LG)
*备注: 39 pages, 8 figures, 2 tables, under review

点击查看摘要

Abstract:The increasingly populated cities of the 21st Century face the challenge of being sustainable and resilient spaces for their inhabitants. However, climate change, among other problems, makes these objectives difficult to achieve. The Urban Heat Island (UHI) phenomenon that occurs in cities, increasing their thermal stress, is one of the stumbling blocks to achieve a more sustainable city. The ability to estimate temperatures with a high degree of accuracy allows for the identification of the highest priority areas in cities where urban improvements need to be made to reduce thermal discomfort. In this work we explore the usefulness of image-to-image deep neural networks (DNNs) for correlating spatial and meteorological variables of a urban area with street-level air temperature. The air temperature at street-level is estimated both spatially and temporally for a specific use case, and compared with existing, well-established numerical models. Based on the obtained results, deep neural networks are confirmed to be faster and less computationally expensive alternative for ground-level air temperature compared to numerical models.

[LG-7] User Centric Semantic Communications

链接: https://arxiv.org/abs/2411.03127
作者: Xunze Liu,Yifei Sun,Zhaorui Wang,Lizhao You,Haoyuan Pan,Fangxin Wang,Shuguang Cui
关键词-EN: reduce bandwidth usage, semantic information, Current studies, efficiently extracting semantic, extracting semantic information
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Current studies on semantic communications mainly focus on efficiently extracting semantic information to reduce bandwidth usage between a transmitter and a user. Although significant process has been made in the semantic communications, a fundamental design problem is that the semantic information is extracted based on certain criteria at the transmitter side along, without considering the user’s actual requirements. As a result, critical information that is of primary concern to the user may be lost. In such cases, the semantic transmission becomes meaningless to the user, as all received information is irrelevant to the user’s interests. To solve this problem, this paper presents a user centric semantic communication system, where the user sends its request for the desired semantic information to the transmitter at the start of each transmission. Then, the transmitter extracts the required semantic information accordingly. A key challenge is how the transmitter can understand the user’s requests for semantic information and extract the required semantic information in a reasonable and robust manner. We solve this challenge by designing a well-structured framework and leveraging off-the-shelf products, such as GPT-4, along with several specialized tools for detection and estimation. Evaluation results demonstrate the feasibility and effectiveness of the proposed user centric semantic communication system.

[LG-8] Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs NEURIPS2024

链接: https://arxiv.org/abs/2411.03107
作者: Long-Fei Li,Peng Zhao,Zhi-Hua Zhou
关键词-EN: unknown transition, study episodic linear, full-information feedback, study episodic, rewards under full-information
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback, employing dynamic regret as the performance measure. We start with in-depth analyses of the strengths and limitations of the two most popular methods: occupancy-measure-based and policy-based methods. We observe that while the occupancy-measure-based method is effective in addressing non-stationary environments, it encounters difficulties with the unknown transition. In contrast, the policy-based method can deal with the unknown transition effectively but faces challenges in handling non-stationary environments. Building on this, we propose a novel algorithm that combines the benefits of both methods. Specifically, it employs (i) an occupancy-measure-based global optimization with a two-layer structure to handle non-stationary environments; and (ii) a policy-based variance-aware value-targeted regression to tackle the unknown transition. We bridge these two parts by a novel conversion. Our algorithm enjoys an \widetilde\mathcalO(d \sqrtH^3 K + \sqrtHK(H + \barP_K)) dynamic regret, where d is the feature dimension, H is the episode length, K is the number of episodes, \barP_K is the non-stationarity measure. We show it is minimax optimal up to logarithmic factors by establishing a matching lower bound. To the best of our knowledge, this is the first work that achieves near-optimal dynamic regret for adversarial linear mixture MDPs with the unknown transition without prior knowledge of the non-stationarity measure.

[LG-9] Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

链接: https://arxiv.org/abs/2411.03085
作者: Wupeng Wang,Zexu Pan,Xinke Li,Shuai Wang,Haizhou Li
关键词-EN: DIP frontend, Speech separation, separate individual speech, individual speech signals, speech separation models
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: IEEE/ACM Transactions on Audio, Speech, and Language Processing

点击查看摘要

Abstract:Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of target reference in real-world cocktail party scenarios. As a result, there exists a domain gap between real and synthetic data when deploying speech separation models in real-world applications. In this paper, we propose a self-supervised domain-invariant pretrained (DIP) frontend that is exposed to mixture data without the need for target reference speech. The DIP frontend utilizes a Siamese network with two innovative pretext tasks, mixture predictive coding (MPC) and mixture invariant coding (MIC), to capture shared contextual cues between real and synthetic unlabeled mixtures. Subsequently, we freeze the DIP frontend as a feature extractor when training the downstream speech separation models on synthetic data. By pretraining the DIP frontend with the contextual cues, we expect that the speech separation skills learned from synthetic data can be effectively transferred to real data. To benefit from the DIP frontend, we introduce a novel separation pipeline to align the feature resolution of the separation models. We evaluate the speech separation quality on standard benchmarks and real-world datasets. The results confirm the superiority of our DIP frontend over existing speech separation models. This study underscores the potential of large-scale pretraining to enhance the quality and intelligibility of speech separation in real-world applications.

[LG-10] Alpha and Prejudice: Improving alpha-sized Worst-case Fairness via Intrinsic Reweighting

链接: https://arxiv.org/abs/2411.03068
作者: Jing Li,Yinghua Yao,Yuangang Pan,Xuanqian Wang,Ivor W. Tsang,Xiuju Fu
关键词-EN: achieves group parity, demographics achieves group, achieves group, group parity, worst-off group
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Worst-case fairness with off-the-shelf demographics achieves group parity by maximizing the model utility of the worst-off group. Nevertheless, demographic information is often unavailable in practical scenarios, which impedes the use of such a direct max-min formulation. Recent advances have reframed this learning problem by introducing the lower bound of minimal partition ratio, denoted as \alpha , as side information, referred to as \alpha -sized worst-case fairness'' in this paper. We first justify the practical significance of this setting by presenting noteworthy evidence from the data privacy perspective, which has been overlooked by existing research. Without imposing specific requirements on loss functions, we propose reweighting the training samples based on their intrinsic importance to fairness. Given the global nature of the worst-case formulation, we further develop a stochastic learning scheme to simplify the training process without compromising model performance. Additionally, we address the issue of outliers and provide a robust variant to handle potential outliers during model training. Our theoretical analysis and experimental observations reveal the connections between the proposed approaches and existingfairness-through-reweighting’’ studies, with extensive experimental results on fairness benchmarks demonstrating the superiority of our methods.

[LG-11] Can Transformers Smell Like Humans? NEURIPS2024

链接: https://arxiv.org/abs/2411.03038
作者: Farzaneh Taleb,Miguel Vasco,Antônio H. Ribeiro,Mårten Björkman,Danica Kragic
关键词-EN: brain encodes stimuli, human olfactory perception, olfactory perception, human brain encodes, human olfactory
类目: Machine Learning (cs.LG)
*备注: Spotlight paper at NeurIPS 2024

点击查看摘要

Abstract:The human brain encodes stimuli from the environment into representations that form a sensory perception of the world. Despite recent advances in understanding visual and auditory perception, olfactory perception remains an under-explored topic in the machine learning community due to the lack of large-scale datasets annotated with labels of human olfactory perception. In this work, we ask the question of whether pre-trained transformer models of chemical structures encode representations that are aligned with human olfactory perception, i.e., can transformers smell like humans? We demonstrate that representations encoded from transformers pre-trained on general chemical structures are highly aligned with human olfactory perception. We use multiple datasets and different types of perceptual representations to show that the representations encoded by transformer models are able to predict: (i) labels associated with odorants provided by experts; (ii) continuous ratings provided by human participants with respect to pre-defined descriptors; and (iii) similarity ratings between odorants provided by human participants. Finally, we evaluate the extent to which this alignment is associated with physicochemical features of odorants known to be relevant for olfactory decoding.

[LG-12] Graph Agnostic Causal Bayesian Optimisation

链接: https://arxiv.org/abs/2411.03028
作者: Sumantrak Mukherjee,Mengyan Zhang,Seth Flaxman,Sebastian Josef Vollmer
关键词-EN: Causal Bayesian Optimisation, Causal Bayesian, Bayesian Optimisation, Agnostic Causal Bayesian, study causal Bayesian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of globally optimising a target variable of an unknown causal graph on which a sequence of soft or hard interventions can be performed. The problem of optimising the target variable associated with a causal graph is formalised as Causal Bayesian Optimisation (CBO). We study the CBO problem under the cumulative regret objective with unknown causal graphs for two settings, namely structural causal models with hard interventions and function networks with soft interventions. We propose Graph Agnostic Causal Bayesian Optimisation (GACBO), an algorithm that actively discovers the causal structure that contributes to achieving optimal rewards. GACBO seeks to balance exploiting the actions that give the best rewards against exploring the causal structures and functions. To the best of our knowledge, our work is the first to study causal Bayesian optimization with cumulative regret objectives in scenarios where the graph is unknown or partially known. We show our proposed algorithm outperforms baselines in simulated experiments and real-world applications.

[LG-13] sting Generalizability in Causal Inference AISTATS2025

链接: https://arxiv.org/abs/2411.03021
作者: Daniel de Vassimon Manela,Linying Yang,Robin J. Evans
关键词-EN: scenarios requires addressing, diverse real-world scenarios, real-world scenarios requires, observed data ranges, Ensuring robust model
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 17 pages, 10 figures, Under review at AISTATS 2025

点击查看摘要

Abstract:Ensuring robust model performance across diverse real-world scenarios requires addressing both transportability across domains with covariate shifts and extrapolation beyond observed data ranges. However, there is no formal procedure for statistically evaluating generalizability in machine learning algorithms, particularly in causal inference. Existing methods often rely on arbitrary metrics like AUC or MSE and focus predominantly on toy datasets, providing limited insights into real-world applicability. To address this gap, we propose a systematic and quantitative framework for evaluating model generalizability under covariate distribution shifts, specifically within causal inference settings. Our approach leverages the frugal parameterization, allowing for flexible simulations from fully and semi-synthetic benchmarks, offering comprehensive evaluations for both mean and distributional regression methods. By basing simulations on real data, our method ensures more realistic evaluations, which is often missing in current work relying on simplified datasets. Furthermore, using simulations and statistical testing, our framework is robust and avoids over-reliance on conventional metrics. Grounded in real-world data, it provides realistic insights into model performance, bridging the gap between synthetic evaluations and practical applications.

[LG-14] ransformer-Based Fault-Tolerant Control for Fixed-Wing UAVs Using Knowledge Distillation and In-Context Adaptation

链接: https://arxiv.org/abs/2411.02975
作者: Francisco Giral,Ignacio Gómez,Ricardo Vinuesa,Soledad Le-Clainche
关键词-EN: Unmanned Aerial Vehicles, fixed-wing Unmanned Aerial, Aerial Vehicles, Unmanned Aerial, Flight Control Systems
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This study presents a transformer-based approach for fault-tolerant control in fixed-wing Unmanned Aerial Vehicles (UAVs), designed to adapt in real time to dynamic changes caused by structural damage or actuator failures. Unlike traditional Flight Control Systems (FCSs) that rely on classical control theory and struggle under severe alterations in dynamics, our method directly maps outer-loop reference values – altitude, heading, and airspeed – into control commands using the in-context learning and attention mechanisms of transformers, thus bypassing inner-loop controllers and fault-detection layers. Employing a teacher-student knowledge distillation framework, the proposed approach trains a student agent with partial observations by transferring knowledge from a privileged expert agent with full observability, enabling robust performance across diverse failure scenarios. Experimental results demonstrate that our transformer-based controller outperforms industry-standard FCS and state-of-the-art reinforcement learning (RL) methods, maintaining high tracking accuracy and stability in nominal conditions and extreme failure cases, highlighting its potential for enhancing UAV operational safety and reliability.

[LG-15] Embedding Safety into RL: A New Take on Trust Region Methods

链接: https://arxiv.org/abs/2411.02957
作者: Nikola Milosevic,Johannes Müller,Nico Scherf
关键词-EN: Reinforcement Learning, Markov Decision Processes, Constrained Markov Decision, Decision Processes, solve a wide
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to producing unsafe behaviors. Constrained Markov Decision Processes (CMDPs) provide a popular framework for incorporating safety constraints. However, common solution methods often compromise reward maximization by being overly conservative or allow unsafe behavior during training. We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints and yields trust regions composed exclusively of safe policies, ensuring constraint satisfaction throughout training. We theoretically study the convergence and update properties of C-TRPO and highlight connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Finally, we demonstrate experimentally that C-TRPO significantly reduces constraint violations while achieving competitive reward maximization compared to state-of-the-art CMDP algorithms.

[LG-16] IMUDiffusion: A Diffusion Model for Multivariate Time Series Synthetisation for Inertial Motion Capturing Systems

链接: https://arxiv.org/abs/2411.02954
作者: Heiko Oppel,Michael Munz
关键词-EN: unlike video-based motion, motion capturing systems, Kinematic sensors, video-based motion capturing, daily activities due
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kinematic sensors are often used to analyze movement behaviors in sports and daily activities due to their ease of use and lack of spatial restrictions, unlike video-based motion capturing systems. Still, the generation, and especially the labeling of motion data for specific activities can be time-consuming and costly. Additionally, many models struggle with limited data, which limits their performance in recognizing complex movement patterns. To address those issues, generating synthetic data can help expand the diversity and variability. In this work, we propose IMUDiffusion, a probabilistic diffusion model specifically designed for multivariate time series generation. Our approach enables the generation of high-quality time series sequences which accurately capture the dynamics of human activities. Moreover, by joining our dataset with synthetic data, we achieve a significant improvement in the performance of our baseline human activity classifier. In some cases, we are able to improve the macro F1-score by almost 30%. IMUDiffusion provides a valuable tool for generating realistic human activity movements and enhance the robustness of models in scenarios with limited training data.

[LG-17] A scalable generative model for dynamical system reconstruction from neuroimaging data NEURIPS2024

链接: https://arxiv.org/abs/2411.02949
作者: Eric Volkmann,Alena Brändle,Daniel Durstewitz,Georgia Koppe
关键词-EN: Data-driven inference, generative dynamics underlying, natural sciences, observed time series, set of observed
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Data-driven inference of the generative dynamics underlying a set of observed time series is of growing interest in machine learning and the natural sciences. In neuroscience, such methods promise to alleviate the need to handcraft models based on biophysical principles and allow to automatize the inference of inter-individual differences in brain dynamics. Recent breakthroughs in training techniques for state space models (SSMs) specifically geared toward dynamical systems (DS) reconstruction (DSR) enable to recover the underlying system including its geometrical (attractor) and long-term statistical invariants from even short time series. These techniques are based on control-theoretic ideas, like modern variants of teacher forcing (TF), to ensure stable loss gradient propagation while training. However, as it currently stands, these techniques are not directly applicable to data modalities where current observations depend on an entire history of previous states due to a signal’s filtering properties, as common in neuroscience (and physiology more generally). Prominent examples are the blood oxygenation level dependent (BOLD) signal in functional magnetic resonance imaging (fMRI) or Ca ^2+ imaging data. Such types of signals render the SSM’s decoder model non-invertible, a requirement for previous TF-based methods. Here, exploiting the recent success of control techniques for training SSMs, we propose a novel algorithm that solves this problem and scales exceptionally well with model dimensionality and filter length. We demonstrate its efficiency in reconstructing dynamical systems, including their state space geometry and long-term temporal properties, from just short BOLD time series.

[LG-18] me-Causal VAE: Robust Financial Time Series Generator

链接: https://arxiv.org/abs/2411.02947
作者: Beatrice Acciaio,Stephan Eckstein,Songyan Hou
关键词-EN: time-causal variational autoencoder, generated time series, time series, market time series, time series data
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:We build a time-causal variational autoencoder (TC-VAE) for robust generation of financial time series data. Our approach imposes a causality constraint on the encoder and decoder networks, ensuring a causal transport from the real market time series to the fake generated time series. Specifically, we prove that the TC-VAE loss provides an upper bound on the causal Wasserstein distance between market distributions and generated distributions. Consequently, the TC-VAE loss controls the discrepancy between optimal values of various dynamic stochastic optimization problems under real and generated distributions. To further enhance the model’s ability to approximate the latent representation of the real market distribution, we integrate a RealNVP prior into the TC-VAE framework. Finally, extensive numerical experiments show that TC-VAE achieves promising results on both synthetic and real market data. This is done by comparing real and generated distributions according to various statistical distances, demonstrating the effectiveness of the generated data for downstream financial optimization tasks, as well as showcasing that the generated data reproduces stylized facts of real financial market data.

[LG-19] P-MOSS: Learned Scheduling For Indexes Over NUMA Servers Using Low-Level Hardware Statistics

链接: https://arxiv.org/abs/2411.02933
作者: Yeasir Rayhan,Walid G. Aref
关键词-EN: Dennard scaling broke, CPU stalled, CPU chip, NUMA processors, Dennard scaling
类目: Databases (cs.DB); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPU stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA processors. Since then, the heterogeneity in the design space of hardware has only increased to the point that DBMS performance may vary significantly up to an order of magnitude in modern servers. An important factor that affects performance includes the location of the logical cores where the DBMS queries are scheduled, and the locations of the data that the queries access. This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to certain logical cores, and places data accordingly to certain integrated memory controllers (IMC), to integrate hardware consciousness into the system. In the spirit of hardware-software synergy, P-MOSS solely guides its scheduling decision based on low-level hardware statistics collected by performance monitoring counters with the aid of a Decision Transformer. Experimental evaluation is performed in the context of the B-tree and R-tree indexes. Performance results demonstrate that P-MOSS has up to 6x improvement over traditional schedules in terms of query throughput.

[LG-20] Privacy-Preserving Graph-Based Machine Learning with Fully Homomorphic Encryption for Collaborative Anti-Money Laundering

链接: https://arxiv.org/abs/2411.02926
作者: Fabrianne Effendi,Anupam Chattopadhyay
关键词-EN: Combating money laundering, money laundering networks, Combating money, money laundering, AML machine learning
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14th International Conference on Security, Privacy, and Applied Cryptographic Engineering (SPACE) 2024

点击查看摘要

Abstract:Combating money laundering has become increasingly complex with the rise of cybercrime and digitalization of financial transactions. Graph-based machine learning techniques have emerged as promising tools for Anti-Money Laundering (AML) detection, capturing intricate relationships within money laundering networks. However, the effectiveness of AML solutions is hindered by data silos within financial institutions, limiting collaboration and overall efficacy. This research presents a novel privacy-preserving approach for collaborative AML machine learning, facilitating secure data sharing across institutions and borders while preserving privacy and regulatory compliance. Leveraging Fully Homomorphic Encryption (FHE), computations are directly performed on encrypted data, ensuring the confidentiality of financial data. Notably, FHE over the Torus (TFHE) was integrated with graph-based machine learning using Zama Concrete ML. The research contributes two key privacy-preserving pipelines. First, the development of a privacy-preserving Graph Neural Network (GNN) pipeline was explored. Optimization techniques like quantization and pruning were used to render the GNN FHE-compatible. Second, a privacy-preserving graph-based XGBoost pipeline leveraging Graph Feature Preprocessor (GFP) was successfully developed. Experiments demonstrated strong predictive performance, with the XGBoost model consistently achieving over 99% accuracy, F1-score, precision, and recall on the balanced AML dataset in both unencrypted and FHE-encrypted inference settings. On the imbalanced dataset, the incorporation of graph-based features improved the F1-score by 8%. The research highlights the need to balance the trade-off between privacy and computational efficiency.

[LG-21] heoretically Guaranteed Distribution Adaptable Learning

链接: https://arxiv.org/abs/2411.02921
作者: Chao Xu,Xijia Tang,Guoqing Liu,Yuhua Qian,Chenping Hou
关键词-EN: open environment applications, evolving data distributions, Distribution Adaptable Learning, data distributions, data distribution evolving
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many open environment applications, data are collected in the form of a stream, which exhibits an evolving distribution over time. How to design algorithms to track these evolving data distributions with provable guarantees, particularly in terms of the generalization ability, remains a formidable challenge. To handle this crucial but rarely studied problem and take a further step toward robust artificial intelligence, we propose a novel framework called Distribution Adaptable Learning (DAL). It enables the model to effectively track the evolving data distributions. By Encoding Feature Marginal Distribution Information (EFMDI), we broke the limitations of optimal transport to characterize the environmental changes and enable model reuse across diverse data distributions. It can enhance the reusable and evolvable properties of DAL in accommodating evolving distributions. Furthermore, to obtain the model interpretability, we not only analyze the generalization error bound of the local step in the evolution process, but also investigate the generalization error bound associated with the entire classifier trajectory of the evolution based on the Fisher-Rao distance. For demonstration, we also present two special cases within the framework, together with their optimizations and convergence analyses. Experimental results over both synthetic and real-world data distribution evolving tasks validate the effectiveness and practical utility of the proposed framework.

[LG-22] Photon: Federated LLM Pre-Training

链接: https://arxiv.org/abs/2411.02908
作者: Lorenzo Sani,Alex Iacob,Zeyu Cao,Royson Lee,Bill Marino,Yan Gao,Dongqi Cai,Zexi Li,Wanru Zhao,Xinchi Qiu,Nicholas D. Lane
关键词-EN: Scaling large language, Scaling large, large language models, demands extensive data, demands extensive
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 13 pages, 9 appendix pages, 10 figures, 3 algorithms, 8 tables

点击查看摘要

Abstract:Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly-connected GPUs if they can effectively be used for pre-training. To achieve this, we introduce Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads. Using Photon, we train the first federated family of decoder-only LLMs from scratch. We show that: (1) Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training; (2) Photon model training time decreases with available compute, achieving a similar compute-time trade-off to centralized; and (3) Photon outperforms the wall-time of baseline distributed training methods by 35% via communicating 64x-512xless. Our proposal is robust to data heterogeneity and converges twice as fast as previous methods like DiLoCo. This surprising data efficiency stems from a unique approach combining small client batch sizes with extremely high learning rates, enabled by federated averaging’s robustness to hyperparameters. Photon thus represents the first economical system for global internet-wide LLM pre-training.

[LG-23] he Unreasonable Effectiveness of LLM s for Query Optimization NEURIPS2024

链接: https://arxiv.org/abs/2411.02862
作者: Peter Akioyamen,Zixuan Yi,Ryan Marcus
关键词-EN: machine learning strategies, reinforcement learning schemes, complex machine learning, customized reinforcement learning, Recent work
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: To appear in the Machine Learning for Systems Workshop at NeurIPS 2024

点击查看摘要

Abstract:Recent work in database query optimization has used complex machine learning strategies, such as customized reinforcement learning schemes. Surprisingly, we show that LLM embeddings of query text contain useful semantic information for query optimization. Specifically, we show that a simple binary classifier deciding between alternative query plans, trained only on a small number of labeled embedded query vectors, can outperform existing heuristic systems. Although we only present some preliminary results, an LLM-powered query optimizer could provide significant benefits, both in terms of performance and simplicity.

[LG-24] SpiDR: A Reconfigurable Digital Compute-in-Memory Spiking Neural Network Accelerator for Event-based Perception

链接: https://arxiv.org/abs/2411.02854
作者: Deepika Sharma,Shubham Negi,Trishit Dutta,Amogh Agrawal,Kaushik Roy
关键词-EN: Dynamic Vision Sensors, Spiking Neural Networks, Spiking Neural, Vision Sensors, Dynamic Vision
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 9 pages, 17 figures

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), with their inherent recurrence, offer an efficient method for processing the asynchronous temporal data generated by Dynamic Vision Sensors (DVS), making them well-suited for event-based vision applications. However, existing SNN accelerators suffer from limitations in adaptability to diverse neuron models, bit precisions and network sizes, inefficient membrane potential (Vmem) handling, and limited sparse optimizations. In response to these challenges, we propose a scalable and reconfigurable digital compute-in-memory (CIM) SNN accelerator \chipname with a set of key features: 1) It uses in-memory computations and reconfigurable operating modes to minimize data movement associated with weight and Vmem data structures while efficiently adapting to different workloads. 2) It supports multiple weight/Vmem bit precision values, enabling a trade-off between accuracy and energy efficiency and enhancing adaptability to diverse application demands. 3) A zero-skipping mechanism for sparse inputs significantly reduces energy usage by leveraging the inherent sparsity of spikes without introducing high overheads for low sparsity. 4) Finally, the asynchronous handshaking mechanism maintains the computational efficiency of the pipeline for variable execution times of different computation units. We fabricated \chipname in 65 nm Taiwan Semiconductor Manufacturing Company (TSMC) low-power (LP) technology. It demonstrates competitive performance (scaled to the same technology node) to other digital SNN accelerators proposed in the recent literature and supports advanced reconfigurability. It achieves up to 5 TOPS/W energy efficiency at 95% input sparsity with 4-bit weights and 7-bit Vmem precision.

[LG-25] ADOPT: Modified Adam Can Converge with Any beta_2 with the Optimal Rate NEURIPS2024

链接: https://arxiv.org/abs/2411.02853
作者: Shohei Taniguchi,Keno Harada,Gouki Minegishi,Yuta Oshima,Seong Cheol Jeong,Go Nagahara,Tomoshi Iiyama,Masahiro Suzuki,Yusuke Iwasawa,Yutaka Matsuo
关键词-EN: popular optimization algorithms, popular optimization, optimization algorithms, Adam, ADOPT
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., \beta_2 , in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of \mathcalO ( 1 / \sqrtT ) with any choice of \beta_2 without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at this https URL.

[LG-26] Adversarial multi-task underwater acoustic target recognition: towards robustness against various influential factors

链接: https://arxiv.org/abs/2411.02848
作者: Yuan Xie,Ji Xu,Jiawei Ren,Junfeng Li
关键词-EN: practical maritime applications, passive sonar faces, sonar faces numerous, Underwater acoustic target, faces numerous challenges
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Underwater acoustic target recognition based on passive sonar faces numerous challenges in practical maritime applications. One of the main challenges lies in the susceptibility of signal characteristics to diverse environmental conditions and data acquisition configurations, which can lead to instability in recognition systems. While significant efforts have been dedicated to addressing these influential factors in other domains of underwater acoustics, they are often neglected in the field of underwater acoustic target recognition. To overcome this limitation, this study designs auxiliary tasks that model influential factors (e.g., source range, water column depth, or wind speed) based on available annotations and adopts a multi-task framework to connect these factors to the recognition task. Furthermore, we integrate an adversarial learning mechanism into the multi-task framework to prompt the model to extract representations that are robust against influential factors. Through extensive experiments and analyses on the ShipsEar dataset, our proposed adversarial multi-task model demonstrates its capacity to effectively model the influential factors and achieve state-of-the-art performance on the 12-class recognition task.

[LG-27] On the Comparison between Multi-modal and Single-modal Contrastive Learning

链接: https://arxiv.org/abs/2411.02837
作者: Wei Huang,Andi Han,Yongqiang Chen,Yuan Cao,Zhiqiang Xu,Taiji Suzuki
关键词-EN: Multi-modal contrastive learning, contrastive learning, single-modal contrastive learning, modern machine learning, learning
类目: Machine Learning (cs.LG)
*备注: 51pages, 1 figure, 1 table

点击查看摘要

Abstract:Multi-modal contrastive learning with language supervision has presented a paradigm shift in modern machine learning. By pre-training on a web-scale dataset, multi-modal contrastive learning can learn high-quality representations that exhibit impressive robustness and transferability. Despite its empirical success, the theoretical understanding is still in its infancy, especially regarding its comparison with single-modal contrastive learning. In this work, we introduce a feature learning theory framework that provides a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning. Based on a data generation model consisting of signal and noise, our analysis is performed on a ReLU network trained with the InfoMax objective function. Through a trajectory-based optimization analysis and generalization characterization on downstream tasks, we identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning. Through the cooperation between the two modalities, multi-modal learning can achieve better feature learning, leading to improvements in performance in downstream tasks compared to single-modal learning. Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning. Empirical experiments on both synthetic and real-world datasets further consolidate our theoretical findings.

[LG-28] CE-CoLLM : Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

链接: https://arxiv.org/abs/2411.02829
作者: Hongpeng Jin,Yanzhao Wu
关键词-EN: Large Language Models, Large Language, achieved remarkable success, Language Models, human-like intelligence
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in serving end-users with human-like intelligence. However, LLMs demand high computational resources, making it challenging to deploy them to satisfy various performance objectives, such as meeting the resource constraints on edge devices close to end-users or achieving high accuracy with ample resources. In this paper, we introduce CE-CoLLM, a novel cloud-edge collaboration framework that supports efficient and adaptive LLM inference for end-users at the edge with two modes, (1) low-latency edge standalone inference and (2) highly accurate cloud-edge collaborative inference. First, we show that the inherent high communication costs for transmitting LLM contextual information between the edge and cloud dominate the overall latency, making it inefficient and costly to deploy LLMs using cloud-edge collaboration. Second, we propose several critical techniques to address this challenge, including early-exit mechanism, cloud context manager, and quantization in cloud-edge collaboration to enable not only low-latency standalone edge inference but also efficient and adaptive cloud-edge collaborative inference for LLMs. Third, we perform comprehensive experimental analysis, which demonstrates that CE-CoLLM significantly reduces inference time by up to 13.81% and cloud computation costs by up to 84.55% compared to the popular cloud-based LLM deployment, while maintaining comparable model accuracy. The proposed approach effectively shifts the computational load to the edge, reduces the communication overhead, scales efficiently with multiple edge clients, and provides reliable LLM deployment using cloud-edge collaboration.

[LG-29] Layer-Adaptive State Pruning for Deep State Space Models

链接: https://arxiv.org/abs/2411.02824
作者: Minseon Gwak,Seongrok Moon,Joohwan Ko,PooGyeon Park
关键词-EN: sacrificed model capacity, alleviate computational costs, computational costs caused, dimension optimization methods, training search space
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Due to the lack of state dimension optimization methods, deep state space models (SSMs) have sacrificed model capacity, training search space, or stability to alleviate computational costs caused by high state dimensions. In this work, we provide a structured pruning method for SSMs, Layer-Adaptive STate pruning (LAST), which reduces the state dimension of each layer in minimizing model-level energy loss by extending modal truncation for a single system. LAST scores are evaluated using \mathcalH_\infty norms of subsystems for each state and layer-wise energy normalization. The scores serve as global pruning criteria, enabling cross-layer comparison of states and layer-adaptive pruning. Across various sequence benchmarks, LAST optimizes previous SSMs, revealing the redundancy and compressibility of their state spaces. Notably, we demonstrate that, on average, pruning 33% of states still maintains performance with 0.52% accuracy loss in multi-input multi-output SSMs without retraining. Code is available at \hrefthis https URL\textthis https URL .

[LG-30] Sparse Orthogonal Parameters Tuning for Continual Learning

链接: https://arxiv.org/abs/2411.02813
作者: Kun-Peng Ning,Hai-Jian Ke,Yu-Yang Liu,Jia-Yu Yao,Yong-Hong Tian,Li Yuan
关键词-EN: recently gained attention, successive downstream tasks, learning methods based, Continual learning methods, recently gained
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning methods based on pre-trained models (PTM) have recently gained attention which adapt to successive downstream tasks without catastrophic forgetting. These methods typically refrain from updating the pre-trained parameters and instead employ additional adapters, prompts, and classifiers. In this paper, we from a novel perspective investigate the benefit of sparse orthogonal parameters for continual learning. We found that merging sparse orthogonality of models learned from multiple streaming tasks has great potential in addressing catastrophic forgetting. Leveraging this insight, we propose a novel yet effective method called SoTU (Sparse Orthogonal Parameters TUning). We hypothesize that the effectiveness of SoTU lies in the transformation of knowledge learned from multiple domains into the fusion of orthogonal delta parameters. Experimental evaluations on diverse CL benchmarks demonstrate the effectiveness of the proposed approach. Notably, SoTU achieves optimal feature representation for streaming data without necessitating complex classifier designs, making it a Plug-and-Play solution.

[LG-31] Query-Efficient Adversarial Attack Against Vertical Federated Graph Learning

链接: https://arxiv.org/abs/2411.02809
作者: Jinyin Chen,Wenbo Mu,Luxin Zhang,Guohan Huang,Haibin Zheng,Yao Cheng
关键词-EN: captured wide attention, wide attention due, Graph neural network, adversarial attack, centralized adversarial attacks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural network (GNN) has captured wide attention due to its capability of graph representation learning for graph-structured data. However, the distributed data silos limit the performance of GNN. Vertical federated learning (VFL), an emerging technique to process distributed data, successfully makes GNN possible to handle the distributed graph-structured data. Despite the prosperous development of vertical federated graph learning (VFGL), the robustness of VFGL against the adversarial attack has not been explored yet. Although numerous adversarial attacks against centralized GNNs are proposed, their attack performance is challenged in the VFGL scenario. To the best of our knowledge, this is the first work to explore the adversarial attack against VFGL. A query-efficient hybrid adversarial attack framework is proposed to significantly improve the centralized adversarial attacks against VFGL, denoted as NA2, short for Neuron-based Adversarial Attack. Specifically, a malicious client manipulates its local training data to improve its contribution in a stealthy fashion. Then a shadow model is established based on the manipulated data to simulate the behavior of the server model in VFGL. As a result, the shadow model can improve the attack success rate of various centralized attacks with a few queries. Extensive experiments on five real-world benchmarks demonstrate that NA2 improves the performance of the centralized adversarial attacks against VFGL, achieving state-of-the-art performance even under potential adaptive defense where the defender knows the attack method. Additionally, we provide interpretable experiments of the effectiveness of NA2 via sensitive neurons identification and visualization of t-SNE.

[LG-32] Advancing Robust Underwater Acoustic Target Recognition through Multi-task Learning and Multi-Gate Mixture-of-Experts

链接: https://arxiv.org/abs/2411.02787
作者: Yuan Xie,Jiawei Ren,Junfeng Li,Ji Xu
关键词-EN: prominent research area, Underwater acoustic target, acoustic target recognition, acoustic recognition models, Underwater acoustic
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Underwater acoustic target recognition has emerged as a prominent research area within the field of underwater acoustics. However, the current availability of authentic underwater acoustic signal recordings remains limited, which hinders data-driven acoustic recognition models from learning robust patterns of targets from a limited set of intricate underwater signals, thereby compromising their stability in practical applications. To overcome these limitations, this study proposes a recognition framework called M3 (Multi-task, Multi-gate, Multi-expert) to enhance the model’s ability to capture robust patterns by making it aware of the inherent properties of targets. In this framework, an auxiliary task that focuses on target properties, such as estimating target size, is designed. The auxiliary task then shares parameters with the recognition task to realize multi-task learning. This paradigm allows the model to concentrate on shared information across tasks and identify robust patterns of targets in a regularized manner, thereby enhancing the model’s generalization ability. Moreover, M3 incorporates multi-expert and multi-gate mechanisms, allowing for the allocation of distinct parameter spaces to various underwater signals. This enables the model to process intricate signal patterns in a fine-grained and differentiated manner. To evaluate the effectiveness of M3, extensive experiments were implemented on the ShipsEar underwater ship-radiated noise dataset. The results substantiate that M3 has the ability to outperform the most advanced single-task recognition models, thereby achieving the state-of-the-art performance.

[LG-33] BrainBits: How Much of the Brain are Generative Reconstruction Methods Using? NEURIPS2024

链接: https://arxiv.org/abs/2411.02783
作者: David Mayo,Christopher Wang,Asa Harbin,Abdulrahman Alabdulkareem,Albert Eaton Shaw,Boris Katz,Andrei Barbu
关键词-EN: neural recordings, higher fidelity text, evaluating stimuli reconstruction, stimuli reconstruction results, tempting to assume
类目: Machine Learning (cs.LG)
*备注: 23 pages, 16 figures, Accepted at NeurIPS 2024

点击查看摘要

Abstract:When evaluating stimuli reconstruction results it is tempting to assume that higher fidelity text and image generation is due to an improved understanding of the brain or more powerful signal extraction from neural recordings. However, in practice, new reconstruction methods could improve performance for at least three other reasons: learning more about the distribution of stimuli, becoming better at reconstructing text or images in general, or exploiting weaknesses in current image and/or text evaluation metrics. Here we disentangle how much of the reconstruction is due to these other factors vs. productively using the neural recordings. We introduce BrainBits, a method that uses a bottleneck to quantify the amount of signal extracted from neural recordings that is actually necessary to reproduce a method’s reconstruction fidelity. We find that it takes surprisingly little information from the brain to produce reconstructions with high fidelity. In these cases, it is clear that the priors of the methods’ generative models are so powerful that the outputs they produce extrapolate far beyond the neural signal they decode. Given that reconstructing stimuli can be improved independently by either improving signal extraction from the brain or by building more powerful generative models, improving the latter may fool us into thinking we are improving the former. We propose that methods should report a method-specific random baseline, a reconstruction ceiling, and a curve of performance as a function of bottleneck size, with the ultimate goal of using more of the neural recordings.

[LG-34] How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

链接: https://arxiv.org/abs/2411.02780
作者: Giannis Daras,Yeshwanth Cherapanamjeri,Constantinos Daskalakis
关键词-EN: generative models depends, quality of generative, clean data, data, clean
类目: Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than 80 models on data with different corruption levels across three datasets ranging from 30,000 to \approx 1.3 M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~ 10% of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments.

[LG-35] Deep learning-based modularized loading protocol for parameter estimation of Bouc-Wen class models

链接: https://arxiv.org/abs/2411.02776
作者: Sebin Oh,Junho Song,Taeyong Kim
关键词-EN: modularized deep learning-based, deep learning-based loading, study proposes, proposes a modularized, modularized deep
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This study proposes a modularized deep learning-based loading protocol for optimal parameter estimation of Bouc-Wen (BW) class models. The protocol consists of two key components: optimal loading history construction and CNN-based rapid parameter estimation. Each component is decomposed into independent sub-modules tailored to distinct hysteretic behaviors-basic hysteresis, structural degradation, and pinching effect-making the protocol adaptable to diverse hysteresis models. Three independent CNN architectures are developed to capture the path-dependent nature of these hysteretic behaviors. By training these CNN architectures on diverse loading histories, minimal loading sequences, termed \textitloading history modules, are identified and then combined to construct an optimal loading history. The three CNN models, trained on the respective loading history modules, serve as rapid parameter estimators. Numerical evaluation of the protocol, including nonlinear time history analysis of a 3-story steel moment frame and fragility curve construction for a 3-story reinforced concrete frame, demonstrates that the proposed protocol significantly reduces total analysis time while maintaining or improving estimation accuracy. The proposed protocol can be extended to other hysteresis models, suggesting a systematic approach for identifying general hysteresis models.

[LG-36] New random projections for isotropic kernels using stable spectral distributions

链接: https://arxiv.org/abs/2411.02770
作者: Nicolas Langrené,Xavier Warin,Pierre Gruet
关键词-EN: Rahimi and Recht, Random Fourier Features, Random Fourier, Fourier Features, kernels
类目: Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 16 pages, 16 figures

点击查看摘要

Abstract:Rahimi and Recht [31] introduced the idea of decomposing shift-invariant kernels by randomly sampling from their spectral distribution. This famous technique, known as Random Fourier Features (RFF), is in principle applicable to any shift-invariant kernel whose spectral distribution can be identified and simulated. In practice, however, it is usually applied to the Gaussian kernel because of its simplicity, since its spectral distribution is also Gaussian. Clearly, simple spectral sampling formulas would be desirable for broader classes of kernel functions. In this paper, we propose to decompose spectral kernel distributions as a scale mixture of \alpha -stable random vectors. This provides a simple and ready-to-use spectral sampling formula for a very large class of multivariate shift-invariant kernels, including exponential power kernels, generalized Matérn kernels, generalized Cauchy kernels, as well as newly introduced kernels such as the Beta, Kummer, and Tricomi kernels. In particular, we show that the spectral densities of all these kernels are scale mixtures of the multivariate Gaussian distribution. This provides a very simple way to modify existing Random Fourier Features software based on Gaussian kernels to cover a much richer class of multivariate kernels. This result has broad applications for support vector machines, kernel ridge regression, Gaussian processes, and other kernel-based machine learning techniques for which the random Fourier features technique is applicable.

[LG-37] A Convex Relaxation Approach to Generalization Analysis for Parallel Positively Homogeneous Networks AISTATS2025

链接: https://arxiv.org/abs/2411.02767
作者: Uday Kiran Reddy Tadipatri,Benjamin D. Haeffele,Joshua Agterberg,René Vidal
关键词-EN: positively homogeneous neural, positively homogeneous maps, parallel positively homogeneous, input-output map decomposes, homogeneous neural networks
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Under review by AISTATS 2025

点击查看摘要

Abstract:We propose a general framework for deriving generalization bounds for parallel positively homogeneous neural networks–a class of neural networks whose input-output map decomposes as the sum of positively homogeneous maps. Examples of such networks include matrix factorization and sensing, single-layer multi-head attention mechanisms, tensor factorization, deep linear and ReLU networks, and more. Our general framework is based on linking the non-convex empirical risk minimization (ERM) problem to a closely related convex optimization problem over prediction functions, which provides a global, achievable lower-bound to the ERM problem. We exploit this convex lower-bound to perform generalization analysis in the convex space while controlling the discrepancy between the convex model and its non-convex counterpart. We apply our general framework to a wide variety of models ranging from low-rank matrix sensing, to structured matrix sensing, two-layer linear networks, two-layer ReLU networks, and single-layer multi-head attention mechanisms, achieving generalization bounds with a sample complexity that scales almost linearly with the network width.

[LG-38] Fast robust approximate message passing

链接: https://arxiv.org/abs/2411.02764
作者: Misha Ivkov,Tselil Schramm
关键词-EN: implementing approximate-message passing, separable AMP algorithm, give a fast, approximate-message passing, procedure for implementing
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages, 2 figures

点击查看摘要

Abstract:We give a fast, spectral procedure for implementing approximate-message passing (AMP) algorithms robustly. For any quadratic optimization problem over symmetric matrices X with independent subgaussian entries, and any separable AMP algorithm \mathcal A , our algorithm performs a spectral pre-processing step and then mildly modifies the iterates of \mathcal A . If given the perturbed input X + E \in \mathbb R^n \times n for any E supported on a \varepsilon n \times \varepsilon n principal minor, our algorithm outputs a solution \hat v which is guaranteed to be close to the output of \mathcal A on the uncorrupted X , with |\mathcal A(X) - \hat v|_2 \le f(\varepsilon) |\mathcal A(X)|_2 where f(\varepsilon) \to 0 as \varepsilon \to 0 depending only on \varepsilon .

[LG-39] DEMONet: Underwater Acoustic Target Recognition based on Multi-Expert Network and Cross-Temporal Variational Autoencoder

链接: https://arxiv.org/abs/2411.02758
作者: Yuan Xie,Xiaowei Zhang,Jiawei Ren,Ji Xu
关键词-EN: acoustic recognition system, dynamic motion states, underwater acoustic recognition, complex underwater environment, physical characteristics
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Building a robust underwater acoustic recognition system in real-world scenarios is challenging due to the complex underwater environment and the dynamic motion states of targets. A promising optimization approach is to leverage the intrinsic physical characteristics of targets, which remain invariable regardless of environmental conditions, to provide robust insights. However, our study reveals that while physical characteristics exhibit robust properties, they may lack class-specific discriminative patterns. Consequently, directly incorporating physical characteristics into model training can potentially introduce unintended inductive biases, leading to performance degradation. To utilize the benefits of physical characteristics while mitigating possible detrimental effects, we propose DEMONet in this study, which utilizes the detection of envelope modulation on noise (DEMON) to provide robust insights into the shaft frequency or blade counts of targets. DEMONet is a multi-expert network that allocates various underwater signals to their best-matched expert layer based on DEMON spectra for fine-grained signal processing. Thereinto, DEMON spectra are solely responsible for providing implicit physical characteristics without establishing a mapping relationship with the target category. Furthermore, to mitigate noise and spurious modulation spectra in DEMON features, we introduce a cross-temporal alignment strategy and employ a variational autoencoder (VAE) to reconstruct noise-resistant DEMON spectra to replace the raw DEMON features. The effectiveness of the proposed DEMONet with cross-temporal VAE was primarily evaluated on the DeepShip dataset and our proprietary datasets. Experimental results demonstrated that our approach could achieve state-of-the-art performance on both datasets.

[LG-40] An information-matching approach to optimal experimental design and active learning

链接: https://arxiv.org/abs/2411.02740
作者: Yonatan Kurniawan(1),Tracianne B. Neilsen(1),Benjamin L. Francis(2),Alex M. Stankovic(3),Mingjian Wen(4),Ilia Nikiforov(5),Ellad B. Tadmor(5),Vasily V. Bulatov(6),Vincenzo Lordi(6),Mark K. Transtrum(1, 2, and 3) ((1) Brigham Young University, Provo, UT, USA, (2) Achilles Heel Technologies, Orem, UT, USA, (3) SLAC National Accelerator Laboratory, Menlo Park, CA, USA, (4) University of Houston, Houston, TX, USA, (5) University of Minnesota, Minneapolis, MN, USA, (6) Lawrence Livermore National Laboratory)
关键词-EN: mathematical models heavily, expensive and challenging, efficacy of mathematical, models heavily depends, Fisher Information Matrix
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.

[LG-41] Compositional simulation-based inference for time series

链接: https://arxiv.org/abs/2411.02728
作者: Manuel Gloeckler,Shoji Toyota,Kenji Fukumizu,Jakob H. Macke
关键词-EN: Amortized simulation-based inference, perform Bayesian inference, Bayesian inference, Amortized simulation-based, perform Bayesian
类目: Machine Learning (cs.LG)
*备注: 26 pages, submitted for a publication

点击查看摘要

Abstract:Amortized simulation-based inference (SBI) methods train neural networks on simulated data to perform Bayesian inference. While this approach avoids the need for tractable likelihoods, it often requires a large number of simulations and has been challenging to scale to time-series data. Scientific simulators frequently emulate real-world dynamics through thousands of single-state transitions over time. We propose an SBI framework that can exploit such Markovian simulators by locally identifying parameters consistent with individual state transitions. We then compose these local results to obtain a posterior over parameters that align with the entire time series observation. We focus on applying this approach to neural posterior score estimation but also show how it can be applied, e.g., to neural likelihood (ratio) estimation. We demonstrate that our approach is more simulation-efficient than directly estimating the global posterior on several synthetic benchmark tasks and simulators used in ecology and epidemiology. Finally, we validate scalability and simulation efficiency of our approach by applying it to a high-dimensional Kolmogorov flow simulator with around one million dimensions in the data domain.

[LG-42] Carbon price fluctuation prediction using blockchain information A new hybrid machine learning approach

链接: https://arxiv.org/abs/2411.02709
作者: H. Wang,Y. Pang,D. Shang
关键词-EN: hybrid machine learning, DILATED Convolutional Neural, Convolutional Neural Networks, DILATED CNN-LSTM framework, Long Short-Term Memory
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 2 figures

点击查看摘要

Abstract:In this study, the novel hybrid machine learning approach is proposed in carbon price fluctuation prediction. Specifically, a research framework integrating DILATED Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) neural network algorithm is proposed. The advantage of the combined framework is that it can make feature extraction more efficient. Then, based on the DILATED CNN-LSTM framework, the L1 and L2 parameter norm penalty as regularization method is adopted to predict. Referring to the characteristics of high correlation between energy indicator price and blockchain information in previous literature, and we primarily includes indicators related to blockchain information through regularization process. Based on the above methods, this paper uses a dataset containing an amount of data to carry out the carbon price prediction. The experimental results show that the DILATED CNN-LSTM framework is superior to the traditional CNN-LSTM architecture. Blockchain information can effectively predict the price. Since parameter norm penalty as regularization, Ridge Regression (RR) as L2 regularization is better than Smoothly Clipped Absolute Deviation Penalty (SCAD) as L1 regularization in price forecasting. Thus, the proposed RR-DILATED CNN-LSTM approach can effectively and accurately predict the fluctuation trend of the carbon price. Therefore, the new forecasting methods and theoretical ecology proposed in this study provide a new basis for trend prediction and evaluating digital assets policy represented by the carbon price for both the academia and practitioners.

[LG-43] Visually Analyze SHAP Plots to Diagnose Misclassifications in ML-based Intrusion Detection ICDM2024

链接: https://arxiv.org/abs/2411.02670
作者: Maraz Mia,Mir Mehedi A. Pritom,Tariqul Islam,Kamrul Hasan
关键词-EN: commonly adopted detective, intrusion detection system, Intrusion detection, adopted detective security, detective security measures
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages, 14 figures, accepted in the MLC Workshop of the International Conference on Data Mining Conference (ICDM 2024)

点击查看摘要

Abstract:Intrusion detection has been a commonly adopted detective security measures to safeguard systems and networks from various threats. A robust intrusion detection system (IDS) can essentially mitigate threats by providing alerts. In networks based IDS, typically we deal with cyber threats like distributed denial of service (DDoS), spoofing, reconnaissance, brute-force, botnets, and so on. In order to detect these threats various machine learning (ML) and deep learning (DL) models have been proposed. However, one of the key challenges with these predictive approaches is the presence of false positive (FP) and false negative (FN) instances. This FPs and FNs within any black-box intrusion detection system (IDS) make the decision-making task of an analyst further complicated. In this paper, we propose an explainable artificial intelligence (XAI) based visual analysis approach using overlapping SHAP plots that presents the feature explanation to identify potential false positive and false negatives in IDS. Our approach can further provide guidance to security analysts for effective decision-making. We present case study with multiple publicly available network traffic datasets to showcase the efficacy of our approach for identifying false positive and false negative instances. Our use-case scenarios provide clear guidance for analysts on how to use the visual analysis approach for reliable course-of-actions against such threats.

[LG-44] Pricing and Competition for Generative AI NEURIPS2024

链接: https://arxiv.org/abs/2411.02661
作者: Rafid Mahmood
关键词-EN: classical machine learning, natural language prompts, Compared to classical, binary user satisfaction, machine learning
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: NeurIPS 2024; 10 pages

点击查看摘要

Abstract:Compared to classical machine learning (ML) models, generative models offer a new usage paradigm where (i) a single model can be used for many different tasks out-of-the-box; (ii) users interact with this model over a series of natural language prompts; and (iii) the model is ideally evaluated on binary user satisfaction with respect to model outputs. Given these characteristics, we explore the problem of how developers of new generative AI software can release and price their technology. We first develop a comparison of two different models for a specific task with respect to user cost-effectiveness. We then model the pricing problem of generative AI software as a game between two different companies who sequentially release their models before users choose their preferred model for each task. Here, the price optimization problem becomes piecewise continuous where the companies must choose a subset of the tasks on which to be cost-effective and forgo revenue for the remaining tasks. In particular, we reveal the value of market information by showing that a company who deploys later after knowing their competitor’s price can always secure cost-effectiveness on at least one task, whereas the company who is the first-to-market must price their model in a way that incentivizes higher prices from the latecomer in order to gain revenue. Most importantly, we find that if the different tasks are sufficiently similar, the first-to-market model may become cost-ineffective on all tasks regardless of how this technology is priced.

[LG-45] Fair and Welfare-Efficient Constrained Multi-matchings under Uncertainty NEURIPS2024

链接: https://arxiv.org/abs/2411.02654
作者: Elita Lobo,Justin Payan,Cyrus Cousins,Yair Zick
关键词-EN: maintaining group fairness, study fair allocation, group fairness, study fair, maintaining group
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 37 pages, 3 figures, to appear in NeurIPS 2024

点击查看摘要

Abstract:We study fair allocation of constrained resources, where a market designer optimizes overall welfare while maintaining group fairness. In many large-scale settings, utilities are not known in advance, but are instead observed after realizing the allocation. We therefore estimate agent utilities using machine learning. Optimizing over estimates requires trading-off between mean utilities and their predictive variances. We discuss these trade-offs under two paradigms for preference modeling – in the stochastic optimization regime, the market designer has access to a probability distribution over utilities, and in the robust optimization regime they have access to an uncertainty set containing the true utilities with high probability. We discuss utilitarian and egalitarian welfare objectives, and we explore how to optimize for them under stochastic and robust paradigms. We demonstrate the efficacy of our approaches on three publicly available conference reviewer assignment datasets. The approaches presented enable scalable constrained resource allocation under uncertainty for many combinations of objectives and preference models.

[LG-46] Fine Grained Insider Risk Detection

链接: https://arxiv.org/abs/2411.02645
作者: Birkett Huber,Casper Neo,Keiran Sampson,Alex Kantchelian,Brett Ksobiech,Yanis Pavlidis
关键词-EN: detect departures, departures from business-justified, business-justified workflows, support agents, actions
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a method to detect departures from business-justified workflows among support agents. Our goal is to assist auditors in identifying agent actions that cannot be explained by the activity within their surrounding context, where normal activity patterns are established from historical data. We apply our method to help audit millions of actions of over three thousand support agents. We collect logs from the tools used by support agents and construct a bipartite graph of Actions and Entities representing all the actions of the agents, as well as background information about entities. From this graph, we sample subgraphs rooted on security-significant actions taken by the agents. Each subgraph captures the relevant context of the root action in terms of other actions, entities and their relationships. We then prioritize the rooted-subgraphs for auditor review using feed-forward and graph neural networks, as well as nearest neighbors techniques. To alleviate the issue of scarce labeling data, we use contrastive learning and domain-specific data augmentations. Expert auditors label the top ranked subgraphs as worth auditing" or not worth auditing" based on the company’s business policies. This system finds subgraphs that are worth auditing with high enough precision to be used in production. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2411.02645 [cs.CR] (or arXiv:2411.02645v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.02645 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-47] Dynamic Weight Adjusting Deep Q-Networks for Real-Time Environmental Adaptation

链接: https://arxiv.org/abs/2411.02559
作者: Xinhao Zhang,Jinghan Zhang,Wujun Si,Kunpeng Liu
关键词-EN: generating efficient solutions, Deep Reinforcement Learning, Deep Reinforcement, shown excellent performance, complex tasks
类目: Machine Learning (cs.LG)
*备注: Accepted by ICKG2024

点击查看摘要

Abstract:Deep Reinforcement Learning has shown excellent performance in generating efficient solutions for complex tasks. However, its efficacy is often limited by static training modes and heavy reliance on vast data from stable environments. To address these shortcomings, this study explores integrating dynamic weight adjustments into Deep Q-Networks (DQN) to enhance their adaptability. We implement these adjustments by modifying the sampling probabilities in the experience replay to make the model focus more on pivotal transitions as indicated by real-time environmental feedback and performance metrics. We design a novel Interactive Dynamic Evaluation Method (IDEM) for DQN that successfully navigates dynamic environments by prioritizing significant transitions based on environmental feedback and learning progress. Additionally, when faced with rapid changes in environmental conditions, IDEM-DQN shows improved performance compared to baseline methods. Our results indicate that under circumstances requiring rapid adaptation, IDEM-DQN can more effectively generalize and stabilize learning. Extensive experiments across various settings confirm that IDEM-DQN outperforms standard DQN models, particularly in environments characterized by frequent and unpredictable changes.

[LG-48] Pretrained transformer efficiently learns low-dimensional target functions in-context NEURIPS2024

链接: https://arxiv.org/abs/2411.02544
作者: Kazusato Oko,Yujin Song,Taiji Suzuki,Denny Wu
关键词-EN: efficiently learn in-context, ICL, pretrained transformer, linear function classes, efficiently learn
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Transformers can efficiently learn in-context from example demonstrations. Most existing theoretical analyses studied the in-context learning (ICL) ability of transformers for linear function classes, where it is typically shown that the minimizer of the pretraining loss implements one gradient descent step on the least squares objective. However, this simplified linear setting arguably does not demonstrate the statistical efficiency of ICL, since the pretrained transformer does not outperform directly solving linear regression on the test prompt. In this paper, we study ICL of a nonlinear function class via transformer with nonlinear MLP layer: given a class of \textitsingle-index target functions f_(\boldsymbolx) = \sigma_(\langle\boldsymbolx,\boldsymbol\beta\rangle) , where the index features \boldsymbol\beta\in\mathbbR^d are drawn from a r -dimensional subspace, we show that a nonlinear transformer optimized by gradient descent (with a pretraining sample complexity that depends on the \textitinformation exponent of the link functions \sigma_* ) learns f_* in-context with a prompt length that only depends on the dimension of the distribution of target functions r ; in contrast, any algorithm that directly learns f_* on test prompt yields a statistical complexity that scales with the ambient dimension d . Our result highlights the adaptivity of the pretrained transformer to low-dimensional structures of the function class, which enables sample-efficient ICL that outperforms estimators that only have access to the in-context data.

[LG-49] Enhancing Graph Neural Networks in Large-scale Traffic Incident Analysis with Concurrency Hypothesis

链接: https://arxiv.org/abs/2411.02542
作者: Xiwen Chen,Sayed Pedram Haeri Boroujeni,Xin Shu,Huayu Li,Abolfazl Razi
关键词-EN: improved safety interventions, persistently high rate, traffic-related deaths highlights, reducing road fatalities, Average Neighbor Crash
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted by Sigspatial 2024

点击查看摘要

Abstract:Despite recent progress in reducing road fatalities, the persistently high rate of traffic-related deaths highlights the necessity for improved safety interventions. Leveraging large-scale graph-based nationwide road network data across 49 states in the USA, our study first posits the Concurrency Hypothesis from intuitive observations, suggesting a significant likelihood of incidents occurring at neighboring nodes within the road network. To quantify this phenomenon, we introduce two novel metrics, Average Neighbor Crash Density (ANCD) and Average Neighbor Crash Continuity (ANCC), and subsequently employ them in statistical tests to validate the hypothesis rigorously. Building upon this foundation, we propose the Concurrency Prior (CP) method, a powerful approach designed to enhance the predictive capabilities of general Graph Neural Network (GNN) models in semi-supervised traffic incident prediction tasks. Our method allows GNNs to incorporate concurrent incident information, as mentioned in the hypothesis, via tokenization with negligible extra parameters. The extensive experiments, utilizing real-world data across states and cities in the USA, demonstrate that integrating CP into 12 state-of-the-art GNN architectures leads to significant improvements, with gains ranging from 3% to 13% in F1 score and 1.3% to 9% in AUC metrics. The code is publicly available at this https URL. Comments: Accepted by Sigspatial 2024 Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2411.02542 [cs.LG] (or arXiv:2411.02542v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.02542 Focus to learn more arXiv-issued DOI via DataCite

[LG-50] owards Harmless Rawlsian Fairness Regardless of Demographic Prior

链接: https://arxiv.org/abs/2411.02467
作者: Xuanqian Wang,Jing Li,Ivor W. Tsang,Yew-Soon Ong
关键词-EN: Due to privacy, group fairness advocate, security concerns, recent advancements, privacy and security
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Due to privacy and security concerns, recent advancements in group fairness advocate for model training regardless of demographic information. However, most methods still require prior knowledge of demographics. In this study, we explore the potential for achieving fairness without compromising its utility when no prior demographics are provided to the training set, namely \emphharmless Rawlsian fairness. We ascertain that such a fairness requirement with no prior demographic information essential promotes training losses to exhibit a Dirac delta distribution. To this end, we propose a simple but effective method named VFair to minimize the variance of training losses inside the optimal set of empirical losses. This problem is then optimized by a tailored dynamic update approach that operates in both loss and gradient dimensions, directing the model towards relatively fairer solutions while preserving its intact utility. Our experimental findings indicate that regression tasks, which are relatively unexplored from literature, can achieve significant fairness improvement through VFair regardless of any prior, whereas classification tasks usually do not because of their quantized utility measurements. The implementation of our method is publicly available at \urlthis https URL.

[LG-51] MADOD: Generalizing OOD Detection to Unseen Domains via G-Invariance Meta-Learning

链接: https://arxiv.org/abs/2411.02444
作者: Haoliang Wang,Chen Zhao,Feng Chen
关键词-EN: challenging traditional domain, traditional domain generalization, face simultaneous covariate, machine learning applications, OOD detection
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: IEEE International Conference on Big Data 2024

点击查看摘要

Abstract:Real-world machine learning applications often face simultaneous covariate and semantic shifts, challenging traditional domain generalization and out-of-distribution (OOD) detection methods. We introduce Meta-learned Across Domain Out-of-distribution Detection (MADOD), a novel framework designed to address both shifts concurrently. MADOD leverages meta-learning and G-invariance to enhance model generalizability and OOD detection in unseen domains. Our key innovation lies in task construction: we randomly designate in-distribution classes as pseudo-OODs within each meta-learning task, simulating OOD scenarios using existing data. This approach, combined with energy-based regularization, enables the learning of robust, domain-invariant features while calibrating decision boundaries for effective OOD detection. Operating in a test domain-agnostic setting, MADOD eliminates the need for adaptation during inference, making it suitable for scenarios where test data is unavailable. Extensive experiments on real-world and synthetic datasets demonstrate MADOD’s superior performance in semantic OOD detection across unseen domains, achieving an AUPR improvement of 8.48% to 20.81%, while maintaining competitive in-distribution classification accuracy, representing a significant advancement in handling both covariate and semantic shifts.

[LG-52] Data Matters: The Case of Predicting Mobile Cellular Traffic

链接: https://arxiv.org/abs/2411.02418
作者: Natalia Vesselinova,Matti Harjula,Pauliina Ilmonen
关键词-EN: mobile cellular operators, sustain smart cities, base stations’ traffic, Accurate predictions, base stations’
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Accurate predictions of base stations’ traffic load are essential to mobile cellular operators and their users as they support the efficient use of network resources and sustain smart cities and roads. Traditionally, cellular network time-series have been considered for this prediction task. More recently, exogenous factors such as points of presence and other environmental knowledge have been introduced to facilitate cellular traffic forecasting. In this study, we focus on smart roads and explore road traffic measures to model the processes underlying cellular traffic generation with the goal to improve prediction performance. Comprehensive experiments demonstrate that by employing road flow and speed, in addition to cellular network metrics, cellular load prediction errors can be reduced by as much as 56.5 %. The code and more detailed results are available on this https URL.

[LG-53] Fairness Evaluation with Item Response Theory

链接: https://arxiv.org/abs/2411.02414
作者: Ziqi Xu,Sevvandi Kandanaarachchi,Cheng Soon Ong,Eirini Ntoutsi
关键词-EN: Item Response Theory, Response Theory, assess student ability, Item Response, educational psychometrics
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Item Response Theory (IRT) has been widely used in educational psychometrics to assess student ability, as well as the difficulty and discrimination of test questions. In this context, discrimination specifically refers to how effectively a question distinguishes between students of different ability levels, and it does not carry any connotation related to fairness. In recent years, IRT has been successfully used to evaluate the predictive performance of Machine Learning (ML) models, but this paper marks its first application in fairness evaluation. In this paper, we propose a novel Fair-IRT framework to evaluate a set of predictive models on a set of individuals, while simultaneously eliciting specific parameters, namely, the ability to make fair predictions (a feature of predictive models), as well as the discrimination and difficulty of individuals that affect the prediction results. Furthermore, we conduct a series of experiments to comprehensively understand the implications of these parameters for fairness evaluation. Detailed explanations for item characteristic curves (ICCs) are provided for particular individuals. We propose the flatness of ICCs to disentangle the unfairness between individuals and predictive models. The experiments demonstrate the effectiveness of this framework as a fairness evaluation tool. Two real-world case studies illustrate its potential application in evaluating fairness in both classification and regression tasks. Our paper aligns well with the Responsible Web track by proposing a Fair-IRT framework to evaluate fairness in ML models, which directly contributes to the development of a more inclusive, equitable, and trustworthy AI.

[LG-54] Slicing for AI: An Online Learning Framework for Network Slicing Supporting AI Services

链接: https://arxiv.org/abs/2411.02412
作者: Menna Helmy,Alaa Awad Abdellatif,Naram Mhaisen,Amr Mohamed,Aiman Erbad
关键词-EN: network slicing strategies, innovative network slicing, requires innovative network, customized network slices, meet Quality
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The forthcoming 6G networks will embrace a new realm of AI-driven services that requires innovative network slicing strategies, namely slicing for AI, which involves the creation of customized network slices to meet Quality of service (QoS) requirements of diverse AI services. This poses challenges due to time-varying dynamics of users’ behavior and mobile networks. Thus, this paper proposes an online learning framework to optimize the allocation of computational and communication resources to AI services, while considering their unique key performance indicators (KPIs), such as accuracy, latency, and cost. We define a problem of optimizing the total accuracy while balancing conflicting KPIs, prove its NP-hardness, and propose an online learning framework for solving it in dynamic environments. We present a basic online solution and two variations employing a pre-learning elimination method for reducing the decision space to expedite the learning. Furthermore, we propose a biased decision space subset selection by incorporating prior knowledge to enhance the learning speed without compromising performance and present two alternatives of handling the selected subset. Our results depict the efficiency of the proposed solutions in converging to the optimal decisions, while reducing decision space and improving time complexity.

[LG-55] Optimal Transport Maps are Good Voice Converters

链接: https://arxiv.org/abs/2411.02402
作者: Arip Asadulaev,Rostislav Korst,Vitalii Shutov,Alexander Korotin,Yaroslav Grebnyak,Vahe Egiazarian,Evgeny Burnaev
关键词-EN: style transfer problems, neural network-based methods, neural network-based, transfer problems, effectively applied
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recently, neural network-based methods for computing optimal transport maps have been effectively applied to style transfer problems. However, the application of these methods to voice conversion is underexplored. In our paper, we fill this gap by investigating optimal transport as a framework for voice conversion. We present a variety of optimal transport algorithms designed for different data representations, such as mel-spectrograms and latent representation of self-supervised speech models. For the mel-spectogram data representation, we achieve strong results in terms of Frechet Audio Distance (FAD). This performance is consistent with our theoretical analysis, which suggests that our method provides an upper bound on the FAD between the target and generated distributions. Within the latent space of the WavLM encoder, we achived state-of-the-art results and outperformed existing methods even with limited reference speaker data.

[LG-56] A Personal data Value at Risk Approach

链接: https://arxiv.org/abs/2411.03217
作者: Luis Enriquez
关键词-EN: data protection, risk management, data protection vulnerability, main data protection, protection risk management
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:What if the main data protection vulnerability is risk management? Data Protection merges three disciplines: data protection law, information security, and risk management. Nonetheless, very little research has been made on the field of data protection risk management, where subjectivity and superficiality are the dominant state of the art. Since the GDPR tells you what to do, but not how to do it, the solution for approaching GDPR compliance is still a gray zone, where the trend is using the rule of thumb. Considering that the most important goal of risk management is to reduce uncertainty in order to take informed decisions, risk management for the protection of the rights and freedoms of the data subjects cannot be disconnected from the impact materialization that data controllers and processors need to assess. This paper proposes a quantitative approach to data protection risk-based compliance from a data controllers perspective, with the aim of proposing a mindset change, where data protection impact assessments can be improved by using data protection analytics, quantitative risk analysis, and calibrating expert opinions.

[LG-57] Online Data Collection for Efficient Semiparametric Inference

链接: https://arxiv.org/abs/2411.03195
作者: Shantanu Gupta,Zachary C. Lipton,David Childers
关键词-EN: statistical data fusion, studied statistical data, online data collection, works have studied, studied statistical
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While many works have studied statistical data fusion, they typically assume that the various datasets are given in advance. However, in practice, estimation requires difficult data collection decisions like determining the available data sources, their costs, and how many samples to collect from each source. Moreover, this process is often sequential because the data collected at a given time can improve collection decisions in the future. In our setup, given access to multiple data sources and budget constraints, the agent must sequentially decide which data source to query to efficiently estimate a target parameter. We formalize this task using Online Moment Selection, a semiparametric framework that applies to any parameter identified by a set of moment conditions. Interestingly, the optimal budget allocation depends on the (unknown) true parameters. We present two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps. We prove that both policies achieve zero regret (assessed by asymptotic MSE) relative to an oracle policy. We empirically validate our methods on both synthetic and real-world causal effect estimation tasks, demonstrating that the online data collection policies outperform their fixed counterparts.

[LG-58] Insights into Lunar Mineralogy: An Unsupervised Approach for Clustering of the Moon Mineral Mapper (M3) spectral data

链接: https://arxiv.org/abs/2411.03186
作者: Freja Thoresen,Igor Drozdovskiy,Aidan Cowley,Magdelena Laban,Sebastien Besse,Sylvain Blunier
关键词-EN: Moon Mineral Mapper, machine learning-based clustering, mapping spectral features, Mineral Mapper, Moon surface
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel method for mapping spectral features of the Moon using machine learning-based clustering of hyperspectral data from the Moon Mineral Mapper (M3) imaging spectrometer. The method uses a convolutional variational autoencoder to reduce the dimensionality of the spectral data and extract features of the spectra. Then, a k-means algorithm is applied to cluster the latent variables into five distinct groups, corresponding to dominant spectral features, which are related to the mineral composition of the Moon’s surface. The resulting global spectral cluster map shows the distribution of the five clusters on the Moon, which consist of a mixture of, among others, plagioclase, pyroxene, olivine, and Fe-bearing minerals across the Moon’s surface. The clusters are compared to the mineral maps from the Kaguya mission, which showed that the locations of the clusters overlap with the locations of high wt% of minerals such as plagioclase, clinopyroxene, and olivine. The paper demonstrates the usefulness of unbiased unsupervised learning for lunar mineral exploration and provides a comprehensive analysis of lunar mineralogy.

[LG-59] Blind Estimation of Sub-band Acoustic Parameters from Ambisonics Recordings using Spectro-Spatial Covariance Features ICASSP2025

链接: https://arxiv.org/abs/2411.03172
作者: Hanyu Meng,Jeroen Breebaart,Jeremy Stoddard,Vidhyasaharan Sethu,Eliathamby Ambikairajah
关键词-EN: Estimating frequency-varying acoustic, enhancing immersive perception, Estimating frequency-varying, spatial audio creation, realistic spatial audio
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注: Submitted to ICASSP2025

点击查看摘要

Abstract:Estimating frequency-varying acoustic parameters is essential for enhancing immersive perception in realistic spatial audio creation. In this paper, we propose a unified framework that blindly estimates reverberation time (T60), direct-to-reverberant ratio (DRR), and clarity (C50) across 10 frequency bands using first-order Ambisonics (FOA) speech recordings as inputs. The proposed framework utilizes a novel feature named Spectro-Spatial Covariance Vector (SSCV), efficiently representing temporal, spectral as well as spatial information of the FOA signal. Our models significantly outperform existing single-channel methods with only spectral information, reducing estimation errors by more than half for all three acoustic parameters. Additionally, we introduce FOA-Conv3D, a novel back-end network for effectively utilising the SSCV feature with a 3D convolutional encoder. FOA-Conv3D outperforms the convolutional neural network (CNN) and recurrent convolutional neural network (CRNN) backends, achieving lower estimation errors and accounting for a higher proportion of variance (PoV) for all 3 acoustic parameters.

[LG-60] Efficient Hamiltonian structure and trace distance learning of Gaussian states

链接: https://arxiv.org/abs/2411.03163
作者: Marco Fanizza,Cambyse Rouzé,Daniel Stilck França
关键词-EN: Gaussian graphical models, temperature bosonic Gaussian, positive temperature bosonic, bosonic Gaussian states, learning Gaussian graphical
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 43 pages, 1 figure

点击查看摘要

Abstract:In this work, we initiate the study of Hamiltonian learning for positive temperature bosonic Gaussian states, the quantum generalization of the widely studied problem of learning Gaussian graphical models. We obtain efficient protocols, both in sample and computational complexity, for the task of inferring the parameters of their underlying quadratic Hamiltonian under the assumption of bounded temperature, squeezing, displacement and maximal degree of the interaction graph. Our protocol only requires heterodyne measurements, which are often experimentally feasible, and has a sample complexity that scales logarithmically with the number of modes. Furthermore, we show that it is possible to learn the underlying interaction graph in a similar setting and sample complexity. Taken together, our results put the status of the quantum Hamiltonian learning problem for continuous variable systems in a much more advanced state when compared to spins, where state-of-the-art results are either unavailable or quantitatively inferior to ours. In addition, we use our techniques to obtain the first results on learning Gaussian states in trace distance with a quadratic scaling in precision and polynomial in the number of modes, albeit imposing certain restrictions on the Gaussian states. Our main technical innovations are several continuity bounds for the covariance and Hamiltonian matrix of a Gaussian state, which are of independent interest, combined with what we call the local inversion technique. In essence, the local inversion technique allows us to reliably infer the Hamiltonian of a Gaussian state by only estimating in parallel submatrices of the covariance matrix whose size scales with the desired precision, but not the number of modes. This way we bypass the need to obtain precise global estimates of the covariance matrix, controlling the sample complexity.

[LG-61] Unleashing the power of novel conditional generative approaches for new materials discovery

链接: https://arxiv.org/abs/2411.03156
作者: Lev Novitskiy,Vladimir Lazarev,Mikhail Tiutiulnikov,Nikita Vakhrameev,Roman Eremin,Innokentiy Humonen,Andrey Kuznetsov,Denis Dimitrov,Semen Budennyy
关键词-EN: long time, finding a candidate, crystal structure design, structure, approaches
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For a very long time, computational approaches to the design of new materials have relied on an iterative process of finding a candidate material and modeling its properties. AI has played a crucial role in this regard, helping to accelerate the discovery and optimization of crystal properties and structures through advanced computational methodologies and data-driven approaches. To address the problem of new materials design and fasten the process of new materials search, we have applied latest generative approaches to the problem of crystal structure design, trying to solve the inverse problem: by given properties generate a structure that satisfies them without utilizing supercomputer powers. In our work we propose two approaches: 1) conditional structure modification: optimization of the stability of an arbitrary atomic configuration, using the energy difference between the most energetically favorable structure and all its less stable polymorphs and 2) conditional structure generation. We used a representation for materials that includes the following information: lattice, atom coordinates, atom types, chemical features, space group and formation energy of the structure. The loss function was optimized to take into account the periodic boundary conditions of crystal structures. We have applied Diffusion models approach, Flow matching, usual Autoencoder (AE) and compared the results of the models and approaches. As a metric for the study, physical PyMatGen matcher was employed: we compare target structure with generated one using default tolerances. So far, our modifier and generator produce structures with needed properties with accuracy 41% and 82% respectively. To prove the offered methodology efficiency, inference have been carried out, resulting in several potentially new structures with formation energy below the AFLOW-derived convex hulls.

[LG-62] Correlating Variational Autoencoders Natively For Multi-View Imputation NEURIPS2024

链接: https://arxiv.org/abs/2411.03097
作者: Ella S. C. Orme,Marina Evangelou,Ulrich Paquet
关键词-EN: latent spaces, source often exhibit, Multi-view data, correlation, spaces
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ‘UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models’, a workshop at NeurIPS 2024

点击查看摘要

Abstract:Multi-view data from the same source often exhibit correlation. This is mirrored in correlation between the latent spaces of separate variational autoencoders (VAEs) trained on each data-view. A multi-view VAE approach is proposed that incorporates a joint prior with a non-zero correlation structure between the latent spaces of the VAEs. By enforcing such correlation structure, more strongly correlated latent spaces are uncovered. Using conditional distributions to move between these latent spaces, missing views can be imputed and used for downstream analysis. Learning this correlation structure involves maintaining validity of the prior distribution, as well as a successful parameterization that allows end-to-end learning.

[LG-63] Blending Ensemble for Classification with Genetic-algorithm generated Alpha factors and Sentiments (GAS)

链接: https://arxiv.org/abs/2411.03035
作者: Quechen Yang
关键词-EN: Algorithm-generated Alpha Sentiment, Genetic Algorithm-generated Alpha, understanding and predicting, increasing maturity, maturity and expansion
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:With the increasing maturity and expansion of the cryptocurrency market, understanding and predicting its price fluctuations has become an important issue in the field of financial engineering. This article introduces an innovative Genetic Algorithm-generated Alpha Sentiment (GAS) blending ensemble model specifically designed to predict Bitcoin market trends. The model integrates advanced ensemble learning methods, feature selection algorithms, and in-depth sentiment analysis to effectively capture the complexity and variability of daily Bitcoin trading data. The GAS framework combines 34 Alpha factors with 8 news economic sentiment factors to provide deep insights into Bitcoin price fluctuations by accurately analyzing market sentiment and technical indicators. The core of this study is using a stacked model (including LightGBM, XGBoost, and Random Forest Classifier) for trend prediction which demonstrates excellent performance in traditional buy-and-hold strategies. In addition, this article also explores the effectiveness of using genetic algorithms to automate alpha factor construction as well as enhancing predictive models through sentiment analysis. Experimental results show that the GAS model performs competitively in daily Bitcoin trend prediction especially when analyzing highly volatile financial assets with rich data.

[LG-64] Neural Networks and (Virtual) Extended Formulations

链接: https://arxiv.org/abs/2411.03006
作者: Christoph Hertrich,Georg Loho
关键词-EN: modern machine learning, extension complexity, linear activation functions, rectified linear units, mathrm
类目: Combinatorics (math.CO); Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Neural networks with piecewise linear activation functions, such as rectified linear units (ReLU) or maxout, are among the most fundamental models in modern machine learning. We make a step towards proving lower bounds on the size of such neural networks by linking their representative capabilities to the notion of the extension complexity \mathrmxc§ of a polytope P , a well-studied quantity in combinatorial optimization and polyhedral geometry. To this end, we propose the notion of virtual extension complexity \mathrmvxc§=\min\mathrmxc(Q)+\mathrmxc®\mid P+Q=R\ . This generalizes \mathrmxc§ and describes the number of inequalities needed to represent the linear optimization problem over P as a difference of two linear programs. We prove that \mathrmvxc§ is a lower bound on the size of a neural network that optimizes over P . While it remains open to derive strong lower bounds on virtual extension complexity, we show that powerful results on the ordinary extension complexity can be converted into lower bounds for monotone neural networks, that is, neural networks with only nonnegative weights. Furthermore, we show that one can efficiently optimize over a polytope P using a small virtual extended formulation. We therefore believe that virtual extension complexity deserves to be studied independently from neural networks, just like the ordinary extension complexity. As a first step in this direction, we derive an example showing that extension complexity can go down under Minkowski sum.

[LG-65] Sparse Reconstruction of Wavefronts using an Over-Complete Phase Dictionary

链接: https://arxiv.org/abs/2411.02985
作者: S. Howard,N. Weisse,J. Schroeder,C. Barbero,B. Alonso,I. Sola,P. Norreys,A. Döpp
关键词-EN: including adaptive optics, phase contrast imaging, including adaptive, adaptive optics, contrast imaging
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wavefront reconstruction is a critical component in various optical systems, including adaptive optics, interferometry, and phase contrast imaging. Traditional reconstruction methods often employ either the Cartesian (pixel) basis or the Zernike polynomial basis. While the Cartesian basis is adept at capturing high-frequency features, it is susceptible to overfitting and inefficiencies due to the high number of degrees of freedom. The Zernike basis efficiently represents common optical aberrations but struggles with complex or non-standard wavefronts such as optical vortices, Bessel beams, or wavefronts with sharp discontinuities. This paper introduces a novel approach to wavefront reconstruction using an over-complete phase dictionary combined with sparse representation techniques. By constructing a dictionary that includes a diverse set of basis functions - ranging from Zernike polynomials to specialized functions representing optical vortices and other complex modes - we enable a more flexible and efficient representation of complex wavefronts. Furthermore, a trainable affine transform is implemented to account for misalignment. Utilizing principles from compressed sensing and sparse coding, we enforce sparsity in the coefficient space to avoid overfitting and enhance robustness to noise.

[LG-66] Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression: A Distribution-Free Analysis

链接: https://arxiv.org/abs/2411.02904
作者: Yingzhen Yang,Ping Li
关键词-EN: neural network trained, two-layer neural network, Neural Tangent Kernel, kernel regression trained, study nonparametric regression
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.11353

点击查看摘要

Abstract:We study nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) in this paper. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of \cO(\eps_n^2) , which is the same rate as that for the classical kernel regression trained by GD with early stopping, where \eps_n is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and n is the size of the training data. It is remarked that our result does not require distributional assumptions on the training data, in a strong contrast with many existing results which rely on specific distributions such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions. The rate \cO(\eps_n^2) is known to be minimax optimal for specific cases, such as the case that the NTK has a polynomial eigenvalue decay rate which happens under certain distributional assumptions. Our result formally fills the gap between training a classical kernel regression model and training an over-parameterized but finite-width neural network by GD for nonparametric regression without distributional assumptions. We also provide confirmative answers to certain open questions or address particular concerns in the literature of training over-parameterized neural networks by GD with early stopping for nonparametric regression, including the characterization of the stopping time, the lower bound for the network width, and the constant learning rate used in GD.

[LG-67] Generalization and Risk Bounds for Recurrent Neural Networks

链接: https://arxiv.org/abs/2411.02784
作者: Xuewei Cheng,Ke Huang,Shujie Ma
关键词-EN: Recurrent Neural Networks, Recurrent Neural, Neural Networks, achieved great success, sequential data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurrent Neural Networks (RNNs) have achieved great success in the prediction of sequential data. However, their theoretical studies are still lagging behind because of their complex interconnected structures. In this paper, we establish a new generalization error bound for vanilla RNNs, and provide a unified framework to calculate the Rademacher complexity that can be applied to a variety of loss functions. When the ramp loss is used, we show that our bound is tighter than the existing bounds based on the same assumptions on the Frobenius and spectral norms of the weight matrices and a few mild conditions. Our numerical results show that our new generalization bound is the tightest among all existing bounds in three public datasets. Our bound improves the second tightest one by an average percentage of 13.80% and 3.01% when the \tanh and ReLU activation functions are used, respectively. Moreover, we derive a sharp estimation error bound for RNN-based estimators obtained through empirical risk minimization (ERM) in multi-class classification problems when the loss function satisfies a Bernstein condition.

[LG-68] Expressivity of deterministic quantum computation with one qubit

链接: https://arxiv.org/abs/2411.02751
作者: Yujin Kim,Daniel K. Park
关键词-EN: Deterministic quantum computation, practical interest due, Deterministic quantum, significant theoretical, interest due
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Deterministic quantum computation with one qubit (DQC1) is of significant theoretical and practical interest due to its computational advantages in certain problems, despite its subuniversality with limited quantum resources. In this work, we introduce parameterized DQC1 as a quantum machine learning model. We demonstrate that the gradient of the measurement outcome of a DQC1 circuit with respect to its gate parameters can be computed directly using the DQC1 protocol. This allows for gradient-based optimization of DQC1 circuits, positioning DQC1 as the sole quantum protocol for both training and inference. We then analyze the expressivity of the parameterized DQC1 circuits, characterizing the set of learnable functions, and show that DQC1-based machine learning (ML) is as powerful as quantum neural networks based on universal computation. Our findings highlight the potential of DQC1 as a practical and versatile platform for ML, capable of rivaling more complex quantum computing models while utilizing simpler quantum resources.

[LG-69] Elliptical Wishart distributions: information geometry maximum likelihood estimator performance analysis and statistical learning

链接: https://arxiv.org/abs/2411.02726
作者: Imen Ayadi,Florent Bouchard,Frédéric Pascal
关键词-EN: Elliptical Wishart distributions, Elliptical Wishart, Wishart distributions, Wishart, paper deals
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper deals with Elliptical Wishart distributions - which generalize the Wishart distribution - in the context of signal processing and machine learning. Two algorithms to compute the maximum likelihood estimator (MLE) are proposed: a fixed point algorithm and a Riemannian optimization method based on the derived information geometry of Elliptical Wishart distributions. The existence and uniqueness of the MLE are characterized as well as the convergence of both estimation algorithms. Statistical properties of the MLE are also investigated such as consistency, asymptotic normality and an intrinsic version of Fisher efficiency. On the statistical learning side, novel classification and clustering methods are designed. For the t -Wishart distribution, the performance of the MLE and statistical learning algorithms are evaluated on both simulated and real EEG and hyperspectral data, showcasing the interest of our proposed methods.

[LG-70] Point processes with event time uncertainty

链接: https://arxiv.org/abs/2411.02694
作者: Xiuyuan Cheng,Tingnan Gong,Yao Xie
关键词-EN: dependent event data, widely used statistical, uncovering the temporal, temporal patterns, patterns in dependent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Point processes are widely used statistical models for uncovering the temporal patterns in dependent event data. In many applications, the event time cannot be observed exactly, calling for the incorporation of time uncertainty into the modeling of point process data. In this work, we introduce a framework to model time-uncertain point processes possibly on a network. We start by deriving the formulation in the continuous-time setting under a few assumptions motivated by application scenarios. After imposing a time grid, we obtain a discrete-time model that facilitates inference and can be computed by first-order optimization methods such as Gradient Descent or Variation inequality (VI) using batch-based Stochastic Gradient Descent (SGD). The parameter recovery guarantee is proved for VI inference at an O(1/k) convergence rate using k SGD steps. Our framework handles non-stationary processes by modeling the inference kernel as a matrix (or tensor on a network) and it covers the stationary process, such as the classical Hawkes process, as a special case. We experimentally show that the proposed approach outperforms previous General Linear model (GLM) baselines on simulated and real data and reveals meaningful causal relations on a Sepsis-associated Derangements dataset.

[LG-71] Multi-modal deformable image registration using untrained neural networks

链接: https://arxiv.org/abs/2411.02672
作者: Quang Luong Nhat Nguyen,Ruiming Cao,Laura Waller
关键词-EN: Image registration techniques, techniques usually assume, lacks a general, general method, method
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Image registration techniques usually assume that the images to be registered are of a certain type (e.g. single- vs. multi-modal, 2D vs. 3D, rigid vs. deformable) and there lacks a general method that can work for data under all conditions. We propose a registration method that utilizes neural networks for image representation. Our method uses untrained networks with limited representation capacity as an implicit prior to guide for a good registration. Unlike previous approaches that are specialized for specific data types, our method handles both rigid and non-rigid, as well as single- and multi-modal registration, without requiring changes to the model or objective function. We have performed a comprehensive evaluation study using a variety of datasets and demonstrated promising performance.

[LG-72] A Trust-Region Algorithm for Noisy Equality Constrained Optimization

链接: https://arxiv.org/abs/2411.02665
作者: Shigeng Sun,Jorge Nocedal
关键词-EN: modified Byrd-Omojokun, gradient evaluations, address the challenges, challenges posed, function and gradient
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a modified Byrd-Omojokun (BO) trust region algorithm to address the challenges posed by noisy function and gradient evaluations. The original BO method was designed to solve equality constrained problems and it forms the backbone of some interior point methods for general large-scale constrained optimization. A key strength of the BO method is its robustness in handling problems with rank-deficient constraint Jacobians. The algorithm proposed in this paper introduces a new criterion for accepting a step and for updating the trust region that makes use of an estimate in the noise in the problem. The analysis presented here gives conditions under which the iterates converge to regions of stationary points of the problem, determined by the level of noise. This analysis is more complex than for line search methods because the trust region carries (noisy) information from previous iterates. Numerical tests illustrate the practical performance of the algorithm.

[LG-73] Deep operator neural network applied to efficient computation of asteroid surface temperature and the Yarkovsky effect

链接: https://arxiv.org/abs/2411.02653
作者: Shunjing Zhao,Hanlun Lei,Xian Shi
关键词-EN: Solar System, thermal property-based studies, Surface temperature distribution, property-based studies, studies about irregular
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: accepted for publication in “Astronomy Astrophysics”

点击查看摘要

Abstract:Surface temperature distribution is crucial for thermal property-based studies about irregular asteroids in our Solar System. While direct numerical simulations could model surface temperatures with high fidelity, they often take a significant amount of computational time, especially for problems where temperature distributions are required to be repeatedly calculated. To this end, deep operator neural network (DeepONet) provides a powerful tool due to its high computational efficiency and generalization ability. In this work, we applied DeepONet to the modelling of asteroid surface temperatures. Results show that the trained network is able to predict temperature with an accuracy of ~1% on average, while the computational cost is five orders of magnitude lower, hence enabling thermal property analysis in a multidimensional parameter space. As a preliminary application, we analyzed the orbital evolution of asteroids through direct N-body simulations embedded with instantaneous Yarkovsky effect inferred by DeepONet-based thermophysical this http URL asteroids (3200) Phaethon and (89433) 2001 WM41 as examples, we show the efficacy and efficiency of our AI-based approach.

[LG-74] Classifier Chain Networks for Multi-Label Classification

链接: https://arxiv.org/abs/2411.02638
作者: Daniel J. W. Touw,Michel van de Velden
关键词-EN: classifier chain network, classifier chain, analyzing multi-labeled data, chain network, chain
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages, 7 figures

点击查看摘要

Abstract:The classifier chain is a widely used method for analyzing multi-labeled data sets. In this study, we introduce a generalization of the classifier chain: the classifier chain network. The classifier chain network enables joint estimation of model parameters, and allows to account for the influence of earlier label predictions on subsequent classifiers in the chain. Through simulations, we evaluate the classifier chain network’s performance against multiple benchmark methods, demonstrating competitive results even in scenarios that deviate from its modeling assumptions. Furthermore, we propose a new measure for detecting conditional dependencies between labels and illustrate the classifier chain network’s effectiveness using an empirical data set.

[LG-75] Optimization Algorithm Design via Electric Circuits

链接: https://arxiv.org/abs/2411.02573
作者: Stephen P. Boyd,Tetiana Parshakova,Ernest K. Ryu,Jaewook J. Suh
关键词-EN: electric RLC circuits, electric RLC, RLC circuits, convex optimization algorithm, convex optimization
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel methodology for convex optimization algorithm design using ideas from electric RLC circuits. Given an optimization problem, the first stage of the methodology is to design an appropriate electric circuit whose continuous-time dynamics converge to the solution of the optimization problem at hand. Then, the second stage is an automated, computer-assisted discretization of the continuous-time dynamics, yielding a provably convergent discrete-time algorithm. Our methodology recovers many classical (distributed) optimization algorithms and enables users to quickly design and explore a wide range of new algorithms with convergence guarantees.

[LG-76] A Directional Rockafellar-Uryasev Regression

链接: https://arxiv.org/abs/2411.02557
作者: Alberto Arletti
关键词-EN: ost Big Data, Data datasets suffer, Big Data, Data, meta data information
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 figures, 19 pages, 5 tables

点击查看摘要

Abstract:Most ost Big Data datasets suffer from selection bias. For example, X (Twitter) training observations differ largely from the testing offline observations as individuals on Twitter are generally more educated, democratic or left-leaning. Therefore, one major obstacle to reliable estimation is the differences between training and testing data. How can researchers make use of such data even in the presence of non-ignorable selection mechanisms? A number of methods have been developed for this issue, such as distributionally robust optimization (DRO) or learning fairness. A possible avenue to reducing the effect of bias is meta-information. Researchers, being field exerts, might have prior information on the form and extent of selection bias affecting their dataset, and in which direction the selection might cause the estimate to change, e.g. over or under estimation. At the same time, there is no direct way to leverage these types of information in learning. I propose a loss function which takes into account two types of meta data information given by the researcher: quantity and direction (under or over sampling) of bias in the training set. Estimation with the proposed loss function is then implemented through a neural network, the directional Rockafellar-Uryasev (dRU) regression model. I test the dRU model on a biased training dataset, a Big Data online drawn electoral poll. I apply the proposed model using meta data information coherent with the political and sampling information obtained from previous studies. The results show that including meta information improves the electoral results predictions compared to a model that does not include them.

[LG-77] Distributionally Robust Optimization

链接: https://arxiv.org/abs/2411.02549
作者: Daniel Kuhn,Soroosh Shafiee,Wolfram Wiesemann
关键词-EN: Distributionally robust optimization, uncertain problem parameters, studies decision problems, Distributionally robust, probability distribution governing
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Distributionally robust optimization (DRO) studies decision problems under uncertainty where the probability distribution governing the uncertain problem parameters is itself uncertain. A key component of any DRO model is its ambiguity set, that is, a family of probability distributions consistent with any available structural or statistical information. DRO seeks decisions that perform best under the worst distribution in the ambiguity set. This worst case criterion is supported by findings in psychology and neuroscience, which indicate that many decision-makers have a low tolerance for distributional ambiguity. DRO is rooted in statistics, operations research and control theory, and recent research has uncovered its deep connections to regularization techniques and adversarial training in machine learning. This survey presents the key findings of the field in a unified and self-contained manner.

[LG-78] Generative Unfolding with Distribution Mapping

链接: https://arxiv.org/abs/2411.02495
作者: Anja Butter,Sascha Diefenbacher,Nathan Huetsch,Vinicius Mikuni,Benjamin Nachman,Sofia Palacios Schweitzer,Tilman Plehn
关键词-EN: Machine learning enables, learning enables unbinned, highly-differential cross section, cross section measurements, Machine learning
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:Machine learning enables unbinned, highly-differential cross section measurements. A recent idea uses generative models to morph a starting simulation into the unfolded data. We show how to extend two morphing techniques, Schrödinger Bridges and Direct Diffusion, in order to ensure that the models learn the correct conditional probabilities. This brings distribution mapping to a similar level of accuracy as the state-of-the-art conditional generative unfolding methods. Numerical results are presented with a standard benchmark dataset of single jet substructure as well as for a new dataset describing a 22-dimensional phase space of Z + 2-jets.

[LG-79] First observations of the seiche that shook the world

链接: https://arxiv.org/abs/2411.02469
作者: Thomas Monahan,Tianning Tang,Stephen Roberts,Thomas A. A. Adcock
关键词-EN: observed globally, mHz seismic signal, September, Surface Water Ocean, Water Ocean Topography
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:On September 16th, 2023, an anomalous 10.88 mHz seismic signal was observed globally, persisting for 9 days. One month later an identical signal appeared, lasting for another week. Several studies have theorized that these signals were produced by seiches which formed after two landslide generated mega-tsunamis in an East-Greenland fjord. This theory is supported by seismic inversions, and analytical and numerical modeling, but no direct observations have been made – until now. Using data from the new Surface Water Ocean Topography mission, we present the first observations of this phenomenon. By ruling out other oceanographic processes, we validate the seiche theory of previous authors and independently estimate its initial amplitude at 7.9 m using Bayesian machine learning and seismic data. This study demonstrates the value of satellite altimetry for studying extreme events, while also highlighting the need for specialized methods to address the altimetric data’s limitations, namely temporal sparsity. These data and approaches will help in understanding future unseen extremes driven by climate change.

[LG-80] Super-Resolution without High-Resolution Labels for Black Hole Simulations

链接: https://arxiv.org/abs/2411.02453
作者: Thomas Helfer,Thomas D.P. Edwards,Jessica Dafflon,Kaze W.K. Wong,Matthew Lyle Olson
关键词-EN: Black Hole mergers, generating Black Hole, Black Hole simulations, Black Hole, Hole mergers
类目: General Relativity and Quantum Cosmology (gr-qc); Machine Learning (cs.LG)
*备注: Code available at this https URL and data at this https URL

点击查看摘要

Abstract:Generating high-resolution simulations is key for advancing our understanding of one of the universe’s most violent events: Black Hole mergers. However, generating Black Hole simulations is limited by prohibitive computational costs and scalability issues, reducing the simulation’s fidelity and resolution achievable within reasonable time frames and resources. In this work, we introduce a novel method that circumvents these limitations by applying a super-resolution technique without directly needing high-resolution labels, leveraging the Hamiltonian and momentum constraints-fundamental equations in general relativity that govern the dynamics of spacetime. We demonstrate that our method achieves a reduction in constraint violation by one to two orders of magnitude and generalizes effectively to out-of-distribution simulations.

[LG-81] A Coverage-Guided Testing Framework for Quantum Neural Networks

链接: https://arxiv.org/abs/2411.02450
作者: Minqi Shao,Jianjun Zhao
关键词-EN: Quantum Neural Networks, combine quantum computing, leveraging quantum properties, Neural Networks, improve machine learning
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum Neural Networks (QNNs) combine quantum computing and neural networks, leveraging quantum properties such as superposition and entanglement to improve machine learning models. These quantum characteristics enable QNNs to potentially outperform classical neural networks in tasks such as quantum chemistry simulations, optimization problems, and quantum-enhanced machine learning. However, they also introduce significant challenges in verifying the correctness and reliability of QNNs. To address this, we propose QCov, a set of test coverage criteria specifically designed for QNNs to systematically evaluate QNN state exploration during testing, focusing on superposition and entanglement. These criteria help detect quantum-specific defects and anomalies. Extensive experiments on benchmark datasets and QNN models validate QCov’s effectiveness in identifying quantum-specific defects and guiding fuzz testing, thereby improving QNN robustness and reliability.

[LG-82] An Efficient Hierarchical Preconditioner-Learner Architecture for Reconstructing Multi-scale Basis Functions of High-dimensional Subsurface Fluid Flow

链接: https://arxiv.org/abs/2411.02431
作者: Peiqi Li,Jie Chen
关键词-EN: subsurface fluid flow, fluid flow, Fourier Neural Operators, subsurface fluid, Fourier Preconditioner-based Hierarchical
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 20 pages, 9 figures

点击查看摘要

Abstract:Modeling subsurface fluid flow in porous media is crucial for applications such as oil and gas exploration. However, the inherent heterogeneity and multi-scale characteristics of these systems pose significant challenges in accurately reconstructing fluid flow behaviors. To address this issue, we proposed Fourier Preconditioner-based Hierarchical Multiscale Net (FP-HMsNet), an efficient hierarchical preconditioner-learner architecture that combines Fourier Neural Operators (FNO) with multi-scale neural networks to reconstruct multi-scale basis functions of high-dimensional subsurface fluid flow. Using a dataset comprising 102,757 training samples, 34,252 validation samples, and 34,254 test samples, we ensured the reliability and generalization capability of the model. Experimental results showed that FP-HMsNet achieved an MSE of 0.0036, an MAE of 0.0375, and an R2 of 0.9716 on the testing set, significantly outperforming existing models and demonstrating exceptional accuracy and generalization ability. Additionally, robustness tests revealed that the model maintained stability under various levels of noise interference. Ablation studies confirmed the critical contribution of the preconditioner and multi-scale pathways to the model’s performance. Compared to current models, FP-HMsNet not only achieved lower errors and higher accuracy but also demonstrated faster convergence and improved computational efficiency, establishing itself as the state-of-the-art (SOTA) approach. This model offers a novel method for efficient and accurate subsurface fluid flow modeling, with promising potential for more complex real-world applications.

信息检索

[IR-0] Self-supervised Hierarchical Representation for Medication Recommendation

链接: https://arxiv.org/abs/2411.03143
作者: Yuliang Liang,Yuting Liu,Yizhou Dang,Enneng Yang,Guibing Guo,Wei Cai,Jianzhe Zhao,Xingwei Wang
关键词-EN: patient health history, medication combinations based, Medication recommender, medication combinations, Chronic Respiratory Diseases
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Medication recommender is to suggest appropriate medication combinations based on a patient’s health history, e.g., diagnoses and procedures. Existing works represent different diagnoses/procedures well separated by one-hot encodings. However, they ignore the latent hierarchical structures of these medical terms, undermining the generalization performance of the model. For example, “Respiratory Diseases”, “Chronic Respiratory Diseases” and “Chronic Bronchiti” have a hierarchical relationship, progressing from general to specific. To address this issue, we propose a novel hierarchical encoder named HIER to hierarchically represent diagnoses and procedures, which is based on standard medical codes and compatible with any existing methods. Specifically, the proposed method learns relation embedding with a self-supervised objective for incorporating the neighbor hierarchical structure. Additionally, we develop the position encoding to explicitly introduce global hierarchical position. Extensive experiments demonstrate significant and consistent improvements in recommendation accuracy across four baselines and two real-world clinical datasets.

[IR-1] HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

链接: https://arxiv.org/abs/2411.02959
作者: Jiejun Tan,Zhicheng Dou,Wen Wang,Mang Wang,Weipeng Chen,Ji-Rong Wen
关键词-EN: HTML, RAG, improve knowledge capabilities, RAG systems, HTML sources
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial systems such as ChatGPT and Perplexity have used Web search engines as their major retrieval systems. Typically, such RAG systems retrieve search results, download HTML sources of the results, and then extract plain texts from the HTML sources. Plain text documents or chunks are fed into the LLMs to augment the generation. However, much of the structural and semantic information inherent in HTML, such as headings and table structures, is lost during this plain-text-based RAG process. To alleviate this problem, we propose HtmlRAG, which uses HTML instead of plain text as the format of retrieved knowledge in RAG. We believe HTML is better than plain text in modeling knowledge in external documents, and most LLMs possess robust capacities to understand HTML. However, utilizing HTML presents new challenges. HTML contains additional content such as tags, JavaScript, and CSS specifications, which bring extra input tokens and noise to the RAG system. To address this issue, we propose HTML cleaning, compression, and pruning strategies, to shorten the HTML while minimizing the loss of information. Specifically, we design a two-step block-tree-based pruning method that prunes useless HTML blocks and keeps only the relevant part of the HTML. Experiments on six QA datasets confirm the superiority of using HTML in RAG systems.

[IR-2] Enhancing EmoBot: An In-Depth Analysis of User Satisfaction and Faults in an Emotion-Aware Chatbot

链接: https://arxiv.org/abs/2411.02831
作者: Taseen Mubassira,Mehedi Hasan,A. B. M. Alim Al Iislam
关键词-EN: detection aspect, community has traditionally, traditionally shown, shown a keen, keen interest
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: 3 pages, extended abstract

点击查看摘要

Abstract:The research community has traditionally shown a keen interest in emotion modeling, with a notable emphasis on the detection aspect. In contrast, the exploration of emotion generation has received less this http URL study delves into an existing state-of-the-art emotional chatbot, EmoBot, designed for generating emotions in general-purpose conversations. This research involves a comprehensive examination, including a survey to evaluate EmoBot’s proficiency in key dimensions like usability, accuracy, and overall user satisfaction, with a specific focus on fault tolerance. By closely examining the chatbot’s operations, we identified some noteworthy shortcomings in the existing model. We propose some solutions designed to address and overcome the identified issues.

[IR-3] Leveraging Vision-Language Models for Manufacturing Feature Recognition in CAD Designs

链接: https://arxiv.org/abs/2411.02810
作者: Muhammad Tayyab Khan,Lequn Chen,Ye Han Ng,Wenhe Feng,Nicholas Yew Jin Tan,Seung Ki Moon
关键词-EN: Automatic feature recognition, actionable manufacturing information, transforming design knowledge, Automatic feature, Traditional AFR methods
类目: Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
*备注: Paper has been submitted to The ASME Journal of Computing and Information Science in Engineering (JCISE)

点击查看摘要

Abstract:Automatic feature recognition (AFR) is essential for transforming design knowledge into actionable manufacturing information. Traditional AFR methods, which rely on predefined geometric rules and large datasets, are often time-consuming and lack generalizability across various manufacturing features. To address these challenges, this study investigates vision-language models (VLMs) for automating the recognition of a wide range of manufacturing features in CAD designs without the need for extensive training datasets or predefined rules. Instead, prompt engineering techniques, such as multi-view query images, few-shot learning, sequential reasoning, and chain-of-thought, are applied to enable recognition. The approach is evaluated on a newly developed CAD dataset containing designs of varying complexity relevant to machining, additive manufacturing, sheet metal forming, molding, and casting. Five VLMs, including three closed-source models (GPT-4o, Claude-3.5-Sonnet, and Claude-3.0-Opus) and two open-source models (LLava and MiniCPM), are evaluated on this dataset with ground truth features labelled by experts. Key metrics include feature quantity accuracy, feature name matching accuracy, hallucination rate, and mean absolute error (MAE). Results show that Claude-3.5-Sonnet achieves the highest feature quantity accuracy (74%) and name-matching accuracy (75%) with the lowest MAE (3.2), while GPT-4o records the lowest hallucination rate (8%). In contrast, open-source models have higher hallucination rates (30%) and lower accuracies (40%). This study demonstrates the potential of VLMs to automate feature recognition in CAD designs within diverse manufacturing scenarios.

[IR-4] owards Context-Aware Adaptation in Extended Reality: A Design Space for XR Interfaces and an Adaptive Placement Strategy

链接: https://arxiv.org/abs/2411.02607
作者: Shakiba Davari,Doug A. Bowman
关键词-EN: Extended Reality, ameliorate traditional displays’, traditional displays’ space, displays’ space limitations, adaptive placement strategy
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Graphics (cs.GR); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:By converting the entire 3D space around the user into a screen, Extended Reality (XR) can ameliorate traditional displays’ space limitations and facilitate the consumption of multiple pieces of information at a time. However, if designed inappropriately, these XR interfaces can overwhelm the user and complicate information access. In this work, we explored the design dimensions that can be adapted to enable suitable presentation and interaction within an XR interface. To investigate a specific use case of context-aware adaptations within our proposed design space, we concentrated on the spatial layout of the XR content and investigated non-adaptive and adaptive placement strategies. In this paper, we (1) present a comprehensive design space for XR interfaces, (2) propose Environment-referenced, an adaptive placement strategy that uses a relevant intermediary from the environment within a Hybrid Frame of Reference (FoR) for each XR object, and (3) evaluate the effectiveness of this adaptive placement strategy and a non-adaptive Body-Fixed placement strategy in four contextual scenarios varying in terms of social setting and user mobility in the environment. The performance of these placement strategies from our within-subjects user study emphasized the importance of intermediaries’ relevance to the user’s focus. These findings underscore the importance of context-aware interfaces, indicating that the appropriate use of an adaptive content placement strategy in a context can significantly improve task efficiency, accuracy, and usability.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-06

目录

概览 (2024-11-06)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载